Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/TikaPipelet"

(8 intermediate revisions by 4 users not shown)
Line 1: Line 1:
<span style="color:#ff0000">'''This pipelet is not yet available in our repository. We are currently in the process of creating CQs for required third party components and hopefully get permission to use it in our project.'''</span>
 
 
 
== Bundle: <tt>org.eclipse.smila.tika</tt> ==
 
== Bundle: <tt>org.eclipse.smila.tika</tt> ==
  
 
=== Description ===
 
=== Description ===
  
The TikaPipelet converts various document formats (such as PDF, Microsoft Office formats, OpenOffice formats, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A binary attachment content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process it is possible to optionally pass the content-type and filename of the document stored in other Record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''.  
+
The TikaPipelet converts various document formats (such as PDF, Microsoft Office formats, OpenOffice formats, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A binary attachment content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''.  
  
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in jobmanager workflows, records causing errors are dropped.
+
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in JobManager workflows, records causing errors are dropped.
  
 
==== Supported document types ====
 
==== Supported document types ====
  
 
By default, SMILA contains only a subset of Tika that supports the conversion of:
 
By default, SMILA contains only a subset of Tika that supports the conversion of:
* plain text documents
+
* Plain text documents
 
* HTML/XML documents
 
* HTML/XML documents
 
* RTF documents
 
* RTF documents
Line 18: Line 16:
 
* OpenOffice documents (OpenDocument formats)
 
* OpenOffice documents (OpenDocument formats)
  
See below on hints how to add Tika extractors for further formats
+
See below on hints how to add Tika extractors for further formats.
  
 
=== Configuration ===
 
=== Configuration ===
Line 25: Line 23:
 
!Property!!Type!!Read Type!!Required!!Description
 
!Property!!Type!!Read Type!!Required!!Description
 
|-
 
|-
|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.
+
|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.
 
|-
 
|-
|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||selects if output should be stored in an attachment or attribute of the record
+
|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if output should be stored in an attachment or attribute of the record
 
|-
 
|-
|''inputName''||String||runtime||yes||name of input attachment or path to input attribute (process a String literal of attribute)
+
|''inputName''||String||runtime||yes||Name of input attachment or path to input attribute (process a String literal of attribute)
 
|-
 
|-
|''outputName''||String||runtime|| yes||name of output attachment or path to output attribute for plain text (store result as String literal of attribute)
+
|''outputName''||String||runtime|| yes||Name of output attachment or path to output attribute for plain text (store result as String literal of attribute)
 
|-
 
|-
 
|''extractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.
 
|''extractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.
 
|-
 
|-
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process.
+
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process. Tika also performs a MimeType detection and the resulting value is stored in this attribute.
 
|-
 
|-
 
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.
 
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.
Line 41: Line 39:
 
|''exportAsHtml''||Boolean||runtime||no||Flag that specifies if the output should be in HTML format (true) or not (false). Plain text output (false) is default.
 
|''exportAsHtml''||Boolean||runtime||no||Flag that specifies if the output should be in HTML format (true) or not (false). Plain text output (false) is default.
 
|-
 
|-
|''pageBreak''||Boolean||runtime||no||Flag that specifies if pageBreaks should be used to split the content into multiple output records (true) or not (false). This parameter is only interpreted if exportAsHtml is ''true'' Default is (false).
+
|''pageBreak''||Boolean||runtime||no||Flag that specifies if pageBreaks should be used to split the content into multiple output records (true) or not (false). The recordId of the output records is generated by concatenating the recordId of the input record with the pageNumber, seperated by ''#'', e.g. (testdoc.pdf#1).This parameter is only interpreted if exportAsHtml is ''false''. Default is (false).
 +
|-
 +
|''pageNumberAttribute''||String||runtime||no||Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if pageBreak is ''true''. If not set, the page number is not set (default)
 
|-
 
|-
 
|''keepHyphens''||Boolean||runtime||no||If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled.  Default is (false).
 
|''keepHyphens''||Boolean||runtime||no||If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled.  Default is (false).
 
|-
 
|-
 
|-
 
|-
|''maxLength''||Long||runtime||no||The maximum number of characters to extract. If a document contains more characters than specified all remaining characters are omitted. To get all available characters just omit this Parameter. This may lead to OutOfMemory Exceptions with big documents. Default is -1 (unlimited)
+
|''maxLength''||Long||runtime||no||The maximum number of characters to extract. If a document contains more characters than specified all remaining characters are omitted. To get all available characters just omit this Parameter. This may lead to OutOfMemory Exceptions with big documents. Default is -1 (unlimited).
 
|-
 
|-
 
|}
 
|}
 +
 +
Some notes on "maxLength" in combination with other parameters:
 +
* If "exportAsHTML" is set to "true", the HTML tags will not be counted when checking the limit, so the actual output will be longer than maxLength characters: The output creation stops when the "real" text content of the HTML reaches maxLength characters. After this, also no additional tags will be appended.
 +
* The extracted text is "trimmed" so the actual output can be shorter than maxLength characters cause leading and trailing whitespaces are removed.
 +
* When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.
 +
  
 
==== Configuring the Property Mapping ====
 
==== Configuring the Property Mapping ====
Line 59: Line 65:
 
!Property!!Type!!Read Type!!Required!!Description
 
!Property!!Type!!Read Type!!Required!!Description
 
|-
 
|-
|''metadataName''||String||runtime||yes||The name of the metadata property
+
|''metadataName''||String||runtime||yes||The name of the metadata property. This will be matched with the extracted metadata property names in a case-insensitive manner.
 
|-
 
|-
|''targetAttribute''||String||runtime||yes||The name of Record attribute to store metadata value(s) in
+
|''targetAttribute''||String||runtime||no||The name of Record attribute to store metadata value(s) in. If not set the string provided in the ''metadataName'' will be used as attribute name.
 
|-
 
|-
 
|''singleResult''||Boolean||runtime||no|| Flag that specifies if only the first value (if multiple values exists) is used in the result (true) or if all values are used (false). Default is false.
 
|''singleResult''||Boolean||runtime||no|| Flag that specifies if only the first value (if multiple values exists) is used in the result (true) or if all values are used (false). Default is false.
 +
|-
 +
|''storeMode''||String ||runtime||no|| Specifies whether attributes already stored in the record target attribute will be left unchanged ("leave"), overwritten ("overwrite") or if the extracted properties will be added to potentially existing ones ("add"). Default is "add".
 
|-
 
|-
 
|}
 
|}
 
 
 
The resulting attribute is
 
* a single <tt>Value</tt>, if only one value has been extracted and the value is not appended to previously existing values
 
* a <tt>AnySeq</tt> containing all values, if more than one value has been extracted or new values are appended to existing values.
 
  
 
==== Example ====
 
==== Example ====
  
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained metadata properties Company, Author and Title will be stored in properties.
+
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained metadata properties Company, Creator and Title will be stored in properties.
  
 
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt>Company</tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>Creator</tt>.
 
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt>Company</tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>Creator</tt>.
Line 98: Line 100:
 
     </rec:Map>
 
     </rec:Map>
 
     <rec:Map>      
 
     <rec:Map>      
         <rec:Val key="metadataName">author</rec:Val>
+
         <rec:Val key="metadataName">creator</rec:Val>
         <rec:Val key="targetAttribute">Author</rec:Val>          
+
         <rec:Val key="targetAttribute">Creator</rec:Val>          
 
         <rec:Val key="singleResult">false</rec:Val>          
 
         <rec:Val key="singleResult">false</rec:Val>          
 
     </rec:Map>
 
     </rec:Map>
Line 110: Line 112:
 
</proc:configuration>
 
</proc:configuration>
 
</source>
 
</source>
 
  
 
==== Typical Property-Names ====
 
==== Typical Property-Names ====

Revision as of 08:16, 12 February 2013

Bundle: org.eclipse.smila.tika

Description

The TikaPipelet converts various document formats (such as PDF, Microsoft Office formats, OpenOffice formats, etc.) to plain text using Tika technology: A binary attachment content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters contentTypeAttribute and fileNameAttribute.

The TikaPipelet supports the configurable error handling as described in SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation. When used in JobManager workflows, records causing errors are dropped.

Supported document types

By default, SMILA contains only a subset of Tika that supports the conversion of:

  • Plain text documents
  • HTML/XML documents
  • RTF documents
  • Microsoft Office documents
  • OpenOffice documents (OpenDocument formats)

See below on hints how to add Tika extractors for further formats.

Configuration

Property Type Read Type Required Description
inputType String : ATTACHMENT, ATTRIBUTE runtime yes Selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.
outputType String : ATTACHMENT, ATTRIBUTE runtime yes Selects if output should be stored in an attachment or attribute of the record
inputName String runtime yes Name of input attachment or path to input attribute (process a String literal of attribute)
outputName String runtime yes Name of output attachment or path to output attribute for plain text (store result as String literal of attribute)
extractProperties String runtime no Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.
contentTypeAttribute String runtime no Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process. Tika also performs a MimeType detection and the resulting value is stored in this attribute.
fileNameAttribute String runtime no Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.
exportAsHtml Boolean runtime no Flag that specifies if the output should be in HTML format (true) or not (false). Plain text output (false) is default.
pageBreak Boolean runtime no Flag that specifies if pageBreaks should be used to split the content into multiple output records (true) or not (false). The recordId of the output records is generated by concatenating the recordId of the input record with the pageNumber, seperated by #, e.g. (testdoc.pdf#1).This parameter is only interpreted if exportAsHtml is false. Default is (false).
pageNumberAttribute String runtime no Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if pageBreak is true. If not set, the page number is not set (default)
keepHyphens Boolean runtime no If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled. Default is (false).
maxLength Long runtime no The maximum number of characters to extract. If a document contains more characters than specified all remaining characters are omitted. To get all available characters just omit this Parameter. This may lead to OutOfMemory Exceptions with big documents. Default is -1 (unlimited).

Some notes on "maxLength" in combination with other parameters:

  • If "exportAsHTML" is set to "true", the HTML tags will not be counted when checking the limit, so the actual output will be longer than maxLength characters: The output creation stops when the "real" text content of the HTML reaches maxLength characters. After this, also no additional tags will be appended.
  • The extracted text is "trimmed" so the actual output can be shorter than maxLength characters cause leading and trailing whitespaces are removed.
  • When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.


Configuring the Property Mapping

In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the Tika documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.

To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the extractProperties parameter. Usually this parameter contains a sequence of maps. The map values have the following format:

Property Type Read Type Required Description
metadataName String runtime yes The name of the metadata property. This will be matched with the extracted metadata property names in a case-insensitive manner.
targetAttribute String runtime no The name of Record attribute to store metadata value(s) in. If not set the string provided in the metadataName will be used as attribute name.
singleResult Boolean runtime no Flag that specifies if only the first value (if multiple values exists) is used in the result (true) or if all values are used (false). Default is false.
storeMode String runtime no Specifies whether attributes already stored in the record target attribute will be left unchanged ("leave"), overwritten ("overwrite") or if the extracted properties will be added to potentially existing ones ("add"). Default is "add".

Example

The following example shows how to configure the pipelet to extract the text from the attachment called Content and stores the extracted text in the attribute Text. Additionally the eventually contained metadata properties Company, Creator and Title will be stored in properties.

E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute Text, the value ACME in the attribute Company, as well as the value John Doe in an attribute Creator.

<proc:configuration>
  <rec:Val key="inputName">Content</rec:Val>
  <rec:Val key="inputType">ATTACHMENT</rec:Val>
  <rec:Val key="outputName">Text</rec:Val>
  <rec:Val key="outputType">ATTRIBUTE</rec:Val>
  <rec:Val key="contentTypeAttribute">MimeType</rec:Val>
  <rec:Val key="fileNameAttribute">FileName</rec:Val>
  <rec:Val key="exportAsHtml">false</rec:Val>
  <rec:Val key="pageBreak">falsec</rec:Val>
  <rec:Val key="keepHyphens">false</rec:Val>
  <rec:Val key="maxLength">100000</rec:Val>
  <rec:Seq key="extractProperties">					    
     <rec:Map>					    
        <rec:Val key="metadataName">company</rec:Val>
        <rec:Val key="targetAttribute">Company</rec:Val>			    					    
        <rec:Val key="singleResult">false</rec:Val>			    					    
     </rec:Map>
     <rec:Map>					    
        <rec:Val key="metadataName">creator</rec:Val>
        <rec:Val key="targetAttribute">Creator</rec:Val>			    					    
        <rec:Val key="singleResult">false</rec:Val>			    					    
     </rec:Map>
     <rec:Map>					    
        <rec:Val key="metadataName">title</rec:Val>
        <rec:Val key="targetAttribute">Title</rec:Val>			    					    
        <rec:Val key="singleResult">true</rec:Val>			    					    
     </rec:Map>		
  </rec:Seq>
</proc:configuration>

Typical Property-Names

  • Generic
    • "contributor"
    • "coverage"
    • "creator"
    • "description"
    • "format"
    • "identifier"
    • "language"
    • "modified"
    • "publisher"
    • "relation"
    • "rights"
    • "source"
    • "subject"
    • "title"
    • "type"
  • MS- Office
    • "Application-Name"
    • "Application-Version"
    • "Author"
    • "Category"
    • "Comments"
    • "Company"
    • "Content-Status"
    • "Edit-Time"
    • "Keywords"
    • "Last-Author"
    • "Manager"
    • "Notes"
    • "Presentation-Format"
    • "Revision-Number"
    • "Security"
    • "Template"
    • "Total-Time"
    • "custom:"
    • "Version"


Extending Tika

SMILA does not contain the complete Tika distribution, because some converters need third party libraries with problematic licenses that we are not allowed to distribute. However, it should be easy to include those parts of Tika into your SMILA installation yourself:

TODO