Difference between revisions of "SMILA/Documentation/TikaPipelet"

From Eclipsepedia

Jump to: navigation, search
(Extending Tika)
(28 intermediate revisions by 4 users not shown)
Line 3: Line 3:
 
=== Description ===
 
=== Description ===
  
The TikaPipelet converts various document formats (such as PDF, Microsoft Office, OpenOffice, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A record attachment containing the binary content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''.  
+
The TikaPipelet converts various document formats (such as PDF, Microsoft Office, OpenOffice, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A record attachment containing the binary content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attributes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''.  
  
 
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in JobManager workflows, records causing errors are dropped.
 
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in JobManager workflows, records causing errors are dropped.
  
==== Supported document types ====
+
=== Supported document types ===
  
 
By default, SMILA contains only a subset of Tika. Therefore not all documents formats can be converted out-of-the-box by using the TikaPipelet. However it's easy to extend SMILA so that the TikaPipelet supports ''all'' document formats, see [[SMILA/Documentation/TikaPipelet#Extending Tika | "Extending Tika"]] section below.
 
By default, SMILA contains only a subset of Tika. Therefore not all documents formats can be converted out-of-the-box by using the TikaPipelet. However it's easy to extend SMILA so that the TikaPipelet supports ''all'' document formats, see [[SMILA/Documentation/TikaPipelet#Extending Tika | "Extending Tika"]] section below.
Line 14: Line 14:
 
!Document format!!supported out-of-the-box!!supported by using!!Hints
 
!Document format!!supported out-of-the-box!!supported by using!!Hints
 
|-
 
|-
|''Microsoft Office''||yes||TikaPipelet||---||
+
|''Microsoft Office''||yes||TikaPipelet||---
 
|-
 
|-
|''OpenOffice (OpenDocument formats)''||yes||TikaPipelet||---||
+
|''OpenOffice (OpenDocument formats)''||Yes||TikaPipelet||---
 
|-
 
|-
|''RTF''||yes||TikaPipelet||---||
+
|''RTF''||yes||TikaPipelet||---
 
|-
 
|-
|''Plain text''||yes||---||no conversion, given input text is used as "converted" text||
+
|''Plain text''||yes||---||No conversion, given input text is used as "converted" text
 
|-
 
|-
|''HTML/XML''||yes||[[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]||---||
+
|''HTML/XML''||yes||[[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]|| [[ SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe|BoilerpipePipelet]] can also be used for HTML text extraction
 
|-
 
|-
|''PDF''||no||[[SMILA/Documentation/TikaPipelet#Extending Tika|Tika extension]] || converted text will be empty with out-of-the-box SMILA||
+
|''PDF''||no||[[SMILA/Documentation/TikaPipelet#Extending Tika|Tika extension]] || Converted text will be empty with out-of-the-box SMILA, a warning will be written to the log
 
|-
 
|-
 
|}
 
|}
  
As you see, SMILA (resp. its 'AddPipeline' which is the default indexing pipeline) per default uses the TikaPipelet only for converting ''binary'' document formats. When indexing text based documents another piplelet ([[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]) is used. However after [[SMILA/Documentation/TikaPipelet#Extending Tika | extending Tika]] this can be simplified by using TikaPipelet for ''all'' document formats.
+
SMILA in its default setting - resp. its 'AddPipeline', which is the default indexing pipeline, - uses the TikaPipelet only for the conversion of ''binary'' document formats. When indexing text-based documents another pipelet named ([[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]) is used. However, after [[SMILA/Documentation/TikaPipelet#extending Tika | extending Tika]] this can be simplified by using TikaPipelet for ''all'' document formats.
  
 
=== Configuration ===
 
=== Configuration ===
Line 35: Line 35:
 
!Property!!Type!!Read Type!!Required!!Description
 
!Property!!Type!!Read Type!!Required!!Description
 
|-
 
|-
|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.
+
|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Defines whether the input is found in an attachment or an attribute of the record. Usually, it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.
 
|-
 
|-
|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if output should be stored in an attachment or attribute of the record
+
|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Defines whether the output should be stored in an attachment or an attribute of the record.
 
|-
 
|-
|''inputName''||String||runtime||yes||Name of input attachment or path to input attribute (process a String literal of attribute)
+
|''inputName''||String||runtime||yes||Name of the input attachment or path to the input attribute (process a String literal of attribute).
 
|-
 
|-
|''outputName''||String||runtime|| yes||Name of output attachment or path to output attribute for plain text (store result as String literal of attribute)
+
|''outputName''||String||runtime|| yes||Name of the output attachment or path to the output attribute for plain text (store result as String literal of attribute).
 
|-
 
|-
|''extractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.
+
|''extractProperties''||String||runtime||no||Specifies which [[#Using_the_metadata_extraction|metadata properties]] reported by Tika for the document should be written to which record attribute.  
 
|-
 
|-
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process. Tika also performs a MimeType detection and the resulting value is stored in this attribute.
+
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified, the content-type is used to better guide the Tika parsing process. Tika also performs MIME type detection and the resulting value is stored in this attribute.
 
|-
 
|-
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.
+
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified, the file name is used to better guide the Tika parsing process.
 
|-
 
|-
|''exportAsHtml''||Boolean||runtime||no||Flag that specifies if the output should be in HTML format (true) or not (false). Plain text output (false) is default.
+
|''exportAsHtml''||Boolean||runtime||no||Flag that specifies if the output should be in HTML format (true) or in plain text (false). Plain text output (false) is default.
 
|-
 
|-
|''pageBreak''||Boolean||runtime||no||Flag that specifies if pageBreaks should be used to split the content into multiple output records (true) or not (false). The recordId of the output records is generated by concatenating the recordId of the input record with the pageNumber, seperated by ''#'', e.g. (testdoc.pdf#1).This parameter is only interpreted if exportAsHtml is ''false''. Default is (false).
+
|''pageBreak''||Boolean||runtime||no||Specifies if page-breaks should be used to split the content into multiple output records (true) or not (false). Default is false. Unless the ''partsAttribute'' parameter is set, the content is split into multiple output records. All attributes of the input record are also copied to the created output records, unless a child record already has an attribute of the same name. In this case, it is preserved. The _recordid of an output record is generated by concatenating the _recordid of the input record with the page number, separated by ''###'', e.g. testdoc.pdf###1. The parameter is only interpreted when ''exportAsHtml'' is set to ''false''. If ''partsAttribute'' is set, the pipelet does not create multiple output records from one input record but so called [[#Using_the_metadata_extraction|multi-part records]].
 
|-
 
|-
|''pageNumberAttribute''||String||runtime||no||Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if pageBreak is ''true''. If not set, the page number is not set (default)
+
|''pageNumberAttribute''||String||runtime||no||Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if ''pageBreak'' is set to ''true''. If not set, the page number is not added (default).
 +
|-
 +
|''partsAttribute''||String||runtime||no||If set, the pages of a document are not split and written into separate records but into a single so called [[#Using_the_metadata_extraction|multi-part record]] . The individual pages are added as a sequence of maps to this record. The parameter defines the key of this sequence; a map within this sequence represents one part (i.e. one page) of the content. Default: not set. The parameter is only interpreted when ''pageBreak'' is set to ''true''.
 
|-
 
|-
 
|''keepHyphens''||Boolean||runtime||no||If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled.  Default is (false).
 
|''keepHyphens''||Boolean||runtime||no||If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled.  Default is (false).
 
|-
 
|-
 
|-
 
|-
|''maxLength''||Long||runtime||no||The maximum number of characters to extract. If a document contains more characters than specified all remaining characters are omitted. To get all available characters just omit this Parameter. This may lead to OutOfMemory Exceptions with big documents. Default is -1 (unlimited).
+
|''maxLength''||Long||runtime||no||The maximum number of characters to extract. If a document contains more characters than specified, all remaining characters are omitted. To get all available characters just omit this parameter. However, this may lead to OutOfMemory exceptions with large documents. Default is -1 (unlimited).
 
|-
 
|-
 
|}
 
|}
Line 67: Line 69:
 
* When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.
 
* When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.
  
 +
=== Using the multi-parts output record feature ===
  
==== Configuring the Property Mapping ====
+
To turn on the multi-parts output, i.e. splitting the content into more parts and storing them in the same output record, use the following configuration:
  
In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the [[SMILA/Glossary#Tika|Tika]] documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.
+
<source lang="xml">
 +
<proc:configuration>
 +
  <rec:Val key="inputName">Content</rec:Val>
 +
  <rec:Val key="inputType">ATTACHMENT</rec:Val>
 +
  <rec:Val key="outputName">text</rec:Val>
 +
  <rec:Val key="outputType">ATTRIBUTE</rec:Val>
 +
  <rec:Val key="pageBreak">true</rec:Val>
 +
  <rec:Val key="pageNumberAttribute">pageNo</rec:Val>
 +
  <rec:Val key="partsAttribute">pages</rec:Val>
 +
  ... 
 +
</proc:configuration>
 +
</source>
 +
 
 +
The parameter ''partsAttribute'' is set and uses the value ''pages''. The output for an example PDF may look like this
 +
 
 +
<source lang="javascript">
 +
{
 +
  "_recordid": "file:/home/user/example.pdf",
 +
  "filename": "example.pdf",
 +
  "pages":
 +
  [
 +
  {
 +
    "text": "this is the content of page 1.",
 +
    "pageNo": "1"
 +
  },
 +
  {
 +
    "text": "this is the content of page 2.",
 +
    "pageNo": "2"
 +
  },
 +
  ...
 +
  ],
 +
  "_attachments": ["Content"]
 +
}
 +
</source>
 +
 
 +
=== Using the metadata extraction ===
 +
 
 +
In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well-known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the [[SMILA/Glossary#Tika|Tika]] documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.
  
 
To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the ''extractProperties'' parameter. Usually this parameter contains a sequence of maps. The map values have the following format:
 
To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the ''extractProperties'' parameter. Usually this parameter contains a sequence of maps. The map values have the following format:
Line 87: Line 127:
 
|}
 
|}
  
==== Example ====
+
===== Example =====
  
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained metadata properties Company, Creator and Title will be stored in properties.
+
The following example shows how to configure the pipelet so as to extract text from an attachment called ''Content'' and store the results in the attribute ''Text''. Additionally, metadata properties like <tt>company</tt>, <tt>creator</tt>, and <tt>title</tt> are extracted (if available) and stored in the respective attributes.
  
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt>Company</tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>Creator</tt>.
+
E.g. when analyzing a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute ''Text'', the value "ACME" in the attribute ''Company'', and the value "John Doe" in an attribute ''Creator''.
  
 
<source lang="xml">
 
<source lang="xml">
Line 102: Line 142:
 
   <rec:Val key="fileNameAttribute">FileName</rec:Val>
 
   <rec:Val key="fileNameAttribute">FileName</rec:Val>
 
   <rec:Val key="exportAsHtml">false</rec:Val>
 
   <rec:Val key="exportAsHtml">false</rec:Val>
   <rec:Val key="pageBreak">falsec</rec:Val>
+
   <rec:Val key="pageBreak">false</rec:Val>
 
   <rec:Val key="keepHyphens">false</rec:Val>
 
   <rec:Val key="keepHyphens">false</rec:Val>
 
   <rec:Val key="maxLength">100000</rec:Val>
 
   <rec:Val key="maxLength">100000</rec:Val>
Line 125: Line 165:
 
</source>
 
</source>
  
==== Typical Property-Names ====
+
===== Typical Property-Names =====
 
* Generic
 
* Generic
 
**"contributor"
 
**"contributor"
Line 163: Line 203:
 
**"custom:"
 
**"custom:"
 
**"Version"
 
**"Version"
 
  
 
=== Extending Tika ===
 
=== Extending Tika ===
Line 170: Line 209:
  
 
* Download org.eclipse.smila.tika.deps bundle from [http://ubuntuone.com/1n9PNxx6akZ0X1Bc7ahYrm here]
 
* Download org.eclipse.smila.tika.deps bundle from [http://ubuntuone.com/1n9PNxx6akZ0X1Bc7ahYrm here]
* Replace the appropriate bundle of your SMILA distribution with the downloaded bundle by just copying the downloaded bundle to <path-to-your-SMILA>/plugins.
+
* Replace the appropriate bundle of your SMILA distribution with the downloaded bundle by just copying the downloaded bundle to <tt><path-to-your-SMILA>/plugins</tt> folder.
 +
 
 +
That's it! After SMILA restart, all document formats supported by Tika will be also be supported by SMILA's TikaPipelet.
  
That's it! Now, all document formats supported by Tika will be also supported by the SMILA TikaPipelet.
+
===== For Developers =====
  
 +
When working with SMILA in eclipse IDE:
 +
* Remove the <tt>org.eclipse.smila.tika.deps</tt> bundle from your workspace by deleting the project. (You can leave the project contents)
 +
* Put the downloaded org.eclipse.smila.tika.deps.jar in your <tt>SMILA.extensions</tt> project and reload your target platform.
  
 
[[Category:SMILA]]  [[Category:SMILA/Pipelet]]
 
[[Category:SMILA]]  [[Category:SMILA/Pipelet]]

Revision as of 03:54, 11 July 2013

Contents

Bundle: org.eclipse.smila.tika

Description

The TikaPipelet converts various document formats (such as PDF, Microsoft Office, OpenOffice, etc.) to plain text using Tika technology: A record attachment containing the binary content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attributes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters contentTypeAttribute and fileNameAttribute.

The TikaPipelet supports the configurable error handling as described in SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation. When used in JobManager workflows, records causing errors are dropped.

Supported document types

By default, SMILA contains only a subset of Tika. Therefore not all documents formats can be converted out-of-the-box by using the TikaPipelet. However it's easy to extend SMILA so that the TikaPipelet supports all document formats, see "Extending Tika" section below.

Document format supported out-of-the-box supported by using Hints
Microsoft Office yes TikaPipelet ---
OpenOffice (OpenDocument formats) Yes TikaPipelet ---
RTF yes TikaPipelet ---
Plain text yes --- No conversion, given input text is used as "converted" text
HTML/XML yes HtmlToTextPipelet BoilerpipePipelet can also be used for HTML text extraction
PDF no Tika extension Converted text will be empty with out-of-the-box SMILA, a warning will be written to the log

SMILA in its default setting - resp. its 'AddPipeline', which is the default indexing pipeline, - uses the TikaPipelet only for the conversion of binary document formats. When indexing text-based documents another pipelet named (HtmlToTextPipelet) is used. However, after extending Tika this can be simplified by using TikaPipelet for all document formats.

Configuration

Property Type Read Type Required Description
inputType String : ATTACHMENT, ATTRIBUTE runtime yes Defines whether the input is found in an attachment or an attribute of the record. Usually, it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.
outputType String : ATTACHMENT, ATTRIBUTE runtime yes Defines whether the output should be stored in an attachment or an attribute of the record.
inputName String runtime yes Name of the input attachment or path to the input attribute (process a String literal of attribute).
outputName String runtime yes Name of the output attachment or path to the output attribute for plain text (store result as String literal of attribute).
extractProperties String runtime no Specifies which metadata properties reported by Tika for the document should be written to which record attribute.
contentTypeAttribute String runtime no Parameter referencing the attribute that contains the content-type of the document. If specified, the content-type is used to better guide the Tika parsing process. Tika also performs MIME type detection and the resulting value is stored in this attribute.
fileNameAttribute String runtime no Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified, the file name is used to better guide the Tika parsing process.
exportAsHtml Boolean runtime no Flag that specifies if the output should be in HTML format (true) or in plain text (false). Plain text output (false) is default.
pageBreak Boolean runtime no Specifies if page-breaks should be used to split the content into multiple output records (true) or not (false). Default is false. Unless the partsAttribute parameter is set, the content is split into multiple output records. All attributes of the input record are also copied to the created output records, unless a child record already has an attribute of the same name. In this case, it is preserved. The _recordid of an output record is generated by concatenating the _recordid of the input record with the page number, separated by ###, e.g. testdoc.pdf###1. The parameter is only interpreted when exportAsHtml is set to false. If partsAttribute is set, the pipelet does not create multiple output records from one input record but so called multi-part records.
pageNumberAttribute String runtime no Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if pageBreak is set to true. If not set, the page number is not added (default).
partsAttribute String runtime no If set, the pages of a document are not split and written into separate records but into a single so called multi-part record . The individual pages are added as a sequence of maps to this record. The parameter defines the key of this sequence; a map within this sequence represents one part (i.e. one page) of the content. Default: not set. The parameter is only interpreted when pageBreak is set to true.
keepHyphens Boolean runtime no If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled. Default is (false).
maxLength Long runtime no The maximum number of characters to extract. If a document contains more characters than specified, all remaining characters are omitted. To get all available characters just omit this parameter. However, this may lead to OutOfMemory exceptions with large documents. Default is -1 (unlimited).

Some notes on "maxLength" in combination with other parameters:

  • If "exportAsHTML" is set to "true", the HTML tags will not be counted when checking the limit, so the actual output will be longer than maxLength characters: The output creation stops when the "real" text content of the HTML reaches maxLength characters. After this, also no additional tags will be appended.
  • The extracted text is "trimmed" so the actual output can be shorter than maxLength characters cause leading and trailing whitespaces are removed.
  • When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.

Using the multi-parts output record feature

To turn on the multi-parts output, i.e. splitting the content into more parts and storing them in the same output record, use the following configuration:

<proc:configuration>
  <rec:Val key="inputName">Content</rec:Val>
  <rec:Val key="inputType">ATTACHMENT</rec:Val>
  <rec:Val key="outputName">text</rec:Val>
  <rec:Val key="outputType">ATTRIBUTE</rec:Val>
  <rec:Val key="pageBreak">true</rec:Val>
  <rec:Val key="pageNumberAttribute">pageNo</rec:Val>
  <rec:Val key="partsAttribute">pages</rec:Val>
  ...  		
</proc:configuration>

The parameter partsAttribute is set and uses the value pages. The output for an example PDF may look like this

{
  "_recordid": "file:/home/user/example.pdf",
  "filename": "example.pdf",
  "pages": 
  [
   {
    "text": "this is the content of page 1.",
    "pageNo": "1"
   },
   {
    "text": "this is the content of page 2.",
    "pageNo": "2"
   },
   ...
  ],
  "_attachments": ["Content"]
}

Using the metadata extraction

In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well-known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the Tika documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.

To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the extractProperties parameter. Usually this parameter contains a sequence of maps. The map values have the following format:

Property Type Read Type Required Description
metadataName String runtime yes The name of the metadata property. This will be matched with the extracted metadata property names in a case-insensitive manner.
targetAttribute String runtime no The name of Record attribute to store metadata value(s) in. If not set the string provided in the metadataName will be used as attribute name.
singleResult Boolean runtime no Flag that specifies if only the first value (if multiple values exists) is used in the result (true) or if all values are used (false). Default is false.
storeMode String runtime no Specifies whether attributes already stored in the record target attribute will be left unchanged ("leave"), overwritten ("overwrite") or if the extracted properties will be added to potentially existing ones ("add"). Default is "add".
Example

The following example shows how to configure the pipelet so as to extract text from an attachment called Content and store the results in the attribute Text. Additionally, metadata properties like company, creator, and title are extracted (if available) and stored in the respective attributes.

E.g. when analyzing a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute Text, the value "ACME" in the attribute Company, and the value "John Doe" in an attribute Creator.

<proc:configuration>
  <rec:Val key="inputName">Content</rec:Val>
  <rec:Val key="inputType">ATTACHMENT</rec:Val>
  <rec:Val key="outputName">Text</rec:Val>
  <rec:Val key="outputType">ATTRIBUTE</rec:Val>
  <rec:Val key="contentTypeAttribute">MimeType</rec:Val>
  <rec:Val key="fileNameAttribute">FileName</rec:Val>
  <rec:Val key="exportAsHtml">false</rec:Val>
  <rec:Val key="pageBreak">false</rec:Val>
  <rec:Val key="keepHyphens">false</rec:Val>
  <rec:Val key="maxLength">100000</rec:Val>
  <rec:Seq key="extractProperties">					    
     <rec:Map>					    
        <rec:Val key="metadataName">company</rec:Val>
        <rec:Val key="targetAttribute">Company</rec:Val>			    					    
        <rec:Val key="singleResult">false</rec:Val>			    					    
     </rec:Map>
     <rec:Map>					    
        <rec:Val key="metadataName">creator</rec:Val>
        <rec:Val key="targetAttribute">Creator</rec:Val>			    					    
        <rec:Val key="singleResult">false</rec:Val>			    					    
     </rec:Map>
     <rec:Map>					    
        <rec:Val key="metadataName">title</rec:Val>
        <rec:Val key="targetAttribute">Title</rec:Val>			    					    
        <rec:Val key="singleResult">true</rec:Val>			    					    
     </rec:Map>		
  </rec:Seq>
</proc:configuration>
Typical Property-Names
  • Generic
    • "contributor"
    • "coverage"
    • "creator"
    • "description"
    • "format"
    • "identifier"
    • "language"
    • "modified"
    • "publisher"
    • "relation"
    • "rights"
    • "source"
    • "subject"
    • "title"
    • "type"
  • MS- Office
    • "Application-Name"
    • "Application-Version"
    • "Author"
    • "Category"
    • "Comments"
    • "Company"
    • "Content-Status"
    • "Edit-Time"
    • "Keywords"
    • "Last-Author"
    • "Manager"
    • "Notes"
    • "Presentation-Format"
    • "Revision-Number"
    • "Security"
    • "Template"
    • "Total-Time"
    • "custom:"
    • "Version"

Extending Tika

SMILA does not contain the complete Tika distribution, because some converters need third party libraries with licenses that we are not allowed to distribute. However, it is easy (and absolutely legal!) to include those parts of Tika into your SMILA installation yourself:

  • Download org.eclipse.smila.tika.deps bundle from here
  • Replace the appropriate bundle of your SMILA distribution with the downloaded bundle by just copying the downloaded bundle to <path-to-your-SMILA>/plugins folder.

That's it! After SMILA restart, all document formats supported by Tika will be also be supported by SMILA's TikaPipelet.

For Developers

When working with SMILA in eclipse IDE:

  • Remove the org.eclipse.smila.tika.deps bundle from your workspace by deleting the project. (You can leave the project contents)
  • Put the downloaded org.eclipse.smila.tika.deps.jar in your SMILA.extensions project and reload your target platform.