Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/TikaPipelet"

(New page: <span style="color:#ff0000">'''This pipelet is not yet available in our repository. We are currently in the process of creating CQs for required third party components and hopefully get pe...)
 
Line 5: Line 5:
 
=== Description ===
 
=== Description ===
  
The TikaPipelet converts various document formats (such as PDF, Microsoft Office formats, OpenOffice formats, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A binary attachment content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process it is possible to optionally pass the MimeType and Filename of the document stored in other Record attributes via parameters ''MimeTypeAttribute'' and ''FileNameAttribute''.  
+
The TikaPipelet converts various document formats (such as PDF, Microsoft Office formats, OpenOffice formats, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A binary attachment content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process it is possible to optionally pass the content-type and filename of the document stored in other Record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''.  
  
 
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in jobmanager workflows, records causing errors are dropped.
 
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in jobmanager workflows, records causing errors are dropped.
Line 36: Line 36:
 
|''ExtractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.
 
|''ExtractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.
 
|-
 
|-
|''MimeTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the mimetype of the document. If specified the mimetype is used to better guide the Tika parsing process.
+
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process.
 
|-
 
|-
|''FileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.
+
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.
 
|-
 
|-
 
|}
 
|}

Revision as of 06:51, 9 January 2013

This pipelet is not yet available in our repository. We are currently in the process of creating CQs for required third party components and hopefully get permission to use it in our project.

Bundle: org.eclipse.smila.tika

Description

The TikaPipelet converts various document formats (such as PDF, Microsoft Office formats, OpenOffice formats, etc.) to plain text using Tika technology: A binary attachment content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process it is possible to optionally pass the content-type and filename of the document stored in other Record attributes via parameters contentTypeAttribute and fileNameAttribute.

The TikaPipelet supports the configurable error handling as described in SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation. When used in jobmanager workflows, records causing errors are dropped.

Supported document types

By default, SMILA contains only a subset of Tika that supports the conversion of:

  • plain text documents (of course ;-)
  • HTML/XML documents
  • RTF documents
  • Microsoft Office documents, both the old formats (doc, xls, ppt) and the new OOXML formats (docx, xlsx, pptx)
  • Microsoft Visio documents
  • OpenOffice documents (OpenDocument formats)

See below on hints how to add Tika extractors for further formats

Configuration

Property Type Read Type Required Description
inputType String : ATTACHMENT, ATTRIBUTE runtime yes selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.
outputType String : ATTACHMENT, ATTRIBUTE runtime yes selects if output should be stored in an attachment or attribute of the record
inputName String runtime yes name of input attachment or path to input attribute (process a String literal of attribute)
outputName String runtime yes name of output attachment or path to output attribute for plain text (store result as String literal of attribute)
ExtractProperties String runtime no Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.
contentTypeAttribute String runtime no Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process.
fileNameAttribute String runtime no Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.

Configuring the Property Mapping

In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the Tika documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.

To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the ExtractProperties parameter. Usually this parameter contains a sequence of string values. The string values can have one of the following formats:

  • <Property-Name>: Add the values of this property to an attribute with the same name.
  • <Property-Name>-><Attribute-Name>: Add the values of the property to the attribute with the given name
  • <Property-Name>->><Attribute-Name>: Store the values of the property in the attribute with the given name, remove existing values first.

The resulting attribute is

  • a single Value, if only one value has been extracted and the value is not appended to previously existing values
  • a AnySeq containing all values, if more than one value has been extracted or new values are appended to existing values.

Example

The following example shows how to configure the pipelet to extract the text from the attachment called Content and stores the extracted text in the attribute Text. Additionally the eventually contained Company, Manager and Creator will be stored in properties which are named after their class URIs.

E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute Text, the value ACME in the attribute Company, as well as the value John Doe in an attribute Creator.

<proc:configuration>
  <rec:Val key="inputName">Content</rec:Val>
  <rec:Val key="inputType">ATTACHMENT</rec:Val>
  <rec:Val key="outputName">Text</rec:Val>
  <rec:Val key="outputType">ATTRIBUTE</rec:Val>
  <rec:Val key="MimeTypeAttribute">MimeType</rec:Val>
  <rec:Val key="FileExtensionAttribute">FileExtension</rec:Val>
  <rec:Seq key="ExtractProperties">					    
    <rec:Val>Company</rec:Val>
    <rec:Val>Creator</rec:Val>						
    <rec:Val>Author</rec:Val>					    					    
  </rec:Seq>
</proc:configuration>


Typical Property-Names

  • Generic
    • "contributor"
    • "coverage"
    • "creator"
    • "description"
    • "format"
    • "identifier"
    • "language"
    • "modified"
    • "publisher"
    • "relation"
    • "rights"
    • "source"
    • "subject"
    • "title"
    • "type"
  • MS- Office
    • "Application-Name"
    • "Application-Version"
    • "Author"
    • "Category"
    • "Comments"
    • "Company"
    • "Content-Status"
    • "Edit-Time"
    • "Keywords"
    • "Last-Author"
    • "Manager"
    • "Notes"
    • "Presentation-Format"
    • "Revision-Number"
    • "Security"
    • "Template"
    • "Total-Time"
    • "custom:"
    • "Version"


Extending Tika

SMILA does not contain the complete Tika distribution, because some converters need third party libraries with problematic licenses that we are not allowed to distribute. However, it should be easy to include those parts of Tika into your SMILA installation yourself:

TODO

Back to the top