Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/AperturePipelet"

(Example)
Line 32: Line 32:
 
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained Company, Manager and Creator will be stored in properties which are named after their class URIs.
 
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained Company, Manager and Creator will be stored in properties which are named after their class URIs.
  
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute ''Text'', the value ''ACME'' in the attribute ''http://schemas.openxmlformats.org/officeDocument/2006/extended-properties/Company'', as well as the value ''John Doe'' in an attribute ''dc:creator''.
+
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt><nowiki>http://schemas.openxmlformats.org/officeDocument/2006/extended-properties/Company</nowiki></tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>dc:creator</tt>.
  
 
'''ConverterConfig.xml'''
 
'''ConverterConfig.xml'''

Revision as of 08:27, 16 September 2011

This pipelet is not yet available in our repository. As soon as the new Aperture release is available we will submit appropriate CQs and hopefully get permission to use it in our project.

Bundle: org.eclipse.smila.processing.pipelets.aperture.AperturePipelet

Description

This Pipelet converts various document formats (like PDF,XLS, etc.) to plain text using Aperture technology. It converts the document's content in AttachmentContent and stores the plain text result in AttachmentText. The optional MimeType of AttachmentContent in AttachmentMimeType is used for conversion. If no MimeType is provided a MimeType identification is done inside the Pipelet using a MimeTypeIdentifier service.

Configuration

Property Type Read Type Description
inputType String : ATTACHMENT, ATTRIBUTE runtime selects if the input is found in an attachment or attribute of the record
outputType String : ATTACHMENT, ATTRIBUTE runtime selects if output should be stored in an attachment or attribute of the record
inputName String runtime name of input attachment or path to input attribute (process a String literal of attribute)
outputName String runtime name of output attachment or path to output attribute for plain text (store result as String literal of attribute)
ExtractProperties String runtime Parameter that definies what to extract from input and copy into record attributes with the name of the extracted properties. Extract-attribute value can be a set of values or a single value.
AttachmentMimeType String runtime Parameter referencing the attribute that contains the mimetype of the attachment content. The parameter (resp. attribute) may not be set (null) and then a mimetype detection is performed.

Note that all properties are required and must be provided.

Example

The following example shows how to configure the pipelet to extract the text from the attachment called Content and stores the extracted text in the attribute Text. Additionally the eventually contained Company, Manager and Creator will be stored in properties which are named after their class URIs.

E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute Text, the value ACME in the attribute http://schemas.openxmlformats.org/officeDocument/2006/extended-properties/Company, as well as the value John Doe in an attribute dc:creator.

ConverterConfig.xml

<proc:configuration>
  <rec:Val key="inputName">Content</rec:Val>
  <rec:Val key="inputType">ATTACHMENT</rec:Val>
  <rec:Val key="outputName">Text</rec:Val>
  <rec:Val key="outputType">ATTRIBUTE</rec:Val>
  <rec:Val key="AttachmentMimeType">MimeType</rec:Val>
  <rec:Seq key="ExtractProperties">					    
    <rec:Val>http://schemas.openxmlformats.org/officeDocument/2006/extended-properties/Company</rec:Val>
    <rec:Val>http://schemas.openxmlformats.org/officeDocument/2006/extended-properties/Manager</rec:Val>						
    <rec:Val>dc:creator</rec:Val>					    					    
  </rec:Seq>
</proc:configuration>

Back to the top