Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets"

Line 11: Line 11:
 
none.
 
none.
  
== org.eclipse.smila.processing.pipelets.SetAnnotationPipelet ==
+
== org.eclipse.smila.processing.pipelets.AddValuesPipelet ==
  
=== Description ===
+
Adds something to an attribute in the processed records. If the attribute does not contain a sequence already, the current value will be wrapped in one before the new values are added.
  
Sets a configurable annotation on each record in the ''input'' variable. This can be used to control the operation of services and pipelets that look at special annotations to distinguish between different operation modes.
+
== Configuration ==  
 
+
<blockquote>
+
Since annotations on the root metadata object of records can now be set inline in the <tt><invokeService></tt> or <tt><invokePipelet></tt> activity (see [[SMILA/Documentation/BPEL Workflow Processor]]), this pipelet is only needed to set annotations on attributes. This means probably that is will not be used very much (-;
+
</blockquote>
+
 
+
=== Configuration ===
+
  
 
{| border="1"
 
{| border="1"
Line 28: Line 22:
 
!Description
 
!Description
 
|-
 
|-
|''Name''
+
|''outputAttribute''
|String
+
|a string value
|Name of annotation ot set
+
|name of attribute to add values to.
|-
+
|''AnonValue''
+
|String
+
|an anonymous value of the annotation. Can occur multiple times.
+
|-
+
|''NamedValue:<name>''
+
|named value of the annotation for name <name>
+
 
|-
 
|-
|''Path''
+
|''valuesToAdd''
|String : attribute path
+
|anything, usually a value or a sequence of values
|Path to attribute to attach the annotation to. If not set, annotation is set on the root metadata object of the record. The index in the final step of the path is irrelevant, the annotation if always attached to the attribute, not on contained literals or objects.
+
|the values to add
 
|}
 
|}
  
==== Example ====
+
=== Example ===
  
The following example was used in the ''AddPipeline'' of the SMILA example application to set an annotation that advises LuceneService in the following invocation to add the records to the index: It creates an annotation named <tt>org.eclipse.smila.lucene.LuceneService</tt> with a named value ''executionMode=ADD'' (see documentation of ''LuceneService'' for details):
+
From a test pipeline: This adds two string values to whatever already exists in attribute "out" of the processed records.
  
 
<source lang="xml">
 
<source lang="xml">
<extensionActivity name="setAnnotations">
+
<proc:invokePipelet name="addValuesToNonExistingAttribute">
   <proc:invokePipelet>
+
   <proc:pipelet class="org.eclipse.smila.processing.pipelets.AddValuesPipelet" />
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.SetAnnotationPipelet" />
+
  <proc:variables input="request" />
    <proc:variables input="request" />
+
  <proc:configuration>
    <proc:PipeletConfiguration>
+
  <rec:Val key="outputAttribute">out</rec:Val>
      <proc:Property name="Name">
+
    <rec:Seq key="valuesToAdd">
        <proc:Value>org.eclipse.smila.lucene.LuceneService</proc:Value>
+
      <rec:Val>value1</rec:Val>
      </proc:Property>
+
       <rec:Val>value2</rec:Val>
      <proc:Property name="NamedValue:executionMode">
+
     </rec:Seq>
        <proc:Value>ADD</proc:Value>
+
   </proc:configuration>
       </proc:Property>
+
</proc:invokePipelet>
     </proc:PipeletConfiguration>
+
   </proc:invokePipelet>
+
</extensionActivity>
+
 
</source>
 
</source>
  
Line 96: Line 80:
 
| name of output attachment or path to output attribute for plain text (store result as literals of attribute)
 
| name of output attachment or path to output attribute for plain text (store result as literals of attribute)
 
|-
 
|-
|''removeContentTagsÄÄ
+
|''removeContentTags''
 
|String
 
|String
 
|comma separated list of HTML tags (case insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to ''"applet,frame,object,script,style"''. If the value is set, you must add the default tags explicitly to have their contents removed, too.
 
|comma separated list of HTML tags (case insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to ''"applet,frame,object,script,style"''. If the value is set, you must add the default tags explicitly to have their contents removed, too.
Line 114: Line 98:
  
 
<source lang="xml">
 
<source lang="xml">
<PipeletConfiguration xmlns="http://www.eclipse.org/smila/processor">
+
<extensionActivity>
     <Property name="inputType">
+
  <proc:invokePipelet name="invokeHtml2Txt">
        <Value>ATTACHMENT</Value>
+
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" />
    </Property>
+
     <proc:variables input="request" output="request" />
    <Property name="outputType">
+
    <proc:configuration>
        <Value>ATTRIBUTE</Value>
+
      <rec:Val key="inputType">ATTACHMENT</rec:Val>
    </Property>
+
      <rec:Val key="outputType">ATTRIBUTE</rec:Val>
    <Property name="inputName">
+
      <rec:Val key="inputName">html</rec:Val>
        <Value>html</Value>
+
      <rec:Val key="outputName">text</rec:Val>
    </Property>
+
      <rec:Val key="meta:author">author</rec:Val>
    <Property name="outputName">
+
      <rec:Val key="meta:keywords">keywords</rec:Val>
        <Value>text</Value>
+
      <rec:Val key="meta:title">title</rec:Val>
    </Property>
+
      <rec:Val key="removeContentTags">h1,h2,h3,h4</rec:Val>
    <Property name="meta:author">
+
     </proc:configuration>
        <Value>authors</Value>
+
  </proc:invokePipelet>
    </Property>
+
</extensionActivity>
    <Property name="meta:keywords">
+
        <Value>keywords</Value>
+
    </Property>
+
    <Property name="tag:title">
+
        <Value>title</Value>
+
    </Property>
+
    <Property name="removeContentTags">
+
        <Value>h1,h2,h3,h4</Value>
+
     </Property>
+
</PipeletConfiguration>
+
 
</source>
 
</source>
 
  
 
== org.eclipse.smila.processing.pipelets.CopyPipelet ==
 
== org.eclipse.smila.processing.pipelets.CopyPipelet ==
Line 186: Line 159:
 
<source lang="xml">
 
<source lang="xml">
 
<!-- copy txt from attachment to attribute -->
 
<!-- copy txt from attachment to attribute -->
<extensionActivity name="invokeCopyContent">
+
<extensionActivity>
    <proc:invokePipelet>
+
  <proc:invokePipelet name="invokeCopyContent">
        <proc:pipelet class="org.eclipse.smila.processing.pipelets.CopyPipelet" />
+
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.CopyPipelet" />
        <proc:variables input="request" output="request" />
+
    <proc:variables input="request" output="request" />
        <proc:PipeletConfiguration>
+
    <proc:configuration>
            <proc:Property name="inputType">
+
      <rec:Val key="inputType">ATTACHMENT</rec:Val>
                <proc:Value>ATTACHMENT</proc:Value>
+
      <rec:Val key="outputType">ATTRIBUTE</rec:Val>
            </proc:Property>      
+
      <rec:Val key="inputName">Content</rec:Val>
            <proc:Property name="outputType">
+
      <rec:Val key="outputName">TextContent</rec:Val>
                <proc:Value>ATTRIBUTE</proc:Value>
+
       <rec:Val key="mode">COPY</rec:Val>
            </proc:Property>
+
    </proc:configuration>
            <proc:Property name="inputName">
+
  </proc:invokePipelet>
                <proc:Value>Content</proc:Value>
+
</extensionActivity>
            </proc:Property>
+
</source>
            <proc:Property name="outputName">
+
                <proc:Value>TextContent</proc:Value>
+
            </proc:Property>        
+
            <proc:Property name="mode">
+
                <proc:Value>COPY</proc:Value>
+
            </proc:Property>      
+
        </proc:PipeletConfiguration>     
+
    </proc:invokePipelet>
+
</extensionActivity></source>
+
 
+
  
 
== org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet ==
 
== org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet ==
Line 215: Line 178:
 
=== Description ===
 
=== Description ===
  
Extracts Literal values from an attribute that has a nested MObject. The attributes in the nested MObject can have nested MOBjects themselves. To address a attribute in the nested structure a path needs to be specified. The pipelet supports different execution modes:  
+
Extracts Literal values from an attribute that has a nested maps. The attributes in the nested map can have nested maps themselves. To address a attribute in the nested structure a path needs to be specified. The pipelet supports different execution modes:  
*FIRST: selects only the first Literal of the specified attribute
+
*FIRST: selects only the first literal of the specified attribute
*LAST: selects only the last Literal of the specified attribute
+
*LAST: selects only the last literal of the specified attribute
*ALL_AS_LIST: selects all Literal values of the specified attribute and returns a list
+
*ALL_AS_LIST: selects all literal values of the specified attribute and returns a list
*ALL_AS_ONE: selects all Literal values of the specified attribute and concatenates them to a single string, using a seperator (default is blank)
+
*ALL_AS_ONE: selects all literal values of the specified attribute and concatenates them to a single string, using a separator (default is blank)
  
This Pipelet works only on attributes, not attachments!
+
This pipelet works only on attributes, not on attachments!
  
 
<b>Note</b>:
 
<b>Note</b>:
The Pipelet currently does not support lists of MObjects !!!
+
If the maps on the path are nested in sequences, the pipelet uses the first element of such a sequence.
 
+
  
 
=== Configuration ===
 
=== Configuration ===
Line 240: Line 202:
 
|''outputPath''
 
|''outputPath''
 
|String
 
|String
|the path to the attribute to store the extracted value(s) as Literals in
+
|the name of the attribute to store the extracted value(s) as Literals in (not a path, only a top-level attribute, currently)
 
|-
 
|-
 
|''mode''
 
|''mode''
Line 257: Line 219:
  
 
<source lang="xml">
 
<source lang="xml">
<extensionActivity name="invokeContentExtraction">
+
<!-- extract content -->
    <proc:invokePipelet>
+
<extensionActivity>
        <proc:pipelet class="org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet" />
+
  <proc:invokePipelet name="extract content">
        <proc:variables input="request" output="request" />
+
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet" />
        <proc:PipeletConfiguration>
+
    <proc:variables input="request" output="request" />
            <proc:Property name="inputPath">
+
    <proc:configuration>
                <proc:Value>Contents/Value</proc:Value>
+
      <rec:Val key="inputPath">Contents/Value</rec:Val>
            </proc:Property>
+
      <rec:Val key="outputPath">Content</rec:Val>
            <proc:Property name="outputPath">
+
      <rec:Val key="mode">ALL_AS_ONE</rec:Val>
                <proc:Value>Content</proc:Value>
+
    </proc:configuration>
            </proc:Property>
+
  </proc:invokePipelet>
            <proc:Property name="mode">
+
                <proc:Value>ALL_AS_ONE</proc:Value>
+
            </proc:Property>    
+
        </proc:PipeletConfiguration>     
+
    </proc:invokePipelet>
+
 
</extensionActivity>
 
</extensionActivity>
 
 
</source>
 
</source>
  
== org.eclipse.smila.processing.pipelets.MimeTypeIdentifyService ==
+
== Bundle: org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet ==
  
 
=== Description ===
 
=== Description ===
This ProcessingService is used to identify the mimetype of a document.  
+
This pipelet is used to identify the MIME type of a document.  
It uses a <tt>org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier</tt> to perform the actual identification of the mimetype. Depending on what properties are specified the mime type is detected from the content or the file extension or both. If the identification does not return a mime type then, if configured, the service searches the metadata for a mimetype. The identified MimeType is store in an attribute in the record.
+
It uses an <tt>org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier</tt> service to perform the actual identification of the MIME type. Depending on the specified properties, the MIME type is detected from the file content, from the file extension, or from both. If the identification does not return a MIME type - and if configured accordingly - the service will search the metadata for this information. The identified MIME type is then stored to an attribute in the record.
  
  
 
=== Configuration ===
 
=== Configuration ===
  
* <tt>configuration/org.eclipse.smila.processing.pipelets./MimeTypeConfig.xml</tt>
+
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:
  
 
{| border = 1
 
{| border = 1
 
!Property!!Type!!Usage!!Description
 
!Property!!Type!!Usage!!Description
 
|-
 
|-
|FileExtensionAttribute||String||optional||name of the attribute containing the file extension
+
|''FileExtensionAttribute''||String||Optional||Name of the attribute containing the file extension
 
|-
 
|-
|ContentAttachment||String||optional||name of the attachment containing the file content
+
|''ContentAttachment''||String||Optional||Name of the attachment containing the file content
 
|-
 
|-
|MetaDataAttribute||String||optional||name of the attribute containing metadata information. e.g. a WebCrawler returns a response header containing mime type information
+
|''MetaDataAttribute''||String||Optional||Name of the attribute containing metadata information, e.g. a Web Crawler returns a response header containing applicable MIME type information
 
|-
 
|-
|MimeTypeAttribute||String||required||name of the attribute to store the identified MimeType in
+
|''MimeTypeAttribute''||String||Required||Name of the attribute to store the identified MIME type to
 
|}
 
|}
Note that at least one of the properties FileExtensionAttribute, ContentAttachment and MetaDataAttribute needs to be specified!
+
Note that at least one of the properties ''FileExtensionAttribute'', ''ContentAttachment'', and ''MetaDataAttribute'' must be specified!
  
 
==== Example ====
 
==== Example ====
  
The following example was used in the SMILA example application to identify MimeTypes of documents delivered by Filesystem- and WebCrawler.
+
The following example is used in the SMILA example application to identify the MIME types of documents that are delivered by the File System Crawler or Web Crawler.
  
'''MimeTypeConfig.xml'''
+
'''addpipeline.bpel'''
 
<source lang="xml">
 
<source lang="xml">
<PipeletConfiguration xmlns="http://www.eclipse.org/smila/processor">
+
<extensionActivity>
  <Property name="FileExtensionAttribute">
+
    <proc:invokePipelet name="detect MimeType">
    <Value>FileExtension</Value>
+
        <proc:pipelet class="org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet" />
  </Property>
+
        <proc:variables input="request" output="request" />
  <Property name="MetaDataAttribute">
+
        <proc:configuration>
    <Value>MetaData</Value>
+
          <rec:Val key="FileExtensionAttribute">Extension</rec:Val>
  </Property>
+
          <rec:Val key="MetaDataAttribute">MetaData</rec:Val>
  <Property name="MimeTypeAttribute">
+
          <rec:Val key="MimeTypeAttribute">MimeType</rec:Val>
    <Value>MimeType</Value>
+
        </proc:configuration>
  </Property>  
+
    </proc:invokePipelet>
</PipeletConfiguration>
+
</extensionActivity>
 
</source>
 
</source>
  
 
[[Category:SMILA]] [[Category:SMILA/Pipelet]]
 
[[Category:SMILA]] [[Category:SMILA/Pipelet]]

Revision as of 09:12, 20 April 2011

This page describes the SMILA pipelets provided by bundle org.eclipse.smila.processing.pipelets.

org.eclipse.smila.processing.pipelets.CommitRecordsPipelet

Description

Commits each record in the input variable on the blackboard to the storages. Can be used to save the records immediately during the workflow instead of only when a workflow has been finished.

Configuration

none.

org.eclipse.smila.processing.pipelets.AddValuesPipelet

Adds something to an attribute in the processed records. If the attribute does not contain a sequence already, the current value will be wrapped in one before the new values are added.

Configuration

Property Type Description
outputAttribute a string value name of attribute to add values to.
valuesToAdd anything, usually a value or a sequence of values the values to add

Example

From a test pipeline: This adds two string values to whatever already exists in attribute "out" of the processed records.

<proc:invokePipelet name="addValuesToNonExistingAttribute">
  <proc:pipelet class="org.eclipse.smila.processing.pipelets.AddValuesPipelet" />
  <proc:variables input="request" />
  <proc:configuration>
  <rec:Val key="outputAttribute">out</rec:Val>
    <rec:Seq key="valuesToAdd">
      <rec:Val>value1</rec:Val>
      <rec:Val>value2</rec:Val>
    </rec:Seq>
  </proc:configuration>
</proc:invokePipelet>

org.eclipse.smila.processing.pipelets.HtmlToTextPipelet

Description

Extract plain text and metadata from an HTML document in an attribute or attachment of each record and writes it to configurable attributes or attachments.

The pipelet uses the CyberNeko HTML parser NekoHTML to parse HTML documents.

Configuration

Property Type Description
inputType String : ATTACHMENT, ATTRIBUTE selects if the HTML input is found in an attachment or attribute of the record
outputType String : ATTACHMENT, ATTRIBUTE selects if the plain text should be stored in an attachment or attribute of the record
inputName String name of input attachment or path to input attribute (process literals of attribute)
outputName String name of output attachment or path to output attribute for plain text (store result as literals of attribute)
removeContentTags String comma separated list of HTML tags (case insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to "applet,frame,object,script,style". If the value is set, you must add the default tags explicitly to have their contents removed, too.
meta:<name> String: attribute path store the content of the <META> tag with name="<name>" (case insensitive) to the attribute named as the value of the property. E.g. a property named "meta:author" with value "authors" causes the content attributes of <META name="author" content="..."> tags to be stored in the attribute authors of the respective record.
tag:title String: attribute path store the content of the <TITLE> tag with to the attribute named as the value of the property.

Example

This configuration extracts plain text from the HTML document in attachment "html" and stores it in the attribute "text". It removes the complete content of heading tags <h1>, ..., <h4>. Additionally it looks for <meta> tags with names "author" and "keywords" and stores their contents in attributes "authors" and "keywords", respectively:

<extensionActivity>
  <proc:invokePipelet name="invokeHtml2Txt">
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" />
    <proc:variables input="request" output="request" />
    <proc:configuration>
      <rec:Val key="inputType">ATTACHMENT</rec:Val>
      <rec:Val key="outputType">ATTRIBUTE</rec:Val>
      <rec:Val key="inputName">html</rec:Val>
      <rec:Val key="outputName">text</rec:Val>
      <rec:Val key="meta:author">author</rec:Val>
      <rec:Val key="meta:keywords">keywords</rec:Val>
      <rec:Val key="meta:title">title</rec:Val>
      <rec:Val key="removeContentTags">h1,h2,h3,h4</rec:Val>
    </proc:configuration>
  </proc:invokePipelet>
</extensionActivity>

org.eclipse.smila.processing.pipelets.CopyPipelet

Description

This pipelet can be used to copy a String value between attributes and/or attachments. It suppoprts two execution modes:

  • COPY: copy the value from the input attribute/attachment to thee output attribute/attachment
  • MOVE: same as COPY, but after that delete the value from the input attribute/attachment

Configuration

Property Type Description
inputType String : ATTACHMENT, ATTRIBUTE selects if the input is found in an attachment or attribute of the record
outputType String : ATTACHMENT, ATTRIBUTE selects if output should be stored in an attachment or attribute of the record
inputName String name of input attachment or path to input attribute (process a String literal of attribute)
outputName String name of output attachment or path to output attribute for plain text (store result as String literal of attribute)
mode String : COPY, MOVE execution mode. Copy the value or move (copy and delete) the value. Default is COPY.

Example

This configuration shows how to copy the value of attachment 'Content' into the attribute 'TextContent':

<!-- copy txt from attachment to attribute -->
<extensionActivity>
  <proc:invokePipelet name="invokeCopyContent">
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.CopyPipelet" />
    <proc:variables input="request" output="request" />
    <proc:configuration>
      <rec:Val key="inputType">ATTACHMENT</rec:Val>
      <rec:Val key="outputType">ATTRIBUTE</rec:Val>
      <rec:Val key="inputName">Content</rec:Val>
      <rec:Val key="outputName">TextContent</rec:Val>
      <rec:Val key="mode">COPY</rec:Val>
    </proc:configuration>
  </proc:invokePipelet>
</extensionActivity>

org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet

Description

Extracts Literal values from an attribute that has a nested maps. The attributes in the nested map can have nested maps themselves. To address a attribute in the nested structure a path needs to be specified. The pipelet supports different execution modes:

  • FIRST: selects only the first literal of the specified attribute
  • LAST: selects only the last literal of the specified attribute
  • ALL_AS_LIST: selects all literal values of the specified attribute and returns a list
  • ALL_AS_ONE: selects all literal values of the specified attribute and concatenates them to a single string, using a separator (default is blank)

This pipelet works only on attributes, not on attachments!

Note: If the maps on the path are nested in sequences, the pipelet uses the first element of such a sequence.

Configuration

Property Type Description
inputPath String the path to the input attribute with Literals
outputPath String the name of the attribute to store the extracted value(s) as Literals in (not a path, only a top-level attribute, currently)
mode String : FIRST, LAST, ALL_AS_LIST, ALL_AS_ONE execution mode. See above for details.
separator String the separation string used for mode ALL_AS_ONE. Default is a blank

Example

This configuration can be applied to records provided by the FeedAgent. It shows how to access the subattribute 'Value' of attribute 'Contents', concatenating all values to one:

<!-- extract content -->
<extensionActivity>
  <proc:invokePipelet name="extract content">
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet" />
    <proc:variables input="request" output="request" />
    <proc:configuration>
      <rec:Val key="inputPath">Contents/Value</rec:Val>
      <rec:Val key="outputPath">Content</rec:Val>
      <rec:Val key="mode">ALL_AS_ONE</rec:Val>
    </proc:configuration>
  </proc:invokePipelet>
</extensionActivity>

Bundle: org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet

Description

This pipelet is used to identify the MIME type of a document. It uses an org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier service to perform the actual identification of the MIME type. Depending on the specified properties, the MIME type is detected from the file content, from the file extension, or from both. If the identification does not return a MIME type - and if configured accordingly - the service will search the metadata for this information. The identified MIME type is then stored to an attribute in the record.


Configuration

The pipelet is configured using the <configuration> section inside the <invokePipelet> activity of the corresponding BPEL file. It provides the following properties:

Property Type Usage Description
FileExtensionAttribute String Optional Name of the attribute containing the file extension
ContentAttachment String Optional Name of the attachment containing the file content
MetaDataAttribute String Optional Name of the attribute containing metadata information, e.g. a Web Crawler returns a response header containing applicable MIME type information
MimeTypeAttribute String Required Name of the attribute to store the identified MIME type to

Note that at least one of the properties FileExtensionAttribute, ContentAttachment, and MetaDataAttribute must be specified!

Example

The following example is used in the SMILA example application to identify the MIME types of documents that are delivered by the File System Crawler or Web Crawler.

addpipeline.bpel

<extensionActivity>
    <proc:invokePipelet name="detect MimeType">
        <proc:pipelet class="org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet" />
        <proc:variables input="request" output="request" />
        <proc:configuration>
          <rec:Val key="FileExtensionAttribute">Extension</rec:Val>
          <rec:Val key="MetaDataAttribute">MetaData</rec:Val>
          <rec:Val key="MimeTypeAttribute">MimeType</rec:Val>
        </proc:configuration>
    </proc:invokePipelet>
</extensionActivity>

Back to the top