Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/2011.Simplification/org.eclipse.smila.processing.pipelets"

(Configuration)
(For SMILA 1.0: Simplification pages are obsolete now, redirect to SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets)
 
(11 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This page describes the SMILA pipelets provided by bundle <tt>org.eclipse.smila.processing.pipelets</tt>.
+
#REDIRECT [[SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets]]
 
+
== org.eclipse.smila.processing.pipelets.CommitRecordsPipelet ==
+
 
+
=== Description ===
+
 
+
Commits each record in the ''input'' variable on the blackboard to the storages. Can be used to save the records immediately during the workflow instead of only when a workflow has been finished.
+
 
+
=== Configuration ===
+
 
+
none.
+
 
+
== org.eclipse.smila.processing.pipelets.AddValuesPipelet ==
+
 
+
tbd.
+
 
+
== Configuration ==
+
 
+
{| border="1"
+
!Property
+
!Type
+
!Description
+
|-
+
|''outputAttribute''
+
|String value
+
|name of attribute to add values to.
+
|-
+
|''valuesToAdd''
+
|anything, usually a value or a sequence of values
+
|the values to add
+
|}
+
 
+
=== Example ===
+
 
+
<source lang="xml">
+
<proc:invokePipelet name="addValuesToNonExistingAttribute">
+
  <proc:pipelet class="org.eclipse.smila.processing.pipelets.AddValuesPipelet" />
+
  <proc:variables input="request" />
+
  <proc:configuration>
+
  <rec:Val key="outputAttribute">out</rec:Val>
+
    <rec:Seq key="valuesToAdd">
+
      <rec:Val>value1</rec:Val>
+
      <rec:Val>value2</rec:Val>
+
    </rec:Seq>
+
  </proc:configuration>
+
</proc:invokePipelet>
+
</source>
+
 
+
== org.eclipse.smila.processing.pipelets.HtmlToTextPipelet ==
+
 
+
=== Description ===
+
 
+
Extract plain text and metadata from an HTML document in an attribute or attachment of each record and writes it to configurable attributes or attachments.
+
 
+
The pipelet uses the CyberNeko HTML parser [http://nekohtml.sourceforge.net/ NekoHTML] to parse HTML documents.
+
 
+
=== Configuration ===
+
 
+
{| border="1"
+
!Property
+
!Type
+
!Description
+
|-
+
|''inputType''
+
|String : ''ATTACHMENT, ATTRIBUTE''
+
|selects if the HTML input is found in an attachment or attribute of the record
+
|-
+
|''outputType''
+
|String : ''ATTACHMENT, ATTRIBUTE''
+
|selects if the plain text should be stored in an attachment or attribute of the record
+
|-
+
|''inputName''
+
|String
+
|name of input attachment or path to input attribute (process literals of attribute)
+
|-
+
|''outputName''
+
|String
+
| name of output attachment or path to output attribute for plain text (store result as literals of attribute)
+
|-
+
|''removeContentTags''
+
|String
+
|comma separated list of HTML tags (case insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to ''"applet,frame,object,script,style"''. If the value is set, you must add the default tags explicitly to have their contents removed, too.
+
|-
+
|''meta:<name>''
+
|String: attribute path
+
|store the content of the <tt><META></tt> tag with ''name="<name>"'' (case insensitive) to the attribute named as the value of the property. E.g. a property named ''"meta:author"'' with value "authors" causes the content attributes of <tt><META name="author" content="..."></tt> tags to be stored in the attribute ''authors'' of the respective record.
+
|-
+
|''tag:title''
+
|String: attribute path
+
|store the content of the <tt><TITLE></tt> tag with to the attribute named as the value of the property.
+
|}
+
 
+
==== Example ====
+
 
+
This configuration extracts plain text from the HTML document in attachment ''"html"'' and stores it in the attribute ''"text"''. It removes the complete content of heading tags <tt><nowiki><h1>, ..., <h4></nowiki></tt>. Additionally it looks for <tt><meta></tt> tags with names ''"author"'' and ''"keywords"'' and stores their contents in attributes ''"authors"'' and ''"keywords"'', respectively:
+
 
+
<source lang="xml">
+
<extensionActivity>
+
  <proc:invokePipelet name="invokeHtml2Txt">
+
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" />
+
    <proc:variables input="request" output="request" />
+
    <proc:configuration>
+
      <rec:Val key="inputType">ATTACHMENT</rec:Val>
+
      <rec:Val key="outputType">ATTRIBUTE</rec:Val>
+
      <rec:Val key="inputName">Content</rec:Val>
+
      <rec:Val key="outputName">Content</rec:Val>
+
      <rec:Val key="meta:title">Title</rec:Val>
+
    </proc:configuration>
+
  </proc:invokePipelet>
+
</extensionActivity>
+
</source>
+
 
+
 
+
== org.eclipse.smila.processing.pipelets.CopyPipelet ==
+
 
+
=== Description ===
+
 
+
This pipelet can be used to copy a String value between attributes and/or attachments. It suppoprts two execution modes:
+
* COPY: copy the value from the input attribute/attachment to thee output attribute/attachment
+
* MOVE: same as COPY, but after that delete the value from the input attribute/attachment
+
 
+
=== Configuration ===
+
 
+
{| border="1"
+
!Property
+
!Type
+
!Description
+
|-
+
|''inputType''
+
|String : ''ATTACHMENT, ATTRIBUTE''
+
|selects if the input is found in an attachment or attribute of the record
+
|-
+
|''outputType''
+
|String : ''ATTACHMENT, ATTRIBUTE''
+
|selects if output should be stored in an attachment or attribute of the record
+
|-
+
|''inputName''
+
|String
+
|name of input attachment or path to input attribute (process a String literal of attribute)
+
|-
+
|''outputName''
+
|String
+
| name of output attachment or path to output attribute for plain text (store result as String literal of attribute)
+
|-
+
|''mode''
+
|String : ''COPY, MOVE''
+
| execution mode. Copy the value or move (copy and delete) the value. Default is COPY.
+
|-
+
|}
+
 
+
==== Example ====
+
 
+
This configuration shows how to copy the value of attachment 'Content' into the attribute 'TextContent':
+
 
+
<source lang="xml">
+
<!-- copy txt from attachment to attribute -->
+
<extensionActivity>
+
  <proc:invokePipelet name="invokeCopyContent">
+
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.CopyPipelet" />
+
    <proc:variables input="request" output="request" />
+
    <proc:configuration>
+
      <rec:Val key="inputType">ATTACHMENT</rec:Val>
+
      <rec:Val key="outputType">ATTRIBUTE</rec:Val>
+
      <rec:Val key="inputName">Content</rec:Val>
+
      <rec:Val key="outputName">Content</rec:Val>
+
      <rec:Val key="mode">COPY</rec:Val>
+
    </proc:configuration>
+
  </proc:invokePipelet>
+
</extensionActivity>
+
</source>
+
 
+
 
+
== org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet ==
+
 
+
=== Description ===
+
 
+
Extracts Literal values from an attribute that has a nested MObject. The attributes in the nested MObject can have nested MOBjects themselves. To address a attribute in the nested structure a path needs to be specified. The pipelet supports different execution modes:
+
*FIRST: selects only the first Literal of the specified attribute
+
*LAST: selects only the last Literal of the specified attribute
+
*ALL_AS_LIST: selects all Literal values of the specified attribute and returns a list
+
*ALL_AS_ONE: selects all Literal values of the specified attribute and concatenates them to a single string, using a seperator (default is blank)
+
 
+
This Pipelet works only on attributes, not attachments!
+
 
+
<b>Note</b>:
+
The Pipelet currently does not support lists of Maps!!!
+
 
+
 
+
=== Configuration ===
+
 
+
{| border="1"
+
!Property
+
!Type
+
!Description
+
|-
+
|''inputPath''
+
|String
+
|the path to the input attribute with Literals
+
|-
+
|''outputPath''
+
|String
+
|the path to the attribute to store the extracted value(s) as Literals in
+
|-
+
|''mode''
+
|String : ''FIRST, LAST, ALL_AS_LIST, ALL_AS_ONE''
+
| execution mode. See above for details.
+
|-
+
|''separator''
+
|String
+
| the separation string used for mode ALL_AS_ONE. Default is a blank
+
|-
+
|}
+
 
+
==== Example ====
+
 
+
This configuration can be applied to records provided by the FeedAgent. It shows how to access the subattribute 'Value' of attribute 'Contents', concatenating all values to one:
+
 
+
<source lang="xml">
+
<!-- extract content -->
+
<extensionActivity>
+
  <proc:invokePipelet name="extract content">
+
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet" />
+
    <proc:variables input="request" output="request" />
+
    <proc:configuration>
+
      <rec:Val key="inputPath">Contents/Value</rec:Val>
+
      <rec:Val key="outputPath">Content</rec:Val>
+
      <rec:Val key="mode">ALL_AS_ONE</rec:Val>
+
    </proc:configuration>
+
  </proc:invokePipelet>
+
</extensionActivity>
+
</source>
+
 
+
== Bundle: org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet ==
+
 
+
=== Description ===
+
This pipelet is used to identify the MIME type of a document.
+
It uses an <tt>org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier</tt> service to perform the actual identification of the MIME type. Depending on the specified properties, the MIME type is detected from the file content, from the file extension, or from both. If the identification does not return a MIME type - and if configured accordingly - the service will search the metadata for this information. The identified MIME type is then stored to an attribute in the record.
+
 
+
 
+
=== Configuration ===
+
 
+
The pipelet is configured using the <tt><PipeletConfiguration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:
+
 
+
{| border = 1
+
!Property!!Type!!Usage!!Description
+
|-
+
|''FileExtensionAttribute''||String||Optional||Name of the attribute containing the file extension
+
|-
+
|''ContentAttachment''||String||Optional||Name of the attachment containing the file content
+
|-
+
|''MetaDataAttribute''||String||Optional||Name of the attribute containing metadata information, e.g. a Web Crawler returns a response header containing applicable MIME type information
+
|-
+
|''MimeTypeAttribute''||String||Required||Name of the attribute to store the identified MIME type to
+
|}
+
Note that at least one of the properties ''FileExtensionAttribute'', ''ContentAttachment'', and ''MetaDataAttribute'' must be specified!
+
 
+
==== Example ====
+
 
+
The following example is used in the SMILA example application to identify the MIME types of documents that are delivered by the File System Crawler or Web Crawler.
+
 
+
'''addpipeline.bpel'''
+
<source lang="xml">
+
<extensionActivity>
+
    <proc:invokePipelet name="detect MimeType">
+
        <proc:pipelet class="org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet" />
+
        <proc:variables input="request" output="request" />
+
        <proc:configuration>
+
          <rec:Val key="FileExtensionAttribute">Extension</rec:Val>
+
          <rec:Val key="MetaDataAttribute">MetaData</rec:Val>
+
          <rec:Val key="MimeTypeAttribute">MimeType</rec:Val>
+
        </proc:configuration>
+
    </proc:invokePipelet>
+
</extensionActivity>
+
</source>
+
 
+
[[Category:SMILA]] [[Category:SMILA/Pipelet]]
+

Latest revision as of 06:03, 19 January 2012

Back to the top