Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets"
(→Description) |
(→Configuration) |
||
Line 247: | Line 247: | ||
|String | |String | ||
|runtime | |runtime | ||
− | |name of input attachment or | + | |name of input attachment or input attribute |
|- | |- | ||
|''outputName'' | |''outputName'' | ||
|String | |String | ||
|runtime | |runtime | ||
− | | name of output attachment or | + | | name of output attachment or output attribute |
|- | |- | ||
|''mode'' | |''mode'' |
Revision as of 11:27, 18 January 2012
This page describes the SMILA pipelets provided by bundle org.eclipse.smila.processing.pipelets.
Contents
- 1 General
- 2 org.eclipse.smila.processing.pipelets.CommitRecordsPipelet
- 3 org.eclipse.smila.processing.pipelets.AddValuesPipelet
- 4 org.eclipse.smila.processing.pipelets.RemoveAttributePipelet
- 5 org.eclipse.smila.processing.pipelets.FilterPipelet
- 6 org.eclipse.smila.processing.pipelets.HtmlToTextPipelet
- 7 org.eclipse.smila.processing.pipelets.CopyPipelet
- 8 org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet
- 9 org.eclipse.smila.processing.pipelets.ReplacePipelet
- 10 org.eclipse.smila.processing.pipelets.ScriptPipelet
- 11 org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet
- 12 org.eclipse.smila.processing.pipelets.FileReaderPipelet
General
All pipelets in this bundle support the configurable error handling as described in SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation. When used in jobmanager workflows, records causing errors are dropped.
Read Type
- runtime: Parameters are read when processing records. Parameter value can be set per Record.
- init: Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.
org.eclipse.smila.processing.pipelets.CommitRecordsPipelet
Description
Commits each record in the input variable on the blackboard to the storages. Can be used to save the records immediately during the workflow instead of only when a workflow has been finished.
Configuration
none.
org.eclipse.smila.processing.pipelets.AddValuesPipelet
Adds something to an attribute in the processed records. If the attribute does not contain a sequence already, the current value will be wrapped in one before the new values are added.
Configuration
Property | Type | Read Type | Description |
---|---|---|---|
outputAttribute | A string value | runtime | The name of the attribute to add values to |
valuesToAdd | Anything, usually a value or a sequence of values | runtime | The values to add |
Example
From a test pipeline: This adds two string values to whatever already exists in attribute "out" of the processed records.
<proc:invokePipelet name="addValuesToNonExistingAttribute"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.AddValuesPipelet" /> <proc:variables input="request" /> <proc:configuration> <rec:Val key="outputAttribute">out</rec:Val> <rec:Seq key="valuesToAdd"> <rec:Val>value1</rec:Val> <rec:Val>value2</rec:Val> </rec:Seq> </proc:configuration> </proc:invokePipelet>
org.eclipse.smila.processing.pipelets.RemoveAttributePipelet
Removes an attribute from each record.
Configuration
The configuration property is either read from the _parameters attribute of a record or from the pipelet configuration. If not set at all, the record remains unchanged.
Property | Type | Read Type | Description |
---|---|---|---|
removeAttribute | A string value | runtime | The name of the attribute to remove |
Example
To remove the complete structure in attribute _parameters, use:
<extensionActivity> <proc:invokePipelet name="removeParameters"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.RemoveAttributePipelet" /> <proc:variables input="result" output="result" /> <proc:configuration> <rec:Val key="removeAttribute">_parameters</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>
org.eclipse.smila.processing.pipelets.FilterPipelet
Copies only those record IDs to the result which match a configurable regular expression in a configurable single-valued attribute. This is useful for conditional processing while at the same time pushing multiple records through the pipeline in a single request: Instead of using BPEL conditions use a FilterPipelet to select only the matching records in a new variable and use the this variable as the input variable for the next pipelets. You can still use the original BPEL variable in the BPEL <reply> activity at the end of the pipeline to return all records as the final result.
Configuration
The configuration properties are read either from the _parameters attribute of each record or from the pipelet configuration.
Property | Type | Read Type | Description |
---|---|---|---|
filterAttribute | A string value | runtime | The name of the attribute to match |
filterExpression | A string value | runtime | The regular expression to match the attribute value against |
Example
To get only those records in the textRecords BPEL variable that have a MimeType starting with text something like this could be used:
<extensionActivity> <proc:invokePipelet name="invokeFilterPipelet"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.FilterPipelet" /> <proc:variables input="request" output="textRecords" /> <proc:configuration> <rec:Val key="filterAttribute">MimeType</rec:Val> <rec:Val key="filterExpression">text/.+</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>
org.eclipse.smila.processing.pipelets.HtmlToTextPipelet
Description
Extract plain text and metadata from an HTML document from an attribute or attachment of each record and writes the results to configurable attributes or attachments.
The pipelet uses the CyberNeko HTML parser NekoHTML to parse HTML documents.
Configuration
Property | Type | Read Type | Description |
---|---|---|---|
inputType | String : ATTACHMENT, ATTRIBUTE | runtime | Defines whether the HTML input is found in an attachment or in an attribute of the record |
outputType | String : ATTACHMENT, ATTRIBUTE | runtime | Defines whether the plain text should be stored in an attachment or in an attribute of the record |
inputName | String | runtime | Name of input attachment or path to input attribute (process literals of attribute) |
outputName | String | runtime | Name of output attachment or path to output attribute for plain text (store result as literals of attribute) |
defaultEncoding | String | runtime | Optional, default encoding to apply to documents when not specified in the documents themselves |
removeContentTags | String | runtime | Comma-separated list of HTML tags (case-insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to "applet,frame,object,script,style". If the value is set, you must add the default tags explicitly to have their contents removed, too. |
meta:<name> | String: attribute path | init | Store the content of the <META> tag with name="<name>" (case insensitive) to the attribute named as the value of the property. E.g. a property named "meta:author" with value "authors" causes the content attributes of <META name="author" content="..."> tags to be stored in the attribute authors of the respective record. |
tag:title | String: attribute path | init | Store the content of the <TITLE> tag with to the attribute named as the value of the property. |
Example
This configuration extracts plain text from the HTML document in attachment "html" and stores the results to the attribute "text". It removes the complete content of heading tags <h1>, ..., <h4>. In addition to that, it looks for <meta> tags with names "author" and "keywords" and stores their contents in attributes "authors" and "keywords", respectively:
<extensionActivity> <proc:invokePipelet name="invokeHtml2Txt"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="inputType">ATTACHMENT</rec:Val> <rec:Val key="outputType">ATTRIBUTE</rec:Val> <rec:Val key="inputName">html</rec:Val> <rec:Val key="outputName">text</rec:Val> <rec:Val key="defaultEncoding">UTF-8</rec:Val> <rec:Val key="meta:author">author</rec:Val> <rec:Val key="meta:keywords">keywords</rec:Val> <rec:Val key="meta:title">title</rec:Val> <rec:Val key="removeContentTags">h1,h2,h3,h4</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>
org.eclipse.smila.processing.pipelets.CopyPipelet
Description
This pipelet can be used to copy or move attribute values to other attributes or to copy or move a string value between attributes and/or attachments. It suppoprts two execution modes:
- COPY: copy the value from the input attribute/attachment to the output attribute/attachment
- MOVE: same as COPY, but after that delete the value from the input attribute/attachment
Configuration
Property | Type | Read Type | Description |
---|---|---|---|
inputType | String : ATTACHMENT, ATTRIBUTE | runtime | selects if the input is found in an attachment or attribute of the record |
outputType | String : ATTACHMENT, ATTRIBUTE | runtime | selects if output should be stored in an attachment or attribute of the record |
inputName | String | runtime | name of input attachment or input attribute |
outputName | String | runtime | name of output attachment or output attribute |
mode | String : COPY, MOVE | runtime | execution mode. Copy the value or move (copy and delete) the value. Default is COPY. |
Example
This configuration shows how to copy the value of attachment 'Content' into the attribute 'TextContent':
<!-- copy txt from attachment to attribute --> <extensionActivity> <proc:invokePipelet name="invokeCopyContent"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.CopyPipelet" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="inputType">ATTACHMENT</rec:Val> <rec:Val key="outputType">ATTRIBUTE</rec:Val> <rec:Val key="inputName">Content</rec:Val> <rec:Val key="outputName">TextContent</rec:Val> <rec:Val key="mode">COPY</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>
org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet
Description
Extracts literal values from an attribute that has a nested map. The attributes in the nested map can have nested maps themselves. To address a attribute in the nested structure, a path needs to be specified. The pipelet supports different execution modes:
- FIRST: selects only the first literal of the specified attribute
- LAST: selects only the last literal of the specified attribute
- ALL_AS_LIST: selects all literal values of the specified attribute and returns a list
- ALL_AS_ONE: selects all literal values of the specified attribute and concatenates them to a single string, using a separator (default is blank)
This pipelet works only on attributes, not on attachments!
Note: If the maps on the path are nested in sequences, the pipelet uses the first element of such a sequence.
Configuration
Property | Type | Read Type | Description |
---|---|---|---|
inputPath | String | runtime | the path to the input attribute with Literals |
outputPath | String | runtime | the name of the attribute to store the extracted value(s) as Literals in (not a path, only a top-level attribute, currently) |
mode | String : FIRST, LAST, ALL_AS_LIST, ALL_AS_ONE | runtime | execution mode. See above for details. |
separator | String | runtime | the separation string used for mode ALL_AS_ONE. Default is a blank |
Example
This configuration can be applied to records provided by the FeedAgent. It shows how to access the subattribute 'Value' of attribute 'Contents', concatenating all values to one:
<!-- extract content --> <extensionActivity> <proc:invokePipelet name="extract content"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="inputPath">Contents/Value</rec:Val> <rec:Val key="outputPath">Content</rec:Val> <rec:Val key="mode">ALL_AS_ONE</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>
org.eclipse.smila.processing.pipelets.ReplacePipelet
Description
Searches for one or more patterns in the literal value of an attribute and substitutes the found occurrences by the configured replacements.
You can choose from different matching types:
- entity: Every pattern is matched against the whole attribute value (with respect to the ignoreCase property) and the first matching pattern defines the new value of the attribute. If no pattern matches, the result is the current value of the attribute.
- substring: All patterns that are part of the attribute value are replaced.
- regexp: Interpret all patterns as regular expression, see Matcher#replaceAll(String)
This pipelet works only on attributes, not on attachments!
Configuration
Property | Type | Read Type | Description |
---|---|---|---|
inputAttribute | String | runtime | the name of the attribute that contains the literal to search in |
outputAttribute | String | runtime | the name of the attribute to store the result value as string, defaults to the input attribute |
type | String : entity, substring, regexp | init | Identifies the type of the pattern, see above for details. Defaults to substring. |
ignoreCase | Boolean | init | indicates that the case is ignored when matching patterns, defaults to false. |
mapping | Map | init | A mapping of multiple patterns and replacements. Each key is a pattern and its value the replacement. |
pattern | String | init | the pattern to apply to the literal value (see above for a description of possible types), required if no mapping is given |
replacement | String | init | the substitution string used to replace all occurrences of the pattern, defaults to the empty string |
Examples
This configuration can be used to map language ids to their label:
<extensionActivity> <proc:invokePipelet name="set language label"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="inputAttribute">Language</rec:Val> <rec:Val key="outputAttribute">LanguageLabel</rec:Val> <rec:Val key="type">entity</rec:Val> <rec:Val key="ignoreCase" type="boolean">true</rec:Val> <rec:Map key="mapping"> <rec:Val key="de">German</rec:Val> <rec:Val key="en">English</rec:Val> <rec:Val key="es">Spanish</rec:Val> <rec:Val key="fr">French</rec:Val> ... </rec:Map> </proc:configuration> </proc:invokePipelet> </extensionActivity>
This configuration can be used to cut the time information from a timestamp:
<extensionActivity> <proc:invokePipelet name="cut time"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="inputAttribute">ModificationTime</rec:Val> <rec:Val key="outputAttribute">ModificationDate</rec:Val> <rec:Val key="type">regexp</rec:Val> <rec:Val key="pattern">[T ].*</rec:Val> <rec:Val key="replacement"></rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>
org.eclipse.smila.processing.pipelets.ScriptPipelet
Description
Executes a script for each record.
For execution the Java Scripting API (JSR 223) is responsible - thus any compatible scripting engine can be used. JavaScript is available "out of the box" and the default script language.
The context of the script will contain three variables:
- blackboard: a reference to the blackboard
- id: the ID of the current record
- record: the metadata of the current record
Please be aware that the intention of this pipelet is to write pipelines fast, but not to write fast pipelines - the script is parsed for every record. Don't use it for production environments where performance matters, but use it to develop an algorithm that you can put into your own pipelet.
Configuration
Property | Type | Read Type | Description |
---|---|---|---|
type | String | init | the mime type of the scripting language, defaults to "text/javascript" |
scriptFile | String | runtime | the path of the file that contains the script - modifications of this file are observed on every execution of the pipelet |
script | String | init | The "inline" script, required unless scriptFile is specified (ignored in that case) |
resultAttribute | String | runtime | The name of an attribute that will receive the result of the script (usually the result of the last expression) |
Examples
This configuration can be used to concatenate the values of two attributes and save the result into a third one:
<extensionActivity> <proc:invokePipelet name="create full name"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="script">record.getStringValue("firstName") + " " + record.getStringValue("lastName")</rec:Val> <rec:Val key="resultAttribute">fullName</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>
This configuration can be used to execute a java script file from $SMILA_PATH$/configuration/example.js:
<extensionActivity> <proc:invokePipelet name="execute script"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="scriptFile">configuration/example.js</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>
org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet
Description
This pipelet is used to identify the MIME type of a document. It uses an org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier service to perform the actual identification of the MIME type. Depending on the specified properties, the MIME type is detected from the file content, from the file extension, or from both. If the identification does not return a MIME type - and if configured accordingly - the service will search the metadata for this information. The identified MIME type is then stored to an attribute in the record.
Configuration
The pipelet is configured using the <configuration> section inside the <invokePipelet> activity of the corresponding BPEL file. It provides the following properties:
Property | Type | Read Type | Usage | Description |
---|---|---|---|---|
FileExtensionAttribute | String | init | Optional | Name of the attribute containing the file extension |
ContentAttachment | String | init | Optional | Name of the attachment containing the file content |
MetaDataAttribute | String | init | Optional | Name of the attribute containing metadata information, e.g. a Web Crawler returns a response header containing applicable MIME type information |
MimeTypeAttribute | String | init | Required | Name of the attribute to store the identified MIME type to |
Note that at least one of the properties FileExtensionAttribute, ContentAttachment, and MetaDataAttribute must be specified!
Example
The following example is used in the SMILA example application to identify the MIME types of documents that are delivered by the File System Crawler or Web Crawler.
addpipeline.bpel
<extensionActivity> <proc:invokePipelet name="detect MimeType"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="FileExtensionAttribute">Extension</rec:Val> <rec:Val key="MetaDataAttribute">MetaData</rec:Val> <rec:Val key="MimeTypeAttribute">MimeType</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>
org.eclipse.smila.processing.pipelets.FileReaderPipelet
Description
This pipelet can be used to read content from a file and add it as an attachement.
Configuration
Property | Type | Read Type | Description |
---|---|---|---|
pathAttribute | String | runtime | The name of the attribute with the path of the file to read from |
contentAttachment | String | runtime | The name of the attachment to store the content |
Example
<!-- read from file and add attachment --> <extensionActivity> <proc:invokePipelet name="invokeReadFile"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.FileReaderPipelet" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="pathAttribute">path</rec:Val> <rec:Val key="contentAttachment">content</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>