Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets.boilerpipe"
< SMILA | Documentation
(New page: This page describes the SMILA pipelets provided by bundle <tt>org.eclipse.smila.processing.pipelets.boilerpipe</tt>. == General == All pipelets in this bundle support the configurable er...) |
|||
Line 11: | Line 11: | ||
== org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet == | == org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet == | ||
− | Extracts text from an HTML input using the [ | + | Extracts text from an HTML input using the [http://code.google.com/p/boilerpipe/ Boilerpipe library]. |
=== Configuration === | === Configuration === |
Revision as of 05:47, 22 May 2012
This page describes the SMILA pipelets provided by bundle org.eclipse.smila.processing.pipelets.boilerpipe.
Contents
General
All pipelets in this bundle support the configurable error handling as described in SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation. When used in jobmanager workflows, records causing errors are dropped.
Read Type
- runtime: Parameters are read when processing records. Parameter value can be set per Record.
- init: Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.
org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet
Extracts text from an HTML input using the Boilerpipe library.
Configuration
Property | Type | Read Type | Description |
---|---|---|---|
inputType | String : ATTACHMENT, ATTRIBUTE | runtime | Defines whether the HTML input is found in an attachment or in an attribute of the record |
outputType | String : ATTACHMENT, ATTRIBUTE | runtime | Defines whether the plain text should be stored in an attachment or in an attribute of the record |
inputName | String | runtime | Name of attachment or attribute that contains the HTML input |
outputName | String | runtime | Name of attachment or attribute for plain text output |
encodingAttribute | String | runtime | Optional name of the attribute with the encoding of the input attachment. |
defaultEncoding | String | runtime | Optional fallback encoding, if anything else fails. |
filter | Sequence of String | init | A list of boiler pipe filters to use, may contain class names, or static method or static variable references (defaults to de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE). |
Example
Extract text from the HTML input in attachment "html" into the attribute "text" using the encoding given in attribute "http.encoding".
<proc:invokePipelet name="extractText"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet" /> <proc:variables input="request" /> <proc:configuration> <rec:Val key="inputType">ATTACHMENT</rec:Val> <rec:Val key="inputName">html</rec:Val> <rec:Val key="outputType">ATTRIBUTE</rec:Val> <rec:Val key="outputName">text</rec:Val> <rec:Val key="encodingAttribute">http.encoding</rec:Val> </proc:configuration> </proc:invokePipelet>