Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets.boilerpipe"

(New page: This page describes the SMILA pipelets provided by bundle <tt>org.eclipse.smila.processing.pipelets.boilerpipe</tt>. == General == All pipelets in this bundle support the configurable er...)
 
Line 11: Line 11:
 
== org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet ==
 
== org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet ==
  
Extracts text from an HTML input using the [Boilerpipe library|http://code.google.com/p/boilerpipe/].  
+
Extracts text from an HTML input using the [http://code.google.com/p/boilerpipe/ Boilerpipe library].  
  
 
=== Configuration ===
 
=== Configuration ===

Revision as of 05:47, 22 May 2012

This page describes the SMILA pipelets provided by bundle org.eclipse.smila.processing.pipelets.boilerpipe.

General

All pipelets in this bundle support the configurable error handling as described in SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation. When used in jobmanager workflows, records causing errors are dropped.

Read Type

  • runtime: Parameters are read when processing records. Parameter value can be set per Record.
  • init: Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.

org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet

Extracts text from an HTML input using the Boilerpipe library.

Configuration

Property Type Read Type Description
inputType String : ATTACHMENT, ATTRIBUTE runtime Defines whether the HTML input is found in an attachment or in an attribute of the record
outputType String : ATTACHMENT, ATTRIBUTE runtime Defines whether the plain text should be stored in an attachment or in an attribute of the record
inputName String runtime Name of attachment or attribute that contains the HTML input
outputName String runtime Name of attachment or attribute for plain text output
encodingAttribute String runtime Optional name of the attribute with the encoding of the input attachment.
defaultEncoding String runtime Optional fallback encoding, if anything else fails.
filter Sequence of String init A list of boiler pipe filters to use, may contain class names, or static method or static variable references (defaults to de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE).


Example

Extract text from the HTML input in attachment "html" into the attribute "text" using the encoding given in attribute "http.encoding".

<proc:invokePipelet name="extractText">
  <proc:pipelet class="org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet" />
  <proc:variables input="request" />
  <proc:configuration>
    <rec:Val key="inputType">ATTACHMENT</rec:Val>
    <rec:Val key="inputName">html</rec:Val>
    <rec:Val key="outputType">ATTRIBUTE</rec:Val>
    <rec:Val key="outputName">text</rec:Val>
    <rec:Val key="encodingAttribute">http.encoding</rec:Val>	
  </proc:configuration>
</proc:invokePipelet>