Difference between revisions of "SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets.boilerpipe"

Revision as of 08:44, 11 September 2012

This page describes the SMILA pipelets provided by bundle org.eclipse.smila.processing.pipelets.boilerpipe.

General

All pipelets in this bundle support the configurable error handling as described in SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation. When used in jobmanager workflows, records causing errors are dropped.

Read Type

runtime: Parameters are read when processing records. Parameter value can be set per Record.
init: Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.

org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet

Extracts text from an HTML input using the Boilerpipe library. In contrast to the HtmlToTextPipelet it offers different algorithms for textual content extraction but does not extract HTML metadata.

Configuration

Property	Type	Read Type	Description
inputType	String : ATTACHMENT, ATTRIBUTE	runtime	Defines whether the HTML input is found in an attachment or in an attribute of the record
outputType	String : ATTACHMENT, ATTRIBUTE	runtime	Defines whether the plain text should be stored in an attachment or in an attribute of the record
inputName	String	runtime	Name of attachment or attribute that contains the HTML input
outputName	String	runtime	Name of attachment or attribute for plain text output
encodingAttribute	String	runtime	Optional name of the attribute with the encoding of the input attachment.
defaultEncoding	String	runtime	Optional fallback encoding, if anything else fails.
filter	Sequence of String	init	A list of boiler pipe filters to use. This may contain class names, static method or static variable references. Default is `de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE`. Please note that BoilerpipeExtractors implement the interface BoilerpipeFilter and are pipelines of BoilerpipeFilters. Therefore you should not use multiple BoilerpipeExtractors! Also please note that some Extractors and Filters do not have a default Constructor and therefore cannot be used by this Pipelet. Others may not have a public Constructor but a public static instance member.

Example

Extract text from the HTML input in attachment "html" into the attribute "text" using the encoding given in attribute "http.encoding" and using the extractor ArticleSentencesExtractor:

<proc:invokePipelet name="extractText">
  <proc:pipelet class="org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet" />
  <proc:variables input="request" />
  <proc:configuration>
    <rec:Val key="inputType">ATTACHMENT</rec:Val>
    <rec:Val key="inputName">html</rec:Val>
    <rec:Val key="outputType">ATTRIBUTE</rec:Val>
    <rec:Val key="outputName">text</rec:Val>
    <rec:Val key="encodingAttribute">http.encoding</rec:Val>	
    <rec:Val key="filter">de.l3s.boilerpipe.extractors.ArticleSentencesExtractor.INSTANCE</rec:Val>	
  </proc:configuration>
</proc:invokePipelet>

The same example but using the simple filter MarkEverythingContentFilter:

<proc:invokePipelet name="extractText">
  <proc:pipelet class="org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet" />
  <proc:variables input="request" />
  <proc:configuration>
    <rec:Val key="inputType">ATTACHMENT</rec:Val>
    <rec:Val key="inputName">html</rec:Val>
    <rec:Val key="outputType">ATTRIBUTE</rec:Val>
    <rec:Val key="outputName">text</rec:Val>
    <rec:Val key="encodingAttribute">http.encoding</rec:Val>	
    <rec:Val key="filter">de.l3s.boilerpipe.filters.simple.MarkEverythingContentFilter.INSTANCE</rec:Val>	
  </proc:configuration>
</proc:invokePipelet>

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets.boilerpipe"

Revision as of 08:44, 11 September 2012

Contents

General

org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet

Configuration

Example

@@ Line 11: / Line 11: @@
 == org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet ==
-Extracts text from an HTML input using the [http://code.google.com/p/boilerpipe/ Boilerpipe library].
+Extracts text from an HTML input using the [http://code.google.com/p/boilerpipe/ Boilerpipe library]. In contrast to the [[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet | HtmlToTextPipelet]] it offers different algorithms for textual content extraction but does not extract HTML metadata.
 === Configuration ===
@@ Line 54: / Line 54: @@
 |Sequence of String
 |init
-|A list of boiler pipe filters to use, may contain class names, or static method or static variable references (defaults to de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE).
+|A list of boiler pipe filters to use. This may contain class names, static method or static variable references. Default is <tt>de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE</tt>. Please note that BoilerpipeExtractors implement the interface BoilerpipeFilter and are pipelines of BoilerpipeFilters. Therefore you should not use multiple BoilerpipeExtractors! Also please note that some Extractors and Filters do not have a default Constructor and therefore cannot be used by this Pipelet. Others may not have a public Constructor but a public static instance member.
 |}
@@ Line 60: / Line 60: @@
 === Example ===
-Extract text from the HTML input in attachment "html" into the attribute "text" using the encoding given in attribute "http.encoding".
+Extract text from the HTML input in attachment "html" into the attribute "text" using the encoding given in attribute "http.encoding" and using the extractor ArticleSentencesExtractor:
 <source lang="xml">
@@ Line 72: / Line 72: @@
      <rec:Val key="outputName">text</rec:Val>
      <rec:Val key="encodingAttribute">http.encoding</rec:Val>
+    <rec:Val key="filter">de.l3s.boilerpipe.extractors.ArticleSentencesExtractor.INSTANCE</rec:Val>
    </proc:configuration>
 </proc:invokePipelet>
 </source>
+The same example but using the simple filter MarkEverythingContentFilter:
+<source lang="xml">
+<proc:invokePipelet name="extractText">
+  <proc:pipelet class="org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet" />
+  <proc:variables input="request" />
+  <proc:configuration>
+    <rec:Val key="inputType">ATTACHMENT</rec:Val>
+    <rec:Val key="inputName">html</rec:Val>
+    <rec:Val key="outputType">ATTRIBUTE</rec:Val>
+    <rec:Val key="outputName">text</rec:Val>
+    <rec:Val key="encodingAttribute">http.encoding</rec:Val>
+    <rec:Val key="filter">de.l3s.boilerpipe.filters.simple.MarkEverythingContentFilter.INSTANCE</rec:Val>
+  </proc:configuration>
+</proc:invokePipelet>
+</source>
 [[Category:SMILA]] [[Category:SMILA/Pipelet]]

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets.boilerpipe"

Revision as of 08:44, 11 September 2012

Contents

General

org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet

Configuration

Example