Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/BPEL Workflow Processor"

(23 intermediate revisions by 4 users not shown)
Line 1: Line 1:
== Basic Configuration ==
+
This page describes how to configure the SMILA BPEL workflow processor and how to call SMILA pipelets from BPEL processes. We do not assume any BPEL knowledge here, i.e. this page should contain everything to enable you to create at least simple BPEL processes for being used in SMILA.
  
The BPEL WorkflowProcessor expects its configuration in subdirectory "org.eclipse.eilf.processing.bpel" of the configuration directory. See the test bundle "org.eclipse.eilf.processing.bpel.test" for an example. In this it expects a file named processor.properties that describes the main configuration. It can contain the following SMILA specific properties:
+
== Basic configuration ==
  
* pipeline.dir (default="pipelines"): The subdirectory of this directory that contains the BPEL process files (together with all needed XSD and WSDL files) and the ODE specific deploy.xml file. See below for details.
+
The BPEL WorkflowProcessor expects its configuration in <tt>configuration/org.eclipse.smila.processing.bpel</tt>. In this directory it expects a file named <tt>processor.properties</tt> that describes the main configuration. This file can contain the following SMILA specific properties:
* pipeline.timeout (default="300"): Maximum time in seconds allowed for processing a pipeline. If a pipeline invocation takes longer, it is aborted with an error. If you expect that longer evaluations are possible in your application (e.g. analysing very large documents) you may want to increase this value.
+
* record.filter (default = none): A record filter defining the attributes and annotations that should be contained in BPEL workflow objects. Of none is set, the workflow objects will contain only the record IDs to be processed. You should take care to add only those attributes and annotations to the filter that are actually used in any pipeline, for adding too many (and too huge) elements to the workflow object may decrease performance and uses more memory. As the WorkflowProcessor uses the Blackboard to filter objects, you must define the filters in "org.eclipse.eilf.blackboard/RecordFilters.xml".
+
  
As the BPEL WorkflowProcessor is based on the Apache ODE BPEL engine [http://ode.apache.org], you can also add all ODE specific configuration properties to this file, just use the prefix "ode." See ODE documentation for details. You have to add at least the configuration for a database connection which ODE needs for internal purposes (e.g. storing process definitions). For SMILA purposes usually a in-memory HSQLDB instance is completely sufficient, the HSQLDB library is incldued in bundle "org.apache.ode". Define it using the following properties:
+
* ''pipeline.dir'' (default="pipelines"): The name of a folder below <tt>configuration/org.eclipse.smila.processing.bpel</tt> which contains the BPEL process files (together with all needed XSD and WSDL files) and the ODE specific <tt>deploy.xml</tt> file. See below for details.
 +
* ''pipeline.timeout'' (default="300"): Maximum time in seconds allowed for processing a pipeline. If a pipeline invocation takes longer, it is aborted with an error. You may want to increase this value in case you expect longer processing times in your application (e.g. when analyzing very large documents).
 +
* ''record.filter'' (default = none): A record filter defining the attributes and annotations that should be contained in BPEL workflow objects. If none is set, the workflow objects will contain only the record IDs to be processed. Add only those attributes and annotations to the filter that are actually used in any pipeline, because adding too many (and too huge) elements to the workflow object may decrease performance and use more memory. As the WorkflowProcessor uses the Blackboard to filter objects, you must define the filters in <tt>org.eclipse.smila.blackboard/RecordFilters.xml</tt>.
 +
 
 +
As the BPEL WorkflowProcessor is based on the [http://ode.apache.org Apache ODE BPEL engine], you can also add all ODE specific configuration properties to this file, just use the prefix <tt>ode.</tt> See ODE documentation for details. You have to add at least the configuration for a database connection which ODE needs for internal purposes (e.g. storing process definitions). For SMILA purposes usually an in-memory [http://derby.apache.org Apache Derby] instance is completely sufficient, the required Derby library is includedn in SMILA. To use it, set the following properties:
  
 
<pre>
 
<pre>
 
ode.db.mode=internal
 
ode.db.mode=internal
ode.db.int.driver=org.hsqldb.jdbcDriver
+
ode.db.int.driver=org.apache.derby.jdbc.EmbeddedDriver
ode.db.int.jdbcurl=jdbc:hsqldb:mem:odedb
+
ode.db.int.jdbcurl=jdbc:derby:memory:odedb;create=true
 
ode.db.int.username=sa
 
ode.db.int.username=sa
 
ode.db.int.password=
 
ode.db.int.password=
 
</pre>
 
</pre>
  
If you want to use a "real" database you will have to make the JDBC driver available to bundle "org.apache.ode", and check the ODE documentation for how to prepare the database schema for ODE.
+
If you want to use a "real" database you will have to make the JDBC driver available to bundle "org.apache.ode", and check the ODE documentation on how to prepare the database schema for ODE.
 +
 
 +
Additional to the initial setup you may update existing pipelines if they are not predefined (system pipelines). You can do that with the [http://wiki.eclipse.org/SMILA/Documentation/Processing/JSON_REST_API REST API] or internally with the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/WorkflowProcessor.html WorkflowProcessor]. Such pipelines are stored in the [[SMILA/Documentation/ObjectStore/Bundle org.eclipse.smila.objectstore|ObjectStore  service]] in store <tt>bpel</tt>.
  
 
== Pipeline definition using BPEL ==
 
== Pipeline definition using BPEL ==
Line 25: Line 29:
 
<source lang="xml">
 
<source lang="xml">
 
<?xml version="1.0" encoding="utf-8" ?>
 
<?xml version="1.0" encoding="utf-8" ?>
<process name="$PIPELINENAME" targetNamespace="http://www.eclipse.org/eilf/processor"
+
<process name="$PIPELINENAME" targetNamespace="http://www.eclipse.org/smila/processor"  
    xmlns="http://docs.oasis-open.org/wsbpel/2.0/process/executable" xmlns:xsd="http://www.w3.org/2001/XMLSchema"
+
  xmlns="http://docs.oasis-open.org/wsbpel/2.0/process/executable"
    xmlns:proc="http://www.eclipse.org/eilf/processor" expressionLanguage="urn:oasis:names:tc:wsbpel:2.0:sublang:xpath1.0"
+
  xmlns:bpel="http://docs.oasis-open.org/wsbpel/2.0/process/executable"
    xmlns:rec="http://www.eclipse.org/eilf/record" xmlns:id="http://www.eclipse.org/eilf/id">
+
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"  
 +
  xmlns:proc="http://www.eclipse.org/smila/processor"  
 +
  xmlns:rec="http://www.eclipse.org/smila/record">
  
     <import location="processor.wsdl" namespace="http://www.eclipse.org/eilf/processor"
+
     <import location="processor.wsdl" namespace="http://www.eclipse.org/smila/processor"
 
         importType="http://schemas.xmlsoap.org/wsdl/" />
 
         importType="http://schemas.xmlsoap.org/wsdl/" />
  
 
     <partnerLinks>
 
     <partnerLinks>
         <partnerLink name="Pipeline" partnerLinkType="proc:ProcessorPartnerLinkType" myRole="service" />
+
         <partnerLink name="Pipeline" partnerLinkType="proc:ProcessorPartnerLinkType"  
 +
            myRole="service" />
 
     </partnerLinks>
 
     </partnerLinks>
  
 
     <extensions>
 
     <extensions>
         <extension namespace="http://www.eclipse.org/eilf/processor" mustUnderstand="no" />
+
         <extension namespace="http://www.eclipse.org/smila/processor" mustUnderstand="no" />
 
     </extensions>
 
     </extensions>
  
Line 46: Line 53:
  
 
     <sequence>
 
     <sequence>
         <receive name="start" partnerLink="Pipeline" portType="proc:ProcessorPortType" operation="process"
+
         <receive name="start" partnerLink="Pipeline" portType="proc:ProcessorPortType"  
            variable="request" createInstance="yes" />
+
            operation="process" variable="request" createInstance="yes" />
  
         <reply name="end" partnerLink="Pipeline" portType="proc:ProcessorPortType" operation="process" variable="request" />
+
        <!-- pipelet invocations will be added here -->
        <exit />
+
 
 +
         <reply name="end" partnerLink="Pipeline" portType="proc:ProcessorPortType"  
 +
            operation="process" variable="request" />
 
     </sequence>
 
     </sequence>
 
</process>
 
</process>
 
</source>
 
</source>
  
To create a new pipeline copy this to a new file with suffix ".bpel" in the configuration directory "org.eclipse.eilf.processing.bpel/pipelines" and replace $PIPELINENAME with the desired name of your pipeline. Add the files from the "xml" directory in bundle org.eclipse.eilf.processing.bpel (id.xsd, record.xsd and processor.wsdl). Then create a file "deploy.xml" in the same directory like this and replace $PIPELINENAME by the name of your pipeline:
+
To create a new pipeline:
 +
# Copy the above snippet to a new file with the suffix <tt>.bpel</tt> and saved it to the folder <tt>configuration/org.eclipse.smila.processing.bpel/''$pipeline.dir''</tt>.
 +
# Then replace ''$PIPELINENAME'' by the desired name of your pipeline.
 +
## Please note that the pipeline name must only contain characters from the following range: "a-zA-Z._-". If the pipeline does not conform to this naming restrictions, it will not be accessible and SMILA will print a warning in the log file that the pipeline name is invalid.
 +
# Next, copy the files <tt>record.xsd</tt>, and <tt>processor.wsdl</tt> from the <tt>xml</tt> directory in bundle <tt>org.eclipse.smila.processing.bpel</tt> to the same folder next to your .bpel file (if not already there).
 +
# Then, still in the same folder, create or edit a file named <tt>deploy.xml</tt> containing the following content but replace ''$PIPELINENAME'' by the name of the new pipeline:
  
 
<source lang="xml">
 
<source lang="xml">
<deploy xmlns="http://www.apache.org/ode/schemas/dd/2007/03" xmlns:proc="http://www.eclipse.org/eilf/processor">
+
<deploy xmlns="http://www.apache.org/ode/schemas/dd/2007/03" xmlns:proc="http://www.eclipse.org/smila/processor">
 +
 
 +
    <!-- other pipelines -->
 +
 
 
     <process name="proc:$PIPELINENAME">
 
     <process name="proc:$PIPELINENAME">
 
         <in-memory>true</in-memory>
 
         <in-memory>true</in-memory>
Line 68: Line 85:
 
</source>
 
</source>
  
You can now add service or pipelet invocations to your pipeline BPEL. To add another pipelet you have to add only another BPEL file and copy the <process> element in deploy.xml for the new pipeline.
+
You can now add pipelet invocations to your pipeline BPEL. To add another pipelet you have to add only another BPEL file and copy the <tt><process></tt> element in <tt>deploy.xml</tt> for the new pipeline.
  
 
=== Pipelet invocations ===
 
=== Pipelet invocations ===
  
Pipelets (aka "Simple Pipelets") are classes that implement interface "org.eclipse.eilf.processing.SimplePipelet" (in bundle "org.eclipse.eilf.processing") and are listed in the "EILF-Pipelets" manifest header of the bundles that contain them. They are configured by the WorkflowProcessor on pipeline initialization. One instance is created for each time they occur in any pipeline, instances are not shared between multiple pipelines. Examples that come with the base SMILA distribution are
+
Pipelets are classes that implement interface <tt>org.eclipse.smila.processing.Pipelet</tt> (in bundle <tt>org.eclipse.smila.processing</tt>) and have a corresponding definition file (the name must end in ".json") in the ''SMILA-INF'' directory of the bundles that provides them. They are configured by the WorkflowProcessor on pipeline initialization. One instance is created for each time they occur in any pipeline, instances are not shared between multiple pipelines.
* in bundle org.eclipse.eilf.processing.pipelets:
+
** org.eclipse.eilf.processing.pipelets.SetAnnotationPipelet: Sets a configured annotation for each record in the input variable.
+
** org.eclipse.eilf.processing.pipelets.CommitRecordsPipelet: Commits each record in the input variable on the blackboard to the storages.
+
** org.eclipse.eilf.processing.pipelets.HtmlToTextPipelet: Extract plain text and metadata from an HTML document in an attribute or attachment of each record and writes it to configurable attributes or attachments.
+
  
* in bundle org.eclipse.eilf.processing.pipelets.aperture:
+
The json file must at least contain the class name of the class that implements the Pipelet interface. E.g.
** org.eclipse.eilf.processing.pipelets.aperture.AperturePipelet: Uses Aperture to convert many kinds of documents to plain text.
+
<source lang="javascript">
 
+
{
* in bundle org.eclipse.eilf.processing.pipelets.xmlprocessing:
+
  "class" : "org.eclipse.smila.processing.pipelets.AddValuesPipelet"
 +
}
 +
</source>
 +
You can add more stuff to this file, e.g. a description of the configuration parameter the pipelet expects and supports. Examples that come with the base SMILA distribution are
 +
* in bundle [[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets|<tt>org.eclipse.smila.processing.pipelets</tt>]]:
 +
** <tt>org.eclipse.smila.processing.pipelets.AddValuesPipelet</tt>: add some attribute values to each record.
 +
** <tt>org.eclipse.smila.processing.pipelets.HtmlToTextPipelet</tt>: Extract plain text and metadata from an HTML document in an attribute or attachment of each record and writes it to configurable attributes or attachments.
 +
* in bundle [[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.xmlprocessing|<tt>org.eclipse.smila.processing.pipelets.xmlprocessing</tt>]]:
 
** A collection of pipelets for XML processing (XSLT, XPath selection, ...) of documents.
 
** A collection of pipelets for XML processing (XSLT, XPath selection, ...) of documents.
 +
* in bundle [[SMILA/Documentation/Solr|<tt>org.eclipse.smila.solr</tt>]]:
 +
** Pipelets for adding records to an Solr index and searching it.
  
To use such a pipelet in your pipeline use the SMILA specific BPEL extension activity "invokePipelet" somewhere between <receive> and <reply> in your BPEL process:
+
To use such a pipelet in your pipeline, use the SMILA specific BPEL extension activity <tt><invokePipelet></tt> somewhere between <tt><receive></tt> and <tt><reply></tt> in your BPEL process:
  
 
<source lang="xml">
 
<source lang="xml">
<extensionActivity name="invokeSomePipelet">
+
<extensionActivity>
   <proc:invokePipelet>
+
   <proc:invokePipelet name="invokeSomePipelet">
     <proc:pipelet class="org.eclipse.eilf.pipelet.SomePipelet" />
+
     <proc:pipelet class="org.eclipse.smila.pipelet.SomePipelet" />
 
     <proc:variables input="request" output="request" />
 
     <proc:variables input="request" output="request" />
     <proc:PipeletConfiguration>
+
     <proc:configuration>
    <proc:Property name="configuration-property-name">
+
      <rec:Val key="single-parameter">value</rec:Val>
       <proc:Value>configuration-property-value"/>
+
      <rec:Seq key="multi-parameter">
    </proc:Property>
+
        <rec:Val>value1</rec:Val>
    <!-- more configuration properties -->
+
        <rec:Val>value2</rec:Val>
     </proc:PipeletConfiguration>
+
       <rec:Seq>
 +
      <rec:Map key="complex-parameter">
 +
        <rec:Val key="sub-parameter1">sub-value1</rec:Val>
 +
        <rec:Val key="sub-parameter2">sub-value2</rec:Val>
 +
      </rec:Map>
 +
      <!-- more configuration parameters -->
 +
     </proc:configuration>
 
   </proc:invokePipelet>
 
   </proc:invokePipelet>
 
</extensionActivity>
 
</extensionActivity>
 
</source>
 
</source>
  
Replace the class name with the class name of the pipelet to use and add configuration properties as needed - this should be documented by the pipelet provider. If the output variable is the same as the input variable (which is usually sufficient), you can omit the "output" attribute.
+
Replace the class name with the class name of the pipelet to use and add configuration parameters as needed - this should be documented by the pipelet provider. The configuration is a generic AnyMap object like the one used as record metadata, see [[SMILA/Documentation/Data Model and Serialization Formats]] for details. If the output variable is the same as the input variable (which is usually sufficient), you can omit the ''output'' attribute.
  
=== Service invocations ===
+
In versions later than 0.9 the invokePipelet activity supports to give an additional "index" variable. This is to support advanced pipelines that can be invoked with multiple records at once (usually used in the PipelineProcessingWorker to reduce BPEL overhead) and still can use conditions to invoke pipelets based on attribute values. You can then do forEach loops on the record list and evaluate conditions on each record. For example the following loop would invoke the HtmlToTextPipelet only for records that have the mime type "text/html":
 
+
Services (aka "Processing Services") are classes that implement interface "org.eclipse.eilf.processing.ProcessingService" (in bundle "org.eclipse.eilf.processing") and registered as OSGi services for this interface with a service property named "eilf.processing.service.name" descrining the name of this service. They are initialised and configured independently of the workflow engine and invocations from different pipelines will use the same instance. An example in the base SMILA distribution is the "LuceneService" in bundle "org.eclipse.eilf.lucene" for adding records to a Lucene index.
+
 
+
To use such a service in your pipeline use the SMILA specific BPEL extension activity "invokeService" somewhere between <receive> and <reply> in your BPEL process:
+
  
 
<source lang="xml">
 
<source lang="xml">
<extensionActivity name="invokeLuceneService">
+
<forEach counterName="index" parallel="yes" name="iterateRecords">
   <proc:invokeService>
+
   <startCounterValue>1</startCounterValue>
     <proc:service name="LuceneService" />
+
  <finalCounterValue>count($request.records/rec:Record)</finalCounterValue>
    <proc:variables input="request" />
+
  <scope>
  </proc:invokeService>
+
     <if name="is HTML document">
</extensionActivity>
+
      <condition>$request.records/rec:Record[position()=$index]/rec:Val[@key="MimeType"]="text/html"</condition>
 +
      <extensionActivity>
 +
        <proc:invokePipelet name="extract text from HTML">
 +
          <proc:pipelet class="org.eclipse.smila.processing.bpel.pipelets.HtmlToTextPipelet" />
 +
          <proc:variables input="request" index="index" />
 +
          <proc:configuration>
 +
            ...
 +
          </proc:configuration>
 +
        </proc:invokePipelet>
 +
      </extensionActivity>
 +
    </if>
 +
  </scope>
 +
</forEach>
 
</source>
 
</source>
  
The output attributes of <proc:variables> is supported here in the same way as in <invokePipelet> (see above), but is omitted in this example because the result of the service invocation should be written to the input variable again. There is no further configuration content in this service invocation. Consult the service documentation about how to configure it.
+
The invokePipelet activity assumes that the index variables takes the values 1 to (number of records). Note that the pipelet is invoked in parallel for each record in this example.
  
=== Setting annotations in Service and Pipelet invocations ===
+
If an output variable would be specified it would be assigned a list with only the one processed record in it. This can be useful if more than one pipelets would be invoked. If no output variable is specified, the input variable is not modified in this case.  
 
+
Both kinds of invocations support setting annotations on the records to process inline in the extension activity XML. This is more important for invocation of services because the operation of some services is controlled by service-defined root annotations of the processed records, but nevertheless it works for invocation of pipelets  in the same way. To use this just add a <proc:setAnnotations> element to <proc:invokeService> or <proc:invokePipelet> that contains the annotations in the standard annotation XML format defined in [Data Model and XML representation]. E.g., the LuceneIndexService (Bundle org.eclipse.eilf.lucene) uses a annotation to define the index to be changed and the type of change (ADD, DELETE). In the example pipeline "addpipeline.bpel" of EILF.application this looks like this:
+
  
 +
'''Hint:''' If it is necessary that the input variable reflects the values of attributes changed by the pipelet (e.g. because it is needed in a following condition and you cannot use the single output record of the pipelet itself for the test) you must copy the output record back to the input variable using a bit of BPEL code:
 
<source lang="xml">
 
<source lang="xml">
<extensionActivity name="invokeLuceneService">
+
  ...
  <proc:invokeService>
+
  <extensionActivity>
     <proc:service name="LuceneIndexService" />
+
    <proc:invokePipelet name="some pipelet">
     <proc:variables input="request" output="request" />
+
     <proc:pipelet class="org.eclipse.smila.example.Pipelet" />
    <proc:setAnnotations>
+
     <proc:variables input="request" index="index" output="oneRecord"/>
      <rec:An n="org.eclipse.eilf.lucene.LuceneIndexService">
+
  </extensionActivity>
        <rec:V n="indexName">test_index</rec:V>
+
<assign name="copy result into original variable for next tests">
        <rec:V n="executionMode">ADD</rec:V>
+
    <copy>
      </rec:An>
+
      <from>$oneRecord.records/rec:Record[1]</from>
     </proc:setAnnotations>
+
      <to>$request.records/rec:Record[position()=$index]</to>
   </proc:invokeService>
+
     </copy>
</extensionActivity>
+
   </assign>
 +
  ...
 
</source>
 
</source>
 
+
See the [https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/trunk/core/SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/AddPipeline.bpel AddPipeline.bpel] in the standard SMILA configuration for an example with complete context.
This sets an annotation named "org.eclipse.eilf.lucene.LuceneIndexService" with two named values "indexName=test_index" and "executionMode=ADD" to all records in the "request" variable. All existing annotations of the same name are removed before the new annotations are set.
+
 
+
  
 
=== Pipeline invocations ===
 
=== Pipeline invocations ===
  
You can also invoke one pipeline from another to group service or pipelet invocations that belong together. To do this you have to use the standard BPEL invoke activity to invoke a BPEL partner link for the sub pipeline:
+
You can also invoke one pipeline from another to group pipelet invocations that belong together. To do this you have to use the standard BPEL invoke activity to invoke a BPEL partner link for the sub pipeline:
  
* define a partner link in the <partnerLinks> section of the BPEL file, replace $SUBPIPELINENAME with the name of pipeline to invoke as defined in its <process> element:
+
* define a partner link in the <tt><partnerLinks></tt> section of the BPEL file, replace ''$SUBPIPELINENAME'' with the name of pipeline to invoke as defined in its <tt><process></tt> element:
  
 
<source lang="xml">
 
<source lang="xml">
Line 155: Line 189:
 
</source>
 
</source>
  
* add a invoke activity between <receive> and <reply>, replace $SUBPIPELINENAME with the pipeline name and adapt the inputVariable and outputVariable attributes if necessary (omitting "outputVariable" is not allowed here!):
+
* add an BPEL <tt><invoke></tt> activity between <tt><receive></tt> and <tt><reply></tt>, replace ''$SUBPIPELINENAME'' with the pipeline name and adapt the ''inputVariable'' and ''outputVariable'' attributes if necessary (omitting ''outputVariable'' is not allowed here!):
  
 
<source lang="xml">
 
<source lang="xml">
 
<invoke name="invokeSubPipeline" operation="process" portType="proc:ProcessorPortType"  
 
<invoke name="invokeSubPipeline" operation="process" portType="proc:ProcessorPortType"  
   partnerLink="$SUBPIPELINENAME" inputVariable="request" outputVariable="request"
+
   partnerLink="$SUBPIPELINENAME" inputVariable="request" outputVariable="request" />
/>
+
 
</source>
 
</source>
  
* add a declaration for the partner link in the deploy.xml entry of your pipeline:
+
* add a declaration for the partner link in the <tt>deploy.xml</tt> entry of your pipeline:
  
 
<source lang="xml">
 
<source lang="xml">
<deploy xmlns="http://www.apache.org/ode/schemas/dd/2007/03" xmlns:proc="http://www.eclipse.org/eilf/processor">
+
<deploy xmlns="http://www.apache.org/ode/schemas/dd/2007/03" xmlns:proc="http://www.eclipse.org/smila/processor">
 
     <process name="proc:$PIPELINENAME">
 
     <process name="proc:$PIPELINENAME">
 
         <in-memory>true</in-memory>
 
         <in-memory>true</in-memory>
Line 179: Line 212:
 
</source>
 
</source>
  
=== Advanced Process definition ===
+
=== Advanced process definition ===
  
You can of course use all other BPEL elements, too, to create your pipelines like conditions, iterations, parallel flows, invocation of external Web Services, etc. However, to describe them is beyond the scope of this introduction and requires "real" knowledge about BPEL (and WSDL, for invoking Web Services).
+
You can of course use all other BPEL elements, too, to create your pipelines like conditions, iterations, parallel flows, invocation of external Web Services, etc. However, to describe them is beyond the scope of this introduction and requires "real" knowledge about BPEL and XPath (and WSDL, for invoking Web Services).
  
 
[[Category:SMILA]]
 
[[Category:SMILA]]

Revision as of 09:56, 26 January 2012

This page describes how to configure the SMILA BPEL workflow processor and how to call SMILA pipelets from BPEL processes. We do not assume any BPEL knowledge here, i.e. this page should contain everything to enable you to create at least simple BPEL processes for being used in SMILA.

Basic configuration

The BPEL WorkflowProcessor expects its configuration in configuration/org.eclipse.smila.processing.bpel. In this directory it expects a file named processor.properties that describes the main configuration. This file can contain the following SMILA specific properties:

  • pipeline.dir (default="pipelines"): The name of a folder below configuration/org.eclipse.smila.processing.bpel which contains the BPEL process files (together with all needed XSD and WSDL files) and the ODE specific deploy.xml file. See below for details.
  • pipeline.timeout (default="300"): Maximum time in seconds allowed for processing a pipeline. If a pipeline invocation takes longer, it is aborted with an error. You may want to increase this value in case you expect longer processing times in your application (e.g. when analyzing very large documents).
  • record.filter (default = none): A record filter defining the attributes and annotations that should be contained in BPEL workflow objects. If none is set, the workflow objects will contain only the record IDs to be processed. Add only those attributes and annotations to the filter that are actually used in any pipeline, because adding too many (and too huge) elements to the workflow object may decrease performance and use more memory. As the WorkflowProcessor uses the Blackboard to filter objects, you must define the filters in org.eclipse.smila.blackboard/RecordFilters.xml.

As the BPEL WorkflowProcessor is based on the Apache ODE BPEL engine, you can also add all ODE specific configuration properties to this file, just use the prefix ode. See ODE documentation for details. You have to add at least the configuration for a database connection which ODE needs for internal purposes (e.g. storing process definitions). For SMILA purposes usually an in-memory Apache Derby instance is completely sufficient, the required Derby library is includedn in SMILA. To use it, set the following properties:

ode.db.mode=internal
ode.db.int.driver=org.apache.derby.jdbc.EmbeddedDriver
ode.db.int.jdbcurl=jdbc:derby:memory:odedb;create=true
ode.db.int.username=sa
ode.db.int.password=

If you want to use a "real" database you will have to make the JDBC driver available to bundle "org.apache.ode", and check the ODE documentation on how to prepare the database schema for ODE.

Additional to the initial setup you may update existing pipelines if they are not predefined (system pipelines). You can do that with the REST API or internally with the WorkflowProcessor. Such pipelines are stored in the ObjectStore service in store bpel.

Pipeline definition using BPEL

The minimal BPEL process for SMILA pipelines looks like this:

<?xml version="1.0" encoding="utf-8" ?>
<process name="$PIPELINENAME" targetNamespace="http://www.eclipse.org/smila/processor" 
  xmlns="http://docs.oasis-open.org/wsbpel/2.0/process/executable"
  xmlns:bpel="http://docs.oasis-open.org/wsbpel/2.0/process/executable"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
  xmlns:proc="http://www.eclipse.org/smila/processor" 
  xmlns:rec="http://www.eclipse.org/smila/record">
 
    <import location="processor.wsdl" namespace="http://www.eclipse.org/smila/processor"
        importType="http://schemas.xmlsoap.org/wsdl/" />
 
    <partnerLinks>
        <partnerLink name="Pipeline" partnerLinkType="proc:ProcessorPartnerLinkType" 
            myRole="service" />
    </partnerLinks>
 
    <extensions>
        <extension namespace="http://www.eclipse.org/smila/processor" mustUnderstand="no" />
    </extensions>
 
    <variables>
        <variable name="request" messageType="proc:ProcessorMessage" />
    </variables>
 
    <sequence>
        <receive name="start" partnerLink="Pipeline" portType="proc:ProcessorPortType" 
            operation="process" variable="request" createInstance="yes" />
 
        <!-- pipelet invocations will be added here -->
 
        <reply name="end" partnerLink="Pipeline" portType="proc:ProcessorPortType" 
            operation="process" variable="request" />
    </sequence>
</process>

To create a new pipeline:

  1. Copy the above snippet to a new file with the suffix .bpel and saved it to the folder configuration/org.eclipse.smila.processing.bpel/$pipeline.dir.
  2. Then replace $PIPELINENAME by the desired name of your pipeline.
    1. Please note that the pipeline name must only contain characters from the following range: "a-zA-Z._-". If the pipeline does not conform to this naming restrictions, it will not be accessible and SMILA will print a warning in the log file that the pipeline name is invalid.
  3. Next, copy the files record.xsd, and processor.wsdl from the xml directory in bundle org.eclipse.smila.processing.bpel to the same folder next to your .bpel file (if not already there).
  4. Then, still in the same folder, create or edit a file named deploy.xml containing the following content but replace $PIPELINENAME by the name of the new pipeline:
<deploy xmlns="http://www.apache.org/ode/schemas/dd/2007/03" xmlns:proc="http://www.eclipse.org/smila/processor">
 
    <!-- other pipelines -->
 
    <process name="proc:$PIPELINENAME">
        <in-memory>true</in-memory>
        <provide partnerLink="Pipeline">
            <service name="proc:$PIPELINENAME" port="ProcessorPort" />
        </provide>
    </process>
</deploy>

You can now add pipelet invocations to your pipeline BPEL. To add another pipelet you have to add only another BPEL file and copy the <process> element in deploy.xml for the new pipeline.

Pipelet invocations

Pipelets are classes that implement interface org.eclipse.smila.processing.Pipelet (in bundle org.eclipse.smila.processing) and have a corresponding definition file (the name must end in ".json") in the SMILA-INF directory of the bundles that provides them. They are configured by the WorkflowProcessor on pipeline initialization. One instance is created for each time they occur in any pipeline, instances are not shared between multiple pipelines.

The json file must at least contain the class name of the class that implements the Pipelet interface. E.g.

{
  "class" : "org.eclipse.smila.processing.pipelets.AddValuesPipelet" 
}

You can add more stuff to this file, e.g. a description of the configuration parameter the pipelet expects and supports. Examples that come with the base SMILA distribution are

  • in bundle org.eclipse.smila.processing.pipelets:
    • org.eclipse.smila.processing.pipelets.AddValuesPipelet: add some attribute values to each record.
    • org.eclipse.smila.processing.pipelets.HtmlToTextPipelet: Extract plain text and metadata from an HTML document in an attribute or attachment of each record and writes it to configurable attributes or attachments.
  • in bundle org.eclipse.smila.processing.pipelets.xmlprocessing:
    • A collection of pipelets for XML processing (XSLT, XPath selection, ...) of documents.
  • in bundle org.eclipse.smila.solr:
    • Pipelets for adding records to an Solr index and searching it.

To use such a pipelet in your pipeline, use the SMILA specific BPEL extension activity <invokePipelet> somewhere between <receive> and <reply> in your BPEL process:

<extensionActivity>
  <proc:invokePipelet name="invokeSomePipelet">
    <proc:pipelet class="org.eclipse.smila.pipelet.SomePipelet" />
    <proc:variables input="request" output="request" />
    <proc:configuration>
      <rec:Val key="single-parameter">value</rec:Val>
      <rec:Seq key="multi-parameter">
        <rec:Val>value1</rec:Val>
        <rec:Val>value2</rec:Val>
      <rec:Seq>
      <rec:Map key="complex-parameter">
        <rec:Val key="sub-parameter1">sub-value1</rec:Val>
        <rec:Val key="sub-parameter2">sub-value2</rec:Val>
      </rec:Map>
      <!-- more configuration parameters -->
    </proc:configuration>
  </proc:invokePipelet>
</extensionActivity>

Replace the class name with the class name of the pipelet to use and add configuration parameters as needed - this should be documented by the pipelet provider. The configuration is a generic AnyMap object like the one used as record metadata, see SMILA/Documentation/Data Model and Serialization Formats for details. If the output variable is the same as the input variable (which is usually sufficient), you can omit the output attribute.

In versions later than 0.9 the invokePipelet activity supports to give an additional "index" variable. This is to support advanced pipelines that can be invoked with multiple records at once (usually used in the PipelineProcessingWorker to reduce BPEL overhead) and still can use conditions to invoke pipelets based on attribute values. You can then do forEach loops on the record list and evaluate conditions on each record. For example the following loop would invoke the HtmlToTextPipelet only for records that have the mime type "text/html":

<forEach counterName="index" parallel="yes" name="iterateRecords">
  <startCounterValue>1</startCounterValue>
  <finalCounterValue>count($request.records/rec:Record)</finalCounterValue>
  <scope>
    <if name="is HTML document">
      <condition>$request.records/rec:Record[position()=$index]/rec:Val[@key="MimeType"]="text/html"</condition>
      <extensionActivity>
        <proc:invokePipelet name="extract text from HTML">
          <proc:pipelet class="org.eclipse.smila.processing.bpel.pipelets.HtmlToTextPipelet" />
          <proc:variables input="request" index="index" />
          <proc:configuration>
            ...
          </proc:configuration>
        </proc:invokePipelet>
      </extensionActivity>
    </if>
  </scope>
</forEach>

The invokePipelet activity assumes that the index variables takes the values 1 to (number of records). Note that the pipelet is invoked in parallel for each record in this example.

If an output variable would be specified it would be assigned a list with only the one processed record in it. This can be useful if more than one pipelets would be invoked. If no output variable is specified, the input variable is not modified in this case.

Hint: If it is necessary that the input variable reflects the values of attributes changed by the pipelet (e.g. because it is needed in a following condition and you cannot use the single output record of the pipelet itself for the test) you must copy the output record back to the input variable using a bit of BPEL code:

  ...
  <extensionActivity>
    <proc:invokePipelet name="some pipelet">
    <proc:pipelet class="org.eclipse.smila.example.Pipelet" />
    <proc:variables input="request" index="index" output="oneRecord"/>
  </extensionActivity>
 <assign name="copy result into original variable for next tests">
    <copy>
      <from>$oneRecord.records/rec:Record[1]</from>
      <to>$request.records/rec:Record[position()=$index]</to>
    </copy>
  </assign>
  ...

See the AddPipeline.bpel in the standard SMILA configuration for an example with complete context.

Pipeline invocations

You can also invoke one pipeline from another to group pipelet invocations that belong together. To do this you have to use the standard BPEL invoke activity to invoke a BPEL partner link for the sub pipeline:

  • define a partner link in the <partnerLinks> section of the BPEL file, replace $SUBPIPELINENAME with the name of pipeline to invoke as defined in its <process> element:
<partnerLinks>
  <partnerLink name="Pipeline" partnerLinkType="proc:ProcessorPartnerLinkType" myRole="service" />
  <partnerLink name="$SUBPIPELINENAME" partnerLinkType="proc:ProcessorPartnerLinkType" partnerRole="service" />
</partnerLinks>
  • add an BPEL <invoke> activity between <receive> and <reply>, replace $SUBPIPELINENAME with the pipeline name and adapt the inputVariable and outputVariable attributes if necessary (omitting outputVariable is not allowed here!):
<invoke name="invokeSubPipeline" operation="process" portType="proc:ProcessorPortType" 
  partnerLink="$SUBPIPELINENAME" inputVariable="request" outputVariable="request" />
  • add a declaration for the partner link in the deploy.xml entry of your pipeline:
<deploy xmlns="http://www.apache.org/ode/schemas/dd/2007/03" xmlns:proc="http://www.eclipse.org/smila/processor">
    <process name="proc:$PIPELINENAME">
        <in-memory>true</in-memory>
        <provide partnerLink="Pipeline">
            <service name="proc:$PIPELINENAME" port="ProcessorPort" />
        </provide>
        <invoke partnerLink="$SUBPIPELINENAME">
            <service name="proc:$SUBPIPELINENAME" port="ProcessorPort" />
        </invoke>
    </process>
</deploy>

Advanced process definition

You can of course use all other BPEL elements, too, to create your pipelines like conditions, iterations, parallel flows, invocation of external Web Services, etc. However, to describe them is beyond the scope of this introduction and requires "real" knowledge about BPEL and XPath (and WSDL, for invoking Web Services).