SMILA/Project Concepts/BPEL Pipelining Concept

Description

In this model the orchestration of pipelets (= "pipeline") is defined by BPEL processes. We distinguish two seperate kinds of pipelets:

"Big Pipelets" are implemented as OSGi services, can be shared by multiple pipelines and their configuration are seperated from the BPEL prociess defition.
"Simple Pipelets" are managed by a component of the BPEL engine integration, instances are not shared by multiple pipelines and their configuration is part of the BPEL process definition.

Discussion

Technical proposal

In this model the orchestration of pipelets (= "pipeline") is defined by BPEL processes. The pipelets are implemented as OSGi services. This should make it easier later to support the execution of unsafe pipelets in own VMs, because there are several technologies for transparent remote communication with OSGi services available (Tuscany, ECF, Riena). In the following we assume that the service lifecycle of all services is controlled by OSGi Declarative Services (DS). This simplifies the starting and stopping of services and binding them to other services. To support the initialization of services at service activation, DS defines that a special method is called when the service is activated, in which the necessary initialization can be done (reading of configurations, connecting to used resources, creating internal structures, etc). DS also defines a method to be called when a service is deactivated that can be used for cleaning up. The two methods must have this signature:

protected void activate(ComponentContext context);
protected void deactivate(ComponentContext context);

Each pipelet service must have a service property "smila.processing.service.name" that specifies the name of this pipelet. The name must be unique for each service in a single VM and is defined in the DS component description. The pipelet name is used in BPEL definition to refer to the pipelets. If multiple instances of the same pipelet class are needed, they can be distinguished using different pipelet names. The pipelet execution method is currently:

Id[] process(Id[] recordIds) throws ProcessingException;

I.e. it is called by the workflow with a list of record IDs, the content of these records is supposed to be available via the Blackboard service, so all access and manipulation of the records is done using the Blackboard service. The result is also a list of record IDs. Usually these will be the same as the input IDs, a different list can be produced by pipelets that split records. This means that all data needed by the pipelet for processing must be on the blackboard:

record attributes and attachments
record annotations
workflow and record notes

The two latter items may also be used to pass parameters to a pipelet. However, we will need BPEL Extension Activities to be able to set them in the BPEL definition (see end of this chapter). Pipelets as well as the BPEL integration get their configurations from a central "configuration repository". This can be a simple directory with a defined structure at first, or a complex service supporting centralized configuration management and updating (and notification of clients about configuration changes) later. Pipelet configurations are separated from the BPEL pipelines, because Pipelets existence does not depend on the existence of a pipeline engine and must not depend on the implementation of the pipeline engine. This makes it easier to use pipelets independent from a special pipelining implementation, e.g. if we want to replace the BPEL engine by a JBPM engine or an own workflow engine implementation. This makes it also easier to share pipelet instances between pipelines which is crucial for pipelets that use lots of memory (e.g. semantic text mining) or need resources that can only be accessed exclusively by one client (e.g. writing to a Solr core). Finally it enables OSGi to restart the BPEL integration service without having to restart the pipelets (e.g. for software updates). The BPEL integration is started by DS, too. Pipelets are bound to the BPEL integration as DS service references. This way the BPEL service can always keep track about currently available pipelet services. It would even be possible to track which pipelet is used in which pipeline and thus to know a priori which pipeline is currently completely executable.

Pipelet instantiation variants

Usually we have one instance of a Pipelet class that has a single configuration. The pipelet name is then a like a key to the combination "pipelet instance name = pipelet class + configuration". However, there may be cases in which it would be good to have a single pipelet class available with different configurations. There are two ways to support this:

Have a single pipelet instance with a configuration consisting of the different parts. Which part of the configuration is actually used in an invocation must then be passed using a record annotation. E.g.: There is a service "pipelet-name" = pipelet.A + config X & config Y, i.e. it has loaded both configurations.

An record in the invocation contains annotations:

- "pipelet-name/select-configuration" = X -> use config X for processing this record
- "pipelet-name/select-configuration" = Y -> use config Y for processing this record

Note that this makes it possible to process different records with different configurations in a single invocation. Of course in such a scenario one configuration should be marked as the default configuration to be used if no annotation is set.

Have multiple pipelet instances with different names, each having one of these configurations. E.g. there a two service instances of the same pipelet class with different pipelet names:
- service 1: "pipelet-name-1" = pipelet.B + config X
- service 2: "pipelet-name-2" = pipelet.B + config Y

Then the pipelet name used in the BPEL invoke activity determines which configuration is used.

Pipelet Implementation rules

Pipelets can potentially be invoked more than once at the same time. This means that a pipelet either should be written in a multithreading-safe way (stateless, read-only configuration and member variables) or it must take care itself about synchronization of critical sections (e.g. Solr core writing).

Configuration repository

This is just an ad-hoc proposal to give an idea of how it could look like. In details it's open to discussion.

For the moment we assume that the configuration repository is a single directory with sub directories in the file system. The configurations for components are located in subdirectories in the repository root. The name of these subdirectories is the bundle name of the component. What's happening inside of a bundle configuration directory is up to the bundle implementation. E.g. for the ODE BPEL integration bundle it contains a property file for general BPEL engine configuration and another subdirectory containing pipeline definitions. E.g.:

configuration
  |
  |-- org.eclipse.smila.processing.bpel
  |    |-- processor.properties
  |    \-- pipelines
  |         |-- pipeline-1.bpel
  |         |-- pipeline-2.bpel
  |         |-- ...
  |         |-- processor.wsdl
  |         |-- record.xsd
  |         |-- id.xsd
  |         | (predefined schema files necessary for reference. 
  |         |  Needed also during editing in BPEL designer)
  |         \-- deploy.xml 
  |           (technical reasons, we can get rid of this)
  |-- org.eclipse.smila.pipelet.A 
  |     |   (example: one instance managing multiple configurations)
  |     |-- config-X.xml
  |     \-- config-Y.xml
  |-- org.eclipse.smila.pipelet.B 
  |     |   (example: one instance per configuration)
  |     |-- pipelet-name-1
  |     |    \-- config-X.xml
  |     \-- pipelet-name-2
  |          \-- config-Y.xml
  |-- ...

This is quite similar to [Configuration handling], but with an optional additional folder level for "configuration sections" to structure the configurations better, e.g. for pipelets that require multiple instances for multiple configurations there can be one section per pipelet instance. Of course, bundles are free on how to use the configuration repository structure for their purposes. But we should describe some usage patterns because that would make reading the repository easier for adminstrators.

(To discuss: do we need folder structures of arbitrary depth?)

SMILA should provide helper classes to make locating and parsing of simple configurations easy. We can define a common XML format for basic configurations that most pipelets can use for their configurations, e.g. something of similar structure than the Record Annotation format?). Simple Property files can be supported, too. Then we can create a simple ConfigurationAccess service with methods like

to navigate the Configuration repository:

String[] getSectionNames(String bundleName); 
// e.g. getSectionNames("org.eclipse.smila.pipelet.B") 
// returns ["pipelet-name-1", "pipelet-name-2"]
String[] getConfigNames(String bundleName);
// e.g. getConfigNames("org.eclipse.smila.pipelet.A") 
// returns ["config-X.xml", "config-Y.xml"]
String[] getConfigNames(String bundleName, String sectionName);
// e.g. getConfigNames("org.eclipse.smila.pipelet.B", "pipelet-name-1") 
// returns ["config-X.xml", "config-Y.xml"]

to access and parse the configurations in common XML format:

Configuration getConfig(String bundleName, String configName);
Configuration getConfig(String bundleName, String sectionName, String configName);

to access and read property files:

Properties getProperties(String bundleName, String configName);
Properties getProperties(String bundleName, String sectionName, String configName);

to access other configurations:

InputStream getStream(String bundleName, String configName);
InputStream getStream(String bundleName, String sectionName, String configName);

This would make accessing of simple configurations quite simple for a pipelet developer.

BPEL Extension Activities

The BPEL specification allows extending BPEL by using Extension Activities. An Extension Activity is basically a Java class with a given interface that is registered to the BPEL engine under a qualified name. It the can be used in BPEL by a statement like this:

<bpel:extensionActivity>
  <myns:NameOfExtension>
    <!-- arbitary XML elements -->
  </myns:NameOfExtension>
</bpel:extensionActivity>

The implementation class is then called with the complete XML element of its description and can access all workflow variables defined in the BPEL. This means the activity can configured in the BPEL. E.g. for setting record annotations we can provide an extension activity similar to this:

<extensionActivity>
    <ext:setAnnotations>
        <ext:target variable="request"/>
        <rec:An n="pipelet-name">
            <rec:An n="select-configuration">
                <rec:V>X</rec:V>
            </rec:An>
        </rec:An>
    </ext:setAnnotations>
</extensionActivity>

This would set an annotation named "pipelet-name/select-configuration" with value "X" on all records the request variable. Of course it would also be possible to create a more specialized activity instead, that would define a simpler syntax to describe the annotations to be set.

Current problems are:

This is not supported by the current release (1.1.1) and also not in release 1.2 (currently about to be released) of ODE, but only in the trunk version (this will be release 1.3 probably). The latest estimation for a release date was "in about two months".
It's also not supported by the current release (M3) of the Eclipse BPEL designer. According to Eclipse Bugzilla it should be added to M4, which in turn should be released in the near future. However, I think we will have to provide own extensions to the BPEL designer anyway in order to have user friendly editing of extension activities provided by us.

Integrating Simple Pipeline Model into BPEL Pipelining Using extension activities it would even be possible to integrate the complete simple pipeline model into the BPEL pipelining model:

<extensionActivity>
    <ext:invokePipelet>
        <ext:pipelet name="pipelet-name"/> 
        <ext:variables input="request" output="result"/>
        <ext:invocationConfig>
          <!-- parameters of invocation, e.g. error handling? -->
        </ext:invocationConfig>
        <ext:pipeletConfig>
          <!-- pipelet XML configuration, schema: to define -->
        </ext:pipeletConfig>
    </ext:invokePipelet>
</extensionActivity>

We could provide an Extension Activity implementation that manages the "simple pipelet" lifecycle and configuration and translates the calls from the BPEL engine into a convenient pipelet invocation. The execution interface of the simple pipelet would be the same as that of the pipelet service described above: Because the lifecycle of extension activities themselves is undefined (it seems that in ODE a new instance is created for each call), the extension activity is only a simple class that promotes the BPEL call to a "SimplePipeletManager" that manages the pipelet instances (all in once or all for a single pipeline), configurations, invocations and error handling.

Id[] process(Id[] recordIds) throws ProcessingException;

Simple pipelets would use the blackboard service to access the actual record data. Additionally, simple pipelets need a method to set the configuration:

void configure(PipeletConfiguration config) throws ConfigurationException;

(Question: Do we also need a method for "shutdown" method to be called when the pipelet is destroyed? Or can we require simple pipelets to be so simple that they do not need such a method?)

The tasks of the SimplePipeletManager are

Start of Pipeline/Pipelet becomes available:
- Instantiate Pipelet
- Parse PipeletConfiguration from BPEL pipeline and call pipelets configure method.
Pipelet Invocation (very similar to invocation of "big pipelets", see [Blackboard Service Concept]:
- Parse records from "input" variable and sync them to blackboard
- Call simple pipelet's execute method with record IDs
- Create workflow objects from result IDs and blackboard content and write them back to "output" variable
In case of a pipelet error: care about indicating the error in a correct way to the BPEL engine.

Issues to solve:

Simple pipelets should be instantiated and configured at deployment of the BPEL pipeline. This way missing pipelet implementations and configuration errors can be reported during system start up and not during first execution. For this it is probably necessary to introspect the pipeline definition and search for occurrences of the extension activity, because the BPEL engine may not support this directly.
Like in the Simple Pipeline Model itself we must decide on a pipelet lookup and instantiation model that makes it easy to support OSGi dynamics: The SimplePipeletManager must be able to track deactivation of bundles providing simple pipelets such that it can destroy the provided pipelets and re-instantiate them when the bundle reappears. Two mechanisms are possible:
- OSGi Service Factories: The providing bundle declares an OSGi service factory that the SPM can use to create actual pipelet instances. This way we can use the DS support for dynamic services also for simple pipelets. We can probably provide a default implementation of this factory such that the providing bundle must only contain a suitable component description starting this factory customized for its own pipelet.
- OSGi Extender Model: Use BundleListener/Tracker to check installed or removed bundles for contained pipelet implementations (declared in a contained XML file). See this for document for details: [1].

Configuration using Eclipse BPEL designer

The Eclipse BPEL designer is extensible itself using extension points. Details have to be clarified by somebody with more experience in Eclipse/GUI/RCP programming, but it should be possible to:

Define a view displaying all available pipelets, maybe grouped.
Drag an available pipelet from this view into the BPEL pipeline which generates a <extensionActivity> element with the <ext:invokePipelet> activity for the dragged pipelet.
Show a specialized properties tab for simple configuration of the pipelet such that the user does not have to write the contained XML. For this the pipelet provider must declare names, types, multiplicity, etc. of the pipelet's configuration properties. This should be done in an XML file provided with the pipelet bundle (schema to be defined).
Provide a view showing all pipelets used in all pipelines grouped by pipelines.

Note that this is not limited to simple pipelets, but can be used similar to handle the "big pipelets". It has to be decided if that should be supported by calling "big pipelet services" also using an extension activity for consistent handling of both types of pipelets. (currently the implementation uses the standard BPEL invoke activity to call pipelet services)

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Project Concepts/BPEL Pipelining Concept

Contents

Description

Discussion

Technical proposal

Pipelet instantiation variants

Pipelet Implementation rules

Configuration repository

BPEL Extension Activities

Current problems are:

Issues to solve:

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Project Concepts/BPEL Pipelining Concept

Contents

Description

Discussion

Technical proposal

Pipelet instantiation variants

Pipelet Implementation rules

Configuration repository

BPEL Extension Activities

Current problems are:

Issues to solve: