Jump to: navigation, search

Difference between revisions of "SMILA/Project Concepts/Blackboard Service Concept"

Line 1: Line 1:
 
 
== Description ==
 
== Description ==
  
Design a service to ease management of SMILA records during workflow processing.
+
In this model the orchestration of pipelets (= "pipeline") is defined by BPEL processes. We distinguish two seperate kinds of pipelets:
 
+
* "Big Pipelets" are implemented as OSGi services, can be shared by multiple pipelines and their configuration are seperated from the BPEL prociess defition.
 +
* "Simple Pipelets" are managed by a component of the BPEL engine integration, instances are not shared by multiple pipelines and their configuration is part of the BPEL process definition.
  
 
== Discussion ==
 
== Discussion ==
 
 
=== When to persist data into the a storage. ===
 
 
[[User:G.schmidt.brox.de]]: Several month ago we have had a discussion, how to persist information into the search index. At that time i proposed a configuration mechanism for controlling the storage/indexing processes when leaving BPEL. At that time we moved this discussion in the direction of using BPEL pipelets for e.g. indexing purposes because that way we are free to configure when and where to use this option. From my point of view such operations should be a general paradime. Either we use by default a configurable process for indexing/storage at the end of BPEL processing or we use pipelets for this case. Please share your thoughts.
 
 
* [[User:Juergen.schumacher.empolis.com|Juergen.schumacher.empolis.com]] Yes, that is probably another valid way to configure it. It would be a minor change: Instead of writing back the records to the persistence layer on at the commit after each workflow which also invalidates the blackboard content, we could introduce a relatively simple pipelet to trigger the commit, after which the blackboard content must stay valid until the router has finished processing the records. So one would have more control over record persistence (the current concept causes each processed record to be persisted after each successfully finished workflow). The reason why I put this outside of BPEL was that I distinguished between "infrastructure" elements that are required to run with each workflow and "application" elements that are different in different setups. To me, "insert to index" is mainly an "application" element, as I can think of SMILA setups that are not used to build search indices. In my picture, persistence was a "infrastructure" element: To be able to chain several workflows via queues it is necessary that the blackboard content is persisted after single each workflow such that the next workflow can access the result (ok, strictly speaking this is not necessary after final workflows that are not followed by others anymore). So I thought it would safer to enforce record persistence this way, and that a workflow creator this way can concentrate more on the "create my application workflow" side of his problem instead of the "making my application work" side. If the team is more in favor of a more flexible solution, no problem. Just vote here (-:
 
  
 
== Technical proposal ==
 
== Technical proposal ==
  
Purpose of the Blackboard Service is the management of SMILA record data during processing in a SMILA component (Connectivity, Workflow Processor). The problem is that different processing engines could require different physical formats of the record data (see [[SMILA/Project Concepts/Data Model and XML representation]] for a discussion). Because this means either complex implementations of the logical data model or big data coversion problems, the idea is to keep the complete record data only on a "blackboard" which is not pushed through the workflow engine itself and to extract only a small "workflow object" from the blackboard to feed the workflow engine. This workflow object would contain only the part from the complete record data which the workflow engine needs for loop or branch conditions (and the record ID, of course). Thus it could be efficient enough to do the conversion between blackboard and workflow object before and after each workflow service invocation. As a side effect, the blackboard service could hide the handling of record persistence from the services to make service development easier.
+
In this model the orchestration of pipelets (= "pipeline") is defined by BPEL processes. The pipelets are implemented as OSGi services. This should make it easier later to support the execution of unsafe pipelets in own VMs, because there are several technologies for transparent remote communication with OSGi services available (Tuscany, ECF, Riena). In the following we assume that the service lifecycle of all services is controlled by OSGi Declarative Services (DS). This simplifies the starting and stopping of services and binding them to other services. To  support the initialization of services at service activation, DS defines that a special method is called when the service is activated, in which the necessary initialization can be done (reading of configurations, connecting to used resources, creating internal structures, etc). DS also defines a method to be called when a service is deactivated that can be used for cleaning up. The two methods must have this signature:
 +
<source lang="java">
 +
protected void activate(ComponentContext context);
 +
protected void deactivate(ComponentContext context);
 +
</source>
 +
Each pipelet service must have a service property "smila.pipelet.name" that specifies the name of this pipelet. The name must be unique for each service in a single VM and is defined in the DS component description. The pipelet name is used in BPEL definition to refer to the pipelets. If multiple instances of the same pipelet class are needed, they can be distinguished using different pipelet names. 
 +
The pipelet execution method is currently:
 +
<source lang="java">
 +
Id[] process(Id[] recordIds) throws ProcessingException;
 +
</source>
 +
I.e. it is called by the workflow with a list of record IDs, the content of these records is supposed to be available via the Blackboard service, so all access and manipulation of the records is done using the Blackboard service. The result is also a list of record IDs. Usually these will be the same as the input IDs, a different list can be produced by pipelets that split records. This means that all data needed by the pipelet for processing must be on the blackboard:
 +
* record attributes and attachments
 +
* record annotations
 +
* workflow and record notes
 +
The two latter items may also be used to pass parameters to a pipelet. However, we will need BPEL Extension Activities to be able to set them in the BPEL definition (see end of this chapter).
 +
Pipelets as well as the BPEL integration get their configurations from a central "configuration repository". This can be a simple directory with a defined structure at first, or a complex service supporting centralized configuration management and updating (and notification of clients about configuration changes) later.
 +
Pipelet configurations are separated from the BPEL pipelines, because Pipelets existence does not depend on the existence of a pipeline engine and must not depend on the implementation of the pipeline engine. This makes it easier to use pipelets independent from a special pipelining implementation, e.g. if we want to replace the BPEL engine by a JBPM engine or an own workflow engine implementation.  This makes it also easier to share pipelet instances between pipelines which is crucial for pipelets that use lots of memory (e.g. semantic text mining) or need resources that can only be accessed exclusively by one client (e.g. writing to a Lucene index). Finally it enables OSGi to restart the BPEL integration service without having to restart the pipelets (e.g. for software updates).
 +
The BPEL integration is started by DS, too. Pipelets are bound to the BPEL integration as DS service references. This way the BPEL service can always keep track about currently available pipelet services. It would even be possible to track which pipelet is used in which pipeline and thus to know a priori which pipeline is currently completely executable.
  
=== Basics ===
+
=== Pipelet instantiation variants ===
 +
Usually we have one instance of a Pipelet class that has a single configuration. The pipelet name is then a like a key to the combination "pipelet instance name = pipelet class + configuration". However, there may be cases in which it would be good to have a single pipelet class available with different configurations. There are two ways to support this:
 +
* Have a single pipelet instance with a configuration consisting of the different parts. Which part of the configuration is actually used in an invocation must then be passed using a record annotation. E.g.: There is a service "pipelet-name" = pipelet.A + config X & config Y, i.e. it has loaded both configurations.
 +
An record in the invocation contains annotations:
 +
** "pipelet-name/select-configuration" = X -> use config X for processing this record
 +
** "pipelet-name/select-configuration" = Y -> use config Y for processing this record
 +
Note that this makes it possible to process different records with different configurations in a single invocation.
 +
Of course in such a scenario one configuration should be marked as the default configuration to be used if no annotation is set.
 +
* Have multiple pipelet instances with different names, each having one of these configurations. E.g. there a two service instances of the same pipelet class with different pipelet names:
 +
** service 1: "pipelet-name-1" = pipelet.B + config X
 +
** service 2: "pipelet-name-2" = pipelet.B + config Y
 +
Then the pipelet name used in the BPEL invoke activity determines which configuration is used.
  
This figure given an overview about how these services could be composed:
+
=== Pipelet Implementation rules ===
  
[[Image:SMILA-Blackboard-Service.png]]
+
Pipelets can potentially be invoked more than once at the same time. This means that a pipelet either should be written in a multithreading-safe way (stateless, read-only configuration and member variables) or it must care itself about synchronization of critical sections (e.g. Lucene index writing).
  
Note that the use of the Blackboard service is not restricted to workflow processing, but it can be also used in Connectivity to create the initial SMILA record from the data sent by Crawlers. This way the persistence services are hidden from Connectivity, too.
+
=== Configuration repository ===
  
It is assumed that the workflow engine itself (which will be a third party product usually) must be embedded into SMILA using some wrapper that translates incoming calls to workflow specific objects and service invocations from the workflow into real SMILA service calls. At least with a BPEL engine like ODE it must be done this way. In the following this wrapper is called the Workflow Integration Service. This workflow integration service will also handle the necessary interaction between workflow engine and blackboard (see next section for details).
+
This is just an ad-hoc proposal to give an idea of how it could look like. In details it's open to discussion.
  
For ODE, the use of Tuscany SCA Java would simplify the development of this integration service because it could be based on the BPEL implementation type of Tuscany. However, in the first version we will create an SMILA specific workflow integration service for ODE that can only orchestrate SMILA pipelet because the Tuscany BPEL implementation type does not yet support service references (see this [http://mail-archives.apache.org/mod_mbox/ws-tuscany-user/200804.mbox/%3c5a75db780804160846u6161d069p17c09a9422b2da8b@mail.gmail.com%3e mail in the Tuscany user mailing list]).
+
For the moment we assume that the configuration repository is a single directory with sub directories in the file system. The configurations for components are located in subdirectories in the repository root. The name of these subdirectories is the bundle name of the component. What's happening inside of a bundle configuration directory is up to the bundle implementation. E.g. for the ODE BPEL integration bundle it contains a property file for general BPEL engine configuration and another subdirectory containing pipeline definitions. E.g.:
* Update 2008-04-21: Tuscany is making progress on this: [http://mail-archives.apache.org/mod_mbox/ws-tuscany-dev/200804.mbox/%3c5a75db780804181720n248b697ar419eff7e945c8e36@mail.gmail.com%3e mail in dev mailing list]
+
  
=== Workflow ===
+
<pre>
 +
configuration
 +
  |
 +
  |-- org.eclipse.smila.processing.bpel
 +
  |    |-- processor.properties
 +
  |    \-- pipelines
 +
  |        |-- pipeline-1.bpel
 +
  |        |-- pipeline-2.bpel
 +
  |        |-- ...
 +
  |        |-- processor.wsdl
 +
  |        |-- record.xsd
 +
  |        |-- id.xsd
 +
  |        | (predefined schema files necessary for reference.
 +
  |        |  Needed also during editing in BPEL designer)
 +
  |        \-- deploy.xml
 +
  |          (technical reasons, we can get rid of this)
 +
  |-- org.eclipse.smila.pipelet.A
 +
  |    |  (example: one instance managing multiple configurations)
 +
  |    |-- config-X.xml
 +
  |    \-- config-Y.xml
 +
  |-- org.eclipse.smila.pipelet.B
 +
  |    |  (example: one instance per configuration)
 +
  |    |-- pipelet-name-1
 +
  |    |    \-- config-X.xml
 +
  |    \-- pipelet-name-2
 +
  |          \-- config-Y.xml
 +
  |-- ...
 +
</pre>
  
The next picture illustrates how and which data flows through this system:
+
This is quite similar to [Configuration handling], but with an optional additional folder level for "configuration sections" to structure the configurations better, e.g. for pipelets that require multiple instances for multiple configurations there can be one section per pipelet instance.  Of course, bundles are free on how to use the configuration repository structure for their purposes. But we should describe some usage patterns because that would make reading the repository easier for adminstrators.
  
[[Image:SMILA-Blackboard-Activity.png]]
+
(To discuss: do we need folder structures of arbitrary depth?)
  
In more detail:
+
SMILA should provide helper classes to make locating and parsing of simple configurations easy. We can define a common XML format for basic configurations that most pipelets can use for their configurations, e.g. something of similar structure than the Record Annotation format?). Simple Property files can be supported, too. Then we can create a simple ConfigurationAccess service with methods like
  
* Listener receives record from queue.
+
* to navigate the Configuration repository:   
bq. The record usually contains only the ID. In special cases it could optionally  include some small attribute values or annotations that could be used to control routing inside the message broker.
+
<source lang="java">
* Listener calls blackboard to load record data from persistence service and writes attributes contained in message record to blackboard.
+
String[] getSectionNames(String bundleName);
* Listener calls workflow service with ID from message record.
+
// e.g. getSectionNames("org.eclipse.smila.pipelet.B")
* Workflow integration creates workflow object for ID.
+
// returns ["pipelet-name-1", "pipelet-name-2"]
bq. The workflow object uses engine specific classes (e.g. DOM for ODE BPEL engine) to represent the record ID and some chosen attributes that are needed in the engine for condition testing or computation. It's an configuration option of the workflow integration which attributes are to be included. In a more advanced version it may be possible to analyse the workflow definition (e.g. the BPEL process) to determine which attributes are needed.
+
String[] getConfigNames(String bundleName);
* Workflow integration invokes the workflow engine. This causes the following steps to be executed a couple of times:
+
// e.g. getConfigNames("org.eclipse.smila.pipelet.A")  
** Workflow engine invokes SMILA service (pipelet). At least for ODE BPEL this means that the engine calls the integration layer which in turn routes the request to the invoked pipelet. So the workflow integration layer receives (potentially modified) workflow objects.
+
// returns ["config-X.xml", "config-Y.xml"]
** Workflow integration writes workflow objects to blackboard and creates record IDs. The selected pipelet is called with these IDs
+
String[] getConfigNames(String bundleName, String sectionName);
** Pipelet processes IDs and manipulates blackboard content. The result is a new list of record IDs (usually identical to the argument list, and usually the list has length 1)
+
// e.g. getConfigNames("org.eclipse.smila.pipelet.B", "pipelet-name-1")  
** Workflow integration creates new workflow objects from the result IDs and blackboard content and feeds them back to the workflow engine.
+
// returns ["config-X.xml", "config-Y.xml"]
* Workflow engine finishes successfully and returns a list of workflow objects.  
+
</source>
bq. If it finishes with an exception, instead of the following the Listener/Router has to invalidate the blackboard for all IDs related to the workflow such that they are not committed back to the storages, and also it has to signal the message broker that the received message has not been processed successfully such that the message broker can move it to the dead letter queue.
+
* Workflow integration extracts IDs from workflow objects and returns them.
+
* Router creates outgoing messages with message records depending on blackboard content for given IDs.  
+
bq. Two things may need configuration here: When to create an outgoing message to which queue (never, always, depending on conditions of attribute values or annotations) - this could also be done in workflow by setting a "nextDestination" annotation for each record ID. And which attributes/annotations are to be included in the message record - if any.
+
* Router commits IDs on blackboard. This writes the blackboard content to the persistence services and invalidates the blackboard content for these IDs.
+
* Router sends outgoing messages to message broker.
+
  
=== Content on Blackboard ===
+
* to access and parse the configurations in common XML format:
 +
<source lang="java">
 +
Configuration getConfig(String bundleName, String configName);
 +
Configuration getConfig(String bundleName, String sectionName, String configName);
 +
</source>
  
The Blackboard contains two kinds of content:
+
* to access and read property files:
 +
<source lang="java">
 +
Properties getProperties(String bundleName, String configName);
 +
Properties getProperties(String bundleName, String sectionName, String configName);
 +
</source>
  
*Records:* All records currently processed in this runtime process. The structure of an record is defined in [[SMILA/Project Concepts/Data Model and XML representation]]. Clients manipulate the records through Blackboard API methods. This way the records are completely under control of the Blackboard which may be used in advanced versions for optimised communication with the persistence services.  
+
* to access other configurations:
 +
<source lang="java">
 +
InputStream getStream(String bundleName, String configName);
 +
InputStream getStream(String bundleName, String sectionName, String configName);
 +
</source>
 +
   
 +
This would make accessing of simple configurations quite simple for a pipelet developer.
  
Records enter the blackboard by one of the following operations:
+
=== BPEL Extension Activities ===
* create: create a new record with a given ID. No data is loaded from persistence, if a record with this ID exists already in the storages it will be overwritten when the created record is commited. E.g. used by Connectivity to initialize the record from incoming data.
+
The BPEL specification allows extending BPEL by using Extension Activities. An Extension Activity is basically a Java class with a given interface that is registered to the BPEL engine under a qualified name. It the can be used in BPEL by a statement like this:
* load: loads record data for the given ID from persistence (or prepare it to be loaded). Used by a client to indicate that it wants to process this record.
+
* split: creates a fragment of a given record, i.e. the record content is copied to a new ID derived from the given by adding a frament name (see [ID Concept] for details).
+
  
All these methods should care about locking the record ID in the storages such that no second runtime process can try to manipulate the same record.
+
<source lang="xml">
 +
<bpel:extensionActivity>
 +
  <myns:NameOfExtension>
 +
    <!-- arbitary XML elements -->
 +
  </myns:NameOfExtension>
 +
</bpel:extensionActivity>
 +
</source>
  
A record is removed from the blackboard with one of these operations:
+
The implementation class is then called with the complete XML element of its description and can access all workflow variables defined in the BPEL. This means the activity can configured in the BPEL. E.g. for setting record annotations we can provide an extension activity similar to this:
* commit: all changes are written to the storages before the record is removed. The record is unlocked in the database.
+
* invalidate: the record is removed from the blackboard. The record is unlocked in the database. If the record was created new (not overwritten) on this blackboard it should be removed from the storage completely.
+
  
*Notes:* Additional temporary data created by pipelets to be used in later pipelets in the same workflow, but not to be persisted in the storages. Notes can be either global or record specific (associated to a record ID). Record specific notes are copied on record splits and removed when the associated record is removed from the blackboard. In any case a note has a name and the value can be of any serializable Java class such that they can be accessed from seperated services in own VMs.
+
<source lang="xml">
 +
<extensionActivity>
 +
    <ext:setAnnotations>
 +
        <ext:target variable="request"/>
 +
        <rec:An n="pipelet-name">
 +
            <rec:An n="select-configuration">
 +
                <rec:V>X</rec:V>
 +
            </rec:An>
 +
        </rec:An>
 +
    </ext:setAnnotations>
 +
</extensionActivity>
 +
</source>
  
bq. A nice extension would be workflow instance specific notes such that a pipelet can pass non persistent information to another pipelet invoked later in the workflow which is not associated to a single record, but does not conflict with information from different workflow instances like global notes would (based on the assumption that the workflow engine supports multi-threaded execution). This information would be removed from the blackboard after the workflow instance has finished. However, it has to be clarified how they are can associated to the workflow instance, even when accessed from a remote VM.
+
This would set an annotation named "pipelet-name/select-configuration" with value "X" on all records the request variable. Of course it would also be possible to create a more specialized activity instead, that would define a simpler syntax to describe the annotations to be set.
  
=== Service Interfaces ===
+
==== Current problems are: ====
  
The Blackboard will be implemented as an OSGi service. The interface could look similar to the following definition. It is getting quite big, so maybe it makes sense to divide it up into the different parts (handling of lifecycle, literal values, object values, annotation, notes, attachments?) for better readability? We'll see about this when implementing.
+
* This is not supported by the current release (1.1.1) and also not in release 1.2 (currently about to be released) of ODE, but only in the trunk version (this will be release 1.3 probably). The latest estimation for a release date was "in about two months".
 +
* It's also not supported by the current release (M3) of the Eclipse BPEL designer. According to Eclipse Bugzilla it should be added to M4, which in turn should be released in the near future. However, I think we will have to provide own extensions to the BPEL designer anyway in order to have user friendly editing of extension activities provided by us.
  
<source lang="java">
+
Integrating Simple Pipeline Model into BPEL Pipelining
interface Blackboard {
+
Using extension activities it would even be possible to integrate the complete simple pipeline model into the BPEL pipelining model:
    // record life cycle methods
+
    void create(ID id) throws BlackboardAccessException;
+
    void load(ID id) throws BlackboardAccessException;
+
    ID split(ID id, String fragmentName) throws BlackboardAccessException;
+
    void commit(ID id) throws BlackboardAccessException;
+
    void invalidate(ID id);
+
   
+
    // factory methods for attribute values and annotation objects
+
    // Literal and Annotation are just interfaces,
+
    // blackboard implementation can determine the actual types for optimization
+
    Literal createLiteral();
+
    Annotation createAnnotation();
+
   
+
    // record content methods
+
    // - record metadata
+
    //  for referenced types see interfaces proposed in [[SMILA/Project Concepts/Data Model and XML representation]]
+
    //  for string format of an attribute path see definition of AttributePath class below.
+
    // -- basic navigation
+
    Iterator<String> getAttributeNames(ID id, Path path) throws BlackboardAccessException;
+
    Iterator<String> getAttributeNames(ID id) throws BlackboardAccessException; // convenience for getAttributeNames(ID, null);
+
    boolean hasAttribute(ID id, Path path) throws BlackboardAccessException;
+
  
    // -- handling of literal values
+
<source lang="xml">
    //    navigation support
+
<extensionActivity>
    boolean hasLiterals(ID id, Path path) throws BlackboardAccessException;
+
     <ext:invokePipelet>
    int getLiteralsSize(ID id, Path path) throws BlackboardAccessException;
+
        <ext:pipelet name="pipelet-name"/>
    //    get all literal attribute values of an attribute (index of last step is irrelevant)
+
        <ext:variables input="request" output="result"/>
    //    a client should not expect the blackboard to reflect changes done to these object automatically,
+
        <ext:invocationConfig>
    //    but always should call one of the modification methods below to really set the changes.
+
          <!-- parameters of invocation, e.g. error handling? -->
    List<Literal> getLiterals(ID id, Path path) throws BlackboardAccessException;
+
        </ext:invocationConfig>
 
+
        <ext:pipeletConfig>
     //    get single attribute value, index is specified in last step of path, defaults to 0.
+
          <!-- pipelet XML configuration, schema: to define -->
    Literal getLiteral(ID id, Path path) throws BlackboardAccessException;
+
         </ext:pipeletConfig>
 
+
     </ext:invokePipelet>
    //    modification of attribute values on blackboard
+
</extensionActivity>
    void setLiterals(ID id, Path path, List<Literal> values) throws BlackboardAccessException;
+
    //    set single literal value, index of last attribute step is irrelevant
+
    void setLiteral(ID id, Path path, Literal value) throws BlackboardAccessException;
+
    //    add a single literal value, index of last attribute step is irrelevant
+
    void addLiteral(ID id, Path path, Literal value) throws BlackboardAccessException;
+
   
+
    //    remove literal specified by index in last step
+
    void removeLiteral(ID id, Path path) throws BlackboardAccessException;
+
    //    remove all literals of specified attribute
+
    void removeLiterals(ID id, Path path) throws BlackboardAccessException;
+
   
+
    // -- handling of sub-objects
+
    //    navigation: check if an attribute has sub-objects and get their number.
+
    boolean hasObjects(ID id, Path path) throws BlackboardAccessException;
+
    int getObjectSize(ID id, Path path) throws BlackboardAccessException;
+
   
+
    //    deleting sub-objects
+
    //    remove sub-objects specified by index in last step
+
    void removeObject(ID id, Path path) throws BlackboardAccessException;
+
    //    remove all sub-objects of specified attribute
+
    void removeObjects(ID id, Path path) throws BlackboardAccessException;
+
   
+
    // access semantic type of sub-object attribute values.
+
    // semantic types of literals are modified at literal object   
+
    String getObjectSemanticType(ID id, Path path) throws BlackboardAccessException;
+
    void setObjectSemanticType(ID id, Path path, String typename) throws BlackboardAccessException;
+
   
+
    // -- annotations of attributes and sub-objects.
+
    //    annotations of literals are accessed via the Literal object
+
    //    use null, "" or an empty attribute path to access root annotations of record.
+
    //    use PathStep.ATTRIBUTE_ANNOTATION as index in final step to access the annotation
+
    //    of the attribute itself.
+
    Iterator<String> getAnnotationNames(ID id, Path path) throws BlackboardAccessException;
+
    boolean hasAnnotations(ID id, Path path) throws BlackboardAccessException;
+
    boolean hasAnnotation(ID id, Path path, String name) throws BlackboardAccessException;
+
 
+
    List<Annotation> getAnnotations(ID id, Path path, String name) throws BlackboardAccessException;
+
    //    shortcut to get only first annotation if one exists.
+
    Annotation getAnnotation(ID id, Path path, String name) throws BlackboardAccessException;
+
    void setAnnotations(ID id, Path path, String name, List<Annotation> annotations) throws BlackboardAccessException;
+
    void setAnnotation(ID id, Path path, String name, Annotation annotation) throws BlackboardAccessException;
+
    void addAnnotation(ID id, Path path, String name, Annotation annotation) throws BlackboardAccessException;
+
    void removeAnnotation(ID id, Path path, String name) throws BlackboardAccessException;
+
    void removeAnnotations(ID id, Path path) throws BlackboardAccessException;
+
   
+
    // - record attachments
+
    boolean hasAttachment(ID id, String name) throws BlackboardAccessException;
+
    byte[] getAttachment(ID id, String name) throws BlackboardAccessException;
+
    InputStream getAttachmentAsStream(ID id, String name) throws BlackboardAccessException;
+
    void setAttachment(ID id, String name, byte[] name) throws BlackboardAccessException;
+
    InputStream setAttachmentFromStream(ID id, String name, InputStream name) throws BlackboardAccessException;
+
       
+
    // - notes methods
+
    boolean hasGlobalNote(String name) throws BlackboardAccessException;
+
    Serializable getGlobalNote(String name) throws BlackboardAccessException;
+
    void setGlobalNote(String name, Serializable object) throws BlackboardAccessException;
+
    boolean hasRecordNote(ID id, String name) throws BlackboardAccessException;
+
    Serializable getRecordNote(ID id, String name) throws BlackboardAccessException;
+
    void setRecordNote(ID id, String name, Serializable object) throws BlackboardAccessException;
+
   
+
    // This is certainly not complete ... just to give an idea of how it could taste.
+
    // lots of convenience methods can be added later.
+
}
+
</source>
+
<source lang="java">
+
public class Path implements Serializable, Iterable<PathStep> {
+
    // string format of attribute path could be something like
+
    // "attributeName1[index1]/attributeName2[index2]/..." or
+
    // "attributeName1@index1/attributeName2@index2/...".
+
    // The first is probably better because similar to XPath?
+
    // The specification of index is optional and defaults to 0.
+
    // Whether the index refers to a literal or a sub-object depends on methods getting the argument
+
   
+
    public static final char SEPARATOR = '/';
+
          
+
    public Path();
+
    public Path(Path path);
+
    public Path(String path);
+
   
+
    // extend path extended more steps. This modifies the object itself and returns it again
+
    // for further modifications, e.g. path.add("level1").add("level2");
+
    public Path add(PathStep step);
+
    public Path add(String attributeName);
+
    public Path add(String attributeName, int index);
+
   
+
    // remove tail element of this.  This modifies the object itself and returns it again
+
    // for further modifications, e.g. path.up().add("siblingAttribute");
+
    public Path up();
+
   
+
    public Iterator<PathStep> iterator();
+
     public boolean isEmpty();
+
    public PathStep get(int positionInPath);
+
    public String getName(int positionInPath);
+
    public int getIndex(int positionInPath);
+
    public int length();
+
   
+
    public boolean equals(Path other);
+
    public int hashCode();
+
    public String toString();
+
}
+
</source>
+
<source lang="java">
+
public class PathStep implements Serializable {
+
    public static final int ATTRIBUTE_ANNOTATION = -1;
+
   
+
    private String name;
+
    private int index = 0; / index of value in multivalued attributes. default is first value.
+
   
+
    public PathStep(String name);
+
    public PathStep(String name, int index);
+
   
+
    public String getName();
+
    public int getIndex();
+
   
+
    public boolean equals(AttributePath other);
+
    public int hashCode();
+
    public String toString();
+
}
+
 
</source>
 
</source>
  
The business interface of a pipelet to be used in an SMILA workflow will be rather simple. It can get access to the local blackboard service using OSGi service lookup or by injection using OSGi Declarative Services. Therefore the main business method just needs to take a list of record IDs as an argument and return a new (or the same) list of IDs as the result. This is the same method that a workflow integration service needs to expose to the Listener/Router component, therefore it makes sense to use a common interface definition for both. This way it is possible to deploy processing runtimes with only a single pipelet without having to create dummy workflow definitions, because a pipelet can be wired up to the Listener/Router immediately. Becasue remote communication with separated pipelets (see below) will be implemented later pipelets (and therefore workflow integrations, too) must be implemented as OSGi services such that the remote communication can be coordinated using SCA. Thus, interfaces could look like this:
+
We could provide an Extension Activity implementation that manages the "simple pipelet" lifecycle and configuration and translates the calls from the BPEL engine into a convenient pipelet invocation. The execution interface of the simple pipelet would be the same as that of the pipelet service described above:
 +
Because the lifecycle of extension activities themselves is undefined (it seems that in ODE a new instance is created for each call), the extension activity is only a simple class that promotes the BPEL call to a "SimplePipeletManager" that manages the pipelet instances (all in once or all for a single pipeline), configurations, invocations and error handling.
  
 
<source lang="java">
 
<source lang="java">
interface RecordProcessor {
+
Id[] process(Id[] recordIds) throws ProcessingException;
    ID[] process(ID[] records) throws ProcessingException;
+
}
+
 
</source>
 
</source>
  
<source lang="java">
+
Simple pipelets would use the blackboard service to access the actual record data.
interface Pipelet extends RecordProcessor {
+
Additionally, simple pipelets need a method to set the configuration:
    // specific methods for pipelets
+
}
+
 
+
</source>
+
  
 
<source lang="java">
 
<source lang="java">
interface WorkflowIntegration extends RecordProcessor {
+
void configure(PipeletConfiguration config) throws ConfigurationException;
    // specific methods for workflow integration services.
+
}
+
 
</source>
 
</source>
  
 +
(Question: Do we also need a method for "shutdown" method to be called when the pipelet is destroyed? Or can we require simple pipelets to be so simple that they do not need such a method?)
  
=== What about pipelets running in a seperate VM? ===
+
The tasks of the SimplePipeletManager are
 
+
* Start of Pipeline/Pipelet becomes available:
*Not relevant for initial implementation. This will be added in advanced versions and discussed in more detail then.*
+
** Instantiate Pipelet
 +
** Parse PipeletConfiguration from BPEL pipeline and call pipelets configure method.
 +
* Pipelet Invocation (very similar to invocation of "big pipelets", see [Blackboard Service Concept]:
 +
** Parse records from "input" variable and sync them to blackboard
 +
** Call simple pipelet's execute method with record IDs
 +
** Create workflow objects from result IDs and blackboard content and write them back to "output" variable
 +
* In case of a pipelet error: care about indicating the error in a correct way to the BPEL engine.
  
We want to be able have pipelets running in separated VM if they are known to be unstable or non-terminating in error conditions. This can be supported by the blackboard service like this:
+
==== Issues to solve: ====
  
[[Image:SMILA-Blackboard-SeparatedService.png]]
+
* Simple pipelets should be instantiated and configured at deployment of the BPEL pipeline. This way missing pipelet implementations and configuration errors can be reported during system start up and not during first execution. For this it is probably necessary to introspect the pipeline definition and search for occurrences of the extension activity, because the BPEL engine may not support this directly.
 +
* Like in the Simple Pipeline Model itself we must decide on a pipelet lookup and instantiation model that makes it easy to support OSGi dynamics: The SimplePipeletManager must be able to track deactivation of bundles providing simple pipelets such that it can destroy the provided pipelets and re-instantiate them when the bundle reappears. Two mechanisms are possible:
 +
** OSGi Service Factories: The providing bundle declares an OSGi service factory that the SPM can use to create actual pipelet instances. This way we can use the DS support for dynamic services also for simple pipelets. We can probably provide a default implementation of this factory such that the providing bundle must only contain a suitable component description starting this factory customized for its own pipelet.
 +
** OSGi Extender Model: Use BundleListener/Tracker to check installed or removed bundles for contained pipelet implementations (declared in a contained XML file). See this for document for details: [http://neilbartlett.name/downloads/preview_extender_20080527_1320.pdf].
 +
Configuration using Eclipse BPEL designer
  
The seperated pipelet VM would have a proxy blackboard service that coordinates the communication with the master blackboard in the workflow processor VM. Only the record ID needs to be sent to the separated pipelets. However, the separated pipelet must be wrapped to provide control of the record life cycle on the proxy blackboard, especially because the changes done in the remote blackboard must be committed back to the master blackboard when the separated pipelet has finished successful, or the proxy blackboard content must be invalidated without commit in case of an pipelet error. Possibly, this pipelet wrapper can also provide "watchdog" functionality to monitor the separated pipelet and terminate and restart it in case of endless loops or excessive memory consumption.
+
The Eclipse BPEL designer is extensible itself using extension points. Details have to be clarified by somebody with more experience in Eclipse/GUI/RCP programming, but it should be possible to:
 +
* Define a view displaying all available pipelets, maybe grouped.
 +
* Drag an available pipelet from this view into the BPEL pipeline which generates a <extensionActivity> element with the <ext:invokePipelet> activity for the dragged pipelet.
 +
* Show a specialized properties tab for simple configuration of the pipelet such that the user does not have to write the contained XML. For this the pipelet provider must declare names, types, multiplicity, etc. of the pipelet's configuration properties. This should be done in an XML file provided with the pipelet bundle (schema to be defined).
 +
* Provide a view showing all pipelets used in all pipelines grouped by pipelines.
 +
Note that this is not limited to simple pipelets, but can be used similar to handle the "big pipelets". It has to be decided if that should be supported by calling "big pipelet services" also using an extension activity for consistent handling of both types of pipelets. (currently the implementation uses the standard BPEL invoke activity to call pipelet services)

Revision as of 09:46, 7 August 2008

Description

In this model the orchestration of pipelets (= "pipeline") is defined by BPEL processes. We distinguish two seperate kinds of pipelets:

  • "Big Pipelets" are implemented as OSGi services, can be shared by multiple pipelines and their configuration are seperated from the BPEL prociess defition.
  • "Simple Pipelets" are managed by a component of the BPEL engine integration, instances are not shared by multiple pipelines and their configuration is part of the BPEL process definition.

Discussion

Technical proposal

In this model the orchestration of pipelets (= "pipeline") is defined by BPEL processes. The pipelets are implemented as OSGi services. This should make it easier later to support the execution of unsafe pipelets in own VMs, because there are several technologies for transparent remote communication with OSGi services available (Tuscany, ECF, Riena). In the following we assume that the service lifecycle of all services is controlled by OSGi Declarative Services (DS). This simplifies the starting and stopping of services and binding them to other services. To support the initialization of services at service activation, DS defines that a special method is called when the service is activated, in which the necessary initialization can be done (reading of configurations, connecting to used resources, creating internal structures, etc). DS also defines a method to be called when a service is deactivated that can be used for cleaning up. The two methods must have this signature:

protected void activate(ComponentContext context);
protected void deactivate(ComponentContext context);

Each pipelet service must have a service property "smila.pipelet.name" that specifies the name of this pipelet. The name must be unique for each service in a single VM and is defined in the DS component description. The pipelet name is used in BPEL definition to refer to the pipelets. If multiple instances of the same pipelet class are needed, they can be distinguished using different pipelet names. The pipelet execution method is currently:

Id[] process(Id[] recordIds) throws ProcessingException;

I.e. it is called by the workflow with a list of record IDs, the content of these records is supposed to be available via the Blackboard service, so all access and manipulation of the records is done using the Blackboard service. The result is also a list of record IDs. Usually these will be the same as the input IDs, a different list can be produced by pipelets that split records. This means that all data needed by the pipelet for processing must be on the blackboard:

  • record attributes and attachments
  • record annotations
  • workflow and record notes

The two latter items may also be used to pass parameters to a pipelet. However, we will need BPEL Extension Activities to be able to set them in the BPEL definition (see end of this chapter). Pipelets as well as the BPEL integration get their configurations from a central "configuration repository". This can be a simple directory with a defined structure at first, or a complex service supporting centralized configuration management and updating (and notification of clients about configuration changes) later. Pipelet configurations are separated from the BPEL pipelines, because Pipelets existence does not depend on the existence of a pipeline engine and must not depend on the implementation of the pipeline engine. This makes it easier to use pipelets independent from a special pipelining implementation, e.g. if we want to replace the BPEL engine by a JBPM engine or an own workflow engine implementation. This makes it also easier to share pipelet instances between pipelines which is crucial for pipelets that use lots of memory (e.g. semantic text mining) or need resources that can only be accessed exclusively by one client (e.g. writing to a Lucene index). Finally it enables OSGi to restart the BPEL integration service without having to restart the pipelets (e.g. for software updates). The BPEL integration is started by DS, too. Pipelets are bound to the BPEL integration as DS service references. This way the BPEL service can always keep track about currently available pipelet services. It would even be possible to track which pipelet is used in which pipeline and thus to know a priori which pipeline is currently completely executable.

Pipelet instantiation variants

Usually we have one instance of a Pipelet class that has a single configuration. The pipelet name is then a like a key to the combination "pipelet instance name = pipelet class + configuration". However, there may be cases in which it would be good to have a single pipelet class available with different configurations. There are two ways to support this:

  • Have a single pipelet instance with a configuration consisting of the different parts. Which part of the configuration is actually used in an invocation must then be passed using a record annotation. E.g.: There is a service "pipelet-name" = pipelet.A + config X & config Y, i.e. it has loaded both configurations.

An record in the invocation contains annotations:

    • "pipelet-name/select-configuration" = X -> use config X for processing this record
    • "pipelet-name/select-configuration" = Y -> use config Y for processing this record

Note that this makes it possible to process different records with different configurations in a single invocation. Of course in such a scenario one configuration should be marked as the default configuration to be used if no annotation is set.

  • Have multiple pipelet instances with different names, each having one of these configurations. E.g. there a two service instances of the same pipelet class with different pipelet names:
    • service 1: "pipelet-name-1" = pipelet.B + config X
    • service 2: "pipelet-name-2" = pipelet.B + config Y

Then the pipelet name used in the BPEL invoke activity determines which configuration is used.

Pipelet Implementation rules

Pipelets can potentially be invoked more than once at the same time. This means that a pipelet either should be written in a multithreading-safe way (stateless, read-only configuration and member variables) or it must care itself about synchronization of critical sections (e.g. Lucene index writing).

Configuration repository

This is just an ad-hoc proposal to give an idea of how it could look like. In details it's open to discussion.

For the moment we assume that the configuration repository is a single directory with sub directories in the file system. The configurations for components are located in subdirectories in the repository root. The name of these subdirectories is the bundle name of the component. What's happening inside of a bundle configuration directory is up to the bundle implementation. E.g. for the ODE BPEL integration bundle it contains a property file for general BPEL engine configuration and another subdirectory containing pipeline definitions. E.g.:

configuration
  |
  |-- org.eclipse.smila.processing.bpel
  |    |-- processor.properties
  |    \-- pipelines
  |         |-- pipeline-1.bpel
  |         |-- pipeline-2.bpel
  |         |-- ...
  |         |-- processor.wsdl
  |         |-- record.xsd
  |         |-- id.xsd
  |         | (predefined schema files necessary for reference. 
  |         |  Needed also during editing in BPEL designer)
  |         \-- deploy.xml 
  |           (technical reasons, we can get rid of this)
  |-- org.eclipse.smila.pipelet.A 
  |     |   (example: one instance managing multiple configurations)
  |     |-- config-X.xml
  |     \-- config-Y.xml
  |-- org.eclipse.smila.pipelet.B 
  |     |   (example: one instance per configuration)
  |     |-- pipelet-name-1
  |     |    \-- config-X.xml
  |     \-- pipelet-name-2
  |          \-- config-Y.xml
  |-- ...

This is quite similar to [Configuration handling], but with an optional additional folder level for "configuration sections" to structure the configurations better, e.g. for pipelets that require multiple instances for multiple configurations there can be one section per pipelet instance. Of course, bundles are free on how to use the configuration repository structure for their purposes. But we should describe some usage patterns because that would make reading the repository easier for adminstrators.

(To discuss: do we need folder structures of arbitrary depth?)

SMILA should provide helper classes to make locating and parsing of simple configurations easy. We can define a common XML format for basic configurations that most pipelets can use for their configurations, e.g. something of similar structure than the Record Annotation format?). Simple Property files can be supported, too. Then we can create a simple ConfigurationAccess service with methods like

  • to navigate the Configuration repository:
String[] getSectionNames(String bundleName); 
// e.g. getSectionNames("org.eclipse.smila.pipelet.B") 
// returns ["pipelet-name-1", "pipelet-name-2"]
String[] getConfigNames(String bundleName);
// e.g. getConfigNames("org.eclipse.smila.pipelet.A") 
// returns ["config-X.xml", "config-Y.xml"]
String[] getConfigNames(String bundleName, String sectionName);
// e.g. getConfigNames("org.eclipse.smila.pipelet.B", "pipelet-name-1") 
// returns ["config-X.xml", "config-Y.xml"]
  • to access and parse the configurations in common XML format:
Configuration getConfig(String bundleName, String configName);
Configuration getConfig(String bundleName, String sectionName, String configName);
  • to access and read property files:
Properties getProperties(String bundleName, String configName);
Properties getProperties(String bundleName, String sectionName, String configName);
  • to access other configurations:
InputStream getStream(String bundleName, String configName);
InputStream getStream(String bundleName, String sectionName, String configName);

This would make accessing of simple configurations quite simple for a pipelet developer.

BPEL Extension Activities

The BPEL specification allows extending BPEL by using Extension Activities. An Extension Activity is basically a Java class with a given interface that is registered to the BPEL engine under a qualified name. It the can be used in BPEL by a statement like this:

<bpel:extensionActivity>
  <myns:NameOfExtension>
    <!-- arbitary XML elements -->
  </myns:NameOfExtension>
</bpel:extensionActivity>

The implementation class is then called with the complete XML element of its description and can access all workflow variables defined in the BPEL. This means the activity can configured in the BPEL. E.g. for setting record annotations we can provide an extension activity similar to this:

<extensionActivity>
    <ext:setAnnotations>
        <ext:target variable="request"/>
        <rec:An n="pipelet-name">
            <rec:An n="select-configuration">
                <rec:V>X</rec:V>
            </rec:An>
        </rec:An>
    </ext:setAnnotations>
</extensionActivity>

This would set an annotation named "pipelet-name/select-configuration" with value "X" on all records the request variable. Of course it would also be possible to create a more specialized activity instead, that would define a simpler syntax to describe the annotations to be set.

Current problems are:

  • This is not supported by the current release (1.1.1) and also not in release 1.2 (currently about to be released) of ODE, but only in the trunk version (this will be release 1.3 probably). The latest estimation for a release date was "in about two months".
  • It's also not supported by the current release (M3) of the Eclipse BPEL designer. According to Eclipse Bugzilla it should be added to M4, which in turn should be released in the near future. However, I think we will have to provide own extensions to the BPEL designer anyway in order to have user friendly editing of extension activities provided by us.

Integrating Simple Pipeline Model into BPEL Pipelining Using extension activities it would even be possible to integrate the complete simple pipeline model into the BPEL pipelining model:

<extensionActivity>
    <ext:invokePipelet>
        <ext:pipelet name="pipelet-name"/> 
        <ext:variables input="request" output="result"/>
        <ext:invocationConfig>
          <!-- parameters of invocation, e.g. error handling? -->
        </ext:invocationConfig>
        <ext:pipeletConfig>
          <!-- pipelet XML configuration, schema: to define -->
        </ext:pipeletConfig>
    </ext:invokePipelet>
</extensionActivity>

We could provide an Extension Activity implementation that manages the "simple pipelet" lifecycle and configuration and translates the calls from the BPEL engine into a convenient pipelet invocation. The execution interface of the simple pipelet would be the same as that of the pipelet service described above: Because the lifecycle of extension activities themselves is undefined (it seems that in ODE a new instance is created for each call), the extension activity is only a simple class that promotes the BPEL call to a "SimplePipeletManager" that manages the pipelet instances (all in once or all for a single pipeline), configurations, invocations and error handling.

Id[] process(Id[] recordIds) throws ProcessingException;

Simple pipelets would use the blackboard service to access the actual record data. Additionally, simple pipelets need a method to set the configuration:

void configure(PipeletConfiguration config) throws ConfigurationException;

(Question: Do we also need a method for "shutdown" method to be called when the pipelet is destroyed? Or can we require simple pipelets to be so simple that they do not need such a method?)

The tasks of the SimplePipeletManager are

  • Start of Pipeline/Pipelet becomes available:
    • Instantiate Pipelet
    • Parse PipeletConfiguration from BPEL pipeline and call pipelets configure method.
  • Pipelet Invocation (very similar to invocation of "big pipelets", see [Blackboard Service Concept]:
    • Parse records from "input" variable and sync them to blackboard
    • Call simple pipelet's execute method with record IDs
    • Create workflow objects from result IDs and blackboard content and write them back to "output" variable
  • In case of a pipelet error: care about indicating the error in a correct way to the BPEL engine.

Issues to solve:

  • Simple pipelets should be instantiated and configured at deployment of the BPEL pipeline. This way missing pipelet implementations and configuration errors can be reported during system start up and not during first execution. For this it is probably necessary to introspect the pipeline definition and search for occurrences of the extension activity, because the BPEL engine may not support this directly.
  • Like in the Simple Pipeline Model itself we must decide on a pipelet lookup and instantiation model that makes it easy to support OSGi dynamics: The SimplePipeletManager must be able to track deactivation of bundles providing simple pipelets such that it can destroy the provided pipelets and re-instantiate them when the bundle reappears. Two mechanisms are possible:
    • OSGi Service Factories: The providing bundle declares an OSGi service factory that the SPM can use to create actual pipelet instances. This way we can use the DS support for dynamic services also for simple pipelets. We can probably provide a default implementation of this factory such that the providing bundle must only contain a suitable component description starting this factory customized for its own pipelet.
    • OSGi Extender Model: Use BundleListener/Tracker to check installed or removed bundles for contained pipelet implementations (declared in a contained XML file). See this for document for details: [1].

Configuration using Eclipse BPEL designer

The Eclipse BPEL designer is extensible itself using extension points. Details have to be clarified by somebody with more experience in Eclipse/GUI/RCP programming, but it should be possible to:

  • Define a view displaying all available pipelets, maybe grouped.
  • Drag an available pipelet from this view into the BPEL pipeline which generates a <extensionActivity> element with the <ext:invokePipelet> activity for the dragged pipelet.
  • Show a specialized properties tab for simple configuration of the pipelet such that the user does not have to write the contained XML. For this the pipelet provider must declare names, types, multiplicity, etc. of the pipelet's configuration properties. This should be done in an XML file provided with the pipelet bundle (schema to be defined).
  • Provide a view showing all pipelets used in all pipelines grouped by pipelines.

Note that this is not limited to simple pipelets, but can be used similar to handle the "big pipelets". It has to be decided if that should be supported by calling "big pipelet services" also using an extension activity for consistent handling of both types of pipelets. (currently the implementation uses the standard BPEL invoke activity to call pipelet services)