Jump to: navigation, search

Difference between revisions of "SMILA/Project Concepts/Data Model and XML representation"

(Description of Logical Data Model)
 
(16 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
 
== Description ==
 
== Description ==
  
Design a service to ease management of SMILA records during workflow processing.
+
This page describes the data model used in SMILA to represent data objects (records) in workflows.
 
+
  
 
== Discussion ==
 
== Discussion ==
  
 +
[[User:G.schmidt.brox.de|G.schmidt.brox.de]]: Currently I have some remarks regarding the record interface.
  
=== When to persist data into the a storage. ===
+
* We are not able to return large data, such as videos or large xml data, due to the lack of a stream interface for attachments.
  
[[User:G.schmidt.brox.de]]: Several month ago we have had a discussion, how to persist information into the search index. At that time i proposed a configuration mechanism for controlling the storage/indexing processes when leaving BPEL. At that time we moved this discussion in the direction of using BPEL pipelets for e.g. indexing purposes because that way we are free to configure when and where to use this option. From my point of view such operations should be a general paradime. Either we use by default a configurable process for indexing/storage at the end of BPEL processing or we use pipelets for this case. Please share your thoughts.
+
** [[User:Juergen.schumacher.empolis.com|Juergen.schumacher.empolis.com]] Yes, that's right. The problem is how to handle streams when sending a record with attachments via remote interfaces (I think we wanted to allow Crawler -> Controller, anyway it must be possible in the communication Controller -> Connectivity). You cannot send around the stream then, the receiver might not even able to access the actual object, even a callback from receiver to sender as in Ivans proposal below might not be possible. Any idea how to handle this is appreciated. Maybe using blackboard services in Crawler components woudl be possible, because it supports pushing attachments as stream directly to its bin storage. In this case a record could be transferred e.g. from CrawlerController to Connectivity by first pushing it from teh CC-Blackboard to the Connectivity-Blackboard and then sending only the record ID to Connectivity. Just an idea.
  
* [[User:Juergen.schumacher.empolis.com|Juergen.schumacher.empolis.com]] Yes, that is probably another valid way to configure it. It would be a minor change: Instead of writing back the records to the persistence layer on at the commit after each workflow which also invalidates the blackboard content, we could introduce a relatively simple pipelet to trigger the commit, after which the blackboard content must stay valid until the router has finished processing the records. So one would have more control over record persistence (the current concept causes each processed record to be persisted after each successfully finished workflow). The reason why I put this outside of BPEL was that I distinguished between "infrastructure" elements that are required to run with each workflow and "application" elements that are different in different setups. To me, "insert to index" is mainly an "application" element, as I can think of SMILA setups that are not used to build search indices. In my picture, persistence was a "infrastructure" element: To be able to chain several workflows via queues it is necessary that the blackboard content is persisted after single each workflow such that the next workflow can access the result (ok, strictly speaking this is not necessary after final workflows that are not followed by others anymore). So I thought it would safer to enforce record persistence this way, and that a workflow creator this way can concentrate more on the "create my application workflow" side of his problem instead of the "making my application work" side. If the team is more in favor of a more flexible solution, no problem. Just vote here (-:
+
* I am developing an crawler that returns XML. This crawler is able to crawl our Berkeley DB storage. Thus way I am able to return full Record structures. The open question is. How do I convert a embedded ID into a ID for the record. Via normal XML/IndexOrder syntax I am not able to generate a dynamic key that contains several hierarchies that may change on record level.
  
== Technical proposal ==
+
** [[User:Juergen.schumacher.empolis.com|Juergen.schumacher.empolis.com]] I'm not sure that I understand your scenario. Why do you need to create a new ID? Is it not possible just to reuse the original ID? The record in your source XML DB still represents the original source object, so probably it should keep its ID. Another possibility would be to create a new ID with the ID hash of the original ID as the key value, because in the XML DB the hash serves as a kind of simple primary key.
  
Purpose of the Blackboard Service is the management of SMILA record data during processing in a SMILA component (Connectivity, Workflow Processor). The problem is that different processing engines could require different physical formats of the record data (see [[SMILA/Project Concepts/Data Model and XML representation]] for a discussion). Because this means either complex implementations of the logical data model or big data coversion problems, the idea is to keep the complete record data only on a "blackboard" which is not pushed through the workflow engine itself and to extract only a small "workflow object" from the blackboard to feed the workflow engine. This workflow object would contain only the part from the complete record data which the workflow engine needs for loop or branch conditions (and the record ID, of course). Thus it could be efficient enough to do the conversion between blackboard and workflow object before and after each workflow service invocation. As a side effect, the blackboard service could hide the handling of record persistence from the services to make service development easier.
+
*** [[User:G.schmidt.brox.de|G.schmidt.brox.de]] From my point of view you get right to the point. How can i create a ID directly. How is configuration at index order affected? E.g. when just copying the ID, how do I set this ID in a MObject/Record object in a way that it's not replaced. Further I may need a transformation between those ID's. e.g. if I have a import from Record V1 to Record V2. I do not yet see a way to handle this.  
  
=== Basics ===
 
  
This figure given an overview about how these services could be composed:
+
[[User:Churkin.ivan.gmail.com|Ivan Churkin]]:
  
[[Image:Blackboard-Service.png]]
+
# Record object should be changed to avoid ability to set Attachment as byte[]
 +
# Crawler Developer will only point Attachment's name in the Record object
 +
# Crawler interface should be improved by adding method
 +
<source lang="java">
 +
interface Crawler{
 +
  ....
 +
  InputStream getAttachmentStream(int pos, String attachmentName)
 +
 
 +
}
 +
</source>
 +
# Crawler Controller, after fetching Record from Crawler, will track Attachments and transfer streams from Crawler one-by-one
  
Note that the use of the Blackboard service is not restricted to workflow processing, but it can be also used in Connectivity to create the initial SMILA record from the data sent by Crawlers. This way the persistence services are hidden from Connectivity, too.
+
[[User:Churkin.ivan.gmail.com|Ivan Churkin]]: I only wondering... Juergen, you know better SCA features, explain please. SCA doesn't supporting stream callbacks? Is it possible somehow to transfer stream? If it's not possible, then shame on SCA, it's better to write TCP/IP based custom protocol :)
  
It is assumed that the workflow engine itself (which will be a third party product usually) must be embedded into SMILA using some wrapper that translates incoming calls to workflow specific objects and service invocations from the workflow into real SMILA service calls. At least with a BPEL engine like ODE it must be done this way. In the following this wrapper is called the Workflow Integration Service. This workflow integration service will also handle the necessary interaction between workflow engine and blackboard (see next section for details).
+
>Maybe using blackboard services in Crawler components would be possible
  
For ODE, the use of Tuscany SCA Java would simplify the development of this integration service because it could be based on the BPEL implementation type of Tuscany. However, in the first version we will create an SMILA specific workflow integration service for ODE that can only orchestrate SMILA pipelet because the Tuscany BPEL implementation type does not yet support service references (see this [http://mail-archives.apache.org/mod_mbox/ws-tuscany-user/200804.mbox/%3c5a75db780804160846u6161d069p17c09a9422b2da8b@mail.gmail.com%3e mail in the Tuscany user mailing list]).
+
BWT: blackboard interface already contains streams as arguments and as return type. So, it cannot be binded by SCA already? What is the difference between CC and Blackboard here?
* Update 2008-04-21: Tuscany is making progress on this: [http://mail-archives.apache.org/mod_mbox/ws-tuscany-dev/200804.mbox/%3c5a75db780804181720n248b697ar419eff7e945c8e36@mail.gmail.com%3e mail in dev mailing list]
+
  
=== Workflow ===
+
[[User:Juergen.schumacher.empolis.com|Juergen.schumacher.empolis.com]] Sorry, I do currently not know what happens with methods with streams as arguments or return types in SCA when using different remote bindings. Certainly SCA cannot do wonders, and from my experience handling streams in any RCP protocol is not trivial. In case of CrawlerController talking to Connectivity via a WebService interface through a firewall that simply does not allow Connectivity talking back to the Controller (a valid deployment scenario from the very beginning of SMILA) ... what should SCA do there? So, in general we should design our interfaces to be remoting-friendly where they need to be. So far, Blackboard has not been a major candidate for being accessed by remote clients. This may change, but it also may require a specific remote interface instead remoting the complete local interface.
  
The next picture illustrates how and which data flows through this system:
+
[[User:G.schmidt.brox.de|G.schmidt.brox.de]] About SCA and remoting usage. Jürgen, could you please ask the SCA team how to handle those points. (You are absolute right). Maybe they have best practices for interface design and so on. Further we may think of a probably need for a communication proxy for SCA or similar technologies.
  
[[Image:Blackboard-Activity.png]]
+
[[User:Churkin.ivan.gmail.com|Ivan Churkin]] Many thanks Juergen, yes, there are problems if to support any RCP. But, maybe it's possible to strict protocol to common one? Or it's not allowed here :(?
 +
>firewall that simply does not allow Connectivity talking back to the Controller
 +
Theoretically remote callbacks may be avoided by caching to file by SCA-proxy class,
  
In more detail:
+
== Technical proposal ==
  
* Listener receives record from queue.
 
bq. The record usually contains only the ID. In special cases it could optionally  include some small attribute values or annotations that could be used to control routing inside the message broker.
 
* Listener calls blackboard to load record data from persistence service and writes attributes contained in message record to blackboard.
 
* Listener calls workflow service with ID from message record.
 
* Workflow integration creates workflow object for ID.
 
bq. The workflow object uses engine specific classes (e.g. DOM for ODE BPEL engine) to represent the record ID and some chosen attributes that are needed in the engine for condition testing or computation. It's an configuration option of the workflow integration which attributes are to be included. In a more advanced version it may be possible to analyse the workflow definition (e.g. the BPEL process) to determine which attributes are needed.
 
* Workflow integration invokes the workflow engine. This causes the following steps to be executed a couple of times:
 
** Workflow engine invokes SMILA service (pipelet). At least for ODE BPEL this means that the engine calls the integration layer which in turn routes the request to the invoked pipelet. So the workflow integration layer receives (potentially modified) workflow objects.
 
** Workflow integration writes workflow objects to blackboard and creates record IDs. The selected pipelet is called with these IDs
 
** Pipelet processes IDs and manipulates blackboard content. The result is a new list of record IDs (usually identical to the argument list, and usually the list has length 1)
 
** Workflow integration creates new workflow objects from the result IDs and blackboard content and feeds them back to the workflow engine.
 
* Workflow engine finishes successfully and returns a list of workflow objects.
 
bq. If it finishes with an exception, instead of the following the Listener/Router has to invalidate the blackboard for all IDs related to the workflow such that they are not committed back to the storages, and also it has to signal the message broker that the received message has not been processed successfully such that the message broker can move it to the dead letter queue.
 
* Workflow integration extracts IDs from workflow objects and returns them.
 
* Router creates outgoing messages with message records depending on blackboard content for given IDs.
 
bq. Two things may need configuration here: When to create an outgoing message to which queue (never, always, depending on conditions of attribute values or annotations) - this could also be done in workflow by setting a "nextDestination" annotation for each record ID. And which attributes/annotations are to be included in the message record - if any.
 
* Router commits IDs on blackboard. This writes the blackboard content to the persistence services and invalidates the blackboard content for these IDs.
 
* Router sends outgoing messages to message broker.
 
  
=== Content on Blackboard ===
+
=== Considerations ===
  
The Blackboard contains two kinds of content:
+
What we need:
 +
* Simple API for service developers to work with the records.
 +
* Minimal constraints on what is possible to express
 +
* Any SMILA component must be able to process every incoming record without knowing about any other component in the installation that
 +
may have produced some service specific part of the record. It must also be able to reproduce these elements in its result if they were
 +
not explicitly deleted during service execution.
 +
* This  means that for service specific classes we cannot even rely on having the same classes in the same version installed in each
 +
composite at the same time.
 +
* Records produced and stored with one version state of an SMILA installation must be re-processable also with updated versions of the installation (at least, if the major version of the framework has not changed).
 +
* Nice XML representation possible
 +
* Simple to express XPath queries on objects for conditions in BPEL or message routers.
  
*Records:* All records currently processed in this runtime process. The structure of an record is defined in [[SMILA/Project Concepts/Data Model and XML representation]]. Clients manipulate the records through Blackboard API methods. This way the records are completely under control of the Blackboard which may be used in advanced versions for optimised communication with the persistence services.  
+
In my opinion, this means, that we cannot have the data model extended by any service specific classes, but we must provide a data model that
 +
is able to express everything that a service might want to express. As a later extension we plan to allow the use of user-definable XML streaming for application specific object types, but this will not be implemented in the first version.
  
Records enter the blackboard by one of the following operations:
+
=== Physical Data Model ===
* create: create a new record with a given ID. No data is loaded from persistence, if a record with this ID exists already in the storages it will be overwritten when the created record is commited. E.g. used by Connectivity to initialize the record from incoming data.
+
* load: loads record data for the given ID from persistence (or prepare it to be loaded). Used by a client to indicate that it wants to process this record.
+
* split: creates a fragment of a given record, i.e. the record content is copied to a new ID derived from the given by adding a frament name (see [ID Concept] for details).
+
  
All these methods should care about locking the record ID in the storages such that no second runtime process can try to manipulate the same record.  
+
{info:Alternative Proposal}
 +
This section has been obsoleted largely by [[SMILA/Project Concepts/Blackboard Service Concept]]. However, I still suggest to define a logical data model using interfaces to hide the physical implementation from the client in order to make optimized implementations of the data model possible in different parts of the framework.
 +
{info}
 +
Problem: Different processing engine require to work on certain Java objects. E.g.:
 +
* The ODE BPEL engine needs to be called with DOM objects.
 +
* ActiveBPEL uses other classes.
 +
* One could think of a SMILA specific processing engine that could use a physical data model that implements the logical data model more efficiently.
  
A record is removed from the blackboard with one of these operations:
+
Conversion between different physical models can become expensive if it has to be done very often. This means e.g. that if a BPEL engine to orchestrate a number of SMILA services, it should not be necessary to actually convert the exchanged data objects each time a service is called and each time a service returns its result to the engine. And because the orchestration engine should be replaceable like everything else in the framework, we cannot commit to using e.g. DOM as the physical representation of our data objects, because then we would have conversion issues when using ActiveBPEL.
* commit: all changes are written to the storages before the record is removed. The record is unlocked in the database.
+
* invalidate: the record is removed from the blackboard. The record is unlocked in the database. If the record was created new (not overwritten) on this blackboard it should be removed from the storage completely.
+
  
*Notes:* Additional temporary data created by pipelets to be used in later pipelets in the same workflow, but not to be persisted in the storages. Notes can be either global or record specific (associated to a record ID). Record specific notes are copied on record splits and removed when the associated record is removed from the blackboard. In any case a note has a name and the value can be of any serializable Java class such that they can be accessed from seperated services in own VMs.
+
Proposal:
 +
* Define logical data model using a set of interfaces and a corresponding XML schema.
 +
* SMILA services access and create data only by using these interfaces, they do not need to know about the actual physica data model.
 +
* Provide physical data models that implement these interfaces using appropriate object formats.
  
bq. A nice extension would be workflow instance specific notes such that a pipelet can pass non persistent information to another pipelet invoked later in the workflow which is not associated to a single record, but does not conflict with information from different workflow instances like global notes would (based on the assumption that the workflow engine supports multi-threaded execution). This information would be removed from the blackboard after the workflow instance has finished. However, it has to be clarified how they are can associated to the workflow instance, even when accessed from a remote VM.
+
E.g. when using ODE as the orchestration engine, use a physical model that uses DOM to represent the data objects. These DOM object can be passed to the BPEL engine directly. Each time a service is invoked from BPEL only a small wrapper must be created and the service can access the DOM objects as logical SMILA objects.
  
=== Service Interfaces ===
+
On the other hand, in a crawler or in a queue listener that does not use a BPEL engine, a more efficient implementation of the logical model could be used for better performance.
  
The Blackboard will be implemented as an OSGi service. The interface could look similar to the following definition. It is getting quite big, so maybe it makes sense to divide it up into the different parts (handling of lifecycle, literal values, object values, annotation, notes, attachments?) for better readability? We'll see about this when implementing.
+
Data exchange between components that require to use different physical data models could easiest be done by using the common XML format for serialization. Also, queue messages would always contain an XML string. Each listener can then decide for itself which implementation to use.
  
<source lang="java">
+
=== Description of Logical Data Model ===
interface Blackboard {
+
    // record life cycle methods
+
    void create(ID id) throws BlackboardAccessException;
+
    void load(ID id) throws BlackboardAccessException;
+
    ID split(ID id, String fragmentName) throws BlackboardAccessException;
+
    void commit(ID id) throws BlackboardAccessException;
+
    void invalidate(ID id);
+
   
+
    // factory methods for attribute values and annotation objects
+
    // Literal and Annotation are just interfaces,
+
    // blackboard implementation can determine the actual types for optimization
+
    Literal createLiteral();
+
    Annotation createAnnotation();
+
   
+
    // record content methods
+
    // - record metadata
+
    //  for referenced types see interfaces proposed in [[SMILA/Project Concepts/Data Model and XML representation]]
+
    //  for string format of an attribute path see definition of AttributePath class below.
+
    // -- basic navigation
+
    Iterator<String> getAttributeNames(ID id, Path path) throws BlackboardAccessException;
+
    Iterator<String> getAttributeNames(ID id) throws BlackboardAccessException; // convenience for getAttributeNames(ID, null);
+
    boolean hasAttribute(ID id, Path path) throws BlackboardAccessException;
+
  
    // -- handling of literal values
+
This proposal is based on experiences made with the IAS (the [http://www.empolis.com/en/information_management/ empolis Information Access Suite]) data model. It is intended as a simplification of the IAS model, to overcome problems caused by its over-specification.
    //   navigation support
+
    boolean hasLiterals(ID id, Path path) throws BlackboardAccessException;
+
    int getLiteralsSize(ID id, Path path) throws BlackboardAccessException;
+
    //    get all literal attribute values of an attribute (index of last step is irrelevant)
+
    //    a client should not expect the blackboard to reflect changes done to these object automatically,
+
    //    but always should call one of the modification methods below to really set the changes.
+
    List<Literal> getLiterals(ID id, Path path) throws BlackboardAccessException;
+
  
    //   get single attribute value, index is specified in last step of path, defaults to 0.
+
Record - Top level element
    Literal getLiteral(ID id, Path path) throws BlackboardAccessException;
+
* ID: see [[SMILA/Project Concepts/ID Concept]] for details
 +
* metadata: Metadata Object - the actual data about the document
 +
* attachments: Map<String, byte[]> - additional data not serializable to XML (or too inefficient), e.g.:
 +
** binary content of documents
 +
** Huge annotations
  
    //    modification of attribute values on blackboard
+
MetadataObject
    void setLiterals(ID id, Path path, List<Literal> values) throws BlackboardAccessException;
+
* attributes: Map<String, Attribute> - data about records according to some application model or ontology
    //    set single literal value, index of last attribute step is irrelevant
+
* annotations: Map<String, List<Annotation>> - additional service specific data
    void setLiteral(ID id, Path path, Literal value) throws BlackboardAccessException;
+
 
    //    add a single literal value, index of last attribute step is irrelevant
+
Attribute
    void addLiteral(ID id, Path path, Literal value) throws BlackboardAccessException;
+
* name: String
   
+
* value: List<MetadataObject|Literal>
    //    remove literal specified by index in last step
+
* annotations: Map<String, List<Annotation>>
    void removeLiteral(ID id, Path path) throws BlackboardAccessException;
+
 
    //    remove all literals of specified attribute
+
Literal
    void removeLiterals(ID id, Path path) throws BlackboardAccessException;
+
* semantic type: String
   
+
* value: (String | Long | Double | Boolean | Date | Time | DateTime)?
    // -- handling of sub-objects
+
* data type
    //    navigation: check if an attribute has sub-objects and get their number.
+
* annotations: Map<String, List<Annotation>>
    boolean hasObjects(ID id, Path path) throws BlackboardAccessException;
+
 
    int getObjectSize(ID id, Path path) throws BlackboardAccessException;
+
Annotation
   
+
* anonymous values: List<String>
    //    deleting sub-objects
+
* named values Map<String, String>
    //    remove sub-objects specified by index in last step
+
* annotations: Map<String, List<Annotation>>
    void removeObject(ID id, Path path) throws BlackboardAccessException;
+
 
    //    remove all sub-objects of specified attribute
+
=== Java Interfaces of Logical Data Model ===
    void removeObjects(ID id, Path path) throws BlackboardAccessException;
+
 
   
+
Note: This is just a preview. The details may be changed during implementation if other variants are be more appropriate or convenenient.
    // access semantic type of sub-object attribute values.
+
 
    // semantic types of literals are modified at literal object   
+
<source lang="java">
    String getObjectSemanticType(ID id, Path path) throws BlackboardAccessException;
+
interface Annotation extends Annotatable {
    void setObjectSemanticType(ID id, Path path, String typename) throws BlackboardAccessException;
+
     List<String> getAnonValues();
   
+
     void addAnonValue(String);
    // -- annotations of attributes and sub-objects.  
+
     void removeAnonValues();
    //    annotations of literals are accessed via the Literal object
+
    //    use null, "" or an empty attribute path to access root annotations of record.
+
    //    use PathStep.ATTRIBUTE_ANNOTATION as index in final step to access the annotation
+
     //    of the attribute itself.
+
    Iterator<String> getAnnotationNames(ID id, Path path) throws BlackboardAccessException;
+
     boolean hasAnnotations(ID id, Path path) throws BlackboardAccessException;
+
     boolean hasAnnotation(ID id, Path path, String name) throws BlackboardAccessException;
+
  
    List<Annotation> getAnnotations(ID id, Path path, String name) throws BlackboardAccessException;
+
     void setNamedValue(String name, String value);
    //    shortcut to get only first annotation if one exists.
+
     String getNamedValues(String name);
    Annotation getAnnotation(ID id, Path path, String name) throws BlackboardAccessException;
+
     void removeNamedValues();
     void setAnnotations(ID id, Path path, String name, List<Annotation> annotations) throws BlackboardAccessException;
+
    void setAnnotation(ID id, Path path, String name, Annotation annotation) throws BlackboardAccessException;
+
     void addAnnotation(ID id, Path path, String name, Annotation annotation) throws BlackboardAccessException;
+
    void removeAnnotation(ID id, Path path, String name) throws BlackboardAccessException;
+
     void removeAnnotations(ID id, Path path) throws BlackboardAccessException;
+
   
+
    // - record attachments
+
    boolean hasAttachment(ID id, String name) throws BlackboardAccessException;
+
    byte[] getAttachment(ID id, String name) throws BlackboardAccessException;
+
    InputStream getAttachmentAsStream(ID id, String name) throws BlackboardAccessException;
+
    void setAttachment(ID id, String name, byte[] name) throws BlackboardAccessException;
+
    InputStream setAttachmentFromStream(ID id, String name, InputStream name) throws BlackboardAccessException;
+
       
+
    // - notes methods
+
    boolean hasGlobalNote(String name) throws BlackboardAccessException;
+
    Serializable getGlobalNote(String name) throws BlackboardAccessException;
+
    void setGlobalNote(String name, Serializable object) throws BlackboardAccessException;
+
    boolean hasRecordNote(ID id, String name) throws BlackboardAccessException;
+
    Serializable getRecordNote(ID id, String name) throws BlackboardAccessException;
+
    void setRecordNote(ID id, String name, Serializable object) throws BlackboardAccessException;
+
   
+
    // This is certainly not complete ... just to give an idea of how it could taste.
+
    // lots of convenience methods can be added later.
+
 
}
 
}
 
</source>
 
</source>
 +
 
<source lang="java">
 
<source lang="java">
public class Path implements Serializable, Iterable<PathStep> {
+
interface Annotatable {
     // string format of attribute path could be something like
+
     boolean hasAnnotations();
    // "attributeName1[index1]/attributeName2[index2]/..." or
+
     boolean hasAnnotation(String);
    // "attributeName1@index1/attributeName2@index2/...".
+
 
    // The first is probably better because similar to XPath?
+
     List<Annotation> getAnnotations(String);
    // The specification of index is optional and defaults to 0.
+
     Annotation getAnnotation(String);
    // Whether the index refers to a literal or a sub-object depends on methods getting the argument
+
 
   
+
     void setAnnotations(String, List<Annotation>);
    public static final char SEPARATOR = '/';
+
     void setAnnotation(String, Annotation);
       
+
     void addAnnotation(String, Annotation);
    public Path();
+
 
     public Path(Path path);
+
     void removeAnnotations(String);
    public Path(String path);
+
   
+
     // extend path extended more steps. This modifies the object itself and returns it again
+
    // for further modifications, e.g. path.add("level1").add("level2");
+
    public Path add(PathStep step);  
+
     public Path add(String attributeName);  
+
     public Path add(String attributeName, int index);
+
   
+
    // remove tail element of this.  This modifies the object itself and returns it again
+
    // for further modifications, e.g. path.up().add("siblingAttribute");
+
    public Path up();
+
   
+
    public Iterator<PathStep> iterator();
+
     public boolean isEmpty();
+
     public PathStep get(int positionInPath);
+
    public String getName(int positionInPath);
+
    public int getIndex(int positionInPath);
+
     public int length();
+
   
+
    public boolean equals(Path other);
+
    public int hashCode();
+
    public String toString();
+
 
}
 
}
 
</source>
 
</source>
 +
 
<source lang="java">
 
<source lang="java">
public class PathStep implements Serializable {
+
interface Record {
     public static final int ATTRIBUTE_ANNOTATION = -1;
+
     ID getID();
   
+
     void setID(ID);
    private String name;
+
 
    private int index = 0; / index of value in multivalued attributes. default is first value.
+
     MObject getMetadata();
   
+
     void getMetadata(MObject);
    public PathStep(String name);
+
 
     public PathStep(String name, int index);
+
     boolean hasAttachments();
   
+
     byte[] getAttachment(String);
     public String getName();
+
     void putAttachment(String, byte[]);
     public int getIndex();
+
   
+
     public boolean equals(AttributePath other);
+
     public int hashCode();
+
     public String toString();
+
 
}
 
}
 
</source>
 
</source>
 
The business interface of a pipelet to be used in an SMILA workflow will be rather simple. It can get access to the local blackboard service using OSGi service lookup or by injection using OSGi Declarative Services. Therefore the main business method just needs to take a list of record IDs as an argument and return a new (or the same) list of IDs as the result. This is the same method that a workflow integration service needs to expose to the Listener/Router component, therefore it makes sense to use a common interface definition for both. This way it is possible to deploy processing runtimes with only a single pipelet without having to create dummy workflow definitions, because a pipelet can be wired up to the Listener/Router immediately. Becasue remote communication with separated pipelets (see below) will be implemented later pipelets (and therefore workflow integrations, too) must be implemented as OSGi services such that the remote communication can be coordinated using SCA. Thus, interfaces could look like this:
 
  
 
<source lang="java">
 
<source lang="java">
interface RecordProcessor {
+
interface Attribute extends Annotatable {
     ID[] process(ID[] records) throws ProcessingException;
+
     String getName();
 +
 
 +
    boolean hasLiterals();
 +
    int valueSize();
 +
    List<Literal> getLiterals();
 +
    Literal getLiterale(); // return only first value in list, if any
 +
    void addLiteral(Literal literal);
 +
    void removeValues();
 +
 
 +
    boolean hasObjects();
 +
    int ObjectSize();
 +
    List<MObject> getObjects();
 +
    MObject getObject(); // return only first Object in list, if any
 +
    void addObject(MObject object);
 +
    void removeObjects();
 
}
 
}
 
</source>
 
</source>
  
 
<source lang="java">
 
<source lang="java">
interface Pipelet extends RecordProcessor {
+
interface AttributeValue extends Annotatable {
     // specific methods for pipelets
+
     String getSemanticType();
 +
    void setSemanticType(String);
 
}
 
}
 +
</source>
  
 +
<source lang="java">
 +
// MObject is short for Metadata Object
 +
interface MObject extends AttributeValue {
 +
    boolean hasAttributes();
 +
    boolean hasAttribute(String);
 +
    Attribute getAttribute(String);
 +
    void setAttribute(String, Attribute);
 +
}
 
</source>
 
</source>
  
 
<source lang="java">
 
<source lang="java">
interface WorkflowIntegration extends RecordProcessor {
+
interface Literal extends AttributeValue {
     // specific methods for workflow integration services.
+
     boolean hasValue();
 +
 
 +
    Object getValue();
 +
    String getStringValue(); // return toString() of value, if not a string
 +
    // other type specific methods return null, if value is not of requested type
 +
    Long getIntValue();
 +
    Double getFPValue();
 +
    Boolean getBoolValue();
 +
    Date getDateTimeValue();
 +
 
 +
    void setValue(Object) throws InvalidArgumentException;
 +
    void setStringValue(String);
 +
    void setIntValue(Long);
 +
    void setFPValue(Double);
 +
    void setBoolValue(Boolean);
 +
    void setDateTimeValue(Date);
 
}
 
}
 
</source>
 
</source>
  
 +
=== XML Schema of Logical Data Model ===
  
=== What about pipelets running in a seperate VM? ===
+
XML Schema design by example
  
*Not relevant for initial implementation. This will be added in advanced versions and discussed in more detail then.*
+
The following XML snippet illustrates how to possibly represent this data model in XML. This section should be seen as experimental.
  
We want to be able have pipelets running in separated VM if they are known to be unstable or non-terminating in error conditions. This can be supported by the blackboard service like this:
+
The XML schema is targeted at being relatively easy to use for XPath expressions in BPEL processes or elsewhere. The element and attribute have been abbreviated in order to minimze the length on the resulting document. This should have an positive impact on communication overhead and processing performance (of course, in reality also whitespace (linefeeds, indentation) should be left out).
  
[[Image:Blackboard-SeparatedService.png]]
+
The annotations used as examples are motivated by 0often used IAS properties.
 +
<source lang="xml">
 +
<RecordList xmlns="http://www.eclipse.org/smila/record" xmlns:id="http://www.eclipse.org/smila/id"
 +
xmlns:rec="http://www.eclipse.org/smila/record" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 +
xsi:schemaLocation="http://www.eclipse.org/smila/record record.xsd ">
 +
 
 +
<Record version="1.0">
 +
<id:ID version="1.0">
 +
<id:Source>share</id:Source>
 +
<id:Key>some.html</id:Key>
 +
</id:ID>
 +
<A n="mimetype">
 +
<!-- IAS retrieval filter: annotation attached to attribute, valid for complete attribute value -->
 +
<An n="filter">
 +
<V n="type">exclude</V>
 +
<An n="values">
 +
<V>text/plain</V>
 +
<V>text/html</V>
 +
</An>
 +
</An>
 +
<L>
 +
<V>text/html</V>
 +
<V st="appl:Mimetype">text/html</V>
 +
</L>
 +
</A>
 +
<A n="filesize"><!-- single numeric value attribute -->
 +
<L>
 +
<V t="int">1234</V>
 +
</L>
 +
</A>
 +
<A n="trustee"><!-- multivalued attribute without annotation for each value -->
 +
<L>
 +
<V>group1</V>
 +
<V>group2</V>
 +
</L>
 +
</A>
 +
<A n="topic"><!-- multivalued attribute with simple values with annotations -->
 +
<An n="importance"><!-- IAS query boost factor, refers to complete attribute -->
 +
<V>4.0</V>
 +
</An>
 +
<L>
 +
<V>Eclipse</V><!-- first value -->
 +
<An n="sourceRef"><!-- part of IAS textminer info for first value-->
 +
<V n="attribute">fulltext</V>
 +
<V n="startPos">37</V>
 +
<V n="endPos">42</V>
 +
</An>
 +
<An n="sourceRef">
 +
<V n="attribute">fulltext</V>
 +
<V n="startPos">137</V>
 +
<V n="endPos">142</V>
 +
</An>
 +
<An n="importance"><!-- extra IAS query boost factor for first value -->
 +
<V>2.0</V>
 +
</An>
 +
</L>
 +
<L>
 +
<V>SMILA</V><!-- second attribute value -->
 +
<An n="sourceRef"><!-- following annotations refer to second value -->
 +
<!-- similar to above -->
 +
</An>
 +
</L>
 +
</A>
 +
<A n="author"><!-- "set of aggregates" -->
 +
<O>
 +
<A n="firstName">
 +
<L>
 +
<V>Igor</V>
 +
</L>
 +
</A>
 +
<A n="lastName">
 +
<L>
 +
<V>Novakovic</V>
 +
</L>
 +
</A>
 +
</O>
 +
<O st="appl:Author">
 +
<A n="firstName">
 +
<L>
 +
<V>Georg</V>
 +
</L>
 +
</A>
 +
<A n="lastName">
 +
<L>
 +
<V>Schmidt</V>
 +
</L>
 +
</A>
 +
</O>
 +
</A>
 +
 
 +
<An n="action">
 +
<V>update</V>
 +
</An>
 +
 
 +
<Attachment>content</Attachment><!-- just a marker that an attachment exists in attachment store? -->
 +
<Attachment>fulltext</Attachment>
 +
</Record>
 +
</RecordList>
 +
</source>
 +
Some notes
 +
* <code><L></code> can contain multiple <code><V></code>, if the single values do not have annotations
 +
* The <code>st</code> attribute in <L> and <O> means some application specific "semantic" type while the <code>t</code> attribute in <code><V></code> means the native datatype of this value.
 +
* The version attribute is for parsers todo conversion between older XML formats and the current supported format if necessary.
 +
* The data model will be extended later to support XML streaming of user-definable object types, either as attribute values or in an extra part of the record.
  
The seperated pipelet VM would have a proxy blackboard service that coordinates the communication with the master blackboard in the workflow processor VM. Only the record ID needs to be sent to the separated pipelets. However, the separated pipelet must be wrapped to provide control of the record life cycle on the proxy blackboard, especially because the changes done in the remote blackboard must be committed back to the master blackboard when the separated pipelet has finished successful, or the proxy blackboard content must be invalidated without commit in case of an pipelet error. Possibly, this pipelet wrapper can also provide "watchdog" functionality to monitor the separated pipelet and terminate and restart it in case of endless loops or excessive memory consumption.
+
[[Category:SMILA]]

Latest revision as of 11:00, 11 November 2008

Description

This page describes the data model used in SMILA to represent data objects (records) in workflows.

Discussion

G.schmidt.brox.de: Currently I have some remarks regarding the record interface.

  • We are not able to return large data, such as videos or large xml data, due to the lack of a stream interface for attachments.
    • Juergen.schumacher.empolis.com Yes, that's right. The problem is how to handle streams when sending a record with attachments via remote interfaces (I think we wanted to allow Crawler -> Controller, anyway it must be possible in the communication Controller -> Connectivity). You cannot send around the stream then, the receiver might not even able to access the actual object, even a callback from receiver to sender as in Ivans proposal below might not be possible. Any idea how to handle this is appreciated. Maybe using blackboard services in Crawler components woudl be possible, because it supports pushing attachments as stream directly to its bin storage. In this case a record could be transferred e.g. from CrawlerController to Connectivity by first pushing it from teh CC-Blackboard to the Connectivity-Blackboard and then sending only the record ID to Connectivity. Just an idea.
  • I am developing an crawler that returns XML. This crawler is able to crawl our Berkeley DB storage. Thus way I am able to return full Record structures. The open question is. How do I convert a embedded ID into a ID for the record. Via normal XML/IndexOrder syntax I am not able to generate a dynamic key that contains several hierarchies that may change on record level.
    • Juergen.schumacher.empolis.com I'm not sure that I understand your scenario. Why do you need to create a new ID? Is it not possible just to reuse the original ID? The record in your source XML DB still represents the original source object, so probably it should keep its ID. Another possibility would be to create a new ID with the ID hash of the original ID as the key value, because in the XML DB the hash serves as a kind of simple primary key.
      • G.schmidt.brox.de From my point of view you get right to the point. How can i create a ID directly. How is configuration at index order affected? E.g. when just copying the ID, how do I set this ID in a MObject/Record object in a way that it's not replaced. Further I may need a transformation between those ID's. e.g. if I have a import from Record V1 to Record V2. I do not yet see a way to handle this.


Ivan Churkin:

  1. Record object should be changed to avoid ability to set Attachment as byte[]
  2. Crawler Developer will only point Attachment's name in the Record object
  3. Crawler interface should be improved by adding method
 interface Crawler{
   ....
   InputStream getAttachmentStream(int pos, String attachmentName)
 
 }
  1. Crawler Controller, after fetching Record from Crawler, will track Attachments and transfer streams from Crawler one-by-one

Ivan Churkin: I only wondering... Juergen, you know better SCA features, explain please. SCA doesn't supporting stream callbacks? Is it possible somehow to transfer stream? If it's not possible, then shame on SCA, it's better to write TCP/IP based custom protocol :)

>Maybe using blackboard services in Crawler components would be possible

BWT: blackboard interface already contains streams as arguments and as return type. So, it cannot be binded by SCA already? What is the difference between CC and Blackboard here?

Juergen.schumacher.empolis.com Sorry, I do currently not know what happens with methods with streams as arguments or return types in SCA when using different remote bindings. Certainly SCA cannot do wonders, and from my experience handling streams in any RCP protocol is not trivial. In case of CrawlerController talking to Connectivity via a WebService interface through a firewall that simply does not allow Connectivity talking back to the Controller (a valid deployment scenario from the very beginning of SMILA) ... what should SCA do there? So, in general we should design our interfaces to be remoting-friendly where they need to be. So far, Blackboard has not been a major candidate for being accessed by remote clients. This may change, but it also may require a specific remote interface instead remoting the complete local interface.

G.schmidt.brox.de About SCA and remoting usage. Jürgen, could you please ask the SCA team how to handle those points. (You are absolute right). Maybe they have best practices for interface design and so on. Further we may think of a probably need for a communication proxy for SCA or similar technologies.

Ivan Churkin Many thanks Juergen, yes, there are problems if to support any RCP. But, maybe it's possible to strict protocol to common one? Or it's not allowed here :(? >firewall that simply does not allow Connectivity talking back to the Controller Theoretically remote callbacks may be avoided by caching to file by SCA-proxy class,

Technical proposal

Considerations

What we need:

  • Simple API for service developers to work with the records.
  • Minimal constraints on what is possible to express
  • Any SMILA component must be able to process every incoming record without knowing about any other component in the installation that

may have produced some service specific part of the record. It must also be able to reproduce these elements in its result if they were not explicitly deleted during service execution.

  • This means that for service specific classes we cannot even rely on having the same classes in the same version installed in each

composite at the same time.

  • Records produced and stored with one version state of an SMILA installation must be re-processable also with updated versions of the installation (at least, if the major version of the framework has not changed).
  • Nice XML representation possible
  • Simple to express XPath queries on objects for conditions in BPEL or message routers.

In my opinion, this means, that we cannot have the data model extended by any service specific classes, but we must provide a data model that is able to express everything that a service might want to express. As a later extension we plan to allow the use of user-definable XML streaming for application specific object types, but this will not be implemented in the first version.

Physical Data Model

{info:Alternative Proposal} This section has been obsoleted largely by SMILA/Project Concepts/Blackboard Service Concept. However, I still suggest to define a logical data model using interfaces to hide the physical implementation from the client in order to make optimized implementations of the data model possible in different parts of the framework. {info} Problem: Different processing engine require to work on certain Java objects. E.g.:

  • The ODE BPEL engine needs to be called with DOM objects.
  • ActiveBPEL uses other classes.
  • One could think of a SMILA specific processing engine that could use a physical data model that implements the logical data model more efficiently.

Conversion between different physical models can become expensive if it has to be done very often. This means e.g. that if a BPEL engine to orchestrate a number of SMILA services, it should not be necessary to actually convert the exchanged data objects each time a service is called and each time a service returns its result to the engine. And because the orchestration engine should be replaceable like everything else in the framework, we cannot commit to using e.g. DOM as the physical representation of our data objects, because then we would have conversion issues when using ActiveBPEL.

Proposal:

  • Define logical data model using a set of interfaces and a corresponding XML schema.
  • SMILA services access and create data only by using these interfaces, they do not need to know about the actual physica data model.
  • Provide physical data models that implement these interfaces using appropriate object formats.

E.g. when using ODE as the orchestration engine, use a physical model that uses DOM to represent the data objects. These DOM object can be passed to the BPEL engine directly. Each time a service is invoked from BPEL only a small wrapper must be created and the service can access the DOM objects as logical SMILA objects.

On the other hand, in a crawler or in a queue listener that does not use a BPEL engine, a more efficient implementation of the logical model could be used for better performance.

Data exchange between components that require to use different physical data models could easiest be done by using the common XML format for serialization. Also, queue messages would always contain an XML string. Each listener can then decide for itself which implementation to use.

Description of Logical Data Model

This proposal is based on experiences made with the IAS (the empolis Information Access Suite) data model. It is intended as a simplification of the IAS model, to overcome problems caused by its over-specification.

Record - Top level element

  • ID: see SMILA/Project Concepts/ID Concept for details
  • metadata: Metadata Object - the actual data about the document
  • attachments: Map<String, byte[]> - additional data not serializable to XML (or too inefficient), e.g.:
    • binary content of documents
    • Huge annotations

MetadataObject

  • attributes: Map<String, Attribute> - data about records according to some application model or ontology
  • annotations: Map<String, List<Annotation>> - additional service specific data

Attribute

  • name: String
  • value: List<MetadataObject|Literal>
  • annotations: Map<String, List<Annotation>>

Literal

  • semantic type: String
  • value: (String | Long | Double | Boolean | Date | Time | DateTime)?
  • data type
  • annotations: Map<String, List<Annotation>>

Annotation

  • anonymous values: List<String>
  • named values Map<String, String>
  • annotations: Map<String, List<Annotation>>

Java Interfaces of Logical Data Model

Note: This is just a preview. The details may be changed during implementation if other variants are be more appropriate or convenenient.

interface Annotation extends Annotatable {
    List<String> getAnonValues();
    void addAnonValue(String);
    void removeAnonValues();
 
    void setNamedValue(String name, String value);
    String getNamedValues(String name);
    void removeNamedValues();
}
interface Annotatable {
    boolean hasAnnotations();
    boolean hasAnnotation(String);
 
    List<Annotation> getAnnotations(String);
    Annotation getAnnotation(String);
 
    void setAnnotations(String, List<Annotation>);
    void setAnnotation(String, Annotation);
    void addAnnotation(String, Annotation);
 
    void removeAnnotations(String);
}
interface Record {
    ID getID();
    void setID(ID);
 
    MObject getMetadata();
    void getMetadata(MObject);
 
    boolean hasAttachments();
    byte[] getAttachment(String);
    void putAttachment(String, byte[]);
}
interface Attribute extends Annotatable {
    String getName();
 
    boolean hasLiterals();
    int valueSize();
    List<Literal> getLiterals();
    Literal getLiterale(); // return only first value in list, if any
    void addLiteral(Literal literal);
    void removeValues();
 
    boolean hasObjects();
    int ObjectSize();
    List<MObject> getObjects();
    MObject getObject(); // return only first Object in list, if any
    void addObject(MObject object);
    void removeObjects();
}
interface AttributeValue extends Annotatable {
    String getSemanticType();
    void setSemanticType(String);
}
// MObject is short for Metadata Object
interface MObject extends AttributeValue {
    boolean hasAttributes();
    boolean hasAttribute(String);
    Attribute getAttribute(String);
    void setAttribute(String, Attribute);
}
interface Literal extends AttributeValue {
    boolean hasValue();
 
    Object getValue();
    String getStringValue(); // return toString() of value, if not a string
    // other type specific methods return null, if value is not of requested type
    Long getIntValue();
    Double getFPValue();
    Boolean getBoolValue();
    Date getDateTimeValue();
 
    void setValue(Object) throws InvalidArgumentException;
    void setStringValue(String);
    void setIntValue(Long);
    void setFPValue(Double);
    void setBoolValue(Boolean);
    void setDateTimeValue(Date);
}

XML Schema of Logical Data Model

XML Schema design by example

The following XML snippet illustrates how to possibly represent this data model in XML. This section should be seen as experimental.

The XML schema is targeted at being relatively easy to use for XPath expressions in BPEL processes or elsewhere. The element and attribute have been abbreviated in order to minimze the length on the resulting document. This should have an positive impact on communication overhead and processing performance (of course, in reality also whitespace (linefeeds, indentation) should be left out).

The annotations used as examples are motivated by 0often used IAS properties.

<RecordList xmlns="http://www.eclipse.org/smila/record" xmlns:id="http://www.eclipse.org/smila/id"
	xmlns:rec="http://www.eclipse.org/smila/record" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.eclipse.org/smila/record record.xsd ">
 
	<Record version="1.0">
		<id:ID version="1.0">
			<id:Source>share</id:Source>
			<id:Key>some.html</id:Key>
		</id:ID>
		<A n="mimetype">
			<!-- IAS retrieval filter: annotation attached to attribute, valid for complete attribute value -->
			<An n="filter">
				<V n="type">exclude</V>
				<An n="values">
					<V>text/plain</V>
					<V>text/html</V>
				</An>
			</An>
			<L>
				<V>text/html</V>
				<V st="appl:Mimetype">text/html</V>
			</L>
		</A>
		<A n="filesize"><!-- single numeric value attribute -->
			<L>
				<V t="int">1234</V>
			</L>
		</A>
		<A n="trustee"><!-- multivalued attribute without annotation for each value -->
			<L>
				<V>group1</V>
				<V>group2</V>
			</L>
		</A>
		<A n="topic"><!-- multivalued attribute with simple values with annotations -->
			<An n="importance"><!-- IAS query boost factor, refers to complete attribute -->
				<V>4.0</V>
			</An>
			<L>
				<V>Eclipse</V><!-- first value -->
				<An n="sourceRef"><!-- part of IAS textminer info for first value-->
					<V n="attribute">fulltext</V>
					<V n="startPos">37</V>
					<V n="endPos">42</V>
				</An>
				<An n="sourceRef">
					<V n="attribute">fulltext</V>
					<V n="startPos">137</V>
					<V n="endPos">142</V>
				</An>
				<An n="importance"><!-- extra IAS query boost factor for first value -->
					<V>2.0</V>
				</An>
			</L>
			<L>
				<V>SMILA</V><!-- second attribute value -->
				<An n="sourceRef"><!-- following annotations refer to second value -->
					<!-- similar to above -->
				</An>
			</L>
		</A>
		<A n="author"><!-- "set of aggregates" -->
			<O>
				<A n="firstName">
					<L>
						<V>Igor</V>
					</L>
				</A>
				<A n="lastName">
					<L>
						<V>Novakovic</V>
					</L>
				</A>
			</O>
			<O st="appl:Author">
				<A n="firstName">
					<L>
						<V>Georg</V>
					</L>
				</A>
				<A n="lastName">
					<L>
						<V>Schmidt</V>
					</L>
				</A>
			</O>
		</A>
 
		<An n="action">
			<V>update</V>
		</An>
 
		<Attachment>content</Attachment><!-- just a marker that an attachment exists in attachment store? -->
		<Attachment>fulltext</Attachment>
	</Record>
</RecordList>

Some notes

  • <L> can contain multiple <V>, if the single values do not have annotations
  • The st attribute in <L> and <O> means some application specific "semantic" type while the t attribute in <V> means the native datatype of this value.
  • The version attribute is for parsers todo conversion between older XML formats and the current supported format if necessary.
  • The data model will be extended later to support XML streaming of user-definable object types, either as attribute values or in an extra part of the record.