Jump to: navigation, search

SMILA/Project Concepts/Data Model and XML representation

Description

This page describes the data model used in SMILA to represent data objects (records) in workflows.

Discussion

G.schmidt.brox.de: Currently I have some remarks regarding the record interface.

  • We are not able to return large data, such as videos or large xml data, due to the lack of a stream interface for attachments.
    • Juergen.schumacher.empolis.com Yes, that's right. The problem is how to handle streams when sending a record with attachments via remote interfaces (I think we wanted to allow Crawler -> Controller, anyway it must be possible in the communication Controller -> Connectivity). You cannot send around the stream then, the receiver might not even able to access the actual object, even a callback from receiver to sender as in Ivans proposal below might not be possible. Any idea how to handle this is appreciated. Maybe using blackboard services in Crawler components woudl be possible, because it supports pushing attachments as stream directly to its bin storage. In this case a record could be transferred e.g. from CrawlerController to Connectivity by first pushing it from teh CC-Blackboard to the Connectivity-Blackboard and then sending only the record ID to Connectivity. Just an idea.
  • I am developing an crawler that returns XML. This crawler is able to crawl our Berkeley DB storage. Thus way I am able to return full Record structures. The open question is. How do I convert a embedded ID into a ID for the record. Via normal XML/IndexOrder syntax I am not able to generate a dynamic key that contains several hierarchies that may change on record level.
    • Juergen.schumacher.empolis.com I'm not sure that I understand your scenario. Why do you need to create a new ID? Is it not possible just to reuse the original ID? The record in your source XML DB still represents the original source object, so probably it should keep its ID. Another possibility would be to create a new ID with the ID hash of the original ID as the key value, because in the XML DB the hash serves as a kind of simple primary key.
      • G.schmidt.brox.de From my point of view you get right to the point. How can i create a ID directly. How is configuration at index order affected? E.g. when just copying the ID, how do I set this ID in a MObject/Record object in a way that it's not replaced. Further I may need a transformation between those ID's. e.g. if I have a import from Record V1 to Record V2. I do not yet see a way to handle this.


Ivan Churkin:

  1. Record object should be changed to avoid ability to set Attachment as byte[]
  2. Crawler Developer will only point Attachment's name in the Record object
  3. Crawler interface should be improved by adding method
 interface Crawler{
   ....
   InputStream getAttachmentStream(int pos, String attachmentName)
 
 }
  1. Crawler Controller, after fetching Record from Crawler, will track Attachments and transfer streams from Crawler one-by-one

Ivan Churkin: I only wondering... Juergen, you know better SCA features, explain please. SCA doesn't supporting stream callbacks? Is it possible somehow to transfer stream? If it's not possible, then shame on SCA, it's better to write TCP/IP based custom protocol :)

>Maybe using blackboard services in Crawler components would be possible

BWT: blackboard interface already contains streams as arguments and as return type. So, it cannot be binded by SCA already? What is the difference between CC and Blackboard here?

Juergen.schumacher.empolis.com Sorry, I do currently not know what happens with methods with streams as arguments or return types in SCA when using different remote bindings. Certainly SCA cannot do wonders, and from my experience handling streams in any RCP protocol is not trivial. In case of CrawlerController talking to Connectivity via a WebService interface through a firewall that simply does not allow Connectivity talking back to the Controller (a valid deployment scenario from the very beginning of SMILA) ... what should SCA do there? So, in general we should design our interfaces to be remoting-friendly where they need to be. So far, Blackboard has not been a major candidate for being accessed by remote clients. This may change, but it also may require a specific remote interface instead remoting the complete local interface.

G.schmidt.brox.de About SCA and remoting usage. Jürgen, could you please ask the SCA team how to handle those points. (You are absolute right). Maybe they have best practices for interface design and so on. Further we may think of a probably need for a communication proxy for SCA or similar technologies.

Ivan Churkin Many thanks Juergen, yes, there are problems if to support any RCP. But, maybe it's possible to strict protocol to common one? Or it's not allowed here :(? >firewall that simply does not allow Connectivity talking back to the Controller Theoretically remote callbacks may be avoided by caching to file by SCA-proxy class,

Technical proposal

Considerations

What we need:

  • Simple API for service developers to work with the records.
  • Minimal constraints on what is possible to express
  • Any SMILA component must be able to process every incoming record without knowing about any other component in the installation that

may have produced some service specific part of the record. It must also be able to reproduce these elements in its result if they were not explicitly deleted during service execution.

  • This means that for service specific classes we cannot even rely on having the same classes in the same version installed in each

composite at the same time.

  • Records produced and stored with one version state of an SMILA installation must be re-processable also with updated versions of the installation (at least, if the major version of the framework has not changed).
  • Nice XML representation possible
  • Simple to express XPath queries on objects for conditions in BPEL or message routers.

In my opinion, this means, that we cannot have the data model extended by any service specific classes, but we must provide a data model that is able to express everything that a service might want to express. As a later extension we plan to allow the use of user-definable XML streaming for application specific object types, but this will not be implemented in the first version.

Physical Data Model

{info:Alternative Proposal} This section has been obsoleted largely by SMILA/Project Concepts/Blackboard Service Concept. However, I still suggest to define a logical data model using interfaces to hide the physical implementation from the client in order to make optimized implementations of the data model possible in different parts of the framework. {info} Problem: Different processing engine require to work on certain Java objects. E.g.:

  • The ODE BPEL engine needs to be called with DOM objects.
  • ActiveBPEL uses other classes.
  • One could think of a SMILA specific processing engine that could use a physical data model that implements the logical data model more efficiently.

Conversion between different physical models can become expensive if it has to be done very often. This means e.g. that if a BPEL engine to orchestrate a number of SMILA services, it should not be necessary to actually convert the exchanged data objects each time a service is called and each time a service returns its result to the engine. And because the orchestration engine should be replaceable like everything else in the framework, we cannot commit to using e.g. DOM as the physical representation of our data objects, because then we would have conversion issues when using ActiveBPEL.

Proposal:

  • Define logical data model using a set of interfaces and a corresponding XML schema.
  • SMILA services access and create data only by using these interfaces, they do not need to know about the actual physica data model.
  • Provide physical data models that implement these interfaces using appropriate object formats.

E.g. when using ODE as the orchestration engine, use a physical model that uses DOM to represent the data objects. These DOM object can be passed to the BPEL engine directly. Each time a service is invoked from BPEL only a small wrapper must be created and the service can access the DOM objects as logical SMILA objects.

On the other hand, in a crawler or in a queue listener that does not use a BPEL engine, a more efficient implementation of the logical model could be used for better performance.

Data exchange between components that require to use different physical data models could easiest be done by using the common XML format for serialization. Also, queue messages would always contain an XML string. Each listener can then decide for itself which implementation to use.

Description of Logical Data Model

This proposal is based on experiences made with the IAS (the empolis Information Access Suite) data model. It is intended as a simplification of the IAS model, to overcome problems caused by its over-specification.

Record - Top level element

  • ID: see SMILA/Project Concepts/ID Concept for details
  • metadata: Metadata Object - the actual data about the document
  • attachments: Map<String, byte[]> - additional data not serializable to XML (or too inefficient), e.g.:
    • binary content of documents
    • Huge annotations

MetadataObject

  • attributes: Map<String, Attribute> - data about records according to some application model or ontology
  • annotations: Map<String, List<Annotation>> - additional service specific data

Attribute

  • name: String
  • value: List<MetadataObject|Literal>
  • annotations: Map<String, List<Annotation>>

Literal

  • semantic type: String
  • value: (String | Long | Double | Boolean | Date | Time | DateTime)?
  • data type
  • annotations: Map<String, List<Annotation>>

Annotation

  • anonymous values: List<String>
  • named values Map<String, String>
  • annotations: Map<String, List<Annotation>>

Java Interfaces of Logical Data Model

Note: This is just a preview. The details may be changed during implementation if other variants are be more appropriate or convenenient.

interface Annotation extends Annotatable {
    List<String> getAnonValues();
    void addAnonValue(String);
    void removeAnonValues();
 
    void setNamedValue(String name, String value);
    String getNamedValues(String name);
    void removeNamedValues();
}
interface Annotatable {
    boolean hasAnnotations();
    boolean hasAnnotation(String);
 
    List<Annotation> getAnnotations(String);
    Annotation getAnnotation(String);
 
    void setAnnotations(String, List<Annotation>);
    void setAnnotation(String, Annotation);
    void addAnnotation(String, Annotation);
 
    void removeAnnotations(String);
}
interface Record {
    ID getID();
    void setID(ID);
 
    MObject getMetadata();
    void getMetadata(MObject);
 
    boolean hasAttachments();
    byte[] getAttachment(String);
    void putAttachment(String, byte[]);
}
interface Attribute extends Annotatable {
    String getName();
 
    boolean hasLiterals();
    int valueSize();
    List<Literal> getLiterals();
    Literal getLiterale(); // return only first value in list, if any
    void addLiteral(Literal literal);
    void removeValues();
 
    boolean hasObjects();
    int ObjectSize();
    List<MObject> getObjects();
    MObject getObject(); // return only first Object in list, if any
    void addObject(MObject object);
    void removeObjects();
}
interface AttributeValue extends Annotatable {
    String getSemanticType();
    void setSemanticType(String);
}
// MObject is short for Metadata Object
interface MObject extends AttributeValue {
    boolean hasAttributes();
    boolean hasAttribute(String);
    Attribute getAttribute(String);
    void setAttribute(String, Attribute);
}
interface Literal extends AttributeValue {
    boolean hasValue();
 
    Object getValue();
    String getStringValue(); // return toString() of value, if not a string
    // other type specific methods return null, if value is not of requested type
    Long getIntValue();
    Double getFPValue();
    Boolean getBoolValue();
    Date getDateTimeValue();
 
    void setValue(Object) throws InvalidArgumentException;
    void setStringValue(String);
    void setIntValue(Long);
    void setFPValue(Double);
    void setBoolValue(Boolean);
    void setDateTimeValue(Date);
}

XML Schema of Logical Data Model

XML Schema design by example

The following XML snippet illustrates how to possibly represent this data model in XML. This section should be seen as experimental.

The XML schema is targeted at being relatively easy to use for XPath expressions in BPEL processes or elsewhere. The element and attribute have been abbreviated in order to minimze the length on the resulting document. This should have an positive impact on communication overhead and processing performance (of course, in reality also whitespace (linefeeds, indentation) should be left out).

The annotations used as examples are motivated by 0often used IAS properties.

<RecordList xmlns="http://www.eclipse.org/smila/record" xmlns:id="http://www.eclipse.org/smila/id"
	xmlns:rec="http://www.eclipse.org/smila/record" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.eclipse.org/smila/record record.xsd ">
 
	<Record version="1.0">
		<id:ID version="1.0">
			<id:Source>share</id:Source>
			<id:Key>some.html</id:Key>
		</id:ID>
		<A n="mimetype">
			<!-- IAS retrieval filter: annotation attached to attribute, valid for complete attribute value -->
			<An n="filter">
				<V n="type">exclude</V>
				<An n="values">
					<V>text/plain</V>
					<V>text/html</V>
				</An>
			</An>
			<L>
				<V>text/html</V>
				<V st="appl:Mimetype">text/html</V>
			</L>
		</A>
		<A n="filesize"><!-- single numeric value attribute -->
			<L>
				<V t="int">1234</V>
			</L>
		</A>
		<A n="trustee"><!-- multivalued attribute without annotation for each value -->
			<L>
				<V>group1</V>
				<V>group2</V>
			</L>
		</A>
		<A n="topic"><!-- multivalued attribute with simple values with annotations -->
			<An n="importance"><!-- IAS query boost factor, refers to complete attribute -->
				<V>4.0</V>
			</An>
			<L>
				<V>Eclipse</V><!-- first value -->
				<An n="sourceRef"><!-- part of IAS textminer info for first value-->
					<V n="attribute">fulltext</V>
					<V n="startPos">37</V>
					<V n="endPos">42</V>
				</An>
				<An n="sourceRef">
					<V n="attribute">fulltext</V>
					<V n="startPos">137</V>
					<V n="endPos">142</V>
				</An>
				<An n="importance"><!-- extra IAS query boost factor for first value -->
					<V>2.0</V>
				</An>
			</L>
			<L>
				<V>SMILA</V><!-- second attribute value -->
				<An n="sourceRef"><!-- following annotations refer to second value -->
					<!-- similar to above -->
				</An>
			</L>
		</A>
		<A n="author"><!-- "set of aggregates" -->
			<O>
				<A n="firstName">
					<L>
						<V>Igor</V>
					</L>
				</A>
				<A n="lastName">
					<L>
						<V>Novakovic</V>
					</L>
				</A>
			</O>
			<O st="appl:Author">
				<A n="firstName">
					<L>
						<V>Georg</V>
					</L>
				</A>
				<A n="lastName">
					<L>
						<V>Schmidt</V>
					</L>
				</A>
			</O>
		</A>
 
		<An n="action">
			<V>update</V>
		</An>
 
		<Attachment>content</Attachment><!-- just a marker that an attachment exists in attachment store? -->
		<Attachment>fulltext</Attachment>
	</Record>
</RecordList>

Some notes

  • <L> can contain multiple <V>, if the single values do not have annotations
  • The st attribute in <L> and <O> means some application specific "semantic" type while the t attribute in <V> means the native datatype of this value.
  • The version attribute is for parsers todo conversion between older XML formats and the current supported format if necessary.
  • The data model will be extended later to support XML streaming of user-definable object types, either as attribute values or in an extra part of the record.