Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Project Concepts/Data Model and XML representation

< SMILA‎ | Project Concepts
Revision as of 16:03, 23 October 2008 by G.schmidt.brox.de (Talk | contribs) (Discussion: Issues in Record interface definition and ID generation from XML snippets.)

Description

This page describes the data model used in SMILA to represent data objects (records) in workflows.

Discussion

AGS: Currently I have some remarks regarding the record interface.

  • We are not able to return large data, such as videos or large xml data, due to the lack of a stream interface for attachments.
  • I am developing an crawler that returns XML. This crawler is able to crawl our Berkeley DB storage. Thus way I am able to return full Record structures. The open question is. How do I convert a embedded ID into a ID for the record. Via normal XML/IndexOrder syntax I am not able to generate a dynamic key that contains several hierarchies that may change on record level.

Please share your thoughts.

Technical proposal

Considerations

What we need:

  • Simple API for service developers to work with the records.
  • Minimal constraints on what is possible to express
  • Any SMILA component must be able to process every incoming record without knowing about any other component in the installation that

may have produced some service specific part of the record. It must also be able to reproduce these elements in its result if they were not explicitly deleted during service execution.

  • This means that for service specific classes we cannot even rely on having the same classes in the same version installed in each

composite at the same time.

  • Records produced and stored with one version state of an SMILA installation must be re-processable also with updated versions of the installation (at least, if the major version of the framework has not changed).
  • Nice XML representation possible
  • Simple to express XPath queries on objects for conditions in BPEL or message routers.

In my opinion, this means, that we cannot have the data model extended by any service specific classes, but we must provide a data model that is able to express everything that a service might want to express. As a later extension we plan to allow the use of user-definable XML streaming for application specific object types, but this will not be implemented in the first version.

Physical Data Model

{info:Alternative Proposal} This section has been obsoleted largely by SMILA/Project Concepts/Blackboard Service Concept. However, I still suggest to define a logical data model using interfaces to hide the physical implementation from the client in order to make optimized implementations of the data model possible in different parts of the framework. {info} Problem: Different processing engine require to work on certain Java objects. E.g.:

  • The ODE BPEL engine needs to be called with DOM objects.
  • ActiveBPEL uses other classes.
  • One could think of a SMILA specific processing engine that could use a physical data model that implements the logical data model more efficiently.

Conversion between different physical models can become expensive if it has to be done very often. This means e.g. that if a BPEL engine to orchestrate a number of SMILA services, it should not be necessary to actually convert the exchanged data objects each time a service is called and each time a service returns its result to the engine. And because the orchestration engine should be replaceable like everything else in the framework, we cannot commit to using e.g. DOM as the physical representation of our data objects, because then we would have conversion issues when using ActiveBPEL.

Proposal:

  • Define logical data model using a set of interfaces and a corresponding XML schema.
  • SMILA services access and create data only by using these interfaces, they do not need to know about the actual physica data model.
  • Provide physical data models that implement these interfaces using appropriate object formats.

E.g. when using ODE as the orchestration engine, use a physical model that uses DOM to represent the data objects. These DOM object can be passed to the BPEL engine directly. Each time a service is invoked from BPEL only a small wrapper must be created and the service can access the DOM objects as logical SMILA objects.

On the other hand, in a crawler or in a queue listener that does not use a BPEL engine, a more efficient implementation of the logical model could be used for better performance.

Data exchange between components that require to use different physical data models could easiest be done by using the common XML format for serialization. Also, queue messages would always contain an XML string. Each listener can then decide for itself which implementation to use.

Description of Logical Data Model

This proposal is based on experiences made with the IAS data model (Orenge objects with Properties). It is intended as a simplification of the IAS model, to overcome problems caused by its over-specification.

Record - Top level element

  • ID: see SMILA/Project Concepts/ID Concept for details
  • metadata: Metadata Object - the actual data about the document
  • attachments: Map<String, byte[]> - additional data not serializable to XML (or too inefficient), e.g.:
    • binary content of documents
    • Huge annotations

MetadataObject

  • attributes: Map<String, Attribute> - data about records according to some application model or ontology
  • annotations: Map<String, List<Annotation>> - additional service specific data

Attribute

  • name: String
  • value: List<MetadataObject|Literal>
  • annotations: Map<String, List<Annotation>>

Literal

  • semantic type: String
  • value: (String | Long | Double | Boolean | Date | Time | DateTime)?
  • data type
  • annotations: Map<String, List<Annotation>>

Annotation

  • anonymous values: List<String>
  • named values Map<String, String>
  • annotations: Map<String, List<Annotation>>

Java Interfaces of Logical Data Model

Note: This is just a preview. The details may be changed during implementation if other variants are be more appropriate or convenenient.

interface Annotation extends Annotatable {
    List<String> getAnonValues();
    void addAnonValue(String);
    void removeAnonValues();
 
    void setNamedValue(String name, String value);
    String getNamedValues(String name);
    void removeNamedValues();
}
interface Annotatable {
    boolean hasAnnotations();
    boolean hasAnnotation(String);
 
    List<Annotation> getAnnotations(String);
    Annotation getAnnotation(String);
 
    void setAnnotations(String, List<Annotation>);
    void setAnnotation(String, Annotation);
    void addAnnotation(String, Annotation);
 
    void removeAnnotations(String);
}
interface Record {
    ID getID();
    void setID(ID);
 
    MObject getMetadata();
    void getMetadata(MObject);
 
    boolean hasAttachments();
    byte[] getAttachment(String);
    void putAttachment(String, byte[]);
}
interface Attribute extends Annotatable {
    String getName();
 
    boolean hasLiterals();
    int valueSize();
    List<Literal> getLiterals();
    Literal getLiterale(); // return only first value in list, if any
    void addLiteral(Literal literal);
    void removeValues();
 
    boolean hasObjects();
    int ObjectSize();
    List<MObject> getObjects();
    MObject getObject(); // return only first Object in list, if any
    void addObject(MObject object);
    void removeObjects();
}
interface AttributeValue extends Annotatable {
    String getSemanticType();
    void setSemanticType(String);
}
// MObject is short for Metadata Object
interface MObject extends AttributeValue {
    boolean hasAttributes();
    boolean hasAttribute(String);
    Attribute getAttribute(String);
    void setAttribute(String, Attribute);
}
interface Literal extends AttributeValue {
    boolean hasValue();
 
    Object getValue();
    String getStringValue(); // return toString() of value, if not a string
    // other type specific methods return null, if value is not of requested type
    Long getIntValue();
    Double getFPValue();
    Boolean getBoolValue();
    Date getDateTimeValue();
 
    void setValue(Object) throws InvalidArgumentException;
    void setStringValue(String);
    void setIntValue(Long);
    void setFPValue(Double);
    void setBoolValue(Boolean);
    void setDateTimeValue(Date);
}

XML Schema of Logical Data Model

XML Schema design by example

The following XML snippet illustrates how to possibly represent this data model in XML. This section should be seen as experimental.

The XML schema is targeted at being relatively easy to use for XPath expressions in BPEL processes or elsewhere. The element and attribute have been abbreviated in order to minimze the length on the resulting document. This should have an positive impact on communication overhead and processing performance (of course, in reality also whitespace (linefeeds, indentation) should be left out).

The annotations used as examples are motivated by 0often used IAS properties.

<RecordList xmlns="http://www.eclipse.org/smila/record" xmlns:id="http://www.eclipse.org/smila/id"
	xmlns:rec="http://www.eclipse.org/smila/record" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.eclipse.org/smila/record record.xsd ">
 
	<Record version="1.0">
		<id:ID version="1.0">
			<id:Source>share</id:Source>
			<id:Key>some.html</id:Key>
		</id:ID>
		<A n="mimetype">
			<!-- IAS retrieval filter: annotation attached to attribute, valid for complete attribute value -->
			<An n="filter">
				<V n="type">exclude</V>
				<An n="values">
					<V>text/plain</V>
					<V>text/html</V>
				</An>
			</An>
			<L>
				<V>text/html</V>
				<V st="appl:Mimetype">text/html</V>
			</L>
		</A>
		<A n="filesize"><!-- single numeric value attribute -->
			<L>
				<V t="int">1234</V>
			</L>
		</A>
		<A n="trustee"><!-- multivalued attribute without annotation for each value -->
			<L>
				<V>group1</V>
				<V>group2</V>
			</L>
		</A>
		<A n="topic"><!-- multivalued attribute with simple values with annotations -->
			<An n="importance"><!-- IAS query boost factor, refers to complete attribute -->
				<V>4.0</V>
			</An>
			<L>
				<V>Eclipse</V><!-- first value -->
				<An n="sourceRef"><!-- part of IAS textminer info for first value-->
					<V n="attribute">fulltext</V>
					<V n="startPos">37</V>
					<V n="endPos">42</V>
				</An>
				<An n="sourceRef">
					<V n="attribute">fulltext</V>
					<V n="startPos">137</V>
					<V n="endPos">142</V>
				</An>
				<An n="importance"><!-- extra IAS query boost factor for first value -->
					<V>2.0</V>
				</An>
			</L>
			<L>
				<V>SMILA</V><!-- second attribute value -->
				<An n="sourceRef"><!-- following annotations refer to second value -->
					<!-- similar to above -->
				</An>
			</L>
		</A>
		<A n="author"><!-- "set of aggregates" -->
			<O>
				<A n="firstName">
					<L>
						<V>Igor</V>
					</L>
				</A>
				<A n="lastName">
					<L>
						<V>Novakovic</V>
					</L>
				</A>
			</O>
			<O st="appl:Author">
				<A n="firstName">
					<L>
						<V>Georg</V>
					</L>
				</A>
				<A n="lastName">
					<L>
						<V>Schmidt</V>
					</L>
				</A>
			</O>
		</A>
 
		<An n="action">
			<V>update</V>
		</An>
 
		<Attachment>content</Attachment><!-- just a marker that an attachment exists in attachment store? -->
		<Attachment>fulltext</Attachment>
	</Record>
</RecordList>

Some notes

  • <L> can contain multiple <V>, if the single values do not have annotations
  • The st attribute in <L> and <O> means some application specific "semantic" type while the t attribute in <V> means the native datatype of this value.
  • The version attribute is for parsers todo conversion between older XML formats and the current supported format if necessary.
  • The data model will be extended later to support XML streaming of user-definable object types, either as attribute values or in an extra part of the record.

Back to the top