Skip to main content
Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/XML storage"

(Xml Storage Use)
(High Level API)
Line 82: Line 82:
 
* void addRecord(Record record);  
 
* void addRecord(Record record);  
 
* void updateRecord(Record record);  
 
* void updateRecord(Record record);  
* Record getRecord(Id id); http://www.eclipse.org/newsportal/article.php?id=108&group=eclipse.technology.eilf#108
+
* Record getRecord(Id id);
  
 
== Performance and Scaling ==
 
== Performance and Scaling ==

Revision as of 06:21, 13 November 2008

XML storage

Introduction

The main use case of the XML Store shall be to store and retrieve XML Documents as well as to obtain a set of documents by an XPath/XQuery.

Within SMILA it is used to store the XML Version of a Record object and thus is used from several components but only via the Blackboard.

It also shall serve as an infrastructure block for any component that need a private XML Storage. In this case the storage shall only be accessible to that component to avoid any conflicts.

The first API draft defines and implement the basic CRUD operations. In-place modifications of sub nodes are not yet needed (Prio 2 or 3).

It is suggested to publish the needed functionality as an OSGi Service with the possibility to run multiple instances which may or may not be running in the same JVM.

Xml Storage Service (XSS)

The intended usage of the XML Storage is very much that of a service or server (eg. like a real DB Server such as MySql, Oracle, etc.) as opposed to a library type implementation. Hence the implementation shall be done as an OSGi Service that is wired up with Declarative Services.

The service itself must support multiple requests at the same time and therefore needs to be multi threaded. The intention is to use a connection-type approach as is the case for SQL DBs. That entails that multiple clients may connect to the service and each client may open possibly multiple connections that are used to query/store XML documents concurrently.

An OSGi service is still run and called within the same JVM. This is in contrast to normal DB services that typically run in their own process and hence communication is done via TCP/IP, pipes etc.

In a later phase, when supporting clustering for horizontal scaling purposes, the XSS needs to hide the clustering capability from its client and manage all its aspects fully transparently, making it purely a matter of configuration.

Xml Storage Use

  • Retrieval of XML documents may either be done by string-id or formulating an XQuery through the XQJ API. The first will always return at most a single doc while latter returns a Sequence of XML Nodes and as such may return whole documents or part of a document
  • common API uses should be encapsulated in their own API layer, so that each client doesn't have to perform all low level functions itself
  • Records are stored by their ID which is calculated deterministically. And after calculation could be cashed at the Record object to avoid recalculation


Note
Because the storage scope is that of whole documents we should also work with these as a whole. Although it is possible to convert an element node that you got via XQuery into a document (involves extracting the element and all its content as text and then to parse this into a DOM ) this process is obviously lengthy/costly. As such, we should store sub sections of XML documents that we use often on their own (ie. w/o their parent/containing document context) as an own entity. Obviously they need to be linked (internally in the Storage API?) so, we can clean them up properly.

Binary Storage

Although it is possible to save binary objects in Berkley DB XML and possibly other Xml DBs it is better to provide separate OSGi Services for these distinctly different storage types. Apart from this, according to Ralf Schuman who investigated this matter, it seems that the performance for larger binary objects is not good with BDB.


API

It shall be possible to run different instances of XML storages similar to the idea of having multiple instances of an MSSQL server running (on the same machine). Each such instance is controlled by configuration and which is identified by a name. The following items are part of the configuration:

  • Service instance name
  • segments
    • segments shall be used for grouping of XML documents
    • I'm not sure on this if we want/need to declare the possible segments in advance. it might also serve as a limitation of possible segments if clients would be allowed to create some on the fly.
  • default segment
  • implementation bundle
    • the impl. bundle dinfines which bundle implements the service interface. for now we will have only one impl. but there might bothers. I also have the idea of providing 2 impls one that is streamlined for performance on a single node installation. the other is targeted for a distributed installation where all parameters are Serizable.
    • how is that called from code? how do we need to config this OSGi like? maybe that is part of the manifest!?
  • host
    • for communication in an distributed env. we will use SCA for remoting the communication. Although this aspect is transparent we still need to tell the service instance that is just starting whether it is the real service server or just a proxy service stub routing to the server.

Example: the data shall be stored in host S hence the service instance I is running in server mode. A client that wants to use the instance I running on host C now calls a Service instance P of the same name but that is running in proxy mode, as that just remotes the communication to I.

    • IMO it is enough to declare the host name/ip of the service server... [aint working since we need SCA also to do inter VM communication on the same host. <- how to handle this? it might be that we need to resort to sockets/ports here after all....]
    • after talking to DS it might be that SCA handles this really transparently such that it created the proxy itself w/o programming intervention.
  • TBC

The implementation itself might necessitate more items to be configured. For BDB these are:

  • basically be able to set all props of these classes/objects
    • EnvironmentConfig
    • Environment
    • XmlManagerConfig
    • this could be done dynamically with a BeanHelper utility class and reflection as done in spring.
  • segments of the generic interface map to containers, hence we need to have one container config per segment. this would indicate that segments must be declared in the general config.
  • TBC

Until then this is a first draft of some most needed methods:

  • Service
    • openConnection(String connectionString) : Connection
    • closeConnection(Connection con);
    • get/setAutocommit(Boolean value);

High Level API

The high level API is contained in the Blackboard and needs not be duplicated here. As a consequence this Service just offers a low level API.sdfsdf

  • void addRecord(Record record);
  • void updateRecord(Record record);
  • Record getRecord(Id id);

Performance and Scaling

At this time the focus lies on producing a working solution. With that bottlenecks can be identified and addresses such as:

  • segment the storage
  • distribute it to hardware nodes

Back to the top