Skip to main content
Jump to: navigation, search

SMILA/Documentation/Xml Storage::Implementation::Berkley XML DB

< SMILA‎ | Documentation
Revision as of 05:53, 13 November 2008 by (Talk | contribs) (Implementation Ideas)

Berkeley DB XML

Oracle Berkeley DB XML is an open source, embeddable XML database with XQuery-based access to documents stored in containers and indexed based on their content. Oracle Berkeley DB XML is built on top of Oracle Berkeley DB and inherits its rich features and attributes. Like Oracle Berkeley DB, it runs in process with the application with no need for human administration. Oracle Berkeley DB XML adds a document parser, XML indexer and XQuery engine on top of Oracle Berkeley DB to enable the fastest, most efficient retrieval of data.

The attached docs from Oracle can be found [| here]

Key Features and Limitations

  • Replication/Clustering
    • good for horizontal scaling with single Read/Write-Master and many Read Slave nodes.
    • limited to 60/1000 replication nodes on Windows/Unix respectively
    • when Master dies then election for new master can be done automatically or by client code.
  • Size
    • total: max. of 256 Tera Bytes in
    • container size: limited by file system [1]
  • XML Encodings
    • ASCII, UTF-8, UTF-16 (Big/Small Endian), UCS4 (Big/Small Endian), EBCDIC code pages IBM037, IBM1047 and IBM1140 encodings, ISO-8859-1 (aka Latin1) and Windows-1252. [2], [3]
  • XML Capabilities
    • it is possible to store XML documents by whatever characteristic into one or more containers, e.g. no requirement to store only docs of the same namespace in one container
    • It is possible to XQuery over >1 containers at the same time.
    • storage of documents on node or whole doc level
    • in place modification of XML nodes with XQuery
    • Validation of XML docs possible, configurable on container level
  • DB Capabilities
    • transactional
    • locks are fine grained and very configurable

Implementation Ideas

Scenario 1: Parallel Access from diff. Clients on same Host It is possible to configure BDB such that several client processes can access the underlying data concurrently by sharing the underlying database files. This is called environment sharing. For this to work BDB needs to be configured to activate its transaction control as described in [| BerkeleyDBXML-Txn-JAVA.pdf].

However, this approach is limited to diff. clients on the same host. Placing the files on a shared recourse such a SAN, NFS is explicitly no valid solution (see for accessing the same data from diff. hosts.

Given the targeted environment of a distributed system the current scenarios seems an unlikely use case for SMILA. The only situation where this scenario could make sense nonetheless is, if all of the following conditions are met (IMHO this is unlikely to be the case):

  • the pre-processing overhead of the client for storing the XML Data is relatively large
  • that execution time occurs before transactional synchronization
  • parallelization with VM threads is less efficient than with native processes.

Known Problem of Environment Sharing on same Host Ralf encountered a problem while testing this and which has been resolved via forum support from BDB.

It seems that setRegister(true) must be turned on. This signals the environment to just have the first process start the DB with recovering and not any subsequent processes.

Scenario 2: Parallel Access from diff. Clients on diff. Hosts

BDBX is an embedded DB and as such it runs inside the process of a single application. This, however is hardly the use case of the DB in SMILA which needs to be able to be accessed by several modules in a distributed environment, i.e. the clients reside on different hosts.

One way of handling access from diff. clients to the same data is to put all requests into a message queue (MQ). Its content is polled by a client app (listener) wrapping the access to the data.

In the likelihood of high stress this scenario must be scaled horizontally, such that the listener does not become the bottle neck. To this end BDB allows replication to other nodes with one master (read/write) and many read-only clients. In such a replicated scenario 2 variants are possible:

  • 1. Replication is transparent to the clients and they just see the MQ, which handles the routing to the r/o nodes.
  • 2. The clients become the nodes themselves to which the data is replicated and thus don't have to share access to the data with any other application.

This latter approach is especially good for clients that only read from but not write to the DB. If they need write access then the code must be configured such that it is possible to define/discern the master node (dynamically).

If two clients reside on the same host but in different VMs, it might not be possible to share the replicated DB between these processes due to replication turned on. This is a limitation of the BDB replication framework, if I understand it correctly (see Replication-Java-GSG.pdf p. 20).

Deadlocking resolver

Current implementation solves the deadlock issues by synchronizing the Oracle Berkeley DB Xml container operations. This will be changed in the near future by implementing the proposed solution (Oracle BDB Xml forum).

Back to the top