Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Xml Storage::Implementation::Berkley XML DB"

m (XML data storage)
Line 1: Line 1:
== XML data storage==
 
 
 
== Berkeley DB XML==
 
== Berkeley DB XML==
  

Revision as of 12:00, 27 October 2008

Berkeley DB XML

Oracle Berkeley DB XML is an open source, embeddable XML database with XQuery-based access to documents stored in containers and indexed based on their content. Oracle Berkeley DB XML is built on top of Oracle Berkeley DB and inherits its rich features and attributes. Like Oracle Berkeley DB, it runs in process with the application with no need for human administration. Oracle Berkeley DB XML adds a document parser, XML indexer and XQuery engine on top of Oracle Berkeley DB to enable the fastest, most efficient retrieval of data.

The attached docs from Oracle can be found [| here]

Key Features and Limitations

  • Replication
    • single Read/Write-Master, many Read Slaves
    • limited to 60/1000 replication nodes on Windows/Unix
    • when Master dies then election for new master can be done automatically.
  • XML Encodings
    • ASCII, UTF-8, UTF-16 (Big/Small Endian), UCS4 (Big/Small Endian), EBCDIC code pages IBM037, IBM1047 and IBM1140 encodings, ISO-8859-1 (aka Latin1) and Windows-1252.

see: http://xerces.apache.org/xerces-c/faq-parse.html#faq-21, http://www.oracle.com/technology/products/berkeley-db/faq/xml_faq.html#1)

  • Segmentation
    • it is possible to group XML documents by what ever characteristic you deem right into the same container or separate ones.
    • It is possible to query ofer >1 containers at the same time.

Implementation Ideas

Scenario 1: Parallel Access from diff. Clients on same Host It is possible to configure BDB such that several client processes can access the underlying data concurrently by sharing the underlying database files. This is called environment sharing. For this to work BDB needs to be configured to activate its transaction control as described in [| BerkeleyDBXML-Txn-JAVA.pdf].

However, this approach is limited to diff. clients on the same host. Placing the files on a shared recourse such a SAN, NFS is explicitly no valid solution (see http://www.oracle.com/technology/products/berkeley-db/faq/db_faq.html#30) for accessing the same data from diff. hosts.

Given the targeted environment of a distributed system the current scenarios seems an unlikely use case for EILF. The only situation where this scenario could make sense nonetheless is, if all of the following conditions are met (IMHO this is unlikely to be the case):

  • the pre-processing overhead of the client for storing the XML Data is relatively large
  • that execution time occurs before transactional synchronization
  • parallelization with VM threads is less efficient than with native processes.

Known Problem of Environment Sharing on same Host Ralf encountered a problem while testing this and which has been resolved via forum support from BDB.

It seems that setRegister(true) must be turned on. This signals the environment to just have the first process start the DB with recovering and not any subsequent processes.

Scenario 2: Parallel Access from diff. Clients on diff. Hosts

BDBX is an embedded DB and as such it runs inside the process of a single application. This, however is hardly the use case of the DB in EILF which needs to be able to be accessed by several modules in a distributed environment, i.e. the clients reside on different hosts.

One way of handling access from diff. clients to the same data is to put all requests into a message queue (MQ). Its content is polled by a client app (listener) wrapping the access to the data.

In the likelihood of high stress this scenario must be scaled horizontally, such that the listener does not become the bottle neck. To this end BDB allows replication to other nodes with one master (read/write) and many read-only clients. In such a replicated scenario 2 variants are possible:

  • 1. Replication is transparent to the clients and they just see the MQ, which handles the routing to the r/o nodes.
  • 2. The clients become the nodes themselves to which the data is replicated and thus don't have to share access to the data with any other application.

This latter approach is especially good for clients that only read from but not write to the DB. If they need write access then the code must be configured such that it is possible to define/discern the master node (dynamically).

If two clients reside on the same host but in different VMs, it might not be possible to share the replicated DB between these processes due to replication turned on. This is a limitation of the BDB replication framework, if I understand it correctly (see Replication-Java-GSG.pdf p. 20).

Deadlocking resolver

Current implementation solves the deadlock issues by synchronizing the Oracle Berkeley DB Xml container operations. This will be changed in the near future by implementing the proposed solution (Oracle BDB Xml forum).

Back to the top