Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Specifications/RecordStorage"

Line 86: Line 86:
  
 
==PoC eclipseLink JPA RecordStorage using Oracle DB and Dao Objects for Smila datamodel==
 
==PoC eclipseLink JPA RecordStorage using Oracle DB and Dao Objects for Smila datamodel==
 +
The basic idea is to use a simpler datamodel than the Smila datamodel for persistence with JPA, by storing des Smila datamodel ias serialized data in this simplified datamodel. Therby the restrictions concerning recursion and Collection/Map support should be circumventablee.
 +
 +
For the class RecordImpl a Dao class will be implemented, that serializes the Record object into a meember variable:
 +
<source lang="java">
 +
@Entity
 +
@Table(name = "RECORDS")
 +
public class RecordDao implements Serializable {
 +
 +
  @Id
 +
  @Column(name = "ID")
 +
  private String _idString;
 +
 +
  @Column(name = "RECORD")
 +
  private byte[] _serializedRecord;
 +
 +
  protected RecordDao() {
 +
  }
 +
 +
  public RecordDao(Record record) throws IOException {
 +
    if (record == null) {
 +
      throw new IllegalArgumentException("parameter record is null");
 +
    }
 +
    if (record.getId() == null) {
 +
      throw new IllegalArgumentException("parameter record has not Id set");
 +
    }
 +
 +
    final ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
 +
    final ObjectOutputStream objectStream = new ObjectOutputStream(byteStream);
 +
    objectStream.writeObject(record);
 +
    objectStream.close();
 +
    _serializedRecord = byteStream.toByteArray();
 +
    _idString = record.getId().toString();
 +
  }
 +
 +
  public Record toRecord() throws IOException, ClassNotFoundException {
 +
    final ObjectInputStream objectStream = new ObjectInputStream(new ByteArrayInputStream(_serializedRecord));
 +
    final Record record = (Record) objectStream.readObject();
 +
    objectStream.close();
 +
    return record;
 +
  }
 +
}
 +
</source>
 +
 +
In the database a table RECORDS will be created having the columns
 +
* ID - VARCHAR
 +
* RECORD - BLOB
 +
 +
The ID is a String representation of the Id of the Record. It us used as a primary key in the database. It is not used during deserialisation of the Record. The Id object itself is automatically serialized with the Record object. The interaface RecordStorage would be left unchanged. Internally the eclipseLink EntityManager works with RecordDao objects instead of Rec ordImpl objects.
  
 
===Enhanced Dao classes for restricted selections of attributes===
 
===Enhanced Dao classes for restricted selections of attributes===
 +
For advanced uses cases there is a need to select records by queries. This could be realised by adding fixed and/or configurable Record attributes to the Dao class. These would be use to allow for filtering of selected attribute value pairs. They are not used to reconstruct the Record during deserialization. They are just data stored in addition to the serialized record.
 +
 +
A fixed attribute could be the source element of the Record Id. It could be used to select records based on the data source (use-case build index without crawling). The ReccordDao would be eenhanced by the following member variable:
 +
<source lang="java">
 +
  @Column(name = "SOURCE")
 +
  private String _source;
 +
</source>
 +
That would result in an additional column SOURCE in the table RECORDS.
 +
 +
For configurable attributes the RecordDao would be enhanced with a list of dao objects for attribute values called AttributeDao:
 +
<source lang="java">
 +
  @OneToMany(targetEntity = AttributeDao.class, cascade = CascadeType.ALL)
 +
  @JoinTable(name = "RECORD_ATTRIBUTES", joinColumns = @JoinColumn(name = "RECORD_ID", referencedColumnName = "ID"), inverseJoinColumns = @JoinColumn(name = "ATTRIBUTE_ID", referencedColumnName = "ATT_ID"))
 +
  List<AttributeDao> _attributes;
 +
</source>
 +
 +
The AttributeDao class could be implemented like this:
 +
<source lang="java">
 +
@Entity
 +
@Table(name = "ATTRIBUTES")
 +
public class AttributeDao {
 +
 +
  @Id
 +
  @GeneratedValue(strategy=GenerationType.AUTO)
 +
  @Column(name = "ATT_ID")
 +
  private String _id;
 +
 
 +
  @Column(name = "ATT_NAME")
 +
  private String _name;
 +
 +
  @BasicCollection (
 +
    fetch=FetchType.EAGER,
 +
    valueColumn=@Column(name="ATT_VALUE"))
 +
    @CollectionTable (
 +
        name="ATTRIBUTE_VALUES",
 +
        primaryKeyJoinColumns=
 +
        {@PrimaryKeyJoinColumn(name="ATT_ID", referencedColumnName="ATT_ID")}
 +
    )   
 +
  private List<String> _values;
 +
 +
  protected AttributeDao() {
 +
  }
 +
 +
  public AttributeDao(String name) {
 +
    _name = name;
 +
    _values = new ArrayList<String>();
 +
  }
 +
 
 +
  public void addValue(String value)
 +
  {
 +
    _values.add(value);
 +
  }
 +
}
 +
</source>
 +
 +
Summing up all modifications the final database schema could look like this:
 +
[db_schema.png]
 +
 +
 +
By a configuration the RecordStorage is told which Record attributes it should persists in the database in addition to the serilaized Record. Neither Annotations nor Sub-Attributes are supported in this approach, just the Literals of the Attribute are stored. Of course more advanced enhancements are possible (but were not in the scope of this PoC)
 +
 +
 +
The RecordStorage interface could add the following methods to support simple querues. Another option is to introduce another interface, e.g. RecordQueries to leave the basic Record functionality in a separate interface.
 +
<source lang="java">
 +
interface RecordStorage
 +
{
 +
    // simple query functionality
 +
    Iterator<Record> findRecordsBySource(String source);
 +
    Iterator<Record> findRecordsByAttribute(String name, String value);
 +
    Iterator<Record> findRecordsByNativeQuery(String whereClause); // evtl. nicht so gut allgemein nutzbar, wenn der zugrundeliegende Store kein SQL unterstützt
 +
}
 +
</source>
 +
 +
The first method can be realized by a JPQL NamedQuery. This is just an Annotation of the RecordDao class used by the RecordStorage implementation:
 +
<source lang="java">
 +
@NamedQueries({
 +
    @NamedQuery(name="RecordDao.findBySource",
 +
                query="SELECT r FROM RecordDao r WHERE r._source = :source"),
 +
})
 +
</source>
  
==PoC eclipseLink JPA RecordStorage using Derby DB and Dao Objects for Smila Datamodel==
+
An implementation of the method could look like this:
 +
<source lang="java">
 +
public Iterator<Record> findRecordsBySource(final String source) {
 +
    Query query = _em.createNamedQuery("RecordDao.findBySource");
 +
    List<RecordDao> daos = query.setParameter("source", source).getResultList();
 +
    return new RecordIterator(daos.iterator());
 +
}
 +
</source>
 +
The helper class <tt>RecordIterator</tt> is needed to convert the RecordDao objects into Recors during iteration.
  
=Serialisation of Records=
 
  
=PoC Blackboard using RecordStorage instead of XMLStorage=
+
The second method is in JPA 1.0  not expressable as a JPQL NamedQuery Annotation, as it lacks the functionality to select values in a Collection. Therefore we have to use an eclipseLink enhancement and generate the query in java code:
 +
<source lang="java">
 +
public Iterator<Record> findRecordsByAttribute(final String name, final String value) {
 +
   
 +
    final Session session = JpaHelper.getServerSession(_emf);
 +
    final ReadAllQuery query = new ReadAllQuery(RecordDao.class);
 +
   
 +
    ExpressionBuilder record = new ExpressionBuilder();
 +
    Expression attributes = record.anyOf("_attributes");
 +
    Expression criteria = attributes.get("_name").equal(name);
 +
    criteria = criteria.and(attributes.anyOf("_values").equal(value));
 +
    query.setSelectionCriteria(criteria);
 +
    query.dontUseDistinct();
 +
   
 +
    List<RecordDao> daos = (List<RecordDao>)session.executeQuery(query);
 +
    return new RecordIterator(daos.iterator());
 +
}
 +
</source>
 +
* [[User:Daniel.stucky.empolis.com|Daniel Stucky]] : At the moment there is a problem as the generated SQL query uses DISTINCT which is not allowed in conjunction withj Blobs
 +
** [[User:Daniel.stucky.empolis.com|Daniel Stucky]] : solved by using <tt>query.dontUseDistinct();</tt>
  
 +
The last method is a more generic variant that allows selection of Records via native SQL. As the results of the SQL query must be Record objects it is only allowd to enter the WHERE clause of the SQL statement which is combined with the static <tt>SELECT * FROM RECORDS </tt>
 +
<source lang="java">
 +
public Iterator<Record> findRecordsByNativeQuery(final String whereClause) {
 +
    String sqlString = "SELECT * FROM RECORDS " + whereClause;
 +
    Query query = _em.createNativeQuery(sqlString, RecordDao.class);
 +
    List<RecordDao> daos = query.getResultList();
 +
    return new RecordIterator(daos.iterator());
 +
}
 +
</source>
  
=Links=
+
All these enhancements are just suggestions. Other solutions to support selection of records via queries are possible.
*[[http://db.apache.org/derby/ http://db.apache.org/derby/]] (Derby homepage)
+
*[[http://wiki.eclipse.org/EclipseLink/ http://wiki.eclipse.org/EclipseLink/]] (EclipseLink homepage)
+
*[[http://jcp.org/en/jsr/detail?id=220 http://jcp.org/en/jsr/detail?id=220]] (JPA Specification)
+

Revision as of 10:18, 3 February 2009

!!! UNDER CONSTRUCTION !!!

Description

As Berkeley XML DB will not be available in eclipse in the near future, we need an open source alternative to store record metadata. There is no requirement to use an XML database onyl, any storage that allows us to persist record metadata will suffice.

RecordStorage Interface

Here is a proposal for a RecordStorage interface. It contains only the basic functionality without any query support.

interface RecordStorage
{
    Record loadRecord(Id id);
    void storedRecord(Record record);
    void removeRecord(Id id);
    boolean existsRecord(Id id)
}


RecordStorage based on JDBC database

At the moment all access to records is based on the record ID. The record ID is the primary key when reading/writing records. It would be easily possible to store the records in a relational database, using just one table with columns for the record ID (the primary key) and a second one to store the record itself. The record could be stored as a BLOB or CLOB:

  • BLOB: the record is just serialize into a byte[] and stored as a BLOB
  • CLOB: the record's XML representation could be stored in a CLOB. Extra method calls to parse/convert the record from/to XML needs to bee applied wheenreading/writing the records (performance impact in comparison to using a BLOB). But this would offer some options to include WHERE clauses accessing the CLOB in SQL queries

Because the String representation of IDs can be really long, an alternative could be to store a hash of the String. (This hash has to be computed whenever accessing the database) In addition one could also add another column to store the source attribute of the record ID. This would allow easy access on all records of a datasource to handle the use-case "reindexing without crawling"

For advanced use-cases (e.g. Mashup) query support is needed (compare XQJ), e.g. to select all records of a certain mime type. It would be possible to add more columns or join tables for selected record attributes. Another option is to do postprocessing of selected records, filtering those records that do not match the query filter. This is functional equal to a SQL select but of course performance is very slow.

When implementing a JDBC RecordStorage one should take care to use database neutral SQL statements, or make the statements to use configurable. A good practice could be to implement the reading/writing in DAO objects, so that database specific implementations of the DAOs could be provided to make use of special features. Most databases offer imporved support for BLOBs and CLOBs.

A good choice for an open source database is Apache Derby. The Apache License 2.0 is compatible to EPL, the database has a low footprint (2MB) and can be used in process as well as in client/server network mode. It is also already commited to Orbit. For a productive environment it would be easily possible to swicth to any other JDBC database, like Oracle.

RecordStorage based on relational database using eclipseLink

EclipseLink offers various options to persists Java objects. Below we go into detail about using eclipseLink with JPA (Java Persistence Api):

Overview on JPA

A mapping of Java classes to a relational database schema is created by using annotations in java code or providing an XML configuration. The classes to be persisted (called Entities) are in general represented ny database tables, member variables as columns in those tables. There are some requirements to be met:

  • an entity class must provide a non argument constructor (either public or protected)
  • antity classes must be top level classes, no enums or interfaces

There exists two kinds of access types, where onyl one kind is usablee per entity:

  • field based: direct access on member variables
  • property based: JavaBean like access via getter- and setter-methods

An Entity must have a unique Id, this can be either

  • a simple key (just one member variable) (@Id)
  • a composite key using multiple member varibales. This implies the usage of an additional primary key class that contains the same member varibales (same name and type) as the entity class (@Id + @idClass)
  • an embedded key (@EmbeddedId)

Entities can have relations to other entities or contain embedded classes. Embedded classes aree not entities themselfes (but must meet the same requirements) and do not have a unique Id. They "belong" to the entity object embedding them. Version 1.0 of the JPA specification demands only support of one level of embedded objects. If more levels are supported depends on the implementation. Collections are also not allowed as embedded classes.

For more information see ejb-3_0-fr-spec-persistence.pdf.


PoC eclipseLink JPA RecordStorage using Oracle DB and Smila datamodel

  • a JPA RecordStorage is always based on a concrete implementation of entities to persist. So it is not possible to implement a generic RecordStorage for any Record implementation but only for a specific implementation, in this case our default implementation RecordImpl, IdImpl , ...
  • the classes of the Smila datamodell cannot be used as is, but have to meet the requirements of JPA (e.g. no argument constructor)
  • the default implementation uses interfaces of other entity classes as types for memeber variables. These have to be replaced by concrete implementatuion types, for example
@Entity
public class RecordImpl implements Record, Serializable {
  @EmbeddedId
  private IdImpl _id;
  ...
}

instead of

@Entity
public class RecordImpl implements Record, Serializable {
  @EmbeddedId
  private Id _id;
  ...
}
  • the Smila coding conventions implicate some issues: the names for tables/columns are automatically generated by JPA using class and member varibale names. The leading _ used in Smila for member variables leads to invalid SQL statements (at leats with Oracle). Therefore it is necessary to define the names for every tables, join tables and columns manually by using annotations or xml configuration
  • the Smila objects (Records, Id, Attribute, Annotation) are all structured recursively and most make also use of Collections or Maps as members
    • eclipseLink supports N levels of embedded objects
    • recursive embedding of objects is NOT supported
    • Collections/Maps of embedded classes are not supported in JPA 1.0 (will be supported in JPA 2.0). eclipseLink offers a so called DescriptorCustomizer which could be used to implement such a support (not considered in die PoC, needs further analysis)
    • so it is not possible modeling the Smila classes as ambedded objects, which would have been the most natural approach as all data belongs to a single record object
  • as an alternative one could try to model the relations between the various datamodell classes. This means all classes have to be annotated as separate Entity objects and each needs to have an own unique Id. Besaides the Record object, no other object has a single member that could be used as a primary key. This is a major problem as for example a object of type Attribute is only unique by creating a primary key over all member varibales, which in addition are Lists. Class LiteralImpl has another problem, as the meemnber _value is of type Object, which is represented as a Blob in the database. But Blobs cannot be used as part of a primary key
    • Daniel Stucky : we could check if automatically generated Ids are useful in this scenario. This means that all Smila Entity objects have to add a new meember varibale (e.g. int _persistentId ). Most likely this approach still fails because of the recursive structure


PoC eclipseLink JPA RecordStorage using Oracle DB and Dao Objects for Smila datamodel

The basic idea is to use a simpler datamodel than the Smila datamodel for persistence with JPA, by storing des Smila datamodel ias serialized data in this simplified datamodel. Therby the restrictions concerning recursion and Collection/Map support should be circumventablee.

For the class RecordImpl a Dao class will be implemented, that serializes the Record object into a meember variable:

@Entity
@Table(name = "RECORDS")
public class RecordDao implements Serializable {
 
  @Id
  @Column(name = "ID")
  private String _idString;
 
  @Column(name = "RECORD")
  private byte[] _serializedRecord;
 
  protected RecordDao() {
  }
 
  public RecordDao(Record record) throws IOException {
    if (record == null) {
      throw new IllegalArgumentException("parameter record is null");
    }
    if (record.getId() == null) {
      throw new IllegalArgumentException("parameter record has not Id set");
    }
 
    final ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
    final ObjectOutputStream objectStream = new ObjectOutputStream(byteStream);
    objectStream.writeObject(record);
    objectStream.close();
    _serializedRecord = byteStream.toByteArray();
    _idString = record.getId().toString();
  }
 
  public Record toRecord() throws IOException, ClassNotFoundException {
    final ObjectInputStream objectStream = new ObjectInputStream(new ByteArrayInputStream(_serializedRecord));
    final Record record = (Record) objectStream.readObject();
    objectStream.close();
    return record;
  }
}

In the database a table RECORDS will be created having the columns

  • ID - VARCHAR
  • RECORD - BLOB

The ID is a String representation of the Id of the Record. It us used as a primary key in the database. It is not used during deserialisation of the Record. The Id object itself is automatically serialized with the Record object. The interaface RecordStorage would be left unchanged. Internally the eclipseLink EntityManager works with RecordDao objects instead of Rec ordImpl objects.

Enhanced Dao classes for restricted selections of attributes

For advanced uses cases there is a need to select records by queries. This could be realised by adding fixed and/or configurable Record attributes to the Dao class. These would be use to allow for filtering of selected attribute value pairs. They are not used to reconstruct the Record during deserialization. They are just data stored in addition to the serialized record.

A fixed attribute could be the source element of the Record Id. It could be used to select records based on the data source (use-case build index without crawling). The ReccordDao would be eenhanced by the following member variable:

  @Column(name = "SOURCE")
  private String _source;

That would result in an additional column SOURCE in the table RECORDS.

For configurable attributes the RecordDao would be enhanced with a list of dao objects for attribute values called AttributeDao:

  @OneToMany(targetEntity = AttributeDao.class, cascade = CascadeType.ALL)
  @JoinTable(name = "RECORD_ATTRIBUTES", joinColumns = @JoinColumn(name = "RECORD_ID", referencedColumnName = "ID"), inverseJoinColumns = @JoinColumn(name = "ATTRIBUTE_ID", referencedColumnName = "ATT_ID"))
  List<AttributeDao> _attributes;

The AttributeDao class could be implemented like this:

@Entity
@Table(name = "ATTRIBUTES")
public class AttributeDao {
 
  @Id
  @GeneratedValue(strategy=GenerationType.AUTO)
  @Column(name = "ATT_ID")
  private String _id;
 
  @Column(name = "ATT_NAME")
  private String _name;
 
  @BasicCollection (
    fetch=FetchType.EAGER,
    valueColumn=@Column(name="ATT_VALUE"))
    @CollectionTable (
        name="ATTRIBUTE_VALUES",
        primaryKeyJoinColumns=
        {@PrimaryKeyJoinColumn(name="ATT_ID", referencedColumnName="ATT_ID")}
    )    
  private List<String> _values;
 
  protected AttributeDao() {
  }
 
  public AttributeDao(String name) {
    _name = name;
    _values = new ArrayList<String>();
  }
 
  public void addValue(String value)
  {
    _values.add(value);
  }
}

Summing up all modifications the final database schema could look like this: [db_schema.png]


By a configuration the RecordStorage is told which Record attributes it should persists in the database in addition to the serilaized Record. Neither Annotations nor Sub-Attributes are supported in this approach, just the Literals of the Attribute are stored. Of course more advanced enhancements are possible (but were not in the scope of this PoC)


The RecordStorage interface could add the following methods to support simple querues. Another option is to introduce another interface, e.g. RecordQueries to leave the basic Record functionality in a separate interface.

interface RecordStorage
{
    // simple query functionality
    Iterator<Record> findRecordsBySource(String source);
    Iterator<Record> findRecordsByAttribute(String name, String value);
    Iterator<Record> findRecordsByNativeQuery(String whereClause); // evtl. nicht so gut allgemein nutzbar, wenn der zugrundeliegende Store kein SQL unterstützt
}

The first method can be realized by a JPQL NamedQuery. This is just an Annotation of the RecordDao class used by the RecordStorage implementation:

@NamedQueries({
    @NamedQuery(name="RecordDao.findBySource",
                query="SELECT r FROM RecordDao r WHERE r._source = :source"),
})

An implementation of the method could look like this:

public Iterator<Record> findRecordsBySource(final String source) {
    Query query = _em.createNamedQuery("RecordDao.findBySource");
    List<RecordDao> daos = query.setParameter("source", source).getResultList();
    return new RecordIterator(daos.iterator());
}

The helper class RecordIterator is needed to convert the RecordDao objects into Recors during iteration.


The second method is in JPA 1.0 not expressable as a JPQL NamedQuery Annotation, as it lacks the functionality to select values in a Collection. Therefore we have to use an eclipseLink enhancement and generate the query in java code:

public Iterator<Record> findRecordsByAttribute(final String name, final String value) {
 
    final Session session = JpaHelper.getServerSession(_emf); 
    final ReadAllQuery query = new ReadAllQuery(RecordDao.class); 
 
    ExpressionBuilder record = new ExpressionBuilder(); 
    Expression attributes = record.anyOf("_attributes");
    Expression criteria = attributes.get("_name").equal(name);
    criteria = criteria.and(attributes.anyOf("_values").equal(value));
    query.setSelectionCriteria(criteria);
    query.dontUseDistinct();
 
    List<RecordDao> daos = (List<RecordDao>)session.executeQuery(query);
    return new RecordIterator(daos.iterator());
}
  • Daniel Stucky : At the moment there is a problem as the generated SQL query uses DISTINCT which is not allowed in conjunction withj Blobs

The last method is a more generic variant that allows selection of Records via native SQL. As the results of the SQL query must be Record objects it is only allowd to enter the WHERE clause of the SQL statement which is combined with the static SELECT * FROM RECORDS

public Iterator<Record> findRecordsByNativeQuery(final String whereClause) {
    String sqlString = "SELECT * FROM RECORDS " + whereClause;
    Query query = _em.createNativeQuery(sqlString, RecordDao.class);
    List<RecordDao> daos = query.getResultList();
    return new RecordIterator(daos.iterator());
}

All these enhancements are just suggestions. Other solutions to support selection of records via queries are possible.

Back to the top