This page describes an initial integration of a semantic layer in SMILA. It currently consists basically of an integration of Aduna's OpenRDF Sesame 2, an open source framework for storage, inferencing and querying of RDF data. We do not provide an own RDF API on our own currently, but just reuse the Sesame API. Based on experiences from actual use cases for the Semantic Layer this might change in the future.
All of the described code is contained in bundle org.eclipse.smila.ontology.
- 1 Introduction
- 2 Discussion
- 3 Sesame Ontology Manager
- 4 Pipelets using the Ontology
Ontologies can be used to describe background knowledge about an application domain that can be used during indexing to derive additional attributes of for documents or during search to enhance or restrict queries. On the other hand, SMILA pipelets can also used to add additional data to an existing ontology, i.e. to learn descriptions of and relations between entities of the application domain.
The de-facto standard format for describing such knowledge is RDF. RDF describes everything in the form of resources and triples (statements) about these resources. A resource is identified by a URI, e.g. http://www.eclipse.org/smila. A statement consists of a subject resource, a predicate resource and object. The predicate describes the meaning of the statement. The object can either be a literal of different types, e.g.
<http://www.eclipse.org/smila> <label> 'SMILA' <http://www.eclipse.org/smila> <createdIn> 2007
or another resource, e.g.
<http://www.eclipse.org/smila> <type> <http://www.eclipse.org/Project> <http://www.eclipse.org/smila> <isPartOf> <http://www.eclipse.org/rt/> <http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com> <isComitterOf> <http://www.eclipse.org/smila>
Data is written to an RDF ontology by adding or removing statements. It can be also read using a statement by e.g. asking for all statements with the predicate hasPartOf. Another possibilty is to use a RDF query language like SPARQL that allows to formulate very complex pattern to access RDF data.
... add your thoughts here ...
Sesame Ontology Manager
The Sesame Ontology Manager is an OSGi service that manages Sesame repositories. It has a configuration file that describes a number of repositories that can be created or used and associates them with a name. Service consumers can then request a connection to one of the repositories by name. A repository is created on the first access from a client.
Then configuration file is expected in the configuration area as org.eclipse.smila.ontology/sesameConfig.xml. You can find the schema definition in bundle org.eclipse.smila.ontology in directory schema. This in is a faily complete with each kind of supported repository:
<?xml version="1.0" encoding="UTF-8"?> <SesameConfiguration default="native" xmlns="http://www.eclipse.org/smila/ontology"> <RepositoryConfig name="memory"> <MemoryStore persist="true" syncDelay="1000" /> <Stackable classname="org.openrdf.sail.inferencer.fc.ForwardChainingRDFSInferencer" /> </RepositoryConfig> <RepositoryConfig name="native"> <NativeStore forceSync="true" indexes="spoc,posc" /> </RepositoryConfig> <RepositoryConfig name="database"> <RdbmsStore driver="org.postgresql.Driver" maxTripleTables="1" indexed="true" sequenced="true"> <Url>jdbc:postgresql://localhost/sesame</Url> <User>sesame</User> <Password>sesame</Password> </RdbmsStore> </RepositoryConfig> <RepositoryConfig name="remote"> <HttpStore repositoryId="repository"> <Url>http://localhost:8080/sesame</Url> </HttpStore> </RepositoryConfig> </SesameConfiguration>
The root element must specify a default repository name, which must match the name attribute of one of the contained repository configurations. Each single repository configuration is described by a <RepositoryConfig> element. It must define a name for the repository. As the name is used as the name of a workspace directory to store the files of memory and native stores in, it should contain only characters suitable for directory names on the current platform. The element must contain one of the different <...Store> elements to describe the physical store type and can (except for HttpStores) contain multiple <Stackable> elements defining implementors of org.openrdf.sail.StackableSail that are stacked on the used store sail. If multiple stackables are specified they are added on top of the store in the order of appearance in the XML file. This means, the configuration file describes the actual Sail stack bottom-up. Sesame itself contains only two useful classes that could be used here:
See the Sesame API documentation for details.
Creates the repository based on a main memory store. It has only two configuration attributes:
- persist (true/false, default is false): write repository content to workspace so that it can be read again after restarts
- syncDelay (integer, default is 0): the time (in milliseconds) to wait after a transaction was commited before writing the changed data to file. Setting it to 0 causes the data to be written immediately after a commit. A negative number prevents syncing until shutdown.
See Sesame User Guide: Memory store configuration for details.
Creates the repositoty based on Sesame's native file database format. It has two configuration attributes:
- indexes (string, no default value): An index string like "spoc,posc" that describes which how the RDF data is indexed for better query performance.
- forceSync (true/false, default is false): Force sync to the hard disk on every write. This makes it sure that each change is actually persisted in the data files immediately, but decreases write performance.
See Sesame User Guide: Native store configuration for details.
Creates a repository that is stored in a relational database. Sesame currently supports PostgreSQL and MySQL. Bundles containing the JDBC driver are currently not part of the SMILA distribution, so you must add them yourself. The element has four possible attributes:
- driver: JDBC driver class name
- maxTripleTables: number of tripe tables created by Sesame. The default is 1 which causes all statements to be stored in a single table. If more tables are allowed, Sesame creates seperate tables per predicate. This may increase performance for large ontologies, however, allowing too many tables might decrease performance again.
- indexed: (true/false, default true): control creation of DB indexes. Usually you want this enabled for better performance.
- sequenced: (true/false, default true): (I did not find any explanation for this option in the Sesame documentation or source code, but if you know what it does, you can use it ;-)
The actual database location is configured by up to 3 sub-elements:
- Url: JDBC-URL
- User: Username for login
- Password: Password for login
If the database does not require authentication it may be possible to omit User and Password elements.
See Sesame User Guide: RDBMS store configuration for details.
Creates a repository that connects to a remote Sesame HTTP repository server. There is only one configuration attribute on the HttpStore element:
- repositoryId: name of repository in server.
The actual repository server location is configured by up to 3 sub-elements:
- Url: HTTP-URL
- User: Username for login
- Password: Password for login
If the repository server does not require authentication, User and Password can be ommitted. It is not possible to add stackable sails to an HTTP repository. This must be configured on the repository server.
See Sesame User Guide: HTTP repository configuration for details.
JMX Management Agent
There is also a JMX management agent that can be used to import RDF data from files into an ontology, export complete repository contents to an RDF file and clear repositories. Additionally it allows to read some information like available repository names, known namespaces and sizes of repositories.
In the JDK JMX console it should look similar to this screenshot:
The first parameter of each operation is the name of a configured repository, in the following list only additional parameters will be described:
- getRepositoryNames: returns the list of names of configured repositories. These names can be used as the first parameter of the other methods.
- getSize: returns the number of statemets in the named repository.
- getNamespaces: return a map of namespace prefixes to the complete names in the given repository.
- getContexts: return a list of context names in the given repository.
- clear: All resources and statements are removed from a repository. The result is a message about the operation.
- importRDF: import an RDF file into a repository. The second parameter is the path and filename of the import file, absolute or relative to SMILA's working directory. The third parameter is the base URI for relative resource URIs in this file. If only absolute resource URIs are used in the RDF file, the actual value of this parameter is irrelevant. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.
- exportRDF: export a complete repository content to an RDF file. The second parameter is the path and filename of the export file, absolute or relative to SMILA's working directory. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.
There are also exemplary batch files for using the import, export and clear operation from the commandline. See SMILA/jmxclient for details and adapt them to your own needs.
Currently, supported file formats and extensions include:
- RDF/XML: .rdf, .rdfs, .owl
- N-Triples: .nt
- Turtle: .ttl
- N3: .n3
- TRIX: .trix
- TRIG: .trig
The .xml suffix is associated to both RDF/XML and TRIX in Sesame, so there may be problems using it.
Pipelets using the Ontology
There are currently two pipelets that make use of the ontology service. These pipelets are also kind of experimental, so they may change completely in the future. More pipelets will be created when we implement "real" use cases.
The pipelets can be used to create and access information about an reord in the ontology: The resource URI can be read from a special attribute named "rdf:about" or the record ID, resource property is written from or read into metadata attributes.
Both pipelets use the standard in-BPEL pipelet configuration. They have these these properties in common:
- sesameRepository: name of repository to use. If not set, the default repository is used.
- recordFilter: name of a record filter that lists all attributes that should be interpreted as resource property. Note that the rdf:about attribute must be contained in this filter if it is used to identify the resource associated to the object. The record filters must be defined in configuration/org.eclipse.smila.blackboard, because the pipelets use blackboard functionality to do the filtering. If not set, all attributes are mapped to proerties.
The URI of the resource associated to a processed record is determined this way:
- If attribute "rdf:about" has literal values (after record filtering), the first one is used as the base value.
- Else the key part (unnamed) of the record ID is used as the base value.
- If the part of the base value before the first ':' character matches a namespace prefix in the used repository, replace the prefix by the full name
- the resulting value must be accepted by Sesame as an URI string. If not, the record is not processed any further.
This pipelet can write attribute values to RDF properties. It creates a resource URI for the record as described above and for each top-level attribute value of the record metadata object, it creates statements using this URI as the subject and the attribute name as a predicate URI. If an attribute name starts with a namespace prefix known in the used repository, it is expanded to the full namespace name. The statement object is created from a SMILA literal vaulue as follow:
- If the semantic type of the literal is set to "rdfs:Resource" (in XML: <L st="rdfs:Resource">) or the attribute name is "rdf:type", a resource URI is created from the string value (after namespace prefix expansion) of the literal.
- Else a literal of the ontology literal datatype best matching the SMILA literal datatype is created (TODO: Not yet implemented for data/time values).
So, to create a resource in the repository as in the example in the introduction, a record like this could be used:
<Record> <A n="rdf:about"> <L> <V>eclipse:smila</V> </L> </A> <A n="rdf:type"> <L st="rdfs:Resource"> <!-- st is optional here --> <V>eclipse:Project</V> </L> </A> <A n="eclipse:isPartOf"> <L st="rdfs:Resource"> <V>eclipse:rt</V> </L> </A> <A n="rdfs:label"> <L> <V>SMILA</V> </L> </A> <A n="eclipse:createdIn"> <L> <V t="int">2007</V> </L> </A> </Record>
(assuming the target repository knows the namespaces "rdf", "rdfs" (standard namespaces) and "eclipse" (= "http://www.eclipse.org/").
A string literal with a language can be written by attaching an annotation named "xml:lang" with the language name as an aonomymous value:
<A n="rdfs:label"> <L> <V>SMILA</V> <An n="xml:lang"> <V>de</V> </An> </L> </A>
By default, new statements are just added to the repository. If all existing statements for a given subject and object should be removed before adding the new statements, add an Annotation named "org.eclipse.smila.ontology" with a named value "clear" giving the language of objects to be removed first. To remove all statements, use "all" as language:
<A n="rdfs:label"> <L> <V>SMILA</V> <An n="org.eclipse.smila.ontology"> <V n="clear">all</V> </An> </L> </A>
To create a statement with the record URI as the object and the attribute value as the subject, put the anon value "reverse" in the same annotation. This only works if the attribute value is specified to be a resource URI. E.g. the final statement of the introductory example can be created from the same record using:
<A n="eclipse:isCommitterOf"> <L st="rdfs:Resource"> <V>http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com</V> <An n="org.eclipse.smila.ontology"> <V>reverse</V> </An> </L> </A>
If you want to implement own pipelets that create attribute values for writing to the repository, see org.eclipse.smila.ontology.records.SesameRecordHelper for constants and helper methods for these special conventions and annotations.
If an error occurs while writing a record to the repository, only the changes related to the current record are rolled back, and the pipelet fails with an exception, further records in the same message are not processed anymore, changes from records written before are not invalidated. (TODO: discuss general pipelet error handling behaviour?)
This pipelet reads statements from the repository about the URI associated with a record into attribute values of the record. Its operation is mostly inverse to the one of the SesameRecordWriterPipelet, so we can keep the description short here. Some notes:
- The URI is determined in the same way from attributes or record ID as described for the SesameRecordWriterPipelet above.
- Only statements are used that have the record URI as subject. No "reverse" attributes are created.
- Resource objects are converted to string literals with semantic type set to "rdfs:Resource".
- Literal objects are converted to best matching SMILA literal datatype (TODO: date/time objects).
- a language tag on a string literal is stored as an "xml:lang" annotation as described above.
- All statements are read from the repository, but only those attributes contained in the specified record filter are written to the blackbaord eventually.
- If predicate URIs start with known name namespaces, the full namespace value is replaced by its prefix and a colon in the associated attribute name.
Errors during reading data from repository to the blackboard record are ignored and logged, processing continues with the next record in the current list, if one exists.