SMILA/Documentation/SesameOntologyManager

This page describes an initial integration of a semantic layer in SMILA. It currently consists basically of an integration of Aduna's OpenRDF Sesame 2, an open source framework for storage, inferencing and querying of RDF data. We do not provide an own RDF API on our own currently, but just reuse the Sesame API. Based on experiences from actual use cases for the Semantic Layer this might change in the future.

Consequently, this page assumes that the reader is accustomed to the basic Sesame concepts. A quick browse through Sesame's User Guide should help, especially the chapters 3 and 8.

All of the described code is contained in bundle org.eclipse.smila.ontology.

Introduction

Ontologies can be used to describe background knowledge about an application domain that can be used during indexing to derive additional attributes of for documents or during search to enhance or restrict queries. On the other hand, SMILA pipelets can also used to add additional data to an existing ontology, i.e. to learn descriptions of and relations between entities of the application domain.

The de-facto standard format for describing such knowledge is RDF. RDF describes everything in the form of resources and triples (statements) about these resources. A resource is identified by a URI, e.g. http://www.eclipse.org/smila. A statement consists of a subject resource, a predicate resource and object. The predicate describes the meaning of the statement. The object can either be a literal of different types, e.g.

Subject	Predicate	Object
http://www.eclipse.org/smila	label	'SMILA'
http://www.eclipse.org/smila	createdIn	2007

or another resource, e.g.

Subject	Predicate	Object
http://www.eclipse.org/smila	type	http://www.eclipse.org/Project
http://www.eclipse.org/smila	isPartOf	http://www.eclipse.org/rt/
http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com	isComitterOf	http://www.eclipse.org/smila

Data is written to an RDF ontology by adding or removing statements. It can be also read using a statement by e.g. asking for all statements with the predicate hasPartOf. Another possibilty is to use a RDF query language like SPARQL that allows to formulate very complex pattern to access RDF data.

Discussion

... add your thoughts here ...

Sesame Ontology Manager

Service

The Sesame Ontology Manager is an OSGi service that manages Sesame repositories. It has a configuration file that describes a number of repositories that can be created or used and associates them with a name. Service consumers can then request a connection to one of the repositories by name. A repository is created on the first access from a client.

Then configuration file is expected in the configuration area as org.eclipse.smila.ontology/sesameConfig.xml. You can find the schema definition in bundle org.eclipse.smila.ontology in directory schema. This in is a faily complete with each kind of supported repository:

<?xml version="1.0" encoding="UTF-8"?>
<SesameConfiguration default="native" xmlns="http://www.eclipse.org/smila/ontology">
    <RepositoryConfig name="memory">
        <MemoryStore persist="true" syncDelay="1000" />
        <Stackable classname="org.openrdf.sail.inferencer.fc.ForwardChainingRDFSInferencer" />
    </RepositoryConfig>
    <RepositoryConfig name="native">
        <NativeStore forceSync="true" indexes="spoc,posc" />
    </RepositoryConfig>
    <RepositoryConfig name="database">
        <RdbmsStore driver="org.postgresql.Driver" maxTripleTables="1" indexed="true" sequenced="true">
            <Url>jdbc:postgresql://localhost/sesame</Url>
            <User>sesame</User>
            <Password>sesame</Password>
        </RdbmsStore>
    </RepositoryConfig>
    <RepositoryConfig name="remote">
        <HttpStore repositoryId="repository">
            <Url>http://localhost:8080/sesame</Url>
        </HttpStore>
    </RepositoryConfig>
</SesameConfiguration>

The root element must specify a default repository name, which must match the name attribute of one of the contained repository configurations. Each single repository configuration is described by a <RepositoryConfig> element. It must define a name for the repository. As the name is used as the name of a workspace directory to store the files of memory and native stores in, it should contain only characters suitable for directory names on the current platform. The element must contain one of the different <...Store> elements to describe the physical store type and can (except for HttpStores) contain multiple <Stackable> elements defining implementors of org.openrdf.sail.StackableSail that are stacked on the used store sail. If multiple stackables are specified they are added on top of the store in the order of appearance in the XML file. This means, the configuration file describes the actual Sail stack bottom-up. Sesame itself contains only two useful classes that could be used here:

org.openrdf.sail.inferencer.fc.DirectTypeHierarchyInferencer
org.openrdf.sail.inferencer.fc.ForwardChainingRDFSInferencer

See the Sesame API documentation for details.

Memory Store

Creates the repository based on a main memory store. It has only two configuration attributes:

Attribute	Type	Default	Description
persist	true/false	false	write repository content to workspace so that it can be read again after restarts
syncDelay	integer	is 0	the time (in milliseconds) to wait after a transaction was commited before writing the changed data to file. Setting it to 0 causes the data to be written immediately after a commit. A negative number prevents syncing until shutdown.

See Sesame User Guide: Memory store configuration for details.

Native Store

Creates the repositoty based on Sesame's native file database format. It has two configuration attributes:

Attribute	Type	Default	Description
indexes	string	-	An index string like "spoc,posc" that describes which how the RDF data is indexed for better query performance.
forceSync	true/false	false	Force sync to the hard disk on every write. This makes it sure that each change is actually persisted in the data files immediately, but decreases write performance.

See Sesame User Guide: Native store configuration for details.

Rdbms Store

Creates a repository that is stored in a relational database. Sesame currently supports PostgreSQL and MySQL. Bundles containing the JDBC driver are currently not part of the SMILA distribution, so you must add them yourself. The element has four possible attributes:

Attribute	Type	Default	Description
driver	string	(required)	JDBC driver class name
maxTripleTables	integer	1	number of tripe tables created by Sesame. The default value causes all statements to be stored in a single table. If more tables are allowed, Sesame creates seperate tables per predicate. This may increase performance for large ontologies, however, allowing too many tables might decrease performance again.
indexed	true/false	true	control creation of DB indexes. Usually you want this enabled for better performance.
sequenced	true/false	true	(I did not find any explanation for this option in the Sesame documentation or source code, but if you know what it does, you can use it ;-)

The actual database location is configured by up to 3 sub-elements:

Tag	Type	Required	Description
Url	string	yes	JDBC-URL
User	string	no	Username for login
Password	string	no	Password for login

If the database does not require authentication it may be possible to omit User and Password elements.

See Sesame User Guide: RDBMS store configuration for details.

Http Store

Creates a repository that connects to a remote Sesame HTTP repository server. There is only one configuration attribute on the HttpStore element:

Attribute	Type	Default	Description
repositoryId	string	(required)	name of repository in server.

The actual repository server location is configured by up to 3 sub-elements:

Tag	Type	Required	Description
Url	string	yes	HTTP-URL of repository server
User	string	no	Username for login
Password	string	no	Password for login

If the repository server does not require authentication, User and Password can be ommitted. It is not possible to add stackable sails to an HTTP repository. This must be configured on the repository server.

See Sesame User Guide: HTTP repository configuration for details.

JMX Management Agent

There is also a JMX management agent that can be used to import RDF data from files into an ontology, export complete repository contents to an RDF file and clear repositories. Additionally it allows to read some information like available repository names, known namespaces and sizes of repositories.

In the JDK JMX console it should look similar to this screenshot:

File:SMILA Sesame Ontology Manager.jpg

The first parameter of each operation is the name of a configured repository, in the following list only additional parameters will be described:

operation	description
getRepositoryNames	returns the list of names of configured repositories. These names can be used as the first parameter of the other methods.
getSize	returns the number of statemets in the named repository.
getNamespaces	return a map of namespace prefixes to the complete names in the given repository.
getContexts	return a list of context names in the given repository.
clear	All resources and statements are removed from a repository. The result is a message about the operation.
importRDF	import an RDF file into a repository. The second parameter is the path and filename of the import file, absolute or relative to SMILA's working directory. The third parameter is the base URI for relative resource URIs in this file. If only absolute resource URIs are used in the RDF file, the actual value of this parameter is irrelevant. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.
exportRDF	export a complete repository content to an RDF file. The second parameter is the path and filename of the export file, absolute or relative to SMILA's working directory. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.

There are also exemplary batch files for using the import, export and clear operation from the commandline. See SMILA/jmxclient for details and adapt them to your own needs.

Currently, supported file formats and extensions include:

RDF format	filename suffixes
RDF/XML	.rdf, .rdfs, .owl
N-Triples	.nt
Turtle	.ttl
N3	.n3
TRIX	.trix
TRIG	.trig

The .xml suffix is associated to both RDF/XML and TRIX in Sesame, so there may be problems using it.

Pipelets using the Ontology

There are currently four pipelets that make use of the ontology service. These pipelets are also kind of experimental, so they may change completely in the future. More pipelets will be created when we implement "real" use cases.

All pipelets use the standard in-BPEL pipelet configuration which can be overridden per request using parameter annotations. The use a common property name to select the repository to work with:

Property	Type	Description
sesameRepository	string	name of repository to use. If not set, the default repository is used.

Writing/Reading complete records to/from the ontology

There are two pipelets can be used to create and access information about an reord in the ontology: The resource URI can be read from a special attribute named "rdf:about" or the record ID, resource property is written from or read into metadata attributes.

Thes pipelets have another property in common:

Property	Type	Description
recordFilter	string	name of a record filter that lists all attributes that should be interpreted as resource property. Note that the rdf:about attribute must be contained in this filter if it is used to identify the resource associated to the object. The record filters must be defined in `configuration/org.eclipse.smila.blackboard`, because the pipelets use blackboard functionality to do the filtering. If not set, all attributes are mapped to proerties.

The URI of the resource associated to a processed record is determined this way:

If attribute "rdf:about" has literal values (after record filtering), the first one is used as the base value.
Else the key part (unnamed) of the record ID is used as the base value.
If the part of the base value before the first ':' character matches a namespace prefix in the used repository, replace the prefix by the full name
the resulting value must be accepted by Sesame as an URI string. If not, the record is not processed any further.

org.eclipse.smila.ontology.pipelets.SesameRecordWriterPipelet

This pipelet can write attribute values to RDF properties. It creates a resource URI for the record as described above and for each top-level attribute value of the record metadata object, it creates statements using this URI as the subject and the attribute name as a predicate URI. If an attribute name starts with a namespace prefix known in the used repository, it is expanded to the full namespace name. The statement object is created from a SMILA literal vaulue as follow:

If the semantic type of the literal is set to "rdfs:Resource" (in XML: <V st="rdfs:Resource">) or the attribute name is "rdf:type", a resource URI is created from the string value (after namespace prefix expansion) of the literal.
Else a literal of the ontology literal datatype best matching the SMILA literal datatype is created (TODO: Not yet implemented for data/time values).

So, to create a resource in the repository as in the example in the introduction, a record like this could be used:

 
<Record>
  <A n="rdf:about">
    <L>
      <V>eclipse:smila</V>
    </L>
  </A>
  <A n="rdf:type">
    <L> 
      <V st="rdfs:Resource">eclipse:Project</V>  <!-- st is optional here -->
    </L>
  </A>
  <A n="eclipse:isPartOf">
    <L> 
      <V st="rdfs:Resource">eclipse:rt</V>
    </L>
  </A>
  <A n="rdfs:label">
    <L>
      <V>SMILA</V>
    </L>
  </A>
  <A n="eclipse:createdIn">
    <L>
      <V t="int">2007</V>
    </L>
  </A>
</Record>

(assuming the target repository knows the namespaces "rdf", "rdfs" (standard namespaces) and "eclipse" (= "http://www.eclipse.org/").

A string literal with a language can be written by attaching an annotation named "xml:lang" with the language name as an aonomymous value:

  <A n="rdfs:label">
    <L>
      <V>SMILA</V>
      <An n="xml:lang">
        <V>de</V>
      </An>
    </L>
  </A>

By default, new statements are just added to the repository. If all existing statements for a given subject and object should be removed before adding the new statements, add an Annotation named "org.eclipse.smila.ontology" with a named value "clear" giving the language of objects to be removed first. To remove all statements, use "all" as language:

  <A n="rdfs:label">
    <L>
      <V>SMILA</V>
      <An n="org.eclipse.smila.ontology">
        <V n="clear">all</V>
      </An>
    </L>
  </A>

To create a statement with the record URI as the object and the attribute value as the subject, put the anon value "reverse" in the same annotation. This only works if the attribute value is specified to be a resource URI. E.g. the final statement of the introductory example can be created from the same record using:

  <A n="eclipse:isCommitterOf">
    <L>
      <V st="rdfs:Resource">http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com</V>
      <An n="org.eclipse.smila.ontology">
        <V>reverse</V>
      </An>
    </L>
  </A>

Finally, the pipelet supports an additional configuration property:

Property	Type	Description
typeUri	string	optional: name of type to set for the resource, if no type statement is created from writing the record.

If you want to implement own pipelets that create attribute values for writing to the repository, see org.eclipse.smila.ontology.records.SesameRecordHelper for constants and helper methods for these special conventions and annotations.

If an error occurs while writing a record to the repository, only the changes related to the current record are rolled back, and the pipelet fails with an exception, further records in the same message are not processed anymore, changes from records written before are not invalidated. (TODO: discuss general pipelet error handling behaviour?)

org.eclipse.smila.ontology.pipelets.SesameRecordReaderPipelet

This pipelet reads statements from the repository about the URI associated with a record into attribute values of the record. Its operation is mostly inverse to the one of the SesameRecordWriterPipelet, so we can keep the description short here. Some notes:

The URI is determined in the same way from attributes or record ID as described for the SesameRecordWriterPipelet above.
Only statements are used that have the record URI as subject. No "reverse" attributes are created.
Resource objects are converted to string literals with semantic type set to "rdfs:Resource".
Literal objects are converted to best matching SMILA literal datatype (TODO: date/time objects).
a language tag on a string literal is stored as an "xml:lang" annotation as described above.
All statements are read from the repository, but only those attributes contained in the specified record filter are written to the blackbaord eventually.
If predicate URIs start with known name namespaces, the full namespace value is replaced by its prefix and a colon in the associated attribute name.

If an error occurs during reading the records, the pipelet invocation is aborted at this point. If multiple records where processed in this called, changes to already processed records are not reverted.

org.eclipse.smila.ontology.pipelets.CreateResourcePipelet

This pipelet can be used to lookup and create resources of a certain types by their name. E.g. if some attribute contains the name of a person, this pipelet can search the ontology for a resource of type "person" with this name, and if no such resource exists, it can create a new one. In either case the URI of this resource is written to (another) attribute. The URI of a new resource is created from the label by removing all non-word characters (i.e. everything except a-z, A-Z, _, 0-9) from the string and concatenating the result to some configurable prefix value (see below).

The pipelet supports the following parameters:

Property	Type	Description
typeUri	string	required: URI of type to use for lookup and creation of resources. Namespace expansion is applied to this URI.
labelAttribute	string	required: attribute containing names of resources to lookup/create.
uriAttribute	string	required: attribute to write the URIs of found/created resources to.
labelPredicate	string	optional: URI of the predicate that specifies the name of the resource. If not set, it defaults to rdfs:label. Namespace expansion is applied to the property value.
uriPrefix	string	optional: prefix for new created URIs. If not set, "urn:" is used.

org.eclipse.smila.ontology.pipelets.CreateRelationPipelet

This pipelet creates statements with subjects and objects read from record attributes. It create a statement with a configurable predicate in the target repository for each combination of values in two configurable attributes. E.g. if one attribute of your records contains URIs of persons and another one URIs of companies they work for, this pipelet could be used to create statements in the ontology using some "worksFor" predicate to describe this relation.

The pipelet supports the following parameters:

Property	Type	Description
subjectAttribute	string	required: name of attribute containing the subjects for the statements to create. Regardless of the actual literal type, the string values of the literals are tried to interpred as URIs. Namespace expansion is applied.
objectAttribute	string	required: name of attribute containing the objects for the statements to create. If the literals are marked as resource, the objects will be URIs (namespaces expanded). Else the SMILA literal values will be written as Sesame literals with a matching datatype.
predicateUri	string	required: URI of the predicate for the statements. Namespace prefixes are expanded first.

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Documentation/SesameOntologyManager

Contents

Introduction

Discussion

Sesame Ontology Manager

Service

Memory Store

Native Store

Rdbms Store

Http Store

JMX Management Agent

Pipelets using the Ontology

Writing/Reading complete records to/from the ontology

org.eclipse.smila.ontology.pipelets.SesameRecordWriterPipelet

org.eclipse.smila.ontology.pipelets.SesameRecordReaderPipelet

org.eclipse.smila.ontology.pipelets.CreateResourcePipelet

org.eclipse.smila.ontology.pipelets.CreateRelationPipelet

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Documentation/SesameOntologyManager

Contents

Introduction

Discussion

Sesame Ontology Manager

Service

Memory Store

Native Store

Rdbms Store

Http Store

JMX Management Agent

Pipelets using the Ontology

Writing/Reading complete records to/from the ontology

org.eclipse.smila.ontology.pipelets.SesameRecordWriterPipelet

org.eclipse.smila.ontology.pipelets.SesameRecordReaderPipelet

org.eclipse.smila.ontology.pipelets.CreateResourcePipelet

org.eclipse.smila.ontology.pipelets.CreateRelationPipelet