Difference between revisions of "SMILA/Documentation/SesameOntologyManager"

From Eclipsepedia

Jump to: navigation, search
(Introduction)
(Introduction)
 
Line 45: Line 45:
 
|}
 
|}
  
Data is written to an RDF ontology by adding or removing statements. Also, it can be read using a statement e.g. by asking for all statements with the predicate <tt>hasPartOf</tt>. Another possibilty is to use an RDF query language like [http://de.wikipedia.org/wiki/SPARQL SPARQL] that allows formulating very complex patterns to access RDF data.
+
Data is written to an RDF ontology by adding or removing statements. Also, it can be read using a statement e.g. by asking for all statements with the predicate <tt>hasPartOf</tt>. Another possibilty is to use an RDF query language like [http://en.wikipedia.org/wiki/SPARQL SPARQL] that allows formulating very complex patterns to access RDF data.
  
 
== Discussion ==
 
== Discussion ==

Latest revision as of 11:56, 23 January 2012

This page describes an initial integration of a semantic layer in SMILA. It currently consists basically of an integration of Aduna's OpenRDF Sesame 2, an open source framework for storage, inferencing, and querying of RDF data. We do not provide an own RDF API on its own currently, but just reuse the Sesame API. Based on experiences from actual use cases for the Semantic Layer, this might change in the future.

Consequently, this page assumes that the reader is accustomed to the basic Sesame concepts. A quick browse through Sesame's User Guide should help, especially the chapters 3 and 8.

All of the described code is contained in bundle org.eclipse.smila.ontology.

Contents

[edit] Introduction

Ontologies can be used to describe background knowledge about an application domain that can be used during indexing to derive additional attributes of for documents or during search to enhance or restrict queries. On the other hand, SMILA pipelets can also be used to add additional data to an existing ontology, i.e. to learn descriptions of and relations between entities of the application domain.

The de-facto standard format for describing such knowledge is RDF. RDF describes everything in the form of resources and triples (statements) about these resources. A resource is identified by a URI, e.g. http://www.eclipse.org/smila. A statement consists of a subject resource, a predicate resource and object. The predicate describes the meaning of the statement. The object can either be a literal of different types, e.g.

Subject Predicate Object
http://www.eclipse.org/smila label 'SMILA'
http://www.eclipse.org/smila createdIn 2007

or another resource, e.g.

Subject Predicate Object
http://www.eclipse.org/smila type http://www.eclipse.org/Project
http://www.eclipse.org/smila isPartOf http://www.eclipse.org/rt/
http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com isComitterOf http://www.eclipse.org/smila

Data is written to an RDF ontology by adding or removing statements. Also, it can be read using a statement e.g. by asking for all statements with the predicate hasPartOf. Another possibilty is to use an RDF query language like SPARQL that allows formulating very complex patterns to access RDF data.

[edit] Discussion

... add your thoughts here ...

[edit] Sesame Ontology Manager

[edit] Service

The Sesame Ontology Manager is an OSGi service that manages Sesame repositories. It has a configuration file that describes a number of repositories that can be created or used and associates them with a name. Service consumers can then request a connection to one of the repositories by its name. A repository is created with the first access from a client.

Then configuration file is expected in the configuration area at org.eclipse.smila.ontology/sesameConfig.xml. You can find the schema definition in bundle org.eclipse.smila.ontology in directory schema. This is fairly complete with respect to the supported repositories:

<?xml version="1.0" encoding="UTF-8"?>
<SesameConfiguration default="native" xmlns="http://www.eclipse.org/smila/ontology">
    <RepositoryConfig name="memory">
        <MemoryStore persist="true" syncDelay="1000" />
        <Stackable classname="org.openrdf.sail.inferencer.fc.ForwardChainingRDFSInferencer" />
    </RepositoryConfig>
    <RepositoryConfig name="native">
        <NativeStore forceSync="true" indexes="spoc,posc" />
    </RepositoryConfig>
    <RepositoryConfig name="database">
        <RdbmsStore driver="org.postgresql.Driver" maxTripleTables="1" indexed="true" sequenced="true">
            <Url>jdbc:postgresql://localhost/sesame</Url>
            <User>sesame</User>
            <Password>sesame</Password>
        </RdbmsStore>
    </RepositoryConfig>
    <RepositoryConfig name="remote">
        <HttpStore repositoryId="repository">
            <Url>http://localhost:8080/sesame</Url>
        </HttpStore>
    </RepositoryConfig>
</SesameConfiguration>

The root element must specify a default repository name, which must match the name attribute of one of the contained repository configurations. Each single repository configuration is described by a <RepositoryConfig> element. It must define a name for the repository. As the name is used as the name of a workspace directory to store the files of memory and native stores in, it should contain only characters suitable for directory names on the current platform. The element must contain one of the different <...Store> elements to describe the physical store type and can (except for HttpStores) contain multiple <Stackable> elements defining implementors of org.openrdf.sail.StackableSail that are stacked on the used store sail. If multiple stackables are specified they are added on top of the store in the order of appearance in the XML file. This means, the configuration file describes the actual sail stack bottom-up. Sesame itself contains only two useful classes that could be used here:

  • org.openrdf.sail.inferencer.fc.DirectTypeHierarchyInferencer
  • org.openrdf.sail.inferencer.fc.ForwardChainingRDFSInferencer

See the Sesame API documentation for details.

[edit] Memory Store

Creates the repository based on a main memory store. It has only two configuration attributes:

Attribute Type Default Description
persist true/false false Writes repository content to the workspace so that it can be read again after restarts.
syncDelay integer 0 The time (in milliseconds) to wait after a transaction was committed before writing the changed data to file. Setting it to 0 causes the data to be written immediately after a commit. A negative number prevents syncing until shutdown.

See Sesame User Guide: Memory store configuration for details.

[edit] Native Store

Creates the repository based on Sesame's native file database format. It has two configuration attributes:

Attribute Type Default Description
indexes string - An index string like "spoc,posc" that describes how the RDF data is being indexed for better query performance.
forceSync true/false false Force sync to the hard disk on every write. This makes sure that each change is actually persisted in the data files immediately, but decreases write performance.

See Sesame User Guide: Native store configuration for details.

[edit] Rdbms Store

Creates a repository that is stored in a relational database. Sesame currently supports PostgreSQL and MySQL. Bundles containing the JDBC driver are currently not part of the SMILA distribution, so you must add them yourself. The element has four possible attributes:

Attribute Type Default Description
driver string (required) The class name of the JDBC driver.
maxTripleTables integer 1 The number of triple tables created by Sesame. The default value causes all statements to be stored in a single table. If more tables are allowed, Sesame creates a separate table per predicate. This may increase performance for large ontologies, however, allowing too many tables might decrease performance again.
indexed true/false true Controls the creation of DB indexes. Usually, this should be enabled for better performance.
sequenced true/false true (I did not find any explanation for this option in the Sesame documentation or source code, but if you know what it does, you can use it ;-)

The actual database location is configured by up to three sub-elements:

Tag Type Required Description
Url string yes JDBC URL
User string no User name for login
Password string no Password for login

If the database does not require authentication, it may be possible to omit User and Password elements.

See Sesame User Guide: RDBMS store configuration for details.

[edit] HTTP Store

Creates a repository that connects to a remote Sesame HTTP repository server. There is only one configuration attribute on the HttpStore element:

Attribute Type Default Description
repositoryId string (required) name of repository in server.

The actual repository server location is configured by up to three sub-elements:

Tag Type Required Description
Url string yes HTTP URL of the repository server
User string no User name for login
Password string no Password for login

If the repository server does not require authentication, User and Password can be ommitted. It is not possible to add stackable sails to an HTTP repository. This must be configured on the repository server.

See Sesame User Guide: HTTP repository configuration for details.

[edit] JMX Management Agent

There is also a JMX management agent that can be used to import RDF data from files into an ontology, export complete repository contents to an RDF file, and clear repositories. Additionally, it allows reading some information such as the available repository names, the known namespaces, and the size of repositories.

In the JDK JMX console it should look similar to this screenshot:

SMILA Sesame Ontology Manager.png

Since the first parameter of each operation is usually the name of the desired repository, we will only describe additional parameters here:

operation description
getRepositoryNames Returns the names of all configured repositories. These names can be used as the first parameter of the other methods.
getSize Returns the total number of statemets in the named repository.
getNamespaces Returns a map of namespace prefixes to the complete names in the named repository.
getContexts Returns a list of context names in the named repository.
clear Removes all resources and statements from the named repository. The result is a message about the operation.
importRDF'' Imports an RDF file into the named repository. The second parameter is the path and filename of the import file, absolute or relative to SMILA's working directory. The third parameter is the base URI for relative resource URIs in this file. If only absolute resource URIs are used in the RDF file, the actual value of this parameter is irrelevant. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.
exportRDF Exports a complete repository content to an RDF file. The second parameter is the path and filename of the export file, absolute or relative to SMILA's working directory. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.

There are also exemplary batch files for importing, exporting, or running the clear operation from the command line. See SMILA/jmxclient for details and adapt them to your own needs.

Currently, supported file formats and extensions include:

RDF format filename suffixes
RDF/XML .rdf, .rdfs, .owl
N-Triples .nt
Turtle .ttl
N3 .n3
TRIX .trix
TRIG .trig

The .xml suffix is associated to both RDF/XML and TRIX in Sesame, so there may be problems using it.

[edit] Pipelets using the Ontology

There are currently four pipelets included that make use of the ontology service. These pipelets are also kind of experimental, so they may change completely in the future. More pipelets will be created when we implement "real" use cases.

All pipelets use the standard in-BPEL pipelet configuration which can be overridden per record by simple values in the _parameters map attribute. All pipelets use a common property name to select the repository to work with:

Property Type Description
sesameRepository string The name of the repository to use. If not set, the default repository is used.

[edit] Writing/Reading complete records to/from the ontology

There are two pipelets that can be used to create and access information about a record in/from the ontology: The resource URI can be read from a special attribute named rdf:about, the resource property is written from or read into metadata attributes.

These pipelets have another parameter in common:

Property Type Description
recordFilter string The name of a record filter that lists all attributes that should be interpreted as resource properties. Note that the rdf:about attribute must be contained in this filter if it is used to identify the resource associated to the object. The record filters must be defined in configuration/org.eclipse.smila.blackboard, because the pipelets use blackboard functionality to do the filtering. If not set, all attributes are mapped to properties.

The URI of the resource associated to a processed record is determined this way:

  • If attribute "rdf:about" has literal values (after record filtering), the first one is used as the base value.
  • If the part of the base value before the first ':' character matches a namespace prefix in the used repository, the prefix is replaced by the full name.
  • The resulting value must be accepted by Sesame as an URI string. If not, the record is not processed any further.

[edit] org.eclipse.smila.ontology.pipelets.SesameRecordWriterPipelet

This pipelet can write attribute values to RDF properties. It creates a resource URI for the record it finds in the corresponding attribute and for each top-level attribute value of the record metadata object, it creates statements using this URI as the subject and the attribute name as a predicate URI. If an attribute name starts with a namespace prefix known in the used repository, it is expanded to the full namespace name. The statement object is created from a SMILA literal vaulue as follow:

  • If the attribute is a map and contains an attribute with a name of "rdf:about", a resource is created from the sub-structure (after namespace prefix expansion) and its URI is linked to the containing structure.
  • Else a literal of the ontology literal datatype best matching the SMILA literal datatype is created (TODO: Not yet implemented for data/time values).

The pipelet can be configured in which attribute to find the URI. To find out how, see the parameter desciption below.

Any system attribute (i.e. any attribute with the name starting with an underscore "_" will be ignored and not written into sesame.

If references to objects should be created, the attributes containing object references have to be marked by a special attribute named _objectProperties:

  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
  ...
  <rec:Seq key="_objectProperties">
    <rec:Val>eclipse:isPartOf</rec:Val>
  </rec:Seq>

So, to create a resource in the repository as in the example in the introduction, a record like this could be used:

 
<rec:Map>
  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
  <rec:Map key="rdf:type">
    <rec:Val key="rdf:about">eclipse:Project</rec:Val>
    <rec:Val key="rdf:type">rdfs:Resource</rec:Val>
  </rec:Map>
  <rec:Seq key="eclipse:isPartOf">
    <rec:Map>
      <rec:Val key="rdf:about">eclipse:rt</rec:Val>
      <rec:Val key="rdf:type">eclipse:TopLevelProject</rec:Val>
    </rec:Map>
    <rec:Val>http://www.eclipse.org/</rec:Val>
  </rec:Seq>
  <rec:Val key="rdfs:label">SMILA</rec:Val>
  <rec:Val key="eclipse:createdIn">2007</rec:Val>
 
  <rec:Seq key="_objectProperties">
    <rec:Val>eclipse:isPartOf</rec:Val>
  </rec:Seq>
</rec:Map>

(Assuming the target repository knows the namespaces "rdf", "rdfs" (standard namespaces) and "eclipse" (= "http://www.eclipse.org/").

A string literal with a language can be written using a map with the language name as key and the locale specific value as value of the map:

  <rec:Map key="rdfs:label">
    <rec:Val key="de">SMILA</rec:Val>
    <rec:Val key="en">SMILA</rec:Val>
  </rec:Map>

By default, new statements are just added to the repository. If all existing statements for a given subject and object should be removed before adding the new statements, special system attributes can be used.

To remove all statements before adding the new statements, add an attribute named "_deleteAll" with a value "true":

  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
  <rec:Val key="_deleteAll">true</rec:Val>

To remove only some properties before adding the new statements, add a sequence attribute named "_deleteProperties" that contains a sequence of properties to be deleted on beforehand.

  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
  ...
  <rec:Seq key="_deleteProperties">
    <rec:Val>eclipse:isCommitterOf</rec:Val>
    <rec:Val>rdfs:label</rec:Val>
  </rec:Seq>

To create a statement with the record URI as the object and the attribute value as the subject, put the attribute value "_reverseProperties" containing a list of the reverse properties in the record. This only works if the attribute value is specified to be a resource URI. E.g. the final statement of the introductory example can be created from the same record using:

  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
  <rec:Val key="eclipse:isCommitterOf">http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com</rec:Val>
  <rec:Seq key="_reverseProperties">
    <rec:Val>eclipse:isCommitterOf</rec:Val>
  </rec:Seq>

Finally, the pipelet supports these additional parameters:

Property Type Description
typeUri string optional: name of type to set for the resource, if no type statement is created from writing the record.
uriAttribute string optional: attribute to write the URIs of found/created resources to. Default: rdf:about

If you want to implement own pipelets that create attribute values for writing to the repository, see org.eclipse.smila.ontology.records.SesameRecordHelper for constants and helper methods for these special conventions and annotations.

If an error occurs while writing a record to the repository, only the changes related to the current record are rolled back, and the pipelet fails with an exception, further records in the same message are not processed anymore, changes from records written before are not invalidated. (TODO: discuss general pipelet error handling behavior?)

[edit] org.eclipse.smila.ontology.pipelets.SesameRecordReaderPipelet

This pipelet reads statements from the repository about the URI associated with a record into attribute values of the record. Its operation is mostly inverse to the one of the SesameRecordWriterPipelet, so we can keep the description short here. Some notes:

  • Only statements are used that have the record URI as subject. No "reverse" attributes are created.
  • Resource objects are converted to string literals.
  • Literal objects are converted to best matching SMILA literal datatype (TODO: date/time objects).
  • a language tag on a string literal is stored as a AnyMap as described above.
  • All statements are read from the repository, but only those attributes contained in the specified record filter are written to the blackbaord eventually.
  • If predicate URIs start with known name namespaces, the full namespace value is replaced by its prefix and a colon in the associated attribute name.

If an error occurs during reading the records, the pipelet invocation is aborted at this point. If multiple records where processed in this called, changes to already processed records are not reverted.

The pipelet can be configured in which attribute to find the URI:

Property Type Description
uriAttribute string optional: attribute to write the URIs of found/created resources to. Default: rdf:about

[edit] org.eclipse.smila.ontology.pipelets.CreateResourcePipelet

This pipelet can be used to lookup and create resources of a certain types by their name. E.g. if some attribute contains the name of a person, this pipelet can search the ontology for a resource of type "person" with this name, and if no such resource exists, it can create a new one. In either case the URI of this resource is written to (another) attribute. The URI of a new resource is created from the label by removing all non-word characters (i.e. everything except a-z, A-Z, _, 0-9) from the string and concatenating the result to some configurable prefix value (see below).

The pipelet supports the following parameters:

Property Type Description
typeUri string required: URI of type to use for lookup and creation of resources. Namespace expansion is applied to this URI.
labelAttribute string required: attribute containing names of resources to lookup/create.
uriAttribute string required: attribute to write the URIs of found/created resources to.
labelPredicate string optional: URI of the predicate that specifies the name of the resource. If not set, it defaults to rdfs:label. Namespace expansion is applied to the property value.
uriPrefix string optional: prefix for new created URIs. If not set, "urn:" is used. Namespace expansion is also applied to the complete new URI.

[edit] org.eclipse.smila.ontology.pipelets.CreateRelationPipelet

This pipelet creates statements with subjects and objects read from record attributes. It create a statement with a configurable predicate in the target repository for each combination of values in two configurable attributes. E.g. if one attribute of your records contains URIs of persons and another one URIs of companies they work for, this pipelet could be used to create statements in the ontology using some "worksFor" predicate to describe this relation.

By default the given objectAttributes are interpreted as URIs. If the objectAttributes should be interpreted as literals, the parameter objectAttributeIsResource has to be set to false in the pipelet configuration.

The pipelet supports the following parameters:

Property Type Description
subjectAttribute string required: name of attribute containing the subjects for the statements to create. Regardless of the actual literal type, the string values of the literals are tried to interpred as URIs. Namespace expansion is applied.
objectAttribute string required: name of attribute containing the objects for the statements to create. If the pipelet parameter objectAttributeIsResource is not explicitly set to false the objects will be URIs (namespaces expanded). Else the SMILA literal values will be written as Sesame literals with a matching datatype.
predicateUri string required: URI of the predicate for the statements. Namespace prefixes are expanded first.
objectAttributeIsResource Boolean optional: if true the objects are interpreted as URIs, if false the objects attributes are interpreted as literals. Default: true.