Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/SesameOntologyManager"

(Introduction)
 
(12 intermediate revisions by 4 users not shown)
Line 1: Line 1:
This page describes an initial integration of a semantic layer in SMILA. It currently consists basically of an integration of [http://www.aduna-software.com/technologies/overview.view Aduna]'s [http://www.openrdf.org/ OpenRDF Sesame 2], an open source framework for storage, inferencing and querying of RDF data. We do not provide an own RDF API on our own currently, but just reuse the Sesame API. Based on experiences from actual use cases for the Semantic Layer this might change in the future.  
+
This page describes an initial integration of a semantic layer in SMILA. It currently consists basically of an integration of [http://www.aduna-software.com/technology/sesame Aduna]'s [http://www.openrdf.org/ OpenRDF Sesame 2], an open source framework for storage, inferencing, and querying of RDF data. We do not provide an own RDF API on its own currently, but just reuse the Sesame API. Based on experiences from actual use cases for the Semantic Layer, this might change in the future.  
  
 
Consequently, this page assumes that the reader is accustomed to the basic Sesame concepts. A quick browse through Sesame's [http://www.openrdf.org/doc/sesame2/users/ User Guide] should help, especially the chapters [http://www.openrdf.org/doc/sesame2/users/ch03.html 3] and [http://www.openrdf.org/doc/sesame2/users/ch08.html 8].
 
Consequently, this page assumes that the reader is accustomed to the basic Sesame concepts. A quick browse through Sesame's [http://www.openrdf.org/doc/sesame2/users/ User Guide] should help, especially the chapters [http://www.openrdf.org/doc/sesame2/users/ch03.html 3] and [http://www.openrdf.org/doc/sesame2/users/ch08.html 8].
Line 7: Line 7:
 
== Introduction ==
 
== Introduction ==
  
Ontologies can be used to describe background knowledge about an application domain that can be used during indexing to derive additional attributes of for documents or during search to enhance or restrict queries. On the other hand, SMILA pipelets can also used to add additional data to an existing ontology, i.e. to learn descriptions of and relations between entities of the application domain.  
+
Ontologies can be used to describe background knowledge about an application domain that can be used during indexing to derive additional attributes of for documents or during search to enhance or restrict queries. On the other hand, SMILA pipelets can also be used to add additional data to an existing ontology, i.e. to learn descriptions of and relations between entities of the application domain.  
  
The de-facto standard format for describing such knowledge is RDF. RDF describes everything in the form of resources and triples (statements) about these resources. A resource is identified by a URI, e.g. <tt>http://www.eclipse.org/smila</tt>. A statement consists of a subject resource, a predicate resource and object. The predicate describes the meaning of the statement. The object can either be a literal of different types, e.g.  
+
The de-facto standard format for describing such knowledge is RDF. RDF describes everything in the form of resources and triples (statements) about these resources. A resource is identified by a URI, e.g. <tt>http://www.eclipse.org/smila</tt>. A statement consists of a subject resource, a predicate resource and object. The predicate describes the meaning of the statement. The object can either be a literal of different types, e.g.  
  
 
{| border="1"
 
{| border="1"
Line 24: Line 24:
 
| 2007  
 
| 2007  
 
|}
 
|}
 
  
 
or another resource, e.g.
 
or another resource, e.g.
Line 46: Line 45:
 
|}
 
|}
  
Data is written to an RDF ontology by adding or removing statements. It can be also read using a statement by e.g. asking for all statements with the predicate <tt>hasPartOf</tt>. Another possibilty is to use a RDF query language like [http://de.wikipedia.org/wiki/SPARQL SPARQL] that allows to formulate very complex pattern to access RDF data.
+
Data is written to an RDF ontology by adding or removing statements. Also, it can be read using a statement e.g. by asking for all statements with the predicate <tt>hasPartOf</tt>. Another possibilty is to use an RDF query language like [http://en.wikipedia.org/wiki/SPARQL SPARQL] that allows formulating very complex patterns to access RDF data.
 
+
  
 
== Discussion ==
 
== Discussion ==
Line 57: Line 55:
 
=== Service ===
 
=== Service ===
  
The Sesame Ontology Manager is an OSGi service that manages Sesame repositories. It has a configuration file that describes a number of repositories that can be created or used and associates them with a name. Service consumers can then request a connection to one of the repositories by name. A repository is created on the first access from a client.
+
The Sesame Ontology Manager is an OSGi service that manages Sesame repositories. It has a configuration file that describes a number of repositories that can be created or used and associates them with a name. Service consumers can then request a connection to one of the repositories by its name. A repository is created with the first access from a client.
  
Then configuration file is expected in the configuration area as <tt>org.eclipse.smila.ontology/sesameConfig.xml</tt>. You can find the schema definition in bundle <tt>org.eclipse.smila.ontology</tt> in directory <tt>schema</tt>. This in is a faily complete with each kind of supported repository:
+
Then configuration file is expected in the configuration area at <tt>org.eclipse.smila.ontology/sesameConfig.xml</tt>. You can find the schema definition in bundle <tt>org.eclipse.smila.ontology</tt> in directory <tt>schema</tt>. This is fairly complete with respect to the supported repositories:
  
 
<source lang="xml">
 
<source lang="xml">
Line 86: Line 84:
 
</source>
 
</source>
  
The root element must specify a default repository name, which must match the name attribute of one of the contained repository configurations. Each single repository configuration is described by a <tt><RepositoryConfig></tt> element. It must define a name for the repository. As the name is used as the name of a workspace directory to store the files of memory and native stores in, it should contain only characters suitable for directory names on the current platform. The element must contain one of the different <...Store> elements to describe the physical store type and can (except for HttpStores) contain multiple <Stackable> elements defining implementors of <tt>org.openrdf.sail.StackableSail</tt> that are stacked on the used store sail. If multiple stackables are specified they are added on top of the store in the order of appearance in the XML file. This means, the configuration file describes the actual Sail stack bottom-up. Sesame itself contains only two useful classes that could be used here:
+
The root element must specify a default repository name, which must match the ''name'' attribute of one of the contained repository configurations. Each single repository configuration is described by a <tt><RepositoryConfig></tt> element. It must define a name for the repository. As the name is used as the name of a workspace directory to store the files of memory and native stores in, it should contain only characters suitable for directory names on the current platform. The element must contain one of the different <...Store> elements to describe the physical store type and can (except for HttpStores) contain multiple <Stackable> elements defining implementors of <tt>org.openrdf.sail.StackableSail</tt> that are stacked on the used store sail. If multiple stackables are specified they are added on top of the store in the order of appearance in the XML file. This means, the configuration file describes the actual sail stack bottom-up. Sesame itself contains only two useful classes that could be used here:
  
 
* <tt>org.openrdf.sail.inferencer.fc.DirectTypeHierarchyInferencer</tt>
 
* <tt>org.openrdf.sail.inferencer.fc.DirectTypeHierarchyInferencer</tt>
Line 103: Line 101:
 
! Description
 
! Description
 
|-
 
|-
| persist  
+
| ''persist''
 
| true/false  
 
| true/false  
 
| false  
 
| false  
| write repository content to workspace so that it can be read again after restarts
+
| Writes repository content to the workspace so that it can be read again after restarts.
 
|-
 
|-
| syncDelay  
+
| ''syncDelay''
 
| integer  
 
| integer  
| is 0  
+
| 0  
| the time (in milliseconds) to wait after a transaction was commited before writing the changed data to file. Setting it to 0 causes the data to be written immediately after a commit. A negative number prevents syncing until shutdown.
+
| The time (in milliseconds) to wait after a transaction was committed before writing the changed data to file. Setting it to 0 causes the data to be written immediately after a commit. A negative number prevents syncing until shutdown.
 
|}
 
|}
  
Line 118: Line 116:
 
==== Native Store ====
 
==== Native Store ====
  
Creates the repositoty based on Sesame's native file database format. It has two configuration attributes:
+
Creates the repository based on Sesame's native file database format. It has two configuration attributes:
  
 
{| border="1"
 
{| border="1"
Line 126: Line 124:
 
! Description
 
! Description
 
|-
 
|-
| indexes  
+
| ''indexes''
 
| string  
 
| string  
 
| -  
 
| -  
| An index string like "spoc,posc" that describes which how the RDF data is indexed for better query performance.  
+
| An index string like "spoc,posc" that describes how the RDF data is being indexed for better query performance.  
 
|-
 
|-
| forceSync  
+
| ''forceSync''
 
| true/false  
 
| true/false  
 
| false  
 
| false  
| Force sync to the hard disk on every write. This makes it sure that each change is actually persisted in the data files immediately, but decreases write performance.
+
| Force sync to the hard disk on every write. This makes sure that each change is actually persisted in the data files immediately, but decreases write performance.
 
|}
 
|}
  
Line 149: Line 147:
 
! Description
 
! Description
 
|-
 
|-
| driver  
+
| ''driver''
 
| string  
 
| string  
 
| (required)  
 
| (required)  
| JDBC driver class name
+
| The class name of the JDBC driver.
 
|-
 
|-
| maxTripleTables  
+
| ''maxTripleTables''
 
| integer  
 
| integer  
 
| 1  
 
| 1  
| number of tripe tables created by Sesame. The default value causes all statements to be stored in a single table. If more tables are allowed, Sesame creates seperate tables per predicate. This may increase performance for large ontologies, however, allowing too many tables might decrease performance again.
+
| The number of triple tables created by Sesame. The default value causes all statements to be stored in a single table. If more tables are allowed, Sesame creates a separate table per predicate. This may increase performance for large ontologies, however, allowing too many tables might decrease performance again.
 
|-
 
|-
| indexed  
+
| ''indexed''
 
| true/false  
 
| true/false  
 
| true  
 
| true  
| control creation of DB indexes. Usually you want this enabled for better performance.
+
| Controls the creation of DB indexes. Usually, this should be enabled for better performance.
 
|-
 
|-
| sequenced  
+
| ''sequenced''
 
| true/false  
 
| true/false  
 
| true  
 
| true  
Line 170: Line 168:
 
|}
 
|}
  
The actual database location is configured by up to 3 sub-elements:
+
The actual database location is configured by up to three sub-elements:
  
 
{| border="1"
 
{| border="1"
Line 178: Line 176:
 
! Description
 
! Description
 
|-
 
|-
| Url  
+
| ''Url''
 
| string  
 
| string  
 
| yes  
 
| yes  
| JDBC-URL
+
| JDBC URL
 
|-
 
|-
| User
+
| ''User''
 
| string  
 
| string  
 
| no  
 
| no  
| Username for login
+
| User name for login
 
|-
 
|-
| Password  
+
| ''Password ''
 
| string  
 
| string  
 
| no  
 
| no  
Line 194: Line 192:
 
|}
 
|}
  
If the database does not require authentication it may be possible to omit User and Password elements.
+
If the database does not require authentication, it may be possible to omit User and Password elements.
  
 
See [http://www.openrdf.org/doc/sesame2/users/ch07.html#section-rdbms-store-config Sesame User Guide: RDBMS store configuration] for details.
 
See [http://www.openrdf.org/doc/sesame2/users/ch07.html#section-rdbms-store-config Sesame User Guide: RDBMS store configuration] for details.
  
==== Http Store ====
+
==== HTTP Store ====
  
 
Creates a repository that connects to a remote Sesame HTTP repository server. There is only one configuration attribute on the HttpStore element:
 
Creates a repository that connects to a remote Sesame HTTP repository server. There is only one configuration attribute on the HttpStore element:
Line 208: Line 206:
 
! Description
 
! Description
 
|-
 
|-
| repositoryId  
+
| ''repositoryId''
 
| string  
 
| string  
 
| (required)  
 
| (required)  
Line 214: Line 212:
 
|}
 
|}
  
The actual repository server location is configured by up to 3 sub-elements:
+
The actual repository server location is configured by up to three sub-elements:
  
 
{| border="1"
 
{| border="1"
Line 222: Line 220:
 
! Description
 
! Description
 
|-
 
|-
| Url  
+
| ''Url''
 
| string  
 
| string  
 
| yes  
 
| yes  
|HTTP-URL of repository server
+
|HTTP URL of the repository server
 
|-
 
|-
| User  
+
| ''User''
 
| string  
 
| string  
 
| no  
 
| no  
| Username for login
+
| User name for login
 
|-
 
|-
| Password  
+
| ''Password''
 
| string  
 
| string  
 
| no  
 
| no  
 
| Password for login
 
| Password for login
 
|}
 
|}
 
  
 
If the repository server does not require authentication, User and Password can be ommitted. It is not possible to add stackable sails to an HTTP repository. This must be configured on the repository server.
 
If the repository server does not require authentication, User and Password can be ommitted. It is not possible to add stackable sails to an HTTP repository. This must be configured on the repository server.
Line 245: Line 242:
 
=== JMX Management Agent ===
 
=== JMX Management Agent ===
  
There is also a JMX management agent that can be used to import RDF data from files into an ontology, export complete repository contents to an RDF file and clear repositories. Additionally it allows to read some information like available repository names, known namespaces and sizes of repositories.  
+
There is also a JMX management agent that can be used to import RDF data from files into an ontology, export complete repository contents to an RDF file, and clear repositories. Additionally, it allows reading some information such as the available repository names, the known namespaces, and the size of repositories.  
  
 
In the JDK JMX console it should look similar to this screenshot:
 
In the JDK JMX console it should look similar to this screenshot:
  
[[Image:SMILA Sesame Ontology Manager.jpg]]
+
[[Image:SMILA Sesame Ontology Manager.png]]
  
The first parameter of each operation is the name of a configured repository, in the following list only additional parameters will be described:
+
Since the first parameter of each operation is usually the name of the desired repository, we will only describe additional parameters here:
  
 
{| border="1"
 
{| border="1"
Line 257: Line 254:
 
! description
 
! description
 
|-
 
|-
| getRepositoryNames  
+
| ''getRepositoryNames''
| returns the list of names of configured repositories. These names can be used as the first parameter of the other methods.
+
| Returns the names of all configured repositories. These names can be used as the first parameter of the other methods.
 
|-
 
|-
| getSize  
+
| ''getSize''
| returns the number of statemets in the named repository.
+
| Returns the total number of statemets in the named repository.
 
|-
 
|-
| getNamespaces  
+
| ''getNamespaces''
| return a map of namespace prefixes to the complete names in the given repository.
+
| Returns a map of namespace prefixes to the complete names in the named repository.
 
|-
 
|-
| getContexts  
+
| ''getContexts''
| return a list of context names in the given repository.
+
| Returns a list of context names in the named repository.
 
|-
 
|-
| clear  
+
| ''clear''
| All resources and statements are removed from a repository. The result is a message about the operation.
+
| Removes all resources and statements from the named repository. The result is a message about the operation.
 
|-
 
|-
| importRDF  
+
| ''importRDF''''
| import an RDF file into a repository. The second parameter is the path and filename of the import file, absolute or relative to SMILA's working directory. The third parameter is the base URI for relative resource URIs in this file. If only absolute resource URIs are used in the RDF file, the actual value of this parameter is irrelevant. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.
+
| Imports an RDF file into the named repository. The second parameter is the path and filename of the import file, absolute or relative to SMILA's working directory. The third parameter is the base URI for relative resource URIs in this file. If only absolute resource URIs are used in the RDF file, the actual value of this parameter is irrelevant. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.
 
|-
 
|-
| exportRDF  
+
| ''exportRDF''
| export a complete repository content to an RDF file. The second parameter is the path and filename of the export file, absolute or relative to SMILA's working directory. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.  
+
| Exports a complete repository content to an RDF file. The second parameter is the path and filename of the export file, absolute or relative to SMILA's working directory. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.  
 
|}
 
|}
  
There are also exemplary batch files for using the import, export and clear operation from the commandline. See SMILA/jmxclient for details and adapt them to your own needs.
+
There are also exemplary batch files for importing, exporting, or running the clear operation from the command line. See <tt>SMILA/jmxclient</tt> for details and adapt them to your own needs.
  
 
Currently, supported file formats and extensions include:
 
Currently, supported file formats and extensions include:
Line 310: Line 307:
 
== Pipelets using the Ontology ==  
 
== Pipelets using the Ontology ==  
  
There are currently two pipelets that make use of the ontology service. These pipelets are also kind of experimental, so they may change completely in the future. More pipelets will be created when we implement "real" use cases.
+
There are currently four pipelets included that make use of the ontology service. These pipelets are also kind of experimental, so they may change completely in the future. More pipelets will be created when we implement "real" use cases.
  
The pipelets can be used to create and access information about an reord in the ontology: The resource URI can be read from a special attribute named "rdf:about" or the record ID, resource property is written from or read into metadata attributes.
+
All pipelets use the standard in-BPEL pipelet configuration which can be overridden per record by simple values in the <tt>_parameters</tt> map attribute. All pipelets use a common property name to select the repository to work with:
 
+
Both pipelets use the standard in-BPEL pipelet configuration. They have these these properties in common:
+
  
 
{| border="1"
 
{| border="1"
Line 321: Line 316:
 
! Description
 
! Description
 
|-
 
|-
| sesameRepository  
+
| ''sesameRepository''
 
| string  
 
| string  
| name of repository to use. If not set, the default repository is used.
+
| The name of the repository to use. If not set, the default repository is used.
 +
|}
 +
 
 +
=== Writing/Reading complete records to/from the ontology ===
 +
 
 +
There are two pipelets that can be used to create and access information about a record in/from the ontology: The resource URI can be read from a special attribute named ''rdf:about'', the resource property is written from or read into metadata attributes.
 +
 
 +
These pipelets have another parameter in common:
 +
 
 +
{| border="1"
 +
! Property
 +
! Type
 +
! Description
 
|-
 
|-
| recordFilter  
+
| ''recordFilter''
 
| string
 
| string
| name of a record filter that lists all attributes that should be interpreted as resource property. Note that the rdf:about attribute must be contained in this filter if it is used to identify the resource associated to the object. The record filters must be defined in <tt>configuration/org.eclipse.smila.blackboard</tt>, because the pipelets use blackboard functionality to do the filtering. If not set, all attributes are mapped to proerties.
+
| The name of a record filter that lists all attributes that should be interpreted as resource properties. Note that the ''rdf:about'' attribute must be contained in this filter if it is used to identify the resource associated to the object. The record filters must be defined in <tt>configuration/org.eclipse.smila.blackboard</tt>, because the pipelets use blackboard functionality to do the filtering. If not set, all attributes are mapped to properties.
 
|}
 
|}
  
The URI of the resource associated to a processed record is determined this way:
+
The URI of the resource associated to a processed record is determined this way:
 
* If attribute "rdf:about" has literal values (after record filtering), the first one is used as the base value.
 
* If attribute "rdf:about" has literal values (after record filtering), the first one is used as the base value.
* Else the key part (unnamed) of the record ID is used as the base value.
+
* If the part of the base value before the first ':' character matches a namespace prefix in the used repository, the prefix is replaced by the full name.
* If the part of the base value before the first ':' character matches a namespace prefix in the used repository, replace the prefix by the full name
+
* The resulting value must be accepted by Sesame as an URI string. If not, the record is not processed any further.
* the resulting value must be accepted by Sesame as an URI string. If not, the record is not processed any further.
+
  
 +
==== org.eclipse.smila.ontology.pipelets.SesameRecordWriterPipelet ====
  
=== org.eclipse.smila.ontology.pipelets.SesameRecordWriterPipelet ===
+
This pipelet can write attribute values to RDF properties. It creates a resource URI for the record it finds in the corresponding attribute and for each top-level attribute value of the record metadata object, it creates statements using this URI as the subject and the attribute name as a predicate URI. If an attribute name starts with a namespace prefix known in the used repository, it is expanded to the full namespace name. The statement object is created from a SMILA literal vaulue as follow:
 +
* If the attribute is a map and contains an attribute with a name of "rdf:about", a resource is created from the sub-structure (after namespace prefix expansion) and its URI is linked to the containing structure.
 +
* Else a literal of the ontology literal datatype best matching the SMILA literal datatype is created (TODO: Not yet implemented for data/time values).
  
This pipelet can write attribute values to RDF properties. It creates a resource URI for the record as described above and for each top-level attribute value of the record metadata object, it creates statements using this URI as the subject and the attribute name as a predicate URI. If an attribute name starts with a namespace prefix known in the used repository, it is expanded to the full namespace name. The statement object is created from a SMILA literal vaulue as follow:
+
The pipelet can be configured in which attribute to find the URI. To find out how, see the parameter desciption below.
* If the semantic type of the literal is set to "rdfs:Resource" (in XML: <tt><L st="rdfs:Resource"></tt>) or the attribute name is "rdf:type", a resource URI is created from the string value (after namespace prefix expansion) of the literal.
+
 
* Else a literal of the ontology literal datatype best matching the SMILA literal datatype is created.
+
Any system attribute (i.e. any attribute with the name starting with an underscore "_" will be ignored and not written into sesame.
 +
 
 +
If references to objects should be created, the attributes containing object references have to be marked by a special attribute named ''_objectProperties'':
 +
 
 +
<source lang="xml">
 +
  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
 +
  ...
 +
  <rec:Seq key="_objectProperties">
 +
    <rec:Val>eclipse:isPartOf</rec:Val>
 +
  </rec:Seq>
 +
</source>
  
 
So, to create a resource in the repository as in the example in the introduction, a record like this could be used:
 
So, to create a resource in the repository as in the example in the introduction, a record like this could be used:
  
 
<source lang="xml">  
 
<source lang="xml">  
<Record>
+
<rec:Map>
   <A n="rdf:about">
+
   <rec:Val key="rdf:about">eclipse:smila</rec:Val>
    <L>
+
   <rec:Map key="rdf:type">
      <V>eclipse:smila</V>
+
     <rec:Val key="rdf:about">eclipse:Project</rec:Val>
    </L>
+
     <rec:Val key="rdf:type">rdfs:Resource</rec:Val>
   </A>
+
   </rec:Map>
  <A n="rdf:type">
+
   <rec:Seq key="eclipse:isPartOf">
     <L st="rdfs:Resource">  <!-- st is optional here -->
+
     <rec:Map>
      <V>eclipse:Project</V>
+
      <rec:Val key="rdf:about">eclipse:rt</rec:Val>
     </L>
+
       <rec:Val key="rdf:type">eclipse:TopLevelProject</rec:Val>
   </A>
+
     </rec:Map>
   <A n="eclipse:isPartOf">
+
    <rec:Val>http://www.eclipse.org/</rec:Val>
     <L st="rdfs:Resource">  
+
   </rec:Seq>
       <V>eclipse:rt</V>
+
   <rec:Val key="rdfs:label">SMILA</rec:Val>
     </L>
+
   <rec:Val key="eclipse:createdIn">2007</rec:Val>
   </A>
+
 
   <A n="rdfs:label">
+
  <rec:Seq key="_objectProperties">
    <L>
+
     <rec:Val>eclipse:isPartOf</rec:Val>
      <V>SMILA</V>
+
   </rec:Seq>
    </L>
+
</rec:Map>
   </A>
+
  <A n="eclipse:createdIn">
+
    <L>
+
      <V t="int">2007</V>
+
     </L>
+
   </A>
+
</Record>
+
 
</source>
 
</source>
  
(assuming the target repository knows the namespaces "rdf", "rdfs" (standard namespaces) and "eclipse" (= "http://www.eclipse.org/").
+
(Assuming the target repository knows the namespaces "rdf", "rdfs" (standard namespaces) and "eclipse" (= "http://www.eclipse.org/").
  
A string literal with a language can be written by attaching an annotation named "xml:lang" with the language name as an aonomymous value:
+
A string literal with a language can be written using a map with the language name as key and the locale specific value as value of the map:
  
 
<source lang="xml">
 
<source lang="xml">
   <A n="rdfs:label">
+
   <rec:Map key="rdfs:label">
     <L>
+
     <rec:Val key="de">SMILA</rec:Val>
      <V>SMILA</V>
+
    <rec:Val key="en">SMILA</rec:Val>
      <An n="xml:lang">
+
   </rec:Map>
        <V>de</V>
+
      </An>
+
    </L>
+
   </A>
+
 
</source>
 
</source>
  
By default, new statements are just added to the repository. If all existing statements for a given subject and object should be removed before adding the new statements, add an Annotation named "org.eclipse.smila.ontology" with a named value "clear" giving the language of objects to be removed first. To remove all statements, use "all" as language:
+
By default, new statements are just added to the repository. If all existing statements for a given subject and object should be removed before adding the new statements, special system attributes can be used.
 +
 
 +
To remove all statements before adding the new statements, add an attribute named "_deleteAll" with a value "true":
  
 
<source lang="xml">
 
<source lang="xml">
   <A n="rdfs:label">
+
   <rec:Val key="rdf:about">eclipse:smila</rec:Val>
    <L>
+
  <rec:Val key="_deleteAll">true</rec:Val>
      <V>SMILA</V>
+
      <An n="org.eclipse.smila.ontology">
+
        <V n="clear">all</V>
+
      </An>
+
    </L>
+
  </A>
+
 
</source>
 
</source>
  
To create a statement with the record URI as the object and the attribute value as the subject, put the anon value "reverse" in the same annotation. This only works if the attribute value is specified to be a resource URI. E.g. the final statement of the introductory example can be created from the same record using:
+
To remove only some properties before adding the new statements, add a sequence attribute named "_deleteProperties" that contains a sequence of properties to be deleted on beforehand.
  
 
<source lang="xml">
 
<source lang="xml">
   <A n="eclipse:isCommitterOf">
+
   <rec:Val key="rdf:about">eclipse:smila</rec:Val>
    <L st="rdfs:Resource">
+
  ...
      <V>http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com</V>
+
  <rec:Seq key="_deleteProperties">
      <An n="org.eclipse.smila.ontology">
+
    <rec:Val>eclipse:isCommitterOf</rec:Val>
        <V>reverse</V>
+
    <rec:Val>rdfs:label</rec:Val>
      </An>
+
   </rec:Seq>
    </L>
+
   </A>
+
 
</source>
 
</source>
  
Finally, the pipelet supports an additional configuration property:
+
To create a statement with the record URI as the object and the attribute value as the subject, put the attribute value "_reverseProperties" containing a list of the reverse properties in the record. This only works if the attribute value is specified to be a resource URI. E.g. the final statement of the introductory example can be created from the same record using:
 +
 
 +
<source lang="xml">
 +
  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
 +
  <rec:Val key="eclipse:isCommitterOf">http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com</rec:Val>
 +
  <rec:Seq key="_reverseProperties">
 +
    <rec:Val>eclipse:isCommitterOf</rec:Val>
 +
  </rec:Seq>
 +
</source>
 +
 
 +
Finally, the pipelet supports these additional parameters:
  
 
{| border="1"
 
{| border="1"
Line 423: Line 435:
 
! Description
 
! Description
 
|-
 
|-
| typeUri  
+
| ''typeUri''
 
| string  
 
| string  
 
| optional: name of type to set for the resource, if no type statement is created from writing the record.
 
| optional: name of type to set for the resource, if no type statement is created from writing the record.
 +
|-
 +
| ''uriAttribute''
 +
| string
 +
| optional: attribute to write the URIs of found/created resources to. Default: ''rdf:about''
 
|}
 
|}
  
 
If you want to implement own pipelets that create attribute values for writing to the repository, see <tt>org.eclipse.smila.ontology.records.SesameRecordHelper</tt> for constants and helper methods for these special conventions and annotations.
 
If you want to implement own pipelets that create attribute values for writing to the repository, see <tt>org.eclipse.smila.ontology.records.SesameRecordHelper</tt> for constants and helper methods for these special conventions and annotations.
  
If an error occurs while writing a record to the repository, only the changes related to the current record are rolled back, and the pipelet fails with an exception, further records in the same message are not processed anymore, changes from records written before are not invalidated. (TODO: discuss general pipelet error handling behaviour?)
+
If an error occurs while writing a record to the repository, only the changes related to the current record are rolled back, and the pipelet fails with an exception, further records in the same message are not processed anymore, changes from records written before are not invalidated. (TODO: discuss general pipelet error handling behavior?)
  
=== org.eclipse.smila.ontology.pipelets.SesameRecordReaderPipelet ===
+
==== org.eclipse.smila.ontology.pipelets.SesameRecordReaderPipelet ====
  
 
This pipelet reads statements from the repository about the URI associated with a record into attribute values of the record. Its operation is mostly inverse to the one of the SesameRecordWriterPipelet, so we can keep the description short here. Some notes:
 
This pipelet reads statements from the repository about the URI associated with a record into attribute values of the record. Its operation is mostly inverse to the one of the SesameRecordWriterPipelet, so we can keep the description short here. Some notes:
  
* The URI is determined in the same way from attributes or record ID as described for the SesameRecordWriterPipelet above.
 
 
* Only statements are used that have the record URI as subject. No "reverse" attributes are created.
 
* Only statements are used that have the record URI as subject. No "reverse" attributes are created.
* Resource objects are converted to string literals with semantic type set to "rdfs:Resource".
+
* Resource objects are converted to string literals.
* Literal objects are converted to best matching SMILA literal datatype.
+
* Literal objects are converted to best matching SMILA literal datatype (TODO: date/time objects).
* a language tag on a string literal is stored as an "xml:lang" annotation as described above.
+
* a language tag on a string literal is stored as a AnyMap as described above.
 
* All statements are read from the repository, but only those attributes contained in the specified record filter are written to the blackbaord eventually.  
 
* All statements are read from the repository, but only those attributes contained in the specified record filter are written to the blackbaord eventually.  
 
* If predicate URIs start with known name namespaces, the full namespace value is replaced by its prefix and a colon in the associated attribute name.
 
* If predicate URIs start with known name namespaces, the full namespace value is replaced by its prefix and a colon in the associated attribute name.
  
Errors during reading data from repository to the blackboard record are ignored and logged, processing continues with the next record in the current list, if one exists.
+
If an error occurs during reading the records, the pipelet invocation is aborted at this point. If multiple records where processed in this called, changes to already processed records are not reverted.
 +
 
 +
The pipelet can be configured in which attribute to find the URI:
 +
{| border="1"
 +
! Property
 +
! Type
 +
! Description
 +
|-
 +
| ''uriAttribute''
 +
| string
 +
| optional: attribute to write the URIs of found/created resources to. Default: ''rdf:about''
 +
|}
 +
 
 +
=== org.eclipse.smila.ontology.pipelets.CreateResourcePipelet ===
 +
 
 +
This pipelet can be used to lookup and create resources of a certain types by their name. E.g. if some attribute contains the name of a person, this pipelet can search the ontology for a resource of type "person" with this name, and if no such resource exists, it can create a new one. In either case the URI of this resource is written to (another) attribute. The URI of a new resource is created from the label by removing all non-word characters (i.e. everything except a-z, A-Z, _, 0-9) from the string and concatenating the result to some configurable prefix value (see below).
 +
 
 +
The pipelet supports the following parameters:
 +
 
 +
{| border="1"
 +
! Property
 +
! Type
 +
! Description
 +
|-
 +
| ''typeUri''
 +
| string
 +
| required: URI of type to use for lookup and creation of resources. Namespace expansion is applied to this URI.
 +
|-
 +
| ''labelAttribute''
 +
| string
 +
| required: attribute containing names of resources to lookup/create.
 +
|-
 +
| ''uriAttribute''
 +
| string
 +
| required: attribute to write the URIs of found/created resources to.
 +
|-
 +
| ''labelPredicate''
 +
| string
 +
| optional: URI of the predicate that specifies the name of the resource. If not set, it defaults to rdfs:label. Namespace expansion is applied to the property value.
 +
|-
 +
| ''uriPrefix''
 +
| string
 +
| optional: prefix for new created URIs. If not set, "urn:" is used. Namespace expansion is also applied to the complete new URI.
 +
|}
 +
 
 +
=== org.eclipse.smila.ontology.pipelets.CreateRelationPipelet ===
 +
 
 +
This pipelet creates statements with subjects and objects read from record attributes. It create a statement with a configurable predicate in the target repository for each combination of values in two configurable attributes. E.g. if one attribute of your records contains URIs of persons and another one URIs of companies they work for, this pipelet could be used to create statements in the ontology using some "worksFor" predicate to describe this relation.
 +
 
 +
By default the given objectAttributes are interpreted as URIs. If the objectAttributes should be interpreted as literals, the parameter ''objectAttributeIsResource'' has to be set to ''false'' in the pipelet configuration.
 +
 
 +
The pipelet supports the following parameters:
 +
 
 +
{| border="1"
 +
! Property
 +
! Type
 +
! Description
 +
|-
 +
| ''subjectAttribute''
 +
| string
 +
| required: name of attribute containing the subjects for the statements to create. Regardless of the actual literal type, the string values of the literals are tried to interpred as URIs. Namespace expansion is applied.
 +
|-
 +
| ''objectAttribute''
 +
| string
 +
| required: name of attribute containing the objects for the statements to create. If the pipelet parameter ''objectAttributeIsResource'' is not explicitly set to ''false'' the objects will be URIs (namespaces expanded). Else the SMILA literal values will be written as Sesame literals with a matching datatype.
 +
|-
 +
| ''predicateUri''
 +
| string
 +
| required: URI of the predicate for the statements. Namespace prefixes are expanded first.
 +
|-
 +
| ''objectAttributeIsResource''
 +
| Boolean
 +
| optional: if ''true'' the objects are interpreted as URIs, if ''false'' the objects attributes are interpreted as literals. Default: ''true''.
 +
|}
  
 
[[Category:SMILA]]
 
[[Category:SMILA]]

Latest revision as of 10:56, 23 January 2012

This page describes an initial integration of a semantic layer in SMILA. It currently consists basically of an integration of Aduna's OpenRDF Sesame 2, an open source framework for storage, inferencing, and querying of RDF data. We do not provide an own RDF API on its own currently, but just reuse the Sesame API. Based on experiences from actual use cases for the Semantic Layer, this might change in the future.

Consequently, this page assumes that the reader is accustomed to the basic Sesame concepts. A quick browse through Sesame's User Guide should help, especially the chapters 3 and 8.

All of the described code is contained in bundle org.eclipse.smila.ontology.

Introduction

Ontologies can be used to describe background knowledge about an application domain that can be used during indexing to derive additional attributes of for documents or during search to enhance or restrict queries. On the other hand, SMILA pipelets can also be used to add additional data to an existing ontology, i.e. to learn descriptions of and relations between entities of the application domain.

The de-facto standard format for describing such knowledge is RDF. RDF describes everything in the form of resources and triples (statements) about these resources. A resource is identified by a URI, e.g. http://www.eclipse.org/smila. A statement consists of a subject resource, a predicate resource and object. The predicate describes the meaning of the statement. The object can either be a literal of different types, e.g.

Subject Predicate Object
http://www.eclipse.org/smila label 'SMILA'
http://www.eclipse.org/smila createdIn 2007

or another resource, e.g.

Subject Predicate Object
http://www.eclipse.org/smila type http://www.eclipse.org/Project
http://www.eclipse.org/smila isPartOf http://www.eclipse.org/rt/
http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com isComitterOf http://www.eclipse.org/smila

Data is written to an RDF ontology by adding or removing statements. Also, it can be read using a statement e.g. by asking for all statements with the predicate hasPartOf. Another possibilty is to use an RDF query language like SPARQL that allows formulating very complex patterns to access RDF data.

Discussion

... add your thoughts here ...

Sesame Ontology Manager

Service

The Sesame Ontology Manager is an OSGi service that manages Sesame repositories. It has a configuration file that describes a number of repositories that can be created or used and associates them with a name. Service consumers can then request a connection to one of the repositories by its name. A repository is created with the first access from a client.

Then configuration file is expected in the configuration area at org.eclipse.smila.ontology/sesameConfig.xml. You can find the schema definition in bundle org.eclipse.smila.ontology in directory schema. This is fairly complete with respect to the supported repositories:

<?xml version="1.0" encoding="UTF-8"?>
<SesameConfiguration default="native" xmlns="http://www.eclipse.org/smila/ontology">
    <RepositoryConfig name="memory">
        <MemoryStore persist="true" syncDelay="1000" />
        <Stackable classname="org.openrdf.sail.inferencer.fc.ForwardChainingRDFSInferencer" />
    </RepositoryConfig>
    <RepositoryConfig name="native">
        <NativeStore forceSync="true" indexes="spoc,posc" />
    </RepositoryConfig>
    <RepositoryConfig name="database">
        <RdbmsStore driver="org.postgresql.Driver" maxTripleTables="1" indexed="true" sequenced="true">
            <Url>jdbc:postgresql://localhost/sesame</Url>
            <User>sesame</User>
            <Password>sesame</Password>
        </RdbmsStore>
    </RepositoryConfig>
    <RepositoryConfig name="remote">
        <HttpStore repositoryId="repository">
            <Url>http://localhost:8080/sesame</Url>
        </HttpStore>
    </RepositoryConfig>
</SesameConfiguration>

The root element must specify a default repository name, which must match the name attribute of one of the contained repository configurations. Each single repository configuration is described by a <RepositoryConfig> element. It must define a name for the repository. As the name is used as the name of a workspace directory to store the files of memory and native stores in, it should contain only characters suitable for directory names on the current platform. The element must contain one of the different <...Store> elements to describe the physical store type and can (except for HttpStores) contain multiple <Stackable> elements defining implementors of org.openrdf.sail.StackableSail that are stacked on the used store sail. If multiple stackables are specified they are added on top of the store in the order of appearance in the XML file. This means, the configuration file describes the actual sail stack bottom-up. Sesame itself contains only two useful classes that could be used here:

  • org.openrdf.sail.inferencer.fc.DirectTypeHierarchyInferencer
  • org.openrdf.sail.inferencer.fc.ForwardChainingRDFSInferencer

See the Sesame API documentation for details.

Memory Store

Creates the repository based on a main memory store. It has only two configuration attributes:

Attribute Type Default Description
persist true/false false Writes repository content to the workspace so that it can be read again after restarts.
syncDelay integer 0 The time (in milliseconds) to wait after a transaction was committed before writing the changed data to file. Setting it to 0 causes the data to be written immediately after a commit. A negative number prevents syncing until shutdown.

See Sesame User Guide: Memory store configuration for details.

Native Store

Creates the repository based on Sesame's native file database format. It has two configuration attributes:

Attribute Type Default Description
indexes string - An index string like "spoc,posc" that describes how the RDF data is being indexed for better query performance.
forceSync true/false false Force sync to the hard disk on every write. This makes sure that each change is actually persisted in the data files immediately, but decreases write performance.

See Sesame User Guide: Native store configuration for details.

Rdbms Store

Creates a repository that is stored in a relational database. Sesame currently supports PostgreSQL and MySQL. Bundles containing the JDBC driver are currently not part of the SMILA distribution, so you must add them yourself. The element has four possible attributes:

Attribute Type Default Description
driver string (required) The class name of the JDBC driver.
maxTripleTables integer 1 The number of triple tables created by Sesame. The default value causes all statements to be stored in a single table. If more tables are allowed, Sesame creates a separate table per predicate. This may increase performance for large ontologies, however, allowing too many tables might decrease performance again.
indexed true/false true Controls the creation of DB indexes. Usually, this should be enabled for better performance.
sequenced true/false true (I did not find any explanation for this option in the Sesame documentation or source code, but if you know what it does, you can use it ;-)

The actual database location is configured by up to three sub-elements:

Tag Type Required Description
Url string yes JDBC URL
User string no User name for login
Password string no Password for login

If the database does not require authentication, it may be possible to omit User and Password elements.

See Sesame User Guide: RDBMS store configuration for details.

HTTP Store

Creates a repository that connects to a remote Sesame HTTP repository server. There is only one configuration attribute on the HttpStore element:

Attribute Type Default Description
repositoryId string (required) name of repository in server.

The actual repository server location is configured by up to three sub-elements:

Tag Type Required Description
Url string yes HTTP URL of the repository server
User string no User name for login
Password string no Password for login

If the repository server does not require authentication, User and Password can be ommitted. It is not possible to add stackable sails to an HTTP repository. This must be configured on the repository server.

See Sesame User Guide: HTTP repository configuration for details.

JMX Management Agent

There is also a JMX management agent that can be used to import RDF data from files into an ontology, export complete repository contents to an RDF file, and clear repositories. Additionally, it allows reading some information such as the available repository names, the known namespaces, and the size of repositories.

In the JDK JMX console it should look similar to this screenshot:

SMILA Sesame Ontology Manager.png

Since the first parameter of each operation is usually the name of the desired repository, we will only describe additional parameters here:

operation description
getRepositoryNames Returns the names of all configured repositories. These names can be used as the first parameter of the other methods.
getSize Returns the total number of statemets in the named repository.
getNamespaces Returns a map of namespace prefixes to the complete names in the named repository.
getContexts Returns a list of context names in the named repository.
clear Removes all resources and statements from the named repository. The result is a message about the operation.
importRDF'' Imports an RDF file into the named repository. The second parameter is the path and filename of the import file, absolute or relative to SMILA's working directory. The third parameter is the base URI for relative resource URIs in this file. If only absolute resource URIs are used in the RDF file, the actual value of this parameter is irrelevant. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.
exportRDF Exports a complete repository content to an RDF file. The second parameter is the path and filename of the export file, absolute or relative to SMILA's working directory. The format of the file is determined by looking at the filename suffix, if no match is found, RDF/XML is assumed.

There are also exemplary batch files for importing, exporting, or running the clear operation from the command line. See SMILA/jmxclient for details and adapt them to your own needs.

Currently, supported file formats and extensions include:

RDF format filename suffixes
RDF/XML .rdf, .rdfs, .owl
N-Triples .nt
Turtle .ttl
N3 .n3
TRIX .trix
TRIG .trig

The .xml suffix is associated to both RDF/XML and TRIX in Sesame, so there may be problems using it.

Pipelets using the Ontology

There are currently four pipelets included that make use of the ontology service. These pipelets are also kind of experimental, so they may change completely in the future. More pipelets will be created when we implement "real" use cases.

All pipelets use the standard in-BPEL pipelet configuration which can be overridden per record by simple values in the _parameters map attribute. All pipelets use a common property name to select the repository to work with:

Property Type Description
sesameRepository string The name of the repository to use. If not set, the default repository is used.

Writing/Reading complete records to/from the ontology

There are two pipelets that can be used to create and access information about a record in/from the ontology: The resource URI can be read from a special attribute named rdf:about, the resource property is written from or read into metadata attributes.

These pipelets have another parameter in common:

Property Type Description
recordFilter string The name of a record filter that lists all attributes that should be interpreted as resource properties. Note that the rdf:about attribute must be contained in this filter if it is used to identify the resource associated to the object. The record filters must be defined in configuration/org.eclipse.smila.blackboard, because the pipelets use blackboard functionality to do the filtering. If not set, all attributes are mapped to properties.

The URI of the resource associated to a processed record is determined this way:

  • If attribute "rdf:about" has literal values (after record filtering), the first one is used as the base value.
  • If the part of the base value before the first ':' character matches a namespace prefix in the used repository, the prefix is replaced by the full name.
  • The resulting value must be accepted by Sesame as an URI string. If not, the record is not processed any further.

org.eclipse.smila.ontology.pipelets.SesameRecordWriterPipelet

This pipelet can write attribute values to RDF properties. It creates a resource URI for the record it finds in the corresponding attribute and for each top-level attribute value of the record metadata object, it creates statements using this URI as the subject and the attribute name as a predicate URI. If an attribute name starts with a namespace prefix known in the used repository, it is expanded to the full namespace name. The statement object is created from a SMILA literal vaulue as follow:

  • If the attribute is a map and contains an attribute with a name of "rdf:about", a resource is created from the sub-structure (after namespace prefix expansion) and its URI is linked to the containing structure.
  • Else a literal of the ontology literal datatype best matching the SMILA literal datatype is created (TODO: Not yet implemented for data/time values).

The pipelet can be configured in which attribute to find the URI. To find out how, see the parameter desciption below.

Any system attribute (i.e. any attribute with the name starting with an underscore "_" will be ignored and not written into sesame.

If references to objects should be created, the attributes containing object references have to be marked by a special attribute named _objectProperties:

  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
  ...
  <rec:Seq key="_objectProperties">
    <rec:Val>eclipse:isPartOf</rec:Val>
  </rec:Seq>

So, to create a resource in the repository as in the example in the introduction, a record like this could be used:

 
<rec:Map>
  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
  <rec:Map key="rdf:type">
    <rec:Val key="rdf:about">eclipse:Project</rec:Val>
    <rec:Val key="rdf:type">rdfs:Resource</rec:Val>
  </rec:Map>
  <rec:Seq key="eclipse:isPartOf">
    <rec:Map>
      <rec:Val key="rdf:about">eclipse:rt</rec:Val>
      <rec:Val key="rdf:type">eclipse:TopLevelProject</rec:Val>
    </rec:Map>
    <rec:Val>http://www.eclipse.org/</rec:Val>
  </rec:Seq>
  <rec:Val key="rdfs:label">SMILA</rec:Val>
  <rec:Val key="eclipse:createdIn">2007</rec:Val>
 
  <rec:Seq key="_objectProperties">
    <rec:Val>eclipse:isPartOf</rec:Val>
  </rec:Seq>
</rec:Map>

(Assuming the target repository knows the namespaces "rdf", "rdfs" (standard namespaces) and "eclipse" (= "http://www.eclipse.org/").

A string literal with a language can be written using a map with the language name as key and the locale specific value as value of the map:

  <rec:Map key="rdfs:label">
    <rec:Val key="de">SMILA</rec:Val>
    <rec:Val key="en">SMILA</rec:Val>
  </rec:Map>

By default, new statements are just added to the repository. If all existing statements for a given subject and object should be removed before adding the new statements, special system attributes can be used.

To remove all statements before adding the new statements, add an attribute named "_deleteAll" with a value "true":

  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
  <rec:Val key="_deleteAll">true</rec:Val>

To remove only some properties before adding the new statements, add a sequence attribute named "_deleteProperties" that contains a sequence of properties to be deleted on beforehand.

  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
  ...
  <rec:Seq key="_deleteProperties">
    <rec:Val>eclipse:isCommitterOf</rec:Val>
    <rec:Val>rdfs:label</rec:Val>
  </rec:Seq>

To create a statement with the record URI as the object and the attribute value as the subject, put the attribute value "_reverseProperties" containing a list of the reverse properties in the record. This only works if the attribute value is specified to be a resource URI. E.g. the final statement of the introductory example can be created from the same record using:

  <rec:Val key="rdf:about">eclipse:smila</rec:Val>
  <rec:Val key="eclipse:isCommitterOf">http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com</rec:Val>
  <rec:Seq key="_reverseProperties">
    <rec:Val>eclipse:isCommitterOf</rec:Val>
  </rec:Seq>

Finally, the pipelet supports these additional parameters:

Property Type Description
typeUri string optional: name of type to set for the resource, if no type statement is created from writing the record.
uriAttribute string optional: attribute to write the URIs of found/created resources to. Default: rdf:about

If you want to implement own pipelets that create attribute values for writing to the repository, see org.eclipse.smila.ontology.records.SesameRecordHelper for constants and helper methods for these special conventions and annotations.

If an error occurs while writing a record to the repository, only the changes related to the current record are rolled back, and the pipelet fails with an exception, further records in the same message are not processed anymore, changes from records written before are not invalidated. (TODO: discuss general pipelet error handling behavior?)

org.eclipse.smila.ontology.pipelets.SesameRecordReaderPipelet

This pipelet reads statements from the repository about the URI associated with a record into attribute values of the record. Its operation is mostly inverse to the one of the SesameRecordWriterPipelet, so we can keep the description short here. Some notes:

  • Only statements are used that have the record URI as subject. No "reverse" attributes are created.
  • Resource objects are converted to string literals.
  • Literal objects are converted to best matching SMILA literal datatype (TODO: date/time objects).
  • a language tag on a string literal is stored as a AnyMap as described above.
  • All statements are read from the repository, but only those attributes contained in the specified record filter are written to the blackbaord eventually.
  • If predicate URIs start with known name namespaces, the full namespace value is replaced by its prefix and a colon in the associated attribute name.

If an error occurs during reading the records, the pipelet invocation is aborted at this point. If multiple records where processed in this called, changes to already processed records are not reverted.

The pipelet can be configured in which attribute to find the URI:

Property Type Description
uriAttribute string optional: attribute to write the URIs of found/created resources to. Default: rdf:about

org.eclipse.smila.ontology.pipelets.CreateResourcePipelet

This pipelet can be used to lookup and create resources of a certain types by their name. E.g. if some attribute contains the name of a person, this pipelet can search the ontology for a resource of type "person" with this name, and if no such resource exists, it can create a new one. In either case the URI of this resource is written to (another) attribute. The URI of a new resource is created from the label by removing all non-word characters (i.e. everything except a-z, A-Z, _, 0-9) from the string and concatenating the result to some configurable prefix value (see below).

The pipelet supports the following parameters:

Property Type Description
typeUri string required: URI of type to use for lookup and creation of resources. Namespace expansion is applied to this URI.
labelAttribute string required: attribute containing names of resources to lookup/create.
uriAttribute string required: attribute to write the URIs of found/created resources to.
labelPredicate string optional: URI of the predicate that specifies the name of the resource. If not set, it defaults to rdfs:label. Namespace expansion is applied to the property value.
uriPrefix string optional: prefix for new created URIs. If not set, "urn:" is used. Namespace expansion is also applied to the complete new URI.

org.eclipse.smila.ontology.pipelets.CreateRelationPipelet

This pipelet creates statements with subjects and objects read from record attributes. It create a statement with a configurable predicate in the target repository for each combination of values in two configurable attributes. E.g. if one attribute of your records contains URIs of persons and another one URIs of companies they work for, this pipelet could be used to create statements in the ontology using some "worksFor" predicate to describe this relation.

By default the given objectAttributes are interpreted as URIs. If the objectAttributes should be interpreted as literals, the parameter objectAttributeIsResource has to be set to false in the pipelet configuration.

The pipelet supports the following parameters:

Property Type Description
subjectAttribute string required: name of attribute containing the subjects for the statements to create. Regardless of the actual literal type, the string values of the literals are tried to interpred as URIs. Namespace expansion is applied.
objectAttribute string required: name of attribute containing the objects for the statements to create. If the pipelet parameter objectAttributeIsResource is not explicitly set to false the objects will be URIs (namespaces expanded). Else the SMILA literal values will be written as Sesame literals with a matching datatype.
predicateUri string required: URI of the predicate for the statements. Namespace prefixes are expanded first.
objectAttributeIsResource Boolean optional: if true the objects are interpreted as URIs, if false the objects attributes are interpreted as literals. Default: true.