Putting the Semantics in SMILA
This is a collection of ideas to initiate some brain storming. It may be a bit chaotic in parts (-;
Currently the SMILA implementation is free of any predefined semantics. This is intentional: We do not want to force users to any semantic scheme which might make it complicated to adopt the framework if no real ontology is needed. They can just put any data they want into the records and do not have to care about being consistent to an ontology that does not even meet their needs. The main purpose of SMILA is to provide a stable, reliable, scalable and performant infrastructure for enterprise information logistics application that might or might not use any formalized ontology as their basis.
The other thing is that we do not want to restrict a user to a certain kind of ontology (SKOS, OWL, NEPOMUK, XESAM, ...), because we do not feel in a position to decide which one might be the best (or even acceptable) choice for all users. Also we have the impression that the Semantic Web community (is there a single one?) has not yet settled to a common accepted standard yet, that would be safe to use.
So, the "SeMantic web" part in the name SMILA is currently not justified. We know this ;-) Now that the first SMILA milestone has been done we want to move on and really add the SeMantic web to SMILA. This paper is intended to collect and discuss ideas how to do this.
But first: Two issues have been criticized about SMILA: That it's data model is not RDF based and that it uses complex ID objects instead of URLs to identify records. Here are some more detailed explanations about the reasons for both of this.
About SMILA's Data Model
The SMILA data model is therefore free of any predefined semantics, it just defines container objects for services to put their data in such that other services can be configured to read it without having to know anything about other services (like specific data formats). It is not based on RDF because, despite of all generality, we wanted to have a distinction of different parts of data:
- attributes: descriptions of documents, which might or might not be backed by an explicit ontology. This is the part of a SMILA record that matches a RDF description of a resource (see end of this page).
- annotations: additional data about the attributes and their values, used to control operation of processing services or to represent additional result data of services. This will be usually not be affected by the use of ontologies at all. Such annotations can be attached to each part of the metadata. Examples for annotations are (taken from IAS, we do not have much examples in current SMILA services yet):
- filters, dynamic query weights and similar things to modify a search query.
- additional information produced by textminer about concepts recognized in fulltext (positions, PoS-tags, stemming info...)
- attachments: raw binary document data, usually.
(Note that though we are talking about "documents" here, SMILA does not identify "records" with "documents", this is just the most common use case. SMILA records can also represent general database entities, persons and other document-unlike objects.)
We think that it makes sense to distinguish between metadata attributes and annotations: attributes are defined by the application ontology (explicit or ad-hoc, with consistency checks or without - it does not matter), while annotations are defined by the used services (no ontology definition necessary, no consistence checking). This way the application developer can focus on designing the application ontology while the services can attach extra data to any part of the metadata they might need without forcing the developer to care about defining or including always the same placeholders.
In other words: The idea of the data model is to wrap the RDF-like object that describe a document in conformance to the application ontology in another object that allows services to attach addidional data necessary for processing, but which no designer of an application ontology wants to care about. E.g., an OWL ontology still can define a datatype property with a simple string range, but services can do more with this attribute. The alternative would be to force the ontology use object properties with ranges of some "AnnotatableString" class. This would seem quite unnatural to us.
A question is if we can make this "wrapping" more explicit in the naming of the SMILA data model element such that the relation between pure RDF objects and SMILA records becomes more obvious.
About SMILA's ID
SMILA IDs are not just URIs. They need to be complex to be able to
- identify objects with the same ID in consecutive "incremental update" runs.
- identify objects from all kinds of data sources - the key relative to a source could be a URI itself or consist of multiple key values (DB tables without single PK columns)
- identify objects contained in containers (which again could be part of a container, etc)
- it should be possible to access the original object using only data from the ID (and knowing about how to apply them to the data source, of course).
We did not find a possibility to express all this in a single URI. Ideas are appreciated (;
Adding an Ontology to SMILA
Aside from all this, SMILA will of course have a default "Ontology model" that users can use that need ontology support for their services but do not want to develop their own ontology model.
Ontologies will be used to replace the IAS notions of "Models" as far as possible. I.e. they are used to store the declarative knowledge which semantic services can use for their operations. This includes (IAS-biased view!)
- defining possible attributes and their value ranges.
- defining resources with names and synonyms or other expressions that can be used to detect them in full text (textminer model)
- annotate properties to define more special structures like taxonomies or other ordering for filtering and query expansion
- annotate properties to derive similarity measures
- express rules for query completion of adaptation
In contrast to the IAS models, Ontologies must allow much more dynamics and must scale much more better. Also the consuming service must be based on the assumption that the ontology changes and can contain a very large number of resources. This means
- the ontology service cannot keep the complete ontology in memory, but must be based on a efficient database
- It must possibly support a change management such that clients that create internal structures from the ontology for more efficient computation (e.g. Textminer) can express their interest to changes in certain parts of the ontology such that they can update their internal structures instead of having to rebuild them.
- ontology changes are also done by SMILA services - the ontology is not read-only to SMILA.
The idea is to add an "Ontology background service" to SMILA that can be used by all services (and crawlers?) that need access to an ontology. Services that do not need ontology access can still use the data objects produced by other services by accessing the produced attributes. This service could have three components:
- Ontology Store: an efficient database based storage for ontologies, e.g. an RDF triple store based on RDBMS. In-memory triple container are not appropriate in distributed, high-volume scenarios.
- API: easy-to-use API on Ontology Store to access and manipulate the stored triples. Should support a query language like SPARQL.
- Reasoner: Adds "intelligence" to the ontology by using the semantics of the ontology (whatever this means).
Services can use the API or SPARQL directly to access the ontology data as well as access the Reasoner to do more sophisticated computations.
- Base of SMILA ontology
- just RDF? to simplistic.
- OWL? which version?
- Query Language:
- SPARQL, probably.
- Are there standard APIs for Ontology access, querying or reasoners that could be implemented? That would it make easier to switch the Ontology implementation.
- OWL API? Is it relevant?
- SPARQL also defines a XML serialization format of results and a Web Service interface definition.
- What must be can or must be provided by an Ontology Reasoner?
- Do we need an "upper ontology", i.e. a "common ontology" defining a set of classes and properties that are available generally? Or is it all user-definable (as in IAS)?
- NEPOMUK ontologies (NFO, NMO)?
- They could be helpful to simplify the configuration of crawlers (apart from serving as the base of an application ontology, of course): A certain crawler implementation could define fixed attribute names for the data that it produces: Instead of having to configure the output attributes for a crawler in each application a crawler implementation could use fixed attribute names (e.g. the file system crawler could use a subset of NFO property names as attributes) and, if necessary, Connectivity could define a mapping of those fixed names to application ontology names. This would collect all mapping configuration in a single place instead of distributing it among the crawler configs (defining output attributes for the crawler data is nothing else than defining a mapping from some implicit internal crawler ontology to the application ontology). It could even enable us to integrate non-SMILA RDF based crawlers (like Aperture?) that send RDF objects (based on their own ontologies) to Connectivity which converts them to records conforming to the SMILA application ontology.
Representing RDF descriptions in SMILA data model
Though the SMILA data model is not explicitly RDF based, a mapping of RDF objects to SMILA records is easily possible. A document's metadata could be represented in RDF as follows
<f:Document rdf:about="document-uri"> <f:mimeType>text/html</f:mimeType> <f:size rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">12345</f:size> <f:modificationDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2008-01-01</f:modificationDate> <f:keyword>short</f:keyword> <f:keyword>nothing</f:keyword> <f:author rdf:resource="author-uri"/> <f:plaintext>This is a very short text about nothing at all</f:plaintext> </f:Document>
would be represented as SMILA as (just an informative example):
<Record xmlns="&rec;"> <id:Id> <id:Source>source-uri</id:Source> <id:Key>document-uri</id:Key> </id:Id> <A n="&rdf;ID"> <!-- if URI is needed as attribute --> <L> <V>document-uri</V> </L> </A> <A n="&rdf;type"> <!-- just as an example --> <L st="&owl;Class"> <V>&f;Document</V> </L> </A> <A n="&f;mimeType"> <L> <V>text/html</V> </L> </A> <A n="&f;size"> <L> <V t="int">12345</V> </L> </A> <A n="&f;modificationDate"> <L > <V t="date">2008-01-01</V> </L> </A> <A n="&f;keyword"> <L> <V>short</V> <V>nothing</V> </L> </A> <A n="&f;author"> <L st="&p;Author"> <V>author-uri</V> </L> </A> <A n="&f;plaintext"> <L> <V>This is a very short text about nothing at all</V> </L> </A> </Record>
Ok, so far this does not have any advantage over using RDF directly. But remember, we want to able to attach service specific extra data to this record. E.g. the keywords could have been extracted by some intelligent service and it wants to provide information about the occurrences of the keywords in the text to make it possible for some other service to create markup. And this other services wants to attach the text decorated with the markup:
<Record xmlns="&rec;"> <id:Id> <id:Source>source-uri</id:Source> <id:Key>document-uri</id:Key> </id:Id> <!-- some attributes not repeated here --> <A n="&f;keyword"> <L> <V>short</V> <An n="sourceRef"> <V n="attribute">&f;plaintext</V> <V n="startPos">16</V> <V n="endPos">21</V> <V n="partOfSpeech">adjective</V> </An> </L> <L> <V>nothing</V> <An n="sourceRef"> <V n="attribute">&f;plaintext</V> <V n="startPos">33</V> <V n="endPos">40</V> <V n="partOfSpeech">noun</V> </An> </L> </A> <A n="&f;plaintext"> <L> <V>This is a very short text about nothing at all</V> <An n="markup"> <V><![CDATA[This is a very <b>short</b> text about <b>nothing</b> at all]]></V> </An> </L> </A> </Record>
Or in a retrieval we could want to add annotations to define filters that restrict the possible result set: The following object describes a query for a document about "short" and "nothing" (with "nothing" being more important for the ranking than "short"), but restricts the results to documents with mime types "text/html" or "text/plain". Not that there is not even a value for attribute f:mimeType, because we do no want to search for documents with certain mime types but only want to filter a search result (a very important difference in similarity-based retrieval).
<Record xmlns="&rec;"> <id:Id> <id:Source>query</id:Source> <id:Key>1</id:Key> </id:Id> <A n="&f;mimeType"> <An n="filter"> <V n="type">exclude</V> <An n="values"> <V>text/plain</V> <V>text/html</V> </An> </An> </A> <A n="&f;keyword"> <L> <V>short</V> </L> <L> <V>nothing</V> <An n="boost"> <V>2.0</V> </An> </L> </A> </Record>
How could we represent something like this in pure RDF? Remeber, we still want to use something like OWL to naturally define an application ontology without forcing the user to have to think about this annotation stuff.
Summary of Meeting with DFKI/Aperture on 2008-07-22
We (Igor, Daniel, Jürgen and Ralph Traphoener (empolis)) met with Leo Sauermann from the Aperture and NEPOMUK project of DFKI to discuss how Aperture and SMILA can cooperate and to get feedback on SMILA from the point of view of the Semantic Web community. The main proposals were:
- Record IDs should be URLs
- Problem: How to express complex IDs as a single URL.
- Alternatively: Reduce IDs to Data source, key and Parent container key to reduce redundancy. A single key could be a URL in this case.
- Problem: need to look up parents of parents of... to actually locate a record source.
- Additionally: add hints about container type to <Element> (e.g. mime type), such that an locator knows that the container must be e.g. unzipped to access the element.
- Could be helpful. We are not completely sure yet, if it is really necessary.
- The data model should be based on RDF
- How to process RDF objects with BPEL? There is no XSD that could used for the WSDL definition.
- What would the resulting API look like?
- Actual data could get more diverse because each service can add its own "ontology" (e.g. textminer tokens, retrieval filters) to add service specific annotations -> decreased interoperability?
- We have to think about it because the data model API is pretty cetral to the system and cannot be changed easily.
As an experiment: Something similar to the above example record with annotations in RDF:
<smila:Record rdf:about="$URL"> <smila:datasource rdf:resource="$SOURCE_URL"/> <smila:parent rdf:resource="$PARENT_URL"/> <dc:format>text/html</dc:format> <dc:subject>short</dc:subject> <dc:subject>nothing</dc:subject> <retrieval:filter> <retrieval:EnumFilter> <smila:attribute rdf:resource="&dc;format"/> <retrieval:filterMode>inclusive</retrieval:filterMode> <retrieval:filterValue>text/html</retrieval:filterValue> <retrieval:filterValue>text/plain</retrieval:filterValue> <retrieval:EnumFilter> </retrieval:filter> <retrieval:boost> <retrieval:Boost> <smila:attribute rdf:resource="&dc;subject"/> <retrieval:boostFactor>2</retrieval:boostFactor> </retrieval:Boost> </retrieval:boost> <textmining:tokenlist> <textmining:TokenList> <smila:attribute rdf:resource="&smila;fulltext"/> <!-- source attribute --> <textmining:token> <textmining:Token> <smila:attribute rdf:resource="&dc;subject"/> <!-- target attribute --> <textmining:tokenSource>short</textmining:source> <textmining:tokenStart>13</textmining:tokenStart> <textmining:tokenEnd>17</textmining:tokenEnd> <textmining:tokenPos>ADJECTIVE</textmining:tokenPos> </textmining:Token> <textmining:Token> <smila:attribute rdf:resource="&dc;subject"/> <!-- target attribute --> <textmining:tokenSource>nothing</textmining:source> <textmining:tokenStart>27</textmining:tokenStart> <textmining:tokenEnd>32</textmining:tokenEnd> <textmining:tokenPos>NOUN</textmining:tokenPos> </textmining:Token> </textmining:token> </textmining:TokenList> </textmining:tokenlist> </smila:Record>
Note that I have found no way yet to attach the Boost annotation only to one value of property dc:subject. In this example the complete attribute is "boosted". This could be another major problem with RDF as a SMILA data model: Something expressable in IAS is not representable in RDF.
- SMILA application ontologies should be based on existing "standards":
- Dublin Core, XMP to describe documents
- NEPOMUK for different kinds of information objects.
- SKOS to model taxonomies
- maybe FOAF for persons?
- Properties not covered by standards but also common to typical SMILA application should be pre-defined and published by SMILA.