Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Data Model and Serialization Formats"

(XML format)
(JSON format)
Line 110: Line 110:
 
   {
 
   {
 
     "email" : "Homer.Simpson@powerplant.com",
 
     "email" : "Homer.Simpson@powerplant.com",
     "adress" :  
+
     "address" :  
 
     {
 
     {
 
       "street" : "742 Evergreen Terrace",
 
       "street" : "742 Evergreen Terrace",

Revision as of 12:31, 25 May 2011

SMILA Data Model

  • Implementation bundle: org.eclipse.smila.datamodel
  • Current Version: 0.8.0

Concepts

The data to be processed in SMILA is represented as records. For example, one record could correspond to one document or to any resource which should be indexed or found in a search. A record consists of metadata and optional attachments.

SMILA data model version 0.8

  • Metadata contains typed values (literals) arranged in maps (key-anything associations) and sequences (lists of anything). Values can be strings, long integers, double precision floating point numbers, booleans, dates (year, month, day) or datetimes (date + time of day, down to seconds). Maps and sequences can be nested arbitrarily, map keys are always strings. All metadata of one record is arranged in a single Map.
  • Attachments can contain any binary content ("byte arrays"), possibly of larger size. During indexing they are not kept in memory all the time, but stored in a "binary storage" service and read only when actually needed.

A single entry in a record's metadata map is called Metadata element. According to the use case, metadata elements can be semantically interpreted as:

Attributes
Usually, attributes are used when referring to the metadata of an object which is to be processed from a given data source or which is retrieved as the result of a search request. For example, typical attributes characterizing a web page to be indexed are its URL, the size in bytes, the MIME type, the title, and the plain-text content. These attributes are defined by the application domain.
Parameters
Attributes may not be adequate or sufficient for all record types. For example, in search processing, a record represents not a single object from some data source but rather a search request object. In such a case, the record's metadata does not contain attributes from the application domain on top-level but rather request parameters that configure and influence the request execution. These parameters are defined by the pipelets which are used in the workflow that was triggered by the search request. Also, their names do not start with underscores. However, a request or result record may contain application-specific attributes on deeper nested levels. Find an example, hopefully illustrating the difference between attributes and parameters, in Search API.
Annotations
An annotation can be used to add a data structure to the record which was generated as the result of some processing step. E.g., a named-entity-recognition pipelet could add an annotation describing at which character position some entity was found, meaning that the record was annotated with this additional information. If annotations appear in the same maps as attributes, their names should be chosen in such a way that they will not conflict with attribute names from the application, e.g. by prefixing them with an underscore "_".
System attributes
These attributes are needed by SMILA in order to coordinate the processing of a record (see below). Their names start with an underscore "_", so that they will not conflict with names from the application domain.

System attributes

RecordID
Every record must contain a single-valued string attribute named "_recordid" which is required to identify the record. It must be unique for all processed records. This must be ensured by whoever created and submitted the record it to the system (this would be crawlers or agents, usually). There is no predefined format of the record ID, hence it can contain any string. So, creating UUIDs or something similar would be entirely sufficient. Also, the producer must place any information needed to access the original data from which the record was produced into explicitly named attributes.
Source
Every record should also contain a second system attribute named "_source" which contains the ID of the data source (e.g. crawler definition) that produced it. This is used by DeltaIndexing or RecordStorage to perform operations on all records from the same source.

XML format

The XML format of a record is designed to be quite compact:

<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
  <Val key="_recordid">web:http://example.org/something</Val>
  <Val key="_source">web</Val>
  <Val key="url">http://example.org/something</Val>
  <Val key="filesize" type="long">1234</Val>
  <Val key="sizeInKb" type="double">1.2</Val>
  <Val key="checked" type="boolean">true</Val>
  <Val key="created" type="date">2010-12-02</Val>
  <Val key="lastModified" type="datetime">2010-12-02T16:20:54+01:00</Val>
  <Seq key="trustee">
    <Val>group1</Val>
    <Val>group2</Val>
  </Seq>
  <Seq key="author">
    <Map>
      <Val key="firstname">John</Val>
      <Val key="lastname">Doe</Val>
    </Map>
    <Map>
      <Val key="firstname">Lisa</Val>
      <Val key="lastname">Müller</Val>
    </Map>
  </Seq>  
  <Map key="contact">
    <Val key="email">Homer.Simpson@powerplant.com</Val>      
    <Map key="address">
      <Val key="street">742 Evergreen Terrace</Val>
      <Val key="city">Springfield</Val>
    </Map>
  </Map>
  <Seq key="emptylist" />
  <Map key="emptymap" />
 
  <Attachment>content</Attachment>
  <Attachment>fulltext</Attachment>
</Record>

Notes:

  • The Any objects are represented by <Val>, <Map>, and <Seq> elements.
  • An object that is part of a map must have an additional key attribute. Elements of sequences must not have the key attribute.
  • The type of a value is defined by an optional type attribute, the default is "string".
  • The format of date values is "yyyy-MM-dd" (see [Javadoc of SimpleDateFormat] for the meaning of the format string)
  • The format of datetime values is "yyyy-MM-dd'T'HH:mm:ssZ", i.e. it must include timezone information.
  • The top-level <Map> element of a record is omitted from the XML.
  • In XML, the record does not contain the attachment values, but only their names so that a reader knows that there are attachments to be processed.

See package org.eclipse.smila.datamodel.xml for serialization helper classes.

JSON format

The JSON format of a record looks like this:

{
  "_recordid" : "web:http://example.org/something",
  "_source" : "web",
  "url" : "web:http://example.org/something",
  "filesize" : 1234,
  "sizeInKb" : 1.2,
  "checked" : true,
  "created" : "2010-12-02",
  "lastModified" : "2010-12-02T16:20:54+01:00",
  "trustee" : [ "group1", "group2" ],
  "author" : 
  [ {
      "firstname" : "John",
      "lastname" : "Doe"
    },
    {
      "firstname" : "Lisa",
      "lastname" : "Müller"
    } ],
  "contact" : 
  {
    "email" : "Homer.Simpson@powerplant.com",
    "address" : 
    {
      "street" : "742 Evergreen Terrace",
      "city" : "Springfield"
    }
  },
  "_attachments": ["content", "fulltext"]
}

Notes:

  • Number value types are determined implicitly when parsing JSON:
    • If a number value can be parsed as a long integer, a long value will be created, else it will become a double value.
  • Date(time) values are not supported by JSON, so they will be parsed to string values
    • So when writing a date(time) value to JSON and read it again, it will be converted to a string value!
    • You can get back the date(time) value, by calling the appropriate conversion methods (AnyMap.getDate(), AnyMap.getDateTime(), Value.asDate(), Value.asDateTime())
  • Map keys are always strings and must be enclosed in quotes.
  • Attachment names can be added as a system attribute "_attachments".

See package org.eclipse.smila.datamodel.json for serialization helper classes.

Record Filters

Record filters produce reduced copies of a record: A record filter has a name and contains a list of metadata element names. When applied to a record, it produces a copy of the record that contains only the elements of the list.

Record filters are described in a simple XML format:

<RecordFilters>
  <Filter name="filter0" />
  <Filter name="filter1">
    <Element name="attribute" />
  </Filter>
  <Filter name="filter3">
    <Element name="attribute1" />
    <Element name="attribute2" />
    <Element name="attribute3" />
  </Filter>
  <Filter name="filter-all">
    <Element name="*"  />
  </Filter>
</RecordFilters>

Notes:

  • A filter always copies the system elements "_recordid" and "_source". Therefore, the apparently empty "filter0" in this definition produces records that still contain these system elements.
  • A filter may contain arbitrary numbers of element names. It's fine if an element does not appear in the record to copy, it's just ignored.
  • A filter always removes attachments: The "filter-all" in this definition produces a copy of the record with all metadata elements, but not attachments.

Filters are usually applied by asking the blackboard for a filtered copy of the record's metadata. See Blackboard service API for details. To work with filters directly, see package org.eclipse.smila.datamodel.filter for utility classes.

Back to the top