Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/2011.Simplification/Data Model and Serialization Formats"

(Concepts)
(Concepts)
Line 18: Line 18:
 
;attributes: this is usually used when referring to metadata of an object from a data source to process or an object found in a search. For example, a web page to index may be described by attributes like it's URL, the size in bytes, the mime type, the title and the plain-text content. These attributes are defined by the application domain.  
 
;attributes: this is usually used when referring to metadata of an object from a data source to process or an object found in a search. For example, a web page to index may be described by attributes like it's URL, the size in bytes, the mime type, the title and the plain-text content. These attributes are defined by the application domain.  
  
;parameters: In some cases (e.g. search processing) a record represents not a single object from a data source, but instead e.g. a search request. In such cases the record metadata does not contain attributes from the application domain on top-level, but request ''parameters'' that configure the request execution. These parameters are defined by the pipelets used in the workflow executing the request and their names do not start with underscores. However, a request record or the result record may contain application attributes on deeper nesting levels. As an hopefully illustrating example refer to the [[SMILA/Documentation/2011.Simplication/Search|Search API]].
+
;parameters: In some cases (e.g. search processing) a record represents not a single object from a data source, but instead e.g. a search request. In such cases the record metadata does not contain attributes from the application domain on top-level, but request ''parameters'' that configure the request execution. These parameters are defined by the pipelets used in the workflow executing the request and their names do not start with underscores. However, a request record or the result record may contain application attributes on deeper nesting levels. As an hopefully illustrating example refer to the [[SMILA/Documentation/2011.Simplification/Search|Search API]].
  
 
;annotations: Sometimes used for structures added to records by processing steps as their results. E.g. a named-entity-recognition pipelet could add data in which attribute of the record at which character positions which entity was found: The record got ''annotated'' which this information. If annotations appear in the same maps as attributes, their names should be chosen such that they will not conflict with attribute names from the application, e.g. by prefixing with an underscore.
 
;annotations: Sometimes used for structures added to records by processing steps as their results. E.g. a named-entity-recognition pipelet could add data in which attribute of the record at which character positions which entity was found: The record got ''annotated'' which this information. If annotations appear in the same maps as attributes, their names should be chosen such that they will not conflict with attribute names from the application, e.g. by prefixing with an underscore.

Revision as of 03:51, 9 March 2011

SMILA Data Model

  • Implementation bundle: org.eclipse.smila.datamodel
  • Current Version: 0.8.0

Concepts

Data to be processed in SMILA is represented as records. For examples, one records usually corresponds one document or other resource to be indexed or found in a search. A record consists of metadata and optionally attachments.

SMILA data model version 0.8

  • Metadata contains typed values (literals) arranged in maps (key-anything associations) and sequences (lists of anything). Values can be strings, long integers, double precision floating point numbers, booleans, dates (year, month, day) or datetimes (date + time of day, down to seconds). Maps and sequences can be nested arbitrarily, map keys are always strings. All metadata of one record is arranged in a single Map.
  • Attachments can contain any binary content ("byte arrays") of possibly larger size. During indexing they are not kept in memory all the time, but stored in a "binary storage" service and read only when actually needed.

A single entry in a record's metadata map is called Metadata element. According to the use case, metadata elements can be semantically interpreted as:

attributes
this is usually used when referring to metadata of an object from a data source to process or an object found in a search. For example, a web page to index may be described by attributes like it's URL, the size in bytes, the mime type, the title and the plain-text content. These attributes are defined by the application domain.
parameters
In some cases (e.g. search processing) a record represents not a single object from a data source, but instead e.g. a search request. In such cases the record metadata does not contain attributes from the application domain on top-level, but request parameters that configure the request execution. These parameters are defined by the pipelets used in the workflow executing the request and their names do not start with underscores. However, a request record or the result record may contain application attributes on deeper nesting levels. As an hopefully illustrating example refer to the Search API.
annotations
Sometimes used for structures added to records by processing steps as their results. E.g. a named-entity-recognition pipelet could add data in which attribute of the record at which character positions which entity was found: The record got annotated which this information. If annotations appear in the same maps as attributes, their names should be chosen such that they will not conflict with attribute names from the application, e.g. by prefixing with an underscore.
system attributes
needed by SMILA to coordinate the processing of a record (see below). They have names starting with an underscore "_" so that they do not conflict with names from the application domain.

System attributes

RecordID
A record must contain a single value string attributes named "_recordid" to identify the record. It must be unique for all processed records, this must be ensured by whoever initially created the record and submits it to the system (this would be crawlers or agents, usually). There is not predefined format of the record ID, it can contain any string, so creating UUIDs or something would be perfectly OK. Any information needed to access the original data the record was produced from should be placed by the producer in explicitly named attributes.
Source
A record should contain a second system attributes named "_source" containing the ID of the data source (e.g. crawler definition) that produced it. This is used by DeltaIndexing or RecordStorage to do operations on all records from one source.

XML format

The XML format for a Record is designed to be quite compact:

<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
  <Val key="_recordid">web:http://example.org/something</Val>
  <Val key="_source">web</Val>
  <Val key="url">http://example.org/something</Val>
  <Val key="filesize" type="long">1234</Val>
  <Val key="sizeInKb" type="double">1.2</Val>
  <Val key="checked" type="boolean">true</Val>
  <Val key="created" type="date">2010-12-02</Val>
  <Val key="lastModified" type="datetime">2010-12-02T16:20:54+01:00</Val>
  <Seq key="trustee">
    <Val>group1</Val>
    <Val>group2</Val>
  </Seq>
  <Seq key="author">
    <Map>
      <Val key="firstname">John</Val>
      <Val key="lastname">Doe</Val>
    </Map>
    <Map>
      <Val key="firstname">Lisa</Val>
      <Val key="lastname">Müller</Val>
    </Map>
  </Seq>  
  <Map key="contact">
    <Val key="email">Homer.Simpson@powerplant.com</Val>      
    <Map key="adress">
      <Val key="street">742 Evergreen Terrace</Val>
      <Val key="city">Springfield</Val>
    </Map>
  </Map>
  <Seq key="emptylist" />
  <Map key="emptymap" />
 
  <Attachment>content</Attachment>
  <Attachment>fulltext</Attachment>
</Record>

Notes:

  • The Any objects are represented by <Val>, <Map> and <Seq> elements.
  • An object that is part of a map must have an additional "key" attribute. Elements of sequences must not have the "key" attribute.
  • The value type is defined by a optional "type" attribute, the default is "string"
  • The format of date values is "yyyy-MM-dd" (see [Javadoc of SimpleDateFormat] for meaning of the format string)
  • The format of date values is "yyyy-MM-dd'T'HH:mm:ssZ", i.e. it must include timezone information.
  • The top-level <Map> of an record is ommitted from the XML.
  • In XML the record does not contain the attachment values, but only their names so that a reader knows that there are attachments to process.

JSON format

Not yet implemented, but soon to come. Will look like this:

{
  "_recordid" : "web:http://example.org/something",
  "_source" : "web",
  "filesize" : 1234,
  "sizeInKb" : 1.2,
  "checked" : true,
  "created" : "2010-12-02",
  "lastModified" : "2010-12-02T16:20:54+01:00",
  "trustee" : [ "group1", "group2" ],
  "author" : 
  [ {
      "firstname" : "John",
      "lastname" : "Doe"
    },
    {
      "firstname" : "Lisa",
      "lastname" : "Müller"
    } ],
  "contact" : 
  {
    "email" : "Homer.Simpson@powerplant.com",
    "adress" : 
    {
      "street" : "742 Evergreen Terrace",
      "city" : "Springfield"
    }
  },
  "_attachments": [content, fulltext]
}

Notes

  • Value types are determined implicitly when parsing JSON:
    • if a number value can be parsed as a long integer, a long value will be created, else it will become a double value.
    • If a string (value enclosed in quotes) matches the date or datetime format a value of the respecive type will be created, else it will become a string value.
  • Map keys are always strings and must be enclosed in quotes.
  • Attachment names can be added as a system attribute "_attachments".

Record Filters

Record Filters produce reduced copies of a record: A record filter has a name and contains a list of metadata element names. When applied to a record, it produces a copy of the record that contains only the elements of the list.

Record filters are described in a simple XML format:

<RecordFilters>
  <Filter name="filter0" />
  <Filter name="filter1">
    <Element name="attribute" />
  </Filter>
  <Filter name="filter3">
    <Element name="attribute1" />
    <Element name="attribute2" />
    <Element name="attribute3" />
  </Filter>
  <Filter name="filter-all">
    <Element name="*"  />
  </Filter>
</RecordFilters>

Notes:

  • A filter always copies the system elements "_recordid" and "_source". Therefore the apparently empty "filter0" in this definition produces records that still contain these system elements.
  • A filter may contain arbitrary numbers of element names. It's fine if an element does not appear in the record to copy, it's just ignored.
  • A filter always removes attachments: The "filter-all" in this definition produces a copy of the record with all metadata elements, but not attachments.

Filters are usually applied by asking the Blackboard for a filtered copy of the record metadata. See Blackboard service API for details. To work with filters directly see package org.eclipse.smila.datamodel.filter for utility classes.

Back to the top