SMILA/Documentation/Data Model and Serialization Formats

SMILA Data Model

Implementation bundle: org.eclipse.smila.datamodel
Current Version: 1.0.0

Concepts

The data to be processed in SMILA is represented as records. For example, one record could correspond to one document or to any resource which should be indexed or found in a search. A record consists of metadata and optional attachments.

Metadata

Metadata contains typed values (literals) arranged in maps (key-anything associations) and sequences (lists of anything). Values can be strings, long integers, double precision floating point numbers, booleans, dates (year, month, day) or datetimes (date + time of day, down to seconds). Maps and sequences can be nested arbitrarily, map keys are always strings. All metadata of one record is arranged in a single Map.

Attachments

Attachments can contain any binary content ("byte arrays"), possibly of larger size. If the content is kept in-memory or read from a persistence service on-demand depends on the implementation of the interface. Currently the size is limited to 2 GB (maximum size of a Java byte[]), but we are planning to extend this in the future.

A single entry in a record's metadata map is called Metadata element. According to the use case, metadata elements can be semantically interpreted as:

Attributes: Usually, attributes are used when referring to the metadata of an object which is to be processed from a given data source or which is retrieved as the result of a search request. For example, typical attributes characterizing a web page to be indexed are its URL, the size in bytes, the MIME type, the title, and the plain-text content. These attributes are defined by the application domain.

Parameters: Attributes may not be adequate or sufficient for all record types. For example, in search processing, a record represents not a single object from some data source but rather a search request object. In such a case, the record's metadata does not contain attributes from the application domain on top-level but rather request parameters that configure and influence the request execution. These parameters are defined by the pipelets which are used in the workflow that was triggered by the search request. Also, their names do not start with underscores. However, a request or result record may contain application-specific attributes on deeper nested levels. Find an example, hopefully illustrating the difference between attributes and parameters, in Search API.

Annotations: An annotation can be used to add a data structure to the record which was generated as the result of some processing step. E.g., a named-entity-recognition pipelet could add an annotation describing at which character position some entity was found, meaning that the record was annotated with this additional information. If annotations appear in the same maps as attributes, their names should be chosen in such a way that they will not conflict with attribute names from the application, e.g. by prefixing them with an underscore "_".

System attributes: These attributes are needed by SMILA in order to coordinate the processing of a record (see below). Their names start with an underscore "_", so that they will not conflict with names from the application domain.

System attributes

RecordID: Every record must contain a single-valued string attribute named "_recordid" which is required to identify the record. It must be unique for all processed records. This must be ensured by whoever created and submitted the record it to the system (this would be crawlers or agents, usually). There is no predefined format of the record ID, hence it can contain any string. So, creating UUIDs or something similar would be entirely sufficient. Also, the producer must place any information needed to access the original data from which the record was produced into explicitly named attributes.
Source: Every record should also contain a second system attribute named "_source" which contains the ID of the data source (e.g. crawler definition) that produced it. This is used by DeltaIndexing or RecordStorage to perform operations on all records from the same source.

Date and DateTime formats

Internally, date and datetime values are represented as instances of java.util.Date, which means that they are stored as the number of milliseconds since January 1, 1970, 00:00:00 GMT. For the string serialization used in XML, JSON or BON (see below) the following rules apply:

The format of date values is "yyyy-MM-dd" (see SimpleDateFormat for the meaning of the format string). The year must have exactly 4 digits, the month and day must have 2 digits.
The format of datetime values is "yyyy-MM-dd'T'HH:mm:ssZ" or "yyyy-MM-dd'T'HH:mm:ss.SSSZ".
- For the date part the date value rules apply.
- Milliseconds are optional when parsing datetime values from strings, but if given, they must have exactly 3 digits.
- The timezone information must be included and must be either "Z" for GMT/UTC or of the forms "+hhmm" or "-hhmm", e.g. "+0100" for Central European Time (CET, MEZ), or "-0600" for Eastern Standard Time (EST). Of course, using "+0000" or "-0000" for GMT/UTC is fine, too.
- When a datetime value is created by parsing from a string (e.g. by parsing XML, JSON or BON, or using the DataFactory.parseFromString methods, it will be printed in the exactly same way when serialized again when written to XML, JSON or BON (see ValueFormatHelper.getDefaultDateTimeFormat).
- When a datetime value was created in Java from an instance of java.util.Date immediately, it will be serialized using the default timezone of the creating JVM. The milliseconds will be included, too, even if they are just 000.

XML format

The XML format of a record is designed to be quite compact:

<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
  <Val key="_recordid">web:http://example.org/something</Val>
  <Val key="_source">web</Val>
  <Val key="url">http://example.org/something</Val>
  <Val key="filesize" type="long">1234</Val>
  <Val key="sizeInKb" type="double">1.2</Val>
  <Val key="checked" type="boolean">true</Val>
  <Val key="created" type="date">2010-12-02</Val>
  <Val key="lastModified" type="datetime">2010-12-02T16:20:54.123+0100</Val>
  <Seq key="trustee">
    <Val>group1</Val>
    <Val>group2</Val>
  </Seq>
  <Seq key="author">
    <Map>
      <Val key="firstname">John</Val>
      <Val key="lastname">Doe</Val>
    </Map>
    <Map>
      <Val key="firstname">Lisa</Val>
      <Val key="lastname">Müller</Val>
    </Map>
  </Seq>  
  <Map key="contact">
    <Val key="email">Homer.Simpson@powerplant.com</Val>      
    <Map key="address">
      <Val key="street">742 Evergreen Terrace</Val>
      <Val key="city">Springfield</Val>
    </Map>
  </Map>
  <Seq key="emptylist" />
  <Map key="emptymap" />
 
  <Attachment>content</Attachment>
  <Attachment>fulltext</Attachment>
</Record>

Notes:

The Any objects are represented by <Val>, <Map>, and <Seq> elements.
An object that is part of a map must have an additional key attribute. Elements of sequences must not have the key attribute.
The type of a value is defined by an optional type attribute, the default is "string".
See above for description of date and datetime formats.
The top-level <Map> element of a record is omitted from the XML.
In XML, the record does not contain the attachment values, but only their names so that a reader knows that there are attachments to be processed.
Attachments are not supported in the XML format, only the names of attachments are preserved, the attachments themselves (the bytes) are lost

See package org.eclipse.smila.datamodel.xml for serialization helper classes.

351704
Due to a bug in the JDK's default implementation of XMLStreamReader you should only use xml in version 1.0. When deserializing you either dont specify the XML declaration at all or you must use

<?xml version="1.0" encoding="utf-8"?>

JSON format

The JSON format of a record looks like this:

{
  "_recordid" : "web:http://example.org/something",
  "_source" : "web",
  "url" : "web:http://example.org/something",
  "filesize" : 1234,
  "sizeInKb" : 1.2,
  "checked" : true,
  "created" : "2010-12-02",
  "lastModified" : "2010-12-02T16:20:54.123+0100",
  "trustee" : [ "group1", "group2" ],
  "author" : 
  [ {
      "firstname" : "John",
      "lastname" : "Doe"
    },
    {
      "firstname" : "Lisa",
      "lastname" : "Müller"
    } ],
  "contact" : 
  {
    "email" : "Homer.Simpson@powerplant.com",
    "address" : 
    {
      "street" : "742 Evergreen Terrace",
      "city" : "Springfield"
    }
  },
 "_attachments": ["content", "fulltext"]
}

Notes:

Number value types are determined implicitly when parsing JSON:
- If a number value can be parsed as a long integer, a long value will be created, else it will become a double value.
Date and DateTime are not supported by JSON natively, therefore date and datetime values are printed to JSON as simple strings using the format rules described above. On the other hand, when the JSON parser finds a string value that has a correct date or datetime format, it creates a date or datetime value. The original string is preserved, so when accessing the value "as a string" the client will get the original string. Also, when the object is written to JSON (or BON or XML) again, the original string will be used. So this autodetection should not cause problems even if some string value has the correct format, but is not meant to be a date or datetime.
Map keys are always strings and must be enclosed in quotes.
Attachments are not supported in the JSON format, only the names of attachments are preserved, the attachments themselves (the bytes) are lost

See package org.eclipse.smila.datamodel.ipc for serialization helper classes.

BON Binary Object Notation Format

Format introduction

The format consists of a sequence of tokens and data with two different types of tokens:

Event tokens are single bytes which are describing an event (e.g. OBJECT-START, SEQUENCE-START).
Data tokens are the first part of an entity.

Every entity consists of up to three parts. The first part is a one byte token which describes the following data type and in case of a string type this token is followed by a data length information (second part). The last part is the information itself (except for the boolean type which is stored within the token).

Integer values are stored in a compressed format. The sign and the integer length (number of bytes) are stored in the token byte. Strings are generally stored in UTF-8 format.

The handling of date and datetime values is exactly as in JSON. See above for detais.

Attachments are fully supported.

Scalar Types

The current release features the following scalar types:

Integer: signed int64
- compressed, bytes are stored in network byte order (big endian)
- −9,223,372,036,854,775,808 to +9,223,372,036,854,775,807
Floating point values:
- double (8 bytes in network byte order IEEE format (java default))
Bool
String:
- UTF-8 coded Text Strings, max 2^31-1 bytes

Integer compressing

The token bytes 0..15 defines the sign of the number (0-7, positive, 8-15 negative) and the number of the necessary bytes, to store the number. The bytes are stored in network byte order.

Examples for integer compression
value	token	data
17	0 (positiv 1 byte)	17 (0x 11)
17985	1 (positiv 2 bytes)	0x 46 41

Binary Type

The Binary type is used for arbitrary binary content in attachments. A single binary is currently limited to a size of max 2^31-1 bytes.

Token Bytes

There are two different types of tokens. Here is a complete list of all tokens which are supported by the current release of the format:

List of event tokens
token	description	byte
OBJECT-START	No version string	25
OBJECT-START	example for BON format extension	26
OBJECT-END		28
SEQUENCE-START		29
SEQUENCE-END		30
MAPPING-START		31
MAPPING-END		32
ATTACHMENTS-START		33
ATTACHMENTS-END		34

List of data tokens
token	description	byte
SCALAR-INT	positiv length 1	0
	positiv length 2	1
	...	...
	positiv length 8	7
	negative length 1	8
	negative length 2	9
	...	...
	negative length 8	15
SCALAR-BOOL	true	16
	false	17
SCALAR-FLOAT	float (32 bit)	18 (reserved, not implemented)
SCALAR-FLOAT	double (64 bit)	19
SCALAR-FLOAT	long double (80 bit)	20 (reserved, not implemented)
SCALAR-STRING	1 length byte	21
	2 length byte	22
	3 length byte	23
	4 length byte	24
BINARY	length 1	35
	length 2	36
	length 3	37
	length 4	38
	length 5	39 (reserved, not implemented)
	...	...
	length 8	42 (reserved, not implemented)

Backward compatible extension concept

If we need a BON format change, we pick a unused token number (e.q. 26) to indicate a new bon format. In this new format we can optionally store additional version information e.q BON format version and record schema version as one byte.

Examples

Sample integer value: -36364

The BON representation:

Value (decimal)	Info	Comment
9	SCALAR-INT	negative int value with 2 bytes length
36364	int int value without sign

and the hex representation:

09 8E 0C

Sample text: ähnlich

The BON representation:

Value (decimal)	Info	Comment
21	SCALAR-STRING	string with one byte length info
08	length info	the string follows
ähnlich	the string content

and the hex representation:

 
15 08 c3 a4 68 6e 6c 69 63 68

A complex example: This could be some text annotation or highlighting structure. The JSON representation is:

{
  "title": [
    ["STEM","the",0,2],
    ["STEM","title",4,8]
  ],
  "text": [ 
    ["STEM","the",0,2],
    ["STEM","text",4,7]
  ]
}

Value (decimal)	Info	Comment
25	OBJECT-START	"---" (here: without Type:version)
31	MAPPING-START
21	SCALAR-STRING	string with one byte length info
5		length info of the string
title		the string content
29	SEQUENCE-START	start of the sequence "STEM,the,0,2"
21	SCALAR-STRING	string with one byte length info
4		length info for "STEM"
STEM		the string content
21	SCALAR-STRING	string with one byte length info
3		length info for "the"
the		the string content
0	SCALAR-INT (positive)	with one byte length
0		the INT value
0	SCALAR-INT (positive)	with one byte length
2		the INT value
30	SEQUENCE-END	end of the sequence "STEM,the,0,2"
29	SEQUENCE-START	start of the sequence "STEM,title,4,8"
21	SCALAR-STRING	string with one byte length info
4		length info for "STEM"
STEM		the string content
21	SCALAR-STRING	string with one byte length info
5		length info for "title"
title		the string content
0	SCALAR-INT (positive)	with one byte length
4		the INT value
0	SCALAR-INT (positive)	with one byte length
8		the INT value
30	SEQUENCE-END	end of the sequence "STEM,title,4,8"
32	MAPPING-END
28	OBJECT-END

Another example with attachments: This could be some input record generated by a crawler (e.g. a mail crawler). The JSON representation is:

{
  "subject": "a test mail",
  "_attachments" : ["pdfFile", "zipFile"]
}

Note that "_attachments" is not a regular metadata field but contains the name of the attachments. Also note that the JSON representation does not contain the attachments themselves. This is only for documentation purpose.

Value (decimal)	Info	Comment
25	OBJECT-START	"---" (here: without Type:version)
31	MAPPING-START
21	SCALAR-STRING	string with one byte length info
7		length info for the string
subject		the string content
21	SCALAR-STRING	string with one byte length info
11		length info for "a test mail"
a test mail		the string content
32	MAPPING-END
33	ATTACHMENTS-START
21	SCALAR-STRING	string with one byte length info
7		length info the string
pdfFile		the string content
35	BINARY	binary with 1 byte length info
12345		length info for the binary content
03x0815 ....		the binary content
21	SCALAR-STRING	string with one byte length info
7		length info for the string
zipFile		the string content
35	BINARY	binary with 1 byte length info
98765		length info for the binary content
08x4711 ....		the binary content
34	ATTACHMENTS-END
28	OBJECT-END

Record Filters

Record filters produce reduced copies of a record: A record filter has a name and contains a list of metadata element names. When applied to a record, it produces a copy of the record that contains only the elements of the list.

Record filters are described in a simple XML format:

<RecordFilters>
  <Filter name="filter0" />
  <Filter name="filter1">
    <Element name="attribute" />
  </Filter>
  <Filter name="filter3">
    <Element name="attribute1" />
    <Element name="attribute2" />
    <Element name="attribute3" />
  </Filter>
  <Filter name="filter-all">
    <Element name="*"  />
  </Filter>
</RecordFilters>

Notes:

A filter always copies the system elements "_recordid" and "_source". Therefore, the apparently empty "filter0" in this definition produces records that still contain these system elements.
A filter may contain arbitrary numbers of element names. It's fine if an element does not appear in the record to copy, it's just ignored.
A filter always removes attachments: The "filter-all" in this definition produces a copy of the record with all metadata elements, but not attachments.

Filters are usually applied by asking the blackboard for a filtered copy of the record's metadata. See Blackboard service API for details. To work with filters directly, see package org.eclipse.smila.datamodel.filter for utility classes.

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Documentation/Data Model and Serialization Formats

Contents

SMILA Data Model

Concepts

System attributes

Date and DateTime formats

XML format

JSON format

BON Binary Object Notation Format

Format introduction

Scalar Types

Integer compressing

Binary Type

Token Bytes

Backward compatible extension concept

Examples

Record Filters

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Documentation/Data Model and Serialization Formats

Contents

SMILA Data Model

Concepts

System attributes

Date and DateTime formats

XML format

JSON format

BON Binary Object Notation Format

Format introduction

Scalar Types

Integer compressing

Binary Type

Token Bytes

Backward compatible extension concept

Examples

Record Filters