Difference between revisions of "SMILA/Project Concepts/ID Concept"

Latest revision as of 10:37, 17 June 2009

Description

The purpose of an ID is to identify an object in the system. What is an object in SMILA?

simple case: a single document
what about compounds?
- archive files, e.g. ZIPs
- Big documents that should be indexed by page or by section

SMILA objects have a life cycle
- creation in crawler or agent
- enrichment, splitting, merging (possible?) during processing in SMILA
- persisting in storages (possibly in different states of procesing) or indexes (usually at the end, but also possibly multiple times).
- process is repeated, when object source changes (index update) -> new object must have same object ID.
- using the ID it must be possible to refer to the source object.

Discussion

Technical proposal

Definition of concepts:

data source: a single location providing access to a colletion of data. (web server, file system, database, CMS, ...). Data is read from a data source using crawler/agents. A data source must have an unique source ID within SMILA to refer to it without having to deal with the technical details of access.

source object: entity in data source. A crawler/agent can create multiple SMILA objects from a single object source (e.g. by extracting files from a ZIP archive). A source object can be identified with respect to its data source using a relatively simple key (URL, path, primary key, ...)

record: an entity representing a complete source object or a part of an source object to be processed by SMILA.
- Can be split into multiple records.
- Multiple records referring to different parts of the same source object can be merged again? Could be useful to split really large documents, process them section by section and merge the results again.
- Can be written to storages or indexes.
- Can be read from a storage in order to redo the rest of the processing (e.g. to

rebuild an index after ontology changes).

Record ID design

A Record ID must contain and it must be able to extract:

data source ID
key of source object in data source, relative to the definitions of the data source

These must be provided by the crawler/agent.

Source objects can have multiple key values, e.g. in database tables with a primary key consisting of multiple columns.

During processing, the record ID may/can be enhanced:

Part specification after splitting a compound
- Element: part of a container, e.g. path in archive (what about recursion: part of part of part...), attachment index in mails, etc. The element is identified by another key which is relative to the container element.
- Fragment: identified by page number, section number, section name, etc.

If merging is supported, multiple records belonging to the same source object can be merged into a single record. The merged ID must reflect this.

Do we want to pack all this into a single ID string (URL, whatever)? All kinds of quoting problems may arise (remember that the source object key could be a complex URL itself already). Thus, we probably want to use a structured ID object. Something like this:

<rec:Record>
	<id:ID>
		<id:Source><!-- String: ID of data source --></id:Source>
		<id:Key><!-- String: key of source object relative to data source --></id:Key>
 
		<!-- the elements above are mandatory, the following is optional -->
 
		<id:Element>
			<id:Key><!-- String: path in archive, attachment index --></id:Key>
			<!-- id:Element can be repeated for recursive archives -->
		</id:Element>
 
		<id:Fragment><!-- page number, section name/number --></id:Fragment>
		<!-- maybe repeated e.g. for books: Part, Chapter, Section, Subsection ... -->
	</id:ID>
 
	<!-- other metadata and non-binary content -->
 
</rec:Record>

For a source object with multiple key values it must be distinguishable which key value belongs to which key "column". Therefore id:Key can be optionally annotated with a name attribute:

<rec:Record>
	<id:ID>
		<id:Source><!-- String: ID of data source --></id:Source>
		<id:Key name="column1"><!-- key value in named column --></id:Key>
		<id:Key name="column2"><!-- key value in named column --></id:Key>
		...
	</id:ID>
</rec:Record>

Because id:Element uses the id:Key element to identify the element inside a compound, it would be technically possible to support compounds that need multiple key values to identify an element. We cannot think of an actual use case currently, though (-;

In Java:

public interface ID extends Serializable
{
    String getSource();
    Key getKey(); 
 
    List<Key> getElements();
    List<String> getFragments();
 
    ID createElementID(String elementName);
    ID createElementID(Key elementKey);
    ID createFragmentID(String framentName);
 
    ID mergeWith(Collection<ID> otherParts);
}

public interface Key extends Serializable
{
    static final String NONAME = "__SMILA:unnamedkey__";
 
    Iterator<String> getKeyNames();
    String getKey(String name);
    String getKey(); // shortcut for getKey(NONAME)
}

public interface IDFactory
{
    ID createID(String source, Key key);
    Key createKey(Map<String, String> keyValues);
 
    // convenience methods:
    ID createID(String source, String key);
    ID createID(String source, Map<String, String> keyValues);
    Key createKey(String key);
}

IDs should be usable as hash keys:

IDs are unchangeable objects
Provide appropriate hashcode() implementation

Examples

Assume a file system data source named "share", referring to a shared directory on a file server (e.g. "\\fileserv\share"). It looks like this:

\\fileserv\share
    |- PDF
    |   \- big.pdf
    \- Archive
        \- oldstuff.zip
        \- PDF
            \- old.pdf
            \- another.zip
                \- another.pdf

"big.pdf" initially gets this ID:

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>PDF/big.pdf</id:Key>
</id:ID>

After splitting it by pages, the following ID refers to the first page of the document:

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>PDF/big.pdf</id:Key>
    <id:Fragment>0</id:Fragment> <!-- or start counting at 1? -->
</id:ID>

Similar for the ZIP: It starts as:

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>Archive/oldstuff.zip</id:Key>
</id:ID>

When it is expanded, the contained file is referred to as

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>Archive/oldstuff.zip</id:Key>
    <id:Element>
        <id:Key>PDF/old.pdf</id:Key>
    </id:Element>
</id:ID>

which it turn can be splitted to pages to become:

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>Archive/oldstuff.zip</id:Key>
    <id:Element>
        <id:Key>PDF/old.pdf</id:Key>
    </id:Element>
    <id:Fragment>0</id:Fragment>
</id:ID>

And finally, the first page of the PDF in the recursive.zip would have this ID:

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>Archive/oldstuff.zip</id:Key>
    <id:Element>
        <id:Key>another.zip</id:Key>
        <id:Element>
            <id:Key>another.pdf</id:Key>
        </id:Element>
    </id:Element>
    <id:Fragment>0</id:Fragment>
</id:ID>

Similar, for a mail server as a data source "mail" we could have the following ID to refer to an attachment of a mail in folder INBOX. In this case, the Element name is the index of the Mime Message part in the message in this case.

<id:ID>
    <id:Source>mail</id:Source>
    <id:Key>INBOX/42</id:Key>
    <id:Element>
      <id:Key>2</id:Key>
    </id:Element>
</id:ID>

A row in a database table with a primary key consisting of columns x and y would be identitified like this:

<id:ID>
    <id:Source>db</id:Source>
    <id:Key name="x">0815</id:Key>
    <id:Key name="y">4711</id:Key>
</id:ID>

@@ Line 1: / Line 1: @@
-== 1. Description ==
+== Description ==
 The purpose of an ID is to identify an object in the system.
-What is an object in EILF?
+What is an object in SMILA?
 * simple case: a single document
@@ Line 9: / Line 9: @@
 ** Big documents that should be indexed by page or by section
-* EILF objects have a life cycle
+* SMILA objects have a life cycle
 ** creation in crawler or agent
-** enrichment, splitting, merging (possible?) during processing in EILF
+** enrichment, splitting, merging (possible?) during processing in SMILA
 ** persisting in storages (possibly in different states of procesing) or indexes (usually at the end, but also possibly multiple times).
 ** process is repeated, when object source changes (index update) -> new object must have same object ID.
@@ Line 17: / Line 17: @@
-These are the associated JIRA tasks:
+== Discussion ==
-* [ECS-68|http://bugs.brox.de/jira/browse/ECS-68]
-== 2. Discussion ==
+== Technical proposal ==
-== 3. Technical proposal ==
+[[Category:SMILA]]
 === Definition of concepts: ===
-* data source: a single location providing access to a colletion of data. (web server, file system, database, CMS, ...). Data is read from a data source using crawler/agents. A data source must have an unique source ID within EILF to refer to it without having to deal with the technical details of access.
+* data source: a single location providing access to a colletion of data. (web server, file system, database, CMS, ...). Data is read from a data source using crawler/agents. A data source must have an unique source ID within SMILA to refer to it without having to deal with the technical details of access.
-* source object: entity in data source. A crawler/agent can create multiple EILF objects from a single object source (e.g. by extracting files from a ZIP archive). A source object can be identified with respect to its data source using a relatively simple key (URL, path, primary key, ...)
+* source object: entity in data source. A crawler/agent can create multiple SMILA objects from a single object source (e.g. by extracting files from a ZIP archive). A source object can be identified with respect to its data source using a relatively simple key (URL, path, primary key, ...)
-* record: an entity representing a complete source object or a part of an source object to be processed by EILF.
+* record: an entity representing a complete source object or a part of an source object to be processed by SMILA.
 ** Can be split into multiple records.
 ** Multiple records referring to different parts of the same source object can be merged again? Could be useful to split really large documents, process them section by section and merge the results again.
@@ Line 49: / Line 47: @@
 Source objects can have multiple key values, e.g. in database tables with a primary key consisting of multiple columns.
-During processing, the record ID may be can be enhanced:
+During processing, the record ID may/can be enhanced:
 * Part specification after splitting a compound
-** Element: part of a container, e.g. path in archive (what about recursion: part of part of part...),
+** Element: part of a container, e.g. path in archive (what about recursion: part of part of part...), attachment index in mails, etc. The element is identified by another key which is relative to the container element.
-   attachment index in mails, etc. The element is identified by another key which is relative to the container element.
 ** Fragment: identified by page number, section number, section name, etc.
@@ Line 60: / Line 57: @@
 <source lang="xml">
-<eilf:Record>
+<rec:Record>
-    <eilf:ID>
+	<id:ID>
-        <eilf:Source><!-- String: ID of data source --></eilf:Source>
+		<id:Source><!-- String: ID of data source --></id:Source>
-        <eilf:Key><!-- String: key of source object wrt. data source --></eilf:Key>
+		<id:Key><!-- String: key of source object relative to data source --></id:Key>
-        <!-- the elemements above are mandatory, the following is optional -->
+		<!-- the elements above are mandatory, the following is optional -->
-        <eilf:Element>
+		<id:Element>
-            <eilf:Key><!-- String: path in archive, attachment index --></eilf:Key>
+			<id:Key><!-- String: path in archive, attachment index --></id:Key>
-            <!-- eilf:Element can repeated for recursive archives -->
+			<!-- id:Element can be repeated for recursive archives -->
-        </eilf:Element>
+		</id:Element>
-        <eilf:Fragment><!-- page number, section name/number --></eilf:Fragment>
+		<id:Fragment><!-- page number, section name/number --></id:Fragment>
-        <!-- maybe repeated e.g. for for books: Part, Chapter, Section, Subsection ... -->
+		<!-- maybe repeated e.g. for books: Part, Chapter, Section, Subsection ... -->
-    </eilf:ID>
+	</id:ID>
-    <!-- other metadata and non-binary content -->
+	<!-- other metadata and non-binary content -->
-</eilf:Record>
+</rec:Record>
 </source>
 For a source object with multiple key values it must be distinguishable which key
-value belongs to which key "column". Therefore eilf:Key can be optionally annotated with a
+value belongs to which key "column". Therefore id:Key can be optionally annotated with a
 name attribute:
 <source lang="xml">
-<eilf:Record>
+<rec:Record>
-    <eilf:ID>
+	<id:ID>
-        <eilf:Source><!-- String: ID of data source --></eilf:Source>
+		<id:Source><!-- String: ID of data source --></id:Source>
-        <eilf:Key name="column1"><!-- key value in named column --></eilf:Key>
+		<id:Key name="column1"><!-- key value in named column --></id:Key>
-        <eilf:Key name="column2"><!-- key value in named column --></eilf:Key>
+		<id:Key name="column2"><!-- key value in named column --></id:Key>
-        ...
+		...
-    </eilf:ID>
+	</id:ID>
-</eilf:Record>
+</rec:Record>
 </source>
-Because eilf:Element uses the eilf:Key element to identify the element inside a compound,
+Because id:Element uses the id:Key element to identify the element inside a compound,
 it would be technically possible to support compounds that need multiple key values to
 identify an element. We cannot think of an actual use case currently, though (-;
@@ Line 122: / Line 119: @@
 public interface Key extends Serializable
 {
-     static final String NONAME = "__eilf:unnamedkey__";
+     static final String NONAME = "__SMILA:unnamedkey__";
      Iterator<String> getKeyNames();
@@ Line 146: / Line 143: @@
 * IDs are unchangeable objects
 * Provide appropriate hashcode() implementation
 === Examples ===
@@ Line 167: / Line 163: @@
 <source lang="xml">
-<eilf:ID>
+<id:ID>
-     <eilf:Source>share</eilf:Source>
+     <id:Source>share</id:Source>
-     <eilf:Key>PDF/big.pdf</eilf:Key>
+     <id:Key>PDF/big.pdf</id:Key>
-</eilf:ID>
+</id:ID>
 </source>
@@ Line 176: / Line 172: @@
 <source lang="xml">
-<eilf:ID>
+<id:ID>
-     <eilf:Source>share</eilf:Source>
+     <id:Source>share</id:Source>
-     <eilf:Key>PDF/big.pdf</eilf:Key>
+     <id:Key>PDF/big.pdf</id:Key>
-     <eilf:Fragment>0</eilf:Fragment> <!-- or start counting at 1? -->
+     <id:Fragment>0</id:Fragment> <!-- or start counting at 1? -->
-</eilf:ID>
+</id:ID>
 </source>
@@ Line 186: / Line 182: @@
 <source lang="xml">
-<eilf:ID>
+<id:ID>
-     <eilf:Source>share</eilf:Source>
+     <id:Source>share</id:Source>
-     <eilf:Key>Archive/oldstuff.zip</eilf:Key>
+     <id:Key>Archive/oldstuff.zip</id:Key>
-</eilf:ID>
+</id:ID>
 </source>
@@ Line 195: / Line 191: @@
 <source lang="xml">
-<eilf:ID>
+<id:ID>
-     <eilf:Source>share</eilf:Source>
+     <id:Source>share</id:Source>
-     <eilf:Key>Archive/oldstuff.zip</eilf:Key>
+     <id:Key>Archive/oldstuff.zip</id:Key>
-     <eilf:Element>
+     <id:Element>
-         <eilf:Key>PDF/old.pdf</eilf:Key>
+         <id:Key>PDF/old.pdf</id:Key>
-     </eilf:Element>
+     </id:Element>
-</eilf:ID>
+</id:ID>
 </source>
@@ Line 207: / Line 203: @@
 <source lang="xml">
-<eilf:ID>
+<id:ID>
-     <eilf:Source>share</eilf:Source>
+     <id:Source>share</id:Source>
-     <eilf:Key>Archive/oldstuff.zip</eilf:Key>
+     <id:Key>Archive/oldstuff.zip</id:Key>
-     <eilf:Element>
+     <id:Element>
-         <eilf:Key>PDF/old.pdf</eilf:Key>
+         <id:Key>PDF/old.pdf</id:Key>
-     </eilf:Element>
+     </id:Element>
-     <eilf:Fragment>0</eilf:Fragment>
+     <id:Fragment>0</id:Fragment>
-</eilf:ID>
+</id:ID>
 </source>
@@ Line 220: / Line 216: @@
 <source lang="xml">
-<eilf:ID>
+<id:ID>
-     <eilf:Source>share</eilf:Source>
+     <id:Source>share</id:Source>
-     <eilf:Key>Archive/oldstuff.zip</eilf:Key>
+     <id:Key>Archive/oldstuff.zip</id:Key>
-     <eilf:Element>
+     <id:Element>
-         <eilf:Key>another.zip</eilf:Key>
+         <id:Key>another.zip</id:Key>
-         <eilf:Element>
+         <id:Element>
-             <eilf:Key>another.pdf</eilf:Key>
+             <id:Key>another.pdf</id:Key>
-         </eilf:Element>
+         </id:Element>
-     </eilf:Element>
+     </id:Element>
-     <eilf:Fragment>0</eilf:Fragment>
+     <id:Fragment>0</id:Fragment>
-</eilf:ID>
+</id:ID>
 </source>
@@ Line 238: / Line 234: @@
 <source lang="xml">
-<eilf:ID>
+<id:ID>
-     <eilf:Source>mail</eilf:Source>
+     <id:Source>mail</id:Source>
-     <eilf:Key>INBOX/42</eilf:Key>
+     <id:Key>INBOX/42</id:Key>
-     <eilf:Element>
+     <id:Element>
-       <eilf:Key>2</eilf:Key>
+       <id:Key>2</id:Key>
-     </eilf:Element>
+     </id:Element>
-</eilf:ID>
+</id:ID>
 </source>
@@ Line 251: / Line 247: @@
 <source lang="xml">
-<eilf:ID>
+<id:ID>
-     <eilf:Source>db</eilf:Source>
+     <id:Source>db</id:Source>
-     <eilf:Key name="x">0815</eilf:Key>
+     <id:Key name="x">0815</id:Key>
-     <eilf:Key name="y">4711</eilf:Key>
+     <id:Key name="y">4711</id:Key>
-</eilf:ID>
+</id:ID>
 </source>

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Project Concepts/ID Concept"

Latest revision as of 10:37, 17 June 2009

Contents

Description

Discussion

Technical proposal

Definition of concepts:

Record ID design

Examples

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Project Concepts/ID Concept"

Latest revision as of 10:37, 17 June 2009

Contents

Description

Discussion

Technical proposal

Definition of concepts:

Record ID design

Examples