Difference between revisions of "SMILA/Documentation/Importing/DeltaCheck"

From Eclipsepedia

Jump to: navigation, search
(Entry key calculation and configuration)
(Clear a single source)
 
(34 intermediate revisions by 4 users not shown)
Line 5: Line 5:
 
=== Worker Description ===
 
=== Worker Description ===
  
TODO
+
* Worker name: <tt>deltaChecker</tt>
 +
* Parameters:
 +
** <tt>deltaImportStrategy</tt>: configures usage of DeltaService. It has four possible values that select one of two behaviours for this worker (see [[SMILA/Documentation/Importing/Concept#Delta_Delete|DeltaDelete]] for an overview):
 +
*** <tt>disabled</tt> or <tt>initial</tt>: DeltaService is not used at all by this worker, the input records are just written to the output unchanged. Actually, the worker could be removed from the workflow completely in this case, but for convenience it is possible to let the worker remain it in the workflow, if performance is not that critical, although the worker won't actually do any useful work in these modes.
 +
*** <tt>additive</tt> or <tt>full</tt>: Perform normal operation: check state of input records, don't write unchanged records to output.
 +
* Input Slot: <tt>recordsToCheck</tt>, a recordBulks bucket
 +
* Output Slot:
 +
** <tt>updatedRecords</tt>, a recordBulks bucket. Output can be empty, if no record needs an update.
 +
** <tt>updatedCompounds</tt>, an optional recordBulks bucket. If connected, compound objects (records with attribute <tt>_isCompound</tt> set to true) are not written to <tt>updatedRecord</tt> but to this slot instead. Output can be empty, of course, if no changed or new compound objects were crawled.
 +
The worker calls the DeltaService for each incoming record. The job run id is taken from a task property, while the source ID, record ID and hash code are taken from the record itself. The hash code is expected to be in attribute <tt>"_deltaHash"</tt> which can contain a single value. Cases are:
 +
* DeltaService reports record as UPTODATE: record is not added to the output.
 +
* DeltaService reports record as NEW: record is added to the output.
 +
* DeltaService reports record as CHANGED: attribute <tt>"_update"</tt> is set to true and the record is added to the output.
 +
* <tt>"_deltaHash"</tt> not set: DeltaService is not called and the record is added to the output.
 +
* Error in DeltaService: record is not written to output.
  
 
=== ObjectStoreDeltaService ===
 
=== ObjectStoreDeltaService ===
  
The DeltaCheck worker makes use of a DeltaService to check and update the state of a record. The first implementation of this service puts those state entries in the ObjectStore (and hence as seperate files in a filesystem, if the filesystem implementation of objectstore is used), which should work well enough for a limited number of records per source.  
+
The DeltaCheck worker makes use of a DeltaService to check and update the state of a record. The bundle <tt>org.eclipse.smila.importing.state.objectstore</tt> provides an implementation of this service putting those state entries in the ObjectStore (and hence as separate files in a filesystem, if the filesystem implementation of objectstore is used), which should work well enough for a limited number of records per source.  
  
 
The keys of the entries are created from the source ID, a '/' character and an entry key created from a digest calculated from the record ID. A small configuration file allows to customize this entry key, which may be necessary to manage a greater number of documents or to make use of advanced features of more sophisticated ObjectStore implementation.
 
The keys of the entries are created from the source ID, a '/' character and an entry key created from a digest calculated from the record ID. A small configuration file allows to customize this entry key, which may be necessary to manage a greater number of documents or to make use of advanced features of more sophisticated ObjectStore implementation.
 +
 +
The service uses store <tt>deltaservice</tt>.
  
 
==== Entry key calculation and configuration ====
 
==== Entry key calculation and configuration ====
Line 19: Line 35:
 
The rest of the digest can be "segmented", i.e. additional '/' can be added so that not all entries of a shard are stored in a single directory. By default, 1 additional '/' is added after the second character of the digest.
 
The rest of the digest can be "segmented", i.e. additional '/' can be added so that not all entries of a shard are stored in a single directory. By default, 1 additional '/' is added after the second character of the digest.
  
The configuration file is <tt>org.eclipse.smila.importing.delta.objectstore/deltastore.properties</tt>:
+
The configuration file is <tt>org.eclipse.smila.importing.state.objectstore/deltastore.properties</tt>:
  
 
<source lang="text">
 
<source lang="text">
Line 28: Line 44:
  
 
# first argument: shard (first characters of record ID digest)
 
# first argument: shard (first characters of record ID digest)
# first argument: segmented record ID digest
+
# second argument: segmented record ID digest
 
key.pattern=%s/%s
 
key.pattern=%s/%s
 
</source>
 
</source>
  
See the test case [https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/trunk/core/org.eclipse.smila.importing.delta.objectstore.test/code/src/org/eclipse/smila/importing/delta/objectstore/test/TestDeltaStoreConfiguration.java TestDeltaStoreConfiguration.java] for examples of the effects of these settings.
+
See the test case [https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/trunk/core/org.eclipse.smila.importing.state.objectstore.test/code/src/org/eclipse/smila/importing/state/objectstore/test/TestStateStoreConfiguration.java TestStateStoreConfiguration.java] for examples of the effects of these settings.
 +
 
 +
=== DeltaService ReST API ===
 +
 
 +
Currently there is only a simple REST API for DeltaService that allows to show which sources have currently entries and delete entries of a single source or all entries.
 +
 
 +
==== Show active sources ====
 +
 
 +
* URL: <tt>/smila/importing/delta</tt>
 +
* Method: <tt>GET</tt>
 +
* Response Code: <tt>200 OK</tt>, if successful,
 +
* Response JSON:
 +
<source lang="javascript">
 +
{"sources": [
 +
  {
 +
    "id": "web",
 +
    "url": "http://localhost:8080/smila/importing/delta/web"
 +
  },
 +
  {
 +
    "id": "file",
 +
    "url": "http://localhost:8080/smila/importing/delta/file"
 +
  }
 +
]}
 +
</source>
 +
 
 +
==== Clear all sources ====
 +
 
 +
* URL: <tt>/smila/importing/delta</tt>
 +
* Method: <tt>DELETE</tt>
 +
* Response Code: <tt>200 OK</tt>, if successful
 +
* Response JSON: none
 +
 
 +
==== Get info about sources ====
 +
 
 +
* URL: <tt>/smila/importing/delta/<sourcename></tt>
 +
* Method: <tt>GET</tt>
 +
* Response Code:
 +
** <tt>200 OK</tt>, if successful,
 +
** <tt>404 NOT FOUND</tt>, if source does not have entries currently.
 +
 
 +
* Response JSON:
 +
 
 +
Contains the ID of the source and the number of entries. If there are more than 10000 entries, the number is only estimated because exact counting could take a long time. To force an exact count, add <tt>?countExact=true</tt> to the request URL.
 +
 
 +
<source lang="javascript">
 +
{
 +
  "id": "web",
 +
  "count": "123456"
 +
}
 +
</source>
 +
 
 +
==== Clear a single source ====
 +
 
 +
* URL: <tt>/smila/importing/delta/<sourcename></tt>
 +
* Method: <tt>DELETE</tt>
 +
* Response Code: <tt>200 OK</tt>, if successful
 +
* Response JSON: none
 +
 
 +
[[Category:SMILA]]

Latest revision as of 09:59, 20 February 2012

Contents

[edit] Workers for Importing: Delta Check

Delta Checking is about determining if a record has changed since the last import run and needs to be sent to the processing job again, e.g. to update the index.

[edit] Worker Description

  • Worker name: deltaChecker
  • Parameters:
    • deltaImportStrategy: configures usage of DeltaService. It has four possible values that select one of two behaviours for this worker (see DeltaDelete for an overview):
      • disabled or initial: DeltaService is not used at all by this worker, the input records are just written to the output unchanged. Actually, the worker could be removed from the workflow completely in this case, but for convenience it is possible to let the worker remain it in the workflow, if performance is not that critical, although the worker won't actually do any useful work in these modes.
      • additive or full: Perform normal operation: check state of input records, don't write unchanged records to output.
  • Input Slot: recordsToCheck, a recordBulks bucket
  • Output Slot:
    • updatedRecords, a recordBulks bucket. Output can be empty, if no record needs an update.
    • updatedCompounds, an optional recordBulks bucket. If connected, compound objects (records with attribute _isCompound set to true) are not written to updatedRecord but to this slot instead. Output can be empty, of course, if no changed or new compound objects were crawled.

The worker calls the DeltaService for each incoming record. The job run id is taken from a task property, while the source ID, record ID and hash code are taken from the record itself. The hash code is expected to be in attribute "_deltaHash" which can contain a single value. Cases are:

  • DeltaService reports record as UPTODATE: record is not added to the output.
  • DeltaService reports record as NEW: record is added to the output.
  • DeltaService reports record as CHANGED: attribute "_update" is set to true and the record is added to the output.
  • "_deltaHash" not set: DeltaService is not called and the record is added to the output.
  • Error in DeltaService: record is not written to output.

[edit] ObjectStoreDeltaService

The DeltaCheck worker makes use of a DeltaService to check and update the state of a record. The bundle org.eclipse.smila.importing.state.objectstore provides an implementation of this service putting those state entries in the ObjectStore (and hence as separate files in a filesystem, if the filesystem implementation of objectstore is used), which should work well enough for a limited number of records per source.

The keys of the entries are created from the source ID, a '/' character and an entry key created from a digest calculated from the record ID. A small configuration file allows to customize this entry key, which may be necessary to manage a greater number of documents or to make use of advanced features of more sophisticated ObjectStore implementation.

The service uses store deltaservice.

[edit] Entry key calculation and configuration

Entries are stored in different "shards". This "sharding" is necessary to make it possible to parallelize the checking for deleted records after the import run has finished: For each shard one task will be generated to find the entries in this shard that have not been visited in this run. The shard part of the entry key is determined by taking the first shard.length characters of the record ID digest. The longer this shard part is the more shards can be created and the more the delete check can be parallelized. The default shard.length is 2 (which yields 256 shards, because the digest is a hexadecimal number).

The rest of the digest can be "segmented", i.e. additional '/' can be added so that not all entries of a shard are stored in a single directory. By default, 1 additional '/' is added after the second character of the digest.

The configuration file is org.eclipse.smila.importing.state.objectstore/deltastore.properties:

# Object ID configuration for delta entries in object store.
shard.length=2
segment.count=2
segment.length=1
 
# first argument: shard (first characters of record ID digest)
# second argument: segmented record ID digest
key.pattern=%s/%s

See the test case TestStateStoreConfiguration.java for examples of the effects of these settings.

[edit] DeltaService ReST API

Currently there is only a simple REST API for DeltaService that allows to show which sources have currently entries and delete entries of a single source or all entries.

[edit] Show active sources

  • URL: /smila/importing/delta
  • Method: GET
  • Response Code: 200 OK, if successful,
  • Response JSON:
{"sources": [
  {
    "id": "web",
    "url": "http://localhost:8080/smila/importing/delta/web"
  },
  {
    "id": "file",
    "url": "http://localhost:8080/smila/importing/delta/file"
  }
]}

[edit] Clear all sources

  • URL: /smila/importing/delta
  • Method: DELETE
  • Response Code: 200 OK, if successful
  • Response JSON: none

[edit] Get info about sources

  • URL: /smila/importing/delta/<sourcename>
  • Method: GET
  • Response Code:
    • 200 OK, if successful,
    • 404 NOT FOUND, if source does not have entries currently.
  • Response JSON:

Contains the ID of the source and the number of entries. If there are more than 10000 entries, the number is only estimated because exact counting could take a long time. To force an exact count, add ?countExact=true to the request URL.

{
  "id": "web",
  "count": "123456"
}

[edit] Clear a single source

  • URL: /smila/importing/delta/<sourcename>
  • Method: DELETE
  • Response Code: 200 OK, if successful
  • Response JSON: none