Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Importing/DeltaCheck"

(Worker Description)
(Entry key calculation and configuration)
Line 15: Line 15:
 
==== Entry key calculation and configuration ====
 
==== Entry key calculation and configuration ====
  
Entries are stored in different "shards". This "sharding" is necessary to make it possible to parallelize the checking for deleted records after the import run has finished: For each shard one task will be generated to find the entries in this shard that have not been visited in this run. The shard part of the entry key is determined by taking the first <tt>shard.length</tt> characters of the record ID digest. The longer this shard part is the more shards can be created and the more the delete check can be parallelized. The default <tt>shard.length</tt> is 2 (which yields 256  
+
Entries are stored in different "shards". This "sharding" is necessary to make it possible to parallelize the checking for deleted records after the import run has finished: For each shard one task will be generated to find the entries in this shard that have not been visited in this run. The shard part of the entry key is determined by taking the first <tt>shard.length</tt> characters of the record ID digest. The longer this shard part is the more shards can be created and the more the delete check can be parallelized. The default <tt>shard.length</tt> is 2 (which yields 256 shards, because the digest is a hexadecimal number).
  
 
The rest of the digest can be "segmented", i.e. additional '/' can be added so that not all entries of a shard are stored in a single directory. By default, 1 additional '/' is added after the second character of the digest.
 
The rest of the digest can be "segmented", i.e. additional '/' can be added so that not all entries of a shard are stored in a single directory. By default, 1 additional '/' is added after the second character of the digest.
Line 31: Line 31:
 
key.pattern=%s/%s
 
key.pattern=%s/%s
 
</source>
 
</source>
 +
 +
See the test case [[https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/trunk/core/org.eclipse.smila.importing.delta.objectstore.test/code/src/org.eclipse.smila.importing.delta.objectstore.test/TestDeltaStoreConfiguration.java TestDeltaStoreConfiguration.java]] for examples of the effects of these settings.

Revision as of 03:51, 25 November 2011

Workers for Importing: Delta Check

Delta Checking is about determining if a record has changed since the last import run and needs to be sent to the processing job again, e.g. to update the index.

Worker Description

TODO

ObjectStoreDeltaService

The DeltaCheck worker makes use of a DeltaService to check and update the state of a record. The first implementation of this service puts those state entries in the ObjectStore (and hence as seperate files in a filesystem, if the filesystem implementation of objectstore is used), which should work well enough for a limited number of records per source.

The keys of the entries are created from the source ID, a '/' character and an entry key created from a digest calculated from the record ID. A small configuration file allows to customize this entry key, which may be necessary to manage a greater number of documents or to make use of advanced features of more sophisticated ObjectStore implementation.

Entry key calculation and configuration

Entries are stored in different "shards". This "sharding" is necessary to make it possible to parallelize the checking for deleted records after the import run has finished: For each shard one task will be generated to find the entries in this shard that have not been visited in this run. The shard part of the entry key is determined by taking the first shard.length characters of the record ID digest. The longer this shard part is the more shards can be created and the more the delete check can be parallelized. The default shard.length is 2 (which yields 256 shards, because the digest is a hexadecimal number).

The rest of the digest can be "segmented", i.e. additional '/' can be added so that not all entries of a shard are stored in a single directory. By default, 1 additional '/' is added after the second character of the digest.

The configuration file is org.eclipse.smila.importing.delta.objectstore/deltastore.properties:

# Object ID configuration for delta entries in object store.
shard.length=2
segment.count=2
segment.length=1
 
# first argument: shard (first characters of record ID digest)
# first argument: segmented record ID digest
key.pattern=%s/%s

See the test case [TestDeltaStoreConfiguration.java] for examples of the effects of these settings.