SMILA/Documentation/DeltaIndexingManager

Overview

The Delta Indexing Manager stores information about last modification of each record (compound elements will bee added soon) and can determine if a record has changed since it's last processing. This decision is based on a hash value provided by a crawler. How such a hash is computed depends on the crawler and it's configuration (for example the filesystem crawler usually computes the hash from the last modification date of files). It provides functionality to manage this information, to determine if documents have changed, to mark documents that have not changed (visited flag) and to determine documents that are indexed but no longer exist in the data source.

API

/**
   * initializes the internal state for an import of a dataSourceID and establishes a lock to avoid that the same.
   * dataSourceID ist initialized multiple times concurrently.
   * 
   * @param dataSourceID
   *          dataSourceID
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void init(String dataSourceID) throws DeltaIndexingException;
 
  /**
   * checks for the hash of the current id is new or has changed (true) or not (false). //
   * 
   * to reduce method calls mark entry as visited on return value false
   * 
   * @param id
   *          the id
   * @param hash
   *          the hash
   * 
   * @return true, if check for update
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  boolean checkForUpdate(Id id, String hash) throws DeltaIndexingException;
 
  /**
   * updates the hash and marks this id as visited.
   * 
   * @param id
   *          the id
   * @param hash
   *          the hash
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void visit(Id id, String hash) throws DeltaIndexingException;
 
  /**
   * Obsolete id iterator.
   * 
   * @param dataSourceID
   *          the data source id
   * 
   * @return the iterator< id>
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  Iterator<Id> obsoleteIdIterator(String dataSourceID) throws DeltaIndexingException;
 
  /**
   * Obsolete id iterator for id fragments.
   * 
   * @param id
   *          the id
   * 
   * @return the iterator< id>
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  Iterator<Id> obsoleteIdIterator(Id id) throws DeltaIndexingException;
 
  /**
   * Clear.
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void clear() throws DeltaIndexingException;
 
  /**
   * Clear.
   * 
   * @param dataSourceID
   *          the data source id
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void clear(String dataSourceID) throws DeltaIndexingException;
 
  /**
   * rollbacks changes was made inside init() and finish(), it should be called before finishing process.
   * 
   * @param dataSourceID
   *          the data source id
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void rollback(String dataSourceID) throws DeltaIndexingException;
 
  /**
   * Delete.
   * 
   * @param id
   *          the id
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void delete(Id id) throws DeltaIndexingException;
 
  /**
   * removes the lock.
   * 
   * @param dataSourceID
   *          the data source id
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void finish(String dataSourceID) throws DeltaIndexingException;
 
  /**
   * Unlock all datasources.
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void unlockDatasources() throws DeltaIndexingException;
 
  /**
   * Exists.
   * 
   * @param dataSourceId
   *          the data source id
   * 
   * @return true, if successful
   */
  boolean exists(String dataSourceId);

Implementations

It is possible to provide different implementations for the DeltaIndexingManager interface. Below is a list of the currently available implementations.

org.eclipse.smila.connectivity.deltaindexing.impl

The default implementation stores the delta indexing information in memory. When stopping/starting the DeltaIndexingManager the current state is written to/read from files located at

workspace\.metadata\.plugins\org.eclipse.smila.connectivity.deltaindexing

These files are named according to the dataSourceId. This implementation is only usefull during development, as the in memory storage will certainly lead to OutOfMemoryExceptions when used with a high data load.

Configuration

There are no configuration options available for this bundle.

org.eclipse.smila.connectivity.deltaindexing.jpa.impl

This implementation uses eclipseLink JPA to store the delta indexing information in an apache derby database. The data is stored in the two tables DATA_SOURCES and DELTA_INDEXING:

DATA_SOURCES
Column	Type	Description
SOURCE_ID	VARCHAR	a hashed value of the Id object of the record
LOCKED	BOOLEAN	a flag if this data source was locked
LOCKED_BY	VARCHAR	the id of the thread that locked this data source

DELTA_INDEXING
Column	Type	Description
ID_HASH	VARCHAR	a hashed value of the Id object of the record
HASH	VARCHAR	the delta indexing hash value
SOURCE_ID	VARCHAR	the data source Id
VISITED	BOOLEAN	flag if this entry was already visited
ID	BLOB	the serialized Id object. This is needed to reconstruct the Id objects for method obsoleteIdIterator()

Configuration

The only configuration needed is a typicall eclipseLink configuration property file. Therin you can specify settings for logging, database connection settings. For more information please refer to the eclipseLink documentation [[1]]. The configuration is located at configuration/org.eclipse.smila.connectivity.deltaindexing.jpa.impl/persistence.properties.

# EclipseLink properties
eclipselink.logging.level=INFO
eclipselink.target-server=None
eclipselink.target-database=org.eclipse.persistence.platform.database.DerbyPlatform
eclipselink.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
eclipselink.jdbc.url=jdbc:derby:workspace/.metadata/.plugins/org.eclipse.smila.connectivity.deltaindexing.jpa.impl/deltaindexingstorage;create=true
eclipselink.jdbc.password=smila
eclipselink.jdbc.user=smila
eclipselink.ddl-generation=drop-and-create-tables

After starting Smila for the first time, the DDL generation setting will print out some nasty warnings, complaining that it can't create some tables. These warnings are not critical. You can get rid of them by setting eclipselink.ddl-generation=none, but only after Smila was started at least once (and the tables were created).

Limitations

At the moment it is necessary to import all packages containing JDBCDriver classes in org.eclipse.smila.connectivity.deltaindexing.jpa.impl. So for changing from derby to another database it is not sufficient to change the configuration in persistence.properties, you also have to add import package statementsv for the JDBC driver to use to your bundles manifest. This will hopefully be changed with the next release of eclipseLink.

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Documentation/DeltaIndexingManager

Contents

Overview

API

Implementations

org.eclipse.smila.connectivity.deltaindexing.impl

Configuration

org.eclipse.smila.connectivity.deltaindexing.jpa.impl

Configuration

Limitations

Breadcrumbs

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Documentation/DeltaIndexingManager

Contents

Overview

API

Implementations

org.eclipse.smila.connectivity.deltaindexing.impl

Configuration

org.eclipse.smila.connectivity.deltaindexing.jpa.impl

Configuration

Limitations