Difference between revisions of "SMILA/Documentation/DeltaIndexingManager"

From Eclipsepedia

Jump to: navigation, search
m (Implementations: made section on how to select imp cleare hopefully)
Line 48: Line 48:
  
 
   /**
 
   /**
   * Creates or updates the delta indexing entry. Sets the hash, the isCompound flag and marks this id as visited.
+
   * Creates or updates the delta indexing entry. this is THE method to make the record known to DI. It sets the hash,
 +
  * the isCompound flag and marks this id as visited.
 
   *  
 
   *  
 
   * @param sessionId
 
   * @param sessionId
Line 84: Line 85:
  
 
   /**
 
   /**
   * Obsolete id iterator for id fragments.
+
   * Obsolete id iterator for id fragments of compounds.
 
   *  
 
   *  
 
   * @param sessionId
 
   * @param sessionId
Line 101: Line 102:
  
 
   /**
 
   /**
   * Clear all entries if the given sessionId.
+
   * Clear all entries if the given sessionId. In order to call clear you first have to initialize a session calling
 +
  * init(). This is to avoid clearing of any locked data sources.
 
   *  
 
   *  
 
   * @param sessionId
 
   * @param sessionId
Line 113: Line 115:
  
 
   /**
 
   /**
   * rolls back changes that were made in the curreent session between init() and finish(), it should be called before
+
   * rolls back changes that were made in the current session between init() and finish(), it should be called before
 
   * finishing process.
 
   * finishing process.
 
   *  
 
   *  
Line 154: Line 156:
  
 
   /**
 
   /**
   * Clears all entries of the DeltaIndexingManager including sessions.
+
   * Clears all entries of the DeltaIndexingManager including any active sessions! Note that this may cause exeptions in
 +
  * clients using any of the closed sessions.
 +
  *
 +
  * @admin this an administrative management function to be called manually and not part of the normal workflow.
 
   *  
 
   *  
 
   * @throws DeltaIndexingException
 
   * @throws DeltaIndexingException
Line 162: Line 167:
  
 
   /**
 
   /**
   * Unlock all datasources and removes all sessions.
+
  * Unlock the given data source and removes the sessions.
 +
  *
 +
  * @admin this an administrative management function to be called manually and not part of the normal workflow.
 +
  *
 +
  * @param dataSourceID
 +
  *          the data source id
 +
  * @throws DeltaIndexingException
 +
  *          the delta indexing exception
 +
  */
 +
  void unlockDatasource(final String dataSourceID) throws DeltaIndexingException;
 +
 
 +
  /**
 +
   * Unlock all data sources and removes all sessions.
 +
  *
 +
  * @admin this an administrative management function to be called manually and not part of the normal workflow.
 
   *  
 
   *  
 
   * @throws DeltaIndexingException
 
   * @throws DeltaIndexingException
Line 168: Line 187:
 
   */
 
   */
 
   void unlockDatasources() throws DeltaIndexingException;
 
   void unlockDatasources() throws DeltaIndexingException;
 +
 +
  /**
 +
  * Get an overview what data sources are locked or unlocked.
 +
  *
 +
  * @return a map containing the dataSoureId and the LockState
 +
  */
 +
  Map<String, LockState> getLockStates();
  
 
   /**
 
   /**
Line 178: Line 204:
 
   */
 
   */
 
   boolean exists(final String dataSourceId);
 
   boolean exists(final String dataSourceId);
 
+
 
 
   /**
 
   /**
 
   * Get the number of delta indexing entries for the given dataSourceID.
 
   * Get the number of delta indexing entries for the given dataSourceID.
Line 194: Line 220:
 
   */
 
   */
 
   Map<String, Long> getEntryCounts();
 
   Map<String, Long> getEntryCounts();
 +
 +
  /**
 +
  * An enumeration defining the lock states a data source in the DeltaIndexingManager.
 +
  */
 +
  public enum LockState {
 +
    /**
 +
    * The lock states.
 +
    */
 +
    LOCKED, UNLOCKED;
 +
  }
 
}
 
}
 
</source>
 
</source>

Revision as of 06:34, 17 February 2010

Contents

Overview

The DeltaIndexingManager stores information about the last modification of each record and can determine if a record has changed since its last processing. This decision is based on a hash value provided by a crawler. How such hash is computed depends on the crawler and its configuration. For example the filesystem crawler usually computes the hash from file's last modification date. It provides functionality to manage this information, to determine if already processed documents have changed, to mark documents that have not changed (visited flag) and to determine documents that are indexed but no longer exist in the datasource.

Before you can use delta indexing you have to create a working session with the DeltaIndexingManager by calling init(final String dataSourceID). This will generate a new session and lock the given data source (if not already locked by another session), and return the session ID. This session ID has to be used for all upcomming calls to DeltaIndexingManager. With calling finish(final String sessionId) the lock is released and the session is destroyed.


API

for the current definition of the interface in trunk see ViewVC or SVN

public interface DeltaIndexingManager {
 
  /**
   * Initializes the internal state for an import of a dataSourceID and creates a session wherein it establishes a lock
   * to avoid that the same. dataSourceID is initialized multiple times concurrently. It returns a unique Id for the
   * session that a client has to use to gain access to the locked data source.
   * 
   * @param dataSourceID
   *          dataSourceID
   * @return a String containing the sessionId
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  String init(final String dataSourceID) throws DeltaIndexingException;
 
  /**
   * checks if the hash of the current id is new or has changed (true) or not (false). //
   * 
   * to reduce method calls mark entry as visited on return value false
   * 
   * @param sessionId
   *          the id of the delta indexing session
   * @param id
   *          the id
   * @param hash
   *          the hash
   * 
   * @return true, if check for update
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  boolean checkForUpdate(final String sessionId, final Id id, final String hash)
    throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Creates or updates the delta indexing entry. this is THE method to make the record known to DI. It sets the hash,
   * the isCompound flag and marks this id as visited.
   * 
   * @param sessionId
   *          the id of the delta indexing session
   * @param id
   *          the id
   * @param hash
   *          the hash
   * @param isCompound
   *          boolean flag if the record identified by id is a compound record (true) or not (false)
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void visit(final String sessionId, final Id id, final String hash, final boolean isCompound)
    throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Obsolete id iterator.
   * 
   * @param sessionId
   *          the id of the delta indexing session
   * @param dataSourceID
   *          the data source id
   * 
   * @return the iterator< id>
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  Iterator<Id> obsoleteIdIterator(final String sessionId, final String dataSourceID)
    throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Obsolete id iterator for id fragments of compounds.
   * 
   * @param sessionId
   *          the id of the delta indexing session
   * @param id
   *          the id
   * 
   * @return the iterator< id>
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  Iterator<Id> obsoleteIdIterator(final String sessionId, final Id id) throws DeltaIndexingSessionException,
    DeltaIndexingException;
 
  /**
   * Clear all entries if the given sessionId. In order to call clear you first have to initialize a session calling
   * init(). This is to avoid clearing of any locked data sources.
   * 
   * @param sessionId
   *          the id of the delta indexing session
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void clear(final String sessionId) throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * rolls back changes that were made in the current session between init() and finish(), it should be called before
   * finishing process.
   * 
   * @param sessionId
   *          the id of the delta indexing session
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void rollback(final String sessionId) throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Delete.
   * 
   * @param sessionId
   *          the id of the delta indexing session
   * @param id
   *          the id
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void delete(final String sessionId, final Id id) throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Finish this delta indexing session and remove the lock.
   * 
   * @param sessionId
   *          the id of the delta indexing session
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void finish(final String sessionId) throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /* methods that don't need a session */
 
  /**
   * Clears all entries of the DeltaIndexingManager including any active sessions! Note that this may cause exeptions in
   * clients using any of the closed sessions.
   * 
   * @admin this an administrative management function to be called manually and not part of the normal workflow.
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void clear() throws DeltaIndexingException;
 
  /**
   * Unlock the given data source and removes the sessions.
   * 
   * @admin this an administrative management function to be called manually and not part of the normal workflow.
   * 
   * @param dataSourceID
   *          the data source id
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void unlockDatasource(final String dataSourceID) throws DeltaIndexingException;
 
  /**
   * Unlock all data sources and removes all sessions.
   * 
   * @admin this an administrative management function to be called manually and not part of the normal workflow.
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void unlockDatasources() throws DeltaIndexingException;
 
  /**
   * Get an overview what data sources are locked or unlocked.
   * 
   * @return a map containing the dataSoureId and the LockState
   */
  Map<String, LockState> getLockStates();
 
  /**
   * Checks if the entries for the given dataSourceId exist.
   * 
   * @param dataSourceId
   *          the data source id
   * 
   * @return true, if successful
   */
  boolean exists(final String dataSourceId);
 
  /**
   * Get the number of delta indexing entries for the given dataSourceID.
   * 
   * @param dataSourceID
   *          the data source id
   * @return the number of entries
   */
  long getEntryCount(final String dataSourceID);
 
  /**
   * Get the number of delta indexing entries for all data sources.
   * 
   * @return a map of dataSoureIds and the entry counts
   */
  Map<String, Long> getEntryCounts();
 
  /**
   * An enumeration defining the lock states a data source in the DeltaIndexingManager.
   */
  public enum LockState {
    /**
     * The lock states.
     */
    LOCKED, UNLOCKED;
  }
}

Implementations

SMILA comes at the moment with two implementations: a memory and a database backed implementation and others may provide further implementations for the DeltaIndexingManager interface.

In general it makes sense to only activate one DeltaIndexingManager Impl. at a time. This is achieved by just starting the desired impl. bundle. If multiple implementations are started, a client using the DeltaIndexingManager has to use a filter has to provide an OSGi Filter when requesting the service, otherwise it gets a reference randomly. Each component description includes a property named smila.connectivity.deltaindexing.impl that can be used for filtering. At the moment the only component that has a reference to the DeltaIndexingManager is the ConnectivityManager.

Below is a list of the currently available implementations.

org.eclipse.smila.connectivity.deltaindexing.impl

The implementation stores the delta indexing information in memory. When stopping/starting the DeltaIndexingManager the current state is written to/read from files located at
workspace\.metadata\.plugins\org.eclipse.smila.connectivity.deltaindexing
These files are named according to the dataSourceId. This implementation is only usefull during development, as the in memory storage will certainly lead to OutOfMemoryExceptions when used with a high data load.

Filter Property

<property name="smila.connectivity.deltaindexing.impl" value="memory"/>

Configuration

There are no configuration options available for this bundle.



org.eclipse.smila.connectivity.deltaindexing.jpa.impl

This implementation uses eclipseLink JPA to store the delta indexing information in an apache derby database. The data is stored in the two tables DATA_SOURCES and DELTA_INDEXING:

DATA_SOURCES
Column Type Description
SOURCE_ID VARCHAR a hashed value of the Id object of the record
LOCKED BOOLEAN a flag if this data source was locked
LOCKED_BY VARCHAR the id of the thread that locked this data source


DELTA_INDEXING
Column Type Description
ID_HASH VARCHAR the hashed value of the Id object of the record
HASH VARCHAR the delta indexing hash value
SOURCE_ID VARCHAR the data source Id
IS_COMPOUND BOOLEAN flag if this entry is a compound object
PARENT_ID_HASH VARCHAR the hashed value of the parent Id object. This is only set if this Id is an element of a compound object, otherwise it is NULL
VISITED BOOLEAN flag if this entry was already visited
MODIFIED BOOLEAN flag if this entry was modified
ID BLOB the serialized Id object. This is needed to reconstruct the Id objects for method obsoleteIdIterator()


Filter Property

<property name="smila.connectivity.deltaindexing.impl" value="jpa"/>


Configuration

Note.png
todo
this section needs to take this new page into account: SMILA/Documentation/General_JPA_Configuration_in_SMILA


The only configuration needed is a typicall eclipseLink configuration property file. Therin you can specify settings for logging, database connection settings. For more information please refer to the eclipseLink documentation [[1]]. The configuration is located at configuration/org.eclipse.smila.connectivity.deltaindexing.jpa.impl/persistence.properties.

# EclipseLink properties
eclipselink.logging.level=INFO
eclipselink.target-server=None
eclipselink.target-database=org.eclipse.persistence.platform.database.DerbyPlatform
eclipselink.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
eclipselink.jdbc.url=jdbc:derby:workspace/.metadata/.plugins/org.eclipse.smila.connectivity.deltaindexing.jpa.impl/deltaindexingstorage;create=true
eclipselink.jdbc.password=smila
eclipselink.jdbc.user=smila
eclipselink.ddl-generation=drop-and-create-tables

After starting Smila for the first time, the DDL generation setting will print out some nasty warnings, complaining that it can't create some tables. These warnings are not critical. You can get rid of them by setting eclipselink.ddl-generation=none, but only after Smila was started at least once (and the tables were created).

Limitations

At the moment it is necessary to import all packages containing JDBCDriver classes in org.eclipse.smila.connectivity.deltaindexing.jpa.impl. So for changing from derby to another database it is not sufficient to change the configuration in persistence.properties, you also have to add import package statementsv for the JDBC driver to use to your bundles manifest. This will hopefully be changed with the next release of eclipseLink.