Jump to: navigation, search

SMILA/Specifications/DeltaIndexingAndConnectivtyDiscussion09

Alternative Concept

Motivation

Why DeltaIndexing

DeltaIndexing is used to speed up repeated indexing of the same DataSource. DeltaIndexing reflects the state of the DataSource in the index and can so determine if a Record of a DataSource is new, has changed or was deleted. An ID and a HASH-Token (which is build over characteristic data that shows a Record has changed, e.g. last modification date) are used to compute this information. In general as a consequence, less data has to be accessed on the DataSource and only the relevant data is indexed. For each DeltaIndexing run all Records that are new, have changed or have NOT changed are marked with a visited flag. At the end of the run, if no errors have occurred, all Records that have not been visited are computed. These records are the ones that are still in the index but are not available on the DataSource anymore (they were deleted or moved). These are also deleted from the index.


The Problems

  • One Problem at the moment is, that because SMILA's processing of incoming Records is asynchronous, DeltaIndexing does NOT really reflect the state of a Record in the index, as there is no guarantee that a Record is indexed after it was successfully added to the Queue. This could be achieved by implementing Notifications that update the DeltaIndexing state using this information. If this is done, then the computation of DeltaIndexing-Delete has to wait for all Queue entries to pass the workflow. This is a complex process which seems to be error-prone. Is it really necessary to reflect the index state or is it enough to reflect the last crawl state ?
  • the API of the ConnectivityManager includes parts of the API of the DeltaIndexingManager, which makes it more complex than necessary. Also it implicates that the ConnectivityManager has an internal state, as DeltaIndexing for a DataSource has to be initialized and finalized. This interfaces forces it's clients to make use of DeltaIndexing and to follow a strict workflow (initialize, add records, optionally call DeltaIndex-Delete and delete the returned IDs, finish). Even if this usage was configurable, the API is - simply spoken - ugly.


New Ideas

Usage of DeltaIndexingManager

  1. used by CrawlerController: This approach would not change much of the current programming logic. Only that the CrawlerController would communicate with two references instead of one. This approach was realized


  1. used by Crawlers: This is a radical change as this also affects the Crawler interface. Crawlers could directly communicate with the DeltaIndexingManager and provide only those Records that pass DeltaIndexing (are new, nedd an update). CrawlerController and Crawler could implement a Consumer/Producer pattern which should improve performance. No more sending of arrays with DIInformation and thereafter retrieving the Record objects. DeltaIndexing-Delete information is computed in the Crawler and can passed to the CrawlerController as regular Records (only the ID is set) and a delete flag to notify the CrawlerController that this Record is to be deleted. This should reduce communication overhead, as the DIInformation has not to be passed between multiple components and the whole process can work multithreaded. Of course this adds a lot more logic to the Crawler and demands more knowledge from a Crawler developer. It would also mean that ID and HASH are generated in the Crawler. The downside is that each Crawler has to implement the DeltaIndexing workflow themselves.
    We could even move all execution logic to the Crawler. CrawlerController would become obsolete. Then Crawlers would handle everything themselves - communication with DeltaIndexingManager, CoumpoundHandlers and ConnectivityManager. I think in this way the best performance can be achieved, as the setup is the very simple. No unnecessary passing of data between components. But a lot of logic has to be re-implemented in every Crawler. I wonder if there is a chance to minimize this.


Implemented Changes

Page Date Bug Author(s)
New Feature: DeltaIndexing On/Off 2009-06-10 bug 279242 DS
Separate Interfaces for ConnectivityManager and DeltaIndexingManager 2008-06 ?  ? DS?

Alternative API

For a better separation of tasks and an easy handling of locks on data sources during a delta indexing run, we could introduce the following interfaces. The implementations should only be proxies using the same DeltaIndexingManager service implementation, so that a DeltaIndexingSession may internally use another service if the initial one becomes unavailable.

interface DeltaIndexingManager
{
    /**
     * Initializes a new DeltaIndexingSession if the datasource is not locked.
     */
    DeltaIndexingSession init(String dataSourceID) throws DeltaIndexingException;
 
    /**
     * Clear all data sources that are not locked.
     */
    void clear() throws DeltaIndexingException;
 
    /**
     * Clears the data source if not locked.
     */
    void clear(String dataSourceID) throws DeltaIndexingException;
 
    /**
     * Unlocks all data sources by force.
     */
    void unlockDatasources() throws DeltaIndexingException;
 
    /**
     * Checks if a data source exists.
     */
    boolean exists(String dataSourceId);
}


interface DeltaIndexingSession
{
    /**
     * Checks if the id needs to be updated.
     */
    boolean checkForUpdate(Id id, String hash) throws DeltaIndexingException;
 
    /**
     * Maks the id as visited.
     */
    void visit(Id id, String hash) throws DeltaIndexingException;
 
    /**
     * Returns an iterator over all unvistied ids of the data source
     */
    Iterator<Id> obsoleteIdIterator(String dataSourceID) throws DeltaIndexingException;
 
    /**
     * Returns an iterator over all unvistied ids of a parent id (compound objects)
     */
    Iterator<Id> obsoleteIdIterator(Id id) throws DeltaIndexingException;
 
    /**
     * Deletes the id.
     */
    void delete(Id id) throws DeltaIndexingException;
 
    /**
    * Finishes the deltaindexing run and unlocks the data source.
    */
    void finish(String dataSourceID) throws DeltaIndexingException;
}

This approach was not realized. But a sessionId was introduced to distinguish between different sessions without relying on thread ids. See https://bugs.eclipse.org/bugs/show_bug.cgi?id=279243