Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Specifications/DeltaIndexingAndConnectivtyDiscussion09"

(Motivation for this page and usage)
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
[[Category:SMILA]]
 +
 
== Motivation for this page and usage ==
 
== Motivation for this page and usage ==
  
Line 10: Line 12:
 
* a solution proposal
 
* a solution proposal
  
ideas that have been implemented are moved to their own page and referenced in [[#Implemented Changes|Implemented Changes]].  
+
ideas that have been implemented are moved to their own page and referenced in [[#Implemented Changes|Implemented Changes]].
  
 
== Ideas  and Problems (under discussion) ==
 
== Ideas  and Problems (under discussion) ==
Line 357: Line 359:
 
This is a radical change as it also affects the Crawler interface. Crawlers could directly communicate with the DeltaIndexingManager and provide only those Records that pass DeltaIndexing (are new, nedd an update). CrawlerController and Crawler could implement a Consumer/Producer pattern which should improve performance. No more sending of arrays with DIInformation and thereafter retrieving the Record objects. DeltaIndexing-Delete information is computed in the Crawler and can passed to the CrawlerController as regular Records (only the ID is set) and a delete flag to notify the CrawlerController that this Record is to be deleted. This should reduce communication overhead, as the DIInformation has not to be passed between multiple components and the whole process can work multithreaded. Of course this adds a lot more logic to the Crawler and demands more knowledge from a Crawler developer. It would also mean that ID and HASH are generated in the Crawler. The downside is that each Crawler has to implement the DeltaIndexing workflow themselves. <br>We could even move all execution logic to the Crawler. CrawlerController would become obsolete. Then Crawlers would handle everything themselves - communication with DeltaIndexingManager, CoumpoundHandlers and ConnectivityManager. I think in this way the best performance can be achieved, as the setup is the very simple. No unnecessary passing of data between components. But a lot of logic has to be re-implemented in every Crawler. I wonder if there is a chance to minimize this.
 
This is a radical change as it also affects the Crawler interface. Crawlers could directly communicate with the DeltaIndexingManager and provide only those Records that pass DeltaIndexing (are new, nedd an update). CrawlerController and Crawler could implement a Consumer/Producer pattern which should improve performance. No more sending of arrays with DIInformation and thereafter retrieving the Record objects. DeltaIndexing-Delete information is computed in the Crawler and can passed to the CrawlerController as regular Records (only the ID is set) and a delete flag to notify the CrawlerController that this Record is to be deleted. This should reduce communication overhead, as the DIInformation has not to be passed between multiple components and the whole process can work multithreaded. Of course this adds a lot more logic to the Crawler and demands more knowledge from a Crawler developer. It would also mean that ID and HASH are generated in the Crawler. The downside is that each Crawler has to implement the DeltaIndexing workflow themselves. <br>We could even move all execution logic to the Crawler. CrawlerController would become obsolete. Then Crawlers would handle everything themselves - communication with DeltaIndexingManager, CoumpoundHandlers and ConnectivityManager. I think in this way the best performance can be achieved, as the setup is the very simple. No unnecessary passing of data between components. But a lot of logic has to be re-implemented in every Crawler. I wonder if there is a chance to minimize this.
  
 +
(an [[SMILA/Specifications/DeltaIndexingAndConnectivtyDiscussion09/Usage_of_DeltaIndexingManager_by_CrawlerControler_alone| empty page]] exists for this already)
  
 
== Implemented Changes ==
 
== Implemented Changes ==

Latest revision as of 02:56, 17 October 2009


Motivation for this page and usage

the current implementation for the DeltaIndexingManager has several problems or short comings which are listed under the section Ideas (under discussion). if the idea is rather large, an own page is usually better and should be created as a child to this page. it still should have an own section that at least must contain a link to the page..

The initiating authors should edit only their own sections and not those of others.

each subsection/page should state:

  • context such as: author, data, based on SVN revision
  • motivation/problem
  • a solution proposal

ideas that have been implemented are moved to their own page and referenced in Implemented Changes.

Ideas and Problems (under discussion)

DeltaIndexing reflects crawl state rather than index state

One Problem at the moment is, that because SMILA's processing of incoming Records is asynchronous, DeltaIndexing does NOT really reflect the state of a Record in the index, as there is no guarantee that a Record is indexed after it was successfully added to the Queue. This could be achieved by implementing Notifications that update the DeltaIndexing state using this information. If this is done, then the computation of DeltaIndexing-Delete has to wait for all Queue entries to pass the workflow. This is a complex process which seems to be error-prone. Is it really necessary to reflect the index state or is it enough to reflect the last crawl state ?

Extract Session Interface from DeltaIndexingManager

For a better separation of tasks and an easy handling of locks on data sources during a delta indexing run, we could introduce the following interfaces. The implementations should only be proxies using the same DeltaIndexingManager service implementation, so that a DeltaIndexingSession may internally use another service if the initial one becomes unavailable.

interface DeltaIndexingManager
{
    /**
     * Initializes a new DeltaIndexingSession if the datasource is not locked.
     */
    DeltaIndexingSession init(String dataSourceID) throws DeltaIndexingException;
 
    /**
     * Clear all data sources that are not locked.
     */
    void clear() throws DeltaIndexingException;
 
    /**
     * Clears the data source if not locked.
     */
    void clear(String dataSourceID) throws DeltaIndexingException;
 
    /**
     * Unlocks all data sources by force.
     */
    void unlockDatasources() throws DeltaIndexingException;
 
    /**
     * Checks if a data source exists.
     */
    boolean exists(String dataSourceId);
}


interface DeltaIndexingSession
{
    /**
     * Checks if the id needs to be updated.
     */
    boolean checkForUpdate(Id id, String hash) throws DeltaIndexingException;
 
    /**
     * Maks the id as visited.
     */
    void visit(Id id, String hash) throws DeltaIndexingException;
 
    /**
     * Returns an iterator over all unvistied ids of the data source
     */
    Iterator<Id> obsoleteIdIterator(String dataSourceID) throws DeltaIndexingException;
 
    /**
     * Returns an iterator over all unvistied ids of a parent id (compound objects)
     */
    Iterator<Id> obsoleteIdIterator(Id id) throws DeltaIndexingException;
 
    /**
     * Deletes the id.
     */
    void delete(Id id) throws DeltaIndexingException;
 
    /**
    * Finishes the deltaindexing run and unlocks the data source.
    */
    void finish(String dataSourceID) throws DeltaIndexingException;
}

This approach was not realized. But a sessionId was introduced to distinguish between different sessions without relying on thread ids. See https://bugs.eclipse.org/bugs/show_bug.cgi?id=279243


Discussion

modifications to the interfaces

TM 2009 10 15: i second the notion to extract a session interface. but i also would do a few renames and changes like so:

public interface IDeltaIndexingManager {
 
  /**
   * Initializes the internal state for an import of a dataSourceID and creates a session wherein it establishes a lock
   * to avoid that the same dataSourceID is initialized multiple times concurrently. It returns an object for the session
   * that a client has to use to gain access to the locked data source.
   * 
   * @param dataSourceID
   *          dataSourceID
   * 
   * @return the i delta indexing session
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  IDeltaIndexingSession createSession(final String dataSourceID) throws DeltaIndexingException;
 
  /* methods that don't need a session */
 
  /**
   * Clears all entries of the DeltaIndexingManager including sessions.
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void clear() throws DeltaIndexingException;
 
  /**
   * Unlock the given data source and removes the sessions.
   * 
   * @param dataSourceID
   *          the data source id
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void unlockDatasource(final String dataSourceID) throws DeltaIndexingException;
 
  /**
   * Unlock all data sources and removes all sessions.
   * 
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void unlockDatasources() throws DeltaIndexingException;
 
  /**
   * Gets an overview what data sources are locked or unlocked.
   * 
   * @return a map containing the dataSoureId and the LockState
   */
  Map<String, LockState> getLockStates();
 
  /**
   * Checks if the entries for the given dataSourceId exist.
   * 
   * @param dataSourceId
   *          the data source id
   * 
   * @return true, if successful
   */
  boolean dataSourceExists(final String dataSourceId);
 
  /**
   * Get the number of delta indexing entries for the given dataSourceID.
   * 
   * @param dataSourceID
   *          the data source id
   * @return the number of entries
   */
  long getEntryCount(final String dataSourceID);
 
  /**
   * Get the number of delta indexing entries for all data sources.
   * 
   * @return a map of dataSoureIds and the entry counts
   */
  Map<String, Long> getEntryCounts();
 
  /**
   * An enumeration defining the lock states a data source in the DeltaIndexingManager.
   */
  public enum LockState {
    /**
     * The lock states.
     */
    LOCKED, UNLOCKED;
  }
}
 
/**
 * The Interface IDeltaIndexingSession.
 * 
 * @author tmenzel
 */
public interface IDeltaIndexingSession {
 
  /**
   * Clear all entries of the given sessionId.
   * 
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void clear() throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Finish this delta indexing session and remove the lock.
   * 
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void commit() throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Delete.
   * 
   * @param id
   *          the id
   * 
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void delete(final Id id) throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Delete untouched ids. rather than calling {@link #delete(Id)} by the controller when iterating thru the ids, the
   * implementation may do so internally for all untouched ids in one go more efficiently.
   * 
   * @param id
   *          the id
   * 
   * @return the number of deleted ids
   * 
   * @throws DeltaIndexingSessionException
   *           the delta indexing session exception
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  long deleteUntouchedIds() throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Obsolete id iterator.
   * 
   * 
   * @return the iterator< id>
   * 
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  Iterator<Id> getUntouchedIds() throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Obsolete id iterator for id fragments.
   * 
   * @param id
   *          the id
   * 
   * @return the iterator< id>
   * 
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  Iterator<Id> getUntouchedIds(final Id id) throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * checks if the hash of the current id is new or has changed (true) or not (false). //
   * 
   * to reduce method calls mark entry as visited on return value false
   * 
   * @param id
   *          the id
   * @param hash
   *          the hash
   * 
   * @return true, if checks for changed
   * 
   * @throws DeltaIndexingSessionException
   *           the delta indexing session exception
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  boolean hasChanged(final Id id, final String hash) throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * rolls back changes that were made in the curreent session between init() and finish(), it should be called before
   * finishing process.
   * 
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void rollback() throws DeltaIndexingSessionException, DeltaIndexingException;
 
  /**
   * Creates or updates the delta indexing entry. this is THE method to make the record known to DI. It sets the hash,
   * the isCompound flag and marks this id as visited.
   * 
   * @param id
   *          the id
   * @param hash
   *          the hash
   * @param isCompound
   *          boolean flag if the record identified by id is a compound record (true) or not (false)
   * 
   * @throws DeltaIndexingSessionException
   *           if the sessionId is invalid
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  void touch(final Id id, final String hash, final boolean isCompound) throws DeltaIndexingSessionException,
    DeltaIndexingException;
 
  /**
   * this is a combination of {@link #hasChanged(Id, String)} and {@link #touch(Id, String, boolean)} in one step.
   * <p>
   * It has a perf. gain over calling the methods seperatly but has the drawback, that the record is always touched
   * independently of an exception that occurs before putting the record into the Q. on the other hand, this matters not
   * much as the subsequent processing may also cause errors which arent reflected in the "touch" state.
   * 
   * @param id
   *          the id
   * @param hash
   *          the hash
   * @param isCompound
   *          the is compound
   * 
   * @return true, if successful
   * 
   * @throws DeltaIndexingSessionException
   *           the delta indexing session exception
   * @throws DeltaIndexingException
   *           the delta indexing exception
   */
  boolean checkAndTouch(final Id id, final String hash, final boolean isCompound)
    throws DeltaIndexingSessionException, DeltaIndexingException;
 
}


Usage of DeltaIndexingManager by CrawlerControler alone

Here is another idea based on the changes introduced with SMILA/Specifications/DeltaIndexingAndConnectivtyDiscussion09/Separate_Interfaces_for_ConnectivityManager_and_DeltaIndexingManager but taking it further that not the CrawlerController communicates with DeltaIndexingManager but each Crawler:

This is a radical change as it also affects the Crawler interface. Crawlers could directly communicate with the DeltaIndexingManager and provide only those Records that pass DeltaIndexing (are new, nedd an update). CrawlerController and Crawler could implement a Consumer/Producer pattern which should improve performance. No more sending of arrays with DIInformation and thereafter retrieving the Record objects. DeltaIndexing-Delete information is computed in the Crawler and can passed to the CrawlerController as regular Records (only the ID is set) and a delete flag to notify the CrawlerController that this Record is to be deleted. This should reduce communication overhead, as the DIInformation has not to be passed between multiple components and the whole process can work multithreaded. Of course this adds a lot more logic to the Crawler and demands more knowledge from a Crawler developer. It would also mean that ID and HASH are generated in the Crawler. The downside is that each Crawler has to implement the DeltaIndexing workflow themselves.
We could even move all execution logic to the Crawler. CrawlerController would become obsolete. Then Crawlers would handle everything themselves - communication with DeltaIndexingManager, CoumpoundHandlers and ConnectivityManager. I think in this way the best performance can be achieved, as the setup is the very simple. No unnecessary passing of data between components. But a lot of logic has to be re-implemented in every Crawler. I wonder if there is a chance to minimize this.

(an empty page exists for this already)

Implemented Changes

Page Date Bug Author(s)
New Feature: DeltaIndexing On/Off 2009-06-10 bug 279242 Daniel Stucky
Separate Interfaces for ConnectivityManager and DeltaIndexingManager 2008-06 ?  ? Daniel Stucky

Copyright © Eclipse Foundation, Inc. All Rights Reserved.