Difference between revisions of "SMILA/Specifications/DeltaIndexingAndConnectivtyDiscussion09"

From Eclipsepedia

Jump to: navigation, search
m (Draft)
Line 10: Line 10:
  
 
==== Why DeltaIndexing ====
 
==== Why DeltaIndexing ====
TODO: describe it
+
DeltaIndexing is used to speed up repeated indexing of the same DataSource. DeltaIndexing reflects the state of the DataSource in the index and can so determine if a Record of a DataSource is new, has changed or was deleted. An ID and a HASH-Token (which is build over characteristic data that shows a Record has changed, e.g. last modification date) are used to compute this information. In general as a consequence, less data has to be accessed on the DataSource and only the relevant data is indexed. For each DeltaIndexing run all Records that are new, have changed or have NOT changed are marked with a visited flag. At the end of the run, if no errors have occurred, all Records that have not been visited are computed. These records are the ones that are still in the index but are not available on the DataSource anymore (they were deleted or moved). They are also deleted from the index, now.
 +
 
 +
 
  
 
==== The Problems ====
 
==== The Problems ====
TODO: describe it
+
* One Problem at the moment is, that because SMILA's processing of incoming Records is asynchronous, DeltaIndexing NOT does really reflect the state of a Record in the index, as there is no guarantee that a Record is indexed after it was successfully added to the Queue. This could be achieved by implementing Notifications that update the DeltaIndexing state using this information. If this is done, then the computation of DeltaIndexing-Delete has to wait for all Queue entries to pass the workflow. This is a complex process which seems to be error-prone. Is it really necessary to reflect the index state or is it enough to reflect the last crawl state ?
 +
* the API of the ConnectivityManager includes parts of the API of the DeltaIndexingManager, which makes it more complex than necessary. Also it implicates that the ConnectivityManager has an internal state, as DeltaIndexing for a DataSource has to be initialized and finalized. This interfaces forces it's clients to make use of DeltaIndexing and to follow a strict workflow (initialize, add records, optionally call DeltaIndex-Delete and delete the returned IDs, finish). Even if this usage was configurable, the API is - simply spoken - ugly.
 +
 
 +
 
 +
=== New Ideas ===
 +
 
 +
==== Seperated Interfaces ====
 +
I suggest to separate ConnectivityManager interface and DeltaIndexingManager interface. It makes both APIs more clear and focused. We should think about SMILA more of a "construction kit" than a "ready for all issues salvation". E.g. if someone wants to connect to SMILA, not using Crawlers or Agents but using the benefits of DeltaIndexing, all the components he needs are there. He can implement his own importer using the DeltaIndexingManager and ConnectivityManager interfaces. There is no need to provide the whole functionality "en-block".
 +
 
 +
<source lang="java">
 +
interface ConnectivityManager
 +
{
 +
  int add(Record[] records) throws ConnectivityException;
 +
  int update(Record[] records) throws ConnectivityException; // optional
 +
  int delete(Id[] ids) throws ConnectivityException;
 +
}
 +
</source>
 +
 
 +
 
 +
<source lang="java">
 +
interface DeltaIndexinManager
 +
{
 +
    void init(String dataSourceID) throws DeltaIndexingException;
 +
    boolean checkForUpdate(Id id, String hash) throws DeltaIndexingException;
 +
    void visit(Id id, String hash) throws DeltaIndexingException;
 +
    Iterator<Id> obsoleteIdIterator(String dataSourceID) throws DeltaIndexingException;
 +
    void finish(String dataSourceId) throws ConnectivityException;
 +
    ...
 +
    // same functionality for Compound objects, remember not to overload methods when using SCA
 +
}
 +
</source>
 +
 
 +
 
 +
Notes: If calls to ConnectivityManager are NOT relevant for DeltaIndexingState (e.g. if it's enough that a call of add/delete succeeded, not the successfull adding to the Queue is required) they could forego a return value and the ConnectivityException and then in the SCA interface these methods could be annotated with @oneway to improve performance. Via callbacks it would still be possible to send back information asynchronously. But if feedback is required, the synchronous method call is much easier to use.
  
=== Draft ===
 
TODO: describe it
 
  
=== Interfaces ===
+
==== Usage of DeltaIndexingManager ====
TODO: Define new Interfaces for DeltaIndexingManager and ConnectivityManager
+
# used by CrawlerController: This approach would not change much of the current programming logic. Only that the CrawlerController would communicate with two references instead of one.
 +
# used by Crawlers: This is a radical change as this also affects the Crawler interface. Crawlers could directly communicate with the DeltaIndexingManager and provide only those Records that pass DeltaIndexing (are new, nedd an update). CrawlerController and Crawler could implement a Consumer/Producer pattern which should improve performance. No more sending of arrays with DIInformation and thereafter retrieving the Record objects. DeltaIndexing-Delete information is computed in the Crawler and can passed to the CrawlerController as regular Records (only the ID is set) and a delete flag to notify the CrawlerController that this Record is to be deleted. This should reduce communication overhead, as the DIInformation has not to be passed between multiple components and the whole process can work multithreaded. Of course this adds a lot more logic to the Crawler and demands more knowledge from a Crawler developer. It would also mean that ID and HASH are generated in the Crawler. The downside is that each Crawler has to implement the DeltaIndexing workflow themselves. <br>We could even move all execution logic to the Crawler. CrawlerController would become obsolete. Then Crawlers would handle everything themselves - communication with DeltaIndexingManager, CoumpoundHandlers and ConnectivityManager. I think in this way the best performance can be achieved, as the setup is the very simple. No unnecessary passing of data between components. But a lot of logic has to be re-implemented in every Crawler. I wonder if there is a chance to minimize this.
  
  
Line 78: Line 112:
  
 
==== Current Concept ====
 
==== Current Concept ====
The execution logic has to be added in parts to the CrawlerController (CrawlThread) and ConnectivityManager. Therefore the mode has to be added to the ConnectivityManager interface.
+
The execution logic has to be added in parts to the CrawlerController (CrawlThread) and ConnectivityManager. Therefore the mode has to be added to the ConnectivityManager interface. The problem is, that still initiialize and finish need to be called, and that the MODE then controls if and how DeltaIndexing is used. This makes the usage and implementation of ConnectivityManager more and more complex and obscure (too many special cases).
  
  
 
==== Alternative Concept ====
 
==== Alternative Concept ====
The execution logic has to be added to the CrawlerController (CrawlThread) only. It decides what actions to perform on the given mode.
+
The execution logic has to be added either
 +
* to the CrawlerController (CrawlThread) only. It decides what actions to perform on the given mode.
 +
* to the Crawler themselves, if the more radical change is implemented

Revision as of 04:50, 27 August 2008


WARNING: This page is under construction by Daniel Stucky
 


Contents

Alternative Concept

Motivation

Why DeltaIndexing

DeltaIndexing is used to speed up repeated indexing of the same DataSource. DeltaIndexing reflects the state of the DataSource in the index and can so determine if a Record of a DataSource is new, has changed or was deleted. An ID and a HASH-Token (which is build over characteristic data that shows a Record has changed, e.g. last modification date) are used to compute this information. In general as a consequence, less data has to be accessed on the DataSource and only the relevant data is indexed. For each DeltaIndexing run all Records that are new, have changed or have NOT changed are marked with a visited flag. At the end of the run, if no errors have occurred, all Records that have not been visited are computed. These records are the ones that are still in the index but are not available on the DataSource anymore (they were deleted or moved). They are also deleted from the index, now.


The Problems

  • One Problem at the moment is, that because SMILA's processing of incoming Records is asynchronous, DeltaIndexing NOT does really reflect the state of a Record in the index, as there is no guarantee that a Record is indexed after it was successfully added to the Queue. This could be achieved by implementing Notifications that update the DeltaIndexing state using this information. If this is done, then the computation of DeltaIndexing-Delete has to wait for all Queue entries to pass the workflow. This is a complex process which seems to be error-prone. Is it really necessary to reflect the index state or is it enough to reflect the last crawl state ?
  • the API of the ConnectivityManager includes parts of the API of the DeltaIndexingManager, which makes it more complex than necessary. Also it implicates that the ConnectivityManager has an internal state, as DeltaIndexing for a DataSource has to be initialized and finalized. This interfaces forces it's clients to make use of DeltaIndexing and to follow a strict workflow (initialize, add records, optionally call DeltaIndex-Delete and delete the returned IDs, finish). Even if this usage was configurable, the API is - simply spoken - ugly.


New Ideas

Seperated Interfaces

I suggest to separate ConnectivityManager interface and DeltaIndexingManager interface. It makes both APIs more clear and focused. We should think about SMILA more of a "construction kit" than a "ready for all issues salvation". E.g. if someone wants to connect to SMILA, not using Crawlers or Agents but using the benefits of DeltaIndexing, all the components he needs are there. He can implement his own importer using the DeltaIndexingManager and ConnectivityManager interfaces. There is no need to provide the whole functionality "en-block".

interface ConnectivityManager
{
  int add(Record[] records) throws ConnectivityException;
  int update(Record[] records) throws ConnectivityException; // optional
  int delete(Id[] ids) throws ConnectivityException;
}


interface DeltaIndexinManager
{
    void init(String dataSourceID) throws DeltaIndexingException;
    boolean checkForUpdate(Id id, String hash) throws DeltaIndexingException;
    void visit(Id id, String hash) throws DeltaIndexingException;
    Iterator<Id> obsoleteIdIterator(String dataSourceID) throws DeltaIndexingException;
    void finish(String dataSourceId) throws ConnectivityException;
    ...
    // same functionality for Compound objects, remember not to overload methods when using SCA
}


Notes: If calls to ConnectivityManager are NOT relevant for DeltaIndexingState (e.g. if it's enough that a call of add/delete succeeded, not the successfull adding to the Queue is required) they could forego a return value and the ConnectivityException and then in the SCA interface these methods could be annotated with @oneway to improve performance. Via callbacks it would still be possible to send back information asynchronously. But if feedback is required, the synchronous method call is much easier to use.


Usage of DeltaIndexingManager

  1. used by CrawlerController: This approach would not change much of the current programming logic. Only that the CrawlerController would communicate with two references instead of one.
  2. used by Crawlers: This is a radical change as this also affects the Crawler interface. Crawlers could directly communicate with the DeltaIndexingManager and provide only those Records that pass DeltaIndexing (are new, nedd an update). CrawlerController and Crawler could implement a Consumer/Producer pattern which should improve performance. No more sending of arrays with DIInformation and thereafter retrieving the Record objects. DeltaIndexing-Delete information is computed in the Crawler and can passed to the CrawlerController as regular Records (only the ID is set) and a delete flag to notify the CrawlerController that this Record is to be deleted. This should reduce communication overhead, as the DIInformation has not to be passed between multiple components and the whole process can work multithreaded. Of course this adds a lot more logic to the Crawler and demands more knowledge from a Crawler developer. It would also mean that ID and HASH are generated in the Crawler. The downside is that each Crawler has to implement the DeltaIndexing workflow themselves.
    We could even move all execution logic to the Crawler. CrawlerController would become obsolete. Then Crawlers would handle everything themselves - communication with DeltaIndexingManager, CoumpoundHandlers and ConnectivityManager. I think in this way the best performance can be achieved, as the setup is the very simple. No unnecessary passing of data between components. But a lot of logic has to be re-implemented in every Crawler. I wonder if there is a chance to minimize this.


New Feature: DeltaIndexing On/Off

Motivation

It should be possible to turn the usage of DeltaIndexing on and off, either to reduce complexity or to gain better performance.

Draft

A simple boolean logic (on/off) seems to simple, as I see 4 possible use cases (modes):

  • FULL: DeltaIndexing is fully activated. This means that
    • each Record is checked if it needs to be updated
    • for each Record an entry is made/updated in the DeltaIndexingManager
    • Delta-Delete is executed at the end of the import
  • ADDITIVE: as FULL, but Delta-Delete is not executed (we allow records in the index that do not exist anymore
  • INITIAL: For an initial import in an empty index or a new source in an existing index performance can be optimized by
    • NOT checking if a record needs to be updated (we know that all records are new)
    • adding an entry in the DeltaIndexingManager for each Record. This allows later imports to make use of DeltaIndexing
    • NOT: executing Delta-Delete (we know that no records are to be deleted)
  • DISABLED: DeltaIndexing is fully deactivated. No checks are done, no entries are created/updated, no Delta-Delete is executed. Later runs cannot benefit from DeltaIndexing

As always, Delta-Delete MUST NOT be executed if any errors occur during import as we do not want to delete records erroneously!

Configuration

To configure the mode of DeltaIndexing execution, an additional parameter is needed in the IndexOrderConfiguration:

XML-Schema:

...
<xs:element name="DeltaIndexingMode" type="DeltaIndexingModeType"/>
 
  <xs:simpleType name="DeltaIndexingModeType">
    <xs:annotation>
      <xs:appinfo>
        <jxb:class ref="org.eclipse.eilf.connectivity.framework.indexorder.messages.DeltaIndexingModeType"/>
      </xs:appinfo>
    </xs:annotation>
    <xs:restriction base="xs:string">
      <xs:pattern value="FULL"/>
      <xs:pattern value="ADDITIVE"/>
      <xs:pattern value="INITIAL"/>
      <xs:pattern value="DISABLED"/>
    </xs:restriction>
  </xs:simpleType>
...

XML example

...
<DeltaIndexingMode>FULL</DeltaIndexingMode>
...


Implementation

Current Concept

The execution logic has to be added in parts to the CrawlerController (CrawlThread) and ConnectivityManager. Therefore the mode has to be added to the ConnectivityManager interface. The problem is, that still initiialize and finish need to be called, and that the MODE then controls if and how DeltaIndexing is used. This makes the usage and implementation of ConnectivityManager more and more complex and obscure (too many special cases).


Alternative Concept

The execution logic has to be added either

  • to the CrawlerController (CrawlThread) only. It decides what actions to perform on the given mode.
  • to the Crawler themselves, if the more radical change is implemented