Skip to main content
Jump to: navigation, search

SMILA/Specifications/DeltaIndexingAndConnectivtyDiscussion09/Usage of DeltaIndexingManager by CrawlerControler alone

< SMILA‎ | Specifications‎ | DeltaIndexingAndConnectivtyDiscussion09
Revision as of 13:38, 15 October 2009 by (Talk | contribs) (moved from parent page)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Usage of DeltaIndexingManager by CrawlerControler alone

{{note|implemented|date: ??}

  1. used by CrawlerController: This approach would not change much of the current programming logic. Only that the CrawlerController would communicate with two references instead of one. This approach was realized
  1. used by Crawlers: This is a radical change as this also affects the Crawler interface. Crawlers could directly communicate with the DeltaIndexingManager and provide only those Records that pass DeltaIndexing (are new, nedd an update). CrawlerController and Crawler could implement a Consumer/Producer pattern which should improve performance. No more sending of arrays with DIInformation and thereafter retrieving the Record objects. DeltaIndexing-Delete information is computed in the Crawler and can passed to the CrawlerController as regular Records (only the ID is set) and a delete flag to notify the CrawlerController that this Record is to be deleted. This should reduce communication overhead, as the DIInformation has not to be passed between multiple components and the whole process can work multithreaded. Of course this adds a lot more logic to the Crawler and demands more knowledge from a Crawler developer. It would also mean that ID and HASH are generated in the Crawler. The downside is that each Crawler has to implement the DeltaIndexing workflow themselves.
    We could even move all execution logic to the Crawler. CrawlerController would become obsolete. Then Crawlers would handle everything themselves - communication with DeltaIndexingManager, CoumpoundHandlers and ConnectivityManager. I think in this way the best performance can be achieved, as the setup is the very simple. No unnecessary passing of data between components. But a lot of logic has to be re-implemented in every Crawler. I wonder if there is a chance to minimize this.

Back to the top