Skip to main content
Jump to: navigation, search

SMILA/Specifications/DeltaIndexingAndConnectivtyDiscussion09/Usage of DeltaIndexingManager by CrawlerControler alone

Usage of DeltaIndexingManager by CrawlerControler alone

Here is another idea based on the changes introduced with SMILA/Specifications/DeltaIndexingAndConnectivtyDiscussion09/Separate_Interfaces_for_ConnectivityManager_and_DeltaIndexingManager but taking it further that not the CrawlerController communicates with DeltaIndexingManager but each Crawler.

{{note|implemented|date: ??}

  1. DeltaIndexing used by Crawlers: This is a radical change as this also affects the Crawler interface. Crawlers could directly communicate with the DeltaIndexingManager and provide only those Records that pass DeltaIndexing (are new, nedd an update). CrawlerController and Crawler could implement a Consumer/Producer pattern which should improve performance. No more sending of arrays with DIInformation and thereafter retrieving the Record objects. DeltaIndexing-Delete information is computed in the Crawler and can passed to the CrawlerController as regular Records (only the ID is set) and a delete flag to notify the CrawlerController that this Record is to be deleted. This should reduce communication overhead, as the DIInformation has not to be passed between multiple components and the whole process can work multithreaded. Of course this adds a lot more logic to the Crawler and demands more knowledge from a Crawler developer. It would also mean that ID and HASH are generated in the Crawler. The downside is that each Crawler has to implement the DeltaIndexing workflow themselves.
    We could even move all execution logic to the Crawler. CrawlerController would become obsolete. Then Crawlers would handle everything themselves - communication with DeltaIndexingManager, CoumpoundHandlers and ConnectivityManager. I think in this way the best performance can be achieved, as the setup is the very simple. No unnecessary passing of data between components. But a lot of logic has to be re-implemented in every Crawler. I wonder if there is a chance to minimize this.

Back to the top