Difference between revisions of "SMILA/Documentation/CrawlerController"

Revision as of 03:56, 21 April 2011

Overview

The CrawlerController is a component that manages and monitors Crawlers. Whenever a new crawl is triggered (via startCrawl()) a new instance of the used Crawler is created and the crawler object hash value is used a an id (called jobId) to identify records created by this crawler instance. This jobid is set as an annotation on all records and is also visible on the crawler instance in the JMX console.

API

Current javadoc:

Implementations

It is possible to provide different implementations for the CrawlerController interface. At the moment there is one implementation available.

org.eclipse.smila.connectivity.framework.impl

This bundle contains the default implementation of the CrawlerController interface.

The CrawlerController implements the general processing logic common for all types of Crawlers. Its interface is a pure management interface that can be accessed by its Java interface or its wrapping JMX interface. It has references to the following OSGi services:

Crawler ComponentFactory
ConnectivityManager
DeltaIndexingManager (optional)
CompoundManager
ConfigurationManagement (t.b.d.)

Crawler Factories register themselves at the CrawlerController. Each time a crawl for a certain type of crawler is initiated, a new instance of that Crawler type is created via the Crawler ComponentFactory. This allows parallel crawling of datasources with the same type (e.g. several websites). Note that it is not possible to crawl the same data source concurrently!

This chart shows the current CrawlerController processing logic for one crawl run:

First the CrawlerController initializes DeltaIndexing for the current data source by calling DeltaIndexingManager::init(String) and also initializes a new Crawler (not shown)
the then executes subprocess process crawler with the initialized Crawler
if no error occured so far it performs the subprocess delete delta
finally it finishes the run by calling DeltaIndexingManager::finish(String)

Process Crawler

the CrawlerController checks if the given Crawler has more data available
YES: the CrawlerController checks each received DataReference send by the Crawler if it needs to be updated by calling DeltaIndexingManager::checkForUpdate(...)
- YES: the CrawlerController request the complete record from the Crawler and checks if the record is a compound
  - YES: the subprocess process compounds is executed.
  - NO: no special actions are taken
- the record is added to the Queue by calling ConnectivityManager::add(...) and is marked as visited in the DeltaIndexingManager by calling DeltaIndexingManager::visit(...)
- NO: the DataReference is skipped. DeltaIndexingManager internally already set the visited flag for this Id
NO: return to the calling process

Process Compounds

Please see CompoundManagement for details on compound handling.

by calling CompoundManager:extract(Record, DataSourceConnectionConfig) the subprocess receives a CompoundCrawler that iterates over the elements of the compound record
the subprocess recursively calls subprocess process crawler using the CompoundCrawler
the compound record is adapted according to the configuration (set to null, modified, left unmodified) by calling CompoundManager:adaptCompoundRecord(Record, DataSourceConnectionConfig)
return to the calling process

Delete Delta

by calling DeltaIndexingManager::obsoleteIdIterator(...) the subprocess receives an Iterator over all Ids that have to be deleted
for each Id ConnectivityManager::delete(...) is called
return to the calling process

Note: The exact logic depends on the settings of DeltaIndexing in the data source configuration. Depending on the configured value, delta indexing logic is executed fully, partially or not at all.

Configuration

There are no configuration options available for this bundle.

JMX interface

Javdoc: org.eclipse.smila.connectivity.framework.CrawlerControllerAgent

Here is a screenshot of the CrawlerController in the JMX Console:

@@ Line 69: / Line 69: @@
 Javdoc: [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/connectivity/framework/CrawlerControllerAgent.html org.eclipse.smila.connectivity.framework.CrawlerControllerAgent]
 Here is a screenshot of the CrawlerController in the JMX Console:
 [[Image:CrawlerControllerJMX.png]]

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/CrawlerController"

Revision as of 03:56, 21 April 2011

Contents

Overview

API

Implementations

org.eclipse.smila.connectivity.framework.impl

Configuration

JMX interface

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/CrawlerController"

Revision as of 03:56, 21 April 2011

Contents

Overview

API

Implementations

org.eclipse.smila.connectivity.framework.impl

Configuration

JMX interface