Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Default configuration workflow overview"

 
(38 intermediate revisions by 7 users not shown)
Line 1: Line 1:
[[Image:Schema.jpg]]
+
This pages given a short explanation of what happens behind the scenes when executing the [[SMILA/Documentation_for_5_Minutes_to_Success|SMILA in 5 Minutes]] example.
  
1. Creation or update of an '''index''' is initiated by a '''user''' request. The user sends a name of a configuration file (so called '''IndexOrderConfiguration''') to the '''Crawler Controller'''. The configuration file describes which '''Crawler''' should access a specific '''Data Source'''. Furthermore it contains all necessary parameters for the crawling process.
+
[[Image:DefaultConfigurationWorkflow-1.0.png]]
  
2. The '''Crawler Controller''' initializes one '''Crawlers''':
+
<font size="-1">
* a. The '''Crawler Controller''' reads the Index order configuration file and assigns the '''Data Source''' from the configuration file to the '''Crawler'''.
+
(download [[Media:DefaultConfigurationWorkflow-1.0.zip|this archive]] to get the original Powerpoint file of this diagram)
* b. The '''Crawler Controller''' starts the Crawler’s thread.
+
</font>
  
3. The '''Crawler''' sequentially retrieves data records from the '''Data Source''' and returns each record’s attributes to the '''Crawler Controller''' which are required for the generation of the '''Record’s ID''' and its '''Hash'''.
+
When crawling a web site with SMILA the following happens:
  
4. The '''Crawler Controller''' generates '''ID''' and '''Hash''' and then determines whether this particular record is new or was already indexed by querying the '''Delta Indexing Service'''.
+
# The user starts a job with workflow ''updateIndex''. Nothing else happens yet, the job waits for input to process.
 +
# The user starts a job with workflow ''webCrawling'' in <tt>runOnce</tt> mode.
 +
# The WebCrawler worker initiates the crawl process by reading the configured start URL. It extracts links and feeds them back to itself, and produces records with metadata and content. Additionally it marks links as visited so that other crawler worker instances will not produce duplicates.
 +
# The DeltaChecker worker reads the records produced by the crawler and checks in the DeltaService if the crawled resources have changed since a previous crawl run. Unchanged resources are filtered out, only changed and new resources are sent to the next worker.
 +
# The WebFetcher worker fetches content of resources that do not have content yet. In this case this would be non-HTML resources because their content was not needed by the crawler worker for link extraction.
 +
# At the end of the crawl workflow, the UpdatePusher worker sends the crawled records with their content to the indexing job as ''added'' records and saves their current state in the delta service.
 +
# Now the indexing job starts to work: The Bulkbuilder writes the records to index to bulks, depending on if they are to be added to or updated in the index, or if they are to be deleted (which does not happen at this point).
 +
# The PipelineProcessor worker picks up those record bulks and puts each record (in manageable numbers) on the blackboard ...
 +
# ... and invokes a configured pipeline for either adding/updating or deleting records.
 +
# The pipelets in the pipelines take the record data from the blackboard, transform the data, extract further metadata and plain text ...
 +
# ... and manipulate the SolrIndex accordingly. The index can now be searched using yet another pipeline (not shown here).
 +
# Finally (and not yet implemented), when the crawl workflow is done, the DeltaService can be asked for all records that have not been crawled in this run, so that ''delete'' records can be sent to the indexing workflow to remove these resources from the index.
  
5. If the '''record''' was '''not''' previously indexed, the '''Crawler Controller''' instructs the '''Crawler''' to retrieve the full content of the '''record''' (metadata and content).  
+
All records produced in this are stored in the ObjectStore while being passed from one worker to the next. The Job/TaskManagement uses Apache Zookeeper to coordinate the work when using multiple SMILA nodes for parallelizing the work to be done.
  
6. The '''Crawler''' fetches the complete '''record''' from the '''Data Source'''. Each '''record''' has an ID and can contain attributes, attachments (contain binary content) and annotations.
+
Crawling a filesystem works similar, the "fileCrawling" workflow just replaces the "WebCrawler" and "WebFetcher" workers by "FileCrawler" and "FileFetcher" workers.
 
+
7. The '''Crawler Controller''' sends the complete retrieved '''Records''' to the '''Connectivity''' module.
+
 
+
8. The '''Connectivity module''' in turn persists the record to the '''storage tier''' via the '''Blackboard Service'''. The '''Blackboard''' provides central access to the record storage layer for all SMILA components thus effectively constituting an abstraction layer: clients do not need to know about the underlying persistence / storage technologies. Structured data (text-based) is stored in '''XML-Storage'''; binaries (e.g. attachments, images) are stored in '''Bin-Storage'''.
+
 
+
9. In the next step the '''Connectivity module''' transmits the '''record’s''' to the '''Router''' which is part of the '''Connectivity module'''. The '''Router''' filters the '''record’s''' attributes according to its configuration. ''Note'': usually only the ID is passed (it is defined in a filter configurations file). After processing and filtering the record, the '''Router''' pushes the record in a JMS message queue ('''Active MQ''') for further processing. Should any subsequent processes require the '''record’s''' full content, they can access it via the '''Blackboard'''.
+
 
+
10. A '''Listener''' subscribing to the '''queue’s''' topic receives the message and invokes further processing logic according to its configuration.
+
 
+
11. The '''Listener''' passes the '''record''' to the respective '''pipeline''' – the '''Add pipeline''', if it is a new record, the '''Delete pipeline''' if the '''record''' needs to be removed from the '''index''' (because it has been deleted from the original '''Data Source'''). A '''pipeline''' is a '''BPEL''' process using a set of '''pipelets''' and '''services''' to process a '''Record’s''' data (e.g. extracting text from various document or image file types). '''BPEL''' is a XML-based language to define business processes by means of orchestrating loosely coupled (web) services. '''BPEL''' processes require a '''BPEL''' runtime environment (e.g. Apache ODE).
+
 
+
12. After processing the '''record''' the '''pipelets / service''' stores the gathered additional data via the '''Blackboard''' service.
+
 
+
13. Finally the '''pipeline''' invokes the '''Lucene Index Service'''.
+
 
+
14. The '''Lucene Index Service''' updates the '''Lucene Index'''.
+

Latest revision as of 09:34, 24 January 2012

This pages given a short explanation of what happens behind the scenes when executing the SMILA in 5 Minutes example.

DefaultConfigurationWorkflow-1.0.png

(download this archive to get the original Powerpoint file of this diagram)

When crawling a web site with SMILA the following happens:

  1. The user starts a job with workflow updateIndex. Nothing else happens yet, the job waits for input to process.
  2. The user starts a job with workflow webCrawling in runOnce mode.
  3. The WebCrawler worker initiates the crawl process by reading the configured start URL. It extracts links and feeds them back to itself, and produces records with metadata and content. Additionally it marks links as visited so that other crawler worker instances will not produce duplicates.
  4. The DeltaChecker worker reads the records produced by the crawler and checks in the DeltaService if the crawled resources have changed since a previous crawl run. Unchanged resources are filtered out, only changed and new resources are sent to the next worker.
  5. The WebFetcher worker fetches content of resources that do not have content yet. In this case this would be non-HTML resources because their content was not needed by the crawler worker for link extraction.
  6. At the end of the crawl workflow, the UpdatePusher worker sends the crawled records with their content to the indexing job as added records and saves their current state in the delta service.
  7. Now the indexing job starts to work: The Bulkbuilder writes the records to index to bulks, depending on if they are to be added to or updated in the index, or if they are to be deleted (which does not happen at this point).
  8. The PipelineProcessor worker picks up those record bulks and puts each record (in manageable numbers) on the blackboard ...
  9. ... and invokes a configured pipeline for either adding/updating or deleting records.
  10. The pipelets in the pipelines take the record data from the blackboard, transform the data, extract further metadata and plain text ...
  11. ... and manipulate the SolrIndex accordingly. The index can now be searched using yet another pipeline (not shown here).
  12. Finally (and not yet implemented), when the crawl workflow is done, the DeltaService can be asked for all records that have not been crawled in this run, so that delete records can be sent to the indexing workflow to remove these resources from the index.

All records produced in this are stored in the ObjectStore while being passed from one worker to the next. The Job/TaskManagement uses Apache Zookeeper to coordinate the work when using multiple SMILA nodes for parallelizing the work to be done.

Crawling a filesystem works similar, the "fileCrawling" workflow just replaces the "WebCrawler" and "WebFetcher" workers by "FileCrawler" and "FileFetcher" workers.

Back to the top