Skip to main content
Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Default configuration workflow overview"

(The diagram description)
Line 3: Line 3:
 
(original slides can be found here: [[Media:DefaultConfigurationWorkflow.zip|DefaultConfigurationWorkflow.zip]])
 
(original slides can be found here: [[Media:DefaultConfigurationWorkflow.zip|DefaultConfigurationWorkflow.zip]])
  
== The diagram description==
+
# The user starts a job with workflow ''updateIndex''. Nothing else happens yet, the job waits for input to process.
 
+
# The user starts a job with workflow ''webCrawling'' in <tt>runOnce</tt> mode.
* 1. Data is imported via [[SMILA/Documentation/Crawler|Crawler]] (or [[SMILA/Documentation/Agent|Agent]]) by configuring a data source and a [[SMILA/Glossary#J|job]] name via the [[SMILA/Documentation/CrawlerController|Crawler Controller]] (resp. [[SMILA/Documentation/AgentController|Agent Controller]]) JMX API.  
+
# The WebCrawler worker initiates the crawl process by reading the configured start URL. It extracts links and feeds them back to itself, and produces records with metadata and content. Additionally it marks links as visited so that other crawler worker instances will not produce duplicates.
* 2. The [[SMILA/Documentation/CrawlerController|Crawler Controller]] initializes the [[SMILA/Documentation/Crawler|Crawler]] by assigning a data source and starting the import
+
# The DeltaChecker worker reads the records produced by the crawler and checks in the DeltaService if the crawled resources have changed since a previous crawl run. Unchanged resources are filtered out, only changed and new resources are sent to the next worker.
* 3. The [[SMILA/Documentation/Crawler|Crawler]] retrieves data references from the '''Data Source''' and returns them to the [[SMILA/Documentation/CrawlerController|Crawler Controller]].
+
# The WebFetcher worker fetches content of resources that do not have content yet. In this case this would be non-HTML resources because their content was not needed by the crawler worker for link extraction.
* 4. The [[SMILA/Documentation/CrawlerController|Crawler Controller]] determines whether this particular data is new/modified or was already indexed by querying the [[SMILA/Documentation/DeltaIndexingManager|Delta Indexing Service]].
+
# At the end of the crawl workflow, the UpdatePusher worker sends the crawled records with their content to the indexing job as ''added'' records and saves their current state in the delta service.
* 5. If the data was not previously indexed, the [[SMILA/Documentation/CrawlerController|Crawler Controller]] instructs the [[SMILA/Documentation/Crawler|Crawler]] to retrieve the full data plus content as [[SMILA/Glossary#R|Record]] (metadata + attachment).
+
# Now the indexing job starts to work: The Bulkbuilder writes the records to index to bulks, depending on if they are to be added to or updated in the index, or if they are to be deleted (which does not happen at this point).
* 6. The [[SMILA/Documentation/Crawler|Crawler]] fetches the complete record from the '''Data Source'''. Each record has an ID and can contain metadata and attachments (binary content).
+
# The PipelineProcessor worker picks up those record bulks and puts each record (in manageable numbers) on the blackboard ...
* 7. The [[SMILA/Documentation/CrawlerController|Crawler Controller]] sends the complete retrieved records to the [[SMILA/Documentation/ConnectivityManager|Connectivity Manager]].
+
# ... and invokes a configured pipeline for either adding/updating or deleting records.
* 8. The [[SMILA/Documentation/ConnectivityManager|Connectivity Manager]] routes the records to the configured job by pushing them to the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]].  
+
# The pipelets in the pipelines take the record data from the blackboard
* 9. The [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] persists the record's attachment content via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]] in the [[SMILA/Documentation/Binary_Storage|Binary Storage]]. Only attachment references remanin in the records. Should any subsequent processes require the record’s full content, they can access it via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]].
+
# ... and manipulate the SolrIndex accordingly. The index can now be searched using yet another pipeline (not shown here).
* 10. Records are cumulated in [[SMILA/Glossary#B|bulks]] for asynchronous workflow processing. [[SMILA/Glossary#R|Record bulks]] are stored in '''ObjectStore'''.
+
# Finally (and not yet implemented), when the crawl workflow is done, the DeltaService can be asked for all records that have not been crawled in this run, so that ''delete'' records can be sent to the indexing workflow to remove these resources from the index.
* 11. An [[SMILA/Glossary#W|asynchronous workflows]] is executed triggered by the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] generated record bulk. This is managed by the [[SMILA/Documentation/JobManager|Jobmanager]] and [[SMILA/Documentation/TaskManager|Taskmanager]] components. Runtime/Synchronization data is stored in '''Zookeeper''', persistent data is stored in '''ObjectStore'''.
+
* 12. Predefined asynchronous workflow ''indexUpdate'' contains [[SMILA/Documentation/Worker/PipelineProcessorWorker|BPEL worker]] for embedding (resp. executing) synchronous BPEL pipelines in the asynchronous workflow. Added records are passed to the predefined BPEL pipeline ''AddPipeline'', deleted records to the ''DeletePipeline''.
+
* 13. A BPEL pipeline uses a set of ''Pipelets'' to process a record's data (e.g. extracting text from various document or image file types). After processing the records the pipelets can store the gathered additional data via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]] service.
+
* 14. The Add- and DeletePipeline contain a [[SMILA/Documentation/Solr#How_to_use_Solr_with_SMILA|SolrIndexPipelet]] which is finally invoked to update the '''Solr/Lucene Index'''.
+

Revision as of 07:39, 24 January 2012

DefaultConfigurationWorkflow.png

(original slides can be found here: DefaultConfigurationWorkflow.zip)

  1. The user starts a job with workflow updateIndex. Nothing else happens yet, the job waits for input to process.
  2. The user starts a job with workflow webCrawling in runOnce mode.
  3. The WebCrawler worker initiates the crawl process by reading the configured start URL. It extracts links and feeds them back to itself, and produces records with metadata and content. Additionally it marks links as visited so that other crawler worker instances will not produce duplicates.
  4. The DeltaChecker worker reads the records produced by the crawler and checks in the DeltaService if the crawled resources have changed since a previous crawl run. Unchanged resources are filtered out, only changed and new resources are sent to the next worker.
  5. The WebFetcher worker fetches content of resources that do not have content yet. In this case this would be non-HTML resources because their content was not needed by the crawler worker for link extraction.
  6. At the end of the crawl workflow, the UpdatePusher worker sends the crawled records with their content to the indexing job as added records and saves their current state in the delta service.
  7. Now the indexing job starts to work: The Bulkbuilder writes the records to index to bulks, depending on if they are to be added to or updated in the index, or if they are to be deleted (which does not happen at this point).
  8. The PipelineProcessor worker picks up those record bulks and puts each record (in manageable numbers) on the blackboard ...
  9. ... and invokes a configured pipeline for either adding/updating or deleting records.
  10. The pipelets in the pipelines take the record data from the blackboard
  11. ... and manipulate the SolrIndex accordingly. The index can now be searched using yet another pipeline (not shown here).
  12. Finally (and not yet implemented), when the crawl workflow is done, the DeltaService can be asked for all records that have not been crawled in this run, so that delete records can be sent to the indexing workflow to remove these resources from the index.

Back to the top