Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Default configuration workflow overview"

(job mgmt changes)
m
Line 1: Line 1:
 
 
[[Image:Schema_.jpg]]
 
[[Image:Schema_.jpg]]
  
 
== The diagramme description==
 
== The diagramme description==
  
* 1. Data is imported via '''Crawler''' (or '''Agent''') by configuring a data source and a job name via the '''Crawler Controller''' (resp. '''Agent Controller''') JMX API.  
+
* 1. Data is imported via [[SMILA/Documentation/Crawler|Crawler]] (or SMILA/Documentation/Agent|Agent]]) by configuring a data source and a job name via the [[SMILA/Documentation/CrawlerController|Crawler Controller]] (resp. [[SMILA/Documentation/AgentController|Agent Controller]]) JMX API.  
* 2. The '''Crawler Controller''' initializes the '''Crawler''' by assigning a data source and starting the import
+
* 2. The [[SMILA/Documentation/CrawlerController|Crawler Controller]] initializes the [[SMILA/Documentation/Crawler|Crawler]] by assigning a data source and starting the import
* 3. The '''Crawler''' retrieves data references from the '''Data Source''' and returns them to the '''Crawler Controller'''.
+
* 3. The [[SMILA/Documentation/Crawler|Crawler]] retrieves data references from the '''Data Source''' and returns them to the [[SMILA/Documentation/CrawlerController|Crawler Controller]].
* 4. The '''Crawler Controller''' determines whether this particular data is new/modified or was already indexed by querying the '''Delta Indexing Service'''.
+
* 4. The [[SMILA/Documentation/CrawlerController|Crawler Controller]] determines whether this particular data is new/modified or was already indexed by querying the '''Delta Indexing Service'''.
* 5. If the data was not previously indexed, the '''Crawler Controller''' instructs the '''Crawler''' to retrieve the full data plus content as record (metadata + attachment).  
+
* 5. If the data was not previously indexed, the [[SMILA/Documentation/CrawlerController|Crawler Controller]] instructs the [[SMILA/Documentation/Crawler|Crawler]] to retrieve the full data plus content as record (metadata + attachment).  
* 6. The '''Crawler''' fetches the complete record from the '''Data Source'''. Each record has an ID and can contain metadata and attachments (binary content).
+
* 6. The [[SMILA/Documentation/Crawler|Crawler]] fetches the complete record from the '''Data Source'''. Each record has an ID and can contain metadata and attachments (binary content).
* 7. The '''Crawler Controller''' sends the complete retrieved records to the '''Connectivity Manager'''.
+
* 7. The [[SMILA/Documentation/CrawlerController|Crawler Controller]] sends the complete retrieved records to the [[SMILA/Documentation/ConnectivityManager|Connectivity Manager]].
* 8. The '''Connectivity Manager''' routes the records to the configured job by pushing them to the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]].  
+
* 8. The [[SMILA/Documentation/ConnectivityManager|Connectivity Manager]] routes the records to the configured job by pushing them to the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]].  
* 9. The [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] persists the record's attachment content via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]] in the '''Binary Storage'''. Only attachment references remanin in the records. Should any subsequent processes require the record’s full content, they can access it via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]].
+
* 9. The [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] persists the record's attachment content via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]] in the [[SMILA/Documentation/Binary_Storage|Binary Storage]]. Only attachment references remanin in the records. Should any subsequent processes require the record’s full content, they can access it via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]].
 
* 10. Records are cumulated as bulks for asynchronous workflow processing. Record bulks are stored in '''ObjectStore'''.
 
* 10. Records are cumulated as bulks for asynchronous workflow processing. Record bulks are stored in '''ObjectStore'''.
 
* 11. An [[#W|asynchronous workflow run]] is executed triggered by the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] generated record bulk. This is managed by the [[SMILA/Documentation/JobManager|Jobmanager]] and [[SMILA/Documentation/TaskManager|Taskmanager]] components. Runtime/Synchronization data is stored in '''Zookeeper''', persistent data is stored in '''ObjectStore'''.
 
* 11. An [[#W|asynchronous workflow run]] is executed triggered by the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] generated record bulk. This is managed by the [[SMILA/Documentation/JobManager|Jobmanager]] and [[SMILA/Documentation/TaskManager|Taskmanager]] components. Runtime/Synchronization data is stored in '''Zookeeper''', persistent data is stored in '''ObjectStore'''.
* 12. Predefined asynchronous workflow "indexUpdate" contains [[SMILA/Documentation/Worker/PipelineProcessingWorker|BPEL worker]] for embedding (resp. executing) synchronous BPEL pipelines in the asynchronous workflow. Added records are passed to the predefined BPEL pipeline ''AddPipeline'', deleted records to the ''DeletePipeline''. A BPEL pipeline is a process using a set of ''pipelets'' to process a record's data (e.g. extracting text from various document or image file types).  
+
* 12. Predefined asynchronous workflow ''indexUpdate'' contains [[SMILA/Documentation/Worker/PipelineProcessingWorker|BPEL worker]] for embedding (resp. executing) synchronous BPEL pipelines in the asynchronous workflow. Added records are passed to the predefined BPEL pipeline ''AddPipeline'', deleted records to the ''DeletePipeline''. A BPEL pipeline is a process using a set of ''pipelets'' to process a record's data (e.g. extracting text from various document or image file types).  
 
* 13. After processing the records the pipelets store the gathered additional data via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]] service.  
 
* 13. After processing the records the pipelets store the gathered additional data via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]] service.  
 
* 14. The Add- and DeletePipeline finally invoke the [[SMILA/Documentation/LuceneIndexPipelet|LuceneIndexPipelet]] to update the '''Lucene Index'''.
 
* 14. The Add- and DeletePipeline finally invoke the [[SMILA/Documentation/LuceneIndexPipelet|LuceneIndexPipelet]] to update the '''Lucene Index'''.

Revision as of 08:50, 5 September 2011

Schema .jpg

The diagramme description

  • 1. Data is imported via Crawler (or SMILA/Documentation/Agent|Agent]]) by configuring a data source and a job name via the Crawler Controller (resp. Agent Controller) JMX API.
  • 2. The Crawler Controller initializes the Crawler by assigning a data source and starting the import
  • 3. The Crawler retrieves data references from the Data Source and returns them to the Crawler Controller.
  • 4. The Crawler Controller determines whether this particular data is new/modified or was already indexed by querying the Delta Indexing Service.
  • 5. If the data was not previously indexed, the Crawler Controller instructs the Crawler to retrieve the full data plus content as record (metadata + attachment).
  • 6. The Crawler fetches the complete record from the Data Source. Each record has an ID and can contain metadata and attachments (binary content).
  • 7. The Crawler Controller sends the complete retrieved records to the Connectivity Manager.
  • 8. The Connectivity Manager routes the records to the configured job by pushing them to the Bulkbuilder.
  • 9. The Bulkbuilder persists the record's attachment content via the Blackboard in the Binary Storage. Only attachment references remanin in the records. Should any subsequent processes require the record’s full content, they can access it via the Blackboard.
  • 10. Records are cumulated as bulks for asynchronous workflow processing. Record bulks are stored in ObjectStore.
  • 11. An asynchronous workflow run is executed triggered by the Bulkbuilder generated record bulk. This is managed by the Jobmanager and Taskmanager components. Runtime/Synchronization data is stored in Zookeeper, persistent data is stored in ObjectStore.
  • 12. Predefined asynchronous workflow indexUpdate contains BPEL worker for embedding (resp. executing) synchronous BPEL pipelines in the asynchronous workflow. Added records are passed to the predefined BPEL pipeline AddPipeline, deleted records to the DeletePipeline. A BPEL pipeline is a process using a set of pipelets to process a record's data (e.g. extracting text from various document or image file types).
  • 13. After processing the records the pipelets store the gathered additional data via the Blackboard service.
  • 14. The Add- and DeletePipeline finally invoke the LuceneIndexPipelet to update the Lucene Index.

Back to the top