Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Default configuration workflow overview"

(The diagramme description)
(job mgmt changes)
Line 1: Line 1:
 +
 
[[Image:Schema_.jpg]]
 
[[Image:Schema_.jpg]]
 
  
 
== The diagramme description==
 
== The diagramme description==
  
 
+
* 1. Data is imported via '''Crawler''' (or '''Agent''') by configuring a data source and a job name via the '''Crawler Controller''' (resp. '''Agent Controller''') JMX API.  
*1. Creation or update of an '''index''' is initiated by a '''user''' request. The user sends a name of a configuration file (so called '''IndexOrderConfiguration''') to the '''Crawler Controller'''.  
+
* 2. The '''Crawler Controller''' initializes the '''Crawler''' by assigning a data source and starting the import
The configuration file describes which '''Crawler''' should access a specific '''Data Source'''. Furthermore it contains all necessary parameters for the crawling process.
+
* 3. The '''Crawler''' retrieves data references from the '''Data Source''' and returns them to the '''Crawler Controller'''.
 
+
* 4. The '''Crawler Controller''' determines whether this particular data is new/modified or was already indexed by querying the '''Delta Indexing Service'''.
*2. The '''Crawler Controller''' initializes this '''Crawlers''':
+
* 5. If the data was not previously indexed, the '''Crawler Controller''' instructs the '''Crawler''' to retrieve the full data plus content as record (metadata + attachment).  
** a. The '''Crawler Controller''' reads the Index order configuration file and assigns the '''Data Source''' from the configuration file to the '''Crawler'''.
+
* 6. The '''Crawler''' fetches the complete record from the '''Data Source'''. Each record has an ID and can contain metadata and attachments (binary content).
** b. The '''Crawler Controller''' starts the Crawler’s thread.
+
* 7. The '''Crawler Controller''' sends the complete retrieved records to the '''Connectivity Manager'''.
 
+
* 8. The '''Connectivity Manager''' routes the records to the configured job by pushing them to the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]].  
*3. The '''Crawler''' retrieves data records from the '''Data Source''' and returns each record’s attributes to the '''Crawler Controller''' which are required for the generation of the '''Record’s ID''' and its '''Hash'''.
+
* 9. The [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] persists the record's attachment content via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]] in the '''Binary Storage'''. Only attachment references remanin in the records. Should any subsequent processes require the record’s full content, they can access it via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]].
 
+
* 10. Records are cumulated as bulks for asynchronous workflow processing. Record bulks are stored in '''ObjectStore'''.
*4. The '''Crawler Controller''' generates '''ID''' and '''Hash''' and then determines whether this particular record is new or was already indexed by querying the '''Delta Indexing Service'''.
+
* 11. An [[#W|asynchronous workflow run]] is executed triggered by the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] generated record bulk. This is managed by the [[SMILA/Documentation/JobManager|Jobmanager]] and [[SMILA/Documentation/TaskManager|Taskmanager]] components. Runtime/Synchronization data is stored in '''Zookeeper''', persistent data is stored in '''ObjectStore'''.
 
+
* 12. Predefined asynchronous workflow "indexUpdate" contains [[SMILA/Documentation/Worker/PipelineProcessingWorker|BPEL worker]] for embedding (resp. executing) synchronous BPEL pipelines in the asynchronous workflow. Added records are passed to the predefined BPEL pipeline ''AddPipeline'', deleted records to the ''DeletePipeline''. A BPEL pipeline is a process using a set of ''pipelets'' to process a record's data (e.g. extracting text from various document or image file types).  
*5. If the '''record''' was '''not''' previously indexed, the '''Crawler Controller''' instructs the '''Crawler''' to retrieve the full content of the '''record''' (metadata and content).  
+
* 13. After processing the records the pipelets store the gathered additional data via the [[SMILA/Documentation/Usage_of_Blackboard_Service|Blackboard]] service.  
 
+
* 14. The Add- and DeletePipeline finally invoke the [[SMILA/Documentation/LuceneIndexPipelet|LuceneIndexPipelet]] to update the '''Lucene Index'''.
*6. The '''Crawler''' fetches the complete '''record''' from the '''Data Source'''. Each '''record''' has an ID and can contain attributes, attachments (contain binary content) and annotations.
+
 
+
*7. The '''Crawler Controller''' sends the complete retrieved '''Records''' to the '''Connectivity''' module.
+
 
+
*8. The '''Connectivity module''' in turn persists the record to the '''storage tier''' via the '''Blackboard Service'''.
+
The '''Blackboard''' provides central access to the record storage layer for all SMILA components thus effectively constituting an abstraction layer: clients do not need to know about the underlying persistence / storage technologies.  
+
 
+
'''Record's''' data without attachments (binaries content) is stored in '''XML-Storage''',  attachments are stored in '''Bin-Storage'''.
+
 
+
*9. In the next step the '''Connectivity module''' transmits the '''record’s''' to the '''Router''' which is part of the '''Connectivity module'''.
+
The '''Router''' filters the '''record’s''' attributes according to its configuration. ''Note'': usually only the ID is passed (it is defined in a filter configurations file).
+
 
+
After processing and filtering the record, the '''Router''' pushes the record in a JMS message queue ('''Active MQ''') for further processing. The Router can filter the '''Record''' , then only necessary data (like '''ID''') is fetched in the ''' Queue'''.
+
 
+
Should any subsequent processes require the '''record’s''' full content, they can access it via the '''Blackboard'''.
+
 
+
*10. A '''Listener''' subscribing to the '''Queue’s''' topic receives the message and invokes further processing logic according to its configuration.
+
 
+
*11. The '''Listener''' passes the '''record''' to the respective '''pipeline''' – the '''Add pipeline''', if it is a new record, the '''Delete pipeline''' if the '''record''' needs to be removed from the '''index''' (because it has been deleted from the original '''Data Source''').  
+
A '''pipeline''' is a '''BPEL''' process using a set of '''pipelets''' to process a '''Record’s''' data (e.g. extracting text from various document or image file types).  
+
 
+
'''BPEL''' is a XML-based language to define business processes by means of orchestrating loosely coupled (web) services.
+
'''BPEL''' processes require a '''BPEL''' runtime environment (e.g. Apache ODE).
+
 
+
*12. After processing the '''record''' the '''pipelets / service''' stores the gathered additional data via the '''Blackboard''' service. The '''Listener''' will send the '''Record''' (can be filter too) to the '''Queue''' back and other '''Data Flow Process''' can access it.
+
 
+
*13. Finally the '''pipeline''' can invoke the '''Lucene Index Service''' via a small pipelet, the LuceneIndexPipelet. (matter of the configuration).
+
 
+
*14. The '''Lucene Index Service''' updates the '''Lucene Index'''.
+

Revision as of 08:44, 5 September 2011

Schema .jpg

The diagramme description

  • 1. Data is imported via Crawler (or Agent) by configuring a data source and a job name via the Crawler Controller (resp. Agent Controller) JMX API.
  • 2. The Crawler Controller initializes the Crawler by assigning a data source and starting the import
  • 3. The Crawler retrieves data references from the Data Source and returns them to the Crawler Controller.
  • 4. The Crawler Controller determines whether this particular data is new/modified or was already indexed by querying the Delta Indexing Service.
  • 5. If the data was not previously indexed, the Crawler Controller instructs the Crawler to retrieve the full data plus content as record (metadata + attachment).
  • 6. The Crawler fetches the complete record from the Data Source. Each record has an ID and can contain metadata and attachments (binary content).
  • 7. The Crawler Controller sends the complete retrieved records to the Connectivity Manager.
  • 8. The Connectivity Manager routes the records to the configured job by pushing them to the Bulkbuilder.
  • 9. The Bulkbuilder persists the record's attachment content via the Blackboard in the Binary Storage. Only attachment references remanin in the records. Should any subsequent processes require the record’s full content, they can access it via the Blackboard.
  • 10. Records are cumulated as bulks for asynchronous workflow processing. Record bulks are stored in ObjectStore.
  • 11. An asynchronous workflow run is executed triggered by the Bulkbuilder generated record bulk. This is managed by the Jobmanager and Taskmanager components. Runtime/Synchronization data is stored in Zookeeper, persistent data is stored in ObjectStore.
  • 12. Predefined asynchronous workflow "indexUpdate" contains BPEL worker for embedding (resp. executing) synchronous BPEL pipelines in the asynchronous workflow. Added records are passed to the predefined BPEL pipeline AddPipeline, deleted records to the DeletePipeline. A BPEL pipeline is a process using a set of pipelets to process a record's data (e.g. extracting text from various document or image file types).
  • 13. After processing the records the pipelets store the gathered additional data via the Blackboard service.
  • 14. The Add- and DeletePipeline finally invoke the LuceneIndexPipelet to update the Lucene Index.

Back to the top