Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Default configuration workflow overview"

m (3rd Party Table)
Line 47: Line 47:
  
 
*14. The '''Lucene Index Service''' updates the '''Lucene Index'''.
 
*14. The '''Lucene Index Service''' updates the '''Lucene Index'''.
 
== 3rd Party Table ==
 
 
 
{{CTable|tableWidth=64%}}
 
| 3rd Party Bundle|| our CQ ||Used in Diagramme Step || Processes
 
|-
 
| com.sleepycat.db.linux.x86 || ... || 8, 12|| Record persistence layer
 
|-
 
| com.sleepycat.db.linux.x86_64|| ... || 8, 12||Record persistence layer
 
|-
 
| com.sleepycat.db.win32||...  || 8, 12|| Record persistence layer
 
|-
 
| com.sleepycat.dbxml|| 2574 || 8, 12|| Record persistence layer
 
|-
 
| com.sleepycat.dbxml.linux.x86||...  || 8, 12|| Record persistence layer
 
|-
 
| com.sleepycat.dbxml.linux.x86_64||...  || 8, 12|| Record persistence layer
 
|-
 
| com.sleepycat.dbxml.win32|| ... || 8, 12|| Record persistence layer
 
|-
 
| com.sun.jaxb|| 2664 || 2 - 13|| XML binding
 
|-
 
| javax.el|| 2683 || 1|| dependency of org.apache.tomcat bundle
 
|-
 
| javax.servlet.jsp || 2685 ||1 - 7||JSP package. Used for the user interface (e.g. search form)
 
|-
 
| javax.xml.bind|| 2686 || 2 - 13|| XML binding
 
|-
 
| javax.xml.stream||  2684|| 2 - 13 || XML storage
 
|-
 
| javax.xml.xquery|| 2668 || 8, 12 || XML storage
 
|-
 
| net.sf.joost|| 2590 ||11 - 13|| STX language processor
 
|-
 
| org.apache.activemq.core|| 2580 || 9 -11|| JMS implementation used for communication between Connectivity and DFP
 
|-
 
| org.apache.commons.io|| 2677 || 1 -14|| basic helper classes for common i/o operations. Used throughout SMILA.
 
|-
 
| org.apache.commons.lang || 2678 || 1 - 14|| additional helper classes for the java.lang package. Used throughout SMILA.
 
|-
 
|org.apache.commons.logging|| 2682 || 1 - 14|| abstraction layer for runtime logging.Used throughout SMILA.
 
|-
 
| org.apache.commons.vfs||...||8, 12|| abstraction layer for various (distributed) file systems. used in storage layer.
 
|-
 
| org.apache.tomcat || 2561|| 1|| JSP/Servlet container providing the user interface (e.g. search form)
 
|-
 
|org.apache.log4j|| 2555|| 1 - 14|| runtime logging. Used throughout SMILA.
 
|-
 
|org.apache.lucene|| 2556 ||13, 14|| Lucene Index
 
|-
 
|org.apache.lucene.analysis|| 2557 / 2603|| 13, 14||Lucene Index
 
|-
 
|org.apache.lucene.search.highlight|| 2558 ||13, 14||Lucene Index
 
|-
 
|org.apache.lucene.search.highlight|| 2558|| 13, 14||Lucene Index
 
|-
 
|org.apache.ode ||...|| 11 - 13|| BPEL runtime environment executing the pipelets in the DFP.
 
|-
 
|org.custommonkey.xmlunit|| 2617|| 11 - 13|| XML extension for JUnit. Used for testing pipelets.
 
|-
 
|org.w3c.tidy|| 2589|| 11 - 13|| HTML clean-up tool.
 
|-
 
|}
 

Revision as of 08:22, 26 March 2009

Schema .jpg


The diagramme description

  • 1. Creation or update of an index is initiated by a user request. The user sends a name of a configuration file (so called IndexOrderConfiguration) to the Crawler Controller.

The configuration file describes which Crawler should access a specific Data Source. Furthermore it contains all necessary parameters for the crawling process.

  • 2. The Crawler Controller initializes this Crawlers:
    • a. The Crawler Controller reads the Index order configuration file and assigns the Data Source from the configuration file to the Crawler.
    • b. The Crawler Controller starts the Crawler’s thread.
  • 3. The Crawler retrieves data records from the Data Source and returns each record’s attributes to the Crawler Controller which are required for the generation of the Record’s ID and its Hash.
  • 4. The Crawler Controller generates ID and Hash and then determines whether this particular record is new or was already indexed by querying the Delta Indexing Service.
  • 5. If the record was not previously indexed, the Crawler Controller instructs the Crawler to retrieve the full content of the record (metadata and content).
  • 6. The Crawler fetches the complete record from the Data Source. Each record has an ID and can contain attributes, attachments (contain binary content) and annotations.
  • 7. The Crawler Controller sends the complete retrieved Records to the Connectivity module.
  • 8. The Connectivity module in turn persists the record to the storage tier via the Blackboard Service.

The Blackboard provides central access to the record storage layer for all SMILA components thus effectively constituting an abstraction layer: clients do not need to know about the underlying persistence / storage technologies.

Record's data without attachments (binaries content) is stored in XML-Storage, attachments are stored in Bin-Storage.

  • 9. In the next step the Connectivity module transmits the record’s to the Router which is part of the Connectivity module.

The Router filters the record’s attributes according to its configuration. Note: usually only the ID is passed (it is defined in a filter configurations file).

After processing and filtering the record, the Router pushes the record in a JMS message queue (Active MQ) for further processing. The Router can filter the Record , then only necessary data (like ID) is fetched in the Queue.

Should any subsequent processes require the record’s full content, they can access it via the Blackboard.

  • 10. A Listener subscribing to the Queue’s topic receives the message and invokes further processing logic according to its configuration.
  • 11. The Listener passes the record to the respective pipeline – the Add pipeline, if it is a new record, the Delete pipeline if the record needs to be removed from the index (because it has been deleted from the original Data Source).

A pipeline is a BPEL process using a set of pipelets and services to process a Record’s data (e.g. extracting text from various document or image file types).

BPEL is a XML-based language to define business processes by means of orchestrating loosely coupled (web) services. BPEL processes require a BPEL runtime environment (e.g. Apache ODE).

  • 12. After processing the record the pipelets / service stores the gathered additional data via the Blackboard service. The Listener will send the Record (can be filter too) to the Queue back and other Data Flow Process can access it.
  • 13. Finally the pipeline can invoke the Lucene Index Service. (matter of the configuration).
  • 14. The Lucene Index Service updates the Lucene Index.

Back to the top