Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Default configuration workflow overview"

(3rd Party Table)
(3rd Party Table)
Line 68: Line 68:
 
| com.sun.jaxb|| 2664 || 2 - 13|| XML binding
 
| com.sun.jaxb|| 2664 || 2 - 13|| XML binding
 
|-
 
|-
| javax.el|| 2683 || 1||  
+
| javax.el|| 2683 || 1|| dependency of tomcat
 
|-
 
|-
| javax.servlet.jsp || 2685 ||1 - 7
+
| javax.servlet.jsp || 2685 ||1 - 7||JSP package. Used for the user interface (e.g. search form)
 
|-
 
|-
 
| javax.xml.bind|| 2686 || 2 - 13|| XML binding
 
| javax.xml.bind|| 2686 || 2 - 13|| XML binding

Revision as of 11:08, 14 October 2008

Schema.jpg


The diagramme description

  • 1. Creation or update of an index is initiated by a user request. The user sends a name of a configuration file (so called IndexOrderConfiguration) to the Crawler Controller.

The configuration file describes which Crawler should access a specific Data Source. Furthermore it contains all necessary parameters for the crawling process.

  • 2. The Crawler Controller initializes one Crawlers:
    • a. The Crawler Controller reads the Index order configuration file and assigns the Data Source from the configuration file to the Crawler.
    • b. The Crawler Controller starts the Crawler’s thread.
  • 3. The Crawler sequentially retrieves data records from the Data Source and returns each record’s attributes to the Crawler Controller which are required for the generation of the Record’s ID and its Hash.
  • 4. The Crawler Controller generates ID and Hash and then determines whether this particular record is new or was already indexed by querying the Delta Indexing Service.
  • 5. If the record was not previously indexed, the Crawler Controller instructs the Crawler to retrieve the full content of the record (metadata and content).
  • 6. The Crawler fetches the complete record from the Data Source. Each record has an ID and can contain attributes, attachments (contain binary content) and annotations.
  • 7. The Crawler Controller sends the complete retrieved Records to the Connectivity module.
  • 8. The Connectivity module in turn persists the record to the storage tier via the Blackboard Service.

The Blackboard provides central access to the record storage layer for all SMILA components thus effectively constituting an abstraction layer: clients do not need to know about the underlying persistence / storage technologies.

Structured data (text-based) is stored in XML-Storage; binaries (e.g. attachments, images) are stored in Bin-Storage.

  • 9. In the next step the Connectivity module transmits the record’s to the Router which is part of the Connectivity module.

The Router filters the record’s attributes according to its configuration. Note: usually only the ID is passed (it is defined in a filter configurations file).

After processing and filtering the record, the Router pushes the record in a JMS message queue (Active MQ) for further processing. Should any subsequent processes require the record’s full content, they can access it via the Blackboard.

  • 10. A Listener subscribing to the queue’s topic receives the message and invokes further processing logic according to its configuration.
  • 11. The Listener passes the record to the respective pipeline – the Add pipeline, if it is a new record, the Delete pipeline if the record needs to be removed from the index (because it has been deleted from the original Data Source).

A pipeline is a BPEL process using a set of pipelets and services to process a Record’s data (e.g. extracting text from various document or image file types).

BPEL is a XML-based language to define business processes by means of orchestrating loosely coupled (web) services. BPEL processes require a BPEL runtime environment (e.g. Apache ODE).

  • 12. After processing the record the pipelets / service stores the gathered additional data via the Blackboard service.
  • 13. Finally the pipeline invokes the Lucene Index Service.
  • 14. The Lucene Index Service updates the Lucene Index.

3rd Party Table

3rd Party Bundle our CQ Used in Diagramme Step Processes
com.sleepycat.db.linux.x86 ... 8, 12 Record persistence layer
com.sleepycat.db.linux.x86_64 ... 8, 12 Record persistence layer
com.sleepycat.db.win32 ... 8, 12 Record persistence layer
com.sleepycat.dbxml 2574 8, 12 Record persistence layer
com.sleepycat.dbxml.linux.x86 ... 8, 12 Record persistence layer
com.sleepycat.dbxml.linux.x86_64 ... 8, 12 Record persistence layer
com.sleepycat.dbxml.win32 ... 8, 12 Record persistence layer
com.sun.jaxb 2664 2 - 13 XML binding
javax.el 2683 1 dependency of tomcat
javax.servlet.jsp 2685 1 - 7 JSP package. Used for the user interface (e.g. search form)
javax.xml.bind 2686 2 - 13 XML binding
javax.xml.stream 2684 2 - 13 XML storage
javax.xml.xquery 2668 8, 12 XML storage
net.sf.joost 2590 11 - 13 used for XSL transformation in Pipelets
org.apache.activemq.core 2580 9 -11 JMS implementation used for communication between Connectivity and DFP
org.apache.commons.io 2677 1 -14 basic helper classes for common i/o operations. Used throughout SMILA.
org.apache.commons.lang 2678 1 - 14 additional helper classes for the java.lang package. Used throughout SMILA.
org.apache.commons.logging 2682 1 - 14 abstraction layer for runtime logging.Used throughout SMILA.
org.apache.commons.vfs ... 8, 12 abstraction layer for various (distributed) file systems. used in storage layer.
org.apache.tomcat 2561 1 JSP/Servlet container providing the user interface (e.g. search form)
org.apache.log4j 2555 1 - 14 runtime logging. Used throughout SMILA.
org.apache.lucene 2556 13, 14 Lucene Index
org.apache.lucene.analysis 2557 / 2603 13, 14 Lucene Index
org.apache.lucene.search.highlight 2558 13, 14 Lucene Index
org.apache.lucene.search.highlight 2558 13, 14 Lucene Index
org.apache.ode ... 11 - 13 BPEL runtime environment executing the pipelets in the DFP.
org.custommonkey.xmlunit 2617 11 - 13 XML extension for JUnit. Used for testing pipelets.
org.w3c.tidy 2589 11 - 13 Used for XML transformation within pipelets.

Back to the top