Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Default configuration workflow overview"

(3rd Party Tabelle)
Line 49: Line 49:
  
 
== 3rd Party Tabelle ==
 
== 3rd Party Tabelle ==
 +
 +
N 3rd Party our CQ NN from Schema
 +
1 com.sleepycat.db.linux.x86 8, 12
 +
2 com.sleepycat.db.linux.x86_64 8, 12
 +
3 com.sleepycat.db.win32 8, 12
 +
4 com.sleepycat.dbxml 2574 8, 12
 +
6 com.sleepycat.dbxml.linux.x86 8, 12
 +
7 com.sleepycat.dbxml.linux.x86_64 8, 12
 +
8 com.sleepycat.dbxml.win32 8, 12
 +
9 com.sun.jaxb 2664 2 - 13
 +
10 javax.el 2683 1
 +
11 javax.servlet.jsp 2685 1 - 7
 +
12 javax.xml.bind 2686 2 - 13
 +
13 javax.xml.stream 2684 2 - 13
 +
14 javax.xml.xquery 2668 8, 12
 +
15 net.sf.joost 2590 11 - 13
 +
16 org.apache.activemq.core 2580 9 -11
 +
17 org.apache.commons.io 2677 1 -14
 +
18 org.apache.commons.lang 2678 1 - 14
 +
18 org.apache.commons.logging 2682 1 - 14
 +
20 org.apache.commons.vfs 8, 12
 +
21 org.apache.tomcat 2561 1
 +
22 org.apache.log4j 2555 1 - 14
 +
23 org.apache.lucene 2556 13, 14
 +
24 org.apache.lucene.analysis 2557 / 2603 13, 14
 +
25 org.apache.lucene.search.highlight 2558 13, 14
 +
26 org.apache.ode 11 - 13
 +
27 org.custommonkey.xmlunit 2617 11 - 13
 +
28 org.w3c.tidy 2589 11 - 13

Revision as of 08:33, 14 October 2008

Schema.jpg


Workflow Smila

  • 1. Creation or update of an index is initiated by a user request. The user sends a name of a configuration file (so called IndexOrderConfiguration) to the Crawler Controller.

The configuration file describes which Crawler should access a specific Data Source. Furthermore it contains all necessary parameters for the crawling process.

  • 2. The Crawler Controller initializes one Crawlers:
    • a. The Crawler Controller reads the Index order configuration file and assigns the Data Source from the configuration file to the Crawler.
    • b. The Crawler Controller starts the Crawler’s thread.
  • 3. The Crawler sequentially retrieves data records from the Data Source and returns each record’s attributes to the Crawler Controller which are required for the generation of the Record’s ID and its Hash.
  • 4. The Crawler Controller generates ID and Hash and then determines whether this particular record is new or was already indexed by querying the Delta Indexing Service.
  • 5. If the record was not previously indexed, the Crawler Controller instructs the Crawler to retrieve the full content of the record (metadata and content).
  • 6. The Crawler fetches the complete record from the Data Source. Each record has an ID and can contain attributes, attachments (contain binary content) and annotations.
  • 7. The Crawler Controller sends the complete retrieved Records to the Connectivity module.
  • 8. The Connectivity module in turn persists the record to the storage tier via the Blackboard Service.

The Blackboard provides central access to the record storage layer for all SMILA components thus effectively constituting an abstraction layer: clients do not need to know about the underlying persistence / storage technologies.

Structured data (text-based) is stored in XML-Storage; binaries (e.g. attachments, images) are stored in Bin-Storage.

  • 9. In the next step the Connectivity module transmits the record’s to the Router which is part of the Connectivity module.

The Router filters the record’s attributes according to its configuration. Note: usually only the ID is passed (it is defined in a filter configurations file).

After processing and filtering the record, the Router pushes the record in a JMS message queue (Active MQ) for further processing. Should any subsequent processes require the record’s full content, they can access it via the Blackboard.

  • 10. A Listener subscribing to the queue’s topic receives the message and invokes further processing logic according to its configuration.
  • 11. The Listener passes the record to the respective pipeline – the Add pipeline, if it is a new record, the Delete pipeline if the record needs to be removed from the index (because it has been deleted from the original Data Source).

A pipeline is a BPEL process using a set of pipelets and services to process a Record’s data (e.g. extracting text from various document or image file types).

BPEL is a XML-based language to define business processes by means of orchestrating loosely coupled (web) services. BPEL processes require a BPEL runtime environment (e.g. Apache ODE).

  • 12. After processing the record the pipelets / service stores the gathered additional data via the Blackboard service.
  • 13. Finally the pipeline invokes the Lucene Index Service.
  • 14. The Lucene Index Service updates the Lucene Index.


3rd Party Tabelle

N 3rd Party our CQ NN from Schema 1 com.sleepycat.db.linux.x86 8, 12 2 com.sleepycat.db.linux.x86_64 8, 12 3 com.sleepycat.db.win32 8, 12 4 com.sleepycat.dbxml 2574 8, 12 6 com.sleepycat.dbxml.linux.x86 8, 12 7 com.sleepycat.dbxml.linux.x86_64 8, 12 8 com.sleepycat.dbxml.win32 8, 12 9 com.sun.jaxb 2664 2 - 13 10 javax.el 2683 1 11 javax.servlet.jsp 2685 1 - 7 12 javax.xml.bind 2686 2 - 13 13 javax.xml.stream 2684 2 - 13 14 javax.xml.xquery 2668 8, 12 15 net.sf.joost 2590 11 - 13 16 org.apache.activemq.core 2580 9 -11 17 org.apache.commons.io 2677 1 -14 18 org.apache.commons.lang 2678 1 - 14 18 org.apache.commons.logging 2682 1 - 14 20 org.apache.commons.vfs 8, 12 21 org.apache.tomcat 2561 1 22 org.apache.log4j 2555 1 - 14 23 org.apache.lucene 2556 13, 14 24 org.apache.lucene.analysis 2557 / 2603 13, 14 25 org.apache.lucene.search.highlight 2558 13, 14 26 org.apache.ode 11 - 13 27 org.custommonkey.xmlunit 2617 11 - 13 28 org.w3c.tidy 2589 11 - 13

Back to the top