Difference between revisions of "SMILA/Documentation/Architecture Overview"

From Eclipsepedia

Jump to: navigation, search
(Description)
(Architecture Overview)
(32 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This page describes the short overview of SMILA's current architecture.
+
== What is SMILA? ==
 +
=== Introduction ===
  
== Introduction ==
+
SMILA is a ''framework'' for creating scalable server-side systems that process large amounts of unstructured data in order to build applications in the area of search, linguistic analysis, information mining or similar. The goal is to enable you to easily integrate data source connectors, search engines, sophisticated analysis methods and other components by gaining scalability and reliability out-of-the-box.
SMILA is a framework that runs on top of OSGi runtime and therefore follows its component model.
+
  
== Architecture Overview ==
+
As such, SMILA provides these main parts:
  
[[Image:SMILA Architecture Overview.png]]
+
* [[SMILA/Documentation/JobManager|'''JobManager''']]: a system for asynchronous, scalable processing of data using configurable ''workflows''. The system is able to reliably distribute the ''tasks'' to be done on big clusters of hosts. The workflows orchestrate easy-to-implement ''workers'' that can be used to integrate application-specific processing logic.
 +
* [[SMILA/Documentation/Importing/Concept|'''Crawlers''']]: concepts and basic implementations for scalable components that extract data from data sources.  
 +
* [[SMILA/Documentation/Pipelets|'''Pipelines''']]: a system for processing synchronous requests (e.g. search requests) by orchestrating easy-to-implement components (''pipelets'') in workflows defined in BPEL.
 +
* [[SMILA/Documentation/ObjectStore/Bundle_org.eclipse.smila.objectstore|'''Storage''']]: concepts for integrating big-data storages for efficient persistence of the processed data.
  
=== Description ===
+
Eventually, all SMILA functionality is accessible for external clients via an ''HTTP ReST API'' using ''JSON'' as the exchange data format.
  
This architecture overview depicts generally two processes: preprocessing and information retrieval.
+
As an Eclipse system, SMILA is built adhering to ''OSGi'' standard and makes a heavy use of the OSGi ''service'' component model.
  
'''Note:''' ''In case where SMILA is used for building a search application, we talk about indexing and search process.''
+
=== Architecture Overview ===
<ul>
+
<li>
+
<p>
+
The '''''preprocessing''''' process generally includes the interaction with the data source either by pushing (updated) data by agents or by pulling data by crawlers and pushing it into the system via the connectivity module. The information that can be pushed into the framework is in general document's metadata, content and diverse security relevant information i.e. access rights.
+
</p>
+
<p>
+
The connectivity module pushes the data into the bulkbuilder, that is the entry point to the asynchronous job management and persists the data in dedicated stores for further processing. The connectivity module has to provide a job name for which the record should be processed.
+
</p>
+
<p>
+
An indexing client can also use the REST API to push JSON objects (i.e. a document's metadata) into a running job directly without interacting with an agent or the connectivity module.
+
</p>
+
<p>
+
The bulkbuilder collects the records that has been pushed and stores them in bulks (storage entities comprising one to many records for effective bulk processing by the workers of the preprocessing workflow).
+
</p>
+
<p>
+
Metadata and access rights are stored in the object store. Content i.e. (large) binary data is stored in the binary store. Beside these two storages, SMILA also offers Delta Indexing store for keeping information about visited objects/documents during a crawling of a data source.
+
Ontology store is a dedicated store for persisting and managing ontologies. The Blackboard service represents a high level API for accessing record information by BPEL pipelines.
+
</p>
+
<p>
+
After one bulk has been completed by matching configured time or size constraints, the bulk is released and the JobManager will determine follow up tasks for the next worker(s) as defined by the workflow of the active job.
+
</p>
+
<p>
+
The WorkerManager listens for available tasks from the TaskManager and let them be processed by its workers.
+
These include PipelineProcessingWorkers that execute synchronous BPEL workflows. These workers initialize a Blackboard with the records to be processed and start a BPEL engine which executes desired workflow. The workflow again is defined by the order of execution of some services either provided by the framework itself or implemented by application's developer.
+
</p>
+
<p>
+
Since the job processing synchronizes itself via ZooKeeper across the whole cluster, the tasks can be executed on different nodes in the cluster, so the preprocessing can easily be spread and therefore parallelized across the whole cluster (provided that the storages are accessible from each node in the cluster). Thus the asynchonous job processing components are the central framework components which enable horizontal scaling of the preprocessing process in the framework. Workers can also be configured to process multiple tasks in parallel on one single node.
+
</p>
+
</li>
+
<li>
+
<p>
+
The '''''information retrieval''''' provides a swift access to previously preprocessed and stored information. Since this process is synchronous there has to be some external component responsible for distributing the load and therefore enabling the horizontal scalability of the information retrieval process. The flexible definition and execution of application's business logic is provided here also by calling a BPEL engine with a desired workflow.
+
</p>
+
</li>
+
<p>
+
'''Hint:'''
+
For initial architecture proposal please see the [[SMILA/Attic/Architecture Overview|archived version]].
+
</p>
+
  
<p>
+
[[Image:SMILA Architecture Overview_1.0.png]]
Original slides can be found here: [[Media:SMILA Architecture.zip|SMILA Architecture.zip]]
+
</p>
+
  
 +
<font size="-1">
 +
Download [[Media:SMILA_Architecture_1.0.zip|this zip file]] containing the original PowerPoint file of this slide.
 +
</font>
  
----
+
A SMILA system consists of two distinguished parts:
 +
* First, data has to be imported into the system and processed to build an search index or extract an ontology or whatever can be learned from the data.
 +
* Second, the learned information is used to answer retrieval requests from users, for examples search or ontology exploration requests.
  
=== Component Documentation ===
+
In the first process usually some data source is crawled or an external client pushes the data from the source into the SMILA system using the HTTP ReST API. Often, the data consists of a large number of documents (e.g. a file system, web site, or content management system). To be processed, each document is represented in SMILA by a ''record'' describing the metadata of the document (name, size, access rights, authors, keywords...) and the original content of the document itself.
 +
 
 +
To process large amounts of data, SMILA must be able to distribute the work to be done on multiple SMILA nodes (computers). Therefore, the ''bulkbuilder'' separates the incoming data into ''bulks'' of records of a configurable size and writes them to an ObjectStore. For each of these bulks, the ''JobManager'' creates ''tasks'' for ''workers'' to process them and produce other bulks containing the result of their operation. When such a worker is available, it asks the ''TaskManager'' for tasks to be done, does the work and finally notifies the TaskManager about the result. ''Workflows'' define which workers should process a bulk in what sequence. Whenever a worker finishes a task for a bulk successfully, the JobManager can create follow-up tasks based on such a workflow definition. In case a worker fails processing a task (because the process or machine crashes or because of a network problem), the JobManager can decide to retry the task later and so ensure that the data is processed even in problematic conditions. The processing of the complete data set using such a workflow is called a ''job run''. The monitoring of the current state of such a job run is possible via the HTTP ReST API.
 +
 
 +
JobManager and TaskManager use [http://zookeeper.apache.org Apache Zookeeper] to coordinate the state of a job run and the to-do and in-progress tasks over multiple computer nodes. Thereby the job processing is distributed and parallelized.
 +
 
 +
To make implementing workers easy, the SMILA JobManager system contains the ''WorkerManager'' that enables you to concentrate on the actual worker functionality without having to worry about getting the TaskManager and ObjectStore interaction right.
 +
 
 +
To extract large amounts of data from the data source, the asynchronous job framework can also be used to implement highly scalable ''crawlers''. Crawling can be divided into several steps:
 +
* getting names of elements from the data source
 +
* checking if the element has changed since a previous crawl run (delta check)
 +
* getting the content of changed or new elements
 +
* pushing the element to a processing job.
 +
These steps can be implemented as separate workers too, so the crawl work can be parallelized and distributed quite easily. By using the JobManager to control the crawling, we also gain the same reliability and scalability for the crawling as for the processing. And implementing new crawlers is just as easy as implementing new workers.
 +
 
 +
Eventually, the final step of such asynchronous processing workflow will write the processed data to some target system, for example a search engine or an ontology manager or a database where it can be used to process retrieval requests which are being handled by the second part of the system. Such requests are coming from an external client application via the HTTP ReST API. They are usually of a synchronous nature, meaning that a client sends a request and waits for the result so it can present it to the end user and therefore it expects the result to be produced rather quickly. On the other hand, we want to have a similar flexibility to configure the processing of such synchronous requests as we have for the asynchronous job processing. Therefore we use a different workflow processor here which is based on a BPEL engine. The BPEL workflows (we call them ''pipelines'') in this processor orchestrate so-called ''pipelets'' to perform the different steps needed to enrich and refine the original requests and to produce the result. Implementing such a pipelet is probably even easier than implementing a worker ;-)
 +
 
 +
Finally, it's even possible to combine both workflow variants because there is a ''PipelineProcessing'' worker in the asynchronous system performs a task by executing synchronous pipeline. So it's possible to implement only a pipelet and make the functionality available in both kinds of workflows. Additionally, there is a ''PipeletProcessing'' worker available which executes just a single pipelet and in that way saves the overhead of the synchronous workflow processor if one pipelet is sufficient to process the tasks.
 +
 
 +
== Want to know more? ==
  
 
For further up to date documentation of all implemented components please see:
 
For further up to date documentation of all implemented components please see:
  
{| class="table.gallery" border=0
+
* See SMILA in action: [[SMILA/Documentation_for_5_Minutes_to_Success|SMILA in 5 Minutes]]
|-
+
* Read the [[SMILA/Manual|Manual]]
| [[SMILA/Documentation|Component Documentation]]
+
|}
+
  
 
[[Category:SMILA]]
 
[[Category:SMILA]]

Revision as of 12:40, 1 March 2012

Contents

What is SMILA?

Introduction

SMILA is a framework for creating scalable server-side systems that process large amounts of unstructured data in order to build applications in the area of search, linguistic analysis, information mining or similar. The goal is to enable you to easily integrate data source connectors, search engines, sophisticated analysis methods and other components by gaining scalability and reliability out-of-the-box.

As such, SMILA provides these main parts:

  • JobManager: a system for asynchronous, scalable processing of data using configurable workflows. The system is able to reliably distribute the tasks to be done on big clusters of hosts. The workflows orchestrate easy-to-implement workers that can be used to integrate application-specific processing logic.
  • Crawlers: concepts and basic implementations for scalable components that extract data from data sources.
  • Pipelines: a system for processing synchronous requests (e.g. search requests) by orchestrating easy-to-implement components (pipelets) in workflows defined in BPEL.
  • Storage: concepts for integrating big-data storages for efficient persistence of the processed data.

Eventually, all SMILA functionality is accessible for external clients via an HTTP ReST API using JSON as the exchange data format.

As an Eclipse system, SMILA is built adhering to OSGi standard and makes a heavy use of the OSGi service component model.

Architecture Overview

SMILA Architecture Overview 1.0.png

Download this zip file containing the original PowerPoint file of this slide.

A SMILA system consists of two distinguished parts:

  • First, data has to be imported into the system and processed to build an search index or extract an ontology or whatever can be learned from the data.
  • Second, the learned information is used to answer retrieval requests from users, for examples search or ontology exploration requests.

In the first process usually some data source is crawled or an external client pushes the data from the source into the SMILA system using the HTTP ReST API. Often, the data consists of a large number of documents (e.g. a file system, web site, or content management system). To be processed, each document is represented in SMILA by a record describing the metadata of the document (name, size, access rights, authors, keywords...) and the original content of the document itself.

To process large amounts of data, SMILA must be able to distribute the work to be done on multiple SMILA nodes (computers). Therefore, the bulkbuilder separates the incoming data into bulks of records of a configurable size and writes them to an ObjectStore. For each of these bulks, the JobManager creates tasks for workers to process them and produce other bulks containing the result of their operation. When such a worker is available, it asks the TaskManager for tasks to be done, does the work and finally notifies the TaskManager about the result. Workflows define which workers should process a bulk in what sequence. Whenever a worker finishes a task for a bulk successfully, the JobManager can create follow-up tasks based on such a workflow definition. In case a worker fails processing a task (because the process or machine crashes or because of a network problem), the JobManager can decide to retry the task later and so ensure that the data is processed even in problematic conditions. The processing of the complete data set using such a workflow is called a job run. The monitoring of the current state of such a job run is possible via the HTTP ReST API.

JobManager and TaskManager use Apache Zookeeper to coordinate the state of a job run and the to-do and in-progress tasks over multiple computer nodes. Thereby the job processing is distributed and parallelized.

To make implementing workers easy, the SMILA JobManager system contains the WorkerManager that enables you to concentrate on the actual worker functionality without having to worry about getting the TaskManager and ObjectStore interaction right.

To extract large amounts of data from the data source, the asynchronous job framework can also be used to implement highly scalable crawlers. Crawling can be divided into several steps:

  • getting names of elements from the data source
  • checking if the element has changed since a previous crawl run (delta check)
  • getting the content of changed or new elements
  • pushing the element to a processing job.

These steps can be implemented as separate workers too, so the crawl work can be parallelized and distributed quite easily. By using the JobManager to control the crawling, we also gain the same reliability and scalability for the crawling as for the processing. And implementing new crawlers is just as easy as implementing new workers.

Eventually, the final step of such asynchronous processing workflow will write the processed data to some target system, for example a search engine or an ontology manager or a database where it can be used to process retrieval requests which are being handled by the second part of the system. Such requests are coming from an external client application via the HTTP ReST API. They are usually of a synchronous nature, meaning that a client sends a request and waits for the result so it can present it to the end user and therefore it expects the result to be produced rather quickly. On the other hand, we want to have a similar flexibility to configure the processing of such synchronous requests as we have for the asynchronous job processing. Therefore we use a different workflow processor here which is based on a BPEL engine. The BPEL workflows (we call them pipelines) in this processor orchestrate so-called pipelets to perform the different steps needed to enrich and refine the original requests and to produce the result. Implementing such a pipelet is probably even easier than implementing a worker ;-)

Finally, it's even possible to combine both workflow variants because there is a PipelineProcessing worker in the asynchronous system performs a task by executing synchronous pipeline. So it's possible to implement only a pipelet and make the functionality available in both kinds of workflows. Additionally, there is a PipeletProcessing worker available which executes just a single pipelet and in that way saves the overhead of the synchronous workflow processor if one pipelet is sufficient to process the tasks.

Want to know more?

For further up to date documentation of all implemented components please see: