Difference between revisions of "SMILA/Documentation/Architecture Overview"

From Eclipsepedia

Jump to: navigation, search
(Architecture Overview)
(Description)
Line 14: Line 14:
 
'''Note:''' ''In case where SMILA is used for building a search application, we talk about indexing and search process.''
 
'''Note:''' ''In case where SMILA is used for building a search application, we talk about indexing and search process.''
  
The '''''preprocessing''''' process generally includes the interaction with the data source either by pushing (updated) data by agents or by pulling data by crawlers and pushing it into the system via the connectivity module. The information that could be pushed into the framework is in general document's metadata, content and diverse security relevant information i.e. access rights. All this information is being persisted in dedicated stores. Metadata and access rights are stored in Record store. Content i.e. (large) binary data is stored in the binary store. Beside these two storages, SMILA also offers Delta Indexing store for keeping information about visited objects/documents during a crawling of a data source. Ontology store is a dedicated store for persisting and managing ontologies. The Blackboard service represents a high level API for accessing record information. After a record has been stored a new processing message has been created and pushed into a specific queue by a router. On the other side of a queue listeners are consuming messages, initializing Blackboard and starting a BPEL engine which executes desired workflow. The workflow again is defined by the order of execution of some services either provided by the framework itself or implemented by application's developer. Since listeners can consume messages remotely the preprocessing can easily be spread and therefore parallelized across the whole cluster. Thus the message queue is the central framework component which enables horizontal scaling of the preprocessing process in the framework. Message queue also introduces asynchronous execution of the business logic.
+
The '''''preprocessing''''' process generally includes the interaction with the data source either by pushing (updated) data by agents or by pulling data by crawlers and pushing it into the system via the connectivity module. The information that could be pushed into the framework is in general document's metadata, content and diverse security relevant information i.e. access rights.
 +
The connectivity module pushes the data into the bulkbuilder, that is the entry point to the asynchronous job management and persists the data in dedicated stores for further processing. The connectivity module has to provide a job name for which the record should be processed.
 +
Metadata and access rights are stored in the object store. Content i.e. (large) binary data is stored in the binary store. Beside these two storages, SMILA also offers Delta Indexing store for keeping information about visited objects/documents during a crawling of a data source.
 +
Ontology store is a dedicated store for persisting and managing ontologies. The Blackboard service represents a high level API for accessing record information by BPEL pipeleines.
 +
After a record has been sent to the bulkbuilder this process collects records and stores them in bulks (storage entities comprising one to many records for effective bulk processing). After one bulk has been completed by matching defined time or size constraints the bulk is released and the asynchronous job processing will determine follow up tasks for the next worker(s) as defined by the workflow of the active job.
 +
The WorkerManager listens for available tasks and has them processed by its workers.
 +
These include PipelineProcessingWorkers that execute synchronous workflows defined by pipelines. These workers initialize a Blackboard with the records to be processed and start a BPEL engine which executes desired workflow. The workflow again is defined by the order of execution of some services either provided by the framework itself or implemented by application's developer.
 +
Since the job processing synchronizes itself via ZooKeeper across the whole cluster, the tasks can be executed on different nodes in the cluster, so the preprocessing can easily be spread and therefore parallelized across the whole cluster (provided that the storages are accessible from each node in the cluster). Thus the asynchonous job processing components are the central framework components which enables horizontal scaling of the preprocessing process in the framework.
  
 
The '''''information retrieval''''' provides a swift access to previously preprocessed and stored information. Since this process is synchronous there has to be some external component responsible for distributing the load and therefore enabling the horizontal scalability of the information retrieval process. The flexible definition and execution of application's business logic is provided here also by calling a BPEL engine with a desired workflow.
 
The '''''information retrieval''''' provides a swift access to previously preprocessed and stored information. Since this process is synchronous there has to be some external component responsible for distributing the load and therefore enabling the horizontal scalability of the information retrieval process. The flexible definition and execution of application's business logic is provided here also by calling a BPEL engine with a desired workflow.

Revision as of 09:05, 2 September 2011

This page describes the short overview of SMILA's current architecture.

Contents

Introduction

SMILA is a framework that runs on top of OSGi runtime and therefore follows its component model.

Architecture Overview

SMILA Architecture Overview.png

Description

This architecture overview depicts generally two processes: preprocessing and information retrieval.

Note: In case where SMILA is used for building a search application, we talk about indexing and search process.

The preprocessing process generally includes the interaction with the data source either by pushing (updated) data by agents or by pulling data by crawlers and pushing it into the system via the connectivity module. The information that could be pushed into the framework is in general document's metadata, content and diverse security relevant information i.e. access rights. The connectivity module pushes the data into the bulkbuilder, that is the entry point to the asynchronous job management and persists the data in dedicated stores for further processing. The connectivity module has to provide a job name for which the record should be processed. Metadata and access rights are stored in the object store. Content i.e. (large) binary data is stored in the binary store. Beside these two storages, SMILA also offers Delta Indexing store for keeping information about visited objects/documents during a crawling of a data source. Ontology store is a dedicated store for persisting and managing ontologies. The Blackboard service represents a high level API for accessing record information by BPEL pipeleines. After a record has been sent to the bulkbuilder this process collects records and stores them in bulks (storage entities comprising one to many records for effective bulk processing). After one bulk has been completed by matching defined time or size constraints the bulk is released and the asynchronous job processing will determine follow up tasks for the next worker(s) as defined by the workflow of the active job. The WorkerManager listens for available tasks and has them processed by its workers. These include PipelineProcessingWorkers that execute synchronous workflows defined by pipelines. These workers initialize a Blackboard with the records to be processed and start a BPEL engine which executes desired workflow. The workflow again is defined by the order of execution of some services either provided by the framework itself or implemented by application's developer. Since the job processing synchronizes itself via ZooKeeper across the whole cluster, the tasks can be executed on different nodes in the cluster, so the preprocessing can easily be spread and therefore parallelized across the whole cluster (provided that the storages are accessible from each node in the cluster). Thus the asynchonous job processing components are the central framework components which enables horizontal scaling of the preprocessing process in the framework.

The information retrieval provides a swift access to previously preprocessed and stored information. Since this process is synchronous there has to be some external component responsible for distributing the load and therefore enabling the horizontal scalability of the information retrieval process. The flexible definition and execution of application's business logic is provided here also by calling a BPEL engine with a desired workflow.

Hint: For initial architecture proposal please see the archived version.

Original slides can be found here: SMILA Architecture.zip



Component Documentation

For further up to date documentation of all implemented components please see:

Component Documentation