This page describes the short overview of SMILA's current architecture.
SMILA is a framework for creating server-side systems that process large amounts of unstructured data in order to build applications in the area of search, linguistic analysis, information mining or similar. As such, SMILA provides these main parts:
- JobManager: a system for asynchronous, scalable processing of data using configurable workflows. The system is able to reliably distribute the tasks to be done on big clusters of hosts. The workflows orchestrate easy-to-implement workers that can be used to integrate application-specific processing logic.
- Crawlers: concepts and basic implementations for scalable components that extract data from data sources
- Pipelines: a system for processing synchronous requests (e.g. search requests) by orchestrating easy-to-implement components ("pipelets") in workflows defined in BPEL.
- Storage: concepts for integrating big-data storages for persistent of the processed data.
Eventually, all SMILA functionality will be accessible for external clients via HTTP ReST APIs using JSON as the data format. As an Eclipse system, SMILA is built in OSGi and makes heavy use of the OSGi service component model.
This architecture overview depicts generally two processes: preprocessing and information retrieval.
Note: In case where SMILA is used for building a search application, we talk about indexing and search process.
The preprocessing process generally includes the interaction with the data source either by pulling data by crawlers and pushing it into the system via the BulkBuilder module. The information that can be pushed into the framework is in general document's metadata, content and diverse security relevant information i.e. access rights.
The bulkbuilder is the entry point to the asynchronous job management and persists the data in dedicated stores for further processing. A bulk is a bunch of records that is processed by various workers that are orchestrated via an asynchronous workflow. Such a workflow can be instantiated by defining a job and the execution ob such a job is called a job run.
For better crawl performance, a crawler (e.g. file system crawler or web crawler) is now implemented as a set of different workers that are running the the asynchronous job management, too. This makes it possible to run the several steps of a crawler in parallel (even on multiple hosts). The complete preprocessing therefore consists of two jobs: One for extracting the raw data from the data source into SMILA, and one for transforming it and loading it into some target, e.g. an index.
An indexing client can also use the REST API to push JSON objects (i.e. a document's metadata) and the document contents into the bulkbuilder. Such a client could be running inside the datasource and react on create-update-delete events in the datasource to send the changed objects to SMILA for processing, so that SMILA does not need to crawl the datasource regularly to stay up-to-date.
Metadata, access rights and document contents are stored in the object store. Beside these two storages, SMILA also offers a DeltaChecker worker for keeping information about the state objects/documents during a crawling of a data source so that in follow-up crawl runs only changed objects are pushed in to the transformation workflow. Ontology store is a dedicated store for persisting and managing ontologies. The Blackboard service represents a high level API for accessing record information by BPEL pipelines.
After one bulk has been completed by matching configured time or size constraints, the bulk is released and the JobManager will determine follow up tasks for the next worker(s) as defined by the workflow of the active job.
The WorkerManager listens for available tasks from the TaskManager and let them be processed by its workers. These include PipelineProcessorWorkers that execute synchronous BPEL workflows. These workers initialize a Blackboard with the records to be processed and start a BPEL engine which executes desired workflow. The workflow again is defined by the order of execution of some services either provided by the framework itself or implemented by application's developer.
Since the job processing synchronizes itself via ZooKeeper across the whole cluster, the tasks can be executed on different nodes in the cluster, so the preprocessing can easily be spread and therefore parallelized across the whole cluster (provided that the storages are accessible from each node in the cluster). Thus the asynchonous job processing components are the central framework components which enable horizontal scaling of the preprocessing process in the framework. Workers can also be configured to process multiple tasks in parallel on one single node.
The information retrieval provides a swift access to previously preprocessed and stored information. Since this process is synchronous there has to be some external component responsible for distributing the load and therefore enabling the horizontal scalability of the information retrieval process. The flexible definition and execution of application's business logic is provided here also by calling a BPEL engine with a desired workflow.
Hint: For initial architecture proposal please see the archived version.
Original slides can be found here: SMILA Architecture.zip
For further up to date documentation of all implemented components please see: