What is SMILA?
SMILA is a framework for creating scalable server-side systems that process large amounts of unstructured data in order to build applications in the area of search, linguistic analysis, information mining or similar. The goal is to enable you to easily integrate data source connectors, search engines, sophisticated analysis methods and other components by gaining scalability and reliability out-of-the-box.
As such, SMILA provides these main parts:
- JobManager: a system for asynchronous, scalable processing of data using configurable workflows. The system is able to reliably distribute the tasks to be done on big clusters of hosts. The workflows orchestrate easy-to-implement workers that can be used to integrate application-specific processing logic.
- Crawlers: concepts and basic implementations for scalable components that extract data from data sources.
- Pipelines: a system for processing synchronous requests (e.g. search requests) by orchestrating easy-to-implement components (pipelets) in workflows defined in BPEL.
- Storage: concepts for integrating big-data storages for efficient persistence of the processed data.
Eventually, all SMILA functionality is accessible for external clients via an HTTP ReST API using JSON as the exchange data format.
As an Eclipse system, SMILA is built adhering to OSGi standard and makes a heavy use of the OSGi service component model.
Download this zip file containing the original PowerPoint file of this slide.
A SMILA system consists of two distinguished parts:
- First, data has to be imported into the system and processed to build an search index or extract an ontology or whatever can be learned from the data.
- Second, the learned information is used to answer retrieval requests from users, for examples search or ontology exploration requests.
In the first process usually some data source is crawled or an external client pushes the data from the source into the SMILA system using the HTTP ReST API. Often, the data consists of a large number of documents (e.g. a file system, web site, or content management system). To be processed, each document is represented in SMILA by a record describing the metadata of the document (name, size, access rights, authors, keywords...) and the original content of the document itself.
To process large amounts of data, SMILA must be able to distribute the work to be done on multiple SMILA nodes (computers). Therefore, the bulkbuilder separates the incoming data into bulks of records of a configurable size and writes them to an ObjectStore. For each of these bulks, the JobManager creates tasks for workers to process them and produce other bulks containing the result of their operation. When such a worker is available, it asks the TaskManager for tasks to be done, does the work and finally notifies the TaskManager about the result. Workflows define which workers should process a bulk in what sequence. Whenever a worker finishes a task for a bulk successfully, the JobManager can create follow-up tasks based on such a workflow definition. In case a worker fails processing a task (because the process or machine crashes or because of a network problem), the JobManager can decide to retry the task later and so ensure that the data is processed even in problematic conditions. The processing of the complete data set using such a workflow is called a job run. The monitoring of the current state of such a job run is possible via the HTTP ReST API.
JobManager and TaskManager use Apache Zookeeper to coordinate the state of a job run and the to-do and in-progress tasks over multiple computer nodes. Thereby the job processing is distributed and parallelized.
To make implementing workers easy, the SMILA JobManager system contains the WorkerManager that enables you to concentrate on the actual worker functionality without having to worry about getting the TaskManager and ObjectStore interaction right.
To extract large amounts of data from the data source, the asynchronous job framework can also be used to implement highly scalable crawlers. Crawling can be divided into several steps:
- getting names of elements from the data source
- checking if the element has changed since a previous crawl run (delta check)
- getting the content of changed or new elements
- pushing the element to a processing job.
These steps can be implemented as separate workers too, so the crawl work can be parallelized and distributed quite easily. By using the JobManager to control the crawling, we also gain the same reliability and scalability for the crawling as for the processing.