I would like to understand the goal of this MapReduce implementation based on SMILA-Workers a little bit better.
As a starting point, I think about the SMILA cluster, the data, possible access pattern and at the end I will mention some
existing projects within the Hadoop ecosystem, which are dealing with the same topics on a large scale. Here I collect some
initial question to prepare a comparison between this SMILA internal approach and an external Hadoop based solution.
How many machines are typically used in a SMILA cluster?
What types of data do we process within SMILA workflows?
I think we process documents in general. Such documents have to be parsed, text analysis is done for single documents and the index has to be created. Results from Knowledge extraction are submitted to the tripple store.
Therfore we use flexible workflows to pipe each document through all steps of the pocessing pipeline.
Does SMILA process the data stored in the index, in order to extract new information from existing records? Are documents bundled in order to process more of them in one step? If yes, how large are the bundles? Can such a workflow span multiple machines?
How many workers are assigned in this case? Does this depend on the configuration or is this calculated on the fly?
For a large set of documents, the batch processing aproach of Hadoop would be an alternative, but if I just have a small set of files, the
pipelines, which are available and preinitialized, could process single files much faster, in order to have a refreshed index with less delay.
Is this the idea behind SMILA?
Beside thinking about the MapReduce implementation with workers I suggest to have a look on the new Hadoop architecture. With YARN we
are able to implement totally new types of distributed applications. So I think, one could reuse the Hadoop framework to combine the SMILA concept with that
of an scalable data analysis and storage plattform.
For reindexing of large datasets the core Hadoop system could be used. Workflows are then defined by Oozie and tasks within a workflow are either Map Reduce jobs, or even Hive or Pig scripts.
The raw data is stored in HDFS or, if we need random access to single elements, within HBase. In this syenario, HBase can be used as a large shared memory to store and exchange the documents, records and
intermedite data for each Worker.
What do you think about such an combined approach for having bulk processing and individual document pipelines in one place?
The project would lead SMILA to the Hadoop ecosystem and finally the concept of pipelets would be a new story in the Hadoop environment. The project Giraph already works in a direction to a build
new applications based in worker nodes which exchange messages with each other, this could be a starting point for the pipelets implmentation, I think.