Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "SMILA/Documentation/HowTo/How to setup SMILA in a cluster"
m (→Introduction) |
|||
Line 1: | Line 1: | ||
=== Introduction === | === Introduction === | ||
− | SMILA is | + | SMILA is primarily thought of as a framework where you can plug in your own high-performant/high-scalable components (e.g. for storage). Nevertheless, it is also possible to set up SMILA out-of-the-box on a cluster by using its default implementations. This permits ''horizontal scaling'' having the effect that importing and processing jobs/tasks will be shared across the cluster nodes. (Remark: We also have a ''vertical scaling'' on each cluster machine, but this is not new, because you also have this with a single-node SMILA.) |
The following steps describe how to set up SMILA on multiple cluster nodes. | The following steps describe how to set up SMILA on multiple cluster nodes. |
Revision as of 07:44, 19 October 2012
Contents
Introduction
SMILA is primarily thought of as a framework where you can plug in your own high-performant/high-scalable components (e.g. for storage). Nevertheless, it is also possible to set up SMILA out-of-the-box on a cluster by using its default implementations. This permits horizontal scaling having the effect that importing and processing jobs/tasks will be shared across the cluster nodes. (Remark: We also have a vertical scaling on each cluster machine, but this is not new, because you also have this with a single-node SMILA.)
The following steps describe how to set up SMILA on multiple cluster nodes.
Install external Solr server
If you want to use Solr for indexing, you need to setup a separate Solr server, because the Solr instances embedded in SMILA cannot be shared with the other SMILA instances.
Single-Node Server
- Download a Solr 3.x archive from http://lucene.apache.org/solr/. This HowTo was tested with Solr 3.6.1.
- Unpack the archive to a local directory, you get a directory like /home/smila/solr/apache-solr-3.6.1.
- Copy the files from SMILA/configuration/org.eclipse.smila.solr to the Solr machine (solr.properties is'nt needed), e.g. to /home/smila/solr/smila-cores
- Go to /home/smila/solr/apache-solr-3.6.1/example and run:
java -Dsolr.solr.home=/home/smila/solr/smila-cores -jar start.jar
- Check if Solr is running at http://localhost:8983/solr/DefaultCore/admin/ (replace localhost with name of your Solr server, if necessary).
Distributed Server
For larger data volumes you will need to setup Solr in a distributed way, too. However, using a distributed Solr setup is not yet fully supported by the SMILA integration (especially during indexing).
Configuring SMILA on cluster node
On each cluster node, you have to do the following SMILA configuration changes.
Cluster configuration
You have to define which nodes belong to the cluster.
Configuration file:configuration/org.eclipse.smila.clusterconfig.simple/clusterconfig.json
Enter new section "clusterNodes":
{ "clusterNodes": ["PC-1", "PC-2", "PC-3"], "taskmanager":{ ... }
Objectstore configuration
You have to define a shared data directory for all nodes
Configuration file:configuration/org.eclipse.smila.objectstore.filesystem/objectstoreservice.properties
Set a root path to the shared directory:
root.path=/data/smila/shared ...
Solr configuration
You have to point to the Solr server that you set up above.
Configuration file:configuration/org.eclipse.smila.solr/solr.properties
solr.embedded=false ... solr.serverUrl=http://<SOLR-HOST>:8983/solr
Jetty configuration
To monitor the cluster node, you have to make SMILA HTTP server accessible from external.
File:SMILA.ini
... -Djetty.host=0.0.0.0 ...
See also Enabling Remote Access to SMILA
Monitoring
You can use the REST API to monitor SMILA cluster activities.
Startup
After SMILA has been started, http://CLUSTER-NODE:8080/smila should show you the configured cluster nodes (SMILA 1.2):
... cluster: { nodes: [ "PC-1", "PC-2", "PC-3" ] } ...
Job run
After a job run has been started you can check the number of tasks that are currently processed on each node in the zookeeper state: http://CLUSTER-NODE:8080/zookeeper/smila/taskmanager/hosts
There you should see a list of the cluster nodes, and the following output for each of them: (The given sample output means that 6 tasks are currently processed on the given cluster node)
stat: ... data: "6"
You can also count the inprogress
tasks under http://CLUSTER-NODE:8080/smila/tasks, which is the number of tasks currently processed in the whole cluster. This number can be compared with the maxScaleUp
setting for a worker in the clusterconfig.json
which is the max. number of tasks allowed to be processed on one node. (see also Taskmanager REST API)
Some useful commands
Removing all documents from a Solr core (unix-shell command):
curl http://localhost:8983/solr/DefaultCore/update?commit=true -H "Content-Type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
Optimize Solr index (unix-shell command):
curl http://localhost:8983/solr/DefaultCore/update?commit=true -H "Content-Type: text/xml" --data-binary '<optimize/>'
Hint: When using Windows, replace all ' by "