Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "SMILA/Documentation/HowTo/How to setup SMILA in a cluster"
m (→Startup) |
(→Objectstore configuration) |
||
(25 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | === | + | === Introduction === |
− | + | SMILA is primarily thought of as a framework where you can plug in your own or third-party high-performant/high-scalable components (e.g. for data storage). Nevertheless, it is also possible to set up SMILA out-of-the-box on a cluster by using its default implementations. This permits ''horizontal scaling'' having the effect that importing and processing jobs/tasks will be shared across the cluster nodes. (Remark: We also have a ''vertical scaling'' on each cluster machine, but this is not new, because you also have this with a single-node SMILA.) | |
− | + | The following steps describe how to set up SMILA on multiple cluster nodes. | |
− | + | === Install external Solr server === | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | If you want to use Solr for indexing, you need to set up a separate Solr server, because the Solr instances embedded in SMILA cannot be shared with the other SMILA instances. | |
− | + | ==== Single node server ==== | |
− | === Configuring SMILA on cluster | + | *Download a Solr 3.x archive from http://lucene.apache.org/solr/. This HowTo was tested with Solr v. 3.6.1. |
+ | *Unpack the archive to a local directory; you will get a directory like <tt>/home/smila/solr/apache-solr-3.6.1</tt>. | ||
+ | *Copy the files from <tt>SMILA/configuration/org.eclipse.smila.solr</tt> to the Solr machine (<tt>solr.properties</tt> isn't needed here), e.g. to <tt>/home/smila/solr/smila-cores</tt>. | ||
+ | *Go to <tt>/home/smila/solr/apache-solr-3.6.1/example</tt> and run: | ||
+ | <pre>java -Dsolr.solr.home=/home/smila/solr/smila-cores -jar start.jar | ||
+ | </pre> | ||
+ | *Check if Solr is running at http://localhost:8983/solr/DefaultCore/admin/ (replace <tt>localhost</tt> with the name of your Solr server, if necessary). | ||
+ | |||
+ | ==== Distributed server ==== | ||
+ | |||
+ | For larger data volumes you will need to set up Solr in a distributed way, too. However, using a distributed Solr setup is not yet fully supported by the SMILA integration (especially during indexing). | ||
+ | |||
+ | === Configuring SMILA on cluster nodes === | ||
On each cluster node, you have to do the following SMILA configuration changes. | On each cluster node, you have to do the following SMILA configuration changes. | ||
Line 26: | Line 31: | ||
You have to define which nodes belong to the cluster. | You have to define which nodes belong to the cluster. | ||
− | Configuration file: <pre>configuration/org.eclipse.smila.clusterconfig.simple/clusterconfig.json</pre> | + | Configuration file: <pre>SMILA/configuration/org.eclipse.smila.clusterconfig.simple/clusterconfig.json</pre> |
− | Enter new section "clusterNodes": | + | Enter new section "clusterNodes" stating the host names of the individual cluster nodes: |
<source lang="javascript"> | <source lang="javascript"> | ||
Line 40: | Line 45: | ||
==== Objectstore configuration ==== | ==== Objectstore configuration ==== | ||
− | You have to define a '''shared''' data directory for all nodes | + | You have to define a '''shared''' data directory for all nodes ("shared" means that the selected directory must be accessbile from every machine in your cluster under the same path). |
− | Configuration file: <pre>configuration/org.eclipse.smila.objectstore.filesystem/objectstoreservice.properties</pre> | + | Configuration file: <pre>SMILA/configuration/org.eclipse.smila.objectstore.filesystem/objectstoreservice.properties</pre> |
+ | |||
+ | (Directory/File will not exist in older SMILA versions - just create it). | ||
Set a root path to the shared directory: | Set a root path to the shared directory: | ||
Line 52: | Line 59: | ||
{{Tip|NFS or SMB/CIFS? | {{Tip|NFS or SMB/CIFS? | ||
− | When running | + | When running on Linux, you can use either an NFS or an SMB/CIFS directory (mounted via Samba) for the objectstore. First tests seem to indicate that using a SMB/CIFS directory is much faster, especially if lots of small files are written (as is the case during crawling processes by the Delta or Visited Links service). Also, we had stability issues with an NFS mount, where a lot of "state NFS file handle" errors occurred. |
Of course, the results may largely depend on your environment and could be completely different in your network.}} | Of course, the results may largely depend on your environment and could be completely different in your network.}} | ||
Line 58: | Line 65: | ||
==== Solr configuration ==== | ==== Solr configuration ==== | ||
− | You have to point to the Solr server that | + | You have to point to the Solr server that we installed above. |
− | Configuration file: <pre>configuration/org.eclipse.smila.solr/solr.properties</pre> | + | Configuration file: <pre>SMILA/configuration/org.eclipse.smila.solr/solr.properties</pre> |
<code> | <code> | ||
solr.embedded=false | solr.embedded=false | ||
... | ... | ||
− | solr.serverUrl=http://<SOLR-HOST>:8983/solr | + | solr.serverUrl=<nowiki>http://<SOLR-HOST>:8983/solr</nowiki> |
</code> | </code> | ||
==== Jetty configuration ==== | ==== Jetty configuration ==== | ||
− | To monitor the cluster node, you have to make SMILA HTTP server accessible from external. | + | To monitor the cluster node, you have to make the SMILA HTTP server accessible from external. |
− | File: <pre>SMILA.ini</pre> | + | File: <pre>SMILA/SMILA.ini</pre> |
<code> | <code> | ||
Line 81: | Line 88: | ||
See also [[SMILA/Documentation/Enable Remote Access|Enabling Remote Access to SMILA]] | See also [[SMILA/Documentation/Enable Remote Access|Enabling Remote Access to SMILA]] | ||
− | |||
=== Monitoring === | === Monitoring === | ||
Line 89: | Line 95: | ||
==== Startup ==== | ==== Startup ==== | ||
− | After | + | After having started SMILA, accessing <tt><nowiki>http://<CLUSTER-NODE>:8080/smila</nowiki></tt> should return the configured cluster nodes in the response (SMILA 1.2): |
<code> | <code> | ||
... | ... | ||
Line 102: | Line 108: | ||
</code> | </code> | ||
− | ==== | + | ==== Running jobs ==== |
− | After a job run | + | After having started a job run, you can check the number of tasks that are currently being processed on each node in ZooKeeper's state at <tt><nowiki>http://<CLUSTER-NODE>:8080/zookeeper/smila/taskmanager/hosts/</nowiki></tt>. |
− | There you should see a list of | + | There, you should see a list of cluster nodes and the following output for each of them: (The given sample output means that 6 tasks are currently being processed on the given cluster node.) |
<code> | <code> | ||
stat: ... | stat: ... | ||
Line 112: | Line 118: | ||
</code> | </code> | ||
− | You can also count the <code>inprogress</code> tasks under | + | You can also count the <code>inprogress</code> tasks under <tt><nowiki>http://<CLUSTER-NODE>:8080/smila/tasks</nowiki></tt>, which is the number of tasks currently processed in the whole cluster. This number can be compared with the <code>maxScaleUp</code> setting for a worker in the <code>clusterconfig.json</code> which is the max. number of tasks allowed to be processed on one node. (see also [[SMILA/Documentation/TaskManager#External_REST_API | Taskmanager REST API]]). |
− | + | ||
=== Some useful commands === | === Some useful commands === | ||
− | Removing all documents from a Solr core ( | + | Removing all documents from a Solr core (Unix-shell command): |
<source lang="text"> | <source lang="text"> | ||
Line 123: | Line 128: | ||
</source> | </source> | ||
− | + | Optimizing the Solr index (Unix-shell command): | |
<source lang="text"> | <source lang="text"> | ||
Line 129: | Line 134: | ||
</source> | </source> | ||
− | ''Hint: When using Windows, replace all ' | + | ''Hint: When using Windows, replace all ' marks with "''. |
Revision as of 10:26, 19 October 2012
Contents
Introduction
SMILA is primarily thought of as a framework where you can plug in your own or third-party high-performant/high-scalable components (e.g. for data storage). Nevertheless, it is also possible to set up SMILA out-of-the-box on a cluster by using its default implementations. This permits horizontal scaling having the effect that importing and processing jobs/tasks will be shared across the cluster nodes. (Remark: We also have a vertical scaling on each cluster machine, but this is not new, because you also have this with a single-node SMILA.)
The following steps describe how to set up SMILA on multiple cluster nodes.
Install external Solr server
If you want to use Solr for indexing, you need to set up a separate Solr server, because the Solr instances embedded in SMILA cannot be shared with the other SMILA instances.
Single node server
- Download a Solr 3.x archive from http://lucene.apache.org/solr/. This HowTo was tested with Solr v. 3.6.1.
- Unpack the archive to a local directory; you will get a directory like /home/smila/solr/apache-solr-3.6.1.
- Copy the files from SMILA/configuration/org.eclipse.smila.solr to the Solr machine (solr.properties isn't needed here), e.g. to /home/smila/solr/smila-cores.
- Go to /home/smila/solr/apache-solr-3.6.1/example and run:
java -Dsolr.solr.home=/home/smila/solr/smila-cores -jar start.jar
- Check if Solr is running at http://localhost:8983/solr/DefaultCore/admin/ (replace localhost with the name of your Solr server, if necessary).
Distributed server
For larger data volumes you will need to set up Solr in a distributed way, too. However, using a distributed Solr setup is not yet fully supported by the SMILA integration (especially during indexing).
Configuring SMILA on cluster nodes
On each cluster node, you have to do the following SMILA configuration changes.
Cluster configuration
You have to define which nodes belong to the cluster.
Configuration file:SMILA/configuration/org.eclipse.smila.clusterconfig.simple/clusterconfig.json
Enter new section "clusterNodes" stating the host names of the individual cluster nodes:
{ "clusterNodes": ["PC-1", "PC-2", "PC-3"], "taskmanager":{ ... }
Objectstore configuration
You have to define a shared data directory for all nodes ("shared" means that the selected directory must be accessbile from every machine in your cluster under the same path).
Configuration file:SMILA/configuration/org.eclipse.smila.objectstore.filesystem/objectstoreservice.properties
(Directory/File will not exist in older SMILA versions - just create it).
Set a root path to the shared directory:
root.path=/data/smila/shared ...
Solr configuration
You have to point to the Solr server that we installed above.
Configuration file:SMILA/configuration/org.eclipse.smila.solr/solr.properties
solr.embedded=false ... solr.serverUrl=http://<SOLR-HOST>:8983/solr
Jetty configuration
To monitor the cluster node, you have to make the SMILA HTTP server accessible from external.
File:SMILA/SMILA.ini
... -Djetty.host=0.0.0.0 ...
See also Enabling Remote Access to SMILA
Monitoring
You can use the REST API to monitor SMILA cluster activities.
Startup
After having started SMILA, accessing http://<CLUSTER-NODE>:8080/smila should return the configured cluster nodes in the response (SMILA 1.2):
... cluster: { nodes: [ "PC-1", "PC-2", "PC-3" ] } ...
Running jobs
After having started a job run, you can check the number of tasks that are currently being processed on each node in ZooKeeper's state at http://<CLUSTER-NODE>:8080/zookeeper/smila/taskmanager/hosts/.
There, you should see a list of cluster nodes and the following output for each of them: (The given sample output means that 6 tasks are currently being processed on the given cluster node.)
stat: ... data: "6"
You can also count the inprogress
tasks under http://<CLUSTER-NODE>:8080/smila/tasks, which is the number of tasks currently processed in the whole cluster. This number can be compared with the maxScaleUp
setting for a worker in the clusterconfig.json
which is the max. number of tasks allowed to be processed on one node. (see also Taskmanager REST API).
Some useful commands
Removing all documents from a Solr core (Unix-shell command):
curl http://localhost:8983/solr/DefaultCore/update?commit=true -H "Content-Type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
Optimizing the Solr index (Unix-shell command):
curl http://localhost:8983/solr/DefaultCore/update?commit=true -H "Content-Type: text/xml" --data-binary '<optimize/>'
Hint: When using Windows, replace all ' marks with ".