Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/HowTo/How to setup SMILA in a cluster"

(Distributed Server)
(Single-Node Server)
Line 5: Line 5:
 
==== Single-Node Server ====
 
==== Single-Node Server ====
  
* Download a Solr 3.x archive from [http://lucene.apache.org/solr/ http://lucene.apache.org/solr/]. This HowTo was tested with Solr 3.6.1
+
* Download a Solr 3.x archive from [http://lucene.apache.org/solr/ http://lucene.apache.org/solr/]. This HowTo was tested with Solr 3.6.1.
 
* Unpack the archive to a local directory, you get a directory like <tt>/home/smila/solr/apache-solr-3.6.1</tt>.
 
* Unpack the archive to a local directory, you get a directory like <tt>/home/smila/solr/apache-solr-3.6.1</tt>.
 
* Copy the files from <tt>SMILA/configuration/org.eclipse.smila.solr</tt> to the Solr machine (solr.properties is'nt needed), e.g. to <tt>/home/smila/solr/smila-cores</tt>
 
* Copy the files from <tt>SMILA/configuration/org.eclipse.smila.solr</tt> to the Solr machine (solr.properties is'nt needed), e.g. to <tt>/home/smila/solr/smila-cores</tt>

Revision as of 09:15, 18 October 2012

Install external Solr server

If you want to use Solr for indexing, you need to setup a separate Solr server, because the Solr instances embedded in SMILA cannot be shared with the other SMILA instances.

Single-Node Server

  • Download a Solr 3.x archive from http://lucene.apache.org/solr/. This HowTo was tested with Solr 3.6.1.
  • Unpack the archive to a local directory, you get a directory like /home/smila/solr/apache-solr-3.6.1.
  • Copy the files from SMILA/configuration/org.eclipse.smila.solr to the Solr machine (solr.properties is'nt needed), e.g. to /home/smila/solr/smila-cores
  • Go to /home/smila/solr/apache-solr-3.6.1/example and run:
java -Dsolr.solr.home=/home/smila/solr/smila-cores -jar start.jar

Distributed Server

For larger data volumes you will need to setup Solr in a distributed way, too. However, using a distributed Solr setup is not yet fully supported by the SMILA integration (especially during indexing).

Configuring SMILA on cluster node

On each cluster node, you have to do the following SMILA configuration changes.

Cluster configuration

You have to define which nodes belong to the cluster.

Configuration file:
configuration/org.eclipse.smila.clusterconfig.simple/clusterconfig.json

Enter new section "clusterNodes":

{
  "clusterNodes": ["PC-1", "PC-2", "PC-3"],    
  "taskmanager":{
  ...
}

Objectstore configuration

You have to define a shared data directory for all nodes

Configuration file:
configuration/org.eclipse.smila.objectstore.filesystem/objectstoreservice.properties

Set a root path to the shared directory:

 root.path=/data/smila/shared
 ...

Idea.png
NFS or SMB/CIFS? When running under Linux, you can use either an NFS or an SMB/CIFS directory (mounted via Samba) for the objectstore. First tests seem to indicate that using a SMB/CIFS directory is much faster, especially if lots of small files are written (as done during crawling by the Delta or VisitedLinks services).


Solr configuration

You have to point to the Solr server that you set up above.

Configuration file:
configuration/org.eclipse.smila.solr/solr.properties

 solr.embedded=false
 ...
 solr.serverUrl=http://<SOLR-HOST>:8983/solr

Jetty configuration

To monitor the cluster node, you have to make SMILA HTTP server accessible from external.

File:
SMILA.ini

 ...
 -Djetty.host=0.0.0.0
 ...

See also Enabling Remote Access to SMILA


Monitoring

You can use the REST API to monitor SMILA cluster activities.

Startup

After SMILA has been started, http://CLUSTER-NODE:8080/smila should show you the configured cluster nodes:

 // TODO

Job run

After a job run has been started you can check the number of tasks that are currently processed on each node in the zookeeper state: http://CLUSTER-NODE:8080/zookeeper/smila/taskmanager/hosts

There you should see a list of the cluster nodes, and the following output for each of them: (The given sample output means that 6 tasks are currently processed on the given cluster node)

  stat: ...
  data: "6"

You can also count the inprogress tasks under http://CLUSTER-NODE:8080/smila/tasks, which is the number of tasks currently processed in the whole cluster. This number can be compared with the maxScaleUp setting for a worker in the clusterconfig.json which is the max. number of tasks allowed to be processed on one node. (see also Taskmanager REST API)


Some useful commands

Removing all documents from a Solr core (unix-shell command):

curl http://localhost:8983/solr/DefaultCore/update?commit=true -H "Content-Type: text/xml" --data-binary '<delete><query>*:*</query></delete>'

Optimize Solr index (unix-shell command):

curl http://localhost:8983/solr/DefaultCore/update?commit=true -H "Content-Type: text/xml" --data-binary '<optimize/>'

Hint: When using Windows, replace all ' by "