Bundle org.eclipse.smila.zookeeper

This bundle provides a service that manages a Apache ZooKeeper server embedded into SMILA. ZooKeeper is used in SMILA by other services as some kind of "memory shared over multiple cluster nodes" to coordinate application state and work. The main users are the JobManager and TaskManager services that store information about running jobs and task queues in ZooKeeper.

The service performs the following tasks:

Create ZooKeeper configuration from cluster configuration
Start and watch an embedded ZooKeeper server, either a stand-alone for single-node clusters or a QuorumPeer for multi-node clusters.
Restart the local ZooKeeper server if it fails during operation
Garbage collection in the local ZooKeeper data directory.
Creating ZooKeeper clients connected to the cluster for use by other services.

Configuration

The configuration file is named zoo.cfg and can be found in configuration/org.eclipse.smila.zookeeper in an installation. This is the default file as contained in the bundle itself:

# ZooKeeper configuration file
 
# The number of milliseconds of each tick.
tickTime=2000
 
# The number of ticks that the initial synchronization phase can take.
initLimit=10
 
# The number of ticks that can pass between sending a request
# and getting an acknowledgement.
syncLimit=5
 
# The directory where the snapshot is stored. 
# If not defined here, set by Zookeeper service to somewhere in the SMILA workspace dir.
# dataDir=
 
# The directory where the transaction log is stored.
# If not defined here, set by Zookeeper service to the same value as dataDir
# dataLogDir=
 
# After snapCount transactions are written to a log file a snapshot is started and a new transaction log file is created. 
# The default snapCount is 100,000.
snapCount=100000
 
# The port at which the clients will connect.
clientPort=2181
 
# Session timeout.
maxSessionTimeout=30000
 
# Limit on number of concurrent connections (at the socket level) that
# a single client, identified by IP address, may make to a single member of
# the ZooKeeper ensemble. Default is 10.  Set high to avoid zk connection
# issues running standalone and pseudo-distributed.
# (0 -> unlimited)
maxClientCnxns=0
 
# The port used by followers to connect to the leader.
zk.serverPort=2888
 
# The port for leader election.
zk.electionPort=3888
 
# number of snapshots to keep in data directory.
zk.snapshotsToKeep=5

Usually you do not need to edit this file. Most properties are defined by ZooKeeper itself, so refer to the ZooKeeper documentation for their meaning or other properties to add. You may need to change the ...Port properties if these ports are already in use on your machines.

Configuration of data directories

For best performance, the Zookeeper documentation recommends to put the snapshot and transaction log directories on separate harddisks on which nothing else is stored. Usually Zookeeper works fine without this separation. However, with very high IO load caused by other SMILA components, it is possible that Zookeeper operations are delayed so much that performance may suffer. In such cases you should try to move the zookeeper data directories away from the device storing the SMILA data directory. You can do this by setting the properties dataDir and optionally dataLogDir, for example:

dataDir=/mnt/ramdisk/zk/snapshots
dataLogDir=/mnt/ramdisk/zk/txnlogs

If only dataDir is set, both snapshots and transaction logs are written to this directory. For even more extreme cases you may want to separate even the transaction logs from the snapshots. In this cases you can set the property dataLogDir to specify a different directory for the transaction logs.

Alternatively you can use the environment variable SMILA_ZK_DATADIR to specify the dataDir and dataLogDir without changing configuration files. Settings in the zoo.cfg will override the environment variable.

If you do not have appropriate additional devices available in your cluster machines, you should consider the use of RAM disks to put the Zookeeper data in. In this case you may also want to change the value of property zk.snapshotsToKeep in order to minimize the necessary size of this RAM disk: If certain limits are reached, Zookeeper creates a new snapshot in dataDir and starts a new transaction log in dataLogDir. Another thread in the SMILA Zookeeper service cleans up these directories regularly by removing the obsolete snapshot and transaction log files. The property zk.snapshotsToKeep determines how many of the most current snapshot and log files should be kept. The default value of 5 is a recommendation from the Zookeeper documentation, but it should be safe to reduce this if necessary. __The minimal value is 3. If you set a lower value, it is increased to 3 automatically.__ In any case, you should be safe with a maximum size setting of 500MB for the RAM disk (it doesn't use the complete space immediately, but only on demand, so it's no problem to set the limit higher than absolutely necessary).

If you still have problems with tasks being retried because keep-alive cannot be delivered you should consider to increase the time-to-live configuration of the Job- and TaskManager.

Notes on Cluster configuration

Using the cluster configuration service you can define a "failsafety level" for tasks. This actually affects the ZooKeeper setup in the cluster: To be able to tolerate n ZooKeeper failures, a ZooKeeper ensemble must consist of at least 2*n+1 nodes because a majority of the ensemble must still be available to confirm write requests. Thus by default a cluster of m nodes can tolerate m/2-1 node failures (m/2 rounded up for odd values of m).

For large clusters it can be a performance problem if the complete cluster is used as the voting ensemble in ZooKeeper as each node must be asked in a write request to confirm the request. Therefore it is possible to reduce the "failsafety level". Setting a lower fail-safety level has the effect that only (max_fail_nodes*2 + 1) nodes are configured as the voting ZooKeeper ensemble, while the ZooKeepers on the rest of the nodes are configured as "observers". Then, at most max_fail_nodes of the voting ensemble may fail and operation will still proceed. Of the "observer" nodes, any number may fail.

It is currently not possible to configure max_fail_nodes=0 (which would mean that there is only 1 voting server) in a multi-node cluster as there need to be at least two voting ZooKeeper server for the voting algorithms to work. Setting this value will use the default value (max. failsafety).

Actually, in clusters with an even number of nodes greater than 3, there will be always at least one observer node, even if you choose the default setting, because a cluster with an even number of nodes does not provide more failsafety than a cluster with one node less. In a 2-node cluster both nodes will have voting servers, but there is also no additional failsafety. Which means: It does not make any sense to use a 2-node cluster. Use 3 nodes instead.

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Documentation/Bundle org.eclipse.smila.zookeeper

Contents

Bundle org.eclipse.smila.zookeeper

Configuration

Configuration of data directories

Notes on Cluster configuration

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Documentation/Bundle org.eclipse.smila.zookeeper

Contents

Bundle org.eclipse.smila.zookeeper

Configuration

Configuration of data directories

Notes on Cluster configuration