Using Hudson/Distributed builds
|Hudson Continuous Integration Server|
|Mailing List • Forums • IRC|
|Distributed building using Hudson|
Hudson supports the "master/slave" mode, where the workload of building projects are delegated to multiple "slave" nodes, allowing single Hudson installation to host a large number of projects, or provide different environments needed for builds/tests. This document describes this mode and how to use it.
How does this work?
A "master" is an installation of Hudson. When you weren't using the master/slave support, a master was all you had. Even in the master/slave mode, the role of a master remains the same. It will serve all HTTP requests, and it can still build projects on its own.
Slaves are computers that are set up to build projects for a master. Hudson runs a separate program called "slave agent" on slaves. There are various ways to start slave agents, but in the end a slave agent and Hudson master needs to establish a bi-directional byte stream (for example a TCP/IP socket.)
When slaves are registered to a master, a master starts distributing loads to slaves. The exact delegation behavior depends on configuration of each project. Some projects may choose to "stick" to a particular machine for a build, while others may choose to roam freely between slaves. For people accessing Hudson website, things works mostly transparently. You can still browse javadoc, see test results, download build results from a master, without ever noticing that builds were done by slaves.
Follow the Step by step guide to set up master and slave machines to quickly start using distributed builds.
Different ways of starting slave agents
Pick the right method depending on your environment and OS that master/slaves run.
Have master launch slave agent via ssh
Hudson has a built-in SSH client implementation that it can use to talk to remote sshd and start a slave agent. This is the most convenient and preferred method for Unix slaves, which normally has sshd out-of-the-box. Click Manage Hudson, then Manage Nodes, then click "New Node." In this set up, you'll supply the connection information (the slave host name, user name, and ssh credential). Note that the slave will need the master's public ssh key copied to ~/.ssh/authorized_keys. (This is a decent howto if you need ssh help). Hudson will do the rest of the work by itself, including copying the binary needed for a slave agent, and starting/stopping slaves. If your project has external dependencies (like a special ~/.m2/settings.xml, or a special version of java), you'll need to set that up yourself, though.
This is the most convenient set up on Unix.
Have master launch slave agent on Windows
For Windows slaves, Hudson can use the remote management facility built into Windows 2000 or later (WMI+DCOM, to be more specific.) In this set up, you'll supply the username and the password of the user who has the administrative access to the system, and Hudson will use that remotely create a Windows service and remotely start/stop them.
This is the most convenient set up on Windows, but does not allow you to run programs that require display interaction (such as GUI tests).
Note: Unlike other Node's configuration type, the Node's name is very important as it is taken as the node's address where to create the service
Write your own script to launch Hudson slaves
If the above turn-key solutions do not provide flexibility necessary, you can write your own script to start a slave. You place this script on the master, and tell Hudson to run this script whenever it needs to connect to a slave.
Typically, your script uses a remote program execution mechanism like SSH, RSH, or other similar means (on Windows, this could be done by the same protocols through cygwin or tools like psexec), but Hudson doesn't really assume any specific method of connectivity.
What Hudson expects from your script is that, in the end, it has to execute the slave agent program like java -jar slave.jar, on the right computer, and have its stdin/stdout connect to your script's stdin/stdout. For example, a script that does "ssh myslave java -jar ~/bin/slave.jar" would satisfy this.
(The point is that you let Hudson run this command, as Hudson uses this stdin/stdout as the communication channel to the slave agent. Because of this, running this manually from your shell will do you no good).
A copy of slave.jar can be downloaded from http://yourserver:port/jnlpJars/slave.jar . Many people write scripts in such a way that this 160K jar is downloaded during the script, to make sure the consistent version of slave.jar is always used. The SSH Slaves plugin does this automatically, so slaves configured using this plugin always use the correct slave.jar.
Technically speaking, in this set up you should update slave.jar every time you upgrade Hudson to a new version. However, in practice slave.jar changes infrequently enough that it's also practical not to update until you see a fatal problem in start-up.
Launching slaves this way often requires an additional initial set up on slaves (especially on Windows, where remote login mechanism is not available out of box), but the benefits of this approach is that when the connection goes bad, you can use Hudson's web interface to re-establish the connection.<
Launch slave agent via Java Web Start
Another way of doing this is to start a slave agent through Java Web Start (JNLP). In this approach, you'll interactively logon to the slave node, open a browser, and open the slave page. You'll be then presented with the JNLP launch icon. Upon clicking it, Java Web Start will kick in, and it launches a slave agent on the computer where the browser was running.
This mode is convenient when the master cannot initiate a connection to slaves, such as when it runs outside a firewall while the rest of the slaves are in the firewall. OTOH, if the machine with a slave agent goes down, the master has no way of re-launching it on its own.
On Windows, you can do this manually once, then from the launched JNLP slave agent, you can install it as a Windows service so that you don't need to interactively start the slave from then on.
If you need display interaction (e.g. for GUI tests) on Windows and you have a dedicated (virtual) test machine, this is a suitable option. Create a hudson user account, enable auto-login, and put a shortcut to the JNLP file in the Startup items (after having trusted the slave agent's certificate). This allows one to run tests as a restricted user as well.
Launch slave agent headlessly
This launch mode uses a mechanism very similar to Java Web Start, except that it runs without using GUI, making it convenient for an execution as a daemon on Unix. To do this, configure this slave to be a JNLP slave, take slave.jar as discussed above, and then from the slave, run a command like this:
$ java -jar slave.jar -jnlpUrl http://yourserver:port/computer/slave-name/slave-agent.jnlp
Make sure to replace "slave-name" with the name of your slave.
Also note that the slaves are a kind of a cluster, and operating a cluster (especially a large one or heterogeneous one) is always a non-trivial task. For example, you need to make sure that all slaves have JDKs, Ant, CVS, and/or any other tools you need for builds. You need to make sure that slaves are up and running, etc. Hudson is not a clustering middleware, and therefore it doesn't make this any easier.
Example: Configuration on Unix
This section describes my current set up of Hudson slaves that I use inside Sun for my day job. My master Hudson node is running on a SPARC Solaris box, and I have many SPARC Solaris slaves, Opteron Linux slaves, and a few Windows slaves.
- Each computer has an user called hudson and a group called hudson. All computers use the same UID and GID. (If you have access to NIS, this can be done more easily.) This is not a Hudson requirement, but it makes the slave management easier.
- On each computer, /var/hudson directory is set as the home directory of user hudson. Again, this is not a hard requirement, but having the same directory layout makes things easier to maintain.
- All machines run SSHD. Windows slaves run cygwin sshd.
- All machines have ntp client installed, and synchronize clock regularly with the same NTP server.
- Master's /var/hudson have all the build tools beneath it --- a few versions of Ant, Maven, and JDKs. JDKs are native programs, so I have JDK copies for all the architectures I need. The directory structure looks like this:
/var/hudson +- .ssh +- bin | +- slave (more about this below) +- workspace (hudson creates this file and store all data files inside) +- tools +- ant-1.5 +- ant-1.6 +- maven-1.0.2 +- maven-2.0 +- java-1.4 -> native/java-1.4 (symlink) +- java-1.5 -> native/java-1.5 (symlink) +- native -> solaris-sparcv9 (symlink; different on each computer) +- solaris-sparcv9 | +- java-1.4 | +- java-1.5 +- linux-amd64 +- java-1.4 +- java-1.5
- Master's /var/hudson/.ssh has private/public key and authorized_keys so that a master can execute programs on slaves through ssh, by using public key authentication.
- On master, I have a little shell script that uses rsync to synchronize master's /var/hudson to slaves (except /var/hudson/workspace). I use this to replicate tools on all slaves.
- /var/hudson/bin/launch-slave is a shell script that Hudson uses to execute jobs remotely. This shell script sets up PATH and a few other things before launching slave.jar. Below is a very simple example script.
JAVA_HOME=/opt/SUN/jdk1.6.0_04PATH=$PATH:$JAVA_HOME/bin export PATH java -jar /var/hudson/bin/slave.jar
- Finally all computers have other standard build tools like svn and cvs installed and available in PATH.
Some slaves are faster, while others are slow. Some slaves are closer (network wise) to a master, others are far away. So doing a good build distribution is a challenge. Currently, Hudson employs the following strategy:
- If a project is configured to stick to one computer, that's always honored.
- Hudson tries to build a project on the same computer that it was previously built.
- Hudson tries to move long builds to slaves, because the amount of network interaction between a master and a slave tends to be logarithmic to the duration of a build (IOW, even if project A takes twice as long to build as project B, it won't require double network transfer.) So this strategy reduces the network overhead.
If you have interesting ideas (or better yet, implementations), please let me know.
Transition from master-only to master/slave
Typically, you start with a master-only installation and then much later you add slaves as your projects grow. When you enable the master/slave mode, Hudson automatically configures all your existing projects to stick to the master node. This is a precaution to avoid disturbing existing projects, since most likely you won't be able to configure slaves correctly without trial and error. After you configure slaves successfully, you need to individually configure projects to let them roam freely. This is tedious, but it allows you to work on one project at a time.
Projects that are newly created on master/slave-enabled Hudson will be by default configured to roam freely.
Master on public network, slaves within firewall
One might consider setting up the Hudson master on the public network (so that people can see it), while leaving the build slaves within the firewall (because having a lot of machines on the internet is expensive.) There are two ways to make it work:
- Allow port-forwarding from the master to your slaves within the firewall. The port-forwarding should be restricted so that only the master with its known IP can connect to slaves. With this set up in the firewall, as far as Hudson is concerned it's as if the firewall doesn't exist.
- Use JNLP slaves and have slaves connect to the master, not the other way around. In this case it's the slaves that initiates the connection, so it works correctly with the NAT firewall.
Note that in both cases, once the master is compromised, all your slaves can be easily compromised (IOW, malicious master can execute arbitrary program on slaves), so both set-up leaves much to be desired in terms of isolating security breach. Build Publisher Plugin provides another way of doing this, in more secure fashion.
Running Multiple Slaves on the Same Machine
It is possible to run multiple slave instances on a Windows machine, and have them installed as separate Windows services so they can start up on system startup. While the correct use of executors largely obviates the need for multiple slave instances on the same machine, there are some unique use cases to consider:
- You want more configurability between the configured nodes. Say you have one node set to be used as much as possible, and the other node do be used only when needed.
- You may have multiple Hudson master installations building different things, and so this configuration would allow you to have slaves for more than one master on the same box. That's right, with Hudson you really can serve two masters.
Follow these steps to get multiple slaves working on the same Windows box:
- Add the first slave node in Hudson and give it its own working dir (e.g. hudson-slave-a).
- Go to the slave page from the slave box and launch by JNLP, then use the menu to install it as a service instead.
- Once the service is running, you'll get hudson-slave.exe and hudson-slave.xml in your slave's work dir.
- Bring up windows services and stop the Hudson Slave service.
- Open a shell prompt, cd into the slave work dir.
- First run "hudson-slave.exe uninstall" to uninstall the one that the jnlp-launched app installed. This should remove it from the service list.
- Now edit hudson-slave.xml. Modify the id and name values so that your mutliple slaves are distinct. I called mine hudson-slave-a and Hudson Slave A.
- Run hudson-slave.exe install and then check the Windows service list to ensure it is there. Start it up, and watch Hudson to see if the slave instance becomes active.
- Now repeat this process for a second slave, beginning with configuring the new node in the master config.
When you go to create the second node, it is nice to be able to copy an existing node, and copy the first node you setup. Then you just tweak the Remote FS Root and a couple other settings to make it distinct. When you are done you should have two (or more) Hudson slave services in the list of Windows services.
Some interesting pages on issues (and resolutions) occurring when using Windows slaves:
Some more general troubleshooting tips:
Every time Hudson launches a program locally/remotely, it prints out the command line to the log file. So when a remote execution fails, login to the computer that runs the master by using the same user account, and try to run the command from your shell. You tend to solve problems quickly in this way.
Each slave has a log page showing the communication between the master and the slave agent. This log often shows error reports.
If you use binary-unsafe remoting mechanism like telnet to launch a slave, add the -text option to slave.jar so that Hudson avoids sending binary data over the network.
When the same command runs outside Hudson just fine, make sure you are testing it with the same user account as Hudson runs under. In particular, if you run Hudson master on Windows, consult How to get command prompt as the SYSTEM user.
Feel free to send your trouble to