Hudson-ci/features/Backup

From Eclipsepedia

Jump to: navigation, search

Contents

Hudson Backup Design

(Proposed by: Stuart Lorber)

Our Current Backup Strategy

We currently use a cloud-based solution to backup most of the servers in our company. The issue of backups came up because our adding and purging of upwards of 20 GB of data per day (and this number is growing) does not allow us to use the cloud-based backup solution we use for our other company servers. We have too much churning and space requirements to keep any useful history and the price for the type of storage we need is cost prohibitive.

We debated what the backup solution should be for our Hudson server. We know that we need an offsite “hot backup” for our source control server but we’ve been debating the need for a hot backup for our Hudson server.

Assuming there is a suitable server available we would need to:

  1. Install the appropriate OS.
  2. Install Hudson.
  3. Restore our Hudson backup.

The last step would probably take the longest as we currently have approximately 1.5 million files and approximately 300,000 directories / symbolic links in our Hudson directory.

We’ve set up test Hudson servers so often that if a suitable machine is available, we can have a new server running within 24 to 36 hours (depending on artifact restoration time).

Because we cannot use our cloud-based backup solution and we are currently evaluating our acceptable backup requirements, our short term solution was to hook a USB 3 external hard drive up to our Hudson server and run a nightly CRON job that calls rsync with appropriate options (rsync -avzh –-delete) to properly copy symbolic links and purge obsolete data. This backup may take 2 minutes or 2 hours depending on the amount of changed data.

We have been checking the logs generated by the nightly rsync but have not yet tested a restore.

During the time the backup is being done, Hudson jobs are still running. Therefore, the final state of the backup does not mark a definitive moment in time; it marks a span of time. We are concerned this will cause problems after a restore; the restored artifacts might be in an inconsistent (and therefore unreliable) state or, worse, this inconsistent state might prevent Hudson from starting.

Examples of inconsistent states or states that prevent Hudson from starting:

  1. Inconsistency between teams.xml and existing team jobs.
  2. Job configuration “corruption”; occurs if a job configuration references a plugin that is not restored.
  3. Jobs that are referenced but do not exist on the system (similar to #1).
  4. Jobs that are restored but are not referenced (therefore never visible to the user).

These problems require manual “debugging” to get Hudson to start. We are very comfortable with the Hudson file system so we can clean up any problems and get the system started but that there isn’t instability. For example, the system can start but has jobs that won’t build.

A Hudson Backup Solution

We like the idea of having the backup directly reflecting the file system. This would support a Hudson restore operation. It would also allow a manual restore if there’s a problem with the Hudson solution. It also allows manual interrogation of the backup.

The backup should include all configuration data as well as all jobs and job artifacts. We want to have our build system fully restored and have as little downtime as possible; there should be no manual intervention.

In a situation like ours it would not work to put the system in a “safe restart” type of state during the back process. This would require all running jobs to finish and would not start any queued.

Some of our test jobs, particularly our UI regression tests, can take upwards of 3 hours to complete. This would prevent the backup from starting and prevent new jobs to start for up to those 3 hours plus the time to complete the backup. In our situation, since we process jobs for approximately 16 hours per day and those hours are not contiguous this is not a viable solution. Therefore the backup would need to run while the system is available for normal job processing.

Backups should determine new artifacts to back up at the time of the backup – as opposed to keeping a record of what’s been previously backed up. As previously mentioned, our short-term backup solution is to use USB drives with rsync. We plan to switch these USB drives weekly so one drive can be stored off site. Therefore we would want Hudson to look at the back up media and use that to determine what to back up.

Job / Build Artifact Backups

One solution for backing up job artifacts might be to backup only completed builds. This would eliminate the situation of backing up job artifacts that are in an inconsistent state. One problem with this would be if a build is being backed up and the running job, based on the number of builds to be kept, causes build(s) to be purged. This might cause the backup of a build in an inconsistent state or a job failure because objects are locked and a build cannot be purged.

System Configuration Backups

If a user were in the process of changing a job configuration, would the backup process need to skip the file? Would the file get locked and the user would have a problem saving changes? This would apply to any configuration changes – Hudson configuration / security / nodes.

It’s reasonable for a user to lose the latest build (or parts of the latest build) if it’s still running. The stability and reliability of the restored system is the most important part of the backup.

Non-job artifacts are trickier; objects like log files for SCM polling, fingerprints, and node logs are continually updated.

Of course all plugins and configuration information needs to be backed up.

Backup Configuration Options


Backup Configuration

Backup settings

 Backup directory _____________________________________ (?)  

 Backup schedule  _____________________________________ (?)  
 E-mail notification  

  [ ] Recipients ______________________________________ (?)

   [ ] Send e-mail for every backup                     (?)

 [ ] Continue on warning                                (?)

[ Backup Now ]         <-- BUTTON


  • Backup directory – Verify the location after user enters a value and exits the field (as is done, for instance, when a user enters a new job name and the job name already exists) or provide a button to verify the location and appropriate permissions. For Linux this location can be any type of read/writeable device that can be mounted on the server. The user can determine what type of device this can be and therefore set their safety and cost threshold. For Windows this location can be a network drive or a mounted device and, again, the user can decide their safety and cost threshold. In either case Hudson doesn’t need to be concerned with the details.
  • Backup schedule – Standard CRON configuration.
  • Email notification – Standard comma-separated list.
  • Send e-mail for every backup – This refers to successful backups. An email should be sent to the recipient list for any unsuccessful backup.
  • Continue on warning – TBD.

Emails notifications should contain information about a backup’s “Success” or “Failure” and a link to the backup log – similar to an email sent out about a job’s status.

Backup Logging

We need effective, descriptive, detailed logging. This can take many forms.

Logging can provide the user with what is being backed up on an “object by object”, “build by build” or “job by job” level. This can be configurable or the user can have all levels of logs available and choose the log to view.

“Object by object” logging would provide too much information that the average user would not find useful.

“Build by build” logging would provide sufficient information to show the progress of the backup and a clear logical view without providing too much information.

“Job by job” logging would provide too little information.


12:30:00.01 System configs backed up.
12:30:00.02 System plugins backed up.
12:30:00.03 Job TEAM1.ABC being backed up.
12:34:00.07 Job TEAM1.ABC backed up.


or


12:30:00.01 System configs backed up.
12:30:00.02 System plugins backed up.
12:30:00.03 Job TEAM1.ABC being backed up.
12:31:00.04 Job TEAM1.ABC build 62 being backed up.
12:32:00.05 Job TEAM1.ABC build 63 being backed up.
12:33:00.06 Job TEAM1.ABC build 64 being backed up.
12:34:00.07 Job TEAM1.ABC backed up.


A facility should be provided to “browse” the backup.

The user should be able to define the number of logs to retain.

Backup Warnings Versus Failures

The definition of a failure should include:

  1. The backup media being unavailable.
  2. The backup media has insufficient / incorrect permissions.
  3. Anything that might prevent Hudson from starting after a restore.
  4. Anything that might result in a job configuration being in an unusable state after a restore. This would include a failure to backup plugins.
  5. Anything that causes the Hudson backup from finishing.
    • This might be more common in early implementations of this functionality. As more situations arise that might cause a backup to fail this list will become longer. These situations might cause the error to be a warning rather than a failure. TBD.

The definition of a warning might include, for instance, a particular build that can’t be backed up.

Allowing warnings for an initial implementation might not be a good idea as those warnings would still leave the system in a stable state after a restore.

‘Failures’ versus ‘warnings’ must be evaluated on a case-by-case basis. It might be better to provide good logging and the tools, instructions or an FAQ for addressing issues.

Clearing the backup location of all data might be one of the “solutions” to resolving backup issues as a failure might leave the backup in an inconsistent state.

HUDSON_BUILD configuration

There is an option to set the HUDSON_BUILD parm in the startup script. This allows the build artifacts to reside in a non-default location. We used this when looking for ways to do functionality that’s been implemented with the team copy plugin. We would try to run multiple instances of Hudson and define the HUDSON_BUILD parm to point to different locations. We no longer use this parm.

Has there been any interest / usage of this per the Hudson User Group?

I think it’s reasonable to assume, for the initial release of a backup solution, that this scenario / parameter is not addressed.

Full Restore

A full system restore implies a non-functional system. Therefore, there is no concern about having the system available for job processing or configuration.

I would suggest the following scenario for the restore.

  1. Install Hudson.
    • I think an install is important (especially on a Linux system where startup scripts are laid down – i.e. /etc/init.d/Hudson).
  2. Start Hudson and install any required plugins for system startup.
  3. Configure the backup parameters to point the location of the backup.
  4. Select the restore option.
  5. The restore option would:
    • a. Restore the version of Hudson used to do the last backup (basically just update the hudson.jar).
    • b. Restart Hudson so it’s using the proper version of the hudson.jar.*
    • c. Automatically place the system in a restricted state that only allows functionality required for the restore process. This should include the ability to view the restore logs.
    • d. Ensure all ownership and permissions are properly set on the Hudson directories and artifacts (Linux).
  1. Restart the system after the restore completes.

  • A requirement should be that the user restores to the same version of Hudson they backed up with. A higher version might not be a problem. A lower version would be. The Hudson .jar be part of the backup. It should be restored first and cause a restart of the system before the restore starts if it’s different than the currently installed version. Allowing a different version to do the restore (earlier or later) adds an unneeded complication for both the user and the Hudson development team in the event of a restoration failure. The restore option can lay down the proper version of hudson.jar. The file is not locked when the system is running. When the system is restarted the new version will be used.

Restore Logging

Restore logging requires similar information / feedback to backup logging.

Logging can provide the user with what is being restored on an “object by object”, “build by build” or “job by job” level. This can be configurable or the user can have all levels of logs available and choose the log to view.

“Object by object” logging would provide too much information that the average user would not find useful.

“Build by build” logging would provide sufficient information to show the progress of the backup and a clear logical view without providing too much information.

“Job by job” logging would provide too little information.


12:30:00.01 System configs restored.
12:30:00.02 System plugins restored.
12:30:00.03 Job TEAM1.ABC being restored.
12:34:00.07 Job TEAM1.ABC restored.


or


12:30:00.01 System configs restored.
12:30:00.02 System plugins restored.
12:30:00.03 Job TEAM1.ABC being restored.
12:31:00.04 Job TEAM1.ABC build 62 being restored.
12:32:00.05 Job TEAM1.ABC build 63 being restored.
12:33:00.06 Job TEAM1.ABC build 64 being restored.
12:34:00.07 Job TEAM1.ABC restored.


The second example provides more feedback to the user faster and a finer grained audit trail without being overwhelming.

The logging output should have the same “style” for backups and restores.

Backing Up Promoted Jobs

There are two backup situations that are of interest to us. One is a full backup for a full system restore.

The only “incremental” backup that would interest us would be for promoted builds to be backed up to a separate location so they can be permanently saved. There are backup plugins that allow post promotion steps. However, the simple backup plugin only sets a flag and prevents the promoted build from being manually deleted.

An incremental backup that backs up promoted builds and does not concern itself with automated restoration or purging of old / removed promoted builds would be useful. Although the builds wouldn’t be easily restorable to the Hudson system a particular build’s artifacts could be recovered manually. This would support protection of “production” artifacts.

Promoted builds are production artifacts that we release to users. It is critical that we have access to these artifacts for software delivery as well as product support. We often need to reproduce a customer’s environment to try to reproduce a problem; therefore we need to have the exact installers / artifacts as the customer. Having a slightly different build, in these situations, is useless for production support.

Partial Restores

A partial backup / restore may cause too many problems. For instance, when dealing with teams…the team.xml file must match existing teams on the file system. If the teams.xml file references a team that does not exist on disk the system will not start. If the teams.xml file does not reference a team that does exist on the file system the user will not see the team or any of its jobs. If the user then tries to “create” this team they’ll get an error because it already “exists” and it cannot be “created”. In this case the teams.xml file would need to modified to match what is on the file system.

In the case of restoring a build there should be checks to ensure that the build does not already exist on the system.

In the case of restoring a job there should be checks to ensure the job does not exist. There might be an option to allow the user to restore (or not restore) associated builds.

Before a job is restored any plugins referenced by the job should be required to be installed. This will prevent job “corruption”.

If partial restores are allowed there needs to be a clear “drill down” function to allow the user to select (or multi-select) the desired artifacts.