Difference between revisions of "Hudson-ci/features/Backup"

Revision as of 17:55, 30 April 2014

New Hudson Backup Design

(Author: Stuart Lorber)

Our Current Backup Strategy

The issue of backups came up because our adding and purging of upwards of 20 GB of data per day (and this number is growing) does not allow us to use the cloud-based backup solution we use for our other company servers. We have too much churning and space requirements to keep any useful history and the price for the type of storage we need is cost prohibitive.

We debated what our backup solution should be for our Hudson server. We know that we need an offsite “hot backup” for our source control server but we’ve been debating the need for a hot backup for our Hudson server.

Assuming there is a suitable server available we would need to:

Install the appropriate OS.
Install Hudson.
Restore our Hudson backup.

The last step would probably take the longest as we currently have approximately 1.5 million files and approximately 300,000 directories / soft links in our Hudson directory.

We’ve set up test Hudson servers so often that if a suitable machine is available, we can have a new server up within 24 to 36 hours (depending on artifact restoration time).

Because we cannot use our cloud-based backup solution and we are currently evaluating our acceptable backup requirements we’ve decided to hook a USB 3 external hard drive up to our Hudson server. We have a nightly CRON job that runs rsync with appropriate options (rsync -avzh –-delete) that will properly copy symbolic links and will purge obsolete data.

We have been checking the logs generate by the nightly rsync but have not yet tested a restore.

The rsync backup may take 20 minutes or it might take 2 hours depending on the amount of changed data. During the time the backup is being done jobs are still running. Therefore, the final state of the backup does not mark a definitive moment in time; it marks a span of time. We are concerned that this will cause problems after a restore. Either the restored artifacts are in an inconsistent (and therefore unreliable) state or, worse, this inconsistent state might prevent Hudson from starting.

Examples that we’ve seen include:

Inconsistency between teams.xml and existing team jobs.
Job configuration “corruption” if a job configuration references a plugin that is not restored.
Security information, on a job level, that points to a job that does not exist.
Jobs that are referenced but do not exist on the system (similar to #1).
Jobs that are restored but are not referenced (therefore never visible to the user.

These problems would require manual “debugging” to get Hudson to start. We are very comfortable with the Hudson file system so we can clean up any problems and get Hudson started but that doesn’t guarantee future problems or a system that can start but has jobs that won’t build.

A Hudson Backup Solution

We debated as to whether a backup solution integrated directly into Hudson should allow partial backups / restores.

We think a partial backup / restore may cause too many problems. For instance, when dealing with teams…the team.xml file must match existing teams on the file system. If the teams.xml file references a team that does not exist on disk the system will not start. If the teams.xml file does not reference a team that does exist on the file system the user will not see the team or any of its jobs. If the user then tries to “create” this team they’ll get an error because it already “exists” and cannot be “created”.

On the other hand if the following rules are applied a partial restore should be possible:

Only allow the restore of jobs.
If the job is not a team job allow it to be restored if it doesn’t already exist. If it does exist warn the user and force the job to be deleted before restoration.
If the job is a team job check if the team exists. If it doesn’t force the user to manually create it. If the team job exists follow #2 and force the user to delete the job before restoration.

Deleting the job before the restoration will force the job to be in a stable state that matches the backed up artifacts.

Preferences for the backup solution

We like the idea of using zips or incremental zips. We’ve seen too many times where the backup plugin takes a long time to run and the zip will then fail. It takes time, lots of CPU and a lot of disk space. We’ve also had problems restoring a successful full backup and applying incremental backups. In addition, having to restore a full backup and then any incremental backups is confusing and, again, I haven’t had any success. We need to a simple solution that lets us specify a backup source and the restore process is fully automated and requires no manual intervention such as modifications to configuration files.

We’ve also had problems where zipping large amounts of data on the server causes transfers of artifacts to slaves to time out.

We like the idea of having the backup directly reflecting the file system. This would support a Hudson restore operation. It would also allow a manually restore if there’s a problem with the Hudson solution. It also allows manual interrogation of the backup.

Eliminating any additional processing (zipping) or change of data “state” (files -> zips -> files) is probably a good thing.

The backup should include all configuration data as well as all jobs and job artifacts. We want to have our build system fully restored and have as little downtime as possible; there should be no manual intervention.

In a situation like ours it would not work to put the system in a “safe restart” type of state. This would require all running jobs to finish and would not start any jobs that are currently queued or are submitted during the safe restart state.

Some of our test jobs, particularly our UI regression tests, can take upwards of 3 hours to complete. This would prevent the backup from starting and prevent jobs from processing. In our situation, since we process jobs for approximately 16 hours per day and those hours are not contiguous this is not a viable solution.

One solution for job backup might be to backup only completed builds. This would eliminate the situation of backing up job artifacts that are in an inconsistent state.

One problem with this would be if a build is being backed up and the running job, based on the number of builds to be kept, causes build(s) to be purged. This would cause either the backup of a build in an inconsistent state or cause a job failure because objects are locked during a backup and the build cannot be purged.

If a user were in the process of changing a job configuration would the backup process need to skip the file? Would the file get locked and the user has a problem saving changes? This would apply to any configuration changes – Hudson configuration / security / nodes.

I think it’s reasonable that a user may lose the latest build (or parts of the latest build). The stability and reliability of the restored system is the most important part of the backup.

Non-job artifacts are trickier. It seems that log files for SCM polling, fingerprints, and node logs are continually updated. Would there any problem catching these at a particular state?

Of course all plugins and configuration information needs to be backed up.

Backups should determine new artifacts to back up at the time of the backup – as opposed to keeping a record of what’s been previously backed up. As previously mentioned, our short-term backup solution is to use USB drives with rsync. We plan to switch these USB drives weekly so one drive can be stored off site. Therefore we would want Hudson to look at the back up media and use that to determine what to back up.

Backup Configuration Options

Backup Configuration

Backup settings

Backup directory _____________________________________ (?)

Backup schedule  _____________________________________ (?)

E-mail notification

 [ ] Recipients ______________________________________ (?)

  [ ] Send e-mail for every backup                     (?)

[ ] Continue on warning                                (?)

[ Backup Now ] <-- BUTTON

• Backup directory – Verify the location after user enters a value and exits the field (as is done, for instance, when a user enters a new job name and the job name already exists) or provide a button to verify the location and appropriate permissions. For Linux this location can be any type of read/writeable device that can be mounted on the server. The user can determine what type of device this can be and therefore set their safety and cost threshold. For Windows this location can be a network drive or a mounted device and, again, the user can decide their safety and cost threshold. In either case Hudson doesn’t need to be concerned with the details. • Backup schedule – Standard CRON configuration. • Email notification – Standard comma-separated list. • Send e-mail for every backup – This refers to successful backups. An email should be sent to the recipient list for any unsuccessful backup. • Continue of warning – TBD.

What information should emails notifications contain? Should they contain information similar to a job’s email? “Success” / “Failure” and a link to a log?

Will logs be kept for a backup? Will the number of logs to keep be user configurable?

Should other backup “protocols” be available that allow for non-mounted / remote backup media?

Backup Logging

We need effective, descriptive, detailed logging. This can take many forms.

Logging can provide the user what is being restored on an “object by object”, “build by build” or “job by job” level. Maybe this can be configurable or the user can have all levels of logs available and choose the log to view.

“Object by object” logging would provide too much information that the average user would not find useful.

“Build by build” logging would provide sufficient information to show the progress of the backup and a clear logical view without providing too much information.

“Job by job” logging would provide too little information.

12:30:00.01 System configs backed up.
12:30:00.02 System plugins backed up.
12:30:00.03 Job TEAM1.ABC being backed up.
12:34:00.07 Job TEAM1.ABC backed up.

or

12:30:00.01 System configs backed up.
12:30:00.02 System plugins backed up.
12:30:00.03 Job TEAM1.ABC being backed up.
12:31:00.04 Job TEAM1.ABC build 62 being backed up.
12:32:00.05 Job TEAM1.ABC build 63 being backed up.
12:33:00.06 Job TEAM1.ABC build 64 being backed up.
12:34:00.07 Job TEAM1.ABC backed up.

Should a facility be provided to “browse” the backup? Would this be possible only when the backup media is directly connected to the server?

Should some kind of verification option be available for users to run? This might be difficult. This might not be possible as artifacts on the server might change between the time the backup was done and the verification is requested.

Clearing backup media

Allow the option to clear the backup directory location. The user might want to clear their backup location after a failure and retry the backup or there might be discrepancies between the Hudson environment and the backup environment. We’ve found this when I first started working with rsync.

For an initial implementation of a backup solution this might be a necessary recovery tool.

Backup warnings versus failures

What should be considered a warning and allows a backup to continue and what should be considered a failure and abort a backup?

Should this be partially configurable? For instance, if a build for a particular job cannot be backed up should it provide the user with a warning or abort the backup.

The definition of a failure should include:

The backup media being unavailable.
The backup media has insufficient / incorrect permissions.
Anything that might prevent Hudson from starting. This would include any failure to restore Hudson configuration information.
Anything that might result in a job configuration being in an unusable state. This would include a failure to backup plugins.
Anything that causes the Hudson backup from finishing.
a. This might be more common in early implementations of this functionality. As more situations arise that might cause a backup to fail but shouldn’t this list will become shorter. These situations might cause the error to be a warning rather than a failure. TBD.

The definition of a warning might include:

A particular job instance.
A build cannot be deleted.

Allowing warnings that allow a backup to continue might not be a good idea for an initial implementation as those warnings would need to be valid on a restore where side effects are unknown.

‘Failures’ versus ‘warnings’ must be evaluated on a case-by-case basis. It might be better to provide good logging and the tools or instructions or an FAQ of how to address these problems.

Clearing the backup location of all artifacts (as mentioned above) might be one of the “solutions” to resolving backup issues as a failed backup might leave the backup in an inconsistent state.

HUDSON_BUILD configuration

There is an option to set the HUDSON_BUILD parm in the startup script. This allows the build artifacts to reside in a non-default location. We used this when looking for ways to do functionality that’s been implemented with the team copy plugin. We would try to run multiple instances of Hudson and define the HUDSON_BUILD parm to point to different locations. We no longer use this parm.

Has there been any interest / usage of this per the Hudson User Group?

I think it’s reasonable to assume, for the initial release of a backup solution, that this scenario / parameter is not addressed.

Backing up promoted jobs

There are two backup situations that are of interest to us. One is a full backup for a full system restore.

The only “incremental” backup that would interest us would be for promoted builds to be backed up to a separate location so they can be permanently saved. There are backup plugins that allow post promotion steps (?). However, the simple backup plugin only sets a flag and prevents the promoted build from being manually deleted.

An incremental backup that backs up promoted builds and does not concern itself with automated restoration or purging of old / removed promoted builds would be useful. Although the builds wouldn’t be easily restorable to the Hudson system a particular build’s artifacts could be recovered manually. This would support protection of “production” artifacts.

Promoted builds are production artifacts that we release to users. It is critical that we have access to these artifacts for software delivery as well as product support. We often need to reproduce a customer’s environment to try to reproduce a problem; therefore we need to have the exact installers / artifacts as the customer. Having a slightly different build, in these situations, is useless for production support.

Restore

A full system restore implies a non-functional system. Therefore there is no concern about having the system available for job processing or configuration.

I would suggest the following scenario for the restore.

Install Hudson.
- I think an install is important (especially on a Linux system where startup scripts are laid down – i.e. /etc/init.d/Hudson).
Start Hudson and install any required plugins for system startup. It would be nice to simply start Hudson without installing the few “required” plugins. These could be restored from the backup or pulled down and installed automatically
Configure the backup parameters to point the location of the backup.
Select the restore option.
The restore option would:
- a. Restore the version of Hudson used to do the last backup (basically just update the hudson.jar).
- b. Restart Hudson so it’s using the proper version of the hudson.jar.*
- c. Rerun step #2 if necessary.
- d. Automatically place the system in a more restricted state than the ‘safe restart’ more. I don’t think it should prevent login, however, no changes to the system configuration or job creation should be allowed. The system should be “view-only”.
- e. Ensure all ownership and permissions are properly set on the Hudson directories and artifacts (Linux). The backup should handle this but….
Restart the system after the restore completes.

Should a requirement be that the user restores to the same version of Hudson they backed up with? A higher version might not be a problem. A lower version would be. Should the Hudson jar be part of the backup? Should it be restored first and cause a restart of the system before the restore starts? Allowing a different version to do the restore (earlier or later) adds an unneeded complication for both the user and the Hudson development team in the event of a restoration failure. The restore option can lay down the proper version of hudson.jar. The file is not locked when the system is running. When the system is restarted the new version will be used. Is there any issue on some of the Linux installs as to the location of the hudson.jar?

What happens if a user backs up from Ubuntu and restores to CentOS? The hudson.jar is generic but the location of hudson.jar might be different between the backup and restore.

Should the user be responsible for installing the correct version of Hudson? Should they responsible for doing an available installation and then replacing the jar before starting the restore process? This would eliminate some of the steps listed above. Is this error prone? Would the user possibly lay down the wrong version? For an initial implementation of a backup process I think it would be reasonable to backup the jar and update it if the installation is upgraded and have the documentation tell the user to do an installation and then replace the installed jar (if different with the jar that was previously backed up.

What are the implications of the system security scheme the user has in place. A fresh install will not have security enabled. Will the user initiating the restore have a problem? I don’t think they will. In the past when we’ve messed up Active Directory security have had to delete the security.xml file and security was not turned off until I restarted Hudson. I think this is an N/A but I wanted to mention this as it’s similar to other possible scenarios.

It would be nice if the user doing the restore could refresh and see jobs / builds being restored. At the very least they should be able to see a log showing some detail about the restoration; either a complete list of files being restored or a descriptive log entry. The latter might be more reasonable. Currently, if someone copies a job to Hudson it doesn’t appear until Hudson is restarted or the “reload configuration” option is selected.

Logging examples:

12:30:00.01 System configs restored. 12:30:00.02 System plugins restored. 12:30:00.03 Job TEAM1.ABC being restored. 12:34:00.07 Job TEAM1.ABC restored.

or

12:30:00.01 System configs restored. 12:30:00.02 System plugins restored. 12:30:00.03 Job TEAM1.ABC being restored. 12:31:00.04 Job TEAM1.ABC build 62 being restored. 12:32:00.05 Job TEAM1.ABC build 63 being restored. 12:33:00.06 Job TEAM1.ABC build 64 being restored. 12:34:00.07 Job TEAM1.ABC restored.

I like option #2. It provides more / quicker feedback to the user and a better / finer grained audit trail without being overwhelming.

The logging output should have the same “style” for backups and restores.

Backups and restores should be done in a “logical” order. For instance, rsync is pretty well behaved. It backs up directories in some kind of logical order: it doesn’t restore file A from location A and file B from location B and then file C from location A.

The Hudson backup and restore solution should follow similar rules so that any log output is easy to review. Alphabetic backup / restoration by team and then jobs within teams would be easy to understand.

Backup Configuration Options

Should there be an option to not purge builds that are no longer on the master system? This would essentially be keeping a historical information that might be valuable.

Partial Restores

Should there be options for partial restores or jobs / builds?

In the case of restoring a build there should be checks to ensure that the build does not already exist on the system.

In the case of restoring a job there should be checks to ensure the job does not exist. There might be an option to allow the user to restore (or not restore) associated builds.

In the case of restoring a job that belongs to a team the team should exist. The user should be prompted that the team does not exist and that it needs to be created first. This would negate issues with the teams.xml file getting out of sync with the file system.

Before a job is restored any plugins referenced by the job should be required to be installed. This will prevent job “corruption”.

If partial restores are allowed there needs to be a clear “drill down” function to allow the user to select (or multi-select) the desired artifacts.

@@ Line 255: / Line 255: @@
 #*c.	Rerun step #2 if necessary.
 #*d.	Automatically place the system in a more restricted state than the ‘safe restart’ more.  I don’t think it should prevent login, however, no changes to the system configuration or job creation should be allowed.  The system should be “view-only”.
-e.	Ensure all ownership and permissions are properly set on the Hudson directories and artifacts (Linux).  The backup should handle this but….
+#*e.	Ensure all ownership and permissions are properly set on the Hudson directories and artifacts (Linux).  The backup should handle this but….
-#>Restart the system after the restore completes.
+#Restart the system after the restore completes.

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "Hudson-ci/features/Backup"

Revision as of 17:55, 30 April 2014

Contents

New Hudson Backup Design

Our Current Backup Strategy

A Hudson Backup Solution

Preferences for the backup solution

Backup Configuration

Backup Logging

Clearing backup media

Backup warnings versus failures

HUDSON_BUILD configuration

Restore

Backup Configuration Options

Partial Restores

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "Hudson-ci/features/Backup"

Revision as of 17:55, 30 April 2014

Contents

New Hudson Backup Design

Our Current Backup Strategy

A Hudson Backup Solution

Preferences for the backup solution

Backup Configuration

Backup Logging

Clearing backup media

Backup warnings versus failures

HUDSON_BUILD configuration

Restore

Backup Configuration Options

Partial Restores