Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "Hudson-ci/features/Backup"

(Restore)
 
(32 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
+
== Hudson Backup Design ==
 
+
(Proposed by: '''Stuart Lorber''')
== New Hudson Backup Design ==
+
+
 
+
(Author: Stuart Lorber)
+
 
+
  
 
=== Our Current Backup Strategy ===
 
=== Our Current Backup Strategy ===
  
 +
We currently use a cloud-based solution to backup most of the servers in our company.  The issue of backups came up because our adding and purging of upwards of 20 GB of data per day (and this number is growing) does not allow us to use the cloud-based backup solution we use for our other company servers.  We have too much churning and space requirements to keep any useful history and the price for the type of storage we need is cost prohibitive.
  
The issue of backups came up because our adding and purging of upwards of 20 GB of data per day (and this number is growing) does not allow us to use the cloud-based backup solution we use for our other company servers.  We have too much churning and space requirements to keep any useful history and the price for the type of storage we need is cost prohibitive.
+
We debated what the backup solution should be for our Hudson server.  We know that we need an offsite “hot backup” for our source control server but we’ve been debating the need for a hot backup for our Hudson server.
 
+
We debated what our backup solution should be for our Hudson server.  We know that we need an offsite “hot backup” for our source control server but we’ve been debating the need for a hot backup for our Hudson server.
+
  
 
Assuming there is a suitable server available we would need to:
 
Assuming there is a suitable server available we would need to:
  
<ol>
+
#Install the appropriate OS.
<li>Install the appropriate OS.</li>
+
#Install Hudson.
<li>Install Hudson.</li>
+
#Restore our Hudson backup.
<li>Restore our Hudson backup.</li>
+
</ol>
+
  
The last step would probably take the longest as we currently have approximately 1.5 million files and approximately 300,000 directories / soft links in our Hudson directory.
+
The last step would probably take the longest as we currently have approximately 1.5 million files and approximately 300,000 directories / symbolic links in our Hudson directory.
  
We’ve set up test Hudson servers so often that if a suitable machine is available, we can have a new server up within 24 to 36 hours (depending on artifact restoration time).
+
We’ve set up test Hudson servers so often that if a suitable machine is available, we can have a new server running within 24 to 36 hours (depending on artifact restoration time).
  
Because we cannot use our cloud-based backup solution and we are currently evaluating our acceptable backup requirements we’ve decided to hook a USB 3 external hard drive up to our Hudson server.  We have a nightly CRON job that runs rsync with appropriate options (''rsync -avzh –-delete'') that will properly copy symbolic links and will purge obsolete data.
+
Because we cannot use our cloud-based backup solution and we are currently evaluating our acceptable backup requirements, our short term solution was to hook a USB 3 external hard drive up to our Hudson server and run a nightly CRON job that calls rsync with appropriate options (rsync -avzh –-delete) to properly copy symbolic links and purge obsolete data.  This backup may take 2 minutes or 2 hours depending on the amount of changed data.
  
We have been checking the logs generate by the nightly rsync but have not yet tested a restore.
+
We have been checking the logs generated by the nightly rsync but have not yet tested a restore.
  
The rsync backup may take 20 minutes or it might take 2 hours depending on the amount of changed data.  During the time the backup is being done jobs are still running.  Therefore, the final state of the backup does not mark a definitive moment in time; it marks a span of time.  We are concerned that this will cause problems after a restore.  Either the restored artifacts are in an inconsistent (and therefore unreliable) state or, worse, this inconsistent state might prevent Hudson from starting.   
+
During the time the backup is being done, Hudson jobs are still running.  Therefore, the final state of the backup does not mark a definitive moment in time; it marks a span of time.  We are concerned this will cause problems after a restore; the restored artifacts might be in an inconsistent (and therefore unreliable) state or, worse, this inconsistent state might prevent Hudson from starting.   
  
Examples that we’ve seen include:
+
Examples of inconsistent states or states that prevent Hudson from starting:
<ol>
+
#Inconsistency between teams.xml and existing team jobs.
<li>Inconsistency between teams.xml and existing team jobs.</li>
+
#Job configuration “corruption”; occurs if a job configuration references a plugin that is not restored.
<li>Job configuration “corruption” if a job configuration references a plugin that is not restored.</li>
+
#Jobs that are referenced but do not exist on the system (similar to #1).
<li>Security information, on a job level, that points to a job that does not exist.</li>
+
#Jobs that are restored but are not referenced (therefore never visible to the user).
<li>Jobs that are referenced but do not exist on the system (similar to #1).</li>
+
<li>Jobs that are restored but are not referenced (therefore never visible to the user.</li>
+
</ol>
+
 
+
These problems would require manual “debugging” to get Hudson to start.  We are very comfortable with the Hudson file system so we can clean up any problems and get Hudson started but that doesn’t guarantee future problems or a system that can start but has jobs that won’t build.
+
  
 +
These problems require manual “debugging” to get Hudson to start.  We are very comfortable with the Hudson file system so we can clean up any problems and get the system started but that there isn’t instability.  For example, the system can start but has jobs that won’t build.
  
 
=== A Hudson Backup Solution ===
 
=== A Hudson Backup Solution ===
  
 +
We like the idea of having the backup directly reflecting the file system.  This would support a Hudson restore operation.  It would also allow a manual restore if there’s a problem with the Hudson solution.  It also allows manual interrogation of the backup.
  
We debated as to whether a backup solution integrated directly into Hudson should allow partial backups / restores.
+
The backup should include all configuration data as well as all jobs and job artifacts.  We want to have our build system fully restored and have as little downtime as possible; there should be no manual intervention.
  
We think a partial backup / restore may cause too many problems.  For instance, when dealing with teams…the team.xml file must match existing teams on the file system.  If the teams.xml file references a team that does not exist on disk the system will not startIf the teams.xml file does not reference a team that does exist on the file system the user will not see the team or any of its jobs.  If the user then tries to “create” this team they’ll get an error because it already “exists” and cannot be “created”.
+
In a situation like ours it would not work to put the system in a “safe restart” type of state during the back processThis would require all running jobs to finish and would not start any queued.
  
On the other hand if the following rules are applied a partial restore should be possible:
+
Some of our test jobs, particularly our UI regression tests, can take upwards of 3 hours to complete.  This would prevent the backup from starting and prevent new jobs to start for up to those 3 hours plus the time to complete the backup.  In our situation, since we process jobs for approximately 16 hours per day and those hours are not contiguous this is not a viable solution.  Therefore the backup would need to run while the system is available for normal job processing.
  
<ol>
+
Backups should determine new artifacts to back up at the time of the backup – as opposed to keeping a record of what’s been previously backed up. As previously mentioned, our short-term backup solution is to use USB drives with rsyncWe plan to switch these USB drives weekly so one drive can be stored off siteTherefore we would want Hudson to look at the back up media and use that to determine what to back up.
<li>Only allow the restore of jobs.</li>
+
<li>If the job is not a team job allow it to be restored if it doesn’t already existIf it does exist warn the user and force the job to be deleted before restoration.</li>
+
<li>If the job is a team job check if the team existsIf it doesn’t force the user to manually create it.  If the team job exists follow #2 and force the user to delete the job before restoration.</li>
+
</ol>
+
  
Deleting the job before the restoration will force the job to be in a stable state that matches the backed up artifacts.
+
=== Job / Build Artifact Backups ===
  
 +
One solution for backing up job artifacts might be to backup only completed builds.  This would eliminate the situation of backing up job artifacts that are in an inconsistent state.  One problem with this would be if a build is being backed up and the running job, based on the number of builds to be kept, causes build(s) to be purged.  This might cause the backup of a build in an inconsistent state or a job failure because objects are locked and a build cannot be purged.
  
=== Preferences for the backup solution ===
+
=== System Configuration Backups ===
  
 +
If a user were in the process of changing a job configuration, would the backup process need to skip the file?  Would the file get locked and the user would have a problem saving changes?  This would apply to any configuration changes – Hudson configuration / security / nodes.
  
We like the idea of using zips or incremental zips.  We’ve seen too many times where the backup plugin takes a long time to run and the zip will then failIt takes time, lots of CPU and a lot of disk space.  We’ve also had problems restoring a successful full backup and applying incremental backups.  In addition, having to restore a full backup and then any incremental backups is confusing and, again, I haven’t had any success.  We need to a simple solution that lets us specify a backup source and the restore process is fully automated and requires no manual intervention such as modifications to configuration files.
+
It’s reasonable for a user to lose the latest build (or parts of the latest build) if it’s still runningThe stability and reliability of the restored system is the most important part of the backup.
  
We’ve also had problems where zipping large amounts of data on the server causes transfers of artifacts to slaves to time out.
+
Non-job artifacts are trickier; objects like log files for SCM polling, fingerprints, and node logs are continually updated.
 
+
We like the idea of having the backup directly reflecting the file system.  This would support a Hudson restore operation.  It would also allow a manually restore if there’s a problem with the Hudson solution.  It also allows manual interrogation of the backup.
+
 
+
Eliminating any additional processing (zipping) or change of data “state” (files -> zips -> files) is probably a good thing.
+
 
+
The backup should include all configuration data as well as all jobs and job artifacts.  We want to have our build system fully restored and have as little downtime as possible; there should be no manual intervention.
+
 
+
In a situation like ours it would not work to put the system in a “safe restart” type of state.  This would require all running jobs to finish and would not start any jobs that are currently queued or are submitted during the safe restart state.
+
 
+
Some of our test jobs, particularly our UI regression tests, can take upwards of 3 hours to complete.  This would prevent the backup from starting and prevent jobs from processing.  In our situation, since we process jobs for approximately 16 hours per day and those hours are not contiguous this is not a viable solution.
+
 
+
One solution for job backup might be to backup only completed builds.  This would eliminate the situation of backing up job artifacts that are in an inconsistent state.
+
 
+
One problem with this would be if a build is being backed up and the running job, based on the number of builds to be kept, causes build(s) to be purged.  This would cause either the backup of a build in an inconsistent state or cause a job failure because objects are locked during a backup and the build cannot be purged.
+
 
+
If a user were in the process of changing a job configuration would the backup process need to skip the file?  Would the file get locked and the user has a problem saving changes?  This would apply to any configuration changes – Hudson configuration / security / nodes.
+
 
+
I think it’s reasonable that a user may lose the latest build (or parts of the latest build).  The stability and reliability of the restored system is the most important part of the backup.
+
 
+
Non-job artifacts are trickier.  It seems that log files for SCM polling, fingerprints, and node logs are continually updated. Would there any problem catching these at a particular state?
+
  
 
Of course all plugins and configuration information needs to be backed up.
 
Of course all plugins and configuration information needs to be backed up.
  
Backups should determine new artifacts to back up at the time of the backup – as opposed to keeping a record of what’s been previously backed up.  As previously mentioned, our short-term backup solution is to use USB drives with rsync.  We plan to switch these USB drives weekly so one drive can be stored off site.  Therefore we would want Hudson to look at the back up media and use that to determine what to back up.
+
=== Backup Configuration Options ===
 
+
Backup Configuration Options
+
 
+
 
+
 
+
 
----
 
----
 +
'''Backup Configuration'''
  
=== Backup Configuration ===
+
<pre>
 +
Backup settings
  
 +
Backup directory _____________________________________ (?) 
  
'''Backup settings'''
+
  Backup schedule  _____________________________________ (?)
 
+
  E-mail notification
Backup directory _____________________________________ (?)
+
 
+
  Backup schedule  _____________________________________ (?)
+
 
+
  E-mail notification
+
  
 
   [ ] Recipients ______________________________________ (?)
 
   [ ] Recipients ______________________________________ (?)
Line 118: Line 78:
 
[ Backup Now ]        <-- BUTTON
 
[ Backup Now ]        <-- BUTTON
  
 
+
</pre>
 
----
 
----
  
 +
* Backup directory – Verify the location after user enters a value and exits the field (as is done, for instance, when a user enters a new job name and the job name already exists) or provide a button to verify the location and appropriate permissions.  For Linux this location can be any type of read/writeable device that can be mounted on the server.  The user can determine what type of device this can be and therefore set their safety and cost threshold.  For Windows this location can be a network drive or a mounted device and, again, the user can decide their safety and cost threshold.  In either case Hudson doesn’t need to be concerned with the details.
 +
* Backup schedule – Standard CRON configuration.
 +
* Email notification – Standard comma-separated list.
 +
* Send e-mail for every backup – This refers to successful backups.  An email should be sent to the recipient list for any unsuccessful backup.
 +
* Continue on warning – TBD.
  
 
+
Emails notifications should contain information about a backup’s  “Success” or  “Failure” and a link to the backup log similar to an email sent out about a job’s status.
• Backup directory – Verify the location after user enters a value and exits the field (as is done, for instance, when a user enters a new job name and the job name already exists) or provide a button to verify the location and appropriate permissions. For Linux this location can be any type of read/writeable device that can be mounted on the server.  The user can determine what type of device this can be and therefore set their safety and cost threshold.  For Windows this location can be a network drive or a mounted device and, again, the user can decide their safety and cost threshold.  In either case Hudson doesn’t need to be concerned with the details.
+
• Backup schedule – Standard CRON configuration.
+
• Email notification – Standard comma-separated list.
+
• Send e-mail for every backup – This refers to successful backups.  An email should be sent to the recipient list for any unsuccessful backup.
+
• Continue of warning – TBD.
+
 
+
What information should emails notifications contain?  Should they contain information similar to a job’s email?  “Success” / “Failure” and a link to a log?
+
 
+
Will logs be kept for a backup?  Will the number of logs to keep be user configurable?
+
 
+
Should other backup “protocols” be available that allow for non-mounted / remote backup media?
+
 
+
  
 
=== Backup Logging ===
 
=== Backup Logging ===
 
  
 
We need effective, descriptive, detailed logging.  This can take many forms.
 
We need effective, descriptive, detailed logging.  This can take many forms.
  
Logging can provide the user what is being restored on an “object by object”, “build by build” or “job by job” level.  Maybe this can be configurable or the user can have all levels of logs available and choose the log to view.
+
Logging can provide the user with what is being backed up on an “object by object”, “build by build” or “job by job” level.  This can be configurable or the user can have all levels of logs available and choose the log to view.
  
 
“Object by object” logging would provide too much information that the average user would not find useful.
 
“Object by object” logging would provide too much information that the average user would not find useful.
Line 148: Line 100:
  
 
“Job by job” logging would provide too little information.
 
“Job by job” logging would provide too little information.
 
 
  
 
----
 
----
Line 157: Line 107:
 
12:34:00.07 Job TEAM1.ABC backed up. <br/>
 
12:34:00.07 Job TEAM1.ABC backed up. <br/>
 
----
 
----
 
  
 
or
 
or
 
  
 
----
 
----
Line 174: Line 122:
 
----
 
----
  
 +
A facility should be provided to “browse” the backup.
  
 +
The user should be able to define the number of logs to retain.
  
Should a facility be provided to “browse” the backup?  Would this be possible only when the backup media is directly connected to the server?
+
=== Backup Warnings Versus Failures ===
 
+
Should some kind of verification option be available for users to run?  This might be difficult.  This might not be possible as artifacts on the server might change between the time the backup was done and the verification is requested.
+
 
+
=== Clearing backup media ===
+
 
+
 
+
Allow the option to clear the backup directory location.  The user might want to clear their backup location after a failure and retry the backup or there might be discrepancies between the Hudson environment and the backup environment.  We’ve found this when I first started working with rsync.
+
 
+
For an initial implementation of a backup solution this might be a necessary recovery tool.
+
 
+
 
+
=== Backup warnings versus failures ===
+
 
+
 
+
What should be considered a warning and allows a backup to continue and what should be considered a failure and abort a backup?
+
 
+
Should this be partially configurable?  For instance, if a build for a particular job cannot be backed up should it provide the user with a warning or abort the backup.
+
  
 
The definition of a failure should include:
 
The definition of a failure should include:
<ol>
 
<li>The backup media being unavailable.</li>
 
<li>The backup media has insufficient / incorrect permissions.</li>
 
<li>Anything that might prevent Hudson from starting.  This would include any failure to restore Hudson configuration information.</li>
 
<li>Anything that might result in a job configuration being in an unusable state.  This would include a failure to backup plugins.</li>
 
<li>Anything that causes the Hudson backup from finishing.</li>
 
<li>a. This might be more common in early implementations of this functionality.  As more situations arise that might cause a backup to fail but shouldn’t this list will become shorter.  These situations might cause the error to be a warning rather than a failure.  TBD.</li>
 
</ol>
 
  
The definition of a warning might include:
+
#The backup media being unavailable.
<ol>
+
#The backup media has insufficient / incorrect permissions.
<li>A particular job instance.</li>
+
#Anything that might prevent Hudson from starting after a restore.
<li>A build cannot be deleted.</li>
+
#Anything that might result in a job configuration being in an unusable state after a restore.  This would include a failure to backup plugins.
</ol>
+
#Anything that causes the Hudson backup from finishing.
 +
#*This might be more common in early implementations of this functionality.  As more situations arise that might cause a backup to fail this list will become longer.  These situations might cause the error to be a warning rather than a failure.  TBD.
  
Allowing warnings that allow a backup to continue might not be a good idea for an initial implementation as those warnings would need to be valid on a restore where side effects are unknown.
+
The definition of a warning might include, for instance, a particular build that can’t be backed up.
  
‘Failures’ versus ‘warnings’ must be evaluated on a case-by-case basis.  It might be better to provide good logging and the tools or instructions or an FAQ of how to address these problems.
+
Allowing warnings for an initial implementation might not be a good idea as those warnings would still leave the system in a stable state after a restore.
  
Clearing the backup location of all artifacts (as mentioned above) might be one of the “solutions” to resolving backup issues as a failed backup might leave the backup in an inconsistent state.
+
‘Failures’ versus ‘warnings’ must be evaluated on a case-by-case basis.  It might be better to provide good logging and the tools, instructions or an FAQ for addressing issues.
  
 +
Clearing the backup location of all data might be one of the “solutions” to resolving backup issues as a failure might leave the backup in an inconsistent state.
  
 
=== HUDSON_BUILD configuration ===
 
=== HUDSON_BUILD configuration ===
 
  
 
There is an option to set the HUDSON_BUILD parm in the startup script.  This allows the build artifacts to reside in a non-default location.  We used this when looking for ways to do functionality that’s been implemented with the team copy plugin.  We would try to run multiple instances of Hudson and define the HUDSON_BUILD parm to point to different locations.  We no longer use this parm.
 
There is an option to set the HUDSON_BUILD parm in the startup script.  This allows the build artifacts to reside in a non-default location.  We used this when looking for ways to do functionality that’s been implemented with the team copy plugin.  We would try to run multiple instances of Hudson and define the HUDSON_BUILD parm to point to different locations.  We no longer use this parm.
Line 227: Line 153:
 
I think it’s reasonable to assume, for the initial release of a backup solution, that this scenario / parameter is not addressed.
 
I think it’s reasonable to assume, for the initial release of a backup solution, that this scenario / parameter is not addressed.
  
Backing up promoted jobs
+
=== Full Restore ===
  
There are two backup situations that are of interest to us.  One is a full backup for a full system restore.
+
A full system restore implies a non-functional system.  Therefore, there is no concern about having the system available for job processing or configuration.
 
+
The only “incremental” backup that would interest us would be for promoted builds to be backed up to a separate location so they can be permanently saved.  There are backup plugins that allow post promotion steps (?).  However, the simple backup plugin only sets a flag and prevents the promoted build from being manually deleted.
+
 
+
An incremental backup that backs up promoted builds and does not concern itself with automated restoration or purging of old / removed promoted builds would be useful.  Although the builds wouldn’t be easily restorable to the Hudson system a particular build’s artifacts could be recovered manually.  This would support protection of “production” artifacts.
+
 
+
Promoted builds are production artifacts that we release to users.  It is critical that we have access to these artifacts for software delivery as well as product support.  We often need to reproduce a customer’s environment to try to reproduce a problem; therefore we need to have the exact installers / artifacts as the customer.  Having a slightly different build, in these situations, is useless for production support.
+
 
+
=== Restore ===
+
 
+
 
+
A full system restore implies a non-functional system.  Therefore there is no concern about having the system available for job processing or configuration.
+
  
 
I would suggest the following scenario for the restore.
 
I would suggest the following scenario for the restore.
 
  
 
#Install Hudson.  
 
#Install Hudson.  
 
#* I think an install is important (especially on a Linux system where startup scripts are laid down – i.e. /etc/init.d/Hudson).
 
#* I think an install is important (especially on a Linux system where startup scripts are laid down – i.e. /etc/init.d/Hudson).
#Start Hudson and install any required plugins for system startup. It would be nice to simply start Hudson without installing the few “required” plugins.  These could be restored from the backup or pulled down and installed automatically
+
#Start Hudson and install any required plugins for system startup.  
 
#Configure the backup parameters to point the location of the backup.
 
#Configure the backup parameters to point the location of the backup.
 
#Select the restore option.  
 
#Select the restore option.  
Line 253: Line 167:
 
#*a. Restore the version of Hudson used to do the last backup (basically just update the hudson.jar).
 
#*a. Restore the version of Hudson used to do the last backup (basically just update the hudson.jar).
 
#*b. Restart Hudson so it’s using the proper version of the hudson.jar.*
 
#*b. Restart Hudson so it’s using the proper version of the hudson.jar.*
#*c. Rerun step #2 if necessary.
+
#*c. Automatically place the system in a restricted state that only allows functionality required for the restore processThis should include the ability to view the restore logs.
#*d. Automatically place the system in a more restricted state than the ‘safe restart’ moreI don’t think it should prevent login, however, no changes to the system configuration or job creation should be allowed.  The system should be “view-only”.
+
#*d. Ensure all ownership and permissions are properly set on the Hudson directories and artifacts (Linux).
e. Ensure all ownership and permissions are properly set on the Hudson directories and artifacts (Linux).  The backup should handle this but….
+
#>Restart the system after the restore completes.
+
  
 +
#Restart the system after the restore completes.
  
 
----
 
----
 +
* A requirement should be that the user restores to the same version of Hudson they backed up with.  A higher version might not be a problem.  A lower version would be.  The Hudson .jar be part of the backup.  It should be restored first and cause a restart of the system before the restore starts if it’s different than the currently installed version.  Allowing a different version to do the restore (earlier or later) adds an unneeded complication for both the user and the Hudson development team in the event of a restoration failure.  The restore option can lay down the proper version of hudson.jar.  The file is not locked when the system is running.  When the system is restarted the new version will be used.
  
* Should a requirement be that the user restores to the same version of Hudson they backed up with?  A higher version might not be a problem.  A lower version would be.  Should the Hudson jar be part of the backup?  Should it be restored first and cause a restart of the system before the restore starts?  Allowing a different version to do the restore (earlier or later) adds an unneeded complication for both the user and the Hudson development team in the event of a restoration failure.  The restore option can lay down the proper version of hudson.jar.  The file is not locked when the system is running.  When the system is restarted the new version will be used.  Is there any issue on some of the Linux installs as to the location of the hudson.jar?
+
=== Restore Logging ===
  
What happens if a user backs up from Ubuntu and restores to CentOS?  The hudson.jar is generic but the location of hudson.jar might be different between the backup and restore.
+
Restore logging requires similar information / feedback to backup logging.  
  
Should the user be responsible for installing the correct version of Hudson?  Should they responsible for doing an available installation and then replacing the jar before starting the restore process? This would eliminate some of the steps listed above.  Is this error prone?  Would the user possibly lay down the wrong version?  For an initial implementation of a backup process I think it would be reasonable to backup the jar and update it if the installation is upgraded and have the documentation tell the user to do an installation and then replace the installed jar (if different with the jar that was previously backed up.
+
Logging can provide the user with what is being restored on an “object by object”, “build by build” or “job by job” level. This can be configurable or the user can have all levels of logs available and choose the log to view.
  
----
+
“Object by object” logging would provide too much information that the average user would not find useful.
  
 +
“Build by build” logging would provide sufficient information to show the progress of the backup and a clear logical view without providing too much information.
  
What are the implications of the system security scheme the user has in place.  A fresh install will not have security enabled.  Will the user initiating the restore have a problem?  I don’t think they will.  In the past when we’ve messed up Active Directory security have had to delete the security.xml file and security was not turned off until I restarted Hudson.  I think this is an N/A but I wanted to mention this as it’s similar to other possible scenarios.
+
“Job by job” logging would provide too little information.
 
+
It would be nice if the user doing the restore could refresh and see jobs / builds being restored.  At the very least they should be able to see a log showing some detail about the restoration; either a complete list of files being restored or a descriptive log entry.  The latter might be more reasonable.  Currently, if someone copies a job to Hudson it doesn’t appear until Hudson is restarted or the “reload configuration” option is selected.
+
 
+
'''Logging examples:'''
+
  
 
----
 
----
 
+
12:30:00.01 System configs restored. <br/>
 
+
12:30:00.02 System plugins restored. <br/>
12:30:00.01 System configs restored.
+
12:30:00.03 Job TEAM1.ABC being restored. <br/>
12:30:00.02 System plugins restored.
+
12:34:00.07 Job TEAM1.ABC restored. <br/>
12:30:00.03 Job TEAM1.ABC being restored.
+
12:34:00.07 Job TEAM1.ABC restored.
+
 
+
 
+
 
----
 
----
  
 
or
 
or
 
  
 
----
 
----
 
+
12:30:00.01 System configs restored. <br/>
12:30:00.01 System configs restored.
+
12:30:00.02 System plugins restored. <br/>
12:30:00.02 System plugins restored.
+
12:30:00.03 Job TEAM1.ABC being restored. <br/>
12:30:00.03 Job TEAM1.ABC being restored.
+
12:31:00.04 Job TEAM1.ABC build 62 being restored. <br/>
12:31:00.04 Job TEAM1.ABC build 62 being restored.
+
12:32:00.05 Job TEAM1.ABC build 63 being restored. <br/>
12:32:00.05 Job TEAM1.ABC build 63 being restored.
+
12:33:00.06 Job TEAM1.ABC build 64 being restored. <br/>
12:33:00.06 Job TEAM1.ABC build 64 being restored.
+
12:34:00.07 Job TEAM1.ABC restored. <br/>
12:34:00.07 Job TEAM1.ABC restored.
+
 
+
 
----
 
----
  
 
+
The second example provides more feedback to the user faster and a finer grained audit trail without being overwhelming.
 
+
I like option #2.  It provides more / quicker feedback to the user and a better / finer grained audit trail without being overwhelming.
+
  
 
The logging output should have the same “style” for backups and restores.
 
The logging output should have the same “style” for backups and restores.
  
Backups and restores should be done in a “logical” order.  For instance, rsync is pretty well behaved.  It backs up directories in some kind of logical order: it doesn’t restore file A from location A and file B from location B and then file C from location A.
+
=== Backing Up Promoted Jobs ===
  
The Hudson backup and restore solution should follow similar rules so that any log output is easy to reviewAlphabetic backup / restoration by team and then jobs within teams would be easy to understand.
+
There are two backup situations that are of interest to usOne is a full backup for a full system restore.
  
=== Backup Configuration Options ===
+
The only “incremental” backup that would interest us would be for promoted builds to be backed up to a separate location so they can be permanently saved.  There are backup plugins that allow post promotion steps.  However, the simple backup plugin only sets a flag and prevents the promoted build from being manually deleted.
  
 +
An incremental backup that backs up promoted builds and does not concern itself with automated restoration or purging of old / removed promoted builds would be useful.  Although the builds wouldn’t be easily restorable to the Hudson system a particular build’s artifacts could be recovered manually.  This would support protection of “production” artifacts.
  
Should there be an option to not purge builds that are no longer on the master system? This would essentially be keeping a historical information that might be valuable.
+
Promoted builds are production artifacts that we release to users.  It is critical that we have access to these artifacts for software delivery as well as product support.  We often need to reproduce a customer’s environment to try to reproduce a problem; therefore we need to have the exact installers / artifacts as the customer. Having a slightly different build, in these situations, is useless for production support.
 
+
  
 
=== Partial Restores ===
 
=== Partial Restores ===
  
Should there be options for partial restores or jobs / builds?
+
A partial backup / restore may cause too many problems.  For instance, when dealing with teams…the team.xml file must match existing teams on the file system.  If the teams.xml file references a team that does not exist on disk the system will not start.  If the teams.xml file does not reference a team that does exist on the file system the user will not see the team or any of its jobs.  If the user then tries to “create” this team they’ll get an error because it already “exists” and it cannot be “created”.  In this case the teams.xml file would need to modified to match what is on the file system.
  
 
In the case of restoring a build there should be checks to ensure that the build does not already exist on the system.
 
In the case of restoring a build there should be checks to ensure that the build does not already exist on the system.
  
 
In the case of restoring a job there should be checks to ensure the job does not exist.  There might be an option to allow the user to restore (or not restore) associated builds.
 
In the case of restoring a job there should be checks to ensure the job does not exist.  There might be an option to allow the user to restore (or not restore) associated builds.
 
In the case of restoring a job that belongs to a team the team should exist.  The user should be prompted that the team does not exist and that it needs to be created first.  This would negate issues with the teams.xml file getting out of sync with the file system.
 
  
 
Before a job is restored any plugins referenced by the job should be required to be installed.  This will prevent job “corruption”.
 
Before a job is restored any plugins referenced by the job should be required to be installed.  This will prevent job “corruption”.
  
 
If partial restores are allowed there needs to be a clear “drill down” function to allow the user to select (or multi-select) the desired artifacts.
 
If partial restores are allowed there needs to be a clear “drill down” function to allow the user to select (or multi-select) the desired artifacts.

Latest revision as of 07:29, 2 May 2014

Hudson Backup Design

(Proposed by: Stuart Lorber)

Our Current Backup Strategy

We currently use a cloud-based solution to backup most of the servers in our company. The issue of backups came up because our adding and purging of upwards of 20 GB of data per day (and this number is growing) does not allow us to use the cloud-based backup solution we use for our other company servers. We have too much churning and space requirements to keep any useful history and the price for the type of storage we need is cost prohibitive.

We debated what the backup solution should be for our Hudson server. We know that we need an offsite “hot backup” for our source control server but we’ve been debating the need for a hot backup for our Hudson server.

Assuming there is a suitable server available we would need to:

  1. Install the appropriate OS.
  2. Install Hudson.
  3. Restore our Hudson backup.

The last step would probably take the longest as we currently have approximately 1.5 million files and approximately 300,000 directories / symbolic links in our Hudson directory.

We’ve set up test Hudson servers so often that if a suitable machine is available, we can have a new server running within 24 to 36 hours (depending on artifact restoration time).

Because we cannot use our cloud-based backup solution and we are currently evaluating our acceptable backup requirements, our short term solution was to hook a USB 3 external hard drive up to our Hudson server and run a nightly CRON job that calls rsync with appropriate options (rsync -avzh –-delete) to properly copy symbolic links and purge obsolete data. This backup may take 2 minutes or 2 hours depending on the amount of changed data.

We have been checking the logs generated by the nightly rsync but have not yet tested a restore.

During the time the backup is being done, Hudson jobs are still running. Therefore, the final state of the backup does not mark a definitive moment in time; it marks a span of time. We are concerned this will cause problems after a restore; the restored artifacts might be in an inconsistent (and therefore unreliable) state or, worse, this inconsistent state might prevent Hudson from starting.

Examples of inconsistent states or states that prevent Hudson from starting:

  1. Inconsistency between teams.xml and existing team jobs.
  2. Job configuration “corruption”; occurs if a job configuration references a plugin that is not restored.
  3. Jobs that are referenced but do not exist on the system (similar to #1).
  4. Jobs that are restored but are not referenced (therefore never visible to the user).

These problems require manual “debugging” to get Hudson to start. We are very comfortable with the Hudson file system so we can clean up any problems and get the system started but that there isn’t instability. For example, the system can start but has jobs that won’t build.

A Hudson Backup Solution

We like the idea of having the backup directly reflecting the file system. This would support a Hudson restore operation. It would also allow a manual restore if there’s a problem with the Hudson solution. It also allows manual interrogation of the backup.

The backup should include all configuration data as well as all jobs and job artifacts. We want to have our build system fully restored and have as little downtime as possible; there should be no manual intervention.

In a situation like ours it would not work to put the system in a “safe restart” type of state during the back process. This would require all running jobs to finish and would not start any queued.

Some of our test jobs, particularly our UI regression tests, can take upwards of 3 hours to complete. This would prevent the backup from starting and prevent new jobs to start for up to those 3 hours plus the time to complete the backup. In our situation, since we process jobs for approximately 16 hours per day and those hours are not contiguous this is not a viable solution. Therefore the backup would need to run while the system is available for normal job processing.

Backups should determine new artifacts to back up at the time of the backup – as opposed to keeping a record of what’s been previously backed up. As previously mentioned, our short-term backup solution is to use USB drives with rsync. We plan to switch these USB drives weekly so one drive can be stored off site. Therefore we would want Hudson to look at the back up media and use that to determine what to back up.

Job / Build Artifact Backups

One solution for backing up job artifacts might be to backup only completed builds. This would eliminate the situation of backing up job artifacts that are in an inconsistent state. One problem with this would be if a build is being backed up and the running job, based on the number of builds to be kept, causes build(s) to be purged. This might cause the backup of a build in an inconsistent state or a job failure because objects are locked and a build cannot be purged.

System Configuration Backups

If a user were in the process of changing a job configuration, would the backup process need to skip the file? Would the file get locked and the user would have a problem saving changes? This would apply to any configuration changes – Hudson configuration / security / nodes.

It’s reasonable for a user to lose the latest build (or parts of the latest build) if it’s still running. The stability and reliability of the restored system is the most important part of the backup.

Non-job artifacts are trickier; objects like log files for SCM polling, fingerprints, and node logs are continually updated.

Of course all plugins and configuration information needs to be backed up.

Backup Configuration Options


Backup Configuration

Backup settings

 Backup directory _____________________________________ (?)  

 Backup schedule  _____________________________________ (?)  
 E-mail notification  

  [ ] Recipients ______________________________________ (?)

   [ ] Send e-mail for every backup                     (?)

 [ ] Continue on warning                                (?)

[ Backup Now ]         <-- BUTTON


  • Backup directory – Verify the location after user enters a value and exits the field (as is done, for instance, when a user enters a new job name and the job name already exists) or provide a button to verify the location and appropriate permissions. For Linux this location can be any type of read/writeable device that can be mounted on the server. The user can determine what type of device this can be and therefore set their safety and cost threshold. For Windows this location can be a network drive or a mounted device and, again, the user can decide their safety and cost threshold. In either case Hudson doesn’t need to be concerned with the details.
  • Backup schedule – Standard CRON configuration.
  • Email notification – Standard comma-separated list.
  • Send e-mail for every backup – This refers to successful backups. An email should be sent to the recipient list for any unsuccessful backup.
  • Continue on warning – TBD.

Emails notifications should contain information about a backup’s “Success” or “Failure” and a link to the backup log – similar to an email sent out about a job’s status.

Backup Logging

We need effective, descriptive, detailed logging. This can take many forms.

Logging can provide the user with what is being backed up on an “object by object”, “build by build” or “job by job” level. This can be configurable or the user can have all levels of logs available and choose the log to view.

“Object by object” logging would provide too much information that the average user would not find useful.

“Build by build” logging would provide sufficient information to show the progress of the backup and a clear logical view without providing too much information.

“Job by job” logging would provide too little information.


12:30:00.01 System configs backed up.
12:30:00.02 System plugins backed up.
12:30:00.03 Job TEAM1.ABC being backed up.
12:34:00.07 Job TEAM1.ABC backed up.


or


12:30:00.01 System configs backed up.
12:30:00.02 System plugins backed up.
12:30:00.03 Job TEAM1.ABC being backed up.
12:31:00.04 Job TEAM1.ABC build 62 being backed up.
12:32:00.05 Job TEAM1.ABC build 63 being backed up.
12:33:00.06 Job TEAM1.ABC build 64 being backed up.
12:34:00.07 Job TEAM1.ABC backed up.


A facility should be provided to “browse” the backup.

The user should be able to define the number of logs to retain.

Backup Warnings Versus Failures

The definition of a failure should include:

  1. The backup media being unavailable.
  2. The backup media has insufficient / incorrect permissions.
  3. Anything that might prevent Hudson from starting after a restore.
  4. Anything that might result in a job configuration being in an unusable state after a restore. This would include a failure to backup plugins.
  5. Anything that causes the Hudson backup from finishing.
    • This might be more common in early implementations of this functionality. As more situations arise that might cause a backup to fail this list will become longer. These situations might cause the error to be a warning rather than a failure. TBD.

The definition of a warning might include, for instance, a particular build that can’t be backed up.

Allowing warnings for an initial implementation might not be a good idea as those warnings would still leave the system in a stable state after a restore.

‘Failures’ versus ‘warnings’ must be evaluated on a case-by-case basis. It might be better to provide good logging and the tools, instructions or an FAQ for addressing issues.

Clearing the backup location of all data might be one of the “solutions” to resolving backup issues as a failure might leave the backup in an inconsistent state.

HUDSON_BUILD configuration

There is an option to set the HUDSON_BUILD parm in the startup script. This allows the build artifacts to reside in a non-default location. We used this when looking for ways to do functionality that’s been implemented with the team copy plugin. We would try to run multiple instances of Hudson and define the HUDSON_BUILD parm to point to different locations. We no longer use this parm.

Has there been any interest / usage of this per the Hudson User Group?

I think it’s reasonable to assume, for the initial release of a backup solution, that this scenario / parameter is not addressed.

Full Restore

A full system restore implies a non-functional system. Therefore, there is no concern about having the system available for job processing or configuration.

I would suggest the following scenario for the restore.

  1. Install Hudson.
    • I think an install is important (especially on a Linux system where startup scripts are laid down – i.e. /etc/init.d/Hudson).
  2. Start Hudson and install any required plugins for system startup.
  3. Configure the backup parameters to point the location of the backup.
  4. Select the restore option.
  5. The restore option would:
    • a. Restore the version of Hudson used to do the last backup (basically just update the hudson.jar).
    • b. Restart Hudson so it’s using the proper version of the hudson.jar.*
    • c. Automatically place the system in a restricted state that only allows functionality required for the restore process. This should include the ability to view the restore logs.
    • d. Ensure all ownership and permissions are properly set on the Hudson directories and artifacts (Linux).
  1. Restart the system after the restore completes.

  • A requirement should be that the user restores to the same version of Hudson they backed up with. A higher version might not be a problem. A lower version would be. The Hudson .jar be part of the backup. It should be restored first and cause a restart of the system before the restore starts if it’s different than the currently installed version. Allowing a different version to do the restore (earlier or later) adds an unneeded complication for both the user and the Hudson development team in the event of a restoration failure. The restore option can lay down the proper version of hudson.jar. The file is not locked when the system is running. When the system is restarted the new version will be used.

Restore Logging

Restore logging requires similar information / feedback to backup logging.

Logging can provide the user with what is being restored on an “object by object”, “build by build” or “job by job” level. This can be configurable or the user can have all levels of logs available and choose the log to view.

“Object by object” logging would provide too much information that the average user would not find useful.

“Build by build” logging would provide sufficient information to show the progress of the backup and a clear logical view without providing too much information.

“Job by job” logging would provide too little information.


12:30:00.01 System configs restored.
12:30:00.02 System plugins restored.
12:30:00.03 Job TEAM1.ABC being restored.
12:34:00.07 Job TEAM1.ABC restored.


or


12:30:00.01 System configs restored.
12:30:00.02 System plugins restored.
12:30:00.03 Job TEAM1.ABC being restored.
12:31:00.04 Job TEAM1.ABC build 62 being restored.
12:32:00.05 Job TEAM1.ABC build 63 being restored.
12:33:00.06 Job TEAM1.ABC build 64 being restored.
12:34:00.07 Job TEAM1.ABC restored.


The second example provides more feedback to the user faster and a finer grained audit trail without being overwhelming.

The logging output should have the same “style” for backups and restores.

Backing Up Promoted Jobs

There are two backup situations that are of interest to us. One is a full backup for a full system restore.

The only “incremental” backup that would interest us would be for promoted builds to be backed up to a separate location so they can be permanently saved. There are backup plugins that allow post promotion steps. However, the simple backup plugin only sets a flag and prevents the promoted build from being manually deleted.

An incremental backup that backs up promoted builds and does not concern itself with automated restoration or purging of old / removed promoted builds would be useful. Although the builds wouldn’t be easily restorable to the Hudson system a particular build’s artifacts could be recovered manually. This would support protection of “production” artifacts.

Promoted builds are production artifacts that we release to users. It is critical that we have access to these artifacts for software delivery as well as product support. We often need to reproduce a customer’s environment to try to reproduce a problem; therefore we need to have the exact installers / artifacts as the customer. Having a slightly different build, in these situations, is useless for production support.

Partial Restores

A partial backup / restore may cause too many problems. For instance, when dealing with teams…the team.xml file must match existing teams on the file system. If the teams.xml file references a team that does not exist on disk the system will not start. If the teams.xml file does not reference a team that does exist on the file system the user will not see the team or any of its jobs. If the user then tries to “create” this team they’ll get an error because it already “exists” and it cannot be “created”. In this case the teams.xml file would need to modified to match what is on the file system.

In the case of restoring a build there should be checks to ensure that the build does not already exist on the system.

In the case of restoring a job there should be checks to ensure the job does not exist. There might be an option to allow the user to restore (or not restore) associated builds.

Before a job is restored any plugins referenced by the job should be required to be installed. This will prevent job “corruption”.

If partial restores are allowed there needs to be a clear “drill down” function to allow the user to select (or multi-select) the desired artifacts.

Back to the top