SimRel/Simultaneous Release Engineering
This page is to outline the main steps that the "Simultaneous Release Engineer" must do for various stages of the release and for some special cases.
It also includes "general interest" sections such as an overview of the repositories used and the concepts behind the "multi-step" Hudson jobs.
Please add or modify it if omissions or errors are noticed so that over time it will get better, and stay currant and accurate.
- 1 Repositories
- 2 Hudson Jobs
- 3 Checklists
The data (model)
Milestones and initial releases are built from 'master' of org.eclipse.simrel.build. And update releases are built from <Name>_maintenance. The <Name>_maintenance branch is created from master in late June or early July, as we transition from build the "main" release, to building its corresponding "update" releases.
Note: When Neon_maintenance was created, I realized in hindsight we should start naming those branches with "update" as the suffix, such as "Oxygen_update" since we now present them as "updates" not a strict, minimal change "maintenance build".
Note: Also note, there has been a suggestion, but I could not find bug as of this writing, that we never use 'master'. The suggestion was we always start with N+1_update by branching N_update, and that way, the committers never have to "change branches" from "master" to "X_update" when working on that stream as we go from the "initial release" phase to the "update" phase. (Which I think is a good idea.)
[Both of the above Notes could be implemented in January, instead of waiting until June/July, if desired, just to make things easier in June/July (though, still hard in January:). I think they could both be main together, since many scripts and Hudson jobs assume "X_maintenance" or "master" and it would be overly complicated to have them handle "X_maintenance" and "Y_update".
Tools and Utility scripts
The "tools and utilities" used for the build are from "org.eclipse.simrel.tools" and there we always use "master" with variables for things like "trainName" which effect the URLs generated, etc.
Of all the "tools and utilities" the most important is the "build.xml". This will build the "data" repository (is assume it is already checked out correctly by Hudson, or user if running locally). Technically it will run by simply invoking "ant". (Ant assume "build.xml" by default). But, by specifying some properties that are specific to "Eclipse.org" infrastructure, via the "production.properties" file, the build can be more efficient and reliable. for example, the build.xml script converts all the "http://download.eclipse.org" URLs to their local files system equivalents: "file:///home/data/httpd/download.eclipse.org". I know in theory we'd like to think "it should not matter much", but it does seem to matter -- I suspect because for such a large number of repositories and such a large number of artifacts (and, so many aggregation builds! :) that p2 (via the CBI aggregator) is hitting the 'http' server very hard.
The "tools and utilities" repository also includes many useful utility scripts that are not necessarily used often, but are needed by the entire process of "doing a release".
There are three main steps to a complete "run" of related Hudson jobs: Validation, Cached Build, and Clean Build. The successful completion of one job triggers the next job in that sequence. This is done that way entirely to provide quicker feedback to those making contributions and each correspond to the similarly named functions in the CBI Aggregator. Validation is the quickest, as it checks only that the requirements and version constraints all fit together. A Cached Build is fairly fast since even though it "downloads the artifacts", it does so only if they do not already exist in its cache, so typically the download time is a LOT faster than for a Clean Build. A Clean Build as the name implies removes any previously cached information or artifacts and builds the repository "from scratch". And, it takes a long time. It takes roughly 2 hours, even when running on "Eclipse.org" infrastructure. (Longer if running remotely). Also, it is helpful to use this three step approach because different errors may show up at each step. Typically, the largest errors are spotted in the quickest, 'validation' job. There are different errors that can show up in 'cached builds' and 'clean builds' jobs which are typically more subtle and which occur less frequently.
These three jobs are meant to be related by the exact commit hash used for the initial "Validation" job. Hence, that "commit" is passed from one job to the next, by the magic of Hudson. The reason for this is simply to increase the odds that a contribution that validated successfully will create a new staging repository. If someone else comes along after and contributes something that "breaks the build", we do not want that first contribution to get held up, simply because someone after them broke the build. [This usually works, but not always, depending on how the build was broken -- for example "repository not found" can effect the whole build (all the jobs) at any point in time, since if someone deletes their repository that is mentioned in their aggrcon file then there is nothing the aggregator or Hudson can do. But, people should really not be doing that, and usually require some education on correct procedures if that happens frequently from the same project. Typically, if it happens at all, it was just an accident based on a typo or something, now that most projects have been educated :)]
One twist on these sequence of "three steps". The last step, "clean build" takes so long that if many people are contributing near the same time that others are (which is common, right before the deadlines), the "clean build" can get backed up and result in a very long queue that can take a day, or so, to run every project's commit. That is why where is a small groovy script, called 'clearCache' that runs at the start of every "clean build". If, at the start of a clean build, that script finds there are other "clean builds" waiting in the queue, then it simply cancels the current clean build before it starts, and allows the next cleanBuild in the queue to run, which also checks if any others are waiting in the queue. Once a clean build gets started, however, it runs to completion. That is, no job is "interrupted" when a new one comes into the queue. While this means we do not have a perfectly one-to-one mapping of "each commit" getting a "complete Hudson build", in most cases it is pretty close, and if several commits all passed the "cached build" step, then chances are they are all "good to go" for the "clean build" step (that is, running each separately would not find any "new" errors, and they do not often interfere with each other at that point.)
Finally, of course, we have the Validation_Gerrit job. It is exactly the same as the Validation job, except it runs from the Gerrit refspec, instead of the tip of the branch. This is very useful since most errors with contributions will show up during the "validation job", so this prevents something being committed to the branch that would "break the build" and allows the committers to fix their contribution before that point.
In addition to the above 4 jobs, there is another pair of "abstract" jobs that are used as the parent for "cascading jobs". This is simply so that these "abstract" jobs can specify nearly all that is necessary for the jobs (so that there is one place to specify the "main stuff") and then each of the 4 jobs specify the few differences for that particular job.
In addition to that, there are also several "releng jobs". These are typically ran manually (such as, simrel.releng.promoteToReleases) or at a pre-specified day and time (e.g. simrel.releng.makeVisible).
Routine Aggregation Builds
Most of the time, the release engineer simply needs to keep an eye on the builds and if it fails, investigate to the point of knowing if a project did something wrong or if the Hudson job itself is failing for some other reason. The former cases (project issues) are usually documented in the Simultaneous Release FAQ in the Common errors and what to do about them section. The release engineer's role in that case is to simply communicate with the project and make sure they are "working on it". In some cases, such as someone has "broken the build" and then already gone home for the night, a contribution might need to be disabled until the project fixes their issues. For Validation_Gerrit jobs, such proactive communication is not necessary. It is required for the others since if someone "breaks the build" it could prevent others from contributing.
In the other main case, that is, Hudson job issues, the errors are usually something strange, such as "lost connection", or a corrupt clone of the repository. In most cases, if the problem is not obvious from reading the log, the procedure this author follows, is a) simply try again, and see if same error occurs, b) if it does, try "manually" cleaning the workspace via the web interface and see if an error occurs, c) if it does, then try restarting the Hudson instance and see if the error still occurs, and d) if it does, then actually start detailed debugging to see what the issue is.
There may be some cases where is it not clear if a failure is a project issue or an infrastructure issue and in those cases, the first step is usually to discuss or communicate with the project to see if they know what the issue is or if they are working on it.
Note: the release engineer needs to not only be listed as "build master" in the simrel.aggr file (which will cause them to be CC'd for any build failure) but also subscribe to the RSS feed from the Hudson jobs. This is because the aggregator itself will not send mail for all failures, even some originating from the aggregator (such as for "inconsistent model"), and certainly will not send mail from failures due to infrastructure problems. Both sources of mail need to be "continuously" monitored.
The other thing to do "continuously" is to monitory the cross-project mailing list, and the cross-project bugzilla component, so see if anyone has an issue with the build that the release engineer needs to help with. Sometimes, it may be more of a "Planning Council chair person's" question, or even a "peer-to-peer" project question, but, best to always check.
Routine Milestones and Release Candidates
These are some items done specifically for "milestones" and "release candidates". Note: as of this writing, for update releases we do not have any milestones, only "release candidates". Also note, it is only for milestones of the "main" release, that we put milestone and release candidates in ".../releases/trainName". We do not do so for the update releases since that URL is already in use for the official release. Another minor point, we do not promote (i.e. make visible) RC4 at the time RC4 is done -- since that is really the "final build". We only promote it (i.e. "make visible") at the time of the final release day.
- A week or so before the scheduled time, check the *.aggrcon files to make sure no projects or features are disabled (enabled="false") and if so, send a reminder to the cross-project list asking if the project is aware of that and help resolve the issue, if any.
- As dictated by the schedule (such as see Oxygen schedule) monitor the mailing lists to see if anyone has asked for an "extension" to the scheduled time.
- When staging is complete (i.e, no extensions requested, and no jobs running) announce on the cross-project list that "staging is complete" and disable the "Validation" job. (Disabling the Validation job usually suffices, since it triggers all the subsequent jobs, but you can disable the "promote to staging" one too, if you are paranoid about it :) since it is the "promoteToStaging" that might mess up the EPP build because the EPP builds are done against the staging repository.
- [NOTE: this step is done only for the "main" build, not "update releases" -- well, it is done for "RC4" of the update release, since that is the "final release".] Shortly after the announcement that "staging is complete" use the job named simrel.releng.promoteToReleases to copy what is in staging to the appropriate releases directory. This allows mirroring of the artifacts to begin so that a number of mirrors (though usually not all) will be available at the time it is "made visible".
- Schedule the simrel.releng.makeVisible job to run at the schedule time (usually 9:30 on Friday, for a 10:00 availability -- the extra 30 minutes being used to sanity check things, and make sure all is well). This requires not only the time be set, but also the default "trainName" and "checkpoint" since it is not an "interactive" job, the "defaults" must be set as needed.
- During the day or hours before "making visible", run the "checkMirrors.sh" job (on non-infrastructure machine and network) to make sure the mirrors are populating. If it appears that, by the time of "making visible" for general availability, there will be less that 3 or 4 mirrors, it is best to discuss with the webmaster to see if the "make visible" step should be postponed, or if the mirror synchronization can be sped up. Note, the checkMirrors script typically requires a manual edit for each new repository that is being "made visible". And, best to include some downloadable artifacts in that query (such as one or two EPP artifacts) in addition to the repository directory.
- Monitor that "makeVisible" job at the time it is scheduled to run, along with the EPP counter part. Simultaneously chat with the EPP project lead (or release engineer) to make sure all is well from that end. Simultaneously run a short, manual "check for updates" action from Eclipse IDE itself as confirmation that all is as expected after the "makeVisible" job runs. Note: there is also a "simrel.releng.sanityCheckComposites" that is intended to run automatically (or can be ran manually) but the point of the "short manual, check for updates" step is to confirm things work when not on the Eclipse.org infrastructure.
- Send a note to cross-project list that XYZ is available.
- Re-enable any jobs that were disabled (assuming not the "final" release).
- After each milestone or release candidate it is best to check the "repo reports" to see if there are any especially egregious errors or omissions. Some examples might be if a project is not signing any of their jar files or if the the "versions" of bundles or features decrease when compared with reference repository. (See bug 500224 for info on the "reference repository". I *think* I have fixed the routine cases but every "major release" the reference repository will need to me manually edited in the scripts, until that bug it fixed, and even then, a property will need to be updated.) Note that projects (Projects Leads and PMCs) are technically responsible for the quality of the repository not the release engineer, but it helps if the release engineer encourages them and reminds them to look!.:) At least until someone improves the tests to cause "failures" for cases that should be failures.
Shortly before final build
A week or so before the final build, it is best to remind everyone (via cross-project list) what the schedule is, and to point them to (or create) a ["Final Daze"] document.
During quiet week before general availability
- Make sure the Info Center is created.
- run the "promoteToRelease" script, if not already done. (It is best to wait until quiet week, since someone might ask for a rebuild prior to that, and there should still be enough time to mirror.)
Shortly after general availability
- Best to tag the two repositories with a "human readable tag", such as "Neon.2" so future comparisons, if needed, will be easier. Note: Just because it is "tagged" it may not be reproducible since it depends on the projects having the correct permanent URL in the aggrcon file. In the past, projects have been encouraged to update that URL, but not all do, and it is not typically double checked, or anything.
- After an "initial release", the main branch must be forked to be <trainName>_updates and announce the change on cross-project list. [Note: details of this item may change slightly if the procedures are changed, as described under the 'Repositories' section above.]
And then ...
Do it all again! :)