Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SimRel/Simultaneous Release Engineering"

(/* Added tasks to remove inactive projects and inactive committers)
(/* Added reminder to update Hudson pages)
Line 94: Line 94:
  
 
Do it all again! :)
 
Do it all again! :)
 +
 +
= Keep Hudson and its web pages and descriptions up to date =
 +
 +
Currently, our Hudson jobs contain the name of the release that they are for. That is just for clarity to users (committers) looking at the page. So, once per year, when once stream ends and another begins, the jobs need to be "copied" with new names, and then their configuration changed to point to the correct branches to build.
 +
 +
The "web page" part of this task needs to be done every release. It can be seen at the top of the main SimRel Hudson instance (namely, https://hudson.eclipse.org/simrel/) there are some links in the description to the "repo reports" for the current last successful "clean builds" as well as an historical record of links for previous releases. The later needs to be added every release. The former only once a year or so just to correct the names and the exact URLs that are pointed to.
 +
  
 
= Remove inactive projects =  
 
= Remove inactive projects =  
Line 101: Line 108:
 
It is best to start with simply disabling the contribution, in the aggrcon file. This can be done by adding 'enabled="false"' to the '<contribution' element. Then "validation aggregation" to see if the removal of those projects breaks anyone else. If it does, coordinate via bugzilla and the cross-project list on what what projects want to do about it.  
 
It is best to start with simply disabling the contribution, in the aggrcon file. This can be done by adding 'enabled="false"' to the '<contribution' element. Then "validation aggregation" to see if the removal of those projects breaks anyone else. If it does, coordinate via bugzilla and the cross-project list on what what projects want to do about it.  
  
Once you are ready to physically remove the files related to the contribution, it is best to "re-enable them" (in your workspace) and use the CBI Aggregator Editor to remove them. Ideally this will also remove any stray features that are in custom categories, but, it may not always, in which case it needs to be cleaned up "manually" in the simrel.aggr file. Once the contribution has been removed from the simrel.aggr file, the actual aggrcon file can be deleted.  
+
Once you are ready to physically remove the files related to the contribution, it is best to "re-enable them" (in your workspace) and use the CBI Aggregator Editor to remove them. Ideally this will also remove any stray features that are in custom categories, but, it may not always, in which case it needs to be cleaned up "manually" in the simrel.aggr file. Once the contribution has been removed from the simrel.aggr file, the actual aggrcon file can be deleted.
  
 
= Remove inactive committers =
 
= Remove inactive committers =

Revision as of 17:20, 19 December 2016

This page is to outline the main steps that the "Simultaneous Release Engineer" must do for various stages of the release and for some special cases.

It also includes "general interest" sections such as an overview of the repositories used and the concepts behind the "multi-step" Hudson jobs.

This documentation is meant to be an overview or orientation. In the Hudson jobs themselves and in the scripts discussed here there is much more detail on specifics.

Please add or modify this wiki page if omissions or errors are noticed so that over time it will get better and stay currant and accurate.

Repositories

The data (model)

Milestones and initial releases are built from 'master' of org.eclipse.simrel.build. And update releases are built from <Name>_maintenance. The <Name>_maintenance branch is created from master in late June or early July, as we transition from build the "main" release, to building its corresponding "update" releases.

Note: When Neon_maintenance was created, I realized in hindsight we should start naming those branches with "update" as the suffix, such as "Oxygen_update" since we now present them as "updates" not a strict, minimal change "maintenance build".

Note: Also note, there has been a suggestion, but I could not find bug as of this writing, that we never use 'master'. The suggestion was we always start with N+1_update by branching N_update, and that way, the committers never have to "change branches" from "master" to "X_update" when working on that stream as we go from the "initial release" phase to the "update" phase. (Which I think is a good idea.)

[Both of the above Notes could be implemented in January, instead of waiting until June/July, if desired, just to make things easier in June/July (though, still hard in January:). I think they could both be main together, since many scripts and Hudson jobs assume "X_maintenance" or "master" and it would be overly complicated to have them handle "X_maintenance" and "Y_update".

Tools and Utility scripts

The "tools and utilities" used for the build are from "org.eclipse.simrel.tools" and there we always use "master" with variables for things like "trainName" which effect the URLs generated, etc.

Of all the "tools and utilities" the most important is the "build.xml". This will build the "data" repository (is assume it is already checked out correctly by Hudson, or user if running locally). Technically it will run by simply invoking "ant". (Ant assume "build.xml" by default). But, by specifying some properties that are specific to "Eclipse.org" infrastructure, via the "production.properties" file, the build can be more efficient and reliable. for example, the build.xml script converts all the "http://download.eclipse.org" URLs to their local files system equivalents: "file:///home/data/httpd/download.eclipse.org". I know in theory we'd like to think "it should not matter much", but it does seem to matter -- I suspect because for such a large number of repositories and such a large number of artifacts (and, so many aggregation builds! :) that p2 (via the CBI aggregator) is hitting the 'http' server very hard.

The "tools and utilities" repository also includes many useful utility scripts that are not necessarily used often, but are needed by the entire process of "doing a release".

Hudson Jobs

There are three main steps to a complete "run" of related Hudson jobs: Validation, Cached Build, and Clean Build. The successful completion of one job triggers the next job in that sequence. This is done that way entirely to provide quicker feedback to those making contributions and each correspond to the similarly named functions in the CBI Aggregator. Validation is the quickest, as it checks only that the requirements and version constraints all fit together. A Cached Build is fairly fast since even though it "downloads the artifacts", it does so only if they do not already exist in its cache, so typically the download time is a LOT faster than for a Clean Build. A Clean Build as the name implies removes any previously cached information or artifacts and builds the repository "from scratch". And, it takes a long time. It takes roughly 2 hours, even when running on "Eclipse.org" infrastructure. (Longer if running remotely). Also, it is helpful to use this three step approach because different errors may show up at each step. Typically, the largest errors are spotted in the quickest, 'validation' job. There are different errors that can show up in 'cached builds' and 'clean builds' jobs which are typically more subtle and which occur less frequently.

These three jobs are meant to be related by the exact commit hash used for the initial "Validation" job. Hence, that "commit" is passed from one job to the next, by the magic of Hudson. The reason for this is simply to increase the odds that a contribution that validated successfully will create a new staging repository. If someone else comes along after and contributes something that "breaks the build", we do not want that first contribution to get held up, simply because someone after them broke the build. [This usually works, but not always, depending on how the build was broken -- for example "repository not found" can effect the whole build (all the jobs) at any point in time, since if someone deletes their repository that is mentioned in their aggrcon file then there is nothing the aggregator or Hudson can do. But, people should really not be doing that, and usually require some education on correct procedures if that happens frequently from the same project. Typically, if it happens at all, it was just an accident based on a typo or something, now that most projects have been educated :)]

One twist on these sequence of "three steps". The last step, "clean build" takes so long that if many people are contributing near the same time that others are (which is common, right before the deadlines), the "clean build" can get backed up and result in a very long queue that can take a day, or so, to run every project's commit. That is why where is a small groovy script, called 'clearCache' that runs at the start of every "clean build". If, at the start of a clean build, that script finds there are other "clean builds" waiting in the queue, then it simply cancels the current clean build before it starts, and allows the next cleanBuild in the queue to run, which also checks if any others are waiting in the queue. Once a clean build gets started, however, it runs to completion. That is, no job is "interrupted" when a new one comes into the queue. While this means we do not have a perfectly one-to-one mapping of "each commit" getting a "complete Hudson build", in most cases it is pretty close, and if several commits all passed the "cached build" step, then chances are they are all "good to go" for the "clean build" step (that is, running each separately would not find any "new" errors, and they do not often interfere with each other at that point.)

Finally, of course, we have the Validation_Gerrit job. It is exactly the same as the Validation job, except it runs from the Gerrit refspec, instead of the tip of the branch. This is very useful since most errors with contributions will show up during the "validation job", so this prevents something being committed to the branch that would "break the build" and allows the committers to fix their contribution before that point.

In addition to the above 4 jobs, there is another pair of "abstract" jobs that are used as the parent for "cascading jobs". This is simply so that these "abstract" jobs can specify nearly all that is necessary for the jobs (so that there is one place to specify the "main stuff") and then each of the 4 jobs specify the few differences for that particular job.

In addition to that, there are also several "releng jobs". These are typically ran manually (such as, simrel.releng.promoteToReleases) or at a pre-specified day and time (e.g. simrel.releng.makeVisible).

Process steps

Routine Aggregation Builds

Most of the time, the release engineer simply needs to keep an eye on the builds and if it fails, investigate to the point of knowing if a project did something wrong or if the Hudson job itself is failing for some other reason. The former cases (project issues) are usually documented in the Simultaneous Release FAQ in the Common errors and what to do about them section. The release engineer's role in that case is to simply communicate with the project and make sure they are "working on it". In some cases, such as someone has "broken the build" and then already gone home for the night, a contribution might need to be disabled until the project fixes their issues. For Validation_Gerrit jobs, such proactive communication is not necessary. It is required for the others since if someone "breaks the build" it could prevent others from contributing.

In the other main case, that is, Hudson job issues, the errors are usually something strange, such as "lost connection", or a corrupt clone of the repository. In most cases, if the problem is not obvious from reading the log, the procedure this author follows, is a) simply try again, and see if same error occurs, b) if it does, try "manually" cleaning the workspace via the web interface and see if an error occurs, c) if it does, then try restarting the Hudson instance and see if the error still occurs, and d) if it does, then actually start detailed debugging to see what the issue is.

There may be some cases where is it not clear if a failure is a project issue or an infrastructure issue and in those cases, the first step is usually to discuss or communicate with the project to see if they know what the issue is or if they are working on it.

Note: the release engineer needs to not only be listed as "build master" in the simrel.aggr file (which will cause them to be CC'd for any build failure) but also subscribe to the RSS feed from the Hudson jobs. This is because the aggregator itself will not send mail for all failures, even some originating from the aggregator (such as for "inconsistent model"), and certainly will not send mail from failures due to infrastructure problems. Both sources of mail need to be "continuously" monitored.

The other thing to do "continuously" is to monitory the cross-project mailing list, and the cross-project bugzilla component, so see if anyone has an issue with the build that the release engineer needs to help with. Sometimes, it may be more of a "Planning Council chair person's" question, or even a "peer-to-peer" project question, but, best to always asks if any doubt.

Routine Milestones and Release Candidates

These are some items done specifically for "milestones" and "release candidates". Note: as of this writing, for update releases we do not have any milestones, only "release candidates". Also note, it is only for milestones of the "main" release, that we put milestone and release candidates in ".../releases/trainName". We do not do so for the update releases since that URL is already in use for the official release. Another minor point, we do not promote (i.e. make visible) RC4 at the time RC4 is done -- since that is really the "final build". We only promote it (i.e. "make visible") at the time of the final release day.

  • A week or so before the scheduled time, check the *.aggrcon files to make sure no projects or features are disabled (enabled="false") and if so, send a reminder to the cross-project list asking if the project is aware of that and help resolve the issue, if any.
  • As dictated by the schedule (such as see Oxygen schedule) monitor the mailing lists to see if anyone has asked for an "extension" to the scheduled time.
  • When staging is complete (i.e, no extensions requested, and no jobs running) announce on the cross-project list that "staging is complete" and disable the "Validation" job. (Disabling the Validation job usually suffices, since it triggers all the subsequent jobs, but you can disable the "promote to staging" one too, if you are paranoid about it :) since it is the "promoteToStaging" that might mess up the EPP build because the EPP builds are done against the staging repository.
  • [NOTE: this step is done only for the "main" build, not "update releases" -- well, it is done for "RC4" of the update release, since that is the "final release".] Shortly after the announcement that "staging is complete" use the job named simrel.releng.promoteToReleases to copy what is in staging to the appropriate releases directory. This allows mirroring of the artifacts to begin so that a number of mirrors (though usually not all) will be available at the time it is "made visible".
  • Schedule the simrel.releng.makeVisible job to run at the schedule time (usually 9:30 on Friday, for a 10:00 availability -- the extra 30 minutes being used to sanity check things, and make sure all is well). This requires not only the time be set, but also the default "trainName" and "checkpoint" since it is not an "interactive" job, the "defaults" must be set as needed.
  • During the day or hours before "making visible", run the "checkMirrors.sh" job (on non-infrastructure machine and network) to make sure the mirrors are populating. If it appears that, by the time of "making visible" for general availability, there will be less that 3 or 4 mirrors, it is best to discuss with the webmaster to see if the "make visible" step should be postponed, or if the mirror synchronization can be sped up. Note, the checkMirrors script typically requires a manual edit for each new repository that is being "made visible". And, best to include some downloadable artifacts in that query (such as one or two EPP artifacts) in addition to the repository directory.
  • Monitor that "makeVisible" job at the time it is scheduled to run, along with the EPP counter part. Simultaneously chat with the EPP project lead (or release engineer) to make sure all is well from that end. Simultaneously run a short, manual "check for updates" action from Eclipse IDE itself as confirmation that all is as expected after the "makeVisible" job runs. Note: there is also a "simrel.releng.sanityCheckComposites" that is intended to run automatically (or can be ran manually) but the point of the "short manual, check for updates" step is to confirm things work when not on the Eclipse.org infrastructure.
  • Send a note to cross-project list that XYZ is available.
  • Re-enable any jobs that were disabled (assuming not the "final" release).
  • After each milestone or release candidate it is best to check the "repo reports" to see if there are any especially egregious errors or omissions. Some examples might be if a project is not signing any of their jar files or if the the "versions" of bundles or features decrease when compared with reference repository. (See bug 500224 for info on the "reference repository". I *think* I have fixed the routine cases but every "major release" the reference repository will need to me manually edited in the scripts, until that bug it fixed, and even then, a property will need to be updated.) Note that projects (Projects Leads and PMCs) are technically responsible for the quality of the repository not the release engineer, but it helps if the release engineer encourages them and reminds them to look!.:) At least until someone improves the tests to cause "failures" for cases that should be failures.

Shortly before final build

A week or so before the final build, it is best to remind everyone (via cross-project list) what the schedule is, and to point them to (or create) a ["Final Daze"] document.

During quiet week before general availability

  • Make sure the Info Center is created.
  • run the "promoteToRelease" script, if not already done. (It is best to wait until quiet week, since someone might ask for a rebuild prior to that, and there should still be enough time to mirror.)

Shortly after general availability

  • Best to tag the two repositories with a "human readable tag", such as "Neon.2" so future comparisons, if needed, will be easier. Note: Just because it is "tagged" it may not be reproducible since it depends on the projects having the correct permanent URL in the aggrcon file. In the past, projects have been encouraged to update that URL, but not all do, and it is not typically double checked, or anything.
  • After an "initial release", the main branch must be forked to be <trainName>_updates and announce the change on cross-project list. [Note: details of this item may change slightly if the procedures are changed, as described under the 'Repositories' section above.]

And then ...

Do it all again! :)

Keep Hudson and its web pages and descriptions up to date

Currently, our Hudson jobs contain the name of the release that they are for. That is just for clarity to users (committers) looking at the page. So, once per year, when once stream ends and another begins, the jobs need to be "copied" with new names, and then their configuration changed to point to the correct branches to build.

The "web page" part of this task needs to be done every release. It can be seen at the top of the main SimRel Hudson instance (namely, https://hudson.eclipse.org/simrel/) there are some links in the description to the "repo reports" for the current last successful "clean builds" as well as an historical record of links for previous releases. The later needs to be added every release. The former only once a year or so just to correct the names and the exact URLs that are pointed to.


Remove inactive projects

This activity is needed primarily after M4, which is the deadline for projects to declare if they plan to participate or not. But, it can come up at other times, if a project states that they have changed their mind and will not participate. The release engineer needs to be involved since, presumably, if a project is no longer interested in participating, there is no one particularly interested in making sure their contribution file is removed. The actual list of projects to remove is worked out by collaborating with the Planning Council (and, Wayne).

It is best to start with simply disabling the contribution, in the aggrcon file. This can be done by adding 'enabled="false"' to the '<contribution' element. Then "validation aggregation" to see if the removal of those projects breaks anyone else. If it does, coordinate via bugzilla and the cross-project list on what what projects want to do about it.

Once you are ready to physically remove the files related to the contribution, it is best to "re-enable them" (in your workspace) and use the CBI Aggregator Editor to remove them. Ideally this will also remove any stray features that are in custom categories, but, it may not always, in which case it needs to be cleaned up "manually" in the simrel.aggr file. Once the contribution has been removed from the simrel.aggr file, the actual aggrcon file can be deleted.

Remove inactive committers

Roughly once per year, roughly after the first "update release" in September, inactive committers should be removed from the "callisto-dev" group. This is primarily a "server hygiene" sort of task. Technically there is no problem letting the group membership grow larger and larger, but as always, best that people only have permission where they really need it.

The principle if deciding if a committer is inactive is if they have not committed anything since the previous major release (so, roughly have not committed anything in a year and 3 months).

There are some scripts in 'org.eclipse.simrel.tools' under 'reportUtilities' that can help with the git queries to determine who has been active and who has not. The hard part is that occasionally someone does "commits" with different email Ids. That is why we have the ".mailmap" file in the 'org.eclipse.simrel.build' project, so that over-all we can keep a record of "who is who" as "many to one" mappings are found.

Also, it is important that a bug be opened and the "*proposed* list of committers to remove" be posted there to give people a chance to say if our scripts (or .mailmap) is wrong, or if even though they have been inactive, they still need write access. Once the dust settles from that bug (give it 2 to 4 weeks) and a firm list of removals is known, the actual list of committer ids to remove can be given to the webmaster for removal from the Linux group.

How to do a re-spin

Overview

It is not done often, but usually at least one per year is needed. A "respin" means to redo an significant repository (typically milestone or an actual release) at some point significantly after its initial deadline has passed, when more care and control is desired over the input and output. A common reason for needing one is that it might be discovered during "quiet week" that there is a serious bug in one component that has "cross project" implications (such as, functional issues might prevent PHP and XML from both being ran in the same workspace). Another example is if a third-party bundle has been included that was later found to be "unacceptable" from a licensing point of view. Note: it is the Planning Council (not "release engineering") that decides if a respin is warranted but typically they would want the input of release engineering as well. Also note, if a respin is done near the end of quiet week, this usually implies an automatic delay of one-week for the general availability -- i.e. no need to do an all-nighter. :)

The method by which care and control is achieved is that the previous candidate repository provides the "input" for all of the projects (via aggrcon files) except for the one or two projects that are contributing to a respin. This is done since some of the URLs or contents at the URLs may have changed since the candidate release either intentionally or accidentally. In a perfect world, each project would maintain their repositories and their aggrcon files such that the candidate build could be reproduced exactly, but there are always a few projects that do not, so its easier to "force" the exact same build, by changing the input source, rather than trying to get everyone lined up to have the correct files and repositories to reproduce the previous candidate release.

Steps

  • create a branch of org.eclipse.simrel.build project from the commit hash (or tag) of the release for which we are doing a rebuild. Name it something obvious like "Neon.4_respin_branch". The important part is that all the feature ids and versions match exactly what was built before. (We will be re-doing the URLs). Note: the commit hash of every build is saved away in a file under "buildinfo" for each repository we create.
  • With that branch loaded in your work bench, run a utility in your workbench, which is in the org.eclipse.simrel.tools repository (master branch) in a directory named transformToOneRepo. In that directory is an XSL file named changeAllRepos.xsl and

and Ant file, which runs the XSL Transform, named changeAllRepos.xml.

  • Before running the utility, specify two parameters on "command line" of the ant job, so to speak: from the "external tools" configuration, under the JRE tab:
- newRepository: The first parameter changes each repo in each aggrcon file (by using the XSL file) to point to the specific, existing repository that we are rebuilding.
- Example: -DnewRepository=http://download.eclipse.org/releases/neon/201609281000/
- javax.xml.transform.TransformerFactory: The second parameter is for the precise XSL Transformer to use. This parameter may not always be required. It depends on the JRE you are using.
- Example: -D=com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl

The above is all merely "preparation". After running the Ant file (which runs the XSL Transform) it is recommended to commit that change to the newly created branch so that the next commit will cleanly show "what is changing".

  • As the next step, you want to change the one or two aggrcon files for the projects that are contributing to the respin. Those one or two files will point to the new URL that has their fix. Usually both "versions" and the repository location have to be changed in the aggrcon file. The project(s) at least provide the data to use if not actually make the change themselves. Commit that change, and make sure only the desired differences exist.
  • Once all that is done, it is good to "validate" and "validate aggregation" using the CBI aggregator editor in the IDE to make sure the basics are correct.
  • Create a new Hudson job that uses the newly created branch. Typically, a "copy of an existing" "BUILD__CLEAN" job is the only Hudson job that is required. After the copy is made, edit its configuration to modify the branch that is checked out by Hudson. Run that job manually, and let it trigger a "promoteToStaging" as usual.
  • Once that new "staging" repository has been created, a "p2Diff" should be ran comparing the staging repository with the previous candidate to confirm the only things changed were what was expected to change. (In reality, occasionally one or two other things might change, simply because p2 is not completely deterministic and has some heuristics to avoid "near infinite optimization". But if any doubt, ask the projects or cross-project list if anyone is concerned -- typically the unexpected changes are "good", such as the candidate repo may have two versions of a bundle, but the respin repository has only one version of that bundle).
  • After this new staging repository is confirmed accurate, then the previously described steps to "promote" and "make visible" would be followed, according to what ever schedule the Planning Council came up with for the respin. Typically, the EPP packages are re-created also, and typically at least some projects do some more functional testing (but there is firm rules or "signoff" process that applies to all cases).

Back to the top