Platform-releng/Kepler 2013 PDE to CBI retrospective
NOTE: this document is still under development and being reviewed.
This document is intended to serve multiple purposes. Partially a brief "experience report" of the Eclipse Project moving to CBI; partially as a way to educate other projects considering CBI on what to expect (we were surprised by a few things, and our experience would have been better, if we had known more accurately what to expect); and partially, to serve as the starting point or "reminders" for planning our next "release engineering" cycle.
Authored by David Williams. July 2013. I should emphasize I've written this as "Releng Lead" and while I've sought out the advice and opinions of others, any errors in fact (or opinions stated) are entirely my responsibility. As with all retrospectives, this is merely a "dump" of things while fresh in the mind, it is not intended to be any definitive statement or judgment. My hope is others might find at least a few points helpful. It is also to aide my memory ... which everyone knows is terrible ... but will help in creating our "releng to-do list" as we start the next development cycle.
Original criteria and "percent accomplished"
Below I am just repeating what we started off saying the "criteria" were, for moving to CBI (Common Build Infrastructure). We got "close enough" for a release (Kepler) and, the point is, for all the extensive work that took, there is still a lot of work to do.
That "extensive work", by the way, came from many people and the Eclipse Foundation itself, as part of the LTS (Long Term Support) effort, not just the platform team ... everyone's help and contributions were crucial to our success and much appreciated. The move to CBI itself was mostly motivated by wanting a build suitable for LTS ... our current build had grown fragile over the years, due to many incremental additions, and while much of that could have been fixed, many of the original committers for PDE batch builds had moved to other projects or companies so there was no particular interest in improving that current PDE build but there was a lot of community interest in using Maven/Tycho.
The "percent accomplished" (below) is intended to be more like a rough "grade" more than estimate of "time and effort remaining". In some cases, we partially "accomplished" the goal by using ant and bash scripts so tried to "balance" that with what we accomplished with pure Maven/Tycho. Those "grades" would roughly correlate with remaining effort of "low", "medium" and "high", but more planning work would be required to turn into "person months to finish" or similar measures.
- Have same output deliverables as current build: zips, repository, etc. (For Equinox and Eclipse) (Be able to reproduce exact same build, given a tag (or tags) to start the build with.) (70%) Mostly this low because we still need to specify .target files for our pre-reqs bug 400518 but, we are learning that there are some "hidden assumptions" that can change things even after a successful release. bug 412110.
- Same warnings (and compile errors, if any) as PDE based build. (60%) bug 402086
- Be able to run our JUnit tests, with same results, as PDE based build. (80%) We do get same results, and only had to change a few things to accommodate new build results (inner jars not signed, we switched to UTF-8 at same time) ... but ... we just use Tycho/Maven to build and package the tests bundles ... we run the tests completely independently just like we did before (unlike most Maven/Tycho projects.)
- Produces the same Java Doc and Help documentation. (80%) Again, we produce the same thing, without much effort in changing ... but we just invoke the same ant scripts we did before ... not sure how most Maven/Tycho projects do it.
- Run a binary comparator against the bundles to ensure that they are the same binary content as the regular PDE bundles. (60%) This is actually hard (or, impossible) to do literally, since qualifiers changed, jar format changed, etc., but I gave it a low score, since for some few cases, we were "compiling to" a different version of Java than we thought we were. bug 411419 and bug 411397
- Qualifiers not change (except for branding bundles) if the content has not changed. (90%). This works quite well in new build, though we still need to document some aspects, and in some cases requires more frequent "manual touching" when we need a feature qualifier to change. Plus, for us to meaningful parse "comparator" results, we're forced to create a 500 megabyte debug log file! bug 401145
- All bundles signed. (90%) Works well, though still working on signing "inner jars".
- Final I-build type repository have same content and metadata as PDE based build (nothing missing, nothing extra). (80%) I gave less that perfect score since we still need to do some post-build "manual massaging" such as to remove the "master category" which Tycho puts in, but we do not want.
- Be easy to fit in to current workflow of automated builds and tests of "nightlies", I-builds, milestones (i.e. committers have to know what to do to "release" something for a build, how to "freeze" changes at a certain point). (90%) This works well, given our "manual bash scripts". Honestly don't know how to accomplish some of what we do there, with a "pure" Maven/Tycho build (and, say, Hudson) though I'm sure its possible.
- Don't create maintenance burden by duplicating information bug 387802 bug 402086, bug 401776 (70%)
These open bugs need scrubbing ... some may be fixed? And a few should be opened (or re-opened) still. But there are over 100 of them still open.
- Bug query for open CBI bugs in Eclipse (Platform, JDT, PDE) and Equinox.
- Bug query for open Eclipse Foundation CBI bugs (all in Eclipse Foundation, CBI).
There were roughly 250 "CBI bugs" fixed. (Some of these are "real bugs" some of them are "adding the POMs" and other "conversion" activities.).
- Bug query for fixed CBI bugs in Eclipse (Platform, JDT, PDE) and Equinox.
- Bug query for fixed Eclipse Foundation CBI bugs (all in Eclipse Foundation, CBI).
- Easy to "get a build" ... but hard to "get a correct build". We in the platform have a lot of complications, such as many fragments, wide range of multiple BREE levels, many different packages produced, etc., that many projects may not have. To some extent, Maven encourages (and assumes) a certain amount of consistency and that's what makes it easy ... when you can be consistent ... but, as soon as you deviate from their built-in assumptions, it starts to get more complicated (true of about anything, I guess, but seems more so, for Maven, where "consistency" is assumed by design).
- Maven's automatic dependency resolution is a blessing and a curse.
- makes it easy to get started, but makes it harder to have cleanly reproducible builds. (We still need to add .target files to specify pre-reqs exactly).
- Uses lots of disk space
- is possible to have "stale data" in local repositories ... especially if using "snapshots", leading to different results if "building fresh" versus "building with existing local repo".
- Hard to know, for things like "LTS", if each project in a large chain of packages will all use .target files and similar things to have exactly reproducible builds.
- Each committer needs to "know more maven" than we anticipated.
- As the worst-case example, in Kepler, I think we have 5 bundles whose byte codes were not compiled according to the BREE named in their manifest.mf file ... due to "mistakes" in their POM files or their inherited POM files) bug 411419 bug 411397. Smaller examples include having/keeping "parents" correct, updating version in POM, when version in manifest.mf changes, converting customcallbacks to pure maven scripts, or converting to work with "antrun" plugin, and setting custom compiler options in each POM, where needed.
- Tycho handles some PDE constructs completely transparently (good), handles some with moderate "Mavenization" (ok), but does not handle some at all and requires a lot of "custom maven work" to get the same build results (some such as for simple things like what warnings the compiler should output -- one outcome for us in Platform is that "test bundles" produce too many warnings).
- Tycho's handling of "multiple BREEs" and "JRE specs in build.properties" is different than in PDE ... probably more powerful ... but has lead to a few cases of confusion or mis-coding.
- Tycho, as its project mission, does not see themselves as a drop-in replacement for PDE ... hence not a drop-in replacement for use in CBI. It takes work!
- For example, see response in bug 402086 comment 5. By the way, this makes perfect sense, from Tycho's point of view. Just surprised some of us. The point is, anyone adopting CBI must have all their committers learn a fair amount of Maven and Tycho -- no free lunch. Be prepared to open high quality bug reports, with simple test cases ... if not even provide patches!
- We still have monolithic build. Apparently would take a LOT of work to change that. This surprised some of us.
- PDE provides a way to "get the source", based on features. Tycho/Maven does not.
- We use the "aggregator/submodule pattern" to (easily) get source from Git, but that is part of the rigidity that leads to monolithic build ... are there other patterns or techniques? Or, is only option to refactor repositories (and code)? We currently have 24 repositories (plus the aggregator repository) ... and about 50 features and over 500 bundles and 551 pom.xml files
- Surprised that the time to do (a clean) Tycho/Maven build is about the same as for (a clean) PDE Build. (2 to 3 hours for Platform builds, using either technology ... about an hour spent signing in either case). [Thanks goes to Eclipse Infrastructure team for providing a plugin-by-plugin signing service which is much faster than when we started!]
- So far, we have had to "live with" a funky, empty "Configuration Feature" in SDK "product". In PDE build, this was merely p2 instructions that were handy to re-use ... but in Tycho/Maven we have to include the feature, to apply the p2 data. In Maven world, "reuse" appears to come from copying, rather than abstraction. So ... can we, and do we want to "copy" that "product data" to each product, instead of using empty "Configuration Feature"? Is the "Configuration Feature" required for Delta Pack?
- "Inheritance" of build properties (and characteristics) is based more on "directory layout" instead of "features" (a hard conceptual leap for many of us).
- Reduced ability to "continue on error" (which, could be argued, is both good and bad).
- The jarring code is in Tycho/Maven is "non standard" (i.e. follows the "jar standards" but not same implementation as in Java VM) (which could be seen as good or bad) but it results in different "jar internal structure" depending on how installed or obtained. This may require "custom tools" (for release engineers) to know when two jars are "the same" or not ... standard tools like "diff" says they are different, even though if unjared, there is no difference. (Again, good and bad, 'diff' was handy ... but, not really made for that purpose).
- Doesn't use "straight p2" code so some things (like eclipse.inf files) are handled differently. (positives and negatives in doing so).
- Hard to know difference between "bug or limitation in maven or tycho" and "bug in our setup".
- Requires many hours of debug work and "trial and error" experimentation. PDE probably had less bugs in it, simply because it is an older technology, and its bugs have been found and fixed already ... But, Tycho does have "real bugs" that are hard to diagnose ... at least, hard for non-experts ... sometimes even the experts ... see bug 413116 for our latest case).
- Documentation problems (not saying that PDE Documentation was that great)
- life cycle of maven is documented fairly well, life cycle of Tycho is not at all (have to look at code, at best).
- Maven's documentation is not as good as Ant's. (Tycho's even worse).
- Such as, very few hyperlinks to related documents or elements.
- Very few "examples in the documentation" as there is in Ant. (There are a few example Tycho projects, that are pointed to for everything ... but, don't contain examples of everything).
- Often had to look at Tycho's Java code to understand an XML POM element or its attributes: what they mean, what permissible values were available, even then hard to know (usually) what much of the terminology means (again, for us newbies).
- No documentation from Tycho on what is handled transparently from PDE build (which is a cool feature when it does handle transparently), versus what is "partially handled", versus what is not handled at all from PDE build properties and requires Maven knowledge on "how to duplicate build results using Maven".
- In several places we had "customcallbacks" and in some cases tried to convert to maven instead of simply calling with antrun-plugin. In most cases "converting to maven" turned out to be a big chore (or, didn't work), while "calling ant" was much easier to adjust to, even though "calling ant" does add extra overhead of "starting ant" in another process. That is, overall advise to others doing conversion to stick with "antrun-plugin" as initial step ... only worry about "mavenizing" when build is correct, and looking to make performance improvements.
- Does not allow "parallel compilation" ... which I mention as mostly a missing "real life test-case" for JDT. (practically speaking, would not save that much time relative to the things that are known to take a long time.)
- XML format is very verbose. (instead of a few lines in a .properties file, sometimes takes a dozen lines in a pom file). Sometimes seems very repetitive (lots of "copying and pasting"). In some ways, a small thing ... but bothers some people that are not used to it.
- One consequence of "changing builds" is much of our Platform's "releng wiki" is obsolete or needs major updates. (good and bad ... some of it was partially obsolete anyway!).
- Seems there were a few cases of committers "breaking production builds" because they changed one local bundle, got it to build locally, but then forgot there were implications for test bundles, etc. or perhaps did "the easy" local build, without using bree-libs profile and have toolchains.xml defined ... so then broke in production build even though built locally ok. (Presumably merely a matter of eduction? Or experience? ... but hard to know why incremental builds in (PDE) workbench, with proper Execution Environments do not suffice ... perhaps many of us do not have developer grade hardware :)
- Still work to do to make truly "reproducible" (using PDE .target files, ironically).
- RCP SDK is not equivalent to what we had ... not sure why yet.
- Delta Pack contents are almost the same as one from PDE builds, but its metadata is a LOT different. Is it sufficient?
- Inner jars are not signed (in Kepler SR0 ... CBI fix on the way).
Good things about Maven/Tycho builds.
- Short feature suffixes.
- Though does require more "manual touches" than before, such as when third party bundles change.
- Rules of what code changes effect qualifiers are not well documented -- just in bugs and mailing lists.
- Does not pack200 non-Java bundles (easily and automatically -- PDE would required "manual" marking or custom ant task).
- Once the numerous bugs/enhancements were fixed, creating source features and source bundles may turn out to be easier than in in PDE builds (PDE builds required a fair amount of "coordination" between different parts of the build, which was error prone.) [Thanks to those such as Jan Sievers, who fixed the bugs and added the enhancements to Tycho!]
- Some committers have found it useful/easier to do their own local builds, when making large changes ... especially (seems to me) changes hard to do in a workbench build, such as low level SWT or Equinox changes.
- Has (reportedly) allowed compiling under early access versions of Java 8, where as PDE builds would have to wait for Java 8 support in to be in Eclipse.
- Some excitement in community about ability to "build themselves".
- Seems to be waning a little already? Hard to tell why, but perhaps because we do not go as far as "the community" would like us to? No tests? Production builds not on Hudson? But our focus has been on enabling LTS builds, and think that's been successful (though, we still do not know how to do a "feature patch build" ... though we are told its possible) and we still don't "run integration tests" in LTS (as far as I know).
Meta comments about the process
- In hindsight, was probably not a great time for us to "make the switch". Others (from Eclipse Foundation, and Sonatype) made the "intial prototype" build of the platform, and then past middle point of a development year we got ownership to turn into a production build "on our own" (after all, we were the ones with the intimate knowledge of our bundles and our needs). My point is, the prototype was developed while we (core committers) were still very busy trying to maintain two builds (3.8 and 4.2) and finish the "move" to "build.eclipse.org" ... so little time for "joint work" (except thanks goes to Paul Webster for finding some time). In hindsight, would have been easier or better to do more of a "joint effort" starting at beginning of a cycle, say with Luna M1, than trying to do in the final milestones ... even though the prototype was a great start and much appreciated ... there were aspects of it in error where more interaction with "core" committers was needed to "make it right". More importantly, many of us "missed out" on an educational opportunity about why things were done a certain way. As another example, early in the prototype phase, it was suggested we "move some things around" so the repository structure made more sense for Maven/Tycho builds. At that time, everyone was too busy to even consider it. In hindsight, it would have been better to spend at least a little time to at least think about it ... perhaps where could have been some "small changes" that would have made big improvements?
- An unexpected benefit, that we are still in the process of, of us "redoing the build" has made us re-examine the whole build ... things had been "added to" the PDE build over many years, and nothing ever removed, so there was a lot to "junk" in there; either literally dead code everyone was afraid to touch, or we "continued to build things" even if there was no longer a reason to. We still need to continue that "self examination".
- In general, I think Maven's "claim to fame" is that is is designed for the "whole development process ... build, tests, deployment". Since we use it "only for builds" its not surprising some of its advantages are of no benefit to us. Harder to offset the costs when only partially using its strong points.
- I want to emphasize we would not have been able to release Kepler, without the expertise of many other people outside the core Platform committers (such as from SAP, Redhat, Sonatype, the Eclipse Foundation itself, and others). Their help is and continues to be much appreciated. However, we have sort of ended up now in a funny situation ... I've heard it said "our PDE build was too complicated and only a few people in the world understood it". Now our current Maven/Tycho build is also pretty complicated, and, honestly, I believe there are only a few people in the world that really understand it. :) I could be wrong, but I am just emphasizing that our build still needs to be simplified and our own education needs to continue. But it has been great having the interest and help from so many others in the community. We've learned a lot from them. Very much appreciated.
- In general, Maven/Tycho is very simple and great for small projects (such as for one Git repository with a few features), that use it from the beginning and that are "setup" from the beginning with Maven/Tycho in mind ... but, hard and expensive to adopt large existing projects (without major re-org of repositories and/or refactoring of code). While it is likely quite suitable for "enterprise use" ... it requires a lot of work to "get right".
- Still lots to do! ... Both fixes and "finishing" some things for Kepler SR1 and improvements for Luna.
- At the moment, "releng planning" is done simply by marking "planned items" in bugzilla with "p2". See this query for current "p2" priorities, and if anything is missing, feel free to comment in a "p3" bug that you think it should be "p2".