Difference Reports Are Good
At the SCM Vocab meeting today (June 7 2006) there was some discussion as to whether reports regarding the differences between two code sets are "core" to useful ALF style use cases or whether they exist only at the borders. A key concern was that the results of these difference reports could become too large to be neatly managed within ALF events, particularly if detailed line by line change reports in the style of the result of the unix diff command is included.
If the detailed diff is the major hang up, I'm happy to get rid of the that line by line information if that's what is needed to provide higher level information. I think a more prudent approach, however, would be to make the detailed information very optional and if it's used either make it's use the problem of the end user or work around message size limits by storing the resulting information in a known location of the file system.
So, what the heck do I mean when I'm talking about a difference report? The kind of information (easily?) available from various SCM systems differ. So the vocabulary used in difference reports should have lots of optional fields. This strikes me as classic ALF. There is information we want from type of tool, we'll standardize the request, standardize the format of the response, and what we get back will differ depending on tool capability.
I'll start by laying out some of the data that I would expect to get back from this request before listing a bunch of fun things we can do with this data. But at the end of the day, the basic theme is that we want to know what changes are happening to a code base, who is changing them, and ideally why. Many ALF tools are going to be able to take advantage of that. So what is in a Difference Report? A set of one or more RevisionGroups that each contain one or more Revisions.
Breaking down the data
What is a Revision?
A revision is information about an atomic change to Versionable Object creating a new Version or moving that version into a different SCM Configuration (via branches or streams). Most often a Revision is information about a check in.
Great. So what data do we have in Revisions?
- The name of the Versionable Object which has changed.
- The SCM user name of the developer who changed it.
- The Date of the change (ideally in GMT, but that's often wishful thinking)
Everything else is optional:
- A check-in message specific to this file.
- A flag indicating the Versionable Object was deleted
- A flag indicating the Versionable Object was added
- The number of lines added
- The number of lines deleted
- The number of lines changed
- If requested: A line by line diff report. Ideally this is in a standard format, but a tool specific one would be fine. The line by line report should be one the SCM tool could use to perform a patch operation if one is supported.
So that's a Revision. Most, and particularly most modern, SCM systems tend to group revisions in one way or another. I'll call these RevisionGroups. Be it just for atomic check in, or for large tasks. Subversion does this primarily for atomic check-in. ClearCase does the same, but also has notion of Activities which is another useful grouping. Perforce does both. CVS has no grouping. All of its RevisionGroups would have exactly one Revision unless someone is clever.
What data does a RevisionGroup have?
- A collection of Revisions
- The date of the (last) commit of this RevisionGroup
Almost required, but it wouldn't make sense for SCM systems like CVS:
- An identifier (usually a number)
- The SCM user name(s) responsible for the ChangeGroup. (not sure if this can really be plural)
Some summary information that would duplicate information carried in the Revisions, but is only really known to the SCM at the higher level. I believe this is an issue for StarTeam and some others, but would need to further research it to be sure.
- The number of versionable objects added
- The number of versionable objects deleted
- The number of versionable objects modified.
Presumably, we could also have a command that only got us RevisionGroup information instead of the details provided by the Revisions themselves. In that case, knowing the list of files touched would be useful.
And why are we doing this?
That's great Eric, so what are we going to do with this data?
Glad you asked. Here are two use cases built in to our build server today. Both are done by more than one of our competitors as well, so there's probably some user demand. In my life as a developer, I appreciate both.
- Create a human readable report with this information.
- Notify only the people who checked in code of the result of a CI build through email, IM, etc.
- Determine if a build is required (no changes = no build).
- Implement a quiet period so that builds do not happen if a check in has happened in the last X minutes.
We currently expose this data fairly poorly through our API. Despite that obstacle, our customers have extended the build server to:
- Intercept build requests, interrogate the revisions, and cancel the build if only uninteresting things have changed (the contents of the docs directory typically).
- Cascade revision information to dependent projects so they know what has changed in dependencies.
- Version development builds using the ChangeGroup identifier.
Frankly, if we didn't spend so much time writing commands to fetch this information and parse the returned reports, we would be doing more fun things with Revisions. That's part of the promise of ALF to us.
Now you're saying, "That's very interesting Eric but it's just Build, there's more to ALF." Another good point. Here is how I would be using this information in Project Management:
- We have a resource named "Amy" who is assigned to Project A. Yet when we do a contribution history for Project A, "Amy" hasn't committed any changes in 3 days. There could be a good reason for that, but it's interesting information.
- When I correlate test and build failures to the Revision history of builds, I see that three of my developers contribute to failing builds much more than others. Are they working on riskier code? Is there a process breakdown? Are they bad developers? The failure reports don't answer those questions, but they help us ask them.
I've gone on way too long. In my view though, revisions are absolutely central to doing anything interesting in ALF. The first thing we need from SCM is for it to do the grunt work of getting code where it other tools can operate on it. After that, our main need is for SCM to provide those tools, and the users, with an explanation of what is going on in the project. What code is changing, by who, when and how significantly. That information is represented in other tools, but the only true representation is going to be changes to source. That's the kind of information tools, service flows, and users can act on. So is it core? Absolutely. Without it, tools like ours will still be doing point to point integration with SCM.
My data models and examples are largely selfish and based on what I work with day to day. I'm sure additional information will be useful to others. I think that's a start for difference reports and the object model works fine (we've implemented in three times now). For an example of those objects in java, grab our old open source project from CVS and see the Revision and ChangesetRevision classes. http://www.urbancode.com/projects/anthill/developer.jsp