Sunday 31 March 2013

Yes, It's Nonsense.

No doubt profitable nonsense. What follows is a reply to a comment by one Andrew Webb on an InfoQ.com sales-pitch-as-technical-"journalism" puff piece. The piece was "reporting" on a study by Dr Donnie Berkholz of InfoQ.com. This kind of hucksterism is one of the most formidable barriers blocking our craft of software development from ever becoming a professional engineering discipline but, well, read on if you will.

Why here rather than on InfoQ? Simple: even though their reply-entry form states that <a> is an accepted HTML element in comments, it refused to accept any of the links I had in this post. Shoddy "journalism", neet shoddy Web development.


Worse than nonsense; this was written for PHBs, likely as a tool to sell consulting hours training teams in the "more expressive" languages. Anybody who remembers Java's transition from a language and VM into a marketing platform, circa 1997-1998, has seen this done before and better.

Note that I am not talking about Dr Berkholz' original study, the link to which was buried in this puff piece. But there appears to be a real problem with the data that the study was based on from Ohloh. From what I've been able to see, the data covers a period of at least 15 years (late 1990s-present). Subversion(!) was used for 53 percent of the projects in the data set; when you add in now-Palaeolithic CVS, the share rises to nearly 2/3 of the projects covered by the data set.

Let me beat that timeframe to death one more time: In the last 20 years, we've gone through at least two major revolutions in the techniques mainstream developers use. Twenty years ago, if you mentioned "OOP" in half the corporate development centres in North America or Asia, the response would have been "what's wrong?" We as a craft, we were just beginning to get our minds wrapped around object-oriented software development, moving it out of its metaphorical Bronze Age, when the larger revolution which that enabled, BDD/TDD/whatever your flavour of Agile, hit like a silver tsunami. Twenty years ago, I was writing "C/C++", Fortran and Ada code, using various odd bits of version control (anybody else remember how good SourceSafe was before Microsoft bought it?), and checking in massive commits, because centralised SCM systems like SourceSafe, CVS and RCS are a pain; seen less as a design/team-support tool than an insurance policy against Raj's hard drive being wiped by some virus or other. Network connectivity was not as omnipresent, reliable or fast as it is today, by several orders of magnitude.

Nowadays, regardless of what language you're in, you're (hopefully) using some form of agile process. You're strongly encouraged to tunnel in, specifying the smallest change that could possibly fail, and then implement just enough code to make that spec pass. You're not going to check in thousands (or even hundreds) of lines of code at one go; you're going to build that feature you're working on over several commits, on a branch of your SCM tree, and then merge it back into the mainline when it's done. The last ten PHP projects I worked on, over a six-year period, averaged less than 50 lines per commit – in one of the most overly verbose languages since ALGOL 68 and COBOL.

Changes in development practices, and tools, will hopelessly skew that data set unless additional controls are applied, and I could find no description of any. That, to me, says that this is, at best, an enjoyable-in-the-moment bit of data-mining with all the relevance to day-to-day developers' lives of the Magic 8-Ball.

This article was even worse; no examination of assumptions, no discussion of changing trends in the craft and industry, just a vacuous puff piece. An insult to the intelligence of the readers and, despite the possible flaws in the original study, to Dr Berkholz' work as well. If my mind could go take a hot shower to wash the oily residue off, it would. I used to think ZDNet was the bottom of the barrel for this sort of thing. I was wrong.


NOTE: Earlier, I'd previously written "Twenty years ago, I was writing "C/C++", Fortran and Ada code, using Subversion", which is an anachronism, of course. Subversion came out in 2000, and I was not in a time warp in 1994. It just feels like I'd been using (used by?) it for that long.

Playing Mind Games With git, And Seemingly Winning

I had some bad experiences with git a few years back, and until a year or so ago, you couldn't get me to touch it with a ten-parsec pole. Let's just say that things got lost… changes, files, jobs, companies, little things like that. So, building out a Rails shop over the last year, with git being The Standard™(thank you, Github), you'd think I'd be super-extra-hyper-careful about it, right?

Well, the training wheels have to come off sometime, and if they take rather important bits of your skull along for the ride, consider it a Learning Experience. You were warned, weren't you? You did read that .2-point type on the side of the tin with the big label LIFE! that said "Death will be an inevitable finale to Life. Don't bother the lawyers", didn't you? Oh, well. Details.

Rule #16,472 in Dickey's Secrets of Software Survival states: "Never write important code while seriously ill with flu or whatnot. The results would be worse than if you had written the code while drunk. In fact, you may want to give serious consideration to getting drunk before attempting to code while sick." Because what violating 16,472 can get you is shenanigans like committing great laundry lists of files, several times, only to find yourself staring at a red bar and wondering how you got there. (Hot the Red Bar in Cable Street in London, but the red bar that is your test tool saying "your code and/or your specs/tests are busted, bub". That red bar.) This has actually happened to me not once, but twice in the last couple of months.)

At which point, you mutter dark threats under your breath at the imbecile who put you in this position (yourself) and start retracing your steps. Whereupon you find that your last genuine, not-because-things-were-wildly-cached proven-working version was five commits earlier. Five commits that have been pushed to the Master Repo on Github earlier. You then note that each of these touched a dozen files or so (because you don't like seeing "19,281 commits" in the project window when you know you're just getting started), and a couple hours of spelunking (as opposed to caving) leaves you none the wiser. What to do?

The first thing to remember is that there is nothing structurally or necessarily semantically significant about the master branch in git. As far as I can tell, it's merely the name assigned to the first branch created in a new repo, which is traditionally used as the gold-standard branch for the repo. (Create new branch, do your thing, merge back into master. Lather; rinse; repeat. The usual workflow.) On most levels, there's nothing that makes deleting the master branch any different than any other.

Of course, the devil is in the details. You've got to have at least one other surviving branch in the repo, obviously, or everything would go poof! when you killed master. And remote repos, like Github, have the additional detail that specifies HEAD as the default branch on that remote. (Thanks to Jefromi for this explanation on StackOverflow, to another guy's question.) There always has to be a default branch; it's set to master when the repo is created and very rarely touched. But it's nice to know that you can touch, or maul, it when the need arises.

Here's what I did:

  1. I went back to that last proven good commit, five commits back, and created a new repair branch from that commit;

  2. I scavenged the more mundane parts of what had been done in the commits on master after the branch point, and made (bloody well) sure specs fully covered the code and passed;

  3. I methodically added in the code pieces that weren't so mundane, found the bug that had led me astray earlier, and fixed it. This left me with a repair branch that was everything master would have been had it been working.

Now for the seatbelts-and-helmets-decidedly-off part.

  1. I verified that I was on the repair branch locally;

  2. I deleted the master branch in my local repo

  3. I ran git branch master to create a new master branch. (Note that we haven't touched the remote repo yet);

  4. I checked out the (new) master branch (which at this point is an exact duplicate of repair, remember);

  5. I (temporarily) pushed the repair branch to origin;

  6. I used the command git remote set-head origin repair to remove the only link that really mattered to the existing master branch;

  7. I deleted the master branch on the remote ("origin") repo as I would any other remote branch;

  8. I force-pushed the new master to the remote, using the command git push --force origin master. I needed the --force option to override git's complaint that the remote branch was ahead of the one I was pushing;

  9. I ran git remote set-head origin master to restore git's default remote branch to what it had recently been; and

  10. I deleted the repair branch from origin

Transmogrification complete, and apparently successful. Even my GUI git client, SourceTree, only paused a second or so before displaying the newly-revised order of things in the repo. It's useful to know that you can trade the safety scissors for a machete when you really do feel the need.

However… so can anyone else with access to the repo. And I can easily see scenarios in a corporate setting where that might be a Very Bad Thing indeed. In a repo used by a sizeable team, with thousands of commits, it wouldn't be all that difficult for a disgruntled or knowingly-soon-to-be-disgruntled team member to write a program that would take the repo back far enough into the mish-mashed past to deter detection, create a clandestine patch to a feature branch that (in the original history) was merged into master some time later, and then walk the commits back onto the repo, including branches, before pushing the (tainted) repo back up to origin. I'd think that your auditing and SCM controls would have to be pretty tight to catch something like that. I'd also think that other DVCSes such as Mercurial or Bazaar would have a much harder time successfully doing what I did, and therefore, would be less likely to such exploitation. That this hasn't been done on a scale wide enough to be well-publicised, I think, speaks quite loudly about the ethics, job satisfaction, and/or laziness of most development-team members.

What do you think?