Archimedes' Lever: version control

Showing posts with label version control. Show all posts

Monday, 24 February 2014

Let's Stop Pretending that Source Control is "Optional but Recommended"

Essentially all of us who make our living from the craft of software development, early in our careers, have had the experience of losing source code that we couldn't get back. Oh, we could type what we remembered into a file of the same name, but it has important differences that we will discover as we work with it. (There is no such thing as eidetic, or photographic, memory).

This has been recognised as a sufficiently universal phenomenon that most developers and managers I have known in the lsat 10-15 years (at least) use use of a versioning system as a filter against dangerous dilettantes; if someone claims to have developed "the next VisiCalc for Linux" without using one, the file containing that person's "CV" may be safely deleted (and do remember to Empty the Trash).

So why on $DEITY's green Earth do we still see well-meaning tutorials on blogs the Internet over which include a step titled "Setup source control (Optional but recommended)"? The particular tutorial that drew my ire this morning even went out of its way to note that you will need to use a "text editor of your choice" with the author noting that "I use Emacs". (Subject and direct object inverted in the original sentence, but that's another rant.)

Don't

that!!

The person reading your tutorial is probably going to be someone with strong continuing economic motivation to improve his software skills; he's reading your tutorial on the off chance that you'll actually teach him something useful. Seeing those three words ("optional but recommended") tells him, almost literally, that you don't really take him seriously. Your half-dozen other readers are going to be (would-be) apprentices in the craft, just learning their way around; they can and, based on experience, will take those three words as permission to just blaze on ahead and, when something goes pear-shaped, to start all over. Maybe you think "nobody I know would take it that way", or even "how could anybody who can read this take it that way", but The Voice of Experience™ is here to tell you that this Internet thingie is a global phenomenon. People from all walks of life, with every language, technical and educational level variation imaginable, can wake up one fine morning and say "I'm going to learn something new, using the Internet", and set out to do just that. That's a feature, not a bug.

I can hear you sputtering "but I didn't want to have to cover how to set up a VCS or use it in the tutorial's workflow; I just wanted to teach people how to use batman.js". Fair enough, and a noble goal; I don't mean to dissuade you (it's actually a pretty good tutorial, by the way). Might I suggest adding to the bullet list under "Guide Assumptions" a new bullet with text approximating

A version control system of your choice (I use Foo)

with an appropriate value of Foo. We don't care whether the Genteel Reader uses git, Bazaar, sccs, or Burt Wonderstone's Incredible Magical VCS; we can merely assume that if he and we are worth each other's time, he's using something; we only need that little extra bit of reinforcement. (Though some choices on that list might be cause for grave concern.)

One of the ways in which the craft of software development needs to change if it's ever going to become a professional engineering discipline is that we're going to have to hammer out and apply a shared set of professional ethics; the closest we've got now is various grab-bag lists of "best practices". I'd argue that "optional but recommended" source control is on the wrong side of one of the lines we need to draw.

Sunday, 31 March 2013

Playing Mind Games With git, And Seemingly Winning

I had some bad experiences with git a few years back, and until a year or so ago, you couldn't get me to touch it with a ten-parsec pole. Let's just say that things got lost… changes, files, jobs, companies, little things like that. So, building out a Rails shop over the last year, with git being The Standard™(thank you, Github), you'd think I'd be super-extra-hyper-careful about it, right?

Well, the training wheels have to come off sometime, and if they take rather important bits of your skull along for the ride, consider it a Learning Experience. You were warned, weren't you? You did read that .2-point type on the side of the tin with the big label LIFE! that said "Death will be an inevitable finale to Life. Don't bother the lawyers", didn't you? Oh, well. Details.

Rule #16,472 in Dickey's Secrets of Software Survival states: "Never write important code while seriously ill with flu or whatnot. The results would be worse than if you had written the code while drunk. In fact, you may want to give serious consideration to getting drunk before attempting to code while sick." Because what violating 16,472 can get you is shenanigans like committing great laundry lists of files, several times, only to find yourself staring at a red bar and wondering how you got there. (Hot the Red Bar in Cable Street in London, but the red bar that is your test tool saying "your code and/or your specs/tests are busted, bub". That red bar.) This has actually happened to me not once, but twice in the last couple of months.)

At which point, you mutter dark threats under your breath at the imbecile who put you in this position (yourself) and start retracing your steps. Whereupon you find that your last genuine, not-because-things-were-wildly-cached proven-working version was five commits earlier. Five commits that have been pushed to the Master Repo on Github earlier. You then note that each of these touched a dozen files or so (because you don't like seeing "19,281 commits" in the project window when you know you're just getting started), and a couple hours of spelunking (as opposed to caving) leaves you none the wiser. What to do?

The first thing to remember is that there is nothing structurally or necessarily semantically significant about the master branch in git. As far as I can tell, it's merely the name assigned to the first branch created in a new repo, which is traditionally used as the gold-standard branch for the repo. (Create new branch, do your thing, merge back into master. Lather; rinse; repeat. The usual workflow.) On most levels, there's nothing that makes deleting the master branch any different than any other.

Of course, the devil is in the details. You've got to have at least one other surviving branch in the repo, obviously, or everything would go poof! when you killed master. And remote repos, like Github, have the additional detail that specifies HEAD as the default branch on that remote. (Thanks to Jefromi for this explanation on StackOverflow, to another guy's question.) There always has to be a default branch; it's set to master when the repo is created and very rarely touched. But it's nice to know that you can touch, or maul, it when the need arises.

Here's what I did:

I went back to that last proven good commit, five commits back, and created a new repair branch from that commit;
I scavenged the more mundane parts of what had been done in the commits on master after the branch point, and made (bloody well) sure specs fully covered the code and passed;
I methodically added in the code pieces that weren't so mundane, found the bug that had led me astray earlier, and fixed it. This left me with a repair branch that was everything master would have been had it been working.

Now for the seatbelts-and-helmets-decidedly-off part.

I verified that I was on the repair branch locally;
I deleted the master branch in my local repo
I ran git branch master to create a new master branch. (Note that we haven't touched the remote repo yet);
I checked out the (new) master branch (which at this point is an exact duplicate of repair, remember);
I (temporarily) pushed the repair branch to origin;
I used the command git remote set-head origin repair to remove the only link that really mattered to the existing master branch;
I deleted the master branch on the remote ("origin") repo as I would any other remote branch;
I force-pushed the new master to the remote, using the command git push --force origin master. I needed the --force option to override git's complaint that the remote branch was ahead of the one I was pushing;
I ran git remote set-head origin master to restore git's default remote branch to what it had recently been; and
I deleted the repair branch from origin

Transmogrification complete, and apparently successful. Even my GUI git client, SourceTree, only paused a second or so before displaying the newly-revised order of things in the repo. It's useful to know that you can trade the safety scissors for a machete when you really do feel the need.

However… so can anyone else with access to the repo. And I can easily see scenarios in a corporate setting where that might be a Very Bad Thing indeed. In a repo used by a sizeable team, with thousands of commits, it wouldn't be all that difficult for a disgruntled or knowingly-soon-to-be-disgruntled team member to write a program that would take the repo back far enough into the mish-mashed past to deter detection, create a clandestine patch to a feature branch that (in the original history) was merged into master some time later, and then walk the commits back onto the repo, including branches, before pushing the (tainted) repo back up to origin. I'd think that your auditing and SCM controls would have to be pretty tight to catch something like that. I'd also think that other DVCSes such as Mercurial or Bazaar would have a much harder time successfully doing what I did, and therefore, would be less likely to such exploitation. That this hasn't been done on a scale wide enough to be well-publicised, I think, speaks quite loudly about the ethics, job satisfaction, and/or laziness of most development-team members.

What do you think?

Tuesday, 18 October 2005

About me and my work at Cilix

I'm working on a lot of things for my work at Cilix, an engineering firm in Kuala Lumpur, Malaysia. First off, let me be clear on one thing: this blog is not officially sanctioned in any way by Cilix; this is 'just me'.

We call ourselves "A Knowledge Company". What that means, at least in my understanding, is that we apply professional knowledge and experience, augmented heavily by technology, to solve customers' knowledge-management and IT challenges. As such, we do a lot of writing — documents, Web pages, software, ad (nearly) infinitum.

We're a small shop as these things go, and our competition comes from much larger organisations with instant multinational name-brand recognition. Like any small firm, we have to win our first projects with a given client by promising — and delivering — a better value proposition than our competition. Where we get repeat business — again, like any similar firm — is by being agile, efficient, and above all, competent to the point of being unquestionably the least risky vendor for a particular solution.

Those attributes, in turn, lead us to consider issues like process, quality, and superlative knowledge of everything we are about. These issues, and how we as an organisation work through them, were what originally attracted me to the Company when I was approached and offered a position here. These issues are also the foci of what I expect to accomplish with this blog and the related collaboration tools (such as the Wiki).

I am also trying to evangelise and lead the implementation of open documentation and data-format standards at Cilix. This involves, among other things, migrating away from proprietary, binary formats like Microsoft Office documents to open, preferably text-based formats. As it happens, many of these open, text-based formats are based on XML vocabularies, such as Docbook and SVG.

Wby are text-based formats preferable? Lots of reasons:

They are usually much more compact (and compressible) than comparable binary formats. Converting mostly-text Microsoft Word documents to Docbook equivalents often yields size reductions of 80% or more (think how much more convenient email attachments would be);
They are usable with a wider variety of tools. I can throw a text file on my PalmPilot and fiddle with it far easier than a Microsoft Word document, for instance;
The are more amenable to most version-control systems, particularly cvs and subversion. Instead of making copies of each version of a binary file, all that is required is to take the difference between two different versions of a text file — a much easier and more reliable operation. I have seen version control systems of all flavours — SourceSafe, cvs, Atria/Rational ClearCase — irretrievably corrupt binary files when insufficient care was taken by the configuration manager in dealing with binary files;
They are more amenable to being stored in databases. Many databases (such as MySQL can return result sets packaged as XML fragments; this, combined with an XSLT parser and stylesheet, opens the door to some truly compelling presentation capabilities.

By taking advantage of these capabilities, we should be able to create better products with more predictable (and shorter) schedules without either greatly expanding the development team or pushing the present staff to the point of burnout. There is a saying in Silicon Valley in California, only partly tongue-in-cheek:

It isn't a startup until somebody dies

Here's hoping that's one "tradition" that's not exported anywhere outside the Valley.

The Vision, Forward through the Rear-View Mirror

The vision I'm trying to promote here, which has been used successfully many times before, is that of a very flexible, highly iterative, highly automated development process, where a small team (like ours) can produce high-quality code rapidly and reliably, without burning anybody out in the process. (Think Agile, as a pervasive, commoditized process.) Having just returned (17 October) from being in hospital due to a series of small strokes, I'm rather highly motivated to do this personally. It's also the best way I can see for our small team to honour our commitments.

To do this, we need to be able to:

have development artifacts (code/Web pages) integrated into their own documentation, using something like Javadoc;
have automated build tools like Ant regularly pull all development artifacts from our software configuration management tool (which for now is subversion)
run an automated testing system (like TestNG) on the newly built artifacts;
add issue reports documenting any failed tests to our issue tracking system, which then crunches various reports and
automatically emails relevant reports to the various stakeholders.

The whole point of this is to have everybody be able to come into work in the morning and/or back from lunch in the afternoon and know exactly what the status of development is, and to be able to track that over time. This can dramatically reduce delays and bottlenecks from traditional flailing about in a more ad hoc development style.

Obviously, one interest is test-driven development, where, as in most so-called Extreme Programming methods, all development artifacts (such as code) are fully tested at least as often as the state of the system changes. What this means in practice is that a developer would include code for testing each artifact integrated with that artifact. Then, an automated test tool would run those tests and report to the Quality Engineering team any results. This would not eliminate the need for a QE team; it would make the team more effective by helping to separate the things which need further exploration from the things that are, provably, working properly.

Why does this matter? For example, there was an article on test-driven development in the September 2005 issue of IEEE Computer (reported on here) that showed one of three development groups reducing defect density by 50% after adopting TDD, and another similar group enjoying a 40% improvement.

All this becomes interesting to us at Cilix when we start looking at tools like:

Cobertura for evaluating test coverage (the percentage of code accessed by tests)
TestNG, one successor to the venerable JUnit automated Java testing framework. TestNG is an improvement for a whole variety of reasons, including being less intrusive in the code under test and having multiple ways to group tests in a way that makes it much harder for you to forget to test something;
Ant, the Apache-developed, Java-based build tool. This is, being Java-based and not interactive per se, easy to automate;
and so on, as mentioned earlier.

Archimedes' Lever