Tuesday 4 January 2011

Windows? Check. Skype? Check. Quality? Oops…

Several cultures share variations of a proverb which reminds us that you cannot build a house on sand and have it endure the test of time — or even the next tide.

Skype learned that the hard way a couple of weeks ago. Like all peer-to-peer, or P2P, networks, Skype relies on (a varying subset of) its users' systems to handle routine network-housekeeping chores, including routing. Fine; that's the way many applications have done things for more than a decade. Skype have been living unusually close to the edge in two respects: they have an unusually large, widely-distributed number of simultaneous users relative to other P2P networks, and it implicitly trusts the reliability of its Windows-based application and the underlying Windows operating system when performing at any needed scale. This is what sent the waves crashing in upon the sand-borne house.

Microsoft Windows is, inarguably, a wildly successful software system; along with Microsoft's Office set of applications, it is the basis of Microsoft's billions of dollars a year in revenue and profit. But just because something is successful or because it's made by a major corporation doesn't necessarily mean it's well-designed or -built. (See, for example, the Ford Pinto.)

By relying on the (always-questionable) stability and reliability of each of the Windows PCs acting as a "supernode" in its network architecture, Skype was setting itself on a path to eventual inevitable, foreseeable large-scale failure.

Many of the well-known problems with keeping Windows stable and running properly have to do with the ridiculous ease of developing, distributing and using malware; in some countries (such as Singapore), estimated infection rates of Internet-connected Windows PCs routinely hit or exceed 90%. Besides the obvious threat to any "secure" information on those systems (credit card info, login IDs, and so on), those infected Windows systems have a relatively large share of their processing power diverted for the benefit of someone other than the legitimate user or of applications other than those that s/he has authorised.

Another cause of instability and other problems, as with the Skype outage, is that it is very difficult to develop, deploy and maintain even moderately complex applications under Windows with any real means of asserting that the software is reliable and robust. The only way to gain any such real assurance is by testing as many hardware and software combinations with your software as you can, and hoping that you've found all the "most likely" problems. (This is why Microsoft and others employ thousands of software testers, in addition to a massive automated-testing infrastructure.) And, though Microsoft have made some relatively amazing improvements in this area (compare, say, Windows 98 to Windows 7), even they can't catch everything — and their development process and system architecture essentially require them to. Every Windows usee in the last 25 years has seen the "Blue Screen of Death;" that is merely an acknowledgement by Windows that it's completely lost its mind, usually after innumerable random acts of lesser senility.

Skype were bitten by both ends of this issue: a new release of their client application for Windows had defects that were not caught during development or internal testing, and by deploying that defective application on an operating system that provides little-to-no protection against maliciously or inadvertently defective applications, a large "time bomb" started ticking away, unnoticed by all until its spectacular detonation on or around 22 December 2010.

For a Skype client to be a "supernode," it cannot be behind a "firewall" that uses network address translation to isolate its clients from the big, bad Net. (There are ways around that, too, of course; but they're generally very black-hat techniques that don't scale well on a commercial basis.) A client does not have to be running on Windows to be a supernode, but given the demographics of Skype's user base, and the larger PC user base in general, the majority are.

And so, when Skype's defective Windows client ran into mortal difficulties on Windows' defective platform, the ordinary user PCs that were acting as supernodes started dropping like flies. Other Skype clients, as with any proper P2P network, simply rerouted to other supernodes. However, since many (if not most) of the Windows clients were behind NAT firewalls and there were (and are) a relatively small percentage of non-Windows clients available to serve as supernodes, the number of "leaf node" clients chasing the decreasing number of supernodes put the remaining systems under an unsustainable load, and so those Skype clients (with their supernode capabilities) dropped off the network.

Had there been a more heterogenous mix of Skype clients so that those available to perform supernode duties were not almost exclusively Windows PCs, the failure might well have been mitigated if not eliminated altogether. Or, were it practical to develop a (reasonably) guaranteed-stable application on Windows, the defect that initiated the failure cascade might well never have occurred. But by "putting all their eggs in one basket" which was known to have large, gaping holes in it, Skype were guaranteeing that their metaphorical sidewalk would eventually be covered hip-deep in broken eggs.

Monocultures in technology, as in farming, are generally Bad Things; a threat which can attack any member of a monoculture can attack all such members. (Think of the Windows malware-infection rates as an example of this.) Single points of commonality (or of control) are always eventual single points of failure.

The Skype failure could well have been a reminder of these long-known basic truths. The commercial interests involved, however, ensure that such a lesson will go unheeded. It will therefore be repeated; very possibly not with Skype, but with some other aspect of our grossly defective information-technology ecosystem. Think of it as "Unsafe at Any Speed" meeting Groundhog Day until something is done to prevent yet another China Syndrome.