Tuesday, 8 May 2007

XHTML Is (Nearly) Useless

EDITED 2010/11/02: If you saw this post as a single massive block "paragraph," my apologies. The definition of "what is a paragraph?" changed after this was originally posted, and it had scrolled far enough back that I didn't notice the carnage. Let me know if there are still any problems. Thanks.


If you've written any Web pages in the last five years (at least), you've at some point bumped into the difference (schism?) between "original" HTML and "new, improved - now based on XML!" XHTML. If you don't write Web "content" (thanks for reading my blog, but why are you here?), or deal professionally with those who do, you may not know the difference, or care that there is a difference. There is, and people should care about it if they care about the Web.

(Briefly, for those who care but don't know; the rest of you can skip this and the next paragraph.) HTML is often known to developers as "tag soup", because very, very many sites don't follow the strict interpretation of the standard, and are "broken" in all sorts of ways. This was initially justified as working around the myriad bugs in grossly defective browsers such as Microsoft Internet Explorer. XHTML was different and better because it was HTML reformulated as XML, which could then be "validated" (checked) by any validating XML parser. HTML-as-XML also (should have) driven the development and use of all sorts of nifty techniques and tools that are only practical when assumptions can safely be made about the structure and format of the document - which would be true in XML/XHTML but not necessarily in "classic" HTML.

The problem, of course, is Microsoft's Internet Explorer browser, affectionately known to Web professionals as "Internet Exploder". Among the many "quirks" (defects) that has unknowingly afflicted usees of that browser, all versions up to and including the current Version 7 fail to understand XHTML as XHTML. The "conversation" that takes place between a browser and a server when the browser requests a Web page is defined by the open standard known as HyperText Transfer Protocol, or HTTP. Part of that conversation involves the server informing the browser what type of data it will be sending. This is done using what is called a "content type" "header".

All together now? Good. When a server wants to send a browser a page of "tag soup" HTML, the correct content type is "text/html". A properly-formatted and -served XHTML page will instead use "application/xhtml+xml". This will inform the browser that, in fact, the page being transferred is a proper XHTML page (per the open standard defining it), so the browser will kick in the assumptions and processing that works for XHTML but not for "tag soup".

Of course, Internet Explorer is now the only major browser that gets this wrong (as indicated by this vintage-2005 blog entry). As far as I am aware, every other major graphical browser in the world - Firefox, Opera, Konqueror, Galeon and the rest - all support The Right Thing. Unfortunately, IE is still the 300-pound gorilla in the china shop; the majority of Windows usees still browse the Web using IE, and though the trend is improving steadily, that will likely continue to be true for the next couple of years (say, 2009-2010 barring unforeseen circumstances).

What kinds of things would a properly XHTML 1.0-compliant browser let us do with our site? One trivial example: let's say you're writing a political-commentary site that is geared towards an upcoming election, and you want to consistently name your candidate as "The Honorable Senator Francis X. Snort (email senator@senatorsnort.org)". When your guy goes down to defeat (one too many campaign-finance scandals, mayhap) you want to change the blurb to "The Honorable former Senator F. X. Snort (email snort@somefreemail.com)". Trivial to do with whatever CMS or scripting system you're using, right? But by using an XML entity, you can simply say "&snort;" in your document, and an entity declaration in your document's header will tell the parser what you really mean. Change the declaration, and every instance of that entity expands to your new meaning. People who use other XML-based markup systems, such as DocBook, have been using this technique for years. Using XML entities in pages shown in correct (non-IE) browsers will do exactly what you tell it to. In IE, or, to be fair, several text-based browsers, the entity name will be displayed exactly as it is in the document - in our case, as &snort;. This is unlikely to have the desired effects on the folks "back home" for the Senator.

Web developers have, as I mentioned, several well-known workarounds for this type of thing, using their authoring tools rather than the document itself. It is, however, a reasonably easy example for people to understand. Given the increasing popularity of systems such as PHP Smarty that let you use large chunks of "raw" (X)HTML along with the scripting goodies, it would come in handy too.

So how does all of this make XHTML "nearly useless?" Because most developers developing pages for the general public (as opposed to corporate intranets), knowing that Microsoft IE doesn't support the correct content type, will either "not bother" developing "correct" XHTML or at best will serve it to all comers as "tag soup" HTML.

This also has the "benefit" of completely stifling further innovation (as far as the end user is concerned) based on XHTML. All of the comments I've made so far are only germane to the initial version of XHTML, designated 1.0. The newer versions, XHTML 1.1 and XHTML 2.0, provide new features and support new technologies that greatly expand the usefulness of the Web - or would, if Microsoft weren't, as usual, dragging the Web down for competitive lock-in purposes. By doing everything in their considerable power to ensure that IE browsers and sites aren't fully, completely interoperable with other browsers, they discourage Windows usees from using "rival" browsers to browse sites labeled "Best viewed with Microsoft Internet Explorer". There's nothing preventing Web designers from writing standards-compliant sites that also work well with IE; in a well-designed site, it's not particularly onerous to support both standards and Microsoft. If you're using Microsoft tools, of course, it will take quite a bit more work and knowledge to create valid sites. It can be done - several sites and mailing lists describe the techniques and mind-set required - but Microsoft do not go out of their way to make it easy to do so.

Of course, this also applies only to the public Internet. If you're fortunate enough to be developing "real Web apps" for your company's intranet, and your company understands the value of open standards, then you're not going to be subjugating yourself to IE and none of this really applies to you. Go enjoy all the things that new tech lets you implement that can really stomp on your non-standards-using competition!

For the rest of us, until the Web gets out of this proprietary funk it's in now, and IE either falls into a long-deserved oblivion (improving Windows security dramatically, but that's another post) or actually complying with the same standards every other serious browser in the world does, then we're going to have problems. One of the more annoying and frustrating ones, as we've discussed, is that XHTML is (nearly) useless." So much for innovation.

Monday, 7 May 2007

It's the End of the Net as we know it, and we feel fine....

John C. Dvorak has an interesting post on his pcmag.com column blog, entitled "Will the Internet Collapse?" He doesn't think it will, obviously, and he's got some pretty impressive trends to back up his contention. Example: 140,000 terabytes of backbone traffic in 2002 — at a "conservative" 60% annual growth through 2007, that's roughly 25 KB for each of the six billion or so people on the planet. Most of whom (still) wouldn't know what a byte was if it bit them; they've got more pressing concerns, like safe food, clean water, housing... But I digress.

I don't think the Net per se will "collapse", either. What's going to happen — what's already happening — is both more subtle and dangerous. The "Internet craze" that gave rise to Bubbles 1.0 (1990s) and 2.0 (now) and has driven the Net from a quirky research project into a cultural touchstone, has done two things that, by comparison, would make an every-Friday-from-4-to-10-PM crash seem benign in comparison (and "4-t0-10-PM" where? On the Internet, it's always "now".)

The first problem is the Baby's Spoon in the Waterfall. There's so much information (wrapped up in even more "content", which isn't the same thing) that no person, government, entity or corporation can ever comprehend. People who spend large amounts of time surfing the Web and using various tools to pull information off the Net in other ways, soon exhibit a behavior akin to being "punch drunk". Late in te 12th round, The Champ has connected so many times with Joe Palooka's jaw, and we in the crowd can see Joe staggering around, unsure of even from which direction the merciless pummeling is coming, let alone able to control the situation. The Champ, in this analogy, is the onslaught of data/information/"content" from the Net, primarily email and the Web; Joe is standing in for the typical, non-technical ("you mean Yahoo and the Web aren't synonyms?") user. As the user's eyes glaze over and the cognitive mind enters vapor-lock, he is essentially unable (and psychologically unwilling) to refine his usage patterns or seek out new experiences that he wouldn't find in "offline" life (what in an earlier age was called the "You Are There" effect). So, for instance, the stereotypical North American user goes back to the "safe, familiar" online equivalents of his offline television shows — the "news" sites owned by the same multinational corporations that own American media, and YouTube, which can be viewed as a worldwide online version of "America's 'Funniest' Home Videos": another vehicle for peddling the same tired corporate products in the commercials.

The other problem, of course, is that organizing all this "stuff" has become more difficult, and the rate that it becomes more difficult is at least as rapid as the rate of growth itself. While the Net, and the Web in particular, have enabled new ways to express individual personalities (e.g., MySpace) and alowed ordinary citizens of many countries to amass much ore detailed information about what their government is doing, for them or to them (e.g., YouGov and Thomas), if you don't know about YouGov or Thomas (or any similar site set up by your own country's government), then the old Bruce Springsteen song, "57 Channels and Nothing's On" seems quaint and manageable in comparison. People know there's all sorts of stuff out there — they can Google for it, "it must be real" — but, unable to come to grips with how things are organized (they aren't, on purpose) or how to use the available information to achieve a personally important goal, they fall back on the sites that organize and package and sanitize the content, accepting loss of control as the price of freedom from thinking too much. (E.g., AOL — a subsidiary of Time-Warner, and Fox "News".com, a wholly-owned subsidiary of AIPAC.) As people sink safely back into their easy chairs, content to absorb the anti-intellectual pablum that bombards them, they lose touch with the idea, let alone the possible reality, of an energized populace using the new, revolutionary technology at its disposal to improve their own lot in life and that of the world at large. Instead of a medium which challenges the status quo, the Net has devolved into a tool which reinforces it.

A collapse of the Internet? You're right, John; it will never happen. But a collapse of the promise and meaning of the Internet? It's already here, folks; we're just standing around watching streaming video of the rubble bouncing.