Tuesday, 8 May 2007

XHTML Is (Nearly) Useless

EDITED 2010/11/02: If you saw this post as a single massive block "paragraph," my apologies. The definition of "what is a paragraph?" changed after this was originally posted, and it had scrolled far enough back that I didn't notice the carnage. Let me know if there are still any problems. Thanks.


If you've written any Web pages in the last five years (at least), you've at some point bumped into the difference (schism?) between "original" HTML and "new, improved - now based on XML!" XHTML. If you don't write Web "content" (thanks for reading my blog, but why are you here?), or deal professionally with those who do, you may not know the difference, or care that there is a difference. There is, and people should care about it if they care about the Web.

(Briefly, for those who care but don't know; the rest of you can skip this and the next paragraph.) HTML is often known to developers as "tag soup", because very, very many sites don't follow the strict interpretation of the standard, and are "broken" in all sorts of ways. This was initially justified as working around the myriad bugs in grossly defective browsers such as Microsoft Internet Explorer. XHTML was different and better because it was HTML reformulated as XML, which could then be "validated" (checked) by any validating XML parser. HTML-as-XML also (should have) driven the development and use of all sorts of nifty techniques and tools that are only practical when assumptions can safely be made about the structure and format of the document - which would be true in XML/XHTML but not necessarily in "classic" HTML.

The problem, of course, is Microsoft's Internet Explorer browser, affectionately known to Web professionals as "Internet Exploder". Among the many "quirks" (defects) that has unknowingly afflicted usees of that browser, all versions up to and including the current Version 7 fail to understand XHTML as XHTML. The "conversation" that takes place between a browser and a server when the browser requests a Web page is defined by the open standard known as HyperText Transfer Protocol, or HTTP. Part of that conversation involves the server informing the browser what type of data it will be sending. This is done using what is called a "content type" "header".

All together now? Good. When a server wants to send a browser a page of "tag soup" HTML, the correct content type is "text/html". A properly-formatted and -served XHTML page will instead use "application/xhtml+xml". This will inform the browser that, in fact, the page being transferred is a proper XHTML page (per the open standard defining it), so the browser will kick in the assumptions and processing that works for XHTML but not for "tag soup".

Of course, Internet Explorer is now the only major browser that gets this wrong (as indicated by this vintage-2005 blog entry). As far as I am aware, every other major graphical browser in the world - Firefox, Opera, Konqueror, Galeon and the rest - all support The Right Thing. Unfortunately, IE is still the 300-pound gorilla in the china shop; the majority of Windows usees still browse the Web using IE, and though the trend is improving steadily, that will likely continue to be true for the next couple of years (say, 2009-2010 barring unforeseen circumstances).

What kinds of things would a properly XHTML 1.0-compliant browser let us do with our site? One trivial example: let's say you're writing a political-commentary site that is geared towards an upcoming election, and you want to consistently name your candidate as "The Honorable Senator Francis X. Snort (email senator@senatorsnort.org)". When your guy goes down to defeat (one too many campaign-finance scandals, mayhap) you want to change the blurb to "The Honorable former Senator F. X. Snort (email snort@somefreemail.com)". Trivial to do with whatever CMS or scripting system you're using, right? But by using an XML entity, you can simply say "&snort;" in your document, and an entity declaration in your document's header will tell the parser what you really mean. Change the declaration, and every instance of that entity expands to your new meaning. People who use other XML-based markup systems, such as DocBook, have been using this technique for years. Using XML entities in pages shown in correct (non-IE) browsers will do exactly what you tell it to. In IE, or, to be fair, several text-based browsers, the entity name will be displayed exactly as it is in the document - in our case, as &snort;. This is unlikely to have the desired effects on the folks "back home" for the Senator.

Web developers have, as I mentioned, several well-known workarounds for this type of thing, using their authoring tools rather than the document itself. It is, however, a reasonably easy example for people to understand. Given the increasing popularity of systems such as PHP Smarty that let you use large chunks of "raw" (X)HTML along with the scripting goodies, it would come in handy too.

So how does all of this make XHTML "nearly useless?" Because most developers developing pages for the general public (as opposed to corporate intranets), knowing that Microsoft IE doesn't support the correct content type, will either "not bother" developing "correct" XHTML or at best will serve it to all comers as "tag soup" HTML.

This also has the "benefit" of completely stifling further innovation (as far as the end user is concerned) based on XHTML. All of the comments I've made so far are only germane to the initial version of XHTML, designated 1.0. The newer versions, XHTML 1.1 and XHTML 2.0, provide new features and support new technologies that greatly expand the usefulness of the Web - or would, if Microsoft weren't, as usual, dragging the Web down for competitive lock-in purposes. By doing everything in their considerable power to ensure that IE browsers and sites aren't fully, completely interoperable with other browsers, they discourage Windows usees from using "rival" browsers to browse sites labeled "Best viewed with Microsoft Internet Explorer". There's nothing preventing Web designers from writing standards-compliant sites that also work well with IE; in a well-designed site, it's not particularly onerous to support both standards and Microsoft. If you're using Microsoft tools, of course, it will take quite a bit more work and knowledge to create valid sites. It can be done - several sites and mailing lists describe the techniques and mind-set required - but Microsoft do not go out of their way to make it easy to do so.

Of course, this also applies only to the public Internet. If you're fortunate enough to be developing "real Web apps" for your company's intranet, and your company understands the value of open standards, then you're not going to be subjugating yourself to IE and none of this really applies to you. Go enjoy all the things that new tech lets you implement that can really stomp on your non-standards-using competition!

For the rest of us, until the Web gets out of this proprietary funk it's in now, and IE either falls into a long-deserved oblivion (improving Windows security dramatically, but that's another post) or actually complying with the same standards every other serious browser in the world does, then we're going to have problems. One of the more annoying and frustrating ones, as we've discussed, is that XHTML is (nearly) useless." So much for innovation.

No comments: