Sunday 19 February 2006

If you can't extend it, is it really an eXtensible HTML?

Arrrrrrrrrrrrrggggggghhhhhhhhhhhhhhh!!!!! So much for consistency....

As anybody who has worked with me in the last 3-4 years well knows, I have been an enthusiastic advocate of DocBook as a documentation markup vocabulary for various purposes, and by extension, XML-based tools for all manner of things (Apache Ant and so on).

One feature I use regularly in Docbook XML source documents is the internal subset, which lets you define entities and include files defining entities not part of the original DTD. So, for instance, my standard software.ent file has an entry of

<ENTITY mswindows '<ulink url="http://www.apple.com/switch">Microsoft Windows</ulink>'>;
This way, anywhere in a DocBook file that includes that entity definition, I can type &mswindows; and, when the document is transformed (into XHTML, PDF, RTF or whatever), the desired link and text will appear in place of the entity. This is an obvious lifesaver when you want to include, for instance, links to glossary definitions for unfamiliar terms scattered through a document.

Fine. But the current state of XHTML (the XML-based successor to HTML) simply doesn't support it. It does not appear to be possible to have an XHTML document with an internal subset parsed correctly by any current major browser on Windows, Linux or the Macintosh. Various Google searches such as this one produce links to pages that say, with varying levels of emphasis and literal wording, "you can't use internal subsets for XHTML that is to be rendered by a browser". It seems that in the force-fit of XML to HTML that produced XHTML, the concept of different "streams", or purposes, for documents was introduced. XHTML which is to be rendered in a Web browser has one set of limitations (including the internal subset); whereas XHTML conformant to the same definition documents which is to be processed as "pure XML" has another.

To say that this sucks is to use that colloquialism as an extreme understatement, akin to saying that tsunamis are wet. This limitation closes off an entire range of applications that would use dynamically-generated XHTML as browser-viewable data in the same spirit as XML generally (without writing an otherwise redundant app to parse and reformat the data). The benefits — and they are significant — of an XML-based browser markup language are (in my view) seriously degraded by foolishness like this.

Of course, several of you are already thinking, I could just use XSLT instead. I had previously wondered why PHP and other Web scripting languages included support for XSLT processing. Now I know, I guess.

If anybody has any corrections or other good ideas, please let me know.