Greytower Technologies

If the navigation is not visible, this link will take you to it.

 

Content-typing XHTML

Introduction

The content of documents on the WWW is important - not only in the traditional sense, but also because a UA need to understand the content from a technical point of view.

It is a common belief that the file extension - typically the last three letters of a filename or URI - is used to identify the conten of the file or resource. This practice was used by many older systems, but is not applicable on the web. The type of content is identified by the HTTP specification, specifically the Content-Type header.

The single most used such content type is text/html. This value lets a web browser or other user-agent know that the content following is HTML of some sort, and that the browser should attempt to present it to the user in a way she can handle.

XHTML is quite a different beast.

Please note that for the rest of this document the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in RFC 2119.

XHTML 1.0

XHTML 1.0 can be referred to as a transitional stage between the HTML 4.01 and XHTML 1.1.

Since the 1.0 version is designed to be compatible with HTML 4.01, a content type of text/html MAY 1 be used to identify the content - but only if the XHTML document is written according to the HTML Compatibility Guidelines.

For all other documents claiming to be XHTML 1.0, a content-type of application/xhtml+xml MUST be used.

XHTML 1.1

For the 1.1 version of XHTML the specifications are clear: text/html MAY NOT be used. The XHTML "native" content type of application/xhtml+xml SHOULD be used, whilst the generic XML content type of application/xml MAY be used.

This soup of content types leaves us with a fairly clear course of action for 1.1: if we use XHTML 1.1 2, we should serve application/xhtml+xml as the content type.

The horror! The Horror! 3

You've guessed it. Serving up a document containing what is to all intents and purposes XHTML as text/html simply means that browsers will jump into error correcting mode and deal. Serving the same document as application/xhtml+xml will, in most cases, present the user with a download dialogue of some sort.

This is undesirable. Sadly there is no way out offered by the specifications, so we'll have to roll our own.

mod_rewrite trickery

Running modern versions of the Apache web server gives you the possibility of voodoo - the mod_rewrite type. What we want to do is make sure that only those browsers who can handle XHTML documents get the content-type identifying it as such.

HTTP gives us the ability to do this. Most user-agents will 4 send the Accept 5 request header, which

... can be used to specify certain media types which are acceptable for the response. Accept headers can be used to indicate that the request is specifically limited to a small set of desired types, as in the case of a request for an in-line image.

This is called content negotiation and is exactly what we want. User-agents which believe, correctly or incorrectly, that they can handle XHTML should include the string application/xhtml+xml in their list of accepted content-types.

If they do, we can use this fact to both comply with the specification and avoid making a mess of other browser's handling of our pages by invoking the following magic:

RewriteEngine On
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml
RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml\s*;\s*q=0
RewriteCond %{REQUEST_URI} \.html$
RewriteCond %{THE_REQUEST} HTTP/1\.1
RewriteRule .* - "[T=application/xhtml+xml; charset=ISO-8859-1]"
   

The above could either be placed in the Apache main configuration file, a separate file included by the main configuration, or even in a .htaccess-file.

Written in more humane language, the rule works as follows:

Turn the rewrite engine on ;
If HTTP_ACCEPT contain the string "application/xhtml\+xml" AND
 HTTP_ACCEPT does not contain "application/xhtml\+xml\s*;\s*q=0" AND
  REQUEST_URI ends in ".html" AND
   THE_REQUEST is a HTTP/1.1 THEN
    change the content-type sent to "application/xhtml\+xml"
   

Please note that the above algorithm does not take into consideration the q-paramater -- and it really, really should.

I know, I know ...

This, as I am painfully aware, does not solve the problem that any XHTML 1.1 content is served up as text/html to browsers who don't happen to understand application/xhtml\+xml. Technically speaking this is in grave violation of the specification.

Ignoring, for a moment, that most things these days violate one specification or other, methods exist to solve the problem by alternative means. Content could be stored in the XHTML format on the server and - since XHTML is just another XML-based language - converted to plain HTML on the fly. With heavy caching this might not even be painful.

The method outlined in this document, however, does really no harm. It will get the job done without too much pain, and saves the author from embedding each page into a server-side browser accept detect-and-decide script.

But first and foremost: it does no harm.

References

XHTML Media Types
http://www.w3.org/TR/xhtml-media-types/
The HyperText Transfer Protocol
http://www.w3.org/Protocols/rfc2616/rfc2616.html
The mod_rewrite reference documentation
http://httpd.apache.org/docs/mod/mod_rewrite.html
Key words for use in RFCs to Indicate Requirement Levels
http://www.rfc-editor.org/rfc/rfc2119.txt

1 For a definition of the word MAY in this context, please refer to RFC 2119.

2 ... and I did.

3 From Joseph Conrad's Heart of Darkness via Francis Ford Coppola's Apocalypse Now and finally to Genndy Tartakovsky's Dexters Laboratory, illustrating how good quotes bubble up the tree of culture until it falls off and bashes its head in on a rock.

4 ... in a perfect world.

5 http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.1

Return to the top of the document