From the Trenches: HTML, Which Version?

jon knight

From the Trenches: HTML, Which Version?

In From the Trenches, a regular column which delves into the more technical aspects of networking and the World Wide Web, Jon Knight, programmer and a member of the ROADS team, takes a look at the causes of good and bad HTML and explains what tags we should be marking up Web pages with.

Most people concerned with Electronic Libraries have by now marked up a document in the HyperText Markup Language (HTML), even if its only their home page. HTML provides an easy means of adding functionality such as distributed hyperlinking and insertion of multimedia objects into documents. Done well, HTML provides access to information over a wide variety of platforms using many different browsers accessing servers via all manners of network connections. However, it is also possible to do HTML badly. Badly done HTML may tie a document down to a particular browser or hardware platform. It may make documents useless over slow network connections. As the Electronic Libraries programme is concerned with empowering people by giving them easy access to information via the Net, deciding what is and is not bad HTML and then avoiding using it is obviously something many librarians and library systems staff will currently be grappling with. This article aims to provide an informal overview of some of the issues surrounding good HTML markup and hopefully highlights some resources that may be of use in helping to improve the markup used in Electronic Library services.

The versions available

Before looking at what may constitute good and bad HTML markup, let us first review the wide variety of HTML versions available. There are currently only two versions of HTML that are on the Internet standards track; HTML 2.0 and HTML 3.0. All other versions are bastardised, vendor specific extensions to one of these open, non-proprietary versions. There is a version of HTML prior to HTML 2.0 known, unsurprisingly as HTML 1.0. It provides the basic hyperlinking and anchors that make HTML a hypertext markup language and some elements for highlighting text in a variety of ways. HTML 1.0 provides us with a lowest common denominator of all the different versions. If you mark a document up to the HTML 1.0 specification then the chances are that more or less every browser will do something vaguely sensible with it and so the information will be conveyed to the user intact. However HTML 1.0 was an informal specification that was never entered as part of the Internet standards process and its use is somewhat depreciated today.

One problem with HTML 1.0 is that it only offers a way to present basic textual information to a user; the means of getting feedback from the user are very limited. HTML 2.0 helps to overcome this problem by providing the document author with the FORMs capability. The mark up tags allow you to embed forms with text input boxes, check boxes, radio buttons and many of the other features that are common in user interfaces. These forms can be interspersed with tags from HTML 1.0 to provide additional functionality to a FORMs document and also to provide some for of access to the available data to non-HTML 2.0 compliant browsers. However such browsers are few and far between these days. HTML 2.0 is thus regarded by many as the base level of HTML to code to if you wish to reach the largest population of browsers and still have reasonable document presentation.

The latest version of HTML, HTML3.0, is still really under development. HTML 3.0 addresses the lack of detailed presentation control in the previous two versions with the introduction of style sheets and tables. The specification for HTML 3.0 also includes a mathematics markup that was very reminiscent of that provided with LaTeX. As HTML 3.0 is still under development, no browsers can claim to be fully compliant with the standard, although many of the more recent browsers have added some of the core HTML 3.0 elements to their own HTML 2.0 base.

Vendors also add their own proprietary tags to the core, standard HTML specifications. These tags are often presentation oriented or make some use of a feature peculiar to that vendor's browser. The most well known commercial browser is currently the Netscape Navigator, versions of which estimates have placed at anywhere between 50% and 90% of the total browser population. It adds many presentational tags that are widely used in many documents proporting to be HTML. Reading one of these "Netscaped" documents on another browser can result in anything from a slight loss of visual attractiveness to a completely unreadable (and therefore unusable) document. Some document authors are so intent on trying to use these Netscapisms that they even place a link to the Netscape distribution site on the Net so that those not blessed with Netscape can download it to view the author's documents. Things are only set to get worse with the entry of Microsoft and IBM into the fray.

It is in part the fact that browser authors add extra tags from one version of the HTML standard to a core from an earlier version and make up their own proprietary elements that causes some of the problems experienced by users. This is compounded by the fact that as the markup gets more complicated the opportunity for bugs to creep into different browsers increases. The result is that we have browsers and documents that all claim to be HTML when in fact many of them are not. To make matters even worse, many people don't specify which version of HTML a document is marked up in or even validate their documents to check that they match one of the specifications (known as a Document Type Definition or DTD).

Many browsers are very tolerant of the markup that they receive which in some ways is a good thing as it means that the end user is likely to see something even if the document's author has made a complete mess out of marking up the document. This has probably helped contribute to the Web's rapid growth as people perceive it to be relatively easily to add markup to documents and get working results. Unfortunately the flip side is that we are left with a Web full of poorly marked up documents not conforming to any of the standards, even the vendor specific extensions.

HTML Markup In Electronic Libraries

Electronic Libraries are in the business of providing information to their patrons via the network. The version of HTML markup used will therefore depend upon what they are trying to achieve and who their patrons are. If a service is only to be made available on a single site and that site only uses a single vendor's browser then of course the library is free to use whatever vendor specific HTML extensions it chooses. For example, if a service is only to be used within a site where all the users have Netscape Navigator v2.0, the Library can make use of blinking text with multiple fonts and frames, knowing that all its users will see much the same thing that the author did.

One point to note however is if the documents are intended to be very long lived the use of proprietary markup might render the upgrade process to the "next great browser" much harder than it would be if the documents were encoded using HTML 1.0 or 2.0. For documents with long life cycles (in computing terms, long is more than five years!) the library should really investigate the use of a more content oriented SGML markup such as TEI and then generate documents conforming to a specific HTML version from that.

However, if the service being provided will be used by patrons with many different browsers, it may be worthwhile sacrificing browser specific bells and whistles in favour of a more generic markup using the standard HTML DTDs. All though the result may not look as "pretty" as one using a vendor's proprietary tags, the chances are that it also will not look a complete mess when viewed on another browser. HTML 2.0 contains enough functionality that it can be used in most information provision situations. A library after all should be providing useful, high quality information resources to all comers and not trying to compete with ad agencies for "cool site of the week" awards.

The are some things that all sites should do however. The first of these is to include a line at the top of every document that they serve that specifies the DTD in use. This is rarely done and even this author admits to having written a large number of documents with no indication as to which version of HTML they conform to. To make this information easy for browsers to process there is a standard markup for it which is actually part of the SGML mechanism upon which all the standard HTML versions are based. An example of this "DOCTYPE" line for an HTML 2.0 document is:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

The next thing that Electronic Library document authors can do to help raise the quality of HTML markup in use on the Web is to validate their documents against the appropriate DTDs. Originally this was a tedious and difficult thing to do which might explain why it is rarely done. However today there are a number HTML editors available that will prevent the generation of invalid HTML, some browsers (such as Arena) indicate when they receive invalid markup and there are also a number of online validation sites such as Halsoft's HTML Validation Service that this article has been checked with.

The latter are particularly useful as they usually have a range of up to date DTDs for a variety HTML versions and can be used without the need to buy or install any new software on your machines. The validation is done using HTML 2.0 FORMs into which either fragments of HTML can be entered to be checked or URLs for entire documents can be specified. When you give such a service a URL for one of your documents, the program that processes the FORM will retrieve the document from your server, validate it against the requested DTD and then return a list of any errors to you. One neat trick with the online validation services is that you can often insert a small piece of HTML markup at the end of all of you documents that mimics the action of the service's form, allowing you to quickly validate a document by just clicking on a "validate me" button at the end of the document. Having such a button present may also encourage your users to try validating you documents. This will both help you spot accidental errors on your part if you make a change that invalidates the HTML but you forget to validate it and also "spread the word" about the practice of validating your HTML.

As well as generating valid HTML with an appropriate DTD, an Electronic Library service must also consider how its patrons will be accessing its documents. If they are all on a campus sitting at workstations and high end PCs with graphical browsers and high speed network links then the inclusion of inlined images in documents will present little problem. However, if they are accessing your service over slow international or dial up links, inlined images can be a pain. Nothing is more annoying to a network user than finding that a potentially useful page is full of inlined images and little else. If a document is to be widely available on the Web, the number of inlined images should be kept to a minimum and they should only be used for decoration or have their content replicated in textual links. This is because most graphical browsers provide the option for the user to turn inlined images off which many dialup users take advantage of and it must also be remembered that there are still a large number of people using text based browsers such as lynx. If the majority of a document's information content is only contained in the inlined images, it will be lost to these two classes of user.

Conclusions

HTML is a great way of providing useful functionality to end users and has helped push the lowest common denominator up a little from pure plain ASCII text in many situations. However, Electronic Library service providers must be aware that how they mark their documents up will affect their usability and thus usefulness to the end user. Proprietary vendor extensions are best avoided for widely used services, documents should include an indication of which HTML DTD they conform to and some form of validation should be performed. Public services should also avoid heavy use of inlined images to carry information content as it alienates users on slow links and non-graphical browsers.

If services take some of these simple approaches to marking up documents in HTML for delivery via the Web, we will have fewer users complaining able unreadable or slow links. Electronic Libraries have the opportunity to become show cases of good HTML markup and high quality information provision. Let's not miss that chance.

[Ed: this article has been validated at HTML 2, which is why it looks marginally different to other articles, in terms of style, in Ariadne.]

Author details

Jon Knight
(J.P.Knight@lut.ac.uk),
Dept. of Computer Studies,
Loughborough University of Technology.