Web Magazine for Information Professionals

How the Oxford English Dictionary Went Online

Laura Elliot explains the use of SGML in the management of the OED text.

Ariadne has already described the long-term task of revising the Oxford English Dictionary and reviewed OED Online at its launch in March this year, but the editor judged, rightly, that there must be a hidden story on the making of the web site. This article sets out to tell that story, describing what was technically involved in turning a twenty-three volume print work into an online publication, and recording how this generation of publishers benefited from visionary groundwork undertaken fifteen years ago which meant that the hardest part of going online - preparing the content - was three-quarters done before they’d heard of the Web.

From hot metal to computer

OED Online is the latest fruit of longstanding and far-sighted efforts to make the Oxford English Dictionary (OED) "machine-readable", efforts that began in the mid-1980s for an entirely different motive. In the eighties, Oxford University Press (OUP) decided to publish a Second Edition of the OED which would integrate the original twelve-volume edition with its later supplement volumes. A consolidated text was necessary to allow future revision on an economic scale. The task was going to require computerization of the text and the whole editorial process. The first step was to get the text on to a computer: at that stage the original edition was still being printed with hot metal. Scanning was impossible given the quality of the print and the numerous unusual typefaces. So the text was keyed on to computer over eighteen months by a team of 150 typists in Florida. Those typists were simply asked to identify different typefaces as they entered the data, but back in Oxford, with the help of computer scientists from IBM and from the University of Waterloo in Ontario, Canada, the next imaginative move was made: the typefaces were converted automatically to codes identifying the various components of the Dictionary - including headwords, numbered sense definitions, etymologies, quotations, and cross-references. Using this information, it was then possible for computer programs to interpret the Dictionary well enough for the contents of the supplements to be integrated with the twelve original volumes. The results needed editorial correction, which for the first time in the history of the OED could be done using screen editing software, but the work involved was nothing in comparison with the alternative of re-editing the text on paper from scratch, and in this way the Second Edition was successfully published in 1989.

Generalized markup language and SGML

The computer scientists used "generalized markup language" for the codes identifying the various text components. The history of this way of dealing with text on computer will be familiar to some of you, but it may still be interesting to see how it’s illustrated by the history of publishing the Oxford English Dictionary, from the Second Edition in 1989 through to OED Online in 2000. To anyone familiar with HTML but perhaps not with its antecedents, the style of coding used since the mid-80s will still look familiar. The dictionary headword abbot would be coded <hw>abbot</hw> and a cross-reference to ABBEY, which OED prints in small capitals, would be coded <xr>abbey</xr>. OED also prints author names in small capitals, but these were coded to distinguish them from the typographically identical cross-references, so DICKENS would be coded <a>Dickens</a>. This work of the mid-80s illustrates rather well the idea behind "generalized markup language": the meaning of different text components, conveyed in print by different typefaces but also dependent on context understood by the reader, should be recorded unamibiguously with codes easy for humans to interpret but also recognisable by a computer. Editors can read the resulting text on screen, but at the same time enough information is hidden in the text for a computer to read it accurately. There are other advantages to such a coding system. It removes the need for anything but standard ASCII characters, which makes the data much more transferable between different computer systems. In addition, it allows you to change very easily how you want to print or display text components in future, because typographical styles are not embedded in the text. OUP has seen all these advantages of "generalized markup" pay off handsomely over the last ten years of the publication history of the OED.

At the same time that the OED was first being put into "generalized markup language", an international standard was being developed for so-called Standard Generalized Markup Language (SGML). This standard laid down what was permissible in text coding and defined generalized rules (called the DTD or Document Type Definition) to express the structure of a SGML-coded text. This move allowed commercial software packages to be written for editing, validating and formatting text coded in generalized markup language, provided it followed the SGML standard. Projects such as the Text Encoding Initiative produced DTDs for various document types, and some of these DTDs became international standards in their own right.

The OED did not follow the move towards standardization and SGML. The mood of the mid-80s was that a DTD should express the structure of a text with the accuracy and completeness of a chemical formula. This proved impossible for the OED given its unique content and long evolution of editorial style; even today OED is revised using its own variant of generalized markup language, which requires custom-built software tools. When OUP went out to tender in 1998 for the construction of OED Online, we realised that it would complicate the project considerably not to provide the text in a standard form, as this would force suppliers to grapple with the idiosyncratic legacy of OED’s composition. Therefore, we produced an alternative version of the OED in "proper" SGML, altering the markup for publication so that it concentrated on identifying the main features of the Dictionary which would be needed for online searching. The SGML markup clearly identifies Dictionary definitions, pronunciations, variant spellings, etymologies, quotations and their dates, titles and authors of cited works. The DTD rules are relatively simple, as, fortunately, views on the significance of the DTD have become less purist since the mid-80s. From the online OED’s DTD, a computer can interpret the text adequately for the needs of the web site. On the other hand, an archaeologist in the year 3000 couldn’t reconstruct every aspect of the electronic text of the OED from just the DTD and a set of tattered print volumes. We consider that to be a reasonable compromise, allowing us to publish OED online without having to resolve a long-standing debate in SGML philosophy.

Were there alternatives to SGML?

For those of you planning electronic text storage with a view to online publication, it might be useful for me to comment on options we chose not to take. First off, we never seriously considered supplying the text for online publication as HTML. We knew from ten years of editing the OED on screen how helpful it was to be able to search the text using the information on structure stored in the generalized markup language: for example, we had found it useful to search for a word just within etymologies or just within illustrative quotations. We wanted our online readers to have similar search facilities. All that useful information on text structure would be lost if we converted the text to HTML, because HTML is only good at storing information about format (bold, italic, etc) and about links to other web pages - OED’s authors and cross-references, quotations and definitions, would all have disappeared back into an undifferentiated blur of text distinguished only by typeface.

Having dismissed HTML, we did have heated debate about databases. Some potential software developers assumed that the whole text should be held divided up within a database, to gain the advantages of re-sequencing and fast search retrieval that database software can provide. We were not convinced by the proposals, as our experience of OED text was that there was always a rump of data which would not fit easily into any given quasi-mathematical model, and that you could spend inordinate effort dealing with dictionary entries that obstinately refused to conform. We could see this tripping up developers and wasting our time. It may be that one day we will revisit the database question, but for the moment we have satisfactorily published online without restructuring the whole text as a database (though a database is used behind the scenes, as I’ll explain later).

Other potential software developers suggested that the OED text used in the web site should be coded in XML rather than SGML. For those unfamiliar with these distinctions, XML is in the same family as SGML and HTML, but a later development. It has the benefits of SGML in that it allows you to apply codes for text structure in ways appropriate for your particular data, but it allows for the demands of formatting web pages, which SGML was developed too early to accommodate. When we went out to tender for OED Online construction in 1998, the perfectly sensible proposals to use XML weren’t actually matched by track records of implementing web sites that used it. We went for a supplier who had a track record in using SGML. But for you in 2000, the situation might already be different, and XML should be seriously considered. XML remains a strong candidate for the next generation of OED editorial revision facilities and as the future basis for online publication.

Prototyping OED Online in libraries

Content preparation may be the overriding task in getting ready to publish online, but there are other major issues. To our surprise, user interface design ultimately took place largely within OUP, using the experience of lexicographical and technical staff. The consultants we’d expected to depend on for design were helpful, but the OED is exceptional in so many ways, and viable ideas for online presentation are so intimately tied to knowledge of OED’s content, that it proved more satisfactory for prototypes to be developed by insiders. The prototypes were used for pre-publication workshops with librarians, academic staff and students, to make sure that we were preparing a reference site that libraries would actually want.

Touring libraries with our prototypes certainly brought out some issues that we had got wrong and that we were able to put right in the published version. We were, on the whole, encouraged by librarians to keep the user interface simple and not to overload the site with features. Feedback from libraries where the web site is now in use will continue to guide us on further improvements. The workshops also answered some of our questions on the technical specification that should be met by OED Online. Libraries we visited had a mixture of hardware and web browsers, of varying ages. We had it emphasized to us that we should keep life simple for technical support services. We therefore set out to avoid the need to download any software or fonts on to the end-user’s computer, and succeeded in that aim. We also tried to ensure that we could support Netscape and Internet Explorer in old versions on Windows, Macintosh and UNIX environments: in the end we were able to support Netscape 3 and above, and Internet Explorer 4 and above, but we found that to support even older versions was impossible because it so severely compromised the search and display functionality we could provide.

Partnership with HighWire Press

 

Early in 1999, the supplier chosen by OUP to construct and host OED Online was Stanford University’s HighWire Press (www.highwire.org). HighWire is a leading not-for-profit aggregator of electronic-based academic journals, started in early 1995 with the online production of the Journal of Biological Chemistry. The online production company, which is now the leading aggregator of scholarly life science publications, is currently responsible for the production and upkeep of 190 sites online and over half a million articles. HighWire’s central technical approach to online publication is to convert all the content it receives to SGML, so an SGML version of OED was a good starting point for the technical partnership between HighWire and OUP.

HighWire’s systems architect Adam Elman confirmed at the end of the construction of OED Online just how helpful OUP’s work on the text structure had proved. "OED is very consistent in its SGML structure. That’s one of the aspects of working with them that really impressed us. We were working with good data to start with, and that’s really important. The generic tagging in the source files greatly simplified certain aspects of production."

OUP provided the Dictionary to HighWire as SGML source files. Over the course of 1999, HighWire built a new system in Java for OED Online. The system searches the Dictionary using Verity’s K2 search engine and produces pages via HighWire’s own SGML-to-HTML style-sheets and conversion routines. Each entry is preserved as a whole in SGML, but these whole entries are stored in a Sybase database that separately records each entry’s identity and its relationship to other entries. When the reader makes an enquiry, Verity searches its indexes on the SGML text, but the matching Dictionary entry is brought to the screen more quickly than it would be otherwise because the software can rapidly "check out" the entry from the Sybase database. Pages of background and marketing information are input by OUP using HighWire’s standard software for this purpose, first designed for journals publishers. It is helpful for OUP that as well as developing the web site, HighWire also hosts OED Online and oversees its maintenance.

Technical challenges

The Verity search engine can interpret SGML, but a major technical hurdle was tuning this software to run at an acceptable speed given the volume of data in combination with the search behaviour that OUP required. Adam Elman summarizes the problem like this: "One of the issues we faced is that Verity is very well designed to find documents that match a particular search phrase. OUP had a very different idea of how searches should work. What OUP wanted was something that would count every instance of a word in the Dictionary, but show the results in the context of entries." What HighWire did was to chop the text up into small paragraph-level pieces, so that each quotation looks like a separate document to the search engine. Because of the size of the Verity search "collection" and the granularity of the pieces, performance became a cause for concern, Adam Elman and his team had to enlist the aid of several Verity engineers to tweak the system until it ran fast enough to put online for a large volume of simultaneous users. A related challenge was highlighting the search terms in the results. HighWire had to write its own routines for highlighting search terms in the web pages for each Dictionary entry - this involved generating the new HTML on the fly from the SGML.

Special characters also posed a significant challenge. SGML and HTML both record special characters as "character entity reference" codes. But choosing the unique codes for the thousand special characters in OED only addresses a fraction of the technical problem. How are you going to display special characters like an Assyrian H or an astrological moon symbol that aren’t part of the standard HTML set for display on web pages? Some web sites offer downloadable fonts, but this was something we wanted to avoid in order to keep the site simple and unobtrusive for readers to use. HighWire were experienced at rendering these characters as inline GIF images. But the OED contains many more special characters than most journals, and HighWire found OUP very particular about presentation. "Most journals are concerned that it looks right and accurate. Oxford was concerned that it looked good as well" Adam Elman commented. For example, if an entry appears in a quotation, it appears in a smaller font in blue, but if it appears in a definition, it appears larger and in black. To handle all the variations, OUP commissioned 2,500 hand-drawn images to handle the display of a thousand or so special characters as they occurred in different contexts in the dictionary. Once the images were drawn, the quality of OED’s SGML smoothed the production process; HighWire could easily match the character entity reference codes to the corresponding GIFs using software they had developed for journals, but it took several weeks to display the images correctly inline with the rest of the text. It was worth the effort, as the results are excellent; to the untrained eye it appears as though the special characters are part of the text. For example, accented Greek is not available in HTML: in the entry for charism, you can see the inline GIFS provided for accented Greek blend seamlessly with the surrounding text.

screenshot of 'charism' entry in OED

 

The next challenge is to update the Dictionary reliably every three months. The next update will be on 15 June. Then we will start work on visible and behind-the-scenes improvements to the web site based on our experience of live publication since March and the feedback we’ve received from readers.

The technical costs of putting OED online

The software development work cost US$400,000. OUP spent roughly US$1 million more on market research and prototyping, graphic design work, consultancy and so on.

An issue which may be more relevant to you is that for all my advocacy of generalized markup language as an excellent basis for online publication, you may have heard that SGML and XML are expensive options which can only be afforded by well-funded or commercial enterprises. I can see where this view comes from. For example, SGML software suppliers and consultants have historically worked for government departments (largely defence departments), huge engineering contractors (famously Boeing, which put its technical documentation into SGML), large healthcare operations and global publishing corporations: these organizations could absorb charges out of the reach of many librarians and smaller (or more parsimonious) publishers. However, if you can see the value of marking up your content to a standard set of rules, with the advantages for web publication that I’ve described, then nowadays the PC-based software to allow you to do this (from suppliers such as SoftQuad and ArbotText) is not ruinously expensive. In the ten years that I’ve been involved in publishing using SGML and similar coding systems, it has become understood that this need not be the preserve of highly technical experts. Freelance editors have gradually begun to branch into SGML and XML, coping without dozens of technical support staff. HTML has helped here: that’s something in which many of us dabble, whether "raw" or in the context of FrontPage or DreamWeaver. The skills needed to encode in SGML or XML are in the same league: the added ingredient necessary is real understanding of the structure of the text, but that is not usually all that complicated. You may have a great deal to gain by branching out into SGML or XML to record the underlying intellectual structure of your documents for online publication, something that HTML simply can’t do for you.

A free trial of OED Online?

If you want to organize a free trial of OED Online for your library, for the UK, and anywhere else outside North or South America, please contact Susanna Lob through worldinfo@oed.com. For the USA, Canada, and all countries in South America, contact Royalynn O’Connor through americasinfor@oed.com. Further information on the cost savings available through USA regional networks is available at: http://www.oup-usa.org/epub/oed/networks2.html. General information on subscriptions is available at http://www.oed.com/public/subscriptions

Author Details

 Laura Elliott
lelliott@oup.co.uk
Oxford English Dictionary Online