Every day, subscribers to the the NewJour mailing list  receive notification of new Internet-available electronic serials. The NewJour definition of a serial covers everything from journals to magazines and newsletters; from the British Accounting Review to Ariadne, to The (virtual) Baguette and I Love My Nanny. Some days, a dozen or more publications are announced. As of 13th February 1997, the NewJour archive contained 3,240 items.
Most of these electronic serials, or e-serials, along with most other electronic publications currently available on the World Wide Web, are stored and represented using one or more of a relatively limited number of document formats. In this article, I'll describe these formats and the formats that are likely to be prominent in the near future. I'll give examples from the e-serials field but most of my comments are also applicable to other online publications such as books, dictionaries and manuals.
I'll start by describing the simplest and until recently most common formats; ASCII and bitmaps. I'll then describe the format that has come to epitomise Web publishing: HTML. This will be followed by a description of its parent, SGML, and then LaTeX, PostScript, PDF and some of the formats used in the multimedia components of e-serials. I won't describe proprietary word processing formats, such as WordPerfect and Microsoft Word, as these are rarely used for widely distributed e-serials and are 'not encouraged' in the eLib Standards Guidelines .
Early e-journals, newsletters and the like tended to use the lowest common denominator formats of ASCII or bitmapped page images. More sophisticated formats couldn't be supported by the majority of users.
The term ASCII, which stands for American Standard Code for Information Interchange, is often used as a synonym for 'plain text file'. ASCII can represent just 128 unique symbols; the English letters and numbers and some of the more common keyboard characters. Chemical and mathematical formulae and languages other than English can't be adequately represented.
Figure 1: ASCII-based title 'page' of Ejournal
An example of an ASCII-based e-serial was the periodical Ejournal  as shown in Figure 1. Issues were regularly emailed to subscribers until April 1996. In August 1996, EJournal [went] hypertext'. Unfortunately, this 'first issue explicitly designed to take full advantage of hypertext and the World Wide Web' appears to be its most recent.
Although many potential journal readers still don't have access to the Web, new ASCII-based e-journals available via email, FTP (File Transfer Protocol) or Gopher (a document retrieval system) have virtually dried up. Emailed newsletters in ASCII are still common; the Daily Brief , for example, is a U.S.-based 2-3 page news summary sent out by e-mail every weekday morning. But for more sophisticated offerings (such as Ariadne), email is now often just used for notification, the publication being available at a Web site. The next step may be a mixed email-Web model; services such as Netscape's InBox Direct could enable the emailing of HTML- based tables of contents which could point to articles held on the serial's Web site .
Bitmaps, also referred to as raster formats, are graphical images stored, not as individual characters, as in ASCII, but as patterns of individual pixels or dots. The graphical components of e-serials are generally stored this way, using bitmap formats such as GIF, PNG and TIFF. These formats are described in Section 7.1. Text is also sometimes stored using bitmaps, as in ADONIS  and Elsevier's TULIP project . In the latter, the articles from over eighty Materials Science journals were made available in bitmapped (TIFF) format. ASCII versions of the articles were also provided. The combination of ASCII and bitmaps used in TULIP enabled full-text searching using ASCII and allowed articles to be viewed in the same visual format as in the paper version using the bitmaps. Figure 2 shows a sample bit-mapped page presented using the TULIP Journal Browser . Bitmaps are far from ideal; the bitmapped text is unsearchable and the files can be large and not always high quality.
A successor program, EES (Elsevier Electronic Subscriptions) , announced in February 1995, aims to offer libraries electronic subscription to all Elsevier Science titles. As with TULIP, EES provides a TIFF bitmapped image and a corresponding ASCII text file for each page. It seems a strange decision to use these formats when, even in February 1995, more advanced formats were available.
Figure 2: Bit-mapped page viewed using the TULIP Journal Browser
But even where more sophisticated formats are used, bitmaps may still have a limited role in representing text in the short-term. HTML can't represent many mathematical and chemical symbols so bitmaps (for example TIFFs or the more compact GIFs) are used whenever such special characters appear in an HTML page. Advances in HTML and browser technology should mean that this solution won't be needed in the longer term.
The majority of new networked ejournals and many other electronic serials are delivered via the Web and are currently based on an HTML (HyperText Markup Language)  backbone linking articles either in HTML or a different format. This second format may be PDF, PostScript, LaTeX, SGML or even bitmaps.
Figure 3: HTML-based table of contents from the IDEAL system, viewed using Netscape
Figure 4: HTML-based article abstract from the IDEAL system, viewed using Netscape
Figure 5: PDF-based article from the IDEAL system, viewed using Acrobat Reader
The obvious example of an HTML backbone with HTML articles is Ariadne . Academic Press's IDEAL system (International Digital Electronic Access Library) , on the other hand, includes journal articles in PDF. The tables of contents and article abstracts of 175 Academic Press journals are freely available in HTML. In addition, authorised users can view, download and print journal articles in PDF. Figure 3 shows an HTML table of contents. Clicking on the first article title displays the abstract shown in Figure 4 and clicking on the 'Full Article' link below the abstract displays the PDF article in Figure 5.
HTML documents can include hypertext links to other documents. 'Documents' accessed via Web browsers can be almost any form of information, from a text-file to the result of a database query. So documents don't have to exist as files; they can be 'virtual', generated in response to a user query. In the IDEAL system, for example, all HTML pages, such as those in Figures 3 and 4, are dynamically generated.
Web browsers, for example Netscape Navigator and Microsoft Internet Explorer (IE), have a built-in capability to interpret some file formats other than HTML. Netscape, for example, can read various graphics formats including GIF and JPEG. To read other formats, helper applications or plug-ins may be necessary. These software packages perform tasks such as displaying still images, playing sound and video and uncompressing files. They're often available free of charge or as shareware on the Internet. The terms helper application and plug-in are sometimes used interchangeably but can be distinguished by the fact that plug-ins are more closely integrated with the browser. Helper applications, on the other hand, generally run files that can't be directly display inside the browser window but need a separate pop-up window.
New features are being added to HTML all the time. Unfortunately, some of them aren't standard, in that they're not ratified by the World Wide Web Consortium (W3C) . So, for example, non-standard tags to describe multimedia components have been introduced by the IE browser  and frames, which are still under discussion by W3C, are already available for Netscape . The most notorious non-standard extension must be the Netscape blink tag which causes text to flash on and off to the annoyance of some Net users. The eLib Standards Guidelines  which provide recommendations for standards in eLib projects, state that such 'vendor-specific extensions' are 'deprecated'.
The latest version of HTML to have been ratified by the W3C is version 3.2. But it's not always wise to base Web sites on the latest standards as many user sites will lag behind in their ability to read them. As of February 1997, the Web design company WebMedia often create sites using HTML 2.0 plus tables to ensure wide accessibility .
Despite support for cute features such as animated GIF's and active maps, HTML's functionality is still fairly basic in areas of importance to more serious publishers. For example, as mentioned in Section 1, many special characters and symbols still need to be represented using bitmaps. Support for mathematics, originally to have been implemented in HTML 3.0 , was obviously not at the top of the priority list for the ever- more commercially-oriented Web community. This is one reason why some journal publishers prefer PDF. Another reason is the lack of author control over HTML article layout.
HTML is a simple application of SGML, or Standard Generalised Markup Language, which, in turn, is a language for describing the logical structure and meaning of documents . SGML doesn't describe documents' visual appearance. For example, it could be used to identify the title, author and section headings of an article but it wouldn't say anything about where to display these components on the screen or what fonts to use.
Figure 6: SGML-based test article from JEP, viewed using SoftQuad Panorama
SGML needs an application to convert it to a suitable viewing format. This application could be an SGML viewer, for example, SoftQuad's Panorama for Netscape  as shown in Figure 6. (SoftQuad have recently announced a 'Panorama-like' plug-in for Internet Explorer .) Alternatively, SGML is often converted to another format, such as HTML, for display. For example, in the eLib CLIC project, involving the parallel paper and online production of the chemistry journal Chemical Communications, SGML files are converted on-the-fly to HTML for display .
SGML has been around since 1986. HTML Version 1 was SGML-like but didn't conform to the SGML standard; the latest standard, Version 3.2 is fully conformant. An SGML Document Type Definition, or DTD, defines the rules for marking up a type of document, be it a journal article, a newsletter, a manual and so on. Each SGML-marked-up document, must contain, or refer to, the relevant DTD. HTML 3.2, for example, being an SGML application, is defined by a DTD. If you're reading this article online, check the source code. (From Netscape, choose View then Document Source Code. From IE, click the right mouse button and choose View Source.) You'll see that the first line is something like this:
This is a reference to the HTML 3.2 DTD. It says to the system reading the file, in this case a browser, 'If you want to know how to interpret the markup in this file, look at the HTML 3.2 DTD.'
3.1: SGML for Journals
SGML development is costly; with some exceptions, its development tools are expensive. The editor of Ariadne was unlikely to have chosen SGML when HTML was adequate for the magazine's needs. But for bigger and more complex projects, such as the development and storage of a large number of journals, SGML is increasingly being seen as at least part of the answer. The move towards the use of SGML in the journals field has increased in pace over the last 2 years. Currently, its main use is in the representation of bibliographic information about articles. There are several 'standard' DTDs that describe this 'header' information:
An increasing number of publishers are considering using SGML for the full-text of articles. Although some use ISO 12083 as a starting point for their own DTDs, few use the standard as is. In the long term, journal articles may be stored in SGML and delivered either in a second format or in SGML itself using an SGML viewer.
The latter approach was tested by the University of Michigan Press which made available, on the Web, SGML test versions of articles from JEP, the Journal of Electronic Publishing  using the ISO 12083 DTD. An example is shown in Figure 6. The articles have since been withdrawn but another of the Press's publications, the Bryn Mawr Reviews, is now in the progress of migrating to SGML. Test examples for this aren't yet publicly available.
Whether the option of using an SGML viewer becomes popular will depend on how SGML on the Web evolves, in particular, how viewing software develops. The Illinois Digital Libraries project, part of the US DLib initiative , has highlighted a number of problems with the current generation of SGML viewers, not least their inadequacies for presenting mathematical and chemical equations.
Despite the increasing prominence of SGML in the journals arena, only three of the sixty-plus projects funded by the eLib programme use SGML to any appreciable extent; they are CLIC, SuperJournal and Infobike. Both SuperJournal 2 and Infobike (JournalsOnline) use SGML as an intermediate format for the provision of bibliographic information by publishers to the system. The majority of eLib e-journal projects involve HTML or PDF. The Programme Director, Chris Rusbridge, has suggested that HTML should be regarded as a short-term solution 'with perhaps a medium term aim of migrating to SGML and a suitable DTD.' 
The W3C have recently been promoting 'an extremely simple dialect of SGML' called Extensible Markup Language, or XML . The goal of XML is 'to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.' It's very early days for XML but it looks like it will be influential.
LaTeX is a system for typesetting documents which is widely used in the scientific and engineering community, particularly by mathematicians and computer scientists. It has excellent facilities for representing mathematical formulae.
Packages such as Word have a WYSIWYG (What You See Is What You Get) interface, which allow users to view changes to a document's layout as they're made. Most user interfaces to the LaTeX system, on the other hand, require the user to mark up an ASCII file with LaTeX codes and then convert the resulting LaTeX file into the appropriate format, usually PostScript, for display or printing. This conversion is a two-stage process. First, the file is passed through a LaTeX formatter to produce a DeVice Independent (DVI) form. It's then converted to PostScript. Software to perform these conversions is widely available in the science and engineering community, as are programs for previewing the DVI and PostScript documents.
Figure 7 shows a PostScript page from the Chicago Journal of Theoretical Computer Science (CJTCS)  viewed using GSPreview, a previewer for PostScript. This journal is available via the Internet in both LaTeX and PostScript form. The 'definitive version' of each article is a LaTeX source file, There are two versions of LaTeX currently in use; the newer version, which came out in 1994, is referred to as LaTeX2e. CJTCS aim to make their articles compatible with both versions. LaTeX's graphical facilities are fairly rudimentary; diagrams and figures that can't be conveniently represented in LaTeX are made available in a format called Encapsulated PostScript (EPS).
As with SGML, the LaTeX mark-up is primarily logical in that, for the most part, it describes the logical structure of a document rather than the visual appearance. Some publishers, for example Springer-Verlag  make 'style files' available to their authors and editors. These style files define how the various logical features, such as headings, paragraphs and so on, are to appear for a particular journal. As with every other format, the use of LaTeX isn't without its problems. There will always be the authors who ignore the prescribed journal style files, invent their own and then forget to send them to the editor with their submitted articles. In such cases, far from saving the editors time, electronic article submission can add to the editorial workload. The problem isn't confined to this format but it can be particularly acute with LaTeX because of the potential sophistication of the documents being produced. The editors of CJTCS aim for maximum portability of the LaTeX source files by using a 'disciplined subset of LaTeX'. This often involves significant copy editing of the files submitted by the authors to remove 'clever' but non-standard markup.
Figure 7: Postscript version of CJTCS article viewed using a Ghostscript Previewer
LaTeX is built on top of a lower level markup language called TeX. Authors are increasingly using LaTeX in preference to TeX and, similarly, journals are increasingly providing articles in LaTeX rather than in TeX.
Whereas LaTeX provides a logical representation of a document, the PostScript page description language  describes the visual appearance of the final page.
Some e-serials provide articles to the end-user directly in PostScript. The advantage of this for publishers is that the document files can't be easily altered by the user; the look and feel of journal articles can be retained and copyright can be protected. But the files can be large, especially if they include graphics, so most e-serials make articles available in a more compact format, for example LaTeX, PDF or HTML. The end-user is left to convert the files to the format recognised by their printer; for most laser printers, this is PostScript.
Another potential problem arises if the reader's machine doesn't have a font which is used in a document. In the case of specialist journals which may use unusual fonts, default fonts may not be an adequate substitute and the original font may be too expensive to licence to readers. One of the main advantages for publishers of the PDF format described in the next section is that the necessary fonts can be included in the document files.
PDF, or Portable Document Format , is an increasingly popular format for e-serials, especially journals. It's used as the underlying representation for the Acrobat suite of software.
PDF, first announced in late 1992, could be summarised as 'PostScript with hypertext'. It's features include
Online presentation and browsing of PDF documents is via the Acrobat Exchange or Reader software. Exchange allows links, annotations, and bookmarks to be created and used. The Reader is a limited version of the Exchange software, allowing use, but not creation, of hypertext features. The Reader software is available free of charge from the Adobe Web site .
Acrobat is being increasing integrated with Web browsers such as Netscape, the latest version of which allows PDF documents to be viewed within the browser window. The Acrobat suite provides facilities for the creation of PDF files. In addition, an increasing number of desk-top publishing packages, for example Adobe PageMaker 6.0 and above, include facilities to convert their proprietary formats to PDF.
PDF is becoming widely used because it's easy for the typesetters to produce from PostScript and because publishers can retain control of page appearance; as with PostScript, PDF isn't easily changed by the end-user. And, unlike HTML, PDF looks the same no matter what viewer is used to display it. As already mentioned, another reason for enthusiasm is that certain fonts can be legally embedded in PDF files, so tackling the problem of local unavailability of fonts. As mentioned in Section 2, many e-journals, for example those in the IDEAL system, now comprise individual PDF articles linked by an HTML-based journal structure.
Adobe Acrobat is just one of a number of 'page description' systems currently available. Other examples include Hummingbird Communications Common Ground, Novell/Tumbleweed Software Envoy and Farallon Replica . A few e-journals, for example, those of the Royal Society of Chemistry , have been made available using CatchWord's 'Internet publishing environment', RealPage . But Adobe PDF already appears to have consolidated its position - in the e-journal field at least. PDF isn't perfect but it's the best that's commonly available at the moment and, as it becomes increasingly common, it becomes increasingly convenient.
In the long term, it's likely that multimedia features such as video clips, sound tracks and interactive images and data will be an accepted part of e-serials, particularly e-journals. Users may be able to run simulation programs, rotate 3D images and apply mathematical formulae to test data.
7.1: Still Images
GIF (Graphics Interchange Format) is currently the most commonly used, and universally supported, data format for cartoon-like images and special characters on the Web. JPEG (Joint Photographic Experts Group) is also used, mainly for photographic images . When used with Web browsers, GIF and JPEG images may be inline, that is, directly on the Web page in question, or external, that is, appearing in a separate window when requested by the user. GIFs can have more than one image per file; when these images are shown in quick succession, they can give the illusion of movement. Hence, the 'animated GIFs' that are becoming increasingly frequent on Web pages .
Portable Network Graphics, or PNG, pronounced 'ping', is being promoted by W3C as an alternative to GIF. PNG graphics have smaller storage sizes and can display more colours than GIFs but haven't yet taken over from GIFs in the popularity stakes.
There are many other still-image formats available, including:
The list of graphic and other multimedia formats supported by any particular browser can be viewed and added to by choosing the appropriate browser menu option, as shown in Figure 8 for Netscape.
Images can be large and slow to download but there are several ways that they can be speeded up. When they're accessed via the Web, images, along with the page they appear in, are cached by the browser; that is, they're stored for possible reuse. This can save time accessing images, such as navigation icons, that may appear on every page.
Another way of cutting load time is to compress the images . Some techniques, such as that employed by the most common versions of JPEG, are described as lossy because information is lost during compression and decompression. This is not usually important if the images are just for viewing; the losses may not be very significant to the human eye. But it could be significant in interactive e-journals in which images such as satellite data and gas chromatography charts might be analysed to extract data. GIF and PNG, on the other hand, are lossless but less suitable for photographic images.
Figure 8: Netscape Dialogue box for listing and editing helper applications
7.2: Audio and Moving images
The use of video and audio is still very rare in e-serials. The World Wide Web Journal of Biology , aims to provide HTML-based articles with links to movies in MPEG, AVI or Quicktime format, as well as to sound files and interactive molecules. But, browsing the back issues, I didn't notice any articles that actually included these features. Authors aren't yet used to the new media and their possibilities.
MPEG, or Motion Pictures Expert Group, is the closest approximation to an international standard video format. Microsoft's AVI (Audio Video Interleaved) and Apple Quicktime are proprietary standards. Viewers for MPEG and Quicktime are freely available for X Windows, Microsoft Windows, and Macintosh platforms . An AVI video player for Windows is built into Windows 95 and NT and is available for Windows 3.1.
A small but increasing number of video and audio players allow 'streaming' ; VDOLive and RealAudio  are currently the most common examples. Streaming is the technology that enables video or audio to be viewed or listened to as it's received, rather than having to wait for the entire file to be downloaded before playing it. There are disadvantages to this approach; if your connection to the Net is slow, the sound or video may appear sluggish or arrive in bursts.
MacroMedia Director is the most common application used to generate multimedia titles . The Director player is available free as a Netscape or IE plug-in for Microsoft Windows called ShockWave. The latter may well play a part in future interactive journals; at the moment, its use is concentrated at the more commercial end of the eserial market. Time Magazine, for example has an online demo incorporating ShockWave interactivity .
The only widely used and fully platform-independent sound file format is Sun Microsystem's AU format. Elsevier's Speech communication journal invites authors to 'illustrate' their articles with audio files which are then made available to users in AU format . Higher quality but platform-dependent formats include AIFF and WAV. The latter is commonly used for sound effects in Microsoft Windows. Netscape and IE have built-in facilities for playing AU, AIFF and WAV files. The audio section of the MPEG standard provides very high quality sound and players are available for a range of platforms. As already mentioned, the inclusion of sound files in e-journals is very rare to date and it's significant that eLib don't make any recommendations in this area .
7.3: Interactive Images
The latest browsers support a new wave of formats that will allow not only movie clips but also interactive 3-D content via formats such as VRML and Java.
VRML, or Virtual Reality Modelling Language , enables the creation and display of three dimensional environments and models. Version 2.0 has become the industry standard for 3D on the Web. Various VRML browsers can be freely down-loaded . A few work with Windows 3.1 or X Windows but most require the Windows 95/NT operating system or a Silicon Graphics machine; the more powerful the computer, the more impressive the results. The potential of VRML in e-journals has already been recognised in the field of chemistry, and demonstrations can be viewed at the Imperial College site .
Developments such as Java  make full multimedia journals increasingly feasible. Java is a relatively simple but powerful programming language, based on the more complex C++ language. It's likely to become increasingly common in interactive presentations such as user processing of data, rotation of 3-D images and other simulations. Small Java programs, called Applets, can be included in HTML files and run using Java-enabled browsers, such as Version 2.0 of Netscape or Version 3.0 of IE.
There may well be a delay before Java Applets find their way into the e-serials field on a large scale. The programming required to develop Java content is considerably more sophisticated than that required to develop HTML pages and would imply a large investment by publishers. But publishers may regard this as an investment worth making. Unlike HTML, the code of Java Applets isn't transparent to either the reader or to the system on which it's running; publishers may see Applets as a way of regaining control over the content of their publications.
7.4: Multimedia Problems
Until high-speed lines and relevant hardware and software are more widespread, a proportion of e-serial users will have to forego the potential added value of multimedia. Some e-journals cater for users with slow lines and text-only interfaces, such as Lynx , by making all multimedia, including graphics, optional. Most browsers allow the image-loading functions to be delayed or turned off. But the main problem with current multimedia e-journals may not be a technical one but rather the difficulty of persuading authors to submit articles incorporating multimedia features.
As more commercially-produced e-serials take to the Web, the few remaining ASCII-based journals and newsletters will disappear. LaTeX will continue to be used until HTML or its successor (SGML or, more likely, XML) can deal adequately with maths and chemistry. It's possible that a subset of LaTeX could be incorporated into HTML to form its maths component. This, and a few other improvements, would make HTML a more viable format for serial article full-text. But it's unlikely to rival PDF for layout quality. PDF will continue to ride high for quite some time until something better appears. And authors will begin to write articles for the Web.
Dr Judith Wusteman
was a lecturer in Computer Science at UKC when she wrote this article. She has since moved to UCD and her present address is:
Dr Judith Wusteman
Department of Library and Information Studies
University College Dublin
Belfield, Dublin 4, Ireland
phone: +353 1 706 7612
fax: +353 1 706 1161