Internationalisation and the Web

jon knight

Internationalisation and the Web

Jon Knight looks at how the Web is currently undergoing the sometimes painful internationalization process required if it is to live up to its name of the World Wide Web.

The World Wide Web is intended to be “an embodiment of human knowledge” [1] but is currently mainly an embodiment of only West European and North American knowledge resources. The reason for this is simple; despite the name, the development of the World Wide Web has until recently been very heavily oriented towards English and other Western European languages[2]. If you want to display a resource with an ideographic character sets from Asian languages for example then you have been forced to either use inlined images or localized, kludged versions of software. The former solution results in poor performance and the latter limits interoperability and freedom of using the latest tools.

The real solution to the problem is obviously to move the development of the Web to the point where resources can contain information in multiple character sets and languages. However this is no simple matter; there is already a lot of software deployed that uses the existing standards. Internationalization (or I18N as its sometimes known) of the Web must ensure that this existing base of tools can still make use of the information available; in other words it should provide for backwards compatibility where possible.

This article will briefly look at some of the options for I18N of Web documents. Firstly the issue of the various character sets that can be used is examined. Then the progress of providing support for these character sets and the languages that use them in the HyperText Markup Language (HTML)[3] and the HyperText Transfer Protocol (HTTP)[4] is reviewed. Finally an overview of the current, often heated, debated surrounding the internationalization/localization of Uniform Resource Locators (URLs)[5] is given.

Character Sets

Most of the information systems that have been developed in the past twenty years have relied on the 7 bit US-ASCII character set[6]. This provides upper and lower case Roman characters, Arabic numerals and a selection of punctuation characters and control codes. Whilst this has proved adequate for handling English texts, it forced people using languages that had other characters to use anglicised/romanised versions. Many information systems still exist with these limitations; indeed many library OPACs make heavy use of romanised transliterations of foriegn titles.

A slight improvement over US-ASCII is the ISO-8859 series of character sets[7]. These character sets all use a full 8 bit byte to represent each character and between them handle most of the character sets needed for both West European, East European and Middle Eastern languages. The lower half of the character sets are broadly compatible with the US-ASCII character set, giving them the ability to “grandfather” in resources that already exist in US-ASCII format. The upper half of each character set contains the characters required for each localized group of languages. The full list of the ISO-8859 series of character sets is:

Latin alphabet No. 1, ISO 8859-1,
Latin alphabet No. 2, ISO 8859-2,
Latin alphabet No. 3, ISO 8859-3,
Latin alphabet No. 4, ISO 8859-4,
Latin/Cyrillic alphabet, ISO 8859-5,
Latin/Arabic alphabet, ISO 8859-6,
Latin/Greek alphabet, ISO 8859-7,
Latin/Hebrew alphabet, ISO 8859-8,
Latin alphabet No. 5, ISO 8859-9,

There are two major problems with the ISO-8859 character set series. Firstly each character set is effectively independent of the the others. This means that if, for example, you need to produce a document that contains both Greek and Hebrew text, you have the problem of having be able to specify which character set each part of the document adheres to. Secondly, the ISO-8859 series of character sets still do not cover the characters natively used by languages in use with an appreciable fraction of the World’s population.

In the Far East another set of ISO character sets have been filling the void left by US-ASCII and the ISO-8859 series. The ISO-2022 standard[8] introduces a method of encoding large ideographic character sets such as Japanese and Chinese into 7 and 8 bit characters. The basic technique used is to assume that the characters in a document are initially simple ASCII style Roman characters. However if a specific set of escape sequences appears, processing of subsequent characters then assumes that they represent a specific set of ideograms.

The ISO-2022-JP[9] character set is widely used in the Japanese speaking portion of the Internet and in Japanese information systems. There are also variations for Chinese (ISO-2022-CN[10]) and Korean (ISO-2022-KR). The ISO-2022 style of extended ASCII based character sets also compete with locally with a number of other de facto standard encodings of ideographic character sets which complicates the matter still further. The ISO-2022 series of character sets suffer from the same problem as ISO-8859 if one needs to represent characters from more than one character set.

In an effort to allow true multilingual resources to be made and to attempt to bring together the characters handled by all the previous disparate character sets, the International Standards Organization has been working on different character set called ISO-10646[11]. Unlike all the previous character sets that attempt to encode all of their characters within at most a single 8-bit byte, ISO-10646 natively uses a 32-bit code space (known as Universal Multiple-Octet Coded Character Set 4 (UCS-4)). This should provide enough room for every character used by every language on Earth.

Related to ISO-10646 is the Unicode standard[12]. Unicode basically replicates the lower 16 bits of the ISO-10646 canonical 32 bit format. This canonical format is known as UCS-2. Both ISO-10646 and Unicode have a number of UCS Transformation Formats such as UTF-16 and UTF-8[12] that allow their 32- and 16-bit canonical characters to be represented using a variable number of 8 bit characters. The major complaint that people raise against both standards is that the popular UTF-8 transformation rules favour English (US-ASCII characters appear as themselves for backwards compatibility) whereas some of the ideographic languages with large character repertoire require upto six bytes to represent a single character. Despite this, Unicode is currently our best hope for a single, universal character set.

I18N of HTML

The previous section has outlined a number of problems with the ISO-8859 series of character sets and how ISO-10646/Unicode attempts to overcome these. Even so, the standard character set adopted by Tim Berners-Lee and the other original developers of the WWW was ISO-8859-1. This allowed existing US-ASCII documents to be easily marked up using normal editing tools whilst also allowing the use of accented characters that are frequently required in Western Europe and North America where most of the developers and early users lived. Also at the time Unicode and the related ISO 10646 standard were not as stable and well known as they are now.

Both the HTML 2.0[13] and HTML 3.2[14] standards currently specify the use of ISO-8859-1. They both permit the use of character entities to allow US-ASCII based editors to include the full repertoire of ISO-8859-1 characters (either by name or by using the appropriate numeric code). However the Document Type Definition (DTD) from RFC 2070[15] corrects this by specifying ISO-10646 as the base character set. This permits an HTML document to contain multiple languages all represented with their native characters.

RFC 2070 also includes a number of new elements and character entities in its DTD that allow users to explicitly state the language that a section of an HTML document is written in and the direction that the characters should be written. This is important because without this information browsers will have great difficulty in correctly rendering documents that contain multiple bidirectional language sections and some cursive writing styles.

Indeed rendering of complex multilingual documents is still a major problem even with ISO-10646 as the wide range of fonts that we are used to using with ISO-8859-1 documents do not exist for the full 35,000+ Unicode/ISO-10646 characters. RFC 2070 suggests that the application may have to resort to displaying the hexadecimal form of the character which is somewhat less than user friendly. Having said that if the browser that a user is using can not render a character it is highly likely that the user will not be able to read the document anyway!

Some of the proposals in RFC 2070 are being incorporated into the W3C’s next DTD recommendation codenamed “Cougar”[16]. One aspect of HTML that has not yet received much attention from the W3C or any of the browser vendors is the extension of the CSS style sheets mechanism[17] to handle non-Western writing styles.

I18N of HTTP

As well as internationalising HTML, some work has been expended on providing support for multiple languages and character sets the HyperText Transfer Protocol (HTTP) often used to retrieve the documents. In HTTP/1.0[18] (the most common version of HTTP currently deployed), uses the charset parameter from MIME[19] to indicate the base character set of the document. The definition in RFC 1945 includes a list of character set names that were registered for use with MIME at the time that it was written.

The HTTP protocol has been updated and improved to give HTTP/1.1[20]. As with HTTP/1.0, the document might be encoded in any IANA registered character set but the protocol also now allows clients and servers to negotiate language content as well as character sets. They do this using the Accept-Language header from the client and the Content-Language response header from the server. This is a distinct improvement over HTTP/1.0 where multiple versions of the same resource in different languages are often indicated by different URLs.

One thing to note is that HTTP/1.1’s default character set if none is specifically requested is ISO-8859-1. This is fine for HTML 2.0/3.2 but is of course incorrect for RFC 2070 and Cougar that have a base character set of ISO-10646. However this is not really a major problem as browsers capable of handling the multilingual features of the RFC 2070 or Cougar are likely to also be able to generate and process charset parameters correctly when requesting documents.

I18N of URLs

The last area of the Web that is the subject of debate concerning I18N issues are the URLs[21] that are used to address and locate resources. URLs are the technology that makes the Web work as they provide the uniform interface to distributed resources accessible via a large number of protocols. Part of the success of the Web is that URLs are universally usable; even though people might not be able to “read” the URL, nearly everyone stands a pretty good chance of having a keyboard that allows them to enter ASCII characters and therefore enter URLs.

The URL standard is currently moving through the IETF standards process and as part of its review, a number of people have suggested that URLs should be extended to incorporate support for either multiple character sets or more practically Unicode rather than plain ASCII. They argue that this will allow non-English speaking users to use URLs that contain strings that mean something to them. To deal with the problem of the large amount of software that already handles ASCII URLs they have suggested that non-ASCII characters from the Unicode character set can be “percent-escaped” in the same way as “reserved” characters from the normal ASCII character set already are. Some example URLs and processing software (such as servers that generate Unicode URLs from localised filenames) have been produced and are currently under discussion. At this point it is still not clear that the advantage of being able to support localized character sets in URLs for specific groups will outweigh the clumsiness and possible difficulty of entering them on existing, non-internationalized systems.

Conclusions

This document has outlined some of the I18N issues present in the World Wide Web today, and also some of the work that is being undertaken to address them. Without a doubt there is an increasing push towards producing both standards and products that permit multilingual usage; both the standards setters and the software producers have realised that there is a large market for Web technology outside of North America and Western Europe.

I18N is a complex subject, especially if one is attempting to maintain backward compatibility with an installed software base of the scale that is currently present in the Web. There is plenty of work still to be done and the debates about aspects of I18N are sometimes not for the faint hearted (or easily offended - I18N is almost a religious topic on some mailing lists!). However we can rest assured that in the next few years we will definitely see the World Wide put back in to the Web.

References

W3C, About The World Wide Web, 1996,
http://www.w3.org/pub/WWW/WWW/
W3C, Internationalization / Localization: Non-western Character sets, Languages, and Writing Systems , April 1997,
http://www.w3.org/pub/WWW/International/
W3C, HyperText Markup Language (HTML),
http://www.w3.org/pub/WWW/MarkUp/
W3C, HTTP - HyperText Transfer Protocol,
http://www.w3.org/pub/WWW/Protocols/
W3C, Naming and Addressing,
http://www.w3.org/pub/WWW/Addressing/
US-ASCII. Coded Character Set - 7-Bit American Standard Code for Information Interchange. Standard ANSI X3.4-1986, ANSI, 1986.
ISO 8859. International Standard – Information Processing – 8-bit Single-Byte Coded Graphic Character Sets – Part 1: Latin Alphabet No. 1, ISO 8859-1:1987. Part 2: Latin alphabet No. 2, ISO 8859-2, 1987. Part 3: Latin alphabet No. 3, ISO 8859-3, 1988. Part 4: Latin alphabet No. 4, ISO 8859-4, 1988. Part 5: Latin/Cyrillic alphabet, ISO 8859-5, 1988. Part 6: Latin/Arabic alphabet, ISO 8859-6, 1987. Part 7: Latin/Greek alphabet, ISO 8859-7, 1987. Part 8: Latin/Hebrew alphabet, ISO 8859-8, 1988. Part 9: Latin alphabet No. 5, ISO 8859-9, 1990.
International Organization for Standardization (ISO), Information processing – ISO 7-bit and 8-bit coded character sets – Code extension techniques, International Standard, Ref. No. ISO 2022-1986 (E).
J. Murai, M. Crispin and E. van der Poel, Japanese Character Encoding for Internet Messages, RFC 1468, June 1993,
ftp://ftp.nordu.net/rfc/rfc1468.txt
HF. Zhu, et al, Chinese Character Encoding for Internet Messages, RFC 1922, March 1996,
ftp://src.doc.ic.ac.uk/rfc/rfc1922.txt.gz
ISO/IEC 10646-1:1993(E) Information Technology–Universal Multiple-octet Coded Character Set (UCS), 1993
D. Goldsmith and M. Davis, Using Unicode with MIME, RFC 1641, July 1994.
http://ds.internic.net/rfc/rfc1641.txt
HyperText Markup Language 2.0
http://src.doc.ic.ac.uk/rfc/rfc1866.txt
HyperText Markup Language 2.0
http://www.w3.org/pub/WWW/TR/REC-html32.html
F. Yergeau, G. Nicol, G. Adams and M. Duerst, Internationalization of the Hypertext Markup Language, RFC 2070, January 1997,
http://src.doc.ic.ac.uk/rfc/rfc2070.txt
W3C, Project: Cougar, W3C’s next version of HTML
http://www.w3.org/pub/WWW/MarkUp/Cougar/
W3C, Web Style Sheets
http://www.w3.org/pub/WWW/Style/
T. Berners-Lee, R. Fielding, H. Frystyk, Hypertext Transfer Protocol – HTTP/1.0, RFC 1945, May 1996
http://src.doc.ic.ac.uk/rfc/rfc1945.txt
N. Freed and N. Borenstein Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies, RFC 2045, November 1996
http://src.doc.ic.ac.uk/rfc/rfc2045.txt
R. Fielding, J. Gettys, J. Mogul, H. Frystyk and T. Berners-Lee, Hypertext Transfer Protocol – HTTP/1.1, RFC 2068, January 1997,
http://src.doc.ic.ac.uk/rfc/rfc2068.txt
T. Berners-Lee, L. Masinter and M. McCahill, Uniform Resource Locators (URL), RFC 1738, December 1994,
http://src.doc.ic.ac.uk/rfc/rfc1738.txt

Author Details

Jon Knight
ROADS Technical Developer
Email: jon@net.lut.ac.uk
Own Web Site: http://www.roads.lut.ac.uk/People/jon.html
Tel: 01509 228237
Address: Computer Science Department, University of Loughborough, Loughborough, Leicestershire