Metadata: Cataloguing Theory and Internet Subject-based Information Gateways
Introduction: cataloguing and the Internet
Modern descriptive cataloguing theory and practice has developed over the past 150 years as a means of organising information for retrieval in libraries. Library catalogues typically consist of a collection of bibliographic records that describe published materials, usually - as the name implies - in the form of printed books but also including cartographic materials, music scores and manuscripts. The standards and cataloguing codes originally developed to support this activity have expanded to include a range of newer publishing media, typically: sound recordings, microforms, video recordings, films and computer files. In response to the current widespread (and increasing) use of computer networks - primarily the Internet - for publishing, bibliographic standards and the formats associated with them have also been adapted to describe this type of material.
However, the increasing use of the Internet as a publishing medium has led to a reassessment of the usefulness or validity of the traditional cataloguing approach in this new environment. In 1995, Vianne T. Sha outlined three broad approaches to Internet resource discovery .
- The development of search services using robot-based search engines to index Web pages, an approach exemplified by services like Lycos and AltaVista.
- The "manual subject guides" approach where human intelligence is utilised to identify and evaluate Internet resources. These are typically subject-based guides that take the form of HTML lists. Examples include the various WWW Virtual Library sites and the subject-guides listed by the Argus Clearinghouse .
- The library cataloguing method, creating bibliographic records for Internet resources in library catalogues. The classic example of this approach is OCLC's InterCat project .
Sha favours the third, traditional, approach where bibliographic records for Internet resources are catalogued using existing library standard formats, primarily MARC.
Researchers will then need to go to only one place - the local library - to access all formats of information resources. Cataloguing records of the resources also give detailed descriptive information to help the researchers identify whether the resources are what they really need, saving the researchers a tremendous amount of time wandering on the Internet to browse through the information. 
This is a laudable aim, but one could sensibly wonder whether the relatively heavyweight codes developed for traditional library cataloguing are really suitable for the mobile and ephemeral resource that is the Internet? Perhaps better solutions would not involve the use of traditional cataloguing codes or formats at all but would combine information professionals' expertise of selecting and evaluating resources with a resource discovery system that makes use of a relatively simple metadata format. Internet resources can be adequately described, for example, using the fifteen elements in the Dublin Core metadata element set . Another relatively simple approach is embodied by the subject-based gateways that use the software developed by the eLib funded ROADS project . These services use information specialists to evaluate and select Internet resources and to create a metadata record for each resource using a ROADS template, a simple metadata format that uses attribute-value pairs.
This paper will attempt to compare traditional library cataloguing practices with the approach taken by subject-based gateways using SOSIG (the Social Science Information Gateway)  as a particular example.
What is a catalogue?
A traditional library catalogue is typically a list of books or other items. Catalogues are often based on the physical location of items (and in particular the library in which the items reside) but can also be based on other criteria like the date of publication or language of the items being described. Library catalogues record, describe and index the resources of a collection or group of collections. Each entry within the catalogue has a reference identifier to enable the item to be found, and sufficient details to identify and describe the item itself. In the older forms of catalogues guard books, printed catalogues and card catalogues the list is arranged in some definite order (e.g. by author, title or subject). In catalogues held as computer files the catalogue records are not held in any order but can be output on request in selected forms of order.
A catalogue record comprises a number of elements. The description records information about an item its title, who created it, who published it and when, the type of material it is (book, serial, etc) and what physical characteristics it has. In modern catalogues, this description is often based on a International Standard Bibliographic Description (ISBD). The description entry is then indexed (given access points) so that a number of different approaches will lead to the item. The indexing terms are referred to as headings and take the form of subject terms, personal and corporate author names, classification scheme numbers and item titles. Rules for the creation of catalogue records, including bibliographic description (usually based on ISBD) and the allocation of headings (access points) are contained in published cataloguing rules like the 2nd. edition of the Anglo-American Cataloguing Rules (AACR2). In order to be able to collocate all entries for specific headings but still allow items to be found via alternative terms or variant names, some form of authority control is required. Authority control establishes a form to be used and then references can be made from alternative terms or variant names.
The catalogues of ROADS-based Internet subject gateways are superficially similar. The catalogue records contain both descriptive information and access points for author names, subject terms and classification numbers. Authority control is also used (to a limited extent) for things like language codes or keyword terms. However, the operation of these catalogues do differ in significant ways from the traditional library catalogue. The following sections will attempt to show these differences.
In libraries, the selection of items for stock is a separate exercise from cataloguing the items acquired and so, apart from in very small libraries, selection and cataloguing processes are carried out by different people. In academic libraries, library staff with subject knowledge and (sometimes) academic staff carry out the task of selection. For public libraries, branch librarians will often be the selectors of stock, while users may be able to suggest particular items. In special libraries selection may be a departmental and not a library task.
The selection procedure for Internet resources involves the additional tasks of discovery and quality evaluation. This location and filtering process would already have taken place in traditional libraries through the use of publishers' catalogues and other acquisition aids. The selection process for Internet subject gateways is typically carried out by subject specialists and information professionals who are often also responsible for cataloguing the resources as well. SOSIG employs a team of core staff based at Bristol and a distributed team of academic social science subject librarians based at institutions around the UK. These librarians are known as Section Editors and are responsible for a specific subject area of the catalogue. The 'Add a New Resource' form in SOSIG allows users to suggest items which could be added to the catalogue. Selection policies are based on the subject knowledge of gateway staff coupled with quality guidelines and evaluation procedures to help ensure consistency across the service .
Library cataloguing rules contain rules for the description of library materials. These descriptions typically include (these are taken from AACR2 1.0B1):
- Title and statement of responsibility
- Publication, distribution, etc.
- Physical description
- Standard number and terms of availability
Description, therefore is concerned with both the intellectual content of an item (its title, its creator(s), the series to which it belongs, etc.) and its physical nature. In library catalogues, the materials described, however, virtually always take the form of physical objects. AACR2 1988 rev., for example, gives rules for the description of books, cartographic materials, sound recordings, computer files and three-dimensional artefacts (among other things). An important part of the description in these cases relate to the physical characteristics of an item for example: the size, pagination and existence of illustrations for books and pamphlets; the type of carrier and playing time for sound, film and video recordings; and what particular items comprise a mixed media unit. This type of information is less useful with relation to networked resources. Cataloguing codes are, however, being adapted to be able to describe networked objects. For example, in 1997 IFLA have published an ISBD for electronic resources .
The templates used by ROADS-based gateways can contain most of the same descriptive information as traditional catalogues and are relatively rich in terms of the information they can record about a resource. The templates can (if required) include over 60 fields or attributes (although most of these are optional and others are automatically created). The ROADS DOCUMENT template-type, for example, has specific attributes for titles (Title, Short-Title and Alternative-Title) and USER clusters for authors and publishers. ROADS-based gateways, however, do not need to (and in general are not able to) describe Internet resoures in a physical sense.
Library cataloguing rules usually specify the chief sources of information for creating bibliographic descriptions. For example, the title page is the chief source of information for a printed book or pamphlet. In the case of Internet resources, however, identifying prescribed sources of information is more problematic. There is, in general, little bibliographic data available from the resource itself and the process of creating a descriptive record requires an element of detective work on the part of the cataloguer to find out who is responsible for the information, etc. In compensation, however, the Description attribute of ROADS templates allows for the provision of a much fuller free text description of a resource than can traditional catalogue records. This description can help the user decide whether or not a resource is of use to them before they connect to the resource itself. In SOSIG the description generally provides some (or all) of the following information:
- The nature of the resource e.g. an electronic journal, collection of reports, etc.
- Who is providing the information (author, organisation, etc.)
- The subject coverage or content of the resource
- Any geographical or temporal limits (e.g. that it covers only German language texts from 1994)
- Any form or process issues that might affect access or ease of use (charging, registration, need for any special software, etc.)
- If the resource is available in any other languages
Part of the function of the ISBDs and cataloguing rules based on them (like AACR2) is to help provide a standard framework for the description of bibliographic items. ROADS-based gateways like SOSIG have also developed cataloguing rules to help maintain consistency throughout their catalogue. Some consistency is also desirable for cross-searching with other ROADS-based gateways. With this in mind, the ROADS project has developed some generic cataloguing guidelines .
Headings (subject terms, classification, main entry)
Headings are the access points to a catalogue entry. For a book, the access points can be author(s), subject heading terms, classification scheme numbers, etc. For other forms of material other access points are needed. For music sound recordings, soloists and conductors may be noted as well as composers. Likewise for films, the leading actors and the director can be noted.
Access points into ROADS catalogues are generally by title, author, category (type of resource), subject keywords, or words from the description, however these are flexible depending on the type of resource that is being described and the requirements of the users of the gateway.
Authority control (personal and corporate authors, uniform titles, subject terms)
Authority control establishes preferred forms of names and terms. Alternative forms of names and terms are used to refer to the preferred forms and to provide alternative points of access. The need for this is twofold. The use of a single preferred term for entry allows searches to bring together all items by a specific author, or all editions of a specific work. The alternative points of access mean that searchers are not disadvantaged by only knowing the non-preferred form of name or subject term.
Bibliographic records currently exist in large numbers in catalogues and national bibliographies and are regularly exchanged and bought. A number of authority control systems are in place for use by those creating records. These include the British Library Name Authority List (BLNAL), the Anglo-American Authority Files (AAAF), the Library of Congress Subject Headings (LCSH) and the Medical Subject Headings (MeSH). Use of such standards means that there is consistency of practice within the community.
Authority files also have a place within Internet cataloguing not only to help with consistency within a single service but also to facilitate the process of cross-searching other gateways and catalogues. Internet subject gateways, for example, can use subject headings like LCSH or MeSH. SOSIG uses authority files for assigning values to the country and language fields (these make use of ISO codes) and the Humanities And Social Science Electronic Thesaurus (HASSET)  for assigning keywords.
Holdings and Access information
In library catalogues, questions about access are often related to holdings information. After users have found items in a catalogue, they will ask questions like: 'where do I find it?', 'can I borrow it?' and 'what equipment do I need to use it?'. Often a classification number or call number identifies its physical location within the library (unless it is on loan). An indicator code or phrase shows whether the item is available for loan - and for what time periods - or whether its use is restricted to the library building itself. The catalogue description should also indicate any required equipment by stating, for example, if a videocassette is VHS or Betamax, or if a sound recording is a vinyl disc, an audiocassette tape or a compact disc.
Internet gateways differ in that they are primarily concerned with access rather than holdings. As a general selection rule the ROADS-based subject gateways only catalogue information that is freely available over the Internet. In some cases restrictions may apply; this may be in terms of technology (e.g. the use of Java, PDF, etc.), cost (e.g. fee-based resources) or registration requirements. Where restrictions exist the catalogue description will include appropriate information for the user e.g. that a PDF file will require the use of an Adobe Acrobat Reader.
A subject gateway has a 'virtual' or 'linked' collection; only the catalogue records describing these resources are kept on the server (the resources themselves are held on computers around the world). The volatile nature of the Internet requires that the catalogue records have to be constantly checked to make sure that resource descriptions are kept current and that links to the resources are still working. An automated 'link checker' runs weekly over the SOSIG catalogue; this generates a list of dead links which then have to be updated or deleted from the database. SOSIG has a formal Collection Management Policy to help monitor the development of the collection and to provide guidelines about the selection and deselection of resources from the catalogue.
There are strong similarities between traditional library cataloguing practice and the creation of by subject gateways of ROADS templates to describe Internet resources. The creation of both are (in general) carried out by information professionals. The ROADS templates themselves have been designed to be compatible (where possible) with library cataloguing standards and formats and mappings between ROADS templates and formats like MARC have been produced to allow records to be transferred to different formats or cross-searched .
However, these comparisons can only be taken so far. Traditional libraries tend to have very demarcated responsibilities. Selection, acquisition, cataloguing and classification often are seen as four separate roles. Subject gateways tend to involve all of their staff in the discovery, evaluation, selection, cataloguing and classification of resources (although some of the SOSIG Section Editors have chosen to adopt the more traditional split of roles between selection and the cataloguing process).
Cataloguing rules and authority files are used in the creation of Internet catalogue records but these rules tend to be created and employed locally. There is as yet no set of widely adopted standards and guidelines in the area of Internet cataloguing to compete with the likes of AACR2.
The 'moving target' nature of the Internet requires that the metadata itself needs to change to reflect constantly changing resources. There is always an element of risk in choosing to spend time creating a catalogue record for a resource that may not exist in three months time or may have changed beyond recognition. Whilst the Internet subject gateways have based much of their practice on library traditions, they have necessarily adapted these to deal with the very different nature of this new media.
- Vianne T. Sha, "Cataloguing Internet resources: the library approach," The Electronic Library, 13 (5), October 1995, pp. 467-476; here p. 467.
- Argus Clearinghouse.
- Sha, "Cataloguing Internet resources," p. 468.
- ROADS (Resource Organisation And Discovery in Subject-based services).
- Dublin Core initiative.
- SOSIG: Social Science Information Gateway.
- e.g.: Paul Hofman, Emma Worsfold, Debra Hiom, Michael Day and Angela Oehler, Specification for resource description methods. Part 2: Selection criteria for quality controlled information gateways. Deliverable 3.2 for Workpackage 3 of DESIRE Project, May 1997.
- ISBD(ER) International Standard Bibliographic Description for Electronic Resources: revised from the ISBD(CF): International Standard Bibliographic Description for Computer Files. Recommended by the ISBD(CF) Review Group.(IFLA UBCIM Publications, New Series, Vol. 17). München: K. G. Saur, 1997.
- HASSET (Humanities and Social Sciences Electronic Thesaurus). Developed by The Data Archive at the University of Essex.
- Michael Day, Mapping between metadata formats.
UKOLN: the UK Office for Library and Information Networking
University of Bath
Bath BA2 7AY, UK
UKOLN: the UK Office for Library and Information Networking
University of Bath
Bath BA2 7AY, UK
Institute for Learning and Research Technology (ILRT)
University of Bristol
8-10 Berkeley Square
Bristol BS8 1HH, UK