Web Magazine for Information Professionals

Making a MARC With Dublin Core

Jon Knight revisits his Perl module for processing MARC records that was introduced in the last issue and adds UNIMARC, USMARC and a script that converts Dublin Core metadata into USMARC records.

In the last issue of Ariadne the basic layout of the MAchine Readable Catalogue (MARC) records [1] used by most library systems worldwide was introduced. The article also described the first release of a Perl module that can be used for processing MARC records. Since that article was published, a number of people have been in touch saying that they either were developing similar in-house MARC processing software or were planning on developing something similar for public usage themselves.

As there is obviously some interest in this software, a second alpha release [2] has been produced that includes the first cut at support for UNIMARC [3] and USMARC [1] [4] style MARC records as well as the original BLCMPMARC [5]. The implementation of USMARC should make the Perl module more useful to a far wider range of people, as USMARC is rapidly becoming the MARC format. The Marc.pm makes some attempts to determine the MARC format being read in based on information held in the Leader as described in the previous article. Whilst not perfect, this does work for some classes of MARC record (it will distinguish all BLCMPMARC records from USMARC records for example, but only be able to tell a few USMARC and UNIMARC records apart). The new release also incorporates a number of bug fixes, so anyone who picked up the previous version is encouraged to pick up the new version.

This latest alpha release also includes a demonstration script that shows how the Perl module can be used. This script is intended to allow Dublin Core metadata that is embedded in HyperText Markup Language (HTML) documents to be extracted and converted into a skeleton MARC record. It is this script that this article will concentrate on.

Mapping Dublin Core to MARC

The Dublin Core Element Set (DCES) consists of fifteen basic elements [6]. These elements are cover a broad range of basic metadata needs such the title, authors, subjects and identifiers for resources. The basic definition of the DCES can be enhanced by the use of qualifiers which provide a way of specifying a more precise way to interpret the metadata held in a particular Dublin Core element. A draft set of proposed qualifiers is currently under discussion [7] and, although the details are likely to change in the next few months, the basic mechanism will hopefully remain the same.

Now Dublin Core is intended to be a relatively simple and easy to create metadata format. Whilst keeping to this goal is not always easy, as there are often demands that it just must support some particular communities own needs, the Dublin Core format is capable of holding a far less richly defined set of metadata than the more complex MARC records. It is therefore necessary to have a mapping between the Dublin Core elements and their qualifiers and a basic subset of the tags in the MARC record.

This mapping definition needs to take into account the basic, unqualified form of a Dublin Core element as well as any special treatment that should be made for specific qualified instances of an element. Luckily such a mapping between Dublin Core and the USMARC format has already been developed by the Library of Congress [8]. This mapping was used as the basis for the demonstration script for the Perl module.

Dublin Core embedded in HTML 2.0/3.2

Another proposal being considered by the Dublin Core metadata community is the ability to embed Dublin Core elements directly into HTML documents. The exact format of this embedding is still the subject of ongoing discussions but a number of services now generate embedded Dublin Core metadata of the form:

<META NAME=“DC.author” CONTENT=“(TYPE=email) jon@net.lut.ac.uk”>

The NAME attribute of the META element is used to hold the metadata schema (in this case “dc”, standing for Dublin Core), followed by the name of the Dublin Core element. By explicitly stating the scheme, this format allows other metadata besides Dublin Core to be embedded in an HTML document in a fairly standard way [9]. The CONTENT attribute holds the element’s value and any qualifiers.

Whilst this embedded format may change in the future, it is a format that follows the HTML 2.0 and 3.2 DTDs [10] [11] and has seen experimental operational deployment and so this is the format the the demonstration script makes use of.

Rather than complicate the dc2marc.pl demonstration script with a full SGML parser, the Metadata::MARC Perl module distribution uses much simpler pattern matching to extract the embedded Dublin Core metadata from the HTML documents. This makes use of Perl’s powerful regular expression features and, with careful construction of the regular expression can handle most instances of the HTML META element, even if it is split over multiple lines. In fact because the script does not use a proper Standard Generalised Markup Language (SGML) parser it is able to extract META elements from HTML documents that don’t conform to the HTML DTDs (an example of the “be conservative in what you generate and tolerant in what you accept” policy that characterises many Internet Engineering Task Force (IETF) derived protocols).

The script proceeds by extracting each META element one at a time. For each element a check is made to see if the element contains DC metadata. This is done by checking to see if the NAME attribute starts with “dc.” and if it does not, discarding that META element. If it does, the “dc.” is stripped off and the remains of the NAME attribute are tidied up to give the Dublin Core element name. The CONTENT attribute of the META element then has all qualifiers removed.

Qualifiers are recognised by being surrounded by brackets in front of the real Dublin Core element in the CONTENT. If the real value of the Dublin Core element also starts with a bracket it should have a space inserted in front of it which this script also removes. It should be noted that this is only one of a number of proposed ways of embedding qualifiers in Dublin Core metadata inside HTML documents. Others involve using escaping mechanisms to distinguish between brackets used to surround qualifiers and those in the actual element value or moving the qualifier into the META NAME attribute.

Both the qualifiers and the element values are loaded into complex Perl data structures in the same way that the MARC data is processed in the Marc.pm module. There is one structure that holds the element value that is keyed on the element name and the instance count for that element. Another structure holds the qualifier values keyed on the element name, qualifer schema (ie TYPE, SCHEME, ROLE, etc) and the count.

Generating the MARC record

Once all of the META elements in the MARC record have been read in, the dc2marc.pl script then processes each of the fifteen DCES element types one at a time to fill in the MarcRecord data structure. For each element type each instance of the element is examined. For those elements that include a mention of qualifiers in the mapping definition [8] each of these qualifiers are checked in turn and if they are present, an entry is made in the MarcRecord data structure for the more precise interpretation of the metadata. Otherwise the script usually make use of the more general mapping form, with the exception of the AUTHOR/CREATOR Dublin Core element which is only inserted into the MARC record if the metadata is the normal format of the name (rather than the author’s email address or affiliation, etc).

Once all the Dublin Core META elements has been examined and entries have been made in the MarcRecord data structure, the script makes a call to the WriteMarcRecord subroutine in the Marc.pm module and the resulting MARC record is created. The WriteMarcRecord subroutine automatically generates the MARC record tags in the correct order (ie: increasing numerical order).

Conclusions

This article has detailed the working of a simple Dublin Core to USMARC metadata convertor written in Perl and based on the Library of Congress mapping definition [8]. It demonstrates the usage of the Marc.pm Perl module [2] and also provides a testbed for the mapping function. The MARC records it generates are nowhere near as full as a trained cataloguer would generate directly but it is hoped that they could be used as a skeleton for the cataloguer to work from. In this way it also demonstrates how embedded Dublin Core metadata can be used to help reduce the cost of cataloguing by allowing authors and publishers to provide machine readable versions of the metadata that cataloguers need.

References

  1. Library of Congress MARC Standards,
    http://lcweb.loc.gov/marc/marc.html
  2. Latest alpha release of MARC Perl module,
    ftp://ftp.roads.lut.ac.uk/pub/ROADS/contrib/Marc-Latest.tar.gz
  3. Brian P. Holt (ed.) with assistance from Sally. J. McCallum and A.B.Long, UNIMARC Manual, 1987, IFLA UBCIM Programme, British Library Bibliographic Services, ISBN 0-903043-44-0.
  4. Walt Crawford, 1984, MARC for Library Use, Knowledge Industry Publications Inc, ISBN 0-96729-120-6.
  5. Talis MARC Manual Revision 4.0, 1994, BLCMP Library Services,
  6. Dublin Core elements,
    http://purl.org/metadata/dublin_core_elements
  7. Dublin Core Qualifiers,
    http://www.roads.lut.ac.uk/Metadata/DC-Qualifiers.html
  8. Dublin Core to USMARC Mapping,
    http://lcweb.loc.gov/marc/dccross.html
  9. Embedding Metadata in HTML 2.0,
    http://www.oclc.org:5046/~weibel/html-meta.html
  10. HyperText Markup Language 2.0,
    http://src.doc.ic.ac.uk/rfc/rfc1866.txt
  11. HyperText Markup Language 3.2,
    http://www.w3.org/pub/WWW/TR/REC-html32.html

Author Details

Jon Knight
works on the ROADS eLib project.
Email: jon@net.lut.ac.uk
ROADS Web Site: http://www.ukoln.ac.uk/metadata/roads/