Handling MARC With PERL

jon knight

Handling MARC With PERL

Jon Knight investigates the inner workings of the MARC record's binary distribution format and presents the first cut at a Perl module to read and write MARC records.

The MAchine Readable Catalogue (MARC) format is probably one of the oldest and most widely used metadata formats today. It was developed in the United States during the 1960's as a data interchange format for monographs in the then newly computerised library automation systems. In the following years the MARC format became a standard for export and import of data to library systems in much of the world and various national and vendor enhanced variations on the original MARC format appeared. The original MARC became known as USMARC [1] and itself continued to develop to include the ability to hold metadata on works other than simple monographs. The other MARC spin offs acquired their own names such as UKMARC, OZMARC, UNIMARC, CANMARC, PicaMARC, BLCMPMARC and so on. Recently considerable interest and effort has been expended to attempt to integrate many of these MARC variants back in USMARC so that we can return to the situation where we have one true MARC.

USMARC also underlies many of the Z39.50 based search and retrieval systems in use today. With Z39.50 slowly growing in popularity, there is now a need to be able to generate USMARC format records from other metadata sources such as Dublin Core [2] data, Apple MCF [3] files and ROADS [4] templates. It would also be useful if existing library catalogues could import and export metadata about some of the resources they have catalogued for use in systems built upon these other metadata formats.

Unfortunately, whilst the use of MARC is well understood by cataloguers and other librarians, the data format is often thought of by the techies as a bit of a mystery. The purpose of this article is to explain the basic MARC format, describe how the various national and vendor supplied MARC derivatives differ from USMARC and offer a description of a simple Perl module that can read and write MARC records.

Format of a MARC record

As the original MARC records were developed in the 1960's and the basic layout of the record has remained largely unchanged since then, the format of a MARC record owes a lot to the state of computer technology at that time. MARC is a binary format designed for writing to magnetic tape. It contains enough information to allow machines with what we would now consider to be very small memories to load just the parts of a MARC record that they are interested in whilst skipping over the rest. It was also designed with some of the popular computer languages of the time in mind, with a fixed length record header that would be easy to read into the data structures available. Structurally therefore MARC is quite simple; the complexity is derived from the fact that it can carry a wide variety of information in precise and detailed formats (such as AACR2 rules).

The basic layout [5] of a MARC record is to first have a 24 byte leader, then a variable number of directory entries and then a variable number of variable sized fields. The fields are sometimes split into variable control field and variable data fields. The layout of a whole MARC record is thus:

 +--------+-----------+-------------------------+----------------------+
 | Leader | Directory | Variable Control Fields | Variable Data Fields |
 +--------+-----------+-------------------------+----------------------+

The leader is designed to provide basic information about the size of the whole MARC record, the size of some of the data structures in the directory and data fields and the some very basic information about type of work that the MARC record is describing. The layout of the leader in the original MARC format was:

    Byte        Name
    ----        ----
    0-4         Record Length
    5           Status (n=new, c=corrected and d=deleted)
    6           Type of Record (a=printed material)
    7           Bibliographic Level (m=monograph)
    8-9         Blanks
    10          Indictator count (2 for monographs)
    11          Subfield code count (2 - 0x1F+subfield code itself)
    12-16       Base address of data
    17-23       Blanks

The leader in some of the new versions of USMARC and some of the other MARC variants have some subtle changes from this original version, though the number of "Type of Record" entries has typically increased [6]! For example the structure of the UNIMARC Leader (according to the 1987 UNIMARC Manual) is:

    Byte        Name
    ----        ----
    0-4         Record Length
    5           Status
    6           Type of record (a=language material, printed
                                b=language material, manuscript
                                c=music scores, printed
                                d=music scores, manuscript
                                e=cartographic materials, printed
                                f=cartographic materials, manuscript
                                g=projected and video material
                                i=sound recordings, non-musical
                                j=sound recordings, musical
                                k=2D graphics (pictures, etc)
                                l=computer media
                                m=multimedia
                                r=3D artifacts and realia
    7           Bibliographic Level    (a=analytical
                                        m=monograph
                                        s=serial
                                        c=collection

    8           Hierarchical Level Code (blank=undefined, 0=no hierarchical
                        relationships, 1=highest level record, 2= record
                        below highest level)
    9           Blank
    10          Indicator length (2 as in MARC II)
    11          Subfield code length (2 as in MARC II)
    12-16       Base address of data
    17          Encoding level (blank=full level, 1=sublevel 1, 2=sublevel 2,
                        3=sublevel 3)
    18          Descriptive Cataloguing Form (blank=record is full ISBD,
                        n=record is in non-ISBD format, i=record is in
                        an incomplete ISBD format)
    19          Blank
    20          Length of length field in directory (always 4 in UNIMARC)
    21          Length of Starting Character Position in directory (always
                        5 in UNIMARC)
    22          Length of implementation defined portion in directory (always
                        0 in UNIMARC)
    23          Blank

Some of these differences in the leader can be used to help software distinguish one MARC record from another. For example, in BLCMPMARC the last byte of the leader is always the character 'g'. In most other MARC variants, including USMARC, this byte is a blank. Therefore we can treat the presence of a 'g' at the end of the leader as a hint that the MARC record is in BLCMPMARC format. Such hints may not always be correct (for example there might be another MARC variant that in some circumstances has a 'g' in the last character position) and in some cases there is no way to tell different MARC variants apart from the leader. This is a pity as the different MARC formats sometimes have completely different interpretations for the data in a particular field and so it is handy to know what MARC variant you are using. In some cases it can even make processing the rest of the record more difficult as we shall see in a moment.

The directory

The directory is used to indicate where each field starts and how long it is. The directory can be of variable length (though there is an implicit maximum size imposed by the fact that the MARC record can be at most 99999 bytes long in total) and consists of a number of 12 byte directory entries. In most MARC variants the format of the directory entry is:

    Byte        Name
    ----        ----
    0-2         Tag
    3-6         Length
    7-11        Start Address

However there are some oddities around. For example in BLCMPMARC the directory structure is a little different:

   Byte         Name
   0-2          Tag
   3            Level
   4-6          Length
   7-11         Start Address

As can be seen BLCMP have sacrificed the length of each field to allow them to include a "level" element in the directory. This is to allow them to record the level within an analytical work that the particular field applies to (rather like byte 8 of the Leader in UNIMARC except that it works on a tag by tag basis rather than applying to a whole record). If one processes a record that is in BLCMPMARC as though it were in, say, UNIMARC, there is potential for the software to misinterpret the directory entries for analytical works and assume that the length of the field is greater than it actually is. This could result in rubbish being read in from the MARC record and is an example of a situation where not knowing which version of MARC you are handling can lead to some unpleasant surprises.

The Fields

The fields are the part of the MARC record that actually hold the bulk of the metadata about the work. There are two main types of field; control fields and data fields. The control fields are used in some systems to carry information such the control number of the work. There are typically only a few control fields in a single MARC record. The data fields are usually more numerous and contain the bibliographic metadata about the work.

The structure of both control and data fields is identical however; they both start with an indicator field and then have a number of subfields. The size of the indicator field is specified in the Leader (the indicator code length at byte 10) of the MARC record and is typically two bytes long. The subfields are typically separated by a subfield delimiter which is the character 0x1F and have a subfield code, which is usually a single character (although there is provision for multi-character subfield codes by specifying a subfield code length greater than 2 in the Leader at byte 11). In many library systems the subfield delimiter is often represented on screen as a "$" although it must be emphasized that this is just an on screen rendering; in the binary file 0x1F is always used.

The end of each field within the record there is either a an end of field delimiter or an end of record delimiter. The end of field delimiter is used at the end of all fields except for the very last one, in which case the end of record delimiter is used. The lengths of the fields specified in the directory entry for a tag include the field delimiters, so this is a handy way to "sanity check" the MARC record as its being read in. If the directory entries are not structured as the software assumes, there is a good chance that at least one tag will have too much or too little data read in and the field will not be terminated by such a delimiter. The software can then either abort at this point or use the information to make a better informed guess as to the MARC variant that it is attempting to process and try again.

Processing with Perl

One thing that would be very useful for the metadata handling communities would be some freely available tools for processing and generating MARC records. Whilst any general purpose computer language could be used to handle MARC records, the approach this work has taken is to start working on a MARC processing module for Perl. The reason for this is partly because Perl offers some nice modularization features and data structures that make producing an Application Programming Interface (API) relatively painless. It is also widely used in the web, library systems and metadata communities for building Common Gateway Interface scripts, OPAC support tools and metadata handling systems. Lastly its my programming language of choice and it fits in with the other tools I'm writing! :-)

An alpha release of the Perl module discussed here is available online [7] and interested parties are encouraged to download it and have a play. Its still in a pretty raw state; at the moment it only knows how to read and write BLCMPMARC (the MARC format that we use at Loughborough University) and exports just two subroutines into the program that uses it. These subroutines are ReadMarcRecord and WriteMarcRecord and as their names suggest they are used to read and write MARC records.

The ReadMarcRecord subroutine currently takes a single argument that is the file handle of an input stream to read the MARC record in from. The first thing that this routine does is read the Leader in by reading in the first 24 bytes from the current file pointer position. It then calls another routine that reads in the directory entries and then loads in the actual fields themselves. The information from both the Leader and the fields is then loaded into a Perl 5 style complex hashed data structure. The format of this data structure is:

    $MarcRecord = {
      marc_type => "BLCMP",
      status => substr($Leader,5,1),
      type => substr($Leader,6,1),
      class => substr($Leader,7,1),
      indicator_count => substr($Leader,10,1),
      subfield_mark_count => substr($Leader,11,1),
      encoding_level => substr($Leader,17,1),
      analytical_record_ind => substr($Leader,18,1),
      source_of_record => substr($Leader,19,1),
      on_union_flag => substr($Leader,20,1),
      scp_length => substr($Leader,21,1),
      general_record_des => substr($Leader,23,1),
      data => { %FIELD },
      level => { %DIR_LEVEL },
    };

In the above description, $Leader is a 24 byte buffer holding the Leader information. The marc_type element of the MarcRecord data structure is used by the module to tell the user's software what sort of MARC record it thinks it has read in. At the moment valid values for this element are Invalid, BadDirectory and BLCMP. The first two indicate that either a Leader could not be read in (because there were less than 24 bytes left in the input file for example) or that the directory entries could not be read properly. The later indicates the variant of MARC that the routine thinks the record is in; at the moment only BLCMP MARC is recognised but other variants will be added based on the techniques described in the previous sections (initially USMARC and UNIMARC will be added, with other MARCs being added later if documentation about them can be secured or other individuals on the Internet contribute code patches).

Following the marc_type element in the MarcRecord data structure is a set of elements that hold the contents of the Leader. These will vary depending upon the MARC variant being read, but there are a core of these such as the Record Length, Record Status, Record Type, Indicator Count and Subfield Mark Count that are likely to be present in all MARC formats.

Next comes the actual fields themselves. These are held in Perl 5 hash of arrays (a feature not available in earlier versions of Perl and many other languages). The hash key is the MARC tag for that field in the MARC record and the array index is the instance of that particular tag (to allow for more than one instance of a particular MARC tag within the MARC record). In the initial alpha release the field contents are then just dumped straight into this hash of arrays. In a later release it may be found to be worthwhile to split off the indicator and each of the subfields and place these in their own data structures within this hash of arrays.

Lastly in the initial release there is another hash of arrays that is keyed and indexed in the same way as the fields themselves but which contains the level code from the BLCMP Directory entries. This is an example of an element of the MarcRecord that will only be valid for a limited subset of the potential MARC record variants. Software that makes use of the this Perl module should be aware that if it makes use of these proprietary or localised extensions a nd alterations to the basic USMARC format, it must be prepared to deal with cases where it meets MARC records that do not contain this information.

The WriteMarcRecord subroutine is similar to the ReadMarcRecord routine and makes use of the same MarcRecord data structure. It takes two parameters currently; a file handle for the output stream and the MarcRecord data structure. The variant of MARC that it writes out is set by the marc_type element in the MarcRecord. It works by constructing the fields and directory sections of the MARC record and then generates a Leader and concatenates the three sections of the record.

Conclusions

This document has tried to explain the structure of the MARC record and has pointed out some of the differences that can appear in the data structures of different MARC variants. It also describes the API of a prototype Perl module to handle MARC records. In the next issue, this API will be developed to handle more MARC variants and some examples will be presented that demonstrate how to use the module to provide conversions between MARC and other metadata formats. It is hoped that this module will prove to be of use to the library and metadata communities and that it will grow and develop over time into a flexible MARC handling package. Even if it merely demystifies MARC for some people it will have served a useful role. Any comments, suggestions and feedback on this article and the associated Perl module are most welcome.

References

[1] Library of Congress USMARC Web-based information,
< http://lcweb.loc.gov/marc/marc.html >

[2] Dublin Core elements,
< http://purl.org/metadata/dublin_core_elements/ >

[3] Apple MCF Research Information,
< http://mcf.research.apple.com/ >

[4] ROADS Web pages,
< http://www.ukoln.ac.uk/roads/ >

[5] The USMARC Formats: Background and Principles,
< http://lcweb.loc.gov/marc/96principl.html >

[6] Expanded definition of USMARC leader/OG type of record,
< http://lcweb.loc.gov/marc/leader06.html >

[7] Alpha release of MARC Perl module,
< ftp://ftp.roads.lut.ac.uk/pub/ROADS/contrib/Marc-0.01.tar.gz >

Author Details

Jon Knight works on the ROADS eLib project at the University of Loughborough, UK
Email: jon@net.lut.ac.uk
Personal Web Page: < http://www.roads.lut.ac.uk/People/jon.html >