Web Magazine for Information Professionals

Metadata: BIBLINK.Checksum

Ian Peacock and Andy Powell describe a proposed algorithm for calculating a checksum for Web pages.

BIBLINK [1] is a project funded within the Telematics for Libraries programme of the European Commission. It is investigating the bi-directional flow of information between publishers and National Bibliographic Agencies (NBAs) and is specifically concerned with information about the publication of electronic resources. Such resources include both on-line publications, Web pages, electronic journals, etc. and electronic publications on physical media such as CD-ROMs.

The project has recently finalised the Functional Specification for the ‘BIBLINK workspace’ - a shared, virtual workspace for the exchange of metadata between publishers, NBAs and other third parties such as the ISSN International Centre. The workspace will allow publishers to ‘upload’ metadata about electronic publications using email or the Web. NBAs and third parties will be able to ‘download’ this metadata, enhance it in various ways and then ‘upload’ the enhanced metadata back to the workspace. The intention is that NBAs will use the enhanced metadata as the basis of a record in the national bibliography if appropriate. Finally, publishers will be able to ‘download’ the enhanced metadata for use in their own systems. The metadata will be stored and exchanged in several syntaxes, including HTML, SGML, UNIMARC and the national MARC formats of the participating partners.

Development of the software for the BIBLINK workspace and a demonstrator based on it will begin in the near future. The software development for the workspace has been sub-contracted to Jouve, Paris.

BIBLINK metadata

As part of its background research, the project has identified the metadata requirements of publishers and NBAs in the scenario described above. The BIBLINK metadata set [2] comprises most of the Dublin Core [3] plus several additional elements. As with the Dublin Core, BIBLINK metadata can be embedded into Web pages using the HTML META element.

The project was especially concerned with ensuring the long term authenticity of the bibliographic records that are created using the BIBLINK workspace [4]. A working definition of authentication was developed:

BIBLINK shall take ‘authentication’ to mean the guarantee that a piece of metadata actually describes a given electronic publication, and only that publication. In other words, there is a one-to-one relationship between an electronic publication and its metadata and this relationship can be authenticated.

To achieve this level of authentication, one of the BIBLINK metadata elements is used to hold a checksum (or message digest) of the resource being described. It is known as the BIBLINK.Checksum. By storing a checksum as part of the metadata, it is possible to determine if a resource has been modified since its metadata was created.

What is a checksum?

A checksum is a computed value which depends on the contents of a ‘block of data’. (The block of data is often referred to as a ‘message’). A common use for simple checksums is to validate data integrity after transmission or storage, by calculating a checksum before and after transmission and comparing them. One-way hash functions (such as MD5 [5]) are a type of checksum which have additional properties (such as being difficult to reverse) that are usually used within cryptography.

Some common uses of checksums are:

A number of algorithms are in use for creating checksums including CRC-32 [6], DES-MAC [7], MD4 [8] and MD5.

The MD5 checksum

Created in 1992 by Ron Rivest for RSA Data Security Incorporated, the MD5 cryptographic hash function is widely used by applications requiring calculation of message digests.

MD5 refers to a type of hash function, i.e. a function that takes a variable length input and returns a fixed length output. The MD5 algorithm produces a digest of 128 stochastically independent bits that have no calculable relation to the original input, for this reason it is known as a ‘message-digest’ or ‘checksum’. Such a digest represents the original message from which it was generated.

BIBLINK.Checksum algorithm

The project chose to base the BIBLINK.Checksum on the MD5 hash function. This was primarily because MD5 is widely used in other applications and there is source code available for it in the public domain. For example, there is an MD5 module available for Perl [9].

The algorithm described here can be used to compute an MD5 hash for HTML pages on the Web.

Inline objects

Although some Web pages simply consist of a single HTML file, many are composed of a number of ‘inline’ objects. These objects are stored separately but are retrieved along with the HTML and displayed by the Web browser to form a complete document. Examples of inline objects are images, applets and ActiveX controls.

Inline objects form an inherent part of a resource. If a diagram in a document changes, one typically considers that the document itself has changed. Other ‘linked’ resources require some action on behalf of the user to be retrieved, such as a mouse click. Examples of linked resources include web pages that are hyperlinked to the original page or that require clicking a button for display.

The BIBLINK.Checksum algorithm defines a set of inline objects that are included in the checksum calculation. They are:

Other externally linked resources are not involved in the calculation.

The algorithm

The following algorithm is proposed:

  1. Retrieve the HTML page from the Web.
  2. Remove all the <META> elements from the HTML, including any surrounding white space (space, tab and end-of-line characters).
  3. Compute MD5 hash.
  4. Retrieve any inline objects referenced by the page (see above).
  5. Compute MD5 hash for each.
  6. Combine all hashes by concatenating them together in the order that they appear in the page (the page’s MD5 hash first).
  7. Compute MD5 hash of the combination.

By computing the BIBLINK.Checksum for a Web page and comparing it with the previously computed checksum stored in the metadata for the page, it is possible to check whether the page has been modified since its metadata was created. Although a simple check of the last modification date of the Web page might give the same information, this does not check whether any of its inline components have changed.

Issues

A number of issues arise in connection with calculating a checksum for documents on the Web.

Dynamic content generated within pages via CGI or Server Side Includes (SSIs) could mean that the document is different when accessed at different times. This would result in a different checksum even where the dynamic content is insignificant, such as a current date or retrieval time.

Currently HTML frames are not dealt with as might be expected. BIBLINK.Checksum calculates the checksum on the HTML page containing the FRAMESET element and FRAME SRC elements. This approach means that the contents of the frames, as would be seen within a browser, are not used in the checksum calculation. This may be rectified in the future if we consider the individual frame sources as inline to the frameset page.

Automatic refreshes triggered via HTTP-EQUIV="refresh" within an HTML META element are ignored by the algorithm. If it is desirable then HTTP-EQUIV URLs could be treated as inline objects.

Generating a BIBLINK.Checksum

A Web CGI based tool has been developed to implement this algorithm [10]. The code for the tool is available separately [11].

By selecting the button below you can use the CGI tool to generate a BIBLINK.Checksum for this Web page.

References

[1] BIBLINK Web pages
http://hosted.ukoln.ac.uk/biblink/

[2] BIBLINK Metadata elements
http://hosted.ukoln.ac.uk/biblink/wp8/fs/bc-semantics.html

[3] Dublin Core
http://purl.org/metadata/dublin_core

[4] Titia van der Werf, Authentication
Deliverable D6.1 of BIBLINK, LB 4034, 1997
http://hosted.ukoln.ac.uk/biblink/wp6/d6.1/

[5] MD5 [RFC5325]
http://sunsite.doc.ic.ac.uk/rfc/rfc1321.txt

[6] CRC-32 [ISO3309 and within RFC6560]
http://sunsite.doc.ic.ac.uk/rfc/rfc1510.txt

[7] DES-MAC [ANSI X9.9 (http://test.team2it.com/rsa/faqref.htm#ANS86a), ISO8731]

[8] MD4 [RFC8320]
http://sunsite.doc.ic.ac.uk/rfc/rfc1320.txt

[9] MD5 Perl module
ftp://sunsite.doc.ic.ac.uk/packages/CPAN//modules/by-module/MD5/

[10] BIBLINK.Checksum CGI-based tool
http://biblink.ukoln.ac.uk/cgi-bin/bibcheck.cgi

[11] Perl source code for BIBLINK.Checksum CGI-based tool
http://www.ukoln.ac.uk/metadata/software-tools/#bibcheck.cgi

Author Details

Ian Peacock
Technical Development and Research, UKOLN
Email: i.peacock@ukoln.ac.uk
Andy Powell
Technical Development and Research, UKOLN
Email: a.powell@ukoln.ac.uk