So You Want to Build a Union Catalogue?
In the not so distant past, if a group of libraries wished to offer a single online catalogue to their collections, they had to adopt a physical union catalogue model: i.e. they would have placed their catalogue records into a single searchable database. More recently there has been much work in virtual union catalogues, whereby the user interface offers integrated access to multiple catalogues as if they were a single catalogue. Neither of these approaches is a panacea, however – both have certain pros and cons, which makes the decision of which to adopt dependent on circumstances.
Before I begin to look at the different models (both physical and virtual), I would like to cover a few element of information retrieval theory. There are two major concepts: recall and precision. Recall illustrates the confidence that a search returns all the information you are interested in. For example, if you searched a bibliographic catalogue on the term “Tchaikovsky”, you would get low recall if the catalogue was inconsistent on the spelling of “Tchaikovsky” (using “Chaikovsky” and other variants) and did not use any cross-references between the different spelling forms. Precision, on the other hand, is the confidence that the results of your search are relevant. For example a search on “Cyril Smith” would have a low precision if it contained a large number of results for the politician, whereas you were interested in the pianist. Typically, web users are now used to low precision and recall, as web search engines can be quite poor in this regard. However, this is not surprising as achieving high precision and recall when searching unstructured or semi-structured data is a very difficult task. Some would argue an impossible task, since the issues in achieving this are very closely related to artificial intelligence (to achieve high precision and recall the algorithms must demonstrate some understanding of both the query and the data being searched). On the opposite side of the coin, librarians and library OPAC (Online Public Access Catalogues) users are used to very high precision and recall, especially recall. This is due to the great care and consistent rules (such as AACR2) applied in creating what are sometimes very elaborate structured catalogue records.
The physical models preserve high precision and recall by offering a single database, with an agreement of cataloguing rules and a single indexing policy. I will briefly elaborate on that latter point. Even though two library systems may contain records created according to the same cataloguing rules (and interpretations of those rules), the indexing between them may differ: one system may group together both ISSN and ISBN’s into the same index, whereas another may index them separately; one system may include authors in the subject index; one system may separate out primary and secondary responsibilities into separate indexes etc. A typical user will get used to its quirks (such as knowing how to spell Tchaikovsky in order to get all results as opposed to no results!), and will consistently achieve high recall often with high precision.
The problem with the physical models is their maintainability. There are at least four models of how to populate such a physical catalogue:
In the first model he union catalogue is the main catalogue for the participating libraries. This is the case with the Oxford University Union Catalogue (OLIS). The Bodleian Library, along with the libraries of the colleges, departments, and University dependents all (well about two thirds of them) subscribe to a single library system and catalogue. This has a number of benefits, it that it centralises the cost of technical support rather than each library tackling the same problems independently, but clearly this approach would hit problems in merging libraries which already have their own library systems, and now wish to adopt a new one. It also would require a greater degree of collaboration in terms of cataloguing and indexing policies that some consortium would be willing to accommodate (although I will argue that the virtual union approach may not offer an alternative). Even so, if the libraries building a new union catalogue do not have existing systems, have little local IT support, and “get on well,” this model is still an effective one worth considering. The increase in network speed and reliability make this appropriate even for geographically dispersed unions. However, this model does not support situations where a library may be a member of more than one union catalogue.
In the second model records are exported from local catalogues to the union catalogue – this, as do the following two models, assumes that the union members have their own online catalogues or library systems. This is the model adopted by the COPAC service in the U.K. An increasing number of UK University libraries send their records to a centralised database. Adding a new library is not a trivial matter since mechanisms for doing the export need to be established, and this may require record conversion. However as the union grows this becomes less difficult as the new libraries are likely to be using the same or similar library systems to those already in the union. Again it centralises the support needed, so can be suitable where there is little local IT support for the union members. It cannot fully address the consistency of the records being imported although it can try to insist on minimal standards. It can however adopt a consistent indexing policy. A major issue with this model, is the latency between an item being catalogued in the local catalogue and its being represented in the union catalogue, and this would depend on how often the exports were performed. It must be remembered that in the case of all but the smallest of libraries there is already a large discrepancy between the actual holdings and the online catalogue (collections still needed to be catalogued, new acquisitions taking a few days before they can be fully catalogued etc.) that the extra few days difference between the local catalogue and the central one, may not be a major problem. Deletion or amendments of records from the local catalogue and how to replicate to the central catalogue are also issues, but there are solutions for handling this. However, the local IT support requirement may increase if the library becomes a member of large number of such union catalogues. One issue that this model cannot address is locally volatile information such as circulation information (e.g. whether the item is on loan) – but this can always be obtained by a separate search on the local catalogue (possibly automated and transparent to the user).
In the third model, records are catalogued on the central catalogue and imported to the local catalogue. This is the model adopted by OCLC, in particular in their CORC project. In this case Internet resources are catalogued centrally (using a web interface) by participating members, and then periodically imported into the local catalogue. This has the advantages in that both cataloguing practice and indexing policy is determined centrally, but it still allows individual members of the union to use their own systems. It also provides an easy means for a member to decide which records should only be local (as would be the case if a library which covered a variety of subjects was a member of a subject specific union catalogue). It does impose the need for a greater degree of local IT support, as typically importing into the local catalogue is a local problem; and it still requires agreement between the union members as to the cataloguing policies used. This model also does not directly address local volatile information, such as circulation status, although there are work-arounds.
A final model is to dynamically update both local and central catalogues simultaneously, i.e. distributed cataloguing. The Z39.50 protocol (which I will mention in more detail shortly) offers a catalogue update service. This could be used by a suitable cataloguing client (and there are a growing number which now support this) to send the catalogue update (either an addition, deletion or modification) to a proxy server which would either send the updates to both local and central catalogues, or queue them if the catalogues where unavailable at the time. Alternatively updates to the local catalogue could trigger replication to the central catalogue in real-time or be queued. In both cases there may be a need for on the fly modification of the records using the same solutions as for the import or export models above. Many major commercial distributed database systems work on this model. This would offer the consistent indexing of the COPAC model, but without the latency issue of updates, and could even include updates of local volatile information such as circulation status. The technology for doing most of this already exists and so the local and central IT costs would be similar to that of the COPAC model. However, to date I have not seen this approach adopted in practice.
More recently we have seen the emergence of virtual models, known within the UK eLib community as “clumps” – a coin termed more by accident or design. The underlying model here, is that the user is presented with a single user interface which “cross-searches” the participating catalogues in parallel and merges the results back to present the user, as if it were a single catalogue. There is a general opinion that this approach is more cost effective, more resilient and more easily scalable than the physical models. Whilst I would not argue that the virtual approaches can be all of these, it is not however automatically guaranteed that this approach is cost effective. Indeed a recent study of the COPAC service reported that it was more cost-effective for it to remain as is, rather than adopt a physical model. Z39.50 is often used as the protocol for achieving this due to its adoption within the library world. There are several good introductions to Z39.50, but in very brief terms it is a generic protocol for allowing a client piece of software (typically a web gateway for the clump project) to query a database (typically a library catalogue) and get results back. It was originally designed to search a single database at a time, but lends itself well to search multiple databases in parallel.
However, the main issue, in virtual union catalogues is that we are searching databases with different cataloguing policies and different indexing policies. In this respect they are often mistakenly viewed to be easier to set up since they do not require conformance or agreement on such matters. However, we immediately hit a problem in computer science known as “searching semantically heterogeneous databases” – this is especially prevalent when we move to searching across different domains (e.g. archive and museums as well as libraries). The problem is once again obtaining good recall and precision (especially recall), and the issues are very similar to achieving these for unstructured data. Technology can give us good results but not as good as those achieved from being consistent in indexing and cataloguing practice. Many of the clump projects have discovered this. Moreover, whereas a user can become accustomed to quirks in individual OPACs, these are much harder to learn and accommodate when searching multiple OPACs in parallel, each with individualist quirks.
Adding a new catalogue to a virtual union clump can be fairly easy – provided however that the new library catalogue supports Z39.50 and has good local IT support. However, if the union consists of small libraries with little or no local IT support, just establishing that the library system supports Z39.50 can be a major task. The virtual approach is therefore more cost-efficient in the former case than the latter. Also although getting the catalogue onto the virtual union clump can be fairly straightforward and quick, configuring it correctly so that it returns reasonable results, in particular with good recall and precision, is a difficult task. In fact this part of the task can be as difficult as adding a new library to a physical union catalogue. There is additional difficulty if a library belongs to two virtual union catalogues requiring different configurations, but there are international initiatives such as the Bath Profile attempting to prevent this.
Another advantage claimed by the virtual models is that they are more resilient, in that it is cheaper to run multiple gateways than to run multiple physical databases, and that the gateway will still be searchable even if not all the collections are available. This is true: however, whether this is an advantage depends on your view of the importance of recall. In the physical union catalogue, you have all or nothing (if the catalogue is unavailable) – whereas the virtual catalogue may not receive results from all the union members because a particular catalogue was unavailable, slow to respond, etc. It is a moot point, which is “better” for the user. A user trying to locate every copy of a particular item within a union catalogue may find the “resilience” of the virtual catalogue more irritating than a blessing.
Scalability is another major question in terms of virtual union catalogues. Physical union catalogues can scale almost ad infinitum providing the central resources are available. Virtual union catalogues are not restricted by central resources, but are restricted by more fundamental concerns such as network bandwidth. There is mixed opinion on how many databases can be effectively parallel searched – the experts’ opinions vary between 10 and 100! This is not helped by the fact that there are two different ways of doing such a parallel search. The commonest approach is to not only perform the search in parallel but also to retrieve the results in parallel. In this case if you cross-search ten catalogues each returning one hundred records, you are in effect pulling a thousand records over the network. Clearly this does not scale very well, but it does allow the interface to sort the results (e.g. by author). Another approach is to perform the search in parallel but only pull the records back as needed. This is more scaleable but makes it harder to display results sorted by anything other than by catalogue. Z39.50 does support catalogue-side sorting (i.e. you can send a search request, ask the catalogue to sort results and then pull the results back as needed) which would solve this problem but few library systems support this yet.
A solution to the scalability problem is to look at forward knowledge, i.e. pre-select the individual catalogues within the union according to what the user is looking for. There are a number of approaches to this. One is to present the user with information as to the collection strengths of the individual catalogues. Another is to encode this in a machine readable form, possibly automatically generated from the indexes, and let the gateway select the catalogues based on the query and this information. These approaches need further investigation, but seem to be less effective for subject based unions (which would have very similar collection strengths) than for regional or geographical unions, but it any case they clearly detract from obtaining good recall. The mechanisms for automating this forward knowledge is not far removed from centralising the indexes. This gives rise to another model of union catalogues not yet investigated. You still have multiple catalogues, indexed locally, but you also index these centrally. The user searches the central indexes, then retrieves the records from the individual catalogues directly. This would only be of benefit over the other models mentioned if the amount of data in a record was much larger than that indexed, and would be of particular benefit if that information was volatile (such as circulation information).
There are still technical issues in the virtual union catalogues. The vendor support for Z39.50 is increasing but still has a long way to go. Some still do not see it as important, most do but have minimal support (for example do not support features such as sort). There are still issues within the standard – the method for the return of holdings information has only recently been decided by the agency behind the standard, and hence it is very hit and miss what holdings or circulation information systems return (if they return any at all). Issues of scalability and forward-knowledge still need to be addressed. However, not all the problems are technical and many of them such as cataloguing and indexing practice are important and need to be addressed whatever model you choose to adopt – technology alone cannot solve all problems.
That does leave the question of which model to choose if you are about to embark on building a union catalogue. These depends on partly on who the members are – if they are fairly small libraries who have little local technical support, I would recommend adopting a physical union model. This has a larger centralised IT cost, but overall the cost is not that much different from that of the virtual model. The virtual models distribute the cost (as well as the searching) onto the local libraries and therefore is more applicable to those using larger library systems with good local IT support. Another issue is the requirements and expectations of the users. If you want recall and precision comparable to OPACs, the physical models have clear lead, and this is unlikely to change. However, most users accustomed to web searches may be perfectly happy with less than perfect searching.
- Details on the eLib Clump projects are at: http://www.ukoln.ac.uk/services/elib/projects/
- Details on the Oxford Union Catalogue are at: http://www.lib.ox.ac.uk
- Details on COPAC are at: http://www.copac.ac.uk
- The study looking at the feasibility of Z39.50 for the COPAC service is at: http://www.curl.ac.uk/projects/z3950.html
- Details on CORC are at: http://www.oclc.org/oclc/corc/index.htm
- A good introduction on Z39.50 is at: http://www.ariadne.ac.uk/issue21/z3950/intro.html
- An article on the Bath Profile is at: http://www.ariadne.ac.uk/issue21/at-the-event/bath-profile.html
- Some articles on searching distributed heterogenous databases are:
“Federated database systems for managing distributed, heterogenous and autonomous databases”. Sheth and Larsen (1990). ACM Computer Surveys No 22.
“Impact of Semantic Heterogeneity on Federating Databases”. R. M. Colomb (1997). Computer Journal, British Computer Society, Vol 40, No 5. ISSN 00104620.
|Matthew J. Dovey|
Research and Development Manager
Libraries Automation Service
University of Oxford