Web Magazine for Information Professionals

Practical Clumping: Mick Ridley on the BOPAC System

Mick Ridley discusses the BOPAC system.

This article attempts to draw some practical lessons for those involved with clumps (or thinking about them) from our experiences on the BOPAC2 project [1]. BOPAC2 was a British Library funded project, that was investigating the problems of large and complex retrievals from Z39.50 [2] searches especially from multiple targets.

Although the funded stage of BOPAC2 is over the system is still under development and available on the web and I would urge people to try the sort of examples I will mention below (and give us feedback). I would also like to make it clear that I am not in anyway blaming or condemning any library for the way its Z39.50 server works merely pointing out some of the problems in getting these sorts of services working. In fact Id like to praise all those who have been pioneering Z39.50 access to their collections. This is particularly since support from library system vendors has not always been what it might and 'the introduction of the Web redirected most vendor activity into developing web front ends to existing or planned systems' as Mark Hinnebusch noted in a 'State of Z39.50' report [3]. Consequently I've tried to keep the examples 'anonymous' and I hope people will take them as examples of the problems of getting consistent or equivalent results across a clump rather than an exposé of how some servers get it 'wrong'.

As something of an aside I should explain the acronym. BOPAC originally stood for the Bradford OPAC (OK really Bradford Online Public Access Catalogue) since we were working at the University of Bradford, but this caused some confusion with the University Library's catalogue (which is itself reachable from BOPAC2). So we have have tended to refer to the project simply as BOPAC. If that has to be an acronym then I hope its for Better OPAC, thats our aim even if we aren't there yet. And it wasn't just an attempt to get ahead of COPAC in alphabetic listings!

When BOPAC2 started the term clump wasn't even a gleam in Lorcan Dempsey's eye. However I suspect by most people's definition BOPAC may be thought of as a clump system, or a broker in MODELS [4] (specifically the MODELS Information Architecture) terminology. It allows you to create a clump of your own by selecting targets you are interested in from a large list and then searches those targets simultaneously and displays the results together. It can also easily be configured to have a more fixed set of targets, as was done in the project for Bradford and Leeds University libraries, producing a mini West Yorks (including British Library Document Supply Centre at Boston Spa) clump. BOPAC2 has two main parts, a Z39.50 client - Web Interface based on the Europagate system and a Java applet that provides the display of, and search within, the retrieved records from all the servers. It is the fact that BOPAC combines the retrievals from different servers into one set that can be manipulated in many ways, and consolidates records for the same item from different servers, that makes the differences between catalogues and their Z39.50 servers so crucial to us.

If most clumps are to be virtual, then the glue that will hold most of them together will be the Z39.50 protocol. And Z39.50 may be the start of a new set of problems rather than the end of your problems. There is a common view that Z39.50 will let you query another database with the interface of a system that you know. E.g. 'The greatest benefit of Z39.50 was that it offered the potential of directly searching any database through the local systems interface' [5] This may be true. What is doesn't tell you is whether the system will behave the way you expect, and give you the results you might expect. In fact I would like to almost reverse the quote and say 'The greatest benefit of Z39.50 was that it offered the potential of directly searching any database regardless of any local systems interface' That is, the point to me is that Z39.50 allows access, presentation is a separate issue. It may also be heretical but the 'local systems interface' may be known but it may also be disliked by users. I would hope that we are entering an era where users can chose or customize the interface they use. As we move to Web based interfaces for many different systems we need to recognise that the appearance of these is less controllable by system designers than other user interfaces. On the other hand the appearance of an HTML form, for entering a query, may well be more consistent for an individual user with their favorite browser regardless of the system being queried.

There are a number of layers that a query must pass through from the user interface to the underlying servers and each layer can present problems. We can group them broadly as below, although some layers may have sub layers and there may not be clear distinctions between them in some cases.

What do you call the search?:
This problem is of course not specific to clump systems but must be tackled by any query interface. There may, however, be a particular problem for clumps, related to the 'local systems interface' issue above. It may seem desirable to use the same terminology as used on a local system but the members of a clump may use different terms, or the same terms but with different meaning, on their local systems. One solution is to use the name of the bib-1 attribute, this seems promising initially but may not be such a straightforward solution as we will see.

What Z39.50 bib-1 attribute, or attributes does this map to?:
Z39.50 systems use attribute sets to define the type of search to be undertaken. The most widely used set is 'bib-1' originally designed for bibliographic use but also used in other areas since it has achieved a 'default' status, other more specialised attribute sets exist. The development of attribute sets is ongoing in the Z39.50 community and a full discussion is beyond the scope of this article. More information can be found at the Library of Congress Z39.50 Maintenance Agency [2] or on the Z39.50IW maillist [6]. Put simply we can define what sort of search we want by specifying the attribute value e.g. 1003 for an Author or 7 for an ISBN. While an ISBN search may be unambiguous the semantics behind an 'author' search may be more complex and there are a number of bib-1 attributes that might be used such as

Personal name1
Name1002
 
Author-name personal1004

and this assumes we are looking for a person and hence avoiding corporate names (2, 1005) and that we are happy to omit editor (1020).

Do all the servers support this?:
There is no guarantee that all the servers you want to query will support all the attributes (in fact they are most unlikely to) or even the same set of attributes. A number of solutions may be possible here. If different servers support different attributes it may be possible to map these to the same search for a (approximately) equivalent result e.g. using a mixture of 1 and 1004 for 'Personal Name - Author'. If, however, there is nowhere that the searches available from the servers intersects, imagine all bar one supports ISBN ( and that one didn't support 'anywhere', which might have allowed the search by a back door) then what is to be done? Do you allow that search but make it clear that there can be no results from that server? Or do you disallow the search altogether, and get a more consistent interface? Or disallow the search only when it involves the server who can't support it?

How does each of the servers interpret this?
Here we move on to the problems of what each server does with the same query. Even with two servers that support the same attributes we cannot be certain that the search will be applied in the same way, that is to the same tags of a MARC record. Or perhaps more accurately we cannot be certain that the index (or indexes) queried by a particular search were build from the same set of MARC tags. Decisions in this area have always had to be made but in the past users could become aware of the 'foibles' of their own system. When querying a clump of catalogues they may all show foibles but not the same ones and the search results may be very different as a consequence. If we take as an example author searches which are often mapped onto search of a personal name index we can get very different results from two servers, holding the same items, with the same catalogue record for this item if the indexes have been built differently. Examples in this area can be found depending on whether some or all notes fields have been used as a source of personal names. This is particularly noticeable when searching for non-book materials when extensive cast lists for video or audio performances may be found in MARC 508. There is also scope for confusion in situations where one server doesn't support an editor (bib-1 1020) search but MARC 248's contents contribute to their name index but another server does allow that search. How in this case should a user try to formulate their query? We can then add on to this local variations in cataloging practice as a further layer of complexity within this layer. Whether the variations reflect ambiguities within AACR2, structural problems of AACR2 or bad practice is beyond the scope of this article. What is clear is that clump systems need to be able to deal with this variation if they are to 'work'.

There are also problems that seem to touch all the layers mentioned about. These are noticeable in the subject and keyword search areas. Here there seems to be much more support for the use of Subject heading (bib-1 21) than Subject name personal (bib-1 1009) yet personal names from the subject heading of MARC 600 often contribute to names retrieved by author searches. Also subject searching may be supported to some degree but use of Anywhere (bib-1 1035) or keyword in title suitably labeled may give the sort of result the user wanted.

What can be done about these problems? Clearly there is no one silver bullet. There are a number of small steps that can be taken and developers need to test extensively and get user feedback. Some of things we did are outlined below.

In the course of developing BOPAC2 one significant feature we had to add was an explanatory line in the first display screen. This added the MARC fields that contained a match for the search terms. In many cases particularly with author,title or author and title searches you would have expected an author/title display to be (relatively) self explanatory but we found that many searches were returning unexpected items and users could make more sense of their retrievals with longer displays. These made it easier to tell, at an early stage, whether items were likely to be of interest.

Examples of this could be seen in, for example, an author search for Dickens, these would often return, in addition to works by Charles Dickens, critical works on Dickens or works based on those of Dickens. the original brief author/title displays included:

Bleak House. Hawthorn, Jeremy
You must believe all this/ Mitchell, Adrian

The first is a critical work on Dickens' Bleak House and its appearance might be self explanatory but the second is, hopefully, succinctly explained by the addition of the contents of the MARC tag that matches the search term.
-:Adrian Mitchell; from Holiday Romance by Charles Dickens;

Or from an Author search for Frost showing that 'author' may express a variety of relationships to a work we show the expanded brief author/title display

Phantasmagoria and other poems/ Carroll,Lewis
-:By Lewis Carroll; with illustrations by Arthur B. Frost

Newton's Principia: Newton, Isaac
-:by Percival Frost

Magenkarzinom. Hot,J.Meyer, Hans-Joacim, Schmoll, Hans-Joachim
ORGANISATION: Frost Pharma

Poems ; Dryden, John,
-: John Dryden; editor; William Frost

This may be the behaviour a user would expect from their local system in which case the shorter display would not be confusing, but for other users these retrievals may well seem puzzling not to say erroneous.

A related problem is that although some of the catalogues being queried may have treated an author search in this, broad, way other catalogues may have had a narrower interpretation of the query. The only way to discover this is probably by observation that all the critical works come from certain catalogues not from the full range of catalogues that were queried.

Sometimes you may retrieve records that show no match with the original search. An example of this was found with an author search for Dylan Thomas. Amongst the retrievals was a collection of Welsh short stories but 'Thomas' was nowhere in the text of the record. Sometimes you may get complete mis hits that seem to be errors but in this case we assume that the record should have been retrieved. A possible reason for this is that the MARC record delivered may not have been the full record from the database, the server may not have deliverd the entire record and so the fields that caused the match may not be present.

A problem with title searches was brought to our attention by a user. BOPAC2's default title search was a title match and the user had failed to find 'Time for Change' on all the servers they expected. On investigation we found that at least one server was being much stricter and only giving a time match on 'Time for Change?' while most servers ignored the question mark. In an attempt to give what we felt was the behaviour users would expect we changed our default from an exact match to a title contains search. So 'Time for Change' now includes titles that might have been excluded before such as 'Children in crisis: a time for caring, a time for change' but does get 'Time for Change?' too.

In the examples above I've concentrated on 'known item' searches which was where BOPAC2 was focussed. We have started to look in more detail at the problems of subject searching where there are clearly a lot of problems because of wider variety in MARC records both in terms of quality and standards than there is for author and title information. We hope to pursue this further in a future BOPAC project.

What conclusions can we draw from this? I believe that currently the best strategy for querying clumps is to use the most general query. Attempting to get more precision in the query will result in less recall. A consequence of this is that you will generally get larger result sets and more irrelevant material but it is the only way to ensure you don't miss relevant material. BOPAC has been designed to make working in this way possible and hopefully easy. Its easy with BOPAC to drop the Monica Dickens hits from a search for Dickens if you are interested in Charles Dickens. That search will also have given you Charles Dickens's hits catalogued as Dickens or Dickens,C that the more specific Charles Dickens search would have excluded. This may not be what should happen but our experience suggests that its the reality. The BOPAC approach here matches the strategy for network queries suggested in a recent article on XML [7].

... imagine going to an on-line travel agency and asking for all the flights from London to New York on July 4. You would probably receive a list several times longer than your screen could display. You could shorten the list by fine-tuning the departure time, price or airline, but to do that, you would have to send a request across the Internet to the travel agency and wait for its answer. If, however, the long list of flights had been sent in XML, then the travel agency could have sent a small Java program along with the flight records that you could use to sort and winnow them in microseconds, without ever involving the server. Multiply this by a few million Web users, and the global efficiency gains become dramatic.

Looking further forward, work on the Z39.50 Interoperability Profile make help towards solutions of some of the problems discussed here. This and the sort of suggestions Mark Hinnebusch makes [3]. will help clumps that have a working relationship between their members. There still remain considerable problems for those involved in what might be called DIY clumping where you query a set of resources of interest to you who may have no relationship at an institutional level. BOPAC allows you to work in this way as do products like BookWhere that let you set up the servers of your choice. I believe that we may see more of this approach, the eLib supported clumps have been both geographical and subject based. It seems to me that personalised clumps which may be very specific in a geographic or subject sense are likely to very useful. Geographically I might be interested in the public library where I live and the academic library where I work. And researchers might be interested in a set of libraries that are geographically very diverse but have material relevant to their specialisation. This could be a quite specific subject like Soviet Science Fiction which would be unlikely to get the same sort of institutional support as Music Online. A key factor in getting improved access to networked resources is information sharing and listings of Z39.50 targets, such as UKOLN's [8] , have a crucial role to play. I hope this article can contibute to that information sharing process too.

Acknowledgments

I would like to acknowledge the help of the other members of the BOPAC team: Lars Nielsen and Fred Ayres and the financial support of the British Library for the BOPAC2 project. And thanks to all the BOPAC users who gave us feedback.

References

[1] BOPAC Home Page. This includes links to BOPAC2 itself and an online version of the BLRIC Report. Available from: URL http://www.comp.brad.ac.uk/research/database/bopac2.html Last checked 21/05/99

[2] Library of Congress Z39.50 Maintenance Agency Available from: URL http://lcweb.loc.gov/z3950/agency/ Last checked 21/05/99

[3] Report to the CIC on the State of Z39.50 Within the Consortium.Mark Hinnebusch. Available from: URL http://www.cic.uiuc.edu/cli/z39-50report.htm Last checked 21/05/99

[4] MODELS:Moving to Distributed Environments for Library Services. Available from: URL http://www.ukoln.ac.uk/dlis/models/ Last checked 21/05/99

[5] Caswell, J.V. 1997. Building an Integrated User Interface to Electronic Resources. Information Technology and Libraries, 16 (2), 63-72.

[6] Z39.50 Implementors Workshop maillist. Archived at URL http://lists.ufl.edu/archives/z3950iw.html Last checked 21/05/99

[7] Jon Bosak and Tim Bray. XML and the Second-Generation Web, Scientific American, May 1999, Available from: URL http://www.sciam.com/1999/0599issue/0599bosak.html Last checked 21/05/99

[8] Directory of Z39.50 targets in the UK. Available from: URL http://www.ukoln.ac.uk/dlis/zdir/ Last checked 21/05/99

Author Details

Mick Ridley
Senior Computer Officer
Dept of Computing
University of Bradford
http://www.staff.comp.brad.ac.uk/~mick/
M.J.Ridley@comp.brad.ac.uk