Web Magazine for Information Professionals

ACM / IEEE Joint Conference on Digital Libraries

John MacColl reports on a selection of the papers given at this conference in Roanoake, Virginia, June 24-28 2001.

This report covers a selection of the papers at the above conference, from those which I chose and was able to attend in a three-strand conference held over three days (with two additional days for workshops, which I did not attend). It includes the three keynote papers, as well as the paper which won the Vannevar Bush award for best conference paper.

The conference was held in Roanoke, Virginia, in the Roanoke Hotel and Conference Center, which is owned by Virginia Tech (located in Blacksburg, some 40 miles away). Ed Fox of Virginia Tech was the Conference chair. It was the first ever joint ACM and IEEE digital libraries conference. Previously, the two organisations have held separate conferences on the same theme.

The opening keynote was given by Brewster Kahle, President of Alexa Internet & Director of the Internet Archive: Public access to digital materials: roles, rights and responsibilities of libraries. He began inspirationally by telling us ours was the best profession to be in at the present time due to technological advances. The universal control aspirations of libraries can finally be realised, in our time. However, we have battles to fight. A lot of publishers think libraries should not exist, and in particular, Pat Schroeder, President of the American Association of Publishers, had recently called libraries ‘the enemy’ in an article in The Washington Post.

Alexa is a for-profit company, alexa.com, now wholly-owned by amazon.com. The Internet Archive is a not-profit, with a contract with alexa which ensures that everything in the Internet Archive is passed to alexa 6 months later, for preservation purposes. Brewster Kahle admitted to having made a fortune in selling alexa.com. He is using this income to support the Internet Archive.

The Internet Archive archives two snapshots of the web – over 40Tb of data, with over 4 billion pages. It is a ‘large-scale ephemera’ archive, and runs various services. It provides rudimentary cataloguing of the web, and also allows analysis of web sites’ ‘relatedness.’ An important service is in maintaining the history of the web. One service it provides is to keep copies of web sites as they used to look. It has also stored web sites of all the political parties in the 2000 US election, so that they can now be ‘replayed’ for political science students. Kahle is keen to have the Internet Archive used for serious research.

The Archive is large, but has holes. Its developers are still catching up with indexing, and many graphics from the early history of the web have not been stored. The service has respected robot exclusions, and so kept out of trouble. Anyone who complains about being captured by the Archive can have their pages purged. Indeed, the ‘purge the complainer policy’ is now in common use in web archive projects. As Kahle put it: ‘If someone wants out of the archive, you simply remove them – and they’re history. Or rather, they’re not history.’

The digital environment is ideal for archiving because storage is cheap (currently $4k per terabyte) and scanning is also cheap ($0.10) per page. At that price, we can have both archival and access copies. We should therefore be identifying resources to archive on the net now.

Can we replicate ILL in the world of digital materials? This is being proposed for licensed materials. Kahle’s thesis is that ‘libraries are special’, so should be able to develop digital ILL ‘without crushing the publishing system.’ He also looked at loaning digital materials, citing NetLibrary, which offers such a system. It sounds absurd, but it conforms to publisher business models, and perhaps we should be working with the publishers to develop it.

In general, Kahle accuses libraries of having been ‘too wimpy’ about retaining their traditional role in the face of rights worries. We need to ramp up our collection, cataloguing and lending of digital materials. People expect it of us. Libraries need to be more assertive. Can the Internet Archive be preserved? The key, says Kahle, is replication, and in time he would hope to see copies of the Internet Archive mirrored across the globe.

Jean Laleuf, Brown University, gave a paper entitled A component repository for learning objects: a progress report. Laleuf used the term ‘exploratory’ for the work his group are doing. They have created about 40 Java applets, made available from their web site. His contention was that there are very few collections of well-designed, reusable software components in education. In educational software, ‘components’ usually means applets (or ‘learning objects’) which are coarse-grained. The Brown team use it to mean components at a range of levels of granularity, e.g. down to button sliders, etc.

They have developed a component repository by using techniques which they call ‘full-grained decomposition’ and a ‘component classification strategy’. Full-grained decomposition involves a very large effort, but is well worth it when establishing a component repository. It involves breaking objects down into smaller and smaller parts.

The ‘component classification strategy’ employs a matrix which classifies components into three categories: application components, support components and core components. Within these categories, each is ranked according to domain independence, reusability, importance of design, granularity and audience. The goal is to have a full library of components eventually. The team are looking at metadata schemes at present, to allow indexing and harvesting.

Gene Golovchinsky, FX Palo Alto Laboratory presented an excellent paper with the title Designing e-books for legal research. This paper was the runner-up in the Vannevar Bush award. The project designed software to allow ebooks for legal students to be produced, based upon the characteristics of legal students’ study behaviour. Generally, students would use it to read scanned legal documents, and navigate from link-to-link by means of a stylus. Students can annotate the sections they wish, as with a marker pen, and then see a composite view of just those annotated paragraphs. Re-annotation can then occur, narrowing down the set of relevant paragraphs. It provides very useful software for students not only of law, but potentially of any subject in which a quantity of legacy printed material is required. A notebook function exists too, so that annotated sections can be added in alongside other comments. Annotations and comments can also be pasted in to a Word document. In testing, students were very positive about the interface. This is fascinating technology, which is likely to result in a hybrid laptop/ebook device.

Ray Larson, of UC Berkeley, discussed Distributed resource discovery: using Z39.50 to build cross-domain information servers. This project has been supported by an NSF/JISC International Digital Library grant, and features, from the UK, the Universities of Liverpool and Manchester, De Montfort University, the Arts & Humanities Data Service (AHDS), the Natural History Museum and CURL. Databases include the Archives Hub, the AHDS suite and COPAC, together with the Online Archive of California, the Making of America II and the MASTER project for recording manuscripts. The Archives Hub is at the heart of the project, as is its Cheshire information retrieval software.

The problem addressed is that Z39.50 hits a scale problem as it copes with hundreds or thousands of servers in a distributed environment. The project is looking at how to constrain, and how to discover, the servers to be searched. They are approaching it by using two Z39.50 services:

This is a very efficient approach, since full, usable collection descriptions of databases can be created in seconds.

Carl Lagoze, of Cornell University, gave a presentation on The Open Archives Initiative. The decision to extend the protocol beyond eprints came from a criticism that the initial approach was conflating politics with technology. It uses ‘deploy now’ technology: Lagoze said they wanted to get something working fast, and so they wanted to use ‘well-baked’ tools. Those they chose were HTTP, XML Schema and DC unqualified. They adopted the ‘80/20 rule’ (only 20% of the work should be new development: 80% should use existing technologies). There are around 35 data providers now registered. The museum community, through CIMI, is doing some very interesting OAi work. A key to understanding OAi is that it is not trying to achieve mass coverage: it is about specialisation, not homogeneity. It is not trying to create a new Google. Lagoze is worried about the range of new schemas appearing, and that it might impact upon the size of the OAi cookbook, which might then become too large for implementers. Lagoze said that the view which used to be held, that popular search engines don’t work, is no longer true. It is accepted that Google, for example, does a very good job. But OAi is able to include resources (such as the Los Alamos National Laboratory arXiv, for example) which keep the main search engine robots out.

My paper was entitled Project ANGEL: an open virtual learning environment with sophisticated access management. The slides have been posted on the ANGEL web site.

Peter Brophy of Manchester Metropolitan University gave a paper entitled Evaluating the Distributed National Electronic Resource. This work is at an early stage, but the approach is interesting. CERLIM are using a number of metaphors as frameworks for evaluation: these include seeing the DNER as a library, a museum, a publisher, a digital library, a hybrid library, a gateway, a portal, a managed learning environment, and a dot.com. They are also using a technique known as ‘quality attributes’ which attempts to break down quality into measurable determinants. The perspective taken by the project is primarily pedagogical.

Sayeed Choudhury of Johns Hopkins University presented the results to date of his project, the Comprehensive Access to Printed Materials (CAPM) prject. This is examining robotic retrieval and digitisation of store materials at Johns Hopkins. There are over one million items in an offsite library shelving facility. In doing the economic feasibility, the project looked at the cost of a page-turner robot, which has not yet been built and is a major challenge for the project. The retrieval robot has already been built. The cost-per-use for digitising works from the shelving facility was identified to a range from c $2.00 to c $37.00. This compares with the cost per item for interlibrary loan supply, which was encouraging for the project. Interestingly, there was no mention of copyright, which will be a real issue for the real implementation of this system.

I attended a panel session on SMETE, the ‘Science, Mathematics, Engineering & Technology Education’ digital library. This developed out of the Digital Libraries Initiative in NSF, under which it ran during 1998-99. It emerged as the National Digital Science Library (NDSL) in 2000, and to date has had two competitive programs. The idea is to have a SMETE digital library by the autumn of 2002. The management of the SMETE programme wanted to get away from the idea of a library, which they thought had some of the wrong connotations. This is a slightly controversial notion, and did not conform to the prevailing view, as I inferred, that digital libraries are learning environments. The vision is to meet the needs of learners in both individual and collaborative settings; that it should be constructed to enable dynamic use of a broad array of materials for learning, primarily in digital format; and that it should be actively managed to promote reliable ‘anytime anywhere’ access. The content is a mix of analogue and born digital material. Much of the material will be free.

In 2002, they have requested a budget of $24.6m, with a proposal deadline expected to be mid-April. The announcement is at http://www.ehr.nsf.gov/HER/DUE/programs/nsdl/. The programme web site is at http://www.smete.org/nsdl.

The vision is:

One of the ‘shared values’ of the programme is that the library is also a human network.

Requirements for further work are in the following areas:

In our environment, we have very heterogeneous content. Is any of the content in our learning management systems going to be reusable? Resources must be usable with all pedagogies and deliverable via all technologies. Open standards provide ‘discovery stability.’

It was suggested that we need a ‘carefully architected anarchy’ – like Napster. The history of the technology has moved from gopher, via the web, to ‘peer-to-peer’ systems, like Napster. There will be massive online learning communities, which are self-sustaining through collaborations, with expertise distributed among members, and authentic learning contexts and motivations. We were given the mantra ‘Electronic Digital Libraries by the people, for the people’ and urged to ‘think Napster, Freenet and Gnutella.’

The second keynote of the conference was given by Pamela Samuelson of UC Berkeley, Digital rights management: what does it mean for libraries? In the introduction to Pamela Samuelson by Christine Borgman, she informed us that Professor Samuelson is a professor in both the Faculty of Information Science and the Faculty of Law at Berkeley, and that she was recently listed among the top 100 lawyers in the US.

Her paper looked at commercial digital rights management (DRM) systems, to examine the differences between them and the systems we are developing. These DRM systems have been designed to enable fine-grained control over commercial distribution of digital content. Most are still in the design and development stage. What they are leading to, however, is the ‘disintermediation of libraries’. Publishers are eager to ‘cut out the middle man.’ Publishers can define the range of authorised uses and build this into code, e.g. ‘this work can be looked at, annotated, printed, downloaded, copied or shared’ – but all for different prices. Is this the publishers’ nirvana, asked Samuelson. The relevant quotation in the legal world is ‘code as code’ (i.e. the publisher sets the rules and technology enforces them). Another quotation is ‘the answer to the machine is the machine’ – i.e. if machines enable free copying, other machines can stop it. This is ‘Star Wars’ technology for the world of IP. In the music industry there is now an organisation called SDMI, which has developed watermarks for digital music, to encode the rights information and conditions. DRM systems can build in user monitoring systems for marketing purposes, pricing strategies (different prices can be charged to different people) and the sale of user profiles. Authentication systems were mentioned, including biometric measures such as ‘retinal scans.’ Illegal use can trigger the self-destruction of applications.

Publishers in the current environment, implied Samuelson, are completely paranoid. They worry that bits are ‘too copiable’ and prone to hackers. They also feel that authors are too greedy, and librarians are the enemy, since they want to give users access to content for free (this was another reference to Pat Schroeder’s Washington Post article). Customers are would-be thieves and computer manufacturers collude in this.

Before DRM systems, copyright was a fairly limited set of rights. There was no legal right to control uses or access to works. The ‘first sale’ doctrine made lending legal. Fair use and library and archival exceptions to copyright promoted learning and the preservation of information. Copyright expiry means that derivative uses are possible from out-of-copyright works (and, ironically, from the publishers’ view, this abuse of their ‘property’ arose largely from the tax on them which is legal deposit). Also, copyright law never reached private conduct. We sing in the shower and use other people’s intellectual property in all kinds of private ways, for free.

The recent DRM copyright law white paper gives owners the right to control temporary copies, even in computer memory, which thus gives them absolute rights over information. The first-sale doctrine does not apply, since digital works always require copies to be made. Fair use should disappear because all use is licensed. The notion of ‘public domain’ in the eyes of publishers, is that it is an artefact of bad technology. It could be argued that ‘out of copyright’ is too, since publishers ought really, if the technology permitted it, to be able to eliminate all material once copyright ends. Deposit, archiving and ILL are outmoded and should disappear. Intermediaries, such as libraries, should function as copyright police (and would therefore be liable for infringement). All of this is, at least potentially, the way the world of digital rights is moving.

There are other things to fear. UCITA is a proposed law to enforce mass market licences. The Digital Millennium Copyright Act (1998) created two new IP rights: anti-circumvention rights; and a prohibition on removal of copyright declaration information. Worryingly, for libraries, preservation is not a valid exception to anti-circumvention. DMCA appears to allow no right to interoperate with data; no fair use; no linking to possibly illegal material; no exceptions for publishers (or digital libraries, which of course often are publishers); and potentially no interactive software. One implication is that digital libraries would have to purge all material which demonstrated weaknesses in systems.

Reminding us of our mission, after having scared us rigid, Samuelson suggested we think of digital libraries as an alternative model for distribution of digital content which is more user-centred and more service-oriented than DRM systems. Might we offer publishers useful lessons here? Indeed, might we provide competition to publishers? She also advised academic authors not to sign away their copyright to publishers. Almost all publishers will accept amended conditions. Copyright is a ‘much thicker right’ than it used to be, so must never be signed away lightly, particularly when fair use as a concept is in danger of disappearing.

She recommended that we should remember that law is a social construct, not a given. So are DRM systems. The US Constitution sets a goal for IP policy which is that it should ‘promote the progress of science and the useful arts.’ DRM copyright law diverges from this because publishers have united behind only one vision of the future. What alternative vision of IP and information policy can we devise for an information society we’d want to live in, and how can we make it happen?

The final conference keynote was by Clifford Lynch, Executive Director of the Coalition for Networked Information: Interoperability: the still unfulfilled promise of networked information. Lynch is a philosopher of information science, and this was a typically thougtful treatise. There is great consensus behind the notion that interoperability ‘ is a good thing.’ However, we have no means of measuring it, nor do we even know whether it is ‘a binary thing, or something more graduated.’ Interoperability should be about expectations. We should have the right to expect systems to interoperate – and interoperability is a lot more than simply engineering round a problem of incompatibility between two systems. Email has achieved the interoperability expectation. But at the ‘upper levels’, however, things become ‘much shakier.’ Interoperability means much more than simply ‘federated searching.’ The next major challenge is to achieve ‘semantic interoperability’.

We need to think also about architecture when we discuss interoperability, and there has been a lot of work done on architecture, but it is mostly top-down. We need to address architecture ‘from the ground up’ stated Lynch. We have done well with the basic engineering – TCP/IP, HTTP etc. We have done well with navigation too, the reason being that, in addressing navigation, ‘humans are in the loop.’ The web is a navigation triumph, and delivers interoperability, but really only to ‘the human perception apparatus.’ It does not deliver semantic interoperability, which is the dream which will allow us to have machines properly work for us and make decisions based on our experiences and objectives.

Lynch encouraged us to study failures. We don’t do enough of this, and it is important that we do. Research into failure is unglamorous. ‘How we designed a protocol and screwed it up’ is not the sort of paper most of us would wish to write. But the Z39.50 community, to its credit, has done this. We need to look at case studies. In the case of Z39.50, the mechanism of retrieving results works fine. But Z39.50 bundles together mechanism with semantics: it attempts to incorporate the semantics of search. This is contrasted with the Open Archive Initiative protocol, which tackles the easier problem simply of harvesting metadata into a centralised database. Despite its problems, Lynch implied, Z39.50 with its bold (and hitherto unrealisable) objective might be the more future-ready protocol.

Dublin Core identified a lowest common denominator set of fields to permit interoperability, but many data providers have found it too insipid to be of use to them. This led to the development of specific qualifiers which can be grafted on to the basic elements in order to make the standard useful for certain groups of users. The DC community thus achieved the successful use of the ‘graceful degradation’ (or ‘dumb-down’) principle. This is a very important contribution made by the DC community to interoperability. The idea is that, even if one does not require or understand the qualified DC element, one will still receive some utility from the unqualified element.

Lynch then turned to the Berners-Lee proposal for ‘the semantic web.’ This will be a very important case study for interoperability. There are a number of entities in the information universe which are ready to move, as email has, into true interoperability. Identity is one – unique identity is on the way. Another is reputation (on which recommender systems, for example, will be based). Vocabularies, authority files and gazetteers are other instances. One of the key challenges for the next few years is whether we can move these entities out of closed systems and into general infrastructure.

Commercial systems talk about ‘stickiness’, i.e. keeping the customer in the system. They want systems to be closed systems, and so are opposed to interoperability. For example, many of them tried to keep email within a closed system. An interesting perspective on the new DNER architecture emphasis on middleware and infrastructure is that ‘stickiness’ is almost an unworthy objective for an academic system to have. Amazon.com, for example, delivers a significant quantity of personalised service to customers in order that they will not move to competitors. The challenge for our community is to move personalization into infrastructure, at least in the academic world. We in the academic world should do this for the world of learning and research.

One of the key ideas to emerge from the presentation, and it is one which the funding agencies must address, is the urgency of developing the unglamorous elements of the information landscape. Am example of this, supported by both Lynch and Carl Lagoze, is the (too extensive) attention currently being given to OAi, in distinction, say, to universal naming standards like the Universal Resource Name and Digital Object Identifiers. We all assume these latter will come along with the infrastructure in due course. But they need to be worked on.

Kathleen McKeown of Columbia University, gave one of the more interesting presentations entitled PERSIVAL: a system for personalised search and summarization over multimedia healthcare information. This system exploits the online patient record in building a clinical information system. The aim is to provide a system which allows a clinician – and ultimately a patient – to ask queries about information they find in the patient record. The system integrates with the other information sources, e.g. journal articles and consumer health information – available at the Columbia Presbyterian Medical Center. Users can pose queries, or the system can generate them automatically, based on previously asked queries. The system provides automatic answers to user queries (based on sophisticated semantics) and suggests specific queries based on individual patient record information. The project is developing a ‘metasearcher’ to enable queries to be run over heterogeneous resources.

The PERSIVAL team are also developing automated classification techniques for hierarchical topic generation using a ‘trainable’ rule-based classifier. Terms are converted to semantic concepts, which are then matched against patient record concepts. The system will match the concept profile from the patient record against journal articles with similar concept profiles. The system is also developing content-based echo video search tools. In general, it appears to lead to a rich resource of information about their own condition for patients – something which they have not in the past normally enjoyed – as well as an important information source for clinicians.

The system assembles the retrieved information and summarises it for the user, selecting sources by ‘genre’ (e.g. for clinicians the summary would be based on the medical journal article, whereas for patients, the summary would be based upon consumer health information). They have also developed a medical dictionary tool using reliable sources. The echo video information can also be extracted and associated in the summary.

Gregory Crane’s paper, Building a hypertextual digital library in the humanities: a case study on London, won the Vannevar Bush award for best conference paper. It described a new humanities digital library collection: a large textual resource and 10,000 images representing books, images and maps on pre-20th century London. This collection is available in the Perseus Digital Library (http://www.perseus.tufts.edu). The inspiration came from a special collection of material on London and its environs at Tufts, which the university had acquired in 1922. The team set out to see how time and space could be used as axes along which to organise the materials. They began with the premise that reference works are a logical starting place for building digital library collections. An interesting point made by Crane is that digital libraries are intrinsically related to the way information has been organised since earliest times. The team was interested to see how the information organisation in the printed material supported the automatic generation of links and visualisation interfaces. Crane calls attention to the difference between ‘literary reading’ and ‘utilitarian reading’ – reading for a specific purpose – which is the type of reading most usually done by users of libraries. His contention is that those who ‘historicise’ documents, seeking to experience them as clues to past cultures – have to read in both modes at once.

In Crane’s view, generalising from the work he has done on the London collection and also on the more extensive classical digital library he has developed in Project Perseus, digital libraries exist to deepen the knowledge of their users on particular subjects, but also to improve their approaches to problems generally. Their utility cannot be measured either by financial gain or by volume of site traffic.

Crane touches also upon the granularity question, which was significantly to the fore throughout this conference. Document-to-document links are not enough. He argues for ‘span-to-span’ links connecting arbitrary subsections of documents. He also emphasizes the need for as many links to related materials as possible. The objective, he states, is recall rather than precision in humanities research.

One of the themes of the paper is the need for automatically-generated links. He cites the New Variorum Shakespeare as a work which is very well supplied with ‘hand-crafted’ links. However, it is so labour-intensive that each new edition is instantly out of date with current Shakespeare scholarship. To support scholarly reading, therefore, requires that each document can be connected to a hypertextual digital library. London tags Latin and Greek words, which were much more extensively used by the writers of the time than they are today, to linguistic support tools. It also tags name references (to people, places and topics) to a range of reference works.

What is particularly interesting about the London approach is the pragmatic compromises made by the development team. In creating support material for place-names, for example, Crane admits that ideally they would have used one unified authority list (instead of several), but they didn’t have the resources to do this, so made do with several indexes which contain lots of name variants, leaving it to the scholars to deal with the ambiguities. This perhaps illustrates the different approach taken by academics to that likely to be advocated by librarians. What is also clear is that the reference support material has to be customised to the collection. The team mined gazetteers and dictionaries, using out-of-copyright reference works and negotiating with Bartholomew’s for vector map data in order to create this specialised reference resource. In all, the collection contains 284,000 automatic links (roughly one word in every forty). These are generated at runtime. In terms of web design, Crane admits that this is not graceful – but as a scholarly resource it is invaluable.

The team were particularly interested in the integration of GIS systems. They used a modern GIS to align their historical maps of the city. Crane admits that they need to do a lot more work on London. In time, it should have a unified authority index. There is also a lot more ‘mining’ of sources to be done to increase the value of the index through disambiguation, but this requires a lot of laborious work through many contemporary sources. Other areas requiring further work are the mining of tabular information (e.g. by visualising statistical information in a GIS) and aligning monetary information to the time axis to show how prices changed over time. Another feature which will be capable of being added in time is ‘temporal spatial query’ (e.g. one will be able to ask the system to provide all documents relating to St Paul’s from the 1630s).

Crane anticipates that, in time, the resource will link to third-party services via ‘open citation linking’, and that third-party sources could also filter their data via Londons visualisation tools. He is unworried about the reference tools developing over time, since they will be added more slowly than the general collection of non-reference materials, and will ‘catch up’. In conclusion, Crane urges anyone contemplating the creation of such collections to start with reference works, and also to use XML not only for its ability to map document structure, but also because it can help to resolve ambiguous referents.

This was a very stimulating conference, with some important contributions to the key themes in digital library research and development over the next few years. I was delighted to be able to attend it, and am very grateful to the DNER Programme Office for its support in allowing me to travel to Roanoke.

 John MacColl,
Director of SELLIC
University of Edinburgh
Email: j.maccoll@ed.ac.uk
Website: http://www.ed.ac.uk