Clifford Lynch in Interview
Z39.50 has been around for a long time, now - why do you think it has not been assimilated into networked retrieval applications and technologies to the extent of e.g. CD-ROMs, the Web?
I think that Z39.50 is now well established for large, if you will, mainframe or server-based systems. Certainly, those folks who wonder what real Z39.50 systems are in use could look at institutions like the University of California, where we have had it in production for a number of years, and we're using it every day to provide access to our community of close to a quarter of a million people, to resources mounted at places like OCLC and RLG.
We also are using it for circulation data out of local library management systems, which have a very high transaction rate and for which we will easily clock hundreds of thousands of queries a day, according to circulation statistics.
So there is no question in my mind that it is well established as a production underpinning technology. It's a very interesting question of why it isn't used more in CD-ROM systems; there is no particular technical reason why you can't provide a Z39.50 interface to a CD-ROM database - in fact, I believe it's been done. In the very early days, of course, there were constraints on the size of the PC's that were often the access portals to CD-ROM-based information systems, where running a full TCP/IP stack plus z39.50 was a lot to ask of those machines. That's certainly no longer the case with current technology, and I would speculate, although it would be very hard to prove this, that a lot of the reason really has to do with marketing decisions and the view that they don't want a large, uncontrolled user community using CD-ROM, because of licencing terms and financial models.
That's, perhaps, a primary reason why we haven't seen more Z39.50 access to CD-ROM databases. Many of those still follow a pattern where they have one set of fee's for a standalone CD-ROM, another set of fees for a networked CD-ROM, and often they charge by the number of concurrent users. Trying to enforce those kinds of rationing mechanisms for a systems that is actually put out on a users premises is quite tricky, so I think that in many cases it has really been driven more by marketing and licensing considerations.
I think we've also seen many content providers who are much more comfortable with the physical character of the CD-ROM and the notion that the performance characteristics of the CD- ROM are such that not that many people can use it concurrently, and they really don't want to turn these into wide area networking systems.
Metadata formats, especially Dublin Core - when are they going to happen, who is going to do them, what are the long-term prospects ?
We have spent a lot of time on that this morning at this meeting. I guess my personal view on this is that one of the drivers will be the development of encoding standards that let us attach Dublin Core metadata  or other metadata packages to documents, or allow servers to provide metadata that is associated with objects. The third model which was discussed this morning, external but linked metadata that is moved with the document is, I think, more problematic because it places a requirement on the protocols and software systems that move objects around to keep the objects and metadata together and related. Certainly for most of the protocols in common use today -- FTP, HTTP, etc -- this is not a natural process: these protocols aren't designed to move complex, inter-related constellations of objects around in a consistent way. So I think that encoding standards are a first step, authoring and site management tools the second. Personally, I believe that as soon as a substantial body of metadata gets out there, we will see indexing services that capitalize on the advantages offered by this body of metadata.
Can you be more specific, in other words quantify, what, or how much, you mean by a substantial body of metadata
One of the difficulties here is defining your universe. It's certainly a property in general of information retrieval systems that if you have very sparse attributes it's very tricky to use for retrieval purposes. For example, imagine that you had a bibliographic database and only 5 per cent of the records had subject headings.
Subject searching is clearly a powerful tool for the user, but how you present it in such an environment is very difficult, because by doing a subject search you are de facto restricting yourself to a very small subset of the database in this scenario. The user probably doesn't understand what 5 per cent of the database had subject headings or why -- the policy for when subject headings are associated with records becomes a critical element in understanding what searching this hypothetical database really means. I think that we face a similar problem with the broad based web indexing services like Alta Vista or Lycos. Their focus is sort of maximum reach and lowest common denominator. There's a tremendous amount of transient, relatively low value information that's indexed by these services. Those are not the places were I would expect to see metadata appear first; I would expect to se it appear on high value, persistent content. In fact, this may only serve to reinforce the value of metadata: by restricting to documents with associated metadata, one may in fact be restricting to the more relevant part of the web for scholarly purposes.
So, part of me at least things that we may see metadata first capitalised on as a way of enhancing retrieval by more disciplined, specific, subject-focused and selective web indexing services --- maybe those run by the academic/research community, or by communities of people interested in specific sorts of content. You know when you do a wedding invitation, are you going to attach metadata to it when you incorporate it in your personal web page? I don't think so.
I think that we may see effective use of metadata show up rather later in the very broad-based web indexes that provide access to the commodity web. I think too that we need to be careful about equating Dublin Core descriptive metadata and metadata in general. In fact, metadata is much broader than just the Dublin Core -- it includes parental control kinds of ratings that the current version of PICS is intended to support. It includes rights management, terms and conditions; it includes evaluative information. I think that the Dublin Core is an important first step in supporting descriptive information. The Warwick Framework gives us a broader setting into which we can also slot other kinds of metadata, and I think it's very important that we get some projects moving which give us some real experience with other classes of (non-descriptive) metadata. We need to understand how these classes of metadata work in the information discovery and retrieval process.
I'm particularly intrigued, for example, by some of the experiments that are going on with collaborative filtering systems, like some of the systems Firefly and their competitors have deployed on the web. There's a lot of interest in using PICS as a means of carrying this kind of metadata. I think that the metadata requirements for supporting evaluative information and community filtering is a very fruitful research area.
Mirroring and caching - your thoughts on these two approaches to reducing long-distance network traffic
Mirroring and caching are both showing up as important issues, basically for getting around the very congested links that characterise parts of the internet. This is, as you know, especially acute when we deal with traffic and information across the Atlantic and across the Pacific where there just never seems to be enough bandwidth.
We in the UK don't get to experience the speed, or otherwise, of the connection between the US and Australia; how bad is it?
It's not good, last time I looked. I understand that in Australia they have set up tariffs and pricing models that essentially charge internet users on a traffic sensitive basis for information moving in and out of Australia. This has made them even more sensitive to bandwidth utilization. It's different than the US-UK situation; there's an economic issue as well as a (public good) performance
I think that when we talk about caching and mirroring we are discussing two very different ideas. Caching is a low level engineering trick that shows up at every level of a computer system or a network of computer systems. It appears in managing the transfer between memory and the CPU inside a processor; between memory and disk storage in a computer system, and throughout a network of computers. It's just good engineering to do caches. Now, caches at an institutional or even national level, at the boundary between an organization and the broader network, for example, is a relatively new idea. The performance claims I have seen for some of these caches is very impressive.
This kind of high level caching is raising new issues that go beyond engineering optimizations. There are some interesting problems in the interaction between caching and business models. For example, many web sites are supported at least in part by advertising, and part of setting your advertising rates is the ability to make accurate statements about how often your pages are being accessed. Having your pages cached messes up these access counts, and in fact can cost a site money in advertising revenue. There are protocol provisions that have been developed, though I'm not clear how widely they are implemented in practice, that ensure accurate reporting of page access for sites (or at least place bounds on the level of inaccuracy). What remains to be seen is how comfortable web sites that are operating in the tradition of circulation-audited periodicals are with these technical solutions.
There are also copyright concerns that have been raised with caching. There's this odd tension. As you move information around the network through various routers and other intermediate systems that may include caches, copies are made (even in the trivial case of copying packets from one router interface to another). There have been legal theories offered that even very transient copies represent copies in the legal (copyright) sense. The longer copies persist, the more nervous people get about the copyright issues. This is an area of deep uncertainty. To me, however, caching is fundamentally and engineering issues; I tend to reject the legal arguments. Mirroring is something very different.
Mirroring is a deliberate replication strategy which you apply not only to counter bandwidth problems, but also for reasons of reliability, replicating data so that if one site fails, there is another copy. It seems to me that the notion of the mirroring of sites is one that is still poorly formulated, poorly supported and perhaps is sometimes solving the wrong problem. In speaking to people who run mirror sites, it seems that running a mirror site actually involves a substantial amount of work, getting all the pointers to be consistent, getting all the links to be consistent, and dealing with not just moving files but a linkage structure within files, where you may also have external links.
It seems to me that one of the things we need to develop is a higher level construct that incorporates mirroring but doesn't explicitly expose mirroring to the end user; a notion, for example, of a public files pace into which you could place potentially high use files for distribution. This public file system would be a distributed systems construct that might appear in many places around the network, and would automatically do the right kind of replication and caching; it would automate much of the mechanics of mirroring which are currently visible to, and of concern to, system administrators. For example, imagine that Netscape releases the latest version of their web browser-- the public file space should just replicate this around the network as needed, as long as usage justified extensive replication, and should provide users with an easy way of finding the "nearest" version. I think that mirroring is just an early, immature step towards a much more sophisticated and user-oriented view of such a public file space. As networked information applications mature, and we understand the requirements better, mirroring will become a lower-level engineering method much like caching and end users won't see it. I view mirroring today as a catch-all term for what should really be more sophisticated and differentiated end-user oriented products and services.
Have you got any thoughts, now you are moving over from a more researchie type role to a more political role as the executive director of CNI?
I'm amused by how my role at the University of California is characterized as research-oriented. In fact, the primary function of that role has been an operational one, running the MELVYL  system (a large information access service) for the whole UC faculty, student and staff community. I also oversee the intercampus network, which links the nine UC campuses together and provides internet access. Of course, in this role I am involved in various standards and advanced development activities on behalf of the University.
In some ways the biggest change I see in my new role at CNI  is that I won't have such a big operational component. At CNI I will be able to focus more intensely on policy and planning issues, looking more at advanced technologies, standards, and how they can translate into operational infrastructure for the whole CNI community. I'll be concerned specifically with the interplay among organizations, technology, standards and infrastructure: and I'll be trying to serve a much broader community -- not just the University of California but the whole CNI task force membership, the constituencies of the sponsoring organizations, and beyond. I think it's too early for me to talk in detail about how I see the specific agenda for CNI shaping up over the next couple of years. As you know, I won't be taking up my position there till mid July, and while I have lots of ideas, I need to consult with the sponsor organizations (ARL, CAUSE, and EDUCOM), my steering committee, and with the CNI task force broadly before I cast these ideas into specific program initiatives.
I can tell you that I see a number of areas involving linkages among organizations, infrastructure, networked information, and technologies as being very crucial. A good example is authentication, and coming up with authentication and authorization strategies that will facilitate sharing and commerce in networked information. I don't know precisely what role CNI is going to play in this, but one thing I'm hearing from the community is a need for some broad-based leadership, and I think that CNI can help there. Naming -- persistent naming of digital objects--is another important issue.
CNI has been active in the metadata area. It has co-sponsored several of the Dublin Core metadata meetings and has conducted research into the role of metadata in networked information discovery and retrieval. I've spent time working with Craig Summerhill, Cecilia Preston, and Avra Michaelson on a white paper in this area which remains incomplete, and I'm eager to get back to work on this and finish it. And there are many other areas where I am hoping that the work of the Coalition can help to inform and move the community's discussion forward. I'm hopeful that in my new role I'll have more time for writing which can contribute to this sort of progress.
The picture was taken at the UKOLN organised Beyond the Beginning conference, in London, June 1997. Thanks to the British Library for permission to reproduce it.
 UKOLN Metadata Web pages,
 Melvyl library system,
 CNI (Coalition for Networked Information) Web site,
University of California,
Executive Director of CNI