Minotaur: Metadata -To Be, Or Not to Be (Catalogued)

Gordon Dunsire thinks that all is not rosy in the garden that is metadata, and wonders how it can assist cataloguing in a real-world sense.

Metadata [1]: it's one of those words that rolls off the tongue; I expect Tony Hancock of h-h-h-half hour fame, could have had some fun with it, as in (I think) the Blood donor, where the 'drinka pinta milka day' slogan catches his eye - eatametadataday, anyone?

What I want to rant about in this column is something close, yet 'further away' - metametadata. If metadata is 'data about data', then metametadata is 'data about metadata'. While the Webblies have at last cottoned on to the need for some kind of structured approach to information retrieval, and there is much gnashing of teeth about Dublin cores and the like, the emphasis remains on structure, rather than content. In other words, it's all very nice to know that most, if not all, Web objects ought to indicate who the 'author' is, and what subjects are covered, etc., but where are the guidelines on how to formulate the content of this metadata? Is it OK to enter 'Gordon Dunsire' as the author of this object, or should it be 'G Dunsire', or come to that, 'Dunsire, Gordon', etc.? And what happens if I use a pseudonym, not to conceal but to categorise various types of output? Will the form I choose make a difference to the reader's ability to search for all the stuff on the Web that's written by me? Or about me? Do search-engines know that 'G Dunsire' and 'Dunsire, Gordon' are the same thing? And that 'George Dunsire' isn't? What happens if my name is 'John Smith'?

And that's the easy bit! Most people know what their name(s) are, and most people understand personal name inversion to allow filing under surname. But what about subject content? Some cataloguers indulge in fantasies whereby all 'authors' entitle their works with the appropriate Library of Congress Subject Heading (unfortunately, LCSH isn't even consistent within itself) or, better still, the classification number (but then, which scheme: DDC, UDC, LC, other?). Thus, instead of 'The World Wide Web unleashed' we get '004.6'; this has the advantages of brevity and accuracy. The book (a hypoWeb object) is not about dogs or spiders; it's about computer communications (or is it?). The trouble is, it's not very reader- or author-friendly, and authors do like to intrigue with their titles. Richard Dawkins is a shining example: 'The extended phenotype: the gene as the unit of selection' (nice one!); 'The selfish gene' (ok, probably about genetics); 'The blind watchmaker' (wha???); 'Climbing Mount Improbable' (give us a break!).

"When I use a word," Humpty Dumpty said, in rather a scornful tone, "it means just what I choose it to mean - neither more nor less."
"The question is", said Alice, "whether you can make words mean so many different things." "The question is", said Humpty Dumpty, "which is to be master - that's all." [2]

So who is to be master? Most cataloguers would say, if not the author, then the authority file. Standardised lists of metadata content terms have been around for some time. Names are taken care of with the combined British Library/Library of Congress file (in English transliteration, at least). Subject words and headings have some standarisation with LCSH, although not perfect. In general, there should be standardised answers to the interrogative primitives of 'who?' (personal and corporate names), 'when' (standard event citations including date and time), 'where' (standard geographical thesauri and gazetteers), 'which' (publication and event titles as citations), and combinations.

This is achievable through a little international cooperation, and with some training and guidelines, it might even be possible to get authors to add such metadata at source, using readily available authority files. It is, after all, in the authors' best interests for their publications to be accurately retrieved; unless, of course, they want them to be retrieved for quantitative rather than qualitative purposes. If the former, then the best thing to do is classify their publication as 'Pornography - visual', and use the word 'sex' in the title and at least twenty times in the first paragraph. I guess Richard Dawkins prefers the latter, but I wonder if he's saving 'sex' for his magnum opus?

I doubt whether we will get authors to comply with authority schemes, even if we could agree amongst ourselves as to what the standards are. In any case, the problem of subject indexing using a universal, standard scheme is intractable; librarians have been unable to do it, so why should we expect authors to? And yet we can't just leave it up to the search-engines, for exactly the same reasons. If there was such a beast as a universal subject classification schema, then I bet research into strong Artificial Intelligence would have been far more fruitful than it has been. As it is, it is near-impossible to 'teach' a know-bot (yuk!) that 'a blind Venetian' is not the same thing as 'a Venetian blind' without providing an exhaustive, prescriptive look-up table, but such a table would constitute a universal scheme for categorising human knowledge. And if you think that some magical trigger-point will be reached, where a sufficient bulk of human knowledge in machine-readable form (for example the Web) can be used as the basis of a look-up table, then think of all the 'noise', remember Garbage-In, Garbage-Out, and try reading 'Godel, Escher, Bach' [3] for an explanation of strange loops and self-referential systems.

So we can't get standardised metadata assembled at source, automated 'post-coordinate' indexing has huge limitations, and we don't want to leave the reader drowning in a sea of false-drops. Enter the cavalry in the form of the cataloguer. Isn't it about time we pointed out the error of their WAIS (sorry, couldn't resist that one) and did something about it?

"What is the use of a book," thought Alice, "without pictures or conversations?" [4].


