Discussions from KIDMM Mash-up Day
- Information Retrieval Today: An Overview of Issues and Methods
- SNOMED Clinical Terms: The Language for Healthcare
- Geospatial information and its Applications
- Integrating Museum Systems: Accessing Collections Information at the Victoria and Albert Museum
- Preservation of Datasets
- Discussion: Taxonomies and Tagging
- Enabling Knowledge Communities
Information Retrieval Today: An Overview of Issues and Methods
David Pullinger (UK Cabinet Office), in charge of the pan-government search solution, commented that ordinary people searching for government documents use terms other than the government's argot. Ironically, Google finds these documents effectively, because it picks up words that are associated with links, often written in plainer English. Conrad drew attention to a 2003 paper on e-democracy by Danny Budzak , comparing terms used to describe services on local government Web sites to those chosen by users. Conrad also noted that on 21 January 2003, the BCS Developing Countries SG had held a discussion workshop on 'Information Literacy' . Part of a definition of IL is knowing how to use search engines effectively. But seeing user education as important shouldn't let engine-makers off the hook: better interfaces, better usability is vital.
SNOMED Clinical Terms: The Language for Healthcare
Tony Rose noted the wealth of experience gained in constructing SNOMEDÐCT, and wondered what other projects could learn. Ian thought a lot of learning is being transferred. When you build a terminology like SNOMED, you construct many little theories, e.g. a theory around Action it involves a performer, a recipient, a lifecycle etc. This helps to relate it to other terminologies providing they share similar models of what an action is.
No concept in SNOMED CT ever disappears: once created, 'it's immortal'. If SNOMED CT had started in the 16th century, the concept of malaria as 'ague' would still be there, but deprecated for current use. Over time, new science will cause concepts to move from one hierarchy to another, or new hierarchies may be introduced. Genomics, for example, will cause many changes.
Mark Phillips (Department of Children, Schools and Families) asked about strategies for improving take-up of SNOMED CT. Ian regretted that pilots of SNOMED CT use have been postponed. This could be disastrous: we have no idea how things will work in practice. All major systems procured under NHS Connecting for Health will use SNOMED CT a huge experiment. But sooner or later you have to bite the bullet with standardised terminologies. Use in real-time patient care will be the acid test.
Robin Clark of the National Cancer Research Institute asked about governance of the process for expanding the terminology and guaranteeing its consistency. One of the most difficult aspects is deciding whether a new term represents a new concept, or is a synonym for something already there. If you understand what a new concept means and most authors are clinicians then you know to what hierarchy it belongs, what qualifiers are available, what defining characteristics are important. Better tools are required but there will always be a big element of human skill and knowledge in authoring any such terminological resource.
Geospatial information and its Applications
Sabine McNeill asked Dan if he was aware of mash-ups between geospatial and climate data. Conrad noted the widespread use of data, geospatially referenced, which feeds into climate prediction models such as those used by the UK Meteorological Office. Ian Herbert added that in meteorology, the models are not simply two-dimensional but also take altitude into account.
David Pullinger asked if lack of standardised geography is holding back the use of GIS; if so, what standard would Dan opt for? He was thinking of e.g. postcodes or Office of National Statistics output areas. Dan agreed that the postcode is a powerful concept, but the area definition is basically a set of delivery points. Any geometry put around those points will be arbitrary. Also, the postcode system was devised for delivering the mail, but many organisations use it to split the country into areas. Indeed the phrase 'postcode lottery' reflects the way that the postcode gets treated as a de facto geographical unit.
Conrad talked about problems with these entities called countries. In classifying geographical locations, one imagines a monohierarchy, in which a town is situated within a country. But over time, boundaries change. This problem comes up repeatedly when you try to attach geographical and historical data together; you need a spatio-temporal gazetteer to track such changes.
Integrating Museum Systems: Accessing Collections Information at the Victoria and Albert Museum
Dan Rickman asked how the Common Data Model is used in practice in building the Virtual Repository. Mike Stapleton of SSL explained that the VR contains a brokerage module. It collects information from the databases, and runs queries as required agains them; and does both these things in response to queries from applications. The EAD archival data is harvested and held locally to the VR in a structured database. As for the Common Data Model, its main benefit was to bring clarity to conversations between Museum colleagues and suppliers.
George Mallen of SSL explained that in such projects, an important starting point is the knowledge organisation system of each institution. Each has its own attitudes to its data and knowledge. Another influence is the various standards, such as CIDOC and SPECTRUM. Technology suppliers, such as SSL and its competitors, must interact with all this, and tailor solutions to the client's needs. To keep abreast of developments in the field, SSL gets involved in European-funded research initiatives - a good way to keep a company and its technology responsive.
Susan Payne of De Montfort University Library said their current experience data modelling is that people are scared about missing something out. If something is discovered to be missing, how does it gets built in? How is agreement reached? Mike Stapleton said there comes a point when you have to draw a line, and it's a problem. If later, while developing applications, something is found missing from the data model, it can be expensive to fix at that stage.
Preservation of Datasets
Conrad Taylor raised the issue of the digital preservation of documents, rather than datasets, and related a concern raised by Adam Farqhuar at the British Library. A huge amount of electronic documents are in proprietary Microsoft Office formats, such as Word and Powerpoint. For this reason, Adam is enthusiastic about Microsoft's Office Open XML format, which should render the formerly closed binary formats into a publicly documented, XML-rendered form.
Ecma International adopted Office Open XML as a standard (Ecma-376), but the fast-track process to make it an IS0/IEC standard (DIS 29500) has proved controversial. Office Open XML is seen by many as a spoiler launched by Microsoft against the ISO/IEC 26300:006 OpenDocument standard, derived from the OpenOffice.org XML format. (John Alexander commented that in fact the proposal had been voted down in a ballot that ended on 2 September 2007 [15.)
Terry said that TNA has an agreement with Microsoft guaranteeing access to all previous Microsoft operating systems and applications. This is a partial solution, though only for TNA, and only for documents authored in Microsoft products. Conrad wondered about the long-term accessibility of Adobe's Portable Document Format, and reported that Adobe is developing an XML representation of the content and structure PDF documents, under the title of the Mars Project .
Discussion: Taxonomies and Tagging
One document provided for study in the Mash-up delegate pack was the BCS Subject Taxonomy, commissioned at the request of the BCS Knowledge Services Board as a resource for classifying BCS information products such as books, articles and Web pages. Conrad asked, If you had to classify a Web page using this taxonomy, how would you start? Wouldn't you regard the task with some foreboding? It seems that BCS staff are often unsure about what keywords to use. This seems to have led to some resistance to doing the work of classification.
Someone asked if classification isn't always contextual. In the BCS, we classify subjects a certain way because we are all generally interested in things to do with computers. Compare a book by Umberto Eco, describing an early Chinese attempt to classify animals: those that walk, those that fly and those that swim. Also, those the Emperor likes, those the Emperor doesn't like, and those the Emperor hasn't made up his mind about yet. It seems crazy Ðbut if your life depends upon not offending the Emperor, it is a very reasonable classification.
Conrad suggested we turn the discussion upside down. In systems like Flickr and Del.icio.us, people classify things with tags that they choose - 'folksonomies'. It's less effort to get people to tag things that way. John Alexander described the practice of displaying a 'tag cloud' - and two understandings of the function of a tag cloud emerged:
- The terms in the cloud can be the tags most frequently assigned; this can become a preferred, semi-normative pick list; though one can still introduce one's own.
- The cloud might instead show the terms on which most searches had been made against the site.
Richard Millwood suggested that classification will be guided by the tools on offer. 'If we gave infant children tools as good as some of those out there in the Web 2.0 context, I think children would learn to understand what taxonomies are, practically, day by day, in an enjoyable context, and they would think very hard about how they would make sense of them communally when they come to secondary school and beyond.'
Major Classification Schemes
We were talking about classifying stuff as it arrives online; but Aida Slavic reminded us of the huge amount of information already there, online or in libraries. The big classification schemes, Library of Congress, Dewey, Universal Decimal Classification, aim to cover the whole of knowledge, and scientifically. The structure of these classifications serves the purpose of co-locating books matching how people are going to use them: for example, all the books about heart diseases together. Not to index heart attacks together with poems about the heart in love, or Braveheart the film.
A huge amount of information is organised according to this system. Do we dare call it obsolete? It has the huge advantage of being a map of knowledge, to which we can map other things. These large classifications are not stupid - they have a good facet analysis system behind them. As Aida sees it (this is why she helps to maintain UDC), if such a classification schemes is encoded and exposed in a machine-understandable way, then perhaps you can correct and improve on it by noting how it touches upon other vocabularies you use.
In building the 'BCS Taxonomy', it would be useful to link it to those large classifications. Among 200 million books in the Library of Congress Catalogue, you will find computer books worth linking to. The point is to use mapping, and not just one classification but many; to link and switch them through some kind of registry.
Conrad wound up the discussion saying that perhaps the BCS, to become an organisation that shares its knowledge, should consider how it can gather its knowledge together and make it accessible to people.
Enabling Knowledge Communities
Someone asked how much knowledge Richard expected users to have, not just to be able to interpret information, but also to be able to tell whether it looks vaguely correct. Conrad said, this is another point that people have made about information literacy: the ability to critically evaluate information one has retrieved as being as important a skill as the ability to go out and find stuff. And when it comes to school students, it's often that critical evaluation that is missing.
Richard said he would go further; he'd say that a lack of critical evaluation is not new, and has always been in short supply; except in the best schools, and with the best teachers. The problem is that children's model of schooling is: 'You are going to tell me what the truth is, and so I don't need to think. All I need is the ability to remember it and write it down when it comes to the exam.' That is the challenge we face in education.
It isn't just schoolchildren who fail critically to evaluate, someone remarked; it's in government as well. There is a huge problem of people being uncritical about statistical data.
Richard noted that some Ultraversity students dropped out after 6 months on the course, with the riposte, 'How could you possibly start a course like this without quality materials such as the Open University provides?' They had not accepted the idea that they might construct knowledge themselves, in conjunction with others filling in the gaps in their expertise. For some, taking responsibility for their own learning was particularly hard.
Jan Wylie asked if 'flaming' in online discussions was still a problem. Yes, said Richard; it is. One way to improve this is not always to do everything in one big forum. A large community sharing a single communication channel such as an email listserver is quite vulnerable to flaming. But where many subcommunities are operating, like the learning sets in Ultraversity which consisted of only a few people each, one can have the experience of community in smaller spaces where trouble is less likely to start.
Someone asked if the software on which the I&DeA Knowledge applications run is available to other communities. Richard said it has been developed with a company, providing a bespoke solution. They do permit any government body to use it as a foundation for their own online community; but you apply to use that company's service rather than taking the software away and running it yourself.