Content Architecture: Exploiting and Managing Diverse Resources
I recently attended the first biennial Conference of the British Chapter of the International Society for Knowledge Organization (ISKO UK)  entitled ‘Content Architecture: Exploiting and Managing Diverse Resources’. It was organized in co-operation with the Department of Information Studies, University College London.
If the intention was to focus on the diversity of resources out there, I also felt that the audience was very diverse in terms of levels of expertise and perspectives. This can provide a useful opportunity for broadening horizons, and I felt that the conference did this reasonably successfully, although I’m not sure that I really understood quite what the ISKO is and what it stands for (except in a literal sense!). Maybe that is somewhat inevitable with a remit as broad as the organisation of knowledge.
This report focuses on some of the plenary sessions, and reflects my own interest in semantics and the Semantic Web, as well as the talks that might be described as thought-pieces, those which tend to raise all sorts of ideas that you mull over on the train journey back home!
It’s All Just Semantics!
The conference kicked off with a plenary session on linguistic semantics. Semantics began as a branch of linguistics, so this seemed like a good starting point. Professor David Crystal, a leading linguist and founder of Crystal Reference Systems (now an ad pepper media company) , told us that the term was first used in its modern sense in the 1890s by Michel Bréal, a French philologist, who referred to ‘the science of meaning in language’. Semantics was seen as a level of linguistic organisation, alongside phonetics and grammar. However, the abstract nature of the concept of meaning meant that semantics remained a neglected branch of linguistics. Crystal took us through some of the history of semantics as a concept, indicating just why the term is so difficult to grasp. He brought us up to date with the concept of the Semantic Web, which gives us one of the widest meanings of semantics. Crystal said that there can be no broader definition of semantics than the one we encounter in the Semantic Web, and furthermore, it has taken us significantly away from the original idea of semantics - the linguistic definition of the term. He warned that if someone talks about providing a ‘semantic solution’ this should be viewed with caution, as it is open to widely differing interpretations, and is therefore somewhat meaningless without further qualification.
Crystal came back to his area of expertise, linguistic semantics, to give an entertaining talk about his recent work relating to the placing of online advertisements, an increasingly important area of advertising. We have over a million words in English, and over 70% of them are technical terms, which tend to have quite a clear meaning. If you take a typical abridged dictionary (around 100,000-150,000 words), the average number of senses per word is 2.4, and this number is rising (the word ‘take’ for example, has 25 meanings). He told us about woefully misplaced online advertisements, in particular an advertisement selling knives placed next to an article on the rise of incidence of stabbing in New York, and even more disastrously, a page in a German publication where an article about Auschwitz was matched with an ad selling cheap gas.
The Importance of Context
Crystal has been working on solutions to problems of irrelevance and insensitivity in the placement of advertisements. Primitive algorithms are the source of the difficulty, where, for example, the keyword ‘knife’ is the linked concept, but the outcome is not as intended, because such algorithms cannot take account of ambiguities in language. One solution is to put words into context. ‘Knife’ in an article about crime will be accompanied by words like murder, blood and police; ‘knife’ in a cutlery advertisement will be accompanied by words like fork and spoon (if it is that kind of knife!). So, the theory is that, if these words are taken into account, it is possible to disambiguate successfully. This process, known as contextual semantics, is one solution that continues to be used, but Crystal argued that this will only capture a small part of content. A news item on stabbing may also be about street safety, policing methods, citizen protection, etc. Most Web pages are multi-thematic, and it is often misleading to think that the title and first few lines will essentially define the content, as other themes inevitably emerge when you read down the page. So, there is a need to analyse the entire content of page, something Crystal referred to as ‘semantic targeting’.
This process lexically analyses pages and categorises by themes . If users are interested in weather and type the word ‘depression’ into a search engine, they will get millions of hits and the first page is likely to refer to the mental health meaning of depression. How can this be improved so that users get the type of depression they are seeking? One way is to increase the search terms, but this can bring about even more diverse results, especially if it is quite an abstract enquiry. A word like depression is a problem because of its polysemic character; it can relate to a mental state, bad weather, a poor economy, or a dip in the ground. If a semantic filter can be devised, the problem is solved: the user types the term depression and a menu prompts her to clarify which type. If she selects meteorology she only gets those hits. But how can we provide the semantic content for a filter? If enquirers only want weather pages, then we need to predict which words related to weather will turn up on the page. But how many items are there in a language available to users to talk about weather? Can we predict what all of them are? One place where all lexical items are gathered is a dictionary. By working through all content-specific items in a dictionary and assigning each sense of each item to a semantic category, all terms will have been covered. In addition, brand names and place names need to be included, so other sources need to be used to cover them. This was the work that Crystal’s team of lexicographers undertook, and it took them 3 years to complete it, although, of course, the work is never really finished, as new terms are constantly introduced and meanings and associations change all the time.
Crystal went on to describe his work on an encyclopaedia database of knowledge categories, originally for Cambridge University Press, but now used commercially, to ensure advertisements are appropriately placed. He concluded that there are more challenges ahead, as advertisers increasingly want more targeted placement of advertisements, which requires identifying the sentiment of a Web page, so that a positive page on a subject can be distinguished from a negative one. This necessitates another lexical trawl to identify words that express positive and negative attitudes. In English there are about 3,000 negative words and about 1,500 positive words (negative always outweigh positive), but over and above that, there are problems such as reverse sense, not to mention the use of irony and sarcasm.
Challenges of Image Retrieval
The next plenary session was on image retrieval. Ian Davis from Dow Jones talked about how hard it is to classify images and find them on the Web. He talked about three approaches to image retrieval: free-text descriptions, controlled vocabulary, and content-based image retrieval (CBIR). Using free text works well to a degree, and controlled vocabulary allows you to focus on the image attributes, depicted concepts and abstract concepts (what is in them, what they are about). CBIR is dependent on pixels, and algorithms can be created to analyse textures, colours and simple shapes. Davis’ talk gave us plenty of visual stimuli, in particular a whole series of images depicting goats and rocks seems to stick in my mind! His aim was to convey just how difficult it is to decide how to describe images and ensure you meet customers’ needs: a customer asks for a picture of a goat, or goats, or goats with rocks; would a scenic landscape with goats in the distance suffice? Is that really a picture of goats? What if they are barely visible? What if customers search for ‘rocks’ and receive a picture of a goat with rocks? Is that going to satisfy them? He gave plenty of other examples of difficulties with classification, such as the apparently simple notion of indoors and outdoors; not so simple when you are trying to classify images. Is the presence of a roof sufficient to signify indoors? Would a bus shelter be classified as indoors or outdoors? Also, how do you classify an abstract picture, such as swirls of light?
Davis talked about customers often wanting to be surprised and inspired. It is difficult to describe images to meet this kind of need. How about related content? If, to extend the previous example, goats and rocks are your thing, are you therefore interested in other animals in rocky landscapes? Controlled vocabulary can help with providing this type of service.
Davis concluded that people are often very critical when it comes to image retrieval if they don’t get what they want. Whilst free-text descriptions are extremely useful, and really the best way to provide a useful image retrieval service, they are nonetheless very time-consuming. Controlled vocabulary is always important, as it helps with precision and accuracy, but again it is time-consuming and invariably complex. CBIR is objective and not so labour-intensive, but at present it is still quite a basic method to identify images. Davis did feel that folksonomies can play a part in enriching classification, but did not elaborate, and this is something that it would have been interesting to hear more about. He felt that the semantic gap is really a semantic gulf. Text is by far the best way to retrieve complex concepts, but a combined approach with a degree of semi-automation is currently the best option.
Ontology-based Image Retrieval
Dr Chris Town, from the University of Cambridge, following Ian Davis’ talk, concentrated on ontology-based image retrieval. Most of the multimedia information on the Web is not properly tagged, but less than 25% of the content available is actually text. How should we represent the content of the image? How can we represent and match queries?
Town gave a summary of CBIR, from basic retrieval through to focusing on the composition and parts of the image via segmentation, and then he took us up to current methods – the application of ontologies, machine-learning techniques and statistical methods. He also went through developments that have enhanced the level of sophistication of how users can search. He referred to Blobworld at Berkeley  which is an initiative for finding coherent image regions which roughly correspond to objects, segmenting images into regions (‘blobs’) with associated colour and texture descriptors.
The query-by-example paradigm can work quite well, but when trying to find images of a certain type, you have to provide others of that type to initiate the search, and there is the issue of salience – what makes it relevant? A look at Google’s Similar Images search  shows that the technology is not always effective in human terms. Chris’s company, Imense  has been doing work in this area and is using Ontological Query Language, which takes the user query, enhances it in terms of syntax and context, and relates it on a conceptual level to whatever features the system has available to translate it into something useful. Imense combines a metadata search with content classification, using the ontology that it has developed. For example, it associates football with people, grass, outdoors, and other entities, such as shapes and colours.
e-Research and New Challenges in Knowledge Structuring
On 23 June, Clifford Lynch, from the Coalition for Networked Information, opened with a stimulating keynote address on ‘e-Research and New Challenges in Knowledge Structuring’. He talked about the database becoming an integral research tool, and changing the way that research is carried out. He referred to ‘synthetic astronomy’, where astronomers make predictions and run them against a database. Consequently, the data that have been gathered for one purpose, and entered into a database, become a valuable, integral part of the scientific process for many different research projects. Scientists can now reproduce an experiment with the data or re-analyse them in different ways, whereas previously the data collected in the course of research might never really be used again. This idea of reuse is key here; reuse advancing scientific discovery, bringing the value of databases to the fore, enabling diverse evidence to be marshalled for the advancement of research. This principle also extends to reuse of sources outside the scientific domain, such as historical diaries on botanical discoveries or tide tables from the 18th century.
However, this does generate difficult decisions about what to keep, how long to keep it for and what is going to be most valuable for reuse. When is it better to keep data because they are expensive to reproduce? When are the data too inextricably tied to the experiment, rendering them of minimal value over time? Lynch posited that in 50 years time we may not be able to understand the data currently generated by the accelerator at CERN; an interesting thought.
Lynch moved on to suggest that we are seeing the rebirth of ‘citizen science’; a notion that goes back the idea of the leisure pursuits of gentlemen in Renaissance Europe. The first line of biodiversity observation is a good example of citizen science, as is astronomy, where first observations are often by amateur astronomers with a humble domestic telescope and their own computer. The same thing seems to be happening in the humanities – maybe we are seeing the emergence of ‘citizen humanities’? If we think about large-scale digitisation of images, it is often left to the audience to describe them and share information relating to them. In fact, people will often be inspired by photographs they find on the Web to describe people, events and experiences that are well beyond the scope of the picture. People en masse may have a huge amount of knowledge that they will share (one has only to think of railway enthusiasts!).
Data and traditional authored works are starting to integrate and relate to each other in complex ways. There is an increasing sense of contributing to the collective knowledge base, to the corpus of knowledge. Individual voices may become more muted as a result.
Writing for Machines
We have now crossed the boundary where scholarly literature is not just read by humans but by machines which compute it. The literature has two different audiences, human and machine. Do we set about adapting literature and making it more suitable for computational work? This brings us back to the whole subject of meaning, and the complexities of semantics.
e-Research and New Challenges in Knowledge Structuring
Tom Scott, from the BBC, followed with a talk about making bbc.co.uk more human-literate. He started by referring to Stephen Fry’s recent comment that the drive to make people computer-literate should be reframed as a drive to make computers more human-literate. His talk was about working towards this goal.
The BBC produces huge volumes of content over a great breadth of subjects. It is too difficult to structure it all from the outset, so the way that the BBC has worked up to now is to build micro sites, which are internally perfectly coherent, but which do not have the breadth of BBC content. For example, you can’t find out everything about what the scientist and presenter Brian Cox has done because you can’t search across all the information. You can’t browse by meaning, even though this is often what people want. If a page is of interest, people often want similar content, but on the current site they cannot follow a semantic thread.
The BBC is now trying to tackle the problem differently, starting to think about the data and how to structure it more appropriately for people. In order to do this, it is thinking primarily about real-world entities (i.e. programmes and subjects) rather than pages or documents. The principle is to give things (concepts) Uniform Resource Identifiers (URIs) and expose the data associated with those URIs. One can then put this into an XML format of RDF (Resource Description Framework) and end up with a page for every concept, with each URI reflecting that one concept.
Linked Data has helped with this process. Scott explained that Linked Data is a very integral component of the Semantic Web, and the concept reflects Tim Berners-Lee’s original vision for the World Wide Web. Linked Data presents a Web of things with unique identifiers (URIs), not a Web of documents. Scott made 3 key points about how this works:
- Use http URIs to give globally unique names to things – anyone can de-reference them in this way
- When de-referenced, you can get useful information back as RDF
- Include links to other URIs to let people discover related information – this is what the Web should be – a compendium of links
Scott did emphasise that there are challenges here. Legacy content is huge and not easily ignored. But the Semantic Web enables us to start to think more in terms of things that matter to people – Paul Weller, lions, steam trains, symphony orchestras, rather than documents. Linked Data frees information from data silos. Whilst a proprietary application programming interface (API) is good, it is still in essence only a door into the silo.
The ‘Linking Open Data’ cloud (LOD)  is all about connections between datasets. DBpedia  is a structured version of Wikipedia that is increasingly becoming a central hub within the LOD cloud, because it constitutes a huge knowledge base of information, providing identifiers for millions of things. DBpedia has become core to the new BBC Web pages, so for those who have doubts about using Wikipedia, the use of DBpedia is, in fact, now further consolidating the place of this user-generated resource. The BBC is now going to Wikipedia pages to edit content for its own purposes.
The BBC has started with a page per programme, so one URI for each programme broadcast. They also have a Web page for every music artist, which sources Music Brainz , the database of music metadata, which provides structured data available in RDF.
Scott also argued that the Web is increasingly not about the browser, as there are more and more ways to access the Web now. We need to recognise that a page is in reality made up of multiple resources, with separate URIs for each resource.
If we can make everything addressable then we can start to mesh things up across the Web. We can take a resource and include it in another page. We can have multiple representations for things, as one URI can have many representations, and the appropriate document can then be returned for the device being used.
Data in the Cloud
The final plenary was on the potential of new technologies. Dr Paul Miller from The Cloud of Data  talked to us about Data in the Cloud. He described the huge move to shift data centres to the Cloud; to a third-party provider. There has been some reluctance, especially in the corporate sector, to trust third-party providers, despite arguments for reduced cost and other advantages. Miller refuted arguments that data is not safe in the Cloud. Whilst there is always a risk, these centres are likely to have substantially more expertise in data security than individual corporations or institutions.
We can begin to see interesting things happening once the data are out there and capable of being joined together. Miller talked about Tim Berners-Lee’s admonishment to us all to ‘stop hugging your data’ . We need to let go, to be less defensive, to allow others to use our data rather than recreate it.
Gordon Brown has said Tim Berners-Lee is going to ‘fix government data’! (The press tended to pick up on the other announcement, the one about Alan Sugar as the saviour of business). UK Government information is largely in silos, but the principles of Linked Data will bring it to the Web, so machines can come along and find it and use it to drive applications.
Miller split the idea of the Cloud into infrastructure, platform and application. Infrastructure as a Service (IaaS) provides big cost savings, as it is possible to draw on any number of machines on demand. This elasticity reflects the reality of demand going up and down, and it is far more cost- effective to pay for it as and when it is needed. Platform as a Service (PaaS) provides the opportunity for developers to focus on their own specialized area of development, without the need to worry about the underlying components that support this. Software as a Service (SaaS) refers to things like Google apps, Zoho, WordPress, etc. These are all lightweight applications delivered over the Web. Most are fairly low-end disruptors, not in competition with Oracle or Microsoft, but becoming increasingly capable, and we may get to the interesting stage where Microsoft Office or similar software applications become largely redundant.
Linked Data is really beginning to make the concept of The Cloud interesting. Datasets are beginning to link up and to take material from each other, and this is the Web done right according to Tim Berners-Lee.
I came away from the ISKO Conference thinking that the Semantic Web has now arrived. As an archivist working in an online environment, I have been involved in organising events for archivists around the digital environment. Five years ago I organised a session on the Semantic Web as part of the Society of Archivists’ Conference. We had an overview talk and a couple of projects were represented, the Vicodi Project  and the Magpie Semantic Filter . It all sounded intriguing, but very ambitious, and I think we all felt that this was a vision that would never be realised. I remember puzzling over how to create the massive-scale ontologies that would be needed to make everything meaningful. We all went away and largely forgot about it all! Now it seems that the tools are available and being widely employed in a way that is set to make a real difference to the Web. Exciting times ahead!
- The International Society for Knowledge Organisation UK http://www.iskouk.org/
- Crystal Reference Systems (part of ad pepper media) http://www.adpepper.com/
- Semantic Targeting http://en.wikipedia.org/wiki/Semantic_targeting
- Blobworld at Berkelely http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/5567.html
- Google Similar Images search http://similar-images.googlelabs.com/
- Imense http://www.imense.com/
- Linked Open Data diagram http://linkeddata.org/
- DBpedia http://www.dbpedia.org/
- Music Brainz http://musicbrainz.org/
- The Cloud of Data http://cloudofdata.com/
- Tim Berners-Lee talk at the TED2009 conference http://www.ted.com/
- Vicodi Project http://www.vicodi.org/innovation.htm
- Magpie, the Semantic Filter http://projects.kmi.open.ac.uk/magpie/main.html