Leonard Will reviews a comprehensive survey of the literature on the use of thesauri in information search processes and interfaces.
Powering Search is a comprehensive review and synthesis of work that has been done over the past 50 years on the use of thesauri to make searching for information more effective. The book does not discuss the principles and practice of construction of information retrieval thesauri in any detail, but concentrates on the search process and on the user interface through which a searcher interacts with a body of information resources. It is written clearly: each chapter begins and ends with a summary of its content, and the first and last chapters summarise the whole book. There are copious references throughout and a full index.
As the author says in his conclusion:
'This book has taken a new approach to thesauri by critiquing the relevant literatures of a variety of communities who share an interest in thesauri and their functions but who are not, it should be noted, closely collaborating at this time – research communities such as library and information science, information retrieval, knowledge organization, human-computer interaction, information architecture, information search behavior, usability studies, search user interface, metadata-enabled information access, interactive information retrieval, and searcher education.'
One consequence of these disparate approaches is that terminology varies across communities: there are many interpretations of the meaning of facet, category, keyword or taxonomy, for example, which the author acknowledges, but he then uses these terms without saying precisely what definition he gives them.
Information Search Processes
Chapters 2 and 3 review studies on how people go about searching for information, leading to the perhaps self-evident conclusion that there are two types of approach. If a specific and well-defined piece of information is sought, people will amend and refine their queries in the light of initial results to get closer to what they seek. On the other hand, if the search requirement is less well defined, a browsing or 'berrypicking' approach is adopted to explore a subject area, picking up and assembling pieces of information and changing the destination as the exploration progresses. Both these approaches use an iterative procedure, within which a thesaurus can serve to make a search more precise, in the first case, or to show the broader context, in the second.
Chapter 4 deals with thesauri in Web-based search systems, and gives several examples of thesauri in digital libraries, subject gateways and portals, digital archives and linked data repositories. This is one way of grouping these examples, but it is not clear that there is any distinction in principle between the way thesauri can be used in each of them, or indeed in search interfaces to other types of document collections. The main distinction, which is not fully addressed, is whether the information resources being searched have been indexed with terms from the thesaurus being used, or whether the thesaurus is just a source of possible terms for searching the text, and possibly the metadata, of documents. More weight needs to be given to the statement in the introduction to ISO 25964 -1:
'If both the indexer and the searcher are guided to choose the same term for the same concept, then relevant documents will be retrieved. This is the main principle underlying thesaurus design ...'
In fact the book generally talks about terms rather than the approach taken by the current standards of considering unambiguously defined concepts, with terms just serving as convenient labels for these. Each concept may have many labels by which it can be retrieved, including one chosen as preferred for each language covered by the thesaurus.
Chapter 5 deals with search and browsing functionalities in 'new thesaurus construction standards', discussing mainly BS8723 and ANSI/NISO Z39.19, because unfortunately the newest standard, ISO 25964 , was published too recently to be dealt with fully, though it is just mentioned. This is not a serious problem, because there are no major changes of principle between these standards, but the later one expands and clarifies many areas. The standards do contain recommendations on some aspects which this book does not discuss – it would be useful to know whether any research project has investigated the usefulness and understanding of these, such as:
- arrays of sibling concepts within a facet, introduced by a node label showing a characteristic of division. For example, within the objects facet there may be arrays introduced by labels such as <vehicles by fuel> or <vehicles by colour>. These arrays are sometimes displayed separately, and the book refers to them as facets or subfacets, but that is not in accordance with the usage in the standards. The expression node label does not appear in the book, though such labels are important elements in structuring the hierarchical display of a thesaurus.
- compound equivalence, where a thesaurus does not contain a compound concept such as coal mining but contains the explicit or implicit direction that it should be expressed by a combination of the two component concepts represented by the terms coal and mining. This is also important in mapping between one controlled vocabulary and another, where a concept which occurs in one vocabulary may have to be mapped to a combination of concepts in another.
- the role of pre-coordinate indexing, where a knowledge organisation scheme provides for concepts to be combined in a prescribed manner at the time of indexing rather than applying them independently to a document for possible combination at the time of searching. This is particularly useful for browsing, where results can be ordered in a logical sequence, helping a searcher to 'navigate proximal to relevant documents' to use Shiri's words. The book does not show an awareness of the complications of mapping between the individual concepts of a thesaurus and the compound concepts of a pre-coordinate scheme such as Library of Congress Subject Headings or the Dewey Decimal Classification.
The meat of the book is in chapters 6 to 9, dealing with the design of thesaurus-enhanced search user interfaces. Many examples are given, from the earliest prototype interfaces to those in current commercial use, as well as some experimental graphical displays. The screen images illustrated are rather small, though just legible, but their being in black and white means that, when different colours are used to distinguish elements, their significance is lost. In some cases URLs to the live interfaces are given, though they are in the list of references and not adjacent to the illustrations. (The one URL that is given in the text, to a project in which the author participated, leads to an index page with broken links. The correct links can be found on the author's personal home page.) Even the live search screens sometimes breach accessibility guidelines, with one of them having small pale grey text written vertically on a green background with very little contrast. Chapter 9 includes a brief discussion of general principles of screen design, but some basic issues such as legibility can be overlooked in the attempt to fit many elements in to a visually attractive and comprehensible layout.
In the conclusions, the author gives graphic visual interfaces the faint praise 'visualization of thesauri may appeal to some users and may assist them in their understanding of the links and relationships'; it is notable that none of the commercial interfaces illustrated uses them. They are interesting for browsing around, but perhaps not very practical for serious searching.
The discussion of user-centred evaluation in chapter 8 reviews many experiments, and recognises that it is difficult to separate the assessment of the value of thesaurus support per se from the assessment of the interface through which it is accessed. Though individual differences in users' experience, familiarity with searching and with the topics concerned all have an effect, predominant conclusions included the following:
- thesaurus navigation was found useful and informative ...
- users prefer interactive query formulation using thesaurus-enhanced search interfaces as opposed to automatic, behind-the-scenes query expansion.
Perhaps the most useful part of the book is the two-page summary at the end, entitled 'Categorization of guidelines and best practices in the design of thesaurus-enhanced search user interfaces'. This encapsulates the lessons that have been learned from all the preceding studies, and, though it would be challenging to design an interface that complied with all these recommendations, many interfaces would be greatly improved if they took account of this important checklist.
To pick out and comment on just a few items:
Provide clear instructions on Boolean search functions across facets and thesaurus terms.
We can interpret this, though the author does not, as deprecating the simplistic instructions so often found requiring a searcher to choose between 'any' or 'all' of the terms in a search statement. A search statement of any complexity normally contains a combination of two or more concepts, each of which may be labelled by one or more terms. The general form of the search is thus on the lines of (a OR b) AND (x OR y OR z) – for example:
(crawl OR freestyle) AND (style OR technique OR breathing).
Such a statement is fairly self-explanatory, and it is surely underestimating the intelligence of most searchers to say that it is too hard for them to understand. It need not be expressed in words, as above, and many interfaces express it by grouping concepts with a tick box beside each, using OR to combine concepts within a group and AND to combine groups.
Provide the original query in an editable format on the results page.
As the studies of search processes showed that searches are generally iterative, a user should be able to see, modify and re-run a search in the light of results.
Provide a large query entry box for both the thesaurus and the collection search feature.
The small query boxes frequently provided do not allow the entry of a search of any complexity, and even if they allow scrolling it is inconvenient not to be able to see the full search statement.
Integrate thesaurus browsing and results examination to allow seamless access to the thesaurus and the retrieved results.
If it is necessary to switch to a different window with a different interface to choose or modify search terms, this is almost as cumbersome as having to refer to the printed version of the thesaurus. It should be possible to copy thesaurus terms, and subtrees, into search statements without retyping them, for example by drag-and-drop or double-click, with assistance in combining them with a valid search syntax.
This book is an important and useful review which will provide essential background and guidance for anyone designing search interfaces. As it deals only with the search aspects and not with the underlying principles of thesaurus construction and use, it should be complemented by other guidance such as ISO 25964 to provide a full understanding of the techniques and benefits of thesaurus-based information storage and retrieval systems. There is still much work to do in the design and evaluation of such systems, but this is an excellent starting point.
- ISO 25964. Information and documentation – thesauri and interoperability with other vocabularies
Part 1: Thesauri for information retrieval. Geneva : ISO, 2011. 152 pages. 238 CHF (Swiss francs).
Part 2: Interoperability with other vocabularies. Geneva : ISO, 2013. 99 pages. 196 CHF (Swiss francs).
Both parts are available on paper or in pdf format. They are also available through national standards bodies.
- ISO 25964 – the international standard for thesauri and interoperability with other vocabularies. The official Web site for information about ISO 25964, giving summaries, contents, and links to related materials http://www.niso.org/schemas/iso25964/
Leonard Will has been working as an independent consultant in information management since 1994, having previously been Head of Library and Information Services at the Science Museum, London. He has special interests in faceted classification and thesaurus construction and use, and was a member of the working parties that drew up BS8723 and ISO25964. He is a member of ISKO (the International Society for Knowledge Organization), and a chartered member of CILIP and of the BCS.