UK Institutional Repository Search: Innovation and Discovery
Institutional repositories are a major element of the Open Access movement. More specifically in research and education, the main purpose is to make available as much of the research output of an institution as possible.
However, a simple search box and a long list of returned (keyword) artefacts derived from either an individual institutional repository (IR) or a federated search that would generate an even longer list, is no longer sufficient. In the latter case, each repository-level search engine returns lists whose rankings relate to its individual default or customised settings. How the aggregated results are re-organised presents major challenges to a federated higher-level simple search facility.
Rather than presenting the user with a very long list of aggregated articles, what would be more useful is for the search / discovery facility to have the ability to re-index the sub-corpus in terms of conceptual relevance to what the user is looking for. This requires a technological capability whereby the machine understands the concepts implicit within a given item-level artefact and can discover, cluster, prioritise and establish relationships within the overall sub-corpus.
This idea has congruence with Tim Berners-Lee's vision of the Web; and technological changes and developments relating to machine-assisted search and discovery, such as the emergent concept coined Web 3.0 , will undoubtedly have a major impact on next-generation search and discovery functionalities relating to research and scholarly enquiries. If such a search and discovery capability is overlaid on existing institutional and related repositories, this could then provide a discovery capability which in turn would inspire ideas and projects about useful and efficient ways of searching for academic research output.
More importantly, the effect would be to provide improved search and discovery facilities for researchers. As a consequence, more rapid discovery of relevant knowledge and relationships between knowledge artefacts in repositories would facilitate discovery in the early part of the research cycle i.e., literature search. This in turn means that by supporting more rapid and relevant discovery of important information at this point, the overall research and innovation-to-discovery lifecycle would speed up.
This is the underlying premise of the aims of the UK Institutional Repository Search Project  and has guided its initial aims and subsequent research, development and service transition modelling.
The need for a search service such as that delivered by Institutional Repository Search (IRS) arose from the realisation that research is confronted by problems in discovery. Information items stored in a repository or collection of repositories are only useful if they can be found and transformed into forms and constructs that meet a user's specific needs and associated enquiry context.
Of key importance is that there are acknowledged challenges in maintaining manual overheads associated with current, future and in particular legacy data, for example the inclusion of non-bibliographical, but associated data assets. As Green et al stated in 2007 , 'the creation of quality metadata is essential for the proper management and use of the public-facing repository. It is, though, also clearly recognised that it is not viable for this metadata to be entirely human-generated'.
These problems therefore are met with a solution in the form of a free, targeted search tool that makes searching faster and more relevant to the researcher. Institutional Repository Search is a project that was set up to develop a UK repository search service to support academic activity. It is funded by the JISC  and led by MIMAS  in partnership with SHERPA , UKOLN  and NaCTeM . Institutional Repository Search is designed to serve as a showcase for UK research and education. The technological developments in IRS operate to reach the project's main targets.
Successfully finding an item relevant to a search implies that the item has been sufficiently well characterised, indexed and classified, such that relevance to a search query can be ascertained. Moreover, the vast and constantly expanding range of information sources, and the growing area of uncharted supplementary data such as theses and dissertations, make searching quite problematic, because of what is often termed information overload.
Consequently, a literature search can take a very long time and is quite likely to lose focus. At the same time, relationships between information are not always clear or easy to find, thereby potentially hindering researchers' attempts at innovation. We offer an example of what a researcher might regard as a preferable scenario below :
Scenario: Discovery Phase (Susan the Researcher)
Susan has submitted a proposal to her Principal Investigator (PI) to research an aspect related to proteomes. Her proposal / dissertation (and scholarly curiosity) is basically related to the relationship of genetic markers and obesity.
As this is a new area, it has been accepted and she is conducting her initial literature review.
She initially goes to Google Scholar and focuses her preliminary search on the subject-specific terms – 'proteomes and obesity'. She is returned a result set of 4,920 artefacts. However, on preliminary evaluation, she becomes aware that the ranking in the list is based upon the number of contextually unrelated 'hits' and not necessarily the focus of inquiry she is initially seeking.
She lowers the granularity of her search and just types in "proteomes" as a keyword: Google Scholar now returns 3,500 results.
As this discovery exercise relates to grounding the theoretical framework of her research, she tries a recently available conceptual search facility, (http://www.intute.ac.uk/irs/demonstrator).
This is different from Google insofar that it uses subject-specific metadata where they exist at artefact level in UK Institutional Repositories to identify key documents. It also uses a keyword to text-mine artefacts that its neural-network algorithms think might be conceptually related.
The researcher re-types in the high-level subject keyword "proteomes". It comes up with 1,730 bibliographic articles returned from a targeted cross-search from the Open Access UK Institutional Repositories.
On the left hand-side of her screen she notices it has clustered the results in a dynamic browse list. This (dynamic taxonomy) has been previously refined by a subject-specific information engineer to filter the dynamic browse list in relation to her professional focus.
She notices that the subject clustering algorithm has organised returns into 5 conceptual categories, one of which is unusually labelled 'back fat'. On clicking on the latter, it refines the result set down to one. On reading the summary of this paper, she discovers that, whilst it has interesting parallels, it clearly relates to another science domain.
Meanwhile, in the background, the system has been cross-indexing available subject-specific metadata from different science domains related to the article she is reviewing and displays a contextually subject-specific relationship by way of 'recommended reading'.
She clicks on this and it displays a narrowed result set of which two articles capture her interest, both from different application science disciplines to her own. These are:
- "Comparative proteomic profiling of plasma from two divergent pig breeds for lean growth. PMF. LC/MS. Biomarker discovery. Label-free proteomics. Porcine. This thesis is submitted in partial fulfilment of the requirements for the degree of Master of Veterinary Science; This artefact is an unpublished PhD thesis";
- "Combining genetic mapping with text mining to identify disease-related genes. Bioinformatics. This thesis is submitted in partial fulfilment of the requirements for the degree of Master of Science. The thesis sets out the results of a structured survey and unpublished PhD thesis".
Both are contextually relevant but unpublished, so she downloads the full PDFs of both articles and asks the system to provide contact details for the submitters and to set an alert if anything else is deposited in this area.
Susan, within her initial literature review, has quickly:
- identified a relationship (within the subject area of proteomes to obesity) to allied similar research in Veterinary Science; and also
- identified text-mining works which provide computational tools to carry out further extrapolation.
This has been achieved without the need to enter in additional search terms as her machine-driven background profile is also able to learn and adapt to the way she seeks information. It is also able to generate and present to her contextually relevant clusters or taxonomies of concepts and changing relationships between them. She flags this for tracking due to the likelihood of it being helpful.
As the system has been deployed across the institution, it is able to identify concepts or relationships to her other researchers whose information searching profiles indicate they are interested in the same subject areas. On the right of her personalised screen she notices the system has provided a list of researchers fitting this attribute. She notices there is a researcher working in a school outside her own faculty who is conducting work in this area and decides to drop him an email to see if they could meet up and 'join forces' in a prospective bid.
What the system layer has allowed her to do is to tap into the existing knowledge capital within the institution as well as having a heightened alerting capability to changes in the research intelligence environment. This is not available to competing institutions whose researchers may be working in relative isolation and who do not benefit from dynamic, adaptive and automatic tracking of the internal and external environments relative to the given research domain. Susan instructs the system to archive the aggregation of the knowledge derived from her information activity in the Institutional eResearch repository in preparation for a prospective bid submission process. This is reviewed and approved by her PI.
The IRS Project aims are:
- To explore and identify all current search and discovery technologies available to the Research & Teaching context that could support developments within the Institutional Repository landscape relating to the Higher Education Institutions (HEI) sector. The IRS project team did this by establishing a dialogue with:
• a number of open-source development communities (such as Lucene and IBM Unstructured Information Management Technology – UIMA); and
• commercial knowledge and search technology stakeholders (Google, Yahoo, Microsoft & Autonomy Systems).
- To encourage the embedding of a discovery and search facility for UK Institutional Repositories in familiar and day-to-day research desktop environments.
- To identify how improved services to individuals could be further developed by including the ability to personalise information based on user profile, directed browse and dynamic navigation.
- To provide richer, more meaningful conceptual and semantic search facilities using neural networking, semantic and text-mining technologies, including full-text document searching and dynamic concept linking.
- To establish how the IRS service can be openly embedded in Web 2.0 technologies aiming at richer personalisation and contribute to developments within the Semantic Web.
Example User-driven Scenarios
A number of user-driven scenarios were developed relating to the research and teaching and learning contexts. These have subsequently been further refined as part of the Automatic Metadata Generation Use Cases recently developed for JISC .
The project has identified and successfully carried out specific development paths – namely evolving from:
- simple metadata search to;
- conceptual search and clustering, full conceptual indexing of documents;
- text mining of full-text documents;
- automatic subject classification, clustering of results and;
- browsing/visualisation of the search results.
Together with our work with NaCTeM, this extends to term-based document classification and query expansion.
Institutional Repository Search currently searches over 95 UK institutional repositories that are listed in the Directory of Open Access Repositories (OpenDOAR) , and harvested using an aggregation system developed by UKOLN.
The evolution of the project involved growing from simple search at the beginning of the project to leveraging our findings with commercial and open-source technology stakeholders, to developing scalable Proof-of Concept (POC) demonstrators for more advanced conceptual search, clustering and text mining-based search facilities, which were integrated and fully operational by the end of the project.
Positive feedback data were obtained from formal evaluation with academic end-users and researchers. The project undertook formal user group exercises where it asked end-users to test the demonstrators. Within its advocacy work, the project also reported on its progress and presented the demonstrators at a number of national and international conferences where the feedback was always positive and in support of its innovative developments. Interest in the project's results continues. User group requirements have been integrated into the project's development iterations to ensure that the project adequately reflects what researchers want from a service such as Institutional Repository Search. Further extensions of the IRS capability have recently been reviewed by formal user panels and by senior scholars from a number of disciplines with consistent positive feedback. Much of IRS's success derives from the fact that it provides a way of discovering dynamic relationships from an otherwise static resource, something that is not available elsewhere.
The project has combined two complementary technologies. One is a Web 3.0-orientated technology offering conceptual search, providing automated clustering and browsing, using the IDOL engine 7 provided by Autonomy . The second is based on text-mining technology provided by NaCTeM.
The rationale for the final choice of these two complementary technologies arose from extensive discussion and elaboration with a range of knowledge technology stakeholders as described in the Project aims:
- Autonomy IDOL has UK academic origins originating from the University of Cambridge and maintained a consistent R&D dialogue with the technical team that worked on the aims of the project over the funded period of IRS. It provides enterprise-class scalable core knowledge infrastructure capabilities which the IRS team has translated and successfully deployed into an open interface in the demonstrator;
- NaCTeM, based at the University of Manchester has been at the forefront of developing subject-specific text-mining applications which complement this broader capability.
Conceptual Search (IDOL)
Conceptual search, more specifically involves:
- Benefits in the use of unstructured information search algorithms supported by metadata and full text, thus supporting automated taxonomy generation, and concept matching across related documents or artefacts.
- The ability to search for a document, based on words that are related to a concept rather than a document that contains the actual search word or phrase.
- The use of unstructured retrieval algorithms (Bayesian Inference and Shannon's Information Theory ), provided by the Autonomy IDOL 7 engine
- The project evaluated a range of available technologies but had particular success with the use of Autonomy IDOL.
- This was due to available experience of its widespread use in key information retrieval areas in different, but contextually similar, scenarios to those of our researchers, such as information analysts in the wider Government and Media sectors.
Conceptual search then allows for a richer contextual search facility for users who want to view documents that are offered according to their relationship to their query.
Text Mining-based Search (NaCTeM)
Metadata are critical to the discovery of new knowledge. The richer the metadata, and the more those metadata are linked into other resources of different types, the better placed a repository is to support new knowledge discovery and semantic search.
Text mining-based search tools that focus on what the document is about are both more convenient and more intuitive. There are several attractive properties of a semantic search at the level of concepts, rather than keyword matching. However, the prerequisite for such tools is the availability of semantic metadata, i.e. information about what the document is about .
A problem with full-text search algorithms is low precision. This is due to the fact that they match the query words indiscriminately against all words in a document, whether they reflect the topic of the document or not. Within the IRS Project  we have proposed a practical way of alleviating this problem by constraining the search to those words or terms representative of the documents. The keywords approach addresses this problem, but it depends on manually provided key words. To be more practically useful, any approach needs to recognise document-representative words and terms automatically.
TerMine is an automatic term extraction service which identifies and ranks multiword terms (concepts) according to relevance. TerMine is based on the C-Value method .
For IRS the results of TerMine have been integrated in the indexing process such that the document contents are represented by the extracted terms. In addition to improving the quality of full-text retrieval, the extracted terms have been used to guide users on how to navigate documents sharing the same domain concepts. Each retrieved document serves as a starting point to find implicitly related documents following the semantic links discovered by the concept terms, alleviating the problem of overlooking information. A real-time document clustering system, Carrot2, which employs the LINGO algorithm , has been customized for our purposes to cluster the retrieved documents on the fly. In each of the groups, the documents are related via a topic, denoted by a human-readable label. For visualisation purposes, the Aduna cluster visualisation library has also been integrated.
The starting point of the search site is a basic search page, which provides options of searching on four fields: abstract, full text, title, author, and a combination of these four. From the retrieved document list page, the document titles lead to a full document information page, on which various information about the document is displayed. In addition, each document is associated with a list of similar documents and a list of terms representative of this document. The users can search for other documents that share the terms by clicking on each term. By repeating the search with the terms, users can potentially find numerous documents linked to a variety of subjects and topics.
Different IR packages may employ different query syntaxes. In order to assist users who are not familiar with Lucene query syntax , the system provides an automatic query generator page, which can be used to generate complex queries by assigning values for a set of parameters via text fields on the page. For example, users can set the words that must occur, may occur or must not occur, select a sorting option, give an author's name(s), repositories' names, etc. To cater to users who may wish to adjust the queries manually, the automatically generated queries are displayed and can be edited. This text mining-based search facility provides a useful tool for academic researchers and demonstrates the benefit of text mining techniques for advanced full-text search.
The IRS search service benefits different groups in the following manner:
- the research community by providing a more effective and personalised search and discovery facility, addressing the problem of information oversight.
- the institutions themselves in providing a useful tool for their research output to attract a global audience.
- society as a whole, in ensuring that publicly funded research is not only made easily accessible through Open Access but that it is also more clearly identifiable for the organisation or person searching for a particular study.
Institutional repositories play a very important role in scholarly communication and there are many sources of information within the research arena where the user can search for information. However, discovery and retrieval of information is not always easy, targeted and relevant. This is turn can potentially hinder or restrict the researcher's chances of finding appropriate and significant material and so hinder innovation.
The UK Institutional Repository Search Project has created a free search tool that responds to these challenges. More importantly, it has broken new ground in visualising how a new paradigm to search and discovery within the academic sector can be achieved beyond the simple search box.
There are key issues in UK research and teaching in scaling up our supportive knowledge capability to allow our end-users, researchers and scholars to optimise and even shorten the research cycle in response to new global challenges.
A key attribute is the ability ultimately to shorten the discovery time points within the research cycle from discovery to innovation. A base scenario could be described as thus:
"From a researcher's perspective, a starting point may not actually be a single entry into a Google search box or implicit skill in 'browsing' a subject-based classification taxonomy.
For example, a new researcher wishing to approach scholarly inquiry to determine the impact of global warming on penguin populations in South Antarctica doesn't walk up to a Librarian and shout 'Penguins'.
However despite advances in search systems such as Google Scholar entering this even as a subject-specific natural language query results in a list of 2,590 returns in a long sequential list".
Based upon recent feedback from researchers, what is needed is a means to organise this subject-orientated information in a way that fits in with current or even dynamic community or personal classification schemes. More problematic is devising the means of mapping a constantly increasing range of subject or contextually related searchable digital assets, from a number of multi-format repositories within a dynamic relationship taxonomy. Somehow, this has to relate directly to the construct framework the researcher is formatively attempting to build by way of new knowledge creation.
Using two proven and best-of-breed complementary technologies, IRS approaches this by searching over 95 UK repositories and 500,000 artefacts to create the opportunity for researchers to identify relevant information and relationships between information, in order to support them in their literature review and the production of their research output. This has subsequently been extended to include both learning repositories and scientific and humanities-orientated historical archive repositories.
We would like to thank our colleagues for their much-valued contributions: Phil Cross (University of Bristol); Monica Duke (UKOLN, University of Bath); Sophia Jones (SHERPA, University of Nottingham); Linda Kerr (Heriot Watts University); Andy Priest (Intute, University of Manchester); Paul Walk (UKOLN, University of Bath).
- The Semantic Web, Wikipedia, 1 November 2009 http://en.wikipedia.org/wiki/Semantic_Web
- Institutional Repository Search http://www.intute.ac.uk/irs
- Green R, Dolphin I, Awre C and Sherratt R (2007) The RepoMMan Project: automating workflow and metadata for an institutional repository, OCLC Systems and Services, 23 (2), 212-215
- Joint Information Systems Committee http://www.jisc.ac.uk/
- National Data Centre, University of Manchester, UK http://www.mimas.ac.uk
- Securing a Hybrid Environment for Research Preservation and Access (SHERPA) http://www.sherpa.ac.uk
- UKOLN, University of Bath http://www.ukoln.ac.uk/
- National Centre for Text Mining http://www.nactem.ac.uk/
- Vic Lyte, Automatic Metadata Generation – Use Cases
- Open Directory of Open Access Repositories http://www.opendoar.org/
- Autonomy IDOL http://www.autonomy.com/content/Products/products-idol-server/index.en.html
- Autonomy: A Unique Combination of Technologies
- Nobata, C., Sasaki, Y., Okazaki, N., Rupp, C. J., Tsujii, J. and Ananiadou, S. (In Press). Semantic Search on Digital Document Repositories based on Text Mining Results. In: International Conferences on Digital Libraries and the Semantic Web 2009 (ICSD2009)
- Piao, S., Rea, B., McNaught, J. and Ananiadou, S.. (2009). Improving Full Text Search With Text Mining Tools. In: Proceedings of the 14th International Conference on Applications of Natural Language to Information Systems
- Frantzi, K., Ananiadou, S. and Mima, H. (2000) Automatic recognition of multi-word terms, International Journal of Digital Libraries 3(2), 117-132.
- Osinski, S., Stefanowski, J., Weiss, D.: Lingo: Search results clustering algorithm based on Singular Value Decomposition. (2004). In: Proceedings of the International Conference on Intelligent Information Systems (IIPWM'04), Zakopane, Poland, pp. 359–368.
- Osinski, S., Weiss, D.: Conceptual clustering using lingo algorithm: evaluation on open directory project data. (2004). In: Proceedings of the International Conference Intelligent Information Systems (IIPWM'04), Zakopane, Poland, pp. 369–377.
- Apache Lucene - Query Parser Syntax http://lucene.apache.org/java/2_3_2/queryparsersyntax.html