The National Centre for Text Mining: Aims and Objectives

julia chruszcz; sophia ananiadou; john keane; john mcnaught; paul watry

The National Centre for Text Mining: Aims and Objectives

Sophia Ananiadou, Julia Chruszcz, John Keane, John McNaught and Paul Watry describe NaCTeM's plans to provide text mining services for UK academics.

In this article we describe the role of the National Centre for Text Mining (NaCTeM). NaCTeM is operated by a consortium of three Universities: the University of Manchester which leads the consortium, the University of Liverpool and the University of Salford. The service activity is run by the National Centre for Dataset Services (MIMAS), based within Manchester Computing (MC). As part of previous and ongoing collaboration, NaCTeM involves, as self-funded partners, world-leading groups at San Diego Supercomputer Center (SDSC), the University of California at Berkeley (UCB), the University of Geneva and the University of Tokyo. NaCTeM's initial focus is on bioscience and biomedical texts as there is an increasing need for bio-text mining and automated methods to search, access, extract, integrate and manage textual information from large-scale bio-resources. NaCTeM was established in Summer 2004 with funding from the Joint Information Systems Committee (JISC), the Biotechnology and Biological Sciences Research Council (BBSRC) and the Engineering and Physical Sciences Research Council (EPSRC), with the consortium itself investing almost the same amount as it received in funding.

Need for Text Mining in Biology

Dynamic development and new discoveries in the domains of bioscience and biomedicine have resulted in a huge volume of domain literature, which is constantly expanding both in size and thematic coverage. With the overwhelming amount of textual information presented in scientific literature, there is a need for effective automated processing that can help scientists to locate, gather and make use of knowledge encoded in electronically available literature [1] [2]. Although a great deal of crucial biomedical information is stored in factual databases, the most relevant and useful information is still represented in domain literature [3]. Medline [4] contains over 14 million records, extending its coverage with more than 40,000 abstracts each month. Open access publishers such as BioMed Central have growing collections of full-text scientific articles. There is increasing activity and interest in linking factual biodatabases to the literature, in using the literature to check, complete or complement the contents of such databases. However currently such curation is laborious, being done largely manually with few sophisticated aids, thus the risks of introducing errors or leaving unsuspected lacunae are non-negligible. There is also great interest among biologists in exploiting the results of mining the literature in a tripartite discovery process involving factual biodatabases and their own experimental data.

Therefore, techniques for literature mining are no longer an option, but a prerequisite for effective knowledge discovery, management, maintenance and update in the long term. To illustrate the growing scale of the task facing specialists trying to discover precise information of interest within the biobibliome, a query such as 'breast cancer treatment' submitted to Medline's search engine in 2004 returned almost 70,000 references while it resulted in 20,000 abstracts in 2001. Effective management of biomedical information is, therefore, a critical issue, as researchers have to be able to process the information both rapidly and systematically. Traditionally, bioscientists search biomedical literature using the PUBMED interface to retrieve MEDLINE documents. PUBMED is an indexing and retrieval repository that manages several million documents. These documents are manually indexed, where index terms are selected and assigned to documents from a standard controlled vocabulary (the Medical Subject Headings, MESH). The retrieval is implemented as a Boolean keyword search, so documents that fully satisfy a query are retrieved. Another problem is the selection of the appropriate index terms which would retrieve the most relevant documents. Index terms do not necessarily characterise documents semantically, but are used to discriminate among documents. Classic Information Retrieval (IR) techniques do not use any linguistic techniques to cope with language variability such as synonymy and polysemy which may produce many false positives. Even controlled indexing approaches are inconsistent and limited since knowledge repositories are static and cannot cope with the dynamic nature of documents.

Using classic IR methods is not sufficient because the number of documents returned in response to a query is huge. Therefore, with the overwhelming amount of new terms being introduced in the literature on a daily basis, text mining devices such as automatic term management tools are indispensable for the systematic and efficient collection of biomedical data which go beyond keyword indexing and retrieval. Manually controlled vocabularies are error-prone, subjective and limited in coverage.

However, once a highly relevant set of documents is returned through exploitation of term-based indexing and searching, this will typically still be large and, more importantly, will still not yield precise facts at this stage.

Processing biomedical literature faces many challenges, including both technical and linguistic. Technical challenges are posed by, for example, restricted availability and access, heterogeneous representation (storage) formats, extensive usage of non-textual contents, such as tables, graphs, figures, etc. and linguistic challenges are posed by the particularities of the biomedical sub-language. One of the main challenges in bio-text mining is the identification of biological terminology, which is a key factor for accessing the information stored in literature, as information across scientific articles is conveyed through the terms and their relationships. Terms (which here are taken to include names of genes, proteins, gene products, organisms, drugs, chemical compounds, etc.) are the means of scientific communication as they are used to refer to domain concepts: in order to understand the meaning of an article and to extract appropriate information, precise identification and association of terms is required [5]. New terms are introduced in the domain vocabulary on a daily basis, and - given the number of names introduced around the world - it is practically impossible to have up-to-date terminologies that are produced and curated manually. There are almost 300 biomedical databases containing terminological information. Many of such resources contain descriptors rather than terms as used in documents, which makes matching controlled sets of terms in literature difficult. Terminological processing (i.e. identification, classification and association of terms) has been recognised as the main bottleneck in biomedical text mining [6], severely reducing the success rates of 'higher-level' text mining processes which crucially depend on accurate identification and labelling of terms. Various approaches have been suggested for automatic recognition of terms in running text [7][8][5]. Crucially, technical terms of the kind we consider here are to be distinguished from index terms used to characterise documents for retrieval: a good index term might not be a technical term; a technical term is of potential interest for text analysis even if it occurs infrequently in a collection; all technical terms in a document are of potential interest for text analysis.

Recognition of terms in text is not the ultimate aim: terms should be also related to existing knowledge and/or to each other; classes of terms and hierarchies of classes need to be established, as it is terms that provide the link between the world of text and the world of ontologies and other such classification schemes; the ontological elements terms map to serve further to drive ontology-based information extraction, discussed further below.

Several approaches have been suggested for the extraction of term relations from literature. The most common approach for discovering term associations is based on shallow parsing and Information Extraction (IE) techniques. These are based either on pattern matching or on IE-based semantic templates. While pattern-matching approaches are typically effective, the cost for preparing domain-oriented patterns is too high. Recall may be affected if there is not a broad coverage of patterns. Since the separate use of either statistical, knowledge-intensive or machine learning approaches cannot capture all the semantic features needed by users, the combination of these approaches is more promising.

Given the dynamic nature of biomedicine, any method should be tunable and application- independent. We believe that the usage of available knowledge sources has to be combined with the dynamic management of concepts (terms) encountered in texts. Most current systems address known relationships, and aim at the extraction of semantic or conceptual entities, properties of entities, and factual information involving identified entities. We propose to support not only the extraction of entities, properties and facts but also, through data mining, the discovery of associations and relationships not explicitly mentioned.

Or indeed totally unsuspected (discovery of new knowledge): this is the true power of text mining. Our view of text mining, thus, is that it involves advanced information retrieval yielding all precisely relevant texts, followed by information extraction processes that result in extraction of facts of interest to the user, followed by data mining to discover previously unsuspected associations.

Role of the National Centre for Text Mining

The paramount responsibility of NaCTeM is to establish high-quality service provision in text mining for the UK academic community, with particular focus on biological and biomedical science. Initial activity will establish the framework to enable a quality service, and to identify 'best of breed' tools. Evaluation and choice of appropriate tools is ongoing, and tools will be customised in cooperation with partners and customers, bearing in mind existing competition and advantages to be gained from cooperation with technology providers.

The overall aims of NaCTeM are:

to provide a one-stop resource and focus primarily for the UK text mining community for user support and advice, service provision, dissemination, training, data access, expertise and knowledge in related emerging technologies, consultative services, tutorials, courses and materials, and demonstrator projects;
to drive the international and national research agenda in text mining informed by the collected experiences of the user community, allied to existing and developing knowledge and evaluation of the state of the art;
to consolidate, evolve, and promulgate best practice from bio-related text mining into other domains;
to widen awareness of and participation in text mining to all science and engineering disciplines, and further to social sciences and humanities, including business and management;
to maintain and develop links with industry and tool suppliers, to establish best practice and provision.

The vision informing NaCTeM is to harness the synergy from service provision and user needs within varied domains, allied to development and research within text mining. The establishment of a virtuous feedback cycle of service provision based both on commercial software and on innovative tools and techniques, themselves in turn derived from user feedback, is intended to enable a quality service whilst ensuring advances within each associated domain. This paradigm is how advances within bio-text mining have occurred, not least within the NaCTeM consortium's recent activity. NaCTeM is working to consolidate existing successes, activity, and working relationships and models, and transfer them to related science and engineering activities and humanities. Importantly, the expectation is that NaCTeM will shortly (Summer 2005) be housed in an interdisciplinary bio-centre co-locating life scientists, physicists, chemists, mathematicians, informaticians, computer scientists and language engineers with service providers and tool developers. Such co-location promises a step-change in awareness and use of text mining such that very definite advances can be both realised and sustained.

The services offered by NaCTeM are expected to be available via a Web-based portal. Three types of service are envisaged: those facilitating access to tools, resources and support; those offering online use of resources and tools, including tools to guide and instruct; and those offering a one-stop shop for complete, end-to-end processing by the centre with appropriate packaging of results. Services will thus include:

Access to state of the art text mining tools developed from leading edge research
Access to a selection of commercial text mining tools at a preferential rate
Access to ontology libraries
Access to large and varied data sources - guidance, and purchase of data sets at preferential rates
Access to a library of data filtering tools
Online tutorials, briefings and white papers
Online advice on matching of specific requirements to text mining solutions
Online performance of text mining and packaging of results involving GRID-based flexible composition of tools, resources and data by users to carry out mining tasks via a portal
Marketing and dissemination activities: e.g. training and course materials; conference and workshop organisation
Collaborative development/enhancement of text mining tools, annotated corpora and ontologies
Text mining tool trials and evaluations

Initially users of NaCTeM will be members of academic and research institutions, and later companies throughout the supply chain in the biotechnological and pharmaceutical industries. In addition, potential users will be public sector information organisations; SMEs (Small and Medium-Sized Enterprises) in the life sciences sector and IT (knowledge management services) sector; regional development agencies; health service trusts and the NHS information authority; major corporates in the pharmaceutical, agro- pharmaceutical and life sciences industries including food and healthcare; government and the media.

NaCTeM integrates areas such as:

Bioinformatics and genomics - Research involves predicting and extracting properties of biological entities through combining large-scale text analysis with experimental biological data and genomic information resources. The use of supervised learning over both text and biological data sources increases novelty detection. Recently, work has started on non-supervised learning approaches using sophisticated term and term relationship extraction. The overall goal is to discover strategies and methods that facilitate user comprehension of experimental data, genomic information and biomedical literature simultaneously.
Ontologies, Lexica and Annotated Text Corpora.

Ontologies describe domain-specific knowledge and facilitate information exchange They store information in a structured way and are crucial resources for the Semantic Web. In an expanding domain such as biomedicine it is also necessary for ontologies to be easily extensible. Since ontologies are needed for automatic knowledge acquisition in biomedicine the challenge is their automatic update. Since manual expansion is almost impossible in a dynamic domain such as biomedicine, text mining solutions such as term clustering and term classification are beneficial for the automatisation of ontologies. Term clustering offers potential association between related terms, which can be used to verify or update instances of semantic relations in an ontology. Term classification results can be used to verify or update the taxonomic aspects of an ontology.

Lexical resources (dictionaries, glossaries, taxonomies) and annotated text corpora are equally important for text mining. Electronic dictionaries give formal linguistic information on wordforms. Furthermore, as ontologies represent concepts and have no link with the surface world of words, a means is needed to link canonical text strings (words) with ontological concepts: dictionaries and taxonomies aid in establishing this mapping. Annotated text corpora (GENIA) [9] are essential for rule development, for training in machine learning techniques and for evaluation.

Meeting the Needs of Users

We now elaborate on some of the above points where these concern the core text mining components provided by the consortium to underpin the national service.

Overall, the text mining process involves many steps, hence potentially many tools, and potentially large amounts of text and data to be analysed and stored at least temporarily, including all the intermediate results, and the need to access large resources, such as ontologies, terminologies, document collections, lexicons, training corpora and rule sets, potentially widely distributed. Much of the processing is compute-intensive and scalability of algorithms and processes is thus a challenge for the service to meet the requirements of users. Moreover, we expect that many users will want to process the same data again and again (e.g. Medline), with perhaps variation of need only at the higher levels of analysis (fact extraction) or during the data mining stage. It is thus inappropriate to, say, expensively analyse the entire contents of Medline for each text mining request. Essentially, parts of the analysis of some collections will remain unchanged, or will change only slowly with the advent of improved analysis techniques.

Thus, once a collection has been processed once to annotate terms and named entities, and properties of these, there is no need to do so again in the general case: there is only a need to analyse new additions to a collection since the last analysis. We expect therefore that part of the service will be devoted to elaborating techniques to reuse previously analysed material, while also developing and exploiting caching techniques to cut down on the amount of processing and data transfer that may otherwise be required. For example, one need only think of the overhead involved in simple lookup of an ontology or dictionary for every concept or word in a collection of several million documents, where numerous requests are received to process that collection: lookup is one of the more straightforward mechanisms of text mining, but the scale of the task here is the point at issue, within a national service. Users may in fact prefer or have to use distributed high-speed processing facilities for large-scale processing rather than their desktop PC. In this case, it is essential to consider portals, GRID capabilities, and access to hosts capable of handling high-dimensional data for the academic community.

We thus, in the national service, face a challenge typically irrelevant to or unaddressed by many current text mining systems: the need to provide scalable, robust, efficient and rapidly responsive services for very large collections that will be the target of many simultaneous requests, in a processing workflow where each process may have to consult massively large-scale resources, manage massive amounts of intermediate results and take advantage wherever possible of sophisticated optimisation mechanisms, distributed and parallel computing techniques. Then again, we must do all this while further recognising the need of many users for security and confidentiality especially with respect to the high-level fact extraction and data mining results that they obtain, which leads us naturally into consideration of secure portals and management of levels of sharing of intermediate data and results.

Development work at NaCTeM will thus be emphasising scalability and efficiency issues, in an environment where different levels of access may need to be managed relative to certain types of intermediate data and results. Moreover, as we recognise that there are many different types of text mining, and that a user may not be interested at any one time in all stages, just in some sub-set (e.g. the user may want to stop after applying IR, term extraction and fact extraction processes, or may want simply to do sentence splitting, tokenisation and document zoning), we do not intend to offer a monolithic service consisting of a single workflow of mandatory tools. It is rather our intention to offer flexibility and the potential to reconfigure workflows as required. Hence, we shall be investing effort in elaborating a model that will allow flexible combination of components (tools and resources) to achieve some text mining task, in a GRID or otherwise distributed environment. This further implies a strong interest in standards: adopting appropriate standards or pushing the development of de facto standards where required. For the user, the advantage is that third party tools and resources, along with the user's own components, can be integrated in a distributed workflow, assuming interface standards are adhered to (e.g. Web services, although with linguistic processing we must remember that standards are required at the linguistic level, not just at the transport protocol level, to ensure that linguistic data tags and attributes, for example, are consistently labeled and interpreted).

We must also not forget that there are many types of user of a national service, and that we must support therefore the expert bioinformatician who is conversant with construction and deployment of components as much as the user who is a domain expert but has no knowledge of or interest in how things work: but a keen interest in getting appropriate results with modest effort in reasonable time. We shall thus also be working to develop environments that will guide users in the identification of appropriate components and resources, or indeed overall off-the-shelf workflows, to accomplish their text mining task. This may well involve an initial interaction with an environment to figure out what the scope of the text mining task is, what kind of facts are being sought, what kind of associations should be looked for, and so on. Our partners from the University of Geneva have long experience in designing and applying quality in use evaluation techniques to guide users in making appropriate choices of natural language processing tools to suit their needs and they will be working closely with us in this area.

As discussed above, term management is a crucial activity in text mining and one that is not at all well handled by the majority of text mining systems. We will be working to render scalable the highly successful ATRACT (Automatic Term Recognition and Clustering for Terms) terminology management system of the University of Salford [10], which is based on a proven, language-independent hybrid statistical and linguistic approach, and to integrate it in text mining workflows.

Ontology-based information extraction is currently in its infancy, at least insofar as sophisticated use of ontologies of events is concerned. Our development work here will focus on developing the University of Manchester's CAFETIERE (Conceptual Annotation of Events, Terms, Individual Entities and RElations) information extraction system to take full advantage of distributed and parallel computing, to render it scalable, and to augment its caching and data reuse capabilities. CAFETIERE is a rule-based analyser whose rules can access user ontologies of entities and events: facts are extracted from texts by looking for instances of entities that participate in ontological events. The onus of rule writing is much reduced by this approach, as the rule writer can write fewer and more generally applicable rules thanks to the efforts of ontology building by others: there is thus a direct, beneficial relationship between the world of ontology construction and the world of information extraction. CAFETIERE can, moreover, perform temporal information extraction, which is important not only for text mining of ephemera (newswires for competitive intelligence purposes) but also for any domain where there is volatility of terminology, of knowledge, as we see in the biobibliome: there is a need to anchor extracted facts temporally with respect to the terminology used and the state of knowledge at the time. This has a further bearing on curation of data over time and the relationship between a future user's state of knowledge and terminological vocabulary and those of the archived texts being analysed. CAFETIERE is in fact a complex package of individual components including tokenisers, part of speech taggers, named entity recognisers, etc. Each of these will be made available for separate use.

The University of Liverpool and UCB have jointly developed a third-generation online IR system, Cheshire, based on national and international standards and in use by a wide variety of national services and projects throughout the UK. The software addresses the need for developing and implementing advanced networking technologies required to support digital library services and online learning environments. We will use Cheshire to harvest and index data using an advanced clustering technique which will enable items to be interlinked automatically and retrieved quickly. This will include Cheshire support as a cross-protocol data harvester and as a transformation engine operating in a distributed, highly parallel environment. Development work on Cheshire will concentrate on meeting the IR needs of text mining, with particular work on advanced indexing and retrieval, focusing on metadata, on improved index term weighting, on search interfaces, and on ontology management. A key development will be the use of the SKIDL (SDSC Knowledge and Information Discovery Lab) toolkit and Cheshire to enhance index term weighting approaches in an automatic text retrieval context, by combining Latent Semantic Analysis with probabilistic retrieval methods to yield salient text fragments as input for following information extraction components. The SKIDL data mining toolkit will be integrated not only to allow data mining over classic information extraction results, but also to associate biological entities, such as parsed genome data with bioscientific texts and bibliographic data. The primary advantage is that Cheshire will be able to support hybrid text mining (e.g. from a journal and from textual representations of DNA) in a transparent and efficient manner.

Data mining techniques have traditionally been used in domains that have structured data, such as customer relationship management in banking and retail. The focus of these techniques is the discovery of unknown but useful knowledge that is hidden within such data. Text mining extends this role to the semi-structured and unstructured world of textual documents. Text mining has been defined as 'the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources' [11]. Mining techniques are thus used to link together in a variety of ways the entities extracted from the IE activity. A number of approaches are possible: for example, clustering is an unsupervised technique that produces groupings of related entities based on similarity criteria; classification is a supervised technique that learns from instances, for example, of user-classified documents of different types to auto-classify unseen documents; and association rules enumerate the frequency of occurrences of sets of entities, and in particular can derive the likelihood of a document containing specific entities given that the document is known to contain another entity.

Conclusions

The services provided by NaCTeM will not all be available instantly. It will be appreciated that the configuration and deployment of a range of scalable, efficient text mining services cannot happen overnight. Work is planned over 3 years, with increasing evolution towards full service capability. Initially, while development work is under way, we will be acting partially as a clearing house, catalogue and repository for third party, open source or GNU-licensed text mining tools, as a means of easily finding useful text mining tools and resources on the Web; and as an advice, consultancy and training centre. As our infrastructural text mining tools are developed, these will be released when appropriate for test purposes in order to gain feedback, before being fully deployed. At present, we are in the setting-up and requirements gathering phase. Throughout, close contacts will be established and maintained with the target user community, to ensure that needs and requirements are met, and that the range of possibilities for text mining is communicated and discussed in sufficient measure to inform the requirements gathering process. We also actively invite contact and discussion with potential users of text mining services from all other domains, as it is part of our remit to reach out to users in other areas in expectation and preparation of future evolution to serve their needs. Our events calendar testifies to the range of contacts we have had, presentations given and workshops attended thus far and we fully expect this activity to grow, given the high degree of interest that has been generated in the community in the centre's aims and activities.

References

Blaschke, C., Hirschman, L. & Valencia, A. 2002. Information Extraction in Molecular Biology. Briefings in Bioinformatics, 3(2): 154-165.
Bretonnel Cohen, K. and Hunter, L. (in press) Natural Language Processing and Systems Biology. In Dubitzky and Pereira (eds) Artificial intelligence methods and tools for systems biology. Springer Verlag.
Hirschman, L., Park, J., Tsujii, J., Wong, L. & Wu, C. 2002. Accomplishments and Challenges in Literature Data Mining for Biology, in Bioinformatics, vol. 18, no 12, pp. 1553-1561
MEDLINE. 2004. National Library of Medicine. Available at: http://www.ncbi.nlm.nih.gov/PubMed/
Krauthammer, M. & Nenadic, G. (2004) Term Identification in the Biomedical Literature, in Ananiadou, S., Friedman, C. & Tsujii, J. (eds) Special Issue on Named Entity Recognition in Biomedicine, Journal of Biomedical Informatics.
Ananiadou, S., 2004: Challenges of term extraction in biomedical texts, available at: http://www.pdg.cnb.uam.es/BioLink/workshop_BioCreative_04/handout/
Jacquemin, C., 2001: Spotting and Discovering Terms through NLP, MIT Press, Cambridge MA
Ananiadou, S., Friedman, C. & Tsujii, J (eds) (2004) Named Entity Recognition in Biomedicine, Special Issue, Journal of Biomedical Informatics, vol. 37 (6)
GENIA, 2004: GENIA resources available at : http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/
Mima, H., Ananiadou, S. & Nenadic, G. (2001) The ATRACT Workbench: Automatic term recognition and clustering for terms. In Matousek, V., Mautner, P., Moucek, R. and Tauser, K. (eds.) Text, Speech and Dialogue. Lecture Notes in Artificial Intelligence 2166. Springer Verlag, Heidelberg, 126 - 133.
Hearst, M., 2003: What is Text Mining? http://www.sims.berkeley.edu/~hearst/text-mining.html, October 2003.

Author Details

Sophia Ananiadou
Reader, School of Computing, Science and Engineering
University of Salford

Email: S.Ananiadou@salford.ac.uk
Web site: http://www.cse.salford.ac.uk/nlp/

Julia Chruszcz
MIMAS
University of Manchester

Email: julia.chruszcz@man.ac.uk

John Keane
Professor, School of Informatics
University of Manchester

Email: john.keane@manchester.ac.uk
Web site: http://www.co.umist.ac.uk/research/group_dde.php

John McNaught
School of Informatics
University of Manchester

Email: john.mcnaught@manchester.ac.uk

Paul Watry
Projects Manager
Special Collections and Archives
University of Liverpool

Email: p.b.watry@liverpool.ac.uk

Return to top