W. Brian Whalley
University of Sheffield
E-books and E-content 2010: Data As Content
This meeting on 11 May 2010, chaired by Anthony Watkinson, was organised by the University College London Department of Information Studies. Some 40 people attended the 'e-book' conference with the specific title; 'Data as Content'. Eight papers were presented with a final panel question and answer session that explored some of the issues that had arisen during the day.
Unfortunately, the first billed presentation, by Matthew Day (Nature) on 'The role of publishers in data management, now and next', had to be cancelled. His talk was to have considered the role of academic publishers as bodies to publish and archive data as well as papers themselves. In its place, Geoffrey Bilder, Director of Strategic Initiatives, CrossRef, gave a talk entitled 'Data and text mining: the search for unknown knowns'. This used the 'Rumsfeldian unknown' to approach data mining and its associated, but not cognate, term 'text mining'. However, this was rather more than using the mining metaphor; it is not the data we are searching for but information that we (perhaps uniquely) require within the data-dirt. So this was what data mining is not; neither data nor information retrieval, nor extraction, nor indeed analysis, on their own. Information is something researchers have problems with in data extraction and in keeping up with the literature. How can we get computers to help? After these clarifications, Geoffrey then went on to look at the Semantic Web and its role as a means of efficient searching of Web-based data for required information and why computers have problems in getting sense out of sentences whereas humans find it (relatively) easy to do so. We then had a brief review of metadata embedding, the creation of query tools and on to the Resource Description Framework (RDF). Semantic Web (Web 3.0) technology provides an answer and queries (using tools such as SPARQL) can be usefully extracted from these RDF encodings. But how to make the 'researcher as author' more efficient as 'researcher as reader'? Geoffrey ended by pointing out how little time the Internet age had been developing compared with print literature from Guttenberg onwards. The point however was that we currently treated text, a pdf say, as incunabula and that we were afraid to let go of that with which we were familiar. So, the final question posed was, 'why do we publish text?'
The next paper was presented by Simon Hodson, Programme Manager, JISC, 'Research data as content: 'Challenges and opportunities for UK HE'. The main gist of his presentation was the role of the universities, especially the research-oriented institutions, in the investment in research data and knowledge production – the value of data as content. This is part of a shift towards including data as a prominent part of scholarly output. These general notions mean that there must be specific responses from individual institutions. There are tools now available demonstrating the benefits of this approach. The importance of skills development was outlined and Simon illustrated JISC involvement with reference to, for example, the Peg-Board Project (Bristol) for palaeoclimatological data . The 'data deluge' is evident in this area as in so many in collaborative science. The significant Nature paper Data's shameful neglect  was discussed and how data management should be woven into all science courses. There was also a mention of the various models for data publication and charging. This is a complex area and one that is hardly realised as necessary in most institutions. The final suggestion was that there might, ultimately, be a general institutional statement about data publication.
Michael Jubb, Director, Research Information Network (RIN), gave a talk entitled, 'To share or not to share? Researchers' perspectives on managing and sharing data'. RIN is a research and policy unit funded by the four Higher Education funding councils, the three national libraries and the seven research councils to investigate ways in which UK researchers can make use of information resources; and how they are (or are not) looking after data for example. His approach was bottom-up and pointed out that many researchers were unaware of the need for this approach. Michael looked at a list of verbs associated with data in the research process (gather, … evaluate … manage … present … disseminate) but pointed out that data are acquired differently in diverse disciplines and he produced a cognitive map of how one team of researchers set about resolving the process. But there are complex factors involved; esteem, competition and altruism are as important as size of dataset and ethics. Some researchers do not trust others' data, especially where teams are involved or the data are complex. Furthermore, there is a dearth of people who know how to handle datasets. These complexities require leadership and co-ordination, top-down as well as bottom-up. Not least, incentives must be provided for researchers. This was an interesting view of the intricate personal as well as institutional relationships involved in sharing data. There are clearly many questions to be asked, perhaps of each generation of researcher.
The contribution by Helle Lauridsen, Discovery Services, Serials Solutions, added to this complexity. 'Metadata management and webscale searching making (all kinds of) data discoverable' addressed further aspects of the user experience. Libraries are still physical and thought of as storage of books. Does the 'Net generation' realise how best to use data and make data discoverable? They expect all search systems to behave like Google and know only a small pool of quality resources. (Unfortunately, tutors' views of data and information are often little more advanced than their students.) Helle then explored some of these problems from the user's standpoint by looking at difficulties with gateways. The paper of Tenopir  is a way to look at the 'value gap' between the perceived value of a library as a gateway and the amount spent on materials. Can the Web help? Yes, but Google Scholar is not delimited enough, you get everything; so what about better indexing of metadata? Fine, if the metadata are there. Webscale discovery is a development by Serials Solutions, the ability to search within articles etc and the merging of metadata, including data in the cloud around the article. However, the data alone may not be that significant since they may need to be filtered by peer review. There are different approaches to adopt, for example, via Link Resolver (National Library for Health/ Health Information Resources) and other discovery tools. Library use is changing, but how fast will this be for the majority of researchers? This was another thought-provoking paper reporting developing ideas of meaningful searching through the data sea.
Adam Farqhar, Head of Digital Technology, British Library (BL), presented a paper 'DataCite,' a review of the progress of this project  designed to make access to scientific data easier. This is a new International Data citation initiative which aims to make it easier to gain access to scientific research data on the internet, to increase the acceptance of research data as a contribution to the scientific record and to support data archiving. This was achieved by reviewing Digital Object Identifiers (DOIs) and their effectiveness including costs of visibility (DOI registration and searching is cheap compared with data harvesting and production). Adam reported on the UK pilot initiative and the US pilot (Dryad) with an example from Science Direct by following links through a small database of DOIs. This was an interesting look into a near-future practical application.
After these papers came three presentations which illustrated JISC-funded initiatives and projects.
James Wilson, University of Oxford gave a talk entitled 'The SUDAMIH Project: Developing Data Infrastructure for the Humanities – requirements and attitudes' as its Project Manager. Research in the humanities tends to produce different problems of data production, storage and use than in the sciences. This is not only due to its 'life's work' approach but also because data are conceived of in different ways from that of scientists. The project is concerned with 'Supporting Data Management Infrastructure in the Humanities' and exploring why there are these differences and what they mean for data management as well as current and future practices in universities. This includes use, sharing and attitudes towards data, which are as valuable in the humanities as elsewhere. The project shows how current practices can be improved, especially through the development of institutional practices. The two main outputs are the 'Database as a Service' system and the development of training modules. The latter may prove hard to promote – especially as most people are inundated with opportunities to attend conferences and workshops. One important aspect mentioned here was the importance of encouraging graduate students to attend workshops a few months after starting on their research.
June Finch, Project Manager, MaDAM, University of Manchester, presented a talk entitled 'MaDAM – A Data Management Pilot for Biomedical Researchers at the University of Manchester'. This is a first step in the introduction of a university-wide data management infrastructure and service. June explained about the local circumstances and the development of user-driven pilot groups. The involvement of users was considered essential and the embedding of needs and working practices was to the fore, as was the sustainability of the project's results, the latter being evaluated with a cost-benefit analysis and financial models. The key challenge was managing the expectations of users and the culture change. The University has a Fedora-based repository (eScholar) dealing with theses etc, but not data. Data management practice varies widely across the University and a main issue was to standardise a structure and to take into account the existing 'Manchester Informatics' system. The talk also showed ideas that were being developed in respect of bringing data directly from instrumentation to a Fedora Commons storage and management system. When this project finishes (summer 2011), its findings should have consequences on how other institutions manage their data.
Finally, Kenji Takeda, Senior Lecturer in Aeronautics, University of Southampton, talked about the way in which Southampton was developing its existing facilities (such as the well known Eprints and National Crystallography Centre). A team of people across departments (Chemistry, Engineering and Archaeology as well as the Library and Information Services) were involved in this project and Kenji presented this as an 'Institutional Data Management Blueprint'. The aim of this project is to create a practical (and attainable) institutional framework for managing research data. As at Manchester and Oxford, surveys were undertaken to look at existing infrastructures and the need to bring together aspects of this ambitious project. They looked at research 'stuff', from PostIts to high-volume data output and incorporated aspects of archiving and preservation. Results of these surveys were illuminating in themselves as were the aspirations of researchers ('seamless integration from papers to source data', 'e-lab notebooks' etc) to frustrations ('metadata', 'responsibility and ownership', 'Freedom of Information'). The use of data systems was outlined on a graph, 'degree of structure' plotted against 'ease of sharing' which showed the diversity and lack of integration across the University. The projected structure was outlined and again key conclusions were that a two-pronged approach was required; bottom-up to augment the researcher's world and a top-down to provide support and guidance. The other conclusion, that, 'good data management is vital for better research' summarised the thrust of the meeting as whole.
The title 'Data as content' sounds a somewhat dry topic, perhaps best fitted for 'info-geeks'. Not so; this was an interesting day with excellent speakers presenting a range of current investigations and glimpses to the future. I hope the summaries above have provided something to whet your appetite for finding out more. Certainly Higher Education institutions in the UK are going to have to get to grips with the issues discussed at this meeting. I am pleased to report that multimedia recordings  of the E-books and E-content 2010 Conference are now available. You will find good video streaming of the presentations as well as downloads (Flash, MOV, OGG and MP3 formats). There are exciting developments out there for academic as well as information researchers; watch out for these developing tools in the future. Perhaps, less explicitly, there are challenges for institutions and Research Councils and other grant-awarding bodies as well as universities. Some of these challenges and issues may well be continued in another UCL initiative, the Fourth Bloomsbury Conference on E-publishing and e-Publications (24-25 June 2010) .
- PEG-BOARD Project Site
- Editorial. (2009) Data's shameful neglect. Nature 461: pp. doi:10.1038/461145a.
- Tenopir, C., King, D. W., Edwards, S., & Wu, L. (2009) Electronic journals and changes in scholarly article seeking and reading patterns. Perspectives, 61: pp. 5-32.
- DataCite http://www.datacite.org/
- Multimedia recordings of 'E-books and E-content 2010' conference held 11 May 2010, River Valley TV
- Fourth Bloomsbury Conference on E-publishing and e-Publications (24-25 June 2010)