CODATA was formed by the International Council for Science (ICSU) in 1966 to co-ordinate and harmonise the use of data in science and technology. One of its very earliest decisions was to hold a conference every two years at which new developments could be reported. The first conference was held in Germany in 1968, and over the following years it would be held in 15 different countries across 4 continents. My colleague Monica Duke and I attended the most recent conference in Taipei both to represent the Digital Curation Centre – CODATA's national member for the UK – and to participate in a track of talks on data publication and citation.
The CODATA Conference is always busy with satellite meetings and this year’s was no exception. On the afternoon of Sunday 28 October, effectively day zero of the conference, I popped my head around the door of the World Data System Members Forum; I couldn't actually get in as the room was packed. I wanted to catch the demonstration of GeoMapApp . This is a Java-based data visualisation tool developed by the Marine Geoscience Data System, which allows users to overlay datasets on a selection of different base maps and grids. The tool has a good set of options for changing how the data are presented, and includes atmospheric and geophysical data as well as the oceanographic data one would expect.
A little later the conference hosts held a welcoming ceremony and reception. Among the speakers was San-cheng Chang, a minister without portfolio within the Taiwanese Executive Yuan. He was pleased to report that the government had launched an official open data activity as of the previous week, starting with real estate information. Unfortunately, it had only provided a database end-user interface, not direct access to the data themselves, so the site crashed within an hour of launch due to estate agents harvesting the data through screen scraping. San-cheng is working on getting these and other data opened up fully, so Taiwan despite its slow start might catch up with the countries leading the open data movement.
Another notable feature of the welcoming ceremony was a half-hour performance by the Chai Found Music Workshop. We were treated to a sextet of traditional Sizhu instruments playing a selection of Taiwanese pieces and, bizarrely, Loch Lomond.
The conference proper began first thing Monday morning with addresses from Ray Harris, Yuan-Tseh Lee, Huadong Guo and Der-Tsai Lee, respectively the Chair of the conference’s Scientific Program Committee, the Presidents of ICSU and CODATA, and the Chair of the conference’s Local Organization Committee. These were followed by three keynote presentations.
There is no such thing as a natural disaster. This was the starting point of Sálvano Briceño’s explanation of how data can be used to mitigate the effects of earthquakes, hurricanes and other such hazards. The number of natural disasters recorded has risen steadily from about 400 in 1980 to about 1000 in 2010, but this is more to do with people building (poorly) in hazardous areas than any decrease in the hospitality of the natural world.
The underlying issues that need to be tackled are social vulnerability and unsustainable development. There are in fact three activities currently addressing these issues, all due to deliver policy decisions in 2015: the International Strategy for Disaster Reduction , the UN Millennium Development Goals  and ongoing international climate change negotiations. A common theme in all three is the need to base their decisions on sound scientific evidence, derived from data on natural hazards, disaster losses and vulnerability. There are copious data on the first of these, a fair amount on the second (though not always accurate) but very little on the third. The IRDR has therefore set up a series of Working Groups to improve both the quality and quantity of such data. The Disaster Loss Data Working Group has been working closely with CODATA, and a CODATA Task Group on Linked Open Data for Global Disaster Risk Research would be proposed at this year's General Assembly.
Chai Found Music Workshop
Europeana and the Digital Public Library of America are well-known efforts to provide access to cultural heritage artefacts via digitisation but, as Der-Tsai Lee was pleased to point out, pre-dating them both was the Taiwan e-Learning and Digital Archives Program (TELDAP) . Work began in TELDAP in 2002, and since then it has built a collection not only from libraries, museums, universities, theatres across Taiwan but also from 116 institutions across 20 countries and through direct engagement with the public.
Around 6000 Moodle courses have been compiled from TELDAP holdings, as well as some manga and anime educational resources. The Program has developed extensive experience in digitisation and related techniques, such as optical character recognition for Chinese glyphs, repairing video frames using adjacent ones, colour palette extraction, and so on. TELDAP is starting to make its data available as linked open data, and has already placed much of its material into the public domain.
Hesheng Chen drew on his experience in Particle Physics to illustrate a talk on the challenges of big data. Spiralling data volumes are outstripping advances in storage and computational power, to the extent that in some cases there is not enough storage capacity for the incoming data stream, let alone the legacy data. Data-intensive research has gone through a couple of changes already to cope with the increasing volume, complexity, rate of acquisition and variability in format of data: moving first to local clusters and then to the Grid. The Grid provides scientists with an easy medium for sharing data, but scheduling jobs efficiently remains a hard task, and it is hampered by a diversity of service models and legacy programs. In High Energy Physics, the community is looking at using Cloud technologies to overcome these issues.
In order to cope with the data from the Large Hadron Collider, CERN's AliEn Grid framework is using Cloud technologies, and using the Extensible Messaging and Presence Protocol (XMPP) as a coordination mechanism. Similarly, the Institute of High Energy Physics at the Chinese Academy of Sciences is using a Cloud system to handle the data from Beijing Spectrometer II (BESII), and in 2010 started using the Berkeley Open Infrastructure for Network Computing (BOINC) to distribute computational tasks among volunteers, under the project name CAS@Home .
After a break, the conference split into two tracks, one on data sharing and the other on environmental and disaster data; I went to the former.
Puneet Kishor gave an overview of the forthcoming version 4 of the suite of Creative Commons (CC) licences . The aim for the new version is to broaden the range of jurisdictions and applications for which the licences can be used, to make them more interoperable with other licences, and to make them fit for the next ten years, all without alienating existing users.
The licences will include provision for sui generis database rights, so they will be suitable for use with data in Europe, for example. There will be greater clarity on the point that rights rather than works are being licensed. The way the attribution requirement is worded in version 3 licences makes combining a large number of CC-licensed works unwieldy, so the new licences will explicitly permit a scalable form of attribution (the current draft wording specifies a hyperlinked credits page). Licensors will be able to add terms that assure users the work satisfies certain quality standards, something that will be very useful in the data context. Lastly, the version 4 licences will include a mechanism for users to remedy a situation in which they forfeit their rights due to a breach of licence terms. Outstanding issues include how to handle text and data mining, and whether or not the licences should be ported to particular jurisdictions.
The version 4 CC licences are expected to launch towards the end of 2012 or early in 2013.
Charlotte Lee reported on two sociological studies of scientific collaboration, performed in the context of Computer-Supported Cooperative Work, a branch of Human–Computer Interaction studies.
The first study looked at the sustainability of cyberinfrastructure middleware in the face of changing funding profiles. It found that the tools with greatest longevity were the ones that could be ported to many different domains and use cases, as this meant developers could find new funding streams when their existing ones dried up.
The second looked at how cancer epidemiologists understood how to use variables from trusted legacy datasets. There was broad acceptance that full documentation is too time-consuming to write, and that researchers would normally have to ask the original team to determine, for example, whether ‘1 month’ means ‘up to 1 month’ or ‘at least 1 month but less than 2’.
Both studies showed that there is no clear division between social and technical issues.
John Broome reported on the Canadian Research Data Summit held in September 2011 by the (Canadian government's) Research Data Strategy Working Group . The purpose of the summit was both to develop a national approach for managing Canada’s research data and to develop a business case for this activity that could be communicated to decision makers. The summit recommended the creation of a body (tentatively called ‘Research Data Canada’) to coordinate research data management nationally and liaise internationally, steered by an Advisory Council made up of government, academic and industry leaders. The summit also drafted a national strategy that tackled issues of building skill sets and capacity in the community, securing sustained funding and improving research data infrastructure. On the latter point, the Working Group is already piloting a set of tools using data from the International Polar Year as a test bed.
The Survey Research Data Archive (SRDA) at Academia Sinica is the largest survey archive in Taiwan and one of the largest in Asia . Ruoh-rong Yu took us through the lifecycle of a data holding, from acquisition, through cleaning and anonymisation, to access. One of three levels of access is used, depending on the sensitivity of the data: public, members only and restricted (i.e. on-site or VPN access only).
Various improvements are planned for the online interface: better search and browse facilities, built-in statistical analysis and data merge tools, better information security management, and an English-language version. The archive is also provides training on how to reuse data.
Following lunch were three blocks of five parallel sessions. Seven of the sessions continued the theme of environmental and disaster data in various ways, but there were also sessions on health data, materials data, open knowledge environments, microbiological research, access to data in developing countries, and a miscellany of topics presented by early career scientists. With so much choice it was difficult to know which sessions to pick; in the end I plumped for materials data and open knowledge environments.
Artwork on the campus of Academia Sinica
Yibin Xu gave an overview of AtomWork, an inorganic materials database developed by the (Japanese) National Institute for Materials Science. It contains phase diagrams, crystal structures, diffraction patterns and property data for over 80,000 inorganic materials, harvested from published literature. It is possible to view the data not only at the level of materials, but also of substances, compounds and chemical systems, and it has some good visualisation capabilities.
While the database contains a useful quantity of data it is by no means comprehensive. Yibin identified several things currently lacking that would make data collection easier: a standard data format for materials data, a scientifically-based identifier system and incentives for researchers to share data.
In June 2011, President Obama launched the Materials Genome Initiative (MGI) in the United States with the ambitious goal of halving the time and finance required to develop advanced materials . It plans to do this by developing a Materials Innovation Infrastructure (i.e. integrated computational, experimental and data informatics tools), encouraging scientists to concentrate on a few key areas relevant for national goals (national security, health and welfare, clean energy systems) and training.
Laura Bartolo gave an overview of progress so far and flagged up some important issues being faced by the MGI. NIST is building three repositories: one for files relating to first principles calculations, one for data used with the CALPHAD (Calculation of Phase Diagrams) method for predicting thermodynamic properties, and one hosting contextual information supporting the other two. Already it is clear that for data to be integrated reliably, universal identifiers and standard ontologies, data models and formats will be needed.
John Rumble explained that while people have been trying to develop exchange standards for materials data since the mid-1980s, none have gained traction because the task is so complicated. There are the microstructure, macrostructure and macroscopic behaviour to consider, the fact that slight changes to manufacturing methods can alter all of these things, and that complex materials can have a variety of structures and behaviours all at once.
There are, however, established standards for testing materials and presenting the results. In 2009–2010, The European Committee for Standardization (CEN) ran a workshop (i.e. project) to create a data format based on one of these standards  and is now running a follow-on workshop to extend this format to other testing standards . This latter workshop is also testing whether the format works as an exchange standard using existing systems and processes.
The Asia Pacific climate is not very kind to paintings. Jane Hunter described a collaboration between the University of Queensland and the Asia Pacific Twentieth Century Conservation Art Research Network to develop a decision support tool for art conservators . Semi-automated techniques are used to harvest information from paint conservation literature and convert it to linked open data. The resulting knowledge base uses the CIDOC CRM , the in-house Ontology of Paintings and PReservation of Art (OPPRA) and materials ontologies. A Web interface has been written that supports the capture, storage, and discovery of empirical and characterisation data, in a manner comparable with electronic laboratory notebooks or laboratory management systems.
In the mid-2000s, the Korean government adopted a policy to promote the Korean materials industry, and one result of this was the launch of the Materials Bank Project in 2007 . As Young-Mok Rhyim explained, the idea was to create the world's leading materials databank, primarily in support of domestic industry. The Materials Bank is divided into three parts: Ceramics Bank, Chemicals Materials Bank and Metals Bank, the latter of which was the focus of this presentation. It contains information on various metal samples, their chemistry and manufacturing provenance, and their microstructure, macroscopic properties and so on. While a good deal of the information comes from scanning published literature, more and more of it is being collected specifically for inclusion in Metals Bank. Currently 326 companies are cooperating in the project.
Paul Uhlir began his introduction to Open Knowledge Environments (OKEs) by setting them in the context of scholarly communications on the Web, and arguing that openness rather than restriction should be the default culture. OKEs are open platforms for scholarly communications hosted by universities or other non-commercial bodies, typically hosting content such as open access journals, grey literature, data, open educational resources and open source software. They also typically provide interactive functions such as wikis, blogs, forums and post-publication reviews that enhance and build on the hosted content. Examples include microBEnet  and CAMERA .
OKEs are not easy or cheap to set up. They represent a significant cultural shift away from print publication, and require long-term financial commitments on the part of the hosts. Among the challenges are how to protect privacy, confidentiality, and genuine commercial advantage, and how to accommodate alternative models of openness such as members-only content and embargoes.
Bill Anderson outlined the technical requirements for an OKE, which he saw as similar to those for an Open Archival Information System . Functions include input, process, output, system management, resources and strategic vision. Managing datasets requires more resources than managing documents as one needs more contextual information to get the sense of a dataset.
More specifically, there are several facilities that an OKE should provide: some form of open review system, both pre- and post-publication, or perhaps some less formal discussion mechanisms; visualisation tools for data; a recommendation service for guiding users to relevant content; data-to-day curation of content; and long-term preservation. Curation is particularly important if wikis are used, as they can become unmanageable and messy without it. Anderson also proposed some formal methods for measuring the openness of an OKE.
In conclusion, Anderson looked at some off-the-shelf software solutions that might be used to build an OKE, suggesting on the one hand the various offerings from the Public Knowledge Project (Open Journal Systems, Open Conference Systems, Open Monograph Press, etc.)  and the University of Prince Edward Island's Islandora system .
Robert Lancashire gave the first in a series of four presentations on contexts in which OKEs are having an impact. The Caribbean region was for a long time isolated from the global network, but now fibre optic cables have been installed and there should be widespread Internet access by 2013. Since 2007 the Caribbean Knowledge and Learning Network has been building C@ribNET, an academic network similar to the UK’s Joint Academic Network (JANET) linking research and educational institutions across 20 Caribbean nations. It is opening up all sorts of exciting possibilities: virtual classrooms allowing students to be taught by a wider variety of lecturers, eLearning facilities, more efficient university administration and, of course, OKEs.
Liping Di introduced us to GeoBrain, the first operational geospatial Web service system, which launched in 2005 . It is a collection of portals for capturing, discovering, preserving and disseminating NASA’s earth observation satellite data and information. The thinking behind it is that most users do not need access to the raw data (e.g. daily precipitation levels), but rather the high-level knowledge derived from it (e.g. whether a region is suffering from drought), but deriving this knowledge requires extensive processing and expert attention. GeoBrain is an OKE that automates the derivation process using peer-reviewed and validated data processing models/workflows that are deployed on demand. One of the pleasing aspects of the system is that a certain element of expertise can be encoded once and reused many times with different data, meaning each contribution represents a significant improvement to the system.
The Digital Lin Chao Geomuseum  is a joint venture of CODATA, the International Geographical Union and the Geographical Society of China. It was established in 2011 as an OKE for geography, particularly emphasising how geographical discoveries and events have impacted on culture and the arts. Chuang Liu demonstrated this by exhibiting some of the holdings collected since May 2012 on the exploration of Tibet and the Himalayas. These mainly consisted of stamps and postcards from around the world, all of which have been digitised and made available online.
Tyng-Ruey Chuang looked at three Taiwanese projects that might be considered OKEs. Ezgo is a computing environment aimed at schools, based on the Kubuntu Linux-based operating system and provided as a bootable DVD or USB stick . PeoPo (People Post) is a citizen journalism site hosted by Taiwan Public Television Service; all content is made available through a Creative Commons licence . OpenFoundry is a source code repository similar to SourceForge and GitHub, but it also organises workshops that promote and build capabilities for open source software development, publishes news and provides guidance on licensing and other legal matters relevant to software development . All three have a philosophy of openness, and they each produce and disseminate knowledge and encourage public participation, so seem to fit the OKE definition. None of them, however, have a particularly sound financial footing.
The first day concluded with a plenary session consisting of three substantive items.
The first was the presentation of the CODATA Prize for outstanding achievement in the world of scientific and technical data. This went to Michael F. Goodchild for his work in Geographic Information Science. He wasn’t able to be there in person, but he joined us on Skype and the hosts played back a presentation he'd recorded for us earlier. The thrust of the talk was that the central challenge of Geographic Information Science is reducing rather complicated data to a simple two-dimensional visualisation. He demonstrated this in various ways. He showed how Geospatial Information Systems frequently exhibit a mismatch of accuracy and precision, for example by reporting the distance between two capital cities to the nearest micron. He pointed out that putting a scale on a national or international map is highly misleading due to the distortions introduced by the flattening process. Lastly, he noted how geographical names and political boundaries are, of course, social constructs and vary according to the intended audience.
The second item was the presentation of the Sangster Prize for the best contribution to CODATA from a Canadian student or recent graduate. This went to Daniel Ricard of Dalhousie University. His acceptance speech was notable for the extraordinary claim that, if one works to become the local expert on metadata vocabularies, one might seem boring at first but will end up being considered the sexiest person in the office. He also argued, cogently enough, that CODATA should fund another, similar prize open to students outside Canada.
The final item of business was the official launch of the book CODATA @ 40 Years: The Story of the ICSU Committee on Data for Science and Technology (CODATA) from 1966 to 2010 by David R. Lide and Gordon H. Wood. One of the authors, I forgot to note which, told us about the motivation for the book, the process of researching it, and how the book was structured. When the day’s proceedings drew to a close after an exhausting 11¼ hours, we were all rewarded with a free copy of the book and a surprise second reception.
The following day kicked off with another set of five parallel tracks, continuing with themes of disaster data and microbiology and adding topics related to big data and national policy. I, however, attended the first of three sessions on data publication and citation.
Michael Diepenbroek gave the introductory overview for the track, taking as his theme the vital role of research data in scholarly publishing. He started by arguing that without the underlying data, journal articles no longer fulfil the roles they are supposed to play as the record and evidence base of empirical science. He went on to enumerate the various requirements for data publication (e.g. quality assurance procedures, standards for interoperability, citability, identification, persistent access), sets of principles that have been defined (e.g. Berlin Declaration on Open Access, Bermuda Principles), and some major players in this space (e.g. the ICSU World Data System, DataCite, ORCID, Thomson Reuters Data Citation Index).
Diepenbroek concluded by emphasising the importance of archiving, of archives being trustworthy, and of archives working closely with publishers and libraries so that data archiving becomes a natural component of the scholarly publication workflow.
David Carlson gave his perspective as editor of Earth System Science Data (ESSD), a data journal that has its roots in the International Polar Year 2007/8 . ESSD publishes papers that describe datasets and the way in which they were collected, but do not draw scientific conclusions from them. The quality assurance process is as follows. Each submission is assigned to a topical editor, who decides if the paper is in scope and resolves any issues with how it is written. If the paper is suitable it is published in ESSD Discussions and undergoes simultaneous review by allocated referees and the scientific community at large. The authors respond to the comments and submit a revised paper, which may undergo a further round of peer review before being published in ESSD or rejected.
Papers are judged on the accessibility of the data (they should be lodged with a data centre and assigned their own DOI), the quality of both the data and the paper, and the significance (uniqueness, usefulness, completeness) of the data. ESSD occasionally publishes themed issues or joint issues. With the latter, ESSD teams up with a regular journal, so that every science paper in the regular journal has a parallel data paper in ESSD. Joint issues are particularly slow to produce, and there are sometimes conflicts where the science paper is good but the data paper is poor or vice versa.
Carlson is pleased with the popularity of the journal, and understands researchers are now including publication in ESSD as part of their data management plans.
IJsbrand Jan Aalbersberg talked about Elsevier’s Article of the Future project, which aims to provide a richer online publication medium for journal papers . In particular, it seeks to integrate summary and full data into the article text.
In his demonstration, Aalbersberg showed an article in which the text was presented in a fairly traditional manner in a central column. The left sidebar contained a table of contents which included previews of the tables and figures. The contents of the right sidebar could be varied between options such as the article metadata, a detailed view of a selected reference, a second copy of a figure or table from the main text, or interactive visualisations and information about a dataset, structure or protein mentioned in the text. Some of the figures were dynamic, with phylogenetic trees that could be zoomed, geospatial data presented using Google Maps, or graphs that revealed the underlying data when the points were clicked.
Such articles can be linked at the article level to holdings in data repositories. Elsevier also invites data repositories to provide widgets that display a visualisation of an archived dataset and, perhaps, show related datasets from their holdings.
Fiona Murphy gave a broader publisher perspective, recounting how scientific, technical and medical (STM) publishers have worked since 2007 on improving the handling of data in their publications. Examples included participation in projects such as PARSE.Insight, APARSEN, and ODE, a joint statement with DataCite, forming an STM Research Data Group and joining the board of the Dryad data repository.
Wiley-Blackwell worked with the Royal Meteorological Society in the UK to launch the Geoscience Data Journal (GDJ), following the involvement of the latter organisation in the JISC-funded OJIMS (Overlay Journal Infrastructure for Meteorological Sciences) project . As with ESSD mentioned above, it is a data journal that aims to give academics credit for publishing data, while ensuring data are archived and reviewed for quality. In GDJ papers, the associated dataset appears both in the reference list and in a prominent position at the start of the main text.
Finally, Murphy discussed her involvement with the PREPARDE (Peer REview for Publication and Accreditation of Research data in the Earth sciences) Project , which is using the GDJ as a test bed for developing workflows, policies and procedures for the whole data publication life cycle.
The final talk in this session was given by Wim Hugo, who gave the data centre perspective. To be effective, datasets have to be discoverable, understandable (even at scale), preserved and usable by both humans and automated systems. The main challenges for data centres are securing long-term funding, handling low quality data submissions, integrating seamlessly with the mainstream publishing industry and building skills and enthusiasm among potential data providers. Hugo offered some possible solutions.
Research grants could and should allocate funds specifically for preservation. Around US$600bn is spent annually on research; allocating 3% of this to preservation would raise around US$18 bn, which is only slightly less than the size of the academic publishing industry. Furthermore, research proposals should be judged (partly) on the strength of their data management plans and the data publication record of the applicants. Published data should be recognised in scholarly rankings, and honours-level degrees should include training on data management and informatics.
Mandating the publication of publicly funded data would certainly help to create a more complete scientific record, but would initially entail a flood of low-quality submissions, and would also introduce disruptive competition between data centres (competing for preservation funding) and publishers (competing for data to publish).
Some outstanding questions need to be answered. When should access to datasets be withdrawn or altered? How different do derived works need to be in order to be considered unique? How can licensed data be properly embargoed? How can potential users of a dataset determine if it will be useful without downloading it in full? How can traditional metadata records for datasets be enhanced with useful citation and quality metadata?
In his keynote, Geoffrey Boulton argued that open data matter and suggested how they might be made a reality. The purpose of the scientific journal, as envisioned by Henry Oldenburg, founding editor of the Philosophical Transactions of the Royal Society, was to publicise discoveries as they happened, in an easily understood fashion, so others could scrutinise and build on them. Until relatively recently, datasets were of a scale that print publications could cope with them; Boulton himself remembers writing papers based on 12 data points. This is certainly no longer the case, and it is precipitating a loss of confidence in scientific research – scepticism over climate change evidence is a case in point.
The antidote is open data, as they discourage fraud, satisfy citizens’ demands for evidence and open up possibilities for new ways of working, such as citizen science. Routine sharing of data would help eliminate the suppression of negative (but nevertheless useful) results, and allow us to identify unexpected patterns. But simply putting raw data on the Web is not enough: the data have to be ‘intelligently open’: they have to be accessible, intelligible, assessable and reusable.
There are legitimate reasons for keeping data closed – commercial interests, personal privacy, public safety and security – but one should never assume these reasons automatically ‘win’: they always need to balanced against the potential benefits of openness.
There is some way to go before this vision of openness can be achieved. There needs to be a cultural shift away from seeing publicly funded research as a private preserve. Data management needs to be embedded in the business of doing science, data must be intelligently published in parallel with papers, and so on. Publishers should open up their data holdings for data mining, governments should seek to improve the efficiency of science, and professional societies should be working to enact change in their communities.
Ovid Tzeng shared some insights from the world of psycholinguistics and cognitive neuroscience. Human language is far more advanced than the communication systems of other animals, and interconnects us in a way little else can. Tzeng has shown cave paintings from 20,000 years ago to children all over the world, and they all interpret them in the same way. Chinese writing, being based on concepts more than sounds, has changed very slowly over time, to the extent that Tzeng found he could read 60-70% of a Taiwanese document written 3,000 years ago.
The brain adapts to make common tasks easier, so reading and writing has a noticeable neurological effect. Scans of children who are learning to read show how their brain activity moves from the right hemisphere to the left. Furthermore, being exposed to vast amounts of information and therefore having to organise it and practise problem-solving techniques seems to affect intelligence. The Flynn effect, whereby the average IQ of a population increases over time, seems to have arrested in developed countries, while in countries that are just becoming exposed to substantial quantities of information thanks to the Internet, IQ is rising by about 3 points per year.
There is a surprisingly strong correlation between levels of literacy (and cognitive ability generally) in a population and life expectancy, perhaps because a literate society is better able to propagate good medical and hygiene practices. It can therefore be argued that the digital divide is not just a concern from the perspective of personal enrichment but has a real effect on people’s lifespans.
The spread of Internet coverage is not only bringing education to remote areas – Tzeng gave an example of remote mountain villages where the residents are now gaining valuable skills after schools and Internet access were installed – but also enabling real political change too. The Internet means people can express themselves even when harsh winter weather would suppress physical rallies. But governments have been fighting back by threatening to revoke Internet access, performing deep packet inspection or even waging electronic wars. It is just as important now to protect freedom of expression as it was when the United States first amended its constitution.
The plenary ended with a poster madness session, where authors were given one minute each to convince delegates to come and read their poster. There were 23 posters in all, though I did notice one of them was a short article that had been printed out on A4 landscape pages, and stuck hopefully on the wall in a poster-sized rectangle.
After a short break, the conference split into two parallel debate sessions, one on data ethics and the other on open access.
The open access debate was chaired by Ray Harris, University College London, and lead by a panel consisting of Geoffrey Boulton, University of Edinburgh, Robert Chen, CIESIN, Columbia University, and David Mao, Library of Congress.
The session began with the chair and each of the panellists giving a short summary of the main issues from their perspectives. Harris enumerated the various statements and declarations that have been made on open access to data, information and knowledge, but gave several examples where openness might be harmful: the locations of nesting sites should be hidden from poachers, for example, and locations of ships at sea from pirates.
Boulton reiterated the points he had made in his keynote speech. Chen talked about the GEOSS (Global Earth Observation System of Systems) Data Sharing Principles adopted in February 2005 , which led to the creation of the GEOSS Data-CORE – an unrestricted, distributed pool of documented datasets – in 2010. He noted the difference between applying restrictions and attaching conditions, and advised communities to be clear about the rights and expectations associated with reusing data. Mao spoke from the perspective of a law librarian; governments have an obligation to ensure their citizens understand the law, which implies they need to provide permanent public access to legal information in open formats. This degree of openness can cause conflicts: in the US, legal reports routinely disclose sensitive personal information (e.g. social security numbers), while the EU is rather more careful about respecting personal privacy in such documents.
For the remainder of the session, the chair collected a handful of questions at a time from the audience and then asked the panel to comment on one or more of them. While I can see that this was done for the sake of efficiency, it did make the discussion a little disjointed.
Concerns were raised over the unevenness of openness around the world and between publicly and privately funded institutions. It was suggested that CODATA should work to achieve greater consistency through an advocacy programme (it did have a hand in the GEOSS principles).
There were also some discussions about the different meanings of ‘free’ (gratis, or ‘free from cost’, versus libre, or ‘free from restriction’) and parallel issues with openness (‘open for access’ versus ‘open to participation’). What is wrong with charging a nominal fee for data, in order to keep the machinery turning? And shouldn't the whole legal process be made open and transparent, and not just the final reports?
Chan provided a good example of economic drivers for collaboration. One of the motivations behind the Group on Earth Observation (GEO) came from government auditors asking, ‘Other countries are sending up satellites, so do we really need to send up our own?’ If the satellites were duplicating effort that would be a searching question, but by co-ordinating them, GEO could give each satellite a unique role that justified its budget.
These and many other discussions brought us to lunchtime, after which it was time for two more blocks of five parallel tracks. Three of the sessions in the first block had a disciplinary focus, and one was on computational infrastructure. I, however, rejoined the data publication and citation track.
I kicked off the session by laying out some of the most pressing challenges facing researchers as they cite datasets, and suggested solutions drawn from the guide How to Cite Datasets and Link to Publications, published by the Digital Curation Centre . In particular, I looked at citation metadata, microattribution approaches, citing data at the right granularity, where to put data citations/references within a paper, and how to cite dynamic datasets.
I was followed by Daniel Cohen, who examined the functions that data citations need to perform, and compared and contrasted them with the functions performed by citations of textual works. Among the functions he discussed were giving the location of the dataset, properly identifying it, eliminating confusion over which data were used, establishing equivalence between datasets cited in different works, providing assurances about the provenance and integrity of data, specifying the relationships between data and other objects, making data discoverable, ensuring data are reliable and trustworthy, fairly attributing credit for data to the right actors, respecting intellectual property, and providing an audit trail within research.
IJsbrand Jan Aalbersberg's second presentation in this track tackled the thorny issue of supplemental data, that is, material additional to a journal article that does not form part of the version of record, but which is hosted online by the publisher to support the article. This facility is an aid to scientific integrity, as without the constraints of the print medium there is no temptation to simplify tables or tweak figures, for example, the better to convey their meaning on the page. On the other hand, the supplemental data area can be something of a dumping ground: the relative significance of the files is not explicit, the files are not easily discoverable and neither are they properly curated.
Editors are generally in favour of supplemental files but publishers are increasingly hostile. Cell put a limit on the number of files that could be submitted, while the Journal of Neuroscience banned them altogether. Various groups have looked at the issue over the years, with one of the most recent being the NISO/NFAIS Supplemental Journal Article Materials Project . This project is producing a Recommended Practices document in two parts: one on business practices, roles and responsibilities, and one on technical matters such as metadata provision, identifiers and preservation.
The EU-funded Open Access Infrastructure for Research in Europe (OpenAIRE) Project ran for 3 years from December 2009, with the aim of providing a helpdesk and national Open Access Liaison Offices to support researchers depositing publications, establishing a pan-European e-print repository network, and explore how datasets might be archived in parallel with publication . Natalia Manola reported on its sister project, OpenAIREplus, which started in 2011 with the aim of providing a cross-linking service between scholarly publications and research datasets. Metadata about publications and datasets are being added to a Current Research Information System (CRIS) and surfaced through the OpenAIRE Hub. The project is active in the Research Data Alliance  via the International Collaboration on Research Data Infrastructure (iCORDI) .
Oak Ridge National Laboratory (ORNL) hosts four major data archives and several smaller ones, in cooperation with the US Department of Energy and NASA. The Carbon Dioxide Information Analysis Center (CDIAC) and Distributed Active Archive Center (DAAC) for Biogeochemical Dynamics both assign one DOI per collection and never delete data. Sample data citations are provided. The Atmospheric Radiation Measurement (ARM) Data Archive holds data from a mix of permanent monitoring stations, mobile sites and aircraft. Some of its holdings are on a terabyte scale; users tend not to download these, but rather get the archive to run the analysis so all the users have to download are the results. This if nothing else is a good argument for DOIs to point to landing pages rather than the datasets directly. ORNL is now turning its attention to impact metrics, and is hoping to work with publishers to get evidence of datasets being cited in the literature.
The final block for the day included sessions on topics such a biological science collections, data preservation and dissemination. I started off in a session discussing data journals in the age of data-intensive research, but then decided to switch to a session on mass collaboration.
OpenStreetMap (OSM) is a initiative to create an free and unrestricted global map through the efforts of volunteers . The data primarily come from GPS devices tracking the routes taken by volunteers, and manual traces of satellite images. Mikel Maron gave several examples where OSM has outdone official and commercial maps: Kibera is a Kenyan slum that had not been mapped before OSM, and when an earthquake hit Port-au-Prince in 2010, OSM rapidly created a highly detailed map of the affected area that was heavily used by the relief effort.
OSM uses a wiki methodology of iterative improvement, version control, and post-hoc quality control rather than moderation. It has a philosophy that low quality data is better than no data at all, because the low quality data will be improved if people care about it enough. One of its strengths is that it has kept things simple: a straightforward, restful API; a single database instead of many collections; a clear project scope; a simple XML format; and folksonomic tagging instead of restrictive schemas. As a community effort, most decisions are made through collective discussion.
Kerstin Lehnert argued that, ideally, physical samples and specimens should be available on the Internet just like digital data, so they can be discovered, identified, accessed and cited in a reliable way. The obvious way to achieve this would be to agree a standard form of catalogue record that could act as the digital surrogate, but there is no universally accepted metadata profile for this. The System for Earth Sample Registration (SESAR) is seeking to change all this . It operates the registry for the International Geo Sample Number (IGSN), an identifier scheme for Earth Science samples. The metadata collected include descriptions of the sample, where and how it was collected, where it is held and if applicable how it relates to other samples. These records are available through the Handle system, so they can be referenced in publications, included in a dataset’s DataCite record, and so on. SESAR is now working on broadening participation and moving to a scalable, distributed architecture.
It may not sound the most engaging of topics, but Te-En Lin defied expectations with a fascinating and entertaining talk on monitoring roadkill in Taiwan. Taiwan is relatively small but has many cars, and between 1995 and 2006 around 13,000 incidents of roadkill were reported in Yangmingshan National Park alone. The Taiwan Endemic Species Research Institute wanted to collect more data about the animals being killed and in 2009 decided to try a citizen science model. As digital cameras and smartphones are ubiquitous, the Institute trained and encouraged volunteers to send in photos of roadkill, but maintaining a connection with the volunteers was difficult and the initiative didn’t take off.
The next idea was to set up a Facebook page, as over half the population use it (though it was blocked from the Institute’s computers). People are asked to post a picture of the roadkill along with when and where the photo was taken; the location can be given as GPS coordinates, a postal address, a utility pole serial number, a road/mileage sign, or by pinning the location on Google Maps. Since the page was set up in August 2011, 984 people have contributed, and there have been 2078 photos of reptiles alone. While the species can be identified from just the photo in most cases, in some it can’t so people are now asked to send in the specimens, which they can do for free through the post or at a convenience store. Most of these specimens have been preserved in alcohol, but some need to be kept dry and flat and have therefore been laminated.
While successful, this method is far from perfect. A lot of time is wasted on transferring data out of Facebook into a local database, and on chasing up incomplete submissions. Further, it is tough work supporting so many positioning methods. Therefore, as a more automated alternative there is now a smartphone application for submitting observations. The data may be viewed on the Taiwan Roadkill Observation Network Web site .
Mike Linksvayer took a look at the characteristics and policy options of mass collaboration projects. On the matter of permissions and restriction, such projects work best when they are as open as possible and use common licensing terms. Privacy, security, integrity, transparency, replicability and modifiability are all important factors. Linksvayer argued that public domain dedications and copyleft licences are good for different purposes, but methods relating to intellectual property are a poor substitute for less legalistic forms of regulation. He also noted the compatibility issues with copyleft licences and the rather restrictive nature of non-commercial and no-derivatives licences.
The session concluded with the speakers forming a panel and taking questions. Puneet Kishor of Creative Commons pointed out that scientists only seem to start caring about their rights over data when someone suggests they give them up. Tracey Lauriault of Carleton University, in reply to Linksvayer’s presentation, noted that intellectual property legislation is actually one of the tools being used to protect Inuit people and their lands being exploited by oil and diamond prospectors.
With that, the second day concluded, and we had about an hour to freshen up before being bussed into the beating heart of Taipei for the conference dinner. The food was served in a long succession of modest courses, and was accompanied by a long succession of modest speeches, which I suppose was appropriate.
For reasons involving an air conditioning unit and some ear plugs I awoke later than planned the following morning. I therefore missed David Carlson’s keynote speech on the International Polar Year 2007/2008, and instead went to one of the five parallel sessions on offer. Topics included solar power, astronomy, earth and environment, and knowledge management, but I decided to go to the last session in the data citation track.
This session was dedicated to the forthcoming white paper being written by CODATA’s Data Citation Standards and Practices Task Group. The authors took it in turns to give a brief summary of the chapter they were writing and invite comments and questions from the delegates.
The report deals with the importance of data citation, current practice, emerging principles, standards, tools and infrastructure, social challenges and research questions. The topic that excited the most discussion was without doubt the emerging principles. A particularly vexing issue concerned the appropriate level at which to identify and reference datasets. While it seems perfectly reasonable to be guided by the nature of the data and pick logically complete datasets, there is an argument for keeping a strong link between referencing and attribution by identifying sets of data with uniform authorship. There was general disapproval of manufacturing different citations for different queries made on large, multidimensional datasets; we were reminded that the fine detail about which data have been used would more usefully be included in the in-text citation than the bibliographic reference.
The next block of sessions was dominated by environmental themes, but to get an idea of the broad range of CODATA activities I attended the session on Task Groups.
The CODATA Task Groups, it seems, are responsible for many of the tangible outcomes and achievements of the organisation. This year, nine Task Groups were applying for renewal and there were five proposals for new Task Groups. The existing groups were asked to give an update on what they had achieved in the past two years, and their plans for the next.
Data at Risk: This group is concerned with rescuing non-digital data (e.g. old ships’ log books) and digitising them so they may be added to longitudinal datasets, for example. People can report at-risk data online , and the group is forming an international collaboration with UNESCO and OECD. One of the big challenges is getting greater involvement from developing countries.
Data Citation Standards and Practices: This group held a major symposium in 2011 and, as mentioned above, is writing a white paper due for publication in March 2013.
Exchangeable Materials Data Representation to Support Scientific Research and Education: This group had been revived in 2010 partly in response to the Materials Genome Initiative in the US. A workshop had been held in China, and the group planned to set up an online registry of materials databases and agree an international standard metadata profile.
Fundamental Physical Constants: This group has maintained the official list of fundamental constants and conversions since 1969. The table is updated every four years, with the next due in 2013, and this time a significant change is on the cards. There is a proposal to fix the value of certain fundamental constants, meaning that the existing definitions of SI units would be discarded in favour of relationships to these constants. One delegate later joked that fixing the value of empirical constants for all time proved that CODATA was the most powerful organisation in the universe.
Global Information Commons for Science Initiative: This group was active in COMMUNIA, the European Thematic Network on the Digital Public Domain, and plans to get involved with the second Global Thematic Conference on the Knowledge Commons in January 2014.
Preservation of and Access to Scientific and Technical Data in Developing Countries: This group works with developing countries to promote a deeper understanding of data issues and advance the adoption of standards. It has held workshops with the Science Council of Asia, the Chinese Association for Science and Technology and the Asian Pacific Network, and helped with the foundation of China’s Geomuseum.
Earth and Space Science Data Interoperability: This group has been particularly active in the Commonwealth of Independent States in promoting the collection of multidisciplinary GIS data, setting up a geomagnetic data centre and translating the World Atlas of Earth’s Magnetic Field for 1500-2010 into Russian. More data products, Web resources and improvements are planned.
Following on from these presentations, the proposed groups had to demonstrate that they had useful and achievable goals and could attract enough members to be viable.
Advancing Informatics for Microbiology: This group would promote the access and use of microbial data through standards, a microbial resource information system, training courses, conferences, and a cloud-based collaboration platform.
Dealing with Data of Complexity, Uncertainty and Bias in Paradigm Shifting Areas: This group would look at ways of improving predictive models and algorithms using health and disaster data.
Knowledge Transformation in Biomedical Sciences (KnoTS): This group would identify key issues and stakeholder needs in Biomedical Sciences, and provide recommendations for technical solutions and policy with a focus on existing standards, knowledge sharing and protection.
Linked Open Data for Global Disaster Risk Research: This group would develop an integrated historical disaster data service for archivists and researchers. The data would come from earth observation, hydrology, meteorology, government reports and financial statistics.
Octopus: Mining Space and Terrestrial Data for Improved Weather, Climate and Agriculture Predictions: The subtitle of this group tells you all you need to know, really. The group expects to deliver data products, a framework for deriving these products, and of course peer-reviewed publications.
After lunch, the final block of parallel sessions dealt with access to data in developing countries, earth observation systems, disaster databases and a strategic vision for CODATA. I thought I'd catch the second half of the data mining track.
Sunspot activity follows an approximate 11-year cycle. The intensity of each cycle is different, but it is possible to partition the cycles into high and low intensity groups. Kassim Mwitondi described how data mining techniques were used to detect patterns in the intensity curves of the 23 cycles on record, and in particular determine whether one could predict the whole of the intensity curve from just the first few months. The team found that patterns from the first three years of a cycle could indeed be used to predict the peak intensity of the cycle. The general approach could, Mwitondi believes, be adapted to many other domains and domain partitioning methods.
Hsuan-Chih Wang told how a new knowledge management system was designed for and implemented at the Science & Technology Policy Research and Information Center in Taiwan. The final system used Sharepoint as a collaboration platform, on top of which were built knowledge management tools and resources (e.g. custom file management tools, a set of frequently asked questions, personal and corporate portals), a project management tool supporting Gantt charts and checkpoints, and a document annotation tool.
There were supposed to be three other talks in this session, but neither of the two speakers involved could make it. I was just about to pack up and find another session when the chair, John Helliwell of the International Union of Crystallographers, invited me up to participate in a panel. We had a free-ranging discussion on the challenges of data mining, which slowly morphed into a discussion on the poor but improving state of research data management worldwide.
The penultimate plenary session was a panel chaired by Robert Chen, CIESIN, Columbia University. Before handing over to the panellists, Chen reflected on how Hurricane Sandy, which was wreaking havoc along the USA’s east coast at the time, was also bringing home the importance of climate change issues.
Jane Hunter listed several global issues in which scientific data play an important role: climate change, clean energy, disaster risk reduction, global pandemics and food security. In each case a cycle is formed: the data are used to generate models, through which we come to an understanding of the phenomena involved. This understanding leads to decisions which influence policies and public funding; the outcomes of these changes generate further data. As an example, the eReefs Project is looking at the negative impacts of agriculture on coral reefs in Queensland . The effect of management actions on various indicators of environmental quality have been measured and modelled. The models have been turned into a Web application that lets members of the public run their own simulations and see the effect of certain actions on, for example, maps of water quality.
Hunter saw the following as major challenges: global open access to politically sensitive data, good visualisations of data quality, improved data correlation services, opening online models to non-experts, real-time analysis and monitoring, multidimensional visualisations, fully exploiting the power of linked open data, ensuring citations of large-scale, collaborative, dynamic databases provide fair attribution, and providing multilingual search and presentation interfaces.
David Carlson argued that free and open access to data is key to progress. The fundamental limiting factor on the usefulness of data will always be the amount of human effort that can be allocated to managing and curating it. He reiterated the point that researchers have to build in good data curation right from the start, and encouraged them to practise it in demonstrator projects before embarking on anything large-scale: they need to know how it will work at the point they write their proposal.
There were, as ever, interesting points raised in the subsequent discussion. A delegate from Germany suggested that ‘citizen sensor’ was a more accurate term to use than ‘citizen science’; Carlson came back with ‘community monitoring’ but others pointed out that citizen science was broader than this: Galaxy Zoo, for example, used volunteer effort to categorise (rather than collect) images. There was a reminder of the wealth of born-analogue data that might be useful: old holiday videos might contain interesting biodiversity data, but the challenge is discovering and extracting such information. Another delegate expressed frustration that if we have trouble convincing top politicians of scientifically proven conclusions, what hope is there of reaching everyone else? The answer came back that shouting louder doesn't work: scientists have to overcome people’s fears when communicating their results. This implies a conversation rather than a broadcast model. It would also help to get people while they’re young, and teach science in a practical and data-driven way in schools.
The conference ended with the usual rounds of thanks to those who worked so hard to make it a success. Presentations were made to outgoing Secretary-General Robert Chen, and to the winner of the best poster award, Anatoly Soloviev. Honourable mentions from the poster judges went to Li-Chin Lee, Punit Kaur and Akiyo Yatagai.
This was my first time at a CODATA conference, and I cannot but admire the stamina of the delegates who come back year after year. They pack a lot into three days, which makes the experience fascinating and stimulating, but also exhausting and frustrating – with five sessions running at once most of the time, you can’t help but miss most of it. The thing that really struck me was the palpable sense of community that had built up around this conference, and the obvious warmth of friendship, never mind professional respect, that exists between the regulars. Before I attended the conference, I must admit to being a bit hazy about what CODATA actually does, and now that I know, I am rather impressed both by its accomplishments and its ambitions for the future.
The conference programme has been given its own Web site, containing abstracts, some full papers and presentations, and more information about authors and speakers . The next CODATA conference will be held in 2014 in India.
Alex Ball is a Research Officer working in the field of digital curation and research data management, and an Institutional Support Officer for the Digital Curation Centre. His interests include Engineering research data, Web technologies and preservation, scientific metadata, data citation and the intellectual property aspects of research data.
This article has been published under Creative Commons Attribution 3.0 Unported (CC BY 3.0) licence. Please note this CC BY licence applies to textual content of this article, and that some images or other non-textual elements may be covered by special copyright arrangements. For guidance on citing this article (giving attribution as required by the CC BY licence), please see below our recommendation of 'How to cite this article'.