On 15-17 December 2003, the ERPANET Project  and the ICSU (International Council for Science) Committee on Data for Science and Technology (CODATA)  held a joint workshop on the selection, appraisal and retention of digital scientific data at the National Library of Portugal (Biblioteca Nacional) in Lisbon. The workshop brought together around 80 participants, a mix of scientists, archivists and data specialists.
After the opening introductions, the first presentation was an overview of CODATA data archiving activities given by Bill Anderson, co-chair of the CODATA Task Group on Preservation and Archiving of Scientific and Technical Data in Developing Countries. He first highlighted a number of recent news stories that concerned the re-analysis of scientific data , ,  and a report in Science on disappearing Web references in scientific papers . He then reiterated a point made by Bernard Smith of the European Commission at a workshop in 2002 that "digital resources will not survive or remain accessible by accident." He then introduced CODATA and its Task Group on Data Preservation. The presentation ended with a brief look at a range of scientific, management, policy and technical issues relating to the long-term preservation of scientific data. Anderson argued that one challenge from a scientific point of view was the tension between discipline-specific requirements and practices and the growing need for interdisciplinary approaches to scientific data.
Terry Eastwood of the University of British Columbia then introduced the archival concept of 'appraisal' and its application to digital records. Archivists had been dealing with the preservation challenges of electronic records for some time and had been involved in the development of record-keeping systems that include provision for determining the disposition of records. Terry argued that archivists' experiences are likely to have parallels in other spheres where digital objects need preservation. He then briefly outlined a model of the archival selection function developed by the Appraisal Task Force of the InterPARES Project . The model defines four appraisal activities, compiling information about digital objects and their contexts, assessing their continuing value, determining the feasibility of their preservation, and finally the appraisal decision itself.
The focus of the workshop then changed slightly as Peter Weiss of the US National Oceanic and Atmospheric Administration (NOAA) spoke about the economics of data reuse. His presentation argued that giving all researchers open access to government-funded scientific data would have considerable economic and social benefits. He cited the example of how the use of multiple meteorological datasets could be used to predict major weather phenomena in developing countries, e.g. monsoons. While observational data from the US National Climatic Data Center (NCDC) was freely available, Peter noted that research into monsoon prediction at the India Institute of Technology had been hampered by the high prices charged for historic atmospheric model data from the European Centre for Medium-Range Weather Forecasts (ECMWF) . He argued that the resulting lack of data from ECMWF not only resulted in potential social and economic harm for the people living in monsoon areas, but that high prices also generated no revenue for the ECMWF itself. Peter pointed out that the United States had long supported making government-funded information available at no more than the cost of dissemination, a policy "based on the premise that government information is a valuable national resource, and that the economic benefits to society are maximized when government information is available in a timely and equitable manner to all" . In consequence, the importance of the private sector in meteorology was growing, e.g. for providing weather risk management services. A 2003 report produced by the US National Academy of Sciences noted the importance of private sector use of National Weather Service data, which 'greatly increases the value of the data and further justifies the high costs of the national observing system infrastructure' . After a brief introduction to the economics of information, Peter used figures from a report produced in 2000 by PIRA International for the Information Society Directorate-General of the European Commission  to argue that charging for public sector information (PSI) in Europe was detrimental to the European economy. The PIRA report concluded, 'a conservative projection of a doubling of market size resulting from eliminating licence fees would produce additional taxation revenues to more than offset the lost income from PSI charges' . In addition, Peter was very critical of particular European policies on data access, in particular the Deutscher Wetterdienst, Germany's National Meteorological Service, which quoted a price of over US$1.5 million for access to historical data, and cost recovery, noting that 50% of the revenue of the UK Meteorological Office comes from a single government department (the Ministry of Defence) with another 30% from other government agencies. He ended his presentation with some examples of good practice and some recommendations.
After this, the focus of the workshop moved on to a series of disciplinary and interdisciplinary case studies. The first was a disciplinary case study in the physical sciences. Jürgen Knobloch of CERN (the European Organisation for Nuclear Research) briefly introduced CERN, and then described the nature of the data collected in particle physics. Particle colliders generate massive amounts of raw data (much of which is routinely thrown away), which is repeatedly analysed to produce results data. More data is generated by computer simulation of the same processes. CERN has an archiving policy, embodied in an operational circular published in 1997 , but this does not specifically cover digital physics data. Despite this, CERN has undertaken to maintain the ability to analyse data generated by its Large Electron-Positron collider (LEP) for as long as it is practicable. The strategy is currently dependent on the running of existing software, which may be in the future need to be part of a 'museum system.' Data preservation in particle physics raises many issues, not the least of which is the perception that data cannot be analysed in meaningful ways by people who were not involved in the original collaboration. Tools for making data available, however, are under development. Knobloch mentioned a method called QUAERO  that has been used to make high-energy physics data publicly available. High-energy physics data is reviewed by other scientists through the Particle Data Group (PDG)  who maintain a database of experimental results. CERN's current major development in particle physics is the Large Hadron Collider (LHC) , currently under construction. (Knobloch noted that there were millions of patents and engineering drawings that would need to preserved at least for the lifetime of the collider). LHC was expected to generate around 12-14 petabytes of data a year, which would present severe challenges for data analysis. Knobloch concluded by reminding attendees that experimental physics is extremely expensive and that experiments are not easy to repeat, and that data are useless without additional documentation, metadata, and software.
The next two presentations moved on to consider the data requirements of the space sciences. Firstly, Françoise Genova of the Centre de Données astronomiques de Strasbourg (CDS) looked at the observational data generated by astronomers. She noted that the reuse of observational data in astronomy was important for optimising the scientific return on large projects. The relatively small size of the profession and the lack of commercial constraints meant that astronomers had a strong tradition of networking and data sharing. Links between observational data and published results are part of the astronomical bibliographic network, which includes the NASA Astrophysics Data System (ADS)  for bibliographic information and specialised online services like SIMBAD  and the NASA/IPAC (Infrared Processing and Analysis Center) Extragalactic Database (NED) . Genova noted that there had been some progress on the development of an interoperable standard definition for tabular data and stressed the importance of de facto standards and of co-operation between all actors, e.g. journals, the ADS, data centres and archives. The presentation ended with a brief look at data interchange formats, including an XML-based format for the exchange of tabular data called VOTable  being developed by the International Virtual Observatory Alliance (IVOA). This format incorporates an existing standard for describing astronomical images known as the Flexible Image Transport System (FITS ) .
The next presentation by Alex Szalay of Johns Hopkins University focused on the changing scale and nature of astronomical data. He noted that there was an exponential growth in the amount of astronomical data that was being generated, estimating that the amount of data doubled each year; currently consisting of a few hundred terabytes, but expected soon to reach a petabyte. Some other trends identified were that data collections were increasingly likely to be distributed, that data itself was 'live' and subject to change, and that scientists were themselves becoming the publishers and curators of data and other content. One consequence of the exponential growth in data being generated is that current tools for downloading and analysing data were becoming less adequate. Szalay and his collaborator Jim Gray of Microsoft Research have commented that FTP or GREP tools do not scale to dealing with terabytes. They have written elsewhere that 'FTPing or GREPing a gigabyte takes a minute, but FTPing or GREPing a terabyte can take a day or more, and sequentially scanning a petabyte takes years.'  Consequently, new techniques are needed for data analysis, including the concept of 'data exploration,' whereby the analysis is performed much closer to the data, e.g. inside the database itself . Szalay also noted that additional work is required on developing better algorithms for data analysis.
The following day started with an interdisciplinary case study focused on the social sciences. Myron Gutmann from the Inter-university Consortium for Political and Social Research (ICPSR), based at the University of Michigan, gave the first presentation. He noted the importance of metadata standards and emphasised that the ICPSR had put a lot of work into the development of such standards, e.g. through the Data Documentation Initiative (DDI) . The types of data that social scientists are concerned with are both quantitative, e.g. census and administrative data, surveys, etc., and qualitative. In the US, the largest data producers are the federal government, universities and private foundations. The main data holders are the US National Archives and Records Administration (NARA) and university-based archives. However, there was still a major challenge in getting data owners to archive data. Partly this was due to serious concerns about confidentiality and research priority, but in many cases researchers just lacked the time or motivation to prepare data for archiving. Kevin Schürer of the UK Data Archive then described the corresponding UK situation. The data types involved were similar, but there was now a growing amount of non-survey type data, e.g. videotaped interviews, mixed media, etc. In the UK, data repositories included the National Archives (Public Record Office), data centres directly funded by research-funding agencies like the Data Archive, the NERC (Natural Environment Research Council) Data Centres and the Arts and Humanities Data Service (AHDS), as well as university-based centres like EDINA. Turning to appraisal issues, Schürer reminded workshop participants that appraisal is not just about selection, but also about clear decisions on what it is not necessary to keep. Appraisal criteria might include whether a data set is appropriate to a particular collection, the existence (or not) of metadata, etc. Resource implications mean that archives like the UK Data Archive are unable to accept all the data that they are offered.
The next disciplinary case study concerned the biological sciences. Firstly Meredith Lane of the Global Biodiversity Information Facility (GBIF) Secretariat in Copenhagen  talked about the different kinds of situation in biology. Meredith said that there were three main kinds of biological data: firstly genomic and proteomic sequence data (typically known as bioinformatics data), secondly data about how organisms interact with ecology and ecosystems (ecoinformatics data), and thirdly information about species (biodiversity informatics data). While these subdomains have much to offer each other, each has its own particular problems. So while bioinformatics data are almost all digital and are kept in universally accessible data stores, the vast majority of species and specimen data are not yet in digital form, often being held in physical data stores that are not easily accessible, e.g. museums of natural history. While, as with bioinformatics data, ecological and ecosystem data is mostly digital, it is not always freely accessible. Meredith went on to describe the GBIF; a 'megascience facility' focused on making primary species occurrence data freely available through the Internet. One motivation was to make biodiversity data - the majority of which is held in the developed world - freely available to the developing world, from where much of the raw data originated. GBIF has concentrated its activities on areas that are not being addressed by other initiatives, focusing on biological names, the generation of catalogues and information architectures, and the interoperability of biodiversity databases among themselves and with other biological data types. Achieving interoperability, however, will depend on good co-operation between the various biological data initiatives in existence .
Weber Amaral of the International Plant Genetics Resources Institute (IPGRI), based near Rome, then gave some examples of data use in the area of agro-biodiversity, a small subset of around 100,000 species, of which 100 species provide around 90 per cent of human nutrition. He explained the differences between the ex situ conservation of species, e.g. in genebanks or botanic gardens, and in situ conservation, where plants are kept in their natural habitat or (where cultivated) in the habitats in which they were domesticated. Amaral then described some of the functionality of the SINGER database , the System-wide Information Network for Generic Resources of the Consultative Group on International Agricultural Research (CGIAR).
The final interdisciplinary case study concerned the earth and environmental sciences. The first presentation was given by John Faundeen of the US Geological Survey (USGS) Earth Resources Observation Systems (EROS) Data Center in South Dakota. The centre holds data from satellite missions and land-sensing data. Some of this data results from missions with which the USGS is involved while others are sought for or are offered to the agency. In order to deal with these, the data centre set up two committees - one internal and one external - to advise on selection criteria and other issues. For appraisal, the USGS uses checklists, including one developed by the NARA for electronic records. The use of these may not lead to a definitive decision, but they do help to inform the process and also document the criteria that were used to reach that decision. The US Federal Geographic Data Committee (FGDC) has also developed a checklist for the retention of data, which has been used, together with the NARA list, by the USGS to re-appraise some data sets already held by the data centre. This has resulted in the 'purging' of at least one large satellite collection, but re-appraisal has helped to align the collections of the data centre with its original mission.
Luigi Fusco of the European Space Agency (ESA) European Space Research Institute (ESRIN), based in Frascati, Italy, gave a presentation on earth observation archives. Earth observation (EO) data is observational data that is used by multiple scientific disciplines and by commercial organisations. While long-term preservation has been identified as a requirement, there are no unified archiving policies at either European or national level. Responsibility for preservation currently mostly resides with the mission owner. Luigi then went on to discuss how data interoperability and preservation were being considered in the context of an European initiative called Global Monitoring for Environment and Security (GMES) , which would integrate space and ground-based EO data. After a brief look at some other EO initiatives, Luigi concluded with an overview of emerging technologies, primarily with regard to GRID developments.
The afternoon was taken up with a discussion chaired by Gail Hodge. This was extremely wide ranging, and the following just highlights a selection of the issues that were raised:
- Differences in perspectives. It was noted that the case study presentations had highlighted major differences between the data practices of scientific disciplines and sub-disciplines. While some disciplines had already developed cultures and technical frameworks for maintaining and sharing research data, others had not. There were also potential conflicts between the viewpoints of scientific investigators and archivists, e.g. on the ownership of data.
- The role of funding agencies. The importance of funding agencies was mentioned several times in the discussion. Some funding agencies were becoming more aware of data issues, and some were already supporting the maintenance of data archives or encouraging grantees to make data publicly available. For example, the UK Economic and Social Research Council will withhold the final 10% of grants if the UK Data Archive cannot confirm that the data generated by the research has been offered to them. Penalising scientists for not depositing data, however, was not a cost-free exercise so there needed to be an added emphasis on giving academic credit for making data available and the resulting need for robust citation mechanisms for data - which is currently sub-discipline dependent. In general, however, it was felt that agencies were interested more in funding primary research than in providing ongoing support to data archives.
- Appraisal. The subject of appraisal was returned to time and again in the discussion. There were questions as to when it should take place, e.g. at the beginning of the data lifecycle (e.g., as part of the project review process) or at the point when data is transferred to an archive. Some argued that an important aspect of appraisal related to the existence (or not) of the metadata required to interpret correctly. Others noted that appraisal criteria would normally depend upon the reasons why data was being retained, e.g. noting potential differences between scientific data that only needs to be kept for a relatively short period of time, e.g. for verification, and the longer-term views of archivists.
- Costs and benefits. There was some discussion of the costs of retaining data. While storage costs are getting cheaper, it was recognised that this would not scale to petabytes of data. The hardware required at this level - e.g., tape drives and robots - remains expensive, not to mention the less quantifiable costs of migration or retrieval. There was some recognition of the need for more cost-benefit models and for demonstrator projects that could highlight these.
- The need for a common vocabulary. The workshop presentations had highlighted many differences in the use of terminology. Several attendees mentioned the value of the Reference Model for an Open Archival Information System (OAIS) (ISO 14721:2003) for providing a common vocabulary . Donald Sawyer of the US National Space Science Data Centre said that appraisal fitted in with the 'Ingest' function of the OAIS functional model and noted work currently being led by the French Centre National d'Études Spatiales (CNES) within the Consultative Committee on Space Data Standards (CCSDS) on developing an abstract standard for a 'producer-archive interface methodology' . CODATA representatives promised to initiate some co-ordinating work on harmonising terminologies.
The final day began with panel presentations on appraisal by John Faundeen, Jürgen Knobloch, Kevin Schürer and Terry Eastwood. John first highlighted the need for scientific relevance but stressed that an appraisal policy should always align with the collecting organisation's mission or charter. He also noted the need for sufficient documentation to use records without the assistance of the creating agency. Another important criteria would be the continued availability of a sufficient level of funding for preservation and to fulfil any additional requirements for distribution, e.g. rights management. Jürgen argued that preservation needed to be seen as important but that appraisal could only be done in collaboration with scientists. Kevin looked at a survey of the use of data sets in the UK Data Archive, noting that a majority of use (around 60%) was concentrated on a fairly low proportion of the collection (around 10%) and he thought that a large percentage of data would most probably never be used. As acquisition was the most expensive part of the preservation process, Kevin concluded that a balanced appraisal policy was essential. The UK Data Archive had set-up an Acquisitions Review Committee to support its accountability, e.g. for when data is rejected. He also emphasised the need for continued dialogue with data creators and the need to move assessment back to the beginning of the data life-cycle. Schürer noted the difficulty of assessing the long-term value of data, noting that future users were likely to be quite different from current ones. Terry Eastwood noted the dichotomy between requirements of the scientists who generate data and those agencies that undertake to take responsibility for data, maintain, preserve and make it available. He argued that creators and preservers needed to collaborate and that, above all, preservation activities needed to be funded adequately.
The final presentations were an overview of the OAIS model by Donald Sawyer and some reiteration of his earlier comments by Peter Weiss, which generated some more debate on the merits of making all publicly funded data available at cost .
By way of conclusion, this ERPANET/CODATA workshop was a useful forum for scientists, data managers, archivists, and the representatives of funding agencies, etc. to meet together to discuss an issue that is growing in importance. I was personally struck - if not entirely surprised - by the diversity of standards and practice that had been developed within different sub-disciplines. The increasing interdisciplinary nature of some scientific disciplines will mean that more attention will need to be given to building tools that build data links between them, e.g. as attempted by the GBIF. It also seems clear that the attention of scientists, their institutions and funding agencies need to be directed towards the creation of a sustainable infrastructure that will result in data with continuing value being retained for as long as it is required. Selection and appraisal guidelines will be a key part of this infrastructure. While the workshop concentrated on discipline-specific approaches, it is likely that there will need to be some interaction with institutional or national initiatives. For those interested, the workshop briefing paper and presentation slides are available from the ERPANET Web site .
- ERPANET: Electronic Resource Preservation and Access Network http://www.erpanet.org/
- CODATA, the Committee on Data for Science and Technology http://www.codata.org/
- Revkin, A.C. (2003). "New view of data supports human link to global warming." New York Times, 18 November.
- Mason, B. (2003). "Lower atmosphere temperature may be rising." Nature Science Update, 12 September.
- Ault, A. (2003). "Climbing a medical Everest." Science, 300, 2024-25.
- Dellavalle, R. P., Hester, E. J., Heilig, L. F., Drake, A. L., Kuntzman, J. W., Graber, M., & Schilling, L.M. (2003). "Going, going, gone: lost Internet references." Science, 302, 787-788.
- InterPARES project: http://www.interpares.org/
- Weiss, P. (2002). Borders in cyberspace: conflicting public sector information policies and their economic impacts. US National Weather Service, February. http://www.weather.gov/sp/Bordersreport2.pdf
- US Office of Management and Budget. (1996). "Management of Federal information resources." OMB Circular No. A-130. http://www.whitehouse.gov/omb/circulars/a130/a130.html
- US National Research Council. (2003). Fair weather: effective partnerships in weather and climate services. Washington, D.C.: National Academies Press, p. 8. http://www.nap.edu/books/0309087465/html/
- PIRA International. (2000). Commercial exploitation of Europe's public sector information: final report. Leatherhead: PIRA International, 30 October. ftp://ftp.cordis.lu/pub/econtent/docs/commercial_final_report.pdf
- PIRA International, University of East Anglia, & KnowledgeView. (2000). Commercial exploitation of Europe's public sector information: executive summary. Luxembourg: Office for Official Publications of the European Communities, 20 September, p. 6.
- CERN Scientific Information Group. (1997). "Rules applicable to archival material and archiving at CERN." CERN Operational Circular No. 3. http://library.cern.ch/archives/archnet/documents.html
- Abazov, V.M., et al. (2001). "Search for new physics using QUAERO: a new interface to D0 event data." Physical Review Letters, 87(23), no. 231801-1. http://arxiv.org/abs/hep-ex/0106039
- Particle Data Group: http://pdg.lbl.gov/
- Large Hadron Collider (LHC): http://lhc-new-homepage.web.cern.ch/
- NASA Astrophysics Data System: http://adswww.harvard.edu/
- SIMBAD astronomical database: http://simbad.u-strasbg.fr/
- NASA/IPAC Extragalactic Database (NED): http://nedwww.ipac.caltech.edu/
- VOTable Documentation: http://www.us-vo.org/VOTable/
- Flexible Image Transport System (FITS): http://fits.gsfc.nasa.gov/
- Gray, J. & Szalay, A. (2002). "The world-wide telescope." Communications of the ACM, 45(11), 50-55.
- "We assume the Internet will evolve so that copying larger data sets will be feasible and economic (in contrast, today data sets in the terabyte range are typically moved by parcel post). Still, it will often be best to move the computation to petabyte-scale data sets in order to minimise data movement and speed the computation." Gray, J. & Szalay, A. (2001). "The world-wide telescope." Science, 293, 2037-2040.
- Data Documentation Initiative: http://www.icpsr.umich.edu/DDI/
- Global Biodiversity Information Facility (GBIF): http://www.gbif.org/
- For example, see: Wilson, E.O. (2003). "The encyclopedia of life." Trends in Ecology and Evolution, 19(2), 77-80.
- CGIAR (Consultative Group on International Agricultural Research) System-wide Information Network for Genetic Resources (SINGER): http://singer.cgiar.org/
- GMES (Global Monitoring for Environment and Security): http://www.gmes.info/
- Consultative Committee for Space Data Systems. (2002). Reference model for an Open Archival Information System (OAIS). CCSDS 650.0-B-1, Blue Book, January. http://www.ccsds.org/documents/650x0b1.pdf
- Consultative Committee for Space Data Systems. (2003). Producer-archive interface methodology abstract standard. CCSDS 651.0-R-1, Red Book, April. http://www.ccsds.org/review/RPA305/651x0r1.pdf
- For more details, see: US National Research Council. (2001). Resolving conflicts arising from the privatization of environmental data. Washington, D.C.: National Academies Press. http://www.nap.edu/books/0309075831/html/
- ERPANET/CODATA Workshop on the Selection, Appraisal, and Retention of Digital Scientific Data, Biblioteca Nacional, Lisbon, Portugal, 15-17 December 2003: http://www.erpanet.org/www/products/lisbon/lisbon.htm