High-energy physics (HEP) experiments acquire huge datasets that may not be superseded by new and better measurements for decades or centuries. Nevertheless, the cost and difficulty of preserving both the data and the understanding of how to use them are daunting. The small number of cases in which data over ten years old have been reanalysed has only served to underline that such analyses are currently very close to impossible. The recent termination of data taking by the H1 and ZEUS experiments at DESY's HERA collider, and by the BaBar experiment at SLAC, plus the imminent termination of other major experiments, prompted the organisation of this workshop. Almost 50 physicists and IT experts participated.
The workshop heard from HEP experiments long past ('it's hopeless to try now'), recent or almost past ('we really must do something') and included representatives form experiments just starting ('interesting issue, but we're really very busy right now'). We were told how luck and industry had succeeded in obtaining new results from 20-year-old data from the JADE experiment, and how the astronomy community apparently shames HEP by taking a formalised approach to preserving data in an intelligible format. Technical issues including preserving the bits and preserving the ability to run ancient software on long-dead operating systems were also addressed. The final input to the workshop was a somewhat asymmetric picture of the funding agency interests from the two sides of the Atlantic.
Parallel working sessions addressed the different needs and issues in e+e-, ep and pp experiments. The likelihood of future data rendering old datasets uninteresting can be very different in these three types of collision. The workshop then tried to lay out a path forward. Apart from the obvious 'hold another workshop' imperative, it seemed clear that experimental programmes that felt 'we really must do something' should be used as drivers. A first step must be to examine issues of cost, difficulty and benefit as quantitatively as possible so that the next workshop could have concrete discussions on the science case for various levels of data preservation. The next workshop is planned to be held early Summer 2009 at the SLAC National Laboratory, Menlo Park, USA.
The workshop programme, and copies of presentations are available on the Web . It is planned that the workshop proceedings will appear as a DESY preprint.
Cristinel described how his immediate concern for preservation of the H1 data had led to the formation of an informal study group comprising experimental HEP scientists and IT experts. In September 2008, this group had agreed on the programme for this workshop.
Below I report in some detail on the presentation by H1, to give the flavour of the type of information presented. From the remaining experiment presentations I select some key issues that add to the overall picture.
H1 acquired ep collision data from 1992 to 2007. The original software was written in Fortran, some parts of which continue to be developed. The BOS (Bank Object System) tool is used to handle the event data structures, for which task Fortran 77 alone was quite inadequate. The Fortran code continues to be used for reconstruction and simulation down to the level of the DST (Data Summary Tape) dataset (18Kbytes/event for data, and twice that for simulation). From that point on the data are converted to an object-oriented representation allowing analysis to be performed in the modern ROOT framework.
H1 has formed a 'Data Preservation Task force' that is planning the continued availability of around 500 terabytes of data ranging from raw and simulated collisions and dedicated calibration runs, to a version of the DST and the derived object-oriented datasets. Preserving these data at DESY should not be a problem. But what about the software? Paradoxically the old Fortran/BOS software gives rise to fewer concerns than the new ROOT-based system. The old Fortran system is stable, and all under the control of H1. The new software draws heavily on the ROOT system that is under continued vigorous development with a consequent need to modify, recompile and verify the H1 software at regular intervals. Much documentation for the whole system is already in place but a 'concentrated effort [is] still needed in the coming years.'
ZEUS reconstruction and simulation code is also Fortran-based with a different set of utilities performing the BOS-like functions. Their jump into the modern ROOT world is at the ntuple rather than DST level, resulting in less ROOT dependence, but arguably a more restricted capability for those unwilling to brave the Fortran. They see no way to keep their full system alive beyond about the year 2013, but are actively exploring a simple ntuple format as a way to make the data available to a wider community in the longer term.
CDF and D0 at Fermilab's proton-antiproton collider are still taking data and hope to do so for several more years barring (regrettably not impossible) budget catastrophes. Perhaps understandably the talk described the data processing and analysis strategy in elegant detail, but did not venture into ideas for long-term preservation and analysis.
D0 did address data preservation stating that 'the only sensible solution would be a high-level format, in which most calibrations and corrections have already been applied to the data.' Offsetting this assertion, Qizhong described the Qaero Project that had supplied just such a data format to D0 physicists some five years ago, and had been received by physicists with five years of indifference.
Belle and BaBar are two 'B-factory' experiments that have acquired huge datasets at exactly the same e+e- collision energy. Belle is still running and hopes to 'keep data/software intact at KEK until Super KEKB takes overwhelmingly large amounts of data.' This will be a sound strategy if the funding gods smile on the Super KEKB proposal!
The upgraded Beijing Electron Synchrotron, BES-III, will run for a decade generating e+e- collision at a few GeV. BES-III data will supersede decades-old data taken at US and European e+e- colliders. However, no successor to BES-III is on the horizon so data preservation is already taken seriously. They plan to preserve access to complete, as opposed to simplified, datasets for about 15 years, but already acknowledge that there are 'anticipated software headaches.' They cannot see how to make the data (usefully) available to the public.
BaBar took its last data in April 2008 and has formed a task force on data preservation. The planning addresses migration of code to current platforms, simplification to remove dependencies on third-party software and virtualisation to lessen or obviate the need for migration.
Like Babar, CLEO took its final data in April 2008. Their examination of data preservation issues led project staff to conclude that it would be: a) very difficult, given the complexities of their software and its dependencies; b) probably not needed since BaBar and Belle have superseded their earlier data, and BES-III will supersede their newer data, and c) there is no funding model for data preservation!
'Currently, we don't really preserve data, we preserve [tape] cartridges(!)' There is 'no strategy for long-term data preservation,' and 'a collective and co-ordinated effort from experiments, funding agencies and data centres seems essential for dealing with this issue.'
Prior to 2002, Fermilab physics data were written to Exabyte cartridges that had reliability issues from the moment they were written. These tapes are physically preserved at Fermilab, but their readability is not assured. From 2002 experimental data has been stored using more robust robotic mass storage systems managed by software that makes it possible to migrate data to new tape technologies in the future. Open issues include, how long to preserve the tape data, and how to maintain or migrate the experiments' software.
The talk acknowledged a commitment to keep 'data available at least until 10 years after LHC shutdown' but did not elaborate. The NGDF itself is a truly distributed facility that is truly walking the Grid walk.
Turning the focus to technologies, Martin set the scene by tracing the history of storage from stone tablets (very low density but huge lifetime) to modern technologies with stupendous densities and very uncertain lifetimes. There are modern technical solutions for dense, very stable, archives, but the market volume favours devices that require 'endless migration'.
Yves addressed how software systems could be kept alive using emulation, virtualisation or continuous test-driven migration (a.k.a. brute force). Emulation makes inefficient use of CPU, but Moore's Law can render the CPU issue moot. Emulation allows Commodore 64 software to be executed today, but (my note) the C64 presented a single stable environment unlike Linux-x86 today. Virtualisation is an attractive way to run a variety of current operating systems. In discussions it emerged that there is no clear evidence that it will be a good solution for a variety of ancient operating systems. To quote Yves: 'Virtualisation alone will not be sufficient,' and 'Everything will be different anyhow. '
For me, this was the most fascinating talk of the workshop. It described 'the only example of reviving and still using 25-30 year old data & software in HEP.' JADE was an e+e- experiment at DESY's PETRA collider. The PETRA (and SLAC's PEP) data are unlikely to be superseded, and improved theoretical understanding of QCD (Quantum ChromoDynamics) now allows valuable new physics results to be obtained if it is possible to analyse the old data. Only JADE has succeeded in this, and that by a combination of industry and luck. A sample luck and industry anecdote:
'The file containing the recorded luminosities of each run and fill, was stored on a private account and therefore lost when [the] DESY archive was cleaned up. Jan Olsson, when cleaning up his office in ~1997, found an old ASCII-printout of the luminosity file. Unfortunately, it was printed on green recycling paper - not suitable for scanning and OCR-ing. A secretary at Aachen re-typed it within 4 weeks. A checksum routine found (and recovered) only 4 typos.'
The key conclusion of the talk was: 'archiving & re-use of data & software must be planned while [an] experiment is still in running mode!' The fact that the talk documented how to succeed when no such planning had been done only served to strengthen the conclusion.
This talk laid bare the sociological challenges involved in making data available to outsiders, and, in this case, competitors. The experiments at CERN's LEP e+e- collider could not bring themselves to allow 'insight into each-other's "kitchen".' As a result, the potential for reuse of even this limited Higgs search data is 'strongly restrained.'
LEP (Large Electron-Positron Collider) was shut down at the end of 2000. Although most experiments are still producing final publications, the preservation of the analysis capability is in a sorry state. Andre gave most details about L3. His L3 summary seems widely true:
'Preservation effort started too late. We consider it failed. (However, [the] publication effort was a success!). Among possible reasons for the failure of the preservation effort:
- Effort started too late (after data taking was completed)
- Based on 1-2 persons, not even working 100% on it
- Everybody's analysis code was 'private' (stored in user's directory, not in central storage)
- Inheriting of analysis typically by person-to-person oral training instead of providing documentation
- Private corrections (e.g. additional smearing of MC) often did not go into central code
- People left to other experiments quickly after end of data taking.'
ALEPH is in somewhat better shape. Each institute has a laptop with a frozen analysis system and 2 TB of data, and there is a long-term effort to keep experts accessible. ALEPH has also addressed the 'who can publish' minefield with a policy that collaboration members can publish their own papers on most ALEPH data. All the LEP experiments seemed to feel let down by the long-scheduled dropping of support for the CERNLIB software library.
OPAL reported all the same issues as L3, but were overall in a somewhat better state, for example in having a 'Long-Term Editorial Board' to advise on the validity of analyses.
HEP studies (we hope) immutable physical laws, whereas astronomy studies a changing universe. Thus in astronomy 'every single observation must be kept.' Astronomy culture differs from HEP in other ways: data are made available to the public after some time, and since 1977, the FITS standard has been used as the format for analysis data.
My own reaction, in reporting this information on the astronomy and virtual observatory story is that I find it deeply disturbing. Can I put their preservation and sharing success down to the simplicity of photographic plates and CCD cameras? Will they fail as spectacularly as HEP on the next generation of space observatories that are nothing less than HEP detectors in space? We must be able to learn something from astronomy, but what is it?
The ROOT software package now has close to a monopoly of support for data persistency and latter-stage analysis in HEP. René presented a picture of continued vigorous development and responsiveness to community needs. For more than a decade, René has called for a strict avoidance of commercial software packages because of their uncertain lifetimes and development cycles. His repeated call resonated with the anecdotes from the experiments.
Salvatore reported on an HEP survey that had been funded by the European Union as a first step towards strategic funding decisions on data preservation. The survey had generated some 1200 responses (one of which was from this reporter) from a reasonably representative set of physicists – the Europe/US, experiment/theory, young/old balances looked much like those in the physics population.
A large majority of respondents thought data preservation was important, but only 16% thought that their experiment or organisation would preserve data, perhaps because 43% thought preservation would cost a significant fraction of the original effort to produce and analyse the data and only 6% thought the cost would be minimal. Apart from technical and cost issues, most respondents were concerned about invalid results being obtained from preserved data and about the extent to which the original owners of the data would get credit for its use.
The European Union has at least two programmes that could fund HEP data preservation efforts. The most promising is the huge FP7 programme that funded the study Salvatore just described, but the funding processes have a long lead time, making the programme more relevant to LHC (Large Hadron Collider) than to experiments that currently need help.
Amber was clear about the DoE/HEP policy on data preservation: 'there isn't one.' Her aim was to listen and learn about all aspects of HEP data preservation including the policies and programmes of other funding agencies.
The Science and Technology Facilities Council funds HEP, nuclear physics, astronomy and space, plus major research facilities in the UK and Europe. Included in its portfolio is the 80-person UK Digital Curation Centre. The picture painted was one of areas of enviable progress, but with very far to go. David noted that STFC was among several UK Research Councils with no reference to a Data Policy on its Web site. He concluded: 'Data Preservation is complex, expensive and unsolved. HEP needs to clarify what they are trying to achieve, understand the costs and potential benefits, and decide if they are worth it.'
The SPIRES database of HEP preprints and publications is used worldwide. Originating at SLAC in the 1970s it is now supported by a US-Europe collaboration and is undergoing vital re-engineering under the new name of INSPIRE.
The e+e-, ep and pp groups met in parallel and reported back the next day.
An implicit background to the e+e- report was the near uniqueness, and consequent long-term value, of e+e- collision data at each centre of mass energy. The experiments favoured the creation of a common data definition format, recognising it to be necessary while far from sufficient. Book-keeping tools, luminosity information and community wisdom, such as the BaBar Hypernews would all be necessary to perform analysis in the long term. Babar experience with 'amazingly sensitive' code already showed that virtualisation was no silver bullet!
The HERA ep data will remain unique for the foreseeable future. Some principal justifications for preserving the data, such as testing new or improved theories, required the ability to improve the reconstruction of raw data – demanding that all data and software be 'alive'. The parallel session served mainly to set out the issues and the ep experiments planned to meet again soon to make further progress.
The pp discussions focused on the Fermilab proton-antiproton data. The Fermilab experiments essentially lost their Run I data, perhaps justifiable given the greater luminosity of Run II. CDF and D0 plan to keep all Run II raw, reconstructed and analysis data, plus their infrastructure and environment alive for about five years. At this point there is a good chance that LHC data will have substantially superseded the Run II data.
A good summary of the issues including:
Homer set out a plan for work leading up to the next workshop. In his words:
I proposed a two-and-a-half day workshop, if possible in the early summer of this year. My proposals for the programme structure were discussed vigorously along with ideas from Cristinel Diaconu. The experiments that had just ceased data taking had been the motive force behind this first workshop and they should continue to lead the drive towards common work on data preservation. The next workshop should aim to make real progress on:
The workshop raised far more questions than it delivered answers. Nevertheless, the way forward became reasonably clear. HEP has to acknowledge and act on its responsibility to make explicit decisions on the science and business case for various levels of data preservation. In those areas where data should be preserved, we must strive to organise a cost-effective inter-experiment and inter-funding-agency programme of work.