A Pragmatic Approach to Preferred File Formats for Acquisition

Dave Thompson sets out the pragmatic approach to preferred file formats for long-term preservation used at the Wellcome Library.

This article sets out the Wellcome Library's decision not explicitly to specify preferred file formats for long-term preservation. It discusses a pragmatic approach in which technical appraisal of the material is used to assess the Library's likelihood of preserving one format over another. The Library takes as its starting point work done by the Florida Digital Archive in setting a level of 'confidence' in its preferred formats. The Library's approach provides for nine principles to consider as part of appraisal. These principles balance economically sustainable preservation and intellectual 'value' with the practicalities of working with specific, and especially proprietary, file formats. Scenarios are used to show the application of principles (see <a href="#annex">Annex</a> below).</p> <p>This article will take a technical perspective when assessing material for acquisition by the Library. In reality technical factors are only part of the assessment of material for inclusion in the Library's collections. Other factors such as intellectual content, significance of the material, significance of the donor/creator and any relationship to material already in the Library also play a part. On this basis, the article considers 'original' formats accepted for long-term preservation, and does not consider formats appropriate for dissemination.</p> <p>This reflects the Library's overall approach to working with born digital archival material. Born digital material is treated similarly to other, analogue archival materials. The Library expects archivists to apply their professional skills regardless of the format of any material, to make choices and decisions about material based on a range of factors and not to see the technical issues surrounding born digital archival material as in any way limiting.</p> <h2 id="Why_Worry_about_Formats">Why Worry about Formats?</h2> <p>Institutions looking to preserve born digital material permanently, the Wellcome Library included, may have little control over the formats in which material is transferred or deposited. The ideal intervention point from a preservation perspective is at the point digital material is first created. However this may be unrealistic. Many working within organisations have no choice in the applications they use, cost of applications may be an issue, or there may simply be a limited number of applications available on which to perform specialist tasks. Material donated after an individual retires or dies can prove especially problematic. It may be obsolete, in obscure formats, on obsolete media and without any metadata describing its context, creation or rendering environment.</p> <p>Computer applications 'save' their data in formats, each application typically having its own file format. The Web site filext [<a href="#1">1</a>] lists some 25,000 file extensions in its database.</p> <p>The long-term preservation of any format depends on the type of format, issues of obsolescence, and availability of hardware and/or software, resources, experience and expertise. Any archive looking to preserve born digital archival material needs to have the means and confidence to move material across the 'gap' that exists between material 'in the wild' and holding it securely in an archive.</p> <p>This presents a number of problems: first, in the proliferation of file formats; second, in the use of proprietary file formats, and third, in formats becoming obsolete, either by being incompatible with later versions of the applications that created them, or by those applications no longer existing. This assumes that proprietary formats are more problematic to preserve as their structure and composition are not known, which hinders preservation intervention by imposing the necessity for specialist expertise. Moreover, as new software is created, so new file formats proliferate, and consequently exacerbate the problem. This article reports its proceedings. The event took <em>Dealing with Sensitive Data: Managing Ethics, Security and Trust</em> as its theme [<a href="#3">3</a>].</p> <h2 id="Day_1:_10_March_2010">Day 1: 10 March 2010</h2> <p>DCC Associate Director <strong>Liz Lyon</strong> and RIN Head of Programmes <strong>Stéphane Goldstein </strong>welcomed the 45 delegates to the event, and began by introducing the keynote speaker, <strong>Iain Buchan</strong>, Professor of Public Health Informatics and Director of the Northwest Institute for Bio-Health Informatics (NIBHI), University of Manchester.</p> <p>Iain's talk was entitled <em>Opening Bio-Health Data and Models Securely and Effectively for Public Benefit</em>, and addressed three main questions:</p> <ol> <li>Where does the public's health need digital innovation?</li> <li>How can research curators promote this innovation (and what are the implications for Ethics, Security and Trust)?</li> <li>Is a framework required (covering the Social Contract and a digital and operational infrastructure)?</li> </ol> <p>A major theme in contemporary healthcare is that of <em>prevention</em>, and the need for proactive 'citizen buy-in' in order to avert NHS bankruptcy, a need supported by the use of 'persuasive technologies.' There is, however, a disconnect between the proactive public health model, and the reactive clinical model, and between expectations and available resource. 'Digital bridges', composed of new information technologies, are used to close the gaps between primary and secondary care, and to link disease-specific pathways.</p> <p>Iain touched on the impact that the data deluge is having on healthcare, reflecting that knowledge can no longer be managed solely by reading scholarly papers: the datasets and structures now extend far beyond any single study's observations. It is now necessary to build data-centred models, and to interrogate them for clusters via dedicated algorithms.</p> <p>However, there are holes in the datasets – for example, clinical trials exclude women of childbearing age and subjects undergoing certain treatments – hence electronic health records must be mined in order to fill these gaps, but this can be problematised by a lack of useful metadata, leading to 'healthcare data tombs,' repositories of health records lacking the contextual information to make them useful. Such data resources may be worse than useless: they may be misinformation.</p> <p>Comprehensible social networks with user-friendly interfaces can be used to improve the quality of metadata, based on the principle that more frequent use leads to better quality information. These networks can also bridge the Balkanisation that can occur when different groups tackle the same issue from varying standpoints (e.g. examining obesity from dietary- and exercise-based perspectives, but not sharing data across these boundaries.) The vision is for a joint, open, unifying and interdisciplinary framework and understanding wherein resources and expertise are shared. Of course, crossing these divides is accompanied by a raft of trust and security issues, and Iain described the various measures that are implemented to cope with them.</p> <p>Iain discussed the ethical issues surrounding wider use of health record information across the NHS, including consent (opt-in versus opt-out), the right (or lack thereof) of an investigator to go to a patient directly, and – perhaps most controversially – whether it was actually <em>unethical </em>to allow a health dataset to go under-exploited. If this is indeed the case, it follows that there is a real need to audit the demonstrable good that is derived from datasets.