Data Archive at the University of Essex
This paper provides a background to the development and ongoing activities of data archives in general and The Data Archive (formerly known as the ESRC Data Archive) at the University of Essex in particular. It describes the main activities involved in running the Archive and explores the benefits for both data producers and data users of a central repository of data. It touches on the growth of The Data Archive in recent years. The Data Archive's main information systems are described in sufficient detail to allow the reader to go away and explore its catalogues and indexes. The paper ends with a reference to some of the challenges which face those who have responsibility for the care of digital materials.
Background to the UK Data Archive
The UK Data Archive at the University of Essex is one of a worldwide network of data archives which had their origins in preserving and providing access to social science data for use by the academic community. Like many of these archives The Data Archive has expanded its role beyond social science data and it now services users outside of the academic community. It was established 30 years ago and is located at the University of Essex, a university with especial strengths in social science and economics. The Data Archive exists to promote wider and more informed use of data in research and teaching and to preserve these data so that they continue to be accessible over time. Its holdings are acquired from a wide variety of sources including central and local government, academia, independent research agencies and commercial sources such as market research agencies. Many of the activities of The Data Archive will be familiar to librarians and keepers of more conventional paper archives. They include:
- establishing user needs to determine what data should be acquired
- negotiating to acquire the data and to determine the conditions of access
- clarifying any confidentiality restrictions
- assisting data providers to create documentation required by secondary users
- validation of data and documentation
- supplementing documentation by adding information on format, media, conditions of access, and on the outcome of quality checks
- preservation of the data and documentation on different media and the establishment of a programme to check for data corruption, to refresh existing media and to migrate onto new media
- cataloguing and indexing by professional staff
- reformatting the data, digitising documentation and delivering data and documentation to users promoting use and supporting users.
Benefits to users and producers of data
The benefits to users of having access to The Data Archive are fairly obvious. They obtain expensive resources cheaply - often these are data which they could not have collected themselves such as census material or data which are by-products of administrative processes. High quality research is promoted as a result of this access. The re-analysis of data from a different perspective is encouraged. The access to data in electronic form permits a level and depth of analysis which cannot be undertaken with published material.
It is critical that data producers should be aware of the benefits to them of sharing data if we are to continue to persuade them to make their data available. These are discussed below.
There are strong altruistic reasons for depositing data so that secondary analysts have access to them. In this way the data producers can contribute to the development of knowledge by ensuring their data are exploited to their full potential. Secondary research facilitates multiple perspectives upon data which have often been collected to address a narrower range of questions. Similarly comparative research can be encouraged by the preservation of multiple data sets for access.
A further altruistic reason for providing access to data is that it assists in the training of empirical social researchers. Often data archives can be actively involved in teaching or in setting up teaching opportunities. A large number of the social science data archives run summer schools on empirical social research. The building and sharing of teaching materials can be carried out by data archives taking advantage of their links into many academic networks. The recent establishment of EU funded large scale facilities in the social sciences at the Central Archive (ZA) in Germany and the UK Data Archive will also help to promote the use of data in teaching.
The Data Archive assists in the promotion of data through catalogues often held electronically, links with other archives and data suppliers, by submitting material to relevant newsletters, and e-mail lists, and running data workshops and giving presentations at conferences. Assistance by user services staff helps to ensure that informed use is made of the data whilst data providers are cushioned from the demands of users who have queries on the data and how to use them. In this way The Data Archive acts as a buffer between users and producers of data. This is an especially useful role because many queries and problems are unrelated to the data. Many potential users of data have little experience of computing or statistical analysis and often have limited assistance at their own institutions. Supporting users is time consuming and requires an understanding of their needs. The Data Archive can draw on resources for funding support which may not be available to producers.
Data producers are increasingly interested in forging links with users, in order to take advantage of users' expertise and to create a community of knowledgeable data users. The Data Archive assists with the establishment of this relationship which can be very useful to data providers. They might consult this 'expert group', get feedback on use especially relating to policy relevant research and have access to a community of supporters who will fight with them when their resources and therefore their data are under threat.
The supply of data for secondary analysis reduces the need to collect data afresh and thus reduces respondent burden. Compliance costs are a concern particularly when data are required from small populations such as surveys of businesses or elites.
The Data Archive improves the accessibility of data by employing demand led distribution systems and by integrating different datasets. Value is added to data directly by The Data Archive staff or by requiring users of the data to redeposit data to which they have added value. This might be by adding contextual information, improving or advising on documentation, reformatting data for delivery, extracting subsets of data and documentation, providing systems to permit data to be visualised, browsed and extracts selected. An important attraction of giving access to data for secondary analysis is that credit will accrue to the depositor. We try to ensure that this happens by specifying that acknowledgement must take place and advising on the wording of citations. The Data Archive periodically writes to journal editors to alert them to the requirement to cite data sources.
In order to persuade data providers to deposit data it is vital that we ensure that their conditions of access are carried out. In some situations this can involve implementing controls over use and occasionally charges for data must be collected. It is also important that we are sensitive to confidentiality issues.
The Data Archive understands the importance of preservation of data. We and our sister archives have built our reputation on the fact that we can preserve the electronic information in a way which permits both data and documentation to be accessible over time. The data management and preservation system must ensure :
- physical reliability of digital information
- security of data and documentation from unauthorised use
- on-going usability of data & documentation
- integration of the data into information and delivery systems.
Management of data with very variable access regimes requires expertise, equipment and operational systems as well as trust and credibility. Since very few data providers have built the expertise and facilities needed to preserve data so that they can be read over time despite changes to hardware and software environments a major advantage is achieved by giving depositors priority access to their own data.
Growth in the use of The Data Archive
After many years of reasonably steady and manageable growth in the use of The Data Archive, demands for its services as a secure place of deposit and as a source of data have risen sharply. Use has tripled over the last five years.
The increase in use can be explained by:
- increasing amount of electronic data being generated
- lack of expertise and relevant facilities in data preservation amongst data producers
- widespread computing access which has expanded the field of users
- academic requirements to maximise research outputs together with the expansion of universities has also led to a greater demand for data
- the increasing acceptance of secondary analysis as a legitimate methodology in a variety of disciplines
- lack of money for primary data collection which has resulted in a greater emphasis on using existing data
- the growing recognition worldwide that data should be exploited more effectively and the acknowledgement that not to use data or to use inadequate data has costs for society as a whole.
This last consideration must be emphasised as it has led to a climate where the idea of freedom of information is gaining ground over a culture of restricted data access.
How can the growth be managed?
The growth in use of The Data Archive has in part been managed by the use of automation for as many of the internal procedures as possible leading to a more efficient organisation. It has become necessary to prioritise the acquisition of data since it is simply not possible to take responsibility for more than a small proportion of the data generated in electronic form.
Potential usage is judged by consultations with relevant members of the data using community and with the data suppliers themselves and by keeping full records on the level and type of use of past datasets of a similar nature.
Resources have been expanded by diversifying the funding base, made feasible because of the wider role played by The Data Archive. At the same time efforts have been made to reduce the workload by ensuring that data depositors have guidelines on what is required of them in terms of documentation and quality checks on the data and by forming partnerships with other organisations with an interest in data supply. Thus some of the load of data dissemination is shared with the national academic computing service located at the University of Manchester. Similarly resources are shared across the European data archives - this will be expanded below. New specialist facilities have been established within the framework of The Data Archive to meet the needs of particular users. These include:
- r-cade (the resource centre for access to data on Europe) established to meet the growing demands for comparative data for different European countries by providing access to data drawn from a variety of European and international agencies via an on-line system of access
- The History Data Service which exists in order to promote a culture of data sharing amongst historical researchers and to give them access to rich resources of machine readable material
- V-P Lab - the virtual psychology laboratory recently set up jointly by the University of Cardiff, the centre for teaching initiatives in psychology at the University of York and The Data Archive in order to preserve and make available, via the internet, psychological experiments together with their software environment.
Questions to be addressed by the VP-Lab Project include how should such data be selected and whether the cataloguing systems developed for other social science data are of relevance to psychology experiments, as well as the particularly difficult issue of how to preserve data which cannot be understood without their related software.
The Data Archive has a recent initiative to try to reduce the heavy load of supporting relatively naïve users by establishing a network of academic organisational representatives who will be given special training and information packs to enable them to supply local expertise and assistance.
A programme of digitisation of documentation is enabling us to plan expanded on-line catalogues which incorporate documentation. For those data with open access conditions such systems can incorporate the actual data too. In the not too distant future we envisage being able to extend this to other data once the security of internet access allows better controls to be operated. In anticipation of such developments we are exploring the use of on-line data browsing, visualisation, extraction and reformatting facilities to enable users to create their own customised datasets.
The use of the internet to provide information about Archive services and to deliver administrative forms to potential users has been of great assistance in reducing the telephone and mail queries we receive. This is being extended in order to streamline the system of access. A major component of the information system now available on the internet is the Archive's catalogue and subject index, provided in a retrieval system known as BIRON and described below.
Bibliographic information retrieval: BIRON
The catalogue and indexes with their associated thesaurus form part of the BIRON (Bibliographic Information Retrieval On-line) system. The user interface of BIRON 4 utilises the World Wide Web and is simple and intuitive to use. The thesaurus which forms part of BIRON can also be viewed independently via the Web. This thesaurus is known as HASSET (Humanities and Social Sciences Electronic thesaurus).
There are many thousands of accesses of the databases in every year and users come from across the world. BIRON consists of descriptive information about studies held, not the datasets themselves, which have to be ordered from the Archive in a separate process.
How does BIRON work?
Subject searches may be carried out by the use of keywords or by searches on subject categories. Subject categories are assigned on the basis of the broad subject coverage of the dataset as a whole while keywords are assigned to individual variables or questions within the dataset. There are several thousand potential keywords to choose from but only approximately 24 major subject categories.
Keyword searches may be carried out using either HASSET (the thesaural interface) or BIRON. In both cases the user is prompted to type in a word or phrase describing the topic for which data are sought. This term is matched against a list of several thousand descriptive terms arranged in associated groups within the thesaurus. These terms are derived from an examination of the questionnaires or data dictionaries associated with each dataset in the Archive and if the search is successful, will retrieve those records which have been indexed with the search term. Keywords are assigned on the literal meaning of the questions or variables they represent and no attempt is made to index any theoretical concepts which the questions may have been designed to measure.
If an exact match is found, the user is told how many studies have been indexed with the matching term. The user may then choose to view the descriptions of the studies retrieved, or may view other associated terms which might assist in focusing the search. If no exact match is found, lists of similarly spelled words are offered for selection and the process of matching begins there. A combined search using boolean operators may be carried out with retrieved searches, or a nested option may be used, allowing further searches to be made on a retrieved subset of records.
Subject category searches involve choosing from a list of broad categories.
Catalogue searches BIRON may be used to search for the names of persons or organisations associated with particular studies, titles or part-titles, dates and geographical areas of data collection. These may be combined in various ways as described above.
What information is retrieved? If the search is successful a list of one or more study titles may be viewed at the end of the search. Users may then bring up on screen all the public information recorded about that study. The information includes a list of indexing terms showing all the topics covered by the data and a catalogue record giving the Archive number, the title, access conditions, data processing codes, the names of principal investigators, data collectors, sponsors and depositors, an abstract detailing the main purposes of the research and main variables. Dates, geographical areas, populations, data collection methodology are also displayed. Note that BIRON consists of information about studies held, not the actual data available for analysis.
Internal uses of BIRON databases BIRON has associated databases used for internal administration: records of users, file locations and documentation are already included and more administrative databases are planned. The system is an important source of easily-extracted performance indicators and developments in this direction are continuing.
Printed outputs from BIRON The Archive produces a variety of lists and catalogues. All lists of titles and catalogue descriptions are selected using the BIRON information retrieval procedures and then output using special purpose formatting software. All 'back of book' indexes for catalogues are produced automatically, indexing fields having been checked against authority lists at the time of initial cataloguing.
The Integrated Data Catalogue (IDC) The IDC provides a quick and simple method of searching the data catalogues of several European data archives including a version of The Data Archive's catalogue. This version is ourput from BIRON and the information it contains is exactly the same as is found in BIRON, but searching methods are much simplified. Because it uses entirely different searching methods, the results may differ considerably. Where a high degree of accuracy is required it is preferable to use BIRON.
HASSET (Humanities and Social Science Electronic Thesaurus) The subject retrieval within BIRON is driven by a thesaurus known as HASSET. HASSET may be viewed separately from BIRON and may be used externally with acknowledgements of the Archive and Unesco (whose thesaurus forms the basis of HASSET), by anyone wishing to undertake indexing or keywording tasks. It should be recognised, however, that HASSET contains only terms which have been used in indexing the Archive's collection and does not attempt to be a universal thesaurus.
Accessing BIRON, IDC and HASSET The recommended way to access BIRON is via the Archive's Home Page .
Council of European Social Science Data Archives The UK Data Archive which has been the focus of this paper is one of a number of national archives with their roots in the social sciences. The network of such archives has been vital to their development. Of especial importance is the Council of European Social Science Data Archives (CESSDA). Member archives share expertise and assist one another in staff training, a major activity being an annual workshop hosted by one of the archives on a specialist topic and attended by relevant archive staff.
We work together on joint undertakings such as the EU funded project to develop an integrated European catalogue of data - a one stop shop - underpinned by agreements to exchange data. In order to avoid the duplication of effort CESSDA members specialise in different areas so that one archive might concentrate on demographic data whilst another concentrates on election studies for example. CESSDA has also been very active in getting new archives established to 'fill in the gaps' so that every nation has at least one facility and can participate in the European network.
This is an excellent model for the future development of archival and dissemination facilities in other areas.
The challenges of the future
The challenges which faced The Data Archive in its infancy thirty years ago were concerned with persuading data producers to provide access to their data and to extend the range of data users.
These challenges have been met with a large degree of success, although the packaging of data to become more accessible to a wider range of non-expert users and with a wider range of delivery mechanisms has only just begun and will involve exciting developments.
Bibliographic control of data within data archives has largely been achieved but the description of internet resources more generally is a subject of evolving standards which will require widely cooperative efforts to achieve results which will continue to be useful into the future. The rapidly evolving technical infrastructure requires great flexibility and forethought on the part of the setters of standards in order to achieve useful results.
The Archive has had thirty years of experience in the preservation of materials within its control, including both data and documentation. The issue of preservation of electronic materials is becoming central to a wider community as the digitisation of paper proceeds rapidly. The burgeoning numbers of records which are created electronically and which have only ever existed in digital form adds to the sense of urgency. The Data Archive is well placed to share its expertise in preservation with others who have come to it more recently and to take part in the evolution of standards in this area. The rapid advance of technology and the speed of development and change in hardware and software systems make preservation an ongoing challenge, however.
 The Data Archive, Web site,