Denise Lievesley and Bridget Winstanley from The Data Archive, University of Essex, outline the philosophy behind, and need for, The Data Archive.
It is critical that data producers should be aware of the benefits to them of sharing data if we are to continue to persuade them to make their data available. These are discussed below.
There are strong altruistic reasons for depositing data so that secondary analysts have access to them. In this way the data producers can contribute to the development of knowledge by ensuring their data are exploited to their full potential. Secondary research facilitates multiple perspectives upon data which have often been collected to address a narrower range of questions. Similarly comparative research can be encouraged by the preservation of multiple data sets for access.
A further altruistic reason for providing access to data is that it assists in the training of empirical social researchers. Often data archives can be actively involved in teaching or in setting up teaching opportunities. A large number of the social science data archives run summer schools on empirical social research. The building and sharing of teaching materials can be carried out by data archives taking advantage of their links into many academic networks. The recent establishment of EU funded large scale facilities in the social sciences at the Central Archive (ZA) in Germany and the UK Data Archive will also help to promote the use of data in teaching.
The Data Archive assists in the promotion of data through catalogues often held electronically, links with other archives and data suppliers, by submitting material to relevant newsletters, and e-mail lists, and running data workshops and giving presentations at conferences. Assistance by user services staff helps to ensure that informed use is made of the data whilst data providers are cushioned from the demands of users who have queries on the data and how to use them. In this way The Data Archive acts as a buffer between users and producers of data. This is an especially useful role because many queries and problems are unrelated to the data. Many potential users of data have little experience of computing or statistical analysis and often have limited assistance at their own institutions. Supporting users is time consuming and requires an understanding of their needs. The Data Archive can draw on resources for funding support which may not be available to producers.
Data producers are increasingly interested in forging links with users, in order to take advantage of users' expertise and to create a community of knowledgeable data users. The Data Archive assists with the establishment of this relationship which can be very useful to data providers. They might consult this 'expert group', get feedback on use especially relating to policy relevant research and have access to a community of supporters who will fight with them when their resources and therefore their data are under threat.
The supply of data for secondary analysis reduces the need to collect data afresh and thus reduces respondent burden. Compliance costs are a concern particularly when data are required from small populations such as surveys of businesses or elites.
The Data Archive improves the accessibility of data by employing demand led
distribution systems and by integrating different datasets. Value is added to data directly
by The Data Archive staff or by requiring users of the data to redeposit data to which they
have added value.
This might be by adding contextual information, improving or advising on documentation,
reformatting data for delivery, extracting subsets of data and documentation, providing
systems to permit data to be visualised, browsed and extracts selected. An important
attraction of giving access to data for secondary analysis is that credit will accrue to the
depositor. We try to ensure that this happens by specifying that acknowledgement must
take place and advising on the wording of citations.
The Data Archive periodically writes to journal editors to alert them to the requirement to
cite data sources.
In order to persuade data providers to deposit data it is vital that we ensure that their conditions of access are carried out. In some situations this can involve implementing controls over use and occasionally charges for data must be collected. It is also important that we are sensitive to confidentiality issues.
The Data Archive understands the importance of preservation of data. We and our sister archives have built our reputation on the fact that we can preserve the electronic information in a way which permits both data and documentation to be accessible over time. The data management and preservation system must ensure :
The increase in use can be explained by:
Potential usage is judged by consultations with relevant members of the data using community and with the data suppliers themselves and by keeping full records on the level and type of use of past datasets of a similar nature.
Resources have been expanded by diversifying the funding base, made feasible because of the wider role played by The Data Archive. At the same time efforts have been made to reduce the workload by ensuring that data depositors have guidelines on what is required of them in terms of documentation and quality checks on the data and by forming partnerships with other organisations with an interest in data supply. Thus some of the load of data dissemination is shared with the national academic computing service located at the University of Manchester. Similarly resources are shared across the European data archives - this will be expanded below. New specialist facilities have been established within the framework of The Data Archive to meet the needs of particular users. These include:
The Data Archive has a recent initiative to try to reduce the heavy load of supporting relatively naïve users by establishing a network of academic organisational representatives who will be given special training and information packs to enable them to supply local expertise and assistance.
A programme of digitisation of documentation is enabling us to plan expanded on-line catalogues which incorporate documentation. For those data with open access conditions such systems can incorporate the actual data too. In the not too distant future we envisage being able to extend this to other data once the security of internet access allows better controls to be operated. In anticipation of such developments we are exploring the use of on-line data browsing, visualisation, extraction and reformatting facilities to enable users to create their own customised datasets.
The use of the internet to provide information about Archive services and to deliver administrative forms to potential users has been of great assistance in reducing the telephone and mail queries we receive. This is being extended in order to streamline the system of access. A major component of the information system now available on the internet is the Archive’s catalogue and subject index, provided in a retrieval system known as BIRON and described below.
There are many thousands of accesses of the databases in every year and users come from across the world. BIRON consists of descriptive information about studies held, not the datasets themselves, which have to be ordered from the Archive in a separate process.
Keyword searches may be carried out using either HASSET (the thesaural interface) or BIRON. In both cases the user is prompted to type in a word or phrase describing the topic for which data are sought. This term is matched against a list of several thousand descriptive terms arranged in associated groups within the thesaurus. These terms are derived from an examination of the questionnaires or data dictionaries associated with each dataset in the Archive and if the search is successful, will retrieve those records which have been indexed with the search term. Keywords are assigned on the literal meaning of the questions or variables they represent and no attempt is made to index any theoretical concepts which the questions may have been designed to measure.
If an exact match is found, the user is told how many studies have been indexed with the matching term. The user may then choose to view the descriptions of the studies retrieved, or may view other associated terms which might assist in focusing the search. If no exact match is found, lists of similarly spelled words are offered for selection and the process of matching begins there. A combined search using boolean operators may be carried out with retrieved searches, or a nested option may be used, allowing further searches to be made on a retrieved subset of records.
Subject category searches involve choosing from a list of broad categories.
Catalogue searches BIRON may be used to search for the names of persons or organisations associated with particular studies, titles or part-titles, dates and geographical areas of data collection. These may be combined in various ways as described above.
What information is retrieved? If the search is successful a list of one or more study titles may be viewed at the end of the search. Users may then bring up on screen all the public information recorded about that study. The information includes a list of indexing terms showing all the topics covered by the data and a catalogue record giving the Archive number, the title, access conditions, data processing codes, the names of principal investigators, data collectors, sponsors and depositors, an abstract detailing the main purposes of the research and main variables. Dates, geographical areas, populations, data collection methodology are also displayed. Note that BIRON consists of information about studies held, not the actual data available for analysis.
Internal uses of BIRON databases BIRON has associated databases used for internal administration: records of users, file locations and documentation are already included and more administrative databases are planned. The system is an important source of easily-extracted performance indicators and developments in this direction are continuing.
Printed outputs from BIRON The Archive produces a variety of lists and catalogues. All lists of titles and catalogue descriptions are selected using the BIRON information retrieval procedures and then output using special purpose formatting software. All 'back of book' indexes for catalogues are produced automatically, indexing fields having been checked against authority lists at the time of initial cataloguing.
The Integrated Data Catalogue (IDC) The IDC provides a quick and simple method of searching the data catalogues of several European data archives including a version of The Data Archive’s catalogue. This version is ourput from BIRON and the information it contains is exactly the same as is found in BIRON, but searching methods are much simplified. Because it uses entirely different searching methods, the results may differ considerably. Where a high degree of accuracy is required it is preferable to use BIRON.
HASSET (Humanities and Social Science Electronic Thesaurus) The subject retrieval within BIRON is driven by a thesaurus known as HASSET. HASSET may be viewed separately from BIRON and may be used externally with acknowledgements of the Archive and Unesco (whose thesaurus forms the basis of HASSET), by anyone wishing to undertake indexing or keywording tasks. It should be recognised, however, that HASSET contains only terms which have been used in indexing the Archive’s collection and does not attempt to be a universal thesaurus.
Accessing BIRON, IDC and HASSET The recommended way to access BIRON is via the Archive's Home Page [1].
We work together on joint undertakings such as the EU funded project to develop an integrated European catalogue of data - a one stop shop - underpinned by agreements to exchange data. In order to avoid the duplication of effort CESSDA members specialise in different areas so that one archive might concentrate on demographic data whilst another concentrates on election studies for example. CESSDA has also been very active in getting new archives established to 'fill in the gaps' so that every nation has at least one facility and can participate in the European network.
This is an excellent model for the future development of archival and dissemination facilities in other areas.
These challenges have been met with a large degree of success, although the packaging of data to become more accessible to a wider range of non-expert users and with a wider range of delivery mechanisms has only just begun and will involve exciting developments.
Bibliographic control of data within data archives has largely been achieved but the description of internet resources more generally is a subject of evolving standards which will require widely cooperative efforts to achieve results which will continue to be useful into the future. The rapidly evolving technical infrastructure requires great flexibility and forethought on the part of the setters of standards in order to achieve useful results.
The Archive has had thirty years of experience in the preservation of materials within its control, including both data and documentation. The issue of preservation of electronic materials is becoming central to a wider community as the digitisation of paper proceeds rapidly. The burgeoning numbers of records which are created electronically and which have only ever existed in digital form adds to the sense of urgency. The Data Archive is well placed to share its expertise in preservation with others who have come to it more recently and to take part in the evolution of standards in this area. The rapid advance of technology and the speed of development and change in hardware and software systems make preservation an ongoing challenge, however.
[1] The Data Archive, Web site,
<
http://dawww.essex.ac.uk/
>
Material on this page is copyright Ariadne/original authors. This article last updated/links checked on January 27th 1997