Efforts to archive a large amount of digital material are being developed by many cultural heritage institutions. We have evidence of this in the numerous initiatives aiming to harvest the Web [1-5] together with the impressive burgeoning of institutional repositories . However, getting the material inside the archive is just the beginning for any initiative concerned with the long-term preservation of digital materials.
Digital preservation can best be described as the activity or set of activities that enable digital information to be intelligible for long periods of time. In general, digital information kept in an archival environment is expected to be readable and interpretable for periods of time much longer than the expected lifetime of the individual hardware and software components that comprise the repository system, as well as the formats in which the items of information are encoded .
Over the past decade, a vast number of preservation strategies have emerged from the various preservation projects developed, literally, all over the world [9-12]. Nonetheless, the most cited and applied preservation strategy continues to be migration , especially in contexts where non-interactive digital objects, such as images, databases or text documents, are the focus of preservation.
Migration can be described as a '( - ) set of organized tasks designed to achieve the periodic transfer of digital materials from one hardware/software configuration to another or from one generation of computer technology to a subsequent generation.' .
The major drawback in this approach is that whenever an object is converted to a new format, some of its original properties may not be adequately transferred to the target format. This may occur due to incompatibilities between the source and target formats or because the application used to do the conversion is not capable of carrying out its tasks correctly.
Every preservation intervention involves choices. Resources are finite, often scarce. As a result, decisions have to be taken to ensure that the best possible preservation strategy is selected from the wide range of options available. These decisions depend upon a multiplicity of factors such as: technical expertise, users' expectations, institutional budget, existing equipment and available time . In this respect migration-based strategies are no different.
In order to better understand all the steps involved in a migration process, one should consider the following sequence of activities:
Two decisions have to be made prior to any object migration: which format should be used to accommodate the properties of the original object; and which application should be used to carry out that migration. This decision-making activity constitutes the first stage of any migration process. It is in the best interests of the preserving institution to aim for the optimal combination of target format and conversion software, i.e. one that preserves the maximum number of properties of the original object at the minimum cost.
Cost should be regarded as a multi-dimensional variable. Factors such as throughput, application charges, format openness or prevalence should be considered collectively during this decision-making activity. Objective tools or frameworks especially designed to help institutions in the selection of appropriate options would greatly simplify this exceptionally complicated task.
The conversion work consists of the reorganisation of the information elements that comprise the digital object into the logical structures as defined by a different format .
From the preserver's point-of-view, carrying out a conversion usually consists in setting up a conversion application and executing it against a collection of digital objects. Some scripts may have to be developed in order to automate the whole procedure.
After the conversion process, the resultant objects should be evaluated in order to determine the amount of data loss incurred during migration. This is accomplished by comparing the properties that comprise the source object (also known as significant properties ) with the properties of its converted counterparts. If the evaluation results are below expectations, i.e. the object's properties have degraded to an unacceptable level, a different migration alternative should be selected and the whole process reinitiated.
In most cases, the evaluation process still requires a considerable amount of manual labour. Certain subjective properties such as the disposition of graphic elements in a text document or the presence of compression artifacts in an image file are generally inspected by human experts, rendering this activity both onerous and time-consuming .
At the University of Minho research is being undertaken to devise new pathways to carry out the three outlined activities in an automated fashion (i.e. selection of migration options, conversion and evaluation). Current activities are focused on the development of a Service-Oriented Architecture (SOA)  that, by combining input from different distributed applications, enables client institutions to preserve collections of digital material automatically.
It is assumed that client institutions already possess a digital repository system capable of storing, managing and providing access to the digital objects they hold. The repository system will act as the client application that benefits from the services provided by the SOA.
In order better to understand all the functions provided by the SOA one might consider the following scenario:
The National Archives of Portugal  are currently engaged in the development of a digital repository system capable of preserving authentic digital objects produced by affiliated public administration institutions (Project RODA ). Alongside the development of the repository software comes the creation of ingest and preservation policies that will aid producers in the preparation of their material before it is submitted to the repository. This notwithstanding, the repository will expect to be confronted by objects in formats previously unencountered and which will need to undergo a process of normalisation before being deposited. In the presence of an unrecognised format, the repository system could invoke a format identification service provided by the SOA in order to obtain information about the object's format, in addition to checking its integrity. After this operation, the repository could interrogate the SOA to obtain a list of formats to which the object could be converted. Simultaneously, the repository would inform the SOA of its preservation preferences and requirements, i.e. a list of preservation-oriented requirements derived from the policies created by the senior management of the archive. A few examples of such requirements are as follows:
The SOA would then address all of these criteria with information previously acquired about the behaviour and quality of all accessible conversion applications and would then produce a ranked list of optimal migration options. The repository system could then select the most suitable one from this list and request the SOA to carry out the corresponding migration.
After the conversion process, the repository system would receive a new digital object (better yet, a new digital representation of the source digital object) and a migration report stating the amount of data lost in that migration. This report could then be merged with the preservation metadata already maintained by the repository in order to document the preservation intervention and sustain the object's authenticity. On a regular basis, the repository would consult with a notification service to determine if any of the formats it holds are at risk of becoming obsolete. When a format falls into that condition, a new migration process is triggered.
A close examination of the outlined scenario enables us to identify the following services:
The general architecture of the proposed SOA is depicted in Figure 1. This design does not intend to be prescriptive or limiting in any way. The goal is to provide a framework for discussion by pointing out the fundamental elements that should be present in such a system. Several interesting and competing research projects are presented as promising candidates to implement some of these elements. We recognise of course that many other initiatives and solutions might also exist outside the scope of our work or this article.
The figure is divided into two major sections: the client and the server-side. The client-side depicts a few examples of applications that may use the services provided by the SOA. Among these are: digital repository systems like DSpace , Fedora  or Eprints , and custom applications developed by individual users.
It is important to point out that any application capable of invoking a Web service may make use of the proposed SOA.
On the server-side are depicted the chief components comprising this framework. Each of these components is actually an independent application with distinctive roles and responsibilities that co-operate with each other by exchanging messages. This approach makes it possible for each component to be governed by a different organisation and facilitates the distribution of workload.
The first of these components is the Obsolescence Notifier, a service responsible for raising awareness among client institutions of the file formats that are at risk of becoming obsolete. This service should to be consulted regularly by client institutions in order to determine if the objects in their custody are close to becoming unreadable to their designated community.
Several resources are available that could be used to support such a service. A few examples are as follows:
The Format Detector, as the name suggests, is a service capable of identifying the underlying encoding of a digital object. The client institution should be able to monitor, migrate and validate the integrity of digital objects without human intervention and this service is indispensable in accomplishing that goal. Furthermore, it enables digital formats to be identified according to the naming scheme used by other components that comprise the proposed SOA (e.g. the Migration Broker).
The following applications are potential candidates for supporting such a service:
Some institutions and initiatives have been developing services capable of carrying out format migrations [35-40]. Such initiatives rely on a common set of communication protocols to support the discovery and invocation of conversion procedures. Any conventional application may also be used as a service if an appropriate application wrapper is developed , i.e. a small piece of software that acts as the intermediary between the application and communication protocol.
In this type of approach, a client application is used to send out a digital object to a remote procedure that, after unpackaging the received message, converts the embedded object and returns the result back to the client.
Standard protocols, such as the ones that accompany Web services technology , may play an important role in this domain due to their open-standard and platform-independent characteristics.
A distributed approach to migration introduces some appealing properties:
However, requiring the presence of a computer network to carry out format migrations hardly seems reasonable in a preservation context. This type of reliance on technology is generally very undesirable. However, digital preservation is a global problem. A distributed approach may very well prove to be an effective way to handle the intricacies of preservation as it allows institutions worldwide to share their solutions and co-operate in the network of services.
The Service Registry component is responsible for managing information about existing conversion services. It stores metadata about its producer/developer (e.g. name, description and contact), about the service itself (e.g. name, description, the source/target formats that it is capable of handling, cost of invocation, etc.) and information on how the service should be invoked by a client application (i.e. its access point).
It is important that the Service Registry is populated with rich metadata. Much of the information delivered to end-users after a conversion will be obtained from this data source. This information can be used to document the preservation intervention as it outlines all the components that took part in the migration process and describes the outcome of the event in terms of data loss and object degradation (see Object Evaluator). This migration report constitutes what PREMIS refers to as an Event Entity .
One of the major advantages of using Web services in this context is in the capacity to combine tens or hundreds of conversion services to create new migration operations. However to accomplish this, each conversion service should respect a well-defined interface that establishes the arguments that each conversion service should be capable of handling. Although this interface is essential to produce service compositions, it is not sufficient on its own. Each conversion service must be described with source and target format metadata elements whose values are obtained from a controlled vocabulary. This is fundamental to enable the computation of the migration network (i.e. all possible migration paths between two given formats).
Several initiatives are considered suitable candidates to provide that controlled vocabulary:
The Migration Broker is responsible for carrying out object migrations. In practice, this component is responsible for making sure that composite conversions are performed atomically from the point of view of the client application and the rest of the SOA components. Additionally, this component is responsible for recording the performance of each migration service. The results of these measurements are stored in the Evaluations Repository, a knowledge base that supports the recommendation system (see Migration Advisor).
A prototype of the proposed SOA is currently being devised at the University of Minho and is presently capable of measuring the following process-related criteria:
The Format Evaluator provides information about the current status of file formats. This information enables the Migration Advisor to determine which formats are better suited to accommodate the properties of source objects by looking at the characteristics of each pair of formats. This service is supported by a data store containing facts about formats (i.e. Format Knowledge Base), but could also exploit external sources of information such as the PRONOM registry or Google Trends , to determine automatically a format's prevalence and usage.
The current prototype is capable of determining the potential gain (in terms of preservation) that one might obtain in converting an object from its original format to a new one by considering the following set of criteria:
All of these criteria are being considered by the Migration Advisor to rank all the available migration options.
This component is capable of measuring the preservation gain of performing a certain transformation. For example, if the target format is royalty-free while the source format is not, there is a preservation gain associated with that transformation. On the other hand, if the target format only supports a lossy type of compression, while the source format is not compressed at all, there is a potential risk of losing important information in the process.
The criteria present in this evaluation taxonomy were assembled from various bibliographic sources such as . Groups of format experts and digital curators may also contribute with additional criteria to enrich the evaluation taxonomy.
The Object Evaluator is in charge of judging the quality of the migration outcome. It accomplishes this by comparing objects submitted for migration with their converted counterparts. Again, these evaluations will be performed according to a range of criteria. These criteria, known in this context as significant properties, constitute the set of attributes of an object that should be kept intact during a preservation intervention . They constitute the array of attributes that characterise an object as a unique intellectual entity, independently of the encoding used to represent it. The Bible for example, may exist in many different formats and media, e.g. ASCII text, Portable Document Format, written on paper or carved on stone, and still be regarded as the Holy Bible. Considering text documents as an example, some significant properties could be: the number of characters, the order of those characters, the page size, the number of pages, the graphical layout, the font type and size.
The Migration Advisor is responsible for producing suggestions of migration alternatives. In reality this component acts as a decision support centre for client institutions and is capable of determining the best possible choice within a wide range of options. It accomplishes this by confronting the preservation requirements outlined by client institutions with the accumulated knowledge about the behaviour of each accessible migration path.
The behaviour of each migration path is determined by taking into consideration the sets of criteria previously described: conversion performance, status of the formats involved and data loss (handled respectively by the Migration Broker, Format Evaluator and Object Evaluator).
In order to generate an appropriate recommendation, the Migration Advisor resorts to the Evaluations Repository, a database containing all the measurements taken by the Object Evaluator, the Format Evaluator and the Migration Broker. Averaging these readings provides a general idea about the behaviour of each migration path.
Different institutions will have distinct preservation needs. They will be able to state their individual requirements by weighting the importance of each criteria handled by the system. This enables the system to rank all the alternatives according to their level of aptness to resolve the preservation problem of the client institution.
In order to rank all possible options, the Evaluations Repository must be populated with data. This is generally called training and basically consists in requesting the SOA to convert a large set of digital objects in different forms and sizes using all possible migration paths. This operation forces all the evaluators to produce reports that will be used by the Migration Advisor to compute an appropriate recommendation.
A prototype implementing the concepts described in this paper is currently being devised at the University of Minho and is expected to be fully operational by the end of 2006 . The purpose of this prototype is to evaluate the suitability of the proposed architecture and the precision of the recommendation system. Precision will be assessed using cross-validation techniques.
Still images are generally represented by simple structures and, for that reason, are being used to guide the development of the prototype. More complex objects, like text documents produced by word-processing software, will be considered afterwards in order to assess their effectiveness in handling more subjective criteria, such as appearance or text layout.
The current prototype fully implements the following components:
Adding new evaluation criteria to the prototype is as easy as updating a configuration file. The real complexity relies on creation of new criterion evaluators. Once developed, these evaluators can be placed in the servers file system (to be loaded during system's bootstrap) or remotely invoked.
New conversion services can be attached to the system by simply adding them to the Service Registry. Training the Migration Advisor to recognise new conversion services is, of course, essential.
Throughout this article, Web services have been presented as a promising technology to support the proposed SOA. However, many other protocols exist which could be used to implement these ideas. Different technologies could even be utilised together by means of gateways or proxies. For example, a gateway is currently being developed to enable converters provided by the TOM Conversion Service , a technology that uses a non-standard communication protocol, to be used by our prototype.
This article describes the set of components that are necessary to build a Service- Oriented Architecture (SOA) to enable cultural heritage institutions to carry out digital preservation with minimum human intervention.
The proposed SOA enables institutions to co-operate in the establishment of a global advisory service that, among other things, will be capable of producing recommendations of optimal migration options, perform format migrations and thoroughly document preservation interventions by generating appropriate preservation metadata.
The proposed SOA could also be used as an objective tool for comparing file formats and conversion software. It could be used to provide an on-demand migration service, i.e. a service capable of converting objects from their archival configurations to formats more suitable for dissemination; as well as a normalisation procedure for ingest work.
Although a prototype for this SOA is still under development, some conclusions can already be drawn. The set of digital objects used to train the recommendation system should be as heterogeneous as possible in terms of shape and size, and should contain at least a couple of thousand objects. Small or homogeneous object sets generate very imprecise recommendations due to overfitting in the learning process .
Further research could be conducted to detect patterns in the user-assigned weights. Such patterns would represent user profiles and would enable the recommendation process to be automated one step further.
The proposed SOA could also contribute to fostering new lines of research such as the improvement, or the development, of comparison algorithms for different classes of objects, e.g. image, text, audio, video or datasets. Comparators such as these are necessary to develop a general purpose Object Evaluator. Further work should also be conducted to devise a general evaluation taxonomy for several classes of digital objects.
The work reported in this article has been funded by the FCT (Fundação para a Ciência e a Tecnologia, Portugal) under the grant SFRH/BD/17334/2004.
Graduated as a Systems and Informatics Engineer, has worked as a consultant at the Arquivo Distrital do Porto (Oporto's Archive) and as a researcher at the University of Minho. Since 2003 has been publishing in field of digital archives/libraries and preservation. Currently, is developing work as a PhD student and coordinating several research projects at the Arquivo Distrital do Porto and the Portuguese National Archives (Instituto dos Arquivos Nacionais/Torre do Tombo).
Auxiliary Professor at the Department of Information Systems of University of Minho, Ana has been publishing in the areas of Knowledge Society, Scholarly Communication, Information Access & Retrieval and Semantic Web. She is also interested in the social aspects of the Internet, primarily on its impacts on scholarly communication.
Auxiliary Professor at the Computer Science Department of the University of Minho, has a Masters on 'Compiler Construction' and a PhD on the subject 'Document Semantics and Processing''. Has been managing projects and publishing in the field of Markup Languages since 1995.