This article describes the motivations behind the development of the ResourceSync Framework. The Framework addresses the need to synchronise resources between Web sites. Resources cover a wide spectrum of types, such as metadata, digital objects, Web pages, or data files. There are many scenarios in which the ability to perform some form of synchronisation is required. Examples include aggregators such as Europeana that want to harvest and aggregate collections of resources, or preservation services that wish to archive Web sites as they change. Some of these use cases are described in this article, along with concrete examples of them in practice, together with issues that they may pose to the Framework as it is developed.
Interoperability lies at the heart of the Internet. Without protocols, standards, and frameworks, the Internet would not and could not exist. These standards are used to determine how clients and servers communicate with each other, how resources are transferred, and how they are described in order to support their use. Each standard has been developed in response to a need for some form of resource handling or description.
Within the library, repository, and general Web resource world, there are a number of protocols and frameworks used to work with resources. They include well-known standards such as:
One requirement that spans several of these areas, yet is a distinct problem in its own right, is that of resource synchronisation. We define resource synchronisation as the need to keep two or more collections of resources synchronised, or ‘in sync’: that is, additions, updates, and deletions from one collection are reflected in the other, so that, subject to any delay in synchronisation, they contain identical resources.
There are variations on this definition related to factors such as whether the whole or part of the collection needs to be synchronised, the number of sources or destinations that are kept synchronised, or the latency between changes occurring in the source collection and being reflected in the destination collection. The process of defining these differences allows the gathering of requirements for a generalised Web resource synchronisation framework.
The use cases in this article are described using a number of conventions that are described below:
This article defines a series of use cases for Web resource synchronisation. They are being used by the NISO and OAI ‘ResourceSync’ project. The project, funded by the Alfred P. Sloan Foundation and JISC, aims to research, develop, prototype, test, and deploy mechanisms for the large-scale synchronisation of Web resources.
These use cases provide both the motivation for the development of a Web resource synchronisation framework, and the requirements that it must fulfil. When developing anything new, be that a product, service, or framework, use cases  provide an easy method to think about ways in which that new development will be used. When developing use cases, actors (the bodies involved) and actions are put together to describe the different behaviours or goals that must be supported. The use cases below show different purposes for a Web synchronisation framework, and therefore help to define the requirements it must fulfil. When evaluating the success of the new framework, it can be judged by whether it can fulfil all of the identified use cases.
There are a number of dimensions that are useful in parameterising the use cases. These dimensions vary between each use case, and the combination of the dimensions makes each use case unique in the requirements-gathering exercise for ResourceSync:
16 use cases have been defined that are guiding the development of the ResourceSync Framework. Each one is briefly described and is accompanied by a diagram showing the source of the resources, the destination of the synchronisation action, and a pictorial expression of the resource(s) being synchronised. The diagrams were drawn during the first project technical meeting in Washington, DC, on 14 June 2012.
In order to explain the use cases, each has a description of how that case is unique, how it is typically deployed, some concrete examples of systems that need this requirement which they are already providing via an alternative method, and, if relevant, a list of issues that need to be considered when defining functionality to fulfil this particular use case.
Two paired systems - a Source and a Destination - that are collections of resources. The Destination must be kept up to date with the Source.
Figure 1: One-to-one sync between two systems
This is the most basic synchronisation use case, and deals with the all the essential features of ResourceSync: initial synchronisation, incremental update, and baseline audit to ensure the collections are indeed synchronised.
One Source, one Destination. It is possible that the Source and Destination are formally aware of each other, but this is not always necessary.
arXiv.org  mirrors. The arXiv repository service hosted at Cornell University is mirrored to a number of geographically dispersed sites. This provides both data redundancy (for disaster-recovery purposes) and speed of access via local mirrors.
A single Destination system that is synchronising from multiple Sources for the purposes of building an aggregation of their resources.
Figure 2: Aggregator (many-to-one)
In this case, a single system is attempting to represent the content from multiple systems, possibly in some kind of union catalogue (eg for cross-searching).
Multiple Sources, one Destination. The Sources may not be formally aware that the Destination is synchronising their content if they are offering their content on the open Web for harvesting.
OAISter and Europeana are aggregators of metadata and content. They harvest them from many sources using public interfaces, and then offer the aggregated resources via their own search service.
Resources that are duplicated across the Source systems (not necessarily within one Source) may result in duplicates in the Destination.
Many Destinations synchronise from a single Source. This is considered to be the most likely scenario for ResourceSync usage from the perspective of a Source.
Figure 3: Master copy (one-to-many)
There is a single Source that is providing resources to multiple Destinations, which will, therefore, be either mirrors or partial mirrors of that Source.
One Source, multiple Destinations. The Source is unlikely to have any agreement with or be formally aware of all the Destinations.
Many traditional institutional repositories are harvested in this way. They offer their open content and metadata for harvesting and reuse.
The Source wishes to supply and/or advertise sub-sets of its full set of resources, to allow Destinations to synchronise with one or more of those sub-sets. If the Destination wishes to synchronise selectively from the Source, the criteria for selection is that provided by the Source.
Figure 4: Selective synchronisation
It indicates that the Source is not just a large aggregation of resources, but that each of those resources may have properties or belong to sets or collections about which the Destination may be interested in knowing, prior to any synchronisation attempt.
One Source, advertising metadata to be used for selection, and any number of Destinations, each of which may wish to synchronise a different sub-set of the Source’s resources.
To provide similar or equivalent functionality to OAI-PMH Set . For example DSpace provides its content in ‘Collections’, which are logical divisions in the overall repository content.
There are two ways that this could be presented to the Destination:
An alternative approach would be to avoid support with a Source but instead present multiple Sources, one for each collection to be exposed. In this way a Destination would select the appropriate Sources to synchronise with (which could have overlapping content).
The set of resources in the Source may include some that are metadata records for other resources in that Source. The Destination may want to only synchronise the metadata records, but may also want the option in future to trace back to the other resource(s) that the metadata record describes.
Figure 5: XML Metadata Harvesting/Identification
It indicates that the Destination may care about what kind of resource a given resource is, and which other resources it is related to.
One Source with a mixture of metadata records and other resources, and any number of Destinations interested in synchronising metadata only.
OAI-PMH is used to synchronise collections of XML datastreams that are identified by an item identifier, a metadataPrefix , and a datestamp.
A Service wishes to listen to change notifications from a Source, and keep a record of the changes that have taken place (perhaps including types of change, frequency, etc), and to make available statistics regarding the Source. The Service does not, at any point, synchronise the resources itself.
Figure 6: Statistics collection
In this use case we are not so much interested in the change, as in the fact that a change has occurred. This is a very simple use case, as it does not require any content synchronisation.
One Source providing change communication with sufficient metadata for the one Service listening to create the desired statistics.
Sites such as the Registry of Open Access Repositories (ROAR)  and the Directory of Open Access Repositories (OpenDOAR)  are already using OAI-PMH for similar purposes (eg providing resource counts).
In some environments the resources to be synchronised may be very large (in the order of many gigabytes for research data systems). Due to the load this may place on the consumer, and limits imposed by transfer protocols or file systems, they may require some specific information about the size of the resource in advance of the synchronisation taking place, or the option to synchronise only part of the resource.
Figure 7: Large data files
It deals not so much with the need for synchronisation as for the need to make consumers aware of what the implications of the synchronisation action will be, and/or to offer them appropriate synchronisation options (such as partial synchronisation of changed content using a tool such as diff).
One Source and one Destination exchanging large data files. The Source may need to provide an indication of the size of the resource, any available retrieval/diff protocols, whether it is an interesting change (from a Destination perspective), when it was last modified, and fixity information. It is likely that to use more specialist retrieval/diff protocols will mean that the Source and the Destination will need to be formally aware of each other.
Research Data Management can require the movement of large files or packages of files over the network asynchronously from the usage or production of the data. The DataFlow  Project at the University of Oxford is transferring zipped research data between a client (DataStage) and server (DataBank) environment using SWORDv2.
We separate out the notion of providing hooks for the efficient update of large data files from the transfer methods themselves. There are various complexities around providing alternative synchronisation options which are Out of Scope.
Some information systems keep their sensitive content hidden for a number of reasons. However some require, for example, a separate public user interface to publish those materials that can be public. The key to this use case is that the synchronisation is carried out over an authorised trust boundary.
Figure 8: Protected resources/access control
The publisher of changes is likely to be a private system, requiring a trusted or protected relationship with the synchronising downstream system.
One Source and one Destination, where there is likely to be a trusted one-to-one relationship between them. The synchronisation will need to be able to expose resources which are not publicly accessible.
CRIS (Current Research Information Systems) often manage and store information about the research outputs of an institution. Some of these may be surfaced through a repository or research portal. Whilst they often employ PUSH technologies, a PULL technology allowing synchronisation could be an alternative.
Legal deposit of digital published content (to a national library repository) would require a framework where only trusted Destinations could harvest the content.
Successful interaction in the context of Web trust and security mechanisms is in scope. Development of additional ResourceSync mechanisms is Out of Scope.
A resource on the Source has changed its identifier or has moved from its original URI to a new one. The Destination does not need to re-sync, but it may need to update its references.
Figure 9: Link maintenance
No synchronisation of the physical resource needs to take place, but the Destination needs to be aware that a change has taken place, and to be able to update its references.
One Source which has moved some of its resources, and a Destination which has previously synchronised resources from the Source.
Any Web-based system holding resources that moves resource identifiers internally.
Is there a difference between a resource ‘move’ and a resource ‘delete followed by a (re-)create’? Should we treat a move as a combination of a related delete- and-create?
Sometimes a system will want to migrate all of its data to a newer environment as a one-off operation, prior to shutdown of the legacy system. The use of a protocol like ResourceSync would be to alert the new system of the resources that it needs to import and then to provide the resources.
Figure 10: Migration / one-off sync
It is a one-off operation, and also may need to guarantee a prompt response from any service that is synchronising the legacy data, to ensure that migration takes place in a timely way prior to the shutdown of the legacy system.
One Source (the legacy system) and one Destination (the new system), which are formally aware of each other.
There are also likely specific rules that the new system would want to implement over the legacy system’s data to import them into a new structure, and it is unclear at this point whether there is a role for ResourceSync there or not.
Large-scale migration is usually down at least twice; first time for testing the new system, and a last and final full migration is done just before cutover. The original system during the final migration is either down or read-only. Parallel work on both the legacy system and the new would allow staged cutover.
Any system migration which needs to maintain legacy data will have this kind of requirement; examples are numerous and unbound. A concrete example might be an institution moving from an EPrints  repository to a DSpace  repository.
Expecting systems outside the legacy/new system pairing (which might have synched with the now defunct legacy system) to understand that this operation has taken place is Out of Scope, unless a redirect at the protocol level is practical.
A Destination has discovered a ResourceSync endpoint, and wants to know what the capacities/features/supported components of that endpoint are, as well as other relevant administrative information regarding the service.
Figure 11: Service description
This focuses on how a Destination learns about the Source’s features prior to engaging in any synchronisation activities.
One Source providing information about its service, and one Destination determining which features exposed by the Source it can take advantage of.
A similar example is used in AtomPub  and SWORD that provide a Service Document that describes the capacities of the server.
A user or user agent is at a Web site and wishes to discover any ResourceSync endpoints on behalf of the Destination which will then use them.
Figure 12: Auto-discovery
This is about how the ResourceSync service provided by a Source is discoverable from its normal Web site representation.
One Source, with a front-end or other interface that can direct the user or Destination server to the appropriate place to carry out synchronisation processes.
There are lots of examples of auto-discovery on the Web, including:
It will be necessary for potential consumers of content to be able to find sources to synchronise from (assuming that the relationship between the client and server is not by prior arrangement). This use case addresses the need to provide directories of potential Sources that support the ResourceSync protocol.
Figure 13: Discovery layer / directory
It is concerned not so much with the synchronisation of resources as with the discovery of Sources with which to be synchronised.
Many Sources being discoverable by a Destination, possibly via some kind of aggregator or directory service. The Source must present enough information to allow the construction of such directories.
In the domain of Open Access Repositories there are registries of systems (such as OpenDOAR , ROAR  and re3data ) that support the discovery of repositories, the kinds of content they hold, the API endpoints they have, and other information about the collection.
Building a Directory itself is Out of Scope for the project.
A Destination synchronises all or a subset of resources from a Source in order to provide a cached copy.
Figure 14: Pre- / Smart-caching
Only a subset of the “operational” copies of the resources need to be synchronised, and they are not being permanently synchronised, only for the purposes of speeding up delivery.
One Source, providing the master copies of resources, and one Destination acting as a local cache of the resources in the Source.
Content Delivery Networks (CDN)  which provide a global network of local caches or mirrors for the efficient and fast transfer of content.
Usage statistics of the resources at the Source need to be accumulated from all cache destinations; however this is an issue for all cache use cases, not just for this project.
An application that consumes data from one or more remote datasets uses cache that stores local copies of remote data. These caches need to be invalidated when the remote data are changed. That is, locally cached content is marked as invalidated if the resource changes in the Source.
Figure 15: Cache invalidation
This uses the change communication as a trigger for local behaviour changes, rather than strictly for synchronisation (although synchronisation may ensue, it is not the primary consequence)
One Source, providing the master copies of resources, and one Destination acting as a local cache of the resources in the Source.
Notification (push) for low latency may be required.
Possible important notification types: Updated, Deleted, Expired
The Source is or has a large triple-store, and the Destination does not want to synchronise the entire dataset whenever a triple in that store changes. A dataset consumer wants to mirror or replicate (parts of) a linked dataset. The periodically running synchronisation process needs to know which triples have changed at what time in order to perform efficient updates in the Destination dataset.
Figure 16: Linked data triple synchronisation
Effectively this means that the resolution of the identifiers available in the Source is more granular than the resolution that the Destination actually wants: only parts of a resource are being synchronised, not the whole resource.
One Source that is or contains a triple-store, and a Destination which wishes to keep up to date without transferring the whole dataset each time.
Any triple store that needs to be synchronised, for example DBpedia, the structured data form of Wikipedia.
This is a specific case of ‘diff’, at the level of the entire dataset unless portions of the triple-store are exposed as resources that can be separately synchronised.
The following use cases have been designated Out of Scope for an initial specification of ResourceSync, but they should be taken into account as the development goes on, to ensure that no avenues are closed off for future versions. Reasons for being out of scope are given in the ‘issues’ section for each use case.
The resource will only be available for a known (by the Source) fixed amount of time. There may be some systems that only hold content for a limited period of time before it is deleted, such as systems that are used for staging content in workflows. The content that is announced via ResourceSync then might have a known time to live (TTL), before it is no longer available.
Figure 17: Temporary resources/TTL
It suggests that the resource in the change communication may only be available for a limited time, and so the Destination must synchronise in a timely manner.
One Source, which contains resources which will only be available for a fixed amount of time, and one Destination which is capable of responding sufficiently quickly.
Twitter search results are often only available for a fixed length of time due to the complete mass of tweets being too large to all be fully indexed.
Any system which offers support to a workflow, and expects the content to move on in time (such as a staging repository), or other environments which only retain information for a short period of time.
The Destination wants to synchronise with a sub-set of the total set of resources held by the Source, and wants to provide a set of query parameters to the Source in order to be given a set of change communications which meet those criteria.
Destination-defined Selective Sync
It places the onus on the Source to provide an API that has the ability to provide filtering on queries sent by the Destination.
One Source that supports Destination-defined queries, one Destination which wants to query for sub-sets of the Source’s resources.
OAI-PMH Sets are similar, except they are usually defined by the Source, not the Destination.
This has significant overlap with the notion of an interoperable search facility. It would rely on agreed information about resources being indexed.
Sometimes it will be necessary to synchronise not only atomic resources but larger complex resources (such as those represented by an ORE  Resource Map). While resource maps themselves could be synchronised like atomic resources, the synchronisation may require referenced resources to also be synchronised. Furthermore, if synchronising such resources results in their URIs being translated into the namespace of the target system, then the resource map being synchronised may need to be rewritten as it is synchronised. Some composite objects may also need to be transferred atomically.
Figure 19: Complex Web objects
It suggests that the synchronisation operation is both a) not a strict copy, as some parts of the synchronised resource may need to be localised in the target system, and b) not limited to synchronising just the primary resource but also resources which it references.
One or more Sources providing composite objects which may span across multiple sources, and one Destination wishing to synchronise those resources.
This scenario refers to any resource that references other resources in order to provide their full expression. Examples would include ORE Resource Maps, which describe an aggregated set of resources to be viewed as a whole; an HTML page which references images and other embedded content; or a SWORD Statement which references its various packaged resources.
How would it handle synchronisation recursion depth for resource references? How to handle cross-site resources?
Some content will have reuse conditions which are required to ensure that the synchronised resource is not inappropriately passed on by the Destination to other Destinations.
Figure 20: Reuse conditions of content
Some resources have metadata associated with them that is specifically to do with the rules by which they should be synchronised.
One Source with licensed content, one Destination.
Licensed content. Embargoed content with release conditions.
Private->Private-Public synchronisation chains where the first sync may be between two private systems; however future downstream synchronisations then make the resources openly available.
This is a complex topic for which only partial solutions in limited domains currently exist.
Software applications, for example traditional institutional repositories are often made up of discrete components in a Service Oriented Architecture  style allowing the platform to be installed on a single server, or to scale-out and be split across multiple servers. Applications such as this often have change event mechanisms to inform other components when resources have changed and need to be propagated. If applications such as this will be developing and deploying ResourceSync, this could replace some of the intra-application communication with a standardised protocol, allowing more interoperability between components from different platforms.
Figure 21: Intra-application event notification
ResourceSync is being used to replace internal change event notification systems as well as providing an outward-facing change event publisher.
One publisher of changes, only a few internal consumers of those changes, even though it is likely there are external consumers of the same change notification system (although would the internal version contain different / more information?). A low level of latency would be required, and should be possible due to the natural inter-relatedness of the components.
The search indexers used in DSpace receive event notifications when resources have changed, so that they can re-index them (and create / delete events for adding / deleting resources from the index).
ResourceSync may be useful within applications but the focus of this project is the Web. Applications using ResourceSync internally may want to namespace or extend event types.
Some systems require synchronisation in both directions - from a source to a destination, and then back again. These may be chained together in several steps. For example:
A -> B -> C -> A
How do we prevent unstoppable cycles of synchronisation?
Figure 22: Cyclic synchronisation
The systems involved will need to track identifiers and versions of records as they move through the synchronisation chain to ensure that change events do not constantly cycle around the system. Furthermore, parallel updates need to be flagged and notified on.
A tree of nodes, with changes being made at the bottom by a large number of nodes, propagating up the tree to fewer and fewer nodes, which then propagate back down to larger numbers of nodes.
Library catalogue records often sync up to union catalogues and beyond (perhaps local to regional to national to international). Changes are made at the local level, and can then propagate up to other systems. Changes also propagate down from higher union catalogues.
How to know that two Web resources are the same? This could be a provenance issue. But equally it might just be the fact that the content has changed (otherwise the fixity information will be the same).
Some advice needs to be given to implementers, even though this is considered Out of Scope.
This article has described the purpose of the ResourceSync Framework that is currently under development. In particular it has described both the use cases that will guide the development, but also the purpose of use cases in the development process.
The variety of use cases shows that the ResourceSync Framework will be able to fulfil many different uses, from transferring large datasets in a laboratory, to populating Web archives of frequently changing Web sites; from providing mirrors of Web sites, to performing wholesale migrations of resources from old to new sites.
When the Framework is fully developed, this list of use cases, together with their associated issues for consideration, can be used as a checklist to ensure that the Framework supports all of the functions and modes that it needs to.
The ResourceSync team gratefully recognises the support of the Sloan Foundation for its support of the project. In addition the team members acknowledge the generous support of JISC in funding the participation of several UK members to the technical committee of ResourceSync.
This set of use cases was initially formed by the authors of this article, but was subsequently developed and completed by the whole project team.
The core project team consists of:
The technical group consists of:
Stuart Lewis is Head of Digital Library Services at the University of Edinburgh where he is currently responsible for a service portfolio including acquisitions, metadata, e-resources, digital library development, information systems, repositories, and research publications. He has worked with open repositories in various roles over the past six years and has a particular interest in interoperability issues. He is the Community Manager of the SWORD v2 Project, which continues to develop the SWORD repository deposit standard. Prior to working at Edinburgh, Stuart held the position of Digital Development Manager at the University of Auckland, New Zealand, and before that led the Web Applications and Repository Projects Team at Aberystwyth University.
Richard Jones has been working in Open Source and in/around Higher Education for over a decade. He is a long-term contributor to open source software, and in particular the DSpace repository platform. He is also an advocate of Open Access, and has written numerous articles on the subject, as well as co-authoring a book on a related topic. He has worked for a number of large HE institutions over the years, including the University of Edinburgh, the University of Bergen and Imperial College London. Subsequently he moved out of HE and first into commercial research and development (at HP Labs and Symplectic), and then on ultimately to founding Cottage Labs.
Simeon Warner is Director of the Repositories Group at Cornell University Library. Current projects include development of an archival repository, the arXiv e-print archive , and Project Euclid . He was one of the developers of arXiv and his research interests include Web information systems, interoperability, plagiarism detection, and open-access scholarly publishing. He has been actively involved with the Open Archives Initiative (OAI) since its inception and was one of the authors of the OAI-PMH and OAI-ORE specifications.
This article has been published under Creative Commons Attribution 3.0 Unported (CC BY 3.0) licence. Please note this CC BY licence applies to textual content of this article, and that some images or other non-textual elements may be covered by special copyright arrangements. For guidance on citing this article (giving attribution as required by the CC BY licence), please see below our recommendation of 'How to cite this article'.