Efforts to create standard metadata records for resources in digital repositories have hitherto relied for the most part on the simple standard schema published by the Dublin Core Metadata Initiative (DCMI) , the Dublin Core Metadata Element Set, more commonly known as 'simple Dublin Core' . While this schema, by and large, met the aim of making metadata interoperable between repositories for purposes such as OAI-PMH , the explicit means by which it achieved this, a drastic simplification of the metadata associated with digital objects to only 15 elements, had the side effect of making it difficult or impossible to describe specific types of resources in detail . A further problem with this 'flat' metadata model is that it does not allow relationships between different versions or copies of a document to be described.
The extension of these 15 elements in DCMI Terms, known as 'qualified Dublin Core'  was an effective admission that richer metadata was required to describe many types of resources. Arguably it remains feasible to use 'simple Dublin Core' fields for certain resource types such as scholarly works in cases where especially complex metadata are not required. The problem has been that almost every repository uses these elements to some extent according to local needs and practice rather than adopting a standard. Inevitably this has had a negative impact on interoperability.
Consequently, the concept of application profiles (APs) was developed. In essence, these are "...[metadata] schemas which consist of data elements drawn from one or more namespaces… and [are] optimised for a particular local application." As they were originally formulated, they were intended to codify the practice whereby it was acknowledged that "...implementors use standard metadata schemas in a pragmatic way." Consequently, on the principle of the "...maxim 'there are no metadata police', implementors will bend and fit metadata schemas for their own purposes."  On the other hand, the engagement of both standards makers and implementors led first to the development of the Scholarly Works Application Profile (SWAP) and subsequently a range of other JISC-funded Dublin Core Application Profiles (DCAPs) .
Following the precedent of SWAP, the DCAPs were based on a conceptual model inherited from the Functional Requirements for Bibliographic Records (FRBR) , in which the key structural entities of the digital object (Group 1) are Work, Expression, Manifestation and Item . The rationale behind the specific choice of the FRBR structure over any other entity-relationship model remains undefined in the documentation for SWAP and the other DCAPs. Such a choice presupposes that the ability to define relationships between conceptual entities both within and between digital objects represents a core aim. In other words it introduces the concept of complex and systematic versioning as part of a common standard.
Since the explicit purpose of the DCAPs is to recommend standard usage for certain identified resource types , the tacit return of the "metadata police" may be seen as something of an irony, although the original customised model based on local needs is still in use elsewhere in a variety of projects and services. The new definition also assumes that DCAPs are built on the DCMI Abstract Model (DCAM) . GAP (the Geospatial Application Profile) is exceptional in that it does not include an entity-relationship model, because it is designed to be a modular add-on to other DCAPs.
Unfortunately, the differing requirements of heterogenous resource types has led to a good deal of structural variation between the DCAPs, so it may be doubted whether the relationships in a complex digital object that contains multiple resource types could easily be described in a way that is discoverable by software tools. The example which perhaps illustrates this best is the Images Application Profile (IAP) , in which the Expression level inherited from FRBR was rejected . The conclusion was reached by explicit comparison with SWAP, which has set a broad precedent for the ongoing development of the other JISC-funded DCAPs. It was felt that changes to an image directly modify its physical Manifestation and are not in any meaningful way the kind of intellectual changes that produce a new Expression in the terms expressed by FRBR. The Expression entity was correspondingly omitted.
The solution for the problem of complex digital objects containing multiple resource types and thus described by multiple application profiles may well be found in the development of a standard common vocabulary of metadata terms, possibly also requiring a common 'domain model'. Together these could comprise a 'core DCAP'. This could allow common query patterns to be applied to DCAPs that are sufficiently similar in structure to the proposed 'core DCAP' . However, the difficulty may be that resource types require specific and relevant metadata that are not common to all, limiting such queries to generic fields. The whole issue requires considerable further analysis and practical testing.
It is perhaps unfair to make the unqualified observation that the DCAPs have been implemented virtually nowhere, since all but SWAP are still under development to a greater or lesser degree. On the face of it, SWAP might appear to be simple to implement, due to the relatively simple metadata requirements for scholarly works; and because the bulk of content in institutional and subject repositories consists of published academic papers or textual works. Yet the only implementation so far has been in the WRAP repository at Warwick University, in which SWAP has been problematic thus far within the context of institutional needs at Warwick, using the EPrints 3.0 software . This case study needs further analysis, but there seems too little basis on which to conclude that SWAP cannot succeed more widely.
The issue of greatest concern is that despite the completion and availability of an application profile for scholarly works since mid-2006, there has been little interest in developing the major software platforms to support SWAP implementation. In the fast-moving world of Web technologies, this lag appears significant and it has become clear that the reasons for the lack of implementation deserve closer scrutiny. The most obvious area of concern is the complexity of the FRBR model. It needs to be made clear whether the model fits the needs of resource delivery on the Web and, if it does, which factors have prevented implementation and how they can be addressed.
The FRBR Entity-Relationship Model
The FRBR entity-relationship model (sometimes known as FRBRER) contains several more entities than the subset inherited by SWAP . The latter has only the crucial Group 1 entities and an amalgam of the Group 2 entities, whereas several of the other DCAPS include Group 3 entities:
|Group 1||Work||Scholarly Work||Image||Work||Work|
|Corporate Body||Agent||Agent||Corporate Body||Agent|
Note that Person in IAP is a Group 3 entity, as its FRBR namesake in Group 2 is amalgamated with Corporate Body as Agent. In LMAP, the entity Context is an innovation which is not inherited from FRBR. As GAP is intended to be a modular extension to the other DCAPs, it does not have an entity-relationship model. There is currently no available model for SDAP, which is consequently omitted here.
On the basis of the above table, it seems that interoperability issues emerging purely from differences between the entities are not especially difficult. In addition to the missing Expression entity in IAP, there are minor issues in each of the groups:
There is a relatively insignificant difference in nomenclature in the top and bottom levels of Group 1, i.e. SWAP Work > Scholarly Work, IAP Work > Image, TBMAP Item like FRBR but others Item > Copy. This is easily resolvable through mappings to a future 'core DCAP' in order to ensure interoperability.
It should be simple enough for software to establish automatically how the entities in the various DCAPs correspond to each other. Other than in IAP, the vertical relationships between entities, i.e. Is Expressed As, Is Manifested As and Is Available As, make this entirely transparent. However, any such process would presume that the machine-readable encodings were available that could be of practical use in software development. The assumption has been that the DCAPs would need to be serialised using XML Schemas, since this has been done for other APs, e.g. eBank . There are other potential alternatives such as RELAX NG and RDF. No such schemas have been published to date and it remains unclear to what purpose they might be put in terms of practical development projects. Since the idiosyncratic DC-TEXT notation used for SWAP has not led to development, it seems that speculative provision of such a schema in a mainstream format might at least provide tools for some initial development and perhaps a better understanding of what may be required in future.
One potential problem in automatically establishing correspondences between the differently named entities is likely to arise in IAP, where Image and Manifestation are linked by the relationship Is Manifested As. From this relationship in FRBR one could normally deduce that Image corresponds to Expression, whereas here it corresponds to Work. Compounding the problem, the entities Agent and Manifestation are related directly by Is Created By in the IAP model, in contrast to Agent and Work (or the equivalent) in the other DCAPs. Presumably the conceptual model implies that images are directly manifested, although why this does not hold true just as well for a scholarly paper or a video clip is perhaps a moot point. However, it will be apparent in any schema that Image is the top-level entity. The main issue raised by these discrepancies is how any future 'core DCAP' would take account of variant models.
The amalgam of Group 2 entities as Agent in all but TBMAP is not a matter of particular concern, since the attribute entityType makes it clear to which FRBR entity the metadata corresponds, and it is thus a matter of mapping. Consequently it may not be crucial whether or not the entities are distinguished in a future 'core DCAP'.
The Group 3 "subject" entities are not included in SWAP and LMAP as they are in the other DCAPs. The omission of these entities was evidently by design and reflects the relative lack of importance of making such fine distinctions in subject metadata to the framers of these DCAPs. It is worth noting that these entities differ from those in Groups 1 and 2 in the respect that they generally contain a very limited number of metadata fields, usually just a single field. The Group 3 entities are clearly intended for the purposes of adding semantic information to subject metadata to assist search tools. Since none of the DCAPs that employ Group 3 entities are fully developed, it has not yet been explored whether existing or future search tools could usefully exploit them, which would no doubt depend heavily on the quality of metadata.
If the 'core DCAP' is to be based on what the DCAPs have in common, i.e. a subset approach, it is presumably unnecessary to include specific Group 3 entities in a future 'core DCAP'. However, this would imply that the 'core DCAP' would be of more limited use in facilitating cross-searching of resources by subject. The alternative would be to include all possible entities, even though they would be unexploited by SWAP and LMAP. Semantic subject searches on this basis would then prove impossible for these two application profiles. It remains to be seen which of these two approaches will prove most useful. In any case, the Has As Subject relationship makes it clear to any potential software that any entities so described are for the purposes of search interfaces. The issue of whether to include the Group 3 entities in the 'core DCAP' for the purposes of search tools needs further analysis, but it should not greatly affect more general interoperability between the DCAPs.
It has already been noted that the structural relationships between entities are by and large consistent, and that it will be necessary to analyse whether irregularities such as those in the IAP entity-relationship model will create interoperability problems or not. Apart from the relationships that exist purely as a part of the entity-relationship model, there are 'sideways' relationships that are essentially metadata fields that contain the URL of an entity in another digital object, i.e. part of or a whole separate Work entity. For example, Has Translation in SWAP provides a method to indicate a translation, whatever the status of that resource in itself.
Unlike in the structural relationships, which consist of an attribute of this kind linked to an identifier in the target entity, it may not be possible to guarantee that the relationship will be bi-directional if (1) the other digital object is not held in the same repository; or (2) the metadata was updated at a different time or by another individual who did not necessarily make the same cataloguing decisions, perhaps in part because not all of the resources then existed or because there existed resources of which the individual was then unaware.
Ideally of course, there would be a software mechanism to inform the repository manager that the link had been created and that a reciprocal link needed to be created, although it would require interoperable software and metadata. Inevitably, there will always be occasions when this is impossible, such as for example when the resource is on the Web but not in a repository.
In general, it is possible to consider 'sideways' relationships within DCAPs in the same way as any other attributes. However, their distribution within the entity models, which is largely inherited from FRBR, seems to raise special problems in describing the relationships between digital objects accurately. Using SWAP as an example, a number of areas of potential inflexibility arise as a result, for example:
- An adaptation might have been made from a particular version of a digital object, i.e. Expression, or else from several versions, or alternatively the version used may be unknown. It is only possible to use Has Adaptation within the Work entity, so there is no scope for stating that one particular Expression was used. (It might equally be desirable to describe several specific versions that were used.)
- Conversely, the more general attribute Has Version in the Expression entity allows a 'version, edition or adaptation' to be described. It does not allow any specific qualification about which of the above is intended, so the use of these two relationships for adaptations is ambiguous. For types other than adaptations, it is clearly impossible to state that, for example, an edition was produced from a Work but it is not known which Expression was used to produce it.
- Translations may be indicated by Has Translation in the Expression entity, but again it is only possible to indicate that the translation was carried out on the basis of one particular Expression rather than on the basis of the Work where it is not known which version was used, although it is possible for several Expressions to be related to one resource using Has Translation attributes. (From the point of view of any one of these Expressions, this might tend to give the misleading impression that only that specific Expression had been used.)
It appears that the framers of TBMAP have attempted to foresee the possible relationships that would be likely to occur between entities belonging to different digital objects. There is a plethora of different relationship types, especially within the Expression entity. Exactly how these various relationships should be tested against sample metadata records, especially within the context of any 'core DCAP", may require considerable further analysis and user testing.
Again, the question of which relationships to include in this 'core DCAP' depends upon whether the latter is intended as a subset of the commonalities between the DCAPs or as a potentially highly complex superset containing all of the relationships required. If any new DCAPs were later added, the latter approach could create problems for backwards compatibility. It remains difficult to decide between these options without a much clearer analysis of the intended purpose and practical applications of such a common metadata set and entity model.
Similar difficulties arise with assigning certain attributes to the FRBR entities in the DCAPs. It may not be the case that all resources fit the metadata requirements predicted by the FRBR model. For illustration, SWAP and IAP will be used here as the main examples because an entity model and a list of attributes is available for them both in a finalised form (but for simplicity, in some cases the attributes mentioned do not necessarily occur in both). The principle is likely to apply more widely, however.
Work and Expression
It is perfectly possible to have a grant that applies only to one Expression of a whole Work, e.g. to a particular report commissioned by a funding council but not to an academic paper that came out of the same conceptual idea or Work. The use of Grant Number and Funder is only allowed at the Work level of SWAP, however, which makes it impossible to describe the two satisfactorily as different Expressions of one Work where that would be appropriate. The same is also true of Grant Number in multiple Manifestations of one Image in IAP. In the nature of images, some derived images may have been produced for different purposes from the others and thus not fall under the same grant; yet they may best be represented as a single Image as a conceptual work. Similarly, the inclusion of Supervisor in the Work level of SWAP makes it impossible to describe a thesis and a derived monograph as two Expressions of a single Work, no matter that they are clearly related versions rather than separate endeavours, as the supervisor would be incorrectly assigned to the book as well. The addition of affiliated institution in SWAP is a further confusion: for example, one Expression might well have been produced at a different institution from the rest.
Abstracts, descriptions and subject metadata are only allowed at the Work level in SWAP, though these too could vary between different Expressions, e.g. revisions in editing, or certain Expressions only representing part of the endeavour and thus only covering certain subjects or aspects of the whole. An example of this could be a chapter of a book published as an article. While it seems unlikely (but perhaps not impossible) that the subject of an Image could change without producing a new work, the description might differ between Manifestations if the changes made to the formatting required additional comments. In the case of an edition of a text book, it is even possible that an additional chapter could have been written by a new author, and so one of the creators listed at the Work level would be incorrectly ascribed to the other Expressions. It is not necessarily clear that this should be a new Work in the context of online repositories, even though this would be the normal approach in library cataloguing. It makes obvious sense to list the various versions as forms of each other for the purposes of Web delivery.
Conversely, titles are allowed in both Work and Expression entities in SWAP (but only at the Image level in IAP, which is unaffected here), although this makes it difficult to decide which is canonical for the whole work if the titles of the various Expressions differ. Such changes might well occur between pre-prints and post-prints during the process of peer review. It should not be the role of a DCAP to require that the repository manager decide whether the original title of a pre-print or the published title of a post-print is more appropriate for the whole Work, especially if those titles are appropriate to a specific version. In this instance, the Work entity exists purely as a container for its various forms: a conceptual derivation produced retrospectively.
In the case of a digital object where the Work represents a known conceptual endeavour, such as Homer's Iliad, it makes far more sense to employ the FRBR model. Addressing the issues raised above in the context of this example, it may be noted that (1) issues of funding or supervision clearly do not arise; (2) all editions will have the same subject and canonical title. It is still possible that the description may vary if there has been an abridgement, or to describe the particular manuscripts used in producing a certain reading. It would be possible to consider translations either as Expressions if they introduced no novel work or, for example in the case of verse translations, as new Works where they added substantially new intellectual content created by a new author. A further small anomaly is that it is impossible to indicate that such a Work had an original language if the manuscripts themselves are not catalogued as separate Expressions, which would appear rather pointless in a digital repository unless they had been digitised. At this stage in the analysis it is important to recall that these are fringe areas of scholarly works in repositories, whose impetus for development has by and large been published journal articles. In most cases, it is readily apparent that the Work level is little more than a container.
The lesson to be drawn from the comparison of a few different models of potential resources in SWAP and IAP is that a one-size-fits-all approach is likely to be inappropriate for a significant quantity of resources. This is likely to depend heavily on the resource type, so FRBR may fit certain resource types better than others. The very fact that IAP departs from the FRBR model is evidence of this. The process of developing new DCAPs has generally followed the precedent of SWAP with respect to FRBR, without sufficient testing against metadata samples and user requirements being published in the documentation.
In the case of SWAP at least, it seems clear that the partition of Work and Expression is an artificial one, and that it might be better to employ a more flexible model that allows both for the container pattern (i.e. pre-print and post-print) and the recursive pattern (i.e. the Iliad containing various versions that may also be discrete works in their own right). The relationships between the 'structural' entities Work and Expression on the one hand, and on the other the 'sideways' relationships to related resources, are perhaps too rigidly differentiated since the cataloguing choices that will be made in practice are often a matter of considerable interpretation. It remains to be seen whether this is more widely true of other resource types. The approach taken in the development of GAP is perhaps illustrative here, since it was intended that geospatial information could be attached to any resource type described by another DCAP. The lack of an entity model effectively makes metadata attributes available for use with any entity. This flexibility avoids all of the issues described above, and it seems to be a useful demonstration of the value of avoiding an overly prescriptive approach.
Manifestation and Copy/Item
The purpose of creating a separate Manifestation level in FRBR was to state that every Item (or Copy in SWAP, IAP and LMAP) contained therein is identical in format. Hence a PDF file that appears to the eye to be identical to the DOC file from which it was produced are different Manifestations but the same Expression. An author's formatted post-print is equally a different Manifestation from the publisher's typeset version, but remains the same Expression provided that the content is the same. There is a potential grey area in which minor corrections and typological changes may alter the content to a small degree, which could perhaps be overlooked as long as it remains substantially the same. In the case of resources which have a great deal of metadata relating to their physical nature, such as images or time-based media, it is easy to see that differentiating between the entities is potentially useful.
By contrast, in SWAP, the only metadata items (excluding those required purely by the FRBR structure) are Format, Date Modified and Publisher. These all raise interesting problems and arguably fit better elsewhere in the model.
It could be seen as rather pointless to record the file format as metadata, since the bitstream itself records this, and all repositories separately record the MIME type, file hashes and file length. If two files are not of the same format, de facto they are not the same Manifestation either (although the fact that they are of the same format does not guarantee the reverse). File hashes and file length may be different, although for practical purposes there is no difference between two files.
While the date that the Manifestation was last modified may be relevant to an image or a video clip, it is harder to see why it is useful to differentiate between the date that a particular Expression of a textual document was modified and the date that a Manifestation thereof was modified, since the former is conceptual rather than physical and must be deduced from the earliest existing physical copy. The bitstream carries this information itself, although the date of copies may not be reliable. In the same way, the date that a Manifestation was modified is ultimately deduced, although in practice the information might be obtained from a publisher's Web site in the case of a publisher's formatted post-print. It may be useful to provide, or at least allow, a date for each entity, but not necessarily as it currently exists based upon FRBR.
It is curious that some metadata related to publication are in the Expression entity in SWAP, such as version number, copyright holder and editor, whereas the name of the publisher is indicated in Manifestation. The rationale would appear to be that the author's formatted post-print should not be associated with the publisher, yet this conflicts with the fact that peer review was arranged by that publisher. It should not be the place of application profiles to make ideological distinctions. It seems sufficient to record to whom the copyright of the Expression and any copies deriving from it belong, where the formatting belongs to the publisher.
Overall, the Manifestation entity in SWAP is, unlike in the case of IAP and TBMAP, largely a container whose existence serves to state whether or not copies are identical, along with some metadata distinctions of questionable use in their context, and it remains untested whether a different entity model would perform better.
The process by which the DCAPs came to be modelled on FRBR was not based on published results from user testing or analysis of the suitability of the entity model based on sample metadata records. In fact, the rationale appears to have been more practical: it is a reasonable assumption that where multiple DCAPs follow the same structure, there will be interoperability benefits. The reasons behind the adoption of FRBR in SWAP remain unexplained in the published documentation. However, this precedent has been followed unless there was an overriding objection, as in the cases of IAP and GAP. The important point is that FRBR was designed for library catalogues, not repositories. The purposes and requirements of Web delivery of resources through repositories are very different to those of library systems. It seems unwise, therefore, to adopt a library-based standard such as FRBR for repository metadata without explicit justification. Any unnecessary complication may have an adverse effect upon the ability of a repository to collect good metadata and of developers to produce intuitive, user-friendly workflows for metadata entry.
The purpose of the slight modification in collapsing FRBR's Person and Corporate Body into a single Agent entity seems unclear, but there no significant modifications to the entity models except IAP that would affect mapping, other than trivial differences of nomenclature. Where the DCAPS differ most is within the entities. Since the majority are incomplete and good testing needs to be based on domain-specific knowledge, it is difficult to comment specifically on what the impact of the FRBR structure may be upon these DCAPs. The example of SWAP, however, tends to indicate that there are at least theoretical objections that require practical testing, since the FRBR model may remove the flexibility required to describe heterogenous textual resources and may impose inappropriate decisions about canonical Work level metadata. Merging entities may risk the loss of valuable information, but it should not be simply assumed that all metadata are necessarily valuable within a given entity.
The question remains as to whether FRBR is suitable for Web delivery within repositories, and for which specific resource types and DCAPs. Until this is answered with practical testing, it will be difficult or even impossible to frame a 'core DCAP' and subsequently analyse whether the concept would be of practical use. Metadata need to be flexible and re-usable in the fast-changing world of repositories. In order to make best use of the specific improvements to repository metadata that the DCAPs have provided, it may be to their advantage to re-analyse their entity models.
- Dublin Core Metadata Initiative http://dublincore.org/
- Dublin Core Metadata Element Set (Version 1.1)
- The Open Access Initiative Protocol for Metadata Harvesting (OAI-PMH)
- It may be noted that from time to time there have been variations in the precise implementation of these 15 elements, notably between the DCMI and OAI recommendations for implementing DC, as well as software specific issues, e.g. the use of dc.contributor.author instead of dc.creator in DSpace.
- DCMI Terms http://dublincore.org/documents/dcmi-terms/
- Rachel Heery and Manjula Patel, Application profiles: mixing and matching metadata schemas, Ariadne, September 2000
- Listed at http://www.ukoln.ac.uk/repositories/digirep/index/Application_Profiles.
There are other APs based largely on Dublin Core, but 'DCAPs' here means those funded by JISC, unless noted otherwise.
- Functional Requirements of Bibliographic Records http://www.ifla.org/VII/s13/frbr/frbr.pdf
- Note that item was renamed as copy to avoid confusion with the wider use of that term in repositories to mean "digital object", e.g. document, image etc with metadata record.
- Pete Johnston & Rosemary Russell, JISC Metadata Application Profiles, Data Models and Interoperability
http://www.scribd.com/doc/3845648/JISC-Metadata-Application-Profiles-Data-Models-and-Interoperability 2, §4.
- Dublin Core Abstract Model http://dublincore.org/documents/abstract-model/
- Images Application Profile http://www.ukoln.ac.uk/repositories/digirep/index/Images_Application_Profile
- ibid., 3.
- Pete Johnston & Rosemary Russell, JISC Metadata Application Profiles, Data Models and Interoperability
http://www.scribd.com/doc/3845648/JISC-Metadata-Application-Profiles-Data-Models-and-Interoperability 3.2.2, §5
- WRAP, University of Warwick http://blogs.warwick.ac.uk/wrap/entry/swap_and_e-prints/
- http://www.ukoln.ac.uk/repositories/digirep/index/Scholarly_Works_Application_Profile, 1, §1.
- eBank http://www.ukoln.ac.uk/projects/ebank-uk/schemas/