Many of the issues faced by the e-Science Programme and the Digital Library community world-wide are generic in nature, in that both require complex metadata in order to create services for users. Both need to process large amounts of distributed data. Recognition of this common interest within both communities resulted in this invitation-only one-day workshop at the e-Science Institute in Edinburgh. It brought together interested parties from both the digital library and e-Science communities, and kicked off detailed discussion of the way forward for both.
About 80 people congregated in the e-Science Institute on Tuesday 30th of April. The day was grey and wet (no surprises there for veterans of the Edinburgh conference scene). The Institute turns out to be located in a former church in South College Street, just opposite the Old Quad of the University. The interior has been purpose designed for the Institute, but the framework of the building and its initial purpose have been respected: parts of the original structure are visible throughout. The main part of the Institute is based on decking running the full length of the body of the church, from the vestibule to the nave, at several levels. There is a lot of glass within the building, giving a sense of spaciousness to both offices and to the decks (also making it easy to see that it was still raining outside). The main lecture theatre, where the presentations were given, is right at the top of the building, and the original roof design arches over it.
Liz Lyon, Director of UKOLN, chaired the opening of the workshop. She gave a brief overview of what was expected from the day: the opening of a dialogue between the two communities about shared issues.
The Director of the e-Science Programme, Tony Hey, gave the opening keynote presentation: He began with some background to the programme. The initial idea for the programme came from John Taylor, Director General of the Research Councils (UK Office of Science and Technology). The essence of the programme he gave in a single sentence: e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.
The amount of money invested in the programme is significant: around 130 million pounds. The funding is about the people to create it, since the hardware funding comes from a different pot of money. Matching funding is required from industry, and currently particle physics gets the lion's share.
As an illustration of the scale of the kind of data-processing which is proposed he mentioned a Geneva based project which will generate petabytes of data, distributed around the planet. Hey then described 4 projects as a way of illustrating the nature of the Grid - Comb-e-Chem, Astro-Grid, MyGrid, and the Discovery Net Project. The last of these involves the creation of a Discovery process mark-up Language, and may involve a new research methodology. Formerly, he argued, science was based on theory and experiment; in the late 20th century scientific simulation was developed as a procedure to add to theory and experiment. In the future, he suggested, there will be yet another layer of process that will add to scientific knowledge: this is 'collection based research'. Knowledge will be derived from the systematic data-mining of very large datasets, often widely distributed around the world. This way of discovery can be characterised by the terms 'reduce', 'mine' and 'sift'
Hey mentioned that there is an existing architecture for the Grid, and a toolkit (the Globus Toolkit, which has emerged as the international de facto toolkit), however there is much missing functionality.
Turning to related developments in the Web arena, he discussed Web Services technology as a way of adding to and developing Grid functionality. Hey felt that this way of doing things has a similar service-oriented architecture as the Grid, where everything is seen as either a provider or consumer of services - exemplified by the 'publish', 'find', and 'bind' triangle. For the development of the functionality of the Grid as much as for the future of the development of the Web, "interoperability is what it is all about," and the Web Services model is a good one on which to base the technical development of the Grid. There is already an Open Grid Services Architecture (OGSA). Hey mentioned two independent studies which warmly endorse the idea. The architecture will exploit the synergy between the Commercial Internet (IBM and Microsoft are both interested in the Web Services idea) and Grid Services, and the OGSA is a key middleware area for the UK e-Science programme.
Metadata and ontologies are, he said, key to higher level Grid Services, since e-Science data and archives need to interoperate with conventional scientific literature in digital libraries. We need support for data federation as much as straight computational power. The resulting services will be very much like digital libraries. He concluded by quoting the originator of the Grid idea in the UK: "e-Science will change the dynamic of the way science is undertaken." There are 'big wins' available.
The JISC Information Environment and Architecture was next on the agenda, and the presentations were given by Alicia Wise and Andy Powell. Alicia Wise gave a general intro to JISC Collections and Services. JISC is about funding research and development programmes (Information Environment development, eLib, the DNER, etc). This work has involved the creation of many important partnerships, though she noted that JISC Collections do not yet amount to Petabytes of resources! JISC has built a lot of access points, but so far 'they haven't come'. The resources are in practice difficult to find. JISC is trying to solve this problem with the development of the Information Environment. There are lots of visible synergies with the Grid idea. The challenge for both is to change the way in which researchers work and collaborate. We need new tools for this. Certain kinds of activity need to be managed, within a secure high quality network. However currently users turn first to Google and Yahoo, rather than the more sophisticated resource discovery tools.
Andy Powell talked about the technical infrastructure of the Information Environment, but deliberately avoided detailed discussion about the underlying protocols and standards. Much of what he said was based on the DNER Information Architecture study (of which he is the co-author), published in 2001 on the UKOLN Web site. Both Tony Hey and Alicia Wise had already used some of Powell's own slides, so where appropriate he merely recapped matters already discussed.
He looked at the 'problem space' from the perspective of the data consumer, and suggested that the driving characteristic of the architecture is that the user needs to interact with multiple data collections. Few cross-collection tools and services are available, and some stuff is part of the 'Invisible Web' [the 'Invisible Web' being those materials which are available on the Web in theory, but which, in practice, are not easy or even possible to find currently using conventional finding aids, since for example - some resources live in databases and are published only in response to queries arriving from specific interfaces]. The user has to manually stitch resources together, since the material is human-readable and human-oriented and not amenable to automated processing.
The 'portal problem' being addressed by JISC is 'how to provide seamless discovery across multiple content providers'. The solution is the Information Environment. Portals are based on the four key concepts of: 'Discover, access, use, publish'. This is the principal target of the architecture, and subsequently the development of services which bring can bring stuff together. This can be achieved if services make their metadata available for searching, and/or harvesting, and also if there are alerting tools, indicating that services have resources available. In the Information Environment, access services are often referred to as 'portals'. (Alerting might be via RSS site summary). There are also fusion services ('brokers'). We need to join up discovery services. We also need localised views of available services (Powell suggested this was an area in which OpenURLS might be important).
Infrastructural services for the Information Environment include the following: Service registry, authentication and authorization, resolver services, user preferences and institutional profiles, terminology services, metadata schema registries, and citation analysis. All of this is based on XML and DC, and all are based on the idea of metadata fusion - so we need a shared view of how this metadata is going to be used. Powell suggested that subject classification, audience level, resource type, and certification, are the four key areas of shared practice.
In conclusion Powell pointed out that Instructional Management Systems (IMS) digital repositories diagrams are similar to the slides illustrating the architecture of the Information Environment, since the problems the architectures are intended to solve are essentially generic.
The European Perspective was given by Keith Jeffery, of CLRC-RAL. Jeffery spoke on work going on or in prospect in the European Union, (EU), related to Grid ideas. He distinguished between GRID and GRIDS: the first of these is based on an American idea, floated in a book by Foster and Kesselman, the GRID Blueprint. In the European model of the information Grid there is (by contrast) a layered architecture.
'Data knowledge' in the Grid model connects major information sources, like the Information Environment interfaces. He suggested that this was very similar to the Information Environment definition of a portal. The 'Knowledge Grid' utilises knowledge discovery in a database (the KDD). This provides interpretational semantics on the information, partly for the purposes of data mining. Suitable security controls are required, and these have to be appropriate to the source and to the accessor. It is also necessary to deal with IPR issues and Rights access.
KJ has been discussing this Grid architecture within W3C. So the original US idea of the Grid has been linked with the Web Services concept, which, if implemented, should be able to handle the requirements. The key to all of this is metadata, which is, as he said, 'ridiculously important stuff'. Interestingly, not for the first time in this workshop a speaker subdivided metadata into different types with different functionality. He pointed out that metadata could be broken down into three different types: schema metadata; navigational metadata; and associative metadata. One of the advantages of DC metadata is that it is difficult to find another format which doesn't intersect with its fields. But it is insufficiently formal and unambiguous for machine understanding.
Jeffrey then looked at how the European Union now frames the guidelines under which Grid type project proposals are solicited. The sixth Framework Programme no longer contains some work areas which have become familiar to those who made proposals under the Fifth Framework Programme. The relevant area for proposals is now FP6 ERA (European Research Area). There are new instruments (EU jargon for how the programme will achieve its goals) for this framework, and GRIDS technology is prominent in the thinking of EU officials. The key phrases in documents which prospective applicants for funding might look out for are: 'Information Landscape' (a Lorcan Dempsey coinage of several years ago, roughly coincident with the idea of the 'Information Environment') and the 'Knowledge Society'. Jeffrey also talked about ERCIM, which is planning large scale activities in the area of GRIDS (large scale distributed systems) for the citizen, which is essentially the context of ERA. This has some relevance to JISC activities, in that FP6 plans to build on (and also build across) existing national initiatives. GRIDS (not GRID) he argued, provides an architectural framework to move forward.
The US view was given by Reagan Moore, Associate Director, San Diego SuperComputing Center. (National Partnership for advanced computational infrastructure). Moore's presentation was aimed more squarely at the interests of the computing community. It was about running applications in a distributed environment and interfaces between systems; about brokerage between networks; essentially about a particular vision of what is technically possible within Grid (or Grids) architecture - distributed computing across platforms and operating systems, rather than the business of searching for research data held in various formats across domains within a platform-independent environment. In both metadata needs to be a key feature of the architecture, whether we are talking about finding and running software applications in a distributed computing environment, or the management of textual data.
Moore talked about Data Grids: these he defined as possessing collections of material, and providing services. So essentially a Data Grid provides services on data collections. This is, he said, very close to what is proposed for the architecture of digital libraries. The problem is that service providers are faced with the problem of managing data in a distributed environment (i.e., the services and collections are not running on a single server). Data Grids offer a way of working over multiple networks, and are the basis for the distributed management of resources. Issues which need to be addressed he listed as: Data Management Systems; Distributed Data Management; and Persistent Data Management. What needs to be managed is: Distributed Data Collections; Digital Libraries; and Persistent Archives.
Digital Entities in Moore's terminology are 'images of reality,' and are combinations of Bitstreams and structural relationships. 'Every digital entity requires information and knowledge to be correctly interpreted and displayed'. He made some interesting differentiations between data, knowledge and information. The former he allocated to digital objects and streams of bits. Knowledge was allocated to the relationship between the attributes of the digital entities, and 'information' is 'any targeted data'.
The terminology used by Moore was different from that used by the UK speakers in a number of respects, though he was clearly speaking about very similar concepts (his information architecture slides made this clear). We'd been in the lecture theatre for two hours by the time he began his presentation, and probably it would have been fatal to an understanding of what he meant by 'abstraction' and 'transparency' to have missed the beginning of his talk by answering a call of nature. Ariadne was unlucky in this respect.
An interesting question came up in the Q&A session afterwards about the giving of persistent identifiers to objects, and how a separate instance of a resource might be indicated (within a Data Grid context), which a researcher might choose to access. If identifiers are given locally or institutionally, then the identifier for two separate instances of a resource within the reach of the Data Grids (anywhere around the world) might be quite different (since the service provider might only have knowledge of the one to which they added a persistent identifier, until a researcher links the second resource with the first). In other words, two instances of a resource (perhaps different editions) might have the same identifier, or else have totally different identifiers. Which to some extent would defeat the object of giving resources persistent identifiers.
1) Digital Preservation: do scientists give a damn about preservation? Perhaps society as a whole cares, and those who pay for the services. May issues are management and procedural, as well as technical. Also, do we have the technical solutions for the implementation of policy decisions, and vice versa, and do we have the policy making structures for the implementation of the technically possible? Issues of scale were raised - there comes a time when the scale of the enterprise affects the nature of the solution. Is there a business model? (there appears not to be). Maybe this question needs to be allocated to a couple of economics Ph.Ds for a study. On the issue of repurposing, it was pointed out that the community will be collecting data for the Grid without knowing how the data will be repurposed. This means associated information is extremely important. It was suggested that annotations are a driving purpose for an archive. As for life-cycle issues,: it was suggested that the community cannot trust the creators of data resources to make appropriate decisions on preservation. But it was suggested that the self-archiving process might function as an enabler of serendipity, since the automation of the process of 'discovery' might be seen to be squeezing this out.
2) Metadata and Ontologies. Process capture for data analysis and new methods of design and exploration, resulting in large quantities of stats. We need tools for provenance and rollback, as well as automation of the discovery process. The example of combinatorial chemistry was used - making haystacks to find needles. The process involves data mining the library of information created by the research. Some info stays in the lab (the associated metadata which makes it possible for the experiments to be repeated): this information needs to be preserved, and scientists need to understand the importance of this - younger scientists especially need to learn to record associated metadata while they are working in the lab.
Virtual data - a request for missing data may be met by simulation. That is, the characteristics of a particular molecule might be inferred from its place in an array of known molecules and their properties. This raises questions of the provenance of data, since the actual properties of the molecule are not actually known, but inferred.
There are various kinds of metadata: descriptive, annotative, etc. We have to understand what kinds of metadata are required for making the Grid viable. It was mentioned that the persistence of the data might be less important the the persistence of the system used to underpin the Grid (hardware, software, etc). Also that we might need to build in a look-ahead time for system design because of the rapid development of the technology.
The question of the propagation of underlying data into derived data products was raised. A piece of derived data which turns out to be based on faulty primary data is naturally also false. If the derived information is arrived at as part of an automated process, then mechanisms for automatic correction of the data and even automatic publication of the new data might be desirable. In other words, changes in primary data need to be reflected upwards (again this raises the issue of provenance of data, and also the tracking of changes, or rollback).
The Semantic Grid has as its aim the bridging of the gap between current endeavour and the vision of e-science. Ontologies are required for the automation of Grid processes. The conclusion is that scientific data and the associated information need to be closely defined within the context of the Grid and its processes. Plus we need better tools for creating metadata. We also need to have good processes for working within collaborative workspaces, and the implementation of clear standards.
3) Data Movement: this group found themselves trying to catalogue the differences between the JISC Information Environment, the e-Science initiative, and Tim Berners-Lee's concept of the Semantic Web. There was discussion of resource discovery, and what the minimum requirements of a researcher are to make a resource discoverable. They also considered the question: what does the publication of data on the Grid actually mean? (Possibly a job for a working party to analyse). It was suggested that there would be more inclination to create complex metadata if it was easy to do this, and it was clearly understood how metadata should be created. If there was kudos associated with the production of good associated metadata, 'they'll do it'. We need carrots, sticks, and clearly stated requirements. If we want our information to be accessed via a JISC Information Environment portal, we also need the provision of tools to help the application of subject classification. And an important issue is the maintenance of quality.
On the issues of semantics and authenticity, it was argued that users want transparency. There are technical gaps to be filled. Digital Library and the Grid communities are starting from different places. Both communities should use the same authentication solution(s), which is better from the point of view of the user, as well as for creating a defined position for 3rd parties and making the business of negotiation easier. Digital certificates were felt to be the 'right idea'. Authorisation was felt to be a more complex issue, and multiple solutions are required. There were three action points which resulted from the session. The first was that we should explore the digital certificate solution to the authentication issue and its sensitivities. Second, we need joint projects in authorization issues - possibly JISC/e-Science collaborations. Third, we need collaboration with among other programmes, the New Opportunities Fund (NOF), etc.
The event was wound up by Malcolm Atkinson. He pointed out that collaboration is expensive. Howevere instant total collaboration immediately is not what is wanted. So which points should we pick out? He hoped the community will 'crystallize-out' some special interest groups. There is a buzz and some dialogue between both sides, and we shouldn't let this go, he said. Someone in the audience brought up the question of a notional timescale for collaboration, in terms of an impact date, and suggested that the community work to a six month framework, otherwise there may be good intentions only. Atkinson responded by suggesting that if interested parties are going to do something, it will be stuff done in the next month, but we need the results of working together in the next twelve months. He also indicated that the Programme was open to ideas about what we might do from the community itself.
University of Bath