Repository Fringe 2010

martin donnelly

Repository Fringe 2010

Martin Donnelly (and friends) report on the Repository Fringe "unconference" held at the National e-Science Centre in Edinburgh, Scotland, over 2-3 September 2010.

2010 was the third year of Repository Fringe, and slightly more formally organised than its antecedents, with an increased number of discursive presentations and less in the way of organised chaos! The proceedings began on Wednesday 1 September with a one-day, pre-event SHERPA/RoMEO API Workshop [1] run by the Repositories Support Project team.

2 September 2010

Opening the event proper on Thursday morning, Sheila Cannell, Director of Library Services, University of Edinburgh, used the imminent Edinburgh festival fireworks as a metaphor for the repository development endeavour. They can be enjoyed for free at various vantage points across the city, or people can pay to gain entry to Princes Street Gardens to experience them in more comfort; but each group receives different versions of the same experience. Openness works along similar lines: someone has to pay, but all can benefit.

Keynote Address

Dr Tony Hirst (Open University)

Tony Hirst's keynote address was entitled 'Open etc' and began with an overview of the scholarly communication workflow, positioning the publications repository as the institution's memory. Time is often wasted when repository staff have to go and seek out papers published by members of the university, and there are other places and ways for storing content. Hirst gave the example of his own blog and other online mechanisms for dissemination. Rather than publish in traditional academic journals, he uses Slideshare as a place to expose his research, and also blogs unformed thoughts which are subsequently refined through user feedback/comments. The blog therefore becomes a kind of notebook and a structured 'repository' / database.

The remainder of Tony's talk dealt with ways of structuring and transforming textual content so that it can be processed as data, addressing key concepts such as discovery, disaggregation, and representation. Documents can be 'chunked' by assigning URIs to each paragraph, and services built atop WordPress's standard RSS feeds using tools like Yahoo Pipes. Tony showed a map of usable (and free) workflows between applications, from Wikipedia to Google Maps via HTML, CSV, KML and <embed>ding; a similar process can be used to link RSS feeds and Mendeley reading lists. Tony called this process 'content liberation': using free and simple-to-use means, he and demonstrated how the interrelation of textual content can be made clearer by the application of visualisation tools.

He ended his talk with a reminder of Ranganathan's five laws of library science [2], and a few thoughts on how they might be applied to data repositories.

Presentation: MEMENTO: Time Travel for the Web

Herbert Van de Sompel (Los Alamos National Laboratory / Old Dominion University)

The central topic of Herbert Van de Sompel's talk was the Memento Project, a Web archiving interface which aims to make it easy to navigate Web sites from the past. Herbert cited Tim Berners Lee on generic versus specific resources, noting that resources vary over time. But automated Web archivers cannot harvest all content, especially pictorial/video content.

There are already Web archives which record the changes that have taken place to particular pages, but Memento is concerned with navigating archived resources. Memento's raison d'être is that recreating/revisiting the experience of the past is more compelling than just reading a summary of changes of, for example, a Wikipedia page. Memento codifies existing ad hoc methods which link the past to the present, and creates a linkage from the present to the past; Herbert noted that the former challenge is considerably simpler than the latter!

Memento's TimeGate front-end uses 'content negotiation in the date-time dimension', and utilises protocol settings that are built into the Hypertext Transfer Protocol (HTTP). Herbert ended by stressing the value of HTTP URIs, and expressing the view that Memento serves to extend their power.

Pecha Kucha: Session 1

Following lunch we staved off post-prandial fatigue with the first breakneck round of Pecha Kucha sessions. Pecha Kucha is a Japanese presentation format in which presenters speak to 20 slides for 20 seconds each, making each presentation last precisely 6 minutes and 40 seconds. The aim of such a rigid format is to focus presenters on getting their message across in a clear and concise way.

Open Access Repository Junction

Ian Stuart, University of Edinburgh / EDINA

Stuart spoke about research as a communal activity, which is often carried out by distributed teams and takes place largely over the Internet. However, educational institutions tend to be more insular in their view of research, operating in what Stuart calls a 'vertical market'. So there is a need for some middle ground, or a sort of 'broker' between the researchers and the institutional repository. Repository Junction is currently working with seven repositories to provide such a service, in liaison with the EPrints and DSpace teams. Their final report is due in May 2011.

Photo courtesy of Nicola Osborne, Social Media Officer, EDINA

JorumOpen

Hiten Vaghmaria, EDINA

Hiten Vaghmaria's short talk was about design. Great design tends to be universally loved, bad design widely disliked. It is important to consider form: people respond emotionally to good design. Jorum is the UK's national repository of learning and teaching resources, and it has recently improved its Web site design in response to user demand for more content to be accessible more quickly.

Hybrid Institutional Repository

Robbie Ireland and Toby Hanning, University of Glasgow Library

Robbie Ireland and Toby Hanning talked about the University of Glasgow's Enlighten as a hybrid, being both an institutional repository and a publications database. Glasgow has successfully mandated deposit in the repository, across a wide range of content types. The speakers suggested that there were four main factors that have contributed to this success: policy, collaboration, data, and organisation. There has been a certain amount of staff resistance to the mandate, and advocacy and collaboration are therefore important. There are also quality-related problems, especially with bulk data uploads, necessitating periodic checks. Enlighten mostly holds metadata records, but the Enlighten team aims to increase the number of full-text items held over the coming months and years.

Photo courtesy of Nicola Osborne, Social Media Officer, EDINA

Dataset Identity

Herbert Van de Sompel, Los Alamos National Laboratory

Following his earlier, very accessible presentation, Herbert Van de Sompel here gave a fairly technical presentation which covered the addressing, accessing, and citing of content, together with the different requirements needed for each of these stages. He spoke about the relative strengths and weaknesses (and requirements) of DOI and HTTP URI for each of these actions. Citing requires identifier and credits, while fragment identifiers can be used to address segments. It is, he stressed, possible to combine approaches.

Providing Data Support in a Repository Context

Elin Stangeland, Cambridge University Library

Elin Stangeland spoke about the JISC-funded Incremental and DataTrain projects, with specific regard to lessons learned on issues of tools and training. The projects have carried out scoping studies and interviews, and among the emerging themes were a widespread ignorance on the part of researchers regarding matters such as back-up procedures and best practice generally. It is clear that the preservation message has not yet captured researchers' hearts and minds, and that people often find it dull. These problems are not discipline-specific, but tailored/discipline-specific guidance would be welcomed in overcoming them.

Addressing History

Nicola Osborne, EDINA

Addressing History is a 6-month JISC rapid innovation project. EDINA in partnership with the National Library of Scotland are developing a scalable geo-coding tool that will combine and enhance data from digitised historical Scottish Post Office Directories (PODs) with large-scale historical maps of the same era through crowdsourcing. In the first instance the tool will access three eras of Edinburgh mapping and PODs (1784-5; 1865; 1905-6). The PODs contain residents' names, professions, and street addresses, information which can be valuable to those who study the past, such as genealogists, local and social historians. Osborne ended by demonstrating various possibilities for combining this new information with existing datasets by way of mashups, and noted that the live preview would be launching soon.

Photo courtesy of Nicola Osborne, Social Media Officer, EDINA

Repository Building Blocks: EPrints

David Tarrant and Patrick McSweeny, University of Southampton

David Tarrant and Patrick McSweeny's double act gave an entertaining overview of new features and future developments in EPrints. EPrints 3.3 is due in the first quarter of 2011, at which point the EPrints Bazaar – a kind of 'App Store' for repository plug-ins – will also be officially launched. The rationale for the EPrints Bazaar is that sharing success is all about the community. Lots of plug-ins have been developed by the EPrints user base, but it can be hard to make them discoverable and achieve a decent profile. The EPrints Bazaar presents a single place for discovery, and supports installation via a single click.

The duo also covered the 'really powerful' Digital Preservation Suite for EPrints, which has been developed in collaboration with Planets, and its accompanying one-day training course. Tarrant demonstrated several of the plug-ins, including SNEEP (Social Network Extensions to EPrints), EdShare ToolBox, and MePrints. They took an out-of-the-box (or 'vanilla') EPrints installation and installed (and tweaked) several plug-ins to make it much more attractive and user-friendly.

The session ended with a video demo of an iPad app (Flipboard) which makes it more pleasant to browse repository content; indeed, ease-of-use and the importance of interface design and user-friendliness emerged as one of the event's overarching themes, representing the carrot to the stick of institutional mandate!

Round Table Discussion 1: The Value of Geo-referencing Data in Repositories

Ian Stuart, EDINA

The broad topics for discussion were 'Why geo-tag stuff?' 'What does geo-tagging mean and what would it be used for?', 'What is tagged: the location of the camera or the contents of the picture?' They led to several subsequent questions: 'Where is a place?', 'What about non-terrestrial places?', 'What about the place as is was at <time>?'

Following much lively discussion, the group reached a consensus; what they wanted ideally was for applications to offer suggestions that users can either accept or modify. The thinking behind this was that people would be more inclined to de-select the places that are wrong rather than write down the places that are right. The real trick is to get positive feedback: show good uses for geo-located data and people will be encouraged to provide geo-locating information items, in the self-interested belief that this will raise their own profile(s).

Round Table Discussion 2: Collaborative Documentation - The EPrints Handbook

Stephanie Taylor, UKOLN

This round table centred on the documentation needs of EPrints users at all levels, from technical practitioner to busy repository manager, and raised the possibility of supplementing existing documentation with community-created content. There was a positive response to the latter suggestion, and useful discussion covering the approaches that currently work well and not so well for both EPrints documentation and peer-support communities. There was also discussion of potential ways in which these support fora could be improved and linked together. Stephanie Taylor will be taking forward the suggestions of this round table over the coming months.

Round Table Discussion 3: Where Does the IR Fit in the CRIS World?

Anna Clements, University of St Andrews, James Toon, University of Edinburgh

This session was designed for institutions facing issues related to the sometimes muddy boundaries between the two system types: Institutional Repositories (IRs) and Current Research Information Systems (CRISs). Anna Clements started the session by offering an overview of the 'Pure' implementation at St Andrews, including work done on the CRISPool Project in providing a CRIS-based solution for SUPA, the Scottish physics research pool, and how St Andrews are dealing with the CRIS/IR issue.

With more and more institutions reportedly turning to the CRIS based approach, the ensuing discussion considered whether or not we ought to maintain the distinction between IR and CRIS systems. In this, the group considered the value of regional or national alternatives to local repository services. The group also discussed whether or not more work should be done in integrating the functionality of CRIS and IR systems, and what impact the REF might have in the long term in bringing these needs together.

3 September 2010

Introducing day two of the event, Simon Bains, Head of the Digital Library at Edinburgh, announced that Toby Hanning and Robbie Ireland had won the public vote for yesterday's best Pecha Kucha session.

Presentation: Hydra

Chris Awre, University of Hull

In the first session of the morning, Chris Awre spoke about Hydra, a self-funded collaboration between Hull, Virginia, Stanford Universities and Fedora Commons (now DuraSpace), which is working towards a reusable framework for multi-purpose, multi-function, multi-institutional repository-enabled solutions.

Hydra is based on two fundamental assumptions:

that no single institution can resource the development of a full range of digital content management solutions on its own, yet each needs the flexibility to tailor solutions to local demands and workflows;
that no single system can provide the full range of repository-based solutions for a given institution's needs, and that sustainable solutions require a common repository infrastructure.

Awre gave an synopsis of Hydra's origins in the RepoMMan, REMAP and CLIF projects at Hull, followed by an overview of the consortium's working methods and organisation, and a visual depiction of the system's Open Source technical architecture – which encompasses Fedora, Blacklight, Solr, Solrizer, and the Hydra Rails plug-in – accompanied by a rationale for the selection of these technologies.

He ended with an assertion that testing is a core community principle; that we need to test our systems adequately, and furthermore we need to demonstrate that they have been tested.

Round Table Discussion 4: UK Metadata Forum

Stephanie Taylor, UKOLN

This was the second meeting of the Metadata Forum [3]]. The Forum aims to help build a community of practice around the use of metadata. Anyone with an interest in metadata is encouraged to participate, but the emphasis is on practical solutions to practical problems.

The session started with a short talk by Sheila Fraser from EDINA. Sheila is working on a JISC Scoping Study entitled, 'Aggregations of Metadata for Images and Time Based Media' [4]]. Some basic models of aggregation were provided for consideration from the group, and this sparked off an interesting discussion on the benefits of aggregation and whether a pragmatic approach is perhaps more beneficial. The general consensus of participants was that being able to access content in the way that made sense to the user had to be the starting point. If a model is less than perfect, but has benefits in the way users can interact with a system, then this is much more beneficial than a 'perfect' model that makes little sense to the users. Of course, there are lots of grey areas in between, and this is where solutions are often found.

The mention of users led on naturally to a discussion about what users want, and how metadata can support their needs. A recurring theme among UK metadata users right now is the need to find robust, workable solutions for dealing with non-text-based objects. Still images, moving images and music were all mentioned as objects repository managers and others are starting to need to deal with on a regular basis. In addition, the specific needs of different subject areas were highlighted. For example, with still images, there are very different needs which depend on subject and use rather than file format. Geolocation is essential for architectural images, but for medical images geolocation information must not be available as this could lead to identification of the patient with the image – something which would compromise data protection requirements and mean the image could not be used as a source for research purposes.

The meeting concluded that some basic guidelines on dealing with different objects would be useful, but that the key to creating something workable would be to have a flexible approach. Such an approach would create a toolkit that took into consideration the needs of specific subjects too, as well as format. The issues raised in the discussion will be used as the focus of later meetings of the Forum, where specific topics will be discussed and people who have already found some working solutions in these areas will give short presentations on their work.

Round Table Discussion 5: Linking Articles into Research Data

Robin Rice and Philip Hunter, University of Edinburgh

This session covered a wide range of questions: Should we encourage academics to curate the dataset that goes with a particular publication, rather than the whole of the data they have produced? How do repository managers want to accept storage for OA or archiving? How could linking datasets improve scientific practice? For pointing to datasets, what is more important: bibliographic citation or permanent identifier? Are DOIs preferable to handles? What metadata is required for linking outputs to datasets, and can ontologies help?

As things stand, when related data are offered to journals, their publishers may say that they cannot support it yet. But the group's view was that if you worry too much about doing something perfectly, you get nothing done. People can still achieve useful results by putting in time and effort, but attempting to scale this approach is somewhat problematic.

The group also discussed a few related issues such as enhanced publications, citation alerts and other tools. (N.B. A much more detailed summary of this round table is being prepared by Philip Hunter.)

Round Table Discussion 6: Re-imagining Learning and Teaching Repositories

Yvonne Howard and Patrick McSweeny, University of Southampton

This wide-ranging discussion centred on the role of repositories in learning and teaching, opportunities for supporting the pedagogic process, and the potential for promoting the reuse – beyond simply encouraging deposit – of materials in repositories. This was a lively session, with participants expressing strong views on the nature and problematic historical legacies of self-defined learning objects (as well as their packaging and metadata), and the most productive ways to take advantage both of materials created in the JISC Open Educational Resources (OER) Programme and of other, previously funded, digitisation initiatives.

Pecha Kucha: Session 2

After lunch we had the second round of quick-fire Pecha Kucha sessions, again with the enticing prospect of a half-bottle of malt whisky for the most popular presentation.

DSpace and the REF

Robin Taylor, University of Edinburgh

Robin Taylor's talk covered Research Excellence Framework (REF) requirements as they relate to publications. Rather than asking researchers to produce and maintain a list of their publications, Edinburgh's approach is to allow each researcher to select the publications for submission from a pre-generated list. This helps counter the difficulty in telling the difference among researchers with similar names.

RePosit Project

Sarah Molloy, Queen Mary University London

The JISC-funded RePosit Project comprises a fairly large project consortium, and runs for one year. The project's goal is to make deposit easier, and hence get more users and submissions/deposits into the CRIS and IR. Project outputs include a survey, training materials, advocacy, shared community space (RePosit Google Group), Twitter account (@JISCRePosit), and a blog.

Developing Services to Support Research Data Management and Sharing

Robin Rice, University of Edinburgh/EDINA

Rice gave an overview of Edinburgh's Data Library service, and the DSpace-based Edinburgh DataShare. A data library should include support for finding, accessing, using, and teaching. Rice and her colleagues have developed guidance and training, and have worked at influencing University policy, in an attempt to overcome traditional barriers to deposit. Rice also gave very quick introductions to the MANTRA Project (MANagement TRAining), which will incorporate video stories from senior researchers as well as data handling exercises, and to the JISC Managing Research Data Project.

JISC CETIS

Phil Barker

Barker's presentation was entitled 'An open and closed case for educational resources', and made the case for increased Openness in the production of learning materials. Open educational resources (OERs) can be found and reused for free by anyone via search engines, and can be presented either modularly or as parts of courses. Phil's hope is that Open becomes the default approach within the HE community.

ShareGeo 12 Months on and Going Open

Anne Robertson, EDINA

A pertinent successor to Phil Barker's presentation, Anne Robertson gave an introduction to the ShareGeo Project, which supports the ready discovery and sharing of geospatial data, and offered a rationale for 'going Open' in order to increase deposit numbers.

SONEX: Scholarly Output Notification and Exchange

Pablo de Castro, SONEX

Pablo de Castro gave an overview of the SONEX think-tank, which aims to identify and analyse deposit opportunities (use cases) for the ingest of research papers (and potentially other scholarly work) into repositories. SONEX is involved in analysis, not coding, so there are no technical outputs as such, although some development work will be undertaken by partners. The project is related to DepositMO, which was funded under the JISC Managing Research Data call.

Presentation: Topic Models

Michael Fourman, School of Informatics, University of Edinburgh

Michael Fourman's talk outlined some new tools to explore and browse ideas. By carrying out an automated analysis of topics featured in Science journal between 1980-2002, Fourman was able to create an automated index of ideas.

Central to this is the conception of a document as 'a bag of words', where syntax can be ignored and word frequencies are all that matter. What Fourman calls a topic is a frequency distribution over words. Documents are generated from a mixture of topics, and topics (and mixtures of topics) can be inferred from textual corpora such as the historic Science journals, and various powerful analyses (Bayesian, Monte Carlo et al.) may be carried out on the data.

Closing Presentation

Kevin Ashley, Digital Curation Centre

(After a quick vote during the coffee break, it emerged that Robin Taylor had won the prize for day 2's best Pecha Kucha presentation.) Closing the event, Kevin Ashley, Director of the DCC, offered a recap of the themes that had emerged over the preceding couple of days – notably fireworks, documentation (for EPrints, coming soon!), building blocks, the Bazaar, Open-ness, the need for collaboration if we are to travel far rather than simply fast, that everything is data (Fourman), and that the nature of data is changing (Van de Sompel).

Ashley also took the opportunity to reflect on the past: starting with the JISC Repository Preservation and Advisory Group in 2004, he showed how repository-related themes have changed over the past six years, and indulged in a little stargazing to frame 'repositories and/for/as data'. The prevailing question must be: which data, and when? More joined-up collaboration between stakeholders will be needed, and we need to think in terms of use cases when it comes to the larger cultural change of updating scholarly publication practice.

Conclusion

Slides, images and links to streamed content – including 20,000 words of live-blogged content – can be found on the Repository Fringe Web site [5]]. Additionally, the event's Twitter hashtag was #rfringe10, and a quick search on this will return a large number of interesting opinions.

Acknowledgements

Many thanks to Ian Stuart, Nicola Osborne, James Toon, Philip Hunter and Stephanie Taylor for providing summaries of the parallel breakout groups.

References

For more information on the workshop, see
http://www.rsp.ac.uk/events/index.php?page=RoMEOAPI2010/index.php
Ranganathan's Laws: Books are for use; Every reader his book; Every book its reader; Save the time of the reader; The library is a growing organism
http://en.wikipedia.org/wiki/Five_laws_of_library_science
The Metadata Forum blog http://blogs.ukoln.ac.uk/themetadataforum/
Edina: Projects: Scoping Study: Aggregations of Metadata for Images and Time Based Media
http://edina.ac.uk/projects/Aggregations_Scoping_summary.html
Repository Fringe Web site http://www.repositoryfringe.org/

Author Details

Martin Donnelly
Curation Research Officer
Digital Curation Centre
University of Edinburgh

Email: martin.donnelly@ed.ac.uk
Web site: http://www.dcc.ac.uk/

Return to top