Survive or Thrive

ed fay

Survive or Thrive

Ed Fay reports on a two-day conference organised by UKOLN on behalf of JISC to consider growth and use of digital content on the Web, which was held in Manchester in June 2010.

Survive or Thrive [1] is the punchy title given to an event intended to stimulate serious consideration amongst digital collections practitioners about future directions in our field - opportunities but also potential pitfalls. The event, which focused on content in HE, comes at a time of financial uncertainty when proving value is of increasing importance in the sector and at a point when significant investment has already been made in the UK into content creation, set against a backdrop of increasingly available content on the open Web from a multitude of sources.

The premise of the event - we must survive in this context and seek to thrive - is a timely reminder of these realities, but also a motivation to explore some of the currently available digital collection technologies and models for user engagement that are out there and working successfully in this environment in order to make the most of our digital collections for our users.

Digital Libraries in a Networked World: Dan Greenstein

Dan Greenstein is Vice Provost at the University of California which includes oversight of the California Digital Library. He has been Director of the Digital Library Foundation, the Arts and Humanities Data Service and the Resource Discovery Network.

Greenstein's opening point was that we would be hearing a lot about what can be done but it would be important to focus on what should be done-and this would be different for each one of us. Something that is likely to apply to all of us, however, is that we will be working in a time of reduced budgets. While he is of the opinion that 10-15% cuts are absorbable, cuts of 20-30% across the sector will require fundamental changes. Although student numbers have been rising steadily, funding per capita has been falling (as demonstrated by statistics from the UK's Department of Education and Skills, over the period 1980/1-1999/2000).

Although the period after World War II saw an explosion in printed publications, we are now seeing an explosion in new distribution models - digital, mobile and print-on-demand - supported by e-publishing and retrospective digitisation. Therefore, Dan maintained, 'redundant management of print collections is insane' as is seeking to make savings on special collections: they are what make a library unique, not copies of print publications that will be increasingly available through other means.

Next generation information practice is likely to require:

Retrospective digitisation
Institutional repositories
Curated born-digital material
E-learning and information literacy ('hybrid instruction' on the basis of closer engagement with information practice and research or teaching departments)

What does it take to fix this? Changes in collection management:

Secure management of digital facsimiles and editions (not everything is worth saving - everything that is saved, costs)
National institutional repository strategy with services implemented at the department or individual level
National print repositories (necessary and sufficient redundancy, no more)
Localised print-on-demand and mobile download

The questions are political, not technical. Currently, core funding is used for print acquisition and e-licensing - commercial acquisition - while 'funding dust' is used for information literacy and digital curation. There is no trade-off being made, and this is unsustainable. Everything should be funded from the same collection budget, with a realistic approach to prioritisation - what ought to be done.

Discovery and delivery are orienting to the individual. If library services do not follow suit, then users will go around us to get at content. This is a 'massively heavy lift' - can we do it quickly enough before the industry bypasses us? (Dan advised the audience to take a look at what Apple had done to the music business with the iPod and iTunes. He maintained it is going after the publishing industry next.)

Challenges include:

Communicating the benefits
First mover problem (it needs to be sector-wide)
Leadership problem (it needs to be institutional, not just library scale)
Scope creep (driven by the range of possibilities online - there must be clear direction)
Threat to local autonomy - ultimately to the local academic library and librarian!

Greenstein's final point - an orderly retreat is better than a disorderly one: services and access to information can be maintained.

If You Love Your Content, Set It Free: Mike Ellis, Eduserv

Following the economic theory of marginal utility, Mike Ellis described how, historically, value derived from scarcity and there was benefit from keeping content closed. In the networked age there are three phenomena which challenge this:

Distribution costs are declining
Piracy opportunities are increasing
Our relationship to content and our information behaviours are changing

Nowadays, Mike maintained, value becomes about usability rather than scarcity:

This is not a blip, things will not be returning to 'normal'
Value has not disappeared, it has shifted
What cannot be copied? What is unique to this institution? Think: trust, authenticity, immediacy
Content is like a teenager - you may try to protect it, but it will climb out the window and go clubbing anyway
If you can't reuse your digital content, the creation effort is wasted
This is and always will be about content and users, not technology
The future is uncertain - but open content and technologies help by lowering costs for reuse and interoperability
It does not matter how you do it
Open and Free = Eyeballs – making content open and freely available will drive its use

It is worth noting that point 8 proved highly contentious during discussion, and was subsequently modified to agree that 'how' should include notions of interoperability, openness and Web accessibility, but beyond that specific technologies matter less than these principles.

Mike has placed his presentation on Slideshare [2].

Web Scale (Content in a Web Environment): Jo Pugh, TNA

The National Archives (TNA) have started to publish content to Flickr Commons [3]. The primary purpose was to use Flickr as a 'shop window' to get across the scope of the collections.

Benefits and the business case for using third-party dissemination:

Large audience 'shop window' (potential to drive traffic to in-house image library)
User tagging and folksonomies (ultimate intention to import back into local catalogue)
Proliferation of images across the site (favourites, groups, etc.)
Annotation
Source of information about users
Users can embed images on their own site, and TNA can embed on theirs
Potential to repurpose TNA content via mashups
Users can post their own content

A key driver for the adoption of Flickr was to avoid rebuilding what was there already, and the motivation to avoid the 'build it and they will come' approach in favour of going to where the users are already. This represents a balance of benefits - costs vs. control of content. A key lesson is that Flickr is working at Web scale - TNA content is not!

It is also worth noting that the ambition for mashups of TNA content was regarded as overly optimistic and has not manifested itself. However this is not seen as a disappointment-rather TNA now better understand their users and their expectations of TNA content as a result of this foray into web-scale dissemination.

What is the aim of being on Flickr?

'We don't judge what constitutes a meaningful interaction with our content'
Enhancements to the TNA catalogue
Vibrant online community of activity
Flickr Apps - ways of using content not limited by the TNA platform

3 Sessions on Geo-Data, Linked Data, Text Mining

Geo-Spatial as an Organising Principle: James Reid, Edina

James Reid began with a widely held assertion: 80% of all an organisation's information has a geographic aspect, directly or indirectly referenced. Direct referencing is an explicit assertion of the geographic information (e.g. open Ordnance Survey maps or geo-referenced digital collections) while indirect referencing is an implicit assertion of geographic information (e.g. textual references to place names). Edina Unlock [4] provides services for geo-referencing and geo-parsing (text-mining for geographic references).

James has made his presentation [5] available.

Linked Data: Tom Heath, Talis

Tom Heath explained linked data using the analogy of a transport network - different transport types interoperate; we don't need to understand how trains or buses work in order to use them to get places. Historical lesson: building physical networks adds value to the places which are connected-building virtual networks adds value to the things which are connected.

How is linked data best implemented? There will be no big bang; expansion and benefits will be incremental. First step - build an infrastructure of identifiers -things we care about as an organisation. This can be as simple as id.uni.ac.uk/thing that captures the domain model (people, departments, things, etc). Cost? As for any infrastructure development, there is the bootstrapping cost vs. cost savings and the value of things that would not otherwise get done.

Tom has made his slides [6] available.

Text Mining: Sophia Ananiadou, NaCTeM

Sophia Ananiadou introduced applications of text mining that are being developed in various academic disciplines to support the analysis of large textual corpora. This is an emerging research discipline in its own right that is finding traction amongst the natural and social sciences as well as in the area of digital humanities. Text mining offers the ability to extract semantic information from unstructured textual documents through a sequence of techniques:

Unstructured text (implicit knowledge)
? Information retrieval
? Information extraction (named entities)
? Semantic metadata annotations (enrich the document)
? Structured content (explicit knowledge)

Existing tools can be adapted to a subject domain using annotated corpora. Applications include document classification, metadata extraction, summarisation and information extraction. Current NaCTeM domains are biological, medical and social sciences. [7]

JISC Resource Discovery Task Force

The JISC and RLUK Resource Discovery Vision - ultimate goal is for a national strategy for resource discovery supported by aggregation services for metadata.

More information is available at:

Resource Discovery Taskforce [8]
Resource Discovery Taskforce vision report [9]
JISC [10]

Digital New Zealand: Andy Neale, Digital New Zealand

Digital New Zealand is a national-scale platform to support digital content, including infrastructure, metadata, and front-end search and delivery. There are currently 99 contributing organisations within New Zealand.

4 initial projects (2008) launched with 20 contributors:

Search widget
Mashup
Remix experience
Tagging demo

Search widget is an aggregator focussed on digital objects (not Web pages - 'Google is for that') built on top of Solr with an HTML + Javascript front-end that can be replicated and changed for different applications. This forms the core of their infrastructure, with a public API making DigiNZ content available to other applications. New applications can be deployed for an institution using pre-defined custom searches in the API, which are used as the basis for building new front-ends with different skins. Not including graphical design changes, a new instance can be launched in around 2 hours. Additional apps from DigiNZ include a timeline app which uses open-source software from the MIT SIMILE Project [11].

Initial work focussed on building foundations for something bigger, designing an infrastructure to be extensible and building from the ground up. Using an Agile methodology (SCRUM) they were able to build the prototypes for initial launch in 16 weeks, although lots of background work had been done on the concepts and organisational buy-in.

Issues:

Unclear, scattered vision statement did not provide an inspiration for the work - they were left to pull out the relevant concepts and work them up into something more relevant
This was not the first national attempt in this space, so they had to be sensitive to the work of others and make their differences clear
They lacked in-house development capacity, so worked with 3 suppliers who could overlap each others' work when necessary to share the workload around
Time and budget constraints - agile development methodology (SCRUM) alleviated their effects
The reputational baggage associated with branding the central service was underestimated
The effort to participate for contributing organisations was perceived to be high, so they designed for a low barrier to entry by not specifying rigid metadata schemas - DigiNZ does most of the heavy lifting to normalise data for ingest. This results in lower quality metadata, but of workable standard.
Bureaucracy hindered institutional take-up

To maintain momentum and energy and avoid the 'slow death of project wind-down' DigiNZ took a conscious decision to maintain the project atmosphere (although not technically a project anymore) by rejecting the idea of 'business-as-usual' and keeping to the two-week development sprint cycle. A current challenge is finding the balance between maintenance and development.

DigiNZ are currently looking at digitisation support services, including an advice service similar to JISC Digital Media. They are also running a central service for digitisation nominations, on which members of the public can vote. To date, they have had 100 nominations, the most popular of which has received 600 votes. This is considered to be a moderate amount of activity.

DigiNZ will continue its mission to lower barriers to getting content online, which includes hosting services for institutions without the resources to maintain their own. Future applications will include metadata enhancement such as geo-tagging, based on the principle of a game to crowdsource the required effort.

Current issues are the balance of effort between central and distributed functionality and tools. They are more interested in distributed access where partner organisations take on much of the effort of publicity. The scope of inclusion of content and the balance between maintenance and development are also issues.

Things that worked for DigiNZ:

Clear and simple vision
Small scoped projects as the basis for scaling up
Clear about the differences from other projects
Lower barriers to participation
Brand and design - looking good inspires pride
Build relationships with collaborators - technology is not hard with the right people
Deadlines
Starting small and upselling
Building for extensibility from the ground up
Iterating development - 2 days user analysis, 8 days build in a two-week cycle
Planning to maintain energy after the launch
Lightweight governance and a team of experts given room to work
'Get on with it' and refactor in response to change

Working the Crowd: Chris Lintott, Galaxy Zoo

Crowdsourcing developed in response to the data deluge of astronomical data, but is increasingly being applied in other domains. Some astronomical data can be processed by machine but not all-in particular the pattern matching necessary to recognise galaxy shapes is not yet sufficiently sophisticated, and even neural nets have problems.

'Citizen science' is not new, it builds on a tradition of amateur ornithologists, astronomers and palaeontologists. However, crowdsourcing is a new kind of citizen science that inverts the normal model. Historically, scientists were asking for people to supply the data (e.g. observations of birds or fossils) while the scientists would do the analysis. In the crowdsourcing model, the scientists supply the data and the people do the analysis.

Galaxy Zoo underwent an explosion in popularity after it was picked up by the BBC, which crashed the original servers. At its peak it was hitting 70,000 classifications per hour (more than 1 PhD-student-month per hour) and by the end had racked up over 200 years FTE (Full Time Equivalent) by 300,000 volunteers. But to put this in perspective, this is only the same attention as that given to 6 seconds of Oprah Winfrey's TV show.

Lessons learnt about volunteer recruiting:

There must be no barrier to entry
Tell them why
Treat them as collaborators
Do not waste their time

It became clear that crowdsourcing could tackle different kinds of questions:

Known knowns: basic classification
Known unknowns: specific, narrow questions not covered by the standard classification (not designed into the experiment, they require a small cadre of volunteers to be recruited for this specific task)
Unknown unknowns: something totally unexpected (they cannot be designed for in the experiment, require a free response, but also require forums and volunteers to self-moderate and filter the feedback to the professionals)

The platform is now being extended into a generalised platform for crowdsourcing [12] to lower the barriers to entry for early-career researchers. The platform will also act as an intermediary to provide guarantees to those on both sides of the relationship (researchers and volunteers) - a 'fulfilment contract'. It is not expected that there will be similar, widespread publicity in the future - people are not as amazed by what the Web can do anymore. The challenge of how to recruit and engage volunteers is seen as ongoing.

Getting Your Attention: David Kay, Sero Consulting

Attention or activity data harvested from OPACs can be used as a source of information about user behaviour and hence as a basis to inform collection management or to build user recommendation systems. There are three kinds of data:

Attention: interest, searches, navigation, etc.
Activity: transactional - requests, check-outs, downloads, etc.
Appeal: recommendations, e.g. reading lists (proxy for activity data).

The University of Huddersfield demonstrated an increased borrowing range by making suggestions based on current user behaviour (average number of books borrowed per person increased from 14 to 16 over the period 2005-9). A developer challenge based on Huddersfield data produced two applications to improve resource discovery, two to support recommended courses based on a user's behaviour, and two to support decision-making in collection management.

Problems:

Over-personalisation of the information landscape runs the risk of removing personal agency, decision-making and independent thought
There are data protection issues regarding the use of the data, although anonymisation mitigates them to an extent

Findings of the CERLIM/MOSAIC Project are that 90% of students want to know what other people were using - to get the bigger picture of what is out there and to aid with retrieval. Recommender systems are used commercially, so the benefits and mechanisms are well understood. It was also observed that they are not the basis for people wasting money (at least this is not a reported outcome), and so they are unlikely to be the basis for people wasting attention in OPACs and non-commercial situations.

There are two approaches to user-augmented retrieval: data analysis (of attention, activity data) and social networks (user engagement - reviews recommendations etc). It was opined that activity data are likely to be of more use for undergraduate courses, due to the volume of use and associated data, while research use exists in the long-tail of the collection, and will likely benefit from social network-type effects. However, it was observed in the Question and Answer session that the long-tail phenomenon only exists at institutional level - subject networks of researchers at national level will have their own corpora of high-use material, and analysis at that level is likely to be more productive.

Question & Answer Sessions

The Q&A session focussed on addressing the elephant in the room-financing. Despite Greenstein's admonition at the start of the day there was a striking absence of specific financial discussion through the course of the day, in favour of a somewhat isolated appreciation of technical possibilities and associated user benefits. The basis for a cost-benefit analysis was not always clear.

The question was simple - "how do we pay for this?"-the answer, less clear. Greenstein was adamant that core budgets can and should be realigned to give appropriate focus to digital collections, while admitting the possibility of commercial partnerships. Commercial partnerships were considered to be optimistic by other respondents, with Pugh warning that "there is no pot of gold for all" and Ellis pointing out the significant overheads of running an in-house commercial operation such as an image library. Neale advocated 'smuggling' technical costs into other activities that over time could build into a significant strand. The danger being that the 'other activities' (and the funding they bring in) could dry up leaving significant, un-resourced technical commitments. Stuart Dempster (JISC) urged us to take at look at the Ithaka case studies [13] (which are due to be refreshed in 2010) and also to pay close attention to government policy to see which way the financial winds are blowing for the sector.

One thing was clear - we cannot avoid opening and preserving content. However, while possibilities were offered and successful financial models do exist, resourcing for specific digital collections remains a challenge for the curators-much depends on the nature of the collection and its target audience.

Conclusion

This two-day event offered a fast-paced look at many areas of digital collections technologies and the ways in which user behaviours are changing. While it was interesting to hear about emerging digital technologies and models for user engagement, following the event there was something of a return to normality as the realities of implementing these innovative technologies became apparent. As mentioned above, despite serious financial considerations in the opening session, there was a striking absence of financial discussion throughout the sessions. It was not clear in many cases what the cost of adopting such technologies or approaches would be, nor the potential savings.

Another topic that was apparent by its absence was that of digital preservation. The ongoing costs of maintaining digital collections are only starting to be understood, and in many institutions making the case for infrastructure to support preservation is the first step before one can consider ways to innovate. Flipping this on its head-may it be the case that innovating with content (and thus providing the associated user benefits) will make collections so valuable that preservation becomes a de facto result of maintaining the user community by providing continued, supported access? In which case, starting from where we are-is it better to make the case for innovation or for preservation? (And of course the answer may be different for different communities, collections or user engagement models.) A caveat, however-we are all familiar with the project silo which, while innovative in the years which produced it, is now languishing on the equivalent of a dusty shelf and looking its age. As institutional digital collections multiply so do the resources required to maintain them; and multiplying technology stacks on top of content only acerbates this. Without a closer engagement with the source of use and the source, increasingly, of the data-researchers, teachers and students-we run the risk of multiplying our commitment to technologies without an associated increase in technical resources to maintain them. If we are not demonstrating the value in doing so, then this is going to be a hard case to make, and rightly so.

Nevertheless, despite these potential pitfalls, this event filled me with a sense of optimism, as so much is changing and so much is becoming possible. The digital revolution is changing the way information is used, and if we understand our collections, their users and the possibilities, then there exists the chance to improve information availability and use throughout the sector.

References

Survive or Thrive Conference http://www.surviveorthrive.org.uk/agenda/
Readers may also be interested in the series of videos of presentations that was produced during the event
http://www.surviveorthrive.org.uk/videos/
If you love your content, set it free (v3.0). Mike Ellis, June 2010. Presentation
http://www.slideshare.net/dmje/if-you-love-your-content-set-it-free-v3-4449122
The National Archives UK's photostream: Flickr
http://www.flickr.com/photos/nationalarchives/
Edina Unlock http://unlock.edina.ac.uk/
Survive or Thrive? - JISC event June 2010. James Reid. Presentation
http://prezi.com/n8ui3umrjxfh/survive-or-thrive/
Linked Data: Avoiding "Breaks of Gauge" in your Web Content. Tom Heath. June 2010. Presentation
http://tomheath.com/slides/2010-06-manchester-linked-data-avoiding-breaks-of-gauge-in-your-web-content.pdf
The National Centre for Text Mining (NaCTeM) http://www.nactem.ac.uk/
Resource Discovery Taskforce http://rdtf.jiscinvolve.org/
McGregor , A (2010) One to Many; Many to One: The resource discovery taskforce vision. Technical Report.
http://ie-repository.jisc.ac.uk/475/
New plans to open up UK resources revealed: JISC. 9 June 2010
http://www.jisc.ac.uk/Home/news/stories/2010/06/discovery.aspx
SIMILE Widgets: Timeline http://www.simile-widgets.org/timeline/
Zooniverse Home page http://www.zooniverse.org
Ithaka Case Studies in Sustainability
http://www.ithaka.org/ithaka-s-r/research/ithaka-case-studies-in-sustainability/ithaka-case-studies-in-sustainability

Author Details

Ed Fay
Collection Digitisation Manager
Library
The London School of Economics and Political Science

Email: E.Fay@lse.ac.uk
Web site: http://www.library.lse.ac.uk/
Twitter: http://www.twitter.com/digitalfay
Blog: http://lselibrarydigidev.blogspot.com/

Return to top