The annual Eduserv Symposium  was billed as a ‘must-attend event for IT professionals in Higher Education’; the choice of topical subject matter being one of the biggest crowd-drawers (the other being the amazing venue: the Royal College of Physicians). The past few years have seen coverage of highly topical areas such as virtualisation and the cloud, the mobile university and access management. This year’s theme of big data is certainly stimulating interest, but what exactly are the implications for those working in research, learning, and operations in Higher Education?
The day was opened by Stephen Butcher, Chief Executive of Eduserv, who after introducing us to the event, passed the floor to Andy Powell, Eduserv’s Research Programme Director. A straw poll indicated that the audience comprised of many working in HEIs, a fair number working outside institutions, but very few researchers. Andy explained that he was keen for the day not to fall into the big data hype space and instead offered perspectives from people who have actually been ‘building stuff’. To define: big data are considered to be data sets that have grown so large and complex that they present challenges to work with using traditional database management tools. The key factors are seen to be the ‘volume, velocity and variability’ of the data (Edd Dumbill, O'Reilly Radar ). While HEIs have experience in storing, curating and managing data there is still much to learn around how you analyse data. Andy expressed concerns that there is potential confusion with the open data agenda, particularly in the government space. He also stressed the importance of not focusing on the technology and bearing in mind the new evolving role of data scientist: someone who likes dealing with data and telling stories around data. Andy finished by saying that he had hoped to have a speaker who could say more about learning analytics, for example tracking the progress of users and learners, as this was potentially another interesting application of big data for HE.
Rob Anderson, CTO EMEA, Isilon Storage Division, EMC, delivered the opening keynote, a comprehensive overview of the big data landscape.
Rob Anderson presents at Eduserv 2012.
(Image by kind permission of Eduserv)
Rob began by explaining why big data has come to the fore now. The reasons are both technical (better access to data using the cloud and more sophisticated tools) and economic (now cheaper to store and analyse). We had the first indication of the day that the big wasn’t necessarily the chief issue but rather the lack of structure, 90% of the digital universe is unstructured, that caused many of the problems. His talk provided an overview of the types of big data we are seeing, from retail information and utilities data provided by smart meters in real-time streams, to health data with huge implications for the public sector. This data came from rich content stores (such as media, VOD, content creation, special effects, GIS), was generated from workflows, could be newly developed intellectual property based on data and derived from consumer data. However to be of value, the data required analysis.
Infographic by- Shanghai Web Designers
60 Seconds - Things That Happen on the Internet Every Sixty Seconds.
(This figure by kind permission of Go-Globe.com)
He explained that good data could be hard to obtain, and so often we ended up with no data at all. This prevented big-data-based decisions, which could benefit everyone by making more accurate products and tailoring products more precisely for their markets. Rob explained that at EMC they had invested in Hadoop  because they were committed to open source, and they had also considered the benefits of scale-out versus scale-up for storage. Considering what was holding up the exploitation of big data at the moment, Roy suggested that it wasn’t the technology but the failure so far to demonstrate ROI (return on investment) along with a lack of organisational change and effective aquisition of suitable talent.
Guy Coates from the Wellcome Trust Sanger Institute gave interesting insights into the world of genome data. Guy explained that the cost of genome sequencing was halving every 12 months and that this trend was continuing. It is more than likely that the $1000 genome would be with us by the summer of 2012. People would soon be purchasing their own USB stick genome sequencers! However the principal cost remained in analysing the data. Guy pointed out some key projects working in this area: UK10K, Ensembl, Cancer Genome projects, pathogen genomics.
At the Sanger Institute staff were now using agile and interactive hardware systems; the key objective was to make stored data visible from everywhere. He stressed the importance of data triage: you cannot keep all the data created, so data that will not be reused must be discarded: ‘No one is going to go back and re-analyse sequencing data because there is so much new data arriving all the time’. In an effort to get away from problem of groups inventing their own data management systems at Sanger, they have implemented a centralised integrated Rule-Oriented Data System (iRODS).
Guy Coates shows the scary data.
(Image by kind permission of Eduserv)
Anthony J Brookes, professor of Genomics and Informatics at the University of Leicester began with an apology for being among an unfamiliar audience; he was one of the few researchers present, but in a position to offer the most memorable ‘fact ‘of that day: the human brain had the capacity of 2.5 petabytes! He then asked the audience to consider if there really was a big data problem? He wasn’t even sure that there was a ‘scale of data’ problem, ‘complexity of data’ or ‘stability of data’ problem. However he did feel that there was a knowledge engineering issue, and that this was preventing us from getting from sequenced genomes to personalised health data. This disastrous divide between research and healthcare (ie the divide in the management of data, eg the use of different standards) required knowledge engineering (analysis) to bridge that gap.
Anthony explained he was involved in the I4Health Network  (integration and interpretation), a network for linking research and practice in medicine. The network saw knowledge engineering as providing the chain between research and health science. Anthony went on to share his experiences of using ORCID (Open Researcher and Contributor ID) to facilitate open data discovery.
When asked to present at the symposium Graham Pryor, Associate Director of the Digital Curation Centre, wasn’t sure what he had to offer. He had seen big data as a technology issue and the ‘DCC doesn’t deal with “plumbing”’. Graham took a look at some big data examples and asked himself whether they had ‘got it all sorted’. His observation was that many weren’t managing to preserve or manage the data, which brought it back into DCC field. Further investigation had shown him that the big data problem was in part about how you persuaded researchers to add planning into the research data management process. These issues were central to effective data management, irrespective of size.
Devin Gaffney, Research Assistant & Master's Candidate, Oxford Internet Institute started with a story about a friend who had discovered himself to be an overnight expert on vampires as a result of Klout using analytics to misinterpret his writing and draw incorrect conclusions. Devin explained that many analytics services (webtrends, peoplebrowser, google analytics) were indicating a weakness in prescribed analytics. His offer was 140kit , a platform that supports the collection and analysis of Twitter posts.
In advance of the symposium Simon Hodson, Programme Manager, Digital Infrastructure, JISC, had taken a quick straw poll from two Russell Group universities. Both believed they held 2 petabytes of managed and unmanaged data and while one currently provided 800 terabytes of storage the other provided only 300 terabytes. One of the universities was concerned that its storage might be full within 12 months. But then storage costs were decreasing, and storage models were changing (often to cloud computing). Simon described a changing landscape, giving an overview of JISC Research Data Management (RDM) programmes and reiterating the need for policies, processes and best practice for research data. Simon flagged up the forthcoming DCC guide on how to develop a RDM service from data and emphasised that community review would prove important.
The Eduserv Symposium 2012.
(Image by kind permission of Eduserv)
The real-world example for the day was given by Simon Metson, Research Associate/Ecology Engineer, University of Bristol/Cloudant. Simon speculated that if the Large Hadron Collider had been turned on when it should have been, CERN would have found itself in trouble; nobody was ready to deal with that amount of data at that point in time. While the LHC now produced around 15 petabytes of data annually, Simon talked about how the University of Bristol dealt with 50-terabyte datasets and how staff have transferred 100 petabytes of data to date, all over Janet. Tools like Haddoop, NoSQL, DBCouch etc. had improved the situation but people weare important too, and there was a need to let people do more and become better trained. Building a suitable team was difficult given the university system of short-term grants but universities needed to build up teams to support these activities.
For Simon, ‘big data isn’t interesting anymore’ as there were no longer a need to create new systems. The upside of this was that we could spend time on understanding our problems better and asking the right questions. One striking statistic Simon gave to illustrate the potential of data usage was that the cost of cleaning up after El Niño was 10% of Ecuadorian GDP, which demonstrated the importance of landslide modelling.
Simon concluded that data-intensive research would become the norm and that universities were going to need access to big data resources. We should also expect to see significant use from non-traditional fields and expect new fields to emerge. As far as Simon was concerned, big data had hit the mainstream.
Max Wind-Cowie, Head of the Progressive Conservatism Project Demos, took a step away from the technical requirements of big data to look at how they were being used by the public sector. He touched on the confusion that existed between the open data and big data agendas, saying that in the public sector it was necessary to work through the existing confusion until people arrived at a better understanding of the differences between the two. While there wasn’t the expertise in the public sector to make use of vast resources of data, there was a strong case for open data. In the public sector ‘there are no graphs because there are no facts, it’s all opinion’. The commercial sector, he stated, had moved on, we expected Amazon and Google to personalise our data, yet the public sector was not a place where displacement happened naturally. However data mattered, something shown in the Demos paper The Data Dividend , which provided the political context as well as the technical issues. Max saw data as something that could drive public sector innovation; it was the duty of government to make such data accessible. He explained that many public sector brands had been rendered toxic in the last few years (eg Job Centre Plus) and big data could help us to understand better how that had come about. What was required was an inside-out approach: openness and accessibility. The end result might be better resource allocation and better segmentation of datasets. Ultimately big data could be used to identify interventions which could bring benefits and cost savings. In the Questions and Answers session, Max argued for the need to have it written into commissioning contracts with public sector contractors that the data were to be shared.
The closing keynote was given by Anthony Joseph, Professor at the University of California, Berkeley. Anthony started his talk with some interesting statistics: for example, Walmart handled data on one million customer transactions per hour. One new trend was the process of analysing user behaviour rather than user input. He gave the example of the U.S. Geological Survey Twitter Earthquake Detector (TED)  looking for geolocated hashtags such as #earthquake and phrases like ‘OMG!’. Google trends have brought into being the notion of ‘nowcasting’, when, for example, more people than usual searching for flu on Google suggested that a flu epidemic was about to break out.
Anthony suggested that our need to hoard has resulted in the selection and deletion of data being the most intractable problem of big data. He pointed out that if you ‘delete the right data no one says thank you, but if you delete the wrong data you have to stand up and testify’, he offerd the US climate trial as an example. We often find it difficult to be selective when curating data because we didn’t yet know the question we would need to answer. Deletion of research data was based on assumptions of what would be of value; what if those assumptions turned out to be based on an incorrect model?
Big data are not cheap and extracting value from data presented a challenge. Obtaining answers that were both of high quality and timely required algorithms, machines and people. The state-of-the-art Mechanicalturk  Project from Amazon was a ‘crowdsourcing internet marketplace’ which required less than an hour to label (manually) a large dataset using human resources. Anthony ended his talk by highlighting that the US government had announced more than $200 million investment in new data projects across a range of departments. He wondered what the UK figures were like and whether we, as HEIs, were going to offer a big data curriculum? And whether we had hired cross-disciplinary faculty and staff or had invested in pooled storage infrastructure/cloud/inter- or intra-campus networks?
In his summary Andy Powell talked about the themes that had emerged during the day:
We don’t need to get hung up on the ‘big’ word. While data were increasing exponentially (something a number of scary graphs indicated) this didn’t have to be an issue, we were growing more accustomed to dealing with large-scale data.
The tools are now available. During the symposium speakers mentioned tools such as Hadoop, DB Couch, NoSQL which all allowed people to work easily with datasets. There was a consensus that people no longer needed to create systems to deal with big data, but coulld now spend that time on understanding their data problem better.
It’s all about the analysis of data. While storage of data could be costly and management of data labour intensive, the analysis of data was often the most complex activity, and possibly the most worthwhile. Processing of big data could provide valuable insights into many areas relevant to the success of your organisation. Keynote speaker Rob Anderson from EMC explained that ‘If we’d been able to analyse big data we might have been able to avoid the last financial crash’. He saw the future as being about making big-data-based decisions. However, while tools had a role to play there, analysis still required human intervention. On his blog Adam Cooper from CETIS advocated human decisions supported by the use of good tools to provide us with data-derived insights rather than ‘data-driven decisions’.
We don’t yet know what data to get rid of. Anthony Joseph, professor at the University of California, Berkeley suggested the selection/deletion of data was the most intractable problem of big data. It represented an on-going problem: we failed to be selective when curating data because we were unsure of the question we would ultimately need to answer.
We need data scientists. Many of the talks highlighted the need to build capacity in this area and train data scientists. JISC was trying to consider changing the research data science role in its programmes and Anthony Joseph asked HEIs to consider offering a big data curriculum. In his summary Andy Powell, Eduserv’s Research Programme Director, asked us to think carefully about how we used the term ‘data scientist’ since the label could be confusing. He noted that there was a difference between managing data, something with which HEIs were familiar, and understanding and analysing data, a fairly new area for us all.
The day ended with a drinks reception in the impressive Royal College of Physicians Library, unfortunately the rain ruined the original plan of holding it in the medicinal garden. The day had been a really enjoyable one. I’m not quite sure whether my own personal question, ‘how are big data relevant to HEIs (research and infrastructure aside)?’ had been answered, but then maybe the answers and insights lie in looking a little closer at the data - or working out what the right questions are.
Videos of all the talks are available from the Eduserv Web site .
Marieke Guy is a research officer at UKOLN. She has worked in an outreach role in the IMPACT project creating Open Educational Resources (OER) on digitisation which are likely to be released later on this year. She has recently taken on a new role as an Institutional Support Officer for the Digital Curation Centre, working towards raising awareness and building capacity for institutional research data management.
Marieke is the remote worker champion at UKOLN. In this role she works towards ensuring that UKOLN remote workers are represented within the organisation.
This article has been published under Creative Commons Attribution 3.0 Unported (CC BY 3.0) licence. Please note this CC BY licence applies to textual content of this article, and that some images or other non-textual elements may be covered by special copyright arrangements. For guidance on citing article (giving attribution as required by the CC BY licence), please see below our recommendation of 'How to cite this article'.