Lost Words, Lost Worlds.

Emma Tonkin discusses how the words we use, and where we use them, change over time, and how this can cause issues for digital preservation.

      'Now let's take this parsnip in.'
      'Parsnip, coffee. Perrin, Wellbourne. What does it matter what we call things?'
      – David Nobbs, The Fall And Rise of Reginald Perrin


What does it matter what we call things? David Nobbs' fictional character Reggie Perrin suggests in the quote above that it doesn't matter at all. Yet we should keep in mind that Reggie tells us this after almost three hundred pages of tragicomic confusion brought on by his habit of arbitrarily replacing nouns with others such as 'parsnip' or 'earwig' ('When I say earwig, I mean your wife') and his serial adoption of half-a-dozen new identities. Reggie's linguistic delinquencies are part of a struggle to escape, both literally and nominally, from a choked, stultifying suburban identity. 

Pragmatically, this sort of thing can get in the way of communication. If you want a coffee, after all, why ask for a parsnip? On the other hand, to have been invited for coffee in 1774 would have been an invitation to a light meal which, admittedly, involved drinking coffee (Crystal, 2014), so Perrin's lesson to us – that you cannot trust the term 'coffee', but you can usually rely on caffeine if you need to stay awake long enough to finish reading that paper – survives intact. 

Which, by a circuitous route, brings us to semantic change, observation of change in the way that terminology is used. It's always a fascinating topic, even if it isn't necessarily a very productive direction under the majority of circumstances because it is also, frankly, rather a subtle phenomenon, with (in general) rather subtle effects. It's interesting to library, archive, museum and cultural heritage collections in general. This is partly because much of the information held in such collections has been around for a very long time and occasionally throws up pragmatic engineering problems as a result, and partly because the languages of the past are innately interesting, whether there is a practical purpose to it or not. 

Words certainly do change their meaning and their connotations, sometimes rather suddenly. Any parent of the approximately 0.02% of American infant girls whose parents went for the chart-topping entry in 2013's list of Top Ten mythological girls' names will be sharply aware of that: #3 is Persephone, #2 Thalia, and at the top of the pops in 2013, we find Isis (Khazan, 2014). The Egyptian goddess of health, marriage and wisdom has abruptly found herself unwillingly defending her namespace against a thoroughly unpleasant competitor. 

This type of current-event-driven change is exceptionally high-profile. Usually the factors that drive semantic change are not so widely broadcast, nor do they develop as rapidly. Semantic change often plods on for decades, sometimes even generations, before the compilers of the Oxford English Dictionary reach for the red pen.   

Semantic change in digital preservation

Semantic change is understood to be an issue of great relevance to digital preservation, in an understated sort of way.  Semantic change in formal systems, such as controlled vocabularies, taxonomies and ontologies, has received ongoing interest from theoreticians and practitioners alike (Tennis 2007; Wang et al, 2011; Baker et al, 2013). 

The concept of semantic change has been mathematically formalised on a number of occasions: Wang et al (2011) reviewed past work in the Semantic Web domain on the subject, then formalised the idea of concept drift (a change in the meaning that underlies a term) for a knowledge organisation context. The formal definitions involved are a little opaque: for example, Wang et al remarked that:

The meaning CTof a concept C at some moment in time t is a triple (labelt(C), intt(C),extt(C)), where labelt(C)
is a String, int
t(C) a set of properties (the intension of C) and extt(C) a subset of the universe (the extension of C). 

In other words, a concept connects label (a word or words) with a set of properties and a set of things. For example, the concept cat connects the word cat with the 'intension' – the properties of being fluffy, four-legged, having a tail, making plaintive meowing noises and insisting on napping on the keyboard – and with the 'extension', i.e. your cat, your neighbour's cat, the feral cat down the road who breaks open trash bags and forages for meat, and so forth. 

Any of the elements of this set-up can change. Change in pronunciation of the English language might eventually cause us to come up with a more representative variant spelling of cat. We may develop a new word for feral cat and transfer all of the relevant cats from cat over to the new concept instead, splitting the cat namespace into cats-that-eat-garbage and cats-that-nap-on-keyboards. This sort of thing can happen, eventually, if there are enough of each to make it worth finding a shortcut way to reference the distinction. 

From a practical point of view, it seems clear that the effects of semantic change are far less dramatic in the short-to-medium term accessibility of a digital object than the sorts of technical change with which digital preservation usually concerns itself. The unavailability of any software capable of interpreting a complex binary file format is a very fundamental accessibility problem, readily and often identified as an issue (see discussion, Graf et al 2013). Such an issue is significant enough to warrant immediate and direct mitigation, perhaps through conversion of the digital object into an alternative and more accessible file format, or through software emulation or virtualisation of the environment required to make use of legacy software (Van der Hoeven et al, 2008; Giaretta, 2008).

What happens if incompatibilities develop between the original interpretation of a set of terms and the way that they are understood today? The answer must depend on the details. We cannot lightly speak of semantic change as though terminology switches suddenly, uniformly, from one meaning to another, as if by common consensus society simply edits its common dictionary. There is, according to mathematical studies (Niyogi, 2006), a sort of threshold effect beyond which change is generally adopted – the 's-shaped curve' of language change, where change pushes forward slowly, peaks, and then tails off (Blythe & Croft, 2012). 

The fact that a new convention dominates does not mean that use of unfamiliar conventions becomes entirely incomprehensible: if it did, Yoda's grammar would require helpful subtitles.  In general it takes a long time for terminology to drift so far from the original meaning that observers (human observers, at least: machines are not generally as flexible with language) cannot dredge the intended meaning from memory or infer it from contextual data. Foscarini et al (2010), commenting on Lee's (2010) contextual framework for digital collections, lay out the importance of situational context to the interpretation of digital objects. The authors identify information about original contexts of creation and use as important for users trying to 'make sense and meaningful use of artefacts at moments of “reactivation” in the future.' 

Olde Worlde Words

Consider the following text drawn from a set of 16th century art auction catalogues, made available by Julie Allinson of the University of York, who developed the dataset as part of the OpenArt project (Allinson et al, 2012).

A Cataloge of the Names off the Most Famous painters hath lived in Europe these
Many Years bygon - - - - -off whose hands tharr is A Great parcell off Most Curious rare picturs
To be sold.

The majority of our difficulty with this text does not result from semantic change, but from simple orthographic inconsistency: for example, we like to spell 'parcel' and 'of' with a more consistent and parsimonious number of consonants. We are also generally a little more cautious in our use of capitalisation, although we, too, often make exceptions to this rule, especially when laying out titles in academic texts.

Semantic change is represented in this text, however. The term 'curious' is most probably used here to mean 'unusually', rather than, as it is read, 'strangely'. We read that the parcel of pictures is 'great', meaning, as in the sense of 'Great Britain', not that it is excellent but simply – according to Geoffrey of Monmouth  (Hay, 1955)– that it is the larger landmass, Britain, by comparison to the smaller, Brittany. 

For a human being, decoding this sort of thing generally demands an unwelcome effort: wading through variant spellings and obsolete terminology can take more time and cognitive resources than we would like to spend. That historical records contain spellings that would be more familiar today to the Dutch ('an original landskip by a good hand') is a pleasant curiosity in passing and an irritant in quantity. 

More problematically, without an appropriately trained indexing function capable of identifying variant spellings, our catalogue becomes almost opaque to searches based on keyword indexing. This is only exacerbated by functionality that is in ordinary terms a useful second-generation catalogue feature, such as automatic spelling corrections performed on search input: these are more helpful if they are normalised against the same conventions that govern the index itself. 

On the other hand, the challenge itself of reading historical texts in their original form can be a positive benefit to some. There is a persistent, if specialist, interest in reading unedited editions of Chaucer's Canterbury Tales, despite the fact that learning Chaucer's Middle English involves almost all of the steps involved in learning a foreign language. That is partially due perhaps to the enduring appeal and timeless themes of the Tales themselves, but as Josephine Livingstone (2013) put it, 'learning a medieval language […] connects you to a body of literature which is at once intensely familiar and delightfully strange. It is an uncannily lovely experience to read lines written many, many hundreds of years ago about bits of the world that you could have laid eyes on yourself'. 

Subtler problems can easily ambush us, too. Absolute problems with comprehension are not always the principal factor driving revision of a text. Sometimes terminology develops connotations that may be viewed as distracting or aggressive. The language of the relatively recent past, like the language of Chaucer, is a window into another world. Texts from the 1800s, however, are not as strange to our eyes as Chaucer or Shakespeare. Such texts are often familiar enough to resonate with contemporary politics, sometimes in unpleasant ways. The same could be said of the language spoken in unfamiliar contexts, whether physical or online. 

In his 1953 novel The Go-Between, L.P. Hartley wrote that 'the past is a foreign country: they do things differently there.' If that's the case, then online communities could be anywhere: if learning a medieval language is a form of time machine, then studying distant communities is a form of space travel. Groups of people on the Web speak as it pleases them, share as it pleases them and write according to the norms of their own communities. 

Hill et al (2003) famously described social interactions such as the sharing of gossip as a form of 'social grooming': one of the means by which the human ape establishes and retains social bonds. Donath (2007) considered the impact of social networking services – platforms that allow for rapid large-scale social grooming activities. Rather than spending hours picking lice from another ape's hair, we have not only dispensed with the effort of any physical grooming activity by developing language and finding another means to 'maintain ties and manage trust' (ibid), but we have also managed to figure out how to broadcast it to hundreds of other people at once with minimal effort to us. For the purpose we make use of what to anthropologists and linguists is known as phatic expression (Malinowski, 1923), and to everybody else as small talk.

      Ape 1: 'Morning'
      Ape 2: 'Hey, how's it going?'
      Ape 1: 'How're you doing?'
      Ape 2: 'Turned out nice again'

and so forth. Phatic expressions aren't designed to inform the listener about the weather. They're 'grooming talk'. Online networks are an excellent medium for the transmission of such signals.  Within our groups, we develop social conventions and signals: differentiating our use of language from that of others' is just one dimension of the many, many things that the social ape can do to mediate membership in a community. 

As much fun as this type of theorising is, pragmatically, it does not really matter all that much from a data management or preservation perspective why such specialisations occur. The 'why' of variant orthography is neither here nor there from a practical perspective: the important point is that it happens.

If social webs are islands, then they are volcanic

Back in 2007, Randall Munroe (of XKCD fame) published a map of the social web, drawing each online social web site as a landmass, representing the user member count by the surface area of each land on the map (Munroe, 2007). He published an updated version the map in 2010 (Munroe, 2010). Many of the locations on the original map no longer exist. Orkut, for example, was closed in 2014 after ten years in operation. Other sites, though still in existence, have lost much of their popularity: MySpace peaked in 2007 and has seen a significant decline in usage since. Munroe's map necessarily displays only a tiny proportion of the many social websites online at the time of his design: most of them would not have been visible on the scale of the graphic, anyway.

Back in 2012, a colleague named Adam Chen and I completed a study into the social web, which identified and tracked the popularity of several hundred social websites over a period of five years. All sorts of sites were included, including German-language Facebook clones, Chinese-language social networking services, a Digg (social news) clone used for news relating to a single political party, and a plethora of social bookmarking sites designed for specific language communities. 

The database included some 75 sites in the social bookmarking category, twelve of which were already closed in 2012, and over a hundred social news sites, of which a quarter were already closed. Returning to that list of sites today, of the 308 that were still functional in 2012, 45 no longer exist even as repurposed domain names. A further 18 simply respond with HTTP web protocol errors in the 400 range: 403 Forbidden; 404 Not Found and the ever-plaintive 410:

      HTTP 410 Gone, The resource is not available and will not be available again. 

A complementary metaphor to Munroe's island itself might characterise each social networking site as an exotic species of island fauna. If a site is granted a stable niche, it can survive comfortably, unless and until a larger, hungrier or altogether angrier species decides to muscle in on its territory. Of course, this metaphor is somewhat spoiled by the fact that people can and often do belong to several social networks simultaneously; we do not have to live full-time on Facebook Island. We can have breakfast on Facebook Island with our school friends and then pop out to researchgate.net to share our morning cuppa with the other five people in the world who share our interest in our personal choice of specialist research. We are unlikely to communicate with each group in an identical tone of voice (or, sometimes, the same language). 

This brings us to the 'lost worlds' of the title. Some of the lost sites that we identified were never in common use. Some contained more spam than coherent content. A few, however, were the 'eight-hundred pound gorilla' of their genre and in their time, and of those, a surprisingly large proportion are now no longer in use. Some have been archived in whole or in part by the site owners or by interested third parties,  most are simply gone. Dependency on a specialist network has its advantages. As Jeffrey (2012) puts it, however, a dependency on external services increases the risk of a 'digital dark age'. Jeffrey referred to the use of social network services in the field of archaeology: the same could be said of librarianship. 

We do not know whether the people sitting under a notionally tranquil palm tree on a soon-to-be-sunken social media island were simply performing the online equivalent of picking lice from another ape's hair or sharing insights about their close reading of Geoffrey Chaucer in Middle English. It does not matter: the island is gone and that material is in all probability lost. 

It may be that issues that hide as deeply in the long grass as semantic change genuinely don't deserve a great deal of time, with rare exceptions. One such exception might involve the development of specialist user interfaces designed to handle material spanning particularly long periods of time. Another might be consideration of the issues related to long-term sustainability of knowledge structures such as vocabularies, taxonomies or ontologies. We may find that as the scale of collections continues to grow – 2015 was, after all, hailed by Oracle as 'the year of Big Data' (Preimesberger, 2014) – the subtle trickery of natural language may pave our way to new views of the data that we hold and the places and times it illustrates. 


Semantic inconsistencies are, depending on one's perspective, either a fascinating resource for productive study or an irritating rake in the face for the busy engineer. The phenomenon is slow, compared to other forms of change. It has received interest in a variety of knowledge-management contexts, and the modelling, characterisation and mitigation of its implications to various engineering problems has been studied. Given the many advances in the area of provision and aggregation of open data, perhaps the extent (and limit) of variance within very large datasets will become more evident as further uses of such large datasets are reported. Since datasets are presently often experienced within a subject-or organisation-specific silo, it may be that the extent of such changes is hidden from view unless organisations make a point of comparing search terms to track potential differences between the semantics used within the index and those selected by users. 


Date published: 
Saturday, 16 January 2016