Web Magazine for Information Professionals

Folksonomies: The Fall and Rise of Plain-text Tagging

Emma Tonkin suggests that rising new ideas are often on their second circuit - and none the worse for that.

Despite the stability of many key technologies underlying today's Internet, venerable workhorses such as TCP/IP and HTTP, the rise of new candidate specifications frequently leads to a sort of collaborative manic depression. Every now and then, a new idea comes along and sparks a wave of interest, the first stage in the Internet hype cycle. Transformed with the addition of a series of relatively content-free conceptual buzzwords, the fragile idea is transmitted between and within communities until disillusionment sets in, when the terminology becomes an out-of-date reminder of a slightly embarrassing era that tomorrow's computer industry professionals will laugh about over a pint of beer. Eventually, the idea is retrieved, repackaged in a less sensational envelope, and filed for later use. This phenomenon is graphically represented as the Gartner hype cycle [1].

This effect is significant for several reasons. Firstly, a solution may prove to be useful long after its buzzword sell-by date has been exceeded and it is lost from view. Secondly, the initial enthusiasm and the obscurity into which previous years' ideas are cast means that they are relatively difficult to retrieve and analyse objectively. This is a result of the apparent fact that the surrounding semiotic structure and the mental model that we share of an idea is a powerful element in our understanding of, and our discourse about, that technology.

Computer science speaks in terms of algorithms and abstract mathematics. Programmers often talk in terms of design patterns, standard solutions to common problems. Discourse on the Internet is often conducted on a far less abstract level; an accepted or rejected buzzword may be seen as an indicator of political stance or of character, rather than being a decision framed in terms of the underlying technology or use case.

Today's 'hot topic' is collaborative tagging; the classification of items using free-text tags, unconstrained and arbitrary values. Tagging services are separated into two general classifications: 'broad', meaning that many different users can tag a single resource, or 'narrow', meaning that a resource is tagged by only one or a few users [2]. For a full introduction to folksonomic tagging, read Hammond et al [3].

There are now a large number of tagging services, many general-purpose, attracting a large and diverse audience. Some are intended for specialised purposes, targeted to a smaller, well-defined audience. Resources may be pointed to by any number of different databases, each of which is aimed at a different set of communities on the Web. The result is a large network of metadata records, containing a tuple of free-text descriptions and a pointer to a resource. The sum of the records from the various tagging services creates a sort of 'tag ensemble' - the sum of taggers' contributions regarding a certain resource.

diagram (3KB): The sum of the records from various tagging services creates a 'tag ensemble'

Figure 1: The sum of the records from various tagging services creates a 'tag ensemble'

Folksonomic tagging is popular; as it turns out, many Web users are more than happy to describe resources in this manner. Possible reasons for this have been identified: it is easy, enjoyable and quick to complete; it is an activity with low cognitive cost, which provides immediate self and social feedback [4]. Formal categorisation is, classically, a stylised way of dividing elements into appropriate groupings, which themselves are recognised as common or standard terms. In formal categorisation, there is a correct answer, implying necessarily that there is also a much larger set of incorrect answers. In tagging, there is a very large group of correct answers - short of intentional abuse of the system, it is doubtful as to whether any incorrect answers exist at all.

But certain apparent disadvantages to the approach have been identified, such as the potential for abuse, the variation and lack of standardisation resultant from the unbounded nature of free-text tagging, and difficulties in reusing the information effectively as a result [5].

For those of us more accustomed to dealing with formal classification methods, folksonomic tagging brings the potential of disruption, headaches and annoyance; how are such techniques supposed to fit into our tidy world? As unexpected as the success of tagging is, as novel and original as the technique seems, can we do anything but wait and see what the future will bring? Fortunately, there is applicable research - at least two separate strands of research - pre-existing on similar topics. In computing, the phrase popularly attributed to Marie Antoinette's milliner is particularly accurate. 'There is nothing new, except that which has been forgotten'.

We proceed with identifying a number of related research strands. Following this, we will review some examples of relevant literature from each strand, conducting a short literature review. We may then be in a position to discuss the issues that arise from each, and how each approach relates to the others.

Talking about Metadata

Human-computer interaction (HCI) is, by definition, about the transfer of information from one place to another. Sometimes, that information is transient in nature, but often it is not. In the latter case, we find ourselves with some data on our hands that we need to store and at some point in the future retrieve.

From the perspective of the machine, the task simply involves copying that data into an empty space, or series of empty spaces, in an appropriate manner and leaving it there until it is required again. When this occurs, the machine will simply retrieve the file corresponding to the reference ID provided. From the human perspective, this is an entirely different problem; we do not in general have the ability faultlessly to recall long strings of alphanumeric characters, so we must use a method that is more accessible to our own understanding of the task. For this reason, the file system designer must consider not only the questions surrounding file storage, structure, access, use, protection and implementation, but also - critically - how the file system appears to the user. According to Tanenbaum [6], probably the most important characteristic of any abstraction mechanism is the way the objects being managed are named. The general solution is to provide each file with a human-readable attribute set. The user provides a certain quantity of metadata - information about the data in the file. On the most basic level, this might be a filename. On a more elaborate level, there are various metadata standards that the user could apply in order to create and store a relatively complete description of the content of their data file.

While we generally think of a filename as a characteristic of that file, this is not the case from the point of view of the file system designer, who thinks of a file as an array of blocks, a collection, containing binary data. As files are typically discovered by looking through the contents of a directory, it is the directory structure that is to be read, since it is this structure that typically contains the metadata required to retrieve the file. The 'Standard Model' for a directory entry in a very simple filesystem might look something like the following, adapted from 'the MS/DOS filesystem', Tanenbaum (p420) [6]:

diagram (2KB) : The 'Standard Model' for a directory entry in a simple filesystem

Figure 2: The 'Standard Model' for a directory entry in a simple filesystem

This design is fairly simple. Any number of possible extensions can be imagined; these raise a number of questions for the intrepid file system designer, including but not limited to:

Many of these questions are essentially speculative, though some have been extensively researched and discussed, with results that have greatly influenced the design of most modern operating systems. Some relevant research will be reviewed later in this article. There are reasons to expect the relevance of a piece of metadata to be influenced by a number of variables, including age; no representation of the world is either complete or permanent [7]. The choice of metadata is necessarily strongly influenced by user behaviour and habit. Context-independent metadata is relatively exempt from the ageing process - context-dependent metadata becomes less useful as the background context is lost.

These issues aside, the filesystem is not the only place that supports an essentially similar structure. Examples might include the metadata records stored in an institutional repository, which may be exposed as a simple Dublin Core description of a resource and a pointer towards the resource (perhaps via a related jump-off page), browser-based bookmarks (and the hyperlink itself), peer-to-peer technologies such as the .torrent records used in BitTorrent to store metadata about a file so that it can be located and retrieved from a network, the metadata stores underlying many citation databases and - most recently of all - in folksonomic tagging. That is not to say that there are not very clear differences between these various types of metadata; there are many, including:

The observation here is simply that none of these differences are critical; despite the differences, there may be sufficient underlying similarity between these examples to make it possible to consider one technology in the light of an understanding of another. We will begin by looking at a few pieces of research that relate to metadata within the filesystem - a field of research sometimes referred to in terms of Personal Information Management (PIM), defined by Lansdale [8] in Kljun & Carr [9] as, 'the acquisition, storage, organization, and retrieval of digital information collections by an individual in their digital environment'.

Strand 1: The Rich Filesystem

A Brief History of the Rich Filesystem

Filenames are possibly the simplest form of free-text tagging on a limited data set. The computer on which I am typing this will permit me to choose any filename for this document, provided it is no greater than 255 characters in length and contains only the permitted characters. Each operating system has its own set of limitations; for example, DOS filesystems used to permit only 11-character filenames, which were displayed as 8 characters followed by a '.', with a three-character file extension. This extension was - and still is, on Windows - used to indicate the type of file, such as '.txt' for text, '.mp3', and so forth. Although filenames are no longer limited to a few upper-case characters, this convention has survived and is in wide use. In practice, then, a file is generally named with a 2-tuple, a pair (N,E), where N is a mnemonic word or phrase and E is the filename extension. N is a free-text value, whilst E is a formally-defined term taken from the relevant controlled vocabulary.

This, though, is not the end of the story for filenames. Many people make use of naming conventions in order to simplify organisation and storage of files. For example, a common file-naming convention in storage of conference papers might be a 4-tuple, such as (Y,C,N,E) - year-conference-author's name.extension. Storing a PDF of a book chapter is sometimes done by using a simple convention such as (Y,B,C,A,E) - year-book title-chapter-author's surname.extension. When saving a paper on my own machine, though, I am likely to add an additional term to indicate the purpose for which I downloaded it, in order that I can separate out resources according to the purpose for which I originally downloaded them - the convention that I use for naming is representative of the task with which the items are associated.

Of course, the filename is not the end of the story; in a conventional file system, files are organised into a hierarchy. This provides neat alternative solutions to problems such as the one for which my naming convention was designed - arranging files into sets. As a solution, however, it is less than perfect. To see why, consider the question of how to store a downloaded file. Should it be stored according to the source from which it was downloaded, according to the subject matter or the content, or with reference to the task for which it was downloaded?

In most traditional filesystems, an artificial restriction is placed upon the system, such that a file can only inhabit one directory at a time. There is no absolute reason why this must be the case; an appropriately-written file system could store any number of filename records pointing to the same location - however, this would confer additional responsibilities onto the filesystem. A delete operation, for example, would imply not only deleting the original reference to the file, but also locating and deleting all other records referencing that file. The Windows-based '.lnk' shortcut or the UNIX concept of 'soft' symbolic linking are designed as solutions to this problem - a soft link is simply an OS-readable metadata record comprising a filename and a pointer referencing a filename (eg. "mytempfile.txt" might contain a pointer that forwards to C:\tmp\filename.txt), while a hard link is a 'standard' filename record, a metadata record comprising a filename and a pointer to the data.

Usability concerns relating to file links (or symlinks) do not appear to be a popular research topic, but it is reasonable to conclude that both soft and hard linking are sufficiently complex to be difficult to use. Similarly, it is often said of the hierarchical filesystem paradigm that it is not intuitive or usable; although this is rooted in the observation that classifying is a task involving a high cognitive load [10], the most frequently cited source for the hypothesis is Barreau and Nardi (1995) [11]. This paper is at the root of today's general perception that the usability of the hierarchical filesystem is low. While this assertion is widely, though neither universally nor unreservedly, accepted today, the paper has received a certain amount of criticism and discussion e.g. Fertig et al, (1996) [12], and the generally accepted rationale behind the assertion has since changed.

In Ravaiso et al, (2004) [10], an overview is given of user experience with modern desktop systems. Some are illuminating:

Ravaiso et al [10] also identify three separate viewpoints on data, which may all be used by a given user at different times. They note that specific tools tend to support only one of these, neglecting the others;

Whatever the ultimate truth on the usability of hierarchical filesystems in general, a large number of research efforts have since appeared that, largely agreeing with the concerns voiced by Barreau and Nardi [11], dedicated their effort to searching for alternatives. In fact, this process had already begun; for example, MIT's 1991-1992 Semantic Filesystem project [13] had outlined a method by which attributes could be automatically extracted from files, and could then be used as a basis for separating content into 'virtual directories' - in this view, files are collected into virtual directories according to automatically extracted features. Various further alternatives were proposed, such as the 'pile' metaphor for the sorting of documents [14].

Of particular interest is the Placeless Documents project of Xerox PARC, beginning in 1999. This project was essentially designed to replace the hierarchical filing structure with an alternative - the Presto prototype, a document management system that made use of 'meaningful, user-level document attributes, such as "Word file", "published paper", "shared with Jim" or "Currently in progress"' [15]. The Placeless Documents project was designed to solve problems like the 'single-inheritance' structure in the filesystem - that is, the previously mentioned problem that a file can inhabit only one place in the filesystem, to provide the ability to reorganise large sets of files quickly and simply, and to provide a set of faceted views on a filesystem according to the task currently in hand. This latter capability would expose what Dourish et al referred to as the multivalent nature of documents - the fact that a given document may 'represent different things to different people'.

Realising that the high cost of metadata was an issue, since the existence of a large quantity of annotations was pivotal to the success of the prototype, the Presto developers envisaged that documents would be tagged/annotated both by users and with metadata extracted automatically by software services performing content analysis. Furthermore, a third category of annotations existed - active categories. Whether produced automatically or added by a user, these tags caused an action to take place. For example, a tag such as 'backup.frequency = nightly' would cause the filesystem to back up the document at the frequency given. These actions may also be user-defined - for example, a user who tags a file with the term 'readme' might set the system to maintain an up-to-date copy of similarly tagged files on his laptop.

Further refinements to the Presto system were made, in the form of the Macadam project [16]. This took advantage of the underlying design of Presto - the annotated documents exposed through Presto were not themselves files, but simply pointers to a file that itself was held in a more traditional underlying data repository. This enabled the creation of a further structure - abstract documents. These documents hold no content of their own, but simply hold properties, representing structured data. The Presto system itself holds auditing information and changes, while the Macadam system holds a variety of views on the Presto records - this is reminiscent of an eprints archive, in which metadata records are held pointing towards resources that, themselves, may well be out of the control of the archivist.

'Macadam properties' could be attached to abstract documents, which themselves could be used to create category types, each one of which represented a set of possible tag values. Folksonomic tagging aficionados will recognise this as bearing some similarities to the 'tag bundle' concept, which exist on sites such as del.icio.us, and permit a collection of related tags to be gathered together under a parent category (for example, one might wish to store 'Nikon', 'Canon', 'Pentax', 'Minolta' and 'Konica' in a tag bundle labelled 'photography'). However, categories in the Macadam project were designed to be used for a number of purposes, such as:

Documents could also inherit general attributes, in order to permit access control - some tags could be set as private and others are public.

Dourish et al (2000) [16], while introducing Macadam, discuss some of the limitations of the original Presto project that were solved in the latter project. They highlight the fact that the key-value pairs were untyped - that is, that the properties that users set were arbitrary, that there was no facility for organising document property values in a hierarchy (that is, it was not possible to define one property value as a 'child' of another, which would allow for a lot of useful functionality). Finally, the MIT's Haystack project took over from Presto, and developed the idea of multiple categorisable documents further.

Due to the high cost of metadata, especially formal information such as classification, in terms of user time and effort, an absolutely central issue to many of these efforts has been the question of the extent to which metadata can be retrieved automatically. The quantity of information available is central to the user experience. Each of the systems mentioned here was designed for use on the small-to-medium scale.

Today, the issue of the metadata-rich filesystem is resurfacing; the as-yet-unreleased WinFS is designed to retrieve files based on content criteria; many common Linux/Unix filesystems permit arbitrary text attributes to be added to files (using the 'extended attributes' system); the Reiser4 filesystem provides hooks for metadata-aware plugins, and Apple's 'Spotlight' feature allows searches based on all sorts of metadata attributes. They are not the first examples of commercial OS that do this - BeOS had a sophisticated model with a base set of file attributes, extended according to the type of the file, and indexed access. Unfortunately, while filesystems are growing smarter, the question of interoperable transfer of metadata between filesystems has not yet been solved, and is very likely a joy yet awaiting us.

Strand 2: Approaches to Classification on the Internet

Classification on the Internet has necessarily approached the problem from a different angle. A principal difficulty for the metadata-rich filesystem designer is the limitation in terms of data - as mentioned previously, producing a system that searches accurately and in a satisfactory manner in the small scale is an extraordinarily difficult problem. The Web, on the other hand, is the largest data set with which anybody could ever hope to work. That is not to say that there have been no attempts to work with limited data sets on the Web - for example, formal metadata systems such as digital repositories often expose relatively small metadata collections. On the other hand, the Amazon Web services expose massive quantities of formal metadata; these by contrast can be searched using simple keyword matching. Other formal metadata systems use contributed formal metadata, such as Yahoo! Directories, or automatically extracted or machine-generated metadata resulting from search engine techniques or data mining - this may be proofread before publication. The semantic web is a particularly interesting approach, which rather than encapsulating a particular type of metadata simply provides the tools for any form of data to be exposed.

However, it is very likely fair to say that the most commonplace Web search services, such as Google, operate according to search methods that owe very little to formal metadata; rather, Google makes use of techniques such as statistical analysis, content analysis and data retrieved from analysis of links and link text. The emergence of folksonomy tagging services on the Web continues the trend of informal metadata use. It is interesting to note that, as the trend towards more informal metadata continues in the filesystem, the trend toward formal metadata continues on the Web.

Discussion

A very large number of systems exist, on a variety of scales, that share sympathetic aims at some level - to enable description, search and retrieval of data by providing access to metadata.

Planned projects, and several currently under development, are in the process of making the metadata-rich filesystem an everyday reality. However, as the work of Ravaiso et al [10] indicates, the separation of local filesystem metadata from Internet-based services may appear unnecessary to the user. The question has been asked in the del.icio.us 'delicious-discuss' mailing list, for example - would it be possible to tag files on a user's local filesystem using the del.icio.us interface? If not, could the capability be added? At first glance, the suggestion appears unlikely. Del.icio.us is a collaborative tagging system, after all; what relevance is there in users' personal information management needs? How much sense does it make to tag local files on a distributed system? However, from the user's perspective the separation between local and Internet-based content may itself appear arbitrary and unnecessary; a unified view of the data collection is ideal.

Is it possible to perform all of one's information management online? Why is it necessary to replicate essentially the same structure locally and on a server? This is prinicipally necessary because ubiquitous Internet access is unavailable. Few mobile devices are permanently online, and indeed, significantly less than a fifth of British households have broadband access according to OECD statistics [17], making it impractical to assume that the centralised system can subsume the other. On the other hand, attempting to link local and Internet-based metadata sources implies the need to solve several interoperability issues. How should the two interact?

Folksonomic tagging is already usefully mixed with other approaches, such as faceted classification, much as occurred in the Placeless Documents project. Allowing customised views in categorisation is now widely understood to be desirable, but a rich metadata environment would be a logical requirement in making this possible. Yet a rich, 'heterogeneous' metadata environment is some way away. As of today, each service on the Web stands in relative isolation. To achieve a world in which information from several sources could be harvested and reused would imply a change of culture, not merely from the traditional Web to Web 2.0, but from the client-server Web to a peer-to-peer structure.

Here, provenance is measured not in the source URI of a piece of information but according to the original source (perhaps according to the digital signature upon the metadata fragment, with trust and community issues to be handled accordingly). Digital repositories would not be the single source of metadata records, but one source of formal metadata in a sea of providers. It is an improbable idea, in a world in which digital repositories make use of conflicting metadata standards, and where the policy and copyright issues surrounding metadata harvesting issues have not yet been solved. But as a thought experiment, it raises some interesting questions; if a user downloads, stores and annotates a record off-line, and then uploads the result into a public content database, how are those annotations handled? How should record and data object versioning, updating and deletion be handled? Today, del.icio.us does not detect duplicate or near-duplicate objects (if one object is placed at two different URIs, del.icio.us does not connect the records) - this is a direct analogue of the classic eprints problem, how metadata from one object should be related to another manifestation of the same object. Many of these issues are research topics in the digital repositories world.

Other options exist that do not involve handing around external metadata records, the design pattern discussed here; for example, embedded metadata schemas like EXIF or ID3 store metadata directly within files. Similarly, embedded metadata on the Web such as that offered by microformats or structured blogging provide a method for storing metadata directly within XHTML pages. However, these involve their own challenges, such as synchronisation of metadata between copies of the same file.

Conclusion

The collaborative phenomenon of tagging is an interesting large-scale adaption of something previously attempted in various contexts, and it currently offers a (rapidly expanding) subset of the functionality potentially available from a system like Placeless Documents. Functionality such as that offered by active tagging is becoming available in certain situations, such as in the case of geotagging, in which tags are used that identify a resource as being tagged in such a way that it can be represented according to the geographical information provided (eg, placed on a map, sorted according to location, etc). Dourish et al recognised explicitly that the more tags available for a document, the better the placeless documents system functions (as a result of the 'long tail' effect [18]) and that there is no realistic likelihood that a single user or group of users will produce a sufficiently large number of tags, thus the need for automated tagging. So the technique has succeeded on the larger scale where it would not on the smaller.

On the other hand, the tags referring to a given resource may be useful as descriptive metadata on the small scale, so there is a reason to retain that metadata on transferring an object to a new filesystem. For example, a file downloaded from the Web might be described with a certain metadata set on one service, and tagged with a number of terms on another. Retaining this information on saving the file locally would lower the cost of collecting local metadata, without requiring the use of automatic metadata extraction tools as the sole solution.

In short, what has until recently been largely treated as a number of dissimilar problems is now undergoing a process of conversion to the point where some attention will have to be paid to the issues - if only because required functionality is now appearing in commercial tools and operating systems. Effective strategies may combine formal and informal, objective and interpretive metadata from a variety of sources. However, local filesystems and Internet-based indexers are dissimilar in context and identical approaches will not necessarily work across both contexts.

References

  1. Guy, M.,. "Integration and Impact: The JISC Annual Conference" July 2005, Ariadne Issue 44
    http://www.ariadne.ac.uk/issue44/jisc-conf-rpt/
  2. Terdiman, D., 2005. Folksonomies Tap People Power. Retrieved 20/04/2006 from
    http://www.wired.com/news/technology/0,1282,66456,00.html
  3. Hammond, T., Hannay, T. Lund, B., Scott, J., (2005) Social Bookmarking Tools A General Review , D-Lib Magazine, April 2005, Volume 11 Number 4. <doi:10.1045/april2005-hammond>.
  4. Rashmi Sinha (2005) A cognitive analysis of tagging, retrieved 20/04/2006 from
    http://www.rashmisinha.com/archives/05_09/tagging-cognitive.html
  5. Guy, M., and Tonkin, E., Folksonomies - Tidying up tags? D-lib Magazine, January 2006.
    http://www.dlib.org/dlib/january06/guy/01guy.html
  6. Tanenbaum, A. S. & Woodhull, A. S., 1997. Operating Systems: Design and Implementation (Second Edition), Prentice Hall, ISBN 0136386776
  7. Gerson, E. M., and Star, S. L. (1986). Analyzing Due Process in the Workplace. ACM Transactions on Office Information Systems, vol. 4, no. 3, July. Pages 257-270.
  8. Lansdale, M., The psychology of personal information management, Applied Ergonomics, 19(1), 55-66, 1988.
  9. Kljun, M., and Carr, D. (2004). Piles of Thumbnails - Visualizing Document Management. Proceedings of the 27th International Conference on Information Technology Interfaces (ITI2005), Cavtat, Croatia, 20-23 June 2004.
  10. Ravasio, P., Schär, S. G., Krueger, H. (2004): In pursuit of desktop evolution: User problems and practices with modern desktop systems. ACM Trans. Comput.-Hum. Interact. 11(2): 156-180
  11. Barreau, DK and Nardi, B. (1995). Finding and reminding: File organization from the desktop. ACM SIGCHI Bulletin, 27 (3), 39-43.
  12. Fertig, S., Freeman, E. and Gelernter, D. (1996). "Finding and reminding" reconsidered. ACM SIGCHI Bulletin, 28 (1), 66-69.
  13. David K. Gifford , Pierre Jouvelot , Mark A. Sheldon , James W. O'Toole, Jr., Semantic file systems, Proceedings of the thirteenth ACM symposium on Operating systems principles, p.16-25, October 13-16, 1991, Pacific Grove, California, United States
  14. Rose, D.E.; Mander, R.; Oren, T., Ponceleon, D.B.; Salomon, G. & Wong, Y.Y. 1993. "Content Awareness in a File System Interface Implementing the 'Pile' Metaphor for Organizing Information", 16 Ann. Intl SIGR'93, ACM, pp. 260-269.
  15. Dourish, P.; Edwards, W. K.; LaMarca, A.; Salisbury, M. Presto: an experimental architecture for fluid interactive document spaces. ACM Transactions on Computer-Human Interaction. 1999 June; 6 (2):133-161.
  16. Dourish, P., Edwards, W. K., LaMarca, A., Lamping, J., Petersen, K., Salisbury, M., Terry, D. B., & Thornton, J. (2000). Extending Document Management Systems with User-Specific Active Properties. ACM Transaction on Information Systems, 18(2), 140-170.
  17. OECD, 2005. OECD Broadband Statistics, June 2005. Retrieved May 2006 from
    http://www.oecd.org/document/16/0,2340,en_2649_34225_35526608_1_1_1_1,00.html
  18. Wired 12:10: The Long Tail
    http://www.wired.com/wired/archive/12.10/tail.html

Author Details

Emma Tonkin
Interoperability Focus Officer UKOLN

Email: e.tonkin@ukoln.ac.uk
Web site: http://www.ukoln.ac.uk/ukoln/staff/

Return to top