Net Gains for Digital Researchers

amy friedlander

Net Gains for Digital Researchers

Amy Friedlander, the editor of D-Lib, looks at, and towards, some of the benefits of the Web and digital technology towards how we do and present research.

Predicting the future is a risky business. On the one hand, the current instantiation of the Internet and the World Wide Web interfaces will one day become obsolete -- perhaps sooner than we think. On the other hand, some configuration of networked digital information technologies is here to stay. Moreover, many of the tools and behaviors that arise to tap the web's potential will migrate as the underlying technologies evolve. Thus, the Internet is far more than a set of data transfer protocols operating over a series of leased lines, packet switches, and servers. It is also an opportunity to consider what happens when a lot of people of different skills and expectations begin to play with advanced information and communications capabilities. Given the inevitability of technological change coupled with the persistence of human behavior, the question resolves into one of aesthetics: what do the digital technologies let us do that is unique to this networked environment? that we cannot do -- or do well -- elsewhere?

Leaving aside the implications of e-mail for rapid and accurate interpersonal and group communications, I see at least four technology-independent characteristics on the web that affect the way we do and present research: hyperlinks within and across documents and collections; the malleability and de-contextualization of digital data, which enables users to store, subdivide, and combine information within the same digital medium; support for multimedia, including text, sound, still and moving images; and interactivity, which has a number of dimensions of which surfing, searching, and filtering are the most widely appreciated. Rarely do any of these features occur in isolation. In combination, they have the paradoxical potential of empowering the individual to identify data and describe results, and enabling the collective activities, such as building very large data sets, sharing and re-using data, and accessing information too large and complex to be maintained by any single organization.

Other observers of the impact of new information technologies on scholarly research have commented on the significance of hyperlinks, multimedia, and interactivity. For example, in American Archivist (Spring 1992), Arva Michelson and Jeff Rothenberg argued that connectivity and end-user computing were the two major trends in scholarly communication, and that the net effect was to enhance the "autonomy of the researcher" (p.244). Enabling the end-user is implicit in notions of interactivity, which together with hyperlinks and multimedia are among the "novel features" that Steve Hitchcock and his colleagues at the Open Journal project identified in on-line scholarly publishing in science, technology, and medicine. Biologists, the authors observe, found this environment appropriate to the visual depictions of their data and built databases that were conducive to sharing information and stimulating continued research. Indeed, Nature as well as other well-respected publications now require require authors of articles containing genetic sequence data to submit these data directly to one of the recognized genetic databanks (e.g., GenBank) as a prerequisite of publication to ensure that the data are in the public domain.

Positive feedback relationships among display of information, storage technologies, and the ways in which researchers understand, collect, and work with data extend to other subject domains. Indeed, the central assumption of digital libraries -- the subject of the on-line magazine I edit -- is the existence of collections of digital information linked by communications networks that enable access by individual researchers anywhere and at any time. In addition to preprint archives in science and mathematics and such well-known biomedical projects as the Human Genome and the Human Brain, which integrate advanced computing and information technologies to support future investigations and applications, collections have been constructed around rare texts and artifacts, notably Beowulf, the Perseus project, and Thesaurus Florentinus.

In the Thesaurus Florentinus, digital imaging and storage technologies have enabled the creation of a series of images of restoration work at the Santa Maria del Fiore in Florence that themselves constitute a re-usable information resource. Thus, the "raw materal" of research results can coalesce into collections that possess underlying digital consistency. Similar collections of digital information in history and the social sciences are maintained by the Economic and Social Research Council Data Archive at the University of Essex (ESRC) and the Inter-university Consortium for Political and Social Research [ICPSR]. These centers accept submssions, update the formats, and offer users search capabilities. Courtesy of the digital technologies and the network, then, the process of building the collection has at least partially devolved to the researchers themselves, who are invited to submit data or reports in a common format; the notion of a collection is enlarged, and access to uniform data is expanded. Although the autonomy of the user is enhanced, as Michaelson and Rothenberg argue, so, too, are the opportunities to collaborate not merely through the established process of papers and conferences but also in nut-and-bolts that support the reseach. In this sense, though, the network reinforces and offers coherence to but does not necessarily create the community of scholars itself, which coalesces around common interests. On the other hand, the very ease of discovering and sharing information on the net means that researchers may find common interests where they had earlier believed there were none.

Part of the ease of using the web stems from the visual, point-and-click technology, perhaps the best example of how visualization informs and alters the way we work -- an observation about end-user computing that Michelson and Rothenberg also made. At the U.S. National Institutes of Health (NIH), the researchers employ digital imaging technologies combined with greater computing power to create and capture new information on brain function and pathologies. Particularly dramatic is the use of video, so that the representation of brain function can be displayed dynamically in real time. Potentially anyone with access to the data and the necessary equipment can re-use the verifiably identifical information. Of course, continued observation remains necessary to control for distortions in the initial observations as well as to advance the knowledge base. But re-visiting the original sources -- whether in history, music, or biology -- reduces or eliminates a level of ambiguity. Moreover, research on visualization may offer new tools for examining conceptual relationships while traversing them seemingly effortlessly and for (see, for example, the summary of related work at Xerox PARC in the June issue of D-Lib Magazine).

In addition to capturing, re-using, and visualizing information, the digital technologies also enable integration of multiple types of information. For example, compilation and registration of information captured in three different technologies and hundreds of digital images enabled the creation of the Visible Human, an on-line biomedical resource for human anatomy. Roy Williams has made a similar case for user-defined hypermapping by extending the idea of Geographic Information Systems (GIS) with multiple data sets in a recent issue of the Cal Tech's Engineering & Science. He envisions a distributed information environment that supports constructing self-defined documents on the fly from complex sources of information too large and too infrequently used to be housed by a single user or institution. And in the June issue of D-Lib Magazine, David Fenske and Jon Dunn argue that digital representations of music can enable musicologists to move from score- or text-based analyses toward explorations based on the digital recordings themselves.

In three of these examples -- genetics, medicine, and cartography -- there exists an agreed-upon structure that underlies the organization of the digital data: the notion of genetic code, the basic human anatomy, and earth mapping systems. Conventional text offers a harder problem, because its form is defined by the writer, witness e.e. cummings. Of course, most text is not poetry. In print on paper, we have used organization, layout, and typesetting as well as text to convey meaning. The Standard Generalized Mark-up Language (SGML), with its various applications to historic documents and text storage and retrieval, focuses our attention on structure and provides a standard vocabulary and representation for a field that was formerly dominated by paleographers and literary critics. We can begin, then, to talk about consistences across material as diverse as rare manuscripts and professional engineering journals.

The web feature that we most associate with making connections within and across documents and other objects is hyperlinking. Links have let us do some wonderful things: We can extend information in the way that footnotes, marginal commentary, appendices, and bibliographies have traditionally extended and annotated text and images. Consider, for example, Project Bartleby at Columbia University, which offers versions of English and American literary works in which the annotations can be invoked on demand. Secondly, we can assemble collections of related material from multiple locations either permanently or as needed, witness WebMuseum or any of the information clearinghouses. Finally, in crafted presentations, we can be emancipated from some of the linear constraints of text, hence the proliferation of home pages with imagemaps, tables of contents, keys, and outlines as devices for helping users navigate the information contained within a conceptual space.

In its current form, hyperlinking is, nonetheless, limited. As Howard Webber cautioned D-Lib Magazine, "for sustained knowledge work, casual hyperlinking is like having a conversation interrupted every minute or so by someone who wants to talk about something slightly different." Thus, hyperlinking contributes to the de-contextualization of information. This de-contextualizing occurs at many levels. In the physical sense, representations of rare items, like Beowulf, can be de-coupled from the physical artifact, affording perhaps a higher level of scrutiny as well as broader access. VRML technologies let users view complex structures from vantage points not otherwise feasible, just as the Visible Human can allow students to practice virtual surgery, and wavelets permit progressive resolution of extremely large images potentially offering users access to relevant portions or details. Moreover, we talk about searching by "keywords", or strings of characters, over swaths of text in the same way that segments of genetic code can be searched in GenBank. Still, writers, editors, and librarians worry a lot about these broad searches, particularly in domains in which the language is used evocatively or in which the underlying concepts have evolved.

As an informal experiment, I ran the word "cancer" over the Library of Congress' on-line historical collections. The first document retrieved was a transcript of an oral history in which the informant mentioned his brush with the disease. The second item was a newspaper's reporting of Booker T. Washington's 1898 Jubilee Thanksgiving address in which he "likened the effect of race discrimination, especially in the Southern States [of the U.S.A.], to a cancer gnawing at the heart of the republic." Experienced users automatically resolve the semantic ambiguity. Naive users, confronted with less obvious choices in usage, need help to discriminate useful from misleading information.

We immigrants from print are hardly alone in our concern for the integrity of the information. Indeed, there is considerable research into issues of semantic interoperability and information retrieval, which addresses one aspect of disembodied text. There remains the question of how the artifact adds to the meaning of the text. Some of this relationship can be captured by SGML. But more is conveyed by the physical item itself. Consider the preceding example: holding the yellowing scrap of newsprint instantly tells the user that a reporter intervened between Washington and posterity; holding notes in Washington's hand tells the user what he intended to say.

Solutions to issues of data capture and re-use that preserve the context but do not constrain de-construction and analysis are likely to arise in many forms and many tools from user interfaces and visualization tools to storage and retrieval. For example, JSTOR provides bitmapped images of pages as well as ASCII text, taking advantage of efficiencies of digital storage and searching while preserving the traditional benefits of layout. Other efforts such as Stanford's ComMentor and Berkeley's multivalent document model are exploring the notion of a document as well as the creation of web annotation and collaborative tools that promise to sustain interactions with the information among reseachers and over time in the tradition of Ted Nelson's original vision of transclusion, "or reuse with original context available, through embedded shared instancing" (Communications of the ACM, August 1995, p. 32). To borrow Webber's words, such tools will "allow multithreaded information to weave a highly personal fabric of specific meaning for individual users," which they may share or hold private as the situation demands.

Amy Friedlander
Editor, D-Lib Magazine