The Resource Discovery Project
Although much valuable information is available in distributed
information systems like the World Wide Web, it is not always easy to
find. As Renato Iannella shows, the Resource Discovery project
at the DSTC in Australia is
investigating issues to do with information discovery and access
across large heterogeneous networks. This article appears in the Web version only of Ariadne.
Resource Discovery at DSTC
The Resource Discovery Project is one of the major research units of
the Distributed Systems Technology
Centre (DSTC). The DSTC is one of over 60 co-operative research
centres in Australia and is a Federally and commercially funded
non-profit company. The DSTC has over 25 participating organisations
which provide resources to the research program, including, direct
funding, seconded staff, hardware and software, and importantly,
research problems. The Resource Discovery Project was established in
mid 1994 after the emerging problem of information discovery on large
networks was identified as a crucial research area for Australian data
networks.
The goal of the Resource Discovery Project is to investigate issues
related to locating, retrieving, and promulgating information in large
networked environments. The Internet and WWW provide a challenging
environment for deployment of these services. The needs of
information publishers - to maximise audience reach - and the user -
to minimise information overload - require advanced technical
solutions and investigative research.
The Resource Discovery Project framework assumes the three layer
architecture shown in Figure 1. The framework
contains the following entities:
- Content Seekers - what technology do users need to
find the information they are seeking? What technologies do
users need for retrieving and managing resources? Users
require software clients with intuitive user interfaces to
facilitate and manage simple and complex information retrieval
tasks.
- Discovery Services - what mechanisms do you need to
store, propagate, and manage the resource descriptions used by
the users to select and retrieve resources? Discovery services
act as the intermediary between the users and the information
providers - matching the needs of the two - to provide a
unified view of information repositories.
- Content Providers - how do you describe your
resources? Various techniques can be used to provide
descriptions of resources (metadata) and to support the
management of the organisation's information publishing
strategies. These include tools which automate metadata
extraction and support access constraints.
Figure 1: Resource Discovery Project Framework
Key Ideas and Goals of the Research
The Resource Discovery Project is aiming to provide timely solutions
to some of today's problems as well as a fundamental and applied
research vision. The following outlines the projects goals, key ideas
and research rationale:
- Goal: Allow organisations to effectively disseminate and
promote networked information both internally and externally.
- Key Idea: Accurate resource descriptions and robust
naming systems are essential for finding resources.
- Research plan: Investigate techniques for
describing and identifying resources. Investigate
technologies for producing and disseminating resource
descriptions and identifiers.
- Goal: Improve information discovery and access across
heterogeneous information sources.
- Key Idea: Information and services will be made
available in various formats and protocols and no single
standard will ever evolve.
- Research plan: Use middleware discovery services to
translate between the various formats and standards.
- Goal: Provide scaleable and manageable solutions for networked
information promotion and discovery.
- Key Idea: Large information spaces are often
distributed, heterogeneous, dynamic, and expanding.
- Research plan: Create scaleable discovery services
using distributed
middleware techniques.
- Goal: Improve user access to resources through better
extraction of information needs and more effective information
filtering.
- Key Idea: Users have difficulty expressing their
information needs
and coping with the large amounts of information which may meet
their needs.
- Research plan: Investigate techniques for extracting
and applying user
information needs and for filtering information.
Technologies and Prototypes
The Resource Discovery Project has investigated a number of the above issues and has
produced prototypes to demonstrate various solutions. These are described below.
Naming
Naming is area receiving increasing attention as it is fundamental to
the capabilities of many systems in distributed networking. The rapid
growth of the World-Wide Web has seen the Uniform Resource Locator
(URL) scheme being used as the de facto naming system for the
Internet. The next generation of naming will however provide more
flexibility - with the development of Uniform Resource Names (URN).
The Resource Discovery Project has developed a URN resolver [1] based on the standard HTTP protocol and which
supports various resolution media types (text, html, or sgml). The
major problem faced is not in the technologies, but in the deployment,
utility and ease of use of tools for creating and managing URN
systems. URLs are easy to create, but pose a legacy problem. URNs are
more complex, but will be more persistent and stable.
The importance of effective and flexible schemes for URNs will be
paramount for their early deployment.
Metadata
Metadata is a significant area of research in the Resource Discovery
Project [2] and there are a number of aspects to
this research.
The first research issue which must be addressed is what set of
information is to be captured by the metadata. This depends on the
type of the resource and on the purpose of the metadata. A metadata
scheme must be sufficiently flexible to capture useful information
about a wide variety of resources for a range of purposes. Ideally, a
single metadata scheme should be used as this minimises the cost of
using metadata. Unfortunately, it is unlikely that there will ever be
agreement on a single metadata scheme and so a major aspect of
metadata research is the relationship between different metadata
schemes and the trade-off between the size and utility of the metadata
element set.
The second research issue is related to the production of
metadata. Metadata is essentially a summary of the data produced by
various levels of abstraction. Using humans to generate these
summaries is expensive and metadata systems attempt to reduce this
cost by automating as much of the process as possible.
The final research issue of metadata concerns how the metadata is
accessed and used. It must be retrieved in a form which can be
processed with its semantics preserved. An important use of metadata
is as a mechanism for resource location in distributed networks like
the Internet. Metadata can provide information for the user to
identify which resources they might be interested in. Once a resource
has been identified, metadata provides the information to allow the
resource to be accessed.
The Resource Discovery Project is working on the following metadata related projects:
- Indexing Dublin Core (using Harvest) and other metadata embedded in HTML files and providing a
search interface to this database.
- Using Dublin Core for describing resources for URN resolution services.
- Using Dublin Core to map search results from Web servers (HTTP) and library catalogues (Z39.50)
into a meta-search engine.
- Using Dublin Core and GILS metadata in X.500 Directories to support the Warwick Framework
and the Trading services defined by the Open Distributed Processing group.
- Extending PICS to support Dublin Core and other text-based metadata.
- Extending the Persistent URL (PURL) system to support Dublin Core metadata.
The Resource Discovery Project is also involved in the fundamental
area of ontologies. An ontology is a set of standard concepts, terms
and definitions which are agreed upon within a particular
community. People often use different terms to refer to the same
concept and they use the same term to refer to different concepts. It
is very difficult to find resources if they have been described using
inconsistent terms. We have been involved in implementing a system
which suggests appropriate terms to users when classifying documents
and allows users to refine this classification and evolve the ontology.
Z39.50
The Resource Discovery Project recognised that large information
providers needed a standard and flexible information retrieval
protocol, and that Z39.50 is the leading standard. The US Government
Information Locator Service (GILS) has also mandated the use of Z39.50
as its primary protocol for agencies.
The main focus of the research work has been on building flexible
mechanisms for building Z39.50 access to various databases formats (for
example, the Harvest database). We also built a Z39.50 to X.500
Gateway system [3] to demonstrate the utility of
Z39.50 and its ability to access Directory Services within the GILS
environment.
The Web and Z39.50 have become very popular as Libraries and other
information providers are adopting the Z39.50 information retrieval
standard for accessing their on-line catalogues. We have compared and
reviewed many of the leading Clients and Web Gateways [4]
The future of Z39.50 is unclear. On the one hand, developers are in
need of a standard and flexible information retrieval protocol, on the
other, they want an easy and lightweight solution to the
problem. The Z39.50 community is attempting to solve this by defining
a light version of Z39.50. The Resource Discovery Project is also
investigating the requirements of the Internet community for a
lightweight information retrieval protocol as an alternative to Z39.50
[6].
Browsing
The Web is large, too large in fact for normal users to cope with the
amount of information shown using normal information presentation
methods. The Resource Discovery Project has been experimenting with an
information presentation tool called the HyperIndex Browser (HIB)
which helps users define queries and navigate large information
spaces. Users may not know the exact query terms to use when
searching for information. The HIB lets users enter general,
encompassing terms as queries. It then extrapolates the "information
space" around that search term and presents the user with a number of
related topics to select from: refinements which narrow the focus of
the search, and enlargements which broaden the focus. For example, in
response to the initial search term internet security, the HIB
might present you with the following suggestions shown in Figure 2
below.
Figure 2: HyperIndex Browser
Currently the HyperIndex Browser assists the user in coping with
information overload and query construction. The next stages of the
research include using the path navigated through the HIB to build a
profile of that user's interests. This profile can then be used to
guide future resource discovery activities.
Heterogeneous Meta-Searching
Searching multiple information servers is obviously a significant area
of work for Resource Discovery systems. The Resource Discovery
Project has developed such a meta-searcher called HotOIL [5]. HotOIL assumes that resources are stored in many
different types of information repositories (such as an enterprise's
databases and a public Internet catalogue) and that these repositories
are distributed widely over computer networks. To find information
without HotOIL you would need to interact with each of these
information repositories individually. HotOIL performs these
interactions for you. Given your query:
- HotOIL consults a directory of information repositories and decides which are most likely to
contain information relevant to your query.
- For each repository chosen in step 1, HotOIL:
- translates your request into a query for this repository
- sends the query to the repository
- retrieves the results
- moulds the results into a common internal format
- HotOIL merges the results returned from each repository and then displays a summary
HotOIL can currently access both HTTP services and Z39.50 servers. The
HyperIndex Browser is used as the front end to HotOIL and effectively
gives the feeling of seamless access to a single database. Internally,
HotOIL uses URNs and metadata to describe the search engines that it
accesses. It also uses the Dublin Core metadata set to describe the
resources returned from each search engine.
The experience with HotOIL highlights the growing perception in the
Internet community of the need for a standard interface to queriable
networked information sources [6].
Proxy Search Engine
The fact that communities of common interest naturally occur within
organisations can be used to enhance resource discovery. The Resource
Discovery Project has developed the What's Hot system [7] which is based on the observation that people in
an organisation share common interests. If someone in an organisation
requests information about a particular subject then it is likely that
other people in the organisation have previously requested information
about that subject, or will in the future.
The What'sHot proxy search engine intercepts a user's request for
information on a subject and directly responds with URLs which are
both about that subject and are also popular within the
organisation. If the request cannot be matched by the local proxy
search engine then the request is passed to proxy search engines at
other sites who respond with URLs if there is a match. The request is
eventually passed to conventional search engines if no proxy search
engine in the system can provide a match.
The innovation in What's Hot is in the way in which:
- the proxy search engines recommend popular resources to each
other, and
- the measurement of the popularity of an individual resource.
Periodic Discovery
The information available on the Internet is highly dynamic. New
resources are continually being published, and existing resources
change often. An unsuccessful search conducted today might yield
useful results tomorrow. To address this problem, The Resource
Discovery Project has developed the Orion prototype [8] which is a tool for locating new resources as
they become available. Orion can be thought of as an agent which can
periodically performs searches. The user is informed of new resources
which may interest them. See Figure 3 below for an example of the
Orion search interface.
Figure 3: Orion
The technologies underlying Orion are the more interesting aspects of
the prototype. Like HotOIL, Orion uses URNs and metadata to
internally describe the external Web sites that it accesses.
Future Direction: Government Information Demonstrator
The prototypes and technologies developed as part of the Resource
Discovery Project have now reached a level of maturity where they can
be deployed in a real environment: Government information discovery
and information promulgation. The Government provides a rich source
of both information and scalability problems that will further
challenge the Project.
The permanence of the government requires an architectural framework
which is both scaleable and durable [9]. Figure 4
outlines the broad architecture that the Resource Discovery Project
will be aiming to populate.
Figure 4: Whole-of-Government Architecture
The three layers in the architecture match those of Figure 1. The
Agency layer deals with metadata for both individual-level and
collection-level (AusGILS) resources. The Middleware layer looks at
both the distribution of indexes (gathering) and the intelligent
routing of queries (brokering). The User layer supports advanced
client functions such as the HyperIndex browsing metaphor.
The technical strategies faced in developing a Whole-of-Government
information access architecture are immense, including scalability,
security and authentication, distributed indexing techniques, and
future migration strategies. We believe that the Resource Discovery
Project will be able to demonstrate solutions to some of these
technical problems and are excited about the prospect of real
deployment of our technologies.
Acknowledgements
Many thanks to the Resource Discovery Project team for their research
work that is described in this report: Nigel Ward, Andrew Wood, Andrew
Waugh, Ying Ni, Arkadi Kosmynin, Maria Lee, Mark French, Peter Bruza,
Jane Hunter, Eric Proper, and Hoylen Sue.
The work reported in this paper has been funded in part by the
Cooperative Research Centres Program, through the Department of the
Prime Minister and Cabinet of Australia.
References
- Iannella, R & Sue, H & Leong D. BURNS: Basic URN Service resolution
for the internet. Asia Pacific World Wide Web Conference, August 23-28 1996 (Beijing & Hong Kong)
http://www.dstc.edu.au/RDU/reports/APweb96/
- Iannella R & Waugh, A. Metadata: Enabling the Internet, CAUSE97
Conference, Melbourne, April 1997
http://www.dstc.edu.au/RDU/reports/CAUSE97/
- The Z39.50 to X.500 Gateway Prototype
http://www.dstc.edu.au/RDU/ZXG/
- Z39.50 Client and Web Gateway Surveys
http://www.dstc.edu.au/RDU/reports/zreviews/
- HotOIL Advanced Prototype Access Program
http://www.dstc.edu.au/BDU/APAP/HotOIL/HotOIL.html
- Ward, N & Wood, A & Finnigan S, & Iannella
I. Discussion Paper: Networked Information Retrieval Standards, 1996
http://www.dstc.edu.au/RDU/reports/webir.html
- What'sHot Advanced Prototype Access Program
http://www.dstc.edu.au/BDU/APAP/WhatsHot/WhatsHot.html
- ORION - Advanced Prototype Access Program
http://www.dstc.edu.au/BDU/APAP/Orion/Orion.html
- Ianella, R. Networked Government Information Access - An Australian
Perspective
CNI Fall Task Meeting, San Francisco, December 1996
http://www.dstc.edu.au/RDU/reports/CNI96/
Author Details
Renato Iannella
(Resource Discovery Project leader at the DSTC)
Home Page: http://www.dstc.edu.au/RDU/staff/ri/
Email: renato@dstc.edu.au
Phone: +61 7 3365 4310
Fax: +61 7 3365 4310
DSTC Home Page: http://www.dstc.edu.au/
Resource Discovery Project Home Page: http://www.dstc.edu.au/RDU/
DSTC Address: Gehrmann Labs, The University of Queensland, 4072, AUSTRALIA

Material on this page is copyright
Ariadne/original
authors.
This article last updated/links checked on 07-Mar-1997