Experiences of Harvesting Web Resources in Engineering Using Automatic Classification
The story behind Engine-e , a recently created robot-generated Web index, is best told by starting in 1994 with the development and maintenance of EELS (Engineering Electronic Library, Sweden) , a manually indexed quality-controlled subject gateway in Engineering. EELS was accompanied by the experimental robot-generated index, "All" Engineering , created within the DESIRE framework . The solution used already in "All" Engineering is similar to that of Engine-e, but with some distinct differences. Work on EELS was initiated by SUTL (Swedish University Technology Libraries) in 1994 with the purpose of giving the technology libraries and universities an opportunity to explore the Internet from selected links to valuable resources with a special focus on resources from Sweden and other Nordic countries. As such it was a very early implementation of a subject-based information gateway .
A group of some ten subject editors from Swedish technology universities carried out the tasks of collecting, evaluating, indexing, cataloguing and updating resources in EELS .
The technical development of EELS was carried out by NetLab , a research and development department at Lund University Libraries, Sweden. Traugott Koch, senior librarian and digital library scientist at NetLab, suggested at an early stage the development of a robot-generated index within EELS, which later was realised in "All" Engineering. The intention was to integrate "All" Engineering as far as possible into the same structure as the one used in the original EELS where the classification scheme of Engineering Information Inc. was used . The harvesting robot in "All" Engineering collected resources starting from seven reliable quality-controlled subject gateways and followed their links down to two or three sub-levels .
As matters progressed, it became apparent that the work to be done by the subject editors proved itself to be the most problematic part of the EELS project. During the period 1994-2000 the Web had expanded exponentially and search engines had greatly increased their performance. Thus the problem of coverage in EELS became increasingly urgent. The editors discovered how labour-intensive it was to keep EELS up to date in this new environment. At the same time users´ need for EELS seemed to diminish as new generation search engines grew increasingly popular. Furthermore, the funding for technical development from the Royal Library´s Department for National Co-ordination and Development, BIBSAM , had also come to an end, except if EELS were to be integrated into a planned national service together with the quality-controlled subject gateways from other Swedish national resource libraries. Eventually the SUTL consortium could no longer guarantee the quality of resources in EELS, due to difficulties experienced by the subject editors. All work on EELS was frozen in 2001.
Manually Indexed Quality-Controlled Subject Gateways vs. Automatic Classification Indices
At the same time as EELS was closed, it was suggested that the work carried out in "All" Engineering was a useful base on which to build further development. What advantages did "All" Engineering have over the quality-controlled subject gateway EELS?
In the context of Engine-e as a successor to a quality-controlled subject gateway, the question of which is best inevitably arises. The question is highly relevant, as not many institutions have the capability to maintain both services. Both quality-controlled subject gateways and robot-generated indices are well-suited for their purposes. These purposes however should be regarded as different. We will discuss some of these considerations below.
Browse/search: One of the major benefits with a quality-controlled subject gateway is that normally, due to the use of knowledge organisation systems, (e.g. classification systems and thesauri), it allows users to browse resources by subject. This can be seen as one means of offering an alternative to what a more general search engine does. In a robot-generated index using automatic classification this functionality can be kept, though not with the same precision in the assigned classifications as can be obtained by intellectual analysis of the resource.
Cost/Manual effort: Maintenance of a subject gateway is often related to the labour-intensive and costly manual effort devoted to indexing resources. In comparison a robot-generated index is filled with content quickly and at a low cost. The robot-generated index also requires manual maintenance, the focus however is not on the resources but rather on tuning the terms in the vocabulary chosen to describe the content of the resources. Many classification schemes and thesauri are not created to be interpreted by a machine. Intellectual tasks may include specifying that some words should not appear in the document in order for that document to receive a certain classification. Ambiguous terms needs to be accompanied by more specialised words in order to be assigned a classification, etc.
Quality/Quantity: The quality of resource descriptions in a quality-controlled subject gateway is related to the effort put into describing the resources, but they can be expected to maintain a very high standard. Keeping records up to date in the robot-generated index is a task for robot software. As regards the quantity of resources populating the two different services, the robot-generated index has a clear advantage, even though a larger amount of irrelevant material is included.
The Start of Engine-e
The further exploration of the ideas behind "All" Engineering received funding approval from BIBSAM for 2002-2003 based on the background presented above. This resulted in the creation of Engine-e as a subject index in Engineering, with subject-based hierarchical browsing and free-text search using Boolean operators. Development has been done in collaboration between the Royal Institute of Technology Library  and Lund University Libraries , and is being built within the Swedish National Resource Library framework .
The Engine-e demonstrator was launched in January 2003. Engine-e's goals are both quality and quantity, as well as continuous renewal and minimised manual labour. From the beginning Engine-e was able to take advantage of work done in EELS and "All" Engineering, benefitting from recent experience and case studies, as well as by having two software components ready for use, i.e. the Combine robot  and a classification algorithm that's compatible with the Combine robot. Engine-e also benefitted from being able to use the appropriate Elsevier Engineering Information Thesaurus and a suitable list of term matching patterns.
Given these advantages, continued work concentrated on:
- selecting start pages for robot harvesting
- trouble-shooting the software
- calibrating the classification algorithm
- developing a gateway for Engine-e
- establishing robust and automatic runtime routines for maintaining the service
- identifying what manual tasks would be needed to maintain the service in the long term and in what framework.
- conducting a smaller user survey
Which Web pages should serve as starting points for the Engine-e harvesting robot? First of all, Web pages of the universities of the SUTL consortium were given careful consideration, since Swedish material is considered to be of special interest for this project. Among the 21 selected start pages, 15 were (non-Swedish) quality-controlled subject gateways. The hyperlinks from these startpages led the robot on the journey of finding resources. If the resources are found to match the criteria for inclusion they are assigned one or more classifications and are included in Engine-e. If not, they are dropped, and their outgoing links are not followed.
Classification in Engine-e
Here follows a brief description of how a document is handled by the classification algorithm.
The classification algorithm stipulates that each term:
- may be a word, a phrase or an AND-combination of words/phrases
- is assigned a score
- is assigned one or more Ei classification codes
The list of terms has almost 21,000 entries, below is a sample with four terms
2: planets=657.2 8: plasma oscillations=932.3 8: plasmas @and beam-plasma interactions=932.3 4: plasmas @and pinch effect=701, 932.3
The first number represents the score, and the last is the Ei classification code assigned to that term. The automatic classification algorithm attempts to match terms found in a particular document with the list of terms.
- An absolute document score is computed as sum_over_all_terms(number_of_hits[term]*score[term])
- A relative document score taking the document's size into account is also computed
- The decision whether to accept or reject a document is based on the above document scores in relation to the experimentally established cut-off values
In order to find out how the classification algorithm behaves and the influence of its parameters, a series of initial experiments were carried out and investigated. In the first series of experiments the cutoff values for accepting or rejecting candidate pages varied. A sample of URLs representing material of varying quality was fed to the algorithm and the outcome was compiled into lists of sample documents that were examined manually.
The questions asked:
- Was too much low-quality material accepted?
- Was some high-quality material lost?
- Were the classifications relevant?
- Did the algorithm tend to assign many or very few classifications per page?
When a reasonable set of cut-off values was found, a second experiment was carried out. This time a fairly large number of pages (about 40,000) were collected, using a large set of selected start pages, and a searchable and browsable Web index was built and serviced via a preliminary version of the gateway.
This experiment resulted in:
- Bugfixes. Among other things the classifications were lost for about 1,5% of the records
- Modifications of the record structure
Full-scale Data Collection
When a safe and automatic runtime environment capable of handling a very large dataflow and computational effort was developed, a full-scale data collection was attempted. Harvesting with automic classification involved the following steps:
- Feeding the robot with start pages for harvesting
- Storing web pages (HTML and PDF originals) internally
- Parsing yields a uniform representation
- Storing successfully classified pages (20% or less of the harvested material) in the database, and their outgoing links are fed back into the robot for continued harvesting
There are some interesting variables and events related to quantity and size of: disk space; records; outgoing links; harvester URL queue and intermediate storage. Another approach is related to quality, e.g. the distribution of: classification codes; classification codes per record; Web servers; and document types.
We quickly learned that the distribution of classification codes, the intermediate storage, (classification is computationally intensive), and the harvester URL queue can grow out of hand. One solution might be to take a crude process control approach by sampling some of these quantities and pause or resume harvesting when appropriate and to constantly log them in order to gain insight. The users of the service would not notice any of these events. The database is indexed once daily in order to become searchable via the Z39.50 protocol . The browsing structure, belonging to the gateway, is automatically derived from this index. The Z39.50 server status is checked every five minutes and restarts automatically when necessary.
A full-scale data collection conducted over a period of seven weeks resulted in 344,886 records and a browsing structure with 740 out of 846 possible nodes populated with resources.
The Engine-e Gateway
The Engine-e gateway should be kept as simple as possible, i.e. it should:
- generally be familiar to users of quality-controlled subject gateways. Browsing results show title, classification and by selection also show excerpts from the document
- consist of as few pages as possible. Consequently, no distinction should be made e.g. between a simple and an advanced search form
- only have one box for entering a search string but offer a variety of interpretations (exact phrase or implicitly inserting one of the the Boolean operators AND, OR and BUT NOT between the words in the search string, the default being AND). Engine-e remembers the previous query completely, which for example allows the user to change Boolean operator or correct a spelling mistake with minimal input. The price paid for this simplicity is that only one kind of Boolean operator can be used at a time
- always reconstruct exactly the browse or search operation whenever results are shown
The presentation of free-text search results supports field-weighted, lexical relevance ranking. The following fields and weights were chosen, based on assumptions of where the relevant terms occur in a document.
Figure 1: Engine-e
During the autumn of 2003 a user survey was conducted. Some of the results from this survey are incorporated below, together with an introduction of how to use Engine-e.
Browsing by subject: The browsing interface of Engine-e consists of a table of subject headings, which follow the Ei classification system, with six broader categories as a starting point for the sub-divided categories under each heading. To browse deeper into the subject hierarchy simply click on a subject. Each subject heading is followed by its Ei classification code and, within parenthesis, the associated hit count.
Results from browsing are sorted alphabetically according to title (whereas search results are sorted according to relevance). After feedback from the user survey, it has been suggested that this should be replaced by a ranking procedure based on weighting and 'popularity links counting' like those used in Google. The number of resources shown at a classification level could also be limited to a practical size. It is also desireable to be able to limit search queries to be performed on a specific level (and downwards) of the browsing structure.
Searching: Engine-e provides full-text searches, looking for matches in title, headings, text body, subject classification fields and the URL. Results are ranked according to where the hits occur in the document.
Presenting results: Results are presented with titles first. Clicking on the title brings you to the resource. Thereafter follows a list of subject headings assigned to that document. By clicking on one of these the user is presented with a list of all resources that have been assigned that particular subject heading. There are two formats for presentation of results, the default format is a short format containing a linked title leading to the resource, a URL and linked allocated Ei class(es) leading to the class(es) in the browsing structure. There is also a format that includes full-text excerpts from the documents.
Engine-e is far from ready, but has proven to be valuable for the users in the survey. The remaining work falls roughly into the following categories:
- Improved presentation and ranking of results
- Improved usability and layout
- Intellectual work on the term list, and improved interpretations of these by the robot. An administrative tool to support the tasks also needs to be developed
Engine-e provides a low-cost search and browsable subject index in the field of technology. While the recall of results is rather high, the drawbacks are that the precision is sometimes low. We find both types of subject gateways; either those created by intensive manual labour or with the help of some smart computing, both allowing the users to search and retrieve information in a useful way.
- Engine-e demonstrator http://engine-e.lub.lu.se/
- EELS, Engineering Electronic Library, Sweden http://eels.lub.lu.se/
- "All" Engineering resources on the Internet http://eels.lub.lu.se/ae/index.html
- The DESIRE Project http://www.lub.lu.se/desire
- Ardö, A., Lundberg, S. and Zettergren, A-Z (2002) "Another piece of cake......?" Ariadne, Issue 32 http://www.ariadne.ac.uk/issue32/netlab-history/
- Jansson, K. (1996) "Indexerade kvalitetsresurser på Internet EELS-projektet" Nordic Journal of Documentation, 51 (1-2), pp.14-20, ISSN 0040-6872.
- NetLab http://www.lub.lu.se/netlab/
- Elsevier Engineering Information http://www.ei.org/
- Ardö, A., Koch T. and Noodén, L. (1999) The construction of a robot-generated subject index. http://www.lub.lu.se/desire/DESIRE36a-WP1.html
- BIBSAM http://www.kb.se/bibsam/english/first.htm
- Royal Institute of Technology Library http://www.lib.kth.se/kthbeng/about.html
- Lund University Libraries, Head Office http://www.lub.lu.se/new_top/omlub/bd_eng.shtml
- Swedish National Resource Library framework http://www.kb.se/bibsam/english/nrl/first.htm
- The Combine robot http://www.lub.lu.se/combine/
- Indexdata, provider of the Zebra Z39.50 implementation, the PHP-YAZ library http://www.indexdata.dk/
- PHP language http://www.php.net/
Lund University Libraries, Sweden
Web site: http://www.lub.lu.se/~jessica/
Lund University Libraries, Sweden
Web site: http://www.lub.lu.se/~tomas/
Royal Institute of Technology Library (KTHB), Sweden
Article Title: "Experiences of Harvesting Web Resources in Engineering using Automatic Classification"
Author: Jessica Lindholm, Tomas Schönthal and Kjell Jansson
Publication Date: 30-October-2003
Publication: Ariadne Issue 37
Originating URL: http://www.ariadne.ac.uk/issue37/lindholm/