Web Magazine for Information Professionals

Search Engines Corner

Dave Beckett discusses the best of the search engine features.

Are you trying to search for something difficult and end up with enormous and completely useless results?

The query in the table below is rather contrived but indicates the problem. In this article I hope to help you with these kinds of situation by pointing you in the right place, showing you the best of the search engine (SE) features and giving tips on managing the output that you get when a query returns thousands of answers.

I run several search engines myself including the Academic Directory[3], UK Internet Sites[4] and research in the area.

Search
Engine
Est. Size
(May 1998)
Results of searching for The Internet
AltaVista[1]140MAbout 47,565,550 matches were found
HotBot[2]110MWeb Results 7939803 matches.
Breakdown: internet 4602458, the 47280842

Specialised Searching

The first choice to make is where to search; and this may not be at a web crawler SE. This table lists some good choices for specialised searches:

Searching ...?Try these
USENET newsDejaNews[5] (also available via other SEs)
Stuff printed on dead treesYour local OPAC, NISS[6], BIDS[7], eLib[8] projects, Northern Light[9]
PicturesLycos[10], Yahoo! Image Surfer[11]
SoundsLycos
NewsBBC News[12], CNN[13], NewBot[14], NewsIndex[15], most big SEs
UK WebYahoo! UK[16], Search UK[17], UK Index[18], InfoSeek UK[19], Excite[20], Lycos
.ac.uk WebAcademic Directory(I run this.)
European WebExcite, Euroferret[21], HotBot
Particular country WebHotBot
In a date rangeHotBot
In a particular languageAltaVista (translations too)

Subject Specific Searches

If you are very lucky, there may be a subject-specific site that you can use where professional cataloguers have been paid to find web resources and provide high quality records. For example there are several eLib Subject Based Information Gateways (SBIGs). Some of those projects are experimenting with web crawls of the sites they have hand-picked, so searching there should return highly relevant results.

So how do you find one of these sites? Try using an appropriate authority in your subject. If you have a professional body or association, look at their web site for links. If that isn't possible, try a general SE or a directory service. Search.com[22] contains a list of over 100 specialises SEs and may be a good place to start.

Search a Web Directory

There are a few well known large directories of which Yahoo! is the largest and best with over 500K web sites listed (and in fact is the #1 used site on the Web). It is always a good idea to search there or maybe one of the other smaller sites such as LookSmart[23]. There are also some non-commercial directories such as WWWVL[24] (a site started by a creator of the web, Tim Berners-Lee) and the new NewHoo![25].

Search a Web Crawler

If you have got this far, you want to get something on the web, or the specialised searches didn't get quite what you wanted.

Coverage and freshness of the web crawls is important and all the web crawlers have different sizes and activity patterns. Search Engine Watch[26] keeps up-to-date estimates on the sizes[27] but this isn't the full story. A paper[28] at the WWW7 conference estimated the size of the web in November 1997 as 200M static pages but the joint coverage of the 4 largest SEs was only 160M and the overlap between them was only 2.2M or 1.4% of all pages!

At the current date, AltaVista is probably still the largest by far, with HotBot a close second and Northern Light getting larger rapidly. Most of the largest SEs crawl are very active, checking or adding millions of pages each day so for freshness, chose the largest crawlers. Each of the SEs has a different set of indexing and search features that can be exploited which may make the difference.

Getting Your Search Query Right

The worst thing you can do to an SE is to present a one word query with a very common item. If you look at some of the internet search engine spy pages[29] many people really do this. Thus you should really choose some extra words. Can't think of any? Well try your one-word search on Excite and use its suggested words feature. 10 related words are suggested with the results of every search (JavaScript support is required for you to pick them via clicking; but you can always type them in too).

Assuming you have managed to get a couple of words, now you can do something with them. If you try a general query on a SE like above, you end up with millions of results, so it is a good idea to modify the query. Most SEs allow you use these methods, although the syntax varies and sometimes it is found on an advanced search page:

Match AnyResults may contain any of the words
Match AllResults must contain all of the words
Exclude / RequireExclude / Require certain mix of words
Phrase SearchingRequire words in the exact order given
ProximityLook for words near other words
WildcardsPartial match on words; do you know how to spell it?

For the details of which engines support which features, see the Search Engine Watch Power Searching page at [30].

Managing Millions

So you have chosen a SE and tweaked the query to match what you want but still end up with an unmanageable set of results. There are two types of page that you really want to identify[31]:

  1. Authorities -- A page that contains a lot of information about a topic.
  2. Hubs -- A page that contains a large number of links to pages containing information about the topic.

People who create such pages do tend to submit them to directories and search engines or maybe get links made from other related authorities or hubs. Check there if you can find those sites there but here are some techniques I suggest in working with SEs:

Refine the query by adding extra keywords
Excite is good at this; this should help you find authorities. AltaVista allows you to refine by adding/excluding words that show up by selecting the Refine button and picking the word lists that are/are not related.
Search in the query results
InfoSeek allows you to restrict searches to the result of the previous query, thus makes it easy to refine when you are confident the answer is in the larger set of results.
Run a query based on a page
Lycos and Excite give you More Like This links which do a search based on one of the result pages; this can help identify hubs
Show more results on the page
Some SEs like HotBot allow up to 100 results per page and most allow you to just see the page titles. This can allow you to get a good overview.
Sort the results
Excite allows you to view the results by Web site -- this can make it easy to identify potential authority sites with lots of pages related to your search.
Look at the matches per word
PlanetSearch[32] by default shows how each word matched the results as coloured bars and you can show up to 250 results per page.
Try a meta-crawler
Metacrawler[33] allows you to search 5 big SEs in parallel and merge the results. This has the advantage of leveraging different scoring methods but can lose important results lower down in the replies and there are only three query types you can use.

FAQs

''So what search engine do you use?''
HotBot mostly, with all the options on; AltaVista for coverage and others as need be.

''Why?''
HotBot has loads of juicy features, is large enough and crawls often to keep the index fresh. One downside is that it is US-based and has no local partner.

References

Here, .com sites are in USA, .uk are in the UK unless otherwise indicated.

[1] AltaVista
http://www.altavista.digital.com/
Note: Since the merger with Compaq, the European AltaVista service at altavista.telia.com does not seem to be updated. You will have to use the US one.

[2] HotBot
http://www.hotbot.com/
Inktomi provide the technology and data behind HotBot and also power searches at Yahoo!, CNET's Snap and Disney's Internet Guide (DIG). Searches there mostly have the same functionality as HotBot.

[3] The Academic Directory, HENSA Unix
http://acdc.hensa.ac.uk/
I made this.

[4] UK Internet Sites, HENSA Unix
http://www.hensa.ac.uk/uksites/
and this.

[5] DejaNews
http://www.dejanews.com/

[6] National Information Services and Systems (NISS)
http://www.niss.ac.uk/

[7] Bath Information & Data Services (BIDS)
http://www.bids.ac.uk/

[8] The Electronic Libraries Programme (eLib)
http://www.ukoln.ac.uk/services/elib/

[9] Northern Light
http://www.northernlight.com/
See last months article for more information.

[10] Lycos UK, (actually in Germany)
http://www.lycos.co.uk/

[11] Yahoo! (US) Image Surfer
http://isurf.yahoo.com/

[12] BBC News
http://news.bbc.co.uk/

[13] CNN (European mirror)
http://europe.cnn.com/

[14] NewBot
http://www.newbot.com/

[15] News Index
http://www.newsindex.com/

[16] Yahoo! UK & Ireland
http://www.yahoo.co.uk/

[17] Search UK
http://www.searchuk.com/

[18] UK Index
http://www.ukindex.co.uk/

[19] InfoSeek UK (actually in USA)
http://www.infoseek.co.uk/

[20] Excite UK (actually in USA)
http://www.excite.co.uk/

[21] EuroFerret (actually in UK)
http://www.euroferret.com/

[22] Search.com, CNET
http://www.search.com/

[23] LookSmart
http://www.looksmart.com/

[24] World Wide Web Virtual Library
http://www.vlib.org/
UK Mirror: http://www.mth.uea.ac.uk/VL/Overview.html

[25] NewHoo!
http://www.newhoo.com/

[26] Search Engine Watch, Mecklermedia
http://www.SearchEngineWatch.com/

[27] Search Engine Sizes, Search Engine Watch
http://www.searchenginewatch.com/reports/sizes.html

[28] A Technique for measuring the relative size and overlap of public Web search engines, Bharat and Broder, Digital Systems Research Center, USA in Proceedings of WWW7, April 1998.

[29] Yahoo! Search Engine Spying Page
http://www.yahoo.co.uk/ Computers_and_Internet/ Internet/ World_Wide_Web/ Searching_the_Web/ Indices_to_Web_Documents/ Random_Links/ Search_Engine_Spying/

[30] Search Engine Watch - Power Searching
http://www.searchenginewatch.com/facts/powersearch.html

[31] Authoritative sources in a hyperlinked environment, Jon Kleinberg in: Proceedings of 9th ACM-SIAM Symposium on Discrete Algorithms, 1998; also appears as IBM Research Report RJ 10076(91892) May 1997
http://www.cs.cornell.edu/home/kleinber/auth.ps

[32] Planet Search
http://www.planetsearch.com/

[33] Metacrawler
http://www.metacrawler.com/

Author details

Dave Beckett
Research Fellow
Computing Laboratory, University of Kent at Canterbury, UK
Email: D.J.Beckett@ukc.ac.uk
Personal Web Page: http://www.cs.ukc.ac.uk/people/staff/djb1/