Search Engines Corner
Are you trying to search for something difficult and end up with enormous and completely useless results?
The query in the table below is rather contrived but indicates the problem. In this article I hope to help you with these kinds of situation by pointing you in the right place, showing you the best of the search engine (SE) features and giving tips on managing the output that you get when a query returns thousands of answers.
I run several search engines myself including the Academic Directory, UK Internet Sites and research in the area.
|Results of searching for The Internet|
|AltaVista||140M||About 47,565,550 matches were found|
|HotBot||110M||Web Results 7939803 matches.|
Breakdown: internet 4602458, the 47280842
The first choice to make is where to search; and this may not be at a web crawler SE. This table lists some good choices for specialised searches:
|Searching ...?||Try these|
|USENET news||DejaNews (also available via other SEs)|
|Stuff printed on dead trees||Your local OPAC, NISS, BIDS, eLib projects, Northern Light|
|Pictures||Lycos, Yahoo! Image Surfer|
|News||BBC News, CNN, NewBot, NewsIndex, most big SEs|
|UK Web||Yahoo! UK, Search UK, UK Index, InfoSeek UK, Excite, Lycos|
|.ac.uk Web||Academic Directory(I run this.)|
|European Web||Excite, Euroferret, HotBot|
|Particular country Web||HotBot|
|In a date range||HotBot|
|In a particular language||AltaVista (translations too)|
Subject Specific Searches
If you are very lucky, there may be a subject-specific site that you can use where professional cataloguers have been paid to find web resources and provide high quality records. For example there are several eLib Subject Based Information Gateways (SBIGs). Some of those projects are experimenting with web crawls of the sites they have hand-picked, so searching there should return highly relevant results.
So how do you find one of these sites? Try using an appropriate authority in your subject. If you have a professional body or association, look at their web site for links. If that isn't possible, try a general SE or a directory service. Search.com contains a list of over 100 specialises SEs and may be a good place to start.
Search a Web Directory
There are a few well known large directories of which Yahoo! is the largest and best with over 500K web sites listed (and in fact is the #1 used site on the Web). It is always a good idea to search there or maybe one of the other smaller sites such as LookSmart. There are also some non-commercial directories such as WWWVL (a site started by a creator of the web, Tim Berners-Lee) and the new NewHoo!.
Search a Web Crawler
If you have got this far, you want to get something on the web, or the specialised searches didn't get quite what you wanted.
Coverage and freshness of the web crawls is important and all the web crawlers have different sizes and activity patterns. Search Engine Watch keeps up-to-date estimates on the sizes but this isn't the full story. A paper at the WWW7 conference estimated the size of the web in November 1997 as 200M static pages but the joint coverage of the 4 largest SEs was only 160M and the overlap between them was only 2.2M or 1.4% of all pages!
At the current date, AltaVista is probably still the largest by far, with HotBot a close second and Northern Light getting larger rapidly. Most of the largest SEs crawl are very active, checking or adding millions of pages each day so for freshness, chose the largest crawlers. Each of the SEs has a different set of indexing and search features that can be exploited which may make the difference.
Getting Your Search Query Right
Assuming you have managed to get a couple of words, now you can do something with them. If you try a general query on a SE like above, you end up with millions of results, so it is a good idea to modify the query. Most SEs allow you use these methods, although the syntax varies and sometimes it is found on an advanced search page:
|Match Any||Results may contain any of the words|
|Match All||Results must contain all of the words|
|Exclude / Require||Exclude / Require certain mix of words|
|Phrase Searching||Require words in the exact order given|
|Proximity||Look for words near other words|
|Wildcards||Partial match on words; do you know how to spell it?|
For the details of which engines support which features, see the Search Engine Watch Power Searching page at .
So you have chosen a SE and tweaked the query to match what you want but still end up with an unmanageable set of results. There are two types of page that you really want to identify:
- Authorities -- A page that contains a lot of information about a topic.
- Hubs -- A page that contains a large number of links to pages containing information about the topic.
People who create such pages do tend to submit them to directories and search engines or maybe get links made from other related authorities or hubs. Check there if you can find those sites there but here are some techniques I suggest in working with SEs:
- Refine the query by adding extra keywords
- Excite is good at this; this should help you find authorities. AltaVista allows you to refine by adding/excluding words that show up by selecting the Refine button and picking the word lists that are/are not related.
- Search in the query results
- InfoSeek allows you to restrict searches to the result of the previous query, thus makes it easy to refine when you are confident the answer is in the larger set of results.
- Run a query based on a page
- Lycos and Excite give you More Like This links which do a search based on one of the result pages; this can help identify hubs
- Show more results on the page
- Some SEs like HotBot allow up to 100 results per page and most allow you to just see the page titles. This can allow you to get a good overview.
- Sort the results
- Excite allows you to view the results by Web site -- this can make it easy to identify potential authority sites with lots of pages related to your search.
- Look at the matches per word
- PlanetSearch by default shows how each word matched the results as coloured bars and you can show up to 250 results per page.
- Try a meta-crawler
- Metacrawler allows you to search 5 big SEs in parallel and merge the results. This has the advantage of leveraging different scoring methods but can lose important results lower down in the replies and there are only three query types you can use.
''So what search engine do you use?''
HotBot mostly, with all the options on; AltaVista for coverage and others as need be.
HotBot has loads of juicy features, is large enough and crawls often to keep the index fresh. One downside is that it is US-based and has no local partner.
Here, .com sites are in USA, .uk are in the UK unless otherwise indicated.
Note: Since the merger with Compaq, the European AltaVista service at altavista.telia.com does not seem to be updated. You will have to use the US one.
Inktomi provide the technology and data behind HotBot and also power searches at Yahoo!, CNET's Snap and Disney's Internet Guide (DIG). Searches there mostly have the same functionality as HotBot.
 The Academic Directory, HENSA Unix
I made this.
 UK Internet Sites, HENSA Unix
 National Information Services and Systems (NISS)
 Bath Information & Data Services (BIDS)
 The Electronic Libraries Programme (eLib)
 Northern Light
See last months article for more information.
 Lycos UK, (actually in Germany)
 Yahoo! (US) Image Surfer
 BBC News
 CNN (European mirror)
 News Index
 Yahoo! UK & Ireland
 Search UK
 UK Index
 InfoSeek UK (actually in USA)
 Excite UK (actually in USA)
 EuroFerret (actually in UK)
 Search.com, CNET
 World Wide Web Virtual Library
UK Mirror: http://www.mth.uea.ac.uk/VL/Overview.html
 Search Engine Watch, Mecklermedia
 Search Engine Sizes, Search Engine Watch
 A Technique for measuring the relative size and overlap of public Web search engines, Bharat and Broder, Digital Systems Research Center, USA in Proceedings of WWW7, April 1998.
 Yahoo! Search Engine Spying Page
http://www.yahoo.co.uk/ Computers_and_Internet/ Internet/ World_Wide_Web/ Searching_the_Web/ Indices_to_Web_Documents/ Random_Links/ Search_Engine_Spying/
 Search Engine Watch - Power Searching
 Authoritative sources in a hyperlinked environment, Jon Kleinberg in: Proceedings of 9th ACM-SIAM Symposium on Discrete Algorithms, 1998; also appears as IBM Research Report RJ 10076(91892) May 1997
 Planet Search
Author detailsDave Beckett
Computing Laboratory, University of Kent at Canterbury, UK
Personal Web Page: http://www.cs.ukc.ac.uk/people/staff/djb1/