Search Engines: Robots, Spiders and Your Website
This issue I thought Id take a look at a subject which is of absolute importance to those of us who use search engines, but is something we know virtually nothing about, and that is how do web pages end up in search engine directories?
If youre short on time, the quick summary is that search engines of the free text variety (rather than the Index/Directory type) employ specialised utilities which visit a site, copy the information they find back to base, and then include this information the next time that they update their index for users. OK, thats it. You can move onto the next article now.
Oh, youre still here! Well, in that case, lets look at the entire issue in a little more detail. These utilities are often called robots, spiders or crawlers. As already described, they reach out and grab pages from the Internet and if its a new page or a page that has been updated since the last time that they visited they will take a copy of the data. They find these pages either because the web author has gone to a search engine and asked for their site to be indexed, or the robot has found their site by following a link from another page. As a result, if the author doesnt tell the engines about a particular page, and doesnt have any links to it, its highly unlikely that the page will be found.
Robots are working all the time; the ones employed by AltaVista for example will spider about 10,000,000 pages a day. If your website has been indexed by a search engine, you can be assured that at some point a robot has visited your site and by following all your links, will have copied all the pages that it can find. It might not do this in one go; if your site is particularly large for example it could put something of a strain on the server which wouldnt please your technical people, so many robots will stagger their visits over the period of several days, just indexing a few pages at a time until theyve taken copies of everything that they can. Also, if you have ever submitted to a search engine that says that it will instantly register and index your site in actual fact it wont do it will make a preliminary visit and make a note to come back and grab the rest of your data at a later date.
How can you tell if your site has been visited by one of these robots? The answer, as with most answers related to the Internet is it depends. Obviously one way is to go to a search engine and run a search for your site; if its retrieved, then your site has been visited. An easy way of doing this at somewhere like AltaVista for example is to do a search for:
host: <URL of your site> such as host:philb
and if you get some results back, you know that your site has been indexed. (It might be worth checking at this point however just to make sure that the search engine has got all of your pages, and also that they are the current versions).
However, this is a pretty laborious way of checking what is much more sensible and easier to do is to access the log files that are automatically kept on your site. As you may know, if you visit a site, your browser is requesting some data, and details of these transactions are kept by the host server in a log file. This file can be viewed using appropriate software (usually called an Access Analyser) and will provide information such as the IP address of the machine that made the request, which pages were viewed, the browser being used, the Operating System, the domain name, country of origin and so on. In exactly the same way that browsers leave this little trace mark behind them, so do the robot programs.
Consequently, if you know the name of the robot or spider that a particular search engine employs it should be a relatively easy task to identify them. Well (and heres that answer again), it depends. If your site is popular your log files will be enormous and it could take a very long time to work through them to find names that make some sense to you. Added to this is the fact that there are over 250 different robots in operation and some, many or none of them might have visited in any particular period of time. So its not a perfect way of identifying them. Besides, new robots are being introduced all the time, old ones may change their names, all of which can make the whole process much more difficult.
There is however a simpler solution that makes a refreshing change! Before I tell you what it is, a quick diversion. Its quite possible that, if youre a web author you might not want the robots to visit certain pages on your site they might not be finished, or you might have some personal information that you dont want everyone to see, or a page may only be published for a short period of time. Whatever the reason, you dont want the pages indexed. The people who produce search engines realise that as well, so a solution was created to allow an author to tell the robots not to index pages, or to follow certain links, or to ignore certain subdirectories on servers the Robot Exclusion Standard. This is done in two ways, either by using the meta tag facility (a meta tag being an HTML tag that is inserted into a page and which is only viewed by search engines, not the viewer of a web page via their browser) or by adding a small text file in the top level of the space allocated to you by your web hosting service.
It is this second method that is of interest to us. Since search engine robots know that authors might not want them to index everything, when they visit sites they will look for the existence of this file, called robots.txt just to check to see if they are allowed to index a site in its entirety. (For the purposes of this article its not necessary to go into detail about what is or is not included in the robots.txt file, but if youre interested in this, a good page to visit is the description of it at AltaVista (http://doc.altavista.com/adv_search/ast_haw_avoiding.html) Alternatively, if you want to read the nitty gritty you might want to look at A Standard for Robot Exclusion (http://info.webcrawler.com/mak/projects/robots/norobots.html) Admittedly, not all robots do look for this file, but the majority do, and will abide by what they find.
Right thats the quick diversion out of the way, so back to the article. When you look at your statistics you should pay particular attention to any requests for the robots.txt file, because the only requests for it will be from the robot and spider programs; its not the sort of thing that a browser will go looking for. It should then be a much simpler matter of then being able to identify which search engines have visited your site over the particular period in question. If you see that Scooter has requested the file you can then track that back to AltaVista once you know enough to link Scooter to that search engine of course. A very useful site which lists details on over 250 robots and spiders can be found at The Web Robots Page at http://info.webcrawler.com/mak/projects/robots/robots.html and just reading the list of names can be quite fascinating I noticed one called Ariadne for example (part of a research project at the University of Munich), another called Dragon Bot (collects pages related to South East Asia) and Googlebot (no prizes for guessing where that one comes from!)
Are there any disadvantages to using this approach? Yes, youve guessed it, the answer is once again it depends. If you only have very limited access to statistical data it is possible that you will get an artificially high count of users, when in actual fact the number of real users is much less than this. Unless you extract more information from your statistics its going to be very difficult to isolate the real users from the spiders. Some people do claim that the spiders cause problems due to the bandwidth they use to collect their data, particularly if they use the rapid fire approach of attempting to get a large number of pages in a very short space of time. This in turn leads to poor retrieval times for real people who want to view pages on a site. Since anyone with sufficient knowledge can create spiders or robots their number will only increase in the future, and although the robots.txt file is of some use in this case, there is no requirement or standardisation that says that all robots need to adhere to it; some may very well completely ignore it. These problem have been addressed in an interesting article written by Martijn Koster, which although now several years old still makes relevant and interesting reading. http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html
However, like them, loathe them or be completely indifferent to them, spiders are one of the most important ways that we have of being able to access the information thats out there.
Phil Bradley is an Independant Internet Consultant.