Book Review: The Invisible Web
- Chris Sherman and Gary Price
- The Invisible Web: Uncovering Information Sources Search Engines Can’t see
- Cyber Age Books, 2001. ISBN 0-910965-51-X
- Price: $29.95
I first became interested in the Invisible Web after seeing Chris Sherman and Gary Price talking at the Internet Librarian International Conference in March this year. In their words “The Invisible Web consists of material that general purpose search engines cannot or will not include in their collection of Web pages.” If currently available resources from search engines are the tip of an iceberg, the Invisible Web is all that lays beneath the surface of the water. The idea of a huge hidden Web available for us to explore was enough to make any information professional get excited.
Several months later The Invisible Web book arrived and I’ll admit that after reading the blurb on the back I began to have doubts about whether a guide on the Invisible Web would actually be that unique and relevant. The press release claims that it is “the first handbook and directory for information users who wish to utilize Invisible Web resources systematically in order to improve the quality and effectiveness of their online research”. It all sounded just too much like another list of useful resources and key sites.
In spite of the hype The Invisible Web pleasantly surprised me by being both informative and useful. However, although both are of interest, it isn’t the directory or the explanation of the Invisible Web itself that make this book worth buying. The essence of this text lies in its discussion of the art of searching. It starts off by leading us on a whistle stop tour from the creation of the Internet in the 1960s to the evolution of search engines and on to the Web as it is today. Once search engines have been put into context current searching problems are discussed and consideration is given to the ways in which we can improve our searching. It is this further understanding of ‘how the Web is’ that promises to help a searcher succeed rather than the list of often out of date URLs at the back.
After a shaky start with a number of repeated paragraphs in the foreword (written by Danny Sullivan, editor of SearchEngineWatch.com) the book is well laid out. Sherman and Price have split the main body of the book into 3 main subject areas. The first and definitely the most interesting for me considers the Internet and searching, the second introduces the Invisible Web and the third is a directory of resources. The resources are also available online with further information about the book from The Invisible Web site (1).
The authors of The Invisible Web are Chris Sherman, the associate editor of SearchEngineWatch.com, and Gary Price the writer of the well-known Web research tools Price’s List. Both authors are experts on search engines and sufficiently well qualified to write about this subject area. Yet their writing throughout the book is both informal and easy to follow. The end result is accessible to any level of Web user, from novice to experienced information professional. No prior knowledge is expected except maybe a few hours ‘surfing’ time. To help those less clued up on the Web in the early chapters the authors spend some time defining key terms. These definitions are given in separate text boxes for the sake of clarity. Every chapter also contains one or two separate myth boxes where a general belief about the Web, such as ‘All search engines are alike’, is quashed
Although I have read a few books on the evolution of the Web this was the first time I’d read anything that considered it from a search engine perspective and I found the different approach quite refreshing. This said I am sure there are a number of other books dedicated to this subject area and I’d be interested in knowing how much of a cross over there is between The Invisible Web and these texts.
In the first three chapters the authors fill us in on the history of the Internet and the eventual need for search technologies. In the early days of the Internet searching usually took the form of sending a request for help to an email list (how much has actually changed we ask ourselves?). With the introduction of anonymous FTP servers, a type of centralised file server enabling file sharing across networks, the situation greatly improved and a directory listing of all the files stored on the server could be shown in the form of an index. An interesting resource listed in the book is the first Web directory available from the W3C site(2). As the Internet began to grow researchers began work on various Internet search tools. The first being Archie created by McGill University, Montreal in 1990, Gopher was created at the University of Minnesota and its interface is a precursor of the popular Web directory like Yahoo. Another search engine, WAIS, the Wide area information server, was the first natural language search engine. Sherman and Price are keen to point out that the fundamental problem with searching in the past and today is that no-one is in charge of the Web which means there is no central authority to maintain an index.
Throughout their introduction to the Internet and Web Sherman and Price offer us a selection of information and trivia. For example have you ever considered where the major search engines get their names from? Yahoo originally stood for Yet Another Hierarchical Officious Oracle and Lycos is actually short for Lycosidae Lycosa, a type of Wolf spider that catches its prey by pursuit rather than relying on a Web.
After their brief history lesson they consider current Web services available on the visible Web by looking at search engines and directories and discussing the difference between browsing and searching. They give a fair explanation of how directories and search engines work and consider the issues and the pros and cons of using either to find information. They end their look at current search engines with a glimpse at specialised and hybrid search tools. URLs are given for meta search engines - search engines that search across search engines, value added search services, browser agents – handy add on tools which must be downloaded, and client based Web search engines or bots such as Copernic (3) which is definitely worth a look.
From chapter 4 onwards the authors move on to looking specifically at the Invisible Web explaining that many people are unaware that much of the authoritative information accessible over the Internet is invisible to the key search engines.
They consider 4 levels of invisibility:
- The Opaque Web – files that can be but aren’t included in search engines such as PDFs and word files.
- The Private Web – technically indexable Web pages that have been excluded from search engines by being password protected or through the use of a robots.txt file.
- The Proprietary Web – Web only available to people who have agreed to certain terms in exchange for viewing the pages. Normally some form of registration is needed which may be free or cost money.
- The Truly Invisible Web – The Web, which cannot be seen due to technical reasons, lack of metadata etc. The majority of dynamically created pages created from content-rich databases fall into this category.
Sherman and Price explain that much of this invaluable material is comprised of from universities, libraries, associations, businesses, and government agencies around the world in database format.
Later in their discussion of the Invisible Web the authors make the point that while searching you need to assess the quality of information and watch for biased information. Some of the key methods used when establishing the credibility of a resource are: looking at its URL to see if it is an established site, considering the author information and making sure that the site is up to date. The credibility of information is one advantage Invisible Web resources have over many of the other sites a search engine would recommend. Invisible Web searching also makes sense when looking for specialised content because there are more focused search results and usually a specialised search interface giving you more control over your search input e.g limiting historic eras.
Prior to the directory listing there are a list of FAQ’s on the Invisible Web and number of case studies of Invisible Web searching. The case studies don’t strike me as being very useful, unless you are looking for information on one of the subject areas covered (such as historical stock quotes) or based in the USA. There is also a chapter on the future for the Invisible Web and searching. Sherman and Prices’s main predictions are that more smart crawlers and targeted crawlers will spider the Web and there will also be more specialised search engines. The authors also take this opportunity to mention metadata standards, Resource Description Framework (RDF) and XML. Unfortunately there was a disappointing lack of depth in the discussion of the promise and pitfalls of metadata in the future. There is also a very brief mention of Image handling processes, again there is no depth, no discussion of Scalable Vector Graphics (SVG), Portable Network Graphics (PNG) or other types of image handling(4). They also touch briefly on wrapper induction techniques, where software probes a database and acts on results, use of such software may ultimately mean that search engines will be able to index databases.
Chris Sherman and Gary Price’s mission is apparently to “save you time and aggravation, and help you succeed in your information quest”. It’s an ambitious plan and you might need a little more help than just an organized bookmark list, which is really what their directory is, though there are some exceptional sites listed. Criticism aside, this book will make you think about why, how and what you search, so it’s definitely a foot in the right direction.
- (1)Invisible Web site http://www.invisible-web.net
- (2)The first Web directory - Information by Subject http://www.w3.org/History/19921103-hypertext/hypertext/DataSources/bySubject/Overview.html
- (3) Copernic http://www.copernic.com