Most Web sites of any size want to offer a facility to perform a free-text search of their content. While we all at least claim to believe in the possibilities of the semantic web, and take care over our navigation aids and sitemaps, we know that sooner or later our readers want to type 'hedgehog' into a search box. Yes, even http://www.microsoft.com  returns plenty of hits if you try this. So how do we provide a search, in the cash-strapped Higher Education world of large, varied and unpredictable Web site setups, often containing a suite of separate servers? Brian Kelly's surveys (reported for example in Ariadne 36 ) have shown that there is no obvious solution. There are, perhaps, four directions we can take:
Each route has its attractions, but the latter two are particularly attractive because of the high degree of trust in, and familiarity with, Google. But the public search has problems:
So can we consider our own private Google? At Oxford, we had a Google Search Appliance (GSA) on free trial over the summer, and this article describes our experiences.
A Google Search Appliance is a self-contained sealed computer which is installed within your network; it both indexes Web sites and delivers search results in the familiar Google way. The indexing is neither affected by, nor influences 'Big Brother' Googles, and the box does not need to communicate at all with the outside world.
The smart bright yellow box was delivered ready to fit in a standard rack mount. After some swearing and pushing, it was fitted in next to some mail servers in about half an hour. We plugged in a monitor and watched it boot up successfully. A laptop connected directly was then used to do the initial configuration, set up an admin password and so on. Playing around with IP numbers and so on took another hour or two, but we fairly soon arrived at a running Google box, administered via a Web interface, and we did not go back to the hardware while we had the machine. Each day it sent an email report saying that it was alive and well, and a system software update was easily and successfully managed after a couple of months.
We contacted Google support by email 3 or 4 times with non-urgent queries about setup, and they responded within a day giving sensible answers. Their online forum for GSA owners is quite good. We did not have occasion to request hardware support.
All the configuration work is done via simple Web forms. Not surprisingly, however, it takes a fair amount of time and experimentation to get straight. After some initial experiments on the local site only, and making the mistake of putting the box on our internal departmental network, we switched it to being simply an ox.ac.uk machine, so that it had no departmental privileges.
The system works by having initial seed points, and then a list of sites to include when crawling. Our seed list was simply http://www.ox.ac.uk/-any site not findable from here would not be visited. The inclusion list was created by getting the registered Web server for each department or college, and then adding other machines as requested.
The second part of the configuration is the list of document types to exclude. We set it to index Word and PDF files, but exclude
Some examples, which used extended regular expressions, from our configuration were:
It is likely that this list would settle down over time as Webmasters learned to understand what the box was doing.
After the configuration described in the previous section, our system settled down to indexing about 547,000 documents. The admin screens say it had an index of 33.23 GBytes. How much this would rise to if all Webmasters adjusted their content seriously, and as more facilities came on line, is hard to say, but it seems reasonable to suggest that a capacity of between 500,000 and 750,000 documents would provide a useful service.
The GSA indexes continuously, hitting servers as hard as you let it, and has a scheme to determine how often to revisit pages; generally speaking, it seldom lags more than a few hours behind page changes; it is possible to push pages immediately into the indexing queue by hand, and remove them from the index by hand. Most Web pages at Oxford seem to be pretty static, judging by the rather low rate crawl after a few weeks of operation.
Changes to the setup take a while to work their way through the system (about 24 hours to get to a clean state after removing a site), but individual sites can be blocked at any time from the search, or individual pages refreshed instantly.
We did not make much adjustment for individual hosts. The default is to not make more than 5 concurrent connections to a given Web server, but for 5 of our servers this was reduced to 1 concurrent session after complaints from Webmasters.
At the heart of the Googlebox setup are collections. These are specifications of subsets of the index, allowing departments and colleges to have their own search. The configuration for a collection has URL patterns to include, and patterns to exclude. Setting up an individual Web site to have its own search form accessing the GSA is easy. Here is the one for our Computing Services Web servers:
<form method="GET" action="http://googlebox.ox.ac.uk/search"> <span class="find"> <label for="input-search">Find </label> </span><br/> <input type="text" name="q" id="input-search" size="15" maxlength="255" value="Enter text"/> <br/> <span class="find"><label for="select-search">In </label></span> <br/> <input type="radio" name="site" value="default_collection" />Oxford <br/> <input type="radio" name="site" value="OUCS" checked="checked" />OUCS <input type="hidden" name="client" value="default_frontend"/> <input type="hidden" name="proxystylesheet" value="default_frontend"/> <input type="hidden" name="output" value="xml_no_dtd"/> <input type="submit" name="btnG" value="Go"/> </form>
The key parameters here are site, which determines the collection to be searched, and proxy stylesheet which determines how the results will be rendered. The results from the Googlebox come in XML, which can be processed by your own application, but are normally filtered through an XSLT transform. This is fairly easy to understand and modify, either with direct editing or with Web forms.
The different front ends can be customised in a variety of ways. One is to force a priority result for a given search term. For example, we set this for keywords 'open source' and 'webmail', so any use of those words forced chosen sites to the top.
Collections can be administered on the GSA admin pages by non-privileged users who can also see search logs and analyses.
During our trial we had a meeting with representatives from Google and discussed the main difficulties which had arisen during the trial. These were:
Some of these problems would likely arise at other similar sites.
It is straightforward to describe the Google Search Appliance:
Whether the GSA is the right solution depends on a number of factors; plainly, one is money, but that depends on the relative importance we all attach to our searching. More important is the setup in which a GSA would be placed, and what it would be expected to do. To provide an intranet index of a single server, it will provide an instant working system. If there is no restricted content, or the restricted content is all on one server, it is easy to use the box to provide external and internal searching. If, however, you have private pages scattered everywhere, using IP ranges to restrict access, the GSA is not necessarily an ideal system.
We decided in the end not to go with the GSA at Oxford, but it would be quite surprising not to see some of the nice yellow boxes emerging in the UK educational community soon.
I would like to thank Google for their patience and helpfulness during our trial. They asked me not to put any screen dumps in this article, or quote prices, but have otherwise encouraged me to discuss our experiences.