Web Magazine for Information Professionals

Custom-built Search Engines

Phil Bradley reviews a means of enhancing the relevance of search results through the use of custom-built search engines.

I’ve mentioned custom-built search engines a couple of times in the past in my Ariadne columns, so it would seem to make sense actually to spend a little time looking at exactly what they are and how you might use them. This article will cover the major contenders and provide an overview of how to create and use them, as well as answering the basic question of why you should.

Why Create Your Own Search Engines?

Given that there are thousands of search engines already available, it might, at first glance, seem to be slightly puzzling why people are spending time and energy creating even more of them. The reason is quite simple – search engines can only do so much, and unfortunately that’s a good deal less than we imagine. Any search engine is trying to do the impossible: namely, to match up your queries with Web pages that will answer the query, yet without really knowing anything about either - other than what it is able to guess. Quite frankly, I find it remarkable that they’re able to work as well as they do, and it’s a tribute to their designers and engineers.

However, all search engines have limitations – lots of them. Let’s take Google as an example. Now, I’m not particularly savaging Google here – I could make the same case for many of the other engines that are out there; but let’s face it – Google is the one that most people are familiar with. For any search you care to do, Google will give you thousands, if not millions of results. This is neither reassuring nor helpful, especially since very few of us go past the first page of results unless we’re really desperate. In fact, the sheer size of the database makes it harder to get a good result given that the search engine has to weight the pros and cons of each page against the others before making a decision as to which result is ranked where.

It’s still possible for people to manipulate the results that Google returns – so called ‘Google bombs’ – the best example of which was ‘miserable failure’ which took searchers to President George W. Bush’s biography [1], although this no longer works. However, if you know how the search engine ranks results, and many do, it’s possible to get a top ten result even if the page is not particularly relevant to the search term entered. A useful, if depressing example, is a search for Martin Luther King: one of the top results returned by Google is a racist Web site that has a high ranking simply because so many sites link to it as an illustration of the fact that you cannot trust the content you find on the Internet. Some celebrities have been taking their own measures by employing firms to ensure that the first ten results on their name in Google are all positive; the material about drug-taking or broken relationships slides down onto the second results page.

More examples can of course be cited, but I think the point is made: just because a page gets a high ranking in a search engine doesn’t mean that it’s any good. That’s not a particular problem if you are already an expert in a subject area; but if you’re new to that subject, or indeed new to searching itself, these distortions can become a real problem. Not only are there too many results, inexperienced searchers are not in a position to decide which results to trust. Moreover, they may have an interest in a particular type of site, such as academic sites or government sites; unless they know the syntax to use in order to narrow a search down to just those, any search is going to be a good deal less than optimal.

How Customised Search Engines Make Things Easier

Putting it as simply as possible, a customised search engine automates a search such as (site a OR site b OR site C etc) AND search_term. That is to say, we start by creating our own sample of Web sites (or indeed Web pages) and simply run searches based on that sample or search universe. Now of course I can just go right ahead and do that, but if I want to search a large number of sites (but still apply my own particular search criteria) this may take a considerable amount of time. Furthermore, it’s not particularly friendly, in that I can’t easily share such a search or provide access to it in any way. However, a customised search engine can be used to create and populate one’s own search universe with just those sites in which one is interested. Moreover, it will generally host the search engine (providing searchers with a URL they can share with others), as well as the code they need to embed the engine on a Web page, blog, start page and so on.

Consequently you can create your own portable search engines and the uses are almost endless. You could create a search engine to help answer a query for a user, and either provide them with the URL of said engine, or embed it onto a web page. If you run a subject-specific Web site or weblog it makes sense to embed one or more search engines to search for that subject – probably including your own Web site as well. Of course, you might still need to run searches for the user, but equally you can point out that a search run with that search engine is going to give a small number of high-quality results.

It doesn’t actually take very long to create a search engine – in fact what usually takes up most time is typing in the list of URLs you want to be included in the search! However, this process can be made a little more painless if you collect a number of them prior to starting. Ideally if you can find a Web page that lists them (such as the URLs of all the UK universities for example) you can use a free tool such as the Link Extractor [2], which will create a listing you can simply cut and paste. One point worth mentioning at this juncture is that your search engine can only find material that has already been indexed by the search engine; so this will exclude the hidden web, databases that are password-protected, and so on.

Other Options

Rollyo

Rollyo (short for ‘Roll your own (search engine)’ [3]) was the first of these resources to come online. It makes use of the Yahoo database, so you might want to run some sample searches on that to begin with to make sure that you’re happy with the results that you’re getting. While you can actually make search engines (or as they prefer to put it, searchrolls) without registering, it makes sense to do so if you’re intending to use it more than a couple of times.

The process for creating your customised search engine is virtually identical across all the various options; so to save repetition I’ll explain it in detail the once here and simply identify any differences when looking at the other options. You have to give the engine a name (Rollyo has a limit of 20 characters, which is annoying, but manageable), and a brief description. Next, you list all the Web addresses of the sites that you want to search on – your own search ‘universe’. Rollyo limits you to a maximum of 25 sites which is disappointing, but if you only have a small universe it’s not a nuisance. You can then choose a category and keywords if you wish to include the search roll in their public collection, or you can ignore this step if you want to keep it private.

That’s pretty much all there is to it! You can see a Rollyo Web 2.0 search engine that I created in about 5 minutes [4]. Alternatively, I could take the code as provided by Rollyo and put the same search engine anywhere else that I chose. Moreover, Rollyo has produced a bookmarklet [5] that allows users to add sites quickly to their search rolls, have immediate access to them, and to create a new one from anywhere.

Rollyo is a robust service and has been used by thousands of people. If you just want to explore quickly the possibilities afforded to you by resources like this, I’d certainly recommend it.

Google Custom Search Engines

Before you can create a search engine using the Google resource you need to have an account with them, which is free, quick and easy to set up. The process for this utility differs in a few small, but important ways from Rollyo. First, and unsurprising, this one uses the Google database, not Yahoo. It’s also rather more powerful, in that there is no limit on the number of sites you can include in the search universe. This is helpful if you want to create a search engine that would search all UK university sites for example. A third difference is that you can limit your search to just the sites that you have listed, or simply give them priority in a ‘normal’ Google search – essentially what you’d be doing here would be to get Google to re-rank results in a limited way.

Once your search engine has been created, Google provides you with a URL that you can use (the one for my Web 2.0 search engine, which is rather more effective than the Rollyo one) is http://tinyurl.com/2dztw5 and the embedded version on my Web site is at http://www.philb.com/ just down on the right-hand side. Finally, there is another version on my Web 2.0 Pagecast [6]. As you’ll see - these things are very portable and you can put them almost anywhere!

Unsurprisingly Google custom search engines have been produced in their thousands; if you want to check to see if an engine has been created so that you don’t have to, a good site to use is the Guide to Custom Search Engines [7]. Alternatively, why not use Google to find them for yourself? The base URL is always the same, so you can start with ‘site:google.com inurl:cse inurl:coop’. This produces about 48,000 results, and then you can simply add on more terms as needed. Adding ‘library’ for example reduces the number to 448 engines. I did try librar* as my search term, but Google decided that my search was too similar to automated requests from a computer virus or spyware application and declined to complete it. I got around that by adding in (library OR librarian OR libraries), generating a total of 470 custom search engines.

Google also provides users with lots of statistical information about the way the engine is being used, ways of refining it, changing the look-and-feel and more besides. This really is an ‘industrial-strength’ custom search engine, and I use dozens of them – particularly on my country search engines pages, since I can create a custom search engine that will search, index or directory-search engines that present their data in a flat HTML format, in other words, a search engine of search engines. Finally, because Google makes its money from advertising, it’s also possible to make very small amounts of pocket money from the adverts that are displayed – some of the money goes to Google, the rest to the person who created the engine.

Yahoo! Search Builder

The Yahoo version [8] also requires creators of search engines to register with the site. Other than a slightly different ordering of information required to set up the engine, it’s exactly the same as previously described. I didn’t find anything that particularly attracted me to this resource, but as I don’t use Yahoo a great deal this says rather more about me than the utility in question. If Yahoo is your preferred engine you’ll certainly warm to this very quickly.

Microsoft Macros

Of course, Microsoft does not want to be left behind, so it has produced its own version - search macros. A listing of them is available [9]. You can of course create your own (although you’re limited to a total of 30 sites) though you do need to be registered with Microsoft in some way. To be honest, it does not appear to be a service that Microsoft is promoting very much, since at time of writing there are only 31,159 search macros which have been created by 24,992 users.

Gigablast

Gigablast has also got involved, although in a very basic way. Its help page on the subject [10] provides us with some HTML code which can look quite daunting to someone not used to such things. The code is hand-edited to list up to 500 sites that can be searched from the form that is created. To be honest, with the easy approach shown by the other search engines, this is an insane idea; I can’t think of anyone who would want to take this approach, and while I respect Gigablast as a good search engine, this offering is virtually useless.

Quintura

Quintura [11] is a search engine that specialises in tag clouds as a means to assist with searching, and it has an invitation-only option for adding a custom search engine to a site. The engine requires (as always) a number of URLs which are then crawled and the search box and subsequent tag cloud are then created. However, since this is not publicly available, I’ll simply draw it to your attention. If you like the ‘tag cloud’ search approach, you may wish to explore this directly with Quintura.

Eurekster Swicki

The Eurekster Swicki [12] has a slightly different way of approaching the concept of personalised search engines. With the other resources that we’ve looked at so far, it’s only the author of the search engine who can change it (although Google does allow for the option of multiple authors). With a Swicki, the search engine results can be affected by the people who use it.

Users create a search engine as previously described, with the slight difference that they can add in some keywords which are appropriate to the subject content and these keywords appear under the search box. This ‘buzz cloud’ will grow and change automatically depending on the searches that are run on it.

The search engine can be embedded in your site (you can see one in action on my Web site [13] on the top of the right hand side column or on the hosted page [14]) and users can run their searches. They will then get taken to a results page at which point they can, if they wish, ‘vote’ for or against specific results. These results will then move up or down the results ranking depending on the number of votes received [15]. They can also comment on results and add their own as well.

Consequently this type of custom search engine will be of most use in situations where there is a group of users with a common interest in a particular subject area and which is prepared to put in a small amount of work in order to tailor a search to mirror its own interests.

Topicle

Topicle [16] is also a community-based resource, based on the Google database. When first visiting the site users can search for appropriate search engines that other Topicle users have made and can simply reuse them if they so choose. They are also able to suggest other URLs appropriate to the subject of the custom search engine as well. These suggestions can then be voted on by other users, in terms of quality of the site and fitness for the subject area, or they can be marked as spam.

Custom search engine creation is, if anything, easier than with Rollyo. Simple create an engine by adding some Web sites and that is about it. You can then go in and edit as necessary to add more sites, or to import a collection of bookmarks. It does not however appear possible to delete sites from an existing engine – not even if you are the person who created it! It also does not appear possible to ‘own’ an engine as you can with the other examples we’ve looked at so far; it really does seem to be a case of letting your creation out into the wild to fend for itself. Consequently I’d have to doubt the value of Topicle as a serious tool, given that there really is little or no control over the engines created. I created one engine that simply searches my site and weblogs and which is available [17] in case you want to take a look and edit it yourself.

Conclusion

The custom search engine market has grown in a very short space of time, both in terms of the number of resources or utilities that can be used and the number of custom search engines that have been built. As search engines carry on trying to understand what we want, staggering under an increasingly heavy load of indexed Web pages, it seems to make perfect sense to take a little weight off them by producing our own. They are quick and simple to produce - and can be used, edited, re-edited and thrown away as appropriate. If you haven’t created one of these before, I would thoroughly recommend them.

References

  1. Gaming the Search Engine, in a Political Season, Tom Zeller Jr., The New York Times, 6 November, 2006 http://www.nytimes.com/2006/11/06/business/media/06link.html
  2. Webmaster Toolkit: Link extractor http://www.webmaster-toolkit.com/link-extractor.shtml
  3. Rollyo: Roll Your Own Search Engine http://www.rollyo.com/
  4. Phil Bradley’s Web 2.0 resources search engine http://www.rollyo.com/philbradley/web_2.0_resources/
  5. Rollyo: Roll Your Own Search Engine: Add a Rollyo RollBar Bookmarklet to your Bookmark Bar http://www.rollyo.com/bookmarklet.html
  6. Pageflakes http://www.pageflakes.com/philipbradley/19806400
  7. Guide To Custom Search Engines (CSEs) http://www.customsearchguide.com/
  8. Yahoo! Search Builder http://builder.search.yahoo.com/m/promo
  9. Windows Live Gallery http://gallery.live.com/default.aspx?pl=4&bt=13
  10. Gigablast Custom Topic Search http://www.gigablast.com/index.php?page=help#cts
  11. Quintura http://www.quintura.com/
  12. Eurekster Swicki http://www.eurekster.com/
  13. Phil Bradley Home Page http://www.philb.com/
  14. Website swicki by Phil Bradley http://website-swicki-swicki.eurekster.com/
  15. Readers may be interested to read more about this aspect in: “Human-powered Search Engines: An Overview and Roundup”, Phil Bradley, January 2008, Ariadne, Issue 54 http://www.ariadne.ac.uk/issue54/search-engines/
  16. Topicle http://www.topicle.com/
  17. Web 2.0 for libraries http://webforlibraries.topicle.com/

Author Details

Phil Bradley
Independent Internet Consultant

Email: philb@philb.com
Web site: http://www.philb.com/

Return to top