Shall we index ?

15/06/03

Introduction

When I was getting to the point where you begin to wander wether you need a search engine for your website because it is getting quite large, in my case HAbeTT.org, you reach the meme of the technology and the need to index the contents of your overage sized site, and many questions and solutions spring to your mind.

If you want a search engine for your own, the first tricky solution is to rest on a major large scale classic web site like google and to restrict the search to your own domain. In the case of google, it is easy as the actual syntax is to add site:www.habett.org to your search request. To limit the scope of your search can be uneasy or maybe uncertain, even though many improvements have been done in that direction in the recent years and the rise of more powerfull search engines like Inktomi and Google. Once you get past the scope limitation obstacle, you end up with results that may be correct but are not integrated with the layout of your site and with many issues linked to the fact that you have no control on the indexing schedule (more on that later).

The second option is to rely on dedicated local search engine sites like atomz (free under certain conditions) or the more powerfull synomia or one of the many others that exist. You tell it the address of your site, it indexes it's contents whenever you ask, you redirect your searchs to it and you end up with results in a custumizable shape and layout with restrictions connected to the technology it employs. The contents of your pages are indexed on a distant server that processes the requests. This solution is better because you gain access on the aspect of the results and you can control when your site is indexed, on demand or regular basis. The fact that your search are processed on a different server can be hidden by html frames if they let you do so. This situation inplies a massive use of bandwith because all your website has to be grabbed on each index update by the distant server. This is a good solution when you only have a basic hosting plan that won't let you use dynamic technologies like cgi/php/asp or when you're comfortable dealing with an extraneous technology without having to be too technical in the way you manage it.

First questions

The real fun and interesting part begins when you have the benefits of a full hosting plan. All over the Internet can you find perl/php/... scripts that let you have your own search engine. In my case, being a basic perl programmer, i got the idea that maybe, i could create my own engine. Then comes the time to think about the technology your are going to implement and you are on your own.

I had the firsthand idea that you have to work in two steps, the indexing of the contents and the processing of the requests. However, to get starting, i wrote a direct search engine, on the fly, without previous indexing. My site having about 800 html files, considering i wanted a custom display of results (link, contextual quote, meta informations, special icon), i was expecting awful performances.

After tweaking around perl code for some time, without major optimization concerns, on a massively co-located server, without using a system grep to get my results the way i wanted in an easy way, i got really supprising perfomance results. A search through 800 html files, lost in a batch of more than 3000 files, a few megabytes of data, takes far less than half a second of CPU system time on the server !

Performance

Considering the perfomance, i start to ask me wether it would be opportune to go further and implement a classic full fledged search engine with previous indexation of the contents of the site. So, shall we index ?

The benchmark being acceptable for an overage sized site, the advantages of the non indexation spring rapidly to the mind. No index means that you get a great freedom consequent to the lack of maintenance. Even though that task can be automated, with a cron tab by example, it remains an intermediary operation that takes CPU time and implies that the modifications of the site are not taken into account in real time.

Perl is an interpreted language, slower than a c/c++ solution. A system grep would allow the process to be much more efficient. A better programmer would obviously find more optimized tricks. The downside of this approach is that it relies on the fact that servers are powerful enough to handle such process. It may seem fallacious to choose a technology that is obviously theoretically and technically inferior, taking the decision in the view of benchmarks but that's the evolution of the computer world and that won't change.

Conclusions

Once again, i find myself facing the technodicy of the simple versus the best. Anyway, in my case, for this site, my decision is taken, i will not index. My choice would have been different if i had wanted to implement textual tolerance strategies.