Small dynamic search engine

03/10/2004

Aim

The idea here is to create a search engine suitable for small scale web sites with correctly formatted pages. To ease administration and maintenance, it searches on the fly thus requiring no previous indexation nor database. To discuss the theoretical aspects of this solution, please report to this other article. Let's see how we make it happen.

The strategy

The following program is simple and pretty efficient. First it crawls throught the web site on the server (in parts that are precisely designated), looks at the HTML files it encounters, checks if they are relevant to the query and then displays them in a presumed relevancy order. This implies that the HTML pages are well formed, that they have a title and the required meta data. Results are displayed as a list and we even try to highlight the context where we found the searched words.

Implementation

The prolog of this program contains the configuration of the directories to scan for and the call to the required perl modules that are well-known and widely distributed.

#!/usr/bin/perl
# All by HAbeTT
# Raw search engine v0.22

# list of the directories to check
@reps = (papers,perl,misc);
foreach (@reps) {
  push (@repstodo,$ENV{'DOCUMENT_ROOT'}."/$_/");
}
$perco = length($ENV{'DOCUMENT_ROOT'});

use CGI::Carp qw(fatalsToBrowser);
use CGI;
use Benchmark;
use File::Find;
use locale;

We first start the benchmark of the script in order to evaluate it's perfomances.

# init benchmark
$timea = new Benchmark;

As the script is called through CGI mode, we grab the parameters of the querry, we modify them to switch them to lower case with an additional test in case of an hazardeous application of use locale; that can be found on certain servers, then we split the different words we're looking for.

# CGI handle
$query = new CGI;
# preprocess
$kw = lc $query->param("kw");
$kw =~ tr/ÉÈÊÀÔÎÇ/éèêàôîç/;
$kw =~ s/[^a-zéèêàîôç0-9 ]//g;
$kw =~ s/ +/ /g;
@kws = split (/ /,$kw);

For the display, we start the HTML output with the required headers.

# headers http
print >>Fragment;
Content-type: text/html
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">
<html>
<head>
<title>Textra search</title>
<link rel=\"stylesheet\" href=\"/style.css\" type=\"text/css\">
</head>
<body>
<p><a href=\"/index.html\" title=\"Main menu\"><img src=\"/grafz/logo.png\"
 width=500 height=150 alt=\"logo\" border=\"0\"></a>
</p>
<p>Search for <b>$kw</b>.</p>
<div align=\"left\"><ul>
Fragment

After a few initializations, we launch the main search process. Note that the spidering will we recursively achieved in the designated directories. The File::Find module makes it pretty easy.

# main search process
$hits = 0;
$fetched = 0;
$files = 0;
find(\&processor, @repstodo);

We'll the the search sub later but right now you need to know that it will fill two hash tables that will contain the text to be displayed and the score of each of the identified files. We sort the results so that the higher ranking gets to be displayed first, infering points are linked to relevance.

# sort and output
@order = reverse sort byval keys %sco;
foreach (@order) {
  print $rez{$_};
}

The main program comes to it's end with the closing of the benchmark and the HTML code that contains a bried summary of the performance and a form to allow another new query.

# end list
print "</ul>\n</div>\n";

# benchmark end
$timez = new Benchmark;
# display benchmark
$td = timediff ($timez,$timea);
$totime = timestr($td);
$oripar = lc $query->param("kw");
print <<Block;
<p>
It took $totime to find those $hits hits out of $fetched documents ($files files).
<form action="http://habett.com/cgi-bin/textracom.cgi">
<input type="text" name="kw" value="$oripar">
<input type="submit">
</form>
</p>
</body>
</html>

Block

exit();

Have a quick look at the sorting routine that comes as no surprise.

sub byval {
  $sco{$a} <=> $sco{$b}
}

Let's move on to the core of the script, the search sub itself. We grab the candidate's name through the File::Find handle then we get rid of resultat that are indeed directories or files that don't appear to be html.

# sub main search
sub processor {
  # get filename
  $fille = $File::Find::name;
  # account
  $files++;
  # eliminate directories
  return if (-d $fille);
  # eliminate non html files
  return unless (substr($fille,$perco) =~ /htm/io);

We read the file block by block (hoping to have chosen a suitable size so that one file sticks to one block) and we store the HTML code in a variable.

  # begin parse file
  open (DAFILE, $fille);
  $fetched++;
  # reading file
  $html = "";
  while ($p = read(DAFILE,$donnees,8192)) {
    # agregation
    $html .= $donnees ;
  }
  # close file
  close(DAFILE);

Before going any further, we isolate the meta data about the document, it's title and it's description tag.

  # parse meta data
  $title = "";
  ($title) = ($html =~ /.*<title>(.*)<\/title>.*/io);
  $description = "";
  ($description) = ($html =~ /meta.*?description.*?content.*?=.*?"(.*?)"/io);

As we've lowered cased the query, we do the same with the HTML code. Then, for each of the searched words, we are distribute points to the file base on the number of times the word appears, if it appears alone and if it occurs in the meta data. It's just a simple calculation but it'll improve the output by introducing hierarchy in the results.

  # lowercase
  $html = lc $html;
  # score calculation
  $score = 0;
  foreach $target (@kws) {
    $score += ($html =~ /$target/);
    $score += ($html =~ /\W$target\W/) * 2;
    $score += 5 if ($description =~ /$target/);
    $score += 10 if ($title =~ /$target/);
  }

If the score ain't zero, we prepare the HTML code by scraping the tags and other things that may prevent a clean output.

  # store scores in an hash
  if ($score != 0) {
    $hits++;
    # extraction file location
    $location = substr($fille,$perco);
    # remove tags
    $html =~ s/<.*?>//go;
    # treat CR LF
    $html =~ s/(?:\012\015|\012|\015)/ /go;

We hight the queried terms in the meta data.

    # highlight in title
    foreach $target (@kws) {
      $title =~ s/$target/<span class="high">$target<\/span>/g;
      $description =~ s/$target/<span class="high">$target<\/span>/g;
    }

Finally, we generate the HTML output for future display as a list with links, meta data and context of occurencies.

    # data
    $data = "<li><a href=\"$location\" title=\"$location\"><b>$title</b></a><br><i>$description</i><br><pre>\n";
    foreach $target (@kws) {
      $poss = undef;
      $poss = index ($html, $target);
      if ($poss) {
        $extra = substr($html,$poss-30,60);
        $extra =~ s/$target/<span class="high">$target<\/span>/g;
        $data .= "$extra\n" if ($extra =~ /span/);
      }
    }
    $data .= "</pre></li>\n";

We store this in the hash tables and we're done.

    $rez{"$location"} = $data;
    $sco{"$location"} = $score;
  }
  
}

main menu