It's all in the title and we are going to see different methods to get to know the geographic location of the visitors of this website. In the beginning, we only have an IP and sometimes an HostName, that we can fetch from the CGI environment variables, REMOTE_ADDR and REMOTE_HOST.
Here comes the simplest method. It relies on quite long a code but it is very efficent in terms of CPU cycles on the server and doesn't waste any bandwidth nor requires external data or perl modules. We get from the CGI gateway the environment variable $ENV{'REMOTE_HOST'};, we check it and then we extract the TLD used by the visitor. Then we call a quite big hash table of data to convert the HostName to something readable. In my example, I got a list of TLDs that may not really be up to date but will be suffiscient for us.
# initialization of the conversion hash table
# this is only a part of the whole thing,
# the complete code is in the archive bellow
%pays = ("ca"=>"Canada","fr"=>"France","nl"=>"Netherlands","no"=>"Norway");
$host = lc $ENV{'REMOTE_HOST'};
# check if the HostName can be used, not empty nor numbers
$host =~ /(\D)/;
print geo() if (($host) and ($1));
# the geo sub
sub geo {
# regexp to extract the TLD
$host =~ /.*\.(\D+)$/;
$wz = $1;
if ($pays{$wz}) {
# case when the TLD is in our hash table
return "Hello ".$pays{$wz};
} else {
# if it ain't, I send myself an email to remind me to update
# data from my script
return "Unknown";
open(SENDMAIL, "|/usr/lib/sendmail -t");
print SENDMAIL <<EOF;
From: robot\@habett.org
To: habett\@habett.org
Subject: New TLD
X-Mailer: HKMail
Please update the script the $wz TLD.
EOF
close(SENDMAIL);
}
}
This is as you see a pretty straightforward method, quite obvious but long to type because of the hash table not displayed here but included in the archive bellow. One interesting thing is the way we keep the webmaster posted if we find an unknown TLD. Things change and new TLD get created and we have a tool to monitor the evolutiosn. This can be a goal on it's own to use this script for.
I fancy this method and I use it quite often as do most server log analyzer software. It is reliable but somehow limited. This is the Technodicy of the simple.
It's problem is easy to understand and I'll use data from one of my server's log (habett.org) : 24% of the visitors don't show their HostName but turn up with a single IP. If you add to that all the TLDs that are not linked to a particular country, this leaves us with 60% of the visitors without a true geographical localization. This figure is small and also large, whether you consider our goal or the tinyness of the means involved in the code.
If there's a technodicy involved, there's an opposition of styles. We've seen the classics, let's see the baroques.
First a download a 437KBytes archive from this site : http://ip-to-country.webhosting.info. Once unzipped, we have a Coma Separated Values (CSV) file weighting 2,6 MBytes. Each of it's lines goes something like this :
"33996344","33996351","GB","GBR","UNITED KINGDOM"
It is just a simple way to describe a range of IPs with the first two numerals and the name of the country in the end. The two other values in the middle are country codes, following the ISO 3166 international standardization norm. All the IP are not there but it's really something valuable.
Out of curiosity, we are going to calculate the number of IPs described in this file. If I'm not mistaken, we can write :
$total = 0;
open (FILE,"ip-to-country.csv");
while (<FILE>) {
($beg,$end) = split (/,/,$_);
$beg =~ s/"//g;
$end =~ s/"//g;
$total += $end-$beg;
}
close (FILE);
print $total;
The result is a stunning 2 634 190 229 IP entries with their geographical location ! Let's do the complete math : There's a potential maximum of 256*256*256*256 possible IPs until the new IPv6 protocol. We should remove from this last figure the non routable IPs ranges like 10.x.x.x, 192.x.x.x and the other ones. Anyway, the file we have covers at least 50% of the whole thing.
Before going any further, we shall ask ourselves if it's really worth it, keeping in mind what we've achieved in the first part of this article. I've had a look at my logs that may be irrelevant but will give us ideas about what to expect. Here's what I read from my analog repport :
That's more than 60% so that means that the rich method shall be explored further because it's apparently worth it as we can even imagine that by combining the two methods, we can find cases where only one method works, out of the overlap cases.
Let's take the example of the temporary IP of a good friend of mine, 217.13.4.77 and see what we can fo with it. First we convert it from IP to plain numerical format, as in (((( 217 * 256) + 13) * 256) +4) * 256 + 77 equals 3 641 508 941. I look inside the ip-to-country.csv file and I read that this IP is in Norway and that is the truth and this IP had as HostName a dot com address !
In the previous example, I did it all by myself, calculating and searching into the large CSV file but a tiny perl program can do it all by itself :
$ip = shift;
($a,$b,$c,$d) = split(/\./,$ip);
$n = ((((($a * 256) + $b) * 256) + $c) * 256) +$d;
open (FILE,"< ip-to-country.csv");
while (<FILE>) {
($beg,$end,$country) = (split (/,/,$_))[0,1,4];
$beg =~ s/"//g;
$end =~ s/"//g;
$isit = (($beg <= $n) and ($end >= $n));
if ($isit) {
$country =~ s/"//g;
$country = lc $country;
$country = "\u$country";
print $country;
}
next if ($isit);
}
exit(0);
This works wonders but it is slow. This is the king of workload not every server can afford very often. There are many ways to optimize the code to make it faster but we're not after an industrial strength application. It's just a game of trick between you and me.
We only have to do the final round-up. We begin with the first method because it is the fastest and we have removed from the hash table all the TLDs that don't belong to a country. If there's no match, we revert to the second method. I've coded the whole thing as a perl script with a CGI interface to produce the following output:
If it's not the name of your country, please contact me to improve this script. Thanks.
The following archive contains the two programs and the final CGI script.