I like to be aware. In touch. I read many Usenet newsgroups but that's not enough. I subscribe to mailing lists but I don't like giving away my e-mail address because I don't want to expand my spam load. I wanted to get started in the complex XML universe. I got into RSS.
RSS stands for, depending on who you ask, Really Simple Syndication or Rich Site Summary. Whatever you choose, it's really interesting. The process is simple : the server has an RSS file, also called feed, most of the time in XML format, that contains a list of the recent content with their descriptions and many details or, more generally, a list of informations. If you have what is called an RSS feed aggregator, it is going to grab all those files on a regular basis and, if all goes smoothly, you will reveive some sort of notification whenever there's new content available without having to check it yourself.
Many sites such as Drobe launchpad, BBC News, Libération or SlashDot offer RSS feeds. We find indeed many things in RSS format : news, blogs, informations, updates, software announces, ... etc.
The aggregation is the task of gathering informations from RSS feeds. There are websites that aggregate feeds on a precise subject (Google news, Meerkat) and we then talk about syndication, sites that aggregate accord to your own choices (the former My Netscape or the personal registrated mode of Meerkat) or there are specialized softwares called aggregators. We are going to focus on the software part because why let somebody else do what might as well do on our own ? It provides privacy ands if we generate uour own feeds, it saves us from having to upload them for future aggregation as we'll see.
There are many aggregator softwares, so much that I could hardly give you some advice, but you'll find here or here a list of the most popular ones. None really convinced me, maybe because I have strict or strange tastes when it comes to GUIs. As I was testing many IMAP clients for another project, I got jealous of the way Mozilla ThunderBird handles RSS feeds inside it's usual messaging interface. It's funny because many aggregator advertise themselves as copycats of e-mail client from their interface.
After endlessly hunting for a good RSS aggregator for my Iyonix, I told myself there was something to do about it. I found NewsPipe by Ricardo M. Reyes that provides exactly the service I was looking for. It's a small Python program that sends e-mails to a designated address when a site has been updated, that is when it's grabbed RSS feed shows something new. Some sort of RSS to e-mail bridge, a new messaging transport. I have tried to get it to run under RISC OS Python but it didn't work so I installed it on the pc and it worked a charm straight out of the box. Python exists under many platforms so that shouldn't be a difficulty.
I first installed Python then put NewsPipe on Piotr's harddisc, configured it to send messages in plain text directly to MessengerPro (my MUA) on the Iyonix and then, created my own OPML file to connect to my favorite RSS feeds on a decent timescale. Folders created in MessengerPro, filtering rules setted and I had my system simulating Mozilla ThunderBird's behaviour. You should too, with your favorite e-mail client, find a way to enjoy the power of the RSS aggregation without having to install new software and with enjoying a unified interface.
When you visit one of your favorite site quite often, look out for logos such as
,
or any mention of an RSS feed, XML, syndication or something similar. Copy the link location and you have the URL you need. Add it to your OPML file and that's it.
Even though the RSS technology has yet to find it's Google, there are reference sites such as Syndic8.com.
If one of you favorite site doesn't have (yet) an RSS feed then you can create it yourself ! RSS feeds come in many flavours and formats because of a chaotic history and protocols wars. We won't describe or analyze here the many formats but I'll let you know that I use 1.0 because it allows achieve do many things while remaining pretty simple so that you can handle it by hand like you would with HTML.
The scripts we are about to see are written in perl and use the XML::RSS::SimpleGen module that is the simplest but produces 2.0 feeds. There's also the good old XML::RSS where you can choose the format of the feed to be generated but it's use is some complicated.
You have to run this scripts on a regular daily basis or how often you want using a cron tab or the task scheduller.
If you generate the feed on a web server instead of your personnal computer locally, don't forget to configure your server (.htaccess file for Apache) so that it sends the right mime type, that is application/xml+rss rss.
Let's move on and generate our first feed. This is going to be extremely simple to the point that it's not really a feed per se. Imagine that one of your friend has a web site and that she updates it very rarely. It's boring to have to visit this site to find out there's nothing new and it's annoying to miss an update. There are many other means to achieve such monitoring and that exemple isn't really interesting but it's just to get started and show you the concepts of the process.
#!perl -w
# Lemon watcher
use LWP::Simple;
use XML::RSS::SimpleGen;
$url = "http://www.lemonbugg.com/poo.html";
# get http headers
@headz = head($url);
# if the page has been modified within the last 24 hours
if ((time - $headz[2]) > 24*60*60) {
# writing the feed
rss_new ("http://www.lemonbugg.com","Lemon","Her updates");
rss_item ($url,"Something new about Lemon on ".time);
rss_save('lemon.rss');
}
exit(0);
As you can see, it's really basic. We grab the HTTP headers of the HTML page to check out if something has changed within the last 24 hours. If that's the case, we generate as fake RSSfeed that contains only one entry and the unix date when this update has been noticed. You just have to run this script automatically every 24 hours, add the feed to your OPML file, <outline text="Lemon" htmlUrl="http://www.lemonbugg.com/" xmlUrl="f:\tests\lemon.rss" delay="60" />, and you'll be notified every time the site is updated. No need for live/dynamic bookmark, news come straight to you. Note that in terms of bandwidth, the process is really inexpensive because we only took the headers when a visit to the site would have required the full HTML file and maybe it's stylesheet and it's images. Nice transaction, safe use of our resources.
Our next example will monitor the FC Barcelona's website because I'm a big Barça fan and I can't wait. You find on this site link to stories about the club and we are going to transform this HTML news page into an RSS feed.
#!perl -w
# Barca headlines RSS generator
use LWP;
use XML::RSS::SimpleGen;
$base = "http://www.fcbarcelona.com";
$browser = LWP::UserAgent->new;
$answer = $browser->get("$base/eng/home-page/home/home.shtml");
$html = $answer->content;
@content = ($html =~ /href=(\/eng\/noticias\/noticias\/n\d+\.\w+) class="b10b?">([^>]+)/g);
rss_new ($base,"Barca headlines","Newsfeed from fcbarcelona.com");
for ($i = 0; $i > @content; $i+=2) {
rss_item ($base.$content[$i],$content[$i+1]);
}
rss_save('barca.rss');
exit(0);
The process is more complicated than expected because the HTML code isn't really structured. The regular expression used to locate where links and titles are will continue so work as long as the site ain't modified in it's template but that good for now. This script suits sites that are not updated every day but where a few new item can show up on a special day. Each item arrives individually with it's URL and title.
So we have this script that grabs URLs and titles of the new features on FCBarcelona.com but we still have to click each time on the links to get to know more. An improvement can be to offer a richer feed by grabing to content of the new items automatically. This take an LWP query for each item but that saves a lot of time and effort, providing additionnal comfort to the end user.
#!perl
# Barca headlines and content RSS generator
use XML::RSS::SimpleGen;
# URL of the site
$base = "http://www.fcbarcelona.com";
# grab the reference page
$html = get_url("$base/eng/home-page/home/home.shtml");
# isolate headlines
@content = ($html =~ /href=(\/eng\/noticias\/noticias\/n\d+\.\w+) class="b10b?">([^<]+)/g);
# create RSS feed
rss_new($base,"Barca headlines");
for ($i = 0; $i < @content; $i+=2) {
# grab the linked page
$cite = get_url("$base".$content[$i]);
$cite =~ s/\n/ /g;
# scrap the interesting section
$cite =~ /<!--Noticia completa-->(.*)<!--Fin noticia completa-->/;
$notice = $1;
# prepare data for visualization
$notice =~ s/<.*?>/ /g;
$notice =~ s/\t+/ /g;
# add the RSS item
rss_item ($base.$content[$i],$content[$i+1],$content[$i+1]."\n".$notice);
}
# save the RSS feed
rss_save('barca.rss');
exit(0);
Now, imagine you've found a feed with interesting items but where most of the content is none of your interest. You could configure your aggregator to filter the feed's content or you could write a script to do it. Let's take the example of HAbett.org's feed and imagine that you are only interested in annoucements of new images. This time, we are going to use the XML::RSS module because we need to parse (read and understand) a feed before generating a new one.
#!perl -w
# HAbeTT.org image RSS selector
use XML::RSS;
use LWP::Simple 'get';
use strict;
# URL of the original feed
my $url = 'http://habett.org/interface/indexen.rss';
# new feed
my $out = new XML::RSS(version => '1.0');
$out->channel(title => 'Images from HAbeTT', link => 'http://habett.org/images/', description => 'Last images published on HAbeTT.org');
# get original feed
my $feed = get($url) or die "Can't download $url $!";
# parse original feed
my $in = new XML::RSS;
$in->parse($feed);
# loop through items
foreach my $item (@{$in->{'items'}}) {
# regexp test
if ($item->{'title'} =~ /image/i) {
# add new item to new feed
$out->add_item(%$item);
}
}
# save feed
$out->save('habetti.rss');
exit(0);
As you see it's pretty simple, that only takes a regular expression and an object handling. Note that the format of the incoming feed is indifferent to the parser. It's pretty good to restrict a feed to what you're interested in and that can lead you to join/aggregate different feeds from different sources into one single feed. We're getting pretty close to syndication.
Syndication is the publication of the same data in different places. We won't dig too deep into the subject but you have to know that there are many ways to use on your site the feeds from other sites, if copyright permits. The other way is also true, you could ask your partners to put on their site the contents from your feed.
There are client-side methods in javascript but that ain't really sound because javascript could be desactivated and you can't let an information system rely on it's hypothetic presence and fiability. Amongst the server-side methods, the mostly used is Server Side Includes (SSI) calling a script converting RSS to HTML. This technique is widely used to promote one's content through third party publication.
A simple way to convert an RSS feed to HTML could go something like this :
$feed = get("http://habett.com/indexen.rss");
$rss = new XML::RSS;
$rss->parse($feed);
foreach $item (@{$rss->{'items'}}) {
print "<li><a href=\"$item->{'link'}\">$item->{'title'}</a></li>\n";
}
As far as the RSS feed for your own site is concerned, there are many ways to create it automatically or to program its generation with perl but we'll see how to edit it by hand because it's more interesting.
We'll see an RSS 1.0 feed using only two additional modules, the Dublin Core MetaData Initiative that allows us to add more practical information and make it richer, and the Syndication that let's us include informations about the diffusion of your feed. There are many other modules, standard or not, which allow us to add many things and you can even create your own modules but we won't go that far in this article.
It all starts with a basic XML prolog that includes the declaration of the modules we are about to use.
<?xml version="1.0" encoding="ISO-8859-1"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" >
Then comes statements about your feed.
<channel rdf:about="http://habett.com/"> <title>haBEtt.com in english</title> <link>http://habett.com/</link> <description>An offstream to subtranet technicated memes and cases</description> <image rdf:resource="http://habett.com/grafz/icon.png" /> <textinput rdf:resource="http://habett.org/cgi-bin/textracom.cgi" />
Lets move on to the Dublin Core declarations about your feed.
<dc:language>en-us</dc:language> <dc:rights>Copyright 1997-2004, Stéphane Roux.</dc:rights> <dc:date>2004-10-09T06:13:43+00:00</dc:date> <dc:publisher>Stéphane Roux</dc:publisher> <dc:creator>habett@habett.org</dc:creator> <dc:subject>Subtranet memes and perl cases.</dc:subject>
Finally, the syndication module declaration about how often your feed is to be read.
<syn:updatePeriod>daily</syn:updatePeriod> <syn:updateFrequency>3</syn:updateFrequency> <syn:updateBase>1970-01-01T00:00+00:00</syn:updateBase>
We're done with the description of the feed and it's meta data. We can now describe the items we want to show. This list of items takes place in two parts : first a simple list of the RDF (Resource Description FrameWork) items and then a description of each of thoses. Here's a list of the items :
<items> <rdf:Seq> <rdf:li rdf:resource="http://habett.com/papers/inforpornen.html" /> <rdf:li rdf:resource="http://habett.com/perl/danceen.html" /> <rdf:li rdf:resource="http://habett.com/perl/moteuren.html" /> </rdf:Seq> </items> </channel>
Note that usually you have about 15 items. Next do we find the description of items we have declared. Each items goes something like this, Dublin Core meta data inclueded.
<item rdf:about="http://habett.com/papers/infopronen.html"> <title>Infoporn suite with RSS</title> <link>http://habett.com/papers/infopornen.html</link> <description>In the infoporn suite project, RSS feeds with description, analysis, aggregation, generation and edition.</description> <dc:creator>Stéphane Roux</dc:creator> <dc:subject>RSS, XML, aggregation, generation, edition</dc:subject> <dc:date>2004-10-09T09:38:41+00:00</dc:date> </item>
Repeat this kind of statement for each element you have announced. In the end, we add an icon to our feed to make it prettier and a search engine to make it useful.
<image rdf:about="http://habett.com/grafz/icon.png"> <title>haBEtt.com</title> <url>http://habett.com/grafz/icon.png</url> <link>http://habett.com/</link> </image> <textinput rdf:about="http://habett.com/cgi_bin/textracom.cgi"> <title>Search haBEtt.com</title> <description>Search inside the contents of haBEtt.com</description> <name>kw</name> <link>http://habett.com/cgi-bin/textracom.cgi</link> </textinput> </rdf:RDF>
Now you know how to handwrite an RSS 1.0 feed. There are so much more to study but that enough to get started. Once the feed is edited, don't forget to check it through the W3C validator, link it from your site and maybe add a little <link rel="alternate" type="application/rss+xml" title="My feed" href="/index.rss"> to your HTML page' head section.
With such a system, news and updates come to you without a mouse click and you can read them with the same ease as when browsing through your e-mails. Today, RSS feed are not everywhere to be found but that's a solution for the future. Think of it as a automatic notification system. Think of it as a self managing mailing list that you can subsribe to without having to give away your e-mail address, a spam free solution. Think of it as the potential accomplishment of push media prophecy circa 1997 finally at your door. A modern solution, not too complex and not bandwidth intensive, to many problems.