RSS/Atom Auto-discovery
My last article on bookmark publishing was picked up on Dave Johnson's roller blog today, with some interesting ideas for enhancement. One idea I found interesting was RSS auto-discovery. A quick search on google showed that a few people have expressed ideas about how to use HTML <link> elements to discover alternate content types for a particular page. As well, Dave suggested that I could output the bookmark hierarchy using Outline Processing Markup Language [OPML], and import the resulting document into Roller (once blogs.sun.com updates to Roller 0.9.9, that is).
As there is only so much I can do between conference calls, writing requirements documents, and planning out my yearly goals, I will focus on idea the first. OPML will have to wait for another article, but if you have extra time, feel free to read ahead.
You might be familiar with the <link> tag. It allows you to specify cascading stylesheets and &lquo;favorite&rquo; icons, etc. It is also possible to specify additional content types for your page. If you click View -> Page Source on this page, for instance, you would see that I define an alternate content type of application/rss+xml. The path given in the link for this document is relative to the blogs.sun.com server, but it can be absolute too to pull alternate content from a different server.
There are several different content types that I am interested in for the bookmark publisher script. While RSS is quite common, other formats such as Atom and RDF are popular. I want to ensure that among all of the favorite icons and style sheets, I get only the links to alternate content types. Each link tag has a type attribute, which contains the MIME type of the alternate link. As I search through the link entries, I'll check each one to see that it matches one of my desired content types. As I intend to roll this function back into the bookmark publisher script when I'm done, I'll write the functionality as a subroutine that takes a URI object and returns a hash reference. The hash reference will be keyed by MIME type, and will contain a URI for each verified content type. The function starts out by listing the acceptable content types:
use strict; use LWP::UserAgent; use URI; my $ua = new LWP::UserAgent( env_proxy => 1 ); sub autodiscover { my $uri = shift; my $map; my %ALT_TYPES = map { $_ => 1 } qw( application/rss+xml application/rdf+xml application/atom+xml text/xml );
In contrast to the accept_bookmark subroutine I described yesterday, this routine will require LWP::UserAgent, which is the full featured object which underlies LWP::Simple. A UserAgent enables far greater control over the request and the response, which makes it ideal for this task. The above code creates a global user agent, which should be used by all sections of the final bookmark publisher script for grabbing files and checking URLs. The autodiscover function then grabs a URI to check, defines the $map of media types to URIs for return, and a list of alternate media types that we want to auto-discover. The map function maps all of the listed content types into the hash with '1' as the value. This makes checking for acceptable media types easier.
Now that we have poured the foundation, it is time to check the passed URI to see if it's pointing at anything interesting:
my $rsp = $ua->get( $uri ); return {} unless $rsp->is_success; # Record the link content type my $headers = $rsp->headers; my $ctypes = ref($headers->{'content-type'}) eq 'ARRAY' ? $headers->{'content-type'} : [ $headers->{'content-type'} ]; my ($ctype) = split /;/, $ctypes->[0]; $map->{$ctype} = $uri; $map->{default} = $ctype;
In the accept subroutine discussed yesterday, we used the head method of LWP::Simple to check if a link is valid. In order to get a handle on the embedded link tags in the target HTML document, we need to use the get method instead. Unfortunately, the method provided by LWP::Simple does not return enough information to enable these links to be processed. The get() method of LWP::UserAgent returns a HTTP::Response object, which provides a headers() method, which then returns a HTTP::Headers object, which we are most interested in.
After requesting the target link, we can check to see if the HTTP response code was success, and return an empty hash reference if it's not. This should indicate to the caller that no valid media types were found for the specified URI. Next, we read the response headers, and determine the content type of the returned document (there's no sense guessing). If there is only one content type, the content-type header contains a scalar value, but if more than one type is present (e.g. if there were different content encodings available), the field contains an ARRAY reference. We deal with this by detecting the field type, and forcing the scalar into an array reference. We can then split the content type from any additional information such as encodings or weightings. Type weightings are expressed in the form of q=x where 0 < x <= 1 which indicates preference when multiple types are available. For simplicity, this example discards weighting, but it might be useful in the future. Once we've separated the MIME type out of the header string, we assign the original URI to that content type, and indicate that this content type is the default.
Now, I often save links to documents that are not HTML, like PDF documents and images. Obviously, these documents can't contain any link references, as they are not HTML, so we must skip auto-discovery for any content types that are not text/html. This should likely not be limited to just text/html, however, as other valid types like text/xhtml or even text/xml might contain useful information. We can worry about this in the final application -- this script is just for discovery, so it's OK to drop the ball here. Link tags, if they are available in the target document, come in the header field link. Here is our link extraction code:
# Don't autodiscover for non-html links (e.g. pdf, images) if ($ctype eq 'text/html') { my $links = ref($headers->{link}) eq 'ARRAY' ? $headers->{link} : [ $headers->{link} ]; foreach my $link (@$links) { my ($href, $type) = $link =~ /<(.+?)>;.+; type="(.+?)"$/; next unless $href and $type; if ($ALT_TYPES{$type}) { my $nuri; if ($href =~ m#\w+://#) { $nuri = new URI($href); } else { $nuri = new URI($uri); $nuri->path( $href ); } # Check that the feed actually exists... $rsp = $ua->head( $nuri ); $map->{$type} = $nuri if $rsp->is_success; } } } return $map;
The first couple of lines should be familiar. The link field works the same as the content field, expressed as a scalar value if there is only one link, or an ARRAY reference if there are more than one. Again, we force the scalar into an array reference. Next, we iterate through each of the links in the document. For each link entry, we extract the href attribute value and the MIME type. If we can't find both, then this is not a proper link entry, so we skip on to the next item. If we do properly extract both fields, we can then check the type against our predefined map of content types. If the content type is in the alternate type map, we construct a new URI value for the type, and check that it exists. Note that we first check to see if the link specified an absolute URL (e.g. one with a scheme). If it did not, the link is considered relative to the original site, so we just replace the path segment in the original URI. To be pedantic, the next step is to check that the listed feed actually responds. If it does, the type and source URI are inserted into the map, and returned to the caller.
When I point my autodiscover function at this blog, I get the following structure back (printed out here with Data::Dumper):
bash$ ./link.pl
$VAR1 = {
'default' => 'text/html',
'application/rss+xml' => bless( do{\(my $o = 'http://blogs.sun.com/roller/rss/comand')}, 'URI::http' ),
'text/html' => bless( do{\(my $o = 'http://blogs.sun.com/comand')}, 'URI::http' )
};
This is what I was expecting -- a default type of text/html, because the main page existed, and an alternate type of application/rss+xml which points to the RSS feed for my site. If the main page did not exist, I'd get back an empty structure, and I'd know to skip the site entirely. This might not be the desired sequence of events, however. It might be desirable to attempt auto-discovery even in the case of a broken main link. We'll see how things turn out when I integrate this function back into the main bookmark publisher script.


