Publishing Netscape Bookmarks
For quite some time, I've wanted to publish my large list of bookmarks on the web. The primary reason is to give me a map of information to use when I am not near a browser with my bookmarks. Another reason is to let others benefit from the time I've spent gathering and organizing these links. I could just upload my Netscape bookmarks.html file, as it is just HTML, but there are issues.
Issue the first is that I have Sun internal links sprinkled liberally through my bookmark folder. I could copy my bookmarks.html aside, then manually remove the internal links, but the current version contains 325 bookmarks, and I am impatient. I also don't want to go through this exercise every time I update, add, or delete a bookmark. Issue the second is that some of my links are old and outdated -- documents have moved (or decomposed). I need to identify links that are broken, and either deal with them, or remove them from the bookmark file altogether. Issue the last is that there are some links in my folders which I do not want published to the world. Sure, I trust you to keep where I bank a secret, but not that guy in the office next to you -- he's kinda shady.
So what to do when faced with a big text processing task like this? Whip out Perl, of course! There are several approaches here, as with every task Perl is involved with. My tack starts with a utility called HTML tidy. This utility will take the not very well formed Netscape bookmark HTML and give me well formed XML. In Perl, I prefilter the bookmarks like this:
use strict; use File::Temp qw/tempfile/; # Filter STDIN through tidy my $temp = new File::Temp( UNLINK => 1 ); open TIDY, "| tidy -quiet -asxml 2>/dev/null 1>$temp" or die "Failed to open pipe to 'tidy': $!"; print TIDY while (<>); close TIDY;
Now the file named by $temp contains well-formed XML. Note that the temporary file uses UNLINK => 1. This will cause the tidy formatted XML file to be cleaned up when the program exits or the $temp variable goes out of scope, whichever comes first. Now that I have well formed XML, I can search through the bookmarks programatically. My weapon of choice for tasks like this is the fine XML::XPath module set written by Matt Sergeant. To begin with, I need to identify the root of the personal toolbar folder:
use XML::XPath; my $xp = new XML::XPath( filename => $temp ); my $root = $xp->find( '/html/body' );
Bookmarks in the file are all organized as HTML definition lists (DL/DD/DT). The very top of the document is inside of the body tag. All of the nodes within the body are now contained in the $root variable. The general pattern for folders and bookmarks within the file is as follows:
DL
DD
H3 -> Folder Title
DT
A -> Bookmark
DT
A -> Bookmark
DD
H3 -> Sub-folder Title
This structure can be arbitrarily deep, therefore the script must be able to handle this. The best way is to process the file using a recursive function:
sub collect_bookmarks { my ($ctx, $href) = @_; # For each folder root... my $f_result = $ctx->find( './dl/dd' ); foreach my $f_node ($f_result->get_nodelist) { # Grab the folder title, skip if no name (separators) (my $f_name = $f_node->find( './h3' )) =~ s/^\s*|\s*$//g; next unless $f_name; # Within this folder, search for bookmark entries my $a_result = $f_node->find( './dl/dt/a' ); foreach my $a_node ($a_result->get_nodelist) { # Retrieve and normalize the URL and bookmark title my $link = $a_node->getAttribute('href'); my $title = $a_node->string_value(); $link =~ s/^\s*|\s*$//g; $link =~ s/\n//g; $title =~ s/^\s*|\s*$//g; $title =~ s/\n//g; # Store the bookmark unless it's bad if (accept_bookmark($title,$link)) { $href->{$f_name}{$title} = $link; } else { print "Skipping bookmark: $link"; } } # Recursive call to process subfolders of this node collect_bookmarks($f_node, \%{$href->{$f_name}}); } }
We call this with the root folder as:
collect_bookmarks($root, \%BOOKMARKS);
This will identify and recursively process any subfolders that are present in the bookmark file structure. Each iteration passes in a hash reference, which that call will populate with a folder name and one or more bookmarks. Note the call to accept_bookmark(), which makes the final decision on whether a bookmark is good or bad. Bad bookmarks are defined by my own criteria, which filters out broken, invalid and internal links, as well as those that I don't want to publish. The function looks like this:
sub accept_bookmark { my ($title, $link) = @_; # Parse the link URL my $luri = new URI($link); # These things are all bad. return 0 if $luri->scheme !~ /http|ftp/ or $title =~ $PRIVATE or $link =~ $PRIVATE or $luri->host =~ /\.(corp|ebay|sfbay|west|central|uk$)/ or index($luri->host,'.') == -1 or not head($luri); return 1; }
This uses URI to filter out any file:// links that might be hanging around, and the head() function from LWP::Simple to check links. The $PRIVATE variable is a compiled regular expression (using qr{}) which contains title and link patterns that I don't want to publish.
Now that I can browse all of the information in the bookmarks, the next step is to write them out in some browseable format. I'm thinking DHTML collapsable lists, but I could just as easily print out simple HTML. I haven't decided yet, but I'll publish the rest of the script in a followup post once it is complete.
PS: Syntax highlighting above was done using Vim 6.3 and the code2html.vim script by Soren Anderson.



Posted by PatrickG on September 28, 2004 at 08:10 PM PDT #
Posted by alecm on September 28, 2004 at 10:48 PM PDT #
I <em>could</em> have used something like grep -v to just drop URLs that had a pattern I didn't want. This doesn't provide flexibility to link check or modify links and check for RSS feeds, etc. during extraction, and I lose the folder hierarchy as well.
Alec -- the current pattern <em>will</em> drop out co.uk, but only because I didn't put too much effort into making it work. 100% of my personal links that end in .uk are in the Sun internal .uk domain.
Posted by comand on September 29, 2004 at 03:03 PM PDT #