Find Broken Links In Your Roller Blog
Recently on an internal mailing list, Calum gave an approach (thanks) which I've used. His suggestion was to use:
wget --spider --force-html -i bookmarks.html
Now as I've previously mentioned, you can use Dave Johnson's Grabber application to save a local copy of all of your Roller blog posts.
I modified Calum's one-liner to create a simple script (findlinks.sh) that iterates over each of those files:
#!/bin/bash
for i in 2*.html;
do
echo File: $i
wget --spider --tries=2 --force-html -i $i
done
I added in the --tries=2 to make the script a little more
responsive. It's definitely not the fastest thing in the world, but my
philosophy for computer programming is:
- Get it working.
- Make it better (more features).
- Make it faster and smaller.
We are still on stage 1 here.
Now you can run the script (in the directory where you saved your blog posts) with:
% ./findlinks.sh > blog-links.txt
I left this running overnight (yes, it's that slow, or rather, it had a lot of files to process), and in the morning I had a 3.7Mb file of interesting information. This now needed to be processed.
Another small Python script to the rescue. This script takes the previously generated output from findlinks.sh as input, processes each line, extracting out the name of each blog post and reporting links that didn't generate a "200 OK" result. It also writes out some simple statistics at the end of the run. Results are written to standard output.
It actually does a bit more than that. If a link generated a "301 Moved" response, then it's ignored. The blogs.sun.com team adjusted all the blog URL's a little while ago. Links of the form:
http://blogs.sun.com/roller/resources/richb/blog-richb.jpg
are now of the form:
http://blogs.sun.com/richb/resource/blog-richb.jpg
It's a pity there isn't an easy way to automatically update all such links in my old blog posts.
It also seems that various Amazon links I have don't like it when the wget
command touches them. They all generate a "405 MethodNotAllowed" response.
I've ignored those too, as they seem to work just fine in a browser.
The new report generates output for each blog post file that looks something like:
File: 20040614-0724.html
Date: [June 14, 2004 07:24]
Url: http://www.sun.com/smrc/photos-sun/pphistory.html
Url: http://brand.sun.com/
Response: 302 Moved Temporarily
Url: https://brand.sun.com/
Url: http://au.sun.com/news/onsun/2002-04/sun_20.html
Url: http://docs.sun.com/db/doc/806-2901/6jc3a4lqm?a=view
>>> Response: 404 Not found
Url: http://docs.sun.com/db/doc/806-2901/6jc3a4lqp?a=view
>>> Response: 404 Not found
Url: http://docs.sun.com/db/doc/806-2901/6jc3a4ltl?a=view
>>> Response: 404 Not found
Url: http://docs.sun.com/db/doc/806-2901/6jc3a4ltg?a=view
>>> Response: 404 Not found
Url: http://docs.sun.com/db/doc/806-2901/6jc3a4lu3?a=view
>>> Response: 404 Not found
Url: http://www.objectfarm.org/Activities/Publications/TheMerger/UserInterfaces/OPENSTEP-Desktop.jpg
Url: http://jsdt.dev.java.net/
Url: https://jsdt.dev.java.net/
Url: http://java.sun.com/products/jms/index.jsp
Url: http://www.wired.com/news/technology/0,1282,35526,00.html
Url: http://www.sun.com/access
Response: 302 Moved Temporarily
Url: http://www.sun.com/access/
Url: http://wwws.sun.com/software/solaris/freeware/download.html
Url: http://www.sun.com/software/solaris/freeware/download.html
Url: http://www.sun.com/software/solaris/freeware/download.xml
Url: http://www.java.blogger.com.br/sd1.jpg
Url: http://www.solaris-x86.org/
Url: http://www.theregister.co.uk/2004/06/02/sun_shows_metropolis/
Url: http://calctool.sourceforge.net/Screenshots/gcalctool.png
Url: http://www.xwinman.org/screenshots/gnome-anakin.jpg
Url: http://web.comlab.ox.ac.uk/oucl/work/richard.brent/pub/pub043.html
Url: http://www.technorati.com/tag/Personal
Where a link generated a response that began with a "4", I've prefixed the report line with ">>>" to make them easier to find.
Here's my link statistics:
917 files processed. 13462 links processed. 2372 links moved. 578 'method not allowed' links. 766 broken links found.
Now I just need to go back and edit all of those broken links and fix them up if possible.
There is also no doubt in my mind that this can be improved. Suggestions on how to go to stage 2 are most welcome.
( Jan 23 2007, 07:48:10 AM PST ) [Listen] Permalink
Comments are closed for this entry.












