« December 2009
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today
XML

Neat blogs

Navigation

Editing

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

20090203 Tuesday February 03, 2009
Scripts are great, but don't reinvent the wheel

I'm in the process of moving my son's soccer club's website. And one of the steps I need to do is to backup the user-generated content on it. I don't have ftp, ssh, or even a backup program to access the remote server. Also, some of the content is not readily linked on a given web page. I.e., there might be photo albums with no static links.

I started with wget, but even recursively, it needs a set of links to follow. The current provider suggested:

There are applications available that allow you to download all pages of
any website for viewing offline which could serve as sort of a backup
for you.  Try Google searching for something like
"website downloader spider tools"

I did that, but I still came up against paying for something I knew I could script and also, I still needed that list of files.

I managed to get a listing of the files from the directory management web page. I then used vi to get it down to a list of files like:

user_images/photos/1/563/221/th_photo_7632.jpg
user_images/photos/1/166/36/th_photo_531.jpg
user_images/photos/1/20/msgboard_1619.jpg
user_images/photos/1/20/msgboard_1312.jpg

First off, wget wasn't making my directories for me when I tried one of these manually. So instead of adding the base URL in vi, I was going to write a perl script to loop over the files, pull out the base directory, make sure to do a 'mkdir -p', retrieve the file, and store it in the correct place.

When I code such things, I skip all over the place. At the start, I was interested in how could I force wget to put the file in the correct place. And as I started to read the manpage, I realized that perhaps I didn't have to write a script. wget could loop over a file, it could prepend a base URL, it could force directory creation.

From the start, I refused to buy a commercial tool because I knew wget should be able to do it for me. I felt vindicated.

In any event, here is the invocation I will end up using:

wget -x -B http://foo.org -i start.list

In retrospect, I believe I didn't even have to strip out all of the fluff from those html pages I had saved. I think wget would have been able to wade through it for me and get the pages.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Trackback URL: http://blogs.sun.com/tdh/entry/scripts_are_great_but_don
Comments:

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed