I'm in the process of moving my son's soccer club's website. And one of the steps I need to do is to backup the user-generated content on it. I don't have ftp, ssh, or even a backup program to access the remote server. Also, some of the content is not readily linked on a given web page. I.e., there might be photo albums with no static links.
I started with wget, but even recursively, it needs a set of links to follow. The current provider suggested:
There are applications available that allow you to download all pages of any website for viewing offline which could serve as sort of a backup for you. Try Google searching for something like "website downloader spider tools"
I did that, but I still came up against paying for something I knew I could script and also, I still needed that list of files.
I managed to get a listing of the files from the directory management web page. I then used vi to get it down to a list of files like:
user_images/photos/1/563/221/th_photo_7632.jpg user_images/photos/1/166/36/th_photo_531.jpg user_images/photos/1/20/msgboard_1619.jpg user_images/photos/1/20/msgboard_1312.jpg
First off, wget wasn't making my directories for me when I tried one of these manually. So instead of adding the base URL in vi, I was going to write a perl script to loop over the files, pull out the base directory, make sure to do a 'mkdir -p', retrieve the file, and store it in the correct place.
When I code such things, I skip all over the place. At the start, I was interested in how could I force wget to put the file in the correct place. And as I started to read the manpage, I realized that perhaps I didn't have to write a script. wget could loop over a file, it could prepend a base URL, it could force directory creation.
From the start, I refused to buy a commercial tool because I knew wget should be able to do it for me. I felt vindicated.
In any event, here is the invocation I will end up using:
wget -x -B http://foo.org -i start.list
In retrospect, I believe I didn't even have to strip out all of the fluff from those html pages I had saved. I think wget would have been able to wade through it for me and get the pages.