Web Scraping for Preservation's Sake
By WretchedGhost
- 2 minutes read - 392 wordsDue to the higher complexities of running a modern website or blog securely or rather, for making it easier for people via CloudFlare and the like, I have found that when one of those main sites are down, again like CloudFlare, sites that rely on their DNS redirecting and whatnot, go down too. Many blogs that I follow are starting to jump on that band-wagon which is annoying since it can be quite a bit of time for a site to return to working order whether it was on CloudFlare’s side or the blog owner. Regardless of this I have looked into ways to have backup copies of their sites while this is still an option.
This one I found to be the most organized and make the format match what is in the page.
wget -r –convert-links –html-extension –no-parent some-site-online.com
To pull entire website but mask it by looking like you are from Mozilla doing a site crawl, then wait 10s between each page pull, and finally limit rate to 35Kbps run:
wget -r -p -U Mozilla –wait=10 –limit-rate=35K some-site-online.com
This one will be best to not get potentially marked as a spam crawler.
Both of these provide you with a folder in your pwd which will contain the relevant details to show the site with only the links that are missing.
In this example I have run the first command at my blog site where it has created a folder named blog.wretchednet.com. If I run Firefox against the index.html file it will open Firefox and present my home page with all the pages that I have created that are traversable and viewable exactly as I had intended it.
Caveats
I have found some sites that will not let you traverse their site in this fashion. I don’t know if this was done our by design or just poor web structure but one of the blogs I wanted to scrape would only let me have access to one page at a time. This is not ideal but it allowed me to get the few pages I wanted.
Also I have found myself blocked from even viewing a site after scraping it. Luckily I was able to take the one post I needed but soon after trying to do their entire site my public IP address was blocked for about 30 days.