The Wayback Machine Scraper [GDoc]

wayback-machine-logoI’ve always loved putting together quick, single-purpose tools in Google Docs. With a little bit of scraping/Xpath knowledge, a Google Spreadsheet can become an inbound marketer’s best friend. There’s already a ton of great tools out there — from robust Twitter analytics docs (TAGS is my favourite), SERP analysis tools, content idea generators, and others — and most of them rely on the workhorse functions importXML and importfeed. One of the biggest issues with GDoc-based tools has always been 50 importXML functions per spreadsheet limit, something that generally makes large-scale analysis and scraping difficult. After looking around I came across Dave Sottimano’s awesome Bulk ImportXML tool, which has been re-purposed for the Wayback scraper.

The Wayback Machine Scraper is Google Doc that allows you to quickly  pull page titles from URLs that no longer exist (i.e. 404 pages). Having page title information for your 404 pages can really help when you’re wanting to redirect those 404 pages to their closest equivalent on your new site. This is especially important when you can’t tease out any information from the URL.

How It Works

The scraper takes the URLs you enter in the left-hand column, and return the Wayback Machine title information in column B. Just copy and paste your list of 404s into column A and click the ‘Go’ button; enter the number of URLs you’ve want to check and the results will immediately start populating the second column. There might be a large amount of “N/A”s, which means the Wayback Machine didn’t have any record of the page. Not much you can do with the Wayback Machine in this case, your best bet is to look back in your GA data to get the page titles.

You can check out the tool here: http://jklt.co/waybackscraper

wayback1