For those who don’t know already, the Wayback Machine is a comprehensive digital archive of web pages on the Internet — having already collected more than 286 billion web pages since 2001.
Wayback uses the CDX Server API to lookup captures — it allows for complex querying, filtering and analysis of Wayback capture data. I recently used this API to visualize a website over time using d3’s radial treemap. You can see the result here.
Since the most difficult part of my project was figuring out how to turn the response of the CDX API into an HTML page that I could crawl, this article is a guide to making a request to and handling the response from the API.
Making the request is simply a GET
request to http://web.archive.org/cdx/search/cdx. A website is specified via a query string — and it’s the only required query string argument. (To see what other query strings can be specified, check the docs).
To get all archived versions, or captures, of reddit.com, you would make a GET
request to http://web.archive.org/cdx/search/cdx?url=reddit.com.
Sweet! Let’s revisit the objective — get all the captures of a web page from the CDX API so I can send my crawler through the archived HTML. Okay, what to do from here?
Ummmmmm…
The Content-Type
of the response is text/plain
— so we’ll do the parsing ourselves. As it says in the documentation, the columns, split by spaces, correspond to the following, in order: urlkey
, timestamp
, original url
, mimetype
, status code
, digest
, and length
.
I created a data model object to represent one capture.
// snapshot.py
class Snapshot(dict):def __init__(self, urlkey=None, timestamp=None, original=None, mimetype=None, statuscode=None, digest=None, length=None):super(Snapshot, self).__init__()self[‘urlkey’] = urlkeyself[‘timestamp’] = timestampself[‘original’] = originalself[‘mimetype’] = mimetypeself[‘statuscode’] = statuscodeself[‘digest’] = digestself[‘length’] = lengthself[‘snapshot_url’] = ‘http://web.archive.org/web/%s/%s/’ % (timestamp, original)
All I’m doing is setting keys on a dictionary object. Notice that I added a key that wasn’t provided in the CDX response to my Snapshot
model — snapshot_url
.
The HTML page for any given capture lives at a URL that is formatted from the timestamp
and original URL
of the capture. The format is http://web.archive.org/web/</TIMESTAMP>/</URL>
.
If you would prefer to get the original page back (without the Wayback Machine rewriting URLs on the page to point into the Archive), you should use this formatting: http://web.archive.org/web/</TIMESTAMP>id_/</URL>
.
Sweet! Now that we have the data model (namely, the Snapshot
class) set up, we can make the request to the CDX API and create instances of the Snapshot
class with the response.
Since we get back a text file, we’ll first split by \n
, or new line, to get a list of capture strings. For each capture, we’ll split by a space(‘ ’) to get the individual columns returned in a list format. After this is done, we’ll have a list of lists, representing the captures returned from the API.
// reddit.py
from snapshot import Snapshotimport requests as req
res = req.get(“http://web.archive.org/cdx/search/cdx?url=reddit.com")
snapshots = res.text.split(‘\n’)
for snapshot in snapshots:snapshot_items = snapshot.split(‘ ‘)
Next, we’ll instantiate a new Snapshot object for each snapshot returned by the API.
// reddit.py
from snapshot import Snapshotimport requests as req
res = req.get("http://web.archive.org/cdx/search/cdx?url=reddit.com")
snapshots = res.text.split('\n')snapshot_list = []
for snapshot in snapshots:snapshot_items = snapshot.split(' ')if len(snapshot_items) == 7:snap = Snapshot(snapshot_items[0], snapshot_items[1], snapshot_items[2], snapshot_items[3], snapshot_items[4], snapshot_items[5], snapshot_items[6])snapshot_list.append(snap)
Now the snapshot_list
variable contains Snapshot
objects that you can use! Make sure you check that statuscode == 200
before making the request to that specific snapshot_url
.
I hope this article helped! Let me know if you have any questions!