Accessing a digital archive of web pages For those who don’t know already, the is a comprehensive digital of web pages on the Internet — having already collected more than 286 billion web pages since 2001. Wayback Machine archive Wayback uses the to lookup captures — it allows for complex querying, filtering and analysis of Wayback capture data. I recently used this API to visualize a website over time using d3’s radial treemap. You can see the result . CDX Server API here Since the most difficult part of my project was figuring out how to turn the response of the CDX API into an HTML page that I could crawl, this article is a guide to making a request to and handling the response from the . API Making the Request Making the request is simply a request to A website is specified via a query string — and it’s the only required query string argument. (To see what other query strings can be specified, ). GET http://web.archive.org/cdx/search/cdx. check the docs To get all archived versions, or captures, of reddit.com, you would make a request to GET http://web.archive.org/cdx/search/cdx?url=reddit.com. Sweet! Let’s revisit the objective — get all the captures of a web page from the CDX API so I can send my crawler through the archived HTML. Okay, what to do from here? Ummmmmm… Handling the Response The of the response is — so we’ll do the parsing ourselves. As it says in the , the columns, split by spaces, correspond to the following, in order: , , , , , , and . Content-Type text/plain documentation urlkey timestamp original url mimetype status code digest length I created a data model object to represent one capture. // snapshot.py class Snapshot(dict):def __init__(self, urlkey=None, timestamp=None, original=None, mimetype=None, statuscode=None, digest=None, length=None):super(Snapshot, self).__init__()self[‘urlkey’] = urlkeyself[‘timestamp’] = timestampself[‘original’] = originalself[‘mimetype’] = mimetypeself[‘statuscode’] = statuscodeself[‘digest’] = digestself[‘length’] = lengthself[‘snapshot_url’] = ‘http://web.archive.org/web/%s/%s/’ % (timestamp, original) All I’m doing is setting keys on a dictionary object. Notice that I added a key that wasn’t provided in the CDX response to my model — . Snapshot snapshot_url The HTML page for any given capture lives at a URL that is formatted from the and of the capture. The format is . timestamp original URL http://web.archive.org/web/</TIMESTAMP>/</URL> If you would prefer to get the original page back (without the Wayback Machine rewriting URLs on the page to point into the Archive), you should use this formatting: . http://web.archive.org/web/</TIMESTAMP>id_/</URL> Sweet! Now that we have the data model (namely, the class) set up, we can make the request to the CDX API and create instances of the class with the response. Snapshot Snapshot Since we get back a text file, we’ll first split by , or new line, to get a list of capture strings. For each capture, we’ll split by a space(‘ ’) to get the individual columns returned in a list format. After this is done, we’ll have a list of lists, representing the captures returned from the API. \n // reddit.py from snapshot import Snapshotimport requests as req res = req.get(“http://web.archive.org/cdx/search/cdx?url=reddit.com") snapshots = res.text.split(‘\n’) for snapshot in snapshots:snapshot_items = snapshot.split(‘ ‘) Next, we’ll instantiate a new Snapshot object for each snapshot returned by the API. // reddit.py from snapshot import Snapshotimport requests as req res = req.get("http://web.archive.org/cdx/search/cdx?url=reddit.com") snapshots = res.text.split('\n')snapshot_list = [] for snapshot in snapshots:snapshot_items = snapshot.split(' ')if len(snapshot_items) == 7:snap = Snapshot(snapshot_items[0], snapshot_items[1], snapshot_items[2], snapshot_items[3], snapshot_items[4], snapshot_items[5], snapshot_items[6])snapshot_list.append(snap) Now the variable contains objects that you can use! Make sure you check that before making the request to that specific . snapshot_list Snapshot statuscode == 200 snapshot_url I hope this article helped! Let me know if you have any questions!
Share Your Thoughts