paint-brush
Guide to Handling Internet Archive’s CDX Server API Responseby@julia.geist
2,968 reads
2,968 reads

Guide to Handling Internet Archive’s CDX Server API Response

by Julia GeistApril 23rd, 2017
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

For those who don’t know already, the <a href="https://archive.org/web/web.php" target="_blank">Wayback Machine</a> is a comprehensive digital <a href="https://hackernoon.com/tagged/archive" target="_blank">archive</a> of web pages on the Internet — having already collected more than 286 billion web pages since 2001.

Company Mentioned

Mention Thumbnail
featured image - Guide to Handling Internet Archive’s CDX Server API Response
Julia Geist HackerNoon profile picture

Accessing a digital archive of web pages

For those who don’t know already, the Wayback Machine is a comprehensive digital archive of web pages on the Internet — having already collected more than 286 billion web pages since 2001.

Wayback uses the CDX Server API to lookup captures — it allows for complex querying, filtering and analysis of Wayback capture data. I recently used this API to visualize a website over time using d3’s radial treemap. You can see the result here.

Since the most difficult part of my project was figuring out how to turn the response of the CDX API into an HTML page that I could crawl, this article is a guide to making a request to and handling the response from the API.

Making the Request

Making the request is simply a GET request to http://web.archive.org/cdx/search/cdx. A website is specified via a query string — and it’s the only required query string argument. (To see what other query strings can be specified, check the docs).

To get all archived versions, or captures, of reddit.com, you would make a GET request to http://web.archive.org/cdx/search/cdx?url=reddit.com.

Sweet! Let’s revisit the objective — get all the captures of a web page from the CDX API so I can send my crawler through the archived HTML. Okay, what to do from here?

Ummmmmm…

Handling the Response

The Content-Type of the response is text/plain— so we’ll do the parsing ourselves. As it says in the documentation, the columns, split by spaces, correspond to the following, in order: urlkey, timestamp, original url, mimetype, status code, digest, and length.

I created a data model object to represent one capture.

// snapshot.py











class Snapshot(dict):def __init__(self, urlkey=None, timestamp=None, original=None, mimetype=None, statuscode=None, digest=None, length=None):super(Snapshot, self).__init__()self[‘urlkey’] = urlkeyself[‘timestamp’] = timestampself[‘original’] = originalself[‘mimetype’] = mimetypeself[‘statuscode’] = statuscodeself[‘digest’] = digestself[‘length’] = lengthself[‘snapshot_url’] = ‘http://web.archive.org/web/%s/%s/’ % (timestamp, original)

All I’m doing is setting keys on a dictionary object. Notice that I added a key that wasn’t provided in the CDX response to my Snapshot model — snapshot_url.

The HTML page for any given capture lives at a URL that is formatted from the timestamp and original URL of the capture. The format is http://web.archive.org/web/</TIMESTAMP>/</URL>.

If you would prefer to get the original page back (without the Wayback Machine rewriting URLs on the page to point into the Archive), you should use this formatting: http://web.archive.org/web/</TIMESTAMP>id_/</URL>.

Sweet! Now that we have the data model (namely, the Snapshot class) set up, we can make the request to the CDX API and create instances of the Snapshot class with the response.

Since we get back a text file, we’ll first split by \n, or new line, to get a list of capture strings. For each capture, we’ll split by a space(‘ ’) to get the individual columns returned in a list format. After this is done, we’ll have a list of lists, representing the captures returned from the API.

// reddit.py


from snapshot import Snapshotimport requests as req

res = req.get(“http://web.archive.org/cdx/search/cdx?url=reddit.com")

snapshots = res.text.split(‘\n’)


for snapshot in snapshots:snapshot_items = snapshot.split(‘ ‘)

Next, we’ll instantiate a new Snapshot object for each snapshot returned by the API.

// reddit.py


from snapshot import Snapshotimport requests as req

res = req.get("http://web.archive.org/cdx/search/cdx?url=reddit.com")


snapshots = res.text.split('\n')snapshot_list = []





for snapshot in snapshots:snapshot_items = snapshot.split(' ')if len(snapshot_items) == 7:snap = Snapshot(snapshot_items[0], snapshot_items[1], snapshot_items[2], snapshot_items[3], snapshot_items[4], snapshot_items[5], snapshot_items[6])snapshot_list.append(snap)

Now the snapshot_list variable contains Snapshot objects that you can use! Make sure you check that statuscode == 200 before making the request to that specific snapshot_url.

I hope this article helped! Let me know if you have any questions!