paint-brush
Get Certified by CIA for Exploring Archives — my Golang Solution for Web Archive Data Extractionby@karust
257 reads

Get Certified by CIA for Exploring Archives — my Golang Solution for Web Archive Data Extraction

by Rustem KamalovJune 5th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

You may be wondering, "How do I retrieve data from a resource that has long been closed or some of the data from it has been deleted?" I had the same question a long time ago, that's when I fully discovered the capabilities of Web Archives and created a web archive extraction tool based on Common Crawl. Recently I had to solve a similar problem, but Common Crawl seems to be barely breathing due to overload... So I decided to reinvent the wheel and improve my tool by extending it with a Wayback Machine archives source. While developing it, I accidentally got certified by the CIA. Now my altruistic self wants to show everyone how to get such a cool thing on their virtual shelf! For those who are interested in the "behind the scenes" details, I also described what the two aforementioned services have in common and how we can use them through the API. At the end, we will get the certificate using the GoGetCraw tool.
featured image - Get Certified by CIA for Exploring Archives — my Golang Solution for Web Archive Data Extraction
Rustem Kamalov HackerNoon profile picture

You may be wondering, "How do I retrieve data from a resource that has long been closed or some of the data from it has been deleted?" I had the same question a long time ago, that's when I fully discovered the capabilities of Web Archives and created a web archive extraction tool based on Common Crawl.


Recently I had to solve a similar problem, but Common Crawl seems to be barely breathing due to overload... So I decided to reinvent the wheel and improve my tool by extending it with a Wayback Machine archives source. While developing it, I accidentally got certified by the CIA. Now my altruistic self wants to show everyone how to get such a cool thing on their virtual shelf!


For those who are interested in the "behind the scenes" details, I also described what the two aforementioned services have in common and how we can use them through the API. At the end, we will get the certificate using the GoGetCraw tool.

Web Archive services

First, it is worth mentioning the points of contact between such services as Common Crawl and Wayback Machine... Their archives are stored in WARC/ARC files, also called CDX files (probably from Capture/Crawl inDeX). Then, with a tool like pywb, we can serve these indexes and examine them with various filters.


For example, in the screenshot below, you can see the result of the request http://<server_addr>/cdx?url=https://twitter.com/internetarchive/ to the CDX server: As you can see, the result will be a number of rows related to the queried URL (https://twitter.com/internetarchive/). This data contains the status at the time of the request (status code) and the data collection time (timestamp), the file type (mimetype) and other interesting parameters. A more detailed description of the CDX server and the parameters used in the queries can be found here.


In addition to the Wayback Machine and Common Crawl services described below, there are many others. Unfortunately, their archives are less extensive and are usually archives of individual country websites or dedicated to a topic (e.g. art). You can find some of them here.

Let's exctract some files

I believe some of you have already used web archives through pretty interfaces. Now I will show you how to do it a fun way, using the API of the 2 aforementioned services. For example, we want to get all the JPEG files on cia.gov and its subdomains and then download the file we are interested in.

Wayback Machine

To perform our task with this resource, we construct the following query:

https://web.archive.org/cdx/search/cdx?url=*.cia.gov/*&output=json&limit=10&filter=mimetype:image/jpeg&collapse=urlkey

Where:

  • /cdx/search/cdx - the endpoint of the CDX server,

  • url=*.cia.gov/* - our target domain,

  • filter=mimetype:image/jpeg - filtering by MIME type of JPEG file,

  • output=json * - demand result in JSON format, limit=10 - limit to 10 results, collapse=urlkey - get unique URLs (without this, there are many duplicates).


As a result, we get the 10 images found in the archive. In addition to the URLs, the response contains the MIME type of the files (useful if you do not use filtering), as well as the status code when accessing the object at the time of archive creation:

[["urlkey","timestamp","original","mimetype","statuscode","digest","length"],
["gov,cia)/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg", "20150324125120", "https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg", "image/jpeg", "200", "OJRFXPWOPZQGPRIZZQZOTRSZKAVLQLKZ", "3845"],
["gov,cia)/++theme++contextual.agencytheme/images/aerial_cropped.jpg", "20160804222651", "https://www.cia.gov/++theme++contextual.agencytheme/images/Aerial_Cropped.jpg", "image/jpeg", "200", "3WII7DZKLXM4KSQ5UTEKO5EL7H5VTB35", "196685"],
["gov,cia)/++theme++contextual.agencytheme/images/background-launch.jpg", "20140121032437", "https://www.cia.gov/++theme++contextual.agencytheme/images/background-launch.jpg", "image/jpeg", "200", "3C4G73473VYPOWDNA4VJUV4Q7EC3IXN4", "44501"],
["gov,cia)/++theme++contextual.agencytheme/images/background-video-panel.jpg", "20150629034524", "https://www.cia.gov/++theme++contextual.agencytheme/images/background-video-panel.jpg", "image/jpeg", "200", "CQCUYUN5VTVJVN4LGKUZ3BHWSIXPSCKC", "71813"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-1.jpg", "20130801151047", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-1.jpg", "image/jpeg", "200", "GPSEAEE23C53TRGHLMBXHWQYNB3EGBCZ", "14858"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-2.jpg", "20130801150245", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-2.jpg", "image/jpeg", "200", "L6P2MNAAMZUMHUEHJFGXWEUQCHHMK2HP", "15136"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-3.jpg", "20130801151656", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-3.jpg", "image/jpeg", "200", "ODNXI3HZETXVVSEJ5I2KTI7KXKNT5WSV", "19717"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-4.jpg", "20130801150219", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-4.jpg", "image/jpeg", "200", "X7N2EIYUDAYWMX7464LNTHBVMTEMZUVN", "20757"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/banner-benefits-background.jpg", "20150510022313", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/banner-benefits-background.jpg", "image/jpeg", "200", "VZJE5XSAQWBD6QF6742BH2N3HOTSCZ4A", "12534"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/chi-diversity.jpg", "20130801150532", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/CHI-diversity.jpg", "image/jpeg", "200", "WJQOQPYJTPL2Y2KZBVJ44MVDMI7TZ7VL", "6458"]]


Next, to access the archived file, we take one of the results above and use the following query:

https://web.archive.org/web/20150324125120id_/https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg

Where:

  • /web - file-server endpoint,

  • 20230502061729id_* - timestamp obtained during the previous query + id_,

  • https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg* - file URL received during the previous request.


Common Crawl

In the case of this service, the task is a bit more challenging because it is overloaded, and now it is extremely difficult to use the Common Crawl CDX server. Nevertheless, I'll try to describe how it can be done :)


First, we choose archive version (Common Crawl updates archives every month or two). For example, if we chose the version for March/April 2023, in this case our request will look like this:

https://index.commoncrawl.org/CC-MAIN-2023-14-index?url=*.cia.gov/*&output=json&limit=10&filter=mimetype:image/jpeg&collapse=urlkey


As expected, the servers at the time of writing are still overloaded with never-ending `504 Gateway Time-out' 😥. Now I have no option but to show you an example of file extraction with a response that I have on hand:

{"urlkey": "com,tutorialspoint)/accounting_basics/accounting_basics_tutorial.pdf", "timestamp": "20230320100841", "url": "http://www.tutorialspoint.com/accounting_basics/accounting_basics_tutorial.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "2JQ2AQ3HQZIMXHB5CJGSADUGOHYBIRJJ", "length": "787172", "offset": "102849414", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296943471.24/warc/CC-MAIN-20230320083513-20230320113513-00267.warc.gz"}
{"urlkey": "com,tutorialspoint)/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_1.pdf", "timestamp": "20230326185123", "url": "https://www.tutorialspoint.com/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_1.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "T4OQARBGDQ2Z3ZMJ57MWZTUIBCFR65QG", "length": "120114", "offset": "1156945883", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296946445.46/warc/CC-MAIN-20230326173112-20230326203112-00412.warc.gz"}
{"urlkey": "com,tutorialspoint)/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_2.pdf", "timestamp": "20230322123716", "url": "https://www.tutorialspoint.com/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_2.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "EJJMOG5QPWIV7YXADIFOPML45UTJKYWW", "length": "118702", "offset": "1159004265", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296943809.76/warc/CC-MAIN-20230322114226-20230322144226-00733.warc.gz"}
{"urlkey": "com,tutorialspoint)/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_3.pdf", "timestamp": "20230324124641", "url": "https://www.tutorialspoint.com/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_3.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "AOTDOZIAULAYGY3AOMD7662BJBEPYKWJ", "length": "210009", "offset": "1172608792", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296945282.33/warc/CC-MAIN-20230324113500-20230324143500-00254.warc.gz"}
{"urlkey": "com,tutorialspoint)/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_1.pdf", "timestamp": "20230330141211", "url": "https://www.tutorialspoint.com/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_1.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "MOODQKFMHRVSZK4UOZO3E6H2MGHTK2VW", "length": "226484", "offset": "1136155166", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296949331.26/warc/CC-MAIN-20230330132508-20230330162508-00514.warc.gz"}
{"urlkey": "com,tutorialspoint)/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_2.pdf", "timestamp": "20230330112743", "url": "https://www.tutorialspoint.com/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_2.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "ZYCDOJ2JTPPWFTCNYEIXCWKEJQXTA7UD", "length": "226957", "offset": "1167440233", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296949181.44/warc/CC-MAIN-20230330101355-20230330131355-00035.warc.gz"}


If you look closely, you'll notice that the CommonCrawl server returns a slightly different JSON, with different parameters from what we saw with the Wayback Machine example.


Now, to download the file, we use filename of the selected object with the storage server (data.commoncrawl.org):

https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-14/segments/1679296949181.44/warc/CC-MAIN-20230330101355-20230330131355-00035.warc.gz


However, if you follow this link, it will download a bulky compressed archive instead of the object we need. To gain access to the desired file, we add the following header to the GET request: Range: bytes=1170209543-1170219812. With this header, we specify the beginning and the end of the file offset in the archive. This header values can be calculated from the offset and length parameters of the chosen JSON object.


Using curl, the final query will look like this:

curl -H "Range:bytes=1167440233-1167667190"  "https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-14/segments/1679296949181.44/warc/CC-MAIN-20230330101355-20230330131355-00035.warc.gz" --output test.warc.gz


As a result we will get a gzip compressed file, decompressing it we will see the following contents:

    WARC/1.0
    WARC-Type: response
    WARC-Date: 2023-03-30T11:27:43Z
    WARC-Record-ID: <urn:uuid:23aaef68-3bb3-4849-a7b8-f81d3b6b603c>
    Content-Length: 276765
    Content-Type: application/http; msgtype=response
    WARC-Warcinfo-ID: <urn:uuid:09066de6-1a53-44c9-80ef-921880273b06>
    WARC-Concurrent-To: <urn:uuid:916a7a96-01a0-4826-acdc-77a44863736f>
    WARC-IP-Address: 192.229.210.176
    WARC-Target-URI: https://www.tutorialspoint.com/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_2.pdf
    WARC-Payload-Digest: sha1:ZYCDOJ2JTPPWFTCNYEIXCWKEJQXTA7UD
    WARC-Block-Digest: sha1:FDUUFO3APTHN55CYNDVRD3CNPYA5NOVT
    WARC-Identified-Payload-Type: application/pdf

    HTTP/1.1 200 OK
    Accept-Ranges: bytes
    Access-Control-Allow-Origin: *
    Access-Control-Allow-Origin: *;
    Cache-Control: max-age=2592000
    Content-Type: application/pdf
    Date: Thu, 30 Mar 2023 11:27:43 GMT
    Etag: "4373e-5c8329dce0d40"
    Expires: Sat, 29 Apr 2023 11:27:43 GMT
    Last-Modified: Wed, 28 Jul 2021 17:50:05 GMT
    Server: Apache/2.4.52 (Ubuntu)
    Vary: User-Agent
    X-Frame-Options: SAMEORIGIN
    X-Version: OCT-10 V1
    X-XSS-Protection: 1; mode=block
    Content-Length: 276286

    %PDF-1.5
    %µµµµ
    <-the rest of the file...->

Taking out the WARC headers and saving the file, all that remains is to enjoy our downloaded content...


GoGetCrawl the CIA certificate

It is worth saying that at the moment there are already a few tools for interacting with archives. But with a cursory glance I haven't found any that, apart from URL mining, could download the file and easily integrate into other Go projects! With this in mind, I updated my outdated solution by doing some refactoring and adding Wayback Machine as a second source of archive data.

Let's get the paper

Finally, we will get a certificate using gogetcrawl. You can use it in several ways described here. You won't have to compile or install anything, and so for convenience:

  • You can download the latest release. And use the binary as follows:
gogetcrawl file *.cia.gov/* --dir ./ --ext pdf

After waiting a little bit we should get the cherished file, if you are bored or there is no file, you can see what happens by adding the -v flag.

  • You can as well use Docker:
docker run uranusq/gogetcrawl url *.cia.gov/* --ext pdf

As a result of this command, we should see the URL with the PDF file. You can learn more about the commands and possible arguments by using the -h flag.

  • Installation option for Go-phers:
go install github.com/karust/gogetcrawl@latest

Congrats

We received our certificate 🫢. For those who don't have it yet, it looks like this: Not fully understanding what it is and why it is there, I immediately rushed to share this opportunity with you.

Open-source

If you use Go and want to become an archivarius, there is an option to apply GoGetCrawl in your project:

go get github.com/karust/gogetcrawl

For example, a minimal program that will show us all the pages of example.com and its subdomains with the status of 200 looks like this:

package main

import (
	"fmt"

	"github.com/karust/gogetcrawl/common"
	"github.com/karust/gogetcrawl/wayback"
)

func main() {
	// Get only 10 status:200 pages
	config := common.RequestConfig{
		URL:     "*.example.com/*",
		Filters: []string{"statuscode:200"},
		Limit:   10,
	}

	// Set request timout and retries
	wb, _ := wayback.New(15, 2)

	// Use config to obtain all CDX server responses
	results, _ := wb.GetPages(config)

	for _, r := range results {
		fmt.Println(r.Urlkey, r.Original, r.MimeType)
	}
}   

On the project page, you can find more examples, including file extraction and CommonCrawl usage.

PS

I hope no one was click baited, and everyone enjoyed the new certificate in their collection :) Perhaps there are other web archive services other than those described in the article that can be used similarly? Let me know.