You may be wondering, "How do I retrieve data from a resource that has long been closed or some of the data from it has been deleted?" I had the same question a long time ago, that's when I fully discovered the capabilities of Web Archives and created a web archive extraction tool based on Common Crawl.
Recently I had to solve a similar problem, but Common Crawl seems to be barely breathing due to overload... So I decided to reinvent the wheel and improve my tool by extending it with a Wayback Machine archives source. While developing it, I accidentally got certified by the CIA. Now my altruistic self wants to show everyone how to get such a cool thing on their virtual shelf!
For those who are interested in the "behind the scenes" details, I also described what the two aforementioned services have in common and how we can use them through the API. At the end, we will get the certificate using the GoGetCraw tool.
First, it is worth mentioning the points of contact between such services as Common Crawl and Wayback Machine... Their archives are stored in WARC/ARC files, also called CDX files (probably from Capture/Crawl inDeX). Then, with a tool like pywb, we can serve these indexes and examine them with various filters.
For example, in the screenshot below, you can see the result of the request http://<server_addr>/cdx?url=https://twitter.com/internetarchive/
to the CDX server:
As you can see, the result will be a number of rows related to the queried URL (https://twitter.com/internetarchive/
). This data contains the status at the time of the request (status code
) and the data collection time (timestamp
), the file type (mimetype
) and other interesting parameters. A more detailed description of the CDX server and the parameters used in the queries can be found here.
In addition to the Wayback Machine and Common Crawl services described below, there are many others. Unfortunately, their archives are less extensive and are usually archives of individual country websites or dedicated to a topic (e.g. art). You can find some of them here.
I believe some of you have already used web archives through pretty interfaces. Now I will show you how to do it a fun way, using the API of the 2 aforementioned services. For example, we want to get all the JPEG files on cia.gov
and its subdomains and then download the file we are interested in.
To perform our task with this resource, we construct the following query:
https://web.archive.org/cdx/search/cdx?url=*.cia.gov/*&output=json&limit=10&filter=mimetype:image/jpeg&collapse=urlkey
Where:
/cdx/search/cdx
- the endpoint of the CDX server,
url=*.cia.gov/*
- our target domain,
filter=mimetype:image/jpeg
- filtering by MIME type of JPEG file,
output=json
* - demand result in JSON format,
limit=10
- limit to 10 results,
collapse=urlkey
- get unique URLs (without this, there are many duplicates).
As a result, we get the 10 images found in the archive. In addition to the URLs, the response contains the MIME type of the files (useful if you do not use filtering), as well as the status code when accessing the object at the time of archive creation:
[["urlkey","timestamp","original","mimetype","statuscode","digest","length"],
["gov,cia)/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg", "20150324125120", "https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg", "image/jpeg", "200", "OJRFXPWOPZQGPRIZZQZOTRSZKAVLQLKZ", "3845"],
["gov,cia)/++theme++contextual.agencytheme/images/aerial_cropped.jpg", "20160804222651", "https://www.cia.gov/++theme++contextual.agencytheme/images/Aerial_Cropped.jpg", "image/jpeg", "200", "3WII7DZKLXM4KSQ5UTEKO5EL7H5VTB35", "196685"],
["gov,cia)/++theme++contextual.agencytheme/images/background-launch.jpg", "20140121032437", "https://www.cia.gov/++theme++contextual.agencytheme/images/background-launch.jpg", "image/jpeg", "200", "3C4G73473VYPOWDNA4VJUV4Q7EC3IXN4", "44501"],
["gov,cia)/++theme++contextual.agencytheme/images/background-video-panel.jpg", "20150629034524", "https://www.cia.gov/++theme++contextual.agencytheme/images/background-video-panel.jpg", "image/jpeg", "200", "CQCUYUN5VTVJVN4LGKUZ3BHWSIXPSCKC", "71813"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-1.jpg", "20130801151047", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-1.jpg", "image/jpeg", "200", "GPSEAEE23C53TRGHLMBXHWQYNB3EGBCZ", "14858"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-2.jpg", "20130801150245", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-2.jpg", "image/jpeg", "200", "L6P2MNAAMZUMHUEHJFGXWEUQCHHMK2HP", "15136"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-3.jpg", "20130801151656", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-3.jpg", "image/jpeg", "200", "ODNXI3HZETXVVSEJ5I2KTI7KXKNT5WSV", "19717"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-4.jpg", "20130801150219", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-4.jpg", "image/jpeg", "200", "X7N2EIYUDAYWMX7464LNTHBVMTEMZUVN", "20757"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/banner-benefits-background.jpg", "20150510022313", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/banner-benefits-background.jpg", "image/jpeg", "200", "VZJE5XSAQWBD6QF6742BH2N3HOTSCZ4A", "12534"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/chi-diversity.jpg", "20130801150532", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/CHI-diversity.jpg", "image/jpeg", "200", "WJQOQPYJTPL2Y2KZBVJ44MVDMI7TZ7VL", "6458"]]
Next, to access the archived file, we take one of the results above and use the following query:
https://web.archive.org/web/20150324125120id_/https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg
Where:
/web
- file-server endpoint,
20230502061729id_
* - timestamp obtained during the previous query + id_,
https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg
* - file URL received during the previous request.
In the case of this service, the task is a bit more challenging because it is overloaded, and now it is extremely difficult to use the Common Crawl CDX server. Nevertheless, I'll try to describe how it can be done :)
First, we choose archive version (Common Crawl updates archives every month or two). For example, if we chose the version for March/April 2023, in this case our request will look like this:
https://index.commoncrawl.org/CC-MAIN-2023-14-index?url=*.cia.gov/*&output=json&limit=10&filter=mimetype:image/jpeg&collapse=urlkey
As expected, the servers at the time of writing are still overloaded with never-ending `504 Gateway Time-out' 😥. Now I have no option but to show you an example of file extraction with a response that I have on hand:
{"urlkey": "com,tutorialspoint)/accounting_basics/accounting_basics_tutorial.pdf", "timestamp": "20230320100841", "url": "http://www.tutorialspoint.com/accounting_basics/accounting_basics_tutorial.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "2JQ2AQ3HQZIMXHB5CJGSADUGOHYBIRJJ", "length": "787172", "offset": "102849414", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296943471.24/warc/CC-MAIN-20230320083513-20230320113513-00267.warc.gz"}
{"urlkey": "com,tutorialspoint)/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_1.pdf", "timestamp": "20230326185123", "url": "https://www.tutorialspoint.com/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_1.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "T4OQARBGDQ2Z3ZMJ57MWZTUIBCFR65QG", "length": "120114", "offset": "1156945883", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296946445.46/warc/CC-MAIN-20230326173112-20230326203112-00412.warc.gz"}
{"urlkey": "com,tutorialspoint)/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_2.pdf", "timestamp": "20230322123716", "url": "https://www.tutorialspoint.com/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_2.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "EJJMOG5QPWIV7YXADIFOPML45UTJKYWW", "length": "118702", "offset": "1159004265", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296943809.76/warc/CC-MAIN-20230322114226-20230322144226-00733.warc.gz"}
{"urlkey": "com,tutorialspoint)/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_3.pdf", "timestamp": "20230324124641", "url": "https://www.tutorialspoint.com/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_3.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "AOTDOZIAULAYGY3AOMD7662BJBEPYKWJ", "length": "210009", "offset": "1172608792", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296945282.33/warc/CC-MAIN-20230324113500-20230324143500-00254.warc.gz"}
{"urlkey": "com,tutorialspoint)/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_1.pdf", "timestamp": "20230330141211", "url": "https://www.tutorialspoint.com/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_1.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "MOODQKFMHRVSZK4UOZO3E6H2MGHTK2VW", "length": "226484", "offset": "1136155166", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296949331.26/warc/CC-MAIN-20230330132508-20230330162508-00514.warc.gz"}
{"urlkey": "com,tutorialspoint)/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_2.pdf", "timestamp": "20230330112743", "url": "https://www.tutorialspoint.com/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_2.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "ZYCDOJ2JTPPWFTCNYEIXCWKEJQXTA7UD", "length": "226957", "offset": "1167440233", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296949181.44/warc/CC-MAIN-20230330101355-20230330131355-00035.warc.gz"}
If you look closely, you'll notice that the CommonCrawl server returns a slightly different JSON, with different parameters from what we saw with the Wayback Machine example.
Now, to download the file, we use filename of the selected object with the storage server (data.commoncrawl.org):
https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-14/segments/1679296949181.44/warc/CC-MAIN-20230330101355-20230330131355-00035.warc.gz
However, if you follow this link, it will download a bulky compressed archive instead of the object we need. To gain access to the desired file, we add the following header to the GET request: Range: bytes=1170209543-1170219812
. With this header, we specify the beginning and the end of the file offset in the archive. This header values can be calculated from the offset and length parameters of the chosen JSON object.
Using curl
, the final query will look like this:
curl -H "Range:bytes=1167440233-1167667190" "https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-14/segments/1679296949181.44/warc/CC-MAIN-20230330101355-20230330131355-00035.warc.gz" --output test.warc.gz
As a result we will get a gzip compressed file, decompressing it we will see the following contents:
WARC/1.0
WARC-Type: response
WARC-Date: 2023-03-30T11:27:43Z
WARC-Record-ID: <urn:uuid:23aaef68-3bb3-4849-a7b8-f81d3b6b603c>
Content-Length: 276765
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:09066de6-1a53-44c9-80ef-921880273b06>
WARC-Concurrent-To: <urn:uuid:916a7a96-01a0-4826-acdc-77a44863736f>
WARC-IP-Address: 192.229.210.176
WARC-Target-URI: https://www.tutorialspoint.com/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_2.pdf
WARC-Payload-Digest: sha1:ZYCDOJ2JTPPWFTCNYEIXCWKEJQXTA7UD
WARC-Block-Digest: sha1:FDUUFO3APTHN55CYNDVRD3CNPYA5NOVT
WARC-Identified-Payload-Type: application/pdf
HTTP/1.1 200 OK
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Access-Control-Allow-Origin: *;
Cache-Control: max-age=2592000
Content-Type: application/pdf
Date: Thu, 30 Mar 2023 11:27:43 GMT
Etag: "4373e-5c8329dce0d40"
Expires: Sat, 29 Apr 2023 11:27:43 GMT
Last-Modified: Wed, 28 Jul 2021 17:50:05 GMT
Server: Apache/2.4.52 (Ubuntu)
Vary: User-Agent
X-Frame-Options: SAMEORIGIN
X-Version: OCT-10 V1
X-XSS-Protection: 1; mode=block
Content-Length: 276286
%PDF-1.5
%µµµµ
<-the rest of the file...->
Taking out the WARC headers and saving the file, all that remains is to enjoy our downloaded content...
It is worth saying that at the moment there are already a few tools for interacting with archives. But with a cursory glance I haven't found any that, apart from URL mining, could download the file and easily integrate into other Go projects!
With this in mind, I updated my outdated solution by doing some refactoring and adding Wayback Machine
as a second source of archive data.
Finally, we will get a certificate using gogetcrawl. You can use it in several ways described here. You won't have to compile or install anything, and so for convenience:
gogetcrawl file *.cia.gov/* --dir ./ --ext pdf
After waiting a little bit we should get the cherished file, if you are bored or there is no file, you can see what happens by adding the -v
flag.
docker run uranusq/gogetcrawl url *.cia.gov/* --ext pdf
As a result of this command, we should see the URL with the PDF file. You can learn more about the commands and possible arguments by using the -h
flag.
go install github.com/karust/gogetcrawl@latest
We received our certificate 🫢. For those who don't have it yet, it looks like this:
If you use Go and want to become an archivarius, there is an option to apply GoGetCrawl in your project:
go get github.com/karust/gogetcrawl
For example, a minimal program that will show us all the pages of example.com
and its subdomains with the status of 200 looks like this:
package main
import (
"fmt"
"github.com/karust/gogetcrawl/common"
"github.com/karust/gogetcrawl/wayback"
)
func main() {
// Get only 10 status:200 pages
config := common.RequestConfig{
URL: "*.example.com/*",
Filters: []string{"statuscode:200"},
Limit: 10,
}
// Set request timout and retries
wb, _ := wayback.New(15, 2)
// Use config to obtain all CDX server responses
results, _ := wb.GetPages(config)
for _, r := range results {
fmt.Println(r.Urlkey, r.Original, r.MimeType)
}
}
On the project page, you can find more examples, including file extraction and CommonCrawl usage.
I hope no one was click baited, and everyone enjoyed the new certificate in their collection :) Perhaps there are other web archive services other than those described in the article that can be used similarly? Let me know.