You may be wondering, I had the same question a long time ago, that's when I fully discovered the capabilities of Web Archives and created a web archive extraction tool based on . "How do I retrieve data from a resource that has long been closed or some of the data from it has been deleted?" Common Crawl Recently I had to solve a similar problem, but Common Crawl seems to be due to overload... So I decided to reinvent the wheel and improve my tool by extending it with a archives source. While developing it, I accidentally got certified by the CIA. Now my altruistic self wants to show everyone how to get such a cool thing on their virtual shelf! barely breathing Wayback Machine For those who are interested in the "behind the scenes" details, I also described what the two aforementioned services have in common and how we can use them through the API. At the end, we will get the certificate using the tool. GoGetCraw Web Archive services First, it is worth mentioning the points of contact between such services as Common Crawl and Wayback Machine... Their archives are stored in WARC/ARC files, also called CDX files (probably from Capture/Crawl inDeX). Then, with a tool like , we can serve these indexes and examine them with various filters. pywb For example, in the screenshot below, you can see the result of the request to the CDX server: As you can see, the result will be a number of rows related to the queried URL ( ). This data contains the status at the time of the request ( ) and the data collection time ( ), the file type ( ) and other interesting parameters. A more detailed description of the CDX server and the parameters used in the queries can be found . http://<server_addr>/cdx?url=https://twitter.com/internetarchive/ https://twitter.com/internetarchive/ status code timestamp mimetype here In addition to the Wayback Machine and Common Crawl services described below, there are many others. Unfortunately, their archives are less extensive and are usually archives of individual country websites or dedicated to a topic (e.g. art). You can find some of them . here Let's exctract some files I believe some of you have already used web archives through pretty interfaces. Now I will show you how to do it a fun way, using the API of the 2 aforementioned services. For example, we want to get all the JPEG files on and its subdomains and then download the file we are interested in. cia.gov Wayback Machine To perform our task with this resource, we construct the following query: https://web.archive.org/cdx/search/cdx?url=*.cia.gov/*&output=json&limit=10&filter=mimetype:image/jpeg&collapse=urlkey Where: - the endpoint of the CDX server, /cdx/search/cdx - our domain, url=*.cia.gov/* target - filtering by type of JPEG file, filter=mimetype:image/jpeg MIME * - demand result in format, - limit to results, - get URLs (without this, there are many duplicates). output=json JSON limit=10 10 collapse=urlkey unique As a result, we get the 10 images found in the archive. In addition to the URLs, the response contains the MIME type of the files (useful if you do not use filtering), as well as the status code when accessing the object at the time of archive creation: [["urlkey","timestamp","original","mimetype","statuscode","digest","length"], ["gov,cia)/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg", "20150324125120", "https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg", "image/jpeg", "200", "OJRFXPWOPZQGPRIZZQZOTRSZKAVLQLKZ", "3845"], ["gov,cia)/++theme++contextual.agencytheme/images/aerial_cropped.jpg", "20160804222651", "https://www.cia.gov/++theme++contextual.agencytheme/images/Aerial_Cropped.jpg", "image/jpeg", "200", "3WII7DZKLXM4KSQ5UTEKO5EL7H5VTB35", "196685"], ["gov,cia)/++theme++contextual.agencytheme/images/background-launch.jpg", "20140121032437", "https://www.cia.gov/++theme++contextual.agencytheme/images/background-launch.jpg", "image/jpeg", "200", "3C4G73473VYPOWDNA4VJUV4Q7EC3IXN4", "44501"], ["gov,cia)/++theme++contextual.agencytheme/images/background-video-panel.jpg", "20150629034524", "https://www.cia.gov/++theme++contextual.agencytheme/images/background-video-panel.jpg", "image/jpeg", "200", "CQCUYUN5VTVJVN4LGKUZ3BHWSIXPSCKC", "71813"], ["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-1.jpg", "20130801151047", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-1.jpg", "image/jpeg", "200", "GPSEAEE23C53TRGHLMBXHWQYNB3EGBCZ", "14858"], ["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-2.jpg", "20130801150245", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-2.jpg", "image/jpeg", "200", "L6P2MNAAMZUMHUEHJFGXWEUQCHHMK2HP", "15136"], ["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-3.jpg", "20130801151656", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-3.jpg", "image/jpeg", "200", "ODNXI3HZETXVVSEJ5I2KTI7KXKNT5WSV", "19717"], ["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-4.jpg", "20130801150219", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-4.jpg", "image/jpeg", "200", "X7N2EIYUDAYWMX7464LNTHBVMTEMZUVN", "20757"], ["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/banner-benefits-background.jpg", "20150510022313", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/banner-benefits-background.jpg", "image/jpeg", "200", "VZJE5XSAQWBD6QF6742BH2N3HOTSCZ4A", "12534"], ["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/chi-diversity.jpg", "20130801150532", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/CHI-diversity.jpg", "image/jpeg", "200", "WJQOQPYJTPL2Y2KZBVJ44MVDMI7TZ7VL", "6458"]] Next, to access the archived file, we take one of the results above and use the following query: https://web.archive.org/web/20150324125120id_/https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg Where: - file-server endpoint, /web * - obtained during the previous query + id_, 20230502061729id_ timestamp * - file received during the previous request. https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg URL Common Crawl In the case of this service, the task is a bit more challenging because it is overloaded, and now it is extremely difficult to use the Common Crawl CDX server. Nevertheless, I'll try to describe how it can be done :) First, we choose (Common Crawl updates archives every month or two). For example, if we chose the version for , in this case our request will look like this: archive version March/April 2023 https://index.commoncrawl.org/CC-MAIN-2023-14-index?url=*.cia.gov/*&output=json&limit=10&filter=mimetype:image/jpeg&collapse=urlkey Now I have no option but to show you an example of file extraction with a response that I have on hand: As expected, the servers at the time of writing are with never-ending `504 Gateway Time-out' 😥. still overloaded {"urlkey": "com,tutorialspoint)/accounting_basics/accounting_basics_tutorial.pdf", "timestamp": "20230320100841", "url": "http://www.tutorialspoint.com/accounting_basics/accounting_basics_tutorial.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "2JQ2AQ3HQZIMXHB5CJGSADUGOHYBIRJJ", "length": "787172", "offset": "102849414", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296943471.24/warc/CC-MAIN-20230320083513-20230320113513-00267.warc.gz"} {"urlkey": "com,tutorialspoint)/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_1.pdf", "timestamp": "20230326185123", "url": "https://www.tutorialspoint.com/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_1.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "T4OQARBGDQ2Z3ZMJ57MWZTUIBCFR65QG", "length": "120114", "offset": "1156945883", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296946445.46/warc/CC-MAIN-20230326173112-20230326203112-00412.warc.gz"} {"urlkey": "com,tutorialspoint)/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_2.pdf", "timestamp": "20230322123716", "url": "https://www.tutorialspoint.com/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_2.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "EJJMOG5QPWIV7YXADIFOPML45UTJKYWW", "length": "118702", "offset": "1159004265", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296943809.76/warc/CC-MAIN-20230322114226-20230322144226-00733.warc.gz"} {"urlkey": "com,tutorialspoint)/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_3.pdf", "timestamp": "20230324124641", "url": "https://www.tutorialspoint.com/add_and_subtract_whole_numbers/pdf/subtracting_of_two_2digit_numbers_with_borrowing_worksheet10_3.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "AOTDOZIAULAYGY3AOMD7662BJBEPYKWJ", "length": "210009", "offset": "1172608792", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296945282.33/warc/CC-MAIN-20230324113500-20230324143500-00254.warc.gz"} {"urlkey": "com,tutorialspoint)/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_1.pdf", "timestamp": "20230330141211", "url": "https://www.tutorialspoint.com/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_1.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "MOODQKFMHRVSZK4UOZO3E6H2MGHTK2VW", "length": "226484", "offset": "1136155166", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296949331.26/warc/CC-MAIN-20230330132508-20230330162508-00514.warc.gz"} {"urlkey": "com,tutorialspoint)/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_2.pdf", "timestamp": "20230330112743", "url": "https://www.tutorialspoint.com/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_2.pdf", "mime": "application/pdf", "mime-detected": "application/pdf", "status": "200", "digest": "ZYCDOJ2JTPPWFTCNYEIXCWKEJQXTA7UD", "length": "226957", "offset": "1167440233", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296949181.44/warc/CC-MAIN-20230330101355-20230330131355-00035.warc.gz"} If you look closely, you'll notice that the CommonCrawl server returns a slightly different JSON, with different parameters from what we saw with the Wayback Machine example. Now, to download the file, we use of the selected object with the storage server ( ): filename data.commoncrawl.org https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-14/segments/1679296949181.44/warc/CC-MAIN-20230330101355-20230330131355-00035.warc.gz However, if you follow this link, it will download a bulky compressed archive instead of the object we need. To gain access to the desired file, we add the following header to the GET request: . With this header, we specify the beginning and the end of the file offset in the archive. This header values can be calculated from the and parameters of the chosen JSON object. Range: bytes=1170209543-1170219812 offset length Using , the final query will look like this: curl curl -H "Range:bytes=1167440233-1167667190" "https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-14/segments/1679296949181.44/warc/CC-MAIN-20230330101355-20230330131355-00035.warc.gz" --output test.warc.gz As a result we will get a gzip compressed file, decompressing it we will see the following contents: WARC/1.0 WARC-Type: response WARC-Date: 2023-03-30T11:27:43Z WARC-Record-ID: <urn:uuid:23aaef68-3bb3-4849-a7b8-f81d3b6b603c> Content-Length: 276765 Content-Type: application/http; msgtype=response WARC-Warcinfo-ID: <urn:uuid:09066de6-1a53-44c9-80ef-921880273b06> WARC-Concurrent-To: <urn:uuid:916a7a96-01a0-4826-acdc-77a44863736f> WARC-IP-Address: 192.229.210.176 WARC-Target-URI: https://www.tutorialspoint.com/adding_and_subtracting_decimals/pdf/addition_with_money_worksheet8_2.pdf WARC-Payload-Digest: sha1:ZYCDOJ2JTPPWFTCNYEIXCWKEJQXTA7UD WARC-Block-Digest: sha1:FDUUFO3APTHN55CYNDVRD3CNPYA5NOVT WARC-Identified-Payload-Type: application/pdf HTTP/1.1 200 OK Accept-Ranges: bytes Access-Control-Allow-Origin: * Access-Control-Allow-Origin: *; Cache-Control: max-age=2592000 Content-Type: application/pdf Date: Thu, 30 Mar 2023 11:27:43 GMT Etag: "4373e-5c8329dce0d40" Expires: Sat, 29 Apr 2023 11:27:43 GMT Last-Modified: Wed, 28 Jul 2021 17:50:05 GMT Server: Apache/2.4.52 (Ubuntu) Vary: User-Agent X-Frame-Options: SAMEORIGIN X-Version: OCT-10 V1 X-XSS-Protection: 1; mode=block Content-Length: 276286 %PDF-1.5 %µµµµ <-the rest of the file...-> Taking out the WARC headers and saving the file, all that remains is to enjoy our downloaded content... GoGetCrawl the CIA certificate It is worth saying that at the moment there are already a few tools for interacting with archives. But with a cursory glance I haven't found any that, apart from URL mining, could download the file and easily integrate into other Go projects! With this in mind, I updated my outdated solution by doing some refactoring and adding as a second source of archive data. Wayback Machine Let's get the paper Finally, we will get a certificate using . You can use it in several ways described . You won't have to compile or install anything, and so for convenience: gogetcrawl here You can download the latest . And use the binary as follows: release gogetcrawl file *.cia.gov/* --dir ./ --ext pdf After waiting a little bit we should get the cherished file, if you are bored or there is no file, you can see what happens by adding the flag. -v You can as well use Docker: docker run uranusq/gogetcrawl url *.cia.gov/* --ext pdf As a result of this command, we should see the URL with the PDF file. You can learn more about the commands and possible arguments by using the flag. -h Installation option for Go-phers: go install github.com/karust/gogetcrawl@latest Congrats We received our certificate 🫢. For those who don't have it yet, it looks like this: Open-source If you use Go and want to become an archivarius, there is an option to apply GoGetCrawl in your project: go get github.com/karust/gogetcrawl For example, a minimal program that will show us all the pages of and its subdomains with the status of 200 looks like this: example.com package main import ( "fmt" "github.com/karust/gogetcrawl/common" "github.com/karust/gogetcrawl/wayback" ) func main() { // Get only 10 status:200 pages config := common.RequestConfig{ URL: "*.example.com/*", Filters: []string{"statuscode:200"}, Limit: 10, } // Set request timout and retries wb, _ := wayback.New(15, 2) // Use config to obtain all CDX server responses results, _ := wb.GetPages(config) for _, r := range results { fmt.Println(r.Urlkey, r.Original, r.MimeType) } } On , you can find more examples, including file extraction and CommonCrawl usage. the project page PS I hope no one was click baited, and everyone enjoyed the new certificate in their collection :) Perhaps there are other web archive services other than those described in the article that can be used similarly? Let me know.