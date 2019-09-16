Scraping a Website for High-Quality ROMs Using Python

Let’s say hypothetically you wanted to download all of the high quality Super Nintendo ROMs from a website. The site simply has a list of links that take you to the file itself. Since this is a flat tree, you could run a basic wget command with the URL ala:

wget -m -np -c -w 3 -R "index.html*" "https://rom-site.blah/path/to/roms/"

However, this would give you every game, regardless of quality. Fortunately, ROM enthusiasts use suffixes to denote the status of roms:

[a] Alternate [p] Pirate [b] Bad Dump (avoid these, they may not work!) [t] Trained [f] Fixed [T-] OldTranslation [T+] NewerTranslation [h] Hack (-) Unknown Year [o] Overdump [!] Verified Good Dump (M #) Multilanguage (# of Languages) ( ###) Checksum (??k) ROM Size ZZZ_ Unclassified (Unl) Unlicensed

So we just want the ones with the [!] suffix. You may also want to specify [U] for just the US releases as well.

wget with a regular expression, but I am definitely no wget or regex pro, so after a few minutes of unsuccessful attempts, I gave up and wrote a short Python script to get me what I wanted using There is certainly some way of specifying this towith a regular expression, but I am definitely noor regex pro, so after a few minutes of unsuccessful attempts, I gave up and wrote a short Python script to get me what I wanted using Beautiful Soup

Before writing any code I analyzed the source of the target URL, and

sure enough, the page was pretty much just a list of anchor tags, with a

direct link to the ROM file. Perfect.

html , I know I just need to extract the links from all the anchors, but only collect the ones containing the [!] suffix. This can be done in less than 15 lines of Python: After peeking at the, I know I just need to extract the links from all the anchors, but only collect the ones containing the [!] suffix. This can be done in less than 15 lines of Python:

First, install beautifulsoup4:

pip3 install beautfilsoup4

good_roms.py with the following code: And then create a file calledwith the following code:

# good_roms.py import requests from bs4 import BeautifulSoup weburl = 'https://site.site/path/to/roms/' data = requests.get(weburl) soup = BeautifulSoup(data.text, features= 'html.parser' ) links = [] for anch in soup.find_all( 'a' ): if '[!]' in str(anch): links.append(weburl + anch.get( 'href' )) for link in links: print(link)

Now I can just run the program and redirect the output to a text file:

python3 good_roms.py > rom-list.txt

wget and it will download just the good ones using the -i input file switch: Now that I have a text file with the URLs of all the good ROMs, I can give that file directly toand it will download just the good ones using theinput file switch:

wget -i rom-list.txt

That’s it! Make sure you have enough space for all the roms and watch them pile up one at a time:

--2019-01-25 21:27:02-- https://rom-site.blah/path/to/roms/YourFavoriteRom[!].bin Reusing existing connection to [rom-site.blah]:443. HTTP request sent, awaiting response... 200 OK Length: 2097152 (2.0M) [application/octet-stream] Saving to: ‘YourFavoriteRom[!].bin’ YourFavoriteRom[!].bin 100%[========================>] 2.00M 513KB/s in 3.9s 2019-01-25 21:27:09 (513 KB/s) - ‘YourFavoriteRom[!].bin’ saved [2097152/2097152] FINISHED --2019-01-25 21:29:41-- Total wall clock time: 38m 47s Downloaded: 693 files, 888M in 30m 38s (495 KB/s)

