Let’s say you wanted to download all of the high quality Super Nintendo ROMs from a website. The site simply has a list of links that take you to the file itself. Since this is a flat tree, you could run a basic wget command with the URL ala: hypothetically wget -m -np -c -w 3 -R "index.html*" "https://rom-site.blah/path/to/roms/" However, this would give you every game, regardless of quality. Fortunately, ROM enthusiasts use suffixes to denote the status of roms: [a] Alternate
[p] Pirate
[b] Bad Dump     (avoid these, they may not work!)
[t] Trained
[f] Fixed
[T-] OldTranslation
[T+] NewerTranslation
[h] Hack
(-) Unknown Year
[o] Overdump
[!] Verified Good Dump
(M ( (??k) ROM Size
ZZZ_ Unclassified
(Unl) Unlicensed #) Multilanguage (# of Languages) ###) Checksum So we just want the ones with the suffix. . [!] You may also want to specify [U] for just the US releases as well There is certainly some way of specifying this to with a regular expression, but I am definitely no or regex pro, so after a few minutes of unsuccessful attempts, I gave up and wrote a short Python script to get me what I wanted using wget wget Beautiful Soup Before writing any code I analyzed the source of the target URL, and sure enough, the page was pretty much just a list of anchor tags, with a direct link to the ROM file. Perfect. After peeking at the , I know I just need to extract the links from all the anchors, but only collect the ones containing the [!] suffix. This can be done in less than 15 lines of Python: html First, install beautifulsoup4: pip3 install beautfilsoup4 And then create a file called with the following code: good_roms.py requests bs4 BeautifulSoup

weburl = data = requests.get(weburl)
soup = BeautifulSoup(data.text, features= )

links = [] anch soup.find_all( ): str(anch):
        links.append(weburl + anch.get( )) link links:
    print(link) # good_roms.py import from import 'https://site.site/path/to/roms/' 'html.parser' for in 'a' if '[!]' in 'href' for in Now I can just run the program and redirect the output to a text file: python3 good_roms.py > rom-list.txt Now that I have a text file with the URLs of all the good ROMs, I can give that file directly to and it will download just the good ones using the input file switch: wget -i wget -i rom-list.txt That’s it! Make sure you have enough space for all the roms and watch them pile up one at a time: --2019-01-25 21:27:02--  https://rom-site.blah/path/to/roms/YourFavoriteRom[!].bin
Reusing existing connection to [rom-site.blah]:443.
HTTP request sent, awaiting response... 200 OK
Length: 2097152 (2.0M) [application/octet-stream]
Saving to: ‘YourFavoriteRom[!].bin’

YourFavoriteRom[!].bin 100%[========================>]   2.00M   513KB/s 3.9s    

2019-01-25 21:27:09 (513 KB/s) - ‘YourFavoriteRom[!].bin’ saved [2097152/2097152]

FINISHED --2019-01-25 21:29:41--
Total wall clock time: 38m 47s
Downloaded: 693 files, 888M 30m 38s (495 KB/s) in in

Nintendo

Super

Target

Scraping a Website for High-Quality ROMs Using Python

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

[DIY] Use Python to Control the Pins of a Raspberry Pi to Light Up LEDs

$1M Hackathon Prizes Announced By MultiversX to Expand the Blockchain Ecosystem

0xMahjong NFT to Begin Free Minting - Mahjong Meta Game Expects Over $10 Million In Funding

The Noonification: Musk Took The Twitter Joke Too Far (11/22/2022)

The Noonification: Top 10 CSS Performance Tips (11/14/2022)

[DIY] Use Python to Control the Pins of a Raspberry Pi to Light Up LEDs

$1M Hackathon Prizes Announced By MultiversX to Expand the Blockchain Ecosystem

0xMahjong NFT to Begin Free Minting - Mahjong Meta Game Expects Over $10 Million In Funding

The Noonification: Musk Took The Twitter Joke Too Far (11/22/2022)

The Noonification: Top 10 CSS Performance Tips (11/14/2022)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps