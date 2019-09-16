Software Developer from Winnipeg, Manitoba, Canada andrewvcourt.com
wget -m -np -c -w 3 -R "index.html*" "https://rom-site.blah/path/to/roms/"
[a] Alternate
[p] Pirate
[b] Bad Dump (avoid these, they may not work!)
[t] Trained
[f] Fixed
[T-] OldTranslation
[T+] NewerTranslation
[h] Hack
(-) Unknown Year
[o] Overdump
[!] Verified Good Dump
(M#) Multilanguage (# of Languages)
(###) Checksum
(??k) ROM Size
ZZZ_ Unclassified
(Unl) Unlicensed
with a regular expression, but I am definitely no
wget
or regex pro, so after a few minutes of unsuccessful attempts, I gave up and wrote a short Python script to get me what I wanted using Beautiful Soup
wget
, I know I just need to extract the links from all the anchors, but only collect the ones containing the [!] suffix. This can be done in less than 15 lines of Python:
html
pip3 install beautfilsoup4
with the following code:
good_roms.py
# good_roms.py
import requests
from bs4 import BeautifulSoup
weburl = 'https://site.site/path/to/roms/'
data = requests.get(weburl)
soup = BeautifulSoup(data.text, features='html.parser')
links = []
for anch in soup.find_all('a'):
if '[!]' in str(anch):
links.append(weburl + anch.get('href'))
for link in links:
print(link)
python3 good_roms.py > rom-list.txt
and it will download just the good ones using the
wget
input file switch:
-i
wget -i rom-list.txt
--2019-01-25 21:27:02-- https://rom-site.blah/path/to/roms/YourFavoriteRom[!].bin
Reusing existing connection to [rom-site.blah]:443.
HTTP request sent, awaiting response... 200 OK
Length: 2097152 (2.0M) [application/octet-stream]
Saving to: ‘YourFavoriteRom[!].bin’
YourFavoriteRom[!].bin 100%[========================>] 2.00M 513KB/s in 3.9s
2019-01-25 21:27:09 (513 KB/s) - ‘YourFavoriteRom[!].bin’ saved [2097152/2097152]
FINISHED --2019-01-25 21:29:41--
Total wall clock time: 38m 47s
Downloaded: 693 files, 888M in 30m 38s (495 KB/s)