In this article, I wrote some small programs to show you how web page parsing works in Rust.
For this task, I chose: Tokio, Reqwest, and Scraper libraries.
It's a library for writing reliable, asynchronous, and slim applications with the Rust programming language.
Why I choose Tokio:
List of projects that use Tokio, based on Tokio's Github repository:
An ergonomic, batteries-included HTTP Client for Rust.
Features:
Reqwest uses Tokio for async requests, and has a blocking work mode. In this article, I used async mode.
Scraper provides an interface to Servo's html5ever and selectors crates, for browser-grade parsing and querying.
use scraper::{Html, Selector};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let resp = reqwest::get("https://www.wireshark.org/download/").await?;
let text = resp.text().await?;
let document = Html::parse_document(&text);
let selector = Selector::parse(r#"table > tbody > tr > td > a"#).unwrap();
for title in document.select(&selector) {
println!("{}", resp.url().to_string());
println!(
"{}",
title
.value()
.attr("href")
.expect("href not found")
.to_string()
);
}
Ok(())
}
Here is a little upgrade for this program. Here Whireshark's page has doubled links. We can ignore them using a hashmap.
use scraper::{Html, Selector};
use std::collections::HashSet;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let resp = reqwest::get("https://www.wireshark.org/download/").await?;
println!("{}", resp.url().to_string());
let text = resp.text().await?;
let mut urls:HashSet<String> = HashSet::new();
let document = Html::parse_document(&text);
let selector = Selector::parse(r#"table > tbody > tr > td > a"#).unwrap();
for title in document.select(&selector) {
let url = title.value().attr("href").expect("href not found").to_string();
if url != "/" || url != "." || url != ".." {
urls.insert(url);
}
}
for url in urls{
println!("{}", url);
}
Ok(())
}
For the next iteration, I wanna show you how to make less complex parser, but with more features.
Task: Collect new wallpapers from Wallheaven's website.
On this site, you can find the “random” button that gives your 24 random pictures. For this task, we need to parse this page. After that go to the picture’s URL, get a full-size image URL and download the picture.
use scraper::{Html, Selector};
use std::io::Cursor;
use std::process;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let resp = match reqwest::get("https://wallhaven.cc/random").await {
Ok(x) => x,
Err(_) => {
println!("{}", "error on /random request");
process::exit(0x0100); //exit if error expected during request to /random
}
};
let text = resp.text().await?;
let document = Html::parse_document(&text);
let selector = Selector::parse(r#"ul > li > figure > a"#).unwrap();
for elem in document.select(&selector) {
let href = elem.value().attr("href").expect("href not found!");
download(&href.to_string()).await?;
}
Ok(())
}
async fn download(url: &String) -> Result<(), Box<dyn std::error::Error>> {
let resp = reqwest::get(url).await?;
let text = resp.text().await?;
let document = Html::parse_document(&text);
let selector = Selector::parse(r#"img"#).unwrap();
for elem in document.select(&selector) {
let picture_url = elem.value().attr("data-cfsrc");
match picture_url {
Some(x) => {
// ignoring another pictures on page like logo and avatar
if x.contains("avatar") || x.contains("logo") {
continue;
}
// trying to get original name of picture
let file_path = x.split("/").last().expect("cant find filename");
match reqwest::get(x).await {
Ok(x) => {
let mut file = std::fs::File::create(file_path)?;
let mut content = Cursor::new(x.bytes().await?);
std::io::copy(&mut content, &mut file)?;
println!("Created: {}", file_path);
}
Err(_) => {
continue;
}
};
}
None => {}
}
}
Ok(())
}
In conclusion, I want to show and resolve some practical parsing tasks with educational aims and refactored not so good as possible. For production, see better examples and develop your own tasks.