Parsing HTML with Rust: A Simple Tutorial Using Tokio, Reqwest, and Scraper by@exactor

Parsing HTML with Rust: A Simple Tutorial Using Tokio, Reqwest, and Scraper

image
Maksim Kuznetsov HackerNoon profile picture

Maksim Kuznetsov

Just a Senior Python Developer.

In this article, I wrote some small programs to show you how web page parsing works in Rust.

For this task, I chose: Tokio, Reqwest, and Scraper libraries.

Tokio

It's a library for writing reliable, asynchronous, and slim applications with the Rust programming language.

Why I choose Tokio:

  • It's fast: Tokio uses zero-cost abstractions for asynchronous.
  • It's safe: Tokio uses only safe Rust and Rust-based parallelism.
  • It's scalable: Minimal footprint in your application.
  • Tokio is also a more popular asynchronous library than async-std.
  • It uses cooperative multitasking based on green threads.

List of projects that use Tokio, based on Tokio's Github repository:

  • hyper: A fast and correct HTTP implementation for Rust.
  • tonic: A rust implementation of gRPC.
  • warp: A super-easy, composable web server framework for warp speeds.
  • tower: A library of modular and reusable components for building robust networking clients and servers.
  • tracing: is a framework for instrumenting Rust programs to collect structured, event-based diagnostic information.
  • rdbc: A Rust database connectivity (RDBC) library for MySQL, Postgres, and SQLite.
  • mio: Fast, low-level I/O library for Rust focusing on non-blocking APIs.
  • bytes: A utility library for working with bytes.
  • loom: The testing tool for concurrent Rust code.

Reqwest

An ergonomic, batteries-included HTTP Client for Rust.

Features:

  • Plain bodies, JSON, urlencoded, multipart
  • Customizable redirect policy
  • HTTP Proxies
  • HTTPS via system-native TLS (or optionally, rustls)
  • Cookie Store
  • WASM

Reqwest uses Tokio for async requests, and has a blocking work mode. In this article, I used async mode.

Scraper

Scraper provides an interface to Servo's html5ever and selectors crates, for browser-grade parsing and querying.

First Program: Simple scrapper for “index of” pages.

use scraper::{Html, Selector};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let resp = reqwest::get("https://www.wireshark.org/download/").await?;
    let text = resp.text().await?;

    let document = Html::parse_document(&text);
    let selector = Selector::parse(r#"table > tbody > tr > td > a"#).unwrap();
    for title in document.select(&selector) {
        println!("{}", resp.url().to_string());
        println!(
            "{}",
            title
                .value()
                .attr("href")
                .expect("href not found")
                .to_string()
        );
    }

    Ok(())
}

Here is a little upgrade for this program. Here Whireshark's page has doubled links. We can ignore them using a hashmap.

use scraper::{Html, Selector};
use std::collections::HashSet;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let resp = reqwest::get("https://www.wireshark.org/download/").await?;
    println!("{}", resp.url().to_string());

    let text = resp.text().await?;

    let mut urls:HashSet<String> = HashSet::new();
    let document = Html::parse_document(&text);
    let selector = Selector::parse(r#"table > tbody > tr > td > a"#).unwrap();
    for title in document.select(&selector) {
        let url = title.value().attr("href").expect("href not found").to_string();
        if url != "/" || url != "." || url != ".." {
            urls.insert(url);
        }
    }

    for url in urls{
        println!("{}", url);
    }

    Ok(())
}

For the next iteration, I wanna show you how to make less complex parser, but with more features. 

Task: Collect new wallpapers from Wallheaven's website.

On this site, you can find the “random” button that gives your 24 random pictures. For this task, we need to parse this page. After that go to the picture’s URL, get a full-size image URL and download the picture.

use scraper::{Html, Selector};
use std::io::Cursor;
use std::process;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let resp = match reqwest::get("https://wallhaven.cc/random").await {
        Ok(x) => x,
        Err(_) => {
            println!("{}", "error on /random request");
            process::exit(0x0100);  //exit if error expected during request to /random
        }
    };
    let text = resp.text().await?;

    let document = Html::parse_document(&text);
    let selector = Selector::parse(r#"ul > li > figure > a"#).unwrap();
    for elem in document.select(&selector) {
        let href = elem.value().attr("href").expect("href not found!");
        download(&href.to_string()).await?;
    }
    Ok(())
}

async fn download(url: &String) -> Result<(), Box<dyn std::error::Error>> {
    let resp = reqwest::get(url).await?;
    let text = resp.text().await?;

    let document = Html::parse_document(&text);
    let selector = Selector::parse(r#"img"#).unwrap();
    for elem in document.select(&selector) {
        let picture_url = elem.value().attr("data-cfsrc");
        match picture_url {
            Some(x) => {
	    // ignoring another pictures on page like logo and avatar
                if x.contains("avatar") || x.contains("logo") {
                    continue;
                }
	    // trying to get original name of picture
                let file_path = x.split("/").last().expect("cant find filename");
                match reqwest::get(x).await {
                    Ok(x) => {
                        let mut file = std::fs::File::create(file_path)?;
                        let mut content = Cursor::new(x.bytes().await?);
                        std::io::copy(&mut content, &mut file)?;
                        println!("Created: {}", file_path);
                    }
                    Err(_) => {
                        continue;
                    }
                };
            }
            None => {}
        }
    }
    Ok(())
}

In conclusion, I want to show and resolve some practical parsing tasks with educational aims and refactored not so good as possible. For production, see better examples and develop your own tasks.

Tags