In this article, I wrote some small programs to show you how web page parsing works in Rust. For this task, I chose: and libraries. Tokio, Reqwest, Scraper Tokio It's a library for writing reliable, asynchronous, and slim applications with the Rust programming language. Why I choose Tokio: It's fast: Tokio uses zero-cost abstractions for asynchronous. It's safe: Tokio uses only safe Rust and Rust-based parallelism. It's scalable: Minimal footprint in your application. Tokio is also a more popular asynchronous library than async-std. It uses cooperative multitasking based on green threads. List of projects that use Tokio, based on Tokio's Github repository : A fast and correct HTTP implementation for Rust. hyper: A rust implementation of gRPC. tonic: A super-easy, composable web server framework for warp speeds. warp: A library of modular and reusable components for building robust networking clients and servers. tower: is a framework for instrumenting Rust programs to collect structured, event-based diagnostic information. tracing: A Rust database connectivity (RDBC) library for MySQL, Postgres, and SQLite. rdbc: Fast, low-level I/O library for Rust focusing on non-blocking APIs. mio: A utility library for working with bytes. bytes: The testing tool for concurrent Rust code. loom: Reqwest An ergonomic, batteries-included HTTP Client for Rust. Features: Plain bodies, JSON, urlencoded, multipart Customizable redirect policy HTTP Proxies HTTPS via system-native TLS (or optionally, rustls) Cookie Store WASM Reqwest uses Tokio for async requests, and has a blocking work mode. In this article, I used async mode. Scraper Scraper provides an interface to Servo's html5ever and selectors crates, for browser-grade parsing and querying. First Program: Simple scrapper for “index of” pages. title
                .value() .to_string()
        );
    } } use scraper::{Html, Selector}; #[tokio::main] async fn main () -> Result <(), Box < dyn std::error::Error>> { let resp = reqwest::get( "https://www.wireshark.org/download/" ). await ?; let text = resp.text(). await ?; let document = Html::parse_document(&text); let selector = Selector::parse( r#"table > tbody > tr > td > a"# ).unwrap(); for title in document.select(&selector) { println! ( "{}" , resp.url().to_string()); println! ( "{}" , .attr( "href" ) .expect( "href not found" ) Ok (()) Here is a little upgrade for this program. Here Whireshark's page has doubled links. We can ignore them using a hashmap. urls.insert(url);
        }
    } } } use scraper::{Html, Selector}; use std::collections::HashSet; #[tokio::main] async fn main () -> Result <(), Box < dyn std::error::Error>> { let resp = reqwest::get( "https://www.wireshark.org/download/" ). await ?; println! ( "{}" , resp.url().to_string()); let text = resp.text(). await ?; let mut urls:HashSet< String > = HashSet::new(); let document = Html::parse_document(&text); let selector = Selector::parse( r#"table > tbody > tr > td > a"# ).unwrap(); for title in document.select(&selector) { let url = title.value().attr( "href" ).expect( "href not found" ).to_string(); if url != "/" || url != "." || url != ".." { for url in urls{ println! ( "{}" , url); Ok (()) For the next iteration, I wanna show you how to make less complex parser, but with more features. Collect new wallpapers from Wallheaven's website. Task: On this site, you can find the “random” button that gives your 24 random pictures. For this task, we need to parse this page. After that go to the picture’s URL, get a full-size image URL and download the picture. }
    }; } } } } }
                };
            } }
    } } use scraper::{Html, Selector}; use std::io::Cursor; use std::process; #[tokio::main] async fn main () -> Result <(), Box < dyn std::error::Error>> { let resp = match reqwest::get( "https://wallhaven.cc/random" ). await { Ok (x) => x, Err (_) => { println! ( "{}" , "error on /random request" ); process::exit( 0x0100 ); //exit if error expected during request to /random let text = resp.text(). await ?; let document = Html::parse_document(&text); let selector = Selector::parse( r#"ul > li > figure > a"# ).unwrap(); for elem in document.select(&selector) { let href = elem.value().attr( "href" ).expect( "href not found!" ); download(&href.to_string()). await ?; Ok (()) async fn download (url: & String ) -> Result <(), Box < dyn std::error::Error>> { let resp = reqwest::get(url). await ?; let text = resp.text(). await ?; let document = Html::parse_document(&text); let selector = Selector::parse( r#"img"# ).unwrap(); for elem in document.select(&selector) { let picture_url = elem.value().attr( "data-cfsrc" ); match picture_url { Some (x) => { // ignoring another pictures on page like logo and avatar if x.contains( "avatar" ) || x.contains( "logo" ) { continue ; // trying to get original name of picture let file_path = x.split( "/" ).last().expect( "cant find filename" ); match reqwest::get(x). await { Ok (x) => { let mut file = std::fs::File::create(file_path)?; let mut content = Cursor::new(x.bytes(). await ?); std::io::copy(& mut content, & mut file)?; println! ( "Created: {}" , file_path); Err (_) => { continue ; None => {} Ok (()) In conclusion, I want to show and resolve some practical parsing tasks with educational aims and refactored not so good as possible. For production, see better examples and develop your own tasks.

Super

How to Speed Up File Downloads With Python

Why do you need project templates?

Parsing HTML with Rust: A Simple Tutorial Using Tokio, Reqwest, and Scraper

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Colors of Python Interpreters

$2M Backing and a Vision: How GAM3S.GG is Reshaping Web3 Gaming

$1M Hackathon Prizes Announced By MultiversX to Expand the Blockchain Ecosystem

Windows Sticky Keys Exploit: The War Veteran That Never Dies

$PEPE, a Purple Lamborghini, and More: The Story Continues

Colors of Python Interpreters

$2M Backing and a Vision: How GAM3S.GG is Reshaping Web3 Gaming

$1M Hackathon Prizes Announced By MultiversX to Expand the Blockchain Ecosystem

Windows Sticky Keys Exploit: The War Veteran That Never Dies

$PEPE, a Purple Lamborghini, and More: The Story Continues

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps