Universal Scraper Microservice A scalable universal scraper. Scrape thousands of TechCrunch articles in seconds! This post is part of a series of open source projects that we’ll be releasing in the next while, as described . previously here Scraping I’ve been doing a bit of Machine Learning and a big part of learning how to create different architectures and models is to deal with real-life data. There are plenty of great out there but I wanted to build one from scratch and settled on TechCrunch article titles (to use for a tech-news article title generator). recently datasets There are plenty of ways to scrape simple data like that from public sites such as TC but since we’ve had to do a few scraping jobs before, I set out to build a simple scraper that pulled in public metadata as well as text pulled in using simple queries to the HTML content. Scrape That’s where came in. It’s built on top of and has a simple abstraction around pulling in data from the raw HTML of sites it scrapes, as well as pulling in structured , and metadata. scrape stdlib schema.org Open Graph It’s open source: _scrape - Distributed Scraper_github.com nemo/scrape And you can also use a production-ready version on stdlib . here Scraping Techcrunch Naturally, after building a distributed scraper – it’s rather easy to scrape up a lot of article titles from a site like Techcrunch. Here’s a quick snippet that I used to scrape about 100 pages of Techcrunch archives: And in a matter of seconds: Scraped Techcrunch article names Now to have fun with this data and come up with a few ML models. That’s for another day. Next time you need to scrape a site in a structured way and don’t want to invest in creating the infrastructure for it, feel free to use (or fork) ! scrape If you’d like to keep up with the open-source microservice releases that we’re doing over the next while, follow my posts.

Microservice Series: Scraper

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Benchmarks for CTRs, CPIs, CPAs, and Conversion Rates for Facebook Ads

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: Effective Techniques Against Disinformation (11/15/2022)

104 Stories To Learn About Marketing Strategy

101 Small Business Marketing Ideas

106 Stories To Learn About Community

Benchmarks for CTRs, CPIs, CPAs, and Conversion Rates for Facebook Ads

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: Effective Techniques Against Disinformation (11/15/2022)

104 Stories To Learn About Marketing Strategy

101 Small Business Marketing Ideas

106 Stories To Learn About Community

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps