A scalable universal scraper. Scrape thousands of TechCrunch articles in seconds!
This post is part of a series of open source projects that we’ll be releasing in the next while, as described previously here.
I’ve been doing a bit of Machine Learning recently and a big part of learning how to create different architectures and models is to deal with real-life data. There are plenty of great datasets out there but I wanted to build one from scratch and settled on TechCrunch article titles (to use for a tech-news article title generator).
There are plenty of ways to scrape simple data like that from public sites such as TC but since we’ve had to do a few scraping jobs before, I set out to build a simple scraper that pulled in public metadata as well as text pulled in using simple queries to the HTML content.
That’s where scrape came in. It’s built on top of stdlib and has a simple abstraction around pulling in data from the raw HTML of sites it scrapes, as well as pulling in structured schema.org, and Open Graph metadata.
It’s open source:
And you can also use a production-ready version on stdlib here.
Naturally, after building a distributed scraper – it’s rather easy to scrape up a lot of article titles from a site like Techcrunch.
Here’s a quick snippet that I used to scrape about 100 pages of Techcrunch archives:
And in a matter of seconds:
Now to have fun with this data and come up with a few ML models. That’s for another day.
Next time you need to scrape a site in a structured way and don’t want to invest in creating the infrastructure for it, feel free to use (or fork) scrape!
If you’d like to keep up with the open-source microservice releases that we’re doing over the next while, follow my posts.