Microservice Series: Scraper

Written by ngardideh | Published 2016/12/06
Tech Story Tags: web-scraping | marketing | techcrunch | microservices

TLDRvia the TL;DR App

Universal Scraper Microservice

A scalable universal scraper. Scrape thousands of TechCrunch articles in seconds!

This post is part of a series of open source projects that we’ll be releasing in the next while, as described previously here.

Scraping

I’ve been doing a bit of Machine Learning recently and a big part of learning how to create different architectures and models is to deal with real-life data. There are plenty of great datasets out there but I wanted to build one from scratch and settled on TechCrunch article titles (to use for a tech-news article title generator).

There are plenty of ways to scrape simple data like that from public sites such as TC but since we’ve had to do a few scraping jobs before, I set out to build a simple scraper that pulled in public metadata as well as text pulled in using simple queries to the HTML content.

Scrape

That’s where scrape came in. It’s built on top of stdlib and has a simple abstraction around pulling in data from the raw HTML of sites it scrapes, as well as pulling in structured schema.org, and Open Graph metadata.

It’s open source:

nemo/scrape_scrape - Distributed Scraper_github.com

And you can also use a production-ready version on stdlib here.

Scraping Techcrunch

Naturally, after building a distributed scraper – it’s rather easy to scrape up a lot of article titles from a site like Techcrunch.

Here’s a quick snippet that I used to scrape about 100 pages of Techcrunch archives:

And in a matter of seconds:

Scraped Techcrunch article names

Now to have fun with this data and come up with a few ML models. That’s for another day.

Next time you need to scrape a site in a structured way and don’t want to invest in creating the infrastructure for it, feel free to use (or fork) scrape!

If you’d like to keep up with the open-source microservice releases that we’re doing over the next while, follow my posts.


Published by HackerNoon on 2016/12/06