In Python there is Scrapy
. We won’t mention this fact in this blog post.
and soon after they released Puppeteer
— a Node.js API for headless Chrome. This was nothing short of a revolution in the web scraping world, because previously the only options to instrument a full-featured web browser were to use PhantomJS
, which is badly outdated, use Splash
, which is based on QtWebKit and only has Python SDK, or use Selenium
, whose API is limited by the need to support all browsers, including Internet Explorer, Firefox, Safari, etc. While headless Chrome made it possible to run web browsers on servers without the need to use X Virtual Framebuffer (Xvfb), Puppeteer provided the simplest and most powerful high-level API for a web browser that ever existed.
Puppeteer is not enough
Even though writing data extraction code for a few web pages in Puppeteer is straightforward, things can get more complicated. For example, when you try to perform a deep crawl of an entire website using a persistent queue of URLs or crawl a list of 100k URLs from a CSV file. And this is where the need for a library comes in.
The Apify SDK
is an open-source library that simplifies the development of web crawlers, scrapers, data extractors and web automation jobs. It provides tools to manage and automatically scale a pool of headless Chrome / Puppeteer instances, to maintain queues of URLs to crawl, store crawling results to a local filesystem or in the cloud, rotate proxies and much more. The library is available as the apify
package on NPM. It can be used either stand-alone in your own applications or in actors
running on the Apify
But unlike other web scraping libraries such as the Headless Chrome Crawler
, the Apify SDK is not bound only to Puppeteer. For example, you can easily create web crawlers that use the cheerio
HTML parsing library or even Selenium
. The basic building blocks are the same for many types of crawlers. In short, the Apify SDK incorporates lessons we at Apify
learned from the scraping of thousands of websites over the last four years. We tried to design the SDK components in such a way as to strike the right balance between simplicity, performance and customizability.
Show me the code
All right, enough talking, let’s have a look at the code. To run it, you’ll need Node.js
8 or later installed on your computer. First, install the Apify SDK
to your project by running:
Now you can run this example script:
Hello crawler example. Source code on GitHub
The script performs a recursive deep crawl of https://www.iana.org
using Puppeteer. The number of Puppeteer processes and tabs is automatically controlled depending on the available CPU and memory in your system. By the way, this functionality is exposed separately as the AutoscaledPool
class, so that you can use it to manage a pool of any other resource-intensive asynchronous tasks.
This is how it looks in terminal:
And when you run the crawl in headful mode, you will see how Apify SDK automatically launches and manages Chrome browsers and tabs:
Apify SDK provides a number of utility classes that are helpful for common web scraping and automation tasks, e.g for management of URLs to crawl, data storage and various crawler skeletons. To get a better idea, have a look at Overview of components
What’s the point of this?
We believe everybody has the right to access the publicly available web in the way they want it and not just the way the authors intended it. It was in exactly this way that the internet grew into what it is today — by enabling people to creatively use what is out there and build new layers on top of it, regardless of what it was designed for. That’s why web scraping is important, because it allows you to create previously unimaginable tools or services on top of existing ones and add a new layer on top of the internet. Why should websites only allow automated access to giants like Google and Bing, but not to your code?
We’re really looking forward to see what you can build with the Apify SDK
. And of course, we’d love to hear what you think about it or answer any questions that you might have. Let us know on Twitter
Subscribe to get your daily round-up of top tech stories!