Puppeteer is not enough
Even though writing data extraction code for a few web pages in Puppeteer is straightforward, things can get more complicated. For example, when you try to perform a deep crawl of an entire website using a persistent queue of URLs or crawl a list of 100k URLs from a CSV file. And this is where the need for a library comes in.
The Apify SDK is an open-source library that simplifies the development of web crawlers, scrapers, data extractors and web automation jobs. It provides tools to manage and automatically scale a pool of headless Chrome / Puppeteer instances, to maintain queues of URLs to crawl, store crawling results to a local filesystem or in the cloud, rotate proxies and much more. The library is available as the apify package on NPM. It can be used either stand-alone in your own applications or in actors running on the Apify cloud platform.
But unlike other web scraping libraries such as the Headless Chrome Crawler, the Apify SDK is not bound only to Puppeteer. For example, you can easily create web crawlers that use the cheerio HTML parsing library or even Selenium. The basic building blocks are the same for many types of crawlers. In short, the Apify SDK incorporates lessons we at Apify learned from the scraping of thousands of websites over the last four years. We tried to design the SDK components in such a way as to strike the right balance between simplicity, performance and customizability.
Show me the code
All right, enough talking, let’s have a look at the code. To run it, you’ll need Node.js 8 or later installed on your computer. First, install the Apify SDK to your project by running:
npm install apify --save
Now you can run this example script:
Hello crawler example. Source code on GitHub
The script performs a recursive deep crawl of https://www.iana.org using Puppeteer. The number of Puppeteer processes and tabs is automatically controlled depending on the available CPU and memory in your system. By the way, this functionality is exposed separately as the AutoscaledPool class, so that you can use it to manage a pool of any other resource-intensive asynchronous tasks.
This is how it looks in terminal:
Recursive crawl of https://www.iana.org using Apify SDK — Terminal view
And when you run the crawl in headful mode, you will see how Apify SDK automatically launches and manages Chrome browsers and tabs:
Recursive crawl of https://www.iana.org using Apify SDK — Chrome browser view
Apify SDK provides a number of utility classes that are helpful for common web scraping and automation tasks, e.g for management of URLs to crawl, data storage and various crawler skeletons. To get a better idea, have a look at Overview of components.
What’s the point of this?
We believe everybody has the right to access the publicly available web in the way they want it and not just the way the authors intended it. It was in exactly this way that the internet grew into what it is today — by enabling people to creatively use what is out there and build new layers on top of it, regardless of what it was designed for. That’s why web scraping is important, because it allows you to create previously unimaginable tools or services on top of existing ones and add a new layer on top of the internet. Why should websites only allow automated access to giants like Google and Bing, but not to your code?
We’re really looking forward to see what you can build with the Apify SDK. And of course, we’d love to hear what you think about it or answer any questions that you might have. Let us know on Twitter.