Jan Curn

CEO of Apify

Why the world needs a universal web scraping library for JavaScript

TL;DR: We’ve released the Apify SDK — an open-source Node.js library for scraping and web crawling. There was one for Python, but until now, there was no such library for JavaScript, THE language of the web.
In Python there is Scrapy, the de facto standard toolkit for building web scrapers and crawlers. But in JavaScript, there was no similarly comprehensive and universal library. But that’s not right. An ever-increasing number of websites use JavaScript to fetch and render user content. To extract data from these websites, you’ll often need to use an actual web browser to parse the HTML and run page scripts, and then inject your data extraction code that will run in the browser context, i.e. you need JavaScript. So what’s the point of using another programming language to manage the “server-side” of the crawler?
Let’s just ignore the fact that JavaScript is the most widely used programming language in the world, according to the 2018 StackOverflow Survey. We won’t mention this fact in this blog post.
There’s another good reason to use JavaScript for web scraping. In April 2017, Google launched headless Chrome and soon after they released Puppeteer — a Node.js API for headless Chrome. This was nothing short of a revolution in the web scraping world, because previously the only options to instrument a full-featured web browser were to use PhantomJS, which is badly outdated, use Splash, which is based on QtWebKit and only has Python SDK, or use Selenium, whose API is limited by the need to support all browsers, including Internet Explorer, Firefox, Safari, etc. While headless Chrome made it possible to run web browsers on servers without the need to use X Virtual Framebuffer (Xvfb), Puppeteer provided the simplest and most powerful high-level API for a web browser that ever existed.

Puppeteer is not enough

Even though writing data extraction code for a few web pages in Puppeteer is straightforward, things can get more complicated. For example, when you try to perform a deep crawl of an entire website using a persistent queue of URLs or crawl a list of 100k URLs from a CSV file. And this is where the need for a library comes in.
The Apify SDK is an open-source library that simplifies the development of web crawlers, scrapers, data extractors and web automation jobs. It provides tools to manage and automatically scale a pool of headless Chrome / Puppeteer instances, to maintain queues of URLs to crawl, store crawling results to a local filesystem or in the cloud, rotate proxies and much more. The library is available as the apify package on NPM. It can be used either stand-alone in your own applications or in actors running on the Apify cloud platform.
But unlike other web scraping libraries such as the Headless Chrome Crawler, the Apify SDK is not bound only to Puppeteer. For example, you can easily create web crawlers that use the cheerio HTML parsing library or even Selenium. The basic building blocks are the same for many types of crawlers. In short, the Apify SDK incorporates lessons we at Apify learned from the scraping of thousands of websites over the last four years. We tried to design the SDK components in such a way as to strike the right balance between simplicity, performance and customizability.

Show me the code

All right, enough talking, let’s have a look at the code. To run it, you’ll need Node.js 8 or later installed on your computer. First, install the Apify SDK to your project by running:
npm install apify --save
Now you can run this example script:
Hello crawler example. Source code on GitHub
The script performs a recursive deep crawl of https://www.iana.org using Puppeteer. The number of Puppeteer processes and tabs is automatically controlled depending on the available CPU and memory in your system. By the way, this functionality is exposed separately as the AutoscaledPool class, so that you can use it to manage a pool of any other resource-intensive asynchronous tasks.
This is how it looks in terminal:
Recursive crawl of https://www.iana.org using Apify SDK — Terminal view
And when you run the crawl in headful mode, you will see how Apify SDK automatically launches and manages Chrome browsers and tabs:
Recursive crawl of https://www.iana.org using Apify SDK — Chrome browser view
Apify SDK provides a number of utility classes that are helpful for common web scraping and automation tasks, e.g for management of URLs to crawl, data storage and various crawler skeletons. To get a better idea, have a look at Overview of components.

What’s the point of this?

We believe everybody has the right to access the publicly available web in the way they want it and not just the way the authors intended it. It was in exactly this way that the internet grew into what it is today — by enabling people to creatively use what is out there and build new layers on top of it, regardless of what it was designed for. That’s why web scraping is important, because it allows you to create previously unimaginable tools or services on top of existing ones and add a new layer on top of the internet. Why should websites only allow automated access to giants like Google and Bing, but not to your code?
We’re really looking forward to see what you can build with the Apify SDK. And of course, we’d love to hear what you think about it or answer any questions that you might have. Let us know on Twitter.
Happy crawling in JavaScript!

Tags

More by Jan Curn

Topics of interest