TL;DR: We’ve released the Apify SDK — an open-source Node.js library for scraping and web crawling. There was one for Python, but until now, there was no such library for JavaScript, THE language of the web. In Python there is , the de facto standard toolkit for building web scrapers and crawlers. But in , there was no similarly comprehensive and universal library. But that’s not right. An ever-increasing number of websites use JavaScript to fetch and render user content. To extract data from these websites, you’ll often need to use an actual web browser to parse the HTML and run page scripts, and then inject your data extraction code that will run in the browser context, i.e. you need JavaScript. So what’s the point of using another programming language to manage the “server-side” of the crawler? Scrapy JavaScript Let’s just ignore the fact that JavaScript is the most widely used programming language in the world, according to the . We won’t mention this fact in this blog post. 2018 StackOverflow Survey There’s another good reason to use JavaScript for web scraping. In April 2017, Google launched and soon after they released — a Node.js API for headless Chrome. This was nothing short of a revolution in the web scraping world, because previously the only options to instrument a full-featured web browser were to use , which is badly outdated, use , which is based on QtWebKit and only has Python SDK, or use , whose API is limited by the need to support all browsers, including Internet Explorer, Firefox, Safari, etc. While headless Chrome made it possible to run web browsers on servers without the need to use X Virtual Framebuffer (Xvfb), Puppeteer provided the simplest and most powerful high-level API for a web browser that ever existed. headless Chrome Puppeteer PhantomJS Splash Selenium Puppeteer is not enough Even though writing data extraction code for a few web pages in Puppeteer is straightforward, things can get more complicated. For example, when you try to perform a deep crawl of an entire website using a persistent queue of URLs or crawl a list of 100k URLs from a CSV file. And this is where the need for a library comes in. The is an open-source library that simplifies the development of web crawlers, scrapers, data extractors and web automation jobs. It provides tools to manage and automatically scale a pool of headless Chrome / Puppeteer instances, to maintain queues of URLs to crawl, store crawling results to a local filesystem or in the cloud, rotate proxies and much more. The library is available as the package on NPM. It can be used either stand-alone in your own applications or in running on the cloud platform. Apify SDK apify actors Apify But unlike other web scraping libraries such as the , the Apify SDK is not bound only to Puppeteer. For example, you can easily create web crawlers that use the HTML parsing library or even . The basic building blocks are the same for many types of crawlers. In short, the Apify SDK incorporates lessons we at learned from the scraping of thousands of websites over the last four years. We tried to design the SDK components in such a way as to strike the right balance between simplicity, performance and customizability. Headless Chrome Crawler cheerio Selenium Apify Show me the code All right, enough talking, let’s have a look at the code. To run it, you’ll need 8 or later installed on your computer. First, install the to your project by running: Node.js Apify SDK npm apify install --save Now you can run this example script: Hello crawler example. Source code on GitHub The script performs a recursive deep crawl of using Puppeteer. The number of Puppeteer processes and tabs is automatically controlled depending on the available CPU and memory in your system. By the way, this functionality is exposed separately as the class, so that you can use it to manage a pool of any other resource-intensive asynchronous tasks. https://www.iana.org AutoscaledPool This is how it looks in terminal: Recursive crawl of https://www.iana.org using Apify SDK — Terminal view And when you run the crawl in headful mode, you will see how Apify SDK automatically launches and manages Chrome browsers and tabs: Recursive crawl of https://www.iana.org using Apify SDK — Chrome browser view Apify SDK provides a number of utility classes that are helpful for common web scraping and automation tasks, e.g for management of URLs to crawl, data storage and various crawler skeletons. To get a better idea, have a look at . Overview of components What’s the point of this? We believe everybody has the right to access the publicly available web in the way they want it and not just the way the authors intended it. It was in exactly this way that the internet grew into what it is today — by enabling people to creatively use what is out there and build new layers on top of it, regardless of what it was designed for. That’s why web scraping is important, because it allows you to create previously unimaginable tools or services on top of existing ones and add a new layer on top of the internet. Why should websites only allow automated access to giants like Google and Bing, but not to your code? We’re really looking forward to see what you can build with the . And of course, we’d love to hear what you think about it or answer any questions that you might have. Let us know on . Apify SDK Twitter Happy crawling in JavaScript!

Fetch

Google

Why the world needs a universal web scraping library for JavaScript

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

7 Tips to Making Your Puppeteer Scripts More Bulletproof

Asked for a Parka, Got an “Error 429: Too Many Requests”

Automation of QA with jest-puppeteer and Headless Chrome

Headless Chrome vs PhantomJS Benchmark

Enabling Cuprite: A Headless Chrome Ruby Driver for Capybara

Headless Testing with Playwright and Jest

7 Tips to Making Your Puppeteer Scripts More Bulletproof

Asked for a Parka, Got an “Error 429: Too Many Requests”

Automation of QA with jest-puppeteer and Headless Chrome

Headless Chrome vs PhantomJS Benchmark

Enabling Cuprite: A Headless Chrome Ruby Driver for Capybara

Headless Testing with Playwright and Jest

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps