paint-brush
A Guide to Web Scraping With JavaScript and Node.jsby@andreas-a
12,734 reads
12,734 reads

A Guide to Web Scraping With JavaScript and Node.js

by Andreas ASeptember 27th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A Guide to Web Scraping With JavaScript and Node.js demonstrates how to use the Axios and Cheerio packages to extract data from a simple website. We’ll be using a headless browser, Puppeteer, to retrieve data from the web page that loads content via JavaScript. The data extraction technique is becoming increasingly beneficial in retrieving information from websites and applying them for various use cases. We'll be seeking to extract the number of comments listed on the top section of the page.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - A Guide to Web Scraping With JavaScript and Node.js
Andreas A HackerNoon profile picture

With the massive increase in the volume of data on the Internet, this technique is becoming increasingly beneficial in retrieving information from websites and applying them for various use cases. Typically, web data extraction involves making a request to the given web page, accessing its HTML code, and parsing that code to harvest some information. Since JavaScript is excellent at manipulating the DOM (Document Object Model) inside a web browser, creating data extraction scripts in Node.js can be extremely versatile. Hence, this tutorial focuses on javascript web scraping.

In this article, we’re going to illustrate how to perform web scraping with JavaScript and Node.js.

We’ll start by demonstrating how to use the Axios and Cheerio packages to extract data from a simple website.

Then, we’ll show how to use a headless browser, Puppeteer, to retrieve data from a dynamic website that loads content via JavaScript.

What you’ll need

  • Web browser
  • A web page to extract data from
  • Code editor such as Visual Studio Code
  • Node.js
  • Axios
  • Cheerio
  • Puppeteer

Ready?

Let’s begin getting our hands dirty…

Getting Started

Installing Node.js

Node.js is a popular JavaScript runtime environment that comes with lots of features for automating the laborious task of gathering data from websites.

To install it on your system, follow the download instructions available on its website here. npm (the Node Package Manager) will also be installed automatically alongside Node.js.

npm is the default package management tool for Node.js. Since we’ll be using packages to simplify web scraping, npm will make the process of consuming them fast and painless.

After installing Node.js, go to your project’s root directory and run the following command to create a package.json file, which will contain all the details relevant to the project:

npm init

Installing Axios

Axios is a robust promise-based HTTP client that can be deployed both in Node.js and the web browser. With this npm package, you can make HTTP requests from Node.js using promises, and download data from the Internet easily and fast.

Furthermore, Axios automatically transforms data into JSON format, intercepts requests and responses, and can handle multiple concurrent requests.

To install it, navigate to your project’s directory folder in the terminal, and run the following command:

npm install axios

By default, NPM will install Axios in a folder named node_modules, which will be automatically created in your project’s directory.

Installing Cheerio

Cheerio is an efficient and lean module that provides jQuery-like syntax for manipulating the content of web pages. It greatly simplifies the process of selecting, editing, and viewing DOM elements on a web page.

While Cheerio allows you to parse and manipulate the DOM easily, it does not work the same way as a web browser. This implies that it doesn’t take requests, execute JavaScript, load external resources, or apply CSS styling.

To install it, navigate to your project’s directory folder in the terminal, and run the following command:

npm install cheerio 

By default, just like Axios, npm will install Cheerio in a folder named node_modules, which will be automatically created in your project’s directory.

Installing Puppeteer

Puppeteer is a Node library that allows you to control a headless Chrome browser programmatically and extract data smoothly and fast.

Since some websites rely on JavaScript to load their content, using an HTTP-based tool like Axios may not yield the intended results. With Puppeteer, you can simulate the browser environment, execute JavaScript just like a browser does, and scrape dynamic content from websites.

To install it, just like the other packages, navigate to your project’s directory folder in the terminal, and run the following command:

npm install puppeteer

Scraping a simple website

Now let’s see how we can use Axios and Cheerio to extract data from a simple website.

For this tutorial, our target will be this web page. We’ll be seeking to extract the number of comments listed on the top section of the page.

To find the specific HTML elements that hold the data we are looking for, let’s use the inspector tool on our web browser:

As you can see on the image above, the number of comments data is enclosed in an

<a>
tag, which is a child of the
<span>
tag with a class of
comment-bubble
. We’ll use this information when using Cheerio to select these elements on the page.

Here are the steps for creating the scraping logic:

1. Let’s start by creating a file called index.js that will contain the programming logic for retrieving data from the web page.

2. Then, let’s use the `require` function, which is built-in within Node.js, to include the modules we’ll use in the project.

const axios = require('axios');
const cheerio = require('cheerio');

3. Let’s use Axios to make a GET HTTP request to the target web page.

Here is the code:

 axios.get('https://www.forextradingbig.com/instaforex- 
    broker-review/')
       .then(response => {
          const html = response.data;      
       })

Notice that when a request is sent to the web page, it returns a response. This Axios response object is made up of various components, including data that refers to the payload returned from the server.

So, when a GET request is made, we output the data from the response, which is in HTML format.

4. Next, let’s load the response data into a Cheerio instance. This way, we can create a Cheerio object to help us in parsing through the HTML from the target web page and finding the DOM elements for the data we want—just like when using jQuery.

To uphold the infamous jQuery convention, we’ll name the Cheerio object

$
.

const $ = cheerio.load(html);

5. Let’s use the Cheerio’s selectors syntax to search the elements containing the data we want:

const scrapedata = $('a', '.comment-bubble').text()
console.log(scrapedata);

Notice that we also used the `text()` method to output the data in a text format.

6. Finally, let’s log any errors experienced during the scraping process.

.catch( error => {
    console.log(error);
}); 

Here is the entire code for the scraping logic:

const axios = require("axios");
const cheerio = require("cheerio");
//performing a GET request
axios
  .get("https://www.forextradingbig.com/instaforex-broker-review/")
  .then((response) => {
    //handling the success
    const html = response.data;

    //loading response data into a Cheerio instance
    const $ = cheerio.load(html);

    //selecting the elements with the data
    const scrapedata = $("a", ".comment-bubble").text();

    //outputting the scraped data
    console.log(scrapedata);
  })
  //handling error
  .catch((error) => {
    console.log(error);
  });

If we run the above code with the `node index.js` command, it returns the information we wanted to scrape from the target web page.

Here is a screenshot of the results:

It worked!

Scraping a dynamic website

Now let’s see how you can use Puppeteer to extract data from a dynamic website.

For this example, we’ll use the ES2017 asynchronous

async/await syntax
to work with promises comfortably.

The

async
expression implies that a promise will be returned. And the
await
expression makes JavaScript wait until that promise is resolved before executing the rest of the code. This syntax will ensure we extract the webpage’s content after it has been successfully loaded.

Our target will be this Reddit page, which uses JavaScript for rendering content. We’ll be seeking to extract the headlines and descriptions found on the page.

To find the specific HTML elements that hold the data we are looking for, let’s use the inspector tool on our web browser:

As you can see on the image above, each post is enclosed in a Post class, amongst other stuff. By examining it closely, we find that each post title has an h3 tag, and each description has a p tag. We’ll use this information when selecting these elements on the page.

Here are the steps for creating the scraping logic:

1. Let’s start by creating a file called index.js that will contain the programming logic for retrieving data from the webpage.

2. Then, let’s use the `require` function, which is built-in within Node.js, to import Puppeteer into our project.

const puppeteer = require ('puppeteer');

3. Let’s launch Puppeteer. We’re actually launching an instance of the Chrome browser to use for accessing the target webpage.

puppeteer.launch()

4. Let’s create a new page in the headless browser. Since we’ve used the

await
expression, we’ll wait for the new page to be opened before saving it to the
page
variable.

After creating the page, we’ll use it for navigating to the Reddit page. Again, since we’ve used

await
, our code execution will pause until the page is loaded or an error is thrown.

We’ll also wait for the page’s body tag to be loaded before proceeding with the rest of the execution.

Here is the code:

.then (async browser => { 
const page = await browser.newPage (); 	
await page.goto ('https://www.reddit.com/r/scraping/'); 	
await page.waitForSelector ('body');

5. After pulling up the Reddit page in Puppeteer, we can use its

evaluate()
function to interact with the page.

With the function, we can execute arbitrary JavaScript in Chrome and use its built-in functions, such as `querySelector()`, to manipulate the page and retrieve its contents.

Here is the code:

let grabPosts = await page.evaluate (() => {
      let allPosts = document.body.querySelectorAll ('.Post');
           scrapeItems = [];
      allPosts.forEach (item => {
        let postTitle = item.querySelector ('h3').innerText;

        let postDescription = '';
        try {
          postDescription = item.querySelector ('p').innerText;
        } catch (err) {}
        scrapeItems.push ({
          postTitle: postTitle,

          postDescription: postDescription,
        });
      });

      let items = {
        "redditPosts": scrapeItems,
      };
      return items;
    });

console.log (grabPosts);

6. Let’s close the browser.

await browser.close ();

7. Finally, let’s log any errors experienced during the scraping process.

.catch (function (err) {
    console.error (err);
});

Here is the entire code for the scraping logic:

const puppeteer = require ('puppeteer');

//initiating Puppeteer
puppeteer
  .launch ()
  .then (async browser => {
  
    //opening a new page and navigating to Reddit
    const page = await browser.newPage ();
    await page.goto ('https://www.reddit.com/r/scraping/');
    await page.waitForSelector ('body');
  
    //manipulating the page's content
    let grabPosts = await page.evaluate (() => {
    let allPosts = document.body.querySelectorAll ('.Post');
      
    //storing the post items in an array then selecting for retrieving content
    scrapeItems = [];
    allPosts.forEach (item => {
      let postTitle = item.querySelector ('h3').innerText;
      let postDescription = '';
        try {
          postDescription = item.querySelector ('p').innerText;
        } catch (err) {}
        scrapeItems.push ({
          postTitle: postTitle,
          postDescription: postDescription,
        });
      });
      let items = {
        "redditPosts": scrapeItems,
      };
      return items;
    });
    //outputting the scraped data
    console.log (grabPosts);
    //closing the browser
    await browser.close ();
  })
  //handling any errors
  .catch (function (err) {
    console.error (err);
  });

If we run the above code with the `node index.js` command, it returns the information we wanted to scrape from the target web page.

Here is a screenshot of the results (for brevity, the results have been truncated):

It worked!

If you intend to use the above in production and make thousands of requests to scrape data, you’ll definitely get banned. In this scenario, rotating your IP addresses after every few requests can help you to stay below their radar and extract content successfully.

Therefore, connecting to a proxy service can help you to make the most of your scraping efforts. Importantly, with residential proxies, you can get around the scraping bottlenecks and harvest online data easily and fast.

In Puppeteer, you can easily connect to a proxy by passing one extra line of arguments when launching it:

puppeteer.launch({
    args: [ '--proxy-server=145.0.10.11:7866' ]
});

Conclusion

That’s how you can perform web scraping with JavaScript and Node.js. With such skills, you can harvest useful information from web pages and integrate them into your use case.

Remember that if you want to build something advanced, you can always check Axios, Cheerio, and Puppeteer documentation to assist you in getting your feet off the ground quickly.

Happy scraping!

Also published on: https://zenscrape.com/web-scraping-with-javascript-and-node-js-tutorial/