With the massive increase in the volume of data on the Internet, this technique is becoming increasingly beneficial in retrieving information from websites and applying them for various use cases. Typically, web data extraction involves making a request to the given web page, accessing its HTML code, and parsing that code to harvest some information. Since JavaScript is excellent at manipulating the DOM (Document Object Model) inside a web browser, creating data extraction scripts in Node.js can be extremely versatile. Hence, this tutorial focuses on javascript web scraping. In this article, we’re going to illustrate how to perform web scraping with JavaScript and Node.js. We’ll start by demonstrating how to use the Axios and Cheerio packages to extract data from a simple website. Then, we’ll show how to use a headless browser, Puppeteer, to retrieve data from a dynamic website that loads content via JavaScript. What you’ll need Web browser A web page to extract data from Code editor such as Visual Studio Code Node.js Axios Cheerio Puppeteer Ready? Let’s begin getting our hands dirty… Getting Started Installing Node.js Node.js is a popular JavaScript runtime environment that comes with lots of features for automating the laborious task of gathering data from websites. To install it on your system, follow the download instructions available on its website here. npm (the Node Package Manager) will also be installed automatically alongside Node.js. npm is the default package management tool for Node.js. Since we’ll be using packages to simplify web scraping, npm will make the process of consuming them fast and painless. After installing Node.js, go to your project’s root directory and run the following command to create a package.json file, which will contain all the details relevant to the project: npm init Installing Axios Axios is a robust promise-based HTTP client that can be deployed both in Node.js and the web browser. With this npm package, you can make HTTP requests from Node.js using promises, and download data from the Internet easily and fast. Furthermore, Axios automatically transforms data into JSON format, intercepts requests and responses, and can handle multiple concurrent requests. To install it, navigate to your project’s directory folder in the terminal, and run the following command: npm install axios By default, NPM will install Axios in a folder named node_modules, which will be automatically created in your project’s directory. Installing Cheerio Cheerio is an efficient and lean module that provides jQuery-like syntax for manipulating the content of web pages. It greatly simplifies the process of selecting, editing, and viewing DOM elements on a web page. While Cheerio allows you to parse and manipulate the DOM easily, it does not work the same way as a web browser. This implies that it doesn’t take requests, execute JavaScript, load external resources, or apply CSS styling. To install it, navigate to your project’s directory folder in the terminal, and run the following command: npm install cheerio By default, just like Axios, npm will install Cheerio in a folder named node_modules, which will be automatically created in your project’s directory. Installing Puppeteer Puppeteer is a Node library that allows you to control a headless Chrome browser programmatically and extract data smoothly and fast. Since some websites rely on JavaScript to load their content, using an HTTP-based tool like Axios may not yield the intended results. With Puppeteer, you can simulate the browser environment, execute JavaScript just like a browser does, and scrape dynamic content from websites. To install it, just like the other packages, navigate to your project’s directory folder in the terminal, and run the following command: npm install puppeteer Scraping a simple website Now let’s see how we can use Axios and Cheerio to extract data from a simple website. For this tutorial, our target will be . We’ll be seeking to extract the number of comments listed on the top section of the page. this web page To find the specific HTML elements that hold the data we are looking for, let’s use the inspector tool on our web browser: As you can see on the image above, the number of comments data is enclosed in an tag, which is a child of the tag with a class of . We’ll use this information when using Cheerio to select these elements on the page. <a> <span> comment-bubble Here are the steps for creating the scraping logic: 1. Let’s start by creating a file called index.js that will contain the programming logic for retrieving data from the web page. 2. Then, let’s use the `require` function, which is built-in within Node.js, to include the modules we’ll use in the project. axios = ( ); cheerio = ( ); const require 'axios' const require 'cheerio' 3. Let’s use Axios to make a GET HTTP request to the target web page. Here is the code: axios.get( )
       .then( { html = response.data;      
       }) 'https://www.forextradingbig.com/instaforex- 
    broker-review/' => response const Notice that when a request is sent to the web page, it returns a response. This Axios response object is made up of various components, including data that refers to the payload returned from the server. So, when a GET request is made, we output the data from the response, which is in HTML format. 4. Next, let’s load the response data into a Cheerio instance. This way, we can create a Cheerio object to help us in parsing through the HTML from the target web page and finding the DOM elements for the data we want—just like when using jQuery. To uphold the infamous jQuery convention, we’ll name the Cheerio object . $ $ = cheerio.load(html); const 5. Let’s use the Cheerio’s selectors syntax to search the elements containing the data we want: scrapedata = $( , ).text() .log(scrapedata); const 'a' '.comment-bubble' console Notice that we also used the `text()` method to output the data in a text format. 6. Finally, let’s log any errors experienced during the scraping process. .catch( { .log(error);
}); => error console Here is the entire code for the scraping logic: axios = ( ); cheerio = ( ); axios
  .get( )
  .then( { html = response.data; $ = cheerio.load(html); scrapedata = $( , ).text(); .log(scrapedata);
  }) .catch( { .log(error);
  }); const require "axios" const require "cheerio" //performing a GET request "https://www.forextradingbig.com/instaforex-broker-review/" ( ) => response //handling the success const //loading response data into a Cheerio instance const //selecting the elements with the data const "a" ".comment-bubble" //outputting the scraped data console //handling error ( ) => error console If we run the above code with the `node index.js` command, it returns the information we wanted to scrape from the target web page. Here is a screenshot of the results: It worked! Scraping a dynamic website Now let’s see how you can use Puppeteer to extract data from a dynamic website. For this example, we’ll use the ES2017 asynchronous to work with promises comfortably. async/await syntax The expression implies that a promise will be returned. And the expression makes JavaScript wait until that promise is resolved before executing the rest of the code. This syntax will ensure we extract the webpage’s content after it has been successfully loaded. async await Our target will be this , which uses JavaScript for rendering content. We’ll be seeking to extract the headlines and descriptions found on the page. Reddit page To find the specific HTML elements that hold the data we are looking for, let’s use the inspector tool on our web browser: As you can see on the image above, each post is enclosed in a Post class, amongst other stuff. By examining it closely, we find that each post title has an h3 tag, and each description has a p tag. We’ll use this information when selecting these elements on the page. Here are the steps for creating the scraping logic: 1. Let’s start by creating a file called index.js that will contain the programming logic for retrieving data from the webpage. 2. Then, let’s use the `require` function, which is built-in within Node.js, to import Puppeteer into our project. puppeteer = ( ); const require 'puppeteer' 3. Let’s launch Puppeteer. We’re actually launching an instance of the Chrome browser to use for accessing the target webpage. puppeteer.launch() 4. Let’s create a new page in the headless browser. Since we’ve used the expression, we’ll wait for the new page to be opened before saving it to the variable. await page After creating the page, we’ll use it for navigating to the Reddit page. Again, since we’ve used , our code execution will pause until the page is loaded or an error is thrown. await We’ll also wait for the page’s body tag to be loaded before proceeding with the rest of the execution. Here is the code: .then ( browser => { page = browser.newPage (); page.goto ( ); page.waitForSelector ( ); async const await await 'https://www.reddit.com/r/scraping/' await 'body' 5. After pulling up the Reddit page in Puppeteer, we can use its function to interact with the page. evaluate() With the function, we can execute arbitrary JavaScript in Chrome and use its built-in functions, such as `querySelector()`, to manipulate the page and retrieve its contents. Here is the code: grabPosts = page.evaluate ( { allPosts = .body.querySelectorAll ( );
           scrapeItems = [];
      allPosts.forEach ( { postTitle = item.querySelector ( ).innerText; postDescription = ; {
          postDescription = item.querySelector ( ).innerText;
        } (err) {}
        scrapeItems.push ({ : postTitle, : postDescription,
        });
      }); items = { : scrapeItems,
      }; items;
    }); .log (grabPosts); let await => () let document '.Post' => item let 'h3' let '' try 'p' catch postTitle postDescription let "redditPosts" return console 6. Let’s close the browser. browser.close (); await 7. Finally, let’s log any errors experienced during the scraping process. .catch ( { .error (err);
}); ( ) function err console Here is the entire code for the scraping logic: puppeteer = ( ); puppeteer
  .launch ()
  .then ( browser => { page = browser.newPage (); page.goto ( ); page.waitForSelector ( ); grabPosts = page.evaluate ( { allPosts = .body.querySelectorAll ( ); scrapeItems = [];
    allPosts.forEach ( { postTitle = item.querySelector ( ).innerText; postDescription = ; {
          postDescription = item.querySelector ( ).innerText;
        } (err) {}
        scrapeItems.push ({ : postTitle, : postDescription,
        });
      }); items = { : scrapeItems,
      }; items;
    }); .log (grabPosts); browser.close ();
  }) .catch ( { .error (err);
  }); const require 'puppeteer' //initiating Puppeteer async //opening a new page and navigating to Reddit const await await 'https://www.reddit.com/r/scraping/' await 'body' //manipulating the page's content let await => () let document '.Post' //storing the post items in an array then selecting for retrieving content => item let 'h3' let '' try 'p' catch postTitle postDescription let "redditPosts" return //outputting the scraped data console //closing the browser await //handling any errors ( ) function err console If we run the above code with the `node index.js` command, it returns the information we wanted to scrape from the target web page. Here is a screenshot of the results (for brevity, the results have been truncated): It worked! If you intend to use the above in production and make thousands of requests to scrape data, you’ll definitely get banned. In this scenario, rotating your IP addresses after every few requests can help you to stay below their radar and extract content successfully. Therefore, connecting to a proxy service can help you to make the most of your scraping efforts. Importantly, with , you can get around the scraping bottlenecks and harvest online data easily and fast. residential proxies In Puppeteer, you can easily connect to a proxy by passing one extra line of arguments when launching it: puppeteer.launch({ : [ ]
}); args '--proxy-server=145.0.10.11:7866' Conclusion That’s how you can perform web scraping with JavaScript and Node.js. With such skills, you can harvest useful information from web pages and integrate them into your use case. Remember that if you want to build something advanced, you can always check Axios, Cheerio, and Puppeteer documentation to assist you in getting your feet off the ground quickly. Happy scraping! Also published on: https://zenscrape.com/web-scraping-with-javascript-and-node-js-tutorial/

Alongside

Target

A Guide to Web Scraping With JavaScript and Node.js

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Detecting Arbitrage Opportunities in Global Digital Product Markets

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

Detecting Arbitrage Opportunities in Global Digital Product Markets

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps