Web Scraping Using Node.js

Written by codejedi | Published 2022/04/05
Tech Story Tags: nodejs | npm-and-nodejs | javascript | web-scraping | data | node | puppeteer | data-scraping

TLDRWeb Scraping is a way to collect all sorts of publicly available data like prices, text, images, contact information and much more from the world-wide-web. This can be useful when trying to collect data that might take a person a lot of time to collect and organize manually.via the TL;DR App

Web scraping:

Web Scraping is a way to collect all sorts of publicly available data like prices, text, images, contact information and much more from the world-wide-web. This can be useful when trying to collect data that might take a person a lot of time to collect and organize manually.

Some of the most useful use cases of web scraping include:

  1. Scraping product prices from e-commerce websites such as Amazon, eBay or Alibaba.
  2. Scraping social media posts, likes, comments, followers or bios.
  3. Scraping contacts from websites like Yellowpages or Linkedin.

Puppeteer

While there are a few different libraries for scraping the web with Node.js, in this tutorial, i'll be using the puppeteer library.

Puppeteer is a popular and easy-to-use npm package used for web automation and web scraping purposes.

Some of puppeteer's most useful features include:

  1. Being able to extract a scraped element's text content.
  2. Being able to interact with a webpage by filling out forms, clicking on buttons or running searches inside a search bar.
  3. Being able to scrape and download images from the web.
  4. Being able to see the web scraping in progress using headless mode.

You can read more about puppeteer here

Installation

For this tutorial, I will suppose you already have npm and node_modules installed, as well as a package.json and package-lock.json file.

If you don't, here's a great guide on how to do so: Setup

To install puppeteer, run one of the following commands in your project's terminal:

npm i puppeteer

Or

yarn add puppeteer

Once puppeteer is installed, it will appear as a directory inside your node_modules.

Let's make a simple web scraping script in Node.js

The web scraping script will get the first synonym of "smart" from the web thesaurus by:

  1. Getting the HTML contents of the web thesaurus' webpage.

  2. Finding the element that we want to scrape through it's selector.

  3. Displaying the text contents of the scraped element.

Let's get started!

Before scraping, and then extracting this element's text through it's selector in Node.js, we need to setup a few things first:

Create or open an empty javascript file, you can name it whatever you want, but I'll name mine "index.js" for this tutorial. Then, require puppeteer on the first line and create the async function inside which we will be writing our web scraping code:

index.js

const puppeteer = require('puppeteer')

async function scrape() {
}
scrape()

Next, initiate a new browser instance and define the "page" variable, which is going to be used for navigating to webpages and scraping elements within a webpage's HTML contents:

index.js

const puppeteer = require('puppeteer')

async function scrape() {
   const browser = await puppeteer.launch({})
   const page = await browser.newPage()
}
scrape()

Scraping the first synonym of "smart"

To locate and copy the selector of the first synonym of "smart", which is what we're going to use to locate the synonym inside of the web thesaurus' webpage, first go to the web thesaurus' synonyms of "smart", right click on the first synonym and click on "inspect". This will make this webpage's DOM pop-up at the right of your screen:

Next, right-click on the highlighted HTML element containing the first synonym and click on "copy selector":

Finally, to navigate to the web thesaurus, scrape and display the first synonym of "smart" through the selector we copied earlier:

  1. First, make the "page" variable navigate to https://www.thesaurus.com/browse/smart inside the newly created browser instance.
  2. Next, we define the "element" variable by making the page wait for our desired element's selector to appear in the webpage's DOM.
  3. The text content of the element is then extracted using the evaluate() function, and displayed inside the "text" variable.
  4. Finally, we close the browser instance.

index.js

const puppeteer = require('puppeteer')

async function scrape() {
   const browser = await puppeteer.launch({})
   const page = await browser.newPage()
   
   await page.goto('https://www.thesaurus.com/browse/smart')
   var element = await page.waitFor("#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(1) > a")
   var text = await page.evaluate(element => element.textContent, element)
   console.log(text)
   browser.close()
}
scrape()

Time to test

Now if you run your index.js script using "node index.js", you will see that it has displayed the first synonym of the word "smart":

Scraping the top 5 synonyms of smart

We can implement the same code to scrape the top 5 synonyms of smart instead of 1:

index.js

const puppeteer = require('puppeteer')

async function scrape() {
   const browser = await puppeteer.launch({})
   const page = await browser.newPage()
   
   await page.goto('https://www.thesaurus.com/browse/smart')
   for(i = 1; i < 6; i++){
    var element = await page.waitFor("#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(" + i + ") > a")
    var text = await page.evaluate(element => element.textContent, element)
    console.log(text)
   }
   browser.close()
}
scrape()

The "element" variable will be: "#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(1) > a" on the first iteration, "#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(2) > a" on the second, and so on until it reaches the last iteration where the "element" variable will be "#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(5) > a".

As you can see, the only thing that is altered in the "element" variable throughout the iterations is the "li:nth-child()" value.

This is because in our case, the elements that we are trying to scrape are all "li" elements inside a "ul" element,

so we can easily scrape them in order by increasing the value inside "li:nth-child()":

  1. li:nth-child(1) for the first synonym.
  2. li:nth-child(2) for the second synonym.
  3. li:nth-child(3) for the third synonym.
  4. li:nth-child(4) for the fourth synonym.
  5. And li:nth-child(5) for the fifth synonym.

Final notes

While web scraping has many advantages like:

  1. Saving time on manually collecting data.
  2. Being able to programmatically aggregate pieces of data scraped from the web.
  3. Creating a dataset of data that might be useful for machine learning, data visualization or data analytics purposes.

It also has 2 disadvantages:

  1. Some websites don't allow for scraping their data, one popular example is craigslist.
  2. Some people consider it to be a gray area since some use cases of web scraping practice user or entity data collection and storage.

Hopefully, this article gave you some insight into web scraping in Node.js, it's practical applications, pros and cons, and how to extract specific elements and their text contents from webpages using the puppeteer library.


Also Published Here


Written by codejedi | Python, Machine Learning, Web-Scraping, Web-Automation and more...
Published by HackerNoon on 2022/04/05