Web Scraping is a way to collect all sorts of publicly available data like prices, text, images, contact information and much more from the world-wide-web. This can be useful when trying to collect data that might take a person a lot of time to collect and organize manually.
Some of the most useful use cases of web scraping include:
While there are a few different libraries for scraping the web with Node.js, in this tutorial, i'll be using the puppeteer library.
Puppeteer is a popular and easy-to-use npm package used for web automation and web scraping purposes.
Some of puppeteer's most useful features include:
You can read more about puppeteer here
For this tutorial, I will suppose you already have npm and node_modules installed, as well as a package.json and package-lock.json file.
If you don't, here's a great guide on how to do so: Setup
To install puppeteer, run one of the following commands in your project's terminal:
npm i puppeteer
Or
yarn add puppeteer
Once puppeteer is installed, it will appear as a directory inside your node_modules.
The web scraping script will get the first synonym of "smart" from the web thesaurus by:
Getting the HTML contents of the web thesaurus' webpage.
Finding the element that we want to scrape through it's selector.
Displaying the text contents of the scraped element.
Before scraping, and then extracting this element's text through it's selector in Node.js, we need to setup a few things first:
Create or open an empty javascript file, you can name it whatever you want, but I'll name mine "index.js" for this tutorial. Then, require puppeteer on the first line and create the async function inside which we will be writing our web scraping code:
index.js
const puppeteer = require('puppeteer')
async function scrape() {
}
scrape()
Next, initiate a new browser instance and define the "page" variable, which is going to be used for navigating to webpages and scraping elements within a webpage's HTML contents:
index.js
const puppeteer = require('puppeteer')
async function scrape() {
const browser = await puppeteer.launch({})
const page = await browser.newPage()
}
scrape()
To locate and copy the selector of the first synonym of "smart", which is what we're going to use to locate the synonym inside of the web thesaurus' webpage, first go to the web thesaurus' synonyms of "smart", right click on the first synonym and click on "inspect". This will make this webpage's DOM pop-up at the right of your screen:
Next, right-click on the highlighted HTML element containing the first synonym and click on "copy selector":
Finally, to navigate to the web thesaurus, scrape and display the first synonym of "smart" through the selector we copied earlier:
index.js
const puppeteer = require('puppeteer')
async function scrape() {
const browser = await puppeteer.launch({})
const page = await browser.newPage()
await page.goto('https://www.thesaurus.com/browse/smart')
var element = await page.waitFor("#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(1) > a")
var text = await page.evaluate(element => element.textContent, element)
console.log(text)
browser.close()
}
scrape()
Now if you run your index.js script using "node index.js", you will see that it has displayed the first synonym of the word "smart":
Scraping the top 5 synonyms of smart
We can implement the same code to scrape the top 5 synonyms of smart instead of 1:
index.js
const puppeteer = require('puppeteer')
async function scrape() {
const browser = await puppeteer.launch({})
const page = await browser.newPage()
await page.goto('https://www.thesaurus.com/browse/smart')
for(i = 1; i < 6; i++){
var element = await page.waitFor("#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(" + i + ") > a")
var text = await page.evaluate(element => element.textContent, element)
console.log(text)
}
browser.close()
}
scrape()
The "element" variable will be: "#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(1) > a" on the first iteration, "#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(2) > a" on the second, and so on until it reaches the last iteration where the "element" variable will be "#meanings > div.css-ixatld.e15rdun50 > ul > li:nth-child(5) > a".
As you can see, the only thing that is altered in the "element" variable throughout the iterations is the "li:nth-child()" value.
This is because in our case, the elements that we are trying to scrape are all "li" elements inside a "ul" element,
so we can easily scrape them in order by increasing the value inside "li:nth-child()":
While web scraping has many advantages like:
It also has 2 disadvantages:
Hopefully, this article gave you some insight into web scraping in Node.js, it's practical applications, pros and cons, and how to extract specific elements and their text contents from webpages using the puppeteer library.
Also Published Here