How do you increase productivity, especially when you want to multi-task and achieve so much in so little time? We have all encountered situations when our Lead in an Organization instructs us to get (scrape) information off the internet, especially if we are in a team that does a lot of manual processes to achieve your work. Doing this with a pen and paper can lead to errors and missing out on specific information from the website.
This tutorial will demonstrate how to automate scraping data off of the website and using it for whatever purpose.
You can find the source code of the completed project on
. Fork, tweak the scripts, and run the code.
<CodeSandbox title="scrape the web" id="web-scraper-nxmv8" />
A basic understanding of JavaScript is necessary for you to complete this project. This Project is built with Node.js and Express.
Also, to follow through the steps, we need to do the following:
Have Node.js and NPM installed on our computer. We use npm
, a package manager, to install dependencies for our program
We will make use of a code editor of our choice
NPM is available when you install Node from the official documentation
Create a node server with the following command.
npm init -y
The above command helps to initialize our project by creating a package.json
file in the root of the folder using npm with the -y
flag to accept the default. We will install the express package from the npm registry to help us write our scripts to run the server.
Then after the initialization, we need to install the dependencies express
, cheerio
, and axios
.
npm install express cheerio axios
express
, a fast and flexible Node.js web Frameworkcheerio
, a package that parses markup and provides an API for traversing/manipulating the resulting data structure. Cheerio implementation is identical to jQuery.axios
, a promise-based HTTP client for the browser and node.js.In our app.js
JavaScript file, we use the following code below to import Express.js, create an instance of the Express application, and finally start the app as an Express server.
const express = require('express');
const app = express();
const PORT = process.env.port || 3000;
app.listen(PORT, () => {
console.log(`server is running on PORT:${PORT}`);
});
Before starting our application in the command line, we need to install nodemon
as a development dependency.
npm install nodemon --save-dev
Nodemon is a monitoring script used during the development of node.js apps. We will configure the package.json
file to use nodemon. This allows us to run our app without manually restarting the server.
{
"scripts": {
"start": "nodemon app.js"
},
"devDependencies": {
"nodemon": "^2.0.15"
}
}
Now start the app in the command line with npm start
, which should output this in the command line.
server is running on PORT:3000
Express.js
is suitable for routing, as we will see later on in the tutorial.
With the complete server setup, we will implement the web scraper that helps boost your productivity and efficiency at work within minutes.
Now in the same file, app.js
we will import the axios
package to send HTTP requests to the Representational State Transfer (REST) endpoint to perform CRUD operations.
const express = require('express');
const axios = require('axios')
const app = express();
const PORT = process.env.port || 3000;
const website = 'https://news.sky.com';
try {
axios(website).then((response) => {
const html = response.data;
console.log(html);
});
} catch (error) {
console.log(error, error.message);
}
app.listen(PORT, () => {
console.log(`server is running on PORT:${PORT}`);
});
From the code snippet above, we use axios. Axios takes in the URL of the website through chaining, and once it has resolved, we get a response from the news website URL in the command line.
To scrape the news website URL data, update our app.js
file with the following. The cheerio package will make this possible.
const express = require('express');
const axios = require('axios');
const cheerio = require('cheerio');
const app = express();
const PORT = process.env.PORT || 3000;
const website = 'https://news.sky.com';
try {
axios(website).then((res) => {
const data = res.data;
const $ = cheerio.load(data);
let content = [];
$('.sdc-site-tile__headline', data).each(function () {
const title = $(this).text();
const url = $(this).find('a').attr('href');
content.push({
title,
url,
});
app.get('/', (req, res) => {
res.json(content);
});
});
});
} catch (error) {
console.log(error, error.message);
}
app.listen(PORT, () => {
console.log(`server is running on PORT:${PORT}`);
});
Let's go through the code above.
The cheerio
package will enable us to manipulate the DOM by reading the elements on the page. We will target specific elements on the page we need to scrape only.
To parse through the HTML, we make use of cheerio.load(data)
to parse all the HTML on the page and save it with a variable, const $
.
To find specific elements on the website with a title, we inspect the page and copy the class name for the h3
tag.
For each title headline, we want to grab the text using text()
and the link to the headline we find with an attribute of href
.
Now, to scrape all our data in a JSON file, we create an empty array with a variable content
. With this created array, we need to push the saved title
and URL
by using the push method in an object to display all the client's scraped data with the GET
method, app.get
with an endpoint /
.
Finally, we execute the block of code within the try...catch
statements. The catch
statement executes if an exception occurs.
With the process completed for scraping a website, we now have the scraped data in JSON format.
Now that you've seen how to create a web scraper with Node.js using the Express.js framework, there is no excuse not to try this with any website of your choice while saving time to get accurate data.
This post explored scraping a website and how productive you can be with a method you can replicate with as many website URLs.
Clone and fork the completed source code here.
To experiment with what we built, you can fetch the data from the server and call it in your frontend application.
First Published here