Today NodeJS has a huge number of libraries that can solve almost any routine task. Web scraping as a product has low entry requirements, which attracts freelancers and development teams to it. Not surprisingly, the library ecosystem for NodeJS already contains everything that is needed for parsing.
Here will be considered the core device of a working application for parsing on NodeJS. Also an example of collecting data from a couple of hundred pages of a site selling branded clothing and accessories https://outlet.scotch-soda.com will be looked at too. The code sample is similar to real scraping applications, one of which was used in the Yelp scraping article.
However, due to natural limitations, several production components, such as database, containerization, proxy connection, and parallel process management using managers, like pm2 were removed from the example. Also, there will be no stops on such clear things as, for example, linting.
However, the basic structure of the project will be considered and the most popular libraries (Axios, Cheerio, Lodash) will be used, authorization keys will be extracted using Puppetter, and the data will be scraped and written to a file using NodeJS streams.
The following terms will be used in the article. NodeJS application - server application, website outlet.scotch-soda.com - a web resource, and the website server is a web server. In other words, the first step is to research the web resource and its page in Chrome or Firefox. Then will be written a server application to send HTTP requests to the web server and at the end will be received the response with the requested data.
The content of outlet.scotch-soda.com is available only to authorized users. In this example, authorization will occur through a Chromium browser controlled by a server application, from which will be received cookies. These cookies will be included in the HTTP headers on every HTTP request to the web server, allowing the application to access the content. When scraping large resources with tens and hundreds of thousands of pages, the received cookies have to be updated several times.
The application will have the following structure:
cookie-manager.js: file with CookieManager
class, which methods do the main work of getting the cookie;
cookie-storage.js: cookie variable file;
index.js: method CookieManager
call point;
.env: environment variable file.
/project_root
|__ /src
| |__ /helpers
| |__ **cookie-manager.js**
| |__ **cookie-storage.js**
|**__ .env**
|__ **index.js**
Primary directory and file structure
Add the following code to the application:
// index.js
// including environment variables in .env
require('dotenv').config();
const cookieManager = require('./src/helpers/cookie-manager');
const { setLocalCookie } = require('./src/helpers/cookie-storage');
// IIFE - application entry point
(async () => {
// CookieManager call point
// login/password values are stored in the .env file
const cookie = await cookieManager.fetchCookie(
process.env.LOGIN,
process.env.PASSWORD,
);
if (cookie) {
// if the cookie was received, assign it as the value of a storage variable
setLocalCookie(cookie);
} else {
console.log('Warning! Could not fetch the Cookie after 3 attempts. Aborting the process...');
// close the application with an error if it is impossible to receive the cookie
process.exit(1);
}
})();
and in cookie-manager.js
:
// cookie-manager.js
// 'user-agents' generates 'User-Agent' values for HTTP headers
// 'puppeteer-extra' - wrapper for 'puppeteer' library
const _ = require('lodash');
const UserAgent = require('user-agents');
const puppeteerXtra = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
// hide from webserver that it is bot
puppeteerXtra.use(StealthPlugin());
class CookieManager {
// this.browser & this.page - Chromium window and page instances
constructor() {
this.browser = null;
this.page = null;
this.cookie = null;
}
// getter
getCookie() {
return this.cookie;
}
// setter
setCookie(cookie) {
this.cookie = cookie;
}
async fetchCookie(username, password) {
// give 3 attempts to authorize and receive cookies
const attemptCount = 3;
try {
// instantiate Chromium window and blank page
this.browser = await puppeteerXtra.launch({
args: ['--window-size=1920,1080'],
headless: process.env.NODE_ENV === 'PROD',
});
// Chromium instantiates blank page and sets 'User-Agent' header
this.page = await this.browser.newPage();
await this.page.setUserAgent((new UserAgent()).toString());
for (let i = 0; i < attemptCount; i += 1) {
// Chromium asks the web server for an authorization page
//and waiting for DOM
await this.page.goto(process.env.LOGIN_PAGE, { waitUntil: ['domcontentloaded'] });
// Chromium waits and presses the country selection confirmation button
// and falling asleep for 1 second: page.waitForTimeout(1000)
await this.page.waitForSelector('#changeRegionAndLanguageBtn', { timeout: 5000 });
await this.page.click('#changeRegionAndLanguageBtn');
await this.page.waitForTimeout(1000);
// Chromium waits for a block to enter a username and password
await this.page.waitForSelector('div.login-box-content', { timeout: 5000 });
await this.page.waitForTimeout(1000);
// Chromium enters username/password and clicks on the 'Log in' button
await this.page.type('input.email-input', username);
await this.page.waitForTimeout(1000);
await this.page.type('input.password-input', password);
await this.page.waitForTimeout(1000);
await this.page.click('button[value="Log in"]');
await this.page.waitForTimeout(3000);
// Chromium waits for target content to load on 'div.main' selector
await this.page.waitForSelector('div.main', { timeout: 5000 });
// get the cookies and glue them into a string of the form <key>=<value> [; <key>=<value>]
this.setCookie(
_.join(
_.map(
await this.page.cookies(),
({ name, value }) => _.join([name, value], '='),
),
'; ',
),
);
// when the cookie has been received, break the loop
if (this.cookie) break;
}
// return cookie to call point (in index.js)
return this.getCookie();
} catch (err) {
throw new Error(err);
} finally {
// close page and browser instances
this.page && await this.page.close();
this.browser && await this.browser.close();
}
}
}
// export singleton
module.exports = new CookieManager();
The values of some variables are links to the .env
file.
// .env
NODE_ENV=DEV
LOGIN_PAGE=https://outlet.scotch-soda.com/de/en/login
[email protected]
PASSWORD=i*m_on@kde
For example, configuration attribute headless
, sent to the method puppeteerXtra.launch
resolves to boolean, which depends on the state of the variableprocess.env.NODE_ENV
. In development, the variable is set to DEV
, headless
is set to FALSE
, and so the Puppeteer understands that right now it should render Chromium execution on the monitor.
Method page.cookies
return an array of objects, each of which defines one cookie and contains two properties: name
and value
. Using some Lodash functions, the loop extracts the key-value pair for each cookie and produces a string similar to the one below:
File cookie-storage.js
:
// cookie-storage.js
// cookie storage variable
let localProcessedCookie = null;
// getter
const getLocalCookie = () => localProcessedCookie;
// setter
const setLocalCookie = (cookie) => {
localProcessedCookie = cookie;
// lock the getLocalCookie function;
// its scope with the localProcessedCookie value will be saved
// after the setLocalCookie function completes
return getLocalCookie;
};
module.exports = {
setLocalCookie,
getLocalCookie,
};
The idea of distinctly defined closures is to keep access to the value of some variable after the end of the function in whose scope this variable was. As a rule, when the function finished executing a return
, it leaves the call stack and the garbage collector removes all variables from memory from its scope.
In the example above, the value of localProcessedCookie
variable with the retrieved cookie remains in the computer's memory after the setLocalCookie
setter completes. This allows getting this value anywhere in the code as long as the application is running.
To do this, when setLocalCookie
is called, the getLocalCookie
function is returned from it. Then, when thesetLocalCookie
function scope is destroyed, NodeJS sees that it has getLocalCookie
closure function. Therefore, the garbage collector leaves all variables from the scope of the returned getter untouched in memory. Since the variablelocalProcessedCookie
was in the scope of getLocalCookie
, it continues to live, retaining the binding to the cookie.
The application needs a primary list of URLs to start crawling. In production, as a rule, crawling starts from the main page of a web resource and, over a certain number of iterations, a collection of links to landing pages is built. Often there are tens and hundreds of thousands of such links for one web resource.
In this example, the crawler will be passed only 8 crawl links as input. Links lead to pages with catalogs of the main product groups. Here they are:
https://outlet.scotch-soda.com/women/clothing
https://outlet.scotch-soda.com/women/footwear
https://outlet.scotch-soda.com/women/accessories/all-womens-accessories
https://outlet.scotch-soda.com/men/clothing
https://outlet.scotch-soda.com/men/footwear
https://outlet.scotch-soda.com/men/accessories/all-mens-accessories
https://outlet.scotch-soda.com/kids/girls/clothing/all-girls-clothing
https://outlet.scotch-soda.com/kids/boys/clothing/all-boys-clothing
To avoid fouling of the application code with such long link strings, let's create a compact URL builder from the following files:
categories.js: file with route parameters;
target-builder.js: file that will build a collection of URLs.
/project_root
|__ /src
| |__ /constants
| | |__ **categories.js**
| |__ /helpers
| |__ cookie-manager.js
| |__ cookie-storage.js
| |__ **target-builder.js**
|**__ .env**
|__ index.js
Add the following code:
// .env
MAIN_PAGE=https://outlet.scotch-soda.com
// index.js
// import builder function
const getTargetUrls = require('./src/helpers/target-builder');
(async () => {
// here the proccess of getting cookie
// gets an array of url links and determines it's length L
const targetUrls = getTargetUrls();
const { length: L } = targetUrls;
})();
// categories.js
module.exports = [
'women/clothing',
'women/footwear',
'women/accessories/all-womens-accessories',
'men/clothing',
'men/footwear',
'men/accessories/all-mens-accessories',
'kids/girls/clothing/all-girls-clothing',
'kids/boys/clothing/all-boys-clothing',
];
// target-builder.js
const path = require('path');
const categories = require('../constants/categories');
// static fragment of route parameters
const staticPath = 'global/en';
// create URL object from main page address
const url = new URL(process.env.MAIN_PAGE);
// add the full string of route parameters to the URL object
// and return full url string
const addPath = (dynamicPath) => {
url.pathname = path.join(staticPath, dynamicPath);
return url.href;
};
// collect URL link from each element of the array with categories
module.exports = () => categories.map((category) => addPath(category));
These three snippets create 8 links given at the beginning of this article. They demonstrate the use of the built-in URL and Path libraries. Someone may wonder if this sounds like 'to crack a nut with a sledgehammer, and wouldn't it be easier to use interpolation.
NodeJS canonical methods are used to work with parameters of routes and URL requests for two reasons:
Add two files to the logical center of the server application:
/project_root
|__ /src
| |__ /constants
| | |__ categories.js
| |__ /helpers
| | |__ cookie-manager.js
| | |__ cookie-storage.js
| | |__ target-builder.js
****| |__ **crawler.js**
| |__ **parser.js**
|**__** .env
|__ **index.js**
First of all, add a loop index.js
that will pass URL links to the Crawler in turn and receive parsed data:
// index.js
const crawler = new Crawler();
(async () => {
// getting Cookie proccess
// and url-links array...
const { length: L } = targetUrls;
// run a loop through the length of the array of url links
for (let i = 0; i < L; i += 1) {
// call the run method of the crawler for each link
// and return parsed data
const result = await crawler.run(targetUrls[i]);
// do smth with parsed data...
}
})();
Crawler code:
// crawler.js
require('dotenv').config();
const cheerio = require('cheerio');
const axios = require('axios').default;
const UserAgent = require('user-agents');
const Parser = require('./parser');
// getLocalCookie - closure function, returns localProcessedCookie
const { getLocalCookie } = require('./helpers/cookie-storage');
module.exports = class Crawler {
constructor() {
// create a class variable and bind it to the newly created Axios object
// with the necessary headers
this.axios = axios.create({
headers: {
cookie: getLocalCookie(),
'user-agent': (new UserAgent()).toString(),
},
});
}
async run(url) {
console.log('IScraper: working on %s', url);
try {
// do HTTP request to the web server
const { data } = await this.axios.get(url);
// create a cheerio object with nodes from html markup
const $ = cheerio.load(data);
// if the cheerio object contains nodes, run Parser
// and return to index.js the result of parsing
if ($.length) {
const p = new Parser($);
return p.parse();
}
console.log('IScraper: could not fetch or handle the page content from %s', url);
return null;
} catch (e) {
console.log('IScraper: could not fetch the page content from %s', url);
return null;
}
}
};
The parser task is to select the data when the cheerio object is received, and then build the following structure for each URL link:
[
{
"Title":"Graphic relaxed-fit T-shirt | Women",
"CurrentPrice":25.96,
"Currency":"€",
"isNew":false
},
{
// at all 36 such elements for every url-link
}
]
Parser code:
// parser.js
require('dotenv').config();
const _ = require('lodash');
module.exports = class Parser {
constructor(content) {
// this.$ - this is a cheerio object parsed from the markup
this.$ = content;
this.$$ = null;
}
// The crawler calls the parse method
// extracts all 'li' elements from the content block
// and in the map loop the target data is selected
parse() {
return this.$('#js-search-result-items')
.children('li')
.map((i, el) => {
this.$$ = this.$(el);
const Title = this.getTitle();
const CurrentPrice = this.getCurrentPrice();
// if two key values are missing, such object is rejected
if (!Title || !CurrentPrice) return {};
return {
Title,
CurrentPrice,
Currency: this.getCurrency(),
isNew: this.isNew(),
};
})
.toArray();
}
// next - private methods, which are used at 'parse' method
getTitle() {
return _.replace(this.$$.find('.product__name').text().trim(), /\s{2,}/g, ' ');
}
getCurrentPrice() {
return _.toNumber(
_.replace(
_.last(_.split(this.$$.find('.product__price').text().trim(), ' ')),
',',
'.',
),
);
}
getCurrency() {
return _.head(_.split(this.$$.find('.product__price').text().trim(), ' '));
}
isNew() {
return /new/.test(_.toLower(this.$$.find('.content-asset p').text().trim()));
}
};
The result of the crawler and the parser work will be 8 arrays with objects inside, passed back to the for loop of the index.js
file.
To write to a file will be used Writable Stream
. Streams are just JS objects that contain a number of methods for working with sequentially appearing chunks of data. All streams inherit from theEventEmitter
class and due to this, they are able to react to events that occur in the runtime environment. Perhaps someone saw something like that:
myServer.on('request', (request, response) => {
// something puts into response
});
// or
myObject.on('data', (chunk) => {
// do something with data
});
which are great examples of NodeJS streams despite their not-so-original names: myServer
and myObject
. In this example, they listen for certain events: the arrival of an HTTP request (the ‘request’
event) and the arrival of a piece of data (the’data’
event), after which they are taken for some useful work. “Streaming” came out in the fact that they work with sliced data fragments and require a minimum amount of RAM.
In this case, 8 arrays with data are sequentially received insidefor
the loop and will be sequentially written to the file without waiting for the accumulation of the full collection and without using any accumulator. Since when executing the example code, the moment when the next portion of parsed data arrives in for
loop is known exactly, there is no need to listen to events, but it is possible to immediately write using write
method built into the stream.
Place to write:
/project_root
|__ /data
| |__ **data.json**
...
// index.js
const fs = require('fs');
const path = require('path');
const { createWriteStream } = require('fs');
// use constants to simplify work with addresses
const resultDirPath = path.join('./', 'data');
const resultFilePath = path.join(resultDirPath, 'data.json');
// check if the data directory exists; create if it's necessary
// if the data.json file existed - delete all data
// ...if not existed - create empty
!fs.existsSync(resultDirPath) && fs.mkdirSync(resultDirPath);
fs.writeFileSync(resultFilePath, '');
(async () => {
// getting Cookie proccess
// and url-links array...
// create a stream object for writing
// and add square bracket to the first line with a line break
const writer = createWriteStream(resultFilePath);
writer.write('[\n');
// run a loop through the length of the url-links array
for (let i = 0; i < L; i += 1) {
const result = await crawler.run(targetUrls[i]);
// if an array with parsed data is received, determine its length l
if (!_.isEmpty(result)) {
const { length: l } = result;
// using the write method, add the next portion
//of the incoming data to data.json
for (let j = 0; j < l; j += 1) {
if (i + 1 === L && j + 1 === l) {
writer.write(` ${JSON.stringify(result[j])}\n`);
} else {
writer.write(` ${JSON.stringify(result[j])},\n`);
}
}
}
}
})();
The nested for
loop solves only one problem: to get a valid json
file in the output, one needs to take care that there is no comma after the last object in the resulting array. The nested for
loop determines which object will be the last one in the application to undo the comma insertion.
If one creates data/data.json
in advance and opens it while the code is running, then one can see in real-time how the Writable Stream sequentially adds new pieces of data.
The output was a JSON-object of the form:
[
{"Title":"Graphic relaxed-fit T-shirt | Women","CurrentPrice":25.96,"Currency":"€","isNew":false},
{"Title":"Printed mercerised T-shirt | Women","CurrentPrice":29.97,"Currency":"€","isNew":true},
{"Title":"Slim-fit camisole | Women","CurrentPrice":32.46,"Currency":"€","isNew":false},
{"Title":"Graphic relaxed-fit T-shirt | Women","CurrentPrice":25.96,"Currency":"€","isNew":true},
...
{"Title":"Piped-collar polo | Boys","CurrentPrice":23.36,"Currency":"€","isNew":false},
{"Title":"Denim chino shorts | Boys","CurrentPrice":45.46,"Currency":"€","isNew":false}
]
Application processing time with authorization was about 20 seconds.
The complete opensource project code is on GitHub. There is also package.json
file with dependencies.