### Introduction Today NodeJS has [a huge number of libraries](https://www.npmjs.com/) that can solve almost any routine task. Web scraping as a product has low entry requirements, which attracts freelancers and development teams to it. Not surprisingly, the library ecosystem for NodeJS already contains everything that is needed for parsing. \ Here will be considered the core device of a working application for parsing on NodeJS. Also an example of collecting data from a couple of hundred pages of a site selling branded clothing and accessories https://outlet.scotch-soda.com will be looked at too. The code sample is similar to real scraping applications, one of which was used in the [Yelp scraping](https://scrape-it.cloud/blog/how-to-scrape-yelp-using-nodejs) article. \ However, due to natural limitations, several production components, such as database, containerization, proxy connection, and parallel process management using managers, like [pm2](https://pm2.keymetrics.io/docs/usage/quick-start/) were removed from the example. Also, there will be no stops on such clear things as, for example, linting. \ However, the basic structure of the project will be considered and the most popular libraries ([Axios](https://axios-http.com/docs/api_intro), [Cheerio](https://cheerio.js.org/), [Lodash](https://www.notion.so/082c592992e04df19953349b1c27aea1)) will be used, authorization keys will be extracted using [Puppetter](https://www.npmjs.com/package/puppeteer-extra), and the data will be scraped and written to a file using [NodeJS streams](https://nodejs.org/dist/latest-v18.x/docs/api/stream.html). ### Terms The following terms will be used in the article. NodeJS application - ***server application***, website outlet.scotch-soda.com - ***a web resource***, and the website server is ***a web server***. In other words, the first step is to research the web resource and its page in Chrome or Firefox. Then will be written a server application to send [HTTP requests](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#HTTP/1.1_request_messages) to the web server and at the end will be received the response with the requested data. ### Getting An Authorization Cookie The content of outlet.scotch-soda.com is available only to authorized users. In this example, authorization will occur through a Chromium browser controlled by a server application, from which will be received cookies. These cookies will be included in the [HTTP headers](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields) on every HTTP request to the web server, allowing the application to access the content. When scraping large resources with tens and hundreds of thousands of pages, the received cookies have to be updated several times. \ The application will have the following structure: **cookie-manager.js:** file with `CookieManager` class, which methods do the main work of getting the cookie; **cookie-storage.js:** cookie variable file; **index.js:** method `CookieManager` call point; **.env:** environment variable file. ``` /project_root |__ /src | |__ /helpers | |__ **cookie-manager.js** | |__ **cookie-storage.js** |**__ .env** |__ **index.js** ``` Primary directory and file structure Add the following code to the application: ```javascript // index.js // including environment variables in .env require('dotenv').config(); const cookieManager = require('./src/helpers/cookie-manager'); const { setLocalCookie } = require('./src/helpers/cookie-storage'); // IIFE - application entry point (async () => { // CookieManager call point // login/password values are stored in the .env file const cookie = await cookieManager.fetchCookie( process.env.LOGIN, process.env.PASSWORD, ); if (cookie) { // if the cookie was received, assign it as the value of a storage variable setLocalCookie(cookie); } else { console.log('Warning! Could not fetch the Cookie after 3 attempts. Aborting the process...'); // close the application with an error if it is impossible to receive the cookie process.exit(1); } })(); ``` and in `cookie-manager.js`: ```javascript // cookie-manager.js // 'user-agents' generates 'User-Agent' values for HTTP headers // 'puppeteer-extra' - wrapper for 'puppeteer' library const _ = require('lodash'); const UserAgent = require('user-agents'); const puppeteerXtra = require('puppeteer-extra'); const StealthPlugin = require('puppeteer-extra-plugin-stealth'); // hide from webserver that it is bot puppeteerXtra.use(StealthPlugin()); class CookieManager { // this.browser & this.page - Chromium window and page instances constructor() { this.browser = null; this.page = null; this.cookie = null; } // getter getCookie() { return this.cookie; } // setter setCookie(cookie) { this.cookie = cookie; } async fetchCookie(username, password) { // give 3 attempts to authorize and receive cookies const attemptCount = 3; try { // instantiate Chromium window and blank page this.browser = await puppeteerXtra.launch({ args: ['--window-size=1920,1080'], headless: process.env.NODE_ENV === 'PROD', }); // Chromium instantiates blank page and sets 'User-Agent' header this.page = await this.browser.newPage(); await this.page.setUserAgent((new UserAgent()).toString()); for (let i = 0; i < attemptCount; i += 1) { // Chromium asks the web server for an authorization page //and waiting for DOM await this.page.goto(process.env.LOGIN_PAGE, { waitUntil: ['domcontentloaded'] }); // Chromium waits and presses the country selection confirmation button // and falling asleep for 1 second: page.waitForTimeout(1000) await this.page.waitForSelector('#changeRegionAndLanguageBtn', { timeout: 5000 }); await this.page.click('#changeRegionAndLanguageBtn'); await this.page.waitForTimeout(1000); // Chromium waits for a block to enter a username and password await this.page.waitForSelector('div.login-box-content', { timeout: 5000 }); await this.page.waitForTimeout(1000); // Chromium enters username/password and clicks on the 'Log in' button await this.page.type('input.email-input', username); await this.page.waitForTimeout(1000); await this.page.type('input.password-input', password); await this.page.waitForTimeout(1000); await this.page.click('button[value="Log in"]'); await this.page.waitForTimeout(3000); // Chromium waits for target content to load on 'div.main' selector await this.page.waitForSelector('div.main', { timeout: 5000 }); // get the cookies and glue them into a string of the form <key>=<value> [; <key>=<value>] this.setCookie( _.join( _.map( await this.page.cookies(), ({ name, value }) => _.join([name, value], '='), ), '; ', ), ); // when the cookie has been received, break the loop if (this.cookie) break; } // return cookie to call point (in index.js) return this.getCookie(); } catch (err) { throw new Error(err); } finally { // close page and browser instances this.page && await this.page.close(); this.browser && await this.browser.close(); } } } // export singleton module.exports = new CookieManager(); ``` The values of some variables are links to the `.env` file. ``` // .env NODE_ENV=DEV LOGIN_PAGE=https://outlet.scotch-soda.com/de/en/login LOGIN=tyrell.wellick@ecorp.com PASSWORD=i*m_on@kde ``` For example, configuration attribute `headless`, sent to the method `puppeteerXtra.launch` resolves to boolean, which depends on the state of the variable`process.env.NODE_ENV`. In development, the variable is set to `DEV`, `headless` is set to `FALSE`, and so the Puppeteer understands that right now it should render Chromium execution on the monitor. \ Method `page.cookies` return an array of objects, each of which defines one cookie and contains two properties: `name` and `value`. Using some Lodash functions, the loop extracts the key-value pair for each cookie and produces a string similar to the one below:  File `cookie-storage.js`: ```javascript // cookie-storage.js // cookie storage variable let localProcessedCookie = null; // getter const getLocalCookie = () => localProcessedCookie; // setter const setLocalCookie = (cookie) => { localProcessedCookie = cookie; // lock the getLocalCookie function; // its scope with the localProcessedCookie value will be saved // after the setLocalCookie function completes return getLocalCookie; }; module.exports = { setLocalCookie, getLocalCookie, }; ``` The idea of distinctly defined closures is to keep access to the value of some variable after the end of the function in whose scope this variable was. As a rule, when the function finished executing a `return`, it leaves the call stack and the garbage collector removes all variables from memory from its scope. \ In the example above, the value of `localProcessedCookie`variable with the retrieved cookie remains in the computer's memory after the `setLocalCookie` setter completes. This allows getting this value anywhere in the code as long as the application is running. \ To do this, when `setLocalCookie` is called, the `getLocalCookie` function is returned from it. Then, when the`setLocalCookie` function scope is destroyed, NodeJS sees that it has `getLocalCookie` closure function. Therefore, the garbage collector leaves all variables from the scope of the returned getter untouched in memory. Since the variable`localProcessedCookie` was in the scope of `getLocalCookie`, it continues to live, retaining the binding to the cookie. ### URL Builder The application needs a primary list of URLs to start crawling. In production, as a rule, crawling starts from the main page of a web resource and, over a certain number of iterations, a collection of links to landing pages is built. Often there are tens and hundreds of thousands of such links for one web resource. \ In this example, the crawler will be passed only 8 crawl links as input. Links lead to pages with catalogs of the main product groups. Here they are: ```jsx https://outlet.scotch-soda.com/women/clothing https://outlet.scotch-soda.com/women/footwear https://outlet.scotch-soda.com/women/accessories/all-womens-accessories https://outlet.scotch-soda.com/men/clothing https://outlet.scotch-soda.com/men/footwear https://outlet.scotch-soda.com/men/accessories/all-mens-accessories https://outlet.scotch-soda.com/kids/girls/clothing/all-girls-clothing https://outlet.scotch-soda.com/kids/boys/clothing/all-boys-clothing ``` To avoid fouling of the application code with such long link strings, let's create a compact URL builder from the following files: **categories.js:** file with route parameters; **target-builder.js:** file that will build a collection of URLs. ``` /project_root |__ /src | |__ /constants | | |__ **categories.js** | |__ /helpers | |__ cookie-manager.js | |__ cookie-storage.js | |__ **target-builder.js** |**__ .env** |__ index.js ``` Add the following code: ``` // .env MAIN_PAGE=https://outlet.scotch-soda.com ``` ```javascript // index.js // import builder function const getTargetUrls = require('./src/helpers/target-builder'); (async () => { // here the proccess of getting cookie // gets an array of url links and determines it's length L const targetUrls = getTargetUrls(); const { length: L } = targetUrls; })(); ``` ```javascript // categories.js module.exports = [ 'women/clothing', 'women/footwear', 'women/accessories/all-womens-accessories', 'men/clothing', 'men/footwear', 'men/accessories/all-mens-accessories', 'kids/girls/clothing/all-girls-clothing', 'kids/boys/clothing/all-boys-clothing', ]; ``` ```javascript // target-builder.js const path = require('path'); const categories = require('../constants/categories'); // static fragment of route parameters const staticPath = 'global/en'; // create URL object from main page address const url = new URL(process.env.MAIN_PAGE); // add the full string of route parameters to the URL object // and return full url string const addPath = (dynamicPath) => { url.pathname = path.join(staticPath, dynamicPath); return url.href; }; // collect URL link from each element of the array with categories module.exports = () => categories.map((category) => addPath(category)); ``` These three snippets create 8 links given at the beginning of this article. They demonstrate the use of the built-in [URL](https://nodejs.org/dist/latest-v18.x/docs/api/url.html) and [Path](https://nodejs.org/dist/latest-v18.x/docs/api/path.html) libraries. Someone may wonder if this sounds like 'to crack a nut with a sledgehammer, and wouldn't it be easier to use interpolation. \ NodeJS canonical methods are used to work with parameters of routes and URL requests for two reasons: 1. Interpolation reads well with a small number of components. 2. To instill good practices, they must be used every day. ### Crawling and Scraping Add two files to the logical center of the server application: * **crawler.js:** contains the Crawler class for sending requests to the web server and receiving web page markup * **parser.js:** contains the Parser class with methods for scraping the markup and getting the target data ``` /project_root |__ /src | |__ /constants | | |__ categories.js | |__ /helpers | | |__ cookie-manager.js | | |__ cookie-storage.js | | |__ target-builder.js ****| |__ **crawler.js** | |__ **parser.js** |**__** .env |__ **index.js** ``` First of all, add a loop `index.js` that will pass URL links to the Crawler in turn and receive parsed data: ```javascript // index.js const crawler = new Crawler(); (async () => { // getting Cookie proccess // and url-links array... const { length: L } = targetUrls; // run a loop through the length of the array of url links for (let i = 0; i < L; i += 1) { // call the run method of the crawler for each link // and return parsed data const result = await crawler.run(targetUrls[i]); // do smth with parsed data... } })(); ``` Crawler code: ```javascript // crawler.js require('dotenv').config(); const cheerio = require('cheerio'); const axios = require('axios').default; const UserAgent = require('user-agents'); const Parser = require('./parser'); // getLocalCookie - closure function, returns localProcessedCookie const { getLocalCookie } = require('./helpers/cookie-storage'); module.exports = class Crawler { constructor() { // create a class variable and bind it to the newly created Axios object // with the necessary headers this.axios = axios.create({ headers: { cookie: getLocalCookie(), 'user-agent': (new UserAgent()).toString(), }, }); } async run(url) { console.log('IScraper: working on %s', url); try { // do HTTP request to the web server const { data } = await this.axios.get(url); // create a cheerio object with nodes from html markup const $ = cheerio.load(data); // if the cheerio object contains nodes, run Parser // and return to index.js the result of parsing if ($.length) { const p = new Parser($); return p.parse(); } console.log('IScraper: could not fetch or handle the page content from %s', url); return null; } catch (e) { console.log('IScraper: could not fetch the page content from %s', url); return null; } } }; ``` The parser task is to select the data when the cheerio object is received, and then build the following structure for each URL link: ```json [ { "Title":"Graphic relaxed-fit T-shirt | Women", "CurrentPrice":25.96, "Currency":"€", "isNew":false }, { // at all 36 such elements for every url-link } ] ``` Parser code: ```javascript // parser.js require('dotenv').config(); const _ = require('lodash'); module.exports = class Parser { constructor(content) { // this.$ - this is a cheerio object parsed from the markup this.$ = content; this.$$ = null; } // The crawler calls the parse method // extracts all 'li' elements from the content block // and in the map loop the target data is selected parse() { return this.$('#js-search-result-items') .children('li') .map((i, el) => { this.$$ = this.$(el); const Title = this.getTitle(); const CurrentPrice = this.getCurrentPrice(); // if two key values are missing, such object is rejected if (!Title || !CurrentPrice) return {}; return { Title, CurrentPrice, Currency: this.getCurrency(), isNew: this.isNew(), }; }) .toArray(); } // next - private methods, which are used at 'parse' method getTitle() { return _.replace(this.$$.find('.product__name').text().trim(), /\s{2,}/g, ' '); } getCurrentPrice() { return _.toNumber( _.replace( _.last(_.split(this.$$.find('.product__price').text().trim(), ' ')), ',', '.', ), ); } getCurrency() { return _.head(_.split(this.$$.find('.product__price').text().trim(), ' ')); } isNew() { return /new/.test(_.toLower(this.$$.find('.content-asset p').text().trim())); } }; ``` The result of the crawler and the parser work will be 8 arrays with objects inside, passed back to the for loop of the `index.js` file. ### Stream Writing to File To write to a file will be used `Writable Stream`. Streams are just JS objects that contain a number of methods for working with sequentially appearing chunks of data. All streams inherit from the`EventEmitter` class and due to this, they are able to react to events that occur in the runtime environment. Perhaps someone saw something like that: ```jsx myServer.on('request', (request, response) => { // something puts into response }); // or myObject.on('data', (chunk) => { // do something with data }); ``` which are great examples of NodeJS streams despite their not-so-original names: `myServer` and `myObject`. In this example, they listen for certain events: the arrival of an HTTP request (the `‘request’` event) and the arrival of a piece of data (the`’data’` event), after which they are taken for some useful work. “Streaming” came out in the fact that they work with sliced data fragments and require a minimum amount of RAM. \ In this case, 8 arrays with data are sequentially received inside`for` the loop and will be sequentially written to the file without waiting for the accumulation of the full collection and without using any accumulator. Since when executing the example code, the moment when the next portion of parsed data arrives in `for`loop is known exactly, there is no need to listen to events, but it is possible to immediately write using `write` method built into the stream. Place to write: ``` /project_root |__ /data | |__ **data.json** ... ``` ```javascript // index.js const fs = require('fs'); const path = require('path'); const { createWriteStream } = require('fs'); // use constants to simplify work with addresses const resultDirPath = path.join('./', 'data'); const resultFilePath = path.join(resultDirPath, 'data.json'); // check if the data directory exists; create if it's necessary // if the data.json file existed - delete all data // ...if not existed - create empty !fs.existsSync(resultDirPath) && fs.mkdirSync(resultDirPath); fs.writeFileSync(resultFilePath, ''); (async () => { // getting Cookie proccess // and url-links array... // create a stream object for writing // and add square bracket to the first line with a line break const writer = createWriteStream(resultFilePath); writer.write('[\n'); // run a loop through the length of the url-links array for (let i = 0; i < L; i += 1) { const result = await crawler.run(targetUrls[i]); // if an array with parsed data is received, determine its length l if (!_.isEmpty(result)) { const { length: l } = result; // using the write method, add the next portion //of the incoming data to data.json for (let j = 0; j < l; j += 1) { if (i + 1 === L && j + 1 === l) { writer.write(` ${JSON.stringify(result[j])}\n`); } else { writer.write(` ${JSON.stringify(result[j])},\n`); } } } } })(); ``` The nested `for` loop solves only one problem: to get a valid `json` file in the output, one needs to take care that there is no comma after the last object in the resulting array. The nested `for` loop determines which object will be the last one in the application to undo the comma insertion. \ If one creates `data/data.json` in advance and opens it while the code is running, then one can see in real-time how the Writable Stream sequentially adds new pieces of data. ### Conclusion The output was a JSON-object of the form: ```json [ {"Title":"Graphic relaxed-fit T-shirt | Women","CurrentPrice":25.96,"Currency":"€","isNew":false}, {"Title":"Printed mercerised T-shirt | Women","CurrentPrice":29.97,"Currency":"€","isNew":true}, {"Title":"Slim-fit camisole | Women","CurrentPrice":32.46,"Currency":"€","isNew":false}, {"Title":"Graphic relaxed-fit T-shirt | Women","CurrentPrice":25.96,"Currency":"€","isNew":true}, ... {"Title":"Piped-collar polo | Boys","CurrentPrice":23.36,"Currency":"€","isNew":false}, {"Title":"Denim chino shorts | Boys","CurrentPrice":45.46,"Currency":"€","isNew":false} ] ``` Application processing time with authorization was about 20 seconds. \ The complete opensource project code is on GitHub. There is also `package.json` file with dependencies.