paint-brush
Web Scraping Sites With Session Cookie Authentication Using NodeJS Requestby@scrapeit
11,878 reads
11,878 reads

Web Scraping Sites With Session Cookie Authentication Using NodeJS Request

by Scrape-It.CloudAugust 2nd, 2022
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

NodeJS has a huge number of libraries that can solve almost any routine task. The library ecosystem for NodeJS already contains everything that is needed for parsing.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Web Scraping Sites With Session Cookie Authentication Using NodeJS Request
Scrape-It.Cloud HackerNoon profile picture

Introduction

Today NodeJS has a huge number of libraries that can solve almost any routine task. Web scraping as a product has low entry requirements, which attracts freelancers and development teams to it. Not surprisingly, the library ecosystem for NodeJS already contains everything that is needed for parsing.


Here will be considered the core device of a working application for parsing on NodeJS. Also an example of collecting data from a couple of hundred pages of a site selling branded clothing and accessories https://outlet.scotch-soda.com will be looked at too. The code sample is similar to real scraping applications, one of which was used in the Yelp scraping article.


However, due to natural limitations, several production components, such as database, containerization, proxy connection, and parallel process management using managers, like pm2 were removed from the example. Also, there will be no stops on such clear things as, for example, linting.


However, the basic structure of the project will be considered and the most popular libraries (Axios, Cheerio, Lodash) will be used, authorization keys will be extracted using Puppetter, and the data will be scraped and written to a file using NodeJS streams.

Terms

The following terms will be used in the article. NodeJS application - server application, website outlet.scotch-soda.com - a web resource, and the website server is a web server. In other words, the first step is to research the web resource and its page in Chrome or Firefox. Then will be written a server application to send HTTP requests to the web server and at the end will be received the response with the requested data.

Getting An Authorization Cookie

The content of outlet.scotch-soda.com is available only to authorized users. In this example, authorization will occur through a Chromium browser controlled by a server application, from which will be received cookies. These cookies will be included in the HTTP headers on every HTTP request to the web server, allowing the application to access the content. When scraping large resources with tens and hundreds of thousands of pages, the received cookies have to be updated several times.


The application will have the following structure:

cookie-manager.js: file with CookieManager class, which methods do the main work of getting the cookie;

cookie-storage.js: cookie variable file;

index.js: method CookieManager call point;

.env: environment variable file.

/project_root
|__ /src
|   |__ /helpers
|      |__ **cookie-manager.js**
|      |__ **cookie-storage.js**
|**__ .env**
|__ **index.js**

Primary directory and file structure

Add the following code to the application:

// index.js

// including environment variables in .env
require('dotenv').config();

const cookieManager = require('./src/helpers/cookie-manager');
const { setLocalCookie } = require('./src/helpers/cookie-storage');

// IIFE - application entry point
(async () => {
	// CookieManager call point
	// login/password values are stored in the .env file
  const cookie = await cookieManager.fetchCookie(
    process.env.LOGIN,
    process.env.PASSWORD,
  );

  if (cookie) {
		// if the cookie was received, assign it as the value of a storage variable
    setLocalCookie(cookie);
  } else {
    console.log('Warning! Could not fetch the Cookie after 3 attempts. Aborting the process...');
		// close the application with an error if it is impossible to receive the cookie
    process.exit(1);
  }
})();

and in cookie-manager.js:

// cookie-manager.js

// 'user-agents' generates 'User-Agent' values for HTTP headers
// 'puppeteer-extra' - wrapper for 'puppeteer' library
const _ = require('lodash');
const UserAgent = require('user-agents');
const puppeteerXtra = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

// hide from webserver that it is bot
puppeteerXtra.use(StealthPlugin());

class CookieManager {
	// this.browser & this.page - Chromium window and page instances 
  constructor() {
    this.browser = null;
    this.page = null;
    this.cookie = null;
  }

  // getter
  getCookie() {
    return this.cookie;
  }

  // setter
  setCookie(cookie) {
    this.cookie = cookie;
  }

  async fetchCookie(username, password) {
		// give 3 attempts to authorize and receive cookies
    const attemptCount = 3;

    try {
			// instantiate Chromium window and blank page 
      this.browser = await puppeteerXtra.launch({
        args: ['--window-size=1920,1080'],
        headless: process.env.NODE_ENV === 'PROD',
      });

			// Chromium instantiates blank page and sets 'User-Agent' header
      this.page = await this.browser.newPage();
      await this.page.setUserAgent((new UserAgent()).toString());

      for (let i = 0; i < attemptCount; i += 1) {
				// Chromium asks the web server for an authorization page
				//and waiting for DOM
        await this.page.goto(process.env.LOGIN_PAGE, { waitUntil: ['domcontentloaded'] });

				// Chromium waits and presses the country selection confirmation button
				// and falling asleep for 1 second: page.waitForTimeout(1000)
        await this.page.waitForSelector('#changeRegionAndLanguageBtn', { timeout: 5000 });
        await this.page.click('#changeRegionAndLanguageBtn');
        await this.page.waitForTimeout(1000);

				// Chromium waits for a block to enter a username and password
        await this.page.waitForSelector('div.login-box-content', { timeout: 5000 });
        await this.page.waitForTimeout(1000);
				
				// Chromium enters username/password and clicks on the 'Log in' button			
        await this.page.type('input.email-input', username);
        await this.page.waitForTimeout(1000);
        await this.page.type('input.password-input', password);
        await this.page.waitForTimeout(1000);
        await this.page.click('button[value="Log in"]');
        await this.page.waitForTimeout(3000);

				// Chromium waits for target content to load on 'div.main' selector
        await this.page.waitForSelector('div.main', { timeout: 5000 });

				// get the cookies and glue them into a string of the form <key>=<value> [; <key>=<value>]
        this.setCookie(
          _.join(
            _.map(
              await this.page.cookies(),
              ({ name, value }) => _.join([name, value], '='),
            ),
            '; ',
          ),
        );

				// when the cookie has been received, break the loop
        if (this.cookie) break;
      }

			// return cookie to call point (in index.js)
      return this.getCookie();
    } catch (err) {
      throw new Error(err);
    } finally {
			// close page and browser instances
      this.page && await this.page.close();
      this.browser && await this.browser.close();
    }
  }
}

// export singleton
module.exports = new CookieManager();

The values of some variables are links to the .env file.

// .env

NODE_ENV=DEV

LOGIN_PAGE=https://outlet.scotch-soda.com/de/en/login

[email protected]
PASSWORD=i*m_on@kde

For example, configuration attribute headless, sent to the method puppeteerXtra.launch resolves to boolean, which depends on the state of the variableprocess.env.NODE_ENV. In development, the variable is set to DEV, headless is set to FALSE, and so the Puppeteer understands that right now it should render Chromium execution on the monitor.


Method page.cookies return an array of objects, each of which defines one cookie and contains two properties: name and value. Using some Lodash functions, the loop extracts the key-value pair for each cookie and produces a string similar to the one below:

File cookie-storage.js:

// cookie-storage.js

//  cookie storage variable
let localProcessedCookie = null;

// getter
const getLocalCookie = () => localProcessedCookie;

// setter
const setLocalCookie = (cookie) => {
  localProcessedCookie = cookie;

	// lock the getLocalCookie function; 
  // its scope with the localProcessedCookie value will be saved
  // after the setLocalCookie function completes
  return getLocalCookie;
};

module.exports = {
  setLocalCookie,
  getLocalCookie,
};

The idea of distinctly defined closures is to keep access to the value of some variable after the end of the function in whose scope this variable was. As a rule, when the function finished executing a return, it leaves the call stack and the garbage collector removes all variables from memory from its scope.


In the example above, the value of localProcessedCookievariable with the retrieved cookie remains in the computer's memory after the setLocalCookie setter completes. This allows getting this value anywhere in the code as long as the application is running.


To do this, when setLocalCookie is called, the getLocalCookie function is returned from it. Then, when thesetLocalCookie function scope is destroyed, NodeJS sees that it has getLocalCookie closure function. Therefore, the garbage collector leaves all variables from the scope of the returned getter untouched in memory. Since the variablelocalProcessedCookie was in the scope of getLocalCookie, it continues to live, retaining the binding to the cookie.

URL Builder

The application needs a primary list of URLs to start crawling. In production, as a rule, crawling starts from the main page of a web resource and, over a certain number of iterations, a collection of links to landing pages is built. Often there are tens and hundreds of thousands of such links for one web resource.


In this example, the crawler will be passed only 8 crawl links as input. Links lead to pages with catalogs of the main product groups. Here they are:

https://outlet.scotch-soda.com/women/clothing
https://outlet.scotch-soda.com/women/footwear
https://outlet.scotch-soda.com/women/accessories/all-womens-accessories
https://outlet.scotch-soda.com/men/clothing
https://outlet.scotch-soda.com/men/footwear
https://outlet.scotch-soda.com/men/accessories/all-mens-accessories
https://outlet.scotch-soda.com/kids/girls/clothing/all-girls-clothing
https://outlet.scotch-soda.com/kids/boys/clothing/all-boys-clothing

To avoid fouling of the application code with such long link strings, let's create a compact URL builder from the following files:

categories.js: file with route parameters;

target-builder.js: file that will build a collection of URLs.

/project_root
|__ /src
|   |__ /constants
|		|  |__ **categories.js**
|   |__ /helpers
|      |__ cookie-manager.js
|      |__ cookie-storage.js
|      |__ **target-builder.js**
|**__ .env**
|__ index.js

Add the following code:

// .env

MAIN_PAGE=https://outlet.scotch-soda.com
// index.js

// import builder function
const getTargetUrls = require('./src/helpers/target-builder');

(async () => {
	// here the proccess of getting cookie

	// gets an array of url links and determines it's length L
  const targetUrls = getTargetUrls();
  const { length: L } = targetUrls;

})();
// categories.js

module.exports = [
  'women/clothing',
  'women/footwear',
  'women/accessories/all-womens-accessories',
  'men/clothing',
  'men/footwear',
  'men/accessories/all-mens-accessories',
  'kids/girls/clothing/all-girls-clothing',
  'kids/boys/clothing/all-boys-clothing',
];
// target-builder.js

const path = require('path');
const categories = require('../constants/categories');

// static fragment of route parameters
const staticPath = 'global/en';

// create URL object from main page address
const url = new URL(process.env.MAIN_PAGE);

// add the full string of route parameters to the URL object
// and return full url string
const addPath = (dynamicPath) => {
  url.pathname = path.join(staticPath, dynamicPath);

  return url.href;
};

// collect URL link from each element of the array with categories
module.exports = () => categories.map((category) => addPath(category));

These three snippets create 8 links given at the beginning of this article. They demonstrate the use of the built-in URL and Path libraries. Someone may wonder if this sounds like 'to crack a nut with a sledgehammer, and wouldn't it be easier to use interpolation.


NodeJS canonical methods are used to work with parameters of routes and URL requests for two reasons:

  1. Interpolation reads well with a small number of components.
  2. To instill good practices, they must be used every day.

Crawling and Scraping

Add two files to the logical center of the server application:

  • crawler.js: contains the Crawler class for sending requests to the web server and receiving web page markup
  • parser.js: contains the Parser class with methods for scraping the markup and getting the target data
/project_root
|__ /src
|   |__ /constants
| 	|  |__ categories.js
|   |__ /helpers
|   |  |__ cookie-manager.js
|   |  |__ cookie-storage.js
|   |  |__ target-builder.js
****|   |__ **crawler.js**
|   |__ **parser.js**
|**__** .env
|__ **index.js**

First of all, add a loop index.js that will pass URL links to the Crawler in turn and receive parsed data:

// index.js

const crawler = new Crawler();

(async () => {
	// getting Cookie proccess
	// and url-links array...
  const { length: L } = targetUrls;

	// run a loop through the length of the array of url links
	for (let i = 0; i < L; i += 1) {
		// call the run method of the crawler for each link
		// and return parsed data
	  const result = await crawler.run(targetUrls[i]);

		// do smth with parsed data...
  }
})();

Crawler code:

// crawler.js

require('dotenv').config();
const cheerio = require('cheerio');
const axios = require('axios').default;
const UserAgent = require('user-agents');

const Parser = require('./parser');
// getLocalCookie - closure function, returns localProcessedCookie
const { getLocalCookie } = require('./helpers/cookie-storage');

module.exports = class Crawler {
  constructor() {
		// create a class variable and bind it to the newly created Axios object
		// with the necessary headers
    this.axios = axios.create({
      headers: {
        cookie: getLocalCookie(),
        'user-agent': (new UserAgent()).toString(),
      },
    });
  }

  async run(url) {
    console.log('IScraper: working on %s', url);

    try {
      // do HTTP request to the web server
      const { data } = await this.axios.get(url);
			// create a cheerio object with nodes from html markup
      const $ = cheerio.load(data);

			// if the cheerio object contains nodes, run Parser
			// and return to index.js the result of parsing
      if ($.length) {
        const p = new Parser($);

        return p.parse();
      }
      console.log('IScraper: could not fetch or handle the page content from %s', url);
      return null;
    } catch (e) {
      console.log('IScraper: could not fetch the page content from %s', url);

      return null;
    }
  }
};

The parser task is to select the data when the cheerio object is received, and then build the following structure for each URL link:

[
  {
		"Title":"Graphic relaxed-fit T-shirt | Women",
		"CurrentPrice":25.96,
		"Currency":"€",
		"isNew":false
	},
	{
		// at all 36 such elements for every url-link
	}
] 

Parser code:

// parser.js

require('dotenv').config();
const _ = require('lodash');

module.exports = class Parser {
  constructor(content) {
		// this.$ - this is a cheerio object parsed from the markup
    this.$ = content;
    this.$$ = null;
  }

	// The crawler calls the parse method
	// extracts all 'li' elements from the content block
	// and in the map loop the target data is selected
  parse() {
    return this.$('#js-search-result-items')
      .children('li')
      .map((i, el) => {
        this.$$ = this.$(el);

        const Title = this.getTitle();
        const CurrentPrice = this.getCurrentPrice();

				// if two key values are missing, such object is rejected
        if (!Title || !CurrentPrice) return {};

        return {
          Title,
          CurrentPrice,
          Currency: this.getCurrency(),
          isNew: this.isNew(),
        };
      })
      .toArray();
  }

	// next - private methods, which are used at 'parse' method
  getTitle() {
    return _.replace(this.$$.find('.product__name').text().trim(), /\s{2,}/g, ' ');
  }

  getCurrentPrice() {
    return _.toNumber(
      _.replace(
        _.last(_.split(this.$$.find('.product__price').text().trim(), ' ')),
        ',',
        '.',
      ),
    );
  }

  getCurrency() {
    return _.head(_.split(this.$$.find('.product__price').text().trim(), ' '));
  }

  isNew() {
    return /new/.test(_.toLower(this.$$.find('.content-asset p').text().trim()));
  }
};

The result of the crawler and the parser work will be 8 arrays with objects inside, passed back to the for loop of the index.js file.

Stream Writing to File

To write to a file will be used Writable Stream. Streams are just JS objects that contain a number of methods for working with sequentially appearing chunks of data. All streams inherit from theEventEmitter class and due to this, they are able to react to events that occur in the runtime environment. Perhaps someone saw something like that:

myServer.on('request', (request, response) => {
	// something puts into response
});

// or

myObject.on('data', (chunk) => {
	// do something with data
});

which are great examples of NodeJS streams despite their not-so-original names: myServer and myObject. In this example, they listen for certain events: the arrival of an HTTP request (the ‘request’ event) and the arrival of a piece of data (the’data’ event), after which they are taken for some useful work. “Streaming” came out in the fact that they work with sliced data fragments and require a minimum amount of RAM.


In this case, 8 arrays with data are sequentially received insidefor the loop and will be sequentially written to the file without waiting for the accumulation of the full collection and without using any accumulator. Since when executing the example code, the moment when the next portion of parsed data arrives in forloop is known exactly, there is no need to listen to events, but it is possible to immediately write using write method built into the stream.

Place to write:

/project_root
|__ /data
|   |__ **data.json**
...
// index.js

const fs = require('fs');
const path = require('path');
const { createWriteStream } = require('fs');

// use constants to simplify work with addresses
const resultDirPath = path.join('./', 'data');
const resultFilePath = path.join(resultDirPath, 'data.json');

// check if the data directory exists; create if it's necessary
// if the data.json file existed - delete all data
//  ...if not existed - create empty
!fs.existsSync(resultDirPath) && fs.mkdirSync(resultDirPath);
fs.writeFileSync(resultFilePath, '');

(async () => {
	// getting Cookie proccess
	// and url-links array...

	// create a stream object for writing
	// and add square bracket to the first line with a line break
	const writer = createWriteStream(resultFilePath);
  writer.write('[\n');

	// run a loop through the length of the url-links array 
	for (let i = 0; i < L; i += 1) {
	  const result = await crawler.run(targetUrls[i]);

		// if an array with parsed data is received, determine its length l
		if (!_.isEmpty(result)) {
      const { length: l } = result;

			// using the write method, add the next portion 
			//of the incoming data to data.json
      for (let j = 0; j < l; j += 1) {
        if (i + 1 === L && j + 1 === l) {
          writer.write(`  ${JSON.stringify(result[j])}\n`);
        } else {
          writer.write(`  ${JSON.stringify(result[j])},\n`);
        }
      }
    }
  }
})();

The nested for loop solves only one problem: to get a valid json file in the output, one needs to take care that there is no comma after the last object in the resulting array. The nested for loop determines which object will be the last one in the application to undo the comma insertion.


If one creates data/data.json in advance and opens it while the code is running, then one can see in real-time how the Writable Stream sequentially adds new pieces of data.

Conclusion

The output was a JSON-object of the form:

[
  {"Title":"Graphic relaxed-fit T-shirt | Women","CurrentPrice":25.96,"Currency":"€","isNew":false},
  {"Title":"Printed mercerised T-shirt | Women","CurrentPrice":29.97,"Currency":"€","isNew":true},
  {"Title":"Slim-fit camisole | Women","CurrentPrice":32.46,"Currency":"€","isNew":false},
  {"Title":"Graphic relaxed-fit T-shirt | Women","CurrentPrice":25.96,"Currency":"€","isNew":true},
	...
	{"Title":"Piped-collar polo | Boys","CurrentPrice":23.36,"Currency":"€","isNew":false},
  {"Title":"Denim chino shorts | Boys","CurrentPrice":45.46,"Currency":"€","isNew":false}
]

Application processing time with authorization was about 20 seconds.


The complete opensource project code is on GitHub. There is also package.json file with dependencies.