Scrape Google Images with Node JS

Written by darshan12 | Published 2022/07/25
Tech Story Tags: nodejs | webscraping | web-development | software-development | google-image-search | search-engine | search | puppeteer

TLDRThis tutorial will teach us to scrape Google Images with Node JS. The code explanation is simple. We just have made a GET request to our target URL by passing a header to our request. Then we have to load our HTML content in our Cheerio JS variable, which is `$` here. We used this `$' to parse out HTML data. Using this API, you don’t have to worry about creating and maintaining the scraper, and also you can scale the number of requests easily without getting blocked.via the TL;DR App

Introduction

This post will teach us to scrape Google Images results with Node JS using multiple methods. Web Scraping Google Images.

Requirements

Install Libraries

Before we begin, install these libraries so we can move forward and prepare our scraper.

npm i unirest
npm i cheerio

To extract our HTML data, we will use Unirest JS, and for parsing the HTML data, we will use Cheerio JS.

Target:

Process

Method-1

We have set up all the things to prepare our scraper. Now, let us discuss our first method to scrape Google Images.

First, we will make a GET request on our target URL with the help of Unirest to extract the raw HTML data.

let header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36"
};
return unirest
.get("https://www.google.com/search?q=nike&oq=nike&hl=en&tbm=isch&asearch=ichunk&async=_id:rg_s,_pms:s,_fmt:pc&sourceid=chrome&ie=UTF-8")
.headers(header)
.then((response) => {
let $ = cheerio.load(response.body);

Step-by-step explanation:

  1. In the first step, we made a GET request to our target URL.
  2. In the second step, we passed the headers required with our target URL.
  3. Then we stored the returned response in the Cheerio instance.

But one User Agent might not be enough as Google can block your request. So, we will make an array of User Agents and rotate it on every request.


    const selectRandom = () => {
    const userAgents =  ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
    ]
    var randomNumber = Math.floor(Math.random() * userAgents.length);
    return userAgents[randomNumber];
    }
    let user_agent = selectRandom();
    let header = {
    "User-Agent": `${user_agent}`
    }

Copy the below target URL. Paste it into your browser, which will download a text file. Open that text file in your code editor and convert it into an HTML file.

https://www.google.com/search?q=nike&oq=nike&hl=en&tbm=isch&asearch=ichunk&async=_id:rg_s,_pms:s,_fmt:pc&sourceid=chrome&ie=UTF-8

Additional parameters which can be used with this URL:

  1. tbs - Term By Search parameter. Read more about this parameter in this article.
  2. chips - Used to filter image results.
  3. ijn - Used for pagination. ijn = 0 will return the first page of results, ijn = 1 will return the second page of results and so on.

Scroll the HTML file till the end of the style tag you will see the HTML tags of the respective image results.

Now, we will parse the required things we want in our response and search for the title tag from the above image. You will find .mVDMnf inside an anchor tag. Just below the title, we have the tag for our source as .FnqxG.

    let images_results= [];
    $("div.rg_bx").each((i, el) => {
     images_results.push({    
     title: $(el).find(".iKjWAf .mVDMnf").text(),
     source: $(el).find(".iKjWAf .FnqxG").text()
    });
  });

After the end of the anchor tag, you will find the div tag with the class name rg_meta which contains a JSON string.

{"bce":"rgb(249,252,249)","cb":21,"cl":21,"clt":"n","cr":21,"ct":21,"id":"qYZE1rcH_OCntM","isu":"www.nike.com","itg":0,"oh":1088,"os":"15KB","ou":"https://c.static-nike.com/a/images/w_1920,c_limit/bzl2wmsfh7kgdkufrrjq/image.jpg","ow":1920,"pt":"Nike.
Just Do It.
Nike.com","rh":"www.nike.com","rid":"mgtROrdDu1XGJM","rmt":0,"rt":0,"ru":"https://www.nike.com/","st":"www.nike.com","th":169,"tu":"https://encrypted-tbn0.gstatic.com/images?q\\u003dtbn:ANd9GcQQAtNCsBlvuD_5pu9bKrTr-Sv5mMwD1-hZE9MS4Px4GKk05naP\\u0026s","tw":298}

We will parse it and extract the link and the URL of the original image from it.

    let images_results= [];
    $("div.rg_bx").each((i, el) => {
        let json_string = $(el).find(".rg_meta").text();
        images_results.push({
        title: $(el).find(".iKjWAf .mVDMnf").text(),
        source: $(el).find(".iKjWAf .FnqxG").text(),
        link: JSON.parse(json_string).ru,
        original: JSON.parse(json_string).ou,
    });     
  });

And at last, we will find the thumbnail URL. If you look at the HTML, there is an image tag under the first anchor tag, which contains the thumbnail URL.

Now, our parser looks like this:

    let images_results= [];
    $("div.rg_bx").each((i, el) => {
        let json_string = $(el).find(".rg_meta").text();
        images_results.push({
        title: $(el).find(".iKjWAf .mVDMnf").text(),
        source: $(el).find(".iKjWAf .FnqxG").text(),
        link: JSON.parse(json_string).ru,
        original: JSON.parse(json_string).ou,
        thumbnail: $(el).find(".rg_l img").attr("src")? $(el).find(".rg_l img").attr("src") : $(el).find(".rg_l img").attr("data-src")
    });    
  }) 

Results:

[
{
title: 'Shoes, Clothing & Accessories. Nike ...',
source: 'www.nike.com',
link: 'https://www.nike.com/in/men',
original: 'https://c.static-nike.com/a/images/w_1920,c_limit/mdbgldn6yg1gg88jomci/image.jpg',
thumbnail: 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTCsZWS0YPC1NFXd4g_Ucn4jkz8VYxL4VbLvWfKa5QI3PKRuHc&s'
},
{
title: 'Nike. Just Do It. Nike.com',
source: 'www.nike.com',
link: 'https://www.nike.com/',
original: 'https://static.nike.com/a/images/f_jpg,q_auto:eco/61b4738b-e1e1-4786-8f6c-26aa0008e80b/swoosh-logo-black.png',
thumbnail: 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRbbeIzjUozRCMzN8gaujUFBJlIFHheriDFvKhSCMD84JL8KeuX&s'
},
....

Here is the full code:

 const unirest = require("unirest");
    const cheerio = require("cheerio");
    
    const getImagesData = () => {
        const selectRandom = () => {
        const userAgents = [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
        ];
        var randomNumber = Math.floor(Math.random() * userAgents.length);
        return userAgents[randomNumber];
        };
        let user_agent = selectRandom();
        let header = {
        "User-Agent": `${user_agent}`,
        };
        return unirest
        .get(
            "https://www.google.com/search?q=nike&oq=nike&hl=en&tbm=isch&asearch=ichunk&async=_id:rg_s,_pms:s,_fmt:pc&sourceid=chrome&ie=UTF-8"
        )
        .headers(header)
        .then((response) => {
            let $ = cheerio.load(response.body);
    
            let images_results = [];
            $("div.rg_bx").each((i, el) => {
            let json_string = $(el).find(".rg_meta").text();
            images_results.push({
                title: $(el).find(".iKjWAf .mVDMnf").text(),
                source: $(el).find(".iKjWAf .FnqxG").text(),
                link: JSON.parse(json_string).ru,
                original: JSON.parse(json_string).ou,
                thumbnail: $(el).find(".rg_l img").attr("src") ? $(el).find(".rg_l img").attr("src") : $(el).find(".rg_l img").attr("data-src"),
            });
            });
    
            console.log(images_results);
        });
    };
    
    getImagesData();                                    

Method - 2

In this method, we will use a simple GET request to fetch the first page results of Google Images. So, let us find the tags for the image results.

https://www.google.com/search?q=Badminton&gl=us&tbm=isch

First, we will find the tag for the title. Look at the above image. You will find the tag for the title as h3 under the div with the class name MSM1fd.

    const images_results = [];

    $(".MSM1fd").each((i,el) => {
        images_results.push({
        title: $(el).find("h3").text(),
        })
    })

Then we will find the tag for the source. If you look at the image, you will find the source of the image under the second anchor tag with the class name as VFACy inside the div with the class name MSM1fd. Also, this anchor tag contains our link. So, our parser would look like this:

    const images_results = [];

    $(".MSM1fd").each((i,el) => {
        images_results.push({
        image: $(el).find("img").attr("src") ? $(el).find("img").attr("src") : $(el).find("img").attr("data-src"),
        title: $(el).find("h3").text(),
        source: $(el).find("a.VFACy .fxgdke").text(),
        link: $(el).find("a.VFACy").attr("href")
        })
    })

The img tag is the only image tag inside the div, so it is not important to look for its class name as particular.

Results

   {
    image: 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSjxyuvqYQfybxq9F2XgME-ya6xb81WUyw3Dpa-YA40-Fy7fx0IlOhXIrK17kNON-r6vNs&usqp=CAU',
    title: 'Nike for Men - Shop New Arrivals - FARFETCH',
    source: 'farfetch.com',
    link: 'https://www.farfetch.com/in/shopping/men/nike/items.aspx'
    },
    {
    image: 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSfCJOrZr0zFxogpQjNT_6kBQ3rmxSPqvCHPpTWLmpOltZinUpptGM-290ssFMCIzFnD1M&usqp=CAU',
    title: "Women's Clothing. Nike IN",
    source: 'nike.com',
    link: 'https://www.nike.com/in/w/womens-clothing-5e1x6z6ymx6'
    },
    ....

Note: You will also find some images with base64 URLs.

This method is fast, but we can't use pagination in this method, while in the first, we can use it. Another method you can work with is the Puppeteer Infinite Scrolling Method, which can solve the problem of pagination. But it is a very time-consuming method.

Using Google Image API

Using this API, you don’t have to worry about creating and maintaining the scraper, and also you can scale the number of requests easily without getting blocked.

Example

   const axios = require('axios');

   axios.get('https://api.serpdog.io/images?api_key=APIKEY&q=football&gl=us')
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log(error);
  });

You can get your API Key by registering on this link.

Results

      "image_results": [
      {
        "image": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS_Tu78LWxIu_M_sN_kMfj2guqIbu2VcSLyI84CQGbuFRIyTCVR&s",
        "title": "Football - Wikipedia",
        "source": "en.wikipedia.org",
        "link": "https://en.wikipedia.org/wiki/Football",
        "original": "https://upload.wikimedia.org/wikipedia/commons/b/b9/Football_iu_1996.jpg",
        "rank": 1
      },
      {
        "image": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTxvsz_pjLnFyCnYyCxxY5rSHQCHjNJyYGFZqhQUtTm0XOzOWw&s",
        "title": "Soft toy, American football/brown - IKEA",
        "source": "www.ikea.com · In stock",
        "link": "https://www.ikea.com/us/en/p/oenskad-soft-toy-american-football-brown-90506769/",
        "original": "https://www.ikea.com/us/en/images/products/oenskad-soft-toy-american-football-brown__0982285_pe815602_s5.jpg",
        "rank": 2
      },
      {
        "image": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTNJYuYBLrUxLrXkbnP18Y6DEgKf_H4HYGCzecsGRAoFtkiGEM&s",
        "title": "NFL postpones three games due to Covid ...",
        "source": "www.cnbc.com",
        "link": "https://www.cnbc.com/2021/12/17/nfl-will-postpone-some-games-over-covid-surge-source-says.html",
        "original": "https://image.cnbcfm.com/api/v1/image/106991253-1639786378304-GettyImages-1185558312r.jpg?v=1639786403",
        "rank": 3
      },
      {
        "image": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTg4WJ88A83JqXwGXsWtiB5qSoHrU_RukbrfXdkWggEKMsJ5Ro1&s",
        "title": "USFL schedule Week 2: What football ...",
        "source": "www.sportingnews.com",
        "link": "https://www.sportingnews.com/us/nfl/news/usfl-schedule-week-2-football-tv-channels-times-scores/oadvrtsc5vn9l4knu8hvnpo0",
        "original": "https://library.sportingnews.com/2022-04/usfl-football-042122-getty-ftr.jpg",
        "rank": 4
      },
      {
        "image": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTsOPQ2-nzsTrdRK1HHXIqB-x96yuiPg7pwfJsm8mToVlw5-UaM&s",
        "title": "Why is football called 'football'?",
        "source": "www.newsnationnow.com",
        "link": "https://www.newsnationnow.com/us-news/hold-why-is-football-called-football/",
        "original": "https://www.newsnationnow.com/wp-content/uploads/sites/108/2022/02/FootballGettyImages-78457130.jpg?w=1280",
        "rank": 5
      },
      {
        "image": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS0WimlY0Nykjy8k5An6k4wViDvyVZzo_0K1MSZNZcwDFGsugNW&s",
        "title": "The Duke NFL Football | Wilson Sporting ...",
        "source": "www.wilson.com · In stock",
        "link": "https://www.wilson.com/en-gb/product/the-duke-nfl-football-wf10011",
        "original": "https://www.wilson.com/en-gb/media/catalog/product/b/c/bc340309-c2a3-441d-ac36-a26187fd94f0_yceho2py9sgzklxk.png",
        "rank": 6
      },
      {
        "image": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ6ZuMtA_I4WZZKXPhiqMpIHsdKJX-SVkcN2yo1KAQ9SQuxdCI&s",
        "title": "National Football League",
        "source": "www.nfl.com",
        "link": "https://www.nfl.com/",
        "original": "https://static.www.nfl.com/image/private/t_editorial_landscape_mobile/f_auto/league/iaesayubxpbxmbxfwm3b.jpg",
        "rank": 7
      },

Conclusion

In this tutorial, we learned to scrape Google Search Images Results. If you have any questions, feel free to ask me in the comments. Follow me on Twitter. Thanks for reading!

Additional Resources


Written by darshan12 | Serpdog is a Google Search API that allows you to access Google Search Results in real-time.
Published by HackerNoon on 2022/07/25