How to Scrape Amazon Product Reviews Behind a Login

Written by wise4rmgod | Published 2023/10/31
Tech Story Tags: webscraping | webscraping-by-ai | webscraping-amazon-products | amazon-webscraping-guide | how-to-scrape-amazon | amazon-webscraping-code | amazon-product-scraping-code | scraping-amazon-reviews

TLDRThis tutorial will show you how to scrape an Amazon product review behind login, parse the raw data, and export the reviews as CSV with a screenshot using Node.js and Puppeteer.via the TL;DR App

The Amazon product reviews page shows important information to customers because it provides insights into the quality and performance of products.

However, it is difficult to scrape when you need login access. But tools like Puppeteer are useful to scrap data behind a login.

This tutorial will show you how to scrape an Amazon product review behind login, parse the raw data, and export the reviews as CSV with a screenshot using Node.js and Puppeteer.

Prerequisites

Before you begin, make sure to install the following:

  • NodeJs: Ensure you have Nodejs installed on your computer

  • Puppeteer: This Node library allows us to navigate the Amazon website, log in, and extract review information from the product page.

  • Papaparse: This library exports the data to a CSV file.

  • Amazon account: You need an Amazon account to access the product review page.

1: Get Access to the Public Page

This step involves retrieving the full HTML of the Aamazon public page.

This HTML contains all the data displayed on the page, and you can access it by using Puppeteer to automate web interaction, which means you can automate the actions of a Chrome browser without having to launch it.

Amazon public product page URL: https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/

Now that you have the URL, the next thing to do is write a Nodejs script that will retrieve the HTML content of the page above.

  • Create a new folder for your project and initialize a Node project.

Open your terminal and run the command below.

Mkdri amazon-review
cd amazon-review
npm init -y

This will create a package.json file for your project.

  • Import the Puppeteer library.
const puppeteer = require('puppeteer'); 

  • Set up a headless browser instance.
(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage();

This starts an asynchronous and await javascript method, launches a headless Chromium browser using Puppeteer, and assigns it to a browser variable.

The browser variable will call the page function will create a new browser page (This means a single tab in the browser)

  • Navigate to the product page using the page.goto() function
await page.goto('https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/',{ timeout: 60000 }
);

This navigates to the URL added as an argument in the goto method. Sometimes, if the page is big and has a lot of information, it takes longer to load.

For this tutorial, you will add an option to the page.goto() method { timeout: 60000 }to increase the timeout limit and give the page more time to load.

You increased the timeout to 60 seconds (60000 milliseconds).

  • Extract the full HTML content of the page
const pageContent = await page.content();

This page.content method returns the full html source code of the specified page and assigns it to the pageContent variable.

  • Output the HTML content in your console.
console.log(pageContent);

This logs the HTML content to the console, which allows you to view the raw HTML.

  • Close the browser

This closes the browser instance to free up space and resources.

await browser.close();

Complete code

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

// Navigate to the product page
await page.goto(
"https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/",{ timeout: 60000 }

);

// Retrieve the full HTML content of the page
const pageContent = await page.content();

// Output the HTML to the console
console.log(pageContent);

// Close the browser
await browser.close();
})();

Example HTML output

2: Scrape Behind the login

This web scraping method allows you to extract data from a website that requires login access.

In this section, you will use Puppeteer to automate filling in your Amazon login details (email and password) and click the login button. Follow the steps below:

  • Login into your Amazon account

This instructs the Puppeteer to navigate to the Amazon login URL

await page.goto(
"https://www.amazon.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.amazon.com%2Fcart%2Fadd-to-cart%2Fref%3Dnav_custrec_signin&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=usflex&openid.mode=checkid_setup&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0"
);

  • Fill in the login details and click the login button

After navigating to the login page, this await page.waitForSelector("#ap_email")

This ensures that it proceeds only when the #ap_email element is visible and present in the DOM.

Once the email input field is present, the script uses the page.type method to simulate typing the Amazon email address into the input field and then clicking the continue button using the page.click() method await page.click("#continue-announce")

Now, the script uses this await page.waitForSelector("#ap_password")

to wait for the ap_password element to appear; once the password input field is present, the script uses page.type to simulate typing the Amazon password

Replace with your email and password.

After entering the password, the script finds the element with the id auth-signin-button and simulates a click.

After clicking on the sign-in button, the script uses this await page.waitForNavigation(); wait for the login to complete.

await page.waitForSelector("#ap_password");
await page.type("#ap_password", ""); // Replace with your Amazon password
await page.click("#auth-signin-button");
// Wait for the login to complete
await page.waitForNavigation();

  • Navigate to the product review page.

After logging in, the scripts navigate to the product page using this method

"https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/",
{ timeout: 60000 }
);

3: Parse Review Data

This allows you to extract data from the Amazon product review page by doing the following.

  • Wait for reviews to load

This uses page.waitForSelector(".review"); to wait for the element with the name “review” to be present. Until 10 review elements are found.

These elements represent individual product reviews on the Amazon product page.

  • Extract information from review elements.

After waiting for the information to load, this code block extracts information from the review elements. It retrieves the author's name, review text, and review date.

const reviews = await page.$$eval(".review", (reviewElements) => {
return reviewElements.slice(0, 10).map((review) => {
const author = review.querySelector(".a-profile-name").textContent;
const text = review.querySelector(".a-row.review-data").textContent;
const date = review.querySelector(".review-date").textContent;
return { author, text, date };
});
});

4: Export Reviews to CSV

This allows you to save them in a format that is easy to analyze; in this tutorial, you will export them to CSV.

  • Export the data to a CSV file.

This is where Papaparse comes in because it exports data to CSV; the data it extracted from the review is passed to Papa.unparse() method, which converts the review data to CSV format.

After creating the CSV file, this code writes the CSV data to amazon_reviews.csv using Node.js inbuilt file system (fs).

const csvData = Papa.unparse(reviews);
fs.writeFileSync("amazon_reviews.csv", csvData);

Output of CSV

author,text,date
Whitney Brown,Verified Purchase,"Reviewed in the United States on July 7, 2021"
Eiskatze,Verified Purchase,"Reviewed in the United States on March 4, 2019"
Billy Johnson,Verified Purchase,"Reviewed in the United States on May 24, 2019"
Katelyn goering,Verified Purchase,"Reviewed in the United States on March 24, 2021"
Timothy A. Taylor,Verified Purchase,"Reviewed in the United States on March 19, 2022"
Sandra G.,Verified Purchase,"Reviewed in the United States on March 4, 2019"
Amazon Customer,Verified Purchase,"Reviewed in the United States on March 29, 2021"
Amra,Verified Purchase,"Reviewed in the United States on February 17, 2019"
pietro,Verified Purchase,"Reviewed in Italy on December 28, 2022"
OLIVIER DETRY,Verified Purchase,"Reviewed in France on December 26, 2020"

Then use page.screenshot method to take screenshots of the review page and give it a name. For example, you can use screenshot.png.

await page.screenshot({ path: "screenshot.png" });

Output of the screenshot

The complete codebase

You can check out the complete codebase on my Github Repo: https://github.com/wise4rmgod/Amazon-webscrapping

Conclusion

Web scraping a website that requires a login is easy with tools like Puppeteer, and it's useful for businesses that want to scrap customer reviews, especially from a site like Amazon.

This tutorial covers scraping an Amazon product page behind a login using Puppeteer.


Written by wise4rmgod | A passionate and highly organized, innovative Open-source Technical Documentation Engineer with 4+ years of experience.
Published by HackerNoon on 2023/10/31