The Amazon product reviews page shows important information to customers because it provides insights into the quality and performance of products. However, it is difficult to scrape when you need login access. But tools like are useful to scrap data behind a login. Puppeteer This tutorial will show you how to scrape an Amazon product review behind login, parse the raw data, and export the reviews as CSV with a screenshot using and . Node.js Puppeteer Prerequisites Before you begin, make sure to install the following: Ensure you have Nodejs installed on your computer NodeJs: This Node library allows us to navigate the Amazon website, log in, and extract review information from the product page. Puppeteer: This library exports the data to a CSV file. Papaparse: You need an Amazon account to access the product review page. Amazon account: 1: Get Access to the Public Page This step involves retrieving the full HTML of the Aamazon public page. This HTML contains all the data displayed on the page, and you can access it by using Puppeteer to automate web interaction, which means you can automate the actions of a Chrome browser without having to launch it. Amazon public product page URL: https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/ Now that you have the URL, the next thing to do is write a Nodejs script that will retrieve the HTML content of the page above. Create a new folder for your project and initialize a Node project. Open your terminal and run the command below. Mkdri amazon-review
cd amazon-review
npm init -y This will create a package.json file for your project. Import the Puppeteer library. const puppeteer = require('puppeteer'); Set up a headless browser instance. (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); This starts an asynchronous and await javascript method, launches a headless Chromium browser using Puppeteer, and assigns it to a variable. browser The browser variable will call the page function will create a new browser page (This means a single tab in the browser) Navigate to the product page using the page.goto() function await page.goto('https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/',{ timeout: 60000 }
); This navigates to the URL added as an argument in the goto method. Sometimes, if the page is big and has a lot of information, it takes longer to load. For this tutorial, you will add an option to the method to increase the timeout limit and give the page more time to load. page.goto() { timeout: 60000 } You increased the timeout to 60 seconds (60000 milliseconds). Extract the full HTML content of the page const pageContent = await page.content(); This method returns the full html source code of the specified page and assigns it to the variable. page.content pageContent . Output the HTML content in your console console.log(pageContent); This logs the HTML content to the console, which allows you to view the raw HTML. Close the browser This closes the browser instance to free up space and resources. await browser.close(); Complete code (async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

// Navigate to the product page
await page.goto(
"https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/",{ timeout: 60000 }

);

// Retrieve the full HTML content of the page
const pageContent = await page.content();

// Output the HTML to the console
console.log(pageContent);

// Close the browser
await browser.close();
})(); Example HTML output 2: Scrape Behind the login This web scraping method allows you to extract data from a website that requires login access. In this section, you will use to automate filling in your Amazon login details ( and ) and click the login button. Follow the steps below: Puppeteer email password Login into your Amazon account This instructs the Puppeteer to navigate to the Amazon login URL await page.goto(
"https://www.amazon.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.amazon.com%2Fcart%2Fadd-to-cart%2Fref%3Dnav_custrec_signin&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=usflex&openid.mode=checkid_setup&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0"
); Fill in the login details and click the login button After navigating to the login page, this await page.waitForSelector("#ap_email") This ensures that it proceeds only when the element is visible and present in the DOM. #ap_email Once the email input field is present, the script uses the method to simulate typing the Amazon email address into the input field and then clicking the button using the method await ("#continue-announce") page.type continue page.click() page.click Now, the script uses this await page.waitForSelector("#ap_password") to wait for the element to appear; once the password input field is present, the script uses to simulate typing the Amazon password ap_password page.type Replace with your email and password. After entering the password, the script finds the element with the id and simulates a click. auth-signin-button After clicking on the sign-in button, the script uses this await page.waitForNavigation(); wait for the login to complete. await page.waitForSelector("#ap_password");
await page.type("#ap_password", ""); // Replace with your Amazon password
await page.click("#auth-signin-button");
// Wait for the login to complete
await page.waitForNavigation(); . Navigate to the product review page After logging in, the scripts navigate to the product page using this method "https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/",
{ timeout: 60000 }
); 3: Parse Review Data This allows you to extract data from the Amazon product review page by doing the following. Wait for reviews to load This uses ; to wait for the element with the name “ ” to be present. Until review elements are found. page.waitForSelector(".review") review 10 These elements represent individual product reviews on the Amazon product page. Extract information from review elements. After waiting for the information to load, this code block extracts information from the review elements. It retrieves the author's name, review text, and review date. const reviews = await page.$$eval(".review", (reviewElements) => {
return reviewElements.slice(0, 10).map((review) => {
const author = review.querySelector(".a-profile-name").textContent;
const text = review.querySelector(".a-row.review-data").textContent;
const date = review.querySelector(".review-date").textContent;
return { author, text, date };
});
}); 4: Export Reviews to CSV This allows you to save them in a format that is easy to analyze; in this tutorial, you will export them to CSV. Export the data to a CSV file. This is where Papaparse comes in because it exports data to CSV; the data it extracted from the review is passed to method, which converts the review data to CSV format. Papa.unparse() After creating the CSV file, this code writes the CSV data to using Node.js inbuilt file system (fs). amazon_reviews.csv const csvData = Papa.unparse(reviews);
fs.writeFileSync("amazon_reviews.csv", csvData); Output of CSV author,text,date
Whitney Brown,Verified Purchase,"Reviewed in the United States on July 7, 2021"
Eiskatze,Verified Purchase,"Reviewed in the United States on March 4, 2019"
Billy Johnson,Verified Purchase,"Reviewed in the United States on May 24, 2019"
Katelyn goering,Verified Purchase,"Reviewed in the United States on March 24, 2021"
Timothy A. Taylor,Verified Purchase,"Reviewed in the United States on March 19, 2022"
Sandra G.,Verified Purchase,"Reviewed in the United States on March 4, 2019"
Amazon Customer,Verified Purchase,"Reviewed in the United States on March 29, 2021"
Amra,Verified Purchase,"Reviewed in the United States on February 17, 2019"
pietro,Verified Purchase,"Reviewed in Italy on December 28, 2022"
OLIVIER DETRY,Verified Purchase,"Reviewed in France on December 26, 2020" Then use method to take screenshots of the review page and give it a name. For example, you can use . page.screenshot screenshot.png await page.screenshot({ path: "screenshot.png" }); Output of the screenshot The complete codebase You can check out the complete codebase on my Github Repo: https://github.com/wise4rmgod/Amazon-webscrapping Conclusion Web scraping a website that requires a login is easy with tools like Puppeteer, and it's useful for businesses that want to scrap customer reviews, especially from a site like Amazon. This tutorial covers scraping an Amazon product page behind a login using Puppeteer.

This story contains new, firsthand information uncovered by the writer.

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

The code in this story is for educational purposes. The readers are solely responsible for whatever they build with it.

How to Scrape Amazon Product Reviews Behind a Login

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

A Beginner's Guide to Using Large Language Models (LLMs) With the PaLM API

How to Get Start No-Code Web Scraping with Octoparse

How to Write a Python Script to Download Reddit videos

Improve Early Failure Detection (EFD) in Web Scraping With Benchmark Data

Is Web Scraping Stealing?

A Beginner's Guide to Using Large Language Models (LLMs) With the PaLM API

How to Get Start No-Code Web Scraping with Octoparse

How to Write a Python Script to Download Reddit videos

Improve Early Failure Detection (EFD) in Web Scraping With Benchmark Data

Is Web Scraping Stealing?

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps