The Amazon product reviews page shows important information to customers because it provides insights into the quality and performance of products.
However, it is difficult to scrape when you need login access. But tools like Puppeteer are useful to scrap data behind a login.
This tutorial will show you how to scrape an Amazon product review behind login, parse the raw data, and export the reviews as CSV with a screenshot using Node.js and Puppeteer.
Before you begin, make sure to install the following:
NodeJs: Ensure you have Nodejs installed on your computer
Puppeteer: This Node library allows us to navigate the Amazon website, log in, and extract review information from the product page.
Papaparse: This library exports the data to a CSV file.
Amazon account: You need an Amazon account to access the product review page.
This step involves retrieving the full HTML of the Aamazon public page.
This HTML contains all the data displayed on the page, and you can access it by using Puppeteer to automate web interaction, which means you can automate the actions of a Chrome browser without having to launch it.
Amazon public product page URL: https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/
Now that you have the URL, the next thing to do is write a Nodejs script that will retrieve the HTML content of the page above.
Open your terminal and run the command below.
Mkdri amazon-review
cd amazon-review
npm init -y
This will create a package.json file for your project.
const puppeteer = require('puppeteer');
(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage();
This starts an asynchronous and await javascript method, launches a headless Chromium browser using Puppeteer, and assigns it to a browser variable.
The browser variable will call the page function will create a new browser page (This means a single tab in the browser)
page.goto()
functionawait page.goto('https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/',{ timeout: 60000 }
);
This navigates to the URL added as an argument in the goto method. Sometimes, if the page is big and has a lot of information, it takes longer to load.
For this tutorial, you will add an option to the page.goto()
method { timeout: 60000 }
to increase the timeout limit and give the page more time to load.
You increased the timeout to 60 seconds (60000 milliseconds).
const pageContent = await page.content();
This page.content
method returns the full html source code of the specified page and assigns it to the pageContent
variable.
console.log(pageContent);
This logs the HTML content to the console, which allows you to view the raw HTML.
This closes the browser instance to free up space and resources.
await browser.close();
Complete code
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate to the product page
await page.goto(
"https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/",{ timeout: 60000 }
);
// Retrieve the full HTML content of the page
const pageContent = await page.content();
// Output the HTML to the console
console.log(pageContent);
// Close the browser
await browser.close();
})();
Example HTML output
This web scraping method allows you to extract data from a website that requires login access.
In this section, you will use Puppeteer to automate filling in your Amazon login details (email and password) and click the login button. Follow the steps below:
This instructs the Puppeteer to navigate to the Amazon login URL
await page.goto(
"https://www.amazon.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.amazon.com%2Fcart%2Fadd-to-cart%2Fref%3Dnav_custrec_signin&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=usflex&openid.mode=checkid_setup&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0"
);
After navigating to the login page, this await page.waitForSelector("#ap_email")
This ensures that it proceeds only when the #ap_email
element is visible and present in the DOM.
Once the email input field is present, the script uses the page.type
method to simulate typing the Amazon email address into the input field and then clicking the continue button using the page.click()
method await page.click("#continue-announce")
Now, the script uses this await page.waitForSelector("#ap_password")
to wait for the ap_password element to appear; once the password input field is present, the script uses page.type
to simulate typing the Amazon password
Replace with your email and password.
After entering the password, the script finds the element with the id auth-signin-button
and simulates a click.
After clicking on the sign-in button, the script uses this await page.waitForNavigation(); wait for the login to complete.
await page.waitForSelector("#ap_password");
await page.type("#ap_password", ""); // Replace with your Amazon password
await page.click("#auth-signin-button");
// Wait for the login to complete
await page.waitForNavigation();
After logging in, the scripts navigate to the product page using this method
"https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/",
{ timeout: 60000 }
);
This allows you to extract data from the Amazon product review page by doing the following.
This uses page.waitForSelector(".review")
; to wait for the element with the name “review” to be present. Until 10 review elements are found.
These elements represent individual product reviews on the Amazon product page.
Extract information from review elements.
After waiting for the information to load, this code block extracts information from the review elements. It retrieves the author's name, review text, and review date.
const reviews = await page.$$eval(".review", (reviewElements) => {
return reviewElements.slice(0, 10).map((review) => {
const author = review.querySelector(".a-profile-name").textContent;
const text = review.querySelector(".a-row.review-data").textContent;
const date = review.querySelector(".review-date").textContent;
return { author, text, date };
});
});
This allows you to save them in a format that is easy to analyze; in this tutorial, you will export them to CSV.
Export the data to a CSV file.
This is where Papaparse comes in because it exports data to CSV; the data it extracted from the review is passed to Papa.unparse()
method, which converts the review data to CSV format.
After creating the CSV file, this code writes the CSV data to amazon_reviews.csv using Node.js inbuilt file system (fs).
const csvData = Papa.unparse(reviews);
fs.writeFileSync("amazon_reviews.csv", csvData);
Output of CSV
author,text,date
Whitney Brown,Verified Purchase,"Reviewed in the United States on July 7, 2021"
Eiskatze,Verified Purchase,"Reviewed in the United States on March 4, 2019"
Billy Johnson,Verified Purchase,"Reviewed in the United States on May 24, 2019"
Katelyn goering,Verified Purchase,"Reviewed in the United States on March 24, 2021"
Timothy A. Taylor,Verified Purchase,"Reviewed in the United States on March 19, 2022"
Sandra G.,Verified Purchase,"Reviewed in the United States on March 4, 2019"
Amazon Customer,Verified Purchase,"Reviewed in the United States on March 29, 2021"
Amra,Verified Purchase,"Reviewed in the United States on February 17, 2019"
pietro,Verified Purchase,"Reviewed in Italy on December 28, 2022"
OLIVIER DETRY,Verified Purchase,"Reviewed in France on December 26, 2020"
Then use page.screenshot
method to take screenshots of the review page and give it a name. For example, you can use screenshot.png.
await page.screenshot({ path: "screenshot.png" });
Output of the screenshot
You can check out the complete codebase on my Github Repo: https://github.com/wise4rmgod/Amazon-webscrapping
Web scraping a website that requires a login is easy with tools like Puppeteer, and it's useful for businesses that want to scrap customer reviews, especially from a site like Amazon.
This tutorial covers scraping an Amazon product page behind a login using Puppeteer.