1,299 reads

How to Scrape Amazon Reviews with and without Code

by ArturJune 20th, 2022

Too Long; Didn't Read

Web scraping is a great way to extract data from any webpage on the Internet and convert it into a computer-readable format such as dataframes. Python libraries allow you to process HTML behind any webpage and get the data you want. It took me around an hour to come up with an idea to scrape Amazon reviews to get the final dataset. Here I will show you how I did it. The dataset includes a title, rating, country, review text and number of users who found the review helpful.

featured image - How to Scrape Amazon Reviews with and without Code

You have probably mastered data analysis and visualization in Python; you have also learned some Natural Language Processing by investigating thousands of datasets available on Kaggle. However, what about creating your own set of data to make your project unique to show off in the portfolio? Does it sound better than crunching overused datasets?

Fortunately, a technique called web scraping is at your service. I am sure you have heard about it before. But if you did not, it is a great way to extract data from any webpage on the Internet and convert it into a computer-readable format such as pandas dataframes. Many Python libraries allow you to process HTML behind any webpage and get the data you want. The most notable examples are BeautifulSoup and Scrapy. And do not forget to learn some requests, as well what get, post mean in data web transfer.

On top of that, every website is different, and you will have to grasp at least the basics of HTML to write efficient and correct code. In addition, if you manage a business and need to scrape data from a dozen of websites to integrate it and drive your decision process? It all quickly becomes very complicated and unsettling.

Fortunately, there are non-code solutions that automate the process of web scraping and make it pleasant and accessible to everyone. One of them is Octoparse. It took me around an hour to come up with an idea to scrape Amazon reviews to get the final dataset. Here I will show you how I did it.

Why Would you Need Amazon Reviews?

Let's imagine you want to buy a set of headphones. You searched Amazon and found these good and high-scored headphones. However, you are still not sure if this is the best choice, so you want to understand their most common issues and merits by reading reviews. But there are more than one thousand of them! You can, of course, filter by the number of starts, but it is still a very laborious task to go through them all. And what if you have ten headsets to choose from? It may take you days or even weeks to make a data-driven choice. You can overcome this difficulty by scraping the reviews and processing them through NLP.

Code-based Solution

Fortunately, Amazon is a pretty popular website, so there are many tutorials on how to manipulate it. I did not want to reinvent the wheel, so I used this great tutorial by Domenic Fayad as a basis for my own code. He himself adopted the code from John Watson Rooney available on GitHub.

Before getting started, these are the columns I would like to have in my final dataset:

Title
Rating
Country
Date
Review text
Number of users who found the review helpful

Some ideas to have a more zoomed view of the data include filtering by date to see if the opinions changed over time and by country to have a spatial overview.

Below is a long and pretty complicated Python code.

import re
from datetime import datetime

import pandas as pd
import requests
from bs4 import BeautifulSoup
from dateutil import parser

def parse_html(url):
    """
    Parses HTML and returns soup.

    Parameter
    ---------
    url: str
        URL to parse.

    Returns
    --------
    bs4.BeautifulSoup
        Parse HTML content.
    """
    # Headers to send to server
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0",
        "Accept-Language": "en-US, en;q=0.5",
    }

    # Get a response from server
    r = requests.get(url, headers=HEADERS)

    # Parse HTML
    soup = BeautifulSoup(r.text, "html.parser")

    return type(soup)


# List to store review data
reviewlist = []

# Pattern to match a country
country_pat = "(?<=Reviewed\sin\s)(.*)(?=\son)"


def extract_date(string):
    """
    Extracts date from a string.
    """
    # Parse string and extract datetime object
    dt = parser.parse(string, fuzzy=True)

    return datetime.strftime(dt, "%Y/%m/%d")


def scrape_reviews(soup):
    """
    Scrape information about each review.
    
    Parameter
    ----------
    soup: bs4.BeautifulSoup
        Parsed HTML content
    
    Returns
    -------
    Adds each review as a dictionary to the pre-defined list reviewlist
    """
    reviews = soup.find_all("div", {"data-hook": "review"})
    try:
        for review in reviews:
            review = {
                "title": review.find("a", {"data-hook": "review-title"}).text.strip(),
                "rating": float(
                    review.find("i", {"data-hook": "review-star-rating"})
                    .text.replace("out of 5 stars", "")
                    .strip()
                ),
                "country": re.search(
                    country_pat,
                    review.find("span", {"data-hook": "review-date"}).text.strip(),
                ).group(1),
                "date": extract_date(
                    review.find("span", {"data-hook": "review-date"}).text.strip()
                ),
                "review": review.find(
                    "span", {"data-hook": "review-body"}
                ).text.strip(),
                "num_helpful": int(
                    review.find("span", {"data-hook": "helpful-vote-statement"})
                    .text.replace(" people found this helpful", "")
                    .strip(),
                ),
            }
            reviewlist.append(review)
    except:
        pass

# Scrape Amazon reviews from multiple pages
for i in range(1, 145):
    # URL
    url = f"https://www.amazon.com/Wireless-Bluetooth-Headphones-Foldable-Headsets/product-reviews/B08VNFD8FS/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber={i})"

    print(f"Scraping page {i}")

    # Parse HTML content
    soup = parse_html(url)

    # Get reviews
    scrape_reviews(soup)

# Save review data to a csv file
pd.DataFrame(reviewlist).to_csv("amazon_reviews.csv", index=False)

It took me around five hours to write and debug the above code, despite my previous knowledge of web-scraping and the great tutorial. I also had to hard-code the number of pages I wanted to scrape because it was taking me too much time to figure out how to paginate on Amazon. Now imagine you have to code for a dozen of websites. It may take days before you finally start analyzing data and building models.

No-Code Solution

Now let's have a look at a non-code solution offered by Octoparse. Their tool enables you to easily scrape whatever webpage on the Internet, from the simplest sample websites for scraping like https://books.toscrape.com to the ones powered with JavaScript.

First, signup on their website, and download and install the software. I followed their comprehensive tutorial, which I strongly recommend reading. They provided a test page to try out the tool, but here I will focus on scraping Amazon reviews. The tool offers tons of templates that allow you to start scraping straight away, for example, product information on Amazon. But I will be using Advanced Mode, which offers more flexibility, and that is the usual way to scrape whatever website you like. Ok, let's start!

In the sidebar, click "New" and then "Advanced Mode". This will open a new tab, in which you can insert URLs you'd like to scrape.

Insert the URL and click "Save". Octoparse will automatically load the page and propose you the data to scrape. It was pretty code in figuring out what data I wanted, although I needed to clean it a bit.

Thus, before creating a workflow, let's remove the columns we do not need. Just hover the cursor over the column name, and click on the icon of a trash bin.

Now it is time to rename the fields. Double-click on the column name you want to rename and insert the new title.

Now, create a workflow! Every task in Octoparse is basically a workflow. In this case, the algorithms automatically figured out that the website has multiple pages, and we need to paginate and extract the data from each page. You can see the workflow on the right-hand side of the window.

Cleaning the Data

Great! However, we still need to clean the data a little bit. For example, extract ratings as numbers. Hover the mouse cursor on the right side of this column, and you will see three dots, click on them. A menu will show up, where you have to click on "Clean data". We are now able to add different steps involved in data cleaning. In this case, we just have to match the pattern "digit.digit" (like 5.0) using RegEx. Click on "Add step", and then "Replace with Regular Expression". A new will pop up, where we insert the pattern \d.\d, which you can evaluate to see if it works correctly by clicking the "Evaluate" button. As you can see below, it worked! Thus, click "Confirm". Once the window closes, click on "Apply" on the right-hand side.

You now see that the column values changed to match the pattern. To practice, perform a similar task with the number of people who found this review helpful.

The only thing that is left is to extract the country name and the date. Before doing that, though, duplicate the column by clicking on the same three dots near the column name and selecting "Copy". We do this because we need one column for the country and one for the date.

To extract the country, you basically have to match a RegEx pattern as previously, but this time it is a bit more complicated: (?<=Reviewed\sin\s)(.*)(?=\son).

Date extraction is even easier because Octoparse automatically determines its presence. In the data cleaning steps menu, select "Reformat extracted date/time". Then, in the popped-up window, choose the date format you want, and that's it!

Checking and Running the Workflow

Before starting the scraping process, let's see if the pagination works as expected. On the workflow scheme, click on "Pagination". This action should highlight the "Next page" button at the bottom of the page. Next, click on "Click to...". If the paged changed to the next one, then everything works fine.

We are now ready to start a workflow run. Click on "Run" at the top-right corner. You have two options: "Run on your device" (slower) or "Run in the cloud". If you want to extract just a small piece of data, running on your device should suffice. However, for bigger datasets, running in the cloud is a more reasonable option.

On the dashboard, you will see that running task! You can now safely close the program and even switch off the computer. Once the task is completed, you can export the data into one of the formats like csv and start investigating it with your preferred programming language and tool.

Wrapping Up

Generally speaking, the option with Python is slower in writing and debugging but faster in the running. On the other hand, Octoparse offers a very straightforward path to scrape the data, but its running times are slightly slower. But its main advantage pops up when you have to scrape multiple websites and do not want to rewrite Python code from scratch every time.

So, make sure you have thoroughly thought about these options and understood your end goal before opting for one or the other.