How To Scrap Product Information With Python & BeautifulSoup Module From Amazon Listings [Tutorial]

Intro

This tutorial will walk you through the basic steps of scraping amazon product information, using Python and BeautifulSoup.

Scraping product information from Amazon can generate incredibly valuable insights for many use cases. No matter if you are monitoring prices, running a business intelligence project or having an eye on your competition.

Python is well suited for this task, since it’s syntax is very easy to read and offers great libraries for networking (requests) and data extraction (BeautifulSoup, full documentation here).

Information that are going to be scraped

As shown in the picture above, we are going to focus on the following information blow:

Title
Categories
Features
Number of reviews
Price
Availability
Code

Getting started

In this tutorial we are going to use the two Python libraries. These can be installed with the following shell command:

pip3 install requests BeautifulSoup

Once you have successfully installed all dependencies, we are all set to start with the actual work.

Fetching the HTML markup

Amazon is quite sensitive when it comes to scraping and immediately displays captchas and content walls for their own data API.

To avoid that, we are defining a user agent, that we are going to use for our http request:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

In this case we chose the most popular user agent on the web.

A list of other common user agents can be found here: http://www.networkinghowtos.com/howto/common-user-agent-list/

Now we are sending our request to the desired amazon product page:

url = 'https://www.amazon.com/FEICE-Stainless-Leathers-Waterproof-Business/dp/B074MWWTVL'

response = requests.get(url, headers=headers)

The retrieved answer can be viewed by running:

print(response.text)

Parsing with BeautifulSoup

We have successfully scraped the amazon product page and will now focus on parsing the desired product information. For this, we will use BeautifulSoup.

soup = BeautifulSoup(response.content, features="lxml")

The code above loads the scraped HTML markup into BeautifulSoup.

Product title

We are going to start with the product title:

title = soup.select("#productTitle")[0].get_text().strip()

Since the select-function returns an array, even if only one element was retrieved, we are selecting the first element with [0].
We are just interested in the text inside the element

#productTitle

and do not care about the HTML tags that are wrapping it. Thus, we are adding get_text() to our command.

Furthermore, our text is wrapped by a lot of whitespace that we want to get rid of. strip() does the job for us.

Product categories

The number of related categories will vary each product. This is where the findAll() function becomes handy:

categories = []

for li in soup.select("#wayfinding-breadcrumbs_container ul.a-unordered-list")[0].findAll("li"):

categories.append(li.get_text().strip())

We are looping through all list items that were found and append them to an array. The rest is of the code is reused from the product title.

Product features

Similar to the product categories, the number of product features also varies from product to product. Hence, we will process them in the exact same way as the product categories:

features = []

for li in soup.select("#feature-bullets ul.a-unordered-list")[0].findAll('li'):

features.append(li.get_text().strip())

Product price

The product price behaves similar to the product title:

price = soup.select("#priceblock_saleprice")[0].get_text()

Note: The retrieved value is a string, containing the dollar sign and the price of the product. If this tutorial was not for demonstrational purpose only, we would detect the contained currency and save the price in a separate float variable.

Product review count

The review count is part of a string. Hence, we parse the string, strip the text, and convert it to an integer variable:

review_count = int(soup.select("#acrCustomerReviewText")[0].get_text().split()[0])

Format as JSON

We have successfully parsed and stored our product information inside variables. To finalise this project, we are going to store them in well readable JSON-format:

jsonObject = {'title': title, 'categories': categories, 'features': features, 'price': price, 'review_count': review_count, 'availability': availability}

To display our result, we print the JSON-object, using the following command:

print(json.dumps(jsonObject, indent=2))

Limitations, Further Improvements

Proxy Server

As stated above, Amazon is very sensitive when it comes to scraping. Even if you are implementing measures like slow scraping, sleep periods, user-agent rotation, etc. Amazon will stop your script at some point. A way to get around this is to implement a proxy server, or use a Scraper API.

The following link provides a good overview about available products:
https://zenscrape.com/best-web-scraping-tools-top-15-web-scrapers-2019/

Store retrieved information

Another improvement that would improve our script, is to store the saved information to a database, or at least, a log-file.