Intro This tutorial will walk you through the basic steps of scraping amazon product information, using Python and BeautifulSoup. Scraping product information from Amazon can generate incredibly valuable insights for many use cases. No matter if you are monitoring prices, running a business intelligence project or having an eye on your competition. Python is well suited for this task, since it’s syntax is very easy to read and offers great libraries for networking (requests) and data extraction (BeautifulSoup, full documentation here). Information that are going to be scraped As shown in the picture above, we are going to focus on the following information blow: Title Categories Features Number of reviews Price Availability Code Getting started In this tutorial we are going to use the two Python libraries. These can be installed with the following shell command: pip3 requests BeautifulSoup install Once you have successfully installed all dependencies, we are all set to start with the actual work. Fetching the HTML markup Amazon is quite sensitive when it comes to scraping and immediately displays captchas and content walls for their own data API. To avoid that, we are defining a user agent, that we are going to use for our http request: = { : } headers 'User-Agent' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3' In this case we chose the most popular user agent on the web. A list of other common user agents can be found here: http://www.networkinghowtos.com/howto/common-user-agent-list/ Now we are sending our request to the desired amazon product page: = url 'https://www.amazon.com/FEICE-Stainless-Leathers-Waterproof-Business/dp/B074MWWTVL' response = requests. (url, =headers) get headers The retrieved answer can be viewed : by running print (response.text) Parsing with BeautifulSoup We have successfully scraped the amazon product page and will now focus on parsing the desired product information. For this, we will use BeautifulSoup. soup = BeautifulSoup( . , = ) response content features "lxml" The code above loads the scraped HTML markup into BeautifulSoup. Product title We are going to start with the product title: = soup. ( )[ ].get_text().strip() title select "#productTitle" 0 Since the -function returns an array, even if only one element was retrieved, we are selecting the first element with [0]. We are just interested in the text inside the element and do not care about the HTML tags that are wrapping it. Thus, we are adding to our command. select #productTitle get_text() Furthermore, our text is wrapped by a lot of whitespace that we want to get rid of. does the job for us. strip() Product categories The number of related categories will vary each product. This is where the function becomes handy: findAll() = [] categories soup.select( )[ ].findAll( ): for li in "#wayfinding-breadcrumbs_container ul.a-unordered-list" 0 "li" ( () ()) categories .append li .get_text .strip We are looping through all list items that were found and append them to an array. The rest is of the code is reused from the product title. Product features Similar to the product categories, the number of product features also varies from product to product. Hence, we will process them in the exact same way as the product categories: = [] features soup.select( )[ ].findAll( ): for li in "#feature-bullets ul.a-unordered-list" 0 'li' ( () ()) features .append li .get_text .strip Product price The product price behaves similar to the product title: = soup.select( )[ ].get_text() price "#priceblock_saleprice" 0 The retrieved value is a string, containing the dollar sign and the price of the product. If this tutorial was not for demonstrational purpose only, we would detect the contained currency and save the price in a separate float variable. Note: Product review count The review count is part of a string. Hence, we parse the string, strip the text, and convert it to an integer variable: review_count = int( .select( )[ ].get_text().split()[ ]) soup "#acrCustomerReviewText" 0 0 Format as JSON We have successfully parsed and stored our product information inside variables. To finalise this project, we are going to store them in well readable JSON-format: = { : title, : categories, : features, : price, : review_count, : availability} jsonObject 'title' 'categories' 'features' 'price' 'review_count' 'availability' To display our result, we print the JSON-object, using the following command: print( .dumps( , indent=2)) json jsonObject Limitations, Further Improvements Proxy Server As stated above, Amazon is very sensitive when it comes to scraping. Even if you are implementing measures like slow scraping, sleep periods, user-agent rotation, etc. Amazon will stop your script at some point. A way to get around this is to implement a proxy server, or use a Scraper API. The following link provides a good overview about available products: https://zenscrape.com/best-web-scraping-tools-top-15-web-scrapers-2019/ Store retrieved information Another improvement that would improve our script, is to store the saved information to a database, or at least, a log-file.