Are you looking for a method of and do not know where to begin with? In that case, you may find this blog very useful in scraping Amazon reviews. In this blog, we will discuss . Web scraping is a simple means of collecting data from different websites, and Scrapy is a web crawling framework in python. scraping Amazon reviews scraping amazon reviews using Scrapy in python Web scraping allows the user to manage data for their requirements, for example, online merchandising , price monitoring and driving marketing decisions. In case you are wondering whether this process is even legal or not, you can find the answer to this query here . Before digging into scraping for product reviews, let us first have a look at a few use-cases of scraping Amazon reviews at the first place Amazon Why the need for scraping Amazon reviews? Sentiment Analysis over the product reviews Sentiment analysis can be performed over the reviews scraped from products on Amazon. Such study helps in . This can help in sellers or even other prospective buyers in understanding the public sentiment related to the product. identifying the user’s emotion towards a particular product Optimising dropshipping sales Drop shipping is a business type that allows a particular company to work without an inventory or a depository for the storage of its products. You can use . web scraping for getting product pricing, user opinions, understanding the needs of the customer and following up with the trend Web scraping for online reputation monitoring It is difficult for large-scale companies to monitor their reputation of products. Web scraping can help in extracting relevant review data which can act as input to different analysis tool to measure user’s sentiment towards the organisation. What is Scrapy? is a for a developer to write code to create, which define how a particular site (or a group of websites) will be scrapped. The most significant feature is that it is built on Twisted, an asynchronous networking library, which makes the spider performance is very significant. Scrapy web crawling framework Let us now have a look at a necessary pipeline for scraping amazon reviews Scraping Amazon reviews Pipeline I always feel that it is essential to have a holistic idea of the work before you start doing it which in our case is scraping Amazon reviews. Hence, before we begin with the coded implementation with Scrapy, let us have an . uber look at the complete pipeline for scraping Amazon reviews In this section, we will look at the different stages involved in scraping amazon reviews along with their short description. This will give you an overall idea of the task which we are going to do using python in the later section. 1. Analysing HTML structure of the webpage Scraping is about . Before starting to write a scraper, we need to understand the HTML structure of the target web page and identify patterns in it. The pattern can be related to usage of classes, ids and other HTML elements in a repetitive manner. finding a pattern in the web pages and extracting them out 2. Scrapy parser implementation in Python After analysing the structure of the target web page, we work on . Scrapy parser’s responsibility is to visit the targeted web page and extract out the information as per the mentioned rules. the coded implementation in python 3. Collection and Storage of Information The parser can dump out the results in any format you wish for be it CSV or JSON. This is the final output while in which your scraped data resides. Python code implementation for scraping Amazon reviews Installing Scrapy We will start by . There can be two cases here though. If you are using conda, then you can install scrapy from the conda-forge using the following command installing Scrapy in our system conda install -c conda-forge scrapy In case you are not using conda, you can use pip and directly install it in your system using the below command pip install scrapy We will start by creating a scrapy project. A scrapy project enables users to collate different components of the crawlers into a single folder. To create a scrapy project use following command scrapy startproject amazon_reviews_scraping Once you have created the project, you will find the following two contents in it. One is a folder which contains your scrapy code, and other is your spacy configuration file. Spacy configuration while helps in running and deploying the Scrapy project on a server. Once we have the project in place, we need to create a spider. A spider is a chunk of python code which determines how a web page will be scrapped. It is the main component which crawls different web pages and extracts content out of it. In our case, this will be the code chuck that will perform the task of visiting Amazon and scraping Amazon reviews. To create a spider, you can use the following command scrapy genspider amazon_review your-link-here Spider gets created within a spiders folder inside the project directory. Once you go into the scrapy project, you will see a directory structure like the one below Scrapy files description Let us understand the Scrapy project structure and supporting files inside in a bit more detail. Main files inside Scrapy project directory includes Items are containers that will be loaded with the scraped data. items.py The spider middleware is a framework of hooks into Scrapy’s spider processing mechanism where you can plug custom functionality to process the responses that are sent to for processing and to handle the requests and items that are generated from spiders. Middleware .py Spiders After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially. Each item pipeline component is a Python class Pipelines .py It allows one to customise the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves settings.py The Spiders is a directory which contains all spiders/crawlers as Python classes. Whenever one runs/crawls any spider, then scrapy looks into this directory and tries to find the spider with its name provided by the user. Spiders define how a certain site or a group of sites will be scraped, including how to perform the crawl and how to extract data from their pages. spiders folder For more detailed information on Scrapy components, you can refer to this link Analysing HTML structure of the webpage Now before we actually start writing spider implementation in python for scraping Amazon reviews, we need to . Below is the page we are trying to scrape which contains different reviews about the MacBook air on Amazon. identify patterns in the target web page We start by opening the web page using the inspect-element feature in the browser. There you can see the HTML code of the web page. After a little bit of exploration, I found the following HTML structure which renders the reviews on the web page On the reviews page, there is a division with id cm_cr-review_list. This division multiple sub-division within which the review content resides. We are planning to extract both rating stars and review comment from the web page. We need to one more level deep into one other sub-divisions to prepare a scheme on fetching both star rating and review comment. Upon further inspection, we can see that every review subdivision is further divided into multiple blocks. One of these blocks contain required star ratings, and other includes the text of review needed. By looking more closely, we can easily see that rating star division is represented by the class attribute “review-rating” and review texts are represented by the class “review-text”. All we need to do now is just to pick these patterns up using our Scrapy parser Defining Scrapy Parser in Python Now once we have our spider template ready and we have analysed the pattern in the target web page, we can start writing the . We begin by extending the Spider class and mentioning the URLs we plan on scraping. Variable start_urls contains the list of the URLs to be crawled by the spider. logic for the extraction of reviews from Amazon Then we need to which gets fired up whenever our spider visits a new page. In the parse function, we need to identify patterns in the targeted page structure. Spider then looks for these patterns and extracts them out from the web page. define a parse function Below is a code sample of Scrapy parser for scraping Amazon reviews # -*- coding: utf -*- # Importing Scrapy Library scrapy # Creating a = # Domain names to scrape allowed_domains = [ ] # Base URL the MacBook air reviews myBaseUrl = start_urls=[] # Creating list urls to be scraped by appending page number a the end base url i range( , ): start_urls.append(myBaseUrl+str(i)) # Defining a Scrapy parser def parse(self, response): data = response.css( ) # Collecting product star ratings star_rating = data.css( ) # Collecting user reviews comments = data.css( ) count = # Combining the results review star_rating: { : .join(review.xpath( ).extract()), : .join(comments[count].xpath( ).extract()) } count=count+ -8 import new ( . ): # class to implement Spide class AmazonReviewsSpider scrapy Spider Spider name name 'amazon_reviews' 'amazon.in' for "https://www.amazon.in/Apple-MacBook-Air-13-3-inch-MQD32HN/product- reviews/B073Q5R6VR/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews &pageNumber=" of of for in 1 121 '#cm_cr-review_list' '.review-rating' '.review-text' 0 for in yield 'stars' '' './/text()' 'comment' '' ".//text()" 1 Storing Scraped Results Finally, we have successfully built our spider. The only task now left is to run this spider. We can run this spider by using the runspider command. It takes to input the spider file to run and the output file to store the collected results. In the case below, spider file is and the output file is amazon_reviews.py reviews.csv scrapy runspider amazon_reviews_scraping/amazon_reviews_scraping/spiders/amazon_reviews.py -o reviews.csv EDA on Amazon reviews In this section, we will try to do some Amazon reviews. We will be counting the overall rating of the product along with the most common words used for the product. Using pandas, we can read the CSV containing the scraped data. exploratory data analysis on the data obtained after scraping pandas pd matplotlib plt pd.read_csv( ) summarised_results = dataset[ ].value_counts() plt.bar(summarised_results.keys(), summarised_results.values) plt.show() import as import as "reviews.csv" "stars" Above code summarises all the ratings and finds their total count. After that, it plots a bar chart to visualise the findings. We have used library here to visualise the results. matlplotlib Let us now try to visualise some of the keywords that are present in the scraped reviews. We can visualise these keywords using a word cloud. Word cloud works on the principle that most frequent words in the text should be much more prominent and bolder among the set of different words. The code snippet below can help you in making a word cloud in python def visualise_word_map(): words= msg dataset[ ]: msg = str(msg).lower() words = words+msg+ wordcloud = WordCloud(width= , height= , background_color= ).generate(words) fig_size = plt.rcParams[ ] fig_size[ ] = fig_size[ ] = plt.show(wordcloud) plt.axis( ) " " for in "comment" " " 3000 2500 'white' "figure.figsize" 0 14 1 7 "off" The image below is a word cloud generated by the above code snippet. Words like the laptop, apple, product and Amazon are represented by much more significant and bolder fonts representing that there are many frequent words used. Furthermore, this word cloud makes sense because we scraped MacBook air’s user reviews from Amazon. Also, you can see words like amazing, good, awesome and excellent indicating that indeed many of the users actually liked the product. Conclusion Using Scrapy, we were able to devise a method for . Additionally, there can be some Amazon reviews as if you try scraping Amazon frequently. This can be a hindrance to your work. In such cases, make sure you are shuffling your IP’s periodically and are making less frequent requests to Amazon server to prevent yourself from blocking out. You can read more about it . scraping amazon reviews using python roadblocks while scraping Amazon tends to block IP’s here Additionally, you can which serves as a protection to your home IP from blocking out while scraping Amazon reviews. use the proxy servers