In today’s world, every business is looking for ways to take advantage of new technologies and stay competitive. Data extraction also knows as data scraping or web scraping enables to access big data in an automated method.
The main applications of web scraping include price intelligence, lead generation, and market research. Mainly, web scraping is practiced by businesses that create their marketing and development strategies based on the vast amount of web data.
So, once we understand what is web-scraping and why it is useful, let’s see how we can navigate through multiple pages of a site and scrape numerous data by using Selenium in Python.
Before starting, go through the following instructions on the installations of required packages:
1) Python: We are going to use Python 3.0
2) Selenium package: To install Selenium package run the following command
pip install selenium
3) Chrome driver: Install the latest version of chrome driver from here. Note that for all this you need Google Chrome to be installed on your PC.
The first thing that we need to do before scraping is to understand the structure of the webpage. Our goal is to scrape Edmunds.com - a car forum. People post their reviews on different cars in the forum and make car buying decisions.
The topic that we are going to scrape will be entry-level luxury car brands. We’ll scrape user id and comments and export them into a CSV file for further analysis.
Let’s start by importing the important packages
#Importing packages
from selenium import webdriver
import pandas as pd
Now we’ll create a new instance of Google Chrome to open the required URL.
driver = webdriver.Chrome('Path in your computer where you have installed chromedriver')
Then we’ll access Google Chrome and open the website.
driver.get('https://forums.edmunds.com/discussion/2864/general/x/entry-level-luxury-performance-sedans/p702')
It’s time to inspect 3 items (user id and comment) and understand how to extract them.
User id: Inspecting the user id, we can see the highlighted text represents the XML code for
@id
.The XML path (XPath) for the user id is coming below. Note, that the XML path contains a comment id, which uniquely denotes each comment on the website.
//*[@id=”Comment_5561090"]/div/div[2]/div[1]/span[1]/a[2]
Look at the picture above and see that XPath contains the user id ‘dino001’.
So, how do we extract the values inside a XPath?
Selenium has a function called “find_elements_by_xpath”. We’ll pass the XPath into this function and get a selenium element.
Then, we can extract the text inside our XPath using the ‘text’ function. In our case, the text is basically the user id (‘dino001’).
userid_element = driver.find_elements_by_xpath('//*[@id="Comment_5561090"]/div/div[2]/div[1]/span[1]/a[2]')[0]
userid = userid_element.text
Comments: Now, let’s understand how to extract the comments of each user.
Below you may find the XPath for the user comment
//*[@id="Comment_5561090"]/div/div[3]/div/div[1]
Once we have the comment id in our XPath we can extract the comment from the above XPath as we did in the case of the user id.
user_message = driver.find_elements_by_xpath('//*[@id="Comment_5561090"]/div/div[3]/div/div[1]')[0]
comment = user_message.text
Thus, we already know how to scrape unique elements from a web page. Now let’s explore how to extract these items for 5000 users.
As mentioned above, we’ll use the comment ids, which are unique for each comment. If we see the XPath for the entire comment block, we see that it has a comment id associated with it.
//*[@id="Comment_5561090"]
The following code snippet will help us extract all the comment ids on our web page. We’ll again use the function ‘find_elements_by_xpath’ on the above XPath and extract the ids from the ‘id’ attribute.
ids = driver.find_elements_by_xpath("//*[contains(@id,'Comment_')]") comment_ids = []
for i in ids:
comment_ids.append(i.get_attribute('id'))
The above code will provide us with a list of all the comment IDs from the web page.
And now it is time to bring all this together in one code to extract 5000 comments. We can extract user ids and comments for each user on the web page by looping through all the comment ids we found in the previous code.
Below is the code snippet to extract all comments:
driver = webdriver.Chrome('C:/Users/bandi/Desktop/Text Analytics/TA Session/chromedriver_win32/chromedriver')
driver.get('https://forums.edmunds.com/discussion/2864/general/x/entry-level-luxury-performance-sedans/p702')
comments = pd.DataFrame(columns = ['user_id','comments'])
ids = driver.find_elements_by_xpath("//*[contains(@id,'Comment_')]")
comment_ids = []
for i in ids:
comment_ids.append(i.get_attribute('id'))
for x in comment_ids:
#Extract user ids from each user on a page
userid_element = driver.find_elements_by_xpath('//*[@id="' + x +'"]/div/div[2]/div[1]/span[1]/a[2]')[0]
userid = userid_element.text
#Extract Message for each user on a page
user_message = driver.find_elements_by_xpath('//*[@id="' + x +'"]/div/div[3]/div/div[1]')[0]
comment = user_message.text
#Adding userid and comment for each user in a dataframe
comments.loc[len(comments)] = [userid,comment]
If you check our URL has page numbers, starting from 702. So, we can go to previous pages by simply changing the page numbers in the URL to extract more comments until we get the desired number of comments.
That's how you scrape a website using Selenium in Python and obtain large amounts of data. After this, you can carry out multiple unstructured data analytics and find interesting trends, sentiments, etc. using this data.