paint-brush
Web Scraping For Fun: With 'requests-html'by@valentineenedah
1,034 reads
1,034 reads

Web Scraping For Fun: With 'requests-html'

by Valentine EnedahNovember 28th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Web scraping is the automated method of gathering data from a website for analysis. Data professionals have a unique chance to obtain data through web scraping that would be otherwise impossible to access. The Python library ‘requests-html’ makes it simple and clear to parse HTML. We will use the free cloud-based collaborative notebook that you can access with your Google Account called ‘Google Colab(https://colab.com/)__ for this project. The code for the whole article can be found on GitHub.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Web Scraping For Fun: With 'requests-html'
Valentine Enedah HackerNoon profile picture

The primary responsibility of a data professional (Data Scientist, Data Engineer, Data Analyst etc.) is to locate, clean, examine, and extract valuable information from data for business purposes.

Especially when it comes to gathering data for a project, this can be complicated.

Despite the massive amount of data available, it is frequently not simple to access.


I know, it’s pretty much exhausting right!😭

But not to worry, Web scraping has your back.

So,

What is Web scraping and why do we do it?

Web scraping is the automated method of gathering data from a website for analysis.

Data professionals have a unique chance to obtain data through web scraping that would be otherwise impossible to access.

A good instance is that majority of businesses have extensive websites that are not accessible to third parties via an API. So we, therefore, will need to scrape these data in order to access them.

The most popular method for Web scraping is using the async method with ‘Beautiful Soup’ and ‘httpx’.

We will be trying something new today.🙂

*drum rolls!!!🥁 Introducing….

The ‘requests-html’ Package!

For organizational purposes, we will be looking at:


  1. What is the ‘requests-html’ package and how to use it.
  2. Async web scraping with ‘requests-html’ package
  3. Cleaning data with ‘pandas’


What is the ‘requests-html’ package and how to use it.

Let's understand a little about “requests-html” before we start utilizing it. The Python library "requests-html" makes it simple and clear to parse HTML. Kenneth Reitz, who also produced the "requests" library, was the person who created it. It includes the following features as an addition to the "requests" package:


  • Full JavaScript support!
  • CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
  • XPath Selectors, for the faint of heart.
  • Mocked user-agent (like a real web browser).
  • Automatic following of redirects.
  • Connection–pooling and cookie persistence.
  • The Requests experience you know and love, with magical parsing abilities.
  • Async Support


Source


One of the cool things about the ‘requests-html’  is that you can scrape websites quickly(asynchronously).

Async web scraping with ‘requests-html’ package

We will use the free cloud-based collaborative notebook that you can access with your Google Account called Google Colab for this project.

All you have to do is sign up, create a notebook, and get started.

!pip install requests-html


For the demo, We will scrape bestsellers from the bookdepository website.

The code for the whole article can be found here

After installing it, we import the package.


from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
r = await asession.get("https://www.bookdepository.com/bestsellers")


After creating an instance of it, we used the get method of the session to access our website. Checking the status code now will show that it was successful (200).


# find out if successful

r.status_code 

We get the Titles

The information we are interested in can now be obtained, starting with the book titles. Using Chrome's developer tools, we can examine the HTML source to determine which xpath or CSS option we should use to obtain our data.



Source: Webpage


# We get the titles
page = 1
titles = []
while page != 35:
   for x in r.html.find("h3.title"):
      titles.append(x.text)
   page +=1


As you can see, we initially established 2 variables: "page" to keep track of the website's pages, and "titles" to house our data. After that, we employed a "while loop" to iterate through our pages and collect the data when there were fewer than 35 total pages (There are 34 pages on the website).
You can use ‘chrome devtools’ to inspect the `pagination` on the page.


Then, while passing in our CSS selector (which is a "h3" tag with the class "title"), we utilized the "find" method from "requests-html" on our HTML content to acquire our data. The outcome was then added to our collection of "titles." Now, if you look at how many titles are on our list, there are roughly 1020.


For the remaining variables in which we are interested, the procedure will be repeated. Don't forget to look through the page source to find the appropriate CSS selectors for the components you require.


len(titles) # check the length -> 1020
titles[:10] # take a peek


Titles

We get the Authors

# We get the authors
page = 1
authors = []
while page != 35:
   for x in r.html.find("p.author"):
     authors.append(x.text)
     page +=1


We also check the length of the data we just scraped.


len(authors) -> 1020

# check the first 10 rows
authors[:10] 


Authors

We get the Prices

# We get the prices
page = 1
prices = []
while page != 35:
   for x in r.html.find("p.price"):
     prices.append(x.text)
   page +=1
len(prices) # -> 1020
prices[:10]


Prices


We obviously need to clean up our pricing variable a lot. Later, we shall deal with it.


We then get the Ratings

# We get the ratings
page = 1
stars = []
while page != 35:
    for x in r.html.find("div.stars"):
      result = x.find("span.star.full-star")
      stars.append(len(result))
    page +=1


We must first obtain the element with the stars and count how many there are for each book in order to determine our ratings, which are represented by stars.




Sadly, not all of the books have ratings yet, so we will deal with any missing information later.


len(stars) -> 748

# check the rating at row 34
stars[34]


We get the Book Formats

# We get the book formats
page = 1
formats = []
while page != 35:
   for x in r.html.find("p.format"):
      formats.append(x.text)
   page +=1


To make sure we have all the data, we can check the length.

len(formats) -> 1020

# check 10 rows
formats[:10] 


Book Formats

Cleaning data with ‘pandas’

Creating a Dataframe

Our data has been effectively scraped, however, it is not yet clean. We must clean it up and save it for further analysis. We need to use pandas to accomplish that. We'll use our data to make a data frame.


Recall, To install Pandas:

pip install pandas


# We put it into a DataFrame
import pandas as pd
stars = pd.Series(stars)
df = pd.DataFrame(list(zip(titles, authors, prices, formats)),
         columns=["titles", "authors", "prices", "formats"])

# to add the stars
df["rating"] = stars 
df.shape # -> (1020, 5)

# To check our data
df.head()


Dataframe(Heads)


# To check the last values
df.tail()


Dataframe(Tails)


You'll see that our data contains a few missing values. Later, we need to address this. Missing values can be dealt with in a variety of ways. You have two options: treat them or drop them. We shall treat them because dropping them for our little dataset is not an option. This can be done in a variety of ways. We can either replace the values with new ones or fill them with the variable's mean or median (non-numeric).

We clean the data

# We start by cleaning the price values

# We remove the strike-through price.
df["prices"] = df.prices.str.replace("\xa0*", "") 


To eliminate the undesirable characters from the data, we are utilizing the replace method of the 'str' method in this code. You'll also notice that our data contains two prices. We have the old data in a list after the new data. Only the new price that appears first in the pricing column is of importance to us.


# To get the current price value
df["prices"] = df["prices"].apply(lambda x: x.split(" ")[0]) 


We applied a function that divides the values by the "whitespaces" and then extracted the first value, which is our new price, using the "pandas" apply method to obtain the new price. You'll see that we also searched for all characters after the 0 characters by using the wildcard *.


Our pricing column still needs some work, but if we verify our data right now, we'll see that it's practically spotless. The US$ characters are still present in our values, as can be seen. That is not what we want for a column that should have a floating-point or decimal value. Therefore, we will get rid of them before changing our prices to floating point numbers. Fortunately, pandas make this simple. Simply substitute nothing for these characters.


# To remove the `US` abbreviation
df["prices"] = df["prices"].str.replace("US", "")


# To remove dollar sign
df["prices"] = df["prices"].str.replace("$", "")
df.head()


Cleaned_Price_Column


The pricing column is now tidy, however something else is still lacking. We will discover that price is still a string type when we look at the data types in our dataframe. For a number column, we are aware that shouldn't be the case. We only need to change this data type's string representation into a floating point number.


df.info()


Column_Datatypes



# To convert prices to float type

df["prices"] = df.prices.astype("float")
df.info()


prices(Converted to float)


We will now address our missing values in the rating column at this point. Let's count the number of values that are now missing. As you can see, there are missing values in roughly 272 rows. We can't drop them because our data is too little, so we will fill them with the price column's average instead. We now have no missing data in our rating column, as we can see if we check it again.


# We fill Na values in rating
df.rating.isna().sum() # 272

import numpy as np
df["rating"] = df["rating"].fillna(round(np.mean(df.rating), 1))


# We recheck for missing values -> 0
df.rating.isna().sum() 


If we check our final that we can see that the data is clean and the columns are of the right types. We can now export our data into csv file for further analysis.


df.head()


Finally, we must save the data for future examination as the final step.


# To save for further analysis

df.to_csv("book_depository_clean.csv")


The requests-html package makes web scraping simple and beginner-friendly. You should try it🔥
You can still read about the packagehere.