paint-brush
A Beginner's Guide to Web Scraping in Pythonby@kalebujordan
740 reads
740 reads

A Beginner's Guide to Web Scraping in Python

by Kalebu Jordan November 5th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this article, you're going to learn the basics of web scraping in python and we'll do a demo project to scrape quotes from a website. In order to follow through with this tutorial, you need to have the following libraries installed on your machine: Requests, BeautifulSoup and requests. In our demo project, we are going to use the requests library to pull the HTML from the website and then parse that HTML using Beautiful Soup. The project is going to scrap quotes from quotes.toscrape.com.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - A Beginner's Guide to Web Scraping in Python
Kalebu Jordan  HackerNoon profile picture

In this article, you're going to learn the basics of web scraping in python and we'll do a demo project to scrape quotes from a website.

What is web scraping?

Web scraping is extracting data from a website programmatically. Using web scraping you can extract the text in HTML tags, download images & files and almost do anything you do manually with copying and pasting but in a faster way.

Should you learn web scraping?

Yeah, absolutely as a programmer in many cases you might need to use the content found on other people's websites but those website doesn't give you API to that, that's why you need to learn web scraping to be able to that.

Requirements

In order to follow through with this tutorial, you need to have the following libraries installed on your machine

Installation

You can install the above two libraries just by using the pip command as shown below;

$ pip install requests 
$ pip install beautifulsoup4

Basics of requests

Requests is an elegant and simple HTTP library for Python, built for human beings, it allows you to send HTTP requests(post, get, put, delete) to a website in an easy way.

We gonna use the requests library while implementing our demo project to send a get a request to the website so as to get its HTML source code.

Basics of BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files, it comes with parsers that give us a way to navigate within an HTML source code and extract the content we need.

For us to be able to pull data from our HTML and XML files we need to convert the string representation of the HTML or XML into a BeautifulSoup object which provides us tons of methods to manipulate it.

Let's get hands dirty with some code

Let use the BeautifulSoup library to extract data from the below HTML file sample.html.

<!DOCTYPE html>
<head>
    <title>Document</title>
</head>
<body>
    <div id = 'quotes'>
        <p id = 'normal'>Time the time before the time times you</p>
        <p id = 'normal'>The Future is now </p>
        <p id = 'special'>Be who you wanted to be when you're younger</p>
        <p id = 'special'>The world is reflection of who you're</p>
    </div>
    <div>
        <p id = 'Languages'>Programming Languages</p>
        <ul>
            <li>Python</li>
            <li>C+++</li>
            <li>Javascript</li>
            <li>Golang</li>
        </ul>
    </div>
</body>
</html>

Extracting all paragraphs in HTML

Let’s Extract all paragraphs from the sample.html shown above using BeautifulSoup:

from bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('p'):
    print(paragraph.text)

Output

When you run the above simple program it will produce the following result:

$ python app.py 
Time the time before the time times you
The Future is now 
Be who you wanted to be when you're younger
The world is a reflection of who you're
Programming Languages

Code Explanation

  • importing BeautifulSoup library
from bs4 import BeautifulSoup
  • Creating a BeautifulSoup object from HTML string
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')

The above 2 lines of code are for reading the sample.html and creating a Beautifulsoup object ready for parsing data.

  • Finding all paragraphs and printing them
for paragraph in soup.find_all('p'):
    print(paragraph.text)

We used BeautifulSoup find_all () method to extract all the paragraph in the HTML file, it accepts a parameter of the name of HTML tag and then it parses through the HTML string to find all tags and returns them.

Extracting all elements in the list from the HTML

In extracting the list elements instead of paragraph, we are going to specify tag li instead of p in the find_all() method just as shown below:

app.py

from bs4 import BeautifulSoup
​
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
​
for List in soup.find_all('li'):
    print(List.text)

Output

$ python app.py
Python
C+++
Javascript
Golang

Extracting paragraphs with a specific id

Apart from just returning all tags in HTML string, we can also specify the attributes of those tags for us to extract only specific tags. just as shown below:

  • Extract paragraphs with an id of normal
import requests
from bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('p'):
    if paragraph['id'] == 'normal':
        print(paragraph.text)

Output

$ python app.py 
Time the time before the time times you
The Future is now 

Demo Project

So far we have seen how to extract data from an HTML file that is in our local directory, now let’s go see how we can extract data from the website hosted in the cloud.

Quotes spider

In this project, we are going to implement a web scraper to scrap quotations from a website of a given URL.

We are going to use the requests library to pull the HTML from the website and then parse that HTML using BeautifulSoup.

Website of Interest (WOI)

In our demo project, we are going to scrap the quotes from quotes.toscrape.com

Demo project source code

In the source code of our demo project, nothing has changed much other than the fact that this time we gonna obtains the HTML source code from a website using the requests module instead of reading it from the file.

import requests
from bs4 import BeautifulSoup
html = requests.get('http://quotes.toscrape.com/').text
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('span'):
    if paragraph.string:
        print(paragraph.string

Output

$ python scraper.py 
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
"It is our choices, Harry, that show what we truly are, far more than our abilities."
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."
"Try not to become a man of success. Rather become a man of value."
"It is better to be hated for what you are than to be loved for what you are not."
"I have not failed. I've just found 10,000 ways that won't work."
"A woman is like a tea bag; you never know how strong it is until it's in hot water."
"A day without sunshine is like, you know, night."

This article was also published here.

Hope you found it interesting. Please share it with your fellow developers on Twitter and other dev communities!