In this article, you're going to learn the basics of web scraping in python and we'll do a demo project to scrape quotes from a website.
Web scraping is extracting data from a website programmatically. Using web scraping you can extract the text in HTML tags, download images & files and almost do anything you do manually with copying and pasting but in a faster way.
Yeah, absolutely as a programmer in many cases you might need to use the content found on other people's websites but those website doesn't give you API to that, that's why you need to learn web scraping to be able to that.
In order to follow through with this tutorial, you need to have the following libraries installed on your machine
You can install the above two libraries just by using the pip command as shown below;
$ pip install requests
$ pip install beautifulsoup4
Requests is an elegant and simple HTTP library for Python, built for human beings, it allows you to send HTTP requests(post, get, put, delete) to a website in an easy way.
We gonna use the requests library while implementing our demo project to send a get a request to the website so as to get its HTML source code.
Beautiful Soup is a Python library for pulling data out of HTML and XML files, it comes with parsers that give us a way to navigate within an HTML source code and extract the content we need.
For us to be able to pull data from our HTML and XML files we need to convert the string representation of the HTML or XML into a BeautifulSoup object which provides us tons of methods to manipulate it.
Let use the BeautifulSoup library to extract data from the below HTML file sample.html.
<!DOCTYPE html>
<head>
<title>Document</title>
</head>
<body>
<div id = 'quotes'>
<p id = 'normal'>Time the time before the time times you</p>
<p id = 'normal'>The Future is now </p>
<p id = 'special'>Be who you wanted to be when you're younger</p>
<p id = 'special'>The world is reflection of who you're</p>
</div>
<div>
<p id = 'Languages'>Programming Languages</p>
<ul>
<li>Python</li>
<li>C+++</li>
<li>Javascript</li>
<li>Golang</li>
</ul>
</div>
</body>
</html>
Extracting all paragraphs in HTML
Let’s Extract all paragraphs from the sample.html shown above using BeautifulSoup:
from bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('p'):
print(paragraph.text)
Output
When you run the above simple program it will produce the following result:
$ python app.py
Time the time before the time times you
The Future is now
Be who you wanted to be when you're younger
The world is a reflection of who you're
Programming Languages
Code Explanation
from bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
The above 2 lines of code are for reading the sample.html and creating a Beautifulsoup object ready for parsing data.
for paragraph in soup.find_all('p'):
print(paragraph.text)
We used BeautifulSoup find_all () method to extract all the paragraph in the HTML file, it accepts a parameter of the name of HTML tag and then it parses through the HTML string to find all tags and returns them.
In extracting the list elements instead of paragraph, we are going to specify tag li instead of p in the find_all() method just as shown below:
app.py
from bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for List in soup.find_all('li'):
print(List.text)
Output
$ python app.py
Python
C+++
Javascript
Golang
Extracting paragraphs with a specific id
Apart from just returning all tags in HTML string, we can also specify the attributes of those tags for us to extract only specific tags. just as shown below:
import requests
from bs4 import BeautifulSoup
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('p'):
if paragraph['id'] == 'normal':
print(paragraph.text)
Output
$ python app.py
Time the time before the time times you
The Future is now
So far we have seen how to extract data from an HTML file that is in our local directory, now let’s go see how we can extract data from the website hosted in the cloud.
Quotes spider
In this project, we are going to implement a web scraper to scrap quotations from a website of a given URL.
We are going to use the requests library to pull the HTML from the website and then parse that HTML using BeautifulSoup.
Website of Interest (WOI)
In our demo project, we are going to scrap the quotes from quotes.toscrape.com
Demo project source code
In the source code of our demo project, nothing has changed much other than the fact that this time we gonna obtains the HTML source code from a website using the requests module instead of reading it from the file.
import requests
from bs4 import BeautifulSoup
html = requests.get('http://quotes.toscrape.com/').text
soup = BeautifulSoup(html, 'html.parser')
for paragraph in soup.find_all('span'):
if paragraph.string:
print(paragraph.string
Output
$ python scraper.py
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
"It is our choices, Harry, that show what we truly are, far more than our abilities."
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."
"Try not to become a man of success. Rather become a man of value."
"It is better to be hated for what you are than to be loved for what you are not."
"I have not failed. I've just found 10,000 ways that won't work."
"A woman is like a tea bag; you never know how strong it is until it's in hot water."
"A day without sunshine is like, you know, night."
This article was also published here.
Hope you found it interesting. Please share it with your fellow developers on Twitter and other dev communities!