In this article, we will build a program that allows you to scrape or grab data from a website with a Python script. This method of gathering data is called web scraping.
Web scraping is all about programmatically using Python or any other programming language to download, clean, and use the data from a web page. Most websites don’t want you scraping their data, and to find out what is legal and permissible for scraping, websites have a dedicated page that shows details of the endpoints allowed.
Attach robots.txt to the end of any link to find out about the allowed endpoints. For example, let’s use https://news.ycombinator.com/robots.txt.
The result should look like this with this text file below:
The screenshot states what endpoints we are allowed and not allowed to scrape from the YCombinator website. A crawl delay means a pause when scraping data from the website with programs, thereby not overloading their servers and slowing down the website because of constant scraping.
In this exercise, we scrape the news content's home page, which we can do according to the user agent.
The Python web scraper requires two necessary modules for scraping the data:
Beautiful Soup is a Python library for extracting data from HTML files. It modifies the file using a parser, turns the data into a valuable document, and saves programmers hours of manual and repetitive work.
The requests HTTP library is for downloading HTML files using the link to the website with the
.get()
function.Now to the nitty-gritty of this project. Create a new directory, and in there, a file that will contain all the scripts for the web scraper program.
Copy and paste the following code:
# app.py
import requests
response = requests.get('https://news.ycombinator.com/news')
yc_web_page = response.text
print(yc_web_page)
The code above does the following:
requests
module.get()
function download the HTML files from the link of the website provided.text
If you run this code with the command python
app.py
and it does not give you any output, it means the two imported modules need to be installed.Run the following commands to install the modules.
pip3 install requests
pip install beautifulsoup4
The result of the source code should look like this:
Next, let’s update the
app.py
file with the rest of the code using beautiful soup:# main.py
import requests
from bs4 import BeautifulSoup # add this
response = requests.get('https://news.ycombinator.com/news')
yc_web_page = response.text
# add this
soup = BeautifulSoup(yc_web_page, 'html.parser')
article_tag = soup.find(name="a", class_='titlelink')
article_title = article_tag.get_text()
article_link = article_tag.get('href')
article_upvote = soup.find(name="span", class_="score").get_text()
result = {
"title": article_title,
"link": article_link,
"point": article_upvote
}
print(result)
Follow the code snippet above by doing the following::
yc_web_page
using the BeautifulSoup function and html.parser
to get the HTML filesBefore going over the rest of the code, let’s open our web browser with the link provided in
.get()
Next, right-click on the page, and click inspect to view the elements tab of the YCombinator news page.
Our web page should look like this:
With Beautiful Soup, we can target specific elements on the page with their class names:
find()
function with the element's name, the a tag, and the class_
with an underscore. This is done to prevent an overwrite of the class in the element on the web pagearticle_tag
using the .get_text()
functionarticle_tag
using the attribute href
with the .get()
functionarticle_upvote
variable, where the tag name, <span>
, and the class name are used to extract the points for each article linkWith the whole script written, our page should scrape the data from the news home page of YCombinator and look like this:
This article taught you how to use Python web scraper to extract data from a web page.
Also, the functionalities of using a web scraper are that it saves time and effort in producing large data sets faster rather than manually.