How to Build a Python Web Scraper: Scrape Data from any Website by@terieyenike
1,280 reads

How to Build a Python Web Scraper: Scrape Data from any Website

tldt arrow
EN
Read on Terminal Reader

Too Long; Didn't Read

Python allows you to scrape or grab data from a website with a Python script. This method of gathering data is called web scraping. Most websites don’t want you scraping their data, and to find out what is legal and permissible for scraping, websites have a dedicated page that shows details of the endpoints allowed. In this exercise, we scrape the news content's home page, which we can do according to the user agent. With the whole script written, our page should scrape the data from the news home page of YCombinator.
featured image - How to Build a Python Web Scraper: Scrape Data from any Website
Teri Eyenike HackerNoon profile picture

@terieyenike

Teri Eyenike

I am a software developer focused on creating content through...

Learn More
LEARN MORE ABOUT @TERIEYENIKE'S EXPERTISE AND PLACE ON THE INTERNET.
react to story with heart

In this article, we will build a program that allows you to scrape or grab data from a website with a Python script. This method of gathering data is called web scraping.

Web scraping is all about programmatically using Python or any other programming language to download, clean, and use the data from a web page. Most websites don’t want you scraping their data, and to find out what is legal and permissible for scraping, websites have a dedicated page that shows details of the endpoints allowed.

Attach robots.txt to the end of any link to find out about the allowed endpoints. For example, let’s use https://news.ycombinator.com/robots.txt.

The result should look like this with this text file below:

image

The screenshot states what endpoints we are allowed and not allowed to scrape from the YCombinator website. A crawl delay means a pause when scraping data from the website with programs, thereby not overloading their servers and slowing down the website because of constant scraping.

In this exercise, we scrape the news content's home page, which we can do according to the user agent. 

Getting Started

The Python web scraper requires two necessary modules for scraping the data:

  • Beautiful Soup
  • Requests

Beautiful Soup

Beautiful Soup is a Python library for extracting data from HTML files. It modifies the file using a parser, turns the data into a valuable document, and saves programmers hours of manual and repetitive work.

Requests

The requests HTTP library is for downloading HTML files using the link to the website with the

.get()
function.

Creating a Web Scraper

Now to the nitty-gritty of this project. Create a new directory, and in there, a file that will contain all the scripts for the web scraper program.

Copy and paste the following code:

# app.py

import requests

response = requests.get('https://news.ycombinator.com/news')
yc_web_page = response.text

print(yc_web_page)

The code above does the following:

  • Importing the
    requests
    module
  • Using the response variable, the requests attached to the
    .get()
    function download the HTML files from the link of the website provided
  • Reading the content of the web page with
    .text

If you run this code with the command python

app.py
and it does not give you any output, it means the two imported modules need to be installed.

Run the following commands to install the modules.

pip3 install requests

pip install beautifulsoup4

The result of the source code should look like this:

image

Next, let’s update the

app.py
file with the rest of the code using beautiful soup:

# main.py

import requests
from bs4 import BeautifulSoup # add this

response = requests.get('https://news.ycombinator.com/news')

yc_web_page = response.text

# add this 
soup = BeautifulSoup(yc_web_page, 'html.parser')

article_tag = soup.find(name="a", class_='titlelink')
article_title = article_tag.get_text()

article_link = article_tag.get('href')
article_upvote = soup.find(name="span", class_="score").get_text()

result = {
  "title": article_title,
  "link": article_link,
  "point": article_upvote
}

print(result)

Follow the code snippet above by doing the following::

  • Import the BeautifulSoup function from module bs4
  • Next, use the variable soup to parse the document from the
    yc_web_page
    using the BeautifulSoup function and
    html.parser
    to get the HTML files

Before going over the rest of the code, let’s open our web browser with the link provided in

.get()

Next, right-click on the page, and click inspect to view the elements tab of the YCombinator news page.

Our web page should look like this:

image

With Beautiful Soup, we can target specific elements on the page with their class names:

  • By assigning the article_tag variable, every page element has a tag name using the
    find()
    function with the element's name, the a tag, and the
    class_
    with an underscore. This is done to prevent an overwrite of the class in the element on the web page
image
  • Now, we want to extract one of the link titles of the
    article_tag
    using the
    .get_text()
    function
  • Next, extract the link of the
    article_tag
    using the attribute
    href
    with the
    .get()
    function
  • The same applies to the
    article_upvote
    variable, where the tag name,
    <span>
    , and the class name are used to extract the points for each article link
  • Create a variable result that will display the extracted data as a dictionary with the key and value pair
  • Print out the final result 

With the whole script written, our page should scrape the data from the news home page of YCombinator and look like this:

image

Conclusion

This article taught you how to use Python web scraper to extract data from a web page. 

Also, the functionalities of using a web scraper are that it saves time and effort in producing large data sets faster rather than manually.

Learn More

Teri Eyenike HackerNoon profile picture
by Teri Eyenike @terieyenike.I am a software developer focused on creating content through technical writing and documentation.
Portfolio

RELATED STORIES

L O A D I N G
. . . comments & more!
Hackernoon hq - po box 2206, edwards, colorado 81632, usa