7,137 reads

Web Scraping Web Scraping with Python and BeautifulSoup

by Mohit SharmaSeptember 12th, 2018

Too Long; Didn't Read

When performing data science tasks, it’s common to want to use data found on the internet. You’ll usually be able to access this data in <em>CSV format</em>, or via an <a href="https://en.wikipedia.org/wiki/Application_programming_interface">Application Programming Interface</a> (API). However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you’ll want to use a technique called <strong>web scraping</strong> to get the data from the web page into a format you can work within your analysis.

Companies Mentioned

featured image - Web Scraping Web Scraping with Python and BeautifulSoup

I am back with another tutorial on how to do Web Scraping with Python and BeautifulSoup.

What you’ll learn

What is Web Scraping
Why we need Web Scraping
At last, how to do Web Scraping using Python and BeautifulSoup

When performing data science tasks, it’s common to want to use data found on the internet. You’ll usually be able to access this data in CSV format, or via an Application Programming Interface (API). However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you’ll want to use a technique called web scraping to get the data from the web page into a format you can work within your analysis.

Today, I’ll show you how to perform Web Scraping using Python3 and BeautifulSoup library.

Before moving forward, I would like to share some of the basic components of a Web page

Whenever you visit a website or web page, your web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

HTML — contain the main content of the page.
CSS — add styling to make the page look nicer.
JS — Javascript files add interactivity to web pages.
Images — image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML.

HTML

HTML is the standard markup language for creating Web pages.

HTML stands for Hyper Text Markup Language
HTML describes the structure of Web pages using markup
HTML elements are the building blocks of HTML pages
HTML elements are represented by tags
HTML tags label pieces of content such as “heading”, “paragraph”, “table”, and so on
Browsers do not display the HTML tags, but use them to render the content of the page

A Simple HTML Document

Example

<!DOCTYPE html><html><head><title>Page Title</title></head><body><h1>My First Heading</h1><p>My first paragraph.</p>

</body></html>

Try it Yourself »

Example Explained

The <!DOCTYPE html> declaration defines this document to be HTML5
The <html> element is the root element of an HTML page
The <head> element contains meta information about the document
The <title> element specifies a title for the document
The <body> element contains the visible page content
The <h1> element defines a large heading
The <p> element defines a paragraph

More Details refer to this HTML Tutorials

What is Web Scraping?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

More details refer to Wikipedia

Why we need Web Scraping?

A large organization will need to keep itself updated with the information changes occurring in multitudes of websites. An intelligent web scraper will find new websites from which it needs to scrap the data. Intelligent approaches identify the changed data, extract it without extracting the unnecessary links present within and navigate between websites to monitor and extract information on a real-time basis efficiently and effectively. You can easily monitor several websites simultaneously while keeping up with the frequency of updates.

You will observe, as has been mentioned earlier, that data across the websites constantly changes. How will know if a key change has been made by an organization? Let’s say there has been a personnel change in the organization, how will you find out about that? That’s where the alerts feature in web scraping comes to play. The intelligent web scraping techniques will alert you to the data changes that have occurred on a particular website, thus helping you keep an eye on opportunities and issues.

Web Scraping using Python and BeautifulSoup

Firstly, I will demonstrate you with very basic HTML web page. And later on, show you how to do web scraping on the real-world web pages.

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

Let’s try downloading a simple sample website, http://dataquestio.github.io/web-scraping-pages/simple.html. We’ll need to first download it using the requests.get method.

After running our request, we get a Response object. This object has a status_codeproperty, which indicates if the page was downloaded successfully.

We can print out the HTML content of the page using the content property:

BeautifulSoup

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object:

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the listfunction on it.

As you can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we’ll dive into the body(Refer to just above, under html.children).

Now, we can get the p tag by finding the children of the body tag

we can use the get_text method to extract all of the text inside the tag.

Finding all instances of a tag at once

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

If you instead only want to find the first instance of a tag, you can use the findmethod, which will return a single BeautifulSoup object.

If you want to fork this notebook go to Web Scraping Tutorial.

Now, I’ll show you how to perform web scraping using Python 3 and the BeautifulSoup library. We’ll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library.

We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape. We’ll extract weather information about downtown San Francisco from this page.

Once you open this page then use CRTL+SHIFT+I to inspect the element, but here we are interested in this particular column (San Francisco CA).

So, by right-clicking on the page near where it says “Extended Forecast”, then clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” in the elements panel.

We can then scroll up in the elements panel to find the “outermost” element that contains all of the text that corresponds to the extended forecasts. In this case, it’s a div tag with the id seven-day-forecast.

Explore the div, you’ll discover that each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a divwith the class tombstone-container.

We now know enough to download the page and start parsing it. In the below code, we:

Download the web page containing the forecast.
Create a BeautifulSoup class to parse the page.
Find the div with id seven-day-forecast, and assign to seven_day
Inside seven_day, find each individual forecast item.
Extract and print the first forecast item.

Extract and print the first forecast item

As you can see, inside the forecast item tonight is all the information we want. There are 4 pieces of information we can extract:

The name of the forecast item — in this case, Today.
The description of the conditions — this is stored in the title property of img.
A short description of the conditions — in this case, Sunny.
The temperature low — in this case, 69_°F_.

Now that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to extract everything at once.

In the below code:

Select all items with the class period-name inside an item with the class tombstone-container in seven_day.

Use a list comprehension to call the get_text method on each BeautifulSoupobject.