When performing data science tasks, it’s common to want to use data found on the internet. You’ll usually be able to access this data in CSV format, or via an Application Programming Interface (API). However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you’ll want to use a technique called web scraping to get the data from the web page into a format you can work within your analysis.
Today, I’ll show you how to perform Web Scraping using Python3 and BeautifulSoup library.
Before moving forward, I would like to share some of the basic components of a Web page
Whenever you visit a website or web page, your web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:
After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML.
<body><h1>My First Heading</h1>
<p>My first paragraph.</p>
More Details refer to this HTML Tutorials
What is Web Scraping?
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
More details refer to Wikipedia
Why we need Web Scraping?
A large organization will need to keep itself updated with the information changes occurring in multitudes of websites. An intelligent web scraper will find new websites from which it needs to scrap the data. Intelligent approaches identify the changed data, extract it without extracting the unnecessary links present within and navigate between websites to monitor and extract information on a real-time basis efficiently and effectively. You can easily monitor several websites simultaneously while keeping up with the frequency of updates.
You will observe, as has been mentioned earlier, that data across the websites constantly changes. How will know if a key change has been made by an organization? Let’s say there has been a personnel change in the organization, how will you find out about that? That’s where the alerts feature in web scraping comes to play. The intelligent web scraping techniques will alert you to the data changes that have occurred on a particular website, thus helping you keep an eye on opportunities and issues.
Firstly, I will demonstrate you with very basic HTML web page. And later on, show you how to do web scraping on the real-world web pages.
The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.
Let’s try downloading a simple sample website, http://dataquestio.github.io/web-scraping-pages/simple.html. We’ll need to first download it using the requests.get method.
After running our request, we get a Response object. This object has a status_codeproperty, which indicates if the page was downloaded successfully.
We can print out the HTML content of the page using the content property:
We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:
We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object:
As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the listfunction on it.
As you can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we’ll dive into the body(Refer to just above, under html.children).
Now, we can get the p tag by finding the children of the body tag
we can use the get_text method to extract all of the text inside the tag.
What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.
If you instead only want to find the first instance of a tag, you can use the findmethod, which will return a single BeautifulSoup object.
If you want to fork this notebook go to Web Scraping Tutorial.
Now, I’ll show you how to perform web scraping using Python 3 and the BeautifulSoup library. We’ll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library.
We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape. We’ll extract weather information about downtown San Francisco from this page.
Once you open this page then use CRTL+SHIFT+I to inspect the element, but here we are interested in this particular column (San Francisco CA).
So, by right-clicking on the page near where it says “Extended Forecast”, then clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” in the elements panel.
We can then scroll up in the elements panel to find the “outermost” element that contains all of the text that corresponds to the extended forecasts. In this case, it’s a div tag with the id seven-day-forecast.
Explore the div, you’ll discover that each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a divwith the class tombstone-container.
We now know enough to download the page and start parsing it. In the below code, we:
Extract and print the first forecast item
As you can see, inside the forecast item tonight is all the information we want. There are 4 pieces of information we can extract:
Now that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to extract everything at once.
In the below code:
Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
Use a list comprehension to call the get_text method on each BeautifulSoupobject.
Combining our data into Pandas DataFrame
We can use a regular expression and the Series.str.extract method to pull out the numeric temperature values.
I hope now you have a good understanding of how to Scrape the data from web pages. In the coming weeks, I’ll do web scraping on
Hope you like this article!! Don’t forget to like this blog and share with others.
Go Subscribe THEMENYOUWANTTOBE
Show Some Love ❤