paint-brush
How to Track Pandemic Cases With Pythonby@kalebujordan
272 reads

How to Track Pandemic Cases With Python

by Kalebu Jordan November 17th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

How to Track Pandemic Cases With Python I would recommend checking out A beginner guide to Webscraping firstly and then come back to complete this tutorial. To effectively follow through with this tutorial, you need to have the following libraries installed on your machine with exception of CSV Module which comes with Python standard library. In order for us to extract and filter coronavirus numbers, we need a way to programmatically access the source code of the webpage we are using to scrap the live Coronavirus cases.

Company Mentioned

Mention Thumbnail
featured image - How to Track Pandemic Cases With Python
Kalebu Jordan  HackerNoon profile picture

Hello Pythonistas, In this tutorial you're are going to learn out how to track worldwide Coronavirus cases using requests and BeautifulSoup library in Python.

Note: If you're new to webscraping I would recommend checking out A beginner guide to Webscraping firstly and then come back to complete this tutorial.

Requirements

To effectively follow through with this tutorial, you need to have the following libraries installed on your machine with exception of CSV Module which comes with Python standard library.

  1. requests
  2. BeautfulSoup
  3. CSV

Installation

Just use the pip to install the mentioned dependencies just as shown below;

$ pip install requests

$ pip install beautifulsoup4

Let's get started

where do scrap the corona cases?

As we have stated above we are going to be tracking the number of cases worldwide by scraping the information from the cloud, there many choices of websites to scrap from but in this tutorial, we will with worldometer.

Let's figure out the structure of the website

Let's firstly understand the structure of the website we are about to scrap, once you open up the worldometer website, scroll down and then you will see a table similar to what shown below.

How table is represented in HTML ?

The table in HTML is usually represented with a table tag whereby tr indicate row and td indicates a specific column in that low just as shown in the example below;

      <table border = "1">
         <tr>
            <td>Row 1, Column 1</td>
            <td>Row 1, Column 2</td>
         </tr>

         <tr>
            <td>Row 2, Column 1</td>
            <td>Row 2, Column 2</td>
         </tr>
      </table>

That means every row in the worldometers table is represented by tag tr, therefore we need to filter all rows from the worldometers table then store in CSV file.

Complete Coronavirus Spider

Below is the complete code of the spider that you're going to build in this tutorial which is capable of scraping the live Coronavirus number from worldometer website and then store them in a CSV file.

app.py

import csv
import requests
from bs4 import BeautifulSoup

html = requests.get('https://www.worldometers.info/coronavirus/').text
html_soup = BeautifulSoup(html, 'html.parser')
rows = html_soup.find_all('tr')

def extract_text(row, tag):
    element = BeautifulSoup(row, 'html.parser').find_all(tag)
    text = [col.get_text() for col in element]
    return text

heading = rows.pop(0)
heading_row = extract_text(str(heading), 'th')[1:9]

with open('corona.csv', 'w') as store:
    Store = csv.writer(store, delimiter=',')
    Store.writerow(heading_row)
    for row in rows:
        test_data = extract_text(str(row), 'td')[1:9]
        Store.writerow(test_data)

OUTPUT

Let's break the code into pieces so as we can understand the role of each part and how they all work together in getting realtime updated numbers.

Importing necessary libraries

The first 3 lines of code just import all our necessary modules and libraries that we will be using to scrap the live covid19 cases and storing data in the file.

import csv

import requests

from bs4 import BeautifulSou

Getting the web-page

In order for us to extract and filter coronavirus numbers, we need a way to programmatically access the source code of the webpage, in doing this we will use the requests library just as shown below.

html = requests.get('https://www.worldometers.info/coronavirus/').text

Extracting all rows in a table

Now that we have the HTML source code of the website, it's now time to parse all the rows present in that table showing coronavirus stats, in doing this will use beautifulSoup just as shown below.

html_soup = BeautifulSoup(html, 'html.parser')

rows = html_soup.find_all('tr')

Making a function do unpack the row columns

After extracting all the rows in the coronavirus table, we need a way to parse all the details on each column in that row, and that's what the below function does.

def extract_text(row, tag):
    element = BeautifulSoup(row, 'html.parser').find_all(tag)
    text = [col.get_text() for col in element]
    return text

Parsing the header

Since we don't wanna confuse the header naming and real stats, we need to pop out the header from the row list as shown below.

heading = rows.pop(0)

heading_row = extract_text(str(heading), 'th')[1:9]

Parsing row details and storing to CSV

Now finally our last job is to parse all the individual details of every row in the table and then store in the CSV file using the CSV module as shown below;

Congratulations you have just made your own How to track Coronavirus in Python, Tweet now to share it with your fellow developers.

Based on your interest, you might also love these articles;

In case of any comment, suggestion, or difficulties drop it in the comment box below and I will get back to you ASAP.

To get the full code for this article you can check out on My Github

Previously published at https://kalebujordan.com/scrap-worldometers-live-update-with-beautifulsoup/