Writing a Scraping Bot with Python and Selenium

Written by otavioss | Published 2023/06/20
Tech Story Tags: python | web-scraping | big-data | data | data-collection | automation | selenium | web-crawling

TLDRSelenium is a tool initially designed for automated tests in web applications. Although it’s not its primary purpose, Selenium is also used in Python for web scraping. As an example, we’ll scrape currencies' historical data as an example of how to build a powerful data-collection tool with Selenium.via the TL;DR App

Selenium is a tool initially designed for automated tests in web applications that is available in several different programming languages. Although it’s not its main purpose, Selenium is also used in Python for web scraping because of its ability to access JavaScript-rendered content which regular scraping tools such as BeautifulSoup can’t do.
Another use case for scraping with Selenium is when it’s necessary to interact with the page before collecting the data, such as by clicking buttons or filling out fields. This is the use case that will be covered in this article. As an example, we’ll scrape investing.com to extract historical data on the dollar exchange rates against one or more currencies.
Searching the web, you can find APIs and Python packages that make it much easier the task of gathering financial data than scraping it manually. However, the idea here is to explore how Selenium can be helpful for general data extraction, financial data is just an example.
Content Overview
  • The scraper
  • The code
  • Handling exceptions
  • Wrapping up and Next steps

The Scraper

First, we need to understand the website. The following URL leads to the historical data for the exchange rate of the dollar against the euro.
https://investing.com/currencies/usd-eur-historical-data
On this page, you can see a table with the data and the option to set the date range we want. That’s what we’re going to use. To see the data for other currencies against the dollar, just replace “eur” with the other currency code in the URL.
Also, this considers you’ll only want the currency’s exchange rate against the dollar. If that’s not the case, just replace the “usd” in the URL.

The Code

We’ll start with the imports, of course, and we don’t need much. Let’s import some useful items from Selenium, the sleep function to insert some pauses in the code, and Pandas to manipulate the date when necessary.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
import pandas as pd
We’ll write a function to scrape the data. The function will receive:
  • A list of currencies codes;
  • A start date;
  • An end date;
  • A boolean informing if we want to export the data as a .csv file. I’ll use False as the default.
Also, as the idea here is to build a scraper capable of gathering data about multiple currencies, we’ll also initialize an empty list to store the data from each currency.
def get_currencies(currencies, start, end, export_csv=False):
    frames = []
As the function now has a list of currencies, you’ll probably imagine that we’ll iterate over this list and get the data currency by currency. That’s precisely the plan.
So for each currency in the currencies list, we’ll create a URL, instantiate a driver object and use it to get the page. Then we’ll maximize the window, but that’s only visible if you keep
option.headless
as
False
, otherwise, Selenium will do the entire work without showing you anything.
for currency in currencies:
    my_url = f'https://br.investing.com/currencies/usd-  {currency.lower()}-historical-data'
    option = Options()
    option.headless = False
    driver = webdriver.Chrome(options=option)
    driver.get(my_url)
    driver.maximize_window()
We’re already looking at the historical data at this point, and we could just get the table with the data. However, by default, we only see the data for about the last 20 days, but we do intend to get this data for any time period we want to, and for this, we’ll use some interesting Selenium functionalities to interact with the website. This is when Selenium shines!
What we’ll do here is to click on the dates and fill the Start Date and End Date fields with the dates we want and hit Apply. For this, we’ll use the WebDriverWait, the ExpectedConditions, and By to make sure the web driver will wait for the elements we want to interact with to be clickable. That’s important because if the diver tries to interact with something before it becomes clickable, an exception will be raised.
The waiting time will be twenty seconds, but it’s up to you to set it as you find appropriate. First, let’s select the date button by its Xpath and then click on it.
date_button = WebDriverWait(driver, 20).until(
            EC.element_to_be_clickable((By.XPATH,
            "/html/body/div[5]/section/div[8]/div[3]/div/div[2]/span")))
date_button.click()
Now, we need to fill in the Start Date field. Let’s first select it and then use clear to delete the default date and send_keys to fill it with the date we want.
start_bar = WebDriverWait(driver, 20).until(
            EC.element_to_be_clickable((By.XPATH,
            "/html/body/div[7]/div[1]/input[1]")))
start_bar.clear()
start_bar.send_keys(start)
And now we repeat the process for the End Date field.
end_bar = WebDriverWait(driver, 20).until(
            EC.element_to_be_clickable((By.XPATH, 
            "/html/body/div[7]/div[1]/input[2]")))
end_bar.clear()
end_bar.send_keys(end)
With this done, we’ll select the Apply button and click on it. Then we use sleep to pause the code for a few seconds and make sure the new page is fully loaded.
apply = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,
                                        "/html/body/div[7]/div[5]/a")))
apply.click()
sleep(5)
If you had
option.headless
as
False
, you’ll see this entire process happening in front of you as if somebody was actually clicking on the page. When Selenium clicks on Apply, you’ll see the table reloading to show the data for the time period you specified.
We now use the
pandas.read_html
function to select all the tables on the page. This function will receive the page’s source code. Finally, we can quit the driver.
dataframes = pd.read_html(driver.page_source)
driver.quit()
print(f'{currency} scraped.')

Handling Exceptions

Although the process of collecting the data is done, we need to consider that Selenium can sometimes be a little unstable and could eventually fail to load the page at some point during the bunch of actions we’re performing here.
To prevent that, we’ll have the entire code inside a
try 
clause that will be inside an infinite loop. Once Selenium manages to do all the processes of collecting the data that I described above, the loop will be broken, but every time it finds a problem, an
expect
clause will be activated. In this scenario, the code will:
  • Quit the driver. It’s always important to do this so we don’t end up with dozens of memory-consuming web drivers running;
  • Print a message indicating the error;
  • Sleep for thirty seconds;
  • Go to the start of the loop once more.
This process will be repeated until the data for each currency is properly collected. And this is the code for all this:
for currency in currencies:
    while True:
        try:
            # Opening the connection and grabbing the page
            my_url = f'https://br.investing.com/currencies/usd-{currency.lower()}-historical-data'
            option = Options()
            option.headless = False
            driver = webdriver.Chrome(options=option)
            driver.get(my_url)
            driver.maximize_window()
               
            # Clicking on the date button
            date_button = WebDriverWait(driver, 20).until(
                        EC.element_to_be_clickable((By.XPATH, "/html/body/div[5]/section/div[8]/div[3]/div/div[2]/span")))
            date_button.click()
            
            # Sending the start date
            start_bar = WebDriverWait(driver, 20).until(
                        EC.element_to_be_clickable((By.XPATH, "/html/body/div[7]/div[1]/input[1]")))
            start_bar.clear()
            start_bar.send_keys(start)

          # Sending the end date
            end_bar = WebDriverWait(driver, 20).until(
                        EC.element_to_be_clickable((By.XPATH, "/html/body/div[7]/div[1]/input[2]")))
            end_bar.clear()
            end_bar.send_keys(end)
           
            # Clicking on the apply button
            apply = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[7]/div[5]/a")))
            apply.click()
            sleep(5)
            
            # Getting the tables on the page and quiting
            dataframes = pd.read_html(driver.page_source)
            driver.quit()
            print(f'{currency} scraped.')
            break
        
        except:
            driver.quit()
            print(f'Failed to scrape {currency}. Trying again in 30 seconds.')
            sleep(30)
            continue
One last step, though. If you recall, what we have so far is a list containing all the tables on the page stored as DataFrames. We need to select the one table that contains the historical data we want.
For each DataFrame in this dataframes list, we’ll check if the name of its columns matches what we expect. If they do, then that’s our frame and we break the loop. And now we’re finally ready to append this DataFrame to the list that was initialized in the beginning.
for dataframe in dataframes:
    if dataframe.columns.tolist() == ['Date', 'Price', 'Open', 'High', 'Low', 'Change%']:
        df = dataframe
        break
frames.append(df)
And yes, if the
export_csv
parameter was set to True, we need to export a .csv file, but that’s far from being an issue as the
DataFrame.to_csv
method can easily get this done. And then we can just wrap this function up by returning the list of DataFrames. This last step is done after the looping through the currencies list is over, of course.
# Inside the loop
    if export_csv:
        df.to_csv('currency.csv', index=False)
        print(f'{currency}.csv exported.')

# Outside of the loop
    return frames
And that’s it! Here’s the complete code for everything we just did:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
import pandas as pd

def get_currencies(currencies, start, end, export_csv=False):
    frames = []
    for currency in currencies:
        while True:
            try:
                # Opening the connection and grabbing the page
                my_url = f'https://br.investing.com/currencies/usd-{currency.lower()}-historical-data'
                option = Options()
                option.headless = False
                driver = webdriver.Chrome(options=option)
                driver.get(my_url)
                driver.maximize_window()
                # Clicking on the date button
                date_button = WebDriverWait(driver, 20).until(
                            EC.element_to_be_clickable((By.XPATH, 
                            "/html/body/div[5]/section/div[8]/div[3]/div/div[2]/span")))
                date_button.click()
                # Sending the start date
                start_bar = WebDriverWait(driver, 20).until(
                            EC.element_to_be_clickable((By.XPATH, 
                            "/html/body/div[7]/div[1]/input[1]")))
                start_bar.clear()
                start_bar.send_keys(start)
                # Sending the end date
                end_bar = WebDriverWait(driver, 20).until(
                            EC.element_to_be_clickable((By.XPATH, 
                            "/html/body/div[7]/div[1]/input[2]")))
                end_bar.clear()
                end_bar.send_keys(end)
                # Clicking on the apply button
                apply = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,
                                                        "/html/body/div[7]/div[5]/a")))
                apply.click()
                sleep(5)
                # Getting the tables on the page and quiting
                dataframes = pd.read_html(driver.page_source)
                driver.quit()
                print(f'{currency} scraped.')
                break
            except:
                driver.quit()
                print(f'Failed to scrape {currency}. Trying again in 30 seconds.')
                sleep(30)
                continue
                
        # Selecting the correct table            
        for dataframe in dataframes:
            if dataframe.columns.tolist() == ['Date', 'Price', 'Open', 'High', 'Low', 'Change%']:
                df = dataframe
                break
        frames.append(df)
        # Exporting the .csv file
        if export_csv:
            df.to_csv('currency.csv', index=False)
            print(f'{currency}.csv exported.')
                  
  return frames

Next Steps and Wrapping Up

So far what the code does is to get the historical data of the exchange rate of a list of currencies against the dollar and returns a list of DataFrames and several .csv files, if you request.
But there’s always room for improvement. With a few more lines of code, it’s not hard to make the function return and export a single DataFrame containing the data for every currency in the list. Another suggestion is to write an update function using the same Selenium functionalities that receive an existing dataframe and update the historical data to the present date.
Besides, the exact same logic used to scrape the currencies can be used to scrape stocks, indices, commodities, futures, and much more. There are literally hundreds of pages to scrape.
However, if that’s what you going for, you must take some safety measures before running your code. That’s because the more requests you send to a website in an automated manner, the more you overload their server, and the chances of getting blocked increase. 
To avoid that, a good option is to take advantage of advanced scraping tools with built-in technology to avoid your code from being unidentified and blocked without losing any performance and keep the scalability you need.
Finally, the use of Selenium as shown in this article can be useful in several other situations such as signing in to websites, filling out forms, selecting items in a dropdown list, and much more. Of course, it’s not the only solution for such problems, but it can definitely be useful depending on the use case.
I hope you’ve enjoyed this and that it can maybe be useful somehow.
Also published here. 
If you have a question or a suggestion, feel free to be in touch.



Written by otavioss | Economist and data scientist
Published by HackerNoon on 2023/06/20