Scraping with Selenium 101: The Big Hole on Data Scientists Toolset [Part 1]
Relatively rare element in the universe.
Usually forgotten in all Data Science masters and courses, Web Scraping is, in my honest opinion a basic tool in the Data Scientist toolset, as is the tool for getting and therefore using external data from your organization when public databases are not available.
And as long as APIs and accesible open source databases are the norm (if that ever happens) instead the exception, web scrapping will be necessary to get that juicy data you are looking for; being that the prices of your competition in their website or knowing the location of all the supermarkets around you.
But what is Selenium?
Selenium is a web browser automator. When used as scraper, from it you'll get a browser like Chrome or Firefox navigating to the webpage you want, interacting with it the way you want and extracting along the way all the information that you might need.
Although it's usually used for testing stuff, it's such a general tool that it's commonly and widely used for web scraping.
Remember, this is a basic (please note the basic word, as I'll be doing simplifications) tutorial for you to get running with Selenium as a Web Scrapper, do not take it as an exhaustive tutorial.
Why is important Selenium for me as a Data Scientist?
Well... you might have noticed that (sadly) there isn't a lot of people providing easy or free access to data outside of your company (sometimes even in your company)
i.e: you might be needing to get list all the hotel prices in your city, and literally no one is going to hand you an ordered database of price hotels in an excel file. If they do, it's your lucky day, if they don't it is also your lucky day, YOU CAN START SCRAPING.
The fact is that there is a lot of information out there for you to get and as a Data Scientist and you can either sit in your desk and say "I don't have the needed information for that analysis" or try to figure out imaginative ways of getting it, and scraping is one of them.
Note: There are out there other quite efficient scrappers frameworks such as Scrapy (https://scrapinghub.com/scrapy-cloud) that might be more suitable and easier to implement for your project.
The thing is that almost all the webpages that I've found worth scrapping were already protected from Scrapy and other tools (you'll probably need IP rotation, and thats money) and most of them were not protected against Selenium (hardly found one or two protected from a well-written code of Selenium)
Also Selenium works well in old-styled pages and messy coded pages too, all in all, for me Selenium is like a 4x4 car, not the fastest, not the most efficient neither the easiest to drive. But it can take you almost anywhere. Does not need pavemented roads and of course can go offroad easily.
Now let's get our hands dirty with data gathering
0. A personal advice
As selenium says in their webpage:
So if you are going to use Selenium for scraping a website, first read the terms and conditions of the webpage or visit the robots.txt file of the website (a lot of them allow you to scrape, others do not allow and most of them don't have a clear policy; so use it at your own risk) and of course be gentle, do not throw petitions like a beast, try to behave like an automated human, make it work on low-demand times and do not overload your host's servers.
1. Installing Selenium
Well this is an easy one, I'll assume you already have installed Python 3 and know how to create at least a simple notebook:
2. Installing the driver
Selenium controles a browser (Chrome, Firefox etc...) through a Driver, each browser has it's own drivers.
So the thing goes like this: Selenium tells the Driver through the WebDriver() function what to do in the selected Browser.
Figure it out like muppets: Selenium is the brain that tells the hand what to do, the hand is the Driver, and finally, the muppet is the Browser.
Easiest way to install geckodriver in mac, go to your terminal and just type:
If you don't have Homebrew installed, you should go now and install it.
Other way is getting the file and just add it to the PATH
Other instructions to get geckodriver working on Mac:
and on Windows:
3. Getting Selenium working ("Hello Scrape!")
For the sake of simplicity, open a notebook such as juPYTER (Of course you can do it with an .py)
First we import the much needed WebDriver package and we state that we will be using Firefox (we've should have already installed geckodriver)
from selenium import webdriver
driver = webdriver.Firefox()
Then we make this muppet go to the page we want to go, let's pay a visit to our friends in Hackernoon.com, as you might check in their robots.txt they have a gentle allowance for user-agents (www.hackernoon.com/robots.txt)
This is enough for now, you have installed Selenium and already opened a webpage with it, we will explain in the following post how to navigate the Hackernoon page with selenium (or any other), clicking here and there and getting some info.
But for now, if you just paste this lines of code and run it, you'll get your muppet moving in Hackernoon, clicking into the search button and looking for one of your favorites posts ever. This one!
Add to the import area this ones:
import from selenium.webdriver.common.keys import Keys
And then add at the bottom this:
search_button = driver.find_element_by_class_name("search-container")
input_bar = driver.find_element_by_name("searchvalue")
input_bar.send_keys('The big hole on Data Scientists toolset')
Final behavior should look like this (without the desktop privacy clapping people)
And there you go! Use it, share it at your own will and spare some claps below.
Remember, attribution will bring you good karma.
See you in the next post!
Subscribe to get your daily round-up of top tech stories!