Usually forgotten in all Data Science masters and courses, Web Scraping is, in my honest opinion a basic tool in the Data Scientist toolset, as is the tool for and therefore using external data from your organization when public databases are not available. getting And as long as APIs and accesible open source databases are the norm (if that ever happens) instead the exception, web scrapping will be necessary to get that juicy data you are looking for; being that the prices of your competition in their website or knowing the location of all the supermarkets around you. But what is Selenium? Selenium is a web browser automator. When used as scraper, from it you'll get a browser like Chrome or Firefox navigating to the webpage you want, interacting with it the way you want and extracting along the way all the information that you might need. Although it's usually used for testing stuff, it's such a general tool that it's commonly and widely used for web scraping. Remember, this is a basic tutorial for you to get running with Selenium as a Web Scrapper, do not take it as an exhaustive tutorial. (please note the basic word, as I'll be doing simplifications) Why is important Selenium for me as a Data Scientist? Well... you might have noticed that (sadly) there isn't a lot of people providing easy or free access to data outside of your company (sometimes even in your company) i.e: you might be needing to get list all the hotel prices in your city, and literally no one is going to hand you an ordered database of price hotels in an excel file. If they do, it's your lucky day, if they don't it is also your lucky day, YOU CAN START SCRAPING. The fact is that there is a lot of information out there for you to get and as a Data Scientist and you can either sit in your desk and say "I don't have the needed information for that analysis" or try to figure out imaginative ways of getting it, and scraping is one of them. Note: There are out there other quite efficient scrappers frameworks such as Scrapy  ( https://scrapinghub.com/scrapy-clou d) that might be more suitable and easier to implement for your project. The thing is that almost all the webpages that I've found worth scrapping were already protected from Scrapy and other tools (you'll probably need IP rotation, and thats money) and most of them were not protected against Selenium (hardly found one or two protected from a well-written code of Selenium) Also Selenium works well in old-styled pages and messy coded pages too, all in all, for me Selenium is like a 4x4 car, not the fastest, not the most efficient neither the easiest to drive. Does not need pavemented roads and of course can go offroad easily. But it can take you almost anywhere. Now let's get our hands dirty with data gathering 0. A personal advice As selenium says in their webpage: So if you are going to use Selenium for scraping a website, or visit the robots.txt file of the website (a lot of them allow you to scrape, others do not allow and most of them don't have a clear policy; so use it at your own risk) and of course , do not throw petitions like a beast, try to behave like an automated human, make it and do not overload your host's servers. first read the terms and conditions of the webpage be gentle work on low-demand times 1. Installing Selenium Well this is an easy one, I'll assume you already have installed Python 3 and know how to create at least a simple notebook: pip install selenium 2. Installing the driver Selenium controles a browser (Chrome, Firefox etc...) through a Driver, each browser has it's own drivers. For instance, if you want to open Mozilla Firefox you'll have to use a Mozilla Driver such as Geckodriver.( ) https://github.com/mozilla/geckodriver/releases So the thing goes like this: tells the through the WebDriver() function what to do in the selected . Selenium Driver Browser Figure it out like muppets: Selenium is the brain that tells the hand what to do, the hand is the Driver, and finally, the muppet is the Browser. Easiest way to install geckodriver in mac, go to your terminal and just type: brew install geckodriver If you don't have Homebrew installed, you should go now and install it. Other way is getting the file and just add it to the PATH Other instructions to get geckodriver working on Mac: https://www.kenst.com/2016/12/installing-marionette-firefoxdriver-on-mac-osx/ and on Windows: https://www.softwaretestinghelp.com/geckodriver-selenium-tutorial/ 3. Getting Selenium working ("Hello Scrape!") For the sake of simplicity, open a notebook such as juPYTER (Of course you can do it with an .py) First we import the much needed WebDriver package and we state that we will be using Firefox (we've should have already installed geckodriver) selenium webdriver

driver = webdriver.Firefox() from import Then we make this muppet go to the page we want to go, let's pay a visit to our friends in Hackernoon.com, as you might check in their robots.txt they have a gentle allowance for user-agents (www.hackernoon.com/robots.txt) driver.get( ) "https://www.hackernoon.com" This is enough for now, you have installed Selenium and already opened a webpage with it, we will explain in the following post how to navigate the Hackernoon page with selenium (or any other), clicking here and there and getting some info. But for now, if you just paste this lines of code and run it, you'll get your muppet moving in Hackernoon, clicking into the search button and looking for one of your favorites posts ever. This one! Add to the import area this ones: time selenium.webdriver.common.keys Keys import import from import And then add at the bottom this: search_button = driver.find_element_by_class_name( )
search_button.click()

time.sleep( )

input_bar = driver.find_element_by_name( )
input_bar.send_keys( )

time.sleep( )

input_bar.send_keys(Keys.RETURN)

time.sleep( )

driver.close() "search-container" 2 "searchvalue" 'The big hole on Data Scientists toolset' 2 10 Final behavior should look like this (without the desktop privacy clapping people) And there you go! Use it, share it at your own will and spare some claps below. Remember, attribution will bring you good karma. See you in the next post!

Scraping with Selenium 101: The Big Hole on Data Scientists Toolset [Part 1]

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps