Published on: 15.12.2018
Disclaimer:This is primarily written from Python programming language ecosystem point of view.
I have noticed that Selenium has become quite popular for scraping data from web pages.
Yes, you can use Selenium for web scraping, but it is not a good idea.
Also personally, I think that articles that teach how to use Selenium for web scraping are giving a bad example of what tool to use for web scraping.
First, Selenium is not a web scraping tool.
It is “for automating web applications for testing purposes” and this statement is from the homepage of Selenium.
Second, in Python, there is a better tool Scrapy open-source web-crawling framework.
The intelligent reader will ask: “What is a benefit in using Scrapy over Python?”
You get speed and a lot of speed (not Amphetamine :-)), speed in development and speed in web scraping time.
There are tips on how to make Selenium web scraping faster, and if you use Scrapy then you do not have those kinds of problems and you are faster.
Just because these articles exist is proof (at least for me) that people are using the wrong tool for the job, an example of “When your only tool is a hammer, everything looks like a nail“.
I personally only use Selenium for web page testing.
I would try to use it for automating web applications (if there are no other options), but I never had that use case so far.
The only exception that I could see for using Selenium as web scraping tool is if a website that you are scraping is using JavaScript to get/display data that you need to scrape.
Scrapy does have the solution for JavaScript with Splash, but I have never used it, so far I always found some workaround.
As you can guess, my advice is to use Scrapy.
I choose Scrapy because I spend less time developing web scraping programs (web spiders) and execution time is fast.
I have found Scrapy to be faster in development time because of a Scrapy shell and cache.
In execution, it is fast because multiple requests can be done simultaneously, this means that data delivery will not be in the same order as requested, just that you are not confused when debugging.
I have used this combination in the past before I decided to invest time in learning Scrapy.
Do not make the same mistake as I did, development time and execution time is much faster with Scrapy than with any other tool that I have found so far.
This is not rant about using Selenium for web scraping, for not production system and learning/hobby it is fine.
I get it, Selenium is easy to start and you can see what is happing in real time on your screen, that is a huge benefit for people starting to do/learn web scraping and it is important to have this kind of early moral bosts when you are learning something new.
But I do think that all these article and tutorial using Selenium for web scraping should have a disclaimer not to use Selenium in real life (if you need to scrape 100K pages in a day, it is not possible to do it in single Selenium instance).
To start with Scrapy it is harder, you have to write XPath selectors and look at source code of HTML page to debug is not fun, but if you want to have fast web scraping that is the price.
After you learn Scrapy you will be faster than with Selenium (Selenium just have a lower-angle learning curve), I personally needed a few days to get the basics.
Originally published at buklijas.info on December 15, 2018.