Published on: 15.12.2018
Disclaimer:This is primarily written from Python programming language ecosystem point of view.
Yes, you can use Selenium for web scraping, but it is not a good idea.
First, Selenium is not a web scraping tool.
It is “for automating web applications for testing purposes” and this statement is from the homepage of Selenium.
Second, in Python, there is a better tool Scrapy open-source web-crawling framework.
The intelligent reader will ask: “What is a benefit in using Scrapy over Python?”
You get speed and a lot of speed (not Amphetamine :-)), speed in development and speed in web scraping time.
Just because these articles exist is proof (at least for me) that people are using the wrong tool for the job, an example of “When your only tool is a hammer, everything looks like a nail“.
I personally only use Selenium for web page testing.
I would try to use it for automating web applications (if there are no other options), but I never had that use case so far.
As you can guess, my advice is to use Scrapy.
I choose Scrapy because I spend less time developing web scraping programs (web spiders) and execution time is fast.
In execution, it is fast because multiple requests can be done simultaneously, this means that data delivery will not be in the same order as requested, just that you are not confused when debugging.
I have used this combination in the past before I decided to invest time in learning Scrapy.
Do not make the same mistake as I did, development time and execution time is much faster with Scrapy than with any other tool that I have found so far.
This is not rant about using Selenium for web scraping, for not production system and learning/hobby it is fine.
I get it, Selenium is easy to start and you can see what is happing in real time on your screen, that is a huge benefit for people starting to do/learn web scraping and it is important to have this kind of early moral bosts when you are learning something new.
But I do think that all these article and tutorial using Selenium for web scraping should have a disclaimer not to use Selenium in real life (if you need to scrape 100K pages in a day, it is not possible to do it in single Selenium instance).
To start with Scrapy it is harder, you have to write XPath selectors and look at source code of HTML page to debug is not fun, but if you want to have fast web scraping that is the price.
After you learn Scrapy you will be faster than with Selenium (Selenium just have a lower-angle learning curve), I personally needed a few days to get the basics.
Originally published at buklijas.info on December 15, 2018.