Scrapy is a web crawling framework which divide the whole process of crawling to small processes so that the crawling process is well organize!
Crawl Data (spider.py) -> Rotate proxy or ip (middlewares.py) ->
Clean Data (items.py)-> Store Data(pipeline.py)
With all the settings (setting.py).
The biggest feature is that it is built on Twisted, an asynchronous networking library, so Scrapy is implemented using a non-blocking (aka asynchronous) code for concurrency, which makes the spider performance is very great. — Michael Yin
Selenium is a free automated testing suite for web applications across different browsers and platforms. Although it was created for automated testing on web app, it is really easy to apply to scrape websites! You just need to
Issues I faced using Selenium:
Issues I faced using Scrapy:
Sharing about my experiences:
But one day, one particular website block me by implementing Completely Automated Public Turing test to tell Computers and Humans Apart (Captcha). I was really stuck but I was required to figure out a way to solve this problem. So, after I tried all the ways to solve the captcha, I think why not I use another framework to try and see whether it can bypass the captcha.
At last I found Scrapy framework and not only solve captcha problems but a start for me to learn a really powerful crawling framework! The learning curve for Scrapy is much steeper than Selenium but it definitely worth it base on the five points below:
Here are some resources I find useful to learn Scrapy-Splash.
Here are some really useful resources to learn Scrapy.
Here are some really useful resources to learn Selenium.
All resources are based on Python. Happy Learning!
If you are interested to know more about tutorials for Scrapy-Splash, Scrapy or Selenium, feel free to comment below!
Feel free to reach out to me too:)
Create your free account to unlock your custom reading experience.