Scrapy is a web crawling framework which divide the whole process of crawling to small processes so that the crawling process is well organize!
Crawl Data (spider.py) -> Rotate proxy or ip (middlewares.py) ->
Clean Data (items.py)-> Store Data(pipeline.py)
With all the settings (setting.py).
The biggest feature is that it is built on Twisted, an asynchronous networking library, so Scrapy is implemented using a non-blocking (aka asynchronous) code for concurrency, which makes the spider performance is very great. — Michael Yin
Selenium is a free automated testing suite for web applications across different browsers and platforms. Although it was created for automated testing on web app, it is really easy to apply to scrape websites! You just need to
- -> Download chrome, firefox or other drivers -> Use their API to scrape websites.
Issues I faced using Selenium:
- Speed is quite slow.
- Need quite a lot memory if you want to build a multi threading crawler to speed the process up.
Issues I faced using Scrapy:
- It is harder to debug.
- Harder to connect to Tor if you are implementing Scrapy-Splash.
- Lesser reference to refer when you want to use Scrapy-Splash.
Sharing about my experiences:
But one day, one particular website block me by implementing Completely Automated Public Turing test to tell Computers and Humans Apart (Captcha). I was really stuck but I was required to figure out a way to solve this problem. So, after I tried all the ways to solve the captcha, I think why not I use another framework to try and see whether it can bypass the captcha.
At last I found Scrapy framework and not only solve captcha problems but a start for me to learn a really powerful crawling framework! The learning curve for Scrapy is much steeper than Selenium but it definitely worth it base on the five points below:
- Write your crawler code in a much shorter Python script compare to selenium.
- Crawl a lot faster than selenium.
- If you are using scrapy-splash, there is a great terminal Splash render on localhost:8050 so that u can try your Lua script.
- Organize your crawler code in a really structured way so that you can attain maximum satisfaction :)
- Scrapy can scale well if you project need to crawl a lot of websites.
Here are some resources I find useful to learn Scrapy-Splash.
- https://www.cnblogs.com/shaosks/p/6950358.html (chinese website)
- https://juejin.im/post/5afe47b3f265da0b767db40e (chinese website)
Here are some really useful resources to learn Scrapy.
Here are some really useful resources to learn Selenium.
All resources are based on Python. Happy Learning!
If you are interested to know more about tutorials for Scrapy-Splash, Scrapy or Selenium, feel free to comment below!
Feel free to reach out to me too:)