Low Wei Hong

@lowweihong

Scrapy or Selenium?

Scrapy

Scrapy is a web crawling framework which divide the whole process of crawling to small processes so that the crawling process is well organize!

Crawl Data (spider.py) -> Rotate proxy or ip (middlewares.py) ->

Clean Data (items.py)-> Store Data(pipeline.py)

With all the settings (setting.py).

The biggest feature is that it is built on Twisted, an asynchronous networking library, so Scrapy is implemented using a non-blocking (aka asynchronous) code for concurrency, which makes the spider performance is very great. — Michael Yin

Selenium

Selenium is a free automated testing suite for web applications across different browsers and platforms. Although it was created for automated testing on web app, it is really easy to apply to scrape websites! You just need to

  • -> Download chrome, firefox or other drivers -> Use their API to scrape websites.

Issues I faced using Selenium:

  1. Speed is quite slow.
  2. Need quite a lot memory if you want to build a multi threading crawler to speed the process up.

Issues I faced using Scrapy:

  1. It is harder to debug.
  2. Harder to connect to Tor if you are implementing Scrapy-Splash.
  3. Lesser reference to refer when you want to use Scrapy-Splash.

Sharing about my experiences:

At first, I learned Selenium as it is much easier to learn and debug as I need to render JavaScript websites. When I first use selenium, it satisfies all my needs, crawling all the web-pages in required time frame. Then speed it up by using multi threading and everything goes really smooth.

Yeah Really Smooth!

But one day, one particular website block me by implementing Completely Automated Public Turing test to tell Computers and Humans Apart (Captcha). I was really stuck but I was required to figure out a way to solve this problem. So, after I tried all the ways to solve the captcha, I think why not I use another framework to try and see whether it can bypass the captcha.

Bang my head and hope something magical come to my mind :(

At last I found Scrapy framework and not only solve captcha problems but a start for me to learn a really powerful crawling framework! The learning curve for Scrapy is much steeper than Selenium but it definitely worth it base on the five points below:

  1. Write your crawler code in a much shorter Python script compare to selenium.
  2. Crawl a lot faster than selenium.
  3. If you are using scrapy-splash, there is a great terminal Splash render on localhost:8050 so that u can try your Lua script.
  4. Organize your crawler code in a really structured way so that you can attain maximum satisfaction :)
  5. Scrapy can scale well if you project need to crawl a lot of websites.

Scrapy-Splash is definitely worth trying out to render heavy loaded Javascript websites but compare to Splash, Scrapy-Splash have much lesser resources compare to Scrapy.

Here are some resources I find useful to learn Scrapy-Splash.

  1. https://splash.readthedocs.io/en/stable/faq.html
  2. https://www.cnblogs.com/shaosks/p/6950358.html (chinese website)
  3. https://juejin.im/post/5afe47b3f265da0b767db40e (chinese website)

Here are some really useful resources to learn Scrapy.

  1. https://doc.scrapy.org/en/latest/intro/tutorial.html
  2. https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/
  3. https://blog.michaelyin.info/scrapy-tutorial-5-how-create-simple-scrapy-spider/
  4. https://python.gotrained.com/scrapy-tutorial-web-scraping-craigslist/

Here are some really useful resources to learn Selenium.

  1. https://selenium-python.readthedocs.io/
  2. https://www.softwaretestingmaterial.com/selenium-tutorial/
  3. https://www.tutorialspoint.com/selenium

All resources are based on Python. Happy Learning!

If you are interested to know more about tutorials for Scrapy-Splash, Scrapy or Selenium, feel free to comment below!

Feel free to reach out to me too:)

More by Low Wei Hong

Topics of interest

More Related Stories