paint-brush
Top Scraping Tools for Amazonby@rajmaster
301 reads
301 reads

Top Scraping Tools for Amazon

by rajmasterDecember 20th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Scraping Amazon is challenging. Hence, having the right tools is crucial. I compared three tools based on their price, performance, and features.

People Mentioned

Mention Thumbnail

Company Mentioned

Mention Thumbnail
featured image - Top Scraping Tools for Amazon
rajmaster HackerNoon profile picture

Scraping Amazon is a complex task when you're doing it 24/7, as opposed to just making a few requests. If you need to extract information from over 45,000 products daily, you'll face a number of challenges:


  • Persistent blocking of proxy IP addresses

  • Multiple CAPTCHAs

  • 500 errors

  • Rotating your proxies and finding clean residential proxies

  • Continually modifying the script logic to remain undetected

  • Browser freezes and system crashes when running on my Linux servers


Amazon constantly changes its approaches to detecting web automation and scraping. You have to add random/fake actions, retries, timeouts, and rotate proxies to counter this. This increases the complexity of the task required to achieve high-quality results.


In my opinion, of all the stages of price monitoring, the most challenging is the collection of a high-quality dataset. I typically use the following tools for data collection:


  • headless Puppeteer with Puppeteer-extra and Puppeteer-extra-plugin-stealth

  • undetected-browser

  • headless-recorder

  • curl-impersonate


I’m trying to switch to Playwright, as it seems to have better performance and parallelism, and offers a broader range of features. To solve captchas, I use 2captcha.com and anti-captcha.com. I mainly use residential and sometimes mobile proxies, as they provide the best results. My scraping scripts run in Docker containers in 2 datacenters, Hetzner and DigitalOcean.


I decided to analyze my stack to improve my performance. After a whole day of googling, I found several interesting services for automation and parsing, which I would like to share with you. The following services seemed particularly interesting:


  • Scrapingbee

  • Browserless

  • Surfsky

  • Bright Data Scraping Browser


Let’s evaluate these services based on performance, features, result, and of course, price.

TL;DR: Web Scraping Services Comparison Table


Scrapingbee

Browserless

Surfsky

Real price for 1000 requests, $ (less is better)

$6

$3.2

Free now

Concurrent requests

200+ on business+ plan

up to 1000

unlim by default

JavaScript rendering

Basic browser fingerprinting

Advanced browser fingerprinting

Proxy rotation

Custom proxies

limited

limited

full

Real GPU canvas rendering

Profile management

Scaling

Upon support request

Premium subscription

Infinite by default

Support

Chrome Debug Protocol

Scraping API

Screenshot

Benchmark Results (more details on this below)

7 points

8 points

14 points

Scrapingbee

I discovered this web-service while reading an article about no-code Amazon scraping. Scrapingbee is a well-known solution, with positive reviews on ProductHunt and Capterra. The pricing plans are mainly based on the number and the level of concurrent requests.


Prices range from $49 for 150k credits with the cheapest plan to $599+ for 8 million credits with the Business+ plan. A credit is payment per request. Each additional option is billed separately. I wasn't exactly thrilled to find out that JavaScript rendering costs an additional 5 credits. Come on, guys, it's 2023, how can one get by without JS? However, a pleasant aspect is that only successful requests are billed. In my case, about 25% of my requests are unsuccessful, and I have to retry these.


The documentation advises using the stealth proxy mode with JavaScript rendering for working with complex websites. The cost of one such request is 75 (!) credits. In my case, I need to make 45,000 * 30 = 1,350,000 requests per month. The most expensive Business+ plan, as shown in the pricing grid, allows for 8,000,000 / 75 = 106,666 requests. At my volume, the cost is unknown, but extrapolating from $599, it comes to around $7600, or $0.006/request. This is quite expensive for less than 1,500,000 requests, making my business unprofitable. Nevertheless, we'll check how much Scrapingbee is worth its price later.


The documentation also mentions a premium proxy, which costs 25 API credits/request. For this experiment, I purchased a Startup subscription for $99 with 1,000,000 credits. Let's see what we can get out of it.



Browserless

I found several reviews on ProductHunt and YouTube videos about Browserless. Unlike Scrapingbee, which operates through a REST API, Browserless works via the Chrome Debug Protocol (CDP), allowing for more flexible and comprehensive scripting. This is exactly what I need. A notable advantage is its detailed documentation.


Cloud subscription is counted per request. The most expensive plan costs $200 for 180k units. A unit represents a browser session of up to 30 seconds. On average, using a proxy, page load indicators are DOMContentLoaded 3.8s, Load 4.5s. The script takes about 4 seconds to execute on a page, depending on the automation logic. So, for $200, you can expect to execute (30 / 9.5) * 180,000 = 568,420. Considering that I have 25% unsuccessful requests, I expect to successfully complete at least 568,420 * 0.75 = 426,315 requests. So a request costs about $0.0004. This looks much more advantageous than ScrapingBee. However, I am concerned about what happens after a 30-second session. If I send more requests, will I need to handle them on my side? I was pleased with the price per request, but my enthusiasm waned when I saw that using the built-in proxy charges 6 units for 1 MB. This is extremely costly. After recalculating the expenses, (30 / 9.5) * (180,000 / 7) * 0.75 = 60,902, it became apparent that by using residential proxies, I would be able to perform 7 times fewer requests. Consequently, the cost amounts to $0.0032 / request. But we'll check later how it performs in action.


You can also pay for a dedicated unit, it costs $250/month, available only through contact with the sales department. Judging by the range of addresses, it seems they host their clusters at DigitalOcean. Running 1 unit, I couldn't manage more than 6-7 parallel browsers.

Premium plan promises a browser 3 times faster and unlimited cloud instance usage, also available through support.


On-prem is a portable Docker container for use on your servers. Looks like a good solution if you want to use it in your cloud-native solution. No cost provided, so I couldn't test this.



Surfsky

It's a new project currently in open public testing. The website reveals that it uses a unique fingerprint spoofing technology to prevent anti-bot systems from detecting bot signatures. It consistently emulates Windows and macOS operating systems, using real fingerprints and deep browser spoofing at the browser kernel level. Like Browserless, it operates via CDP: the API returns a websocket url, which can be used to write scripts in Puppeteer, Playwright, or Selenium. There's clear documentation, including quick start guides and code snippets.


I noticed several cool features: built-in proxies (including support for OpenVPN and SSH), real GPU canvas rendering and, unlike Browserless, support for SOCKS5 proxies and OpenVPN. The API allows creating one-time profiles, which by default have a 30-second inactivity time, but unlike Browserless, this can be extended in the session settings. I also found that you can create persistent profiles. In this case, the browser instance (profile) can be managed by starting and stopping it. After stopping, the browser's entire state is saved, as if you closed it on your own computer, and it resumes from where you left off on the next launch. This is something I've been missing all these years. Currently, my scripts implement cookie saving and setting in each cycle, which is cumbersome and requires managing the whole state.


Moreover, Surfsky saves not only cookies, but also extensions, local storage, history, bookmarks, service workers, and passwords, which means that I can start and stop browsers whenever I need to, without any hassle — Surfsky takes care of all this, keeping it in their storage.


I also found a scraping API that allows working through a REST API, much like ScrapingBee.


During testing, I didn't find any serious limitations, except some minor inconveniences, such as the lack of a web-based administrative panel for managing subscriptions and viewing statistics. For statistics, they provide charts from Grafana, where you can see the number of running and pending instances.


I couldn't find the cost, but that's because the project is in open testing and is currently free. Why not give it a try?



Bright Data Scraping Browser

I appreciate Bright Data for its high-level service and the quality of its proxies. However, the price of proxies has always been a concern. In the Growth plan, residential proxies cost $8.04/GB, allowing for 62GB of traffic with a $500 package. An average Amazon page with necessary resources takes about 12MB, and the use of shared local caches reduces this to around 7MB. Therefore, a $500 Growth package only suffices for about 10,000 pages, which is admittedly very limited.


Bright Data recently introduced a scraping browser, which, like Browserless and Surfsky, operates through CDP. When I checked the pricing page, I was startled by the costs. The Growth plan is $17.00/GB for monthly payment and $15.30/GB for a yearly subscription.


After thoroughly examining the capabilities and documentation, I concluded that it's not worth the investment. An interesting feature I found was automatic CAPTCHA solving, but for such a price, I'd rather invest in capmonster.cloud, which offers broader support for solving various CAPTCHAs, and pay someone on Fiverr $50 to integrate it into my stack. So, unless you are the Warren Buffett of automation with hundreds of thousands of dollars to spare on traffic costs, I would advise looking elsewhere.



Benchmark

For the final round I've selected 3 services, excluding Bright Data (because of the high price), and now we are ready to run our benchmarks. Each test will be run 3 times at 1-minute intervals. Here are the criteria for our testing:


  • Page load time. Consider the operation completed when the ‘DOMContentLoaded’ event is triggered. This benchmark will utilize the built-in proxies provided by the service itself. We'll use Amazon's Today's Deals as an example: https://www.amazon.com/gp/goldbox. The fastest service received 3 points, the second-fastest received 2 points, and the slowest received 1 point.
  • Built-in residential proxies. If a participant supports these, they get 1 point.
  • Support for custom proxy and VPN types. HTTP, SOCKS5, SSH, OpenVPN. If a participant supports these, they get 1 point.
  • Easy-to-use admin panel. Subscription management, request builder, etc. If a participant supports these, they get 1 point.
  • Examples of code for a quick start and clear documentation. If there was enough information available for me to work with the service, it received 1 point.
  • Browser fingerprinting. Matching IP address, user agent, browser API, WebRTC leaks, etc. We'll check this using CreepJS and PixelScan. If you want to learn more about browser fingerprinting, start with BrowserLeaks. The higher the score, the more points a participant gets. For Browserless, the 'stealth' option was applied. ScrapingBee was used with the 'Stealth Proxy' option. Surfsky operates in stealth mode by default, but also offers the option of flexible fingerprint tuning.
  • Performance. Scraping Google SERP (search results for 100 random keywords), Amazon Today's Deals, eBay search results (100 random keywords). In this setup, each participant will start 20 instances simultaneously, and each test will be run 3 times. I calculated the average for all three services. The fastest service received 3 points, the second-fastest received 2 points, and the slowest received 1 point.



Scrapingbee

Browserless

Surfsky

Page load time (avg)

6.1s

4.7s

3.4s

Built-in residential proxies

HTTP support

SOCKS5 support

SSH support

OpenVPN support

Easy-to-use admin panel

Easy-to-use documentation

PixelScan passing

CreepJS passing

Google SERP (avg)

1.79 page/sec

0.95 page/sec

2.71 page/sec

Amazon today’s deals (avg)

1.18 page/sec

0.97 page/sec

3.32 page/sec

Ebay search results

1.15 page/sec

0.78 page/sec

1.36 page/sec

Average for three (20 instances concurrently)

1.8 page/sec

0.9 page/sec

2.5 page/sec

Benchmark Results

7 points

8 points

14 points


Conclusion

It's fair to say that this was not an easy journey. As a result, we have a detailed comparison of what I consider to be the highest quality services. We learned about their pros and cons and were able to calculate the cost of expenses, which is crucial when planning your budget. The services demonstrated commendable results. Personally, I would highlight 2 winners: undoubtedly Surfsky, for its stable and efficient performance, along with its innovative features not found in competitors, and Browserless for its convenience.


Thank you for reading until the end, and I hope you found this informative. See you!