Proxy Servers for Your Data Science Project: A Comprehensive Guide

Written by proxyway | Published 2023/06/02
Tech Story Tags: web-scraping | proxies | data-science | residential-proxies | web-scraping-tools | good-company | hackernoon-top-story | security | hackernoon-tr | hackernoon-ko | hackernoon-de | hackernoon-bn

TLDRWeb scraping comes with its own challenges like CAPTCHAs and geo-restrictions. Proxies won’t solve all your problems; but aside from basic hygiene, such as using realistic user-agents, they’re the most effective way to keep your web scraper running.via the TL;DR App

One excellent way to build datasets is by scraping the web. It becomes especially relevant when sources like Common Crawl or official APIs fail to meet your project’s requirements (for example, the data isn’t recent enough or Twitter prices you out). However, web scraping comes with its own challenges like CAPTCHAs and geo-restrictions. This is where our hero — proxies — enter the scene.

A proxy server is another computer (a server, phone, or even IoT device) that lets you access the web through itself. This way, you get a new IP address from an optionally different location. Proxies are much like VPNs — the difference is that you can use many of them at once instead of just one or two connections.

Proxy servers are becoming more and more necessary as websites implement rate limiting and more sophisticated restrictions to prevent automated access. Sometimes, for good reasons, though often it’s to protect their commercial interests or monopolize data. Proxies won’t solve all your problems; but aside from basic hygiene, such as using realistic user-agents, they’re the most effective way to keep your web scraper running.

This guide will give you a strong understanding of proxy servers in the context of web scraping. It comes with theoretical knowledge together with some practical data we collected for our annual proxy market research.

First things first: you most likely don’t want to use free proxies

When web scrapers (the people) encounter a need for proxies, their first impulse is to try out a free proxy list. These lists are publicly available online; most of the proxies in them come from vulnerable or mismanaged servers.

One big problem with free proxy lists is that you never know who runs them. The operator may have malicious intentions, which can manifest in ad injection, traffic logging, and other nasty things. This sounds scary, but it’s unlikely to affect you much as a data scientist: you probably won’t configure proxies on the operating system level or use them to access your bank account.

A more boring — though realistic — issue is that free proxies are simply unreliable.

They rarely stay online for long; the ones that do run are unbearably slow because they have hundreds of tenants; and because they have hundreds of tenants, such proxies very quickly strike themselves out with any relevant website. Not a great tool to work with.

For this reason, I strongly recommend paying for a commercial proxy service.

Types of proxy servers out there

Proxy servers are most often categorized by the IP source. For web scraping purposes, you’ll most likely choose between three types: datacenter, residential, or ISP (also called static residential) proxies.

Datacenter proxies

These proxies are hosted on servers in data centers. They can be instances of Amazon Cloud Services (AWS) or come from any other cloud hosting provider. IP databases see datacenter proxies as belonging to hosting or datacenter ranges.

Datacenter proxies typically run on powerful hardware and internet connection, so you can scrape a lot of data fast. They can be a very efficient choice for targets that don’t monitor IP reputation but struggle with websites that do.

Residential proxies

These proxies come from the residential devices of real people (hence the name). For example, they can run on a Windows laptop, Android phone, or a smart fridge on Wi-Fi. IP databases classify residential proxies as landline or mobile connections.

Residential proxies always come in large pools of addresses from around the world. They’re less predictable than datacenter IPs because they rely on the connection quality and availability of the host device. However, this very feature makes residential proxies very effective with all kinds of websites, as they look like regular visitors.

ISP proxies

These proxies are hosted on servers in data centers but registered under consumer internet service providers like Verizon. This is achieved by signing a contract with an ISP and getting it to announce IPs on its network. Ideally, IP databases should identify such proxies as landline or mobile connections. Sometimes, though, smaller regional ISPs are still classified as data centers.

ISP proxies retain the qualities of datacenter proxies but are able to scrape more effectively due to a better IP reputation.

How proxies are sourced and whether they’re legal

At this point, you may be thinking, “Woah, wait a minute! Android phones? Sounds awfully like a botnet! Is that even legal?” This is a good question to ask, and it shows that you care. Many don’t.

The fact is that there’s a fine line between a commercial proxy server and a botnet. It applies less to datacenter proxies where the supply chain is pretty much clear: a cloud host buys IPs, puts them on a server, and rents the addresses to a proxy provider. But the question of ethics becomes very relevant with residential proxy networks.

Without delving too deep into the topic, residential proxies can be sourced in several ways. The most popular method is via SDKs in desktop and mobile applications. This video gives concrete examples of what that looks like.

https://www.youtube.com/watch?v=1-Lbgd4BYHI&embedable=true

Another way is to exchange traffic directly for a service (such as a free VPN, see BrightVPN) or money (using bandwidth-sharing apps like Honeygain).

In any case, the onus falls on the proxy provider to ensure that the IP source is aware of the arrangement and consents to it. You should be able to find information about proxy sourcing and usage approaches on the provider’s website. This significantly reduces the chance that you’ll be using a botnet.

Proxy server configurations

Proxy servers may have different configurations based on tenancy and rotation.

The first criterion describes how many people can use the same proxy server at once. All free proxy lists have potentially hundreds of tenants, while commercial proxy providers impose limits. In the market lingo, shared or semi-shared means you’ll use the same proxy server with several other people (often one to four). Private or dedicated means that you alone will use the proxies for all or particular domains.

The ability to choose tenancy is a feature of datacenter and ISP proxies, as providers have full ownership over them. It’s unavailable with residential proxies but also less relevant — residential users have more natural browsing patterns and are less likely to overload websites with requests.

The second criterion, rotation, shows whether proxy servers switch automatically. The ones that don’t often come in proxy lists formatted as IP:Port (1). The ones that do take the format of endpoint:port (2):

  1. 192.168.0.1:10000

  2. en.proxyprovider.net:10000

The endpoint functions as a gateway to the provider’s proxy pool. It automatically routes your requests through different addresses in the backend. Even when an IP address changes, the endpoint remains the same.

Rotating proxies are very convenient for web scraping, as you get access to an effectively infinite number of IPs (thousands to millions). However, such services often charge for traffic expenditure, while static proxy lists tend to offer unlimited traffic.

Comparing the different proxy types

Let’s compare the discussed proxy types based on their various characteristics.

Affordability

In the case of rotating proxy networks, residential proxies cost around eight to 10 times more than datacenter addresses and a little less than rotating ISP proxy networks.

Static proxy networks usually charge per IP address. Compared to ISP proxies, datacenter proxy servers cost two to three times less, depending on how much you buy.

Infrastructure performance

Despite routing traffic through unpredictable end-user devices, major residential proxy networks work very well. For web scraping purposes (when the IP rotates with each connection request), they transfer data almost as well as datacenter proxies:

We also failed to see a big difference in latency. Below are response times for requests made to the nearest server of a global CDN (the response size is several kilobytes) and Amazon (around 1 MB response size):

One area where datacenter and ISP proxies dominate is throughput. They have wider pipes for transferring data. Some residential proxies can be very fast, but you can just as well encounter servers that fail to reach 1 Mbps. There’s much less predictability.

IP Uptime

Datacenter and ISP proxies can stay up nearly 24/7, going offline only when an outage or maintenance occurs. The uptime of a residential proxy depends on various factors, such as whether it comes from a Windows or Android device. In any case, it’s much shorter and less dependable compared to server-based proxies.

We wrote a script that pinged an IP database every 20 seconds to see how often a residential IP address would change. Here are some results:

  • IP #1: 43 seconds
  • IP #2: Over an hour
  • IP #3: 3 minutes, 33 seconds
  • IP #4: 8 minutes, 31 seconds
  • IP #5: Over an hour

Scraping effectiveness

Datacenter proxies struggle against protected or highly popular websites like Google or social media. This can be somewhat mitigated by choosing dedicated IPs with a cleaner usage history.

Sometimes, connecting from a datacenter IP range is enough for websites not to let you in. In such cases, the only recourse is to use a different proxy type.

ISP proxies have a better IP reputation by default, so they face less initial scrutiny. However, they still have less realistic browsing patterns compared to a true residential address. Residential proxy servers form large pools, are very diverse, and share browsing history with real people. These qualities make them very hard to distinguish from the website’s broader audience.

Location variety

Datacenter and ISP proxies are hosted in data centers, and the number of available datacenters is limited. The most frequent location is probably Ashburn, though major providers can get you datacenter IPs from dozens of countries. However, if you need addresses in smaller locales, and especially their non-capital cities, you’ll be out of luck.

Residential proxies have no such restrictions — as long as there’s a device and a willing participant, anyone from anywhere can join. Consequently, major providers can offer IPs in all countries with city, ASN, and sometimes even ZIP code targeting.

Popularity

Based on our survey of major proxy providers, most chose residential proxies as their most popular product. Only two providers specified datacenter proxy servers, and none ISP proxies.

Datacenter proxies have been falling off in recent years, as major targets like Amazon, Google, LinkedIn, and others tightened their security systems. ISP proxies have the capacity to take their place, but their uptake is hindered by sourcing difficulties — it’s hard to get reputable ISPs onboard.

Recommendations for your projects

But enough theory. How can you apply this knowledge to your project? I’ve modeled multiple scenarios with proxy server suggestions.

To keep this article free from commercial interests, I won’t mention particular providers. If you’d like some recommendations, we compare major providers on our website (Disclaimer: we’re in an affiliate relationship with most of those companies. However, this has no impact on our market reports.)

Scraping websites without advanced protection

Example: Scraping a minor news portal, e-commerce website, or even non-Google search engine.

Recommendation: Rotating datacenter proxies. You can get access to a pool of 2,000–100,000 rotating IPs for $0.7/GB or less. You won’t have to worry about replacing blocked IPs, and the cost of traffic is low enough for some serious data extraction.

Downloading multimedia content

Example: Downloading videos off a streaming service or pictures from an image aggregator for archiving purposes.

Recommendation: Static datacenter or ISP proxies. They are fast and don’t count traffic expenditure. Go for datacenter proxies when the website allows it and ISP proxies otherwise.

Crawling many different websites

Example: Searching for brand mentions throughout the web.

Recommendation: Static datacenter or ISP proxies. Crawling requires a lot of data, so it’s best to choose a proxy type that doesn’t use traffic as the main metric. Also, even a finite number of IPs will take you far when you switch targets.

Scraping large popular targets

Example: Extracting job ads from LinkedIn, company information from G2, or hashtagged posts from a social media network.

Recommendation: Residential proxies. With an effectively infinite number of IPs, you won’t risk banning out your proxies. In addition, you’ll experience a higher success rate compared to other proxy types.

Collecting real-time data

Example: Tracking stock price movements.

Recommendations: Datacenter proxies if the website allows them; else, ISP proxies**.** Their fast connection speed ensures that you’ll extract data as it appears.

Accessing hyper-localized pages

Example: Monitoring search engine page positions for localized Google queries.

Recommendation: Residential proxies, as they support city-level IP filtering.

Conclusion

This article gave you a brief introduction to proxy servers for web scraping purposes. After reading it, you should be able to distinguish between the main proxy types, configurations, and which setup would benefit your data science project the most.


Written by proxyway | Your Trusted Guide to All Things Proxy.
Published by HackerNoon on 2023/06/02