According to Wikipedia, "In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server providing that resource".
In other words, it is a server application that stands between you and your target server: you can send a request to the target server through the proxy server, so the machine that sends the request to the target server is the proxy and not your client.
There are several ways proxies can be useful when scraping.
As we have seen in the examples before, proxies can be categorized using the feature we want to highlight. Usually, when comparing the different proxies vendors or when we find online a list of free proxies, the main features are anonymity, its location, and if it's static or rotating.
Usually, this information is provided in the free online proxies lists, where the proxies are divided in:
Every IP can be traced back to a country of provenience, so as e-commerce websites can geo-fence any IP, also proxy providers usually sell proxies located in the main countries of the world.
On top, the most important players in the field allow their customers to choose the final location of the proxies, like a data center, a mobile phone, or a residential IP.
Data center IPs are the most common and cheap but also they are likely to get banned from websites that use strong fingerprinting techniques to block bots or IPS that are in a well-known range of data center addresses. I made a test, setting up a Highly Anonymous Proxy on an Ec2 AWS instance and this is the result of the IP API.
As you can see, combining data from several sources and public registries, it can be easily detected that the request is coming from an EC2 instance.
The TCP-IP Fingerprint test also gives interesting results.
These are the results when browsing from my Mac without using proxies. It can be detected with a fair degree of certainty that I'm using a Mac Os operating system and this is because every operating system uses different values inside the TCP Handshake packet sent when establishing a connection with a server.
In fact, when turning on the proxy on my Unix server, the results of the test change, and, with such an uptime value, it's a red flag signaling that the request is coming from a server and not a personal client.
To mimic a more realistic real user sending requests to a website, the most important proxy providers give the option to use proxies with residentials IP, so installed outside data centers, or mobile proxies.
Of course, due to scarcity and the cost of the bandwidth, these options are more expensive than the data center's one.
Using a static proxy means that your requests will be made all from the same IP of the proxy. This doesn't help in case the website limits the number of requests per IP in a certain timeframe.
The solution is a rotating proxy: it means that you set up your scraper as if you're using only one proxy but, in the background, every request sent to its address is routed to a pool of proxies.
The target website will see these requests coming from different IPs and machines and will not trigger any block.
There ain't no such thing as a free lunch, even in this field.
For a list of free proxies, you can have a look here but tendentially they are unreliable and slow.
If you're willing to pay for proxies, there are many actors in this industry, the most famous are:
Also published here.