I was utterly frustrated with scraping HTTP data from a firewall-protected website. Despite using residential proxies from multiple providers, my requests kept getting blocked without any clear reason. Sometimes, the script worked on my local machine, but it would fail when running on a cloud server.
After extensive research, I stumbled upon the concept of TLS fingerprinting. Let me break it down for you:
When we send an HTTPS request to a server, the process begins with a “Client Hello” TLS request. This request shares supported TLS versions, various cipher suites (encryption algorithms), the user agent, and several other parameters. These details and additional parameters create a unique TLS fingerprint for each request, making it easy to distinguish bots and scripts from legitimate browser clients.
Check your TLS Fingerprint: https://tls.peet.ws/
Sources: https://www.zenrows.com/blog/what-is-tls-fingerprint
Armed with this knowledge, I dug deeper into bypassing TLS fingerprinting. I couldn’t find a one-stop solution, so I’m sharing my findings.
We’ll be using JavaScript to code the solution, but the actual request will be made using the curl
command. This approach can be easily implemented in various programming languages.
Here is the sample curl
command I initially executed:
curl --location 'url' \
--header 'accept: application/json, text/javascript, */*; q=0.01' \
--header 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
--header 'cookie: authCookie' \
--header 'sec-ch-ua: "Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"' \
--header 'sec-ch-ua-bitness: "64"' \
--header 'sec-ch-ua-full-version: "120.0.6099.234"' \
--header 'sec-ch-ua-full-version-list: "Not_A Brand";v="8.0.0.0", "Chromium";v="120.0.6099.234", "Google Chrome";v="120.0.6099.234"' \
--header 'sec-fetch-dest: empty' \
--header 'sec-fetch-mode: cors' \
--header 'sec-fetch-site: same-origin' \
--header 'user-agent: userAgent' \
--header 'x-requested-with: XMLHttpRequest'
You might notice that I’m already using a proxy in the request. However, it wasn’t good enough, so I added a few things based on the User-Agent my script was creating:
Here is the JavaScript code snippet that provides the TLS version string for the request and, based on the version, selects and randomizes the Cipher Suites to prevent TLS fingerprinting:javascriptCopy code
const tls12CipherSuites = [
'ECDHE-ECDSA-AES256-GCM-SHA384',
'ECDHE-RSA-AES256-GCM-SHA384',
'ECDHE-ECDSA-AES128-GCM-SHA256',
'ECDHE-RSA-AES128-GCM-SHA256',
'ECDHE-ECDSA-AES256-SHA384',
'ECDHE-RSA-AES256-SHA384',
'ECDHE-ECDSA-AES128-SHA256',
'ECDHE-RSA-AES128-SHA256',
'DHE-RSA-AES256-GCM-SHA384',
'DHE-RSA-AES128-GCM-SHA256',
'DHE-RSA-AES256-SHA256',
'DHE-RSA-AES128-SHA256',
];
const tls11CipherSuites = [
'RSA-AES128-SHA',
'RSA-AES256-SHA',
'RSA-3DES-EDE-CBC-SHA',
'ECDHE-RSA-AES128-SHA',
'ECDHE-RSA-AES256-SHA',
'ECDHE-ECDSA-AES128-SHA',
'ECDHE-ECDSA-AES256-SHA',
'DHE-RSA-AES128-SHA',
'DHE-RSA-AES256-SHA',
'DHE-RSA-3DES-EDE-CBC-SHA',
];
type SuiteMap = {
[key: string]: string[];
};
const versionCipherMap: SuiteMap = {
'v1.1': tls11CipherSuites,
'v1.2': tls12CipherSuites,
};
function shuffledSuites(inputArray: string[]) {
const firstThree = inputArray.slice(0, 3);
const rest = inputArray.slice(3);
for (let i = rest.length - 1; i > 0; i--) {
const j = Math.floor(Math.random() * (i + 1));
[rest[i], rest[j]] = [rest[j], rest[i]];
}
return firstThree.concat(rest).join(':');
}
function provideTLSAndSuites() {
const tlsVersions = Object.keys(versionCipherMap);
const selectedVersion = tlsVersions[Math.floor(Math.random() * tlsVersions.length)];
const tlsVersionString = `--tls${selectedVersion}`;
const shuffledSuitesString = shuffledSuites(versionCipherMap[selectedVersion]);
return [tlsVersionString, shuffledSuitesString];
}
This code provides the TLS version string for the request and selects and randomizes the cipher suites to prevent TLS fingerprinting.
We’ll fetch the TLS version and cipher suites for the request and embed them in the curl
command:
const [tlsVersionString, cipherSuitesString] = provideTLSAndSuites();
Here is the resulting curl
command:
curl --connect-timeout 30 --max-time 50 'url' \
--proxy 'proxy_url' tlsVersionString --cipher cipherSuitesString \
-H 'accept: application/json, text/javascript, */*; q=0.01' \
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'cookie: authCookie' \
-H 'sec-ch-ua: "Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"' \
-H 'sec-ch-ua-bitness: "64"' \
-H 'sec-ch-ua-full-version: "120.0.6099.234"' \
-H 'sec-ch-ua-full-version-list: "Not_A Brand";v="8.0.0.0", "Chromium";v="120.0.6099.234", "Google Chrome";v="120.0.6099.234"' \
-H 'sec-fetch-dest: empty' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-site: same-origin' \
-H 'user-agent: userAgent' \
-H 'x-requested-with: XMLHttpRequest'
Hopefully, this will allow your request to make it through the firewall.
If this helped you or you enjoyed the content, don’t forget to clap and follow for more such content. Happy scraping!