Akamai のボット検出を JA3Proxy でアウトスマートする

Akamai Bot Managerは市場で最も一般的なアンチボットソリューションの1つであり、電子商取引サイトから旅行サイトに至るまで、多くのプロフィールのウェブサイトで使用されており、構成に応じて回避することは困難です。 私の経験に基づき、ウェブサイトがAkamai Bot Managerの保護を有効にすると出会う典型的なパターンは、スクラッパー(通常は私のスタック内のScrapy)が掛かって最初のリクエストから時間が切れることです。 しかし、Akamai Bot Manager は何ですか、ウェブサイトがそれを使用しているかどうかをどのように確認できますか? Akamai Bot Managerの概要 Akamai のボット検出は、他のすべての現代的なボット保護ソフトウェアと同様に、複数のレイヤーで動作します。 Akamai は TLS ハンドシェイクと接続の詳細(JA3 TLS 指紋、暗号化スイート、TLS バージョンなど)を分析して、実際のブラウザや既知の自動化ツールと一致するかどうかを確認します。各ブラウザ(Chrome、Firefox、Safari など)には特徴的な TLS 指紋があります。 Network fingerprinting こちらもチェック 実際のブラウザはほとんど常にHTTP/2+を使用し、特定の順序と形式でHTTPヘッダーを送信します。クライアントがまだHTTP/1.1を使用しているか、ブラウザ以外の順序でヘッダーを持っている場合、それは赤い旗です。 実際のブラウザの鏡は重要です。 HTTP/2 usage and header order and their order もう一つのレイヤーは Akamai は、クライアント IP が住宅ネットワーク、モバイルネットワーク、またはデータセンターであるかどうかを確認します。 住宅およびモバイル IP (実際のユーザーが持っているタイプ) は信頼性の高い点数を取得しますが、既知のデータセンター IP 範囲は自動的に疑われます。 常に、クラウドサーバーの農場ではなく、異なる実際のユーザーの場所から来ているように見えます。 IP reputation and analysis residential proxies 最後に、 ウェブページのJavaScriptセンサーは、クライアントの環境や相互作用(タイミング、マウス動き、またはそれらの欠如、ブラウザオブジェクトの異常な性質など)に関する多数のデータポイントを収集します。AkamaiのAIモデルは、このデータをクラッシュし、各セッションにボットの確率スコアを割り当てます。この側面は回避するのに最も困難であり、しばしば頭のないブラウザを実行するか、センサーの論理を複製する必要があります。 Akamai 雇用 行動分析 クライアントサイドスクリプトとAIモデル Akamai 雇用 行動分析 行動分析 クライアントサイドスクリプトとAIモデル しかし、どのようにしてウェブサイトがAkamai Bot Managerを使用していることを検出できますか? 除く、 もし、あなたが気づいたら、 そして ウェブサイトで使用されるクッキーは、ウェブサイトが自らを保護するためにAkamaiを使用していることを示す最も明確な兆候です。 the usual Wappalyzer browser extension アブック bmsc / bmsc 通常の Wappalyzer ブラウザ拡張子 アブック アブック bmsc / bmsc bmsc / bmsc これらの防御を考慮して、多くのScrapyユーザーは、 ダウンロード マネージャーは Akamai を回避します. このプラグインは、 実際のブラウザのネットワークサインを偽装するためのライブラリ。 scrapy-impersonate curl_cffi 実際には、Scrapy ImpersonateはあなたのScrapy spiderのリクエストをChromeやFirefoxのように「見える」ものにします:それは、これらのブラウザに匹敵するTLS指紋(JA3)を提供し、HTTP/2を使用し、また、ブラウザのパターンを模するために低レベルのHTTP/2フレームヘッダーを調整します。 Scrapy Impersonateの制限 Scrapy Impersonate は強力なツールですが、いくつかの制限があります: Scrapy Impersonate is designed as a Scrapy download handler, which means it only works within Scrapy’s asynchronous framework. If your project doesn’t use Scrapy or you want to switch to a different framework (like a simple script with / or an asyncio pipeline), you can’t directly carry over its capabilities. Migrating away from Scrapy often means a of your HTTP logic, and you’d lose the built-in TLS spoofing unless you implement a new solution from scratch. Locked into Scrapy: requests httpx complete rewrite Using Scrapy Impersonate alongside proxy rotation can be tricky. Under the hood, it replaces Scrapy’s default downloader with one based on , which doesn’t seamlessly integrate with Scrapy’s proxy middleware. Early adopters discovered that HTTPS proxy support was broken because the proxy handling code was bypassed. Although fixes and workarounds (like disabling Scrapy’s built-in proxy handling and configuring directly) exist, it’s harder to rotate proxies or handle proxy authentication with this setup. Robust error handling for proxy failures (e.g., detecting a dead proxy and retrying) is not as straightforward as with Scrapy’s standard downloader, because errors bubble up from the layer and may not trigger Scrapy’s usual retry logic. Proxy Rotation Challenges: curl_cffi curl_cffi curl Scrapy Impersonate currently supports a finite list of browser fingerprints (Chrome, Edge, Safari, etc., up to certain versions). This list can lag behind the latest browser releases. You might be stuck impersonating an older browser version, which could be a problem if a target site specifically requires the nuances of a newer TLS handshake (some advanced WAFs actually check minor details that change between Chrome versions). Maintenance and Flexibility: Perhaps most importantly, even with proper TLS and HTTP/2 impersonation, . For websites that have implemented a higher level of protection, checking also the browser fingerprint, any browserless configuration, including Scrapy Impersonate, isn’t sufficient for Akamai or similar top-tier bot defenses. You might get past the TLS handshake, but fail on other signals (like the absence of the expected sensor data or subtle discrepancies in headers/cookies). In other words, it’s a piece of the puzzle, not a complete solution. Not a Silver Bullet: Akamai Bot Manager can still detect and block you 今日見る解決策は、最初の2つの点を解決するのに役立ちます:我々は連鎖する。 , 最適な TLS 指紋と回転型住宅プロキシで、IP を回転させ、評判のスコアを高めます。 JA3プロキシ JA3プロキシ TLS 指紋と JA3 の理解 ソリューションに潜入する前に、私たちが何を偽造しているのかを理解することが重要です. 各 HTTPS クライアントはユニークなプロセスを提供しています。 この指紋は、TLS プロトコルのバージョンと、クライアントがサポートしている数多くのオプションの組み合わせです - クライアントの TLS を話す「方言」として考えてください。 TLS fingerprint e.g. TLS 1.2 vs TLS 1.3. Modern browsers will offer 1.3 (while still allowing 1.2 for compatibility). Older clients or some libraries might only do 1.2. Supported TLS Version: the list of cryptographic algorithms the client can use, in preferred order. Browsers tend to have long lists including ciphers like AES-GCM, ChaCha20, etc., plus some GREASE (randomized) values to prevent fingerprinting. Cipher Suites: extra features in TLS, like Server Name Indication (SNI), supported groups (elliptic curves), ALPN (which is used for HTTP/2 negotiation), etc. Both the presence of certain extensions and their order matter. Extensions: The Concept of これらの TLS クライアント ハロー 詳細を記録する標準化された方法です。 JA3 は、その作成者のイニシャルにちなんで名付けられ、上記のフィールドを特定の順序で連結することで、指紋の文字列を作成します。 JA3 fingerprinting JA3_string = TLSVersion,CipherSuiteIDs,ExtensionIDs,EllipticCurveIDs,EllipticCurveFormatIDs 各リスト(数字、拡張等)は、 各セクション by たとえば、Chrome ブラウザでは、以下のような JA3 文字列を生成する場合があります。 - , 771,4866-4867-4865-....-47-255,0-11-10-16-23-...-21,29-23-30-25-24,0-1-2 これは TLS 1.2 (771 は 0x0303 です) を表し、暗号化スイート、拡張子、サポートされる曲線、および曲線形式の特定のセットです(数字は標準化された ID です)。 セキュリティ ツールはしばしば MD5 ハッシュをログまたは比較します(長い数字の連続よりも処理しやすくなります)。 MD5 hashed なぜボット検出が重要なのか?なぜか Chrome バージョン X for Windows リストの順序に同じ JA3 指紋を表示します。Firefox には独自の JA3 があります。 browser TLS stacks are fairly uniform いつも Python のリクエスト ライブラリ(キャップの下で OpenSSL を使用)は、すべてのメインストリーム ブラウザとは全く異なる JA3 を持っており、簡単に検出できます。 Akamai のようなアンチ ボット サービスは JA3 ハッシュのデータベースを維持します: JA3 が「よく知られている良い」リスト (一般的なブラウザ) にない場合、または既知の自動化リストにいる場合、あなたはフラッグされます。 要するに、AkamaiのTLS指紋検査を通過するには、 . we need our client’s JA3 to match a popular browser これは通常、最新のChromeまたはFirefoxの指紋を模することを意味します(これらはウェブ上で最も一般的な正当なユーザーです)。 ユーザ・エージェントの文字列を単に変更するだけでは不十分ですが、低レベルのTLSハンドシェイクを変更する必要があります。 (それ自体は、ブラウザを模するために、カラフルおよびTLSライブラリの特別な構築を活用しています) しかし、Scrapyの外では、同じ効果を達成するための別の方法が必要です。 curl_cffi TLS Impersonation Proxy + 住宅プロキシチェーン 私たちの解決策は、 実際のブラウザユーザーとほぼ区別できないようにするために: chain two proxies JA3Proxy is an open-source tool that acts as an HTTP(S) proxy that replays traffic with a chosen TLS fingerprint. In other words, you run JA3Proxy locally, configure it to imitate a specific browser’s TLS handshake, and then direct your scraper traffic through it. JA3Proxy will terminate your TLS connection and initiate a new TLS handshake to the target site using the impersonated fingerprint. From the target site’s perspective, it looks like, say, a Chrome browser connecting. The beauty of this approach is that – you can use Python , , cURL, or anything, by simply pointing it at JA3Proxy. You are no longer locked into Scrapy or any particular library to get browser-like TLS; the proxy takes care of it. JA3Proxy for TLS Impersonation: it’s client-agnostic requests httpx Under the hood, JA3Proxy uses (an advanced TLS library in Go) to customize the Client Hello. It supports a variety of client profiles (Chrome, Firefox, Safari, etc., across different versions). You can, for example, configure it to mimic the latest browsers available in the library. For our needs, we’d choose the latest available Chrome fingerprint, Chrome 133. As for Scrapy-Impersonate, the integration of the latest browsers in the library can take some time, but until this gets regularly updated, it’s not an issue. uTLS One thing to note: JA3Proxy focuses on TLS fingerprints (the JA3 part). It doesn’t inherently modify HTTP headers (other than those that relate to TLS, like ALPN for HTTP/2) or handle higher-level browser behaviors. It gets us past the network fingerprinting, which is the hardest to change, but we must still ensure our HTTP headers and usage patterns are correct. Luckily, we can manually set headers in our HTTP client to mimic a browser (User-Agent, etc.), and HTTP/2 can be achieved as long as the TLS negotiation allows it (Chrome’s Client Hello will advertise ALPN support for h2, so if the site supports it, JA3Proxy will negotiate HTTP/2). The second part of the chain is an upstream . This will take care of the IP reputation and distribution. Residential Proxy for IP Rotation: residential proxy The combined effect is powerful: to Akamai, your scraper now looks like Chrome 133 running on a residential IP. The TLS handshake matches Chrome’s JA3, the HTTP/2 and headers can be adjusted to match Chrome, and the source IP is a regular household. This addresses the major fingerprinting vectors at the network level. It doesn’t solve Akamai’s JavaScript-based challenges by itself, but this should be enough to bypass most of the websites you’ll encounter. JA3Proxyのインストールガイド JA3Proxy を設定し、住宅プロキシで連鎖しましょう。 JA3Proxyをインストールする JA3Proxy は Go で書かれています. あなたは 2 つの簡単なオプションを持っています: ソースからコンパイルするか、または Docker コンテナを使用します. ソースから構築するには、Go をインストールする必要があります. 実行: git clone https://github.com/LyleMi/ja3proxy.git cd ja3proxy make これはAを生み出すべきである。 フォルダーで実行できます(代わりに、実行できます。 プロジェクトがGoベースであるため、手動で行う)。 ja3proxy go build Docker を好む場合は、GitHub コンテナ レジストリに事前に構築された画像があります。 docker pull ghcr.io/lylemi/ja3proxy:latest 最新の画像を取得します. 次に実行できます。 (一瞬で実行コマンドを表示します) Dockerは、ローカルなGo環境を必要とせずにすべてをパッケージするので便利です。 docker run 私の個人的な経験では、インストールは少し悪夢でした。私はドッカーイメージを動作させることができなかったので、それに接続しようとしているときに常にエラーを受け取ったので、ブラウザが認識されませんでした。それから私はマックで手動で構築することに決め、同じエラーに遭遇しました。しかし、数時間のデバッグの後、私はいくつかの依存性、特にutLSを更新する必要があることを発見しました。図書館のバージョンに衝突があり、これらすべてが問題を引き起こしていました。 TLS証明書の取得または作成 JA3Proxy は HTTPS プロキシとして機能し、TLS をキャプチャし、クライアントに独自の証明書を提示します。 デフォルトでは、検索 そして 証明書を提供しない場合は、単純テキストモード(通常の HTTP プロキシとして)で実行し、クライアント内の証明書検証を無視することもできます(生産には推奨されていませんが、テストには受け入れられます)。 cert.pem key.pem 最良の実践は、自己署名のルート証明書とキーを生成し、その証明書を信頼するようにスクレーパーを構成して、セキュリティ警告なしでトラフィックをキャプチャすることができます。 openssl req -x509 -newkey rsa:2048 -sha256 -days 365 -nodes -keyout key.pem -out cert.pem -subj "/CN=JA3Proxy" これはAを創る。 / ペアは1年間有効です(生産用途では、その設定がある場合に正当な内部CAを使用することもできますが、ほとんどのスキャン目的では、クライアントがそれを信頼できる限り、自己署名はOKです)。 cert.pem key.pem JA3ProxyをChromeの指紋で起動する バイナリを使用する場合は、以下のようなコマンドを実行します。 ./ja3proxy -port 8080 -client Chrome -version 131 -cert cert.pem -key key.pem -upstream YOURPROXYIP:PORT このコマンドを破ります: -port 8080 は、ポート 8080 で聴くように言います(必要に応じて別のポートを選択できます)。 -client Chrome -version 131 は、指紋プロファイルを選択します。この例では、Chrome 131 用に組み込まれたプロファイルを使用します。これらのプロファイルを、お望みのブラウザ/バージョンに匹敵するプロファイルに置き換えます. たとえば、Chrome 130 が最新バージョンでサポートされている場合、Chrome -client Chrome -version 130 を使用する場合があります。 (JA3Proxy のドキュメントまたは使用している uTLS ライブラリをチェックすることで、利用可能な指紋のリストを見つけることができます。 プロファイルには、さまざまな Chrome、 Firefox バージョン、Safari、Edge などがあります。 -cert and -key は、ステップ 2 で生成した TLS 証明書ファイルを指定します。 -upstream 123.45.67.89:1080 is the address of the upstream proxy. This should be replaced with your residential proxy endpoint. Important: JA3Proxy expects this to be a SOCKS5 proxy address__github.com__. If your provider gave you something like proxy.provider.com:8000 with a username/password, you can try the format username:password@proxy.provider.com:8000. (JA3Proxy will parse the string and should handle authentication for SOCKS5 if given in user:pass@host:port form. If that doesn't work, you might configure your residential proxy to be IP-allowed or use an IP whitelist feature to avoid auth, or run a local Dante SOCKS proxy that forward to an authen Docker を使用する場合、その等価は次のとおりです。 docker run -p 8080:8080 \ -v $(pwd)/cert.pem:/app/cert.pem -v $(pwd)/key.pem:/app/key.pem \ ghcr.io/lylemi/ja3proxy:latest \ -client Chrome -version 133 -cert /app/cert.pem -key /app/key.pem \ -upstream YOURPROXYIP:PORT コンテナに cert と key をマウントし、ポート 8080 を露出します。コマンドを実際の proxy 認証/ホストを含むように調整します。 (または、指定したホスト/ポート) localhost:8080 A real-world use case - MrPorter.com MrPorter.com は、業界の他の多くと共に、Akamai Bot Manager で自分自身を保護するファッションの電子商取引サイトです。 ファイルに指定されたように単純な Python リクエストを使用して リポジトリで、予想通りタイムアウトエラーに遭遇しました。 simple_request.py import requests URL = "https://www.mrporter.com/en-gb/mens/product/loewe/clothing/casual-shorts/plus-paula-s-ibiza-wide-leg-printed-cotton-blend-terry-jacquard-shorts/46376663162864673" headers = { "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US,en;q=0.5", "priority": "u=0, i", "sec-ch-ua": "\"Brave\";v=\"135\", \"Not-A.Brand\";v=\"8\", \"Chromium\";v=\"135\"", "sec-ch-ua-mobile": "?0", "sec-ch-ua-platform": "\"macOS\"", "sec-fetch-dest": "document", "sec-fetch-mode": "navigate", "sec-fetch-site": "none", "sec-fetch-user": "?1", "sec-gpc": "1", "service-worker-navigation-preload": "true", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36" } def main(): try: response = requests.get(URL, headers=headers, timeout=10) response.raise_for_status() print(response.text) except requests.RequestException as e: print(f"Error fetching the page: {e}") if __name__ == "__main__": main() 結果 : Error fetching the page: HTTPSConnectionPool(host='www.mrporter.com', port=443): Read timed out. (read timeout=10) 使うことによって で、 ただし、Python リクエストがこの指紋を使用していることを示すデータベースを見つけることができませんでした。確かに、それは同じヘッダーとユーザエージェントで Brave Browser を使用して得た指紋とは異なります。 Scrapfly TLS 指紋ツール 結果を見ることができる Scrapfly TLS 指紋ツール 結果を見ることができる Cipher Suitesの順序は異なりますので、指紋も異なります。 JA3Proxyドッカーを起動し、住宅プロキシを接続せず、どうなるか見てみましょう。 docker run -p 8080:8080 \ -v $(pwd)/cert.pem:/app/cert.pem -v $(pwd)/key.pem:/app/key.pem \ ghcr.io/lylemi/ja3proxy:latest \ -client Chrome -version 131 -cert /app/cert.pem -key /app/key.pem We got the message メッセージ HTTP Proxy Server listen at :8080, with tls fingerprint 131 Chrome localhost:8080 は、Python リクエスト スクリプトでプロキシとして使用できます。 私の設定でエラーが発生したもう一つの原因は、Python Requests を使用して JA3Proxy に接続しようとしました。しばらく掘り下げた後、問題はリクエストライブラリが HTTP/2 をサポートしていないことであり、JA3Proxy は Chrome の近代的なバージョンを使用する場合に使用します。 私のテストのために、私はファイルに示されているように、HTTPXを使用する必要があります。 . request_with_proxies.py この場合、Scrapfly TLS API を再度呼び出した場合、JA3 文字列の最初の部分(Cipher 順序) . is identical to that of my browser 私のブラウザと同じです。 最後のテストとして、このスクリプトを使用して MrPorter ページをリクエストする場合、問題なくダウンロードできます。 ホーム > Residential Proxy TLS 指紋の偽造を解決した今では、ターゲット ウェブサイトが見る IP を回転するだけです。 JA3Proxyには、上流と呼ばれるこの手助けのオプションがあります。 JA3Proxy コマンドを起動すると、 ./ja3proxy -addr 127.0.0.1 -client Chrome -version 131 -cert cert.pem -key key.pem -upstream socks5h://USER:PASS@PROVIDER:PORT -debug 私たちは、お好みのプロキシプロバイダーを使用して、私たちのリクエストをトンネルすることができます。 SOCKS5を介して接続する必要がありますので、プロバイダーがこの機能をサポートしていることを確認してください。 これを行った後、IPをチェックすることで、私の住宅ロータリングIPが現地であることがわかるし、問題なくMRPorterページをダウンロードし続けることができます。 pierluigivinciguerra@Mac 85.AKAMAI-JA3PROXY % python3.10 request_with_proxies.py 200 {"ip":"5.49.222.37"} pierluigivinciguerra@Mac 85.AKAMAI-JA3PROXY % python3.10 request_with_proxies.py 200 {"ip":"197.244.237.29"} pierluigivinciguerra@Mac 85.AKAMAI-JA3PROXY % python3.10 request_with_proxies.py 200 {"ip":"41.193.144.67"} pierluigivinciguerra@Mac 85.AKAMAI-JA3PROXY % python3.10 request_with_proxies.py 200 {"ip":"102.217.240.216"} pierluigivinciguerra@Mac 85.AKAMAI-JA3PROXY % python3.10 request_with_proxies.py 結論 この投稿では、MrPorterのウェブサイトでAkamai Bot Managerを回避する方法を見ました。ウェブサイトの保護レベルは中程度なので、回避するには複雑なブラウザの指紋の課題はありませんが、私の経験では、私たちの道でAkamaiに遭遇するときに最も一般的な使用ケースです。 私は、このソリューションをさまざまなフレームワークで使用できるように、それを回避するためにJA3Proxyアプローチに従うことを選択します. If you are using Scrapy, you can always rely on Scrapy Impersonate, despite its limitations, or you can try to set the cifers in the right order manually. この記事は、Pierluigi Vinciguerraによる「The Lab」シリーズの一部です。Web Scrapingに関するより多くの知識のための彼のサブスタックページをご覧ください。 この記事は、一部の シリーズ by 彼の様子をチェック Web Scrapingに関するより多くの知識のためのページ。 “The Lab” ピエルルイジ・ヴィンチガーラ サブスタック 「ラボ」