Use Kali Linux Containers, WebDriver, and Tor for Anonymous Web Scraping

Written by csfx | Published 2023/08/05
Tech Story Tags: tor | docker | webscraping | kali-linux | ethical-hacking | devops | python | hackernoon-top-story

TLDRSimulate attack payloads using Kali Linux Container as an ephemeral host Operating System for anonymous web scraping and penetration testing. Includes example code for using Selenium to automate Tor Browser. Finally, I will discuss counter measures that can be applied both in a Data Center and client side to attempt to detect, rate limit, and block Bot activity.via the TL;DR App

In my previous article, I was able to demonstrate a Kali Linux container running with Tor Browser, and connected to its desktop environment with a VNC client. I verified that Tor Browser was connecting to Tor Network during a browsing session. This setup will allow me to simulate the kind of traffic that might come from an attacker targeting a website.

In this experiment I am going to use Selenium to automate Tor Browser to synthesize keystrokes and navigation events through the browser’s WebDriver interface. Each crawler will have a random IP address provided by the embedded Tor Proxy for evading detection. After saving the results as JSON objects in the local filesystem, I will use Python to process them into a single CSV file. Finally, I will discuss what counter measures can be applied in a Data Center and Client Side to attempt to detect, rate limit, and block Bot activity.

All the files and applicable licenses are available in this open source repository: tor-driver-python

Screen Capture Running the Completed Crawler Script:

https://youtu.be/5YN1DI6wnnw?embedable=true

I have a background in test automation, and have spent many hours designing tests. I also spent a lot of time working with Selenium, and I have used it in many different programming languages and settings to automate web browsers for testing purposes. There are scenarios where it is only possible to test a web application with a real browser, and Selenium is a great tool for that.

The Scenario

In my work as a DevOps Engineer I have spent no small amount of time fretting about what to do with the web crawlers that are hitting, and sometimes outright attacking, the web applications I am responsible for. I thought it would be an interesting experiment to explore the other side of this issue for once.

I want to see how close I could get to simulating an attack from a botnet for educational purposes, and to discuss methods of countering things like suspicious Tor network traffic in a modern Data Center. Botnets are commonly used to perform Credential Stuffing attacks. I will be using a similar technique to search queries and gather information from the web.

Credential stuffing is the automated injection of stolen username and password pairs (“credentials”) in to website login forms, in order to fraudulently gain access to user accounts. 1

Ethical Considerations

In order to avoid ethical issues, while also trying to stay true to the task. I am making the following changes to the scenario:

  1. I will not be collecting Personally Identifiable Information (PII), and no credentials will be used. I will instead be reading search terms from a file and gathering public information.
  2. I will be using Tor Network to simulate a random IP address hopping behavior instead of a botnet.
  3. These steps intentionally target sites with permissive robots.txt files, and Terms and Conditions were checked at the time of writing that would not be exclusionary of crawling. For example IMDB's Terms and Conditions explicitly forbid crawling without written consent.
  4. The information gathered will be limited to publicly available information only. I will not be using any exploits or other methods to gain access to information that is not publicly available on the target site.
  5. Finally, I will limit the rate requests are made to the target site to avoid being blocked or causing issues with throughput for other users.

robots.txt

The Robots Exclusion Protocol is a way for Webmasters to tell crawlers where they are and aren't allowed to gather information from. More information and examples can be found on the robotstxt.org website. I found an article: List of alternative search engines while trying to find one that allowed web scraping on the search results page. Below is a summary of that research.

Search Engine

robots.txt URL

Is crawling permitted?

Google

robots.txt

No, but has an Api

Bing

robots.txt

No, but has an Api

Yahoo

robots.txt

No

DuckDuckGo

robots.txt

No, but has an Api

Open Verse

robots.txt

Yes, but not exactly what I was looking for

SearX

robots.txt

Yes

Additional Resources

Some other resources that I found useful while researching this topic:

Manipulating a Web Page with Selenium and WebDriver

I am going to avoid using libraries other than selenium for this example. There are some really basic patterns that I want to demonstrate and I don't want to get bogged down with a particular Domain Specific Language (DSL) that could make it harder to understand what is going on.

However, I think using a test running framework is a great way of organizing this kind of code. Adding a framework can solve a lot of problems around general code structure, retry logic, and even reporting.

The Basic Pattern

There is a basic pattern to how I go about manipulating a page in a WebDriver session. I also add a pause after every action performed. Browser automation can be flaky. The timeouts add a lot of stability to the crawl, and greatly limit the chances of getting rate limited and blocked. Wherever necessary, I also augment the crawl with API calls to other Search Engines or sources of information.

  1. Navigate to a URL.
  2. Wait for the page to finish loading.
  3. Find the element I want to interact with, develop selector for it.
  4. Wait for the element to be present.
  5. Scroll the element into view.
  6. Interact with the element.
  7. Repeat

Overview of the Code

I took a really simple approach to the selectors. I am using both the xpath and css selectors that are available in the browser. Focusing mainly around anchor tags, and URL fragments to navigate between pages during a crawl.

I am using expected conditions to wait for the elements to be present before trying to click on them. The Selenium project has a lot of documentation, but I also found the discussion about wait conditions with example usages on Stack Overflow to be an invaluable resource.

TorDriver Module

There is an existing PyPi project called tbselenium that has a similar function. For this experiment, I referenced the Firefox profile setup, but didn't need any of the other features tbselenium includes. The additional complexity of the containers not having root access was all contributing to making debugging more difficult. This added to the motivation for limiting dependencies, and trying simple pre-existing solutions. For example, there are many places where I am using linux tools and sub-shells instead of implementing pure python solutions directly.

The finished class is approximately 150 lines of Python. I think it will be easier to analyze what is happening in depth with less to review. I learned a lot about how the Tor Browser Launcher works, and how to configure Firefox profiles. This profile has been amassed from multiple sources online, and they are mentioned in the source code as well as this document.

I have abstracted the startup, tear down, and one very common piece of navigation logic into a class called TorDriver. It is a very simple class that sets up a Firefox profile with the Tor Browser Launcher. It has a method for checking if an element is visible on the page, and another that verifies the proxy socket is up and running. The Firefox profile setup and debugging was largely informed by a Stack Overflow discussion: Open Tor Browser with Selenium.

The completed file can be found here: tor-driver-python/torDriver.py

Imports

Importing selenium, pprint, subprocess, and socket for the setup and WebDriver components.

Check if an Element is Visible

The following method abstracts checking for an element, and returns True or False if it is visible within a timeout.

Checking if Tor Proxy Started

The proxy port needs to be active before sending signals to it. Following some examples in Stack Overflow about Testing Socket Connections in Python I came up with this:

The Class

The bulk of the module is a class that controls the Firefox profile, downloads geckodriver, and initiates torbrowser-launcher.

TorDriver Definition and Initialization

Here I have a basic configuration and some ways of overriding things, but mostly keeping this as simple as possible:

Firefox Profile Setup

The Firefox profile needs to be configured at the very minimum to connect to the proxy port, I had also disabled javascript with it.

Setup WebDriver

This uses the profile and the binary from TorDriver to initialize a driver

Download Geckodriver

Adding a method for downloading and extracting geckodriver in a subprocess. Worth mentioning that somehow when running it in the container the tar.gz is not compressed any longer and it simply required being unarchived. More information about the error is available here: stdin: not in gzip format error

Setup and Wait for Tor Network

Until the socket responds, retry a connection to the proxy port:

crawler.py

In this example I took the following approach two phase approach. First phase is information gathering, and subsequent phase is for processing the information. This way I am not tied to network connectivity for the entire process, and can retry parsing the results as many times as necessary without going back to the source material.

The complete file can be found here: tor-driver-python/crawler.py

Module Imports

Main Crawler Functionality

The crawler reads in a text file and uses that information to populate queries in the WebDriver session. The state of the crawl is kept in a folder of json files one per query. I attempt to do the minimum processing absolutely necessary to export the information one time and any subsequent processing can happen in the existing data instead of going back to the site(s).

searches.txt

I am using a text file to store searches in. I chose a text file because it is very easy to restructure. Editing text is a low barrier to start a crawl with new information, or to resume one that failed part way through. If this crawler had more complex data requirements I would consider using a Database instead. That will allow for implementing an API for controlling scans with a custom user interface, for reporting.

JSON Results Format

Example files are already in the results folder in the repo: tor-driver-python/results

In a more robust crawler I would suggest using an actual database technology. This is enough to easily tell where data collection stopped and facilitate restarting.

Running the Crawler from a Container

The crawler can be run from the container with the following commands. The report generator requires JSON files to be present, example export CSV file can be found here:

Start the container:

docker run -it --rm -p 5901:5901 -v "${HOME}/src":/src excitingtheory/kalilinux-xvfb:torbrowser

Start a VNC Server in the container, it will prompt for session passwords:

/opt/start-vnc-server-once.sh

Begin the crawl from inside the VNC session:

python3 crawler.py

The crawler will wait for the initialization of Tor Browser, and unfortunately, it is a manual step. Simply click the checkbox, and click connect. See the video demo for an example.

report.py

The report script will generate a Comma Separated Value (CSV) file from the

JavaScript Object Notation (JSON ) result files the crawler saves throughout the crawl. I chose the CSV format because it is a more common format for sharing with colleagues, but still easy to import into other tools for further analysis.

The complete file can be found here: tor-driver-python/report.py

Module Imports

This uses built in Python libraries to read JSON, write CSV, and parse URLs for formatting and data presentation. Then loops through the results and loads them to begin data processing.

Main Report Generator

This is the core functionality of the report generator. This does a final presentation and ordering of the data captured in the results objects. Typically URLs are only useful to crawlers functional movement through a site, and not as a final data capture, but it is a good start for customizing further data extraction.

Run the Report

The results of the crawl are saved in the ./results directory as JSON files. I will use the following script to generate a report from the data.

python3 report.py

Example Report

An example output CSV file can be found here: tor-driver-python/output.csv

Detecting and Mitigating Attacks

There are a few different ways to detect and mitigate Bot activity. I will focus primarily on the Data Center side of things, but I will also discuss some client side detection methods. The client can never really be trusted though as client side signals can change at any time, and can be spoofed. I think it is important to keep this in mind when designing a detection system. In the Data Center there are two forms of protection I will discuss: rate limiting, and reputation blocking.

Client Side Detection

There are a few ways to detect an active WebDriver session on the client side with just javascript: somewhat related issue in Github goes into more detail. Essentially, because the WebDriver protocol changes the document and window objects it can be detected in client side code.

Detecting and Blocking Bot Traffic in a Data Center

I am going to focus on the solutions that I have the most experience with, Fastly, AWS WAF, and Nginx. CloudFlare was a total surprise, so I am going to talk about their offering as well.

AWS Web Application Firewall (WAF) Rate Based Rules can be also used to block Denial of Service levels of activity, and there are default rules that can be used to detect Tor network traffic as well, see the IP Reputation Rule Documentation for more information. Another common approach is to block all traffic from other data centers, which is safe if the target audience is Consumers. However, Businesses could be using cloud VPN and other technology that can render this harmful to legitimate traffic.

Fastly's Signal Science, a very popular solution, can be used to detect Tor traffic specifically. First, they can protect from DDOS attacks, see their DDOS Mitigation page for more information. Second, they can detect Tor traffic, and block it. Here is the Using System Signals documentation that covers this.

For Nginx there are some articles about doing this as well: How to block anonymous traffic with Nginx or inside your web application. Essentially, by calling out to APIs to get information about Tor exit nodes, IP blocking rules can be generated and applied to Nginx on a schedule.

In a surprising contrast to the above cloud providers, CloudFlare is offering support for Tor clients. I came across their Tor Support Documentation!? where they discuss the capability of serving content to Tor users from the network. I think this is a really interesting approach, and I am eager to explore it further in the future.

Conclusion

WebDriver is a powerful tool for testing and can also be used for gathering information in places where access to an API isn’t feasible. For example: access is otherwise restricted, censored, too expensive, or generally locked behind anti-competitive practices. Better still is combining data gathered from web crawling with information gathered from APIs.

This is an important exercise because it is becoming harder and harder to prevent malicious traffic from Bots, and it isn’t a good security practice to wait until an attack occurs to consider how to mitigate it. I believe everyone who is responsible for putting information online should know how breached information is going to be used against the systems they are responsible for. In a simplified scenario, with ethical constraints, I demonstrated this by doing the following:

  1. Using Selenium to automate a Tor Browser session using a random IP Address.
  2. Gathering information from a website using Python and WebDriver protocols.
  3. Saved the information to a set of JSON files.
  4. Used a python script to generate a CSV report from the results offline.
  5. Discussed several ways of detecting and mitigating Denial of Service and Tor Network activity in a Data Center.


Written by csfx | DevOps Engineer with a background in Public Cloud, Containers, Security, and Automation. Serverless all the things!
Published by HackerNoon on 2023/08/05