In my previous article, I was able to demonstrate a Kali Linux container running with Tor Browser, and connected to its desktop environment with a VNC client. I verified that Tor Browser was connecting to Tor Network during a browsing session. This setup will allow me to simulate the kind of traffic that might come from an attacker targeting a website.
In this experiment I am going to use Selenium to automate Tor Browser to synthesize keystrokes and navigation events through the browser’s WebDriver interface. Each crawler will have a random IP address provided by the embedded Tor Proxy for evading detection. After saving the results as JSON objects in the local filesystem, I will use Python to process them into a single CSV file. Finally, I will discuss what counter measures can be applied in a Data Center and Client Side to attempt to detect, rate limit, and block Bot activity.
All the files and applicable licenses are available in this open source repository: tor-driver-python
I have a background in test automation, and have spent many hours designing tests. I also spent a lot of time working with Selenium, and I have used it in many different programming languages and settings to automate web browsers for testing purposes. There are scenarios where it is only possible to test a web application with a real browser, and Selenium is a great tool for that.
In my work as a DevOps Engineer I have spent no small amount of time fretting about what to do with the web crawlers that are hitting, and sometimes outright attacking, the web applications I am responsible for. I thought it would be an interesting experiment to explore the other side of this issue for once.
I want to see how close I could get to simulating an attack from a botnet for educational purposes, and to discuss methods of countering things like suspicious Tor network traffic in a modern Data Center. Botnets are commonly used to perform Credential Stuffing attacks. I will be using a similar technique to search queries and gather information from the web.
Credential stuffing is the automated injection of stolen username and password pairs (“credentials”) in to website login forms, in order to fraudulently gain access to user accounts. 1
In order to avoid ethical issues, while also trying to stay true to the task. I am making the following changes to the scenario:
robots.txt
files, and Terms and Conditions were checked at the time of writing that would not be exclusionary of crawling. For example IMDB's Terms and Conditions explicitly forbid crawling without written consent.
The Robots Exclusion Protocol is a way for Webmasters to tell crawlers where they are and aren't allowed to gather information from. More information and examples can be found on the robotstxt.org website. I found an article: List of alternative search engines while trying to find one that allowed web scraping on the search results page. Below is a summary of that research.
Search Engine |
robots.txt URL |
Is crawling permitted? |
---|---|---|
No, but has an Api | ||
No, but has an Api | ||
No | ||
No, but has an Api | ||
Yes, but not exactly what I was looking for | ||
Yes |
Some other resources that I found useful while researching this topic:
I am going to avoid using libraries other than selenium for this example. There are some really basic patterns that I want to demonstrate and I don't want to get bogged down with a particular Domain Specific Language (DSL) that could make it harder to understand what is going on.
However, I think using a test running framework is a great way of organizing this kind of code. Adding a framework can solve a lot of problems around general code structure, retry logic, and even reporting.
There is a basic pattern to how I go about manipulating a page in a WebDriver session. I also add a pause after every action performed. Browser automation can be flaky. The timeouts add a lot of stability to the crawl, and greatly limit the chances of getting rate limited and blocked. Wherever necessary, I also augment the crawl with API calls to other Search Engines or sources of information.
I took a really simple approach to the selectors. I am using both the xpath and css selectors that are available in the browser. Focusing mainly around anchor tags, and URL fragments to navigate between pages during a crawl.
I am using expected conditions to wait for the elements to be present before trying to click on them. The Selenium project has a lot of documentation, but I also found the discussion about wait conditions with example usages on Stack Overflow to be an invaluable resource.
There is an existing PyPi project called tbselenium that has a similar function. For this experiment, I referenced the Firefox profile setup, but didn't need any of the other features tbselenium includes. The additional complexity of the containers not having root access was all contributing to making debugging more difficult. This added to the motivation for limiting dependencies, and trying simple pre-existing solutions. For example, there are many places where I am using linux tools and sub-shells instead of implementing pure python solutions directly.
The finished class is approximately 150 lines of Python. I think it will be easier to analyze what is happening in depth with less to review. I learned a lot about how the Tor Browser Launcher works, and how to configure Firefox profiles. This profile has been amassed from multiple sources online, and they are mentioned in the source code as well as this document.
I have abstracted the startup, tear down, and one very common piece of navigation logic into a class called TorDriver
. It is a very simple class that sets up a Firefox profile with the Tor Browser Launcher. It has a method for checking if an element is visible on the page, and another that verifies the proxy socket is up and running. The Firefox profile setup and debugging was largely informed by a Stack Overflow discussion: Open Tor Browser with Selenium.
The completed file can be found here: tor-driver-python/torDriver.py
Importing selenium, pprint, subprocess, and socket for the setup and WebDriver components.
The following method abstracts checking for an element, and returns True
or False
if it is visible within a timeout.
The proxy port needs to be active before sending signals to it. Following some examples in Stack Overflow about Testing Socket Connections in Python I came up with this:
The bulk of the module is a class that controls the Firefox profile, downloads geckodriver, and initiates torbrowser-launcher.
Here I have a basic configuration and some ways of overriding things, but mostly keeping this as simple as possible:
The Firefox profile needs to be configured at the very minimum to connect to the proxy port, I had also disabled javascript with it.
This uses the profile and the binary from TorDriver to initialize a driver
Adding a method for downloading and extracting geckodriver in a subprocess. Worth mentioning that somehow when running it in the container the tar.gz
is not compressed any longer and it simply required being unarchived. More information about the error is available here: stdin: not in gzip format error
Until the socket responds, retry a connection to the proxy port:
In this example I took the following approach two phase approach. First phase is information gathering, and subsequent phase is for processing the information. This way I am not tied to network connectivity for the entire process, and can retry parsing the results as many times as necessary without going back to the source material.
The complete file can be found here: tor-driver-python/crawler.py
The crawler reads in a text file and uses that information to populate queries in the WebDriver session. The state of the crawl is kept in a folder of json files one per query. I attempt to do the minimum processing absolutely necessary to export the information one time and any subsequent processing can happen in the existing data instead of going back to the site(s).
I am using a text file to store searches in. I chose a text file because it is very easy to restructure. Editing text is a low barrier to start a crawl with new information, or to resume one that failed part way through. If this crawler had more complex data requirements I would consider using a Database instead. That will allow for implementing an API for controlling scans with a custom user interface, for reporting.
Example files are already in the results folder in the repo: tor-driver-python/results
In a more robust crawler I would suggest using an actual database technology. This is enough to easily tell where data collection stopped and facilitate restarting.
The crawler can be run from the container with the following commands. The report generator requires JSON files to be present, example export CSV file can be found here:
Start the container:
docker run -it --rm -p 5901:5901 -v "${HOME}/src":/src excitingtheory/kalilinux-xvfb:torbrowser
Start a VNC Server in the container, it will prompt for session passwords:
/opt/start-vnc-server-once.sh
Begin the crawl from inside the VNC session:
python3 crawler.py
The crawler will wait for the initialization of Tor Browser, and unfortunately, it is a manual step. Simply click the checkbox, and click connect. See the video demo for an example.
The report script will generate a Comma Separated Value (CSV) file from the
JavaScript Object Notation (JSON ) result files the crawler saves throughout the crawl. I chose the CSV format because it is a more common format for sharing with colleagues, but still easy to import into other tools for further analysis.
The complete file can be found here: tor-driver-python/report.py
This uses built in Python libraries to read JSON, write CSV, and parse URLs for formatting and data presentation. Then loops through the results and loads them to begin data processing.
This is the core functionality of the report generator. This does a final presentation and ordering of the data captured in the results objects. Typically URLs are only useful to crawlers functional movement through a site, and not as a final data capture, but it is a good start for customizing further data extraction.
The results of the crawl are saved in the ./results
directory as JSON files. I will use the following script to generate a report from the data.
python3 report.py
An example output CSV file can be found here: tor-driver-python/output.csv
There are a few different ways to detect and mitigate Bot activity. I will focus primarily on the Data Center side of things, but I will also discuss some client side detection methods. The client can never really be trusted though as client side signals can change at any time, and can be spoofed. I think it is important to keep this in mind when designing a detection system. In the Data Center there are two forms of protection I will discuss: rate limiting, and reputation blocking.
There are a few ways to detect an active WebDriver session on the client side with just javascript: somewhat related issue in Github goes into more detail. Essentially, because the WebDriver protocol changes the document and window objects it can be detected in client side code.
I am going to focus on the solutions that I have the most experience with, Fastly, AWS WAF, and Nginx. CloudFlare was a total surprise, so I am going to talk about their offering as well.
AWS Web Application Firewall (WAF) Rate Based Rules can be also used to block Denial of Service levels of activity, and there are default rules that can be used to detect Tor network traffic as well, see the IP Reputation Rule Documentation for more information. Another common approach is to block all traffic from other data centers, which is safe if the target audience is Consumers. However, Businesses could be using cloud VPN and other technology that can render this harmful to legitimate traffic.
Fastly's Signal Science, a very popular solution, can be used to detect Tor traffic specifically. First, they can protect from DDOS attacks, see their DDOS Mitigation page for more information. Second, they can detect Tor traffic, and block it. Here is the Using System Signals documentation that covers this.
For Nginx there are some articles about doing this as well: How to block anonymous traffic with Nginx or inside your web application. Essentially, by calling out to APIs to get information about Tor exit nodes, IP blocking rules can be generated and applied to Nginx on a schedule.
In a surprising contrast to the above cloud providers, CloudFlare is offering support for Tor clients. I came across their Tor Support Documentation!? where they discuss the capability of serving content to Tor users from the network. I think this is a really interesting approach, and I am eager to explore it further in the future.
WebDriver is a powerful tool for testing and can also be used for gathering information in places where access to an API isn’t feasible. For example: access is otherwise restricted, censored, too expensive, or generally locked behind anti-competitive practices. Better still is combining data gathered from web crawling with information gathered from APIs.
This is an important exercise because it is becoming harder and harder to prevent malicious traffic from Bots, and it isn’t a good security practice to wait until an attack occurs to consider how to mitigate it. I believe everyone who is responsible for putting information online should know how breached information is going to be used against the systems they are responsible for. In a simplified scenario, with ethical constraints, I demonstrated this by doing the following: