Use Kali Linux Docker Containers to Support Covert Web Scraping

Written by csfx | Published 2023/07/23
Tech Story Tags: cyber-security | devops | web-scraping | tor | kali-linux | github-actions | docker | hackernoon-top-story

TLDRUnlock the power of Kali Linux on Docker networks by adding a GUI back to the container. Level up information gathering with anonymized web scraping powered by Python, Selenium, and Tor Browser. In Part 2 of this series, everything in the containers will be put to the test.via the TL;DR App

Unlock the power of Kali Linux on Docker networks by adding a GUI back to the container and level up information gathering with anonymized web scraping powered by Python, Selenium, and Tor Browser.

As a DevOps Engineer, I have both shipped, and enabled other Engineers to ship, many hundreds of thousands of containers. I use Continuous Delivery and a variety of orchestration methods: Kubernetes, Lambda Containers on AWS, Elastic Container Service (ECS) backed by both Elastic Compute Cloud(EC2) and Fargate. Containers are useful for custom build systems, automated deployments, vulnerability scanning, research purposes, data analytics, machine learning, legacy software archival, local development, and web scraping.

In this guide I will go over the steps I went through to add a GUI to the official Kali Linux container from DockerHub using Xvfb and a VNC server. The code I discuss here is published in the kalilinux-xvfb open source repository, and is continuously delivered to Docker Hub for general use. Finally, in Part 2 of this series, everything in the containers will be put to the test. I will use a custom Python script, Pytest, and Selenium to automate the Tor Browser instance and gather data from the web.

Container Networks

Part of the motivation behind having Kali Linux in a container is to gain visibility into the private network space created by the host machine for containers. The default setup from Docker Desktop is to use a NAT network. Unless the port(s) are bound to the host machine, with docker run -p, that service will not be accessible from the host.

The Software

Here are added summary sections with links for each of the following software packages:

  • Kali Linux

  • Docker Desktop

  • Tor Browser

  • Tor Browser Launcher

  • Xvfb

  • Selenium

Additional links for software I used:

Kali Linux

Kali Linux is a Debian-based Linux distribution aimed at advanced Penetration Testing and Security Auditing. 1

I chose Kali Linux for the sheer amount of security focused tools that are available. The Kali Linux team provides a Docker image that is updated regularly. Much of the required software is available to the Kali Linux Docker image and I will be using the package manager to install them.

Docker Desktop

Docker Desktop is a framework for running containers locally. Containers are a way of packaging operating system and application state in a reusable operating system image. Additional resources are available on the Docker website.

Containers and virtual machines have similar resource isolation and allocation benefits, but function differently because containers virtualize the operating system instead of hardware. 2

Since Docker containers are more lightweight than virtual machines, I favor them heavily in my development efforts. Long term, this state can be stored as a small text file and is a lot easier to track changes over time with git and similar tools. The ability to derive an image again from a text file is powerful. However, it is entirely possible for a Dockerfile to fall into disrepair over time and could fail to build at a later date. Much like any other Operating System, and Infrastructure as Code, it is important to keep everything up to date.

An alternative to Docker would be to use a Virtual Machine, or a bare metal installation. However, I believe containers are the perfect testbed for running ephemeral web crawlers because they are easy to provision and cleanup. I can run many containers on a single machine, in parallel, that can be started and stopped almost instantly. Then whenever I require additional isolation this whole setup can be run from inside a virtual machine as well.

Tor Browser

The Onion Router (Tor) is a global network of computers that route traffic through a series of encrypted hops. This allows for a user to browse the internet somewhat anonymously, and also allows for a user to access websites that are only available on the Tor Network.

  • Tor Browser is a heavily configured version of Firefox with a focus on privacy and security. It can use the Tor Network out of the box.

  • torbrowser-launcher is included in many Linux operating systems and makes starting a web browser on the Tor Network extremely easily.

Selenium

Selenium is a tool for automating web browsers. I went with Python in this example, but there are many languages and tools available.

I will be using Selenium to automate this Tor Browser on the Tor Network. I am attempting to take advantage of the security conscious browser's existing config to limit sharing of information to targets. I will be able to clean up the containers runtime states by stopping them. The resulting WebDriver session will have a random IP address vended through Tor, making it a very powerful tool for web scraping.

Xvfb

The XOrg component Xvfb can be used to simulate a screen to the X window system. I will then use this screen to start an X server in the container. This emulation technique is not particularly fast, and graphics heavy applications will benefit from a GPU.

Overview of the Files

The Dockerfiles should build on any Linux, macOS, or Windows distribution that can run Docker. I have tested this on Docker Desktop using an Intel processor. There is an ARM variant for the official Kalilinux base container. I would appreciate feedback from anyone that tries running this on ARM.

  1. Dockerfile
  2. TorBrowser.Dockerfile
  3. TorBrowser-root.Dockerfile
  4. startx-once.sh
  5. start-vnc-server-once.sh
  6. ssh-keys.sh
  7. torrc
  8. .github/workflows/main.yml

Dockerfile

This is the Dockerfile that I used to build the base image. I have added a few comments to explain what each line does.

# From the official Kali Linux Docker image rolling release
FROM kalilinux/kali-rolling

# Leaving this here for now, but it's not needed
ARG DISPLAY_NUMBER=1
ENV DISPLAY=:$DISPLAY_NUMBER

# Install the packages we need
# Got lucky with this one, there was a ticket about this:
# https://github.com/dnschneid/crouton/issues/4461 `xfce4 dbus-launch not found` in this version of the image so installing dbus-x11
RUN apt-get update -y -q \
  && apt-get install -y -q xvfb xfce4 xfce4-goodies tightvncserver net-tools curl openssh-server dbus-x11

# This line copies the shell scripts from the local directory to the container, and makes them executable.
COPY *.sh /opt/
RUN chmod 755 /opt/*.sh

# This line runs the shell scripts that start the X server, and generate the SSH keys.
RUN /opt/startx-once.sh
RUN /opt/ssh-keys.sh

# This line sets the entrypoint to bash, so that we can run commands in the container.
ENTRYPOINT /bin/bash

startx-once.sh

Helps start the X server. I am using the Xvfb command to start the X server. If a PID is present, and when something is wrong with the X server, I can kill the process and start it again. I experimented with blocking X sessions from network, while it may be unnecessary, i am still adding it anyway: -nolisten tcp discussion on super user

#!/bin/bash
FILE=/opt/.x.pid
if [[ -f "$FILE" ]]; then
    echo "$FILE exist"
    echo "PID: " `cat $FILE`
else
    Xvfb -nolisten tcp $DISPLAY &
    echo $? > $FILE
fi

ssh-keys.sh

VNC Clients are less than ideal for copying and pasting text. This can be remedied by adding SSH keys to the container to help other hosts connect to it. This opens up the possibility of using SSH tunnels to connect to the container.

The following ssh-keygen command was referenced from this Stack Exchange Article

#!/bin/bash

# Generate SSH keys
ssh-keygen -b 2048 -t rsa -f ~/.ssh/id_rsa -q -N ""
cat ~/.ssh/id_rsa >> ~/.ssh/authorized_keys

If I am going to use these keys, I need the private key on my workstation. I am using the following commands to copy the private key to my host machine from docker, but there are a multitude of ways to do this. There was a good discussion around this on Stack Overflow.

Finding the container ID with the ps command:

docker ps

Using the container ID to copy the with the cp command:

# For root
docker cp <containerId>:/root/.ssh/id_rsa /target/path/on/host

# Otherwise for the user configured in the Dockerfile
docker cp <containerId>:/home/username/.ssh/id_rsa /target/path/on/host

torrc

This needs to be copied into the container, but is just uncommented sections from the boilerplate configuration the Tor Project provides here.

# /etc/tor/torrc

## Tor opens a SOCKS proxy on port 9050 by default -- even if you don't
## configure one below. Set "SOCKSPort 0" if you plan to run Tor only
## as a relay, and not make any local application connections yourself.
SOCKSPort 9050 # Default: Bind to localhost:9050 for local connections.
#SOCKSPort 192.168.0.1:9100 # Bind to this address:port too.

## Uncomment this to start the process in the background... or use
## --runasdaemon 1 on the command line. This is ignored on Windows;
## see the FAQ entry if you want Tor to run as an NT service.
RunAsDaemon 1

## The port on which Tor will listen for local connections from Tor
## controller applications, as documented in control-spec.txt.
ControlPort 9051

## If you enable the controlport, be sure to enable one of these
## authentication methods, to prevent attackers from accessing it.
#HashedControlPassword 16:872860B76453A77D60CA2BB8C1A7042072093276A3D701AD684053EC4C
CookieAuthentication 1

TorBrowser.Dockerfile

This Dockerfile builds on the previous Dockerfile, and adds Tor Browser and Python to the mix. I am using the torbrowser-launcher package to install Tor Browser.

# From the base Dockerfile:
FROM excitingtheory/kalilinux-xvfb

ARG DISPLAY_NUMBER=1
ENV DISPLAY=:$DISPLAY_NUMBER

# More info about torbrowser-launcher:
# https://github.com/micahflee/torbrowser-launcher
RUN apt-get update -y -q \
  && apt-get install -y -q torbrowser-launcher python3 python3-pip

# Create a user and add them to the sudo group, sudo still TBD but leaving for now, will not be able to switch users in this container.
RUN useradd -s /bin/bash -m username \
  && usermod -aG sudo username

# Install python packages
RUN pip3 install selenium pytest
COPY torrc /etc/tor/torrc

# Set the user and home directory for the container
USER username
ENV USER=username
ENV HOME=/home/username

# Run the shell scripts that start the X server, and generate the SSH keys.
RUN /opt/startx-once.sh
RUN /opt/ssh-keys.sh

# This line sets the entrypoint to bash, so that we can run commands in the container.
ENTRYPOINT /bin/bash

TorBrowser-root.Dockerfile

Just like before but with less configuration. Will use the root user.

# From the base Dockerfile:
FROM excitingtheory/kalilinux-xvfb

ARG DISPLAY_NUMBER=1
ENV DISPLAY=:$DISPLAY_NUMBER

# More info about torbrowser-launcher:
# https://github.com/micahflee/torbrowser-launcher
RUN apt-get update -y -q \
  && apt-get install -y -q torbrowser-launcher python3 python3-pip


# Install python packages
RUN pip3 install selenium pytest

# Copy the tor config file into the containera
COPY torrc /etc/tor/torrc

# Run the shell scripts that start the X server, and generate the SSH keys.
RUN /opt/startx-once.sh
RUN /opt/ssh-keys.sh

# This line sets the entrypoint to bash, so that we can run commands in the container.
ENTRYPOINT /bin/bash

.github/workflows/main.yml

The Github Actions workflow file builds the Docker image and pushes it to Docker Hub. This is the stock workflow from Docker's documentation, with additional tags being pushed. I added a schedule to trigger this Action once a day.

name: docker build and push

on:
  push:
    branches:
      - "main"
  schedule:
      - cron:  '20 16 * * *'

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      -
        name: Checkout
        uses: actions/checkout@v3
      -
        name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      -
        name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      -
        name: Build and push base image
        uses: docker/build-push-action@v4
        with:
          context: .
          file: ./Dockerfile
          push: true
          tags: excitingtheory/kalilinux-xvfb:latest
      -
        name: Build and push torbrowser root image
        uses: docker/build-push-action@v4
        with:
          context: .
          file: ./TorBrowser-root.Dockerfile
          push: true
          tags: excitingtheory/kalilinux-xvfb:torbrowser-root
      -
        name: Build and push torbrowser image
        uses: docker/build-push-action@v4
        with:
          context: .
          file: ./TorBrowser.Dockerfile
          push: true
          tags: excitingtheory/kalilinux-xvfb:torbrowser

Using the Containers

Now I can test out the containers and see if they work. I am going to use the torbrowser tag for this example.

Building the Containers (Optional)

See the Docker CLI Build Documentation for more information about the docker build command. To build containers locally I use the following commands, otherwise they can be pulled from Docker Hub on demand during the next steps.

docker build -t excitingtheory/kalilinux-xvfb .
docker build -t excitingtheory/kalilinux:torbrowser -f TorBrowser.Dockerfile .
docker build -t excitingtheory/kalilinux:torbrowser-root -f TorBrowser-root.Dockerfile .

Running the Containers with a Volume

I need to mount a volume to the container so I can access the files that I am working on. I am using the ${HOME}/src directory as a volume.

Some brief notes on the flags that we are using are included below, otherwise see the Docker CLI Run Documentation for more information about the docker run command.

  • -it is used to run the container in interactive mode.

  • --rm is used to remove the container layers when it exits.

  • -p is used to map the port from the container to the host machine.

  • -v is used to mount a volume from the host machine to the container.

# Run the base container
docker run -it --rm -p 5901:5901 -v "${HOME}/src":/src excitingtheory/kalilinux-xvfb

# Run the container with the default user
docker run -it --rm -p 5901:5901 -v "${HOME}/src":/src excitingtheory/kalilinux-xvfb:torbrowser

# Run the container with the root user
docker run -it --rm -p 5901:5901  -v "${HOME}/src":/src excitingtheory/kalilinux-xvfb:torbrowser-root

Using Tor Browser from the Container

Next I want to open the user container, login through VNC, and run the Tor Browser Launcher to verify that everything is working so far. Tor browser launcher won't work with the root user.

Start the container with the default user:

docker run -it --rm -p 5901:5901 -v "${HOME}/src":/src excitingtheory/kalilinux-xvfb:torbrowser

Starting the VNC server in the container, the server will prompt for the main password, I am setting an additional view-only password.

/opt/start-vnc-server-once.sh

Below is an example of the output.

Connect on macOS

The connect to server dialog can be found from the finder menu: Go -> Connect to Server.

Once the dialog is open, enter the address vnc://localhost:5901 and click Connect. I have saved this address as a favorite for easy access.

Connect on Windows

A vnc client like TightVNC can connect to this on localhost:5901.

Connected to Kali Linux Xfce4 Desktop

Once connected, I can see the xfce4 desktop.

Now I can start the Tor Browser Launcher from the terminal in the VNC session.

torbrowser-launcher

Once the browser is open, I can check to see if it is working by going to the Tor Project Connection Checker

An example of a successful test:

I opened the site on my host machine, that wasn’t connected, to compare:

Progress so Far

This is the Operating System setup I wanted to finish before I started working with Selenium and Python. To recap, I have done the following:

  1. Installed dependencies for Selenium, Python, and a simple xfce4 desktop environment.
  2. Built and pushed three interdependent Docker images with GitHub Actions.
  3. Successfully started a VNC Server in a Docker container.
  4. Connected to a running container's desktop environment with a VNC client.
  5. Tested out running Tor Browser Launcher from the container, and verified it is connecting to the Tor network.

Now that there is a working container environment. Check out the next post in this series on covert Python web scraping.


Written by csfx | DevOps Engineer with a background in Public Cloud, Containers, Security, and Automation. Serverless all the things!
Published by HackerNoon on 2023/07/23