Namespaces are one of the primary components of linux containers. Namespaces provide isolation of shared resources: they give each application its own unique view of the system. Because of namespaces, each docker container to appears to have its own filesystem and network. Linux added namespace support gradually over many releases. Due to this gradual change, each type of namespace offers its own unique challenges. Pid namespaces in particular require special handling, especially when multiple processes are involved.
Processes in linux live in a tree-like structure. Each process in the kernel has a unique process identifier, called a “pid” for short. The record for each process in the tracks the pid of its a immediate parent. The pid is also passed to the parent when the a process is created via the fork syscall. The kernel generates a new pid for the child and returns the identifier to the calling process, but it is up to the parent to keep track of this pid manually.
The first process started by the kernel has pid 1. This process is referred to as the init process, or simply ‘init’. The parent pid of init is pid 0, signifying that its parent is the kernel. Pid 1 is the root of the user-space process tree: It is possible to reach pid 1 on a linux system from any process by recursively following each process’ parent. If pid 1 dies, the kernel will panic and you have to reboot the machine.
Linux namespaces are created using the unshare syscall, passing a set of flags representing which namespaces to create. In most cases, unshare pops you right into the new namespace. For example, as soon as a process creates a network namespace it immediately sees an empty view of the network with no devices.
The pid namespace is a little different: when you unshare the pid namespace, the process doesn’t immediately enter the new namespace. Instead, it is required to fork. The child process enters the pid namespace and becomes pid 1. This imbues it with special properties.
It is also important to note that a pid namespace creates a separate view of the process hierarchy. In other words the forked process will actually have two pids: it has pid 1 inside the namespace, and has a different pid when viewed from outside the namespace.
Inside a namespace, init (pid 1) has three unique features when compared to other processes:
It does not automatically get default signal handers, so a signal sent to it is ignored unless it registers a signal hander for that signal. (This is why many dockerized processes fail to respond to ctrl-c and you are forced to kill them with something like `docker kill`).
If another process in the namespace dies before its children, its children will be reparented to pid 1. This allows init to collect the exit status from the process so that the kernel can remove it from the process table.
If it dies, every other process in the pid namespace will be forcibly terminated and the namespace will be cleaned up.
It is clear that the init process is tightly coupled to the lifetime of the container.
Docker (and runc) run the process specified as the containers entrypoint (or cmd) as pid 1 in a new pid namespace. This can lead to some unexpected behavior for an application processes because it usually isn’t designed to run as pid 1. If it doesn’t set up its own signal handlers, signaling the process will not work. If it forks a child that dies before any grandchildren exit, zombie processes can accumulate in the container, potentially filling up the process table.
Docker has been pretty hands-off about this. It is possible to run a special init process in your container and have it fork-exec into the application process, and many containers do this to avoid these problems. One unfortunate side affect of this decision is that the container gains more complexity. Once the container has a real init system, people are apt to embed multiple processes which sacrifices some of the benefit of dependency isolation. Docker’s lack of a native support for pods only exacerbates this problem.
Rkt takes a somewhat saner approach to this problem. It assumes that the process you are starting is not an init process, so it creates an init process for you (systemd) and then has systemd create a the filesystem namespace for the container process and start it. Systemd becomes pid 1 in the namespace and the container process runs as pid 2. This does mean that if the container supplies an init process it will run as pid 2, but this rarely causes issues in practice.
For a single process, an advanced init system like systemd is overkill, but expecting container builders to understand the nuances of pid namespaces and init processes is a mistake. There is a simpler solution, but it requires the container spawner to act as init on behalf of the user.
After forking into the pid namespace, instead of execing the container process immediately, the spawner can fork again. The second fork allows the container spawner to become pid 1. It can set up signal handlers to pass all signals to the child. It can then reap zombies until its child dies, at which point it can collect the exit status of the container process and pass it on to the containerization system. This means signals work as expected (I can ctrl-c my process again!) and zombies are properly reaped.
Note that a similar alternative has been available since docker 1.13. It is possible to pass the--init
flag when starting your container, which will cause docker to start a simple init process for you. It doesn’t appear that this option is widely used, however, and in my experimentation it seems to have some bugs. I have found scenarios where I ctrl-c the process and the init process doesn’t stop until it is manually killed.
It is often beneficial for multiple related processes to run together, but it is preferable to bundle these processes separately so that their dependencies can be isolated. To achieve this, rkt and kubernetes introduced the idea of pods. A pod is a set of related containers that share some namespaces. In the rkt implementation, every namespace but the filesystem namespace is shared.
Because kubernetes also supports pods, it illustrates a similar approach using docker. Due to some of the aforementioned issues with pid namespaces, kubernetes doesn’t yet share pid namespaces between containers in the same pod. This is unfortunate, because that means processes in the same pod cannot signal each other. In addition, each container in the pod has the aforementioned init problem: every container process will run as pid 1.
The rkt approach is superior for pods. You are not required to run an init process inside your containers, but it is easy to create multiple processes that can communicate and even signal each other. Unfortunately the situation isn’t as straightforward when we start talking about adding containers to an existing pod.
With the container runtime interface, kubernetes has introduced the concept of a pod sandbox. This allows the container runtime to allocate resources in advance of starting the containers. While especially useful for networking, the concept also enables adding containers to existing pods. If you are creating the pod sandbox first and then starting the containers one by one, why not allow a for an additional container to be added later? This would be especially useful for periodic tasks like database backups or log collection.
Rkt has introduced experimental support for this very feature, allowing for the creation of a pod independently of any containers. Containers (or “apps” in the rkt terminology) can be added or removed from the pod at a later time. Rkt accomplishes this by starting systemd with no running units. It then communicates with the pod’s systemd to start new apps on demand. This solution is quite elegant, although in this model the init process has additional privileges and introduces a new attack vector. The systemd process in rkts sandbox model:
In the non-sandbox model, the init process could start the child processes and then drop these privileges to minimize the effects of compromise.
There are a few different ways to deal with init, sandboxes, and pid namespaces. Each method has some drawbacks. The following options are available:
So which of these options is best? A case could be made for each, but I prefer options 4 and 5. In fact, one could choose between them based on the expected lifetime of the process. Option five is a good fit for long running processes, especially for docker, where the process spawner ends up daemonizing the process anyway. If the process is a shorter task, using option four and keeping the process separate from the pid process tree keeps things extremely simple.
It looks like some work is being done in kubernetes to create a pause container that could act as init. Once kubernetes has support for sharing pid namespaces, option 5 could soon follow.
There is quite a bit of hidden complexity in pid namespaces. The choices made by containerization systems today have significant drawbacks that could be avoided by adopting alternative approaches. While the drawbacks for a single container in docker are fairly well understood and have a reasonable workaround, allowing the container spawner to act as init would simplify things for container builders.
When it comes to groups of containers, the rkt approach of a separate init is superior to the docker approach. It allows the processes to communicate via signals, which is not currently possible using the kubernetes pod model. Once delayed start containers are included, however, even rkt’s approach starts to show some drawbacks.
The most compelling approach for delayed start containers is to start a simple init process along with the pid namespace, but to spawn new container processes via the container spawner. This allows the init process to drop privileges, shutting down attack vectors. The spawner can choose to daemonize the new process, keeping the process tree consistent, or it can remain as the parent of the new process, simplifying process management.