Or let's write our own init process
When your process runs as PID 1 in a Docker container, signal handling behaves differently to what you might expect.
First lets sanity check what happens when a process is not PID 1 on a “normal” system.
A simple Python process that just sleeps
Aarons-iMac:bin aaronkalair$ cat mypy.pyimport subprocess
subprocess.call(["sleep", "100"])
And if we run it and send SIGTERM
Aarons-iMac:init-proc aaronkalair$ ps -ef | grep python501 14013 6588 0 2:08pm ttys004 0:00.02 python mypy.py
Aarons-iMac:bin aaronkalair$ kill 14013Terminated: 15
It gets terminated, nothing surprising here
And now let’s run it as PID 1 in a Docker container
Aarons-iMac:bin aaronkalair$ cat Dockerfilefrom ubuntu:16.04
RUN apt-get updateRUN apt-get install -y pythonCOPY mypy.py /srv/
CMD ["python", "/srv/mypy.py"]
Run this container, exec in and then send the same signal
Aarons-iMac:init-proc aaronkalair$ docker exec -it 0229aa205b48 bash
root@0229aa205b48:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 14:15 ? 00:00:00 python /srv/mypy.pyroot 7 1 0 14:15 ? 00:00:00 sleep 100
root@0229aa205b48:/# kill 1
root@0229aa205b48:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 14:15 ? 00:00:00 python /srv/mypy.pyroot 7 1 0 14:15 ? 00:00:00 sleep 100
And now nothing happens!
Lets try this with a Go process that does something similar
package main
import ("time")
func main() {time.Sleep(time.Duration(100000) * time.Millisecond)}
Pop this into a Docker container, run it, exec in and send it SIGTERM
Aarons-iMac:init-proc aaronkalair$ docker exec -it e6ccf11be060 bash
root@e6ccf11be060:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 14:28 ? 00:00:00 ./srv/sleep-spawner
root@e6ccf11be060:/# kill 1
root@e6ccf11be060:/# Aarons-iMac:init-proc aaronkalair$
And it’s killed, just like it behaves if it wasn’t running as PID 1
So what’s going on here then?
Well PID 1 is special in Linux, amongst other things it ignores any signals unless a handler for that signal is explicitly declared. From the Docker docs — https://docs.docker.com/engine/reference/run/#foreground
Note: A process running as PID 1 inside a container is treated specially by Linux: it ignores any signal with the default action. So, the process will not terminate on
_SIGINT_
or_SIGTERM_
unless it is coded to do so.
We could just define handlers for those signals in every process we want to run in a Docker container but this is a lot of work and we may not have the source code to do so. Furthermore there are other responsibilities for PID 1 that we’ll explore later.
So instead we could run a different process as PID 1 and have it proxy signals to the actual process we want to run and perform the other duties of a standard init process
There are numerous solutions that do this for example
Yelps dumb-init
— https://github.com/Yelp/dumb-init
Tini
which is shipped with Docker— https://docs.docker.com/engine/reference/run/#specify-an-init-process
And many more which you can find by searching around.
But I’m going to write my own…
So let's start with the basics I need a program that takes the name of another process to execute and executes it
func main() {cmd := exec.Command(os.Args[1], os.Args[2:]...)err := cmd.Start()if err != nil {panic(err)}err = cmd.Wait()if err != nil {panic(err)}}
Some important things to note about how we do this because it will be important later.
After we Start()
the new process we call Wait()
this is important, this will block until the command exits and once it does cleans up any resources associated with it.
Failure to wait
on a process you spawn leads to zombie processes that hang around once they’ve finished executing consuming some resource.
From the man page — http://man7.org/linux/man-pages/man2/waitpid.2.html#NOTES
A child that terminates, but has not been waited for becomes a "zombie". The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.
So let's try out our new signal proxy, if we run that in a container…
CMD ["./srv/init-proc", "/srv/sleep-spawner", "1"]
We can see that our proxy process is now PID 1 and has spawned off sleep-spawner
root@36c4892039db:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 17:45 ? 00:00:00 ./srv/init-proc /srv/sleep-spawner 1root 11 1 0 17:45 ? 00:00:00 /srv/sleep-spawner 1
Alright the next step is to register ourselves as being interested with all the possible signals
func main() {signalChannel := make(chan os.Signal, 2)signal.Notify(signalChannel)pid := -1
go sigHandler(&pid, signalChannel)
cmd := exec.Command(os.Args\[1\], os.Args\[2:\]...)
err := cmd.Start()
pid = cmd.Process.Pid
if err != nil {
panic(err)
}
err = cmd.Wait()
if err != nil {
panic(err)
}
}
With sigHandler
defined as:
func sigHandler(pid *int, signalChannel chan os.Signal) {var sigToSend syscall.Signal = syscall.SIGHUPfor {sig := <-signalChannelswitch sig {// #1 - Sent went the controlling terminal is closed, typically used by daemonised processes to reload configcase syscall.SIGHUP:sigToSend = syscall.SIGHUP// #2 - Like pressing CTRL+Ccase syscall.SIGINT:sigToSend = syscall.SIGINT.....repeat for all signals}syscall.Kill(*pid, sigToSend)}}
It simply switches on all the signals Go supports — https://golang.org/pkg/syscall/#pkg-constants
And then uses the kill
system call to send the signal through to the process that’s being ran.
Now let's use it to run our Python program and see if it handles SIGTERM correctly.
Aarons-iMac:init-proc aaronkalair$ docker exec -it 579ef1d3ce77 bash
root@579ef1d3ce77:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 18:33 ? 00:00:00 ./srv/init-proc python /srv/mypy.pyroot 13 1 0 18:33 ? 00:00:00 python /srv/mypy.pyroot 14 13 0 18:33 ? 00:00:00 sleep 100
root@579ef1d3ce77:/# kill 1
root@579ef1d3ce77:/# Aarons-iMac:init-proc aaronkalair$
And it works!
Now let’s take care of another thing PID 1 is responsible for, cleaning up Zombie processes.
Imagine this scenario
A — spawns -> B — spawns-> C
Now if B dies or exits before C, C becomes an orphan process, who is C’s parent now?
Well the operating system is responsible for reparenting orphan processes to PID 1, so it now looks like
A — parent of -> C
Now when C exits A will receive the SIGCHILD
signal and is responsible for calling wait
on C to clean up this Zombie process.
So lets add this logic to the SIGCHILD case:
case syscall.SIGCHLD:var status syscall.WaitStatusvar rusage syscall.Rusagesyscall.Wait4(-1, &status, syscall.WNOHANG, &rusage) sigToSend = syscall.SIGCHLD
-1
Means wait for any child process to change state rather than a specific one as we don’t know the ID of the process that has exited when we get the signal
WNOHANG
Means that if there are no child processes that have changed state don’t block waiting for one, return immediately
Performing wait
on a terminated child cleans up its resources preventing it from remaining a zombie process
From the wait
manpage — http://man7.org/linux/man-pages/man2/waitpid.2.html
In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a wait is not performed, then the terminated child remains in a "zombie" state
Now there’s just one more case to handle imagine:
A — spawns -> B — spawns -> C
Now C exits but B doesn’t call wait on it
A — parent of-> B — parent of-> C (defunct zombie process)
wait
Only works on child processes so no matter how many times our init process A called wait
it wouldn’t clean up the resources C was using. (And note that SIGCHILD
would only be sent to B so A wouldn’t even be aware of C exiting)
Now B exits A recieves SIGCHILD
calls wait
and B is cleaned up nicely.
C is now an orphan that gets reparented to A so we have
A — parent of -> C (defunct zombie process)
We can see the above in action with some modifications to our sleeping program to produce processes where parents exit before there children and don’t call wait
func main() {MAX_LEVEL := 4
level, err := strconv.Atoi(os.Args[1])if err != nil {panic(err)}
// We'll have a bunch of processes that immediately exit at the max levelif level == MAX_LEVEL {return}
// Need the top level to outlive the others, otherwise the container would exit and you wouldn't be able to inspect the process treesleepTime := 0if level == 1 {sleepTime = 20000000} else {// Generate proceses where children sleep for longer than there parents so parents exit first without waiting on the children showing what happens to orphan / zombie processessleepTime = level * 1000}
level += 1for i := 0; i < 2; i++ {// Spawn a command and intentionally dont wait on iterr := exec.Command("/srv/sleep-spawner", strconv.Itoa(level)).Start()if err != nil {panic(err)}}time.Sleep(time.Duration(sleepTime) * time.Millisecond)}
It’s available on Github here — https://github.com/AaronKalair/sleep-spawner
And if we run this we can see what the process tree looks like:
Aarons-iMac:init-proc aaronkalair$ docker exec -it 854a232d4b89 bashroot@854a232d4b89:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 22:13 ? 00:00:00 ./srv/init-proc /srv/sleep-spawner 1root 12 1 0 22:13 ? 00:00:00 /srv/sleep-spawner 1root 17 12 0 22:13 ? 00:00:00 [sleep-spawner] <defunct>root 22 12 0 22:13 ? 00:00:00 [sleep-spawner] <defunct>root 32 1 0 22:13 ? 00:00:00 [sleep-spawner] <defunct>
With our current implementation this will remain the situation forever, so we need to modify it slightly to handle cases like this:
case syscall.SIGCHLD:var status syscall.WaitStatusvar rusage syscall.Rusagefor {retValue, err := syscall.Wait4(-1, &status, syscall.WNOHANG, &rusage)if err != nil {panic(err)}if retValue <= 0 {break}}sigToSend = syscall.SIGCHLD
We take advantage of the return value of wait4
when used in combination with WNOHANG
to call it in a loop every time we get a SIGCHILD
signal.
Again from the man page (wait4's return value conforms to waitpid — http://man7.org/linux/man-pages/man2/waitpid.2.html )
on success, returns the process ID of the child whose state has changed; if WNOHANG was specified and one or more child(ren) specified by pid exist, but have not yet changed state, then 0 is returned. On error, -1 is returned.
So we can sit calling Wait4
until we get a return value less than or equal to 0 knowing that it’s cleaning up exited processes.
Now if we run this and exec inside the container and check with ps
Aarons-iMac:init-proc aaronkalair$ docker exec -it 30f13d4e53bd bashroot@30f13d4e53bd:/# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 22:05 ? 00:00:00 ./srv/init-proc /srv/sleep-spawner 1root 12 1 0 22:05 ? 00:00:00 /srv/sleep-spawner 1root 17 12 0 22:05 ? 00:00:00 [sleep-spawner] <defunct>root 18 12 0 22:05 ? 00:00:00 [sleep-spawner] <defunct>
We can see that the zombies parented to PID 1 have now been cleaned up!
And there we have it, we’ve made a basic init process that lets us send signals to processes running in Docker containers and have them behave the same way they would outside of a container, and the ability cleanup zombie processes!
See the full source code here — https://github.com/AaronKalair/init-proc