Have you ever yanked the power cord out of your computer in frustration? While this might seem like a quick solution, it can lead to data loss and system instability. In the world of software, a similar concept exists: the hard shutdown. This abrupt termination can cause problems just like its physical counterpart. Thankfully, there's a better way: the graceful shutdown.
By integrating graceful shutdown, we provide advance notification to the service. This enables it to complete ongoing requests, potentially save state information to disk, and ultimately avoid data corruption during shutdown.
This guide will delve into the world of graceful shutdowns, specifically focusing on their implementation in Go applications running on Kubernetes.
One of the key tools for achieving graceful shutdowns in Unix-based systems is the concept of signals, which are, in simple terms, a simple way to communicate one specific thing to a process, from another process. By understanding how signals work, we can leverage them to implement controlled termination procedures within our applications, ensuring a smooth and data-safe shutdown process.
There are many signals, and you can find them here, but our concern is only shutdown signals:
These signals can be sent from the user (Ctrl+C / Ctrl+D), from another program/process, or from the system itself (kernel / OS), for example, a SIGSEGV aka segmentation fault is sent by the OS.
To explore the world of graceful shutdowns in a practical setting, let's create a simple service we can experiment with. This "guinea pig" service will have a single endpoint that simulates some real-world work (we’ll add a slight delay) by calling Redis's INCR command. We'll also provide a basic Kubernetes configuration to test how the platform handles termination signals.
The ultimate goal: ensure our service gracefully handles shutdowns without losing any requests/data. By comparing the number of requests sent in parallel with the final counter value in Redis, we'll be able to verify if our graceful shutdown implementation is successful.
We won’t go into details of setting up the Kubernetes cluster and Redis, but you can find the full setup in our Github repository.
The verification process is the following:
Let’s start with our base Go HTTP Server.
hard-shutdown/main.go
package main
import (
"net/http"
"os"
"time"
"github.com/go-redis/redis"
)
func main() {
redisdb := redis.NewClient(&redis.Options{
Addr: os.Getenv("REDIS_ADDR"),
})
server := http.Server{
Addr: ":8080",
}
http.HandleFunc("/incr", func(w http.ResponseWriter, r *http.Request) {
go processRequest(redisdb)
w.WriteHeader(http.StatusOK)
})
server.ListenAndServe()
}
func processRequest(redisdb *redis.Client) {
// simulate some business logic here
time.Sleep(time.Second * 5)
redisdb.Incr("counter")
}
When we run our verification procedure using this code we’ll see that some requests fail and the counter is less than 1000 (the number may vary each run).
Which clearly means that we lost some data during the rolling update. 😢
Go provides a signal package that allows you to handle Unix Signals. It’s important to note that by default, SIGINT and SIGTERM signals cause the Go program to exit. And in order for our Go application not to exit so abruptly, we need to handle incoming signals.
There are two options to do so.
Using channel:
c := make(chan os.Signal, 1)
signal.Notify(c, syscall.SIGTERM)
Using context (preferred approach nowadays):
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM)
defer stop()
NotifyContext returns a copy of the parent context that is marked done (its Done channel is closed) when one of the listed signals arrives, when the returned stop() function is called, or when the parent context's Done channel is closed, whichever happens first.
There are few problems with our current implementation of HTTP Server:
Let’s rewrite it.
graceful-shutdown/main.go
package main
// imports
var wg sync.WaitGroup
func main() {
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM)
defer stop()
// redisdb, server
http.HandleFunc("/incr", func(w http.ResponseWriter, r *http.Request) {
wg.Add(1)
go processRequest(redisdb)
w.WriteHeader(http.StatusOK)
})
// make it a goroutine
go server.ListenAndServe()
// listen for the interrupt signal
<-ctx.Done()
// stop the server
if err := server.Shutdown(context.Background()); err != nil {
log.Fatalf("could not shutdown: %v\n", err)
}
// wait for all goroutines to finish
wg.Wait()
// close redis connection
redisdb.Close()
os.Exit(0)
}
func processRequest(redisdb *redis.Client) {
defer wg.Done()
// simulate some business logic here
time.Sleep(time.Second * 5)
redisdb.Incr("counter")
}
Here’s the summary of updates:
Now, if we repeat our verification process we will see that all 1000 requests are processed correctly. 🎉
Frameworks like Echo, Gin, Fiber and others will spawn a goroutine for each incoming requests, giving it a context and then call your function / handler depending on the routing you decided. In our case it would be the anonymous function given to HandleFunc for the “/incr” path.
When you intercept a SIGTERM signals and ask your framework to gracefully shutdown, 2 important things happen (to oversimplify):
Note: Kubernetes also stop directing incoming traffic from the loadbalancer to your pod once it has labelled it as Terminating.
Terminating a process can be complex, especially if there are many steps involved like closing connections. To ensure things run smoothly, you can set a timeout. This timeout acts as a safety net, gracefully exiting the process if it takes longer than expected.
shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
go func() {
if err := server.Shutdown(shutdownCtx); err != nil {
log.Fatalf("could not shutdown: %v\n", err)
}
}()
select {
case <-shutdownCtx.Done():
if shutdownCtx.Err() == context.DeadlineExceeded {
log.Fatalln("timeout exceeded, forcing shutdown")
}
os.Exit(0)
}
Since we used Kubernetes to deploy our service, let’s dive deeper into how it terminates the pods. Once Kubernetes decides to terminate the pod, the following events will take place:
As you can see, if you have a long-running termination process, it may be necessary to increase the terminationGracePeriodSeconds setting, allowing your application enough time to shut down gracefully.
Graceful shutdowns safeguard data integrity, maintain a seamless user experience, and optimize resource management. With its rich standard library and emphasis on concurrency, Go empowers developers to effortlessly integrate graceful shutdown practices – a necessity for applications deployed in containerized or orchestrated environments like Kubernetes.
You can find the Go code and Kubernetes manifests in our Github repository.