Go is a programming language often used for applications in which performance matters. Optimizing your code based on assumptions is not a best practice of course. You need to have insights about your code performance and bottlenecks to be able to optimize it efficiently. What is a profiler? Profiler is a dynamic performance analysis tool that provides critical execution insights in various dimensions which enable resolving performance issues, locating memory leaks, thread contention and more. Go comes with built-in support for profiling! What kinds of profiles can I get? Go has several built in profiles: stack traces of all current Goroutines Goroutine: stack traces of CPU returned by runtime CPU: a sampling of memory allocations of live objects Heap: a sampling of all past memory allocations Allocation: stack traces that led to the creation of new OS threads Thread: stack traces that led to blocking on synchronization primitives Block: stack traces of holders of contended mutexes Mutex: How to get a profile? There are a few ways to create a profile. 1. Using “go test” to generate profile Support for profiling the standard testing package. As an example, the following command runs all benchmarks and writes CPU and memory profile to and : built into cpu.prof mem.prof go -cpuprofile cpu.prof -memprofile mem.prof -bench . test 2. Download live profile using HTTP If you need to have a live profile in a long-running service, package is your ultimate solution. Solely import the package in your code and you are good to go: net/http/pprof _ import "net/http/pprof" The initialization of this package is the reason you need to import it solely. { http.HandleFunc( , Index) http.HandleFunc( , Cmdline) http.HandleFunc( , Profile) http.HandleFunc( , Symbol) http.HandleFunc( , Trace) } func init () "/debug/pprof/" "/debug/pprof/cmdline" "/debug/pprof/profile" "/debug/pprof/symbol" "/debug/pprof/trace" As you can see the package initialization registers handlers for the given URL in . These handlers write the profile into a which you can download. Now the only thing you need to do is running a HTTP server with a handler: DefaultServeMux http.ResponseWriter nil { { http.ListenAndServe( , ) }() } func init () go func () ":1234" nil Note that if you already run your http server with , you don’t need to run another server and profiles are available on your server. DefaultServeMux Now, you can download live profile data from the URL. /debug/pprof/ 3. Profiling in code Using , You can also profile directly within the code. For example you can start a CPU profile using and then stop it by . Now you have the CPU profile written to the given . runtime/pprof pprof.StartCPUProfile(io.Writer) pprof.StopCPUProfile() io.Writer But wait, there is an easier option. has developed a package named which can make profiling easy as pie. First you have to get the package using following command (or using Go modules): Dave Cheney profile go get github.com/pkg/profile Then you can use it like this: { profile.Start(profile.ProfilePath( )).Stop() } func main () defer "." // do something Above code will write CPU profile into in working directory. The default has been set to CPU profiling but you can pass options to to be able to get other profiles as well. cpu.pprof profile.Start() profile.Start(profile.CPUProfile).Stop() profile.Start(profile.GoroutineProfile).Stop() profile.Start(profile.BlockProfile).Stop() profile.Start(profile.ThreadcreationProfile).Stop() profile.Start(profile.MemProfileHeap).Stop() profile.Start(profile.MemProfileAllocs).Stop() profile.Start(profile.MutexProfile).Stop() // CPUProfile enables cpu profiling. Note: Default is CPU defer // GoroutineProfile enables goroutine profiling. // It returns all Goroutines alive when defer occurs. defer // BlockProfile enables block (contention) profiling. defer // ThreadcreationProfile enables thread creation profiling. defer // MemProfileHeap changes which type of memory profiling to // profile the heap. defer // MemProfileAllocs changes which type of memory to profile // allocations. defer // MutexProfile enables mutex profiling. defer How to use a profile? So we have collected the profile data, what now? is a tool for visualization and analysis of profiling data. It reads a collection of profiling samples in format and generates reports to visualize and help analyze the data. It can generate both text and graphical reports. pprof profile.proto All of the ways that mentioned above are using under the hood and this package writes runtime profiling data in the format expected by the pprof visualization tool. runtime/pprof Write A Simple Code To Profile For a better understanding, I’ve written the below code to profile. It is a simple HTTP server which handles requests on URL. Every request comes in, the function will declare a and create a child Goroutine at line 16. The child will decode the given JSON log within the and if it was successful it does a time consuming (~400 millisecond) write and then sends to . Meanwhile the function blocks on select statement (line 27) until either when the child sends the result to (line 24) or when a timeout happens (line 30). /log logHandler chan int Request.Body http.StatusOK ch logHandler ch ( _ ) { http.HandleFunc( , logHandler) http.ListenAndServe( , ) } { ch := ( ) { obj := ( [ ] ) err := json.NewDecoder(r.Body).Decode(&obj); err != { ch <- http.StatusBadRequest } time.Sleep(time.Duration(rand.Intn( )) * time.Millisecond) ch <- http.StatusOK }() { status := <-ch: w.WriteHeader(status) <-time.After( * time.Millisecond): w.WriteHeader(http.StatusRequestTimeout) } } import "encoding/json" "math/rand" "net/http" "net/http/pprof" "time" func main () "/log" ":8080" nil func logHandler (w http.ResponseWriter, r *http.Request) make chan int go func () make map string float64 if nil return // simulation of a time consuming process like writing logs into db 400 select case case 200 I’ve compiled and run the code and have sent some loads to it using . Here is the output: hey Summary: Total: 0.8292 secs Slowest: 0.2248 secs Fastest: 0.0039 secs Average: 0.1574 secs Requests/sec: 241.1882Latency distribution: 10% 0.0409 secs 25% 0.1143 secs 50% 0.2007 secs 75% 0.2013 secs 90% 0.2093 secs 95% 0.2187 secs 99% 0.2245 secsStatus code distribution: [200] 87 responses [408] 113 responses in in in in in in in pprof Basics After writing, running, and sending load to our service, it is time to run the pprof tool on the profile of the service to see which functions were the most consuming ones during the load. Install pprof Tool You can use command or install to use it directly. go tool pprof pprof go get -u github.com/google/pprof Run pprof On Interactive Mode Now let’s run pprof on allocated objects using the following command: pprof -alloc_objects http://:8080/debug/pprof/allocs This will open a simple interactive shell that takes pprof commands to generate reports. Now enter the command , the output will be the top memory consumers. top Dropped Nodes As you can see in the output, 11 nodes has been dropped (line 3). A node is an object entry and dropping a node can help to reduce the noise, but sometimes it may hide the root cause of the issue. Adding the flag or assigning it directly inside the attractive shell will include all data of the profile in the result. -nodefraction=0 Difference Between Flat And Cumulative In the above output, you can see two columns, and . Flat means , while cum (cumulative) means For better understanding, let’s assume the following function, the flat time of function is 4s and the cum is 11s. flat cum only by this function by this function and functions called down the stack. A { B() DO STH DIRECTLY C() } func A () // takes 1s // takes 4s // takes 6s Difference Between Allocations And Heap Profile An important note to consider is that the difference between the allocations and heap profile is the way the pprof tool reads them at start time. Allocations profile will start pprof in a mode which displays the total number of allocated objects since the program began which as well, but heap profile will start in a mode that displays number of live allocated objects which . includes garbage collected ones does not include garbage collected bytes That means in the pprof tool you can change the mode regardless of the profile type (heap or allocations). It can be done using the following values. You can either set them as a pprof flag or assign them to inside the pprof interactive shell: sample_index amount of memory allocated and not released yet (heap) inuse_space : amount of objects allocated and not released yet (heap) inuse_objects : total amount of memory allocated (regardless of released) alloc_space : total amount of objects allocated (regardless of released) alloc_objects : Get An SVG Graph From The Profile You can generate a graph of your profile in an SVG format, and open it with a web browser. The following command will request for a 5s CPU profile and will launch a browser with an SVG file. pprof -web http://:8080/debug/pprof/profile?seconds=5 Run pprof Via A Web Interface One of the useful pprof features is the web interface. If the flag is specified, pprof starts a web server at the specified host:por that provides an interactive web-based interface to pprof. -http pprof -http :8080 http://:8080/debug/pprof/goroutine The preceding command should automatically open your web browser at the right page; if not, you can manually visit the specified port in your web browser. Monitoring Goroutines From Command Line As you can see in the Goroutine profile, 115 Goroutines are alive which is not normal for idle service. It should be a Goroutine leak. So let’s check Goroutines status using the tool. grmon is a command line monitoring for Goroutines. Let’s install it and run it on host that is serving the profile data: grmon go get -u github.com/bcicen/grmon grmon -host :8080 The output is: It seems the function blocks on a action. After reviewing the code I found the bug. It has occurred due to sending into an unguaranteed channel. I’ve fixed that by turn the to a buffered one channel (line 15). It was the very same mistake that I have mentioned . logHandler chan send ch this post How does a profiler work? Although stack traces generated by profilers are often intelligible, it is sometimes necessary to understand how profilers work under the hood to get even more details from the generated profiles. CPU Profiler The Go CPU profiler uses a signal to record code execution statistics. Once the signal got registered, it will deliver every specified time interval. This timer unlike typical timers will increment when the CPU is executing the process. The signal will interrupt code execution and make it possible to see which code was interrupted. SIGPROF When the function is called, the signal handler will be registered to call every 10 milliseconds interval by default (100 Hz). On Unix, it uses a syscall to set the signal timer. pprof.StartCPUProfile SIGPROF setitimer(2) On every invocation signal, the handler will trace back the execution by unwinding it from the current value. This will generate a stack trace and increment its hit count. The whole process will result in a list of stack traces grouped by the number of times each stack trace is seen. PC After the profiler is stopped, the stack traces will be converted to pprof-supported stack frames with source file names, line numbers and function names. Memory Profiler The memory profiler samples allocations. It will show function calls allocations. Recording all allocation and unwinding thestack trace would be expensive, therefore a sampling technique is used. heap The sampling process relies on a pseudo random number generator based on to sample only a fraction of allocations. The generated numbers define the distance between samples in terms of allocated memory size. This means that only allocations that cross the next random sampling point will be sampled. exponential distribution The sampling rate defines the mean of exponential distribution . The default value is 512 KB which is specified by the . That means to tune the size of the fraction to be sampled, you have to change value. Setting the value to 1, will include every allocated block in the profile. Obviously this will slow down your application. runtime.MemProfileRate runtime.MemProfileRate Stack traces with corresponding sampled allocations information is available any time, since the memory allocation profiler is always active. Setting the to 0, will turn off the memory sampling entirely. runtime.MemProfileRate Once the stack traces got ready, those will be converted to pprof-supported stack frames. Also published here .