Disclaimer: This story is not sponsored by Async Profiler. Overview Most Java sampling profilers rely on " " to profile Java applications. However, there is an inherent limitation with JVM TI. JVM TI only allows collecting stack traces at a safepoint. Therefore, any sampling profilers using JVM TI suffer from the safepoint bias problem. Java Virtual Machine Tool Interface (JVM TI) " does not use JVMTI to get stack trace samples and therefore, it avoids the safepoint bias problem. Let’s understand this problem first. " Async Profiler So, what is a safepoint? "A safepoint is a moment in time when a thread's data, its internal state and representation in the JVM are, well, safe for observation by other threads in the JVM." Following are some examples of safepoints in a Java application. Between every 2 bytecodes (interpreter mode) Backedge of non-counted loops Method exit JNI call exit Safepoint bias problem has talked about this safepoint bias problem in many places. His blog post, " " is perhaps the most comprehensive blog post describing this problem. Nitsan Wakart Why (Most) Sampling Java Profilers Are F***ing Terrible In summary, when sampling profilers get sample stack traces, the application threads are stopped to collect data and the threads are resumed. The problem here is that the application threads are stopped only at safepoints and therefore when profilers get stack trace samples at predefined intervals, the stack traces are retrieved only at next available safepoint poll location. Hence, the samples are biased towards safepoints and the profiler may report inaccurate data. How can we avoid this safepoint bias problem? There is an method, which is an OpenJDK internal API call to facilitate non-safepoint collection of stack traces. There are some profilers using this method to avoid the . AsyncGetCallTrace AsyncGetCallTrace safepoint bias problem Nitsan also has written a great blog post on method: " " AsyncGetCallTrace The Pros and Cons of AsyncGetCallTrace Profilers The " " is one of the first Java sampling profilers without the safepoint bias problem. Honest Profiler has its own that uses UNIX Operating System signals and API in order to efficiently and accurately profile a Java application. Honest Profiler sampling agent AsyncGetCallTrace also does not require threads to be at safe points in order for stacks to be sampled. However, it cannot get metadata for non-safepoint parts of the code without using the following flags: " " Java Flight Recorder -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints Introduction to Async Profiler Now, you must have figured out that " " also avoids the safepoint bias problem by using API. Async Profiler AsyncGetCallTrace is a low overhead sampling profiler for Java. It uses and together to give a more holistic view of the application including both system code paths and Java code paths. Async profiler works with any JDK based on HotSpot JVM (which has method) such as OpenJDK and Oracle JDK. Async Profiler AsyncGetCallTrace perf_events AsyncGetCallTrace With , the system code paths of an application are profiled. Async Profiler nicely combines both system code paths from and Java code paths from API. perf_events perf_events AsyncGetCallTrace It’s also recommended to use the following flags when profiling with Async Profiler: " " -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints Async profiler supports several types of profiling. For example, CPU profiling Allocation profiling Wall-clock profiling Lock profiling Another exciting feature in Async Profiler is the out-of-the-box support, which allows visualizing stack traces. Flame Graph Flame Graph is just one of the output formats supported by Async Profiler. Interestingly, Async Profiler also can produce a with "Method Profiling Sample" events. Java Flight Recording Following are output types supported by Async Profiler. : Summary of execution profile summary : List all unique stack trace samples in the descending order of sample count. traces : List all top methods from stack trace samples. flat : Folded output of stack traces compatible with . collapsed Brendan Gregg’s flamegraph.pl script : Visualize all stack traces as a Flame Graph. svg : Web (HTML) page showing all stack traces as a call tree. tree : Java Flight Recording with Method Profiling Sample events jfr The last three outputs ( , , and ) should be used with the option to dump output to a file. svg tree jfr If the output file has , or extension, the output format will automatically be , or respectively. *.svg *.html *.jfr svg tree jfr The , , and outputs can be combined. That is, in fact, the default output of Async Profiler. summary traces flat Getting Started with Async Profiler As I mentioned earlier, Async Profiler depends on . Following configurations should be done to capture kernel call stacks using from a non-root process. perf_events perf_events Set to . /proc/sys/kernel/perf_event_paranoid 1 Set to . /proc/sys/kernel/kptr_restrict 0 The first setting will allow general users to profile the kernel. The second setting will disable restrictions placed on exposing kernel addresses. You can directly write values to the mentioned files or use command to update values. sysctl Writing to files: 1 | sudo tee /proc/sys/kernel/perf_event_paranoid 0 | sudo tee /proc/sys/kernel/kptr_restrict echo echo Using command: sysctl sudo sysctl -w kernel.perf_event_paranoid=1 sudo sysctl -w kernel.kptr_restrict=0 You may also add the following lines to in order to make the settings permanent. ( ) /etc/sysctl.conf This is not usually recommended kernel. =1 kernel. =0 perf_event_paranoid kptr_restrict Latest Async Profiler can be downloaded from page or you can easily build the Async Profiler. . GitHub releases Async Profiler is also included in IntelliJ IDEA Since I like to work with the latest changes, I built the Async Profiler from the source. Building the profiler was easy. I just exported and executed the command. JAVA_HOME make Let’s profile some sample applications and see how each profile will look like. I used sample applications I developed to demonstrate various performance issues. Source code is at https://github.com/chrishantha/sample-java-programs/ I will be using the svg output since it’s easier to visualize stack traces with Flame Graphs. CPU Profiling To profile the CPU, I used " " sample application. This application has several threads doing several CPU consuming tasks such as calculating hashes for some random s and doing some math operations. highcpu UUID I executed the application using the following command. java -Xms128m -Xmx128m -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -jar target/highcpu.jar -- -timeout 600 exit While the application was running, I took a CPU profile for 30 seconds using the following command. ./profiler.sh -d 30 -f cpu-flame-graph.svg --title --width 1600 $(pgrep -f highcpu) "CPU profile" Following is the Flame Graph showing which methods were on CPU. Here, we can see that the Flame Graph shows both system code paths and Java code paths. Without Async Profiler, getting a similar Flame Graph was not straightforward. In order to generate " ", the following steps had to be done. Java Mixed-Mode Flame Graphs Run the application with . -XX:+PreserveFramePointer Profile application using command and generate a file. perf record perf.data Generate Java symbol table using to map Java code addresses to method names. perf-map-agent Use command, get folded output, and generate Flame Graph. perf script This approach had various issues. For example, the flag, which is only available in JDK 8u60 and later will have a performance overhead. Sometimes, the frames will be missing due to method inlining. -XX:+PreserveFramePointer Allocation Profiling Async Profiler can also be used to profile the code allocating heap memory. It uses (Thread Local Allocation Buffer) callbacks, which can be used in production without much overhead. Java Flight Recorder also uses a similar approach. TLAB Let’s look at an Allocation profile for the same highcpu sample application. Note that profiling event type is . alloc ./profiler.sh -e alloc -d 30 -f alloc-flame-graph.svg --title --width 1600 $(pgrep -f highcpu) "Allocation profile" Now, top edges show the methods allocating memory. Here, we can quickly see which code paths are allocating memory in the application. Let’s look at another sample application named " ". This application checks whether a number is prime or not in a finite loop. allocations Unlike highcpu application, this application terminates in a short time. Therefore, I profiled the allocations from the beginning by launching . Async Profiler as an agent java -Xms64m -Xmx64m -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -agentpath: -jar target/allocations.jar "/home/isuru/projects/git-projects/async-profiler/build/libasyncProfiler.so=start,event=alloc,file=allocation-flame-graph.svg,svg,title=Allocation profile,width=1600" Following is the Flame Graph output. From the Allocations Flame Graph, it is clear that almost all of the allocations are due to , which is not easily perceived when looking at the code. Java Autoboxing Wall-clock Profiling Async Profiler can also sample all threads equally irrespective of each thread's status (Running, Sleeping or Blocked) by using the event type. wall Wall-clock profile is really useful to figure out what's happening with threads over time. Wall-clock profiling can be used to troubleshoot issues in the application start-up time. Since all threads are profiled regardless of thread status, Wall-clock profiler is most useful in per-thread mode. Let's see an example. Following is the wall-clock profile the highcpu application. ./profiler.sh -e wall -t -d 30 -f wall-flame-graph.svg --title --width 1600 $(pgrep -f highcpu) "Wall clock profile" Above Flame Graph shows that there is an almost equal number of samples for each thread. Let's look at another wall-clock profile for " " application. This application mainly has two threads (Even Thread and Odd Thread) to print even and odd numbers. The method to check whether a number is even is synchronized. latencies java -Xms64m -Xmx64m -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:+UseSerialGC -agentpath: -jar target/latencies.jar --count 10 "/home/isuru/projects/git-projects/async-profiler/build/libasyncProfiler.so=start,event=wall,file=wall-flame-graph.svg,threads,svg,simple,title=Wall clock profile,width=1600" Above Flame Graph clearly shows the thread states of all threads. Even and Odd threads were in Sleeping or Blocked states most of the time. Lock Profiling Locks in a Java Application can be profiled by using the event type in Async Profiler. lock Let's look at the lock profile for the "latencies" application. java -Xms64m -Xmx64m -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:+UseSerialGC -agentpath: -jar target/latencies.jar --count 10 "/home/isuru/projects/git-projects/async-profiler/build/libasyncProfiler.so=start,event=lock,file=lock-flame-graph.svg,svg,simple,title=Lock profile,width=1600" Above Flame Graph shows the code paths blocked due to locks. Other Types of Profiling Async Profiler also has many other event types. For example, it supports some Hardware and Software performance counters from . perf_events This means that, with Async Profiler, software performance counters like , , etc. and hardware performance counters like , , etc. can be profiled easily. context-switches page-faults cache-misses branch-misses To view all event types supported by the target JVM, the action can be used. list ./profiler.sh list jps Basic events: cpu alloc lock wall itimer Perf events: page-faults context-switches cycles instructions cache-references cache-misses branches branch-misses bus-cycles L1-dcache-load-misses LLC-load-misses dTLB-load-misses mem:breakpoint trace:tracepoint Summary Async Profiler is possibly the best profiling tool I have used so far. It avoids the safepoint bias problem and combines the power of perf_events' Linux Kernel profiling. Async Profiler supports many kinds of events like CPU cycles, Linux performance counters, allocations, lock attempts, etc. This story demonstrated a few examples of Async Profiler using some sample Java applications and producing Flame Graphs. It's really easy to use Async Profiler especially because of its outputs are much simpler and does not require special applications to process the outputs. I recommend you to try Async Profiler and make sure that you always profile your applications while developing (to avoid costly production issues in the future).