Understanding Thread Dumps

Your application slows down suddenly or maybe it just stops doing anything. What do you do? What can you do? Well, you can take a thread dump. In the JVM, all execution is performed via threads. The JVM executes commands on the host OS by running a JVM thread corresponding to a native thread of the OS. So whenever the JVM executes an instruction, it's always done via a thread. JVM threads have a stack of what method invocations have been called to reach the one they're currently executing. You typically see these printed as stack traces in exceptions, showing which line of code caused your exception to be thrown. But, a stack trace can be taken with jstack at any point during execution and will show you that thread's call stack. A thread dump is a list of all the threads in the JVM with a stack trace and some metadata for each.

Copy of What does it mean to be a _good_ engineer_

Anatomy of a Thread Dump Entry

Let’s start with a very simple example.

"Keep-Alive-Timer" #12 daemon prio=8 os_prio=0 tid=0x00007f89a4008d90 nid=0xabf9 waiting on condition [0x00007f85a319d000]

   java.lang.Thread.State: TIMED_WAITING (sleeping)

     at java.lang.Thread.sleep(Native Method)

     at sun.net.www.http.KeepAliveCache.run(KeepAliveCache.java:172)

     at java.lang.Thread.run(Thread.java:748)

The first thing to note is the thread name “Keep-Alive-Timer.”  This thread is created by the KeepAliveCache in order to keep http threads alive. The #12 refers to this being the 12th thread created by the JVM since it started.  After the thread number, the daemon keyword indicates that this is a daemon thread, which means that it won’t prevent the JVM from shutting down if it’s the last running thread. After that, there are several less important pieces of metadata about the thread, such as the priority, the os priority, the thread identifier, the native identifier, the state (this one is important) and its address in the JVM.

Finally, we have the thread state. Java threads can be in seven states. The important states are

  • Runnable: Means a thread is currently executing.
  • Blocked: The thread is waiting for a lock to be released. This happens when entering a synchronized block, for instance.
  • Waiting / Timed_waiting: The thread is waiting for something to happen. These are joins/gets on CompletableFutures, calling Thread.sleep, or Thread.park. Threads in a thread pool waiting for work will be in this state.

Going back to our example we can see that it’s in a `TIMED_WAITING` state, so the thread is currently waiting. From the KeepAliveCache source code, we can deduce the thread is currently waiting before it keeps alive some more http connections.

A Few More Examples

"NioHttpClient-Callback-670" #68821 daemon prio=5 os_prio=0 tid=0x00007f15781d8170 nid=0x814d waiting on condition [0x00007f1594761000]

   java.lang.Thread.State: TIMED_WAITING (parking)

     at sun.misc.Unsafe.park(Native Method)

     - parking to wait for  <0x0000000083d10a78> (a java.util.concurrent.SynchronousQueue$TransferStack)

     at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

     at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)

     at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)

     at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)

     at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)

     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

     at java.lang.Thread.run(Thread.java:748)

Here is a thread “NioHttpClient-Callback-670” which, based on the name, is in the NioHttpClient-Callback pool. This thread’s state is `TIMED_WAITING`. Reading the stack trace, the thread is calling getTask() which means it’s polling a synchronous queue and waiting for an element from the queue. So, this thread is currently waiting for work from its Executor and is not executing any tasks. The majority of the threads in a thread dump are typically waiting for work from a threadpool and can usually be ignored (the signal to noise ratio in thread dumps is very low). 

"update-pool-12" #1157 daemon prio=5 os_prio=0 tid=0x00007f155800de00 nid=0x11484 waiting on condition [0x00007f133dc7e000]

   java.lang.Thread.State: WAITING (parking)

     at sun.misc.Unsafe.park(Native Method)

     - parking to wait for  <0x0000000785af47a0> (a java.util.concurrent.CompletableFuture$Signaller)

     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

     at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)

     at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)

     at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)

     at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934)

     at com.hubspot.wal.WalWriter.writeToStore(WalWriter.java:79)

     at com.hubspot.wal.WalWriter.write(WalWriter.java:66)

     at com.hubspot.contacts.hbase.data.updaters.ContactUpdateWriter.publishWalMessage(ContactUpdateWriter.java:97)

This thread (stack trace truncated) is taken from a HubSpot Kafka worker that updates contacts (CRM stuff). The thread name tells us it’s from the update pool. Currently the thread is in a waiting state. Reading backwards in the stack trace it has called `CompletableFuture#join()`, which will wait for the future to complete. Continuing down the stacktrace it's called `publishWalMessage`, which refers to a Kafka write ahead log (WAL), so this worker is currently waiting for a Kafka produce to its WAL to finish. 

Effectively Reading Thread Dumps

Thread dumps are full of noise: threads that aren’t doing anything, hundreds of stack frames that aren’t meaningful and metadata that you’ll never need (I’m looking at you, native thread identifier). If you try to scroll through a thread dump looking for the information you need, you almost certainly won’t find it. But there are ways to read through them. When you’re looking through a thread dump, you’re typically interested in what your threads are doing. Instead of looking through the whole thread dump, you search for your package name. For instance in this stack trace:

"update-pool-12" #1157 daemon prio=5 os_prio=0 tid=0x00007f155800de00 nid=0x11484 waiting on condition [0x00007f133dc7e000]

   java.lang.Thread.State: WAITING (parking)

     at sun.misc.Unsafe.park(Native Method)

- parking to wait for  <0x0000000785af47a0> (a java.util.concurrent.CompletableFuture$Signaller)

     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

     at  java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)

     at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)

     at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)

     at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934)

     at com.hubspot.wal.WalWriter.writeToStore(WalWriter.java:79)

     at com.hubspot.wal.WalWriter.write(WalWriter.java:66)

     at com.hubspot.contacts.hbase.data.updaters.ContactUpdateWriter.publishWalMessage(ContactUpdateWriter.java:97)

We know it’s a Contacts service, so we search for the package of that service which is `com.hubspot.contacts`. This gets you to a stack trace you’re interested in and the part of the stack trace you care about. The first occurrence of your package name in the stack trace is the point in your code where the thread is currently executing. So you know where to look in your code to find context to what it's executing. It’s worth noting that sometimes in a thread dump, your package won’t be there, which means that none of your code is currently executing. This is frequently the case for low throughput web services and workers (for example, a web service doing < 10 req/s at 10ms mean latency means 90% or more of the time no thread is doing work).

Taking a Thread Dump 

Thread dumps can be taken using the jstack utility. If you're running jstack it's important that you run it as the same user running the JVM you're taking a thread dump of, otherwise thread names won’t be populated. For example if you have a JVM running under user java-service with PID 435, you could take its thread dump via

sudo su java-service jstack 435

Use Cases for Thread Dumps

Slow downs in async workers and web services

Thread dumps can be very handy for analyzing when a service slows down. An example is a web service that’s rejecting http requests because it’s http thread pool and queue are full. Then you can take a thread dump to see what all of the http threads are stuck doing. Go through around 10-15 stack traces and see if they’re all doing the same thing. Look for things like lots of threads making database queries, waiting to acquire database connections, waiting on or calling out to an http api, or stuck in some surprisingly expensive computation. If you can identify what’s slow then you can investigate that service/database/computation to see why it’s slow.

Performance tuning via sampling

Performance profiling with thread dumps can be done, but is not the most efficient way to profile. Flame graphs or a profiling tool like yourkit are better suited to the task, but thread dumps can be used if you don’t have either of those available. Take several thread dumps over the course of a minute when a service is at peak traffic. Then analyze where in the code each of the threads are currently working. This is very similar to looking at slow downs, except there’s going to be more noise and you’ll have to be more careful about drawing conclusions. If all of your threads are calling out to a datastore, it’s not necessarily because the datastore is slow, but maybe that they spend the majority of an http request talking to that datastore. You should go in with an expectation of the work you think your threads should be doing. Then verify that there aren't any areas where the threads are spending more time than you'd expect. A useful tactic when profiling/tuning async workers is to scale them down so they can’t keep up and then look at the performance characteristics.

Finding deadlocks

Looking for deadlocks is usually straightforward with a thread dump. Find a thread that should be doing work and then look at what it’s waiting for. If it’s on a lock, then you know the lock is the issue. So you need to find what’s holding the lock or why the lock was not properly released. Sometimes the thread holding the lock will be stuck and you can find that in the thread dump as well. If you can't, then you'll need to figure out how the lock got into this state. Taking a heap dump and analyzing the values of the classes that interact with the lock can be a good first step. If the thread is waiting on a future, then you need to find the thread executing the future and figure out why the future isn’t complete. In some cases the future will not correspond to a thread (such as async http) and it can be helpful to take a heap dump to examine the classes involved.

Hunting down thread leaks

For hunting down thread leaks, you can group threads together based on their thread names with a command like

awk '{ print $1 }' threaddump.log | grep '"' | sort | awk '{print substr($0,0,13)}' | uniq -c | sort -nr > sort-threadnames.log 

You can adjust the `13` characters as needed. If all of your thread pools have been properly named, it will be easy to find the culprit. If you find that you have thread-1 to thread-200, it’s going to take a little longer. By taking several thread dumps, you may be able to find one of these threads executing application code, which should point you to where it’s being defined. Otherwise you can take a heap dump and look for thread pools/executor services. When you find the one in question, look for incoming references to that thread pool and it should tell you what’s creating it. If the thread pool has been orphaned (non-empty thread pools are never garbage collected because they are always reachable from the threads inside the pool) then you will need to do an audit of all of your thread pools. Running the application locally with a breakpoint on thread pool constructors is also an option (though a very painful one). Another option for analyzing the thread dumps, rather than bash one liners, would be Spotify's thread dump analyzer, which will also group them together.

 

Jacob Schlather

Written by Jacob Schlather

Jacob Schlather is a Site Reliability Engineer at HubSpot.

Subscribe for updates

New Call-to-action