Building a Robust System Using the Circuit Breaker Pattern

Written by Andrew Herbst | May 25, 2011

HubSpot is a big system. We're composed of lots of databases, web servers, third party integrations and myriad other components. So when one of those subsystems goes down and the larger system isn't insulated against such failures adequately, it can be a headache.

Let's consider a concrete example. Say you have a web server that populates data on a page from sharded databases, call them A and B. Half of your users are in A and the other half are in B. Let's also assume that we're being naive and have little or no caching of data—every hit to the page results in a database roundtrip. Database connections have timeouts of 60 seconds, which is to say that the web server will attempt to connect to the correct shard for 60 seconds before giving up and throwing an exception. All good?

Now let's take one of your shards away, shard B. What happens? Shard A users will be happy—the server will grab a connection to the database, pull the data and be on it's way in short order. Shard B users will have a rougher go of it—the web server will try to connect to the database for 60 seconds before giving up and hopefully showing an error message saying their data is unavailable (or an ugly stack trace failing that). How will this situation scale? If you have enough shard B users trying to access the site at the same time, the web server will quickly run out of worker threads as it tries to connect to the unresponsive database; at that point you'll get browser timeouts or proxy errors for both A and B users since the entire web server is unable to serve requests.

"Do you really expect my load balancer to timeout with a proxy error when I can't connect to my database?"
"No Mr. Bond, I expect you to die."

In this case, Goldfinger is actually on to something. What we need is a way to detect the failure of this database (or really any subsystem) and avoid even the attempt to connect to it until the error condition has passed—we should just die. This is called a Circuit Breaker, and it's well described by Michael Nygard in Release It!. Here's how it works.

Let's define three states:

CLOSED Connections to the monitored subsystem are passed through as normal
OPEN Connections to the monitored subsystem are intercepted and immediately fail; an error or exception is passed back to the calling client
HALF_OPEN A limited number of connections to the monitored subsystem are allowed to pass through but further failures put us right back into OPEN (this is our retry mechanism)

And, we have a state machine:

Exceeded failure threshold on CLOSED? Move to OPEN.
Retry timeout passes? Move to HALF_OPEN.
Connections succeed while in HALF_OPEN? Move to CLOSED, otherwise move back to OPEN

We recently implemented this here at HubSpot and we're seeing some nice early returns. Here's an example of how it works:

public interface MonitoredResourceInterface {
    @CircuitBreakerExceptionBlocklist(blocklist={SQLException.class})
    public void someMethodToMonitor() throws SQLException, BlahException;

    public void someMethodWeDontCareAbout();
}

public void getCircuitBreakerMonitoredResource() {
    MontioredResourceInterface ds = getMonitoredResourceImplementation(); 
    CircuitBreakerWrapper wrapper = CircuitBreakerWrapper.getInstance();

    CircuitBreakerPolicy policy = getPolicyImplementation();
    ds = wrapper.wrap(ds, policy);
    return ds;
}

There are two important keys here. One is the notion of an exception blocklist: that is, a list of exceptions that our circuit breaker will watch for and that will cause us to eventually trip when we reach a threshold determined by the passed in CircuitBreakerPolicy instance. In this example we know that we don't care about BlahExceptions—perhaps they happen all the time and we know they don't hurt overall system performance. But we do care about SQLExceptions: those can be caused by database connection problems and too many of those lead to serious issues. We want too many SQLExceptions to trip us to the OPEN state so our app can live to fight another day when our database is gone.

The other key is our transparent wrapping of an object in a dynamic proxy that monitors thrown exceptions and intercepts failing methods when appropriate. We've used this approach for third-party service monitoring to great success at HubSpot, but that's another article.

This basic pattern should be part of any software engineering toolkit. As your software grows in complexity and scale, it's critical to effectively insulate subsystems from each other so that one failing component doesn't bring down the whole system. The circuit breaker is one pattern that, when used judiciously, can increase overall system robustness and improve end user experience.

View full post