HubSpot is a big system. We're composed of lots of databases, web servers, third party integrations and myriad other components. So when one of those subsystems goes down and the larger system isn't insulated against such failures adequately, it can be a headache.
Let's consider a concrete example. Say you have a web server that populates data on a page from sharded databases, call them A and B. Half of your users are in A and the other half are in B. Let's also assume that we're being naive and have little or no caching of data—every hit to the page results in a database roundtrip. Database connections have timeouts of 60 seconds, which is to say that the web server will attempt to connect to the correct shard for 60 seconds before giving up and throwing an exception. All good?
Now let's take one of your shards away, shard B. What happens? Shard A users will be happy—the server will grab a connection to the database, pull the data and be on it's way in short order. Shard B users will have a rougher go of it—the web server will try to connect to the database for 60 seconds before giving up and hopefully showing an error message saying their data is unavailable (or an ugly stack trace failing that). How will this situation scale? If you have enough shard B users trying to access the site at the same time, the web server will quickly run out of worker threads as it tries to connect to the unresponsive database; at that point you'll get browser timeouts or proxy errors for both A and B users since the entire web server is unable to serve requests.
"Do you really expect my load balancer to timeout with a proxy error when I can't connect to my database?"
"No Mr. Bond, I expect you to die."
In this case, Goldfinger is actually on to something. What we need is a way to detect the failure of this database (or really any subsystem) and avoid even the attempt to connect to it until the error condition has passed—we should just die. This is called a Circuit Breaker, and it's well described by Michael Nygard in Release It!. Here's how it works.
Let's define three states:
public interface MonitoredResourceInterface {There are two important keys here. One is the notion of an exception blocklist: that is, a list of exceptions that our circuit breaker will watch for and that will cause us to eventually trip when we reach a threshold determined by the passed in CircuitBreakerPolicy instance. In this example we know that we don't care about BlahExceptions—perhaps they happen all the time and we know they don't hurt overall system performance. But we do care about SQLExceptions: those can be caused by database connection problems and too many of those lead to serious issues. We want too many SQLExceptions to trip us to the OPEN state so our app can live to fight another day when our database is gone.
@CircuitBreakerExceptionBlocklist(blocklist={SQLException.class})
public void someMethodToMonitor() throws SQLException, BlahException;
public void someMethodWeDontCareAbout();
}
public void getCircuitBreakerMonitoredResource() {
MontioredResourceInterface ds = getMonitoredResourceImplementation();
CircuitBreakerWrapper wrapper = CircuitBreakerWrapper.getInstance();
CircuitBreakerPolicy policy = getPolicyImplementation();
ds = wrapper.wrap(ds, policy);
return ds;
}