Like many companies, we were affected by last week's S3 outage. We were surprised, however, by the extent of the impact to our systems. It was a bit of a wake-up call, and we realized how much of a failure point S3 had become for us and how little we were doing to protected ourselves against S3 downtime.
We know that network calls will inevitably fail, so we use patterns like circuit breaking and bulkheading for our service-to-service requests, MySQL queries, HBase RPCs, Kafka writes, and so on. Our calls to S3, however, didn't have these protections. This post will cover some of the lessons we're going to apply to our codebase in order to insulate ourselves against any future S3 outages, and also introduce a new library we wrote to help us get there.
Lesson 1: Centralize creation of S3 clients
When we looked around our codebase, we realized access to S3 wasn't centralized or standardized at all; each service was doing its own thing. We had a mix of JetS3t and the official AWS Java SDK; some apps configured timeouts and retries, while most used the defaults. One of the things we've discovered over time is that the easiest way to make sure best practices are followed is to bake them in so that you get them automatically. We've found that an effective way to achieve this is to centralize client creation into a shared internal library. This makes life easier for users, while allowing us to transparently add behavior that we think is beneficial. And because everyone is going through the same code path, if we want to make changes in the future we just need to update the code in one place and it will take effect for everyone. Next, we'll cover some of these best practices that are going to be baked into our S3 client creation.
Lesson 2: You probably don't want the default timeouts
During the outage, we saw some APIs fail that we didn't expect to. We expected that the endpoints that hit S3 would fail, while the other endpoints would continue serving traffic. In reality, some of our APIs stopped serving traffic entirely. When we investigated the issue, we found that the endpoints hitting S3 were taking a very long time to fail, causing requests to pile up and exhaust our API's HTTP thread pool. The end result was that the few endpoints that hit S3 took down the entire service. After noticing this, we checked the client timeouts being used. The AWS Java SDK has a default socket timeout of 50 seconds and JetS3t's is 60 seconds. In addition, the AWS Java SDK will retry errors up to 3 times by default and JetS3t will do 5 retries. This means that when S3 is having issues it will take much longer to fail than we want. To fix, we dropped these timeouts way down so our worst-case latency is much lower.
Lesson 3: Use circuit-breakers and bulkheading to fail better
Beyond tighter timeouts, we also need a way to short-circuit S3 when it's down so that we fail fast and reduce our chances of cascading failure. We could tell our engineers to wrap all of their S3 calls, but this is tedious and doesn't take advantage of the single code path we discussed before. Instead, we want to bake in this behavior so that all calls to S3 automatically get these protections. To achieve this, we wrote a helper library called S3Decorators (on GitHub here). This library provides an injection point for intercepting all calls to S3, which can be used to add logging, track metrics, inject failures or latency for testing, or to wrap each call with Hystrix or Failsafe (to our knowledge, the most popular Java libraries implementing the circuit breaker pattern). We've provided ready-to-use implementations for Hystrix (here) and Failsafe (here) to make it easy to get started. It's as simple as:
And now every S3 call is wrapped in a Hystrix command.
Despite last week's outage, S3 is still incredibly reliable. For the majority of our services, the remediations we're putting in place to fail quickly and safely in the event of an S3 outage will be sufficient. There are, however, some critical services within our infrastructure that need to withstand a single-region S3 outage. For these use-cases, we're investigating strategies like cross-region replication or syncing data to another provider such as Google Cloud Storage. We need to be careful, however, as it is easy to unintentionally add additional failure points that would end up decreasing, rather than increasing, our availability. If you have any tips, tricks, or feedback feel free to leave a comment below.