Post-Mortems at HubSpot: What I Learned From 250 Whys

At HubSpot, for the past two years, we've modeled our post-mortems on Eric Ries's 5 Whys.  When we started, I served as the facilitator for basically all of them -- over time, we've added other folks into that role.

It's been a deeply interesting learning experience.

I've learned about how HubSpot's systems work, why they sometimes break, and what we can do to make them more resilient.  Beyond that, I've learned a lot about complex systems and failure in general.  Which, in case you're wondering, is a fascinating topic.  I highly, highly recommend Richard Cook's essay "How Complex Systems Fail" in O'Reilly's Web Operations.  Or Atul Gawande's Complications and Better.  Or basically anything John Allspaw writes.  

If you'd like to build resilient systems, here's some of what I've learned from the fifty-plus 5 Whys I've been a part of. (And by "systems," I mean systems of people + machines, and by "resilient," I mean I'm stealing from Allspaw.)

 

Let's Plan for a Future Where We're All As Stupid as We Are Today

This is a somewhat specific detail, but it comes up a lot, so I wanted to pull it out.  If you run a bunch of 5 Whys, you'll find that a lot of times, the developer who made the first-order mistake (forgot to copy configs from QA to Prod, or deployed two apps out of order, or whatever), will say "Look, this was totally my fault, I screwed up, that's the whole story.  I'll be more careful next time."

The very short summary of which is: We're going to fix this problem by being less stupid in the future.

Which, well, you can guess how that's going to turn out.

I now start all the 5 Whys by saying "We're trying to prepare for a future where we're all just as stupid as we are today."  (Actually, my more nuanced view is that we're all really smart on occasion, and really stupid on occasion, to varying degrees, but none of us is never stupid).

Or, I sometimes say:

"Someone made a mistake, but we all somehow built a system that turned that mistake into a disaster.  This is an opportunity to improve that system."

 

Less Root Cause, More Broadest Fix

One lesson from the study of complex systems is that there is almost never a single root cause.  Really bad failures only happen because several things conspire.  A developer made a mistake, and our automated testing was weak, so we didn't catch it before it went live... but the pain was magnified because our monitoring was weak, so we didn't notice until it had done significant damage to a large number of customers.

In practical terms, just about every 5 Whys has a branch point, where there are two roads you could go down (testing vs. monitoring is a classic one at HubSpot).

When I lead them, I say something like "We're not really looking for 'root causes' so much as identifying the areas where an improvement would prevent the broadest classes of problems going ahead."

I find that perspective useful -- instead of people debating "Which of these two causes is the real one?" you're asking them "If we made A slightly better or B slightly better, which would prevent the broadest class of future problems?"  I've had better luck with that conversation.

Oh, and speaking of testing vs. monitoring...

 

Err on the Side of MTTR, Not MTBF

Asking "What would prevent the broadest class of future problems?" has led us pretty steadily towards improving Mean-Time-To-Repair, and focusing a lot less on Mean-Time-Between-Failures.

Incidentally, this is also more John Allspaw.  Last winter, Jeremy Katz went around the office grabbing everyone by the lapels and forcing them to read this blog post:

MTTR is more important than MTBF (for most types of F)

If you can swiftly identify failing pieces of your system, and swiftly recover from those failures, you're in good shape for a wide variety of different failure types (I can think of specific 5 Whys triggered by all of the below):

  • Bugs in code
  • Mismatches between QA and Prod configs
  • Out-of-control thread pools eating all of memory
  • DNS problems
  • Network partitions
  • Spontaneous machine/instance failures
  • Full data center outages 

We host a lot of stuff at EC2, and had an all-things-considered pretty good day during the Amazonpocalypse -- see more below.

If you're reading this, and thinking, well, duh, I'll just say: the idea of preventing failures is very, very seductive to people.  In fact, it may be very seductive to the people who sign your paycheck.  Because, come on, what's the alternative -- accepting failure?  Why would we do that?  Don't we care about the customer?

Consider the 5 Whys as a way to build up evidence that recovering from failures is a much better economic bet than trying to prevent them.

 

Never Let "Slow Down" Be The Answer

In addition to forbidding a future in which we're less stupid, it's also useful to try to take "slow down" off the table at the outset.  In fact, I like to sometimes think of 5 Whys as "How can we address this problem so we can go even faster than we currently do, while causing less pain to customers?"

That's not strictly a part of 5 Whys, but people tend to underestimate the economic costs of slowing down, so it's good to make sure that you don't retrospective your way into waterfall.  HubSpot has a pretty deep cultural commitment to velocity, but even so, I've found it useful to say this out loud on occasion.

If you want to learn more about the hidden costs of slowing down, give Donald Reinertsen's Principles of Product Development Flow a careful read.  It's a deep, insightful analysis of all sorts of marvelous product development geekery: the cost of queues, the power of small batches, techniques for decentralized control, etc, etc.  Really just amazingly good.

 

Our Biggest Win: The Product Quality Crisis

In the fall of last year, after a particularly nasty spate of customer-facing bugs, Brian Halligan (our CEO) said "Guys, what the hell is going on, we've got to stop inflicting so much pain on our customers."  And he sent a bunch of us off to come up with ideas for improvements.

This was, in my estimation, where our history of 5 Whys really paid off.  Because, if you ask a bunch of people "What should we do so we stop having so many bugs in production?" most of them will come up with some variation on "Slow down and add more human review." It just feels intuitively right to a lot people -- surely we're having bugs because our developers, who are speed cowboys, are being careless, so we should slow them down and double-check their work, right?

Yoav Shapira, Jeremy Katz, and I took a careful look back over the 5 Whys we'd run, and the message was clear: It was our systems that were unreliable, not our developers.  We needed to improve our monitoring, we needed to make our deploys more deterministic, we needed to match configs from dev to QA to Production.  We were able to stand up in front of leadership and say "If we'd released less often, and had someone QA every release... we'd still have had 80-90% of our worst problems hit us."

Not long after this, Jeremy (one of our strongest developers) was asked to set up a new team to focus on Tools & Infrastructure for the developers.  That's a serious investment on the part of our leadership. I'm not sure they would have made it without such a clear vision of where we'd been getting tripped up.

 

Hard Choices are Still Very Hard

Overall, we've been very happy with the wins we've gotten from our post-mortems.  I recommend 5 Whys highly.  But they don't magically make your problems go away.  For example, I've sat in on, or led, probably a half-dozen different 5 Whys, where we ended up with a root problem involving this one specific, painfully unreliable, legacy system.

So you ask: Why can't you incrementally improve that system?

We have.  Somewhat.  However, not only is the system legacy and not seeing almost any active development, but it runs on an OS that we've deprecated for new server-side code.  Almost all the improvements (even modest, incremental ones) would require expertise we don't really have on our team.  Do we hire that expertise?  Train someone up in it?  Hope that our ongoing, gradual rewrites will get enough critical code off that system + OS eventually?  Buckle down and just port everything off that OS in one (incredibly expensive) fell swoop?

We haven't found a great answer.

And so we keep on having occasional, sometimes pretty nasty, issues, where we're facing that system again.

I think the lesson here is: No post-mortem process will excuse you from making hard economic choices.  It helps you frame them well, but -- not infrequently -- you're still facing a difficult tradeoff.

 

Where We Are Today

The Amazonpocalypse was, for us, as for many startups, an intense test of our systems' resilience.  How'd we do?

Overall, pretty well. Customers saw little downtime -- pieces of our app were unavailable at various times, but the overall system was up and running with pretty much no interruption. We got a bit lucky, but we've definitely built, between our code and our people, a pretty resilient system. The 5 Whys discipline helped give us:

 - Solid, high-quality monitoring

We found out about the issues at 4 am, and had people failing over key databases to slaves within an hour or two. Two years ago, our monitors failed mysteriously pretty much every night, and everyone had learned to ignore them.  Many post-mortems had pushed us to clean those up and make sure someone was listening when they went off.

 - Clear team ownership over systems

We've embraced the DevOps model, and it served us incredibly well.  The developers who own and live with our key systems were able to put them into a degraded, but functioning state, and then work to restore full functionality.  Clear ownership has been another post-mortem theme.

 - Fast, reliable deploys

Over the day, people pushed out new code to dozens of servers -- putting in place messaging for customers, switching off non-critical functionality, pointing at new database masters as they came up.  Our deploy process has been subject to a long series of 5 Whys-driven improvements--to the point that we all depend on it unreservedly in the midst of a nasty crisis.

We've got a lot of work to do, but I think we all came out of that day pretty proud of where we are.

 

Some Practical Tips for the Moderator

The follow-up Eric Ries post on how to conduct a 5 Whys, is very good.  In the spirit of that post, I've included some additional notes on how we run ours:

Start by framing the "Bad Thing" in terms of customer/economic pain

People will say "We pushed a bug for the leads app," and I try turn that into "From such-and-such time, to such-and-such a time, customers were unable to view the public leads details screen. There were N reports of customer complaints from support (or we've looked into the logs, and Y customers were impacted)."

It really helps to understand just how Bad the Bad Thing was (so you can decide how much effort you're going to spend on corrective actions).

Write it all out on a whiteboard as you go

My opinion: your job as facilitator is not to identify the causes/fixes, it's to build consensus in the room about what those causes/fixes are.  Your expertise is valuable in pushing the room to think about different ideas, or to help identify something that they're all circling around, but you should be asking, often "Why was the system set up that way?" or "If we fixed this, would the problem been prevented?"  They own the answers.  Write it all out, see if people agree.  If no one's offering any ideas, write something down to spark conversation.  But make sure you're listening more than you are doing your own analysis.

Don't be afraid to call on people

Often, there's a developer or support person sitting in a corner, not saying much.  It's really good for the facilitator to draw them in, e.g. "Does that sound right to you?" or "Was there anything else going on while you were trying to fix this?"  Definitely do that if you sense that anyone is uncomfortable with the path you're going down.  If half the room is nodding, but one person is looking uncertain, you want to find out why.

Try really, really hard to have everyone in the same physical room

Because of the importance of reading people's emotions, as above, I personally hate having people call in via conference call for 5 Whys analysis.  We do it on occasion, but I fight it pretty hard.  Skype has been marginally better, but it still ain't great.  I like to have no more than about eight people, max, in the room.  It's hard to have a really good conversation with more.

Want to work with us to make our system world-class?  If so, we want you, frankly, more badly than we should likely admit.  

 

so-hiring
Dan Milstein

Written by Dan Milstein

Comments