Last week, I attended the Velocity Conference on web performance and operations out in California. The conference itself was really good; a ton of good talks on topics like post-mortems, building systems for failure, failures that people have hit, how to successfully embrace developer ownership of product operations. Basically a lot of the things that we embrace and encounter every day at HubSpot.
On Tuesday night, there was a set of Ignite talks. If you've never been to an Ignite talk, the idea is that you have 5 minutes to give a talk. Pressure, right? Well, to make it more fun, you also have 20 slides and they auto-advance every 15 seconds. Okay, lots of pressure. And somehow I talked myself into submitting a proposal which was accepted so I then had to write a talk.
As with many of the Velocity talks, mine was a story. About six months ago, I shifted to running a team that is responsible for automating, standardizing and improving the infrastructure that powers HubSpot. Part of this has involved a bit of a mindshare shift from that of just being a developer to caring a bit more about the operations side of things. In doing so, I've noticed some differences that have made me feel a bit more like the work I'm doing is reactive and just a little bit too late. Part of this is because the goal of operations work is that nothing goes wrong... this means that you build something and nothing happens! So you don't get to some of the preventative work you'd like.
But then, inevitably, the bad thing occurs. And then you do some sort of post-mortem. One of the things we've been finding in a lot of ours is that we could have found the problem faster instead of having our customers report the problem. So a lot of our corrective actions end up being things which would allow us to do so: alerts in our monitoring system, automated testing and similar things to find out sooner.
From here, I talked a bit about how we fared with the Amazon EBS problems a couple of months ago. Overall, we fared pretty well. But there were some areas where we had problems. One of these was that we had a database which was not being backed up according to our standards. We were lucky and it wasn't a lasting problem. But as part of the post-mortem, I went ahead and backed up the server. I then put a monitor in place to ensure the server was backed up. And then I generalized it out so that it applied to all of our servers. And this is when we discovered that there were a few other servers on the brink of disaster. So then I backed them up. Coincidentally, my team was working on the automated build process for our database servers of the future, so we went ahead and built it into that process so that servers would get it automatically in the future.
But I had a little bit of an inspiration here. As a developer, I would do test driven development; I'd write a unit test and then get my code to work to make the test pass. As an infrastructure developer, I was doing monitor driven infrastructure development; I wrote a monitor and then made the infrastructure match that state. Pretty powerful; I could do it with our old stuff or with new things we were building. And adding a monitor to start with is cheap; from there you can make easier value judgements on what the cost is of not making preventative improvements.
You do have to be careful that your monitors aren't noisy in the process, but it's working. I finally don't feel like I'm always behind and getting to things just too late. Instead, I'm getting to things before they explode which in the end means a more stable system, happier developers, and happier customers using our product.