Slacking hard, or hardly Slacking: Automating infrastructure at scale

Here at HubSpot, automation is king. If you’ve got a bug, all you have to do is fix it once and it’s gone. But if you try to ignore a human error? Get ready — you’ll definitely see it again.

This preference drives our team’s worldview, and you can see it in our organizational structure (we don’t have any teams that we’d consider to be DevOps). Instead, we have a Platform as a Service model, meaning that a substantial fraction of our engineers are devoted to providing services to enable the whole engineering team. Our mission as platform engineers is simple: we want to provide HubSpot’s engineers with superpowers.

On the infrastructure team, we’ve bought or built a variety of systems that maintain a large fleet of cloud instances running a heterogeneous mix of hardware and offer a variety of datastores. We attempt to optimize every step of the development pipeline, from a seamless deployment experience to the initial setup of a new hire’s laptop, from lightning-fast build times to intuitive monitoring (and more). We aim to reduce the complex operational functions of running a business at scale to easy-to-handle tasks that are safe and offer minimal operational overhead.

As our team has expanded, and as our automation has expanded in lockstep, we’ve started to experience some growing pains. Our team’s strength — our willingness to buy or build systems to fix problems — is also its fundamental weakness. As we scale, we encounter more and more problems, and as we encounter problems, we buy or build more and more systems. Eventually it becomes difficult for any single engineer to know everything our platform can offer, or to know where to look for a specific service.

This situation can be even worse for new hires, since that context (much of which can feel like tribal knowledge that everyone else automatically knows) is staggeringly large. For us, this reached a breaking point as we hit 40 product teams. We started hearing feedback loud and clear from our bewildered coworkers: knowing where to go had become a real issue.

Luckily, we use Slack for communication. It’s one of the few constants at HubSpot; every employee is on Slack and easily reachable. In an effort to help support our team, we started operating a lightweight support channel through Slack where anyone on the team could ask questions and get answers. This worked quite well for a while, allowing our team to present a responsive and human front to our fairly autonomous systems. But that, too, has started to scale poorly, as the load on our platform engineers grows more or less linearly with the size of the organization (and our team is growing quickly).

Unsurprisingly, we tackled this issue as we often do — with automation. Fortunately, Slack isn’t just an instant messaging service; it’s also a platform. As we started digging in, we found that their API was well-crafted, and it was surprisingly easy to build sophisticated workflows.

We started by addressing a simple issue — which channels are for which kinds of questions? We set up bots to provide users with a clear message when they entered a channel to try and cut down on irrelevant posts. This turned the tribal knowledge of which channels should be used for which subjects into knowledge that could be easily found or verified. These messages were such a simple change, but they were extremely effective. We quickly realized that there was a vast, untapped potential in building our automation and tooling directly into Slack, since this single system powered essentially all of our communication and collaboration.

As we started to invest more time and effort building on top of Slack, we identified three specific problems that most needed solving. First, engineers had trouble knowing exactly where to go and who to talk to in order to find answers about a specific system. Second, too much time was wasted polling systems for asynchronous tasks to complete. And third, context switching is hard. Engineers often have to switch contexts when, for example, a job finishes while they are working on something else. We wanted to minimize the time it took for engineers to switch between tasks and gather the information they need. While all three of these look like quite different problems, Slack’s platform made it easy for us to build effective solutions for each.

To tackle the first issue, we focused on automating the routing of questions or requests in our internal support channel. Now, when engineers post anything in that channel, the text of their post is analyzed and a bot posts a reply offering routing options. The logic is deeply integrated with our internal team management software, so we can route questions to teams based on what they own. This eliminates the need to know who to ask (or even where to go) beyond one Slack channel.

image (1)

 

To tackle the second issue, we began generating unique IDs for long-running tasks. Now, users can choose to receive a Slack notification when their tasks are complete, letting them avoid having to sit and poll for updates.

Pasted image at 2018_04_26 03_05 PM

 

And finally, to tackle the issue of context switching, we began experimenting with custom-built Slack integrations that pack meaningful information and actions into our alerting platform. This minimizes the work that engineers need to do to gather context when interrupted.

The possibilities for building on top of Slack are infinite, but we already feel like we’ve made a big impact on some of the biggest issues that impact developer productivity. In our most recent developer survey, these were some of the responses we got from members of the team:

  • Love the addition of the slack bot that suggests on call aliases based on question keywords. Really nice feature.
  • PI Bot routing questions automatically in #platform-support and encouraging threading of questions is absolutely brilliant.
  • The new slack bot in platform support to direct to the correct on-call person is awesome.
  • I really like the platform support question/inquiry router.

These efforts have gotten other internal teams excited about building on top of Slack: our Build and Deploy team is working on allowing teams to deploy code directly from Slack, and our Infrastructure Security team is exploring how they could allow engineers to manage permissions and do audits directly from Slack. And these have all been relatively easy to accomplish — Slack is a helpful, clear, and capable partner.

We’re excited to announce that we’re open sourcing our Java client for Slack. Check out our endorsement on Slack’s community page here: https://api.slack.com/community, and the code on our Github here: https://github.com/HubSpot/slack-client.

We’ve invested heavily in offering a straightforward API for an asynchronous Netty-based http client, with robust utilities for rate limiting, debugging, and auditing built in from our deep experience building clients internally. It’s extensible, battle tested, and we’d love to hear what you think and see what you build with it.

Happy Slacking!

Elias Szabo-Wexler

Written by Elias Szabo-Wexler

Elias Szabo-Wexler is a tech lead on HubSpot's platform infrastructure team.

Comments

Subscribe for updates

New Call-to-action