HubSpot has been an Amazon Web Services (or AWS) customer for ten years now. Our footprint includes almost 2,500 EC2 instances, many petabytes of data on EBS and S3, and over a petabyte of web traffic flowing through over a hundred different ELBs each month. AWS’s offerings have been a huge driver of our growth because it has allowed us to easily scale up or down our infrastructure as our needs have changed. Furthermore, running our infrastructure in the cloud allows our engineers to focus on building HubSpot instead of building a data center.
However, the cloud provider landscape has changed dramatically since 2008; where there was once just Amazon are now a handful of choices, so when we started planning the international expansion of our computing footprint, we decided we should check out our options.
HubSpot’s engineering team have always been great admirers of Google. We use a lot of their open source libraries -- Guice, Guava, GRPC, and Protobuf, to name a few -- and had been curious about Google Compute Engine ever since we caught wind of it at Google I/O back in 2012. In 2017 we started adopting Kubernetes, Google’s open-source platform for managing containerized workloads, for our data infrastructure. Our first big project has been migrating all 400+ of our MySQL databases from standalone instances into Kubernetes with the help of the Vitess project. All signs pointed to Google Cloud Platform as an option worth exploring.
Google’s networking offering was the first thing that impressed us. We liked that Google's Global AnyCast network terminates traffic at the nearest Google point of presence to the client, and routes the traffic over Google's fast internal network rather than the public internet. We also liked that they allowed for a flat IP space between regions before Amazon did (Amazon released inter-region VPC peering in November 2017). Support for multi-path routing, tag-based routes and firewall rules, as well as fast cross-zone networking within a region only sweetened the deal.
There were also a number of compelling hosted services we could take advantage of if we decided to use Google Cloud. HBase sits at the heart of HubSpot’s data platform — it handles 530+ terabytes of data per day, and up to 80 million requests per second. But we’ve always admired Google’s BigTable, which is Google's NoSQL big data service. It powers many core Google services, including Search, Analytics, Maps, and Gmail, and we were excited about being able to explore leveraging that power in the HubSpot stack.
But talk is cheap. Launching an entirely new computing environment on a cloud provider we have no experience with is no easy task. We weren’t ready to take that leap without knowing if it would be an improvement for our customers. But the only way we could truly measure the impact would be to have Google Cloud Platform serve production traffic. So, we settled on a multi-provider approach: we’d keep our pre-existing infrastructure on AWS and use Google Cloud for our new international expansion. Because we treat our infrastructure as code, we were well equipped to take the chance, especially if it could drive improvements in reliability and customer experience.
Infrastructure as code
Many companies treat their infrastructure as a bespoke process. You need a cluster? File a ticket and someone will set it up for you eventually. You need to launch a new microservice? You’ll need the expertise of a release engineering team (or more often, some lady named Jane who can cobble a script together for you). You need a high-throughput, reliable queuing system? Good luck. You may have to figure out what technology to use and how to set it up yourself.
There’s an infrastructure concept called pets and cattle. Pets are things you treat with care; they’re the most critical parts of your infrastructure. They’re the hosts that must never go down. You can usually tell something is a pet because it has a very specific name. Cattle are the opposite -- they’re the things you don’t care about. The pieces of infrastructure that can fail and it won’t matter because you’ve got five more of them, or some automated way to fix it.
At HubSpot, we strive to treat all our infrastructure like cattle through automation. The less important any one piece is and the more automation you have in place, the less energy and focus it takes to operate. We use a number of open-source tools to help accomplish this, including Mesos and Kubernetes for running applications, and Puppet for ensure that all of our instances are configured the same way.
We’ve also built some of our own tools to help manage our infrastructure more effectively. Rainmaker is a web app that we started building in 2012 that provides a simple wrapper around the AWS console, as well as implements features like access control and auditing that haven’t existed in AWS forever. We also abstract many AWS-specific concepts in Rainmaker to reduce cognitive load on our engineers. Exposing just the information and actions that HubSpot engineers need in order to do their jobs helps things get done faster and with fewer mistakes.
Integrating Google Cloud Platform into Rainmaker turned out to be pretty easy. Being able to leverage most of our pre-existing code and automation allowed us to get a proof-of-concept working in record time.
Google Cloud Platform in 30 days
We put Google Cloud Platform though a variety of tests to assess its performance and reliability to help us make our decision. We needed to make sure that Google could run any workload or datastore we were currently running in AWS, and that communication between AWS and Google was fast and reliable enough for our services spread between the providers to function smoothly.
But the first step was to set up HubSpot’s Google Cloud environment in the first place. This meant creating machine images to match our EC2 AMIs, getting Rainmaker to properly provision Google instances, and setting up a VPN between AWS and Google. With all this in place, we were able to spin up instances that were identical to each other in either provider, and have them communicate with each other with no issues.
With all the basic plumbing in place, it was time to deploy some actual applications. The first step was setting up Singularity, our open-source Mesos scheduler. This included a Mesos cluster, a ZooKeeper cluster, and a MySQL database (for storing “cold” data to minimize load on ZooKeeper). Then we updated Orion, our deployment orchestration tool, to be aware of this new Singularity cluster as a location engineers could deploy to. The last step was provisioning Google Cloud Load Balancers and wiring them up to our NGINX load balancers running in the new Singularity cluster so that the outside world could access our services.
We decided to test our collection services, because we collect a lot of customer data through forms on websites, and track thousands of individual customer metrics through our analytics pipeline. We were hoping that running our EU customers’ data through Google Cloud in Frankfurt would be a bit faster than it had been using AWS. So we set some tests using a handful of dummy accounts from Stockholm, Berlin, and even Australia, and the results were, to put it lightly, stunning. It wasn’t a little bit faster. It was a full 5 times faster. We were excited, because that would be a Very Big Win for our customers.
But before we moved full steam ahead with Google Cloud for our international infrastructure, there were a few things we needed to figure out. We were concerned about the Live Migrations Google uses for rolling out updates and moving workloads away from bad hardware. They seemed magical; exciting but also scary.
These migrations can occur at any time on any host, and for write-heavy workloads like HBase, live migrating seamlessly from one VM to another while handling millions of requests per second seemed impossible to pull off. We did not want our customers to experience any lag or delay while updating contact information or tweaking their campaigns. Replicating that firehose of data across the globe in real-time would be like trying to change the tires on a bus while it’s moving at 1,000 mph.
We’d had years to learn all the ways that things can break in AWS, and had built up a ton of tooling to handle these cases. We put in robust notifications for when servers go down, and built automation to replace servers that are failing so that we’d have no latency due to maintenance or server issues.
In order to verify that response times remained consistent during those migrations, we replayed 7 hours of production traffic while working with Google's engineering team to force live migrations. As it turns out, we had no reason to worry — we saw no impact to our response times whatsoever.
That was the sign we needed to start moving forward with Google Cloud Platform.
We took what we’d learned from our testing and set up official QA and Production environments in Google Cloud Platform and a higher-throughput VPN between Amazon and Google. We then provisioned all our core pieces of platform infrastructure including LDAP slaves, DNS servers, and a Vault cluster. We deployed everything we needed, and spent the next couple of weeks chasing down loose ends and making sure we were ready to rumble.
Our platform team did all of this heavy lifting in about 30 days. Now, any HubSpot engineer can decide that they want to run their service in Frankfurt on Google Cloud Platform by a one-line change in a configuration file, and boom, it’s there.
If you’re using a cloud service provider, or are thinking about using one, we encourage you to put some thought about how your team interacts with your underlying infrastructure. Creating tooling that abstracts away the implementation details of your infrastructure will make it easier to expand and to take advantage of all the awesome, up-and-coming technology in the fast-growing world of cloud infrastructure. We’ve heard that companies can take up to two years to migrate from an on-premise solution to the cloud because they’re so tied into their existing infrastructure. But by creating a layer of abstraction between the infrastructure your services, by treating your infrastructure less like pets and more like cattle, and by using technologies like Mesos and Kubernetes, you give yourself the optionality to move fast and try new things.
We’re happy to announce that we’re partnering with Google on our international cloud infrastructure. We’re excited to see how investing in Google Cloud Platform can help grow our company and our platform and we're even more excited to see where it takes us in 2028.