Upgrading your data infrastructure may feel like an arduous task, but doing so through a thoughtful process can lead to increased performance, stronger reliability, bug fixes, and the able to contribute to open source community. In this post, Olga Shestopalova talks you through why updates are important and how to get better at doing them.
There are numerous benefits to continuously patching and upgrading the software that powers your products. Upgrading to new versions of critical systems is a challenging and high risk endeavor, however, the benefits are huge and when done right the costs and risks can be minimized. In this blog post we will share top reasons to upgrade your software, what we’ve learned while executing on our upgrades, and how you and your teams can get better at regular upgrades.
An easy reason to upgrade your software is for performance benefits. As software gets worked on, it should ideally get faster and more efficient. While that’s not necessarily true for all upgrades, it was certainly true for our datastores.
We ran an experiment comparing Elasticsearch search times between the version we were running and the target version we were upgrading to, and found that the new version was almost 30% faster and could sustain much higher request loads with lower CPU. Similarly, HBase saw lower cpu utilization, much better RegionServer GC metrics, generally improved latency and throughput, and lower exception volume in our target upgraded version.
Another no-brainer to upgrade is improved reliability and outage recovery. As HubSpot scales, so does our data, and we need to expect outages and have playbooks ready, and as such, our time to recover matters greatly.
For example, Elasticsearch 7 has better shard recovery and can handle network disconnections much better than previous versions; in older versions, a network disconnect would cause minutes to hours of recovery, but in Elasticsearch 7, shard recovery takes 6-10 seconds. This puts us from a major outage to barely noticeable blip.
Security Patches and Bug Fixes
Not the most glamorous reason to upgrade, but nonetheless a very important one. As we aim to stay on top of security breaches, this proves to be a necessary reason to upgrade.
We recently upgraded our MySQL version to patch a security hole that our team found and mandated that we fix (we take security very seriously here). Less urgently, we are upgrading our Vitess version to pick up a number of bug fixes that impair our ability to operate efficiently, and once that is complete, we can remove a number of band-aid fixes and workarounds that we’ve had to implement in the meantime.
New features can make your life easier, operations simpler, and unlock more potential for functionality that you can pass along to your customers. At HubSpot, the question of when will product teams be able to use this new feature frequents our support requests.
For recent examples, upgrading HBase would provide us with an asynchronous client, upgrading Elasticsearch would give us more advanced shard routing, and upgrading Vitess would give us better, easier, and faster tooling for resharding. All of these new features have product teams clamoring for upgrades to happen as soon as possible.
The Ability to Contribute to the Open Source Community
Last but certainly not least, when your software is up to date, this gives you the opportunity to participate in and contribute to the open source community. Most open source projects do not accept patches for versions that are out of date. This means that any patches you develop against an out of date datastore need to be maintained by your teams. When you’re running up to date software you can push those patches into the open source version. This has the double effect of improving the system for the entire community while reducing the ongoing support burden of those patches for your engineers.
In rolling out these upgrades, we need the process to be safe, well-tested, and as hands-off as possible so we’re not sinking massive amounts of engineering time to roll out and we’re not compromising our customer experience. For context, the scale with which we operate at HubSpot is immense - we have over 800 Vitess clusters, 1000 Elasticsearch shards, 6,000 Kafka topics, 800,000 HBase regions in each environment. Automation is the only way we can manage our datastores at scale. Therefore, we’re creating a runbook that anyone in Data Infra can follow, and that is standardized across datastores.
The focus of our upgrade process is to de-risk the upgrade itself, front-loading the work, and making the rollout itself very smooth and with high degree of confidence. With certain upgrades, there is no easy downgrade path so it’s important to conduct sufficient testing.
The first step of an upgrade is to read release notes, making note of any changes that could impact the particular setup that your team has. We also start a line of communication with product teams that use our datastores, providing a link to documentation on the upgrade process, any changes, and a way to know what to watch out for.
Data Infra teams maintain a fork of the open source version of a datastore. This fork often has patches that add metrics, connect with our authentical layer, fix bugs, and various enhancements. The next step in our upgrade process is to update our fork to pull in a more recent version of the datastore and then a re-evaluation of our patches, re-applying those we need and discarding outdated ones. This is particularly difficult for datastores that haven’t upgraded in a while because a lot of the context on the patches has been lost to time. This problem feeds into pushing as many patches upstream as possible, so that there are simply fewer patches to maintain and port across upgrades.
Once we have a working upgraded fork, we can start testing. HubSpot has several environments: production (the actual environment that customers use and see), QA (similar to production, but not visible to customers and used for product team testing), and testing (only visible to Data Infra teams). We start our testing process by rolling out the new upgraded datastore version to our testing environments and running tests against it to ensure that all the necessary functionality still works, as well as all of our automation and processes.
For the datastores that can, we run a job against all codebases that creates a branch of the code that depends on the new version of our datastore software, and runs all unit tests that exist to catch any incompatibilities statically. If any test failures are found, we work with the owning teams to resolve these. We then roll out the new version to all unit tests, ensuring that no new code is written that isn’t compatible with the upgrade.
If the software upgrade involves a client upgrade, we use a shimmed client to be able
to include both versions and switch between them using a live config flag. This allows us to roll out the client change to select testing services first, with a slow and controlled rollout. Some client changes require product teams to perform code changes on their end as well. In this case, being able to import both versions of the client is important so they can perform testing and implement any changes that are needed on their own time and without breaking changes. We also heavily lean into automating any code changes where possible, running a job to automatically create pull requests with the changes that match certain patterns to reduce the toil for both Data Infra and product teams.
All the testing in the world cannot account for 100% of all access patterns and query types running in production. It would be ideal to have a way to run actual production traffic against the upgraded datastore to see results, but have it not affect customers. So, we have come up with query replaying: the idea behind this is to be able to mirror traffic to an identical “shadow” datastore and compare results between the current version and the upgraded version without having any customer impact. We compare how a query performs (or fails) between versions, as well as the results of the query where possible to catch both performance regressions as well as behavioral changes. This happens either online in real time as queries come in, or offline as an analysis job. This also helps us scale up new clusters correctly to be able to withstand real traffic.
Query Replay Example
Our HBase team implemented query replay in real time to a “shadow” upgraded cluster. They forked traffic inside of the client, using a discarding threadpool to not affect the actual query latency. Then, results and query latency were compared, and outliers were flagged for investigation. This helped them correctly scale new clusters and ensure that queries work as expected between the two versions.
Our Elasticsearch team opted to use offline query comparison: in their implementation, a sample of queries were produced to a kafka topic, the consumers of which tested the query for compatibility against the new version and stored the results in a MySQL table. There was then a service that individual product teams could query to get a list of example queries they made and their respective incompatibilities, along with documentation on how to resolve them.
Both approaches helped the respective teams catch real issues and resolve them without any impact to the customer, and in a safe and timely manner.
Once we’ve made it past query replay, we’re ready to start the rollout.
The first step in the rollout is to communicate our intent to product teams. We aim to have both an engineering-wide announcement as well as per-team announcements as their datastores are upgraded. The goal here is to not catch anyone unaware, so if there are unexpected issues, teams know what to look out for and where to go for help.
As for the actual upgrade/downgrade process, we need to automate as much as possible here, both to avoid human error and to be able to scale with us (for example, we have over 800 Vitess clusters per environment, doing that manually would take an enormous amount of time!). Depending on the datastore and the changes between versions, downgrades are not always easy or possible to automate, but we have a playbook on hand and tested it out in case of emergency
For upgrade mechanics, each datastore is different: some datastores spin up entirely new clusters and move traffic over, and some have to upgrade on existing clusters in place, rolling restarting the servers to complete the upgrade. However, since we’re heavily invested in query replaying, both types of upgrades (new-cluster and in-place) have a “shadow” upgraded component, be that an extra cluster or an extra replica in an existing cluster. This has some unexpected benefits: for example, for clusters that did not accept the upgrade very well, we were able to iterate quickly by making code changes (either on the datastore end or on the product team end) and switching traffic to the “shadow” upgraded component and swiftly reverting traffic back if the change did not improve performance as expected. Once fully upgraded, we can keep around extra “shadow” downgraded components for as long as we like to have something to fall back on in case a problem arises down the line. In practice, we did not keep these around for very long because they add to the maintenance and operational burden.
Our rollout process allows for both QA and production to be happening in parallel. This gives us a faster feedback loop as some problems or queries only crop up in production and are never seen in QA. A cluster will always be upgraded in QA before production, but this gated rollout strategy allows us to give certain clusters more time to resolve issues or have any product team work done to them without blocking the rest of the clusters.
To summarize, we do a lot of testing up-front, in the form of product team unit tests, query replaying, and acceptance tests and then have a slow and confident rollout to QA and production.
To achieve our goal of staying up to date with our software and upgrading on a regular cadence, we will proactively seek out deprecated features and changes in future versions, notifying product teams that this would impact ahead of time to make the upgrade process smoother and faster for everyone involved. Some datastore clients are also moving towards version-agnostic clients, so code migration work is being done today to prevent more work in the future.
The biggest pain point for all the Data Infra teams has been nonstandard usages of our datastores and libraries. Corralling product teams to keep to the recommended patterns, as well as having decent unit test coverage has helped us test our upgrades and make no-touch changes where necessary.
This is an incredibly exciting time for Data Infra, and a time for a lot of growth of both people and code; we have so many improvements to look forward to with these upgrades.
The biggest learning here is when you don’t upgrade for a long time, the next upgrade is a lot more work and investigation than you expect. This upgrade initiative has helped us bolster our testing frameworks and automation, as well as our communication channels with product teams. It has been a big investment today that will make both upgrades and regular day operations much easier in the future.
These are the types of challenges we solve for on a daily basis at HubSpot. If projects like this sound exciting to you, we’re hiring! Check out our open positions and apply.