Performing migrations can be a thankless task. The old system, flaws and all, works. And, everyone expects the new system to do everything the old one does, in addition to whatever new functionality is required. Even if users can be moved to the new system incrementally, people don’t have much patience for products behaving differently and functionality being broken or missing. This is something we learned the hard way this past year.
My team is one of the teams that maintains the infrastructure behind HubSpot’s Sidekick product, a freemium tool that provides email productivity tools such as email open and click tracking. Over the past year, we made two big migrations. First, in order to achieve a better balance of performance, cost, and control, we moved our primary storage for our email open and click feature from one data storage platform to another. This required copying and transforming hundreds of gigabytes of data and building an entirely new pipeline to process clicks and opens into the new storage system reliably.
Second, after several months of manually modifying settings in our billing backend to support our new Sidekick for Business product and pricing, we re-wrote much of our billing system to standardize and automate handling of multiple pricing levels. This required significant changes to our team membership flows and internal accounting.
Neither of these migrations were seamless. We followed the general principles behind our known tactics for safely rewriting a big API (such as gating and routing traffic incrementally), but we neglected to do a few things that could have helped us avoid a few potholes. So, here are four lessons we learned from these Sidekick migrations that will hopefully make your next migration a smoother process.
One big challenge in technology migrations is knowing everything the old system does and why. Over time, a code base accretes little tweaks and hacks that help it deal with edge cases that come up in a production system. Like scar tissue, this code is critical in keeping the product functioning smoothly, but as people join and leave teams, and as products evolve, it’s hard to keep track of the cuts that led to scarring in the first place. Without a team that knows the old system inside and out, important things can get lost in the migration.
Something important did get lost when we were migrating our open and click tracking system: the suppression of notifications when you open your own emails. Although the job of our tracking system is to tell our users when someone has opened an email that they’ve sent, users don’t want to be notified when they open their own emails. So, an earlier team had built special checks to detect when users open their own email and discard that notification. When my team built the updated pipeline, we didn’t realize this and didn’t build in these checks when we initially rolled out the new system. Users immediately began complaining and we had to scramble to fix the problem. This was only one of several instances of “scar tissue” that ended up getting lost in translation during the migration and causing us problems.
When we started our billing migration, we knew we needed to be more careful. We dug into the code and wrote a basic specification of the system. Writing the spec helped us document what needed to be done and why, and forced us to think through what we may have otherwise overlooked during the migration. While we certainly didn’t get everything right, having the document as a reference made a huge difference in glitch-proofing our process as much as we could.
As you start exploring the system’s behavior, you might discover that different people depend on it in different ways. It’s natural to involve other technical teams and management in the migration process, but we realized that non-technical parties, from your social media team to your support staff, from your salespeople to your finance department, should be clued in from the get-go. The goal is to understand what’s important to them so you can keep an eye on it from the technical side.
These stakeholders interact with your systems every day and have their own special tricks or patterns for getting their jobs done. In fact, over time, you’ve probably built a variety of little tools that they’ve become dependent on. Migrations are designed to improve the system for your customers (e.g. better performance, more features) but if the new system breaks or eliminates those tools, you’ve done the opposite for your internal stakeholders. Changing their workflows without warning means your technical team will be bombarded with questions that stakeholders used to handle themselves.
For example, when we re-engineered the billing system, we changed the semantics of some details on the internal billing administration page, and broke the ability for support to make certain account adjustments. As a result, our support team was often confused about how to interpret what they saw on the page for a given account, and was also unable to rectify common problems that they had previously been able to handle. Needless to say, this led to a lot of stress for everyone. By being more explicit about changes and keeping tooling changes in line with product changes, we could have made this much easier for everyone. Do your team and stakeholders a favor by communicating the migration early on and keeping them in the loop throughout the process.
An added benefit of working with other stakeholders is that they may help you spot problems that you didn’t even think to check for in testing. In our case, it’s often a race between our social media team and our support reps to see who gets the first word from customers that something is off after we deploy some new code. And one time, our finance department quickly pinged us when we “forgot” to enforce our freemium product limits for a week. Phew.
On that note, a powerful strategy for finding problems during a migration is to identify key business metrics and set up alerting on those metrics. Often, alerting focuses on technical problems: is your 99th percentile response time spiking? Are there too many server errors? However, a vast spectrum of product failures do not trigger these alarms.
One example of a business metric we use concerns our user onboarding process. During onboarding, we show a new user how to send a tracked email and what it is like to receive a notification. Based on historical data, we know how many people should experience this interaction each hour. If it doesn’t happen at the right rate, we know we’ve broken something. Because anomaly detection can be difficult, just setting thresholds for extremely abnormal behaviors can be helpful (e.g., if zero new users receive a notification, that’s a bad sign). This means thinking about your business metrics before you start coding. Business metrics tend to be more robust to technology change than technical metrics because you still need to provide the same business value even if your technical implementation is changing.
Another technique is to compare end-to-end output, if possible. If you are changing your data source, make sure the user rendered output remains the same. Here, “same” can mean anything you want—you can literally compare the rendered output pixel by pixel or you can just make sure there are certain div elements that contain the right text. For developers, it can be very helpful to have a mechanism that forces the application to use the old component or the new component at runtime. We used a secret URL to allow us to view user activity data using both the new and the old data stores as a way to detect issues and verify fixes.
Having a plan, working with stakeholders, and monitoring key metrics lets you catch problems before or as they happen. But, there will still be things that slip through the cracks. That’s why it’s important that both your infrastructure and your application’s architecture supports rapid iteration.
Our billing system and email tracking pipeline were initially part of a more monolithic system, so despite HubSpot’s microservices architecture and deployment infrastructure, we could not deploy pieces of it independently and big changes were risky to deploy. This was frustrating because, as we migrated users and data incrementally (in both cases!), we found a lot of edge case and anomalous data that would not work properly in the new system. These were not even cases that you can really plan for: you often won’t know the kinds of weird data you and your users have put into the system over the years until you try to migrate it.
To address this, we improved our architecture by extracting these components from our monolithic system, enabling us to iterate more quickly. Our email tracking pipeline now has many incremental processing stages that can be independently and reliably deployed. We extracted our billing system’s UI into a separately deployable front-end component, so we could make front-end improvements without having to worry about changes being coupled to our larger monolithic system. In both these cases, it benefited ourselves and our users when we could improve, tweak and test our systems more quickly and safely.
These four lessons have helped us manage the constant technology change involved in operating a SaaS product. What do you do to help ensure successful technology migrations?