Written by Kartik Vishwanath, VP of Engineering at HubSpot.
_______
Between 14:20 UTC and 15:16 UTC on August 27, 2024, some HubSpot customers were unable to log in to their accounts, access their content, or use our tools. We understand how critical our product is for your business and sincerely apologize for the inconvenience this issue caused.
At HubSpot, maintaining reliability is our top priority. We are continuously improving our systems to prevent incidents like this from occurring again in the future. Here’s a detailed explanation of what occurred, what we learned, and the steps we’re taking to improve our platform’s resilience.
Architecture Overview
HubSpot's platform processes billions of customer requests daily, and traffic is distributed across multiple layers of security and routing systems to ensure redundancy and efficiency. One critical part of this system is our internal load balancers, which are deployed on Kubernetes and manage traffic distribution across backend services.
In this case, an internal issue caused several instances of these load balancers to misreport their health status, leading to simultaneous shutdowns. This triggered a cascading failure, as the remaining load balancer instances became overwhelmed by a surge of traffic.
What Happened
On August 27, 2024, routine operation of a critical component of our traffic management system triggered a situation where an issue in our load balancer stack, which we were unaware of but has been documented externally, caused the path handling updates to temporarily block health check responses. These health checks are responsible for determining whether a load balancer is functioning properly and is ready to handle traffic. Due to this blockage, the health checks were unable to complete in time, leading the system to misinterpret these load balancers as unhealthy.
As a result, some instances of these load balancers were falsely marked as failed and began shutting down simultaneously. With these instances offline, the remaining load balancers became overloaded, experiencing a sudden surge in traffic. This surge led to a congestive collapse, where the system became too overwhelmed to function properly. Built-in protections were triggered to reject excess connections, but overall performance was heavily impacted for affected customers.
Timeline of Events
Let’s walk through the critical events of August 27, 2024:
- 14:08 UTC: During routine operation of our load balancer, a previously unknown issue caused our automated health checks to slow down.
- 14:18 UTC: Automatic detection of impaired load balancer instances erroneously marked a subset of the instances down due to failing health checks. This initiated a process to gradually remove them from handling traffic.
- 14:19 UTC: The remaining load balancer instances experienced a massive surge in connections (11x our normal peak traffic), with traffic spiking from hundreds of thousands to several million connections. As a result, they struggled to handle the increased load, leading to degraded performance and delays for users.
- 14:23 UTC: Our automated systems triggered connection limits to reduce memory overload, but this resulted in increased response times and failed requests. Monitoring systems detected widespread performance degradation across the platform, and the incident response team was mobilized.
- 14:44 UTC: The status page was updated. For clarification, we had already identified the internal issue and were working on a fix at this point. The delay in updating the status page caused some confusion for our customers.
- 14:59 UTC: The team attempted to increase the capacity of the remaining load balancer instances.
- 15:04 UTC: New load balancer instances came online, relieving pressure on the system. This began to stabilize traffic and restore normal functionality.
- 15:16 UTC: Full system recovery and traffic returned to normal levels. We continued to monitor to ensure stability.
What We Learned
We conduct detailed reviews following all incidents to ensure that we have responded appropriately and proportionally to the situation. We use these reviews to identify opportunities to reduce the likelihood of similar events happening in the future and apply learnings to our future product reliability efforts. This incident has led us to focus on both immediate fixes and longer-term improvements. Below are key themes from our review and actions we’re taking:
1. Health Check Tuning
The core issue stemmed from our health check system being temporarily blocked during the update process, which caused several load balancers to falsely report as unhealthy. This led to a cascading failure as multiple instances shut down simultaneously, overloading the remaining ones. To address this, we have adjusted the configuration to ensure that health checks do not incorrectly timeout. Moving forward, we are also redesigning the health check system to be more resilient during operations like configuration updates, prioritizing health checks so they continue to run uninterrupted. This will prevent similar false failures and ensure critical systems remain fully operational during future updates.
2. Capacity and Load Balancing Improvements
During the incident, the remaining load balancers were overwhelmed by a massive surge in traffic after some instances were falsely removed from service. Although built-in protections helped mitigate some of the load by rejecting excess connections, the surge was too fast and too large to handle efficiently. In response, we have increased the capacity of our load balancers to ensure they can handle greater volumes of traffic during sudden spikes. Additionally, we are enhancing the overall architecture of our load balancing system to detect traffic surges earlier and distribute load more dynamically in real-time, preventing any single set of instances from being overwhelmed. This will provide more robust protection against future traffic spikes and improve system stability.
3. Status Page Automation and Customer Communication
One challenge during this incident was the delay in updating our status page, which caused confusion for some customers. Although we identified the issue internally by 14:20 UTC, the status page was not updated until 14:44 UTC. This resulted in outdated or incomplete information being shared with customers. To improve communications timeliness, we have automated status page updates for large-scale outages, ensuring real-time information is communicated to customers as soon as an incident occurs. We are also integrating our monitoring systems more tightly with our communication tools, so future status updates are more timely and accurate.
4. Load Testing and Failure Simulations
To further reduce the risk of future incidents, we are expanding our load testing capabilities to simulate high-traffic scenarios. This will help us identify weak points in the system before they become a problem. Additionally, we are introducing chaos engineering practices to regularly test platform resilience. By intentionally simulating failure scenarios, it will help us uncover unknowns to ensure our systems can withstand unexpected issues.
5. Enhanced Traffic Monitoring and Auto-Scaling
Our monitoring systems were able to detect the incident quickly, but we realized that earlier detection of abnormal traffic surges could have prevented the cascading failure. To improve this, we have introduced new metrics that monitor traffic patterns more closely, including tracking the ratio of current connections to the maximum allowed. This will help us identify surges earlier to take action before they reach critical levels. Additionally, we are implementing auto-scaling capabilities that will automatically adjust resource capacity based on traffic patterns. This will prevent overloads and maintain system stability, even during unexpected traffic spikes.
Moving Forward
We take this incident extremely seriously and are dedicating significant engineering resources to help protect against this from happening again. Our priority is to provide a reliable and robust platform for our customers, and we will continue to improve our systems to prevent similar disruptions in the future.
Thank you for your trust in HubSpot. We will share relevant updates as we make progress on our long-term efforts to enhance the reliability of our platform.