Written by Kartik Vishwanath, VP of Engineering at HubSpot.
_______
Between 14:20 UTC and 15:16 UTC on August 27, 2024, some HubSpot customers were unable to log in to their accounts, access their content, or use our tools. We understand how critical our product is for your business and sincerely apologize for the inconvenience this issue caused.
At HubSpot, maintaining reliability is our top priority. We are continuously improving our systems to prevent incidents like this from occurring again in the future. Here’s a detailed explanation of what occurred, what we learned, and the steps we’re taking to improve our platform’s resilience.
HubSpot's platform processes billions of customer requests daily, and traffic is distributed across multiple layers of security and routing systems to ensure redundancy and efficiency. One critical part of this system is our internal load balancers, which are deployed on Kubernetes and manage traffic distribution across backend services.
In this case, an internal issue caused several instances of these load balancers to misreport their health status, leading to simultaneous shutdowns. This triggered a cascading failure, as the remaining load balancer instances became overwhelmed by a surge of traffic.
On August 27, 2024, routine operation of a critical component of our traffic management system triggered a situation where an issue in our load balancer stack, which we were unaware of but has been documented externally, caused the path handling updates to temporarily block health check responses. These health checks are responsible for determining whether a load balancer is functioning properly and is ready to handle traffic. Due to this blockage, the health checks were unable to complete in time, leading the system to misinterpret these load balancers as unhealthy.
As a result, some instances of these load balancers were falsely marked as failed and began shutting down simultaneously. With these instances offline, the remaining load balancers became overloaded, experiencing a sudden surge in traffic. This surge led to a congestive collapse, where the system became too overwhelmed to function properly. Built-in protections were triggered to reject excess connections, but overall performance was heavily impacted for affected customers.
Timeline of Events
Let’s walk through the critical events of August 27, 2024:
We conduct detailed reviews following all incidents to ensure that we have responded appropriately and proportionally to the situation. We use these reviews to identify opportunities to reduce the likelihood of similar events happening in the future and apply learnings to our future product reliability efforts. This incident has led us to focus on both immediate fixes and longer-term improvements. Below are key themes from our review and actions we’re taking:
1. Health Check Tuning
The core issue stemmed from our health check system being temporarily blocked during the update process, which caused several load balancers to falsely report as unhealthy. This led to a cascading failure as multiple instances shut down simultaneously, overloading the remaining ones. To address this, we have adjusted the configuration to ensure that health checks do not incorrectly timeout. Moving forward, we are also redesigning the health check system to be more resilient during operations like configuration updates, prioritizing health checks so they continue to run uninterrupted. This will prevent similar false failures and ensure critical systems remain fully operational during future updates.
2. Capacity and Load Balancing Improvements
During the incident, the remaining load balancers were overwhelmed by a massive surge in traffic after some instances were falsely removed from service. Although built-in protections helped mitigate some of the load by rejecting excess connections, the surge was too fast and too large to handle efficiently. In response, we have increased the capacity of our load balancers to ensure they can handle greater volumes of traffic during sudden spikes. Additionally, we are enhancing the overall architecture of our load balancing system to detect traffic surges earlier and distribute load more dynamically in real-time, preventing any single set of instances from being overwhelmed. This will provide more robust protection against future traffic spikes and improve system stability.
3. Status Page Automation and Customer Communication
One challenge during this incident was the delay in updating our status page, which caused confusion for some customers. Although we identified the issue internally by 14:20 UTC, the status page was not updated until 14:44 UTC. This resulted in outdated or incomplete information being shared with customers. To improve communications timeliness, we have automated status page updates for large-scale outages, ensuring real-time information is communicated to customers as soon as an incident occurs. We are also integrating our monitoring systems more tightly with our communication tools, so future status updates are more timely and accurate.
4. Load Testing and Failure Simulations
To further reduce the risk of future incidents, we are expanding our load testing capabilities to simulate high-traffic scenarios. This will help us identify weak points in the system before they become a problem. Additionally, we are introducing chaos engineering practices to regularly test platform resilience. By intentionally simulating failure scenarios, it will help us uncover unknowns to ensure our systems can withstand unexpected issues.
Our monitoring systems were able to detect the incident quickly, but we realized that earlier detection of abnormal traffic surges could have prevented the cascading failure. To improve this, we have introduced new metrics that monitor traffic patterns more closely, including tracking the ratio of current connections to the maximum allowed. This will help us identify surges earlier to take action before they reach critical levels. Additionally, we are implementing auto-scaling capabilities that will automatically adjust resource capacity based on traffic patterns. This will prevent overloads and maintain system stability, even during unexpected traffic spikes.
Moving Forward
We take this incident extremely seriously and are dedicating significant engineering resources to help protect against this from happening again. Our priority is to provide a reliable and robust platform for our customers, and we will continue to improve our systems to prevent similar disruptions in the future.
Thank you for your trust in HubSpot. We will share relevant updates as we make progress on our long-term efforts to enhance the reliability of our platform.