Written by Kartik Vishwanath, VP of Engineering at HubSpot.
_______
Between 5:47 AM EDT and 9:44 AM EDT on September 25, 2024, some HubSpot customers were unable to log in to their accounts, access their content, or use our tools. The issue occurred during two time periods. The first began at 5:47 AM EDT and was resolved by 6:09 AM EDT. The second started at 9:14 AM EDT and was resolved by 9:44 AM EDT. This incident followed a similar event on August 27, 2024, which also affected the same part of our routing stack. We sincerely apologize for the inconvenience and disruption these incidents may have caused.
While both incidents had different root causes, we conducted thorough reviews of each and addressed the specific issues identified. However, out of an abundance of caution, we decided to roll back the routing layer to a previously proven and reliable configuration that had been in operation for a longer period. This rollback provides us with additional time to comprehensively evaluate the next-generation routing technology, which had been in operation for over a year without prior issues.
For a deeper technical explanation of the root causes and corrective actions taken in the September 25 and August 27 incidents, please see the appendix section to read the full technical details.
Architecture Overview
HubSpot’s infrastructure processes billions of customer requests daily, leveraging a routing system composed of load balancers that manage the flow of requests across various backend services. These load balancers are deployed as Kubernetes pods, and serve to ensure efficient traffic distribution and redundancy to support large scale operations.
In this incident, an issue with how the traffic-handling pods were distributed across Kubernetes nodes caused multiple pods to fail their readiness checks simultaneously. This led to automatic shutdowns of affected pods. The remaining pods became overwhelmed, resulting in a surge of 5xx errors on our primary load balancer. This behavior resembled a thundering herd problem, where simultaneous failures led to increased load on the remaining healthy pods, compounding overall system strain.
What Happened
On September 23, we identified an issue with the distribution of traffic-handling pods across Kubernetes nodes. This imbalance in pod distribution risked overloading certain nodes, making the system vulnerable. A fix was implemented on September 24 to resolve this issue. However, the fix did not take effect due to a deployment issue. This left the system exposed to failure under specific traffic patterns.
When a surge in traffic occurred on the morning of September 25, the previously known pod distribution issue triggered a cascade of failures. Readiness probes for multiple pods began to fail simultaneously, causing Kubernetes to remove these pods from service. This overwhelmed the remaining pods with an influx of traffic, resulting in two waves of degraded performance and 5xx errors.
Although we had been actively addressing this issue, the fix could not be applied in time to prevent the system from entering a vulnerable state during the traffic spike.
Timeline of Events (all times in EDT)
- September 23: An issue was identified with pod distribution across Kubernetes nodes, and a fix was initiated.
- September 24, 5:55 PM: A deployment intended to correct the issue was completed. However, an issue prevented the fix from taking effect, leaving the system vulnerable.
- September 25, 4:20 AM: A small wave of pods on a Kubernetes node failed their readiness checks, but it did not cause significant customer impact.
- September 25, 5:35 AM: A significant number of pods on the same node failed their readiness checks, triggering the first major impact.
- September 25, 5:47 AM: The first wave of failures began, causing a surge in 5xx errors and customer pain.
- September 25, 6:06 AM: The team scaled the number of pods, reducing the immediate impact.
- September 25, 6:09 AM: The first period of failures was largely resolved, and traffic returned to normal levels.
- September 25, 9:14 AM: A second period of failures occurred, with a larger number of pods on another node failing their readiness checks.
- September 25, 9:17 AM: The team scaled the number of pods further and relaxed readiness probe settings to stabilize the system.
- September 25, 9:44 AM: The second wave was resolved, and traffic returned to normal.
What We Learned
In both this incident and the earlier event on August 27, 2024, we performed comprehensive reviews and addressed the specific root causes that led to service disruptions. However, given that these two incidents affected the same part of our routing stack, we took an additional step: rolling back the routing layer to a previously stable technology that had been in operation much longer.
The current routing technology had been in use for over a year without incidents prior to these events, but the previous technology, which we have now reverted to, has been operational for much longer. This rollback will allow us additional time to conduct a more comprehensive evaluation of the newer generation routing technology before considering its redeployment.
Key Learnings and Actions Taken:
- Routing System Rollback
In an abundance of caution, we rolled back the affected routing layer to the previously stable version, which had been successfully operating for several years before the deployment of the newer technology. This rollback ensures system stability while we continue to evaluate the next-generation routing technology in more depth. - Improved Pod Distribution
We deployed a corrected version of the configuration to ensure traffic-handling pods are evenly distributed across Kubernetes nodes. This prevents overloading and mitigates the risk of widespread failures. - Load Balancer Optimization
Our load balancers have been reconfigured to better manage traffic surges and avoid removing too many pods from service simultaneously during high-load situations. - Enhanced Monitoring and Alerting
We’ve improved our monitoring systems to detect early signs of pod readiness failures, even at small scales, and refined metrics for monitoring pod distribution across nodes. - Traffic Optimization and Sharding
We are conducting an analysis of internal traffic patterns and will begin shifting certain workloads to alternate traffic paths. Additionally, we are planning to shard the traffic distribution to further reduce the impact of potential future issues.
Long-Term Improvements
We are using this rollback period to thoroughly evaluate the affected routing technology, ensuring it aligns with our long-term reliability goals. This comprehensive evaluation will allow us to address any architectural improvements required for the next-generation system before considering redeployment.
We are also mandating validation of proper pod distribution across nodes before future deployments, ensuring that no deployment proceeds without confirming optimal load balancing.
Moving Forward
We remain committed to providing a reliable and robust platform for our customers. The combination of these immediate fixes and long-term improvements will decrease the likelihood of similar incidents occurring in the future.
Thank you for your trust in HubSpot, and we appreciate your patience as we work to continuously enhance our systems.
Appendix: A Technical Deep Dive into the September 25 and August 27 Incidents
In both the September 25 and August 27 incidents, our traffic routing layer—primarily powered by Envoy and deployed on Kubernetes—experienced critical failures that impacted service availability. While the outcomes were similar, the root causes and triggering factors behind each incident were distinct.
Envoy’s Role in Our Architecture
Envoy served as a critical component in our routing infrastructure, managing incoming traffic and distributing it to the appropriate backend services. Deployed on Kubernetes, Envoy pods operated as load balancers, handling both internal and external traffic. Each pod was monitored by liveness and readiness probes to ensure its health and capacity to handle requests. These probes, along with health checks sent by the AWS ALB (Application Load Balancer), formed the foundation for maintaining high availability in our traffic routing layer.
August 27 Incident
The August 27 incident was triggered by a coordinated false negative in Envoy’s liveness probes. Config updates pushed over Envoy’s xDS protocol caused the Envoy /ready endpoint to time out, which led Kubernetes to mark several healthy Envoy pods as unhealthy. As these pods began to shut down, the remaining pods were overwhelmed by a sudden influx of traffic.
This caused a congestive collapse in the load balancer, as the remaining pods struggled to handle the surge in connections. The overload manager in Envoy, which manages connection limits, was engaged and rejected excess connections, but the collapse occurred before the system could stabilize.
Key contributing factors:
- The noisy neighbor issue on the /ready endpoint, where multiple administrative tasks such as health checks, stats collection, and config updates are handled by a single thread, blocked the health checks, leading to the false negative.
- Traffic surges overwhelmed the remaining healthy pods, exacerbating the failure.
September 25 Incident
The September 25 incident, while similar in outcome, had a different root cause. Prior to the incident, we had identified an imbalance in the distribution of Envoy pods across Kubernetes nodes, where certain nodes were overloaded with too many traffic-handling pods. A fix was created and deployed on September 24 to address this issue. However, due to a deployment issue, the fix did not take effect, leaving the system in a vulnerable state.
When a spike in traffic occurred on September 25, the overloaded nodes caused the readiness probes to fail for several Envoy pods, which led to their shutdown. The remaining pods, now overwhelmed by the traffic surge, caused a system-wide degradation, similar to the August 27 incident, but triggered by different conditions.
Key contributing factors:
- Uneven distribution of Envoy pods across Kubernetes nodes left some nodes more vulnerable to traffic surges, which caused readiness probes to fail simultaneously.
- The fix to address the pod distribution imbalance was not applied in time due to a deployment issue, leaving the system exposed to failure under high traffic.
- The spike in traffic patterns, coupled with the existing vulnerabilities, led to overloaded pods and cascading shutdowns, similar to the congestive collapse seen in the August 27 incident.
Key differences between the events:
- The August 27 failure was due to misconfigured health checks, while the September 25 failure stemmed from an unresolved issue in pod distribution.
- Readiness probe failures on September 25 were a result of traffic surges overloading the already imbalanced pods.
Key Action: Rollback to Nginx
Given that both incidents affected the same layer of our routing infrastructure, and to ensure greater stability, we made the decision to roll back the routing layer from Envoy to the Nginx system that had been in use previously. Nginx, which has been operational for much longer, provides a stable fallback while we conduct a more comprehensive evaluation of Envoy. This rollback allows us to address the architectural improvements needed in Envoy before considering redeployment.
Other Actions Taken
Following the August 27 incident:
- We added configurations to Envoy’s overload manager to better handle connection limits and avoid future congestive collapse.
- We revised the health check mechanism to isolate the /ready endpoint from other admin tasks, mitigating the noisy neighbor issue.
After the September 25 incident:
- We rolled back the affected routing layer from Envoy to Nginx, ensuring immediate stability while we review and improve Envoy's configuration and architecture.
- We implemented changes to ensure even pod distribution across Kubernetes nodes, minimizing the risk of future traffic surges leading to similar failures.
- We upgraded monitoring and alerting systems to catch early signs of traffic imbalances or readiness probe failures.
In both cases, we have increased capacity, enhanced traffic management, and improved resilience through making critical adjustments to our routing infrastructure to reduce the risk of cascading failures.