Written by Kartik Vishwanath, VP of Engineering at HubSpot.
_______
Between 5:47 AM EDT and 9:44 AM EDT on September 25, 2024, some HubSpot customers were unable to log in to their accounts, access their content, or use our tools. The issue occurred during two time periods. The first began at 5:47 AM EDT and was resolved by 6:09 AM EDT. The second started at 9:14 AM EDT and was resolved by 9:44 AM EDT. This incident followed a similar event on August 27, 2024, which also affected the same part of our routing stack. We sincerely apologize for the inconvenience and disruption these incidents may have caused.
While both incidents had different root causes, we conducted thorough reviews of each and addressed the specific issues identified. However, out of an abundance of caution, we decided to roll back the routing layer to a previously proven and reliable configuration that had been in operation for a longer period. This rollback provides us with additional time to comprehensively evaluate the next-generation routing technology, which had been in operation for over a year without prior issues.
For a deeper technical explanation of the root causes and corrective actions taken in the September 25 and August 27 incidents, please see the appendix section to read the full technical details.
HubSpot’s infrastructure processes billions of customer requests daily, leveraging a routing system composed of load balancers that manage the flow of requests across various backend services. These load balancers are deployed as Kubernetes pods, and serve to ensure efficient traffic distribution and redundancy to support large scale operations.
In this incident, an issue with how the traffic-handling pods were distributed across Kubernetes nodes caused multiple pods to fail their readiness checks simultaneously. This led to automatic shutdowns of affected pods. The remaining pods became overwhelmed, resulting in a surge of 5xx errors on our primary load balancer. This behavior resembled a thundering herd problem, where simultaneous failures led to increased load on the remaining healthy pods, compounding overall system strain.
On September 23, we identified an issue with the distribution of traffic-handling pods across Kubernetes nodes. This imbalance in pod distribution risked overloading certain nodes, making the system vulnerable. A fix was implemented on September 24 to resolve this issue. However, the fix did not take effect due to a deployment issue. This left the system exposed to failure under specific traffic patterns.
When a surge in traffic occurred on the morning of September 25, the previously known pod distribution issue triggered a cascade of failures. Readiness probes for multiple pods began to fail simultaneously, causing Kubernetes to remove these pods from service. This overwhelmed the remaining pods with an influx of traffic, resulting in two waves of degraded performance and 5xx errors.
Although we had been actively addressing this issue, the fix could not be applied in time to prevent the system from entering a vulnerable state during the traffic spike.
In both this incident and the earlier event on August 27, 2024, we performed comprehensive reviews and addressed the specific root causes that led to service disruptions. However, given that these two incidents affected the same part of our routing stack, we took an additional step: rolling back the routing layer to a previously stable technology that had been in operation much longer.
The current routing technology had been in use for over a year without incidents prior to these events, but the previous technology, which we have now reverted to, has been operational for much longer. This rollback will allow us additional time to conduct a more comprehensive evaluation of the newer generation routing technology before considering its redeployment.
We are using this rollback period to thoroughly evaluate the affected routing technology, ensuring it aligns with our long-term reliability goals. This comprehensive evaluation will allow us to address any architectural improvements required for the next-generation system before considering redeployment.
We are also mandating validation of proper pod distribution across nodes before future deployments, ensuring that no deployment proceeds without confirming optimal load balancing.
We remain committed to providing a reliable and robust platform for our customers. The combination of these immediate fixes and long-term improvements will decrease the likelihood of similar incidents occurring in the future.
Thank you for your trust in HubSpot, and we appreciate your patience as we work to continuously enhance our systems.
In both the September 25 and August 27 incidents, our traffic routing layer—primarily powered by Envoy and deployed on Kubernetes—experienced critical failures that impacted service availability. While the outcomes were similar, the root causes and triggering factors behind each incident were distinct.
Envoy served as a critical component in our routing infrastructure, managing incoming traffic and distributing it to the appropriate backend services. Deployed on Kubernetes, Envoy pods operated as load balancers, handling both internal and external traffic. Each pod was monitored by liveness and readiness probes to ensure its health and capacity to handle requests. These probes, along with health checks sent by the AWS ALB (Application Load Balancer), formed the foundation for maintaining high availability in our traffic routing layer.
The August 27 incident was triggered by a coordinated false negative in Envoy’s liveness probes. Config updates pushed over Envoy’s xDS protocol caused the Envoy /ready endpoint to time out, which led Kubernetes to mark several healthy Envoy pods as unhealthy. As these pods began to shut down, the remaining pods were overwhelmed by a sudden influx of traffic.
This caused a congestive collapse in the load balancer, as the remaining pods struggled to handle the surge in connections. The overload manager in Envoy, which manages connection limits, was engaged and rejected excess connections, but the collapse occurred before the system could stabilize.
Key contributing factors:
The September 25 incident, while similar in outcome, had a different root cause. Prior to the incident, we had identified an imbalance in the distribution of Envoy pods across Kubernetes nodes, where certain nodes were overloaded with too many traffic-handling pods. A fix was created and deployed on September 24 to address this issue. However, due to a deployment issue, the fix did not take effect, leaving the system in a vulnerable state.
When a spike in traffic occurred on September 25, the overloaded nodes caused the readiness probes to fail for several Envoy pods, which led to their shutdown. The remaining pods, now overwhelmed by the traffic surge, caused a system-wide degradation, similar to the August 27 incident, but triggered by different conditions.
Key contributing factors:
Given that both incidents affected the same layer of our routing infrastructure, and to ensure greater stability, we made the decision to roll back the routing layer from Envoy to the Nginx system that had been in use previously. Nginx, which has been operational for much longer, provides a stable fallback while we conduct a more comprehensive evaluation of Envoy. This rollback allows us to address the architectural improvements needed in Envoy before considering redeployment.
Following the August 27 incident:
After the September 25 incident:
In both cases, we have increased capacity, enhanced traffic management, and improved resilience through making critical adjustments to our routing infrastructure to reduce the risk of cascading failures.