On October 20, 2025, HubSpot experienced a significant service disruption affecting multiple product features due to a severe AWS outage in the us-east-1 region. While our infrastructure remained intact, the widespread nature of the cloud provider failure impacted both our services and critical third-party vendors we rely on. We've completed a thorough analysis of this incident and are implementing comprehensive improvements to strengthen our resilience against future cloud provider disruptions.
At 2:48 AM ET on October 20, AWS experienced one of its most severe service disruptions in recent history, affecting numerous services in the us-east-1 region where HubSpot's primary infrastructure operates. The outage began with DynamoDB (a database service) failures that cascaded to affect IAM (Identity and Access Management), SQS (Simple Queue Service), and EC2 (compute instances).
When services need work performed in the background or at a future time, they enqueue tasks to TQ2, which reliably processes them using Amazon SQS as its underlying message queue. This architecture keeps HubSpot's user-facing features responsive while handling time-intensive operations behind the scenes.
During the AWS outage, TQ2 experienced significant processing degradation that extended well beyond the initial cloud provider disruption:
During this incident, customers experienced degraded performance across multiple product areas.
While AWS services were impaired, our engineering teams took several defensive actions to protect service stability and minimize customer impact:
Infrastructure Stabilization Our engineers immediately disabled automated scaling and deployment systems to prevent the AWS API failures from causing additional service disruption. This manual intervention maintained the stability of our existing infrastructure while AWS services were degraded.
Workload Management We performed manual interventions to allow critical applications to continue running without requiring calls to failing AWS APIs, enabling some services to maintain partial functionality during the outage.
Task Queue Failover Our TQ2 system's automatic failover mechanism successfully redirected millions of background tasks to our Kafka backup system, preserving task data during the SQS outage and enabling recovery once AWS services were restored.
Enhanced Incident Response Procedures We're documenting and automating our defensive procedures that were executed manually during this incident. This includes automated deployment freezing, infrastructure scaling controls, and recovery procedures. These runbooks will ensure any engineer can execute critical defensive measures within minutes rather than requiring specific expertise.
Vendor Diversification We're exploring vendor diversification strategies across our stack to reduce dependency on any single provider. This includes evaluating alternative providers and building abstraction layers that enable automatic failover when vendors experience regional failures.
Advanced Monitoring and Early Warning Systems We're improving monitoring around critical cloud provider APIs and dependencies to detect failures before they impact production workloads and identify degradation patterns early.
Improved Task Queue Recovery The TQ2 failover consumer is being converted from synchronous to asynchronous processing to significantly increase throughput. We're also migrating the failover topic to a dedicated infrastructure with substantially increased capacity to enable massive parallel processing during recovery scenarios.
Expanded Chaos Engineering We're expanding our existing chaos engineering program to include more comprehensive AWS service failure scenarios. These exercises will validate our runbooks, train our teams, and identify weaknesses before they impact customers.
Customer Communication Improvements We're improving error messages to be more timely and informative when service issues occur. Additionally, we're implementing automatic in-app banners that activate when critical dependencies fail, ensuring customers have clear visibility into major service disruptions.
While cloud provider outages of this magnitude are rare, we recognize that our customers depend on HubSpot for mission-critical business operations. This incident has catalyzed a comprehensive reliability initiative spanning infrastructure, architecture, and operational improvements.
We're investing in building anti-fragile systems - systems that not only withstand failures, but also improve from stress testing. Through vendor diversification, architectural evolution, and rigorous failure testing, we're working to ensure that future cloud provider incidents have minimal impact on your business operations.
Thank you for your patience and continued trust in HubSpot. We're committed to maintaining that trust through continuous improvement of our platform's reliability and resilience.