On Thursday, March 27, 2025, between approximately 15:00 UTC and 17:50 UTC, some HubSpot customers experienced delays in sending emails, including marketing emails, test sends, signup verifications, and 2FA (two-factor authentication) emails. This issue only affected a segment of our North American infrastructure.

We understand how critical timely email delivery is for your business operations, and we sincerely apologize for any disruption and inconvenience this incident caused. At HubSpot, reliability is a top priority, and we are committed to learning from this event and improving our systems to prevent similar issues in the future.

Here’s a detailed explanation of what occurred and the steps we’re taking to enhance our platform’s resilience.

What Happened

On March 27, starting around 15:00 UTC, an internal component responsible for processing email sends began sending an unusual pattern of large atomic writes. This caused hotspotting on specific areas of our distributed data store, Apache HBase, used for critical email operations. 

This type of concentrated, large-scale write activity put significant pressure on the HBase cluster, impacting performance due to an unfavorable colocation of certain critical meta and data tables within the affected regions. This quickly led to strain on the underlying data infrastructure, causing an email processing backlog. 

While the initiating component was paused quickly, the initial database instability triggered excessive retry attempts from other parts of our email sending system. This meant that recovering required processing not only the original backlog, but also a large volume of these retries, further stressing the affected data cluster and prolonging the incident.

Our engineering teams were alerted to the issue shortly after it began. Here’s a timeline of the events:

  • 15:02 UTC: Our monitoring systems detected anomalies in the email sending pipeline. As the underlying data infrastructure began experiencing high demand, it led to a backlog. Emails started experiencing delays.
  • 15:00 UTC - 16:30 UTC: A large portion of emails sent during this time were queued for retry due to the system overload.
  • 15:39 UTC: The potential root cause was identified and our teams took immediate steps to stabilize the system. This involved carefully managing the load on the affected infrastructure and adjusting resources to alleviate the pressure.
  • 15:58 UTC: HubSpot posted a status page and in-app banner informing our customers of the incident.
  • 17:20 UTC: The system stabilized, and newly sent emails began processing normally without significant delays.
  • 17:52 UTC: The backlog of emails that were delayed during the incident was fully processed and sent

The maximum email send delay experienced during this incident was approximately 1 hour and 53 minutes, while the median delay was around 1 hour and 25 minutes.

Most emails sent after 17:20 UTC were delivered without delay, although a small portion of emails that required retries may have experienced delays until the backlog was cleared at 17:50 UTC.

What We're Doing to Improve

Following a thorough review of this incident, we have identified several key areas for improvement to make our email sending infrastructure more resilient:

  1. Enhanced Visibility and Alerting: We are improving visibility into our systems to provide earlier warnings of potential strain on the email infrastructure. This will allow our teams to react more quickly and proactively manage resources before customer impact occurs.
  2. Increased System Resilience: We are implementing changes to make the email sending pipeline better equipped to handle unexpected surges in load. This includes improvements to how the system manages queues and retries, ensuring smoother performance under stress.
  3. Improved Load Management: We are refining our internal processes for load balancing and resource allocation within the email system. This includes exploring ways to better isolate and prioritize different types of email traffic to minimize the impact of any single component experiencing issues.
  4. Infrastructure Optimization: We are conducting a deeper review of the underlying storage and infrastructure components supporting our email services to identify opportunities for long-term performance and stability enhancements.

Our commitment to improvement extends beyond our own systems. When we identify opportunities to enhance the open-source technologies we rely on, we contribute our findings and fixes back to the community. 

As part of our response, we are actively working on contributions to the Apache HBase project, the database technology involved in this incident. This project aims to improve the database's resilience to demanding workloads and enhance visibility into system operations, helping us and the broader community identify and address potential issues more quickly in the future. You can view some of these efforts publicly here:  HBASE-29231, HBASE-29229, HBASE-29090.

Moving Forward

We recognize that incidents like this impact your ability to connect with your customers, and we take our responsibility to provide a reliable platform very seriously. We are dedicating engineering resources to implement the improvements outlined above and will continue to invest in the stability and performance of the HubSpot platform.

Thank you for your patience and your continued trust in HubSpot. We are committed to providing you with the reliable tools you need to grow better.

Recommended Articles

Join our subscribers

Sign up here and we'll keep you updated on the latest in product, UX, and engineering from HubSpot.

Subscribe to the newsletter