Written by Charles Connell, Senior Software Engineer II at HubSpot and Swetha Narayanaswamy, Director, Engineering at HubSpot.
_______
Between 13:57 UTC on May 28, 2024 and 13:00 UTC on May 29, 2024 a subset of HubSpot customers experienced delays in events, analytics, forms processing, and chat messaging. We apologize for the impact this issue may have had on you and your business. Reliability is a core value at HubSpot that we care about deeply. As such, we would like to take this opportunity to share what happened and what we're doing to prevent similar issues in the future.
During our engineering investigation, we isolated the issue to a Linux kernel bug that impaired the instances of our HBase fleet running on the ARM64 CPU architecture.
Architecture Overview
Much of HubSpot’s customers’ data is stored in an open-source key-value store called HBase. These software instances are run in AWS on the GNU/Linux operating system. The Operating system kernel manages system resources, such as the CPU, memory, and devices, ensuring everything works together smoothly and efficiently between software applications and computer hardware. To keep your data as secure as possible we frequently upgrade our kernel software to incorporate the latest fixes and vulnerability updates.
The kernel manages network communication with other servers and the bookkeeping associated with those connections. The performance and correctness of this code is crucial in our connected world.
The Incident
- On April 11, 2024, we began to upgrade our HBase servers from Linux kernel version 5.15.86 to 6.1.66. This version of the kernel had a bug that made it believe it was using more memory for TCP connections than it actually was. This miscalculation increased from April 11 to May 27.
- By May 27 at 18:19 UTC, after weeks of uptime, some servers in one of our HBase databases incorrectly believed they had hit their TCP memory limits. This caused some TCP connections to drop, not connect, or slow down, which degraded performance enough to affect our HubSpot user experience.
- On May 28 at 20:00 UTC, we identified TCP memory limit violations as the likely cause for poor performance of the HBase database. As a result, we quickly raised memory limits. We also deployed a metric collector to all HubSpot servers to collect TCP memory usage statistics from every server. We verified that no other servers at HubSpot were at risk of the same issue.
- By May 29 at 08:00 UTC, HubSpot services were back to normal operation.
- On June 2, we isolated the Linux kernel bug (we'll discuss below) and the resulting bookkeeping errors behind the incident.
Technical Details
On a multiprocessor system, each processor has its own cache of the computer’s memory. It is much faster for a processor to read and write from/to its own cache than the computer’s memory. These caches (including store buffers) can hold different values for the same memory address, leading each processor to have different views on the values of your code’s variables. A single variable that is frequently accessed by different processors is not conducive to caching. To be useful, the variable’s value must be kept synchronized across all processors.
To accomplish this synchronization, any time the value is changed by one processor, the new value must be written all the way to memory, and reads from other processors must be read all the way from memory. Programs can deploy so-called memory barriers or fences to accomplish this. Unfortunately, when used many times per second, this can be a drag on application performance. The Linux kernel has a concept of per-CPU variables, which despite appearing as one variable, are actually independent variables for each CPU. For variables that are accessed very frequently, using a per-CPU variable instead of a normal variable can give a big performance boost.
Linux maintains memory buffers that are necessary to support TCP sockets. Linux has a configurable limit on the overall amount of memory used by the kernel on all TCP sockets. In order for the kernel to know whether it has hit its limit, it must track how much memory is used by TCP sockets. Every time memory is allocated in support of a socket, a function inside the kernel named sk_memory_allocated_add is called. Every time memory is deallocated, a function named sk_memory_allocated_sub is called. This can happen thousands of times per second, on all CPUs in a system. Prior to Linux 6.0.0, a normal integer variable named memory_allocated tracked the memory usage. In Linux 6.0.0, Linux introduced a per-CPU variable named per_cpu_fw_alloc alongside memory_allocated.
To illustrate, in Linux 5.15.86, sk_memory_allocated_add looked like this:
A single variable tracks the memory usage, and it’s incremented when necessary. sk_memory_allocated_sub works similarly. The addition is explicitly done with atomic semantics because we expect multiple threads to contend over this variable. Addition can be thought of as three steps: read, add, and write. Using atomic_long_add_return ensures a thread will complete these steps before another thread starts them, that the value is flushed from the processor’s store buffer into memory, and that other processors’ cached copies of the value are invalidated.
In Linux 6.1.66, the same function looked like this:
Again, sk_memory_allocated_sub is similar, and not shown. While the goal here is still just the simple task of adding and subtracting from a counter, the code got more complicated. There is now a variable per_cpu_fw_alloc that gets incremented first. If that exceeds 256, its value is added into memory_allocated and per_cpu_fw_alloc is set back to 0. If sk_memory_allocated_add is able to access only per_cpu_fw_alloc and not enter the conditional block, we can better avoid memory access, and let other processors avoid it too later on.
None of this complexity is inherently a problem, however, more complex code means more opportunities to make mistakes. Kernel code authors must assume their code can be preempted by the kernel scheduler or by a hardware interrupt. Scheduler preemptions can move a thread from one processor to another at any time. Hardware interrupts will pause a thread to run the interrupt handler, which could include a different invocation of the very same function that just got paused, and then allow the thread to continue. Moving from one CPU to another is not acceptable while dealing with per-CPU state. sk_memory_allocated_add takes care of this concern by calling preempt_disable(), which disables the scheduler, before starting its work, and then calling preempt_enable() again afterwards.
The mistake was in not considering how interrupts would affect sk_memory_allocated_add/sk_memory_allocated_sub. __this_cpu_add_return and __this_cpu_sub are not guaranteed to be atomic, nor are they “interrupt-safe,” as their documentation advertises. On rare occasions in our HBase servers, while __this_cpu_add_return was running, an interrupt handler re-entered the same code path, leading __this_cpu_add_return to return or store an incorrect value, which then eventually got added to memory_allocated.
Our HBase servers run on a mix of x86-64 and ARM64 processors, and the bug only manifested on ARM64. The implementation of __this_cpu_add_return is architecture-specific. On x86, it uses an xadd machine instruction. This loads a value from memory, adds it with another value, and then stores it back into memory. Because this is all done in a single instruction, it is atomic, and therefore is incidentally interrupt-safe. On ARM64, the same operation is written using a simple += operator. In HubSpot’s build of the kernel, that operator compiles to three concurrency-naive instructions:
It's possible for execution on a single CPU to follow this flow:
On the other hand, the interrupt-safe version, this_cpu_add_return, compiles to
This version uses the load-store-exclusive system to ensure that the operation fails and retries until it succeeds without interference. stxr will succeed only if the value loaded in the earlier ldxr has still not changed in memory since the load. If it fails, cbnz jumps us back to the first line, and we try again as many times as necessary. Later versions of the ARM64 spec also offer a single-instruction equivalent like x86 (ldadd). The kernel is capable of patching its own executable code at runtime, if it detects that the processor can support newer instructions. Our kernels indicate that they are doing this, so this load-store-exclusive code is not used at runtime, but is still illustrative of a good solution.
While narrowing in on the bug, we deployed a kernel build that tracked TCP memory usage in both the old way (a single atomic variable) and the new way (with per-CPU variables). We then reported the values from each tracking system to verify they diverged.
It’s typical for HBase servers to have uptimes of several months. The below chart shows the divergence we saw within a single kernel over just 2 days:
This shows how complex it is to write code that is fast, multi-threaded, and correct, all at the same time, on all types of computer hardware. Simple addition and subtraction are not always so simple. The author of the code fixed the bug in kernel 6.1.90, by refactoring to ensure interrupt safety.
What We Learned
While our rollout process includes canarying and proceeds gradually, we didn’t catch this bug early on because of how long it takes to manifest the issue.
Following our detailed investigation, we have identified and implemented the following improvements:
- We upgraded select HubSpot servers to a new version of the Linux kernel that does not contain the bug.
- We are proactively monitoring all servers at HubSpot to ensure their TCP memory usage (or the kernel’s perception of it) doesn’t exceed limits.
- We have improved our HBase server testing setup to detect similar issues in a testing environment prior to production deployment.
Finally, we want to reiterate that reliability is a core tenet at HubSpot and our goal is to ensure our customers have the tools they need to grow every day.