Here’s something I think everyone can relate to: you fix a problem, only to discover later on that it’s still happening. Why is that? Sometimes we’re in a rush, or are a little careless. I believe it’s often because we did not understand the problem well enough to actually address it. By being a little more deliberate we can avoid this, and reliably fix problems once and for all. And it’s not a particularly technical skill. We need to start by asking “why” more, and looking for the answers to understand the problem well enough to confidently solve it.
The case of the OutOfMemoryErrors
At HubSpot one of the things I’m responsible for is a web crawler. This system has had a multitude of issues but one class of them, OutOfMemoryErrors, have been perennial. At times it seemed there was another every other week. They no longer plague me, and I attribute that to continually questioning why the error was occurring and searching for the answers before fixing it.
An example from earlier this year started like most others: with a heap dump.
- Why did it fill the heap? Turns out while parsing the HTML a base 64 encoded PNG was taking up a considerable amount of memory.
- Why did it have an image at all? The service is supposed to collect only the HTML. The code had set a flag to not download images for the headless browser client so that naturally made me question:
- Why then did the image appear in the heap? Looking further into the client I could see that the flag was being passed down, but I was able to repeatedly capture images with that client, with the flag set.
- So why were the images present? A quick search revealed that the name of that flag was wrong. I was able to submit a pull request to the client, with screenshot evidence of an example webpage before and after my change to demonstrate that it was in fact not properly omitting images before.
Had I just added more memory to the service, this issue would have been buried for longer. It would have consumed more resources, which cost us. Other users of the headless browser client would be inadvertently downloading images as well. Best of all, I don’t get OutOfMemoryErrors any longer.
A habit of asking “why”
At each step of the way I was able to find some concrete evidence that led me to the next step. Often the code itself is not enough to correctly discover an issue so these artifacts are incredibly important. Things like heap dumps, thread dumps, and log files can give us a picture of the system as it runs. They also typically have timestamps included so it is possible to collate the evidence to construct a more complete picture of what happened.
By cultivating a habit of asking “why” and searching for evidence, we can solve problems rather than treating symptoms. I say “habit” because it’s a practice that we add to our work rather than a specific skill. You may find that it takes a while to become successful at it, but like many things, the more repetition, the easier it will become, until you find it’s happening automatically for you.