In tech companies, Cost of Goods Sold is a key business metric driven in large part by the efficiency of software architectures. Saving money always sounds like a great idea, but it is not always a priority over features and growth, nor is it straightforward. At HubSpot, our relatively new Backend Performance team is tasked with improving the runtime and cost performance of our backend software. In this two-part blog series, we will look at a structured method we use for approaching cost savings work and demonstrating how we apply it at HubSpot to save millions on the storage costs of our application logs.
The first phase to working on cost savings is discovery. We need to know how much each of our software systems are costing. The foundations for cost data often start with cloud providers like Amazon Web Services (AWS). They generally provide detailed cost data for the cloud resources you use. In simpler systems, this may be enough to start piecing together cost categorizations.
At HubSpot, our backend microservices are deployed using a custom Mesos layer called Singularity on top of AWS EC2 hosts. Any given EC2 host may be running multiple different deployable applications at any time. We also run our own database servers via Kubernetes instead of using cloud-hosted databases. All of this virtualization makes it hard to correlate the cost of a single EC2 instance to the cost of a specific application.
To address this challenge, we have built an internal library that correlates applications to AWS resources by intercepting samples of application network calls to track usage of resources like S3, AWS Lambda, our internal hosted databases, and more. Tying all this data together, we are able to aggregate the costs of applications and databases, as well as attribute utilization of database costs to applications.
We can now build explorations into our software costs. We store our cost data in S3, accessible by AWS Athena as well as the third-party Redash product. It is important to have the cost data available to analytic query engines to understand the costs of complex systems.
Using this tooling, we can now look at the highest cost areas of our ecosystem. The higher the cost percentage, the more leverage achieving cost efficiency of, say 10%, can help. The chart above captures daily cost breakdowns for the month of July. What stands out is that S3 costs account for between 45% to 50% of our daily costs.
So we know S3 might be a potential target for cost savings, but how? Next we drill down to the monthly cost of individual S3 buckets.
% of S3 Costs
It is starting to become clear there are specific high cost buckets, particularly hubspot-live-logs-prod and hubspot-hbase-backups. Great! Since buckets are generally owned by a team at HubSpot, we now have two different teams to follow up with on their usage of S3.
Attribute Costs to Functionality
We follow up on this cost data with the teams involved, our HBase Infrastructure team and our Logging team. Discussion with the HBase team reveals they are actively working on a version migration and consolidation of backups, so future cost reduction seems to be taken care of. For logging, we learned that logs are first stored as raw JSON in S3 and then an asynchronous compaction job converts the files to compressed ORC format. However, a key revelation was that only about 30% of the files end up getting compacted. The Spark compaction job is not keeping up with the volume of logs.
We have many different log streams at HubSpot: application logs, request logs, load balancer logs, database logs, etc. Our final measurement involved writing a job to size each log type in the raw JSON logs bucket to see if there were any specific heavy log types. The results showed our request logs came in at about 31 petabytes of data, with our application logs in second at about 10 petabytes of data.
Once we have sufficient cost data attributing our highest cost areas to specific parts of our software architecture, we can start forming hypotheses on potential design changes to reduce cost while preserving sufficient functionality.
Since storage costs were by far our biggest cost for our log data, reducing the size and amount of log files we store seem like viable vectors for cost savings. We already have a process of compacting raw JSON to compressed ORC. These facts naturally lead us to our hypothesis:
We can store all log files as compressed ORC
We frame our hypothesis around ORC for a few reasons. We already have tooling built around supporting ORC. ORC is a columnar storage format, giving it great compression and size characteristics. We use AWS Athena to query our log data, and Athena supports ORC and Parquet. ORC compresses a bit better than Parquet, meaning smaller storage size and cost.
Optimized Row Columnar (ORC) data structure layout
With our new hypothesis formed, we next want to measure what the expected outcome of implementing our hypothesis would be. It’s important to balance potential cost savings against the engineering investment of implementation and preservation of performance.
We wrote jobs to verify compression rates and the performance of conversion. We want to make sure the compression is large enough and that the conversion of raw logs to ORC is not a performance bottleneck compared to conversion to JSON.
The jobs revealed that the same request log data compressed as ORC is about 5% the size of the raw JSON data, or 20x smaller. Meanwhile, the CPU and IO time to convert raw logs to Snappy compressed ORC is the same as raw JSON, both coming in at a little over one second to convert 122 megabytes of raw log data.
Given the current size of our logs, we estimated that the remaining lifetime cost of the current raw JSON logs was in the 7 figures range, and that the cost of the same logs all stored as ORC would instead be a low 6 figures number, leading to a total estimated cost savings of 7 figures.
After initial discovery and taking measurements to estimate the impact of potential work, the next steps are the design and execution of the cost savings measures. Stay tuned for our second and final post in this series, where we walk through the design and implementation of the cost savings, the cost and performance results of undertaking this project, and the broader guide for exploring cost savings of your own.
Check out Part 2 of this series to learn more about the design details, the implementation of cost savings and its impact for our customer!