At HubSpot, all of our backend code is written in Java and built with Maven. This code is spread over ~3,500 Maven modules, with lots of dependencies between them. We use snapshots for all of our internal dependencies, and we almost never do releases or version bumps. We pair this with a snapshot update policy of always, so that every build picks up the latest version of every library. We like that it forces people to be conscious of backwards-compatibility, avoids version conflicts when using internal libraries, and gets us closer to the monorepo mindset while still being able to use mostly off-the-shelf tooling.
Houston, we have a problem
There are some performance drawbacks to this approach. Because of the update policy of always, each build needs to check whether there's a new version of every snapshot. To do this, Maven needs to fetch the snapshot metadata and its checksum from the remote repository (Nexus, in our case) so that it can compare to the cached data in the local repository. This requires at least two round-trips to Nexus for each snapshot dependency. If an app has hundreds of snapshot dependencies, this latency starts to add up. And for our Dublin engineers, where each of these round-trips is transatlantic, the latency is downright unusable.
We were able to partially mitigate this issue by writing a local proxy that would cache responses for snapshot metadata and return the cached data if a snapshot hadn't changed. Each developer ran this proxy locally and configured Maven to use it instead of hitting Nexus directly. This sort of worked. But it was never perfect, because the cache invalidation wasn’t as reliable as we liked and our HTTP proxy just wasn’t very good. And even with the local proxy, snapshot resolution still added a few seconds of overhead compared to running the build in offline mode. But it was better than nothing, so local development relied on the proxy and our CI environment relied on low latency to the remote repository (which meant running all of our builds in the same region as our Nexus server).
This was the state of things for a while, until we hit another performance issue. As the engineering team grew, the volume of builds grew with it. In less than 18 months we went from 1,000-2,000 builds per day to over 10,000 builds per day. This led to an increase in traffic to Nexus that it just couldn't handle. Periodically, the latency on requests to Nexus would go through the roof and all of our Java builds would grind to a halt until we could get it back up and running. When this happened a few days in a row, we knew we needed to find a better solution.
We considered changing the snapshot update policy to something other than always. This would indeed reduce traffic to Nexus, but would cause problems for correctness. For example, let's say someone adds a method to module A and uses that method in module B. In order for module B to compile, it needs to pick up the latest version of module A. This requirement precludes all of the built-in update policies besides always. But if we could implement a custom update policy, then we could plug in our own logic to tell Maven whether it needs to check for a new snapshot. Maven doesn’t support custom update policies, but we found that we could achieve this sort of functionality by writing a Maven extension.
The snapshot accelerator
So we started designing a system that allows us to hook into Maven and skip the metadata fetch for a dependency if we know the snapshot hasn't changed. This is a similar idea to the proxy, except that it allows us to bypass the metadata HTTP requests entirely (which is noticeably faster, possibly because Maven isn't great at parallelizing this sort of work). This design also means we don't have to proxy requests to Nexus, which eliminates some complexity and operational issues. We also redesigned the way snapshot versions are tracked so that the system is reliable enough to use in our CI environment.
The system is comprised of three parts:
- A simple HTTP API backed by a relational database which keeps track of the latest resolved snapshot version for each group/artifact/base version
- A Maven plugin which reports to the API after a new snapshot has been published
- A Maven extension which hits the API at the start of a build to find all new snapshots and then short-circuits metadata requests for dependencies that haven't changed
We have open-sourced this system as the maven-snapshot-accelerator.
Putting it all together
Now, when a build runs in our CI environment the Maven command looks like:
mvn -B deploy com.hubspot.snapshots:accelerator-maven-plugin:0.3:report
This will invoke the accelerator plugin after the deploy phase in order to report the new snapshot version to the accelerator API.
Next, the version of Maven we use locally and in our CI environment has the accelerator Maven extension installed. The extension keeps its state in the Maven local repository so that it can do incremental updates by fetching a delta from the API at the start of each build. As part of this, we went back to sharing a Maven local repository between all builds on the same server. We also add a few Aether flags when we run Maven which make this concurrent access safer:
- -Daether.connector.resumeDownloads=false - To prevent one build from trying to resume the download of another concurrent build and potentially corrupting the local repository
- -Daether.artifactResolver.snapshotNormalization=false - To make Maven use the fully resolved snapshot JARs for everything, which should be immutable and not change out from under us
This still isn't technically safe, and given enough concurrency we would probably see issues, but we have a few things working in our favor. The first is that builds are spread across the 150 servers in our Mesos cluster, so concurrent builds on the same server aren't very common. The other thing working in our favor is that dependency resolution is now much faster, so that the window for things to go wrong is much smaller. In fact, since rolling this out to our CI environment a few weeks ago, we haven't seen a single build failure due to local repository corruption.
Since rolling this out, we haven't had a single Nexus outage. We've also seen faster and more consistent build times, both locally and in our CI environment. Here is a chart showing the volume of traffic to Nexus before (orange) and after (green):
(As a side note, you can tell from that graph that our engineers work from roughly 10am to 6pm. So next time you're interviewing somewhere, don't ask about work-life balance. Just ask to see their Nexus graphs.)
And here is a chart showing the Nexus request latency during the same time period:
You can see that each spike in traffic had a corresponding spike in latency, and now both graphs are much more stable.
This setup is working well for us at the moment, but it seems like we're nearing the limit of what can be achieved in terms of speed and scale with Maven. Gradle is attractive, but seems to represent more of an incremental improvement, whereas tools like Bazel, Pants, or Buck are fundamentally different. We plan to evaluate some of these tools to see how they compare in terms of speed, configurability, extensibility, IDE integration, and so on.
If you have any thoughts, feel free to leave a comment below and don't forget to check out the maven-snapshot-accelerator on GitHub.