At HubSpot, all of our backend code is written in Java and built with Maven. This code is spread over ~3,500 Maven modules, with lots of dependencies between them. We use snapshots for all of our internal dependencies, and we almost never do releases or version bumps. We pair this with a snapshot update policy of always, so that every build picks up the latest version of every library. We like that it forces people to be conscious of backwards-compatibility, avoids version conflicts when using internal libraries, and gets us closer to the monorepo mindset while still being able to use mostly off-the-shelf tooling.
Houston, we have a problem
The snapshot accelerator
- A simple HTTP API backed by a relational database which keeps track of the latest resolved snapshot version for each group/artifact/base version
- A Maven plugin which reports to the API after a new snapshot has been published
- A Maven extension which hits the API at the start of a build to find all new snapshots and then short-circuits metadata requests for dependencies that haven't changed
We have open-sourced this system as the maven-snapshot-accelerator.
Putting it all together
Now, when a build runs in our CI environment the Maven command looks like:
Next, the version of Maven we use locally and in our CI environment has the accelerator Maven extension installed. The extension keeps its state in the Maven local repository so that it can do incremental updates by fetching a delta from the API at the start of each build. As part of this, we went back to sharing a Maven local repository between all builds on the same server. We also add a few Aether flags when we run Maven which make this concurrent access safer:
- -Daether.connector.resumeDownloads=false - To prevent one build from trying to resume the download of another concurrent build and potentially corrupting the local repository
- -Daether.artifactResolver.snapshotNormalization=false - To make Maven use the fully resolved snapshot JARs for everything, which should be immutable and not change out from under us
This still isn't technically safe, and given enough concurrency we would probably see issues, but we have a few things working in our favor. The first is that builds are spread across the 150 servers in our Mesos cluster, so concurrent builds on the same server aren't very common. The other thing working in our favor is that dependency resolution is now much faster, so that the window for things to go wrong is much smaller. In fact, since rolling this out to our CI environment a few weeks ago, we haven't seen a single build failure due to local repository corruption.
Results
Since rolling this out, we haven't had a single Nexus outage. We've also seen faster and more consistent build times, both locally and in our CI environment. Here is a chart showing the volume of traffic to Nexus before (orange) and after (green):
(As a side note, you can tell from that graph that our engineers work from roughly 10am to 6pm. So next time you're interviewing somewhere, don't ask about work-life balance. Just ask to see their Nexus graphs.)
And here is a chart showing the Nexus request latency during the same time period:
You can see that each spike in traffic had a corresponding spike in latency, and now both graphs are much more stable.
Future Work
This setup is working well for us at the moment, but it seems like we're nearing the limit of what can be achieved in terms of speed and scale with Maven. Gradle is attractive, but seems to represent more of an incremental improvement, whereas tools like Bazel, Pants, or Buck are fundamentally different. We plan to evaluate some of these tools to see how they compare in terms of speed, configurability, extensibility, IDE integration, and so on.
If you have any thoughts, feel free to leave a comment below and don't forget to check out the maven-snapshot-accelerator on GitHub.