So, here at HubSpot, and at my previous gig at Lookery, I've been writing a pile of Hadoop code to take log files, pull out key info, sum it up in various ways, and store the results.
Overall, this has been a lot of fun, but I've developed a sort of healthy fear. Because of one thing: it's terrifyingly easy to make invisible mistakes.
E.g. say you're counting up unique visitors from paid search ads on Google. You run it on a month of data. Your program churns along, and then spits out: 12335.
And now we come to the problem: is 12335 the Right Answer?
If you've been writing programs for more than about ten minutes, you've discovered that your code usually has errors. If you're writing something which generates a web page (possibly by talking to a db), you find those errors relatively quickly. Even, worst case, you release it with a bug, then some customer says "Hey, when I click on thing X, it doesn't do what it should." This is Not That Bad.
But with data processing, errors can linger for a while, and, worse yet, can easily infect all the numbers you're collecting. Invisibly making everything wrong. Then, later, your customer says "Hey, why didn't these ads show up on my search marketing screen?" And it's totally non-obvious at which stage of your multipart data pipeline things went awry. It's also not clear how to fix all the existing data you've already collected, which is now suspect. This is Very, Very Bad Indeed.
Here's how I'm currently dealing with the Fear:
- Test-Driven Development is Your Friend
Without tests, why on earth would you trust the 12335 above? Also, data-processing programs tend to be very simple to test, because they have defined inputs and outputs. You're going to be a whole heck of a lot happier if you start with with tests.
- Kill, Kill, Kill the Whole Pipeline
This is a Toyota-inspired idea, and, again, differs from other kinds of coding. Basically, in any kind of error or unexpected situation, it's really good to just kill the whole pipeline immediately. This encourages the developers to immediately deal with issues, and work towards an entirely defect-free pipeline. The alternative makes it very easy for downstream data to become corrupted, again, in ways you can't easily remedy.
- Reentrancy Will Save Your A**
Even with your careful testing, and your aggressive pipeline stopping, you're still going to get into situations where you need to drop partial data and re-run. If you can make sure that every step can be run multiple times without causing trouble, you're going to be so, so much happier. E.g. if you're writing to a directory in HDFS, blow the entire directory away and recreate the whole thing. If you're writing summary rows into the db, record enough info to be able to either rewrite entire rows or skip ones which have already been entered.
That's my current set of takeaways -- anyone else have experiences on these fronts they'd like to talk about?
Update: the nice folks over at Hacker News point out that when I say "reentrant," I really mean "idempotent". They are, in fact, totally right.