So, here at HubSpot, and at my previous gig at Lookery, I've been writing a pile of Hadoop code to take log files, pull out key info, sum it up in various ways, and store the results.
Overall, this has been a lot of fun, but I've developed a sort of healthy fear. Because of one thing: it's terrifyingly easy to make invisible mistakes.
E.g. say you're counting up unique visitors from paid search ads on Google. You run it on a month of data. Your program churns along, and then spits out: 12335.
And now we come to the problem: is 12335 the Right Answer?
If you've been writing programs for more than about ten minutes, you've discovered that your code usually has errors. If you're writing something which generates a web page (possibly by talking to a db), you find those errors relatively quickly. Even, worst case, you release it with a bug, then some customer says "Hey, when I click on thing X, it doesn't do what it should." This is Not That Bad.
But with data processing, errors can linger for a while, and, worse yet, can easily infect all the numbers you're collecting. Invisibly making everything wrong. Then, later, your customer says "Hey, why didn't these ads show up on my search marketing screen?" And it's totally non-obvious at which stage of your multipart data pipeline things went awry. It's also not clear how to fix all the existing data you've already collected, which is now suspect. This is Very, Very Bad Indeed.
Here's how I'm currently dealing with the Fear:
That's my current set of takeaways -- anyone else have experiences on these fronts they'd like to talk about?
===
Update: the nice folks over at Hacker News point out that when I say "reentrant," I really mean "idempotent". They are, in fact, totally right.