On the train back from New York, where I just caught the very first Hadoop World, NYC conference, along with a few HubSpot friends (Steve Laniel, Owen Raccuglia, Andy Novikov).
We're in the process of moving the backend of our Marketing Analytics to Hadoop (that's what I've been working on since I joined HubSpot earlier this year).
Overall impression: Hadoop usage is just exploding. I mean, to some degree, sure, no duh, but, still it was pretty impressive to see how many people, in how many different ways, are building cool stuff on top of Hadoop. It had the feel of being early on in the curve of exponential growth -- something like I would imagine Linux users felt in 1995. There were, I think, 500 registered participants at the conference -- I'm guessing that next year it will be 1000.
Or, as I saw someone twitter: if you're looking for a job, learn Hadoop, because everyone was hiring.
- Ashish Thusoo's talk about Hadoop and Hive at Facebook was impressive as heck. I've had Hive on my "explore in more detail" list for a while, and this moved it up to near the top. As someone with a long-standing crush on the relational model, SQL-like expressions getting translated into MapReduce steps is just plenty sexy.
- Another tool which I'm going to have to take a closer look at is Sqoop, from Cloudera's Aaron Kimball. We have exactly the challenge he described early on in his talk -- huge amounts of semi-structured data for which Hadoop is perfect/necessary, but then some fully structured data in an RDBMS which we need to join in. Sqoop basically gives you a clean and efficient means to get that structured data into your Hadoop job flows. As he briefly alluded to near the end of the talk, there are still some pretty tricky bits necessary to handle data which updates in the RDBMS, but, overall, Sqoop seems like it could be a very useful tool.
- Rumor has it that I missed a sensational talk from Yahoo! folks about Social Graph Analysis (multiple people had that blown-away look in their eyes when they talked about it).
- Several presentations (Cloudera, Karmasphere, maybe someone else, too), described new desktop apps that are supposed to ease the process of developing, deploying and monitoring Hadoop jobs. These left me a little underwhelmed -- they were, in part, pretty versions of existing, ugly web pages, or checkbox/drop-down replacements for long command strings.
Having now been writing Hadoop jobs for coming up on a year, I will say -- they are tricky to write and debug, and when they run into trouble in the real world, it can be quite hard to figure out what happened. But I don't really see these GUI tools as making much of a difference -- they may shorten the time to learn how to launch a job in the first place, but I don't see them helping you much in diagnosing what happened when you poured 10 gigs of log files through a job flow running on 100 nodes, and ended up with a result with some errors in it.
It's possible that I'm somewhat biased, because our usage of Hadoop has (so far), been all about setting up recurring jobs, rather than one-time or ad-hoc queries.
But, in general, the "build higher-level languages on top of MapReduce" theme seemed much more promising to me than the "make submitting jobs prettier" one.
(okay, I sound grumpy to myself -- if you end up looking over my shoulder 2 months from now, and I'm clicking around in Cloudera Desktop, I encourage you to mock me wholeheartedly).
Also sat in on a few presentations that focused more on the admin side -- e.g. a barrage of extremely useful tips from Ed Capriolo, of About.com, on how to monitor your cluster's health (he promised to post his slides online, which I would absolutely recommend checking out -- I think he's packaged up some nagios + cacti scripts and configs so that others can download them and get going right away. I just googled some and found his Join the Grid project -- that may be the source).
Overall, it was an excellent day -- a lot of excitement in the room, a real buzz of energy.