Hadoop World impressions

I know that Dan has already written up his impressions of the conference and, being Dan's words, I'm sure theyr'e incisive and witty. Still, I'm not going to read what he wrote before I write what I write.

 The quantity of talks at Hadoop World, and their intellectual content, was frankly staggering. By 4pm my brain was completely full, and I had a frenetic kind of energy that I normally associate with caffeine toxicity. It was an exhilirating conference.

 The big boys at the conference were Yahoo! and Facebook, with relatively little involvement from Amazon. Yahoo! shoves a ridiculous quantity of data into Hadoop; I believe the number they threw out was 4 terabytes a day. Indeed, the words "terabyte" and "petabyte" were thrown around so casually that when someone -- the HadoopDB team, I believe, which is an academic group at Yale -- mentioned mere gigabytes, it hardly registered.

Most of the talks dealt with abstractions above Hadoop; of these, the most prominent -- I believe everyone mentioned them at one time or another -- were Hive and Pig. Both are SQLish languages that optimize their queries into map and reduce jobs. eBay had its own variant on this, whose name I didn't write down; it was a special language meant to speed experiments on their recommendation engine. Someone loses an auction for, say, a 1998 Volkswagen; what's the best thing to suggest that they buy instead? They conduct thousands of these experiments per day, and they need a language to efficiently encode consumer-behavior patterns. Hadoop appears in the backend, but eBay and most of the other speakers quickly leave it behind. It's a testament to the technology's maturity that it has become something like electrical wiring: largely unnoticed, and there to serve the real action a couple layers up.

For my money, the coolest use of Hadoop was Jake Hofman's analysis of the social graph. He notes that 30 or 40 years ago, one could do detailed analysis of very small (10-odd-node) networks, that today we can do high-level analysis of massive networks, and that the ideal would be that kind of intimate network knowledge at massive scale. To that end, he's developing a library to handle the sort of operators one wants to think about in any network: average number of in-links and out-links, the number of connected components in the network (e.g., are there really "six degrees of separation" between any two nodes on the graph?), some measure of "pagerank" for each node in the graph, etc., etc. Thinking at a productively high level about this stuff wasn't possible until we had the computation resources to walk massive graphs; walking the graph is something that can be done very efficiently in MapReduce/Hadoop.

(Indeed, it seems to me that the "social graph" as such has been all talk and little delivery. Those of us who don't work within Facebook or Google don't have access to the full graph; at most, we have access to the few hundred or thousand nodes immediately around us. Hofman suggested in his talk that Twitter's APIs are opening up the graph in a really useful way. This is exciting to me. Maybe we can finally start reasoning about the graph as a whole.)

I could go on about each of the other talks, but I won't; in the next few days, I'll try to compile all the Hadoop World slides for this blog's readers. Suffice it to say that the range of uses for Hadoop is fairly jaw-dropping, and the amount of architectural work being done behind the scenes by Yahoo et al. is humbling.

The great power of computer science is that it allows us to express previously inaccessible concepts using powerful abstractions, while tremendous work goes on behind the scenes that we never even need to think about. Alfred North Whitehead put it best in his Introduction to Mathematics:


It is a profoundly erroneous truism, repeated by all copy-books and by eminent people when they are making speeches, that we should cultivate the habit of thinking of what we are doing. The precise opposite is the case. Civilisation advances by extending the number of operations we can perform without thinking about them.


Hadoop is one important step in this evolution: a tool for making formerly daunting work possible.


Steve Laniel

Written by Steve Laniel