Who Loves the Magic Undocumented Hive Mapjoin? This Guy.

on Nov 6, 2009

So, I've got this nice Hive join statement, joining a tiny little partition from one table against a sizable set of partitions from another. And I'm running it, and it's taking a while. And I can tell,from looking at the job, that it's doing the join reduce-side --meaning, it's generating the cross-product in the mapper, and then sending it over to the reducer to filter it down.

But, clearly, this is a perfect fit for a map-side hash join (meaning, hold the entire tiny partition in memory in each map task + run no reducers at all). If I was coding it myself, I could make this happen via a bunch of coding +some configuration trickery. But, surely, Hive will make it easier, no?

The docs had little to tell me, but I saw Jira tickets about adding this ability, and finally found a mailing list message which had the magic incantation. It's a hint within the statement, just convert this:

SELECT t1.portal_id, t2.lead_id, t1.visit_time,

to this:

SELECT /*+ MAPJOIN(t2)*/ t1.portal_id, t2.lead_id, t1.visit_time,

Done, and now my entire job is running in the mapper and is taking about 30% of the time it used to. Woo. Big points for Hive, for damn sure.

Leaving the Agency World: Advice for New Product Designers

Most designers don’t dream about working on one brand, one website, and one UI. They crave variety. That’s why I joined an agency right out of school ...

Jonathan Meharry (He/Him)

on Nov 17, 2015

EngineeringProduct

The Road to My First Product Analyst Position

I didn’t grow up knowing I wanted to be a product analyst. Honestly, the role probably didn’t even exist when I was a child. It wasn’t until about ...

Erin Wilt (She/Her)

on Oct 15, 2020

Product