So, I've got this nice Hive join statement, joining a tiny little partition from one table against a sizable set of partitions from another.  And I'm running it, and it's taking a while.  And I can tell,from looking at the job, that it's doing the join reduce-side --meaning, it's generating the cross-product in the mapper, and then sending it over to the reducer to filter it down. 

But, clearly, this is a perfect fit for a map-side hash join (meaning, hold the entire tiny partition in memory in each map task + run no reducers at all).  If I was coding it myself, I could make this happen via a bunch of coding +some configuration trickery.  But, surely, Hive will make it easier, no?

The docs had little to tell me, but I saw Jira tickets about adding this ability, and finally found a mailing list message which had the magic incantation.  It's a hint within the statement, just convert this:

  SELECT t1.portal_id, t2.lead_id, t1.visit_time,

to this:

  SELECT /*+ MAPJOIN(t2)*/ t1.portal_id, t2.lead_id, t1.visit_time,

Done, and now my entire job is running in the mapper and is taking about 30% of the time it used to.  Woo.  Big points for Hive, for damn sure.


Recommended Articles

Join our subscribers

Sign up here and we'll keep you updated on the latest in product, UX, and engineering from HubSpot.

Subscribe to the newsletter