So, I've got this nice Hive join statement, joining a tiny little partition from one table against a sizable set of partitions from another. And I'm running it, and it's taking a while. And I can tell,from looking at the job, that it's doing the join reduce-side --meaning, it's generating the cross-product in the mapper, and then sending it over to the reducer to filter it down.
But, clearly, this is a perfect fit for a map-side hash join (meaning, hold the entire tiny partition in memory in each map task + run no reducers at all). If I was coding it myself, I could make this happen via a bunch of coding +some configuration trickery. But, surely, Hive will make it easier, no?
The docs had little to tell me, but I saw Jira tickets about adding this ability, and finally found a mailing list message which had the magic incantation. It's a hint within the statement, just convert this:
SELECT t1.portal_id, t2.lead_id, t1.visit_time,
to this:
SELECT /*+ MAPJOIN(t2)*/ t1.portal_id, t2.lead_id, t1.visit_time,
Done, and now my entire job is running in the mapper and is taking about 30% of the time it used to. Woo. Big points for Hive, for damn sure.