Easing the Pain of Hadoop with Guice

If you've used Hadoop for non-trivial jobs, you've probably run into the "remote construction" problem. Since Hadoop ships your code and spins it up in Java VMs on remote machines, you can't pass pre-constructed Java objects into your jobs--you have to bootstrap everything from scratch in the spun-up VM, and the only form of configuration Hadoop makes it easy to pass over is key/value string pairs.

One way to use these key/value string pairs is to use them to specify the names of classes to instantiate during setup in the job--this is what Hadoop does itself with its mapper class and reducer class options. Often this is enough to get the job done. But recently I have been working on a project to do some web crawling in Hadoop. The crawler classes and related hierarchies are fairly loosely coupled, and a lot of options (where the boundaries of the crawl are, for example, or how to process the html to discover new links) are expressed as different objects being used during the rather complex construction process of the top-level crawler object. This design was flexible and successful for local crawls, but, because you can't pass objects into a Hadoop job, it was a real pain to construct a crawler in Hadoop. Basically for every new combination of options I wanted, I had to create a class that was hard-coded to construct the crawler appropriately, and pass that class name as an option to Hadoop. In the job's setup code, it would instantiate an object of that class and use it to get the constructed crawler. That meant every time I wanted to add a new combination of options, I had to add code to my project, rebuild it, and ship the new jars. A real pain.

Enter Guice, Google's framework for dependency injection. Guice (simplifying slightly) lets you throw a bunch of objects into a mix, indicating which should be used to implement which interfaces, and then constructs your objects for you, making sure that all dependencies are met along the way. You enable this by creating your own classes that implement Guice's Module interface, in which you "bind" your custom implementations to the appropriate interfaces. Finally, using the @Inject annotation, you tell Guice which constructor dependencies it should attempt to satisfy. (Guice can do more than this, but this is what it does in a nutshell.) To make this more concrete, here's a much simplified example:

public class DbCrawlModule extends AbstractModule {
  @Override 
  protected void configure() {
    //implement the PageSource interface with the
    //DatabasePageSource class...and so on
    bind(PageSource.class).to(DatabasePageSource.class);
    bind(LinkDiscoverer.class).to(DomLinkDiscoverer.class);
    bind(Crawler.class).to(SimpleCrawler.class);
  }
}
public class SimpleCrawler implements Crawler {
  private PageSource pageSource;
  private LinkDiscoverer linkDiscoverer;
  //the all-import Inject annotation...tells Guice
  //to try to meet these constructor dependencies
  @Inject
  public RealBillingService(PageSource pageSource,
      LinkDiscoverer linkDiscoverer) {
    this.pageSource = pageSource;
    this.linkDiscoverer = linkDiscoverer;
  }
  //...real crawler code
}
//wherever your main lives...
public static void main(String[] args) {
    Injector injector = Guice.createInjector(new DbCrawlModule());
    //here is where Guice creates the dependency graph
    //and constructs your object for you...this will
    //create a SimpleCrawler from a DatabasePageSource and
    //a DomLinkDiscoverer
    Crawler crawler = injector.getInstance(Crawler.class);
    //...use your crawler...
}

You see how nicely Guice helps you keep your objects loosely coupled, and hides your implementation details in your Module class.  So how would this work in Hadoop?  At first glance, it looks just as bad as our "one configuration == one class" idea, since we'd have to create a custom module for each configuration combination we wanted to use.  But with a little hackery, we can make this more dynamic.  We'll pass a Hadoop configuration string of the form "<interface1.class>:<implementer1.class>;<interface2.class>:<implementer2.class>", and use a custom Module to parse that out and perform the binding.  Here is a simplified listing (in the real thing we want to provide for more advanced Guice features like scopes, and provide some support for formatting the string).

public class WiringModule implements Module {
    private Map<Class<?>, Class<?>> classes;
    /**
     * Take the class mappings provided in string form and later replicate them in Guice.
     * 
     * The format of the string should be:
     * 
     * <implemented class1>:<implementing class1>[;<implemented class2>:<implementing class2>...]
     * 
     */
    public WiringModule(String wiring) {
        classes = new HashMap<Class<?>, Class<?>>();
        if (wiring != null) {
            for (String wire : wiring.split(";")) {
                String[] classNames = wire.split(":", 2);
                try {
                    Class<?> bindClass = Class.forName(classNames[0]);
                    Class<?> toClass = Class.forName(classNames[1]);
                    classes.put(bindClass, toClass);
                } catch (ClassNotFoundException e) {
                    //abort! abort!
                    //...
                }
            }
        }
    }
    /**
     * For Guice to call
     */
    @Override
    public void configure(Binder binder) {
        for (Entry<Class<?>, Class<?>> e : classes.entrySet()) {
            Class c = e.getKey();
            Class b = e.getValue();
            binder.bind(c).to(b);
        }
    }
}
Now, when setting up a Hadoop crawl job, I can specify from the command line a series of interface to class mappings (and there's a set of defaults as well).  I bundle those up into a wiring string of the format described above, and set a custom Hadoop config key to that value.  In the setup of the Hadoop job, we pull out that wiring string and then do something like

Injector injector = Guice.createInjector(new WiringModule(wiringString));

Crawler crawler = injector.getInstance(Crawler.class);

//and we're off...


If I forget to specify a dependency, Guice lets me know really quickly by throwing an error.  And Guice's extra flexibility comes in really handy when the construction graphs get big, and incorporate constructors with different numbers of arguments, etc.  I just have to make sure all the elements are thrown into the mix, and Guice will "do the right thing."

Is this solution--encoding the classes into strings--a little dirty?  Probably.  Has it reduced my pain with Hadoop, and made it easier to take advantage of map/reduce?  Definitely.

Easing the Pain of Hadoop with Guice

Recommended Articles

Infrastructure as Code: Getting the best of both worlds with AWS and Google Cloud Platform

Git After It: Why GitHub Won When So Many Other Big Companies Failed

Quick Guide: How to Put Invisible reCaptcha on Your Website

Join our subscribers

Easing the Pain of Hadoop with Guice

Recommended Articles

Infrastructure as Code: Getting the best of both worlds with AWS and Google Cloud Platform

Git After It: Why GitHub Won When So Many Other Big Companies Failed

Quick Guide: How to Put Invisible reCaptcha on Your Website

Join our subscribers

Get Connected