How To Make a Hadoop Web Interface
How to make a Hadoop Web Interface? This is a question that comes up a lot, and it’s an important question. As Hadoop moves more and more into the mainstream, the user base is going to expand from the current group of techno-nerds into a broader population of less technical people. This is where providing a Hadoop Web Interface will become important.
However, there are some challenges to making a Hadoop Web Interface. Let’s face it Hadoop in its vanilla out of the box form is not very easy for a non-technical person to use. Most of the jobs are pretty static pieces of code, where there is a mapper and a reducer and maybe some other classes and then a static run() method which is an implementation of Tool. The usual way to submit your jobs is via the command line. Now hey I personally like that approach, because I have a lot of control over everything. However, the less technical users absolutely hate it! This would require them to have a detailed knowledge of Hadoop, and they just want data to play with.
The first thing to making a Hadoop Web Interface is to approach it differently. We are going to need ask different questions and come up with different answers. We are going to need a generic job submitter, one that can then be used in a service call in the web application. The UI would present some nice, clean, easy to use interface, next the user would make some sequence of selections, and then they would click a button to start their job. On the back-end, the request would be passed to the service call, the set of parameters would be processed and turned into a Hadoop job, and submitted to the cluster. This is how to make a Hadoop Web Interface.
The conceptual design of a Hadoop Web Interface is actually pretty simple. It’s the details that will be the hard part. Basically, you will need three processing components:
- Something to gather up the set of parameters for each job,
- Something to convert string class names into actual classes
- Something step through the parameters, perform any formatting/processing, and submit the job
In its simplest form, that is all we need to do! Like I said the hard part is in the details. Fortunately, for you I have started the process for you. I have a project checked into GitHub that provides a prototype of this fundamental approach and can be very easily translated into a Hadoop Web Interface.
Let's step through how this will translate into a Hadoop Web Interface. The component to read the parameter will become the UI layer of the Hadoop Web Interface. There would be a set of selections. There would be pulldown lists to select the kind of job that you want to execute, and where to get the inputs from. Naturally, these would need to be defined up front, in terms of what your users needs, so you will need to speak to them. The output from the UI layer would be some kind of data structure that can then be submitted with a request. In the prototype project, it looks something like:
Map<String, String> map = new HashMap<String, String>();
map.put("job.name", "Dynamic WordCount");
HadoopRunner tester = new HadoopRunner();
In this case, I am using a HashMap with an arbitrary key and the values for each of the respective parameters. I kept this example simple to demonstrate the Hadoop Web Interface approach. With only a little bit of imagination, you can see how this could be sent across the wire as a request from the UI.
Then moving to the back-end of our Hadoop Web Interface, the data structure with all the job parameters will need to be sent to the server and eventually make its way to a service endpoint, and this is where the processing will begin. The first thing that we will need to do is to convert the string representations of class names into actual classes. For example, the "mapper.class" element of the hash has a value of "org.apache.hadoop.examples.WordCount$TokenizerMapper". This will need to be converted into the actual Java class object. For this I have used reflections, and that is all pretty much straightforward, and I won't discuss it further.
After retrieving the class objects, you will be in a position to be able to setup the job and submit it. This will look very similar to the typical static kinds of jobs that we are all so familiar with. The only difference will the use of variables. In my simple example, the flow is very simple, but in reality it will need to be much more complex. It will be necessary to check for particular keys and/or values, and do some logic here. The code from the prototype project is intended to be a barebones example and looks something like:
String sourceJar = map.get("source.jar");
File jarFile = new File(sourceJar);
sourceJar = jarFile.toURI().toURL().toExternalForm();
Job job = new Job(conf, map.get("job.name"));
FileInputFormat.addInputPath(job, new Path(map.get("input.dir")));
FileOutputFormat.setOutputPath(job, new Path(map.get("output.dir")));
Class<Mapper> mapper = RunUtilities.tryToLoadClass(sourceJar,
Class<Reducer> reducer = RunUtilities.tryToLoadClass(sourceJar,
Class<Class> outputKey = RunUtilities.tryToLoadClass(sourceJar,
Class<Class> outputValue = RunUtilities.tryToLoadClass(sourceJar,
An important thing to note is to get the job to be submitted and executed in the cluster, it will be necessary to have the Hadoop config files, which describes your cluster, and it will be necessary to have all the required library JARs. In prototype project I have included the libraries I needed to submit a job to a CDH 4.3 cluster. You may need to adjust these, depending upon your cluster.
Next, for the Hadoop Web Interface to be successful it will be necessary to define what kinds of jobs can be submitted. This will be very important, because the mappers and reducers will need to created ahead of time and placed in a JAR that can be submitted with the job. In the case of the prototype, I am using the standard Hadoop examples JAR, but in a production situation you would use your own JAR. It is very important to limit the options here. You do not want to give your end users too many choices, which will only confuse them. You will need to speak to the users, and find out what they really need.
Naturally, in all this the question of how can we submit a job that is not in the set of available options will arise. The answer is that you should not even try to offer that with vanilla Hadoop. It just will not work out. Instead, your best option there would be ask your end-users if they would be willing to learn Pig Latin and then use a PigServer instead. This will allow you to dynamically add in a Pig script and submit that as a Pig job, which under the covers translates into Hadoop jobs. This approach is technically much more viable than trying allow the addition of other classes and JARs at runtime.
Finally, what I have provided here is an approach and a working prototype to be able to create a Hadoop Web Interface. These are the fundamental pieces that you will need to make it work. Naturally, there are details, as there are with all web applications, but this is definitely a place to start.