What is Hadoop MiniCluster?
The Hadoop Suite, it's a great framework for solving problems that only a few years ago went unsolved. I can't even count the number technical challenges that were placed in the To Be Determined category, or even worse those that were bolted together with some half-okolele solution that just never quite worked. However, that has all changed with the advent of Hadoop. However, testing continues to plague the development effort. The question remains how to do testing? The answer is Hadoop testing with MiniCluster.
Here is a classic scenario. You are given some business problem to solve and after pondering it for a while you realize that a Hadoop solution is the right way to go. So you pitch your idea to the rest of the development team, and you get a green light to do a Proof Of Concept. Everything goes great! You solve the problem and it performs in record time with all the usual scalability benefits. Now you need to go to production. You find yourself in a meeting with the QA folks, and their first question is how do we test it? What kind of automated testing can we use? They just open up a fire hose of questions, you just can't answer. Hey, I have been there!
However, fear not because there is a solution. Interestingly, the solution comes from the Hadoop folks themselves. Most of the Hadoop related projects use an internal testing framework called MiniCluster. What the Hadoop MiniCluster does is to fire up a small memory-resident Hadoop cluster that can then be used to execute tests against. This comes standard with Hadoop, but for some reason it does not get a lot of press, and thus does not have a great deal of documentation around it. If the Hadoop team does their Hadoop testing with MiniCluster, then why shouldn't we?
I have found Hadoop testing with MiniCluster to be incredibly useful in doing complete end-to-end testing. It allows for the whole job to be tested. There are other popular testing frameworks, which I have sworn by in the past, but the big problem with those is that they tend to rely upon mocking everything below your map/reduce methods. Let's face it, the map/reduce methods just don't tend to be all that complex. The real action happens in your custom InputFormat or RecordReader or any of those more interesting customizations. Hadoop testing with MiniCluster lets you test all of those classes.
The biggest problem that I had with Hadoop testing with MiniCluster is that there is just not much information available on it. There are a few articles out there, but those don't really give complete examples, and it's not really something I could just drop into an existing Hadoop project that I needed to add on some testing. I had done some previous MiniCluster work with the old version of Hadoop, but then YARN and MR2 came out, and of course all of that does not work anymore. The libraries and classes are different now. Fortunately, the Hadoop development teams have improved things and made it much easier to use the new MiniCluster. I have a project on GitHub that will provide a good starting point.
Starting the Test Cluster
The beauty of the new MiniCluster that comes with YARN and MR2 is that it is much easier to use. One of the things that should strike you with the example project is that everything is under the test path. The Hadoop libraries provide everything. All you really need to do is setup your test case correctly. That setup is really a simple matter of executing a few Java statements. The basic actions are:
1. Define a name for your cluster (e.g. cluster1)
2. Instantiate a new Configuration object for your cluster
3. Create a data directory for your cluster to use
4. Instantiate a new MiniDFSCluster object
The Java statements would look something like:
testDataPath = new File(PathUtils.getTestDir(getClass()), miniclusters");
conf = new HdfsConfiguration();
File testDataCluster1 = new File(testDataPath, CLUSTER_1);
String c1Path = testDataCluster1.getAbsolutePath();
cluster = new MiniDFSCluster.Builder(conf).build();
Of course, these would be more than likely be in a @Before method of your test case. Refer the the example test case class, BasicMRTest, in the sample project.
Running the Test Cases
So once you have the cluster up and running, you will need to use it. There will be five basic steps to each test case
1) start the MiniCluster
2) stage the data needed for the test case
3) setup the MapReduce job and run it
4) verify the results
5) stop the MiniCluster
Your test cases will follow the same general pattern. They will be plain old Junit test cases, that have @Before and @After methods. Intuitively, the @Before method will instantiate the test cluster and the @After method will shut it down. Each of the individual test cases will then be able to access the HDFS to stage any required directories and/or data, setup the MR job (create a Job, define the input/output formats, etc…), run the job, verify the output results with your favorite asserts to ensure they match with expected values, and finally stop the cluster.
There will be test cases that require data to be staged ahead of time with the intention of producing a known results. In these cases, it will be necessary to place data in HDFS and then process that data with your MapReduce program. Fortunately, the MiniCluster exposes the FileSystem in the usual manner. An example of copying staged data into HDSF would be something like:
FileSystem fs = FileSystem.get(conf);
Path homeDir = fs.getHomeDirectory();
String rawHdfsDirPath = homeDir + "/testing/input";
Path rawHdfsData = new Path(rawHdfsDirPath + "/data.txt");
File inputRawData = new File("src/test/resources/files/data.txt");
String inputRawDataAbsFilePath = inputRawData.getAbsolutePath();
Path inputData = new Path(inputRawDataAbsFilePath);
Stopping the Test Cluster
Once the test cases have completed, it will be necessary to gracefully shut down the cluster. This is actually a single Java statement:
There are also some temporary files that should be cleaned up. These are both in HDFS and the local filesystem. The details of that clean up are included in the sample project. Of course, these statements would be more than likely be in a @After method of your test case. Refer the the example test case class, BasicMRTest, in the sample project for more information.
Finally, the required dependencies. Yes, this one is important and generally so much fun to figure out. I used Hadoop 2.6.0 but it should also work with any version after that. The following standard libraries were needed:
It was also necessary to include some of the test libraries. This was done by using the Maven classifier of tests, refer to the pom.xml in the sample project. The test libraries that needed to be included were: