Hive Testing with MiniHiveServer2

hive

Hive is a terrific technology that allows you to solve Big Data problems very quickly.  It allows a user or developer to write SQL like queries, called Hive Query Language (HQL), which do the same thing as MapReduce programs.  However, a big question is how do we test the HQL scripts that we write?

Previously, there were some solutions to perform Hive unit testing.  Unfortunately, they only work for the older version of Hadoop.  With the introduction of YARN and MR2, all of the libraries and classes have changed.  Fortunately, the Hive development team has already solved this problem for us with the MiniHiveServer2.  However, the tricky part is to extract what we need from the very large bundle of source code, called the Hive distribution.  

What is MiniHiveServer2?

HiveMiniServer2 is a fully functioning Hive Server2 that can be instantiated for the purposes of testing.  It uses an embedded instance of Derby for its database, and it use the MiniCluster to provide the HDFS and capability to run MapReduce jobs.  This is something that already existing in the Hive distribution as part of development testing. 

This whole journey started when I needed to create a unit test for a UDAF that I developed, and none of the existing Hive testing tools worked.  Having a modicum of experience with MiniCluster, I knew where to begin, but I had no idea how much effort it would take to dig through the details of the Hive source, so I am sharing my findings to save those who come after me some time.  I have created a Github repository with some sample code,

https://github.com/bobfreitas/hiveunit-mr2.

The biggest problem that I had with Hive Testing with MiniHiveServer2 is that there is just not much information available on it. The only source of information I could find was the Hive source code, and of course there are lots of projects, packages, and classes in there.  The really nice thing is that once I was able to get the basic MiniHiveServer2, with an internal Derby database and backing MiniCluster to fire up, everything just fell into place.  

Starting the Test Cluster

The real magic happens with firing up of the MiniHiveServer2 and MiniCluster, and at the end of the day, it was a bit underwhelming, but getting the right sequence of statements took a little bit of work:

FileSystem fs;

MiniHS2 miniHS2 = null;

Map<String, String> confOverlay;

HiveConf hiveConf;

Configuration conf = new Configuration();

hiveConf = new HiveConf(conf, org.apache.hadoop.hive.ql.exec.CopyTask.class);

miniHS2 = new MiniHS2(hiveConf, true);

confOverlay = new HashMap<String, String>();

confOverlay.put(ConfVars.HIVE_SUPPORT_CONCURRENCY.varname, "false");

confOverlay.put(MRConfig.FRAMEWORK_NAME, MRConfig.LOCAL_FRAMEWORK_NAME);

miniHS2.start(confOverlay);

fs = miniHS2.getDfs().getFileSystem();

SessionState ss = new SessionState(hiveConf);

SessionState.start(ss);

I know you are probably thinking REALLY what is so hard about that little sequence of statements.  That was actually my reaction after I got it working, but it did take a little bit of time to get it working, because I ended up needing to pull little bits and pieces together from different integration tests to get here. The brevity of the startup sequence is a testament to the hard work that has gone into the more recent releases of Hive.  

Using the Test Cluster

So once you have the cluster up and running, you will need to use it.  The testing libraryprovides three classes that you should know about:

1)    HiveTestCluster – used to manage the MiniHiveServer2 and MiniCluster instances, provides the ability to start, stop, execute scripts and get access to the mini-HDFS

2)    HiveScript – used to model a HiveScript, preforms any pre-processing and converts an HQL script file into something that can be executed by the HiveTestCluster

3)    HiveTestSuite – the primary interface to interact with the test cluster and submit an HQL script

Your test cases will follow the same general pattern.  They will be plain old Junit test cases, that have @Before and @After methods.  Intuitively, the @Before method will instantiate the test cluster and the @After method will shut it down.  Each of the individual test cases will then be able to access the HDFS to stage any required directories and/or data, execute the HQL script and then verify the results.  If the script being tested does not actually generate any output, then create a simple test script to do the final results validation. 

The basic sequence of statements a test case would need to follow is:

In the @Before method

HiveTestSuite testSuite = testSuite = new HiveTestSuite();

testSuite.createTestCluster();

In the test case

List<String> results = testSuite.executeScript(<some_script>);

assertEquals(2, results.size());

 In the @After method

testSuite.shutdownTestCluster();

Refer to the test case, com.inmobi.hive.test.HiveSuiteTest, for a more complete example.

Parameters and Excludes

The executeScript() method can take two additional parameters: params and excludes.  The params is a HashMap that will allow for parameters to be substituted using the Hive convention of ${VAR_NAME}, where you would use  ${VAR_NAME} in your script, and then create a hashmap with VAR_NAME as the key and the substitution value as the data.  The excludes is a List of Strings that allow you to exclude entire lines from your script.  This is primarily intended for the ADD JAR commands.  It is just easier to exclude those lines and ensure the needed jars are on the classpath at the project level. 

Accessing HDFS

There will be test cases that require data to be staged ahead of time with the intention of producing a known results.  In these cases, it will be necessary to place data in HDFS and then process that data with your Hive script.  Fortunately, the HiveTestSuite exposes a reference to the FileSystem with the getFS() method.  An example of copying staged data into HDSF would be something like:

FileSystem fs = testSuite.getFS();

Path homeDir = fs.getHomeDirectory();

String rawHdfsDirPath = homeDir + "/testing/input";

Path rawHdfsData = new Path(rawHdfsDirPath + "/weather.txt");

File inputRawData = new File("src/test/resources/files/weather.txt");

String inputRawDataAbsFilePath = inputRawData.getAbsolutePath();

Path inputData = new Path(inputRawDataAbsFilePath);

fs.copyFromLocalFile(inputData, rawHdfsData);

Stopping the Test Cluster

One details that I found was that there were some directories that needed to be cleaned up at the end of the test case.  There are a lot of temporary files being created, so it is understandable that not all of them get cleaned up correctly.  The biggest offender was in the target directory.  A subdirectory named MiniMRCluster_<clusterId> will get created but does not get deleted.  These will accumulate, if they are not deleted.  In addition, there were also some files in the /tmp directory, but I chose not to explicitly clean those up. If they become a problem for you, it may become necessary to do something about that. 

The Dependencies

Finally, the required dependencies.  Ah yes, this was a lot of fun to figure out.  Since these are spread out across different POM files in the multi-module project.  It took quite a few attempts to get all the needed libraries. 

I used Hive 1.1.0 but it should also work with any version after that.  The following standard libraries were needed:

org.apache.hive:hive-cli

org.apache.hive:hive-common

org.apache.hive:hive-contrib

org.apache.hive:hive-exec

org.apache.hive:hive-metastore

org.apache.hive:hive-serde

org.apache.hive:hive-service

org.apache.hive:hive-shims

 

The use of Hive 1.1.0 required that I needed to use Hadoop 2.6.0, and the following libraries where needed:

org.apache.hadoop:hadoop-annotations

org.apache.hadoop:hadoop-auth

org.apache.hadoop:hadoop-common

org.apache.hadoop:hadoop-hdfs

org.apache.hadoop:hadoop-mapreduce-client-common

org.apache.hadoop:hadoop-mapreduce-client-hs

 

Okay, so now here comes one of the many wrinkles, it was also necessary to include some of the test libraries.  This was done by using the Maven classifier of tests, refer to the pom.xml in the sample project.  The test libraries that needed to be included were:

org.apache.hadoop:hadoop-mapreduce-client-jobclient

org.apache.hadoop'hadoop-yarn-server-tests

org.apache.hadoop:hadoop-hdfs

org.apache.hadoop:hadoop-common

 

Finally, it was necessary to include some libraries for webapps that are part of the system.  These were:

org.mortbay.jetty:jetty:6.1.26

org.mortbay.jetty:jetty-util:6.1.26

com.sun.jersey:jersey-core:1.9

com.sun.jersey:jersey-server:1.9

javax.servlet.jsp:jsp-api:2.1

tomcat:jasper-runtime:5.5.23