The more you learn about Hadoop, the more you find how much development is going on to make Hadoop accessible. In this post, I will share my experience with Excel access to Hadoop data.
In my last post, I explained the Hadoop ecosystem in context and mentioned that I have built a home lab for first-hand experience. I built a cluster using a freely available appliance and wrote a few Linux scripts to dump machine data into the HDFS on a regular basis.
The machine data I chose was the available bandwidth on my home router on a five minute interval. Sometimes when I am working at home and my son is trying to play an online game, he will come into my office and say "Dad, are you doing something with the network?". Chances are, I am downloading something huge like an ISO or uploading a training video to YouTube and his game response time suffers. Keeping a historical log of available network bandwidth was a good use case for my Hadoop lab.
So the machine data is automatically filling my Hadoop "data lake" and I can query that data with Hive. I am comfortable with SQL and the command line, so Hive is good for me. What about folks that use Excel as their primary number crunching/graphing application? At my company, and most of my customers, Excel is still the primary "What If?" tool. The answer is adding a Hadoop ODBC driver for those Excel users which is very easy to do. Both Cloudera and Hortonworks have nice tutorials on ODBC access to Hadoop. I followed the Hortonworks tutorial as it is built into their Hadoop appliance.
All you need to do is follow the instructions, download the ODBC driver to your PC, configure the driver for the IP address of your Hadoop cluster and choose Data from Other Sources -> Microsoft Query from within Excel. Choose your HCatalog table and fields from the list, define your query and go. Here are those steps in pictures:
Your data will be inserted automagically into Excel where you can filter, pivot, graph or whatever else makes you happy. Once you have your spreadsheet/graph the way you want it, all you need to do is click "Refresh" to get the latest data. The query in my spreadsheet gets the latest 36 rows of bandwidth data and updates the graph. The Excel created Hive query looks like this "SELECT iperf2.i_date, iperf2.i_bandw FROM default.iperf2 iperf2 ORDER BY iperf2.i_date limit 36". You can edit the query manually from within Excel by choosing Connection Properties from Data -> Refresh All.
Have fun accessing your Hadoop cluster with Excel.