Big Data: Hadoop in Context

In Presales, our job is to remain at least 30 seconds ahead of our customers in order to help lead our customers in the right direction for business success.  In my opinion, Big Data is to 2013 what Cloud was to 2011-2012.  The majority of businesses are currently in either the knowledge gathering or strategy phase of the Big Data journey.  

I recently spoke at a Big Data seminar and held an impromptu, live "American Idol"-type voting experiment.  All attendees texted a response to the question "What Stage of the Big Data Journey are You In?"  79% of respondents were in the first two phases.

Survey Results

As a member of the core Big Data team at my company, I have been assisting in the development of educational and strategic offerings for our customers.  I have also been spending some time with Hadoop in my home Lab.  There is nothing like hands-on to put technologies in context.  After spending some time with both the Cloudera and Hortonworks appliances, I understand the power of Hadoop and the purpose of each of the components.  I will use this Cloudera diagram of the Apache open source components to illustrate.  As this is my first month with Hadoop, please feel free to comment if my definitions could be better.
Hadoop Components

Let's start from the bottom up.  This section is a bit long, so settle in.  The first sentence in each definition is right from Cloudera's CDH data sheet.
  • HDFSHadoop Distributed File System — scalable, distributed, fault tolerant data storage.  Context: A distributed file system similar to what you would see under a parallel database like Oracle RAC.  As data is loaded, the data is broken up into chunks, spread across nodes and duplicated for HA.  HDFS commands can be used to perform file operations on the HDFS file system such as copy, move, list, remove.
  • Mapreduce: Distributed computing framework for Apache Hadoop.  Context: Think about any parallel database that you understand.  Queries are broken up into pieces, the pieces are run in parallel (Map) and the results are merged for the final answer (Reduce). 
  • HCatalog: A table and storage management service for data stored in Hadoop. Context: Databases, tables, columns and rows.  In my opinion, HCatalog makes Hadoop accessible to most users and allowed me to really start using Hadoop.  Hadoop's native data access method is through java.  Your average data analyst is not a java programmer.  HCatalog allows SQL-like query access of unstructured data through tools like Hive and Pig.
  • HBaseScalable record and table storage with real-time read/write access sitting on HDFS. Context: In-memory columnar databases like HP Vertica or MS PDW.
  • HiveMetadata repository with SQL-like interface and ODBC/JDBC drivers for connecting BI applications to Hadoop.  Context: PL-SQL.  select first,last,phone from employees where last='faucher';  I am a big fan of Hive and Impala for making Hadoop easy to access.  Hive is also used to define the data and load the data (ETL).
  • Pig: High-level data flow language for processing data stored in Hadoop.  Context: Batch processing or pipelines.  If you want the output of one SQL statement to be the input for another SQL statement (an so on), use Pig.  Even supports user-defined functions.
  • ImpalaReal-time, SQL-based query engine for data stored in HDFS or HBase.  Context: A faster Hive. Faster mostly because Impala circumvents MapReduce.  Here are some results from Cloudera which match my personal experience.  Hive feels like batch, Impala feels ad-hoc. My cluster and data set are small, so Impala returned answers immediately where Hive took 20-40 seconds.
  • Impala vs. Hive
  • Flume: Distributed framework for collecting and aggregating log and event data and streaming it into HDFS or HBase in real time.  Context: Any real time transaction loader.  Oracle GoldenGate comes to mind.  I have not played around with Flume yet, but am thinking of using it to push real time network bandwidth metrics into HDFS for analysis.
  • Hue: Browser-based desktop interface for Hadoop.  Context: Any web-based menu system for database definition, load and query.  All the above tools are represented by icons and can be run from within the web interface.  Very nice interface.
  • Sqoop: Data transport engine for integrating Hadoop with relational databases.  ContextAny bulk DB table loader.  Oracle SQL*Loader is one example.
  • MahootLibrary of machine learning algorithms for Hadoop.  Machine learning allows computers to base their outputs on previous experiences.  Context: Netflix or Amazon recommendations based on past purchases.  I have not played with Mahoot yet.  Maybe I will analyze the world's blog posts to learn that I should really be writing about grumpy cats or The Bachelorette. 
  • Whirr: Library for deploying and running Hadoop in the cloud.  Context:  Think of it as ssh or rsh for Hadoop.  Sign up for a cloud cluster somewhere like Amazon or Rackspace, define your credentials locally and execute remote commands to control your cluster.  Configure, launch, use, and destroy your cluster remotely.  I have not tried this out yet, but I did create a Rackspace account and a small Linux VM to create some machine data for Hadoop.
  • Oozie: Workflow engine to coordinate Hadoop activities.  Context: Existing workflow or job scheduling tools either open source our commercial like BMC Control-M or CA Autosys
  • Zookeeper: Highly reliable distributed coordination service. Centralized management of the entire cluster in terms of name services, group services, synchronization services and configuration management. Context: Oracle Enterprise Manager (OEM) provides these functions for Oracle clusters.
Oh and a few final acronyms for you that I like.  Hadoop works on CRAP data (Create Replicate Append Process) while traditional databases work on CRUD data (Create Read Update Delete).


Dennis Faucher said…
Thank you Santhosh. That is very nice of you.