Hadoop Follow Up – Hortonworks HDP Sandbox

The Hortonworks Hadoop Sandbox download got corrupted the first time.  It worked fine the second time.

Installation

I installed Oracle VirtualBox first.  Then, in the Oracle VM VirtualBox Manager, I select the File | Import Appliance… option, selected the HDP_2.4_virtualbox_v3.ova file and clicked Next and Import.
Importing the HDP Appliance

A few seconds later, the box was installed, so I started it up.  After  loading and starting a ton of stuff, it seemed to stop doing things and the screen looked like this:
HDP Appliance Screen

Connecting to the VM

I dismissed the two messages at the top and tried a zillion things to figure out what to do next.  Nothing.  Then I read something in the Hortonworks Tutorial in the Hello World section of the Hortonworks tutorial site about the box’s address and how to connect to the Welcome Screen.  No wonder I couldn’t do anything inside the VM itself, the interface is web-based and uses the URL:  http://127.0.0.1:8888/.  Entering that URL into my browser, I connected and saw this:
HDP Welcome Screen

Then I ran into difficulty because the firewall at work won’t let me download the tutorial files.  Ack!

My First Foray into Hadoop

So I have a big dataset (1.7 billion rows) that I want to analyze.  I figured, “Hey, Hadoop is all over this Big Data thing, I wonder if I can do a Proof of Concept?”

Compiling Hadoop on Windows (Ugh!)

So, first, I tried to follow some instructions on how to get the Hadoop source into Windows and compile it.  It turns out that Hadoop is Jave based and most Hadoop programmers are Java programmers.  So a lot of the instructions are in Java.  And, good for me, the build engine is Maven, which I happen to know quite a bit about thanks to the weeks at CompanionCabinet where I automated the build using Maven.

However, it turned out the Ant was having a problem with running the SH command and after several tries, I went googling for an already compiled version of the Hadoop project.  Low and behold, I found one on GitHub:  https://github.com/karthikj1/Hadoop-2.7.1-Windows-64-binaries.  In the middle of the top area of the page is a “1 Release” link.  Click there to download the binary:

Hadoop Binary

Installing all the bits

Based on the wiki article here:  http://wiki.apache.org/hadoop/Hadoop2OnWindows.

I found the link to this:  Building.txt

Near the bottom of that file, are some incomplete instructions on what to download, install and do to compile your own version of Hadoop in Windows.

So I downloaded all these:

  1. Java Developers Kit (JDK) 1.7.0_80, includes Java Runtime Environment (JRE) 7.
    JDK Download
  2. Maven 3.3.9.
  3. Cygwin 64.
  4. CMake 3.5.2.
  5. zlib 128.
  6. protobuf 2.5.0.
  7. Windows 7 SDK.

Then I installed or unzipped the files.

  1. JDK 1.7 is an install.  I let it install to Program Files\Java.
  2. I copied the Maven file to the Java folder and unzipped it to a new folder (apache-maven-3.3.9).
  3. I installed Cygwin to the Program Files\Java\Cygwin folder.
  4. I installed CMake and accepted the defaults.
  5. I unzipped the zlib 128 files to Program Files\Java\zlib128-dll.
  6. I unzipped the protobuf files to Program Files\Java\protobuf-2.5.0.
  7. I tried to install the Windows 7 SDK but it had issues, which I ignored and proceeded on since I wasn’t going to compile my own Hadoop after all.
  8. I unzipped the Hadoop files to \hadoop-2.7.1.

Then I did the following steps:

  1. JAVA_HOME must be set, and the path must not contain spaces. If the full path would contain spaces, then use the Windows short path instead.  In my case, this was:
    set JAVA_HOME=C:\Progra~1\Java\jdk1.7.0_80\
  2. I created a C:\tmp folder because I didn’t have one and, by convention, Hadoop uses it.
  3. I added the ZLIB_HOME environment variable and pointed it to C:\Program Files\Java\zlib128-dll\include.
  4. I added several items to the PATH variable:  C:\Program Files\Java\apache-maven-3.3.9\bin;C:\Program Files (x86)\CMake\bin;C:\Program Files\Java\zlib128-dll;C:\Program Files\Java\Cygwin\bin

With all that in place, I was ready to start Hadoop.

Starting Hadoop

Apparently I have to configure several files in the Hadoop\etc\configure folder first.

Section 3 on the wiki page describes in detail how to change the configuration files.

I combined that information with the steps found on this article to create the file system, create a directory and put my txt file there.

What’s Next?

I am not sure what’s next.  Looks like I have some learning to do.

This article gives a nice technical overview of Hadoop.

And then I discovered Hortonworks.  Hortonworks Sandbox is an open-source VM with Hadoop and a bunch of tools already fully configured.  So I downloaded this onto a different machine and am trying it out right now.  I’m going to try the VirtualBox VM.  I used VMWare Player and VirtualBox some time ago and found VirtualBox a lot easier to work with.  It looks the Hortonworks HDP Sandbox is going to take a while to download.  See you again on Monday.

In the meantime, I’m going to check out this tutorial on edureka.