So I have a big dataset (1.7 billion rows) that I want to analyze. I figured, “Hey, Hadoop is all over this Big Data thing, I wonder if I can do a Proof of Concept?”
Compiling Hadoop on Windows (Ugh!)
So, first, I tried to follow some instructions on how to get the Hadoop source into Windows and compile it. It turns out that Hadoop is Jave based and most Hadoop programmers are Java programmers. So a lot of the instructions are in Java. And, good for me, the build engine is Maven, which I happen to know quite a bit about thanks to the weeks at CompanionCabinet where I automated the build using Maven.
However, it turned out the Ant was having a problem with running the SH command and after several tries, I went googling for an already compiled version of the Hadoop project. Low and behold, I found one on GitHub: https://github.com/karthikj1/Hadoop-2.7.1-Windows-64-binaries. In the middle of the top area of the page is a “1 Release” link. Click there to download the binary:
Installing all the bits
Based on the wiki article here: http://wiki.apache.org/hadoop/Hadoop2OnWindows.
I found the link to this: Building.txt
Near the bottom of that file, are some incomplete instructions on what to download, install and do to compile your own version of Hadoop in Windows.
So I downloaded all these:
- Java Developers Kit (JDK) 1.7.0_80, includes Java Runtime Environment (JRE) 7.
- Maven 3.3.9.
- Cygwin 64.
- CMake 3.5.2.
- zlib 128.
- protobuf 2.5.0.
- Windows 7 SDK.
Then I installed or unzipped the files.
- JDK 1.7 is an install. I let it install to Program Files\Java.
- I copied the Maven file to the Java folder and unzipped it to a new folder (apache-maven-3.3.9).
- I installed Cygwin to the Program Files\Java\Cygwin folder.
- I installed CMake and accepted the defaults.
- I unzipped the zlib 128 files to Program Files\Java\zlib128-dll.
- I unzipped the protobuf files to Program Files\Java\protobuf-2.5.0.
- I tried to install the Windows 7 SDK but it had issues, which I ignored and proceeded on since I wasn’t going to compile my own Hadoop after all.
- I unzipped the Hadoop files to \hadoop-2.7.1.
Then I did the following steps:
- JAVA_HOME must be set, and the path must not contain spaces. If the full path would contain spaces, then use the Windows short path instead. In my case, this was:
- I created a C:\tmp folder because I didn’t have one and, by convention, Hadoop uses it.
- I added the ZLIB_HOME environment variable and pointed it to C:\Program Files\Java\zlib128-dll\include.
- I added several items to the PATH variable: C:\Program Files\Java\apache-maven-3.3.9\bin;C:\Program Files (x86)\CMake\bin;C:\Program Files\Java\zlib128-dll;C:\Program Files\Java\Cygwin\bin
With all that in place, I was ready to start Hadoop.
Apparently I have to configure several files in the Hadoop\etc\configure folder first.
Section 3 on the wiki page describes in detail how to change the configuration files.
I combined that information with the steps found on this article to create the file system, create a directory and put my txt file there.
I am not sure what’s next. Looks like I have some learning to do.
This article gives a nice technical overview of Hadoop.
And then I discovered Hortonworks. Hortonworks Sandbox is an open-source VM with Hadoop and a bunch of tools already fully configured. So I downloaded this onto a different machine and am trying it out right now. I’m going to try the VirtualBox VM. I used VMWare Player and VirtualBox some time ago and found VirtualBox a lot easier to work with. It looks the Hortonworks HDP Sandbox is going to take a while to download. See you again on Monday.
In the meantime, I’m going to check out this tutorial on edureka.