How to install Apache Hadoop natively on 64-bit OS

by | Mar 29, 2016 | BlogPosts, Tech Tips | 0 comments

Introduction

The following article (is accompanied with a very ‘techy’-heavy warning!) will provide a step-by-step guide to installing Hadoop version 2.6. on a CentOS 7 using an rpm built in a 64 version of the OS.

Prerequisites
  • Time (server hours)
  • Internet connection (in some capacity)
  • VMBox (or a cloud machine access)
  • Root access (you can install Hadoop without root access to the system, however it is a bit more complicated. Remember, root access is required only during the installation phase, not for application/service execution)!
How to:

1. Download VMWare player* or Oracle Virtual Box.

2. Download CentOS 7 ISO image** or any other distro based on RHEL.

3. Install VM software.

4. Install the ISO image.

5. Launch Installed VM.

6. Open Terminal.

7. Switch to root user.

8. Execute the following:

          # sudo su –

          # sudo yum update

9. Install all updates and remove existing JAVA:

          # sudo yum remove java

10. Download Oracle JAVA***

a. Download the 64bit .rpm package

b. Execute # yum localinstall <java_pachakge_name>.rpm

11. Set JAVA_HOME****

          # vi /etc/profile.d/java.sh

12. Add the following lines:

          #!/bin/bash

          JAVA_HOME=/usr/java/default

          PATH=$JAVA_HOME/bin:$PATH

          export PATH JAVA_HOME

          # chmod +x /etc/profile.d/java.sh

          # source /etc/profile.d/java.sh

13. Check java:

          # java -version

Which should return the java version:

          # echo $JAVA_HOME

Which in turn should return the java home dir path.

14. Download Maven

          # tar -zxvf <maven_pachage_name>.tar.gz -C /opt/

15. Set M3_HOME

          # vi /etc/profile.d/maven.sh

16. Add the following lines:

          #!/bin/bash

          M3_HOME=/opt/<maven_dir_name>

          PATH=$M3_HOME/bin:$PATH

          export PATH M3_HOME

          # chmod +x /etc/profile.d/maven.sh

          # source /etc/profile.d/maven.sh

17. Check Maven

          # mvn -version

Which should return the Maven version:

          # echo $M3_HOME

Which in turn should return the Maven home dir path.

18. Download the following tools for Hadoop native code compilation.

          # yum group install “Development

          #yum install openssl-devel zlib-devel

19. Download

wgethttp://cbs.centos.org/kojifiles/packages/protobuf/2.5.0/10.el7.centos/x86_64/protobuf-2.5.0-10.el7.centos.x86_64.rpm

wgethttp://cbs.centos.org/kojifiles/packages/protobuf/2.5.0/10.el7.centos/x86_64/protobuf-devel-2.5.0-10.el7.centos.x86_64.rpm

wgethttp://cbs.centos.org/kojifiles/packages/protobuf/2.5.0/10.el7.centos/x86_64/protobuf-compiler-2.5.0-10.el7.centos.x86_64.

          # yum -y install protobuf-*****

20. Prep for Hadoop: execute the following commands

          # groupadd hadoop

          # useradd -g hadoop yarn (Note: yarn user – is going to be used for node manager)

          # useradd -g hadoop hdfs (Note: hdfs user – is for things related to the hdfs file system)

          # useradd -g hadoop mapred (Note: mapred user – related to map reduce jobs)

(Note: You can add passwd to users if you like.)

21. Login to hdfs (Note: This step is required as Hadoop needs a ssh connection without a passphrase.)

          # su – hdfs

          # ssh-keygen -t rsa -P “”

          # cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

          # chmod 0600 ~/.ssh/authorized_keys

22. Test ssh

          # ssh localhost date

          # yes

23. Exit hdfs user

          # exit

24. Download Apache Hadoop (source)

25. Extract tar file into /opt Dir

          # tar -zxvf <hadoop_pachage_name>.tar.gz -C /opt/

26. Navigate to the new Hadoop dir

          # cd /opt/<hadoop_dir_name>/

27. Edit the pom.xml file and add <additionalparam>-Xdoclint:none</additionalparam> to the properties section. For

          …

          <!– platform encoding override –>

          <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

          <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>

          <additionalparam>-Xdoclint:none</additionalparam>

          </properties>

          …

(Note: This step in only required if you decided to use Java 8.)

28. Execute the following commands:

          # cd ..

          # chown hdfs:hadoop <hadoop_dir_name> -R

(Note: Make sure that no permission blocks exist).

29. Build the native Hadoop library

          # su – hdfs

          # cd /opt/<hadoop_dir_name>

          # mvn package -Pdist,native -DskipTests -Dtar

Go grab some coffee/tea… This step is not mandatory, but recommended! Here’s what you should see by the end of the process.

 

[INFO] ————————————————————————

[INFO] Reactor Summary:

[INFO]

[INFO] Apache Hadoop Main …………………………… SUCCESS [ 16.389 s]

[INFO] Apache Hadoop Project POM …………………….. SUCCESS [  6.905 s]

[INFO] Apache Hadoop Annotations …………………….. SUCCESS [  8.923 s]

[INFO] Apache Hadoop Assemblies ……………………… SUCCESS [  0.340 s]

[INFO] Apache Hadoop Project Dist POM ………………… SUCCESS [  5.277 s]

[INFO] Apache Hadoop Maven Plugins …………………… SUCCESS [  8.378 s]

[INFO] Apache Hadoop MiniKDC ………………………… SUCCESS [02:25 min]

[INFO] Apache Hadoop Auth …………………………… SUCCESS [01:47 min]

[INFO] Apache Hadoop Auth Examples …………………… SUCCESS [  4.060 s]

[INFO] Apache Hadoop Common …………………………. SUCCESS [03:10 min]

[INFO] Apache Hadoop NFS ……………………………. SUCCESS [  7.413 s]

[INFO] Apache Hadoop KMS ……………………………. SUCCESS [ 45.635 s]

[INFO] Apache Hadoop Common Project ………………….. SUCCESS [  0.046 s]

[INFO] Apache Hadoop HDFS …………………………… SUCCESS [02:32 min]

[INFO] Apache Hadoop HttpFS …………………………. SUCCESS [ 21.490 s]

[INFO] Apache Hadoop HDFS BookKeeper Journal ………….. SUCCESS [ 17.206 s]

[INFO] Apache Hadoop HDFS-NFS ……………………….. SUCCESS [  4.122 s]

[INFO] Apache Hadoop HDFS Project ……………………. SUCCESS [  0.044 s]

[INFO] hadoop-yarn …………………………………. SUCCESS [  0.054 s]

[INFO] hadoop-yarn-api ……………………………… SUCCESS [ 37.593 s]

[INFO] hadoop-yarn-common …………………………… SUCCESS [01:36 min]

[INFO] hadoop-yarn-server …………………………… SUCCESS [  0.036 s]

[INFO] hadoop-yarn-server-common …………………….. SUCCESS [ 15.557 s]

[INFO] hadoop-yarn-server-nodemanager ………………… SUCCESS [ 42.800 s]

[INFO] hadoop-yarn-server-web-proxy ………………….. SUCCESS [  2.961 s]

[INFO] hadoop-yarn-server-applicationhistoryservice ……. SUCCESS [  6.280 s]

[INFO] hadoop-yarn-server-resourcemanager …………….. SUCCESS [ 20.282 s]

[INFO] hadoop-yarn-server-tests ……………………… SUCCESS [  5.231 s]

[INFO] hadoop-yarn-client …………………………… SUCCESS [  7.769 s]

[INFO] hadoop-yarn-applications ……………………… SUCCESS [  0.031 s]

[INFO] hadoop-yarn-applications-distributedshell ………. SUCCESS [  3.625 s]

[INFO] hadoop-yarn-applications-unmanaged-am-launcher ….. SUCCESS [  2.082 s]

[INFO] hadoop-yarn-site …………………………….. SUCCESS [  0.038 s]

[INFO] hadoop-yarn-registry …………………………. SUCCESS [  5.406 s]

[INFO] hadoop-yarn-project ………………………….. SUCCESS [  6.252 s]

[INFO] hadoop-mapreduce-client ………………………. SUCCESS [  0.080 s]

[INFO] hadoop-mapreduce-client-core ………………….. SUCCESS [ 22.981 s]

[INFO] hadoop-mapreduce-client-common ………………… SUCCESS [ 17.918 s]

[INFO] hadoop-mapreduce-client-shuffle ……………….. SUCCESS [  4.349 s]

[INFO] hadoop-mapreduce-client-app …………………… SUCCESS [ 10.538 s]

[INFO] hadoop-mapreduce-client-hs ……………………. SUCCESS [  8.806 s]

[INFO] hadoop-mapreduce-client-jobclient ……………… SUCCESS [  9.771 s]

[INFO] hadoop-mapreduce-client-hs-plugins …………….. SUCCESS [  1.889 s]

[INFO] Apache Hadoop MapReduce Examples ………………. SUCCESS [  5.765 s]

[INFO] hadoop-mapreduce …………………………….. SUCCESS [  4.789 s]

[INFO] Apache Hadoop MapReduce Streaming ……………… SUCCESS [  8.040 s]

[INFO] Apache Hadoop Distributed Copy ………………… SUCCESS [  9.787 s]

[INFO] Apache Hadoop Archives ……………………….. SUCCESS [  2.165 s]

[INFO] Apache Hadoop Rumen ………………………….. SUCCESS [  6.321 s]

[INFO] Apache Hadoop Gridmix ………………………… SUCCESS [  4.502 s]

[INFO] Apache Hadoop Data Join ………………………. SUCCESS [  2.613 s]

[INFO] Apache Hadoop Ant Tasks ………………………. SUCCESS [  2.081 s]

[INFO] Apache Hadoop Extras …………………………. SUCCESS [  3.048 s]

[INFO] Apache Hadoop Pipes ………………………….. SUCCESS [  7.640 s]

[INFO] Apache Hadoop OpenStack support ……………….. SUCCESS [  4.934 s]

[INFO] Apache Hadoop Amazon Web Services support ………. SUCCESS [ 24.968 s]

[INFO] Apache Hadoop Client …………………………. SUCCESS [  8.046 s]

[INFO] Apache Hadoop Mini-Cluster ……………………. SUCCESS [  0.084 s]

[INFO] Apache Hadoop Scheduler Load Simulator …………. SUCCESS [  5.169 s]

[INFO] Apache Hadoop Tools Dist ……………………… SUCCESS [  9.050 s]

[INFO] Apache Hadoop Tools ………………………….. SUCCESS [  0.025 s]

[INFO] Apache Hadoop Distribution ……………………. SUCCESS [ 36.246 s]

[INFO] ————————————————————————

[INFO] BUILD SUCCESS

[INFO] ————————————————————————

[INFO] Total time: 20:28 min

[INFO] Finished at: 2015-11-23T07:50:32-08:00

[INFO] Final Memory: 215M/847M

[INFO] ————————————————————————

 

Configuration

1. Switch back to root

          # exit

2. Move the native Hadoop to opt

          # mv /opt/<hadoop_dir_name>/hadoop-dist/target/<hadoop_version> /opt/

3. Create data dir

          # mkdir -p /var/data/hadoop/hdfs/nn

          # mkdir -p /var/data/hadoop/hdfs/snn

          # mkdir -p /var/data/hadoop/hdfs/dn

          # chown hdfs:hadoop /var/data/hadoop/hdfs -R

4. Create log dir

          # cd /opt/<hadoop_version>

(Note: This is the new dir we moved a few steps before)

          # mkdir logs

          # chmod g+w logs

          # chown -R yarn:hadoop .

5. Set HADOOP

          # vi /etc/profile.d/hadoop.sh

6. Add the following lines:

          #!/bin/bash

          HADOOP_HOME=/opt/<hadoop_dir_name>

          PATH=$HADOOP_HOME/bin:$PATH

          export PATH HADOOP_HOME

          # chmod +x /etc/profile.d/hadoop.sh

          # source /etc/profile.d/hadoop.s

7. Check Hadoop

          # echo $HADOOP_HOME

This should return the Hadoop home dir path.

8. Configure Hadoop

          # cd /opt/hadoop-2.6.2/etc/hadoop/

          # vim core-site.xml

9. Add the following code inside of configuration:

          <property>

          <name>fs.default.name</name>

          <value>hdfs://localhost:9000</value>

          </property>

          <property>

          <name>hadoop.http.staticuser.user</name>

          <value>hdfs</value>

          </property>

          # vim hdfs-site.xml

10. Add the following code inside of configuration:

          <property>

          <name>dfs.replication</name>

          <value>1</value>

          </property>

          <property>

          <name>dfs.namenode.name.dir</name>

          <value>file:/var/data/hadoop/hdfs/nn</value>

          </property>

          <property>

          <name>fs.checkpoint.dir</name>

          <value>file:/var/data/hadoop/hdfs/snn</value>

          </property>

          <property>

          <name>fs.checkpoint.edits.dir</name>

          <value>file:var/data/hadoop/hdfs/snn</value>

          </property>

          <property>

          <name>dfs.datanode.data.dir</name>

          <value>file:/var/data/hadoop/hdfs/dn</value>

          </property>

          # vim mapred-site.xml

11. Add the following code inside of configuration:

          <property>

          <name>mapreduce.framework.name</name>

          <value>yarn</value>

          </property>

          <property>

          <name>mapreduce.jobhistory.intermediate-done-dir</name>

          <value>/mr-history/tmp</value>

          </property>

          <property>

          <name>mapreduce.jobhistory.done-dir</name>

          <value>/mr-history/done</value>

          </property>

          # vim yarn-site.xml

12. Add the following code inside of configuration:

          <property>

          <name>yarn.nodemanager.aux-services</name>

          <value>mapreduce_shuffle</value>

          </property>

          <property>

          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

          <value>org.apache.hadoop.mapred.ShuffleHandler</value>

          </property>

13. Switch to hdfs user

          # su – hdfs

          # cd /opt/<hadoop_dir>/bin

          # ./hdfs namenode -format

          # cd /opt/<hadoop_dir>/sbin

          # ./hadoop-daemon.sh start namenode

          # ./hadoop-daemon.sh start secondarynamenode

          # ./hadoop-daemon.sh start datanode

14. Create /mr-history in hdfs file system for job history

          # hdfs dfs -mkdir -p /mr-history/tmp

          # hdfs dfs -mkdir -p /mr-history/done

          # hdfs dfs -chown -R yarn:hadoop /mr-history

15. Start YARN services

          # su – yarn

          # cd /opt/<hadoop_dir>/sbin

          # ./yarn-daemon.sh start resourcemanager

          # ./mr-jobhistory-daemon.sh start historyserver

Check the following:
  • Check that the serves are up and running.
    • Open a web-browser (Firefox recommended) and open two tabs with the following URL’s:
      • http://localhost:50070
      • http://localhost:8088
  • Run a sample job to test that Hadoop is working

                   # su – hdfs

                    # export YARN_EXAMPLES=/opt/<hadoop_dir>/share/hadoop/mapreduce

                    # yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-<hadoop_version>.jar pi 8 100000

  • You should start to see the execution in the terminal.
  • You can also check in the browser in the job tracker how your job is performing.

You are all set. Your test environment is set. Of course you can change the configurations that I have provided here to your liking and needs.

 

*For the purpose of this demonstration, VMWare player 12.0.1 was used.

**CentOS 7 FullDVD ISO image was used. Ubuntu and other Debian based Linux distros will also work, but some installation steps may differ.

*** You can install the latest version Java or use the recommended version. A list of recommended versions can be found online.

**** There are several ways to set JAVA, I find this the easiest and it guaranties that on reboot JAVA PATH will always stay the same.

***** You can download the latest version of Protocol Buffers from “https://developers.google.com/protocol-buffers/“, but you will need to run a couple of extra commands. The above method is faster and it works just fine.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *