Skip to main content

Hadoop Tutorial - How to install pseudo-distributed Hadoop cluster on a Mac Book Pro

In this tutorial I am going to explain the steps required to install and run Hadoop on a MacBook Pro. In this Demo I am using Mac OS X version 10.6.8.

STEP 1 : Preparing Environment

The first step in Hadoop installation is to setup the environment like Java, ssh connectivity etc. We will go through each of the required setup in detail.

Hadoop is written in Java and requires Java 1.6 or higher for Hadoop installation. Mac comes with Java and you need to make sure that you have the required Java version installed on.

Open the terminal and type the below.


Jinesh-Mathews-MacBook-Pro:~ jineshmathew$ java -version
java version "1.6.0_65"
Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-10M4609)
Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)

If java is not installed or the version is below 1.6 then you need to download and install java.Please see the below note.

For Java versions 6 and below, Apple supplies their own version of Java. For Mac OS X 10.6 and below, use the Software Update feature (available on the Apple menu) to check that you have the most up-to-date version of Java 6 for your Mac. For issues related to Apple Java 6 on Mac, contact Apple Support. Oracle and Java.com only support Java 7 and later, and only on 64 bit systems.

Make sure that ssh is installed on your machine. By default Mac has ssh installed. Make sure that by running below commands.


Jinesh-Mathews-MacBook-Pro:~ jineshmathew$ which ssh
/usr/bin/ssh
Jinesh-Mathews-MacBook-Pro:~ jineshmathew$ which ssh-keygen
/usr/bin/ssh-keygen

Now make sure that ssh actually works by running below command.

Jinesh-Mathews-MacBook-Pro:~ jineshmathew$ ssh localhost
Last login: Tue Nov 18 10:29:01 2014

There is a good chance that it would not work in the first attempt. Then try one or more of the following to solve the issue.

Go to System Preferences->Sharing and enable Remote Login and try again.

       

If above doesn't work then you need to generate ssh keys as below.

Run the below command to create public and private keys. Make sure that you DON'T give any passphrase when it is prompted.
       
Jinesh-Mathews-MacBook-Pro:~ jineshmathew$ssh-keygen -t rsa

Now the public key will be available at ~/.ssh/id_rsa.pub and private key will be available at ~/.ssh/id_rsa

Copy public key to authorized_keys file by running below command.
  
Jinesh-Mathews-MacBook-Pro:.ssh jineshmathew$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now try ssh to localhost.

Jinesh-Mathews-MacBook-Pro:~ jineshmathew$ ssh localhost
Last login: Tue Nov 18 10:29:01 2014

STEP 2 : Installing Hadoop

Get latest version of Hadoop from below URL.

At the time of this tutorial I downloaded version hadoop-2.5.1.tar.gz

Run following command to extract Hadoop. Assuming hadoop-2.5.1.tar.gz in home directory.
Jinesh-Mathews-MacBook-Pro:~ jineshmathew$ tar -xzvf hadoop-2.5.1.tar.gz

A directory hadoop-2.5.1 has been created. Run below command to verify.
Jinesh-Mathews-MacBook-Pro:~ jineshmathew$ cd hadoop-2.5.1
Jinesh-Mathews-MacBook-Pro:hadoop-2.5.1 jineshmathew$ ls -ltr total 48 drwxr-xr-x@ 4 jineshmathew staff 136 Sep 5 18:30 share drwxr-xr-x@ 3 jineshmathew staff 102 Sep 5 18:30 lib drwxr-xr-x@ 3 jineshmathew staff 102 Sep 5 18:30 etc -rw-r--r--@ 1 jineshmathew staff 1366 Sep 5 18:30 README.txt -rw-r--r--@ 1 jineshmathew staff 101 Sep 5 18:30 NOTICE.txt -rw-r--r--@ 1 jineshmathew staff 15458 Sep 5 18:30 LICENSE.txt drwxr-xr-x@ 29 jineshmathew staff 986 Sep 5 18:30 sbin drwxr-xr-x@ 11 jineshmathew staff 374 Sep 5 18:30 libexec drwxr-xr-x@ 7 jineshmathew staff 238 Sep 5 18:30 include drwxr-xr-x@ 13 jineshmathew staff 442 Sep 5 18:30 bin drwxr-xr-x 9 jineshmathew staff 306 Nov 18 10:37 logs

For ease of use I am going to export 3 environment variables and I will add them to ~/.bash_profile so that every time they will be set.

Jinesh-Mathews-MacBook-Pro:hadoop-2.5.1 jineshmathew$ vi ~/.bash_profile export JAVA_HOME=/Library/Java/Home export HADOOP_HOME=/Users/jineshmathew/hadoop-2.5.1 export PATH=$PATH:$HADOOP_HOME


Now try running hadoop
Jinesh-Mathews-MacBook-Pro:~ jineshmathew$ cd $HADOOP_HOME Jinesh-Mathews-MacBook-Pro:hadoop-2.5.1 jineshmathew$ bin/hadoop version Hadoop 2.5.1 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 2e18d179e4a8065b6a9f29cf2d 
e9451891265cce Compiled by jenkins on 2014-09-05T23:11Z Compiled with protoc 2.5.0 From source with checksum 6424fcab95bfff8337780a181ad7c78 This command was run using /Users/jineshmathew/hadoop-2.5.1/share/hadoop/common/hadoo
p-common-2.5.1.jar

STEP 3 : Hadoop Configuration

Now we need to configure Hadoop to run one of the three supported modes.
    • Standalone mode
    • Pseudo-distributed mode
    • Fully distributed mod
For learning I prefer to have at least pseudo-distributed mode which kind of simulates a hadoop cluster with just one nameNode and one dataNode. This also has a secondaryNameNode all in one machine which is our Mac.

Edit etc/hadoop/core-site.xml to have the following.
      <configuration>
        <property>
              <name>fs.defaultFS</name>
              <value>hdfs://localhost:9000</value>
        </property>
     </configuration>
Edit etc/hadoop/hdfs-site.xml to have the following.
      <configuration>
         <property>
               <name>dfs.replication</name>
               <value>1</value>
         </property> 
       </configuration>  

STEP 4 : Running Hadoop
  • Now we need to format HDFS which is the file system for hadoop by running the following command.
          $ bin/hdfs namenode -format
  • Start nameNode and dataNode Daemon
           $ sbin/start-dfs.sh          
  • Now create following directories inside HDFS.
           $ bin/hdfs dfs -mkdir /user
           $ bin/hdfs dfs -mkdir /user/<username>
           $ bin/hdfs dfs -mkdir /user/<username>/input
  • Check the directory by running the following command 
          $ bin/hdfs dfs -ls /user/jineshmathew/input
  • Now Copy some files from local file system to HDFS.
          bin/hdfs dfs  -put etc/hadoop /user/jineshmathew/input
  • Now it is time to run some of the examples provided by Hadoop.
          $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar grep input/hadoop output 'dfs[a-z.]+'

This hadoop job is to count the pattern 'dfs[a-z]+' on all files available in the input/hadoop directory. 

Once the job is finished then you can check the output by running following command.

         $ bin/hdfs dfs -cat  /user/jineshmathew/output/*

           14/11/18 10:54:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for                your platform... using builtin-java classes where applicable
          6 dfs.audit.logger
          4 dfs.class
          3 dfs.server.namenode.
          2 dfs.period
          2 dfs.audit.log.maxfilesize
          2 dfs.audit.log.maxbackupindex
          1 dfsmetrics.log
          1 dfsadmin
          1 dfs.servers
          1 dfs.replication
          1 dfs.file
  • You can also browse the web interface for nameNode using URL: http://localhost:50070/
  • At any time check all hadoop daemons by running jps command 
         $ jps
         592 SecondaryNameNode
         424 NameNode
         495 DataNode
         1798 Jps
  • Once you are done then stop hadoop by running following command.
          $ sbin/stop-dfs.sh

Hope you have a great installation of Hadoop and understood some concepts on the way. More tutorials will be uploaded soon. Please dont forget to post your comments and questions.

Comments

Post a Comment

Popular posts from this blog

How to format and install macOS in your old Macbook/ iMac

 You can follow these steps to install a mac OS on an old Mac book following these steps. Here I assume that you have the actual bootable CD for the OS for installation. 1. Restart the laptop 2. Press Command + R key until it shows recovery mode 3. Open Disk Utilities 4. Select the hard drive and try to partition the drive. For example I have created a partition called Partition1 5. Insert bootable CD and restart the laptop. When option comes choose to boot from the CD. 6. Choose partition1 as the place to install the OS 7. Continue the installation process. 8. Once installation is completed then it might need to restart for further updates. 9. Most of the times a more recent compatible version of the OS might be available. In order to upgrade to the more latest compatible OS follow below steps. 11. Find the latest compatible version of OS. 12. Go to apple support sites and manually download the image and click to install. 13. Follow installation instructions and this would upgrade you

How to create a minikube single node cluster for learning Kubernetes

In this post I will explain how to setup a minikube single node kubernetes cluster using AWS EC2 instance which would help anyone who is trying to learn kubernetes and also help them to gain practical knowledge in kubernetes by running kubernetes commands, creating kubernetes objects etc. Minikube is a single node kubernetes cluster which means a kubernetes cluster with only one node that is a single VM. Minikube is only used for learning purposes and it is not an alternative for a real kubernetes cluster and should not be used for development and production usage. In this example I have launched an AWS EC2 instance with below configuration where I will install minikube and related tools. AWS EC2 Instance Configuration AMI: Ubuntu Free tier eligible 64 bit Instance type : t2-large ( For me t2-small or t2-micro is giving performance issues due to less memory) Once the EC2 instance is up and running, login to the instance using below command on terminal. If you are using wi

log4j - How to write log to multiple log files using log4j.properties

In Java applications some times you may need to write your log messages to specific log files with its own specific log properties. If you are using log4j internally then first step that you need to do is to have a proper log4j.properties file. Below example shows 2 log4j appenders which write to 2 different log files, one is a debug log and another one is a reports log. Debug log file can have all log messages and reports log can have log messages specific to reporting on say splunk monitoring. # Root logger option log4j.rootLogger=ALL,STDOUT,debugLog log4j.logger.reportsLogger=INFO,reportsLog log4j.additivity.reportsLogger=false     log4j.appender.STDOUT=org.apache.log4j.ConsoleAppender log4j.appender.STDOUT.layout=org.apache.log4j.PatternLayout log4j.appender.STDOUT.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %C:%L - %m%n     # Direct log messages to a log file log4j.appender.debugLog=org.apache.log4j.RollingFileAppender log4j.appender.debugLo