Setting your Macbook for Apache Hadoop

This is the second post in this series, I will walk you through the steps to run Hadoop on your puny little Macbook. If you ever had a hesitation in learning about Hadoop services because of a lack of a "place to run code", you need to find another excuse for not learning Hadoop after this post.

I will be using Homebrew for installing a lot of things in this series, it makes your life a lot simpler. 

Pre-requisites

You need to have Java installed. And you need to know how to set up your bash_profile. If you don't know how to do it:

Installation

The way Homebrew works is that it will pick the latest available stable build and install it. But, in our case, we need our Hadoop installation to support other services like Hive and Presto so we need to go for a stable build of Hadoop, which in our case is 2.7.6.

For installing a previous version, you need to get the checksum of the older version and add that to the Hadoop formula in Homebrew. You can get the checksum from the Apache repo.

Edit your Hadoop formula using brew edit hadoop as shown below:


UPDATE - Jan 2020: hadoop-2.7.6 has now moved to Archive, so use this URL:

http://archive.apache.org/dist/hadoop/common/hadoop-2.7.6/hadoop-2.7.6.tar.gz




Modify the url to point to 2.7.6 and change the checksum accordingly. Do note that the checksum on the Apache website has uppercase characters and you need to change them to lowercase. Or you can refer to the snapshot above to verify that your checksum is matching with mine. If the checksum doesn't match, the installation will fail. 

Once you are done, you can install Hadoop by running brew install hadoop

Hadoop Configurations

All installations via Homebrew go to /usr/local/Cellar. In our case, the home directory for Hadoop installation is /usr/local/Cellar/hadoop/2.7.6/libexec/  and the configuration files are located at  /usr/local/Cellar/hadoop/2.7.6/libexec/etc/hadoop. We need to update the bash_profile to setup environment variables for Hadoop.

Update bash_profile

Go to the home directory and open .bash_profile and add the environment variables:


Source the bash_profile file after editing using source .bash_profile

Modify hadoop-env.sh

Go to the hadoop configuration directory (cd $HADOOP_CONF_DIR ) and edit hadoop-env.sh. Make sure to export the correct JAVA_HOME:



Modify core-site.xml

Go to the core-site.xml and add the values for the property fs.defaultFS. Your file should look like this:


Modify hdfs-site.xml

You need to create a directory where your HDFS data will reside. I have created the directory as /Users/myuser/hadoop-data/hadoop for this purpose. We need to edit hdfs-site.xml and add this location. Make sure that you give the path for the location that you create in your machine.


<property>

    <name>dfs.replication</name>

    <value>1</value>

</property>

<property>

  <name>dfs.namenode.name.dir</name>
  <value>file:/Users/myuser/hadoop-data/hadoop/dfs/name</value>   
</property>   
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/Users/myuser/hadoop-data/hadoop/dfs/data</value>   
</property>

Modify mapred-site.xml

This file controls the resource manager for your Hadoop jobs. Create this file in the config directory and add the following configurations:

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>256</value>
</property>
<property>
  <name>mapreduce.map.memory.mb</name>
  <value>256</value>
</property>

Modify yarn-site.xml

This file controls the yarn settings for your Hadoop installation. Add the following within the configuration tags of this file:

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
  <value>95.0</value>
</property>
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>12288</value>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>256</value>
</property>

Setting up HDFS

Now that the configurations are done, you need to format the namenode. This can be done by hdfs namenode -format

Starting Hadoop

You can now start Hadoop services by using start-dfs.sh and start-yarn.sh


Note: If your services fail to start with the message "can't connect to localhost", you need to enable remote sharing in your system preferences




After starting the namenode and YARN, you can verify that they are running using jps or by going to the UI consoles.

Namenode:


YARN:

Stopping Hadoop

You can stop Hadoop services by using the following commands:

$ stop-yarn.sh
$ stop-dfs.sh

That's it. It wasn't that complex, was it? 

In the next post, I will configure Apache Hive and run hql on this namenode. Do share your feedback.

Previous post: bash_profile setup                                Next post: Install Hive

Comments

Popular posts from this blog

Uber Data Model

Data Engineer Interview Questions: SQL

Cracking Data Engineering Interviews

Hive Challenges: Bucketing, Bloom Filters and More

UBER Data Architecture