Bhaskar Karambelkar's Blog

Installing RHadoop on Cloudera CDH 4.3.0

 

Tags: hadoop cloudera rhadoop rstats


These are my notes for installing RHadoop on a Cloudera CDH 4.3.0 Hadoop Cluster. Although the notes are geared towards installing on CDH, they can be used to install RHadoop on any other Hadoop distro.

The default installation instructionsas per RHadoop wiki tell you to install the ‘R’ package from the EPEL Repo. The problem with that is ‘R’ pulls in ‘R-core-devel’ package, and that pulls in all sort of ‘build tools’ including the gcc compiler and and a host of other library ‘dev’ packages.

This is a big issue for us, as we’re not allowed to have compilers installed on our Production boxes, which if you think about it, is a logical thing to do, considering the security requirements of a production environment.

With some efforts I was able to install RHadoop w/o having to install R-core-devel package on our hadoop cluster and these are the notes of that installation. To be able to install RHadoop on a CDH cluster, w/o having to install ‘R-core-devel’ package, you need to setup a so called ‘build box’, where you’ll install everything and push it on to the hadoop cluster.

Set up a “build box”, with the same exact OS version, JDK, and Hadoop Version as that of your Hadoop Cluster. We need Hadoop installed to compile RHadoop packages, but we’re not going to make this box a part of the hadoop cluster.

On all boxes, i.e. each Hadoop Node as well as the ‘build box’ , first setup EPEL yum repository as per instructions on EPEL Wiki.

# As root user  
yum -y --enablerepo=epel install R-core 
#If your system doesn't allow 'root' logins, use 'sudo yum ...' instead 

Then only on the build box

# As root user 
# If your system doesn't allow 'root' logins, use 'sudo bash -l' instead 
# to start a root shell and do the next 4 steps 
yum -y --enablerepo=epel install R 
export JAVA_HOME=<JDK Base Dir> 
export PATH=$JAVA_HOME/bin:$PATH 
R CMD javareconf 

# everything below can be done as a non-root user.
cat > ~/.Rprofile <<THE_END 
options(repos=structure(c(CRAN="<CRAN Mirror Closest to you>")))
THE_END 

cat > ~/.Renviron << THE_END 
R_LIBS_USER="/opt/R/library" 
HADOOP_HOME="<BASE DIRECTORY OF HADOOP>" 
HADOOP_CMD="<Full PATH of the 'hadoop' Command>"
HADOOP_STREAMING="<Full Path to Hadoop Streaming Jar File>"
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:<Path to Hadoop Native Lib dir>" 
JAVA_HOME="<Path to JDK>" 
R_JAVA_LD_LIBRARY_PATH="${JAVA_HOME}/lib/amd64/server" 
THE_END 

Rscript -e "install.packages(c('Rcpp', 'RJSONIO', 'itertools', 'digest'));" 
Rscript -e "install.packages(c('functional', 'stringr', 'plyr'));" 
Rscript -e 'install.packages("rJava");' 
Rscript -e 'install.packages("reshape2");' 
Rscript -e 'install.packages("bitops");' 
#Download rhdfs rmr2 from RHadoop Github page. 
R CMD INSTALL rhdfs_1.0.5.tar.gz R CMD INSTALL rmr2_2.0.0.tar.gz 

This should install all the required ‘R’ packages along with ‘rhdfs’ and ‘rmr2’ under /opt/R/library directory on the ‘build’ box. 

After this simply scp/rsync the $HOME/.Renviron file, and /opt/R directory to each hadoop node in the cluster. The .Renviron file needs to be in the home directory of every user who’s going to run a ‘R’ map-reduce job.

This should setup RHadoop on your Hadoop Cluster without having to install compiler toolchains on each Hadoop Node.