环境说明:
- Java:Oracle Java JDK 1.6.0_24
- hadoop环境及安装参考前文:hadoop系列之五:hadoop 2.2.0的安装配置
- zookeeper环境参考前文:hadoop系列之十一:Zookeeper简介及安装
- HBase环境参考前文:hadoop系列之十二:搭建hbase集群
一、安装R环境:
yum repo:可安装如下repo:
# rpm -Uvh http://archive.linux.duke.edu/pub/epel//6/x86_64/epel-release-6-8.noarch.rpm ##地址可能会改变
因集群机器较多,我是使用一台服务器作为yum repo服务器,其他机器都通过这台repo服务器进行安装,这样就大大加快了速度。关于自建源的服务器的搭建,非常简单,只需将rpm放入http或ftp可访问的目录,然后执行下面命令:
# createrepo -v /RPMpath/
配置好yum源,然后就可以使用yum安装了,命令如下:
[root@hdnode01 ~]# yum install R-core R-core-devel
安装成功后,查看版本如下:
[root@hdnode01 ~]# R --version
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.
二、安装RHadoop:
2.1 准备:
RHadoop是RevolutionAnalytics的工程的项目,开源实现代码在GitHub社区可以找到。RHadoop包含三个R包 (rmr,rhdfs,rhbase),分别是对应Hadoop系统架构中的,MapReduce, HDFS, HBase 三个部分。由于这三个库不能在CRAN中找到,所以需要自己下载:
https://github.com/RevolutionAnalytics/RHadoop/wiki
下载后信息如下:
[root@hdnode01 rhadoop]# ll
总用量 152
-rw-r--r-- 1 root root 62731 4月 27 20:41 rhbase_1.2.0.tar.gz
-rw-r--r-- 1 root root 25105 4月 27 20:41 rhdfs_1.0.8.tar.gz
-rw-r--r-- 1 root root 57860 4月 27 20:41 rmr2_3.1.0.tar.gz
将java在系统中的相关配置传给R:
[root@hdnode01 rhadoop]# R CMD javareconf
Java interpreter : /usr/java/jdk1.6.0_24/jre/bin/java
Java version : 1.6.0_24
Java home path : /usr/java/jdk1.6.0_24
Java compiler : /usr/java/jdk1.6.0_24/bin/javac
Java headers gen.: /usr/java/jdk1.6.0_24/bin/javah
Java archive tool: /usr/java/jdk1.6.0_24/bin/jartrying to compile and link a JNI progam
detected JNI cpp flags : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
detected JNI linker flags : -L$(JAVA_HOME)/jre/lib/amd64/server -ljvm
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -I/usr/java/jdk1.6.0_24/include -I/usr/java/jdk1.6.0_24/include/linux -I/usr/local/include -fpic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c conftest.c -o conftest.o
gcc -m64 -std=gnu99 -shared -L/usr/local/lib64 -o conftest.so conftest.o -L/usr/java/jdk1.6.0_24/jre/lib/amd64/server -ljvm -L/usr/lib64/R/lib -lRJAVA_HOME : /usr/java/jdk1.6.0_24
Java library path: $(JAVA_HOME)/jre/lib/amd64/server
JNI cpp flags : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
JNI linker flags : -L$(JAVA_HOME)/jre/lib/amd64/server -ljvm
Updating Java configuration in /usr/lib64/R
Done.
2.2 安装Rhadoop
安装依赖:
先输入R命令,进入R命令行,然后输入如下命令:
install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "caTools"))
根据提示选择镜像服务器所在地区,我这里选择20(beijing)
在所有节点安装rmr:
[root@hdnode01 rhadoop]# R CMD INSTALL rmr2_3.1.0.tar.gz
在主节点上安装rhdfs:
[root@hdnode01 rhadoop]# R CMD INSTALL rhdfs_1.0.8.tar.gz
查看安装后的库文件情况:
[root@hdnode01 ~]# ls /usr/lib64/R/library
base class datasets graphics itertools Matrix nnet reshape2 rmr2 stats tcltk
bitops cluster digest grDevices KernSmooth methods parallel rhdfs rpart stats4 tools
boot codetools foreign grid lattice mgcv plyr rJava spatial stringr translations
caTools compiler functional iterators MASS nlme Rcpp RJSONIO splines survival utils
安装可能存在的问题:
- 问题1:
[root@hdnode01 rhadoop]# R CMD INSTALL rmr2_3.1.0.tar.gz
* installing to library ‘/usr/lib64/R/library’
ERROR: dependencies ‘Rcpp’, ‘RJSONIO’, ‘bitops’, ‘digest’, ‘functional’, ‘reshape2’, ‘stringr’, ‘plyr’, ‘caTools’ are not available for package ‘rmr2’
* removing ‘/usr/lib64/R/library/rmr2’
原因:
上述依赖没有安装。重新安装上述依赖后即可。
- 问题2:
[root@hdnode01 rhadoop]# R CMD INSTALL rhdfs_1.0.8.tar.gz
* installing to library ‘/usr/lib64/R/library’
* installing *source* package ‘rhdfs’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
converting help for package ‘rhdfs’
finding HTML links ... done
hdfs-file-access html
hdfs-file-manip html
hdfs.defaults html
hdfs.file-level html
initialization html
rhdfs html
text.files html
** building package indices
** testing if installed package can be loaded
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/usr/lib64/R/library/rhdfs’
原因:找不到hadoop命令,处理方法如下:
[root@hdnode01 rhadoop]# export HADOOP_CMD=/usr/local/hadoop/bin/hadoop
3、使用RHadoop:
3.1 基本的hdfs的文件操作
查看hdfs文件目录
hadoop的命令:
[root@hdnode01 rhadoop]# hadoop fs -ls /user
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2014-04-22 00:00 /user/DataTest
drwxr-xr-x - hadoop supergroup 0 2014-04-27 20:59 /user/hadoop
R语言函数:
[root@hdnode01 rhadoop]# R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(rhdfs) #加载hdfs库文件
Loading required package: rJavaHADOOP_CMD=/usr/local/hadoop/bin/hadoop
Be sure to run hdfs.init()
> hdfs.init() #初始化
14/04/28 13:28:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> hdfs.ls("/user/") #查看hdfs文件
permission owner group size modtime file
1 drwxr-xr-x hadoop supergroup 0 2014-04-22 00:00 /user/DataTest
2 drwxr-xr-x hadoop supergroup 0 2014-04-27 20:59 /user/hadoop
其他用法基本一样,如查看hadoop数据文件
hadoop的命令:hadoop fs -cat /user/hadoop/o_same_school/part-m-00000
R语言函数:hdfs.cat(”/user/hadoop/o_same_school/part-m-00000″)
3.2 执行一个rmr算法的任务
- 第一个,普通的R语言程序:
> small.ints = 1:10
> sapply(small.ints, function(x) x^2)
MapReduce的R语言程序:
> small.ints = to.dfs(1:10)
> mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))
> from.dfs("/tmp/RtmpWnzxl4/file5deb791fcbd5")
因为MapReduce只能访问HDFS文件系统,先要用to.dfs把数据存储到HDFS文件系统里。MapReduce的运算结果再用from.dfs函数从HDFS文件系统中取出。
- 第二个,rmr的例子是wordcount,对文件中的单词计数
> input<- '/user/hadoop/o_same_school/part-m-00000'
> wordcount = function(input, output = NULL, pattern = " "){wc.map = function(., lines) {
keyval(unlist( strsplit( x = lines,split = pattern)),1)
}wc.reduce =function(word, counts ) {
keyval(word, sum(counts))
}mapreduce(input = input ,output = output, input.format = "text",
map = wc.map, reduce = wc.reduce,combine = T)
}> wordcount(input)
> from.dfs("/tmp/RtmpfZUFEa/file6cac626aa4a7")
我在HDFS上提前放置了数据文件/user/hadoop/o_same_school/part-m-00000。写wordcount的MapReduce函数,执行wordcount函数,最后用from.dfs从HDFS中取得结果。