现在的位置: 首页 > 大数据 > hadoop > 正文
RHadoop之二:安装RHadoop
2014年04月28日 hadoop, 大数据 ⁄ 共 6714字 RHadoop之二:安装RHadoop已关闭评论 ⁄ 被围观 7,872 views+

环境说明:

一、安装R环境:

yum repo:可安装如下repo:

# rpm -Uvh http://archive.linux.duke.edu/pub/epel//6/x86_64/epel-release-6-8.noarch.rpm  ##地址可能会改变

因集群机器较多,我是使用一台服务器作为yum repo服务器,其他机器都通过这台repo服务器进行安装,这样就大大加快了速度。关于自建源的服务器的搭建,非常简单,只需将rpm放入http或ftp可访问的目录,然后执行下面命令:

# createrepo -v /RPMpath/

配置好yum源,然后就可以使用yum安装了,命令如下:

[root@hdnode01 ~]# yum install R-core R-core-devel

安装成功后,查看版本如下:

[root@hdnode01 ~]# R --version
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

二、安装RHadoop:

2.1 准备:

RHadoop是RevolutionAnalytics的工程的项目,开源实现代码在GitHub社区可以找到。RHadoop包含三个R包 (rmr,rhdfs,rhbase),分别是对应Hadoop系统架构中的,MapReduce, HDFS, HBase 三个部分。由于这三个库不能在CRAN中找到,所以需要自己下载:
https://github.com/RevolutionAnalytics/RHadoop/wiki

下载后信息如下:

[root@hdnode01 rhadoop]# ll
总用量 152
-rw-r--r-- 1 root root 62731 4月  27 20:41 rhbase_1.2.0.tar.gz
-rw-r--r-- 1 root root 25105 4月  27 20:41 rhdfs_1.0.8.tar.gz
-rw-r--r-- 1 root root 57860 4月  27 20:41 rmr2_3.1.0.tar.gz

将java在系统中的相关配置传给R:

[root@hdnode01 rhadoop]# R CMD javareconf
Java interpreter : /usr/java/jdk1.6.0_24/jre/bin/java
Java version     : 1.6.0_24
Java home path   : /usr/java/jdk1.6.0_24
Java compiler    : /usr/java/jdk1.6.0_24/bin/javac
Java headers gen.: /usr/java/jdk1.6.0_24/bin/javah
Java archive tool: /usr/java/jdk1.6.0_24/bin/jar

trying to compile and link a JNI progam
detected JNI cpp flags    : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
detected JNI linker flags : -L$(JAVA_HOME)/jre/lib/amd64/server -ljvm
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -I/usr/java/jdk1.6.0_24/include -I/usr/java/jdk1.6.0_24/include/linux -I/usr/local/include    -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic  -c conftest.c -o conftest.o
gcc -m64 -std=gnu99 -shared -L/usr/local/lib64 -o conftest.so conftest.o -L/usr/java/jdk1.6.0_24/jre/lib/amd64/server -ljvm -L/usr/lib64/R/lib -lR

JAVA_HOME        : /usr/java/jdk1.6.0_24
Java library path: $(JAVA_HOME)/jre/lib/amd64/server
JNI cpp flags    : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
JNI linker flags : -L$(JAVA_HOME)/jre/lib/amd64/server -ljvm
Updating Java configuration in /usr/lib64/R
Done.

2.2 安装Rhadoop

安装依赖:

先输入R命令,进入R命令行,然后输入如下命令:

install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "caTools"))

根据提示选择镜像服务器所在地区,我这里选择20(beijing)

在所有节点安装rmr:

[root@hdnode01 rhadoop]# R CMD INSTALL rmr2_3.1.0.tar.gz

在主节点上安装rhdfs:

[root@hdnode01 rhadoop]# R CMD INSTALL rhdfs_1.0.8.tar.gz

查看安装后的库文件情况:

[root@hdnode01 ~]# ls /usr/lib64/R/library
base     class      datasets    graphics   itertools   Matrix   nnet      reshape2  rmr2     stats     tcltk
bitops   cluster    digest      grDevices  KernSmooth  methods  parallel  rhdfs     rpart    stats4    tools
boot     codetools  foreign     grid       lattice     mgcv     plyr      rJava     spatial  stringr   translations
caTools  compiler   functional  iterators  MASS        nlme     Rcpp      RJSONIO   splines  survival  utils

安装可能存在的问题:

  • 问题1:

[root@hdnode01 rhadoop]# R CMD INSTALL rmr2_3.1.0.tar.gz
* installing to library ‘/usr/lib64/R/library’
ERROR: dependencies ‘Rcpp’, ‘RJSONIO’, ‘bitops’, ‘digest’, ‘functional’, ‘reshape2’, ‘stringr’, ‘plyr’, ‘caTools’ are not available for package ‘rmr2’
* removing ‘/usr/lib64/R/library/rmr2’

原因:

上述依赖没有安装。重新安装上述依赖后即可。

  • 问题2:

[root@hdnode01 rhadoop]# R CMD INSTALL rhdfs_1.0.8.tar.gz
* installing to library ‘/usr/lib64/R/library’
* installing *source* package ‘rhdfs’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
converting help for package ‘rhdfs’
finding HTML links ... done
hdfs-file-access                        html
hdfs-file-manip                         html
hdfs.defaults                           html
hdfs.file-level                         html
initialization                          html
rhdfs                                   html
text.files                              html
** building package indices
** testing if installed package can be loaded
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/usr/lib64/R/library/rhdfs’

原因:找不到hadoop命令,处理方法如下:

[root@hdnode01 rhadoop]# export HADOOP_CMD=/usr/local/hadoop/bin/hadoop

3、使用RHadoop:

3.1 基本的hdfs的文件操作

查看hdfs文件目录
hadoop的命令:

[root@hdnode01 rhadoop]# hadoop fs -ls /user
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2014-04-22 00:00 /user/DataTest
drwxr-xr-x   - hadoop supergroup          0 2014-04-27 20:59 /user/hadoop

R语言函数:

[root@hdnode01 rhadoop]# R

R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
library(rhdfs)   #加载hdfs库文件
Loading required package: rJava

HADOOP_CMD=/usr/local/hadoop/bin/hadoop

Be sure to run hdfs.init()
> hdfs.init()  #初始化
14/04/28 13:28:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> hdfs.ls("/user/") #查看hdfs文件
permission  owner      group size          modtime           file
1 drwxr-xr-x hadoop supergroup    0 2014-04-22 00:00 /user/DataTest
2 drwxr-xr-x hadoop supergroup    0 2014-04-27 20:59   /user/hadoop

其他用法基本一样,如查看hadoop数据文件
hadoop的命令:hadoop fs -cat /user/hadoop/o_same_school/part-m-00000
R语言函数:hdfs.cat(”/user/hadoop/o_same_school/part-m-00000″)

3.2 执行一个rmr算法的任务

  • 第一个,普通的R语言程序:

> small.ints = 1:10
> sapply(small.ints, function(x) x^2)

MapReduce的R语言程序:

> small.ints = to.dfs(1:10)
> mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))
> from.dfs("/tmp/RtmpWnzxl4/file5deb791fcbd5")

因为MapReduce只能访问HDFS文件系统,先要用to.dfs把数据存储到HDFS文件系统里。MapReduce的运算结果再用from.dfs函数从HDFS文件系统中取出。

  • 第二个,rmr的例子是wordcount,对文件中的单词计数

> input<- '/user/hadoop/o_same_school/part-m-00000'
> wordcount = function(input, output = NULL, pattern = " "){

wc.map = function(., lines) {
keyval(unlist( strsplit( x = lines,split = pattern)),1)
}

wc.reduce =function(word, counts ) {
keyval(word, sum(counts))
}

mapreduce(input = input ,output = output, input.format = "text",
map = wc.map, reduce = wc.reduce,combine = T)
}

> wordcount(input)
> from.dfs("/tmp/RtmpfZUFEa/file6cac626aa4a7")

我在HDFS上提前放置了数据文件/user/hadoop/o_same_school/part-m-00000。写wordcount的MapReduce函数,执行wordcount函数,最后用from.dfs从HDFS中取得结果。

抱歉!评论已关闭.

×