- hadoop系列之七:HDFS基本操作
- hadoop系列之六:CentOS X64编译安装hadoop
- hadoop系列之五:hadoop 2.2.0的安装配置
- hadoop系列之四:hadoop版本选择
- hadoop系列之三:Hadoop分布式文件系统(HDFS)理论基础
- hadoop系列之二:MapReduce理论基础
- hadoop系列之一:Hadoop简介
前面系列博客已经了解了hadoop的基本架构、理论基础、安装和hdfs的基本使用,下面就可以运行一个MapReduce进行简单的实验了。
在/usr/local/hadoop/share/hadoop/mapreduce下,已经有一个jar示例包。
[hadoop@hadoop01 mapreduce]$ pwd
/usr/local/hadoop/share/hadoop/mapreduce
[hadoop@hadoop01 mapreduce]$ ll
total 4400
-rw-r--r--. 1 hadoop hadoop 482042 Oct 6 23:38 hadoop-mapreduce-client-app-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop 656365 Oct 6 23:38 hadoop-mapreduce-client-common-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop 1455001 Oct 6 23:38 hadoop-mapreduce-client-core-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop 117184 Oct 6 23:38 hadoop-mapreduce-client-hs-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop 4063 Oct 6 23:38 hadoop-mapreduce-client-hs-plugins-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop 35216 Oct 6 23:38 hadoop-mapreduce-client-jobclient-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop 1434852 Oct 6 23:38 hadoop-mapreduce-client-jobclient-2.2.0-tests.jar
-rw-r--r--. 1 hadoop hadoop 21537 Oct 6 23:38 hadoop-mapreduce-client-shuffle-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop 270227 Oct 6 23:38 hadoop-mapreduce-examples-2.2.0.jar
drwxr-xr-x. 2 hadoop hadoop 4096 Oct 6 23:38 lib
drwxr-xr-x. 2 hadoop hadoop 4096 Oct 6 23:38 lib-examples
drwxr-xr-x. 2 hadoop hadoop 4096 Oct 6 23:38 sources
这里面含有框架提供的很多例子.我们现在学习一下如何运行其中的例子吧。
[hadoop@hadoop01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.2.0.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
通过上面的命令我们可以看出,示例包提供了21个MapReduce示例。我们以wordcount为例:
要怎么执行了,如下这样?
[hadoop@hadoop01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount
Usage: wordcount <in> <out>
根据用法,我们需要补全 wordcount 的文件输入路径和文件输出路径.我们首先上传一个文件到 hdfs 中:
[hadoop@hadoop01 mapreduce]$ hdfs dfs -put /etc/inittab
[hadoop@hadoop01 mapreduce]$ hdfs dfs -lsr
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x - hadoop supergroup 0 2014-04-06 05:01 abc
-rw-r--r-- 3 hadoop supergroup 157069653 2014-04-06 05:35 access_2013_05_31.log
-rw-r--r-- 3 hadoop supergroup 884 2014-04-06 05:40 inittab
我们要将结果输入到out目录中,再次执行上述命令:
[hadoop@hadoop01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount inittab out
14/04/06 05:41:49 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/04/06 05:41:49 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/04/06 05:41:50 INFO input.FileInputFormat: Total input paths to process : 1
14/04/06 05:41:50 INFO mapreduce.JobSubmitter: number of splits:1
hadoop处理这种程序很快,几秒即可完成。据我实测,用此MapReduce跑一个200M的httpd日志统计word出现次数,也只需几十秒。
查看结果:
[hadoop@hadoop01 mapreduce]$ hdfs dfs -lsr
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x - hadoop supergroup 0 2014-04-06 05:01 abc
-rw-r--r-- 3 hadoop supergroup 157069653 2014-04-06 05:35 access_2013_05_31.log
-rw-r--r-- 3 hadoop supergroup 884 2014-04-06 05:40 inittab
drwxr-xr-x - hadoop supergroup 0 2014-04-06 05:41 out
-rw-r--r-- 3 hadoop supergroup 0 2014-04-06 05:41 out/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 872 2014-04-06 05:41 out/part-r-00000
其中out/part-r-00000就是输出结果:
[hadoop@hadoop01 mapreduce]$ hdfs dfs -cat out/part-r-00000
# 25
(Do 2
(The 1
- 7
/etc/init/control-alt-delete.conf 1
/etc/init/rc.conf 1
/etc/init/rcS.conf 1
/etc/init/serial.conf, 1
/etc/init/tty.conf 1
/etc/sysconfig/init. 1……以下省略……
以上只是显示了一部分。显示结果是按照字符的字段顺序排列的,每一行显示字符及出现次数。
这样就表示一个完整的hadoop集群已经可以正常运行了,后面只需将实际应用中的MapReduce运行起来即可。