现在的位置: 首页 > 大数据 > hadoop > 正文
hadoop系列之八:运行简单MapReduce实例
2014年04月06日 hadoop, 大数据 ⁄ 共 4765字 hadoop系列之八:运行简单MapReduce实例已关闭评论 ⁄ 被围观 7,892 views+

前面系列博客已经了解了hadoop的基本架构、理论基础、安装和hdfs的基本使用,下面就可以运行一个MapReduce进行简单的实验了。

在/usr/local/hadoop/share/hadoop/mapreduce下,已经有一个jar示例包。

[hadoop@hadoop01 mapreduce]$ pwd
/usr/local/hadoop/share/hadoop/mapreduce
[hadoop@hadoop01 mapreduce]$ ll
total 4400
-rw-r--r--. 1 hadoop hadoop  482042 Oct  6 23:38 hadoop-mapreduce-client-app-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop  656365 Oct  6 23:38 hadoop-mapreduce-client-common-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop 1455001 Oct  6 23:38 hadoop-mapreduce-client-core-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop  117184 Oct  6 23:38 hadoop-mapreduce-client-hs-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop    4063 Oct  6 23:38 hadoop-mapreduce-client-hs-plugins-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop   35216 Oct  6 23:38 hadoop-mapreduce-client-jobclient-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop 1434852 Oct  6 23:38 hadoop-mapreduce-client-jobclient-2.2.0-tests.jar
-rw-r--r--. 1 hadoop hadoop   21537 Oct  6 23:38 hadoop-mapreduce-client-shuffle-2.2.0.jar
-rw-r--r--. 1 hadoop hadoop  270227 Oct  6 23:38 hadoop-mapreduce-examples-2.2.0.jar
drwxr-xr-x. 2 hadoop hadoop    4096 Oct  6 23:38 lib
drwxr-xr-x. 2 hadoop hadoop    4096 Oct  6 23:38 lib-examples
drwxr-xr-x. 2 hadoop hadoop    4096 Oct  6 23:38 sources

这里面含有框架提供的很多例子.我们现在学习一下如何运行其中的例子吧。

[hadoop@hadoop01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.2.0.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

通过上面的命令我们可以看出,示例包提供了21个MapReduce示例。我们以wordcount为例:

要怎么执行了,如下这样?

[hadoop@hadoop01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount
Usage: wordcount <in> <out>

根据用法,我们需要补全 wordcount 的文件输入路径和文件输出路径.我们首先上传一个文件到 hdfs 中:

[hadoop@hadoop01 mapreduce]$ hdfs dfs -put  /etc/inittab
[hadoop@hadoop01 mapreduce]$ hdfs dfs -lsr
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x   - hadoop supergroup          0 2014-04-06 05:01 abc
-rw-r--r--   3 hadoop supergroup  157069653 2014-04-06 05:35 access_2013_05_31.log
-rw-r--r--   3 hadoop supergroup        884 2014-04-06 05:40 inittab

我们要将结果输入到out目录中,再次执行上述命令:

[hadoop@hadoop01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount inittab out
14/04/06 05:41:49 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/04/06 05:41:49 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/04/06 05:41:50 INFO input.FileInputFormat: Total input paths to process : 1
14/04/06 05:41:50 INFO mapreduce.JobSubmitter: number of splits:1

hadoop处理这种程序很快,几秒即可完成。据我实测,用此MapReduce跑一个200M的httpd日志统计word出现次数,也只需几十秒。

查看结果:

[hadoop@hadoop01 mapreduce]$ hdfs dfs -lsr
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x   - hadoop supergroup          0 2014-04-06 05:01 abc
-rw-r--r--   3 hadoop supergroup  157069653 2014-04-06 05:35 access_2013_05_31.log
-rw-r--r--   3 hadoop supergroup        884 2014-04-06 05:40 inittab
drwxr-xr-x   - hadoop supergroup          0 2014-04-06 05:41 out
-rw-r--r--   3 hadoop supergroup          0 2014-04-06 05:41 out/_SUCCESS
-rw-r--r--   3 hadoop supergroup        872 2014-04-06 05:41 out/part-r-00000

其中out/part-r-00000就是输出结果:

[hadoop@hadoop01 mapreduce]$ hdfs dfs -cat out/part-r-00000
#    25
(Do    2
(The    1
-    7
/etc/init/control-alt-delete.conf    1
/etc/init/rc.conf    1
/etc/init/rcS.conf    1
/etc/init/serial.conf,    1
/etc/init/tty.conf    1
/etc/sysconfig/init.    1

……以下省略……

以上只是显示了一部分。显示结果是按照字符的字段顺序排列的,每一行显示字符及出现次数。

这样就表示一个完整的hadoop集群已经可以正常运行了,后面只需将实际应用中的MapReduce运行起来即可。

抱歉!评论已关闭.

×