Hadoop学习笔记：Hadoop安装（本地安装）

最近开始研究大数据这块，现在从最基础的Hadoop开始，后续将逐渐学习Hadoop整个生态圈的各个部分组件。

Hadoop安装分为本地安装、伪分布式、完全分布式和高可用分布式，这里为个人学习用（实际情况是本人没有那么多机器，装虚拟机的话，内存可能也不够，T_T），仅涉及到本地安装和伪分布式安装。

环境准备

# 操作系统信息
$ cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)

# 系统内核信息
$ uname -r
3.10.0-693.11.6.el7.x86_64

# hostname信息 ( 配置 )
$ hostnamectl set-hostname v108.zlikun.com
$ hostnamectl status
   Static hostname: v108.zlikun.com
         Icon name: computer-vm
           Chassis: vm
        Machine ID: da1dac0e4969496a8906d711f95f2a7f
           Boot ID: 8ffc47fb1b7148ab992d8bf6f3f32ac1
    Virtualization: vmware
  Operating System: CentOS Linux 7 (Core)
       CPE OS Name: cpe:/o:centos:centos:7
            Kernel: Linux 3.10.0-693.11.6.el7.x86_64
      Architecture: x86-64

# 在`/etc/hosts`文件中配置主机名 ( 不要在意为什么是v108这个细节，只是我电脑上的第8个虚拟机而已，^_^ )
192.168.1.108   v108.zlikun.com v108

安装JAVA

# 解压 `jdk-8u151-linux-x64.tar.gz` 包，并移动到 `/usr/local` 目录下
/usr/local/jdk1.8.0_151

# 在 `/etc/profile` 中配置环境变量
export JAVA_HOME=/usr/local/jdk1.8.0_151
export PATH=$PATH:$JAVA_HOME/bin

# 查看JDK版本
$ java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

本地安装`Hadoop`

# Hadoop使用2.7.5版，文档地址：http://hadoop.apache.org/docs/r2.7.5/ ，下面是参考安装文档：
# http://hadoop.apache.org/docs/r2.7.5/hadoop-project-dist/hadoop-common/SingleCluster.html

# 安装过程：本地安装只需要将 `hadoop-2.7.5.tar.gz` 安装包解压到指定目录即可
$ tar zxvf hadoop-2.7.5.tar.gz
$ mv hadoop-2.7.5 /opt/hadoop
# 这里删除全部 *.cmd 格式文件 ( 这些文件仅限Windows下使用，非必须删除，取决于个人习惯 )
$ rm -rf /opt/hadoop/*/*.cmd
# 配置环境变量 HADOOP_HOME
$ echo 'export HADOOP_HOME=/opt/hadoop' >> /etc/profile

# 下面是Hadoop的目录结构
/opt/hadoop/
├── bin
│   ├── container-executor
│   ├── hadoop
│   ├── hdfs
│   ├── mapred
│   ├── rcc
│   ├── test-container-executor
│   └── yarn
├── etc
│   └── hadoop
├── include
│   ├── hdfs.h
│   ├── Pipes.hh
│   ├── SerialUtils.hh
│   ├── StringUtils.hh
│   └── TemplateFactory.hh
├── lib
│   └── native
├── libexec
│   ├── hadoop-config.sh
│   ├── hdfs-config.sh
│   ├── httpfs-config.sh
│   ├── kms-config.sh
│   ├── mapred-config.sh
│   └── yarn-config.sh
├── LICENSE.txt
├── NOTICE.txt
├── README.txt
├── sbin
│   ├── distribute-exclude.sh
│   ├── hadoop-daemon.sh
│   ├── hadoop-daemons.sh
│   ├── hdfs-config.sh
│   ├── httpfs.sh
│   ├── kms.sh
│   ├── mr-jobhistory-daemon.sh
│   ├── refresh-namenodes.sh
│   ├── slaves.sh
│   ├── start-all.sh
│   ├── start-balancer.sh
│   ├── start-dfs.sh
│   ├── start-secure-dns.sh
│   ├── start-yarn.sh
│   ├── stop-all.sh
│   ├── stop-balancer.sh
│   ├── stop-dfs.sh
│   ├── stop-secure-dns.sh
│   ├── stop-yarn.sh
│   ├── yarn-daemon.sh
│   └── yarn-daemons.sh
└── share
    ├── doc
    └── hadoop

# 执行 `bin/hadoop` 命令，可以查看帮助文档
$ cd /opt/hadoop
$ bin/hadoop 
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
  CLASSNAME            run the class named CLASSNAME
 or
  where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
                       note: please use "yarn jar" to launch
                             YARN applications, not this command.
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
  credential           interact with credential providers
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
  trace                view and modify Hadoop tracing settings

Most commands print help when invoked w/o parameters.

# 通常安装Hadoop后第一件事就是配置它的JAVA_HOME参数
$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_151

词频统计示例

# 准备一个本地文件，后面将对其进行词频统计
$ mkdir input
$ echo 'java golang ruby rust erlang java javascript lua rust java' > input/lang.txt

# 运行官方自带 `MapReduce` 程序，进行词频统计
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount input output
18/01/30 08:42:34 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
18/01/30 08:42:34 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
18/01/30 08:42:34 INFO input.FileInputFormat: Total input paths to process : 1
18/01/30 08:42:34 INFO mapreduce.JobSubmitter: number of splits:1
18/01/30 08:42:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local935371141_0001
18/01/30 08:42:34 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
18/01/30 08:42:34 INFO mapreduce.Job: Running job: job_local935371141_0001
18/01/30 08:42:34 INFO mapred.LocalJobRunner: OutputCommitter set in config null
18/01/30 08:42:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/01/30 08:42:34 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
18/01/30 08:42:34 INFO mapred.LocalJobRunner: Waiting for map tasks
18/01/30 08:42:34 INFO mapred.LocalJobRunner: Starting task: attempt_local935371141_0001_m_000000_0
18/01/30 08:42:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/01/30 08:42:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
18/01/30 08:42:34 INFO mapred.MapTask: Processing split: file:/opt/hadoop/input/lang.txt:0+59
18/01/30 08:42:34 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
18/01/30 08:42:34 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
18/01/30 08:42:34 INFO mapred.MapTask: soft limit at 83886080
18/01/30 08:42:34 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
18/01/30 08:42:34 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
18/01/30 08:42:34 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
18/01/30 08:42:34 INFO mapred.LocalJobRunner: 
18/01/30 08:42:34 INFO mapred.MapTask: Starting flush of map output
18/01/30 08:42:34 INFO mapred.MapTask: Spilling map output
18/01/30 08:42:34 INFO mapred.MapTask: bufstart = 0; bufend = 99; bufvoid = 104857600
18/01/30 08:42:34 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214360(104857440); length = 37/6553600
18/01/30 08:42:34 INFO mapred.MapTask: Finished spill 0
18/01/30 08:42:34 INFO mapred.Task: Task:attempt_local935371141_0001_m_000000_0 is done. And is in the process of committing
18/01/30 08:42:35 INFO mapred.LocalJobRunner: map
18/01/30 08:42:35 INFO mapred.Task: Task 'attempt_local935371141_0001_m_000000_0' done.
18/01/30 08:42:35 INFO mapred.Task: Final Counters for attempt_local935371141_0001_m_000000_0: Counters: 18
        File System Counters
                FILE: Number of bytes read=296042
                FILE: Number of bytes written=585271
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=1
                Map output records=10
                Map output bytes=99
                Map output materialized bytes=92
                Input split bytes=96
                Combine input records=10
                Combine output records=7
                Spilled Records=7
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=16
                Total committed heap usage (bytes)=165744640
        File Input Format Counters 
                Bytes Read=59
18/01/30 08:42:35 INFO mapred.LocalJobRunner: Finishing task: attempt_local935371141_0001_m_000000_0
18/01/30 08:42:35 INFO mapred.LocalJobRunner: map task executor complete.
18/01/30 08:42:35 INFO mapred.LocalJobRunner: Waiting for reduce tasks
18/01/30 08:42:35 INFO mapred.LocalJobRunner: Starting task: attempt_local935371141_0001_r_000000_0
18/01/30 08:42:35 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/01/30 08:42:35 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
18/01/30 08:42:35 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@475ebd65
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=363285696, maxSingleShuffleLimit=90821424, mergeThreshold=239768576, ioSortFactor=10, memToMemMergeOutputsThreshold=10
18/01/30 08:42:35 INFO reduce.EventFetcher: attempt_local935371141_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
18/01/30 08:42:35 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local935371141_0001_m_000000_0 decomp: 88 len: 92 to MEMORY
18/01/30 08:42:35 INFO reduce.InMemoryMapOutput: Read 88 bytes from map-output for attempt_local935371141_0001_m_000000_0
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 88, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->88
18/01/30 08:42:35 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
18/01/30 08:42:35 INFO mapred.LocalJobRunner: 1 / 1 copied.
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
18/01/30 08:42:35 WARN io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
        at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method)
        at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
        at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
        at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:206)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
18/01/30 08:42:35 INFO mapred.Merger: Merging 1 sorted segments
18/01/30 08:42:35 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 79 bytes
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: Merged 1 segments, 88 bytes to disk to satisfy reduce memory limit
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: Merging 1 files, 92 bytes from disk
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
18/01/30 08:42:35 INFO mapred.Merger: Merging 1 sorted segments
18/01/30 08:42:35 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 79 bytes
18/01/30 08:42:35 INFO mapred.LocalJobRunner: 1 / 1 copied.
18/01/30 08:42:35 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
18/01/30 08:42:35 INFO mapred.Task: Task:attempt_local935371141_0001_r_000000_0 is done. And is in the process of committing
18/01/30 08:42:35 INFO mapred.LocalJobRunner: 1 / 1 copied.
18/01/30 08:42:35 INFO mapred.Task: Task attempt_local935371141_0001_r_000000_0 is allowed to commit now
18/01/30 08:42:35 INFO output.FileOutputCommitter: Saved output of task 'attempt_local935371141_0001_r_000000_0' to file:/opt/hadoop/output/_temporary/0/task_local935371141_0001_r_000000
18/01/30 08:42:35 INFO mapred.LocalJobRunner: reduce > reduce
18/01/30 08:42:35 INFO mapred.Task: Task 'attempt_local935371141_0001_r_000000_0' done.
18/01/30 08:42:35 INFO mapred.Task: Final Counters for attempt_local935371141_0001_r_000000_0: Counters: 24
        File System Counters
                FILE: Number of bytes read=296258
                FILE: Number of bytes written=585433
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Combine input records=0
                Combine output records=0
                Reduce input groups=7
                Reduce shuffle bytes=92
                Reduce input records=7
                Reduce output records=7
                Spilled Records=7
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=2
                Total committed heap usage (bytes)=165744640
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Output Format Counters 
                Bytes Written=70
18/01/30 08:42:35 INFO mapred.LocalJobRunner: Finishing task: attempt_local935371141_0001_r_000000_0
18/01/30 08:42:35 INFO mapred.LocalJobRunner: reduce task executor complete.
18/01/30 08:42:35 INFO mapreduce.Job: Job job_local935371141_0001 running in uber mode : false
18/01/30 08:42:35 INFO mapreduce.Job:  map 100% reduce 100%
18/01/30 08:42:35 INFO mapreduce.Job: Job job_local935371141_0001 completed successfully
18/01/30 08:42:35 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=592300
                FILE: Number of bytes written=1170704
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=1
                Map output records=10
                Map output bytes=99
                Map output materialized bytes=92
                Input split bytes=96
                Combine input records=10
                Combine output records=7
                Reduce input groups=7
                Reduce shuffle bytes=92
                Reduce input records=7
                Reduce output records=7
                Spilled Records=14
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=18
                Total committed heap usage (bytes)=331489280
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=59
        File Output Format Counters 
                Bytes Written=70

# 查看统计结果
$ cat output/*
erlang  1
golang  1
java    3
javascript      1
lua     1
ruby    1
rust    2

至此，本地Hadoop就安装完成了，可以进行一些简单的测试，这里可以参考官方演示示例：http://hadoop.apache.org/docs/r2.7.5/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation