最近开始研究大数据这块,现在从最基础的Hadoop
开始,后续将逐渐学习Hadoop
整个生态圈的各个部分组件。
Hadoop
安装分为本地安装、伪分布式、完全分布式和高可用分布式,这里为个人学习用(实际情况是本人没有那么多机器,装虚拟机的话,内存可能也不够,T_T),仅涉及到本地安装和伪分布式安装。
环境准备
# 操作系统信息
$ cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)
# 系统内核信息
$ uname -r
3.10.0-693.11.6.el7.x86_64
# hostname信息 ( 配置 )
$ hostnamectl set-hostname v108.zlikun.com
$ hostnamectl status
Static hostname: v108.zlikun.com
Icon name: computer-vm
Chassis: vm
Machine ID: da1dac0e4969496a8906d711f95f2a7f
Boot ID: 8ffc47fb1b7148ab992d8bf6f3f32ac1
Virtualization: vmware
Operating System: CentOS Linux 7 (Core)
CPE OS Name: cpe:/o:centos:centos:7
Kernel: Linux 3.10.0-693.11.6.el7.x86_64
Architecture: x86-64
# 在`/etc/hosts`文件中配置主机名 ( 不要在意为什么是v108这个细节,只是我电脑上的第8个虚拟机而已,^_^ )
192.168.1.108 v108.zlikun.com v108
安装JAVA
# 解压 `jdk-8u151-linux-x64.tar.gz` 包,并移动到 `/usr/local` 目录下
/usr/local/jdk1.8.0_151
# 在 `/etc/profile` 中配置环境变量
export JAVA_HOME=/usr/local/jdk1.8.0_151
export PATH=$PATH:$JAVA_HOME/bin
# 查看JDK版本
$ java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
本地安装Hadoop
# Hadoop使用2.7.5版,文档地址:http://hadoop.apache.org/docs/r2.7.5/ ,下面是参考安装文档:
# http://hadoop.apache.org/docs/r2.7.5/hadoop-project-dist/hadoop-common/SingleCluster.html
# 安装过程:本地安装只需要将 `hadoop-2.7.5.tar.gz` 安装包解压到指定目录即可
$ tar zxvf hadoop-2.7.5.tar.gz
$ mv hadoop-2.7.5 /opt/hadoop
# 这里删除全部 *.cmd 格式文件 ( 这些文件仅限Windows下使用,非必须删除,取决于个人习惯 )
$ rm -rf /opt/hadoop/*/*.cmd
# 配置环境变量 HADOOP_HOME
$ echo 'export HADOOP_HOME=/opt/hadoop' >> /etc/profile
# 下面是Hadoop的目录结构
/opt/hadoop/
├── bin
│ ├── container-executor
│ ├── hadoop
│ ├── hdfs
│ ├── mapred
│ ├── rcc
│ ├── test-container-executor
│ └── yarn
├── etc
│ └── hadoop
├── include
│ ├── hdfs.h
│ ├── Pipes.hh
│ ├── SerialUtils.hh
│ ├── StringUtils.hh
│ └── TemplateFactory.hh
├── lib
│ └── native
├── libexec
│ ├── hadoop-config.sh
│ ├── hdfs-config.sh
│ ├── httpfs-config.sh
│ ├── kms-config.sh
│ ├── mapred-config.sh
│ └── yarn-config.sh
├── LICENSE.txt
├── NOTICE.txt
├── README.txt
├── sbin
│ ├── distribute-exclude.sh
│ ├── hadoop-daemon.sh
│ ├── hadoop-daemons.sh
│ ├── hdfs-config.sh
│ ├── httpfs.sh
│ ├── kms.sh
│ ├── mr-jobhistory-daemon.sh
│ ├── refresh-namenodes.sh
│ ├── slaves.sh
│ ├── start-all.sh
│ ├── start-balancer.sh
│ ├── start-dfs.sh
│ ├── start-secure-dns.sh
│ ├── start-yarn.sh
│ ├── stop-all.sh
│ ├── stop-balancer.sh
│ ├── stop-dfs.sh
│ ├── stop-secure-dns.sh
│ ├── stop-yarn.sh
│ ├── yarn-daemon.sh
│ └── yarn-daemons.sh
└── share
├── doc
└── hadoop
# 执行 `bin/hadoop` 命令,可以查看帮助文档
$ cd /opt/hadoop
$ bin/hadoop
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME run the class named CLASSNAME
or
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
note: please use "yarn jar" to launch
YARN applications, not this command.
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
credential interact with credential providers
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
trace view and modify Hadoop tracing settings
Most commands print help when invoked w/o parameters.
# 通常安装Hadoop后第一件事就是配置它的JAVA_HOME参数
$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_151
词频统计示例
# 准备一个本地文件,后面将对其进行词频统计
$ mkdir input
$ echo 'java golang ruby rust erlang java javascript lua rust java' > input/lang.txt
# 运行官方自带 `MapReduce` 程序,进行词频统计
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount input output
18/01/30 08:42:34 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
18/01/30 08:42:34 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
18/01/30 08:42:34 INFO input.FileInputFormat: Total input paths to process : 1
18/01/30 08:42:34 INFO mapreduce.JobSubmitter: number of splits:1
18/01/30 08:42:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local935371141_0001
18/01/30 08:42:34 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
18/01/30 08:42:34 INFO mapreduce.Job: Running job: job_local935371141_0001
18/01/30 08:42:34 INFO mapred.LocalJobRunner: OutputCommitter set in config null
18/01/30 08:42:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/01/30 08:42:34 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
18/01/30 08:42:34 INFO mapred.LocalJobRunner: Waiting for map tasks
18/01/30 08:42:34 INFO mapred.LocalJobRunner: Starting task: attempt_local935371141_0001_m_000000_0
18/01/30 08:42:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/01/30 08:42:34 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
18/01/30 08:42:34 INFO mapred.MapTask: Processing split: file:/opt/hadoop/input/lang.txt:0+59
18/01/30 08:42:34 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
18/01/30 08:42:34 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
18/01/30 08:42:34 INFO mapred.MapTask: soft limit at 83886080
18/01/30 08:42:34 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
18/01/30 08:42:34 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
18/01/30 08:42:34 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
18/01/30 08:42:34 INFO mapred.LocalJobRunner:
18/01/30 08:42:34 INFO mapred.MapTask: Starting flush of map output
18/01/30 08:42:34 INFO mapred.MapTask: Spilling map output
18/01/30 08:42:34 INFO mapred.MapTask: bufstart = 0; bufend = 99; bufvoid = 104857600
18/01/30 08:42:34 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214360(104857440); length = 37/6553600
18/01/30 08:42:34 INFO mapred.MapTask: Finished spill 0
18/01/30 08:42:34 INFO mapred.Task: Task:attempt_local935371141_0001_m_000000_0 is done. And is in the process of committing
18/01/30 08:42:35 INFO mapred.LocalJobRunner: map
18/01/30 08:42:35 INFO mapred.Task: Task 'attempt_local935371141_0001_m_000000_0' done.
18/01/30 08:42:35 INFO mapred.Task: Final Counters for attempt_local935371141_0001_m_000000_0: Counters: 18
File System Counters
FILE: Number of bytes read=296042
FILE: Number of bytes written=585271
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=1
Map output records=10
Map output bytes=99
Map output materialized bytes=92
Input split bytes=96
Combine input records=10
Combine output records=7
Spilled Records=7
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=16
Total committed heap usage (bytes)=165744640
File Input Format Counters
Bytes Read=59
18/01/30 08:42:35 INFO mapred.LocalJobRunner: Finishing task: attempt_local935371141_0001_m_000000_0
18/01/30 08:42:35 INFO mapred.LocalJobRunner: map task executor complete.
18/01/30 08:42:35 INFO mapred.LocalJobRunner: Waiting for reduce tasks
18/01/30 08:42:35 INFO mapred.LocalJobRunner: Starting task: attempt_local935371141_0001_r_000000_0
18/01/30 08:42:35 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/01/30 08:42:35 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
18/01/30 08:42:35 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@475ebd65
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=363285696, maxSingleShuffleLimit=90821424, mergeThreshold=239768576, ioSortFactor=10, memToMemMergeOutputsThreshold=10
18/01/30 08:42:35 INFO reduce.EventFetcher: attempt_local935371141_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
18/01/30 08:42:35 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local935371141_0001_m_000000_0 decomp: 88 len: 92 to MEMORY
18/01/30 08:42:35 INFO reduce.InMemoryMapOutput: Read 88 bytes from map-output for attempt_local935371141_0001_m_000000_0
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 88, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->88
18/01/30 08:42:35 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
18/01/30 08:42:35 INFO mapred.LocalJobRunner: 1 / 1 copied.
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
18/01/30 08:42:35 WARN io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/01/30 08:42:35 INFO mapred.Merger: Merging 1 sorted segments
18/01/30 08:42:35 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 79 bytes
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: Merged 1 segments, 88 bytes to disk to satisfy reduce memory limit
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: Merging 1 files, 92 bytes from disk
18/01/30 08:42:35 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
18/01/30 08:42:35 INFO mapred.Merger: Merging 1 sorted segments
18/01/30 08:42:35 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 79 bytes
18/01/30 08:42:35 INFO mapred.LocalJobRunner: 1 / 1 copied.
18/01/30 08:42:35 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
18/01/30 08:42:35 INFO mapred.Task: Task:attempt_local935371141_0001_r_000000_0 is done. And is in the process of committing
18/01/30 08:42:35 INFO mapred.LocalJobRunner: 1 / 1 copied.
18/01/30 08:42:35 INFO mapred.Task: Task attempt_local935371141_0001_r_000000_0 is allowed to commit now
18/01/30 08:42:35 INFO output.FileOutputCommitter: Saved output of task 'attempt_local935371141_0001_r_000000_0' to file:/opt/hadoop/output/_temporary/0/task_local935371141_0001_r_000000
18/01/30 08:42:35 INFO mapred.LocalJobRunner: reduce > reduce
18/01/30 08:42:35 INFO mapred.Task: Task 'attempt_local935371141_0001_r_000000_0' done.
18/01/30 08:42:35 INFO mapred.Task: Final Counters for attempt_local935371141_0001_r_000000_0: Counters: 24
File System Counters
FILE: Number of bytes read=296258
FILE: Number of bytes written=585433
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Combine input records=0
Combine output records=0
Reduce input groups=7
Reduce shuffle bytes=92
Reduce input records=7
Reduce output records=7
Spilled Records=7
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=2
Total committed heap usage (bytes)=165744640
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Output Format Counters
Bytes Written=70
18/01/30 08:42:35 INFO mapred.LocalJobRunner: Finishing task: attempt_local935371141_0001_r_000000_0
18/01/30 08:42:35 INFO mapred.LocalJobRunner: reduce task executor complete.
18/01/30 08:42:35 INFO mapreduce.Job: Job job_local935371141_0001 running in uber mode : false
18/01/30 08:42:35 INFO mapreduce.Job: map 100% reduce 100%
18/01/30 08:42:35 INFO mapreduce.Job: Job job_local935371141_0001 completed successfully
18/01/30 08:42:35 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=592300
FILE: Number of bytes written=1170704
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=1
Map output records=10
Map output bytes=99
Map output materialized bytes=92
Input split bytes=96
Combine input records=10
Combine output records=7
Reduce input groups=7
Reduce shuffle bytes=92
Reduce input records=7
Reduce output records=7
Spilled Records=14
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=18
Total committed heap usage (bytes)=331489280
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=59
File Output Format Counters
Bytes Written=70
# 查看统计结果
$ cat output/*
erlang 1
golang 1
java 3
javascript 1
lua 1
ruby 1
rust 2
至此,本地Hadoop就安装完成了,可以进行一些简单的测试,这里可以参考官方演示示例:http://hadoop.apache.org/docs/r2.7.5/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation