Spark 1.6.0 源码精读 - HelloWorld开发者社区

Spark 1.6.0

一般程序的入口都是这个步骤，Config->Context

Spark也不例外，先从入口开始

SparkConf：Spark 应用程序的配置

/**
  * SparkConf.scala 
  *
  *
  * Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
  *
  * Most of the time, you would create a SparkConf object with `new SparkConf()`, which will load
  * values from any `spark.*` Java system properties set in your application as well. In this case,
  * parameters you set directly on the `SparkConf` object take priority over system properties.
  *
  * For unit tests, you can also call `new SparkConf(false)` to skip loading external settings and
  * get the same configuration no matter what the system properties are.
  *
  * All setter methods in this class support chaining. For example, you can write
  * `new SparkConf().setMaster("local").setAppName("My app")`.
  *
  * Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified
  * by the user. Spark does not support modifying the configuration at runtime.
  *
  * @param loadDefaults whether to also load values from Java system properties
  */

SparkContext：

/**
 * SparkContext.scala
 *
 * 
 * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
 * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
 *
 * Only one SparkContext may be active per JVM.  You must `stop()` the active SparkContext before
 * creating a new one.  This limitation may eventually be removed; see SPARK-2243 for more details.
 *
 * @param config a Spark Config object describing the application configuration. Any settings in
 *   this config overrides the default configs as well as system properties.
 */

SparkContext ：由上述注释可知：

Spark功能的主入口。
充当与Spark Cluster的连接的角色
在集群中创建RDD，累加器，广播器

另外：

Spark程序的注册是通过SparkContext实例化时产生的对象来完成的，具体是被SchedulerBackend注册给集群的。standalone的模式下，是SparkDeploySchedulerBackend。
Spark程序运行的时候要通过Cluster Manager来获取具体的计算资源，计算资源的获取也是SparkContext产生的对象申请的。
Spark的调度优化也是基于SparkContext的，RDD创建完后不会立即执行，会由SparkContext中的TaskScheduler和DAGScheduler等来调度优化。
SparkContext奔溃或者结束时，整个Spark程序也结束了。

综上所述：SparkContext主要是是创建RDD、累加器、广播、注册程序、获取计算资源、调度优化。

上述的SchedulerBackend中的Scheduler可以理解为是TaskScheduler，而不是DAGScheduler。也就是说SchedulerBackend是TaskScheuler的backend。

从调度角度来看，分为DAGScheduler和TaskScheduler，都是保证进度按时完成的。

假如拿盖房子来说，可以分为打地基，砌墙，盖顶，粉刷等阶段[stage]。而具体里面每一个阶段中，比如砌墙，可以分为砌东南西北墙等。

DAGScheduler属于高层调度器，只要是负责Stage层面的调度和失败重试。比如上面的地基，砌墙，盖顶，粉刷等阶段，而且都是有依赖关系的【shuffle】。如果哪个步骤失败了，也负责通知重试。

TastScheduler属于底层调度器，负责Task层面的调度和失败重试。比如上面的砌墙中的砌南墙，如果南墙失败了，则负责通知重砌。

SparkContext构建的3大对象：

DAGScheduler：是面向Job的Stage的高层调度器，是一个类。

TaskScheduler：是一个接口。目前只有TaskSchedulerImpl一个实现。

SchedulerBackend：也是一个接口。根据不同的Cluster Manager的不同实现而实现不同。在standalone下，是SparkDeploySchedulerBackend

从整个程序运行时的角度来讲分为4大核心对象：DAGScheduler、TaskScheduler、SchedulerBackend、MapOutputTrackerMaster。

除了上面已有的3个。

MapOutputTrackerMaster是负责Shuffle数据输出和读入的管理。

SparkDeploySchedulerBackend有3大核心功能：

负责与Master连接，注册当前程序。standalone下，由SparkDeploySchedulerBackend的start方法中创建的AppClient中的ClientEndpoint向Master注册的。

接受集群中为应用程序分配的计算资源Executor的注册和管理。主要是向Driver。

将Task发送给Executor。

补充说明的是：SchedulerBackend是被TaskSchedulerImpl管理的。

当通过SparkDeploySchedulerBackend向Master注册程序的时候，Master会将Command发指令给Worker下的Executor，Worker启动Executor所在进程的时候，该进程名字就是CoarseGrainedExecutorBackend。该类是有main方法的入口类。且Executor是先向Driver注册成功后，再启动具体的Executor。

Spark 之SparkContext 源码精读1

Andriod第三方源码分析

Android进阶之旅-(Framework源码分析)

热门文章