Kubernetes自定义调度器 — 初识调度框架

Kubernetes自定义调度器 — 初识调度框架 Kubernetes 已经成为容器编排(Orchestration)平台的事实标准，它为容器化应用提供了简单且高效部署的方式、大规模可伸缩、资源调度等生命周期管理功能。kube-scheduler作为kubernetes的核心组件，它负责整个集群资源的调度功能，根据特定的调度算法或调度策略，将Pod调度到最优的Node节点，使集群的资源得到合理且充分的利用。

一般情况下，kube-scheduler提供的默认算法或编排的策略能够满足我们绝大多数的要求，但是在实际的项目中，需要考虑的问题就有很多了，我们会比kubernetes更加了解我们自己的应用，比如kubernetes无法对GPU资源调度的，这就需要我们来扩展调度器功能。

自定义调度器

纵观Kubernetes的发展历程，经过近几年的快速发展，在稳定性、扩展性和规模化方面都有了长足进步，核心组件日臻成熟。但实际上社区对于kube-scheduler的整体扩展规划设计是近两年才开始的。早期提升调度器扩展能力的方式主要有两种，一种是调度器扩展(Scheduler Extender)，另外一种是多调度器(Multiple schedulers)。

不管是Scheduler Extender还是Multiple schedulers，在性能和灵活性方面都有很大的问题，为了解决该困境，社区从Kubernetes 1.15版本开始, 作为alpha功能加入了新的调度框架 Kubernetes Scheduling Framework机制，可插拔架构使Kube-scheduler 扩展性更好、代码更简洁。

在进入Framework之前，我们再来了解下Kubernetes调度程序大概过程:

Kube-Controller-Manager将spec.nodeName为空的Pod加入待调度的Pod列表
Kube-scheduler watch Kube-apiserver，从调度队列中Pop出一个Pod，开始一个标准的调度周期
经过过滤阶段，根据Pod的属性(如：CPU/内存，nodeSelector/nodeAffinity)，在该阶段计算出满足要求的节点候选列表
然后打分阶段，为每个候选节点给出一个分数，并挑选出得分最高的节点
最后kube-scheduler向kube-apiserver发送绑定调用，设置Pod的spec.nodeName属性以表示将该Pod调度到的节点
kubelet通过kube-apiserver监听到kube-scheduler产生的绑定信息，获得Pod列表，下载Image并启动容器，然后由kubelet负责拉起Pod

Scheduling Framework调度框架

Scheduling Framework在原有的调度流程中, 定义了丰富扩展点接口，用户可以实现扩展点定义的接口来定义自己的调度逻辑，并将扩展注册到扩展点上，调度框架在执行调度工作流时，运行到相应的扩展点时，会调用用户注册的插件。

调度框架定义了一些扩展点，有些扩展点上的扩展可以更改调度程序的决策，有些扩展点上的扩展只是发送一个通知。每次调度一个Pod的过程分为两个阶段：调度周期(Scheduling Cycle)和绑定周期(Binding Cycle)。

调度周期为Pod选择一个节点，绑定周期将该决定应用于集群。调度周期和绑定周期一起被称为"调度上下文"(scheduling context)。调度过程和绑定过程遇到该Pod不可调度或存在内部错误，则中止调度或绑定周期，该Pod将返回队列并重试。

下图展示了调度框架中的调度上下文及其中的扩展点，一个扩展可以注册多个扩展点，以便可以执行更复杂的有状态的任务 Kubernetes自定义调度器 — 初识调度框架

scheduling cycle 是同步执行的，同一个时间只有一个 scheduling cycle，是线程安全的； binding cycle 是异步执行的，同一个时间中可能会有多个 binding cycle在运行，是线程不安全的。

调度周期(Scheduling Cycle)

QueueSort 对调度队列中的Pod进行排序,以决定先调度哪个Pod,本质上提供了一个Less(Pod1, Pod2)功能用于比较两个Pod谁更优先获得调度,同一时间点只能有一个QueueSort插件生效。

PreFilter 对Pod的信息进行预处理，或者检查一些集群或Pod必须满足的前提条件，只有当所有的PreFilter插件都返回success时，才能进入下一个阶段，如果返回了error，则调度过程终止。

Filter 用来过滤掉不满足Pod调度要求的节点，对于每一个节点，调度器将按顺序执行Filter扩展。为了提升效率，Filter的执行顺序可以被配置，以减少Filter策略执行的次数，如可以把NodeSelector的Filter放到第一个，从而过滤掉大量的节点。Node节点执行Filter策略是并发执行的，所以在同一调度周期中多次调用过滤器。

PostFilter 在1.19版本发布，主要是用于处理当找不到该Pod的可行节点时才调用，例如试图通过抢占其他Pod来使该Pod可调度。，Autoscale触发等行为。

PreScore 在1.19之前版本称为PostFilter，主要用于在Score之前进行一些信息生成即预评分，从而为Score插件使用提供可共享的状态。如果PreScore插件返回错误，则调度周期将中止。

Score 用于对已通过过滤阶段的节点进行打分，调度器将针对每一个节点调用Score扩展，评分结果是一个范围内的整数。在NormalizeScore阶段之后，调度器将会把每个Score扩展对具体某个节点的评分结果和该扩展的权重合并起来，作为最终评分结果。

NormalizeScore 用于对节点进行最终排序之前修改每个节点的评分分数，注册到该扩展点的扩展在被调用时，将获得同一个插件中的Score扩展的评分结果作为参数，调度框架每执行一次调度，都将调用所有插件中的一个NormalizeScore扩展一次。

Reserve 用于避免调度器在等待Pod与节点绑定的过程中调度新的Pod到节点上时，发生实际使用资源超出可用资源的情况，因为绑定Pod到节点上是异步发生的。实现Reserve扩展的插件有两种方法，分别是Reserve和Unreserve，每个Reserve插件的Reserve方法可能成功也可能失败；如果一个Reserve方法调用失败，则不会执行后续插件，并且认为Reserve阶段失败，触发Unreserve扩展。

Permit 用于阻止或者延迟Pod与节点的绑定。可以执行三种操作：

approve（批准）：所有Permit插件批准Pod后，将其发送进行绑定。

deny（拒绝）：如果任何Permit插件拒绝Pod，则将其返回到调度队列，触发Reserve插件中的Unreserve扩展。

wait（等待）：如果一个permit扩展返回了wait，则Pod将保持在permit阶段，如超时，wait状态变成deny，Pod 将被放回到待调度队列，触发Reserve插件中的Unreserve扩展。

绑定周期(Binding Cycle)

PreBind 用于在Pod绑定之前执行某些逻辑,如将数据卷挂载到节点上，然后再允许Pod在此处运行。如插件返回错误，则Pod被拒绝并返回到调度队列。

Bind 用于将Pod绑定到节点。在所有PreBind插件完成之前，不会调用Bind插件。调度框架按照Bind扩展注册的顺序逐个调用Bind扩展，具体某个Bind扩展可以选择处理或者不处理该Pod，如果某个Bind扩展处理了该Pod与节点的绑定，余下的Bind扩展将被忽略。

PostBind 这是一个信息扩展点。成功绑定Pod后，将调用后PostBind插件。Bind周期到此结束，可以用来清理关联的资源。

实现自定义的调度插件

sig-scheduling小组为了更好的管理调度相关的Plugin，新建scheduler-plugins项目，用户可以直接基于这个项目来定义自己的插件。 https://github.com/kubernetes-sigs/scheduler-plugins 接下来我们以社区示例的Qos的插件来为例，简单开发一个Filter、PreScore的插件。

定义插件的对象和构造函数：

// Name is the name of the plugin used in the plugin registry and configurations.
const Name = "sample"

// Sort is a plugin that implements QoS class based sorting.
type sample struct{}

var _ framework.FilterPlugin = &sample{}
var _ framework.PreScorePlugin = &sample{}

// New initializes a new plugin and returns it.
func New(_ runtime.Object, _ framework.FrameworkHandle) (framework.Plugin, error) {
  return &sample{}, nil
}

实现对应插件的接口：


// Name returns name of the plugin.
func (pl *sample) Name() string {
  return Name
}
func (pl *sample) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
  log.Printf("filter pod: %v, node: %v", pod.Name, nodeInfo)
  log.Println(state)
  // 排除没有cpu=true标签的节点
  if nodeInfo.Node().Labels["cpu"] != "true" {
    return framework.NewStatus(framework.Unschedulable, "Node: "+nodeInfo.Node().Name)
  }
  return framework.NewStatus(framework.Success, "Node: "+nodeInfo.Node().Name)
}
func (pl *sample)PreScore(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodes []*v1.Node) *framework.Status  {
  log.Println(nodes)
  return framework.NewStatus(framework.Success, "Node: "+pod.Name)
}

插件注册：

func main() {
  rand.Seed(time.Now().UnixNano())
  // Register custom plugins to the scheduler framework.
  // Later they can consist of scheduler profile(s) and hence
  // used by various kinds of workloads.
  command := app.NewSchedulerCommand(
    app.WithPlugin(plugins.Name, plugins.New),
  )
  if err := command.Execute(); err != nil {
    os.Exit(1)
  }
}

把编译源码并打包成容器镜像

FROM debian:stretch-slim
WORKDIR /
COPY app /usr/local/bin
CMD ["app"]

在kubernetes安装插件

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: sample-scheduler-clusterrole
rules:
- apiGroups:
    - ""
  resources:
    - endpoints
    - events
  verbs:
    - create
    - get
    - update
- apiGroups:
    - ""
  resources:
    - nodes
  verbs:
    - get
    - list
    - watch
- apiGroups:
    - ""
  resources:
    - pods
  verbs:
    - delete
    - get
    - list
    - watch
    - update
- apiGroups:
    - ""
  resources:
    - bindings
    - pods/binding
  verbs:
    - create
- apiGroups:
    - ""
  resources:
    - pods/status
  verbs:
    - patch
    - update
- apiGroups:
    - ""
  resources:
    - replicationcontrollers
    - services
  verbs:
    - get
    - list
    - watch
- apiGroups:
    - apps
    - extensions
  resources:
    - replicasets
  verbs:
    - get
    - list
    - watch
- apiGroups:
    - apps
  resources:
    - statefulsets
  verbs:
    - get
    - list
    - watch
- apiGroups:
    - policy
  resources:
    - poddisruptionbudgets
  verbs:
    - get
    - list
    - watch
- apiGroups:
    - ""
  resources:
    - persistentvolumeclaims
    - persistentvolumes
  verbs:
    - get
    - list
    - watch
- apiGroups:
    - ""
  resources:
    - configmaps
  verbs:
    - get
    - list
    - watch
- apiGroups:
    - "storage.k8s.io"
  resources:
    - storageclasses
    - csinodes
  verbs:
    - get
    - list
    - watch
- apiGroups:
    - "coordination.k8s.io"
  resources:
    - leases
  verbs:
    - create
    - get
    - list
    - update
- apiGroups:
    - "events.k8s.io"
  resources:
    - events
  verbs:
    - create
    - patch
    - update
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: sample-scheduler-sa
namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: sample-scheduler-clusterrolebinding
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: sample-scheduler-clusterrole
subjects:
- kind: ServiceAccount
name: sample-scheduler-sa
namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
name: scheduler-config
namespace: kube-system
data:
scheduler-config.yaml: |
  apiVersion: kubescheduler.config.k8s.io/v1beta1
  kind: KubeSchedulerConfiguration
  leaderElection:
    leaderElect: false
  clientConnection:
    kubeconfig: /kube-scheduler.kubeconfig
  profiles:
  - schedulerName: sample-scheduler
    plugins:
      filter:
        enabled:
        - name: sample
      preScore:
        enabled:
          - name: sample
        disabled:
          - name: "*"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-scheduler
namespace: kube-system
labels:
  component: sample-scheduler
spec:
replicas: 1
selector:
  matchLabels:
    component: sample-scheduler
template:
  metadata:
    labels:
      component: sample-scheduler
  spec:
    serviceAccount: sample-scheduler-sa
    priorityClassName: system-cluster-critical
    volumes:
      - name: scheduler-config
        configMap:
          name: scheduler-config
    containers:
      - name: scheduler-ctrl
        image: sample-scheduler:stretch-slim
        imagePullPolicy: IfNotPresent
        args:
          - --config=/etc/kubernetes/scheduler-config.yaml
          - --v=3
        resources:
          requests:
            cpu: "50m"
        volumeMounts:
          - name: scheduler-config
            mountPath: /etc/kubernetes

运行日志： Kubernetes自定义调度器 — 初识调度框架

验证插件

测试的pod，schedulerName字段指定自定义调度器

apiVersion: apps/v1
kind: Deployment
metadata:
name: test-scheduler
spec:
replicas: 1
selector:
  matchLabels:
    app: test-scheduler
template:
  metadata:
    labels:
      app: test-scheduler
  spec:
    schedulerName: sample-scheduler
    containers:
    - image: docker.io/library/nginx:1.19.2-alpine
      imagePullPolicy: IfNotPresent
      name: nginx
      ports:
      - containerPort: 80

创建测试pod


root@ubuntu:~# kubectl create -f  test-deploy.yaml
deployment.apps/test-scheduler created
root@ubuntu:~# kubectl get po
NAME                             READY   STATUS    RESTARTS   AGE
test-scheduler-6645fd9f6-pfzkm   0/1     Pending   0          4s
root@ubuntu:~# kubectl describe pod test-scheduler-6645fd9f6-pfzkm 
Name:           test-scheduler-6645fd9f6-pfzkm
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=test-scheduler
                pod-template-hash=6645fd9f6
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/test-scheduler-6645fd9f6
Containers:
  nginx:
    Image:        docker.io/library/nginx:1.19.2-alpine
    Port:         80/TCP
    Host Port:    0/TCP
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-jbdqf (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-jbdqf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-jbdqf
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 360s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 360s
Events:
  Type     Reason            Age   From  Message
  ----     ------            ----  ----  -------
  Warning  FailedScheduling  36s         0/1 nodes are available: 1 Node: kube-master.
  Warning  FailedScheduling  36s         0/1 nodes are available: 1 Node: kube-master.

因为排除没有 cpu=true 标签的节点，故pod处于Pending状态，日志也会打印信息

I1011 15:14:35.015568     466 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="default/test-scheduler-6645fd9f6-pfzkm" err="0/1 nodes are available: 1 Node: kube-master."

给节点增加标签


root@ubuntu:~# kubectl label nodes kube-master cpu=true
node/kube-master labeled
root@ubuntu:~# kubectl get nodes --show-labels 
NAME          STATUS   ROLES    AGE   VERSION   LABELS
kube-master   Ready    <none>   30d   v1.19.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cpu=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=kube-master,kubernetes.io/os=linux
root@ubuntu:~# kubectl get po
NAME                             READY   STATUS    RESTARTS   AGE
test-scheduler-6645fd9f6-pfzkm   1/1     Running   0          3m16s

Pod已经正常调度，处于running状态，同时自定义插件也输出对应的日志

I1011 15:17:45.635502     466 default_binder.go:51] Attempting to bind default/test-scheduler-6645fd9f6-pfzkm to kube-master
I1011 15:17:45.643123     466 eventhandlers.go:205] delete event for unscheduled pod default/test-scheduler-6645fd9f6-pfzkm
I1011 15:17:45.643204     466 eventhandlers.go:225] add event for scheduled pod default/test-scheduler-6645fd9f6-pfzkm 
I1011 15:17:45.645886     466 scheduler.go:597] "Successfully bound pod to node" pod="default/test-scheduler-6645fd9f6-pfzkm" node="kube-master" evaluatedNodes=1 feasibleNodes=1