Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

K8S中的各个Node及Pod如何监控，业界常用的方案基本都是：Prometheus + Grafana

来先看看整体效果如下：

一.Prometheus 部署

#创建configmap；默认这里都是9090的端口，因为9090的端口已被其他服务占用，所以我改了端口

cat prometheus.configmap.yaml

apiVersion: v1

kind: ConfigMap

metadata:

namespace: kube-system

data:

prometheus.yml: |

global:

scrape_interval: 15s

scrape_timeout: 15s

scrape_configs:

- job_name: 'prometheus'

static_configs:

- targets: ['localhost:9090']

- job_name: 'kubernetes-node'

kubernetes_sd_configs:

- role: node

relabel_configs:

- source_labels: [__address__]

regex: '(.*):10250'

replacement: '${1}:9100'

target_label: __address__

action: replace

- action: labelmap

regex: __meta_kubernetes_node_label_(.+)

- job_name: 'kubernetes-cadvisor'

kubernetes_sd_configs:

- role: node

scheme: https

tls_config:

ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

relabel_configs:

- action: labelmap

regex: __meta_kubernetes_node_label_(.+)

- target_label: __address__

replacement: kubernetes.default.svc:443

- source_labels: [__meta_kubernetes_node_name]

regex: (.+)

target_label: __metrics_path__

replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

- job_name: kubernetes-apiservers

kubernetes_sd_configs:

- role: endpoints

relabel_configs:

- action: keep

regex: default;kubernetes;https

source_labels:

- __meta_kubernetes_namespace

- __meta_kubernetes_service_name

- __meta_kubernetes_endpoint_port_name

scheme: https

tls_config:

ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

insecure_skip_verify: true

bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

#各个Node节点需挂载NFS网盘

1.首先各个Node节点需要安装nfs-utils

2.NFS Server 新增目录/data/k8s_data/prometheus/k8s-vloume并修改/etc/exports权限

3.挂载NFS

mount -t nfs 192.168.1.115:/data/k8s_data/prometheus/k8s-vloume /data/k8s_data/prometheus/

以上操作可是用Ansible统一操作

#申明PV及创建PVC，这里将Prometheus的数据持久化的内网的NFS网盘中；

**cat prometheus-volume.yaml
**

apiVersion: v1

kind: PersistentVolume

metadata:

spec:

capacity:

storage: 10Gi

accessModes:

- ReadWriteOnce

persistentVolumeReclaimPolicy: Recycle

nfs:

server: 192.168.1.115

path: /data/k8s_data/prometheus/k8s-vloume

---

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

namespace: kube-system

spec:

accessModes:

- ReadWriteOnce

resources:

requests:

storage: 10Gi

#创建Deployment及Service

cat prometheus.deploy.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: kube-system

labels:

app: prometheus

spec:

selector:

matchLabels:

app: prometheus

template:

metadata:

labels:

app: prometheus

spec:

serviceAccountName: prometheus

containers:

- image: harbor.xxxxx.com/prom/prometheus:v2.4.3 #这里私仓地址，修改为自己的或docker hub中可下载

command:

- "/bin/prometheus"

args:

- "--config.file=/etc/prometheus/prometheus.yml"

- "--storage.tsdb.path=/prometheus"

- "--storage.tsdb.retention=30d"

- "--web.enable-admin-api" # 控制对admin HTTP API的访问，其中包括删除时间序列等功能

- "--web.enable-lifecycle" # 支持热更新，直接执行localhost:9090/-/reload立即生效

ports:

- containerPort: 9090

protocol: TCP

volumeMounts:

- mountPath: "/prometheus"

subPath: prometheus

- mountPath: "/etc/prometheus"

resources:

requests:

cpu: 100m

memory: 512Mi

limits:

cpu: 100m

memory: 512Mi

securityContext:

runAsUser: 0

volumes:

- name: data

persistentVolumeClaim:

claimName: prometheus

- configMap:

---

apiVersion: v1

kind: Service

metadata:

namespace: kube-system

labels:

app: prometheus

spec:

type: NodePort

selector:

app: prometheus

ports:

- port: 9091

protocol: TCP

targetPort: 9090

nodePort: 39091

#创建授权规则

cat prometheus-rbac.yaml

apiVersion: v1

kind: ServiceAccount

metadata:

namespace: kube-system

---

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRole

metadata:

rules:

- apiGroups:

- ""

resources:

- nodes

- services

- endpoints

- pods

- nodes/proxy

verbs:

- get

- list

- watch

- apiGroups:

- ""

resources:

- configmaps

- nodes/metrics

verbs:

- get

- nonResourceURLs:

- /metrics

verbs:

- get

---

apiVersion: rbac.authorization.k8s.io/v1beta1

kind: ClusterRoleBinding

metadata:

roleRef:

apiGroup: rbac.authorization.k8s.io

kind: ClusterRole

subjects:

- kind: ServiceAccount

namespace: kube-system

#执行

kubectl apply -f .

#查看部署情况

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#查看Service情况

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#进行访问验证

#我这里使用了nginx及内网的DNS域名进行绑定也可直接访问Node+NodePort访问，也还可以使用Ingress进行配置

我这里输入：

http://prometheus.xxxx.com:9090/targets

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#监控K8S的Node节点，需要部署node-exporter，这里使用DaemonSet，使每个Node节点都部署

cat prometheus-node-exporter.yaml

apiVersion: apps/v1

kind: DaemonSet

metadata:

namespace: kube-system

labels:

k8s-app: node-exporter

spec:

selector:

matchLabels:

template:

metadata:

labels:

app: node-exporter

spec:

hostPID: true

hostIPC: true

hostNetwork: true

containers:

- name: node-exporter

image: harbor.xxx.com/prom/node-exporter:v0.16.0

ports:

- containerPort: 9100

resources:

requests:

cpu: 0.15

securityContext:

privileged: true

args:

- --path.procfs

- /host/proc

- --path.sysfs

- /host/sys

- --collector.filesystem.ignored-mount-points

- '"^/(sys|proc|dev|host|etc)($|/)"'

volumeMounts:

- name: dev

mountPath: /host/dev

- name: proc

mountPath: /host/proc

- name: sys

mountPath: /host/sys

- name: rootfs

mountPath: /rootfs

tolerations:

- key: "node-role.kubernetes.io/master"

operator: "Exists"

effect: "NoSchedule"

volumes:

- name: proc

hostPath:

path: /proc

- name: dev

hostPath:

path: /dev

- name: sys

hostPath:

path: /sys

- name: rootfs

hostPath:

path: /

#执行

kubectl apply -f prometheus-node-exporter.yaml

#查看

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#查看数据情况

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

二、Grafana部署

#申明PV及创建PVC

cat grafana_volume.yaml

apiVersion: v1

kind: PersistentVolume

metadata:

spec:

capacity:

storage: 10Gi

accessModes:

- ReadWriteOnce

persistentVolumeReclaimPolicy: Recycle

nfs:

server: 192.168.1.115

path: /data/k8s_data/grafana

---

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

namespace: kube-system

spec:

accessModes:

- ReadWriteOnce

resources:

requests:

storage: 10Gi

#创建Deployment ，里面有Grafana的用户名及密码

cat grafana_deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: kube-system

labels:

app: grafana

k8s-app: grafana

spec:

selector:

matchLabels:

k8s-app: grafana

app: grafana

revisionHistoryLimit: 10

template:

metadata:

labels:

app: grafana

k8s-app: grafana

spec:

containers:

- name: grafana

image: grafana/grafana:5.3.4

imagePullPolicy: IfNotPresent

ports:

- containerPort: 3000

env:

- name: GF_SECURITY_ADMIN_USER

value: admin

- name: GF_SECURITY_ADMIN_PASSWORD

value: admin

readinessProbe:

failureThreshold: 10

httpGet:

path: /api/health

port: 3000

scheme: HTTP

initialDelaySeconds: 60

periodSeconds: 10

successThreshold: 1

timeoutSeconds: 30

livenessProbe:

failureThreshold: 3

httpGet:

path: /api/health

port: 3000

scheme: HTTP

periodSeconds: 10

successThreshold: 1

timeoutSeconds: 1

resources:

limits:

cpu: 300m

memory: 1024Mi

requests:

cpu: 300m

memory: 1024Mi

volumeMounts:

- mountPath: /var/lib/grafana

subPath: grafana

securityContext:

fsGroup: 472

runAsUser: 472

volumes:

- name: storage

persistentVolumeClaim:

claimName: grafana

#创建临时目录授权

cat grafana_job.yaml

apiVersion: batch/v1

kind: Job

metadata:

namespace: kube-system

spec:

template:

spec:

restartPolicy: Never

containers:

- name: grafana-chown

command: ["chown", "-R", "472:472", "/var/lib/grafana"]

image: harbor.xxxxx.com/busybox/busybox:1.28

imagePullPolicy: IfNotPresent

volumeMounts:

- name: storage

subPath: grafana

mountPath: /var/lib/grafana

volumes:

- name: storage

persistentVolumeClaim:

claimName: grafana

#创建Service

cat grafana_svc.yaml

apiVersion: v1

kind: Service

metadata:

namespace: kube-system

labels:

app: grafana

spec:

type: NodePort

selector:

app: grafana

ports:

- port: 3000

protocol: TCP

targetPort: 3000

nodePort: 30000

#执行

kubectl apply -f .

#访问Grafana，这里也使用了内网DNS域名及Nginx

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

输入admin/admin，然后修改新密码

#添加Prometheus数据源

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

选择Type为Prometheus 、填写url（我这里填写了url总是无法测试通过，最后填写了IP地址)

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

最后Save & Test 保存即可

三、添加Kubernetes 模板

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

搜索：Kubernetes Deployment Statefulset Daemonset metrics模板；将其导入

也可以将其模板下载下来，进行导入亦或者输入模板的Id：8858

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#配置K8S集群内存使用率

(sum(node_filesystem_size_bytes{device="tmpfs"}) - sum(node_filesystem_free_bytes{device="tmpfs"}) ) / sum(node_filesystem_size_bytes{device="tmpfs"}) * 100

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#配置K8S 集群文件系统使用率

(sum(node_filesystem_size_bytes{device="tmpfs"}) - sum(node_filesystem_free_bytes{device="tmpfs"}) ) / sum(node_filesystem_size_bytes{device="tmpfs"}) * 100

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置