Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

Stella981
• 阅读 790

K8S中的各个Node及Pod如何监控,业界常用的方案基本都是:Prometheus + Grafana

来先看看整体效果如下:

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

一.Prometheus 部署

#创建configmap;默认这里都是9090的端口,因为9090的端口已被其他服务占用,所以我改了端口 

cat prometheus.configmap.yaml

apiVersion: v1

kind: ConfigMap

metadata:

  name: prometheus-config

  namespace: kube-system

data:

  prometheus.yml: |

    global:

      scrape_interval: 15s

      scrape_timeout: 15s

    scrape_configs:

    - job_name: 'prometheus'

      static_configs:

      - targets: ['localhost:9090']

    - job_name: 'kubernetes-node'

      kubernetes_sd_configs:

      - role: node

      relabel_configs:

      - source_labels: [__address__]

        regex: '(.*):10250'

        replacement: '${1}:9100'

        target_label: __address__

        action: replace

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

    - job_name: 'kubernetes-cadvisor'

      kubernetes_sd_configs:

      - role: node

      scheme: https

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      relabel_configs:

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

      - target_label: __address__

        replacement: kubernetes.default.svc:443

      - source_labels: [__meta_kubernetes_node_name]

        regex: (.+)

        target_label: __metrics_path__

        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor  

    - job_name: kubernetes-apiservers

      kubernetes_sd_configs:

      - role: endpoints

      relabel_configs:

      - action: keep

        regex: default;kubernetes;https

        source_labels:

        - __meta_kubernetes_namespace

        - __meta_kubernetes_service_name

        - __meta_kubernetes_endpoint_port_name

      scheme: https

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

        insecure_skip_verify: true

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 

#各个Node节点需挂载NFS网盘

 1.首先各个Node节点需要安装nfs-utils

2.NFS Server 新增目录/data/k8s_data/prometheus/k8s-vloume并修改/etc/exports权限

3.挂载NFS

mount -t nfs 192.168.1.115:/data/k8s_data/prometheus/k8s-vloume /data/k8s_data/prometheus/

以上操作可是用Ansible统一操作

#申明PV及创建PVC,这里将Prometheus的数据持久化的内网的NFS网盘中;

**cat prometheus-volume.yaml
**

apiVersion: v1

kind: PersistentVolume

metadata:

  name: prometheus

spec:

  capacity:

    storage: 10Gi

  accessModes:

  - ReadWriteOnce

  persistentVolumeReclaimPolicy: Recycle

  nfs:

    server: 192.168.1.115

    path: /data/k8s_data/prometheus/k8s-vloume

---

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

  name: prometheus

  namespace: kube-system

spec:

  accessModes:

  - ReadWriteOnce

  resources:

    requests:

      storage: 10Gi

#创建Deployment及Service

cat prometheus.deploy.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

  name: prometheus

  namespace: kube-system

  labels:

    app: prometheus

spec:

  selector:

    matchLabels:

      app: prometheus

  template:

    metadata:

      labels:

        app: prometheus

    spec:

      serviceAccountName: prometheus

      containers:

      - image: harbor.xxxxx.com/prom/prometheus:v2.4.3  #这里私仓地址,修改为自己的或docker hub中可下载

        name: prometheus

        command:

        - "/bin/prometheus"

        args:

        - "--config.file=/etc/prometheus/prometheus.yml"

        - "--storage.tsdb.path=/prometheus"

        - "--storage.tsdb.retention=30d"

        - "--web.enable-admin-api"  # 控制对admin HTTP API的访问,其中包括删除时间序列等功能

        - "--web.enable-lifecycle"  # 支持热更新,直接执行localhost:9090/-/reload立即生效

        ports:

        - containerPort: 9090

          protocol: TCP

          name: http

        volumeMounts:

        - mountPath: "/prometheus"

          subPath: prometheus

          name: data

        - mountPath: "/etc/prometheus"

          name: config-volume

        resources:

          requests:

            cpu: 100m

            memory: 512Mi

          limits:

            cpu: 100m

            memory: 512Mi

      securityContext:

        runAsUser: 0

      volumes:

      - name: data

        persistentVolumeClaim:

          claimName: prometheus

      - configMap:

          name: prometheus-config

        name: config-volume

---

apiVersion: v1

kind: Service

metadata:

  namespace: kube-system

  name: prometheus

  labels:

    app: prometheus

spec:

  type: NodePort

  selector:

    app: prometheus

  ports:

  - port: 9091

    protocol: TCP

    targetPort: 9090

    nodePort: 39091

#创建授权规则

cat prometheus-rbac.yaml

apiVersion: v1

kind: ServiceAccount

metadata:

  name: prometheus

  namespace: kube-system

---

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRole

metadata:

  name: prometheus

rules:

- apiGroups:

  - ""

  resources:

  - nodes

  - services

  - endpoints

  - pods

  - nodes/proxy

  verbs:

  - get

  - list

  - watch

- apiGroups:

  - ""

  resources:

  - configmaps

  - nodes/metrics

  verbs:

  - get

- nonResourceURLs:

  - /metrics

  verbs:

  - get

---

apiVersion: rbac.authorization.k8s.io/v1beta1

kind: ClusterRoleBinding

metadata:

  name: prometheus

roleRef:

  apiGroup: rbac.authorization.k8s.io

  kind: ClusterRole

  name: prometheus

subjects:

- kind: ServiceAccount

  name: prometheus

  namespace: kube-system

#执行

kubectl apply -f .

#查看部署情况

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#查看Service情况

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#进行访问验证

#我这里使用了nginx及内网的DNS域名进行绑定也可直接访问Node+NodePort访问,也还可以使用Ingress进行配置

我这里输入:

http://prometheus.xxxx.com:9090/targets

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#监控K8S的Node节点,需要部署node-exporter,这里使用DaemonSet,使每个Node节点都部署

cat prometheus-node-exporter.yaml

apiVersion: apps/v1

kind: DaemonSet

metadata:

  name: node-exporter

  namespace: kube-system

  labels:

    name: node-exporter

    k8s-app: node-exporter

spec:

  selector:

    matchLabels:

      name: node-exporter

  template:

    metadata:

      labels:

        name: node-exporter

        app: node-exporter

    spec:

      hostPID: true

      hostIPC: true

      hostNetwork: true

      containers:

      - name: node-exporter

        image: harbor.xxx.com/prom/node-exporter:v0.16.0

        ports:

        - containerPort: 9100

        resources:

          requests:

            cpu: 0.15

        securityContext:

          privileged: true

        args:

        - --path.procfs

        - /host/proc

        - --path.sysfs

        - /host/sys

        - --collector.filesystem.ignored-mount-points

        - '"^/(sys|proc|dev|host|etc)($|/)"'

        volumeMounts:

        - name: dev

          mountPath: /host/dev

        - name: proc

          mountPath: /host/proc

        - name: sys

          mountPath: /host/sys

        - name: rootfs

          mountPath: /rootfs

      tolerations:

      - key: "node-role.kubernetes.io/master"

        operator: "Exists"

        effect: "NoSchedule"

      volumes:

        - name: proc

          hostPath:

            path: /proc

        - name: dev

          hostPath:

            path: /dev

        - name: sys

          hostPath:

            path: /sys

        - name: rootfs

          hostPath:

            path: /

#执行

kubectl apply -f prometheus-node-exporter.yaml

#查看

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#查看数据情况

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

二、Grafana部署

#申明PV及创建PVC

cat grafana_volume.yaml

apiVersion: v1

kind: PersistentVolume

metadata:

  name: grafana

spec:

  capacity:

    storage: 10Gi

  accessModes:

  - ReadWriteOnce

  persistentVolumeReclaimPolicy: Recycle

  nfs:

    server: 192.168.1.115

    path: /data/k8s_data/grafana

---

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

  name: grafana

  namespace: kube-system

spec:

  accessModes:

  - ReadWriteOnce

  resources:

    requests:

      storage: 10Gi

#创建Deployment ,里面有Grafana的用户名及密码

cat grafana_deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

  name: grafana

  namespace: kube-system

  labels:

    app: grafana

    k8s-app: grafana

spec:

  selector:

    matchLabels:

      k8s-app: grafana

      app: grafana

  revisionHistoryLimit: 10

  template:

    metadata:

      labels:

        app: grafana

        k8s-app: grafana

    spec:

      containers:

      - name: grafana

        image: grafana/grafana:5.3.4

        imagePullPolicy: IfNotPresent

        ports:

        - containerPort: 3000

          name: grafana

        env:

        - name: GF_SECURITY_ADMIN_USER

          value: admin

        - name: GF_SECURITY_ADMIN_PASSWORD

          value: admin

        readinessProbe:

          failureThreshold: 10

          httpGet:

            path: /api/health

            port: 3000

            scheme: HTTP

          initialDelaySeconds: 60

          periodSeconds: 10

          successThreshold: 1

          timeoutSeconds: 30

        livenessProbe:

          failureThreshold: 3

          httpGet:

            path: /api/health

            port: 3000

            scheme: HTTP

          periodSeconds: 10

          successThreshold: 1

          timeoutSeconds: 1

        resources:

          limits:

            cpu: 300m

            memory: 1024Mi

          requests:

            cpu: 300m

            memory: 1024Mi

        volumeMounts:

        - mountPath: /var/lib/grafana

          subPath: grafana

          name: storage

      securityContext:

        fsGroup: 472

        runAsUser: 472

      volumes:

      - name: storage

        persistentVolumeClaim:

          claimName: grafana

#创建临时目录授权

cat grafana_job.yaml

apiVersion: batch/v1

kind: Job

metadata:

  name: grafana-chown

  namespace: kube-system

spec:

  template:

    spec:

      restartPolicy: Never

      containers:

      - name: grafana-chown

        command: ["chown", "-R", "472:472", "/var/lib/grafana"]

        image: harbor.xxxxx.com/busybox/busybox:1.28

        imagePullPolicy: IfNotPresent

        volumeMounts:

        - name: storage

          subPath: grafana

          mountPath: /var/lib/grafana

      volumes:

      - name: storage

        persistentVolumeClaim:

          claimName: grafana

#创建Service

cat grafana_svc.yaml

apiVersion: v1

kind: Service

metadata:

  name: grafana

  namespace: kube-system

  labels:

    app: grafana

spec:

  type: NodePort

  selector: 

    app: grafana

  ports:

  - port: 3000

    protocol: TCP

    targetPort: 3000

    nodePort: 30000

#执行

kubectl apply -f .

#访问Grafana,这里也使用了内网DNS域名及Nginx

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

输入admin/admin,然后修改新密码

#添加Prometheus数据源

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

选择Type为Prometheus 、填写url(我这里填写了url总是无法测试通过,最后填写了IP地址)

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

最后Save & Test 保存即可

三、添加Kubernetes 模板

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

搜索:Kubernetes Deployment Statefulset Daemonset metrics模板;将其导入

 也可以将其模板下载下来,进行导入 亦或者输入模板的Id:8858

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#配置K8S集群内存使用率

(sum(node_filesystem_size_bytes{device="tmpfs"}) - sum(node_filesystem_free_bytes{device="tmpfs"}) ) / sum(node_filesystem_size_bytes{device="tmpfs"}) * 100

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#配置K8S 集群文件系统使用率

(sum(node_filesystem_size_bytes{device="tmpfs"}) - sum(node_filesystem_free_bytes{device="tmpfs"}) ) / sum(node_filesystem_size_bytes{device="tmpfs"}) * 100

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

#配置Pod CPU使用率

sum by (pod)(rate(container_cpu_usage_seconds_total{image!=" ", pod_name!=" "}[1m]))

Kubernetes集群监控Prometheus + Grafana监控方案部署及配置

后面基本上都是写PromSQL,这个SQL语句都可以从Prometheus提取数据指标,可以根据情况进行编写

后续思考:

1、可结合Alert进行监控告警

2、持久化数据放到时序数据库(如OpenTSDB)

【参考资料】

https://grafana.com/docs/grafana/v5.3/features/datasources/prometheus/

https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

http://opentsdb.net/

点赞
收藏
评论区
推荐文章
blmius blmius
3年前
MySQL:[Err] 1292 - Incorrect datetime value: ‘0000-00-00 00:00:00‘ for column ‘CREATE_TIME‘ at row 1
文章目录问题用navicat导入数据时,报错:原因这是因为当前的MySQL不支持datetime为0的情况。解决修改sql\mode:sql\mode:SQLMode定义了MySQL应支持的SQL语法、数据校验等,这样可以更容易地在不同的环境中使用MySQL。全局s
皕杰报表之UUID
​在我们用皕杰报表工具设计填报报表时,如何在新增行里自动增加id呢?能新增整数排序id吗?目前可以在新增行里自动增加id,但只能用uuid函数增加UUID编码,不能新增整数排序id。uuid函数说明:获取一个UUID,可以在填报表中用来创建数据ID语法:uuid()或uuid(sep)参数说明:sep布尔值,生成的uuid中是否包含分隔符'',缺省为
待兔 待兔
4个月前
手写Java HashMap源码
HashMap的使用教程HashMap的使用教程HashMap的使用教程HashMap的使用教程HashMap的使用教程22
Jacquelyn38 Jacquelyn38
3年前
2020年前端实用代码段,为你的工作保驾护航
有空的时候,自己总结了几个代码段,在开发中也经常使用,谢谢。1、使用解构获取json数据let jsonData  id: 1,status: "OK",data: 'a', 'b';let  id, status, data: number   jsonData;console.log(id, status, number )
Stella981 Stella981
3年前
KVM调整cpu和内存
一.修改kvm虚拟机的配置1、virsheditcentos7找到“memory”和“vcpu”标签,将<namecentos7</name<uuid2220a6d1a36a4fbb8523e078b3dfe795</uuid
Wesley13 Wesley13
3年前
mysql设置时区
mysql设置时区mysql\_query("SETtime\_zone'8:00'")ordie('时区设置失败,请联系管理员!');中国在东8区所以加8方法二:selectcount(user\_id)asdevice,CONVERT\_TZ(FROM\_UNIXTIME(reg\_time),'08:00','0
Wesley13 Wesley13
3年前
00:Java简单了解
浅谈Java之概述Java是SUN(StanfordUniversityNetwork),斯坦福大学网络公司)1995年推出的一门高级编程语言。Java是一种面向Internet的编程语言。随着Java技术在web方面的不断成熟,已经成为Web应用程序的首选开发语言。Java是简单易学,完全面向对象,安全可靠,与平台无关的编程语言。
Stella981 Stella981
3年前
Django中Admin中的一些参数配置
设置在列表中显示的字段,id为django模型默认的主键list_display('id','name','sex','profession','email','qq','phone','status','create_time')设置在列表可编辑字段list_editable
Wesley13 Wesley13
3年前
MySQL部分从库上面因为大量的临时表tmp_table造成慢查询
背景描述Time:20190124T00:08:14.70572408:00User@Host:@Id:Schema:sentrymetaLast_errno:0Killed:0Query_time:0.315758Lock_
Python进阶者 Python进阶者
10个月前
Excel中这日期老是出来00:00:00,怎么用Pandas把这个去除
大家好,我是皮皮。一、前言前几天在Python白银交流群【上海新年人】问了一个Pandas数据筛选的问题。问题如下:这日期老是出来00:00:00,怎么把这个去除。二、实现过程后来【论草莓如何成为冻干莓】给了一个思路和代码如下:pd.toexcel之前把这