来源: DevOpSec公众号 作者: DevOpSec
背景
k8s版本1.25.6
,业务k8s
容器化,虚机里进程迁移到容器里后,运维在执行free -m
top
等命令排查问题时一脸迷惑,显示内存还有很多结果pod
的容器被oom
或CPU
资源显示很多核且空闲很多资源进程却运行很慢,我们看到的资源视图是物理机的而非我们做了限定pod
里容器的资源,这给研发和运维排查问题带来一定的干扰。
那是什么原因导致运维看到的资源视图还是物理机的呢?
我们知道容器通过cgroup
对CPU
、内存
、交换空间
等资源进行限制,但是容器并不是完全独立隔离的,它与主机共享内核,因此可以访问主机上的一些信息。在Linux系统中,/proc
目录下存放了许多虚拟文件,它们提供了对系统内核和运行时信息的访问。/proc/meminfo
文件包含了关于内存使用和状态的信息,例如总内存大小、可用内存、已使用内存等。当在容器里执行free -m
时,实际上是在访问主机上的/proc/meminfo文件的信息,所以展示的是物理机的内存信息。
我们知道什么原因导致的容器资源视图没有隔离的问题,在实际的使用过程中除了有迷惑还会有一些痛点:
比如
nginx
根据CPU
核数自动设置worker
数量。jvm
程序内存根据系统内存大小自动设置jvm
大小,导致进程启动不了或者运行过程中经常oom
。信息的过度泄露可能会危害物理机的安全等。
那怎么解决容器资源视图隔离的问题?
Linux容器(LXC)社区
早就意识到上述问题,他们开发了LXCFS
(Linux Containers File System)来解决容器资源视图隔离的问题。
下面来看看LXCFS
的工作原理。
LXCFS
工作原理
LXCFS
是一个使用FUSE(Filesystem in Userspace)
实现的小型虚拟文件系统,旨在让Linux
容器感觉更像一个虚拟机。它最初是LXC的一个附带项目,但可由任何运行时使用。
LXCFS
确保procfs
中关键文件提供的信息是针对容器的,例如:
/proc/cpuinfo
/proc/diskstats
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime
/proc/slabinfo
/sys/devices/system/cpu/online
LXCFS
将这些信息适配到容器内,以便显示的值(例如/proc/uptime
)真正反映容器的运行时间,而不是主机的运行时间。
即LXCFS
在容器内部创建了一个虚拟的文件系统,通过挂载主机上的一些关键目录(如/proc
和/sys
等)到容器内部的对应目录下,使得容器内的进程可以看到主机上的资源信息,同时,LXCFS
通过自己的逻辑和计算,提供了对这些资源信息的虚拟视图,使得容器内部能够看到主机上实际的资源使用情况。
容器里执行
free -m
,读取文件/proc/meminfo
因为
/proc/meminfo
文 件是挂载的,所以会读取/var/lib/lxcfs/proc/meminfo
文件见下文,这就触发了LXCFS
的工作机制LXCFS
文件系通过gblic系统调用vfs接口然后转向Fuse内核模块FUSE
回调用户空间LXCFS
文件系统实现接口,获取容器的cgroup
信息LXCFS
实现根据容器id
获取并计算cgroup
下被限制容器的实际mem
、cpu
等信息,最终返回给用户看到的结果就是cgroup
限制的资源视图。
LXCFS
机器上部署
a. 安装 lxcfs
yum install meson fuse-devel fuse cmake help2man fuse3 fuse3-devel -y
git clone git://github.com/lxc/lxcfs
cd lxcfs
meson setup -Dinit-script=systemd --prefix=/usr build/
meson compile -C build/
meson install -C build/
b. 启动lxcfs
mkdir -p /var/lib/lxcfs
lxcfs /var/lib/lxcfs
c. 测试运行容器
docker run -it -m 256m --memory-swap 256m --cpus=1 \
-v /var/lib/lxcfs/proc/cpuinfo:/proc/cpuinfo:rw \
-v /var/lib/lxcfs/proc/diskstats:/proc/diskstats:rw \
-v /var/lib/lxcfs/proc/meminfo:/proc/meminfo:rw \
-v /var/lib/lxcfs/proc/stat:/proc/stat:rw \
-v /var/lib/lxcfs/proc/swaps:/proc/swaps:rw \
-v /var/lib/lxcfs/proc/uptime:/proc/uptime:rw \
-v /var/lib/lxcfs/proc/slabinfo:/proc/slabinfo:rw \
-v /var/lib/lxcfs/sys/devices/system/cpu:/sys/devices/system/cpu:rw \
ubuntu:18.04 /bin/bash
启动容器后,执行如下命令确认是否生效
1. uptime #容器启动时间
2. free -m #内存情况
3. lscpu #看online cpu 核数 或者 cat /proc/cpuinfo
k8s 环境下怎么为pod加上资源视图隔离呢?下面我们来看一看
LXCFS
k8s 环境运行
解决步骤:
首先要使
lxcfs
进程在所有的node
上运行,这个我们使用damonset
解决其次挂载
node
上的/sys/fs/cgroup
、/usr/lib64
和/usr/local
到lxcfs
里,把lxcfs
容器里虚拟文件系统/var/lib/lxcfs/
通过hostPath
挂载到物理机上最后创建
pod
yaml,通过hostPath
形式把node
上/var/lib/lxcfs/
挂载到pod
的容器里,这样就完成了lxcfs
解决k8s
容器资源视图隔离的问题。
a. 构建lxcfs
镜像
a.1 目录结构
tree .
.
├── Dockerfile
├── build.sh
└── lxcfs-lxcfs-5.0.4.tar.gz
a.2 Dockerfile
FROM centos:7.9 #或者制定你的基础镜像
#安装
RUN yum install meson fuse-devel fuse cmake help2man fuse3 fuse3-devel git -y
RUN git clone git://github.com/lxc/lxcfs && cd lxcfs
RUN meson setup -Dinit-script=systemd --prefix=/usr build/
RUN meson compile -C build/
RUN meson install -C build/
#运行
RUN mkdir -p /var/lib/lxcfs
CMD ["sh", "-c", "lxcfs /var/lib/lxcfs"]
a.3 build.sh 构建镜像
#!/bin/bash
source /etc/profile
docker build -t yourharbor.domain.com/centos/7.9/lxcfs/5.0.4/lxcfs .
docker push yourharbor.domain.com/centos/7.9/lxcfs/5.0.4/lxcfs
到这里lxcfs
镜像就构建完了,下面看看怎么用此镜像
b. 运行lxcfs
daemonset
yaml
使用构建的lxcfs
镜像,挂载node
文件到pod
同时挂载/var/lib/lxcfs/
到node
上,见下述yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
labels:
app: lxcfs
name: lxcfs
namespace: default
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: lxcfs
template:
metadata:
labels:
app: lxcfs
spec:
containers:
- yourharbor.domain.com/centos/7.9/lxcfs/5.0.4/lxcfs
imagePullPolicy: Always
name: lxcfs
resources: {}
securityContext:
privileged: true
volumeMounts:
- mountPath: /sys/fs/cgroup
name: cgroup
- mountPath: /var/lib/lxcfs
mountPropagation: Bidirectional
name: lxcfs
- mountPath: /usr/local
name: usr-local
- mountPath: /usr/lib64
name: usr-lib64
hostPID: true
imagePullSecrets:
- name: your-docker-token
restartPolicy: Always
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: your-taint-key
operator: Exists
volumes:
- hostPath:
path: /sys/fs/cgroup
type: ""
name: cgroup
- hostPath:
path: /usr/local
type: ""
name: usr-local
- hostPath:
path: /usr/lib64
type: ""
name: usr-lib64
- hostPath:
path: /var/lib/lxcfs
type: DirectoryOrCreate
name: lxcfs
apply
上述yaml
后可能个别node
上lxcfs daemonset pod
启动报如下错误
Error: failed to generate container "974c6c0465adae1a244e3416b3e053ba2dccb0cbd123c2d02317c9301e3f83d0" spec: failed to apply OCI options: failed to stat "/var/lib/lxcfs": stat /var/lib/lxcfs: transport endpoint is not connected
解决办法
umount /var/lib/lxcfs
c. 验证 deployment pod yaml 定义
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 2
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
volumes:
- hostPath:
path: /var/lib/lxcfs/proc/cpuinfo
type: ""
name: lxcfs-proc-cpuinfo
- hostPath:
path: /var/lib/lxcfs/proc/diskstats
type: ""
name: lxcfs-proc-diskstats
- hostPath:
path: /var/lib/lxcfs/proc/meminfo
type: ""
name: lxcfs-proc-meminfo
- hostPath:
path: /var/lib/lxcfs/proc/stat
type: ""
name: lxcfs-proc-stat
- hostPath:
path: /var/lib/lxcfs/proc/swaps
type: ""
name: lxcfs-proc-swaps
- hostPath:
path: /var/lib/lxcfs/proc/uptime
type: ""
name: lxcfs-proc-uptime
- hostPath:
path: /var/lib/lxcfs/proc/loadavg
type: ""
name: lxcfs-proc-loadavg
- hostPath:
path: /var/lib/lxcfs/sys/devices/system/cpu/online
type: ""
name: lxcfs-sys-devices-system-cpu-online
containers:
- name: web
image: httpd:2.4.32
imagePullPolicy: Always
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "256Mi"
cpu: "500m"
volumeMounts:
- mountPath: /proc/cpuinfo
name: lxcfs-proc-cpuinfo
readOnly: true
- mountPath: /proc/meminfo
name: lxcfs-proc-meminfo
readOnly: true
- mountPath: /proc/diskstats
name: lxcfs-proc-diskstats
readOnly: true
- mountPath: /proc/stat
name: lxcfs-proc-stat
readOnly: true
- mountPath: /proc/swaps
name: lxcfs-proc-swaps
readOnly: true
- mountPath: /proc/uptime
name: lxcfs-proc-uptime
readOnly: true
- mountPath: /proc/loadavg
name: lxcfs-proc-loadavg
readOnly: true
- mountPath: /sys/devices/system/cpu/online
name: lxcfs-sys-devices-system-cpu-online
readOnly: true
这样pod
通过lxcfs
实现了容器资源视图隔离。
但这里有一个问题一个两个容器这样复制粘贴设置还能接受,成千上万和容器这种重复操作,作为追求KISS原则
的你肯定不能忍。
那有没有办法解决呢?我们可以通过实现 admission-webhook
(准入控制 Admission Control
)在授权后对请求做进一步的验证或添加默认参数。我们想到的前辈们都已经实现,就不用重复造轮子了。可以参考 lxcfs-admission-webhook
lxcfs-admission-webhook 注入实现容器自动挂载/proc
、/sys/
lxcfs-admission-webhook
实现了一个动态的准入webhook
,更准确的讲是实现了一个修改性质的webhook
,即监听pod
的创建,然后对pod
执行patch
的操作,从而将lxcfs
与容器内的目录映射关系植入到pod
创建的yaml
中从而实现自动挂载。
使用上也比较KISS
,只用在资源文件里加一条注解即可。
下面我们看看怎么玩
1. 准备lxcfs-admission-webhook
镜像
go build
二进制
git clone git@github.com:denverdino/lxcfs-admission-webhook.git
cd lxcfs-admission-webhook
# build lxcfs-admission-webhook,因为是老的go项目需要转成支持go mod
export GOPROXY=https://goproxy.cn,direct
go mod init v1
go mody tidy
CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o lxcfs-admission-webhook
chmod +x lxcfs-admission-webhook
Dockerfile
FROM alpine:latest
ADD lxcfs-admission-webhook /lxcfs-admission-webhook
ENTRYPOINT ["./lxcfs-admission-webhook"]
构建镜像
docker build -t yourharbor.domain.com/alpine/lxcfs-admission-webhook:v1 .
docker push yourharbor.domain.com/alpine/lxcfs-admission-webhook:v1
2. 运行lxcfs-admission-webhook
pod
每个集群都有自己的CA证书,所以不同集群部署lxcfs-admission-webhook
,先做如下操作再应用yaml
2.1 目录结构
tree .
.
├── dp.yaml #lxcfs-admission-webhook deployment
├── mutatingwebhook.yaml #MutatingWebhookConfiguration
└── svc.yaml #webhook svc
└── webhook-create-signed-cert.sh #创建`lxcfs-admission-webhook`依赖证书
2.2 修改webhook-create-signed-cert.sh
注:由于k8s版本较新,lxcfs-admission-webhook
近几年没有更新,所以适配新版本k8s修改了github上的k8s的证书生成脚本webhook-create-signed-cert.sh
#!/bin/bash
set -e
usage() {
cat <<EOF
Generate certificate suitable for use with an sidecar-injector webhook service.
This script uses k8s' CertificateSigningRequest API to a generate a
certificate signed by k8s CA suitable for use with sidecar-injector webhook
services. This requires permissions to create and approve CSR. See
https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster for
detailed explantion and additional instructions.
The server key/cert k8s CA cert are stored in a k8s secret.
usage: ${0} [OPTIONS]
The following flags are required.
--service Service name of webhook.
--namespace Namespace where webhook service and secret reside.
--secret Secret name for CA certificate and server certificate/key pair.
EOF
exit 1
}
while [[ $# -gt 0 ]]; do
case ${1} in
--service)
service="$2"
shift
;;
--secret)
secret="$2"
shift
;;
--namespace)
namespace="$2"
shift
;;
*)
usage
;;
esac
shift
done
[ -z ${service} ] && service=lxcfs-admission-webhook-svc
[ -z ${secret} ] && secret=lxcfs-admission-webhook-certs
[ -z ${namespace} ] && namespace=default
if [ ! -x "$(command -v openssl)" ]; then
echo "openssl not found"
exit 1
fi
csrName=${service}.${namespace}
tmpdir=$(mktemp -d)
echo "creating certs in tmpdir ${tmpdir} "
cat <<EOF >> ${tmpdir}/csr.conf
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = ${service}
DNS.2 = ${service}.${namespace}
DNS.3 = ${service}.${namespace}.svc
EOF
openssl genrsa -out ${tmpdir}/server-key.pem 2048
#openssl req -new -key ${tmpdir}/server-key.pem -subj "/CN=${service}.${namespace}.svc" -out ${tmpdir}/server.csr -config ${tmpdir}/csr.conf
openssl req -new -key ${tmpdir}/server-key.pem -subj "/CN=system:node:${service}.${namespace}.svc;/O=system:nodes" -out ${tmpdir}/server.csr -config ${tmpdir}/csr.conf
# clean-up any previously created CSR for our service. Ignore errors if not present.
kubectl delete csr ${csrName} -n ${namespace} 2>/dev/null || true
# create server cert/key CSR and send to k8s API
cat <<EOF | kubectl -n ${namespace} create -f -
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
name: ${csrName}
spec:
groups:
- system:authenticated
signerName: kubernetes.io/kubelet-serving
request: $(cat ${tmpdir}/server.csr | base64 | tr -d '\n')
usages:
- digital signature
- key encipherment
- server auth
EOF
# verify CSR has been created
while true; do
kubectl get csr ${csrName}
if [ "$?" -eq 0 ]; then
break
fi
done
# approve and fetch the signed certificate
kubectl certificate approve ${csrName}
# verify certificate has been signed
for x in $(seq 10); do
serverCert=$(kubectl get csr ${csrName} -o jsonpath='{.status.certificate}')
if [[ ${serverCert} != '' ]]; then
break
fi
sleep 1
done
if [[ ${serverCert} == '' ]]; then
echo "ERROR: After approving csr ${csrName}, the signed certificate did not appear on the resource. Giving up after 10 attempts." >&2
exit 1
fi
echo ${serverCert} | openssl base64 -d -A -out ${tmpdir}/server-cert.pem
# create the secret with CA cert and server cert/key
kubectl create secret generic ${secret} \
--from-file=key.pem=${tmpdir}/server-key.pem \
--from-file=cert.pem=${tmpdir}/server-cert.pem \
--dry-run -o yaml |
kubectl -n ${namespace} apply -f -
修改了证书请求命令/CN=system:node:${service}.${namespace}.svc;/O=system:nodes
和 修改了--namespace
的bug
然后在k8s master
节点上运行 kubectl create ns lxcfs ; sh webhook-create-signed-cert.sh --namespace lxcfs
2.2 获取集群CA证书内容
kubectl config view --raw --flatten --minify -o jsonpath='{.clusters[].cluster.certificate-authority-data}'
2.3 更新CA证书内容到mutatingwebhook.yaml
caBundle
字段
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
name: mutating-lxcfs-admission-webhook-cfg
labels:
app: lxcfs-admission-webhook
webhooks:
- name: mutating.lxcfs-admission-webhook.aliyun.com
clientConfig:
service:
name: lxcfs-admission-webhook-svc
namespace: default
path: "/mutate"
caBundle: LS0tLS1CRUdJxiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJek1EY3hOekEwTXpNek5Gb1hEVE16TURjeE5EQTBNek16TkZvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTlVZCjd4SThpcXZtbEtNN0FDTUFDY0huRWxxTXgyakR1b3JkWk81cUNGYTBNalROOXNqZHhUbHNNTlMrUHpuOUxPSkMKZ2d5TW90MGNPaW0zQTd2bllRYzFCY2I3UHFLOGpjS0U2a0E5MWVyNlpNSHU0c3ZXRXEybjVyMlIvcnY5NUR2eQpIRzlzTUJnenQrWUFJNlR6OGJNazhnMzJZR1BJejEvTTJmalBCa292bVJ3U0c1UkVIYWVFNW1TdDBRMnJheGJQCmtEU0pDSEErVlV3QThuekpFRVpwdkIxbUZ6MytXKzhrOUpIYlFtSW40TzhNaCtYYXlGc2Vab2g5SC9kVERkSXUKN0JXVG5pcmg5YkNWZzJhSDJidG03ZVpSY2s1V3IrM0QxcmUrc1FxWnpVdlhFSzBQYTk4MENGd3BYTVhsenlFdQpqNkhQRjZzOUhmV0gxOVdJMUdrQ0F3RUFBYU5aTUZjd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZBQVVicWVyaklyUDRmOFV0ZjErUzRERzVSWStNQlVHQTFVZEVRUU8KTUF5Q0NtdDFZbVZ5Ym1WMFpYTXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBTGx0OHBELzVtMnhVclJSdUJIdQpaODFKbnpDSzB6Y2ZhbHRROXFiWkFQb2syT1R6eTQrclh6SHQ4VzVHN01YVmN6TXVoZnh0OXFSeWVLekM3bmtICnpJSnIxcmxPbkkwaXdNcHJFeDlNQkpBTnBNdWNwN3ljaE82RGlOQ01ocFAwMXdDbWVENTBsVUladlIrMHhUbHEKaGVZdTFZS3Eza3Q0dzNuWVUxUGszUGU1Q3NweFNqd0NKNVF0RHpyUFY4bE5JaHNMZjRHV2U2bDN0N2J5ck9wWApsUWJiMXovazNRTDRTU3pqcEdkQVRmUnVmRmsrbk1RVkFCSmJwVWp5aHNFMlg1TjRvLzlKWFVpZVhLNlYxOHNiCnVtVUlLYlkySGIyTHNISXEveTBHeHpITnpGTndEeEdGNnNSWFF5SkFYVS9tekNWRWczbEhaWUlpUU9wdkc2VdfsZXFVPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0twx==
rules:
- operations: [ "CREATE" ]
apiGroups: ["core", ""]
apiVersions: ["v1"]
resources: ["pods"]
namespaceSelector:
matchLabels:
lxcfs-admission-webhook: enabled
2.4 lxcfs-admission-webhook
的 dp.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: lxcfs-admission-webhook-deployment
labels:
app: lxcfs-admission-webhook
namespace: lxcfs
spec:
replicas: 1
selector:
matchLabels:
app: lxcfs-admission-webhook
template:
metadata:
labels:
app: lxcfs-admission-webhook
spec:
imagePullSecrets:
- name: your-docker-token
containers:
- name: lxcfs-admission-webhook
image: yourharbor.domain.com/alpine/lxcfs-admission-webhook:v1
imagePullPolicy: IfNotPresent
args:
- -tlsCertFile=/etc/webhook/certs/cert.pem
- -tlsKeyFile=/etc/webhook/certs/key.pem
- -alsologtostderr
- -v=4
- 2>&1
volumeMounts:
- name: webhook-certs
mountPath: /etc/webhook/certs
readOnly: true
volumes:
- name: webhook-certs
secret:
secretName: lxcfs-admission-webhook-certs
2.5 svc.yaml
apiVersion: v1
kind: Service
metadata:
namespace: lxcfs
name: lxcfs-admission-webhook-svc
labels:
app: lxcfs-admission-webhook
spec:
ports:
- port: 443
targetPort: 443
selector:
app: lxcfs-admission-webhook
3.验证,应用注解能力
给default
namespace
开启lxcfs
能力
kubectl label namespace default lxcfs-admission-webhook=enabled
部署deployment
cd lxcfs-admission-webhook
kubectl apply -f deployment/web.yaml
登录容器执行free
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
lxcfs-admission-webhook-deployment-f4bdd6f66-5wrlg 1/1 Running 0 8m29s
lxcfs-pqs2d 1/1 Running 0 55m
lxcfs-zfh99 1/1 Running 0 55m
web-7c5464f6b9-6zxdf 1/1 Running 0 8m10s
web-7c5464f6b9-nktff 1/1 Running 0 8m10s
$ kubectl exec -ti web-7c5464f6b9-6zxdf sh
# free
total used free shared buffers cached
Mem: 262144 2744 259400 0 0 312
-/+ buffers/cache: 2432 259712
Swap: 0 0 0
#
总结
这里强调一下,我们实现的是容器资源视图和物理机资源视图的隔离,而非pod
的。
容器资源视图隔离后,视觉上舒服很多,对定位问题,服务启动,网络安全上都有很大帮助,行动起来吧。关注DevOpSec
每周分享干货内容,我们一起进步。