node_exporter默认是只能监控挂载的硬盘的总大小,如果想知道具体某个目录的大小,并不能简单通过配置实现。
查找了一下文档,只能通过脚本将磁盘大小输出到.prom文件中,配置node_exporter从文件中获取指标方式来实现该需求,因为都跑k8s了,所以我打算通过daemonset来运行一个pod,实现上述工作。

准备镜像

这里使用alpine作为基础镜像,安装了几个常用的工具,其中只有moreutils是必须的,不然没有后续定时任务中的sponge命令,不想装的话将定时任务中sponge改成mv也行, 然后修改了下时区以及启动crond,都是为了定时任务做准备。

1
2
3
4
5
6
7
FROM alpine:3.18.4
RUN sed -i 's/dl-cdn.alpinelinux.org/mirrors.tuna.tsinghua.edu.cn/g' /etc/apk/repositories
RUN apk add --no-cache bash openssh curl mysql-client moreutils
RUN apk add tzdata \
&& cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
&& echo "Asia/Shanghai" > /etc/timezone
CMD crond -l 2 -f```

这里推的都是我公司的私有仓库,并未分享出来,大家不要尝试直接拉这个镜像哦,备忘一下docker build命令而已,老忘。。。

1
2
docker build -t docker.shuyilink.com/operations:1.0 .
docker push docker.shuyilink.com/operations:1.0

准备生成指标脚本

根据参考链接中官方的脚本进行修改,进行如下几个点修改:

  • 增加du命令超时时间,防止目录过大du命令占用过多io,超时后会将改目录大小设置为999999999,也便于后面监控报警进行识别
  • 增加判断如果目录不存在,会跳过
  • 去掉了du –block-size=1 –summarize 这两个参数,因为alphine镜像中的du没有。。。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash

echo "# HELP node_directory_size_bytes Disk space used by some directories"
echo "# TYPE node_directory_size_bytes gauge"

# 设置超时时间为20秒
for dir in "$@"
do
if [ -d "$dir" ]; then
output=$(timeout 20 du -s "$dir" 2> /dev/null)

# 检查是否超时或者du命令未返回结果
if [ -z "$output" ]; then
echo "node_directory_size_bytes{directory=\"$dir\"} 999999999"
else
echo "$output" | awk -v dir="$dir" '{ printf "node_directory_size_bytes{directory=\"%s\"} %s\n", dir, $1 }'
fi
fi
done

准备k8s yaml

这里挂载了/data/log和/data/volume两个需要统计大小的目录进容器中,然后通过配置文件方式挂载了脚本和定时任务配置,然后通过定时任务方式将结果输出到了data/promfile/directory_size.prom文件中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: sygloperator
namespace: ops
spec:
selector:
matchLabels:
app: sygloperator
template:
metadata:
labels:
app: sygloperator
spec:
containers:
- name: sygloperator
image: docker.shuyilink.com/operations:1.0
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 10m
memory: 128Mi
volumeMounts:
- mountPath: /data/log
name: datalog
- mountPath: /etc/crontabs/root
name: cronjob
subPath: root
- mountPath: /root/scripts
name: scripts
- mountPath: /data/volume
name: datavolume
- mountPath: /data/promfile
name: promfile
nodeSelector:
beta.kubernetes.io/arch: amd64
volumes:
- hostPath:
path: /data/log
type: DirectoryOrCreate
name: datalog
- configMap:
defaultMode: 420
name: cronjob-config
name: cronjob
- configMap:
defaultMode: 420
name: opscripts
name: scripts
- hostPath:
path: /data/volume
type: DirectoryOrCreate
name: datavolume
- hostPath:
path: /data/promfile
type: DirectoryOrCreate
name: promfile
---
apiVersion: v1
data:
root: |
5 */1 * * * /bin/bash /root/scripts/directory-size.sh /data/log /data/volume| sponge /data/promfile/directory_size.prom
kind: ConfigMap
metadata:
name: cronjob-config
namespace: ops

---
apiVersion: v1
data:
directory-size.sh: |-
#!/bin/bash

echo "# HELP node_directory_size_bytes Disk space used by some directories"
echo "# TYPE node_directory_size_bytes gauge"

# 设置超时时间为20秒
for dir in "$@"
do
if [ -d "$dir" ]; then
output=$(timeout 20 du -s "$dir" 2> /dev/null)

# 检查是否超时或者du命令未返回结果
if [ -z "$output" ]; then
echo "node_directory_size_bytes{directory=\"$dir\"} 999999999"
else
echo "$output" | awk -v dir="$dir" '{ printf "node_directory_size_bytes{directory=\"%s\"} %s\n", dir, $1 }'
fi
fi
done
kind: ConfigMap
metadata:
name: opscripts
namespace: ops

如果正常运行的话,会在服务器/data/promfile/目录中生成directory_size.prom文件,文件内容如下

1
2
3
4
# HELP node_directory_size_bytes Disk space used by some directories
# TYPE node_directory_size_bytes gauge
node_directory_size_bytes{directory="/data/log"} 52609040
node_directory_size_bytes{directory="/data/volume"} 999999999

修改node_exporter配置

记得修改node_expoter增加配置–collector.textfile.directory=/rootfs/data/promfile
这里的目录根据你的的node_exporter挂载目录进行修改,我的完整node_exporter yaml如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: '8'
labels:
k8s-app: node-exporter
k8s.kuboard.cn/name: node-exporter
name: node-exporter
namespace: ops
resourceVersion: '612487219'
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: node-exporter
version: v1.0.1
template:
metadata:
creationTimestamp: null
labels:
k8s-app: node-exporter
version: v1.0.1
spec:
containers:
- args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- >-
--collector.filesystem.ignored-mount-points=^/(dev|proc|sys|boot|run|mnt)($|/)
- '--no-collector.zfs'
- '--no-collector.netclass'
- '--no-collector.nfs'
- '--no-collector.filesystem'
- '--log.level=debug'
- '--collector.textfile.directory=/rootfs/data/promfile'
image: 'prom/node-exporter:v1.0.1'
imagePullPolicy: IfNotPresent
name: prometheus-node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: metrics
protocol: TCP
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 10m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
readOnly: true
- mountPath: /host/sys
name: sys
readOnly: true
- mountPath: /rootfs
name: rootfs
readOnly: true
dnsPolicy: ClusterFirst
hostIPC: true
hostNetwork: true
hostPID: true
nodeSelector:
kubernetes.io/arch: amd64
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
volumes:
- hostPath:
path: /proc
type: ''
name: proc
- hostPath:
path: /sys
type: ''
name: sys
- hostPath:
path: /
type: ''
name: rootfs
- hostPath:
path: /dev
type: ''
name: dev
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
name: node-exporter
namespace: ops
spec:
clusterIP: None
ports:
- name: metrics
port: 9100
protocol: TCP
targetPort: 9100
selector:
k8s-app: node-exporter
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}

验证

curl nodeexporter的9100/metrics,应该就能看到如下内容,或者在prometheus中查询
node_directory_size_bytes指标,也会有对应的输出,如果没有,可以看看alpine pod的输出日志和node_exporter中的日志

1
2
3
4
# HELP node_directory_size_bytes Disk space used by some directories
# TYPE node_directory_size_bytes gauge
node_directory_size_bytes{directory="/data/log"} 4.231016e+06
node_directory_size_bytes{directory="/data/volume"} 174324

参考

node-exporter-textfile-collector-scripts/directory-size.sh at master · prometheus-community/node-exporter-textfile-collector-scripts (github.com)