kubectl top pod发现,同一个命名空间中,部分pod能查出来资源使用情况而部分没有,更仔细观察会发现没有资源的pod都是在同一个节点上,之前经常通过重启kubelet来解决,终于决定仔细检查下原因。

由于kubelet的日志挺多,一下子没找到具体报错的日志,想到kubelet获取监控指标是通过cadvisor来获取的,因此在k8s中单独部署了一套cadvisor,一下子就找到了关键的报错日志如下。

1
2
3
4
E1209 02:38:37.265120       1 watcher.go:152] Failed to watch directory "/sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod4e1cd4a0-4af9-4e7c-afe5-c0ab3fc3764a": inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod4e1cd4a0-4af9-4e7c-afe5-c0ab3fc3764a/0795e696035df98ba98e267005e1c23a91f0ad0210c0750b8e948fd744b3f12a: no space left on device
E1209 02:38:37.265133 1 watcher.go:152] Failed to watch directory "/sys/fs/cgroup/cpu,cpuacct/kubepods/burstable": inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod4e1cd4a0-4af9-4e7c-afe5-c0ab3fc3764a/0795e696035df98ba98e267005e1c23a91f0ad0210c0750b8e948fd744b3f12a: no space left on device
E1209 02:38:37.265146 1 watcher.go:152] Failed to watch directory "/sys/fs/cgroup/cpu,cpuacct/kubepods": inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod4e1cd4a0-4af9-4e7c-afe5-c0ab3fc3764a/0795e696035df98ba98e267005e1c23a91f0ad0210c0750b8e948fd744b3f12a: no space left on device
F1209 02:38:37.265170 1 cadvisor.go:167] Failed to start manager: inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod4e1cd4a0-4af9-4e7c-afe5-c0ab3fc3764a/0795e696035df98ba98e267005e1c23a91f0ad0210c0750b8e948fd744b3f12a: no space left on device

原因很简单,系统inotify达到限制了,排查方法如下:

查看系统限制

1
2
3
4
[root@node193 ~]# sysctl -a | grep fs.inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 8192

查看进程使用的inotify数量

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env bash

total=0
result="CMD PID FD-INFO INOTIFY-WATCHES\n"
while read pid fd; do \
exe="$(readlink -f /proc/$pid/exe || echo n/a)"; \
fdinfo="/proc/$pid/fdinfo/$fd" ; \
count="$(grep -c inotify "$fdinfo" || true)"; \
if [ $((count)) != 0 ]; then
total=$((total+count)); \
result+="$exe $pid $fdinfo $count\n"; \
fi
done <<< "$(lsof +c 0 -n -P -u root|awk '/inotify$/ { gsub(/[urw]$/,"",$4); print $2" "$4 }')" && echo "total $total inotify watches" && result="$(echo -e $result|column -t)\n" && echo -e "$result" | head -1 && echo -e "$result" | sed "1d" | sort -k 4rn;

结果如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[root@node193 ~]# bash test.sh 
total 5430 inotify watches
CMD PID FD-INFO INOTIFY-WATCHES
/usr/bin/kubelet 21664 /proc/21664/fdinfo/35 5132
/usr/bin/promtail 24262 /proc/24262/fdinfo/8 241
/usr/lib/systemd/systemd-udevd 904 /proc/904/fdinfo/7 9
/usr/sbin/NetworkManager 1242 /proc/1242/fdinfo/10 5
/usr/lib/systemd/systemd 1 /proc/1/fdinfo/14 4
/usr/lib/systemd/systemd 1 /proc/1/fdinfo/35 4
/usr/sbin/NetworkManager 1242 /proc/1242/fdinfo/11 4
/usr/bin/cilium-agent 32834 /proc/32834/fdinfo/127 3
/usr/lib/systemd/systemd 1 /proc/1/fdinfo/13 3
/usr/lib/systemd/systemd 1 /proc/1/fdinfo/15 3
/usr/sbin/crond 1591 /proc/1591/fdinfo/5 3
/usr/bin/cilium-agent 32834 /proc/32834/fdinfo/27 2
/usr/bin/promtail 24262 /proc/24262/fdinfo/81 2
/usr/sbin/rsyslogd 1240 /proc/1240/fdinfo/5 2
n/a 10150 /proc/10150/fdinfo/15 1
n/a 10205 /proc/10205/fdinfo/15 1
n/a 1157 /proc/1157/fdinfo/15 1
n/a 13045 /proc/13045/fdinfo/15 1
n/a 21684 /proc/21684/fdinfo/339 1
n/a 22471 /proc/22471/fdinfo/15 1
n/a 37295 /proc/37295/fdinfo/3 1
n/a 9584 /proc/9584/fdinfo/3 1
/usr/bin/kubelet 21664 /proc/21664/fdinfo/20 1
/usr/bin/kubelet 21664 /proc/21664/fdinfo/8 1
/usr/bin/kubelet 21664 /proc/21664/fdinfo/9 1
/usr/lib/systemd/systemd 1 /proc/1/fdinfo/10 1
/usr/local/bin/kube-proxy 4056 /proc/4056/fdinfo/6 1

这里很明显看到kubelet使用较多,经测试单独再部署一套cadvisor会使用和kubelet相同的数量,这样就超了8192的原系统限制,调大参数即可。

最终解决方案:

1
2
sysctl  -w fs.inotify.max_user_watches=1048576
systemctl restart kubelet