CrashLoopBackOff 是在 k8s 中较常见的一种 Pod 异常状态,最直接的表述,集群中的 Pod 在不断的重启挂掉,一直循环,往往 Pod 运行几秒钟 因为程序异常会直接死掉,没有常驻进程,但是 容器运行时 会根据 Pod 的重启策略(默认为:always)一直的重启它,所以会 CrashLoopBackOff
┌──[root@vms100.liruilongs.github.io]-[~/ansible/crashlookbackoff_demo] └─$kubectl describe pods crashlookbackoff-pod | grep -A 20 -i event Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 17m default-scheduler Successfully assigned default/crashlookbackoff-pod to vms106.liruilongs.github.io Normal Pulled 17m kubelet Container image "docker.io/istio/proxyv2:1.16.2" already present on machine Normal Created 17m kubelet Created container istio-init Normal Started 17m kubelet Started container istio-init Normal Pulled 17m kubelet Successfully pulled image "busybox"in 15.691344116s Normal Pulled 17m kubelet Container image "docker.io/istio/proxyv2:1.16.2" already present on machine Normal Created 17m kubelet Created container istio-proxy Normal Started 17m kubelet Started container istio-proxy Warning Unhealthy 17m (x2 over 17m) kubelet Readiness probe failed: Get "http://10.244.31.80:15021/healthz/ready": dial tcp 10.244.31.80:15021: connect: connection refused Normal Pulled 16m kubelet Successfully pulled image "busybox"in 15.599021058s Normal Created 16m (x2 over 17m) kubelet Created container busybox Normal Started 16m (x2 over 17m) kubelet Started container busybox Warning Unhealthy 16m (x6 over 17m) kubelet Liveness probe failed: cat: can't open '/tmp/liruilong': No such file or directory Normal Killing 16m (x2 over 17m) kubelet Container busybox failed liveness probe, will be restarted Normal Pulling 12m (x6 over 17m) kubelet Pulling image "busybox" Warning BackOff 2m39s (x36 over 11m) kubelet Back-off restarting failed container ┌──[root@vms100.liruilongs.github.io]-[~/ansible/crashlookbackoff_demo] └─$kubectl get pods crashlookbackoff-pod -w NAME READY STATUS RESTARTS AGE crashlookbackoff-pod 1/2 CrashLoopBackOff 7 (3m4s ago) 15m
┌──[root@vms100.liruilongs.github.io]-[~/ansible/crashlookbackoff_demo] └─$kubectl describe pods crashlookbackoff-pod | grep -A 20 Events: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 9m52s default-scheduler Successfully assigned default/crashlookbackoff-pod to vms106.liruilongs.github.io Normal Pulled 9m50s kubelet Container image "docker.io/istio/proxyv2:1.16.2" already present on machine Normal Created 9m50s kubelet Created container istio-init Normal Started 9m50s kubelet Started container istio-init Normal Pulled 9m49s kubelet Successfully pulled image "busybox"in 1.007350697s Normal Created 9m48s kubelet Created container istio-proxy Normal Pulled 9m48s kubelet Container image "docker.io/istio/proxyv2:1.16.2" already present on machine Normal Started 9m47s kubelet Started container istio-proxy Normal Pulled 9m26s kubelet Successfully pulled image "busybox"in 15.685743099s Normal Pulled 8m53s kubelet Successfully pulled image "busybox"in 16.040759951s Normal Pulling 8m22s (x4 over 9m50s) kubelet Pulling image "busybox" Normal Created 8m6s (x4 over 9m48s) kubelet Created container busybox Normal Started 8m6s (x4 over 9m48s) kubelet Started container busybox Normal Pulled 8m6s kubelet Successfully pulled image "busybox"in 15.878975739s Warning BackOff 4m50s (x15 over 9m20s) kubelet Back-off restarting failed container ┌──[root@vms100.liruilongs.github.io]-[~/ansible/crashlookbackoff_demo] └─$kubectl get pods crashlookbackoff-pod NAME READY STATUS RESTARTS AGE crashlookbackoff-pod 1/2 CrashLoopBackOff 6 (2m27s ago) 10m ┌──[root@vms100.liruilongs.github.io]-[~/ansible/crashlookbackoff_demo] └─$
W0228 07:29:31.671667 1 mutation_detector.go:53] Mutation detector is enabled, this will result in memory leakage. E0228 07:29:31.671746 1 factory.go:224] /hostvarrun/docker.sock exists, but not found /hostvarrun/dockershim.sock W0228 07:29:31.767342 1 factory.go:113] Failed to new image service for containerd (, unix:///hostvarrun/containerd/containerd.sock): failed to fetch cri-containerd status: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService W0228 07:29:31.767721 1 mutation_detector.go:53] Mutation detector is enabled, this will result in memory leakage. panic: runtime error: invalid memory address or nil pointer dereference
重点在这一句: /hostvarrun/docker. sock exists, but not found /hostvarrun/dockershim.sock | Failed to new image service for containerd... , docker. sock 存在,但是没有找到 dockershim.sock,创建新的镜像服务失败.
┌──[root@vms100.liruilongs.github.io]-[~/ansible/crashlookbackoff_demo] └─$kubectl logs -f deployments/release-name-grafana Found 2 pods, using pod/release-name-grafana-76f4b7b77d-bbvws [2023-03-26 03:14:28] Starting collector [2023-03-26 03:14:28] No folder annotation was provided, defaulting to k8s-sidecar-target-directory [2023-03-26 03:14:28] Selected resource type: ('secret', 'configmap') [2023-03-26 03:14:28] Loading incluster config ... [2023-03-26 03:14:28] Config for cluster api at 'https://10.96.0.1:443' loaded... [2023-03-26 03:14:28] Unique filenames will not be enforced. [2023-03-26 03:14:28] 5xx response content will not be enabled. [2023-03-26 03:14:34] Working on ADDED configmap default/release-name-kube-promethe-controller-manager [2023-03-26 03:14:34] Working on ADDED configmap default/release-name-kube-promethe-namespace-by-pod ................
这也可能提供有关应用程序级别问题的线索。例如,您可以在下面看到一个显示./data can’t be mounted,可能是因为它已被使用并被其他容器锁定。
[root@master student]# oc new-app --name=nginx --docker-image=registry.lab.example.com/nginx --> Found Docker image c825216 (4 years old) from registry.lab.example.com for"registry.lab.example.com/nginx"
* An image stream will be created as "nginx:latest" that will track this image * This image will be deployed in deployment config "nginx" * Port 80/tcp will be load balanced by service "nginx" * Other containers can access this service through the hostname "nginx" * WARNING: Image "registry.lab.example.com/nginx" runs as the 'root' user which may not be permitted by your cluster administrator
--> Creating resources ... imagestream "nginx" created deploymentconfig "nginx" created service "nginx" created --> Success Application is not exposed. You can expose services to the outside world by executing one or more of the commands below: 'oc expose svc/nginx' Run 'oc status' to view your app.
[root@master student]# kubectl get pods NAME READY STATUS RESTARTS AGE docker-registry-1-drmbk 1/1 Running 2 1d nginx-1-deploy 1/1 Running 0 45s nginx-1-h5zx8 0/1 CrashLoopBackOff 2 42s registry-console-1-dg4h9 1/1 Running 2 1d router-1-27wtd 1/1 Running 2 1d router-1-lvmvk 1/1 Running 2 1d
可以看到 pod 一直创建失败,CrashLoopBackOff
1 2 3 4
[root@master student]# kubectl get pods --selector=app=nginx NAME READY STATUS RESTARTS AGE nginx-1-h5zx8 0/1 CrashLoopBackOff 6 6m [root@master student]#
在最前面的 Pod 创建的时候,我们看到一个告警,提示提示,以root 的方式运行不被集群管理员所允许。所以 Pod 状态一直是 CrashLoopBackOff.
1
* WARNING: Image "registry.lab.example.com/nginx" runs as the 'root' user which may not be permitted by your cluster administrator
查看事件,只是提示 BackOff,
1 2 3 4 5 6 7 8 9 10 11 12
[root@master student]# oc describe pods nginx-1-h5zx8 | grep -i -A 20 event Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 13m default-scheduler Successfully assigned nginx-1-h5zx8 to node2.lab.example.com Normal SuccessfulMountVolume 13m kubelet, node2.lab.example.com MountVolume.SetUp succeeded for volume "default-token-bmctn" Normal Pulled 12m (x4 over 13m) kubelet, node2.lab.example.com Successfully pulled image "registry.lab.example.com/nginx@sha256:4ffd9758ea9ea360fd87d0cee7a2d1cf9dba630bb57ca36b3108dcd3708dc189" Normal Created 12m (x4 over 13m) kubelet, node2.lab.example.com Created container Normal Started 12m (x4 over 13m) kubelet, node2.lab.example.com Started container Normal Pulling 11m (x5 over 13m) kubelet, node2.lab.example.com pulling image "registry.lab.example.com/nginx@sha256:4ffd9758ea9ea360fd87d0cee7a2d1cf9dba630bb57ca36b3108dcd3708dc189" Warning BackOff 3m (x46 over 13m) kubelet, node2.lab.example.com Back-off restarting failed container [root@master student]#
[root@master student]# oc logs nginx-1-h5zx8 2023/04/14 11:42:31 [warn] 1#1: the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /etc/nginx/nginx.conf:2 nginx: [warn] the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /etc/nginx/nginx.conf:2 2023/04/14 11:42:31 [emerg] 1#1: mkdir() "/var/cache/nginx/client_temp" failed (13: Permission denied) nginx: [emerg] mkdir() "/var/cache/nginx/client_temp" failed (13: Permission denied) [root@master student]#
这个时候,我们可以修改默认的 SCC 相关的权限,允许 root 来启动容器进程,或者通过创建新的 SA 的方式,创建服务帐户;将特定 SCC(如 anyuid)绑定到服务账户;修改 dc 使用创建的 sa 用户身份运行和其他涉及到到的 SA。