记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ——赫尔曼·黑塞《德米安》

写在前面


  • 不小心拔错电源了,虚机强制关机,开机后集群死掉了
  • 记录下解决方案
  • 断电导致etcd 快照数据丢失,没有备份.基本上是没办法处理
  • 可以找专业的 DBA来处理数据看有没有可能恢复
  • 这篇博文的解决办法是删除了 etcd 数据目录中的部分文件。
  • 集群可以启动,但是 部署的环境数据都丢失了,包括CNI, 集群自带的 DNS 组件也丢了。
  • 理解不足小伙伴帮忙指正
  • 不管是生产还是测试,如果没做UPS电源, k8s集群 ETCD 一定要备份,ETCD 一定要备份,ETCD 一定要备份 ,重要的话说三遍。

我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ——赫尔曼·黑塞《德米安》


当前集群的状态

1
2
3
┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
The connection to the server 192.168.26.81:6443 was refused - did you specify the right host or port?

重启 docke 和 kubelet 尝试启动

1
2
3
4
┌──[root@vms81.liruilongs.github.io]-[~]
└─$systemctl restart docker
┌──[root@vms81.liruilongs.github.io]-[~]
└─$systemctl restart kubelet.service

还是不行,查看下 maser 节点的 kubelet 日志信息

1
2
3
4
5
6
7
8
9
┌──[root@vms81.liruilongs.github.io]-[~]
└─$journalctl -u kubelet.service -f
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.703418 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.804201 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.905156 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.005487 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.105648 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.186066 11344 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://192.168.26.81:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vms81.liruilongs.github.io?timeout=10s": dial tcp 192.168.26.81:6443: connect: connection refused
1月 19 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.205785 11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"

利用 docker 查看下当前存在的 pod 信息

1
2
3
4
5
6
7
8
9
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d9d6471ce936 b51ddc1014b0 "kube-scheduler --au…" 17 minutes ago Up 17 minutes k8s_kube-scheduler_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_14
010c1b8c30c6 5425bcbd23c5 "kube-controller-man…" 17 minutes ago Up 17 minutes k8s_kube-controller-manager_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_15
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up About a minute k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
f557435d150e registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_7
5deaffbc555a registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_7
a418c2ce33f2 registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-apiserver-vms81.liruilongs.github.io_kube-system_a35cb37b6c90c72f607936b33161eefe_6

etcd 没有启动, apiservice 也没有启动。

1
2
3
4
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
b5e18722315b 004811815584 "etcd --advertise-cl…" 5 minutes ago Exited (2) About a minute ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 21 minutes ago Up 4 minutes k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7

尝试重新启动 etcd

1
2
3
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker restart b5e18722315b
b5e18722315b

查看启动状态

1
2
3
4
5
6
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
b5e18722315b 004811815584 "etcd --advertise-cl…" 5 minutes ago Exited (2) About a minute ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 21 minutes ago Up 4 minutes k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker logs b5e18722315b

看一下 etcd 对应的日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker logs 8a53cbc545e4
..................................................
{"level":"info","ts":"2023-01-19T01:34:24.332Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"5.557212ms"}
{"level":"warn","ts":"2023-01-19T01:34:24.332Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000014-0000000000185aba.wal.broken"}
{"level":"info","ts":"2023-01-19T01:34:24.770Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":26912747,"snapshot-size":"42 kB"}
{"level":"warn","ts":"2023-01-19T01:34:24.771Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":26912747,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000019aa7eb.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2023-01-19T01:43:31.738Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000114600, 0xc000588240, 0x1, 0x1)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000080960, 0x122e2fc, 0x2a, 0xc000588240, 0x1, 0x1)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffe54af1e25, 0x1a, 0x0, 0x0, 0x0, 0x0, 0xc0004cf830, 0x1, 0x1, 0xc0004cfa70, ...)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0000ee000, 0xc0000ee600, 0x0, 0x0)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc0000ee000, 0x1202a6f, 0x6, 0xc000428401, 0x2)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a120, 0x12, 0x12)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a120, 0x12, 0x12)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40 +0x11f
main.main()
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32 +0x45

"msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","

“msg”: “从快照恢复v3后台失败”, “error”: “未能找到数据库快照文件(snap: 快照文件不存在)”,”

断电照成数据文件损坏了,它希望从快照中恢复,但是没有快照。

额,这里没有备份,所以基本上是没有办法修复了。只能通过 kubeadm 重置集群了。

一些补救措施

如果说你希望通过一些其他的方式来启动集群,来获取一些当前集群的配置信息,下面的方式可以尝试,但是我的集群使用了下面的方法,所有的 pods 数据都丢失了,没办法最后重置集群了。

如果你想使用下面的方式,一定要备份删除的 etcd 数据文件

etcd master 是一个静态 pod ,所以我们看下 yaml 文件,配置的数据文件中什么位置

1
2
3
4
5
┌──[root@vms81.liruilongs.github.io]-[~]
└─$cd /etc/kubernetes/manifests/
┌──[root@vms81.liruilongs.github.io]-[/etc/kubernetes/manifests]
└─$ls
etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml

- --data-dir=/var/lib/etcd

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
┌──[root@vms81.liruilongs.github.io]-[/etc/kubernetes/manifests]
└─$cat etcd.yaml | grep -e "--"
- --advertise-client-urls=https://192.168.26.81:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --initial-advertise-peer-urls=https://192.168.26.81:2380
- --initial-cluster=vms81.liruilongs.github.io=https://192.168.26.81:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.81:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://192.168.26.81:2380
- --name=vms81.liruilongs.github.io
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

对应的数据文件,可以尝试对数据文件进行修复,如果希望集群可以快速启动,可以

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd/member]
└─$tree
.
├── snap
│   ├── 0000000000000058-00000000019a0ba7.snap
│   ├── 0000000000000058-00000000019a32b8.snap
│   ├── 0000000000000058-00000000019a59c9.snap
│   ├── 0000000000000058-00000000019a80da.snap
│   ├── 0000000000000058-00000000019aa7eb.snap
│   └── db
└── wal
├── 0000000000000014-0000000000185aba.wal.broken
├── 0000000000000142-0000000001963c0e.wal
├── 0000000000000143-0000000001977bbe.wal
├── 0000000000000144-0000000001986aa6.wal
├── 0000000000000145-0000000001995ef6.wal
├── 0000000000000146-00000000019a544d.wal
└── 1.tmp

2 directories, 13 files

备份一下数据文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$ls
member
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$tar -cvf member.tar member/
member/
member/snap/
member/snap/db
member/snap/0000000000000058-00000000019a0ba7.snap
member/snap/0000000000000058-00000000019a32b8.snap
member/snap/0000000000000058-00000000019a59c9.snap
member/snap/0000000000000058-00000000019a80da.snap
member/snap/0000000000000058-00000000019aa7eb.snap
member/wal/
member/wal/0000000000000142-0000000001963c0e.wal
member/wal/0000000000000144-0000000001986aa6.wal
member/wal/0000000000000014-0000000000185aba.wal.broken
member/wal/0000000000000145-0000000001995ef6.wal
member/wal/0000000000000146-00000000019a544d.wal
member/wal/1.tmp
member/wal/0000000000000143-0000000001977bbe.wal
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$ls
member member.tar
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$mv member.tar /tmp/
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$
1
2
3
4
5
6
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$rm -rf member/snap/*.snap
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$rm -rf member/wal/*.wal
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$

重新启动 docker 对应的镜像,或者重新启动 kubectl。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
a3b97cb34d9b 004811815584 "etcd --advertise-cl…" 2 minutes ago Exited (2) 2 minutes ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 3 hours ago Up 2 hours k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker start a3b97cb34d9b
a3b97cb34d9b
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
e1fc068247af 004811815584 "etcd --advertise-cl…" 3 seconds ago Up 2 seconds k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_46
a3b97cb34d9b 004811815584 "etcd --advertise-cl…" 3 minutes ago Exited (2) 3 seconds ago k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd registry.aliyuncs.com/google_containers/pause:3.5 "/pause" 3 hours ago Up 2 hours k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$

查看 Node 状态

1
2
3
4
5
6
7
8
9
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$kubectl get nodes
NAME STATUS ROLES AGE VERSION
vms155.liruilongs.github.io Ready <none> 76s v1.22.2
vms81.liruilongs.github.io Ready <none> 76s v1.22.2
vms82.liruilongs.github.io Ready <none> 76s v1.22.2
vms83.liruilongs.github.io Ready <none> 76s v1.22.2
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$

查看集群当前所有的 Pod 。

1
2
3
4
5
6
7
┌──[root@vms81.liruilongs.github.io]-[~/ansible/kubevirt]
└─$kubectl get pods -A
NAME READY STATUS RESTARTS AGE
etcd-vms81.liruilongs.github.io 1/1 Running 48 (3h35m ago) 3h53m
kube-apiserver-vms81.liruilongs.github.io 1/1 Running 48 (3h35m ago) 3h51m
kube-controller-manager-vms81.liruilongs.github.io 1/1 Running 17 (3h35m ago) 3h51m
kube-scheduler-vms81.liruilongs.github.io 1/1 Running 16 (3h35m ago) 3h52m

网络相关的 pod 都不在了,而且 k8s 的 dns 组件也没有起来, 这里需要 重新配置网络,有点麻烦,正常情况下如果, 网络相关的组件没有起来, 所有节点应该都是未就绪状态。感觉有点妖。。。时间关系,我需要集群来做实验,所以通过 kubeadm重置了

1
2
┌──[root@vms81.liruilongs.github.io]-[~/ansible]
└─$kubectl apply -f calico.yaml

博文参考


https://github.com/etcd-io/etcd/issues/11949

记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

https://liruilongs.github.io/2023/01/19/K8s/etcd/虚机强制断电K8s-集群-etcd-pod挂掉-问题解决/

发布于

2023-01-19

更新于

2023-06-21

许可协议

评论
Your browser is out-of-date!

Update your browser to view this website correctly.&npsb;Update my browser now

×