K8s 集群高可用master节点ETCD全部挂掉如何恢复?

不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。——村上春树

写在前面


  • 博文内容涉及集群 ETCD 全部挂掉,通过备份文件恢复的操作 Demo
  • 理解不足小伙伴帮忙指正 :),生活加油

不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。——村上春树


前提是需要etcd备份文件,如果没有 etcd 备份,或者其他的备份手段,可能 GG 了

备份文件分享

分享一个备份脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat /usr/lib/systemd/system/etcd_back.sh
#!/bin/bash

#@File : erct_break.sh
#@Time : 2023/01/27 23:00:27
#@Author : Li Ruilong
#@Version : 1.0
#@Desc : ETCD 备份
#@Contact : 1224965096@qq.com

if [ ! -d /root/back/ ];then
mkdir -p /root/back/
fi
STR_DATE=$(date +%Y%m%d%H%M)

ETCDCTL_API=3 etcdctl \
--endpoints="https://127.0.0.1:2379" \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key" \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
snapshot save /root/back/snap-${STR_DATE}.db

ETCDCTL_API=3 etcdctl --write-out=table snapshot status /root/back/snap-${STR_DATE}.db

sudo chmod o-w,u-w,g-w /root/back/snap-${STR_DATE}.db

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

运行方式

1
2
3
4
5
6
7
8
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh
Snapshot saved at /root/back/snap-202406051145.db
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 7b00ddcf | 22243784 | 5999 | 88 MB |
+----------+----------+------------+------------+

生成对应的备份数据

1
2
3
4
5
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202*
.....
-r--r--r-- 1 root root 87515168 6月 5 11:45 /root/back/snap-202406051144.db
-r--r--r-- 1 root root 87515168 6月 5 11:45 /root/back/snap-202406051145.db

可以使用 systemd 配置成 service unit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
[Unit]
Description= "ETCD 备份"
After=network-online.target

[Service]
Type=oneshot
Environment=ETCDCTL_API=3
ExecStart=/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh


[Install]
WantedBy=multi-user.target

主要是方便看日志,方便管理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- No entries --
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl start etcd-backup.service
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- Logs begin at 三 2024-06-05 03:49:25 CST, end at 三 2024-06-05 11:49:08 CST. --
6月 05 11:49:04 vms100.liruilongs.github.io systemd[1]: Starting "ETCD 备份"...
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: Snapshot saved at /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: | HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: | 1ce12bf7 | 22244346 | 3753 | 88 MB |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io sudo[4344]: root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/chmod o-w,u-w,g-w /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io systemd[1]: Started "ETCD 备份".

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202406051*
........................
-r--r--r-- 1 root root 87515168 6月 5 11:49 /root/back/snap-202406051149.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

然后使用 timer unit 配置为定时启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
[Unit]
Description="每天备份一次 ETCD"

[Timer]
OnBootSec=3s
OnCalendar=*-*-* 00:00:00
Unit=etcd-backup.service

[Install]
WantedBy=multi-user.target

同样可以看日志

1
2
3
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.timer
-- No entries --

故障处理恢复

故障表象,集群整个崩了,所有 master 上的 etcd 和 apiserver 都死掉了

1
2
3
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pods
The connection to the server 192.168.26.99:30033 was refused - did you specify the right host or port?

移动 etcd 和 apiserver 的对应 静态 podyaml 文件。关于 静态 Pod 运行原理这里不多讲,感兴趣小伙伴可以官网看下

1
2
3
4
5
6
7
8
9
10
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "mv /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml} /tmp/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

清除当前集群的 etcd 的数据文件和对应的目录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "rm -rf /var/lib/etcd/*" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'. If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /var/lib/etcd/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

拷贝 备份文件到当前集群的每个 etcd 节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m copy -a "src=/root/back/snap-202403270000.db dest=/root/" -i host.yaml
192.168.26.100 | CHANGED => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": true,
"checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
"dest": "/root/snap-202403270000.db",
"gid": 0,
"group": "root",
"md5sum": "6489d7243f636086816ac13aa69ceb44",
"mode": "0644",
"owner": "root",
"size": 87515168,
"src": "/root/.ansible/tmp/ansible-tmp-1717557132.87-95740-233443993764822/source",
"state": "file",
"uid": 0
}
192.168.26.101 | CHANGED => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": true,
"checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
"dest": "/root/snap-202403270000.db",
"gid": 0,
"group": "root",
"md5sum": "6489d7243f636086816ac13aa69ceb44",
"mode": "0644",
"owner": "root",
"size": 87515168,
"src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95742-263013169057776/source",
"state": "file",
"uid": 0
}
192.168.26.102 | CHANGED => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": true,
"checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
"dest": "/root/snap-202403270000.db",
"gid": 0,
"group": "root",
"md5sum": "6489d7243f636086816ac13aa69ceb44",
"mode": "0644",
"owner": "root",
"size": 87515168,
"src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95744-205050494494041/source",
"state": "file",
"uid": 0
}
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

确定拷贝文件的备份文件

1
2
3
4
5
6
7
8
9
10
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /root/snap-202403270000.db" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.101 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.100 | CHANGED | rc=0 >>
/root/snap-202403270000.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

在其中一个节点执行备份恢复命令

1
2
3
4
5
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$vim etcd_break.sh
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
Error: data-dir "/var/lib/etcd" exists

提示目录存在,所以需要把目录也同样删除掉

1
2
3
4
5
6
7
8
9
10
11
12
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "rm -rf /var/lib/etcd" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'. If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

备份恢复命令

1
2
3
4
5
6
7
8
9
10
11
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat etcd_break.sh
ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db \
--name vms100.liruilongs.github.io \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key" \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
--endpoints="https://127.0.0.1:2379" \
--initial-advertise-peer-urls="https://192.168.26.100:2380" \
--initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" \
--data-dir=/var/lib/etcd

再次执行,备份恢复成功

1
2
3
4
5
6
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
2024-06-05 11:19:12.114058 I | mvcc: restore compact to 22239463
2024-06-05 11:19:12.137939 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138023 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138055 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7

其他的etcd节点备份恢复,需要修改脚本两个地方:

192.168.26.101 节点执行

1
2
3
4
5
6
7
8
9
10
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.101 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms101.liruilongs.github.io --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.101:2380" --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.101 | CHANGED | rc=0 >>
2024-06-05 11:25:25.557851 I | mvcc: restore compact to 22239463
2024-06-05 11:25:25.614487 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614549 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614574 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

192.168.26.102 节点执行

1
2
3
4
5
6
7
8
9
10
11
12
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.102 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms102.l
iruilongs.github.io --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert=
"/etc/kubernetes/pki/etcd/ca.crt" --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.102:2380" --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.102 | CHANGED | rc=0 >>
2024-06-05 11:30:06.918159 I | mvcc: restore compact to 22239463
2024-06-05 11:30:06.935413 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935460 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935471 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

移动静态 Pod对应的 yaml 文件,恢复 etcd 和apiserver 对应的Pod

1
2
3
4
5
6
7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

确认静态pod 恢复

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /etc/kubernetes/manifests/" -i host.yaml
192.168.26.100 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.102 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.101 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

查看 etcd 集群节点状态

1
2
3
4
5
6
7
8
9
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+

确认集群是否恢复

1
2
3
4
5
6
7
8
9
10
11
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$kubectl get nodes
NAME STATUS ROLES AGE VERSION
vms100.liruilongs.github.io Ready control-plane 495d v1.25.1
vms101.liruilongs.github.io Ready control-plane 495d v1.25.1
vms102.liruilongs.github.io Ready control-plane 495d v1.25.1
vms103.liruilongs.github.io Ready <none> 495d v1.25.1
vms105.liruilongs.github.io Ready <none> 495d v1.25.1
vms106.liruilongs.github.io Ready <none> 495d v1.25.1
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

博文部分内容参考

© 文中涉及参考链接内容版权归原作者所有,如有侵权请告知 :)


https://etcd.io/docs/v3.5/


© 2018-至今 liruilonger@gmail.com, 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)

发布于

2024-06-04

更新于

2024-11-22

许可协议

评论
Your browser is out-of-date!

Update your browser to view this website correctly.&npsb;Update my browser now

×