我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ——赫尔曼·黑塞《德米安》
写在前面
集群电源不稳定,或者节点动不动就 宕机,一定要做好备份,ETCD
的快照文件很容易受影响损坏。
重置了很多次集群,才认识到备份的重要
博文内容涉及
etcd 运维基础知识了解
静态 Pod 方式 etcd 集群灾备与恢复 Demo
定时备份的任务编写
二进制 etcd 集群灾备恢复 Demo
理解不足小伙伴帮忙指正
我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ——赫尔曼·黑塞《德米安》
etcd 概述 etcd
是 CoreOS
团队于2013年6月发起的开源项目,它的目标是构建一个高可用的分布式键值(key-value)数据库
。
etcd
内部采用 raft
协议作为一致性算法,etcd
基于Go
语言实现。
完全复制
:集群中的每个节点都可以使用完整的存档
高可用性
:Etcd可用于避免硬件的单点故障或网络问题
一致性
:每次读取都会返回跨多主机的最新写入
简单
:包括一个定义良好、面向用户的API(gRPC)
安全
:实现了带有可选的客户端证书身份验证的自动化TLS
快速
:每秒10000次写入的基准速度
可靠
:使用Raft算法实现了强一致、高可用的服务存储目录
ETCD 集群运维相关的基本知识:
读写端口为: 2379
, 数据同步端口: 2380
ETCD集群
是一个分布式系统
,使用Raft协议
来维护集群内各个节点状态的一致性。
主机状态 Leader
, Follower
, Candidate
当集群初始化时候,每个节点都是Follower
角色,通过心跳与其他节点同步数据
通过Follower
读取数据,通过Leader
写入数据
当Follower
在一定时间内没有收到来自主节点
的心跳,会将自己角色改变为Candidate
,并发起一次选主投票
配置etcd集群,建议尽可能是奇数个节点
,而不要偶数个节点
,推荐的数量为 3、5 或者 7 个节点构成一个集群。
使用 etcd 的内置备份/恢复工具从源部署备份数据并在新部署中恢复数据。恢复前需要清理数据目录
数据目录下 snap
: 存放快照数据,etcd防止WAL文件过多而设置的快照,存储etcd数据状态。
数据目录下 wal
: 存放预写式日志,最大的作用是记录了整个数据变化的全部历程。在etcd中,所有数据的修改在提交前,都要先写入到WAL中。
一个 etcd 集群可能不应超过七个节点,写入性能会受影响,建议运行五个节点。一个 5 成员的 etcd 集群可以容忍两个成员故障,三个成员可以容忍1个故障。
常用配置参数:
ETCD_NAME
节点名称,默认为defaul
ETCD_DATA_DIR
服务运行数据保存的路
ETCD_LISTEN_PEER_URLS
监听的同伴通信的地址,比如http://ip:2380,如果有多个,使用逗号分隔。需要所有节点都能够访问,所以不要使用 localhost
ETCD_LISTEN_CLIENT_URLS
监听的客户端服务地址
ETCD_ADVERTISE_CLIENT_URLS
对外公告的该节点客户端监听地址,这个值会告诉集群中其他节点
ETCD_INITIAL_ADVERTISE_PEER_URLS
对外公告的该节点同伴监听地址,这个值会告诉集群中其他节
ETCD_INITIAL_CLUSTER
集群中所有节点的信息
ETCD_INITIAL_CLUSTER_STATE
新建集群的时候,这个值为 new
;假如加入已经存在的集群,这个值为existing
ETCD_INITIAL_CLUSTER_TOKEN
集群的ID,多个集群的时候,每个集群的ID必须保持唯一
静态 Pod方式 集群备份恢复 单节点ETCD备份恢复 如果 etcd 为单节点部署,可以直接 物理备份
,直接备份对应的数据文件目录即可,恢复
的话可以直接把备份的 etcd 数据目录复制到 etcd 指定的目录。恢复完成需要恢复 /etc/kubernetes/manifests
内 etcd.yaml
文件原来的状态。
也可以基于快照进行备份
备份命令 1 2 3 4 5 6 7 ┌──[root@vms81.liruilongs.github.io]-[/backup_20230127] └─$ETCDCTL_API =3 etcdctl --endpoints="https://127.0.0.1:2379" \ --cert="/etc/kubernetes/pki/etcd/server.crt" \ --key="/etc/kubernetes/pki/etcd/server.key" \ --cacert="/etc/kubernetes/pki/etcd/ca.crt" snapshot save snap-$(date +%Y%m%d%H%M).db Snapshot saved at snap-202301272133.db
恢复命令 1 2 3 4 5 6 7 8 9 10 11 12 13 ┌──[root@vms81.liruilongs.github.io]-[/backup_20230127] └─$ETCDCTL_API =3 etcdctl snapshot restore ./snap-202301272133.db \ --name vms81.liruilongs.github.io \ --cert="/etc/kubernetes/pki/etcd/server.crt" \ --key="/etc/kubernetes/pki/etcd/server.key" \ --cacert="/etc/kubernetes/pki/etcd/ca.crt" \ --initial-advertise-peer-urls=https://192.168.26.81:2380 \ --initial-cluster="vms81.liruilongs.github.io=https://192.168.26.81:2380" \ --data-dir=/var/lib/etcd 2023-01-27 21:40:01.193420 I | mvcc: restore compact to 484325 2023-01-27 21:40:01.199682 I | etcdserver/membership: added member cbf506fa2d16c7 [https://192.168.26.81:2380] to cluster 46c9df5da345274b ┌──[root@vms81.liruilongs.github.io]-[/backup_20230127] └─$
具体对应的参数值,可以通过 etcd 静态 pod 的 yaml 文件获取
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd/member] └─$kubectl describe pods etcd-vms81.liruilongs.github.io | grep -e "--" --advertise-client-urls=https://192.168.26.81:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --client-cert-auth=true --data-dir=/var/lib/etcd --initial-advertise-peer-urls=https://192.168.26.81:2380 --initial-cluster=vms81.liruilongs.github.io=https://192.168.26.81:2380 --key-file=/etc/kubernetes/pki/etcd/server.key --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.81:2379 --listen-metrics-urls=http://127.0.0.1:2381 --listen-peer-urls=https://192.168.26.81:2380 --name=vms81.liruilongs.github.io --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt --peer-client-cert-auth=true --peer-key-file=/etc/kubernetes/pki/etcd/peer.key --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt --snapshot-count=10000 --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt ┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd/member] └─$
集群ETCD备份恢复 集群节点状态
1 2 3 4 5 6 7 8 9 10 ┌──[root@vms100.liruilongs.github.io]-[~/ansible/helm] └─$ETCDCTL_API =3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table +------------------+---------+-----------------------------+-----------------------------+-----------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+-----------------------------+-----------------------------+-----------------------------+ | ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 | | 11486647d7f3a17b | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 | | e00e3877df8f76f4 | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 | +------------------+---------+-----------------------------+-----------------------------+-----------------------------+ ┌──[root@vms100.liruilongs.github.io]-[~/ansible/helm]
version 及 leader 信息。
1 2 3 4 5 6 7 8 9 10 11 ┌──[root@vms100.liruilongs.github.io]-[~/ansible/kubescape] └─$ETCDCTL_API =3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster -w table +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://192.168.26.100:2379 | ee392e5273e89e2 | 3.5.4 | 37 MB | false | 100 | 3152364 | | https://192.168.26.102:2379 | 11486647d7f3a17b | 3.5.4 | 36 MB | false | 100 | 3152364 | | https://192.168.26.101:2379 | e00e3877df8f76f4 | 3.5.4 | 36 MB | true | 100 | 3152364 | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ ┌──[root@vms100.liruilongs.github.io]-[~/ansible/kubescape] └─$
集群情况下,备份可以单节点备份,前面我们也讲过,etcd 集群为完全复制,单节点备份
1 2 ┌──[root@vms100.liruilongs.github.io]-[~] └─$yum -y install etcd
没有 etcdctl 工具,需要安装一下 etcd 或者从其他的地方单独拷贝一下。这里我们安装下,然后把 etcetl 拷贝到其他集群节点。
备份 1 2 3 4 5 ┌──[root@vms100.liruilongs.github.io]-[~] └─$ENDPOINT=https://127.0.0.1:2379 ┌──[root@vms100.liruilongs.github.io]-[~] └─$ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" snapshot save snapshot.db Snapshot saved at snapshot.db
校验快照 hash 值
1 2 3 4 5 6 7 8 9 ┌──[root@vms100.liruilongs.github.io]-[~] └─$ETCDCTL_API =3 etcdctl --write-out=table snapshot status snapshot.db +----------+----------+------------+------------+ | HASH | REVISION | TOTAL KEYS | TOTAL SIZE | +----------+----------+------------+------------+ | 46aa26ed | 217504 | 2711 | 27 MB | +----------+----------+------------+------------+ ┌──[root@vms100.liruilongs.github.io]-[~] └─$
恢复 这里的 etcd 集群部署,采用堆叠的方式,通过静态 pod 运行,位于每个控制节点的上。
一定要备份,恢复前需要把原来的数据文件备份清理,在恢复前需要确保 etcd
和 api-Service
已经停掉。获取必要的参数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ┌──[root@vms100.liruilongs.github.io]-[~] └─$kubectl describe pod etcd-vms100.liruilongs.github.io -n kube-system | grep -e '--' --advertise-client-urls=https://192.168.26.100:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --client-cert-auth=true --data-dir=/var/lib/etcd --experimental-initial-corrupt-check=true --experimental-watch-progress-notify-interval=5s --initial-advertise-peer-urls=https://192.168.26.100:2380 --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380 --key-file=/etc/kubernetes/pki/etcd/server.key --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379 --listen-metrics-urls=http://127.0.0.1:2381 --listen-peer-urls=https://192.168.26.100:2380 --name=vms100.liruilongs.github.io --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt --peer-client-cert-auth=true --peer-key-file=/etc/kubernetes/pki/etcd/peer.key --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt --snapshot-count=10000 --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt ┌──[root@vms100.liruilongs.github.io]-[~] └─$
恢复的时候:停掉所有 Master 节点的 kube-apiserver
和 etcd
这两个静态pod 。 kubelet 每隔 20s 会扫描一次这个目录确定是否发生静态 pod 变动。 移动Yaml文件 即可停掉。
这是使用 Ansible ,集群所有节点执行。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ansible k8s_master -m command -a "mv /etc/kubernetes/manifests/etcd.yaml /tmp/ " -i host.yaml 192.168.26.102 | CHANGED | rc=0 >> 192.168.26.101 | CHANGED | rc=0 >> 192.168.26.100 | CHANGED | rc=0 >> ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ansible k8s_master -m command -a "mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/ " -i host.yaml 192.168.26.101 | CHANGED | rc=0 >> 192.168.26.102 | CHANGED | rc=0 >> 192.168.26.100 | CHANGED | rc=0 >>
确实 静态 Yaml 文件发生移动
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ansible k8s_master -m command -a "ls /etc/kubernetes/manifests/" -i host.yaml 192.168.26.102 | CHANGED | rc=0 >> haproxy.yaml keepalived.yaml kube-controller-manager.yaml kube-scheduler.yaml 192.168.26.100 | CHANGED | rc=0 >> haproxy.yaml keepalived.yaml kube-controller-manager.yaml kube-scheduler.yaml 192.168.26.101 | CHANGED | rc=0 >> haproxy.yaml keepalived.yaml kube-controller-manager.yaml kube-scheduler.yaml ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$
清空所有集群节点的 etcd
数据目录
1 2 3 4 5 6 7 8 9 ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ansible k8s_master -m command -a "rm -rf /var/lib/etcd/" -i host.yaml [WARNING]: Consider using the file module with state=absent rather than running 'rm' . If you need to use command because file is insufficient you can add 'warn: false' to this command task or set 'command_warnings=False' in ansible.cfg to get rid of this message.192.168.26.101 | CHANGED | rc=0 >> 192.168.26.102 | CHANGED | rc=0 >> 192.168.26.100 | CHANGED | rc=0 >>
复制快照备份文件到集群所有节点
1 2 ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ansible k8s_master -m copy -a "src=snap-202302070000.db dest=/root/" -i host.yaml
在 vms100.liruilongs.github.io
上面恢复
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ETCDCTL_API =3 etcdctl snapshot restore snap-202302070000.db \ --name vms100.liruilongs.github.io \ --cert="/etc/kubernetes/pki/etcd/server.crt" \ --key="/etc/kubernetes/pki/etcd/server.key" \ --cacert="/etc/kubernetes/pki/etcd/ca.crt" \ --endpoints="https://127.0.0.1:2379" \ --initial-advertise-peer-urls="https://192.168.26.100:2380" \ --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" \ --data-dir=/var/lib/etcd 2023-02-08 12:50:27.598250 I | mvcc: restore compact to 2837993 2023-02-08 12:50:27.609440 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7 2023-02-08 12:50:27.609480 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7 2023-02-08 12:50:27.609487 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
在 vms101.liruilongs.github.io
上恢复
1 2 3 4 5 6 7 8 9 ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ssh 192.168.26.101 Last login: Wed Feb 8 12:48:31 2023 from 192.168.26.100 ┌──[root@vms101.liruilongs.github.io]-[~] └─$ETCDCTL_API =3 etcdctl snapshot restore snap-202302070000.db --name vms101.liruilongs.github.io --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.101:2380" --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd 2023-02-08 12:52:21.976748 I | mvcc: restore compact to 2837993 2023-02-08 12:52:21.991588 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7 2023-02-08 12:52:21.991622 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7 2023-02-08 12:52:21.991629 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
在 vms102.liruilongs.github.io
上恢复
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ssh 192.168.26.102 Last login: Wed Feb 8 12:48:31 2023 from 192.168.26.100 ┌──[root@vms102.liruilongs.github.io]-[~] └─$ETCDCTL_API =3 etcdctl snapshot restore snap-202302070000.db --name vms102.liruilongs.github.io --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes /pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.102:2380" --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https:/ /192.168.26.102:2380" --data-dir=/var/lib/etcd2023-02-08 12:53:32.338663 I | mvcc: restore compact to 2837993 2023-02-08 12:53:32.354619 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7 2023-02-08 12:53:32.354782 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7 2023-02-08 12:53:32.354790 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7 ┌──[root@vms102.liruilongs.github.io]-[~] └─$
恢复完成后移动 etcd,api-service
静态pod 配置文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ansible k8s_master -m command -a "mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/ " -i host.yaml 192.168.26.101 | CHANGED | rc=0 >> 192.168.26.102 | CHANGED | rc=0 >> 192.168.26.100 | CHANGED | rc=0 >> ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ansible k8s_master -m command -a "mv /tmp/etcd.yaml /etc/kubernetes/manifests/etcd.yaml " -i host.yaml 192.168.26.101 | CHANGED | rc=0 >> 192.168.26.102 | CHANGED | rc=0 >> 192.168.26.100 | CHANGED | rc=0 >> ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$
确认移动成功。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ansible k8s_master -m command -a "ls /etc/kubernetes/manifests/" -i host.yaml 192.168.26.100 | CHANGED | rc=0 >> etcd.yaml haproxy.yaml keepalived.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml 192.168.26.101 | CHANGED | rc=0 >> etcd.yaml haproxy.yaml keepalived.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml 192.168.26.102 | CHANGED | rc=0 >> etcd.yaml haproxy.yaml keepalived.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml ┌──[root@vms100.liruilongs.github.io]-[~/ansible]
任意节点查看 etcd 集群信息。恢复成功
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$kubectl get pods The connection to the server 192.168.26.99:30033 was refused - did you specify the right host or port? ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$ETCDCTL_API =3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster -w table +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://192.168.26.100:2379 | ee392e5273e89e2 | 3.5.4 | 37 MB | false | 2 | 146 | | https://192.168.26.101:2379 | 70059e836d19883d | 3.5.4 | 37 MB | true | 2 | 146 | | https://192.168.26.102:2379 | b8cb9f66c2e63b91 | 3.5.4 | 37 MB | false | 2 | 146 | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ ┌──[root@vms100.liruilongs.github.io]-[~/ansible] └─$
遇到的问题: 如果某一节点有下面的报错,或者集群节点添加不成功,添加了两个,需要按照上面的步骤重复进行。
panic: tocommit(258) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
问题处理
1 2 3 4 5 6 7 8 ┌──[root@vms100.liruilongs.github.io]-[~/back] └─$ETCDCTL_API =3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster -w table +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://192.168.26.100:2379 | ee392e5273e89e2 | 3.5.4 | 37 MB | true | 2 | 85951 | | https://192.168.26.101:2379 | 70059e836d19883d | 3.5.4 | 37 MB | false | 2 | 85951 | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
备份定时任务编写 这里的定时备份通过,systemd.service
和 systemd.timer
实现,定时运行 etcd_back.sh
备份脚本,并设置开机自启
很简单没啥说的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ┌──[root@vms81.liruilongs.github.io]-[~/back] └─$systemctl cat etcd-backup [Unit] Description= "ETCD 备份" After=network-online.target [Service] Type=oneshot Environment=ETCDCTL_API=3 ExecStart=/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh [Install] WantedBy=multi-user.target
每天午夜执行一次
1 2 3 4 5 6 7 8 9 10 11 12 13 ┌──[root@vms81.liruilongs.github.io]-[~/back] └─$systemctl cat etcd-backup.timer [Unit] Description="每天备份一次 ETCD" [Timer] OnBootSec=3s OnCalendar=*-*-* 00:00:00 Unit=etcd-backup.service [Install] WantedBy=multi-user.target
备份脚本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 ┌──[root@vms100.liruilongs.github.io]-[~/ansible/backup] └─$cat etcd_back.sh if [ ! -d /root/back/ ];then mkdir -p /root/back/ fi STR_DATE=$(date +%Y%m%d%H%M) ETCDCTL_API=3 etcdctl \ --endpoints="https://127.0.0.1:2379" \ --cert="/etc/kubernetes/pki/etcd/server.crt" \ --key="/etc/kubernetes/pki/etcd/server.key" \ --cacert="/etc/kubernetes/pki/etcd/ca.crt" \ snapshot save /root/back/snap-${STR_DATE} .db ETCDCTL_API=3 etcdctl --write-out=table snapshot status /root/back/snap-${STR_DATE} .db sudo chmod o-w,u-w,g-w /root/back/snap-${STR_DATE} .db
服务和定时任务的备份部署
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ┌──[root@vms100.liruilongs.github.io]-[~/ansible/backup] └─$cat deply.sh cp ./* /usr/lib/systemd/system/ systemctl enable etcd-backup.timer --now systemctl enable etcd-backup.service --now ls /root/back/
日志查看
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ┌──[root@vms100.liruilongs.github.io]-[~/ansible/backup] └─$journalctl -u etcd-backup.service -o cat ................... Starting "ETCD 备份" ... Snapshot saved at /root/back/snap-202301290120.db +----------+----------+------------+------------+ | HASH | REVISION | TOTAL KEYS | TOTAL SIZE | +----------+----------+------------+------------+ | 74323316 | 640319 | 2250 | 27 MB | +----------+----------+------------+------------+ Started "ETCD 备份" . Starting "ETCD 备份" ... Snapshot saved at /root/back/snap-202301290120.db +----------+----------+------------+------------+ | HASH | REVISION | TOTAL KEYS | TOTAL SIZE | +----------+----------+------------+------------+ | e75a16bf | 640325 | 2255 | 27 MB | +----------+----------+------------+------------+ Started "ETCD 备份" . Starting "ETCD 备份" ... Snapshot saved at /root/back/snap-202301290121.db +----------+----------+------------+------------+ | HASH | REVISION | TOTAL KEYS | TOTAL SIZE | +----------+----------+------------+------------+ | eb5e9e86 | 640388 | 2318 | 27 MB | +----------+----------+------------+------------+ Started "ETCD 备份" . Starting "ETCD 备份" ... Snapshot saved at /root/back/snap-202301290121.db +----------+----------+------------+------------+ | HASH | REVISION | TOTAL KEYS | TOTAL SIZE | +----------+----------+------------+------------+ | 30a91bb6 | 640402 | 2333 | 27 MB | +----------+----------+------------+------------+ Started "ETCD 备份" .
二进制 集群备份恢复 二进制集群的备份恢复和 静态 pod 的方式基本相同。
这里不同的是,下面的恢复方式使用,先恢复前两个节点,构成集群,第三个节点加入集群的方式。当前集群信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible etcd -m shell -a "etcdctl member list" 192.168.26.101 | CHANGED | rc=0 >> 2fd4f9ba70a04579: name=etcd-102 peerURLs=http://192.168.26.102:2380 clientURLs=http://192.168.26.102:2379,http://localhost:2379 isLeader=false 6f2038a018db1103: name=etcd-100 peerURLs=http://192.168.26.100:2380 clientURLs=http://192.168.26.100:2379,http://localhost:2379 isLeader=false bd330576bb637f25: name=etcd-101 peerURLs=http://192.168.26.101:2380 clientURLs=http://192.168.26.101:2379,http://localhost:2379 isLeader=true 192.168.26.102 | CHANGED | rc=0 >> 2fd4f9ba70a04579: name=etcd-102 peerURLs=http://192.168.26.102:2380 clientURLs=http://192.168.26.102:2379,http://localhost:2379 isLeader=false 6f2038a018db1103: name=etcd-100 peerURLs=http://192.168.26.100:2380 clientURLs=http://192.168.26.100:2379,http://localhost:2379 isLeader=false bd330576bb637f25: name=etcd-101 peerURLs=http://192.168.26.101:2380 clientURLs=http://192.168.26.101:2379,http://localhost:2379 isLeader=true 192.168.26.100 | CHANGED | rc=0 >> 2fd4f9ba70a04579: name=etcd-102 peerURLs=http://192.168.26.102:2380 clientURLs=http://192.168.26.102:2379,http://localhost:2379 isLeader=false 6f2038a018db1103: name=etcd-100 peerURLs=http://192.168.26.100:2380 clientURLs=http://192.168.26.100:2379,http://localhost:2379 isLeader=false bd330576bb637f25: name=etcd-101 peerURLs=http://192.168.26.101:2380 clientURLs=http://192.168.26.101:2379,http://localhost:2379 isLeader=true ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$
准备数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible 192.168.26.100 -a "etcdctl put name liruilong" 192.168.26.100 | CHANGED | rc=0 >> OK ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible etcd -a "etcdctl get name" 192.168.26.102 | CHANGED | rc=0 >> name liruilong 192.168.26.100 | CHANGED | rc=0 >> name liruilong 192.168.26.101 | CHANGED | rc=0 >> name liruilong
在任意一台主机上对 etcd 做快照
1 2 3 4 5 6 7 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible 192.168.26.101 -a "etcdctl snapshot save snap20211010.db" 192.168.26.101 | CHANGED | rc=0 >> Snapshot saved at snap20211010.db ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$
此快照里包含了刚刚写的数据 name=liruilong,然后把快照文件复制到所有节点
1 2 3 4 5 6 7 8 9 10 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible 192.168.26.101 -a "scp /root/snap20211010.db root@192.168.26.100:/root/" 192.168.26.101 | CHANGED | rc=0 >> ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible 192.168.26.101 -a "scp /root/snap20211010.db root@192.168.26.102:/root/" 192.168.26.101 | CHANGED | rc=0 >> ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$
清空数据所有节点数据
1 2 3 4 5 6 7 8 9 10 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible etcd -a "etcdctl del name" 192.168.26.101 | CHANGED | rc=0 >> 1 192.168.26.102 | CHANGED | rc=0 >> 0 192.168.26.100 | CHANGED | rc=0 >> 0 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$
在所有节点上关闭 etcd,并删除/var/lib/etcd/里所有数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$# 在所有节点上关闭 etcd,并删除/var/lib/etcd/里所有数据: ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible etcd -a "systemctl stop etcd" 192.168.26.100 | CHANGED | rc=0 >> 192.168.26.102 | CHANGED | rc=0 >> 192.168.26.101 | CHANGED | rc=0 >> ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible etcd -m shell -a "rm -rf /var/lib/etcd/*" [WARNING]: Consider using the file module with state=absent rather than running 'rm' . If you need to use command because file is insufficient you can add 'warn: false' to this command task or set 'command_warnings=False' in ansible.cfg to get rid of this message.192.168.26.102 | CHANGED | rc=0 >> 192.168.26.100 | CHANGED | rc=0 >> 192.168.26.101 | CHANGED | rc=0 >>
在所有节点上把快照文件的所有者和所属组设置为 etcd:
1 2 3 4 5 6 7 8 9 10 11 12 13 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible etcd -a "chown etcd.etcd /root/snap20211010.db" [WARNING]: Consider using the file module with owner rather than running 'chown' . If you need to use command because file is insufficient you can add 'warn: false' to this command task or set 'command_warnings=False' in ansible.cfg to get rid of this message.192.168.26.100 | CHANGED | rc=0 >> 192.168.26.102 | CHANGED | rc=0 >> 192.168.26.101 | CHANGED | rc=0 >> ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$# 在每台节点上开始恢复数据:
在 100,101 节点上开始恢复数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible 192.168.26.100 -m script -a "./snapshot_restore.sh" 192.168.26.100 | CHANGED => { "changed" : true , "rc" : 0, "stderr" : "Shared connection to 192.168.26.100 closed.\r\n" , "stderr_lines" : [ "Shared connection to 192.168.26.100 closed." ], "stdout" : "2021-10-10 12:14:30.726021 I | etcdserver/membership: added member 6f2038a018db1103 [http://192.168.26.100:2380] to cluster af623437f584d792\r\n2021-10-10 12:14:30.726234 I | etcdserver/membership: added member bd330576bb637f25 [http://192.168.26.101:2380] to cluster af623437f584d792\r\n" , "stdout_lines" : [ "2021-10-10 12:14:30.726021 I | etcdserver/membership: added member 6f2038a018db1103 [http://192.168.26.100:2380] to cluster af623437f584d792" , "2021-10-10 12:14:30.726234 I | etcdserver/membership: added member bd330576bb637f25 [http://192.168.26.101:2380] to cluster af623437f584d792" ] } ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$cat -n ./snapshot_restore.sh 1 2 3 4 5 etcdctl snapshot restore /root/snap20211010.db \ 6 --name etcd-100 \ 7 --initial-advertise-peer-urls="http://192.168.26.100:2380" \ 8 --initial-cluster="etcd-100=http://192.168.26.100:2380,etcd-101=http://192.168.26.101:2380" \ 9 --data-dir="/var/lib/etcd/cluster.etcd" 10 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$sed '6,7s/100/101/g' ./snapshot_restore.sh etcdctl snapshot restore /root/snap20211010.db \ --name etcd-101 \ --initial-advertise-peer-urls="http://192.168.26.101:2380" \ --initial-cluster="etcd-100=http://192.168.26.100:2380,etcd-101=http://192.168.26.101:2380" \ --data-dir="/var/lib/etcd/cluster.etcd" ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$sed -i '6,7s/100/101/g' ./snapshot_restore.sh ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$cat ./snapshot_restore.sh etcdctl snapshot restore /root/snap20211010.db \ --name etcd-101 \ --initial-advertise-peer-urls="http://192.168.26.101:2380" \ --initial-cluster="etcd-100=http://192.168.26.100:2380,etcd-101=http://192.168.26.101:2380" \ --data-dir="/var/lib/etcd/cluster.etcd" ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible 192.168.26.101 -m script -a "./snapshot_restore.sh" 192.168.26.101 | CHANGED => { "changed" : true , "rc" : 0, "stderr" : "Shared connection to 192.168.26.101 closed.\r\n" , "stderr_lines" : [ "Shared connection to 192.168.26.101 closed." ], "stdout" : "2021-10-10 12:20:26.032754 I | etcdserver/membership: added member 6f2038a018db1103 [http://192.168.26.100:2380] to cluster af623437f584d792\r\n2021-10-10 12:20:26.032930 I | etcdserver/membership: added member bd330576bb637f25 [http://192.168.26.101:2380] to cluster af623437f584d792\r\n" , "stdout_lines" : [ "2021-10-10 12:20:26.032754 I | etcdserver/membership: added member 6f2038a018db1103 [http://192.168.26.100:2380] to cluster af623437f584d792" , "2021-10-10 12:20:26.032930 I | etcdserver/membership: added member bd330576bb637f25 [http://192.168.26.101:2380] to cluster af623437f584d792" ] } ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$
所有节点把/var/lib/etcd 及里面内容的所有者和所属组改为 etcd:etcd 然后分别启动 etcd
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible etcd -a "chown -R etcd.etcd /var/lib/etcd/" [WARNING]: Consider using the file module with owner rather than running 'chown' . If you need to use command because file is insufficient you can add 'warn: false' to this command task or set 'command_warnings=False' in ansible.cfg to get rid of this message.192.168.26.100 | CHANGED | rc=0 >> 192.168.26.101 | CHANGED | rc=0 >> 192.168.26.102 | CHANGED | rc=0 >> ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible etcd -a "systemctl start etcd" 192.168.26.102 | FAILED | rc=1 >> Job for etcd.service failed because the control process exited with error code. See "systemctl status etcd.service" and "journalctl -xe" for details.non-zero return code 192.168.26.101 | CHANGED | rc=0 >> 192.168.26.100 | CHANGED | rc=0 >> ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$
把剩下的 102 节点添加进集群
1 2 3 4 5 6 7 8 9 [root@vms100 cluster.etcd] Member fbd8a96cbf1c004d added to cluster af623437f584d792 ETCD_NAME="etcd-102" ETCD_INITIAL_CLUSTER="etcd-100=http://192.168.26.100:2380,etcd-101=http://192.168.26.101:2380,etcd-102=http://192.168.26.102:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.26.102:2380" ETCD_INITIAL_CLUSTER_STATE="existing" [root@vms100 cluster.etcd]
测试恢复结果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible 192.168.26.102 -m copy -a "src=./etcd.conf dest=/etc/etcd/etcd.conf force=yes" 192.168.26.102 | SUCCESS => { "ansible_facts" : { "discovered_interpreter_python" : "/usr/bin/python" }, "changed" : false , "checksum" : "2d8fa163150e32da563f5e591134b38cc356d237" , "dest" : "/etc/etcd/etcd.conf" , "gid" : 0, "group" : "root" , "mode" : "0644" , "owner" : "root" , "path" : "/etc/etcd/etcd.conf" , "size" : 574, "state" : "file" , "uid" : 0 } ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible 192.168.26.102 -m shell -a "systemctl enable etcd --now" 192.168.26.102 | CHANGED | rc=0 >> ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible etcd -m shell -a "etcdctl member list" 192.168.26.101 | CHANGED | rc=0 >> 6f2038a018db1103, started, etcd-100, http://192.168.26.100:2380, http://192.168.26.100:2379,http://localhost:2379 bd330576bb637f25, started, etcd-101, http://192.168.26.101:2380, http://192.168.26.101:2379,http://localhost:2379 fbd8a96cbf1c004d, started, etcd-102, http://192.168.26.102:2380, http://192.168.26.102:2379,http://localhost:2379 192.168.26.100 | CHANGED | rc=0 >> 6f2038a018db1103, started, etcd-100, http://192.168.26.100:2380, http://192.168.26.100:2379,http://localhost:2379 bd330576bb637f25, started, etcd-101, http://192.168.26.101:2380, http://192.168.26.101:2379,http://localhost:2379 fbd8a96cbf1c004d, started, etcd-102, http://192.168.26.102:2380, http://192.168.26.102:2379,http://localhost:2379 192.168.26.102 | CHANGED | rc=0 >> 6f2038a018db1103, started, etcd-100, http://192.168.26.100:2380, http://192.168.26.100:2379,http://localhost:2379 bd330576bb637f25, started, etcd-101, http://192.168.26.101:2380, http://192.168.26.101:2379,http://localhost:2379 fbd8a96cbf1c004d, started, etcd-102, http://192.168.26.102:2380, http://192.168.26.102:2379,http://localhost:2379 ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$ansible etcd -a "etcdctl get name" 192.168.26.102 | CHANGED | rc=0 >> name liruilong 192.168.26.101 | CHANGED | rc=0 >> name liruilong 192.168.26.100 | CHANGED | rc=0 >> name liruilong ┌──[root@vms81.liruilongs.github.io]-[~/ansible] └─$
博文部分内容参考 文中涉及参考链接内容版权归原作者所有,如有侵权请告知
https://etcd.io/docs/v3.5/faq/
https://etcd.io/docs/v3.6/op-guide/recovery/#restoring-a-cluster
https://kubernetes.io/zh-cn/docs/tasks/administer-cluster/configure-upgrade-etcd/
https://docs.vmware.com/en/VMware-Application-Catalog/services/tutorials/GUID-backup-restore-data-etcd-kubernetes-index.html
https://github.com/etcd-io/etcd/issues/13509
© 2018-至今 liruilonger@gmail.com , All rights reserved. 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)