kubernetes|云原生| 如何优雅的重启和更新pod---pod生命周期管理实务
前言:
kubernetes的管理维护的复杂性体现在了方方面面,例如,pod的管理,服务的管理,用户的管理(RBAC),网络的管理等等,因此,kubernetes安装部署完毕仅仅是万里长征的第一步,后面的运营和维护工作才是更为关键的东西。
那么,pod的生命周期是什么概念呢?这些和重启与更新这样的操作有着怎样的联系呢?进一步的说,什么是优雅,优雅的重启和更新有什么好处?如何做到优雅的重启和更新?
以上问题是本文想要搞清楚的,也应该搞清楚的问题,下面就以上问题做一个尽量详细的解答,如有不对的地方,还请各位轻喷(水或者火)
一,
pod的生命周期
Pod 是 Kubernetes 中最基本的工作单元,代表了一个可执行的应用程序实例。Pod 的生命周期由一系列状态组成,如下所示:
- Pending:表示 Pod 已经被创建,但尚未调度到任何节点上。
- Running:表示 Pod 已经被成功调度并正在运行。
- Succeeded:表示 Pod 所包含的所有容器都已经成功终止,且不会被重启。
- Failed:表示 Pod 所包含的至少有一个容器未能成功终止,或者 Pod 本身出现了故障。
- Unknown:表示无法确定 Pod 的状态,通常是由于 API 服务器无法与 Pod 进行通信。
- terrimer 挂起状态,表示此pod不可用,一般是删除期间的旧pod的状态
当然了,pod的状态还有十来种,例如,outofcpu 等等这样的,但主要的常用的状态就是上面的这些。
Pod创建:
1. API Server 在接收到创建pod的请求之后,会根据用户提交的参数值来创建一个运行时的pod对象。
2. 根据 API Server 请求的上下文的元数据来验证两者的 namespace 是否匹配,如果不匹配则创建失败。
3. Namespace 匹配成功之后,会向 pod 对象注入一些系统数据,如果 pod 未提供 pod 的名字,则 API Server 会将 pod 的 uid 作为 pod 的名字。
4. API Server 接下来会检查 pod 对象的必需字段是否为空,如果为空,创建失败。
5. 上述准备工作完成之后会将在 etcd 中持久化这个对象,将异步调用返回结果封装成 restful.response,完成结果反馈。
6. API Server 创建过程完成,剩下的由 scheduler 和 kubelet 来完成,此时 pod 处于 pending 状态。
7. Scheduler选择出最优节点。
8. Kubelet启动该Pod。
Pod删除:
1. 用户发出删除 pod 命令
2. 将 pod 标记为“Terminating”状态
监控到 pod 对象为“Terminating”状态的同时启动 pod 关闭过程
endpoints 控制器监控到 pod 对象关闭,将pod与service匹配的 endpoints 列表中删除
Pod执行PreStop定义的内容
3. 宽限期(默认30秒)结束之后,若存在任何一个运行的进程,pod 会收到 SIGKILL 信号
4. Kubelet 请求 API Server 将此 Pod 资源宽限期设置为0从而完成删除操作
那么,pod的生命周期运行机制是比较复杂的,上面只是粗略的说了一下,底层的东西并无必要在本文详细讲解,而一个pod从创建到彻底的删除或回收我们就可以简单的认为这是一个生命周期,而在此期间pod可能会经历种种状态,并不是简单的说一个pod创建完了就等待删除,这些想法是不正确的。
二,
pod生命周期的管理
由于kubernetes是一个自动化的容器管理平台,因此,我们总是希望pod被部署好后,是处于一个稳定的状态,也就是说除了running状态,其它的状态基本是不可接受的,除了部分的job类型或者init类型的pod,那么,现在的目标就很简单了,如何保持pod的状态总是running
下面以一个实际的例子来说明问题:
kubernetes的版本
[root@node1 ~]# kubectl get no -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready control-plane,master 117d v1.23.16 192.168.123.11 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://20.10.8
node2 Ready control-plane,master 117d v1.23.16 192.168.123.12 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://20.10.8
node3 Ready control-plane,master 117d v1.23.16 192.168.123.13 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://20.10.8
node4 Ready worker 117d v1.23.16 192.168.123.14 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://20.10.8
创建了一个名为nginx的deployment,然后将它修改为两个副本,随后又修改为三个副本
kubectl create deployment nginx --image=nginx:1.18
kubectl scale deployment nginx --replicas=2
kubectl scale deployment nginx --replicas=3
创建一个nodeport类型的service将该后端服务发布出去,经查询,可以看到端口30353是对外端口:
kubectl expose deployment nginx --type=NodePort --port=80 --target-port=80
[root@node1 ~]# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 117d
nginx NodePort 10.96.24.248 <none> 80:30353/TCP 36s
此时,可以利用watch命令监听此服务:
watch curl -I http://192.168.123.11:30353/
OK,现在一切都还是正常的,那么,现在我们更新此deployment的镜像为1.20.1,这个时候会出现什么情况呢?
通过kubelet get events命令,可以发现,kubectl set 命令更改镜像不会对服务造成任何的影响,服务没有任何中断:
<invalid> Normal SuccessfulCreate replicaset/nginx-648458674d Created pod: nginx-648458674d-gldmc
<invalid> Normal Scheduled pod/nginx-648458674d-gldmc Successfully assigned default/nginx-648458674d-gldmc to node4
<invalid> Normal Pulling pod/nginx-648458674d-gldmc Pulling image "nginx:1.20.1"
<invalid> Normal Starting node/node4 Starting kubelet.
<invalid> Normal Pulled pod/nginx-648458674d-gldmc Successfully pulled image "nginx:1.20.1" in 55.174012251s (55.174015279s including waiting)
<invalid> Normal Created pod/nginx-648458674d-gldmc Created container nginx
<invalid> Normal Started pod/nginx-648458674d-gldmc Started container nginx
<invalid> Normal ScalingReplicaSet deployment/nginx Scaled down replica set nginx-6888c79454 to 2
<invalid> Normal SuccessfulDelete replicaset/nginx-6888c79454 Deleted pod: nginx-6888c79454-g24tx
<invalid> Normal Killing pod/nginx-6888c79454-g24tx Stopping container nginx
<invalid> Normal ScalingReplicaSet deployment/nginx (combined from similar events): Scaled up replica set nginx-648458674d to 2
<invalid> Normal SuccessfulCreate replicaset/nginx-648458674d Created pod: nginx-648458674d-kfgb4
<invalid> Normal Scheduled pod/nginx-648458674d-kfgb4 Successfully assigned default/nginx-648458674d-kfgb4 to node4
<invalid> Normal Pulled pod/nginx-648458674d-kfgb4 Container image "nginx:1.20.1" already present on machine
<invalid> Normal Created pod/nginx-648458674d-kfgb4 Created container nginx
<invalid> Normal Started pod/nginx-648458674d-kfgb4 Started container nginx
<invalid> Normal ScalingReplicaSet deployment/nginx (combined from similar events): Scaled down replica set nginx-6888c79454 to 1
<invalid> Normal SuccessfulDelete replicaset/nginx-6888c79454 Deleted pod: nginx-6888c79454-dhhts
<invalid> Normal ScalingReplicaSet deployment/nginx (combined from similar events): Scaled up replica set nginx-648458674d to 3
<invalid> Normal Killing pod/nginx-6888c79454-dhhts Stopping container nginx
<invalid> Normal SuccessfulCreate replicaset/nginx-648458674d Created pod: nginx-648458674d-v4lwp
<invalid> Normal Scheduled pod/nginx-648458674d-v4lwp Successfully assigned default/nginx-648458674d-v4lwp to node4
<invalid> Normal Pulled pod/nginx-648458674d-v4lwp Container image "nginx:1.20.1" already present on machine
<invalid> Normal Created pod/nginx-648458674d-v4lwp Created container nginx
<invalid> Normal Started pod/nginx-648458674d-v4lwp Started container nginx
<invalid> Normal Killing pod/nginx-6888c79454-dhhts Stopping container nginx
<invalid> Warning FailedKillPod pod/nginx-6888c79454-dhhts error killing pod: failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: No such container: 0c27aa115f96cbc5d713a2d508310d20035021046b59878ffc50bb2bd6ee9271"
<invalid> Normal SuccessfulDelete replicaset/nginx-6888c79454 Deleted pod: nginx-6888c79454-gcw24
<invalid> Normal ScalingReplicaSet deployment/nginx (combined from similar events): Scaled down replica set nginx-6888c79454 to 0
<invalid> Normal Killing pod/nginx-6888c79454-gcw24 Stopping container nginx
<invalid> Normal Starting node/node4 Starting kubelet.
<invalid> Normal Scheduled pod/nginx-6888c79454-tlczp Successfully assigned default/nginx-6888c79454-tlczp to node4
<invalid> Normal SuccessfulCreate replicaset/nginx-6888c79454 Created pod: nginx-6888c79454-tlczp
<invalid> Normal ScalingReplicaSet deployment/nginx Scaled up replica set nginx-6888c79454 to 2
<invalid> Normal Starting node/node4 Starting kubelet.
<invalid> Normal Starting node/node4 Starting kubelet.
<invalid> Normal Starting node/node4 Starting kubelet.
<invalid> Normal Pulled pod/nginx-6888c79454-tlczp Container image "nginx:1.18" already present on machine
<invalid> Normal Created pod/nginx-6888c79454-tlczp Created container nginx
<invalid> Normal Started pod/nginx-6888c79454-tlczp Started container nginx
<invalid> Normal Starting node/node4 Starting kubelet.
<invalid> Normal Starting node/node4 Starting kubelet.
<invalid> Normal Starting node/node4 Starting kubelet.
<invalid> Normal Starting node/node4 Starting kubelet.
<invalid> Normal Scheduled pod/nginx-6888c79454-6tfk2 Successfully assigned default/nginx-6888c79454-6tfk2 to node4
<invalid> Normal SuccessfulCreate replicaset/nginx-6888c79454 Created pod: nginx-6888c79454-6tfk2
<invalid> Normal ScalingReplicaSet deployment/nginx Scaled up replica set nginx-6888c79454 to 3
<invalid> Normal Starting node/node4 Starting kubelet.
<invalid> Normal Pulled pod/nginx-6888c79454-6tfk2 Container image "nginx:1.18" already present on machine
<invalid> Normal Created pod/nginx-6888c79454-6tfk2 Created container nginx
<invalid> Normal Started pod/nginx-6888c79454-6tfk2 Started container nginx
<invalid> Normal Starting node/node4 Starting kubelet.
说明:以上是kubernetes的调度过程,关键的地方如下,表示scale逐步扩张新镜像的pod,缩减旧镜像的pod:
(combined from similar events): Scaled up replica set nginx-648458674d to 2
(combined from similar events): Scaled down replica set nginx-6888c79454 to 1
(combined from similar events): Scaled up replica set nginx-648458674d to 3
(combined from similar events): Scaled down replica set nginx-6888c79454 to 0
pod的最终状态如下:
[root@node1 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-648458674d-gldmc 1/1 Running 0 <invalid> 10.244.41.18 node4 <none> <none>
nginx-648458674d-kfgb4 1/1 Running 0 <invalid> 10.244.41.19 node4 <none> <none>
nginx-648458674d-v4lwp 1/1 Running 0 <invalid> 10.244.41.20 node4 <none> <none>
OK,这样的更新我们可以认为是一个平滑的,优雅的更新,而如果是通过部署清单yaml文件先删除deploment,在修改文件后重新创建deployment,这样的方式无疑是简单的,粗暴的,会导致服务中断的,此时我们认为这个更新不是平滑的,粗暴的一种更新方式。
那么,重启的话,只是省略掉修改部署清单yaml 文件这一步,同样的是粗暴的一种方式。
具体的操作也不就演示了,大体上就是kubectl delete -f 文件 然后kubelet apply -f文件 这种形式。
三,
更为精细的pod版本控制 kubectl rollout
以上介绍的两种pod管理方式,可以看出,并不是特别的精准,因为都是命令行的形式,所有更改并没有具体的体现,因此,平常的工作中,还是需要使用部署清单yaml文件的。
那么,kubectl rollout 命令是可以满足优雅重启和更新的,下面接上面的例子说明:
[root@node1 ~]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-648458674d-gldmc 1/1 Running 0 <invalid>
nginx-648458674d-kfgb4 1/1 Running 0 <invalid>
nginx-648458674d-v4lwp 1/1 Running 0 <invalid>
直接重启deployment控制器:
[root@node1 ~]# kubectl rollout restart deployment nginx
deployment.apps/nginx restarted
查看events:
命令:
kubectl get events -w
部分输出:
<invalid> Normal ScalingReplicaSet deployment/nginx (combined from similar events): Scaled up replica set nginx-5fc8f974d9 to 1
<invalid> Normal SuccessfulCreate replicaset/nginx-5fc8f974d9 Created pod: nginx-5fc8f974d9-9gn8z
<invalid> Normal Scheduled pod/nginx-5fc8f974d9-9gn8z Successfully assigned default/nginx-5fc8f974d9-9gn8z to node4
<invalid> Normal Pulled pod/nginx-5fc8f974d9-9gn8z Container image "nginx:1.18" already present on machine
<invalid> Normal Created pod/nginx-5fc8f974d9-9gn8z Created container nginx
<invalid> Normal Started pod/nginx-5fc8f974d9-9gn8z Started container nginx
<invalid> Normal ScalingReplicaSet deployment/nginx (combined from similar events): Scaled down replica set nginx-bf95bf86b to 2
<invalid> Normal ScalingReplicaSet deployment/nginx (combined from similar events): Scaled up replica set nginx-5fc8f974d9 to 2
<invalid> Normal SuccessfulDelete replicaset/nginx-bf95bf86b Deleted pod: nginx-bf95bf86b-jsssl
<invalid> Normal Killing pod/nginx-bf95bf86b-jsssl Stopping container nginx
<invalid> Normal SuccessfulCreate replicaset/nginx-5fc8f974d9 Created pod: nginx-5fc8f974d9-nkcbd
<invalid> Normal Scheduled pod/nginx-5fc8f974d9-nkcbd Successfully assigned default/nginx-5fc8f974d9-nkcbd to node4
<invalid> Normal Pulled pod/nginx-5fc8f974d9-nkcbd Container image "nginx:1.18" already present on machine
<invalid> Normal Created pod/nginx-5fc8f974d9-nkcbd Created container nginx
<invalid> Normal Started pod/nginx-5fc8f974d9-nkcbd Started container nginx
<invalid> Normal ScalingReplicaSet deployment/nginx (combined from similar events): Scaled down replica set nginx-bf95bf86b to 1
<invalid> Normal SuccessfulDelete replicaset/nginx-bf95bf86b Deleted pod: nginx-bf95bf86b-98lpj
<invalid> Normal Killing pod/nginx-bf95bf86b-98lpj Stopping container nginx
<invalid> Normal ScalingReplicaSet deployment/nginx (combined from similar events): Scaled up replica set nginx-5fc8f974d9 to 3
<invalid> Normal SuccessfulCreate replicaset/nginx-5fc8f974d9 Created pod: nginx-5fc8f974d9-xw64m
<invalid> Normal Scheduled pod/nginx-5fc8f974d9-xw64m Successfully assigned default/nginx-5fc8f974d9-xw64m to node4
<invalid> Normal Pulled pod/nginx-5fc8f974d9-xw64m Container image "nginx:1.18" already present on machine
<invalid> Normal Created pod/nginx-5fc8f974d9-xw64m Created container nginx
<invalid> Normal Started pod/nginx-5fc8f974d9-xw64m Started container nginx
<invalid> Normal ScalingReplicaSet deployment/nginx (combined from similar events): Scaled down replica set nginx-bf95bf86b to 0
<invalid> Normal SuccessfulDelete replicaset/nginx-bf95bf86b Deleted pod: nginx-bf95bf86b-nfh5r
更新完毕后,pod的状态:
[root@node1 ~]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-bf95bf86b-98lpj 1/1 Running 0 <invalid>
nginx-bf95bf86b-jsssl 1/1 Running 0 <invalid>
nginx-bf95bf86b-nfh5r 1/1 Running 0 <invalid>
[root@node1 ~]# kubectl get replicasets.apps
NAME DESIRED CURRENT READY AGE
nginx-5fc8f974d9 3 3 3 <invalid>
nginx-648458674d 0 0 0 <invalid>
nginx-6888c79454 0 0 0 <invalid>
nginx-bf95bf86b 0 0 0 <invalid>
此时看rollout的历史,应该是4个,输出可以看到和上面的rc是一一对应的关系:
[root@node1 ~]# kubectl rollout history deployment
deployment.apps/nginx
REVISION CHANGE-CAUSE
1 <none>
2 <none>
3 <none>
4 <none>
查看deployment的历史详情:
kubectl rollout history deployment nginx --revision=1
deployment.apps/nginx with revision #1
Pod Template:
Labels: app=nginx
pod-template-hash=6888c79454
Containers:
nginx:
Image: nginx:1.18
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
[root@node1 ~]# kubectl rollout history deployment nginx --revision=2
deployment.apps/nginx with revision #2
Pod Template:
Labels: app=nginx
pod-template-hash=6888c79454
Containers:
nginx:
Image: nginx:1.18
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
root@node1 ~]# kubectl rollout history deployment nginx --revision=3
deployment.apps/nginx with revision #3
Pod Template:
Labels: app=nginx
pod-template-hash=bf95bf86b
Annotations: kubectl.kubernetes.io/restartedAt: 2023-11-18T17:06:24+08:00
Containers:
nginx:
Image: nginx:1.18
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
[root@node1 ~]# kubectl rollout history deployment nginx --revision=4
deployment.apps/nginx with revision #4
Pod Template:
Labels: app=nginx
pod-template-hash=5fc8f974d9
Annotations: kubectl.kubernetes.io/restartedAt: 2023-11-18T17:10:02+08:00
Containers:
nginx:
Image: nginx:1.18
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
升级镜像到1.20.1 ,生成新历史版本5:
[root@node1 ~]# kubectl apply -f nginx.yaml
deployment.apps/nginx configured
[root@node1 ~]# kubectl rollout history deployment nginx --revision=5
deployment.apps/nginx with revision #5
Pod Template:
Labels: app=nginx
pod-template-hash=6469d4d479
Annotations: kubectl.kubernetes.io/restartedAt: 2023-11-18T17:10:02+08:00
Containers:
nginx:
Image: nginx:1.20.1
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
回滚版本到2:
先查询历史版本
[root@node1 ~]# kubectl get rs -o wide
NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
nginx-5fc8f974d9 0 0 0 <invalid> nginx nginx:1.18 app=nginx,pod-template-hash=5fc8f974d9
nginx-6469d4d479 3 3 3 <invalid> nginx nginx:1.20.1 app=nginx,pod-template-hash=6469d4d479
nginx-648458674d 0 0 0 <invalid> nginx nginx:1.20.1 app=nginx,pod-template-hash=648458674d
nginx-6888c79454 0 0 0 <invalid> nginx nginx:1.18 app=nginx,pod-template-hash=6888c79454
nginx-bf95bf86b 0 0 0 <invalid> nginx nginx:1.18 app=nginx,pod-template-hash=bf95bf86b
[root@node1 ~]# kubectl rollout history deployment nginx
deployment.apps/nginx
REVISION CHANGE-CAUSE
1 <none>
2 <none>
3 <none>
4 <none>
5 <none>
[root@node1 ~]# kubectl rollout history deployment nginx --revision=2
deployment.apps/nginx with revision #2
Pod Template:
Labels: app=nginx
pod-template-hash=6888c79454
Containers:
nginx:
Image: nginx:1.18
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
回滚:
[root@node1 ~]# kubectl rollout undo deployment nginx --to-revision=2
deployment.apps/nginx rolled back
查看是否正确回滚:
[root@node1 ~]# kubectl get deployments.apps -owide
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
nginx 3/3 3 3 177m nginx nginx:1.18 app=nginx
那么,重启pod一般常见的就是删除pod后在重新创建,但,对于多副本的pod来说,会有服务中断的风险,更新一般也是暴力方式删除pod后,修改后在重新启动了,或者副本数先设置为0后,在恢复到原先的设置。
而如果想要服务不中断的,优雅的更新或者重启,首选还得是kubectl rollout 命令啦。