目录

K8s主机断电重启后服务无法启动

目录

排查过程比较恶心,感觉不会出问题的地方出了问题

按常理说,使用kubeadm安装的k8s,k8s组件容器无论怎么重启,怎么断电都不会让k8s集群自身出问题,所以我根本没在意组件问题,直接着手重启我们的中间件,我们的业务服务,而且,kube-system命名空间里,全都是running 的,导致我错误的认为,组件都是ok的,我正在重启的nacos,重启一段时间后发现,无论如何连接不到mysql,我看日志,改配置,各种手段全部使出来,仍然没用,大概持续1小时,我开始怀疑是组件问题, 一开始是通过busybox测试servicename竟然解析不了,coredns有问题,重启coredns后解决,又出现新问题,能解析后发现能解析但是访问不通,通过nsenter确定服务是正常的,那就是网络有问题,从宿主机直接telnet svcIP,发现telnet不通,随重启kubeproxy,重启后为解决,查看kube-proxy日志

1
E1031 05:39:31.739663       1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Unauthorized

发现说没有RBAC权限

随后手动创建kube-proxy权限,集群恢复,重启中间件,重启业务服务成功

具体步骤1.

检查 kube-proxy 服务账户

1
kubectl get sa -n kube-system

确保存在 kube-proxy 服务账户。

2检查 kube-proxy 的 ClusterRole 和 ClusterRoleBinding

1
kubectl get clusterrolebinding kube-proxy -o yaml

bash输出应类似于以下内容:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-proxy
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-proxy
subjects:
- kind: ServiceAccount
  name: kube-proxy
  namespace: kube-system

yaml如果不存在,可以创建它:

1
kubectl create clusterrolebinding kube-proxy --clusterrole=kube-proxy --serviceaccount=kube-system:kube-proxy

bash3. 检查 kube-proxy 的 ClusterRole

1
kubectl get clusterrole kube-proxy -o yaml

bash输出应类似于以下内容:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-proxy
rules:
- apiGroups: [""]
  resources: ["services", "endpoints", "nodes"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "patch", "update"]
- apiGroups: ["discovery.k8s.io"]
  resources: ["endpointslices"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
  resources: ["networkpolicies"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["policy"]
  resources: ["podsecuritypolicies"]
  verbs: ["use"]

yaml如果不存在,可以创建它:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-proxy
rules:
- apiGroups: [""]
  resources: ["services", "endpoints", "nodes"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "patch", "update"]
- apiGroups: ["discovery.k8s.io"]
  resources: ["endpointslices"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
  resources: ["networkpolicies"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["policy"]
  resources: ["podsecuritypolicies"]
  verbs: ["use"]
EOF
bash

重新启动 kube-proxy如果上述步骤没有解决问题,尝试重启 kube-proxy Pod:kubectl delete pod -l k8s-app=kube-proxy -n kube-system