排查过程比较恶心,感觉不会出问题的地方出了问题
按常理说,使用kubeadm安装的k8s,k8s组件容器无论怎么重启,怎么断电都不会让k8s集群自身出问题,所以我根本没在意组件问题,直接着手重启我们的中间件,我们的业务服务,而且,kube-system命名空间里,全都是running 的,导致我错误的认为,组件都是ok的,我正在重启的nacos,重启一段时间后发现,无论如何连接不到mysql,我看日志,改配置,各种手段全部使出来,仍然没用,大概持续1小时,我开始怀疑是组件问题, 一开始是通过busybox测试servicename竟然解析不了,coredns有问题,重启coredns后解决,又出现新问题,能解析后发现能解析但是访问不通,通过nsenter确定服务是正常的,那就是网络有问题,从宿主机直接telnet svcIP,发现telnet不通,随重启kubeproxy,重启后为解决,查看kube-proxy日志
1
|
E1031 05:39:31.739663 1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Unauthorized
|
发现说没有RBAC权限
随后手动创建kube-proxy权限,集群恢复,重启中间件,重启业务服务成功
具体步骤1.
检查 kube-proxy 服务账户
1
|
kubectl get sa -n kube-system
|
确保存在 kube-proxy 服务账户。
2检查 kube-proxy 的 ClusterRole 和 ClusterRoleBinding
1
|
kubectl get clusterrolebinding kube-proxy -o yaml
|
bash输出应类似于以下内容:
1
2
3
4
5
6
7
8
9
10
11
12
|
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-proxy
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-proxy
subjects:
- kind: ServiceAccount
name: kube-proxy
namespace: kube-system
|
yaml如果不存在,可以创建它:
1
|
kubectl create clusterrolebinding kube-proxy --clusterrole=kube-proxy --serviceaccount=kube-system:kube-proxy
|
bash3. 检查 kube-proxy 的 ClusterRole
1
|
kubectl get clusterrole kube-proxy -o yaml
|
bash输出应类似于以下内容:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-proxy
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch", "update"]
- apiGroups: ["discovery.k8s.io"]
resources: ["endpointslices"]
verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
resources: ["networkpolicies"]
verbs: ["get", "list", "watch"]
- apiGroups: ["policy"]
resources: ["podsecuritypolicies"]
verbs: ["use"]
|
yaml如果不存在,可以创建它:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-proxy
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch", "update"]
- apiGroups: ["discovery.k8s.io"]
resources: ["endpointslices"]
verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
resources: ["networkpolicies"]
verbs: ["get", "list", "watch"]
- apiGroups: ["policy"]
resources: ["podsecuritypolicies"]
verbs: ["use"]
EOF
bash
|
重新启动 kube-proxy如果上述步骤没有解决问题,尝试重启 kube-proxy Pod:kubectl delete pod -l k8s-app=kube-proxy -n kube-system