Symptoms
While creating a new workload cluster, the status of the machine is stuck on a single CP
kubectl get machines NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
preprov-4-3-control-plane-zqhzx preprov-4-3 cp0 preprovisioned:////10.129.3.8 Running 25m v1.24.6 preprov-4-3-md-0-6dc6d7676f-2hzvx preprov-4-3 w2 preprovisioned:////10.129.3.9 Running 25m v1.24.6 preprov-4-3-md-0-6dc6d7676f-6lppf preprov-4-3 w0 preprovisioned:////10.129.3.6 Running 25m v1.24.6 preprov-4-3-md-0-6dc6d7676f-6xbkc preprov-4-3 w3 preprovisioned:////10.129.3.3 Running 25m v1.24.6-
preprov-4-3-md-0-6dc6d7676f-gq5kw preprov-4-3 w1 preprovisioned:////10.129.3.26 Running 25m v1.24.6
Kubeadmcontrolplane replicas is stuck to 1, even after trying to scale it up
kubectl get kubeadmcontrolplane NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
preprov-4-3-control-plane preprov-4-3 true true 1 1 1 0 31m v1.24.6
Describing kubeadmcontrolplane has the following events
- lastTransitionTime: "2023-04-04T02:59:09Z" message: 'Following machines are reporting unknown etcd member status: preprov-4-3-control-plane-zqhzx' reason: EtcdClusterUnknown status: Unknown type: EtcdClusterHealthy
Which might indicate an unhealthy CP components.
In order to confirm the CP components status, we need to SSH into the first CP and use the admin.conf kubeconfig.
kubectl get po -n kube-system -n kube-system --kubeconfig /etc/kubernetes/admin.conf NAME READY STATUS RESTARTS AGE coredns-6d4b75cb6d-lgk26 1/1 Running 0 133m coredns-6d4b75cb6d-pnxvw 1/1 Running 0 133m etcd-cp0 1/1 Running 219 133m kube-apiserver-cp0 1/1 Running 243 133m kube-controller-manager-cp0 1/1 Running 219 133m
And the etcd logs
embed: rejected connection from "127.0.0.1:34824" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "etcd-alexchrisramos-cp0") embed: rejected connection from "127.0.0.1:34838" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "etcd-alexchrisramos-cp0")
Solution
All these symptoms are a good indication that there is a time drift between the nodes on the cluster and/or the node/machine where the bootstrap is (bastion host).
Please see our existing articles about handling Time sync issues, as a starting point in resolving the errors.
https://support.d2iq.com/hc/en-us/articles/4411574092436-Does-anyone-really-know-what-time-it-is-
https://support.d2iq.com/hc/en-us/articles/5719303786132-Configuring-custom-NTP-servers-for-DKP-2-X