Create new cluster fails to complete due to time drift – D2iQ

Symptoms

While creating a new workload cluster, the status of the machine is stuck on a single CP

kubectl get machines
NAME                                CLUSTER       NODENAME     PROVIDERID                       PHASE     AGE   VERSION
preprov-4-3-control-plane-zqhzx     preprov-4-3   cp0   preprovisioned:////10.129.3.8    Running   25m   v1.24.6
preprov-4-3-md-0-6dc6d7676f-2hzvx   preprov-4-3   w2    preprovisioned:////10.129.3.9    Running   25m   v1.24.6
preprov-4-3-md-0-6dc6d7676f-6lppf   preprov-4-3   w0    preprovisioned:////10.129.3.6    Running   25m   v1.24.6
preprov-4-3-md-0-6dc6d7676f-6xbkc   preprov-4-3   w3    preprovisioned:////10.129.3.3    Running   25m   v1.24.6-
preprov-4-3-md-0-6dc6d7676f-gq5kw   preprov-4-3   w1    preprovisioned:////10.129.3.26   Running   25m   v1.24.6

Kubeadmcontrolplane replicas is stuck to 1, even after trying to scale it up

kubectl get kubeadmcontrolplane
NAME                       CLUSTER       INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
preprov-4-3-control-plane  preprov-4-3   true          true                   1          1       1         0             31m   v1.24.6

Describing kubeadmcontrolplane has the following events

    - lastTransitionTime: "2023-04-04T02:59:09Z"
      message: 'Following machines are reporting unknown etcd member status: preprov-4-3-control-plane-zqhzx'
      reason: EtcdClusterUnknown
      status: Unknown
      type: EtcdClusterHealthy

Which might indicate an unhealthy CP components.

In order to confirm the CP components status, we need to SSH into the first CP and use the admin.conf kubeconfig.

kubectl get po -n kube-system -n kube-system --kubeconfig /etc/kubernetes/admin.conf 
NAME                                         READY   STATUS    RESTARTS   AGE
coredns-6d4b75cb6d-lgk26                     1/1     Running   0          133m
coredns-6d4b75cb6d-pnxvw                     1/1     Running   0          133m
etcd-cp0                                     1/1     Running   219        133m
kube-apiserver-cp0                           1/1     Running   243        133m
kube-controller-manager-cp0                  1/1     Running   219        133m

And the etcd logs

embed: rejected connection from "127.0.0.1:34824" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "etcd-alexchrisramos-cp0")
embed: rejected connection from "127.0.0.1:34838" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "etcd-alexchrisramos-cp0")

Solution

All these symptoms are a good indication that there is a time drift between the nodes on the cluster and/or the node/machine where the bootstrap is (bastion host).

Please see our existing articles about handling Time sync issues, as a starting point in resolving the errors.

https://support.d2iq.com/hc/en-us/articles/4411574092436-Does-anyone-really-know-what-time-it-is-

https://support.d2iq.com/hc/en-us/articles/5719303786132-Configuring-custom-NTP-servers-for-DKP-2-X