When trying to create a new cluster on pre-provisioned hardware, you may encounter a failure where DKP will not install Kubernetes to certain nodes.
Output resembling the following might appear in the cappp-system logs in the bootstrap cluster:
EXXXX XX:XX:XX 1 leaderelection.go:325] error retrieving resource lock cappp-system/xxxxxx.cluster.konvoy.d2iq.io: etcdserver: request timed out
EXXXX 13:32:50 1 leaderelection.go:325] error retrieving resource lock cappp-system/xxxxxx.cluster.konvoy.d2iq.io: Get "https://XX.XX.XX.XX:443/api/v1/namespaces/cappp-system/configmaps/xxxxxx.cluster.konvoy.d2iq.io": context deadline exceeded
IXXXX 13:32:50 1 leaderelection.go:278] failed to renew lease cappp-system/xxxxxx.cluster.konvoy.d2iq.io: timed out waiting for the condition
20XX-XX-XXTXX:XX:XX ERROR setup problem running manager {"error": "leader election lost"}
...
2022-01-27T13:43:43.513Z ERROR controller.preprovisionedmachine could not delete directory {"reconciler group": "infrastructure.cluster.konvoy.d2iq.io", "reconciler kind": "PreprovisionedMachine", "name": "cluster-ibs-ng-control-plane-xxxxx", "namespace": "default", "directory": "/var/lib/kubelet", "error": "error describing directory \"/var/lib/kubelet\" with stat: Process exited with status 1"}
github.com/mesosphere/cluster-api-provider-preprovisioned/controllers.(*PreprovisionedMachineReconciler).Reconcile
The error itself does not appear to be very helpful in narrowing down the cause of the problem.
However, we have found that a common cause for this is that network time is out of sync on the machine you are using to run "dkp" commands and create the cluster (i.e. your local laptop or a bastion host).
If you encounter errors such as this and are unsure of the cause, a good first troubleshooting step would be to make sure that every machine in your pre-provisioned cluster AND your bastion host or laptop are properly in sync with a network time server.
Please see this article for more details on the importance of time sync and how you can verify it: