When performing a DKP upgrade, you may notice a standard timeout error during the step where you upgrade Kubernetes on the worker nodes:
unable to update node pool: error while waiting for node pool to be ready: timed out waiting for the condition
Solution
On further inspection of the capi-controller-manager pod logs, you may notice a pod disruption error:
XXXXX XX:XX:XX 1 machine_controller.go:619] "error when evicting pods/\"rook-ceph-osd-0-xxxxxxxxx-xxxxx\" -n \"kommander\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.\n" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/konvoynode-md-0-xxxxxxxxxx-xxxxx" namespace="default" name="konvoynode-md-0-xxxxxxxxxx-xxxxx" reconcileID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx MachineSet="default/konvoynode-md-0-xxxxxxxxxx" MachineDeployment="default/konvoynode-md-0" Cluster="default/konvoynode" Node="nodename.my.domain"
If you inspect the Describe output for the rook-ceph-mon pods, you may then see the actual cause for why the OSD pods can not be evicted:
"message": "0/6 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 1 node(s) were unschedulable, 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 5 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling."
This is a result of there not being enough nodes for rook-ceph to continue to run in a healthy state while one node is being drained for an upgrade. Usually this is a result of running a DKP cluster with fewer than four worker nodes. At least four worker nodes is a requirement to run DKP.