Coping with a machine stuck in Provisioning state during deployment of dkp2.1 cluster on-prem – D2iQ

When a cluster is being deployed on-prem, a lot of unexpected problems with the custom infrastructure may happen.
Thus, at some moment, you may find out that the cluster's deployment is stuck, and some of the machines are stuck in the "Provisioning" state.

There are some ideas on how to cope with that.

Firstly, we need to determine what errors on what node happened.
Let's find the problematic machine:

kubectl get machine

can show, for example, this output:

NAME CLUSTER AGE PROVIDERID PHASE VERSION
vavypp-jmpqh-control-plane-6fjth vavypp-jmpqh 24m preprovisioned:////192.168.122.11 Running v1.21.6
vavypp-jmpqh-control-plane-ql667 vavypp-jmpqh 19m preprovisioned:////192.168.122.12 Running v1.21.6
vavypp-jmpqh-control-plane-s4hl7 vavypp-jmpqh 12m preprovisioned:////192.168.122.13 Running v1.21.6
vavypp-jmpqh-md-0-d7f46cfd5-9z5b8 vavypp-jmpqh 24m preprovisioned:////192.168.122.24 Running v1.21.6
vavypp-jmpqh-md-0-d7f46cfd5-clg6k vavypp-jmpqh 24m preprovisioned:////192.168.122.22 Running v1.21.6
vavypp-jmpqh-md-0-d7f46cfd5-g6dft vavypp-jmpqh 24m Provisioning v1.21.6
vavypp-jmpqh-md-0-d7f46cfd5-kkkn7 vavypp-jmpqh 24m Provisioning v1.21.6

Here we see, that the machine vavypp-jmpqh-md-0-d7f46cfd5-g6dft is stuck in the "Provisioning" state.
Let's look at the details of the machine:

kubectl get machine machine in more details -o yaml

Here we see in the field "spec.infrastructureRef.name" the name of the provisioning process:
vavypp-jmpqh-md-0-pcrqq

Let's find the problematic Jobs and pods relating to this provisioning:

kubectl get all -A | grep vavypp-jmpqh-md-0-pcrqq

default pod/vavypp-jmpqh-md-0-pcrqq-provision-jrtqw 0/1 Error 0 22m
default job.batch/vavypp-jmpqh-md-0-pcrqq-provision 0/1 22m 22m

We see, that the provision Job isn't finished successfully, because its pod got into the "Error" state.
Let's take a look at the logs of the provisioning pod:

kubectl logs vavypp-jmpqh-md-0-pcrqq-provision-jrtqw

Here we can see the exact problem that happened during the machine's deployment.
For example:

fatal: [192.168.122.21]: FAILED! => {"attempts": 3, "changed": false, "msg": "Failure downloading http://mirror.centos.org/centos/7/extras/x86_64/Packages/container-selinux-2.107-3.el7.noarch.rpm, Request failed: <urlopen error [Errno 101] Network is unreachable>"}

So, we have a network connectivity problem on server 192.168.122.21.

Let's say, we've spent some time and fixed the problem on the server.

Now we need to re-trigger the machine's deployment.

Depending on the server's problems, it may be either enough to delete the provisioning Job:

kubectl delete job vavypp-jmpqh-md-0-pcrqq-provision

After this, the cappp-controller-manager will start a new provisioned Job with the new pod,
or, in more complicated cases,
the provisioning pod should be deleted, and the machine object should be re-annotated with a command like this:

kubectl annotate machine <machine_name> touch=now

After this, we need to restart the cappp-controller-manager:

kubectl -n cappp-system rollout restart deploy/cappp-controller-manager

In some cases,
a machine can be in the state "Running", but the Kubernetes node never came to the "Ready" state, so we may realize that a server was completely broken, and we need to decommission and re-image it.
Thus, we need to delete the machine object to trigger its re-deployment:

kubectl delete machine <machine name>