Problem

When deploying an air-gapped DKP cluster, you may notice that the first control-plane node has been provisioned, but the cluster is not scaling up as expected. When this happens, it may be helpful to view the logging for the capi-kubeadm-control-plane-system pod. Within the logging for, there may be failures with the control-plane pods being unhealthy:

"msg"="Waiting for control plane to pass preflight checks" "cluster-name"="<CLUSTERNAME>" "name"="<CLUSTERNAME>" "namespace"="default" "failures"="[<MACHINE> does not have APIServerPodHealthy condition, machine <MACHINE> does not have ControllerManagerPodHealthy condition, machine <MACHINE> does not have SchedulerPodHealthy condition, machine <MACHINE> does not have EtcdPodHealthy condition, machine <MACHINE> does not have EtcdMemberHealthy condition]"
controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/<CLUSTERNAME>\": context deadline exceeded" "cluster"="<CLUSTERNAME>" "name"="<CLUSTERNAME>" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"

The failure here indicates some underlying issue on the host that is causing the API-Server pods to be unhealthy. When this happens, CAPI will pause deploying the remaining nodes until the problem with the first control-plane pod is resolved.

Solution

SSHing into the problem host and validating the underlying issue is the best next step to take in these situations. In this case, we know that the API-server pod is unhealthy. When one of your pods is failing to start or crashing, checking the Kubelet service log, the containerd service log, and gathering the logs from the container are all advised. After collecting our logging we can see that the reason for our API-server pod failing is due to an issue pulling the image:

[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.22.8: output: time="" level=fatalmsg="pulling image: rpc error: code = Unknown desc = failed to pull and unpack image \"k8s.gcr.io/kube-apiserver:v1.22.8\": failed to resolve reference \"k8s.gcr.io/kube-apiserver:v1.22.8\": get registry endpoints: parse endpoint url: parse \":///v2/<REGISTRY_URL>/dkp\": missing protocol scheme"

We can see that the error 'missing protocol scheme' results from our private Docker registry URL not containing HTTP(S). When deploying an air-gapped DKP cluster, there are multiple points where you will need to define the DOCKER_REGISTRY_ADDRESS environment variable. The first time is when you go to seed your registry, if you include the protocol here you will run into a separate error:

level=fatal msg="Invalid destination name docker://https://<REGISTRY_URL>: invalid reference format"

In this case, all that needs to be done is to remove the HTTP(S) from your environment variable and re-run the seed command. The second time you are required to define this variable is just before creating your cluster, in this case, you will need to redefine the environment variable to be formatted with <http(s)>://<registry-address>:<registry-port> after you have updated your environment variable you will need to redeploy the cluster. After redeploying, you can check the CAPI control plane pod to validate the errors have resolved.