Troubleshooting DKP 2.x Azure/AKS Deployments – D2iQ

In line with using Cluster API with the release of DKP 2.x, the Azure deployment makes use of the capz provider API to interface with Azure and AKS.

There are instances where the cluster would fail to deploy for numerous reasons. Symptoms of this would be a timeout after running the

kubectl wait --for=condition=ControlPlaneReady "clusters/${CLUSTER_NAME}" --timeout=20m

You can further examine the capz logs which you will see messages explaining the issue, for example, the issue below:

E0307 15:54:54.704442 1 controller.go:304] controller/azuremanagedmachinepool "msg"="Reconciler error" "error"="error creating AzureManagedMachinePool default/cptgw4c: failed to reconcile machine pool cptgw4c: failed to get existing agent pool: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/bb3ab037-69d3-4c5d-97f1-046c6f0c03c1/resourceGroups//providers/Microsoft.ContainerService/managedClusters//agentPools/cptgw4c?api-version=2021-05-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post \"https://login.microsoftonline.com/5ea61d64-79eb-42ea-a33b-f7b9e17616b1/oauth2/token?api-version=1.0\": dial tcp: lookup login.microsoftonline.com on 10.96.0.10:53: server misbehaving'" "name"="cptgw4c" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="AzureManagedMachinePool"

The line dial tcp: lookup login.microsoftonline.com on 10.96.0.10:53: server misbehaving' shows issues with DNS lookups in the Azure environment.

Further information regarding the capz implementation can be found here https://capz.sigs.k8s.io/