Once, after a network outage, you may discover that your DKP 2.2 cluster is stack; no any new pods can't be started; and kubeapi-server's logs are full of lines like this:
Failed to update lock: Put "https://10.208.3.13:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
or
E0805 10:10:42.636964 1 leaderelection.go:361] Failed to update lock: Put "https://10.208.3.14:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
This is a familiar pattern whenever a validation webhook is in place that is too restrictive.
It will not allow any resource to update, in this case writing a lock. Such a process is very important for raft leader election, as such the cluster would not be able to recover until its components are able to establish connection with each other. To eliminate this, you must first copy existing webhookvalidationconfigurations down, so you can delete and then redeploy them. Once the cluster is re-established, if it is, then we should be able to redeploy the webhooks again. Kommander requires that certain webhooks exist, both to enable other tools to gain necessary fields to function (such as proxy settings) and to restrict the customer or external entities from performing certain actions they are not allowed to. We therefore suggest to always replace them once the cluster has returned.
This script can help to back up and delete all webhookvalidationconfigurations:
kubectl get ValidatingWebhookConfiguration
for VALWEBHOOKCONF in $(kubectl get ValidatingWebhookConfiguration | awk '{ print $1}'); do
kubectl get ValidatingWebhookConfiguration $VALWEBHOOKCONF -o yaml > $VALWEBHOOKCONF.yaml $VALWEBHOOKCONF
done
for VALWEBHOOKCONF in $(kubectl get ValidatingWebhookConfiguration | awk '{ print $1}'); do
kubectl delete ValidatingWebhookConfiguration $VALWEBHOOKCONF
done