Konvoy auto-provisioning-webhook pod fails to start due to aggressive liveness check
Overview/Background
When deploying a version <=v1.5.2 Konvoy cluster, auto-provisioning is installed automatically during the cluster deployment by default. In certain environments, we have observed that the liveness check grace period for the auto-provisioning-webhook pod is too aggressive and doesn't allow the pod to start successfully. In these scenarios, you will observe that liveness/readiness probes have failed in the pod's describe output:
$ kubectl describe pod -l control-plane=auto-provisioning-webhook -n konvoy ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 42h (x1776 over 2d10h) kubelet, Readiness probe failed: Get https://:8443/readyz: dial tcp :8443: connect: connection refused Warning Unhealthy 41h (x933 over 2d10h) kubelet, Liveness probe failed: Get https://:8443/healthz: dial tcp :8443: connect: connection refused Warning BackOff 41h (x3394 over 2d10h) kubelet, Back-off restarting failed container
Additionally, the auto-provisioning-webhook logs will have very little output, usually only consisting of the following line:
$ kubectl logs -l control-plane=auto-provisioning-webhook -n konvoy 2020-10-10T10:10:10.100Z INFO konvoyimagefetcher.direct failed to parse image metadata - skipping {"tag": "dev", "digest": "sha256:47ddbe9d30299d8beafa970c496523604f64551165a2292738bcba39936e2398", "error": "invalid semantic version dev for Konvoy: could not parse \"dev\" as version"}
Solution
If you are not using auto-provisioning, you can remove it from the cluster with the following command, ensuring that you substitute for your Konvoy version (e.g., v1.5.2):
docker run -v $(pwd):/opt/konvoy -e KUBECONFIG=admin.conf -w /opt/konvoy --entrypoint /usr/local/bin/helmv3 mesosphere/konvoy: uninstall auto-provisioning -n konvoy
However, note that any subsequent runs of `konvoy up` or `konvoy deploy` must also include the `--without-auto-provisioning` flag.
To workaround this issue with auto-provisioning enabled, you can perform the following process:
1. Execute `./konvoy up --yes` 2. Once auto provisioning starts to try to deploy and hang, exit the process with Ctrl+c 3. Edit the auto-provisioning-webook deployment with `kubectl edit deploy -n konvoy auto-provisioning-webhook` 4. Modify the liveness probe by changing initialDelaySeconds from 30 to 180, then save and quit so that the deployment gets updated 5. Ensure the pod restarts with `kubectl get po -n konvoy -l control-plane=auto-provisioning-webhook` (AGE should be around the time when you edited the deployment) 6. Follow the logs and check if the pod starts properly with `kubectl logs -n konvoy -l control-plane=auto-provisioning-webhook -f` 7. Confirm that the pod becomes READY with `kubectl get pod -l control-plane=auto-provisioning-webhook -n konvoy`
Once complete, auto-provisioning should deploy successfully and be fully functional. Should you encounter any additional issues, please feel free to open a case with D2iQ support.