Nodes in NotReady state due to 'use of closed network connection'
Overview/Background
While operating a Konvoy cluster, you may encounter a scenario where node(s) go into a NotReady state and do not recover. To troubleshoot this issue, it is beneficial to look at the kubelet logs of the node that is NotReady:journalctl -u kubeletOne cause for nodes going into a NotReady state is related to an upstream Kubernetes issue that is caused by an issue with the golang http2 library using dead connections. For more information, please see:
https://github.com/kubernetes/kubernetes/issues/87615
https://github.com/kubernetes/kubernetes/issues/91963
https://go-review.googlesource.com/c/net/+/198040
If you are encountering this issue, in addition to the node being in a NotReady state, you will observe messages similar to the following in the kubelet logs:
Error updating node status, will retry: error getting node "": Get https://:6443/api/v1/nodes/?timeout=10s: read tcp :42388->:6443: use of closed network connectionTo confirm, execute the following command on the node in question:
journalctl -u kubelet | grep 'use of closed network connection'
Solution
To immediately resolve the issue, you can simply restart the kubelet with the following command:sudo systemctl restart kubeletOnce the kubelet has restarted, it should re-connect to the apiserver and move to a Ready state.
As a short-term mitigation until the upstream issue is resolved, we have introduced a watchdog to monitor the kubelet for this issue and automatically remediate it. For instructions on adding this watchdog to your cluster, please reach out to D2iQ support.