Overview
In some cases, you may observe a Konvoy cluster's control plane stack suddenly becoming unstable. This might include components like etcd, containerd, and the apiserver. We have found that these behaviors can be linked to too many configmaps being kept by Helm for old versions of charts. This guide will help you identify whether this is the problem you are experiencing and how to resolve it.Identifying and resolving
Because etcd is the component most sensitive to changes in response time, that is the best place to check. You may find messages in the etcd logs that resemble the following:kube-system_etcd-ip-hostname-x.log:2020-01-01 00:00:00 | etcdserver: read-only range request "key:\"/registry/configmaps/kube-system/\" range_end:\"/registry/configmaps/kube-system0\" " with result "range_response_count:2766 size:60299670" took too long (416.406633ms) to execute
This message indicates that requests to list configmaps are taking too long and tying up etcd resources. Next, you can run the following to check to see how many configmaps are being held by Helm that are in "SUPERSEDED" status:
kubectl get cm -n kube-system -l STATUS=SUPERSEDED,OWNER=TILLER -oname | grep kubeaddons
If this number appears high (in the hundreds or thousands) you can run the following to remove superseded configmaps:
kubectl get cm -n kube-system -l STATUS=SUPERSEDED,OWNER=TILLER -oname | grep kubeaddons | xargs -I {} kubectl delete -n kube-system {}
This should result in a cleanup that will improve response times and resolve the stability issues.
If you believe you are impacted by this problem but this resolution did not help, please submit a support ticket:
https://support.d2iq.com/s/article/Opening-a-New-Support-Case