Overview
You may find that your Konvoy/DKP cluster is showing unpredictable behavior and certain components and services are crashing. Often in cases like this, it's a good idea to check the health of etcd because all other Kubernetes components rely on a functioning etcd cluster.
One problem etcd can encounter is that network time is out of sync. Please refer to this article for more details on why time sync is important in distributed computing.
Identifying and Resolving
Check etcd pod logs for entries that resemble the following. This may only appear on one or some of your etcd pods:
20XX-XX-XX XX:XX:XX.XXXXXX W | rafthttp: the clock difference against peer bb00b46425cdaa06 is too high [1m31.204410306s 1s]
Solution
If you see messages like this, it indicates that the etcd cluster disagrees on their clocks. This causes etcd to not trust the integrity of its own data and can result in outages or other unpredictable behavior.
To resolve this, you can SSH into each node running etcd and try to force chrony to synchronize:
sudo chronyc -a makestep
If this does not result in your system clocks synchronizing, then additional troubleshooting of chrony is required. Refer to the chrony documentation for additional diagnostic steps you can take to troubleshoot chrony: https://chrony.tuxfamily.org/faq.html#_computer_is_not_synchronising