You may see that a node in your kubernetes cluster has become unhealthy. If you describe the node in question, you may see the following status message:
message: 'PLEG is not healthy: pleg was last seen active ...
The Pod Lifecycle Event Generator (PLEG) is usually unhealthy because the underlying containerizer is unhealthy. Konvoy utilizes containerd and so the first thing you should try is checking the health of containerd on the node in question. A good way to check is to run:
time sudo crictl ps -a
This will list all containers on the host. However, you know that you have a problem if it immediately returns with an error message instead of listing the containers such as the following:
FATA[0000] listing containers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16844660 vs. 16777216)
The grpc max message size value is controlled by /etc/containerd/config.toml. The maximum value you can set this to is 16777216 or 4 Megabytes. From the above message we can see that we're already at the maximum value so increasing this will not help us. From here you can try to restart containerd via:
sudo systemctl restart containerd
This will likely not solve the issue though and so if you still experience problems the next thing you should do is delete all exited containers on the node via:
sudo crictl rm --all
At first glance this command looks like it will delete all containers on the host which would not be ideal, but if you run this against a healthy test node you'll find that it does not have the ability to remove running containers unless you add the --force flag. This means that what the command really does is remove all exited containers. This should solve the problem with running sudo crictl ps -a, which in turn will solve the issue with PLEG being unhealthy.