Image garbage collection failed – D2iQ

In a DKP cluster, you may notice that the disk mount where /var/lib/containerd is located has become 85% full and garbage collection is failing with log messages like the following:

MMM DD HH:MM:SS node.example.com kubelet[30947]: EMMDD HH:MM:SS.msmsms 30947 kubelet.go:1287] Image garbage collection failed multiple times in a row: failed to garbage collect required amount of images. Wanted to free xxxx bytes, but freed 0 bytes.

By default, the Kubelet will taint a node and start evicting pods when it reaches its threshold of 85% usage. Before it reaches this amount, it will try to run a garbage collection process where it deletes Docker images that are not being used by any active containers.

One common cause for why this error prints is that it simply is unable to identify any unused images to clean up to bring it back down to a healthy threshold.

One thing you can try in this case is to manually run an image cleanup just to make sure. On a node where this is occurring, you can run:

crictl rmi -q

This will attempt to identify unused images and remove them.

If this does not work, this likely indicates that your workload is simply trying to use more disk space than is available.

If this is the case, your options are to reduce your disk usage or to increase the total disk space on your nodes.

Here are some more resources on others who have encountered this error in Kubernetes:

https://github.com/kubernetes/kubernetes/issues/71869

https://giters.com/rancher/k3os/issues/765