Pods being evicted, "Attempting to reclaim ephemeral-storage" – D2iQ

You may encounter a case where pods are being evicted from a node, and the "describe" output reports that the kubelet is doing this to reclaim ephemeral storage.

For more details on this issue, the first place to check is the log output of the kubelet service on the node in question. SSH into the node and run:

journalctl -u kubelet

Look for entries like the following:

Jan 28 16:09:51 node03 kubelet[25570]: I0128 16:09:51.448281   25570 image_gc_manager.go:305] [imageGCManager]: Disk usage on image filesystem is at 85% which is over the high threshold (85%). Trying to free 2000000000 bytes down to the low threshold (80%).

Messages like these indicate that the disk mount that corresponds to the kubelet or the kubelet's image store are over the maximum threshold and it needs to clear up space to prevent damage from happening to the cluster.

These values correspond to the "nodefs" and "imagefs" thresholds documented here:
https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#hard-eviction-thresholds

Unless you have configured your nodes differently, this will usually refer to the same mount - the location of /var/lib/kubelet, /var/log, and /var/lib/containerd, so you will encounter this issue when this mount reaches 85% of its total capacity.

It is also important to note that "allocation" is a different value than "usage". This can be confusing, as the "Allocated Resources" output for this node may not report very high disk usage.

However, if you run a command such as "df" on the impacted node, you can see actual usage. In most cases, the ephemeral mount will be located at "/". It will look like the following:

Filesystem                Size  Used Avail Use% Mounted on
/dev/mapper/root          100G   84G   16G  84%    /

The reason for this discrepancy is that the node's "Allocated Resources" output only shows the Requests and Limits that you've configured for the pods that are scheduled on this node.
It does NOT show actual disk usage. Pods can be configured to use emptyDir mounts without Requests or Limits configured, or there may be a large amount of logs being printed to this mount. Neither of these cases will count towards "Allocated Resources" but will still occupy disk space.

To resolve this issue, first be sure that the mount containing /var/lib/kubelet and /var/lib/containerd is configured to have at least the recommended minimum of 80GB:
https://docs.d2iq.com/dkp/konvoy/latest/install/install-onprem/#worker-nodes

This may not be enough for certain workloads, so be sure to size this mount according to your needs.

In addition, it can be helpful to establish specific Requests and Limits for the volumes that your pods are using, rather than leaving them unlimited. This has the added benefit of making it easier to identify specific pods that are taking up disk space:
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits

Finally, be aware of the amount of logging that your pods are printing. Particularly noisy processes can eat up a surprising amount of disk space in a short amount of time.

If you are encountering this issue in Konvoy and none of these situations seem to apply to the problem, please feel free to submit a ticket with D2IQ support for additional triage:
https://support.d2iq.com/s/article/Opening-a-New-Support-Case