Overview
There is a known issue in certain versions of Konvoy where the dcgm-exporter pod fails to start when you are deploying GPU nodes with host-based drivers.
This appears to be due to nv-hostengine failing to start properly inside of the nvidia-dcgm-exporter container.
You may see messages like the following in your dcgm-exporter pod logs which would indicate this problem:
I1222 06:53:33.214405 1 main.go:18] Starting OS watcher. I1222 06:53:33.215811 1 main.go:23] Starting FS watcher. E1222 06:53:38.630994 1 http.go:48] error responding to 10.0.128.146:9400/gpu/metrics: open /run/dcgm/dcgm-pod.prom: no such file or directory F1222 06:53:53.216156 1 watchers.go:58] No events received. Make sure "dcgm-exporter" is running
You can also check the output of `dmesg -T` on the relevant node and look for a line resembling:
[Tue Dec 22 07:57:33 2020] nvidia-nvswitch: Version mismatch, kernel version 460.27.04 user version 450.51.06
Workaround
At the moment, the only known workaround to this issue is to downgrade your host drivers to nvidia-driver-latest-dkms-450.51.06-1.el7.x86_64.
This can also be resolved by upgrading Konvoy to version 1.7.5 or 1.8.3 (or later).