How to resolve " Failed to initialize NVML: Driver/library version mismatch" error
Overview/Backgroud
You may encounter this error when trying to run a GPU workload or nvidia-smi
command. We have typically seen this when NVIDIA drivers on a node are upgraded but the run-time driver information is not up to date.
A pod impacted by this issue would fail with status RunContainerError
and will report the following error under Events
Warning Failed 91s (x4 over 2m12s) kubelet, ip-10-0-129-17.us-west-2.compute.internal Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch\\\\n\\\"\"": unknown
Verification
Before we try to resolve the issue let's try to confirm that the issue is actually with the loaded kernel drivers being outdated.
Check the run-time driver information
The command
cat /proc/driver/nvidia/version
will show the run-time information about the driver like so
NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.64.00 Wed Feb 26 16:26:08 UTC 2020 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)
Compare against the version for drivers installed
Compare the NVIDIA driver version obtained above (e.g. 440.64.00) against the drivers you have currently installed.
If you are using host-based drivers you can check the driver version using
rpm -qa | grep nvidia-driver
on Centos/RHEL or
dpkg -l | grep nvidia-driver
on Ubuntu
Example output from a Centos node
... nvidia-driver-latest-libs-455.32.00-1.el7.x86_64 nvidia-driver-latest-455.32.00-1.el7.x86_64 ...
If you are using container-based drivers from the Konvoy NVIDIA addon, then check the version tag for the container in the nvidia-kubeaddons-nvidia-driver-
pod.
In the example above from a Centos host, we see that the run-time driver version 440.64.00 is different from the installed version 455.32.00. If you see a discrepancy like this, it confirms that the issue is caused by driver upgrade, and the solution documented here should resolve the issue.
Solution 1: Drain and reboot the worker
Rebooting the node is the easiest way to fix the issue. Rebooting the node will make sure that the drivers are properly initialized after the upgrade.
If you need to upgrade drivers on a GPU worker node, we recommend draining the node, then performing the driver upgrade and then rebooting the node before deploying fresh workloads. If you are using container-based drivers then the recommended procedure to upgrade is documented here.
Solution 2 : Reload NVIDIA kernel modules
This method is more involved and should only be used if draining and rebooting the GPU worker in question is not an option. This will also involve draining any GPU workloads running on the node. If you want to avoid draining and rebooting due to some currently running GPU workloads, then this method offers no advantage. This is useful only if there are non-GPU workloads on the worker node that cannot be drained or the worker node cannot be rebooted for any reason.
Drain GPU workloads
For this method, we would need the GPUs not to be in use and for that we would need to stop any GPU workloads on the impacted node.
Stop NVIDIA device driver
Once we have stopped all the GPU workloads, we need to stop the NVIDIA device plugin. It is deployed as a daemonset on GPU workers. To only stop if on the impacted node we can remove the label konvoy.mesosphere.com/gpu-provider
kubectl label node konvoy.mesosphere.com/gpu-provider-
Once the label is removed all pods associated with NVIDIA konvoy addon on the worker node will be removed.
Restart kubelet
After stopping GPU workloads and device plugin the last process left using the nvidia
kernel module would be the kubelet
service. Since the device plugin is no longer running just restarting the service should stop it from using the kernel module.
sudo systemctl restart kubelet
Check if there are any processes still using NVIDIA drivers
Before attempting to unload the kernel modules lets check if any processes are still using the NVIDIA drivers
sudo lsof /dev/nvidia**
Kill any processes still using the drivers.
Check which NVIDIA kernel modules are loaded
lsmod | grep ^nvidia
example output
nvidia_uvm 939731 0 nvidia_drm 39594 0 nvidia_modeset 1109637 1 nvidia_drm nvidia 20390418 18 nvidia_modeset,nvidia_uvm
Unload NVIDIA kernel modules
In the example above, the third column above shows modules that are using the module listed in the first column. A module that is being used cannot be removed until the dependant module is removed, hence the order of the following commands is important. Make sure that any running GPU workloads are terminated before removing the modules.
sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia
Verify that the modules are unloaded
lsmod | grep ^nvidia
should return no output.
Relaunch the NVIDIA addon pods
Now we are ready to relaunch the pods for NVIDIA addon on the impacted node. Adding the following label back should relaunch the pods and prepare the node to accept GPU workloads again
kubectl label node konvoy.mesosphere.com/gpu-provider=NVIDIA