How to resolve " Failed to initialize NVML: Driver/library version mismatch" error

Overview/Backgroud

You may encounter this error when trying to run a GPU workload or nvidia-smi command. We have typically seen this when NVIDIA drivers on a node are upgraded but the run-time driver information is not up to date.

A pod impacted by this issue would fail with status RunContainerError and will report the following error under Events

Warning Failed 91s (x4 over 2m12s) kubelet, ip-10-0-129-17.us-west-2.compute.internal Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch\\\\n\\\"\"": unknown

Verification

Before we try to resolve the issue let's try to confirm that the issue is actually with the loaded kernel drivers being outdated.

Check the run-time driver information

The command

cat /proc/driver/nvidia/version

will show the run-time information about the driver like so

NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.64.00 Wed Feb 26 16:26:08 UTC 2020 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)

Compare against the version for drivers installed

Compare the NVIDIA driver version obtained above (e.g. 440.64.00) against the drivers you have currently installed.

If you are using host-based drivers you can check the driver version using

rpm -qa | grep nvidia-driver

on Centos/RHEL or

dpkg -l | grep nvidia-driver

on Ubuntu

Example output from a Centos node

... nvidia-driver-latest-libs-455.32.00-1.el7.x86_64 nvidia-driver-latest-455.32.00-1.el7.x86_64 ...

If you are using container-based drivers from the Konvoy NVIDIA addon, then check the version tag for the container in the nvidia-kubeaddons-nvidia-driver- pod.

In the example above from a Centos host, we see that the run-time driver version 440.64.00 is different from the installed version 455.32.00. If you see a discrepancy like this, it confirms that the issue is caused by driver upgrade, and the solution documented here should resolve the issue.

Solution 1: Drain and reboot the worker

Rebooting the node is the easiest way to fix the issue. Rebooting the node will make sure that the drivers are properly initialized after the upgrade.

If you need to upgrade drivers on a GPU worker node, we recommend draining the node, then performing the driver upgrade and then rebooting the node before deploying fresh workloads. If you are using container-based drivers then the recommended procedure to upgrade is documented here.

Solution 2 : Reload NVIDIA kernel modules

This method is more involved and should only be used if draining and rebooting the GPU worker in question is not an option. This will also involve draining any GPU workloads running on the node. If you want to avoid draining and rebooting due to some currently running GPU workloads, then this method offers no advantage. This is useful only if there are non-GPU workloads on the worker node that cannot be drained or the worker node cannot be rebooted for any reason.

Drain GPU workloads

For this method, we would need the GPUs not to be in use and for that we would need to stop any GPU workloads on the impacted node.

Stop NVIDIA device driver

Once we have stopped all the GPU workloads, we need to stop the NVIDIA device plugin. It is deployed as a daemonset on GPU workers. To only stop if on the impacted node we can remove the label konvoy.mesosphere.com/gpu-provider

kubectl label node  konvoy.mesosphere.com/gpu-provider-

Once the label is removed all pods associated with NVIDIA konvoy addon on the worker node will be removed.

Restart kubelet

After stopping GPU workloads and device plugin the last process left using the nvidia kernel module would be the kubelet service. Since the device plugin is no longer running just restarting the service should stop it from using the kernel module.

sudo systemctl restart kubelet

Check if there are any processes still using NVIDIA drivers

Before attempting to unload the kernel modules lets check if any processes are still using the NVIDIA drivers

sudo lsof /dev/nvidia**

Kill any processes still using the drivers.

Check which NVIDIA kernel modules are loaded

lsmod | grep ^nvidia

example output

nvidia_uvm 939731 0 nvidia_drm 39594 0 nvidia_modeset 1109637 1 nvidia_drm nvidia 20390418 18 nvidia_modeset,nvidia_uvm

Unload NVIDIA kernel modules

In the example above, the third column above shows modules that are using the module listed in the first column. A module that is being used cannot be removed until the dependant module is removed, hence the order of the following commands is important. Make sure that any running GPU workloads are terminated before removing the modules.

sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia

Verify that the modules are unloaded

lsmod | grep ^nvidia

should return no output.

Relaunch the NVIDIA addon pods

Now we are ready to relaunch the pods for NVIDIA addon on the impacted node. Adding the following label back should relaunch the pods and prepare the node to accept GPU workloads again

kubectl label node  konvoy.mesosphere.com/gpu-provider=NVIDIA