Starting with a new Centos 7.9 Minimal Install host with a single Nvidia GPU we observe the following when probing the OS for information:
1. The Nvidia GPU shows up via lspci:
lspci | grep NVIDIA 0b:00.0 3D controller: NVIDIA Corporation GM200GL [Tesla M40] (rev a1)2. The open source Nouveau drivers are currently loaded.
lsmod | grep nouveau nouveau 1898794 0 video 24538 1 nouveau3. The Centos version before and after a yum update:
uname -r 3.10.0-1127.el7.x86_64After reboot:
uname -r 3.10.0-1160.21.1.el7.x86_64
Konvoy Versions 1.6.X and Lower
If you are preparing this host for Konvoy 1.6.X or lower, you first ensure that the kernel version is supported. To find out what Kernel versions you can choose from, browse to the nvidia/driver page on docker hub:
https://hub.docker.com/r/nvidia/driver/tags?page=1&ordering=last_updated&name=centos
Nvidia does not provide a version for every minor release of Centos, so you must choose from one of the Kernel versions that are listed above. For example,
nvidia/driver:450.80.02-1.0.0-3.10.0-1160.15.2.el7.x86_64-centos7 supports Centos 1160.15.2 but we are currently on 1160.21.1. Since Nvidia has not published a version for our latest kernel, we must downgrade the host to 1160.15.2 in order to utilize this Nvidia/Driver image. Also note that the Nvidia Driver version 450 indicates support for CUDA 11.0, and 460 indicates CUDA 11.2. Currently, Konvoy only supports CUDA 11 so be sure to pick a a tag including version 450.
Lets install the correct kernel version:
sudo yum install kernel-3.10.0-1160.15.2.el7
We must then reboot to load the new kernel:
sudo reboot now uname -r 3.10.0-1160.15.2.el7.x86_64
For all versions of Konvoy, you must disable the Nouveau drivers:
Edit /etc/default/grub. Append the following to “GRUB_CMDLINE_LINUX”
rd.driver.blacklist=nouveau nouveau.modeset=0
Generate a new grub configuration to include the above changes.
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
Edit/create /etc/modprobe.d/blacklist.conf and append:
blacklist nouveau
Backup your old initramfs and then build a new one. If you change kernel versions in the future, remember to repeat this process if the Nouveau drivers become re-enabled:
sudo mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img sudo dracut --omit-drivers nouveau -f /boot/initramfs-$(uname -r).img $(uname -r) sudo reboot now
After doing the above steps we should now no longer see Nouveau loaded via:
lsmod | grep nouveau
If you still see Nouveau, you will need to evaluate what configuration errors are in place and address them before moving forwards.
For Konvoy 1.6.X and lower you can now configure the Nvidia Addon in cluster.yaml:
- name: nvidia enabled: true values: | nvidia-driver: enabled: true image: tag: "450.80.02-1.0.0-3.10.0-1160.15.2.el7.x86_64-centos7"
Konvoy Versions 1.7.X and Higher
For Konvoy 1.7 or higher, deploy the Nvidia drivers to the host instead of configuring them via cluster.yaml:
https://www.nvidia.com/Download/index.aspx?lang=en-us
For our example we will use a Tesla M40 card, and will set the options as follows. Note that we MUST select CUDA 11.0, not 11.2:
Product Type: Data Center / Tesla Product Series: M-Class Product: M40 Operating System: Linux 64-bit CUDA Toolkit: 11.0 Language: English (US)
We can then download the driver version for our hardware, then push it to the worker and install it:
scp NVIDIA-Linux-x86_64-450.102.04.run user@gpu-worker-1:/tmp/NVIDIA-Linux-x86_64-450.102.04.run
Ssh to the host, make the installer executable and then run it. GCC is a dependency of the installer so install that too:
sudo yum install gcc sudo chmod +x NVIDIA-Linux-x86_64-450.102.04.run sudo ./NVIDIA-Linux-x86_64-450.102.04.run
If you encounter errors during installation, see the below:
You must also have the correct kernel headers for your version of Centos. When you install GCC, it will install the latest version of these headers. In our example, that means kernel-headers-3.10.0-1160.21.1.el7.x86_64.rpm will be installed as a depenency, but we need to install the headers for the kernel that we have switched to:
sudo yum install kernel-devel-3.10.0-1160.15.2.el7 kernel-headers-3.10.0-1160.15.2.el7
If you get the following message you can safely ignore it:
WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path '/usr/lib64/xorg/modules'; these paths were not queryable from the system. If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
If you are not sure if you need the 32-bit compatibility libraries, install those as well.
You should see the following message when installation is complete:
Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 450.102.04) is now complete.
Reboot your host, then lets check to ensure that the proper drivers are loaded:
lsmod | grep nvidia nvidia_drm 48653 0 nvidia_modeset 1177160 1 nvidia_drm nvidia 19704900 1 nvidia_modeset drm_kms_helper 186531 2 vmwgfx,nvidia_drm drm 456166 5 ttm,drm_kms_helper,vmwgfx,nvidia_drm
Finally, lets check that nvidia-smi responds properly:
nvidia-smi Mon Mar 29 15:51:56 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla M40 Off | 00000000:0B:00.0 Off | 0 | | N/A 36C P0 69W / 250W | 0MiB / 11448MiB | 97% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
You can then enable the Nvidia addon and deploy Konvoy. It does not need special configuration for 1.7:
- name: nvidia enabled: true
If you are upgrading from Konvoy 1.6 to Konvoy 1.7 special precautions must be taken to reconfigure your GPU nodes for host based drivers, please reach out to D2iQ support for guidance.