When creating a cluster with nvidia gpu capable nodes. You might experience an issue where the
nvidia-feature-discovery-gpu-feature-discovery is in a crashloop state. And the logs contains
gpu-feature-discovery: 2022/10/24 05:38:43 Start running gpu-feature-discovery: 2022/10/24 05:38:43 Warning: Error removing output file: failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory gpu-feature-discovery: 2022/10/24 05:38:43 Exiting gpu-feature-discovery: 2022/10/24 05:38:43 Error: error creating NVML labeler: failed to initialize NVML: unexpected failure calling nvml.Init: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
This issue occurs when the combination of --registry flags and nvidia override is set during cluster creation. The --registry flags are implemented on the containerd config with the use of "imports". And as the import is merged into the containerd config.toml, it removes the nvidia container configuration.
When creating a new cluster, the issue can be bypassed by adding the registry details in the override file, together with nvidia. Instead of using the --registry flags in the dkp cli.
And for existing clusters where the --registry flags were used during cluster creation. Just replace the override secret with the new override that contains the registry and nvidia details. Then delete the machines to force reprovisioning of the nodes.
kubectl delete machine <machinename>
Then rename the files with .toml extension located in
imports = ["/etc/containerd/conf.d/*.toml"]
In both scenarios above, the example of the override would be:
- host: "harbor-registry.daclusta"