Issue

When creating a cluster with nvidia gpu capable nodes. You might experience an issue where the nvidia-feature-discovery-gpu-feature-discovery is in a crashloop state. And the logs contains

gpu-feature-discovery: 2022/10/24 05:38:43 Start running
gpu-feature-discovery: 2022/10/24 05:38:43 Warning: Error removing output file: failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory
gpu-feature-discovery: 2022/10/24 05:38:43 Exiting
gpu-feature-discovery: 2022/10/24 05:38:43 Error: error creating NVML labeler: failed to initialize NVML: unexpected failure calling nvml.Init: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

Background

This issue occurs when the combination of --registry flags and nvidia override is set during cluster creation. The --registry flags are implemented on the containerd config with the use of "imports". And as the import is merged into the containerd config.toml, it removes the nvidia container configuration.

Workaround

New Cluster

When creating a new cluster, the issue can be bypassed by adding the registry details in the override file, together with nvidia. Instead of using the --registry flags in the dkp cli.

Existing Cluster

And for existing clusters where the --registry flags were used during cluster creation. Just replace the override secret with the new override that contains the registry and nvidia details. Then delete the machines to force reprovisioning of the nodes.

kubectl delete machine <machinename>

Then rename the files with .toml extension located in

imports = ["/etc/containerd/conf.d/*.toml"]

In both scenarios above, the example of the override would be:

gpu:
    types:
        - nvidia
build_name_extra: -nvidia
default_image_registry_mirrors:
  "docker.io": "https://harbor-registry.daclusta/v2/harbor-registry"
  "*": "https://harbor-registry.daclusta/v2/harbor-registry"
image_registries_with_auth:
- host: "harbor-registry.daclusta"
  username: "testuser"
  password: "Harbor1111"
  auth: ""
  identityToken: ""