When deploying Kubernetes clusters with DKP in VSphere, the operator should specify the SSL certificate fingerprint if using a self-signed certificate by using the flag:
"--tls-thumb-print=${VCENTERTLSTHUMBPRINT}"
When the fingerprint is in incorrect or not declared with the correct format in the cluster.yaml, the vsphere-cloud-controller-manager won't establish a connection to the VCenter API and the deployment will get stuck. An example of the correct way to declare the fingerprint in the cluster.yaml is shown below:
thumbprint: C4:93:CC:E5:AA:80:57:38:54:05:3F:1C:7C:CE:44:46:58:E3:56:1F
Evidence of this issue can be found in the logs of the vsphere-cloud-controller-manager pod. To review the logs, the operator should SSH into the control-plane node, and use crictl to identify the pod and review the logs, as shown below:
crictl ps -a
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
76081398c6a77 cb03930a2bd42 About a minute ago Exited node-driver-registrar 6 17bfb30851a32
8e967bc513e56 7a71aca7b60fc 7 minutes ago Running calico-node 0 a37bfd83643b1
1f8ca37681b60 2a8ef6985a3e5 7 minutes ago Exited install-cni 0 a37bfd83643b1
d1e7900943387 8b6940b4f6952 8 minutes ago Running liveness-probe 0 17bfb30851a32
5bc478064638d 17300d20daf93 8 minutes ago Exited flexvol-driver 0 a37bfd83643b1
e5bda27f5ace8 f822f80398b9a 8 minutes ago Running calico-typha 0 55ecd06458bb2
0ab0bb57d2738 861f94c040a54 8 minutes ago Running vsphere-csi-node 0 17bfb30851a32
ac158f49cad2a 4d4fa43f0ff03 8 minutes ago Running vsphere-cloud-controller-manager 0 93a86fead04d3
f8e0e0c406322 648350e58702c 8 minutes ago Running tigera-operator 0 887a823817e5f
cce84208c41b3 c1cfbd59f7747 9 minutes ago Running kube-proxy 0 93b52f831885c
21b9d5a10bad5 4b9683cda6d3b 9 minutes ago Running kube-vip 0 941fd4c2ed8b7
13285a3b567d4 41ff053508988 9 minutes ago Running kube-controller-manager 0 6196a235ec2af
a8dff40cb34a2 0369cf4303ffd 9 minutes ago Running etcd 0 7096e6d136bce
8d51122b53be5 c0d565df2c900 9 minutes ago Running kube-apiserver 0 abeb854bf633e
ac7eca93d01f4 398b2c18375df 9 minutes ago Running kube-scheduler 0 f543102d651a9
When the incorrect format or wrong SSL fingerprint has been declared in the cluster.yaml, the vsphere-cloud-controller-manager will log the following events:
I0502 20:26:01.423514 1 cloud.go:126] Starting the API Server
W0502 20:26:01.424066 1 clientconn.go:1326] [core] grpc: addrConn.createTransport failed to connect to {:43001 localhost:43001 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp :43001: connect: connection refused". Reconnecting...
W0502 20:26:01.424192 1 server.go:138] could not getversion: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp :43001: connect: connection refused"
I0502 20:26:01.431644 1 search.go:76] WhichVCandDCByNodeID nodeID: 42051993-5dd1-c181-258f-0b9c51c7a385
E0502 20:26:01.516586 1 connection.go:177] Failed to create new client. err: Post "https://192.168.2.111:443/sdk": host "192.168.2.111:443" thumbprint does not match "SHA1 Fingerprint=C4:93:CC:E5:AA:80:57:38:54:05:3F:1C:7C:CE:44:46:58:E3:56:1F"
E0502 20:26:01.516627 1 connection.go:63] Failed to create govmomi client. err: Post "https://192.168.2.111:443/sdk": host "192.168.2.111:443" thumbprint does not match "SHA1 Fingerprint=C4:93:CC:E5:AA:80:57:38:54:05:3F:1C:7C:CE:44:46:58:E3:56:1F"
E0502 20:26:01.516642 1 connectionmanager.go:148] Cannot connect to vCenter with err: Post "https://192.168.2.111:443/sdk": host "192.168.2.111:443" thumbprint does not match "SHA1 Fingerprint=C4:93:CC:E5:AA:80:57:38:54:05:3F:1C:7C:CE:44:46:58:E3:56:1F"
To obtain the certificate fingerprint, the following command can be executed:
openssl s_client -connect <VCenter IP Address or FQDN>:443 < /dev/null 2>/dev/null | openssl x509 -fingerprint -noout -in /dev/stdin
To fix the issue, please specify the correct fingerprint, declare it in the correct format in the cluster.yaml and redeploy the cluster.