Checking etcd health for Konvoy clusters – D2iQ

Overview/Background

Before performing any operations involving etcd, it is critical to ensure that your etcd cluster is healthy and has quorum.

Per the etcd documentation:

"etcd is designed to withstand machine failures. An etcd cluster automatically recovers from temporary failures (e.g., machine reboots) and tolerates up to (N-1)/2 permanent failures for a cluster of N members. When a member permanently fails, whether due to hardware failure or disk corruption, it loses access to the cluster. If the cluster permanently loses more than (N-1)/2 members then it disastrously fails, irrevocably losing quorum. Once quorum is lost, the cluster cannot reach consensus and therefore cannot continue accepting updates."

Solution

To check the status of the etcd cluster, we can utilize etcdctl commands. We recommend performing the following steps, and checking the outputs, based on the official etcd recommendations:

1) Find the etcd pod IDs associated with the Kubernetes cluster in question:

kubectl get pods -A | grep etcd

2) Use kubectl exec to gain a shell terminal inside of each etcd pod. Substitute in one of the the pod IDs you got earlier:

kubectl exec -it -n kube-system etcd-xxxx-yyy -- sh

3) Check the etcd member list:

etcdctl \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --endpoints=https://127.0.0.1:2379 \
  member list

The output should contain all etcd nodes, with each node reporting as `started`.

4) Check the etcd endpoint status:

etcdctl \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --endpoints=https://127.0.0.1:2379 \
  endpoint status --cluster -w table

The values for RAFT TERM on each member should be equal and the RAFT INDEX values on each member should be not be too far apart from each other.

5) Check the etcd endpoint health:

etcdctl \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --endpoints=https://127.0.0.1:2379 \
  endpoint health --cluster -w table

Each endpoint should report as healthy.

6) Check the etcd alarm list:

etcdctl \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --endpoints=https://127.0.0.1:2379 \
  alarm list

No alarms should be active.

If any output indicates an issue, we recommend investigating further in order to remediate the issue, and/or submitting a support case.