Determining cluster health before upgrading DKP – D2iQ

Upgrading a cluster can be a tense situation, and reducing the number of things that can go wrong can be challenging. Below are some helpful commands and workflows that you can use to help ensure that the health of your cluster is suitable when preparing to upgrade DKP.

Overall cluster health

Running kubectl get nodes to check for any overarching issues with your nodes is a great first step in determining the overall health of your cluster:

kubectl get nodes
NAME   STATUS ROLES AGE VERSION
worker1 Ready <none> 3d v1.19.9
worker2 Ready <none> 3d v1.19.9
worker3 Ready <none> 3d v1.19.9
master1 Ready master 3d v1.19.9

As well as kubectl get pods -n kube-system:

kubectl get pods -n kube-system
NAME          READY  STATUS RESTARTS AGE
calico-node    2/2   Running   0     3d
coredns        1/1   Running   0     3d
etcd           1/1   Running   0     3d
keepalived     3/3   Running   0     3d
kube-apiserver 1/1   Running   0     3d
kube-scheduler 1/1   Running   0     3d

In this case, all of our nodes are running fine, and there are no immediate issues with our kube-system pods but in the event that a node is not ready or a kube-system pod is failing resolving any errors before proceeding is the appropriate workflow.

Something else that may be helpful is running through any cluster-wide Grafana dashboards available to check for issues such as an uneven distribution of resources or pods. Checking for resource usage is a straightforward way to see if there are any issues with your cluster. To get a glance at the general health of your cluster, you can also run the below:

kubectl cluster-info
Kubernetes control plane is running at <Control plane>
KubeDNS is running at <DNS>

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

If you want verbose output, adding the 'dump' argument will output just about everything related to your cluster but is generally too verbose to be helpful in most situations.

Node specific health

If any nodes are not Ready, investigate why by running a kubectl describe node <node>. In this case, it appears all nodes are healthy on the surface, but for reference, you would want to check the node's events and conditions. I have omitted some output for clarity:

kubectl describe node worker1
Conditions:
Type               Status LastHeartbeat LastTransition Reason Message
----               ------ ------------- -------------- ------ -------
NetworkUnavailable False  <time>        <time>         CalicoIsUp Calico is running on this node
MemoryPressure     False  <time>        <time>         KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure       False  <time>        <time>         KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure        False  <time>        <time>         KubeletHasSufficientPID kubelet has sufficient PID available
Ready              True   <time>        <time>         KubeletReady kubelet is posting ready status
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource           Requests Limits
--------           -------- ------
cpu                <Value> <Value>
memory             <Value> <Value>
ephemeral-storage  <Value> <Value>
hugepages-1Gi      <Value> <Value>
hugepages-2Mi      <Value> <Value>
Events: <none>

If you are looking for a slightly deeper dive, checking both the Kubelet and Containerd status on your nodes will help ensure that the backbone of your cluster is functioning properly. Typically any issues with these components will float to the top and show when describing nodes or arise with issues with containers, but it can be helpful to check when being extra cautious:

* kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
...
Active: active (running) since 3 days ago
Docs: https://kubernetes.io/docs/
Main PID: 90 (kubelet)
Tasks: 150
Memory: 397.0M
CGroup: <output>
W kubelet[90]: dns.go:125] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: <>

* containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled)
...
Active: active (running) since 3 days ago
Docs: https://containerd.io
Main PID: 457 (containerd)
Tasks: 5626
Memory: 3.2G
CGroup: <output>
containerd[457]: time="" level=info msg="Finish piping \"stderr\" of container exec \"<containerID\""
containerd[457]: time="" level=info msg="Exec process \"<containrID>\" exits with exit code 0 and error <nil>"
containerd[457]: time="" level=info msg="ExecSync for \"<containerID>\" returns with exit code 0"

This will provide a general feel of how healthy the Kubelet and Container.d are functioning, in this case the warnings in the Kubelet logs can be safely ignored. If there are some error messages that are concerning, you are able to take a deeper look at the logging by running one of either journalctl -u kubelet or journalctl -u containerd. You can also check the overall status of your pods on your nodes by running kubectl get pods -A --sort-by=.status.phase --field-selector spec.nodeName=<node name>. This will print out all of your pods sorted by their status. If there are many pods in some not ready state, it may be worthwhile to investigate why.

Pod health

In the case that there are pods in some failed state, finding the pods that are not running and investigating the failure reason for those pods before continuing the upgrade is the best first step in determining pod health. If there are a fair amount of evicted pods (or another not healthy state) on a particular node, running kubectl describe against a subset of those pods to determine why this is the case would be the best next step:

kubectl get pods -A --sort-by=.status.phase --field-selector spec.nodeName=<node>
NAMESPACE  NAME READY STATUS  RESTARTS AGE
namespace1 pod1  0/1  Evicted    0     3d
namespace1 pod2  1/1  Running    0     3d
namespace1 pod3  1/1  Running    0     3d

In this case, I checked pod1 and noted that there were some resource issues on this node at some point:

kubectl describe pod -n namespace1 pod1
Namespace: pod1
Priority: 0
Node: <node>
...
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container container1 was using 1625488805, which exceeds its request of 0.

Pods not backed by a volume will consume ephemeral-storage; if you have many pods running without ephemeral-storage limits, I would expect high usage and, eventually, some pods being evicted. In general, this is not a major red flag unless many pods are being evicted. If there is a further desire for investigation, you can check the current resource consumption in Grafana or your deployed monitoring tool. If you have a large number of pods in an Error state or a similar status such as failed to start, taking a look at the individual pods and their logs then determining if the failure was due to the cluster health or something that is pod specific would be the appropriate workflow to move your investigation forward.

DKP Application health

Ensure that all Kustomizations and HelmReleases are successfully reconciled:

kubectl get kustomization -A
kubectl get helmrelease -A

Extra considerations

Before upgrading, validating that no nodes are near 100% usage in any resource is an excellent proactive step in ensuring a smooth upgrade. When you go to upgrade, there will be pods looking to be rescheduled from the upgrading node when it is drained. Because of this, the remaining nodes will have to have space to accommodate any re-scheduled pods. With that being said, if there are pods that have a particular affinity ensuring that these pods have another node to fall over to will help ensure there are no outages.

Ultimately, determining the acceptable state of the cluster is something that your organization will need to do, but these will serve as good guidelines and workflows for determining the overall health of a cluster.