Problem
Throttling of kubectl requests can happen for many reasons; when it does you will see responses similar to the below when executing commands:
kubectl get pods
I0920 13:00:24 request.go:645] Throttling request took 3.04s, request: GET:<URL>
<response>
Solution
The first step in diagnosing the underlying issue is to get more information from our kubectl command; we can do this by adding the -v=<number> flag. Increasing the verbosity will show us each call that kubectl makes when we attempt to query the API server. In some cases, you may see logging similar to the below:
I0501 02:02:1.431116 260204 cached_discovery.go:87] failed to write cache to /home/$USER/.kube/cache/discovery/CLUSTER_NAME/../..
due to mkdir /home/$USER/.kube/cache/discovery: permission denied
In this case, we do not have proper permissions set on our cache folder. This can be remedied by running 'chmod 755 -R ~/.kube/cache. After doing this, re-run your kubectl command with the verbose flag to validate if the throttling has resolved; if it has not, then deleting the cache is the next best step. If you are not seeing any issues with the cache but are seeing requests make it to the API server then being throttled, this could mean that your API server is overloaded in some capacity. A helpful troubleshooting step to take is to generate two kubeconfigs: One admin config from a node and another that routes through traefik. You can validate this by checking the URL that your kubectl commands run through, if its hitting the API server directly it will be something similar to:
https://<Controlplane_endpoint>:6443/apis/
If it routes through traefik it will contain a URI such as the following:
https://<IP>/dkp/api-server/version?timeout=32s
If you are seeing slowness when hitting the API server directly, The best way to troubleshoot this issue further is to view the pre-created dashboard in Grafana and review the API server logging. Within this dashboard, there are many useful charts such as the: Work Queue Depth, Work Queue Latency, read/write availability, and many more:
If you see that any of these are higher than expected, start your investigation there to work towards the root cause. If you see slow reads/writes, another direction to look into is ETCD. The major things to look out for are disk I/O and network latency. By default, metrics for ETCD are not gathered by Prometheus in DKP 2.X. If you would like to have this enabled, please follow the steps outlined in this Knowledge Article. If you are only seeing throttling when hitting the endpoint for Traefik you may need to scale up either the Traefik or kube-oidc deployments to resolve the issue.
If you are seeing high latency and are running DKP 2.2.0 or lower, you may be affected by a service account token issue which can degrade the API server performance due to the number of service account tokens present on the cluster.