Overview

In DKP 2.X, the servicemonitor responsible for scraping metrics from ETCD is disabled by default. If you wish to have ETCD metrics available in Prometheus, you must follow the steps below. Please note that Prometheus requires certificates to query ETCD. Because the secret that contains these certificates is created manually, you will need to manually update it every time your ETCD instance rotates its certificates.

Setup

1. First get the override configmap for the Kube-Prometheus HelmRelease and store it into a file. Editing the YAML while its running on the cluster can get clunky, and this helps keep everything clean:

kubectl get cm -n kommander kube-prometheus-stack-overrides -o yaml > overrides.yaml

2. The serviceMonitor for ETCD requires you to have your ca.crt, server.key, server.crt from your control plane available. These will be located in /etc/kubernetes/pki/etcd on each control-plane node; you only need certs from one node as they are all the same. Once you have grabbed all of the files, you can create a secret containing each one with the below command:

kubectl create secret generic etcd-certs -n kommander --from-file=ca.crt --from-file=server.key --from-file=server.crt

3. Add the below YAML to the bottom of your overrides.yaml. The indentation falls in line with Grafana and the other apps:

kubeEtcd:
  enabled: true
  serviceMonitor:
    scheme: "https"
    caFile: "/etc/prometheus/secrets/etcd-certs/ca.crt"
    certFile: "/etc/prometheus/secrets/etcd-certs/server.crt"
    keyFile: "/etc/prometheus/secrets/etcd-certs/server.key"

Along with that, you will also need to add the secret that contains the ETCD certs to the 'secrets' section of the overrides.yaml, the dex secret is already configured, so you can use that as a guide:

prometheusSpec:
    secrets:
      - dex
      - etcd-certs

So, your file will look something like the below after adding it:

...
prometheusSpec:
  secrets:
  - dex
  - etcd-certs
  storageSpec:
    volumeClaimTemplate:
      spec:
        # 100Gi is the default size for the chart
        resources:
          requests:
            storage: 100Gi
  resources:
    limits:
      cpu: 2000m
      memory: 10922Mi
    requests:
      cpu: 1000m
      memory: 4000Mi
grafana:
  resources:
    # keep request = limit to keep this container in guaranteed class
    limits:
      cpu: 300m
      memory: 100Mi
    requests:
      cpu: 200m
      memory: 100Mi
alertmanager:
  alertmanagerSpec:
    resources:
      limits:
        cpu: 200m
        memory: 250Mi
      requests:
        cpu: 100m
        memory: 200Mi
kubeEtcd:
  enabled: true
  serviceMonitor:
    scheme: "https"
    caFile: "/etc/prometheus/secrets/etcd-certs/ca.crt"
    certFile: "/etc/prometheus/secrets/etcd-certs/server.crt"
    keyFile: "/etc/prometheus/secrets/etcd-certs/server.key"

4. After applying the changes to the local yaml, delete the override file on the cluster 'kubectl delete cm -n kommander kube-prometheus-stack-overrides', then apply the new yaml 'kubectl apply -f overrides.yaml'

5. Give the cluster a few minutes to reconcile, then validate that your service monitor has been created:

kubectl get servicemonitors.monitoring.coreos.com -n kommander kube-prometheus-stack-kube-etcd
NAME AGE
kube-prometheus-stack-kube-etcd 55m

Once the service monitor has been created, you can check for the ETCD metrics in the Prometheus UI.

Troubleshooting

If you are not seeing your metrics in the UI, the first place to start your investigation is in the Prometheus UI. In the 'service discovery' tab of the Prometheus UI, you should see a reference to ETCD. If you do not, then validating that there are no errors in the Prometheus logging regarding setting up the servicemonitor is the best next step. If it is present, but you do not see metrics, you can check under the 'targets' tab for each ETCD instance. If the servicemonitor has issues querying the targets, it will show here. If you are seeing connection errors spinning up a network connectivity pod and curling the metrics endpoint can help validate if pod-to-pod communication is working between the node that Prometheus is deployed on and ETCD.