Affected Konvoy Versions:
All Versions of Konvoy 1.x
Issue:
In specific versions of Prometheus, a crash or sudden interruption beween Prometheus and its Persistent Volume can lead to the creation of a chunk file that is empty. When Prometheus attempts to read this chunk file at startup it will crash and prevent the Pod from launching. You will see an error in the Prometheus logs that indicates this:
err="opening storage failed: mmap files, file: /prometheus/data/chunks_head/000008: mmap: invalid argument
More information about this issue can be found here:
https://github.com/prometheus/prometheus/issues/7469
Solution
The solution is to identify the PV that Prometheus Server is using, and then delete the chunk file mentioned in the Prometheus Pod logs. This issue can affect any Prometheus deployment but for our example resolution we are using the default Prometheus Addon that ships with Konvoy 1.6.0.
First identify the Persistent Volume associated with your crashing Prometheus server:
[k8user@workstation working-dir]$ kubectl get pv -A | grep prometheus local-pv-36b138ad 99Gi RWO Delete Bound kommander/kommander-kubecost-prometheus-alertmanager localvolumeprovisioner 14h local-pv-4f6e0947 99Gi RWO Delete Bound kubeaddons/prometheus-prometheus-kubeaddons-prom-prometheus-db-prometheus-prometheus-kubeaddons-prom-prometheus-0 localvolumeprovisioner 14h local-pv-c1ba2c00 199Gi RWO Delete Bound kommander/kommander-kubecost-prometheus-server localvolumeprovisioner 14h
The PV for kubeaddons/prometheus-prometheus-kubeaddons-prom-prometheus-db-prometheus-prometheus-kubeaddons-prom-prometheus-0 is the one we want in this instance. Its ID is local-pv-4f6e0947 so next lets describe that PV and identify which kubernetes node it resides on:
[k8user@workstation working-dir]$ kubectl describe pv local-pv-4f6e0947 Name: local-pv-4f6e0947 Labels: Annotations: pv.kubernetes.io/bound-by-controller: yes pv.kubernetes.io/provisioned-by: local-volume-provisioner-node4.k8s-cluster-c793eae1-d7fd-40f3-8280-848b898beed8 Finalizers: [kubernetes.io/pv-protection] StorageClass: localvolumeprovisioner Status: Bound Claim: kubeaddons/prometheus-prometheus-kubeaddons-prom-prometheus-db-prometheus-prometheus-kubeaddons-prom-prometheus-0 Reclaim Policy: Delete Access Modes: RWO VolumeMode: Filesystem Capacity: 99Gi Node Affinity: Required Terms: Term 0: kubernetes.io/hostname in [node4.k8s-cluster] Message: Source: Type: LocalVolume (a persistent volume backed by local storage on a node) Path: /mnt/disks/b2dd935b-189d-4f51-9f6b-a34e84ecc9dd Events:
The line that tells us which node it resides on is:
Term 0: kubernetes.io/hostname in [node4.k8s-cluster]
Below that we can see the actual path to the PV on disk:
Path: /mnt/disks/b2dd935b-189d-4f51-9f6b-a34e84ecc9dd
With this information we can SSH to the node node4.k8s-cluster, navigate to the proper path and delete the file Prometheus is complaining about:
[k8user@workstation working-dir]$ ssh node4.k8s-cluster -l k8user Last login: Thu Nov 5 01:39:07 2020 from workstation.daclusta -sh-4.2$ ls -ls /mnt/disks/b2dd935b-189d-4f51-9f6b-a34e84ecc9dd total 0 0 drwxrwsr-x 7 root 2000 160 Nov 5 16:00 prometheus-db
We see a prometheus-db folder so we're looking in the right place, we need to locate the chunks_head directory:
-sh-4.2$ ls -ls /mnt/disks/b2dd935b-189d-4f51-9f6b-a34e84ecc9dd/prometheus-db/ total 20 0 drwxr-sr-x 3 k8user 2000 68 Nov 5 12:00 01EPCNMA7W7F0AGN6YG991CABH 0 drwxr-sr-x 3 k8user 2000 68 Nov 5 16:00 01EPD3B16A4VJ7D8880FDJ7WDD 0 drwxr-sr-x 3 k8user 2000 68 Nov 5 16:00 01EPD3BT9QG6X7TPR6S9FDE003 0 drwxrwsr-x 2 k8user 2000 34 Nov 5 16:00 chunks_head 20 -rw-rw-r-- 1 k8user 2000 20001 Nov 5 16:33 queries.active 0 drwxrwsr-x 3 k8user 2000 81 Nov 5 16:00 wal
Inside the chunks_head directory we can see the chunk files, locate the offending one and then delete it:
-sh-4.2$ ls -ls /mnt/disks/b2dd935b-189d-4f51-9f6b-a34e84ecc9dd/prometheus-db/chunks_head/ total 108004 87524 -rw-r--r-- 1 k8user 2000 89620918 Nov 5 16:00 000007 20480 -rw-r--r-- 1 k8user 2000 0 Nov 5 16:00 000008