Prometheus Pods Fail to Start Due to Empty Chunk File – D2iQ

Affected Konvoy Versions:

All Versions of Konvoy 1.x

Issue:

In specific versions of Prometheus, a crash or sudden interruption beween Prometheus and its Persistent Volume can lead to the creation of a chunk file that is empty. When Prometheus attempts to read this chunk file at startup it will crash and prevent the Pod from launching. You will see an error in the Prometheus logs that indicates this:

err="opening storage failed: mmap files, file: /prometheus/data/chunks_head/000008: mmap: invalid argument

More information about this issue can be found here:
https://github.com/prometheus/prometheus/issues/7469

Solution

The solution is to identify the PV that Prometheus Server is using, and then delete the chunk file mentioned in the Prometheus Pod logs. This issue can affect any Prometheus deployment but for our example resolution we are using the default Prometheus Addon that ships with Konvoy 1.6.0.

First identify the Persistent Volume associated with your crashing Prometheus server:

[k8user@workstation working-dir]$ kubectl get pv -A | grep prometheus
local-pv-36b138ad   99Gi       RWO            Delete           Bound       kommander/kommander-kubecost-prometheus-alertmanager                                                                localvolumeprovisioner            14h
local-pv-4f6e0947   99Gi       RWO            Delete           Bound       kubeaddons/prometheus-prometheus-kubeaddons-prom-prometheus-db-prometheus-prometheus-kubeaddons-prom-prometheus-0   localvolumeprovisioner            14h
local-pv-c1ba2c00   199Gi      RWO            Delete           Bound       kommander/kommander-kubecost-prometheus-server                                                                      localvolumeprovisioner            14h

The PV for kubeaddons/prometheus-prometheus-kubeaddons-prom-prometheus-db-prometheus-prometheus-kubeaddons-prom-prometheus-0 is the one we want in this instance. Its ID is local-pv-4f6e0947 so next lets describe that PV and identify which kubernetes node it resides on:

[k8user@workstation working-dir]$ kubectl describe pv local-pv-4f6e0947
Name:              local-pv-4f6e0947
Labels:            
Annotations:       pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: local-volume-provisioner-node4.k8s-cluster-c793eae1-d7fd-40f3-8280-848b898beed8
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      localvolumeprovisioner
Status:            Bound
Claim:             kubeaddons/prometheus-prometheus-kubeaddons-prom-prometheus-db-prometheus-prometheus-kubeaddons-prom-prometheus-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          99Gi
Node Affinity:     
  Required Terms:  
    Term 0:        kubernetes.io/hostname in [node4.k8s-cluster]
Message:           
Source:
    Type:  LocalVolume (a persistent volume backed by local storage on a node)
    Path:  /mnt/disks/b2dd935b-189d-4f51-9f6b-a34e84ecc9dd
Events:

The line that tells us which node it resides on is:

    Term 0:        kubernetes.io/hostname in [node4.k8s-cluster]

Below that we can see the actual path to the PV on disk:

    Path:  /mnt/disks/b2dd935b-189d-4f51-9f6b-a34e84ecc9dd

With this information we can SSH to the node node4.k8s-cluster, navigate to the proper path and delete the file Prometheus is complaining about:

[k8user@workstation working-dir]$ ssh node4.k8s-cluster -l k8user
Last login: Thu Nov  5 01:39:07 2020 from workstation.daclusta
-sh-4.2$ ls -ls /mnt/disks/b2dd935b-189d-4f51-9f6b-a34e84ecc9dd
total 0
0 drwxrwsr-x 7 root 2000 160 Nov  5 16:00 prometheus-db

We see a prometheus-db folder so we're looking in the right place, we need to locate the chunks_head directory:

-sh-4.2$ ls -ls /mnt/disks/b2dd935b-189d-4f51-9f6b-a34e84ecc9dd/prometheus-db/
total 20
 0 drwxr-sr-x 3 k8user 2000    68 Nov  5 12:00 01EPCNMA7W7F0AGN6YG991CABH
 0 drwxr-sr-x 3 k8user 2000    68 Nov  5 16:00 01EPD3B16A4VJ7D8880FDJ7WDD
 0 drwxr-sr-x 3 k8user 2000    68 Nov  5 16:00 01EPD3BT9QG6X7TPR6S9FDE003
 0 drwxrwsr-x 2 k8user 2000    34 Nov  5 16:00 chunks_head
20 -rw-rw-r-- 1 k8user 2000 20001 Nov  5 16:33 queries.active
 0 drwxrwsr-x 3 k8user 2000    81 Nov  5 16:00 wal

Inside the chunks_head directory we can see the chunk files, locate the offending one and then delete it:

-sh-4.2$ ls -ls /mnt/disks/b2dd935b-189d-4f51-9f6b-a34e84ecc9dd/prometheus-db/chunks_head/
total 108004
87524 -rw-r--r-- 1 k8user 2000 89620918 Nov  5 16:00 000007
20480 -rw-r--r-- 1 k8user 2000        0 Nov  5 16:00 000008