Overview/Background
If the Kubernetes backend etcd database exceeds its space quota for any member, etcd will raise a cluster-wide alarm that puts the cluster into a maintenance mode which only accepts key reads and deletes. This space quota ensures that the cluster operates in a reliable fashion and that etcd does not run out of storage space. When this issue occurs, you may observe the following:
- etcd alarm stating
alarm:NOSPACE
etcdserver: mvcc: database space exceeded
log messagesapplying raft message exceeded backend quota
log messages To restore write functionality to the cluster and resume normal Kubernetes and etcd operation, you must free enough space in the keyspace and defragment the backend database to clear the space quota alarm.
Solution
Note: For all steps below, ETCD_CONTAINERID
is assumed to be set as an environment variable containing the local etcd CONTAINER ID
:
ETCD_CONTAINERID=$(crictl ps -q --pod $(crictl pods -q --namespace=kube-system --name=etcd))
Removing excessive keyspace data and defragmenting the backend database can help to put the cluster back within the quota limits. To do so, using the memberID reported in the alarm, find the where the alarm was triggered (executed on any node running a kube-system etcd container):
crictl exec $ETCD_CONTAINERID /bin/sh -c '\ ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt \ ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \ ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key \ ETCDCTL_API=3 \ etcdctl member list -w json' | python -m json.tool
{ "header": { "cluster_id": 13536073241193627255, "member_id": 622044271603492567, "raft_term": 3 }, "members": [ { "ID": 622044271603492567, "clientURLs": [ "https://10.0.193.211:2379" ], "name": "ip-10-0-193-211.us-west-2.compute.internal", "peerURLs": [ "https://10.0.193.211:2380" ] }, { "ID": 7020364781308380685, "clientURLs": [ "https://10.0.195.94:2379" ], "name": "ip-10-0-195-94.us-west-2.compute.internal", "peerURLs": [ "https://10.0.195.94:2380" ] }, { "ID": 13036013581852492582, "clientURLs": [ "https://10.0.195.5:2379" ], "name": "ip-10-0-195-5.us-west-2.compute.internal", "peerURLs": [ "https://10.0.195.5:2380" ] } ] }
Next, perform the following steps on the node where the alarm was triggered:
1) Get the current revision:
rev=$(crictl exec $ETCD_CONTAINERID /bin/sh -c '\ ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt \ ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \ ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key \ ETCDCTL_API=3 \ etcdctl endpoint status --write-out="json"' | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*'); \ echo "Set \$rev to $rev."
Set $rev to 27343.2) Compact away all old revisions:
crictl exec $ETCD_CONTAINERID /bin/sh -c "\ ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt \ ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \ ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key \ ETCDCTL_API=3 \ etcdctl --endpoints=$(crictl exec $ETCD_CONTAINERID /bin/sh -c "ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key ETCDCTL_API=3 etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") compact $rev"
compacted revision 273433) Defragment away excessive space:
crictl exec $ETCD_CONTAINERID /bin/sh -c "\ ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt \ ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \ ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key \ ETCDCTL_API=3 \ etcdctl --endpoints=$(crictl exec $ETCD_CONTAINERID /bin/sh -c "ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key ETCDCTL_API=3 etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") defrag"
Finished defragmenting etcd member[https://10.0.193.211:2379] Finished defragmenting etcd member[https://10.0.195.94:2379] Finished defragmenting etcd member[https://10.0.195.5:2379]4) Verify the endpoint status and database size
crictl exec $ETCD_CONTAINERID /bin/sh -c '\ ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt \ ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \ ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key \ ETCDCTL_API=3 \ etcdctl -w table endpoint status --cluster'
+---------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +---------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://10.0.193.211:2379 | 8a1f1f8237e26d7 | 3.3.10 | 3.4 MB | false | 3 | 34504 | | https://10.0.195.94:2379 | 616d5892b78c860d | 3.3.10 | 3.4 MB | false | 3 | 34504 | | https://10.0.195.5:2379 | b4e93948f17c4326 | 3.3.10 | 3.4 MB | true | 3 | 34504 | +---------------------------+------------------+---------+---------+-----------+-----------+------------+
5) Disarm the alarm:NOSPACE
alarm:
crictl exec $ETCD_CONTAINERID /bin/sh -c '\ ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt \ ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \ ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key \ ETCDCTL_API=3 \ etcdctl alarm disarm'
memberID:622044271603492567 alarm:NOSPACE
6) Verify that there are no longer any etcd alarms active:
crictl exec $ETCD_CONTAINERID /bin/sh -c '\ ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt \ ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt \ ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key \ ETCDCTL_API=3 \ etcdctl alarm list'