Overview/Background
There are a number of etcd log messages that are helpful to check for when troubleshooting issues with etcd.
Common errors and causes
connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: i/o timeout"; Reconnecting to {0.0.0.0:2379 0 }
The host firewall is preventing network communication.
mvcc: database space exceeded or applying raft message exceeded backend quota
If the Kubernetes backend etcd database exceeds its space quota for any member, etcd will raise a cluster-wide alarm that puts the cluster into a maintenance mode which only accepts key reads and deletes. This space quota ensures that the cluster operates in a reliable fashion and that etcd does not run out of storage space. To restore write functionality to the cluster and resume normal Kubernetes and etcd operation, you must free enough space in the keyspace and defragment the backend database to clear the space quota alarm. See etcd database and disk space errors for more information.
dial tcp :2379: getsockopt: connection refused or dial tcp :2380: getsockopt: connection refused
A connection to the etcd endpoint could not be established. Ensure that the etcd container is running on the host with the address shown.
is starting a new election at term
The etcd cluster lost quorum and is establishing a new leader. For an etcd cluster with n members, quorum is (n/2)+1. Note that it is recommended to use a 5 etcd cluster for Kubernetes in production environments.
failed to send out heartbeat on time
The leader has skipped two heartbeat intervals. The etcd leader periodically sends heartbeats to its followers to maintain its leadership. If followers do not receive heartbeats within an election interval, an election will be triggered. When the leader skips two heartbeat intervals, etcd will warn that it failed to send a heartbeat on time.
apply entries took too long
If the average apply duration exceeds 100 milliseconds, etcd will warn that entries are taking too long to apply. This issue can have a few causes:
- Slow disk - To rule out a slow disk from causing this warning, you can monitor backendcommitduration_seconds (p99 duration should be less than 25ms), or check the Disk Sync Duration in the etcd Grafana dashboard that is deployed with Konvoy. If the disk is too slow, addressing any hardware problems with the disk, removing any possibilities of disk contention, or replacing the disk with a faster disk could help to resolve the issue.
- CPU starvation - To verify, you can check the CPU usage of the container or host where etcd is running. Increasing the resources allocated to etcd or moving etcd to a dedicated node can help resolve this issue.
- Slow network - Check for high latency or packets being dropped between etcd members.
snapshotting is taking more than seconds to finish
Sending a snapshot took more than 30 seconds and exceeded the expected transfer time for a 1Gbps connection.
cluster ID mismatch
The etcd node is trying to join a cluster that has already been formed. Please contact D2IQ support if you run into this issue.
rafthttp: failed to find member
The cluster state is invalid and the etcd member cannot join the cluster. Please contact D2IQ support if you run into this issue.