Konvoy Elasticsearch/Kibana read-only troubleshooting guide
Overview
In some cases, you might notice that the Elastic stack that is included in your Konvoy deployment (Elasticsearch, Kibana, Fluentbit) will stop accepting new log entries.
Often this is due to the disk space for these pods filling up to the point that it reaches Elastic's watermark threshold. In such a case, it will automatically set itself to "read-only" mode as a precaution to prevent against more catastrophic system failures and permanent data loss.
This article will help you identify whether this is happening in your cluster and the steps you can take to resolve it.
Log entries
If you see symptoms such as this, look through the logs of your cluster's Elasticsearch master nodes:
kubectl logs -n kubeaddons elasticsearch-kubeaddons-master-0
Look for messages about "high disk watermark" being exceeded, such as the following:
[20XX-XX-XXTXX:XX:XX,XXX][WARN ][o.e.c.r.a.DiskThresholdMonitor] [elasticsearch-kubeaddons-master-0] high disk watermark [90%] exceeded on [xxx][elasticsearch-kubeaddons-data-0][/usr/share/elasticsearch/data/nodes/0] free: 2.4gb[8.2%], shards will be relocated away from this node
The above message means that 90% of the elasticsearch-kubeaddons-data-0 pod's storage space is used up, so search indexing work is being moved off of it. When this occurs on all data node pods, you can see the result in other parts of the stack that might be trying to write to Elasticsearch, such as Fluentbit:
kubectl get pods -n kubeaddons | grep fluentbit # Get a pod ID from the resulting list kubectl logs kubectl logs -n kubeaddons fluentbit-kubeaddons-fluent-bit-xxxxx (Use the ID from the previous step)
You should see log entries like below, indicating that the Elasticsearch cluster is refusing writes because it is in read-only mode:
[20XX/XX/XX XX:XX:XX] [error] [out_es] could not pack/validate JSON response {"took":0,"errors":true,"items":[{"index":{"_index":"kubernetes_cluster","_type":"flb_type","_id":"xxx","status":403,"error":{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"}}},{"index":{"_index":"kubernetes_cluster","_type":"flb_type","_id":"xxx","status":403,"error":{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"}}},{"index":{"_index":"kubernetes_cluster","_type":"flb_type","_id":"xxx","status":403,"error":{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"}}},{"index":{"_index":"kubernetes_cluster-2020.06.02","_type":"flb_type","_id":"xxx","status":403,"error":{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-onl
Resolution
The simplest way to resolve this is by adding more Elasticsearch data nodes. This will increase the total disk space and spread the load throughout the cluster. You can do that by editing your Konvoy cluster.yaml file to add replicas to the elasticsearch data nodes:
- name: elasticsearch enabled: true values: | data: replicas: 5
Once the file is edited, you can run konvoy deploy addons
to push the change to your cluster.
After the change has been made, you can use kubectl get pods -n kubeaddons
to verify that the amount of running "elasticsearch-kubeaddons-data-" pods matches what you specified. (Remember that it starts counting at 0!) Once this increase has been made (or you've otherwise ensured that more storage space has been added to the Elasticsearch cluster), you can clear read-only mode.
To do this, you'll need to attach to one of the running containers in the Elasticsearch cluster:
kubectl -n kubeaddons exec -it elasticsearch-kubeaddons-master-0 /bin/bash
Once you reach the shell prompt inside the container, make the API call to clear read-only mode:
curl -X PUT "localhost:9200/_all/_settings" -H 'Content-Type: application/json' -d'{ "index.blocks.read_only_allow_delete" : null } }'
Press CTRL+D to exit the container shell when you're done.
Now the Elasticsearch cluster should accept new entries!
Please note that this process may need to be repeated again later if the log volume in your Konvoy cluster still overwhelms the amount of disk space that is available.