Why is Grafana-Loki Ingester not passing readiness check? – D2iQ

Some users have encountered that the Grafana Loki Ingester pod does not pass the readiness check after its storage backend gets full and the ingester flush queue grows too large.

grafana-loki-loki-distributed-ingester-0 0/1 Running 0 47h

When this happens, the following events will be logged in the grafana-loki-loki-distributed-ingester-0 log:

level=error ts=2022-08-05T18:54:08.654718016Z caller=flush.go:220 org_id=fake msg="failed to flush user" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
level=debug ts=2022-08-05T18:54:08.654767857Z caller=flush.go:216 msg="flushing stream" userid=fake fp=be95f3ed18e7aeb8 immediate=true
level=debug ts=2022-08-05T18:54:11.791960743Z caller=logging.go:66 traceID=2ddb168a137ca7c6 msg="GET /ready (503) 34.27us"

Notice how queries to the Ingester /ready endpoint returns a 503 instead of a 200, which means that the Loki Ingester is not ready, and how a "context deadline exceeded" is logged which means that flush operations was not completed in the expected time frame.

This usually happens when the Loki ingester flush operations queue grows too large, therefore the Loki Ingester requires more time to flush all the data in memory. Loki ingester has a parameter that controls how long flush operations can take (flush_op_timeout). This is a known issue.

In Kommander 2.2.0, the value for this parameter is set to 10 seconds, which is too low in certain scenarios where the flush queue grows too large. This value will be set to 10 minutes in Kommander 2.3.

If this issue is encountered in a Kommander 2.2.0 cluster, the user can increase the value of "flush_op_timeout" from 10s to 10m.

The Loki Ingester configuration is stored in the configmap grafana-loki-loki-distributed in the kommander namespace (management cluster). To increased the value, the configmap should be edited:

> kubectl -n kommander edit cm grafana-loki-loki-distributed

To confirm that the parameter value has been increased, the user can exec into the Loki Ingester pod and check the configuration file:

> kubectl -n kommander exec -it grafana-loki-loki-distributed-ingester-0 -- /bin/sh

> grep flush_op_timeout /etc/loki/config/config.yaml 
flush_op_timeout: 10m