Note: This issue is now fixed in DKP 2.2.0 and above.
When a cluster is unexpectedly terminated, FluentBit pods may have corrupted journal entries which causes a "CrashLoopBackoff" error when the cluster is restored. The cluster log for those pods will contain the following error along with a stack trace:
Assertion 'lastindex <= right' failed at ../src/journal/journal-file.c:2209, function genericarray_bisect(). Aborting.
This is a known issue with FluentBit where one or more of the the journal log lines doesn't have an "=" symbol causes a SIGSEGV error and FluentBit doesn't respond to the health check.
The github issue https://github.com/fluent/fluent-bit/issues/4407 describes the problem, and links to the PR that fixes the problem in V1.8.3. DKP 2.2.0 onwards no longer suffers from this issue, as it uses later releases of FluentBit.
If you are running DKP 2.1.1 or below, and you have this issue, you can simply clear the journal of the failing pod:
kubectl exec <pod> -- sudo journalctl --rotate
kubectl exec <pod> -- sudo journalctl --vacuum-time=1s
The rotate
command clears down and archives all open journal files and creates new ones, and the vacuum-time
command clears down the archive to only keep the last 1 second. After these commands, you effectively have a clear pod and it will start without issue.