Problem

In some scenarios you may see Fluentbit crashlooping with the below errors:

[error] [net] TCP connection failed: logging-operator-logging-fluentd.kommander.svc:24240 (Connection refused)
[error] [net] socket #106 could not connect to logging-operator-logging-fluentd.kommander.svc:24240
[error] [output:forward:forward.0] no upstream connections available
[ warn] [engine] failed to flush chunk '1-xxxx.xxxxx.flb', retry in 908 seconds: task_id=12, input=tail.0 > output=forward.0 (out_id=0)

The error above typically indicates that there are upstream issues connecting to Fluentd, in this case we can see an issue with an SSL call in fluentD's logging:

#<Thread:0x00007fd64d6c19e0@event_loop /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.12.4/lib/fluent/plugin_helper/thread.rb:70 run> terminated with exception (report_on_exception is true):
/usr/lib/ruby/2.7.0/openssl/ssl.rb:239:in `peeraddr': Socket not connected - getpeername(2) (Errno::ENOTCONN)
from /usr/lib/ruby/2.7.0/openssl/ssl.rb:239:in `peeraddr'

If you do not see any meaningful errors in Fluentd you can open a shell into the pod and view the fluentd.out to get more information on the failures.

Solution

This is a known issue with certain versions of fluentbit; information can be found here (https://github.com/fluent/fluentd/issues/3635). A simple workaround for this error is to disable TLS for fluentbit, this can be done by adding the below to the logging-operator-logging-overrides configmap:

apiVersion: v1
data:
  values.yaml: |
    ---
    tls:
      enabled: false
kind: ConfigMap
metadata:
  name: logging-operator-logging-overrides
  namespace: kommander

If you do not wish to disable TLS, then you will need to upgrade your DKP version to 2.2+, as this version of DKP deploys Fluentbit v1.14.5, which contains the fix for this issue.