How to detect if a clock skew is affecting Rook-Ceph monitors in DKP? – D2iQ

Starting with DKP 2.3.0, Ceph is used to provide object storage for the logging stack. The Ceph storage cluster requires three components: monitors, managers and object storage daemons (OSD).

Monitors are responsible for maintaining the maps of the cluster state, including the maps for the manager, OSD, etc. These maps are critical cluster states required by other Ceph daemons to coordinate with each other.

Keeping time in-sync across nodes is critical because the Ceph monitor consensus mechanism relies on a tight time alignment. The maximum tolerated clock skew allows clocks to drift up to 0.05 seconds and can be configured by using the mon-clock-drift-allowed option.

When clock skew exists across the monitor nodes, the rook-ceph-mon-a-xxx-yyy pod logs includes the following entries:

debug 2023-03-09T12:15:07.930+0000 7fe56170d700 1 mon.a@0(leader).osd e3 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 kv_alloc: 322961408
debug 2023-03-09T12:15:10.675+0000 7fe55ef08700 0 log_channel(cluster) log [WRN] : 2 clock skew 21.1215s > max 0.05s
debug 2023-03-09T12:15:10.675+0000 7fe55ef08700 0 log_channel(cluster) log [WRN] : 1 clock skew 15.2147s > max 0.05s
cluster 2023-03-09T12:15:10.676922+0000 mon.a (mon.0) 13978 : cluster [WRN] 2 clock skew 21.1215s > max 0.05s
cluster 2023-03-09T12:15:10.676967+0000 mon.a (mon.0) 13979 : cluster [WRN] 1 clock skew 15.2147s > max 0.05s

When this happens the operator should focus his efforts in keeping time in-sync. Ceph documentation recommends synchronizing your clocks using NTP servers running on-metal rather than using VM virtualized clocks.