Issue
We've encountered some issues where pod to pod communication between nodes are completely not working or has poor network speed. Even though node to node communication are fine.
Symptoms could range from Pod's failing to reach the coredns pod for name resolution. Or poor performance of applications, like web UI loading slowly.
To confirm the symptoms, we could run tcpdump
on 2 pods that is running on separate nodes. While doing a curl
on each other's IP.
And with the network speed issue, a simple scp
of a large file between pods to determine the average transfer speed. Or we could also use iperf
.
A customer observed no pod to pod connectivity on a host that has ConnectX-6 Lx network interface card, with firmware version
ethtool -i ens8 | grep firm ; uname -r ; ethtool -k ens8 | grep ipx
firmware-version: 12.28.2006 (MT_2180110032)
And poor pod to pod network speed with hosts that has ConnectX-4 network interface, with firmware
ethtool -i ens8 | grep firm ; uname -r ; ethtool -k ens8 | grep ipx
firmware-version: 20.28.4000 (MT_0000000222) 34.18.0-193.6.3.el8_2.x86_64
Workaround
This is an issue with the offloading feature of the Network interface, and it's compatibility with the encapsulation mode used by the overlay network.
To confirm and as a workaround, disable the offloading feature of the network interface
ethtool --offload ens8 rx off tx off
This is a temporary change and meant to test if this will fix the issue. Please check with your specific OS on how to persist the change.
Another recommendation is to change the encapsulation mode of Calico, from IPIP to VXLAN. Although further testing is needed to confirm this. And this article will be updated.