Overview

In deployments where multiple on-premise Konvoy clusters are running on the same L2 network, there can be multiple Keepalived deployments trying to claim the same virtual router ID (VRID) as another deployment in your network, which can lead to network instability.

Problem

In the case that there is a large amount of gratuitous ARP requests coming from Konvoy nodes, there may be a configuration conflict with Keepalived. When two or more clusters are configured to use the same VRID, both will try to claim it and send out a series of gratuitous ARP requests. When there is a conflict due to two deployments trying to own the same VRID, you will see logging such as the below:


Fri Jan 21 17:16:10 2022: (lb-vips) Entering BACKUP STATE
Fri Jan 21 17:16:10 2022: (lb-vips) removing VIPs.
Fri Jan 21 17:16:10 2022: (lb-vips) ip address associated with VRID 51 not present in MASTER advert : <Control plane endpoint>
Fri Jan 21 17:16:10 2022: (lb-vips) ip address associated with VRID 51 not present in MASTER advert : <Control plane endpoint>
Fri Jan 21 17:16:10 2022: (lb-vips) ip address associated with VRID 51 not present in MASTER advert : <Control plane endpoint>
Fri Jan 21 17:16:30 2022: (lb-vips) Receive advertisement timeout
Fri Jan 21 17:16:30 2022: (lb-vips) Entering MASTER STATE
When Keepalived enters the master state, it will send a series of five Gratiutious ARP requests out for every virtual IP (VIP) that is configured:
Fri Jan 21 17:16:30 2022: (lb-vips) Entering MASTER STATE
Fri Jan 21 17:16:30 2022: (lb-vips) setting VIPs.
Fri Jan 21 17:16:30 2022: (lb-vips) Sending/queueing gratuitous ARPs on ens192 for <Controll plane endpoint>
Fri Jan 21 17:16:30 2022: Sending gratuitous ARP on ens0 for <Control plane endpoint>
Fri Jan 21 17:16:30 2022: Sending gratuitous ARP on ens0 for <Control plane endpoint>
Fri Jan 21 17:16:30 2022: Sending gratuitous ARP on ens0 for <Control plane endpoint>
Fri Jan 21 17:16:30 2022: Sending gratuitous ARP on ens0 for <Control plane endpoint>
Fri Jan 21 17:16:30 2022: Sending gratuitous ARP on ens0 for <Control plane endpoint>

These gratuitous ARP requests will be present among all Keepalived pods, leading to a large amount of overhead and general instability within the network.

Solution

You can validate the configured value by describing the keepalived pod and inspecting the vrrp_instance configuration for its virtual_router_id:

vrrp_instance lb-vips {
state BACKUP
interface ens192
virtual_router_id 51
priority 100
advert_int 1
nopreempt # Prevent fail-back
track_script {
chk_script
}

If multiple clusters are configured to use the same VRID, you will need to edit the static pod manifest on each control-plane node and update your cluster.yaml to reflect the new value. The Keepalived.yaml will be located with your static pod manifest directory, which by default is located within /etc/Kubernetes/manifests. Within keepalived.yaml, you will need to edit the spec.containers.env section within the keepalived container to contain a unique value between 0-255:

containers:
  - command:
    - /usr/sbin/keepalived
    - --dont-fork
    - --dump-conf
    - --log-console
    - --log-detail
    - --vrrp
    - --snmp
    - --snmp-agent-socket
    - /var/agentx/master.sock
    env:
    - name: VRID
      value: <Update this value>

When this value is updated, Kubernetes will automatically restart the Keepalived pod on that control plane. Once all pods are updated to a new value, the amount of gratuitous ARP requests will normalize to around one a minute.

Please note all control-plane nodes in a cluster need to have the same VRID value.