Overview

One of the first ideas that comes to mind when you hear Kubernetes is 'high availability'. Having multiple replicas spread across your cluster is typically how this can be achieved when you deploy your application using Kubernetes, but is this always the case? In large clusters, the general expectation would be that your pods end up on many nodes with minimal repeats. In some cases, depending on the current load of the cluster and per-node utilization, you may end up with many pods from a single deployment getting scheduled on the same node.

Under the hood

The main loop in the scheduler is fairly straightforward; it pulls a pod from the queue of scheduled pods, ensures that the pod is not deleting, then attempts to find a node that it can be scheduled on:

// Synchronously attempt to find a fit for the pod.
start := time.Now()
suggestedHost, err := sched.schedule(pod)
metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInMicroseconds(start))
if err != nil {
return
}

Within that function, Kubernetes calls the schedule function and tries to use the loaded algorithm to find a suitable place for our new workload:

// schedule implements the scheduling algorithm and returns the suggested host.
func (sched *Scheduler) schedule(pod *v1.Pod) (string, error) {
  host, err := sched.config.Algorithm.Schedule(pod, sched.config.NodeLister)
  if err != nil {
    glog.V(1).Infof("Failed to schedule pod: %v/%v", pod.Namespace, pod.Name)
    ...

How does this work? Well, it's based on: Priorities, and Predicates. Each of these plays a critical role in scheduling, with Predicates acting as filters for nodes that meet our pods' requirements and Priorities being the weighted values for each node's current state.

Predicates

When scheduling, predicates are items that act as filters when selecting nodes. The scheduler will consider all the available metadata and filter down to a subset of nodes that meet all criteria. In smaller clusters, this may account for all available nodes, whereas, in large clusters, the kube-scheduler will stop looking for feasible nodes once it has enough of them. This is done to save time when scheduling. A fair portion of these items come from the pod metadata, things such as volume claims, anti-affinity, resource requests, labels, available ports, etc:

predicateMetadata := &predicateMetadata{
    pod: pod,
    podBestEffort: isPodBestEffort(pod),
    podRequest: GetResourceRequest(pod),
    podPorts: schedutil.GetUsedPorts(pod),
    matchingAntiAffinityTerms: matchingTerms,

When looking at the considerations for predicates, we can note that when filtering for nodes, the scheduler considers many different values. Each of these determines if a node is suitable for our workload; the kube-scheduler will attempt to check if nodes are ideal until it has compiled enough available nodes. If your cluster is on the larger side, this may mean that not all nodes get checked before completion.

Priorities

After we have filtered our nodes down to a subset that meets our criteria, it is time to rank each node and attempt to place the pod. These consider node-level metrics such as cpu/memory consumption, amount of requested resources, taints/tolerations, and even image locality. There is a long list of default Priorities that can be found here. For each node that meets our predicates, Kubernetes will check each priority and attempt to rank the node based on the cumulative sums of each. For example, in the most_requested priority, we can see the function used for calculating the used capacity on the node:

// The used capacity is calculated on a scale of 0-10
// 0 being the lowest priority and 10 being the highest.
// The more resources are used the higher the score is. This function
// is almost a reversed version of least_requested_priority.calculatUnusedScore
// (10 - calculateUnusedScore). The main difference is in rounding. It was added to
// keep the final formula clean and not to modify the widely used (by users
// in their default scheduling policies) calculateUSedScore.
func calculateUsedScore(requested int64, capacity int64, node string) int64 {
if capacity == 0 {
  return 0
}
if requested > capacity {
  glog.V(10).Infof("Combined requested resources %d from existing pods exceeds capacity %d on node %s",
  requested, capacity, node)
  return 0
}
return (requested * schedulerapi.MaxPriority) / capacity
}

which called from the main function within calculateUsedPriority:

...
cpuScore := calculateUsedScore(totalResources.MilliCPU, allocatableResources.MilliCPU, node.Name)
memoryScore := calculateUsedScore(totalResources.Memory, allocatableResources.Memory, node.Name)
if glog.V(10) {
  // We explicitly don't do glog.V(10).Infof() to avoid computing all the parameters if this is
  // not logged. There is visible performance gain from it.
  glog.V(10).Infof(
  "%v -> %v: Most Requested Priority, capacity %d millicores %d memory bytes, total request %d millicores %d memory bytes, score %d CPU %d memory",
    pod.Name, node.Name,
    allocatableResources.MilliCPU, allocatableResources.Memory,
    totalResources.MilliCPU, totalResources.Memory,
    cpuScore, memoryScore,
  )
}

return schedulerapi.HostPriority{
Host: node.Name,
Score: int((cpuScore + memoryScore) / 2),
}, nil

This is done for each priority. The important thing to take away from this is that the priority functions will all return a score of 0-10. After all of the priorities are called, we are left with the total sum of all priorities. This is then used to rank each node, and the node with the highest score will be the node that our workload gets placed on.

Considerations

In general, pods in your environment will be spread out across nodes. Still, because the scheduler has many different priorities outside of spread priority, it is worth taking extra steps to ensure high availability when workloads require it. Based on the default predicates and priorities, we have a few options that we can utilize to get the availability we desire. One of the more straightforward routes is to add anti-affinity to our deployment definition:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
    replicas: 3
    template:
      metadata:
        labels:
          app: store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
  containers:
  - name: redis-server
    image: redis:3.2-alpine

With this anti-affinity applied to our deployment, Kubernetes will check each node and validate if it has any matching anti-affinity terms. If so, it will be disqualified for placement for our workload. This will ensure that our pods do not get placed on similar nodes, if possible.