Kubernetes - Advanced Scheduling Techniques

As we build and scale our applications in Kubernetes, we begin to run into more complex scheduling needs. The default scheduler in Kubernetes does a great job out of the box, but there are many scenarios where we need to fine-tune how and where our Pods get placed in the cluster.

In this chapter, we'll explore advanced scheduling techniques in Kubernetes. These tools and strategies help us manage workloads more effectively, optimize resource usage, and ensure that applications meet performance, compliance, and high availability requirements.

What is Pod Scheduling?

Pod scheduling is the process of assigning a Pod to a suitable node. When we create a Pod, the Kubernetes Scheduler looks at the available nodes and decides where to place that Pod, based on various factors like available CPU, memory, node taints, affinity rules, and more.

In simple cases, Kubernetes just needs to find a node that meets the Pod's resource requests. But in production clusters, we often have extra rules or policies to consider â like spreading workloads across zones, running certain workloads only on specific machines, or isolating Pods from others.

Key Concepts in Advanced Scheduling

Let's go over the main tools we can use for advanced scheduling in Kubernetes:

Node Affinity and Anti-Affinity
Pod Affinity and Anti-Affinity
Taints and Tolerations
Scheduling Constraints and Topology Spread
Resource Limits and Requests
Priority Classes and Preemption
Dynamic Scheduling (with the Descheduler)

We'll go through each of these with explanations and examples.

Node Affinity and Anti-Affinity

Node Affinity

Node Affinity is used when we want a Pod to run only on nodes that meet specific criteria, like having a certain label.

For example, let's say we have some GPU nodes labeled hardware=gpu. We want our AI workloads to run only on these nodes.

First, make sure the node has the right label:

$ kubectl label nodes node01 hardware=gpu

Output

node/node01 labeled

Next, create the following file called gpu-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
    name: gpu-pod
spec:
    affinity:
        nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                    - matchExpressions:
                        - key: hardware
                          operator: In
                          values:
                            - gpu
    containers:
        - name: pause
          image: k8s.gcr.io/pause

Apply it using:

$ kubectl apply -f gpu-pod.yaml

Output

pod/gpu-pod created

This rule ensures that the Pod will only run on nodes labeled with hardware=gpu.

Node Anti-Affinity

Node Anti-Affinity lets us do the opposite â we tell Kubernetes to avoid certain nodes.

For example, if we don't want our workload running on nodes with env=testing:

apiVersion: v1
kind: Pod
metadata:
    name: no-testing-pod
spec:
    affinity:
        nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                    - matchExpressions:
                        - key: env
                          operator: NotIn
                          values:
                            - testing
    containers:
        - name: nginx
          image: nginx

Apply the file:

$ kubectl apply -f node-anti-affinity-pod.yaml

Output

pod/no-testing-pod created

This helps us avoid unwanted environments or isolate certain workloads.

Pod Affinity and Anti-Affinity

While Node Affinity targets the nodes, Pod Affinity and Anti-Affinity work at the Pod level.

Pod Affinity

With Pod Affinity, we can place Pods close to each other â for example, co-locating a frontend and backend for low-latency communication.

First, deploy a backend Pod with the label app=backend:

apiVersion: v1
kind: Pod
metadata:
  name: backend
  labels:
    app: backend
spec:
  containers:
  - name: nginx
    image: nginx

Apply it:

$ kubectl apply -f backend-pod.yaml

Output

pod/backend created

Now create the Pod with Pod Affinity. Save it as pod-affinity.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: affinity-pod
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - backend
        topologyKey: "kubernetes.io/hostname"
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]

Here, the Pod will be scheduled on a node where another Pod with app=backend is already running.

Apply it:

$ kubectl apply -f pod-affinity.yaml

Output

pod/affinity-pod created

Pod Anti-Affinity Example

Now let's make sure Pods don't land on the same node â useful for high availability.

First, deploy a frontend Pod with the label app=frontend. Save this as frontend-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: frontend
  labels:
    app: frontend
spec:
  containers:
  - name: nginx
    image: nginx

Apply it:

$ kubectl apply -f frontend-pod.yaml

Output

pod/frontend created

Create a second Pod that uses Pod Anti-Affinity. Save it as pod-anti-affinity.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: isolated-frontend
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - frontend
        topologyKey: "kubernetes.io/hostname"
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]

Now this Pod won't be placed on the same node as other frontend Pods.

Apply it:

$ kubectl apply -f pod-anti-affinity.yaml

Output

pod/isolated-frontend created

Taints and Tolerations

Taints let nodes repel Pods. This is useful when we want to dedicate nodes to special workloads.

Let's taint a node and run a Pod that tolerates it.

$ kubectl taint nodes node01 key=value:NoSchedule

This means no Pod will be scheduled on node01 unless it has a matching toleration.

To tolerate this taint, we add:

apiVersion: v1
kind: Pod
metadata:
  name: toleration-pod
spec:
  tolerations:
  - key: "special"
    operator: "Equal"
    value: "workload"
    effect: "NoSchedule"
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]

This way, we can set up dedicated GPU nodes, batch nodes, or even isolate workloads for security or compliance reasons.

Apply the Pod:

$ kubectl apply -f toleration-pod.yaml

Scheduling Constraints and Topology Spread

To ensure Pods are spread evenly across zones (or other topology domains), we can use Topology Spread Constraints. This is especially useful for high availability, so that a failure in one zone doesn't bring down all replicas.

To get started, create the following file named topology-spread-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: myapp
      containers:
      - name: nginx
        image: nginx

This tells Kubernetes:

Spread the Pods across zones, based on topology.kubernetes.io/zone.
maxSkew: 1 means one zone can't have more than 1 Pod more than another.

Apply it with:

$ kubectl apply -f topology-spread-deployment.yaml

Output

deployment.apps/myapp created

Resource Limits and Requests

While these aren't scheduling strategies on their own, requests help Kubernetes pick nodes that have enough available resources. Limits help prevent Pods from using too much.

Create the following file named resource-requests-limits.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: "500m"
        memory: "256Mi"
      limits:
        cpu: "1"
        memory: "512Mi"

Explanation:

Requests (CPU & memory) are used by the scheduler to place the Pod.
Limits define how much the Pod can actually consume during runtime.

Apply it with:

$ kubectl apply -f resource-requests-limits.yaml

Output

pod/resource-demo created

Priority Classes and Preemption

If the cluster is full, high-priority workloads can preempt (evict) low-priority ones. This is useful when running mission-critical apps.

First, create the PriorityClass. Save this as high-priority-class.yaml:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 100000
globalDefault: false
description: "This class is for critical workloads"

Apply it:

$ kubectl apply -f high-priority-class.yaml

Output

priorityclass.scheduling.k8s.io/high-priority created

Now, use it in a Pod. Create this file as high-priority-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  priorityClassName: high-priority
  containers:
  - name: nginx
    image: nginx

Apply it:

$ kubectl apply -f high-priority-pod.yaml

Output

pod/critical-app created

If the cluster is under pressure, Kubernetes may evict lower-priority Pods to make space for this one.

The Descheduler (for Dynamic Scheduling)

Sometimes, the initial Pod placement isn't ideal. Maybe nodes are underused or overloaded.

Kubernetes doesn't automatically move Pods after they've been scheduled. But the Descheduler can help.

The Descheduler is a separate component that:

Analyzes the current state of the cluster
Identifies Pods that should be moved
Evicts them safely, allowing the Scheduler to reschedule them

It supports strategies like:

Removing duplicates from the same node
Balancing resource usage
Enforcing affinity/anti-affinity

You can install it as a Job or CronJob and run it periodically. It's great for dynamic environments where workloads change over time.

Conclusion

Kubernetes scheduling isn't just about placing Pods on nodes â it's about making smart, context-aware decisions that align with how our applications actually work in the real world.

Whether we're keeping critical services apart for high availability, co-locating tightly coupled workloads for performance, or writing our own scheduler to handle edge cases, the tools are all there. We just need to know how and when to use them.

Advanced techniques like affinity rules, taints and tolerations, topology spread constraints, and custom schedulers give us precision control over how our clusters behave â not just at startup, but as they evolve and scale.

By combining these strategies, we can build clusters that are more resilient, more efficient, and more tailored to our real operational goals.

Print Page