Troubleshooting Kubernetes with Commands

Working with Kubernetes can be both exciting and challenging. While it offers powerful tools for managing containerized applications, issues are inevitable. Whether it's a pod stuck in CrashLoopBackOff or a service not responding, knowing how to troubleshoot effectively is crucial.

In this chapter, we'll walk through some common scenarios and the kubectl commands that can help us diagnose and resolve them.

Checking Cluster Health

Before diving into specific issues, it's essential to ensure our cluster is healthy.

View Cluster Nodes

Use the following command to view cluster nodes -

$ kubectl get nodes

Output

NAME           STATUS   ROLES           AGE   VERSION
controlplane   Ready    control-plane   71m   v1.31.6
node01         Ready    <none>          71m   v1.31.6

This command lists all nodes in the cluster. We should see all nodes in the Ready state. If any node is NotReady, it might indicate issues with the node's health or connectivity.

Inspect Node Details

Use the following command to inspect node details -

$ kubectl describe node <node-name>

It provides detailed information about a specific node, including resource usage, conditions, and events. It's useful for identifying issues like disk pressure or memory shortages.

Investigating Pods

Pods are the smallest deployable units in Kubernetes. When things go wrong, pods are often the first place to look.

List All Pods

You can use the following command to list all the pods -

$ kubectl get pods -A

Output

NAMESPACE      NAME                                    READY   STATUS    RESTARTS   AGE
kube-flannel   kube-flannel-ds-fbqdb                   1/1     Running   0          75m
kube-flannel   kube-flannel-ds-jln5p                   1/1     Running   0          75m
kube-system    coredns-7c65d6cfc9-4t6gh                1/1     Running   0          75m
kube-system    coredns-7c65d6cfc9-gprrn                1/1     Running   0          75m
kube-system    etcd-controlplane                       1/1     Running   0          75m
kube-system    hostpath-provisioner-5558658586-md76l   1/1     Running   0          75m

It lists all pods across all namespaces. Look for pods not in the Running or Completed state.

Describe a Pod

Use the following command to describe a pod -

$ kubectl describe pod <pod-name> -n <namespace>

For instance:

$ kubectl describe pod kube-flannel-ds-fbqdb -n kube-flannel

Output

Name:                 kube-flannel-ds-fbqdb
Namespace:            kube-flannel
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      flannel
Node:                 controlplane/172.16.8.5
Start Time:           Tue, 29 Apr 2025 11:15:04 +0000

It provides detailed information about the pod, including events, which can indicate issues like failed scheduling or image pull errors.

View Pod Logs

To view pod logs, use the following command -

$ kubectl logs kube-flannel-ds-fbqdb -n kube-flannel

Output

I0429 11:15:25.762540       1 kube.go:139] Waiting 10m0s for node controller to sync
I0429 11:15:25.762636       1 kube.go:469] Starting kube subnet manager
I0429 11:15:25.769460       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.244.1.0/24]

It shows the logs from the pod's main container. If the pod has multiple containers, specify the container name:

$ kubectl logs <pod-name> -c <container-name> -n <namespace>

Common Pod Issues

Common Pod issues include application errors, misconfigured environment variables, or insufficient resources. Let's explore some frequent pod-related problems and how to troubleshoot them.

CrashLoopBackOff

$ kubectl get pods

Output

NAME            READY   STATUS             RESTARTS   AGE
crashloop-pod   0/1     CrashLoopBackOff   5          30s

This status indicates that a pod is repeatedly crashing.

Describe the pod to view recent events:

$ kubectl describe pod crashloop-pod

Output

Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  30s   default-scheduler  Successfully assigned default/crashloop-pod to node1
  Normal   Pulled     30s   kubelet, node1     Container image "busybox" already present on machine
  Warning  BackOff    5s    kubelet, node1     Back-off restarting failed container
  Normal   Killing    5s    kubelet, node1     Killing container with id docker://crashloop-container:container has runAsNonRoot and image has non-numeric user (busybox), cannot find user busybox in /etc/passwd

Check the pod's logs for error messages:

$ kubectl logs crashloop-pod

Output

sh: nonexistent-command: not found

This error occurs when the container tries to run an invalid command. We can troubleshoot this issue by:

  • Fixing it by setting a valid command or removing the invalid command from the YAML file.
  • Deleting and recreating the pod after making the changes.
  • Ensuring the pod's status is Running and checking the logs to verify the fix.

ImagePullBackOff / ErrImagePull

This error occurs when Kubernetes can't pull the container image, usually due to authentication issues or incorrect image names.

Describe the pod to get error details:

$ kubectl describe pod <pod-name> -n <namespace>

Verify the image name and tag in your deployment configuration. Ensure the image exists in the specified registry and that Kubernetes has access to it. Authentication issues with private registries are a common culprit.

Pending Pods

If a pod is stuck in the Pending state, it usually means it hasn't been scheduled to a node, possibly due to resource constraints or node selectors.

Describe the pod to see scheduling events:

$ kubectl describe pod <pod-name> -n <namespace>

Check for resource constraints or node selectors that might prevent scheduling. Insufficient cluster resources or misconfigured affinity rules can cause this issue.

Service and Networking Issues

Services expose applications running on pods. Networking problems can prevent communication between services and pods.

List the Services

Use the following command to list the services -

$ kubectl get svc -A

Output

NAMESPACE     NAME                   TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGE
default       kubernetes             ClusterIP   10.96.0.1     <none>        443/TCP                  50m
kube-system   kube-dns               ClusterIP   10.96.0.10    <none>        53/UDP,53/TCP,9153/TCP   50m
kube-system   kubelet-csr-approver   ClusterIP   10.107.94.7   <none>        8080/TCP                 50m

It lists all the services across namespaces.

Describe a Service

Use the following command to describe a service -

$ kubectl describe svc kubelet-csr-approver -n kube-system

Output

Type:                     ClusterIP
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.107.94.7
IPs:                      10.107.94.7
Port:                     metrics  8080/TCP
TargetPort:               metrics/TCP
Endpoints:                10.244.1.3:8080,10.244.1.6:8080
Session Affinity:         None
Internal Traffic Policy:  Cluster
Events:                   <none>

It provides details about the service's configuration and endpoints.

Test Service Connectivity

Use port forwarding to test service access:

$ kubectl port-forward svc/kubelet-csr-approver 8080:8080 -n kube-system

Output

Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
Handling connection for 8080
Handling connection for 8080

Then, access the service at http://localhost:8080.

DNS Resolution

If a pod can't resolve service names, check the DNS configuration:

$ kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>

Ensure the CoreDNS pods are running and healthy.

Node-Level Troubleshooting

Sometimes, issues stem from the nodes themselves.

Check Node Status

Check the node status using the following command -

$ kubectl get nodes

Output

NAME           STATUS   ROLES           AGE     VERSION
controlplane   Ready    control-plane   6m37s   v1.31.6
node01         Ready    <none>          6m23s   v1.31.6

Nodes should be in the Ready state.

Describe a Node

If you need to get the description of a node, then use the following command -

$ kubectl describe node <node-name>

Look for conditions like MemoryPressure or DiskPressure.

Configuration and Secrets

Misconfigured ConfigMaps or Secrets can cause application failures.

List ConfigMaps

List ConfigMaps using the following command -

$ kubectl get configmaps -n dev

Output

NAME               DATA   AGE
app-config         3      25m
env-settings       2      25m

Describe a ConfigMap

Use the following command to describe a ConfigMap -

$ kubectl describe configmap app-config -n dev

Output

Name:         app-config
Namespace:    dev
Data
====
PORT:         80
DB_HOST:      db-service
DB_PORT:      not-a-number

In this example, the DB_PORT value is incorrectly set to not-a-number, but it should be an integer (a valid port number). Here's how we can resolve it:

$ kubectl  configmap app-config -n dev -p '{"data":{"DB_PORT":"5432"}}'

Output

configmap/app-config ed

It will the DB_PORT value to 5432 without opening an editor.

Events and Audit Trails

Events provide a timeline of significant occurrences in the cluster.

View Events

Use the following commands to view events -

$ kubectl get events -A --sort-by='.metadata.creationTimestamp'

Output

kube-system    15m         Normal    Pulled
pod/metrics-server-54bf7cdd6-7khch           Successfully pulled image
"registry.k8s.io/metrics-server/metrics-server:v0.7.2" in 2.458s 
(2.458s including waiting). Image size: 19494617 bytes.
kube-system    15m         Normal    Created
pod/metrics-server-54bf7cdd6-7khch   Created container: metrics-server
kube-system    15m         Normal    Started
pod/metrics-server-54bf7cdd6-7khch   Started container metrics-server

It lists all the events across namespaces, sorted by time. Events can reveal issues like failed mounts, scheduling problems, or image pull errors.

Advanced Debugging

For more complex issues, additional tools and commands can help.

Execute Commands in a Pod

If you need to execute a command in a Pod, then use the following command -

$ kubectl exec -it <pod-name> -n <namespace> -- /bin/bash

Example: Exec into a Pod

$ kubectl exec -it my-app-5f4d8c6f6c-rm87x -n dev -- /bin/bash

It opens a shell inside the pod. Use this shell to inspect configuration files, test DNS (nslookup), or run service-level commands (curl, ping, etc.).

Run a Debug Pod

Deploy a temporary pod with debugging tools:

$ kubectl run debug-pod --rm -i --tty --image=busybox -- /bin/sh

This is useful for testing network connectivity or DNS resolution.

Debug Stateful Applications

For StatefulSets or persistent volumes, check the Persistent Volume Claims (PVCs):

$ kubectl get pvc -n dev

Output

NAME            STATUS   VOLUME  CAPACITY   ACCESS MODES
data-my-db-0    Pending                                                   

Describe the PVC

Use the following command to describe the PVC -

$ kubectl describe pvc data-my-db-0 -n dev

Output

Events:
  Type    Reason                Message
  ----    ------                -------
Warning ProvisioningFailed   no persistent volumes available for this claim

For this issue, check if a compatible StorageClass and PV exist. Create one if missing or update the PVC to use an available class. Examine storage class or volume binding issues.

Inspect DaemonSets and Jobs

DaemonSets and Jobs often cause cluster-level issues if improperly configured.

Check DaemonSet Status

Use the following command to check DaemonSet status -

$ kubectl get ds -n kube-system

Output

NAME         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-proxy   2         2         2       2            2           kubernetes.io/os=linux   39m

Check Job Status

To check the status of a Job, use the following command -

$ kubectl get jobs -n dev

Output

NAME               COMPLETIONS   DURATION   AGE
my-job             1/1           2s         10m
failed-job         0/1           1m         15m

It shows how many of the required job completions have occurred. 1/1 means it's successfully completed; 0/1 indicates it hasn't completed or failed.

For failed Jobs, inspect pod logs or events:

$ kubectl logs <pod-name>
$ kubectl describe job <job-name>

For DaemonSets:

  • Ensure all nodes that should be running the DaemonSet have pods.
  • If pods aren't created, inspect taints or node selectors.

Check RBAC Permissions

When an application or user can't access resources, it might be due to Role-Based Access Control (RBAC).

Verify the permissions:

$ kubectl auth can-i create deployments --as=dev-user

Output

no

It means that the dev-user does not have the necessary permissions to create deployments in the cluster.

To fix this, update the Role-Based Access Control (RBAC) configuration to give the dev-user the necessary permissions to create deployments.

Debugging a Container with a Sidecar

We can add a sidecar container to an existing pod to help with debugging, inspecting shared volumes, network, or file systems without interrupting the main application.

$ kubectl debug pod/my-app -n dev --image=busybox --target=main-container

Output

Pod "my-app-debug" created.

To use the sidecar container, run:

$ kubectl exec -it my-app-debug -n dev -- /bin/sh

This command creates a sidecar container using the busybox image and attaches it to the existing my-app pod in the dev namespace. It will run alongside the main-container (the primary container in the pod) and allow debugging without affecting the main container.

We can now use kubectl exec to get a shell into the new debugging container and start investigating issues. This allows interactive debugging without affecting the main container.

Leverage External Tools

Third-party tools can enhance your troubleshooting capabilities:

  • k9s: A terminal UI for managing Kubernetes clusters.
  • Lens: A Kubernetes dasard with real-time metrics.
  • Prometheus/Grafana: For advanced monitoring and alerting.

Conclusion

Troubleshooting Kubernetes involves a combination of inspecting resources, analyzing logs, and testing connectivity. By mastering these command-line techniques, we can efficiently diagnose and resolve issues in our clusters.

Remember, the key steps include:

  • Inspecting pods, deployments, and services.
  • Analyzing logs for error messages.
  • Testing network connectivity between components.
  • Monitoring resource usage to identify bottlenecks.

With these tools and strategies, we're well-equipped to maintain healthy and resilient Kubernetes environments.