Troubleshoot GKE


This page lists troubleshooting pages for common issues you might encounter when using Google Kubernetes Engine (GKE). This page is for Admins and architects, Security specialists, Networking specialists, or Storage specialists who troubleshoot GKE configurations. To learn more about GKE roles, see Common GKE Enterprise user roles and tasks.

Use this page to diagnose and resolve issues you encounter across various stages of working with your GKE infrastructure:

This page also provides access to more general troubleshooting topics:

To troubleshoot GKE networking, see Troubleshoot GKE networking in the GKE networking documentation.

Cluster setup

TopicDescription
Cluster creationResolve issues with creating clusters.
Autopilot clustersDiagnose and troubleshoot GKE Autopilot clusters, including cluster creation, namespace deletion, scaling, and workload issues.
Kubectl command-line toolTroubleshoot the kubectl command-line tool in GKE, including issues with authentication, authorization. This page also includes advice on how to troubleshoot the Konnectivity proxy to check if it's causing the kubectl logs, attach, exec, or port-forward commands to stop responding.
Standard node poolsTroubleshoot GKE Standard node pools, including issues with node pool creation, best-effort provisioning, corrupted instance metadata, and migrating workloads to new node pools.
Node registrationTroubleshoot issues that occur when adding nodes to your GKE Standard cluster, such as node registration failures and missing prerequisites for successful node registration.
Container runtimeTroubleshoot container runtimes in GKE, including issues with containerd and dockershim, and private registries.

Storage

TopicDescription
StorageTroubleshoot storage, including issues with regional persistent disks, disk performance, and volume expansion.

Cluster security

TopicDescription
AuthenticationTroubleshoot authentication in GKE, including issues with RBAC, Workload Identity Federation for GKE, and the GKE metadata server.
Service accountsTroubleshoot service accounts, including restoring the default service account and enabling the Compute Engine default service account.
Application-layer secretsTroubleshoot issues that can occur when configuring application-layer secrets encryption, including failed updates and errors where you're unable to use a Cloud KMS key or where the Cloud KMS key version was destroyed.

Cluster's root Certificate Authority expiring soon

TopicDescription
Root Certificate Authority (CA) expiringIf your cluster's root Certificate Authority (CA) is expiring soon, learn how to perform a credential rotation to prevent normal cluster operations from being interrupted.

Workloads

TopicDescription
Troubleshoot errors for workloads running in a GKE cluster, including CrashLoopBackOff and PodUnschedulable. Read the PodUnschedulable section for advice on errors like MatchNodeSelector and Does not have minimum availability.
Image pullsTroubleshoot image pulls. Learn what causes statuses like ImagePullBackOff and ErrImagePull and how to resolve these statuses by fixing common issues like authentication and network connectivity.
Arm workloadsTroubleshoot issues with Arm workloads, including Pods on Arm nodes crashing.
TPUsTroubleshoot TPUs, including issues with quota, node auto-provisioning, workload configuration, and scheduling.
GPUsTroubleshoot GPUs, including issues with GPU driver installation, device plugin errors, and container images.

Cluster management

TopicDescription
UpgradesTroubleshoot issues with GKE cluster upgrades, such as a kube-apiserver that's unhealthy after a control plane upgrade or workloads that are evicted after an upgrade.
WebhooksUnderstand how to troubleshoot and ensure the stability of your cluster control plane when using admission webhooks.
Namespace stuck in the Terminating stateTroubleshoot issues with namespaces stuck in the Terminating state by identifying and removing the unhealthy components that are blocking deletion.

Monitoring

TopicDescription
System metricsTroubleshoot system metrics not appearing in Cloud Monitoring.
Monitoring dasardsTroubleshoot monitoring dasards, including issues with enabling monitoring, missing Kubernetes resources, and permissions.
LoggingTroubleshoot logging, including issues with enabling logging, missing logs, and quotas.

4xx errors

TopicDescription
4xx errorsTroubleshoot some of the 400, 401, 403, and 404 errors that you might encounter when using GKE. This page also includes information on how to troubleshoot missing edit permissions on account errors.

Known issues

TopicDescription
Known issuesIdentify and resolve known issues that might affect your use of GKE.

What's next