AI/ML orchestration on GKE documentation

Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. With Google Kubernetes Engine (GKE), you can implement a robust, production-ready AI/ML platform with all the benefits of managed Kubernetes and these capabilities:

  • Infrastructure orchestration that supports GPUs and TPUs for training and serving workloads at scale.
  • Flexible integration with distributed computing and data processing frameworks.
  • Support for multiple teams on the same infrastructure to maximize utilization of resources
This page provides an overview of the AI/ML capabilities of GKE and how to get started running optimized AI/ML workloads on GKE with GPUs, TPUs, and frameworks like Hugging Face TGI, vLLM, and JetStream.
  • Get access to Gemini 2.0 Flash Thinking
  • Free monthly usage of popular products, including AI APIs and BigQuery
  • No automatic charges, no commitment
View free product offers

Keep exploring with 20+ always-free products

Access 20+ free products for common use cases, including AI APIs, VMs, data warehouses, and more.

training
Training and tutorials

Learn how to deploy LLMs using Tensor Processing Units (TPUs) on GKE with the Optimum TPU serving framework from Hugging Face.

Tutorial AI/ML Inference TPU

training
Training and tutorials

Learn how to create storage backed by fully managed Parallelstore instances, and access them as volumes. The CSI driver is optimized for AI/ML training workloads involving smaller file sizes and random reads.

Tutorial AI/ML Data Loading

training
Training and tutorials

Learn how to how to simplify and accelerate the loading of AI/ML model weights on GKE using Hyperdisk ML.

Tutorial AI/ML Data Loading

training
Training and tutorials

Learn how to serve a LLM using Tensor Processing Units (TPUs) on GKE with JetStream through PyTorch.

Tutorial AI/ML Inference TPUs

training
Training and tutorials

Learn best practices for optimizing LLM inference performance with GPUs on GKE using the vLLM and Text Generation Inference (TGI) serving frameworks.

Tutorial AI/ML Inference GPUs

training
Training and tutorials

Learn when to use the NVIDIA GPU operator and how to enable the NVIDIA GPU Operator on GKE.

Tutorial GPUs

training
Training and tutorials

Learn how to set up your autoscaling infrastructure by using the GKE Horizontal Pod Autoscaler (HPA) to deploy the Gemma LLM using single-host JetStream.

Tutorial TPUs

training
Training and tutorials

Learn how to fine-tune Gemma LLM using GPUs on GKE with the Hugging Face Transformers library.

Tutorial AI/ML Inference GPUs

training
Training and tutorials

Learn how to deploy and serve a Stable Diffusion model on GKE using TPUs, Ray Serve, and the Ray Operator add-on.

Tutorial AI/ML Inference Ray TPUs

training
Training and tutorials

Learn how to set up your autoscaling infrastructure by using the GKE Horizontal Pod Autoscaler (HPA) to deploy the Gemma LLM with the Hugging Face Text Generation Interface (TGI) serving framework.

Tutorial GPUs

training
Training and tutorials

Learn how to run a container-based, Megatron-LM PyTorch workload on A3 Mega.

Tutorial AI/ML Training GPUs

training
Training and tutorials

Learn how to request hardware accelerators (GPUs) in your GKE Autopilot workloads.

Tutorial GPUs

training
Training and tutorials

Learn how to serve Llama 2 70B or Falcon 40B using multiple NVIDIA L4 GPUs with GKE.

Tutorial AI/ML Inference GPUs

training
Training and tutorials

Learn how to easily start using Ray on GKE by running a workload on a Ray cluster.

Tutorial Ray

training
Training and tutorials

Learn how to serve Falcon 7b, Llama2 7b, Falcon 40b, or Llama2 70b using the Ray framework in GKE.

Tutorial AI/ML Inference Ray GPUs

training
Training and tutorials

Learn how to orchestrate a Jax workload on multiple TPU slices on GKE by using JobSet and Kueue.

Tutorial TPUs

training
Training and tutorials

Learn how to observe GPU workloads on GKE with NVIDIA Data Center GPU Manager (DCGM).

Tutorial AI/ML Observability GPUs

training
Training and tutorials

This quickstart shows you how to deploy a training model with GPUs in GKE and store the predictions in Cloud Storage.

Tutorial AI/ML Training GPUs

training
Training and tutorials

This video shows how GKE helps solve common challenges of training large AI models at scale, and the best practices for training and serving large-scale machine learning models on GKE.

Video AI/ML Training AI/ML Inference

training
Training and tutorials

This blog post is a step-by-step guide to the creation, execution, and teardown of a Tensorflow-enabled Jupiter notebook.

Blog AI/ML Training AI ML Inference GPUs

training
Training and tutorials

This tutorial uses Kueue to show you how to implement a Job queueing system, and configure workload resource and quota sharing between different namespaces on GKE.

Tutorial AI/ML Batch

training
Training and tutorials

This tutorial shows you how to integrate a Large Language Model application based on retrieval-augmented generation with PDF files that you upload to a Cloud Storage bucket.

Tutorial AI/ML Data Loading

training
Training and tutorials

This tutorial shows you how to analyze big datasets on GKE by leveraging BigQuery for data storage and processing, Cloud Run for request handling, and a Gemma LLM for data analysis and predictions.

Tutorial AI/ML Data Loading

use case
Use cases

Learn how to leverage GKE and Ray to efficiently preprocess large datasets for machine learning.

MLOps Training Ray

use case
Use cases

Learn how to speed up data loading times for your machine learning applications on Google Kubernetes Engine.

Inference Hyperdisk ML Cloud Storage FUSE

use case
Use cases

Learn how to optimize your GPU inference costs by fine-tuning GKE's Horizontal Pod Autoscaler for maximum efficiency.

Inference GPU HPA

use case
Use cases

Learn how to deploy cutting-edge NVIDIA NIM microservices on GKE with ease and accelerate your AI workloads.

AI NVIDIA NIM

use case
Use cases

Learn how Ray Operator on GKE simplifies your AI/ML production deployments, boosting performance and scalability.

AI TPU Ray

use case
Use cases

Learn how to maximize large language model (LLM) serving throughput for GPUs on GKE, including infrastructure decisions and model server optimizations.

LLM GPU NVIDIA

use case
Use cases

How to build a search engine with Google Cloud, using Vertex AI Agent Builder, Vertex AI Search, and GKE.

Search Agent Vertex AI

use case
Use cases

How LiveX AI uses GKE to build AI agents that enhance customer satisfaction and reduce costs.

GenAI NVIDIA GPU

use case
Use cases

Reference architecture for running a generative AI application with retrieval-augmented generation (RAG) using GKE, Cloud SQL, Ray, Hugging Face, and LangChain.

GenAI RAG Ray

use case
Use cases

How IPRally uses GKE and Ray to build a scalable, efficient ML platform for faster patent searches with better accuracy.

AI Ray GPU

use case
Use cases

Leverage Gemma on Cloud GPUs and Cloud TPUs for inference and training efficiency on GKE.

AI Gemma Performance

use case
Use cases

Use best-in-class Gemma open models to build portable, customizable AI applications and deploy them on GKE.

AI Gemma Performance

use case
Use cases

Orchestrate Ray applications in GKE with KubeRay and Kueue.

Kueue Ray KubeRay

use case
Use cases

Apply security insights and hardening techniques for training AI/ML workloads using Ray on GKE.

AI Ray Security

use case
Use cases

Select the best combination of storage options for AI and ML workloads on Google Cloud.

AI ML Storage

use case
Use cases

Automatically install Nvidia GPU drivers in GKE.

GPU NVIDIA Installation

use case
Use cases

Train generative AI models using GKE and NVIDIA NeMo framework.

GenAI NVIDIA NeMo

use case
Use cases

Improve scalability, cost-efficiency, fault tolerance, isolation, and portability by using GKE for Ray workloads.

AI Ray Scale

use case
Use cases

Gain improved GPU support, performance, and lower pricing for AI/ML workloads with GKE Autopilot.

GPU Autopilot Performance

use case
Use cases

Startup scales personalized video output with GKE.

GPU Scale Containers

use case
Use cases

How Ray is transforming ML development at .

ML Ray Containers

use case
Use cases

Ordaōs Bio, one of the leading AI accelerators for biomedical research and discovery, is finding solutions to novel immunotherapies in oncology and chronic inflammatory disease.

Performance TPU Cost optimization

use case
Use cases

How Moloco, a Silicon Valley startup, harnessed the power of GKE and Tensor Flow Enterprise to supercharge its machine learning (ML) infrastructure.

ML Scale Cost optimization

code sample
Code Samples

View sample applications used in official GKE product tutorials.

code sample
Code Samples

View experimental samples for leveraging GKE to accelerate your AI/ML initiatives.

Related videos