AI/ML orchestration on GKE documentation
Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. With Google Kubernetes Engine (GKE), you can implement a robust, production-ready AI/ML platform with all the benefits of managed Kubernetes and these capabilities:
- Infrastructure orchestration that supports GPUs and TPUs for training and serving workloads at scale.
- Flexible integration with distributed computing and data processing frameworks.
- Support for multiple teams on the same infrastructure to maximize utilization of resources
Start your proof of concept with $300 in free credit
- Get access to Gemini 2.0 Flash Thinking
- Free monthly usage of popular products, including AI APIs and BigQuery
- No automatic charges, no commitment
Keep exploring with 20+ always-free products
Access 20+ free products for common use cases, including AI APIs, VMs, data warehouses, and more.
Serve open models using GKE Gen AI capabilities
- New!
- New!
- New!
- Tutorial
- Tutorial
- Tutorial
Orchestrate TPUs and GPUs at large scale
- New!
- New!
- Video
- Video
- Video
- Blog
Cost optimization and job orchestration
- New!
- Best practice
- Blog
- Blog
- Best practice
- Best practice
- Best practice
Serve open source models using TPUs on GKE with Optimum TPU
Learn how to deploy LLMs using Tensor Processing Units (TPUs) on GKE with the Optimum TPU serving framework from Hugging Face.
Create and use a volume backed by a Parallelstore instance in GKE
Learn how to create storage backed by fully managed Parallelstore instances, and access them as volumes. The CSI driver is optimized for AI/ML training workloads involving smaller file sizes and random reads.
Accelerate AI/ML data loading with Hyperdisk ML
Learn how to how to simplify and accelerate the loading of AI/ML model weights on GKE using Hyperdisk ML.
Serve an LLM using TPUs on GKE with JetStream and PyTorch
Learn how to serve a LLM using Tensor Processing Units (TPUs) on GKE with JetStream through PyTorch.
Best practices for optimizing LLM inference with GPUs on GKE
Learn best practices for optimizing LLM inference performance with GPUs on GKE using the vLLM and Text Generation Inference (TGI) serving frameworks.
Manage the GPU Stack with the NVIDIA GPU Operator on GKE
Learn when to use the NVIDIA GPU operator and how to enable the NVIDIA GPU Operator on GKE.
Configure autoscaling for LLM workloads on TPUs
Learn how to set up your autoscaling infrastructure by using the GKE Horizontal Pod Autoscaler (HPA) to deploy the Gemma LLM using single-host JetStream.
Fine-tune Gemma open models using multiple GPUs on GKE
Learn how to fine-tune Gemma LLM using GPUs on GKE with the Hugging Face Transformers library.
Deploy a Ray Serve application with a Stable Diffusion model on GKE with TPUs
Learn how to deploy and serve a Stable Diffusion model on GKE using TPUs, Ray Serve, and the Ray Operator add-on.
Configure autoscaling for LLM workloads on GPUs with GKE
Learn how to set up your autoscaling infrastructure by using the GKE Horizontal Pod Autoscaler (HPA) to deploy the Gemma LLM with the Hugging Face Text Generation Interface (TGI) serving framework.
Train Llama2 with Megatron-LM on A3 Mega virtual machines
Learn how to run a container-based, Megatron-LM PyTorch workload on A3 Mega.
Deploy GPU workloads in Autopilot
Learn how to request hardware accelerators (GPUs) in your GKE Autopilot workloads.
Serve a LLM with multiple GPUs in GKE
Learn how to serve Llama 2 70B or Falcon 40B using multiple NVIDIA L4 GPUs with GKE.
Getting started with Ray on GKE
Learn how to easily start using Ray on GKE by running a workload on a Ray cluster.
Serve an LLM on L4 GPUs with Ray
Learn how to serve Falcon 7b, Llama2 7b, Falcon 40b, or Llama2 70b using the Ray framework in GKE.
Orchestrate TPU Multislice workloads using JobSet and Kueue
Learn how to orchestrate a Jax workload on multiple TPU slices on GKE by using JobSet and Kueue.
Monitoring GPU workloads on GKE with NVIDIA Data Center GPU Manager (DCGM)
Learn how to observe GPU workloads on GKE with NVIDIA Data Center GPU Manager (DCGM).
Quickstart: Train a model with GPUs on GKE Standard clusters
This quickstart shows you how to deploy a training model with GPUs in GKE and store the predictions in Cloud Storage.
Running large-scale machine learning on GKE
This video shows how GKE helps solve common challenges of training large AI models at scale, and the best practices for training and serving large-scale machine learning models on GKE.
TensorFlow on GKE Autopilot with GPU acceleration
This blog post is a step-by-step guide to the creation, execution, and teardown of a Tensorflow-enabled Jupiter notebook.
Implement a Job queuing system with quota sharing between namespaces on GKE
This tutorial uses Kueue to show you how to implement a Job queueing system, and configure workload resource and quota sharing between different namespaces on GKE.
Build a RAG chatbot with GKE and Cloud Storage
This tutorial shows you how to integrate a Large Language Model application based on retrieval-augmented generation with PDF files that you upload to a Cloud Storage bucket.
Analyze data on GKE using BigQuery, Cloud Run, and Gemma
This tutorial shows you how to analyze big datasets on GKE by leveraging BigQuery for data storage and processing, Cloud Run for request handling, and a Gemma LLM for data analysis and predictions.
Distributed data preprocessing with GKE and Ray: Scaling for the enterprise
Learn how to leverage GKE and Ray to efficiently preprocess large datasets for machine learning.
Data loading best practices for AI/ML inference on GKE
Learn how to speed up data loading times for your machine learning applications on Google Kubernetes Engine.
Save on GPUs: Smarter autoscaling for your GKE inferencing workloads
Learn how to optimize your GPU inference costs by fine-tuning GKE's Horizontal Pod Autoscaler for maximum efficiency.
Efficiently serve optimized AI models with NVIDIA NIM microservices on GKE
Learn how to deploy cutting-edge NVIDIA NIM microservices on GKE with ease and accelerate your AI workloads.
Accelerate Ray in production with new Ray Operator on GKE
Learn how Ray Operator on GKE simplifies your AI/ML production deployments, boosting performance and scalability.
Maximize your LLM serving throughput for GPUs on GKE — a practical guide
Learn how to maximize large language model (LLM) serving throughput for GPUs on GKE, including infrastructure decisions and model server optimizations.
Search engines made simple: A low-code approach with GKE and Vertex AI Agent Builder
How to build a search engine with Google Cloud, using Vertex AI Agent Builder, Vertex AI Search, and GKE.
LiveX AI reduces customer support costs with AI agents trained and served on GKE and NVIDIA AI
How LiveX AI uses GKE to build AI agents that enhance customer satisfaction and reduce costs.
Infrastructure for a RAG-capable generative AI application using GKE
Reference architecture for running a generative AI application with retrieval-augmented generation (RAG) using GKE, Cloud SQL, Ray, Hugging Face, and LangChain.
Innovating in patent search: How IPRally leverages AI with GKE and Ray
How IPRally uses GKE and Ray to build a scalable, efficient ML platform for faster patent searches with better accuracy.
Performance deep dive of Gemma on Google Cloud
Leverage Gemma on Cloud GPUs and Cloud TPUs for inference and training efficiency on GKE.
Gemma on GKE deep dive: New innovations to serve open generative AI models
Use best-in-class Gemma open models to build portable, customizable AI applications and deploy them on GKE.
Advanced scheduling for AI/ML with Ray and Kueue
Orchestrate Ray applications in GKE with KubeRay and Kueue.
How to secure Ray on Google Kubernetes Engine
Apply security insights and hardening techniques for training AI/ML workloads using Ray on GKE.
Design storage for AI and ML workloads in Google Cloud
Select the best combination of storage options for AI and ML workloads on Google Cloud.
Automatic driver installation simplifies using NVIDIA GPUs in GKE
Automatically install Nvidia GPU drivers in GKE.
Accelerate your generative AI journey with NVIDIA NeMo framework on GKEE
Train generative AI models using GKE and NVIDIA NeMo framework.
Why GKE for your Ray AI workloads?
Improve scalability, cost-efficiency, fault tolerance, isolation, and portability by using GKE for Ray workloads.
Running AI on fully managed GKE, now with new compute options, pricing and resource reservations
Gain improved GPU support, performance, and lower pricing for AI/ML workloads with GKE Autopilot.
How SEEN scaled output 89x and reduced GPU costs by 66% using GKE
Startup scales personalized video output with GKE.
How is unleashing ML Innovation with Ray and GKE
How Ray is transforming ML development at .
How Ordaōs Bio takes advantage of generative AI on GKE
Ordaōs Bio, one of the leading AI accelerators for biomedical research and discovery, is finding solutions to novel immunotherapies in oncology and chronic inflammatory disease.
GKE from a growing startup powered by ML
How Moloco, a Silicon Valley startup, harnessed the power of GKE and Tensor Flow Enterprise to supercharge its machine learning (ML) infrastructure.
Google Kubernetes Engine (GKE) Samples
View sample applications used in official GKE product tutorials.
GKE AI Labs Samples
View experimental samples for leveraging GKE to accelerate your AI/ML initiatives.