AI/ML Workloads on GPU Clusters ​
NOTE
This guide focuses on NVIDIA accelerators, as Gardener provides first-class support for them, enabling a robust and conformant environment for GPU-powered applications.
Introduction ​
Gardener has joined the CNCF Kubernetes AI Conformance program. This certification validates that Gardener provides a standardized, reliable platform capable of handling the demanding requirements of modern AI and machine learning workloads, ensuring that your GPU-accelerated applications run on a proven and conformant infrastructure.
This guide walks you through provisioning GPU-enabled Kubernetes clusters with Gardener and configuring them for production AI/ML workloads.
Prerequisites ​
Before you begin, ensure you have:
- Access to a Gardener project with appropriate permissions
- Requested GPU quota from your cloud provider (e.g.,
g4dninstances on AWS,Standard_NCseries on Azure, orn1-standardwith GPUs on GCP) kubectlandhelminstalled locally
IMPORTANT
Due to high demand for GPU resources, availability may be limited in certain regions. Plan your cluster location accordingly and request quota in advance.
Step 1: Provision a GPU-Enabled Cluster ​
To create a Gardener-managed Kubernetes cluster with GPU-enabled worker nodes, you need to configure the provider.workers section in your shoot manifest with a GPU-enabled machine type and a compatible operating system.
Shoot Configuration Example ​
Here's an example shoot manifest that defines a worker pool with AWS g4dn.xlarge instances running Garden Linux:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
name: my-cluster
namespace: garden-my-project
spec:
cloudProfileName: aws
region: eu-central-1
kubernetes:
version: "1.33.0"
provider:
type: aws
workers:
- name: worker-gpu
minimum: 1
maximum: 3
machine:
type: g4dn.xlarge # GPU-enabled machine type
image:
name: gardenlinux
version: "1877.5.0" # Use a supported Garden Linux version
architecture: amd64
zones:
- eu-central-1a
...Key configuration points:
- Machine Type: Select a GPU-enabled machine type from your cloud provider (check our
cloudprofilesfor available options) - Operating System: Garden Linux is fully supported by the NVIDIA GPU Operator
- Zones: Ensure GPU instances are available in your selected zones
After applying this manifest, Gardener will provision your cluster with GPU-enabled worker nodes.
Step 2: Install the NVIDIA GPU Operator ​
Once your cluster is running, the GPU hardware is not yet usable by Kubernetes workloads. You must install the NVIDIA GPU Operator, which automates the management of all necessary NVIDIA software components:
- NVIDIA drivers
- NVIDIA Container Toolkit
- Kubernetes device plugin for GPUs
- DCGM (Data Center GPU Manager) for monitoring
- GPU Feature Discovery
Installation with Helm ​
The recommended installation method uses the NVIDIA Helm chart with Garden Linux-specific configuration:
Add the NVIDIA Helm repository:
bashhelm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo updateInstall the operator with Garden Linux configuration:
bashhelm upgrade --install --create-namespace -n gpu-operator gpu-operator nvidia/gpu-operator \ --values https://raw.githubusercontent.com/gardenlinux/gardenlinux-nvidia-installer/refs/heads/main/helm/gpu-operator-values.yaml
Verification ​
After installation completes, verify that GPU resources are available to Kubernetes:
# Check that GPU resources are visible
kubectl describe nodes | grep "nvidia.com/gpu"
# Verify GPU operator pods are running
kubectl get pods -n gpu-operator
# Check GPU capacity on nodes
kubectl get nodes -o jsonpath='{range .items[?(@.status.allocatable.nvidia\.com/gpu)]}{.metadata.name}: {.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'You should see nvidia.com/gpu listed in the Capacity and Allocatable resources of your GPU nodes.
Step 3: Run a Test Workload ​
To confirm everything is working, deploy a simple pod that requests a GPU and runs nvidia-smi:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1Apply and check the logs:
kubectl apply -f gpu-test-pod.yaml
kubectl logs gpu-testYou should see the nvidia-smi output showing your GPU details.
Advanced Topics ​
Secure Accelerator Access ​
Gardener clusters ensure that GPU access is properly isolated. This provides critical security guarantees:
- No Unauthorized Access: Pods that do not explicitly request
nvidia.com/gpuresources cannot see or access any GPU devices, even if scheduled on a GPU node - Workload Isolation: A container that requests GPUs will only see and use the specific devices allocated to it, preventing interference with other workloads
This isolation is enforced by the NVIDIA device plugin and container runtime, ensuring secure multi-tenant GPU usage.
For detailed validation of these security properties, see the Secure Accelerator Access example in our AI Conformance repository.
Exposing GPU Metrics ​
The NVIDIA GPU Operator automatically deploys the DCGM Exporter, which exposes comprehensive GPU metrics in Prometheus format, including:
- GPU utilization
- Memory usage (used/free)
- Temperature
- Power consumption
- Clock speeds
Integrating with Your Monitoring Stack ​
The DCGM Exporter service is available at nvidia-dcgm-exporter.gpu-operator.svc on port 9400 and can be scraped by any Prometheus-compatible monitoring solution.
If you're using Prometheus Operator, you can create a ServiceMonitor to automatically configure scraping:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nvidia-dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: metrics
interval: 15sFor other monitoring solutions, configure them to scrape the endpoint directly:
- Service:
nvidia-dcgm-exporter.gpu-operator.svc:9400 - Metrics path:
/metrics - Format: Prometheus exposition format
For a complete example, see the Accelerator Metrics guide.
Autoscaling GPU Workloads with HPA ​
You can use GPU metrics to autoscale your AI/ML workloads with a HorizontalPodAutoscaler (HPA), allowing you to scale applications based on actual GPU utilization rather than just CPU or memory.
HPA autoscaling with custom GPU metrics requires a complete metrics pipeline. Understanding this architecture is crucial:
DCGM Exporter → Prometheus → prometheus-adapter → custom.metrics.k8s.io API → HPAThe HPA controller (part of kube-controller-manager) can only query the Kubernetes API Server for metrics. The API Server itself doesn't collect or store metrics - it delegates to registered metrics APIs:
- metrics.k8s.io (resource metrics like CPU/memory, provided by metrics-server)
- custom.metrics.k8s.io (custom metrics, requires an adapter)
- external.metrics.k8s.io (external metrics, requires an adapter)
For GPU metrics to be available to HPA, you must deploy a solution that implements the custom.metrics.k8s.io API - it's a fundamental part of Kubernetes' metrics architecture.
Common alternatives to prometheus-adapter:
- KEDA - Event-driven autoscaling with support for many metric sources
- Custom metrics adapter - Build your own adapter that reads directly from DCGM
- Cloud provider-specific adapters (AWS CloudWatch, Azure Monitor, GCP Monitoring)
However, prometheus-adapter is the most common and well-supported solution for custom metrics in Kubernetes.
The complete setup involves four main components:
ServiceMonitor: Configure Prometheus to scrape DCGM Exporter metrics (see "Integrating with Your Monitoring Stack" above)
PrometheusRule: Create recording rules to aggregate raw DCGM metrics into stable, pod-level custom metrics:
yamlapiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: gpu-custom-metrics namespace: monitoring spec: groups: - name: gpu-metrics rules: - record: pod_gpu_utilization expr: label_replace(label_replace(DCGM_FI_DEV_GPU_UTIL, "pod", "$1", "exported_pod", "(.*)"), "namespace", "$1", "exported_namespace", "(.*)")Prometheus Adapter: Deploy prometheus-adapter to expose Prometheus metrics via the
custom.metrics.k8s.ioAPI:bashhelm install prometheus-adapter prometheus-community/prometheus-adapter \ --namespace monitoring \ --set prometheus.url=http://prometheus-server.monitoring.svc \ --set rules.custom[0].seriesQuery='pod_gpu_utilization{pod!=""}' \ --set rules.custom[0].name.as='pod_gpu_utilization' \ --set rules.custom[0].metricsQuery='avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'HPA: Configure an HPA to use the custom GPU metric for scaling decisions:
yamlapiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: gpu-workload-hpa namespace: your-namespace spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: gpu-workload minReplicas: 1 maxReplicas: 2 metrics: - type: Pods pods: metric: name: pod_gpu_utilization target: type: AverageValue averageValue: "10" # Target 10% GPU utilization
For a complete, step-by-step guide with all necessary manifests, see the Pod Autoscaling example.
Dynamic Resource Allocation (DRA) ​
For more flexible and fine-grained GPU management beyond simple device counts, you can enable Dynamic Resource Allocation (DRA). DRA enables advanced resource claims, flexible device filtering using common expression language (CEL), or sharing GPUs across workloads.
To enable DRA, activate the corresponding feature gates in your shoot manifest:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
kubernetes:
version: "1.33.0"
kubeAPIServer:
featureGates:
DynamicResourceAllocation: true
runtimeConfig:
resource.k8s.io/v1beta1: true
kubeControllerManager:
featureGates:
DynamicResourceAllocation: true
kubeScheduler:
featureGates:
DynamicResourceAllocation: true
kubelet:
featureGates:
DynamicResourceAllocation: trueNOTE
In Kubernetes v1.33, DRA is available via the v1beta1 API. The v1 API will be available in Kubernetes v1.34 and later.
For more details, see the DRA Support example.
AI/ML Workload Controllers ​
For running distributed AI/ML applications, using a specialized workload controller is highly recommended. KubeRay, for example, simplifies the deployment and management of your AI/ML workload on Kubernetes, enabling distributed training and inference workloads.
Gardener provides a conformant environment to run such operators. The operator handles:
- Cluster lifecycle management
- Autoscaling of worker nodes
- Job scheduling and execution
- Service exposure
For a detailed example of installing and using KubeRay on a Gardener cluster, see the GPU Workload Controller example.
Gang Scheduling ​
Distributed training jobs often require all of their pods (the "gang") to be scheduled simultaneously. This "all-or-nothing" behavior prevents deadlocks and wasted resources when some pods are scheduled but others cannot be due to resource constraints.
Kubernetes does not provide gang scheduling out-of-the-box, but it can be added with schedulers like Kueue, Grove, or similar solutions.
Gang scheduling is critical for:
- Multi-GPU training frameworks (PyTorch DDP, Horovod, DeepSpeed)
- Distributed training systems (TensorFlow, JAX, MPI)
- Efficient GPU utilization in multi-tenant environments
For a complete example of setting up Kueue for gang scheduling on a Gardener cluster, refer to the Gang Scheduling example.
Advanced Inference with Gateway API ​
For production AI inference services, the Kubernetes Gateway API provides advanced traffic management capabilities beyond standard Ingress resources:
- Weighted traffic splitting for A/B testing and canary deployments
- Header-based routing (e.g.,
X-Model-Versionfor model version selection) - Gradual rollouts between model versions
- Cross-namespace routing with ReferenceGrant for secure multi-tenant deployments
You need to install a Gateway controller (e.g., Traefik, Envoy Gateway, Istio) separately.
For a complete example, see the AI Inference example.
Troubleshooting ​
GPU Operator Installation Issues ​
Driver pod fails to start:
# Check driver pod logs
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
# Verify Garden Linux values file was used
helm get values gpu-operator -n gpu-operatorGPU resources not visible:
# Check device plugin status
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
# Verify driver completed successfully first
kubectl get pods -n gpu-operator | grep nvidia-driverComponents stuck in Init state:
This is normal behavior during driver installation. Components wait for the driver to be ready before starting. Driver installation can take 5-15 minutes depending on network speed and node resources.
# Monitor driver installation progress
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset -fWorkload Issues ​
Pod stuck in Pending state:
# Check if GPU resources are available
kubectl describe node <gpu-node-name> | grep -A 5 "Allocated resources"
# Check pod events
kubectl describe pod <pod-name>GPU not accessible in container:
Ensure your pod spec includes a GPU resource request:
resources:
limits:
nvidia.com/gpu: 1Related Links ​
Gardener AI Conformance ​
Examples and Guides ​
- Secure Accelerator Access
- Exposing GPU Metrics
- Cluster Autoscaling
- Pod Autoscaling with GPU Metrics
- Dynamic Resource Allocation (DRA)
- GPU Workload Controller (KubeRay)
- Gang Scheduling (Kueue)
- AI Inference with Gateway API