AI/ML Workloads on GPU Clusters

NOTE

This guide focuses on NVIDIA accelerators, as Gardener provides first-class support for them, enabling a robust and conformant environment for GPU-powered applications.

Introduction

Gardener has joined the CNCF Kubernetes AI Conformance program. This certification validates that Gardener provides a standardized, reliable platform capable of handling the demanding requirements of modern AI and machine learning workloads, ensuring that your GPU-accelerated applications run on a proven and conformant infrastructure.

This guide walks you through provisioning GPU-enabled Kubernetes clusters with Gardener and configuring them for production AI/ML workloads.

Prerequisites

Before you begin, ensure you have:

Access to a Gardener project with appropriate permissions
Requested GPU quota from your cloud provider (e.g., g4dn instances on AWS, Standard_NC series on Azure, or n1-standard with GPUs on GCP)
kubectl and helm installed locally

IMPORTANT

Due to high demand for GPU resources, availability may be limited in certain regions. Plan your cluster location accordingly and request quota in advance.

Step 1: Provision a GPU-Enabled Cluster

To create a Gardener-managed Kubernetes cluster with GPU-enabled worker nodes, you need to configure the provider.workers section in your shoot manifest with a GPU-enabled machine type and a compatible operating system.

Shoot Configuration Example

Here's an example shoot manifest that defines a worker pool with AWS g4dn.xlarge instances running Garden Linux:

yaml

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: my-cluster
  namespace: garden-my-project
spec:
  cloudProfileName: aws
  region: eu-central-1

  kubernetes:
    version: "1.33.0"

  provider:
    type: aws
    workers:
      - name: worker-gpu
        minimum: 1
        maximum: 3
        machine:
          type: g4dn.xlarge  # GPU-enabled machine type
          image:
            name: gardenlinux
            version: "1877.5.0"  # Use a supported Garden Linux version
          architecture: amd64
        zones:
          - eu-central-1a
  ...

Key configuration points:

Machine Type: Select a GPU-enabled machine type from your cloud provider (check our cloudprofiles for available options)
Operating System: Garden Linux is fully supported by the NVIDIA GPU Operator
Zones: Ensure GPU instances are available in your selected zones

After applying this manifest, Gardener will provision your cluster with GPU-enabled worker nodes.

Step 2: Install the NVIDIA GPU Operator

Once your cluster is running, the GPU hardware is not yet usable by Kubernetes workloads. You must install the NVIDIA GPU Operator, which automates the management of all necessary NVIDIA software components:

NVIDIA drivers
NVIDIA Container Toolkit
Kubernetes device plugin for GPUs
DCGM (Data Center GPU Manager) for monitoring
GPU Feature Discovery

Installation with Helm

The recommended installation method uses the NVIDIA Helm chart with Garden Linux-specific configuration:

Add the NVIDIA Helm repository:

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install the operator with Garden Linux configuration:

bash

helm upgrade --install --create-namespace -n gpu-operator gpu-operator nvidia/gpu-operator \
  --values https://raw.githubusercontent.com/gardenlinux/gardenlinux-nvidia-installer/refs/heads/main/helm/gpu-operator-values.yaml

Verification

After installation completes, verify that GPU resources are available to Kubernetes:

bash

# Check that GPU resources are visible
kubectl describe nodes | grep "nvidia.com/gpu"

# Verify GPU operator pods are running
kubectl get pods -n gpu-operator

# Check GPU capacity on nodes
kubectl get nodes -o jsonpath='{range .items[?(@.status.allocatable.nvidia\.com/gpu)]}{.metadata.name}: {.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'

You should see nvidia.com/gpu listed in the Capacity and Allocatable resources of your GPU nodes.

Step 3: Run a Test Workload

To confirm everything is working, deploy a simple pod that requests a GPU and runs nvidia-smi:

yaml

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1

Apply and check the logs:

bash

kubectl apply -f gpu-test-pod.yaml
kubectl logs gpu-test

You should see the nvidia-smi output showing your GPU details.

Advanced Topics

Secure Accelerator Access

Gardener clusters ensure that GPU access is properly isolated. This provides critical security guarantees:

No Unauthorized Access: Pods that do not explicitly request nvidia.com/gpu resources cannot see or access any GPU devices, even if scheduled on a GPU node
Workload Isolation: A container that requests GPUs will only see and use the specific devices allocated to it, preventing interference with other workloads

This isolation is enforced by the NVIDIA device plugin and container runtime, ensuring secure multi-tenant GPU usage.

For detailed validation of these security properties, see the Secure Accelerator Access example in our AI Conformance repository.

Exposing GPU Metrics

The NVIDIA GPU Operator automatically deploys the DCGM Exporter, which exposes comprehensive GPU metrics in Prometheus format, including:

GPU utilization
Memory usage (used/free)
Temperature
Power consumption
Clock speeds

Integrating with Your Monitoring Stack

The DCGM Exporter service is available at nvidia-dcgm-exporter.gpu-operator.svc on port 9400 and can be scraped by any Prometheus-compatible monitoring solution.

If you're using Prometheus Operator, you can create a ServiceMonitor to automatically configure scraping:

yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nvidia-dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
  - port: metrics
    interval: 15s

For other monitoring solutions, configure them to scrape the endpoint directly:

Service: nvidia-dcgm-exporter.gpu-operator.svc:9400
Metrics path: /metrics
Format: Prometheus exposition format

For a complete example, see the Accelerator Metrics guide.

Autoscaling GPU Workloads with HPA

You can use GPU metrics to autoscale your AI/ML workloads with a HorizontalPodAutoscaler (HPA), allowing you to scale applications based on actual GPU utilization rather than just CPU or memory.

HPA autoscaling with custom GPU metrics requires a complete metrics pipeline. Understanding this architecture is crucial:

DCGM Exporter → Prometheus → prometheus-adapter → custom.metrics.k8s.io API → HPA

The HPA controller (part of kube-controller-manager) can only query the Kubernetes API Server for metrics. The API Server itself doesn't collect or store metrics - it delegates to registered metrics APIs:

metrics.k8s.io (resource metrics like CPU/memory, provided by metrics-server)
custom.metrics.k8s.io (custom metrics, requires an adapter)
external.metrics.k8s.io (external metrics, requires an adapter)

For GPU metrics to be available to HPA, you must deploy a solution that implements the custom.metrics.k8s.io API - it's a fundamental part of Kubernetes' metrics architecture.

Common alternatives to prometheus-adapter:

KEDA - Event-driven autoscaling with support for many metric sources
Custom metrics adapter - Build your own adapter that reads directly from DCGM
Cloud provider-specific adapters (AWS CloudWatch, Azure Monitor, GCP Monitoring)

However, prometheus-adapter is the most common and well-supported solution for custom metrics in Kubernetes.

The complete setup involves four main components:

ServiceMonitor: Configure Prometheus to scrape DCGM Exporter metrics (see "Integrating with Your Monitoring Stack" above)

PrometheusRule: Create recording rules to aggregate raw DCGM metrics into stable, pod-level custom metrics:

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-custom-metrics
  namespace: monitoring
spec:
  groups:
  - name: gpu-metrics
    rules:
    - record: pod_gpu_utilization
      expr: label_replace(label_replace(DCGM_FI_DEV_GPU_UTIL, "pod", "$1", "exported_pod", "(.*)"), "namespace", "$1", "exported_namespace", "(.*)")

Prometheus Adapter: Deploy prometheus-adapter to expose Prometheus metrics via the custom.metrics.k8s.io API:

bash

helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-server.monitoring.svc \
  --set rules.custom[0].seriesQuery='pod_gpu_utilization{pod!=""}' \
  --set rules.custom[0].name.as='pod_gpu_utilization' \
  --set rules.custom[0].metricsQuery='avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

HPA: Configure an HPA to use the custom GPU metric for scaling decisions:

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-workload-hpa
  namespace: your-namespace
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: gpu-workload
  minReplicas: 1
  maxReplicas: 2
  metrics:
  - type: Pods
    pods:
      metric:
        name: pod_gpu_utilization
      target:
        type: AverageValue
        averageValue: "10"  # Target 10% GPU utilization

For a complete, step-by-step guide with all necessary manifests, see the Pod Autoscaling example.

Dynamic Resource Allocation (DRA)

For more flexible and fine-grained GPU management beyond simple device counts, you can enable Dynamic Resource Allocation (DRA). DRA enables advanced resource claims, flexible device filtering using common expression language (CEL), or sharing GPUs across workloads.

To enable DRA, activate the corresponding feature gates in your shoot manifest:

yaml

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  kubernetes:
    version: "1.33.0"
    kubeAPIServer:
      featureGates:
        DynamicResourceAllocation: true
      runtimeConfig:
        resource.k8s.io/v1beta1: true
    kubeControllerManager:
      featureGates:
        DynamicResourceAllocation: true
    kubeScheduler:
      featureGates:
        DynamicResourceAllocation: true
    kubelet:
      featureGates:
        DynamicResourceAllocation: true

NOTE

In Kubernetes v1.33, DRA is available via the v1beta1 API. The v1 API will be available in Kubernetes v1.34 and later.

For more details, see the DRA Support example.

AI/ML Workload Controllers

For running distributed AI/ML applications, using a specialized workload controller is highly recommended. KubeRay, for example, simplifies the deployment and management of your AI/ML workload on Kubernetes, enabling distributed training and inference workloads.

Gardener provides a conformant environment to run such operators. The operator handles:

Cluster lifecycle management
Autoscaling of worker nodes
Job scheduling and execution
Service exposure

For a detailed example of installing and using KubeRay on a Gardener cluster, see the GPU Workload Controller example.

Gang Scheduling

Distributed training jobs often require all of their pods (the "gang") to be scheduled simultaneously. This "all-or-nothing" behavior prevents deadlocks and wasted resources when some pods are scheduled but others cannot be due to resource constraints.

Kubernetes does not provide gang scheduling out-of-the-box, but it can be added with schedulers like Kueue, Grove, or similar solutions.

Gang scheduling is critical for:

Multi-GPU training frameworks (PyTorch DDP, Horovod, DeepSpeed)
Distributed training systems (TensorFlow, JAX, MPI)
Efficient GPU utilization in multi-tenant environments

For a complete example of setting up Kueue for gang scheduling on a Gardener cluster, refer to the Gang Scheduling example.

Advanced Inference with Gateway API

For production AI inference services, the Kubernetes Gateway API provides advanced traffic management capabilities beyond standard Ingress resources:

Weighted traffic splitting for A/B testing and canary deployments
Header-based routing (e.g., X-Model-Version for model version selection)
Gradual rollouts between model versions
Cross-namespace routing with ReferenceGrant for secure multi-tenant deployments

You need to install a Gateway controller (e.g., Traefik, Envoy Gateway, Istio) separately.

For a complete example, see the AI Inference example.

Troubleshooting

GPU Operator Installation Issues

Driver pod fails to start:

bash

# Check driver pod logs
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# Verify Garden Linux values file was used
helm get values gpu-operator -n gpu-operator

GPU resources not visible:

bash

# Check device plugin status
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

# Verify driver completed successfully first
kubectl get pods -n gpu-operator | grep nvidia-driver

Components stuck in Init state:

This is normal behavior during driver installation. Components wait for the driver to be ready before starting. Driver installation can take 5-15 minutes depending on network speed and node resources.

bash

# Monitor driver installation progress
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset -f

Workload Issues

Pod stuck in Pending state:

bash

# Check if GPU resources are available
kubectl describe node <gpu-node-name> | grep -A 5 "Allocated resources"

# Check pod events
kubectl describe pod <pod-name>

GPU not accessible in container:

Ensure your pod spec includes a GPU resource request:

yaml

resources:
  limits:
    nvidia.com/gpu: 1

Resources

Provider Alibaba Cloud

Tutorials

Kubernetes cluster on alicloud with gardener

Provider AWS

Tutorials

Provider Azure

Tutorials

Kubernetes cluster on azure with gardener

Provider Equinix Metal

Provider GCP

Tutorials

Kubernetes cluster on gcp with gardener

proposals

Provider OpenStack

CoreOS/FlatCar OS

SUSE CHost OS

Ubuntu OS

Calico CNI

Cilium CNI

Kubernetes Auditing

Api Reference

Registry Cache

Registry cache

registry-mirror

Certificate Services

Tutorials

DNS Services

Tutorials

Lakom Service

Egress Filtering

Networking Problem Detector

OpenID Connect Services

Node Audit Logging

Concepts

Deployment

Concepts

Deployment

Getting started locally

Proposals

Documents

Proposals

To-Do

Style Guide

AI/ML Workloads on GPU Clusters ​

Introduction ​

Prerequisites ​

Step 1: Provision a GPU-Enabled Cluster ​

Shoot Configuration Example ​

Step 2: Install the NVIDIA GPU Operator ​

Installation with Helm ​

Verification ​

Step 3: Run a Test Workload ​

Advanced Topics ​

Secure Accelerator Access ​

Exposing GPU Metrics ​

Integrating with Your Monitoring Stack ​

Autoscaling GPU Workloads with HPA ​

Dynamic Resource Allocation (DRA) ​

AI/ML Workload Controllers ​

Gang Scheduling ​

Advanced Inference with Gateway API ​

Troubleshooting ​

GPU Operator Installation Issues ​

Workload Issues ​

Related Links ​

Gardener AI Conformance ​

Examples and Guides ​

External Resources ​

AI/ML Workloads on GPU Clusters

Introduction

Prerequisites

Step 1: Provision a GPU-Enabled Cluster

Shoot Configuration Example

Step 2: Install the NVIDIA GPU Operator

Installation with Helm

Verification

Step 3: Run a Test Workload

Advanced Topics

Secure Accelerator Access

Exposing GPU Metrics

Integrating with Your Monitoring Stack

Autoscaling GPU Workloads with HPA

Dynamic Resource Allocation (DRA)

AI/ML Workload Controllers

Gang Scheduling

Advanced Inference with Gateway API

Troubleshooting

GPU Operator Installation Issues

Workload Issues

Related Links

Gardener AI Conformance

Examples and Guides

External Resources