Tech Blog by vClusterPress and Media Resources

Introducing and a Deep Dive Into AICR with vCluster

Jun 12, 2026
|
min Read
Introducing and a Deep Dive Into AICR with vCluster

GPU clusters are powerful, but getting the software stack right is rarely simple. Drivers, CUDA, NCCL, kernels, Kubernetes versions, GPU operators, and container runtimes all need to line up. When they do not, training jobs fail, debugging gets messy, and teams lose time chasing compatibility issues.

AI Cluster Runtime, NVIDIA’s open source runtime for AI clusters, takes a more practical approach. Instead of relying on a static compatibility matrix, AICR helps generate and validate the right configuration from the cluster you actually have.

In this guide, we introduce AI Cluster Runtime, explain the core pieces it is built from, and then walk through a hands-on deep dive using AICR with vCluster on a private A100 node. You will see how to resolve recipes, validate conformance against live hardware, and understand how AICR can simplify GPU cluster setup.

The matrix problem

A GPU platform is a stack of versioned layers that all have opinions about each other: serving engine, the GPU driver and CUDA toolkit, NCCL, the GPU Operator and Network Operator, the OS kernel, the Kubernetes release, the container runtime. Multiply the valid versions of each by the hardware you support and you land on hundreds of combinations. Each one is a row that has to be correct.

Eight versioned layers multiplied by three hardware targets produce hundreds of combinations
Figure 1. Every layer carries versions, and every combination is a row you would otherwise verify by hand.

The matrix is not wrong because people are careless. It is wrong because it is static and the cluster is not. Drivers get patched, a new node type arrives, an operator bumps a minor version, and the document quietly drifts from reality.

What AICR is

AICR is a runtime and a CLI (aicr, talking to an aicrd API server, written in Go) that treats cluster configuration the way a package manager treats dependencies. You describe what you want, it resolves the versions that actually fit, and it can prove the result against the live cluster.

What it is not:

  • Not a Kubernetes distribution. It runs on the cluster you already have.
  • Not a static matrix. It reads state and resolves per cluster, every time.
  • Not a black box. Every step emits a file you can read, diff, and replay.

A useful mental model is a package manager. The snapshot is your lockfile, a record of exactly what is present. The recipe is the resolved dependency graph. The bundle is the install script that applies it.

AICR reads live cluster state, resolves a recipe from overlays, deploys validator Jobs, and emits deployers
Figure 2. How AICR sees the cluster, resolves a recipe from a catalog of overlays, and proves the result in cluster.

The four parts

AICR is built from four moves that chain into one loop.

Snapshot to recipe to validate to bundle, each stage emitting an artifact
Figure 3. Four stages, each producing an artifact you can read and replay.

Snapshot

aicr snapshot reads the live cluster: nodes, GPU models, driver and toolkit versions, installed operators, and writes them to snapshot.yaml. This is the ground truth the rest of the loop resolves against, not an assumption about what the cluster should contain.

Recipe

aicr recipe resolves a recipe from layered overlays: a base, then cloud, accelerator, OS, and workload layers stacked on top. Later layers win, so an H100 overlay can override a default that the base set. Matching is asymmetric: a specific query for H100 on AWS matches the H100 and AWS overlays plus any wildcard layers, while a query for any never pulls hardware specific recipes.

Five overlays compose into one resolved recipe; matching is asymmetric
Figure 4. Overlays compose into a single resolved recipe. Specific queries match wildcards, but wildcards never pull specific recipes.

NVIDIA ships tuned recipes for hardware like H100 and GB200. On an A100 like in the below demo, the resolver falls back to the any overlay, which is the honest behavior: you get a working baseline rather than a tuned profile written for different silicon.

Validate

aicr validate is where AICR earns trust. It does not lint YAML. It deploys real workloads into the cluster as Kubernetes Jobs and checks them in phases.

Readiness gates the run, then deployment, performance, and conformance phases
Figure 5. A readiness pre-flight gates everything. Performance runs a real NCCL all-reduce.

The readiness pre flight runs first and gates everything: if it fails, no validator Jobs are deployed at all. Then deployment checks confirm components land and reconcile, performance can run an NCCL all reduce as a Kubeflow TrainJob and measure aggregate bus bandwidth, and conformance checks that the cluster behaves to the API spec. The output is a pass or fail report, per check and will give you confidence on the scope of the tasks and the parameters you care for in that task!

Bundle

aicr bundle turns the resolved recipe into something you can apply. It emits deployers for your tool of choice: Helm by default, plus Argo CD, Argo CD with Helm, Flux, and Helmfile. Bundlers run in parallel, so wall clock time is bounded by the slowest one, not the sum. 

The goal is to help reach a state which helps you get the most of your GPUs!

The Demo: A100, Recipe and Conformance with vind

In this part of the article, we will put AICR to work on a real GPU. We will stand up a small cluster on a laptop, join a remote A100 to it, resolve a recipe for that hardware, deploy it, and then prove that it works. It may sound like a lot, but each step is short, and AICR does most of the heavy lifting for you.

Installing AICR

The first step is to install AICR. You can do this with Homebrew, or with the install script. Either one works, so pick whichever you prefer:

# install with Homebrew
$ brew tap NVIDIA/aicr && brew install aicr
 
# or with the install script
$ curl -sfL https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash

Setting Up the Control Plane with vind

Once AICR is installed, the next step is to create a cluster to work in. For this we use vind, which stands for vCluster in Docker. vind runs a real Kubernetes control plane inside a container on your laptop, the same way that kind does. The difference is that vind can also join remote nodes over a VPN, and that is what lets us attach the A100. Because the GPU node joins over an encrypted tunnel, the GPU box never needs a public address. You can reference the vCluster.yaml used here.

Find the  multi-node-cluster.yaml in the repo or here: 

  experimental:
    docker:
       nodes:
        - name: worker-1
        - name: worker-2
        - name: worker-3

  controlPlane:
    distro:
      k8s:
        version: "v1.34.0"

  privateNodes:
    enabled: true

    vpn:
      enabled: true

      nodeToNode:
        enabled: true

Create the control plane in Docker with the following command: 

# stand up a tenant control plane in Docker
$ sudo vcluster create aicr -f multi-node-cluster.yaml
Figure 6. The demo topology: a control plane in Docker on your laptop, a remote A100 joined over an encrypted tunnel.

Next, mint a join token with`vcluster token create`! The generated command connects the external VM to the local cluster through the vCluster VPN, so the remote A100 becomes a Kubernetes worker node even though it is running outside your laptop’s Docker network. For a full walkthrough of joining a GCP instance as an external node, see External Nodes: Joining a GCP Instance to Your Local vind Cluster

# mint a join token, then run the printed script on the A100 host
$ vcluster token create
  join script written · run it on the GPU node to register

In this demo, we’re using a VM from GCP with a A100!

Checking the Cluster

Now that the control plane is running and the A100 has joined, you can check what you have. The setup is three local workers on the laptop and one remote GCP node that carries the GPU.

First, confirm the control plane is running:

# the vind control plane (a tenant cluster), running in Docker
$ vcluster list
  NAME | STATUS  | CONNECTED | AGE
  aicr | running | True      | 42m

Then list the nodes:

# five nodes: local control plane and workers, plus a remote A100
$ kubectl get nodes -o wide
NAME                       STATUS  ROLES                VERSION  KERNEL
aicr                       Ready   control-plane,master v1.34.0  6.12.54-linuxkit
worker-1                   Ready   <none>               v1.34.0  6.12.54-linuxkit
worker-2                   Ready   <none>               v1.34.0  6.12.54-linuxkit
worker-3                   Ready   <none>               v1.34.0  6.12.54-linuxkit
instance-20260602-102458   Ready   <none>               v1.34.0  6.17.0-1016-gcp

The three nodes with the linuxkit kernel are the local Docker nodes. The node with the -gcp kernel is the remote A100.

Confirming the GPU

On that GCP host, the GPU is real. If you run nvidia-smi on the node itself, you can see the A100 sitting idle and ready:

$ nvidia-smi
NVIDIA-SMI 580.159.03    Driver 580.159.03    CUDA 13.0
GPU  Name                    Pwr: Usage/Cap   Memory-Usage     Util
0    NVIDIA A100-SXM4-40GB    47W / 400W       0MiB / 40960MiB  0%
No running processes found

No running processes found

Resolving a Recipe for the A100

With the cluster ready, the next step is to resolve a recipe. A recipe is the set of components and versions that fit your hardware. Instead of looking these up in a matrix by hand, you ask AICR to work them out for you.

We run aicr recipe with two flags. AICR reads the cluster and detects the service and the accelerator as any, and our flags override those values with the kind (drop in replacement for vind) profile and a100. It then resolves twelve components across three overlays:

# resolve a recipe for the A100, kind service profile
$ aicr recipe --service kind --accelerator a100 -o recipe.yaml
[cli] flag override: service      detected=any -> kind
[cli] flag override: accelerator  detected=any -> a100
[cli] building recipe from criteria(service=kind, accelerator=a100)
[cli] recipe generation completed: components=12 overlays=3

The output is a RecipeResult file. It records the criteria you asked for, applies three overlays (base, monitoring-hpa, and kind), pins an exact version for every component, and lists the conformance checks to run later:

$ cat recipe.yaml
apiVersion: aicr.nvidia.com/v1alpha1
componentRefs:
  - chart: cert-manager
    name: cert-manager
    namespace: cert-manager
    source: https://charts.jetstack.io
    type: Helm
    valuesFile: components/cert-manager/values.yaml
    version: v1.20.2
  - chart: gpu-operator
    dependencyRefs:
      - nfd
      - cert-manager
      - kube-prometheus-stack
    manifestFiles:
      - components/gpu-operator/manifests/dcgm-exporter.yaml
    name: gpu-operator
    namespace: gpu-operator
    overrides:
      dcgm:
        enabled: false
      dcgmExporter:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: node-role.kubernetes.io/control-plane
                      operator: DoesNotExist
        config:
          create: false
          name: ""
      devicePlugin:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: node-role.kubernetes.io/control-plane
                      operator: DoesNotExist
        env: []
      driver:
        enabled: false
        rdma:
          enabled: false
      gdrcopy:
        enabled: false
      gfd:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: node-role.kubernetes.io/control-plane
                      operator: DoesNotExist
      migManager:
        enabled: false
      operator:
        resources:
          limits:
            cpu: 500m
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 200Mi
      toolkit:
        enabled: false
      validator:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: node-role.kubernetes.io/control-plane
                      operator: DoesNotExist
    source: https://helm.ngc.nvidia.com/nvidia
    type: Helm
    valuesFile: components/gpu-operator/values.yaml
    version: v25.10.1
  - chart: k8s-ephemeral-storage-metrics
    dependencyRefs:
      - kube-prometheus-stack
    name: k8s-ephemeral-storage-metrics
    namespace: monitoring
    source: https://jmcgrath207.github.io/k8s-ephemeral-storage-metrics/chart
    type: Helm
    valuesFile: components/k8s-ephemeral-storage-metrics/values.yaml
    version: 1.19.2
  - chart: kai-scheduler
    dependencyRefs:
      - gpu-operator
    name: kai-scheduler
    namespace: kai-scheduler
    source: oci://ghcr.io/kai-scheduler/kai-scheduler
    type: Helm
    valuesFile: components/kai-scheduler/values.yaml
    version: v0.14.1
  - chart: kube-prometheus-stack
    dependencyRefs:
      - prometheus-operator-crds
    name: kube-prometheus-stack
    namespace: monitoring
    overrides:
      alertmanager:
        alertmanagerSpec:
          resources:
            limits:
              cpu: 250m
              memory: 256Mi
            requests:
              cpu: 50m
              memory: 64Mi
      defaultRules:
        create: false
      grafana:
        enabled: false
      prometheus:
        prometheusSpec:
          resources:
            limits:
              cpu: 1
              memory: 1Gi
            requests:
              cpu: 250m
              memory: 512Mi
          retention: 7d
          storageSpec:
            emptyDir:
              medium: ""
              sizeLimit: 5Gi
      prometheusOperator:
        alertmanagerConfigNamespaces:
          - monitoring
        alertmanagerInstanceNamespaces:
          - monitoring
        livenessProbe:
          failureThreshold: 10
          timeoutSeconds: 10
        prometheusInstanceNamespaces:
          - monitoring
        readinessProbe:
          failureThreshold: 6
          timeoutSeconds: 10
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 256Mi
        thanosRulerInstanceNamespaces:
          - monitoring
    source: https://prometheus-community.github.io/helm-charts
    type: Helm
    valuesFile: components/kube-prometheus-stack/values.yaml
    version: 84.4.0
  - chart: network-operator
    dependencyRefs:
      - nfd
      - cert-manager
    name: network-operator
    namespace: nvidia-network-operator
    source: https://helm.ngc.nvidia.com/nvidia
    type: Helm
    valuesFile: components/network-operator/values.yaml
    version: 26.1.1
  - chart: node-feature-discovery
    name: nfd
    namespace: node-feature-discovery
    source: https://kubernetes-sigs.github.io/node-feature-discovery/charts
    type: Helm
    valuesFile: components/nfd/values.yaml
    version: 0.18.3
  - chart: skyhook-operator
    name: nodewright-operator
    namespace: skyhook
    overrides:
      controllerManager:
        manager:
          resources:
            limits:
              cpu: 500m
              memory: 1Gi
            requests:
              cpu: 250m
              memory: 512Mi
    source: https://helm.ngc.nvidia.com/nvidia/skyhook
    type: Helm
    valuesFile: components/nodewright-operator/values.yaml
    version: v0.15.1
  - chart: nvidia-dra-driver-gpu
    dependencyRefs:
      - gpu-operator
    name: nvidia-dra-driver-gpu
    namespace: nvidia-dra-driver
    overrides:
      nvidiaDriverRoot: /
    source: https://helm.ngc.nvidia.com/nvidia
    type: Helm
    valuesFile: components/nvidia-dra-driver-gpu/values.yaml
    version: 25.12.0
  - chart: nvsentinel
    dependencyRefs:
      - cert-manager
      - gpu-operator
    name: nvsentinel
    namespace: nvsentinel
    overrides:
      platformConnector:
        resources:
          limits:
            cpu: 200m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 256Mi
    source: oci://ghcr.io/nvidia
    type: Helm
    valuesFile: components/nvsentinel/values.yaml
    version: v1.3.0
  - chart: prometheus-adapter
    dependencyRefs:
      - kube-prometheus-stack
    name: prometheus-adapter
    namespace: monitoring
    source: https://prometheus-community.github.io/helm-charts
    type: Helm
    valuesFile: components/prometheus-adapter/values.yaml
    version: 5.3.0
  - chart: prometheus-operator-crds
    name: prometheus-operator-crds
    namespace: monitoring
    source: https://prometheus-community.github.io/helm-charts
    type: Helm
    valuesFile: components/prometheus-operator-crds/values.yaml
    version: 28.0.1
constraints:
  - name: K8s.server.version
    value: '>= 1.25'
criteria:
  accelerator: a100
  intent: any
  os: any
  platform: any
  service: kind
deploymentOrder:
  - cert-manager
  - nfd
  - network-operator
  - nodewright-operator
  - prometheus-operator-crds
  - kube-prometheus-stack
  - gpu-operator
  - k8s-ephemeral-storage-metrics
  - kai-scheduler
  - nvidia-dra-driver-gpu
  - nvsentinel
  - prometheus-adapter
kind: RecipeResult
metadata:
  appliedOverlays:
    - base
    - monitoring-hpa
    - kind
  version: 0.13.0
validation:
  conformance:
    checks:
      - platform-health
      - gpu-operator-health
      - dra-support
      - accelerator-metrics
      - ai-service-metrics

Bundling and Deploying

Now that you have a recipe, the next step is to turn it into something you can install. aicr bundle converts the recipe into per-component Helm charts, and the generated deploy.sh script installs them in the right order like below:

❯ ls bundles 
001-cert-manager			008-gpu-operator-post			deploy.sh
002-nfd					009-k8s-ephemeral-storage-metrics	README.md
003-network-operator			010-kai-scheduler			recipe.yaml
004-nodewright-operator			011-nvidia-dra-driver-gpu		results.json
005-prometheus-operator-crds		012-nvsentinel				undeploy.sh
006-kube-prometheus-stack		013-prometheus-adapter
007-gpu-operator			checksums.txt

The run is honest about what happens along the way. The GPU operator needed one retry while its ClusterPolicy settled, and the script reminds you at the end that some of the work keeps going in the background after it exits:

# turn the recipe into Helm bundles, then deploy
$ aicr bundle --recipe recipe.yaml --deployer helm -o ./bundles
[cli] bundle generated: type=Helm files=57 size=135665 dur=0.016s
 
$ cd bundles && chmod +x deploy.sh && ./deploy.sh
Pre-flight checks passed.
Installing cert-manager          ... STATUS: deployed
Installing nfd                   ... STATUS: deployed
Installing network-operator      ... STATUS: deployed
Installing gpu-operator          ...
  Error: ClusterPolicy not ready (InProgress); retrying in 5s
  gpu-operator                   ... STATUS: deployed (revision 2)
Installing kai-scheduler         ... STATUS: deployed
Installing nvidia-dra-driver-gpu ... STATUS: deployed
Installing nvsentinel            ... STATUS: deployed
  ... 13 charts total
Deployment complete.
 
NOTE: Helm install results, not full GPU-workload readiness.
Convergence continues async: node tuning (~10-20 min),
gpu-operator operands, and DRA kubelet plugin registration.

Validating Conformance

The last step is to prove that the cluster actually works. aicr validate with the conformance phase deploys an agent into the aicr-validation namespace to capture a fresh snapshot, runs a readiness check that gates everything else, and then runs five conformance validators chosen from a catalog of eleven. There are other phases as well like inference and training but here we just focus on the conformance part as below:

# prove it: deploy validators into the cluster and check
$ aicr validate --recipe recipe.yaml --phase conformance
[cli] deploying agent: namespace=aicr-validation
[agent] collecting node / GPU / OS / Kubernetes info
[agent] node topology complete: nodes=5 labels=144
[cli] job completed successfully
[cli] readiness pre-flight: constraints=1
[cli] readiness passed: K8s.server.version >= 1.25 (got v1.34.0)
[cli] running phase=conformance catalog=11 selected=5
[cli] validator passed: dra-support
[cli] validator passed: accelerator-metrics
[cli] validator passed: ai-service-metrics
[cli] validator passed: gpu-operator-health
[cli] validator passed: platform-health
[cli] conformance passed: validators=5 passed=5 failed=0 (1m0s)

The report comes out as CTRF JSON, generated by aicr 0.13.0:

{ "reportFormat": "CTRF", "generatedBy": "aicr",
  "tool": { "name": "aicr", "version": "0.13.0" },
  "summary": { "tests": 5, "passed": 5, "failed": 0 } }

None of these checks are a simple lint. The accelerator-metrics check pulled live DCGM metrics for the NVIDIA A100-SXM4-40GB, and the dra-support check allocated a GPU to a test pod, saw the message DRA GPU allocation successful, and then cleaned up after itself. In other words, conformance schedules real GPU work and checks the result.

So the 5/5 is real. It covers a live GPU allocation and real DCGM metrics from the A100. You generated a recipe from the cluster you actually have, deployed it, and proved it before anything real depended on it. That is the difference between reading a matrix and generating one.

Understanding Snapshots and When They Help

You might have noticed that we never ran aicr snapshot anywhere in this walkthrough. That was on purpose, and it is worth explaining why.

First, what is a snapshot? A snapshot is a file that records what your cluster currently has. This includes the nodes, the GPU models on them, the driver and toolkit versions, and the operators that are installed. AICR uses this file as the starting point for everything else. The recipe is resolved against it, and the validator checks against it.

The reason we did not run it as its own step is simple. The two commands we did run, aicr recipe and aicr validate, each take a snapshot on their own before they do their work. So the snapshot still happened. We just did not have to run it by hand.

You can see this in the recipe step. When we ran aicr recipe, the CLI read the cluster first and printed what it found:

[cli] CLI flag overriding snapshot-detected value: field=service detected=any override=kind
[cli] CLI flag overriding snapshot-detected value: field=accelerator detected=any override=a100

Notice the words "snapshot-detected value". AICR looked at the cluster, detected both the service and the accelerator as any, and then our two flags replaced those values with kind and a100. The snapshot was already there. We only changed what it found.

The validate step does the same thing, but it captures a brand new snapshot each time so it is never working from stale information. It deploys a small agent into the aicr-validation namespace, and that agent collects the cluster state right before the checks run:

[cli] deploying agent to capture snapshot
[agent] collecting node topology information
[agent] collecting Kubernetes cluster information
[agent] node topology collection complete: nodes=5 taints=0 labels=144

There is one detail here worth pointing out. The agent runs inside the cluster as a pod, and from inside that pod it could not find nvidia-smi:

[agent] nvidia-smi not found - no GPU data will be collected

So the snapshot itself did not contain any GPU details, and yet conformance still passed all five checks. That is fine, and it is by design. The snapshot records the nodes and the Kubernetes state. It does not need to see the GPU directly. The proof that the GPU works comes from the validators, which pull live DCGM metrics from the A100 and run a real DRA allocation on it. The snapshot describes the cluster, and the validators test it.

So when would you run aicr snapshot by yourself? You run it when you want the file itself, not just the steps that use it. A few common reasons:

  1. To look at what is on the cluster before you resolve a recipe.
  2. To compare two snapshots taken at different times, so you can see what changed when a cluster drifts.
  3. To save the cluster state and resolve a recipe against that saved file later, which is useful for repeatable or air-gapped runs.
  4. To keep a record of the cluster for an audit.

For this walkthrough we did not need any of those, so we let recipe and validate handle the snapshot for us.

Final Thoughts

The real shift is who does the resolving. A compatibility matrix makes a person hold every constraint by hand. AICR moves that work to a tool that reads the actual cluster, so you review a result instead of working it out yourself. That makes everyday work easier. Adding a new GPU, an A100 now or an H100 next quarter, becomes a flag rather than a research project, and rebuilding a cluster after a driver bump is just generate, deploy, validate. Because the loop ends in a conformance run on real GPU work, you catch a broken stack before a training job does.

You do not even need your own fleet to try it. vind as you saw is a drop-in for kind that can also join a remote node over a secure tunnel. Point it at any GPU you can borrow for an hour, run the four stages, and you have your own conformant GPU stack by this afternoon, resolved from the hardware you actually have and proven before anything depends on it.

With vCluster you can easily run training, inference and other jobs with dynamic capacity in a conformant manner and if you have any questions don’t forget to join our Slack

Share:
Get started with the #1 tenant isolation platform.

Give your tenants the hyperscaler experience, ready in seconds.

Ready to take vCluster for a spin?

Deploy your first virtual cluster today.