Tech Blog by vClusterPress and Media Resources

vMetal Deep Dive: How AI Clouds Turn Bare Metal GPUs Into a Programmable Platform

May 28, 2026
|
13
min Read
vMetal Deep Dive: How AI Clouds Turn Bare Metal GPUs Into a Programmable Platform

vMetal's official pitch in one line: "Run your GPU data center like a hyperscaler."

I made a long-form video walking through what's actually behind that claim. The architecture, the YAML, the network model, the demo. This is the written companion.

📺 Watch the full video walkthrough: vMetal Deep Dive on YouTube

I went into the actual working repos while writing this. The loft-sh/vcluster-bare-metal-with-kubevirt repo gives you a fully self-contained local demo using KubeVirt VMs as fake bare metal, and loft-sh/vcluster-docs has the source of truth for the docs. Everything in this post is grounded in real YAML you can apply, not slideware.

Plain-English Glossary (skip if you live in this stuff)

Some terms come up a lot in this post. If any of them feel unfamiliar, here they are in everyday language.

TermWhat it actually means
BMC (Baseboard Management Controller)A tiny separate computer inside every server with its own network port. It can power the server on/off and install an OS even when the main machine is off. Think of it as the server's remote control.
Redfish / IPMIThe two common languages BMCs speak. Redfish is the modern one, IPMI is the legacy one.
PXE bootA way for a server to get its operating system over the network instead of from local disk. The server says "I'm new here, give me an OS," and a server on the network hands it one.
Cloud-initA tiny script that runs the first time a freshly installed server boots. Sets the hostname, configures the network, joins clusters, etc.
Metal3 / IronicOpen-source projects vMetal is built on. Metal3 represents physical servers as Kubernetes objects; Ironic is the engine that talks to BMCs and PXE-boots them.
CR / CRD"Custom Resource." A YAML object Kubernetes manages alongside Pods, Deployments, etc. A BareMetalHost is a CR that represents one physical server.
Control Plane ClusterThe Kubernetes cluster where vMetal itself runs. It's the brain.
Tenant ClusterWhat vMetal hands to each customer or team. Their own isolated Kubernetes cluster with their own GPU nodes.
VLAN / VXLANTwo ways to slice one physical network into many isolated virtual ones. Think floor plans for the same building.
MultusA Kubernetes plugin that lets a pod be on more than one network at the same time. vMetal uses it so its DHCP pod can sit on the bare metal network.
NVLink / InfiniBandSuper-fast cables between GPUs (NVLink, inside one server) and between servers (InfiniBand). What makes large training runs go fast.

The Platform Problem

Buying GPUs is the easy part. Then you need provisioning, OS lifecycle, tenant isolation, networking, DNS, and scheduling. And you need it to be self-service so customers or internal teams don't file tickets for every notebook.

Building this internally is typically 6 to 12 months and a serious team. Meanwhile the GPUs depreciate. vMetal's pitch: turn the racks into a compute platform without writing the platform.

What vMetal Actually Is

In the simplest possible terms: vMetal lets you treat a rack of physical GPU servers like a cloud. You point it at your hardware once, and from then on you can hand a fresh server to a customer or team in seconds and reclaim it when they're done. All from Kubernetes-native YAML or a UI.

The docs put it more formally:

"vMetal is the bare metal layer of the vCluster Platform. It builds on Metal3 and Ironic to handle BMC communication, PXE boot, OS installation, and server cleaning."

There's no hypervisor in the way. Workloads get direct access to GPUs, NVLink fabric, and InfiniBand. The hardware behaves the way the manufacturer intended. vMetal manages the physical machines themselves: registering them, installing an OS over the network, joining them to a Tenant Cluster, and cleaning them up when the tenant is done.

Built on:

  • Metal3. Exposes each physical server as a BareMetalHost custom resource.
  • Ironic. The engine that talks to BMCs (Redfish or IPMI), drives power, PXE, and image writing.

How vMetal "Detects" Servers

A common question: does vMetal scan the network and auto-discover servers? Today, the model is declarative. You register each server once, and from then on the platform handles everything (registering, inspecting, claiming, provisioning, deprovisioning, reuse). More automated discovery is on the roadmap, so you won't have to declare individual hosts in future versions.

The trigger for a server to enter the system is a BareMetalHost CR pointing at the BMC. You create:

  1. A Secret with BMC username/password
  2. A BareMetalHost CR with the BMC URL plus the boot MAC address

Once those exist, Metal3 and Ironic do the rest automatically. Registering (verify BMC creds), then Inspecting (auto-collect hardware inventory: CPU, RAM, NICs, disks, firmware, GPUs, PCIe), then Available.

Real example from the working kubevirt demo repo:

apiVersion: v1
kind: Secret
metadata:
name: server-01-bmc
namespace: metal3-system
type: Opaque
stringData:
username: admin
password: <BMC-PASSWORD>
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
name: server-01
namespace: metal3-system
labels:
role: compute
spec:
bmc:
address: redfish://192.168.1.100
credentialsName: server-01-bmc
bootMACAddress: "aa:bb:cc:dd:ee:01"

Adding lots of servers at once

If you're racking a fleet, not just one machine, there's a bulk registration path. You concatenate BareMetalHost and Secret resources into a single YAML file (one document per server, separated by ---) and apply it. The docs show this exact pattern:

---
apiVersion: v1
kind: Secret
metadata:
name: server-01-bmc
namespace: metal3-system
stringData:
username: admin
password: <BMC-PASSWORD>
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
name: server-01
namespace: metal3-system
labels:
role: compute
rack: rack-a
spec:
bmc:
address: redfish://192.168.1.100
credentialsName: server-01-bmc
bootMACAddress: "aa:bb:cc:dd:ee:01"
---
# server-02, server-03, ... in the same file

kubectl apply -f servers.yaml

You get parallel registration plus inspection across the whole batch. Combine this with rack/role labels and your hardware inventory becomes a single GitOps-managed manifest.

What if my team doesn't know YAML?

The Platform UI has a form under Bare Metal Servers. You click Add, fill in the BMC address, credentials, and boot MAC. The platform writes the CR for you. In practice this is operator-side work anyway. The data scientists who consume the GPUs never see this layer.

The Network Model: the Other Question Everyone Asks

"Do bare metal servers need to be on the same network as the Control Plane Cluster?"

No. Bare metal servers can sit on completely different networks than the Control Plane Cluster. What you do need is two things wired up. First, the Control Plane Cluster has to be able to reach the BMCs (so it can power servers on and trigger installs). Second, a single network bridge from the cluster into the bare metal provisioning network (so the install actually happens). The next two subsections explain each in plain terms.

Path 1: Ironic to BMC (control plane traffic)

Ironic runs inside the Control Plane Cluster. It must have IP reachability to each BMC (the Redfish/IPMI endpoint). Same L2 is not required, same IP range is not required. The docs are explicit:

"Ironic must have network access to the BMC addresses of the bare metal servers."

If your BMCs are on 10.10.0.0/24 and your Control Plane Cluster pods are on 10.244.0.0/16, that's fine, as long as routing exists.

Path 2: DHCP/PXE to bare metal NICs (provisioning traffic)

This is the part that needs explicit wiring. The DHCP/PXE proxy pod runs in the Control Plane Cluster and is attached via Multus to the provisioning network. Two modes the docs document:

Bridge mode. Control Plane Cluster nodes have a bridge (e.g. br0) attached to the provisioning network. The DHCP pod attaches through that bridge.

deploy:
dhcp:
enabled: true
helmValues: |
networkAttachmentDefinition:
vip: 192.168.100.2/24
config: |
{
"cniVersion": "0.3.1",
"type": "bridge",
"bridge": "br0",
"isDefaultGateway": false
}

Macvlan mode. Used when "the bare metal servers are on the same network as the Control Plane Cluster nodes." The DHCP pod gets a macvlan interface on eth0.

deploy:
dhcp:
enabled: true
helmValues: |
networkAttachmentDefinition:
vip: 10.0.0.2/24
config: |
{
"cniVersion": "0.3.1",
"type": "macvlan",
"master": "eth0",
"mode": "bridge"
}

In the kubevirt demo, the entire provisioning network is 192.168.100.0/24 on a bridge br0 set up by a DaemonSet. The KubeVirt "fake" bare metal VMs live in 192.168.100.10–20, the bridge IP is 192.168.100.1, and the DHCP pod gets 192.168.100.4. Bridge mode, end to end.

So: bare metal servers don't need to share IP range with the Control Plane Cluster. What they need is a wire from the cluster nodes into their provisioning network (bridge or shared L2 via macvlan), plus IP routability from Ironic to the BMCs.

What vMetal Deploys for You

A NodeProvider of type Metal3 can deploy three components into the Control Plane Cluster, each individually toggleable. From the actual node-provider.yaml in the kubevirt demo:

apiVersion: storage.loft.sh/v1
kind: NodeProvider
metadata:
name: metal3
spec:
displayName: "Metal3 Bare Metal Hosts"
metal3:
clusterRef:
cluster: loft-cluster
namespace: default
deploy:
multus:
enabled: true
metal3:
enabled: true
dhcp:
enabled: true
helmValues: |
networkAttachmentDefinition:
vip: 192.168.100.4/24
nodeTypes:
- name: vm

The three components:

ComponentRole
Metal3 + IronicReconciles BareMetalHost CRs. Talks to BMCs. Drives power, PXE, OS image writes.
DHCP ProxyHandles PXE boot. Acts as a proxy between bare metal servers and Ironic when they're on different networks.
Multus CNILets the DHCP pod attach to the provisioning network (separate from the cluster pod network).

If you already run any of these (you have a Metal3 install, your own DHCP, your own Multus) you disable the corresponding deploy.*.enabled field and bring your own.

The configuration surface itself comes down to three Kubernetes resources working together:

The NodeProvider points at the Control Plane Cluster and toggles what gets deployed. Each BareMetalHost plus its Secret represents one physical server. NodeType resources define hardware profiles (CPU, memory, GPU count) and a label selector that matches BareMetalHost resources. When a workload needs a GPU server, vMetal finds an available host with matching labels. There's also a built-in cost calculation that picks the cheapest matching node type when multiple could fulfill a request.

The Lifecycle

Every BareMetalHost moves through this state machine:

StateWhat's happening
RegisteringVerify BMC creds. Can the system actually talk to this server?
InspectingAuto-collect hardware inventory. CPU, RAM, NICs, disks, firmware, GPUs, PCIe.
AvailableIn the pool. Waiting to be claimed.
ProvisioningOS image writing via PXE; cloud-init staged.
ProvisionedOS running. If targeting a Tenant Cluster, already joined.
DeprovisioningCleanup. Returned to the pool.
ErrorAnything can transition here. Debug like any Kubernetes operator.

When a Machine (the platform's claim) is deleted, vMetal restores the BareMetalHost to its original state and it becomes Available again. Same server, next tenant.

End-to-End Path: From Tenant Request to Running Pod

When a Tenant Cluster requests a bare metal node:

  1. Selection. The provider picks an Available BareMetalHost matching the node type's label selector and resources.
  2. Configuration. Cloud-init user data is generated, stored as a Secret on the Control Plane Cluster.
  3. Setup. The BMH is patched with image reference plus userData Secret reference. This is the declarative trigger.
  4. Installation. Ironic powers the server on via BMC, sets boot to PXE, IPA (Ironic Python Agent) writes the OS to disk.
  5. Boot. The server reboots from disk into the new OS, cloud-init runs.
  6. Integration. For vCluster private nodes, cloud-init includes the join command. The node automatically registers with the Tenant Cluster.

No manual kubeadm join. No manual switch port flipping. No manual DNS update.

You Can Run This Locally Today

The thing that surprised me most: there's a fully working local replica using KubeVirt VMs as fake bare metal servers. The repo loft-sh/vcluster-bare-metal-with-kubevirt has a Makefile that walks you through the whole thing without owning any DGX hardware.

Prerequisites: Docker, vcluster CLI, kubectl, helm, a host with KVM (~16GB RAM, 4+ CPU).

# Create a vcluster-in-docker host cluster
make vind-up

# Install everything (cert-manager, KubeVirt, br0 bridge,
# vCluster Platform, Metal3 NodeProvider, DHCP, Multus)
make install

# Boot KubeVirt VMs that pretend to be bare metal servers
# Each VM has a Redfish BMC shim (virtbmc) the platform can talk to
make create-vms

# Now create a vCluster that auto-claims those "BMHs" as private nodes
make create-vcluster

Behind the scenes:

  • A Linux bridge (br0, 192.168.100.0/24) on the host acts as the shared provisioning network
  • A NodeProvider deploys Multus, Metal3 plus Ironic, and the DHCP server into the host cluster
  • A NodeEnvironment provides the IP range (192.168.100.10–20), gateway, and DNS for the network
  • An Ubuntu 24.04 OSImage and a static SSHKey are referenced
  • BareMetalHost resources point at each VM's virtbmc Redfish endpoint

make create-vcluster then creates a VirtualClusterInstance that requests a node from the metal3 provider. A NodeClaim is created, a BMH is selected and provisioned (Ironic writes the image), the VM boots, cloud-init joins the Tenant Cluster, and kubectl get nodes against the Tenant Cluster shows the new node.

This is the cheapest way I've seen to learn how this stack actually behaves end-to-end.

Tenant Isolation: Three Layers, Real Boundaries

The isolation model has three distinct layers and they all matter:

  1. Network isolation. Each tenant gets its own VLAN/VXLAN. When a node is claimed, vMetal coordinates with the network controller (Netris in the GTC demo, but the integration is pluggable) and the server manager (BCM, NVIDIA Base Command Manager) to physically move the node into the tenant's network. Switch ports get reconfigured. NVLink fabric and InfiniBand get reconfigured. DNS gets updated.
  2. Cluster isolation. Each tenant runs in their own Tenant Cluster (via vCluster) with its own Virtual Control Plane (API server, scheduler, controller manager, resource view).
  3. Runtime isolation. vNode adds boundary enforcement when workloads share physical nodes (less relevant for dedicated bare metal, more relevant for shared-host scenarios).

The hot standby trick: PXE boots are slow. So vMetal keeps DGX nodes pre-provisioned in a "management" pool. Claim time is then a network move plus cluster join, not a full reinstall. Seconds, not minutes.

The Demo, Compressed

In the video I walk through the GTC demo. Here's the punch line.

A data scientist opens Run:ai, targets Tenant Cluster #1, and clicks Create Jupyter Notebook. That's the user-side action.

Under the hood:

  1. Run:ai schedules the workload. Pending, no GPU node available in the tenant's cluster.
  2. The dynamic node pool kicks in; vMetal sees the demand.
  3. vMetal picks an Available DGX from the management pool.
  4. vMetal coordinates with Netris and moves the node's switch ports into Tenant 1's VLAN.
  5. vMetal coordinates with BCM and updates the node's network assignment.
  6. The node joins Tenant Cluster #1.
  7. Run:ai sees the new node, schedules the Jupyter pod, containers start.

You can watch this happen live in the Netris UI. Three DGX nodes start in management. After the click, DGX-01 visibly moves into Tenant 1's network. BCM confirms the same. When the tenant releases the node, vMetal reverses everything.

Things to Know Going In

A few practical notes so you can plan your rollout:

  • Servers need a BMC. Redfish or IPMI. Pretty much standard on any modern server-class hardware (Dell iDRAC, HPE iLO, Supermicro IPMI, NVIDIA DGX, etc.).
  • The first PXE boot of a fresh server takes minutes, not seconds. That's just how PXE works. vMetal's hot standby model handles this elegantly. Keep a warm pool of pre-provisioned nodes and tenant claims become near-instant network moves.
  • Network plumbing is upfront work. A bridge or macvlan into the provisioning network needs to be set up once. Your network team likely already does this for any bare metal automation; vMetal just needs a leg into it.
  • OS images must be HTTP-accessible. Local or authenticated image sources aren't supported directly, so plan to host your images on a reachable HTTP endpoint.
  • You need a vCluster Platform license that includes vMetal. The Control Plane Cluster must be connected to the platform.
  • Most managed Kubernetes services work fine as Control Plane Clusters. GKE Standard, EKS, and AKS managed node groups are all supported. But GKE Autopilot, EKS Auto Mode, and EKS Fargate are not supported because of restrictions on privileged workloads. If you go managed, also make sure you can route from the cluster's VPC to the BMC network (VPC peering or VPN, typically).

Who This Is Actually For

  • AI Clouds. The math is brutal. Every month spent building this internally is a month of GPU depreciation with no revenue. vMetal compresses time-to-launch.
  • Enterprise AI factories. Same self-service experience as cloud, on hardware you fully own.
  • Sovereign cloud providers. Full data residency, no public cloud dependency.

If you're a single team with one rack and one workload, you don't need this. If you have multiple tenants, multiple workload types, or multiple teams sharing hardware, this is the platform layer that solves it.

Wrap-Up

vMetal turns physical GPU servers into a programmable, tenant-isolated, cloud-like platform without a hypervisor. Built on Metal3 plus Ironic. Tenant Clusters via vCluster. Hot standby keeps claim times in seconds. Everything is Kubernetes-native YAML, GitOps-friendly, and there's a UI for the YAML-averse.

The thing that pushed me from "interesting" to "actually convinced" was the kubevirt repo. You can run the entire stack locally on a beefy laptop, watch a BareMetalHost go from Registering to Provisioned, and see a Tenant Cluster auto-claim it as a private node. If you're evaluating vMetal, start there before scheduling a vendor call.

📺 Watch the full video walkthrough: vMetal Deep Dive on YouTube

Sources & Working Code

Share:
Get started with the #1 tenant isolation platform.

Give your tenants the hyperscaler experience, ready in seconds.

Ready to take vCluster for a spin?

Deploy your first virtual cluster today.