Tech Blog by vCluster Press and Media Resources

5 Kubernetes GPU Sharing Tools That Actually Work at Multi-Tenant Scale

No items found.

Jun 27, 2026

|

min Read

Summary

Sharing GPUs among untrusted tenants at scale is a critical challenge, as standard time-slicing is insecure and typical cluster utilization remains below 30%.
Hardware-based solutions like NVIDIA MIG offer strong isolation but are operationally rigid, while software-based tools can add significant operational complexity without guaranteeing security.
Achieving robust tenant isolation requires a solution that isolates both the Kubernetes control plane and the individual workloads without sacrificing bare-metal performance.
vCluster Platform provides this full-stack isolation, allowing cloud providers and enterprises to run hundreds of secure tenant clusters on shared GPU hardware.

After months of dealing with GPU resource contention, most platform teams eventually reach the same conclusion: sharing a GPU between two pods is a solved problem. Sharing GPUs safely between untrusted tenants at scale is an entirely different beast.

The standard Kubernetes GPU allocation model — nvidia.com/gpu: 1 — is brutally inefficient. GPU utilization in most clusters hovers below 20–30%, yet each workload holds a device hostage. For AI cloud providers, inference platforms, and enterprises building internal AI factories, the real objective is robust tenant isolation: getting multiple isolated customers onto shared hardware without sacrificing security, performance, or operational sanity.

Generic guides gloss over this. They show you how to enable time-slicing with a ConfigMap and call it a day. What they don't tell you is that time-slicing on a tenant-isolated cluster is, as one practitioner put it bluntly, "really not secure." The hard problems — fault isolation, noisy-neighbor prevention, control plane sprawl, per-tenant observability — require a fundamentally different toolkit.

This article evaluates five practical tools and approaches for Kubernetes GPU sharing in tenant-isolated production environments. For each, we cover four criteria that actually matter in production:

Setup complexity — How hard is it to get running and keep running?
Isolation guarantees — Can one tenant crash or spy on another?
Observability support — Can you see what each tenant is actually consuming?
Production readiness — Is this proven at scale, or still a science experiment?

1. vCluster Platform + vNode: Full-Stack Tenant and Workload Isolation

Best for: AI cloud providers, neoclouds, and enterprises that need both control-plane isolation and workload isolation on shared GPU hardware.

vCluster Platform is the only solution in this list that addresses two distinct layers of the tenant isolation problem simultaneously: the orchestration layer (who gets which clusters) and the workload isolation layer (what happens inside the container runtime).

At the orchestration layer, vCluster Platform virtualizes the Kubernetes control plane itself. Each tenant gets a fully isolated, CNCF-certified Kubernetes cluster — complete with its own API server, etcd, RBAC, and CRDs — running as a lightweight pod on the host cluster. There's no need to provision separate physical clusters per tenant. Tenant clusters spin up in seconds, and the platform ships with fleet management UI, SSO, quotas, templates, GitOps/IaC integration, and full Day 2 operations out of the box.

At the workload isolation layer, vNode provides kernel-native container breakout protection using seccomp, cgroups, Linux namespaces, and AppArmor — without any hypervisor. This directly answers the frustration practitioners have with VM-based isolation: you don't get bare-metal GPU performance when you're routing traffic through a hypervisor. vNode eliminates that tax entirely.

For operators starting from raw hardware, vMetal handles zero-touch bare metal provisioning — PXE boot, OS installation, network automation — and vCluster Standalone can run as a single binary directly on a Linux server, with no k3s, kubeadm, or other base Kubernetes distribution required. The result is a complete, integrated path from GPU racks to isolated tenant clusters.

Pros:

Defense-in-depth isolation: control plane virtualization (vCluster) + kernel-native workload isolation (vNode)
No hypervisor tax — bare metal GPU performance preserved
Tenant clusters spin up in seconds; no provisioning queue
Full Day 2 operations included: observability, updates, backups, compliance
Integrates with Run:AI, Ray, Jupyter, and Slurm via Certified Stacks

Cons:

Full performance and cost advantages are most pronounced on bare metal; public cloud VM deployments reduce some of the overhead delta
vNode is currently in private beta

Setup Complexity: Low. The commercial platform is designed for rapid onboarding, and the vCluster Standalone binary eliminates the need for a base K8s layer. See the documentation.

Isolation Guarantees: High. The combination of full control plane virtualization and kernel-native workload isolation represents one of the strongest tenant isolation postures available without switching to full VMs.

Observability Support: High. Centralized fleet management covers per-tenant cluster observability, usage metrics, and compliance reporting across all tenant clusters.

Production Readiness: High. Proven at 100K+ GPU nodes across 50+ GPU cloud and Fortune 500 customers including CoreWeave and JPMorganChase. Named in the NVIDIA DGX SuperPOD reference architecture.

2. NVIDIA GPU Operator (Time-Slicing)

Best for: Internal dev environments where all tenants are fully trusted.

Time-slicing via the NVIDIA GPU Operator is the most common entry point for Kubernetes GPU sharing. It works by creating logical device replicas — configure a replicas: 4 in a ConfigMap and your A100 advertises four nvidia.com/gpu resources to the scheduler. Pods share the physical GPU through context-switching.

apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: gpu-operator data: a100-80gb: | version: v1 sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 4

It's simple. It's also dangerous in any real scenario requiring tenant isolation.

Pros:

Trivial to enable; no hardware requirements beyond any NVIDIA GPU
Enables high oversubscription ratios

Cons:

Zero memory or fault isolation. A memory leak or crash in one pod can bring down every other workload on the same GPU
Insecure for untrusted tenants — workloads share the same memory space
Performance is unpredictable due to context-switching overhead, which is especially problematic for latency-sensitive inference

Setup Complexity: Low to Moderate. A ConfigMap change and a node label are all that's required to get started with the GPU Operator.

Isolation Guarantees: None. This is the critical and often understated weakness. "If your GPU handles public info or multiple tenants, time slicing a GPU is really not secure."

Observability Support: Low. Standard DCGM metrics are available for the physical device, but attributing utilization and performance to individual pods sharing the same device is unreliable.

Production Readiness: Not suitable for untrusted tenants. Acceptable for homogeneous internal dev clusters where blast radius is contained.

3. NVIDIA MIG (Multi-Instance GPU)

Best for: Workloads requiring strong, hardware-enforced isolation on A100/H100 hardware.

NVIDIA MIG is a hardware feature available on data-center-class GPUs that partitions a single physical GPU into up to seven independent GPU Instances (GIs). Each instance gets its own dedicated memory, cache, and streaming multiprocessors — it appears to Kubernetes as a fully separate device. One H100 can run up to seven completely isolated GPU workloads simultaneously, which is a significant step forward compared to time-slicing.

The isolation is real and hardware-enforced. A fault in one MIG instance cannot affect another. Per-instance DCGM monitoring provides clean, accurate utilization data. For workloads where regulatory compliance or strict SLA guarantees are required, MIG is the gold standard at the hardware level.

The problems are well-documented in the community: MIG is rigid and operationally painful.

Pros:

Hardware-enforced memory and fault isolation — strongest GPU-level guarantee available
Each instance is independently monitorable via DCGM-Exporter
Well-supported by the NVIDIA GPU Operator's mig strategy

Cons:

Partition sizes are fixed and prescribed by NVIDIA — workloads must fit the available profiles, or you waste capacity
Reconfiguring MIG profiles requires a node reboot, causing downtime and making dynamic workload management nearly impossible
Only available on expensive A100/H100-class hardware — not applicable to older or consumer GPU fleets

Setup Complexity: Moderate. Requires MIG-capable hardware, enabling the mig strategy in the GPU Operator, and planning partition profiles in advance.

Isolation Guarantees: High — the best hardware-level guarantee in this list. But isolation is locked to fixed partition sizes, limiting scheduling flexibility.

Observability Support: High. Each MIG instance is independently addressable and monitorable, making per-tenant GPU reporting straightforward.

Production Readiness: High, with caveats. Excellent for stable, predictable workloads. Problematic for dynamic environments where workload sizes change frequently.

4. HAMi Scheduler Extender

Best for: Teams that need hardware-agnostic fractional GPU allocation and have the platform engineering bandwidth to maintain custom scheduler components.

HAMi (Heterogeneous AI Computing Virtualization Middleware) is an open-source Kubernetes scheduler extender that virtualizes GPU resources at the software level. Rather than requesting whole GPU devices, pods can request fractional GPU memory or compute units. It works across a broader range of GPU hardware than MIG, making it attractive for mixed-hardware environments.

Pros:

Hardware agnostic — works with GPU models that don't support MIG
Fine-grained fractional resource requests (memory, compute percentage)
Active open-source community and CNCF sandbox project

Cons:

High operational overhead. Running a custom scheduler extender introduces a new critical component into your control plane. If it breaks, scheduling breaks
Isolation relies on software-level enforcement, not hardware guarantees. Memory limits between tenants are advisory, not absolute
Latency and performance behavior can be unpredictable, especially under contention

Setup Complexity: High. Deploying HAMi requires installing the scheduler extender, device plugin, and webhook components, then correctly integrating them with the existing Kubernetes scheduler. This is not a simple Helm install — it demands a thorough understanding of Kubernetes scheduling internals and careful ongoing maintenance.

Isolation Guarantees: Moderate. The scheduler enforces resource allocation at admission time, but runtime enforcement depends on driver-level behavior. There are no hardware-enforced memory boundaries between tenants sharing the same physical GPU.

Observability Support: Moderate. HAMi exposes metrics on GPU allocation and scheduling decisions, but deep per-workload performance visibility requires additional tooling layered on top.

Production Readiness: Moderate. Well-suited for organizations with dedicated platform engineering teams who can own and operate a custom scheduler component. The operational complexity is a significant adoption barrier for anyone without that specialization.

5. Run:AI

Best for: Enterprises and research institutions that need advanced workload scheduling, job queuing, and GPU utilization analytics on top of Kubernetes.

Run:AI is a commercial AI workload management platform that sits above Kubernetes and provides a sophisticated scheduling, queuing, and orchestration layer. It enables fractional GPU allocation, preemptible jobs, fair-share policies across teams, and workload gang scheduling — all the features that make large-scale AI infrastructure manageable.

The important distinction: Run:AI is a scheduling and orchestration tool, not an isolation technology. It governs how GPU resources are allocated and queued, but the actual tenant isolation is only as strong as the underlying Kubernetes setup. Pair it with namespaces and you get namespace-level isolation. Pair it with MIG and you get hardware-enforced GPU instance isolation. Pair it with vCluster Platform and you get full control plane isolation.

Pros:

Best-in-class workload scheduling: fractional GPUs, preemption, fair-share, gang scheduling
Rich observability dashboards for GPU utilization, job queues, and cluster-wide efficiency
Enterprise support, mature product, and a strong ecosystem

Cons:

Does not provide tenant isolation on its own — requires an underlying isolation mechanism
Additional licensing cost and vendor dependency
Adds another software layer to operate and upgrade alongside Kubernetes

Setup Complexity: Moderate. Run:AI ships as a Kubernetes-native install, but onboarding and integrating it with existing policies and tenancy models requires dedicated configuration work.

Isolation Guarantees: Depends entirely on configuration. Run:AI enforces resource quotas and scheduling policies, but memory-level and control-plane-level isolation requires a separate mechanism beneath it.

Observability Support: High. This is Run:AI's strongest dimension — detailed dashboards covering job-level GPU utilization, queue depth, and cluster efficiency make it a go-to for organizations focused on maximizing hardware ROI.

Production Readiness: High. A mature, enterprise-grade platform used across large AI research and production environments.

Integration Note: vCluster Platform offers Certified Stacks that include a pre-validated Run:AI environment. This lets you combine Run:AI's scheduling power with vCluster's tenant isolation — without custom integration work.

Choosing the Right Approach

No single tool solves every dimension of Kubernetes GPU sharing with tenant isolation at scale. The right answer depends on where your primary constraint sits:

Tool	Isolation Level	Setup Complexity	Observability	Production Ready
vCluster Platform + vNode	High (control plane + kernel)	Low	High	✓ 100K+ GPU nodes
NVIDIA MIG	High (hardware-enforced)	Moderate	High	✓ With caveats
NVIDIA Time-Slicing	None	Low	Low	⚠ Trusted only
HAMi	Moderate (software)	High	Moderate	⚠ Ops-heavy
Run:AI	Depends on config	Moderate	High	✓ Enterprise

If you're an AI cloud provider or building an internal GPU platform for untrusted workloads, the time-slicing path runs out quickly. MIG gives you hardware guarantees but locks you into rigid partitioning with painful operational overhead. HAMi offers flexibility at the cost of custom scheduler complexity. Run:AI excels at scheduling but delegates isolation to whatever sits beneath it.

Frequently Asked Questions

What is multi-tenant GPU sharing in Kubernetes?

GPU sharing for multiple tenants is the practice of allowing multiple users or customers (tenants) to run workloads on the same physical GPU hardware within a Kubernetes cluster while maintaining strict isolation. This approach aims to maximize the utilization of expensive GPU resources by securely and efficiently partitioning them among different tenants.

Why is NVIDIA's time-slicing not secure for multiple tenants?

NVIDIA's time-slicing is not secure for multiple tenants because it lacks memory and fault isolation. All workloads sharing the GPU operate in the same memory space, allowing a crash, memory leak, or malicious actor in one pod to affect all others on that GPU, making it unsuitable for untrusted environments.

How does NVIDIA MIG differ from time-slicing?

NVIDIA MIG (Multi-Instance GPU) provides hardware-enforced isolation by partitioning a physical GPU into several smaller, independent GPU instances, each with its own dedicated memory and compute resources. Time-slicing, in contrast, is a software-based method where multiple processes take turns using the full GPU without any hardware separation, offering no real isolation.

What makes vCluster Platform a strong solution for tenant isolation?

vCluster Platform provides a unique, two-layer isolation model that secures both the Kubernetes control plane and the underlying workloads. It gives each tenant a virtual Kubernetes cluster for control plane isolation and uses kernel-native container protection to secure workloads, offering defense-in-depth without the performance overhead of traditional VMs.

Can different GPU sharing tools be used together?

Yes, combining GPU sharing tools is a common and powerful strategy. For example, you can layer a sophisticated scheduler like Run:AI on top of an isolation platform like vCluster Platform to get both advanced job management and strong tenant security, creating a best-of-both-worlds solution.

What is the biggest challenge in scaling multi-tenant GPU clusters?

The biggest challenge is achieving robust tenant isolation without sacrificing performance or operational simplicity. A scalable solution must prevent "noisy neighbor" problems, secure the control plane, provide per-tenant observability, and ensure fault isolation, all while keeping the platform manageable and cost-effective.

How do I choose the right GPU sharing tool?

The right tool depends on your trust model and operational needs. For trusted internal teams, time-slicing may suffice. For predictable workloads needing hardware isolation, NVIDIA MIG is excellent. For building a scalable, secure AI platform for untrusted users, a comprehensive solution like vCluster Platform is the most robust choice.

The only stack that solves both the control-plane isolation problem and the workload isolation problem — without a hypervisor tax — is vCluster Platform combined with vNode. With 100K+ GPU nodes in production and customer deployments across 50+ GPU clouds and Fortune 500 companies, it's also the most proven path from bare metal to isolated, managed Kubernetes at scale. Request a personalized demo to see how vCluster can solve your GPU sharing challenges at scale.

‍

Related blog posts

Ready to take vCluster for a spin?

Deploy your first virtual cluster today.