AI Cloud Providers · ClusterMAX™ Criteria Guide

The Platform Layer That Upgrades Your ClusterMAX™ Score

ClusterMAX™ evaluates AI cloud providers across 10 dimensions. This guide maps each criterion to the specific capabilities in vCluster, vNode, and vMetal that help you improve in that area, including the Security criterion that now explicitly names vCluster as a requirement.

Get a Demo

View ClusterMAX Criteria

Trusted by sovereign AI cloud providers worldwide

Context

What is ClusterMAX™ and Why Does Your Rating Matter?

ClusterMAX, published by SemiAnalysis, has become the de facto standard for evaluating GPU cloud infrastructure. Enterprise AI teams consult it before selecting a provider. Improving your rating directly impacts deal velocity and customer trust.

10 Evaluation Dimensions

ClusterMAX scores providers across Security, Lifecycle, Orchestration, Storage, Networking, Reliability, Monitoring, Pricing, Partnerships, and Availability, covering the full infrastructure stack enterprise customers care about.

vCluster Explicitly Named

The updated Security criterion now explicitly lists “vCluster or similar isolation beyond container-based only” as a requirement, a direct signal to enterprise buyers that cluster-level isolation is non-negotiable.

Platform vs. Hardware

Many providers plateau in their ratings not because of hardware, but because they lack a managed platform layer. vCluster, vNode, and vMetal directly address the software and operational gaps ClusterMAX evaluates.

This guide walks through each ClusterMAX dimension and shows where vCluster, vNode, and vMetal can meaningfully move the needle.

Use it as a roadmap: identify where your current offering falls short, and understand exactly which product to deploy to close the gap.

Criteria Walkthrough

How the vCluster Platform Addresses Each ClusterMAX™ Dimension

ClusterMAX evaluates dozens of specific requirements across 10 dimensions. Choose one of the 10 ClusterMAX dimensions on the left, then select a subcategory on the right to see the criteria vCluster covers and how.

ClusterMAX™ Criteria (1)

vCluster or similar isolation beyond container-based only

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (5)

Protection against container escalation vulnerabilities
Updated NVIDIA Container Toolkit preventing CVE-2024-0132 and related vulnerabilities
Protection against CVE-2025-23359, CVE-2025-23266
Automated rollout of new NVIDIA Container Toolkit versions upon CVE discovery
Part of NVIDIA security program for embargoed access to latest security patches

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (5)

VLAN isolation between tenants for RoCE
PKeys set for InfiniBand tenants
InfiniBand Security Keys Management (SMKey, SAKey, CKey, VSKey)
AM Key configuration (if SHARP is available)
SR-IOV with QP0 & MAD disabled when passing Virtual Function pointer into VM

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (8)

SOC 1 or SOC 2 Type II compliance
ISO 27001 certification
GDPR, PCI, HIPAA, FedRAMP compliance
Ability to sell to secure government customers globally
Penetration testing conducted (provider, scope, and coverage documented)
SOC 2 and penetration testing specifically cover InfiniBand/RoCEv2 fabric
Security firms with expertise in high-speed IB/ETH networking
Process in place for future security improvements

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (8)

Onboarding costs and delivery timeline adherence
Ease of onboarding (UI-based vs. manual Terraform)
Out-of-the-box GPU Direct RDMA (between NIC and GPU) setup
Fast onboarding process (UI-based provisioning vs. Terraform complexity)
Delivery date accuracy and meeting expectations
Out-of-the-box IB/RoCEv2 & NVIDIA drivers configuration
Performance optimization libraries (e.g., TogetherAI kernel collection)
Provisioning of CPU head node for SLURM without explicit request

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (2)

Knowledge of industry experts
Understanding of standard ML user expectations

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (5)

Audit logs capturing resource actions (create/start/stop/delete), administrative actions, and billing events
Audit log entries include actor identity (user ID, email, IP address), action details, target resource, timestamp, and success/failure status
Audit logs queryable via API with filtering by resource type, project, user, and date range
Minimum 90-day audit log retention with export capability
Audit log access restricted to administrators with no additional usage charges

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (6)

Automated setup process for Kubernetes
Automated managed Kubernetes service
kubectl access or KUBECONFIG provided
Easy access to kube-dashboard, Lens, etc.
Storage accessible via PVC + hostpath + S3
CUDA_VISIBLE_DEVICES properly configured

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (4)

Easy process for adding new cluster users
RBAC and SSO implementation
No SSH key copying required
Storage RBAC enforcement

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (7)

Self-service capabilities (e.g., Lambda 1CC, Nebius managed operator)
Automated setup process for SLURM
Automated managed SLURM service
Head node provisioning
Out-of-the-box SLURM topology configuration
SLURM modules availability
Pyxis container plugin support

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (4)

Storage integration with Kubernetes for PVCs/storage class
Proper mounting configuration out-of-the-box
Out-of-the-box parallel filesystem (e.g., Weka, DDN, VAST)
Out-of-the-box managed S3-compatible object storage

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (4)

Automated backups with configurable retention policy for file systems, object storage, and databases
Cross-region replication or backup for disaster recovery
Snapshot support for persistent volumes (CSI snapshots for Kubernetes PVs)
Centralized backup monitoring, alerting, and restore validation

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (10)

Mount reliability (no random flaking on and off)
Read performance testing
Write performance testing
Throughput and latency measurements
Scalability testing for performance and capacity
Point-in-time recovery for managed databases
Immutable or WORM-capable storage for ransomware protection and compliance
Published durability and availability SLAs for all storage tiers
Backup encryption at rest and in transit with customer-managed key (CMK/BYOK) support
Checkpoint storage durability for training workloads (replication factor, cross-AZ guarantees)

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (5)

InfiniBand or RoCEv2 support
MPI distribution using higher performance hpc-x mpirun
Proper NCCL configuration (/etc/nccl.conf)
NCCL_IB_GID_INDEX=3 set for RoCEv2
NCCL_MIN_NCHANNELS, NCCL_PROTO, NCCL_ALGO NOT set (auto-configuration)

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (5)

NCCL monitoring plugin availability
SHARP support for enhanced performance
Network bandwidth and latency testing
4-node NCCL test within specification
PyTorch layer network performance within spec

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (7)

NVLINK connectivity and error tracking (critical for NVL72)
Network stability assessment
Link flap monitoring and prevention
Ethernet and InfiniBand event monitoring (Link Flaps)
InfiniBand health monitoring (link status, error counters, PKey consistency)
InfiniBand link status validation (ibstat)
Partition Key (PKey) consistency across nodes

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (18)

Kubernetes node health checks
Automated node draining and replacement
GPU falling off the bus detection
PCIe error monitoring
Thermal monitoring (GPU temperature)
GPU and CPU memory stats (ECC error rate)
NVIDIA XID and SXID error code detection
NCCL and SLURM topology health
Driver and core library version consistency across nodes
ECC error detection
Temperature monitoring and throttling alerts
Power monitoring and utilization tracking
NVIDIA XID/SXID error detection (through DCGM)
PCIe bus and power state health
Error counter monitoring (retries, dropped packets)
NCCL operation health tracking
MSA SLA evaluation (99%, 99.9%, etc.)
IPMI exporter and fan speed monitoring

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (6)

ncu profiling available for all users
Out-of-the-box detailed managed Grafana
Real-time system monitoring
Performance tracking
Resource utilization monitoring
TFLOPs estimation tracking

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (1)

Alerting capabilities

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (5)

Automated Active and Passive Health Checks
Comprehensive passive health check implementation
Diagnostic tools
Burn-in test documentation
Automated active health check implementation

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (2)

Automatic node draining for detected issues
AI model system for failure prediction

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (7)

Tenant cluster sharing and metering with chargeback/showback
Individual charges for storage, compute nodes, network vs. bundled pricing
Low $/GPU/hr pricing
Consumption model options (3-year, 1-year, 6-month, 3-month, 1-month)
Expansion and extension of existing contracts
Latest GPU availability and timeline
Kernel library availability for MFU boosting (e.g., TogetherAI kernel collection)

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (5)

Ecosystem support and integration
NVIDIA NCP or Lepton certification
AMD Cloud Alliance status
AMD or NVIDIA investment
SchedMD partnership (makers of SLURM)

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (1)

Participation in industry events

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (4)

Latest GPU models available (H200, B200, GB200, NVL72)
Current GPU models (H100, H200, A100, L40S, MI300X)
B200, B300, GB200 NVL72 availability timeline
MI355X availability planning

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (2)

Total quantity of GPUs and cluster scale experience
Geographic reach and service accessibility

How vCluster Helps

Additional Requirements

ClusterMAX™ Criteria (3)

Capacity planning capabilities
Availability/utilization rates
Roadmap for future GPU acquisitions/upgrades

How vCluster Helps

Additional Requirements

Unlock the full ClusterMAX assessment

Fill out the form to explore detailed insights into how vCluster can help you meet requirements across all ten dimensions.

Dedicated Control Plane

vCluster provides each tenant with a dedicated Kubernetes control plane — isolated API server, RBAC, CRDs, and controllers. This eliminates cross-tenant API access entirely. Pair with vNode for kernel-level workload isolation and Private Nodes for GPU workloads requiring exclusive physical access.

Dedicated control plane per tenant
Private Nodes
vNode (optional)

Security

Control Plane & Workload Isolation

Active

Kernel-Level Isolation

vNode enforces kernel-native isolation using Linux user namespaces, seccomp profiles, and cgroup separation. Container escape vulnerabilities are contained to the affected workload — they cannot reach other tenants or the host.

vNode

Security

Prevent Container Breakouts

Active

Zero-Day CVE Protection

vNode protected against these vulnerabilities even without patching. Not just about patching existing container breakouts — also about preventing attacks for future zero-days which are not yet known. vCluster Platform automates installation of GPU Operator / NVIDIA Container Toolkit and makes it easy to keep it up to date.

vNode

Security

Prevent Container Breakouts

Active

Automated Toolkit Updates

Fully automated rollout of new NVIDIA Container Toolkit versions is possible with vCluster Templates.

Templates

Security

Prevent Container Breakouts

Active

Not Applicable

Not applicable to vCluster Labs as a software vendor.

Security

Prevent Container Breakouts

RoCE Tenant Isolation

Per-tenant network isolation for RoCE is automated at provisioning time via the network fabric partner. vCluster coordinates and instructs the underlying networking tool, including Netris, OpenStack Neutron, Multus / SR-IOV for RDMA. Provides simple webhooks for homegrown network automation tools as well.

Security

Network Isolation

Active

InfiniBand PKey Isolation

Partition Key assignment per InfiniBand tenant is automated at provisioning time via the network fabric partner. vCluster integrates with Netris, which orchestrates PKey assignment through NVIDIA UFM.

Security

Network Isolation

Active

IB Security Key Management

InfiniBand fabric key management is handled at the network infrastructure layer via NVIDIA UFM and OpenSM. vCluster integrates with Netris, which provides configuration guidance for M_Key and VS_Key within UFM.

Security

Network Isolation

Active

SHARP AM Key Config

SHARP Aggregation Manager key configuration is handled at the fabric layer by the network infrastructure partner when SHARP is deployed. Only applicable if SHARP-capable Mellanox switches are present.

Security

Network Isolation

Active

Bare Metal Architecture

vCluster Private Nodes is focused on bare metal rather than VMs for AI clusters, eliminating the need for SR-IOV VF partitioning. This is a host-layer concern for hypervisor-based environments.

Security

Network Isolation

Self-Hosted Software Model

vCluster is self-hosted software — vCluster Labs does not process or store customer data but helps with SOC 2 requirements using software features such as automated backups via vCluster Snapshots for example.

Security

Certifications

ISO 27001 Scope

Security

Certifications

Compliance-Ready Controls

Compliance obligations reside within the operator's deployment environment. vCluster provides the technical controls required to operate within compliant environments: RBAC, audit logging, network isolation, and FIPS-compliant images (Enterprise). FIPS version of images is particularly relevant for federal-related business.

RBAC
Audit Logging
Network Isolation
FIPS Images

Security

Certifications

Sovereign Deployment Support

vCluster's self-hosted model supports air-gapped, on-prem, and sovereign deployments — customer data never leaves the operator's environment. FedRAMP and ITAR eligibility depends on the operator's own certifications.

Security

Certifications

Pen Testing Reports

Third-party penetration testing has been conducted on vCluster Platform and vNode. Reports available under NDA on request. This does not directly help the operator with their own audit but ensures our software passes through cleanly.

Security

Certifications

IB/RoCEv2 Pen Testing

Third-party penetration testing has been conducted on vCluster Platform and vNode. Reports available under NDA on request.

Security

Certifications

Expert Fabric Partners

Fabric-layer security assessments require a partner with InfiniBand/Ethernet expertise. vCluster works with operators to validate tenant isolation at the control plane and networking layers.

Security

Certifications

Continuous Security Improvement

vCluster supports AI cloud providers with this by enabling an automation-first approach (Templates, etc.) and vCluster maintains a published vulnerability disclosure process.

Security

Certifications

Fast Tenant Onboarding

FAST — vCluster Platform reduces time-to-first-cluster to minutes, eliminating the weeks typically required for manual Kubernetes infrastructure setup. Tenant onboarding can be fully automated with vCluster Platform.

Templates
Automations

Lifecycle

Tenant Provisioning

Active

Self-Service Provisioning

LIKE HYPERSCALERS — Everything can be fully automated AND all standard IaC/GitOps provisioning tools and flows are supported, plus a great UX in the Platform UI.

vCluster provisions a fully managed Kubernetes environment in seconds via UI, API, or Kubernetes CRDs — no new physical infrastructure, no manual Terraform. vMetal automates the full bare metal lifecycle: PXE boot, OS provisioning, and node registration.

UI provisioning
API provisioning
K8s CRD-based provisioning
GitOps-compatible

Lifecycle

Tenant Provisioning

Active

GPU Direct RDMA Setup

vCluster Templates delivers production-ready GPU tenant clusters with GPU Direct RDMA configured out of the box. GPU Operator, Network Operator, and nvidia-peermem deployed via standard Helm or GitOps workflows. No new physical infrastructure required; works on any vCluster with bare metal GPU nodes.

GPU Operator RDMA configuration
NVIDIA Network Operator (MOFED)
nvidia-peermem

Lifecycle

Tenant Provisioning

Active

Instant Cluster Provisioning

vCluster provisions a fully managed Kubernetes environment in seconds via UI, API, or K8s CRDs — customers receive a kubeconfig immediately with no manual handoff. vMetal automates bare metal GPU node provisioning end-to-end: PXE boot, OS install, and cluster registration — replacing Terraform complexity with a declarative self-service workflow that is fully GitOps-compatible.

UI provisioning
API provisioning
K8s CRD-based GitOps provisioning

Lifecycle

Tenant Provisioning

Active

Predictable Delivery

Our automations make things more predictable and fast. Additionally our Customer Engineering Team is here to support complex new data center setups.

Lifecycle

Tenant Provisioning

Active

MOFED Driver Lifecycle

vCluster Templates manages MOFED driver and NVIDIA Container Toolkit lifecycle fleet-wide via the Network Operator. No per-node manual intervention required. Driver versions stay consistent across nodes and updates roll out through standard Kubernetes operator reconciliation.

GPU Operator driver management
NVIDIA Network Operator MOFED
IB/RoCEv2 driver lifecycle

Lifecycle

Tenant Provisioning

Active

Performance Testing (Roadmap)

Coming soon

Automated performance testing for new tenant environments is on the roadmap. Currently, vCluster helps with manual performance testing but an in-product solution is coming soon.

Lifecycle

Tenant Provisioning

Active

SLURM Platform (Roadmap)

Coming soon

vCluster will launch a SLURM solution in H2 2026 providing fully automated SLURM cluster provisioning and day 2 operations. Coming soon.

Lifecycle

Tenant Provisioning

Active

AI Cloud Industry Experience

vCluster has worked with some of the biggest AI cloud providers such as CoreWeave since 2021 and gained valuable experience since the earliest days of the AI cloud industry.

Lifecycle

Industry Experience

Active

Resource Lifecycle Logging

vCluster Platform exposes all metrics and the Kubernetes API audit logging for all resource lifecycle events (create, update, delete) and administrative actions. Integration into billing systems can be done custom today but in-product automated integration is coming soon. Each log entry includes actor identity, resource type, action, and timestamp.

K8s API audit logging
Resource lifecycle events
Admin action logging

Lifecycle

Audit Logs

Active

Actor Attribution

vCluster Platform captures full actor attribution in audit logs via standard Kubernetes audit policy: user ID, source IP, action verb, target resource, and request status (success/failure). This meets ClusterMAX's actor-attribution requirement for each logged event.

K8s audit policy: user
sourceIP
verb
resource
responseStatus

Lifecycle

Audit Logs

Active

Log Query & Filtering

Audit logs are accessible via the Kubernetes API. Full filtering by resource type, user, and date range depends on the log aggregation backend the operator connects (e.g., Loki, OpenSearch, Elasticsearch). vCluster Platform surfaces the audit log stream — additional queryability at scale can be achieved with a connected aggregation layer.

K8s audit log stream
log aggregation integration

Lifecycle

Audit Logs

Active

Flexible Log Retention

Audit logs can be persisted to databases and systems considered industry standard. Retention periods can be set individually without limitations.

Lifecycle

Audit Logs

Active

Admin-Only Access

vCluster Platform's RBAC layer allows configuration of who can see the audit log. Default permissions are restricted to platform administrators only — tenant users cannot access audit logs by default. No additional usage charges apply for audit log access.

RBAC-restricted audit log access
no per-access charges

Lifecycle

Audit Logs

Active

Declarative K8s Provisioning

vCluster provisions a fully managed Kubernetes environment via CRDs, API, or UI in seconds — no manual cluster setup, no Terraform, no infrastructure provisioning required. The entire setup is declarative and GitOps-compatible.

CRD-based provisioning
API provisioning
UI provisioning
GitOps-compatible

Orchestration

Self-Service

Active

Managed K8s Per Tenant

vCluster is a fully managed Kubernetes environment provisioned on demand via API or UI. Each tenant gets their own API server, RBAC, and namespaces — isolated from all other tenants — without requiring separate physical clusters or manual cluster management by the provider.

Dedicated API server per tenant
isolated RBAC
no shared control plane

Orchestration

Self-Service

Active

Instant kubeconfig Delivery

On provisioning, vCluster automatically generates and delivers a kubeconfig to the customer. They have immediate kubectl access and can use Helm, Lens, k9s, or any standard Kubernetes tooling — no SSH setup, no firewall rules, no manual steps from the operator.

Auto-generated kubeconfig at provisioning
immediate kubectl access

Orchestration

Self-Service

Active

Universal Tooling Compatibility

Standard kubeconfig works with any Kubernetes tooling out of the box — Lens, k9s, kube-dashboard, Headlamp all connect normally. No additional configuration required from the operator or the tenant.

Standard kubeconfig compatibility with all K8s tooling

Orchestration

Self-Service

Active

GPU Device Assignment

CUDA_VISIBLE_DEVICES is automatically configured by the NVIDIA device plugin (part of GPU Operator) when GPUs are requested via Kubernetes resource limits. The device plugin injects GPU device assignments as environment variables into containers at scheduling time — no manual configuration required.

NVIDIA device plugin envvar injection
GPU resource limits

Orchestration

Self-Service

Active

PVC + hostPath + S3

PVC and hostPath storage work natively in vCluster. PVCs are synced to the host cluster by default, and vCluster includes a local path provisioner for dynamic PVC provisioning via hostPath with no additional setup. S3-compatible object storage is also accessible via an S3 CSI driver deployed on the host cluster.

PVC via CSI passthrough
hostPath
S3 via operator storage setup

Orchestration

Self-Service

Active

User Management & SSO

vCluster Platform provides fully featured user management. Users can be manually onboarded via email/password but SSO is also supported. OIDC/SAML and other SSO integrations enable secure and automated user onboarding and offboarding. Automated permission and key management ensures secure access for new users.

SSO/OIDC integration
RBAC role assignment

Orchestration

User Management & Access Control

Active

Per-Tenant RBAC & SSO

Each tenant cluster has its own isolated RBAC layer and OIDC/SSO integration. Providers connect their existing identity provider (Okta, GitHub, Azure AD) so customers authenticate via SSO and receive role-scoped kubeconfigs — no shared identity namespace between tenants.

Per-vCluster isolated RBAC
OIDC/SSO integration
customer brings their IdP

Orchestration

User Management & Access Control

Active

No SSH Required

vCluster eliminates SSH-based cluster access entirely. Customers receive a kubeconfig and authenticate via OIDC/SSO — no SSH key distribution, no per-node access management, no operator involvement for adding or removing cluster access.

kubeconfig-based access
no key distribution

Orchestration

User Management & Access Control

Active

Storage Access Controls

Kubernetes RBAC controls storage access at the namespace and PVC level. Per-tenant isolation means tenants cannot access PVCs outside their own cluster — cross-tenant storage access is structurally prevented.

K8s RBAC for storage
per-tenant PVC isolation

Orchestration

User Management & Access Control

Active

Pyxis Not Supported

Pyxis is a container runtime plugin for SLURM, built by NVIDIA. Not currently supported.

Orchestration

SLURM

Parallel Filesystem Support

vCluster supports any storage systems including Weka, DDN, VAST, or any other CSI driver. Procuring and installing the filesystem is the responsibility of the operators as a prerequisite to use it in vCluster.

CSI passthrough
host storage class inheritance

Storage

Storage Provisioning

S3 Object Storage

S3-compatible object storage is the operator's responsibility to provision and expose. vCluster does not manage object storage directly. Tenants can access operator-provisioned S3 endpoints from within their tenant cluster using standard Kubernetes secrets and environment variables.

Storage

Storage Provisioning

PVC & Storage Class Support

vCluster supports any storage systems including Weka, DDN, VAST, or any other CSI driver. PVC provisioning is identical to a native cluster — no extra configuration required. Procuring and installing the filesystem is the responsibility of the operator as a prerequisite.

CSI passthrough
host storage class inheritance
PVC provisioning identical to native cluster

Storage

Storage Provisioning

Active

CSI Mount Passthrough

CSI passthrough means mounts are handled by the tenant cluster's CSI driver — no extra configuration required.

CSI passthrough mount handling

Storage

Storage Provisioning

Active

Automated Backup & Restore

vCluster Platform provides backup and restore for tenant cluster state and PV data via volume snapshots, with configurable scheduling and retention. Application-specific (database) backups are recommended to be configured additionally.

Storage

Storage Backups

Active

Multi-Region DR

vCluster Platform multi-region mode provides control plane DR across regions. Storage-layer cross-region replication is determined by the operator storage infrastructure and requires appropriate configuration of the storage system.

vCluster Platform multi-region mode

Storage

Storage Backups

Active

PV Snapshot Support

vCluster supports PV snapshots — tenants trigger VolumeSnapshots via standard Kubernetes APIs and vCluster handles the rest. Automated backups can be configured on a time interval or run via CRON schedule.

VolumeSnapshot support
CSI snapshot passthrough
Platform backup/restore

Storage

Storage Backups

Active

Backup Monitoring & Validation

vCluster Platform provides automated backups and exposes any status and metadata information about backups including providing the ability to verify backup validity.

Storage

Storage Backups

Active

Storage Integration Support

vCluster works with any high-performance storage solution and our team supports hands-on in configuring the storage-related integrations and automations for particular data center setup.

Storage

Storage Performance & Security

IB & RoCEv2 Fabric Automation

InfiniBand and RoCEv2 fabric automation is provided by the network infrastructure partner. vCluster integrates with Netris, which automates east-west fabric configuration, tenant isolation, PKey assignment via UFM, and Spectrum-X host networking via the NHN plugin.

Netris fabric automation
NVIDIA UFM integration
Spectrum-X NHN plugin

Networking

Network Setup

Active

Bare Metal MPI Performance

vCluster's Private Nodes model runs MPI workloads directly on bare metal — no virtualization layer between the MPI processes and the network fabric. MPI, TorchElastic, Ray, and JAX perform as they would on a native cluster with zero overhead added. HPC-X installation and configuration is the operator's responsibility.

Private Nodes bare-metal data path
zero MPI scheduling overhead

Networking

Network Setup

NCCL Config Responsibility

NCCL configuration (/etc/nccl.conf) is set by the operator in their GPU node OS image. vMetal provisions nodes using the operator's own ISO — no OS-level modifications are made. Operators include NCCL config in their base image or deploy it via init scripts.

Networking

Network Setup

GID Index Auto-Select

NCCL_IB_GID_INDEX=3 is a pre-NCCL 2.21 requirement. Since NCCL 2.21, GID index is auto-selected based on active link layer — manual configuration is no longer needed. For operators running NCCL 2.21+, this is handled automatically.

Networking

Network Setup

NCCL Auto-Tuning

vCluster Platform and vMetal do not set NCCL_MIN_NCHANNELS, NCCL_PROTO, or NCCL_ALGO. vMetal provisions nodes using the operator's own OS image without modification — no NCCL environment variables are injected at any layer. NCCL auto-tuning runs unobstructed.

Networking

Network Setup

SHARP Collective Operations

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) accelerates NCCL collective operations when NVIDIA Quantum InfiniBand switches are present. Configuration is managed via NVIDIA UFM through the network fabric partner.

Networking

Network Performance

Active

NCCL Test Validation

The 4-node NCCL bandwidth test is a provider validation benchmark — the operator runs and certifies this against their hardware configuration. vCluster's Private Nodes model ensures no platform overhead is added to the NCCL data path, but benchmark execution and pass/fail certification is the operator's responsibility.

Networking

Network Performance

PyTorch Native Performance

PyTorch distributed performance benchmarking is a provider validation responsibility. vCluster adds no overhead to the GPU compute or network data path — PyTorch workloads on Private Nodes run at native bare-metal performance.

Networking

Network Performance

NCCL Observability

NCCL communication observability is available via NVIDIA DCGM Exporter (deployed as part of GPU Operator) for GPU-level metrics, and via NVIDIA's NCCL Inspector Profiler Plugin (released December 2025) for per-communicator, per-collective performance monitoring. Both integrate with Prometheus and Grafana.

DCGM Exporter
NCCL Inspector Profiler Plugin
Prometheus
Grafana

Networking

Network Performance

Active

Automated Network Testing

Coming soon

Automated performance testing for new tenant environments is on the roadmap. Currently, vCluster helps with manual performance testing but an in-product solution is coming soon.

NCCL communication observability is available via NVIDIA DCGM Exporter, which deploys automatically as part of GPU Operator.

Networking

Network Performance

Active

Fabric Stability Monitoring

Network stability monitoring across the IB and Ethernet fabric is managed by the network infrastructure partner. Netris provides continuous switch-level health monitoring including interface status, BGP state, topology/wiring errors, and hardware health across all managed nodes.

Netris fabric monitoring
K8s node conditions

Reliability

Network Resilience

Active

Link Flap Monitoring

InfiniBand and Ethernet link flap monitoring is handled at the network infrastructure layer by the fabric partner. Netris provides continuous interface up/down monitoring across all managed switches and surfaces events to the operator.

Netris interface monitoring

Reliability

Network Resilience

Active

Switch Event Monitoring

InfiniBand and Ethernet link flap events are monitored at the fabric layer by the network infrastructure partner. Netris provides continuous interface status monitoring across managed switches, alerting on link state changes.

Netris interface monitoring
NVIDIA UFM

Reliability

Network Resilience

Active

IB Health Monitoring

InfiniBand health monitoring — link status, error counters, and PKey consistency — is managed at the fabric layer via NVIDIA UFM, orchestrated by the network infrastructure partner. Netris integrates with UFM for IB fabric management and monitoring.

Netris-UFM integration
NVIDIA UFM IB monitoring

Reliability

Network Resilience

Active

NVLink Error Tracking

DCGM Exporter tracks NVLink bandwidth and error metrics per GPU. Critical for NVL72 configurations. Deployed via GPU Operator on vCluster — operators configure Prometheus alerts for NVLink error thresholds.

DCGM NVLink bandwidth and error metrics
Prometheus alerting

Reliability

Network Resilience

Active

IB Link Status Validation

InfiniBand link status monitoring is managed at the fabric layer via NVIDIA UFM, orchestrated by the network infrastructure partner. Netris automates fabric configuration. Operators requiring ibstat-level granularity must run IB diagnostic tooling directly.

NVIDIA UFM IB monitoring

Reliability

Network Resilience

Active

PKey Consistency Enforcement

PKey consistency across InfiniBand nodes is maintained by Netris via its NVIDIA UFM integration. The Netris-UFM reconciliation loop (10-second interval) continuously verifies and enforces PKey assignments across all fabric nodes.

Netris-UFM PKey reconciliation

Reliability

Network Resilience

Active

SLA Architecture Support

SLA commitments are the operator's responsibility to define and publish. vCluster Platform's architecture enables higher SLAs: control plane HA (multi-replica with embedded etcd or external DB), automated node lifecycle via vMetal, and tenant isolation ensuring one tenant's failure does not affect others.

Control plane HA
multi-replica deployment
external DB support

Reliability

GPU & System Health

GPU Bus Fault Detection

NVIDIA DCGM Exporter (deployed via GPU Operator on vCluster) tracks XID error codes including GPU bus fault events. Operators connect Prometheus to receive alerts. vCluster tenant isolation scopes blast radius to the affected node only.

DCGM XID error tracking
Prometheus alerting

Reliability

GPU & System Health

Active

PCIe Error Monitoring

DCGM Exporter tracks PCIe replay counters and bus error events. Deployed automatically via GPU Operator on vCluster — operators connect Prometheus scraping.

DCGM PCIe replay counter
Prometheus

Reliability

GPU & System Health

Active

GPU Thermal Monitoring

DCGM Exporter tracks GPU and memory temperature on all GPU nodes. Operators connect Prometheus and configure thermal threshold alerts. NVIDIA's default Grafana dashboard includes thermal panels.

DCGM GPU/memory temperature metrics
Prometheus alerting
Grafana dashboard

Reliability

GPU & System Health

Active

ECC Memory Monitoring

DCGM Exporter tracks GPU ECC error rates per GPU. Deployed via GPU Operator on vCluster. CPU memory ECC monitoring requires a separate IPMI exporter deployed by the operator.

DCGM ECC error metrics
IPMI exporter for CPU memory

Reliability

GPU & System Health

Active

XID & SXID Detection

DCGM Exporter tracks XID and SXID error codes per GPU. Deployed via GPU Operator on vCluster — XID events surface through Prometheus alerting. vCluster tenant isolation scopes impact to the affected tenant only.

DCGM XID/SXID tracking
Prometheus alerting

Reliability

GPU & System Health

Active

NCCL Topology Health

NCCL operation observability is available via NVIDIA's NCCL Inspector Profiler Plugin. Operators include the inspector .so library in their GPU workload container image and set NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE=1 in their pod specs. The plugin runs entirely in-process — no cluster-level changes, no DaemonSet, no privileged containers required. Works natively inside vCluster pods.

NCCL Inspector Profiler Plugin
per-communicator performance logging

Reliability

GPU & System Health

Active

K8s Node Health Checks

vCluster surfaces Kubernetes node health conditions through the virtual control plane API server — including node Ready, MemoryPressure, DiskPressure, and PIDPressure conditions. Platform-level node health is further enriched by vMetal's bare metal lifecycle monitoring. Operators access node health via standard kubectl or any K8s observability tooling.

K8s node conditions via vCluster API
vMetal node lifecycle monitoring

Reliability

GPU & System Health

Active

Driver Version Consistency

GPU Operator enforces consistent NVIDIA driver and toolkit versions as a DaemonSet across all GPU nodes in the cluster. Version drift is detected and corrected by the operator reconciliation loop automatically.

GPU Operator DaemonSet version enforcement

Reliability

GPU & System Health

Active

ECC Error Detection

DCGM Exporter tracks single-bit and double-bit ECC errors per GPU. Deployed via GPU Operator on vCluster — operators configure Prometheus alerts for ECC thresholds.

DCGM ECC error tracking
Prometheus alerting

Reliability

GPU & System Health

Active

Thermal Throttling Alerts

DCGM Exporter tracks GPU temperature and thermal throttling events. Operators configure Prometheus alerting rules for threshold detection. NVIDIA's default Grafana dashboard includes thermal monitoring panels.

DCGM thermal metrics
Prometheus alerting
Grafana thermal dashboards

Reliability

GPU & System Health

Active

Power Usage Tracking

DCGM Exporter tracks GPU power draw and total energy consumption per GPU. Operators connect Prometheus for per-tenant power utilization dashboards and chargeback via PromQL aggregation by namespace.

DCGM power usage and energy metrics
Prometheus
per-tenant power dashboards

Reliability

GPU & System Health

Active

XID/SXID Error Tracking

DCGM Exporter tracks XID and SXID error codes per GPU via GPU Operator. XID events surface through Prometheus. vCluster tenant isolation scopes blast radius to the affected tenant only.

DCGM XID/SXID error tracking

Reliability

GPU & System Health

Active

PCIe & Power State Metrics

DCGM Exporter tracks PCIe bus health and power state metrics. Deployed via GPU Operator on vCluster — operators configure Prometheus alerting for anomalies.

DCGM PCIe and power state metrics

Reliability

GPU & System Health

Active

IPMI & Fan Monitoring

IPMI/BMC telemetry for fan speed and hardware health is the operator's responsibility to configure. vMetal (Metal3) uses BMC for node provisioning and power management but does not expose ongoing IPMI telemetry. Operators deploy a standalone IPMI exporter on their nodes to feed hardware metrics into Prometheus.

Reliability

GPU & System Health

Error Counter Monitoring

GPU-layer error counters (retries, ECC errors) are available via DCGM Exporter deployed through GPU Operator. Network-layer packet drop and retry counters are the operator's responsibility — Netris provides network automation but not deep flow-state telemetry.

DCGM GPU error counters

Reliability

GPU & System Health

Active

NCCL Operation Profiling

NCCL operation health tracking is available via NVIDIA's NCCL Inspector Profiler Plugin. Operators include the inspector .so library in their GPU workload container images and enable it via environment variables (NCCL_PROFILER_PLUGIN, NCCL_INSPECTOR_ENABLE=1). The plugin runs entirely in-process with no Kubernetes footprint — works natively inside vCluster GPU pods.

NCCL Inspector Profiler Plugin

Reliability

GPU & System Health

Active

Automated Node Recovery

vMetal handles the full bare metal node lifecycle today — including deprovisioning, reimaging, and returning nodes to the available pool for reassignment. What is on the roadmap is automated health-based triggering: detecting a GPU fault via DCGM, automatically draining the node, deprovisioning, and reprovisioning without operator intervention. Today that remediation chain requires operator action.

Bare metal deprovisioning
reimaging
node pool return
automated remediation pipeline (roadmap)

Reliability

GPU & System Health

Active

ncu Profiling

ncu is available via the operator's container image / NVIDIA toolkit.

Monitoring

GPU Monitoring

Active

Managed Grafana Dashboards

kube-prometheus-stack (including Grafana) deploys normally inside each tenant cluster. DCGM Exporter integrates with Prometheus to surface GPU utilization, memory, power, and error metrics per tenant — all scoped to the individual tenant with no cross-tenant visibility. Operators can offer managed Grafana dashboards as a platform feature.

Per-tenant kube-prometheus-stack
DCGM Exporter integration
per-tenant Grafana

Monitoring

GPU Monitoring

Active

TFLOPs Estimation

TFLOPs estimation is the operator's responsibility to benchmark and publish. DCGM Exporter surfaces GPU utilization and SM clock data that operators can use to derive effective TFLOPs, but vCluster does not compute or track TFLOPs natively.

DCGM SM utilization and clock metrics (raw input for TFLOPs estimation)

Monitoring

GPU Monitoring

Real-Time System Monitoring

Real-time GPU and system monitoring is available via DCGM Exporter (GPU metrics at ~1s granularity) and kube-prometheus-stack — both deployable inside each tenant cluster. Operators connect their Prometheus instance to scrape DCGM metrics and visualize in Grafana in real time.

DCGM real-time GPU metrics
Prometheus scraping

Monitoring

GPU Monitoring

Active

Per-Tenant GPU Performance

Per-tenant GPU performance tracking is available via DCGM Exporter — surfacing GPU utilization, SM occupancy, memory bandwidth, and power draw per namespace. Operators aggregate metrics by tenant cluster namespace for per-tenant performance visibility.

DCGM GPU performance metrics
per-tenant namespace scoping

Monitoring

GPU Monitoring

Active

Tenant Resource Utilization

Per-tenant resource utilization monitoring is available by deploying GPU Operator, DCGM Exporter, and kube-prometheus-stack inside each tenant cluster, with ServiceMonitors configured to scope metrics by namespace. vCluster's tenant isolation ensures each tenant only sees their own metrics.

Per-tenant kube-prometheus-stack
DCGM GPU utilization metrics
cross-tenant aggregation for operators

Monitoring

GPU Monitoring

Active

Prometheus Alerting

Prometheus Alertmanager is the alerting layer — deployable alongside kube-prometheus-stack inside each tenant cluster. Operators configure alert rules against DCGM metrics (XID errors, temperature thresholds, ECC errors, utilization). Per-tenant isolation ensures alert rules and notification channels are scoped to individual tenants.

Prometheus Alertmanager
per-tenant alert rules

Monitoring

Alerting

Active

Active & Passive Health Checks

Passive health checks are available via DCGM Exporter (GPU metrics) and Kubernetes node conditions (Ready, MemoryPressure, DiskPressure) — both accessible within each tenant cluster. Active health checks require operator-configured tooling such as scheduled NCCL tests or GPU burn-in jobs.

DCGM passive GPU monitoring
K8s node conditions

Monitoring

Health Checks

Active

Burn-In Test Responsibility

GPU burn-in testing and documentation is the operator's responsibility. vMetal provisions bare metal nodes but does not run or document burn-in tests — it is Metal3 under the hood (PXE, OS provisioning, lifecycle management only).

Monitoring

Health Checks

Full Passive Health Suite

Comprehensive passive GPU health monitoring is available via DCGM Exporter deployed through GPU Operator on vCluster — covering temperature, power, ECC errors, XID codes, PCIe health, NVLink status, and utilization. Kubernetes node conditions provide host-level passive health. Together they cover the full passive health check surface.

DCGM full metric suite
K8s node conditions

Monitoring

Health Checks

Active

Active Health Check Jobs

Automated active health checks (GPU burn-in, DGEMM benchmarks, NCCL tests) are the operator's responsibility to configure and schedule. vCluster supports running these as Kubernetes Jobs or CronJobs within tenant clusters — the isolation model ensures health check jobs do not interfere with other tenants.

K8s Job/CronJob-based health check support

Monitoring

Health Checks

Diagnostic Tooling

Operators have full kubectl access to each tenant cluster for standard Kubernetes diagnostics. GPU diagnostics are available via DCGM (XID codes, ECC errors, health validation). NVIDIA GPU Operator includes a validator component that runs diagnostic checks at node startup.

kubectl diagnostics
DCGM GPU health data
GPU Operator validator

Monitoring

Health Checks

Active

Automated Node Remediation

Coming soon

vMetal supports node deprovisioning and reimaging today — nodes can be drained, wiped, and returned to the available pool. The automated fault-detection-to-action pipeline (detect GPU fault via DCGM, automatically drain, deprovision, reprovision) is on the roadmap via Auto Nodes + vMetal + vCluster. Today operator action is required to trigger remediation.

vMetal node deprovisioning/reimaging (live)
Auto Nodes + vMetal automated remediation (roadmap)

Monitoring

Auto Remediation

Active

Failure Prediction System

AI-based failure prediction is the operator's responsibility to configure. vCluster provides the observability data layer (DCGM metrics, K8s events) that an ML-based failure prediction system can consume.

Monitoring

Auto Remediation

Usage Metering & Showback

vCluster Platform provides per-cluster resource quota enforcement and usage metering via vBilling. Operators can expose per-tenant utilization as showback dashboards via Grafana. Chargeback and invoicing to end customers requires connecting an external billing platform. vBilling provides the metering data layer.

Resource quota enforcement
vBilling metering
DCGM + Prometheus
Grafana showback

Pricing

Metering & Billing

Active

GPU Pricing Economics

GPU pricing is the operator's hardware and business decision. vCluster reduces operational overhead — fewer ops engineers needed to manage multi-tenant K8s at scale — which can lower total platform cost and improve margins. vMetal reduces time-to-provisioned-node, reducing idle GPU costs.

Reduced operational overhead
faster provisioning reducing idle time

Pricing

Metering & Billing

Flexible Contract Models

Contract and consumption models are the operator's commercial decision. vCluster Platform's metering layer provides the usage data needed to support any billing model — on-demand, reserved, or tiered — but the contracts themselves are the operator's responsibility.

Per-tenant usage metering

Pricing

Metering & Billing

Granular Resource Metering

vCluster Platform provides per-tenant resource quota enforcement and usage metering via vBilling, giving operators the granular consumption data needed to support unbundled billing for compute, storage, and GPU resources. Integration with an external billing system is required for actual invoicing and chargeback.

vBilling metering
DCGM GPU usage data
namespace-scoped Prometheus metrics

Pricing

Metering & Billing

Active

Tenant Scale-Out

Contract expansion and renewal is a commercial decision for the operator. vCluster Platform makes it operationally easy to scale existing tenants — adding nodes, increasing resource quotas, or spinning up new tenant clusters requires no new physical clusters.

On-demand tenant cluster scaling
resource quota adjustment

Pricing

Metering & Billing

Hardware-Agnostic Provisioning

GPU hardware acquisition and availability timelines are the operator's business decision. vMetal (Metal3) accelerates time from physical node to registered Kubernetes worker — any new GPU SKU is supported without platform changes. The operator controls the hardware roadmap.

Hardware-agnostic bare metal provisioning

Pricing

Metering & Billing

Kernel Library Management

Performance kernel libraries are the operator's responsibility to bundle in their base OS image or make available to tenants. vCluster does not ship or manage ML kernel collections. Tenants can install libraries inside their tenant cluster environments independently.

Pricing

Metering & Billing

NVIDIA Investment Status

NVIDIA is a close technology partner with vCluster, not a current investor.

Partnerships

NVIDIA Ecosystem

NCP Technical Compatibility

NCP (NVIDIA Cloud Partner) is a certification program exclusively for GPU cloud infrastructure providers — vCluster Labs as a software vendor is not eligible and does not claim NCP status. vCluster helps AI cloud operators satisfy the technical requirements of NCP certification: GPU Operator compatibility, DCGM integration, IB/RoCEv2 networking stack, and multi-tenant isolation. Operators using vCluster can reference this compatibility when pursuing NCP certification.

GPU Operator compatibility
DCGM integration
multi-tenant isolation enabling NCP technical requirements

Partnerships

NVIDIA Ecosystem

Active

AMD Cloud Alliance

AMD Cloud Alliance membership is a provider-level certification. vCluster Platform supports AMD GPU workloads via the AMD GPU Operator on Kubernetes — hardware-agnostic at the platform layer. AMD Cloud Alliance status is the operator's credential to pursue.

Partnerships

NVIDIA Ecosystem

SchedMD Partnership

No current SchedMD partnership. SLURM support is on the roadmap for H2 2026.

Partnerships

NVIDIA Ecosystem

Full CNCF Ecosystem Support

The full Kubernetes and NVIDIA GPU ecosystem works inside a tenant cluster without modification — GPU Operator, Network Operator, DCGM, KAI Scheduler, Prometheus, Grafana, ArgoCD, and all standard CNCF tooling. vCluster is CNCF-compatible and does not require custom integrations or ecosystem modifications.

Full CNCF ecosystem compatibility
NVIDIA GPU stack compatibility
GitOps-compatible

Partnerships

NVIDIA Ecosystem

Active

Industry Event Participation

vCluster Labs actively participates in major GPU infrastructure and Kubernetes events — including KubeCon North America, KubeCon Europe, NVIDIA GTC, and SC (Supercomputing). The team presents on GPU tenant isolation, vNode security, and AI infrastructure architecture.

Partnerships

Industry Events

Active

Next-Gen GPU Support

vMetal supports any bare metal GPU node — H200, B200, GB200, and NVL72 hardware register into the platform without software changes. New GPU SKUs require no platform redesign.

Hardware-agnostic bare metal provisioning
no platform changes for new GPU SKUs

Availability

Hardware Provisioning

Current GPU Compatibility

Current GPU model availability is the operator's hardware decision. vMetal supports any bare metal GPU node — H100, H200, A100, L40S, and MI300X hardware all register into the platform identically. Hardware-agnostic provisioning means no per-SKU configuration.

Hardware-agnostic bare metal provisioning

Availability

Hardware Provisioning

Hardware Timeline Planning

Hardware acquisition timelines are the operator's business decision. vMetal supports any new GPU SKU without platform changes.

Availability

Hardware Provisioning

Multi-Tenant Scale

Fleet size is the operator's hardware investment decision. vCluster Platform enables scale without cluster sprawl — hundreds of tenant clusters run on a single control plane cluster, keeping per-cluster operational overhead flat regardless of GPU count.

Multi-tenant density
single control plane cluster for hundreds of tenants

Availability

Scale & Density

Geographic Reach

Geographic reach is the operator's infrastructure and business decision. vCluster Platform's multi-region mode supports distributed control planes across regions.

Availability

Scale & Density

Utilization Optimization

Uptime and utilization SLAs are the operator's commitment. vCluster Platform improves utilization economics — shared infrastructure with strong tenant isolation reduces idle GPU waste versus cluster-per-tenant approaches. Control plane HA ensures platform availability independent of individual node health.

Multi-tenant GPU utilization
control plane HA

Availability

Capacity Planning

Fleet-Wide Capacity Planning

vCluster Platform enables capacity planning without linear operational scaling — adding new tenants, expanding resource quotas, or onboarding new GPU nodes requires no new physical clusters. Operators manage all tenants from a single control plane with full visibility into resource allocation and utilization across the fleet.

Single control plane for fleet-wide capacity visibility
on-demand tenant scaling

Availability

Capacity Planning

Active

GPU Expansion Simplicity

GPU acquisition roadmaps are the operator's hardware and financial planning responsibility. vCluster Platform makes adding new nodes to existing tenant clusters operationally simple — no new clusters or configuration required.

Availability

Capacity Planning

Business Outcomes

What Moving Up the ClusterMAX™ Ranking Means for Your Business

Enterprise AI teams use ClusterMAX to make six- and seven-figure infrastructure decisions. Each rating improvement translates directly to pipeline and contract value.

Rating Tier = Larger Deal Size

Enterprise buyers filter ClusterMAX tiers before shortlisting providers. A higher rating gets you into more RFPs and reduces time spent on procurement due diligence.

<1 Day From Hardware to Managed Kubernetes

vCluster turns raw GPU nodes into a fully managed Kubernetes offering in under a day. Faster time-to-market means more customers onboarded before competitors catch up.

ClusterMAX Names You by Name

vCluster is now explicitly cited in the ClusterMAX Security criterion. Customers reading the criteria see your implementation listed as the requirement, not an alternative.

No Cluster Sprawl

vCluster lets you serve 100 enterprise tenants on a shared GPU fleet without managing 100 separate physical clusters, keeping OpEx flat as you scale to meet Availability scoring.

Isolation Without Compromise

Private Nodes and dedicated control planes give each customer the isolation of a dedicated environment at a fraction of the cost, directly improving Security and Orchestration scores.

Full GPU Stack Compatibility

GPU Operator, NCCL, MIG, DCGM, and distributed training frameworks all work natively inside vClusters, satisfying the Orchestration, Storage, Networking, and Monitoring criteria simultaneously.

Dive deeper