AI Cloud Providers · ClusterMAX™ Criteria Guide

The Platform Layer That Upgrades Your ClusterMAX™ Score

ClusterMAX™ evaluates AI cloud providers across 10 dimensions. This guide maps each criterion to the specific capabilities in vCluster, vNode, and vMetal that help you improve in that area, including the Security criterion that now explicitly names vCluster as a requirement.

Trusted by sovereign AI cloud providers worldwide
Context

What is ClusterMAX™ and Why Does Your Rating Matter?

ClusterMAX, published by SemiAnalysis, has become the de facto standard for evaluating GPU cloud infrastructure. Enterprise AI teams consult it before selecting a provider. Improving your rating directly impacts deal velocity and customer trust.

10 Evaluation Dimensions

ClusterMAX scores providers across Security, Lifecycle, Orchestration, Storage, Networking, Reliability, Monitoring, Pricing, Partnerships, and Availability, covering the full infrastructure stack enterprise customers care about.

vCluster Explicitly Named

The updated Security criterion now explicitly lists “vCluster or similar isolation beyond container-based only” as a requirement, a direct signal to enterprise buyers that cluster-level isolation is non-negotiable.

Platform vs. Hardware

Many providers plateau in their ratings not because of hardware, but because they lack a managed platform layer. vCluster, vNode, and vMetal directly address the software and operational gaps ClusterMAX evaluates.

This guide walks through each ClusterMAX dimension and shows where vCluster, vNode, and vMetal can meaningfully move the needle.

Use it as a roadmap: identify where your current offering falls short, and understand exactly which product to deploy to close the gap.

Criteria Walkthrough

How the vCluster Platform Addresses Each ClusterMAX™ Dimension

ClusterMAX evaluates dozens of specific requirements across 10 dimensions. Choose one of the 10 ClusterMAX dimensions on the left, then select a subcategory on the right to see the criteria vCluster covers and how.

ClusterMAX™ Criteria (1)
  • vCluster or similar isolation beyond container-based only
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (5)
  • Protection against container escalation vulnerabilities
  • Updated NVIDIA Container Toolkit preventing CVE-2024-0132 and related vulnerabilities
  • Protection against CVE-2025-23359, CVE-2025-23266
  • Automated rollout of new NVIDIA Container Toolkit versions upon CVE discovery
  • Part of NVIDIA security program for embargoed access to latest security patches
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (5)
  • VLAN isolation between tenants for RoCE
  • PKeys set for InfiniBand tenants
  • InfiniBand Security Keys Management (SMKey, SAKey, CKey, VSKey)
  • AM Key configuration (if SHARP is available)
  • SR-IOV with QP0 & MAD disabled when passing Virtual Function pointer into VM
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (8)
  • SOC 1 or SOC 2 Type II compliance
  • ISO 27001 certification
  • GDPR, PCI, HIPAA, FedRAMP compliance
  • Ability to sell to secure government customers globally
  • Penetration testing conducted (provider, scope, and coverage documented)
  • SOC 2 and penetration testing specifically cover InfiniBand/RoCEv2 fabric
  • Security firms with expertise in high-speed IB/ETH networking
  • Process in place for future security improvements
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (8)
  • Onboarding costs and delivery timeline adherence
  • Ease of onboarding (UI-based vs. manual Terraform)
  • Out-of-the-box GPU Direct RDMA (between NIC and GPU) setup
  • Fast onboarding process (UI-based provisioning vs. Terraform complexity)
  • Delivery date accuracy and meeting expectations
  • Out-of-the-box IB/RoCEv2 & NVIDIA drivers configuration
  • Performance optimization libraries (e.g., TogetherAI kernel collection)
  • Provisioning of CPU head node for SLURM without explicit request
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (2)
  • Knowledge of industry experts
  • Understanding of standard ML user expectations
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (5)
  • Audit logs capturing resource actions (create/start/stop/delete), administrative actions, and billing events
  • Audit log entries include actor identity (user ID, email, IP address), action details, target resource, timestamp, and success/failure status
  • Audit logs queryable via API with filtering by resource type, project, user, and date range
  • Minimum 90-day audit log retention with export capability
  • Audit log access restricted to administrators with no additional usage charges
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (6)
  • Automated setup process for Kubernetes
  • Automated managed Kubernetes service
  • kubectl access or KUBECONFIG provided
  • Easy access to kube-dashboard, Lens, etc.
  • Storage accessible via PVC + hostpath + S3
  • CUDA_VISIBLE_DEVICES properly configured
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (4)
  • Easy process for adding new cluster users
  • RBAC and SSO implementation
  • No SSH key copying required
  • Storage RBAC enforcement
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (7)
  • Self-service capabilities (e.g., Lambda 1CC, Nebius managed operator)
  • Automated setup process for SLURM
  • Automated managed SLURM service
  • Head node provisioning
  • Out-of-the-box SLURM topology configuration
  • SLURM modules availability
  • Pyxis container plugin support
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (4)
  • Storage integration with Kubernetes for PVCs/storage class
  • Proper mounting configuration out-of-the-box
  • Out-of-the-box parallel filesystem (e.g., Weka, DDN, VAST)
  • Out-of-the-box managed S3-compatible object storage
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (4)
  • Automated backups with configurable retention policy for file systems, object storage, and databases
  • Cross-region replication or backup for disaster recovery
  • Snapshot support for persistent volumes (CSI snapshots for Kubernetes PVs)
  • Centralized backup monitoring, alerting, and restore validation
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (10)
  • Mount reliability (no random flaking on and off)
  • Read performance testing
  • Write performance testing
  • Throughput and latency measurements
  • Scalability testing for performance and capacity
  • Point-in-time recovery for managed databases
  • Immutable or WORM-capable storage for ransomware protection and compliance
  • Published durability and availability SLAs for all storage tiers
  • Backup encryption at rest and in transit with customer-managed key (CMK/BYOK) support
  • Checkpoint storage durability for training workloads (replication factor, cross-AZ guarantees)
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (5)
  • InfiniBand or RoCEv2 support
  • MPI distribution using higher performance hpc-x mpirun
  • Proper NCCL configuration (/etc/nccl.conf)
  • NCCL_IB_GID_INDEX=3 set for RoCEv2
  • NCCL_MIN_NCHANNELS, NCCL_PROTO, NCCL_ALGO NOT set (auto-configuration)
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (5)
  • NCCL monitoring plugin availability
  • SHARP support for enhanced performance
  • Network bandwidth and latency testing
  • 4-node NCCL test within specification
  • PyTorch layer network performance within spec
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (7)
  • NVLINK connectivity and error tracking (critical for NVL72)
  • Network stability assessment
  • Link flap monitoring and prevention
  • Ethernet and InfiniBand event monitoring (Link Flaps)
  • InfiniBand health monitoring (link status, error counters, PKey consistency)
  • InfiniBand link status validation (ibstat)
  • Partition Key (PKey) consistency across nodes
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (18)
  • Kubernetes node health checks
  • Automated node draining and replacement
  • GPU falling off the bus detection
  • PCIe error monitoring
  • Thermal monitoring (GPU temperature)
  • GPU and CPU memory stats (ECC error rate)
  • NVIDIA XID and SXID error code detection
  • NCCL and SLURM topology health
  • Driver and core library version consistency across nodes
  • ECC error detection
  • Temperature monitoring and throttling alerts
  • Power monitoring and utilization tracking
  • NVIDIA XID/SXID error detection (through DCGM)
  • PCIe bus and power state health
  • Error counter monitoring (retries, dropped packets)
  • NCCL operation health tracking
  • MSA SLA evaluation (99%, 99.9%, etc.)
  • IPMI exporter and fan speed monitoring
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (6)
  • ncu profiling available for all users
  • Out-of-the-box detailed managed Grafana
  • Real-time system monitoring
  • Performance tracking
  • Resource utilization monitoring
  • TFLOPs estimation tracking
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (1)
  • Alerting capabilities
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (5)
  • Automated Active and Passive Health Checks
  • Comprehensive passive health check implementation
  • Diagnostic tools
  • Burn-in test documentation
  • Automated active health check implementation
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (2)
  • Automatic node draining for detected issues
  • AI model system for failure prediction
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (7)
  • Tenant cluster sharing and metering with chargeback/showback
  • Individual charges for storage, compute nodes, network vs. bundled pricing
  • Low $/GPU/hr pricing
  • Consumption model options (3-year, 1-year, 6-month, 3-month, 1-month)
  • Expansion and extension of existing contracts
  • Latest GPU availability and timeline
  • Kernel library availability for MFU boosting (e.g., TogetherAI kernel collection)
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (5)
  • Ecosystem support and integration
  • NVIDIA NCP or Lepton certification
  • AMD Cloud Alliance status
  • AMD or NVIDIA investment
  • SchedMD partnership (makers of SLURM)
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (1)
  • Participation in industry events
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (4)
  • Latest GPU models available (H200, B200, GB200, NVL72)
  • Current GPU models (H100, H200, A100, L40S, MI300X)
  • B200, B300, GB200 NVL72 availability timeline
  • MI355X availability planning
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (2)
  • Total quantity of GPUs and cluster scale experience
  • Geographic reach and service accessibility
How vCluster Helps
Additional Requirements
ClusterMAX™ Criteria (3)
  • Capacity planning capabilities
  • Availability/utilization rates
  • Roadmap for future GPU acquisitions/upgrades
How vCluster Helps
Additional Requirements
Unlock the full ClusterMAX assessment

Fill out the form to explore detailed insights into how vCluster can help you meet requirements across all ten dimensions.

Dedicated Control Plane

vCluster provides each tenant with a dedicated Kubernetes control plane — isolated API server, RBAC, CRDs, and controllers. This eliminates cross-tenant API access entirely. Pair with vNode for kernel-level workload isolation and Private Nodes for GPU workloads requiring exclusive physical access.

  • Dedicated control plane per tenant
  • Private Nodes
  • vNode (optional)
Security
Control Plane & Workload Isolation
Active
Kernel-Level Isolation

vNode enforces kernel-native isolation using Linux user namespaces, seccomp profiles, and cgroup separation. Container escape vulnerabilities are contained to the affected workload — they cannot reach other tenants or the host.

  • vNode
Security
Prevent Container Breakouts
Active
Zero-Day CVE Protection

vNode protected against these vulnerabilities even without patching. Not just about patching existing container breakouts — also about preventing attacks for future zero-days which are not yet known. vCluster Platform automates installation of GPU Operator / NVIDIA Container Toolkit and makes it easy to keep it up to date.

  • vNode
Security
Prevent Container Breakouts
Active
Automated Toolkit Updates

Fully automated rollout of new NVIDIA Container Toolkit versions is possible with vCluster Templates.

  • Templates
Security
Prevent Container Breakouts
Active
Not Applicable

Not applicable to vCluster Labs as a software vendor.

Security
Prevent Container Breakouts
RoCE Tenant Isolation

Per-tenant network isolation for RoCE is automated at provisioning time via the network fabric partner. vCluster coordinates and instructs the underlying networking tool, including Netris, OpenStack Neutron, Multus / SR-IOV for RDMA. Provides simple webhooks for homegrown network automation tools as well.

Security
Network Isolation
Active
InfiniBand PKey Isolation

Partition Key assignment per InfiniBand tenant is automated at provisioning time via the network fabric partner. vCluster integrates with Netris, which orchestrates PKey assignment through NVIDIA UFM.

Security
Network Isolation
Active
IB Security Key Management

InfiniBand fabric key management is handled at the network infrastructure layer via NVIDIA UFM and OpenSM. vCluster integrates with Netris, which provides configuration guidance for M_Key and VS_Key within UFM.

Security
Network Isolation
Active
SHARP AM Key Config

SHARP Aggregation Manager key configuration is handled at the fabric layer by the network infrastructure partner when SHARP is deployed. Only applicable if SHARP-capable Mellanox switches are present.

Security
Network Isolation
Active
Bare Metal Architecture

vCluster Private Nodes is focused on bare metal rather than VMs for AI clusters, eliminating the need for SR-IOV VF partitioning. This is a host-layer concern for hypervisor-based environments.

Security
Network Isolation
Self-Hosted Software Model

vCluster is self-hosted software — vCluster Labs does not process or store customer data but helps with SOC 2 requirements using software features such as automated backups via vCluster Snapshots for example.

Security
Certifications
ISO 27001 Scope

vCluster is self-hosted software — vCluster Labs does not process or store customer data but helps with SOC 2 requirements using software features such as automated backups via vCluster Snapshots for example. ISO 27001 obligations reside within the operator's own deployment environment.

Security
Certifications
Compliance-Ready Controls

Compliance obligations reside within the operator's deployment environment. vCluster provides the technical controls required to operate within compliant environments: RBAC, audit logging, network isolation, and FIPS-compliant images (Enterprise). FIPS version of images is particularly relevant for federal-related business.

  • RBAC
  • Audit Logging
  • Network Isolation
  • FIPS Images
Security
Certifications
Sovereign Deployment Support

vCluster's self-hosted model supports air-gapped, on-prem, and sovereign deployments — customer data never leaves the operator's environment. FedRAMP and ITAR eligibility depends on the operator's own certifications.

Security
Certifications
Pen Testing Reports

Third-party penetration testing has been conducted on vCluster Platform and vNode. Reports available under NDA on request. This does not directly help the operator with their own audit but ensures our software passes through cleanly.

Security
Certifications
IB/RoCEv2 Pen Testing

Third-party penetration testing has been conducted on vCluster Platform and vNode. Reports available under NDA on request.

Security
Certifications
Expert Fabric Partners

Fabric-layer security assessments require a partner with InfiniBand/Ethernet expertise. vCluster works with operators to validate tenant isolation at the control plane and networking layers.

Security
Certifications
Continuous Security Improvement

vCluster supports AI cloud providers with this by enabling an automation-first approach (Templates, etc.) and vCluster maintains a published vulnerability disclosure process.

Security
Certifications
Fast Tenant Onboarding

FAST — vCluster Platform reduces time-to-first-cluster to minutes, eliminating the weeks typically required for manual Kubernetes infrastructure setup. Tenant onboarding can be fully automated with vCluster Platform.

  • Templates
  • Automations
Lifecycle
Tenant Provisioning
Active
Self-Service Provisioning

LIKE HYPERSCALERS — Everything can be fully automated AND all standard IaC/GitOps provisioning tools and flows are supported, plus a great UX in the Platform UI.

vCluster provisions a fully managed Kubernetes environment in seconds via UI, API, or Kubernetes CRDs — no new physical infrastructure, no manual Terraform. vMetal automates the full bare metal lifecycle: PXE boot, OS provisioning, and node registration.

  • UI provisioning
  • API provisioning
  • K8s CRD-based provisioning
  • GitOps-compatible
Lifecycle
Tenant Provisioning
Active
GPU Direct RDMA Setup

vCluster Templates delivers production-ready GPU tenant clusters with GPU Direct RDMA configured out of the box. GPU Operator, Network Operator, and nvidia-peermem deployed via standard Helm or GitOps workflows. No new physical infrastructure required; works on any vCluster with bare metal GPU nodes.

  • GPU Operator RDMA configuration
  • NVIDIA Network Operator (MOFED)
  • nvidia-peermem
Lifecycle
Tenant Provisioning
Active
Instant Cluster Provisioning

vCluster provisions a fully managed Kubernetes environment in seconds via UI, API, or K8s CRDs — customers receive a kubeconfig immediately with no manual handoff. vMetal automates bare metal GPU node provisioning end-to-end: PXE boot, OS install, and cluster registration — replacing Terraform complexity with a declarative self-service workflow that is fully GitOps-compatible.

  • UI provisioning
  • API provisioning
  • K8s CRD-based GitOps provisioning
Lifecycle
Tenant Provisioning
Active
Predictable Delivery

Our automations make things more predictable and fast. Additionally our Customer Engineering Team is here to support complex new data center setups.

Lifecycle
Tenant Provisioning
Active
MOFED Driver Lifecycle

vCluster Templates manages MOFED driver and NVIDIA Container Toolkit lifecycle fleet-wide via the Network Operator. No per-node manual intervention required. Driver versions stay consistent across nodes and updates roll out through standard Kubernetes operator reconciliation.

  • GPU Operator driver management
  • NVIDIA Network Operator MOFED
  • IB/RoCEv2 driver lifecycle
Lifecycle
Tenant Provisioning
Active
Performance Testing (Roadmap)
Coming soon

Automated performance testing for new tenant environments is on the roadmap. Currently, vCluster helps with manual performance testing but an in-product solution is coming soon.

Lifecycle
Tenant Provisioning
Active
SLURM Platform (Roadmap)
Coming soon

vCluster will launch a SLURM solution in H2 2026 providing fully automated SLURM cluster provisioning and day 2 operations. Coming soon.

Lifecycle
Tenant Provisioning
Active
AI Cloud Industry Experience

vCluster has worked with some of the biggest AI cloud providers such as CoreWeave since 2021 and gained valuable experience since the earliest days of the AI cloud industry.

Lifecycle
Industry Experience
Active
Resource Lifecycle Logging

vCluster Platform exposes all metrics and the Kubernetes API audit logging for all resource lifecycle events (create, update, delete) and administrative actions. Integration into billing systems can be done custom today but in-product automated integration is coming soon. Each log entry includes actor identity, resource type, action, and timestamp.

  • K8s API audit logging
  • Resource lifecycle events
  • Admin action logging
Lifecycle
Audit Logs
Active
Actor Attribution

vCluster Platform captures full actor attribution in audit logs via standard Kubernetes audit policy: user ID, source IP, action verb, target resource, and request status (success/failure). This meets ClusterMAX's actor-attribution requirement for each logged event.

  • K8s audit policy: user
  • sourceIP
  • verb
  • resource
  • responseStatus
Lifecycle
Audit Logs
Active
Log Query & Filtering

Audit logs are accessible via the Kubernetes API. Full filtering by resource type, user, and date range depends on the log aggregation backend the operator connects (e.g., Loki, OpenSearch, Elasticsearch). vCluster Platform surfaces the audit log stream — additional queryability at scale can be achieved with a connected aggregation layer.

  • K8s audit log stream
  • log aggregation integration
Lifecycle
Audit Logs
Active
Flexible Log Retention

Audit logs can be persisted to databases and systems considered industry standard. Retention periods can be set individually without limitations.

Lifecycle
Audit Logs
Active
Admin-Only Access

vCluster Platform's RBAC layer allows configuration of who can see the audit log. Default permissions are restricted to platform administrators only — tenant users cannot access audit logs by default. No additional usage charges apply for audit log access.

  • RBAC-restricted audit log access
  • no per-access charges
Lifecycle
Audit Logs
Active
Declarative K8s Provisioning

vCluster provisions a fully managed Kubernetes environment via CRDs, API, or UI in seconds — no manual cluster setup, no Terraform, no infrastructure provisioning required. The entire setup is declarative and GitOps-compatible.

  • CRD-based provisioning
  • API provisioning
  • UI provisioning
  • GitOps-compatible
Orchestration
Self-Service
Active
Managed K8s Per Tenant

vCluster is a fully managed Kubernetes environment provisioned on demand via API or UI. Each tenant gets their own API server, RBAC, and namespaces — isolated from all other tenants — without requiring separate physical clusters or manual cluster management by the provider.

  • Dedicated API server per tenant
  • isolated RBAC
  • no shared control plane
Orchestration
Self-Service
Active
Instant kubeconfig Delivery

On provisioning, vCluster automatically generates and delivers a kubeconfig to the customer. They have immediate kubectl access and can use Helm, Lens, k9s, or any standard Kubernetes tooling — no SSH setup, no firewall rules, no manual steps from the operator.

  • Auto-generated kubeconfig at provisioning
  • immediate kubectl access
Orchestration
Self-Service
Active
Universal Tooling Compatibility

Standard kubeconfig works with any Kubernetes tooling out of the box — Lens, k9s, kube-dashboard, Headlamp all connect normally. No additional configuration required from the operator or the tenant.

  • Standard kubeconfig compatibility with all K8s tooling
Orchestration
Self-Service
Active
GPU Device Assignment

CUDA_VISIBLE_DEVICES is automatically configured by the NVIDIA device plugin (part of GPU Operator) when GPUs are requested via Kubernetes resource limits. The device plugin injects GPU device assignments as environment variables into containers at scheduling time — no manual configuration required.

  • NVIDIA device plugin envvar injection
  • GPU resource limits
Orchestration
Self-Service
Active
PVC + hostPath + S3

PVC and hostPath storage work natively in vCluster. PVCs are synced to the host cluster by default, and vCluster includes a local path provisioner for dynamic PVC provisioning via hostPath with no additional setup. S3-compatible object storage is also accessible via an S3 CSI driver deployed on the host cluster.

  • PVC via CSI passthrough
  • hostPath
  • S3 via operator storage setup
Orchestration
Self-Service
Active
User Management & SSO

vCluster Platform provides fully featured user management. Users can be manually onboarded via email/password but SSO is also supported. OIDC/SAML and other SSO integrations enable secure and automated user onboarding and offboarding. Automated permission and key management ensures secure access for new users.

  • SSO/OIDC integration
  • RBAC role assignment
Orchestration
User Management & Access Control
Active
Per-Tenant RBAC & SSO

Each tenant cluster has its own isolated RBAC layer and OIDC/SSO integration. Providers connect their existing identity provider (Okta, GitHub, Azure AD) so customers authenticate via SSO and receive role-scoped kubeconfigs — no shared identity namespace between tenants.

  • Per-vCluster isolated RBAC
  • OIDC/SSO integration
  • customer brings their IdP
Orchestration
User Management & Access Control
Active
No SSH Required

vCluster eliminates SSH-based cluster access entirely. Customers receive a kubeconfig and authenticate via OIDC/SSO — no SSH key distribution, no per-node access management, no operator involvement for adding or removing cluster access.

  • kubeconfig-based access
  • no key distribution
Orchestration
User Management & Access Control
Active
Storage Access Controls

Kubernetes RBAC controls storage access at the namespace and PVC level. Per-tenant isolation means tenants cannot access PVCs outside their own cluster — cross-tenant storage access is structurally prevented.

  • K8s RBAC for storage
  • per-tenant PVC isolation
Orchestration
User Management & Access Control
Active
Pyxis Not Supported

Pyxis is a container runtime plugin for SLURM, built by NVIDIA. Not currently supported.

Orchestration
SLURM
Parallel Filesystem Support

vCluster supports any storage systems including Weka, DDN, VAST, or any other CSI driver. Procuring and installing the filesystem is the responsibility of the operators as a prerequisite to use it in vCluster.

  • CSI passthrough
  • host storage class inheritance
Storage
Storage Provisioning
S3 Object Storage

S3-compatible object storage is the operator's responsibility to provision and expose. vCluster does not manage object storage directly. Tenants can access operator-provisioned S3 endpoints from within their tenant cluster using standard Kubernetes secrets and environment variables.

Storage
Storage Provisioning
PVC & Storage Class Support

vCluster supports any storage systems including Weka, DDN, VAST, or any other CSI driver. PVC provisioning is identical to a native cluster — no extra configuration required. Procuring and installing the filesystem is the responsibility of the operator as a prerequisite.

  • CSI passthrough
  • host storage class inheritance
  • PVC provisioning identical to native cluster
Storage
Storage Provisioning
Active
CSI Mount Passthrough

CSI passthrough means mounts are handled by the tenant cluster's CSI driver — no extra configuration required.

  • CSI passthrough mount handling
Storage
Storage Provisioning
Active
Automated Backup & Restore

vCluster Platform provides backup and restore for tenant cluster state and PV data via volume snapshots, with configurable scheduling and retention. Application-specific (database) backups are recommended to be configured additionally.

Storage
Storage Backups
Active
Multi-Region DR

vCluster Platform multi-region mode provides control plane DR across regions. Storage-layer cross-region replication is determined by the operator storage infrastructure and requires appropriate configuration of the storage system.

  • vCluster Platform multi-region mode
Storage
Storage Backups
Active
PV Snapshot Support

vCluster supports PV snapshots — tenants trigger VolumeSnapshots via standard Kubernetes APIs and vCluster handles the rest. Automated backups can be configured on a time interval or run via CRON schedule.

  • VolumeSnapshot support
  • CSI snapshot passthrough
  • Platform backup/restore
Storage
Storage Backups
Active
Backup Monitoring & Validation

vCluster Platform provides automated backups and exposes any status and metadata information about backups including providing the ability to verify backup validity.

Storage
Storage Backups
Active
Storage Integration Support

vCluster works with any high-performance storage solution and our team supports hands-on in configuring the storage-related integrations and automations for particular data center setup.

Storage
Storage Performance & Security
IB & RoCEv2 Fabric Automation

InfiniBand and RoCEv2 fabric automation is provided by the network infrastructure partner. vCluster integrates with Netris, which automates east-west fabric configuration, tenant isolation, PKey assignment via UFM, and Spectrum-X host networking via the NHN plugin.

  • Netris fabric automation
  • NVIDIA UFM integration
  • Spectrum-X NHN plugin
Networking
Network Setup
Active
Bare Metal MPI Performance

vCluster's Private Nodes model runs MPI workloads directly on bare metal — no virtualization layer between the MPI processes and the network fabric. MPI, TorchElastic, Ray, and JAX perform as they would on a native cluster with zero overhead added. HPC-X installation and configuration is the operator's responsibility.

  • Private Nodes bare-metal data path
  • zero MPI scheduling overhead
Networking
Network Setup
NCCL Config Responsibility

NCCL configuration (/etc/nccl.conf) is set by the operator in their GPU node OS image. vMetal provisions nodes using the operator's own ISO — no OS-level modifications are made. Operators include NCCL config in their base image or deploy it via init scripts.

Networking
Network Setup
GID Index Auto-Select

NCCL_IB_GID_INDEX=3 is a pre-NCCL 2.21 requirement. Since NCCL 2.21, GID index is auto-selected based on active link layer — manual configuration is no longer needed. For operators running NCCL 2.21+, this is handled automatically.

Networking
Network Setup
NCCL Auto-Tuning

vCluster Platform and vMetal do not set NCCL_MIN_NCHANNELS, NCCL_PROTO, or NCCL_ALGO. vMetal provisions nodes using the operator's own OS image without modification — no NCCL environment variables are injected at any layer. NCCL auto-tuning runs unobstructed.

Networking
Network Setup
SHARP Collective Operations

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) accelerates NCCL collective operations when NVIDIA Quantum InfiniBand switches are present. Configuration is managed via NVIDIA UFM through the network fabric partner.

Networking
Network Performance
Active
NCCL Test Validation

The 4-node NCCL bandwidth test is a provider validation benchmark — the operator runs and certifies this against their hardware configuration. vCluster's Private Nodes model ensures no platform overhead is added to the NCCL data path, but benchmark execution and pass/fail certification is the operator's responsibility.

Networking
Network Performance
PyTorch Native Performance

PyTorch distributed performance benchmarking is a provider validation responsibility. vCluster adds no overhead to the GPU compute or network data path — PyTorch workloads on Private Nodes run at native bare-metal performance.

Networking
Network Performance
NCCL Observability

NCCL communication observability is available via NVIDIA DCGM Exporter (deployed as part of GPU Operator) for GPU-level metrics, and via NVIDIA's NCCL Inspector Profiler Plugin (released December 2025) for per-communicator, per-collective performance monitoring. Both integrate with Prometheus and Grafana.

  • DCGM Exporter
  • NCCL Inspector Profiler Plugin
  • Prometheus
  • Grafana
Networking
Network Performance
Active
Automated Network Testing
Coming soon

Automated performance testing for new tenant environments is on the roadmap. Currently, vCluster helps with manual performance testing but an in-product solution is coming soon.

NCCL communication observability is available via NVIDIA DCGM Exporter, which deploys automatically as part of GPU Operator.

Networking
Network Performance
Active
Fabric Stability Monitoring

Network stability monitoring across the IB and Ethernet fabric is managed by the network infrastructure partner. Netris provides continuous switch-level health monitoring including interface status, BGP state, topology/wiring errors, and hardware health across all managed nodes.

  • Netris fabric monitoring
  • K8s node conditions
Reliability
Network Resilience
Active
Link Flap Monitoring

InfiniBand and Ethernet link flap monitoring is handled at the network infrastructure layer by the fabric partner. Netris provides continuous interface up/down monitoring across all managed switches and surfaces events to the operator.

  • Netris interface monitoring
Reliability
Network Resilience
Active
Switch Event Monitoring

InfiniBand and Ethernet link flap events are monitored at the fabric layer by the network infrastructure partner. Netris provides continuous interface status monitoring across managed switches, alerting on link state changes.

  • Netris interface monitoring
  • NVIDIA UFM
Reliability
Network Resilience
Active
IB Health Monitoring

InfiniBand health monitoring — link status, error counters, and PKey consistency — is managed at the fabric layer via NVIDIA UFM, orchestrated by the network infrastructure partner. Netris integrates with UFM for IB fabric management and monitoring.

  • Netris-UFM integration
  • NVIDIA UFM IB monitoring
Reliability
Network Resilience
Active
NVLink Error Tracking

DCGM Exporter tracks NVLink bandwidth and error metrics per GPU. Critical for NVL72 configurations. Deployed via GPU Operator on vCluster — operators configure Prometheus alerts for NVLink error thresholds.

  • DCGM NVLink bandwidth and error metrics
  • Prometheus alerting
Reliability
Network Resilience
Active
IB Link Status Validation

InfiniBand link status monitoring is managed at the fabric layer via NVIDIA UFM, orchestrated by the network infrastructure partner. Netris automates fabric configuration. Operators requiring ibstat-level granularity must run IB diagnostic tooling directly.

  • NVIDIA UFM IB monitoring
Reliability
Network Resilience
Active
PKey Consistency Enforcement

PKey consistency across InfiniBand nodes is maintained by Netris via its NVIDIA UFM integration. The Netris-UFM reconciliation loop (10-second interval) continuously verifies and enforces PKey assignments across all fabric nodes.

  • Netris-UFM PKey reconciliation
Reliability
Network Resilience
Active
SLA Architecture Support

SLA commitments are the operator's responsibility to define and publish. vCluster Platform's architecture enables higher SLAs: control plane HA (multi-replica with embedded etcd or external DB), automated node lifecycle via vMetal, and tenant isolation ensuring one tenant's failure does not affect others.

  • Control plane HA
  • multi-replica deployment
  • external DB support
Reliability
GPU & System Health
GPU Bus Fault Detection

NVIDIA DCGM Exporter (deployed via GPU Operator on vCluster) tracks XID error codes including GPU bus fault events. Operators connect Prometheus to receive alerts. vCluster tenant isolation scopes blast radius to the affected node only.

  • DCGM XID error tracking
  • Prometheus alerting
Reliability
GPU & System Health
Active
PCIe Error Monitoring

DCGM Exporter tracks PCIe replay counters and bus error events. Deployed automatically via GPU Operator on vCluster — operators connect Prometheus scraping.

  • DCGM PCIe replay counter
  • Prometheus
Reliability
GPU & System Health
Active
GPU Thermal Monitoring

DCGM Exporter tracks GPU and memory temperature on all GPU nodes. Operators connect Prometheus and configure thermal threshold alerts. NVIDIA's default Grafana dashboard includes thermal panels.

  • DCGM GPU/memory temperature metrics
  • Prometheus alerting
  • Grafana dashboard
Reliability
GPU & System Health
Active
ECC Memory Monitoring

DCGM Exporter tracks GPU ECC error rates per GPU. Deployed via GPU Operator on vCluster. CPU memory ECC monitoring requires a separate IPMI exporter deployed by the operator.

  • DCGM ECC error metrics
  • IPMI exporter for CPU memory
Reliability
GPU & System Health
Active
XID & SXID Detection

DCGM Exporter tracks XID and SXID error codes per GPU. Deployed via GPU Operator on vCluster — XID events surface through Prometheus alerting. vCluster tenant isolation scopes impact to the affected tenant only.

  • DCGM XID/SXID tracking
  • Prometheus alerting
Reliability
GPU & System Health
Active
NCCL Topology Health

NCCL operation observability is available via NVIDIA's NCCL Inspector Profiler Plugin. Operators include the inspector .so library in their GPU workload container image and set NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE=1 in their pod specs. The plugin runs entirely in-process — no cluster-level changes, no DaemonSet, no privileged containers required. Works natively inside vCluster pods.

  • NCCL Inspector Profiler Plugin
  • per-communicator performance logging
Reliability
GPU & System Health
Active
K8s Node Health Checks

vCluster surfaces Kubernetes node health conditions through the virtual control plane API server — including node Ready, MemoryPressure, DiskPressure, and PIDPressure conditions. Platform-level node health is further enriched by vMetal's bare metal lifecycle monitoring. Operators access node health via standard kubectl or any K8s observability tooling.

  • K8s node conditions via vCluster API
  • vMetal node lifecycle monitoring
Reliability
GPU & System Health
Active
Driver Version Consistency

GPU Operator enforces consistent NVIDIA driver and toolkit versions as a DaemonSet across all GPU nodes in the cluster. Version drift is detected and corrected by the operator reconciliation loop automatically.

  • GPU Operator DaemonSet version enforcement
Reliability
GPU & System Health
Active
ECC Error Detection

DCGM Exporter tracks single-bit and double-bit ECC errors per GPU. Deployed via GPU Operator on vCluster — operators configure Prometheus alerts for ECC thresholds.

  • DCGM ECC error tracking
  • Prometheus alerting
Reliability
GPU & System Health
Active
Thermal Throttling Alerts

DCGM Exporter tracks GPU temperature and thermal throttling events. Operators configure Prometheus alerting rules for threshold detection. NVIDIA's default Grafana dashboard includes thermal monitoring panels.

  • DCGM thermal metrics
  • Prometheus alerting
  • Grafana thermal dashboards
Reliability
GPU & System Health
Active
Power Usage Tracking

DCGM Exporter tracks GPU power draw and total energy consumption per GPU. Operators connect Prometheus for per-tenant power utilization dashboards and chargeback via PromQL aggregation by namespace.

  • DCGM power usage and energy metrics
  • Prometheus
  • per-tenant power dashboards
Reliability
GPU & System Health
Active
XID/SXID Error Tracking

DCGM Exporter tracks XID and SXID error codes per GPU via GPU Operator. XID events surface through Prometheus. vCluster tenant isolation scopes blast radius to the affected tenant only.

  • DCGM XID/SXID error tracking
Reliability
GPU & System Health
Active
PCIe & Power State Metrics

DCGM Exporter tracks PCIe bus health and power state metrics. Deployed via GPU Operator on vCluster — operators configure Prometheus alerting for anomalies.

  • DCGM PCIe and power state metrics
Reliability
GPU & System Health
Active
IPMI & Fan Monitoring

IPMI/BMC telemetry for fan speed and hardware health is the operator's responsibility to configure. vMetal (Metal3) uses BMC for node provisioning and power management but does not expose ongoing IPMI telemetry. Operators deploy a standalone IPMI exporter on their nodes to feed hardware metrics into Prometheus.

Reliability
GPU & System Health
Error Counter Monitoring

GPU-layer error counters (retries, ECC errors) are available via DCGM Exporter deployed through GPU Operator. Network-layer packet drop and retry counters are the operator's responsibility — Netris provides network automation but not deep flow-state telemetry.

  • DCGM GPU error counters
Reliability
GPU & System Health
Active
NCCL Operation Profiling

NCCL operation health tracking is available via NVIDIA's NCCL Inspector Profiler Plugin. Operators include the inspector .so library in their GPU workload container images and enable it via environment variables (NCCL_PROFILER_PLUGIN, NCCL_INSPECTOR_ENABLE=1). The plugin runs entirely in-process with no Kubernetes footprint — works natively inside vCluster GPU pods.

  • NCCL Inspector Profiler Plugin
Reliability
GPU & System Health
Active
Automated Node Recovery

vMetal handles the full bare metal node lifecycle today — including deprovisioning, reimaging, and returning nodes to the available pool for reassignment. What is on the roadmap is automated health-based triggering: detecting a GPU fault via DCGM, automatically draining the node, deprovisioning, and reprovisioning without operator intervention. Today that remediation chain requires operator action.

  • Bare metal deprovisioning
  • reimaging
  • node pool return
  • automated remediation pipeline (roadmap)
Reliability
GPU & System Health
Active
ncu Profiling

ncu is available via the operator's container image / NVIDIA toolkit.

Monitoring
GPU Monitoring
Active
Managed Grafana Dashboards

kube-prometheus-stack (including Grafana) deploys normally inside each tenant cluster. DCGM Exporter integrates with Prometheus to surface GPU utilization, memory, power, and error metrics per tenant — all scoped to the individual tenant with no cross-tenant visibility. Operators can offer managed Grafana dashboards as a platform feature.

  • Per-tenant kube-prometheus-stack
  • DCGM Exporter integration
  • per-tenant Grafana
Monitoring
GPU Monitoring
Active
TFLOPs Estimation

TFLOPs estimation is the operator's responsibility to benchmark and publish. DCGM Exporter surfaces GPU utilization and SM clock data that operators can use to derive effective TFLOPs, but vCluster does not compute or track TFLOPs natively.

  • DCGM SM utilization and clock metrics (raw input for TFLOPs estimation)
Monitoring
GPU Monitoring
Real-Time System Monitoring

Real-time GPU and system monitoring is available via DCGM Exporter (GPU metrics at ~1s granularity) and kube-prometheus-stack — both deployable inside each tenant cluster. Operators connect their Prometheus instance to scrape DCGM metrics and visualize in Grafana in real time.

  • DCGM real-time GPU metrics
  • Prometheus scraping
Monitoring
GPU Monitoring
Active
Per-Tenant GPU Performance

Per-tenant GPU performance tracking is available via DCGM Exporter — surfacing GPU utilization, SM occupancy, memory bandwidth, and power draw per namespace. Operators aggregate metrics by tenant cluster namespace for per-tenant performance visibility.

  • DCGM GPU performance metrics
  • per-tenant namespace scoping
Monitoring
GPU Monitoring
Active
Tenant Resource Utilization

Per-tenant resource utilization monitoring is available by deploying GPU Operator, DCGM Exporter, and kube-prometheus-stack inside each tenant cluster, with ServiceMonitors configured to scope metrics by namespace. vCluster's tenant isolation ensures each tenant only sees their own metrics.

  • Per-tenant kube-prometheus-stack
  • DCGM GPU utilization metrics
  • cross-tenant aggregation for operators
Monitoring
GPU Monitoring
Active
Prometheus Alerting

Prometheus Alertmanager is the alerting layer — deployable alongside kube-prometheus-stack inside each tenant cluster. Operators configure alert rules against DCGM metrics (XID errors, temperature thresholds, ECC errors, utilization). Per-tenant isolation ensures alert rules and notification channels are scoped to individual tenants.

  • Prometheus Alertmanager
  • per-tenant alert rules
Monitoring
Alerting
Active
Active & Passive Health Checks

Passive health checks are available via DCGM Exporter (GPU metrics) and Kubernetes node conditions (Ready, MemoryPressure, DiskPressure) — both accessible within each tenant cluster. Active health checks require operator-configured tooling such as scheduled NCCL tests or GPU burn-in jobs.

  • DCGM passive GPU monitoring
  • K8s node conditions
Monitoring
Health Checks
Active
Burn-In Test Responsibility

GPU burn-in testing and documentation is the operator's responsibility. vMetal provisions bare metal nodes but does not run or document burn-in tests — it is Metal3 under the hood (PXE, OS provisioning, lifecycle management only).

Monitoring
Health Checks
Full Passive Health Suite

Comprehensive passive GPU health monitoring is available via DCGM Exporter deployed through GPU Operator on vCluster — covering temperature, power, ECC errors, XID codes, PCIe health, NVLink status, and utilization. Kubernetes node conditions provide host-level passive health. Together they cover the full passive health check surface.

  • DCGM full metric suite
  • K8s node conditions
Monitoring
Health Checks
Active
Active Health Check Jobs

Automated active health checks (GPU burn-in, DGEMM benchmarks, NCCL tests) are the operator's responsibility to configure and schedule. vCluster supports running these as Kubernetes Jobs or CronJobs within tenant clusters — the isolation model ensures health check jobs do not interfere with other tenants.

  • K8s Job/CronJob-based health check support
Monitoring
Health Checks
Diagnostic Tooling

Operators have full kubectl access to each tenant cluster for standard Kubernetes diagnostics. GPU diagnostics are available via DCGM (XID codes, ECC errors, health validation). NVIDIA GPU Operator includes a validator component that runs diagnostic checks at node startup.

  • kubectl diagnostics
  • DCGM GPU health data
  • GPU Operator validator
Monitoring
Health Checks
Active
Automated Node Remediation
Coming soon

vMetal supports node deprovisioning and reimaging today — nodes can be drained, wiped, and returned to the available pool. The automated fault-detection-to-action pipeline (detect GPU fault via DCGM, automatically drain, deprovision, reprovision) is on the roadmap via Auto Nodes + vMetal + vCluster. Today operator action is required to trigger remediation.

  • vMetal node deprovisioning/reimaging (live)
  • Auto Nodes + vMetal automated remediation (roadmap)
Monitoring
Auto Remediation
Active
Failure Prediction System

AI-based failure prediction is the operator's responsibility to configure. vCluster provides the observability data layer (DCGM metrics, K8s events) that an ML-based failure prediction system can consume.

Monitoring
Auto Remediation
Usage Metering & Showback

vCluster Platform provides per-cluster resource quota enforcement and usage metering via vBilling. Operators can expose per-tenant utilization as showback dashboards via Grafana. Chargeback and invoicing to end customers requires connecting an external billing platform. vBilling provides the metering data layer.

  • Resource quota enforcement
  • vBilling metering
  • DCGM + Prometheus
  • Grafana showback
Pricing
Metering & Billing
Active
GPU Pricing Economics

GPU pricing is the operator's hardware and business decision. vCluster reduces operational overhead — fewer ops engineers needed to manage multi-tenant K8s at scale — which can lower total platform cost and improve margins. vMetal reduces time-to-provisioned-node, reducing idle GPU costs.

  • Reduced operational overhead
  • faster provisioning reducing idle time
Pricing
Metering & Billing
Flexible Contract Models

Contract and consumption models are the operator's commercial decision. vCluster Platform's metering layer provides the usage data needed to support any billing model — on-demand, reserved, or tiered — but the contracts themselves are the operator's responsibility.

  • Per-tenant usage metering
Pricing
Metering & Billing
Granular Resource Metering

vCluster Platform provides per-tenant resource quota enforcement and usage metering via vBilling, giving operators the granular consumption data needed to support unbundled billing for compute, storage, and GPU resources. Integration with an external billing system is required for actual invoicing and chargeback.

  • vBilling metering
  • DCGM GPU usage data
  • namespace-scoped Prometheus metrics
Pricing
Metering & Billing
Active
Tenant Scale-Out

Contract expansion and renewal is a commercial decision for the operator. vCluster Platform makes it operationally easy to scale existing tenants — adding nodes, increasing resource quotas, or spinning up new tenant clusters requires no new physical clusters.

  • On-demand tenant cluster scaling
  • resource quota adjustment
Pricing
Metering & Billing
Hardware-Agnostic Provisioning

GPU hardware acquisition and availability timelines are the operator's business decision. vMetal (Metal3) accelerates time from physical node to registered Kubernetes worker — any new GPU SKU is supported without platform changes. The operator controls the hardware roadmap.

  • Hardware-agnostic bare metal provisioning
Pricing
Metering & Billing
Kernel Library Management

Performance kernel libraries are the operator's responsibility to bundle in their base OS image or make available to tenants. vCluster does not ship or manage ML kernel collections. Tenants can install libraries inside their tenant cluster environments independently.

Pricing
Metering & Billing
NVIDIA Investment Status

NVIDIA is a close technology partner with vCluster, not a current investor.

Partnerships
NVIDIA Ecosystem
NCP Technical Compatibility

NCP (NVIDIA Cloud Partner) is a certification program exclusively for GPU cloud infrastructure providers — vCluster Labs as a software vendor is not eligible and does not claim NCP status. vCluster helps AI cloud operators satisfy the technical requirements of NCP certification: GPU Operator compatibility, DCGM integration, IB/RoCEv2 networking stack, and multi-tenant isolation. Operators using vCluster can reference this compatibility when pursuing NCP certification.

  • GPU Operator compatibility
  • DCGM integration
  • multi-tenant isolation enabling NCP technical requirements
Partnerships
NVIDIA Ecosystem
Active
AMD Cloud Alliance

AMD Cloud Alliance membership is a provider-level certification. vCluster Platform supports AMD GPU workloads via the AMD GPU Operator on Kubernetes — hardware-agnostic at the platform layer. AMD Cloud Alliance status is the operator's credential to pursue.

Partnerships
NVIDIA Ecosystem
SchedMD Partnership

No current SchedMD partnership. SLURM support is on the roadmap for H2 2026.

Partnerships
NVIDIA Ecosystem
Full CNCF Ecosystem Support

The full Kubernetes and NVIDIA GPU ecosystem works inside a tenant cluster without modification — GPU Operator, Network Operator, DCGM, KAI Scheduler, Prometheus, Grafana, ArgoCD, and all standard CNCF tooling. vCluster is CNCF-compatible and does not require custom integrations or ecosystem modifications.

  • Full CNCF ecosystem compatibility
  • NVIDIA GPU stack compatibility
  • GitOps-compatible
Partnerships
NVIDIA Ecosystem
Active
Industry Event Participation

vCluster Labs actively participates in major GPU infrastructure and Kubernetes events — including KubeCon North America, KubeCon Europe, NVIDIA GTC, and SC (Supercomputing). The team presents on GPU tenant isolation, vNode security, and AI infrastructure architecture.

Partnerships
Industry Events
Active
Next-Gen GPU Support

vMetal supports any bare metal GPU node — H200, B200, GB200, and NVL72 hardware register into the platform without software changes. New GPU SKUs require no platform redesign.

  • Hardware-agnostic bare metal provisioning
  • no platform changes for new GPU SKUs
Availability
Hardware Provisioning
Current GPU Compatibility

Current GPU model availability is the operator's hardware decision. vMetal supports any bare metal GPU node — H100, H200, A100, L40S, and MI300X hardware all register into the platform identically. Hardware-agnostic provisioning means no per-SKU configuration.

  • Hardware-agnostic bare metal provisioning
Availability
Hardware Provisioning
Hardware Timeline Planning

Hardware acquisition timelines are the operator's business decision. vMetal supports any new GPU SKU without platform changes.

Availability
Hardware Provisioning
Multi-Tenant Scale

Fleet size is the operator's hardware investment decision. vCluster Platform enables scale without cluster sprawl — hundreds of tenant clusters run on a single control plane cluster, keeping per-cluster operational overhead flat regardless of GPU count.

  • Multi-tenant density
  • single control plane cluster for hundreds of tenants
Availability
Scale & Density
Geographic Reach

Geographic reach is the operator's infrastructure and business decision. vCluster Platform's multi-region mode supports distributed control planes across regions.

Availability
Scale & Density
Utilization Optimization

Uptime and utilization SLAs are the operator's commitment. vCluster Platform improves utilization economics — shared infrastructure with strong tenant isolation reduces idle GPU waste versus cluster-per-tenant approaches. Control plane HA ensures platform availability independent of individual node health.

  • Multi-tenant GPU utilization
  • control plane HA
Availability
Capacity Planning
Fleet-Wide Capacity Planning

vCluster Platform enables capacity planning without linear operational scaling — adding new tenants, expanding resource quotas, or onboarding new GPU nodes requires no new physical clusters. Operators manage all tenants from a single control plane with full visibility into resource allocation and utilization across the fleet.

  • Single control plane for fleet-wide capacity visibility
  • on-demand tenant scaling
Availability
Capacity Planning
Active
GPU Expansion Simplicity

GPU acquisition roadmaps are the operator's hardware and financial planning responsibility. vCluster Platform makes adding new nodes to existing tenant clusters operationally simple — no new clusters or configuration required.

Availability
Capacity Planning
Business Outcomes

What Moving Up the ClusterMAX™ Ranking Means for Your Business

Enterprise AI teams use ClusterMAX to make six- and seven-figure infrastructure decisions. Each rating improvement translates directly to pipeline and contract value.

Rating Tier = Larger Deal Size

Enterprise buyers filter ClusterMAX tiers before shortlisting providers. A higher rating gets you into more RFPs and reduces time spent on procurement due diligence.

<1 Day From Hardware to Managed Kubernetes

vCluster turns raw GPU nodes into a fully managed Kubernetes offering in under a day. Faster time-to-market means more customers onboarded before competitors catch up.

ClusterMAX Names You by Name

vCluster is now explicitly cited in the ClusterMAX Security criterion. Customers reading the criteria see your implementation listed as the requirement, not an alternative.

No Cluster Sprawl

vCluster lets you serve 100 enterprise tenants on a shared GPU fleet without managing 100 separate physical clusters, keeping OpEx flat as you scale to meet Availability scoring.

Isolation Without Compromise

Private Nodes and dedicated control planes give each customer the isolation of a dedicated environment at a fraction of the cost, directly improving Security and Orchestration scores.

Full GPU Stack Compatibility

GPU Operator, NCCL, MIG, DCGM, and distributed training frameworks all work natively inside vClusters, satisfying the Orchestration, Storage, Networking, and Monitoring criteria simultaneously.

Dive deeper

Architecture, Networking & Industry Certifications

vCluster on NVIDIA DGX Systems Reference Architecture
Ebook
vCluster on NVIDIA DGX Systems Reference Architecture

A blueprint for bringing cloud-grade elasticity and automation to NVIDIA DGX systems.

Automate Network Isolation for Hard Multi-Tenant Kubernetes
SOLUTION
Automate Network Isolation for Hard Multi-Tenant Kubernetes

vCluster and Netris integrate Kubernetes and network automation.

vCluster Guide to Achieve ClusterMAX™ Platinum Rating
GUIDE
vCluster Guide to Achieve ClusterMAX™ Platinum Rating

Learn how to deliver enterprise-grade Kubernetes for AI workloads and improve ClusterMAX™ rating.

Ready to Improve Your ClusterMAX™ Rating?

Talk to a solution architect about which vCluster platform products to deploy first based on your current rating gaps.