Tech Blog by vClusterPress and Media Resources

Kubernetes 1.36: User Namespaces Are Finally GA – And Why It Matters for Tenant Isolation

May 27, 2026
|
19
min Read
Kubernetes 1.36: User Namespaces Are Finally GA – And Why It Matters for Tenant Isolation

After ten years of KEPs, six years of active development, and four releases of beta polishing, User Namespaces in Kubernetes graduated to General Availability in v1.36 (released April 23, 2026).

If you only read one thing from the release blog, read this:

A process running as root inside a container is also seen from the kernel as root on the host. If an attacker manages to break out of the container, whether through a kernel vulnerability or a misconfigured mount, they are root on the host.

That single sentence has been the defining limitation of "containers as a security boundary" since the day Docker shipped. v1.36 is the day Kubernetes finally has a clean, in-tree, default-on answer to it.

This post is the deep dive: what user namespaces actually do, the kernel breakthrough that made them practical for stateful pods, the exact YAML to enable them, the CVEs they neutralize, the limitations the release blog politely glosses over, and where they fit alongside node-level isolation tools like vNode for production tenant isolation.

The UID 0 Problem (and Why "Run As Non-Root" Was Never Enough)

For years the recommended container security mantra has been: "don't run as root." Pod Security Standards enforce it, OPA gatekeeper rules enforce it, every CIS benchmark enforces it.

There is a real problem with that advice: a huge fraction of real workloads need root inside the container. Database servers binding privileged ports, build tools doing chroots, network functions calling iptables, sidecars mounting filesystems, anything doing Docker-in-Docker, anything that does apt-get install. Telling those workloads "just run as 1000" means rebuilding the image, hacking the entrypoint, or giving up the feature.

So operators routinely cave and grant root. And the second they do, the kernel sees the same thing it has always seen: UID 0 in the container is UID 0 on the host. A container escape – runc CVE, kernel LPE, badly-mounted hostPath, leaked socket – and the attacker is running as the host's actual root, with full power over every other pod on the node, the kubelet, the container runtime, and any mounted secrets.

Capabilities, seccomp, AppArmor, and SELinux all reduce the blast radius. None of them change the underlying truth: the process identity is still root.

User namespaces fix that at the kernel level.

The UID 0 problem: with vs without user namespaces

What a User Namespace Actually Does

A Linux user namespace is a kernel feature (stable since 3.8) that gives a process its own private mapping of UIDs and GIDs. Inside the namespace, the process can see itself as UID 0 (root). Outside the namespace – to the host kernel, to ps aux, to file ownership on disk – it is some unprivileged UID like 100000.

The mapping is configured per-namespace and looks like this:

Inside container UID → Host UID
0 → 100000
1 → 100001
...
65535 → 165535

The process inside truly believes it is root. It can chown files (within its namespace), chmod 4755 binaries, install packages, run iptables against its own network namespace. None of that grants it any privilege over the host. If a kernel bug lets it escape to the host process tree, it lands as UID 100000 – a user that owns nothing, can read nothing sensitive, and can kill no other process.

Capabilities behave the same way: with hostUsers: false, CAP_SYS_ADMIN is administrative only over the namespace's own resources. CAP_SYS_MODULE is literally void – the kernel will not let a namespaced root load a kernel module no matter what the pod spec requests.

This is the property the Kubernetes blog refers to as "namespaced capabilities," and it is the reason user namespaces are different in kind from runAsUser: 1000. You do not have to rewrite the workload. The workload runs as root. The host just refuses to believe it.

How user namespace UID/GID mapping works

The Hard Part Wasn't Namespaces – It Was Volumes

User namespaces have existed in Linux for over a decade. Container runtimes have used them for years (rootless Podman, rootless Docker, LXC). So why did Kubernetes take six years?

Volumes.

Imagine pod A is mapped to host UID range 100000–165535 and pod B is mapped to 200000–265535. Both containers want to read a file from a PersistentVolume. On disk, that file is owned by some real UID. For pod A's "root" to read it, the file must be owned by 100000. For pod B's "root" to read the same volume, it must be owned by 200000. You cannot have both.

The early implementations solved this by chown'ing every file in the volume on pod startup to match the pod's host UID range. For an emptyDir this is fine. For a 50TB persistent volume with a few million inodes, it is a multi-minute startup penalty – and worse, it permanently rewrites file ownership, breaking any other pod that wants to mount the same volume with a different UID range.

The fix is a kernel feature called ID-mapped mounts, introduced in Linux 5.12 and refined through 6.3. The mount itself carries a UID translation table. The kernel applies it transparently at every read, write, stat, and chown:

  • On disk, the file is owned by some fixed UID (say, 1000).
  • Pod A mounts the volume with an idmap that maps host 100000 → container 0. The container sees the file as owned by 0.
  • Pod B mounts the same volume with an idmap that maps host 200000 → container 0. Its container sees the file as owned by 0.
  • No chown. No file rewrite. O(1) mount time.

This is the breakthrough. Without idmap mounts, Kubernetes user namespaces would have shipped years ago and been useless for any pod with persistent state. With them, you can flip hostUsers: false on a StatefulSet running Postgres tomorrow and not pay any startup penalty at all.

ID-mapped mounts: how the kernel translates UIDs at mount time

The YAML: It Really Is One Line

Here is the entire opt-in:

apiVersion: v1
kind: Pod
metadata:
name: privileged-but-contained
spec:
hostUsers: false # the whole feature
containers:
- name: app
image: fedora:42
securityContext:
runAsUser: 0 # root inside the container
capabilities:
add: ["NET_ADMIN"] # namespaced, harmless on the host

A few things worth knowing about that spec:

  • runAsUser: 0, runAsGroup, fsGroup: all of these refer to the user inside the container. The kubelet handles the mapping to a unique non-overlapping host range for you. You do not write subuid/subgid manually.
  • File ownership inside volumeMounts is identical whether hostUsers is true or false. You can flip the flag on an existing pod template without touching any volume permissions, init container chowns, or fsGroup config.
  • Capabilities you add are scoped to the pod's namespace. CAP_SYS_ADMIN works for things like mounting in the pod's mount namespace. CAP_SYS_MODULE does nothing. CAP_NET_ADMIN works on the pod's network namespace, not the host's.
  • Pod Security Standards integrate with this: the Restricted profile becomes meaningfully more permissive when hostUsers: false, because the dangerous capabilities are no longer dangerous.

By default the kubelet allocates each pod 65,536 contiguous UIDs/GIDs from a range above 0–65535, guaranteeing no two pods on a node share a mapping. If you want a custom range, you create a system user named exactly kubelet on the node, install getsubids (from shadow-utils), and add entries to /etc/subuid and /etc/subgid.

Pod with hostUsers: false – end-to-end flow

What This Actually Stops – The CVE Map

User namespaces neutralize a specific class of attack: anything that depends on the container's UID being equal to a host UID with privilege. Some real examples:

CVE / IssueClassWithout hostUsers: falseWith hostUsers: false
CVE-2019-5736 (runc host binary overwrite)Container escapeAttacker overwrites /usr/bin/runc on host as rootAttacker is unprivileged on host, write fails
CVE-2021-25741 (subPath symlink race)Host file readPod reads arbitrary host files matching pod's UIDPod's UID does not match any sensitive host UID
CVE-2022-0492 (cgroup release_agent)Privilege escalation via cgroupsCAP_SYS_ADMIN in pod can write release_agentCAP_SYS_ADMIN is namespace-scoped, write fails
CVE-2024-21626 (runc fd leak)Container escapeProcess inherits host fd as rootProcess inherits fd as unprivileged user
Generic kernel LPE during escapeDefense in depthLands as host rootLands as unprivileged host user

This is not a theoretical list. These are HIGH-severity CVEs that have shipped in the last few years. User namespaces would have neutered or significantly defanged every one of them.

The Kubernetes release blog is honest about this:

This feature also enables a critical pattern: running workloads with privileges and still being confined in the user namespace. When hostUsers: false is set, capabilities like CAP_NET_ADMIN become namespaced ... This effectively enables new use cases that were not possible before without running a fully privileged container.

Translation: workloads that previously required a privileged pod – and therefore required you to trust them at host-root level – can now run with their privileges scoped to the pod. That is a category change.

The Production-Readiness Reality Check

GA does not mean "every cluster gets this for free." Here is the requirements list, and it is not trivial:

ComponentMinimum versionNotes
Linux kernel6.3+Earlier 5.12+ kernels work for some volume types but tmpfs (used by service account tokens, Secrets) needs 6.3
Filesystem at /var/lib/kubelet/pods/ and all pod volumesidmap-capablebtrfs, ext4, xfs, fat, tmpfs, overlayfs are supported
containerd2.0+
CRI-O1.25+
OCI runtimerunc 1.2+ or crun 1.13+
cri-dockerdnot yet supportedMirantis/cri-dockerd#74

A few realities that flow from this list:

  • Most managed Kubernetes platforms are not on kernel 6.3+ yet. GKE, EKS, and AKS rolled out 5.x kernels through 2024 and have been gradually moving to 6.x. Check uname -r on your nodes before assuming hostUsers: false works.
  • NFS-backed PersistentVolumes have variable idmap support. If you run stateful workloads on NFS, test before committing.
  • Pods cannot access host /proc, host /sys, or other pods' namespaces. This is the point of the feature, but it does break a small number of legitimate workloads – node-exporter-style daemons, certain debugging tooling, anything that reads /proc/1/... on the host.
  • CAP_SYS_MODULE no longer works. Workloads that load kernel modules from inside a pod (some CNI installers, some GPU drivers, eBPF tooling that sideloads modules) will need a different home – typically a dedicated privileged DaemonSet that runs without hostUsers: false.
  • No migration path for already-running pods. hostUsers is immutable on a Pod; you change it via the controller (Deployment, StatefulSet) and recreate.

This is still a giant step forward. It is not a magic flag.

Where Pod-Level User Namespaces Stop

User namespaces solve the process identity problem for a running container. They do not solve everything.

Two specific gaps matter for tenant isolation:

  1. The container creation flow itself runs on the host. OCI hooks like the NVIDIA Container Toolkit's createContainer hook fire before pivot_root – which means they run as a privileged host process, with the container's working directory, with environment variables inherited from the container image. The user namespace applies to the eventual container process, not to the runtime that builds it. We will come back to this with a real CVE.
  2. The shared kernel is still shared. Every pod on a node uses one Linux kernel. A kernel bug – a use-after-free in a syscall, a buggy filesystem driver, a netfilter race – can still affect every workload on the node, regardless of UID mapping.

The first gap is the one most people miss. The second gap is the well-known one.

For the second gap, hypervisor-grade options exist (Kata, Firecracker, Edera) and userspace syscall sandboxes exist (gVisor). They each pay a real cost: VM boot time and memory tax for the hypervisor route, syscall-translation latency and compatibility loss for gVisor. They also each restrict what tenants can do – gVisor blocks privileged operations entirely; Kata loses things like kubectl port-forward because the kubelet's network namespace assumptions break across the VM boundary.

vNode lives in this exact landscape, and the comparison vCluster ships on its site puts it head-to-head with Kata, gVisor, and Sysbox – not as a different category, but as a peer alternative with a different mechanism. After digging into how it actually works, that positioning is defensible. Here is why.

What vNode Actually Does – The Architecture That Matters

The shorthand "vNode is user namespaces + seccomp" is incomplete. The official one-line framing from the vCluster team is more accurate: VM-grade isolation, container-grade performance, zero workload changes.

How that actually works is what makes the comparison defensible.

vNode is a nested container runtime. Not a sidecar, not a policy engine. Each tenant's workloads run inside an isolated vNode container that acts as a lightweight "mini-host" – the workload sees a normal Linux environment, the host kernel sees an unprivileged process. No hypervisor. No guest kernel.

The actual on-node components are:

  • vnode-manager – per-physical-node Pod that orchestrates virtual nodes
  • vnode-containerd-shim-runc-v2 – per-vNode shim that replaces the regular containerd shim
  • vnode-runc – forked runc that runs inside the vNode container
  • vnode-init – init process inside each vNode
  • vnode-cni – CNI plugin that wires networking through the vNode

In a standard Kubernetes setup, every step from the containerd shim through runc through OCI hooks executes on the host as host root, before pivot_root. With vNode, that whole chain (vnode-containerd-shim-runc-v2vnode-runc → OCI hooks → pivot_root → workload) runs inside the vNode container. Containers-in-containers, safely.

On top of that nesting, vNode applies a defined three-layer security model.

Layer 1: User Namespaces + UID Mapping

Every vNode is an unprivileged user on the host. Each gets a unique UID/GID range; container UID 0 maps to an unprivileged host UID at 65536+ ('nobody').

ID-mapped mounts (Linux 6.1+) handle the translation at zero cost: no recursive chown, no startup penalty. vNode's documentation lists support for ~50 filesystems including ext4, btrfs, xfs, and overlayfs.

The bottom line: even a full container escape lands the attacker on the host as nobody – unprivileged, with no access to other vNodes' files or processes.

Layer 2: File System Virtualization (FUSE)

Sensitive kernel interface subpaths in /proc and /sys are virtualized via a FUSE filesystem (vnodefs). The workload reads /proc/uptime and gets its own uptime. Writes to /proc/sys/... are scoped per-container. Host hardware identifiers in /sys/devices/virtual/dmi are completely hidden.

PathMechanismWhat the workload sees
/proc/uptimeFUSE (vnodefs)Container's own uptime, not host's
/proc/sys/*FUSE (vnodefs)Per-container sysctls (hostname, pid_max, ip_forward, etc.)
/proc/[pid]/*PID namespaceOnly own processes visible
/sys/kernelFUSE (vnodefs)Kernel parameters hidden from workload
/sys/devices/virtual/dmiFUSE (vnodefs)Hardware identifiers completely hidden

This is what lets Prometheus node-exporter, K3s, Docker-in-Docker, and GPU operators actually run inside the vNode – they expect a real node and vNode gives them one without exposing the underlying host.

Layer 3: Targeted Syscall Filtering

vNode classifies syscalls into three buckets, surgically rather than wholesale:

  • Blocked (hard deny): Raw packet sockets (AF_PACKET), promiscuous mode (SIOCSIFFLAGS), packet multicast operations, setting trusted.* xattrs.
  • Intercepted (handled in userspace): mount/umount (validated paths only), chown/fchown (UID-mapped correctly), xattr ops (only overlay.opaque allows container-in-container), reboot/swap (returns success as a no-op).
  • Allowed (pass-through): Everything else hits the real kernel. The process runs as an unprivileged UID; standard kernel permission checks apply normally.

This is the explicit philosophical contrast with gVisor:

AspectgVisorvNode
Syscall handlingReimplements ~237 syscalls in userspace SentryPass-through as unprivileged
Kernel attack surface~68 syscalls actually exposedFull kernel, unprivileged user
/proc, /sys isolationReimplemented in SentryFUSE virtualization
Performance10–30% I/O overheadNear-native
Compatibility~70% syscall coverageFull Linux
PhilosophyDon't trust the kernel at allTrust parts of the kernel, harden it, then isolate tenants

gVisor's bet is "rewrite the entire kernel in userspace so the workload never reaches it." vNode's bet is "trust parts of the kernel, but make the process unprivileged and virtualize the things that leak host info." Both are defensible. They have very different cost profiles.

Defense in depth – additive, not replacement

The vCluster team is explicit that vNode is additive to Kubernetes security, not a substitute. Kubernetes-imposed seccomp profiles, capability drops, and resource limits are preserved and passed through. vNode adds its layer on top:

  1. Kubernetes policy – RBAC, NetworkPolicy, PodSecurity, seccomp profiles, capability drops.
  2. vNode isolation – user namespaces, FUSE filesystem virtualization, targeted seccomp, UID mapping.
  3. Kernel enforcement – standard permission checks on unprivileged UID, cgroups resource limits.

How vNode wraps the container creation flow

The Concrete Reason This Matters: NVIDIAScape (CVE-2025-23266)

In July 2025, Wiz disclosed CVE-2025-23266 – "NVIDIAScape" – a CVSS 9.0 container escape in the NVIDIA Container Toolkit. Wiz estimated 40% of GPU-using environments were affected.

The exploit is, almost insultingly, three lines in a Dockerfile:

FROM ubuntu:22.04
ENV LD_PRELOAD=/proc/self/cwd/poc.so
COPY poc.so /poc.so

That's it. Once the malicious image is run on a node with a vulnerable NVIDIA Container Toolkit:

  1. Kubelet → containerd → runc starts setting up the container.
  2. runc runs the createContainer OCI hook, which invokes nvidia-ctk to attach GPUs.
  3. nvidia-ctk runs as the host's runc process – root, on the host, before pivot_root.
  4. It inherits LD_PRELOAD from the container image's environment. It inherits the container's working directory (/proc/self/cwd resolves to the container's mounted root).
  5. The dynamic loader honors LD_PRELOAD, loads poc.so from the container image, executes it – as host root, before any container isolation has been applied.

Game over. The attacker now has root on the host and can dial containerd.sock directly to start a fully-privileged hostPath container, mount the host filesystem, exfiltrate every secret on the node.

Now let's apply Kubernetes 1.36 user namespaces to this exact attack. Set hostUsers: false on the malicious pod. What changes?

Almost nothing. The user namespace is created for the eventual container process. The createContainer hook runs before the user namespace and pivot_root are applied – it runs as the host's runc, as host root, exactly as before. The malicious poc.so loads, executes as host root, and the breakout succeeds.

vCluster's own engineering blog states this explicitly:

"Unlike the seccomp filters gVisor applies or the recently introduced user namespaces feature in Kubernetes, we don't apply any security measures against a user-defined container [… instead vNode wraps the workload in a hardened sandbox container]. Even if container escape occurs, the attacker would just land in the virtual node which is our vNode sandbox container rather than the actual host."

This is the point I missed in the previous draft. K8s 1.36 user namespaces do not protect against the OCI-hook class of vulnerabilities. vNode does, by construction, because the hook executes inside the vNode sandbox. The attacker's LD_PRELOAD payload still loads – but it runs inside the vNode container, not on the host. Lands in the sandbox. Cannot reach /run/containerd/containerd.sock, cannot start a privileged hostPath pod, cannot mount the host filesystem.

This is not theoretical. NVIDIAScape was a real, exploited, 40% attack-surface CVE. The pattern – malicious OCI hook context inheritance – recurs whenever a runtime ships hooks. vNode mitigates an entire class of these by relocating the creation flow into a sandbox.

NVIDIAScape: how the hook bypasses K8s user namespaces but not vNode

The Honest Comparison Table

This mirrors vCluster's own published How vNode Compares chart, with one extra row I think readers need.

CapabilityvNodeK8s 1.36 hostUsers: falseSysboxgVisorKata Containers
Isolation approachUser namespaces + FUSE + targeted seccompUser namespacesUser namespacesUserspace syscall interceptionMicro-VMs
Low overheadYesYesPartialNoNo
Fast startup timeYesYesYesYesNo
Low performance impactYesYesYesPartial (10–30% I/O tax)No
High tenant autonomyYesNoPartialNoYes
High security strengthYesNoPartialYesYes
High networking & storage isolationYesNoNoNoYes
Protects against OCI-hook vulnerabilities (NVIDIAScape class)YesNoNoNoYes
Low failure blast radiusYesPartialYesPartialNo
Compatibility with cloud providersYesYesNoPartialNo
Kubernetes nativeYesYesNoYesYes
Ease of useYesPartialPartialNoNo
Commercial support & maintenanceYesNoNoNoNo
Kernel-CVE cross-tenant protectionNo (shared kernel)No (shared kernel)NoYes (most syscalls never reach kernel)Yes (separate kernel per pod)
Threat model fitCooperative multi-tenant platformsSingle-pod hardeningDocker-in-Docker focusUntrusted code executionAdversarial multi-tenant

Two rows worth being literal about, because they often get muddied:

OCI-hook protection – vNode's nested-runtime architecture means OCI hooks (like nvidia-ctk's createContainer hook) execute inside the vNode container. The K8s 1.36 user-ns feature applies to the eventual container process, not the runtime that builds it. NVIDIAScape exploits this gap. vNode and Kata close it; raw user namespaces, gVisor, and Sysbox do not.

Kernel-CVE cross-tenant protection – vNode shares the host kernel. So does upstream Kubernetes user namespaces (hostUsers: false) and Sysbox. A kernel CVE that escapes a user namespace affects every tenant on the node for all three. Kata gives a separate kernel per pod; gVisor implements most syscalls in userspace so they never reach the host kernel. If your threat model is "regulated, adversarial multi-tenant environments where a kernel CVE between tenants is a compliance non-starter," you want a hypervisor option. For cooperative-tenant platforms – internal teams, CI runners, AI Cloud GPU customers, training pipelines – the OCI-hook protection and workload compatibility matter more, and vNode wins on the dimensions that actually constrain the design.

My Brutal Opinion: Is vNode Still Needed After 1.36?

Short version: yes – and stronger than I initially thought.

I'll separate the answer into the two real questions.

Question 1: Is vNode still needed if I only run a few privileged pods on an otherwise single-tenant cluster?

Probably not. Set hostUsers: false and call it done. v1.36 is genuinely sufficient for hardening a privileged sidecar, a build pod, or a network-admin DaemonSet on a cluster where you trust everyone running pods.

Question 2: Is vNode still needed for multi-tenant platforms?

Yes, for three reasons that v1.36 does not address:

  1. OCI-hook vulnerabilities. NVIDIAScape is the cleanest example, but the pattern is structural, not specific to NVIDIA. Any runtime that ships OCI hooks – GPU operators, CSI drivers, custom CDI specs, sidecar injectors – has the same exposure. v1.36 user namespaces apply to the running container, not the creation flow. vNode wraps the creation flow. This is a defense-in-depth difference that becomes a defense-in-only difference when the kernel-level CVE is in the runtime's hook chain.
  2. Workload compatibility for privileged tenants. v1.36 lets a single pod run "privileged inside its own namespace," which is great. But "I want my tenants to run their own GPU operator, their own DinD-based build runners, their own nested K3s, their own Prometheus node-exporter, and have all of it just work" is a different ask. K3s does not run under raw hostUsers: false (kernel module loading, sysctl writes, /proc expectations). DinD does not run cleanly. node-exporter does not produce useful output. vNode's procfs/sysfs emulation is what makes those workloads functional inside a sandboxed boundary.
  3. Tenant-shaped abstraction. Operators, DaemonSets, GPU drivers, monitoring agents – the entire ecosystem assumes "a node" is a real unit. With vNode each tenant gets a node abstraction they can operate, complete with their own DaemonSets and their own privileged operations bounded by the tenant. Building this on top of raw 1.36 user namespaces requires substantial platform engineering.

Where vNode is genuinely the right answer:

  • Internal platform teams running tens to hundreds of cooperative tenants.
  • AI Clouds renting GPU capacity to teams who need to run their own GPU operator, DinD runners, fine-tuning frameworks – privileged inside their tenant boundary, not on the host. Especially after NVIDIAScape, which targeted exactly this market.
  • CI/CD platforms running per-team build clusters.
  • Any environment where the tenant ask is "give me a node I can operate" and the platform answer needs to be "yes, but contained."

The pattern most platform teams actually land on:

vCluster for control plane Tenant Isolation. vNode for node-level Tenant Isolation – including OCI-hook protection and workload compatibility. v1.36 hostUsers: false as the per-pod kernel-level UID floor underneath both.

The three layers are independent and they compound. v1.36 strengthens the floor. vNode is what most multi-tenant platforms still need to build on top of that floor.

The three layers of tenant isolation – where each one stops

Try It

The three things worth doing this week:

  1. Check your node kernels. kubectl get nodes -o wide shows the kernel version. Anything below 6.3 is not getting the full feature.
  2. Pick one privileged workload and flip hostUsers: false. The lowest-risk candidate is anything that needs CAP_NET_ADMIN for its own network namespace – a CNI sidecar, a service mesh proxy, a network function. The change is one line. The blast-radius reduction is enormous.
  3. Audit which pods still need real host privileges. With v1.36, the set of workloads that genuinely require host-level root shrinks dramatically. Those that remain (kernel-module loaders, some node-exporters) are now a much smaller, much more auditable surface.

References

Share:
Get started with the #1 tenant isolation platform.

Give your tenants the hyperscaler experience, ready in seconds.

Ready to take vCluster for a spin?

Deploy your first virtual cluster today.