diff --git a/README.md b/README.md index 850df3a..bf01c80 100644 --- a/README.md +++ b/README.md @@ -103,6 +103,10 @@ Features: - [lesson 36-userspace-ebpf](src/36-userspace-ebpf/README.md) Userspace eBPF Runtimes: Overview and Applications - [lesson 38-btf-uprobe](src/38-btf-uprobe/README.md) Expanding eBPF Compile Once, Run Everywhere(CO-RE) to Userspace Compatibility - [lesson 43-kfuncs](src/43-kfuncs/README.md) Extending eBPF Beyond Its Limits: Custom kfuncs in Kernel Modules +- [features struct_ops](src/features/struct_ops/README.md) Extending Kernel Subsystems with BPF struct_ops +- [features bpf_iters](src/features/bpf_iters/README.md) BPF Iterators for Kernel Data Export +- [features dynptr](src/features/dynptr/README.md) BPF Dynamic Pointers for Variable-Length Data +- [features bpf_arena](src/features/bpf_arena/README.md) BPF Arena for Zero-Copy Shared Memory - [features bpf_wq](src/features/bpf_wq/README.md) BPF Workqueues for Asynchronous Sleepable Tasks - [features bpf_iters](src/features/bpf_iters/README.md) BPF Iterators for Kernel Data Export - [features struct_ops](src/features/struct_ops/README.md) Extending Kernel Subsystems with BPF struct_ops diff --git a/README.zh.md b/README.zh.md index c9a4802..b34c8b9 100644 --- a/README.zh.md +++ b/README.zh.md @@ -81,6 +81,10 @@ GPU: - [lesson 36-userspace-ebpf](src/36-userspace-ebpf/README.zh.md) 用户空间 eBPF 运行时:深度解析与应用实践 - [lesson 38-btf-uprobe](src/38-btf-uprobe/README.zh.md) 借助 eBPF 和 BTF,让用户态也能一次编译、到处运行 - [lesson 43-kfuncs](src/43-kfuncs/README.zh.md) 超越 eBPF 的极限:在内核模块中定义自定义 kfunc +- [features struct_ops](src/features/struct_ops/README.zh.md) eBPF 教程:使用 BPF struct_ops 扩展内核子系统 +- [features bpf_iters](src/features/bpf_iters/README.zh.md) eBPF 教程:BPF 迭代器用于内核数据导出 +- [features dynptr](src/features/dynptr/README.zh.md) BPF Dynamic Pointers for Variable-Length Data +- [features bpf_arena](src/features/bpf_arena/README.zh.md) eBPF 实例教程:BPF Arena 零拷贝共享内存 - [features bpf_wq](src/features/bpf_wq/README.zh.md) eBPF 教程:BPF 工作队列用于异步可睡眠任务 - [features bpf_iters](src/features/bpf_iters/README.zh.md) eBPF 教程:BPF 迭代器用于内核数据导出 - [features struct_ops](src/features/struct_ops/README.zh.md) eBPF 教程:使用 BPF struct_ops 扩展内核子系统 diff --git a/src/SUMMARY.md b/src/SUMMARY.md index dd9a7d6..3e1075e 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -94,6 +94,10 @@ Features: - [lesson 36-userspace-ebpf](36-userspace-ebpf/README.md) Userspace eBPF Runtimes: Overview and Applications - [lesson 38-btf-uprobe](38-btf-uprobe/README.md) Expanding eBPF Compile Once, Run Everywhere(CO-RE) to Userspace Compatibility - [lesson 43-kfuncs](43-kfuncs/README.md) Extending eBPF Beyond Its Limits: Custom kfuncs in Kernel Modules +- [features struct_ops](features/struct_ops/README.md) Extending Kernel Subsystems with BPF struct_ops +- [features bpf_iters](features/bpf_iters/README.md) BPF Iterators for Kernel Data Export +- [features dynptr](features/dynptr/README.md) BPF Dynamic Pointers for Variable-Length Data +- [features bpf_arena](features/bpf_arena/README.md) BPF Arena for Zero-Copy Shared Memory - [features bpf_wq](features/bpf_wq/README.md) BPF Workqueues for Asynchronous Sleepable Tasks - [features bpf_iters](features/bpf_iters/README.md) BPF Iterators for Kernel Data Export - [features struct_ops](features/struct_ops/README.md) Extending Kernel Subsystems with BPF struct_ops diff --git a/src/SUMMARY.zh.md b/src/SUMMARY.zh.md index 08416ac..a5c9e9f 100644 --- a/src/SUMMARY.zh.md +++ b/src/SUMMARY.zh.md @@ -73,6 +73,10 @@ GPU: - [lesson 36-userspace-ebpf](36-userspace-ebpf/README.zh.md) 用户空间 eBPF 运行时:深度解析与应用实践 - [lesson 38-btf-uprobe](38-btf-uprobe/README.zh.md) 借助 eBPF 和 BTF,让用户态也能一次编译、到处运行 - [lesson 43-kfuncs](43-kfuncs/README.zh.md) 超越 eBPF 的极限:在内核模块中定义自定义 kfunc +- [features struct_ops](features/struct_ops/README.zh.md) eBPF 教程:使用 BPF struct_ops 扩展内核子系统 +- [features bpf_iters](features/bpf_iters/README.zh.md) eBPF 教程:BPF 迭代器用于内核数据导出 +- [features dynptr](features/dynptr/README.zh.md) BPF Dynamic Pointers for Variable-Length Data +- [features bpf_arena](features/bpf_arena/README.zh.md) eBPF 实例教程:BPF Arena 零拷贝共享内存 - [features bpf_wq](features/bpf_wq/README.zh.md) eBPF 教程:BPF 工作队列用于异步可睡眠任务 - [features bpf_iters](features/bpf_iters/README.zh.md) eBPF 教程:BPF 迭代器用于内核数据导出 - [features struct_ops](features/struct_ops/README.zh.md) eBPF 教程:使用 BPF struct_ops 扩展内核子系统 diff --git a/src/cgroup/README.md b/src/cgroup/README.md index b109378..129dcf0 100644 --- a/src/cgroup/README.md +++ b/src/cgroup/README.md @@ -1,16 +1,14 @@ # eBPF Tutorial: cgroup-based Policy Control -This tutorial demonstrates how to use cgroup eBPF programs to implement per-cgroup policy controls for networking, device access, and sysctl operations. +Do you need to enforce network access control on containers or specific process groups without affecting the entire system? Or do you need to restrict certain processes from accessing specific devices while allowing others to use them normally? Traditional iptables and device permissions are global, making fine-grained per-process-group control impossible. + +This is the problem **cgroup eBPF** solves. By attaching eBPF programs to cgroups (control groups), you can implement policy control based on process membership—only processes belonging to a specific cgroup are affected. This enables container isolation, multi-tenant security, and sandbox environments. In this tutorial, we'll build a complete "policy guard" program that demonstrates TCP connection filtering, device access control, and sysctl read restrictions—three types of cgroup eBPF usage. ## What is cgroup eBPF? -**cgroup eBPF** allows you to attach eBPF programs to cgroups (control groups) to enforce policies based on process/container membership. Unlike XDP/tc which work on network interfaces, cgroup eBPF works at the process level: +The core idea of cgroup eBPF is simple: attach an eBPF program to a cgroup, and all processes in that cgroup will be controlled by this program. Unlike XDP/tc which filter traffic by network interface, cgroup eBPF filters by process membership—put a container in a cgroup, attach a policy program, and that container's network access, device access, and sysctl reads/writes are all under your control. Processes in other cgroups are completely unaffected. -- Policies only affect processes in the target cgroup -- Perfect for container/multi-tenant/sandbox isolation -- Covers: network access control, socket options, sysctl access, device access - -When a cgroup eBPF program denies an operation, userspace typically sees `EPERM` (Operation not permitted). +This model is perfect for container and multi-tenant scenarios. Kubernetes NetworkPolicy uses cgroup eBPF under the hood. You can also use it for device isolation (e.g., restricting which containers can access GPUs), security sandboxes (preventing reads of sensitive sysctls), and more. When a cgroup eBPF program denies an operation, userspace syscalls return `EPERM` (Operation not permitted). ## cgroup eBPF Hook Points @@ -69,10 +67,428 @@ We implement a single eBPF object with three programs: Events are sent to userspace via ringbuf for observability. +## Implementation + +### Shared Header: cgroup_guard.h + +This header defines data structures shared between kernel and userspace: + +```c +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +#ifndef __CGROUP_GUARD_H +#define __CGROUP_GUARD_H + +#ifndef TASK_COMM_LEN +#define TASK_COMM_LEN 16 +#endif + +#define SYSCTL_NAME_LEN 64 + +enum event_type { + EVENT_CONNECT4 = 1, + EVENT_DEVICE = 2, + EVENT_SYSCTL = 3, +}; + +struct event { + __u64 ts_ns; + __u32 pid; + __u32 type; + char comm[TASK_COMM_LEN]; + + union { + struct { + __u32 daddr; /* IPv4, network order */ + __u16 dport; /* host order */ + __u16 proto; /* e.g. 6 for TCP */ + } connect4; + + struct { + __u32 major; + __u32 minor; + __u32 access_type; + } device; + + struct { + __u32 write; + char name[SYSCTL_NAME_LEN]; + } sysctl; + }; +}; + +#endif /* __CGROUP_GUARD_H */ +``` + +The `event` structure uses a union to store type-specific data for different events, saving space while maintaining a unified event format. + +### eBPF Program: cgroup_guard.bpf.c + +```c +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +/* cgroup_guard.bpf.c - cgroup eBPF policy guard + * + * This program demonstrates three types of cgroup eBPF hooks: + * 1. cgroup/connect4 - TCP connection filtering + * 2. cgroup/dev - Device access control + * 3. cgroup/sysctl - Sysctl read/write control + */ +#include "vmlinux.h" +#include +#include + +#include "cgroup_guard.h" + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; + +/* ===== Configurable options: set by userspace before load ===== */ +#define IPPROTO_TCP 6 + +const volatile __u16 blocked_tcp_dport = 0; /* host order */ +const volatile __u32 blocked_dev_major = 0; +const volatile __u32 blocked_dev_minor = 0; +const volatile char denied_sysctl_name[SYSCTL_NAME_LEN] = {}; /* NUL-terminated */ + +/* ===== ringbuf: send denied events to userspace ===== */ +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 1 << 24); /* 16MB */ +} events SEC(".maps"); + +static __always_inline void fill_common(struct event *e, __u32 type) +{ + e->ts_ns = bpf_ktime_get_ns(); + e->type = type; + e->pid = (__u32)(bpf_get_current_pid_tgid() >> 32); + bpf_get_current_comm(&e->comm, sizeof(e->comm)); +} + +/* Compare two strings, return 1 if equal, 0 if not + * Note: b is volatile to handle const volatile rodata arrays correctly */ +static __always_inline int str_eq(const char *a, const volatile char *b, int max_len) +{ +#pragma unroll + for (int i = 0; i < SYSCTL_NAME_LEN; i++) { + char ca = a[i]; + char cb = b[i]; + if (ca != cb) + return 0; + if (ca == '\0') + return 1; + } + return 1; +} + +/* ===== 1) Network: block TCP connect4 to specified port ===== + * ctx: struct bpf_sock_addr + * user_ip4/user_port: network byte order (need conversion) + * + * Return semantics: + * - return 1: allow + * - return 0: deny (userspace gets EPERM) + */ +SEC("cgroup/connect4") +int cg_connect4(struct bpf_sock_addr *ctx) +{ + if (blocked_tcp_dport == 0) + return 1; + + if (ctx->protocol != IPPROTO_TCP) + return 1; + + __u16 dport = bpf_ntohs((__u16)ctx->user_port); + if (dport != blocked_tcp_dport) + return 1; + + struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0); + if (e) { + fill_common(e, EVENT_CONNECT4); + e->connect4.daddr = ctx->user_ip4; /* network order */ + e->connect4.dport = dport; /* host order */ + e->connect4.proto = ctx->protocol; + bpf_ringbuf_submit(e, 0); + } + + return 0; /* deny -> userspace gets EPERM on connect */ +} + +/* ===== 2) Device: block access to specified major:minor ===== + * ctx: struct bpf_cgroup_dev_ctx { access_type, major, minor } + * + * Return semantics: + * - return 0: deny (userspace gets EPERM) + * - return non-zero: allow + */ +SEC("cgroup/dev") +int cg_dev(struct bpf_cgroup_dev_ctx *ctx) +{ + if (blocked_dev_major == 0 && blocked_dev_minor == 0) + return 1; + + if (ctx->major != blocked_dev_major || ctx->minor != blocked_dev_minor) + return 1; + + struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0); + if (e) { + fill_common(e, EVENT_DEVICE); + e->device.major = ctx->major; + e->device.minor = ctx->minor; + e->device.access_type = ctx->access_type; + bpf_ringbuf_submit(e, 0); + } + + return 0; /* deny -> -EPERM */ +} + +/* ===== 3) Sysctl: block reading specified sysctl ===== + * ctx: struct bpf_sysctl + * Use bpf_sysctl_get_name() to get name + * + * Return semantics: + * - return 0: reject + * - return 1: proceed + * If return 0, userspace read/write returns -1 with errno=EPERM + */ +SEC("cgroup/sysctl") +int cg_sysctl(struct bpf_sysctl *ctx) +{ + char name[SYSCTL_NAME_LEN]; + int ret = bpf_sysctl_get_name(ctx, name, sizeof(name), 0); + if (ret < 0) + return 1; + + if (denied_sysctl_name[0] == '\0') + return 1; + + /* Only deny reads, allow writes (safer for testing) */ + if (ctx->write) + return 1; + + if (!str_eq(name, denied_sysctl_name, SYSCTL_NAME_LEN)) + return 1; + + struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0); + if (e) { + fill_common(e, EVENT_SYSCTL); + e->sysctl.write = ctx->write; +#pragma unroll + for (int i = 0; i < SYSCTL_NAME_LEN; i++) { + e->sysctl.name[i] = name[i]; + if (name[i] == '\0') + break; + } + bpf_ringbuf_submit(e, 0); + } + + return 0; /* deny -> -EPERM */ +} +``` + +#### Understanding the BPF Code + +The overall logic of this program is clear: three cgroup hooks handle network connections, device access, and sysctl reads/writes respectively. Each hook follows the same workflow—check if the current operation matches the configured blocking rule, report an event via ringbuf and return 0 (deny) if it matches, otherwise return 1 (allow). + +The `cg_connect4` function uses `SEC("cgroup/connect4")` to attach at IPv4 connection time. There's an important detail here: `ctx->user_port` is in network byte order (big-endian), while our configured port is in host byte order, so we must convert with `bpf_ntohs()` before comparing. If the destination port matches our configured `blocked_tcp_dport`, the program returns 0, and the userspace `connect()` call fails with `EPERM`. + +The `cg_dev` function handles device access. Its context `struct bpf_cgroup_dev_ctx` contains three key fields: `major` and `minor` identify the device (e.g., `/dev/null` is 1:3), and `access_type` indicates the access type (read/write/mknod). We simply compare whether major:minor matches the configured values. + +The `cg_sysctl` function intercepts sysctl reads/writes under `/proc/sys/`. It uses `bpf_sysctl_get_name()` to get the sysctl name, in path format like `kernel/hostname` (slash-separated, not dots). We only block reads, allowing writes—this is safer for testing and won't accidentally change system configuration. + +The configuration options at the top of the program are declared as `const volatile`. This is the standard CO-RE (Compile Once, Run Everywhere) pattern: these values are defaults (0 or empty string) at compile time, and userspace sets the actual values via `skel->rodata->` before `load()`. This allows a single compiled BPF program to run with different configurations. + +### Userspace Loader: cgroup_guard.c + +```c +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +/* cgroup_guard.c - Userspace loader for cgroup eBPF policy guard */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "cgroup_guard.skel.h" +#include "cgroup_guard.h" + +static volatile sig_atomic_t exiting = 0; + +static void sig_handler(int sig) +{ + (void)sig; + exiting = 1; +} + +static int libbpf_print_fn(enum libbpf_print_level level, + const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG) + return 0; + return vfprintf(stderr, format, args); +} + +static void usage(const char *prog) +{ + fprintf(stderr, + "Usage: %s [OPTIONS]\n" + "\n" + "Options:\n" + " -c, --cgroup PATH cgroup v2 path (default: /sys/fs/cgroup/ebpf_demo)\n" + " -p, --block-port PORT block TCP connect() to this dst port (IPv4)\n" + " -d, --deny-device MAJ:MIN deny device access for (major:minor)\n" + " -s, --deny-sysctl NAME deny sysctl READ of this name\n" + " -h, --help show this help\n", + prog); +} + +static int handle_event(void *ctx, void *data, size_t data_sz) +{ + (void)ctx; + (void)data_sz; + + const struct event *e = (const struct event *)data; + + if (e->type == EVENT_CONNECT4) { + char ip[INET_ADDRSTRLEN] = {0}; + struct in_addr addr = { .s_addr = e->connect4.daddr }; + inet_ntop(AF_INET, &addr, ip, sizeof(ip)); + + printf("[DENY connect4] pid=%u comm=%s daddr=%s dport=%u proto=%u\n", + e->pid, e->comm, ip, e->connect4.dport, e->connect4.proto); + } else if (e->type == EVENT_DEVICE) { + printf("[DENY device] pid=%u comm=%s major=%u minor=%u access_type=0x%x\n", + e->pid, e->comm, e->device.major, e->device.minor, e->device.access_type); + } else if (e->type == EVENT_SYSCTL) { + printf("[DENY sysctl] pid=%u comm=%s write=%u name=%s\n", + e->pid, e->comm, e->sysctl.write, e->sysctl.name); + } + + fflush(stdout); + return 0; +} + +int main(int argc, char **argv) +{ + const char *cgroup_path = "/sys/fs/cgroup/ebpf_demo"; + int block_port = 0; + int dev_major = 0, dev_minor = 0; + const char *deny_sysctl = NULL; + + /* Parse command line arguments */ + static const struct option long_opts[] = { + { "cgroup", required_argument, NULL, 'c' }, + { "block-port", required_argument, NULL, 'p' }, + { "deny-device", required_argument, NULL, 'd' }, + { "deny-sysctl", required_argument, NULL, 's' }, + { "help", no_argument, NULL, 'h' }, + {} + }; + + int opt; + while ((opt = getopt_long(argc, argv, "c:p:d:s:h", long_opts, NULL)) != -1) { + switch (opt) { + case 'c': cgroup_path = optarg; break; + case 'p': block_port = atoi(optarg); break; + case 'd': /* parse major:minor */ break; + case 's': deny_sysctl = optarg; break; + default: usage(argv[0]); return 1; + } + } + + libbpf_set_print(libbpf_print_fn); + signal(SIGINT, sig_handler); + signal(SIGTERM, sig_handler); + + /* Create cgroup directory if needed */ + mkdir(cgroup_path, 0755); + + int cg_fd = open(cgroup_path, O_RDONLY | O_DIRECTORY); + if (cg_fd < 0) { + fprintf(stderr, "open(%s) failed: %s\n", cgroup_path, strerror(errno)); + return 1; + } + + /* Open and configure BPF skeleton */ + struct cgroup_guard_bpf *skel = cgroup_guard_bpf__open(); + if (!skel) { + fprintf(stderr, "cgroup_guard_bpf__open() failed\n"); + close(cg_fd); + return 1; + } + + /* Write .rodata configuration (must be before load) */ + if (block_port > 0 && block_port <= 65535) + skel->rodata->blocked_tcp_dport = (__u16)block_port; + if (dev_major > 0 || dev_minor > 0) { + skel->rodata->blocked_dev_major = (__u32)dev_major; + skel->rodata->blocked_dev_minor = (__u32)dev_minor; + } + if (deny_sysctl) { + snprintf((char *)skel->rodata->denied_sysctl_name, + SYSCTL_NAME_LEN, "%s", deny_sysctl); + } + + /* Load BPF programs into kernel */ + int err = cgroup_guard_bpf__load(skel); + if (err) { + fprintf(stderr, "cgroup_guard_bpf__load() failed: %d\n", err); + goto cleanup; + } + + /* Attach programs to cgroup */ + struct bpf_link *link_connect = bpf_program__attach_cgroup(skel->progs.cg_connect4, cg_fd); + struct bpf_link *link_dev = bpf_program__attach_cgroup(skel->progs.cg_dev, cg_fd); + struct bpf_link *link_sysctl = bpf_program__attach_cgroup(skel->progs.cg_sysctl, cg_fd); + + /* Setup ring buffer for events */ + struct ring_buffer *rb = ring_buffer__new(bpf_map__fd(skel->maps.events), + handle_event, NULL, NULL); + + printf("Attached to cgroup: %s\n", cgroup_path); + printf("Config: block_port=%d, deny_device=%d:%d, deny_sysctl_read=%s\n", + block_port, dev_major, dev_minor, deny_sysctl ? deny_sysctl : "(none)"); + + /* Main event loop */ + while (!exiting) { + err = ring_buffer__poll(rb, 200 /* ms */); + if (err == -EINTR) + break; + } + + ring_buffer__free(rb); + +cleanup: + bpf_link__destroy(link_sysctl); + bpf_link__destroy(link_dev); + bpf_link__destroy(link_connect); + cgroup_guard_bpf__destroy(skel); + close(cg_fd); + return err ? 1 : 0; +} +``` + +#### Understanding the Userspace Code + +The userspace loader's core job is to attach BPF programs to the specified cgroup, then continuously poll the ringbuf to print denied events. + +The program first uses `getopt_long` to parse command-line arguments, getting the cgroup path and three policy configurations. Then it uses `open()` with `O_RDONLY | O_DIRECTORY` to open the cgroup directory and get a file descriptor. This fd is the attach target—cgroup eBPF programs are attached to cgroup directories. + +Next comes the standard skeleton workflow: `open()` opens the BPF object, set `.rodata` configuration, then `load()` loads it into the kernel. Note that configuration must be set before load—after load, `.rodata` becomes read-only. + +Attaching uses `bpf_program__attach_cgroup(prog, cg_fd)` to attach each BPF program to the cgroup. Here we attach three programs: connect4, dev, and sysctl. After successful attachment, all processes in this cgroup will have their relevant operations go through these BPF programs. + +Finally, the event loop. `ring_buffer__poll()` polls the ringbuf, calling the `handle_event` callback whenever events arrive to print them. This lets you see which operations are being denied in real-time. + ## Building ```bash -cd src/49-cgroup +cd src/cgroup make ``` @@ -138,56 +554,52 @@ Expected output: [DENY sysctl] pid=12347 comm=cat write=0 name=kernel/hostname ``` +## One-click Test + +We provide a test script that automatically compiles, starts servers, runs tests, and cleans up: + +```bash +sudo ./test.sh +``` + ## Verifying with bpftool ```bash sudo bpftool cgroup tree /sys/fs/cgroup/ebpf_demo ``` -## Key Implementation Details +## When to Use cgroup eBPF -### 1. Network byte order for sock_addr +Choosing the right technology depends on your control granularity requirements. -```c -// user_port is network byte order, must convert -__u16 dport = bpf_ntohs((__u16)ctx->user_port); -``` +cgroup eBPF's control granularity is **process groups**—put processes in a cgroup, attach a BPF program, and the policy applies to that group. This is perfect for container scenarios: each container is a cgroup, and you can set different network policies, device permissions, and sysctl access rules for different containers. When a process leaves the cgroup, the policy automatically stops applying—no manual cleanup needed. -### 2. Return value semantics +XDP and tc's control granularity is **network interfaces**. They handle all traffic passing through a specific NIC, regardless of which process it comes from. If you need high-performance packet processing, DDoS protection, or load balancing, XDP/tc are better choices. But if you want "only allow container A to access port 80, while container B can access any port," XDP/tc become inconvenient. -```c -// For sock_addr (connect4/bind4/etc): -return 1; // allow -return 0; // deny -> EPERM +seccomp-BPF's control granularity is **individual processes**. It filters system calls, such as preventing a process from calling `fork`, `exec`, or `socket`. seccomp is lower-level and suitable for process sandboxing. But it can't control network destination addresses or device major:minor—these higher-level semantics. -// For device: -return 0; // deny -> EPERM -return 1; // allow +Traditional iptables/nftables are **global**. Rules you configure apply to all processes on the entire system—there's no way to say "this rule only affects container A." -// For sysctl: -return 0; // reject -> EPERM -return 1; // proceed -``` +In summary: if you need per-container/process-group policies, want to control network, devices, and sysctls together, and want policies to automatically follow process lifecycles, cgroup eBPF is the right choice. -### 3. Configuration via .rodata +## Summary -```c -// BPF side - const volatile for CO-RE -const volatile __u16 blocked_tcp_dport = 0; +cgroup eBPF solves the problem of fine-grained control that traditional global policies can't achieve by binding policies to process groups. This tutorial demonstrated three commonly used cgroup hooks: -// Userspace - set before load -skel->rodata->blocked_tcp_dport = (__u16)port; -``` +- **`cgroup/connect4`**: Filter destination ports at TCP connection time, blocking disallowed outbound connections +- **`cgroup/dev`**: Check major:minor at device access time, restricting reads/writes to specific devices +- **`cgroup/sysctl`**: Check names at sysctl read/write time, preventing sensitive configuration leaks or tampering -## Files +This "policy guard" pattern can be extended to production use cases: container network policies (similar to Kubernetes NetworkPolicy), device isolation (GPU/TPU exclusive access), security sandboxes (restricting system information access). With ringbuf event reporting, you can also implement policy auditing and alerting. -- `cgroup_guard.h` - Shared data structures -- `cgroup_guard.bpf.c` - eBPF programs (connect4, device, sysctl hooks) -- `cgroup_guard.c` - Userspace loader -- `Makefile` - Build system +> If you want to learn more about eBPF, check out our tutorial repository at or visit our website at . ## References -- [Kernel docs: libbpf program types](https://docs.kernel.org/bpf/libbpf/program_types.html) -- [eBPF docs: CGROUP_SOCK_ADDR](https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_CGROUP_SOCK_ADDR/) -- [eBPF docs: CGROUP_DEVICE](https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_CGROUP_DEVICE/) +- **Kernel docs:** [libbpf program types](https://docs.kernel.org/bpf/libbpf/program_types.html) - all cgroup-related section names +- **eBPF docs:** [CGROUP_SOCK_ADDR](https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_CGROUP_SOCK_ADDR/) - socket address hooks explained +- **eBPF docs:** [CGROUP_DEVICE](https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_CGROUP_DEVICE/) - device access control explained +- **eBPF docs:** [CGROUP_SYSCTL](https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_CGROUP_SYSCTL/) - sysctl access control explained +- **Tutorial repository:** + +Full source code is available in the tutorial repository. Requires Linux kernel 4.10+ (cgroup v2) and libbpf. diff --git a/src/features/dynptr/.config b/src/features/dynptr/.config new file mode 100644 index 0000000..205ae3e --- /dev/null +++ b/src/features/dynptr/.config @@ -0,0 +1,3 @@ +level=Depth +type=Features +desc=BPF Dynamic Pointers for Variable-Length Data diff --git a/src/features/dynptr/.gitignore b/src/features/dynptr/.gitignore new file mode 100644 index 0000000..8649a57 --- /dev/null +++ b/src/features/dynptr/.gitignore @@ -0,0 +1,2 @@ +.output +dynptr_tc diff --git a/src/features/dynptr/Makefile b/src/features/dynptr/Makefile new file mode 100644 index 0000000..43fa66a --- /dev/null +++ b/src/features/dynptr/Makefile @@ -0,0 +1,112 @@ +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +OUTPUT := .output +CLANG ?= clang +LIBBPF_SRC := $(abspath ../../third_party/libbpf/src) +BPFTOOL_SRC := $(abspath ../../third_party/bpftool/src) +LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a) +BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool) +BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool +ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \ + | sed 's/arm.*/arm/' \ + | sed 's/aarch64/arm64/' \ + | sed 's/ppc64le/powerpc/' \ + | sed 's/mips.*/mips/' \ + | sed 's/riscv64/riscv/' \ + | sed 's/loongarch64/loongarch/') +VMLINUX := ../../third_party/vmlinux/$(ARCH)/vmlinux.h +# Use our own libbpf API headers and Linux UAPI headers distributed with +# libbpf to avoid dependency on system-wide headers, which could be missing or +# outdated +INCLUDES := -I$(OUTPUT) -I../../third_party/libbpf/include/uapi -I$(dir $(VMLINUX)) -I. +CFLAGS := -g -Wall +ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS) + +APPS = dynptr_tc + +# Get Clang's default includes on this system. We'll explicitly add these dirs +# to the includes list when compiling with `-target bpf` because otherwise some +# architecture-specific dirs will be "missing" on some architectures/distros - +# headers such as asm/types.h, asm/byteorder.h, asm/socket.h, asm/sockios.h, +# sys/cdefs.h etc. might be missing. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - &1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') + +ifeq ($(V),1) + Q = + msg = +else + Q = @ + msg = @printf ' %-8s %s%s\n' \ + "$(1)" \ + "$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \ + "$(if $(3), $(3))"; + MAKEFLAGS += --no-print-directory +endif + +define allow-override + $(if $(or $(findstring environment,$(origin $(1))),\ + $(findstring command line,$(origin $(1)))),,\ + $(eval $(1) = $(2))) +endef + +$(call allow-override,CC,$(CROSS_COMPILE)cc) +$(call allow-override,LD,$(CROSS_COMPILE)ld) + +.PHONY: all +all: $(APPS) + +.PHONY: clean +clean: + $(call msg,CLEAN) + $(Q)rm -rf $(OUTPUT) $(APPS) + +$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT): + $(call msg,MKDIR,$@) + $(Q)mkdir -p $@ + +# Build libbpf +$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf + $(call msg,LIB,$@) + $(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \ + OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \ + INCLUDEDIR= LIBDIR= UAPIDIR= \ + install + +# Build bpftool +$(BPFTOOL): | $(BPFTOOL_OUTPUT) + $(call msg,BPFTOOL,$@) + $(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap + +# Build BPF code +$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL) + $(call msg,BPF,$@) + $(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \ + $(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \ + -c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + $(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + +# Generate BPF skeletons +$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL) + $(call msg,GEN-SKEL,$@) + $(Q)$(BPFTOOL) gen skeleton $< > $@ + +# Build user-space code +$(patsubst %,$(OUTPUT)/%.o,$(APPS)): %.o: %.skel.h + +$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT) + $(call msg,CC,$@) + $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ + +# Build application binary +$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT) + $(call msg,BINARY,$@) + $(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@ + +# delete failed targets +.DELETE_ON_ERROR: + +# keep intermediate (.skel.h, .bpf.o, etc) targets +.SECONDARY: diff --git a/src/features/dynptr/README.md b/src/features/dynptr/README.md new file mode 100644 index 0000000..bc4f843 --- /dev/null +++ b/src/features/dynptr/README.md @@ -0,0 +1,436 @@ +# eBPF Tutorial by Example: BPF Dynamic Pointers for Variable-Length Data + +Ever written an eBPF packet parser and struggled with those verbose `data_end` bounds checks that the verifier still rejects? Or tried to send variable-length events through ring buffers only to find yourself locked into fixed-size structures? Traditional eBPF development forces you to prove memory safety statically at compile time, which becomes painful when dealing with runtime-determined sizes like packet lengths or user-configurable snapshot lengths. + +This is what **BPF dynptrs** (dynamic pointers) solve. Introduced gradually from Linux v5.19, dynptrs provide a verifier-friendly way to work with variable-length data by shifting some bounds checking from compile-time static analysis to runtime validation. In this tutorial, we'll build a TC ingress program that uses **skb dynptrs** to parse TCP packets safely and **ringbuf dynptrs** to output variable-length events containing configurable payload snapshots. + +> The complete source code: + +## Introduction to BPF Dynamic Pointers + +### The Problem: When Static Verification Isn't Enough + +The eBPF verifier's core mission is proving memory safety at load time. Every pointer dereference must be bounded, every array access must be within limits. This works beautifully for simple cases, but becomes a struggle when sizes are determined at runtime. + +Consider parsing a packet where the IP header length comes from a 4-bit field, or reading user-configurable amounts of TCP payload. The classic approach requires extensive bounds checking with `data_end` comparisons, and even correctly written code sometimes fails verification because the verifier cannot trace all possible paths. When working with non-linear skb data (paged buffers), the situation gets worse since that data isn't directly accessible through `ctx->data` at all. + +Variable-length output presents similar challenges. The traditional `bpf_ringbuf_reserve()` returns a raw pointer, but writing runtime-determined amounts of data to it makes the verifier uncomfortable because it cannot statically prove your writes stay within bounds. + +### The Solution: Runtime-Checked Dynamic Pointers + +Dynptrs introduce an opaque handle type that carries metadata about the underlying memory region including its bounds and type. You cannot dereference a dynptr directly since the verifier will reject such attempts. Instead, you must use helper functions or kfuncs that perform the appropriate safety checks. + +The key insight is that **some of these checks happen at runtime rather than compile time**. Functions like `bpf_dynptr_read()` and `bpf_dynptr_write()` validate bounds when they execute and return errors on failure. Functions like `bpf_dynptr_slice()` return NULL when the requested region cannot be accessed safely. This lets you express logic that would be unprovable statically while maintaining safety guarantees. + +For the verifier, dynptrs are tracked specially. They have lifecycle rules (some must be released), type constraints (skb dynptrs behave differently than local dynptrs), and the verifier ensures you follow these rules. The runtime checks are the verifier's way of delegating what it cannot prove statically. + +## Dynptr API Overview + +### Helpers vs Kfuncs + +The dynptr ecosystem spans two categories of functions. **Helper functions** are part of the stable UAPI and generally maintain backward compatibility. **Kfuncs** (kernel functions) are internal kernel exports to BPF with no ABI stability guarantees, meaning they may change between kernel versions. + +For dynptrs, the foundational read/write operations are helpers, while newer features like skb dynptrs and slicing are kfuncs. This means some dynptr functionality requires newer kernels and you should verify availability before relying on specific features. + +### Creating Dynptrs + +There are several ways to create dynptrs depending on your data source. The `bpf_dynptr_from_mem()` helper creates a dynptr from map values or global variables, useful for working with configuration data or scratch buffers. The `bpf_dynptr_from_skb()` kfunc creates a dynptr from a socket buffer, enabling safe access to packet data including non-linear (paged) regions. For XDP programs, `bpf_dynptr_from_xdp()` provides similar functionality. + +Ring buffer operations use `bpf_ringbuf_reserve_dynptr()` to allocate variable-length records. Unlike regular `bpf_ringbuf_reserve()` which returns a pointer to a fixed-size region, the dynptr variant lets you specify the size at runtime. This is crucial for variable-length event structures. + +### Reading and Writing + +The `bpf_dynptr_read()` helper copies data from a dynptr into a destination buffer. It takes an offset and length, performing runtime bounds checking and returning an error if the read would exceed the dynptr's bounds. This is the safe way to extract data when you need it in a local buffer. + +The `bpf_dynptr_write()` helper does the reverse, copying data into a dynptr. For skb dynptrs, writing may have additional semantics similar to `bpf_skb_store_bytes()`, and note that writes can invalidate previously obtained slices. + +The `bpf_dynptr_data()` helper returns a direct pointer to data within the dynptr, with the verifier tracking the bounds statically. However, this does NOT work for skb or xdp dynptrs since their data may not be in a single contiguous region. + +### Slicing for Packet Parsing + +For skb and xdp dynptrs, `bpf_dynptr_slice()` is the primary way to access data. You provide an offset, a length, and optionally a local buffer. The function returns a pointer to the requested data, which may be either a direct pointer into the packet or your provided buffer (if the data needed to be copied from non-linear regions). + +The critical rule is that **you must NULL-check the return value**. A NULL return means the requested region cannot be accessed, either because it exceeds packet bounds or for other internal reasons. Once you have a valid slice pointer, you can dereference it safely within the requested bounds. + +There's also `bpf_dynptr_slice_rdwr()` for obtaining writable slices, with availability depending on the program type and whether the underlying data supports writes. + +### Ring Buffer Lifecycle + +The `bpf_ringbuf_reserve_dynptr()` function has special lifecycle rules enforced by the verifier. Once you call it, you **must** call either `bpf_ringbuf_submit_dynptr()` or `bpf_ringbuf_discard_dynptr()` on the dynptr, regardless of whether the reservation succeeded. This is not optional since the verifier tracks dynptr state and will reject programs that leak reserved dynptrs. + +This differs from regular ringbuf usage where a NULL return from `bpf_ringbuf_reserve()` means nothing was allocated. With dynptrs, the reserve failure still requires explicit cleanup through discard. The verifier needs this guarantee to ensure proper resource management. + +## Implementation: TC Ingress with Dynptr Parsing and Variable-Length Events + +Our demonstration program attaches to TC ingress and accomplishes three things. First, it creates an skb dynptr from incoming packets using `bpf_dynptr_from_skb()`. Second, it parses Ethernet, IPv4, and TCP headers using `bpf_dynptr_slice()` for safe bounds-checked access. Third, it outputs variable-length events through a ringbuf dynptr, including a configurable snapshot of TCP payload. + +### Complete BPF Program: dynptr_tc.bpf.c + +```c +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include + +#include "dynptr_tc.h" + +/* kfunc declarations for dynptr operations (v6.4+) */ +extern int bpf_dynptr_from_skb(struct __sk_buff *s, __u64 flags, + struct bpf_dynptr *ptr__uninit) __ksym; +extern void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, __u32 offset, + void *buffer__opt, __u32 buffer__sz) __ksym; + +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 1 << 24); /* 16MB */ +} events SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, struct config); +} cfg_map SEC(".maps"); + +SEC("tc") +int dynptr_tc_ingress(struct __sk_buff *ctx) +{ + const struct config *cfg; + struct bpf_dynptr skb_ptr; + + /* Temporary buffers for slice (data may be copied here) */ + struct ethhdr eth_buf; + struct iphdr ip_buf; + struct tcphdr tcp_buf; + + const struct ethhdr *eth; + const struct iphdr *iph; + const struct tcphdr *tcp; + + cfg = bpf_map_lookup_elem(&cfg_map, &(__u32){0}); + if (!cfg) + return TC_ACT_OK; + + /* Create dynptr from skb */ + if (bpf_dynptr_from_skb(ctx, 0, &skb_ptr)) + return TC_ACT_OK; + + /* Parse Ethernet header using slice */ + eth = bpf_dynptr_slice(&skb_ptr, 0, ð_buf, sizeof(eth_buf)); + if (!eth) + return TC_ACT_OK; + + if (eth->h_proto != bpf_htons(ETH_P_IP)) + return TC_ACT_OK; + + /* Parse IPv4 header */ + __u32 ip_off = sizeof(*eth); + iph = bpf_dynptr_slice(&skb_ptr, ip_off, &ip_buf, sizeof(ip_buf)); + if (!iph || iph->version != 4 || iph->protocol != IPPROTO_TCP) + return TC_ACT_OK; + + /* Parse TCP header */ + __u32 tcp_off = ip_off + ((__u32)iph->ihl * 4); + tcp = bpf_dynptr_slice(&skb_ptr, tcp_off, &tcp_buf, sizeof(tcp_buf)); + if (!tcp) + return TC_ACT_OK; + + __u16 dport = bpf_ntohs(tcp->dest); + __u8 drop = (cfg->blocked_port && dport == cfg->blocked_port); + + /* Output variable-length event using ringbuf dynptr */ + if (cfg->enable_ringbuf) { + __u32 snap_len = cfg->snap_len; + __u8 payload[MAX_SNAPLEN] = {}; + + __u32 payload_off = tcp_off + ((__u32)tcp->doff * 4); + if (payload_off < ctx->len) { + __u32 avail = ctx->len - payload_off; + if (snap_len > avail) snap_len = avail; + if (snap_len > MAX_SNAPLEN) snap_len = MAX_SNAPLEN; + + if (bpf_dynptr_read(payload, snap_len, &skb_ptr, payload_off, 0)) + snap_len = 0; + } else { + snap_len = 0; + } + + struct event_hdr hdr = { + .ts_ns = bpf_ktime_get_ns(), + .ifindex = ctx->ifindex, + .pkt_len = ctx->len, + .saddr = iph->saddr, + .daddr = iph->daddr, + .sport = bpf_ntohs(tcp->source), + .dport = dport, + .drop = drop, + .snap_len = snap_len, + }; + + /* Reserve variable-length ringbuf record */ + struct bpf_dynptr rb; + __u32 total_sz = sizeof(hdr) + snap_len; + + long err = bpf_ringbuf_reserve_dynptr(&events, total_sz, 0, &rb); + if (err) { + /* Must discard even on failure */ + bpf_ringbuf_discard_dynptr(&rb, 0); + return drop ? TC_ACT_SHOT : TC_ACT_OK; + } + + bpf_dynptr_write(&rb, 0, &hdr, sizeof(hdr), 0); + if (snap_len) + bpf_dynptr_write(&rb, sizeof(hdr), payload, snap_len, 0); + + bpf_ringbuf_submit_dynptr(&rb, 0); + } + + return drop ? TC_ACT_SHOT : TC_ACT_OK; +} + +char _license[] SEC("license") = "GPL"; +``` + +### Understanding the BPF Code + +The program begins by declaring the kfuncs it needs. The `bpf_dynptr_from_skb()` function creates a dynptr from the socket buffer, and `bpf_dynptr_slice()` returns pointers to specific regions within it. The `__ksym` attribute tells the loader these are kernel symbols to be resolved at load time. + +When parsing headers, notice how we provide local buffers (`eth_buf`, `ip_buf`, `tcp_buf`) to each slice call. The slice function may return a pointer directly into packet data if it's linearly accessible, or it may copy data into our buffer and return a pointer to the buffer. Either way, we get a valid pointer we can dereference, or NULL on failure. + +The NULL check pattern is crucial. Each slice call can fail if the requested offset plus length exceeds packet bounds or if the data cannot be accessed for other reasons. Checking for NULL before using the returned pointer is mandatory. + +For ringbuf output, we use `bpf_dynptr_read()` to copy TCP payload from the skb into a local buffer first. This demonstrates reading from an skb dynptr with runtime-determined length (bounded by configuration and available data). The read may fail if bounds are exceeded, in which case we set `snap_len` to zero. + +The ringbuf dynptr reserve shows the variable-length allocation pattern. We compute the total size (header plus snapshot) and reserve that exact amount. After writing both the header and payload using `bpf_dynptr_write()`, we submit the record. Note the discard call on reserve failure to satisfy the verifier's lifecycle requirements. + +### Complete User-Space Program: dynptr_tc.c + +```c +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +#include +#include +#include +#include +#include +#include +#include +#include + +#include "dynptr_tc.skel.h" +#include "dynptr_tc.h" + +static volatile sig_atomic_t exiting = 0; + +static void sig_handler(int signo) { exiting = 1; } + +static int handle_event(void *ctx, void *data, size_t data_sz) +{ + const struct event_hdr *e = data; + char saddr[INET_ADDRSTRLEN], daddr[INET_ADDRSTRLEN]; + + inet_ntop(AF_INET, &e->saddr, saddr, sizeof(saddr)); + inet_ntop(AF_INET, &e->daddr, daddr, sizeof(daddr)); + + printf("if=%u %s:%u -> %s:%u len=%u drop=%u snap=%u", + e->ifindex, saddr, e->sport, daddr, e->dport, + e->pkt_len, e->drop, e->snap_len); + + if (e->snap_len && data_sz >= sizeof(*e) + e->snap_len) { + printf(" payload=\""); + for (int i = 0; i < e->snap_len; i++) { + unsigned char c = e->payload[i]; + putchar((c >= 32 && c <= 126) ? c : '.'); + } + printf("\""); + } + printf("\n"); + return 0; +} + +int main(int argc, char **argv) +{ + const char *ifname = NULL; + struct config cfg = { .blocked_port = 0, .snap_len = 64, .enable_ringbuf = 1 }; + + /* Parse arguments */ + for (int i = 1; i < argc; i++) { + if (!strcmp(argv[i], "-i") && i+1 < argc) ifname = argv[++i]; + else if (!strcmp(argv[i], "-p") && i+1 < argc) cfg.blocked_port = atoi(argv[++i]); + else if (!strcmp(argv[i], "-s") && i+1 < argc) cfg.snap_len = atoi(argv[++i]); + else if (!strcmp(argv[i], "-n")) cfg.enable_ringbuf = 0; + } + + if (!ifname) { + fprintf(stderr, "Usage: %s -i [-p port] [-s len] [-n]\n", argv[0]); + return 1; + } + + int ifindex = if_nametoindex(ifname); + if (!ifindex) { perror("if_nametoindex"); return 1; } + + signal(SIGINT, sig_handler); + signal(SIGTERM, sig_handler); + + struct dynptr_tc_bpf *skel = dynptr_tc_bpf__open_and_load(); + if (!skel) { fprintf(stderr, "Failed to load BPF\n"); return 1; } + + /* Configure */ + bpf_map_update_elem(bpf_map__fd(skel->maps.cfg_map), &(__u32){0}, &cfg, BPF_ANY); + + /* Attach to TC ingress */ + struct bpf_tc_hook hook = { .sz = sizeof(hook), .ifindex = ifindex, .attach_point = BPF_TC_INGRESS }; + struct bpf_tc_opts opts = { .sz = sizeof(opts), .handle = 1, .priority = 1, + .prog_fd = bpf_program__fd(skel->progs.dynptr_tc_ingress) }; + + bpf_tc_hook_create(&hook); + if (bpf_tc_attach(&hook, &opts)) { fprintf(stderr, "TC attach failed\n"); goto cleanup; } + + struct ring_buffer *rb = cfg.enable_ringbuf ? + ring_buffer__new(bpf_map__fd(skel->maps.events), handle_event, NULL, NULL) : NULL; + + printf("Attached to %s. blocked_port=%u snap_len=%u\n", ifname, cfg.blocked_port, cfg.snap_len); + + while (!exiting) { + if (rb) ring_buffer__poll(rb, 100); + else usleep(100000); + } + + ring_buffer__free(rb); + bpf_tc_detach(&hook, &opts); + bpf_tc_hook_destroy(&hook); +cleanup: + dynptr_tc_bpf__destroy(skel); + return 0; +} +``` + +### Understanding the User-Space Code + +The userspace program loads the BPF skeleton, configures it through the array map, and attaches to TC ingress. The ring buffer callback `handle_event()` receives each variable-length event and prints it. + +Notice how we access the variable-length payload. The `struct event_hdr` has a flexible array member `payload[]` at the end. When an event arrives, `data_sz` tells us the total size, and `e->snap_len` tells us specifically how much payload was included. We validate both before accessing the payload bytes. + +The configuration map allows runtime control over blocking behavior and snapshot length without reloading the BPF program. This demonstrates the common pattern of using maps for user-to-kernel communication. + +## Compilation and Execution + +Navigate to the dynptr directory and build: + +```bash +cd bpf-developer-tutorial/src/features/dynptr +make +``` + +This compiles the BPF program with the repository's standard toolchain, generating the skeleton header and linking against libbpf. + +### Creating a Test Environment + +To test properly, we need a network namespace so traffic actually traverses the veth pair rather than going through loopback. The included `test.sh` script handles this automatically, but here's the manual setup: + +```bash +# Create network namespace +sudo ip netns add test_ns + +# Create veth pair with one end in the namespace +sudo ip link add veth_host type veth peer name veth_ns +sudo ip link set veth_ns netns test_ns + +# Configure host side +sudo ip addr add 10.200.0.1/24 dev veth_host +sudo ip link set veth_host up + +# Configure namespace side +sudo ip netns exec test_ns ip addr add 10.200.0.2/24 dev veth_ns +sudo ip netns exec test_ns ip link set veth_ns up + +# Start HTTP server inside the namespace +sudo ip netns exec test_ns python3 -m http.server 8080 --bind 10.200.0.2 & +``` + +### Running the Demo + +Start the dynptr TC program attached to the host side of the veth: + +```bash +sudo ./dynptr_tc -i veth_host -p 0 -s 32 +``` + +In another terminal, make a request: + +```bash +curl http://10.200.0.2:8080/ +``` + +You should see output showing captured packets: + +``` +Attached to TC ingress of veth_host (ifindex=X). Ctrl-C to exit. +blocked_port=0 snap_len=32 ringbuf=1 +if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=221 drop=0 snap=32 payload="HTTP/1.0 200 OK..Server: SimpleH" +if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=742 drop=0 snap=32 payload="." +``` + +The output shows HTTP response packets from the server, with the payload field containing the beginning of the response data. + +### Testing the Drop Policy + +Test blocking by specifying port 8080: + +```bash +sudo ./dynptr_tc -i veth_host -p 8080 -s 32 +``` + +In another terminal: + +```bash +curl --max-time 3 http://10.200.0.2:8080/ +``` + +The curl should timeout since response packets are blocked. The dynptr_tc output shows `drop=1`: + +``` +if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=74 drop=1 snap=0 +``` + +### Using the Test Script + +For convenience, run the included test script which handles all setup automatically: + +```bash +sudo ./test.sh +``` + +This creates the namespace, runs both capture and blocking tests, and cleans up afterward. + +## When to Use Dynptrs + +Dynptrs shine in several scenarios. **Variable-length events** are the classic use case since ringbuf dynptrs let you allocate exactly the size you need at runtime, avoiding wasted space from oversized fixed structures or complex multi-record schemes. + +**Packet parsing** benefits from dynptrs when dealing with non-linear skbs or complex protocol stacks where traditional bounds checking becomes unwieldy. The slice API provides a cleaner abstraction that handles both linear and paged data uniformly. + +**Crypto and verification** operations like `bpf_crypto_encrypt()`, `bpf_verify_pkcs7_signature()`, and `bpf_get_file_xattr()` all use dynptrs as buffer arguments, making dynptr familiarity essential for these advanced use cases. + +**User ringbuf consumption** through `bpf_user_ringbuf_drain()` delivers samples as dynptrs, enabling safe handling of userspace-provided data in BPF programs. + +For simple fixed-size operations where you know bounds at compile time, traditional approaches may be simpler. But as your BPF programs grow more sophisticated, dynptrs become increasingly valuable. + +## Summary + +BPF dynptrs provide a verifier-friendly mechanism for working with variable-length and runtime-bounded data. Rather than proving memory safety entirely through static analysis, dynptrs shift some verification to runtime checks, enabling patterns that would otherwise be impossible or extremely awkward to express. + +Our example demonstrated the two primary dynptr patterns: using skb dynptrs with slices for clean packet parsing, and using ringbuf dynptrs for variable-length event output. The key takeaways are to always NULL-check slice returns, always submit or discard ringbuf dynptrs, and remember that skb dynptrs require kfuncs available from Linux v6.4. + +As eBPF capabilities continue to expand, dynptrs form an increasingly important part of the toolkit. Whether you're building packet processors, security monitors, or performance tools, understanding dynptrs will help you write cleaner, more capable BPF programs. + +> If you'd like to dive deeper into eBPF, check out our tutorial repository at or visit our website at . + +## References + +- **Dynptr Concept Documentation:** +- **bpf_ringbuf_reserve_dynptr Helper:** +- **bpf_dynptr_from_skb Kfunc:** +- **bpf_dynptr_slice Kfunc:** +- **Kernel Kfuncs Documentation:** +- **Tutorial Repository:** + +This example requires Linux kernel 6.4 or newer for the skb dynptr kfuncs. The ringbuf dynptr helpers are available from Linux 5.19. Complete source code is available in the tutorial repository. diff --git a/src/features/dynptr/README.zh.md b/src/features/dynptr/README.zh.md new file mode 100644 index 0000000..d068a96 --- /dev/null +++ b/src/features/dynptr/README.zh.md @@ -0,0 +1,436 @@ +# eBPF 实例教程:BPF 动态指针处理可变长度数据 + +你是否曾经在编写 eBPF 包解析器时,被那些冗长的 `data_end` 边界检查搞得焦头烂额,而验证器仍然拒绝通过?是否尝试过用 ring buffer 发送可变长度事件,却发现自己只能用固定大小的结构体?传统的 eBPF 开发要求你在编译时静态证明内存安全性,当处理运行时才能确定的大小(比如包长度或用户配置的快照长度)时就会变得非常痛苦。 + +这正是 **BPF dynptrs**(动态指针)要解决的问题。从 Linux v5.19 开始逐步引入,dynptr 提供了一种验证器友好的方式来处理可变长度数据,它将部分边界检查从编译时静态分析转移到运行时验证。在本教程中,我们将构建一个 TC ingress 程序,使用 **skb dynptr** 安全解析 TCP 数据包,并使用 **ringbuf dynptr** 输出包含可配置 payload 快照的可变长度事件。 + +> 完整源代码: + +## BPF 动态指针简介 + +### 问题:静态验证的局限性 + +eBPF 验证器的核心使命是在加载时证明内存安全性。每个指针解引用都必须有边界限制,每个数组访问都必须在范围内。这对简单场景效果很好,但当大小在运行时才能确定时就会遇到困难。 + +考虑这样的场景:解析一个数据包时 IP 头长度来自一个 4 位字段,或者要读取用户配置数量的 TCP payload。传统方法需要大量的 `data_end` 比较进行边界检查,即使代码写得完全正确,有时验证器仍然无法追踪所有可能的路径从而拒绝通过。当处理非线性 skb 数据(分页缓冲区)时情况更糟,因为这些数据根本无法通过 `ctx->data` 直接访问。 + +可变长度输出也面临类似的挑战。传统的 `bpf_ringbuf_reserve()` 返回原始指针,但向其中写入运行时确定数量的数据会让验证器感到不安,因为它无法静态证明你的写入会保持在边界内。 + +### 解决方案:运行时检查的动态指针 + +Dynptr 引入了一种不透明的句柄类型,它携带关于底层内存区域的元数据,包括边界和类型信息。你不能直接解引用 dynptr,验证器会拒绝这样的尝试。相反,你必须使用执行适当安全检查的 helper 函数或 kfunc。 + +关键在于**其中一些检查发生在运行时而非编译时**。像 `bpf_dynptr_read()` 和 `bpf_dynptr_write()` 这样的函数在执行时验证边界,失败时返回错误。像 `bpf_dynptr_slice()` 这样的函数在无法安全访问请求区域时返回 NULL。这让你能够表达静态无法证明的逻辑,同时保持安全保证。 + +对于验证器来说,dynptr 被特殊追踪。它们有生命周期规则(某些必须被释放),有类型约束(skb dynptr 与本地 dynptr 行为不同),验证器确保你遵循这些规则。运行时检查是验证器将它无法静态证明的部分委托出去的方式。 + +## Dynptr API 概览 + +### Helper 函数与 Kfunc + +dynptr 生态系统跨越两类函数。**Helper 函数** 是稳定 UAPI 的一部分,通常保持向后兼容。**Kfunc**(内核函数)是内核向 BPF 暴露的内部导出,没有 ABI 稳定性保证,意味着它们可能在内核版本间发生变化。 + +对于 dynptr,基础的读写操作是 helper,而较新的特性如 skb dynptr 和切片操作是 kfunc。这意味着某些 dynptr 功能需要较新的内核,你应该在依赖特定特性前验证其可用性。 + +### 创建 Dynptr + +根据数据来源有几种创建 dynptr 的方式。`bpf_dynptr_from_mem()` helper 从 map value 或全局变量创建 dynptr,用于处理配置数据或临时缓冲区。`bpf_dynptr_from_skb()` kfunc 从 socket buffer 创建 dynptr,允许安全访问包数据,包括非线性(分页)区域。对于 XDP 程序,`bpf_dynptr_from_xdp()` 提供类似功能。 + +Ring buffer 操作使用 `bpf_ringbuf_reserve_dynptr()` 来分配可变长度记录。与返回固定大小区域指针的普通 `bpf_ringbuf_reserve()` 不同,dynptr 变体允许你在运行时指定大小。这对可变长度事件结构至关重要。 + +### 读取和写入 + +`bpf_dynptr_read()` helper 将数据从 dynptr 复制到目标缓冲区。它接受偏移量和长度,执行运行时边界检查,如果读取超出 dynptr 边界则返回错误。当你需要将数据放入本地缓冲区时,这是安全的提取方式。 + +`bpf_dynptr_write()` helper 做相反的事情,将数据复制到 dynptr 中。对于 skb dynptr,写入可能有类似于 `bpf_skb_store_bytes()` 的额外语义,注意写入可能使之前获取的切片失效。 + +`bpf_dynptr_data()` helper 返回 dynptr 内数据的直接指针,验证器静态追踪边界。然而,这对 skb 或 xdp dynptr **不起作用**,因为它们的数据可能不在单个连续区域中。 + +### 用于包解析的切片操作 + +对于 skb 和 xdp dynptr,`bpf_dynptr_slice()` 是访问数据的主要方式。你提供偏移量、长度和可选的本地缓冲区。该函数返回指向请求数据的指针,它可能是直接指向包数据的指针,也可能是你提供的缓冲区(如果数据需要从非线性区域复制)。 + +关键规则是**必须对返回值进行 NULL 检查**。NULL 返回意味着无法访问请求的区域,要么因为超出包边界,要么因为其他内部原因。一旦你有了有效的切片指针,就可以在请求的边界内安全解引用它。 + +还有 `bpf_dynptr_slice_rdwr()` 用于获取可写切片,其可用性取决于程序类型和底层数据是否支持写入。 + +### Ring Buffer 生命周期 + +`bpf_ringbuf_reserve_dynptr()` 函数有验证器强制的特殊生命周期规则。一旦调用它,**必须**对 dynptr 调用 `bpf_ringbuf_submit_dynptr()` 或 `bpf_ringbuf_discard_dynptr()`,无论预留是否成功。这不是可选的,因为验证器追踪 dynptr 状态,会拒绝泄漏已预留 dynptr 的程序。 + +这与普通 ringbuf 用法不同,在那里 `bpf_ringbuf_reserve()` 返回 NULL 意味着没有分配任何东西。对于 dynptr,预留失败仍然需要通过 discard 进行显式清理。验证器需要这个保证来确保正确的资源管理。 + +## 实现:使用 Dynptr 解析和可变长度事件的 TC Ingress + +我们的演示程序附加到 TC ingress 并完成三件事。首先,它使用 `bpf_dynptr_from_skb()` 从传入的数据包创建 skb dynptr。其次,它使用 `bpf_dynptr_slice()` 解析以太网、IPv4 和 TCP 头,实现安全的边界检查访问。第三,它通过 ringbuf dynptr 输出可变长度事件,包括可配置的 TCP payload 快照。 + +### 完整的 BPF 程序:dynptr_tc.bpf.c + +```c +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include + +#include "dynptr_tc.h" + +/* dynptr 操作的 kfunc 声明(v6.4+) */ +extern int bpf_dynptr_from_skb(struct __sk_buff *s, __u64 flags, + struct bpf_dynptr *ptr__uninit) __ksym; +extern void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, __u32 offset, + void *buffer__opt, __u32 buffer__sz) __ksym; + +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 1 << 24); /* 16MB */ +} events SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, struct config); +} cfg_map SEC(".maps"); + +SEC("tc") +int dynptr_tc_ingress(struct __sk_buff *ctx) +{ + const struct config *cfg; + struct bpf_dynptr skb_ptr; + + /* 用于切片的临时缓冲区(数据可能被复制到这里) */ + struct ethhdr eth_buf; + struct iphdr ip_buf; + struct tcphdr tcp_buf; + + const struct ethhdr *eth; + const struct iphdr *iph; + const struct tcphdr *tcp; + + cfg = bpf_map_lookup_elem(&cfg_map, &(__u32){0}); + if (!cfg) + return TC_ACT_OK; + + /* 从 skb 创建 dynptr */ + if (bpf_dynptr_from_skb(ctx, 0, &skb_ptr)) + return TC_ACT_OK; + + /* 使用切片解析以太网头 */ + eth = bpf_dynptr_slice(&skb_ptr, 0, ð_buf, sizeof(eth_buf)); + if (!eth) + return TC_ACT_OK; + + if (eth->h_proto != bpf_htons(ETH_P_IP)) + return TC_ACT_OK; + + /* 解析 IPv4 头 */ + __u32 ip_off = sizeof(*eth); + iph = bpf_dynptr_slice(&skb_ptr, ip_off, &ip_buf, sizeof(ip_buf)); + if (!iph || iph->version != 4 || iph->protocol != IPPROTO_TCP) + return TC_ACT_OK; + + /* 解析 TCP 头 */ + __u32 tcp_off = ip_off + ((__u32)iph->ihl * 4); + tcp = bpf_dynptr_slice(&skb_ptr, tcp_off, &tcp_buf, sizeof(tcp_buf)); + if (!tcp) + return TC_ACT_OK; + + __u16 dport = bpf_ntohs(tcp->dest); + __u8 drop = (cfg->blocked_port && dport == cfg->blocked_port); + + /* 使用 ringbuf dynptr 输出可变长度事件 */ + if (cfg->enable_ringbuf) { + __u32 snap_len = cfg->snap_len; + __u8 payload[MAX_SNAPLEN] = {}; + + __u32 payload_off = tcp_off + ((__u32)tcp->doff * 4); + if (payload_off < ctx->len) { + __u32 avail = ctx->len - payload_off; + if (snap_len > avail) snap_len = avail; + if (snap_len > MAX_SNAPLEN) snap_len = MAX_SNAPLEN; + + if (bpf_dynptr_read(payload, snap_len, &skb_ptr, payload_off, 0)) + snap_len = 0; + } else { + snap_len = 0; + } + + struct event_hdr hdr = { + .ts_ns = bpf_ktime_get_ns(), + .ifindex = ctx->ifindex, + .pkt_len = ctx->len, + .saddr = iph->saddr, + .daddr = iph->daddr, + .sport = bpf_ntohs(tcp->source), + .dport = dport, + .drop = drop, + .snap_len = snap_len, + }; + + /* 预留可变长度的 ringbuf 记录 */ + struct bpf_dynptr rb; + __u32 total_sz = sizeof(hdr) + snap_len; + + long err = bpf_ringbuf_reserve_dynptr(&events, total_sz, 0, &rb); + if (err) { + /* 即使失败也必须 discard */ + bpf_ringbuf_discard_dynptr(&rb, 0); + return drop ? TC_ACT_SHOT : TC_ACT_OK; + } + + bpf_dynptr_write(&rb, 0, &hdr, sizeof(hdr), 0); + if (snap_len) + bpf_dynptr_write(&rb, sizeof(hdr), payload, snap_len, 0); + + bpf_ringbuf_submit_dynptr(&rb, 0); + } + + return drop ? TC_ACT_SHOT : TC_ACT_OK; +} + +char _license[] SEC("license") = "GPL"; +``` + +### BPF 代码解析 + +程序首先声明它需要的 kfunc。`bpf_dynptr_from_skb()` 函数从 socket buffer 创建 dynptr,`bpf_dynptr_slice()` 返回指向其中特定区域的指针。`__ksym` 属性告诉加载器这些是需要在加载时解析的内核符号。 + +在解析头部时,注意我们如何为每个切片调用提供本地缓冲区(`eth_buf`、`ip_buf`、`tcp_buf`)。如果数据可以线性访问,切片函数可能返回直接指向包数据的指针;或者它可能将数据复制到我们的缓冲区并返回指向缓冲区的指针。无论哪种方式,我们都能得到一个可以解引用的有效指针,或者在失败时得到 NULL。 + +NULL 检查模式至关重要。如果请求的偏移量加长度超过包边界,或者由于其他原因无法访问数据,每个切片调用都可能失败。在使用返回的指针之前检查 NULL 是必须的。 + +对于 ringbuf 输出,我们使用 `bpf_dynptr_read()` 先将 TCP payload 从 skb 复制到本地缓冲区。这展示了用运行时确定长度(受配置和可用数据限制)从 skb dynptr 读取的方式。如果超出边界,读取可能失败,在这种情况下我们将 `snap_len` 设为零。 + +ringbuf dynptr 预留展示了可变长度分配模式。我们计算总大小(头部加快照)并精确预留该数量。使用 `bpf_dynptr_write()` 写入头部和 payload 后,我们提交记录。注意预留失败时的 discard 调用,以满足验证器的生命周期要求。 + +### 完整的用户态程序:dynptr_tc.c + +```c +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +#include +#include +#include +#include +#include +#include +#include +#include + +#include "dynptr_tc.skel.h" +#include "dynptr_tc.h" + +static volatile sig_atomic_t exiting = 0; + +static void sig_handler(int signo) { exiting = 1; } + +static int handle_event(void *ctx, void *data, size_t data_sz) +{ + const struct event_hdr *e = data; + char saddr[INET_ADDRSTRLEN], daddr[INET_ADDRSTRLEN]; + + inet_ntop(AF_INET, &e->saddr, saddr, sizeof(saddr)); + inet_ntop(AF_INET, &e->daddr, daddr, sizeof(daddr)); + + printf("if=%u %s:%u -> %s:%u len=%u drop=%u snap=%u", + e->ifindex, saddr, e->sport, daddr, e->dport, + e->pkt_len, e->drop, e->snap_len); + + if (e->snap_len && data_sz >= sizeof(*e) + e->snap_len) { + printf(" payload=\""); + for (int i = 0; i < e->snap_len; i++) { + unsigned char c = e->payload[i]; + putchar((c >= 32 && c <= 126) ? c : '.'); + } + printf("\""); + } + printf("\n"); + return 0; +} + +int main(int argc, char **argv) +{ + const char *ifname = NULL; + struct config cfg = { .blocked_port = 0, .snap_len = 64, .enable_ringbuf = 1 }; + + /* 解析参数 */ + for (int i = 1; i < argc; i++) { + if (!strcmp(argv[i], "-i") && i+1 < argc) ifname = argv[++i]; + else if (!strcmp(argv[i], "-p") && i+1 < argc) cfg.blocked_port = atoi(argv[++i]); + else if (!strcmp(argv[i], "-s") && i+1 < argc) cfg.snap_len = atoi(argv[++i]); + else if (!strcmp(argv[i], "-n")) cfg.enable_ringbuf = 0; + } + + if (!ifname) { + fprintf(stderr, "Usage: %s -i [-p port] [-s len] [-n]\n", argv[0]); + return 1; + } + + int ifindex = if_nametoindex(ifname); + if (!ifindex) { perror("if_nametoindex"); return 1; } + + signal(SIGINT, sig_handler); + signal(SIGTERM, sig_handler); + + struct dynptr_tc_bpf *skel = dynptr_tc_bpf__open_and_load(); + if (!skel) { fprintf(stderr, "Failed to load BPF\n"); return 1; } + + /* 配置 */ + bpf_map_update_elem(bpf_map__fd(skel->maps.cfg_map), &(__u32){0}, &cfg, BPF_ANY); + + /* 附加到 TC ingress */ + struct bpf_tc_hook hook = { .sz = sizeof(hook), .ifindex = ifindex, .attach_point = BPF_TC_INGRESS }; + struct bpf_tc_opts opts = { .sz = sizeof(opts), .handle = 1, .priority = 1, + .prog_fd = bpf_program__fd(skel->progs.dynptr_tc_ingress) }; + + bpf_tc_hook_create(&hook); + if (bpf_tc_attach(&hook, &opts)) { fprintf(stderr, "TC attach failed\n"); goto cleanup; } + + struct ring_buffer *rb = cfg.enable_ringbuf ? + ring_buffer__new(bpf_map__fd(skel->maps.events), handle_event, NULL, NULL) : NULL; + + printf("Attached to %s. blocked_port=%u snap_len=%u\n", ifname, cfg.blocked_port, cfg.snap_len); + + while (!exiting) { + if (rb) ring_buffer__poll(rb, 100); + else usleep(100000); + } + + ring_buffer__free(rb); + bpf_tc_detach(&hook, &opts); + bpf_tc_hook_destroy(&hook); +cleanup: + dynptr_tc_bpf__destroy(skel); + return 0; +} +``` + +### 用户态代码解析 + +用户态程序加载 BPF skeleton,通过 array map 配置它,并附加到 TC ingress。ring buffer 回调 `handle_event()` 接收每个可变长度事件并打印它。 + +注意我们如何访问可变长度的 payload。`struct event_hdr` 末尾有一个柔性数组成员 `payload[]`。当事件到达时,`data_sz` 告诉我们总大小,`e->snap_len` 具体告诉我们包含了多少 payload。我们在访问 payload 字节前验证这两者。 + +配置 map 允许在不重新加载 BPF 程序的情况下运行时控制阻断行为和快照长度。这展示了使用 map 进行用户态到内核通信的常见模式。 + +## 编译和执行 + +进入 dynptr 目录并构建: + +```bash +cd bpf-developer-tutorial/src/features/dynptr +make +``` + +这使用仓库的标准工具链编译 BPF 程序,生成 skeleton 头文件并链接 libbpf。 + +### 创建测试环境 + +为了正确测试,我们需要使用网络命名空间,这样流量才会真正经过 veth 对而不是走 loopback。附带的 `test.sh` 脚本会自动处理这些,但以下是手动设置方法: + +```bash +# 创建网络命名空间 +sudo ip netns add test_ns + +# 创建 veth 对,一端放入命名空间 +sudo ip link add veth_host type veth peer name veth_ns +sudo ip link set veth_ns netns test_ns + +# 配置主机端 +sudo ip addr add 10.200.0.1/24 dev veth_host +sudo ip link set veth_host up + +# 配置命名空间端 +sudo ip netns exec test_ns ip addr add 10.200.0.2/24 dev veth_ns +sudo ip netns exec test_ns ip link set veth_ns up + +# 在命名空间内启动 HTTP 服务器 +sudo ip netns exec test_ns python3 -m http.server 8080 --bind 10.200.0.2 & +``` + +### 运行演示 + +启动附加到 veth 主机端的 dynptr TC 程序: + +```bash +sudo ./dynptr_tc -i veth_host -p 0 -s 32 +``` + +在另一个终端发起请求: + +```bash +curl http://10.200.0.2:8080/ +``` + +你应该看到捕获的数据包输出: + +``` +Attached to TC ingress of veth_host (ifindex=X). Ctrl-C to exit. +blocked_port=0 snap_len=32 ringbuf=1 +if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=221 drop=0 snap=32 payload="HTTP/1.0 200 OK..Server: SimpleH" +if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=742 drop=0 snap=32 payload="." +``` + +输出显示来自服务器的 HTTP 响应包,payload 字段包含响应数据的开头部分。 + +### 测试丢包策略 + +通过指定端口 8080 测试阻断: + +```bash +sudo ./dynptr_tc -i veth_host -p 8080 -s 32 +``` + +在另一个终端: + +```bash +curl --max-time 3 http://10.200.0.2:8080/ +``` + +由于响应包被阻断,curl 应该会超时。dynptr_tc 输出显示 `drop=1`: + +``` +if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=74 drop=1 snap=0 +``` + +### 使用测试脚本 + +为方便起见,运行附带的测试脚本,它会自动处理所有设置: + +```bash +sudo ./test.sh +``` + +脚本会创建命名空间,运行捕获和阻断两个测试,最后自动清理。 + +## 何时使用 Dynptr + +Dynptr 在几种场景中表现出色。**可变长度事件**是经典用例,因为 ringbuf dynptr 让你能够在运行时精确分配所需的大小,避免因过大的固定结构体而浪费空间,或需要使用复杂的多记录方案。 + +**包解析**在处理非线性 skb 或复杂协议栈时受益于 dynptr,在这些情况下传统的边界检查变得难以处理。切片 API 提供了更清晰的抽象,统一处理线性和分页数据。 + +**加密和验证**操作如 `bpf_crypto_encrypt()`、`bpf_verify_pkcs7_signature()` 和 `bpf_get_file_xattr()` 都使用 dynptr 作为缓冲区参数,使得熟悉 dynptr 对这些高级用例至关重要。 + +**用户 ringbuf 消费**通过 `bpf_user_ringbuf_drain()` 以 dynptr 形式传递样本,实现在 BPF 程序中安全处理用户空间提供的数据。 + +对于在编译时就知道边界的简单固定大小操作,传统方法可能更简单。但随着你的 BPF 程序变得更复杂,dynptr 会变得越来越有价值。 + +## 总结 + +BPF dynptr 提供了一种验证器友好的机制来处理可变长度和运行时边界的数据。它不是完全通过静态分析来证明内存安全性,而是将一些验证转移到运行时检查,实现了否则不可能或极其难以表达的模式。 + +我们的例子展示了两种主要的 dynptr 模式:使用带切片的 skb dynptr 进行清晰的包解析,以及使用 ringbuf dynptr 进行可变长度事件输出。关键要点是始终对切片返回值进行 NULL 检查,始终提交或丢弃 ringbuf dynptr,并记住 skb dynptr 需要从 Linux v6.4 开始可用的 kfunc。 + +随着 eBPF 能力的持续扩展,dynptr 成为工具集中越来越重要的部分。无论你是构建包处理器、安全监控器还是性能工具,理解 dynptr 都将帮助你编写更清晰、更强大的 BPF 程序。 + +> 如果你想深入学习 eBPF,请查看我们的教程仓库 或访问我们的网站 。 + +## 参考资料 + +- **Dynptr 概念文档:** +- **bpf_ringbuf_reserve_dynptr Helper:** +- **bpf_dynptr_from_skb Kfunc:** +- **bpf_dynptr_slice Kfunc:** +- **内核 Kfuncs 文档:** +- **教程仓库:** + +本示例需要 Linux 内核 6.4 或更新版本以支持 skb dynptr kfunc。ringbuf dynptr helper 从 Linux 5.19 开始可用。完整源代码可在教程仓库中找到。 diff --git a/src/features/dynptr/dynptr_tc.bpf.c b/src/features/dynptr/dynptr_tc.bpf.c new file mode 100644 index 0000000..f9c4c2d --- /dev/null +++ b/src/features/dynptr/dynptr_tc.bpf.c @@ -0,0 +1,194 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2024 eunomia-bpf +// +// Demonstrates BPF dynptrs with TC ingress: +// - bpf_dynptr_from_skb() to create skb dynptr (kfunc, v6.4+) +// - bpf_dynptr_slice() to parse packet headers safely +// - bpf_ringbuf_reserve_dynptr() for variable-length ringbuf records +// - bpf_dynptr_read/write() for data copying + +#include +#include +#include + +/* __BPF__ is auto-defined by clang when targeting BPF */ +#include "dynptr_tc.h" + +/* Constants not in vmlinux.h */ +#define TC_ACT_OK 0 +#define TC_ACT_SHOT 2 +#define ETH_P_IP 0x0800 + +/* kfunc: bpf_dynptr_from_skb (v6.4+) */ +extern int bpf_dynptr_from_skb(struct __sk_buff *s, __u64 flags, + struct bpf_dynptr *ptr__uninit) __ksym; + +/* kfunc: bpf_dynptr_slice */ +extern void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, __u32 offset, + void *buffer__opt, __u32 buffer__sz) __ksym; + +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 1 << 24); /* 16MB ringbuf */ +} events SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, struct dynptr_cfg); +} cfg_map SEC(".maps"); + +static __always_inline const struct dynptr_cfg *get_cfg(void) +{ + __u32 key = 0; + return bpf_map_lookup_elem(&cfg_map, &key); +} + +SEC("tc") +int dynptr_tc_ingress(struct __sk_buff *ctx) +{ + const struct dynptr_cfg *cfg = get_cfg(); + struct bpf_dynptr skb_ptr; + __u32 pkt_len = ctx->len; + + /* Temporary buffers for slice (data may be copied here) */ + struct ethhdr eth_buf; + struct iphdr ip_buf; + struct tcphdr tcp_buf; + + const struct ethhdr *eth; + const struct iphdr *iph; + const struct tcphdr *tcp; + + __u32 ip_off, tcp_off, payload_off; + __u32 ip_hdr_len, tcp_hdr_len; + + __u16 sport, dport; + __u8 drop = 0; + int act = TC_ACT_OK; + + if (!cfg) + return TC_ACT_OK; + + /* Create dynptr from skb using kfunc */ + if (bpf_dynptr_from_skb(ctx, 0, &skb_ptr)) + return TC_ACT_OK; + + /* Parse Ethernet header using dynptr_slice */ + eth = bpf_dynptr_slice(&skb_ptr, 0, ð_buf, sizeof(eth_buf)); + if (!eth) + return TC_ACT_OK; + + if (eth->h_proto != bpf_htons(ETH_P_IP)) + return TC_ACT_OK; + + /* Parse IPv4 header */ + ip_off = sizeof(*eth); + iph = bpf_dynptr_slice(&skb_ptr, ip_off, &ip_buf, sizeof(ip_buf)); + if (!iph) + return TC_ACT_OK; + + if (iph->version != 4) + return TC_ACT_OK; + + ip_hdr_len = (__u32)iph->ihl * 4; + if (ip_hdr_len < sizeof(struct iphdr) || ip_hdr_len > 60) + return TC_ACT_OK; + + if (iph->protocol != IPPROTO_TCP) + return TC_ACT_OK; + + /* Parse TCP header */ + tcp_off = ip_off + ip_hdr_len; + tcp = bpf_dynptr_slice(&skb_ptr, tcp_off, &tcp_buf, sizeof(tcp_buf)); + if (!tcp) + return TC_ACT_OK; + + tcp_hdr_len = (__u32)tcp->doff * 4; + if (tcp_hdr_len < sizeof(struct tcphdr) || tcp_hdr_len > 60) + return TC_ACT_OK; + + sport = bpf_ntohs(tcp->source); + dport = bpf_ntohs(tcp->dest); + + /* Simple policy: drop packets to/from specified port + * Check both sport and dport to work on both ingress and egress */ + if (cfg->blocked_port && (sport == cfg->blocked_port || dport == cfg->blocked_port)) { + drop = 1; + act = TC_ACT_SHOT; + } + + /* Output event to ringbuf using dynptr record (variable length) */ + if (cfg->enable_ringbuf) { + __u32 snap_len = cfg->snap_len; + __u8 payload[MAX_SNAPLEN] = {}; + long err; + + if (snap_len > MAX_SNAPLEN) + snap_len = MAX_SNAPLEN; + + payload_off = tcp_off + tcp_hdr_len; + + /* Calculate available payload length */ + if (payload_off >= pkt_len) { + snap_len = 0; + } else { + __u32 avail = pkt_len - payload_off; + if (avail < snap_len) + snap_len = avail; + } + + /* Read payload from skb dynptr */ + if (snap_len) { + err = bpf_dynptr_read(payload, snap_len, &skb_ptr, payload_off, 0); + if (err) + snap_len = 0; + } + + /* Build event header */ + struct event_hdr hdr = {}; + hdr.ts_ns = bpf_ktime_get_ns(); + hdr.ifindex = ctx->ifindex; + hdr.pkt_len = pkt_len; + hdr.saddr = iph->saddr; + hdr.daddr = iph->daddr; + hdr.sport = sport; + hdr.dport = dport; + hdr.drop = drop; + hdr.snap_len = (__u16)snap_len; + + /* Reserve ringbuf dynptr record (runtime-determined size) */ + struct bpf_dynptr rb; + __u32 total_sz = sizeof(hdr) + snap_len; + + err = bpf_ringbuf_reserve_dynptr(&events, total_sz, 0, &rb); + if (err) { + /* Critical: must discard/submit even on reserve failure */ + bpf_ringbuf_discard_dynptr(&rb, 0); + return act; + } + + /* Write header to ringbuf dynptr */ + err = bpf_dynptr_write(&rb, 0, &hdr, sizeof(hdr), 0); + if (err) { + bpf_ringbuf_discard_dynptr(&rb, 0); + return act; + } + + /* Write payload (if any) */ + if (snap_len) { + err = bpf_dynptr_write(&rb, sizeof(hdr), payload, snap_len, 0); + if (err) { + bpf_ringbuf_discard_dynptr(&rb, 0); + return act; + } + } + + bpf_ringbuf_submit_dynptr(&rb, 0); + } + + return act; +} + +char _license[] SEC("license") = "GPL"; diff --git a/src/features/dynptr/dynptr_tc.c b/src/features/dynptr/dynptr_tc.c new file mode 100644 index 0000000..475b8c5 --- /dev/null +++ b/src/features/dynptr/dynptr_tc.c @@ -0,0 +1,244 @@ +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +// Copyright (c) 2024 eunomia-bpf +// +// User-space loader for dynptr TC demo + +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include + +#include "dynptr_tc.skel.h" +#include "dynptr_tc.h" + +static volatile sig_atomic_t exiting = 0; + +static void sig_handler(int signo) +{ + (void)signo; + exiting = 1; +} + +static int libbpf_print_fn(enum libbpf_print_level level, + const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG) + return 0; + return vfprintf(stderr, format, args); +} + +static int bump_memlock_rlimit(void) +{ + struct rlimit rlim_new = { + .rlim_cur = RLIM_INFINITY, + .rlim_max = RLIM_INFINITY, + }; + return setrlimit(RLIMIT_MEMLOCK, &rlim_new); +} + +static void print_ascii_sanitized(const unsigned char *p, size_t len) +{ + for (size_t i = 0; i < len; i++) { + unsigned char c = p[i]; + if (c >= 32 && c <= 126) + putchar(c); + else + putchar('.'); + } +} + +static int handle_event(void *ctx, void *data, size_t data_sz) +{ + (void)ctx; + const struct event_hdr *e = data; + + char saddr[INET_ADDRSTRLEN]; + char daddr[INET_ADDRSTRLEN]; + + inet_ntop(AF_INET, &e->saddr, saddr, sizeof(saddr)); + inet_ntop(AF_INET, &e->daddr, daddr, sizeof(daddr)); + + printf("if=%u %s:%u -> %s:%u len=%u drop=%u snap=%u ", + e->ifindex, + saddr, e->sport, + daddr, e->dport, + e->pkt_len, + e->drop, + e->snap_len); + + if (e->snap_len && data_sz >= sizeof(*e) + e->snap_len) { + printf("payload=\""); + print_ascii_sanitized(e->payload, e->snap_len); + printf("\""); + } + printf("\n"); + return 0; +} + +static void usage(const char *prog) +{ + fprintf(stderr, + "Usage: %s -i [-p blocked_port] [-s snap_len] [-n]\n" + "\n" + " -i attach to TC ingress of this netdev\n" + " -p drop TCP packets whose dport == port (0 = disable)\n" + " -s snapshot first bytes of TCP payload (max %d)\n" + " -n disable ringbuf output\n" + "\n" + "Example:\n" + " sudo %s -i veth1 -p 8080 -s 64\n", + prog, MAX_SNAPLEN, prog); +} + +int main(int argc, char **argv) +{ + const char *ifname = NULL; + int opt, err; + int ifindex; + + struct dynptr_cfg cfg = { + .blocked_port = 0, + .snap_len = 64, + .enable_ringbuf = 1, + }; + + while ((opt = getopt(argc, argv, "i:p:s:nh")) != -1) { + switch (opt) { + case 'i': + ifname = optarg; + break; + case 'p': + cfg.blocked_port = (__u16)atoi(optarg); + break; + case 's': + cfg.snap_len = (__u32)atoi(optarg); + break; + case 'n': + cfg.enable_ringbuf = 0; + break; + case 'h': + default: + usage(argv[0]); + return opt == 'h' ? 0 : 1; + } + } + + if (!ifname) { + usage(argv[0]); + return 1; + } + + if (cfg.snap_len > MAX_SNAPLEN) + cfg.snap_len = MAX_SNAPLEN; + + if (bump_memlock_rlimit()) { + perror("setrlimit(RLIMIT_MEMLOCK)"); + return 1; + } + + libbpf_set_strict_mode(LIBBPF_STRICT_ALL); + libbpf_set_print(libbpf_print_fn); + + signal(SIGINT, sig_handler); + signal(SIGTERM, sig_handler); + + ifindex = if_nametoindex(ifname); + if (!ifindex) { + fprintf(stderr, "if_nametoindex(%s) failed: %s\n", ifname, strerror(errno)); + return 1; + } + + struct dynptr_tc_bpf *skel = dynptr_tc_bpf__open(); + if (!skel) { + fprintf(stderr, "Failed to open BPF skeleton\n"); + return 1; + } + + err = dynptr_tc_bpf__load(skel); + if (err) { + fprintf(stderr, "Failed to load BPF skeleton: %d\n", err); + goto cleanup; + } + + /* Write configuration to map */ + { + __u32 key = 0; + int cfg_fd = bpf_map__fd(skel->maps.cfg_map); + err = bpf_map_update_elem(cfg_fd, &key, &cfg, BPF_ANY); + if (err) { + fprintf(stderr, "bpf_map_update_elem(cfg_map) failed: %s\n", strerror(errno)); + goto cleanup; + } + } + + /* Attach to TC ingress */ + struct bpf_tc_hook hook = { + .sz = sizeof(hook), + .ifindex = ifindex, + .attach_point = BPF_TC_INGRESS, + }; + struct bpf_tc_opts opts = { + .sz = sizeof(opts), + .handle = 1, + .priority = 1, + .prog_fd = bpf_program__fd(skel->progs.dynptr_tc_ingress), + }; + + err = bpf_tc_hook_create(&hook); + if (err && err != -EEXIST) { + fprintf(stderr, "bpf_tc_hook_create failed: %d\n", err); + goto cleanup; + } + + err = bpf_tc_attach(&hook, &opts); + if (err) { + fprintf(stderr, "bpf_tc_attach failed: %d\n", err); + goto cleanup; + } + + struct ring_buffer *rb = NULL; + if (cfg.enable_ringbuf) { + rb = ring_buffer__new(bpf_map__fd(skel->maps.events), handle_event, NULL, NULL); + if (!rb) { + fprintf(stderr, "ring_buffer__new failed\n"); + goto cleanup_detach; + } + } + + printf("Attached to TC ingress of %s (ifindex=%d). Ctrl-C to exit.\n", + ifname, ifindex); + printf("blocked_port=%u snap_len=%u ringbuf=%u\n", + cfg.blocked_port, cfg.snap_len, cfg.enable_ringbuf); + + while (!exiting) { + if (rb) { + err = ring_buffer__poll(rb, 100 /* ms */); + if (err == -EINTR) break; + if (err < 0) { + fprintf(stderr, "ring_buffer__poll error: %d\n", err); + break; + } + } else { + usleep(100000); + } + } + + ring_buffer__free(rb); + +cleanup_detach: + bpf_tc_detach(&hook, &opts); + bpf_tc_hook_destroy(&hook); + +cleanup: + dynptr_tc_bpf__destroy(skel); + return err < 0 ? -err : 0; +} diff --git a/src/features/dynptr/dynptr_tc.h b/src/features/dynptr/dynptr_tc.h new file mode 100644 index 0000000..cc08063 --- /dev/null +++ b/src/features/dynptr/dynptr_tc.h @@ -0,0 +1,45 @@ +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +// Copyright (c) 2024 eunomia-bpf +#ifndef __DYNPTR_TC_H +#define __DYNPTR_TC_H + +/* Use types that work in both BPF and userspace contexts */ +#ifdef __BPF__ +/* BPF side uses vmlinux types */ +#else +/* Userspace side uses standard types */ +#include +#endif + +#define MAX_SNAPLEN 64 + +struct dynptr_cfg { + __u16 blocked_port; /* 0 = disable blocking */ + __u16 _pad1; + + __u32 snap_len; /* TCP payload snapshot length */ + __u8 enable_ringbuf; /* 1: output events to ringbuf */ + __u8 _pad2[3]; +}; + +/* Fixed header + variable payload (flex array) */ +struct event_hdr { + __u64 ts_ns; + + __u32 ifindex; + __u32 pkt_len; + + __be32 saddr; + __be32 daddr; + + __u16 sport; + __u16 dport; + + __u8 drop; /* 1: packet was dropped */ + __u8 _pad1; + __u16 snap_len; /* actual snapshot length (<= MAX_SNAPLEN) */ + + __u8 payload[]; /* follows immediately after the struct */ +}; + +#endif /* __DYNPTR_TC_H */ diff --git a/src/features/dynptr/test.sh b/src/features/dynptr/test.sh new file mode 100755 index 0000000..7f66d4f --- /dev/null +++ b/src/features/dynptr/test.sh @@ -0,0 +1,214 @@ +#!/bin/bash +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +# Test script for dynptr TC demo +# Requires: root privileges, Linux kernel >= 6.4 + +set -e + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "$SCRIPT_DIR" + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +NC='\033[0m' # No Color + +log_info() { echo -e "${GREEN}[INFO]${NC} $1"; } +log_warn() { echo -e "${YELLOW}[WARN]${NC} $1"; } +log_error() { echo -e "${RED}[ERROR]${NC} $1"; } + +cleanup() { + log_info "Cleaning up..." + + # Kill processes + sudo pkill -9 dynptr_tc 2>/dev/null || true + + # Kill HTTP server in namespace + if [ -n "$HTTP_PID" ]; then + sudo kill -9 $HTTP_PID 2>/dev/null || true + fi + sudo ip netns pids test_dynptr 2>/dev/null | xargs -r sudo kill -9 2>/dev/null || true + + # Remove TC hooks + sudo tc qdisc del dev veth_host clsact 2>/dev/null || true + + # Remove network namespace and veth + sudo ip link del veth_host 2>/dev/null || true + sudo ip netns del test_dynptr 2>/dev/null || true + + log_info "Cleanup complete" +} + +trap cleanup EXIT + +check_prereqs() { + log_info "Checking prerequisites..." + + if [ "$EUID" -ne 0 ]; then + log_error "This script must be run as root" + exit 1 + fi + + # Check kernel version (need >= 6.4 for bpf_dynptr_from_skb) + KERNEL_VERSION=$(uname -r | cut -d. -f1-2) + KERNEL_MAJOR=$(echo "$KERNEL_VERSION" | cut -d. -f1) + KERNEL_MINOR=$(echo "$KERNEL_VERSION" | cut -d. -f2) + + if [ "$KERNEL_MAJOR" -lt 6 ] || ([ "$KERNEL_MAJOR" -eq 6 ] && [ "$KERNEL_MINOR" -lt 4 ]); then + log_warn "Kernel version $KERNEL_VERSION detected. This demo requires >= 6.4 for skb dynptr kfuncs." + log_warn "The test may fail to load the BPF program." + else + log_info "Kernel version $KERNEL_VERSION OK (>= 6.4)" + fi + + if [ ! -f "./dynptr_tc" ]; then + log_info "Building dynptr_tc..." + make + fi +} + +setup_network() { + log_info "Setting up network namespace and veth pair..." + + # Clean up any existing setup + sudo ip link del veth_host 2>/dev/null || true + sudo ip netns del test_dynptr 2>/dev/null || true + + # Create network namespace + sudo ip netns add test_dynptr + + # Create veth pair + sudo ip link add veth_host type veth peer name veth_ns + + # Move one end to namespace + sudo ip link set veth_ns netns test_dynptr + + # Configure host side + sudo ip addr add 10.200.0.1/24 dev veth_host + sudo ip link set veth_host up + + # Configure namespace side + sudo ip netns exec test_dynptr ip addr add 10.200.0.2/24 dev veth_ns + sudo ip netns exec test_dynptr ip link set veth_ns up + sudo ip netns exec test_dynptr ip link set lo up + + # Verify connectivity + if ping -c 1 -W 1 10.200.0.2 > /dev/null 2>&1; then + log_info "Network namespace setup complete, connectivity verified" + else + log_error "Failed to establish connectivity to namespace" + exit 1 + fi +} + +start_http_server() { + log_info "Starting HTTP server in namespace on 10.200.0.2:8080..." + sudo ip netns exec test_dynptr python3 -m http.server 8080 --bind 10.200.0.2 &>/dev/null & + HTTP_PID=$! + sleep 2 + + # Verify HTTP server + if curl -s --max-time 2 http://10.200.0.2:8080/ > /dev/null 2>&1; then + log_info "HTTP server is running" + else + log_error "Failed to start HTTP server" + exit 1 + fi +} + +test_basic_capture() { + log_info "=== Test 1: Basic packet capture ===" + + # Remove any existing TC hooks + sudo tc qdisc del dev veth_host clsact 2>/dev/null || true + + # Start dynptr_tc (no blocking) + log_info "Starting dynptr_tc on veth_host..." + sudo timeout 5 ./dynptr_tc -i veth_host -p 0 -s 32 > /tmp/dynptr_output.txt 2>&1 & + DYNPTR_PID=$! + sleep 2 + + # Send HTTP request + log_info "Sending HTTP request..." + curl -s --max-time 2 http://10.200.0.2:8080/ > /dev/null 2>&1 || true + + sleep 2 + + # Wait for dynptr_tc to finish + wait $DYNPTR_PID 2>/dev/null || true + + # Check output + if grep -q "10.200.0.2:8080" /tmp/dynptr_output.txt; then + if grep -q "payload=" /tmp/dynptr_output.txt; then + log_info "Test 1 PASSED: Captured TCP packet with payload" + echo "Sample output:" + grep "payload=" /tmp/dynptr_output.txt | head -2 + else + log_warn "Test 1 PARTIAL: Captured packets but no payload (might be ACKs only)" + cat /tmp/dynptr_output.txt | head -5 + fi + else + log_error "Test 1 FAILED: Did not capture expected packets" + cat /tmp/dynptr_output.txt + return 1 + fi + + # Clean up TC hook for next test + sudo tc qdisc del dev veth_host clsact 2>/dev/null || true +} + +test_blocking() { + log_info "=== Test 2: Port blocking ===" + + # Start dynptr_tc with blocking on port 8080 + log_info "Starting dynptr_tc with port 8080 blocked..." + sudo ./dynptr_tc -i veth_host -p 8080 -s 32 > /tmp/dynptr_block.txt 2>&1 & + DYNPTR_PID=$! + sleep 2 + + # Send HTTP request (should timeout/fail) + log_info "Sending HTTP request (should be blocked)..." + if curl -s --max-time 3 http://10.200.0.2:8080/ > /dev/null 2>&1; then + log_warn "Test 2 WARNING: Request succeeded but should have been blocked" + log_warn "Note: Blocking depends on TC direction - responses from server are blocked on ingress" + else + log_info "Test 2: Request blocked/timed out as expected" + fi + + sleep 1 + sudo kill -INT $DYNPTR_PID 2>/dev/null || true + wait $DYNPTR_PID 2>/dev/null || true + + # Check output for drop events + if grep -q "drop=1" /tmp/dynptr_block.txt; then + log_info "Test 2 PASSED: Packets were dropped (drop=1 in output)" + grep "drop=1" /tmp/dynptr_block.txt | head -2 + else + log_info "Test 2 output:" + cat /tmp/dynptr_block.txt | head -5 + fi + + # Clean up TC hook + sudo tc qdisc del dev veth_host clsact 2>/dev/null || true +} + +main() { + log_info "Starting dynptr TC demo tests..." + log_info "Kernel: $(uname -r)" + + check_prereqs + setup_network + start_http_server + + echo "" + test_basic_capture + + echo "" + test_blocking + + echo "" + log_info "All tests completed!" +} + +main "$@"