feat: add tcx and bpf token tutorials (#203)

* feat: add tcx and bpf token tutorials

* docs: auto-generate documentation

* docs: rewrite tcx and bpf_token tutorials with richer content and consistent style

Rewrote all 4 README files (EN/ZH for both tutorials) to match the
existing tutorial style with detailed background, full code listings,
step-by-step explanations, comparison tables, and proper references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: replace em dashes with colons, commas, and parentheses

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: auto-generate documentation

* ci: add tcx and bpf_token builds to CI; simplify execl in token_userns_demo

- Add make targets for src/50-tcx and src/features/bpf_token to the
  test-libbpf CI workflow.
- Replace four separate execl() calls with a single execv() using a
  dynamically built argv array, reducing complexity and eliminating
  CodeFactor command-injection false positives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: fix mkdocs path in trigger-sync workflow

Use .venv/bin/mkdocs instead of bare mkdocs, since make install
puts it inside a virtualenv.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: LinuxDev9002 <linuxdev8883@example.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
云微
2026-03-10 21:23:23 -07:00
committed by GitHub
parent ee6d522e40
commit 3a722c03d5
22 changed files with 2212 additions and 1 deletions

View File

@@ -111,3 +111,11 @@ jobs:
- name: test features bpf_wq
run: |
make -C src/features/bpf_wq
- name: test 50 tcx
run: |
make -C src/50-tcx
- name: test features bpf_token
run: |
make -C src/features/bpf_token

View File

@@ -28,7 +28,7 @@ jobs:
- name: Test page build
run: |
mkdocs build -v
.venv/bin/mkdocs build -v
- name: Trigger sync workflow
if: github.event_name == 'push' && github.ref == 'refs/heads/main'

View File

@@ -74,6 +74,7 @@ Networking:
- [lesson 41-xdp-tcpdump](src/41-xdp-tcpdump/README.md) Capturing TCP Information with XDP
- [lesson 42-xdp-loadbalancer](src/42-xdp-loadbalancer/README.md) XDP Load Balancer
- [lesson 46-xdp-test](src/46-xdp-test/README.md) Building a High-Performance XDP Packet Generator
- [lesson 50-tcx](src/50-tcx/README.md) Composable Traffic Control with TCX Links
Tracing:
@@ -103,6 +104,7 @@ Features:
- [lesson 36-userspace-ebpf](src/36-userspace-ebpf/README.md) Userspace eBPF Runtimes: Overview and Applications
- [lesson 38-btf-uprobe](src/38-btf-uprobe/README.md) Expanding eBPF Compile Once, Run Everywhere(CO-RE) to Userspace Compatibility
- [lesson 43-kfuncs](src/43-kfuncs/README.md) Extending eBPF Beyond Its Limits: Custom kfuncs in Kernel Modules
- [features bpf_token](src/features/bpf_token/README.md) BPF Token for Delegated Privilege and Secure Program Loading
- [features bpf_wq](src/features/bpf_wq/README.md) BPF Workqueues for Asynchronous Sleepable Tasks
- [features struct_ops](src/features/struct_ops/README.md) Extending Kernel Subsystems with BPF struct_ops
- [features dynptr](src/features/dynptr/README.md) BPF Dynamic Pointers for Variable-Length Data

View File

@@ -67,6 +67,7 @@ GPU:
- [lesson 41-xdp-tcpdump](src/41-xdp-tcpdump/README.zh.md) eBPF 示例教程:使用 XDP 捕获 TCP 信息
- [lesson 42-xdp-loadbalancer](src/42-xdp-loadbalancer/README.zh.md) eBPF 开发者教程: 简单的 XDP 负载均衡器
- [lesson 46-xdp-test](src/46-xdp-test/README.zh.md) eBPF 实例教程:构建高性能 XDP 数据包生成器
- [lesson 50-tcx](src/50-tcx/README.zh.md) eBPF 入门实践教程第五十篇:使用 TCX Link 实现可组合的流量控制
安全:
- [lesson 24-hide](src/24-hide/README.zh.md) eBPF 开发实践:使用 eBPF 隐藏进程或文件信息
@@ -81,6 +82,7 @@ GPU:
- [lesson 36-userspace-ebpf](src/36-userspace-ebpf/README.zh.md) 用户空间 eBPF 运行时:深度解析与应用实践
- [lesson 38-btf-uprobe](src/38-btf-uprobe/README.zh.md) 借助 eBPF 和 BTF让用户态也能一次编译、到处运行
- [lesson 43-kfuncs](src/43-kfuncs/README.zh.md) 超越 eBPF 的极限:在内核模块中定义自定义 kfunc
- [features bpf_token](src/features/bpf_token/README.zh.md) eBPF 入门实践教程BPF Token安全的委托式权限与程序加载
- [features bpf_wq](src/features/bpf_wq/README.zh.md) eBPF 教程BPF 工作队列用于异步可睡眠任务
- [features struct_ops](src/features/struct_ops/README.zh.md) eBPF 教程:使用 BPF struct_ops 扩展内核子系统
- [features dynptr](src/features/dynptr/README.zh.md) BPF Dynamic Pointers for Variable-Length Data

2
src/50-tcx/.config Normal file
View File

@@ -0,0 +1,2 @@
level=Depth
type=Networking

12
src/50-tcx/.gitignore vendored Normal file
View File

@@ -0,0 +1,12 @@
# Build artifacts
.output/
*.o
*.skel.h
# Generated binaries
tcx_demo
# Editor files
*.swp
*~
.vscode/

92
src/50-tcx/Makefile Normal file
View File

@@ -0,0 +1,92 @@
# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
OUTPUT := .output
CLANG ?= clang
LIBBPF_SRC := $(abspath ../third_party/libbpf/src)
BPFTOOL_SRC := $(abspath ../third_party/bpftool/src)
LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a)
BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool)
BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool
ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \
| sed 's/arm.*/arm/' \
| sed 's/aarch64/arm64/' \
| sed 's/ppc64le/powerpc/' \
| sed 's/mips.*/mips/' \
| sed 's/riscv64/riscv/' \
| sed 's/loongarch64/loongarch/')
VMLINUX := ../third_party/vmlinux/$(ARCH)/vmlinux.h
INCLUDES := -I$(OUTPUT) -I../third_party/libbpf/include/uapi -I$(dir $(VMLINUX)) -I.
CFLAGS := -g -Wall
ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS)
APPS = tcx_demo
CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - </dev/null 2>&1 \
| sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }')
ifeq ($(V),1)
Q =
msg =
else
Q = @
msg = @printf ' %-8s %s%s\n' \
"$(1)" \
"$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \
"$(if $(3), $(3))";
MAKEFLAGS += --no-print-directory
endif
define allow-override
$(if $(or $(findstring environment,$(origin $(1))),\
$(findstring command line,$(origin $(1)))),,\
$(eval $(1) = $(2)))
endef
$(call allow-override,CC,$(CROSS_COMPILE)cc)
$(call allow-override,LD,$(CROSS_COMPILE)ld)
.PHONY: all
all: $(APPS)
.PHONY: clean
clean:
$(call msg,CLEAN)
$(Q)rm -rf $(OUTPUT) $(APPS)
$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT):
$(call msg,MKDIR,$@)
$(Q)mkdir -p $@
$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf
$(call msg,LIB,$@)
$(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \
OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \
INCLUDEDIR= LIBDIR= UAPIDIR= \
install
$(BPFTOOL): | $(BPFTOOL_OUTPUT)
$(call msg,BPFTOOL,$@)
$(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap
$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL)
$(call msg,BPF,$@)
$(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \
$(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \
-c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@)
$(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@)
$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL)
$(call msg,GEN-SKEL,$@)
$(Q)$(BPFTOOL) gen skeleton $< > $@
$(patsubst %,$(OUTPUT)/%.o,$(APPS)): %.o: %.skel.h
$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT)
$(call msg,CC,$@)
$(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@
$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT)
$(call msg,BINARY,$@)
$(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@
.DELETE_ON_ERROR:
.SECONDARY:

240
src/50-tcx/README.md Normal file
View File

@@ -0,0 +1,240 @@
# eBPF Tutorial by Example 50: Composable Traffic Control with TCX Links
Ever tried attaching multiple BPF programs to the TC ingress path and got frustrated managing qdisc handles, filter priorities, and the `tc` CLI? Or needed one application's TC program to coexist safely with another's without accidentally overwriting it? Traditional `cls_bpf` attachment through `tc` works, but it inherits decades of queueing discipline plumbing that was never designed for the BPF-centric world. What if you could attach, order, and manage TC programs using the same link-based API that XDP and cgroup programs already enjoy?
This is what **TCX** (Traffic Control eXtension) solves. Introduced by Daniel Borkmann and merged in Linux 6.6, TCX provides a lightweight, fd-based multi-program attach infrastructure for the TC ingress and egress data path. Programs get BPF link semantics (safe ownership, auto-detachment on close, and explicit ordering through `BPF_F_BEFORE` / `BPF_F_AFTER` flags) without touching a single qdisc or filter priority.
In this tutorial, we'll attach two TCX ingress programs to the loopback interface, place one before the other, query the kernel's live chain state, and generate traffic to verify execution order.
> The complete source code: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/50-tcx>
## Introduction to TCX: Why Classic TC Attachment Needed a Rethink
### The Problem: Qdisc Plumbing and Unsafe Ownership
Classic `tc` BPF attachment (`cls_bpf`) was bolted onto the existing Traffic Control framework. To attach a BPF program, you first needed a `clsact` qdisc on the interface, then added a filter with a handle and priority. This worked fine for a single operator, but created real problems in cloud-native environments where multiple applications need to attach TC programs to the same interface:
1. **No ownership model**: A `tc filter del` from one application can accidentally remove another application's program. There's no protection against this because classic tc filters are identified by handle/priority, not by the process that created them.
2. **Priority conflicts**: Two applications might pick the same priority number. The second attachment silently replaces the first.
3. **Permanent attachment by default**: Classic tc filters persist until explicitly removed. If the application that attached a filter crashes without cleanup, the filter remains, potentially with stale program logic.
4. **CLI dependency**: Even with libbpf, the attachment model was tied to netlink, the same mechanism the `tc` CLI uses. This meant your BPF application was sharing a control plane with every other tc user on the system.
These issues became acute in projects like Cilium, where the BPF dataplane needs to coexist with third-party CNI plugins, observability agents, and security tools that all want to hook into TC.
### The Solution: Link-Based Multi-Program Management
TCX takes a fundamentally different approach. Instead of piggybacking on qdisc infrastructure, it provides a dedicated, qdisc-less extension point for BPF programs at the TC ingress and egress hooks. The key design principles:
**BPF Link Semantics**: `bpf_program__attach_tcx()` creates a `BPF_LINK_TYPE_TCX` link. Like XDP links and cgroup links, TCX links give you safe ownership: the link is pinned to the file descriptor, auto-detaches when the fd is closed, and cannot be accidentally overridden by another application.
**Explicit Ordering**: Instead of implicit priority numbers, you place programs relative to each other using `BPF_F_BEFORE` and `BPF_F_AFTER`. You can also use `BPF_F_REPLACE` to atomically swap a specific program. All operations support an `expected_revision` field that prevents race conditions during concurrent modifications.
**Chain Return Codes**: TCX defines simplified return codes that make multi-program composition explicit:
| Return Code | Value | Meaning |
|-------------|-------|---------|
| `TCX_NEXT` | -1 | Non-terminating; pass the packet to the next program in the chain |
| `TCX_PASS` | 0 | Accept the packet and terminate the chain |
| `TCX_DROP` | 2 | Drop the packet and terminate the chain |
| `TCX_REDIRECT` | 7 | Redirect the packet and terminate the chain |
Unknown return codes are mapped to `TCX_NEXT` for forward compatibility.
**Coexistence with Classic TC**: TCX links can coexist with traditional `cls_bpf` filters on the same interface. The kernel runs TCX programs first, then falls through to classic `tcf_classify()` if present. This allows gradual migration from classic tc to TCX without a disruptive cutover.
## Writing the eBPF Program
Our BPF object contains two programs that demonstrate chain composition. Here is the complete source:
```c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>
#ifndef TCX_NEXT
#define TCX_NEXT -1
#endif
#ifndef TCX_PASS
#define TCX_PASS 0
#endif
char LICENSE[] SEC("license") = "GPL";
__u64 stats_hits;
__u64 classifier_hits;
__u32 last_len;
__u16 last_protocol;
__u32 last_ifindex;
SEC("tcx/ingress")
int tcx_stats(struct __sk_buff *skb)
{
stats_hits++;
last_len = skb->len;
last_protocol = bpf_ntohs(skb->protocol);
last_ifindex = skb->ifindex;
return TCX_NEXT;
}
SEC("tcx/ingress")
int tcx_classifier(struct __sk_buff *skb)
{
classifier_hits++;
return TCX_PASS;
}
```
Let's walk through this step by step.
### Section Names: `SEC("tcx/ingress")`
The `SEC("tcx/ingress")` annotation tells libbpf that this program should be attached to the TCX ingress hook rather than the classic TC classifier. This is not just a naming convention; libbpf maps this section name to `BPF_PROG_TYPE_SCHED_CLS` with the appropriate attach type for TCX. The corresponding egress variant is `SEC("tcx/egress")`.
Note that `SEC("tc")`, `SEC("classifier")`, and `SEC("action")` are now considered deprecated by libbpf in favor of the `tcx/*` section names.
### Global Variables as Counters
Instead of using a BPF map for counters, we use global variables (`stats_hits`, `classifier_hits`, `last_len`, etc.). The libbpf skeleton exposes these through `skel->bss->stats_hits`, which makes the user-space code simpler. This is fine for a single-CPU demo; for production use, you would want per-CPU maps to avoid data races.
### Return Codes: `TCX_NEXT` vs `TCX_PASS`
This is the heart of TCX composition:
- `tcx_stats` returns `TCX_NEXT`, which means "I've done my work, now pass the packet to the next program in the chain." The chain continues executing.
- `tcx_classifier` returns `TCX_PASS`, which is a terminal verdict: the packet is accepted and no further programs in the chain run.
If we had placed `tcx_classifier` *before* `tcx_stats` in the chain, `tcx_stats` would never execute because `TCX_PASS` terminates the chain. Ordering matters, and TCX makes it explicit.
## User-Space Loader: Attaching and Querying the Chain
The user-space code demonstrates three key TCX operations: attaching programs, ordering them relative to each other, and querying the live chain.
### Step 1: Attach the First Program
```c
classifier_link = bpf_program__attach_tcx(skel->progs.tcx_classifier,
ifindex, NULL);
```
This attaches `tcx_classifier` to the TCX ingress hook on the specified interface. Passing `NULL` for options means "use defaults", so the program gets appended to the chain. At this point, the chain has one program.
### Step 2: Insert the Second Program *Before* the First
```c
LIBBPF_OPTS(bpf_tcx_opts, before_opts,
.flags = BPF_F_BEFORE,
.relative_fd = bpf_program__fd(skel->progs.tcx_classifier));
stats_link = bpf_program__attach_tcx(skel->progs.tcx_stats,
ifindex, &before_opts);
```
The `bpf_tcx_opts` structure tells the kernel to insert `tcx_stats` *before* `tcx_classifier` in the chain. The `.relative_fd` field identifies the reference point, which is the fd of the already-attached classifier program. After this, the chain is: `tcx_stats``tcx_classifier`.
You could equivalently use `BPF_F_AFTER` with a different reference to achieve the same ordering. The important point is that you express the desired order directly, rather than hoping that two numeric priorities sort correctly.
### Step 3: Query the Chain
```c
LIBBPF_OPTS(bpf_prog_query_opts, query);
query.count = 8;
query.prog_ids = prog_ids;
query.link_ids = link_ids;
err = bpf_prog_query_opts(ifindex, BPF_TCX_INGRESS, &query);
```
After attachment, the loader queries the kernel for the live chain state. The returned data includes:
- **`revision`**: A monotonically increasing counter that changes on every chain modification. This is the value you would pass as `expected_revision` if you wanted to perform atomic updates.
- **`prog_ids[]`**: The BPF program IDs in chain order.
- **`link_ids[]`**: The corresponding BPF link IDs.
This allows any observer to determine exactly which programs are attached and in what order, which is invaluable for debugging multi-program pipelines.
### Step 4: Generate Traffic and Read Counters
The loader sends a UDP packet to `127.0.0.1` (port 9, discard) to trigger the chain, waits briefly, then reads the global variables to verify both programs executed:
```c
printf(" tcx_stats hits : %llu\n",
(unsigned long long)skel->bss->stats_hits);
printf(" tcx_classifier hits : %llu\n",
(unsigned long long)skel->bss->classifier_hits);
```
If both counters are 1, the chain worked as expected: `tcx_stats` ran first (recording metadata and returning `TCX_NEXT`), then `tcx_classifier` ran second (counting the packet and returning `TCX_PASS`).
## Compilation and Execution
This example requires Linux 6.6+ with TCX support and a recent libbpf.
```bash
cd bpf-developer-tutorial/src/50-tcx
make
sudo ./tcx_demo -i lo
```
Expected output:
```text
Attached TCX programs to lo (ifindex=1)
TCX ingress chain revision: 3
slot 0: prog_id=812 link_id=901
slot 1: prog_id=811 link_id=900
Counters:
tcx_stats hits : 1
tcx_classifier hits : 1
last ifindex : 1
last protocol : 0x0800
last length : 46
```
The revision is 3 because the chain was modified twice: once when `tcx_classifier` was attached (revision went from 0 to 1), and once when `tcx_stats` was inserted before it (revision went to 2). The query itself increments the revision to 3.
If you want to inspect the attach behavior without traffic, add `-n`:
```bash
sudo ./tcx_demo -i lo -n
```
Use `-v` to enable libbpf debug output, which is helpful for seeing the low-level BPF syscall sequence.
## How This Differs from Lesson 20 (Classic TC)
[Lesson 20-tc](../20-tc/README.md) teaches the classic TC data path: creating a `clsact` qdisc, attaching a `SEC("tc")` program as a filter, and using `__sk_buff` for packet inspection. That lesson is still valuable because the **packet processing model** is identical: TCX programs receive the same `__sk_buff` context and use the same helpers for packet parsing.
What TCX replaces is the **control plane**:
| Aspect | Classic TC (Lesson 20) | TCX (Lesson 50) |
|--------|----------------------|-----------------|
| Attach mechanism | Netlink / `tc` CLI | `bpf_program__attach_tcx()` |
| Ownership | None; anyone can `tc filter del` | BPF link; auto-detaches on fd close |
| Ordering | Implicit priority numbers | Explicit `BPF_F_BEFORE` / `BPF_F_AFTER` |
| Multi-program | Manual priority management | Built-in chain with revision tracking |
| Section name | `SEC("tc")` | `SEC("tcx/ingress")` / `SEC("tcx/egress")` |
| Kernel requirement | Any modern kernel | Linux 6.6+ |
If you are building new libbpf-based networking tools, TCX is the recommended interface. Cilium has already migrated from classic tc to TCX for its dataplane.
## Summary
In this tutorial, we learned how TCX modernizes TC program attachment by replacing qdisc-based plumbing with BPF link semantics. We attached two ingress programs, controlled their execution order with `BPF_F_BEFORE`, queried the live chain with `bpf_prog_query_opts()`, and verified that both programs executed in the correct order. TCX provides safe ownership, explicit ordering, revision-aware updates, and coexistence with classic TC, making it the foundation for composable, multi-program traffic control in modern eBPF applications.
If you'd like to learn more about eBPF, visit our tutorial code repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> or website <https://eunomia.dev/tutorials/> for more examples and complete tutorials.
## References
- [TCX kernel commit: fd-based tcx multi-prog infra with link support](https://lore.kernel.org/bpf/20230707172455.7634-3-daniel@iogearbox.net/)
- [BPF_PROG_TYPE_SCHED_CLS documentation](https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_SCHED_CLS/)
- [bpf_program__attach_tcx libbpf API](https://docs.ebpf.io/ebpf-library/libbpf/userspace/bpf_program__attach_tcx/)
- [Cilium TCX & Netkit update (BPFConf 2024)](https://bpfconf.ebpf.io/bpfconf2024/bpfconf2024_material/tcx_netkit_update_and_global_sk_iter.pdf)
- [Generic multi-prog API, tcx links and meta device (BPFConf 2023)](http://oldvger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf)
- <https://docs.kernel.org/bpf/>

240
src/50-tcx/README.zh.md Normal file
View File

@@ -0,0 +1,240 @@
# eBPF 入门实践教程第五十篇:使用 TCX Link 实现可组合的流量控制
你是否试过在 TC ingress 路径上挂载多个 BPF 程序,却被 qdisc handle、filter priority 和 `tc` CLI 搞得焦头烂额?或者一个应用的 TC 程序被另一个应用不小心覆盖掉?传统的 `cls_bpf` 挂载方式确实能工作,但它继承了几十年的 queueing discipline 管道,而这套体系根本不是为 BPF 优先的世界设计的。如果你能用和 XDP、cgroup 相同的 link 模型来管理 TC 程序,会怎样?
这就是 **TCX**Traffic Control eXtension要解决的问题。TCX 由 Daniel Borkmann 开发,于 Linux 6.6 合入内核,它为 TC ingress 和 egress 数据路径提供了一套轻量级的、基于 fd 的多程序挂载基础设施。程序获得 BPF link 语义安全的所有权、fd 关闭时自动卸载、通过 `BPF_F_BEFORE` / `BPF_F_AFTER` 显式排序),完全不需要碰任何 qdisc 或 filter priority。
本教程将在 loopback 接口上挂载两个 TCX ingress 程序,把一个插到另一个前面,查询内核的实时链状态,并发送流量来验证执行顺序。
> 完整源代码: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/50-tcx>
## 背景:为什么经典 TC 挂载需要被重新思考
### 问题qdisc 管道和不安全的所有权
经典的 `tc` BPF 挂载(`cls_bpf`)是嫁接在已有 Traffic Control 框架之上的。要挂载一个 BPF 程序,你首先需要在接口上创建一个 `clsact` qdisc然后添加一个带有 handle 和 priority 的 filter。这在单一操作者的场景下没问题但在云原生环境中多个应用需要在同一个接口上挂载 TC 程序,就出了大问题:
1. **没有所有权模型**:一个应用的 `tc filter del` 可以意外删除另一个应用的程序。因为经典 tc filter 是通过 handle/priority 标识的,而不是通过创建它的进程。
2. **Priority 冲突**:两个应用可能选了相同的 priority 值。第二次挂载会默默覆盖第一次的。
3. **默认永久挂载**:经典 tc filter 会一直存在直到被显式删除。如果挂载 filter 的应用崩溃了且没有清理filter 会一直留在那里,可能带着过时的程序逻辑。
4. **CLI 依赖**:即使用 libbpf挂载模型也绑定在 netlink 上,和 `tc` CLI 使用的是同一套机制。这意味着你的 BPF 应用和系统上所有其他 tc 用户共享同一个控制面。
这些问题在 Cilium 等项目中变得尤为突出。BPF 数据面需要和第三方 CNI 插件、可观测性 agent 以及安全工具和平共处,而它们都想挂到 TC 上。
### 解决方案:基于 Link 的多程序管理
TCX 采取了完全不同的思路。它不是在 qdisc 基础设施上打补丁,而是在 TC ingress 和 egress 挂载点上提供了一个专用的、无 qdisc 的扩展入口。核心设计原则:
**BPF Link 语义**`bpf_program__attach_tcx()` 创建 `BPF_LINK_TYPE_TCX` link。和 XDP link、cgroup link 一样TCX link 赋予你安全的所有权link 绑定到 fd 上fd 关闭时自动卸载,不会被其他应用意外覆盖。
**显式排序**:不再依赖隐式的 priority 数字,而是通过 `BPF_F_BEFORE``BPF_F_AFTER` 将程序相对于彼此放置。还可以用 `BPF_F_REPLACE` 原子替换特定程序。所有操作都支持 `expected_revision` 字段来防止并发修改时的竞争条件。
**链返回码**TCX 定义了简化的返回码,使多程序组合变得显式:
| 返回码 | 值 | 含义 |
|--------|-----|------|
| `TCX_NEXT` | -1 | 非终止;把数据包传给链中的下一个程序 |
| `TCX_PASS` | 0 | 接受数据包并终止链 |
| `TCX_DROP` | 2 | 丢弃数据包并终止链 |
| `TCX_REDIRECT` | 7 | 重定向数据包并终止链 |
未知的返回码会被映射为 `TCX_NEXT`,以保证前向兼容。
**和经典 TC 共存**TCX link 可以和同一接口上的传统 `cls_bpf` filter 共存。内核先执行 TCX 程序,如果存在经典 filter再降级到 `tcf_classify()`。这允许从经典 tc 到 TCX 的渐进迁移,不需要一次性切换。
## 编写 eBPF 程序
我们的 BPF 对象包含两个程序,用来演示链的组合。以下是完整源代码:
```c
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>
#ifndef TCX_NEXT
#define TCX_NEXT -1
#endif
#ifndef TCX_PASS
#define TCX_PASS 0
#endif
char LICENSE[] SEC("license") = "GPL";
__u64 stats_hits;
__u64 classifier_hits;
__u32 last_len;
__u16 last_protocol;
__u32 last_ifindex;
SEC("tcx/ingress")
int tcx_stats(struct __sk_buff *skb)
{
stats_hits++;
last_len = skb->len;
last_protocol = bpf_ntohs(skb->protocol);
last_ifindex = skb->ifindex;
return TCX_NEXT;
}
SEC("tcx/ingress")
int tcx_classifier(struct __sk_buff *skb)
{
classifier_hits++;
return TCX_PASS;
}
```
我们逐步分析。
### Section 名:`SEC("tcx/ingress")`
`SEC("tcx/ingress")` 注解告诉 libbpf 这个程序应该挂载到 TCX ingress 挂载点,而非经典的 TC classifier。这不仅仅是一个命名约定libbpf 会把这个 section 名映射到 `BPF_PROG_TYPE_SCHED_CLS` 并设置 TCX 对应的 attach type。egress 的对应写法是 `SEC("tcx/egress")`
注意,`SEC("tc")``SEC("classifier")``SEC("action")` 已经被 libbpf 视为废弃,推荐改用 `tcx/*` section 名。
### 全局变量作为计数器
我们用全局变量(`stats_hits``classifier_hits``last_len` 等)而不是 BPF map 作为计数器。libbpf skeleton 会通过 `skel->bss->stats_hits` 暴露这些变量,使用户态代码更简洁。这在单 CPU demo 中没有问题;生产环境中应使用 per-CPU map 来避免数据竞争。
### 返回码:`TCX_NEXT` vs `TCX_PASS`
这是 TCX 组合的核心:
- `tcx_stats` 返回 `TCX_NEXT`,意思是"我的工作做完了,把数据包传给链中的下一个程序"。链继续执行。
- `tcx_classifier` 返回 `TCX_PASS`,这是一个终止性判定:数据包被接受,链中后续的程序不会再执行。
如果我们把 `tcx_classifier` 放在 `tcx_stats` *前面*`tcx_stats` 就永远不会执行,因为 `TCX_PASS` 会终止链。顺序很重要,而 TCX 让这件事变得显式。
## 用户态加载器:挂载和查询链
用户态代码演示了三个关键的 TCX 操作:挂载程序、相对排序、查询实时链。
### 第一步:挂载第一个程序
```c
classifier_link = bpf_program__attach_tcx(skel->progs.tcx_classifier,
ifindex, NULL);
```
`tcx_classifier` 挂到指定接口的 TCX ingress 挂载点上。`NULL` 选项表示"使用默认值",程序被追加到链的末尾。此时链中有一个程序。
### 第二步:把第二个程序插到第一个*前面*
```c
LIBBPF_OPTS(bpf_tcx_opts, before_opts,
.flags = BPF_F_BEFORE,
.relative_fd = bpf_program__fd(skel->progs.tcx_classifier));
stats_link = bpf_program__attach_tcx(skel->progs.tcx_stats,
ifindex, &before_opts);
```
`bpf_tcx_opts` 结构体告诉内核把 `tcx_stats` 插到 `tcx_classifier` *前面*`.relative_fd` 字段标识参考点,即已挂载的 classifier 程序的 fd。操作完成后链的顺序是`tcx_stats``tcx_classifier`
你也可以用 `BPF_F_AFTER` 配合不同的参考点来达到同样的排序效果。重点是你可以直接表达想要的顺序,而不需要期望两个数字 priority 碰巧排对。
### 第三步:查询链
```c
LIBBPF_OPTS(bpf_prog_query_opts, query);
query.count = 8;
query.prog_ids = prog_ids;
query.link_ids = link_ids;
err = bpf_prog_query_opts(ifindex, BPF_TCX_INGRESS, &query);
```
挂载完成后,加载器查询内核中链的实时状态。返回的数据包括:
- **`revision`**:一个单调递增的计数器,每次链被修改时都会变化。如果你想执行原子更新,可以把这个值作为 `expected_revision` 传入。
- **`prog_ids[]`**:按链顺序排列的 BPF 程序 ID。
- **`link_ids[]`**:对应的 BPF link ID。
这让任何观察者都能精确判断哪些程序被挂载了、顺序是什么,这对调试多程序流水线非常有价值。
### 第四步:发送流量并读取计数器
加载器向 `127.0.0.1`(端口 9discard 服务)发送一个 UDP 包来触发链,短暂等待后读取全局变量来验证两个程序都执行了:
```c
printf(" tcx_stats hits : %llu\n",
(unsigned long long)skel->bss->stats_hits);
printf(" tcx_classifier hits : %llu\n",
(unsigned long long)skel->bss->classifier_hits);
```
如果两个计数器都是 1链就按预期工作了`tcx_stats` 先执行(记录元信息并返回 `TCX_NEXT`),然后 `tcx_classifier` 执行(计数并返回 `TCX_PASS`)。
## 编译和运行
本示例需要 Linux 6.6+ 且支持 TCX以及较新版本的 libbpf。
```bash
cd bpf-developer-tutorial/src/50-tcx
make
sudo ./tcx_demo -i lo
```
预期输出:
```text
Attached TCX programs to lo (ifindex=1)
TCX ingress chain revision: 3
slot 0: prog_id=812 link_id=901
slot 1: prog_id=811 link_id=900
Counters:
tcx_stats hits : 1
tcx_classifier hits : 1
last ifindex : 1
last protocol : 0x0800
last length : 46
```
revision 是 3因为链被修改了两次`tcx_classifier` 挂载时revision 从 0 到 1`tcx_stats` 插入到它前面时revision 到 2。查询本身使 revision 递增到 3。
如果只想看挂载行为而不发流量,加 `-n`
```bash
sudo ./tcx_demo -i lo -n
```
`-v` 开启 libbpf 调试输出,可以看到底层 BPF syscall 的执行序列。
## 它和第 20 课(经典 TC的区别
[第 20 课-tc](../20-tc/README.zh.md) 讲的是经典 TC 数据路径:创建 `clsact` qdisc挂载 `SEC("tc")` 程序作为 filter使用 `__sk_buff` 进行包检查。那一课仍然有价值,因为**数据包处理模型**是完全相同的TCX 程序收到的是相同的 `__sk_buff` context使用相同的 helper 来解析数据包。
TCX 替换的是**控制面**
| 方面 | 经典 TC第 20 课) | TCX第 50 课) |
|------|---------------------|-----------------|
| 挂载方式 | Netlink / `tc` CLI | `bpf_program__attach_tcx()` |
| 所有权 | 无;任何人可以 `tc filter del` | BPF linkfd 关闭时自动卸载 |
| 排序 | 隐式 priority 数字 | 显式 `BPF_F_BEFORE` / `BPF_F_AFTER` |
| 多程序 | 手动 priority 管理 | 内建链 + revision 追踪 |
| Section 名 | `SEC("tc")` | `SEC("tcx/ingress")` / `SEC("tcx/egress")` |
| 内核要求 | 任意现代内核 | Linux 6.6+ |
如果你正在构建新的 libbpf 网络工具TCX 是推荐的接口。Cilium 已经将其数据面从经典 tc 迁移到了 TCX。
## 总结
本教程介绍了 TCX 如何用 BPF link 语义取代基于 qdisc 的 TC 程序管理。我们挂载了两个 ingress 程序,用 `BPF_F_BEFORE` 控制了它们的执行顺序,用 `bpf_prog_query_opts()` 查询了实时链状态并验证了两个程序按正确顺序执行。TCX 提供了安全的所有权、显式排序、revision 感知的更新以及和经典 TC 的共存能力,使其成为现代 eBPF 应用中可组合、多程序流量控制的基石。
如果你想了解更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 或网站 <https://eunomia.dev/tutorials/> 获取更多示例和完整教程。
## 参考
- [TCX 内核提交fd-based tcx multi-prog infra with link support](https://lore.kernel.org/bpf/20230707172455.7634-3-daniel@iogearbox.net/)
- [BPF_PROG_TYPE_SCHED_CLS 文档](https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_SCHED_CLS/)
- [bpf_program__attach_tcx libbpf API](https://docs.ebpf.io/ebpf-library/libbpf/userspace/bpf_program__attach_tcx/)
- [Cilium TCX & Netkit 更新BPFConf 2024](https://bpfconf.ebpf.io/bpfconf2024/bpfconf2024_material/tcx_netkit_update_and_global_sk_iter.pdf)
- [Generic multi-prog API, tcx links and meta deviceBPFConf 2023](http://oldvger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf)
- <https://docs.kernel.org/bpf/>

37
src/50-tcx/tcx_demo.bpf.c Normal file
View File

@@ -0,0 +1,37 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>
#ifndef TCX_NEXT
#define TCX_NEXT -1
#endif
#ifndef TCX_PASS
#define TCX_PASS 0
#endif
char LICENSE[] SEC("license") = "GPL";
__u64 stats_hits;
__u64 classifier_hits;
__u32 last_len;
__u16 last_protocol;
__u32 last_ifindex;
SEC("tcx/ingress")
int tcx_stats(struct __sk_buff *skb)
{
stats_hits++;
last_len = skb->len;
last_protocol = bpf_ntohs(skb->protocol);
last_ifindex = skb->ifindex;
return TCX_NEXT;
}
SEC("tcx/ingress")
int tcx_classifier(struct __sk_buff *skb)
{
classifier_hits++;
return TCX_PASS;
}

192
src/50-tcx/tcx_demo.c Normal file
View File

@@ -0,0 +1,192 @@
// SPDX-License-Identifier: GPL-2.0
#include <arpa/inet.h>
#include <errno.h>
#include <net/if.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <unistd.h>
#include <bpf/bpf.h>
#include <bpf/libbpf.h>
#include "tcx_demo.skel.h"
static struct env {
const char *ifname;
bool verbose;
bool no_trigger;
} env = {
.ifname = "lo",
};
static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
if (level == LIBBPF_DEBUG && !env.verbose)
return 0;
return vfprintf(stderr, format, args);
}
static void usage(const char *prog)
{
fprintf(stderr,
"Usage: %s [-i IFACE] [-v] [-n]\n"
" -i IFACE attach TCX programs to interface (default: lo)\n"
" -v enable libbpf debug logs\n"
" -n do not generate loopback traffic automatically\n",
prog);
}
static int parse_args(int argc, char **argv)
{
int opt;
while ((opt = getopt(argc, argv, "i:vn")) != -1) {
switch (opt) {
case 'i':
env.ifname = optarg;
break;
case 'v':
env.verbose = true;
break;
case 'n':
env.no_trigger = true;
break;
default:
return -EINVAL;
}
}
return 0;
}
static int generate_loopback_traffic(void)
{
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(9),
};
const char payload[] = "tcx tutorial packet";
int fd, err = 0;
if (inet_pton(AF_INET, "127.0.0.1", &addr.sin_addr) != 1)
return -EINVAL;
fd = socket(AF_INET, SOCK_DGRAM, 0);
if (fd < 0)
return -errno;
if (sendto(fd, payload, sizeof(payload), 0, (struct sockaddr *)&addr, sizeof(addr)) < 0)
err = -errno;
close(fd);
return err;
}
static void print_tcx_chain(int ifindex)
{
LIBBPF_OPTS(bpf_prog_query_opts, query);
__u32 prog_ids[8] = {};
__u32 link_ids[8] = {};
int err;
__u32 i;
query.count = 8;
query.prog_ids = prog_ids;
query.link_ids = link_ids;
err = bpf_prog_query_opts(ifindex, BPF_TCX_INGRESS, &query);
if (err) {
fprintf(stderr, "bpf_prog_query_opts failed: %s\n", strerror(errno));
return;
}
printf("TCX ingress chain revision: %llu\n",
(unsigned long long)query.revision);
for (i = 0; i < query.count; i++) {
printf(" slot %u: prog_id=%u link_id=%u\n",
i, prog_ids[i], link_ids[i]);
}
}
int main(int argc, char **argv)
{
struct tcx_demo_bpf *skel = NULL;
struct bpf_link *classifier_link = NULL, *stats_link = NULL;
int ifindex, err;
err = parse_args(argc, argv);
if (err) {
usage(argv[0]);
return 1;
}
ifindex = if_nametoindex(env.ifname);
if (!ifindex) {
fprintf(stderr, "unknown interface '%s'\n", env.ifname);
return 1;
}
libbpf_set_print(libbpf_print_fn);
skel = tcx_demo_bpf__open_and_load();
if (!skel) {
fprintf(stderr, "failed to open and load tcx skeleton\n");
return 1;
}
classifier_link = bpf_program__attach_tcx(skel->progs.tcx_classifier,
ifindex, NULL);
err = libbpf_get_error(classifier_link);
if (err) {
fprintf(stderr, "failed to attach tcx_classifier: %s\n",
strerror(-err));
classifier_link = NULL;
goto cleanup;
}
{
LIBBPF_OPTS(bpf_tcx_opts, before_opts,
.flags = BPF_F_BEFORE,
.relative_fd = bpf_program__fd(skel->progs.tcx_classifier));
stats_link = bpf_program__attach_tcx(skel->progs.tcx_stats,
ifindex, &before_opts);
err = libbpf_get_error(stats_link);
if (err) {
fprintf(stderr, "failed to attach tcx_stats: %s\n",
strerror(-err));
stats_link = NULL;
goto cleanup;
}
}
printf("Attached TCX programs to %s (ifindex=%d)\n", env.ifname, ifindex);
print_tcx_chain(ifindex);
if (!env.no_trigger && strcmp(env.ifname, "lo") == 0) {
err = generate_loopback_traffic();
if (err)
fprintf(stderr, "failed to generate loopback traffic: %s\n",
strerror(-err));
usleep(200000);
} else if (!env.no_trigger) {
printf("Generate traffic on %s and re-run with -n if you only want attach/query.\n",
env.ifname);
}
printf("\nCounters:\n");
printf(" tcx_stats hits : %llu\n",
(unsigned long long)skel->bss->stats_hits);
printf(" tcx_classifier hits : %llu\n",
(unsigned long long)skel->bss->classifier_hits);
printf(" last ifindex : %u\n", skel->bss->last_ifindex);
printf(" last protocol : 0x%04x\n", skel->bss->last_protocol);
printf(" last length : %u\n", skel->bss->last_len);
cleanup:
bpf_link__destroy(stats_link);
bpf_link__destroy(classifier_link);
tcx_demo_bpf__destroy(skel);
return err != 0;
}

View File

@@ -65,6 +65,7 @@ Networking:
- [lesson 41-xdp-tcpdump](41-xdp-tcpdump/README.md) Capturing TCP Information with XDP
- [lesson 42-xdp-loadbalancer](42-xdp-loadbalancer/README.md) XDP Load Balancer
- [lesson 46-xdp-test](46-xdp-test/README.md) Building a High-Performance XDP Packet Generator
- [lesson 50-tcx](50-tcx/README.md) Composable Traffic Control with TCX Links
Tracing:
@@ -94,6 +95,7 @@ Features:
- [lesson 36-userspace-ebpf](36-userspace-ebpf/README.md) Userspace eBPF Runtimes: Overview and Applications
- [lesson 38-btf-uprobe](38-btf-uprobe/README.md) Expanding eBPF Compile Once, Run Everywhere(CO-RE) to Userspace Compatibility
- [lesson 43-kfuncs](43-kfuncs/README.md) Extending eBPF Beyond Its Limits: Custom kfuncs in Kernel Modules
- [features bpf_token](features/bpf_token/README.md) BPF Token for Delegated Privilege and Secure Program Loading
- [features bpf_wq](features/bpf_wq/README.md) BPF Workqueues for Asynchronous Sleepable Tasks
- [features struct_ops](features/struct_ops/README.md) Extending Kernel Subsystems with BPF struct_ops
- [features dynptr](features/dynptr/README.md) BPF Dynamic Pointers for Variable-Length Data

View File

@@ -59,6 +59,7 @@ GPU:
- [lesson 41-xdp-tcpdump](41-xdp-tcpdump/README.zh.md) eBPF 示例教程:使用 XDP 捕获 TCP 信息
- [lesson 42-xdp-loadbalancer](42-xdp-loadbalancer/README.zh.md) eBPF 开发者教程: 简单的 XDP 负载均衡器
- [lesson 46-xdp-test](46-xdp-test/README.zh.md) eBPF 实例教程:构建高性能 XDP 数据包生成器
- [lesson 50-tcx](50-tcx/README.zh.md) eBPF 入门实践教程第五十篇:使用 TCX Link 实现可组合的流量控制
安全:
- [lesson 24-hide](24-hide/README.zh.md) eBPF 开发实践:使用 eBPF 隐藏进程或文件信息
@@ -73,6 +74,7 @@ GPU:
- [lesson 36-userspace-ebpf](36-userspace-ebpf/README.zh.md) 用户空间 eBPF 运行时:深度解析与应用实践
- [lesson 38-btf-uprobe](38-btf-uprobe/README.zh.md) 借助 eBPF 和 BTF让用户态也能一次编译、到处运行
- [lesson 43-kfuncs](43-kfuncs/README.zh.md) 超越 eBPF 的极限:在内核模块中定义自定义 kfunc
- [features bpf_token](features/bpf_token/README.zh.md) eBPF 入门实践教程BPF Token安全的委托式权限与程序加载
- [features bpf_wq](features/bpf_wq/README.zh.md) eBPF 教程BPF 工作队列用于异步可睡眠任务
- [features struct_ops](features/struct_ops/README.zh.md) eBPF 教程:使用 BPF struct_ops 扩展内核子系统
- [features dynptr](features/dynptr/README.zh.md) BPF Dynamic Pointers for Variable-Length Data

View File

@@ -0,0 +1,2 @@
level=Depth
type=Features

13
src/features/bpf_token/.gitignore vendored Normal file
View File

@@ -0,0 +1,13 @@
# Build artifacts
.output/
*.o
*.skel.h
# Generated binaries
token_trace
token_userns_demo
# Editor files
*.swp
*~
.vscode/

View File

@@ -0,0 +1,93 @@
# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
OUTPUT := .output
CLANG ?= clang
LIBBPF_SRC := $(abspath ../../third_party/libbpf/src)
BPFTOOL_SRC := $(abspath ../../third_party/bpftool/src)
LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a)
BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool)
BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool
ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \
| sed 's/arm.*/arm/' \
| sed 's/aarch64/arm64/' \
| sed 's/ppc64le/powerpc/' \
| sed 's/mips.*/mips/' \
| sed 's/riscv64/riscv/' \
| sed 's/loongarch64/loongarch/')
VMLINUX := ../../third_party/vmlinux/$(ARCH)/vmlinux.h
INCLUDES := -I$(OUTPUT) -I../../third_party/libbpf/include/uapi -I$(dir $(VMLINUX)) -I.
CFLAGS := -g -Wall
ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS)
BPF_APPS = token_trace
APPS = $(BPF_APPS) token_userns_demo
CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - </dev/null 2>&1 \
| sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }')
ifeq ($(V),1)
Q =
msg =
else
Q = @
msg = @printf ' %-8s %s%s\n' \
"$(1)" \
"$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \
"$(if $(3), $(3))";
MAKEFLAGS += --no-print-directory
endif
define allow-override
$(if $(or $(findstring environment,$(origin $(1))),\
$(findstring command line,$(origin $(1)))),,\
$(eval $(1) = $(2)))
endef
$(call allow-override,CC,$(CROSS_COMPILE)cc)
$(call allow-override,LD,$(CROSS_COMPILE)ld)
.PHONY: all
all: $(APPS)
.PHONY: clean
clean:
$(call msg,CLEAN)
$(Q)rm -rf $(OUTPUT) $(APPS)
$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT):
$(call msg,MKDIR,$@)
$(Q)mkdir -p $@
$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf
$(call msg,LIB,$@)
$(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \
OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \
INCLUDEDIR= LIBDIR= UAPIDIR= \
install
$(BPFTOOL): | $(BPFTOOL_OUTPUT)
$(call msg,BPFTOOL,$@)
$(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap
$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL)
$(call msg,BPF,$@)
$(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \
$(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \
-c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@)
$(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@)
$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL)
$(call msg,GEN-SKEL,$@)
$(Q)$(BPFTOOL) gen skeleton $< > $@
$(patsubst %,$(OUTPUT)/%.o,$(BPF_APPS)): %.o: %.skel.h
$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT)
$(call msg,CC,$@)
$(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@
$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT)
$(call msg,BINARY,$@)
$(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@
.DELETE_ON_ERROR:
.SECONDARY:

View File

@@ -0,0 +1,289 @@
# eBPF Tutorial by Example: BPF Token for Delegated Privilege and Secure Program Loading
Ever needed to let a container or CI job load an eBPF program without giving it full `CAP_BPF` or `CAP_SYS_ADMIN`? Or wanted to expose XDP packet processing to a tenant workload while ensuring it can only create the specific map types and program types you've approved? Before BPF token, the answer was binary: either you had the capabilities to do *everything* in BPF, or you could do *nothing*. There was no middle ground.
This is what **BPF Token** solves. Introduced by Andrii Nakryiko and merged in Linux 6.9, BPF token is a delegation mechanism that lets a privileged process (like a container runtime or systemd) create a precisely scoped permission set for BPF operations, then hand it to an unprivileged process through a bpffs mount. The unprivileged process can load programs, create maps, and attach hooks, but only the types that were explicitly allowed. No broad capabilities required.
In this tutorial, we'll set up a delegated bpffs mount in a user namespace, derive a BPF token from it, and use libbpf to load and attach a minimal XDP program, all from a process that has zero BPF capabilities of its own.
> The complete source code: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_token>
## Introduction to BPF Token: Solving the Privilege Problem
### The Problem: All-or-Nothing BPF Capabilities
Traditional eBPF requires `CAP_BPF` for program loading and map creation, plus additional capabilities like `CAP_PERFMON` for tracing, `CAP_NET_ADMIN` for networking hooks, and `CAP_SYS_ADMIN` for certain advanced operations. These capabilities are inherently **system-wide**: you cannot namespace or sandbox `CAP_BPF`. As the kernel documentation explains, this is by design: BPF tracing helpers like `bpf_probe_read_kernel()` can access arbitrary kernel memory, which fundamentally cannot be scoped to a single namespace.
This creates a real problem in multi-tenant environments:
1. **Container isolation**: A Kubernetes pod that needs to run a simple XDP program must be given `CAP_BPF` + `CAP_NET_ADMIN`, which also grants it the ability to load *any* BPF program type and create *any* map type. There's no way to say "you can load XDP programs but not kprobes."
2. **CI/CD pipelines**: A build job that tests an eBPF-based observability tool needs root-equivalent capabilities to load programs, even though the test only exercises a specific, well-known program type.
3. **Third-party integrations**: A service mesh sidecar that attaches sockops programs needs capabilities that also grant it the ability to trace every process on the host.
The result is that organizations either give broad BPF capabilities (weakening their security posture) or prohibit BPF entirely in unprivileged contexts (limiting the technology's adoption).
### The Solution: Scoped Delegation Through bpffs
BPF token takes a different approach. Instead of trying to namespace capabilities (which is fundamentally unsafe for BPF), it introduces an explicit delegation model:
1. A **privileged process** (container runtime, init system, platform daemon) creates a bpffs instance with specific delegation options that define exactly which BPF operations are allowed.
2. The privileged process passes this bpffs mount to an **unprivileged process** (container, CI job, tenant workload).
3. The unprivileged process derives a **BPF token** from the bpffs mount. The token is a file descriptor that carries the delegated permission set.
4. When the unprivileged process makes `bpf()` syscalls (through libbpf or directly), it passes the token fd. The kernel checks permissions against the token instead of against the process's capabilities.
The token is scoped along four independent axes:
| Delegation Option | What It Controls | Example |
|-------------------|-----------------|---------|
| `delegate_cmds` | Which `bpf()` commands are allowed | `prog_load:map_create:btf_load:link_create` |
| `delegate_maps` | Which map types can be created | `array:hash:ringbuf` |
| `delegate_progs` | Which program types can be loaded | `xdp:socket_filter` |
| `delegate_attachs` | Which attach types are allowed | `xdp:cgroup_inet_ingress` or `any` |
Each axis is a bitmask. If a bit isn't set, the corresponding operation is denied even if the token is present. This gives platform engineers fine-grained control: you can allow a container to load XDP programs with array maps but deny it access to kprobes, perf events, or hash-of-maps.
### The User Namespace Constraint
One critical design decision: **a BPF token must be created inside the same user namespace as the bpffs instance, and that user namespace must not be `init_user_ns`**. This is intentional. It means:
- A host-namespace bpffs (the one at `/sys/fs/bpf`) does **not** produce usable tokens. Tokens only work when the bpffs is associated with a non-init user namespace.
- The privileged parent configures the bpffs before passing it to the child, but the child (in its own user namespace) is the one that creates and uses the token.
- This design prevents a process with an existing token from using it to escalate privileges outside its namespace boundary.
### How libbpf Makes It Transparent
For applications built with libbpf (which is most of them), token usage is nearly transparent. You have three options:
1. **Explicit path**: Set `bpf_object_open_opts.bpf_token_path` when opening the BPF object. libbpf will derive the token from the specified bpffs mount.
2. **Environment variable**: Set `LIBBPF_BPF_TOKEN_PATH` to point to the bpffs mount. libbpf picks it up automatically.
3. **Default path**: If the default `/sys/fs/bpf` is a delegated bpffs in the current user namespace, libbpf uses it implicitly.
Once the token is derived, libbpf passes it to every relevant syscall (`BPF_MAP_CREATE`, `BPF_BTF_LOAD`, `BPF_PROG_LOAD`, and `BPF_LINK_CREATE`) without any source-code changes in the BPF application.
## Writing the eBPF Program
The BPF side of this demo is intentionally minimal: a tiny XDP program on loopback. This keeps the focus on the token workflow. Here's the complete source:
```c
// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
char LICENSE[] SEC("license") = "GPL";
struct token_stats {
__u64 packets;
__u32 last_ifindex;
};
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 1);
__type(key, __u32);
__type(value, struct token_stats);
} stats_map SEC(".maps");
SEC("xdp")
int handle_packet(struct xdp_md *ctx)
{
struct token_stats *stats;
__u32 key = 0;
stats = bpf_map_lookup_elem(&stats_map, &key);
if (!stats)
return 0;
stats->packets++;
stats->last_ifindex = ctx->ingress_ifindex;
return XDP_PASS;
}
```
A few design choices to note:
**`BPF_MAP_TYPE_ARRAY`** was chosen because the delegation policy explicitly allows `array` maps. If we had used a hash map instead, loading would fail because the token doesn't grant `hash` map creation permission. This is the token model in action; even trivial program changes can be caught by the delegation policy.
**`SEC("xdp")`** matches the `delegate_progs=xdp` policy. If you changed this to `SEC("kprobe/...")`, the kernel would reject it at load time with an `EPERM` because kprobe isn't in the allowed program types.
**`XDP_PASS`** simply lets every packet through. The program's only purpose is to prove that a token-backed load and attach succeeded. In production, you'd replace this with real packet-processing logic.
## User-Space Loader: Token-Backed Loading
The `token_trace.c` loader is a standard libbpf skeleton program with one key addition: it passes a `bpf_token_path`:
```c
struct bpf_object_open_opts open_opts = {};
open_opts.sz = sizeof(open_opts);
open_opts.bpf_token_path = env.token_path;
skel = token_trace_bpf__open_opts(&open_opts);
```
From this point on, libbpf takes over. When it calls `bpf(BPF_MAP_CREATE)` to create `stats_map`, it includes the token fd. When it calls `bpf(BPF_PROG_LOAD)` for the XDP program, it includes the token fd. When it calls `bpf(BPF_LINK_CREATE)` to attach to the interface, it includes the token fd.
The rest of the loader is straightforward:
```c
err = token_trace_bpf__load(skel); // token used for map_create + prog_load
link = bpf_program__attach_xdp(skel->progs.handle_packet, ifindex); // token used for link_create
```
After attaching, the loader reads the map before and after generating a test packet to verify the program executed:
```c
err = bpf_map_lookup_elem(map_fd, &key, &before);
// ... generate UDP packet to 127.0.0.1 ...
err = bpf_map_lookup_elem(map_fd, &key, &after);
printf("delta : %llu\n", after.packets - before.packets);
```
If the delta is 1, the XDP program was successfully loaded and attached using only delegated capabilities.
## The Namespace Orchestrator: `token_userns_demo`
Because BPF token requires a non-init user namespace, running a bare `token_trace -t /sys/fs/bpf` on the host won't work. The `token_userns_demo.c` wrapper automates the complex namespace choreography. Here's the full sequence:
### Step 1: Fork and Create Namespaces
```
parent (root, init_user_ns) child (unprivileged, new userns)
│ │
│ fork() │
├────────────────────────────────────────>│
│ │
│ unshare(CLONE_NEWUSER)
│ unshare(CLONE_NEWNS | CLONE_NEWNET)
```
The child creates a new user namespace (where it maps itself to uid/gid 0), a new mount namespace (so bpffs mounts are private), and a new network namespace (so `lo` is a fresh interface it can attach to).
### Step 2: Create bpffs and Configure Delegation
```
parent (root, init_user_ns) child (new userns)
│ │
│ fs_fd = fsopen("bpf", 0)
│ <───── send fs_fd via SCM_RIGHTS ────│
│ │
fsconfig(fs_fd, "delegate_cmds", ...) │ (waiting for ack)
fsconfig(fs_fd, "delegate_maps", "array") │
fsconfig(fs_fd, "delegate_progs", "xdp:...") │
fsconfig(fs_fd, "delegate_attachs", "any") │
fsconfig(fs_fd, FSCONFIG_CMD_CREATE) │
│ │
│ ───────── send ack ─────────────────>│
```
The child calls `fsopen("bpf", 0)` to create a bpffs filesystem context in its user namespace, then sends the file descriptor to the parent via a Unix socket (`SCM_RIGHTS`). The parent, running as root in the init namespace, configures the delegation policy with `fsconfig()`, then materializes the filesystem with `FSCONFIG_CMD_CREATE`.
This two-step dance is necessary because: (a) the bpffs must be created in the child's user namespace (for the token to be valid there), but (b) only the privileged parent can set delegation options (because those options grant BPF capabilities).
### Step 3: Mount and Load
```
child (new userns)
mnt_fd = fsmount(fs_fd, 0, 0)
token_path = "/proc/self/fd/<mnt_fd>"
set_loopback_up()
exec("./token_trace", "-t", token_path, "-i", "lo")
```
The child materializes the bpffs as a detached mount (no mount point needed, since `/proc/self/fd/<mnt_fd>` gives a path), brings the loopback interface up in its network namespace, and `exec`s `token_trace` with the bpffs path. From `token_trace`'s perspective, it's just opening a BPF object with a token path. It doesn't know or care about the namespace setup.
## Preparing a bpffs Mount Manually
If you want to experiment with the mount syntax outside the demo wrapper, the repository includes a helper script:
```bash
cd bpf-developer-tutorial/src/features/bpf_token
bash setup_token_bpffs.sh /tmp/bpf-token
```
This mounts bpffs at `/tmp/bpf-token` with:
```text
delegate_cmds=prog_load:map_create:btf_load:link_create
delegate_maps=array
delegate_progs=xdp:socket_filter
delegate_attachs=any
```
**Why `socket_filter`?** libbpf performs a trivial program-load probe before loading the real BPF object. This probe uses a generic `BPF_PROG_TYPE_SOCKET_FILTER` program to detect kernel feature support. Without `socket_filter` in the delegation policy, the probe fails and libbpf refuses to proceed.
**Why `delegate_attachs=any`?** The same libbpf probe path also triggers attach-type validation in the kernel's token checking code. Using `any` avoids having to enumerate every possible attach type for probe compatibility.
Note that a host-namespace mount like this is useful for inspecting the delegation policy (e.g., with `bpftool token list`), but won't produce working tokens unless the `bpf(BPF_TOKEN_CREATE)` syscall comes from a matching non-init user namespace.
## Compilation and Execution
Build all binaries:
```bash
cd bpf-developer-tutorial/src/features/bpf_token
make
```
Run the end-to-end demo:
```bash
sudo ./token_userns_demo
```
Expected output:
```text
token path : /proc/self/fd/5
interface : lo (ifindex=1)
packets before : 0
packets after : 1
delta : 1
last ifindex : 1
```
The `delta: 1` confirms that the XDP program was successfully loaded and attached using a BPF token, with no `CAP_BPF` or `CAP_SYS_ADMIN` in the child process.
Add `-v` for verbose libbpf output to see the token being created and used:
```bash
sudo ./token_userns_demo -v
```
If you already manage your own delegated bpffs in a user namespace, you can run the loader directly:
```bash
./token_trace -t /proc/self/fd/<mnt-fd> -i lo
```
## Real-World Applications
While this tutorial uses a minimal XDP program, the BPF token pattern scales to production scenarios:
- **Container runtimes** (LXD, Docker, Kubernetes): Mount a delegated bpffs into a container with only the program and map types the workload needs. LXD already supports this through its `security.delegate_bpf` option.
- **CI/CD testing**: Give build jobs the ability to load and test specific eBPF programs without granting them host-level capabilities. The delegation policy acts as an allowlist for BPF operations.
- **Multi-tenant BPF platforms**: A platform daemon creates per-tenant bpffs mounts with different delegation policies. One tenant might be allowed XDP + array maps, while another might get tracepoint + ringbuf access.
- **LSM integration**: Because BPF tokens integrate with Linux Security Modules, you can combine token delegation with SELinux or AppArmor policies for defense-in-depth. Each token gets its own security context that LSM hooks can inspect.
## Summary
In this tutorial, we learned how BPF token provides a delegation model for eBPF privilege that goes beyond the binary "all or nothing" of Linux capabilities. We walked through the complete flow: a privileged parent configures a bpffs instance with specific delegation options, an unprivileged child in a user namespace derives a token from that bpffs, and libbpf transparently uses the token for map creation, program loading, and attachment. The result is a minimal XDP program running in an unprivileged context, something that was impossible before Linux 6.9.
BPF token is not a niche feature. It represents the kernel's answer to a fundamental question in the eBPF ecosystem: how do you safely share BPF capabilities in a multi-tenant world without granting unconstrained access to the BPF subsystem?
If you'd like to learn more about eBPF, visit our tutorial code repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> or website <https://eunomia.dev/tutorials/> for more examples and complete tutorials.
## References
- [BPF Token concept documentation](https://docs.ebpf.io/linux/concepts/token/)
- [BPF token kernel patch series (Andrii Nakryiko)](https://lore.kernel.org/bpf/20240103222034.2582628-1-andrii@kernel.org/T/)
- [BPF token LWN article](https://lwn.net/Articles/959350/)
- [Finer-grained BPF tokens LWN discussion](https://lwn.net/Articles/947173/)
- [Privilege delegation using BPF Token (LXD documentation)](https://documentation.ubuntu.com/lxd/latest/explanation/bpf/)
- [bpf_token_create() libbpf API](https://docs.ebpf.io/ebpf-library/libbpf/userspace/bpf_token_create/)
- <https://docs.kernel.org/bpf/>

View File

@@ -0,0 +1,289 @@
# eBPF 入门实践教程BPF Token安全的委托式权限与程序加载
你是否需要让容器或 CI 任务加载一个 eBPF 程序,但又不想给它完整的 `CAP_BPF``CAP_SYS_ADMIN`?或者你想把 XDP 数据包处理能力开放给租户工作负载,同时确保它只能创建你批准过的 map 类型和 program 类型?在 BPF token 出现之前,答案是二元的:要么你有能力在 BPF 中做*一切*,要么你*什么都做不了*。没有中间地带。
这就是 **BPF Token** 要解决的问题。BPF token 由 Andrii Nakryiko 开发,于 Linux 6.9 合入内核,它是一种委托机制,让特权进程(如容器运行时或 systemd创建一组精确限定范围的 BPF 操作许可集合,然后通过 bpffs 挂载传递给非特权进程。非特权进程可以加载程序、创建 map、挂载 hook但只能使用被显式允许的类型。不需要任何宽泛的 capability。
本教程将在 user namespace 中设置一个带委托策略的 bpffs 挂载,从中派生 BPF token然后用 libbpf 加载并挂载一个最小的 XDP 程序。所有操作来自一个本身没有任何 BPF capability 的进程。
> 完整源代码: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_token>
## 背景:解决 BPF 权限问题
### 问题:全有或全无的 BPF Capability
传统 eBPF 需要 `CAP_BPF` 来加载程序和创建 map还需要 `CAP_PERFMON`(用于 tracing`CAP_NET_ADMIN`(用于网络 hook`CAP_SYS_ADMIN`(用于某些高级操作)等额外的 capability。这些 capability 本质上是**系统级**的,你无法对 `CAP_BPF` 做 namespace 隔离或沙箱化。内核文档解释了原因BPF tracing helper`bpf_probe_read_kernel()`)可以访问任意内核内存,这在根本上无法被限定到单个 namespace 中。
这在多租户环境中造成了实际问题:
1. **容器隔离**:一个只需要运行简单 XDP 程序的 Kubernetes Pod 必须被赋予 `CAP_BPF` + `CAP_NET_ADMIN`,但这也同时赋予了它加载*任意* BPF 程序类型和创建*任意* map 类型的能力。你没办法说"你可以加载 XDP 程序但不能加载 kprobe"。
2. **CI/CD 流水线**:一个测试 eBPF 可观测工具的构建任务需要 root 级别的 capability 来加载程序,即使测试只涉及一个特定的、已知的程序类型。
3. **第三方集成**:一个 service mesh sidecar 需要挂载 sockops 程序的 capability但这些 capability 同时也赋予了它 trace 主机上每个进程的能力。
结果就是:组织要么给出宽泛的 BPF capability削弱安全态势要么在非特权环境中完全禁止 BPF限制了该技术的采用
### 解决方案:通过 bpffs 进行精确委托
BPF token 采取了不同的思路。它没有尝试对 capability 做 namespace 化(对 BPF 来说这根本不安全),而是引入了显式的委托模型:
1. **特权进程**容器运行时、init 系统、平台守护进程)创建一个带有特定委托选项的 bpffs 实例,精确定义允许哪些 BPF 操作。
2. 特权进程将这个 bpffs 挂载传递给**非特权进程**容器、CI 任务、租户工作负载)。
3. 非特权进程从 bpffs 挂载中派生**BPF token**。token 是一个文件描述符,承载着委托的权限集合。
4. 当非特权进程发起 `bpf()` 系统调用时(通过 libbpf 或直接调用),传入 token fd。内核根据 token 而不是进程的 capability 来检查权限。
token 沿四个独立轴进行限定:
| 委托选项 | 控制内容 | 示例 |
|----------|---------|------|
| `delegate_cmds` | 允许哪些 `bpf()` 命令 | `prog_load:map_create:btf_load:link_create` |
| `delegate_maps` | 允许创建哪些 map 类型 | `array:hash:ringbuf` |
| `delegate_progs` | 允许加载哪些程序类型 | `xdp:socket_filter` |
| `delegate_attachs` | 允许哪些 attach 类型 | `xdp:cgroup_inet_ingress``any` |
每个轴是一个位掩码。如果某个位未设置,对应的操作即使有 token 也会被拒绝。这给了平台工程师细粒度的控制:你可以允许容器加载带 array map 的 XDP 程序,但拒绝它访问 kprobe、perf event 或 hash-of-maps。
### User Namespace 约束
一个关键的设计决定:**BPF token 必须在和 bpffs 实例相同的 user namespace 中创建,且该 user namespace 不能是 `init_user_ns`**。这是有意为之。这意味着:
- 主机 namespace 下的 bpffs`/sys/fs/bpf`**不能**产生可用的 token。token 只在 bpffs 关联到非 init 的 user namespace 时才能工作。
- 特权父进程在将 bpffs 传给子进程之前配置好委托策略,但子进程(在自己的 user namespace 中)才是创建和使用 token 的一方。
- 这个设计防止持有 token 的进程利用它在 namespace 边界之外提升权限。
### libbpf 如何让它变得透明
对于基于 libbpf 构建的应用(大多数 eBPF 应用都是token 的使用几乎是透明的。你有三种选择:
1. **显式路径**:在打开 BPF 对象时设置 `bpf_object_open_opts.bpf_token_path`。libbpf 会从指定的 bpffs 挂载中派生 token。
2. **环境变量**:设置 `LIBBPF_BPF_TOKEN_PATH` 指向 bpffs 挂载。libbpf 自动识别。
3. **默认路径**:如果默认的 `/sys/fs/bpf` 是当前 user namespace 中的委托 bpffslibbpf 隐式使用它。
一旦 token 被派生libbpf 会在每个相关的 syscall`BPF_MAP_CREATE``BPF_BTF_LOAD``BPF_PROG_LOAD``BPF_LINK_CREATE`)中传递它,不需要修改 BPF 应用的任何源代码。
## 编写 eBPF 程序
本教程的 BPF 侧故意保持最小,只有 loopback 上的一个 XDP 小程序。这样可以把注意力集中在 token 工作流上。以下是完整源码:
```c
// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
char LICENSE[] SEC("license") = "GPL";
struct token_stats {
__u64 packets;
__u32 last_ifindex;
};
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 1);
__type(key, __u32);
__type(value, struct token_stats);
} stats_map SEC(".maps");
SEC("xdp")
int handle_packet(struct xdp_md *ctx)
{
struct token_stats *stats;
__u32 key = 0;
stats = bpf_map_lookup_elem(&stats_map, &key);
if (!stats)
return 0;
stats->packets++;
stats->last_ifindex = ctx->ingress_ifindex;
return XDP_PASS;
}
```
有几个设计选择值得注意:
**`BPF_MAP_TYPE_ARRAY`** 被选中是因为委托策略显式允许了 `array` map。如果我们改用 hash map加载会失败因为 token 不授予 `hash` map 的创建权限。这正是 token 模型在起作用:即使是微小的程序改动也会被委托策略捕获。
**`SEC("xdp")`** 匹配 `delegate_progs=xdp` 策略。如果你把它改成 `SEC("kprobe/...")`,内核会在加载时返回 `EPERM` 拒绝,因为 kprobe 不在允许的程序类型中。
**`XDP_PASS`** 简单地放行每个包。这个程序的唯一目的是证明基于 token 的加载和挂载成功了。在生产环境中,你会用真正的包处理逻辑来替换它。
## 用户态加载器:基于 Token 的加载
`token_trace.c` 加载器是一个标准的 libbpf skeleton 程序,唯一的关键区别是它传递了 `bpf_token_path`
```c
struct bpf_object_open_opts open_opts = {};
open_opts.sz = sizeof(open_opts);
open_opts.bpf_token_path = env.token_path;
skel = token_trace_bpf__open_opts(&open_opts);
```
从这一刻开始libbpf 接管了一切。当它调用 `bpf(BPF_MAP_CREATE)` 创建 `stats_map` 时,会附带 token fd。当它调用 `bpf(BPF_PROG_LOAD)` 加载 XDP 程序时,附带 token fd。当它调用 `bpf(BPF_LINK_CREATE)` 挂载到接口时,同样附带 token fd。
加载器的其余部分是标准流程:
```c
err = token_trace_bpf__load(skel); // token 用于 map_create + prog_load
link = bpf_program__attach_xdp(skel->progs.handle_packet, ifindex); // token 用于 link_create
```
挂载完成后,加载器在发送测试数据包前后分别读取 map 值来验证程序执行了:
```c
err = bpf_map_lookup_elem(map_fd, &key, &before);
// ... 向 127.0.0.1 发送 UDP 包 ...
err = bpf_map_lookup_elem(map_fd, &key, &after);
printf("delta : %llu\n", after.packets - before.packets);
```
如果 delta 是 1说明 XDP 程序已经用委托的 capability 成功加载和挂载了。
## Namespace 编排器:`token_userns_demo`
由于 BPF token 要求非 init 的 user namespace在主机上直接运行 `token_trace -t /sys/fs/bpf` 是行不通的。`token_userns_demo.c` 封装器自动处理了复杂的 namespace 编排。以下是完整流程:
### 第一步Fork 并创建 Namespace
```
父进程 (root, init_user_ns) 子进程 (非特权, 新 userns)
│ │
│ fork() │
├────────────────────────────────────────>│
│ │
│ unshare(CLONE_NEWUSER)
│ unshare(CLONE_NEWNS | CLONE_NEWNET)
```
子进程创建新的 user namespace在其中把自己映射为 uid/gid 0、新的 mount namespace使 bpffs 挂载是私有的)和新的 network namespace使 `lo` 是一个全新的接口)。
### 第二步:创建 bpffs 并配置委托策略
```
父进程 (root, init_user_ns) 子进程 (新 userns)
│ │
│ fs_fd = fsopen("bpf", 0)
│ <───── 通过 SCM_RIGHTS 发送 fs_fd ──│
│ │
fsconfig(fs_fd, "delegate_cmds", ...) │ (等待确认)
fsconfig(fs_fd, "delegate_maps", "array") │
fsconfig(fs_fd, "delegate_progs", "xdp:...") │
fsconfig(fs_fd, "delegate_attachs", "any") │
fsconfig(fs_fd, FSCONFIG_CMD_CREATE) │
│ │
│ ───────── 发送确认 ─────────────────>│
```
子进程调用 `fsopen("bpf", 0)` 在自己的 user namespace 中创建一个 bpffs 文件系统上下文,然后通过 Unix socket`SCM_RIGHTS`)把文件描述符发给父进程。父进程以 root 身份运行在 init namespace 中,用 `fsconfig()` 配置委托策略,然后用 `FSCONFIG_CMD_CREATE` 实例化文件系统。
这个两步配合是必要的,因为:(a) bpffs 必须在子进程的 user namespace 中创建token 才能在那里有效),但 (b) 只有特权父进程才能设置委托选项(因为这些选项授予 BPF capability
### 第三步:挂载并加载
```
子进程 (新 userns)
mnt_fd = fsmount(fs_fd, 0, 0)
token_path = "/proc/self/fd/<mnt_fd>"
set_loopback_up()
exec("./token_trace", "-t", token_path, "-i", "lo")
```
子进程将 bpffs 实例化为一个分离的挂载(不需要挂载点,因为 `/proc/self/fd/<mnt_fd>` 提供了路径),在自己的 network namespace 中拉起 loopback 接口,然后 `exec` 执行 `token_trace` 并传入 bpffs 路径。从 `token_trace` 的角度看,它只是在用一个 token path 打开 BPF 对象,完全不知道也不关心 namespace 的设置过程。
## 手动准备 bpffs 挂载
如果你想在 demo 封装器之外试验 mount 语法,仓库里包含一个辅助脚本:
```bash
cd bpf-developer-tutorial/src/features/bpf_token
bash setup_token_bpffs.sh /tmp/bpf-token
```
它会在 `/tmp/bpf-token` 上用以下策略挂载 bpffs
```text
delegate_cmds=prog_load:map_create:btf_load:link_create
delegate_maps=array
delegate_progs=xdp:socket_filter
delegate_attachs=any
```
**为什么要 `socket_filter`** libbpf 在加载真正的 BPF 对象之前会做一次微小的 program-load probe 来检测内核特性支持。这个 probe 使用的是通用的 `BPF_PROG_TYPE_SOCKET_FILTER` 程序类型。如果委托策略中没有 `socket_filter`probe 会失败libbpf 拒绝继续。
**为什么要 `delegate_attachs=any`** 同样的 libbpf probe 路径还会触发内核 token 检查代码中的 attach-type 验证。使用 `any` 避免了为 probe 兼容性而逐一列举每个可能的 attach type。
注意:这样的主机 namespace 挂载对于检查委托策略很有用(例如配合 `bpftool token list`),但除非 `bpf(BPF_TOKEN_CREATE)` syscall 来自匹配的非 init user namespace否则不会产生可用的 token。
## 编译和运行
编译所有二进制文件:
```bash
cd bpf-developer-tutorial/src/features/bpf_token
make
```
运行端到端 demo
```bash
sudo ./token_userns_demo
```
预期输出:
```text
token path : /proc/self/fd/5
interface : lo (ifindex=1)
packets before : 0
packets after : 1
delta : 1
last ifindex : 1
```
`delta: 1` 确认 XDP 程序已使用 BPF token 成功加载和挂载,子进程中没有 `CAP_BPF``CAP_SYS_ADMIN`
`-v` 可以看到 libbpf 的详细输出,显示 token 的创建和使用过程:
```bash
sudo ./token_userns_demo -v
```
如果你自己已经管理好了在 user namespace 中的委托 bpffs可以直接运行加载器
```bash
./token_trace -t /proc/self/fd/<mnt-fd> -i lo
```
## 实际应用场景
虽然本教程使用了一个最小的 XDP 程序,但 BPF token 模式可以扩展到生产场景:
- **容器运行时**LXD、Docker、Kubernetes把带有特定 program 和 map 类型限制的委托 bpffs 挂载到容器中。LXD 已经通过 `security.delegate_bpf` 选项支持了这一点。
- **CI/CD 测试**:赋予构建任务加载和测试特定 eBPF 程序的能力,无需授予主机级 capability。委托策略充当 BPF 操作的白名单。
- **多租户 BPF 平台**:平台守护进程为每个租户创建不同委托策略的 bpffs 挂载。一个租户可能被允许使用 XDP + array map另一个可能获得 tracepoint + ringbuf 访问权限。
- **LSM 集成**:由于 BPF token 和 Linux Security Module 集成,你可以将 token 委托和 SELinux 或 AppArmor 策略结合实现纵深防御。每个 token 获得自己的安全上下文LSM hook 可以对其进行检查。
## 总结
本教程介绍了 BPF token 如何为 eBPF 权限提供一种超越 Linux capability "全有或全无"二元模型的委托机制。我们完整走过了整个流程:特权父进程用特定委托选项配置 bpffs 实例user namespace 中的非特权子进程从该 bpffs 派生 tokenlibbpf 透明地使用 token 进行 map 创建、程序加载和挂载。最终结果是一个最小的 XDP 程序在非特权上下文中运行,这在 Linux 6.9 之前是不可能的。
BPF token 不是一个冷门功能。它代表了内核对 eBPF 生态系统中一个基本问题的回答:**在多租户环境中,如何安全地共享 BPF 能力,而不授予对 BPF 子系统的无约束访问?**
如果你想了解更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 或网站 <https://eunomia.dev/tutorials/> 获取更多示例和完整教程。
## 参考
- [BPF Token 概念文档](https://docs.ebpf.io/linux/concepts/token/)
- [BPF token 内核补丁系列Andrii Nakryiko](https://lore.kernel.org/bpf/20240103222034.2582628-1-andrii@kernel.org/T/)
- [BPF token LWN 文章](https://lwn.net/Articles/959350/)
- [更细粒度的 BPF token LWN 讨论](https://lwn.net/Articles/947173/)
- [使用 BPF Token 进行权限委托LXD 文档)](https://documentation.ubuntu.com/lxd/latest/explanation/bpf/)
- [bpf_token_create() libbpf API](https://docs.ebpf.io/ebpf-library/libbpf/userspace/bpf_token_create/)
- <https://docs.kernel.org/bpf/>

View File

@@ -0,0 +1,17 @@
#!/usr/bin/env bash
set -euo pipefail
MOUNTPOINT="${1:-/tmp/bpf-token}"
OPTIONS="delegate_cmds=prog_load:map_create:btf_load:link_create,delegate_maps=array,delegate_progs=xdp:socket_filter,delegate_attachs=any"
mkdir -p "${MOUNTPOINT}"
if mountpoint -q "${MOUNTPOINT}"; then
echo "bpffs is already mounted at ${MOUNTPOINT}"
exit 0
fi
mount -t bpf bpf "${MOUNTPOINT}" -o "${OPTIONS}"
echo "Mounted delegated bpffs at ${MOUNTPOINT}"
echo "Note: a bpffs mount in init_user_ns is useful for inspection, but token creation itself must happen from the same non-init user namespace as the bpffs instance."
grep " ${MOUNTPOINT} " /proc/mounts || true

View File

@@ -0,0 +1,32 @@
// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
char LICENSE[] SEC("license") = "GPL";
struct token_stats {
__u64 packets;
__u32 last_ifindex;
};
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 1);
__type(key, __u32);
__type(value, struct token_stats);
} stats_map SEC(".maps");
SEC("xdp")
int handle_packet(struct xdp_md *ctx)
{
struct token_stats *stats;
__u32 key = 0;
stats = bpf_map_lookup_elem(&stats_map, &key);
if (!stats)
return 0;
stats->packets++;
stats->last_ifindex = ctx->ingress_ifindex;
return XDP_PASS;
}

View File

@@ -0,0 +1,193 @@
// SPDX-License-Identifier: GPL-2.0
#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <net/if.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <unistd.h>
#include <bpf/bpf.h>
#include <bpf/libbpf.h>
#include "token_trace.skel.h"
struct token_stats {
__u64 packets;
__u32 last_ifindex;
};
static struct env {
const char *token_path;
const char *ifname;
bool verbose;
bool no_trigger;
} env = {
.ifname = "lo",
};
static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
if (level == LIBBPF_DEBUG && !env.verbose)
return 0;
return vfprintf(stderr, format, args);
}
static void usage(const char *prog)
{
fprintf(stderr,
"Usage: %s [-t TOKEN_BPFFS] [-i IFACE] [-v] [-n]\n"
" -t TOKEN_BPFFS delegated bpffs mount used to derive a BPF token\n"
" -i IFACE interface to attach XDP program to (default: lo)\n"
" -v enable libbpf debug logs\n"
" -n do not generate loopback traffic automatically\n",
prog);
}
static int parse_args(int argc, char **argv)
{
int opt;
while ((opt = getopt(argc, argv, "t:i:vn")) != -1) {
switch (opt) {
case 't':
env.token_path = optarg;
break;
case 'i':
env.ifname = optarg;
break;
case 'v':
env.verbose = true;
break;
case 'n':
env.no_trigger = true;
break;
default:
return -EINVAL;
}
}
return 0;
}
static int generate_loopback_traffic(void)
{
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(9),
};
const char payload[] = "bpf token xdp demo";
int fd, err = 0;
if (inet_pton(AF_INET, "127.0.0.1", &addr.sin_addr) != 1)
return -EINVAL;
fd = socket(AF_INET, SOCK_DGRAM, 0);
if (fd < 0)
return -errno;
if (sendto(fd, payload, sizeof(payload), 0,
(struct sockaddr *)&addr, sizeof(addr)) < 0)
err = -errno;
close(fd);
return err;
}
int main(int argc, char **argv)
{
struct token_trace_bpf *skel = NULL;
struct bpf_object_open_opts open_opts = {};
struct token_stats before = {}, after = {};
struct bpf_link *link = NULL;
__u32 key = 0;
int ifindex, map_fd;
int err = 0;
err = parse_args(argc, argv);
if (err) {
usage(argv[0]);
return 1;
}
libbpf_set_print(libbpf_print_fn);
libbpf_set_memlock_rlim(0);
ifindex = if_nametoindex(env.ifname);
if (!ifindex) {
fprintf(stderr, "unknown interface '%s'\n", env.ifname);
return 1;
}
open_opts.sz = sizeof(open_opts);
open_opts.bpf_token_path = env.token_path;
skel = token_trace_bpf__open_opts(&open_opts);
if (!skel) {
fprintf(stderr, "failed to open token_trace skeleton\n");
return 1;
}
err = token_trace_bpf__load(skel);
if (err) {
fprintf(stderr,
"failed to load BPF program: %s\n"
"hint: if you intended to use a delegated token, pass -t <bpffs-path>\n",
strerror(-err));
goto cleanup;
}
link = bpf_program__attach_xdp(skel->progs.handle_packet, ifindex);
err = libbpf_get_error(link);
if (err) {
link = NULL;
fprintf(stderr, "failed to attach XDP program: %s\n", strerror(-err));
goto cleanup;
}
map_fd = bpf_map__fd(skel->maps.stats_map);
err = bpf_map_lookup_elem(map_fd, &key, &before);
if (err) {
err = -errno;
fprintf(stderr, "failed to read stats before traffic: %s\n",
strerror(errno));
goto cleanup;
}
if (!env.no_trigger && strcmp(env.ifname, "lo") == 0) {
err = generate_loopback_traffic();
if (err) {
fprintf(stderr, "failed to generate loopback traffic: %s\n",
strerror(-err));
goto cleanup;
}
usleep(100000);
} else if (!env.no_trigger) {
printf("Generate traffic on %s and re-run with -n if you only want attach/query.\n",
env.ifname);
}
err = bpf_map_lookup_elem(map_fd, &key, &after);
if (err) {
err = -errno;
fprintf(stderr, "failed to read stats after traffic: %s\n",
strerror(errno));
goto cleanup;
}
printf("token path : %s\n",
env.token_path ? env.token_path :
"(none, libbpf may use LIBBPF_BPF_TOKEN_PATH or /sys/fs/bpf)");
printf("interface : %s (ifindex=%d)\n", env.ifname, ifindex);
printf("packets before : %llu\n", (unsigned long long)before.packets);
printf("packets after : %llu\n", (unsigned long long)after.packets);
printf("delta : %llu\n",
(unsigned long long)(after.packets - before.packets));
printf("last ifindex : %u\n", after.last_ifindex);
cleanup:
bpf_link__destroy(link);
token_trace_bpf__destroy(skel);
return err != 0;
}

View File

@@ -0,0 +1,452 @@
// SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <linux/mount.h>
#include <net/if.h>
#include <sched.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mount.h>
#include <sys/resource.h>
#include <sys/socket.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <unistd.h>
static struct env {
bool verbose;
bool no_trigger;
} env = {
};
static void usage(const char *prog)
{
fprintf(stderr,
"Usage: %s [-v] [-n]\n"
" -v enable verbose token_trace logs\n"
" -n do not generate loopback traffic automatically\n",
prog);
}
static int parse_args(int argc, char **argv)
{
int opt;
while ((opt = getopt(argc, argv, "vn")) != -1) {
switch (opt) {
case 'v':
env.verbose = true;
break;
case 'n':
env.no_trigger = true;
break;
default:
return -EINVAL;
}
}
return 0;
}
static inline int sys_fsopen(const char *fsname, unsigned flags)
{
return syscall(__NR_fsopen, fsname, flags);
}
static inline int sys_fsconfig(int fs_fd, unsigned cmd, const char *key,
const void *val, int aux)
{
return syscall(__NR_fsconfig, fs_fd, cmd, key, val, aux);
}
static inline int sys_fsmount(int fs_fd, unsigned flags, unsigned ms_flags)
{
return syscall(__NR_fsmount, fs_fd, flags, ms_flags);
}
static ssize_t write_nointr(int fd, const void *buf, size_t count)
{
ssize_t ret;
do {
ret = write(fd, buf, count);
} while (ret < 0 && errno == EINTR);
return ret;
}
static int write_file(const char *path, const void *buf, size_t count)
{
int fd;
ssize_t ret;
fd = open(path, O_WRONLY | O_CLOEXEC | O_NOCTTY);
if (fd < 0)
return -errno;
ret = write_nointr(fd, buf, count);
close(fd);
if (ret < 0)
return -errno;
if ((size_t)ret != count)
return -EIO;
return 0;
}
static int sendfd(int sockfd, int fd)
{
struct msghdr msg = {};
struct cmsghdr *cmsg;
int fds[1] = { fd };
char iobuf[1] = { 0 };
struct iovec io = {
.iov_base = iobuf,
.iov_len = sizeof(iobuf),
};
union {
char buf[CMSG_SPACE(sizeof(fds))];
struct cmsghdr align;
} u = {};
ssize_t ret;
msg.msg_iov = &io;
msg.msg_iovlen = 1;
msg.msg_control = u.buf;
msg.msg_controllen = sizeof(u.buf);
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(fds));
memcpy(CMSG_DATA(cmsg), fds, sizeof(fds));
ret = sendmsg(sockfd, &msg, 0);
if (ret < 0)
return -errno;
if (ret != 1)
return -EIO;
return 0;
}
static int recvfd(int sockfd, int *fd)
{
struct msghdr msg = {};
struct cmsghdr *cmsg;
int fds[1];
char iobuf[1];
struct iovec io = {
.iov_base = iobuf,
.iov_len = sizeof(iobuf),
};
union {
char buf[CMSG_SPACE(sizeof(fds))];
struct cmsghdr align;
} u = {};
ssize_t ret;
msg.msg_iov = &io;
msg.msg_iovlen = 1;
msg.msg_control = u.buf;
msg.msg_controllen = sizeof(u.buf);
ret = recvmsg(sockfd, &msg, 0);
if (ret < 0)
return -errno;
if (ret != 1)
return -EIO;
cmsg = CMSG_FIRSTHDR(&msg);
if (!cmsg)
return -EINVAL;
if (cmsg->cmsg_len != CMSG_LEN(sizeof(fds)))
return -EINVAL;
if (cmsg->cmsg_level != SOL_SOCKET || cmsg->cmsg_type != SCM_RIGHTS)
return -EINVAL;
memcpy(fds, CMSG_DATA(cmsg), sizeof(fds));
*fd = fds[0];
return 0;
}
static int create_and_enter_userns(void)
{
uid_t uid = getuid();
gid_t gid = getgid();
char map[64];
int err;
if (unshare(CLONE_NEWUSER))
return -errno;
err = write_file("/proc/self/setgroups", "deny", sizeof("deny") - 1);
if (err && err != -ENOENT)
return err;
snprintf(map, sizeof(map), "0 %d 1", uid);
err = write_file("/proc/self/uid_map", map, strlen(map));
if (err)
return err;
snprintf(map, sizeof(map), "0 %d 1", gid);
err = write_file("/proc/self/gid_map", map, strlen(map));
if (err)
return err;
if (setgid(0))
return -errno;
if (setuid(0))
return -errno;
return 0;
}
static int set_delegate_mask(int fs_fd, const char *key, const char *mask_str)
{
int err;
err = sys_fsconfig(fs_fd, FSCONFIG_SET_STRING, key, mask_str, 0);
if (err < 0)
return -errno;
return 0;
}
static int set_loopback_up(void)
{
struct ifreq ifr = {};
int fd;
fd = socket(AF_INET, SOCK_DGRAM, 0);
if (fd < 0)
return -errno;
snprintf(ifr.ifr_name, sizeof(ifr.ifr_name), "lo");
if (ioctl(fd, SIOCGIFFLAGS, &ifr) < 0) {
close(fd);
return -errno;
}
ifr.ifr_flags |= IFF_UP | IFF_RUNNING;
if (ioctl(fd, SIOCSIFFLAGS, &ifr) < 0) {
close(fd);
return -errno;
}
close(fd);
return 0;
}
static void raise_memlock_limit(void)
{
struct rlimit rlim = {
.rlim_cur = RLIM_INFINITY,
.rlim_max = RLIM_INFINITY,
};
if (setrlimit(RLIMIT_MEMLOCK, &rlim))
fprintf(stderr, "warning: failed to raise RLIMIT_MEMLOCK: %s\n",
strerror(errno));
}
static int child_main(int sockfd)
{
char ack;
char token_path[64];
int err, fs_fd = -1, mnt_fd = -1;
err = create_and_enter_userns();
if (err) {
fprintf(stderr, "failed to create user namespace: %s\n",
strerror(-err));
return 1;
}
if (unshare(CLONE_NEWNS | CLONE_NEWNET)) {
err = -errno;
fprintf(stderr, "failed to create mount/net namespace: %s\n",
strerror(errno));
return 1;
}
if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL)) {
err = -errno;
fprintf(stderr, "failed to remount / private: %s\n", strerror(errno));
return 1;
}
err = set_loopback_up();
if (err) {
fprintf(stderr, "failed to bring loopback up: %s\n",
strerror(-err));
return 1;
}
fs_fd = sys_fsopen("bpf", 0);
if (fs_fd < 0) {
err = -errno;
fprintf(stderr, "fsopen(\"bpf\") failed: %s\n", strerror(errno));
return 1;
}
err = sendfd(sockfd, fs_fd);
if (err) {
fprintf(stderr, "failed to send bpffs fs_fd: %s\n", strerror(-err));
goto out;
}
if (read(sockfd, &ack, 1) != 1) {
fprintf(stderr, "failed to receive parent ack\n");
err = -EIO;
goto out;
}
mnt_fd = sys_fsmount(fs_fd, 0, 0);
if (mnt_fd < 0) {
err = -errno;
fprintf(stderr, "fsmount() failed: %s\n", strerror(errno));
goto out;
}
snprintf(token_path, sizeof(token_path), "/proc/self/fd/%d", mnt_fd);
{
const char *argv[10];
int argc = 0;
argv[argc++] = "./token_trace";
if (env.verbose)
argv[argc++] = "-v";
if (env.no_trigger)
argv[argc++] = "-n";
argv[argc++] = "-t";
argv[argc++] = token_path;
argv[argc++] = "-i";
argv[argc++] = "lo";
argv[argc] = NULL;
execv("./token_trace", (char *const *)argv);
}
err = -errno;
fprintf(stderr, "failed to exec ./token_trace: %s\n", strerror(errno));
out:
if (mnt_fd >= 0)
close(mnt_fd);
if (fs_fd >= 0)
close(fs_fd);
return 1;
}
int main(int argc, char **argv)
{
static const char *delegate_cmds =
"prog_load:map_create:btf_load:link_create";
int err, socks[2] = { -1, -1 }, fs_fd = -1, status;
pid_t pid;
char ack = 1;
err = parse_args(argc, argv);
if (err) {
usage(argv[0]);
return 1;
}
if (geteuid() != 0) {
fprintf(stderr, "run this demo with sudo/root so the parent can configure delegated bpffs\n");
return 1;
}
if (access("./token_trace", X_OK) != 0) {
fprintf(stderr, "missing ./token_trace, run 'make' in this directory first\n");
return 1;
}
raise_memlock_limit();
if (socketpair(AF_UNIX, SOCK_SEQPACKET | SOCK_CLOEXEC, 0, socks)) {
fprintf(stderr, "socketpair failed: %s\n", strerror(errno));
return 1;
}
pid = fork();
if (pid < 0) {
fprintf(stderr, "fork failed: %s\n", strerror(errno));
return 1;
}
if (pid == 0) {
close(socks[0]);
return child_main(socks[1]);
}
close(socks[1]);
err = recvfd(socks[0], &fs_fd);
if (err) {
fprintf(stderr, "failed to receive bpffs fs_fd: %s\n", strerror(-err));
goto out;
}
err = set_delegate_mask(fs_fd, "delegate_cmds", delegate_cmds);
if (err) {
fprintf(stderr, "failed to set delegate_cmds: %s\n", strerror(-err));
goto out;
}
err = set_delegate_mask(fs_fd, "delegate_maps", "array");
if (err) {
fprintf(stderr, "failed to set delegate_maps: %s\n", strerror(-err));
goto out;
}
err = set_delegate_mask(fs_fd, "delegate_progs", "xdp:socket_filter");
if (err) {
fprintf(stderr, "failed to set delegate_progs: %s\n", strerror(-err));
goto out;
}
err = set_delegate_mask(fs_fd, "delegate_attachs", "any");
if (err) {
fprintf(stderr, "failed to set delegate_attachs: %s\n", strerror(-err));
goto out;
}
if (sys_fsconfig(fs_fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0) < 0) {
err = -errno;
fprintf(stderr, "failed to materialize bpffs: %s\n", strerror(errno));
goto out;
}
if (write(socks[0], &ack, 1) != 1) {
err = -errno;
fprintf(stderr, "failed to send parent ack: %s\n", strerror(errno));
goto out;
}
err = 0;
out:
if (fs_fd >= 0)
close(fs_fd);
close(socks[0]);
if (waitpid(pid, &status, 0) < 0) {
fprintf(stderr, "waitpid failed: %s\n", strerror(errno));
return 1;
}
if (err)
return 1;
if (WIFEXITED(status))
return WEXITSTATUS(status);
if (WIFSIGNALED(status)) {
fprintf(stderr, "child terminated by signal %d\n", WTERMSIG(status));
return 1;
}
return 1;
}