From 34f690385ea474e5b2ef93e891b897bcce81966e Mon Sep 17 00:00:00 2001
From: Littlefisher <i@littlefisher.me>
Date: Sun, 11 Jan 2026 23:27:20 -0800
Subject: [PATCH] docs: Update README files to include information on extending
 GPU driver behavior with eBPF and the gpu_ext project

---
 src/47-cuda-events/README.md           | 2 ++
 src/47-cuda-events/README.zh.md        | 2 ++
 src/features/struct_ops/README.md      | 4 +++-
 src/features/struct_ops/README.zh.md   | 4 +++-
 src/xpu/gpu-kernel-driver/README.md    | 2 +-
 src/xpu/gpu-kernel-driver/README.zh.md | 2 ++
 6 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/src/47-cuda-events/README.md b/src/47-cuda-events/README.md
index e259f30..846e125 100644
--- a/src/47-cuda-events/README.md
+++ b/src/47-cuda-events/README.md
@@ -477,6 +477,8 @@ The `cuda_events` tool supports these options:
 
 ## Next Steps
 
+Beyond tracing, eBPF can also extend GPU driver behavior—see our [gpu_ext project](https://github.com/eunomia-bpf/gpu_ext) for GPU scheduling and memory offloading via BPF struct_ops ([LPC 2024 talk](https://lpc.events/event/19/contributions/2168/)).
+
 Once you're comfortable with this basic CUDA tracing tool, you could extend it to:
 
 1. Add support for more CUDA API functions
diff --git a/src/47-cuda-events/README.zh.md b/src/47-cuda-events/README.zh.md
index 0a6fe80..6274806 100644
--- a/src/47-cuda-events/README.zh.md
+++ b/src/47-cuda-events/README.zh.md
@@ -483,6 +483,8 @@ cudaFree:               0.00 µs
 
 ## 下一步
 
+除了追踪，eBPF 还可以扩展 GPU 驱动行为——参见我们的 [gpu_ext 项目](https://github.com/eunomia-bpf/gpu_ext)，通过 BPF struct_ops 实现 GPU 调度和内存卸载（[LPC 2024 演讲](https://lpc.events/event/19/contributions/2168/)）。
+
 一旦你熟悉了这个基本的CUDA追踪工具，你可以扩展它来：
 
 1. 添加对更多CUDA API函数的支持
diff --git a/src/features/struct_ops/README.md b/src/features/struct_ops/README.md
index e249eb5..c834b6f 100644
--- a/src/features/struct_ops/README.md
+++ b/src/features/struct_ops/README.md
@@ -4,7 +4,7 @@ Have you ever wanted to implement a kernel feature, like a new network protocol
 
 This is the power of **BPF struct_ops**. This advanced eBPF feature allows BPF programs to implement the callbacks for a kernel structure of operations, effectively allowing you to "plug in" BPF code to act as a kernel subsystem. It's a step beyond simple tracing or filtering; it's about implementing core kernel logic in BPF. For instance, we also use it to implement GPU scheduling and memory offloading extensions with eBPF in GPU drivers (see [LPC 2024 talk](https://lpc.events/event/19/contributions/2168/) and the [gpu_ext project](https://github.com/eunomia-bpf/gpu_ext)).
 
-In this tutorial, we will explore how to use `struct_ops` to dynamically implement a kernel subsystem's functionality. We won't be using the common TCP congestion control example. Instead, we'll take a more fundamental approach that mirrors the extensibility seen with kfuncs. We will create a custom kernel module that defines a new, simple subsystem with a set of operations. This module will act as a placeholder, creating new attachment points for our BPF programs. Then, we will write a BPF program to implement the logic for these operations. This demonstrates a powerful pattern: using a minimal kernel module to expose a `struct_ops` interface, and then using BPF to provide the full, complex implementation.
+In this tutorial, we will explore how to use `struct_ops` to dynamically extend kernel subsystem behavior. We won't be using the common TCP congestion control example. Instead, we'll take a more fundamental approach that mirrors the extensibility seen with kfuncs. We will create a custom kernel module that defines a new, simple subsystem with a set of operations. This module will act as a placeholder, creating new attachment points for our BPF programs. Then, we will write a BPF program to implement the logic for these operations. This demonstrates a powerful pattern: using a minimal kernel module to expose a `struct_ops` interface, and then using BPF to provide the full, complex implementation.
 
 > The complete source code for this tutorial can be found here: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/struct_ops>
 
@@ -13,6 +13,7 @@ In this tutorial, we will explore how to use `struct_ops` to dynamically impleme
 ### The Challenge: Extending Kernel Behavior Safely and Dynamically
 
 Traditionally, adding new functionality to the Linux kernel, such as a new file system, a network protocol, or a scheduler algorithm, requires writing a kernel module. While powerful, kernel modules come with significant challenges:
+
 - **Complexity:** Kernel development has a steep learning curve and requires a deep understanding of kernel internals.
 - **Safety:** A bug in a kernel module can easily crash the entire system. There are no sandboxing guarantees.
 - **Maintenance:** Kernel modules must be maintained and recompiled for different kernel versions, creating a tight coupling with the kernel's internal APIs.
@@ -28,6 +29,7 @@ This is a paradigm shift. It's no longer just about observing or filtering; it's
 This approach is similar in spirit to how **kfuncs** allow developers to extend the capabilities of BPF. With kfuncs, we can add custom helper functions to the BPF runtime by defining them in a kernel module. With `struct_ops`, we take this a step further: we define a whole new *set of attach points* for BPF programs, effectively creating a custom, BPF-programmable subsystem within the kernel.
 
 The benefits are immense:
+
 - **Dynamic Implementation**: You can load, update, and unload the BPF programs implementing the subsystem logic on the fly, without restarting the kernel or the application.
 - **Safety**: The BPF verifier ensures that the BPF programs are safe to run, preventing common pitfalls like infinite loops, out-of-bounds memory access, and system crashes.
 - **Flexibility**: The logic is in the BPF program, which can be developed and updated independently of the kernel module that defines the `struct_ops` interface.
diff --git a/src/features/struct_ops/README.zh.md b/src/features/struct_ops/README.zh.md
index 9587696..51be5f8 100644
--- a/src/features/struct_ops/README.zh.md
+++ b/src/features/struct_ops/README.zh.md
@@ -4,7 +4,7 @@
 
 这就是 **BPF struct_ops** 的强大之处。这个先进的 eBPF 功能允许 BPF 程序实现内核操作结构的回调函数，实际上是让你能够“插入”BPF 代码来充当一个内核子系统。这已经超越了简单的跟踪或过滤；这是关于在 BPF 中实现核心的内核逻辑。例如，我们还使用它在 GPU 驱动中通过 eBPF 实现 GPU 调度和内存卸载扩展（请参阅 [LPC 2024 演讲](https://lpc.events/event/19/contributions/2168/) 和 [gpu_ext 项目](https://github.com/eunomia-bpf/gpu_ext)）。
 
-在本教程中，我们将探讨如何使用 `struct_ops` 来动态地实现一个内核子系统的功能。我们不会使用常见的 TCP 拥塞控制示例。相反，我们将采用一种更基础的方法，这种方法反映了与 kfuncs 相似的可扩展性。我们将创建一个自定义的内核模块，该模块定义了一组新的、简单的操作。这个模块将充当一个占位符，为我们的 BPF 程序创建新的附加点。然后，我们将编写一个 BPF 程序来实现这些操作的逻辑。这演示了一种强大的模式：使用一个最小化的内核模块来暴露一个 `struct_ops` 接口，然后使用 BPF 来提供完整、复杂的实现。
+在本教程中，我们将探讨如何使用 `struct_ops` 来动态地扩展内核子系统的行为。我们不会使用常见的 TCP 拥塞控制示例。相反，我们将采用一种更基础的方法，这种方法反映了与 kfuncs 相似的可扩展性。我们将创建一个自定义的内核模块，该模块定义了一组新的、简单的操作。这个模块将充当一个占位符，为我们的 BPF 程序创建新的附加点。然后，我们将编写一个 BPF 程序来实现这些操作的逻辑。这演示了一种强大的模式：使用一个最小化的内核模块来暴露一个 `struct_ops` 接口，然后使用 BPF 来提供完整、复杂的实现。
 
 > 本教程的完整源代码可以在这里找到：<https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/struct_ops>
 
@@ -13,6 +13,7 @@
 ### 挑战：安全、动态地扩展内核行为
 
 传统上，向 Linux 内核添加新功能，例如新的文件系统、网络协议或调度器算法，都需要编写内核模块。虽然功能强大，但内核模块也带来了重大的挑战：
+
 - **复杂性：** 内核开发具有陡峭的学习曲线，需要对内核内部有深入的了解。
 - **安全性：** 内核模块中的一个错误很容易导致整个系统崩溃。没有沙箱保障。
 - **维护性：** 内核模块必须针对不同的内核版本进行维护和重新编译，这与内核的内部 API 产生了紧密的耦合。
@@ -28,6 +29,7 @@ BPF `struct_ops` 填补了这一空白。它允许 BPF 程序实现 `struct_ops`
 这种方法的精神与 **kfuncs** 允许开发者扩展 BPF 功能的方式相似。通过 kfuncs，我们可以在内核模块中定义它们，从而向 BPF 运行时添加自定义的辅助函数。通过 `struct_ops`，我们更进一步：我们为 BPF 程序定义了一整套全新的*附加点*，有效地在内核中创建了一个自定义的、可通过 BPF 编程的子系统。
 
 其好处是巨大的：
+
 - **动态实现**：你可以在不重启内核或应用程序的情况下，动态加载、更新和卸载实现子系统逻辑的 BPF 程序。
 - **安全性**：BPF 验证器确保 BPF 程序的运行是安全的，防止了诸如无限循环、越界内存访问和系统崩溃等常见陷阱。
 - **灵活性**：逻辑位于 BPF 程序中，可以独立于定义 `struct_ops` 接口的内核模块进行开发和更新。
diff --git a/src/xpu/gpu-kernel-driver/README.md b/src/xpu/gpu-kernel-driver/README.md
index b2bcf0d..32c44ca 100644
--- a/src/xpu/gpu-kernel-driver/README.md
+++ b/src/xpu/gpu-kernel-driver/README.md
@@ -394,7 +394,7 @@ sudo cat /sys/kernel/debug/tracing/available_events | grep -E '(gpu_scheduler|i9
 
 This tutorial focuses on kernel-side GPU driver tracing, which provides visibility into job scheduling, memory management, and driver-firmware communication. However, kernel tracepoints have fundamental limitations. When `drm_run_job` fires, we know a job started executing on GPU hardware, but we cannot observe what happens inside the GPU itself. The execution of thousands of parallel threads, their memory access patterns, branch divergence, warp occupancy, and instruction-level behavior remain invisible. These details are critical for understanding performance bottlenecks - whether memory coalescing is failing, whether thread divergence is killing efficiency, or whether shared memory bank conflicts are stalling execution.
 
-To achieve fine-grained GPU observability, eBPF programs must run directly on the GPU. This is the direction explored by the eGPU paper and [bpftime GPU examples](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu). bpftime converts eBPF bytecode to PTX instructions that GPUs can execute, then dynamically patches CUDA binaries at runtime to inject these eBPF programs at kernel entry/exit points. This enables observing GPU-specific information like block indices, thread indices, global timers, and warp-level metrics. Developers can instrument critical paths inside GPU kernels to measure execution behavior and diagnose complex performance issues that kernel-side tracing cannot reach. This GPU-internal observability complements kernel tracepoints - together they provide end-to-end visibility from API calls through kernel drivers to GPU execution.
+To achieve fine-grained GPU observability, eBPF programs must run directly on the GPU. This is the direction explored by the eGPU paper and [bpftime GPU examples](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu). bpftime converts eBPF bytecode to PTX instructions that GPUs can execute, then dynamically patches CUDA binaries at runtime to inject these eBPF programs at kernel entry/exit points. This enables observing GPU-specific information like block indices, thread indices, global timers, and warp-level metrics. Developers can instrument critical paths inside GPU kernels to measure execution behavior and diagnose complex performance issues that kernel-side tracing cannot reach. This GPU-internal observability complements kernel tracepoints - together they provide end-to-end visibility from API calls through kernel drivers to GPU execution. Beyond tracing, eBPF can also extend GPU driver behavior—see our [gpu_ext project](https://github.com/eunomia-bpf/gpu_ext) for GPU scheduling and memory offloading via BPF struct_ops ([LPC 2024 talk](https://lpc.events/event/19/contributions/2168/)).
 
 ## Summary
 
diff --git a/src/xpu/gpu-kernel-driver/README.zh.md b/src/xpu/gpu-kernel-driver/README.zh.md
index cb4c60b..5cb86b5 100644
--- a/src/xpu/gpu-kernel-driver/README.zh.md
+++ b/src/xpu/gpu-kernel-driver/README.zh.md
@@ -295,6 +295,8 @@ sudo cat /sys/kernel/debug/tracing/available_events | grep -E '(gpu_scheduler|i9
 
 要实现细粒度的 GPU 可观测性，eBPF 程序必须直接在 GPU 上运行。这正是 eGPU 论文和 [bpftime GPU 示例](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu)所探索的方向。bpftime 将 eBPF 字节码转换为 GPU 可以执行的 PTX 指令，然后在运行时动态修补 CUDA 二进制文件，将这些 eBPF 程序注入到内核入口/出口点。这使得开发者可以观察 GPU 特有的信息，如块索引、线程索引、全局计时器和 warp 级指标。开发者可以在 GPU 内核的关键路径上进行插桩，测量执行行为并诊断内核侧追踪无法触及的复杂性能问题。这种 GPU 内部的可观测性与内核跟踪点互补 - 它们一起提供了从 API 调用通过内核驱动到 GPU 执行的端到端可见性。
 
+除了追踪，eBPF 还可以扩展 GPU 驱动行为——参见我们的 [gpu_ext 项目](https://github.com/eunomia-bpf/gpu_ext)，通过 BPF struct_ops 实现 GPU 调度和内存卸载（[LPC 2024 演讲](https://lpc.events/event/19/contributions/2168/)）。
+
 ## 总结
 
 GPU 内核跟踪点提供零开销的驱动内部可见性。DRM 调度器的稳定 uAPI 跟踪点跨所有供应商工作，适合生产监控。供应商特定跟踪点暴露详细的内存管理和命令提交管道。bpftrace 脚本演示了跟踪作业调度、测量延迟和识别依赖停顿 - 所有这些对于诊断游戏、ML 训练和云 GPU 工作负载中的性能问题都至关重要。对于超越内核追踪的 GPU 内部可观测性，请探索 bpftime 的 GPU eBPF 能力。