Add eBPF tutorial and monitoring scripts for GPU activity

- Introduced a comprehensive README.md detailing the use of eBPF for monitoring GPU activity through kernel tracepoints.
- Added bpftrace scripts for monitoring AMD GPU operations, including buffer object creation, command submission, and interrupts.
- Created a bpftrace script for tracking DRM GPU scheduler activity across all modern GPU drivers.
- Developed a bpftrace script to monitor display vertical blanking events for frame timing analysis.
- Implemented a bpftrace script for Intel i915 GPU activity, focusing on GEM object management, memory operations, and page faults.
This commit is contained in:
yunwei37
2025-10-05 00:26:31 -07:00
parent 0e19d48331
commit 6042594b8c
10 changed files with 1489 additions and 42 deletions

View File

@@ -95,3 +95,19 @@ jobs:
- name: test 45
run: |
make -C src/45-scx-nest
- name: test 46 xdp-pktgen
run: |
make -C src/46-xdp-test
- name: test features bpf_arena
run: |
make -C src/features/bpf_arena
- name: test features bpf_iters
run: |
make -C src/features/bpf_iters
- name: test features bpf_wq
run: |
make -C src/features/bpf_wq

View File

@@ -1,44 +1,281 @@
# xdp-pktgen: xdp based packet generator
# eBPF Tutorial by Example: Building a High-Performance XDP Packet Generator
This is a simple xdp based packet generator.
Need to stress-test your network stack or measure XDP program performance? Traditional packet generators like `pktgen` require kernel modules or run in userspace with high overhead. There's a better way - XDP's BPF_PROG_RUN feature lets you inject packets directly into the kernel's fast path at millions of packets per second, all from userspace without loading network drivers.
## **How to use**
In this tutorial, we'll build an XDP-based packet generator that leverages the kernel's BPF_PROG_RUN test infrastructure. We'll explore how XDP's `XDP_TX` action creates a packet reflection loop, understand the live frames mode that enables real packet injection, and measure the performance characteristics of XDP programs under load. By the end, you'll have a production-ready tool for network testing and XDP benchmarking.
clone the repo, you can update the git submodule with following commands:
## Understanding XDP Packet Generation
```sh
git submodule update --init --recursive
XDP (eXpress Data Path) provides the fastest programmable packet processing in Linux by hooking into network drivers before the kernel's networking stack allocates socket buffers. Normally, XDP programs process packets arriving from network interfaces. But what if you want to test an XDP program's performance without real network traffic? Or inject synthetic packets to stress-test your network infrastructure?
### The BPF_PROG_RUN Testing Interface
The kernel exposes `bpf_prog_test_run()` (BPF_PROG_RUN) as a testing mechanism for BPF programs. Originally designed for unit testing, this syscall lets userspace invoke a BPF program with synthetic input and capture its output. For XDP programs, you provide a packet buffer and an `xdp_md` context describing the packet metadata (interface index, RX queue). The kernel runs your XDP program and returns the action code (XDP_DROP, XDP_PASS, XDP_TX, etc.) along with any packet modifications.
Traditional BPF_PROG_RUN operates in "dry run" mode - packets are processed but never actually transmitted. The XDP program runs, modifies packet data, returns an action, but nothing hits the wire. This is perfect for testing packet parsing logic or measuring program execution time in isolation.
### Live Frames Mode: Real Packet Injection
In Linux 5.18+, the kernel introduced **live frames mode** via the `BPF_F_TEST_XDP_LIVE_FRAMES` flag. This fundamentally changes BPF_PROG_RUN behavior. When enabled, XDP_TX actions don't just return - they actually transmit packets on the wire through the specified network interface. This turns BPF_PROG_RUN into a powerful packet generator.
Here's how it works: Your userspace program constructs a packet (Ethernet frame with IP header, UDP payload, etc.) and passes it to `bpf_prog_test_run()` with live frames enabled. The XDP program receives this packet in its `xdp_md` context. If the program returns `XDP_TX`, the kernel transmits the packet through the network driver as if it arrived on the interface and was reflected back. The packet appears on the wire with full hardware offload support (checksumming, segmentation, etc.).
This enables several powerful use cases. **Network stack stress testing**: Flood your system with millions of packets per second to find breaking points in the network stack, driver, or application layer. **XDP program benchmarking**: Measure how many packets per second your XDP program can process under realistic load without external packet generators. **Protocol fuzzing**: Generate malformed packets or unusual protocol sequences to test robustness. **Synthetic traffic generation**: Create realistic traffic patterns for testing load balancers, firewalls, or intrusion detection systems.
### The XDP_TX Reflection Loop
The simplest XDP packet generator uses the `XDP_TX` action. This tells the kernel "transmit this packet back out the interface it arrived on." Our minimal XDP program is literally three lines:
```c
SEC("xdp")
int xdp_redirect_notouch(struct xdp_md *ctx)
{
return XDP_TX;
}
```
### **3. Install dependencies**
That's it. No packet parsing, no header modification - just reflect everything. Combined with BPF_PROG_RUN in live frames mode, this creates a packet generation loop: userspace injects a packet, XDP reflects it to the wire, repeat at millions of packets per second.
For dependencies, it varies from distribution to distribution. You can refer to shell.nix and dockerfile for installation.
Why is this so fast? The XDP program runs in the driver's receive path with direct access to DMA buffers. There's no socket buffer allocation, no protocol stack traversal, no context switching to userspace between packets. The kernel can batch packet processing across multiple frames, amortizing syscall overhead. On modern hardware, a single CPU core can generate 5-10 million packets per second.
On Ubuntu, you may run `make install` or
## Building the Packet Generator
```sh
sudo apt-get install -y --no-install-recommends \
libelf1 libelf-dev zlib1g-dev \
make clang llvm
Let's examine how the complete packet generator works, from userspace control to kernel packet injection.
### Complete XDP Program: xdp-pktgen.bpf.c
```c
/* SPDX-License-Identifier: MIT */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
char _license[] SEC("license") = "GPL";
SEC("xdp")
int xdp_redirect_notouch(struct xdp_md *ctx)
{
return XDP_TX;
}
```
to install dependencies.
This is the entire XDP program. The `SEC("xdp")` attribute marks it as an XDP program for libbpf's program loader. The function receives an `xdp_md` context containing packet metadata - `data` and `data_end` pointers frame the packet buffer, `ingress_ifindex` identifies the receiving interface, and RX queue information is available for multi-queue NICs.
### **4. Build the project**
We immediately return `XDP_TX` without touching the packet. In live frames mode, this causes the kernel to transmit the packet. The packet data itself comes from userspace - we'll construct UDP or custom protocol packets and inject them via BPF_PROG_RUN.
To build the project, run the following command:
The beauty of this minimal approach is that all packet construction happens in userspace where you have full control. Want to fuzz protocols? Generate packets in C with arbitrary header fields. Need realistic traffic patterns? Read pcap files and replay them through the XDP program. Testing specific edge cases? Craft packets byte-by-byte. The XDP program is just a vehicle for getting packets onto the wire at line rate.
```sh
### Userspace Control Program: xdp-pktgen.c
The userspace program handles packet construction, BPF program loading, and injection control. Let's walk through the key components.
#### Packet Construction and Configuration
```c
struct config {
int ifindex; // Which interface to inject packets on
int xdp_flags; // XDP attachment flags
int repeat; // How many times to inject each packet
int batch_size; // Batch size for BPF_PROG_RUN (0 = auto)
};
struct config cfg = {
.ifindex = 6, // Network interface (e.g., eth0)
.repeat = 1 << 20, // 1 million repeats per batch
.batch_size = 0, // Let kernel choose optimal batch
};
```
The configuration controls packet injection parameters. Interface index identifies which NIC to use - find it with `ip link show`. Repeat count determines how many times to inject each packet in a single BPF_PROG_RUN call. Higher counts amortize syscall overhead but increase latency before the next packet template. Batch size lets you inject multiple different packets in one syscall (advanced feature, 0 means single packet mode).
Packet construction supports two modes. By default, it generates a synthetic UDP/IPv4 packet:
```c
struct test_udp_packet_v4 pkt_udp = create_test_udp_packet_v4();
size = sizeof(pkt_udp);
memcpy(pkt_file_buffer, &pkt_udp, size);
```
This creates a minimal valid UDP packet - Ethernet frame with source/dest MACs, IPv4 header with addresses and checksums, UDP header with ports, and a small payload. The `create_test_udp_packet_v4()` helper (from test_udp_pkt.h) constructs a wire-format packet that network stacks will accept.
For custom packets, set the `PKTGEN_FILE` environment variable to a file containing raw packet bytes:
```c
if ((pkt_file = getenv("PKTGEN_FILE")) != NULL) {
FILE* file = fopen(pkt_file, "r");
size = fread(pkt_file_buffer, 1, 1024, file);
fclose(file);
}
```
This lets you inject arbitrary packets - pcap extracts, fuzzing payloads, or protocol test vectors. Any binary data works as long as it forms a valid Ethernet frame.
#### BPF_PROG_RUN Invocation and Live Frames
The packet injection loop uses `bpf_prog_test_run_opts()` to repeatedly invoke the XDP program:
```c
struct xdp_md ctx_in = {
.data_end = size, // Packet length
.ingress_ifindex = cfg.ifindex // Which interface
};
DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
.data_in = pkt_file_buffer, // Packet data
.data_size_in = size, // Packet length
.ctx_in = &ctx_in, // XDP metadata
.ctx_size_in = sizeof(ctx_in),
.repeat = cfg.repeat, // Repeat count
.flags = BPF_F_TEST_XDP_LIVE_FRAMES, // Enable live TX
.batch_size = cfg.batch_size,
.cpu = 0, // Pin to CPU 0
);
```
The critical flag is `BPF_F_TEST_XDP_LIVE_FRAMES`. Without it, the XDP program runs but packets stay in memory. With it, XDP_TX actions actually transmit packets through the driver. The kernel validates that the interface index is valid and the interface is up, ensuring packets hit the wire.
CPU pinning (`cpu = 0`) is important for performance measurement. By pinning the injection thread to CPU 0, you get consistent performance numbers and avoid cache bouncing across cores. For maximum throughput, you'd spawn multiple threads pinned to different CPUs, each injecting packets on separate interfaces or queues.
The injection loop continues until interrupted:
```c
do {
err = bpf_prog_test_run_opts(run_prog_fd, &opts);
if (err)
return -errno;
iterations += opts.repeat;
} while ((count == 0 || iterations < count) && !exiting);
```
Each `bpf_prog_test_run_opts()` call injects `repeat` packets (1 million by default). With a fast XDP program, this completes in milliseconds. The kernel batches packet processing, minimizing per-packet overhead. Total throughput depends on packet size, NIC capability, and CPU performance, but 5-10 Mpps per core is achievable.
#### Kernel Support Detection
Not all kernels support live frames mode. The program probes for support before starting injection:
```c
static int probe_kernel_support(int run_prog_fd)
{
int err = run_prog(run_prog_fd, 1); // Try injecting 1 packet
if (err == -EOPNOTSUPP) {
printf("BPF_PROG_RUN with batch size support is missing from libbpf.\n");
} else if (err == -EINVAL) {
err = -EOPNOTSUPP;
printf("Kernel doesn't support live packet mode for XDP BPF_PROG_RUN.\n");
} else if (err) {
printf("Error probing kernel support: %s\n", strerror(-err));
} else {
printf("Kernel supports live packet mode for XDP BPF_PROG_RUN.\n");
}
return err;
}
```
This attempts a single packet injection. If the kernel lacks support (Linux <5.18 or CONFIG_XDP_SOCKETS not enabled), it returns `-EINVAL`. Older libbpf versions without batch support return `-EOPNOTSUPP`. Success means you can proceed with full packet generation.
## Running the Packet Generator
Navigate to the tutorial directory and build the project:
```bash
cd /home/yunwei37/workspace/bpf-developer-tutorial/src/46-xdp-test
make build
```
This will compile your code and create the necessary binaries. You can you the `Github Code space` or `Github Action` to build the project as well.
This compiles both the XDP program (`xdp-pktgen.bpf.o`) and userspace control program (`xdp-pktgen`). The build requires Clang for BPF compilation and libbpf for skeleton generation.
### ***Run the Project***
Before running, identify your network interface index. Use `ip link show` to list interfaces:
You can run the binary with:
```bash
ip link show
```
```console
You'll see output like:
```
1: lo: <LOOPBACK,UP,LOWER_UP> ...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
6: veth0: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
```
Note the interface number (e.g., 6 for veth0). Update the config in xdp-pktgen.c if needed:
```c
struct config cfg = {
.ifindex = 6, // Change to your interface index
...
};
```
Run the packet generator with root privileges (required for BPF_PROG_RUN):
```bash
sudo ./xdp-pktgen
```
You'll see output like:
```
Kernel supports live packet mode for XDP BPF_PROG_RUN.
pkt size: 42
[Generating packets...]
```
The program runs until interrupted with Ctrl-C. Monitor packet transmission with:
```bash
# In another terminal, watch interface statistics
watch -n 1 'ip -s link show veth0'
```
You'll see TX packet counters increasing rapidly. On a modern CPU, expect 5-10 million packets per second per core for minimal-size packets.
### Custom Packet Injection
To inject custom packets, create a binary packet file and set the environment variable:
```bash
# Create a custom packet (e.g., using scapy or hping3 to generate the binary)
echo -n -e '\x00\x01\x02\x03\x04\x05...' > custom_packet.bin
# Inject it
sudo PKTGEN_FILE=custom_packet.bin ./xdp-pktgen
```
The generator reads up to 1024 bytes from the file and injects that packet repeatedly. This works for any protocol - IPv6, ICMP, custom L2 protocols, even malformed packets for fuzzing.
## Performance Characteristics and Tuning
XDP packet generation performance depends on several factors. Let's understand what limits throughput and how to maximize it.
**Packet size impact**: Smaller packets achieve higher packet rates but lower throughput in bytes per second. A 64-byte packet at 10 Mpps delivers 5 Gbps. A 1500-byte packet at 2 Mpps delivers 24 Gbps. The CPU processes packets at roughly constant packet-per-second rates, so larger packets achieve higher bandwidth.
**CPU frequency and microarchitecture**: Newer CPUs with higher frequencies and better IPC (instructions per cycle) achieve higher rates. Intel Xeon or AMD EPYC server CPUs can hit 10+ Mpps per core. Older or lower-power CPUs may only reach 2-5 Mpps.
**NIC capabilities**: The network driver must keep up with injection rates. High-end NICs (Intel X710, Mellanox ConnectX) support millions of packets per second. Consumer gigabit NICs often saturate at 1-2 Mpps due to driver limitations or hardware buffering.
**Memory bandwidth**: At high rates, packet data transfer to/from NIC DMA buffers can become a bottleneck. Ensure the system has sufficient memory bandwidth (use `perf stat` to monitor memory controller utilization).
**Interrupt and polling overhead**: Network drivers use interrupts or polling (NAPI) to process packets. Under extreme load, interrupt overhead can slow processing. Consider tuning interrupt coalescing or using busy-polling.
For maximum performance, pin the injection thread to a dedicated CPU core, disable CPU frequency scaling (set governor to performance), use huge pages for packet buffers to reduce TLB misses, and consider multi-queue NICs with RSS (Receive Side Scaling) - spawn threads per queue for parallel injection.
## Summary and Next Steps
XDP packet generators leverage the kernel's BPF_PROG_RUN infrastructure to inject packets at line rate from userspace. By combining a minimal XDP program that returns XDP_TX with live frames mode, you can transmit millions of packets per second without external hardware or kernel modules. This enables network stack stress testing, XDP program benchmarking, protocol fuzzing, and synthetic traffic generation.
Our implementation demonstrates the core concepts: a simple XDP reflection program, userspace packet construction with custom or default UDP packets, BPF_PROG_RUN invocation with live frames flag, and kernel support detection. The result is a flexible, high-performance packet generator suitable for testing network infrastructure, measuring XDP program performance, or generating realistic traffic patterns.
Beyond basic generation, you can extend this approach to create sophisticated testing tools. Add packet templates for different protocols (TCP SYN floods, ICMP echo, DNS queries). Implement traffic shaping (vary inter-packet delays). Support multiple interfaces simultaneously for throughput aggregation. Integrate with network monitoring to measure drop rates or latency. The XDP packet generator framework provides a foundation for advanced network testing capabilities.
> If you'd like to dive deeper into eBPF and XDP, check out our tutorial repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> or visit our website at <https://eunomia.dev/tutorials/>.
## References
- **Tutorial Repository**: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/46-xdp-test>
- **Linux Kernel XDP Documentation**: `Documentation/networking/xdp.rst`
- **BPF_PROG_RUN Documentation**: `tools/testing/selftests/bpf/README.rst`
- **XDP Tutorial**: <https://github.com/xdp-project/xdp-tutorial>
- **libbpf Documentation**: <https://libbpf.readthedocs.io/>
Complete source code with build instructions and example packet templates is available in the tutorial repository. Contributions welcome!

View File

@@ -1,44 +1,281 @@
# xdp-pktgen: xdp based packet generator
# eBPF 实例教程:构建高性能 XDP 数据包生成器
This is a simple xdp based packet generator.
需要对网络栈进行压力测试或测量 XDP 程序性能吗?传统的数据包生成器如 `pktgen` 需要内核模块或在用户态运行,开销很大。有更好的方法 - XDP 的 BPF_PROG_RUN 功能让你可以直接从用户态向内核快速路径注入数据包,速度可达每秒数百万包,而且不需要加载网络驱动。
## **How to use**
在本教程中,我们将构建一个基于 XDP 的数据包生成器,利用内核的 BPF_PROG_RUN 测试基础设施。我们将探索 XDP 的 `XDP_TX` 动作如何创建数据包反射循环,理解启用真实数据包注入的实时帧模式,并测量高负载下 XDP 程序的性能特征。最后,你将拥有一个用于网络测试和 XDP 基准测试的生产级工具。
clone the repo, you can update the git submodule with following commands:
## 理解 XDP 数据包生成
```sh
git submodule update --init --recursive
XDPeXpress Data Path通过在内核网络栈分配套接字缓冲区之前挂钩到网络驱动程序提供了 Linux 中最快的可编程数据包处理。通常XDP 程序处理从网络接口到达的数据包。但是,如果你想在没有真实网络流量的情况下测试 XDP 程序的性能怎么办?或者注入合成数据包来对网络基础设施进行压力测试?
### BPF_PROG_RUN 测试接口
内核通过 `bpf_prog_test_run()`BPF_PROG_RUN暴露了一个用于测试 BPF 程序的机制。最初设计用于单元测试,这个系统调用让用户空间可以使用合成输入调用 BPF 程序并捕获其输出。对于 XDP 程序你提供一个数据包缓冲区和描述数据包元数据接口索引、RX 队列)的 `xdp_md` 上下文。内核运行你的 XDP 程序并返回动作代码XDP_DROP、XDP_PASS、XDP_TX 等)以及任何数据包修改。
传统的 BPF_PROG_RUN 在"空运行"模式下操作 - 数据包被处理但从不实际传输。XDP 程序运行,修改数据包数据,返回一个动作,但没有任何东西到达网络。这对于测试数据包解析逻辑或在隔离环境中测量程序执行时间非常完美。
### 实时帧模式:真实数据包注入
在 Linux 5.18+ 中,内核通过 `BPF_F_TEST_XDP_LIVE_FRAMES` 标志引入了**实时帧模式**。这从根本上改变了 BPF_PROG_RUN 的行为。当启用时XDP_TX 动作不仅仅是返回 - 它们实际上通过指定的网络接口在网络上传输数据包。这将 BPF_PROG_RUN 变成了一个强大的数据包生成器。
工作原理如下:你的用户空间程序构造一个数据包(带有 IP 头、UDP 负载等的以太网帧),并在启用实时帧的情况下将其传递给 `bpf_prog_test_run()`。XDP 程序在其 `xdp_md` 上下文中接收这个数据包。如果程序返回 `XDP_TX`,内核会通过网络驱动传输数据包,就像它到达接口并被反射回去一样。数据包出现在网络上,完全支持硬件卸载(校验和、分段等)。
这启用了几个强大的用例。**网络栈压力测试**:用每秒数百万个数据包淹没你的系统,以找到网络栈、驱动程序或应用层的瓶颈。**XDP 程序基准测试**:在没有外部数据包生成器的情况下,测量 XDP 程序在真实负载下每秒可以处理多少个数据包。**协议模糊测试**:生成格式错误的数据包或不寻常的协议序列来测试健壮性。**合成流量生成**:创建真实的流量模式来测试负载均衡器、防火墙或入侵检测系统。
### XDP_TX 反射循环
最简单的 XDP 数据包生成器使用 `XDP_TX` 动作。这告诉内核"将这个数据包传输回它到达的接口"。我们的最小 XDP 程序字面上只有三行:
```c
SEC("xdp")
int xdp_redirect_notouch(struct xdp_md *ctx)
{
return XDP_TX;
}
```
### **3. Install dependencies**
就是这样。没有数据包解析,没有头部修改 - 只是反射所有内容。结合实时帧模式下的 BPF_PROG_RUN这创建了一个数据包生成循环用户空间注入一个数据包XDP 将其反射到网络,以每秒数百万个数据包的速度重复。
For dependencies, it varies from distribution to distribution. You can refer to shell.nix and dockerfile for installation.
为什么这么快XDP 程序在驱动程序的接收路径中运行,直接访问 DMA 缓冲区。没有套接字缓冲区分配,没有协议栈遍历,数据包之间没有上下文切换到用户空间。内核可以批量处理多个帧的数据包,摊销每个数据包的开销。在现代硬件上,单个 CPU 核心可以生成 500-1000 万个数据包每秒。
On Ubuntu, you may run `make install` or
## 构建数据包生成器
```sh
sudo apt-get install -y --no-install-recommends \
libelf1 libelf-dev zlib1g-dev \
make clang llvm
让我们检查完整的数据包生成器如何工作,从用户空间控制到内核数据包注入。
### 完整的 XDP 程序xdp-pktgen.bpf.c
```c
/* SPDX-License-Identifier: MIT */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
char _license[] SEC("license") = "GPL";
SEC("xdp")
int xdp_redirect_notouch(struct xdp_md *ctx)
{
return XDP_TX;
}
```
to install dependencies.
这就是整个 XDP 程序。`SEC("xdp")` 属性将其标记为 libbpf 程序加载器的 XDP 程序。该函数接收一个 `xdp_md` 上下文,其中包含数据包元数据 - `data``data_end` 指针框定数据包缓冲区,`ingress_ifindex` 标识接收接口,并且 RX 队列信息可用于多队列网卡。
### **4. Build the project**
我们立即返回 `XDP_TX` 而不触碰数据包。在实时帧模式下,这会导致内核传输数据包。数据包数据本身来自用户空间 - 我们将构造 UDP 或自定义协议数据包,并通过 BPF_PROG_RUN 注入它们。
To build the project, run the following command:
这种最小化方法的美妙之处在于所有数据包构造都发生在用户空间,你可以完全控制。想要模糊测试协议?在 C 中生成具有任意头字段的数据包。需要真实的流量模式?读取 pcap 文件并通过 XDP 程序重放它们。测试特定的边缘情况逐字节制作数据包。XDP 程序只是将数据包以线速发送到网络的工具。
```sh
### 用户空间控制程序xdp-pktgen.c
用户空间程序处理数据包构造、BPF 程序加载和注入控制。让我们逐步了解关键组件。
#### 数据包构造和配置
```c
struct config {
int ifindex; // 在哪个接口上注入数据包
int xdp_flags; // XDP 附加标志
int repeat; // 注入每个数据包的次数
int batch_size; // BPF_PROG_RUN 的批量大小0 = 自动)
};
struct config cfg = {
.ifindex = 6, // 网络接口例如eth0
.repeat = 1 << 20, // 每批 100 万次重复
.batch_size = 0, // 让内核选择最佳批次
};
```
配置控制数据包注入参数。接口索引标识要使用的网卡 - 使用 `ip link show` 查找。重复计数确定在单个 BPF_PROG_RUN 调用中注入每个数据包的次数。更高的计数可以摊销系统调用开销但会增加下一个数据包模板之前的延迟。批量大小允许你在一次系统调用中注入多个不同的数据包高级功能0 表示单数据包模式)。
数据包构造支持两种模式。默认情况下,它生成一个合成的 UDP/IPv4 数据包:
```c
struct test_udp_packet_v4 pkt_udp = create_test_udp_packet_v4();
size = sizeof(pkt_udp);
memcpy(pkt_file_buffer, &pkt_udp, size);
```
这创建了一个最小的有效 UDP 数据包 - 带有源/目标 MAC 的以太网帧、带有地址和校验和的 IPv4 头、带有端口的 UDP 头和小的负载。`create_test_udp_packet_v4()` 辅助函数(来自 test_udp_pkt.h构造一个网络栈可以接受的线格式数据包。
对于自定义数据包,将 `PKTGEN_FILE` 环境变量设置为包含原始数据包字节的文件:
```c
if ((pkt_file = getenv("PKTGEN_FILE")) != NULL) {
FILE* file = fopen(pkt_file, "r");
size = fread(pkt_file_buffer, 1, 1024, file);
fclose(file);
}
```
这允许你注入任意数据包 - pcap 提取、模糊测试负载或协议测试向量。任何二进制数据都可以工作,只要它形成一个有效的以太网帧。
#### BPF_PROG_RUN 调用和实时帧
数据包注入循环使用 `bpf_prog_test_run_opts()` 重复调用 XDP 程序:
```c
struct xdp_md ctx_in = {
.data_end = size, // 数据包长度
.ingress_ifindex = cfg.ifindex // 哪个接口
};
DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
.data_in = pkt_file_buffer, // 数据包数据
.data_size_in = size, // 数据包长度
.ctx_in = &ctx_in, // XDP 元数据
.ctx_size_in = sizeof(ctx_in),
.repeat = cfg.repeat, // 重复计数
.flags = BPF_F_TEST_XDP_LIVE_FRAMES, // 启用实时传输
.batch_size = cfg.batch_size,
.cpu = 0, // 固定到 CPU 0
);
```
关键标志是 `BPF_F_TEST_XDP_LIVE_FRAMES`。如果没有它XDP 程序会运行但数据包保留在内存中。有了它XDP_TX 动作实际上通过驱动程序传输数据包。内核验证接口索引是否有效且接口是否启动,确保数据包到达网络。
CPU 固定(`cpu = 0`)对性能测量很重要。通过将注入线程固定到 CPU 0你可以获得一致的性能数字并避免跨核心的缓存抖动。为了获得最大吞吐量你可以生成多个固定到不同 CPU 的线程,每个线程在单独的接口或队列上注入数据包。
注入循环一直持续到中断:
```c
do {
err = bpf_prog_test_run_opts(run_prog_fd, &opts);
if (err)
return -errno;
iterations += opts.repeat;
} while ((count == 0 || iterations < count) && !exiting);
```
每次 `bpf_prog_test_run_opts()` 调用注入 `repeat` 个数据包(默认 100 万个)。使用快速的 XDP 程序,这在几毫秒内完成。内核批量处理数据包,最小化每个数据包的开销。总吞吐量取决于数据包大小、网卡能力和 CPU 性能,但每个核心可以实现 500-1000 万 pps。
#### 内核支持检测
并非所有内核都支持实时帧模式。程序在开始注入之前探测支持:
```c
static int probe_kernel_support(int run_prog_fd)
{
int err = run_prog(run_prog_fd, 1); // 尝试注入 1 个数据包
if (err == -EOPNOTSUPP) {
printf("BPF_PROG_RUN with batch size support is missing from libbpf.\n");
} else if (err == -EINVAL) {
err = -EOPNOTSUPP;
printf("Kernel doesn't support live packet mode for XDP BPF_PROG_RUN.\n");
} else if (err) {
printf("Error probing kernel support: %s\n", strerror(-err));
} else {
printf("Kernel supports live packet mode for XDP BPF_PROG_RUN.\n");
}
return err;
}
```
这尝试单个数据包注入。如果内核缺乏支持Linux <5.18 或未启用 CONFIG_XDP_SOCKETS它返回 `-EINVAL`。没有批量支持的旧 libbpf 版本返回 `-EOPNOTSUPP`。成功意味着你可以继续完整的数据包生成。
## 运行数据包生成器
导航到教程目录并构建项目:
```bash
cd /home/yunwei37/workspace/bpf-developer-tutorial/src/46-xdp-test
make build
```
This will compile your code and create the necessary binaries. You can you the `Github Code space` or `Github Action` to build the project as well.
这会编译 XDP 程序(`xdp-pktgen.bpf.o`)和用户空间控制程序(`xdp-pktgen`)。构建需要 Clang 用于 BPF 编译和 libbpf 用于骨架生成。
### ***Run the Project***
在运行之前,识别你的网络接口索引。使用 `ip link show` 列出接口:
You can run the binary with:
```bash
ip link show
```
```console
你会看到类似以下的输出:
```
1: lo: <LOOPBACK,UP,LOWER_UP> ...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
6: veth0: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
```
注意接口编号例如veth0 为 6。如果需要在 xdp-pktgen.c 中更新配置:
```c
struct config cfg = {
.ifindex = 6, // 更改为你的接口索引
...
};
```
使用 root 权限运行数据包生成器BPF_PROG_RUN 需要):
```bash
sudo ./xdp-pktgen
```
你会看到类似以下的输出:
```
Kernel supports live packet mode for XDP BPF_PROG_RUN.
pkt size: 42
[Generating packets...]
```
程序一直运行直到使用 Ctrl-C 中断。使用以下命令监控数据包传输:
```bash
# 在另一个终端中,监视接口统计信息
watch -n 1 'ip -s link show veth0'
```
你会看到 TX 数据包计数器快速增加。在现代 CPU 上,对于最小大小的数据包,预计每个核心每秒 500-1000 万个数据包。
### 自定义数据包注入
要注入自定义数据包,创建一个二进制数据包文件并设置环境变量:
```bash
# 创建自定义数据包(例如,使用 scapy 或 hping3 生成二进制文件)
echo -n -e '\x00\x01\x02\x03\x04\x05...' > custom_packet.bin
# 注入它
sudo PKTGEN_FILE=custom_packet.bin ./xdp-pktgen
```
生成器从文件中读取最多 1024 字节并重复注入该数据包。这适用于任何协议 - IPv6、ICMP、自定义 L2 协议,甚至用于模糊测试的格式错误的数据包。
## 性能特征和调优
XDP 数据包生成性能取决于几个因素。让我们了解什么限制了吞吐量以及如何最大化它。
**数据包大小影响**较小的数据包实现更高的包速率但每秒字节数的吞吐量较低。64 字节的数据包在 1000 万 pps 时提供 5 Gbps。1500 字节的数据包在 200 万 pps 时提供 24 Gbps。CPU 以大致恒定的每秒包速率处理数据包,因此较大的数据包实现更高的带宽。
**CPU 频率和微架构**:具有更高频率和更好 IPC每周期指令数的新 CPU 实现更高的速率。Intel Xeon 或 AMD EPYC 服务器 CPU 每个核心可以达到 1000 万以上 pps。较旧或低功耗 CPU 可能只能达到 200-500 万 pps。
**网卡能力**网络驱动程序必须跟上注入速率。高端网卡Intel X710、Mellanox ConnectX支持每秒数百万个数据包。消费级千兆网卡由于驱动程序限制或硬件缓冲通常在 100-200 万 pps 时饱和。
**内存带宽**:在高速率下,往返于网卡 DMA 缓冲区的数据包数据传输可能成为瓶颈。确保系统有足够的内存带宽(使用 `perf stat` 监控内存控制器利用率)。
**中断和轮询开销**网络驱动程序使用中断或轮询NAPI来处理数据包。在极端负载下中断开销可能会减慢处理速度。考虑调整中断合并或使用忙轮询。
为了获得最大性能,将注入线程固定到专用 CPU 核心,禁用 CPU 频率缩放(将调节器设置为 performance使用大页面用于数据包缓冲区以减少 TLB 未命中,并考虑带有 RSS接收侧缩放的多队列网卡 - 为每个队列生成线程以进行并行注入。
## 总结与下一步
XDP 数据包生成器利用内核的 BPF_PROG_RUN 基础设施以线速从用户空间注入数据包。通过将返回 XDP_TX 的最小 XDP 程序与实时帧模式相结合你可以在没有外部硬件或内核模块的情况下每秒传输数百万个数据包。这使得网络栈压力测试、XDP 程序基准测试、协议模糊测试和合成流量生成成为可能。
我们的实现演示了核心概念:一个简单的 XDP 反射程序、使用自定义或默认 UDP 数据包的用户空间数据包构造、带有实时帧标志的 BPF_PROG_RUN 调用,以及内核支持检测。结果是一个灵活的、高性能的数据包生成器,适用于测试网络基础设施、测量 XDP 程序性能或生成真实的流量模式。
除了基本生成之外你可以扩展这种方法来创建复杂的测试工具。为不同的协议添加数据包模板TCP SYN 洪水、ICMP echo、DNS 查询。实现流量整形改变数据包间延迟。同时支持多个接口以进行吞吐量聚合。与网络监控集成以测量丢包率或延迟。XDP 数据包生成器框架为高级网络测试功能提供了基础。
> 如果你想深入了解 eBPF 和 XDP请查看我们的教程仓库 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 或访问我们的网站 <https://eunomia.dev/tutorials/>。
## 参考资料
- **教程仓库**: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/46-xdp-test>
- **Linux 内核 XDP 文档**: `Documentation/networking/xdp.rst`
- **BPF_PROG_RUN 文档**: `tools/testing/selftests/bpf/README.rst`
- **XDP 教程**: <https://github.com/xdp-project/xdp-tutorial>
- **libbpf 文档**: <https://libbpf.readthedocs.io/>
完整的源代码及构建说明和示例数据包模板可在教程仓库中获取。欢迎贡献!

View File

@@ -0,0 +1,292 @@
# eBPF Tutorial by Example: Monitoring GPU Activity with Kernel Tracepoints
Ever wondered what your GPU is really doing under the hood? When games stutter, ML training slows down, or video encoding freezes, the answers lie deep inside the kernel's GPU driver. Traditional debugging relies on guesswork and vendor-specific tools, but there's a better way. Linux kernel GPU tracepoints expose real-time insights into job scheduling, memory allocation, and command submission - and eBPF lets you analyze this data with minimal overhead.
In this tutorial, we'll explore GPU kernel tracepoints across DRM scheduler, Intel i915, and AMD AMDGPU drivers. We'll write bpftrace scripts to monitor live GPU activity, track memory pressure, measure job latency, and diagnose performance bottlenecks. By the end, you'll have production-ready monitoring tools and deep knowledge of how GPUs interact with the kernel.
## Understanding GPU Kernel Tracepoints
GPU tracepoints are instrumentation points built directly into the kernel's Direct Rendering Manager (DRM) subsystem. When your GPU schedules a job, allocates memory, or signals a fence, these tracepoints fire - capturing precise timing, resource identifiers, and driver state. Unlike userspace profiling tools that sample periodically and miss events, kernel tracepoints catch every single operation with nanosecond timestamps.
### Why Kernel Tracepoints Matter for GPU Monitoring
Think about what happens when you launch a GPU workload. Your application submits commands through the graphics API (Vulkan, OpenGL, CUDA). The userspace driver translates these into hardware-specific command buffers. The kernel driver receives an ioctl, validates the work, allocates GPU memory, binds resources to GPU address space, schedules the job on a hardware ring, and waits for completion. Traditional profiling sees the start and end - kernel tracepoints see every step in between.
The performance implications are significant. Polling-based monitoring checks GPU state every 100ms and consumes CPU cycles on every check. Tracepoints activate only when events occur, adding mere nanoseconds of overhead per event, and capture 100% of activity including microsecond-duration jobs. For production monitoring of Kubernetes GPU workloads or debugging ML training performance, this difference is critical.
### The DRM Tracepoint Ecosystem
GPU tracepoints span three layers of the graphics stack. **DRM scheduler tracepoints** (gpu_scheduler event group) are marked as stable uAPI - their format will never change. These work identically across Intel, AMD, and Nouveau drivers, making them perfect for vendor-neutral monitoring. They track job submission (`drm_run_job`), completion (`drm_sched_process_job`), and dependency waits (`drm_sched_job_wait_dep`).
**Vendor-specific tracepoints** expose driver internals. Intel i915 tracepoints track GEM object creation (`i915_gem_object_create`), VMA binding to GPU address space (`i915_vma_bind`), memory pressure events (`i915_gem_shrink`), and page faults (`i915_gem_object_fault`). AMD AMDGPU tracepoints monitor buffer object lifecycle (`amdgpu_bo_create`), command submission from userspace (`amdgpu_cs_ioctl`), scheduler execution (`amdgpu_sched_run_job`), and GPU interrupts (`amdgpu_iv`). Note that Intel low-level tracepoints require `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y` in your kernel config.
**Generic DRM tracepoints** handle display synchronization through vblank events - critical for diagnosing frame drops and compositor latency. Events include vblank occurrence (`drm_vblank_event`), userspace queueing (`drm_vblank_event_queued`), and delivery (`drm_vblank_event_delivered`).
### Real-World Use Cases
GPU tracepoints solve problems that traditional tools can't touch. **Diagnosing stuttering in games**: You notice frame drops every few seconds. Vblank tracepoints reveal missed vertical blanks. Job scheduling traces show CPU-side delays in command submission. Memory tracepoints expose allocations triggering evictions during critical frames. Within minutes you identify that texture uploads are blocking the rendering pipeline.
**Optimizing ML training performance**: Your PyTorch training is 40% slower than expected. AMDGPU command submission tracing reveals excessive synchronization - the CPU waits for GPU completion too often. Job dependency tracepoints show unnecessary fences between independent operations. Memory traces expose thrashing between VRAM and system RAM. You reorganize batching to eliminate stalls.
**Cloud GPU billing accuracy**: Multi-tenant systems need fair energy and resource accounting. DRM scheduler tracepoints attribute exact GPU time to each container. Memory tracepoints track allocation per workload. This data feeds into accurate billing systems that charge based on actual resource consumption rather than time-based estimates.
**Thermal throttling investigation**: GPU performance degrades under load. Interrupt tracing shows thermal events from the GPU. Job scheduling traces reveal frequency scaling impacting execution time. Memory migration traces show the driver moving workloads to cooler GPU dies. You adjust power limits and improve airflow.
## Tracepoint Reference Guide
Let's examine each tracepoint category in detail, understanding the data they expose and how to interpret it.
### DRM Scheduler Tracepoints: The Universal GPU Monitor
The DRM scheduler provides a vendor-neutral view of GPU job management. These tracepoints work identically whether you're running Intel integrated graphics, AMD discrete GPUs, or Nouveau on NVIDIA hardware.
#### drm_run_job: When GPU Work Starts Executing
When the scheduler assigns a job to GPU hardware, `drm_run_job` fires. This marks the transition from "queued in software" to "actively running on silicon." The tracepoint captures the job ID (unique identifier for correlation), ring name (which execution engine: graphics, compute, video decode), queue depth (how many jobs are waiting), and hardware job count (jobs currently executing on GPU).
The format looks like: `entity=0xffff888... id=12345 fence=0xffff888... ring=gfx job count:5 hw job count:2`. This tells you job 12345 on the graphics ring started executing. Five jobs are queued behind it, and two jobs are currently running on hardware (multi-engine GPUs can run jobs in parallel).
Use this to measure job scheduling latency. Record the timestamp when userspace submits work (using command submission tracepoints), then measure time until `drm_run_job` fires. Latencies over 1ms indicate CPU-side scheduling delays. Per-ring statistics reveal if specific engines (video encode, compute) are bottlenecked.
#### drm_sched_process_job: Job Completion Signal
When GPU hardware completes a job and signals its fence, this tracepoint fires. The fence pointer identifies the completed job - correlate it with `drm_run_job` to calculate GPU execution time. Format: `fence=0xffff888... signaled`.
Combine with `drm_run_job` timestamps to compute job execution time: `completion_time - run_time = GPU_execution_duration`. If jobs that should take 5ms are taking 50ms, you've found a GPU performance problem. Throughput metrics (jobs completed per second) indicate overall GPU utilization.
#### drm_sched_job_wait_dep: Dependency Stalls
Before a job can execute, its dependencies (previous jobs it waits for) must complete. This tracepoint fires when a job blocks waiting for a fence. Format: `job ring=gfx id=12345 depends fence=0xffff888... context=1234 seq=567`.
This reveals pipeline stalls. If compute jobs constantly wait for graphics jobs, you're not exploiting parallelism. If wait times are long, dependency chains are too deep - consider batching independent work. Excessive dependencies indicate a CPU-side scheduling inefficiency.
### Intel i915 Tracepoints: Memory and I/O Deep Dive
Intel's i915 driver exposes detailed tracepoints for memory management and data transfer. These require `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y` - check with `grep CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS /boot/config-$(uname -r)`.
#### i915_gem_object_create: GPU Memory Allocation
When the driver allocates a GEM (Graphics Execution Manager) object - the fundamental unit of GPU-accessible memory - this fires. Format: `obj=0xffff888... size=0x100000` indicates allocating a 1MB object.
Track total allocated memory over time to detect leaks. Sudden allocation spikes before performance drops suggest memory pressure. Correlate object pointers with subsequent bind/fault events to understand object lifecycle. High-frequency small allocations indicate inefficient batching.
#### i915_vma_bind: Mapping Memory to GPU Address Space
Allocating memory isn't enough - it must be mapped (bound) into GPU address space. This tracepoint fires on VMA (Virtual Memory Area) binding. Format: `obj=0xffff888... offset=0x0000100000 size=0x10000 mappable vm=0xffff888...` shows 64KB bound at GPU virtual address 0x100000.
Binding overhead impacts performance. Frequent rebinding indicates memory thrashing - the driver evicting and rebinding objects under pressure. GPU page faults often correlate with bind operations - the CPU bound memory just before GPU accessed it. Flags like `PIN_MAPPABLE` indicate memory accessible by both CPU and GPU.
#### i915_gem_shrink: Memory Pressure Response
Under memory pressure, the driver reclaims GPU memory. Format: `dev=0 target=0x1000000 flags=0x3` means the driver tries to reclaim 16MB. High shrink activity indicates undersized GPU memory for the workload.
Correlate with performance drops - if shrinking happens during frame rendering, it causes stutters. Flags indicate shrink aggressiveness. Repeated shrinks with small targets suggest memory fragmentation. Compare target with actual freed amount (track object destructions) to measure reclaim efficiency.
#### i915_gem_object_fault: GPU Page Faults
When CPU or GPU accesses unmapped memory, a fault occurs. Format: `obj=0xffff888... GTT index=128 writable` indicates a write fault on Graphics Translation Table page 128. Faults are expensive - they stall execution while the kernel resolves the missing mapping.
Excessive faults kill performance. Write faults are more expensive than reads (require invalidating caches). GTT faults (GPU accessing unmapped memory) indicate incomplete resource binding before job submission. CPU faults suggest inefficient CPU/GPU synchronization - CPU accessing objects while GPU is using them.
### AMD AMDGPU Tracepoints: Command Flow and Interrupts
AMD's AMDGPU driver provides comprehensive tracing of command submission and hardware interrupts.
#### amdgpu_cs_ioctl: Userspace Command Submission
When an application submits GPU work via ioctl, this captures the request. Format: `sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2` shows job 12345 submitted to graphics ring with 2 indirect buffers.
This marks when userspace hands off work to kernel. Record timestamp to measure submission-to-execution latency when combined with `amdgpu_sched_run_job`. High frequency indicates small batches - potential for better batching. Per-ring distribution shows workload balance across engines.
#### amdgpu_sched_run_job: Kernel Schedules Job
The kernel scheduler starts executing a previously submitted job. Comparing timestamps with `amdgpu_cs_ioctl` reveals submission latency. Format includes job ID and ring for correlation.
Submission latencies over 100μs indicate kernel scheduling delays. Per-ring latencies show if specific engines are scheduling-bound. Correlate with CPU scheduler traces to identify if kernel threads are being preempted.
#### amdgpu_bo_create: Buffer Object Allocation
AMD's equivalent to i915 GEM objects. Format: `bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1` allocates 1MB (256 pages). Type indicates VRAM vs GTT (system memory accessible by GPU). Preferred/allowed domains show placement policy.
Track VRAM allocations to monitor memory usage. Type mismatches (requesting VRAM but falling back to GTT) indicate VRAM exhaustion. Visible flag indicates CPU-accessible memory - expensive, use sparingly.
#### amdgpu_bo_move: Memory Migration
When buffer objects migrate between VRAM and GTT, this fires. Migrations are expensive (require copying data over PCIe). Excessive moves indicate memory thrashing - working set exceeds VRAM capacity.
Measure move frequency and size to quantify PCIe bandwidth consumption. Correlate with performance drops - migrations stall GPU execution. Optimize by reducing working set or using smarter placement policies (keep frequently accessed data in VRAM).
#### amdgpu_iv: GPU Interrupts
The GPU signals interrupts for completed work, errors, and events. Format: `ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890 pasid:100 src_data: 00000001...` captures interrupt details.
Source ID indicates interrupt type (completion, fault, thermal). High interrupt rates impact CPU performance. Unexpected interrupts suggest hardware errors. VMID and PASID identify which process/VM triggered the interrupt - critical for multi-tenant debugging.
### DRM Vblank Tracepoints: Display Synchronization
Vblank (vertical blanking) events synchronize rendering with display refresh. Missing vblanks causes dropped frames and stutter.
#### drm_vblank_event: Vertical Blank Occurs
When the display enters vertical blanking period, this fires. Format: `crtc=0 seq=12345 time=1234567890 high-prec=true` indicates vblank on display controller 0, sequence number 12345.
Track vblank frequency to verify refresh rate (60Hz = 60 vblanks/second). Missed sequences indicate frame drops. High-precision timestamps enable sub-millisecond frame timing analysis. Per-CRTC tracking for multi-monitor setups.
#### drm_vblank_event_queued and drm_vblank_event_delivered
These track vblank event delivery to userspace. Queuing latency (queue to delivery) measures kernel scheduling delay. Total latency (vblank to delivery) includes both kernel and driver processing.
Latencies over 1ms indicate compositor problems. Compare across CRTCs to identify problematic displays. Correlate with frame drops visible to users - events delivered late mean missed frames.
## Monitoring with Bpftrace Scripts
We've created vendor-specific bpftrace scripts for production monitoring. Each script focuses on its GPU vendor's specific tracepoints while sharing a common output format.
### DRM Scheduler Monitor: Universal GPU Tracking
The `drm_scheduler.bt` script works on **all GPU drivers** because it uses stable uAPI tracepoints. It tracks jobs across all rings, measures completion rates, and identifies dependency stalls.
The script attaches to `gpu_scheduler:drm_run_job`, `gpu_scheduler:drm_sched_process_job`, and `gpu_scheduler:drm_sched_job_wait_dep`. On job start, it records timestamps in a map keyed by job ID for later latency calculation. It increments per-ring counters to show workload distribution. On completion, it prints fence information. On dependency wait, it shows which job blocks which fence.
Output shows timestamp, event type (RUN/COMPLETE/WAIT_DEP), job ID, ring name, and queue depth. At program end, statistics summarize jobs per ring and dependency wait counts. This reveals if specific rings are saturated, whether jobs are blocked by dependencies, and overall GPU utilization patterns.
### Intel i915 Monitor: Memory and I/O Profiling
The `intel_i915.bt` script tracks Intel GPU memory operations, I/O transfers, and page faults. It requires `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y`.
On `i915_gem_object_create`, it accumulates total allocated memory and stores per-object sizes. VMA bind/unbind events track GPU address space changes. Shrink events measure memory pressure. Pwrite/pread track CPU-GPU data transfers. Faults categorize by type (GTT vs CPU, read vs write).
Output reports allocation size and running total in MB. Bind operations show GPU virtual address and flags. I/O operations track offset and length. Faults indicate type and whether they're reads or writes. End statistics summarize total allocations, VMA operations, memory pressure (shrink operations and bytes reclaimed), I/O volume (read/write counts and sizes), and fault analysis (total faults, write vs read).
This reveals memory leaks (allocations without corresponding frees), binding overhead (frequent rebinds indicate thrashing), memory pressure timing (correlate shrinks with performance drops), I/O patterns (large transfers vs many small ones), and fault hotspots (expensive operations to optimize).
### AMD AMDGPU Monitor: Command Submission Analysis
The `amd_amdgpu.bt` script focuses on AMD's command submission pipeline, measuring latency from ioctl to execution.
On `amdgpu_cs_ioctl`, it records submission timestamp keyed by job ID. When `amdgpu_sched_run_job` fires, it calculates latency: `(current_time - submit_time)`. Buffer object create/move events track memory. Interrupt events count by source ID. Virtual memory operations (flush, map, unmap) measure TLB activity.
Output shows timestamp, event type, job ID, ring name, and calculated latency in microseconds. End statistics include memory allocation totals, command submission counts per ring, average and distribution of submission latency (histogram showing how many jobs experienced different latency buckets), interrupt counts by source, and virtual memory operation counts.
Latency histograms are critical - most jobs should have <50μs latency. A tail of high-latency jobs indicates scheduling problems. Per-ring statistics show if compute workloads have different latency than graphics. Memory migration tracking helps diagnose VRAM pressure.
### Display Vblank Monitor: Frame Timing Analysis
The `drm_vblank.bt` script tracks display synchronization for diagnosing frame drops.
On `drm_vblank_event`, it records timestamp keyed by CRTC and sequence. When `drm_vblank_event_queued` fires, it timestamps queue time. On `drm_vblank_event_delivered`, it calculates queue-to-delivery latency and total vblank-to-delivery latency.
Output shows vblank events, queued events, and delivered events with timestamps. End statistics include total vblank counts per CRTC, event delivery counts, average delivery latency, latency distribution histogram, and total event latency (vblank occurrence to userspace delivery).
Delivery latencies over 1ms indicate compositor scheduling issues. Total latencies reveal end-to-end delay visible to applications. Per-CRTC statistics show if specific monitors have problems. Latency histograms expose outliers causing visible stutter.
## Running the Monitors
Let's trace live GPU activity. Navigate to the scripts directory and run any monitor with bpftrace. The DRM scheduler monitor works on all GPUs:
```bash
cd bpf-developer-tutorial/srcsrc/xpu/gpu-kernel-driver/scripts
sudo bpftrace drm_scheduler.bt
```
You'll see output like:
```
Tracing DRM GPU scheduler... Hit Ctrl-C to end.
TIME(ms) EVENT JOB_ID RING QUEUED DETAILS
296119090 RUN 12345 gfx 5 hw=2
296120190 COMPLETE 0xffff888...
=== DRM Scheduler Statistics ===
Jobs per ring:
@jobs_per_ring[gfx]: 1523
@jobs_per_ring[compute]: 89
Waits per ring:
@waits_per_ring[gfx]: 12
```
This shows graphics jobs dominating workload (1523 vs 89 compute jobs). Few dependency waits (12) indicate good pipeline parallelism.
For Intel GPUs, run the i915 monitor:
```bash
sudo bpftrace intel_i915.bt
```
For AMD GPUs:
```bash
sudo bpftrace amd_amdgpu.bt
```
For display timing:
```bash
sudo bpftrace drm_vblank.bt
```
Each script outputs real-time events and end-of-run statistics. Run them during GPU workloads (gaming, ML training, video encoding) to capture characteristic patterns.
## Verifying Tracepoint Availability
Before running scripts, verify tracepoints exist on your system. We've included a test script:
```bash
cd bpf-developer-tutorial/srcsrc/xpu/gpu-kernel-driver/tests
sudo ./test_basic_tracing.sh
```
This checks for gpu_scheduler, drm, i915, and amdgpu event groups. It reports which tracepoints are available and recommends appropriate monitoring scripts for your hardware. For Intel systems, it verifies if low-level tracepoints are enabled in kernel config.
You can also manually inspect available tracepoints:
```bash
# All GPU tracepoints
sudo cat /sys/kernel/debug/tracing/available_events | grep -E '(gpu_scheduler|i915|amdgpu|^drm:)'
# DRM scheduler (stable, all vendors)
sudo cat /sys/kernel/debug/tracing/available_events | grep gpu_scheduler
# Intel i915
sudo cat /sys/kernel/debug/tracing/available_events | grep i915
# AMD AMDGPU
sudo cat /sys/kernel/debug/tracing/available_events | grep amdgpu
```
To manually enable a tracepoint and view raw output:
```bash
# Enable drm_run_job
echo 1 | sudo tee /sys/kernel/debug/tracing/events/gpu_scheduler/drm_run_job/enable
# View trace output
sudo cat /sys/kernel/debug/tracing/trace
# Disable when done
echo 0 | sudo tee /sys/kernel/debug/tracing/events/gpu_scheduler/drm_run_job/enable
```
## Summary and Next Steps
GPU kernel tracepoints provide unprecedented visibility into graphics driver behavior. The DRM scheduler's stable uAPI tracepoints work across all vendors, making them perfect for production monitoring. Vendor-specific tracepoints from Intel i915 and AMD AMDGPU expose detailed memory management, command submission pipelines, and hardware interrupt patterns.
Our bpftrace scripts demonstrate practical monitoring: measuring job scheduling latency, tracking memory pressure, analyzing command submission bottlenecks, and diagnosing frame drops. These techniques apply directly to real-world problems - optimizing ML training performance, debugging game stutters, implementing fair GPU resource accounting in cloud environments, and investigating thermal throttling.
The key advantage over traditional tools is completeness and overhead. Kernel tracepoints capture every event with nanosecond precision at negligible cost. No polling, no sampling gaps, no missed short-lived jobs. This data feeds production monitoring systems (Prometheus exporters reading bpftrace output), ad-hoc performance debugging (run a script when users report issues), and automated optimization (trigger workload rebalancing based on latency thresholds).
> If you'd like to dive deeper into eBPF, check out our tutorial repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> or visit our website at <https://eunomia.dev/tutorials/>.
## References
- **Linux Kernel Source**: `/drivers/gpu/drm/`
- **DRM Scheduler**: `/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h`
- **Intel i915**: `/drivers/gpu/drm/i915/i915_trace.h`
- **AMD AMDGPU**: `/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h`
- **Generic DRM**: `/drivers/gpu/drm/drm_trace.h`
- **Kernel Tracepoint Documentation**: `Documentation/trace/tracepoints.rst`
- **Tutorial Repository**: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/xpu/gpu-kernel-driver>
Complete source code including all bpftrace scripts and test cases is available in the tutorial repository. Contributions and issue reports welcome!

View File

@@ -0,0 +1,263 @@
#!/usr/bin/env bpftrace
/*
* amd_amdgpu.bt - Monitor AMD GPU activity
*
* Tracks AMD GPU operations:
* - Buffer object creation and movement
* - Command submission (ioctl → scheduler)
* - GPU interrupts
* - Virtual memory operations
* - Register access (optional, very verbose)
*
* Usage: sudo bpftrace amd_amdgpu.bt
* Usage (with register tracing): sudo bpftrace amd_amdgpu.bt --unsafe -c 'TRACE_REGS=1'
*/
BEGIN
{
printf("Tracing AMD GPU... Hit Ctrl-C to end.\n");
printf("%-18s %-14s %-16s %-12s %s\n",
"TIME(ms)", "EVENT", "ID/OBJECT", "RING/SIZE", "DETAILS");
@total_alloc = 0;
@trace_regs = 0; /* Set to 1 to enable register tracing */
}
/* Buffer object creation */
tracepoint:amdgpu:amdgpu_bo_create
{
$bo = args->bo;
$pages = args->pages;
$type = args->type;
$size = $pages * 4096;
@total_alloc += $size;
@allocs[$bo] = $size;
printf("%-18llu %-14s 0x%-14llx %-12llu pages=%u type=%u\n",
nsecs / 1000000,
"BO_CREATE",
$bo,
$size,
$pages,
$type);
@bo_creates = count();
}
/* Buffer object move (VRAM ↔ GTT migration) */
tracepoint:amdgpu:amdgpu_bo_move
{
$bo = args->bo;
$size = @allocs[$bo];
printf("%-18llu %-14s 0x%-14llx %-12llu (migration)\n",
nsecs / 1000000,
"BO_MOVE",
$bo,
$size);
@bo_moves = count();
}
/* Command submission ioctl from userspace */
tracepoint:amdgpu:amdgpu_cs_ioctl
{
$job_id = args->sched_job_id;
$ring = str(args->ring_name);
$seqno = args->seqno;
$num_ibs = args->num_ibs;
/* Record submission time for latency calculation */
@submit_time[$job_id] = nsecs;
printf("%-18llu %-14s %-16llu %-12s seq=%u ibs=%u\n",
nsecs / 1000000,
"CS_IOCTL",
$job_id,
$ring,
$seqno,
$num_ibs);
@cs_ioctls = count();
@cs_per_ring[$ring] = count();
}
/* Scheduler starts job execution */
tracepoint:amdgpu:amdgpu_sched_run_job
{
$job_id = args->sched_job_id;
$ring = str(args->ring_name);
$seqno = args->seqno;
/* Calculate submission-to-execution latency */
if (@submit_time[$job_id]) {
$latency_us = (nsecs - @submit_time[$job_id]) / 1000;
delete(@submit_time[$job_id]);
printf("%-18llu %-14s %-16llu %-12s seq=%u latency=%lluus\n",
nsecs / 1000000,
"SCHED_RUN",
$job_id,
$ring,
$seqno,
$latency_us);
/* Track latency statistics */
@latency_hist = hist($latency_us);
@latency_sum += $latency_us;
@latency_count += 1;
} else {
printf("%-18llu %-14s %-16llu %-12s seq=%u\n",
nsecs / 1000000,
"SCHED_RUN",
$job_id,
$ring,
$seqno);
}
@sched_runs = count();
}
/* Command submission processing */
tracepoint:amdgpu:amdgpu_cs
{
$ring = args->ring;
$dw = args->dw;
$fences = args->fences;
printf("%-18llu %-14s ring=%-11u %-12s dw=%u fences=%u\n",
nsecs / 1000000,
"CS_PROCESS",
$ring,
"-",
$dw,
$fences);
@cs_process = count();
}
/* GPU interrupt */
tracepoint:amdgpu:amdgpu_iv
{
$ih = args->ih;
$client = args->client_id;
$src = args->src_id;
$ring = args->ring_id;
$vmid = args->vmid;
$pasid = args->pasid;
printf("%-18llu %-14s ih=%u %-12s client=%u src=%u vmid=%u pasid=%u\n",
nsecs / 1000000,
"INTERRUPT",
$ih,
"-",
$client,
$src,
$vmid,
$pasid);
@interrupts = count();
@interrupts_by_src[$src] = count();
}
/* Virtual memory TLB flush */
tracepoint:amdgpu:amdgpu_vm_flush
{
printf("%-18llu %-14s %-16s %-12s\n",
nsecs / 1000000,
"VM_FLUSH",
"-",
"-");
@vm_flushes = count();
}
/* Virtual memory BO map */
tracepoint:amdgpu:amdgpu_vm_bo_map
{
printf("%-18llu %-14s %-16s %-12s\n",
nsecs / 1000000,
"VM_BO_MAP",
"-",
"-");
@vm_maps = count();
}
/* Virtual memory BO unmap */
tracepoint:amdgpu:amdgpu_vm_bo_unmap
{
printf("%-18llu %-14s %-16s %-12s\n",
nsecs / 1000000,
"VM_BO_UNMAP",
"-",
"-");
@vm_unmaps = count();
}
/* Register read (optional - very verbose!) */
tracepoint:amdgpu:amdgpu_device_rreg
/@trace_regs/
{
$did = args->did;
$reg = args->reg;
$value = args->value;
printf("%-18llu %-14s dev=0x%-11x %-12s reg=0x%x val=0x%x\n",
nsecs / 1000000,
"REG_READ",
$did,
"-",
$reg,
$value);
}
/* Register write (optional - very verbose!) */
tracepoint:amdgpu:amdgpu_device_wreg
/@trace_regs/
{
$did = args->did;
$reg = args->reg;
$value = args->value;
printf("%-18llu %-14s dev=0x%-11x %-12s reg=0x%x val=0x%x\n",
nsecs / 1000000,
"REG_WRITE",
$did,
"-",
$reg,
$value);
}
END
{
printf("\n=== AMD GPU Statistics ===\n");
printf("\nMemory:\n");
printf(" Total allocated: %llu MB\n", @total_alloc / 1048576);
if (@latency_count > 0) {
printf("\nSubmission Latency:\n");
printf(" Average: %llu us\n", @latency_sum / @latency_count);
printf("\n Distribution (microseconds):\n");
print(@latency_hist);
}
printf("\nEvent counts:\n");
print(@bo_creates);
print(@bo_moves);
print(@cs_ioctls);
print(@sched_runs);
print(@cs_process);
print(@interrupts);
print(@vm_flushes);
print(@vm_maps);
print(@vm_unmaps);
printf("\nCommands per ring:\n");
print(@cs_per_ring);
printf("\nInterrupts by source:\n");
print(@interrupts_by_src);
}

View File

@@ -0,0 +1,84 @@
#!/usr/bin/env bpftrace
/*
* drm_scheduler.bt - Monitor DRM GPU scheduler activity
*
* This script tracks GPU job scheduling using stable DRM scheduler tracepoints.
* Works across ALL modern GPU drivers (Intel i915, AMD AMDGPU, Nouveau, etc.)
*
* The gpu_scheduler tracepoints are stable uAPI - guaranteed not to change.
*
* Usage: sudo bpftrace drm_scheduler.bt
*/
BEGIN
{
printf("Tracing DRM GPU scheduler... Hit Ctrl-C to end.\n");
printf("%-18s %-12s %-16s %-12s %-8s %s\n",
"TIME(ms)", "EVENT", "JOB_ID", "RING", "QUEUED", "DETAILS");
}
/* GPU job starts executing */
tracepoint:gpu_scheduler:drm_run_job
{
$job_id = args->id;
$ring = str(args->name);
$queue = args->job_count;
$hw_queue = args->hw_job_count;
/* Record start time for latency calculation */
@start[$job_id] = nsecs;
printf("%-18llu %-12s %-16llu %-12s %-8u hw=%d\n",
nsecs / 1000000,
"RUN",
$job_id,
$ring,
$queue,
$hw_queue);
/* Track per-ring statistics */
@jobs_per_ring[$ring] = count();
}
/* GPU job completes (fence signaled) */
tracepoint:gpu_scheduler:drm_sched_process_job
{
$fence = args->fence;
printf("%-18llu %-12s %-16p\n",
nsecs / 1000000,
"COMPLETE",
$fence);
@completion_count = count();
}
/* Job waiting for dependencies */
tracepoint:gpu_scheduler:drm_sched_job_wait_dep
{
$job_id = args->id;
$ring = str(args->name);
$dep_ctx = args->ctx;
$dep_seq = args->seqno;
printf("%-18llu %-12s %-16llu %-12s %-8s ctx=%llu seq=%u\n",
nsecs / 1000000,
"WAIT_DEP",
$job_id,
$ring,
"-",
$dep_ctx,
$dep_seq);
@wait_count = count();
@waits_per_ring[$ring] = count();
}
END
{
printf("\n=== DRM Scheduler Statistics ===\n");
printf("\nJobs per ring:\n");
print(@jobs_per_ring);
printf("\nWaits per ring:\n");
print(@waits_per_ring);
}

View File

@@ -0,0 +1,123 @@
#!/usr/bin/env bpftrace
/*
* drm_vblank.bt - Monitor display vertical blanking events
*
* Tracks display synchronization using generic DRM vblank tracepoints.
* Works across all DRM drivers.
*
* Use cases:
* - Frame timing analysis
* - V-sync debugging
* - Compositor performance monitoring
* - Event delivery latency measurement
*
* Usage: sudo bpftrace drm_vblank.bt
*/
BEGIN
{
printf("Tracing DRM vblank events... Hit Ctrl-C to end.\n");
printf("%-18s %-14s %-6s %-10s %s\n",
"TIME(ms)", "EVENT", "CRTC", "SEQUENCE", "DETAILS");
}
/* Vblank event occurs */
tracepoint:drm:drm_vblank_event
{
$crtc = args->crtc;
$seq = args->seq;
$time = args->time;
$high_prec = args->high_prec;
printf("%-18llu %-14s %-6d %-10u %s\n",
nsecs / 1000000,
"VBLANK",
$crtc,
$seq,
$high_prec ? "high-prec" : "");
/* Track vblanks per CRTC */
@vblanks = count();
@vblanks_per_crtc[$crtc] = count();
/* Record sequence for latency tracking */
@vblank_time[$crtc, $seq] = nsecs;
}
/* Vblank event queued for delivery */
tracepoint:drm:drm_vblank_event_queued
{
$crtc = args->crtc;
$seq = args->seq;
printf("%-18llu %-14s %-6d %-10u\n",
nsecs / 1000000,
"QUEUED",
$crtc,
$seq);
@queued = count();
@queue_time[$crtc, $seq] = nsecs;
}
/* Vblank event delivered to userspace */
tracepoint:drm:drm_vblank_event_delivered
{
$crtc = args->crtc;
$seq = args->seq;
/* Calculate delivery latency */
if (@queue_time[$crtc, $seq]) {
$latency_us = (nsecs - @queue_time[$crtc, $seq]) / 1000;
delete(@queue_time[$crtc, $seq]);
printf("%-18llu %-14s %-6d %-10u latency=%lluus\n",
nsecs / 1000000,
"DELIVERED",
$crtc,
$seq,
$latency_us);
@delivery_latency = hist($latency_us);
@latency_sum += $latency_us;
@latency_count += 1;
} else {
printf("%-18llu %-14s %-6d %-10u\n",
nsecs / 1000000,
"DELIVERED",
$crtc,
$seq);
}
@delivered = count();
/* Calculate total event latency (vblank to delivery) */
if (@vblank_time[$crtc, $seq]) {
$total_latency_us = (nsecs - @vblank_time[$crtc, $seq]) / 1000;
delete(@vblank_time[$crtc, $seq]);
@total_latency = hist($total_latency_us);
}
}
END
{
printf("\n=== DRM Vblank Statistics ===\n");
if (@latency_count > 0) {
printf("\nEvent Delivery Latency:\n");
printf(" Average: %llu us\n", @latency_sum / @latency_count);
printf("\n Distribution (queue → delivery, microseconds):\n");
print(@delivery_latency);
printf("\nTotal Event Latency (vblank → delivery, microseconds):\n");
print(@total_latency);
}
printf("\nEvent counts:\n");
print(@vblanks);
print(@queued);
print(@delivered);
printf("\nVblanks per CRTC:\n");
print(@vblanks_per_crtc);
}

View File

@@ -0,0 +1,195 @@
#!/usr/bin/env bpftrace
/*
* intel_i915.bt - Monitor Intel i915 GPU activity
*
* Tracks Intel GPU operations:
* - GEM object creation and memory allocations
* - VMA binding/unbinding (GPU address space)
* - I/O operations (pread/pwrite)
* - Page faults
* - Memory pressure (shrink/evict)
*
* Requires: CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y
*
* Usage: sudo bpftrace intel_i915.bt
*/
BEGIN
{
printf("Tracing Intel i915 GPU... Hit Ctrl-C to end.\n");
printf("%-18s %-12s %-18s %-12s %s\n",
"TIME(ms)", "EVENT", "OBJECT", "SIZE/OFFSET", "DETAILS");
@total_alloc = 0;
}
/* GEM object creation */
tracepoint:i915:i915_gem_object_create
{
$obj = args->obj;
$size = args->size;
@total_alloc += $size;
@allocs[$obj] = $size;
printf("%-18llu %-12s 0x%-16llx %-12llu total=%llu MB\n",
nsecs / 1000000,
"GEM_CREATE",
$obj,
$size,
@total_alloc / 1048576);
@gem_creates = count();
}
/* VMA bind to GPU address space */
tracepoint:i915:i915_vma_bind
{
$obj = args->obj;
$offset = args->offset;
$size = args->size;
$flags = args->flags;
printf("%-18llu %-12s 0x%-16llx 0x%-10llx size=%llu flags=0x%x\n",
nsecs / 1000000,
"VMA_BIND",
$obj,
$offset,
$size,
$flags);
@vma_binds = count();
}
/* VMA unbind from GPU address space */
tracepoint:i915:i915_vma_unbind
{
$obj = args->obj;
$offset = args->offset;
$size = args->size;
printf("%-18llu %-12s 0x%-16llx 0x%-10llx size=%llu\n",
nsecs / 1000000,
"VMA_UNBIND",
$obj,
$offset,
$size);
@vma_unbinds = count();
}
/* Memory shrink (reclaim under pressure) */
tracepoint:i915:i915_gem_shrink
{
$target = args->target;
$flags = args->flags;
printf("%-18llu %-12s %-18s %-12llu flags=0x%x\n",
nsecs / 1000000,
"SHRINK",
"-",
$target,
$flags);
@shrinks = count();
@shrink_bytes += $target;
}
/* GPU object eviction */
tracepoint:i915:i915_gem_evict
{
$size = args->size;
$align = args->align;
$flags = args->flags;
printf("%-18llu %-12s %-18s %-12llu align=%llu flags=0x%x\n",
nsecs / 1000000,
"EVICT",
"-",
$size,
$align,
$flags);
@evictions = count();
}
/* Userspace writes to GEM object */
tracepoint:i915:i915_gem_object_pwrite
{
$obj = args->obj;
$offset = args->offset;
$len = args->len;
printf("%-18llu %-12s 0x%-16llx 0x%-10llx len=%llu\n",
nsecs / 1000000,
"PWRITE",
$obj,
$offset,
$len);
@pwrites = count();
@pwrite_bytes += $len;
}
/* Userspace reads from GEM object */
tracepoint:i915:i915_gem_object_pread
{
$obj = args->obj;
$offset = args->offset;
$len = args->len;
printf("%-18llu %-12s 0x%-16llx 0x%-10llx len=%llu\n",
nsecs / 1000000,
"PREAD",
$obj,
$offset,
$len);
@preads = count();
@pread_bytes += $len;
}
/* GPU page fault */
tracepoint:i915:i915_gem_object_fault
{
$obj = args->obj;
$index = args->index;
$gtt = args->gtt;
$write = args->write;
printf("%-18llu %-12s 0x%-16llx %-12llu %s %s\n",
nsecs / 1000000,
"FAULT",
$obj,
$index,
$gtt ? "GTT" : "CPU",
$write ? "WRITE" : "READ");
@faults = count();
if ($write) {
@write_faults = count();
} else {
@read_faults = count();
}
}
END
{
printf("\n=== Intel i915 GPU Statistics ===\n");
printf("\nMemory:\n");
printf(" Total allocated: %llu MB\n", @total_alloc / 1048576);
printf(" Bytes shrunk: %llu MB\n", @shrink_bytes / 1048576);
printf(" Bytes written: %llu MB\n", @pwrite_bytes / 1048576);
printf(" Bytes read: %llu MB\n", @pread_bytes / 1048576);
printf("\nEvent counts:\n");
print(@gem_creates);
print(@vma_binds);
print(@vma_unbinds);
print(@shrinks);
print(@evictions);
print(@pwrites);
print(@preads);
print(@faults);
print(@write_faults);
print(@read_faults);
}