docs: enhance README with detailed explanation of CUDA API tracing and eBPF integration

This commit is contained in:
yunwei37
2025-09-30 22:28:48 -07:00
parent 70451702f0
commit c9d3d65c15
2 changed files with 17 additions and 33 deletions

View File

@@ -4,25 +4,15 @@ Have you ever wondered what's happening under the hood when your CUDA applicatio
## Introduction to CUDA and GPU Tracing
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose processing. When you run a CUDA application, several things happen behind the scenes:
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose processing. When you run a CUDA application, a typical workflow begins with the host (CPU) allocating memory on the device (GPU), followed by data transfer from host memory to device memory, then GPU kernels (functions) are launched to process the data, after which results are transferred back from device to host, and finally device memory is freed.
1. The host (CPU) allocates memory on the device (GPU)
2. Data is transferred from host to device memory
3. GPU kernels (functions) are launched to process the data
4. Results are transferred back from device to host
5. Device memory is freed
Each operation in this process involves CUDA API calls, such as `cudaMalloc` for memory allocation, `cudaMemcpy` for data transfer, and `cudaLaunchKernel` for kernel execution. Tracing these calls can provide valuable insights for debugging and performance optimization, but this isn't straightforward. GPU operations are asynchronous, meaning the CPU can continue executing after submitting work to the GPU without waiting, and traditional debugging tools often can't penetrate this asynchronous boundary to access GPU internal state.
Each of these operations involves CUDA API calls like `cudaMalloc`, `cudaMemcpy`, and `cudaLaunchKernel`. Tracing these calls can provide valuable insights for debugging and performance optimization, but this isn't straightforward. GPU operations happen asynchronously, and traditional debugging tools often can't access GPU internals.
This is where eBPF comes to the rescue! By using uprobes, we can intercept CUDA API calls in the user-space CUDA runtime library (`libcudart.so`) before they reach the GPU driver, capturing critical information. This approach allows us to gain deep insights into memory allocation sizes and patterns, data transfer directions and volumes, kernel launch parameters, error codes and failure reasons returned by the API, and precise timing information for each operation. By intercepting these calls on the CPU side, we can build a complete view of an application's GPU usage behavior without modifying application code or relying on proprietary profiling tools.
This is where eBPF comes to the rescue! By using uprobes, we can intercept CUDA API calls in the user-space CUDA runtime library (`libcudart.so`) before they reach the GPU. This gives us visibility into:
This tutorial primarily focuses on CPU-side CUDA API tracing, which provides a macro view of how applications interact with the GPU. However, CPU-side tracing alone has clear limitations. When a CUDA API function like `cudaLaunchKernel` is called, it merely submits a work request to the GPU. We can see when the kernel was launched, but we cannot observe what actually happens inside the GPU. Critical details such as how thousands of threads access memory, their execution patterns, branching behavior, and synchronization operations remain invisible. These details are crucial for understanding performance bottlenecks, such as whether memory access patterns cause coalesced access failures or whether severe thread divergence reduces execution efficiency.
- Memory allocation sizes and patterns
- Data transfer directions and sizes
- Kernel launch parameters
- Error codes and failures
- Timing of operations
This blog mainly focuses on the CPU side of the CUDA API calls, for fined-grained tracing of GPU operations, you can see [eGPU](https://dl.acm.org/doi/10.1145/3723851.3726984) paper and [bpftime](https://github.com/eunomia-bpf/bpftime) project.
To achieve fine-grained tracing of GPU operations, eBPF programs need to run directly on the GPU. This is exactly what the eGPU paper and [bpftime GPU examples](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu) explore. bpftime converts eBPF programs into PTX instructions that GPUs can execute, then dynamically modifies CUDA binaries at runtime to inject these eBPF programs at kernel entry and exit points, enabling observation of GPU internal behavior. This approach allows developers to access GPU-specific information such as block indices, thread indices, global timers, and perform measurements and tracing on critical paths during kernel execution. This GPU-internal observability is essential for diagnosing complex performance issues, understanding kernel execution behavior, and optimizing GPU computation—capabilities that CPU-side tracing simply cannot provide.
## Key CUDA Functions We Trace
@@ -504,5 +494,7 @@ The code of this tutorial is in [https://github.com/eunomia-bpf/bpf-developer-tu
- NVIDIA CUDA Runtime API: [https://docs.nvidia.com/cuda/cuda-runtime-api/](https://docs.nvidia.com/cuda/cuda-runtime-api/)
- libbpf Documentation: [https://libbpf.readthedocs.io/](https://libbpf.readthedocs.io/)
- Linux uprobes Documentation: [https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt](https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt)
- eGPU: eBPF on GPUs: <https://dl.acm.org/doi/10.1145/3723851.3726984>
- bpftime GPU Examples: <https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu>
If you'd like to dive deeper into eBPF, check out our tutorial repository at [https://github.com/eunomia-bpf/bpf-developer-tutorial](https://github.com/eunomia-bpf/bpf-developer-tutorial) or visit our website at [https://eunomia.dev/tutorials/](https://eunomia.dev/tutorials/).

View File

@@ -1,28 +1,14 @@
# eBPF 与机器学习可观测:追踪 CUDA GPU 操作
# eBPF 教程:追踪 CUDA GPU 操作
你是否曾经想知道CUDA应用程序在运行时底层发生了什么GPU操作由于发生在具有独立内存空间的设备上因此调试和性能分析变得极为困难。在本教程中我们将构建一个强大的基于eBPF的追踪工具让你实时查看CUDA API调用。
## CUDA和GPU追踪简介
CUDACompute Unified Device Architecture计算统一设备架构是NVIDIA的并行计算平台和编程模型使开发者能够利用NVIDIA GPU进行通用计算。当你运行CUDA应用程序时后台会发生以下步骤:
CUDACompute Unified Device Architecture计算统一设备架构是NVIDIA的并行计算平台和编程模型使开发者能够利用NVIDIA GPU进行通用计算。当你运行CUDA应用程序时一个典型的工作流程从主机CPU在设备GPU上分配内存开始随后数据从主机内存传输到设备内存接着GPU内核函数被启动以处理数据处理完成后结果从设备传回主机最后设备内存被释放。
1. 主机CPU在设备GPU上分配内存
2. 数据从主机内存传输到设备内存
3. GPU内核函数被启动以处理数据
4. 结果从设备传回主机
5. 设备内存被释放
这个过程中的每个操作都涉及CUDA API调用例如用于内存分配的`cudaMalloc`、用于数据传输的`cudaMemcpy`以及用于启动内核的`cudaLaunchKernel`。追踪这些调用可以提供宝贵的调试和性能优化信息但这并不简单。GPU操作是异步的这意味着CPU在提交工作给GPU后可以继续执行而无需等待而传统调试工具通常无法穿透这层异步边界来访问GPU内部状态。
每个操作都涉及CUDA API调用`cudaMalloc``cudaMemcpy``cudaLaunchKernel`。追踪这些调用可以提供宝贵的调试和性能优化信息但这并不简单。GPU操作是异步的传统调试工具通常无法访问GPU内部
这时eBPF就派上用场了通过使用uprobes我们可以在用户空间CUDA运行库`libcudart.so`中拦截CUDA API调用在它们到达GPU之前。这使我们能够了解
- 内存分配大小和模式
- 数据传输方向和大小
- 内核启动参数
- 错误代码和失败原因
- 操作的时间信息
本教程主要关注CPU侧的CUDA API调用对于细粒度的GPU操作追踪你可以参考[eGPU](https://dl.acm.org/doi/10.1145/3723851.3726984)论文和[bpftime](https://github.com/eunomia-bpf/bpftime)项目。
这时eBPF就派上用场了通过使用uprobes我们可以在用户空间CUDA运行库`libcudart.so`中拦截CUDA API调用在它们到达GPU驱动之前捕获关键信息。这种方法使我们能够深入了解内存分配的大小和模式、数据传输的方向和大小、内核启动时使用的参数、API返回的错误代码和失败原因以及各个操作的精确时间信息。通过在CPU侧拦截这些调用我们可以构建应用程序GPU使用行为的完整视图而无需修改应用程序代码或依赖专有的性能分析工具
## eBPF技术背景与GPU追踪的挑战
@@ -32,6 +18,10 @@ eBPFExtended Berkeley Packet Filter最初是为网络数据包过滤而设
GPU追踪面临着独特的挑战。现代GPU是高度并行的处理器包含数千个小型计算核心这些核心可以同时执行数万个线程。GPU具有自己的内存层次结构包括全局内存、共享内存、常数内存和纹理内存这些内存的访问模式对性能有着巨大影响。更复杂的是GPU操作通常是异步的这意味着当CPU启动一个GPU操作后它可以继续执行其他任务而无需等待GPU操作完成。另外CUDA编程模型的异步特性使得调试变得特别困难。当一个内核函数在GPU上执行时CPU无法直接观察到GPU的内部状态。错误可能在GPU上发生但直到后续的同步操作如cudaDeviceSynchronize或cudaStreamSynchronize时才被检测到这使得错误源的定位变得困难。此外GPU内存错误如数组越界访问可能导致静默的数据损坏而不是立即的程序崩溃这进一步增加了调试的复杂性。
本教程主要关注CPU侧的CUDA API调用这为我们提供了应用程序与GPU交互的宏观视图。然而仅在CPU侧进行追踪存在明显的局限性。当CUDA API函数如`cudaLaunchKernel`被调用时它只是向GPU提交了一个工作请求我们可以看到内核何时被启动但无法观察到GPU内部实际发生了什么。GPU上运行的成千上万个线程如何访问内存、它们的执行模式、分支行为和同步操作等关键细节都是不可见的。这些细节对于理解性能瓶颈至关重要比如内存访问模式是否导致了合并访问失败或者是否存在严重的线程分化导致执行效率降低。
要实现对GPU操作的细粒度追踪需要直接在GPU上运行eBPF程序。这正是eGPU论文和[bpftime GPU示例](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu)所探索的方向。bpftime通过将eBPF程序转换为GPU可以执行的PTX指令然后在运行时动态修改CUDA二进制文件将这些eBPF程序注入到GPU内核的入口和出口点从而实现对GPU内部行为的观测。这种方法使得开发者可以访问GPU特有的信息如块索引、线程索引、全局计时器等并且可以在GPU内核执行的关键路径上进行测量和追踪。这种GPU内部的可观测性对于诊断复杂的性能问题、理解内核执行行为以及优化GPU计算至关重要是CPU侧追踪无法企及的。
## 我们追踪的关键CUDA函数
我们的追踪工具监控几个关键CUDA函数这些函数代表GPU计算中的主要操作。了解这些函数有助于解释追踪结果并诊断CUDA应用程序中的问题
@@ -511,5 +501,7 @@ cudaFree: 0.00 µs
- NVIDIA CUDA运行时API[https://docs.nvidia.com/cuda/cuda-runtime-api/](https://docs.nvidia.com/cuda/cuda-runtime-api/)
- libbpf文档[https://libbpf.readthedocs.io/](https://libbpf.readthedocs.io/)
- Linux uprobes文档[https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt](https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt)
- eGPU: eBPF on GPUs: <https://dl.acm.org/doi/10.1145/3723851.3726984>
- bpftime GPU示例: <https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu>
如果你想深入了解eBPF请查看我们的教程仓库[https://github.com/eunomia-bpf/bpf-developer-tutorial](https://github.com/eunomia-bpf/bpf-developer-tutorial) 或访问我们的网站:[https://eunomia.dev/tutorials/](https://eunomia.dev/tutorials/)。