add chinese doc for cuda

This commit is contained in:
officeyutong
2025-05-24 18:21:26 +08:00
parent f933bd4759
commit 80afe2aa34
6 changed files with 439 additions and 634 deletions

View File

@@ -78,6 +78,11 @@ Scheduler:
- [lesson 44-scx-simple](src/44-scx-simple/README.md) Introduction to the BPF Scheduler
- [lesson 45-scx-nest](src/45-scx-nest/README.md) Implementing the `scx_nest` Scheduler
GPU:
- [lesson 47](src/47-cuda-events/README.md) Using eBPF to trace CUDA operations for GPU
Other:
- [lesson 35-user-ringbuf](src/35-user-ringbuf/README.md) Asynchronously Send to Kernel with User Ring Buffer

View File

@@ -72,6 +72,11 @@ Android:
调度器:
- [lesson 44-scx-simple](src/44-scx-simple/README.zh.md) None
GPU:
- [lesson 47-cuda-events](src/47-cuda-events/README.zh.md) 使用 eBPF 追踪 CUDA 操作
其他:
- [lesson 35-user-ringbuf](src/35-user-ringbuf/README.zh.md) eBPF开发实践使用 user ring buffer 向内核异步发送信息

View File

@@ -481,16 +481,6 @@ The `cuda_events` tool supports these options:
- `-p PATH`: Specify the path to the CUDA runtime library or application
- `-d PID`: Trace only the specified process ID
## Learning Objectives
Through this tutorial, you'll learn:
1. How CUDA applications interact with GPUs through the CUDA runtime API
2. How to use eBPF uprobes to trace user-space libraries
3. How to design efficient data structures for kernel-to-user communication
4. How to process and display traced events in a user-friendly format
5. How to filter events by process ID for focused debugging
## Next Steps
Once you're comfortable with this basic CUDA tracing tool, you could extend it to:
@@ -501,6 +491,9 @@ Once you're comfortable with this basic CUDA tracing tool, you could extend it t
4. Create visualizations of CUDA operations for easier analysis
5. Add support for other GPU frameworks like OpenCL or ROCm
For more detail about the cuda example and tutorial, you can checkout out repo and the code in <https://github.com/eunomia-bpf/basic-cuda-tutorial>
## References
- CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
@@ -508,83 +501,4 @@ Once you're comfortable with this basic CUDA tracing tool, you could extend it t
- libbpf Documentation: https://libbpf.readthedocs.io/
- Linux uprobes Documentation: https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt
## Benchmarking Tracing Overhead
While tracing is an invaluable tool for debugging and understanding CUDA applications, it does introduce some overhead. We've included a benchmarking tool to help you measure this overhead.
### The Benchmark Tool
The `bench.cu` program performs several CUDA operations repeatedly and measures their execution time:
1. Memory allocation (`cudaMalloc`)
2. Memory transfers (host to device and device to host)
3. Kernel launches
4. Memory deallocation (`cudaFree`)
5. Full operations (the complete sequence)
Each operation is executed many times to get statistically significant results, and the average time per operation is reported in microseconds.
### Running the Benchmark
To build the benchmark tool:
```bash
make bench
```
To run a complete benchmark that compares performance with and without tracing:
```bash
make benchmark
```
This will run the benchmark twice:
1. First without any tracing
2. Then with the CUDA events tracer attached
You can also run individual benchmarks:
```bash
# Without tracing
make benchmark-no-trace
# With tracing
make benchmark-with-trace
```
### Interpreting the Results
The benchmark output shows the average time for each CUDA operation in microseconds. Compare the times with and without tracing to understand the overhead.
For example:
```
# Without tracing
cudaMalloc : 23.45 µs per operation
cudaMemcpyH2D : 42.67 µs per operation
cudaLaunchKernel : 15.89 µs per operation
cudaMemcpyD2H : 38.12 µs per operation
cudaFree : 10.34 µs per operation
Full Operation : 130.47 µs per operation
# With tracing
cudaMalloc : 25.12 µs per operation
cudaMemcpyH2D : 45.89 µs per operation
cudaLaunchKernel : 17.23 µs per operation
cudaMemcpyD2H : 41.56 µs per operation
cudaFree : 11.78 µs per operation
Full Operation : 141.58 µs per operation
```
In this example, tracing adds about 7-10% overhead to CUDA operations. This is typically acceptable for debugging and profiling purposes, but it's important to be aware of this impact when interpreting the results.
### Optimization Opportunities
If you find the tracing overhead too high for your use case, there are several ways to reduce it:
1. Trace only specific CUDA functions that are relevant to your investigation
2. Filter by specific process IDs to minimize the number of events captured
3. Disable return probes using the `-r` flag if you don't need return values
4. Consider running eBPF in user-space with tools like [bpftime](https://github.com/eunomia-bpf/bpftime) to reduce context-switching overhead
If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

File diff suppressed because it is too large Load Diff

View File

@@ -67,6 +67,10 @@ Scheduler:
- [lesson 44-scx-simple](44-scx-simple/README.md) Introduction to the BPF Scheduler
- [lesson 45-scx-nest](45-scx-nest/README.md) Implementing the `scx_nest` Scheduler
GPU:
- [lesson 47-cuda-events](47-cuda-events/README.md) Using eBPF to trace CUDA operations for GPU
Other:
- [lesson 35-user-ringbuf](35-user-ringbuf/README.md) Asynchronously Send to Kernel with User Ring Buffer

View File

@@ -64,6 +64,10 @@ Android:
- [lesson 44-scx-simple](44-scx-simple/README.zh.md) eBPF 教程BPF 调度器入门
- [lesson 45-scx-nest](45-scx-nest/README.zh.md) eBPF 示例教程:实现 `scx_nest` 调度器
GPU:
- [lesson 47-cuda-events](47-cuda-events/README.zh.md) 使用 eBPF 追踪 CUDA 操作
其他:
- [lesson 35-user-ringbuf](35-user-ringbuf/README.zh.md) eBPF开发实践使用 user ring buffer 向内核异步发送信息