mirror of
https://github.com/eunomia-bpf/bpf-developer-tutorial.git
synced 2026-05-11 18:36:56 +08:00
add chinese doc for cuda
This commit is contained in:
@@ -78,6 +78,11 @@ Scheduler:
|
|||||||
- [lesson 44-scx-simple](src/44-scx-simple/README.md) Introduction to the BPF Scheduler
|
- [lesson 44-scx-simple](src/44-scx-simple/README.md) Introduction to the BPF Scheduler
|
||||||
- [lesson 45-scx-nest](src/45-scx-nest/README.md) Implementing the `scx_nest` Scheduler
|
- [lesson 45-scx-nest](src/45-scx-nest/README.md) Implementing the `scx_nest` Scheduler
|
||||||
|
|
||||||
|
GPU:
|
||||||
|
|
||||||
|
- [lesson 47](src/47-cuda-events/README.md) Using eBPF to trace CUDA operations for GPU
|
||||||
|
|
||||||
|
|
||||||
Other:
|
Other:
|
||||||
|
|
||||||
- [lesson 35-user-ringbuf](src/35-user-ringbuf/README.md) Asynchronously Send to Kernel with User Ring Buffer
|
- [lesson 35-user-ringbuf](src/35-user-ringbuf/README.md) Asynchronously Send to Kernel with User Ring Buffer
|
||||||
|
|||||||
@@ -72,6 +72,11 @@ Android:
|
|||||||
调度器:
|
调度器:
|
||||||
|
|
||||||
- [lesson 44-scx-simple](src/44-scx-simple/README.zh.md) None
|
- [lesson 44-scx-simple](src/44-scx-simple/README.zh.md) None
|
||||||
|
|
||||||
|
GPU:
|
||||||
|
|
||||||
|
- [lesson 47-cuda-events](src/47-cuda-events/README.zh.md) 使用 eBPF 追踪 CUDA 操作
|
||||||
|
|
||||||
其他:
|
其他:
|
||||||
|
|
||||||
- [lesson 35-user-ringbuf](src/35-user-ringbuf/README.zh.md) eBPF开发实践:使用 user ring buffer 向内核异步发送信息
|
- [lesson 35-user-ringbuf](src/35-user-ringbuf/README.zh.md) eBPF开发实践:使用 user ring buffer 向内核异步发送信息
|
||||||
|
|||||||
@@ -481,16 +481,6 @@ The `cuda_events` tool supports these options:
|
|||||||
- `-p PATH`: Specify the path to the CUDA runtime library or application
|
- `-p PATH`: Specify the path to the CUDA runtime library or application
|
||||||
- `-d PID`: Trace only the specified process ID
|
- `-d PID`: Trace only the specified process ID
|
||||||
|
|
||||||
## Learning Objectives
|
|
||||||
|
|
||||||
Through this tutorial, you'll learn:
|
|
||||||
|
|
||||||
1. How CUDA applications interact with GPUs through the CUDA runtime API
|
|
||||||
2. How to use eBPF uprobes to trace user-space libraries
|
|
||||||
3. How to design efficient data structures for kernel-to-user communication
|
|
||||||
4. How to process and display traced events in a user-friendly format
|
|
||||||
5. How to filter events by process ID for focused debugging
|
|
||||||
|
|
||||||
## Next Steps
|
## Next Steps
|
||||||
|
|
||||||
Once you're comfortable with this basic CUDA tracing tool, you could extend it to:
|
Once you're comfortable with this basic CUDA tracing tool, you could extend it to:
|
||||||
@@ -501,6 +491,9 @@ Once you're comfortable with this basic CUDA tracing tool, you could extend it t
|
|||||||
4. Create visualizations of CUDA operations for easier analysis
|
4. Create visualizations of CUDA operations for easier analysis
|
||||||
5. Add support for other GPU frameworks like OpenCL or ROCm
|
5. Add support for other GPU frameworks like OpenCL or ROCm
|
||||||
|
|
||||||
|
For more detail about the cuda example and tutorial, you can checkout out repo and the code in <https://github.com/eunomia-bpf/basic-cuda-tutorial>
|
||||||
|
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
|
- CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
|
||||||
@@ -508,83 +501,4 @@ Once you're comfortable with this basic CUDA tracing tool, you could extend it t
|
|||||||
- libbpf Documentation: https://libbpf.readthedocs.io/
|
- libbpf Documentation: https://libbpf.readthedocs.io/
|
||||||
- Linux uprobes Documentation: https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt
|
- Linux uprobes Documentation: https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt
|
||||||
|
|
||||||
## Benchmarking Tracing Overhead
|
|
||||||
|
|
||||||
While tracing is an invaluable tool for debugging and understanding CUDA applications, it does introduce some overhead. We've included a benchmarking tool to help you measure this overhead.
|
|
||||||
|
|
||||||
### The Benchmark Tool
|
|
||||||
|
|
||||||
The `bench.cu` program performs several CUDA operations repeatedly and measures their execution time:
|
|
||||||
|
|
||||||
1. Memory allocation (`cudaMalloc`)
|
|
||||||
2. Memory transfers (host to device and device to host)
|
|
||||||
3. Kernel launches
|
|
||||||
4. Memory deallocation (`cudaFree`)
|
|
||||||
5. Full operations (the complete sequence)
|
|
||||||
|
|
||||||
Each operation is executed many times to get statistically significant results, and the average time per operation is reported in microseconds.
|
|
||||||
|
|
||||||
### Running the Benchmark
|
|
||||||
|
|
||||||
To build the benchmark tool:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
make bench
|
|
||||||
```
|
|
||||||
|
|
||||||
To run a complete benchmark that compares performance with and without tracing:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
make benchmark
|
|
||||||
```
|
|
||||||
|
|
||||||
This will run the benchmark twice:
|
|
||||||
1. First without any tracing
|
|
||||||
2. Then with the CUDA events tracer attached
|
|
||||||
|
|
||||||
You can also run individual benchmarks:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Without tracing
|
|
||||||
make benchmark-no-trace
|
|
||||||
|
|
||||||
# With tracing
|
|
||||||
make benchmark-with-trace
|
|
||||||
```
|
|
||||||
|
|
||||||
### Interpreting the Results
|
|
||||||
|
|
||||||
The benchmark output shows the average time for each CUDA operation in microseconds. Compare the times with and without tracing to understand the overhead.
|
|
||||||
|
|
||||||
For example:
|
|
||||||
|
|
||||||
```
|
|
||||||
# Without tracing
|
|
||||||
cudaMalloc : 23.45 µs per operation
|
|
||||||
cudaMemcpyH2D : 42.67 µs per operation
|
|
||||||
cudaLaunchKernel : 15.89 µs per operation
|
|
||||||
cudaMemcpyD2H : 38.12 µs per operation
|
|
||||||
cudaFree : 10.34 µs per operation
|
|
||||||
Full Operation : 130.47 µs per operation
|
|
||||||
|
|
||||||
# With tracing
|
|
||||||
cudaMalloc : 25.12 µs per operation
|
|
||||||
cudaMemcpyH2D : 45.89 µs per operation
|
|
||||||
cudaLaunchKernel : 17.23 µs per operation
|
|
||||||
cudaMemcpyD2H : 41.56 µs per operation
|
|
||||||
cudaFree : 11.78 µs per operation
|
|
||||||
Full Operation : 141.58 µs per operation
|
|
||||||
```
|
|
||||||
|
|
||||||
In this example, tracing adds about 7-10% overhead to CUDA operations. This is typically acceptable for debugging and profiling purposes, but it's important to be aware of this impact when interpreting the results.
|
|
||||||
|
|
||||||
### Optimization Opportunities
|
|
||||||
|
|
||||||
If you find the tracing overhead too high for your use case, there are several ways to reduce it:
|
|
||||||
|
|
||||||
1. Trace only specific CUDA functions that are relevant to your investigation
|
|
||||||
2. Filter by specific process IDs to minimize the number of events captured
|
|
||||||
3. Disable return probes using the `-r` flag if you don't need return values
|
|
||||||
4. Consider running eBPF in user-space with tools like [bpftime](https://github.com/eunomia-bpf/bpftime) to reduce context-switching overhead
|
|
||||||
|
|
||||||
If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.
|
If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -67,6 +67,10 @@ Scheduler:
|
|||||||
- [lesson 44-scx-simple](44-scx-simple/README.md) Introduction to the BPF Scheduler
|
- [lesson 44-scx-simple](44-scx-simple/README.md) Introduction to the BPF Scheduler
|
||||||
- [lesson 45-scx-nest](45-scx-nest/README.md) Implementing the `scx_nest` Scheduler
|
- [lesson 45-scx-nest](45-scx-nest/README.md) Implementing the `scx_nest` Scheduler
|
||||||
|
|
||||||
|
GPU:
|
||||||
|
|
||||||
|
- [lesson 47-cuda-events](47-cuda-events/README.md) Using eBPF to trace CUDA operations for GPU
|
||||||
|
|
||||||
Other:
|
Other:
|
||||||
|
|
||||||
- [lesson 35-user-ringbuf](35-user-ringbuf/README.md) Asynchronously Send to Kernel with User Ring Buffer
|
- [lesson 35-user-ringbuf](35-user-ringbuf/README.md) Asynchronously Send to Kernel with User Ring Buffer
|
||||||
|
|||||||
@@ -64,6 +64,10 @@ Android:
|
|||||||
- [lesson 44-scx-simple](44-scx-simple/README.zh.md) eBPF 教程:BPF 调度器入门
|
- [lesson 44-scx-simple](44-scx-simple/README.zh.md) eBPF 教程:BPF 调度器入门
|
||||||
- [lesson 45-scx-nest](45-scx-nest/README.zh.md) eBPF 示例教程:实现 `scx_nest` 调度器
|
- [lesson 45-scx-nest](45-scx-nest/README.zh.md) eBPF 示例教程:实现 `scx_nest` 调度器
|
||||||
|
|
||||||
|
GPU:
|
||||||
|
|
||||||
|
- [lesson 47-cuda-events](47-cuda-events/README.zh.md) 使用 eBPF 追踪 CUDA 操作
|
||||||
|
|
||||||
其他:
|
其他:
|
||||||
|
|
||||||
- [lesson 35-user-ringbuf](35-user-ringbuf/README.zh.md) eBPF开发实践:使用 user ring buffer 向内核异步发送信息
|
- [lesson 35-user-ringbuf](35-user-ringbuf/README.zh.md) eBPF开发实践:使用 user ring buffer 向内核异步发送信息
|
||||||
|
|||||||
Reference in New Issue
Block a user