Enhance GPU and NPU kernel driver documentation and monitoring scripts

- Updated README.zh.md for GPU kernel driver to improve clarity and formatting. - Added nvidia_driver.bt script for monitoring NVIDIA proprietary GPU driver activity using kernel probes. - Revised README.md for NPU kernel driver to enhance explanations and correct minor grammatical issues.
2026-03-19 19:35:40 +08:00 · 2025-10-13 07:18:48 -07:00
parent d4ec997ab2
commit b8cc834d7f
4 changed files with 668 additions and 42 deletions
--- a/src/xpu/gpu-kernel-driver/README.md
+++ b/src/xpu/gpu-kernel-driver/README.md
@@ -12,7 +12,7 @@ GPU tracepoints are instrumentation points built into the kernel's Direct Render

 The key insight: kernel tracepoints activate only when events occur, adding nanoseconds of overhead per event. They capture 100% of activity including microsecond-duration jobs. Polling-based monitoring checks GPU state every 100ms and misses short-lived operations entirely.

-GPU tracepoints span three layers. **DRM scheduler tracepoints** (`gpu_scheduler` event group) are stable uAPI - their format never changes. They work identically across Intel, AMD, and Nouveau drivers for vendor-neutral monitoring. **Vendor-specific tracepoints** expose driver internals - Intel i915 tracks GEM object creation and VMA binding, AMD AMDGPU monitors buffer objects and command submission. **Generic DRM tracepoints** handle display synchronization through vblank events for diagnosing frame drops.
+GPU tracepoints span three layers. DRM scheduler tracepoints (`gpu_scheduler` event group) are stable uAPI; their format never changes. They work identically across Intel, AMD, and Nouveau drivers for vendor-neutral monitoring. Vendor-specific tracepoints expose driver internals. Intel i915 tracks GEM object creation and VMA binding, while AMD AMDGPU monitors buffer objects and command submission. Generic DRM tracepoints handle display synchronization through vblank events for diagnosing frame drops.

 ## DRM Scheduler Monitor: Universal GPU Tracking

@@ -109,7 +109,7 @@ END

 ### Understanding the Script

-The script attaches to three stable DRM scheduler tracepoints. When `drm_run_job` fires, a job transitions from "queued in software" to "running on silicon." The tracepoint captures `args->id` (job ID for correlation), `args->name` (ring name - which execution engine like graphics, compute, or video decode), `args->job_count` (queue depth - how many jobs are waiting), and `args->hw_job_count` (jobs currently executing on GPU hardware).
+The script attaches to three stable DRM scheduler tracepoints. When `drm_run_job` fires, a job transitions from "queued in software" to "running on silicon." The tracepoint captures `args->id` (job ID for correlation), `args->name` (ring name indicating which execution engine like graphics, compute, or video decode), `args->job_count` (queue depth indicating how many jobs are waiting), and `args->hw_job_count` (jobs currently executing on GPU hardware).

 The format `entity=0xffff888... id=12345 fence=0xffff888... ring=gfx job count:5 hw job count:2` tells you job 12345 on the graphics ring started executing with 5 jobs queued behind it and 2 jobs currently running on hardware. Multi-engine GPUs can run jobs in parallel across different rings.

@@ -125,45 +125,242 @@ At program end, the `END` block prints statistics. `@jobs_per_ring` shows job co

 ## Intel i915 Tracepoints: Memory Management Deep Dive

-Intel's i915 driver exposes detailed tracepoints for memory operations. These require `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y` in your kernel config - check with `grep CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS /boot/config-$(uname -r)`.
+Intel's i915 driver exposes detailed tracepoints for memory operations. These require `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y` in your kernel config; check with `grep CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS /boot/config-$(uname -r)`.

-**i915_gem_object_create** fires when the driver allocates a GEM (Graphics Execution Manager) object - the fundamental unit of GPU-accessible memory. Format: `obj=0xffff888... size=0x100000` indicates allocating a 1MB object. Track total allocated memory over time to detect leaks. Sudden allocation spikes before performance drops suggest memory pressure. Correlate object pointers with subsequent bind/fault events to understand object lifecycle.
+i915_gem_object_create fires when the driver allocates a GEM (Graphics Execution Manager) object, the fundamental unit of GPU-accessible memory. Format: `obj=0xffff888... size=0x100000` indicates allocating a 1MB object. Track total allocated memory over time to detect leaks. Sudden allocation spikes before performance drops suggest memory pressure. Correlate object pointers with subsequent bind/fault events to understand object lifecycle.

-**i915_vma_bind** tracks mapping memory into GPU address space. Allocating memory isn't enough - it must be bound into GPU virtual address space. Format: `obj=0xffff888... offset=0x0000100000 size=0x10000 mappable vm=0xffff888...` shows 64KB bound at GPU virtual address 0x100000. Frequent rebinding indicates memory thrashing - the driver evicting and rebinding objects under pressure. GPU page faults often correlate with bind operations.
+i915_vma_bind tracks mapping memory into GPU address space. Allocating memory isn't enough; it must be bound into GPU virtual address space. Format: `obj=0xffff888... offset=0x0000100000 size=0x10000 mappable vm=0xffff888...` shows 64KB bound at GPU virtual address 0x100000. Frequent rebinding indicates memory thrashing, where the driver evicts and rebinds objects under pressure. GPU page faults often correlate with bind operations.

-**i915_gem_shrink** captures memory pressure response. Under memory pressure, the driver reclaims GPU memory. Format: `dev=0 target=0x1000000 flags=0x3` means the driver tries to reclaim 16MB. High shrink activity indicates undersized GPU memory for the workload. Correlate with performance drops - if shrinking happens during frame rendering, it causes stutters.
+i915_gem_shrink captures memory pressure response. Under memory pressure, the driver reclaims GPU memory. Format: `dev=0 target=0x1000000 flags=0x3` means the driver tries to reclaim 16MB. High shrink activity indicates undersized GPU memory for the workload. Correlate with performance drops; if shrinking happens during frame rendering, it causes stutters.

-**i915_gem_object_fault** tracks page faults when CPU or GPU accesses unmapped memory. Format: `obj=0xffff888... GTT index=128 writable` indicates a write fault on Graphics Translation Table page 128. Faults are expensive - they stall execution while the kernel resolves the missing mapping. Write faults are more expensive than reads (require invalidating caches). GTT faults indicate incomplete resource binding before job submission.
+i915_gem_object_fault tracks page faults when CPU or GPU accesses unmapped memory. Format: `obj=0xffff888... GTT index=128 writable` indicates a write fault on Graphics Translation Table page 128. Faults are expensive because they stall execution while the kernel resolves the missing mapping. Write faults are more expensive than reads since they require invalidating caches. GTT faults indicate incomplete resource binding before job submission.

 ## AMD AMDGPU Tracepoints: Command Submission Pipeline

 AMD's AMDGPU driver provides comprehensive tracing of command submission and hardware interrupts.

-**amdgpu_cs_ioctl** captures userspace command submission. When an application submits GPU work via ioctl, this tracepoint fires. Format: `sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2` shows job 12345 submitted to graphics ring with 2 indirect buffers. This marks when userspace hands off work to kernel. Record timestamp to measure submission-to-execution latency when combined with `amdgpu_sched_run_job`. High frequency indicates small batches - potential for better batching.
+amdgpu_cs_ioctl captures userspace command submission. When an application submits GPU work via ioctl, this tracepoint fires. Format: `sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2` shows job 12345 submitted to graphics ring with 2 indirect buffers. This marks when userspace hands off work to kernel. Record timestamp to measure submission to execution latency when combined with `amdgpu_sched_run_job`. High frequency indicates small batches and potential for better batching.

-**amdgpu_sched_run_job** fires when the kernel scheduler starts executing a previously submitted job. Comparing timestamps with `amdgpu_cs_ioctl` reveals submission latency. Submission latencies over 100μs indicate kernel scheduling delays. Per-ring latencies show if specific engines are scheduling-bound.
+amdgpu_sched_run_job fires when the kernel scheduler starts executing a previously submitted job. Comparing timestamps with `amdgpu_cs_ioctl` reveals submission latency. Submission latencies over 100μs indicate kernel scheduling delays. Per-ring latencies show if specific engines are scheduling-bound.

-**amdgpu_bo_create** tracks buffer object allocation - AMD's equivalent to i915 GEM objects. Format: `bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1` allocates 1MB (256 pages). Type indicates VRAM vs GTT (system memory accessible by GPU). Preferred/allowed domains show placement policy. Type mismatches (requesting VRAM but falling back to GTT) indicate VRAM exhaustion. Visible flag indicates CPU-accessible memory - expensive, use sparingly.
+amdgpu_bo_create tracks buffer object allocation, AMD's equivalent to i915 GEM objects. Format: `bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1` allocates 1MB (256 pages). Type indicates VRAM vs GTT (system memory accessible by GPU). Preferred/allowed domains show placement policy. Type mismatches where VRAM is requested but GTT is used indicate VRAM exhaustion. Visible flag indicates CPU-accessible memory, which is expensive and should be used sparingly.

-**amdgpu_bo_move** fires when buffer objects migrate between VRAM and GTT. Migrations are expensive (require copying data over PCIe). Excessive moves indicate memory thrashing - working set exceeds VRAM capacity. Measure move frequency and size to quantify PCIe bandwidth consumption. Correlate with performance drops - migrations stall GPU execution.
+amdgpu_bo_move fires when buffer objects migrate between VRAM and GTT. Migrations are expensive because they require copying data over PCIe. Excessive moves indicate memory thrashing where the working set exceeds VRAM capacity. Measure move frequency and size to quantify PCIe bandwidth consumption. Correlate with performance drops since migrations stall GPU execution.

-**amdgpu_iv** captures GPU interrupts. The GPU signals interrupts for completed work, errors, and events. Format: `ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890 pasid:100 src_data: 00000001...` captures interrupt details. Source ID indicates interrupt type (completion, fault, thermal). High interrupt rates impact CPU performance. VMID and PASID identify which process/VM triggered the interrupt - critical for multi-tenant debugging.
+amdgpu_iv captures GPU interrupts. The GPU signals interrupts for completed work, errors, and events. Format: `ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890 pasid:100 src_data: 00000001...` captures interrupt details. Source ID indicates interrupt type (completion, fault, thermal). High interrupt rates impact CPU performance. VMID and PASID identify which process/VM triggered the interrupt, which is critical for multi-tenant debugging.

 ## DRM Vblank Tracepoints: Display Synchronization

 Vblank (vertical blanking) events synchronize rendering with display refresh. Missing vblanks causes dropped frames and stutter.

-**drm_vblank_event** fires when the display enters vertical blanking period. Format: `crtc=0 seq=12345 time=1234567890 high-prec=true` indicates vblank on display controller 0, sequence number 12345. Track vblank frequency to verify refresh rate (60Hz = 60 vblanks/second). Missed sequences indicate frame drops. High-precision timestamps enable sub-millisecond frame timing analysis.
+drm_vblank_event fires when the display enters vertical blanking period. Format: `crtc=0 seq=12345 time=1234567890 high-prec=true` indicates vblank on display controller 0, sequence number 12345. Track vblank frequency to verify refresh rate (60Hz = 60 vblanks/second). Missed sequences indicate frame drops. High-precision timestamps enable sub-millisecond frame timing analysis.

-**drm_vblank_event_queued** and **drm_vblank_event_delivered** track vblank event delivery to userspace. Queuing latency (queue to delivery) measures kernel scheduling delay. Total latency (vblank to delivery) includes both kernel and driver processing. Latencies over 1ms indicate compositor problems. Correlate with frame drops visible to users - events delivered late mean missed frames.
+drm_vblank_event_queued and drm_vblank_event_delivered track vblank event delivery to userspace. Queuing latency (queue to delivery) measures kernel scheduling delay. Total latency (vblank to delivery) includes both kernel and driver processing. Latencies over 1ms indicate compositor problems. Correlate with frame drops visible to users since events delivered late mean missed frames.
+
+## NVIDIA Proprietary Driver: Different Architecture
+
+Unlike Intel, AMD, and Nouveau which use the kernel's Direct Rendering Manager (DRM) subsystem, NVIDIA's proprietary driver (nvidia.ko) operates outside DRM. It implements its own kernel module interface with vendor-specific functions and a single tracepoint. This architectural difference means NVIDIA GPUs require different monitoring approaches; we attach to kernel probes on nvidia.ko functions instead of DRM tracepoints.
+
+The key distinction: DRM drivers expose standardized `gpu_scheduler` tracepoints that work identically across vendors. NVIDIA's closed-source driver provides only one tracepoint (`nvidia:nvidia_dev_xid` for hardware errors) and requires monitoring internal kernel functions like `nvidia_open`, `nvidia_unlocked_ioctl`, and `nvidia_isr`. This makes NVIDIA monitoring more fragile since function names can change between driver versions, but it still provides valuable insights into GPU activity.
+
+### NVIDIA Driver Monitoring: nvidia_driver.bt
+
+The `nvidia_driver.bt` script tracks NVIDIA GPU operations through kernel probes on the proprietary driver. Unlike DRM scheduler monitoring which is vendor-neutral, this script is NVIDIA-specific and requires the proprietary nvidia.ko module loaded.
+
+The script monitors six key areas:
+- **Device operations**: Tracks when processes open/close GPU devices and issue ioctl commands
+- **Memory management**: Records mmap operations, page faults, and VMA lifecycle
+- **Interrupt handling**: Measures ISR latency from hardware interrupt to processing
+- **P2P communication**: Captures GPU-to-GPU page requests and DMA mapping
+- **Power management**: Times suspend/resume cycles
+- **Error reporting**: Reports Xid hardware/driver errors immediately
+
+### Complete Bpftrace Script: scripts/nvidia_driver.bt
+
+```c
+#!/usr/bin/env bpftrace
+/* nvidia_driver.bt - Monitor NVIDIA proprietary GPU driver activity */
+
+BEGIN
+{
+    printf("Tracing NVIDIA GPU driver activity... Hit Ctrl-C to end.\n");
+    printf("%-12s %-18s %-16s %-8s %-8s %-20s\n",
+           "TIME(ms)", "EVENT", "COMM", "PID", "GPU_ID", "DETAILS");
+}
+
+kprobe:nvidia_open
+{
+    printf("%-12llu %-18s %-16s %-8d %-8s %s\n",
+           elapsed / 1000000, "OPEN", comm, pid, "-", "GPU device opened");
+    @opens[comm] = count();
+    @open_pids[pid] = 1;
+}
+
+kprobe:nvidia_unlocked_ioctl
+{
+    @ioctl_count = count();
+    @ioctls_per_process[comm] = count();
+    if (rand % 100 == 0) {  /* Sample 1% */
+        printf("%-12llu %-18s %-16s %-8d %-8s cmd=0x%lx\n",
+               elapsed / 1000000, "IOCTL", comm, pid, "-", arg1);
+    }
+}
+
+kprobe:nvidia_mmap
+{
+    @mmap_count = count();
+    @total_mmap_bytes = sum(arg2);
+    printf("%-12llu %-18s %-16s %-8d %-8s offset=0x%lx size=%lu\n",
+           elapsed / 1000000, "MMAP", comm, pid, "-", arg1, arg2);
+}
+
+kprobe:nvidia_isr
+{
+    @isr_count = count();
+    @last_isr_time = nsecs;
+}
+
+kprobe:nvidia_isr_kthread_bh
+{
+    @isr_bh_count = count();
+    if (@last_isr_time > 0) {
+        @isr_latency_us = hist((nsecs - @last_isr_time) / 1000);
+    }
+}
+
+tracepoint:nvidia:nvidia_dev_xid
+{
+    printf("\n!!! GPU ERROR !!!\n");
+    printf("  └─ Xid: %u - %s\n\n", args->error_code, str(args->msg));
+    @xid_errors = count();
+    @xid_codes[args->error_code] = count();
+}
+
+END
+{
+    printf("\n=== NVIDIA GPU Driver Statistics ===\n");
+    printf("Opens by process:\n"); print(@opens);
+    printf("Total ioctls:\n"); print(@ioctl_count);
+    printf("Top ioctl callers:\n"); print(@ioctls_per_process);
+    printf("Total mmaps:\n"); print(@mmap_count);
+    printf("Poll calls:\n"); print(@poll_count);
+}
+```
+
+### Understanding NVIDIA Driver Operations
+
+Device Operations: `nvidia_open` fires when a process opens `/dev/nvidia0` (or other GPU device nodes). This is the entry point for GPU access. CUDA applications, OpenGL contexts, and compute workloads all start here. Track `@opens[comm]` to see which applications use the GPU. Each open usually corresponds to a CUDA context or graphics context creation.
+
+IOCTL Commands: `nvidia_unlocked_ioctl` is the highest-frequency operation. Every GPU command submission, memory allocation, synchronization, and query goes through ioctls. A single frame of graphics rendering may issue hundreds of ioctls. The script samples 1% of ioctls to reduce overhead while maintaining visibility. High ioctl rates (>100k/sec) indicate fine-grained GPU interactions and potential for better batching. The `arg1` parameter contains the ioctl command code identifying the operation type.
+
+Memory Mapping: `nvidia_mmap` maps GPU memory into process virtual address space, enabling CPU access to GPU buffers. Format `offset=0x100000 size=1048576` maps 1MB of GPU memory. Track `@total_mmap_bytes` to understand GPU memory usage. Frequent large mmaps may indicate CPU-GPU data transfer patterns. Unified memory (CUDA managed memory) triggers extensive mmap activity as the driver migrates pages between CPU and GPU.
+
+Page Faults: `nvidia_fault` captures expensive events when CPU or GPU accesses unmapped memory. Page faults stall execution while the driver resolves the mapping. High fault counts indicate unified memory page migration under memory pressure, incomplete memory binding before kernel launch, or CPU accessing GPU memory without proper mapping. Correlate faults with performance drops. Faults during critical sections (kernel execution) directly impact throughput.
+
+Interrupt Handling: `nvidia_isr` fires when the GPU signals an interrupt, typically for completed work, errors, or synchronization events. Modern GPUs use MSI-X interrupts for lower latency. The bottom-half handler (`nvidia_isr_kthread_bh`) performs the actual work processing. ISR latency (time from hardware interrupt to bottom-half processing) indicates kernel scheduling efficiency. High ISR rates (>10k/sec) may impact CPU performance since each interrupt costs CPU cycles.
+
+P2P Transfers: `nvidia_p2p_get_pages` and `nvidia_p2p_dma_map_pages` enable direct GPU-to-GPU transfers over NVLink or PCIe without CPU involvement. Multi-GPU workloads (distributed training, GPU clusters) rely on P2P for high bandwidth. Track P2P operations to verify GPU-GPU communication is working. Missing P2P support (older PCIe configurations) forces slower CPU-mediated transfers.
+
+Xid Errors: The `nvidia:nvidia_dev_xid` tracepoint is NVIDIA's only exposed tracepoint. Xid errors indicate hardware problems (GPU faults, memory errors, thermal issues) or driver bugs. Common Xids include Xid 31 (GPU memory page fault), Xid 43 (GPU stopped responding/hang), Xid 45 (GPU memory ECC error), and Xid 79 (GPU fell off the bus/PCIe error). Any Xid error requires investigation since they often precede crashes or data corruption.
+
+### Running NVIDIA Driver Monitor
+
+Verify NVIDIA driver is loaded and check available probes:
+
+```bash
+# Check NVIDIA driver module
+lsmod | grep nvidia
+
+# List available NVIDIA probes
+sudo bpftrace -l 'kprobe:nvidia_*' | head -20
+sudo bpftrace -l 'tracepoint:nvidia:*'
+```
+
+Run the monitor during GPU workloads:
+
+```bash
+cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver
+sudo bpftrace scripts/nvidia_driver.bt
+```
+
+**Real execution output** capturing llama-server (LLM inference), nvtop (GPU monitoring), and CUDA application cleanup:
+
+```
+Attaching 18 probes...
+Tracing NVIDIA GPU driver activity... Hit Ctrl-C to end.
+TIME(ms)     EVENT              COMM             PID      GPU_ID   DETAILS
+2627         IOCTL              nvtop            759434   -        cmd=0xc020462a
+38984        CLOSE              python           783815   -        GPU device closed
+70693        CLOSE              cuda00001400006  781802   -        GPU device closed
+72427        OPEN               llama-server     800150   -        GPU device opened
+72427        CLOSE              llama-server     800150   -        GPU device closed
+72427        OPEN               llama-server     800150   -        GPU device opened
+72428        OPEN               llama-server     800150   -        GPU device opened
+72431        MMAP               llama-server     800150   -        offset=0xffff968357d37140 size=...
+72448        OPEN               llama-server     800150   -        GPU device opened
+72458        OPEN               llama-server     800150   -        GPU device opened
+... (39 opens, 26 mmaps from llama-server during initialization)
+
+========================================
+  NVIDIA GPU Driver Statistics
+========================================
+
+--- Device Operations ---
+Opens by process:
+@opens[llama-server]: 39
+
+Closes by process:
+@closes[llama-server]: 1
+@closes[python]: 8
+@closes[cuda00001400006]: 38
+
+Total ioctls:
+@ioctl_count: 2779
+Top ioctl callers:
+@ioctls_per_process[llama-server]: 422
+@ioctls_per_process[nvtop]: 2357
+
+Total mmaps:
+@mmap_count: 26
+Total mmap bytes:
+@total_mmap_bytes: 18446744046197555104
+
+--- Memory Management ---
+Total page faults:
+@fault_count: 0
+VMA releases:
+@vma_release_count: 29
+
+--- Interrupt Handling ---
+Total ISR calls:
+@isr_count: 0
+
+--- Async Operations ---
+Poll calls:
+@poll_count: 24254
+
+Currently open PIDs:
+@open_pids[800150]: 1
+```
+
+**Analysis**: This real-world trace reveals several patterns. The llama-server process opened the GPU device 39 times during initialization - typical for LLM inference engines that initialize multiple CUDA contexts for different model layers or batching strategies. The 422 ioctls from llama-server indicate active inference work. The nvtop monitoring tool issued 2,357 ioctls polling GPU state. The script captured 38 device closes from a terminating CUDA application (cuda00001400006) and 8 from a Python process - showing cleanup patterns. The 24,254 poll calls indicate high async I/O activity from monitoring tools. Zero page faults suggests all memory was properly pre-allocated. Zero ISR events during this capture window indicates the GPU was between computation batches - ISRs fire when GPU work completes. No Xid errors means healthy hardware operation. The currently-open PID 800150 (llama-server) remained active after the trace ended.

 ## Running the Monitor Scripts

-Navigate to the scripts directory and run the DRM scheduler monitor. It works on all GPUs:
+Navigate to the tutorial directory and run the appropriate monitor for your GPU.
+
+**For DRM-based GPUs (Intel, AMD, Nouveau)** - Universal monitoring:

 ```bash
-cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver/scripts
-sudo bpftrace drm_scheduler.bt
+cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver
+sudo bpftrace scripts/drm_scheduler.bt
+```
+
+**For NVIDIA Proprietary Driver**:
+
+```bash
+cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver
+sudo bpftrace scripts/nvidia_driver.bt
 ```

 Expected output:
--- a/src/xpu/gpu-kernel-driver/README.zh.md
+++ b/src/xpu/gpu-kernel-driver/README.zh.md
@@ -10,7 +10,7 @@ GPU 跟踪点是内核直接渲染管理器（DRM）子系统中内置的仪器

 关键洞察：内核跟踪点仅在事件发生时激活，每个事件添加纳秒级开销。它们捕获 100% 的活动，包括微秒级持续时间的作业。基于轮询的监控每 100ms 检查一次 GPU 状态，完全错过短期操作。

-GPU 跟踪点跨越三层。**DRM 调度器跟踪点**（`gpu_scheduler` 事件组）是稳定的 uAPI - 格式永不改变。它们在 Intel、AMD 和 Nouveau 驱动上工作完全相同，适合供应商中立的监控。**供应商特定跟踪点**暴露驱动内部 - Intel i915 跟踪 GEM 对象创建和 VMA 绑定，AMD AMDGPU 监控缓冲对象和命令提交。**通用 DRM 跟踪点**通过 vblank 事件处理显示同步，用于诊断丢帧。
+GPU 跟踪点跨越三层。DRM 调度器跟踪点(`gpu_scheduler` 事件组)是稳定的 uAPI,格式永不改变。它们在 Intel、AMD 和 Nouveau 驱动上工作完全相同，适合供应商中立的监控。供应商特定跟踪点暴露驱动内部。Intel i915 跟踪 GEM 对象创建和 VMA 绑定，而 AMD AMDGPU 监控缓冲对象和命令提交。通用 DRM 跟踪点通过 vblank 事件处理显示同步，用于诊断丢帧。

 ## DRM 调度器监视器：通用 GPU 跟踪

@@ -123,29 +123,29 @@ END

 ## Intel i915 跟踪点：内存管理深入分析

-Intel 的 i915 驱动暴露了内存操作的详细跟踪点。这些需要内核配置中的 `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y` - 使用 `grep CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS /boot/config-$(uname -r)` 检查。
+Intel 的 i915 驱动暴露了内存操作的详细跟踪点。这些需要内核配置中的 `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y`,使用 `grep CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS /boot/config-$(uname -r)` 检查。

-**i915_gem_object_create** 在驱动分配 GEM（图形执行管理器）对象时触发 - GPU 可访问内存的基本单位。格式：`obj=0xffff888... size=0x100000` 表示分配 1MB 对象。随时间跟踪总分配内存以检测泄漏。性能下降前的突然分配峰值表示内存压力。将对象指针与后续绑定/故障事件关联以了解对象生命周期。
+i915_gem_object_create 在驱动分配 GEM(图形执行管理器)对象时触发,这是 GPU 可访问内存的基本单位。格式：`obj=0xffff888... size=0x100000` 表示分配 1MB 对象。随时间跟踪总分配内存以检测泄漏。性能下降前的突然分配峰值表示内存压力。将对象指针与后续绑定/故障事件关联以了解对象生命周期。

-**i915_vma_bind** 跟踪将内存映射到 GPU 地址空间。分配内存还不够 - 它必须绑定到 GPU 虚拟地址空间。格式：`obj=0xffff888... offset=0x0000100000 size=0x10000 mappable vm=0xffff888...` 显示在 GPU 虚拟地址 0x100000 处绑定的 64KB。频繁的重新绑定表示内存抖动 - 驱动在压力下驱逐和重新绑定对象。GPU 页面故障通常与绑定操作相关。
+i915_vma_bind 跟踪将内存映射到 GPU 地址空间。分配内存还不够,它必须绑定到 GPU 虚拟地址空间。格式：`obj=0xffff888... offset=0x0000100000 size=0x10000 mappable vm=0xffff888...` 显示在 GPU 虚拟地址 0x100000 处绑定的 64KB。频繁的重新绑定表示内存抖动,即驱动在压力下驱逐和重新绑定对象。GPU 页面故障通常与绑定操作相关。

-**i915_gem_shrink** 捕获内存压力响应。在内存压力下，驱动回收 GPU 内存。格式：`dev=0 target=0x1000000 flags=0x3` 意味着驱动尝试回收 16MB。高收缩活动表示工作负载的 GPU 内存过小。与性能下降关联 - 如果在帧渲染期间发生收缩，会导致卡顿。
+i915_gem_shrink 捕获内存压力响应。在内存压力下，驱动回收 GPU 内存。格式：`dev=0 target=0x1000000 flags=0x3` 意味着驱动尝试回收 16MB。高收缩活动表示工作负载的 GPU 内存过小。与性能下降关联,如果在帧渲染期间发生收缩，会导致卡顿。

-**i915_gem_object_fault** 跟踪 CPU 或 GPU 访问未映射内存时的页面故障。格式：`obj=0xffff888... GTT index=128 writable` 表示图形转换表页 128 上的写故障。故障代价昂贵 - 它们在内核解决缺失映射时停止执行。写故障比读故障更昂贵（需要使缓存失效）。GTT 故障表示作业提交前资源绑定不完整。
+i915_gem_object_fault 跟踪 CPU 或 GPU 访问未映射内存时的页面故障。格式：`obj=0xffff888... GTT index=128 writable` 表示图形转换表页 128 上的写故障。故障代价昂贵,因为它们在内核解决缺失映射时停止执行。写故障比读故障更昂贵,因为需要使缓存失效。GTT 故障表示作业提交前资源绑定不完整。

 ## AMD AMDGPU 跟踪点：命令提交管道

 AMD 的 AMDGPU 驱动提供命令提交和硬件中断的全面跟踪。

-**amdgpu_cs_ioctl** 捕获用户空间命令提交。当应用通过 ioctl 提交 GPU 工作时，此跟踪点触发。格式：`sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2` 显示提交到图形环的作业 12345 有 2 个间接缓冲区。这标志着用户空间将工作交给内核的时间。记录时间戳以在与 `amdgpu_sched_run_job` 结合时测量提交到执行的延迟。高频率表示小批次 - 更好批处理的潜力。
+amdgpu_cs_ioctl 捕获用户空间命令提交。当应用通过 ioctl 提交 GPU 工作时，此跟踪点触发。格式：`sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2` 显示提交到图形环的作业 12345 有 2 个间接缓冲区。这标志着用户空间将工作交给内核的时间。记录时间戳以在与 `amdgpu_sched_run_job` 结合时测量提交到执行的延迟。高频率表示小批次和更好批处理的潜力。

-**amdgpu_sched_run_job** 在内核调度器开始执行先前提交的作业时触发。将时间戳与 `amdgpu_cs_ioctl` 比较可揭示提交延迟。超过 100μs 的提交延迟表示内核调度延迟。每个环的延迟显示特定引擎是否受调度限制。
+amdgpu_sched_run_job 在内核调度器开始执行先前提交的作业时触发。将时间戳与 `amdgpu_cs_ioctl` 比较可揭示提交延迟。超过 100μs 的提交延迟表示内核调度延迟。每个环的延迟显示特定引擎是否受调度限制。

-**amdgpu_bo_create** 跟踪缓冲对象分配 - AMD 的 i915 GEM 对象等价物。格式：`bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1` 分配 1MB（256 页）。类型表示 VRAM 与 GTT（GPU 可访问的系统内存）。首选/允许域显示放置策略。类型不匹配（请求 VRAM 但回退到 GTT）表示 VRAM 耗尽。可见标志表示 CPU 可访问的内存 - 昂贵，谨慎使用。
+amdgpu_bo_create 跟踪缓冲对象分配,这是 AMD 的 i915 GEM 对象等价物。格式：`bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1` 分配 1MB(256 页)。类型表示 VRAM 与 GTT(GPU 可访问的系统内存)。首选/允许域显示放置策略。请求 VRAM 但使用 GTT 的类型不匹配表示 VRAM 耗尽。可见标志表示 CPU 可访问的内存,这很昂贵，应谨慎使用。

-**amdgpu_bo_move** 在缓冲对象在 VRAM 和 GTT 之间迁移时触发。迁移代价昂贵（需要通过 PCIe 复制数据）。过度的移动表示内存抖动 - 工作集超过 VRAM 容量。测量移动频率和大小以量化 PCIe 带宽消耗。与性能下降关联 - 迁移停止 GPU 执行。
+amdgpu_bo_move 在缓冲对象在 VRAM 和 GTT 之间迁移时触发。迁移代价昂贵,因为需要通过 PCIe 复制数据。过度的移动表示内存抖动,即工作集超过 VRAM 容量。测量移动频率和大小以量化 PCIe 带宽消耗。与性能下降关联,因为迁移会停止 GPU 执行。

-**amdgpu_iv** 捕获 GPU 中断。GPU 为完成的工作、错误和事件发出中断信号。格式：`ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890 pasid:100 src_data: 00000001...` 捕获中断详细信息。源 ID 表示中断类型（完成、故障、热）。高中断率影响 CPU 性能。VMID 和 PASID 识别哪个进程/VM 触发了中断 - 对于多租户调试至关重要。
+amdgpu_iv 捕获 GPU 中断。GPU 为完成的工作、错误和事件发出中断信号。格式：`ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890 pasid:100 src_data: 00000001...` 捕获中断详细信息。源 ID 表示中断类型(完成、故障、热)。高中断率影响 CPU 性能。VMID 和 PASID 识别哪个进程/VM 触发了中断,这对于多租户调试至关重要。

 ## DRM Vblank 跟踪点：显示同步

@@ -155,13 +155,111 @@ Vblank（垂直消隐）事件将渲染与显示刷新同步。错过 vblank 会

 **drm_vblank_event_queued** 和 **drm_vblank_event_delivered** 跟踪 vblank 事件传递到用户空间。排队延迟（队列到传递）测量内核调度延迟。总延迟（vblank 到传递）包括内核和驱动处理。超过 1ms 的延迟表示合成器问题。与用户可见的丢帧关联 - 延迟传递的事件意味着错过的帧。

-## 运行监控脚本
+## NVIDIA 专有驱动：不同的架构

-导航到脚本目录并运行 DRM 调度器监视器。它在所有 GPU 上工作：
+与使用内核直接渲染管理器（DRM）子系统的 Intel、AMD 和 Nouveau 不同，**NVIDIA 的专有驱动（nvidia.ko）在 DRM 之外运行**。它实现了自己的内核模块接口，带有供应商特定的函数和单个跟踪点。这种架构差异意味着 NVIDIA GPU 需要不同的监控方法 - 我们附加到 nvidia.ko 函数的内核探针，而不是 DRM 跟踪点。
+
+关键区别：DRM 驱动暴露标准化的 `gpu_scheduler` 跟踪点，在供应商之间工作完全相同。NVIDIA 的闭源驱动只提供一个跟踪点（`nvidia:nvidia_dev_xid` 用于硬件错误），需要监控内部内核函数如 `nvidia_open`、`nvidia_unlocked_ioctl` 和 `nvidia_isr`。这使得 NVIDIA 监控更脆弱 - 函数名称可能在驱动版本之间改变 - 但仍然提供有价值的 GPU 活动洞察。
+
+### NVIDIA 驱动监控：nvidia_driver.bt
+
+`nvidia_driver.bt` 脚本通过对专有驱动的内核探针跟踪 NVIDIA GPU 操作。与供应商中立的 DRM 调度器监控不同，此脚本是 NVIDIA 特定的，需要加载专有 nvidia.ko 模块。完整源代码可在 `scripts/nvidia_driver.bt` 中找到。
+
+**关键脚本特性：**
+
+脚本附加 18 个内核探针以监控：
+- **设备操作**：open、close、ioctl（采样 1% 以降低开销）
+- **内存管理**：mmap、页故障、VMA 操作
+- **中断处理**：ISR、MSI-X、下半部处理程序及延迟直方图
+- **P2P 通信**：GPU 到 GPU 的页面请求和 DMA 映射
+- **电源管理**：挂起/恢复周期及持续时间跟踪
+- **错误报告**：通过 `nvidia:nvidia_dev_xid` 跟踪点报告 Xid 硬件/驱动错误
+
+**运行 NVIDIA 驱动监视器**
+
+验证 NVIDIA 驱动已加载并检查可用探针：

 ```bash
-cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver/scripts
-sudo bpftrace drm_scheduler.bt
+# 检查 NVIDIA 驱动模块
+lsmod | grep nvidia
+
+# 列出可用的 NVIDIA 探针
+sudo bpftrace -l 'kprobe:nvidia_*' | head -20
+sudo bpftrace -l 'tracepoint:nvidia:*'
+```
+
+在 GPU 工作负载期间运行监视器：
+
+```bash
+cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver
+sudo bpftrace scripts/nvidia_driver.bt
+```
+
+**真实执行输出**（捕获 llama-server（LLM 推理）、nvtop（GPU 监控）和 CUDA 应用清理）：
+
+```
+Attaching 18 probes...
+Tracing NVIDIA GPU driver activity... Hit Ctrl-C to end.
+TIME(ms)     EVENT              COMM             PID      GPU_ID   DETAILS
+2627         IOCTL              nvtop            759434   -        cmd=0xc020462a
+38984        CLOSE              python           783815   -        GPU device closed
+70693        CLOSE              cuda00001400006  781802   -        GPU device closed
+72427        OPEN               llama-server     800150   -        GPU device opened
+72427        CLOSE              llama-server     800150   -        GPU device closed
+72427        OPEN               llama-server     800150   -        GPU device opened
+72428        OPEN               llama-server     800150   -        GPU device opened
+72431        MMAP               llama-server     800150   -        offset=0xffff968357d37140 size=...
+72448        OPEN               llama-server     800150   -        GPU device opened
+... (在初始化期间 llama-server 的 39 次 open，26 次 mmap)
+
+========================================
+  NVIDIA GPU Driver Statistics
+========================================
+
+--- Device Operations ---
+Opens by process:
+@opens[llama-server]: 39
+
+Closes by process:
+@closes[llama-server]: 1
+@closes[python]: 8
+@closes[cuda00001400006]: 38
+
+Total ioctls:
+@ioctl_count: 2779
+Top ioctl callers:
+@ioctls_per_process[llama-server]: 422
+@ioctls_per_process[nvtop]: 2357
+
+Total mmaps:
+@mmap_count: 26
+
+--- Async Operations ---
+Poll calls:
+@poll_count: 24254
+
+Currently open PIDs:
+@open_pids[800150]: 1
+```
+
+**分析**：这个真实世界的跟踪揭示了几个模式。llama-server 进程在初始化期间打开了 GPU 设备 39 次 - 对于为不同模型层或批处理策略初始化多个 CUDA 上下文的 LLM 推理引擎来说很典型。来自 llama-server 的 422 次 ioctl 表示活跃的推理工作。nvtop 监控工具发出了 2,357 次 ioctl 轮询 GPU 状态。脚本捕获了来自终止 CUDA 应用（cuda00001400006）的 38 次设备关闭和来自 Python 进程的 8 次 - 显示清理模式。24,254 次轮询调用表示来自监控工具的高异步 I/O 活动。零页故障表明所有内存都已正确预分配。零 Xid 错误意味着硬件运行正常。当前打开的 PID 800150（llama-server）在跟踪结束后仍保持活动状态。
+
+## 运行监控脚本
+
+导航到教程目录并根据你的 GPU 运行适当的监视器。
+
+**对于基于 DRM 的 GPU（Intel、AMD、Nouveau）** - 通用监控：
+
+```bash
+cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver
+sudo bpftrace scripts/drm_scheduler.bt
+```
+
+**对于 NVIDIA 专有驱动**：
+
+```bash
+cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver
+sudo bpftrace scripts/nvidia_driver.bt
 ```

 预期输出：
--- a/src/xpu/gpu-kernel-driver/scripts/nvidia_driver.bt
+++ b/src/xpu/gpu-kernel-driver/scripts/nvidia_driver.bt
@@ -0,0 +1,331 @@
+#!/usr/bin/env bpftrace
+/*
+ * nvidia_driver.bt - Monitor NVIDIA proprietary GPU driver activity
+ *
+ * This script tracks NVIDIA GPU operations using kernel probes on the
+ * proprietary nvidia.ko driver. Unlike DRM drivers, NVIDIA uses its own
+ * kernel interface with vendor-specific tracepoints and functions.
+ *
+ * Key monitoring areas:
+ * - GPU operations (open/close/ioctl/mmap)
+ * - Interrupt handling (ISR activity)
+ * - Memory management (page faults)
+ * - P2P transfers (GPU-to-GPU communication)
+ * - Error reporting (Xid errors)
+ * - Power management (suspend/resume)
+ *
+ * Usage: sudo bpftrace nvidia_driver.bt
+ */
+
+BEGIN
+{
+    printf("Tracing NVIDIA GPU driver activity... Hit Ctrl-C to end.\n");
+    printf("%-12s %-18s %-16s %-8s %-8s %-20s\n",
+           "TIME(ms)", "EVENT", "COMM", "PID", "GPU_ID", "DETAILS");
+}
+
+/* ========== GPU Device Operations ========== */
+
+/* GPU device opened by application */
+kprobe:nvidia_open
+{
+    printf("%-12llu %-18s %-16s %-8d %-8s %s\n",
+           elapsed / 1000000,
+           "OPEN",
+           comm,
+           pid,
+           "-",
+           "GPU device opened");
+
+    @opens[comm] = count();
+    @open_pids[pid] = 1;
+}
+
+/* GPU device closed */
+kprobe:nvidia_close
+{
+    printf("%-12llu %-18s %-16s %-8d %-8s %s\n",
+           elapsed / 1000000,
+           "CLOSE",
+           comm,
+           pid,
+           "-",
+           "GPU device closed");
+
+    @closes[comm] = count();
+    delete(@open_pids[pid]);
+}
+
+/* ioctl commands to GPU (most frequent operation) */
+kprobe:nvidia_unlocked_ioctl
+{
+    @ioctl_count = count();
+    @ioctls_per_process[comm] = count();
+
+    /* Sample only 1% to reduce overhead */
+    if (rand % 100 == 0) {
+        printf("%-12llu %-18s %-16s %-8d %-8s cmd=0x%lx\n",
+               elapsed / 1000000,
+               "IOCTL",
+               comm,
+               pid,
+               "-",
+               arg1);
+    }
+}
+
+/* Memory mapping operations */
+kprobe:nvidia_mmap
+{
+    $offset = arg1;
+    $size = arg2;
+
+    printf("%-12llu %-18s %-16s %-8d %-8s offset=0x%lx size=%lu\n",
+           elapsed / 1000000,
+           "MMAP",
+           comm,
+           pid,
+           "-",
+           $offset,
+           $size);
+
+    @mmap_count = count();
+    @total_mmap_bytes = sum($size);
+}
+
+/* GPU page faults (when GPU or CPU accesses unmapped memory) */
+kprobe:nvidia_fault
+{
+    $address = arg1;
+
+    printf("%-12llu %-18s %-16s %-8d %-8s addr=0x%lx\n",
+           elapsed / 1000000,
+           "PAGE_FAULT",
+           comm,
+           pid,
+           "-",
+           $address);
+
+    @fault_count = count();
+    @faults_per_process[comm] = count();
+}
+
+/* ========== Interrupt Handling ========== */
+
+/* GPU interrupt handler (high frequency event) */
+kprobe:nvidia_isr
+{
+    @isr_count = count();
+    @last_isr_time = nsecs;
+}
+
+/* MSI-X interrupt handler (modern GPUs) */
+kprobe:nvidia_isr_msix
+{
+    @isr_msix_count = count();
+}
+
+/* Bottom-half interrupt handler (actual work processing) */
+kprobe:nvidia_isr_kthread_bh
+{
+    @isr_bh_count = count();
+
+    /* Calculate ISR latency if we have last ISR time */
+    if (@last_isr_time > 0) {
+        $latency_us = (nsecs - @last_isr_time) / 1000;
+        @isr_latency_us = hist($latency_us);
+    }
+}
+
+/* ========== P2P GPU-to-GPU Communication ========== */
+
+/* P2P memory pages requested (for GPU-GPU transfers) */
+kprobe:nvidia_p2p_get_pages
+{
+    $offset = arg1;
+    $size = arg2;
+
+    printf("%-12llu %-18s %-16s %-8d %-8s offset=0x%lx entries=%lu\n",
+           elapsed / 1000000,
+           "P2P_GET_PAGES",
+           comm,
+           pid,
+           "-",
+           $offset,
+           $size);
+
+    @p2p_get_count = count();
+}
+
+/* P2P DMA mapping for direct GPU-GPU transfers */
+kprobe:nvidia_p2p_dma_map_pages
+{
+    printf("%-12llu %-18s %-16s %-8d %-8s %s\n",
+           elapsed / 1000000,
+           "P2P_DMA_MAP",
+           comm,
+           pid,
+           "-",
+           "DMA mapping for P2P");
+
+    @p2p_dma_map_count = count();
+}
+
+/* ========== Power Management ========== */
+
+/* GPU entering suspend */
+kprobe:nvidia_suspend
+{
+    printf("%-12llu %-18s %-16s %-8d %-8s %s\n",
+           elapsed / 1000000,
+           "SUSPEND",
+           comm,
+           pid,
+           "-",
+           "GPU suspending");
+
+    @suspend_count = count();
+    @suspend_start = nsecs;
+}
+
+/* GPU resuming from suspend */
+kprobe:nvidia_resume
+{
+    printf("%-12llu %-18s %-16s %-8d %-8s %s\n",
+           elapsed / 1000000,
+           "RESUME",
+           comm,
+           pid,
+           "-",
+           "GPU resuming");
+
+    @resume_count = count();
+
+    /* Calculate suspend duration if we tracked suspend start */
+    if (@suspend_start > 0) {
+        $suspend_duration_ms = (nsecs - @suspend_start) / 1000000;
+        printf("  └─ Suspend duration: %llu ms\n", $suspend_duration_ms);
+        @suspend_duration_ms = hist($suspend_duration_ms);
+        @suspend_start = 0;
+    }
+}
+
+/* ========== Error Reporting ========== */
+
+/* NVIDIA Xid errors (hardware/driver errors) */
+tracepoint:nvidia:nvidia_dev_xid
+{
+    $dev = str(args->dev);
+    $xid = args->error_code;
+    $msg = str(args->msg);
+
+    printf("\n!!! GPU ERROR !!!\n");
+    printf("%-12llu %-18s %-16s %-8d %-8s dev=%s\n",
+           elapsed / 1000000,
+           "XID_ERROR",
+           comm,
+           pid,
+           "-",
+           $dev);
+    printf("  └─ Xid: %u - %s\n\n", $xid, $msg);
+
+    @xid_errors = count();
+    @xid_codes[$xid] = count();
+}
+
+/* ========== Statistics and Histograms ========== */
+
+/* Track VMA (Virtual Memory Area) operations */
+kprobe:nvidia_vma_open
+{
+    @vma_open_count = count();
+}
+
+kprobe:nvidia_vma_release
+{
+    @vma_release_count = count();
+}
+
+/* Poll operations (async I/O) */
+kprobe:nvidia_poll
+{
+    @poll_count = count();
+}
+
+END
+{
+    printf("\n");
+    printf("========================================\n");
+    printf("  NVIDIA GPU Driver Statistics\n");
+    printf("========================================\n");
+
+    /* Device Operations */
+    printf("\n--- Device Operations ---\n");
+    printf("Opens by process:\n");
+    print(@opens);
+    printf("\nCloses by process:\n");
+    print(@closes);
+    printf("\nTotal ioctls:\n");
+    print(@ioctl_count);
+    printf("Top ioctl callers:\n");
+    print(@ioctls_per_process);
+    printf("\nTotal mmaps:\n");
+    print(@mmap_count);
+    printf("Total mmap bytes:\n");
+    print(@total_mmap_bytes);
+
+    /* Memory Management */
+    printf("\n--- Memory Management ---\n");
+    printf("Total page faults:\n");
+    print(@fault_count);
+    printf("Faults by process:\n");
+    print(@faults_per_process);
+    printf("VMA opens:\n");
+    print(@vma_open_count);
+    printf("VMA releases:\n");
+    print(@vma_release_count);
+
+    /* Interrupt Statistics */
+    printf("\n--- Interrupt Handling ---\n");
+    printf("Total ISR calls:\n");
+    print(@isr_count);
+    printf("MSI-X ISR calls:\n");
+    print(@isr_msix_count);
+    printf("Bottom-half handlers:\n");
+    print(@isr_bh_count);
+    printf("\nISR latency distribution (μs):\n");
+    print(@isr_latency_us);
+
+    /* P2P Operations */
+    printf("\n--- P2P GPU-GPU Communication ---\n");
+    printf("P2P get_pages:\n");
+    print(@p2p_get_count);
+    printf("P2P DMA mappings:\n");
+    print(@p2p_dma_map_count);
+
+    /* Power Management */
+    printf("\n--- Power Management ---\n");
+    printf("Suspends:\n");
+    print(@suspend_count);
+    printf("Resumes:\n");
+    print(@resume_count);
+    printf("Suspend duration distribution (ms):\n");
+    print(@suspend_duration_ms);
+
+    /* Errors */
+    printf("\n--- Errors ---\n");
+    printf("Total Xid errors:\n");
+    print(@xid_errors);
+    printf("Xid error codes:\n");
+    print(@xid_codes);
+
+    /* Async I/O */
+    printf("\n--- Async Operations ---\n");
+    printf("Poll calls:\n");
+    print(@poll_count);
+
+    printf("\nCurrently open PIDs:\n");
+    print(@open_pids);
+
+    printf("\n========================================\n");
+}
--- a/src/xpu/npu-kernel-driver/README.md
+++ b/src/xpu/npu-kernel-driver/README.md
@@ -10,11 +10,11 @@ This tutorial shows you how to trace Intel NPU kernel driver operations using eB

 Intel's NPU driver follows a two-layer architecture similar to GPU drivers. The kernel module (`intel_vpu`) lives in mainline Linux at `drivers/accel/ivpu/` and exposes `/dev/accel/accel0` as the device interface. This handles hardware communication, memory management through an MMU, and IPC (Inter-Processor Communication) with NPU firmware running on the accelerator itself.

-The userspace driver (`libze_intel_vpu.so`) implements the Level Zero API - Intel's unified programming interface for accelerators. When you call Level Zero functions like `zeMemAllocHost()` or `zeCommandQueueExecuteCommandLists()`, the library translates these into DRM ioctls that hit the kernel module. The kernel validates requests, sets up memory mappings, submits work to the NPU firmware, and polls for completion.
+The userspace driver (`libze_intel_vpu.so`) implements the Level Zero API, Intel's unified programming interface for accelerators. When you call Level Zero functions like `zeMemAllocHost()` or `zeCommandQueueExecuteCommandLists()`, the library translates these into DRM ioctls that hit the kernel module. The kernel validates requests, sets up memory mappings, submits work to the NPU firmware, and polls for completion.

-The NPU firmware itself runs autonomously on the accelerator hardware. It receives command buffers from the kernel, schedules compute kernels, manages on-chip memory, and signals completion through interrupts. All communication happens via IPC channels - shared memory regions where kernel and firmware exchange messages. This architecture means three layers must coordinate correctly: your application, the kernel driver, and NPU firmware.
+The NPU firmware itself runs autonomously on the accelerator hardware. It receives command buffers from the kernel, schedules compute kernels, manages on-chip memory, and signals completion through interrupts. All communication happens via IPC channels, which are shared memory regions where kernel and firmware exchange messages. This architecture means three layers must coordinate correctly: your application, the kernel driver, and NPU firmware.

-Understanding this flow is critical for debugging. When an AI inference stalls, is it the kernel waiting for firmware? Is memory allocation thrashing? Are IPC messages backing up? eBPF tracing reveals the kernel side of this story - every ioctl, every memory mapping, every IPC interrupt.
+Understanding this flow is critical for debugging. When an AI inference stalls, is it the kernel waiting for firmware? Is memory allocation thrashing? Are IPC messages backing up? eBPF tracing reveals the kernel side of this story including every ioctl, every memory mapping, and every IPC interrupt.

 ## Level Zero API to Kernel Driver Mapping

@@ -24,19 +24,19 @@ The Level Zero workflow breaks down into five phases. Initialization opens the N

 Here's how each API call translates to kernel operations:

-**zeMemAllocHost** allocates host-visible memory accessible by both CPU and NPU. This triggers `DRM_IOCTL_IVPU_BO_CREATE` ioctl, hitting `ivpu_bo_create_ioctl()` in the kernel. The driver calls `ivpu_gem_create_object()` to allocate a GEM (Graphics Execution Manager) buffer object, then `ivpu_mmu_context_map_page()` maps pages into NPU's address space via the MMU. Finally `ivpu_bo_pin()` pins the buffer in memory so it can't be swapped out during compute.
+zeMemAllocHost allocates host-visible memory accessible by both CPU and NPU. This triggers `DRM_IOCTL_IVPU_BO_CREATE` ioctl, hitting `ivpu_bo_create_ioctl()` in the kernel. The driver calls `ivpu_gem_create_object()` to allocate a GEM (Graphics Execution Manager) buffer object, then `ivpu_mmu_context_map_page()` maps pages into NPU's address space via the MMU. Finally `ivpu_bo_pin()` pins the buffer in memory so it can't be swapped out during compute.

-For our matrix multiplication example with three buffers (input matrix A, input matrix B, output matrix C), we see three `zeMemAllocHost()` calls. Each triggers approximately 1,377 `ivpu_mmu_context_map_page()` calls - that's 4,131 total page mappings for setting up compute memory.
+For our matrix multiplication example with three buffers (input matrix A, input matrix B, output matrix C), we see three `zeMemAllocHost()` calls. Each triggers approximately 1,377 `ivpu_mmu_context_map_page()` calls, totaling 4,131 page mappings for setting up compute memory.

-**zeCommandQueueCreate** establishes a queue for submitting work. This maps to `DRM_IOCTL_IVPU_GET_PARAM` ioctl calling `ivpu_get_param_ioctl()` to query queue capabilities. The actual queue object lives in userspace - the kernel just provides device parameters.
+zeCommandQueueCreate establishes a queue for submitting work. This maps to `DRM_IOCTL_IVPU_GET_PARAM` ioctl calling `ivpu_get_param_ioctl()` to query queue capabilities. The actual queue object lives in userspace; the kernel just provides device parameters.

-**zeCommandListCreate** builds a command list in userspace. No kernel call happens here - the library constructs command buffers in memory that will later be submitted to the NPU.
+zeCommandListCreate builds a command list in userspace. No kernel call happens here. The library constructs command buffers in memory that will later be submitted to the NPU.

-**zeCommandQueueExecuteCommandLists** is where work actually reaches the NPU. This triggers `DRM_IOCTL_IVPU_SUBMIT` ioctl, calling `ivpu_submit_ioctl()` in the kernel. The driver validates the command buffer, sets up DMA transfers, and sends an IPC message to NPU firmware requesting execution. The firmware wakes up, processes the request, schedules compute kernels on NPU hardware, and starts sending IPC interrupts back to signal progress.
+zeCommandQueueExecuteCommandLists is where work actually reaches the NPU. This triggers `DRM_IOCTL_IVPU_SUBMIT` ioctl, calling `ivpu_submit_ioctl()` in the kernel. The driver validates the command buffer, sets up DMA transfers, and sends an IPC message to NPU firmware requesting execution. The firmware wakes up, processes the request, schedules compute kernels on NPU hardware, and starts sending IPC interrupts back to signal progress.

-During execution, we observe massive IPC traffic: 946 `ivpu_ipc_irq_handler()` calls (interrupt handler for IPC messages from firmware), 945 `ivpu_ipc_receive()` calls (reading messages from shared memory), and 951 `ivpu_hw_ip_ipc_rx_count_get()` calls (polling IPC queue depth). This intense communication is normal - the firmware sends status updates, memory fence signals, and completion notifications throughout the compute operation.
+During execution, we observe massive IPC traffic: 946 `ivpu_ipc_irq_handler()` calls (interrupt handler for IPC messages from firmware), 945 `ivpu_ipc_receive()` calls (reading messages from shared memory), and 951 `ivpu_hw_ip_ipc_rx_count_get()` calls (polling IPC queue depth). This intense communication is normal since the firmware sends status updates, memory fence signals, and completion notifications throughout the compute operation.

-**zeFenceHostSynchronize** blocks until the NPU completes work. This doesn't trigger a dedicated ioctl - instead the library continuously calls `ivpu_get_param_ioctl()` to poll fence status. The kernel checks if the firmware signaled completion via IPC. More `ivpu_ipc_irq_handler()` calls fire as the firmware sends the final completion message.
+zeFenceHostSynchronize blocks until the NPU completes work. This doesn't trigger a dedicated ioctl. Instead the library continuously calls `ivpu_get_param_ioctl()` to poll fence status. The kernel checks if the firmware signaled completion via IPC. More `ivpu_ipc_irq_handler()` calls fire as the firmware sends the final completion message.

 ## Tracing NPU Operations with Bpftrace