diff --git a/src/xpu/gpu-kernel-driver/README.md b/src/xpu/gpu-kernel-driver/README.md index 437b425..7f62aa6 100644 --- a/src/xpu/gpu-kernel-driver/README.md +++ b/src/xpu/gpu-kernel-driver/README.md @@ -1,195 +1,170 @@ -# eBPF Tutorial: Monitoring GPU Driver Activity with Kernel Tracepoints +# eBPF Tutorial by Example: Monitoring GPU Driver Activity with Kernel Tracepoints -Ever wondered what your GPU is really doing under the hood? When games stutter, ML training slows down, or video encoding freezes, the answers lie deep inside the kernel's GPU driver. Traditional debugging relies on guesswork and vendor-specific tools, but there's a better way. Linux kernel GPU tracepoints expose real-time insights into job scheduling, memory allocation, and command submission - and eBPF lets you analyze this data with minimal overhead. +When games stutter or ML training slows down, the answers lie inside the GPU kernel driver. Linux kernel tracepoints expose real-time job scheduling, memory allocation, and command submission data. Unlike userspace profiling tools that sample periodically and miss events, kernel tracepoints catch every operation with nanosecond timestamps and minimal overhead. -In this tutorial, we'll explore GPU kernel tracepoints across DRM scheduler, Intel i915, and AMD AMDGPU drivers. We'll write bpftrace scripts to monitor live GPU activity, track memory pressure, measure job latency, and diagnose performance bottlenecks. By the end, you'll have production-ready monitoring tools and deep knowledge of how GPUs interact with the kernel. +This tutorial shows how to monitor GPU activity using eBPF and bpftrace. We'll track DRM scheduler jobs, measure latency, and diagnose bottlenecks using stable kernel tracepoints that work across Intel, AMD, and Nouveau drivers. -## Understanding GPU Kernel Tracepoints +## GPU Kernel Tracepoints: Zero-Overhead Observability -GPU tracepoints are instrumentation points built directly into the kernel's Direct Rendering Manager (DRM) subsystem. When your GPU schedules a job, allocates memory, or signals a fence, these tracepoints fire - capturing precise timing, resource identifiers, and driver state. Unlike userspace profiling tools that sample periodically and miss events, kernel tracepoints catch every single operation with nanosecond timestamps. +GPU tracepoints are instrumentation points built into the kernel's Direct Rendering Manager (DRM) subsystem. When your GPU schedules a job, allocates memory, or signals a fence, these tracepoints fire with precise timing and driver state. -### Why Kernel Tracepoints Matter for GPU Monitoring +The key insight: kernel tracepoints activate only when events occur, adding nanoseconds of overhead per event. They capture 100% of activity including microsecond-duration jobs. Polling-based monitoring checks GPU state every 100ms and misses short-lived operations entirely. -Think about what happens when you launch a GPU workload. Your application submits commands through the graphics API (Vulkan, OpenGL, CUDA). The userspace driver translates these into hardware-specific command buffers. The kernel driver receives an ioctl, validates the work, allocates GPU memory, binds resources to GPU address space, schedules the job on a hardware ring, and waits for completion. Traditional profiling sees the start and end - kernel tracepoints see every step in between. +GPU tracepoints span three layers. **DRM scheduler tracepoints** (`gpu_scheduler` event group) are stable uAPI - their format never changes. They work identically across Intel, AMD, and Nouveau drivers for vendor-neutral monitoring. **Vendor-specific tracepoints** expose driver internals - Intel i915 tracks GEM object creation and VMA binding, AMD AMDGPU monitors buffer objects and command submission. **Generic DRM tracepoints** handle display synchronization through vblank events for diagnosing frame drops. -The performance implications are significant. Polling-based monitoring checks GPU state every 100ms and consumes CPU cycles on every check. Tracepoints activate only when events occur, adding mere nanoseconds of overhead per event, and capture 100% of activity including microsecond-duration jobs. For production monitoring of Kubernetes GPU workloads or debugging ML training performance, this difference is critical. +## DRM Scheduler Monitor: Universal GPU Tracking -### The DRM Tracepoint Ecosystem +The `drm_scheduler.bt` script works on **all GPU drivers** because it uses stable uAPI tracepoints. It tracks job submission (`drm_run_job`), completion (`drm_sched_process_job`), and dependency waits (`drm_sched_job_wait_dep`) across all rings. -GPU tracepoints span three layers of the graphics stack. **DRM scheduler tracepoints** (gpu_scheduler event group) are marked as stable uAPI - their format will never change. These work identically across Intel, AMD, and Nouveau drivers, making them perfect for vendor-neutral monitoring. They track job submission (`drm_run_job`), completion (`drm_sched_process_job`), and dependency waits (`drm_sched_job_wait_dep`). +### Complete Bpftrace Script: drm_scheduler.bt -**Vendor-specific tracepoints** expose driver internals. Intel i915 tracepoints track GEM object creation (`i915_gem_object_create`), VMA binding to GPU address space (`i915_vma_bind`), memory pressure events (`i915_gem_shrink`), and page faults (`i915_gem_object_fault`). AMD AMDGPU tracepoints monitor buffer object lifecycle (`amdgpu_bo_create`), command submission from userspace (`amdgpu_cs_ioctl`), scheduler execution (`amdgpu_sched_run_job`), and GPU interrupts (`amdgpu_iv`). Note that Intel low-level tracepoints require `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y` in your kernel config. +```c +#!/usr/bin/env bpftrace +/* + * drm_scheduler.bt - Monitor DRM GPU scheduler activity + * + * This script tracks GPU job scheduling using stable DRM scheduler tracepoints. + * Works across ALL modern GPU drivers (Intel i915, AMD AMDGPU, Nouveau, etc.) + * + * The gpu_scheduler tracepoints are stable uAPI - guaranteed not to change. + * + * Usage: sudo bpftrace drm_scheduler.bt + */ -**Generic DRM tracepoints** handle display synchronization through vblank events - critical for diagnosing frame drops and compositor latency. Events include vblank occurrence (`drm_vblank_event`), userspace queueing (`drm_vblank_event_queued`), and delivery (`drm_vblank_event_delivered`). +BEGIN +{ + printf("Tracing DRM GPU scheduler... Hit Ctrl-C to end.\n"); + printf("%-18s %-12s %-16s %-12s %-8s %s\n", + "TIME(ms)", "EVENT", "JOB_ID", "RING", "QUEUED", "DETAILS"); +} -### Real-World Use Cases +/* GPU job starts executing */ +tracepoint:gpu_scheduler:drm_run_job +{ + $job_id = args->id; + $ring = str(args->name); + $queue = args->job_count; + $hw_queue = args->hw_job_count; -GPU tracepoints solve problems that traditional tools can't touch. **Diagnosing stuttering in games**: You notice frame drops every few seconds. Vblank tracepoints reveal missed vertical blanks. Job scheduling traces show CPU-side delays in command submission. Memory tracepoints expose allocations triggering evictions during critical frames. Within minutes you identify that texture uploads are blocking the rendering pipeline. + /* Record start time for latency calculation */ + @start[$job_id] = nsecs; -**Optimizing ML training performance**: Your PyTorch training is 40% slower than expected. AMDGPU command submission tracing reveals excessive synchronization - the CPU waits for GPU completion too often. Job dependency tracepoints show unnecessary fences between independent operations. Memory traces expose thrashing between VRAM and system RAM. You reorganize batching to eliminate stalls. + printf("%-18llu %-12s %-16llu %-12s %-8u hw=%d\n", + nsecs / 1000000, + "RUN", + $job_id, + $ring, + $queue, + $hw_queue); -**Cloud GPU billing accuracy**: Multi-tenant systems need fair energy and resource accounting. DRM scheduler tracepoints attribute exact GPU time to each container. Memory tracepoints track allocation per workload. This data feeds into accurate billing systems that charge based on actual resource consumption rather than time-based estimates. + /* Track per-ring statistics */ + @jobs_per_ring[$ring] = count(); +} -**Thermal throttling investigation**: GPU performance degrades under load. Interrupt tracing shows thermal events from the GPU. Job scheduling traces reveal frequency scaling impacting execution time. Memory migration traces show the driver moving workloads to cooler GPU dies. You adjust power limits and improve airflow. +/* GPU job completes (fence signaled) */ +tracepoint:gpu_scheduler:drm_sched_process_job +{ + $fence = args->fence; -## Tracepoint Reference Guide + printf("%-18llu %-12s %-16p\n", + nsecs / 1000000, + "COMPLETE", + $fence); -Let's examine each tracepoint category in detail, understanding the data they expose and how to interpret it. + @completion_count = count(); +} -### DRM Scheduler Tracepoints: The Universal GPU Monitor +/* Job waiting for dependencies */ +tracepoint:gpu_scheduler:drm_sched_job_wait_dep +{ + $job_id = args->id; + $ring = str(args->name); + $dep_ctx = args->ctx; + $dep_seq = args->seqno; -The DRM scheduler provides a vendor-neutral view of GPU job management. These tracepoints work identically whether you're running Intel integrated graphics, AMD discrete GPUs, or Nouveau on NVIDIA hardware. + printf("%-18llu %-12s %-16llu %-12s %-8s ctx=%llu seq=%u\n", + nsecs / 1000000, + "WAIT_DEP", + $job_id, + $ring, + "-", + $dep_ctx, + $dep_seq); -#### drm_run_job: When GPU Work Starts Executing + @wait_count = count(); + @waits_per_ring[$ring] = count(); +} -When the scheduler assigns a job to GPU hardware, `drm_run_job` fires. This marks the transition from "queued in software" to "actively running on silicon." The tracepoint captures the job ID (unique identifier for correlation), ring name (which execution engine: graphics, compute, video decode), queue depth (how many jobs are waiting), and hardware job count (jobs currently executing on GPU). +END +{ + printf("\n=== DRM Scheduler Statistics ===\n"); + printf("\nJobs per ring:\n"); + print(@jobs_per_ring); + printf("\nWaits per ring:\n"); + print(@waits_per_ring); +} +``` -The format looks like: `entity=0xffff888... id=12345 fence=0xffff888... ring=gfx job count:5 hw job count:2`. This tells you job 12345 on the graphics ring started executing. Five jobs are queued behind it, and two jobs are currently running on hardware (multi-engine GPUs can run jobs in parallel). +### Understanding the Script -Use this to measure job scheduling latency. Record the timestamp when userspace submits work (using command submission tracepoints), then measure time until `drm_run_job` fires. Latencies over 1ms indicate CPU-side scheduling delays. Per-ring statistics reveal if specific engines (video encode, compute) are bottlenecked. +The script attaches to three stable DRM scheduler tracepoints. When `drm_run_job` fires, a job transitions from "queued in software" to "running on silicon." The tracepoint captures `args->id` (job ID for correlation), `args->name` (ring name - which execution engine like graphics, compute, or video decode), `args->job_count` (queue depth - how many jobs are waiting), and `args->hw_job_count` (jobs currently executing on GPU hardware). -#### drm_sched_process_job: Job Completion Signal +The format `entity=0xffff888... id=12345 fence=0xffff888... ring=gfx job count:5 hw job count:2` tells you job 12345 on the graphics ring started executing with 5 jobs queued behind it and 2 jobs currently running on hardware. Multi-engine GPUs can run jobs in parallel across different rings. -When GPU hardware completes a job and signals its fence, this tracepoint fires. The fence pointer identifies the completed job - correlate it with `drm_run_job` to calculate GPU execution time. Format: `fence=0xffff888... signaled`. +We record `@start[$job_id] = nsecs` to enable latency calculation. The script stores the timestamp keyed by job ID. Later, when tracking completion or measuring end-to-end latency, you can compute `nsecs - @start[$job_id]` to get execution time. The `@jobs_per_ring[$ring] = count()` line increments per-ring counters, showing workload distribution across engines. -Combine with `drm_run_job` timestamps to compute job execution time: `completion_time - run_time = GPU_execution_duration`. If jobs that should take 5ms are taking 50ms, you've found a GPU performance problem. Throughput metrics (jobs completed per second) indicate overall GPU utilization. +When `drm_sched_process_job` fires, GPU hardware completed a job and signaled its fence. The fence pointer `args->fence` identifies the completed job. Correlating fence pointers between `drm_run_job` and this tracepoint lets you calculate GPU execution time: `completion_time - run_time = GPU_execution_duration`. If jobs that should take 5ms are taking 50ms, you've found a GPU performance problem. -#### drm_sched_job_wait_dep: Dependency Stalls +The `drm_sched_job_wait_dep` tracepoint fires when a job blocks waiting for a fence. Before a job executes, its dependencies (previous jobs it waits for) must complete. The format shows `args->ctx` (dependency context) and `args->seqno` (sequence number) identifying which fence blocks this job. -Before a job can execute, its dependencies (previous jobs it waits for) must complete. This tracepoint fires when a job blocks waiting for a fence. Format: `job ring=gfx id=12345 depends fence=0xffff888... context=1234 seq=567`. +This reveals pipeline stalls. If compute jobs constantly wait for graphics jobs, you're not exploiting parallelism. Long wait times suggest dependency chains are too deep - consider batching independent work. Excessive dependencies indicate CPU-side scheduling inefficiency. The `@waits_per_ring[$ring] = count()` metric tracks which rings experience the most dependency stalls. -This reveals pipeline stalls. If compute jobs constantly wait for graphics jobs, you're not exploiting parallelism. If wait times are long, dependency chains are too deep - consider batching independent work. Excessive dependencies indicate a CPU-side scheduling inefficiency. +At program end, the `END` block prints statistics. `@jobs_per_ring` shows job counts per execution engine - revealing if specific rings (video encode, compute) are saturated. `@waits_per_ring` exposes dependency bottlenecks. This data reveals overall GPU utilization patterns and whether jobs are blocked by dependencies. -### Intel i915 Tracepoints: Memory and I/O Deep Dive +## Intel i915 Tracepoints: Memory Management Deep Dive -Intel's i915 driver exposes detailed tracepoints for memory management and data transfer. These require `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y` - check with `grep CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS /boot/config-$(uname -r)`. +Intel's i915 driver exposes detailed tracepoints for memory operations. These require `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y` in your kernel config - check with `grep CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS /boot/config-$(uname -r)`. -#### i915_gem_object_create: GPU Memory Allocation +**i915_gem_object_create** fires when the driver allocates a GEM (Graphics Execution Manager) object - the fundamental unit of GPU-accessible memory. Format: `obj=0xffff888... size=0x100000` indicates allocating a 1MB object. Track total allocated memory over time to detect leaks. Sudden allocation spikes before performance drops suggest memory pressure. Correlate object pointers with subsequent bind/fault events to understand object lifecycle. -When the driver allocates a GEM (Graphics Execution Manager) object - the fundamental unit of GPU-accessible memory - this fires. Format: `obj=0xffff888... size=0x100000` indicates allocating a 1MB object. +**i915_vma_bind** tracks mapping memory into GPU address space. Allocating memory isn't enough - it must be bound into GPU virtual address space. Format: `obj=0xffff888... offset=0x0000100000 size=0x10000 mappable vm=0xffff888...` shows 64KB bound at GPU virtual address 0x100000. Frequent rebinding indicates memory thrashing - the driver evicting and rebinding objects under pressure. GPU page faults often correlate with bind operations. -Track total allocated memory over time to detect leaks. Sudden allocation spikes before performance drops suggest memory pressure. Correlate object pointers with subsequent bind/fault events to understand object lifecycle. High-frequency small allocations indicate inefficient batching. +**i915_gem_shrink** captures memory pressure response. Under memory pressure, the driver reclaims GPU memory. Format: `dev=0 target=0x1000000 flags=0x3` means the driver tries to reclaim 16MB. High shrink activity indicates undersized GPU memory for the workload. Correlate with performance drops - if shrinking happens during frame rendering, it causes stutters. -#### i915_vma_bind: Mapping Memory to GPU Address Space +**i915_gem_object_fault** tracks page faults when CPU or GPU accesses unmapped memory. Format: `obj=0xffff888... GTT index=128 writable` indicates a write fault on Graphics Translation Table page 128. Faults are expensive - they stall execution while the kernel resolves the missing mapping. Write faults are more expensive than reads (require invalidating caches). GTT faults indicate incomplete resource binding before job submission. -Allocating memory isn't enough - it must be mapped (bound) into GPU address space. This tracepoint fires on VMA (Virtual Memory Area) binding. Format: `obj=0xffff888... offset=0x0000100000 size=0x10000 mappable vm=0xffff888...` shows 64KB bound at GPU virtual address 0x100000. - -Binding overhead impacts performance. Frequent rebinding indicates memory thrashing - the driver evicting and rebinding objects under pressure. GPU page faults often correlate with bind operations - the CPU bound memory just before GPU accessed it. Flags like `PIN_MAPPABLE` indicate memory accessible by both CPU and GPU. - -#### i915_gem_shrink: Memory Pressure Response - -Under memory pressure, the driver reclaims GPU memory. Format: `dev=0 target=0x1000000 flags=0x3` means the driver tries to reclaim 16MB. High shrink activity indicates undersized GPU memory for the workload. - -Correlate with performance drops - if shrinking happens during frame rendering, it causes stutters. Flags indicate shrink aggressiveness. Repeated shrinks with small targets suggest memory fragmentation. Compare target with actual freed amount (track object destructions) to measure reclaim efficiency. - -#### i915_gem_object_fault: GPU Page Faults - -When CPU or GPU accesses unmapped memory, a fault occurs. Format: `obj=0xffff888... GTT index=128 writable` indicates a write fault on Graphics Translation Table page 128. Faults are expensive - they stall execution while the kernel resolves the missing mapping. - -Excessive faults kill performance. Write faults are more expensive than reads (require invalidating caches). GTT faults (GPU accessing unmapped memory) indicate incomplete resource binding before job submission. CPU faults suggest inefficient CPU/GPU synchronization - CPU accessing objects while GPU is using them. - -### AMD AMDGPU Tracepoints: Command Flow and Interrupts +## AMD AMDGPU Tracepoints: Command Submission Pipeline AMD's AMDGPU driver provides comprehensive tracing of command submission and hardware interrupts. -#### amdgpu_cs_ioctl: Userspace Command Submission +**amdgpu_cs_ioctl** captures userspace command submission. When an application submits GPU work via ioctl, this tracepoint fires. Format: `sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2` shows job 12345 submitted to graphics ring with 2 indirect buffers. This marks when userspace hands off work to kernel. Record timestamp to measure submission-to-execution latency when combined with `amdgpu_sched_run_job`. High frequency indicates small batches - potential for better batching. -When an application submits GPU work via ioctl, this captures the request. Format: `sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2` shows job 12345 submitted to graphics ring with 2 indirect buffers. +**amdgpu_sched_run_job** fires when the kernel scheduler starts executing a previously submitted job. Comparing timestamps with `amdgpu_cs_ioctl` reveals submission latency. Submission latencies over 100μs indicate kernel scheduling delays. Per-ring latencies show if specific engines are scheduling-bound. -This marks when userspace hands off work to kernel. Record timestamp to measure submission-to-execution latency when combined with `amdgpu_sched_run_job`. High frequency indicates small batches - potential for better batching. Per-ring distribution shows workload balance across engines. +**amdgpu_bo_create** tracks buffer object allocation - AMD's equivalent to i915 GEM objects. Format: `bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1` allocates 1MB (256 pages). Type indicates VRAM vs GTT (system memory accessible by GPU). Preferred/allowed domains show placement policy. Type mismatches (requesting VRAM but falling back to GTT) indicate VRAM exhaustion. Visible flag indicates CPU-accessible memory - expensive, use sparingly. -#### amdgpu_sched_run_job: Kernel Schedules Job +**amdgpu_bo_move** fires when buffer objects migrate between VRAM and GTT. Migrations are expensive (require copying data over PCIe). Excessive moves indicate memory thrashing - working set exceeds VRAM capacity. Measure move frequency and size to quantify PCIe bandwidth consumption. Correlate with performance drops - migrations stall GPU execution. -The kernel scheduler starts executing a previously submitted job. Comparing timestamps with `amdgpu_cs_ioctl` reveals submission latency. Format includes job ID and ring for correlation. +**amdgpu_iv** captures GPU interrupts. The GPU signals interrupts for completed work, errors, and events. Format: `ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890 pasid:100 src_data: 00000001...` captures interrupt details. Source ID indicates interrupt type (completion, fault, thermal). High interrupt rates impact CPU performance. VMID and PASID identify which process/VM triggered the interrupt - critical for multi-tenant debugging. -Submission latencies over 100μs indicate kernel scheduling delays. Per-ring latencies show if specific engines are scheduling-bound. Correlate with CPU scheduler traces to identify if kernel threads are being preempted. - -#### amdgpu_bo_create: Buffer Object Allocation - -AMD's equivalent to i915 GEM objects. Format: `bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1` allocates 1MB (256 pages). Type indicates VRAM vs GTT (system memory accessible by GPU). Preferred/allowed domains show placement policy. - -Track VRAM allocations to monitor memory usage. Type mismatches (requesting VRAM but falling back to GTT) indicate VRAM exhaustion. Visible flag indicates CPU-accessible memory - expensive, use sparingly. - -#### amdgpu_bo_move: Memory Migration - -When buffer objects migrate between VRAM and GTT, this fires. Migrations are expensive (require copying data over PCIe). Excessive moves indicate memory thrashing - working set exceeds VRAM capacity. - -Measure move frequency and size to quantify PCIe bandwidth consumption. Correlate with performance drops - migrations stall GPU execution. Optimize by reducing working set or using smarter placement policies (keep frequently accessed data in VRAM). - -#### amdgpu_iv: GPU Interrupts - -The GPU signals interrupts for completed work, errors, and events. Format: `ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890 pasid:100 src_data: 00000001...` captures interrupt details. - -Source ID indicates interrupt type (completion, fault, thermal). High interrupt rates impact CPU performance. Unexpected interrupts suggest hardware errors. VMID and PASID identify which process/VM triggered the interrupt - critical for multi-tenant debugging. - -### DRM Vblank Tracepoints: Display Synchronization +## DRM Vblank Tracepoints: Display Synchronization Vblank (vertical blanking) events synchronize rendering with display refresh. Missing vblanks causes dropped frames and stutter. -#### drm_vblank_event: Vertical Blank Occurs +**drm_vblank_event** fires when the display enters vertical blanking period. Format: `crtc=0 seq=12345 time=1234567890 high-prec=true` indicates vblank on display controller 0, sequence number 12345. Track vblank frequency to verify refresh rate (60Hz = 60 vblanks/second). Missed sequences indicate frame drops. High-precision timestamps enable sub-millisecond frame timing analysis. -When the display enters vertical blanking period, this fires. Format: `crtc=0 seq=12345 time=1234567890 high-prec=true` indicates vblank on display controller 0, sequence number 12345. +**drm_vblank_event_queued** and **drm_vblank_event_delivered** track vblank event delivery to userspace. Queuing latency (queue to delivery) measures kernel scheduling delay. Total latency (vblank to delivery) includes both kernel and driver processing. Latencies over 1ms indicate compositor problems. Correlate with frame drops visible to users - events delivered late mean missed frames. -Track vblank frequency to verify refresh rate (60Hz = 60 vblanks/second). Missed sequences indicate frame drops. High-precision timestamps enable sub-millisecond frame timing analysis. Per-CRTC tracking for multi-monitor setups. +## Running the Monitor Scripts -#### drm_vblank_event_queued and drm_vblank_event_delivered - -These track vblank event delivery to userspace. Queuing latency (queue to delivery) measures kernel scheduling delay. Total latency (vblank to delivery) includes both kernel and driver processing. - -Latencies over 1ms indicate compositor problems. Compare across CRTCs to identify problematic displays. Correlate with frame drops visible to users - events delivered late mean missed frames. - -## Monitoring with Bpftrace Scripts - -We've created vendor-specific bpftrace scripts for production monitoring. Each script focuses on its GPU vendor's specific tracepoints while sharing a common output format. - -### DRM Scheduler Monitor: Universal GPU Tracking - -The `drm_scheduler.bt` script works on **all GPU drivers** because it uses stable uAPI tracepoints. It tracks jobs across all rings, measures completion rates, and identifies dependency stalls. - -The script attaches to `gpu_scheduler:drm_run_job`, `gpu_scheduler:drm_sched_process_job`, and `gpu_scheduler:drm_sched_job_wait_dep`. On job start, it records timestamps in a map keyed by job ID for later latency calculation. It increments per-ring counters to show workload distribution. On completion, it prints fence information. On dependency wait, it shows which job blocks which fence. - -Output shows timestamp, event type (RUN/COMPLETE/WAIT_DEP), job ID, ring name, and queue depth. At program end, statistics summarize jobs per ring and dependency wait counts. This reveals if specific rings are saturated, whether jobs are blocked by dependencies, and overall GPU utilization patterns. - -### Intel i915 Monitor: Memory and I/O Profiling - -The `intel_i915.bt` script tracks Intel GPU memory operations, I/O transfers, and page faults. It requires `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y`. - -On `i915_gem_object_create`, it accumulates total allocated memory and stores per-object sizes. VMA bind/unbind events track GPU address space changes. Shrink events measure memory pressure. Pwrite/pread track CPU-GPU data transfers. Faults categorize by type (GTT vs CPU, read vs write). - -Output reports allocation size and running total in MB. Bind operations show GPU virtual address and flags. I/O operations track offset and length. Faults indicate type and whether they're reads or writes. End statistics summarize total allocations, VMA operations, memory pressure (shrink operations and bytes reclaimed), I/O volume (read/write counts and sizes), and fault analysis (total faults, write vs read). - -This reveals memory leaks (allocations without corresponding frees), binding overhead (frequent rebinds indicate thrashing), memory pressure timing (correlate shrinks with performance drops), I/O patterns (large transfers vs many small ones), and fault hotspots (expensive operations to optimize). - -### AMD AMDGPU Monitor: Command Submission Analysis - -The `amd_amdgpu.bt` script focuses on AMD's command submission pipeline, measuring latency from ioctl to execution. - -On `amdgpu_cs_ioctl`, it records submission timestamp keyed by job ID. When `amdgpu_sched_run_job` fires, it calculates latency: `(current_time - submit_time)`. Buffer object create/move events track memory. Interrupt events count by source ID. Virtual memory operations (flush, map, unmap) measure TLB activity. - -Output shows timestamp, event type, job ID, ring name, and calculated latency in microseconds. End statistics include memory allocation totals, command submission counts per ring, average and distribution of submission latency (histogram showing how many jobs experienced different latency buckets), interrupt counts by source, and virtual memory operation counts. - -Latency histograms are critical - most jobs should have <50μs latency. A tail of high-latency jobs indicates scheduling problems. Per-ring statistics show if compute workloads have different latency than graphics. Memory migration tracking helps diagnose VRAM pressure. - -### Display Vblank Monitor: Frame Timing Analysis - -The `drm_vblank.bt` script tracks display synchronization for diagnosing frame drops. - -On `drm_vblank_event`, it records timestamp keyed by CRTC and sequence. When `drm_vblank_event_queued` fires, it timestamps queue time. On `drm_vblank_event_delivered`, it calculates queue-to-delivery latency and total vblank-to-delivery latency. - -Output shows vblank events, queued events, and delivered events with timestamps. End statistics include total vblank counts per CRTC, event delivery counts, average delivery latency, latency distribution histogram, and total event latency (vblank occurrence to userspace delivery). - -Delivery latencies over 1ms indicate compositor scheduling issues. Total latencies reveal end-to-end delay visible to applications. Per-CRTC statistics show if specific monitors have problems. Latency histograms expose outliers causing visible stutter. - -## Running the Monitors - -Let's trace live GPU activity. Navigate to the scripts directory and run any monitor with bpftrace. The DRM scheduler monitor works on all GPUs: +Navigate to the scripts directory and run the DRM scheduler monitor. It works on all GPUs: ```bash -cd bpf-developer-tutorial/srcsrc/xpu/gpu-kernel-driver/scripts +cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver/scripts sudo bpftrace drm_scheduler.bt ``` -You'll see output like: +Expected output: ``` Tracing DRM GPU scheduler... Hit Ctrl-C to end. @@ -207,75 +182,18 @@ Waits per ring: @waits_per_ring[gfx]: 12 ``` -This shows graphics jobs dominating workload (1523 vs 89 compute jobs). Few dependency waits (12) indicate good pipeline parallelism. +Graphics jobs dominate (1523 vs 89 compute jobs). Few dependency waits (12) indicate good pipeline parallelism. For Intel GPUs, use `intel_i915.bt`. For AMD GPUs, use `amd_amdgpu.bt`. For display timing, use `drm_vblank.bt`. Run these during GPU workloads (gaming, ML training, video encoding) to capture activity patterns. -For Intel GPUs, run the i915 monitor: - -```bash -sudo bpftrace intel_i915.bt -``` - -For AMD GPUs: - -```bash -sudo bpftrace amd_amdgpu.bt -``` - -For display timing: - -```bash -sudo bpftrace drm_vblank.bt -``` - -Each script outputs real-time events and end-of-run statistics. Run them during GPU workloads (gaming, ML training, video encoding) to capture characteristic patterns. - -## Verifying Tracepoint Availability - -Before running scripts, verify tracepoints exist on your system. We've included a test script: - -```bash -cd bpf-developer-tutorial/srcsrc/xpu/gpu-kernel-driver/tests -sudo ./test_basic_tracing.sh -``` - -This checks for gpu_scheduler, drm, i915, and amdgpu event groups. It reports which tracepoints are available and recommends appropriate monitoring scripts for your hardware. For Intel systems, it verifies if low-level tracepoints are enabled in kernel config. - -You can also manually inspect available tracepoints: +Verify tracepoints exist on your system before running scripts: ```bash # All GPU tracepoints sudo cat /sys/kernel/debug/tracing/available_events | grep -E '(gpu_scheduler|i915|amdgpu|^drm:)' - -# DRM scheduler (stable, all vendors) -sudo cat /sys/kernel/debug/tracing/available_events | grep gpu_scheduler - -# Intel i915 -sudo cat /sys/kernel/debug/tracing/available_events | grep i915 - -# AMD AMDGPU -sudo cat /sys/kernel/debug/tracing/available_events | grep amdgpu ``` -To manually enable a tracepoint and view raw output: +## Summary -```bash -# Enable drm_run_job -echo 1 | sudo tee /sys/kernel/debug/tracing/events/gpu_scheduler/drm_run_job/enable - -# View trace output -sudo cat /sys/kernel/debug/tracing/trace - -# Disable when done -echo 0 | sudo tee /sys/kernel/debug/tracing/events/gpu_scheduler/drm_run_job/enable -``` - -## Summary and Next Steps - -GPU kernel tracepoints provide unprecedented visibility into graphics driver behavior. The DRM scheduler's stable uAPI tracepoints work across all vendors, making them perfect for production monitoring. Vendor-specific tracepoints from Intel i915 and AMD AMDGPU expose detailed memory management, command submission pipelines, and hardware interrupt patterns. - -Our bpftrace scripts demonstrate practical monitoring: measuring job scheduling latency, tracking memory pressure, analyzing command submission bottlenecks, and diagnosing frame drops. These techniques apply directly to real-world problems - optimizing ML training performance, debugging game stutters, implementing fair GPU resource accounting in cloud environments, and investigating thermal throttling. - -The key advantage over traditional tools is completeness and overhead. Kernel tracepoints capture every event with nanosecond precision at negligible cost. No polling, no sampling gaps, no missed short-lived jobs. This data feeds production monitoring systems (Prometheus exporters reading bpftrace output), ad-hoc performance debugging (run a script when users report issues), and automated optimization (trigger workload rebalancing based on latency thresholds). +GPU kernel tracepoints provide zero-overhead visibility into driver internals. DRM scheduler's stable uAPI tracepoints work across all vendors for production monitoring. Vendor-specific tracepoints expose detailed memory management and command submission pipelines. The bpftrace script demonstrates tracking job scheduling, measuring latency, and identifying dependency stalls - all critical for diagnosing performance issues in games, ML training, and cloud GPU workloads. > If you'd like to dive deeper into eBPF, check out our tutorial repository at or visit our website at . diff --git a/src/xpu/gpu-kernel-driver/README.zh.md b/src/xpu/gpu-kernel-driver/README.zh.md index 7271440..134d9a4 100644 --- a/src/xpu/gpu-kernel-driver/README.zh.md +++ b/src/xpu/gpu-kernel-driver/README.zh.md @@ -1,195 +1,170 @@ -# eBPF 教程:使用内核跟踪点监控 GPU 驱动活动 +# eBPF 实例教程:使用内核跟踪点监控 GPU 驱动活动 -你是否曾经想知道你的 GPU 在底层到底在做什么?当游戏卡顿、机器学习训练变慢或视频编码冻结时,答案就隐藏在内核 GPU 驱动的深处。传统调试依赖于猜测和供应商特定的工具,但有更好的方法。Linux 内核 GPU 跟踪点暴露了作业调度、内存分配和命令提交的实时洞察 - 而 eBPF 让你可以以最小的开销分析这些数据。 +当游戏卡顿或机器学习训练变慢时,答案就隐藏在 GPU 内核驱动内部。Linux 内核跟踪点暴露了实时的作业调度、内存分配和命令提交数据。与周期性采样并错过事件的用户空间分析工具不同,内核跟踪点以纳秒级时间戳和最小开销捕获每个操作。 -在本教程中,我们将探索跨 DRM 调度器、Intel i915 和 AMD AMDGPU 驱动的 GPU 内核跟踪点。我们将编写 bpftrace 脚本来监控实时 GPU 活动、跟踪内存压力、测量作业延迟并诊断性能瓶颈。最后,你将拥有生产就绪的监控工具以及对 GPU 如何与内核交互的深入了解。 +本教程展示如何使用 eBPF 和 bpftrace 监控 GPU 活动。我们将跟踪 DRM 调度器作业、测量延迟,并使用跨 Intel、AMD 和 Nouveau 驱动工作的稳定内核跟踪点诊断瓶颈。 -## 理解 GPU 内核跟踪点 +## GPU 内核跟踪点:零开销可观测性 -GPU 跟踪点是直接内置在内核的直接渲染管理器(DRM)子系统中的仪器点。当你的 GPU 调度作业、分配内存或发出栅栏信号时,这些跟踪点会触发 - 捕获精确的时序、资源标识符和驱动状态。与周期性采样并错过事件的用户空间分析工具不同,内核跟踪点以纳秒级时间戳捕获每一个操作。 +GPU 跟踪点是内核直接渲染管理器(DRM)子系统中内置的仪器点。当 GPU 调度作业、分配内存或发出栅栏信号时,这些跟踪点会以精确的时序和驱动状态触发。 -### 为什么内核跟踪点对 GPU 监控很重要 +关键洞察:内核跟踪点仅在事件发生时激活,每个事件添加纳秒级开销。它们捕获 100% 的活动,包括微秒级持续时间的作业。基于轮询的监控每 100ms 检查一次 GPU 状态,完全错过短期操作。 -想想当你启动 GPU 工作负载时会发生什么。你的应用通过图形 API(Vulkan、OpenGL、CUDA)提交命令。用户空间驱动将这些转换为硬件特定的命令缓冲区。内核驱动接收 ioctl,验证工作,分配 GPU 内存,将资源绑定到 GPU 地址空间,在硬件环上调度作业,并等待完成。传统分析看到开始和结束 - 内核跟踪点看到每一步。 +GPU 跟踪点跨越三层。**DRM 调度器跟踪点**(`gpu_scheduler` 事件组)是稳定的 uAPI - 格式永不改变。它们在 Intel、AMD 和 Nouveau 驱动上工作完全相同,适合供应商中立的监控。**供应商特定跟踪点**暴露驱动内部 - Intel i915 跟踪 GEM 对象创建和 VMA 绑定,AMD AMDGPU 监控缓冲对象和命令提交。**通用 DRM 跟踪点**通过 vblank 事件处理显示同步,用于诊断丢帧。 -性能影响是显著的。基于轮询的监控每 100ms 检查一次 GPU 状态,每次检查都会消耗 CPU 周期。跟踪点仅在事件发生时激活,每个事件仅添加纳秒级的开销,并捕获 100% 的活动,包括微秒级持续时间的作业。对于 Kubernetes GPU 工作负载的生产监控或调试 ML 训练性能,这种差异至关重要。 +## DRM 调度器监视器:通用 GPU 跟踪 -### DRM 跟踪点生态系统 +`drm_scheduler.bt` 脚本在**所有 GPU 驱动**上工作,因为它使用稳定的 uAPI 跟踪点。它跟踪作业提交(`drm_run_job`)、完成(`drm_sched_process_job`)和依赖等待(`drm_sched_job_wait_dep`)跨所有环。 -GPU 跟踪点跨越图形堆栈的三层。**DRM 调度器跟踪点**(gpu_scheduler 事件组)被标记为稳定的 uAPI - 它们的格式永远不会改变。这些在 Intel、AMD 和 Nouveau 驱动上工作完全相同,使它们成为供应商中立监控的完美选择。它们跟踪作业提交(`drm_run_job`)、完成(`drm_sched_process_job`)和依赖等待(`drm_sched_job_wait_dep`)。 +### 完整的 Bpftrace 脚本:drm_scheduler.bt -**供应商特定跟踪点**暴露驱动内部。Intel i915 跟踪点跟踪 GEM 对象创建(`i915_gem_object_create`)、VMA 绑定到 GPU 地址空间(`i915_vma_bind`)、内存压力事件(`i915_gem_shrink`)和页面故障(`i915_gem_object_fault`)。AMD AMDGPU 跟踪点监控缓冲对象生命周期(`amdgpu_bo_create`)、从用户空间提交命令(`amdgpu_cs_ioctl`)、调度器执行(`amdgpu_sched_run_job`)和 GPU 中断(`amdgpu_iv`)。注意 Intel 低级跟踪点需要在内核配置中启用 `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y`。 +```c +#!/usr/bin/env bpftrace +/* + * drm_scheduler.bt - 监控 DRM GPU 调度器活动 + * + * 此脚本使用稳定的 DRM 调度器跟踪点跟踪 GPU 作业调度。 + * 适用于所有现代 GPU 驱动(Intel i915、AMD AMDGPU、Nouveau 等) + * + * gpu_scheduler 跟踪点是稳定的 uAPI - 保证不会改变。 + * + * 使用方法:sudo bpftrace drm_scheduler.bt + */ -**通用 DRM 跟踪点**通过 vblank 事件处理显示同步 - 对于诊断丢帧和合成器延迟至关重要。事件包括 vblank 发生(`drm_vblank_event`)、用户空间排队(`drm_vblank_event_queued`)和传递(`drm_vblank_event_delivered`)。 +BEGIN +{ + printf("正在跟踪 DRM GPU 调度器... 按 Ctrl-C 结束。\n"); + printf("%-18s %-12s %-16s %-12s %-8s %s\n", + "时间(ms)", "事件", "作业ID", "环", "排队", "详情"); +} -### 实际应用场景 +/* GPU 作业开始执行 */ +tracepoint:gpu_scheduler:drm_run_job +{ + $job_id = args->id; + $ring = str(args->name); + $queue = args->job_count; + $hw_queue = args->hw_job_count; -GPU 跟踪点解决了传统工具无法触及的问题。**诊断游戏卡顿**:你注意到每隔几秒就会丢帧。Vblank 跟踪点揭示了错过的垂直消隐。作业调度跟踪显示命令提交中的 CPU 端延迟。内存跟踪点暴露在关键帧期间触发驱逐的分配。几分钟内你就能识别出纹理上传正在阻塞渲染管道。 + /* 记录开始时间用于延迟计算 */ + @start[$job_id] = nsecs; -**优化 ML 训练性能**:你的 PyTorch 训练比预期慢 40%。AMDGPU 命令提交跟踪揭示了过度同步 - CPU 过于频繁地等待 GPU 完成。作业依赖跟踪点显示独立操作之间不必要的栅栏。内存跟踪暴露了 VRAM 和系统 RAM 之间的抖动。你重新组织批处理以消除停顿。 + printf("%-18llu %-12s %-16llu %-12s %-8u hw=%d\n", + nsecs / 1000000, + "RUN", + $job_id, + $ring, + $queue, + $hw_queue); -**云 GPU 计费准确性**:多租户系统需要公平的能源和资源核算。DRM 调度器跟踪点将确切的 GPU 时间归因于每个容器。内存跟踪点跟踪每个工作负载的分配。这些数据馈送到基于实际资源消耗而非基于时间估计收费的准确计费系统。 + /* 跟踪每个环的统计 */ + @jobs_per_ring[$ring] = count(); +} -**热节流调查**:GPU 性能在负载下降级。中断跟踪显示来自 GPU 的热事件。作业调度跟踪揭示影响执行时间的频率缩放。内存迁移跟踪显示驱动将工作负载移动到更冷的 GPU 芯片。你调整功率限制并改善气流。 +/* GPU 作业完成(栅栏已发出信号)*/ +tracepoint:gpu_scheduler:drm_sched_process_job +{ + $fence = args->fence; -## 跟踪点参考指南 + printf("%-18llu %-12s %-16p\n", + nsecs / 1000000, + "COMPLETE", + $fence); -让我们详细检查每个跟踪点类别,了解它们暴露的数据以及如何解释它。 + @completion_count = count(); +} -### DRM 调度器跟踪点:通用 GPU 监视器 +/* 作业等待依赖 */ +tracepoint:gpu_scheduler:drm_sched_job_wait_dep +{ + $job_id = args->id; + $ring = str(args->name); + $dep_ctx = args->ctx; + $dep_seq = args->seqno; -DRM 调度器提供 GPU 作业管理的供应商中立视图。无论你运行的是 Intel 集成显卡、AMD 独立 GPU 还是 NVIDIA 硬件上的 Nouveau,这些跟踪点的工作方式都完全相同。 + printf("%-18llu %-12s %-16llu %-12s %-8s ctx=%llu seq=%u\n", + nsecs / 1000000, + "WAIT_DEP", + $job_id, + $ring, + "-", + $dep_ctx, + $dep_seq); -#### drm_run_job:GPU 工作开始执行时 + @wait_count = count(); + @waits_per_ring[$ring] = count(); +} -当调度器将作业分配给 GPU 硬件时,`drm_run_job` 触发。这标志着从"在软件中排队"到"在硅上主动运行"的转换。跟踪点捕获作业 ID(关联的唯一标识符)、环名称(哪个执行引擎:图形、计算、视频解码)、队列深度(有多少作业在等待)和硬件作业计数(当前在 GPU 上执行的作业)。 +END +{ + printf("\n=== DRM 调度器统计 ===\n"); + printf("\n每个环的作业数:\n"); + print(@jobs_per_ring); + printf("\n每个环的等待数:\n"); + print(@waits_per_ring); +} +``` -格式看起来像:`entity=0xffff888... id=12345 fence=0xffff888... ring=gfx job count:5 hw job count:2`。这告诉你图形环上的作业 12345 开始执行。五个作业在它后面排队,两个作业当前在硬件上运行(多引擎 GPU 可以并行运行作业)。 +### 理解脚本 -使用此来测量作业调度延迟。记录用户空间提交工作时的时间戳(使用命令提交跟踪点),然后测量到 `drm_run_job` 触发的时间。超过 1ms 的延迟表示 CPU 端调度延迟。每个环的统计数据揭示特定引擎(视频编码、计算)是否存在瓶颈。 +脚本附加到三个稳定的 DRM 调度器跟踪点。当 `drm_run_job` 触发时,作业从"在软件中排队"转换为"在硅上运行"。跟踪点捕获 `args->id`(用于关联的作业 ID)、`args->name`(环名称 - 哪个执行引擎如图形、计算或视频解码)、`args->job_count`(队列深度 - 有多少作业在等待)和 `args->hw_job_count`(当前在 GPU 硬件上执行的作业)。 -#### drm_sched_process_job:作业完成信号 +格式 `entity=0xffff888... id=12345 fence=0xffff888... ring=gfx job count:5 hw job count:2` 告诉你图形环上的作业 12345 开始执行,后面有 5 个作业排队,硬件上当前运行 2 个作业。多引擎 GPU 可以跨不同环并行运行作业。 -当 GPU 硬件完成作业并发出其栅栏信号时,此跟踪点触发。栅栏指针标识已完成的作业 - 将其与 `drm_run_job` 关联以计算 GPU 执行时间。格式:`fence=0xffff888... signaled`。 +我们记录 `@start[$job_id] = nsecs` 以启用延迟计算。脚本存储按作业 ID 键控的时间戳。稍后,在跟踪完成或测量端到端延迟时,你可以计算 `nsecs - @start[$job_id]` 以获得执行时间。`@jobs_per_ring[$ring] = count()` 行递增每个环的计数器,显示跨引擎的工作负载分布。 -与 `drm_run_job` 时间戳结合以计算作业执行时间:`completion_time - run_time = GPU_execution_duration`。如果应该需要 5ms 的作业需要 50ms,你就发现了 GPU 性能问题。吞吐量指标(每秒完成的作业)表示总体 GPU 利用率。 +当 `drm_sched_process_job` 触发时,GPU 硬件完成了作业并发出其栅栏信号。栅栏指针 `args->fence` 标识已完成的作业。在 `drm_run_job` 和此跟踪点之间关联栅栏指针,让你可以计算 GPU 执行时间:`completion_time - run_time = GPU_execution_duration`。如果应该需要 5ms 的作业需要 50ms,你就发现了 GPU 性能问题。 -#### drm_sched_job_wait_dep:依赖停顿 +`drm_sched_job_wait_dep` 跟踪点在作业阻塞等待栅栏时触发。在作业执行之前,其依赖项(它等待的先前作业)必须完成。格式显示 `args->ctx`(依赖上下文)和 `args->seqno`(序列号)标识哪个栅栏阻塞此作业。 -在作业可以执行之前,其依赖项(它等待的先前作业)必须完成。此跟踪点在作业阻塞等待栅栏时触发。格式:`job ring=gfx id=12345 depends fence=0xffff888... context=1234 seq=567`。 +这揭示了管道停顿。如果计算作业不断等待图形作业,你就没有利用并行性。长等待时间表明依赖链太深 - 考虑批处理独立工作。过度的依赖表示 CPU 端调度效率低下。`@waits_per_ring[$ring] = count()` 指标跟踪哪些环经历最多的依赖停顿。 -这揭示了管道停顿。如果计算作业不断等待图形作业,你就没有利用并行性。如果等待时间很长,依赖链太深 - 考虑批处理独立工作。过度的依赖表示 CPU 端调度效率低下。 +程序结束时,`END` 块打印统计信息。`@jobs_per_ring` 显示每个执行引擎的作业计数 - 揭示特定环(视频编码、计算)是否饱和。`@waits_per_ring` 暴露依赖瓶颈。这些数据揭示了总体 GPU 利用率模式以及作业是否被依赖阻塞。 -### Intel i915 跟踪点:内存和 I/O 深入分析 +## Intel i915 跟踪点:内存管理深入分析 -Intel 的 i915 驱动暴露了内存管理和数据传输的详细跟踪点。这些需要 `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y` - 使用 `grep CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS /boot/config-$(uname -r)` 检查。 +Intel 的 i915 驱动暴露了内存操作的详细跟踪点。这些需要内核配置中的 `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y` - 使用 `grep CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS /boot/config-$(uname -r)` 检查。 -#### i915_gem_object_create:GPU 内存分配 +**i915_gem_object_create** 在驱动分配 GEM(图形执行管理器)对象时触发 - GPU 可访问内存的基本单位。格式:`obj=0xffff888... size=0x100000` 表示分配 1MB 对象。随时间跟踪总分配内存以检测泄漏。性能下降前的突然分配峰值表示内存压力。将对象指针与后续绑定/故障事件关联以了解对象生命周期。 -当驱动分配 GEM(图形执行管理器)对象 - GPU 可访问内存的基本单位时,此触发。格式:`obj=0xffff888... size=0x100000` 表示分配 1MB 对象。 +**i915_vma_bind** 跟踪将内存映射到 GPU 地址空间。分配内存还不够 - 它必须绑定到 GPU 虚拟地址空间。格式:`obj=0xffff888... offset=0x0000100000 size=0x10000 mappable vm=0xffff888...` 显示在 GPU 虚拟地址 0x100000 处绑定的 64KB。频繁的重新绑定表示内存抖动 - 驱动在压力下驱逐和重新绑定对象。GPU 页面故障通常与绑定操作相关。 -随时间跟踪总分配内存以检测泄漏。性能下降前的突然分配峰值表示内存压力。将对象指针与后续绑定/故障事件关联以了解对象生命周期。高频率小分配表示低效批处理。 +**i915_gem_shrink** 捕获内存压力响应。在内存压力下,驱动回收 GPU 内存。格式:`dev=0 target=0x1000000 flags=0x3` 意味着驱动尝试回收 16MB。高收缩活动表示工作负载的 GPU 内存过小。与性能下降关联 - 如果在帧渲染期间发生收缩,会导致卡顿。 -#### i915_vma_bind:将内存映射到 GPU 地址空间 +**i915_gem_object_fault** 跟踪 CPU 或 GPU 访问未映射内存时的页面故障。格式:`obj=0xffff888... GTT index=128 writable` 表示图形转换表页 128 上的写故障。故障代价昂贵 - 它们在内核解决缺失映射时停止执行。写故障比读故障更昂贵(需要使缓存失效)。GTT 故障表示作业提交前资源绑定不完整。 -分配内存还不够 - 它必须映射(绑定)到 GPU 地址空间。此跟踪点在 VMA(虚拟内存区域)绑定时触发。格式:`obj=0xffff888... offset=0x0000100000 size=0x10000 mappable vm=0xffff888...` 显示在 GPU 虚拟地址 0x100000 处绑定的 64KB。 - -绑定开销影响性能。频繁的重新绑定表示内存抖动 - 驱动在压力下驱逐和重新绑定对象。GPU 页面故障通常与绑定操作相关 - CPU 在 GPU 访问之前绑定内存。像 `PIN_MAPPABLE` 这样的标志表示 CPU 和 GPU 都可以访问的内存。 - -#### i915_gem_shrink:内存压力响应 - -在内存压力下,驱动回收 GPU 内存。格式:`dev=0 target=0x1000000 flags=0x3` 意味着驱动尝试回收 16MB。高收缩活动表示工作负载的 GPU 内存过小。 - -与性能下降关联 - 如果在帧渲染期间发生收缩,会导致卡顿。标志表示收缩的激进程度。反复收缩小目标表示内存碎片。将目标与实际释放量(跟踪对象销毁)进行比较以测量回收效率。 - -#### i915_gem_object_fault:GPU 页面故障 - -当 CPU 或 GPU 访问未映射的内存时,会发生故障。格式:`obj=0xffff888... GTT index=128 writable` 表示图形转换表页 128 上的写故障。故障代价昂贵 - 它们在内核解决缺失映射时停止执行。 - -过度的故障会降低性能。写故障比读故障更昂贵(需要使缓存失效)。GTT 故障(GPU 访问未映射的内存)表示作业提交前资源绑定不完整。CPU 故障表示低效的 CPU/GPU 同步 - CPU 在 GPU 使用对象时访问它们。 - -### AMD AMDGPU 跟踪点:命令流和中断 +## AMD AMDGPU 跟踪点:命令提交管道 AMD 的 AMDGPU 驱动提供命令提交和硬件中断的全面跟踪。 -#### amdgpu_cs_ioctl:用户空间命令提交 +**amdgpu_cs_ioctl** 捕获用户空间命令提交。当应用通过 ioctl 提交 GPU 工作时,此跟踪点触发。格式:`sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2` 显示提交到图形环的作业 12345 有 2 个间接缓冲区。这标志着用户空间将工作交给内核的时间。记录时间戳以在与 `amdgpu_sched_run_job` 结合时测量提交到执行的延迟。高频率表示小批次 - 更好批处理的潜力。 -当应用通过 ioctl 提交 GPU 工作时,此捕获请求。格式:`sched_job=12345 timeline=gfx context=1000 seqno=567 ring_name=gfx_0.0.0 num_ibs=2` 显示提交到图形环的作业 12345 有 2 个间接缓冲区。 +**amdgpu_sched_run_job** 在内核调度器开始执行先前提交的作业时触发。将时间戳与 `amdgpu_cs_ioctl` 比较可揭示提交延迟。超过 100μs 的提交延迟表示内核调度延迟。每个环的延迟显示特定引擎是否受调度限制。 -这标志着用户空间将工作交给内核的时间。记录时间戳以在与 `amdgpu_sched_run_job` 结合时测量提交到执行的延迟。高频率表示小批次 - 更好批处理的潜力。每个环的分布显示跨引擎的工作负载平衡。 +**amdgpu_bo_create** 跟踪缓冲对象分配 - AMD 的 i915 GEM 对象等价物。格式:`bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1` 分配 1MB(256 页)。类型表示 VRAM 与 GTT(GPU 可访问的系统内存)。首选/允许域显示放置策略。类型不匹配(请求 VRAM 但回退到 GTT)表示 VRAM 耗尽。可见标志表示 CPU 可访问的内存 - 昂贵,谨慎使用。 -#### amdgpu_sched_run_job:内核调度作业 +**amdgpu_bo_move** 在缓冲对象在 VRAM 和 GTT 之间迁移时触发。迁移代价昂贵(需要通过 PCIe 复制数据)。过度的移动表示内存抖动 - 工作集超过 VRAM 容量。测量移动频率和大小以量化 PCIe 带宽消耗。与性能下降关联 - 迁移停止 GPU 执行。 -内核调度器开始执行先前提交的作业。将时间戳与 `amdgpu_cs_ioctl` 比较可揭示提交延迟。格式包括作业 ID 和用于关联的环。 +**amdgpu_iv** 捕获 GPU 中断。GPU 为完成的工作、错误和事件发出中断信号。格式:`ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890 pasid:100 src_data: 00000001...` 捕获中断详细信息。源 ID 表示中断类型(完成、故障、热)。高中断率影响 CPU 性能。VMID 和 PASID 识别哪个进程/VM 触发了中断 - 对于多租户调试至关重要。 -超过 100μs 的提交延迟表示内核调度延迟。每个环的延迟显示特定引擎是否受调度限制。与 CPU 调度器跟踪关联以识别内核线程是否被抢占。 - -#### amdgpu_bo_create:缓冲对象分配 - -AMD 的 i915 GEM 对象等价物。格式:`bo=0xffff888... pages=256 type=2 preferred=4 allowed=7 visible=1` 分配 1MB(256 页)。类型表示 VRAM 与 GTT(GPU 可访问的系统内存)。首选/允许域显示放置策略。 - -跟踪 VRAM 分配以监控内存使用。类型不匹配(请求 VRAM 但回退到 GTT)表示 VRAM 耗尽。可见标志表示 CPU 可访问的内存 - 昂贵,谨慎使用。 - -#### amdgpu_bo_move:内存迁移 - -当缓冲对象在 VRAM 和 GTT 之间迁移时,此触发。迁移代价昂贵(需要通过 PCIe 复制数据)。过度的移动表示内存抖动 - 工作集超过 VRAM 容量。 - -测量移动频率和大小以量化 PCIe 带宽消耗。与性能下降关联 - 迁移停止 GPU 执行。通过减少工作集或使用更智能的放置策略(将频繁访问的数据保留在 VRAM 中)进行优化。 - -#### amdgpu_iv:GPU 中断 - -GPU 为完成的工作、错误和事件发出中断信号。格式:`ih:0 client_id:1 src_id:42 ring:0 vmid:5 timestamp:1234567890 pasid:100 src_data: 00000001...` 捕获中断详细信息。 - -源 ID 表示中断类型(完成、故障、热)。高中断率影响 CPU 性能。意外中断表示硬件错误。VMID 和 PASID 识别哪个进程/VM 触发了中断 - 对于多租户调试至关重要。 - -### DRM Vblank 跟踪点:显示同步 +## DRM Vblank 跟踪点:显示同步 Vblank(垂直消隐)事件将渲染与显示刷新同步。错过 vblank 会导致丢帧和卡顿。 -#### drm_vblank_event:垂直消隐发生 +**drm_vblank_event** 在显示进入垂直消隐期时触发。格式:`crtc=0 seq=12345 time=1234567890 high-prec=true` 表示显示控制器 0 上的 vblank,序列号 12345。跟踪 vblank 频率以验证刷新率(60Hz = 60 vblanks/秒)。错过的序列表示丢帧。高精度时间戳启用亚毫秒帧时序分析。 -当显示进入垂直消隐期时,此触发。格式:`crtc=0 seq=12345 time=1234567890 high-prec=true` 表示显示控制器 0 上的 vblank,序列号 12345。 +**drm_vblank_event_queued** 和 **drm_vblank_event_delivered** 跟踪 vblank 事件传递到用户空间。排队延迟(队列到传递)测量内核调度延迟。总延迟(vblank 到传递)包括内核和驱动处理。超过 1ms 的延迟表示合成器问题。与用户可见的丢帧关联 - 延迟传递的事件意味着错过的帧。 -跟踪 vblank 频率以验证刷新率(60Hz = 60 vblanks/秒)。错过的序列表示丢帧。高精度时间戳启用亚毫秒帧时序分析。多显示器设置的每 CRTC 跟踪。 +## 运行监控脚本 -#### drm_vblank_event_queued 和 drm_vblank_event_delivered - -这些跟踪 vblank 事件传递到用户空间。排队延迟(队列到传递)测量内核调度延迟。总延迟(vblank 到传递)包括内核和驱动处理。 - -超过 1ms 的延迟表示合成器问题。跨 CRTC 比较以识别有问题的显示。与用户可见的丢帧关联 - 延迟传递的事件意味着错过的帧。 - -## 使用 Bpftrace 脚本监控 - -我们为生产监控创建了供应商特定的 bpftrace 脚本。每个脚本专注于其 GPU 供应商的特定跟踪点,同时共享通用输出格式。 - -### DRM 调度器监视器:通用 GPU 跟踪 - -`drm_scheduler.bt` 脚本在**所有 GPU 驱动**上工作,因为它使用稳定的 uAPI 跟踪点。它跟踪所有环上的作业,测量完成率,并识别依赖停顿。 - -脚本附加到 `gpu_scheduler:drm_run_job`、`gpu_scheduler:drm_sched_process_job` 和 `gpu_scheduler:drm_sched_job_wait_dep`。在作业开始时,它在按作业 ID 键控的 map 中记录时间戳以供以后计算延迟。它递增每个环的计数器以显示工作负载分布。在完成时,它打印栅栏信息。在依赖等待时,它显示哪个作业阻塞哪个栅栏。 - -输出显示时间戳、事件类型(RUN/COMPLETE/WAIT_DEP)、作业 ID、环名称和队列深度。程序结束时,统计数据总结每个环的作业和依赖等待计数。这揭示了特定环是否饱和、作业是否被依赖阻塞以及总体 GPU 利用率模式。 - -### Intel i915 监视器:内存和 I/O 分析 - -`intel_i915.bt` 脚本跟踪 Intel GPU 内存操作、I/O 传输和页面故障。它需要 `CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y`。 - -在 `i915_gem_object_create` 上,它累积总分配内存并存储每个对象的大小。VMA 绑定/解绑事件跟踪 GPU 地址空间更改。收缩事件测量内存压力。Pwrite/pread 跟踪 CPU-GPU 数据传输。故障按类型分类(GTT 与 CPU,读与写)。 - -输出报告分配大小和以 MB 为单位的运行总计。绑定操作显示 GPU 虚拟地址和标志。I/O 操作跟踪偏移量和长度。故障指示类型以及它们是读还是写。结束统计汇总总分配、VMA 操作、内存压力(收缩操作和回收字节)、I/O 量(读/写计数和大小)以及故障分析(总故障,写与读)。 - -这揭示了内存泄漏(没有相应释放的分配)、绑定开销(频繁的重新绑定表示抖动)、内存压力时序(将收缩与性能下降关联)、I/O 模式(大传输与许多小传输)和故障热点(要优化的昂贵操作)。 - -### AMD AMDGPU 监视器:命令提交分析 - -`amd_amdgpu.bt` 脚本专注于 AMD 的命令提交管道,测量从 ioctl 到执行的延迟。 - -在 `amdgpu_cs_ioctl` 上,它记录按作业 ID 键控的提交时间戳。当 `amdgpu_sched_run_job` 触发时,它计算延迟:`(current_time - submit_time)`。缓冲对象创建/移动事件跟踪内存。中断事件按源 ID 计数。虚拟内存操作(刷新、映射、取消映射)测量 TLB 活动。 - -输出显示时间戳、事件类型、作业 ID、环名称和以微秒为单位的计算延迟。结束统计包括内存分配总计、每个环的命令提交计数、提交延迟的平均值和分布(直方图显示有多少作业经历了不同的延迟桶)、按源的中断计数以及虚拟内存操作计数。 - -延迟直方图至关重要 - 大多数作业应该有 <50μs 的延迟。高延迟作业的尾部表示调度问题。每个环的统计显示计算工作负载是否具有与图形不同的延迟。内存迁移跟踪有助于诊断 VRAM 压力。 - -### 显示 Vblank 监视器:帧时序分析 - -`drm_vblank.bt` 脚本跟踪显示同步以诊断丢帧。 - -在 `drm_vblank_event` 上,它记录按 CRTC 和序列键控的时间戳。当 `drm_vblank_event_queued` 触发时,它时间戳队列时间。在 `drm_vblank_event_delivered` 上,它计算队列到传递延迟和总 vblank 到传递延迟。 - -输出显示 vblank 事件、排队事件和带时间戳的传递事件。结束统计包括每个 CRTC 的总 vblank 计数、事件传递计数、平均传递延迟、延迟分布直方图以及总事件延迟(vblank 发生到用户空间传递)。 - -超过 1ms 的传递延迟表示合成器调度问题。总延迟揭示应用可见的端到端延迟。每 CRTC 统计显示特定显示器是否有问题。延迟直方图暴露导致可见卡顿的异常值。 - -## 运行监视器 - -让我们跟踪实时 GPU 活动。导航到脚本目录并使用 bpftrace 运行任何监视器。DRM 调度器监视器在所有 GPU 上工作: +导航到脚本目录并运行 DRM 调度器监视器。它在所有 GPU 上工作: ```bash cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver/scripts sudo bpftrace drm_scheduler.bt ``` -你将看到如下输出: +预期输出: ``` Tracing DRM GPU scheduler... Hit Ctrl-C to end. @@ -207,75 +182,18 @@ Waits per ring: @waits_per_ring[gfx]: 12 ``` -这显示图形作业主导工作负载(1523 对 89 个计算作业)。很少的依赖等待(12)表示良好的管道并行性。 +图形作业占主导地位(1523 对 89 个计算作业)。很少的依赖等待(12)表示良好的管道并行性。对于 Intel GPU,使用 `intel_i915.bt`。对于 AMD GPU,使用 `amd_amdgpu.bt`。对于显示时序,使用 `drm_vblank.bt`。在 GPU 工作负载(游戏、ML 训练、视频编码)期间运行这些脚本以捕获活动模式。 -对于 Intel GPU,运行 i915 监视器: - -```bash -sudo bpftrace intel_i915.bt -``` - -对于 AMD GPU: - -```bash -sudo bpftrace amd_amdgpu.bt -``` - -对于显示时序: - -```bash -sudo bpftrace drm_vblank.bt -``` - -每个脚本都输出实时事件和运行结束统计。在 GPU 工作负载(游戏、ML 训练、视频编码)期间运行它们以捕获特征模式。 - -## 验证跟踪点可用性 - -在运行脚本之前,验证你的系统上存在跟踪点。我们包含了一个测试脚本: - -```bash -cd bpf-developer-tutorial/src/xpu/gpu-kernel-driver/tests -sudo ./test_basic_tracing.sh -``` - -这检查 gpu_scheduler、drm、i915 和 amdgpu 事件组。它报告哪些跟踪点可用并为你的硬件推荐适当的监控脚本。对于 Intel 系统,它验证内核配置中是否启用了低级跟踪点。 - -你还可以手动检查可用的跟踪点: +在运行脚本之前验证跟踪点存在于你的系统上: ```bash # 所有 GPU 跟踪点 sudo cat /sys/kernel/debug/tracing/available_events | grep -E '(gpu_scheduler|i915|amdgpu|^drm:)' - -# DRM 调度器(稳定,所有供应商) -sudo cat /sys/kernel/debug/tracing/available_events | grep gpu_scheduler - -# Intel i915 -sudo cat /sys/kernel/debug/tracing/available_events | grep i915 - -# AMD AMDGPU -sudo cat /sys/kernel/debug/tracing/available_events | grep amdgpu ``` -要手动启用跟踪点并查看原始输出: +## 总结 -```bash -# 启用 drm_run_job -echo 1 | sudo tee /sys/kernel/debug/tracing/events/gpu_scheduler/drm_run_job/enable - -# 查看跟踪输出 -sudo cat /sys/kernel/debug/tracing/trace - -# 完成后禁用 -echo 0 | sudo tee /sys/kernel/debug/tracing/events/gpu_scheduler/drm_run_job/enable -``` - -## 总结和下一步 - -GPU 内核跟踪点提供对图形驱动行为的前所未有的可见性。DRM 调度器的稳定 uAPI 跟踪点在所有供应商上工作,使它们成为生产监控的完美选择。来自 Intel i915 和 AMD AMDGPU 的供应商特定跟踪点暴露详细的内存管理、命令提交管道和硬件中断模式。 - -我们的 bpftrace 脚本演示了实际监控:测量作业调度延迟、跟踪内存压力、分析命令提交瓶颈以及诊断丢帧。这些技术直接应用于实际问题 - 优化 ML 训练性能、调试游戏卡顿、在云环境中实现公平的 GPU 资源核算以及调查热节流。 - -与传统工具相比,关键优势是完整性和开销。内核跟踪点以纳秒级精度捕获每个事件,成本可忽略不计。没有轮询,没有采样间隙,没有错过的短期作业。这些数据馈送生产监控系统(Prometheus 导出器读取 bpftrace 输出)、临时性能调试(用户报告问题时运行脚本)和自动化优化(基于延迟阈值触发工作负载重新平衡)。 +GPU 内核跟踪点提供零开销的驱动内部可见性。DRM 调度器的稳定 uAPI 跟踪点跨所有供应商工作,适合生产监控。供应商特定跟踪点暴露详细的内存管理和命令提交管道。bpftrace 脚本演示了跟踪作业调度、测量延迟和识别依赖停顿 - 所有这些对于诊断游戏、ML 训练和云 GPU 工作负载中的性能问题都至关重要。 > 如果你想深入了解 eBPF,请查看我们的教程仓库 或访问我们的网站