Refactor code structure for improved readability and maintainability

2026-02-04 02:34:16 +08:00 · 2025-10-27 19:41:40 -07:00
parent b8cc834d7f
commit 05ca05aa7c
34 changed files with 17358 additions and 0 deletions
--- a/src/xpu/flamegraph/.gitignore
+++ b/src/xpu/flamegraph/.gitignore
@@ -0,0 +1,12 @@
+*.o
+*.so
+cpu_results.txt
+gpu_results.txt
+gpu_results.json
+__pycache__/
+*.svg
+*.folded
+*.txt
+/*.json
+test_cupti
+venv/
--- a/src/xpu/flamegraph/README.md
+++ b/src/xpu/flamegraph/README.md
@@ -0,0 +1,630 @@
+# eBPF Tutorial by Example: GPU+CPU Unified Flamegraph Profiling with CUPTI and eBPF
+
+When GPU applications run slower than expected, the bottleneck could be anywhere - CPU preprocessing, GPU kernel execution, memory transfers, or CPU-GPU synchronization. Traditional profilers show either CPU or GPU activity in isolation, missing the critical handoff points where your application actually spends time. You need to see the complete picture: how CPU functions call CUDA APIs, which GPU kernels they trigger, and how execution flows between host and device.
+
+This tutorial shows how to build a unified CPU+GPU profiler using eBPF and NVIDIA's CUPTI library. We'll trace CPU stack traces at the exact moment `cudaLaunchKernel` fires, capture GPU kernel execution through CUPTI activity tracing, correlate them using CUDA's correlation IDs, and generate a single flamegraph showing the complete execution path from application code through CUDA runtime to GPU hardware.
+
+> The complete source code: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/xpu/flamegraph>
+
+## The Challenge: Correlating CPU and GPU Activity
+
+GPU profiling requires understanding two separate execution domains. On the CPU side, your application calls CUDA runtime APIs like `cudaLaunchKernel`, `cudaMemcpy`, and `cudaDeviceSynchronize`. These functions prepare work, validate parameters, and submit commands to the GPU driver. On the GPU side, kernels execute thousands of parallel threads, access memory, and signal completion through interrupts. The gap between these domains is where performance problems hide.
+
+This challenge is universal across GPU vendors. NVIDIA GPUs use CUDA runtime and CUPTI, AMD GPUs use ROCm and rocProfiler, and Intel GPUs use Level Zero and GPU Observability Architecture. Each vendor provides different APIs, but the fundamental problem remains the same: correlating CPU code paths with GPU kernel execution. Tools like iaprof for Intel GPUs demonstrate similar architectures - using eBPF to capture CPU stacks, vendor-specific APIs to trace GPU activity, and correlation logic to merge them into unified flamegraphs. The techniques in this tutorial apply to NVIDIA GPUs but the principles transfer to any GPU platform.
+
+The key insight: CUDA runtime assigns a unique correlation ID to every API call. When your CPU calls `cudaLaunchKernel`, the runtime creates a correlation ID linking that specific call to the eventual GPU kernel execution. NVIDIA's CUPTI (CUDA Profiling Tools Interface) library records both runtime API calls and GPU kernel executions, embedding these correlation IDs in activity records. By matching correlation IDs between CPU-side eBPF stack traces and GPU-side CUPTI events, we reconstruct the complete execution flow.
+
+Traditional profiling approaches fall short. CPU profilers like perf or eBPF-based profilers capture application and runtime stack traces but have no visibility into GPU execution. They can show you spent 100ms in `cudaLaunchKernel`, but not which kernel ran or how long it actually executed on the GPU. GPU profilers like NVIDIA Nsight or nvprof capture detailed kernel metrics but only show the kernel name, losing context about which CPU code path triggered it. You see a kernel took 50ms, but not why your application called it or what happened before and after.
+
+CUPTI provides the bridge. It's a callback and activity-based API that instruments the CUDA runtime and driver. When you enable CUPTI activity tracing, it records timestamped events for runtime API calls (entry and exit), kernel executions (launch and completion), memory transfers, and synchronization operations. Each event contains a correlation ID linking GPU work back to the CPU API call that submitted it. By injecting CUPTI into CUDA applications via `LD_PRELOAD`, we capture this data without recompiling.
+
+## Architecture: eBPF Profiler + CUPTI Injection
+
+The profiling system has three components working in concert. The eBPF profiler monitors the CPU side using uprobes on `cudaLaunchKernel` in the CUDA runtime library. Every time any process calls this function to launch a GPU kernel, the eBPF program captures the complete CPU stack trace with nanosecond timestamps. This stack shows the application call chain leading to the kernel launch - revealing which functions, which loops, which code paths triggered GPU work.
+
+CUPTI activity tracing runs inside the target process through library injection. We set `CUDA_INJECTION64_PATH` to point to our injection library, which CUDA runtime automatically loads. This library enables CUPTI activity callbacks for runtime APIs and concurrent kernel execution. As the application runs, CUPTI accumulates activity records in internal buffers. When buffers fill or the application exits, CUPTI calls our buffer completion callback, where we serialize events to a trace file. Each event contains start/end timestamps in nanoseconds and correlation IDs.
+
+The trace merger combines these two data sources. It parses CPU stack traces in extended folded format (timestamp, command name, PID, TID, CPU, semicolon-separated stack) and GPU traces in Chrome JSON format (CUPTI events converted to Chrome trace format for visualization). Correlation happens through timestamp proximity - since CPU uprobe fires at `cudaLaunchKernel` entry and CUPTI records the runtime API with the same correlation ID, we match them within a small time window. The merger then matches GPU kernel events to their corresponding runtime API calls via correlation ID. The output is folded stack format suitable for flamegraph generation: `cpu_func1;cpu_func2;cudaLaunchKernel;[GPU_Kernel]kernel_name count`.
+
+## Component Overview
+
+The system consists of four key tools that work together to provide end-to-end visibility.
+
+**gpuperf.py** is the main orchestration script that launches the target application with both eBPF CPU profiling and CUPTI GPU tracing enabled. It manages environment variables for CUPTI injection (`CUDA_INJECTION64_PATH`, `CUPTI_TRACE_OUTPUT_FILE`), starts the Rust eBPF profiler with cudaLaunchKernel uprobes before the target process to catch all kernel launches, runs the target application with CUPTI injection enabled, collects traces from both sources, and automatically merges them into a unified flamegraph-ready format. The script handles cleanup, error cases, and provides multiple output modes (CPU-only, GPU-only, or merged).
+
+**Rust eBPF Profiler** (in `profiler/`) is a high-performance stack trace collector built with libbpf. Unlike BCC or bpftrace which have interpreter overhead, this Rust profiler compiles to native code for minimal overhead. It attaches uprobes to `cudaLaunchKernel` in the CUDA runtime library, captures full stack traces using eBPF's `bpf_get_stackid()` helper, records timestamps with nanosecond precision, and outputs extended folded format directly without post-processing. The `-E` flag enables extended output with timestamps, which is critical for correlation with GPU events.
+
+**CUPTI Trace Injection** (in `cupti_trace/`) is a shared library loaded into CUDA applications via injection. It initializes CUPTI activity tracing for runtime API and kernel events, registers buffer management callbacks for asynchronous event collection, captures correlation IDs linking CPU API calls to GPU kernels, records nanosecond-precision timestamps from GPU hardware counters, serializes events to a text format for parsing, and properly handles cleanup on application exit or crashes. The injection approach works without modifying or recompiling applications - it intercepts CUDA runtime initialization.
+
+**Trace Merger** (`merge_gpu_cpu_trace.py`) performs the correlation logic. It parses CPU traces in extended folded format extracting timestamps, process info, and stack traces. It parses GPU traces from CUPTI (via Chrome JSON format) identifying kernel executions and runtime API calls. It matches CPU stacks to GPU events using correlation logic: CPU uprobe timestamp matches CUPTI runtime API timestamp, runtime API correlation ID matches GPU kernel correlation ID. Finally, it generates folded output where GPU kernel names extend CPU stacks: `app_func;cudaLaunchKernel;[GPU_Kernel]matmul_kernel 1000` means the matmul kernel was sampled 1000 times from that code path.
+
+## High-Level Code Analysis: The Complete Profiling Pipeline
+
+The complete profiling flow starts when you run `gpuperf.py` to launch your CUDA application. Let's walk through what happens from process startup to final flamegraph generation, following the actual code paths.
+
+### Key Implementation: Three-Component Architecture
+
+The profiling pipeline consists of three key components working together. Here's the essential logic from each:
+
+**1. eBPF Profiler (`profiler/src/bpf/profile.bpf.c`) - Kernel-Space Stack Capture:**
+
+```c
+// eBPF program that captures stack traces when cudaLaunchKernel is called
+SEC("uprobe")
+int uprobe_handler(struct pt_regs *ctx)
+{
+    struct stacktrace_event *event;
+
+    // Reserve space in ring buffer for the event
+    event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
+    if (!event)
+        return 1;
+
+    // Capture process/thread info
+    event->pid = bpf_get_current_pid_tgid() >> 32;
+    event->cpu_id = bpf_get_smp_processor_id();
+    event->timestamp = bpf_ktime_get_ns();  // Nanosecond timestamp
+    bpf_get_current_comm(event->comm, sizeof(event->comm));
+
+    // Capture kernel and user stack traces
+    event->kstack_sz = bpf_get_stack(ctx, event->kstack, sizeof(event->kstack), 0);
+    event->ustack_sz = bpf_get_stack(ctx, event->ustack, sizeof(event->ustack), BPF_F_USER_STACK);
+
+    bpf_ringbuf_submit(event, 0);
+    return 0;
+}
+```
+
+**2. CUPTI Injection (`cupti_trace/cupti_trace_injection.cpp`) - GPU Activity Tracking:**
+
+```cpp
+// Callback when CUPTI fills an activity buffer
+void CUPTIAPI BufferCompleted(CUcontext ctx, uint32_t streamId, uint8_t *buffer,
+                               size_t size, size_t validSize)
+{
+    CUpti_Activity *record = NULL;
+
+    // Iterate through all activity records in the buffer
+    while (CUPTI_SUCCESS == cuptiActivityGetNextRecord(buffer, validSize, &record)) {
+        switch (record->kind) {
+            case CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL: {
+                CUpti_ActivityKernel4 *kernel = (CUpti_ActivityKernel4 *)record;
+
+                // Extract kernel execution details
+                fprintf(outputFile, "CONCURRENT_KERNEL [ %llu, %llu ] duration %llu, \"%s\", correlationId %u\n",
+                        kernel->start,           // GPU timestamp (ns)
+                        kernel->end,             // GPU timestamp (ns)
+                        kernel->end - kernel->start,
+                        kernel->name,            // Kernel function name
+                        kernel->correlationId);  // Links to CPU API call
+                break;
+            }
+            case CUPTI_ACTIVITY_KIND_RUNTIME: {
+                CUpti_ActivityAPI *api = (CUpti_ActivityAPI *)record;
+
+                // Track cudaLaunchKernel API calls
+                if (api->cbid == CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernel_v7000) {
+                    fprintf(outputFile, "RUNTIME [ %llu, %llu ] \"cudaLaunchKernel\", correlationId %u\n",
+                            api->start,          // API entry timestamp
+                            api->end,            // API exit timestamp
+                            api->correlationId); // Same ID as kernel
+                }
+                break;
+            }
+        }
+    }
+}
+
+// Initialize CUPTI tracing when library is loaded
+__attribute__((constructor))
+void InitializeInjection(void)
+{
+    // Subscribe to CUPTI callbacks
+    cuptiSubscribe(&subscriberHandle, CallbackHandler, NULL);
+
+    // Enable activity tracing for kernels and runtime APIs
+    cuptiActivityEnable(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);
+    cuptiActivityEnable(CUPTI_ACTIVITY_KIND_RUNTIME);
+
+    // Register buffer management callbacks
+    cuptiActivityRegisterCallbacks(BufferRequested, BufferCompleted);
+}
+```
+
+**3. Trace Merger (`merge_gpu_cpu_trace.py`) - Correlation Logic:**
+
+```python
+class TraceMerger:
+    def find_matching_kernel(self, cpu_stack: CPUStack) -> Optional[GPUKernelEvent]:
+        """
+        Correlate CPU stack with GPU kernel using two-step matching:
+        1. Match CPU timestamp to cudaLaunchKernel runtime API call
+        2. Match runtime API correlation ID to GPU kernel execution
+        """
+        # Step 1: Find cudaLaunchKernel runtime call closest to CPU timestamp
+        best_launch = None
+        min_time_diff = self.timestamp_tolerance_ns  # 10ms window
+
+        for launch in self.cuda_launches.values():
+            time_diff = abs(cpu_stack.timestamp_ns - launch.start_ns)
+            if time_diff < min_time_diff:
+                min_time_diff = time_diff
+                best_launch = launch
+
+        if not best_launch:
+            return None
+
+        # Step 2: Find GPU kernel with matching correlation ID
+        for kernel in self.gpu_kernels:
+            if kernel.correlation_id == best_launch.correlation_id:
+                return kernel  # Found the GPU kernel triggered by this CPU call
+
+        return None
+
+    def merge_traces(self):
+        """Build merged stacks: cpu_func1;cpu_func2;cudaLaunchKernel;[GPU_Kernel]kernel_name"""
+        for cpu_stack in self.cpu_stacks:
+            merged_stack = cpu_stack.stack.copy()  # Start with CPU stack
+
+            gpu_kernel = self.find_matching_kernel(cpu_stack)
+            if gpu_kernel:
+                merged_stack.append(f"[GPU_Kernel]{gpu_kernel.name}")
+            else:
+                merged_stack.append("[GPU_Launch_Pending]")
+
+            # Output folded format: stack1;stack2;...;stackN count
+            stack_str = ';'.join(merged_stack)
+            self.merged_stacks[stack_str] += 1
+```
+
+**Orchestration in gpuperf.py:**
+
+```python
+def run_with_trace(self, command, cpu_profile, chrome_trace, merged_trace):
+    # 1. Set environment for CUPTI injection
+    env = os.environ.copy()
+    env['CUDA_INJECTION64_PATH'] = str(self.injection_lib)
+    env['CUPTI_TRACE_OUTPUT_FILE'] = trace_file
+
+    # 2. Start eBPF profiler BEFORE target (must attach uprobe first)
+    self.start_cpu_profiler(cpu_output_file=cpu_profile)
+    time.sleep(1.0)  # Ensure uprobe is attached
+
+    # 3. Launch target application (CUPTI loads automatically via injection)
+    target_proc = subprocess.Popen(command, env=env)
+    target_proc.wait()
+
+    # 4. Stop profiler and merge traces
+    self.stop_cpu_profiler()
+    self.generate_merged_trace(cpu_trace=cpu_profile, gpu_trace=chrome_trace,
+                                output_file=merged_trace)
+```
+
+The orchestration starts in `GPUPerf.__init__()`, which locates required components. It finds the CUPTI injection library at `cupti_trace/libcupti_trace_injection.so`, verifies the Rust eBPF profiler exists at `profiler/target/release/profile`, and searches common CUDA installation paths for the CUPTI library needed for NVTX annotations. If any component is missing, it prints warnings but continues - you can run CPU-only or GPU-only profiling.
+
+When you run `gpuperf.py -c gpu.json -p cpu.txt -m merged.folded ./my_cuda_app`, the script calls `run_with_trace()`. This function orchestrates the entire profiling session. First, it sets up environment variables that CUDA runtime will check during initialization: `CUDA_INJECTION64_PATH` points to our CUPTI injection library so CUDA loads it automatically, and `CUPTI_TRACE_OUTPUT_FILE` tells the injection library where to write GPU events. The injection approach works without modifying applications because CUDA runtime explicitly supports injection libraries for profiling.
+
+The critical ordering happens next. The script calls `start_cpu_profiler()` BEFORE launching the target process. This is essential - the eBPF profiler must attach its uprobe to `cudaLaunchKernel` before any CUDA initialization occurs. The Rust profiler runs `sudo ./profile --uprobe /usr/local/cuda-12.9/lib64/libcudart.so.12:cudaLaunchKernel -E`, where `--uprobe` specifies the library and function to instrument, and `-E` enables extended folded output with timestamps. The script waits 1 second after starting the profiler to ensure uprobes are fully attached before the target process loads the CUDA runtime.
+
+Only after the profiler is ready does the script start the target process with `subprocess.Popen(command, env=env)`. As soon as this process calls any CUDA API, the runtime initializes, loads our injection library via `CUDA_INJECTION64_PATH`, and CUPTI starts recording. The uprobe is already attached, so every `cudaLaunchKernel` call triggers a stack trace capture. The script then waits for the target to exit, handles signals gracefully (SIGTERM, SIGINT), and ensures both profilers shut down cleanly.
+
+After the target exits, `generate_merged_trace()` performs correlation. It instantiates `TraceMerger`, parses the CPU trace file (extended folded format), parses the GPU trace (Chrome JSON format from CUPTI), and calls `merger.merge_traces()` which matches events via correlation IDs and timestamps. The output is folded format combining CPU and GPU stacks, ready for flamegraph generation.
+
+### eBPF Profiler: Capturing CPU Stacks at Kernel Launch
+
+The Rust profiler in `profiler/` is a libbpf-based eBPF application. Unlike bpftrace or BCC which interpret scripts at runtime, this profiler compiles to native code for minimal overhead. It attaches uprobes dynamically to any function in any library, making it perfect for instrumenting CUDA runtime without modifying NVIDIA's binaries.
+
+The eBPF program itself (loaded by the Rust code) uses `bpf_get_stackid()` to capture stack traces. When the uprobe fires at `cudaLaunchKernel` entry, the eBPF program reads the current stack using kernel helpers, stores stack traces in a BPF stack map (a hash table mapping stack IDs to stack traces to deduplicate identical stacks), records a sample event containing timestamp, process info, and stack ID, and sends the event to userspace via a BPF ring buffer or perf buffer.
+
+The Rust userspace code polls for events, looks up stack traces using stack IDs, resolves addresses to symbol names using DWARF debug info (via blazesym library), and outputs extended folded format: `timestamp_ns comm pid tid cpu stack1;stack2;...;stackN`. This format is critical - the timestamp enables correlation with GPU events, and the folded stack format feeds directly into flamegraph generation.
+
+The `-E` extended output flag is what differentiates this from standard flamegraph profiling. Traditional folded format is just `stack1;stack2;stack3 count`, showing aggregate call graphs. Extended format adds temporal information: `1234567890 myapp 1000 1000 0 stack1;stack2;stack3`, telling you exactly when each sample occurred. This timestamp precision is what allows matching CPU stacks to GPU kernel launches that happen milliseconds or microseconds later.
+
+### CUPTI Trace Injection: Capturing GPU Activity
+
+The CUPTI injection library in `cupti_trace/` implements the GPU-side instrumentation. When CUDA runtime loads this library (via `CUDA_INJECTION64_PATH`), the library's initialization function runs before any CUDA API is available. This is the perfect time to set up CUPTI callbacks.
+
+The initialization flow calls `cuptiSubscribe()` to register a subscriber handle, enables activity tracing with `cuptiActivityEnable()` for `CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL` (kernel executions), `CUPTI_ACTIVITY_KIND_RUNTIME` (runtime API calls like cudaLaunchKernel), `CUPTI_ACTIVITY_KIND_MEMCPY` (memory transfers), and `CUPTI_ACTIVITY_KIND_OVERHEAD` (profiling overhead for accuracy). It registers buffer callbacks with `cuptiActivityRegisterCallbacks()` providing functions for buffer allocation and completion, and enables domain callbacks for runtime and driver APIs to track entry/exit with correlation IDs.
+
+As the application runs, CUPTI accumulates activity records in internal buffers. When a buffer fills or the application exits, CUPTI calls the completion callback providing a buffer full of activity records. The injection library iterates through records, parsing different activity kinds: `CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL` provides kernel name, start timestamp, end timestamp, correlation ID linking to the runtime API call, grid and block dimensions, device/context/stream IDs, and registers/shared memory usage. `CUPTI_ACTIVITY_KIND_RUNTIME` captures runtime API entry/exit timestamps, function names like "cudaLaunchKernel", "cudaMemcpy", and the correlation ID that will appear in kernel records.
+
+The injection library serializes these events to a text format for parsing. Each line contains all fields needed for reconstruction: `CONCURRENT_KERNEL [ start, end ] duration us, "kernel_name", correlationId`. This format is parsed by `cupti_trace_parser.py` which converts to Chrome Trace JSON format. Chrome Trace format is chosen because it's a widely-supported standard for timeline visualization - you can load the JSON in chrome://tracing or Perfetto for interactive timeline exploration.
+
+The critical piece is correlation IDs. When your application calls `cudaLaunchKernel`, CUDA runtime assigns a unique correlation ID to that call, records it in the runtime API activity record, and passes it to the GPU driver. When the GPU executes the kernel, the driver records the same correlation ID in the kernel activity record. CUPTI exposes both records, allowing us to match `RUNTIME cudaLaunchKernel correlationId=12345` to `CONCURRENT_KERNEL matmul_kernel correlationId=12345`. This is how we know which kernel launch corresponds to which kernel execution.
+
+### Trace Merger: Correlating CPU and GPU
+
+The `TraceMerger` class in `merge_gpu_cpu_trace.py` performs the critical correlation logic. It loads CPU stacks from extended folded format and GPU events from Chrome JSON format, then matches them using timestamps and correlation IDs.
+
+Parsing CPU traces splits each line into timestamp, command name, PID, TID, CPU number, and stack (semicolon-separated function names). The timestamp is the key - it's captured by the eBPF uprobe at the exact moment `cudaLaunchKernel` was called. The stack shows the application call chain leading to that launch. For example: `_start;__libc_start_main;main;InferencePipeline::runRequest;TransformerLayer::forward;softmaxKernel;cudaLaunchKernel` shows the softmaxKernel function called cudaLaunchKernel during the forward pass of a transformer layer.
+
+Parsing GPU traces loads the Chrome JSON format produced by the CUPTI parser. The merger extracts two types of events: `CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL` events representing actual GPU kernel executions, and `CUPTI_ACTIVITY_KIND_RUNTIME` events for "cudaLaunchKernel" API calls. Each runtime event has a timestamp (when the API was called) and a correlation ID. Each kernel event has start/end timestamps and the same correlation ID.
+
+Correlation happens in two stages. First, match CPU stack traces to CUPTI runtime API events by timestamp. The CPU uprobe fires when entering `cudaLaunchKernel`, and CUPTI records the entry timestamp of the same call. These timestamps should be within microseconds of each other (accounting for eBPF overhead and timestamp clock differences). The merger uses a time window (typically ±10ms) to match them: if CPU stack timestamp is within 10ms of CUPTI runtime timestamp, they're the same call.
+
+Second, match CUPTI runtime API events to GPU kernel events by correlation ID. Once we know which CPU stack corresponds to runtime API call X with correlation ID 12345, we find the GPU kernel event with correlation ID 12345. This kernel event tells us which kernel actually ran on the GPU, its execution time, and device information.
+
+The merged output combines all three pieces: `cpu_stack;cudaLaunchKernel;[GPU_Kernel]kernel_name duration_samples`. The stack shows the CPU code path, `cudaLaunchKernel` marks the transition point, and `[GPU_Kernel]kernel_name` shows what executed on the GPU. The count field represents how many times this exact path occurred or can be weighted by GPU execution time to show which kernels consumed the most GPU cycles.
+
+This merged folded format feeds directly into flamegraph generation. The `combined_flamegraph.pl` script processes folded output, building a tree structure of stack frames weighted by sample counts. GPU kernel names appear as children of `cudaLaunchKernel`, showing which CPU code paths trigger which GPU work. Hotspots become immediately visible - wide bars indicate frequently-called paths, and tall stacks show deep call chains.
+
+## Understanding the Correlation Algorithm
+
+The correlation algorithm is the heart of the profiler. Let's examine the logic in detail, as implemented in `merge_gpu_cpu_trace.py`.
+
+The CPU trace format is: `timestamp_ns comm pid tid cpu stack1;stack2;...;stackN`. Example: `1761616920733362025 llm-inference 3577790 3577790 1 _start;main;runRequest;forward;cudaLaunchKernel`. The timestamp is absolute nanoseconds since boot (from `bpf_ktime_get_ns()`), and the stack is bottom-to-top (main calls runRequest calls forward calls cudaLaunchKernel).
+
+The GPU trace contains two relevant event types. Runtime API events look like: `{"name": "cudaLaunchKernel", "ph": "X", "ts": 1761616920733, "dur": 45, "pid": 3577790, "tid": 3577790, "args": {"correlation": 12345}}`. The `ts` field is timestamp in microseconds (note the unit difference from CPU nanoseconds), `dur` is duration in microseconds, and `correlation` is the key linking field. Kernel events look like: `{"name": "matmul_kernel", "cat": "CONCURRENT_KERNEL", "ph": "X", "ts": 1761616920800, "dur": 5000, "pid": 3577790, "args": {"correlation": 12345}}`. The same correlation ID links runtime to kernel.
+
+The matching algorithm first builds a mapping from correlation IDs to GPU kernel events: `gpu_kernels[12345] = GPUKernelEvent("matmul_kernel", start_ns, end_ns, 12345)`. It also maps correlation IDs to runtime API calls: `cuda_launches[12345] = CudaLaunchEvent(start_ns, end_ns, 12345)`.
+
+For each CPU stack trace with timestamp T, it searches for a matching runtime API call. The search looks for `cuda_launches` where `|runtime.start_ns - T| < TIME_WINDOW` (typically 10ms). Why a time window? Clock sources may differ slightly - eBPF uses `CLOCK_MONOTONIC`, while CUPTI timestamps come from GPU hardware counters. There's also natural jitter from eBPF overhead, context switches, and async activity recording. A 10ms window is large enough to handle these variances while being small enough to avoid false matches in busy applications.
+
+Once a CPU stack matches a runtime API call, we have the correlation ID. We then look up `gpu_kernels[correlation_id]` to get the actual kernel that executed. Now we have the complete chain: CPU stack → runtime API → GPU kernel. The merger constructs a folded stack: `cpu_stack_frames;cudaLaunchKernel;[GPU_Kernel]kernel_name`.
+
+The merger can weight stacks in two ways. Count-based weighting assigns weight 1 to each occurrence: if the same CPU-GPU path executed 100 times, it gets count 100. Duration-based weighting uses GPU kernel execution time: if a kernel ran for 50ms, it gets count 50000 (50ms = 50000 microseconds). Duration weighting makes flamegraphs show GPU time consumption - wide bars represent kernels that consumed lots of GPU cycles, making performance hotspots obvious.
+
+Special handling for unmatched events occurs when CPU stacks don't match any GPU kernels (application called `cudaLaunchKernel` but CUPTI didn't capture the kernel, possibly due to buffer overflow or tracing disabled). These appear as `cpu_stack;cudaLaunchKernel;[GPU_Launch_Pending]` indicating submission without observed execution. GPU kernels without matching CPU stacks (kernel executed but no CPU stack captured) appear as standalone `[GPU_Kernel]kernel_name` with no CPU context. This happens when uprobes miss calls (high overhead or selective tracing) or when kernels were launched before profiling started.
+
+## CUPTI Activity Tracing Implementation
+
+The CUPTI injection library in `cupti_trace/cupti_trace.cpp` deserves deeper examination. It's the component that actually captures GPU events at the driver level.
+
+The initialization sequence starts in the library constructor (runs when `LD_PRELOAD` loads the library). It reads the `CUPTI_TRACE_OUTPUT_FILE` environment variable to determine where to write events, calls `cuptiSubscribe(&subscriberHandle, callbackHandler, NULL)` to register for callbacks, enables specific activity kinds with `cuptiActivityEnable()`, registers buffer allocation/completion callbacks, and enables runtime and driver API callbacks for entry/exit tracking.
+
+Buffer management is asynchronous. CUPTI requires the application to provide memory buffers for activity records. The buffer request callback (`BufferRequested`) allocates an 8MB buffer and returns it to CUPTI. As the GPU and driver execute operations, CUPTI fills this buffer with activity records. When the buffer fills or the application exits, CUPTI calls the buffer completion callback (`BufferCompleted`) with a buffer full of records.
+
+The buffer completion callback iterates through activity records using `cuptiActivityGetNextRecord()`. Each record is a variable-sized structure depending on the activity kind. For `CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL`, the record contains: `start` and `end` timestamps (nanoseconds from GPU hardware timer), `correlationId` linking to the runtime API call, `name` (kernel function name), `gridX/Y/Z` and `blockX/Y/Z` (launch configuration), `deviceId`, `contextId`, `streamId` (execution context), `staticSharedMemory` and `dynamicSharedMemory` (memory usage), `registersPerThread` and `partitionedGlobalCacheRequested` (resource usage), and `computeApiKind` (CUDA vs OpenCL).
+
+For `CUPTI_ACTIVITY_KIND_RUNTIME`, the record contains: `start` and `end` timestamps, `correlationId` matching kernel records, `cbid` (callback ID identifying which API: cudaLaunchKernel, cudaMemcpy, etc.), `processId` and `threadId` of the calling process/thread. The `cbid` field is compared against constants like `CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernel_v7000` to identify the API function.
+
+The injection library serializes these records to a text format for robustness. Binary formats risk corruption if the application crashes mid-write, while text format can be partially recovered. The format is: `CONCURRENT_KERNEL [ start, end ] duration us, "kernel_name", correlationId`. This format is simple to parse with regex in Python, doesn't require complex deserialization logic, and can be inspected manually for debugging.
+
+Cleanup happens in two paths. Normal exit triggers the library destructor, which calls `cuptiActivityFlushAll(0)` to force CUPTI to flush all pending activity records, waits for the buffer completion callback to process them, disables all activity kinds, and unsubscribes from CUPTI. Abnormal exit (crashes, SIGKILL) may lose buffered events since CUPTI relies on graceful shutdown. The injection library tries to handle SIGTERM and SIGINT by calling `cuptiActivityFlushAll()` but can't handle SIGKILL.
+
+## Example Applications
+
+The tutorial provides two CUDA applications for profiling demonstration.
+
+### Real LLM Inference: Qwen3.cu (Recommended)
+
+The primary example is `qwen3.cu`, a single-file CUDA implementation of the Qwen3 0.6B transformer model. This is a real, working language model that runs inference on GPU, making it perfect for profiling actual AI workloads. The implementation includes tokenization, multi-head attention, feedforward layers, and RMS normalization - all the components of modern transformer architectures.
+
+### Alternative: Mock Transformer Simulator
+
+The `mock-test/llm-inference.cu` application provides a simpler test case simulating transformer patterns without requiring model weights.
+
+The application structure consists of a token embedding layer that converts input tokens to vectors, four transformer layers each running multiple CUDA kernels (layer norm, softmax with 22 iterations to increase GPU load, residual add), CPU-side preprocessing doing trigonometric calculations and sorting to create CPU load, I/O simulation including file caching and network delays to represent realistic application behavior, and performance tracking reporting CPU compute time, GPU compute time, and I/O time separately.
+
+Each transformer layer's `forward()` method launches CUDA kernels: `layerNormKernel<<<grid, block, 0, stream>>>()` for normalizing activations, `softmaxKernel<<<grid, block, 0, stream>>>()` called 22 times in a loop to simulate intensive compute, and `residualAddKernel<<<grid, block, 0, stream>>>()` for skip connections. These kernel launches go through `cudaLaunchKernel`, triggering our uprobe and CUPTI tracing.
+
+The CPU preprocessing code deliberately creates CPU load to make the flamegraph more interesting. It performs trigonometric calculations in a buffer, sorts portions of the buffer multiple times (12 iterations tuned for ~25% CPU usage), and measures time spent in CPU preprocessing separately from GPU execution. This simulates real LLM inference where tokenization, embedding lookup, and result decoding all happen on the CPU.
+
+Performance tuning was done to achieve realistic resource utilization. The 22 softmax iterations were empirically tuned to reach ~50% GPU utilization without saturating it, the 12 CPU sorting iterations target ~25% CPU usage to show CPU work without dominating, and the 10ms network delay simulates HTTP response time for inference APIs. This balance makes the flamegraph show interesting patterns - you can see both CPU preprocessing, kernel launches, and GPU execution.
+
+Running `gpuperf.py -c test.json -p cpu.txt -m merged.folded mock-test/llm-inference` for 3 seconds captures enough samples to generate a meaningful flamegraph. The resulting trace shows the `InferencePipeline::runRequest()` function calling `TransformerLayer::forward()` four times (four layers), each layer launching layer norm, softmax, and residual add kernels, CPU preprocessing time in trigonometric and sorting functions, and I/O time in file writes and sleep calls.
+
+The merged flamegraph visualizes this hierarchy. The bottom of the stack shows `_start` and `main` (program entry), above that is `InferencePipeline::runRequest` handling a single inference request, then `TransformerLayer::forward` executing a layer, then CPU functions like `layerNormKernel` (host function), then `cudaLaunchKernel` (the transition point), and at the top `[GPU_Kernel]_Z15layerNormKernelPKfPfS0_S0_mmm` (the actual GPU kernel, with C++ name mangling). Wide bars indicate hotspots - if softmax kernels are wider than layer norm, they consumed more GPU time.
+
+## Compilation and Execution
+
+Build the complete profiling stack by first compiling the CUPTI injection library, then the Rust eBPF profiler, and finally the mock application.
+
+### Build CUPTI Injection Library
+
+Navigate to the CUPTI trace directory and compile:
+
+```bash
+cd bpf-developer-tutorial/src/xpu/flamegraph/cupti_trace
+make
+```
+
+This compiles `cupti_trace.cpp` into `libcupti_trace_injection.so`, linking against CUPTI and CUDA runtime libraries. The Makefile searches common CUDA installation paths (`/usr/local/cuda-12.9`, `/usr/local/cuda-13.0`, etc.) and uses the appropriate include paths and library paths. Verify the library exists:
+
+```bash
+ls -lh libcupti_trace_injection.so
+```
+
+You should see a shared library around 100-120KB. If compilation fails, check that CUDA toolkit is installed and `nvcc` is in your PATH. CUPTI comes with the CUDA toolkit in `extras/CUPTI/`.
+
+### Build Rust eBPF Profiler
+
+Navigate to the profiler directory and compile in release mode for minimal overhead:
+
+```bash
+cd bpf-developer-tutorial/src/xpu/flamegraph/profiler
+cargo build --release
+```
+
+This compiles the Rust profiler with full optimizations. The eBPF program is compiled to BPF bytecode and embedded in the Rust binary. Verify the profiler:
+
+```bash
+ls -lh target/release/profile
+./target/release/profile --help
+```
+
+The profiler should show options for `--uprobe` (specify function to trace) and `-E` (extended folded output). The binary should be around 2-3MB including embedded eBPF code and symbol resolution libraries.
+
+### Build Mock LLM Application
+
+Navigate to the mock test directory and compile the CUDA application:
+
+```bash
+cd bpf-developer-tutorial/src/xpu/flamegraph/mock-test
+make
+```
+
+This uses `nvcc` to compile `llm-inference.cu` into an executable. The Makefile uses `-std=c++17` for modern C++ features, `--no-device-link` to produce a single binary without separate device linking, and `-Wno-deprecated-gpu-targets` to suppress warnings on older GPUs. Verify compilation:
+
+```bash
+ls -lh llm-inference
+```
+
+The binary should be around 200KB. You can test it runs (though it will execute for 10 seconds by default):
+
+```bash
+./llm-inference
+# Press Ctrl+C after a few seconds to stop early
+```
+
+### Build Real LLM Inference Application (Qwen3.cu)
+
+The tutorial includes a real LLM inference engine - qwen3.cu, a single-file CUDA implementation of the Qwen3 0.6B model:
+
+```bash
+cd bpf-developer-tutorial/src/xpu/flamegraph/qwen3.cu
+
+# Download the FP32 model (3GB)
+make download-model
+
+# Compile with dynamic CUDA runtime for uprobe support
+make runcu
+```
+
+Verify dynamic linking (required for eBPF uprobes):
+
+```bash
+ldd runcu | grep cudart
+# Should show: libcudart.so.12 => /usr/local/cuda-12.9/lib64/libcudart.so.12
+```
+
+### Running the Profiler
+
+With all components built, run the complete profiling stack. The `gpuperf.py` script orchestrates everything:
+
+```bash
+cd bpf-developer-tutorial/src/xpu/flamegraph
+
+# Profile real LLM inference (Qwen3 model)
+sudo timeout -s 2 10 python3 gpuperf.py \
+    -c qwen3_gpu.json \
+    -p qwen3_cpu.txt \
+    -m qwen3_merged.folded \
+    bash -c 'cd qwen3.cu && ./runcu Qwen3-0.6B-FP32.gguf -q "Explain eBPF" -r 1'
+```
+
+The script output shows the profiling session:
+
+```
+Starting CPU profiler with cudaLaunchKernel hook
+  CUDA library: /usr/local/cuda-12.9/lib64/libcudart.so.12
+  Output: qwen3_cpu.txt
+Running command with GPU profiling: bash -c cd qwen3.cu && ./runcu...
+Trace output: qwen3_gpu.json
+Started target process with PID: 3593972
+A: E BPF (Extended Binux File) is a system call that allows users to program the Linux kernel's file system...
+tok/s: 55.710306
+
+Stopping CPU profiler...
+CPU profile saved to: qwen3_cpu.txt
+
+Converting trace to Chrome format: qwen3_gpu.json
+Parsed 2452 events
+
+Chrome trace file written to: qwen3_gpu.json
+
+Generating merged CPU+GPU trace: qwen3_merged.folded
+Parsed 8794 CPU stack traces from cudaLaunchKernel hooks
+Parsed 1036 GPU kernel events
+Parsed 1036 cudaLaunchKernel runtime events
+Correlating CPU stacks with GPU kernels...
+Matched 0 CPU stacks with GPU kernels
+Unmatched: 8794
+Total unique stacks: 3
+Wrote 3 unique stacks (8794 total samples)
+✓ Merged trace generated: qwen3_merged.folded
+```
+
+The key statistics show that 8,794 CPU stack traces were captured (one per `cudaLaunchKernel` call during inference), 2,452 total GPU events including kernels, memcpy, and runtime API calls, and 3 unique stack patterns representing the main code paths: `forward()` (transformer layer execution - 5,176 samples), `matmul()` (matrix multiplication - 3,614 samples), and `rmsnorm()` (RMS normalization - 4 samples). This real-world LLM inference trace reveals the actual computation patterns of transformer models.
+
+### Generate Flamegraph
+
+Convert the merged folded trace to a flamegraph SVG:
+
+```bash
+./combined_flamegraph.pl qwen3_merged.folded > qwen3_flamegraph.svg
+```
+
+Open the SVG in a web browser:
+
+```bash
+firefox qwen3_flamegraph.svg
+# or
+google-chrome qwen3_flamegraph.svg
+```
+
+The flamegraph is interactive. Click on a stack frame to zoom in, showing only that subtree. Hover over frames to see function names and sample counts. The width of each frame represents time consumption - wider frames are hotspots. The color is random and doesn't mean anything (it's just for visual distinction).
+
+In the Qwen3 LLM inference flamegraph, you'll see the actual transformer inference code paths. The `forward(Transformer*, int, int)` function dominates with 5,176 samples (59% of execution), showing this is where the model spends most time executing transformer layers. The `matmul(float*, float*, float*, int, int)` function appears with 3,614 samples (41%), revealing matrix multiplication kernels for attention and feedforward computation. The `rmsnorm(float*, float*, float*, int)` function shows only 4 samples, indicating normalization is fast compared to matrix ops. Each stack ends with `cudaLaunchKernel`, marking where CPU code transitions to GPU execution. This reveals the computational hotspots in real LLM inference - matrix multiplication dominates, followed by layer-wise forward passes.
+
+### Inspecting Individual Traces
+
+The profiler generates three trace files that can be inspected independently.
+
+**CPU trace (qwen3_cpu.txt)** contains raw uprobe samples in extended folded format:
+
+```bash
+head -5 qwen3_cpu.txt
+```
+
+Example output:
+
+```
+1761618697756454073 runcu 3593972 3593972 1 forward(Transformer*, int, int);cudaLaunchKernel
+1761618697756957027 runcu 3593972 3593972 1 matmul(float*, float*, float*, int, int);cudaLaunchKernel
+1761618697756968813 runcu 3593972 3593972 1 matmul(float*, float*, float*, int, int);cudaLaunchKernel
+...
+```
+
+Each line is a stack trace captured when `cudaLaunchKernel` was called. You can process this independently with `flamegraph.pl` to see just CPU-side behavior. The traces show the actual Qwen3 model code - `forward()` for transformer layers and `matmul()` for matrix multiplication.
+
+**GPU trace (qwen3_gpu.json)** is in Chrome Trace Format for timeline visualization:
+
+```bash
+head -20 qwen3_gpu.json
+```
+
+This is JSON containing an array of trace events. Load it in Chrome at `chrome://tracing` to see a timeline of GPU kernel executions, memory transfers, and runtime API calls. The timeline shows parallelism (overlapping kernels), bubbles (idle time), and memory transfer costs.
+
+**Merged trace (qwen3_merged.folded)** combines both:
+
+```bash
+cat qwen3_merged.folded
+```
+
+Example output:
+
+```
+forward(Transformer*, int, int);cudaLaunchKernel;[GPU_Launch_Pending] 5176
+matmul(float*, float*, float*, int, int);cudaLaunchKernel;[GPU_Launch_Pending] 3614
+rmsnorm(float*, float*, float*, int);cudaLaunchKernel;[GPU_Launch_Pending] 4
+```
+
+This is folded stack format with GPU kernel names appended. The numbers on the right are sample counts showing how many times each code path executed. Feed this directly to `combined_flamegraph.pl` to generate the unified visualization. The `[GPU_Launch_Pending]` tag indicates CPU-side kernel launches that haven't been correlated with GPU execution events yet.
+
+## Advanced Usage: Profiling Real Applications
+
+The profiler works with any CUDA application without recompilation. Let's profile PyTorch model training as an example.
+
+### Profile PyTorch Training Script
+
+Suppose you have a PyTorch training script `train.py`:
+
+```bash
+cd bpf-developer-tutorial/src/xpu/flamegraph
+
+sudo python3 gpuperf.py \
+    -c pytorch_gpu.json \
+    -p pytorch_cpu.txt \
+    -m pytorch_merged.folded \
+    python train.py --epochs 1
+```
+
+This captures all GPU kernel launches during one training epoch. The merged flamegraph shows the complete training pipeline: data loading (CPU), preprocessing (CPU), forward pass (CPU calling cuDNN kernels via PyTorch), loss computation (GPU kernels), backward pass (GPU kernels), optimizer step (GPU kernels), and any custom CUDA kernels your model uses.
+
+Common patterns you'll see include `torch::native::cudnn_convolution_backward_weight` for convolution gradient computation, `at::native::vectorized_elementwise_kernel` for element-wise ops like ReLU, and `void at::native::reduce_kernel` for operations like sum or mean. Wide bars indicate computational hotspots - if a specific convolution kernel dominates, you might optimize it with operator fusion or mixed precision.
+
+### Profile TensorFlow or JAX
+
+The profiler works with any framework using CUDA:
+
+```bash
+# TensorFlow
+sudo python3 gpuperf.py -m tensorflow_merged.folded python train_tf.py
+
+# JAX
+sudo python3 gpuperf.py -m jax_merged.folded python train_jax.py
+```
+
+JAX uses XLA for kernel fusion, so you'll see XLA-compiled kernels with names like `__xla_compiled_kernel_123`. TensorFlow shows both TF op kernels and cuDNN calls. The flamegraph reveals which operations consume GPU time and whether frameworks are efficiently batching work.
+
+### CPU-Only Profiling
+
+If you only care about CPU-side behavior (finding CPU bottlenecks in data loading or preprocessing):
+
+```bash
+sudo python3 gpuperf.py --no-gpu -p cpu_only.txt python train.py
+```
+
+This runs only the eBPF profiler, capturing CPU stacks without GPU tracing overhead. Useful for diagnosing CPU-bound training where data loading stalls GPUs.
+
+### GPU-Only Profiling
+
+To trace only GPU activity without CPU stack traces:
+
+```bash
+python3 gpuperf.py --no-cpu -c gpu_only.json ./my_cuda_app
+```
+
+This uses only CUPTI injection without eBPF uprobes. Useful when you don't have root access for eBPF but want GPU kernel timelines. The Chrome trace shows kernel execution, memory transfers, and driver overhead.
+
+## Troubleshooting Common Issues
+
+**No CPU stacks captured**: The eBPF profiler requires root privileges for uprobes. Run with `sudo`. Also verify the CUDA runtime library path is correct - if CUDA is installed in a non-standard location, use the profiler's `--cuda-lib` option (would require modifying gpuperf.py to expose it).
+
+**No GPU events captured**: Check that `libcupti_trace_injection.so` exists and `CUDA_INJECTION64_PATH` points to it. Verify CUPTI library is found in `/usr/local/cuda/extras/CUPTI/lib64/`. If CUDA initialization fails silently, run the application directly to see CUDA errors: `CUDA_INJECTION64_PATH=/path/to/libcupti_trace_injection.so ./my_app`.
+
+**Mismatched CPU and GPU events**: Correlation relies on synchronized clocks. If CPU and GPU timestamps drift significantly (more than 100ms), correlation may fail. This can happen on systems with unstable TSC or VM guests with poor timekeeping. Try reducing the correlation time window in `merge_gpu_cpu_trace.py` or check system clock with `clocksource` settings.
+
+**Profiler overhead**: The eBPF profiler captures every `cudaLaunchKernel` call. Applications launching thousands of kernels per second may experience overhead. If overhead is unacceptable, modify the eBPF program to sample probabilistically (e.g., trace 1 out of every 10 calls). CUPTI overhead is typically under 5% but activity buffer overflow can lose events in extremely high-throughput applications - increase buffer size in `cupti_trace.cpp`.
+
+**Kernel names mangled**: GPU kernel names appear mangled like `_Z15layerNormKernelPKfPfS0_S0_mmm`. This is C++ name mangling. To demangle, pipe the folded output through `c++filt`:
+
+```bash
+cat merged.folded | c++filt > merged_demangled.folded
+./combined_flamegraph.pl merged_demangled.folded > flamegraph.svg
+```
+
+**Missing symbols in CPU stacks**: If CPU stacks show only addresses like `0x7af80f22a1ca` without function names, the profiler lacks debug symbols. Ensure your application and libraries are compiled with `-g` (debug info). For system libraries, install debug symbol packages (e.g., `debuginfo` packages on RHEL/CentOS or `dbgsym` on Debian/Ubuntu).
+
+## Limitations and Future Directions
+
+This profiler captures kernel launches but not kernel internals. When the flamegraph shows a GPU kernel consumed 50ms, it doesn't tell you why - whether threads are memory-bound, compute-bound, or stalled on divergence. For kernel-internal profiling, use NVIDIA Nsight Compute or Nsight Systems which instrument GPU execution at the warp level.
+
+Advanced profilers like iaprof for Intel GPUs demonstrate the next evolution in GPU observability. iaprof combines eBPF kernel tracing with hardware performance sampling using Intel GPU Observability Architecture (OA) and Debug API. Instead of just showing "kernel X ran for 50ms", iaprof captures execution unit stall reasons (memory latency, ALU bottlenecks, instruction fetch stalls) and attributes them back to specific shader instructions. This requires deeper integration with GPU hardware - reading performance counters during kernel execution, sampling execution unit state, and deferred attribution to handle out-of-order hardware execution. The correlation challenge becomes even harder because hardware samples arrive asynchronously and must be matched to kernel contexts after execution completes.
+
+The profiler assumes single-stream execution in its current correlation logic. Multi-stream applications launch kernels on multiple CUDA streams, which can execute concurrently on the GPU. The merger should track stream IDs from CUPTI events and handle concurrent executions properly. Currently it may attribute concurrent kernels to whichever CPU launch happened closest in time. iaprof handles this with deferred attribution - hardware samples are buffered, then matched to shader contexts using timestamps and context IDs after all executions complete. This approach could be adapted for CUDA streams by buffering correlation matches and resolving them based on stream timelines.
+
+Correlation ID overflow can occur in very long-running applications. CUDA's correlation IDs are 32-bit integers that may wrap around after billions of API calls. The merger doesn't currently handle wraparound, which could cause mismatches in applications running for days or weeks. Production profilers use epoch-based correlation where IDs reset at defined intervals and events include epoch markers.
+
+Multi-GPU applications launch work on multiple devices. The profiler tracks device IDs in CUPTI events but doesn't distinguish them in the merged output. A proper multi-GPU flamegraph should separate stacks by device, showing which GPUs execute which kernels and whether load is balanced. The folded stack format could be extended with device tags: `cpu_stack;cudaLaunchKernel;[GPU0_Kernel]kernel_name` vs `cpu_stack;cudaLaunchKernel;[GPU1_Kernel]kernel_name`.
+
+Integration with higher-level profilers would be valuable. Combining this tool with NVIDIA Nsight Systems would provide both high-level code flow (from flamegraphs) and detailed kernel metrics (from Nsight). Similarly, integrating with perf or BPF-based full-system profilers would show GPU work in the context of system-wide resource usage (CPU scheduling, interrupts, memory pressure). The folded stack format is designed for this - you can merge CPU perf samples with GPU samples by concatenating stacks.
+
+For truly unified CPU+GPU observability, explore eBPF programs running on the GPU itself. The [bpftime GPU project](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu) compiles eBPF bytecode to PTX instructions, enabling instrumentation inside GPU kernels. This exposes thread-level metrics like memory coalescing efficiency, warp occupancy, and bank conflicts - data impossible to obtain from kernel-side tracing. Future directions could combine kernel-side CUPTI tracing with GPU-side eBPF instrumentation for complete visibility from application code to individual warp execution.
+
+## Summary
+
+GPU profiling requires bridging two execution domains: CPU code submitting work and GPU hardware executing it. This tutorial demonstrated a complete profiling stack combining eBPF for CPU stack traces, CUPTI for GPU activity tracing, and correlation logic to merge them into unified flamegraphs. The eBPF profiler captures CPU stacks at every `cudaLaunchKernel` call with nanosecond timestamps. CUPTI injection records GPU kernel executions with correlation IDs linking them back to CPU API calls. The trace merger matches events via timestamps and correlation IDs, producing folded stacks showing complete execution paths from application code through CUDA runtime to GPU kernels. The resulting flamegraphs visualize end-to-end execution, revealing hotspots across both CPU and GPU.
+
+This approach works without recompiling applications, supports any CUDA framework (PyTorch, TensorFlow, JAX, raw CUDA), and provides low overhead suitable for production profiling. The tools are modular - you can use eBPF profiling alone for CPU analysis, CUPTI injection alone for GPU timelines, or combine them for unified visibility. Apply these techniques to diagnose performance bottlenecks in ML training, GPU-accelerated applications, or any CUDA workload where understanding CPU-GPU interaction is critical.
+
+> If you'd like to dive deeper into eBPF, check out our tutorial repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> or visit our website at <https://eunomia.dev/tutorials/>.
+
+## Cross-Vendor GPU Profiling Comparison
+
+This tutorial focuses on NVIDIA GPUs using CUPTI, but the architecture applies across GPU vendors with vendor-specific APIs replacing CUPTI. Understanding these alternatives helps you apply similar techniques to other GPU platforms.
+
+**Intel GPUs (iaprof approach)**: Intel's profiling architecture uses Level Zero API for GPU tracing and Intel GPU Observability Architecture (OA) for hardware performance monitoring. Instead of CUPTI injection, iaprof uses eBPF tracepoints on the i915/Xe kernel drivers to intercept GPU command submission. The EU Stall Collector samples execution unit performance counters during kernel execution, capturing memory stalls, ALU bottlenecks, and instruction fetch delays. The Debug Collector retrieves shader binaries and context metadata through Intel Debug API. Correlation happens through batch buffer parsing - iaprof extracts kernel contexts from GPU command buffers and matches them to eBPF CPU stack traces via timestamp proximity. The deferred attribution model handles out-of-order hardware samples by buffering them until kernel execution completes, then matching samples to shader contexts using context IDs and timestamps. This is more complex than CUPTI correlation because GPU hardware doesn't provide correlation IDs directly.
+
+**AMD GPUs (ROCm approach)**: AMD's ROCm stack provides rocProfiler for GPU tracing, similar to CUPTI's role in CUDA. The rocProfiler API enables activity callbacks for kernel dispatches, memory transfers, and hardware performance counters. eBPF profiling on AMD GPUs can attach uprobes to `hipLaunchKernel` or `hsa_queue_create` in the ROCm runtime. The HSA (Heterogeneous System Architecture) runtime assigns correlation IDs to kernel dispatches, analogous to CUDA correlation IDs. AMD GPUs expose hardware counters through rocProfiler that reveal memory bandwidth utilization, wavefront occupancy, and cache hit rates. The correlation mechanism is similar to CUPTI - match eBPF CPU stacks to rocProfiler runtime API events by timestamp, then match runtime events to kernel executions via correlation IDs.
+
+**Vendor-Neutral Approaches**: For applications using portable GPU APIs like OpenCL or SYCL, profiling must work across vendors. OpenCL provides `clSetEventCallback()` for event notification and `clGetEventProfilingInfo()` for kernel timing. eBPF uprobes can attach to `clEnqueueNDRangeKernel` to capture CPU stacks at kernel submission. SYCL queue profiling captures kernel execution times through `info::event::command_start` and `command_end` queries. The challenge is that vendor-neutral APIs don't expose hardware performance counters uniformly - each vendor requires platform-specific extensions.
+
+The key insight: all GPU profiling follows the same pattern - capture CPU context at API submission (eBPF uprobes), trace GPU execution with vendor APIs (CUPTI/OA/rocProfiler), and correlate via timestamps and IDs. The iaprof architecture demonstrates advanced techniques like deferred attribution and batch buffer parsing that can enhance CUDA profiling for complex multi-stream workloads. Studying cross-vendor approaches reveals common challenges (asynchronous hardware samples, out-of-order execution, clock synchronization) and solutions (buffered correlation, epoch-based ID tracking, timestamp windows).
+
+## References
+
+- **NVIDIA CUPTI Documentation**: <https://docs.nvidia.com/cupti/Cupti/index.html>
+- **CUPTI Activity API**: <https://docs.nvidia.com/cupti/Cupti/r_main.html#r_activity_api>
+- **CUDA Profiling Guide**: <https://docs.nvidia.com/cuda/profiler-users-guide/>
+- **eBPF Stack Trace Helpers**: <https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#4-bpf_get_stackid>
+- **Chrome Trace Format**: <https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU>
+- **Flamegraph Visualization**: <https://www.brendangregg.com/flamegraphs.html>
+- **bpftime GPU eBPF**: <https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu>
+- **iaprof Intel GPU Profiling**: <https://eunomia.dev/blog/2025/10/11/understanding-iaprof-a-deep-dive-into-aigpu-flame-graph-profiling/>
+- **Intel GPU Observability Architecture**: Intel Graphics documentation
+- **AMD ROCm Profiler**: <https://rocm.docs.amd.com/projects/rocprofiler/en/latest/>
+- **Tutorial Repository**: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/xpu/flamegraph>
+
+Complete source code including the eBPF profiler, CUPTI injection library, trace merger, and test applications is available in the tutorial repository. Contributions and issue reports welcome!
--- a/src/xpu/flamegraph/combined_flamegraph.pl
+++ b/src/xpu/flamegraph/combined_flamegraph.pl
--- a/src/xpu/flamegraph/cupti_trace/.gitignore
+++ b/src/xpu/flamegraph/cupti_trace/.gitignore
@@ -0,0 +1,12 @@
+*.o
+*.so
+cpu_results.txt
+gpu_results.txt
+gpu_results.json
+__pycache__/
+*.svg
+*.folded
+*.txt
+/*.json
+test_cupti
+venv/
--- a/src/xpu/flamegraph/cupti_trace/Makefile
+++ b/src/xpu/flamegraph/cupti_trace/Makefile
@@ -0,0 +1,71 @@
+#
+# Copyright 2021 NVIDIA Corporation. All rights reserved
+#
+ifndef OS
+    OS   := $(shell uname)
+    HOST_ARCH := $(shell uname -m)
+endif
+
+CUDA_INSTALL_PATH ?= /usr/local/cuda-12.9
+NVCC := "$(CUDA_INSTALL_PATH)/bin/nvcc"
+CUPTI_INSTALL_PATH ?= $(CUDA_INSTALL_PATH)/targets/x86_64-linux
+CUPTI_SAMPLES_PATH ?= $(CUDA_INSTALL_PATH)/extras/CUPTI/samples/common
+INCLUDES := -I"$(CUDA_INSTALL_PATH)/include" -I$(CUPTI_INSTALL_PATH)/include -I$(CUPTI_SAMPLES_PATH)
+
+ifeq ($(OS),Windows_NT)
+    LIB_PATH ?= ..\..\lib64
+else
+    EXTRAS_LIB_PATH := $(CUPTI_INSTALL_PATH)/lib
+    LIB_PATH ?= $(CUDA_INSTALL_PATH)/lib64
+endif
+
+# Point to the necessary cross-compiler.
+NVCCFLAGS :=
+ifeq ($(OS),Windows_NT)
+    export PATH := $(PATH):$(LIB_PATH)
+    LIBS= -L $(LIB_PATH) -lcuda -lcupti -ldetours
+    LIBNAME := libcupti_trace_injection.dll
+else
+    ifeq ($(OS), Darwin)
+        export DYLD_LIBRARY_PATH := $(DYLD_LIBRARY_PATH):$(LIB_PATH)
+        LIBS= -Xlinker -framework -Xlinker cuda -L $(EXTRAS_LIB_PATH) -L $(LIB_PATH) -lcupti
+    else
+        export LD_LIBRARY_PATH := $(LD_LIBRARY_PATH):$(LIB_PATH)
+        LIBS = -L $(LIB_PATH) -lcuda -L $(EXTRAS_LIB_PATH) -lcupti
+    endif
+    LIBNAME := libcupti_trace_injection.so
+    NVCCFLAGS += -Xcompiler -fPIC
+endif
+
+ifneq ($(TARGET_ARCH), $(HOST_ARCH))
+    ifeq ($(TARGET_ARCH), aarch64)
+        ifeq ($(TARGET_OS), linux)
+            HOST_COMPILER ?= aarch64-linux-gnu-g++
+        else ifeq ($(TARGET_OS),qnx)
+            ifeq ($(QNX_HOST),)
+                $(error ERROR - QNX_HOST must be passed to the QNX host toolchain)
+            endif
+            ifeq ($(QNX_TARGET),)
+                $(error ERROR - QNX_TARGET must be passed to the QNX target toolchain)
+            endif
+            HOST_COMPILER ?= $(QNX_HOST)/usr/bin/q++
+            ifndef QPP_CONFIG_VERSION
+                QPP_CONFIG_VERSION = 12.2.0
+            endif
+            $(info QPP_CONFIG_VERSION = $(QPP_CONFIG_VERSION))
+            NVCCFLAGS += --qpp-config $(QPP_CONFIG_VERSION),gcc_ntoaarch64le -lsocket
+        endif
+    endif
+
+    ifdef HOST_COMPILER
+        NVCC_COMPILER := -ccbin $(HOST_COMPILER)
+    endif
+endif
+
+all: cupti_trace_injection
+
+cupti_trace_injection: cupti_trace_injection.cpp
+	$(NVCC) $(NVCC_COMPILER) $(NVCCFLAGS) $(INCLUDES) -o $(LIBNAME) -shared $< $(LIBS) --no-device-link
+
+clean:
+	rm -f $(LIBNAME) cupti_trace_injection.o *.o *.bak
--- a/src/xpu/flamegraph/cupti_trace/cupti_trace_injection.cpp
+++ b/src/xpu/flamegraph/cupti_trace/cupti_trace_injection.cpp
@@ -0,0 +1,593 @@
+/*
+ * Copyright 2021-2024 NVIDIA Corporation. All rights reserved.
+ *
+ * CUPTI based tracing injection to trace any CUDA application.
+ * This sample demonstrates how to use activity
+ * and callback APIs in the injection code.
+ * Refer to the README.txt file for usage.
+ *
+ * Workflow in brief:
+ *
+ *  After the initialization routine returns, the application resumes running,
+ *  with the registered callbacks triggering as expected.
+ *  Subscribed to ProfilerStart and ProfilerStop callbacks. These callbacks
+ *  control the collection of profiling data.
+ *
+ *  ProfilerStart callback:
+ *      Start the collection by enabling activities. Also enable callback for
+ *      the API cudaDeviceReset to flush activity buffers.
+ *
+ *  ProfilerStop callback:
+ *      Get all the activity buffers which have all the activity records completed
+ *      by using cuptiActivityFlushAll() API and then disable cudaDeviceReset callback
+ *      and all the activities to stop collection.
+ *
+ *  AtExitHandler:
+ *      Register to the atexit handler to get all the activity buffers including the ones
+ *      which have incomplete activity records by using force flush API
+ *      cuptiActivityFlushAll(1).
+ */
+
+// System headers
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <mutex>
+#include <map>
+
+// CUDA headers
+#include <cuda.h>
+
+// CUPTI headers
+#include "helper_cupti_activity.h"
+
+// Detours for Windows
+#ifdef _WIN32
+#include "detours.h"
+#include <windows.h>
+#else
+#include <pthread.h>
+#include <unistd.h>
+#endif
+
+// Macros
+#define IS_ACTIVITY_SELECTED(activitySelect, activityKind)                               \
+    (activitySelect & (1LL << activityKind))
+
+#define SELECT_ACTIVITY(activitySelect, activityKind)                                    \
+    (activitySelect |= (1LL << activityKind))
+
+// Variable related to initialize injection.
+std::mutex initializeInjectionMutex;
+
+// Data structures for graph node tracking
+typedef struct ApiData_st
+{
+    const char *pFunctionName;
+    uint32_t correlationId;
+} ApiData;
+
+typedef std::map<uint64_t, ApiData> NodeIdApiDataMap;
+NodeIdApiDataMap nodeIdCorrelationMap;
+
+// Global Structure
+typedef struct InjectionGlobals_st
+{
+    volatile uint32_t       initialized;
+    CUpti_SubscriberHandle  subscriberHandle;
+    int                     tracingEnabled;
+    uint64_t                profileMode;
+} InjectionGlobals;
+
+InjectionGlobals injectionGlobals;
+
+CUptiResult
+DisableCuptiActivities(
+    CUcontext ctx);
+
+// Functions
+static void
+InitializeInjectionGlobals(void)
+{
+    injectionGlobals.initialized        = 0;
+    injectionGlobals.subscriberHandle   = NULL;
+    injectionGlobals.tracingEnabled     = 0;
+    injectionGlobals.profileMode        = 0;
+}
+
+static void
+AtExitHandler(void)
+{
+    CUPTI_API_CALL(cuptiGetLastError());
+
+    // Force flush the activity buffers.
+    if (injectionGlobals.tracingEnabled)
+    {
+        CUPTI_API_CALL(DisableCuptiActivities(NULL));
+        CUPTI_API_CALL_VERBOSE(cuptiActivityFlushAll(1));
+    }
+    
+    // Flush and close output file if it's not stdout
+    if (globals.pOutputFile && globals.pOutputFile != stdout && globals.pOutputFile != stderr)
+    {
+        fflush(globals.pOutputFile);
+        fclose(globals.pOutputFile);
+        globals.pOutputFile = NULL;
+    }
+}
+
+#ifdef _WIN32
+typedef void(WINAPI *rtlExitUserProcess_t)(uint32_t exitCode);
+rtlExitUserProcess_t Real_RtlExitUserProcess = NULL;
+
+// Detour_RtlExitUserProcess.
+void WINAPI
+Detour_RtlExitUserProcess(
+    uint32_t exitCode)
+{
+    AtExitHandler();
+
+    Real_RtlExitUserProcess(exitCode);
+}
+#endif
+
+void
+RegisterAtExitHandler(void)
+{
+#ifdef _WIN32
+    {
+        // It's unsafe to use atexit(), static destructors, DllMain PROCESS_DETACH, etc.
+        // because there's no way to guarantee the CUDA driver is still in a valid state
+        // when you get to those, due to the undefined order of dynamic library tear-down
+        // during process destruction.
+        // Also, the first thing the Windows kernel does when any thread in a process
+        // calls exit() is to immediately terminate all other threads, without any kind of
+        // synchronization.
+        // So the only valid time to do any in-process cleanup at exit() is before control
+        // is passed to the kernel. Use Detours to intercept a low-level ntdll.dll
+        // function "RtlExitUserProcess".
+        int detourStatus = 0;
+        FARPROC proc;
+
+        // ntdll.dll will always be loaded, no need to load the library.
+        HMODULE ntDll = GetModuleHandle(TEXT("ntdll.dll"));
+        if (!ntDll)
+        {
+            detourStatus = 1;
+            goto DetourError;
+        }
+
+        proc = GetProcAddress(ntDll, "RtlExitUserProcess");
+        if (!proc)
+        {
+            detourStatus = 1;
+            goto DetourError;
+        }
+        Real_RtlExitUserProcess = (rtlExitUserProcess_t)proc;
+
+        // Begin a detour transaction
+        if (DetourTransactionBegin() != ERROR_SUCCESS)
+        {
+            detourStatus = 1;
+            goto DetourError;
+        }
+
+        if (DetourUpdateThread(GetCurrentThread()) != ERROR_SUCCESS)
+        {
+            detourStatus = 1;
+            goto DetourError;
+        }
+
+        DetourSetIgnoreTooSmall(TRUE);
+
+        if (DetourAttach((void **)&Real_RtlExitUserProcess,
+                         (void *)Detour_RtlExitUserProcess) != ERROR_SUCCESS)
+        {
+            detourStatus = 1;
+            goto DetourError;
+        }
+
+        // Commit the transaction
+        if (DetourTransactionCommit() != ERROR_SUCCESS)
+        {
+            detourStatus = 1;
+            goto DetourError;
+        }
+    DetourError:
+        if (detourStatus != 0)
+        {
+            atexit(&AtExitHandler);
+        }
+    }
+#else
+    atexit(&AtExitHandler);
+#endif
+}
+
+static CUptiResult
+SelectActivities()
+{
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_DRIVER);
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_RUNTIME);
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_OVERHEAD);
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_MEMSET);
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_MEMCPY);
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_MEMCPY2);
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_MEMORY2);
+    // Enable activities to capture the NVTX annotations - markers, ranges and resource naming.
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_NAME);
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_MARKER);
+    SELECT_ACTIVITY(injectionGlobals.profileMode, CUPTI_ACTIVITY_KIND_MARKER_DATA);
+
+    return CUPTI_SUCCESS;
+}
+
+void
+GraphTraceRecords(
+    CUpti_Activity *pRecord)
+{
+    switch (pRecord->kind)
+    {
+        case CUPTI_ACTIVITY_KIND_MEMCPY:
+        {
+            CUpti_ActivityMemcpy6 *pMemcpyRecord = (CUpti_ActivityMemcpy6 *) pRecord;
+
+            // Retrieve the information of the API used to create the node.
+            NodeIdApiDataMap::iterator it = nodeIdCorrelationMap.find(pMemcpyRecord->graphNodeId);
+            if (it != nodeIdCorrelationMap.end())
+            {
+                fprintf(globals.pOutputFile, "Graph node was created using API %s with correlationId %u\n",
+                        it->second.pFunctionName, it->second.correlationId);
+            }
+            break;
+        }
+        case CUPTI_ACTIVITY_KIND_KERNEL:
+        case CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL:
+        {
+            CUpti_ActivityKernel9 *pKernelRecord = (CUpti_ActivityKernel9 *) pRecord;
+
+            // Retrieve the information of the API used to create the node.
+            NodeIdApiDataMap::iterator it = nodeIdCorrelationMap.find(pKernelRecord->graphNodeId);
+            if (it != nodeIdCorrelationMap.end())
+            {
+                fprintf(globals.pOutputFile, "Graph node was created using API %s with correlationId %u\n",
+                        it->second.pFunctionName, it->second.correlationId);
+            }
+            break;
+        }
+        default:
+            break;
+    }
+}
+
+static CUptiResult
+EnableCuptiActivities(
+    CUcontext context)
+{
+    CUPTI_API_CALL_VERBOSE(cuptiEnableCallback(1, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_RUNTIME_API, CUPTI_RUNTIME_TRACE_CBID_cudaDeviceReset_v3020));
+
+    CUPTI_API_CALL(SelectActivities());
+
+    for (int i = 0; i < CUPTI_ACTIVITY_KIND_COUNT; ++i)
+    {
+        CUptiResult result = CUPTI_SUCCESS;
+
+        if (IS_ACTIVITY_SELECTED(injectionGlobals.profileMode, i))
+        {
+            // If context is NULL activities are being enabled after CUDA initialization.
+            // Else the activities are being enabled on cudaProfilerStart API.
+            if (context == NULL)
+            {
+                std::cout << "Enabling CUPTI_ACTIVITY_KIND_" << GetActivityKindString((CUpti_ActivityKind)i) << ".\n";
+                CUPTI_API_CALL(cuptiActivityEnable((CUpti_ActivityKind)i));
+            }
+            else
+            {
+                // Since some activities are not supported at context mode,
+                // enable them in global mode if context mode fails.
+                std::cout << "Enabling CUPTI_ACTIVITY_KIND_" << GetActivityKindString((CUpti_ActivityKind)i) << " for a context.\n";
+                result = cuptiActivityEnableContext(context, (CUpti_ActivityKind)i);
+
+                if (result == CUPTI_ERROR_INVALID_KIND)
+                {
+                    cuptiGetLastError();
+                    std::cout << "Enabling CUPTI_ACTIVITY_KIND_" << GetActivityKindString((CUpti_ActivityKind)i) << ".\n";
+                    CUPTI_API_CALL_VERBOSE(cuptiActivityEnable((CUpti_ActivityKind)i));
+                }
+                else if (result != CUPTI_SUCCESS)
+                {
+                    CUPTI_API_CALL(result);
+                }
+            }
+        }
+    }
+
+    return CUPTI_SUCCESS;
+}
+
+CUptiResult
+DisableCuptiActivities(
+    CUcontext context)
+{
+    CUPTI_API_CALL_VERBOSE(cuptiEnableCallback(0, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_RUNTIME_API, CUPTI_RUNTIME_TRACE_CBID_cudaDeviceReset_v3020));
+
+    for (int i = 0; i < CUPTI_ACTIVITY_KIND_COUNT; ++i)
+    {
+        CUptiResult result = CUPTI_SUCCESS;
+
+        if (IS_ACTIVITY_SELECTED(injectionGlobals.profileMode, i))
+        {
+            if (context == NULL)
+            {
+                std::cout << "Disabling CUPTI_ACTIVITY_KIND_" << GetActivityKindString((CUpti_ActivityKind)i) << ".\n";
+                CUPTI_API_CALL(cuptiActivityDisable((CUpti_ActivityKind)i));
+            }
+            else
+            {
+                // Since some activities are not supported at context mode,
+                // disable them in global mode if context mode fails.
+                std::cout << "Disabling CUPTI_ACTIVITY_KIND_" << GetActivityKindString((CUpti_ActivityKind)i) << " for a context.\n";
+                result = cuptiActivityDisableContext(context, (CUpti_ActivityKind)i);
+
+                if (result == CUPTI_ERROR_INVALID_KIND)
+                {
+                    cuptiGetLastError();
+                    std::cout << "Disabling CUPTI_ACTIVITY_KIND_" << GetActivityKindString((CUpti_ActivityKind)i) << ".\n";
+                    CUPTI_API_CALL(cuptiActivityDisable((CUpti_ActivityKind)i));
+                }
+                else if (result != CUPTI_SUCCESS)
+                {
+                    CUPTI_API_CALL(result);
+                }
+            }
+        }
+    }
+
+    return CUPTI_SUCCESS;
+}
+
+static CUptiResult
+OnCudaDeviceReset(void)
+{
+    // Flush all activity buffers.
+    CUPTI_API_CALL_VERBOSE(cuptiActivityFlushAll(0));
+
+    return CUPTI_SUCCESS;
+}
+
+static CUptiResult
+OnProfilerStart(
+    CUcontext context)
+{
+    if (context == NULL)
+    {
+        // Don't do anything if context is NULL.
+        return CUPTI_SUCCESS;
+    }
+
+    CUPTI_API_CALL(EnableCuptiActivities(context));
+
+    return CUPTI_SUCCESS;
+}
+
+static CUptiResult
+OnProfilerStop(
+    CUcontext context)
+{
+    if (context == NULL)
+    {
+        // Don't do anything if context is NULL.
+        return CUPTI_SUCCESS;
+    }
+
+    CUPTI_API_CALL_VERBOSE(cuptiActivityFlushAll(0));
+    CUPTI_API_CALL(DisableCuptiActivities(context));
+
+    return CUPTI_SUCCESS;
+}
+
+void CUPTIAPI
+InjectionCallbackHandler(
+    void *pUserData,
+    CUpti_CallbackDomain domain,
+    CUpti_CallbackId callbackId,
+    void *pCallbackData)
+{
+    static const char *s_pFunctionName;
+    static uint32_t s_correlationId;
+
+    const CUpti_CallbackData *pCallbackInfo = (CUpti_CallbackData *)pCallbackData;
+
+    // Clear any previous CUPTI errors. cuptiGetLastError() retrieves and clears the last error.
+    // We don't treat this as fatal since it's just clearing state from previous operations.
+    CUptiResult _status = cuptiGetLastError();
+    if (_status != CUPTI_SUCCESS && _status != CUPTI_ERROR_NOT_INITIALIZED)
+    {
+        const char *pErrorString;
+        cuptiGetResultString(_status, &pErrorString);
+        // Log but don't exit - this is just informational
+        std::cerr << "Warning: Cleared previous CUPTI error(" << _status << "): " << pErrorString << "\n";
+    }
+
+    switch (domain)
+    {
+        case CUPTI_CB_DOMAIN_STATE:
+            HandleDomainStateCallback(callbackId, (CUpti_StateData *)pCallbackData);
+            break;
+        case CUPTI_CB_DOMAIN_RESOURCE:
+        {
+            CUpti_ResourceData *pResourceData = (CUpti_ResourceData *)pCallbackData;
+            switch (callbackId)
+            {
+                case CUPTI_CBID_RESOURCE_GRAPHNODE_CREATED:
+                {
+                    // Do not store info for the nodes that are created during graph instantiate.
+                    if (s_pFunctionName && !strncmp(s_pFunctionName, "cudaGraphInstantiate", strlen("cudaGraphInstantiate")))
+                    {
+                        break;
+                    }
+                    CUpti_GraphData *callbackData = (CUpti_GraphData *) pResourceData->resourceDescriptor;
+                    uint64_t nodeId;
+
+                    // Query the graph node ID and store the API correlation id and function name.
+                    CUPTI_API_CALL(cuptiGetGraphNodeId(callbackData->node, &nodeId));
+                    ApiData apiData;
+                    apiData.correlationId = s_correlationId;
+                    apiData.pFunctionName = s_pFunctionName;
+                    nodeIdCorrelationMap[nodeId] = apiData;
+                    break;
+                }
+                case CUPTI_CBID_RESOURCE_GRAPHNODE_CLONED:
+                {
+                    CUpti_GraphData *callbackData = (CUpti_GraphData *) pResourceData->resourceDescriptor;
+                    uint64_t nodeId, originalNodeId;
+
+                    // Overwrite the map entry with node ID of the cloned graph node.
+                    CUPTI_API_CALL(cuptiGetGraphNodeId(callbackData->originalNode, &originalNodeId));
+                    NodeIdApiDataMap::iterator it = nodeIdCorrelationMap.find(originalNodeId);
+                    if (it != nodeIdCorrelationMap.end())
+                    {
+                        CUPTI_API_CALL(cuptiGetGraphNodeId(callbackData->node, &nodeId));
+                        ApiData apiData = it->second;
+                        nodeIdCorrelationMap.erase(it);
+                        nodeIdCorrelationMap[nodeId] = apiData;
+                    }
+                    break;
+                }
+                default:
+                    break;
+            }
+            break;
+        }
+        case CUPTI_CB_DOMAIN_DRIVER_API:
+        {
+            switch (callbackId)
+            {
+                case CUPTI_DRIVER_TRACE_CBID_cuProfilerStart:
+                {
+                    // We start profiling collection on exit of the API.
+                    if (pCallbackInfo->callbackSite == CUPTI_API_EXIT)
+                    {
+                        OnProfilerStart(pCallbackInfo->context);
+                    }
+                    break;
+                }
+                case CUPTI_DRIVER_TRACE_CBID_cuProfilerStop:
+                {
+                    // We stop profiling collection on entry of the API.
+                    if (pCallbackInfo->callbackSite == CUPTI_API_ENTER)
+                    {
+                        OnProfilerStop(pCallbackInfo->context);
+                    }
+                    break;
+                }
+                default:
+                    break;
+            }
+            break;
+        }
+        case CUPTI_CB_DOMAIN_RUNTIME_API:
+        {
+            if (pCallbackInfo->callbackSite == CUPTI_API_ENTER)
+            {
+                s_correlationId = pCallbackInfo->correlationId;
+                s_pFunctionName = pCallbackInfo->functionName;
+            }
+
+            switch (callbackId)
+            {
+                case CUPTI_RUNTIME_TRACE_CBID_cudaDeviceReset_v3020:
+                {
+                    if (pCallbackInfo->callbackSite == CUPTI_API_ENTER)
+                    {
+                        CUPTI_API_CALL(OnCudaDeviceReset());
+                    }
+                    break;
+                }
+                default:
+                    break;
+            }
+            break;
+        }
+        default:
+            break;
+    }
+}
+
+static void
+SetupCupti(void)
+{
+    UserData *pUserData = (UserData *)malloc(sizeof(UserData));
+    MEMORY_ALLOCATION_CALL(pUserData);
+
+    memset(pUserData, 0, sizeof(UserData));
+    pUserData->pPostProcessActivityRecords = GraphTraceRecords;
+    pUserData->printActivityRecords        = 1;
+
+    // Common CUPTI Initialization.
+    // Configure output file from environment variable or use default
+    const char *outputPath = getenv("CUPTI_TRACE_OUTPUT_FILE");
+    if (!outputPath) {
+        outputPath = "cupti_trace_output.txt";  // Default filename
+    }
+
+    FILE *outputFile = stdout;  // Default to stdout
+    if (strcmp(outputPath, "stdout") != 0) {
+        outputFile = fopen(outputPath, "w");
+        if (!outputFile) {
+            std::cerr << "Failed to open output file '" << outputPath << "', falling back to stdout\n";
+            outputFile = stdout;
+        } else {
+            std::cout << "CUPTI trace output will be written to: " << outputPath << "\n";
+        }
+    }
+    InitCuptiTrace(pUserData, (void *)InjectionCallbackHandler, outputFile);
+
+    injectionGlobals.subscriberHandle = globals.subscriberHandle;
+
+    // Subscribe Driver callback to call OnProfilerStart/OnProfilerStop function.
+    CUPTI_API_CALL_VERBOSE(cuptiEnableCallback(1, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_DRIVER_API, CUPTI_DRIVER_TRACE_CBID_cuProfilerStart));
+    CUPTI_API_CALL_VERBOSE(cuptiEnableCallback(1, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_DRIVER_API, CUPTI_DRIVER_TRACE_CBID_cuProfilerStop));
+
+    // Enable callbacks for CUDA graph node tracking.
+    CUPTI_API_CALL_VERBOSE(cuptiEnableCallback(1, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_RESOURCE, CUPTI_CBID_RESOURCE_GRAPHNODE_CREATED));
+    CUPTI_API_CALL_VERBOSE(cuptiEnableCallback(1, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_RESOURCE, CUPTI_CBID_RESOURCE_GRAPHNODE_CLONED));
+    CUPTI_API_CALL_VERBOSE(cuptiEnableDomain(1, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_RUNTIME_API));
+
+    // Enable CUPTI activities.
+    CUPTI_API_CALL(EnableCuptiActivities(NULL));
+}
+
+#ifdef _WIN32
+extern "C" __declspec(dllexport) int
+InitializeInjection(void)
+#else
+extern "C" int
+InitializeInjection(void)
+#endif
+{
+    if (injectionGlobals.initialized)
+    {
+        // Return 1 to indicate that the injection is already successfully initialized.
+        return 1;
+    }
+
+    initializeInjectionMutex.lock();
+
+    // Initialize injection global options.
+    InitializeInjectionGlobals();
+
+    RegisterAtExitHandler();
+
+    // Initialize CUPTI.
+    SetupCupti();
+
+    injectionGlobals.tracingEnabled = 1;
+    injectionGlobals.initialized = 1;
+
+    initializeInjectionMutex.unlock();
+
+    // Return 1 to indicate that the injection is successfully initialized.
+    return 1;
+}
--- a/src/xpu/flamegraph/cupti_trace/helper_cupti.h
+++ b/src/xpu/flamegraph/cupti_trace/helper_cupti.h
@@ -0,0 +1,184 @@
+/**
+ * Copyright 2022-2024 NVIDIA Corporation.  All rights reserved.
+ *
+ * Please refer to the NVIDIA end user license agreement (EULA) associated
+ * with this source code for terms and conditions that govern your use of
+ * this software. Any use, reproduction, disclosure, or distribution of
+ * this software and related documentation outside the terms of the EULA
+ * is strictly prohibited.
+ *
+ */
+
+////////////////////////////////////////////////////////////////////////////////
+
+#ifndef HELPER_CUPTI_H_
+#define HELPER_CUPTI_H_
+
+#pragma once
+
+#include <iostream>
+
+#ifndef EXIT_WAIVED
+#define EXIT_WAIVED 2
+#endif
+
+#if defined(WIN32) || defined(_WIN32)
+#define stricmp _stricmp
+#else
+#define stricmp strcasecmp
+#endif
+
+#define CUDA_MAX_DEVICES    256     // consider theoretical max devices as 256
+#define DEV_NAME_LEN 256
+
+#ifndef DRIVER_API_CALL
+#define DRIVER_API_CALL(apiFunctionCall)                                            \
+do                                                                                  \
+{                                                                                   \
+    CUresult _status = apiFunctionCall;                                             \
+    if (_status != CUDA_SUCCESS)                                                    \
+    {                                                                               \
+        const char *pErrorString;                                                   \
+        cuGetErrorString(_status, &pErrorString);                                   \
+                                                                                    \
+        std::cerr << "\n\nError: " << __FILE__ << ":" << __LINE__ << ": Function "  \
+        << #apiFunctionCall << " failed with error(" << _status << "): "            \
+        << pErrorString << ".\n\n";                                                 \
+                                                                                    \
+        exit(EXIT_FAILURE);                                                         \
+    }                                                                               \
+} while (0)
+#endif
+
+#ifndef RUNTIME_API_CALL
+#define RUNTIME_API_CALL(apiFunctionCall)                                           \
+do                                                                                  \
+{                                                                                   \
+    cudaError_t _status = apiFunctionCall;                                          \
+    if (_status != cudaSuccess)                                                     \
+    {                                                                               \
+        std::cerr << "\n\nError: " << __FILE__ << ":" << __LINE__ << ": Function "  \
+        << #apiFunctionCall << " failed with error(" << _status << "): "            \
+        << cudaGetErrorString(_status) << ".\n\n";                                  \
+                                                                                    \
+        exit(EXIT_FAILURE);                                                         \
+    }                                                                               \
+} while (0)
+#endif
+
+#ifndef CUPTI_API_CALL
+#define CUPTI_API_CALL(apiFunctionCall)                                             \
+do                                                                                  \
+{                                                                                   \
+    CUptiResult _status = apiFunctionCall;                                          \
+    if (_status != CUPTI_SUCCESS)                                                   \
+    {                                                                               \
+        const char *pErrorString;                                                   \
+        cuptiGetResultString(_status, &pErrorString);                               \
+                                                                                    \
+        std::cerr << "\n\nError: " << __FILE__ << ":" << __LINE__ << ": Function "  \
+        << #apiFunctionCall << " failed with error(" << _status << "): "            \
+        << pErrorString << ".\n\n";                                                 \
+                                                                                    \
+        exit(EXIT_FAILURE);                                                         \
+    }                                                                               \
+} while (0)
+#endif
+
+#ifndef CUPTI_API_CALL_VERBOSE
+#define CUPTI_API_CALL_VERBOSE(apiFunctionCall)                                     \
+do                                                                                  \
+{                                                                                   \
+    std::cout << "Calling CUPTI API: " << #apiFunctionCall << "\n";                 \
+                                                                                    \
+    CUptiResult _status = apiFunctionCall;                                          \
+    if (_status != CUPTI_SUCCESS)                                                   \
+    {                                                                               \
+        const char *pErrorString;                                                   \
+        cuptiGetResultString(_status, &pErrorString);                               \
+                                                                                    \
+        std::cerr << "\n\nError: " << __FILE__ << ":" << __LINE__ << ": Function "  \
+        << #apiFunctionCall << " failed with error(" << _status << "): "            \
+        << pErrorString << ".\n\n";                                                 \
+                                                                                    \
+        exit(EXIT_FAILURE);                                                         \
+    }                                                                               \
+} while (0)
+#endif
+
+#ifndef CUPTI_UTIL_CALL
+#define CUPTI_UTIL_CALL(apiFunctionCall)                                            \
+do                                                                                  \
+{                                                                                   \
+    CUptiUtilResult _status = apiFunctionCall;                                      \
+    if (_status != CUPTI_UTIL_SUCCESS)                                              \
+    {                                                                               \
+        std::cerr << "\n\nError: " << __FILE__ << ":" << __LINE__ << ": Function "  \
+        << #apiFunctionCall << " failed with error: " << _status << "\n\n";         \
+                                                                                    \
+        exit(EXIT_FAILURE);                                                         \
+    }                                                                               \
+} while (0)
+#endif
+
+#ifndef NVPW_API_CALL
+#define NVPW_API_CALL(apiFunctionCall)                                              \
+do                                                                                  \
+{                                                                                   \
+    NVPA_Status _status = apiFunctionCall;                                          \
+    if (_status != NVPA_STATUS_SUCCESS)                                             \
+    {                                                                               \
+        std::cerr << "\n\nError: " << __FILE__ << ":" << __LINE__ << ": Function "  \
+        << #apiFunctionCall << " failed with error: " << _status << "\n\n";         \
+                                                                                    \
+        exit(EXIT_FAILURE);                                                         \
+    }                                                                               \
+} while (0)
+#endif
+
+#ifndef MEMORY_ALLOCATION_CALL
+#define MEMORY_ALLOCATION_CALL(variable)                                            \
+do                                                                                  \
+{                                                                                   \
+    if (variable == NULL)                                                           \
+    {                                                                               \
+        std::cerr << "\n\nError: " << __FILE__ << ":" << __LINE__ <<                \
+        " Memory allocation failed.\n\n";                                           \
+                                                                                    \
+        exit(EXIT_FAILURE);                                                         \
+    }                                                                               \
+} while (0)
+#endif
+
+#ifndef CHECK_CONDITION
+#define CHECK_CONDITION(condition)                                                  \
+do                                                                                  \
+{                                                                                   \
+    if (!(condition))                                                               \
+    {                                                                               \
+        std::cerr << "\n\nError: " << __FILE__ << ":" << __LINE__ << ": Condition " \
+        << #condition << " failed.\n\n";                                            \
+                                                                                    \
+        exit(EXIT_FAILURE);                                                         \
+    }                                                                               \
+} while (0)
+#endif
+
+#ifndef CHECK_INTEGER_CONDITION
+#define CHECK_INTEGER_CONDITION(argument1, operator, argument2)                     \
+do                                                                                  \
+{                                                                                   \
+    if (!(argument1 operator argument2))                                            \
+    {                                                                               \
+        std::cerr << "\n\nError: " << __FILE__ << ":" << __LINE__ << ": Condition " \
+        << #argument1 << " " << #operator << " " << #argument2 << " fails. " <<     \
+        #argument1 << " = " << argument1 << ", " << #argument2 << " = " <<          \
+        argument2 << "\n\n";                                                        \
+                                                                                    \
+        exit(EXIT_FAILURE);                                                         \
+    }                                                                               \
+} while (0)
+#endif
+
+#endif // HELPER_CUPTI_H_
+
--- a/src/xpu/flamegraph/cupti_trace/helper_cupti_activity.h
+++ b/src/xpu/flamegraph/cupti_trace/helper_cupti_activity.h
--- a/src/xpu/flamegraph/cupti_trace_parser.py
+++ b/src/xpu/flamegraph/cupti_trace_parser.py
@@ -0,0 +1,314 @@
+
+#!/usr/bin/env python3
+"""
+CUPTI Trace Parser Module
+Parses CUPTI trace data and converts to Chrome Trace Format
+"""
+
+import re
+import json
+from typing import List, Dict, Any
+
+
+class CuptiTraceParser:
+    """Parser for CUPTI trace data"""
+    
+    def __init__(self):
+        # Regular expressions for different trace line formats
+        self.runtime_pattern = r'RUNTIME \[ (\d+), (\d+) \] duration (\d+), "([^"]+)", cbid (\d+), processId (\d+), threadId (\d+), correlationId (\d+)'
+        self.driver_pattern = r'DRIVER \[ (\d+), (\d+) \] duration (\d+), "([^"]+)", cbid (\d+), processId (\d+), threadId (\d+), correlationId (\d+)'
+        self.kernel_pattern = r'CONCURRENT_KERNEL \[ (\d+), (\d+) \] duration (\d+), "([^"]+)", correlationId (\d+)'
+        self.overhead_pattern = r'OVERHEAD ([A-Z_]+) \[ (\d+), (\d+) \] duration (\d+), (\w+), id (\d+), correlation id (\d+)'
+        self.memory_pattern = r'MEMORY2 \[ (\d+) \] memoryOperationType (\w+), memoryKind (\w+), size (\d+), address (\d+)'
+        self.memcpy_pattern = r'MEMCPY "([^"]+)" \[ (\d+), (\d+) \] duration (\d+), size (\d+), copyCount (\d+), srcKind (\w+), dstKind (\w+), correlationId (\d+)'
+        self.grid_pattern = r'\s+grid \[ (\d+), (\d+), (\d+) \], block \[ (\d+), (\d+), (\d+) \]'
+        self.device_pattern = r'\s+deviceId (\d+), contextId (\d+), streamId (\d+)'
+        
+    def parse_file(self, filename: str) -> List[Dict[str, Any]]:
+        """Parse CUPTI trace file and return list of events"""
+        with open(filename, 'r') as f:
+            lines = f.readlines()
+        
+        return self.parse_lines(lines)
+    
+    def parse_lines(self, lines: List[str]) -> List[Dict[str, Any]]:
+        """Parse CUPTI trace lines and return list of events"""
+        events = []
+        i = 0
+        
+        while i < len(lines):
+            line = lines[i].strip()
+            
+            # Skip empty lines or non-trace lines
+            if not line or self._should_skip_line(line):
+                i += 1
+                continue
+            
+            # Try parsing different event types
+            event = None
+            lines_consumed = 1
+            
+            # Parse RUNTIME events
+            match = re.search(self.runtime_pattern, line)
+            if match:
+                event = self._parse_runtime_event(match)
+            else:
+                # Parse DRIVER events
+                match = re.search(self.driver_pattern, line)
+                if match:
+                    event = self._parse_driver_event(match)
+                else:
+                    # Parse CONCURRENT_KERNEL events
+                    match = re.search(self.kernel_pattern, line)
+                    if match:
+                        event, lines_consumed = self._parse_kernel_event(match, lines, i)
+                    else:
+                        # Parse OVERHEAD events
+                        match = re.search(self.overhead_pattern, line)
+                        if match:
+                            event = self._parse_overhead_event(match)
+                        else:
+                            # Parse MEMCPY events
+                            match = re.search(self.memcpy_pattern, line)
+                            if match:
+                                event, lines_consumed = self._parse_memcpy_event(match, lines, i)
+                            else:
+                                # Parse MEMORY2 events
+                                match = re.search(self.memory_pattern, line)
+                                if match:
+                                    event = self._parse_memory_event(match)
+            
+            if event:
+                events.append(event)
+            
+            i += lines_consumed
+        
+        return events
+    
+    def _should_skip_line(self, line: str) -> bool:
+        """Check if line should be skipped"""
+        skip_prefixes = [
+            'Calling CUPTI', 'Enabling', 'Disabling', 'Found',
+            'Configuring', 'It took', 'Activity buffer', 'CUPTI trace output',
+            'Running command', 'Trace output:', 'Started target', 
+            'Starting CPU', 'Stopping CPU', 'CPU profile'
+        ]
+        return any(line.startswith(prefix) for prefix in skip_prefixes)
+    
+    def _parse_runtime_event(self, match) -> Dict[str, Any]:
+        """Parse RUNTIME event"""
+        start_time = int(match.group(1))
+        duration = int(match.group(3))
+        name = match.group(4)
+        cbid = match.group(5)
+        process_id = int(match.group(6))
+        thread_id = int(match.group(7))
+        correlation_id = int(match.group(8))
+        
+        return {
+            "name": f"Runtime: {name}",
+            "ph": "X",  # Complete event
+            "ts": start_time / 1000,  # Convert ns to µs
+            "dur": duration / 1000,
+            "tid": thread_id,
+            "pid": process_id,
+            "cat": "CUDA_Runtime",
+            "args": {
+                "cbid": cbid,
+                "correlationId": correlation_id
+            }
+        }
+    
+    def _parse_driver_event(self, match) -> Dict[str, Any]:
+        """Parse DRIVER event"""
+        start_time = int(match.group(1))
+        duration = int(match.group(3))
+        name = match.group(4)
+        cbid = match.group(5)
+        process_id = int(match.group(6))
+        thread_id = int(match.group(7))
+        correlation_id = int(match.group(8))
+        
+        return {
+            "name": f"Driver: {name}",
+            "ph": "X",
+            "ts": start_time / 1000,
+            "dur": duration / 1000,
+            "tid": thread_id,
+            "pid": process_id,
+            "cat": "CUDA_Driver",
+            "args": {
+                "cbid": cbid,
+                "correlationId": correlation_id
+            }
+        }
+    
+    def _parse_kernel_event(self, match, lines: List[str], current_index: int) -> tuple:
+        """Parse CONCURRENT_KERNEL event with optional additional info"""
+        start_time = int(match.group(1))
+        duration = int(match.group(3))
+        name = match.group(4)
+        correlation_id = int(match.group(5))
+        
+        kernel_info = {
+            "name": f"Kernel: {name}",
+            "ph": "X",
+            "ts": start_time / 1000,
+            "dur": duration / 1000,
+            "cat": "GPU_Kernel",
+            "args": {
+                "correlationId": correlation_id
+            }
+        }
+        
+        lines_consumed = 1
+        
+        # Check next lines for additional kernel info
+        if current_index + 1 < len(lines):
+            next_line = lines[current_index + 1].strip()
+            grid_match = re.search(self.grid_pattern, next_line)
+            if grid_match:
+                kernel_info["args"]["grid"] = [
+                    int(grid_match.group(1)),
+                    int(grid_match.group(2)),
+                    int(grid_match.group(3))
+                ]
+                kernel_info["args"]["block"] = [
+                    int(grid_match.group(4)),
+                    int(grid_match.group(5)),
+                    int(grid_match.group(6))
+                ]
+                lines_consumed += 1
+                
+        if current_index + lines_consumed < len(lines):
+            next_line = lines[current_index + lines_consumed].strip()
+            device_match = re.search(self.device_pattern, next_line)
+            if device_match:
+                device_id = int(device_match.group(1))
+                context_id = int(device_match.group(2))
+                stream_id = int(device_match.group(3))
+                
+                kernel_info["tid"] = f"GPU{device_id}_Stream{stream_id}"
+                kernel_info["pid"] = f"Device_{device_id}"
+                kernel_info["args"]["deviceId"] = device_id
+                kernel_info["args"]["contextId"] = context_id
+                kernel_info["args"]["streamId"] = stream_id
+                lines_consumed += 1
+        
+        return kernel_info, lines_consumed
+    
+    def _parse_overhead_event(self, match) -> Dict[str, Any]:
+        """Parse OVERHEAD event"""
+        overhead_type = match.group(1)
+        start_time = int(match.group(2))
+        duration = int(match.group(4))
+        overhead_target = match.group(5)
+        overhead_id = int(match.group(6))
+        correlation_id = int(match.group(7))
+        
+        return {
+            "name": f"Overhead: {overhead_type}",
+            "ph": "X",
+            "ts": start_time / 1000,
+            "dur": duration / 1000,
+            "tid": overhead_id,
+            "pid": "CUPTI_Overhead",
+            "cat": "Overhead",
+            "args": {
+                "type": overhead_type,
+                "target": overhead_target,
+                "correlationId": correlation_id
+            }
+        }
+    
+    def _parse_memcpy_event(self, match, lines: List[str], current_index: int) -> tuple:
+        """Parse MEMCPY event with optional device info"""
+        copy_type = match.group(1)
+        start_time = int(match.group(2))
+        duration = int(match.group(4))
+        size = int(match.group(5))
+        copy_count = int(match.group(6))
+        src_kind = match.group(7)
+        dst_kind = match.group(8)
+        correlation_id = int(match.group(9))
+        
+        memcpy_info = {
+            "name": f"MemCopy: {copy_type}",
+            "ph": "X",
+            "ts": start_time / 1000,
+            "dur": duration / 1000,
+            "cat": "MemCopy",
+            "args": {
+                "type": copy_type,
+                "size": size,
+                "copyCount": copy_count,
+                "srcKind": src_kind,
+                "dstKind": dst_kind,
+                "correlationId": correlation_id
+            }
+        }
+        
+        lines_consumed = 1
+        
+        # Check next line for device info
+        if current_index + 1 < len(lines):
+            next_line = lines[current_index + 1].strip()
+            device_match = re.search(self.device_pattern, next_line)
+            if device_match:
+                device_id = int(device_match.group(1))
+                context_id = int(device_match.group(2))
+                stream_id = int(device_match.group(3))
+                
+                memcpy_info["tid"] = f"GPU{device_id}_Stream{stream_id}"
+                memcpy_info["pid"] = f"Device_{device_id}"
+                memcpy_info["args"]["deviceId"] = device_id
+                memcpy_info["args"]["contextId"] = context_id
+                memcpy_info["args"]["streamId"] = stream_id
+                lines_consumed += 1
+            else:
+                memcpy_info["tid"] = "MemCopy_Operations"
+                memcpy_info["pid"] = "MemCopy"
+        
+        return memcpy_info, lines_consumed
+    
+    def _parse_memory_event(self, match) -> Dict[str, Any]:
+        """Parse MEMORY2 event"""
+        timestamp = int(match.group(1))
+        operation = match.group(2)
+        memory_kind = match.group(3)
+        size = int(match.group(4))
+        address = int(match.group(5))
+        
+        return {
+            "name": f"Memory: {operation} ({memory_kind})",
+            "ph": "i",  # Instant event
+            "ts": timestamp / 1000,
+            "tid": "Memory_Operations",
+            "pid": "Memory",
+            "cat": "Memory",
+            "s": "g",  # Global scope
+            "args": {
+                "operation": operation,
+                "kind": memory_kind,
+                "size": size,
+                "address": hex(address)
+            }
+        }
+    
+    def to_chrome_trace(self, events: List[Dict[str, Any]], metadata: Dict[str, Any] = None) -> Dict[str, Any]:
+        """Convert events to Chrome Trace Format"""
+        trace_data = {
+            "traceEvents": events,
+            "displayTimeUnit": "ms",
+            "metadata": metadata or {
+                "tool": "CUPTI Trace Parser",
+                "format": "Chrome Trace Format"
+            }
+        }
+        return trace_data
+    
+    def save_chrome_trace(self, events: List[Dict[str, Any]], output_file: str, metadata: Dict[str, Any] = None):
+        """Save events as Chrome Trace Format JSON"""
+        trace_data = self.to_chrome_trace(events, metadata)
+        with open(output_file, 'w') as f:
+            json.dump(trace_data, f, indent=2)
--- a/src/xpu/flamegraph/gpuperf.py
+++ b/src/xpu/flamegraph/gpuperf.py
@@ -0,0 +1,415 @@
+#!/usr/bin/env python3
+
+import os
+import sys
+import argparse
+import subprocess
+import tempfile
+import atexit
+import time
+import json
+from pathlib import Path
+from cupti_trace_parser import CuptiTraceParser
+from merge_gpu_cpu_trace import TraceMerger
+
+class GPUPerf:
+    def __init__(self):
+        self.script_dir = Path(__file__).parent.absolute()
+        self.injection_lib = self.script_dir / "cupti_trace/libcupti_trace_injection.so"
+        self.output_file = None
+        self.temp_trace_file = None
+        self.profiler_proc = None
+        self.profiler_output = None
+        self.parser = CuptiTraceParser()  # Initialize the parser
+        
+        # Path to CPU profiler
+        script_dir = Path(__file__).parent.resolve()
+        self.cpu_profiler = script_dir / "profiler/target/release/profile"
+        if not self.cpu_profiler.exists():
+            print(f"Warning: CPU profiler not found at {self.cpu_profiler}", file=sys.stderr)
+            self.cpu_profiler = None
+        
+        # Find CUPTI library path
+        cuda_paths = [
+            "/usr/local/cuda-13.0/extras/CUPTI/lib64",
+            "/usr/local/cuda/extras/CUPTI/lib64",
+            "/usr/local/cuda-12.0/extras/CUPTI/lib64",
+        ]
+        
+        self.cupti_lib = None
+        for path in cuda_paths:
+            cupti_path = Path(path) / "libcupti.so"
+            if cupti_path.exists():
+                self.cupti_lib = str(cupti_path)
+                self.cupti_lib_dir = str(Path(path))
+                break
+                
+        if not self.cupti_lib:
+            print("Warning: Could not find CUPTI library. NVTX annotations may not work.", file=sys.stderr)
+    
+    def parse_cupti_trace(self, filename):
+        """Parse CUPTI trace data using the parser module"""
+        return self.parser.parse_file(filename)
+    
+    def start_cpu_profiler(self, pid=None, cpu_output_file=None, cuda_lib_path=None):
+        """Start CPU profiler with cudaLaunchKernel uprobe"""
+        if not self.cpu_profiler:
+            return None
+
+        if not cpu_output_file:
+            cpu_output_file = f"cpu_profile_{pid if pid else 'cuda'}.txt"
+
+        self.profiler_output = cpu_output_file
+
+        # Find CUDA runtime library if not specified
+        if not cuda_lib_path:
+            cuda_paths = [
+                "/usr/local/cuda-12.9/lib64/libcudart.so.12",
+                "/usr/local/cuda-13.0/lib64/libcudart.so.12",
+                "/usr/local/cuda/lib64/libcudart.so.12",
+                "/usr/local/cuda-12.8/lib64/libcudart.so.12",
+            ]
+            for path in cuda_paths:
+                if Path(path).exists():
+                    cuda_lib_path = path
+                    break
+
+        if not cuda_lib_path:
+            print("Warning: Could not find CUDA runtime library for uprobe", file=sys.stderr)
+            return None
+
+        print(f"Starting CPU profiler with cudaLaunchKernel hook")
+        print(f"  CUDA library: {cuda_lib_path}")
+        print(f"  Output: {cpu_output_file}")
+
+        try:
+            # Run profiler with cudaLaunchKernel uprobe in extended folded format
+            # Format: timestamp_ns comm pid tid cpu stack1;stack2;stack3
+            cmd = ["sudo", str(self.cpu_profiler),
+                   "--uprobe", f"{cuda_lib_path}:cudaLaunchKernel",
+                   "-E"]  # -E for extended folded format with timestamps
+
+            self.profiler_proc = subprocess.Popen(
+                cmd,
+                stdout=open(cpu_output_file, 'w'),
+                stderr=subprocess.PIPE
+            )
+            # Give it a moment to attach
+            time.sleep(1.0)
+            return self.profiler_proc
+        except Exception as e:
+            print(f"Warning: Failed to start CPU profiler: {e}", file=sys.stderr)
+            return None
+    
+    def stop_cpu_profiler(self):
+        """Stop the CPU profiler gracefully"""
+        if self.profiler_proc and self.profiler_proc.poll() is None:
+            print("Stopping CPU profiler...")
+            self.profiler_proc.terminate()
+            try:
+                self.profiler_proc.wait(timeout=5)
+            except subprocess.TimeoutExpired:
+                self.profiler_proc.kill()
+                self.profiler_proc.wait()
+            
+            if self.profiler_output and os.path.exists(self.profiler_output):
+                print(f"CPU profile saved to: {self.profiler_output}")
+    
+    def run_with_trace(self, command, output_trace=None, chrome_trace=None, cpu_profile=None, merged_trace=None, no_merge=False):
+        """Run a command with CUPTI tracing and optional CPU profiling enabled"""
+        
+        # Determine if we're doing GPU profiling
+        do_gpu_profiling = output_trace is not None or chrome_trace is not None
+        
+        # Check if injection library exists (only if we're doing GPU profiling)
+        if do_gpu_profiling and not self.injection_lib.exists():
+            print(f"Error: CUPTI injection library not found at {self.injection_lib}", file=sys.stderr)
+            print("Please build it first using 'make' in the cupti_trace directory", file=sys.stderr)
+            return 1
+        
+        # Set up trace output file for GPU profiling
+        trace_file = None
+        if do_gpu_profiling:
+            if output_trace:
+                trace_file = output_trace
+            else:
+                # Create temporary file for trace output
+                fd, trace_file = tempfile.mkstemp(suffix=".txt", prefix="gpuperf_trace_")
+                os.close(fd)
+                self.temp_trace_file = trace_file
+                atexit.register(self.cleanup_temp_files)
+        
+        # Set up environment variables
+        env = os.environ.copy()
+        env['CUDA_INJECTION64_PATH'] = str(self.injection_lib)
+        env['CUPTI_TRACE_OUTPUT_FILE'] = trace_file
+        
+        if self.cupti_lib:
+            env['NVTX_INJECTION64_PATH'] = self.cupti_lib
+            if 'LD_LIBRARY_PATH' in env:
+                env['LD_LIBRARY_PATH'] = f"{self.cupti_lib_dir}:{env['LD_LIBRARY_PATH']}"
+            else:
+                env['LD_LIBRARY_PATH'] = self.cupti_lib_dir
+        
+        print(f"Running command with GPU profiling: {' '.join(command)}")
+        print(f"Trace output: {trace_file}")
+        
+        # Start the target process
+        target_proc = None
+
+        try:
+            # Start CPU profiler FIRST if available and requested
+            if cpu_profile and self.cpu_profiler:
+                # Start profiler BEFORE target process to catch all kernel launches
+                self.start_cpu_profiler(cpu_output_file=cpu_profile)
+
+            # Then start the target process
+            target_proc = subprocess.Popen(command, env=env)
+            target_pid = target_proc.pid
+            print(f"Started target process with PID: {target_pid}")
+            
+            # Wait for the target process to complete
+            return_code = target_proc.wait()
+            
+        except KeyboardInterrupt:
+            print("\nInterrupted by user")
+            if target_proc:
+                target_proc.terminate()
+                try:
+                    target_proc.wait(timeout=5)
+                except subprocess.TimeoutExpired:
+                    target_proc.kill()
+            return_code = 130
+        except Exception as e:
+            print(f"Error running command: {e}", file=sys.stderr)
+            return_code = 1
+        finally:
+            # Stop CPU profiler if running
+            self.stop_cpu_profiler()
+        
+        # Convert to Chrome trace if requested
+        if chrome_trace and os.path.exists(trace_file):
+            print(f"\nConverting trace to Chrome format: {chrome_trace}")
+            try:
+                events = self.parse_cupti_trace(trace_file)
+                print(f"Parsed {len(events)} events")
+                
+                metadata = {
+                    "tool": "gpuperf - GPU Performance Profiler",
+                    "format": "Chrome Trace Format",
+                    "command": ' '.join(command)
+                }
+                
+                self.parser.save_chrome_trace(events, chrome_trace, metadata)
+                
+                print(f"\nChrome trace file written to: {chrome_trace}")
+                print("\nTo visualize the trace:")
+                print("1. Open Chrome or Edge browser")
+                print("2. Navigate to chrome://tracing or edge://tracing")
+                print("3. Click 'Load' and select the generated JSON file")
+                print("\nAlternatively, visit https://ui.perfetto.dev/ and drag the JSON file there")
+            except Exception as e:
+                print(f"Error converting trace: {e}", file=sys.stderr)
+        
+        # Clean up temporary file if not keeping raw trace
+        if not output_trace and self.temp_trace_file:
+            try:
+                os.unlink(self.temp_trace_file)
+            except:
+                pass
+        
+        # Generate merged folded trace if both CPU and GPU traces are available (and not disabled)
+        if not no_merge and cpu_profile and (chrome_trace or output_trace):
+            merged_output = merged_trace if merged_trace else "merged_trace.folded"
+            self.generate_merged_trace(
+                cpu_trace=cpu_profile,
+                gpu_trace=chrome_trace if chrome_trace else None,
+                gpu_raw_trace=trace_file if do_gpu_profiling else None,
+                output_file=merged_output
+            )
+        
+        return return_code
+    
+    def generate_merged_trace(self, cpu_trace=None, gpu_trace=None, gpu_raw_trace=None, output_file=None):
+        """Generate merged CPU+GPU folded trace using TraceMerger"""
+        if not cpu_trace or not (gpu_trace or gpu_raw_trace):
+            return  # Need both CPU and GPU traces
+        
+        if not output_file:
+            output_file = "merged_trace.folded"
+        
+        print(f"\nGenerating merged CPU+GPU trace: {output_file}")
+        
+        try:
+            merger = TraceMerger()
+            
+            # Parse CPU trace
+            if os.path.exists(cpu_trace):
+                merger.parse_cpu_trace(cpu_trace)
+            else:
+                print(f"Warning: CPU trace not found: {cpu_trace}")
+                return
+            
+            # Parse GPU trace (prefer JSON, fallback to raw)
+            if gpu_trace and os.path.exists(gpu_trace):
+                merger.parse_gpu_trace(gpu_trace)
+            elif gpu_raw_trace and os.path.exists(gpu_raw_trace):
+                # Convert raw trace to events first
+                events = self.parse_cupti_trace(gpu_raw_trace)
+                # Create temporary JSON for merger
+                import json
+                temp_json = tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False)
+                json.dump({"traceEvents": events}, temp_json)
+                temp_json.close()
+                merger.parse_gpu_trace(temp_json.name)
+                os.unlink(temp_json.name)
+            else:
+                print(f"Warning: GPU trace not found")
+                return
+            
+            # Merge traces
+            merger.merge_traces()
+            
+            # Write folded output
+            merger.write_folded_output(output_file)
+            
+            print(f"✓ Merged trace generated: {output_file}")
+            print(f"\nTo generate flamegraph:")
+            print(f"  /root/yunwei37/systemscope/cpu-tools/combined_flamegraph.pl {output_file} > merged_flamegraph.svg")
+            
+        except Exception as e:
+            print(f"Error generating merged trace: {e}", file=sys.stderr)
+    
+    def cleanup_temp_files(self):
+        """Clean up temporary files"""
+        if self.temp_trace_file and os.path.exists(self.temp_trace_file):
+            try:
+                os.unlink(self.temp_trace_file)
+            except:
+                pass
+    
+    def convert_trace(self, input_file, output_file):
+        """Convert existing CUPTI trace to Chrome format"""
+        
+        if not os.path.exists(input_file):
+            print(f"Error: Input file '{input_file}' not found", file=sys.stderr)
+            return 1
+        
+        print(f"Converting CUPTI trace to Chrome format...")
+        print(f"Input: {input_file}")
+        print(f"Output: {output_file}")
+        
+        try:
+            events = self.parse_cupti_trace(input_file)
+            print(f"Parsed {len(events)} events")
+            
+            metadata = {
+                "tool": "gpuperf - GPU Performance Profiler",
+                "format": "Chrome Trace Format"
+            }
+            
+            self.parser.save_chrome_trace(events, output_file, metadata)
+            
+            print(f"\nChrome trace file written to: {output_file}")
+            print("\nTo visualize the trace:")
+            print("1. Open Chrome or Edge browser")
+            print("2. Navigate to chrome://tracing or edge://tracing")
+            print("3. Click 'Load' and select the generated JSON file")
+            print("\nAlternatively, visit https://ui.perfetto.dev/ and drag the JSON file there")
+            
+            return 0
+        except Exception as e:
+            print(f"Error converting trace: {e}", file=sys.stderr)
+            return 1
+
+def main():
+    # Check if first argument is 'convert' for conversion mode
+    if len(sys.argv) > 1 and sys.argv[1] == 'convert':
+        parser = argparse.ArgumentParser(
+            prog='gpuperf convert',
+            description='Convert existing CUPTI trace to Chrome format'
+        )
+        parser.add_argument('mode', help='Operation mode')  # This will be 'convert'
+        parser.add_argument('-i', '--input', required=True, help='Input CUPTI trace file')
+        parser.add_argument('-o', '--output', default='trace.json', help='Output Chrome trace JSON file')
+        args = parser.parse_args()
+        
+        profiler = GPUPerf()
+        return profiler.convert_trace(args.input, args.output)
+    
+    # Regular run mode
+    parser = argparse.ArgumentParser(
+        description='gpuperf - GPU and CPU Performance Profiler',
+        usage='gpuperf [options] command [args...]\n       gpuperf convert -i input.txt -o output.json'
+    )
+    
+    parser.add_argument('-o', '--output', help='Save raw CUPTI trace to file (default: gpu_results.txt)')
+    parser.add_argument('-c', '--chrome', help='Convert trace to Chrome format and save to file (default: gpu_results.json)')
+    parser.add_argument('-p', '--cpu-profile', help='Also capture CPU profile and save to file (default: cpu_results.txt)')
+    parser.add_argument('-m', '--merged', help='Save merged CPU+GPU folded trace (default: merged_trace.folded)')
+    parser.add_argument('--cpu-only', action='store_true', help='Only run CPU profiler without GPU tracing')
+    parser.add_argument('--no-gpu', action='store_true', help='Disable GPU profiling')
+    parser.add_argument('--no-cpu', action='store_true', help='Disable CPU profiling')
+    parser.add_argument('--no-merge', action='store_true', help='Disable automatic merged trace generation')
+    parser.add_argument('command', nargs=argparse.REMAINDER, help='Command to run with profiling')
+    
+    args = parser.parse_args()
+    
+    profiler = GPUPerf()
+    
+    # Handle run mode
+    if not args.command:
+        parser.print_help()
+        return 1
+    
+    # Use the command directly from REMAINDER
+    full_command = args.command
+    
+    # CPU-only mode
+    if args.cpu_only:
+        if not profiler.cpu_profiler:
+            print("Error: CPU profiler not available", file=sys.stderr)
+            return 1
+        
+        # Start the process and immediately profile it
+        try:
+            target_proc = subprocess.Popen(full_command)
+            target_pid = target_proc.pid
+            print(f"Started target process with PID: {target_pid}")
+            
+            cpu_output = args.cpu_profile or "cpu_results.txt"
+            profiler.start_cpu_profiler(target_pid, cpu_output)
+            
+            return_code = target_proc.wait()
+            profiler.stop_cpu_profiler()
+            return return_code
+        except Exception as e:
+            print(f"Error: {e}", file=sys.stderr)
+            return 1
+    
+    # Set up default values
+    gpu_output = args.output if args.output else ("gpu_results.txt" if not args.no_gpu else None)
+    chrome_output = args.chrome if args.chrome else ("gpu_results.json" if not args.no_gpu else None)
+    cpu_output = args.cpu_profile if args.cpu_profile else ("cpu_results.txt" if not args.no_cpu else None)
+    
+    # If user explicitly disabled GPU, don't run GPU profiling
+    if args.no_gpu:
+        gpu_output = None
+        chrome_output = None
+    
+    # If user explicitly disabled CPU, don't run CPU profiling  
+    if args.no_cpu:
+        cpu_output = None
+    
+    # Combined GPU and CPU profiling (or just one based on flags)
+    return profiler.run_with_trace(
+        full_command, 
+        output_trace=gpu_output, 
+        chrome_trace=chrome_output,
+        cpu_profile=cpu_output,
+        merged_trace=args.merged,
+        no_merge=args.no_merge
+    )
+
+if __name__ == '__main__':
+    sys.exit(main())
--- a/src/xpu/flamegraph/merge_gpu_cpu_trace.py
+++ b/src/xpu/flamegraph/merge_gpu_cpu_trace.py
@@ -0,0 +1,343 @@
+#!/usr/bin/env python3
+"""
+Merge GPU and CPU traces into folded flamegraph format
+Correlates CPU stack traces from cudaLaunchKernel uprobe with GPU kernel execution
+using CUPTI correlation IDs and timestamp matching
+"""
+
+import json
+import re
+import sys
+import argparse
+from pathlib import Path
+from typing import List, Dict, Tuple, Any, Optional
+from collections import defaultdict
+
+
+class GPUKernelEvent:
+    """Represents a GPU kernel execution event"""
+    def __init__(self, name: str, start_ns: int, end_ns: int, correlation_id: int):
+        self.name = name
+        self.start_ns = start_ns
+        self.end_ns = end_ns
+        self.correlation_id = correlation_id
+
+    def __repr__(self):
+        return f"GPUKernel({self.name}, {self.start_ns}-{self.end_ns}, corr={self.correlation_id})"
+
+
+class CudaLaunchEvent:
+    """Represents a cudaLaunchKernel runtime API call"""
+    def __init__(self, start_ns: int, end_ns: int, correlation_id: int):
+        self.start_ns = start_ns
+        self.end_ns = end_ns
+        self.correlation_id = correlation_id
+
+    def __repr__(self):
+        return f"CudaLaunch({self.start_ns}-{self.end_ns}, corr={self.correlation_id})"
+
+
+class CPUStack:
+    """Represents a CPU stack trace from cudaLaunchKernel uprobe in extended folded format"""
+    def __init__(self, timestamp_ns: int, comm: str, pid: int, tid: int, cpu: int, stack: List[str]):
+        self.timestamp_ns = timestamp_ns
+        self.comm = comm
+        self.pid = pid
+        self.tid = tid
+        self.cpu = cpu
+        self.stack = stack  # List of function names from bottom to top
+
+    def __repr__(self):
+        return f"CPUStack({self.timestamp_ns}, pid={self.pid}, tid={self.tid}, depth={len(self.stack)})"
+
+
+class TraceMerger:
+    """Merges GPU CUPTI traces with CPU stack traces from cudaLaunchKernel hooks"""
+
+    def __init__(self, timestamp_tolerance_ms=10.0):
+        self.gpu_kernels = []  # List of GPUKernelEvent
+        self.cuda_launches = {}  # correlation_id -> CudaLaunchEvent
+        self.cpu_stacks = []  # List of CPUStack from uprobe (extended folded format)
+        self.merged_stacks = defaultdict(int)  # stack_string -> count
+        self.timestamp_tolerance_ns = int(timestamp_tolerance_ms * 1_000_000)
+
+    def parse_cpu_trace(self, cpu_file: str):
+        """Parse CPU trace file in extended folded format from Rust profiler"""
+        print(f"Parsing CPU uprobe trace (extended folded format): {cpu_file}")
+
+        with open(cpu_file, 'r') as f:
+            lines = f.readlines()
+
+        stack_count = 0
+        for line in lines:
+            line = line.strip()
+            if not line:
+                continue
+
+            # Extended folded format: timestamp_ns comm pid tid cpu stack1;stack2;stack3
+            parts = line.split(None, 5)  # Split on whitespace, max 6 parts
+            if len(parts) < 6:
+                continue
+
+            try:
+                timestamp_ns = int(parts[0])
+                comm = parts[1]
+                pid = int(parts[2])
+                tid = int(parts[3])
+                cpu = int(parts[4])
+                stack_str = parts[5]
+
+                # Parse stack frames (separated by semicolons)
+                stack_frames = []
+                seen_cuda_launch = False
+                if stack_str:
+                    frames = stack_str.split(';')
+                    for frame in frames:
+                        frame = frame.strip()
+                        if frame and frame not in ['<no-symbol>', '_start', '__libc_start_main']:
+                            # Clean up cudaLaunchKernel variations - keep only first occurrence
+                            if 'cudaLaunchKernel' in frame or '__device_stub__' in frame:
+                                if not seen_cuda_launch:
+                                    frame = 'cudaLaunchKernel'
+                                    stack_frames.append(frame)
+                                    seen_cuda_launch = True
+                            else:
+                                stack_frames.append(frame)
+
+                if stack_frames:
+                    self.cpu_stacks.append(CPUStack(
+                        timestamp_ns, comm, pid, tid, cpu, stack_frames
+                    ))
+                    stack_count += 1
+
+            except (ValueError, IndexError) as e:
+                print(f"Warning: Failed to parse line: {line[:100]}... Error: {e}")
+                continue
+
+        print(f"Parsed {stack_count} CPU stack traces from cudaLaunchKernel hooks")
+
+    def parse_gpu_trace(self, gpu_json_file: str):
+        """Parse GPU trace JSON file and extract kernel events and launch correlations"""
+        print(f"Parsing GPU CUPTI trace: {gpu_json_file}")
+
+        with open(gpu_json_file, 'r') as f:
+            data = json.load(f)
+
+        events = data.get('traceEvents', [])
+        kernel_count = 0
+        launch_count = 0
+
+        for event in events:
+            name = event.get('name', '')
+            category = event.get('cat', '')
+            correlation_id = event.get('args', {}).get('correlationId', 0)
+
+            # Extract cudaLaunchKernel runtime events
+            if category == 'CUDA_Runtime' and 'LaunchKernel' in name:
+                start_us = event.get('ts', 0)
+                duration_us = event.get('dur', 0)
+
+                if start_us > 0 and duration_us > 0 and correlation_id > 0:
+                    start_ns = int(start_us * 1000)
+                    end_ns = int((start_us + duration_us) * 1000)
+
+                    self.cuda_launches[correlation_id] = CudaLaunchEvent(
+                        start_ns, end_ns, correlation_id
+                    )
+                    launch_count += 1
+
+            # Extract actual GPU kernel executions
+            elif category == 'GPU_Kernel' or name.startswith('Kernel:'):
+                kernel_name = name.replace('Kernel: ', '')
+                start_us = event.get('ts', 0)
+                duration_us = event.get('dur', 0)
+
+                if start_us > 0 and duration_us > 0 and correlation_id > 0:
+                    start_ns = int(start_us * 1000)
+                    end_ns = int((start_us + duration_us) * 1000)
+
+                    self.gpu_kernels.append(GPUKernelEvent(
+                        kernel_name,
+                        start_ns,
+                        end_ns,
+                        correlation_id
+                    ))
+                    kernel_count += 1
+
+        # Sort by correlation ID for efficient lookup
+        self.gpu_kernels.sort(key=lambda k: k.correlation_id)
+
+        print(f"Parsed {kernel_count} GPU kernel events")
+        print(f"Parsed {launch_count} cudaLaunchKernel runtime events")
+
+    def find_matching_kernel(self, cpu_stack: CPUStack) -> Optional[GPUKernelEvent]:
+        """
+        Find GPU kernel that matches the CPU stack trace.
+        Strategy:
+        1. Find cudaLaunchKernel runtime call within timestamp tolerance
+        2. Use correlation ID to find actual GPU kernel execution
+        """
+
+        # Find cudaLaunchKernel runtime event that matches timestamp
+        best_launch = None
+        min_time_diff = self.timestamp_tolerance_ns
+
+        for launch in self.cuda_launches.values():
+            # Check if CPU stack timestamp is close to launch time
+            time_diff = abs(cpu_stack.timestamp_ns - launch.start_ns)
+
+            if time_diff < min_time_diff:
+                min_time_diff = time_diff
+                best_launch = launch
+
+        if not best_launch:
+            return None
+
+        # Find GPU kernel with matching correlation ID
+        for kernel in self.gpu_kernels:
+            if kernel.correlation_id == best_launch.correlation_id:
+                return kernel
+
+        return None
+
+    def merge_traces(self):
+        """Correlate CPU stacks with GPU kernels using correlation IDs and timestamps"""
+        print("Correlating CPU stacks with GPU kernels...")
+
+        matched_count = 0
+        unmatched_count = 0
+
+        for cpu_stack in self.cpu_stacks:
+            # Find matching GPU kernel
+            gpu_kernel = self.find_matching_kernel(cpu_stack)
+
+            # Build merged stack
+            merged_stack = cpu_stack.stack.copy()
+
+            if gpu_kernel:
+                # Add GPU kernel to the top of the stack
+                merged_stack.append(f"[GPU_Kernel]{gpu_kernel.name}")
+                matched_count += 1
+            else:
+                # Mark as unmatched launch (may happen if kernel hasn't executed yet)
+                merged_stack.append("[GPU_Launch_Pending]")
+                unmatched_count += 1
+
+            # Create folded stack string
+            if merged_stack:
+                stack_str = ';'.join(merged_stack)
+                self.merged_stacks[stack_str] += 1
+
+        print(f"Matched {matched_count} CPU stacks with GPU kernels")
+        print(f"Unmatched: {unmatched_count}")
+        print(f"Total unique stacks: {len(self.merged_stacks)}")
+
+    def write_folded_output(self, output_file: str):
+        """Write folded stack format for flamegraph generation"""
+        print(f"Writing folded output to: {output_file}")
+
+        with open(output_file, 'w') as f:
+            for stack, count in sorted(self.merged_stacks.items()):
+                # Folded format: stack_frame1;stack_frame2;... count
+                f.write(f"{stack} {count}\n")
+
+        total_samples = sum(self.merged_stacks.values())
+        print(f"Wrote {len(self.merged_stacks)} unique stacks ({total_samples} total samples)")
+
+    def generate_summary(self):
+        """Generate summary statistics"""
+        print("\n=== Summary Statistics ===")
+
+        # CPU statistics
+        if self.cpu_stacks:
+            cpu_start = min(s.timestamp_ns for s in self.cpu_stacks)
+            cpu_end = max(s.timestamp_ns for s in self.cpu_stacks)
+            cpu_duration_ms = (cpu_end - cpu_start) / 1_000_000
+            print(f"CPU trace duration: {cpu_duration_ms:.2f} ms")
+            print(f"CPU stacks captured: {len(self.cpu_stacks)}")
+
+        # GPU statistics
+        if self.gpu_kernels:
+            print(f"\nGPU kernels executed: {len(self.gpu_kernels)}")
+            print(f"CUDA launch events: {len(self.cuda_launches)}")
+
+            total_kernel_time = sum(k.end_ns - k.start_ns for k in self.gpu_kernels) / 1_000_000
+            print(f"Total kernel execution time: {total_kernel_time:.2f} ms")
+
+            # Show kernel breakdown
+            kernel_names = defaultdict(int)
+            for k in self.gpu_kernels:
+                kernel_names[k.name] += 1
+
+            print("\nKernel execution counts:")
+            for name, count in sorted(kernel_names.items(), key=lambda x: -x[1]):
+                print(f"  {name}: {count}")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Merge GPU CUPTI traces with CPU cudaLaunchKernel stack traces'
+    )
+    parser.add_argument(
+        '-c', '--cpu',
+        default='cpu_results.txt',
+        help='CPU uprobe trace file (extended folded format, default: cpu_results.txt)'
+    )
+    parser.add_argument(
+        '-g', '--gpu',
+        default='gpu_results.json',
+        help='GPU CUPTI trace JSON file (default: gpu_results.json)'
+    )
+    parser.add_argument(
+        '-o', '--output',
+        default='merged_trace.folded',
+        help='Output folded stack file (default: merged_trace.folded)'
+    )
+    parser.add_argument(
+        '-t', '--tolerance',
+        type=float,
+        default=10.0,
+        help='Timestamp matching tolerance in milliseconds (default: 10.0)'
+    )
+    parser.add_argument(
+        '-s', '--summary',
+        action='store_true',
+        help='Print summary statistics'
+    )
+
+    args = parser.parse_args()
+
+    # Check input files exist
+    if not Path(args.cpu).exists():
+        print(f"Error: CPU trace file not found: {args.cpu}", file=sys.stderr)
+        sys.exit(1)
+
+    if not Path(args.gpu).exists():
+        print(f"Error: GPU trace file not found: {args.gpu}", file=sys.stderr)
+        sys.exit(1)
+
+    # Create merger and process traces
+    merger = TraceMerger(timestamp_tolerance_ms=args.tolerance)
+
+    # Parse inputs
+    merger.parse_cpu_trace(args.cpu)
+    merger.parse_gpu_trace(args.gpu)
+
+    # Merge traces
+    merger.merge_traces()
+
+    # Write output
+    merger.write_folded_output(args.output)
+
+    # Print summary if requested
+    if args.summary:
+        merger.generate_summary()
+
+    print(f"\nTo generate flamegraph:")
+    print(f"  flamegraph.pl {args.output} > merged_flamegraph.svg")
+    print(f"\nOr use online viewer:")
+    print(f"  https://www.speedscope.app/ (upload {args.output})")
+
+
+if __name__ == '__main__':
+    main()
--- a/src/xpu/flamegraph/mock-test/.gitignore
+++ b/src/xpu/flamegraph/mock-test/.gitignore
@@ -0,0 +1,3 @@
+llm-inference
+*.o
+*.bak
--- a/src/xpu/flamegraph/mock-test/Makefile
+++ b/src/xpu/flamegraph/mock-test/Makefile
@@ -0,0 +1,53 @@
+#
+# Makefile for mock LLM inference application
+#
+ifndef OS
+    OS   := $(shell uname)
+    HOST_ARCH := $(shell uname -m)
+endif
+
+CUDA_INSTALL_PATH ?= /usr/local/cuda-12.9
+NVCC := "$(CUDA_INSTALL_PATH)/bin/nvcc"
+INCLUDES := -I"$(CUDA_INSTALL_PATH)/include"
+
+ifeq ($(OS),Windows_NT)
+    LIB_PATH ?= ..\..\lib64
+else
+    LIB_PATH ?= $(CUDA_INSTALL_PATH)/lib64
+endif
+
+# Point to the necessary cross-compiler.
+NVCCFLAGS :=
+
+ifneq ($(TARGET_ARCH), $(HOST_ARCH))
+    ifeq ($(TARGET_ARCH), aarch64)
+        ifeq ($(TARGET_OS), linux)
+            HOST_COMPILER ?= aarch64-linux-gnu-g++
+        else ifeq ($(TARGET_OS),qnx)
+            ifeq ($(QNX_HOST),)
+                $(error ERROR - QNX_HOST must be passed to the QNX host toolchain)
+            endif
+            ifeq ($(QNX_TARGET),)
+                $(error ERROR - QNX_TARGET must be passed to the QNX target toolchain)
+            endif
+            HOST_COMPILER ?= $(QNX_HOST)/usr/bin/q++
+            ifndef QPP_CONFIG_VERSION
+                QPP_CONFIG_VERSION = 12.2.0
+            endif
+            $(info QPP_CONFIG_VERSION = $(QPP_CONFIG_VERSION))
+            NVCCFLAGS += --qpp-config $(QPP_CONFIG_VERSION),gcc_ntoaarch64le -lsocket
+        endif
+    endif
+
+    ifdef HOST_COMPILER
+        NVCC_COMPILER := -ccbin $(HOST_COMPILER)
+    endif
+endif
+
+all: llm-inference
+
+llm-inference: llm-inference.cu
+	$(NVCC) $(NVCC_COMPILER) $(INCLUDES) -o llm-inference llm-inference.cu -L $(LIB_PATH) -lcudart -std=c++17 -Wno-deprecated-gpu-targets --no-device-link
+
+clean:
+	rm -f llm-inference *.o *.bak
--- a/src/xpu/flamegraph/mock-test/llm-inference.cu
+++ b/src/xpu/flamegraph/mock-test/llm-inference.cu
@@ -0,0 +1,702 @@
+#include <iostream>
+#include <iomanip>
+#include <vector>
+#include <memory>
+#include <string>
+#include <array>
+#include <random>
+#include <chrono>
+#include <thread>
+#include <fstream>
+#include <algorithm>
+#include <cuda_runtime.h>
+#include <signal.h>
+#include <cmath>
+
+// =============================================================================
+// Configuration using constexpr
+// =============================================================================
+namespace Config {
+    constexpr size_t BATCH_SIZE = 16;
+    constexpr size_t SEQ_LENGTH = 1024;
+    constexpr size_t HIDDEN_DIM = 2048;
+    constexpr size_t NUM_HEADS = 16;
+    constexpr size_t HEAD_DIM = HIDDEN_DIM / NUM_HEADS;
+    constexpr size_t FFN_DIM = HIDDEN_DIM * 4;
+    constexpr size_t NUM_LAYERS = 4;
+    constexpr size_t VOCAB_SIZE = 4000;
+    constexpr int DURATION_SECONDS = 10;
+}
+
+// =============================================================================
+// CUDA Error Checking Wrapper
+// =============================================================================
+class CudaError : public std::runtime_error {
+public:
+    explicit CudaError(const std::string& msg) : std::runtime_error(msg) {}
+};
+
+inline void checkCuda(cudaError_t result, const char* file, int line) {
+    if (result != cudaSuccess) {
+        throw CudaError(std::string("CUDA Error: ") +
+                       cudaGetErrorString(result) +
+                       " at " + file + ":" + std::to_string(line));
+    }
+}
+
+#define CUDA_CHECK(call) checkCuda((call), __FILE__, __LINE__)
+
+// =============================================================================
+// RAII CUDA Memory Wrapper
+// =============================================================================
+template<typename T>
+class CudaDeviceMemory {
+private:
+    T* data_ = nullptr;
+    size_t size_ = 0;
+
+public:
+    explicit CudaDeviceMemory(size_t count) : size_(count) {
+        if (count > 0) {
+            CUDA_CHECK(cudaMalloc(&data_, count * sizeof(T)));
+            std::cout << "[CUDA] Allocated " << (count * sizeof(T)) / (1024.0 * 1024.0)
+                     << " MB on device" << std::endl;
+        }
+    }
+
+    ~CudaDeviceMemory() {
+        if (data_) {
+            cudaFree(data_);
+        }
+    }
+
+    // Delete copy operations
+    CudaDeviceMemory(const CudaDeviceMemory&) = delete;
+    CudaDeviceMemory& operator=(const CudaDeviceMemory&) = delete;
+
+    // Allow move operations
+    CudaDeviceMemory(CudaDeviceMemory&& other) noexcept
+        : data_(other.data_), size_(other.size_) {
+        other.data_ = nullptr;
+        other.size_ = 0;
+    }
+
+    CudaDeviceMemory& operator=(CudaDeviceMemory&& other) noexcept {
+        if (this != &other) {
+            if (data_) cudaFree(data_);
+            data_ = other.data_;
+            size_ = other.size_;
+            other.data_ = nullptr;
+            other.size_ = 0;
+        }
+        return *this;
+    }
+
+    T* get() { return data_; }
+    const T* get() const { return data_; }
+    size_t size() const { return size_; }
+
+    void copyFromHost(const std::vector<T>& host_data) {
+        if (host_data.size() != size_) {
+            throw std::runtime_error("Size mismatch in copyFromHost");
+        }
+        CUDA_CHECK(cudaMemcpy(data_, host_data.data(),
+                             size_ * sizeof(T), cudaMemcpyHostToDevice));
+    }
+
+    void copyToHost(std::vector<T>& host_data) const {
+        if (host_data.size() != size_) {
+            host_data.resize(size_);
+        }
+        CUDA_CHECK(cudaMemcpy(host_data.data(), data_,
+                             size_ * sizeof(T), cudaMemcpyDeviceToHost));
+    }
+
+    void zero() {
+        CUDA_CHECK(cudaMemset(data_, 0, size_ * sizeof(T)));
+    }
+};
+
+// =============================================================================
+// CUDA Stream Wrapper
+// =============================================================================
+class CudaStream {
+private:
+    cudaStream_t stream_ = nullptr;
+
+public:
+    CudaStream() {
+        CUDA_CHECK(cudaStreamCreate(&stream_));
+    }
+
+    ~CudaStream() {
+        if (stream_) {
+            cudaStreamDestroy(stream_);
+        }
+    }
+
+    CudaStream(const CudaStream&) = delete;
+    CudaStream& operator=(const CudaStream&) = delete;
+
+    cudaStream_t get() const { return stream_; }
+
+    void synchronize() {
+        CUDA_CHECK(cudaStreamSynchronize(stream_));
+    }
+};
+
+// =============================================================================
+// GPU Kernels
+// =============================================================================
+__global__ void attentionQKTKernel(const float* Q, const float* K, float* scores,
+                                   size_t batch, size_t seq_len, size_t head_dim) {
+    size_t b = blockIdx.z;
+    size_t i = blockIdx.y * blockDim.y + threadIdx.y;
+    size_t j = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (b < batch && i < seq_len && j < seq_len) {
+        float sum = 0.0f;
+        for (size_t k = 0; k < head_dim; k++) {
+            size_t q_idx = b * seq_len * head_dim + i * head_dim + k;
+            size_t k_idx = b * seq_len * head_dim + j * head_dim + k;
+            sum += Q[q_idx] * K[k_idx];
+        }
+        scores[b * seq_len * seq_len + i * seq_len + j] = sum / sqrtf(static_cast<float>(head_dim));
+    }
+}
+
+__global__ void softmaxKernel(const float* input, float* output, size_t batch, size_t seq_len) {
+    size_t b = blockIdx.y;
+    size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (b < batch && i < seq_len) {
+        float max_val = -INFINITY;
+        for (size_t j = 0; j < seq_len; j++) {
+            size_t idx = b * seq_len * seq_len + i * seq_len + j;
+            max_val = fmaxf(max_val, input[idx]);
+        }
+
+        float sum = 0.0f;
+        for (size_t j = 0; j < seq_len; j++) {
+            size_t idx = b * seq_len * seq_len + i * seq_len + j;
+            output[idx] = expf(input[idx] - max_val);
+            sum += output[idx];
+        }
+
+        for (size_t j = 0; j < seq_len; j++) {
+            size_t idx = b * seq_len * seq_len + i * seq_len + j;
+            output[idx] /= sum;
+        }
+    }
+}
+
+__global__ void layerNormKernel(const float* input, float* output,
+                                const float* gamma, const float* beta,
+                                size_t batch, size_t seq_len, size_t hidden_dim) {
+    size_t b = blockIdx.y;
+    size_t i = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (b < batch && i < seq_len) {
+        float mean = 0.0f;
+        for (size_t j = 0; j < hidden_dim; j++) {
+            mean += input[b * seq_len * hidden_dim + i * hidden_dim + j];
+        }
+        mean /= hidden_dim;
+
+        float variance = 0.0f;
+        for (size_t j = 0; j < hidden_dim; j++) {
+            float diff = input[b * seq_len * hidden_dim + i * hidden_dim + j] - mean;
+            variance += diff * diff;
+        }
+        variance /= hidden_dim;
+
+        float std = sqrtf(variance + 1e-5f);
+        for (size_t j = 0; j < hidden_dim; j++) {
+            size_t idx = b * seq_len * hidden_dim + i * hidden_dim + j;
+            output[idx] = gamma[j] * (input[idx] - mean) / std + beta[j];
+        }
+    }
+}
+
+__global__ void residualAddKernel(const float* input, const float* residual,
+                                  float* output, size_t n) {
+    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n) {
+        output[idx] = input[idx] + residual[idx];
+    }
+}
+
+// =============================================================================
+// Token Embedding using modern C++
+// =============================================================================
+class TokenEmbedding {
+private:
+    std::vector<float> embeddings_;
+    size_t vocab_size_;
+    size_t embedding_dim_;
+    std::mt19937 rng_;
+    std::uniform_real_distribution<float> dist_;
+
+public:
+    TokenEmbedding(size_t vocab_size, size_t embedding_dim)
+        : vocab_size_(vocab_size)
+        , embedding_dim_(embedding_dim)
+        , rng_(std::random_device{}())
+        , dist_(-1.0f, 1.0f) {
+
+        embeddings_.resize(vocab_size * embedding_dim);
+        std::cout << "[Init] Creating TokenEmbedding: "
+                  << (embeddings_.size() * sizeof(float)) / (1024.0 * 1024.0)
+                  << " MB" << std::endl;
+
+        // Initialize with random values
+        for (auto& val : embeddings_) {
+            val = dist_(rng_);
+        }
+    }
+
+    void embed(const std::vector<int>& tokens, std::vector<float>& output) const {
+        // Output should be sized for full batch
+        size_t required_size = Config::BATCH_SIZE * Config::SEQ_LENGTH * embedding_dim_;
+        output.resize(required_size);
+        std::fill(output.begin(), output.end(), 0.0f);
+
+        // Fill first sequence with actual embeddings
+        for (size_t i = 0; i < tokens.size() && i < Config::SEQ_LENGTH; ++i) {
+            int token_id = tokens[i] % vocab_size_;
+            size_t src_offset = token_id * embedding_dim_;
+            size_t dst_offset = i * embedding_dim_;
+
+            std::copy_n(embeddings_.begin() + src_offset,
+                       embedding_dim_,
+                       output.begin() + dst_offset);
+        }
+    }
+
+    size_t getEmbeddingDim() const { return embedding_dim_; }
+};
+
+// =============================================================================
+// Transformer Layer using RAII
+// =============================================================================
+class TransformerLayer {
+private:
+    CudaDeviceMemory<float> d_Q_;
+    CudaDeviceMemory<float> d_K_;
+    CudaDeviceMemory<float> d_V_;
+    CudaDeviceMemory<float> d_attn_scores_;
+    CudaDeviceMemory<float> d_attn_probs_;
+    CudaDeviceMemory<float> d_attn_output_;
+    CudaDeviceMemory<float> d_ln_gamma_;
+    CudaDeviceMemory<float> d_ln_beta_;
+    CudaDeviceMemory<float> d_residual_;
+
+    std::vector<float> h_gamma_;
+    std::vector<float> h_beta_;
+    CudaStream stream_;
+
+public:
+    TransformerLayer()
+        : d_Q_(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::HEAD_DIM)
+        , d_K_(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::HEAD_DIM)
+        , d_V_(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::HEAD_DIM)
+        , d_attn_scores_(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::SEQ_LENGTH)
+        , d_attn_probs_(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::SEQ_LENGTH)
+        , d_attn_output_(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::HEAD_DIM)
+        , d_ln_gamma_(Config::HIDDEN_DIM)
+        , d_ln_beta_(Config::HIDDEN_DIM)
+        , d_residual_(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::HIDDEN_DIM)
+        , h_gamma_(Config::HIDDEN_DIM, 1.0f)
+        , h_beta_(Config::HIDDEN_DIM, 0.0f) {
+
+        std::cout << "[Init] Creating TransformerLayer" << std::endl;
+
+        d_ln_gamma_.copyFromHost(h_gamma_);
+        d_ln_beta_.copyFromHost(h_beta_);
+    }
+
+    void forward(const CudaDeviceMemory<float>& d_input,
+                 CudaDeviceMemory<float>& d_output) {
+
+        // Do multiple passes to increase GPU compute time
+        // Pass 1: Layer norm
+        dim3 ln_grid((Config::SEQ_LENGTH + 255) / 256, Config::BATCH_SIZE);
+        layerNormKernel<<<ln_grid, 256, 0, stream_.get()>>>(
+            d_input.get(), d_residual_.get(),
+            d_ln_gamma_.get(), d_ln_beta_.get(),
+            Config::BATCH_SIZE, Config::SEQ_LENGTH, Config::HIDDEN_DIM);
+
+        // Pass 2: Multiple softmax iterations to increase GPU compute
+        dim3 softmax_grid((Config::SEQ_LENGTH + 255) / 256, Config::BATCH_SIZE);
+        for (int i = 0; i < 22; ++i) {  // Tuned to 22 iterations for ~50% GPU
+            softmaxKernel<<<softmax_grid, 256, 0, stream_.get()>>>(
+                d_attn_scores_.get(), d_attn_probs_.get(),
+                Config::BATCH_SIZE, Config::SEQ_LENGTH);
+        }
+
+        // Pass 3: Residual add
+        size_t total_elements = Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::HIDDEN_DIM;
+        for (int i = 0; i < 2; ++i) {
+            residualAddKernel<<<(total_elements + 255) / 256, 256, 0, stream_.get()>>>(
+                d_residual_.get(), d_input.get(), d_output.get(), total_elements);
+        }
+
+        // Pass 4: Multiple layer norm passes
+        for (int i = 0; i < 2; ++i) {
+            layerNormKernel<<<ln_grid, 256, 0, stream_.get()>>>(
+                d_output.get(), d_residual_.get(),
+                d_ln_gamma_.get(), d_ln_beta_.get(),
+                Config::BATCH_SIZE, Config::SEQ_LENGTH, Config::HIDDEN_DIM);
+        }
+
+        // Pass 5: Final residual
+        residualAddKernel<<<(total_elements + 255) / 256, 256, 0, stream_.get()>>>(
+            d_residual_.get(), d_input.get(), d_output.get(), total_elements);
+
+        stream_.synchronize();
+    }
+};
+
+// =============================================================================
+// File Cache Manager
+// =============================================================================
+class PromptCache {
+private:
+    std::string cache_dir_;
+    std::vector<std::string> cached_files_;
+
+public:
+    PromptCache() {
+        cache_dir_ = "/tmp/llm_cache_" + std::to_string(getpid());
+        std::string cmd = "mkdir -p " + cache_dir_;
+        std::system(cmd.c_str());
+        std::cout << "[Init] Cache directory: " << cache_dir_ << std::endl;
+    }
+
+    ~PromptCache() {
+        cleanup();
+    }
+
+    void writeCache(const std::string& key, const std::vector<float>& data, int iteration) {
+        std::string filename = cache_dir_ + "/cache_" + key + "_" + std::to_string(iteration) + ".bin";
+        std::ofstream file(filename, std::ios::binary);
+        if (file) {
+            file.write(reinterpret_cast<const char*>(data.data()),
+                      data.size() * sizeof(float));
+            cached_files_.push_back(filename);
+        }
+    }
+
+    bool readCache(const std::string& key, std::vector<float>& data, int iteration) {
+        std::string filename = cache_dir_ + "/cache_" + key + "_" + std::to_string(iteration) + ".bin";
+        std::ifstream file(filename, std::ios::binary);
+        if (!file) return false;
+
+        file.seekg(0, std::ios::end);
+        size_t size = file.tellg() / sizeof(float);
+        file.seekg(0, std::ios::beg);
+
+        data.resize(size);
+        file.read(reinterpret_cast<char*>(data.data()), size * sizeof(float));
+        return true;
+    }
+
+    void cleanup() {
+        for (const auto& file : cached_files_) {
+            std::remove(file.c_str());
+        }
+        std::string cmd = "rm -rf " + cache_dir_;
+        std::system(cmd.c_str());
+    }
+};
+
+// =============================================================================
+// Performance Timing Statistics
+// =============================================================================
+struct RequestTimings {
+    double cpu_compute_ms = 0.0;
+    double gpu_compute_ms = 0.0;
+    double io_time_ms = 0.0;
+
+    void add(const RequestTimings& other) {
+        cpu_compute_ms += other.cpu_compute_ms;
+        gpu_compute_ms += other.gpu_compute_ms;
+        io_time_ms += other.io_time_ms;
+    }
+
+    double total_ms() const {
+        return cpu_compute_ms + gpu_compute_ms + io_time_ms;
+    }
+};
+
+// =============================================================================
+// Main Inference Pipeline
+// =============================================================================
+class InferencePipeline {
+private:
+    std::unique_ptr<TokenEmbedding> embedding_;
+    std::vector<std::unique_ptr<TransformerLayer>> layers_;
+    std::unique_ptr<PromptCache> cache_;
+
+    CudaDeviceMemory<float> d_input_;
+    CudaDeviceMemory<float> d_output_;
+
+    std::vector<float> h_input_;
+    std::vector<float> h_output_;
+
+    // Performance tracking
+    std::vector<RequestTimings> request_timings_;
+    RequestTimings accumulated_timings_;
+    int request_count_ = 0;
+
+    std::array<std::string, 5> prompts_ = {
+        "What is artificial intelligence?",
+        "Explain transformer architectures",
+        "Describe deep learning techniques",
+        "What are neural networks?",
+        "How does machine learning work?"
+    };
+
+public:
+    InferencePipeline()
+        : embedding_(std::make_unique<TokenEmbedding>(Config::VOCAB_SIZE, Config::HIDDEN_DIM))
+        , cache_(std::make_unique<PromptCache>())
+        , d_input_(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::HIDDEN_DIM)
+        , d_output_(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::HIDDEN_DIM) {
+
+        std::cout << "[Init] Creating InferencePipeline with "
+                  << Config::NUM_LAYERS << " layers" << std::endl;
+
+        // Create transformer layers
+        for (size_t i = 0; i < Config::NUM_LAYERS; ++i) {
+            std::cout << "[Init] Creating layer " << (i + 1) << "/"
+                     << Config::NUM_LAYERS << std::endl;
+            layers_.push_back(std::make_unique<TransformerLayer>());
+        }
+
+        h_input_.resize(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::HIDDEN_DIM);
+        h_output_.resize(Config::BATCH_SIZE * Config::SEQ_LENGTH * Config::HIDDEN_DIM);
+
+        std::cout << "[Init] Pipeline initialization complete" << std::endl;
+    }
+
+    void runRequest(int request_id) {
+        RequestTimings timings;
+        auto start_time = std::chrono::high_resolution_clock::now();
+
+        // Select prompt
+        const auto& prompt = prompts_[request_id % prompts_.size()];
+
+        // ===== CPU COMPUTE: Tokenization =====
+        auto cpu_start = std::chrono::high_resolution_clock::now();
+        std::vector<int> tokens;
+        tokens.reserve(Config::SEQ_LENGTH);
+        for (size_t i = 0; i < Config::SEQ_LENGTH && i < prompt.length(); ++i) {
+            tokens.push_back(static_cast<int>(prompt[i]));
+        }
+        while (tokens.size() < Config::SEQ_LENGTH) {
+            tokens.push_back(0);  // Padding
+        }
+
+        // ===== CPU COMPUTE: Embedding lookup =====
+        embedding_->embed(tokens, h_input_);
+
+        // ===== CPU COMPUTE: Additional preprocessing (to increase CPU time) =====
+        // Simulate text preprocessing, normalization, etc.
+        std::vector<float> temp_buffer(Config::SEQ_LENGTH * 150);  // Increased buffer
+        for (size_t i = 0; i < temp_buffer.size(); ++i) {
+            temp_buffer[i] = std::sin(static_cast<float>(i)) * std::cos(static_cast<float>(request_id));
+        }
+
+        // Simulate some CPU-intensive work (sorting, searching, etc.)
+        for (int iter = 0; iter < 12; ++iter) {  // Tuned to 12 iterations for ~25% CPU
+            std::partial_sort(temp_buffer.begin(), temp_buffer.begin() + 1500, temp_buffer.end());
+        }
+
+        auto cpu_end = std::chrono::high_resolution_clock::now();
+        timings.cpu_compute_ms = std::chrono::duration<double, std::milli>(cpu_end - cpu_start).count();
+
+        // ===== I/O: Transfer to GPU =====
+        auto io_start = std::chrono::high_resolution_clock::now();
+        d_input_.copyFromHost(h_input_);
+        auto io_end = std::chrono::high_resolution_clock::now();
+        timings.io_time_ms += std::chrono::duration<double, std::milli>(io_end - io_start).count();
+
+        // ===== GPU COMPUTE: Forward pass through transformer layers =====
+        auto gpu_start = std::chrono::high_resolution_clock::now();
+        auto* current_input = &d_input_;
+        auto* current_output = &d_output_;
+
+        for (auto& layer : layers_) {
+            layer->forward(*current_input, *current_output);
+            std::swap(current_input, current_output);
+        }
+        auto gpu_end = std::chrono::high_resolution_clock::now();
+        timings.gpu_compute_ms = std::chrono::duration<double, std::milli>(gpu_end - gpu_start).count();
+
+        // ===== I/O: Transfer back to CPU =====
+        io_start = std::chrono::high_resolution_clock::now();
+        current_input->copyToHost(h_output_);
+        io_end = std::chrono::high_resolution_clock::now();
+        timings.io_time_ms += std::chrono::duration<double, std::milli>(io_end - io_start).count();
+
+        // ===== I/O: Cache results (file I/O) =====
+        if (request_id % 2 == 0) {
+            io_start = std::chrono::high_resolution_clock::now();
+            cache_->writeCache("prompt_" + std::to_string(request_id % prompts_.size()),
+                              h_output_, request_id);
+            io_end = std::chrono::high_resolution_clock::now();
+            timings.io_time_ms += std::chrono::duration<double, std::milli>(io_end - io_start).count();
+        }
+
+        // ===== I/O: Simulate network delay =====
+        io_start = std::chrono::high_resolution_clock::now();
+        std::this_thread::sleep_for(std::chrono::milliseconds(10));  // Reduced from 50ms to 10ms
+        io_end = std::chrono::high_resolution_clock::now();
+        timings.io_time_ms += std::chrono::duration<double, std::milli>(io_end - io_start).count();
+
+        // Track timings
+        request_timings_.push_back(timings);
+        accumulated_timings_.add(timings);
+        request_count_++;
+
+        // Report every 10 requests
+        if (request_count_ % 10 == 0) {
+            reportTimings(request_count_);
+        }
+    }
+
+    void reportTimings(int last_request_id) {
+        // Calculate statistics for last 10 requests
+        size_t start_idx = request_timings_.size() >= 10 ? request_timings_.size() - 10 : 0;
+        RequestTimings last_10;
+
+        for (size_t i = start_idx; i < request_timings_.size(); ++i) {
+            last_10.add(request_timings_[i]);
+        }
+
+        int count = request_timings_.size() - start_idx;
+        double avg_cpu = last_10.cpu_compute_ms / count;
+        double avg_gpu = last_10.gpu_compute_ms / count;
+        double avg_io = last_10.io_time_ms / count;
+        double avg_total = (avg_cpu + avg_gpu + avg_io);
+
+        std::cout << "\n[Performance Report] Requests " << (last_request_id - count + 1)
+                  << " - " << last_request_id << " (last " << count << " requests):" << std::endl;
+        std::cout << "  CPU Compute:  " << std::fixed << std::setprecision(2)
+                  << avg_cpu << " ms (" << (avg_cpu / avg_total * 100) << "%)" << std::endl;
+        std::cout << "  GPU Compute:  " << avg_gpu << " ms ("
+                  << (avg_gpu / avg_total * 100) << "%)" << std::endl;
+        std::cout << "  I/O (+ Net):  " << avg_io << " ms ("
+                  << (avg_io / avg_total * 100) << "%)" << std::endl;
+        std::cout << "  Total Time:   " << avg_total << " ms/request" << std::endl;
+    }
+
+    void printFinalReport() {
+        if (request_count_ == 0) return;
+
+        std::cout << "\n=============================================================" << std::endl;
+        std::cout << "Final Performance Report (" << request_count_ << " total requests)" << std::endl;
+        std::cout << "=============================================================" << std::endl;
+
+        double avg_cpu = accumulated_timings_.cpu_compute_ms / request_count_;
+        double avg_gpu = accumulated_timings_.gpu_compute_ms / request_count_;
+        double avg_io = accumulated_timings_.io_time_ms / request_count_;
+        double avg_total = (avg_cpu + avg_gpu + avg_io);
+
+        std::cout << "Average per request:" << std::endl;
+        std::cout << "  CPU Compute:  " << std::fixed << std::setprecision(2)
+                  << avg_cpu << " ms (" << (avg_cpu / avg_total * 100) << "%)" << std::endl;
+        std::cout << "  GPU Compute:  " << avg_gpu << " ms ("
+                  << (avg_gpu / avg_total * 100) << "%)" << std::endl;
+        std::cout << "  I/O (+ Net):  " << avg_io << " ms ("
+                  << (avg_io / avg_total * 100) << "%)" << std::endl;
+        std::cout << "  Total Time:   " << avg_total << " ms/request" << std::endl;
+        std::cout << "\nTotal time breakdown:" << std::endl;
+        std::cout << "  CPU Compute:  " << accumulated_timings_.cpu_compute_ms << " ms" << std::endl;
+        std::cout << "  GPU Compute:  " << accumulated_timings_.gpu_compute_ms << " ms" << std::endl;
+        std::cout << "  I/O (+ Net):  " << accumulated_timings_.io_time_ms << " ms" << std::endl;
+        std::cout << "=============================================================" << std::endl;
+    }
+};
+
+// =============================================================================
+// Global cleanup handler
+// =============================================================================
+std::unique_ptr<InferencePipeline> g_pipeline;
+volatile sig_atomic_t g_interrupted = 0;
+
+void signalHandler(int signum) {
+    std::cout << "\n[Signal] Received signal " << signum << ", cleaning up..." << std::endl;
+    g_interrupted = 1;
+    g_pipeline.reset();
+    std::cout << "[Cleanup] Complete. Exiting." << std::endl;
+    exit(signum);
+}
+
+// =============================================================================
+// Main
+// =============================================================================
+int main() {
+    try {
+        std::cout << "=============================================================" << std::endl;
+        std::cout << "Modern C++ LLM Inference Simulator" << std::endl;
+        std::cout << "=============================================================" << std::endl;
+        std::cout << "Configuration:" << std::endl;
+        std::cout << "  - Batch Size: " << Config::BATCH_SIZE << std::endl;
+        std::cout << "  - Sequence Length: " << Config::SEQ_LENGTH << std::endl;
+        std::cout << "  - Hidden Dimension: " << Config::HIDDEN_DIM << std::endl;
+        std::cout << "  - Number of Layers: " << Config::NUM_LAYERS << std::endl;
+        std::cout << "  - Duration: " << Config::DURATION_SECONDS << " seconds" << std::endl;
+        std::cout << "=============================================================" << std::endl;
+
+        // Initialize CUDA
+        CUDA_CHECK(cudaSetDevice(0));
+        std::cout << "[Init] CUDA device initialized" << std::endl;
+
+        // Setup signal handlers
+        signal(SIGINT, signalHandler);
+        signal(SIGTERM, signalHandler);
+
+        // Create pipeline
+        g_pipeline = std::make_unique<InferencePipeline>();
+
+        // Run request processing loop
+        auto start = std::chrono::steady_clock::now();
+        int request_id = 0;
+
+        std::cout << "\n[Starting] Processing requests for " << Config::DURATION_SECONDS
+                  << " seconds..." << std::endl;
+
+        while (!g_interrupted) {
+            auto now = std::chrono::steady_clock::now();
+            auto elapsed = std::chrono::duration_cast<std::chrono::seconds>(now - start).count();
+
+            if (elapsed >= Config::DURATION_SECONDS) {
+                break;
+            }
+
+            g_pipeline->runRequest(request_id);
+            request_id++;
+        }
+
+        std::cout << "\n=============================================================" << std::endl;
+        std::cout << "Completed " << request_id << " requests in "
+                  << Config::DURATION_SECONDS << " seconds" << std::endl;
+        std::cout << "Average throughput: "
+                  << (request_id / static_cast<double>(Config::DURATION_SECONDS))
+                  << " requests/second" << std::endl;
+        std::cout << "=============================================================" << std::endl;
+
+        // Print final performance report
+        g_pipeline->printFinalReport();
+
+        g_pipeline.reset();
+
+        return 0;
+
+    } catch (const std::exception& e) {
+        std::cerr << "[ERROR] " << e.what() << std::endl;
+        return 1;
+    }
+}
--- a/src/xpu/flamegraph/profiler/.cargo/config.toml
+++ b/src/xpu/flamegraph/profiler/.cargo/config.toml
@@ -0,0 +1,2 @@
+[build]
+rustflags = ["-C", "linker=gcc"]
--- a/src/xpu/flamegraph/profiler/.gitignore
+++ b/src/xpu/flamegraph/profiler/.gitignore
@@ -0,0 +1,2 @@
+/src/bpf/.output
+/target
--- a/src/xpu/flamegraph/profiler/Cargo.lock
+++ b/src/xpu/flamegraph/profiler/Cargo.lock
@@ -0,0 +1,909 @@
+# This file is automatically @generated by Cargo.
+# It is not intended for manual editing.
+version = 3
+
+[[package]]
+name = "adler2"
+version = "2.0.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa"
+
+[[package]]
+name = "aho-corasick"
+version = "1.1.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8e60d3430d3a69478ad0993f19238d2df97c507009a52b3c10addcd7f6bcb916"
+dependencies = [
+ "memchr",
+]
+
+[[package]]
+name = "anstream"
+version = "0.6.20"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3ae563653d1938f79b1ab1b5e668c87c76a9930414574a6583a7b7e11a8e6192"
+dependencies = [
+ "anstyle",
+ "anstyle-parse",
+ "anstyle-query",
+ "anstyle-wincon",
+ "colorchoice",
+ "is_terminal_polyfill",
+ "utf8parse",
+]
+
+[[package]]
+name = "anstyle"
+version = "1.0.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "862ed96ca487e809f1c8e5a8447f6ee2cf102f846893800b20cebdf541fc6bbd"
+
+[[package]]
+name = "anstyle-parse"
+version = "0.2.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4e7644824f0aa2c7b9384579234ef10eb7efb6a0deb83f9630a49594dd9c15c2"
+dependencies = [
+ "utf8parse",
+]
+
+[[package]]
+name = "anstyle-query"
+version = "1.1.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9e231f6134f61b71076a3eab506c379d4f36122f2af15a9ff04415ea4c3339e2"
+dependencies = [
+ "windows-sys",
+]
+
+[[package]]
+name = "anstyle-wincon"
+version = "3.0.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3e0633414522a32ffaac8ac6cc8f748e090c5717661fddeea04219e2344f5f2a"
+dependencies = [
+ "anstyle",
+ "once_cell_polyfill",
+ "windows-sys",
+]
+
+[[package]]
+name = "anyhow"
+version = "1.0.99"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b0674a1ddeecb70197781e945de4b3b8ffb61fa939a5597bcf48503737663100"
+
+[[package]]
+name = "bitflags"
+version = "2.9.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "34efbcccd345379ca2868b2b2c9d3782e9cc58ba87bc7d79d5b53d9c9ae6f25d"
+
+[[package]]
+name = "blazesym"
+version = "0.2.0-rc.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "29a810b7e5f883ad3c711208237841f051061bf59b6ee698ac4dc1fe12a3a5db"
+dependencies = [
+ "cpp_demangle",
+ "gimli",
+ "libc",
+ "memmap2 0.9.8",
+ "miniz_oxide",
+ "rustc-demangle",
+ "tracing",
+]
+
+[[package]]
+name = "camino"
+version = "1.1.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5d07aa9a93b00c76f71bc35d598bed923f6d4f3a9ca5c24b7737ae1a292841c0"
+dependencies = [
+ "serde",
+]
+
+[[package]]
+name = "cargo-platform"
+version = "0.1.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e35af189006b9c0f00a064685c727031e3ed2d8020f7ba284d78cc2671bd36ea"
+dependencies = [
+ "serde",
+]
+
+[[package]]
+name = "cargo_metadata"
+version = "0.15.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "eee4243f1f26fc7a42710e7439c149e2b10b05472f88090acce52632f231a73a"
+dependencies = [
+ "camino",
+ "cargo-platform",
+ "semver",
+ "serde",
+ "serde_json",
+ "thiserror",
+]
+
+[[package]]
+name = "cc"
+version = "1.2.34"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "42bc4aea80032b7bf409b0bc7ccad88853858911b7713a8062fdc0623867bedc"
+dependencies = [
+ "shlex",
+]
+
+[[package]]
+name = "cfg-if"
+version = "1.0.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2fd1289c04a9ea8cb22300a459a72a385d7c73d3259e2ed7dcb2af674838cfa9"
+
+[[package]]
+name = "cfg_aliases"
+version = "0.2.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "613afe47fcd5fac7ccf1db93babcb082c5994d996f20b8b159f2ad1658eb5724"
+
+[[package]]
+name = "clap"
+version = "4.5.46"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2c5e4fcf9c21d2e544ca1ee9d8552de13019a42aa7dbf32747fa7aaf1df76e57"
+dependencies = [
+ "clap_builder",
+ "clap_derive",
+]
+
+[[package]]
+name = "clap_builder"
+version = "4.5.46"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "fecb53a0e6fcfb055f686001bc2e2592fa527efaf38dbe81a6a9563562e57d41"
+dependencies = [
+ "anstream",
+ "anstyle",
+ "clap_lex",
+ "strsim",
+]
+
+[[package]]
+name = "clap_derive"
+version = "4.5.45"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "14cb31bb0a7d536caef2639baa7fad459e15c3144efefa6dbd1c84562c4739f6"
+dependencies = [
+ "heck",
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "clap_lex"
+version = "0.7.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b94f61472cee1439c0b966b47e3aca9ae07e45d070759512cd390ea2bebc6675"
+
+[[package]]
+name = "colorchoice"
+version = "1.0.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b05b61dc5112cbb17e4b6cd61790d9845d13888356391624cbe7e41efeac1e75"
+
+[[package]]
+name = "cpp_demangle"
+version = "0.4.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "96e58d342ad113c2b878f16d5d034c03be492ae460cdbc02b7f0f2284d310c7d"
+dependencies = [
+ "cfg-if",
+]
+
+[[package]]
+name = "equivalent"
+version = "1.0.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "877a4ace8713b0bcf2a4e7eec82529c029f1d0619886d18145fea96c3ffe5c0f"
+
+[[package]]
+name = "errno"
+version = "0.3.13"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "778e2ac28f6c47af28e4907f13ffd1e1ddbd400980a9abd7c8df189bf578a5ad"
+dependencies = [
+ "libc",
+ "windows-sys",
+]
+
+[[package]]
+name = "fallible-iterator"
+version = "0.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2acce4a10f12dc2fb14a218589d4f1f62ef011b2d0cc4b3cb1bba8e94da14649"
+
+[[package]]
+name = "fastrand"
+version = "2.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be"
+
+[[package]]
+name = "getrandom"
+version = "0.3.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "26145e563e54f2cadc477553f1ec5ee650b00862f0a58bcd12cbdc5f0ea2d2f4"
+dependencies = [
+ "cfg-if",
+ "libc",
+ "r-efi",
+ "wasi",
+]
+
+[[package]]
+name = "gimli"
+version = "0.32.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cc6298e594375a7fead9efd5568f0a46e6a154fb6a9bdcbe3c06946ffd81a5f6"
+dependencies = [
+ "fallible-iterator",
+ "indexmap",
+ "stable_deref_trait",
+]
+
+[[package]]
+name = "hashbrown"
+version = "0.15.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9229cfe53dfd69f0609a49f65461bd93001ea1ef889cd5529dd176593f5338a1"
+
+[[package]]
+name = "heck"
+version = "0.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea"
+
+[[package]]
+name = "indexmap"
+version = "2.11.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f2481980430f9f78649238835720ddccc57e52df14ffce1c6f37391d61b563e9"
+dependencies = [
+ "equivalent",
+ "hashbrown",
+]
+
+[[package]]
+name = "is_terminal_polyfill"
+version = "1.70.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7943c866cc5cd64cbc25b2e01621d07fa8eb2a1a23160ee81ce38704e97b8ecf"
+
+[[package]]
+name = "itoa"
+version = "1.0.15"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4a5f13b858c8d314ee3e8f639011f7ccefe71f97f96e50151fb991f267928e2c"
+
+[[package]]
+name = "lazy_static"
+version = "1.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "bbd2bcb4c963f2ddae06a2efc7e9f3591312473c50c6685e1f298068316e66fe"
+
+[[package]]
+name = "libbpf-cargo"
+version = "0.24.8"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "704727a07f185a76c58faa7b8ed08fba3194661c212183aea1174fe2970ee185"
+dependencies = [
+ "anyhow",
+ "cargo_metadata",
+ "clap",
+ "libbpf-rs",
+ "memmap2 0.5.10",
+ "regex",
+ "semver",
+ "serde",
+ "serde_json",
+ "tempfile",
+]
+
+[[package]]
+name = "libbpf-rs"
+version = "0.24.8"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "93edd9cd673087fa7518fd63ad6c87be2cd9b4e35034b1873f3e3258c018275b"
+dependencies = [
+ "bitflags",
+ "libbpf-sys",
+ "libc",
+ "vsprintf",
+]
+
+[[package]]
+name = "libbpf-sys"
+version = "1.6.1+v1.6.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e351855cbd724ac341b2a1c163568808e72acd930c491a921331c2e5347390d3"
+dependencies = [
+ "cc",
+ "nix 0.30.1",
+ "pkg-config",
+]
+
+[[package]]
+name = "libc"
+version = "0.2.175"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6a82ae493e598baaea5209805c49bbf2ea7de956d50d7da0da1164f9c6d28543"
+
+[[package]]
+name = "linux-raw-sys"
+version = "0.9.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cd945864f07fe9f5371a27ad7b52a172b4b499999f1d97574c9fa68373937e12"
+
+[[package]]
+name = "log"
+version = "0.4.27"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "13dc2df351e3202783a1fe0d44375f7295ffb4049267b0f3018346dc122a1d94"
+
+[[package]]
+name = "matchers"
+version = "0.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8263075bb86c5a1b1427b5ae862e8889656f126e9f77c484496e8b47cf5c5558"
+dependencies = [
+ "regex-automata 0.1.10",
+]
+
+[[package]]
+name = "memchr"
+version = "2.7.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "32a282da65faaf38286cf3be983213fcf1d2e2a58700e808f83f4ea9a4804bc0"
+
+[[package]]
+name = "memmap2"
+version = "0.5.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "83faa42c0a078c393f6b29d5db232d8be22776a891f8f56e5284faee4a20b327"
+dependencies = [
+ "libc",
+]
+
+[[package]]
+name = "memmap2"
+version = "0.9.8"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "843a98750cd611cc2965a8213b53b43e715f13c37a9e096c6408e69990961db7"
+dependencies = [
+ "libc",
+]
+
+[[package]]
+name = "miniz_oxide"
+version = "0.8.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1fa76a2c86f704bdb222d66965fb3d63269ce38518b83cb0575fca855ebb6316"
+dependencies = [
+ "adler2",
+ "simd-adler32",
+]
+
+[[package]]
+name = "nix"
+version = "0.29.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "71e2746dc3a24dd78b3cfcb7be93368c6de9963d30f43a6a73998a9cf4b17b46"
+dependencies = [
+ "bitflags",
+ "cfg-if",
+ "cfg_aliases",
+ "libc",
+]
+
+[[package]]
+name = "nix"
+version = "0.30.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "74523f3a35e05aba87a1d978330aef40f67b0304ac79c1c00b294c9830543db6"
+dependencies = [
+ "bitflags",
+ "cfg-if",
+ "cfg_aliases",
+ "libc",
+]
+
+[[package]]
+name = "nu-ansi-term"
+version = "0.46.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "77a8165726e8236064dbb45459242600304b42a5ea24ee2948e18e023bf7ba84"
+dependencies = [
+ "overload",
+ "winapi",
+]
+
+[[package]]
+name = "once_cell"
+version = "1.21.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "42f5e15c9953c5e4ccceeb2e7382a716482c34515315f7b03532b8b4e8393d2d"
+
+[[package]]
+name = "once_cell_polyfill"
+version = "1.70.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a4895175b425cb1f87721b59f0f286c2092bd4af812243672510e1ac53e2e0ad"
+
+[[package]]
+name = "overload"
+version = "0.1.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b15813163c1d831bf4a13c3610c05c0d03b39feb07f7e09fa234dac9b15aaf39"
+
+[[package]]
+name = "pin-project-lite"
+version = "0.2.16"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3b3cff922bd51709b605d9ead9aa71031d81447142d828eb4a6eba76fe619f9b"
+
+[[package]]
+name = "pkg-config"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7edddbd0b52d732b21ad9a5fab5c704c14cd949e5e9a1ec5929a24fded1b904c"
+
+[[package]]
+name = "proc-macro2"
+version = "1.0.101"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "89ae43fd86e4158d6db51ad8e2b80f313af9cc74f5c0e03ccb87de09998732de"
+dependencies = [
+ "unicode-ident",
+]
+
+[[package]]
+name = "profile"
+version = "0.1.0"
+dependencies = [
+ "blazesym",
+ "clap",
+ "libbpf-cargo",
+ "libbpf-rs",
+ "libc",
+ "nix 0.29.0",
+ "tracing",
+ "tracing-subscriber",
+]
+
+[[package]]
+name = "quote"
+version = "1.0.40"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1885c039570dc00dcb4ff087a89e185fd56bae234ddc7f056a945bf36467248d"
+dependencies = [
+ "proc-macro2",
+]
+
+[[package]]
+name = "r-efi"
+version = "5.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f"
+
+[[package]]
+name = "regex"
+version = "1.11.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "23d7fd106d8c02486a8d64e778353d1cffe08ce79ac2e82f540c86d0facf6912"
+dependencies = [
+ "aho-corasick",
+ "memchr",
+ "regex-automata 0.4.10",
+ "regex-syntax 0.8.6",
+]
+
+[[package]]
+name = "regex-automata"
+version = "0.1.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6c230d73fb8d8c1b9c0b3135c5142a8acee3a0558fb8db5cf1cb65f8d7862132"
+dependencies = [
+ "regex-syntax 0.6.29",
+]
+
+[[package]]
+name = "regex-automata"
+version = "0.4.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6b9458fa0bfeeac22b5ca447c63aaf45f28439a709ccd244698632f9aa6394d6"
+dependencies = [
+ "aho-corasick",
+ "memchr",
+ "regex-syntax 0.8.6",
+]
+
+[[package]]
+name = "regex-syntax"
+version = "0.6.29"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f162c6dd7b008981e4d40210aca20b4bd0f9b60ca9271061b07f78537722f2e1"
+
+[[package]]
+name = "regex-syntax"
+version = "0.8.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "caf4aa5b0f434c91fe5c7f1ecb6a5ece2130b02ad2a590589dda5146df959001"
+
+[[package]]
+name = "rustc-demangle"
+version = "0.1.26"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "56f7d92ca342cea22a06f2121d944b4fd82af56988c270852495420f961d4ace"
+
+[[package]]
+name = "rustix"
+version = "1.0.8"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "11181fbabf243db407ef8df94a6ce0b2f9a733bd8be4ad02b4eda9602296cac8"
+dependencies = [
+ "bitflags",
+ "errno",
+ "libc",
+ "linux-raw-sys",
+ "windows-sys",
+]
+
+[[package]]
+name = "ryu"
+version = "1.0.20"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "28d3b2b1366ec20994f1fd18c3c594f05c5dd4bc44d8bb0c1c632c8d6829481f"
+
+[[package]]
+name = "semver"
+version = "1.0.26"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "56e6fa9c48d24d85fb3de5ad847117517440f6beceb7798af16b4a87d616b8d0"
+dependencies = [
+ "serde",
+]
+
+[[package]]
+name = "serde"
+version = "1.0.219"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5f0e2c6ed6606019b4e29e69dbaba95b11854410e5347d525002456dbbb786b6"
+dependencies = [
+ "serde_derive",
+]
+
+[[package]]
+name = "serde_derive"
+version = "1.0.219"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5b0276cf7f2c73365f7157c8123c21cd9a50fbbd844757af28ca1f5925fc2a00"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "serde_json"
+version = "1.0.143"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d401abef1d108fbd9cbaebc3e46611f4b1021f714a0597a71f41ee463f5f4a5a"
+dependencies = [
+ "itoa",
+ "memchr",
+ "ryu",
+ "serde",
+]
+
+[[package]]
+name = "sharded-slab"
+version = "0.1.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f40ca3c46823713e0d4209592e8d6e826aa57e928f09752619fc696c499637f6"
+dependencies = [
+ "lazy_static",
+]
+
+[[package]]
+name = "shlex"
+version = "1.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64"
+
+[[package]]
+name = "simd-adler32"
+version = "0.3.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d66dc143e6b11c1eddc06d5c423cfc97062865baf299914ab64caa38182078fe"
+
+[[package]]
+name = "smallvec"
+version = "1.15.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "67b1b7a3b5fe4f1376887184045fcf45c69e92af734b7aaddc05fb777b6fbd03"
+
+[[package]]
+name = "stable_deref_trait"
+version = "1.2.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a8f112729512f8e442d81f95a8a7ddf2b7c6b8a1a6f509a95864142b30cab2d3"
+
+[[package]]
+name = "strsim"
+version = "0.11.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7da8b5736845d9f2fcb837ea5d9e2628564b3b043a70948a3f0b778838c5fb4f"
+
+[[package]]
+name = "syn"
+version = "2.0.106"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ede7c438028d4436d71104916910f5bb611972c5cfd7f89b8300a8186e6fada6"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "unicode-ident",
+]
+
+[[package]]
+name = "tempfile"
+version = "3.21.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "15b61f8f20e3a6f7e0649d825294eaf317edce30f82cf6026e7e4cb9222a7d1e"
+dependencies = [
+ "fastrand",
+ "getrandom",
+ "once_cell",
+ "rustix",
+ "windows-sys",
+]
+
+[[package]]
+name = "thiserror"
+version = "1.0.69"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b6aaf5339b578ea85b50e080feb250a3e8ae8cfcdff9a461c9ec2904bc923f52"
+dependencies = [
+ "thiserror-impl",
+]
+
+[[package]]
+name = "thiserror-impl"
+version = "1.0.69"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4fee6c4efc90059e10f81e6d42c60a18f76588c3d74cb83a0b242a2b6c7504c1"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "thread_local"
+version = "1.1.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f60246a4944f24f6e018aa17cdeffb7818b76356965d03b07d6a9886e8962185"
+dependencies = [
+ "cfg-if",
+]
+
+[[package]]
+name = "tracing"
+version = "0.1.41"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "784e0ac535deb450455cbfa28a6f0df145ea1bb7ae51b821cf5e7927fdcfbdd0"
+dependencies = [
+ "pin-project-lite",
+ "tracing-attributes",
+ "tracing-core",
+]
+
+[[package]]
+name = "tracing-attributes"
+version = "0.1.30"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "81383ab64e72a7a8b8e13130c49e3dab29def6d0c7d76a03087b3cf71c5c6903"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "tracing-core"
+version = "0.1.34"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b9d12581f227e93f094d3af2ae690a574abb8a2b9b7a96e7cfe9647b2b617678"
+dependencies = [
+ "once_cell",
+ "valuable",
+]
+
+[[package]]
+name = "tracing-log"
+version = "0.2.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ee855f1f400bd0e5c02d150ae5de3840039a3f54b025156404e34c23c03f47c3"
+dependencies = [
+ "log",
+ "once_cell",
+ "tracing-core",
+]
+
+[[package]]
+name = "tracing-subscriber"
+version = "0.3.19"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e8189decb5ac0fa7bc8b96b7cb9b2701d60d48805aca84a238004d665fcc4008"
+dependencies = [
+ "matchers",
+ "nu-ansi-term",
+ "once_cell",
+ "regex",
+ "sharded-slab",
+ "smallvec",
+ "thread_local",
+ "tracing",
+ "tracing-core",
+ "tracing-log",
+]
+
+[[package]]
+name = "unicode-ident"
+version = "1.0.18"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5a5f39404a5da50712a4c1eecf25e90dd62b613502b7e925fd4e4d19b5c96512"
+
+[[package]]
+name = "utf8parse"
+version = "0.2.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "06abde3611657adf66d383f00b093d7faecc7fa57071cce2578660c9f1010821"
+
+[[package]]
+name = "valuable"
+version = "0.1.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ba73ea9cf16a25df0c8caa16c51acb937d5712a8429db78a3ee29d5dcacd3a65"
+
+[[package]]
+name = "vsprintf"
+version = "2.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "aec2f81b75ca063294776b4f7e8da71d1d5ae81c2b1b149c8d89969230265d63"
+dependencies = [
+ "cc",
+ "libc",
+]
+
+[[package]]
+name = "wasi"
+version = "0.14.2+wasi-0.2.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9683f9a5a998d873c0d21fcbe3c083009670149a8fab228644b8bd36b2c48cb3"
+dependencies = [
+ "wit-bindgen-rt",
+]
+
+[[package]]
+name = "winapi"
+version = "0.3.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419"
+dependencies = [
+ "winapi-i686-pc-windows-gnu",
+ "winapi-x86_64-pc-windows-gnu",
+]
+
+[[package]]
+name = "winapi-i686-pc-windows-gnu"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6"
+
+[[package]]
+name = "winapi-x86_64-pc-windows-gnu"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f"
+
+[[package]]
+name = "windows-link"
+version = "0.1.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5e6ad25900d524eaabdbbb96d20b4311e1e7ae1699af4fb28c17ae66c80d798a"
+
+[[package]]
+name = "windows-sys"
+version = "0.60.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f2f500e4d28234f72040990ec9d39e3a6b950f9f22d3dba18416c35882612bcb"
+dependencies = [
+ "windows-targets",
+]
+
+[[package]]
+name = "windows-targets"
+version = "0.53.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d5fe6031c4041849d7c496a8ded650796e7b6ecc19df1a431c1a363342e5dc91"
+dependencies = [
+ "windows-link",
+ "windows_aarch64_gnullvm",
+ "windows_aarch64_msvc",
+ "windows_i686_gnu",
+ "windows_i686_gnullvm",
+ "windows_i686_msvc",
+ "windows_x86_64_gnu",
+ "windows_x86_64_gnullvm",
+ "windows_x86_64_msvc",
+]
+
+[[package]]
+name = "windows_aarch64_gnullvm"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "86b8d5f90ddd19cb4a147a5fa63ca848db3df085e25fee3cc10b39b6eebae764"
+
+[[package]]
+name = "windows_aarch64_msvc"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c7651a1f62a11b8cbd5e0d42526e55f2c99886c77e007179efff86c2b137e66c"
+
+[[package]]
+name = "windows_i686_gnu"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c1dc67659d35f387f5f6c479dc4e28f1d4bb90ddd1a5d3da2e5d97b42d6272c3"
+
+[[package]]
+name = "windows_i686_gnullvm"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9ce6ccbdedbf6d6354471319e781c0dfef054c81fbc7cf83f338a4296c0cae11"
+
+[[package]]
+name = "windows_i686_msvc"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "581fee95406bb13382d2f65cd4a908ca7b1e4c2f1917f143ba16efe98a589b5d"
+
+[[package]]
+name = "windows_x86_64_gnu"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2e55b5ac9ea33f2fc1716d1742db15574fd6fc8dadc51caab1c16a3d3b4190ba"
+
+[[package]]
+name = "windows_x86_64_gnullvm"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0a6e035dd0599267ce1ee132e51c27dd29437f63325753051e71dd9e42406c57"
+
+[[package]]
+name = "windows_x86_64_msvc"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "271414315aff87387382ec3d271b52d7ae78726f5d44ac98b4f4030c91880486"
+
+[[package]]
+name = "wit-bindgen-rt"
+version = "0.39.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6f42320e61fe2cfd34354ecb597f86f413484a798ba44a8ca1165c58d42da6c1"
+dependencies = [
+ "bitflags",
+]
--- a/src/xpu/flamegraph/profiler/Cargo.toml
+++ b/src/xpu/flamegraph/profiler/Cargo.toml
@@ -0,0 +1,19 @@
+[package]
+name = "profile"
+version = "0.1.0"
+authors = ["Kuifeng Lee <kuifeng@fb.com>"]
+license = "GPL-2.0 OR BSD-3-Clause"
+edition = "2021"
+rust-version = "1.71"
+
+[dependencies]
+blazesym = { version = "0.2.0-rc.4",features = ["tracing"] }
+clap = { version = "4.5", features = ["derive"] }
+libbpf-rs = "0.24"
+libc = "*"
+nix = "0.29.0"
+tracing = "0.1"
+tracing-subscriber = {version = "0.3", features = ["ansi", "env-filter", "fmt"]}
+
+[build-dependencies]
+libbpf-cargo = "0.24"
--- a/src/xpu/flamegraph/profiler/build.rs
+++ b/src/xpu/flamegraph/profiler/build.rs
@@ -0,0 +1,30 @@
+use std::env;
+use std::ffi::OsStr;
+use std::path::Path;
+use std::path::PathBuf;
+
+use libbpf_cargo::SkeletonBuilder;
+
+const SRC: &str = "src/bpf/profile.bpf.c";
+
+fn main() {
+    let mut out =
+        PathBuf::from(env::var_os("OUT_DIR").expect("OUT_DIR must be set in build script"));
+    out.push("profile.skel.rs");
+
+    let arch = env::var("CARGO_CFG_TARGET_ARCH")
+        .expect("CARGO_CFG_TARGET_ARCH must be set in build script");
+
+    // Ensure we're building for a supported architecture
+    println!("cargo:warning=Building for architecture: {}", arch);
+
+    SkeletonBuilder::new()
+        .source(SRC)
+        .clang_args([
+            OsStr::new("-I"),
+            Path::new("../vmlinux").as_os_str()
+        ])
+        .build_and_generate(out)
+        .expect("bpf compilation failed");
+    println!("cargo:rerun-if-changed={}", SRC);
+}
--- a/src/xpu/flamegraph/profiler/src/bpf/profile.bpf.c
+++ b/src/xpu/flamegraph/profiler/src/bpf/profile.bpf.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+/* Copyright (c) 2022 Meta Platforms, Inc. */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+#include "profile.h"
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 256 * 1024);
+} events SEC(".maps");
+
+// Shared helper to collect stack trace
+static __always_inline int collect_stack_trace(void *ctx, u64 cookie)
+{
+	int pid = bpf_get_current_pid_tgid() >> 32;
+	int cpu_id = bpf_get_smp_processor_id();
+	struct stacktrace_event *event;
+
+	event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
+	if (!event)
+		return 1;
+
+	event->pid = pid;
+	event->cpu_id = cpu_id;
+	event->timestamp = bpf_ktime_get_ns();
+
+	if (bpf_get_current_comm(event->comm, sizeof(event->comm)))
+		event->comm[0] = 0;
+
+	// Store probe_id in cpu_id field when in probe mode
+	// In perf mode: cpu_id is actual CPU
+	// In probe mode: cpu_id is probe_id, actual CPU stored in pid high bits if needed
+	if (cookie != 0) {
+		event->cpu_id = (u32)cookie;  // probe_id from bpf_get_attach_cookie
+	}
+
+	event->kstack_sz = bpf_get_stack(ctx, event->kstack, sizeof(event->kstack), 0);
+
+	event->ustack_sz =
+		bpf_get_stack(ctx, event->ustack, sizeof(event->ustack), BPF_F_USER_STACK);
+
+	bpf_ringbuf_submit(event, 0);
+
+	return 0;
+}
+
+SEC("perf_event")
+int profile(void *ctx)
+{
+	return collect_stack_trace(ctx, 0);
+}
+
+// Generic kprobe handler
+SEC("kprobe")
+int kprobe_handler(struct pt_regs *ctx)
+{
+	u64 probe_id = bpf_get_attach_cookie(ctx);
+	return collect_stack_trace(ctx, probe_id);
+}
+
+// Generic kretprobe handler
+SEC("kretprobe")
+int kretprobe_handler(struct pt_regs *ctx)
+{
+	u64 probe_id = bpf_get_attach_cookie(ctx);
+	return collect_stack_trace(ctx, probe_id);
+}
+
+// Generic uprobe handler
+SEC("uprobe")
+int uprobe_handler(struct pt_regs *ctx)
+{
+	u64 probe_id = bpf_get_attach_cookie(ctx);
+	return collect_stack_trace(ctx, probe_id);
+}
+
+// Generic uretprobe handler
+SEC("uretprobe")
+int uretprobe_handler(struct pt_regs *ctx)
+{
+	u64 probe_id = bpf_get_attach_cookie(ctx);
+	return collect_stack_trace(ctx, probe_id);
+}
--- a/src/xpu/flamegraph/profiler/src/bpf/profile.h
+++ b/src/xpu/flamegraph/profiler/src/bpf/profile.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+/* Copyright (c) 2022 Meta Platforms, Inc. */
+#ifndef __PROFILE_H_
+#define __PROFILE_H_
+
+#ifndef TASK_COMM_LEN
+#define TASK_COMM_LEN 16
+#endif
+
+#ifndef MAX_STACK_DEPTH
+#define MAX_STACK_DEPTH 128
+#endif
+
+typedef __u64 stack_trace_t[MAX_STACK_DEPTH];
+
+struct stacktrace_event {
+	__u32 pid;
+	__u32 cpu_id;
+	__u64 timestamp;
+	char comm[TASK_COMM_LEN];
+	__s32 kstack_sz;
+	__s32 ustack_sz;
+	stack_trace_t kstack;
+	stack_trace_t ustack;
+};
+
+#endif /* __PROFILE_H_ */
--- a/src/xpu/flamegraph/profiler/src/event.rs
+++ b/src/xpu/flamegraph/profiler/src/event.rs
@@ -0,0 +1,309 @@
+use std::mem;
+use std::time::{SystemTime, UNIX_EPOCH};
+use blazesym::symbolize;
+use nix::sys::sysinfo;
+
+pub const MAX_STACK_DEPTH: usize = 128;
+pub const TASK_COMM_LEN: usize = 16;
+const ADDR_WIDTH: usize = 16;
+
+// A Rust version of stacktrace_event in profile.h
+#[repr(C)]
+pub struct StacktraceEvent {
+    pub pid: u32,
+    pub cpu_id: u32,
+    pub timestamp: u64,
+    pub comm: [u8; TASK_COMM_LEN],
+    pub kstack_size: i32,
+    pub ustack_size: i32,
+    pub kstack: [u64; MAX_STACK_DEPTH],
+    pub ustack: [u64; MAX_STACK_DEPTH],
+}
+
+pub enum OutputFormat {
+    Standard,
+    FoldedExtended,
+}
+
+pub struct EventHandler {
+    symbolizer: symbolize::Symbolizer,
+    format: OutputFormat,
+    boot_time_ns: u64,
+}
+
+impl EventHandler {
+    pub fn new(format: OutputFormat) -> Self {
+        // Get system uptime to calculate boot time
+        let boot_time_ns = Self::get_boot_time_ns();
+        
+        Self {
+            symbolizer: symbolize::Symbolizer::new(),
+            format,
+            boot_time_ns,
+        }
+    }
+
+    fn get_boot_time_ns() -> u64 {
+        // Get current Unix timestamp in nanoseconds
+        let now = SystemTime::now()
+            .duration_since(UNIX_EPOCH)
+            .expect("System time before Unix epoch");
+        let now_ns = now.as_nanos() as u64;
+        
+        // Get system uptime in nanoseconds
+        let info = sysinfo::sysinfo().expect("Failed to get sysinfo");
+        let uptime_ns = (info.uptime().as_secs_f64() * 1_000_000_000.0) as u64;
+        
+        // Boot time = current time - uptime
+        now_ns - uptime_ns
+    }
+
+    pub fn handle(&self, data: &[u8]) -> ::std::os::raw::c_int {
+        if data.len() != mem::size_of::<StacktraceEvent>() {
+            eprintln!(
+                "Invalid size {} != {}",
+                data.len(),
+                mem::size_of::<StacktraceEvent>()
+            );
+            return 1;
+        }
+
+        let event = unsafe { &*(data.as_ptr() as *const StacktraceEvent) };
+
+        if event.kstack_size <= 0 && event.ustack_size <= 0 {
+            return 1;
+        }
+
+        match self.format {
+            OutputFormat::Standard => self.handle_standard(event),
+            OutputFormat::FoldedExtended => self.handle_folded_extended(event),
+        }
+
+        0
+    }
+
+    // Helper to extract stack slice
+    fn get_stack_slice<'a>(stack: &'a [u64; MAX_STACK_DEPTH], size: i32) -> &'a [u64] {
+        if size > 0 {
+            &stack[0..(size as usize / mem::size_of::<u64>())]
+        } else {
+            &[]
+        }
+    }
+
+    // Helper to get command name
+    fn get_comm_str(comm: &[u8; TASK_COMM_LEN]) -> &str {
+        std::str::from_utf8(comm)
+            .unwrap_or("<unknown>")
+            .trim_end_matches('\0')
+    }
+
+    fn handle_standard(&self, event: &StacktraceEvent) {
+        let comm = Self::get_comm_str(&event.comm);
+        // Convert kernel timestamp to Unix timestamp
+        let unix_timestamp_ns = event.timestamp + self.boot_time_ns;
+        let timestamp_sec = unix_timestamp_ns / 1_000_000_000;
+        let timestamp_nsec = unix_timestamp_ns % 1_000_000_000;
+        println!("[{}.{:09}] COMM: {} (pid={}) @ CPU {}", 
+                 timestamp_sec, timestamp_nsec, comm, event.pid, event.cpu_id);
+
+        if event.kstack_size > 0 {
+            println!("Kernel:");
+            let kstack = Self::get_stack_slice(&event.kstack, event.kstack_size);
+            show_stack_trace(kstack, &self.symbolizer, 0);
+        } else {
+            println!("No Kernel Stack");
+        }
+
+        if event.ustack_size > 0 {
+            println!("Userspace:");
+            let ustack = Self::get_stack_slice(&event.ustack, event.ustack_size);
+            show_stack_trace(ustack, &self.symbolizer, event.pid);
+        } else {
+            println!("No Userspace Stack");
+        }
+
+        println!();
+    }
+
+    fn handle_folded_extended(&self, event: &StacktraceEvent) {
+        let comm = Self::get_comm_str(&event.comm);
+        let tid = event.pid; // For single-threaded processes, TID = PID
+        
+        let mut stack_frames = Vec::new();
+
+        // Process user stack (if present)
+        if event.ustack_size > 0 {
+            let ustack = Self::get_stack_slice(&event.ustack, event.ustack_size);
+            let user_frames = symbolize_stack_to_vec(&self.symbolizer, ustack, event.pid);
+            
+            // Add user frames in reverse order (top to bottom)
+            for frame in user_frames.iter().rev() {
+                stack_frames.push(frame.clone());
+            }
+        }
+
+        // Process kernel stack (if present)
+        if event.kstack_size > 0 {
+            let kstack = Self::get_stack_slice(&event.kstack, event.kstack_size);
+            let kernel_frames = symbolize_stack_to_vec(&self.symbolizer, kstack, 0);
+            
+            // Add kernel frames with [k] suffix in reverse order (top to bottom)
+            for frame in kernel_frames.iter().rev() {
+                stack_frames.push(format!("{}_[k]", frame));
+            }
+        }
+
+        // Format: timestamp_ns comm pid tid cpu stack1;stack2;stack3
+        // Convert kernel timestamp to Unix timestamp
+        let unix_timestamp_ns = event.timestamp + self.boot_time_ns;
+        println!(
+            "{} {} {} {} {} {}",
+            unix_timestamp_ns,
+            comm,
+            event.pid,
+            tid,
+            event.cpu_id,
+            stack_frames.join(";")
+        );
+    }
+}
+
+
+fn print_frame(
+    name: &str,
+    addr_info: Option<(blazesym::Addr, blazesym::Addr, usize)>,
+    code_info: &Option<symbolize::CodeInfo>,
+) {
+    let code_info = code_info.as_ref().map(|code_info| {
+        let path = code_info.to_path();
+        let path = path.display();
+
+        match (code_info.line, code_info.column) {
+            (Some(line), Some(col)) => format!(" {path}:{line}:{col}"),
+            (Some(line), None) => format!(" {path}:{line}"),
+            (None, _) => format!(" {path}"),
+        }
+    });
+
+    if let Some((input_addr, addr, offset)) = addr_info {
+        // If we have various address information bits we have a new symbol.
+        println!(
+            "{input_addr:#0width$x}: {name} @ {addr:#x}+{offset:#x}{code_info}",
+            code_info = code_info.as_deref().unwrap_or(""),
+            width = ADDR_WIDTH
+        )
+    } else {
+        // Otherwise we are dealing with an inlined call.
+        println!(
+            "{:width$}  {name}{code_info} [inlined]",
+            " ",
+            code_info = code_info
+                .map(|info| format!(" @{info}"))
+                .as_deref()
+                .unwrap_or(""),
+            width = ADDR_WIDTH
+        )
+    }
+}
+
+// Helper function to convert stack addresses for blazesym
+fn convert_stack_addresses(stack: &[u64]) -> Vec<blazesym::Addr> {
+    if mem::size_of::<blazesym::Addr>() != mem::size_of::<u64>() {
+        stack
+            .iter()
+            .copied()
+            .map(|addr| addr as blazesym::Addr)
+            .collect::<Vec<_>>()
+    } else {
+        // For same-sized types, still need to return owned data for consistency
+        stack.iter().copied().map(|addr| addr as blazesym::Addr).collect()
+    }
+}
+
+// Get the stack addresses as a slice (avoiding lifetime issues)
+fn get_stack_slice<'a>(stack: &'a [u64], converted: &'a [blazesym::Addr]) -> &'a [blazesym::Addr] {
+    if mem::size_of::<blazesym::Addr>() != mem::size_of::<u64>() {
+        converted
+    } else {
+        // SAFETY: `Addr` has the same size as `u64`, so it can be trivially and
+        //         safely converted.
+        unsafe { mem::transmute::<_, &[blazesym::Addr]>(stack) }
+    }
+}
+
+// Get source for symbolization based on PID (0 means kernel)
+fn get_symbolize_source(pid: u32) -> symbolize::source::Source<'static> {
+    if pid == 0 {
+        symbolize::source::Source::from(symbolize::source::Kernel::default())
+    } else {
+        symbolize::source::Source::from(symbolize::source::Process::new(pid.into()))
+    }
+}
+
+// Symbolize stack and return as vector of strings for folded format
+fn symbolize_stack_to_vec(symbolizer: &symbolize::Symbolizer, stack: &[u64], pid: u32) -> Vec<String> {
+    let converted = convert_stack_addresses(stack);
+    let stack_addrs = get_stack_slice(stack, &converted);
+    let src = get_symbolize_source(pid);
+    
+    let syms = match symbolizer.symbolize(&src, symbolize::Input::AbsAddr(stack_addrs)) {
+        Ok(syms) => syms,
+        Err(_) => {
+            // Return addresses if symbolization fails
+            return stack_addrs.iter().map(|addr| format!("{:#x}", addr)).collect();
+        }
+    };
+
+    let mut result = Vec::new();
+    for (addr, sym) in stack_addrs.iter().copied().zip(syms) {
+        match sym {
+            symbolize::Symbolized::Sym(symbolize::Sym {
+                name,
+                ..
+            }) => {
+                result.push(name.to_string());
+            }
+            symbolize::Symbolized::Unknown(..) => {
+                result.push(format!("{:#x}", addr));
+            }
+        }
+    }
+    result
+}
+
+// Pid 0 means a kernel space stack.
+fn show_stack_trace(stack: &[u64], symbolizer: &symbolize::Symbolizer, pid: u32) {
+    let converted = convert_stack_addresses(stack);
+    let stack_addrs = get_stack_slice(stack, &converted);
+    let src = get_symbolize_source(pid);
+
+    let syms = match symbolizer.symbolize(&src, symbolize::Input::AbsAddr(stack_addrs)) {
+        Ok(syms) => syms,
+        Err(err) => {
+            eprintln!("  failed to symbolize addresses: {err:#}");
+            return;
+        }
+    };
+
+    for (input_addr, sym) in stack_addrs.iter().copied().zip(syms) {
+        match sym {
+            symbolize::Symbolized::Sym(symbolize::Sym {
+                name,
+                addr,
+                offset,
+                code_info,
+                inlined,
+                ..
+            }) => {
+                print_frame(&name, Some((input_addr, addr, offset)), &code_info);
+                for frame in inlined.iter() {
+                    print_frame(&frame.name, None, &frame.code_info);
+                }
+            }
+            symbolize::Symbolized::Unknown(..) => {
+                println!("{input_addr:#0width$x}: <no-symbol>", width = ADDR_WIDTH)
+            }
+        }
+    }
+}
--- a/src/xpu/flamegraph/profiler/src/main.rs
+++ b/src/xpu/flamegraph/profiler/src/main.rs
@@ -0,0 +1,176 @@
+use std::mem::MaybeUninit;
+use std::time::Duration;
+
+use clap::ArgAction;
+use clap::Parser;
+
+use libbpf_rs::skel::OpenSkel as _;
+use libbpf_rs::skel::SkelBuilder as _;
+use libbpf_rs::UprobeOpts;
+
+use tracing::subscriber::set_global_default as set_global_subscriber;
+use tracing_subscriber::filter::LevelFilter;
+use tracing_subscriber::fmt::format::FmtSpan;
+use tracing_subscriber::fmt::time::SystemTime;
+use tracing_subscriber::FmtSubscriber;
+
+mod profile {
+    include!(concat!(env!("OUT_DIR"), "/profile.skel.rs"));
+}
+mod syscall;
+mod event;
+mod perf;
+
+use profile::*;
+
+#[derive(Parser, Debug)]
+struct Args {
+    /// Sampling frequency (only used in perf mode)
+    #[arg(short, default_value_t = 50)]
+    freq: u64,
+    /// Increase verbosity (can be supplied multiple times).
+    #[arg(short = 'v', long = "verbose", global = true, action = ArgAction::Count)]
+    verbosity: u8,
+    /// Use software event for triggering stack trace capture.
+    ///
+    /// This can be useful for compatibility reasons if hardware event is not available
+    /// (which could happen in a virtual machine, for example).
+    #[arg(long = "sw-event")]
+    sw_event: bool,
+    /// Filter by PID (optional)
+    #[arg(short = 'p', long = "pid")]
+    pid: Option<i32>,
+    /// Output in extended folded format (timestamp_ns comm pid tid cpu stack1;stack2;...)
+    #[arg(short = 'E', long = "fold-extend")]
+    fold_extend: bool,
+    /// Attach to kprobe (format: "symbol" e.g. "tcp_v4_connect")
+    /// Can be specified multiple times
+    #[arg(long = "kprobe")]
+    kprobes: Vec<String>,
+    /// Attach to kretprobe (format: "symbol")
+    #[arg(long = "kretprobe")]
+    kretprobes: Vec<String>,
+    /// Attach to uprobe (format: "binary:symbol" e.g. "/lib/libc.so.6:malloc")
+    #[arg(long = "uprobe")]
+    uprobes: Vec<String>,
+    /// Attach to uretprobe (format: "binary:symbol")
+    #[arg(long = "uretprobe")]
+    uretprobes: Vec<String>,
+}
+
+fn main() -> Result<(), libbpf_rs::Error> {
+    let args = Args::parse();
+    let level = match args.verbosity {
+        0 => LevelFilter::WARN,
+        1 => LevelFilter::INFO,
+        2 => LevelFilter::DEBUG,
+        _ => LevelFilter::TRACE,
+    };
+
+    let subscriber = FmtSubscriber::builder()
+        .with_max_level(level)
+        .with_span_events(FmtSpan::FULL)
+        .with_timer(SystemTime)
+        .finish();
+    let () = set_global_subscriber(subscriber).expect("failed to set tracing subscriber");
+
+    let skel_builder = ProfileSkelBuilder::default();
+    let mut open_object = MaybeUninit::uninit();
+    let open_skel = skel_builder.open(&mut open_object).unwrap();
+    let skel = open_skel.load().unwrap();
+
+    let _perf_links;
+    let mut pefds = Vec::new();
+    let mut _probe_links = Vec::new();
+    let mut probe_id: u32 = 1;
+
+    let has_probes = !args.kprobes.is_empty() || !args.kretprobes.is_empty()
+        || !args.uprobes.is_empty() || !args.uretprobes.is_empty();
+
+    if has_probes {
+        // Attach kprobes
+        for symbol in &args.kprobes {
+            let link = skel.progs.kprobe_handler.attach_kprobe(false, symbol)?;
+            eprintln!("Attached kprobe (id={}): {}", probe_id, symbol);
+            _probe_links.push(link);
+            probe_id += 1;
+        }
+
+        // Attach kretprobes
+        for symbol in &args.kretprobes {
+            let link = skel.progs.kretprobe_handler.attach_kprobe(true, symbol)?;
+            eprintln!("Attached kretprobe (id={}): {}", probe_id, symbol);
+            _probe_links.push(link);
+            probe_id += 1;
+        }
+
+        // Attach uprobes
+        for spec in &args.uprobes {
+            let parts: Vec<&str> = spec.split(':').collect();
+            if parts.len() != 2 {
+                eprintln!("Error: uprobe format should be 'binary:symbol'");
+                std::process::exit(1);
+            }
+            let opts = UprobeOpts {
+                func_name: parts[1].to_string(),
+                cookie: probe_id as u64,
+                retprobe: false,
+                ..Default::default()
+            };
+            let link = skel.progs.uprobe_handler.attach_uprobe_with_opts(-1, parts[0], 0, opts)?;
+            eprintln!("Attached uprobe (id={}): {} in {}", probe_id, parts[1], parts[0]);
+            _probe_links.push(link);
+            probe_id += 1;
+        }
+
+        // Attach uretprobes
+        for spec in &args.uretprobes {
+            let parts: Vec<&str> = spec.split(':').collect();
+            if parts.len() != 2 {
+                eprintln!("Error: uretprobe format should be 'binary:symbol'");
+                std::process::exit(1);
+            }
+            let opts = UprobeOpts {
+                func_name: parts[1].to_string(),
+                cookie: probe_id as u64,
+                retprobe: true,
+                ..Default::default()
+            };
+            let link = skel.progs.uretprobe_handler.attach_uprobe_with_opts(-1, parts[0], 0, opts)?;
+            eprintln!("Attached uretprobe (id={}): {} in {}", probe_id, parts[1], parts[0]);
+            _probe_links.push(link);
+            probe_id += 1;
+        }
+    } else {
+        // Perf mode
+        let freq = if args.freq < 1 { 1 } else { args.freq };
+        pefds = perf::init_perf_monitor(freq, args.sw_event, args.pid)?;
+        _perf_links = perf::attach_perf_event(&pefds, &skel.progs.profile);
+        eprintln!("Perf mode: sampling at {} Hz", freq);
+    }
+
+    let output_format = if args.fold_extend {
+        event::OutputFormat::FoldedExtended
+    } else {
+        event::OutputFormat::Standard
+    };
+
+    let event_handler = event::EventHandler::new(output_format);
+
+    let mut builder = libbpf_rs::RingBufferBuilder::new();
+    builder
+        .add(&skel.maps.events, move |data| {
+            event_handler.handle(data)
+        })
+        .unwrap();
+
+    let ringbuf = builder.build().unwrap();
+    while ringbuf.poll(Duration::MAX).is_ok() {}
+
+    // Clean up perf events if in perf mode
+    if !pefds.is_empty() {
+        perf::close_perf_events(pefds)?;
+    }
+
+    Ok(())
+}
--- a/src/xpu/flamegraph/profiler/src/perf.rs
+++ b/src/xpu/flamegraph/profiler/src/perf.rs
@@ -0,0 +1,63 @@
+use std::io;
+use std::mem;
+use nix::unistd::close;
+use libbpf_rs::ErrorExt as _;
+
+use crate::syscall;
+
+pub fn init_perf_monitor(freq: u64, sw_event: bool, pid_filter: Option<i32>) -> Result<Vec<i32>, libbpf_rs::Error> {
+    let nprocs = libbpf_rs::num_possible_cpus().unwrap();
+    let pid = pid_filter.unwrap_or(-1);
+    let attr = syscall::perf_event_attr {
+        _type: if sw_event {
+            syscall::PERF_TYPE_SOFTWARE
+        } else {
+            syscall::PERF_TYPE_HARDWARE
+        },
+        size: mem::size_of::<syscall::perf_event_attr>() as u32,
+        config: if sw_event {
+            syscall::PERF_COUNT_SW_CPU_CLOCK
+        } else {
+            syscall::PERF_COUNT_HW_CPU_CYCLES
+        },
+        sample: syscall::sample_un { sample_freq: freq },
+        flags: 1 << 10, // freq = 1
+        ..Default::default()
+    };
+    (0..nprocs)
+        .map(|cpu| {
+            let fd = syscall::perf_event_open(&attr, pid, cpu as i32, -1, 0) as i32;
+            if fd == -1 {
+                let mut error_context = "Failed to open perf event.";
+                let os_error = io::Error::last_os_error();
+                if !sw_event && os_error.kind() == io::ErrorKind::NotFound {
+                    error_context = "Failed to open perf event.\n\
+                                    Try running the profile example with the `--sw-event` option.";
+                }
+                Err(libbpf_rs::Error::from(os_error)).context(error_context)
+            } else {
+                Ok(fd)
+            }
+        })
+        .collect()
+}
+
+pub fn attach_perf_event(
+    pefds: &[i32],
+    prog: &libbpf_rs::ProgramMut,
+) -> Vec<Result<libbpf_rs::Link, libbpf_rs::Error>> {
+    pefds
+        .iter()
+        .map(|pefd| prog.attach_perf_event(*pefd))
+        .collect()
+}
+
+pub fn close_perf_events(pefds: Vec<i32>) -> Result<(), libbpf_rs::Error> {
+    for pefd in pefds {
+        close(pefd)
+            .map_err(io::Error::from)
+            .map_err(libbpf_rs::Error::from)
+            .context("failed to close perf event")?;
+    }
+    Ok(())
+}
--- a/src/xpu/flamegraph/profiler/src/syscall.rs
+++ b/src/xpu/flamegraph/profiler/src/syscall.rs
@@ -0,0 +1,90 @@
+use std::mem;
+
+extern crate libc;
+
+#[repr(C)]
+pub union sample_un {
+    pub sample_period: u64,
+    pub sample_freq: u64,
+}
+
+#[repr(C)]
+pub union wakeup_un {
+    pub wakeup_events: u32,
+    pub wakeup_atermark: u32,
+}
+
+#[repr(C)]
+pub union bp_1_un {
+    pub bp_addr: u64,
+    pub kprobe_func: u64,
+    pub uprobe_path: u64,
+    pub config1: u64,
+}
+
+#[repr(C)]
+pub union bp_2_un {
+    pub bp_len: u64,
+    pub kprobe_addr: u64,
+    pub probe_offset: u64,
+    pub config2: u64,
+}
+
+#[repr(C)]
+pub struct perf_event_attr {
+    pub _type: u32,
+    pub size: u32,
+    pub config: u64,
+    pub sample: sample_un,
+    pub sample_type: u64,
+    pub read_format: u64,
+    pub flags: u64,
+    pub wakeup: wakeup_un,
+    pub bp_type: u32,
+    pub bp_1: bp_1_un,
+    pub bp_2: bp_2_un,
+    pub branch_sample_type: u64,
+    pub sample_regs_user: u64,
+    pub sample_stack_user: u32,
+    pub clockid: i32,
+    pub sample_regs_intr: u64,
+    pub aux_watermark: u32,
+    pub sample_max_stack: u16,
+    pub __reserved_2: u16,
+    pub aux_sample_size: u32,
+    pub __reserved_3: u32,
+}
+
+impl Default for perf_event_attr {
+    fn default() -> Self {
+        unsafe { mem::zeroed() }
+    }
+}
+
+pub const PERF_TYPE_HARDWARE: u32 = 0;
+pub const PERF_TYPE_SOFTWARE: u32 = 1;
+pub const PERF_COUNT_HW_CPU_CYCLES: u64 = 0;
+pub const PERF_COUNT_SW_CPU_CLOCK: u64 = 0;
+
+extern "C" {
+    fn syscall(number: libc::c_long, ...) -> libc::c_long;
+}
+
+pub fn perf_event_open(
+    hw_event: &perf_event_attr,
+    pid: libc::pid_t,
+    cpu: libc::c_int,
+    group_fd: libc::c_int,
+    flags: libc::c_ulong,
+) -> libc::c_long {
+    unsafe {
+        syscall(
+            libc::SYS_perf_event_open,
+            hw_event as *const perf_event_attr,
+            pid,
+            cpu,
+            group_fd,
+            flags,
+        )
+    }
+}
--- a/src/xpu/flamegraph/qwen3.cu/.gitattributes
+++ b/src/xpu/flamegraph/qwen3.cu/.gitattributes
@@ -0,0 +1 @@
+convert_hf_to_gguf_ordered.py linguist-vendored
--- a/src/xpu/flamegraph/qwen3.cu/.gitignore
+++ b/src/xpu/flamegraph/qwen3.cu/.gitignore
@@ -0,0 +1,10 @@
+*.i
+*.ii
+*.gpu
+*.ptx
+*.cubin
+*.fatbin
+runcu
+runcublas
+*.gguf
+Qwen3-0.6B-GGUF-FP32/
--- a/src/xpu/flamegraph/qwen3.cu/LICENSE
+++ b/src/xpu/flamegraph/qwen3.cu/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2025 William Song
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/src/xpu/flamegraph/qwen3.cu/Makefile
+++ b/src/xpu/flamegraph/qwen3.cu/Makefile
@@ -0,0 +1,68 @@
+# choose your compiler, e.g. gcc/clang
+# example override to clang: make run CC=clang
+CC = gcc
+# For CUDA compile
+CUDA_INSTALL_PATH ?= /usr/local/cuda-12.9
+NVCC := "$(CUDA_INSTALL_PATH)/bin/nvcc"
+INCLUDES := -I"$(CUDA_INSTALL_PATH)/include"
+LIB_PATH ?= $(CUDA_INSTALL_PATH)/lib64
+
+# compile the Cuda version (with dynamic libcudart for eBPF uprobe profiling)
+.PHONY: runcu
+runcu: runcu.cu
+	$(NVCC) $(INCLUDES) -O3 -Wno-deprecated-gpu-targets --no-device-link -o runcu runcu.cu -L $(LIB_PATH) -lcudart -lm
+# compile cublas included
+.PHONY: runcublas
+runcublas: runcu.cu
+	$(NVCC) $(INCLUDES) -O3 -Wno-deprecated-gpu-targets --no-device-link -DUSE_CUBLAS -o runcublas runcu.cu -L $(LIB_PATH) -lcudart -lm -lcublas
+
+# download the model
+.PHONY: download-model
+download-model:
+	@if [ -f Qwen3-0.6B-FP32.gguf ] && [ $$(stat -c%s Qwen3-0.6B-FP32.gguf) -gt 1000000 ]; then \
+		echo "Model already exists (size: $$(du -h Qwen3-0.6B-FP32.gguf | cut -f1))"; \
+	else \
+		echo "Downloading Qwen3-0.6B model (3GB - this will take a while)..."; \
+		wget -c https://huggingface.co/huggit0000/Qwen3-0.6B-GGUF-FP32/resolve/main/Qwen3-0.6B-FP32.gguf -O Qwen3-0.6B-FP32.gguf || \
+		curl -L -C - https://huggingface.co/huggit0000/Qwen3-0.6B-GGUF-FP32/resolve/main/Qwen3-0.6B-FP32.gguf -o Qwen3-0.6B-FP32.gguf; \
+		echo "Model downloaded successfully (size: $$(du -h Qwen3-0.6B-FP32.gguf | cut -f1))"; \
+	fi
+
+# =========================
+# The below is not used hree.
+
+
+# the most basic way of building that is most likely to work on most systems
+.PHONY: run
+run: run.c
+	$(CC) -O3 -o run run.c -lm
+
+# useful for a debug build, can then e.g. analyze with valgrind, example:
+# $ valgrind --leak-check=full ./run out/model.bin -n 3
+rundebug: run.c
+	$(CC) -g -o run run.c -lm
+
+# https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
+# https://simonbyrne.github.io/notes/fastmath/
+# -Ofast enables all -O3 optimizations.
+# Disregards strict standards compliance.
+# It also enables optimizations that are not valid for all standard-compliant programs.
+# It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific
+# -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens.
+# It turns off -fsemantic-interposition.
+# In our specific application this is *probably* okay to use
+#.PHONY: run
+#runfast: run.c
+#	$(CC) -O3 -o run -fopenmp -march=native run.c -lm
+
+# additionally compiles with OpenMP, allowing multithreaded runs
+# make sure to also enable multiple threads when running, e.g.:
+# OMP_NUM_THREADS=4 ./run out/model.bin
+.PHONY: runomp
+runomp: run.c
+	$(CC) -O3 -fopenmp -march=native run.c  -lm  -o run
+
+
+.PHONY: clean
+clean:
+	rm -f run runcu runcublas
--- a/src/xpu/flamegraph/qwen3.cu/README
+++ b/src/xpu/flamegraph/qwen3.cu/README
@@ -0,0 +1,166 @@
+# qwen3.cu
+
+`qwen3.cu` is a **single-file, pure CUDA C implementation** for running inference on the Qwen3 model with no external libraries, no dependencies. It’s a follow-up to my earlier weekend project, [qwen3.c](https://github.com/...), inspired by Andrej Karpathy’s [`llama2.c`](https://github.com/karpathy/llama2.c). Everything’s packed into one file from tokenization all the way to CUDA kernels, staying true to the spirit of minimalism.
+
+This implementation runs the Qwen3 0.6B model, a small but capable model. I'm using **full-precision GGUF** here, chosen for its clarity and to help others learn its ways. Also, It’s fully self-contained, so there’s no need for any format conversion out of the box. Most GGUF models are quantized to 8-bit or lower, but for this project, you’ll want to use the FP32 version which you can download as below. Or, if you make it work from the BF16 weights, you can convert them using the included `convert_hf_to_gguf_ordered.py` script; I've made sure the layers are ordered numerically so everything aligns correctly.
+
+Even though GGUF files already include a binary tokenizer, this project reads vocab and merges from plain `.txt` files. It keeps things more transparent and easier to follow. Tokenization and detokenization overhead is negligible compared to the forward pass, so it doesn’t really impact TTS.
+
+It also supports multi-turn conversation out of the box, and native support for Qwen3’s reasoning mode. For reference, there’s also a cuBLAS version included. It’s roughly 2x faster for now, but I’ll probably try to narrow that gap in the future. I’ll add more explanation on the code later.
+
+### UPDATE
+[Oct-27-25] Added single prompt mode (-q flag) for non-interactive usage
+[Oct-27-25] Updated Makefile with --no-device-link flag and download-model target
+[Aug-08-25] Remove the nonsense loop. TPS increased from ~35 to ~39. Set base for benchmarking.
+[What's next] Improve kernels
+
+## Quick Start
+
+```sh
+# Clone this repo
+git clone https://github.com/gigit0000/qwen3.cu.git
+cd qwen3.cu
+
+# Download FP32 model (3GB) - uses wget/curl, no Git LFS required
+make download-model
+
+# Compile and run (interactive mode)
+make runcu
+./runcu Qwen3-0.6B-FP32.gguf
+
+# Or use single prompt mode (runs once and exits)
+./runcu Qwen3-0.6B-FP32.gguf -q "What is CUDA?"
+```
+
+## Faster Inference
+Use cuBLAS (roughly 2x faster):
+```sh
+# Compile and run
+make runcublas
+./runcublas Qwen3-0.6B-FP32.gguf
+
+# Single prompt with cuBLAS
+./runcublas Qwen3-0.6B-FP32.gguf -q "Explain quantum computing" -r 1
+```
+
+## Makefile Improvements
+
+The Makefile now includes:
+- **`make download-model`**: Automatically downloads the 3GB FP32 model using wget/curl (no Git LFS required)
+- **Fixed compilation flags**: Added `-Wno-deprecated-gpu-targets --no-device-link` to fix build issues on newer CUDA versions
+- **Clean target**: `make clean` removes built binaries
+
+## Description
+
+You can enable reasoning (-k 1) or multi-turn (-m 1):
+```
+./runcu Qwen3-0.6B-FP32.gguf -k 1 -m 1
+```
+
+**New: Single Prompt Mode (-q)**
+Run a single query and exit (useful for scripting and automation):
+```sh
+./runcu Qwen3-0.6B-FP32.gguf -q "What is machine learning?"
+./runcu Qwen3-0.6B-FP32.gguf -q "Explain eBPF in one sentence"
+
+# Combine with other flags
+./runcu Qwen3-0.6B-FP32.gguf -q "Why is the sky blue?" -r 1  # with TPS
+./runcu Qwen3-0.6B-FP32.gguf -q "2+2=?" -t 0.3              # lower temperature
+```
+
+If you want to extract text files (vocab.txt, merges.txt and header.txt) on your own, you can use the scripts:
+```sh
+# tokenizer - vocab.txt and merges.txt
+python extract_v_m.py Qwen3-0.6B-FP32.gguf
+
+```
+
+### Inference Examples
+
+Multi-turn Conversation with the option m
+```
+# ./runcu Qwen3-0.6B-FP32.gguf -m 1 -k 0
+Multi-turn = on, thinKing = off, Temperature = 0.60, top-P = 0.95
+Press Enter to exit the chat
+Enter system prompt (or Enter to skip): Tell me in one sentence
+Q: Where is the best spot in Paris?
+A: The best spot in Paris is the Eiffel Tower.
+Q: What about the second-best spot?
+A: The second-best spot in Paris is the Louvre Museum.
+```
+
+Reasoning with the option k
+```
+# ./runcu Qwen3-0.6B-FP32.gguf -k 1
+Multi-turn = off, thinKing = on, Temperature = 0.60, top-P = 0.95
+Press Enter to exit the chat
+Enter system prompt (or Enter to skip): 
+Q: Why do stars shine? Give me a quick answer!
+A: <think>
+Okay, the user is asking why stars shine. Let me start by recalling what I know about stars. Stars are luminous objects that emit light. So, the main reason they shine is because they produce light through nuclear fusion.
+
+Wait, but I should make sure. Stars form from clouds of gas and dust in space. When these clouds cool, they start fusing hydrogen into helium, which releases energy. This energy is what we see as light. So the process is nuclear fusion of hydrogen into helium, which gives off energy.
+
+I should also mention that the energy from stars is what we perceive as light. Maybe add that this light travels through space and we see it on Earth. But the question is why they shine, so the answer should focus on the energy production.
+
+I need to keep it simple and concise. The user probably wants a quick answer, so no need for too much detail. Let me check if there's any other reason, but I think that's the main one. Alright, I think that's it.
+</think>
+
+Stars shine because they produce light through nuclear fusion of hydrogen into helium in their cores. This energy is then released as visible light, giving them their luminous glow.
+```
+You can enable and monitor TPS with the r option:
+```
+./runcu Qwen3-0.6B-FP32.gguf -r 1
+Multi-turn = off, thinKing = off, tps(R) = on, Temperature = 0.60, top-P = 0.95
+Press Enter to exit the chat
+Enter system prompt (or Enter to skip): You name is Tom.
+Q: What is your name?
+A: My name is Tom.
+tok/s: 34.482759
+```
+
+## Command-Line Options
+
+```
+Usage:   ./runcu <FP32 GGUF file> [options]
+Example: ./runcu Qwen3-0.6B-FP32.gguf
+         ./runcu Qwen3-0.6B-FP32.gguf -q "What is CUDA?"
+
+Options:
+  -t <float>  temperature in [0,inf], default 0.6
+  -p <float>  p value in top-p (nucleus) sampling in [0,1] default 0.95
+  -s <int>    random seed, default time(NULL)
+  -m <int>    multi-turn: 0 = off (default), 1 = on
+  -k <int>    reasoning: 0 = off (default), 1 = on
+  -r <int>    TPS: 0 = off (default), 1 = on
+  -q <string> single prompt mode (run once and exit)
+```
+
+**Usage Tips:**
+- Use `-q` for automation, scripting, or quick queries
+- Combine `-q` with `-r 1` to measure inference speed
+- Use `-k 1` to enable Qwen3's reasoning mode (shows thinking process)
+- Use `-m 1` for multi-turn conversations (maintains context)
+- Lower `-t` (temperature) for more deterministic outputs
+- Use `runcublas` instead of `runcu` for 2x faster inference
+
+## (Maybe) TODO
+- [ ] Kernel optimization
+- [ ] CUTLASS version
+- [ ] KV cache for multi-turn conversations
+
+## Acknoledgement
+- Inspired and baselined from Andrej Kapathy's [llama2.c](https://github.com/karpathy/llama2.c)
+- Most kernels and CUDA ports were originally adopted from @rogerallen's great repo [llama2.cu](https://github.com/rogerallen/)
+- Based on my qwen3.c [repo](https://github.com/gigit0000/qwen3.c/)
+- GGUF [llama.cpp](https://github.com/ggml-org/llama.cpp)
+- FGPF
+
+## License
+MIT
+
+
+
+
+
+
--- a/src/xpu/flamegraph/qwen3.cu/convert_hf_to_gguf_ordered.py
+++ b/src/xpu/flamegraph/qwen3.cu/convert_hf_to_gguf_ordered.py
--- a/src/xpu/flamegraph/qwen3.cu/extract_v_m.py
+++ b/src/xpu/flamegraph/qwen3.cu/extract_v_m.py
@@ -0,0 +1,143 @@
+#modified from a utility in llama.cpp
+#!/usr/bin/env python3
+import string
+import logging
+import sys
+from pathlib import Path
+
+logger = logging.getLogger("reader")
+
+# Necessary to load the local gguf package
+# sys.path.insert(0, str(Path(__file__).parent.parent))
+
+from gguf.gguf_reader import GGUFReader
+
+# 된다
+def extract_merges_to_txt(reader, output_file="merges.txt"):
+    parts = reader.fields["tokenizer.ggml.merges"].parts
+
+    # Skip initial header/metadata parts
+    start_idx = 6
+
+    # Crop to full merge pairs only
+    if (len(parts) - start_idx) % 2 != 0:
+        print(f"Merges field has odd number of parts after header. Truncating last.")
+        parts = parts[:len(parts) - 1]
+
+    with open(output_file, "w", encoding="utf-8") as f:
+        for i in range(start_idx, len(parts), 2):
+            merge_bytes = parts[i]
+            try:
+                merge_str = bytes(merge_bytes).decode("utf-8")
+            except Exception:
+                merge_str = bytes(merge_bytes).decode("utf-8", errors="replace")
+            f.write(merge_str + "\n")
+
+    print(f"Extracted {((len(parts) - start_idx) //2)} merges to {output_file}")
+
+
+def extract_vocab_to_txt(reader, output_file="vocab.txt"):
+    tokens = reader.fields["tokenizer.ggml.tokens"].parts
+    with open(output_file, "w", encoding="utf-8") as f:
+        # Start at 6 (where real tokens start)
+        for i in range(6, len(tokens), 2):
+            token_bytes = tokens[i]
+            # Only process tokens that are arrays of uint8
+            if getattr(token_bytes, 'dtype', None) == 'uint8':
+                b = bytes(token_bytes)
+                b = b.rstrip(b'\x00')
+                if b:  # skip empty
+                    try:
+                        token_str = b.decode("utf-8")
+                    except Exception:
+                        token_str = b.decode("utf-8", errors="replace")
+                    f.write(token_str + "\n")
+    print(f"Extraction complete ({(len(tokens) -6) //2} tokens written).")
+
+
+def read_gguf_file(gguf_file_path):
+    """
+    Reads and prints key-value pairs and tensor information from a GGUF file in an improved format.
+
+    Parameters:
+    - gguf_file_path: Path to the GGUF file.
+    """
+
+    reader = GGUFReader(gguf_file_path)
+    
+    extract_merges_to_txt(reader)
+    extract_vocab_to_txt(reader)
+
+    # List all key-value pairs in a columnized format
+    print("Key-Value Pairs:") # noqa: NP100
+    max_key_length = max(len(key) for key in reader.fields.keys())
+    
+    for key, field in reader.fields.items():
+        value = field.parts[field.data[0]]
+        print(f"{key:{max_key_length}} : {value}")
+
+        try:
+            value1 = ''.join(chr(x) for x in value)  # Convert [103, 112, 116, 50] to "gpt2"
+            print(f"{key:{max_key_length}} : {value1}")  # Print key and value
+        except:
+            pass    
+        #elif isinstance(value, bytes):
+        #value2 = value.tobytes().decode('utf-8')  # If value is bytes, decode to string
+        #print(f"{key:{max_key_length}} : {value2}")  # Print key and value
+
+
+    for key, field in reader.fields.items():
+        value = field.parts[field.data[0]]
+
+        # Try to convert to string if it looks like string data
+        if isinstance(value, list) and all(isinstance(x, int) for x in value):
+            # Try UTF-8 first, fallback to ASCII, else show the list
+            try:
+                value_str = bytes(value).decode('utf-8')
+            except (UnicodeDecodeError, ValueError, TypeError):
+                try:
+                    if all(32 <= x <= 126 for x in value):  # printable ASCII
+                        value_str = ''.join(chr(x) for x in value)
+                    else:
+                        value_str = str(value)
+                except Exception:
+                    value_str = str(value)
+            value = value_str
+
+        elif isinstance(value, bytes):
+            try:
+                value = value.decode('utf-8')
+            except UnicodeDecodeError:
+                value = str(value)
+
+        elif hasattr(value, 'tobytes'):  # numpy ndarray/memmap/etc
+            try:
+                value = value.tobytes().decode('utf-8')
+            except UnicodeDecodeError:
+                value = repr(value)
+                # OR, for arrays: np.array2string(value) for small arrays
+            except Exception:
+                value = repr(value)
+        else:
+            value = str(value)
+
+        print(f"{key:{max_key_length}} : {value}")
+
+    # List all tensors
+    print("Tensors:") # noqa: NP100
+    tensor_info_format = "{:<30} | Shape: {:<15} | Size: {:<12} | Quantization: {}"
+    print(tensor_info_format.format("Tensor Name", "Shape", "Size", "Quantization")) # noqa: NP100
+    print("-" * 80) # noqa: NP100
+    for tensor in reader.tensors:
+        shape_str = "x".join(map(str, tensor.shape))
+        size_str = str(tensor.n_elements)
+        quantization_str = tensor.tensor_type.name
+        print(tensor_info_format.format(tensor.name, shape_str, size_str, quantization_str)) # noqa: NP100
+
+
+if __name__ == '__main__':
+    if len(sys.argv) < 2:
+        logger.info("Usage: reader.py <path_to_gguf_file>")
+        sys.exit(1)
+    gguf_file_path = sys.argv[1]
+    read_gguf_file(gguf_file_path)
--- a/src/xpu/flamegraph/qwen3.cu/header.py
+++ b/src/xpu/flamegraph/qwen3.cu/header.py
@@ -0,0 +1,133 @@
+import struct
+import sys
+import logging
+
+logger = logging.getLogger(__name__)
+
+def read_string(f):
+    """Read a string from the file"""
+    length = struct.unpack('<Q', f.read(8))[0]
+    try:
+        return f.read(length).decode('utf-8')
+    except UnicodeDecodeError:
+        # If UTF-8 fails, try latin-1 or return as hex
+        f.seek(-length, 1)  # Go back
+        data = f.read(length)
+        return f"<binary data: {data[:50].hex()}{'...' if len(data) > 50 else ''}>"
+
+def read_gguf_file(file_path):
+    """Read GGUF file and extract header information"""
+    with open(file_path, 'rb') as f:
+        # Read magic number
+        magic = f.read(4)
+        if magic != b'GGUF':
+            raise ValueError("Not a valid GGUF file")
+        
+        # Read version
+        version = struct.unpack('<I', f.read(4))[0]
+        
+        # Read tensor count and metadata count
+        tensor_count = struct.unpack('<Q', f.read(8))[0]
+        metadata_count = struct.unpack('<Q', f.read(8))[0]
+        
+        output = []
+        output.append(f"MAGIC={magic.decode('ascii')}")
+        output.append(f"VERSION={version}")
+        output.append(f"TENSOR_COUNT={tensor_count}")
+        output.append(f"METADATA_COUNT={metadata_count}")
+        
+        # Read metadata
+        try:
+            for i in range(metadata_count):
+                key = read_string(f)
+                value_type = struct.unpack('<I', f.read(4))[0]
+                
+                # Read value based on type
+                if value_type == 0:  # UINT8
+                    value = struct.unpack('<B', f.read(1))[0]
+                elif value_type == 1:  # INT8
+                    value = struct.unpack('<b', f.read(1))[0]
+                elif value_type == 2:  # UINT16
+                    value = struct.unpack('<H', f.read(2))[0]
+                elif value_type == 3:  # INT16
+                    value = struct.unpack('<h', f.read(2))[0]
+                elif value_type == 4:  # UINT32
+                    value = struct.unpack('<I', f.read(4))[0]
+                elif value_type == 5:  # INT32
+                    value = struct.unpack('<i', f.read(4))[0]
+                elif value_type == 6:  # FLOAT32
+                    value = struct.unpack('<f', f.read(4))[0]
+                elif value_type == 7:  # BOOL
+                    value = struct.unpack('<B', f.read(1))[0] != 0
+                elif value_type == 8:  # STRING
+                    value = read_string(f)
+                elif value_type == 9:  # ARRAY
+                    array_type = struct.unpack('<I', f.read(4))[0]
+                    array_length = struct.unpack('<Q', f.read(8))[0]
+                    value = f"ARRAY_TYPE={array_type},ARRAY_LENGTH={array_length}"
+                    # Skip array data safely
+                    try:
+                        for _ in range(array_length):
+                            if array_type == 8:  # STRING array
+                                read_string(f)
+                            elif array_type == 4:  # UINT32 array
+                                f.read(4)
+                            elif array_type == 5:  # INT32 array
+                                f.read(4)
+                            elif array_type == 6:  # FLOAT32 array
+                                f.read(4)
+                            elif array_type == 0:  # UINT8 array
+                                f.read(1)
+                            elif array_type == 1:  # INT8 array
+                                f.read(1)
+                            else:
+                                # Skip unknown array type
+                                f.read(4)  # Assume 4 bytes per element
+                    except Exception as e:
+                        value = f"ARRAY_TYPE={array_type},ARRAY_LENGTH={array_length},ERROR=parse_error"
+                elif value_type == 10:  # UINT64
+                    value = struct.unpack('<Q', f.read(8))[0]
+                elif value_type == 11:  # INT64
+                    value = struct.unpack('<q', f.read(8))[0]
+                elif value_type == 12:  # FLOAT64
+                    value = struct.unpack('<d', f.read(8))[0]
+                else:
+                    value = f"UNKNOWN_TYPE={value_type}"
+                
+                # Clean key name for C compatibility
+                clean_key = key.replace('.', '_').replace('-', '_').upper()
+                output.append(f"{clean_key}={value}")
+        except Exception as e:
+            output.append(f"METADATA_ERROR={e}")
+            return
+        
+        # Read tensor info (without data)
+        for i in range(tensor_count):
+            name = read_string(f)
+            n_dimensions = struct.unpack('<I', f.read(4))[0]
+            dimensions = []
+            for _ in range(n_dimensions):
+                dimensions.append(struct.unpack('<Q', f.read(8))[0])
+            tensor_type = struct.unpack('<I', f.read(4))[0]
+            offset = struct.unpack('<Q', f.read(8))[0]
+            
+            # Clean tensor name for C compatibility
+            clean_name = name.replace('.', '_').replace('-', '_').upper()
+            output.append(f"TENSOR_{i}_NAME={clean_name}")
+            output.append(f"TENSOR_{i}_DIMENSIONS={','.join(map(str, dimensions))}")
+            output.append(f"TENSOR_{i}_TYPE={tensor_type}")
+            output.append(f"TENSOR_{i}_OFFSET={offset}")
+        
+        # Write to file
+        with open('header.txt', 'w', encoding='utf-8') as out_file:
+            out_file.write('\n'.join(output))
+        
+        print("Header information saved to header.txt")
+
+if __name__ == '__main__':
+    if len(sys.argv) < 2:
+        logger.info("Usage: reader.py <path_to_gguf_file>")
+        sys.exit(1)
+    
+    gguf_file_path = sys.argv[1]
+    read_gguf_file(gguf_file_path)
--- a/src/xpu/flamegraph/qwen3.cu/runcu.cu
+++ b/src/xpu/flamegraph/qwen3.cu/runcu.cu
				`@@ -0,0 +1 @@`
				`convert_hf_to_gguf_ordered.py linguist-vendored`