eBPF Tutorial: Python Stack Profiler
Profile Python applications at the OS level using eBPF to capture native and Python call stacks, helping identify performance bottlenecks in Python programs including data science workloads, web servers, and ML inference.
The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/trace/python-stack-profiler
Overview
Python profiling traditionally relies on instrumentation (cProfile) or sampling within the interpreter (py-spy). These approaches have limitations:
- cProfile: High overhead, requires code modification
- py-spy: Samples from userspace, may miss short-lived functions
- perf: Captures native stacks but can't see Python function names
This tutorial shows how to use eBPF to capture both native C stacks AND Python interpreter stacks, giving you complete visibility into where your Python application spends time.
What You'll Learn
- How to attach eBPF probes to Python processes
- Walking Python interpreter frame structures from kernel space
- Extracting Python function names, filenames, and line numbers
- Combining native and Python stacks for complete profiling
- Generating flamegraphs for Python applications
Prerequisites
- Linux kernel 5.15+ (for BPF ring buffer support)
- Python 3.8+ running on your system
- Root access (for loading eBPF programs)
- Understanding of stack traces and profiling concepts
Building and Running
make
sudo ./python-stack
How It Works
The profiler samples Python processes at a regular interval (e.g., 49Hz to avoid lock-step with scheduler). For each sample:
- Capture native stack: Use BPF stack helpers to get kernel and userspace stacks
- Identify Python threads: Check if the process is running Python interpreter
- Walk Python frames: Read PyFrameObject chain from CPython internals
- Extract symbols: Get function names, filenames, line numbers from PyCodeObject
- Aggregate data: Count stack occurrences for flamegraph generation
Python Internals
CPython's frame structure (simplified):
struct _frame {
struct _frame *f_back; // Previous frame
PyCodeObject *f_code; // Code object
int f_lineno; // Current line number
};
struct PyCodeObject {
PyObject *co_filename; // Source filename
PyObject *co_name; // Function name
};
Example Output
python-script.py:main;process_data;expensive_function 247
python-script.py:main;load_model;torch.load 189
python-script.py:main;preprocess;np.array 156
Each line shows the stack trace and sample count.
Use Cases
- ML/AI workloads: Profile PyTorch, TensorFlow, NumPy operations
- Web servers: Find bottlenecks in Flask, Django, FastAPI
- Data processing: Optimize pandas, polars operations
- General Python: Any Python application performance analysis
Next Steps
- Extend to capture GIL contention
- Add Python object allocation tracking
- Integrate with other eBPF metrics (CPU, memory)
- Build flamegraph visualization