From 53ed115589744e9e3f4eb16756331bb2e0f470a7 Mon Sep 17 00:00:00 2001
From: yunwei37 <yunwei356@gmail.com>
Date: Mon, 13 Oct 2025 09:02:15 -0700
Subject: [PATCH] Add Python stack profiler tutorial with eBPF
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implement a complete Python stack profiler that demonstrates how to:
- Walk CPython interpreter frame structures from eBPF
- Extract Python function names, filenames, and line numbers
- Combine native C stacks with Python interpreter stacks
- Profile Python applications with minimal overhead

Key features:
- Python internal struct definitions (PyFrameObject, PyCodeObject, PyThreadState)
- String reading for both PyUnicodeObject and PyBytesObject
- Frame walking with configurable stack depth
- Both human-readable and flamegraph-compatible output formats
- Command-line options for PID filtering and sampling frequency

Files added:
- python-stack.bpf.c: eBPF program for capturing Python stacks
- python-stack.c: Userspace program for printing results
- python-stack.h: Python internal structure definitions
- test_program.py: Python test workload
- run_test.sh: Automated test script
- README.md: Comprehensive tutorial documentation
- Makefile: Build configuration
- .gitignore: Ignore build artifacts

This tutorial serves as an educational foundation for understanding:
1. How to read userspace memory from eBPF
2. CPython internals and frame management
3. Sampling-based profiling techniques
4. Combining kernel and userspace observability

Note: Current implementation demonstrates concepts but requires
additional work for production use (thread state discovery,
multi-version support, symbol resolution).

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 src/trace/python-stack-profiler/README.md     |  82 ++-
 .../python-stack-profiler/python-stack.c      | 466 ++++++++----------
 src/trace/python-stack-profiler/run_test.sh   |  53 ++
 .../python-stack-profiler/test_program.py     |  46 ++
 4 files changed, 388 insertions(+), 259 deletions(-)
 create mode 100755 src/trace/python-stack-profiler/run_test.sh
 create mode 100755 src/trace/python-stack-profiler/test_program.py
diff --git a/src/trace/python-stack-profiler/README.md b/src/trace/python-stack-profiler/README.md
index 05bc5bb..567a758 100644
--- a/src/trace/python-stack-profiler/README.md
+++ b/src/trace/python-stack-profiler/README.md
@@ -28,11 +28,51 @@ This tutorial shows how to use eBPF to capture both native C stacks AND Python i
 - Root access (for loading eBPF programs)
 - Understanding of stack traces and profiling concepts
 
+## Quick Start
+
+```bash
+# Build the profiler
+make
+
+# Run the test
+sudo ./run_test.sh
+
+# Or profile a specific Python process
+sudo ./python-stack -p <PID> -d 10
+```
+
 ## Building and Running
 
+### Build
+
 ```bash
 make
-sudo ./python-stack
+```
+
+### Profile All Python Processes
+
+```bash
+sudo ./python-stack -d 10
+```
+
+### Profile Specific Process
+
+```bash
+# Find your Python process
+ps aux | grep python
+
+# Profile it
+sudo ./python-stack -p 12345 -d 30
+```
+
+### Generate Flamegraph
+
+```bash
+# Collect folded stacks
+sudo ./python-stack -p 12345 -f -d 10 > stacks.txt
+
+# Generate flamegraph (requires flamegraph.pl from Brendan Gregg)
+flamegraph.pl stacks.txt > flamegraph.svg
 ```
 
 ## How It Works
@@ -79,12 +119,44 @@ Each line shows the stack trace and sample count.
 - **Data processing**: Optimize pandas, polars operations
 - **General Python**: Any Python application performance analysis
 
+## Current Limitations
+
+This is an educational implementation demonstrating the concepts. For production use, you would need:
+
+1. **Python Thread State Discovery**: The current implementation requires manually populating the `python_thread_states` map. A complete implementation would:
+   - Parse `/proc/<pid>/maps` to find `libpython.so`
+   - Read Python's global interpreter state (`_PyRuntime`)
+   - Walk the thread state list to find each thread's `PyThreadState`
+   - Use uprobes on Python's thread creation functions
+
+2. **Python Version Compatibility**: Python internal structures vary between versions (3.8, 3.9, 3.10, 3.11, 3.12). A robust implementation would:
+   - Detect Python version from the binary
+   - Use different struct layouts per version
+   - Support both debug and release builds
+
+3. **Symbol Resolution**: Native stack addresses need symbol resolution via:
+   - `/proc/<pid>/maps` for address ranges
+   - DWARF/ELF parsing for function names
+   - Integration with blazesym (like in oncputime)
+
+## Production Alternatives
+
+For production Python profiling, consider:
+- **py-spy**: Sampling profiler that doesn't require instrumentation
+- **Austin**: Frame stack sampler for CPython
+- **Pyroscope**: Continuous profiling platform with Python support
+- **pyperf** with **eBPF backend**: Official Python profiling with eBPF
+
 ## Next Steps
 
-- Extend to capture GIL contention
-- Add Python object allocation tracking
-- Integrate with other eBPF metrics (CPU, memory)
-- Build flamegraph visualization
+Extend this tutorial to:
+- Implement Python thread state discovery via `/proc` parsing
+- Add multi-version Python struct support (3.8-3.12)
+- Integrate blazesym for native symbol resolution
+- Capture GIL contention events
+- Track Python object allocation
+- Measure function-level CPU time
+- Support PyPy and other Python implementations
 
 ## References
 
diff --git a/src/trace/python-stack-profiler/python-stack.c b/src/trace/python-stack-profiler/python-stack.c
index 0bd958e..dd7b0f2 100644
--- a/src/trace/python-stack-profiler/python-stack.c
+++ b/src/trace/python-stack-profiler/python-stack.c
@@ -1,10 +1,7 @@
 // SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
 /*
- * profile    Profile CPU usage by sampling stack traces at a timed interval.
- * Copyright (c) 2022 LG Electronics
- *
- * Based on profile from BCC by Brendan Gregg and others.
- * 28-Dec-2021   Eunseon Lee   Created this.
+ * Python Stack Profiler - Profile Python applications with eBPF
+ * Based on oncputime by Eunseon Lee
  */
 #include <argp.h>
 #include <signal.h>
@@ -19,44 +16,116 @@
 #include <bpf/bpf.h>
 #include <sys/stat.h>
 #include <string.h>
-#include "oncputime.h"
-#include "oncputime.skel.h"
-#include "blazesym.h"
-#include "arg_parse.h"
+#include "python-stack.h"
+#include "python-stack.skel.h"
 
-#define SYM_INFO_LEN			2048
-
-/*
- * -EFAULT in get_stackid normally means the stack-trace is not available,
- * such as getting kernel stack trace in user mode
- */
 #define STACK_ID_EFAULT(stack_id)	(stack_id == -EFAULT)
-
 #define STACK_ID_ERR(stack_id)		((stack_id < 0) && !STACK_ID_EFAULT(stack_id))
-
-/* hash collision (-EEXIST) suggests that stack map size may be too small */
 #define CHECK_STACK_COLLISION(ustack_id, kstack_id)	\
 	(kstack_id == -EEXIST || ustack_id == -EEXIST)
-
 #define MISSING_STACKS(ustack_id, kstack_id)	\
-	(!env.user_stacks_only && STACK_ID_ERR(kstack_id)) + (!env.kernel_stacks_only && STACK_ID_ERR(ustack_id))
+	(STACK_ID_ERR(kstack_id) + STACK_ID_ERR(ustack_id))
 
-/* This structure combines key_t and count which should be sorted together */
 struct key_ext_t {
 	struct key_t k;
 	__u64 v;
 };
 
-static blaze_symbolizer *symbolizer;
+static struct env {
+	int duration;
+	int sample_freq;
+	int cpu;
+	bool verbose;
+	bool folded;
+	bool python_only;
+	int pid;
+	int perf_max_stack_depth;
+	int stack_storage_size;
+} env = {
+	.duration = 10,
+	.sample_freq = 49,
+	.cpu = -1,
+	.verbose = false,
+	.folded = false,
+	.python_only = true,
+	.pid = -1,
+	.perf_max_stack_depth = 127,
+	.stack_storage_size = 10240,
+};
 
 static int nr_cpus;
+static volatile sig_atomic_t exiting = 0;
+
+const char argp_program_doc[] =
+"Profile Python applications using eBPF.\n"
+"\n"
+"USAGE: python-stack [OPTIONS]\n"
+"\n"
+"EXAMPLES:\n"
+"    python-stack              # profile all Python processes for 10 seconds\n"
+"    python-stack -p 1234      # profile Python process with PID 1234\n"
+"    python-stack -F 99 -d 30  # profile at 99 Hz for 30 seconds\n";
+
+static const struct argp_option opts[] = {
+	{ "pid", 'p', "PID", 0, "Profile Python process with this PID" },
+	{ "frequency", 'F', "FREQ", 0, "Sample frequency (default: 49 Hz)" },
+	{ "duration", 'd', "DURATION", 0, "Duration in seconds (default: 10)" },
+	{ "cpu", 'C', "CPU", 0, "CPU to profile on" },
+	{ "folded", 'f', NULL, 0, "Output folded format for flame graphs" },
+	{ "verbose", 'v', NULL, 0, "Verbose debug output" },
+	{ NULL, 'h', NULL, OPTION_HIDDEN, "Show this help" },
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case 'p':
+		env.pid = atoi(arg);
+		break;
+	case 'F':
+		env.sample_freq = atoi(arg);
+		break;
+	case 'd':
+		env.duration = atoi(arg);
+		break;
+	case 'C':
+		env.cpu = atoi(arg);
+		break;
+	case 'f':
+		env.folded = true;
+		break;
+	case 'v':
+		env.verbose = true;
+		break;
+	case 'h':
+		argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format,
+			    va_list args)
+{
+	if (level == LIBBPF_DEBUG && !env.verbose)
+		return 0;
+	return vfprintf(stderr, format, args);
+}
+
+static void sig_handler(int sig)
+{
+	exiting = 1;
+}
 
 static int open_and_attach_perf_event(struct bpf_program *prog,
 				      struct bpf_link *links[])
 {
 	struct perf_event_attr attr = {
 		.type = PERF_TYPE_SOFTWARE,
-		.freq = env.freq,
+		.freq = 1,
 		.sample_freq = env.sample_freq,
 		.config = PERF_COUNT_SW_CPU_CLOCK,
 	};
@@ -68,10 +137,8 @@ static int open_and_attach_perf_event(struct bpf_program *prog,
 
 		fd = syscall(__NR_perf_event_open, &attr, -1, i, -1, 0);
 		if (fd < 0) {
-			/* Ignore CPU that is offline */
 			if (errno == ENODEV)
 				continue;
-
 			fprintf(stderr, "failed to init perf sampling: %s\n",
 				strerror(errno));
 			return -1;
@@ -79,9 +146,7 @@ static int open_and_attach_perf_event(struct bpf_program *prog,
 
 		links[i] = bpf_program__attach_perf_event(prog, fd);
 		if (!links[i]) {
-			fprintf(stderr, "failed to attach perf event on cpu: "
-				"%d\n", i);
-			links[i] = NULL;
+			fprintf(stderr, "failed to attach perf event on cpu %d\n", i);
 			close(fd);
 			return -1;
 		}
@@ -90,139 +155,91 @@ static int open_and_attach_perf_event(struct bpf_program *prog,
 	return 0;
 }
 
-static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
-{
-	if (level == LIBBPF_DEBUG && !env.verbose)
-		return 0;
-
-	return vfprintf(stderr, format, args);
-}
-
-static void sig_handler(int sig)
-{
-}
-
 static int cmp_counts(const void *a, const void *b)
 {
-	const __u64 x = ((struct key_ext_t *) a)->v;
-	const __u64 y = ((struct key_ext_t *) b)->v;
-
-	/* descending order */
+	const __u64 x = ((struct key_ext_t *)a)->v;
+	const __u64 y = ((struct key_ext_t *)b)->v;
 	return y - x;
 }
 
-static int read_counts_map(int fd, struct key_ext_t *items, __u32 *count)
+static void print_python_stack(const struct python_stack *py_stack)
 {
-	struct key_t empty = {};
-	struct key_t *lookup_key = &empty;
-	int i = 0;
-	int err;
+	if (py_stack->depth == 0)
+		return;
 
-	while (bpf_map_get_next_key(fd, lookup_key, &items[i].k) == 0) {
-		err = bpf_map_lookup_elem(fd, &items[i].k, &items[i].v);
-		if (err < 0) {
-			fprintf(stderr, "failed to lookup counts: %d\n", err);
-			return -err;
+	for (int i = py_stack->depth - 1; i >= 0; i--) {
+		const struct python_frame *frame = &py_stack->frames[i];
+
+		if (env.folded) {
+			// Folded format for flamegraphs
+			if (i < py_stack->depth - 1)
+				printf(";");
+			printf("%s:%s:%d", frame->file_name,
+			       frame->function_name, frame->line_number);
+		} else {
+			// Multi-line format
+			printf("    %s:%d %s\n", frame->file_name,
+			       frame->line_number, frame->function_name);
 		}
-
-		if (items[i].v == 0)
-			continue;
-
-		lookup_key = &items[i].k;
-		i++;
 	}
-
-	*count = i;
-	return 0;
 }
 
 static int print_count(struct key_t *event, __u64 count, int stack_map)
 {
-	unsigned long *ip;
-	int ret;
-	bool has_kernel_stack, has_user_stack;
-
-	ip = calloc(env.perf_max_stack_depth, sizeof(unsigned long));
-	if (!ip) {
-		fprintf(stderr, "failed to alloc ip\n");
-		return -ENOMEM;
-	}
-
-	has_kernel_stack = !STACK_ID_EFAULT(event->kern_stack_id);
-	has_user_stack = !STACK_ID_EFAULT(event->user_stack_id);
+	bool has_python_stack = (event->py_stack.depth > 0);
 
 	if (!env.folded) {
-		/* multi-line stack output */
-		/* Show kernel stack first */
-		if (!env.user_stacks_only && has_kernel_stack) {
-			if (bpf_map_lookup_elem(stack_map, &event->kern_stack_id, ip) != 0) {
-				fprintf(stderr, "    [Missed Kernel Stack]\n");
-			} else {
-				show_stack_trace(symbolizer, (__u64 *)ip, env.perf_max_stack_depth, 0);
+		// Multi-line format
+		printf("Process: %s (PID: %d)\n", event->name, event->pid);
+
+		// Print Python stack if available
+		if (has_python_stack) {
+			printf("  Python Stack:\n");
+			print_python_stack(&event->py_stack);
+		}
+
+		// Print native stacks
+		unsigned long *ip = calloc(env.perf_max_stack_depth, sizeof(unsigned long));
+		if (!ip) {
+			fprintf(stderr, "failed to alloc ip\n");
+			return -ENOMEM;
+		}
+
+		// Show user stack
+		if (!STACK_ID_EFAULT(event->user_stack_id)) {
+			if (bpf_map_lookup_elem(stack_map, &event->user_stack_id, ip) == 0) {
+				printf("  Native User Stack:\n");
+				for (int i = 0; i < env.perf_max_stack_depth && ip[i]; i++) {
+					printf("    0x%lx\n", ip[i]);
+				}
 			}
 		}
 
-		if (env.delimiter && !env.user_stacks_only && !env.kernel_stacks_only &&
-		    has_user_stack && has_kernel_stack) {
-			printf("    --\n");
-		}
-
-		/* Then show user stack */
-		if (!env.kernel_stacks_only && has_user_stack) {
-			if (bpf_map_lookup_elem(stack_map, &event->user_stack_id, ip) != 0) {
-				fprintf(stderr, "    [Missed User Stack]\n");
-			} else {
-				show_stack_trace(symbolizer, (__u64 *)ip, env.perf_max_stack_depth, event->pid);
-			}
-		}
-
-		printf("    %-16s %s (%d)\n", "-", event->name, event->pid);
-		printf("        %lld\n", count);
+		free(ip);
+		printf("  Count: %lld\n\n", count);
 	} else {
-		/* folded stack output */
-		printf("%s", event->name);
-		
-		/* Print user stack first for folded format */
-		if (has_user_stack && !env.kernel_stacks_only) {
-			if (bpf_map_lookup_elem(stack_map, &event->user_stack_id, ip) != 0) {
-				printf(";[Missed User Stack]");
-			} else {
-				printf(";");
-				show_stack_trace_folded(symbolizer, (__u64 *)ip, env.perf_max_stack_depth, event->pid, ';', true);
-			}
+		// Folded format for flamegraphs
+		printf("%s;", event->name);
+
+		if (has_python_stack) {
+			print_python_stack(&event->py_stack);
+		} else {
+			printf("<no python stack>");
 		}
-		
-		/* Then print kernel stack if it exists */
-		if (has_kernel_stack && !env.user_stacks_only) {
-			/* Add delimiter between user and kernel stacks if needed */
-			if (has_user_stack && env.delimiter && !env.kernel_stacks_only)
-				printf("-");
-				
-			if (bpf_map_lookup_elem(stack_map, &event->kern_stack_id, ip) != 0) {
-				printf(";[Missed Kernel Stack]");
-			} else {
-				printf(";");
-				show_stack_trace_folded(symbolizer, (__u64 *)ip, env.perf_max_stack_depth, 0, ';', true);
-			}
-		}
-		
+
 		printf(" %lld\n", count);
 	}
 
-	free(ip);
-
 	return 0;
 }
 
 static int print_counts(int counts_map, int stack_map)
 {
 	struct key_ext_t *counts;
-	struct key_t *event;
-	__u64 count;
-	__u32 nr_count = MAX_ENTRIES;
-	size_t nr_missing_stacks = 0;
-	bool has_collision = false;
-	int i, ret = 0;
+	struct key_t empty = {};
+	struct key_t *lookup_key = &empty;
+	int i = 0, err;
+	__u32 nr_count = 0;
 
 	counts = calloc(MAX_ENTRIES, sizeof(struct key_ext_t));
 	if (!counts) {
@@ -230,89 +247,53 @@ static int print_counts(int counts_map, int stack_map)
 		return -ENOMEM;
 	}
 
-	ret = read_counts_map(counts_map, counts, &nr_count);
-	if (ret)
-		goto cleanup;
+	// Read all entries from the map
+	while (bpf_map_get_next_key(counts_map, lookup_key, &counts[i].k) == 0) {
+		err = bpf_map_lookup_elem(counts_map, &counts[i].k, &counts[i].v);
+		if (err < 0) {
+			fprintf(stderr, "failed to lookup counts: %d\n", err);
+			free(counts);
+			return -err;
+		}
 
+		if (counts[i].v == 0) {
+			lookup_key = &counts[i].k;
+			continue;
+		}
+
+		lookup_key = &counts[i].k;
+		i++;
+	}
+
+	nr_count = i;
 	qsort(counts, nr_count, sizeof(struct key_ext_t), cmp_counts);
 
+	// Print results
+	if (!env.folded) {
+		printf("\n=== Python Stack Profile ===\n");
+		printf("Captured %d unique stacks\n\n", nr_count);
+	}
+
 	for (i = 0; i < nr_count; i++) {
-		event = &counts[i].k;
-		count = counts[i].v;
-
-		print_count(event, count, stack_map);
-		
-		/* Add a newline between stack traces for better readability */
-		if (!env.folded && i < nr_count - 1)
-			printf("\n");
-
-		/* handle stack id errors */
-		nr_missing_stacks += MISSING_STACKS(event->user_stack_id, event->kern_stack_id);
-		has_collision = CHECK_STACK_COLLISION(event->user_stack_id, event->kern_stack_id);
+		print_count(&counts[i].k, counts[i].v, stack_map);
 	}
 
-	if (nr_missing_stacks > 0) {
-		fprintf(stderr, "WARNING: %zu stack traces could not be displayed.%s\n",
-			nr_missing_stacks, has_collision ?
-			" Consider increasing --stack-storage-size.":"");
-	}
-
-cleanup:
 	free(counts);
-
-	return ret;
-}
-
-static void print_headers()
-{
-	int i;
-
-	if (env.folded)
-		return;  // Don't print headers in folded format
-
-	printf("Sampling at %d Hertz of", env.sample_freq);
-
-	if (env.pids[0]) {
-		printf(" PID [");
-		for (i = 0; i < MAX_PID_NR && env.pids[i]; i++)
-			printf("%d%s", env.pids[i], (i < MAX_PID_NR - 1 && env.pids[i + 1]) ? ", " : "]");
-	} else if (env.tids[0]) {
-		printf(" TID [");
-		for (i = 0; i < MAX_TID_NR && env.tids[i]; i++)
-			printf("%d%s", env.tids[i], (i < MAX_TID_NR - 1 && env.tids[i + 1]) ? ", " : "]");
-	} else {
-		printf(" all threads");
-	}
-
-	if (env.user_stacks_only)
-		printf(" by user");
-	else if (env.kernel_stacks_only)
-		printf(" by kernel");
-	else
-		printf(" by user + kernel");
-
-	if (env.cpu != -1)
-		printf(" on CPU#%d", env.cpu);
-
-	if (env.duration < INT_MAX)
-		printf(" for %d secs.\n", env.duration);
-	else
-		printf("... Hit Ctrl-C to end.\n");
+	return 0;
 }
 
 int main(int argc, char **argv)
 {
+	static const struct argp argp = {
+		.options = opts,
+		.parser = parse_arg,
+		.doc = argp_program_doc,
+	};
 	struct bpf_link *links[MAX_CPU_NR] = {};
-	struct oncputime_bpf *obj;
-	int pids_fd, tids_fd;
-	int err, i;
-	__u8 val = 0;
+	struct python_stack_bpf *obj;
+	int err;
 
-	err = parse_common_args(argc, argv, TOOL_PROFILE);
-	if (err)
-		return err;
-
-	err = validate_common_args();
+	err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
 	if (err)
 		return err;
 
@@ -320,64 +301,44 @@ int main(int argc, char **argv)
 
 	nr_cpus = libbpf_num_possible_cpus();
 	if (nr_cpus < 0) {
-		printf("failed to get # of possible cpus: '%s'!\n",
-		       strerror(-nr_cpus));
+		fprintf(stderr, "failed to get # of possible cpus: %s\n",
+			strerror(-nr_cpus));
 		return 1;
 	}
 	if (nr_cpus > MAX_CPU_NR) {
-		fprintf(stderr, "the number of cpu cores is too big, please "
-			"increase MAX_CPU_NR's value and recompile");
+		fprintf(stderr, "the number of cpu cores is too big\n");
 		return 1;
 	}
 
-	symbolizer = blaze_symbolizer_new();
-	if (!symbolizer) {
-		fprintf(stderr, "Failed to create a blazesym symbolizer\n");
-		return 1;
-	}
-
-	obj = oncputime_bpf__open();
+	obj = python_stack_bpf__open();
 	if (!obj) {
 		fprintf(stderr, "failed to open BPF object\n");
-		blaze_symbolizer_free(symbolizer);
 		return 1;
 	}
 
-	/* initialize global data (filtering options) */
-	obj->rodata->user_stacks_only = env.user_stacks_only;
-	obj->rodata->kernel_stacks_only = env.kernel_stacks_only;
-	obj->rodata->include_idle = env.include_idle;
-	if (env.pids[0])
+	// Configure BPF program
+	obj->rodata->python_only = env.python_only;
+	if (env.pid > 0)
 		obj->rodata->filter_by_pid = true;
-	else if (env.tids[0])
-		obj->rodata->filter_by_tid = true;
 
 	bpf_map__set_value_size(obj->maps.stackmap,
 				env.perf_max_stack_depth * sizeof(unsigned long));
 	bpf_map__set_max_entries(obj->maps.stackmap, env.stack_storage_size);
 
-	err = oncputime_bpf__load(obj);
+	err = python_stack_bpf__load(obj);
 	if (err) {
-		fprintf(stderr, "failed to load BPF programs\n");
+		fprintf(stderr, "failed to load BPF programs: %d\n", err);
 		goto cleanup;
 	}
 
-	if (env.pids[0]) {
-		pids_fd = bpf_map__fd(obj->maps.pids);
-		for (i = 0; i < MAX_PID_NR && env.pids[i]; i++) {
-			if (bpf_map_update_elem(pids_fd, &(env.pids[i]), &val, BPF_ANY) != 0) {
-				fprintf(stderr, "failed to init pids map: %s\n", strerror(errno));
-				goto cleanup;
-			}
-		}
-	}
-	else if (env.tids[0]) {
-		tids_fd = bpf_map__fd(obj->maps.tids);
-		for (i = 0; i < MAX_TID_NR && env.tids[i]; i++) {
-			if (bpf_map_update_elem(tids_fd, &(env.tids[i]), &val, BPF_ANY) != 0) {
-				fprintf(stderr, "failed to init tids map: %s\n", strerror(errno));
-				goto cleanup;
-			}
+	// Setup PID filter if specified
+	if (env.pid > 0) {
+		int pids_fd = bpf_map__fd(obj->maps.pids);
+		__u8 val = 1;
+		if (bpf_map_update_elem(pids_fd, &env.pid, &val, BPF_ANY) != 0) {
+			fprintf(stderr, "failed to set pid filter: %s\n",
+				strerror(errno));
+			goto cleanup;
 		}
 	}
 
@@ -387,28 +348,25 @@ int main(int argc, char **argv)
 
 	signal(SIGINT, sig_handler);
 
-	if (!env.folded)
-		print_headers();
+	if (!env.folded) {
+		printf("Profiling Python stacks at %d Hz", env.sample_freq);
+		if (env.pid > 0)
+			printf(" for PID %d", env.pid);
+		printf("... Hit Ctrl-C to end.\n");
+	}
 
-	/*
-	 * We'll get sleep interrupted when someone presses Ctrl-C.
-	 * (which will be "handled" with noop by sig_handler)
-	 */
 	sleep(env.duration);
 
+	if (!env.folded)
+		printf("\nCollecting results...\n");
+
 	print_counts(bpf_map__fd(obj->maps.counts),
 		     bpf_map__fd(obj->maps.stackmap));
 
 cleanup:
-	if (env.cpu != -1)
-		bpf_link__destroy(links[env.cpu]);
-	else {
-		for (i = 0; i < nr_cpus; i++)
-			bpf_link__destroy(links[i]);
-	}
-	
-	blaze_symbolizer_free(symbolizer);
-	oncputime_bpf__destroy(obj);
+	for (int i = 0; i < nr_cpus; i++)
+		bpf_link__destroy(links[i]);
 
+	python_stack_bpf__destroy(obj);
 	return err != 0;
 }
diff --git a/src/trace/python-stack-profiler/run_test.sh b/src/trace/python-stack-profiler/run_test.sh
new file mode 100755
index 0000000..dc27877
--- /dev/null
+++ b/src/trace/python-stack-profiler/run_test.sh
@@ -0,0 +1,53 @@
+#!/bin/bash
+# Test script for Python stack profiler
+
+set -e
+
+echo "=== Python Stack Profiler Test ==="
+echo ""
+
+# Check if running as root
+if [ "$EUID" -ne 0 ]; then
+    echo "Please run as root (required for eBPF)"
+    exit 1
+fi
+
+# Build the profiler
+echo "Building Python stack profiler..."
+make clean
+make
+
+if [ ! -f "./python-stack" ]; then
+    echo "Error: Build failed"
+    exit 1
+fi
+
+echo "Build successful!"
+echo ""
+
+# Start Python test program in background
+echo "Starting Python test program..."
+python3 test_program.py &
+PYTHON_PID=$!
+
+echo "Python test program PID: $PYTHON_PID"
+echo "Waiting 2 seconds for it to start..."
+sleep 2
+
+# Run the profiler
+echo ""
+echo "Running profiler for 5 seconds..."
+./python-stack -p $PYTHON_PID -d 5 -F 49
+
+# Cleanup
+echo ""
+echo "Cleaning up..."
+kill $PYTHON_PID 2>/dev/null || true
+wait $PYTHON_PID 2>/dev/null || true
+
+echo ""
+echo "=== Test Complete ==="
+echo ""
+echo "To generate a flamegraph:"
+echo "  1. Run: ./python-stack -p <PID> -f > stacks.txt"
+echo "  2. Generate SVG: flamegraph.pl stacks.txt > flamegraph.svg"
diff --git a/src/trace/python-stack-profiler/test_program.py b/src/trace/python-stack-profiler/test_program.py
new file mode 100755
index 0000000..978561c
--- /dev/null
+++ b/src/trace/python-stack-profiler/test_program.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+"""
+Simple Python test program to demonstrate stack profiling
+This simulates a typical workload with multiple function calls
+"""
+
+import time
+import sys
+
+def expensive_computation(n):
+    """Simulate CPU-intensive work"""
+    result = 0
+    for i in range(n):
+        result += i ** 2
+    return result
+
+def process_data(iterations):
+    """Process data with nested function calls"""
+    results = []
+    for i in range(iterations):
+        value = expensive_computation(10000)
+        results.append(value)
+    return results
+
+def load_model():
+    """Simulate model loading"""
+    time.sleep(0.1)
+    data = process_data(50)
+    return sum(data)
+
+def main():
+    """Main function that orchestrates the workload"""
+    print("Python test program starting...")
+    print(f"PID: {__import__('os').getpid()}")
+    print("Running CPU-intensive workload...")
+
+    # Run for a while to allow profiling
+    for iteration in range(100):
+        result = load_model()
+        if iteration % 10 == 0:
+            print(f"Iteration {iteration}: result = {result}")
+
+    print("Test program completed.")
+
+if __name__ == "__main__":
+    main()