From 0e19d48331705c51a725c29ac82956eebff89814 Mon Sep 17 00:00:00 2001
From: yunwei37 <yunwei356@gmail.com>
Date: Sat, 4 Oct 2025 23:15:03 -0700
Subject: [PATCH] docs: add comprehensive BPF Arena tutorial detailing
 features, use cases, and examples

---
 src/features/bpf_arena/README.md | 366 ++++++++++++++++++++++
 src/features/bpf_iters/README.md | 501 ++++++++++++++++++++----------
 src/features/bpf_wq/README.md    | 511 +++++++++++++------------------
 3 files changed, 920 insertions(+), 458 deletions(-)
 create mode 100644 src/features/bpf_arena/README.md

diff --git a/src/features/bpf_arena/README.md b/src/features/bpf_arena/README.md
new file mode 100644
index 0000000..788c85c
--- /dev/null
+++ b/src/features/bpf_arena/README.md
@@ -0,0 +1,366 @@
+# eBPF Tutorial by Example: BPF Arena for Zero-Copy Shared Memory
+
+Ever tried building a linked list in eBPF and got stuck using awkward integer indices instead of real pointers? Or needed to share large amounts of data between your kernel BPF program and userspace without expensive syscalls? Traditional BPF maps force you to work around pointer limitations and require system calls for every access. What if you could just use normal C pointers and have direct memory access from both kernel and userspace?
+
+This is what **BPF Arena** solves. Created by Alexei Starovoitov in 2024, arena provides a sparse shared memory region where BPF programs can use real pointers to build complex data structures like linked lists, trees, and graphs, while userspace gets zero-copy direct access to the same memory. In this tutorial, we'll build a linked list in arena memory and show you how both kernel and userspace can manipulate it using standard pointer operations.
+
+## Introduction to BPF Arena: Breaking Free from Map Limitations
+
+### The Problem: When BPF Maps Aren't Enough
+
+Traditional BPF maps are fantastic for simple key-value storage, but they have fundamental limitations when you need complex data structures or large-scale data sharing. Let's look at what developers faced before arena existed.
+
+**Ring buffers** only work in one direction - BPF can send data to userspace, but userspace can't write back. They're streaming-only, no random access. **Hash and array maps** require syscalls like `bpf_map_lookup_elem()` for every access from userspace. Array maps allocate all their memory upfront, wasting space if you only use a fraction of entries. Most critically, **you can't use real pointers** - you're forced to use integer indices to link data structures together.
+
+Building a linked list the old way looked like this mess:
+
+```c
+struct node {
+    int next_idx;  // Can't use pointers, must use index!
+    int data;
+};
+
+struct {
+    __uint(type, BPF_MAP_TYPE_ARRAY);
+    __uint(max_entries, 10000);
+    __type(value, struct node);
+} nodes_map SEC(".maps");
+
+// Traverse requires repeated map lookups
+int idx = head_idx;
+while (idx != -1) {
+    struct node *n = bpf_map_lookup_elem(&nodes_map, &idx);
+    if (!n) break;
+    process(n->data);
+    idx = n->next_idx;  // No pointer following!
+}
+```
+
+Every node access requires a map lookup. You can't just follow pointers like normal C code. The verifier won't let you use pointers across different map entries. This makes implementing trees, graphs, or any pointer-based structure incredibly awkward and slow.
+
+### The Solution: Sparse Shared Memory with Real Pointers
+
+In 2024, Alexei Starovoitov from the Linux kernel team introduced BPF arena to solve these limitations. Arena provides a **sparse shared memory region** between BPF programs and userspace, supporting up to 4GB of address space. Memory pages are allocated on-demand as you use them, so you don't waste space. Both kernel BPF code and userspace programs can map the same arena and access it directly.
+
+The game-changer: you can use **real C pointers** in BPF programs targeting arena memory. The `__arena` annotation tells the verifier that these pointers reference arena space, and special address space casts (`cast_kern()`, `cast_user()`) let you safely convert between kernel and userspace views of the same memory. Userspace gets zero-copy access through `mmap()` - no syscalls needed to read or write arena data.
+
+Here's what the same linked list looks like with arena:
+
+```c
+struct node __arena {
+    struct node __arena *next;  // Real pointer!
+    int data;
+};
+
+struct node __arena *head;
+
+// Traverse with normal pointer following
+struct node __arena *n = head;
+while (n) {
+    process(n->data);
+    n = n->next;  // Just follow the pointer!
+}
+```
+
+Clean, simple, exactly how you'd write it in normal C. The verifier understands arena pointers and lets you dereference them safely.
+
+### Why This Matters
+
+Arena was inspired by research showing the potential for complex data structures in BPF. Before arena, developers were building hash tables, queues, and trees using giant BPF array maps with integer indices instead of pointers. It worked, but the code was ugly and slow. Arena unlocks several powerful use cases.
+
+**In-kernel data structures** become practical. You can implement custom hash tables with collision chaining, AVL or red-black trees for sorted data, graphs for network topology mapping, all using normal pointer operations. **Key-value store accelerators** can run in the kernel for maximum performance, with userspace getting direct access to the data structure without syscall overhead. **Bidirectional communication** works naturally - both kernel and userspace can modify shared data structures using lock-free algorithms. **Large data aggregation** scales up to 4GB instead of being limited by typical map size constraints.
+
+## Implementation: Building a Linked List in Arena Memory
+
+Let's build a complete example that demonstrates arena's power. We'll create a linked list where BPF programs add and delete elements using real pointers, while userspace directly accesses the list to compute sums without any syscalls.
+
+### Complete BPF Program: arena_list.bpf.c
+
+```c
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#define BPF_NO_KFUNC_PROTOTYPES
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARENA);
+	__uint(map_flags, BPF_F_MMAPABLE);
+	__uint(max_entries, 100); /* number of pages */
+#ifdef __TARGET_ARCH_arm64
+	__ulong(map_extra, 0x1ull << 32); /* start of mmap() region */
+#else
+	__ulong(map_extra, 0x1ull << 44); /* start of mmap() region */
+#endif
+} arena SEC(".maps");
+
+#include "bpf_arena_alloc.h"
+#include "bpf_arena_list.h"
+
+struct elem {
+	struct arena_list_node node;
+	__u64 value;
+};
+
+struct arena_list_head __arena *list_head;
+int list_sum;
+int cnt;
+bool skip = false;
+
+#ifdef __BPF_FEATURE_ADDR_SPACE_CAST
+long __arena arena_sum;
+int __arena test_val = 1;
+struct arena_list_head __arena global_head;
+#else
+long arena_sum SEC(".addr_space.1");
+int test_val SEC(".addr_space.1");
+#endif
+
+int zero;
+
+SEC("syscall")
+int arena_list_add(void *ctx)
+{
+#ifdef __BPF_FEATURE_ADDR_SPACE_CAST
+	__u64 i;
+
+	list_head = &global_head;
+
+	for (i = zero; i < cnt && can_loop; i++) {
+		struct elem __arena *n = bpf_alloc(sizeof(*n));
+
+		test_val++;
+		n->value = i;
+		arena_sum += i;
+		list_add_head(&n->node, list_head);
+	}
+#else
+	skip = true;
+#endif
+	return 0;
+}
+
+SEC("syscall")
+int arena_list_del(void *ctx)
+{
+#ifdef __BPF_FEATURE_ADDR_SPACE_CAST
+	struct elem __arena *n;
+	int sum = 0;
+
+	arena_sum = 0;
+	list_for_each_entry(n, list_head, node) {
+		sum += n->value;
+		arena_sum += n->value;
+		list_del(&n->node);
+		bpf_free(n);
+	}
+	list_sum = sum;
+#else
+	skip = true;
+#endif
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+```
+
+### Understanding the BPF Code
+
+The program starts by defining the arena map itself. `BPF_MAP_TYPE_ARENA` tells the kernel this is arena memory, and `BPF_F_MMAPABLE` makes it accessible via `mmap()` from userspace. The `max_entries` field specifies how many pages (typically 4KB each) the arena can hold - here we allow up to 100 pages, or about 400KB. The `map_extra` field sets where in the virtual address space the arena gets mapped, using different addresses for ARM64 vs x86-64 to avoid conflicts with existing mappings.
+
+After defining the map, we include arena helpers. The `bpf_arena_alloc.h` file provides `bpf_alloc()` and `bpf_free()` functions - a simple memory allocator that works with arena pages, similar to `malloc()` and `free()` but specifically for arena memory. The `bpf_arena_list.h` file implements doubly-linked list operations using arena pointers, including `list_add_head()` to prepend nodes and `list_for_each_entry()` to iterate safely.
+
+Our `elem` structure contains the actual data. The `arena_list_node` member provides the `next` and `pprev` pointers for linking nodes together - these are arena pointers marked with `__arena`. The `value` field holds our payload data. Notice the `__arena` annotation on `list_head` - this tells the verifier this pointer references arena memory, not normal kernel memory.
+
+The `arena_list_add()` function creates list elements. It's marked `SEC("syscall")` because userspace will trigger it using `bpf_prog_test_run()`. The loop allocates new elements using `bpf_alloc(sizeof(*n))`, which returns an arena pointer. We can then dereference `n->value` directly - the verifier allows this because `n` is an arena pointer. The `list_add_head()` call prepends the new node to the list using normal pointer manipulation, all happening in arena memory. The `can_loop` check satisfies the verifier's bounded loop requirement.
+
+The `arena_list_del()` function demonstrates iteration and cleanup. The `list_for_each_entry()` macro walks the list following arena pointers. Inside the loop, we sum values and delete nodes. The `bpf_free(n)` call returns memory to the arena allocator, decreasing the reference count and potentially freeing pages when the count hits zero.
+
+The address space cast feature is crucial. Some compilers support `__BPF_FEATURE_ADDR_SPACE_CAST` which enables the `__arena` annotation to work as a compiler address space. Without this support, we fall back to using explicit section annotations like `SEC(".addr_space.1")`. The code checks for this feature and skips execution if it's not available, preventing runtime errors.
+
+### Complete User-Space Program: arena_list.c
+
+```c
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <stdint.h>
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+
+#include "bpf_arena_list.h"
+#include "arena_list.skel.h"
+
+struct elem {
+	struct arena_list_node node;
+	uint64_t value;
+};
+
+static int list_sum(struct arena_list_head *head)
+{
+	struct elem __arena *n;
+	int sum = 0;
+
+	list_for_each_entry(n, head, node)
+		sum += n->value;
+	return sum;
+}
+
+static void test_arena_list_add_del(int cnt)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct arena_list_bpf *skel;
+	int expected_sum = (u_int64_t)cnt * (cnt - 1) / 2;
+	int ret, sum;
+
+	skel = arena_list_bpf__open_and_load();
+	if (!skel) {
+		fprintf(stderr, "Failed to open and load BPF skeleton\n");
+		return;
+	}
+
+	skel->bss->cnt = cnt;
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts);
+	if (ret != 0) {
+		fprintf(stderr, "Failed to run arena_list_add: %d\n", ret);
+		goto out;
+	}
+	if (opts.retval != 0) {
+		fprintf(stderr, "arena_list_add returned %d\n", opts.retval);
+		goto out;
+	}
+	if (skel->bss->skip) {
+		printf("SKIP: compiler doesn't support arena_cast\n");
+		goto out;
+	}
+	sum = list_sum(skel->bss->list_head);
+	printf("Sum of elements: %d (expected: %d)\n", sum, expected_sum);
+	printf("Arena sum: %ld (expected: %d)\n", skel->bss->arena_sum, expected_sum);
+	printf("Number of elements: %d (expected: %d)\n", skel->data->test_val, cnt + 1);
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_del), &opts);
+	if (ret != 0) {
+		fprintf(stderr, "Failed to run arena_list_del: %d\n", ret);
+		goto out;
+	}
+	sum = list_sum(skel->bss->list_head);
+	printf("Sum after deletion: %d (expected: 0)\n", sum);
+	printf("Sum computed by BPF: %d (expected: %d)\n", skel->bss->list_sum, expected_sum);
+	printf("Arena sum after deletion: %ld (expected: %d)\n", skel->bss->arena_sum, expected_sum);
+
+	printf("\nTest passed!\n");
+out:
+	arena_list_bpf__destroy(skel);
+}
+
+int main(int argc, char **argv)
+{
+	int cnt = 10;
+
+	if (argc > 1) {
+		cnt = atoi(argv[1]);
+		if (cnt <= 0) {
+			fprintf(stderr, "Invalid count: %s\n", argv[1]);
+			return 1;
+		}
+	}
+
+	printf("Testing arena list with %d elements\n", cnt);
+	test_arena_list_add_del(cnt);
+
+	return 0;
+}
+```
+
+### Understanding the User-Space Code
+
+The userspace program demonstrates zero-copy access to arena memory. When we load the BPF skeleton using `arena_list_bpf__open_and_load()`, libbpf automatically `mmap()`s the arena into userspace. The pointer `skel->bss->list_head` points directly into this mapped arena memory.
+
+The `list_sum()` function walks the linked list from userspace. Notice we're using the same `list_for_each_entry()` macro as the BPF code. The list is in arena memory, shared between kernel and userspace. Userspace can directly dereference arena pointers to access node values and follow `next` pointers - no syscalls needed. This is the zero-copy benefit: userspace reads memory directly from the mapped region.
+
+The test flow orchestrates the demonstration. First, we set `skel->bss->cnt` to specify how many list elements to create. Then `bpf_prog_test_run_opts()` executes the `arena_list_add` BPF program, which builds the list in arena memory. Once that returns, userspace immediately calls `list_sum()` to verify the list by walking it directly from userspace - no syscalls, just direct memory access. The expected sum is calculated as 0+1+2+...+(cnt-1), which equals cnt*(cnt-1)/2.
+
+After verifying the list, we run `arena_list_del` to remove all elements. This BPF program walks the list, computes its own sum, and calls `bpf_free()` on each node. Userspace then verifies the list is empty by calling `list_sum()` again, which should return 0. We also check that `skel->bss->list_sum` matches our expected value, confirming the BPF program computed the correct sum before deleting nodes.
+
+## Understanding Arena Memory Allocation
+
+The arena allocator deserves a closer look because it shows how BPF programs can implement sophisticated memory management in arena space. The allocator in `bpf_arena_alloc.h` uses a per-CPU page fragment approach to avoid locking.
+
+Each CPU maintains its own current page and offset. When you call `bpf_alloc(size)`, it first rounds up the size to 8-byte alignment. If the current page has enough space at the current offset, it allocates from there by just decrementing the offset and returning a pointer. If not enough space remains, it allocates a fresh page using `bpf_arena_alloc_pages()`, which is a kernel helper that gets arena pages from the kernel's page allocator. Each page maintains a reference count in its last 8 bytes, tracking how many allocated objects point into that page.
+
+The `bpf_free(addr)` function implements reference-counted deallocation. It rounds the address down to the page boundary, finds the reference count, and decrements it. When the count reaches zero - meaning all objects allocated from that page have been freed - it returns the entire page to the kernel using `bpf_arena_free_pages()`. This page-level reference counting means individual `bpf_free()` calls are fast, and memory is returned to the system only when appropriate.
+
+This allocator design avoids locks by using per-CPU state. Since BPF programs run with preemption disabled on a single CPU, the current CPU's page fragment can be accessed without synchronization. This makes `bpf_alloc()` extremely fast - typically just a few instructions to allocate from the current page.
+
+## Compilation and Execution
+
+Navigate to the bpf_arena directory and build the example:
+
+```bash
+cd /home/yunwei37/workspace/bpf-developer-tutorial/src/features/bpf_arena
+make
+```
+
+The Makefile compiles the BPF program with `-D__BPF_FEATURE_ADDR_SPACE_CAST` to enable arena pointer support. It uses `bpftool gen object` to process the compiled BPF object and generate a skeleton header that userspace can include.
+
+Run the arena list test with 10 elements:
+
+```bash
+sudo ./arena_list 10
+```
+
+Expected output:
+
+```
+Testing arena list with 10 elements
+Sum of elements: 45 (expected: 45)
+Arena sum: 45 (expected: 45)
+Number of elements: 11 (expected: 11)
+Sum after deletion: 0 (expected: 0)
+Sum computed by BPF: 45 (expected: 45)
+Arena sum after deletion: 45 (expected: 45)
+
+Test passed!
+```
+
+Try it with more elements to see arena scaling:
+
+```bash
+sudo ./arena_list 100
+```
+
+The sum should be 4950 (100*99/2). Notice that userspace can verify the list by directly accessing arena memory without any syscalls. This zero-copy access is what makes arena powerful for large data structures.
+
+## When to Use Arena vs Other BPF Maps
+
+Choosing the right BPF map type depends on your access patterns and data structure needs. **Use regular BPF maps** (hash, array, etc.) when you need simple key-value storage, small data structures that fit well in maps, standard map operations like atomic updates, or per-CPU statistics without complex linking. Maps excel at straightforward use cases with kernel-provided operations.
+
+**Use BPF Arena** when you need complex linked structures like lists, trees, or graphs, large shared memory exceeding typical map sizes, zero-copy userspace access to avoid syscall overhead, or custom memory management beyond what maps provide. Arena shines for sophisticated data structures where pointer operations are natural.
+
+**Use Ring Buffers** when you need one-way streaming from BPF to userspace, event logs or trace data, or sequentially processed data without random access. Ring buffers are optimized for high-throughput event streams but don't support bidirectional access or complex data structures.
+
+The arena vs maps trade-off fundamentally comes down to pointers and access patterns. If you find yourself encoding indices to simulate pointers in BPF maps, arena is probably the better choice. If you need large-scale data structures accessible from both kernel and userspace, arena's zero-copy shared memory model is hard to beat.
+
+## Summary and Next Steps
+
+BPF Arena solves a fundamental limitation of traditional BPF maps by providing sparse shared memory where you can use real C pointers to build complex data structures. Created by Alexei Starovoitov in 2024, arena enables linked lists, trees, graphs, and custom allocators using normal pointer operations instead of awkward integer indices. Both kernel BPF programs and userspace can map the same arena for zero-copy bidirectional access, eliminating syscall overhead.
+
+Our linked list example demonstrates the core arena concepts: defining an arena map, using `__arena` annotations for pointer types, allocating memory with `bpf_alloc()`, and accessing the same data structure from both kernel and userspace. The per-CPU page fragment allocator shows how BPF programs can implement sophisticated memory management in arena space. Arena unlocks new possibilities for in-kernel data structures, key-value store accelerators, and large-scale data aggregation up to 4GB.
+
+> If you'd like to dive deeper into eBPF, check out our tutorial repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> or visit our website at <https://eunomia.dev/tutorials/>.
+
+## References
+
+- **Original Arena Patches:** <https://lwn.net/Articles/961594/>
+- **Meta's Arena Examples:** Linux kernel tree `samples/bpf/arena_*.c`
+- **Tutorial Repository:** <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_arena>
+- **Linux Kernel Source:** `kernel/bpf/arena.c` - Arena implementation
+- **LLVM Address Spaces:** Documentation on `__arena` compiler support
+
+This example is adapted from Meta's arena_list.c in the Linux kernel samples, with educational enhancements. Requires Linux kernel 6.10+ with `CONFIG_BPF_ARENA=y` enabled. Complete source code available in the tutorial repository.
diff --git a/src/features/bpf_iters/README.md b/src/features/bpf_iters/README.md
index 6548292..1530440 100644
--- a/src/features/bpf_iters/README.md
+++ b/src/features/bpf_iters/README.md
@@ -1,55 +1,314 @@
-# BPF Iterators Tutorial
+# eBPF Tutorial by Example: BPF Iterators for Kernel Data Export
 
-## What are BPF Iterators?
+Ever tried monitoring hundreds of processes and ended up parsing thousands of `/proc` files just to find the few you care about? Or needed custom formatted kernel data but didn't want to modify the kernel itself? Traditional `/proc` filesystem access is slow, inflexible, and forces you to process tons of data in userspace even when you only need a small filtered subset.
 
-BPF iterators allow you to iterate over kernel data structures and export formatted data to userspace via `seq_file`. They're a modern replacement for traditional `/proc` files with **programmable, filterable, in-kernel data processing**.
+This is what **BPF Iterators** solve. Introduced in Linux kernel 5.8, iterators let you traverse kernel data structures directly from BPF programs, apply filters in-kernel, and output exactly the data you need in any format you want. In this tutorial, we'll build a dual-mode iterator that shows kernel stack traces and open file descriptors for processes, with in-kernel filtering by process name - dramatically faster than parsing `/proc`.
 
-## Real-World Example: Task Stack Iterator
+## Introduction to BPF Iterators: The /proc Replacement
 
-### The Problem with Traditional Approach
+### The Problem: /proc is Slow and Rigid
+
+Traditional Linux monitoring revolves around the `/proc` filesystem. Need to see what processes are doing? Read `/proc/*/stack`. Want open files? Parse `/proc/*/fd/*`. This works, but it's painfully inefficient when you're monitoring systems at scale or need specific filtered views of kernel data.
+
+The performance problem is systemic. Every `/proc` access requires a syscall, kernel mode transition, text formatting, data copy to userspace, and then you parse that text back into structures. If you want stack traces for all "bash" processes among 1000 total processes, you still read all 1000 `/proc/*/stack` files and filter in userspace. That's 1000 syscalls, 1000 text parsing operations, and megabytes of data transferred just to find a handful of matches.
+
+Format inflexibility compounds the problem. The kernel chooses what data to show and how to format it. Want stack traces with custom annotations? Too bad, you get the kernel's fixed format. Need to aggregate data across processes? Parse everything in userspace. The `/proc` interface is designed for human consumption, not programmatic filtering and analysis.
+
+Here's what traditional monitoring looks like:
 
-**Traditional method** (using `/proc` or system tools):
 ```bash
-# Show all process stack traces
-cat /proc/*/stack
+# Find stack traces for all bash processes
+for pid in $(pgrep bash); do
+  echo "=== PID $pid ==="
+  cat /proc/$pid/stack
+done
 ```
 
-**Problems:**
-1. ❌ **No filtering** - Must read ALL processes, parse in userspace
-2. ❌ **Fixed format** - Cannot customize output
-3. ❌ **High overhead** - Context switches, string formatting, massive output
-4. ❌ **Post-processing** - All filtering/aggregation in userspace
-5. ❌ **Inflexible** - Want different fields? Modify kernel!
+This spawns `pgrep` as a subprocess, makes a syscall per matching PID to read stack files, parses text output, and does all filtering in userspace. Simple to write, horrible for performance.
 
-### BPF Iterator Solution
+### The Solution: Programmable In-Kernel Iteration
+
+BPF iterators flip the model. Instead of pulling all data to userspace for processing, you push your processing logic into the kernel where the data lives. An iterator is a BPF program attached to a kernel data structure traversal that gets called for each element. The kernel walks tasks, files, or sockets, invokes your BPF program with each element's context, and your code decides what to output and how to format it.
+
+The architecture is elegant. You write a BPF program marked `SEC("iter/task")` or `SEC("iter/task_file")` that receives each task or file during iteration. Inside this program, you have direct access to kernel struct fields, can filter based on any criteria using normal C logic, and use `BPF_SEQ_PRINTF()` to format output exactly as needed. The kernel handles the iteration mechanics while your code focuses purely on filtering and formatting.
+
+When userspace reads from the iterator file descriptor, the magic happens entirely in the kernel. The kernel walks the task list, calls your BPF program for each task passing the task_struct pointer. Your program checks if the task name matches your filter - if not, it returns 0 immediately with no output. If it matches, your program extracts the stack trace and formats it to a seq_file. All this happens in kernel context before any data crosses to userspace.
+
+The benefits are transformative. **In-kernel filtering** means only relevant data crosses the kernel boundary, eliminating wasted work. **Custom formats** let you output binary, JSON, CSV, whatever your tools need. **Single read operation** replaces thousands of individual `/proc` file accesses. **Zero parsing** because you formatted the data correctly in the kernel. **Composability** works with standard Unix tools since iterator output comes through a normal file descriptor.
+
+### Iterator Types and Capabilities
+
+The kernel provides iterators for many subsystems. **Task iterators** (`iter/task`) walk all tasks giving you access to process state, credentials, resource usage, and parent-child relationships. **File iterators** (`iter/task_file`) traverse open file descriptors showing files, sockets, pipes, and other fd types. **Network iterators** (`iter/tcp`, `iter/udp`) walk active network connections with full socket state. **BPF object iterators** (`iter/bpf_map`, `iter/bpf_prog`) enumerate loaded BPF programs and maps for introspection.
+
+Our tutorial focuses on task and task_file iterators because they solve common monitoring needs and demonstrate core concepts applicable to all iterator types.
+
+## Implementation: Dual-Mode Task Iterator
+
+Let's build a complete example demonstrating two iterator types in one tool. We'll create a program that can show either kernel stack traces or open file descriptors for processes, with optional filtering by process name.
+
+### Complete BPF Program: task_stack.bpf.c
+
+```c
+// SPDX-License-Identifier: GPL-2.0
+/* Kernel task stack and file descriptor iterator */
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define MAX_STACK_TRACE_DEPTH   64
+unsigned long entries[MAX_STACK_TRACE_DEPTH] = {};
+#define SIZE_OF_ULONG (sizeof(unsigned long))
+
+/* Filter: only show stacks for tasks with this name (empty = show all) */
+char target_comm[16] = "";
+__u32 stacks_shown = 0;
+__u32 files_shown = 0;
+
+/* Task stack iterator */
+SEC("iter/task")
+int dump_task_stack(struct bpf_iter__task *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct task_struct *task = ctx->task;
+	long i, retlen;
+	int match = 1;
+
+	if (task == (void *)0) {
+		/* End of iteration - print summary */
+		if (stacks_shown > 0) {
+			BPF_SEQ_PRINTF(seq, "\n=== Summary: %u task stacks shown ===\n",
+				       stacks_shown);
+		}
+		return 0;
+	}
+
+	/* Filter by task name if specified */
+	if (target_comm[0] != '\0') {
+		match = 0;
+		for (i = 0; i < 16; i++) {
+			if (task->comm[i] != target_comm[i])
+				break;
+			if (task->comm[i] == '\0') {
+				match = 1;
+				break;
+			}
+		}
+		if (!match)
+			return 0;
+	}
+
+	/* Get kernel stack trace for this task */
+	retlen = bpf_get_task_stack(task, entries,
+				    MAX_STACK_TRACE_DEPTH * SIZE_OF_ULONG, 0);
+	if (retlen < 0)
+		return 0;
+
+	stacks_shown++;
+
+	/* Print task info and stack trace */
+	BPF_SEQ_PRINTF(seq, "=== Task: %s (pid=%u, tgid=%u) ===\n",
+		       task->comm, task->pid, task->tgid);
+	BPF_SEQ_PRINTF(seq, "Stack depth: %u frames\n", retlen / SIZE_OF_ULONG);
+
+	for (i = 0; i < MAX_STACK_TRACE_DEPTH; i++) {
+		if (retlen > i * SIZE_OF_ULONG)
+			BPF_SEQ_PRINTF(seq, "  [%2ld] %pB\n", i, (void *)entries[i]);
+	}
+	BPF_SEQ_PRINTF(seq, "\n");
+
+	return 0;
+}
+
+/* Task file descriptor iterator */
+SEC("iter/task_file")
+int dump_task_file(struct bpf_iter__task_file *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct task_struct *task = ctx->task;
+	struct file *file = ctx->file;
+	__u32 fd = ctx->fd;
+	long i;
+	int match = 1;
+
+	if (task == (void *)0 || file == (void *)0) {
+		if (files_shown > 0 && ctx->meta->seq_num > 0) {
+			BPF_SEQ_PRINTF(seq, "\n=== Summary: %u file descriptors shown ===\n",
+				       files_shown);
+		}
+		return 0;
+	}
+
+	/* Filter by task name if specified */
+	if (target_comm[0] != '\0') {
+		match = 0;
+		for (i = 0; i < 16; i++) {
+			if (task->comm[i] != target_comm[i])
+				break;
+			if (task->comm[i] == '\0') {
+				match = 1;
+				break;
+			}
+		}
+		if (!match)
+			return 0;
+	}
+
+	if (ctx->meta->seq_num == 0) {
+		BPF_SEQ_PRINTF(seq, "%-16s %8s %8s %6s %s\n",
+			       "COMM", "TGID", "PID", "FD", "FILE_OPS");
+	}
+
+	files_shown++;
+
+	BPF_SEQ_PRINTF(seq, "%-16s %8d %8d %6d 0x%lx\n",
+		       task->comm, task->tgid, task->pid, fd,
+		       (long)file->f_op);
+
+	return 0;
+}
+```
+
+### Understanding the BPF Code
+
+The program implements two separate iterators sharing common filtering logic. The `SEC("iter/task")` annotation registers `dump_task_stack` as a task iterator - the kernel will call this function once for each task in the system. The context structure `bpf_iter__task` provides three critical pieces: the `meta` field containing iteration metadata and the seq_file for output, the `task` pointer to the current task_struct, and a NULL task pointer when iteration finishes so you can print summaries.
+
+The task stack iterator shows in-kernel filtering in action. When `task` is NULL, we've reached the end of iteration and can print summary statistics showing how many tasks matched our filter. For each task, we first apply filtering by comparing `task->comm` (the process name) against `target_comm`. We can't use standard library functions like `strcmp()` in BPF, so we manually loop through characters comparing byte by byte. If the names don't match and filtering is enabled, we immediately return 0 with no output - this task is skipped entirely in the kernel without crossing to userspace.
+
+Once a task passes filtering, we extract its kernel stack trace using `bpf_get_task_stack()`. This BPF helper captures up to 64 stack frames into our `entries` array, returning the number of bytes written. We format the output using `BPF_SEQ_PRINTF()` which writes to the kernel's seq_file infrastructure. The special `%pB` format specifier symbolizes kernel addresses, turning raw pointers into human-readable function names like `schedule+0x42/0x100`. This makes stack traces immediately useful for debugging.
+
+The file descriptor iterator demonstrates a different iterator type. `SEC("iter/task_file")` tells the kernel to call this function for every open file descriptor across all tasks. The context provides `task`, `file` (the kernel's struct file pointer), and `fd` (the numeric file descriptor). We apply the same task name filtering, then format output as a table. Using `ctx->meta->seq_num` to detect the first output lets us print column headers exactly once.
+
+Notice how filtering happens before any expensive operations. We check the task name first, and only if it matches do we extract stack traces or format file information. This minimizes work in the kernel fast path - non-matching tasks are rejected with just a string comparison, no memory allocation, no formatting, no output.
+
+### Complete User-Space Program: task_stack.c
+
+```c
+// SPDX-License-Identifier: GPL-2.0
+/* Userspace program for task stack and file iterator */
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+#include "task_stack.skel.h"
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+	return vfprintf(stderr, format, args);
+}
+
+static void run_iterator(const char *name, struct bpf_program *prog)
+{
+	struct bpf_link *link;
+	int iter_fd, len;
+	char buf[8192];
+
+	link = bpf_program__attach_iter(prog, NULL);
+	if (!link) {
+		fprintf(stderr, "Failed to attach %s iterator\n", name);
+		return;
+	}
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (iter_fd < 0) {
+		fprintf(stderr, "Failed to create %s iterator: %d\n", name, iter_fd);
+		bpf_link__destroy(link);
+		return;
+	}
+
+	while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
+		buf[len] = '\0';
+		printf("%s", buf);
+	}
+
+	close(iter_fd);
+	bpf_link__destroy(link);
+}
+
+int main(int argc, char **argv)
+{
+	struct task_stack_bpf *skel;
+	int err;
+	int show_files = 0;
+
+	libbpf_set_print(libbpf_print_fn);
+
+	/* Parse arguments */
+	if (argc > 1 && strcmp(argv[1], "--files") == 0) {
+		show_files = 1;
+		argc--;
+		argv++;
+	}
+
+	/* Open BPF application */
+	skel = task_stack_bpf__open();
+	if (!skel) {
+		fprintf(stderr, "Failed to open BPF skeleton\n");
+		return 1;
+	}
+
+	/* Configure filter before loading */
+	if (argc > 1) {
+		strncpy(skel->bss->target_comm, argv[1], sizeof(skel->bss->target_comm) - 1);
+		printf("Filtering for tasks matching: %s\n\n", argv[1]);
+	} else {
+		printf("Usage: %s [--files] [comm]\n", argv[0]);
+		printf("  --files    Show open file descriptors instead of stacks\n");
+		printf("  comm       Filter by process name\n\n");
+	}
+
+	/* Load BPF program */
+	err = task_stack_bpf__load(skel);
+	if (err) {
+		fprintf(stderr, "Failed to load BPF skeleton\n");
+		goto cleanup;
+	}
+
+	if (show_files) {
+		printf("=== BPF Task File Descriptor Iterator ===\n\n");
+		run_iterator("task_file", skel->progs.dump_task_file);
+	} else {
+		printf("=== BPF Task Stack Iterator ===\n\n");
+		run_iterator("task", skel->progs.dump_task_stack);
+	}
+
+cleanup:
+	task_stack_bpf__destroy(skel);
+	return err;
+}
+```
+
+### Understanding the User-Space Code
+
+The userspace program showcases how simple iterator usage is once you understand the pattern. The `run_iterator()` function encapsulates the three-step iterator lifecycle. First, `bpf_program__attach_iter()` attaches the BPF program to the iterator infrastructure, registering it to be called during iteration. Second, `bpf_iter_create()` creates a file descriptor representing an iterator instance. Third, simple `read()` calls consume the iterator output.
+
+Here's what makes this powerful: when you read from the iterator fd, the kernel transparently starts walking tasks or files. For each element, it calls your BPF program passing the element's context. Your BPF code filters and formats output to a seq_file buffer. The kernel accumulates this output and returns it through the read() call. From userspace's perspective, it's just reading a file - all the iteration, filtering, and formatting complexity is hidden in the kernel.
+
+The main function handles mode selection and configuration. We parse command-line arguments to determine whether to show stacks or files, and what process name to filter for. Critically, we set `skel->bss->target_comm` before loading the BPF program. This writes the filter string into the BPF program's global data section, making it visible to kernel code when the program runs. This is how we pass configuration from userspace to kernel without complex communication channels.
+
+After loading, we select which iterator to run based on the `--files` flag. Both iterators use the same filtering logic, but produce different output - one shows stack traces, the other shows file descriptors. The shared filtering code demonstrates how BPF programs can implement reusable logic across different iterator types.
+
+## Compilation and Execution
+
+Navigate to the bpf_iters directory and build:
+
+```bash
+cd /home/yunwei37/workspace/bpf-developer-tutorial/src/features/bpf_iters
+make
+```
+
+The Makefile compiles the BPF program with BTF support and generates a skeleton header containing the compiled bytecode embedded in C structures. This skeleton API makes BPF program loading trivial.
+
+Show kernel stack traces for all systemd processes:
 
-**Our implementation** (`task_stack.bpf.c`):
 ```bash
-# Show only systemd tasks with kernel stack traces
 sudo ./task_stack systemd
 ```
 
-**Benefits:**
-1. ✅ **In-kernel filtering** - Only selected processes sent to userspace
-2. ✅ **Custom format** - Choose exactly what fields to show
-3. ✅ **Low overhead** - Filter before copying to userspace
-4. ✅ **Programmable** - Add statistics, calculations, aggregations
-5. ✅ **Dynamic** - Load different filters without kernel changes
-
-### Performance Comparison
-
-| Operation | Traditional `/proc` | BPF Iterator |
-|-----------|-------------------|--------------|
-| Read all stacks | Parse 1000+ files | Single read() call |
-| Filter by name | Userspace loop | In-kernel filter |
-| Data transfer | MB of text | KB of relevant data |
-| CPU usage | High (parsing) | Low (pre-filtered) |
-| Customization | Recompile kernel | Load new BPF program |
-
-## Example Output
+Expected output:
 
 ```
-$ sudo ./task_stack systemd
 Filtering for tasks matching: systemd
 
 === BPF Task Stack Iterator ===
@@ -63,143 +322,61 @@ Stack depth: 6 frames
   [ 4] do_syscall_64+0x7e/0x170
   [ 5] entry_SYSCALL_64_after_hwframe+0x76/0x7e
 
-=== Summary: 2 task stacks shown ===
+=== Summary: 1 task stacks shown ===
 ```
 
-## How It Works
-
-### 1. BPF Program (`task_stack.bpf.c`)
-
-```c
-SEC("iter/task")
-int dump_task_stack(struct bpf_iter__task *ctx)
-{
-    struct task_struct *task = ctx->task;
-
-    // In-kernel filtering by task name
-    if (target_comm[0] != '\0' && !match_name(task->comm))
-        return 0;  // Skip this task
-
-    // Get kernel stack trace
-    bpf_get_task_stack(task, entries, MAX_DEPTH * SIZE_OF_ULONG, 0);
-
-    // Format and output to seq_file
-    BPF_SEQ_PRINTF(seq, "Task: %s (pid=%u)\n", task->comm, task->pid);
-
-    return 0;
-}
-```
-
-### 2. Userspace Program (`task_stack.c`)
-
-```c
-// Attach iterator
-link = bpf_program__attach_iter(skel->progs.dump_task_stack, NULL);
-
-// Create iterator instance
-iter_fd = bpf_iter_create(bpf_link__fd(link));
-
-// Read output
-while ((len = read(iter_fd, buf, sizeof(buf))) > 0) {
-    printf("%s", buf);
-}
-```
-
-## Available Iterator Types
-
-The kernel provides many iterator types:
-
-### System Iterators
-- `iter/task` - Iterate all tasks/processes
-- `iter/ksym` - Kernel symbols (like `/proc/kallsyms`)
-- `iter/bpf_map` - All BPF maps in system
-- `iter/bpf_link` - All BPF links
-
-### Network Iterators
-- `iter/tcp` - TCP sockets (replaces `/proc/net/tcp`)
-- `iter/udp` - UDP sockets
-- `iter/unix` - Unix domain sockets
-- `iter/netlink` - Netlink sockets
-
-### Map Iterators
-- `iter/bpf_map_elem` - Iterate map elements
-- `iter/sockmap` - Socket map entries
-
-### Task/Process Iterators
-- `iter/task_file` - Task file descriptors (like `/proc/PID/fd`)
-- `iter/task_vma` - Task memory mappings (like `/proc/PID/maps`)
-
-## Use Cases
-
-### 1. Performance Monitoring
-- Track high-latency network connections
-- Monitor stuck processes (long-running syscalls)
-- Identify memory-hungry tasks
-
-### 2. Debugging
-- Capture stack traces of specific processes
-- Dump kernel state for analysis
-- Trace system calls in real-time
-
-### 3. Security
-- Monitor process creation patterns
-- Track network connection attempts
-- Audit file access patterns
-
-### 4. Custom `/proc` Replacements
-- Create application-specific views
-- Filter and aggregate kernel data
-- Reduce userspace processing overhead
-
-## Building and Running
+Show open file descriptors for bash processes:
 
 ```bash
-# Build
-cd /home/yunwei37/workspace/bpf-developer-tutorial/src/features/bpf_iters
-make
-
-# Run - show all tasks
-sudo ./task_stack
-
-# Run - filter by task name
-sudo ./task_stack systemd
-sudo ./task_stack bash
+sudo ./task_stack --files bash
 ```
 
-## Key Differences: Iterator Types
+Expected output:
 
-### Kernel Iterators (`SEC("iter/...")`)
-- **Purpose**: Export kernel data to userspace
-- **Output**: seq_file (readable via read())
-- **Activation**: Attach, create instance, read FD
-- **Example**: Task stacks, TCP sockets, kernel symbols
+```
+Filtering for tasks matching: bash
 
-### Open-Coded Iterators (`bpf_for`, `bpf_iter_num`)
-- **Purpose**: Loop constructs within BPF programs
-- **Output**: Internal program variables
-- **Activation**: Execute during program run
-- **Example**: Sum numbers, count elements, iterate arrays
+=== BPF Task File Descriptor Iterator ===
 
-## Advantages Over Traditional Approaches
+COMM                 TGID      PID     FD FILE_OPS
+bash                12345    12345      0 0xffffffff81e3c6e0
+bash                12345    12345      1 0xffffffff81e3c6e0
+bash                12345    12345      2 0xffffffff81e3c6e0
+bash                12345    12345    255 0xffffffff82145dc0
 
-| Feature | Traditional `/proc` | BPF Iterators |
-|---------|-------------------|---------------|
-| **Filtering** | Userspace only | In-kernel |
-| **Performance** | High overhead | Minimal overhead |
-| **Customization** | Kernel rebuild | Load BPF program |
-| **Format** | Fixed | Fully programmable |
-| **Statistics** | Userspace calc | In-kernel aggregation |
-| **Security** | No filtering | LSM hooks available |
-| **Deployment** | Static | Dynamic (load anytime) |
+=== Summary: 4 file descriptors shown ===
+```
 
-## Summary
+Run without filtering to see all tasks:
 
-BPF iterators are **game-changing** for system observability:
+```bash
+sudo ./task_stack
+```
 
-1. **Performance**: Filter in kernel, only send relevant data
-2. **Flexibility**: Load different programs for different views
-3. **Power**: Access raw kernel structures with type safety (BTF)
-4. **Safety**: Verified by BPF verifier, can't crash kernel
-5. **Portability**: CO-RE ensures binary works across kernel versions
+This shows stacks for every task in the system. On a typical desktop, this might display hundreds of tasks. Notice how fast it runs compared to parsing `/proc/*/stack` for all processes - the iterator is dramatically more efficient.
 
-They enable creating **custom, high-performance system monitoring tools** without modifying the kernel!
+## When to Use BPF Iterators vs /proc
+
+Choose **BPF iterators** when you need filtered kernel data without userspace processing overhead, custom output formats that don't match `/proc` text, performance-critical monitoring that runs frequently, or integration with BPF-based observability infrastructure. Iterators excel when you're monitoring many entities but only care about a subset, or when you need to aggregate and transform data in the kernel.
+
+Choose **/proc** when you need simple one-off queries, are debugging or prototyping where development speed matters more than runtime performance, want maximum portability across kernel versions (iterators require relatively recent kernels), or run in restricted environments where you can't load BPF programs.
+
+The fundamental trade-off is processing location. Iterators push filtering and formatting into the kernel for efficiency and flexibility, while `/proc` keeps the kernel simple and does all processing in userspace. For production monitoring of complex systems, iterators usually win due to their performance benefits and programming flexibility.
+
+## Summary and Next Steps
+
+BPF iterators revolutionize how we export kernel data by enabling programmable, filtered iteration directly from BPF code. Instead of repeatedly reading and parsing `/proc` files, you write a BPF program that iterates kernel structures in-kernel, applies filtering at the source, and formats output exactly as needed. This eliminates massive overhead from syscalls, mode transitions, and userspace parsing while providing complete flexibility in output format.
+
+Our dual-mode iterator demonstrates both task and file iteration, showing how one BPF program can export multiple views of kernel data with shared filtering logic. The kernel handles complex iteration mechanics while your BPF code focuses purely on filtering and formatting. Iterators integrate seamlessly with standard Unix tools through their file descriptor interface, making them composable building blocks for sophisticated monitoring pipelines.
+
+> If you'd like to dive deeper into eBPF, check out our tutorial repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> or visit our website at <https://eunomia.dev/tutorials/>.
+
+## References
+
+- **BPF Iterator Documentation:** <https://docs.kernel.org/bpf/bpf_iterators.html>
+- **Kernel Iterator Selftests:** Linux kernel tree `tools/testing/selftests/bpf/*iter*.c`
+- **Tutorial Repository:** <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_iters>
+- **libbpf Iterator API:** <https://github.com/libbpf/libbpf>
+- **BPF Helpers Manual:** <https://man7.org/linux/man-pages/man7/bpf-helpers.7.html>
+
+Examples adapted from Linux kernel BPF selftests with educational enhancements. Requires Linux kernel 5.8+ for iterator support, BTF enabled, and libbpf. Complete source code available in the tutorial repository.
diff --git a/src/features/bpf_wq/README.md b/src/features/bpf_wq/README.md
index aba34bc..7b9e923 100644
--- a/src/features/bpf_wq/README.md
+++ b/src/features/bpf_wq/README.md
@@ -1,233 +1,256 @@
-# BPF Workqueues Tutorial
+# eBPF Tutorial by Example: BPF Workqueues for Asynchronous Sleepable Tasks
 
-## What are BPF Workqueues?
+Ever needed your eBPF program to sleep, allocate memory, or wait for device I/O? Traditional eBPF programs run in restricted contexts where blocking operations crash the system. But what if your HID device needs timing delays between injected key events, or your cleanup routine needs to sleep while freeing resources?
 
-BPF workqueues allow you to schedule **asynchronous work** from BPF programs. This enables:
-- Deferred processing
-- Non-blocking operations
-- Background task execution
-- Sleepable context for long-running operations
+This is what **BPF Workqueues** enable. Created by Benjamin Tissoires at Red Hat in 2024 for HID-BPF device handling, workqueues let you schedule asynchronous work that runs in process context where sleeping and blocking operations are allowed. In this tutorial, we'll explore why workqueues were created, how they differ from timers, and build a complete example demonstrating async callback execution.
 
-## The Problem
+## Introduction to BPF Workqueues: Solving the Sleep Problem
 
-### Before bpf_wq: Limitations of bpf_timer
+### The Problem: When eBPF Can't Sleep
 
-**bpf_timer** runs in **softirq context**, which has severe limitations:
-- ❌ Cannot sleep
-- ❌ Cannot use `kzalloc()` (memory allocation)
-- ❌ Cannot wait for device I/O
-- ❌ Cannot perform any blocking operations
+Before BPF workqueues existed, developers had `bpf_timer` for deferred execution. Timers work great for scheduling callbacks after a delay, perfect for updating counters or triggering periodic events. But there's a fundamental limitation that made timers unusable for certain critical use cases: **bpf_timer runs in softirq (software interrupt) context**.
 
-### Real-World Use Case: HID Device Handling
+Softirq context has strict rules enforced by the kernel. You cannot sleep or wait for I/O - any attempt to do so will cause kernel panics or deadlocks. You cannot allocate memory using `kzalloc()` with `GFP_KERNEL` flag because memory allocation might need to wait for pages. You cannot communicate with hardware devices that require waiting for responses. Essentially, you cannot perform any blocking operations that might cause the CPU to wait.
 
-**Problem**: HID (Human Interface Devices - keyboards, mice, tablets) devices need to:
-1. **React to events asynchronously** - Transform input, inject new events
-2. **Communicate with hardware** - Re-initialize devices after sleep/wake
-3. **Perform device I/O** - Send commands, wait for responses
+This limitation became a real problem for Benjamin Tissoires at Red Hat when he was developing HID-BPF in 2023. HID devices (keyboards, mice, tablets, game controllers) frequently need operations that timers simply can't handle. Imagine implementing keyboard macro functionality where pressing F1 types "hello" - you need 10ms delays between each keystroke for the system to properly process events. Or consider a device with buggy firmware that needs re-initialization after system wake - you must send commands and wait for hardware responses. Timer callbacks in softirq context can't do any of this.
 
-**These operations require sleepable context!**
+As Benjamin Tissoires explained in his kernel patches: "I need something similar to bpf_timers, but not in soft IRQ context... the bpf_timer functionality would prevent me to kzalloc and wait for the device."
 
-## The Solution: bpf_wq
+### The Solution: Process Context Execution
 
-Developed by **Benjamin Tissoires** (Red Hat) in 2024 as part of HID-BPF work.
+In early 2024, Benjamin proposed and developed **bpf_wq** - essentially "bpf_timer but in process context instead of softirq." The kernel community merged it into Linux v6.10+ in April 2024. The key insight is simple but powerful: by running callbacks in process context (through the kernel's workqueue infrastructure), BPF programs gain access to the full range of kernel operations.
 
-### Key Quote from Kernel Patches:
-> "I need something similar to bpf_timers, but not in soft IRQ context...
-> the bpf_timer functionality would prevent me to kzalloc and wait for the device"
+Here's what changes with process context:
 
-### What bpf_wq Provides:
-- ✅ **Sleepable context** - Can perform blocking operations
-- ✅ **Memory allocation** - Can use `kzalloc()` safely
-- ✅ **Device I/O** - Can wait for hardware responses
-- ✅ **Asynchronous execution** - Deferred work without blocking main path
+| Feature | bpf_timer (softirq) | bpf_wq (process) |
+|---------|---------------------|------------------|
+| **Can sleep?** | ❌ No - will crash | ✅ Yes - safe to sleep |
+| **Memory allocation** | ❌ Limited flags only | ✅ Full `kzalloc()` support |
+| **Device I/O** | ❌ Cannot wait | ✅ Can wait for responses |
+| **Blocking operations** | ❌ Prohibited | ✅ Fully supported |
+| **Latency** | Very low (microseconds) | Higher (milliseconds) |
+| **Use case** | Time-critical fast path | Sleepable slow path |
 
-## Real-World Applications
+Workqueues enable the classic "fast path + slow path" pattern. Your eBPF program handles performance-critical operations immediately in the fast path, then schedules expensive cleanup or I/O operations to run asynchronously in the slow path. The fast path stays responsive while the slow path gets the capabilities it needs.
 
-### 1. HID Device Quirks and Fixes
+### Real-World Applications
 
-**Problem**: Many HID devices have firmware bugs or quirks requiring workarounds.
+The applications span multiple domains. **HID device handling** was the original motivation - injecting keyboard macros with timing delays, fixing broken device firmware dynamically without kernel drivers, re-initializing devices after wake from sleep, transforming input events on the fly. All these require sleepable operations that only workqueues can provide.
 
-**Before bpf_wq**: Write kernel drivers, recompile kernel
-**With bpf_wq**: Load BPF program to fix device behavior dynamically
+**Network packet processing** benefits from async cleanup patterns. Your XDP program enforces rate limits and drops packets in the fast path (non-blocking), while a workqueue cleans up stale tracking entries in the background. This prevents memory leaks without impacting packet processing performance.
 
-**Example Use Cases**:
-- Transform single key press into macro sequence
-- Fix devices that forget to send button release events
-- Invert mouse coordinates for broken hardware
-- Re-initialize device after wake from sleep
+**Security monitoring** can apply fast rules immediately, then use workqueues to query reputation databases or external threat intelligence services. The fast path makes instant decisions while the slow path updates policies based on complex analysis.
 
-### 2. Network Packet Processing
+**Resource cleanup** defers expensive operations. Instead of blocking the main code path while freeing memory, closing connections, or compacting data structures, you schedule a workqueue to handle cleanup in the background.
 
-**Problem**: Rate limiting requires tracking state and cleaning up old entries.
+## Implementation: Simple Workqueue Test
 
-**Before**: Either block packet processing OR leak memory
-**With bpf_wq**:
-- Fast path: Check limits, drop packets (non-blocking)
-- Slow path: Workqueue cleans up stale entries (async)
+Let's build a complete example that demonstrates the workqueue lifecycle. We'll create a program that triggers on the `unlink` syscall, schedules async work, and verifies that both the main path and workqueue callback execute correctly.
 
-### 3. Security and Monitoring
-
-**Problem**: Security decisions need to consult external services or databases.
-
-**Before**: All decisions must be instant (no waiting)
-**With bpf_wq**:
-- Fast path: Apply known rules immediately
-- Slow path: Query reputation databases, update policy
-
-### 4. Resource Cleanup
-
-**Problem**: Freeing resources (memory, connections) can be expensive.
-
-**Before**: Block main path during cleanup
-**With bpf_wq**: Defer cleanup to background workqueue
-
-## Technical Architecture
-
-### Comparison: bpf_timer vs bpf_wq
-
-| Feature | bpf_timer | bpf_wq |
-|---------|-----------|--------|
-| **Context** | Softirq (interrupt) | Process (workqueue) |
-| **Can sleep?** | ❌ No | ✅ Yes |
-| **Memory allocation** | ❌ No | ✅ Yes |
-| **Device I/O** | ❌ No | ✅ Yes |
-| **Latency** | Very low (μs) | Higher (ms) |
-| **Use case** | Time-critical | Sleepable operations |
-
-### When to Use Each
-
-**Use bpf_timer when:**
-- You need microsecond-level precision
-- Operations are fast and non-blocking
-- You're just updating counters or state
-
-**Use bpf_wq when:**
-- You need to sleep or wait
-- You need memory allocation
-- You need device/network I/O
-- Cleanup can happen later
-
-## Code Example: Why Workqueue Matters
-
-### ❌ Cannot Do with bpf_timer (softirq):
-```c
-// This FAILS in bpf_timer callback (softirq context)
-static int timer_callback(void *map, int *key, void *value)
-{
-    // ERROR: Cannot allocate in softirq!
-    struct data *d = kmalloc(sizeof(*d), GFP_KERNEL);
-
-    // ERROR: Cannot sleep in softirq!
-    send_device_command_and_wait(device);
-
-    return 0;
-}
-```
-
-### ✅ Works with bpf_wq (workqueue):
-```c
-// This WORKS in bpf_wq callback (process context)
-static int wq_callback(void *map, int *key, void *value)
-{
-    // OK: Can allocate in process context
-    struct data *d = kmalloc(sizeof(*d), GFP_KERNEL);
-
-    // OK: Can sleep/wait in process context
-    send_device_command_and_wait(device);
-
-    // OK: Can do blocking I/O
-    write_to_file(log_file, data);
-
-    kfree(d);
-    return 0;
-}
-```
-
-## Historical Timeline
-
-1. **2022**: Benjamin Tissoires starts HID-BPF work
-2. **2023**: Realizes bpf_timer limitations for HID device I/O
-3. **Early 2024**: Proposes bpf_wq as "bpf_timer in process context"
-4. **April 2024**: bpf_wq merged into kernel (v6.10+)
-5. **2024-Present**: Used for HID quirks, rate limiting, async cleanup
-
-## Key Takeaway
-
-**bpf_wq exists because real-world device handling and resource management need sleepable, blocking operations that bpf_timer cannot provide.**
-
-It enables BPF programs to:
-- Fix hardware quirks without kernel drivers
-- Perform async cleanup without blocking
-- Wait for I/O without hanging the system
-- Do "slow work" without impacting "fast path"
-
-**Bottom line**: bpf_wq brings true asynchronous, sleepable programming to BPF!
-
-## How It Works
-
-### 1. Workqueue Structure
-
-Embed a `struct bpf_wq` in your map value:
+### Complete BPF Program: wq_simple.bpf.c
 
 ```c
+// SPDX-License-Identifier: GPL-2.0
+/* Simple BPF workqueue example */
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_experimental.h"
+
+char LICENSE[] SEC("license") = "GPL";
+
+/* Element with embedded workqueue */
 struct elem {
-    int value;
-    struct bpf_wq work;  // Embedded workqueue
+	int value;
+	struct bpf_wq work;
 };
 
+/* Array to store our element */
 struct {
-    __uint(type, BPF_MAP_TYPE_ARRAY);
-    __type(value, struct elem);
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
 } array SEC(".maps");
-```
 
-### 2. Initialize and Schedule
+/* Result variables */
+__u32 wq_executed = 0;
+__u32 main_executed = 0;
 
-```c
+/* Workqueue callback - runs asynchronously in workqueue context */
+static int wq_callback(void *map, int *key, void *value)
+{
+	struct elem *val = value;
+	/* This runs later in workqueue context */
+	wq_executed = 1;
+	val->value = 42; /* Modify the value asynchronously */
+	return 0;
+}
+
+/* Main program - schedules work */
 SEC("fentry/do_unlinkat")
 int test_workqueue(void *ctx)
 {
-    struct elem *val = bpf_map_lookup_elem(&array, &key);
-    struct bpf_wq *wq = &val->work;
+	struct elem init = {.value = 0}, *val;
+	struct bpf_wq *wq;
+	int key = 0;
 
-    // Initialize workqueue
-    bpf_wq_init(wq, &array, 0);
+	main_executed = 1;
 
-    // Set callback function
-    bpf_wq_set_callback(wq, callback_fn, 0);
+	/* Initialize element in map */
+	bpf_map_update_elem(&array, &key, &init, 0);
 
-    // Schedule async execution
-    bpf_wq_start(wq, 0);
+	/* Get element from map */
+	val = bpf_map_lookup_elem(&array, &key);
+	if (!val)
+		return 0;
 
-    return 0;
+	/* Initialize workqueue */
+	wq = &val->work;
+	if (bpf_wq_init(wq, &array, 0) != 0)
+		return 0;
+
+	/* Set callback function */
+	if (bpf_wq_set_callback(wq, wq_callback, 0))
+		return 0;
+
+	/* Schedule work to run asynchronously */
+	if (bpf_wq_start(wq, 0))
+		return 0;
+
+	return 0;
 }
 ```
 
-### 3. Callback Execution
+### Understanding the BPF Code
+
+The program demonstrates the complete workqueue workflow from initialization through async execution. We start by defining a structure that embeds a workqueue. The `struct elem` contains both application data (`value`) and the workqueue handle (`struct bpf_wq work`). This embedding pattern is critical - the workqueue infrastructure needs to know which map contains the workqueue structure, and embedding it in the map value establishes this relationship.
+
+Our map is a simple array with one entry, chosen for simplicity in this example. In production code, you'd typically use hash maps to track multiple entities, each with its own embedded workqueue. The global variables `wq_executed` and `main_executed` serve as test instrumentation, letting userspace verify that both code paths ran.
+
+The workqueue callback shows the signature that all workqueue callbacks must follow: `int callback(void *map, int *key, void *value)`. The kernel invokes this function asynchronously in process context, passing the map containing the workqueue, the key of the entry, and a pointer to the value. This signature gives the callback full context about which element triggered it and access to the element's data. Our callback sets `wq_executed = 1` to prove it ran, and modifies `val->value = 42` to demonstrate that async modifications persist in the map.
+
+The main program attached to `fentry/do_unlinkat` triggers whenever the `unlink` syscall executes. This gives us an easy way to activate the program - userspace just needs to delete a file. We set `main_executed = 1` immediately to mark the synchronous path. Then we initialize an element and store it in the map using `bpf_map_update_elem()`. This is necessary because the workqueue must be embedded in a map entry.
+
+The workqueue initialization follows a three-step sequence. First, `bpf_wq_init(wq, &array, 0)` initializes the workqueue handle, passing the map that contains it. The verifier uses this information to validate that the workqueue and its container are properly related. Second, `bpf_wq_set_callback(wq, wq_callback, 0)` registers our callback function. The verifier checks that the callback has the correct signature. Third, `bpf_wq_start(wq, 0)` schedules the workqueue to execute asynchronously. This call returns immediately - the main program continues executing while the kernel queues the work for later execution in process context.
+
+The flags parameter in all three functions is reserved for future use and should be 0 in current kernels. The pattern allows future extensions without breaking API compatibility.
+
+### Complete User-Space Program: wq_simple.c
 
 ```c
-static int callback_fn(void *map, int *key, void *value)
+// SPDX-License-Identifier: GPL-2.0
+/* Userspace test for BPF workqueue */
+#include <stdio.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/resource.h>
+#include <bpf/libbpf.h>
+#include "wq_simple.skel.h"
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
 {
-    struct elem *val = value;
+	return vfprintf(stderr, format, args);
+}
 
-    // This runs asynchronously in workqueue context
-    val->value = 42;
+int main(int argc, char **argv)
+{
+	struct wq_simple_bpf *skel;
+	int err, fd;
 
-    return 0;
+	libbpf_set_print(libbpf_print_fn);
+
+	/* Open and load BPF application */
+	skel = wq_simple_bpf__open_and_load();
+	if (!skel) {
+		fprintf(stderr, "Failed to open and load BPF skeleton\n");
+		return 1;
+	}
+
+	/* Attach tracepoint handler */
+	err = wq_simple_bpf__attach(skel);
+	if (err) {
+		fprintf(stderr, "Failed to attach BPF skeleton\n");
+		goto cleanup;
+	}
+
+	printf("BPF workqueue program attached. Triggering unlink syscall...\n");
+
+	/* Create a temporary file to trigger do_unlinkat */
+	fd = open("/tmp/wq_test_file", O_CREAT | O_WRONLY, 0644);
+	if (fd >= 0) {
+		close(fd);
+		unlink("/tmp/wq_test_file");
+	}
+
+	/* Give workqueue time to execute */
+	sleep(1);
+
+	/* Check results */
+	printf("\nResults:\n");
+	printf("  main_executed = %u (expected: 1)\n", skel->bss->main_executed);
+	printf("  wq_executed = %u (expected: 1)\n", skel->bss->wq_executed);
+
+	if (skel->bss->main_executed == 1 && skel->bss->wq_executed == 1) {
+		printf("\n✓ Test PASSED!\n");
+	} else {
+		printf("\n✗ Test FAILED!\n");
+		err = 1;
+	}
+
+cleanup:
+	wq_simple_bpf__destroy(skel);
+	return err;
 }
 ```
 
-## Examples
+### Understanding the User-Space Code
 
-### 1. Simple Workqueue Test (`wq_simple`)
+The userspace program orchestrates the test and verifies results. We use the skeleton API from libbpf which embeds the compiled BPF bytecode in a C structure, making loading trivial. The `wq_simple_bpf__open_and_load()` call compiles (if needed), loads the BPF program into the kernel, and creates all maps in one operation.
 
-Basic demonstration:
-- Workqueue initialization on syscall entry
-- Async callback execution
-- Verification of both sync and async paths
+After loading, `wq_simple_bpf__attach()` attaches the fentry program to `do_unlinkat`. From this point, any unlink syscall will trigger our BPF program. We deliberately trigger this by creating and immediately deleting a temporary file. The `open()` creates `/tmp/wq_test_file`, we close the fd, then `unlink()` deletes it. This deletion enters the kernel's `do_unlinkat` function, triggering our fentry probe.
+
+Here's the critical timing aspect: workqueue execution is asynchronous. Our main BPF program schedules the work and returns immediately. The kernel queues the callback for later execution by a kernel worker thread. This is why we `sleep(1)` - giving the workqueue time to execute before we check results. In production code, you'd use more sophisticated synchronization, but for a simple test, sleep is sufficient.
+
+After the sleep, we read global variables from the BPF program's `.bss` section. The skeleton provides convenient access through `skel->bss->main_executed` and `skel->bss->wq_executed`. If both are 1, we know the synchronous path (fentry) and async path (workqueue callback) both executed successfully.
+
+## Understanding Workqueue APIs
+
+The workqueue API consists of three essential functions that manage the lifecycle. **`bpf_wq_init(wq, map, flags)`** initializes a workqueue handle, establishing the relationship between the workqueue and its containing map. The map parameter is crucial - it tells the verifier which map contains the value with the embedded `bpf_wq` structure. The verifier uses this to ensure memory safety across async execution. Flags should be 0 in current kernels.
+
+**`bpf_wq_set_callback(wq, callback_fn, flags)`** registers the function to execute asynchronously. The callback must have the signature `int callback(void *map, int *key, void *value)`. The verifier checks this signature at load time and will reject programs with mismatched signatures. This type safety prevents common async programming errors. Flags should be 0.
+
+**`bpf_wq_start(wq, flags)`** schedules the workqueue to run. This returns immediately - your BPF program continues executing synchronously. The kernel queues the callback for execution by a worker thread in process context at some point in the future. The callback might run microseconds or milliseconds later depending on system load. Flags should be 0.
+
+The callback signature deserves attention. Unlike `bpf_timer` callbacks which receive `(void *map, __u32 *key, void *value)`, workqueue callbacks receive `(void *map, int *key, void *value)`. Note the key type difference - `int *` vs `__u32 *`. This reflects the evolution of the API and must be matched exactly or the verifier rejects your program. The callback runs in process context, so it can safely perform operations that would crash in softirq context.
+
+## When to Use Workqueues vs Timers
+
+Choose **bpf_timer** when you need microsecond-precision timing, operations are fast and non-blocking, you're updating counters or simple state, or implementing periodic fast-path operations like statistics collection or packet pacing. Timers excel at time-critical tasks that must execute with minimal latency.
+
+Choose **bpf_wq** when you need to sleep or wait, allocate memory with `kzalloc()`, perform device or network I/O, or defer cleanup operations that can happen later. Workqueues are perfect for the "fast path + slow path" pattern where critical operations happen immediately and expensive processing runs asynchronously. Examples include HID device I/O (keyboard macro injection with delays), async map cleanup (preventing memory leaks), security policy updates (querying external databases), and background processing (compression, encryption, aggregation).
+
+The fundamental trade-off is latency vs capability. Timers have lower latency but restricted capabilities. Workqueues have higher latency but full process context capabilities including sleeping and blocking I/O.
+
+## Compilation and Execution
+
+Navigate to the bpf_wq directory and build:
 
 ```bash
-$ sudo ./wq_simple
+cd /home/yunwei37/workspace/bpf-developer-tutorial/src/features/bpf_wq
+make
+```
+
+The Makefile compiles the BPF program with the experimental workqueue features enabled and generates a skeleton header.
+
+Run the simple workqueue test:
+
+```bash
+sudo ./wq_simple
+```
+
+Expected output:
+
+```
 BPF workqueue program attached. Triggering unlink syscall...
 
 Results:
@@ -237,132 +260,28 @@ Results:
 ✓ Test PASSED!
 ```
 
-### 2. Real-World: Rate Limiter with Async Cleanup (`rate_limiter`)
+The test verifies that both the synchronous fentry probe and the asynchronous workqueue callback executed successfully. If the workqueue callback didn't run, `wq_executed` would be 0 and the test would fail.
 
-**Production-ready example** showing practical workqueue usage:
+## Historical Timeline and Context
 
-**Problem**:
-- Track packet rates per source IP
-- Drop packets exceeding 100 pps
-- Clean up stale entries without blocking packet processing
+Understanding how workqueues came to exist helps appreciate their design. In 2022, Benjamin Tissoires started work on HID-BPF, aiming to let users fix broken HID devices without kernel drivers. By 2023, he realized `bpf_timer` limitations made HID device I/O impossible - you can't wait for hardware responses in softirq context. In early 2024, he proposed `bpf_wq` as "bpf_timer in process context," collaborating with the BPF community on the design. The kernel merged workqueues in April 2024 as part of Linux v6.10. Since then, they've been used for HID quirks, rate limiting, async cleanup, and other sleepable operations.
 
-**Solution with Workqueues**:
-- **Fast path**: Check/update rate limits, drop if needed
-- **Slow path (async)**: Workqueue removes entries older than 10 seconds
-- **Zero blocking**: Cleanup runs in background
+The key quote from Benjamin's patches captures the motivation perfectly: "I need something similar to bpf_timers, but not in soft IRQ context... the bpf_timer functionality would prevent me to kzalloc and wait for the device."
 
-```bash
-$ sudo ./rate_limiter eth0
-=== BPF Rate Limiter with Workqueue Cleanup ===
-Interface: eth0 (ifindex=2)
-Rate limit: 100 packets/sec per IP
-Cleanup: Async workqueue removes stale entries (>10s old)
+This real-world need drove the design. Workqueues exist because device handling and resource management require sleepable, blocking operations that timers fundamentally cannot provide.
 
-Press Ctrl+C to stop...
+## Summary and Next Steps
 
-Time       Total Pkts      Dropped         Active IPs      Cleanups
------------------------------------------------------------------------
-1234       45123          1234            150             12
-1235       46789          1456            152             15
-...
-```
+BPF workqueues solve a fundamental limitation of eBPF by enabling sleepable, blocking operations in process context. Created specifically to support HID device handling where timing delays and device I/O are essential, workqueues unlock powerful new capabilities for eBPF programs. They enable the "fast path + slow path" pattern where performance-critical operations execute immediately while expensive cleanup and I/O happen asynchronously without blocking.
 
-**Key Features**:
-1. **In-kernel rate limiting** - No userspace involvement for packet decisions
-2. **Per-IP tracking** - Hash map stores state for each source IP
-3. **Async cleanup** - Workqueue prevents memory leaks without blocking packets
-4. **Real-time stats** - Monitor performance and efficiency
+Our simple example demonstrates the core workqueue lifecycle: embedding a `bpf_wq` in a map value, initializing and configuring it, scheduling async execution, and verifying the callback runs in process context. This same pattern scales to production use cases like network rate limiting with async cleanup, security monitoring with external service queries, and device handling with I/O operations.
 
-## Use Cases
+> If you'd like to dive deeper into eBPF, check out our tutorial repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> or visit our website at <https://eunomia.dev/tutorials/>.
 
-### 1. Rate Limiting
-Schedule delayed actions to enforce rate limits:
-```c
-// Defer packet drop decision
-bpf_wq_start(wq, 0);  // Execute in background
-```
+## References
 
-### 2. Batch Processing
-Accumulate events and process in batches:
-```c
-// Collect events in map
-// Workqueue processes batch periodically
-```
+- **Original Kernel Patches:** Benjamin Tissoires' HID-BPF and bpf_wq patches (2023-2024)
+- **Linux Kernel Source:** `kernel/bpf/helpers.c` - workqueue implementation
+- **Tutorial Repository:** <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_wq>
 
-### 3. Heavy Computations
-Offload expensive operations:
-```c
-// Main path: fast, non-blocking
-// Workqueue: slow processing (parsing, crypto)
-```
-
-### 4. Cleanup Tasks
-Defer resource cleanup:
-```c
-// Free memory, close connections in background
-```
-
-## Building and Running
-
-```bash
-# Build
-cd /home/yunwei37/workspace/bpf-developer-tutorial/src/features/bpf_wq
-make
-
-# Run simple test
-sudo ./wq_simple
-
-# Run rate limiter (requires network interface)
-sudo ./rate_limiter lo      # Use loopback for testing
-sudo ./rate_limiter eth0    # Use real interface
-
-# Generate test traffic
-ping -f localhost           # Flood ping to trigger rate limiting
-```
-
-## Key APIs
-
-| Function | Purpose |
-|----------|---------|
-| `bpf_wq_init(wq, map, flags)` | Initialize workqueue |
-| `bpf_wq_set_callback(wq, fn, flags)` | Set callback function |
-| `bpf_wq_start(wq, flags)` | Schedule async execution |
-
-## Requirements
-
-- Linux kernel 6.6+ (workqueue support)
-- Root/sudo access
-- libbpf, clang, bpftool
-
-## Files
-
-```
-bpf_wq/
-├── wq_simple.bpf.c       # BPF workqueue program
-├── wq_simple.c           # Userspace loader
-├── bpf_experimental.h    # Workqueue helper definitions
-├── Makefile              # Build system
-├── README.md             # This file
-└── .gitignore            # Ignore build artifacts
-```
-
-## Advantages Over Alternatives
-
-| Approach | Blocking | Context Switches | Complexity |
-|----------|----------|-----------------|------------|
-| **Synchronous** | Yes | No | Low |
-| **Userspace notification** | No | Yes (many) | High |
-| **BPF workqueue** | No | Minimal | Medium |
-
-BPF workqueues provide the best balance of performance and flexibility for async operations!
-
-## Summary
-
-BPF workqueues enable **true asynchronous programming** in BPF:
-- ✅ Non-blocking main path
-- ✅ Deferred execution
-- ✅ Sleepable context support
-- ✅ Minimal overhead
-- ✅ Type-safe callbacks
-
-Perfect for scenarios where you need to do work later without blocking the fast path!
+Example adapted from Linux kernel BPF selftests with educational enhancements. Requires Linux kernel 6.10+ for workqueue support. Complete source code available in the tutorial repository.