add English doc (#55)

* add english documents

* docs: add root english README
This commit is contained in:
云微
2023-08-08 08:55:55 +01:00
committed by GitHub
parent dc37ec0fcf
commit ef9a1d9b47
38 changed files with 5587 additions and 6 deletions

View File

@@ -1,4 +1,4 @@
# eBPF 开发者教程与知识库:通过小工具一步步学习 eBPF
# eBPF 开发者教程与知识库:Learn eBPF by example tools
[![CI](https://github.com/eunomia-bpf/bpf-developer-tutorial/actions/workflows/main.yml/badge.svg)](https://github.com/eunomia-bpf/bpf-developer-tutorial/actions/workflows/main.yml)
@@ -11,6 +11,12 @@
教程关注于可观测性、网络、安全等等方面的 eBPF 示例。
This is a development tutorial for eBPF based on CO-RE (Compile Once, Run Everywhere). It provides practical eBPF development practices from beginner to advanced, including basic concepts, code examples, and real-world applications. Unlike BCC, we use frameworks like libbpf, Cilium, libbpf-rs, and eunomia-bpf for development, with examples in languages such as C, Go, and Rust.
This tutorial does not cover complex concepts and scenario introductions. Its main purpose is to provide examples of eBPF tools (**very short, starting with twenty lines of code!**) to help eBPF application developers quickly grasp eBPF development methods and techniques. The tutorial content can be found in the directory, with each directory being an independent eBPF tool example.
The tutorial is in Chinese and English versions. For English version, please refer to [README_en.md](README_en.md) and the README_en.md in each directory.
## 目录
### 入门文档

171
README_en.md Normal file
View File

@@ -0,0 +1,171 @@
# eBPF Developer Tutorial and Knowledge Base: Learning eBPF Step by Step with Tools
[![CI](https://github.com/eunomia-bpf/bpf-developer-tutorial/actions/workflows/main.yml/badge.svg)](https://github.com/eunomia-bpf/bpf-developer-tutorial/actions/workflows/main.yml)
[GitHub](https://github.com/eunomia-bpf/bpf-developer-tutorial)
[Gitee Mirror](https://gitee.com/yunwei37/bpf-developer-tutorial)
This is a development tutorial for eBPF based on CO-RE (Compile Once, Run Everywhere). It provides practical eBPF development practices from beginner to advanced, including basic concepts, code examples, and real-world applications. Unlike BCC, we use frameworks like libbpf, Cilium, libbpf-rs, and eunomia-bpf for development, with examples in languages such as C, Go, and Rust.
This tutorial does not cover complex concepts and scenario introductions. Its main purpose is to provide examples of eBPF tools (**very short, starting with twenty lines of code!**) to help eBPF application developers quickly grasp eBPF development methods and techniques. The tutorial content can be found in the directory, with each directory being an independent eBPF tool example.
The tutorial focuses on eBPF examples in observability, networking, security, and more.
## Table of Contents
### Getting Started Documentation
Includes simple eBPF program samples and introductions.
- [lesson 0-introduce](src/0-introduce/README_en.md) Introduces basic concepts of eBPF and common development tools
- [lesson 1-helloworld](src/1-helloworld/README_en.md) Develops the simplest "Hello World" program using eBPF and introduces the basic framework and development process of eBPF
- [lesson 2-kprobe-unlink](src/2-kprobe-unlink/README_en.md) Uses kprobe in eBPF to capture the unlink system call
- [lesson 3-fentry-unlink](src/3-fentry-unlink/README_en.md) Uses fentry in eBPF to capture the unlink system call
- [lesson 4-opensnoop](src/4-opensnoop/README_en.md) Uses eBPF to capture the system call collection of processes opening files, and filters process PIDs in eBPF using global variables
- [lesson 5-uprobe-bashreadline](src/5-uprobe-bashreadline/README_en.md) Uses uprobe in eBPF to capture the readline function calls in bash
- [lesson 6-sigsnoop](src/6-sigsnoop/README_en.md) Captures the system call collection of processes sending signals and uses a hash map to store states
- [lesson 7-execsnoop](src/7-execsnoop/README_en.md) Captures process execution times and prints output to user space through perf event array
- [lesson 8-exitsnoop](src/8-exitsnoop/README_en.md) Captures process exit events and prints output to user space using a ring buffer
- [lesson 9-runqlat](src/9-runqlat/README_en.md) Captures process scheduling delays and records them in histogram format
- [lesson 10-hardirqs](src/10-hardirqs/README_en.md) Captures interrupt events using hardirqs or softirqs
- [lesson 11-bootstrap](src/11-bootstrap/README_en.md) Writes native libbpf user space code for eBPF using libbpf-bootstrap and establishes a complete libbpf project.
- [lesson 12-profile](src/12-profile/README_en.md) Performs performance analysis using eBPF
- [lesson 13-tcpconnlat](src/13-tcpconnlat/README_en.md) Records TCP connection latency and processes data in user space using libbpf
- [lesson 14-tcpstates](src/14-tcpstates/README_en.md) Records TCP connection state and TCP RTT.- [lesson 15-javagc](src/15-javagc/README_en.md) Capture user-level Java GC event duration using usdt
- [lesson 16-memleak](src/16-memleak/README_en.md) Detect memory leaks
- [lesson 17-biopattern](src/17-biopattern/README_en.md) Capture disk IO patterns
- [lesson 18-further-reading](src/18-further-reading/README_en.md) Further reading?
- [lesson 19-lsm-connect](src/19-lsm-connect/README_en.md) Use LSM for security detection and defense
- [lesson 20-tc](src/20-tc/README_en.md) Use eBPF for tc traffic control
- [lesson 21-xdp](src/21-xdp/README_en.md) Use eBPF for XDP packet processing
### Advanced Documentation and Sample Programs
This section covers advanced topics related to eBPF, including using eBPF programs on Android, possible attacks and defenses using eBPF programs, and complex tracing. Combining the user-mode and kernel-mode aspects of eBPF can bring great power (as well as security risks).
Android:
- [Using eBPF programs on Android](src/22-android/README_en.md)
Networking and tracing:
- [Tracing HTTP requests or other layer-7 protocols using eBPF](src/23-http/README_en.md)
- [Accelerating network request forwarding using sockops](src/29-sockops/README_en.md)
Security:
- [Hiding process or file information using eBPF](src/24-hide/README_en.md)
- [Terminating processes by sending signals using bpf_send_signal](src/25-signal/README_en.md)
- [Adding sudo users using eBPF](src/26-sudo/README_en.md)
- [Replacing text read or written by any program using eBPF](src/27-replace/README_en.md)
- [BPF lifecycle: Running eBPF programs continuously in Detached mode after user-mode applications exit](src/28-detach/README_en.md)
Continuously updated...
## Why write this tutorial?
In the process of learning eBPF, we have been inspired and helped by the [bcc python developer tutorial](src/bcc-documents/tutorial_bcc_python_developer.md). However, from the current perspective, using libbpf to develop eBPF applications is a relatively better choice. However, there seems to be few tutorials that focus on eBPF development based on libbpf and BPF CO-RE, introducing it through examples and tools. Therefore, we initiated this project, adopting a similar organization method as the bcc python developer tutorial, but using CO-RE's libbpf for development.
This project is mainly based on [libbpf-bootstrap](https://github.com/libbpf/libbpf-bootstrap) and [eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf) frameworks, and uses eunomia-bpf to help simplify the development of some user-space libbpf eBPF code, allowing developers to focus on kernel-space eBPF code development.
> - We also provide a small tool called GPTtrace, which uses ChatGPT to automatically write eBPF programs and trace Linux systems through natural language descriptions. This tool allows you to interactively learn eBPF programs: [GPTtrace](https://github.com/eunomia-bpf/GPTtrace)
> - Feel free to raise any questions or issues related to eBPF learning, or bugs encountered in practice, in the issue or discussion section of this repository. We will do our best to help you!
## GitHub Templates: Easily build eBPF projects and development environments, compile and run eBPF programs online with one click
When starting a new eBPF project, are you confused about how to set up the environment and choose a programming language? Don't worry, we have prepared a series of GitHub templates for you to quickly start a brand new eBPF project. Just click the `Use this template` button on GitHub to get started.- <https://github.com/eunomia-bpf/libbpf-starter-template>: eBPF project template based on the C language and libbpf framework
- <https://github.com/eunomia-bpf/cilium-ebpf-starter-template>: eBPF project template based on the Go language and cilium/ framework
- <https://github.com/eunomia-bpf/libbpf-rs-starter-template>: eBPF project template based on the Rust language and libbpf-rs framework
- <https://github.com/eunomia-bpf/eunomia-template>: eBPF project template based on the C language and eunomia-bpf framework
These starter templates include the following features:
- A Makefile to build the project with a single command
- A Dockerfile to automatically create a containerized environment for your eBPF project and publish it to GitHub Packages
- GitHub Actions to automate the build, test, and release processes
- All dependencies required for eBPF development
> By setting an existing repository as a template, you and others can quickly generate new repositories with the same basic structure, eliminating the need for manual creation and configuration. With GitHub template repositories, developers can focus on the core functionality and logic of their projects without wasting time on the setup and structure. For more information about template repositories, see the official documentation: <https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-template-repository>
When you create a new repository using one of the eBPF project templates mentioned above, you can easily set up and launch an online development environment with GitHub Codespaces. Here are the steps to compile and run eBPF programs using GitHub Codespaces:
1. Click the Code button in your new repository and select the Open with Codespaces option:
![code](imgs/code-button.png)
2. GitHub will create a new Codespace for you, which may take a few minutes depending on your network speed and the size of the repository.
3. Once your Codespace is launched and ready to use, you can open the terminal and navigate to your project directory.
4. You can follow the instructions in the corresponding repository to compile and run eBPF programs:
![codespace](imgs/codespace.png)
With Codespaces, you can easily create, manage, and share cloud-based development environments, speeding up and making your development process more reliable. You can develop with Codespaces anywhere, on any device, just need a computer with a web browser. Additionally, GitHub Codespaces supports pre-configured environments, customized development containers, and customizable development experiences to meet your development needs.
After writing code in a codespace and making a commit, GitHub Actions will compile and automatically publish the container image. Then, you can use Docker to run this eBPF program anywhere with just one command, for example:
```console
$ sudo docker run --rm -it --privileged ghcr.io/eunomia-bpf/libbpf-rs-template:latest
[sudo] password for xxx:
Tracing run queue latency higher than 10000 us
TIME COMM TID LAT(us)
12:09:19 systemd-udevd 30786 18300
12:09:19 systemd-udevd 30796 21941
12:09:19 systemd-udevd 30793 10323
12:09:19 systemd-udevd 30795 14827
12:09:19 systemd-udevd 30790 17973
12:09:19 systemd-udevd 30793 12328
12:09:19 systemd-udevd 30796 28721
```
![docker](imgs/docker.png)
## Why do we need tutorials based on libbpf and BPF CO-RE?
> In history, when it comes to developing a BPF application, one could choose the BCC framework to load the BPF program into the kernel when implementing various BPF programs for Tracepoints. BCC provides a built-in Clang compiler that can compile BPF code at runtime and customize it into a program that conforms to a specific host kernel. This is the only way to develop maintainable BPF applications under the constantly changing internal kernel environment. The portability of BPF and the introduction of CO-RE are detailed in the article "BPF Portability and CO-RE", explaining why BCC was the only viable option before and why libbpf is now considered a better choice. Last year, Libbpf saw significant improvements in functionality and complexity, eliminating many differences with BCC (especially for Tracepoints applications) and adding many new and powerful features that BCC does not support (such as global variables and BPF skeletons)
> Admittedly, BCC does its best to simplify the work of BPF developers, but sometimes it also increases the difficulty of problem localization and fixing while providing convenience. Users must remember its naming conventions and the autogenerated structures for Tracepoints, and they must rely on rewriting this code to read kernel data and access kprobe parameters. When using BPF maps, it is necessary to write half-object-oriented C code that does not completely match what happens in the kernel. Furthermore, BCC leads to the writing of a large amount of boilerplate code in user space, with manually configuring the most trivial parts.
> As mentioned above, BCC relies on runtime compilation and embeds a large LLVM/Clang library, which creates certain gaps between BCC and an ideal usage scenario:
> - High resource utilization (memory and CPU) at compile time, which may interfere with the main process in busy servers.
> - It relies on the kernel header package and needs to be installed on each target host. Even so, if certain kernel contents are not exposed through public header files, type definitions need to be copied and pasted into the BPF code to achieve the purpose.
> - Even the smallest compile-time errors can only be detected at runtime, followed by recompiling and restarting the user-space application. This greatly affects the iteration time of development (and increases frustration...).
> Libbpf + BPF CO-RE (Compile Once - Run Everywhere) takes a different approach, considering BPF programs as normal user-space programs: they only need to be compiled into small binaries that can be deployed on target hosts without modification. libbpf acts as a loader for BPF programs, responsible for configuration work (relocating, loading, and verifying BPF programs, creating BPF maps, attaching to BPF hooks, etc.), and developers only need to focus on the correctness and performance of BPF programs. This approach minimizes overhead, eliminates dependencies, and improves the overall developer experience.
> In terms of API and code conventions, libbpf adheres to the philosophy of "least surprise", where most things need to be explicitly stated: no header files are implied, and no code is rewritten. Most monotonous steps can be eliminated using simple C code and appropriate auxiliary macros. In addition, what users write is the content that needs to be executed, and the structure of BPF applications is one-to-one, finally verified and executed by the kernel.
Reference: [BCC to Libbpf Conversion Guide (Translation) - Deep Dive into eBPF](https://www.ebpf.top/post/bcc-to-libbpf-guid/)
## eunomia-bpf
[eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf) is an open-source eBPF dynamic loading runtime and development toolkit designed to simplify the development, building, distribution, and execution of eBPF programs. It is based on the libbpf CO-RE lightweight development framework.
With eunomia-bpf, you can:
- Write only the libbpf kernel mode code when writing eBPF programs or tools, automatically retrieving kernel mode export information.
- Use Wasm to develop eBPF user mode programs, controlling the entire eBPF program loading and execution, as well as handling related data within the WASM virtual machine.
- eunomia-bpf can package pre-compiled eBPF programs into universal JSON or WASM modules for distribution across architectures and kernel versions, allowing dynamic loading and execution without the need for recompilation.
eunomia-bpf consists of a compilation toolchain and a runtime library. Compared to traditional frameworks like BCC and native libbpf, it greatly simplifies the development process of eBPF programs, where in most cases, only the kernel mode code needs to be written to easily build, package, and publish complete eBPF applications. At the same time, the kernel mode eBPF code guarantees 100% compatibility with mainstream development frameworks such as libbpf, libbpfgo, libbpf-rs, and more. When user mode code needs to be written, multiple languages can be used with the help of Webassembly. Compared to script tools like bpftrace, eunomia-bpf maintains similar convenience, while not being limited to trace scenarios and can be used in various other fields such as networking and security.> - eunomia-bpf project GitHub address: <https://github.com/eunomia-bpf/eunomia-bpf>
>
> - gitee mirror: <https://gitee.com/anolis/eunomia>
## Let ChatGPT Help Us
This tutorial uses ChatGPT to learn how to write eBPF programs. At the same time, we try to teach ChatGPT how to write eBPF programs. The general steps are as follows:
1. Teach it the basic knowledge of eBPF programming.
2. Show it some cases: hello world, basic structure of eBPF programs, how to use eBPF programs for tracing, and let it start writing tutorials.
3. Manually adjust the tutorials and correct errors in the code and documents.
4. Feed the modified code back to ChatGPT for further learning.
5. Try to make ChatGPT generate eBPF programs and corresponding tutorial documents automatically! For example:
![ebpf-chatgpt-signal](imgs/ebpf-chatgpt-signal.png)
The complete conversation log can be found here: [ChatGPT.md](ChatGPT.md)
We have also built a demo of a command-line tool. Through training in this tutorial, it can automatically write eBPF programs and trace Linux systems using natural language descriptions: <https://github.com/eunomia-bpf/GPTtrace>
![ebpf-chatgpt-signal](https://github.com/eunomia-bpf/GPTtrace/blob/main/doc/result.gif)

View File

@@ -0,0 +1,145 @@
# eBPF Beginner's Development Tutorial 0: Introduction to the Basic Concepts of eBPF and Common Development Tools
## 1. Introduction to eBPF: Secure and Efficient Kernel Extension
eBPF is a revolutionary technology that originated in the Linux kernel and allows sandbox programs to run in the kernel of an operating system. It is used to securely and efficiently extend the functionality of the kernel without the need to modify the kernel's source code or load kernel modules. By allowing the execution of sandbox programs in the operating system, eBPF enables application developers to dynamically add additional functionality to the operating system at runtime. The operating system then ensures security and execution efficiency, similar to performing native compilation with the help of a Just-In-Time (JIT) compiler and verification engine. eBPF programs are portable between kernel versions and can be automatically updated, avoiding workload interruptions and node restarts.
Today, eBPF is widely used in various scenarios: in modern data centers and cloud-native environments, it provides high-performance network packet processing and load balancing; it achieves observability of various fine-grained metrics with very low resource overhead, helping application developers trace applications and provide insights for performance troubleshooting; it ensures secure execution of application and container runtimes, and more. The possibilities are endless, and the innovation unleashed by eBPF in the operating system kernel has only just begun [3].
### The Future of eBPF: Kernel's JavaScript Programmable Interface
For browsers, the introduction of JavaScript brought programmability and initiated a tremendous revolution, transforming browsers into almost independent operating systems. Now let's return to eBPF: in order to understand the programmability impact of eBPF on the Linux kernel, it is helpful to have a high-level understanding of the structure of the Linux kernel and how it interacts with applications and hardware [4].
![kernel-arch](kernel-arch.png)
The main purpose of the Linux kernel is to abstract hardware or virtual hardware and provide a consistent API (system calls) that allows applications to run and share resources. To achieve this goal, we maintain a series of subsystems and layers to distribute these responsibilities [5]. Each subsystem typically allows some level of configuration to cater to different user requirements. If the desired behavior cannot be achieved through configuration, there are two options: either modify the kernel source code and convince the Linux kernel community that the change is necessary (waiting for several years for the new kernel version to become a commodity), or write a kernel module and regularly fix it because each kernel version may break it. In practice, neither of the two solutions is commonly used: the former is too costly, and the latter lacks portability.
With eBPF, there is a new option to reprogram the behavior of the Linux kernel without modifying the kernel's source code or loading kernel modules, while ensuring a certain degree of behavioral consistency, compatibility, and security across different kernel versions [6]. To achieve this, eBPF programs also need a corresponding API that allows the execution and sharing of resources for user-defined applications. In other words, in a sense, the eBPF virtual machine also provides a mechanism similar to system calls. With the help of the communication mechanism between eBPF and user space, both the Wasm virtual machine and user space applications can have full access to this set of "system calls." On the one hand, it can programmatically extend the capabilities of traditional system calls, and on the other hand, it can achieve more efficient programmable IO processing in many layers such as networking and file systems.
![new-os](new-os-model.png)
As shown in the above figure, the Linux kernel of today is evolving towards a new kernel model: user-defined applications can run in both kernel space and user space, with user space accessing system resources through traditional system calls and kernel space interacting with various parts of the system through BPF Helper Calls. As of early 2023, there are already more than 220 helper system interfaces in the eBPF virtual machine in the kernel, covering a wide range of application scenarios.## Note
It is worth noting that BPF Helper Call and system calls are not competitive. Their programming models and performance advantages are completely different and they do not completely replace each other. For the Wasm and Wasi related ecosystems, the situation is similar. The specially designed wasi interface needs to go through a long standardization process. However, it may provide better performance and portability guarantees for user-mode applications in specific scenarios. On the other hand, eBPF can provide a fast and flexible solution for extending system interfaces, while ensuring sandbox nature and portability.
Currently, eBPF is still in the early stages. However, with the help of the kernel interfaces provided by eBPF and the ability to interact with user space, applications in the Wasm virtual machine can almost access the data and return values of any kernel or user mode function call (kprobe, uprobe...). It can collect and understand all system calls at a low cost and obtain packet-level data and socket-level data for all network operations (tracepoint, socket...). It can also add additional protocol analyzers and easily program any forwarding logic in the network packet processing solution (XDP, TC...), without leaving the packet processing environment of the Linux kernel.
Moreover, eBPF has the ability to write data to any address of a user space process (bpf_probe_write_user[7]), partially modify the return value of a kernel function (bpf_override_return[8]), and even directly execute certain system calls in kernel mode[9]. Fortunately, eBPF performs strict security checks on the bytecode before loading it into the kernel to ensure that there are no operations such as memory out-of-bounds. Moreover, many features that may expand the attack surface and pose security risks need to be explicitly enabled during kernel compilation. Before loading the bytecode into the kernel, the Wasm virtual machine can also choose to enable or disable certain eBPF features to ensure the security of the sandbox.
## 2. Some Tips on Learning eBPF Development
This article will not provide a more detailed introduction to the principles of eBPF, but here is a learning plan and reference materials that may be of value:
### Introduction to eBPF (5-7h)
- Google or other search engines: eBPF
- Ask ChatGPT-like things: What is eBPF?
Recommended:
- Read the introduction to ebpf: <https://ebpf.io/> (30min)
- Briefly understand the ebpf kernel-related documentation: <https://prototype-kernel.readthedocs.io/en/latest/bpf/> (Know where to queries, 30min)
- Read the Chinese ebpf beginner's guide: <https://www.modb.pro/db/391570> (1h)
- There are a lot of reference materials: <https://github.com/zoidbergwill/awesome-ebpf> (2-3h)
- You can choose to flip through PPTs that interest you: <https://github.com/gojue/ebpf-slide> (1-2h)
Answer three questions:
1. Understand what eBPF is? Why do we need it? Can't we use kernel modules?
2. What functions does it have? What can it do in the Linux kernel? What are the types of eBPF programs and helpers (not all of them need to be known, but need to know where to find them)?
3. What can it be used for? For example, in which scenarios can it be used? Networking, security, observability?
### Understand how to develop eBPF programs (10-15h)
Understand and try eBPF development frameworks:
- Examples of developing various tools with BCC: <https://github.com/iovisor/bcc/blob/master/docs/tutorial_bcc_python_developer.md> (Run through, 3-4h)
- Some examples of libbpf: <https://github.com/libbpf/libbpf-bootstrap> (Run any interesting one and read the source code, 2h)
- Tutorial based on libbpf and eunomia-bpf: <https://github.com/eunomia-bpf/bpf-developer-tutorial> (Read part 1-10, 3-4h)
Other development frameworks: Go or Rust language, please search and try on your own (0-2h)
Have questions or things you want to know, whether or not they are related to this project, you can start discussing in the discussions of this project.
Answer some questions and try some experiments (2-5h):
1. How to develop the simplest eBPF program?
2. How to trace a kernel feature or function with eBPF? There are many ways, provide corresponding code examples;
3. What are the solutions for communication between user mode and kernel mode? How to send information from user mode to kernel mode? How to pass information from kernel mode to user mode? Provide code examples;
4. Write your own eBPF program to implement a feature;
5. In the entire lifecycle of an eBPF program, what does it do in user mode and kernel mode?## 3. How to use eBPF programming
Writing original eBPF programs is very tedious and difficult. To change this situation, llvm introduced in 2015 the ability to compile code written in high-level languages into eBPF bytecode, and the eBPF community wrapped primitive system calls such as `bpf()` and provided the `libbpf` library. These libraries include functions for loading bytecode into the kernel and other key functions. In the Linux source code package, there are numerous eBPF sample codes provided by Linux based on `libbpf`, located in the `samples/bpf/` directory.
A typical `libbpf`-based eBPF program consists of two files: `*_kern.c` and `*_user.c`. The mounting points and processing functions in the kernel are written in `*_kern.c`, while the user code for injecting kernel code and performing various tasks in user space is written in `*_user.c`. For more detailed tutorials, refer to [this video](https://www.bilibili.com/video/BV1f54y1h74r?spm_id_from=333.999.0.0). However, due to the difficulty in understanding and the entry barrier, most eBPF program development at the current stage is based on tools such as:
- BCC
- BPFtrace
- libbpf-bootstrap
- Go eBPF library
And there are newer tools such as `eunomia-bpf`.
## Writing eBPF programs
eBPF programs consist of a kernel space part and a user space part. The kernel space part contains the actual logic of the program, while the user space part is responsible for loading and managing the kernel space part. With the eunomia-bpf development tool, only the kernel space part needs to be written.
The code in the kernel space part needs to conform to the syntax and instruction set of eBPF. eBPF programs mainly consist of several functions, each with its own specific purpose. The available function types include:
- kprobe: probe function, executed before or after a specified kernel function.
- tracepoint: tracepoint function, executed at a specified kernel tracepoint.
- raw_tracepoint: raw tracepoint function, executed at a specified kernel raw tracepoint.
- xdp: network data processing function, intercepting and processing network packets.
- perf_event: performance event function, used to handle kernel performance events.
- kretprobe: return probe function, executed when a specified kernel function returns.
- tracepoint_return: tracepoint return function, executed when a specified kernel tracepoint returns.
- raw_tracepoint_return: raw tracepoint return function, executed when a specified kernel raw tracepoint returns.
### BCC
BCC stands for BPF Compiler Collection. The project is a Python library that includes a complete toolchain for writing, compiling, and loading BPF programs, as well as tools for debugging and diagnosing performance issues.
Since its release in 2015, BCC has been continuously improved by hundreds of contributors and now includes a large number of ready-to-use tracing tools. The [official project repository](https://github.com/iovisor/bcc/blob/master/docs/tutorial.md) provides a handy tutorial for users to quickly get started with BCC.
Users can program in high-level languages such as Python and Lua on BCC. Compared to programming directly in C, these high-level languages are much more convenient. Users only need to design BPF programs in C, and the rest, including compilation, parsing, loading, etc., can be done by BCC.
However, a drawback of using BCC is its compatibility. Each time an eBPF program based on BCC is executed, it needs to be compiled, and compiling requires users to configure related header files and corresponding implementations. In practical applications, as you may have experienced, dependency issues in compilation can be quite tricky. Therefore, in the development of this project, we have abandoned BCC and chosen the libbpf-bootstrap tool, which allows for one-time compilation and multiple runs.
### eBPF Go library
The eBPF Go library provides a general-purpose eBPF library that decouples the process of obtaining eBPF bytecode from the loading and management of eBPF programs, and implements similar CO- functionality as libbpf. eBPF programs are usually created by writing high-level languages and then compiled into eBPF bytecode using the clang/LLVM compiler.
### libbpf
`libbpf-bootstrap` is a BPF development scaffold based on the `libbpf` library, and its source code can be obtained from its [GitHub](https://github.com/libbpf/libbpf-bootstrap).
`libbpf-bootstrap` combines years of practice from the BPF community and provides a modern and convenient workflow for developers, achieving the goal of one-time compilation and reuse.
BPF programs based on `libbpf-bootstrap` have certain naming conventions for the source files. The file for generating kernel space bytecode ends with `.bpf.c`, and the file for loading bytecode in user space ends with `.c`, and the prefixes of these two files must be the same.Instructions: Translate the following Chinese text to English
while maintaining the original formatting: "Based on `libbpf-bootstrap`, BPF programs will compile `*.bpf.c` files into corresponding `.o` files,
and then generate the `skeleton` file based on this file, i.e. `*.skel.h`. This file will contain some data structures defined in the kernel space,
as well as the key functions used to load kernel space code. After the user-space code includes this file, it can call the corresponding loading function to load the bytecode into the kernel. Similarly, `libbpf-bootstrap` also has a comprehensive introduction tutorial that users can refer to [here](https://nakryiko.com/posts/libbpf-bootstrap/) for detailed introductory operations.
### eunomia-bpf
Developing, building, and distributing eBPF has always been a high-threshold task. The use of tools such as BCC and bpftrace has high development efficiency and good portability. However, when it comes to distribution and deployment, it requires the installation of LLVM, Clang, and other compilation environments, and the compilation process needs to be executed locally or remotely every time, resulting in substantial resource consumption. On the other hand, using the native CO-RE libbpf requires writing a considerable amount of user-mode loading code to help properly load eBPF programs and obtain reported information from the kernel. At the same time, there is no good solution for distributing and managing eBPF programs.
[eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf) is an open-source eBPF dynamic loading runtime and development toolchain designed to simplify the development, building, distribution, and execution of eBPF programs. It is based on the libbpf CO-RE lightweight development framework.
With eunomia-bpf, you can:
- When writing eBPF programs or tools, only write kernel space code, automatically retrieve kernel space export information, and dynamically load it as a module.
- Use WASM for user space interactive program development to control the loading and execution of the entire eBPF program, as well as the processing of related data inside the WASM virtual machine.
- eunomia-bpf can package pre-compiled eBPF programs into universal JSON or WASM modules for distribution across architectures and kernel versions. They can be dynamically loaded and run without the need for recompilation.
eunomia-bpf consists of a compilation toolchain and a runtime library. Compared with traditional frameworks such as BCC and native libbpf, it greatly simplifies the development process of eBPF programs. In most cases, only writing kernel space code is required to easily build, package, and publish complete eBPF applications. At the same time, kernel space eBPF code ensures 100% compatibility with mainstream development frameworks such as libbpf, libbpfgo, libbpf-rs, etc. When there is a need to write user-space code, it can also be developed in multiple languages with the help of WebAssembly. Compared with script tools such as bpftrace, eunomia-bpf retains similar convenience, while not only limited to tracing but also applicable to more scenarios, such as networking, security, etc.
> - eunomia-bpf project Github address: <https://github.com/eunomia-bpf/eunomia-bpf>
> - gitee mirror: <https://gitee.com/anolis/eunomia>
## References
- eBPF Introduction: <https://ebpf.io/>
- BPF Compiler Collection (BCC): <https://github.com/iovisor/bcc>
- eunomia-bpf: <https://github.com/eunomia-bpf/eunomia-bpf>
You can also visit our tutorial code repository <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and complete tutorial source code. All content is open source. We will continue to share more content about eBPF development practices to help you better understand and master eBPF technology.".

View File

@@ -1,4 +1,4 @@
# eBPF 入门开发实践教程Hello World基本框架和开发流程
# eBPF 入门开发实践教程Hello World基本框架和开发流程
在本篇博客中我们将深入探讨eBPFExtended Berkeley Packet Filter的基本框架和开发流程。eBPF是一种在Linux内核上运行的强大网络和性能分析工具它为开发者提供了在内核运行时动态加载、更新和运行用户定义代码的能力。这使得开发者可以实现高效、安全的内核级别的网络监控、性能分析和故障排查等功能。

View File

@@ -0,0 +1,190 @@
# eBPF Getting Started Tutorial 1 Hello World, Basic Framework and Development Process
In this blog post, we will delve into the basic framework and development process of eBPF (Extended Berkeley Packet Filter). eBPF is a powerful network and performance analysis tool that runs on the Linux kernel, providing developers with the ability to dynamically load, update, and run user-defined code at kernel runtime. This enables developers to implement efficient, secure kernel-level network monitoring, performance analysis, and troubleshooting functionalities.
This article is the second part of the eBPF Getting Started Tutorial, where we will focus on how to write a simple eBPF program and demonstrate the entire development process through practical examples. Before reading this tutorial, it is recommended that you first learn the concepts of eBPF by studying the first tutorial.
When developing eBPF programs, there are multiple development frameworks to choose from, such as BCC (BPF Compiler Collection) libbpf, cilium/ebpf, eunomia-bpf, etc. Although these tools have different characteristics, their basic development process is similar. In the following content, we will delve into these processes and use the Hello World program as an example to guide readers in mastering the basic skills of eBPF development.
This tutorial will help you understand the basic structure of eBPF programs, the compilation and loading process, the interaction between user space and kernel space, as well as debugging and optimization techniques. By studying this tutorial, you will master the basic knowledge of eBPF development and lay a solid foundation for further learning and practice.
## Preparation of eBPF Development Environment and Basic Development Process
Before starting to write eBPF programs, we need to prepare a suitable development environment and understand the basic development process of eBPF programs. This section will provide a detailed introduction to these subjects.
### Installing the necessary software and tools
To develop eBPF programs, you need to install the following software and tools:
- Linux kernel: Since eBPF is a kernel technology, you need to have a relatively new version of the Linux kernel (recommended version 4.8 and above) to support eBPF functionality.
- LLVM and Clang: These tools are used to compile eBPF programs. Installing the latest version of LLVM and Clang ensures that you get the best eBPF support.
An eBPF program consists of two main parts: the kernel space part and the user space part. The kernel space part contains the actual logic of the eBPF program, while the user space part is responsible for loading, running, and monitoring the kernel space program.
Once you have chosen a suitable development framework, such as BCC (BPF Compiler Collection), libbpf, cilium/ebpf, or eunomia-bpf, you can begin developing the user space and kernel space programs. Taking the BCC tool as an example, we will introduce the basic development process of eBPF programs:
1. Installing the BCC tool: Depending on your Linux distribution, follow the guidelines in the BCC documentation to install the BCC tool and its dependencies.
2. Writing an eBPF program (C language): Use the C language to write a simple eBPF program, such as the Hello World program. This program can be executed in kernel space and perform specific tasks, such as counting network packets.
3. Writing a user space program (Python or C, etc.): Use languages like Python or C to write a user space program that is responsible for loading, running, and interacting with the eBPF program. In this program, you need to use the API provided by BCC to load and manipulate the kernel space eBPF program.
4. Compiling the eBPF program: Use the BCC tool to compile the eBPF program written in C language into bytecode that can be executed by the kernel. BCC dynamically compiles the eBPF program from source code at runtime.
5. Loading and running the eBPF program: In the user space program, use the API provided by BCC to load the compiled eBPF program into kernel space and then run it.
6. Interacting with the eBPF program: The user space program interacts with the eBPF program through the API provided by BCC, implementing data collection, analysis, and display functions. For example, you can use the BCC API to read map data in the eBPF program to obtain network packet statistics.
7. Unloading the eBPF program: When the eBPF program is no longer needed, the user space program should unload it from the kernel space using the BCC API.
8. Debugging and optimization: Use tools like bpftool to debug and optimize eBPF programs, improving program performance and stability.
Through the above process, you can develop, compile, run, and debug eBPF programs using the BCC tool. Note that the development process of other frameworks, such as libbpf, cilium/ebpf, and eunomia-bpf, is similar but slightly different. Therefore, when choosing a framework, please refer to the respective official documentation and examples.
By following this process, you can develop an eBPF program that runs in the kernel. eunomia-bpf is an open-source eBPF dynamic loading runtime and development toolchain. It aims to simplify the development, building, distribution, and running of eBPF programs. It is based on the libbpf CO-RE lightweight development framework, supports loading and executing eBPF programs through a user space WebAssembly (WASM) virtual machine, and packages precompiled eBPF programs into universal JSON or WASM modules for distribution. We will use eunomia-bpf for demonstration purposes.
## Download and Install eunomia-bpf Development Tools
You can download and install eunomia-bpf using the following steps:
Download the ecli tool for running eBPF programs:
```console
$ wget https://aka.pw/bpf-ecli -O ecli && chmod +x ./ecli
$ ./ecli -h
Usage: ecli [--help] [--version] [--json] [--no-cache] url-and-args
```
Download the compiler toolchain for compiling eBPF kernel code into config files or WASM modules:
```console
$ wget https://github.com/eunomia-bpf/eunomia-bpf/releases/latest/download/ecc && chmod +x ./ecc
$ ./ecc -h
eunomia-bpf compiler
Usage: ecc [OPTIONS] <SOURCE_PATH> [EXPORT_EVENT_HEADER]
....
```
You can also compile using the docker image:
```console
$ docker run -it -v `pwd`/:/src/ ghcr.io/eunomia-bpf/ecc-`uname -m`:latest # Compile using docker. `pwd` should contain *.bpf.c files and *.h files.
export PATH=PATH:~/.eunomia/bin
Compiling bpf object...
Packing ebpf object and config into /src/package.json...
```
## Hello World - minimal eBPF program
We will start with a simple eBPF program that prints a message in the kernel. We will use the eunomia-bpf compiler toolchain to compile it into a BPF bytecode file, and then load and run the program using the ecli tool. For the sake of the example, we can temporarily disregard the user space program.
```c
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#define BPF_NO_GLOBAL_DATA
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
typedef unsigned int u32;
typedef int pid_t;
const pid_t pid_filter = 0;
char LICENSE[] SEC("license") = "Dual BSD/GPL";
SEC("tp/syscalls/sys_enter_write")
int handle_tp(void *ctx)
{
pid_t pid = bpf_get_current_pid_tgid() >> 32;
if (pid_filter && pid != pid_filter)
return 0;
bpf_printk("BPF triggered sys_enter_write from PID %d.\n", pid);
return 0;
}
```
This program defines a handle_tp function and attaches it to the sys_enter_write tracepoint using the SEC macro (i.e., it is executed when the write system call is entered). The function retrieves the process ID of the write system call invocation using the bpf_get_current_pid_tgid and bpf_printk functions, and prints it in the kernel log.
- `bpf_trace_printk()`: A simple mechanism to output information to the trace_pipe (/sys/kernel/debug/tracing/trace_pipe). This is fine for simple use cases, but it has limitations: a maximum of 3 parameters; the first parameter must be %s (i.e., a string); and the trace_pipe is globally shared in the kernel, so other programs using the trace_pipe concurrently might disrupt its output. A better approach is to use BPF_PERF_OUTPUT(), which will be discussed later.
- `void *ctx`: ctx is originally a parameter of a specific type, but since it is not used here, it is written as void *.
- `return 0;`: This is necessary, returning 0 (to know why, refer to #139 <https://github.com/iovisor/bcc/issues/139>).
To compile and run this program, you can use the ecc tool and ecli command. First, on Ubuntu/Debian, execute the following command:
```shell
sudo apt install clang llvm
```
Compile the program using ecc:".```console
$ ./ecc minimal.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
Or compile using a docker image:
```shell
docker run -it -v `pwd`/:/src/ ghcr.io/eunomia-bpf/ecc-`uname -m`:latest
```
Then run the compiled program using ecli:
```console
$ sudo ./ecli run package.json
Running eBPF program...
```
After running this program, you can view the output of the eBPF program by checking the /sys/kernel/debug/tracing/trace_pipe file:
```console
$ sudo cat /sys/kernel/debug/tracing/trace_pipe | grep "BPF triggered sys_enter_write"
<...>-3840345 [010] d... 3220701.101143: bpf_trace_printk: write system call from PID 3840345.
<...>-3840345 [010] d... 3220701.101143: bpf_trace_printk: write system call from PID 3840345.
```
Once you stop the ecli process by pressing Ctrl+C, the corresponding output will also stop.
Note: If your Linux distribution (e.g. Ubuntu) does not have the tracing subsystem enabled by default, you may not see any output. Use the following command to enable this feature:
```console
$ sudo su
# echo 1 > /sys/kernel/debug/tracing/tracing_on
```
## Basic Framework of eBPF Program
As mentioned above, the basic framework of an eBPF program includes:
- Including header files: You need to include <linux/bpf.h> and <bpf/bpf_helpers.h> header files, among others.
- Defining a license: You need to define a license, typically using "Dual BSD/GPL".
- Defining a BPF function: You need to define a BPF function, for example, named handle_tp, which takes void *ctx as a parameter and returns int. This is usually written in the C language.
- Using BPF helper functions: In the BPF function, you can use BPF helper functions such as bpf_get_current_pid_tgid() and bpf_printk().
- Return value
## Tracepoints
Tracepoints are a kernel static instrumentation technique, technically just trace functions placed in the kernel source code, which are essentially probe points with control conditions inserted into the source code, allowing post-processing with additional processing functions. For example, the most common static tracing method in the kernel is printk, which outputs log messages. For example, there are tracepoints at the start and end of system calls, scheduler events, file system operations, and disk I/O. Tracepoints were first introduced in Linux version 2.6.32 in 2009. Tracepoints are a stable API and their number is limited.
## GitHub Templates: Build eBPF Projects and Development Environments Easily
When faced with creating an eBPF project, are you confused about how to set up the environment and choose a programming language? Don't worry, we have prepared a series of GitHub templates to help you quickly start a brand new eBPF project. Just click the `Use this template` button on GitHub to get started.
- <https://github.com/eunomia-bpf/libbpf-starter-template>: eBPF project template based on the C language and libbpf framework.
- <https://github.com/eunomia-bpf/cilium-ebpf-starter-template>: eBPF project template based on the C language and cilium/ebpf framework.
- <https://github.com/eunomia-bpf/libbpf-rs-starter-template>: eBPF project template based on the Rust language and libbpf-rs framework.
- <https://github.com/eunomia-bpf/eunomia-template>: eBPF project template based on the C language and eunomia-bpf framework.
These starter templates include the following features:
- A Makefile for building the project with one command.
- A Dockerfile for automatically creating a containerized environment for your eBPF project and publishing it to Github Packages.- GitHub Actions, used for automating build, test, and release processes
- All dependencies required for eBPF development
> By setting an existing repository as a template, you and others can quickly generate new repositories with the same underlying structure, eliminating the tedious process of manual creation and configuration. With GitHub template repositories, developers can focus on the core functionality and logic of their projects without wasting time on setup and structure. For more information about template repositories, please refer to the official documentation: <https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-template-repository>
## Summary
The development and usage process of eBPF programs can be summarized in the following steps:
- Define the interface and types of eBPF programs: This includes defining the interface functions of eBPF programs, defining and implementing eBPF kernel maps and shared memory (perf events), and defining and using eBPF kernel helper functions.
- Write the code for eBPF programs: This includes writing the main logic of the eBPF program, implementing read and write operations on eBPF kernel maps, and using eBPF kernel helper functions.
- Compile the eBPF program: This includes using an eBPF compiler (such as clang) to compile the eBPF program code into eBPF bytecode and generate an executable eBPF kernel module. ecc essentially calls the clang compiler to compile eBPF programs.
- Load the eBPF program into the kernel: This includes loading the compiled eBPF kernel module into the Linux kernel and attaching the eBPF program to the specified kernel events.
- Use the eBPF program: This includes monitoring the execution of the eBPF program and exchanging and sharing data using eBPF kernel maps and shared memory.
- In practical development, there may be additional steps such as configuring compilation and loading parameters, managing eBPF kernel modules and kernel maps, and using other advanced features.
It should be noted that the execution of BPF programs occurs in the kernel space, so special tools and techniques are needed to write, compile, and debug BPF programs. eunomia-bpf is an open-source BPF compiler and toolkit that can help developers write and run BPF programs quickly and easily.
You can also visit our tutorial code repository <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and complete tutorials, all of which are open-source. We will continue to share more about eBPF development practices to help you better understand and master eBPF technology.

View File

@@ -0,0 +1,261 @@
# eBPF Beginner's Development Tutorial Ten: Capturing Interrupt Events in eBPF Using hardirqs or softirqs
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and run user-defined code at runtime in the kernel.
This article is the tenth part of the eBPF Beginner's Development Tutorial, focusing on capturing interrupt events using hardirqs or softirqs in eBPF.
hardirqs and softirqs are two different types of interrupt handlers in the Linux kernel. They are used to handle interrupt requests generated by hardware devices, as well as asynchronous events in the kernel. In eBPF, we can use the eBPF tools hardirqs and softirqs to capture and analyze information related to interrupt handling in the kernel.
## What are hardirqs and softirqs?
hardirqs are hardware interrupt handlers. When a hardware device generates an interrupt request, the kernel maps it to a specific interrupt vector and executes the associated hardware interrupt handler. Hardware interrupt handlers are commonly used to handle events in device drivers, such as completion of device data transfer or device errors.
softirqs are software interrupt handlers. They are a low-level asynchronous event handling mechanism in the kernel, used for handling high-priority tasks in the kernel. softirqs are commonly used to handle events in the network protocol stack, disk subsystem, and other kernel components. Compared to hardware interrupt handlers, software interrupt handlers have more flexibility and configurability.
## Implementation Details
In eBPF, we can capture and analyze hardirqs and softirqs by attaching specific kprobes or tracepoints. To capture hardirqs and softirqs, eBPF programs need to be placed on relevant kernel functions. These functions include:
- For hardirqs: irq_handler_entry and irq_handler_exit.
- For softirqs: softirq_entry and softirq_exit.
When the kernel processes hardirqs or softirqs, these eBPF programs are executed to collect relevant information such as interrupt vectors, execution time of interrupt handlers, etc. The collected information can be used for analyzing performance issues and other interrupt handling related problems in the kernel.
To capture hardirqs and softirqs, the following steps can be followed:
1. Define data structures and maps in eBPF programs for storing interrupt information.
2. Write eBPF programs and attach them to the corresponding kernel functions to capture hardirqs or softirqs.
3. In eBPF programs, collect relevant information about interrupt handlers and store this information in the maps.
4. In user space applications, read the data from the maps to analyze and display the interrupt handling information.
By following the above approach, we can use hardirqs and softirqs in eBPF to capture and analyze interrupt events in the kernel, identifying potential performance issues and interrupt handling related problems.
## Implementation of hardirqs Code
The main purpose of the hardirqs program is to obtain the name, execution count, and execution time of interrupt handlers and display the distribution of execution time in the form of a histogram. Let's analyze this code step by step.
```c
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2020 Wenbo Zhang
#include <vmlinux.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "hardirqs.h"
#include "bits.bpf.h"
#include "maps.bpf.h"
#define MAX_ENTRIES 256
const volatile bool filter_cg = false;
const volatile bool targ_dist = false;
const volatile bool targ_ns = false;
const volatile bool do_count = false;
struct {
__uint(type, BPF_MAP_TYPE_CGROUP_ARRAY);
__type(key, u32);
__type(value, u32);
__uint(max_entries, 1);
} cgroup_map SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 1);
__type(key, u32);
__type(value, u64);
``````c
} start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, struct irq_key);
__type(value, struct info);
} infos SEC(".maps");
static struct info zero;
static int handle_entry(int irq, struct irqaction *action)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
if (do_count) {
struct irq_key key = {};
struct info *info;
bpf_probe_read_kernel_str(&key.name, sizeof(key.name), BPF_CORE_READ(action, name));
info = bpf_map_lookup_or_try_init(&infos, &key, &zero);
if (!info)
return 0;
info->count += 1;
return 0;
} else {
u64 ts = bpf_ktime_get_ns();
u32 key = 0;
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
bpf_map_update_elem(&start, &key, &ts, BPF_ANY);
return 0;
}
}
static int handle_exit(int irq, struct irqaction *action)
{
struct irq_key ikey = {};
struct info *info;
u32 key = 0;
u64 delta;
u64 *tsp;
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
tsp = bpf_map_lookup_elem(&start, &key);
if (!tsp)
return 0;
delta = bpf_ktime_get_ns() - *tsp;
if (!targ_ns)
delta /= 1000U;
bpf_probe_read_kernel_str(&ikey.name, sizeof(ikey.name), BPF_CORE_READ(action, name));
info = bpf_map_lookup_or_try_init(&infos, &ikey, &zero);
if (!info)
return 0;
if (!targ_dist) {
info->count += delta;
} else {
u64 slot;
slot = log2(delta);
if (slot >= MAX_SLOTS)
slot = MAX_SLOTS - 1;
info->slots[slot]++;
}
return 0;
}
SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry_btf, int irq, struct irqaction *action)
{
return handle_entry(irq, action);
}
SEC("tp_btf/irq_handler_exit")
int BPF_PROG(irq_handler_exit_btf, int irq, struct irqaction *action)
{
return handle_exit(irq, action);
}
SEC("raw_tp/irq_handler_entry")
int BPF_PROG(irq_handler_entry, int irq, struct irqaction *action)
{
return handle_entry(irq, action);
}
SEC("raw_tp/irq_handler_exit")
```int BPF_PROG(irq_handler_exit, int irq, struct irqaction *action)
{
return handle_exit(irq, action);
}
char LICENSE[] SEC("license") = "GPL";
```
This code is an eBPF program used to capture and analyze the execution information of hardware interrupt handlers (hardirqs) in the kernel. The main purpose of the program is to obtain the name, execution count, and execution time of the interrupt handler, and display the distribution of execution time in the form of a histogram. Let's analyze this code step by step.
1. Include necessary header files and define data structures:
```c
#include <vmlinux.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "hardirqs.h"
#include "bits.bpf.h"
#include "maps.bpf.h"
```
This program includes the standard header files required for eBPF development, as well as custom header files for defining data structures and maps.
2. Define global variables and maps:
```c
#define MAX_ENTRIES 256
const volatile bool filter_cg = false;
const volatile bool targ_dist = false;
const volatile bool targ_ns = false;
const volatile bool do_count = false;
...
```
This program defines some global variables that are used to configure the behavior of the program. For example, `filter_cg` controls whether to filter cgroups, `targ_dist` controls whether to display the distribution of execution time, etc. Additionally, the program defines three maps for storing cgroup information, start timestamps, and interrupt handler information.
3. Define two helper functions `handle_entry` and `handle_exit`:
These two functions are called at the entry and exit points of the interrupt handler. `handle_entry` records the start timestamp or updates the interrupt count, while `handle_exit` calculates the execution time of the interrupt handler and stores the result in the corresponding information map.
4. Define the entry points of the eBPF program:
```c
SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry_btf, int irq, struct irqaction *action)
{
return handle_entry(irq, action);
}
SEC("tp_btf/irq_handler_exit")
int BPF_PROG(irq_handler_exit_btf, int irq, struct irqaction *action)
{
return handle_exit(irq, action);
}
SEC("raw_tp/irq_handler_entry")
int BPF_PROG(irq_handler_entry, int irq, struct irqaction *action)
{
return handle_entry(irq, action);
}
SEC("raw_tp/irq_handler_exit")
int BPF_PROG(irq_handler_exit, int irq, struct irqaction *action)
{
return handle_exit(irq, action);
}
```
Here, four entry points of the eBPF program are defined, which are used to capture the entry and exit events of the interrupt handler. `tp_btf` and `raw_tp` represent capturing events using BPF Type Format (BTF) and raw tracepoints, respectively. This ensures that the program can be ported and run on different kernel versions.
The code for Softirq is similar, and I won't elaborate on it here.
## Run code.Translated content:
"eunomia-bpf is an open-source eBPF dynamic loading runtime and development toolchain that combines Wasm. Its purpose is to simplify the development, building, distribution, and execution of eBPF programs. You can refer to <https://github.com/eunomia-bpf/eunomia-bpf> to download and install the ecc compilation toolchain and ecli runtime. We use eunomia-bpf to compile and run this example.
To compile this program, use the ecc tool:
```console
$ ecc hardirqs.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
Then run:
```console
sudo ecli run ./package.json
```
## Summary
In this chapter (eBPF Getting Started Tutorial Ten: Capturing Interrupt Events in eBPF with Hardirqs or Softirqs), we learned how to capture and analyze the execution information of hardware interrupt handlers (hardirqs) in the kernel using eBPF programs. We explained the example code in detail, including how to define data structures, mappings, eBPF program entry points, and how to call helper functions to record execution information at the entry and exit points of interrupt handlers.
By studying the content of this chapter, you should have mastered the methods of capturing interrupt events with hardirqs or softirqs in eBPF, as well as how to analyze these events to identify performance issues and other problems related to interrupt handling in the kernel. These skills are crucial for analyzing and optimizing the performance of the Linux kernel.
To better understand and practice eBPF programming, we recommend reading the official documentation of eunomia-bpf: <https://github.com/eunomia-bpf/eunomia-bpf>. In addition, we provide a complete tutorial and source code for you to view and learn from at <https://github.com/eunomia-bpf/bpf-developer-tutorial>. We hope this tutorial can help you get started with eBPF development smoothly and provide useful references for your further learning and practice."

View File

@@ -0,0 +1,627 @@
# eBPF Beginner's Development Practice Tutorial 11: Using libbpf to Develop User-Space Programs in eBPF and Trace exec() and exit() System Calls
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and run user-defined code during kernel runtime.
In this tutorial, we will learn how kernel-space and user-space eBPF programs work together. We will also learn how to use the native libbpf to develop user-space programs, package eBPF applications into executable files, and distribute them across different kernel versions.
## The libbpf Library and Why We Need to Use It
libbpf is a C language library that is distributed with the kernel version to assist in loading and running eBPF programs. It provides a set of C APIs for interacting with the eBPF system, allowing developers to write user-space programs more easily to load and manage eBPF programs. These user-space programs are typically used for system performance analysis, monitoring, or optimization.
There are several advantages to using the libbpf library:
- It simplifies the process of loading, updating, and running eBPF programs.
- It provides a set of easy-to-use APIs, allowing developers to focus on writing core logic instead of dealing with low-level details.
- It ensures compatibility with the eBPF subsystem in the kernel, reducing maintenance costs.
At the same time, libbpf and BTF (BPF Type Format) are important components of the eBPF ecosystem. They play critical roles in achieving compatibility across different kernel versions. BTF is a metadata format used to describe type information in eBPF programs. The primary purpose of BTF is to provide a structured way to describe data structures in the kernel so that eBPF programs can access and manipulate them more easily.
The key roles of BTF in achieving compatibility across different kernel versions are as follows:
- BTF allows eBPF programs to access detailed type information of kernel data structures without hardcoding specific kernel versions. This enables eBPF programs to adapt to different kernel versions, achieving compatibility across kernel versions.
- By using BPF CO-RE (Compile Once, Run Everywhere) technology, eBPF programs can leverage BTF to parse the type information of kernel data structures during compilation, thereby generating eBPF programs that can run on different kernel versions.
By combining libbpf and BTF, eBPF programs can run on various kernel versions without the need for separate compilation for each kernel version. This greatly improves the portability and compatibility of the eBPF ecosystem and reduces the difficulty of development and maintenance.
## What is Bootstrap
Bootstrap is a complete application that utilizes libbpf. It uses eBPF programs to trace the exec() system call in the kernel (handled by the SEC("tp/sched/sched_process_exec") handle_exec BPF program), which mainly corresponds to the creation of new processes (excluding the fork() part). In addition, it also traces the exit() system call of processes (handled by the SEC("tp/sched/sched_process_exit") handle_exit BPF program) to understand when each process exits.
These two BPF programs work together to capture interesting information about new processes, such as the file name of the binary and measure the lifecycle of processes. They also collect interesting statistics, such as exit codes or resource consumption, when a process exits. This is a good starting point to gain a deeper understanding of the inner workings of the kernel and observe how things actually operate.
Bootstrap also uses the argp API (part of libc) for command-line argument parsing, allowing users to configure the behavior of the application through command-line options. This provides flexibility and allows users to customize the program behavior according to their specific needs. While these functionalities can also be achieved using the eunomia-bpf tool, using libbpf here provides higher scalability in user space at the cost of additional complexity.
## Bootstrap
Bootstrap consists of two parts: kernel space and user space. The kernel space part is an eBPF program that traces the exec() and exit() system calls. The user space part is a C language program that uses the libbpf library to load and run the kernel space program and process the data collected from the kernel space program.
### Kernel-space eBPF Program bootstrap.bpf.c
```c
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2020 Facebook */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include "bootstrap.h"
char LICENSE[] SEC("license") = "Dual BSD/GPL";
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 8192);
__type(key, pid_t);
__type(value, u64);
} exec_start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} rb SEC(".maps");
const volatile unsigned long long min_duration_ns = 0;
SEC("tp/sched/sched_process_exec")
int handle_exec(struct trace_event_raw_sched_process_exec *ctx)
{
struct task_struct *task;
unsigned fname_off;
struct event *e;
pid_t pid;
u64 ts;
/* remember time exec() was executed for this PID */
pid = bpf_get_current_pid_tgid() >> 32;
ts = bpf_ktime_get_ns();
bpf_map_update_elem(&exec_start, &pid, &ts, BPF_ANY);
/* don't emit exec events when minimum duration is specified */
if (min_duration_ns)
return 0;
/* reserve sample from BPF ringbuf */
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e)
return 0;
/* fill out the sample with data */
task = (struct task_struct *)bpf_get_current_task();
e->exit_event = false;
e->pid = pid;
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
bpf_get_current_comm(&e->comm, sizeof(e->comm));
fname_off = ctx->__data_loc_filename & 0xFFFF;
bpf_probe_read_str(&e->filename, sizeof(e->filename), (void *)ctx + fname_off);
/* successfully submit it to user-space for post-processing */
bpf_ringbuf_submit(e, 0);
return 0;
}
SEC("tp/sched/sched_process_exit")
int handle_exit(struct trace_event_raw_sched_process_template* ctx)
{
struct task_struct *task;
struct event *e;
pid_t pid, tid;
u64 id, ts, *start_ts, duration_ns = 0;
/* get PID and TID of exiting thread/process */
id = bpf_get_current_pid_tgid();
pid = id >> 32;
tid = (u32)id;
/* ignore thread exits */
if (pid != tid)
return 0;
/* if we recorded start of the process, calculate lifetime duration */
start_ts = bpf_map_lookup_elem(&exec_start, &pid);
if (start_ts)duration_ns = bpf_ktime_get_ns() - *start_ts;
else if (min_duration_ns)
return 0;
bpf_map_delete_elem(&exec_start, &pid);
/* if process didn't live long enough, return early */
if (min_duration_ns && duration_ns < min_duration_ns)
return 0;
/* reserve sample from BPF ringbuf */
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e)
return 0;
/* fill out the sample with data */
task = (struct task_struct *)bpf_get_current_task();
e->exit_event = true;
e->duration_ns = duration_ns;
e->pid = pid;
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
e->exit_code = (BPF_CORE_READ(task, exit_code) >> 8) & 0xff;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
/* send data to user-space for post-processing */
bpf_ringbuf_submit(e, 0);
return 0;
}
```
This code is a kernel-level eBPF program (`bootstrap.bpf.c`) used to trace `exec()` and `exit()` system calls. It captures process creation and exit events using an eBPF program and sends the relevant information to a user-space program for processing. Below is a detailed explanation of the code.
First, we include the necessary headers and define the license for the eBPF program. We also define two eBPF maps: `exec_start` and `rb`. `exec_start` is a hash type eBPF map used to store the timestamp when a process starts executing. `rb` is a ring buffer type eBPF map used to store captured event data and send it to the user-space program.
```c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include "bootstrap.h"
char LICENSE[] SEC("license") = "Dual BSD/GPL";
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 8192);
__type(key, pid_t);
__type(value, u64);
} exec_start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} rb SEC(".maps");
const volatile unsigned long long min_duration_ns = 0;
```
Next, we define an eBPF program named `handle_exec` which is triggered when a process executes the `exec()` system call. First, we retrieve the PID from the current process, record the timestamp when the process starts executing, and store it in the `exec_start` map.
```c
SEC("tp/sched/sched_process_exec")
int handle_exec(struct trace_event_raw_sched_process_exec *ctx)
{
// ...
pid = bpf_get_current_pid_tgid() >> 32;
ts = bpf_ktime_get_ns();
bpf_map_update_elem(&exec_start, &pid, &ts, BPF_ANY);
// ...
}
```Then, we reserve an event structure from the circular buffer map `rb` and fill in the relevant data, such as the process ID, parent process ID, and process name. Afterwards, we send this data to the user-mode program for processing.
```c
// reserve sample from BPF ringbuf
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e)
return 0;
// fill out the sample with data
task = (struct task_struct *)bpf_get_current_task();
e->exit_event = false;
e->pid = pid;
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
bpf_get_current_comm(&e->comm, sizeof(e->comm));
fname_off = ctx->__data_loc_filename & 0xFFFF;
bpf_probe_read_str(&e->filename, sizeof(e->filename), (void *)ctx + fname_off);
// successfully submit it to user-space for post-processing
bpf_ringbuf_submit(e, 0);
return 0;
```
Finally, we define an eBPF program named `handle_exit` that will be triggered when a process executes the `exit()` system call. First, we retrieve the PID and TID (thread ID) from the current process. If the PID and TID are not equal, it means that this is a thread exit, and we will ignore this event.
```c
SEC("tp/sched/sched_process_exit")
int handle_exit(struct trace_event_raw_sched_process_template* ctx)
{
// ...
id = bpf_get_current_pid_tgid();
pid = id >> 32;
tid = (u32)id;
/* ignore thread exits */
if (pid != tid)
return 0;
// ...
}
```
Next, we look up the timestamp of when the process started execution, which was previously stored in the `exec_start` map. If a timestamp is found, we calculate the process's lifetime duration and then remove the record from the `exec_start` map. If a timestamp is not found and a minimum duration is specified, we return directly.
```c
// if we recorded start of the process, calculate lifetime duration
start_ts = bpf_map_lookup_elem(&exec_start, &pid);
if (start_ts)
duration_ns = bpf_ktime_get_ns() - *start_ts;
else if (min_duration_ns)
return 0;
bpf_map_delete_elem(&exec_start, &pid);
// if process didn't live long enough, return early
if (min_duration_ns && duration_ns < min_duration_ns)
return 0;
```
Then, we reserve an event structure from the circular buffer map `rb` and fill in the relevant data, such as the process ID, parent process ID, process name, and process duration. Finally, we send this data to the user-mode program for processing.
```c
/* reserve sample from BPF ringbuf */
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e)
return 0;
/* fill out the sample with data */
task = (struct task_struct *)bpf_get_current_task();
e->exit_event = true;
e->duration_ns = duration_ns;```
e->pid = pid;
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
e->exit_code = (BPF_CORE_READ(task, exit_code) >> 8) & 0xff;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
/* send data to user-space for post-processing */
bpf_ringbuf_submit(e, 0);
return 0;
}
```
This way, when a process executes the exec() or exit() system calls, our eBPF program captures the corresponding events and sends detailed information to the user space program for further processing. This allows us to easily monitor process creation and termination and obtain detailed information about the processes.
In addition, in the bootstrap.h file, we also define the data structures for interaction with user space:
```c
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
/* Copyright (c) 2020 Facebook */
#ifndef __BOOTSTRAP_H
#define __BOOTSTRAP_H
#define TASK_COMM_LEN 16
#define MAX_FILENAME_LEN 127
struct event {
int pid;
int ppid;
unsigned exit_code;
unsigned long long duration_ns;
char comm[TASK_COMM_LEN];
char filename[MAX_FILENAME_LEN];
bool exit_event;
};
#endif /* __BOOTSTRAP_H */
```
### User space, bootstrap.c
```c
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
/* Copyright (c) 2020 Facebook */
#include <argp.h>
#include <signal.h>
#include <stdio.h>
#include <time.h>
#include <sys/resource.h>
#include <bpf/libbpf.h>
#include "bootstrap.h"
#include "bootstrap.skel.h"
static struct env {
bool verbose;
long min_duration_ms;
} env;
const char *argp_program_version = "bootstrap 0.0";
const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
const char argp_program_doc[] =
"BPF bootstrap demo application.\n"
"\n"
"It traces process start and exits and shows associated \n"
"information (filename, process duration, PID and PPID, etc).\n"
"\n"
"USAGE: ./bootstrap [-d <min-duration-ms>] [-v]\n";
static const struct argp_option opts[] = {
{ "verbose", 'v', NULL, 0, "Verbose debug output" },
{ "duration", 'd', "DURATION-MS", 0, "Minimum process duration (ms) to report" },
{},
};
static error_t parse_arg(int key, char *arg, struct argp_state *state)
{
switch (key) {
case 'v':
env.verbose = true;
break;
case 'd':
errno = 0;
env.min_duration_ms = strtol(arg, NULL, 10);
if (errno || env.min_duration_ms <= 0) {
fprintf(stderr, "Invalid duration: %s\n", arg);
argp_usage(state);
}
break;
case ARGP_KEY_ARG:
argp_usage(state);
break;
default:
return ARGP_ERR_UNKNOWN;
}
return 0;
}
static const struct argp argp = {
.options = opts,
.parser = parse_arg,
.doc = argp_program_doc,
};
static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
if (level == LIBBPF_DEBUG && !env.verbose)
return 0;
return vfprintf(stderr, format, args);
}
static volatile bool exiting = false;
static void sig_handler(int sig)
{
exiting = true;
}
static int handle_event(void *ctx, void *data, size_t data_sz)
{
const struct event *e = data;
struct tm *tm;
char ts[32];
time_t t;
time(&t);
tm = localtime(&t);
strftime(ts, sizeof(ts), "%H:%M:%S", tm);
if (e->exit_event) {
printf("%-8s %-5s %-16s %-7d %-7d [%u]",
ts, "EXIT", e->comm, e->pid, e->ppid, e->exit_code);
if (e->duration_ns)
printf(" (%llums)", e->duration_ns / 1000000);
printf("\n");
} else {
printf("%-8s %-5s %-16s %-7d %-7d %s\n",
ts, "EXEC", e->comm, e->pid, e->ppid, e->filename);
}
return 0;
}
int main(int argc, char **argv)
{
struct ring_buffer *rb = NULL;
struct bootstrap_bpf *skel;
int err;
/* Parse command line arguments */
err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
if (err)
return err;
/* Set up libbpf errors and debug info callback */
libbpf_set_print(libbpf_print_fn);
/* Cleaner handling of Ctrl-C */
signal(SIGINT, sig_handler);
signal(SIGTERM, sig_handler);
/* Load and verify BPF application */
skel = bootstrap_bpf__open();
if (!skel) {
fprintf(stderr, "Failed to open and load BPF skeleton\n");
return 1;
}
/* Parameterize BPF code with minimum duration parameter */
skel->rodata->min_duration_ns = env.min_duration_ms * 1000000ULL;
/* Load & verify BPF programs */
err = bootstrap_bpf__load(skel);
if (err) {
fprintf(stderr, "Failed to load and verify BPF skeleton\n");
goto cleanup;
}
/* Attach tracepoints */
err = bootstrap_bpf__attach(skel);
if (err) {
fprintf(stderr, "Failed to attach BPF skeleton\n");
goto cleanup;
}
/* Set up ring buffer polling */
rb = ring_buffer__new(bpf_map__fd(skel->maps.rb), handle_event, NULL, NULL);
if (!rb) {
err = -1;
fprintf(stderr, "Failed to create ring buffer\n");
goto cleanup;
}
/* Process events */
printf("%-8s %-5s %-16s %-7s %-7s %s\n",
"TIME", "EVENT", "COMM", "PID", "PPID", "FILENAME/EXIT CODE");
while (!exiting) {
err = ring_buffer__poll(rb, 100 /* timeout, ms */);
/* Ctrl-C will cause -EINTR */
if (err == -EINTR) {
err = 0;
break;
}
if (err < 0) {
printf("Error polling perf buffer: %d\n", err);
break;
}
}
cleanup:
/* Clean up */
ring_buffer__free(rb);
bootstrap_bpf__destroy(skel);
return err < 0 ? -err : 0;
}
```
This user-level program is mainly used to load, verify, attach eBPF programs, and receive event data collected by eBPF programs and print it out. We will analyze some key parts.
First, we define an env structure to store command line arguments:
```c
static struct env {
bool verbose;
long min_duration_ms;
} env;
```
Next, we use the argp library to parse command line arguments:
```c
static const struct argp_option opts[] = {
{ "verbose", 'v', NULL, 0, "Verbose debug output" },
{ "duration", 'd', "DURATION-MS", 0, "Minimum process duration (ms) to report" },
{},
};
static error_t parse_arg(int key, char *arg, struct argp_state *state)
{
// ...
}
static const struct argp argp = {
.options = opts,
.parser = parse_arg,
.doc = argp_program_doc,
};
```
In the main() function, we first parse the command line arguments, and then set the libbpf print callback function libbpf_print_fn to output debug information when needed:
```c
err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
if (err)
return err;```c
libbpf_set_print(libbpf_print_fn);
Next, we open the eBPF skeleton file, pass the minimum duration parameter to the eBPF program, and load and attach the eBPF program:
```c
skel = bootstrap_bpf__open();
if (!skel) {
fprintf(stderr, "Failed to open and load BPF skeleton\n");
return 1;
}
skel->rodata->min_duration_ns = env.min_duration_ms * 1000000ULL;
err = bootstrap_bpf__load(skel);
if (err) {
fprintf(stderr, "Failed to load and verify BPF skeleton\n");
goto cleanup;
}
err = bootstrap_bpf__attach(skel);
if (err) {
fprintf(stderr, "Failed to attach BPF skeleton\n");
goto cleanup;
}
```
Then, we create a ring buffer to receive event data sent by the eBPF program:
```c
rb = ring_buffer__new(bpf_map__fd(skel->maps.rb), handle_event, NULL, NULL);
if (!rb) {
err = -1;
fprintf(stderr, "Failed to create ring buffer\n");
goto cleanup;
}
```
The handle_event() function handles events received from the eBPF program. Depending on the event type (process execution or exit), it extracts and prints event information such as timestamp, process name, process ID, parent process ID, file name, or exit code.
Finally, we use the ring_buffer__poll() function to poll the ring buffer and process the received event data:
```c
while (!exiting) {
err = ring_buffer__poll(rb, 100 /* timeout, ms */);
// ...
}
```
When the program receives the SIGINT or SIGTERM signal, it completes the final cleanup and exit operations, and closes and unloads the eBPF program:
```c
cleanup:
/* Clean up */
ring_buffer__free(rb);
bootstrap_bpf__destroy(skel);
return err < 0 ? -err : 0;
}
```
## Dependency Installation
Building the example requires clang, libelf, and zlib. The package names may vary in different distributions.
On Ubuntu/Debian, you need to execute the following command:
```shell
sudo apt install clang libelf1 libelf-dev zlib1g-dev
```
On CentOS/Fedora, you need to execute the following command:
```shell
sudo dnf install clang elfutils-libelf elfutils-libelf-devel zlib-devel
```
## Compile and Run
Compile and run the above code:
```console
$ make
BPF .output/bootstrap.bpf.o
GEN-SKEL .output/bootstrap.skel.h
CC .output/bootstrap.o
BINARY bootstrap
$ sudo ./bootstrap
[sudo] password for yunwei:
TIME EVENT COMM PID PPID FILENAME/EXIT CODE
03:16:41 EXEC sh 110688 80168 /bin/sh
03:16:41 EXEC which 110689 110688 /usr/bin/which
03:16:41 EXIT which 110689 110688 [0] (0ms)
03:16:41 EXIT sh 110688 80168 [0] (0ms)".
```
The complete source code can be found at <https://github.com/eunomia-bpf/bpf-developer-tutorial>
## Summary
Through this example, we have learned how to combine eBPF programs with user-space programs. This combination provides developers with a powerful toolkit for efficient data collection and processing across the kernel and user space. By using eBPF and libbpf, you can build more efficient, scalable, and secure monitoring and performance analysis tools.
In the following tutorials, we will continue to explore the advanced features of eBPF and share more about eBPF development practices. Through continuous learning and practice, you will have a better understanding and mastery of eBPF technology and apply it to solve real-world problems.
If you would like to learn more about eBPF knowledge and practices, please refer to the official documentation of eunomia-bpf: <https://github.com/eunomia-bpf/eunomia-bpf>. You can also visit our tutorial code repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and complete tutorials.
## Reference
- [Building BPF applications with libbpf-bootstrap](https://nakryiko.com/posts/libbpf-bootstrap/)
- <https://github.com/libbpf/libbpf-bootstrap>

328
src/12-profile/README_en.md Normal file
View File

@@ -0,0 +1,328 @@
# eBPF Beginner's Practical Tutorial 12: Using eBPF Program Profile for Performance Analysis
This tutorial will guide you on using libbpf and eBPF programs for performance analysis. We will leverage the perf mechanism in the kernel to learn how to capture the execution time of functions and view performance data.
libbpf is a C library for interacting with eBPF. It provides the basic functionality for creating, loading, and using eBPF programs. In this tutorial, we will mainly use libbpf for development. Perf is a performance analysis tool in the Linux kernel that allows users to measure and analyze the performance of kernel and user space programs, as well as obtain corresponding call stacks. It collects performance data using hardware counters and software events in the kernel.
## eBPF Tool: profile Performance Analysis Example
The `profile` tool is implemented based on eBPF and utilizes the perf events in the Linux kernel for performance analysis. The `profile` tool periodically samples each processor to capture the execution of kernel and user space functions. It provides the following information for stack traces:
- Address: memory address of the function call
- Symbol: function name
- File Name: name of the source code file
- Line Number: line number in the source code
This information helps developers locate performance bottlenecks and optimize code. Furthermore, flame graphs can be generated based on this information for a more intuitive view of performance data.
In this example, you can compile and run it with the libbpf library (using Ubuntu/Debian as an example):
```console
$ git submodule update --init --recursive
$ sudo apt install clang libelf1 libelf-dev zlib1g-dev
$ make
$ sudo ./profile
COMM: chronyd (pid=156) @ CPU 1
Kernel:
0 [<ffffffff81ee9f56>] _raw_spin_lock_irqsave+0x16
1 [<ffffffff811527b4>] remove_wait_queue+0x14
2 [<ffffffff8132611d>] poll_freewait+0x3d
3 [<ffffffff81326d3f>] do_select+0x7bf
4 [<ffffffff81327af2>] core_sys_select+0x182
5 [<ffffffff81327f3a>] __x64_sys_pselect6+0xea
6 [<ffffffff81ed9e38>] do_syscall_64+0x38
7 [<ffffffff82000099>] entry_SYSCALL_64_after_hwframe+0x61
Userspace:
0 [<00007fab187bfe09>]
1 [<000000000ee6ae98>]
COMM: profile (pid=9843) @ CPU 6
No Kernel Stack
Userspace:
0 [<0000556deb068ac8>]
1 [<0000556dec34cad0>]
```
## Implementation Principle
The `profile` tool consists of two parts: the eBPF program in kernel space and the `profile` symbol handling program in user space. The `profile` symbol handling program is responsible for loading the eBPF program and processing the data outputted by the eBPF program.
### Kernel Space Part
The implementation logic of the eBPF program in kernel space mainly relies on perf events to periodically sample the stack of the program, thereby capturing its execution flow.
```c
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2022 Meta Platforms, Inc. */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include "profile.h"
char LICENSE[] SEC("license") = "Dual BSD/GPL";
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} events SEC(".maps");
SEC("perf_event")
int profile(void *ctx)
{
int pid = bpf_get_current_pid_tgid() >> 32;
int cpu_id = bpf_get_smp_processor_id();
struct stacktrace_event *event;
int cp;
event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
if (!event)
return 1;
event->pid = pid;
event->cpu_id = cpu_id;
if (bpf_get_current_comm(event->comm, sizeof(event->comm)))
event->comm[0] = 0;
event->kstack_sz = bpf_get_stack(ctx, event->kstack, sizeof(event->kstack), 0);
event->ustack_sz = bpf_get_stack(ctx, event->ustack, sizeof(event->ustack), BPF_F_USER_STACK);
bpf_ringbuf_submit(event, 0);
return 0;
}
```
Next, we will focus on the key part of the kernel code.
1. Define eBPF maps `events`:
```c
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} events SEC(".maps");
```
Here, a eBPF maps of type `BPF_MAP_TYPE_RINGBUF` is defined. The Ring Buffer is a high-performance circular buffer used to transfer data between the kernel and user space. `max_entries` sets the maximum size of the Ring Buffer.
2. Define `perf_event` eBPF program:
```c
SEC("perf_event")
int profile(void *ctx)
```
Here, a eBPF program named `profile` is defined, which will be executed when a perf event is triggered.
3. Get process ID and CPU ID:
```c
int pid = bpf_get_current_pid_tgid() >> 32;
int cpu_id = bpf_get_smp_processor_id();
```
The function `bpf_get_current_pid_tgid()` returns the PID and TID of the current process. By right shifting 32 bits, we get the PID. The function `bpf_get_smp_processor_id()` returns the ID of the current CPU.
4. Reserve space in the Ring Buffer:
```c
event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
if (!event)
return 1;
```
Use the `bpf_ringbuf_reserve()` function to reserve space in the Ring Buffer for storing the collected stack information. If the reservation fails, return an error.
5. Get the current process name:
```c
if (bpf_get_current_comm(event->comm, sizeof(event->comm)))
event->comm[0] = 0;
```
Use the `bpf_get_current_comm()` function to get the current process name and store it in `event->comm`.
6. Get kernel stack information:
```c
event->kstack_sz = bpf_get_stack(ctx, event->kstack, sizeof(event->kstack), 0);
```
Use the `bpf_get_stack()` function to get kernel stack information. Store the result in `event->kstack` and the size in `event->kstack_sz`.
7. Get user space stack information:
```c
event->ustack_sz = bpf_get_stack(ctx, event->ustack, sizeof(event->ustack), BPF_F_USER_STACK);
```Using the `bpf_get_stack()` function with the `BPF_F_USER_STACK` flag retrieves information about the user space stack. Store the result in `event->ustack` and its size in `event->ustack_sz`.
8. Submit the event to the Ring Buffer:
```c
bpf_ringbuf_submit(event, 0);
```
Finally, use the `bpf_ringbuf_submit()` function to submit the event to the Ring Buffer for the user space program to read and process.
This kernel mode eBPF program captures the program's execution flow by sampling the kernel stack and user space stack of the program periodically. These data are stored in the Ring Buffer for the user mode `profile` program to read.
### User Mode Section
This code is mainly responsible for setting up perf events for each online CPU and attaching eBPF programs:
```c
static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
return ret;
}
int main(){
...
for (cpu = 0; cpu < num_cpus; cpu++) {
/* skip offline/not present CPUs */
if (cpu >= num_online_cpus || !online_mask[cpu])
continue;
/* Set up performance monitoring on a CPU/Core */
pefd = perf_event_open(&attr, pid, cpu, -1, PERF_FLAG_FD_CLOEXEC);
if (pefd < 0) {
fprintf(stderr, "Fail to set up performance monitor on a CPU/Core\n");
err = -1;
goto cleanup;
}
pefds[cpu] = pefd;
/* Attach a BPF program on a CPU */
links[cpu] = bpf_program__attach_perf_event(skel->progs.profile, pefd);
if (!links[cpu]) {
err = -1;
goto cleanup;
}
}
...
}
```
The `perf_event_open` function is a wrapper for the perf_event_open system call. It takes a pointer to a perf_event_attr structure to specify the type and attributes of the perf event. The pid parameter is used to specify the process ID to monitor (-1 for monitoring all processes), and the cpu parameter is used to specify the CPU to monitor. The group_fd parameter is used to group perf events, and we use -1 here to indicate no grouping is needed. The flags parameter is used to set some flags, and we use PERF_FLAG_FD_CLOEXEC to ensure file descriptors are closed when executing exec series system calls.
In the main function:
```c
for (cpu = 0; cpu < num_cpus; cpu++) {
// ...
}
```
This loop sets up perf events and attaches eBPF programs for each online CPU. Firstly, it checks if the current CPU is online and skips if it's not. Then, it uses the perf_event_open() function to set up perf events for the current CPU and stores the returned file descriptor in the pefds array. Finally, it attaches the eBPF program to the perf event using the bpf_program__attach_perf_event() function. The links array is used to store the BPF links for each CPU so that they can be destroyed when the program ends.By doing so, user-mode programs set perf events for each online CPU and attach eBPF programs to these perf events to monitor all online CPUs in the system.
The following two functions are used to display stack traces and handle events received from the ring buffer:
```c
static void show_stack_trace(__u64 *stack, int stack_sz, pid_t pid)
{
const struct blazesym_result *result;
const struct blazesym_csym *sym;
sym_src_cfg src;
int i, j;
if (pid) {
src.src_type = SRC_T_PROCESS;
src.params.process.pid = pid;
} else {
src.src_type = SRC_T_KERNEL;
src.params.kernel.kallsyms = NULL;
src.params.kernel.kernel_image = NULL;
}
result = blazesym_symbolize(symbolizer, &src, 1, (const uint64_t *)stack, stack_sz);
for (i = 0; i < stack_sz; i++) {
if (!result || result->size <= i || !result->entries[i].size) {
printf(" %d [<%016llx>]\n", i, stack[i]);
continue;
}
if (result->entries[i].size == 1) {
sym = &result->entries[i].syms[0];
if (sym->path && sym->path[0]) {
printf(" %d [<%016llx>] %s+0x%llx %s:%ld\n",
i, stack[i], sym->symbol,
stack[i] - sym->start_address,
sym->path, sym->line_no);
} else {
printf(" %d [<%016llx>] %s+0x%llx\n",
i, stack[i], sym->symbol,
stack[i] - sym->start_address);
}
continue;
}
printf(" %d [<%016llx>]\n", i, stack[i]);
for (j = 0; j < result->entries[i].size; j++) {
sym = &result->entries[i].syms[j];
if (sym->path && sym->path[0]) {
printf(" %s+0x%llx %s:%ld\n",
sym->symbol, stack[i] - sym->start_address,
sym->path, sym->line_no);
} else {
printf(" %s+0x%llx\n", sym->symbol,
stack[i] - sym->start_address);
}
}
}
blazesym_result_free(result);
}
``` /* Receive events from the ring buffer. */".```c
static int event_handler(void *_ctx, void *data, size_t size)
{
struct stacktrace_event *event = data;
if (event->kstack_sz <= 0 && event->ustack_sz <= 0)
return 1;
printf("COMM: %s (pid=%d) @ CPU %d\n", event->comm, event->pid, event->cpu_id);
if (event->kstack_sz > 0) {
printf("Kernel:\n");
show_stack_trace(event->kstack, event->kstack_sz / sizeof(__u64), 0);
} else {
printf("No Kernel Stack\n");
}
if (event->ustack_sz > 0) {
printf("Userspace:\n");
show_stack_trace(event->ustack, event->ustack_sz / sizeof(__u64), event->pid);
} else {
printf("No Userspace Stack\n");
}
printf("\n");
return 0;
}
```
The `show_stack_trace()` function is used to display the stack trace of the kernel or userspace. It takes a `stack` parameter, which is a pointer to the kernel or userspace stack, and a `stack_sz` parameter, which represents the size of the stack. The `pid` parameter represents the ID of the process to be displayed (set to 0 when displaying the kernel stack). In the function, the source of the stack (kernel or userspace) is determined based on the `pid` parameter, and then the `blazesym_symbolize()` function is called to resolve the addresses in the stack to symbol names and source code locations. Finally, the resolved results are traversed and the symbol names and source code location information are outputted.
The `event_handler()` function is used to handle events received from the ring buffer. It takes a `data` parameter, which points to the data in the ring buffer, and a `size` parameter, which represents the size of the data. The function first converts the `data` pointer to a pointer of type `stacktrace_event`, and then checks the sizes of the kernel and userspace stacks. If the stacks are empty, it returns directly. Next, the function outputs the process name, process ID, and CPU ID information. Then it displays the stack traces of the kernel and userspace respectively. When calling the `show_stack_trace()` function, the addresses, sizes, and process ID of the kernel and userspace stacks are passed in separately.
These two functions are part of the eBPF profiling tool, used to display and process stack trace information collected by eBPF programs, helping users understand program performance and bottlenecks.
### Summary
Through this introductory tutorial on eBPF, we have learned how to use eBPF programs for performance analysis. In this process, we explained in detail how to create eBPF programs, monitor process performance, and retrieve data from the ring buffer for analyzing stack traces. We also learned how to use the `perf_event_open()` function to set up performance monitoring and attach BPF programs to performance events. In this tutorial, we also demonstrated how to write eBPF programs to capture the kernel and userspace stack information of processes in order to analyze program performance bottlenecks. With this example, you can understand the powerful features of eBPF in performance analysis.
If you want to learn more about eBPF knowledge and practices, please refer to the official documentation of eunomia-bpf: <https://github.com/eunomia-bpf/eunomia-bpf>. You can also visit our tutorial code repository <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and complete tutorials.
The next tutorial will further explore advanced features of eBPF. We will continue to share more content about eBPF development practices to help you better understand and master eBPF technology. We hope these contents will be helpful for your learning and practice on the eBPF development journey.

View File

@@ -0,0 +1,569 @@
# eBPF Beginner's Development Practice Tutorial 13: Statistics of TCP Connection Delay and Data Processing in User Space Using libbpf
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool used in the Linux kernel. eBPF allows developers to dynamically load, update, and run user-defined code without restarting the kernel or changing the kernel source code.
This article is the thirteenth installment of the eBPF beginner's development practice tutorial, mainly about how to use eBPF to statistics TCP connection delay and process data in user space using libbpf.
## Background
When developing backends, regardless of the programming language used, we often need to call databases such as MySQL and Redis, perform RPC remote calls, or call other RESTful APIs. The underlying implementation of these calls is usually based on the TCP protocol. This is because TCP protocol has advantages such as reliable connection, error retransmission, congestion control, etc., so TCP is more widely used in network transport layer protocols than UDP. However, TCP also has some drawbacks, such as longer connection establishment delay. Therefore, some alternative solutions have emerged, such as QUIC (Quick UDP Internet Connections).
Analyzing TCP connection delay is very useful for network performance analysis, optimization, and troubleshooting.
## Overview of tcpconnlat Tool
The `tcpconnlat` tool can trace the functions in the kernel that perform active TCP connections (such as using the `connect()` system call), measure and display connection delay, i.e., the time from sending SYN to receiving response packets.
### TCP Connection Principle
The process of establishing a TCP connection is often referred to as the "three-way handshake". Here are the steps of the entire process:
1. Client sends SYN packet to the server: The client sends SYN through the `connect()` system call. This involves local system call and CPU time cost of software interrupts.
2. SYN packet is transmitted to the server: This is a network transmission that depends on network latency.
3. Server handles the SYN packet: The server kernel receives the packet through a software interrupt, then puts it into the listen queue and sends SYN/ACK response. This mainly involves CPU time cost.
4. SYN/ACK packet is transmitted to the client: This is another network transmission.
5. Client handles the SYN/ACK: The client kernel receives and handles the SYN/ACK packet, then sends ACK. This mainly involves software interrupt handling cost.
6. ACK packet is transmitted to the server: This is the third network transmission.
7. Server receives ACK: The server kernel receives and handles the ACK, then moves the corresponding connection from the listen queue to the established queue. This involves CPU time cost of a software interrupt.
8. Wake up the server-side user process: The user process blocked by the `accept()` system call is awakened, and then the established connection is taken out from the established queue. This involves CPU cost of a context switch.
The complete flowchart is shown below:
![tcpconnlat1](tcpconnlat1.png)
From the client's perspective, under normal circumstances, the total time for a TCP connection is approximately the time consumed by one network round-trip. However, in some cases, it may cause an increase in network transmission time, an increase in CPU processing overhead, or even connection failure. When a long delay is detected, it can be analyzed in conjunction with other information.
## eBPF Implementation of tcpconnlat
To understand the process of establishing a TCP connection, we need to understand two queues used by the Linux kernel when handling TCP connections:
- Listen queue (SYN queue): Stores TCP connections that are in the process of performing three-way handshake. After the server receives the SYN packet, it stores the connection information in this queue.
- Established queue (Accept queue): Stores TCP connections that have completed three-way handshake and are waiting for the application to call the `accept()` function. After the server receives the ACK packet, it creates a new connection and adds it to this queue.
With an understanding of the purpose of these two queues, we can begin to explore the specific implementation of tcpconnlat. The implementation of tcpconnlat can be divided into two parts: kernel space and user space, which include several main trace points: `tcp_v4_connect`, `tcp_v6_connect`, and `tcp_rcv_state_process`.
These trace points are mainly located in the TCP/IP network stack in the kernel. When executing the corresponding system call or kernel function, these trace points are activated, triggering the execution of eBPF programs. This allows us to capture and measure the entire process of establishing a TCP connection.
Let's take a look at the source code of these mounting points first:
```c
SEC("kprobe/tcp_v4_connect")
int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_v6_connect")
int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_rcv_state_process")
int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
{
return handle_tcp_rcv_state_process(ctx, sk);
}
```
This code snippet shows the definition of three kernel probes (kprobe). `tcp_v4_connect` and `tcp_v6_connect` are triggered when the corresponding IPv4 and IPv6 connections are initialized, invoking the `trace_connect()` function. On the other hand, `tcp_rcv_state_process` is triggered when the TCP connection state changes in the kernel, calling the `handle_tcp_rcv_state_process()` function.
The following section will be divided into two parts: one part analyzes the kernel part of these mount points, where we will delve into the kernel source code to explain how these functions work in detail. The other part analyzes the user part, focusing on how eBPF programs collect data from these mount points and interact with user-space programs.
### Analysis of tcp_v4_connect function
The `tcp_v4_connect` function is the main way that the Linux kernel handles TCP IPv4 connection requests. When a user-space program creates a socket through the `socket` system call and then attempts to connect to a remote server through the `connect` system call, the `tcp_v4_connect` function is triggered.
```c
/* This will initiate an outgoing connection. */
int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
struct inet_timewait_death_row *tcp_death_row;
struct inet_sock *inet = inet_sk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct ip_options_rcu *inet_opt;
struct net *net = sock_net(sk);
__be16 orig_sport, orig_dport;
__be32 daddr, nexthop;
struct flowi4 *fl4;
struct rtable *rt;
int err;
if (addr_len < sizeof(struct sockaddr_in))
return -EINVAL;
if (usin->sin_family != AF_INET)
return -EAFNOSUPPORT;
nexthop = daddr = usin->sin_addr.s_addr;
inet_opt = rcu_dereference_protected(inet->inet_opt,
lockdep_sock_is_held(sk));
if (inet_opt && inet_opt->opt.srr) {
if (!daddr)
return -EINVAL;
nexthop = inet_opt->opt.faddr;
}
orig_sport = inet->inet_sport;
orig_dport = usin->sin_port;
fl4 = &inet->cork.fl.u.ip4;
rt = ip_route_connect(fl4, nexthop, inet->inet_saddr,
sk->sk_bound_dev_if, IPPROTO_TCP, orig_sport,
orig_dport, sk);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
if (err == -ENETUNREACH)
IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
return err;
}
if (rt->rt_flags & (RTCF_MULTICAST | RTCF_BROADCAST)) {
ip_rt_put(rt);
return -ENETUNREACH;
}
if (!inet_opt || !inet_opt->opt.srr)
daddr = fl4->daddr;
tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
if (!inet->inet_saddr) {
err = inet_bhash2_update_saddr(sk, &fl4->saddr, AF_INET);
if (err) {
ip_rt_put(rt);
return err;
}
} else {
sk_rcv_saddr_set(sk, inet->inet_saddr);
}
if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
/* Reset inherited state */
tp->rx_opt.ts_recent = 0;
tp->rx_opt.ts_recent_stamp = 0;
if (likely(!tp->repair))
WRITE_ONCE(tp->write_seq, 0);
}
inet->inet_dport = usin->sin_port;
sk_daddr_set(sk, daddr);
inet_csk(sk)->icsk_ext_hdr_len = 0;
if (inet_opt)
inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT;
/* Socket identity is still unknown (sport may be zero).
* However we set state to SYN-SENT and not releasing socket
* lock select source port, enter ourselves into the hash tables and
* complete initialization after this.
*/
tcp_set_state(sk, TCP_SYN_SENT);
err = inet_hash_connect(tcp_death_row, sk);
if (err)
goto failure;
sk_set_txhash(sk);
rt = ip_route_newports(fl4, rt, orig_sport, orig_dport,
inet->inet_sport, inet->inet_dport, sk);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
rt = NULL;
goto failure;
}
/* OK, now commit destination to socket. */
sk->sk_gso_type = SKB_GSO_TCPV4;
sk_setup_caps(sk, &rt->dst);
rt = NULL;
if (likely(!tp->repair)) {
if (!tp->write_seq)
WRITE_ONCE(tp->write_seq,
secure_tcp_seq(inet->inet_saddr,
inet->inet_daddr,
inet->inet_sport,
usin->sin_port));
tp->tsoffset = secure_tcp_ts_off(net, inet->inet_saddr,
inet->inet_daddr);
}
inet->inet_id = get_random_u16();
if (tcp_fastopen_defer_connect(sk, &err))
return err;
if (err)
goto failure;
err = tcp_connect(sk);
if (err)
goto failure;
return 0;
failure:
/*".* This unhashes the socket and releases the local port,
* if necessary.
*/
tcp_set_state(sk, TCP_CLOSE);
inet_bhash2_reset_saddr(sk);
ip_rt_put(rt);
sk->sk_route_caps = 0;
inet->inet_dport = 0;
return err;
}
EXPORT_SYMBOL(tcp_v4_connect);
```
Reference link: <https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_ipv4.c#L340>
Next, let's analyze this function step by step:
First, this function takes three parameters: a socket pointer `sk`, a pointer to the socket address structure `uaddr`, and the length of the address `addr_len`.
```c
int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
```
The function starts by checking the parameters, making sure the address length is correct and the address family is IPv4. If these conditions are not met, the function returns an error.
Next, the function retrieves the destination address and, if a source routing option is set (an advanced IP feature that is typically not used), it also retrieves the next hop address for the source route.
```c
nexthop = daddr = usin->sin_addr.s_addr;
inet_opt = rcu_dereference_protected(inet->inet_opt,
lockdep_sock_is_held(sk));
if (inet_opt && inet_opt->opt.srr) {
if (!daddr)
return -EINVAL;
nexthop = inet_opt->opt.faddr;
}
```
Then, using this information, the function looks for a route entry to the destination address. If a route entry cannot be found or the route entry points to a multicast or broadcast address, the function returns an error.
Next, it updates the source address, handles the state of some TCP timestamp options, and sets the destination port and address. After that, it updates some other socket and TCP options and sets the connection state to `SYN-SENT`.
Then, the function tries to add the socket to the connected sockets hash table using the `inet_hash_connect` function. If this step fails, it restores the socket state and returns an error.
If all the previous steps succeed, it then updates the route entry with the new source and destination ports. If this step fails, it cleans up resources and returns an error.
Next, it commits the destination information to the socket and selects a secure random value for the sequence offset for future segments.
Then, the function tries to establish the connection using TCP Fast Open (TFO), and if TFO is not available or the TFO attempt fails, it falls back to the regular TCP three-way handshake for connection.
Finally, if all the above steps succeed, the function returns success; otherwise, it cleans up all resources and returns an error.
In summary, the `tcp_v4_connect` function is a complex function that handles TCP connection requests. It handles many cases, including parameter checking, route lookup, source address selection, source routing, TCP option handling, TCP Fast Open, and more. Its main goal is to establish TCP connections as safely and efficiently as possible.
### Kernel Code
```c
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2020 Wenbo Zhang
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
#include "tcpconnlat.h"
#define AF_INET 2
#define AF_INET6 10
const volatile __u64 targ_min_us = 0;
const volatile pid_t targ_tgid = 0;
struct piddata {
char comm[TASK_COMM_LEN];
u64 ts;
u32 tgid;
};
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 4096);
__type(key, struct sock *);
__type(value, struct piddata);
} start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
static int trace_connect(struct sock *sk)
{
u32 tgid = bpf_get_current_pid_tgid() >> 32;
struct piddata piddata = {};
if (targ_tgid && targ_tgid != tgid)
return 0;
bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
piddata.ts = bpf_ktime_get_ns();
piddata.tgid = tgid;
bpf_map_update_elem(&start, &sk, &piddata, 0);
return 0;
}
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
{
struct piddata *piddatap;
struct event event = {};
s64 delta;
u64 ts;
if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
return 0;
piddatap = bpf_map_lookup_elem(&start, &sk);
if (!piddatap)
return 0;
ts = bpf_ktime_get_ns();
delta = (s64)(ts - piddatap->ts);
if (delta < 0)
goto cleanup;
event.delta_us = delta / 1000U;
if (targ_min_us && event.delta_us < targ_min_us)
goto cleanup;
__builtin_memcpy(&event.comm, piddatap->comm,
sizeof(event.comm));
event.ts_us = ts / 1000;
event.tgid = piddatap->tgid;
event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
if (event.af == AF_INET) {
event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
} else {
BPF_CORE_READ_INTO(&event.saddr_v6, sk,
__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
BPF_CORE_READ_INTO(&event.daddr_v6, sk,
__sk_common.skc_v6_daddr.in6_u.u6_addr32);
}
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
&event, sizeof(event));
cleanup:
bpf_map_delete_elem(&start, &sk);
return 0;
}
SEC("kprobe/tcp_v4_connect")
int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_v6_connect")
int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_rcv_state_process")
int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
{
return handle_tcp_rcv_state_process(ctx, sk);
}
SEC("fentry/tcp_v4_connect")
int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("fentry/tcp_v6_connect")
int BPF_PROG(fentry_tcp_v6_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("fentry/tcp_rcv_state_process")
int BPF_PROG(fentry_tcp_rcv_state_process, struct sock *sk)
{
return handle_tcp_rcv_state_process(ctx, sk);
}
char LICENSE[] SEC("license") = "GPL";
```
This eBPF (Extended Berkeley Packet Filter) program is mainly used to monitor and collect the time it takes to establish TCP connections, i.e., the time interval from initiating a TCP connection request (connect system call) to the completion of the connection establishment (SYN-ACK handshake process). This is very useful for monitoring network latency, service performance analysis, and other aspects.
First, two eBPF maps are defined: `start` and `events`. `start` is a hash table used to store the process information and timestamp of the initiating connection request, while `events` is a map of type `PERF_EVENT_ARRAY` used to transfer event data to user space.
```c
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 4096);
__type(key, struct sock *);
__type(value, struct piddata);
} start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
```
In the kprobe handling functions `trace_connect` of `tcp_v4_connect` and `tcp_v6_connect`, the process information (process name, process ID, and current timestamp) of the initiating connection request is recorded and stored in the `start` map with the socket structure as the key.
```c
static int trace_connect(struct sock *sk)
{
u32 tgid = bpf_get_current_pid_tgid() >> 32;
struct piddata piddata = {};
if (targ_tgid && targ_tgid != tgid)
return 0;
bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
piddata.ts = bpf_ktime_get_ns();
piddata.tgid = tgid;
bpf_map_update_elem(&start, &sk, &piddata, 0);
return 0;
}
```
When the TCP state machine processes the SYN-ACK packet, i.e., when the connection is established, the kprobe handling function `handle_tcp_rcv_state_process` of `tcp_rcv_state_process` is triggered. In this function, it first checks if the socket state is `SYN-SENT`. If it is, it looks up the process information for the socket in the `start` map. Then it calculates the time interval from the initiation of the connection to the present and sends this time interval, process information, and TCP connection details (source port, destination port, source IP, destination IP, etc.) as an event to user space using the `bpf_perf_event_output` function.
```c
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
{
struct piddata *piddatap;
struct event event = {};
s64 delta;
u64 ts;
if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
return 0;
piddatap = bpf_map_lookup_elem(&start, &sk);
if (!piddatap)
return 0;
ts = bpf_ktime_get_ns();
delta = (s64)(ts - piddatap->ts);
if (delta < 0)
goto cleanup;
event.delta_us = delta / 1000U;
if (targ_min_us && event.delta_us < targ_min_us)
goto cleanup;
__builtin_memcpy(&event.comm, piddatap->comm,
sizeof(event.comm));
event.ts_us = ts / 1000;
event.tgid = piddatap->tgid;
event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
if (event.af == AF_INET) {
event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
} else {
BPF_CORE_READ_INTO(&event.saddr_v6, sk,
__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
BPF_CORE_READ_INTO(&event.daddr_v6, sk,
__sk_common.skc_v6_daddr.in6_u.u6_addr32);
}
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
&event, sizeof(event));
cleanup:
bpf_map_delete_elem(&start, &sk);
return 0;
}
```
This program uses a while loop to repeatedly poll the perf event buffer. If there is an error during polling (e.g., due to a signal interruption), an error message will be printed. This polling process continues until an exit flag `exiting` is received.
Next, let's take a look at the `handle_event` function, which handles every eBPF event sent from the kernel to user space:
```c
void handle_event(void* ctx, int cpu, void* data, __u32 data_sz) {
const struct event* e = data;
char src[INET6_ADDRSTRLEN];
char dst[INET6_ADDRSTRLEN];
union {
struct in_addr x4;
struct in6_addr x6;
} s, d;
static __u64 start_ts;
if (env.timestamp) {
if (start_ts == 0)
start_ts = e->ts_us;
printf("%-9.3f ", (e->ts_us - start_ts) / 1000000.0);
}
if (e->af == AF_INET) {
s.x4.s_addr = e->saddr_v4;
d.x4.s_addr = e->daddr_v4;
} else if (e->af == AF_INET6) {
memcpy(&s.x6.s6_addr, e->saddr_v6, sizeof(s.x6.s6_addr));
memcpy(&d.x6.s6_addr, e->daddr_v6, sizeof(d.x6.s6_addr));
} else {
fprintf(stderr, "broken event: event->af=%d", e->af);
return;
}
if (env.lport) {
printf("%-6d %-12.12s %-2d %-16s %-6d %-16s %-5d %.2f\n", e->tgid,
e->comm, e->af == AF_INET ? 4 : 6,
inet_ntop(e->af, &s, src, sizeof(src)), e->lport,
inet_ntop(e->af, &d, dst, sizeof(dst)), ntohs(e->dport),
e->delta_us / 1000.0);
} else {
printf("%-6d %-12.12s %-2d %-16s %-16s %-5d %.2f\n", e->tgid, e->comm,
e->af == AF_INET ? 4 : 6, inet_ntop(e->af, &s, src, sizeof(src)),
inet_ntop(e->af, &d, dst, sizeof(dst)), ntohs(e->dport),
e->delta_us / 1000.0);
}
}
```
The `handle_event` function takes arguments including the CPU number, a pointer to the data, and the size of the data. The data is a `event` structure that contains information about TCP connections computed in the kernel space.
First, it compares the timestamp of the received event with the start timestamp (if available) to calculate the relative time of the event, and then prints it. Next, it converts the source address and destination address from network byte order to host byte order based on the IP address type (IPv4 or IPv6).
Finally, depending on whether the user chooses to display the local port, it prints the process ID, process name, IP version, source IP address, local port (if available), destination IP address, destination port, and connection establishment time. This connection establishment time is calculated in the eBPF program running in the kernel space and sent to the user space.
## Compilation and Execution
```console
$ make
...
BPF .output/tcpconnlat.bpf.o".GEN-SKEL .output/tcpconnlat.skel.h
CC .output/tcpconnlat.o
BINARY tcpconnlat
$ sudo ./tcpconnlat
PID COMM IP SADDR DADDR DPORT LAT(ms)
222564 wget 4 192.168.88.15 110.242.68.3 80 25.29
222684 wget 4 192.168.88.15 167.179.101.42 443 246.76
222726 ssh 4 192.168.88.15 167.179.101.42 22 241.17
222774 ssh 4 192.168.88.15 1.15.149.151 22 25.31
```
Source code: [https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/13-tcpconnlat](https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/13-tcpconnlat)
References:
- [tcpconnlat](https://github.com/iovisor/bcc/blob/master/libbpf-tools/tcpconnlat.c) in bcc
## Summary
In this eBPF introductory tutorial, we learned how to use eBPF to track and measure the latency of TCP connections. We first explored how eBPF programs can attach to specific kernel functions in kernel-space and capture the start and end times of connection establishment to calculate latency.
We also learned how to use BPF maps to store and retrieve data in kernel-space, enabling data sharing among different parts of the eBPF program. Additionally, we discussed how to use perf events to send data from kernel-space to user-space for further processing and display.
In user-space, we introduced the usage of libbpf library APIs, such as perf_buffer__poll, to receive and process data sent from the kernel-space. We also demonstrated how to parse and print this data in a human-readable format.
If you are interested in learning more about eBPF and its practical applications, please refer to the official documentation of eunomia-bpf: [https://github.com/eunomia-bpf/eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf). You can also visit our tutorial code repository at [https://github.com/eunomia-bpf/bpf-developer-tutorial](https://github.com/eunomia-bpf/bpf-developer-tutorial) for more examples and complete tutorials.
In the upcoming tutorials, we will dive deeper into advanced features of eBPF, such as tracing the path of network packets and fine-grained system performance monitoring. We will continue to share more content on eBPF development practices to help you better understand and master eBPF technology. We hope these resources will be valuable in your learning and practical journey with eBPF.

View File

@@ -0,0 +1,409 @@
# eBPF Introductory Practice Tutorial 14: Recording TCP Connection Status and TCP RTT
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool widely used in the Linux kernel. eBPF allows developers to dynamically load, update, and run user-defined code without restarting the kernel or changing the kernel source code.
In this article of our eBPF introductory practice tutorial series, we will introduce two sample programs: `tcpstates` and `tcprtt`. `tcpstates` is used to record the state changes of TCP connections, while `tcprtt` is used to record the Round-Trip Time (RTT) of TCP.
## `tcprtt` and `tcpstates`
Network quality is crucial in the current Internet environment. There are many factors that affect network quality, including hardware, network environment, and the quality of software programming. To help users better locate network issues, we introduce the tool `tcprtt`. `tcprtt` can monitor the Round-Trip Time of TCP connections, evaluate network quality, and help users identify potential problems.
When a TCP connection is established, `tcprtt` automatically selects the appropriate execution function based on the current system conditions. In the execution function, `tcprtt` collects various basic information of the TCP connection, such as source address, destination address, source port, destination port, and time elapsed, and updates this information to a histogram-like BPF map. After the execution is completed, `tcprtt` presents the collected information graphically to users through user-mode code.
`tcpstates` is a tool specifically designed to track and print changes in TCP connection status. It can display the duration of TCP connections in each state, measured in milliseconds. For example, for a single TCP session, `tcpstates` can print output similar to the following:
```sh
SKADDR C-PID C-COMM LADDR LPORT RADDR RPORT OLDSTATE -> NEWSTATE MS
ffff9fd7e8192000 22384 curl 100.66.100.185 0 52.33.159.26 80 CLOSE -> SYN_SENT 0.000
ffff9fd7e8192000 0 swapper/5 100.66.100.185 63446 52.33.159.26 80 SYN_SENT -> ESTABLISHED 1.373
ffff9fd7e8192000 22384 curl 100.66.100.185 63446 52.33.159.26 80 ESTABLISHED -> FIN_WAIT1 176.042
ffff9fd7e8192000 0 swapper/5 100.66.100.185 63446 52.33.159.26 80 FIN_WAIT1 -> FIN_WAIT2 0.536
ffff9fd7e8192000 0 swapper/5 100.66.100.185 63446 52.33.159.26 80 FIN_WAIT2 -> CLOSE 0.006
```
In the above output, the most time is spent in the ESTABLISHED state, which indicates that the connection has been established and data transmission is in progress. The transition from this state to the FIN_WAIT1 state (the beginning of connection closure) took 176.042 milliseconds.
In our upcoming tutorials, we will delve deeper into these two tools, explaining their implementation principles, and hopefully, these contents will help you in your work with eBPF for network and performance analysis.
## tcpstate
Due to space constraints, here we mainly discuss and analyze the corresponding eBPF kernel-mode code implementation. The following is the eBPF code for tcpstate:
```c
const volatile bool filter_by_sport = false;
const volatile bool filter_by_dport = false;
const volatile short target_family = 0;
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, __u16);
__type(value, __u16);
} sports SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
...
```__type(key, __u16);
__type(value, __u16);
} dports SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, struct sock *);
__type(value, __u64);
} timestamps SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u32));
} events SEC(".maps");
SEC("tracepoint/sock/inet_sock_set_state")
int handle_set_state(struct trace_event_raw_inet_sock_set_state *ctx)
{
struct sock *sk = (struct sock *)ctx->skaddr;
__u16 family = ctx->family;
__u16 sport = ctx->sport;
__u16 dport = ctx->dport;
__u64 *tsp, delta_us, ts;
struct event event = {};
if (ctx->protocol != IPPROTO_TCP)
return 0;
if (target_family && target_family != family)
return 0;
if (filter_by_sport && !bpf_map_lookup_elem(&sports, &sport))
return 0;
if (filter_by_dport && !bpf_map_lookup_elem(&dports, &dport))
return 0;
tsp = bpf_map_lookup_elem(&timestamps, &sk);
ts = bpf_ktime_get_ns();
if (!tsp)
delta_us = 0;
else
delta_us = (ts - *tsp) / 1000;
event.skaddr = (__u64)sk;
event.ts_us = ts / 1000;
event.delta_us = delta_us;
event.pid = bpf_get_current_pid_tgid() >> 32;
event.oldstate = ctx->oldstate;
event.newstate = ctx->newstate;
event.family = family;
event.sport = sport;
event.dport = dport;
bpf_get_current_comm(&event.task, sizeof(event.task));
if (family == AF_INET) {
bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_rcv_saddr);
bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_daddr);
} else { /* family == AF_INET6 */
bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_v6_daddr.in6_u.u6_addr32);
}.`bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
if (ctx->newstate == TCP_CLOSE)
bpf_map_delete_elem(&timestamps, &sk);
else
bpf_map_update_elem(&timestamps, &sk, &ts, BPF_ANY);
return 0;
}
```
The `tcpstates` program relies on eBPF Tracepoints to capture the state changes of TCP connections, in order to track the time spent in each state of the TCP connection.
### Define BPF Maps
In the `tcpstates` program, several BPF Maps are defined, which are the primary way of interaction between the eBPF program and the user-space program. `sports` and `dports` are used to store the source and destination ports for filtering TCP connections; `timestamps` is used to store the timestamps for each TCP connection to calculate the time spent in each state; `events` is a map of type `perf_event`, used to send event data to the user-space.
### Trace TCP Connection State Changes
The program defines a function called `handle_set_state`, which is a program of type tracepoint and is mounted on the `sock/inet_sock_set_state` kernel tracepoint. Whenever the TCP connection state changes, this tracepoint is triggered and the `handle_set_state` function is executed.
In the `handle_set_state` function, it first determines whether the current TCP connection needs to be processed through a series of conditional judgments, then retrieves the previous timestamp of the current connection from the `timestamps` map, and calculates the time spent in the current state. Then, the program places the collected data in an event structure and sends the event to the user-space using the `bpf_perf_event_output` function.
### Update Timestamps
Finally, based on the new state of the TCP connection, the program performs different operations: if the new state is TCP_CLOSE, it means the connection has been closed and the program deletes the timestamp of that connection from the `timestamps` map; otherwise, the program updates the timestamp of the connection.
## User-Space Processing
The user-space part is mainly about loading the eBPF program using libbpf and receiving event data from the kernel using perf_event:
```c
static void handle_event(void* ctx, int cpu, void* data, __u32 data_sz) {
char ts[32], saddr[26], daddr[26];
struct event* e = data;
struct tm* tm;
int family;
time_t t;
if (emit_timestamp) {
time(&t);
tm = localtime(&t);
strftime(ts, sizeof(ts), "%H:%M:%S", tm);
printf("%8s ", ts);
}
inet_ntop(e->family, &e->saddr, saddr, sizeof(saddr));
inet_ntop(e->family, &e->daddr, daddr, sizeof(daddr));
if (wide_output) {
family = e->family == AF_INET ? 4 : 6;
printf(
"%-16llx %-7d %-16s %-2d %-26s %-5d %-26s %-5d %-11s -> %-11s "
"%.3f\n",
e->skaddr, e->pid, e->task, family, saddr, e->sport, daddr,
e->dport, tcp_states[e->oldstate], tcp_states[e->newstate],
(double)e->delta_us / 1000);
} else {
printf(
"%-16llx %-7d %-10.10s %-15s %-5d %-15s %-5d %-11s -> %-11s %.3f\n",
...
```
handle_event` is a callback function that is called by perf_event. It handles new events that arrive in the kernel.
In the `handle_event` function, we first use the `inet_ntop` function to convert the binary IP address to a human-readable format. Then, based on whether the wide format is needed or not, we print different information. This information includes the timestamp of the event, source IP address, source port, destination IP address, destination port, old state, new state, and the time spent in the old state.
This allows users to see the changes in TCP connection states and the duration of each state, helping them diagnose network issues.
In summary, the user-space part of the processing involves the following steps:
1. Use libbpf to load and run the eBPF program.
2. Set up a callback function to receive events sent by the kernel.
3. Process the received events, convert them into a human-readable format, and print them.
The above is the main implementation logic of the user-space part of the `tcpstates` program. Through this chapter, you should have gained a deeper understanding of how to handle kernel events in user space. In the next chapter, we will introduce more knowledge about using eBPF for network monitoring.
### tcprtt
In this section, we will analyze the kernel BPF code of the `tcprtt` eBPF program. `tcprtt` is a program used to measure TCP Round Trip Time (RTT) and stores the RTT information in a histogram.
```c
/// @sample {"interval": 1000, "type" : "log2_hist"}
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, u64);
__type(value, struct hist);
} hists SEC(".maps");
static struct hist zero;
SEC("fentry/tcp_rcv_established")
int BPF_PROG(tcp_rcv, struct sock *sk)
{
const struct inet_sock *inet = (struct inet_sock *)(sk);
struct tcp_sock *ts;
struct hist *histp;
u64 key, slot;
u32 srtt;
if (targ_sport && targ_sport != inet->inet_sport)
return 0;
if (targ_dport && targ_dport != sk->__sk_common.skc_dport)
return 0;
if (targ_saddr && targ_saddr != inet->inet_saddr)
return 0;
if (targ_daddr && targ_daddr != sk->__sk_common.skc_daddr)
return 0;
if (targ_laddr_hist)
key = inet->inet_saddr;
else if (targ_raddr_hist)
key = inet->sk.__sk_common.skc_daddr;
else
key = 0;
histp = bpf_map_lookup_or_try_init(&hists, &key, &zero);
if (!histp)
return 0;
ts = (struct tcp_sock *)(sk);
srtt = BPF_CORE_READ(ts, srtt_us) >> 3;
if (targ_ms)
srtt /= 1000U;
slot = log2l(srtt);
if (slot >= MAX_SLOTS)
slot = MAX_SLOTS - 1;
```
The code above declares a map called `hists`, which is a hash map used to store the histogram data. The `hists` map has a maximum number of entries defined as `MAX_ENTRIES`.
The function `BPF_PROG(tcp_rcv, struct sock *sk)` is the entry point of the eBPF program for handling the `tcp_rcv_established` event. Within this function, the program retrieves various information from the network socket and checks if filtering conditions are met. Then, it performs operations on the histogram data structure. Finally, the program calculates the slot for the RTT value and updates the histogram accordingly.
This is the main code logic of the `tcprtt` eBPF program in kernel mode. The eBPF program measures the RTT of TCP connections and maintains a histogram to collect and analyze the RTT data.Instructions:
```c
__sync_fetch_and_add(&histp->slots[slot], 1);
if (targ_show_ext) {
__sync_fetch_and_add(&histp->latency, srtt);
__sync_fetch_and_add(&histp->cnt, 1);
}
return 0;
}
```
First, we define a hash type eBPF map called `hists`, which is used to store statistics information about RTT. In this map, the key is a 64-bit integer, and the value is a `hist` structure that contains an array to store the count of different RTT intervals.
Next, we define an eBPF program called `tcp_rcv` which will be called every time a TCP packet is received in the kernel. In this program, we first filter TCP connections based on filtering conditions (source/destination IP address and port). If the conditions are met, we select the corresponding key (source IP, destination IP, or 0) based on the set parameters, and then look up or initialize the corresponding histogram in the `hists` map.
Then, we read the `srtt_us` field of the TCP connection, which represents the smoothed RTT value in microseconds. We convert this RTT value to a logarithmic form and store it as a slot in the histogram.
If the `show_ext` parameter is set, we also increment the RTT value and the counter in the `latency` and `cnt` fields of the histogram.
With the above processing, we can analyze and track the RTT of each TCP connection to better understand the network performance.
In summary, the main logic of the `tcprtt` eBPF program includes the following steps:
1. Filter TCP connections based on filtering conditions.
2. Look up or initialize the corresponding histogram in the `hists` map.
3. Read the `srtt_us` field of the TCP connection, convert it to a logarithmic form, and store it in the histogram.
4. If the `show_ext` parameter is set, increment the RTT value and the counter in the `latency` and `cnt` fields of the histogram.
`tcprtt` is attached to the kernel's `tcp_rcv_established` function:
```c
void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
```
This function is the main function in the kernel for processing received TCP data and is called when a TCP connection is in the `ESTABLISHED` state. The processing logic of this function includes a fast path and a slow path. The fast path is disabled in the following cases:
- We have advertised a zero window - zero window probing can only be handled correctly in the slow path.
- Out-of-order data packets received.
- Expecting to receive urgent data.
- No remaining buffer space.
- Received unexpected TCP flags/window values/header lengths (detected by checking TCP header against the expected flags).
- Data is being transmitted in both directions. The fast path only supports pure senders or pure receivers (meaning the sequence number or acknowledgement value must remain unchanged).
- Received unexpected TCP options.
When these conditions are not met, it enters a standard receive processing, which follows RFC 793 to handle all cases. The first three cases can be ensured by setting the correct expected flags, while the remaining cases require inline checks. When everything is normal, the fast processing path is invoked in the `tcp_data_queue` function.
## Compilation and Execution
For `tcpstates`, you can compile and run the libbpf application with the following command:
```console
$ make
...
BPF .output/tcpstates.bpf.o
GEN-SKEL .output/tcpstates.skel.h
CC .output/tcpstates.o
BINARY tcpstates
$ sudo ./tcpstates
SKADDR PID COMM LADDR LPORT RADDR RPORT OLDSTATE -> NEWSTATE MS
ffff9bf61bb62bc0 164978 node 192.168.88.15 0 52.178.17.2 443 CLOSE -> SYN_SENT 0.000
ffff9bf61bb62bc0 0 swapper/0 192.168.88.15 41596 52.178.17.2 443 SYN_SENT -> ESTABLISHED 225.794".
format: Return only the translated content, not including the original text.```
"ffff9bf61bb62bc0 0 swapper/0 192.168.88.15 41596 52.178.17.2 443 ESTABLISHED -> CLOSE_WAIT 901.454
ffff9bf61bb62bc0 164978 node 192.168.88.15 41596 52.178.17.2 443 CLOSE_WAIT -> LAST_ACK 0.793
ffff9bf61bb62bc0 164978 node 192.168.88.15 41596 52.178.17.2 443 LAST_ACK -> LAST_ACK 0.086
ffff9bf61bb62bc0 228759 kworker/u6 192.168.88.15 41596 52.178.17.2 443 LAST_ACK -> CLOSE 0.193
ffff9bf6d8ee88c0 229832 redis-serv 0.0.0.0 6379 0.0.0.0 0 CLOSE -> LISTEN 0.000
ffff9bf6d8ee88c0 229832 redis-serv 0.0.0.0 6379 0.0.0.0 0 LISTEN -> CLOSE 1.763
ffff9bf7109d6900 88750 node 127.0.0.1 39755 127.0.0.1 50966 ESTABLISHED -> FIN_WAIT1 0.000
```
For tcprtt, we can use eunomia-bpf to compile and run this example:
Compile:
```shell
docker run -it -v `pwd`/:/src/ ghcr.io/eunomia-bpf/ecc-`uname -m`:latest
```
Or
```console
$ ecc runqlat.bpf.c runqlat.h
Compiling bpf object...
Generating export types...
Packing ebpf object and config into package.json...
```
Run:
```console
$ sudo ecli run package.json -h
A simple eBPF program
Usage: package.json [OPTIONS]
Options:
--verbose Whether to show libbpf debug information
--targ_laddr_hist Set value of `bool` variable targ_laddr_hist
--targ_raddr_hist Set value of `bool` variable targ_raddr_hist
--targ_show_ext Set value of `bool` variable targ_show_ext
--targ_sport <targ_sport> Set value of `__u16` variable targ_sport
--targ_dport <targ_dport> Set value of `__u16` variable targ_dport
--targ_saddr <targ_saddr> Set value of `__u32` variable targ_saddr
--targ_daddr <targ_daddr> Set value of `__u32` variable targ_daddr
--targ_ms Set value of `bool` variable targ_ms
-h, --help Print help
-V, --version Print version
Built with eunomia-bpf framework.".
```See https://github.com/eunomia-bpf/eunomia-bpf for more information.
$ sudo ecli run package.json
key = 0
latency = 0
cnt = 0
(unit) : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 4 |******************** |
1024 -> 2047 : 1 |***** |
2048 -> 4095 : 0 | |
4096 -> 8191 : 8 |****************************************|
key = 0
latency = 0
cnt = 0
(unit) : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |512 -> 1023 : 11 |*************************** |
1024 -> 2047 : 1 |** |
2048 -> 4095 : 0 | |
4096 -> 8191 : 16 |****************************************|
8192 -> 16383 : 4 |********** |
```
Complete source code:
- <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/14-tcpstates>
References:
- [tcpstates](https://github.com/iovisor/bcc/blob/master/tools/tcpstates_example.txt)
- [tcprtt](https://github.com/iovisor/bcc/blob/master/tools/tcprtt.py)
- [libbpf-tools/tcpstates](<https://github.com/iovisor/bcc/blob/master/libbpf-tools/tcpstates.bpf.c>)
## Summary
In this eBPF introductory tutorial, we learned how to use the tcpstates and tcprtt eBPF example programs to monitor and analyze the connection states and round-trip time of TCP. We understood the working principles and implementation methods of tcpstates and tcprtt, including how to store data using BPF maps, how to retrieve and process TCP connection information in eBPF programs, and how to parse and display the data collected by eBPF programs in user-space applications.
If you would like to learn more about eBPF knowledge and practices, you can visit our tutorial code repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and complete tutorials. The upcoming tutorials will further explore advanced features of eBPF, and we will continue to share more content about eBPF development practices.

318
src/15-javagc/README_en.md Normal file
View File

@@ -0,0 +1,318 @@
# eBPF Introduction Tutorial 15: Capturing User-Space Java GC Event Duration Using USDT
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool widely used in the Linux kernel. eBPF allows developers to dynamically load, update, and run user-defined code without the need to restart the kernel or modify the kernel source code. This feature provides eBPF with high flexibility and performance, making it widely applicable in network and system performance analysis. Furthermore, eBPF also supports capturing user-space application behavior using User-Level Statically Defined Tracing (USDT).
In this article of our eBPF introduction tutorial series, we will explore how to use eBPF and USDT to capture and analyze the duration of Java garbage collection (GC) events.
## Introduction to USDT
USDT is a mechanism for inserting static tracepoints into applications, allowing developers to insert probes at critical points in the program for debugging and performance analysis purposes. These probes can be dynamically activated at runtime by tools such as DTrace, SystemTap, or eBPF, allowing access to the program's internal state and performance metrics without the need to restart the application or modify the program code. USDT is widely used in many open-source software applications such as MySQL, PostgreSQL, Ruby, Python, and Node.js.
### User-Level Tracing Mechanism: User-Level Dynamic Tracing and USDT
User-Level Dynamic Tracing allows us to instrument any user-level code by placing probes. For example, we can trace query requests in a MySQL server by placing a probe on the `dispatch_command()` function:
```bash
# ./uprobe 'p:cmd /opt/bin/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcj +0(%dx):string'
Tracing uprobe cmd (p:cmd /opt/bin/mysqld:0x2dbd40 +0(%dx):string). Ctrl-C to end.
mysqld-2855 [001] d... 19957757.590926: cmd: (0x6dbd40) arg1="show tables"
mysqld-2855 [001] d... 19957759.703497: cmd: (0x6dbd40) arg1="SELECT * FROM numbers"
[...]
```
Here, we use the `uprobe `tool, which leverages Linux's built-in functionalities: ftrace (tracing framework) and uprobes (User-Level Dynamic Tracing, requires a relatively new Linux version, around 4.0 or later). Other tracing frameworks such as perf_events and SystemTap can also achieve this functionality.
Many other MySQL functions can be traced to obtain more information. We can list and count the number of these functions:
```bash
# ./uprobe -l /opt/bin/mysqld | more
account_hash_get_key
add_collation
add_compiled_collation
add_plugin_noargs
adjust_time_range
[...]
# ./uprobe -l /opt/bin/mysqld | wc -l
21809
```
There are 21,000 functions here. We can also trace library functions or even individual instruction offsets.
User-Level Dynamic Tracing capability is very powerful and can solve numerous problems. However, using it also has some challenges: identifying the code to trace, handling function parameters, and dealing with code modifications.
User-Level Statically Defined Tracing (USDT) can address some of these challenges. USDT probes (or "markers" at the user level) are trace macros inserted at critical positions in the code, providing a stable and well-documented API. This makes the tracing work simpler.
With USDT, we can easily trace a probe called `mysql:query__start` instead of tracing the C++ symbol `_Z16dispatch_command19enum_server_commandP3THDPcj`, which is the `dispatch_command()` function. Of course, we can still trace `dispatch_command()` and the other 21,000 mysqld functions when needed, but only when USDT probes cannot solve the problem.In Linux, USDT (User Statically Defined Tracing) has actually existed in various forms for decades. It has recently gained attention again due to the popularity of Sun's DTrace tool, which has led to many common applications, including MySQL, PostgreSQL, Node.js, Java, etc., adding USDT support. SystemTap has developed a way to consume these DTrace probes.
You may be running a Linux application that already includes USDT probes, or you may need to recompile it (usually with --enable-dtrace). You can use `readelf` to check, for example, for Node.js:
```bash
# readelf -n node
[...]
Notes at offset 0x00c43058 with length 0x00000494:
Owner Data size Description
stapsdt 0x0000003c NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: gc__start
Location: 0x0000000000bf44b4, Base: 0x0000000000f22464, Semaphore: 0x0000000001243028
Arguments: 4@%esi 4@%edx 8@%rdi
[...]
stapsdt 0x00000082 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: http__client__request
Location: 0x0000000000bf48ff, Base: 0x0000000000f22464, Semaphore: 0x0000000001243024
Arguments: 8@%rax 8@%rdx 8@-136(%rbp) -4@-140(%rbp) 8@-72(%rbp) 8@-80(%rbp) -4@-144(%rbp)
[...]
```
This is a Node.js recompiled with --enable-dtrace and installed with the systemtap-sdt-dev package that provides "dtrace" functionality to support USDT. Here are two probes displayed: node:gc__start (garbage collection start) and node:http__client__request.
At this point, you can use SystemTap or LTTng to trace these probes. However, built-in Linux tracers like ftrace and perf_events currently cannot do this (although perf_events support is under development).
## Introduction to Java GC
Java, as a high-level programming language, has automatic garbage collection (GC) as one of its core features. The goal of Java GC is to automatically reclaim memory space that is no longer used by the program, thereby relieving programmers of the burden of memory management. However, the GC process may cause application pauses, which can impact program performance and response time. Therefore, monitoring and analyzing Java GC events are essential for understanding and optimizing the performance of Java applications.
In the following tutorial, we will demonstrate how to use eBPF and USDT to monitor and analyze the duration of Java GC events. We hope this content will be helpful to you in your work with eBPF for application performance analysis.
## eBPF Implementation Mechanism
The eBPF program for Java GC is divided into two parts: kernel space and user space. We will introduce the implementation mechanisms of these two parts separately.
### Kernel Space Program
```c
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
/* Copyright (c) 2022 Chen Tao */
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/usdt.bpf.h>
#include "javagc.h"
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 100);
__type(key, uint32_t);
__type(value, struct data_t);
} data_map SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
``````cpp
__type(key, int);
__type(value, int);
} perf_map SEC(".maps");
__u32 time;
static int gc_start(struct pt_regs *ctx)
{
struct data_t data = {};
data.cpu = bpf_get_smp_processor_id();
data.pid = bpf_get_current_pid_tgid() >> 32;
data.ts = bpf_ktime_get_ns();
bpf_map_update_elem(&data_map, &data.pid, &data, 0);
return 0;
}
static int gc_end(struct pt_regs *ctx)
{
struct data_t data = {};
struct data_t *p;
__u32 val;
data.cpu = bpf_get_smp_processor_id();
data.pid = bpf_get_current_pid_tgid() >> 32;
data.ts = bpf_ktime_get_ns();
p = bpf_map_lookup_elem(&data_map, &data.pid);
if (!p)
return 0;
val = data.ts - p->ts;
if (val > time) {
data.ts = val;
bpf_perf_event_output(ctx, &perf_map, BPF_F_CURRENT_CPU, &data, sizeof(data));
}
bpf_map_delete_elem(&data_map, &data.pid);
return 0;
}
SEC("usdt")
int handle_gc_start(struct pt_regs *ctx)
{
return gc_start(ctx);
}
SEC("usdt")
int handle_gc_end(struct pt_regs *ctx)
{
return gc_end(ctx);
}
SEC("usdt")
int handle_mem_pool_gc_start(struct pt_regs *ctx)
{
return gc_start(ctx);
}
SEC("usdt")
int handle_mem_pool_gc_end(struct pt_regs *ctx)
{
return gc_end(ctx);
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";
```
First, we define two maps:
- `data_map`: This hashmap stores the start time of garbage collection for each process ID. The `data_t` structure contains the process ID, CPU ID, and timestamp.
- `perf_map`: This is a perf event array used to send data back to the user-space program.
Then, we have four handler functions: `gc_start`, `gc_end`, and two USDT handler functions `handle_mem_pool_gc_start` and `handle_mem_pool_gc_end`. These functions are all annotated with the BPF `SEC("usdt")` macro to capture USDT events related to garbage collection in a Java process.
The `gc_start` function is called when garbage collection starts. It first gets the current CPU ID, process ID, and timestamp, and then stores this data in `data_map`.
The `gc_end` function is called when garbage collection ends. It performs similar operations as `gc_start`, but it also retrieves the start time from `data_map` and calculates the duration of garbage collection. If the duration exceeds a set threshold (`time` variable), it sends the data back to the user-space program.
`handle_gc_start` and `handle_gc_end` are handler functions for the garbage collection start and end events, respectively, and they call `gc_start` and `gc_end`, respectively.
`handle_mem_pool_gc_start` and `handle_mem_pool_gc_end` are handler functions for the garbage collection start and end events in the memory pool, and they also call `gc_start` and `gc_end`, respectively.Finally, we have a `LICENSE` array that declares the license of the BPF program, which is required for loading the BPF program.
### User-space Program
The main goal of the user-space program is to load and run eBPF programs, as well as process data from the kernel-space program. This is achieved through the use of the libbpf library. Here, we are omitting some common code for loading and running eBPF programs and only showing the parts related to USDT.
The first function `get_jvmso_path` is used to obtain the path of the `libjvm.so` library for the running Java Virtual Machine (JVM). First, it opens the `/proc/<pid>/maps` file, which contains the memory mapping information of the process address space. Then, it searches for the line that contains `libjvm.so` in the file and copies the path of that line to the provided argument.
```c
static int get_jvmso_path(char *path)
{
char mode[16], line[128], buf[64];
size_t seg_start, seg_end, seg_off;
FILE *f;
int i = 0;
sprintf(buf, "/proc/%d/maps", env.pid);
f = fopen(buf, "r");
if (!f)
return -1;
while (fscanf(f, "%zx-%zx %s %zx %*s %*d%[^\n]\n",
&seg_start, &seg_end, mode, &seg_off, line) == 5) {
i = 0;
while (isblank(line[i]))
i++;
if (strstr(line + i, "libjvm.so")) {
break;
}
}
strcpy(path, line + i);
fclose(f);
return 0;
}
```
Next, we see the attachment of the eBPF programs (`handle_gc_start` and `handle_gc_end`) to the relevant USDT probes in the Java process. Each program achieves this by calling the `bpf_program__attach_usdt` function, which takes as parameters the BPF program, the process ID, the binary path, and the provider and name of the probe. If the probe is successfully attached, `bpf_program__attach_usdt` will return a link object, which is stored in the skeleton's link member. If the attachment fails, the program will print an error message and perform cleanup.
```c
skel->links.handle_mem_pool_gc_start = bpf_program__attach_usdt(skel->progs.handle_gc_start, env.pid,
binary_path, "hotspot", "mem__pool__gc__begin", NULL);
if (!skel->links.handle_mem_pool_gc_start) {
err = errno;
fprintf(stderr, "attach usdt mem__pool__gc__begin failed: %s\n", strerror(err));
goto cleanup;
}
skel->links.handle_mem_pool_gc_end = bpf_program__attach_usdt(skel->progs.handle_gc_end, env.pid,
binary_path, "hotspot", "mem__pool__gc__end", NULL);
if (!skel->links.handle_mem_pool_gc_end) {
err = errno;
fprintf(stderr, "attach usdt mem__pool__gc__end failed: %s\n", strerror(err));
goto cleanup;
}
skel->links.handle_gc_start = bpf_program__attach_usdt(skel->progs.handle_gc_start, env.pid,
binary_path, "hotspot", "gc__begin", NULL);
if (!skel->links.handle_gc_start) {
err = errno;
fprintf(stderr, "attach usdt gc__begin failed: %s\n", strerror(err));
goto cleanup;
}
skel->links.handle_gc_end = bpf_program__attach_usdt(skel->progs.handle_gc_end, env.pid,
binary_path, "hotspot", "gc__end", NULL);
if (!skel->links.handle_gc_end) {
err = errno;
fprintf(stderr, "attach usdt gc__end failed: %s\n", strerror(err));
goto cleanup;
}
```
The last function `handle_event` is a callback function used to handle data received from the perf event array. This function is triggered by the perf event array and is called each time a new event is received. The function first converts the data to a `data_t` structure, then formats the current time as a string, and finally prints the timestamp, CPU ID, process ID, and duration of the garbage collection.
```c
static void handle_event(void *ctx, int cpu, void *data, __u32 data_sz)
{
struct data_t *e = (struct data_t *)data;
struct tm *tm = NULL;
char ts[16];
time_t t;
time(&t);
tm = localtime(&t);
strftime(ts, sizeof(ts), "%H:%M:%S", tm);
printf("%-8s %-7d %-7d %-7lld\n", ts, e->cpu, e->pid, e->ts/1000);
}
```
## Installing Dependencies
To build the example, you need clang, libelf, and zlib. The package names may vary with different distributions.
On Ubuntu/Debian, run the following command:
```shell
sudo apt install clang libelf1 libelf-dev zlib1g-dev
```
On CentOS/Fedora, run the following command:
```shell
sudo dnf install clang elfutils-libelf elfutils-libelf-devel zlib-devel
```
## Compiling and Running
In the corresponding directory, run Make to compile and run the code:
```console
$ make
$ sudo ./javagc -p 12345
Tracing javagc time... Hit Ctrl-C to end.
TIME CPU PID GC TIME
10:00:01 10% 12345 50ms
10:00:02 12% 12345 55ms
10:00:03 9% 12345 47ms
10:00:04 13% 12345 52ms
10:00:05 11% 12345 50ms
```
Complete source code:
- <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/15-javagc>
References:
- <https://www.brendangregg.com/blog/2015-07-03/hacking-linux-usdt-ftrace.html>
- <https://github.com/iovisor/bcc/blob/master/libbpf-tools/javagc.c>
Summary.Through this introductory eBPF tutorial, we have learned how to use eBPF and USDT for dynamic tracing and analysis of Java garbage collection (GC) events. We have understood how to set USDT tracepoints in user space applications and how to write eBPF programs to capture information from these tracepoints, thereby gaining a deeper understanding and optimizing the behavior and performance of Java GC.
Additionally, we have also introduced some basic knowledge and practical techniques related to Java GC, USDT, and eBPF. This knowledge and skills are valuable for developers who want to delve into the field of network and system performance analysis.
If you would like to learn more about eBPF knowledge and practices, you can visit our tutorial code repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> to get more examples and the complete tutorial.

View File

@@ -375,6 +375,7 @@ int memleak__kfree(void *ctx)
return gen_free_enter(ptr);
}
```
上述代码片段定义了一个函数memleak__kfree这是一个bpf程序会在内核调用kfree函数时执行。首先该函数检查是否存在kfree函数。如果存在则会读取传递给kfree函数的参数即要释放的内存块的地址并保存到变量ptr中否则会读取传递给kmem_free函数的参数即要释放的内存块的地址并保存到变量ptr中。接着该函数会调用之前定义的gen_free_enter函数来处理该内存块的释放。
```c
@@ -454,14 +455,14 @@ int attach_uprobes(struct memleak_bpf *skel)
注意一些内存分配函数可能并不存在或已弃用比如valloc、pvalloc等因此它们的附加可能会失败。在这种情况下我们允许附加失败并不会阻止程序的执行。这是因为我们更关注的是主流和常用的内存分配函数而这些已经被弃用的函数往往在实际应用中较少使用。
完整的源代码https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/16-memleak
完整的源代码:<https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/16-memleak>
参考:<https://github.com/iovisor/bcc/blob/master/libbpf-tools/memleak.c>
## 编译运行
```console
$ git clone https://github.com/iovisor/bcc.git --recurse-submodules
$ cd libbpf-tools/
$ make memleak
$ make
$ sudo ./memleak
using default object: libc.so.6
using page size: 4096

445
src/16-memleak/README_en.md Normal file
View File

@@ -0,0 +1,445 @@
# eBPF Getting Started Tutorial 16: Writing eBPF Program Memleak for Monitoring Memory Leaks
eBPF (extended Berkeley Packet Filter) is a powerful network and performance analysis tool that is widely used in the Linux kernel. eBPF allows developers to dynamically load, update, and run user-defined code without restarting the kernel or modifying its source code.
In this tutorial, we will explore how to write a Memleak program using eBPF to monitor memory leaks in programs.
## Background and Importance
Memory leaks are a common problem in computer programming and should not be underestimated. When memory leaks occur, programs gradually consume more memory resources without properly releasing them. Over time, this behavior can lead to a gradual depletion of system memory, significantly reducing the overall performance of the program and system.
There are many possible causes of memory leaks. It may be due to misconfiguration, such as a program incorrectly configuring dynamic allocation of certain resources. It may also be due to software bugs or incorrect memory management strategies, such as forgetting to release memory that is no longer needed during program execution. Additionally, if an application's memory usage is too high, system performance may significantly decrease due to paging/swapping, or it may even cause the application to be forcibly terminated by the system's OOM killer (Out of Memory Killer).
### Challenges of Debugging Memory Leaks
Debugging memory leak issues is a complex and challenging task. This involves detailed examination of the program's configuration, memory allocation, and deallocation, often requiring specialized tools to aid in diagnosis. For example, there are tools that can associate malloc() function calls with specific detection tools, such as Valgrind memcheck, which can simulate the CPU to check all memory accesses, but may greatly slow down the application's execution speed. Another option is to use heap analyzers, such as libtcmalloc, which are relatively faster but may still decrease the application's execution speed by more than five times. Additionally, there are tools like gdb that can obtain core dumps of applications and perform post-processing analysis of memory usage. However, these tools often require pausing the application during core dump acquisition or calling the free() function after the application terminates.
## Role of eBPF
In this context, the role of eBPF becomes particularly important. eBPF provides an efficient mechanism for monitoring and tracking system-level events, including memory allocation and deallocation. With eBPF, we can trace memory allocation and deallocation requests and collect the call stacks for each allocation. We can then analyze this information to identify call stacks that perform memory allocations but do not perform subsequent deallocations, helping us identify the source of memory leaks. The advantage of this approach is that it can be done in real-time within a running application without pausing the application or performing complex post-processing.
The `memleak` eBPF tool can trace and match memory allocation and deallocation requests, and collect the call stacks for each allocation. Subsequently, `memleak` can print a summary indicating which call stacks executed allocations but did not perform subsequent deallocations. For example, running the command:
```console
# ./memleak -p $(pidof allocs)
Attaching to pid 5193, Ctrl+C to quit.
[11:16:33] Top 2 stacks with outstanding allocations:
80 bytes in 5 allocations from stack
main+0x6d [allocs]
__libc_start_main+0xf0 [libc-2.21.so]
[11:16:34] Top 2 stacks with outstanding allocations:
160 bytes in 10 allocations from stack
main+0x6d [allocs]
__libc_start_main+0xf0 [libc-2.21.so]
```
After running this command, we can see which stacks the allocated but not deallocated memory came from, as well as the size and quantity of these unreleased memory blocks.
Over time, it becomes evident that the `main` function of the `allocs` process is leaking memory, 16 bytes at a time. Fortunately, we don't need to inspect each allocation; we have a nice summary that tells us which stack is responsible for the significant leaks.
## Implementation Principle of memleak
At a basic level, `memleak` operates by installing monitoring devices on the memory allocation and deallocation paths. It achieves this by inserting eBPF programs into memory allocation and deallocation functions. This means that when these functions are called, `memleak` will record important information, such as the caller's process ID (PID), the allocated memory address, and the size of the allocated memory. When the function for freeing memory is called, `memleak` will delete the corresponding memory allocation record in its internal map. This mechanism allows `memleak` to accurately trace which memory blocks have been allocated but not deallocated.For commonly used memory allocation functions in user space, such as `malloc` and `calloc`, `memleak` uses user space probing (uprobe) technology for monitoring. Uprobe is a dynamic tracing technology for user space applications, which can set breakpoints at any location at runtime without modifying the binary files, thus achieving tracing of specific function calls.
For kernel space memory allocation functions, such as `kmalloc`, `memleak` chooses to use tracepoints for monitoring. Tracepoint is a dynamic tracing technology provided in the Linux kernel, which can dynamically trace specific events in the kernel at runtime without recompiling the kernel or loading kernel modules.
## Kernel Space eBPF Program Implementation
## `memleak` Kernel Space eBPF Program Implementation
The kernel space eBPF program of `memleak` contains some key functions for tracking memory allocation and deallocation. Before delving into these functions, let's first take a look at some data structures defined by `memleak`, which are used in both its kernel space and user space programs.
```c
#ifndef __MEMLEAK_H
#define __MEMLEAK_H
#define ALLOCS_MAX_ENTRIES 1000000
#define COMBINED_ALLOCS_MAX_ENTRIES 10240
struct alloc_info {
__u64 size; // Size of allocated memory
__u64 timestamp_ns; // Timestamp when allocation occurs, in nanoseconds
int stack_id; // Call stack ID when allocation occurs
};
union combined_alloc_info {
struct {
__u64 total_size : 40; // Total size of all unreleased allocations
__u64 number_of_allocs : 24; // Total number of unreleased allocations
};
__u64 bits; // Bitwise representation of the structure
};
#endif /* __MEMLEAK_H */
```
Here, two main data structures are defined: `alloc_info` and `combined_alloc_info`.
The `alloc_info` structure contains basic information about a memory allocation, including the allocated memory size `size`, the timestamp `timestamp_ns` when the allocation occurs, and the call stack ID `stack_id` that triggers the allocation.
The `combined_alloc_info` is a union that contains an embedded structure and a `__u64` type bitwise representation `bits`. The embedded structure has two members: `total_size` and `number_of_allocs`, representing the total size and total count of unreleased allocations, respectively. The numbers 40 and 24 indicate the number of bits occupied by the `total_size` and `number_of_allocs` members, limiting their size. By using this limitation, storage space for the `combined_alloc_info` structure can be saved. Moreover, since `total_size` and `number_of_allocs` share the same `unsigned long long` type variable `bits` for storage, bitwise operations on the member variable `bits` can be used to access and modify `total_size` and `number_of_allocs`, avoiding the complexity of defining additional variables and functions in the program.
Next, `memleak` defines a series of eBPF maps for storing memory allocation information and analysis results. These maps are defined in the form of `SEC(".maps")`, indicating that they belong to the mapping section of the eBPF program.
```c
const volatile size_t min_size = 0;
const volatile size_t max_size = -1;
const volatile size_t page_size = 4096;
const volatile __u64 sample_rate = 1;
const volatile bool trace_all = false;
const volatile __u64 stack_flags = 0;
const volatile bool wa_missing_free = false;
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, pid_t);
__type(value, u64);
__uint(max_entries, 10240);
} sizes SEC(".maps");
struct {
//... (continued)__uint(type, BPF_MAP_TYPE_HASH);
__type(key, u64); /* address */
__type(value, struct alloc_info);
__uint(max_entries, ALLOCS_MAX_ENTRIES);
} allocs SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, u64); /* stack id */
__type(value, union combined_alloc_info);
__uint(max_entries, COMBINED_ALLOCS_MAX_ENTRIES);
} combined_allocs SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, u64);
__type(value, u64);
__uint(max_entries, 10240);
} memptrs SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_STACK_TRACE);
__type(key, u32);
} stack_traces SEC(".maps");
static union combined_alloc_info initial_cinfo;
```
The code first defines some configurable parameters, such as `min_size`, `max_size`, `page_size`, `sample_rate`, `trace_all`, `stack_flags`, and `wa_missing_free`, representing the minimum allocation size, maximum allocation size, page size, sample rate, whether to trace all allocations, stack flags, and whether to work in missing free mode.
Then, five maps are defined:
1. `sizes`: This is a hash-type map with the key as the process ID and the value as `u64` type, storing the allocation size of each process.
2. `allocs`: This is also a hash-type map with the key as the allocation address and the value as the `alloc_info` structure, storing detailed information about each memory allocation.
3. `combined_allocs`: This is another hash-type map with the key as the stack ID and the value as the `combined_alloc_info` union, storing the total size and count of all unreleased allocations.
4. `memptrs`: This is also a hash-type map with both the key and value as `u64` type, used to pass memory pointers between user space and kernel space.
5. `stack_traces`: This is a stack trace-type map with the key as `u32` type, used to store stack IDs.
Taking the user-space memory allocation tracing as an example, it mainly hooks memory-related function calls such as `malloc`, `free`, `calloc`, `realloc`, `mmap`, and `munmap` to record data when these functions are called. In user space, `memleak` mainly uses uprobes technology for hooking.
Each function call is divided into "enter" and "exit" parts. The "enter" part records the function call parameters, such as the size of the allocation or the address being freed. The "exit" part is mainly used to obtain the return value of the function, such as the memory address obtained from the allocation.
Here, `gen_alloc_enter`, `gen_alloc_exit`, `gen_free_enter` are functions that implement the recording behavior, and they are used to record relevant information when allocation starts, allocation ends, and freeing starts, respectively.
The function prototype is as follows:
```c
SEC("uprobe")
int BPF_KPROBE(malloc_enter, size_t size)
{
// Record relevant information when allocation starts
return gen_alloc_enter(size);
}
SEC("uretprobe")
int BPF_KRETPROBE(malloc_exit)
{
// Record relevant information when allocation ends
return gen_alloc_exit(ctx);
}
SEC("uprobe")
int BPF_KPROBE(free_enter, void *address)
{
// Record relevant information when freeing starts
return gen_free_enter(address);
}
```
`malloc_enter` and `free_enter` are probes mounted at the entry points of the `malloc` and `free` functions, respectively, to record data during function calls. `malloc_exit` is a probe mounted at the return point of the `malloc` function to record the return value of the function.
These functions are declared using the `BPF_KPROBE` and `BPF_KRETPROBE` macros, which are used to declare kprobes (kernel probes) and kretprobes (kernel return probes), respectively. Specifically, kprobe is triggered during function calls, while kretprobe is triggered during function returns.
The `gen_alloc_enter` function is called at the beginning of a memory allocation request. This function is mainly responsible for collecting some basic information when the function that allocates memory is called. Now, let's take a deep dive into the implementation of this function.
```c
static int gen_alloc_enter(size_t size)
{
if (size < min_size || size > max_size)
return 0;
if (sample_rate > 1) {
if (bpf_ktime_get_ns() % sample_rate != 0)
return 0;
}
const pid_t pid = bpf_get_current_pid_tgid() >> 32;
bpf_map_update_elem(&sizes, &pid, &size, BPF_ANY);
if (trace_all)
bpf_printk("alloc entered, size = %lu\n", size);
return 0;
}
SEC("uprobe")
int BPF_KPROBE(malloc_enter, size_t size)
{
return gen_alloc_enter(size);
}
```
First, the `gen_alloc_enter` function takes a `size` parameter that represents the size of the requested memory allocation. If this value is not between `min_size` and `max_size`, the function will return directly without performing any further operations. This allows the tool to focus on tracing memory allocation requests within a specific range and filter out uninteresting allocation requests.
Next, the function checks the sampling rate `sample_rate`. If `sample_rate` is greater than 1, it means that we don't need to trace all memory allocation requests, but rather trace them periodically. Here, `bpf_ktime_get_ns` is used to get the current timestamp, and the modulus operation is used to determine whether to trace the current memory allocation request. This is a common sampling technique used to reduce performance overhead while providing a representative sample for analysis.
Then, the function uses the `bpf_get_current_pid_tgid` function to retrieve the current process's PID. Note that the PID here is actually a combination of the process ID and thread ID, and we shift it right by 32 bits to get the actual process ID.
The function then updates the `sizes` map, which uses the process ID as the key and the requested memory allocation size as the value. `BPF_ANY` indicates that if the key already exists, the value will be updated; otherwise, a new entry will be created.
Finally, if the `trace_all` flag is enabled, the function will print a message indicating that a memory allocation has occurred.
The `BPF_KPROBE` macro is used to intercept the execution of the `malloc` function with a BPF uprobe when the `malloc_enter` function is called, and it records the memory allocation size using `gen_alloc_enter`.
We have just analyzed the entry function `gen_alloc_enter` of memory allocation, now let's focus on the exit part of this process. Specifically, we will discuss the `gen_alloc_exit2` function and how to obtain the returned memory address from the memory allocation call.
```c
static int gen_alloc_exit2(void *ctx, u64 address)
{
const pid_t pid = bpf_get_current_pid_tgid() >> 32;
struct alloc_info info;
const u64* size = bpf_map_lookup_elem(&sizes, &pid);
if (!size)
return 0; // missed alloc entry
__builtin_memset(&info, 0, sizeof(info));
info.size = *size;bpf_map_delete_elem(&sizes, &pid);
if (address != 0) {
info.timestamp_ns = bpf_ktime_get_ns();
info.stack_id = bpf_get_stackid(ctx, &stack_traces, stack_flags);
bpf_map_update_elem(&allocs, &address, &info, BPF_ANY);
update_statistics_add(info.stack_id, info.size);
}
if (trace_all) {
bpf_printk("alloc exited, size = %lu, result = %lx\n",
info.size, address);
}
return 0;
}
static int gen_alloc_exit(struct pt_regs *ctx)
{
return gen_alloc_exit2(ctx, PT_REGS_RC(ctx));
}
SEC("uretprobe")
int BPF_KRETPROBE(malloc_exit)
{
return gen_alloc_exit(ctx);
}
```
`gen_alloc_exit2` function is called when the memory allocation operation is completed. This function takes two parameters, one is the context `ctx` and the other is the memory address returned by the memory allocation function `address`.
First, it obtains the PID (Process ID) of the current thread and uses it as a key to look up the corresponding memory allocation size in the `sizes` map. If not found (i.e., no entry for the memory allocation operation), the function simply returns.
Then, it clears the content of the `info` structure and sets its `size` field to the memory allocation size found in the map. It also removes the corresponding element from the `sizes` map because the memory allocation operation has completed and this information is no longer needed.
Next, if `address` is not zero (indicating a successful memory allocation operation), the function further collects some additional information. First, it obtains the current timestamp as the completion time of the memory allocation and fetches the current stack trace. These pieces of information are stored in the `info` structure and subsequently updated in the `allocs` map.
Finally, the function calls `update_statistics_add` to update the statistics data and, if tracing of all memory allocation operations is enabled, it prints some information about the memory allocation operation.
Note that, `gen_alloc_exit` is a wrapper for `gen_alloc_exit2`, which passes `PT_REGS_RC(ctx)` as the `address` parameter to `gen_alloc_exit2`.
In our discussion, we just mentioned that `update_statistics_add` function is called in the `gen_alloc_exit2` function to update the statistics data for memory allocations. Now let's take a closer look at the implementation of this function.
```c
static void update_statistics_add(u64 stack_id, u64 sz)
{
union combined_alloc_info *existing_cinfo;
existing_cinfo = bpf_map_lookup_or_try_init(&combined_allocs, &stack_id, &initial_cinfo);
if (!existing_cinfo)
return;
const union combined_alloc_info incremental_cinfo = {
.total_size = sz,
.number_of_allocs = 1
};
__sync_fetch_and_add(&existing_cinfo->bits, incremental_cinfo.bits);
}
```
The `update_statistics_add` function takes two parameters: the current stack ID `stack_id` and the size of the memory allocation `sz`. These two parameters are collected in the memory allocation event and used to update the statistics data for memory allocations.First, the function tries to find the element with the current stack ID as the key in the `combined_allocs` map. If it is not found, a new element is initialized with `initial_cinfo` (which is a default `combined_alloc_info` structure with all fields set to zero).
Next, the function creates an `incremental_cinfo` and sets its `total_size` to the current memory allocation size and `number_of_allocs` to 1. This is because each call to the `update_statistics_add` function represents a new memory allocation event, and the size of this event's memory allocation is `sz`.
Finally, the function atomically adds the value of `incremental_cinfo` to `existing_cinfo` using the `__sync_fetch_and_add` function. Note that this step is thread-safe, so even if multiple threads call the `update_statistics_add` function concurrently, each memory allocation event will be correctly recorded in the statistics.
In summary, the `update_statistics_add` function implements the logic for updating memory allocation statistics. By maintaining the total amount and number of memory allocations for each stack ID, we can gain insight into the memory allocation behavior of the program.
In our process of tracking memory allocation statistics, we not only need to count memory allocations but also consider memory releases. In the above code, we define a function called `update_statistics_del` that updates the statistics when memory is freed. The function `gen_free_enter` is executed when the process calls the `free` function.
The `update_statistics_del` function takes the stack ID and the size of the memory block to be freed as parameters. First, the function uses the current stack ID as the key to look up the corresponding `combined_alloc_info` structure in the `combined_allocs` map. If it is not found, an error message is output and the function returns. If it is found, a `decremental_cinfo` `combined_alloc_info` structure is constructed with its `total_size` set to the size of the memory to be freed and `number_of_allocs` set to 1. Then the `__sync_fetch_and_sub` function is used to atomically subtract the value of `decremental_cinfo` from `existing_cinfo`. Note that the `number_of_allocs` here is negative, indicating a decrease in memory allocation.
The `gen_free_enter` function takes the address to be freed as a parameter. It first converts the address to an unsigned 64-bit integer (`u64`). Then it looks up the `alloc_info` structure in the `allocs` map using the address as the key. If it is not found, the function returns 0. If it is found, the `alloc_info` structure is deleted from the `allocs` map, and the `update_statistics_del` function is called with the stack ID and size from `info`. If `trace_all` is true, an information message is output.
```c
int BPF_KPROBE(free_enter, void *address)
{
return gen_free_enter(address);
}
```
Next, let's look at the `gen_free_enter` function. It takes an address as a parameter, which is the result of memory allocation, i.e., the starting address of the memory to be freed. The function first uses this address as a key to search for the corresponding `alloc_info` structure in the `allocs` map. If it is not found, it simply returns because it means that this address has not been allocated. If it is found, the element is deleted, and the `update_statistics_del` function is called to update the statistics data. Finally, if global tracking is enabled, a message is also output, including this address and its size.
While tracking and profiling memory allocation, we also need to track kernel-mode memory allocation and deallocation. In the Linux kernel, the `kmem_cache_alloc` function and the `kfree` function are used for kernel-mode memory allocation and deallocation, respectively.
```c
SEC("tracepoint/kmem/kfree")
int memleak__kfree(void *ctx)
{
const void *ptr;
if (has_kfree()) {
struct trace_event_raw_kfree___x *args = ctx;
ptr = BPF_CORE_READ(args, ptr);
} else {
struct trace_event_raw_kmem_free___x *args = ctx;
ptr = BPF_CORE_READ(args, ptr);
}
return gen_free_enter(ptr);
}
```
The above code snippet defines a function `memleak__kfree`. This is a BPF program that will be executed when the `kfree` function is called in the kernel. First, the function checks if `kfree` exists. If it does, it reads the argument passed to the `kfree` function (i.e., the address of the memory block to be freed) and saves it in the variable `ptr`. Otherwise, it reads the argument passed to the `kmem_free` function (i.e., the address of the memory block to be freed) and saves it in the variable `ptr`. Then, the function calls the previously defined `gen_free_enter` function to handle the release of this memory block.
```c
SEC("tracepoint/kmem/kmem_cache_alloc")
int memleak__kmem_cache_alloc(struct trace_event_raw_kmem_alloc *ctx)
{
if (wa_missing_free)
gen_free_enter(ctx->ptr);
gen_alloc_enter(ctx->bytes_alloc);
return gen_alloc_exit2(ctx, (u64)(ctx->ptr));
}
```
This code snippet defines a function `memleak__kmem_cache_alloc`. This is also a BPF program that will be executed when the `kmem_cache_alloc` function is called in the kernel. If the `wa_missing_free` flag is set, it calls the `gen_free_enter` function to handle possible missed release operations. Then, the function calls the `gen_alloc_enter` function to handle memory allocation and finally calls the `gen_alloc_exit2` function to record the allocation result.
Both of these BPF programs use the `SEC` macro to define the corresponding tracepoints, so that they can be executed when the corresponding kernel functions are called. In the Linux kernel, a tracepoint is a static hook that can be inserted into the kernel to collect runtime kernel information. It is very useful for debugging and performance analysis.
In the process of understanding this code, pay attention to the use of the `BPF_CORE_READ` macro. This macro is used to read kernel data in BPF programs. In BPF programs, we cannot directly access kernel memory and need to use such macros to safely read data.
### User-Space Program
After understanding the BPF kernel part, let's switch to the user-space program. The user-space program works closely with the BPF kernel program. It is responsible for loading BPF programs into the kernel, setting up and managing BPF maps, and handling data collected from BPF programs. The user-space program is longer, but here we can briefly refer to its mount point.
```c
int attach_uprobes(struct memleak_bpf *skel)
{
ATTACH_UPROBE_CHECKED(skel, malloc, malloc_enter);
ATTACH_URETPROBE_CHECKED(skel, malloc, malloc_exit);
ATTACH_UPROBE_CHECKED(skel, calloc, calloc_enter);
ATTACH_URETPROBE_CHECKED(skel, calloc, calloc_exit);
ATTACH_UPROBE_CHECKED(skel, realloc, realloc_enter);
ATTACH_URETPROBE_CHECKED(skel, realloc, realloc_exit);
ATTACH_UPROBE_CHECKED(skel, mmap, mmap_enter);
ATTACH_URETPROBE_CHECKED(skel, mmap, mmap_exit);
ATTACH_UPROBE_CHECKED(skel, posix_memalign, posix_memalign_enter);
ATTACH_URETPROBE_CHECKED(skel, posix_memalign, posix_memalign_exit);
ATTACH_UPROBE_CHECKED(skel, memalign, memalign_enter);
ATTACH_URETPROBE_CHECKED(skel, memalign, memalign_exit);
ATTACH_UPROBE_CHECKED(skel, free, free_enter);
ATTACH_UPROBE_CHECKED(skel, munmap, munmap_enter);
// the following probes are intentionally allowed to fail attachment
// deprecated in libc.so bionic
ATTACH_UPROBE(skel, valloc, valloc_enter);
ATTACH_URETPROBE(skel, valloc, valloc_exit);
// deprecated in libc.so bionic
ATTACH_UPROBE(skel, pvalloc, pvalloc_enter);
ATTACH_URETPROBE(skel, pvalloc, pvalloc_exit);
// added in C11
ATTACH_UPROBE(skel, aligned_alloc, aligned_alloc_enter);
ATTACH_URETPROBE(skel, aligned_alloc, aligned_alloc_exit);
return 0;
}
```
In this code snippet, we see a function called `attach_uprobes` that mounts uprobes (user space probes) onto memory allocation and deallocation functions. In Linux, uprobes are a kernel mechanism that allows setting breakpoints at arbitrary locations in user space programs, enabling precise observation and control over the behavior of user space programs.
Here, each memory-related function is traced using two uprobes: one at the entry (enter) of the function and one at the exit. Thus, every time these functions are called or return, a uprobes event is triggered, which in turn triggers the corresponding BPF program.
In the actual implementation, we use two macros, `ATTACH_UPROBE` and `ATTACH_URETPROBE`, to attach uprobes and uretprobes (function return probes), respectively. Each macro takes three arguments: the skeleton of the BPF program (skel), the name of the function to monitor, and the name of the BPF program to trigger.
These mount points include common memory allocation functions such as malloc, calloc, realloc, mmap, posix_memalign, memalign, free, and their corresponding exit points. Additionally, we also observe some possible allocation functions such as valloc, pvalloc, aligned_alloc, although they may not always exist.
The goal of these mount points is to capture all possible memory allocation and deallocation events, allowing our memory leak detection tool to obtain as comprehensive data as possible. This approach enables us to track not only memory allocation and deallocation but also their contextual information such as call stacks and invocation counts, helping us to pinpoint and fix memory leak issues.
Note that some memory allocation functions may not exist or may have been deprecated, such as valloc and pvalloc. Thus, their attachment may fail. In such cases, we allow for attachment failures, which do not prevent the program from executing. This is because we are more focused on mainstream and commonly used memory allocation functions, while these deprecated functions are often used less frequently in practical applications.
Complete source code: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/16-memleak>
Reference: <https://github.com/iovisor/bcc/blob/master/libbpf-tools/memleak.c>
## Compile and Run
```console
$ make
$ sudo ./memleak
using default object: libc.so.6
using page size: 4096
tracing kernel: true
Tracing outstanding memory allocs... Hit Ctrl-C to end
[17:17:27] Top 10 stacks with outstanding allocations:
1236992 bytes in 302 allocations from stack
0 [<ffffffff812c8f43>] <null sym>
1 [<ffffffff812c8f43>] <null sym>
2 [<ffffffff812a9d42>] <null sym>
3 [<ffffffff812aa392>] <null sym>
4 [<ffffffff810df0cb>] <null sym>
5 [<ffffffff81edc3fd>] <null sym>
6 [<ffffffff82000b62>] <null sym>
...
```
## Summary
Through this eBPF introductory tutorial, you have learned how to write a Memleak eBPF monitoring program to monitor memory leaks in real time. You have also learned about the application of eBPF in memory monitoring, how to write eBPF programs using the BPF API, create and use eBPF maps, and how to use eBPF tools to monitor and analyze memory leak issues. We have provided a detailed example to help you understand the execution flow and principles of eBPF code.
You can visit our tutorial code repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and complete tutorials.
The next tutorial will further explore advanced features of eBPF, and we will continue to share more content related to eBPF development practices. We hope that this knowledge and skills will help you better understand and use eBPF to solve problems encountered in practical work.

View File

@@ -0,0 +1,21 @@
# eBPF Getting Started Tutorial: Writing eBPF Program Biopattern: Statistical Random/Sequential Disk I/O
## Background
Biopattern can statistically count the ratio of random/sequential disk I/O.
TODO
## Implementation Principle
The ebpf code of Biopattern is implemented under the mount point tracepoint/block/block_rq_complete. After the disk completes an IO request, the program will pass through this mount point. Biopattern has an internal hash table with device number as the primary key. When the program passes through the mount point, Biopattern obtains the operation information and determines whether the current operation is random or sequential IO based on the previous operation record of the device in the hash table, and updates the operation count.
## Writing eBPF Program
TODO
### Summary
Biopattern can show the ratio of random/sequential disk I/O, which is very helpful for developers to grasp the overall I/O situation.
TODO

View File

@@ -0,0 +1,3 @@
# More Reference Materials
TODO

View File

@@ -0,0 +1,168 @@
# eBPF Getting Started Tutorial: Security Detection and Defense using LSM
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool widely used in the Linux kernel. eBPF allows developers to dynamically load, update, and run user-defined code without restarting the kernel or modifying the kernel source code. This feature enables eBPF to provide high flexibility and performance, making it widely applicable in network and system performance analysis. The same applies to eBPF applications in security, and this article will introduce how to use the eBPF LSM (Linux Security Modules) mechanism to implement a simple security check program.
## Background
LSM has been an official security framework in the Linux kernel since Linux 2.6, and security implementations based on it include SELinux and AppArmor. With the introduction of BPF LSM in Linux 5.7, system developers have been able to freely implement function-level security checks. This article provides an example of limiting access to a specific IPv4 address through the socket connect function using a BPF LSM program. (This demonstrates its high control precision.)
## Overview of LSM
LSM (Linux Security Modules) is a framework in the Linux kernel that supports various computer security models. LSM predefines a set of hook points on critical paths related to Linux kernel security, decoupling the kernel from security modules. This allows different security modules to be loaded/unloaded in the kernel freely without modifying the existing kernel code, thus enabling them to provide security inspection features.
In the past, using LSM mainly involved configuring existing security modules like SELinux and AppArmor or writing custom kernel modules. However, with the introduction of the BPF LSM mechanism in Linux 5.7, everything changed. Now, developers can write custom security policies using eBPF and dynamically load them into the LSM mount points in the kernel without configuring or writing kernel modules.
Some of the hook points currently supported by LSM include:
+ File open, creation, deletion, and movement;
+ Filesystem mounting;
+ Operations on tasks and processes;
+ Operations on sockets (creating, binding sockets, sending and receiving messages, etc.);
For more hook points, refer to [lsm_hooks.h](https://github.com/torvalds/linux/blob/master/include/linux/lsm_hooks.h).
## Verifying BPF LSM Availability
First, please confirm that your kernel version is higher than 5.7. Next, you can use the following command to check if BPF LSM support is enabled:
```console
$ cat /boot/config-$(uname -r) | grep BPF_LSM
CONFIG_BPF_LSM=y
```
If the output contains `CONFIG_BPF_LSM=y`, BPF LSM is supported. Provided that the above conditions are met, you can use the following command to check if the output includes the `bpf` option:
```console
$ cat /sys/kernel/security/lsm
ndlock,lockdown,yama,integrity,apparmor
```
If the output does not include the `bpf` option (as in the example above), you can modify `/etc/default/grub`:
```conf
GRUB_CMDLINE_LINUX="lsm=ndlock,lockdown,yama,integrity,apparmor,bpf"
```
Then, update the grub configuration using the `update-grub2` command (the corresponding command may vary depending on the system), and restart the system.
## Writing eBPF Programs
```C
// lsm-connect.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
char LICENSE[] SEC("license") = "GPL";
#define EPERM 1
#define AF_INET 2
const __u32 blockme = 16843009; // 1.1.1.1 -> int
SEC("lsm/socket_connect")
int BPF_PROG(restrict_connect, struct socket *sock, struct sockaddr *address, int addrlen, int ret)
{
// Satisfying "cannot override a denial" rule
if (ret != 0)
{
return ret;
}
// Only IPv4 in this example
if (address->sa_family != AF_INET)
{
return 0;
}
// Cast the address to an IPv4 socket address
struct sockaddr_in *addr = (struct sockaddr_in *)address;
// Where do you want to go?
__u32 dest = addr->sin_addr.s_addr;
bpf_printk("lsm: found connect to %d", dest);
if (dest == blockme)
{
bpf_printk("lsm: blocking %d", dest);
return -EPERM;
}
return 0;
}
```
This is eBPF code implemented in C on the kernel side. It blocks all connection operations through a socket to 1.1.1.1. The following information is included:
+ The `SEC("lsm/socket_connect")` macro indicates the expected mount point for this program.
+ The program is defined by the `BPF_PROG` macro (see [tools/lib/bpf/bpf_tracing.h](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/lib/bpf/bpf_tracing.h) for details).
+ `restrict_connect` is the program name required by the `BPF_PROG` macro.
+ `ret` is the return value of the LSM check program (potential) before the current function on this mount point.
The overall idea of the program is not difficult to understand:
+ First, if the return value of other security check functions is non-zero (failed), there is no need to check further and the connection is rejected.
+ Next, it determines whether it is an IPv4 connection request and compares the address being connected to with 1.1.1.1.
+ If the requested address is 1.1.1.1, the connection is blocked; otherwise, the connection is allowed.
During the execution of the program, all connection operations through a socket will be output to `/sys/kernel/debug/tracing/trace_pipe`.
## Compilation and Execution
Compile using a container:
```console
docker run -it -v `pwd`/:/src/ ghcr.io/eunomia-bpf/ecc-`uname -m`:latest
```
Or compile using `ecc`:
```console
$ ecc lsm-connect.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
And run using `ecli`:
```shell
sudo ecli run package.json
```
Next, open another terminal and try to access 1.1.1.1:
```console
$ ping 1.1.1.1
ping: connect: Operation not permitted
$ curl 1.1.1.1
curl: (7) Couldn't connect to server
$ wget 1.1.1.1
--2023-04-23 08:41:18-- (try: 2) http://1.1.1.1/
Connecting to 1.1.1.1:80... failed: Operation not permitted.
Retrying.
```
At the same time, we can view the output of `bpf_printk`:
```console
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
ping-7054 [000] d...1 6313.430872: bpf_trace_printk: lsm: found connect to 16843009
ping-7054 [000] d...1 6313.430874: bpf_trace_printk: lsm: blocking 16843009
curl-7058 [000] d...1 6316.346582: bpf_trace_printk: lsm: found connect to 16843009
curl-7058 [000] d...1 6316.346584: bpf_trace_printk: lsm: blocking 16843009".```
wget-7061 [000] d...1 6318.800698: bpf_trace_printk: lsm: found connect to 16843009
wget-7061 [000] d...1 6318.800700: bpf_trace_printk: lsm: blocking 16843009
```
Complete source code: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/19-lsm-connect>
## Summary
This article introduces how to use BPF LSM to restrict access to a specific IPv4 address through a socket. We can enable the LSM BPF mount point by modifying the GRUB configuration file. In the eBPF program, we define functions using the `BPF_PROG` macro and specify the mount point using the `SEC` macro. In the implementation of the function, we follow the principle of "cannot override a denial" in the LSM security-checking module and restrict the socket connection request based on the destination address of the request.
If you want to learn more about eBPF knowledge and practices, you can visit our tutorial code repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and complete tutorials.
## References
+ <https://github.com/leodido/demo-cloud-native-ebpf-day>
+ <https://aya-rs.dev/book/programs/lsm/#writing-lsm-bpf-program>

View File

@@ -0,0 +1,150 @@
# eBPF Beginner's Development Practice Tutorial 2: Using kprobe to Monitor the unlink System Call in eBPF
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and run user-defined code at runtime.
This article is the second part of the eBPF beginner's development practice tutorial, focusing on using kprobe to capture the unlink system call in eBPF. The article will first explain the basic concepts and technical background of kprobes, and then introduce how to use kprobe to capture the unlink system call in eBPF.
## Background of kprobes Technology
During the debugging process of the kernel or modules, developers often need to know whether certain functions are called, when they are called, whether the execution is correct, and what the input and return values of the functions are. A simple approach is to add log print information to the corresponding functions in the kernel code. However, this approach often requires recompiling the kernel or modules, restarting the device, etc., which is complex and may disrupt the original code execution process.
By using the kprobes technology, users can define their own callback functions and dynamically insert probes into almost all functions in the kernel or modules (some functions cannot be probed, such as the kprobes' own implementation functions, which will be explained in detail later). When the kernel execution flow reaches the specified probe function, it will invoke the callback function, allowing the user to collect the desired information. The kernel will then return to the normal execution flow. If the user has collected sufficient information and no longer needs to continue probing, the probes can be dynamically removed. Therefore, the kprobes technology has the advantages of minimal impact on the kernel execution flow and easy operation.
The kprobes technology includes three detection methods: kprobe, jprobe, and kretprobe. First, kprobe is the most basic detection method and serves as the basis for the other two. It allows probes to be placed at any position (including within a function). It provides three callback modes for probes: `pre_handler`, `post_handler`, and `fault_handler`. The `pre_handler` function is called before the probed instruction is executed, the `post_handler` is called after the probed instruction is completed (note that it is not the probed function), and the `fault_handler` is called when a memory access error occurs. The jprobe is based on kprobe and is used to obtain the input values of the probed function. Finally, as the name suggests, kretprobe is also based on kprobe and is used to obtain the return values of the probed function.
The kprobes technology is not only implemented through software but also requires support from the hardware architecture. This involves CPU exception handling and single-step debugging techniques. The former is used to make the program's execution flow enter the user-registered callback function, and the latter is used to single-step execute the probed instruction. Therefore, not all architectures support kprobes. Currently, kprobes technology supports various architectures, including i386, x86_64, ppc64, ia64, sparc64, arm, ppc, and mips (note that some architecture implementations may not be complete, see the kernel's Documentation/kprobes.txt for details).
Features and Usage Restrictions of kprobes:
1. kprobes allows multiple kprobes to be registered at the same probe position, but jprobe currently does not support this. It is also not allowed to use other jprobe callback functions or the `post_handler` callback function of kprobe as probe points.
2. In general, any function in the kernel can be probed, including interrupt handlers. However, the functions used to implement kprobes themselves in kernel/kprobes.c and arch/*/kernel/kprobes.c are not allowed to be probed. Additionally, `do_page_fault` and `notifier_call_chain` are also not allowed.
3. If an inline function is used as a probe point, kprobes may not be able to guarantee that probe points are registered for all instances of that function. Since gcc may automatically optimize certain functions as inline functions, the desired probing effect may not be achieved.
4. The callback function of a probe point may modify the runtime context of the probed function, such as by modifying the kernel's data structure or saving register information before triggering the prober in the `struct pt_regs` structure. Therefore, kprobes can be used to install bug fixes or inject fault testing code.
5. kprobes avoids calling the callback function of another probe point again when processing the probe point function. For example, if a probe point is registered on the `printk()` function and the callback function may call `printk()` again, the callback for the `printk` probe point will not be triggered again. Only the `nmissed` field in the `kprobe` structure will be incremented.
6. mutex locks and dynamic memory allocation are not used in the registration and removal process of kprobes.
format: markdown7. During the execution of kprobes callback functions, kernel preemption is disabled, and it may also be executed with interrupts disabled, which depends on the CPU architecture. Therefore, regardless of the situation, do not call functions that will give up the CPU in the callback function (such as semaphore, mutex lock, etc.);
8. kretprobe is implemented by replacing the return address with the pre-defined trampoline address, so stack backtraces and gcc inline function `__builtin_return_address()` will return the address of the trampoline instead of the actual return address of the probed function;
9. If the number of function calls and return calls of a function are unequal, registering kretprobe on such a function may not achieve the expected effect, for example, the `do_exit()` function will have problems, while the `do_execve()` function and `do_fork()` function will not;
10. When entering and exiting a function, if the CPU is running on a stack that does not belong to the current task, registering kretprobe on that function may have unpredictable consequences. Therefore, kprobes does not support registering kretprobe for the `__switch_to()` function under the X86_64 architecture and will directly return `-EINVAL`.
## kprobe Example
The complete code is as follows:
```c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
char LICENSE[] SEC("license") = "Dual BSD/GPL";
SEC("kprobe/do_unlinkat")
int BPF_KPROBE(do_unlinkat, int dfd, struct filename *name)
{
pid_t pid;
const char *filename;
pid = bpf_get_current_pid_tgid() >> 32;
filename = BPF_CORE_READ(name, name);
bpf_printk("KPROBE ENTRY pid = %d, filename = %s\n", pid, filename);
return 0;
}
SEC("kretprobe/do_unlinkat")
int BPF_KRETPROBE(do_unlinkat_exit, long ret)
{
pid_t pid;
pid = bpf_get_current_pid_tgid() >> 32;
bpf_printk("KPROBE EXIT: pid = %d, ret = %ld\n", pid, ret);
return 0;
}
```
This code is a simple eBPF program used to monitor and capture the unlink system call executed in the Linux kernel. The unlink system call is used to delete a file. This eBPF program traces this system call by placing hooks at the entry and exit points of the `do_unlinkat` function using a kprobe (kernel probe).
First, we import necessary header files such as vmlinux.h, bpf_helpers.h, bpf_tracing.h, and bpf_core_read.h. Then, we define a license to allow the program to run in the kernel.
```c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
char LICENSE[] SEC("license") = "Dual BSD/GPL";
```
Next, we define a kprobe named `BPF_KPROBE(do_unlinkat)` which gets triggered when the `do_unlinkat` function is entered. It takes two parameters: `dfd` (file descriptor) and `name` (filename structure pointer). In this kprobe, we retrieve the PID (process identifier) of the current process and then read the filename. Finally, we use the `bpf_printk` function to print the PID and filename in the kernel log.
```c
SEC("kprobe/do_unlinkat")
int BPF_KPROBE(do_unlinkat, int dfd, struct filename *name)
{
pid_t pid;
const char *filename;
pid = bpf_get_current_pid_tgid() >> 32;
filename = BPF_CORE_READ(name, name);".```
bpf_printk("KPROBE ENTRY pid = %d, filename = %s\n", pid, filename);
return 0;
}
```
Next, we define a kretprobe named `BPF_KRETPROBE(do_unlinkat_exit)` that will be triggered when exiting the `do_unlinkat` function. The purpose of this kretprobe is to capture the return value (`ret`) of the function. We again obtain the PID of the current process and use the `bpf_printk` function to print the PID and return value in the kernel log.
```c
SEC("kretprobe/do_unlinkat")
int BPF_KRETPROBE(do_unlinkat_exit, long ret)
{
pid_t pid;
pid = bpf_get_current_pid_tgid() >> 32;
bpf_printk("KPROBE EXIT: pid = %d, ret = %ld\n", pid, ret);
return 0;
}
```
eunomia-bpf is an open-source eBPF dynamic loading runtime and development toolchain that combines with Wasm. Its goal is to simplify the development, build, distribution, and execution of eBPF programs. You can refer to <https://github.com/eunomia-bpf/eunomia-bpf> to download and install the ecc compiler toolchain and ecli runtime.
To compile this program, use the ecc tool:
```console
$ ecc kprobe-link.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
Then run:
```console
sudo ecli run package.json
```
In another window:
```shell
touch test1
rm test1
touch test2
rm test2
```
You should see kprobe demo output similar to the following in the /sys/kernel/debug/tracing/trace_pipe file:
```shell
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
rm-9346 [005] d..3 4710.951696: bpf_trace_printk: KPROBE ENTRY pid = 9346, filename = test1
rm-9346 [005] d..4 4710.951819: bpf_trace_printk: KPROBE EXIT: ret = 0
rm-9346 [005] d..3 4710.951852: bpf_trace_printk: KPROBE ENTRY pid = 9346, filename = test2
rm-9346 [005] d..4 4710.951895: bpf_trace_printk: KPROBE EXIT: ret = 0
```
## Summary
In this article's example, we learned how to use eBPF's kprobe and kretprobe to capture the unlink system call. For more examples and detailed development guides, please refer to the official documentation of eunomia-bpf: <https://github.com/eunomia-bpf/eunomia-bpf>
This article is the second part of the introductory eBPF development tutorial. The next article will explain how to use fentry to monitor and capture the unlink system call in eBPF.
If you'd like to learn more about eBPF knowledge and practices, you can visit our tutorial code repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and complete tutorials.

109
src/20-tc/README_en.md Normal file
View File

@@ -0,0 +1,109 @@
# eBPF Introductory Practice Tutorial 20: Use eBPF for tc Traffic Control
## Background
Linux's Traffic Control (tc) subsystem has been present in the kernel for many years. Similar to the relationship between iptables and netfilter, tc includes a user-space tc program and a kernel-level traffic control framework. It is mainly used to control the sending and receiving of packets in terms of rate, sequence, and other aspects. Starting from Linux 4.1, tc has added some new attachment points and supports loading eBPF programs as filters onto these attachment points.
## Overview of tc
From the protocol stack perspective, tc is located at the link layer. Its position has already completed the allocation of sk_buff and is later than xdp. In order to control the sending and receiving of packets, tc uses a queue structure to temporarily store and organize packets. In the tc subsystem, the corresponding data structure and algorithm control mechanism are abstracted as qdisc (Queueing discipline). It exposes two callback interfaces for enqueuing and dequeuing packets externally, and internally hides the implementation of queuing algorithms. In qdisc, we can implement complex tree structures based on filters and classes. Filters are mounted on qdisc or class to implement specific filtering logic, and the return value determines whether the packet belongs to a specific class.
When a packet reaches the top-level qdisc, its enqueue interface is called, and the mounted filters are executed one by one until a filter matches successfully. Then the packet is sent to the class pointed to by that filter and enters the qdisc processing process configured by that class. The tc framework provides the so-called classifier-action mechanism, that is, when a packet matches a specific filter, the action mounted by that filter is executed to process the packet, implementing a complete packet classification and processing mechanism.
The existing tc provides eBPF with the direct-action mode, which allows an eBPF program loaded as a filter to return values such as `TC_ACT_OK` as tc actions, instead of just returning a classid like traditional filters and handing over the packet processing to the action module. Now, eBPF programs can be mounted on specific qdiscs to perform packet classification and processing actions.
## Writing eBPF Programs
```c
#include <vmlinux.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#define TC_ACT_OK 0
#define ETH_P_IP 0x0800 /* Internet Protocol packet */
/// @tchook {"ifindex":1, "attach_point":"BPF_TC_INGRESS"}
/// @tcopts {"handle":1, "priority":1}
SEC("tc")
int tc_ingress(struct __sk_buff *ctx)
{
void *data_end = (void *)(__u64)ctx->data_end;
void *data = (void *)(__u64)ctx->data;
struct ethhdr *l2;
struct iphdr *l3;
if (ctx->protocol != bpf_htons(ETH_P_IP))
return TC_ACT_OK;
l2 = data;
if ((void *)(l2 + 1) > data_end)
return TC_ACT_OK;
l3 = (struct iphdr *)(l2 + 1);
if ((void *)(l3 + 1) > data_end)
return TC_ACT_OK;
bpf_printk("Got IP packet: tot_len: %d, ttl: %d", bpf_ntohs(l3->tot_len), l3->ttl);
return TC_ACT_OK;
}
char __license[] SEC("license") = "GPL";
```
This code defines an eBPF program that can capture and process packets through Linux TC (Transmission Control). In this program, we limit it to capture only IPv4 protocol packets, and then print out the total length and Time-To-Live (TTL) value of the packet using the bpf_printk function.Here is the translated text:
"
What needs to be noted is that we use some BPF library functions in the code, such as the functions bpf_htons and bpf_ntohs, which are used for conversion between network byte order and host byte order. In addition, we also use some comments to provide additional points and option information for TC. For example, at the beginning of this code, we use the following comments:
```c
/// @tchook {"ifindex":1, "attach_point":"BPF_TC_INGRESS"}
/// @tcopts {"handle":1, "priority":1}
```
These comments tell TC to attach the eBPF program to the ingress attachment point of the network interface, and specify the values of the handle and priority options. You can refer to the introduction in [patchwork](https://patchwork.kernel.org/project/netdevbpf/patch/20210512103451.989420-3-memxor@gmail.com/) for tc-related APIs in libbpf.
In summary, this code implements a simple eBPF program that captures packets and prints out their information.
## Compilation and Execution
Compile using a container:
```console
docker run -it -v `pwd`/:/src/ ghcr.io/eunomia-bpf/ecc-`uname -m`:latest
```
Or compile using `ecc`:
```console
$ ecc tc.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
And run using `ecli`:
```shell
sudo ecli run ./package.json
```
You can view the output of the program in the following way:
```console
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
node-1254811 [007] ..s1 8737831.671074: 0: Got IP packet: tot_len: 79, ttl: 64
sshd-1254728 [006] ..s1 8737831.674334: 0: Got IP packet: tot_len: 79, ttl: 64
sshd-1254728 [006] ..s1 8737831.674349: 0: Got IP packet: tot_len: 72, ttl: 64
node-1254811 [007] ..s1 8737831.674550: 0: Got IP packet: tot_len: 71, ttl: 64
```
## Summary
This article introduces how to mount eBPF type filters to the TC traffic control subsystem to achieve queuing processing of link layer packets. Based on the solution provided by eunomia-bpf to pass parameters to libbpf through comments, we can mount our own tc BPF program to the target network device with specified options and use the sk_buff structure of the kernel to filter and process packets.
If you want to learn more about eBPF knowledge and practice, you can visit our tutorial code repository <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and complete tutorials.
## References
+ <http://just4coding.com/2022/08/05/tc/>
+ <https://arthurchiao.art/blog/understanding-tc-da-mode-zh/>
"

102
src/21-xdp/README_en.md Normal file
View File

@@ -0,0 +1,102 @@
# eBPF 入门实践教程二十一:使用 xdp 实现可编程包处理
## 背景
xdpeXpress Data Path是 Linux 内核中新兴的一种绕过内核的、可编程的包处理方案。相较于 cBPFxdp 的挂载点非常底层,位于网络设备驱动的软中断处理过程,甚至早于 skb_buff 结构的分配。因此,在 xdp 上挂载 eBPF 程序适用于很多简单但次数极多的包处理操作(如防御 Dos 攻击可以达到很高的性能24Mpps/core
## XDP 概述
xdp 不是第一个支持可编程包处理的系统,在此之前,以 DPDKData Plane Development Kit为代表的内核旁路方案甚至能够取得更高的性能其思路为完全绕过内核由用户态的网络应用接管网络设备从而避免了用户态和内核态的切换开销。然而这样的方式具有很多天然的缺陷
+ 无法与内核中成熟的网络模块集成,而不得不在用户态将其重新实现;
+ 破坏了内核的安全边界,使得内核提供的很多网络工具变得不可用;
+ 在与常规的 socket 交互时,需要从用户态重新将包注入到内核;
+ 需要占用一个或多个单独的 CPU 来进行包处理;
除此之外,利用内核模块和内核网络协议栈中的 hook 点也是一种思路,然而前者对内核的改动大,出错的代价高昂;后者在整套包处理流程中位置偏后,其效率不够理想。
总而言之xdp + eBPF 为可编程包处理系统提出了一种更为稳健的思路,在某种程度上权衡了上述方案的种种优点和不足,获取较高性能的同时又不会对内核的包处理流程进行过多的改变,同时借助 eBPF 虚拟机的优势将用户定义的包处理过程进行隔离和限制,提高了安全性。
## 编写 eBPF 程序
```C
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
/// @ifindex 1
/// @flags 0
/// @xdpopts {"old_prog_fd":0}
SEC("xdp")
int xdp_pass(struct xdp_md* ctx) {
void* data = (void*)(long)ctx->data;
void* data_end = (void*)(long)ctx->data_end;
int pkt_sz = data_end - data;
bpf_printk("packet size is %d", pkt_sz);
return XDP_PASS;
}
char __license[] SEC("license") = "GPL";
```
这是一段 C 语言实现的 eBPF 内核侧代码,它能够通过 xdp 捕获所有经过目标网络设备的数据包,计算其大小并输出到 `trace_pipe` 中。
值得注意的是,在代码中我们使用了以下注释:
```C
/// @ifindex 1
/// @flags 0
/// @xdpopts {"old_prog_fd":0}
```
这是由 eunomia-bpf 提供的功能,我们可以通过这样的注释告知 eunomia-bpf 加载器此 xdp 程序想要挂载的目标网络设备编号,挂载的标志和选项。
这些变量的设计基于 libbpf 提供的 API可以通过 [patchwork](https://patchwork.kernel.org/project/netdevbpf/patch/20220120061422.2710637-2-andrii@kernel.org/#24705508) 查看接口的详细介绍。
`SEC("xdp")` 宏指出 BPF 程序的类型,`ctx` 是此 BPF 程序执行的上下文,用于包处理流程。
在程序的最后,我们返回了 `XDP_PASS`,这表示我们的 xdp 程序会将经过目标网络设备的包正常交付给内核的网络协议栈。可以通过 [XDP actions](https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/implementation/xdp_actions.html) 了解更多 xdp 的处理动作。
## 编译运行
通过容器编译:
```console
docker run -it -v `pwd`/:/src/ ghcr.io/eunomia-bpf/ecc-`uname -m`:latest
```
或是通过 `ecc` 编译:
```console
$ ecc xdp.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
并通过 `ecli` 运行:
```console
sudo ecli run package.json
```
可以通过如下方式查看程序的输出:
```console
$ sudo cat /sys/kernel/tracing/trace_pipe
node-1939 [000] d.s11 1601.190413: bpf_trace_printk: packet size is 177
node-1939 [000] d.s11 1601.190479: bpf_trace_printk: packet size is 66
ksoftirqd/1-19 [001] d.s.1 1601.237507: bpf_trace_printk: packet size is 66
node-1939 [000] d.s11 1601.275860: bpf_trace_printk: packet size is 344
```
## 总结
本文介绍了如何使用 xdp 来处理经过特定网络设备的包,基于 eunomia-bpf 提供的通过注释向 libbpf 传递参数的方案,我们可以将自己编写的 xdp BPF 程序以指定选项挂载到目标设备,并在网络包进入内核网络协议栈之前就对其进行处理,从而获取高性能的可编程包处理能力。
如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 以获取更多示例和完整的教程。
## 参考资料
+ <http://arthurchiao.art/blog/xdp-paper-acm-2018-zh/>
+ <http://arthurchiao.art/blog/linux-net-stack-implementation-rx-zh/>
+ <https://github.com/xdp-project/xdp-tutorial/tree/master/basic01-xdp-pass>

147
src/22-android/README_en.md Normal file
View File

@@ -0,0 +1,147 @@
# Using eBPF Programs on Android
> This article mainly documents the author's exploration process, results, and issues encountered while testing the level of support for CO-RE technology based on the libbpf library on high version Android kernels in the Android Studio Emulator. The test was conducted by building a Debian environment in the Android Shell environment and attempting to build the eunomia-bpf toolchain and run its test cases based on this.
## Background
As of now (2023-04), Android has not provided good support for dynamic loading of eBPF programs. Both the compiler distribution scheme represented by bcc and the CO-RE scheme based on btf and libbpf rely heavily on Linux environment support and cannot run well on the Android system.[^WeiShu]
However, there have been some successful cases of trying eBPF on the Android platform. In addition to the solution provided by Google to modify `Android.bp` to build and mount eBPF programs with the entire system[^Google], some people have proposed building a Linux environment based on the Android kernel and running the eBPF toolchain using this approach, and have developed related tools.
Currently available information mostly focuses on the testing of bcc and bpftrace toolchains based on the adeb/eadb sandbox built on the Android kernel, with less testing work on the CO-RE scheme. There is more reference material available for using the bcc tool on Android, such as:
+ SeeFlowerX: <https://blog.seeflower.dev/category/eBPF/>
+ evilpan: <https://bbs.kanxue.com/thread-271043.htm>
The main idea is to use chroot to run a Debian image on the Android kernel and build the entire bcc toolchain within it in order to use eBPF tools. The same principle applies to using bpftrace.
In fact, higher versions of the Android kernel already support the btf option, which means that the emerging CO-RE technology in the eBPF field should also be applicable to Linux systems based on the Android kernel. This article will test and run eunomia-bpf in the emulator environment based on this.
> [eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf) is an open-source project that combines libbpf and WebAssembly technology, aiming to simplify the writing, compilation, and deployment of eBPF programs. This project can be seen as a practical way of implementing CO-RE, with libbpf as its core dependency. It is believed that the testing work of eunomia-bpf can provide reference for other CO-RE schemes.
## Test Environment
+ Android Emulator (Android Studio Flamingo | 2022.2.1)
+ AVD: Pixel 6
+ Android Image: Tiramisu Android 13.0 x86_64 (5.15.41-android13-8-00055-g4f5025129fe8-ab8949913)
## Environment Setup[^SeeFlowerX]
1. Obtain `debianfs-amd64-full.tar.gz` from the releases page of the [eadb repository](https://github.com/tiann/eadb) as the rootfs of the Linux environment. Also, get the `assets` directory from this project to build the environment.
2. Configure and start the Android Virtual Device in the Android Studio Device Manager.
3. Push `debianfs-amd64-full.tar.gz` and the `assets` directory to the AVD using the adb tool from the Android Studio SDK:
+ `./adb push debianfs-amd64-full.tar.gz /data/local/tmp/deb.tar.gz`
+ `./adb push assets /data/local/tmp/assets`
4. Use adb to enter the Android shell environment and obtain root permissions:
+ `./adb shell`
+ `su`
5. Build and enter the debian environment in the Android shell:
+ `mkdir -p /data/eadb`
+ `mv /data/local/tmp/assets/* /data/eadb`
+ `mv /data/local/tmp/deb.tar.gz /data/eadb/deb.tar.gz`+ `rm -r /data/local/tmp/assets`
+ `chmod +x /data/eadb/device-*`
+ `/data/eadb/device-unpack`
+ `/data/eadb/run /data/eadb/debian`
At this point, the Linux environment required for testing eBPF has been set up. In addition, in the Android shell (before entering debian), you can use `zcat /proc/config.gz` in conjunction with `grep` to view kernel compilation options.
>Currently, the debian environment packaged by eadb has a low version of libc and lacks many tool dependencies. Additionally, due to different kernel compilation options, some eBPF features may not be available.
## Build Tools
Clone the eunomia-bpf repository into the local debian environment. For the specific build process, refer to the repository's [build.md](https://github.com/eunomia-bpf/eunomia-bpf/blob/master/documents/build.md). In this test, I used the `ecc` compilation method to generate the `package.json`. Please refer to the [repository page](https://github.com/eunomia-bpf/eunomia-bpf/tree/master/compiler) for the build and usage instructions for this tool.
>During the build process, you may need to manually install tools such as `curl`, `pkg-config`, `libssl-dev`, etc.
## Results
Some eBPF programs can be successfully executed on Android, but there are also some applications that cannot be executed successfully for various reasons.
### Success Cases
#### [bootstrap](https://github.com/eunomia-bpf/eunomia-bpf/tree/master/examples/bpftools/bootstrap)
The output of running is as follows:
```console
TIME PID PPID EXIT_CODE DURATION_NS COMM FILENAME EXIT_EVENT
09:09:19 10217 479 0 0 sh /system/bin/sh 0
09:09:19 10217 479 0 0 ps /system/bin/ps 0
09:09:19 10217 479 0 54352100 ps 1
09:09:21 10219 479 0 0 sh /system/bin/sh 0
09:09:21 10219 479 0 0 ps /system/bin/ps 0
09:09:21 10219 479 0 44260900 ps 1
```
#### [tcpstates](https://github.com/eunomia-bpf/eunomia-bpf/tree/master/examples/bpftools/tcpstates)
After starting monitoring, download a web page using `wget` in the Linux environment:
```console
TIME SADDR DADDR SKADDR TS_US DELTA_US PID OLDSTATE NEWSTATE FAMILY SPORT DPORT TASK
09:07:46 0x4007000200005000000000000f02000a 0x5000000000000f02000a8bc53f77 18446635827774444352 3315344998 0 10115 7 2 2 0 80 wget
09:07:46 0x40020002d98e50003d99f8090f02000a 0xd98e50003d99f8090f02000a8bc53f77 18446635827774444352 3315465870 120872 0 2 1 2 55694 80 swapper/0
09:07:46 0x40010002d98e50003d99f8090f02000a 0xd98e50003d99f8090f02000a8bc53f77 18446635827774444352 3315668799 202929 10115 1 4 2 55694 80 wget
09:07:46 0x40040002d98e50003d99f8090f02000a 0xd98e50003d99f8090f02000a8bc53f77 18446635827774444352 3315670037 1237 0 4 5 2 55694 80 swapper/0
09:07:46 0x40050002000050003d99f8090f02000a 0x50003d99f8090f02000a8bc53f77 18446635827774444352 3315670225 188 0 5 7 2 55694 80 swapper/0
09:07:47 0x400200020000bb01565811650f02000a 0xbb01565811650f02000a6aa0d9ac 18446635828348806592 3316433261 0 2546 2 7 2 49970 443 ChromiumNet
09:07:47 0x400200020000bb01db794a690f02000a 0xbb01db794a690f02000aea2afb8e 18446635827774427776 3316535591 0 1469 2 7 2 37386 443 ChromiumNet
```
Start the detection and open the Chrome browser in the Android Studio simulation interface to access the Baidu page:
```console
TIME SADDR DADDR SKADDR TS_US DELTA_US PID OLDSTATE NEWSTATE FAMILY SPORT DPORT TASK
07:46:58 0x400700020000bb01000000000f02000a 0xbb01000000000f02000aeb6f2270 18446631020066638144 192874641 0 3305 7 2 2 0 443 NetworkService
07:46:58 0x40020002d28abb01494b6ebe0f02000a 0xd28abb01494b6ebe0f02000aeb6f2270 18446631020066638144 192921938 47297 3305 2 1 2 53898 443 NetworkService
07:46:58 0x400700020000bb01000000000f02000a 0xbb01000000000f02000ae7e7e8b7 18446631020132433920 193111426 0 3305 7 2 2 0 443 NetworkService
07:46:58 0x40020002b4a0bb0179ff85e80f02000a 0xb4a0bb0179ff85e80f02000ae7e7e8b7 18446631020132433920 193124670 13244 3305 2 1 2 46240 443 NetworkService
07:46:58 0x40010002b4a0bb0179ff85e80f02000a 0xb4a0bb0179ff85e80f02000ae7e7e8b7 18446631020132433920 193185397 60727 3305 1 4 2 46240 443 NetworkService
07:46:58 0x40040002b4a0bb0179ff85e80f02000a 0xb4a0bb0179ff85e80f02000ae7e7e8b7 18446631020132433920 193186122 724 3305 4 5 2 46240 443 NetworkService
07:46:58 0x400500020000bb0179ff85e80f02000a 0xbb0179ff85e80f02000ae7e7e8b7 18446631020132433920 193186244 122 3305 5 7 2 46240 443 NetworkService".07:46:59 0x40010002d01ebb01d0c52f5c0f02000a 0xd01ebb01d0c52f5c0f02000a51449c27 18446631020103553856 194110884 0 5130 1 8 2 53278 443 ThreadPoolForeg
07:46:59 0x400800020000bb01d0c52f5c0f02000a 0xbb01d0c52f5c0f02000a51449c27 18446631020103553856 194121000 10116 3305 8 7 2 53278 443 NetworkService
07:46:59 0x400700020000bb01000000000f02000a 0xbb01000000000f02000aeb6f2270 18446631020099513920 194603677 0 3305 7 2 2 0 443 NetworkService
07:46:59 0x40020002d28ebb0182dd92990f02000a 0xd28ebb0182dd92990f02000aeb6f2270 18446631020099513920 194649313 45635 12 2 1 2 53902 443 ksoftirqd/0
07:47:00 0x400700020000bb01000000000f02000a 0xbb01000000000f02000a26f6e878 18446631020132433920 195193350 0 3305 7 2 2 0 443 NetworkService
07:47:00 0x40020002ba32bb01e0e09e3a0f02000a 0xba32bb01e0e09e3a0f02000a26f6e878 18446631020132433920 195206992 13642 0 2 1 2 47666 443 swapper/0
07:47:00 0x400700020000bb01000000000f02000a 0xbb01000000000f02000ae7e7e8b7 18446631020132448128 195233125 0 3305 7 2 2 0 443 NetworkService
07:47:00 0x40020002b4a8bb0136cac8dd0f02000a 0xb4a8bb0136cac8dd0f02000ae7e7e8b7 18446631020132448128 195246569 13444 3305 2 1 2 46248 443 NetworkService
07:47:00 0xf02000affff00000000000000000000 0x1aca06cffff00000000000000000000 18446631019225912320 195383897 0 947 7 2 10 0 80 Thread-11
07:47:00 0x40010002b4a8bb0136cac8dd0f02000a 0xb4a8bb0136cac8dd0f02000ae7e7e8b7 18446631020132448128 195421584 175014 3305 1 4 2 46248 443 NetworkService
07:47:00 0x40040002b4a8bb0136cac8dd0f02000a 0xb4a8bb0136cac8dd0f02000ae7e7e8b7 18446631020132448128 195422361 777 3305 4 5 2 46248 443 NetworkService
07:47:00 0x400500020000bb0136cac8dd0f02000a 0xbb0136cac8dd0f02000ae7e7e8b7 18446631020132448128 195422450 88 3305 5 7 2 46248 443 NetworkService
07:47:01 0x400700020000bb01000000000f02000a 0xbb01000000000f02000aea2afb8e 18446631020099528128 196321556 0 1315 7 2 2 0 443 ChromiumNet
```
Note: some error messages may appear in the Android shell during the test:
```console
libbpf: failed to determine tracepoint 'syscalls/sys_enter_open' perf event ID: No such file or directory
libbpf: prog 'tracepoint__syscalls__sys_enter_open': failed to create tracepoint 'syscalls/sys_enter_open' perf event: No such file or directory
libbpf: prog 'tracepoint__syscalls__sys_enter_open': failed to auto-attach: -2
failed to attach skeleton
Error: BpfError("load and attach ebpf program failed")
```
Later, after investigation, it was found that the kernel did not enable the `CONFIG_FTRACE_SYSCALLS` option, which resulted in the inability to use the tracepoint of syscalls.
## Summary
The `CONFIG_DEBUG_INFO_BTF` option is enabled by default when viewing the kernel compilation options in the Android shell. Based on this, the examples provided by the eunomia-bpf project already have some successful cases, such as monitoring the execution of the `exec` family of functions and the status of TCP connections.
For some cases that cannot run, the reasons are mainly the following:
1. The kernel compilation options do not support the relevant eBPF functionality;
2. The Linux environment packaged by eadb is weak and lacks necessary dependencies;
Currently, using eBPF tools in the Android system still requires building a complete Linux runtime environment. However, the Android kernel itself has comprehensive support for eBPF. This test proves that higher versions of the Android kernel support BTF debugging information and CO-RE dependent eBPF programs.
The development of eBPF tools in the Android system requires the addition of official new features. Currently, it seems that using eBPF tools directly through an Android app requires a lot of effort. At the same time, since eBPF tools require root privileges, ordinary Android users will encounter more difficulties in using them.
If you want to learn more about eBPF knowledge and practice, you can visit our tutorial code repository <https://github.com/eunomia-bpf/bpf-developer-tutorial> to get more examples and complete tutorials.
## Reference
+ [Google android docs](https://source.android.google.cn/docs/core/architecture/kernel/bpf)
+ [weixin WeiShu](https://mp.weixin.qq.com/s/mul4n5D3nXThjxuHV7GpMA)
+ [SeeFlowerX](https://blog.seeflower.dev/archives/138/>)

3
src/23-http/README_en.md Normal file
View File

@@ -0,0 +1,3 @@
# http
TODO

1
src/24-hide/README_en.md Normal file
View File

@@ -0,0 +1 @@
# TODO: translate into English

View File

@@ -0,0 +1,24 @@
# Terminate Malicious Processes Using bpf_send_signal
Compile:
```bash
make
```
Usage:
```bash
sudo ./bpfdos
```
This program sends a `SIG_KILL` signal to any program that tries to use the `ptrace` system call, such as `strace`.
Once bpf-dos starts running, you can test it by running the following command:
```bash
strace /bin/whoami
```
## References
- <https://github.com/pathtofile/bad-bpf>.

21
src/26-sudo/README_en.md Normal file
View File

@@ -0,0 +1,21 @@
# Using eBPF to add sudo user
Compilation:
```bash
make
```
Usage:
```sh
sudo ./sudoadd --username lowpriv-user
```
This program allows a user with lower privileges to become root using `sudo`.
It works by intercepting `sudo` reading the `/etc/sudoers` file and overwriting the first line with `<username> ALL=(ALL:ALL) NOPASSWD:ALL #`. This tricks `sudo` into thinking that the user is allowed to become root. Other programs like `cat` or `sudoedit` are not affected, so the file remains unchanged and the user does not have these permissions. The `#` at the end of the line ensures that the rest of the line is treated as a comment, so it does not break the logic of the file.
## References
- [https://github.com/pathtofile/bad-bpf](https://github.com/pathtofile/bad-bpf)

View File

@@ -0,0 +1,36 @@
# Replace Text Read or Written by Any Program with eBPF
Compile:
```bash
make
```
Usage:
```sh
sudo ./replace --filename /path/to/file --input foo --replace bar
```
This program will replace all text in the file that matches 'input' with 'replace' text.
There are many use cases for this, such as:
Hiding the kernel module 'joydev' to avoid detection by tools like 'lsmod':
```bash
./replace -f /proc/modules -i 'joydev' -r 'cryptd'
```
Spoofing the MAC address of the 'eth0' interface:
```bash
./replace -f /sys/class/net/eth0/address -i '00:15:5d:01:ca:05' -r '00:00:00:00:00:00'
```
Malware performing anti-sandbox checks may look for MAC addresses as an indication of whether it is running in a virtual machine or sandbox, rather than on a "real" machine.
**Note:** The lengths of 'input' and 'replace' must be the same to avoid introducing NULL characters in the middle of the text block. To input a newline character at a bash prompt, use `$'\n'`, for example `--replace $'text\n'`.
## References
- <https://github.com/pathtofile/bad-bpf>.

View File

@@ -0,0 +1,57 @@
# Running eBPF Programs After User-Space Application Exits: The Lifecycle of eBPF Programs
By using the detach method to run eBPF programs, the user space loader can exit without stopping the eBPF program.
## The Lifecycle of eBPF Programs
First, we need to understand some key concepts, such as BPF objects (including programs, maps, and debug information), file descriptors (FDs), reference counting (refcnt), etc. In the eBPF system, user space accesses BPF objects through file descriptors, and each object has a reference count. When an object is created, its reference count is initialized to 1. If the object is no longer in use (i.e., no other programs or file descriptors reference it), its reference count will decrease to 0 and be cleaned up in memory after the RCU grace period.
Next, we need to understand the lifecycle of eBPF programs. First, when you create a BPF program and attach it to a "hook" (e.g., a network interface, a system call, etc.), its reference count increases. Then, even if the user space process that originally created and loaded the program exits, as long as the reference count of the BPF program is greater than 0, it will remain active. However, there is an important point in this process: not all hooks are equal. Some hooks are global, such as XDP, tc's clsact, and cgroup-based hooks. These global hooks keep the BPF program active until the objects themselves disappear. Some hooks are local and only run during the lifetime of the process that owns them.
For managing the lifecycle of BPF objects (programs or maps), another key operation is "detach." This operation prevents any future execution of the attached program. Then, in cases where you need to replace a BPF program, you can use the replace operation. This is a complex process because you need to ensure that during the replacement process, no events being processed are lost, and the old and new program may run simultaneously on different CPUs.
Finally, in addition to managing the lifecycle of BPF objects through file descriptors and reference counting, there is another method called BPFFS, which is the "BPF Filesystem." User space processes can "pin" a BPF program or map in BPFFS, which increases the reference count of the object, keeping the BPF object active even if the BPF program is not attached anywhere or the BPF map is not used by any program.
So when we talk about running eBPF programs in the background, we need to understand the meaning of this process. In some cases, even if the user space process has exited, we may still want the BPF program to keep running. This requires us to manage the lifecycle of BPF objects correctly.
## Running
Here, we still use the example of string replacement used in the previous application to demonstrate potential security risks. By using `--detach` to run the program, the user space loader can exit without stopping the eBPF program.
Compilation:
```bash
make
```
Before running, please make sure that the BPF file system has been mounted:
```bash
sudo mount bpffs -t bpf /sys/fs/bpf
mkdir /sys/fs/bpf/textreplace
```
Then, you can run text-replace2 with detach:
```bash
./textreplace2 -f /proc/modules -i 'joydev' -r 'cryptd' -d
```
This will create some eBPF link files under `/sys/fs/bpf/textreplace`. Once the loader is successfully running, you can check the log by running the following command:
```bash
sudo cat /sys/kernel/debug/tracing/trace_pipe
# Confirm that the link files exist
sudo ls -l /sys/fs/bpf/textreplace
```
Finally, to stop, simply delete the link files:
```bash
sudo rm -r /sys/fs/bpf/textreplace
```
## References
- <https://github.com/pathtofile/bad-bpf>
- <https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html>

View File

@@ -0,0 +1,75 @@
# eBPF sockops Example
## Performance Optimization using eBPF sockops
Network connections are essentially communication between sockets. eBPF provides a [bpf_msg_redirect_hash](https://man7.org/linux/man-pages/man7/bpf-helpers.7.html) function, which allows packets sent by applications to be directly forwarded to the destination socket, greatly accelerating the packet processing flow in the kernel.
Here, sock_map is the crucial part that records socket rules, i.e., based on the current packet information, a socket connection is selected from sock_map to forward the request. Therefore, it is necessary to save the socket information to sock_map at the hook of sockops or elsewhere, and provide a key-based rule (generally a four-tuple) to find the socket.
The Merbridge project uses eBPF instead of iptables to accelerate Istio. After using Merbridge (eBPF) optimization, inbound and outbound traffic will bypass many kernel modules, significantly improving performance, as shown in the figure below:
![merbridge](merbridge.png)
## Running the Example
This example program redirects traffic from the sender's socket (outbound) to the receiver's socket (inbound), **bypassing the TCP/IP kernel network stack**. In this example, we assume that the sender and receiver are running on the **same machine**.
### Compiling the eBPF Program
```shell
# Compile the bpf_sockops program
clang -O2 -g -Wall -target bpf -c bpf_sockops.c -o bpf_sockops.o
clang -O2 -g -Wall -target bpf -c bpf_redir.c -o bpf_redir.o
```
### Loading the eBPF Program
```shell
sudo ./load.sh
```
You can use the [bpftool utility](https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/Documentation/bpftool-prog.rst) to check if these two eBPF programs have been loaded.
```console
$ sudo bpftool prog show
63: sock_ops name bpf_sockmap tag 275467be1d69253d gpl
loaded_at 2019-01-24T13:07:17+0200 uid 0
xlated 1232B jited 750B memlock 4096B map_ids 58
64: sk_msg name bpf_redir tag bc78074aa9dd96f4 gpl
loaded_at 2019-01-24T13:07:17+0200 uid 0
xlated 304B jited 233B memlock 4096B map_ids 58
```
### Running the [iperf3](https://iperf.fr/) Server
```shell
iperf3 -s -p 10000
```
### Running the [iperf3](https://iperf.fr/) Client
```shell
iperf3 -c 127.0.0.1 -t 10 -l 64k -p 10000
```
### Collecting Traces
```console
$ ./trace.sh
iperf3-9516 [001] .... 22500.634108: 0: <<< ipv4 op = 4, port 18583 --> 4135
iperf3-9516 [001] ..s1 22500.634137: 0: <<< ipv4 op = 5, port 4135 --> 18583
iperf3-9516 [001] .... 22500.634523: 0: <<< ipv4 op = 4, port 19095 --> 4135
iperf3-9516 [001] ..s1 22500.634536: 0: <<< ipv4 op = 5, port 4135 --> 19095
```
You should be able to see 4 events for socket establishment. If you don't see any events, the eBPF program may not have been attached correctly.
### Unloading the eBPF Program
```shell
sudo ./unload.sh
```
## References
- <https://github.com/zachidan/ebpf-sockops>- <https://github.com/merbridge/merbridge>

View File

@@ -0,0 +1,88 @@
# eBPF Introductory Development Practice Tutorial 3: Detecting Captured Unlink System Calls in eBPF
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and execute user-defined code at runtime in the kernel.
This article is the third part of the eBPF introductory development practice tutorial, focusing on capturing unlink system calls using fentry in eBPF.
## Fentry
fentry (function entry) and fexit (function exit) are two types of probes in eBPF (Extended Berkeley Packet Filter) used for tracing at the entry and exit points of Linux kernel functions. They allow developers to collect information, modify parameters, or observe return values at specific stages of kernel function execution. This tracing and monitoring functionality is very useful in performance analysis, troubleshooting, and security analysis scenarios.
Compared to kprobes, fentry and fexit programs have higher performance and availability. In this example, we can directly access the pointers to the functions' parameters, just like in regular C code, without needing various read helpers. The main difference between fexit and kretprobe programs is that fexit programs can access both the input parameters and return values of a function, while kretprobe programs can only access the return value. Starting from the 5.5 kernel, fentry and fexit are available for eBPF programs.
```c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
char LICENSE[] SEC("license") = "Dual BSD/GPL";
SEC("fentry/do_unlinkat")
int BPF_PROG(do_unlinkat, int dfd, struct filename *name)
{
pid_t pid;
pid = bpf_get_current_pid_tgid() >> 32;
bpf_printk("fentry: pid = %d, filename = %s\n", pid, name->name);
return 0;
}
SEC("fexit/do_unlinkat")
int BPF_PROG(do_unlinkat_exit, int dfd, struct filename *name, long ret)
{
pid_t pid;
pid = bpf_get_current_pid_tgid() >> 32;
bpf_printk("fexit: pid = %d, filename = %s, ret = %ld\n", pid, name->name, ret);
return 0;
}
```
This program is an eBPF (Extended Berkeley Packet Filter) program written in the C language. It uses BPF fentry and fexit probes to trace the Linux kernel function `do_unlinkat`. In this tutorial, we will use this program as an example to learn how to use fentry in eBPF to detect and capture unlink system calls.
The program consists of the following parts:
1. Include header files: including vmlinux.h (for accessing kernel data structures), bpf/bpf_helpers.h (which includes eBPF helper functions), bpf/bpf_tracing.h (for eBPF tracing-related functionalities).
2. Define license: Here, a character array named `LICENSE` is defined, containing the license information "Dual BSD/GPL".
3. Define fentry probe: We define an fentry probe named `BPF_PROG(do_unlinkat)` that is triggered at the entry point of the `do_unlinkat` function. This probe retrieves the PID (Process ID) of the current process and prints it along with the filename to the kernel log.
4. Define fexit probe: We also define an fexit probe named `BPF_PROG(do_unlinkat_exit)` that is triggered at the exit point of the `do_unlinkat` function. Similar to the fentry probe, this probe also retrieves the PID of the current process and prints it along with the filename and return value to the kernel log.
Through this example, you can learn how to use fentry and fexit probes in eBPF to monitor and capture kernel function calls, such as the unlink system call in this tutorial.Instructions: Translate the following Chinese text to English
while maintaining the original formatting: "eunomia-bpf is an open source eBPF dynamic loading runtime and development toolchain combined with Wasm. Its goal is to simplify the development, building, distribution, and running of eBPF programs. You can refer to [here](https://github.com/eunomia-bpf/eunomia-bpf) to download and install the ecc compilation toolchain and ecli runtime. We use eunomia-bpf to compile and run this example.
To compile and run the above code:
```console
$ ecc fentry-link.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
$ sudo ecli run package.json
Running eBPF program...
```
In another window:
```shell
touch test_file
rm test_file
touch test_file2
rm test_file2
```
After running this program, you can view the output of the eBPF program by examining the `/sys/kernel/debug/tracing/trace_pipe` file:
```console
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
rm-9290 [004] d..2 4637.798698: bpf_trace_printk: fentry: pid = 9290, filename = test_file
rm-9290 [004] d..2 4637.798843: bpf_trace_printk: fexit: pid = 9290, filename = test_file, ret = 0
rm-9290 [004] d..2 4637.798698: bpf_trace_printk: fentry: pid = 9290, filename = test_file2
rm-9290 [004] d..2 4637.798843: bpf_trace_printk: fexit: pid = 9290, filename = test_file2, ret = 0
```
## Summary
This program is an eBPF program that captures the `do_unlinkat` and `do_unlinkat_exit` functions using fentry and fexit, and uses `bpf_get_current_pid_tgid` and `bpf_printk` functions to obtain the ID, filename, and return value of the process calling do_unlinkat, and print them in the kernel log.
To compile this program, you can use the ecc tool, and to run it, you can use the ecli command, and view the output of the eBPF program by checking the `/sys/kernel/debug/tracing/trace_pipe` file. For more examples and detailed development guide, please refer to the official documentation of eunomia-bpf: [here](https://github.com/eunomia-bpf/eunomia-bpf)
If you want to learn more about eBPF knowledge and practice, you can visit our tutorial code repository [here](https://github.com/eunomia-bpf/bpf-developer-tutorial) for more examples and complete tutorials.".

View File

@@ -0,0 +1,5 @@
# eBPF openssl
TODO: make it work
from [https://github.com/kiosk404/openssl_tracer](https://github.com/kiosk404/openssl_tracer).

View File

@@ -0,0 +1,7 @@
# goroutine trace
TODO: make this work and implement the documentation
Modify from [https://github.com/deepflowio/deepflow](https://github.com/deepflowio/deepflow)
This example is provided as GPL license.

View File

@@ -0,0 +1,5 @@
# func latency
TODO: make it work
from https://github.com/iovisor/bcc/blob/master/libbpf-tools/funclatency.c.

View File

@@ -0,0 +1,124 @@
# eBPF Getting Started Development Tutorial 4: Capturing the System Call Collection of Process Opening Files in eBPF and Using Global Variables to Filter Process PIDs
eBPF (Extended Berkeley Packet Filter) is a kernel execution environment that allows users to run secure and efficient programs in the kernel. It is commonly used for network filtering, performance analysis, security monitoring, and other scenarios. The power of eBPF lies in its ability to capture and modify network packets or system calls at runtime in the kernel, enabling monitoring and adjustment of the operating system's behavior.
This article is the fourth part of the eBPF Getting Started Development Tutorial, mainly focusing on how to capture the system call collection of process opening files and filtering process PIDs using global variables in eBPF.
In Linux system, the interaction between processes and files is achieved through system calls. System calls serve as the interface between user space programs and kernel space programs, allowing user programs to request specific operations from the kernel. In this tutorial, we focus on the sys_openat system call, which is used to open files.
When a process opens a file, it issues a sys_openat system call to the kernel and passes relevant parameters (such as file path, open mode, etc.). The kernel handles this request and returns a file descriptor, which serves as a reference for subsequent file operations. By capturing the sys_openat system call, we can understand when and how a process opens a file.
## Capturing the System Call Collection of Process Opening Files in eBPF
First, we need to write an eBPF program to capture the system call of a process opening a file. The specific implementation is as follows:
```c
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
/// @description "Process ID to trace"
const volatile int pid_target = 0;
SEC("tracepoint/syscalls/sys_enter_openat")
int tracepoint__syscalls__sys_enter_openat(struct trace_event_raw_sys_enter* ctx)
{
u64 id = bpf_get_current_pid_tgid();
u32 pid = id;
if (pid_target && pid_target != pid)
return false;
// Use bpf_printk to print the process information
bpf_printk("Process ID: %d enter sys openat\n", pid);
return 0;
}
/// "Trace open family syscalls."
char LICENSE[] SEC("license") = "GPL";
```
This eBPF program implements the following:
1. Include header files: <vmlinux.h> contains the definition of kernel data structures, and <bpf/bpf_helpers.h> contains the helper functions required by eBPF programs.
2. Define the global variable `pid_target` for filtering a specified process ID. Setting it to 0 captures sys_openat calls from all processes.
3. Use the `SEC` macro to define an eBPF program associated with the tracepoint "tracepoint/syscalls/sys_enter_openat". This tracepoint is triggered when a process initiates the `sys_openat` system call.
4. Implement the eBPF program `tracepoint__syscalls__sys_enter_openat`, which takes a parameter `ctx` of type `struct trace_event_raw_sys_enter`. This structure contains information about the system call.
5. Use the `bpf_get_current_pid_tgid()` function to retrieve the PID and TGID (Thread Group ID) of the current process. Since we only care about the PID, we assign it to the `u32` variable `pid`.
6. Check if the `pid_target` variable is equal to the current process's PID. If `pid_target` is not 0 and is not equal to the current process's PID, return `false` to skip capturing the `sys_openat` call of that process.
7. Use the `bpf_printk()` function to print the captured process ID and relevant information about the `sys_openat` call. These information can be viewed in user space using BPF tools.
8. Set the program license to "GPL", which is a necessary condition for running eBPF programs.### Instructions
Translate the following Chinese text to English while maintaining the original formatting:
"This eBPF program can be loaded into the kernel and executed using tools like libbpf or eunomia-bpf. It captures the sys_openat system call of the specified process (or all processes) and outputs relevant information in user-space.
eunomia-bpf is an open-source eBPF dynamic loading runtime and development toolchain combined with Wasm. Its purpose is to simplify the development, building, distribution, and execution of eBPF programs. You can refer to <https://github.com/eunomia-bpf/eunomia-bpf> to download and install the ecc compilation toolchain and ecli runtime. We will use eunomia-bpf to compile and run this example.
Compile and run the above code:
```console
$ ecc opensnoop.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
$ sudo ecli run package.json
Running eBPF program...
```
After running this program, you can view the output of the eBPF program by viewing the `/sys/kernel/debug/tracing/trace_pipe` file:
```console
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
<...>-3840345 [010] d... 3220701.101179: bpf_trace_printk: Process ID: 3840345 enter sys openat
<...>-3840345 [010] d... 3220702.158000: bpf_trace_printk: Process ID: 3840345 enter sys openat
```
At this point, we are able to capture the sys_openat system call for opening files by processes.
## Filtering Process PID in eBPF using Global Variables
Global variables act as a data sharing mechanism in eBPF programs, allowing data interaction between user space programs and eBPF programs. This is very useful when filtering specific conditions or modifying the behavior of eBPF programs. This design allows user space programs to dynamically control the behavior of eBPF programs at runtime.
In our example, the global variable `pid_target` is used to filter process PIDs. User space programs can set the value of this variable to capture only the `sys_openat` system calls related to the specified PID in the eBPF program.
The principle of using global variables is that they are defined and stored in the data section of eBPF programs. When the eBPF program is loaded into the kernel and executed, these global variables are retained in the kernel and can be accessed through BPF system calls. User space programs can use certain features of BPF system calls, such as `bpf_obj_get_info_by_fd` and `bpf_obj_get_info`, to obtain information about the eBPF object, including the position and value of global variables.
You can view the help information for opensnoop by executing the command `ecli -h`:
```console
$ ecli package.json -h
Usage: opensnoop_bpf [--help] [--version] [--verbose] [--pid_target VAR]
Trace open family syscalls.
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
--verbose prints libbpf debug information
--pid_target Process ID to trace
Built with eunomia-bpf framework.
See https://github.com/eunomia-bpf/eunomia-bpf for more information.
```
You can specify the PID of the process to capture by using the `--pid_target` option, for example:
```console
$ sudo ./ecli run package.json --pid_target 618
Running eBPF program...
```
After running this program, you can view the output of the eBPF program by viewing the `/sys/kernel/debug/tracing/trace_pipe` file:
```console
$ sudo cat /sys/kernel/debug/tracing/trace_pipe".\-3840345 [010] d... 3220701.101179: bpf_trace_printk: Process ID: 618 enter sys openat
\-3840345 [010] d... 3220702.158000: bpf_trace_printk: Process ID: 618 enter sys openat
```
## Summary
This article introduces how to use eBPF programs to capture the system calls for process file opening. In an eBPF program, we can capture the system calls for process file opening by defining functions `tracepoint__syscalls__sys_enter_open` and `tracepoint__syscalls__sys_enter_openat` and attaching them to the tracepoints `sys_enter_open` and `sys_enter_openat` using the `SEC` macro. We can use the `bpf_get_current_pid_tgid` function to get the process ID that calls the open or openat system call, and print it out in the kernel log using the `bpf_printk` function. In an eBPF program, we can also filter the output by defining a global variable `pid_target` to specify the pid of the process to be captured, only outputting the information of the specified process.
By learning this tutorial, you should have a deeper understanding of how to capture and filter system calls for specific processes in eBPF. This method has widespread applications in system monitoring, performance analysis, and security auditing.
For more examples and detailed development guides, please refer to the official documentation of eunomia-bpf: <https://github.com/eunomia-bpf/eunomia-bpf>
If you want to learn more about eBPF knowledge and practices, you can visit our tutorial code repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and a complete tutorial.

View File

@@ -0,0 +1,124 @@
# eBPF Beginner's Development Tutorial 5: Capturing readline Function Calls in eBPF
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel that allows developers to dynamically load, update, and run user-defined code at runtime.
This article is the fifth part of the eBPF beginner's development tutorial, which mainly introduces how to capture readline function calls in bash using uprobe.
## What is uprobe
uprobe is a user-space probe that allows dynamic instrumentation in user-space programs. The probe locations include function entry, specific offsets, and function returns. When we define an uprobe, the kernel creates fast breakpoint instructions (int3 instructions on x86 machines) on the attached instructions. When the program executes this instruction, the kernel triggers an event, causing the program to enter kernel mode and call the probe function through a callback function. After executing the probe function, the program returns to user mode to continue executing subsequent instructions.
uprobe is file-based. When a function in a binary file is traced, all processes that use the file are instrumented, including those that have not yet been started, allowing system calls to be tracked system-wide.
uprobe is suitable for parsing some traffic in user mode that cannot be resolved by kernel mode probes, such as HTTP/2 traffic (where the header is encoded and cannot be decoded by the kernel) and HTTPS traffic (which is encrypted and cannot be decrypted by the kernel).
## Capturing readline Function Calls in bash using uprobe
uprobe is an eBPF probe used to capture user-space function calls, allowing us to capture system functions called by user-space programs.
For example, we can use uprobe to capture readline function calls in bash and get the command line input from the user. The example code is as follows:
```c
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#define TASK_COMM_LEN 16
#define MAX_LINE_SIZE 80
SEC("uretprobe//bin/bash:readline")
int BPF_KRETPROBE(printret, const void *ret)
{
char str[MAX_LINE_SIZE];
char comm[TASK_COMM_LEN];
u32 pid;
if (!ret)
return 0;
bpf_get_current_comm(&comm, sizeof(comm));
pid = bpf_get_current_pid_tgid() >> 32;
bpf_probe_read_user_str(str, sizeof(str), ret);
bpf_printk("PID %d (%s) read: %s ", pid, comm, str);
return 0;
};
char LICENSE[] SEC("license") = "GPL";
```
The purpose of this code is to execute the specified BPF_PROBE function (printret function) when the readline function in bash returns.
In the printret function, we first obtain the process name and process ID of the process calling the readline function. Then, we use the bpf_probe_read_user_str function to read the user input command line string. Lastly, we use the bpf_printk function to print the process ID, process name, and input command line string.
In addition, we also need to define the uprobe probe using the SEC macro and define the probe function using the BPF_KRETPROBE macro.In the `SEC` macro in the code above, we need to specify the type of the uprobe, the path of the binary file to capture, and the name of the function to capture. For example, the definition of the `SEC` macro in the code above is as follows:
```c
SEC("uprobe//bin/bash:readline")
```
This indicates that we want to capture the `readline` function in the `/bin/bash` binary file.
Next, we need to use the `BPF_KRETPROBE` macro to define the probe function. For example:
```c
BPF_KRETPROBE(printret, const void *ret)
```
Here, `printret` is the name of the probe function, and `const void *ret` is the parameter of the probe function, which represents the return value of the captured function.
Then, we use the `bpf_get_current_comm` function to get the name of the current task and store it in the `comm` array.
```c
bpf_get_current_comm(&comm, sizeof(comm));
```
We use the `bpf_get_current_pid_tgid` function to get the PID of the current process and store it in the `pid` variable.
```c
pid = bpf_get_current_pid_tgid() >> 32;
```
We use the `bpf_probe_read_user_str` function to read the return value of the `readline` function from the user space and store it in the `str` array.
```c
bpf_probe_read_user_str(str, sizeof(str), ret);
```
Finally, we use the `bpf_printk` function to output the PID, task name, and user input string.
```c
bpf_printk("PID %d (%s) read: %s ", pid, comm, str);
```
eunomia-bpf is an open-source eBPF dynamic loading runtime and development toolchain combined with Wasm. Its purpose is to simplify the development, build, distribution, and running of eBPF programs. You can refer to <https://github.com/eunomia-bpf/eunomia-bpf> to download and install the ecc compiler toolchain and ecli runtime. We use eunomia-bpf to compile and run this example.
Compile and run the above code:
```console
$ ecc bashreadline.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
$ sudo ecli run package.json
Running eBPF program...
```
After running this program, you can view the output of the eBPF program by checking the file `/sys/kernel/debug/tracing/trace_pipe`:
```console
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
bash-32969 [000] d..31 64001.375748: bpf_trace_printk: PID 32969 (bash) read: fff
bash-32969 [000] d..31 64002.056951: bpf_trace_printk: PID 32969 (bash) read: fff
```
You can see that we have successfully captured the `readline` function call of `bash` and obtained the command line entered by the user in `bash`.
## Summary
In the above code, we used the `SEC` macro to define an uprobe probe, which specifies the user space program (`bin/bash`) to be captured and the function (`readline`) to be captured. In addition, we used the `BPF_KRETPROBE` macro to define a callback function (`printret`) for handling the return value of the `readline` function. This function can retrieve the return value of the `readline` function and print it to the kernel log. In this way, we can use eBPF to capture the `readline` function call of `bash` and obtain the command line entered by the user in `bash`.
For more examples and detailed development guides, please refer to the official documentation of eunomia-bpf: <https://github.com/eunomia-bpf/eunomia-bpf>
If you want to learn more about eBPF knowledge and practices, you can visit our tutorial code repository <https://github.com/eunomia-bpf/bpf-developer-tutorial> to get more examples and complete tutorials.

141
src/6-sigsnoop/README_en.md Executable file
View File

@@ -0,0 +1,141 @@
# eBPF Beginner's Development Practice Tutorial 6: Capturing a Collection of System Calls that Send Signals to Processes, Using a Hash Map to Store State
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel that allows developers to dynamically load, update, and run user-defined code at runtime.
This article is the sixth part of the eBPF beginner's development practice tutorial, which mainly introduces how to implement an eBPF tool that captures a collection of system calls that send signals to processes and uses a hash map to store state.
## sigsnoop
The example code is as follows:
```c
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#define MAX_ENTRIES 10240
#define TASK_COMM_LEN 16
struct event {
unsigned int pid;
unsigned int tpid;
int sig;
int ret;
char comm[TASK_COMM_LEN];
};
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, __u32);
__type(value, struct event);
} values SEC(".maps");
static int probe_entry(pid_t tpid, int sig)
{
struct event event = {};
__u64 pid_tgid;
__u32 tid;
pid_tgid = bpf_get_current_pid_tgid();
tid = (__u32)pid_tgid;
event.pid = pid_tgid >> 32;
event.tpid = tpid;
event.sig = sig;
bpf_get_current_comm(event.comm, sizeof(event.comm));
bpf_map_update_elem(&values, &tid, &event, BPF_ANY);
return 0;
}
static int probe_exit(void *ctx, int ret)
{
__u64 pid_tgid = bpf_get_current_pid_tgid();
__u32 tid = (__u32)pid_tgid;
struct event *eventp;
eventp = bpf_map_lookup_elem(&values, &tid);
if (!eventp)
return 0;
eventp->ret = ret;
bpf_printk("PID %d (%s) sent signal %d ",
eventp->pid, eventp->comm, eventp->sig);
bpf_printk("to PID %d, ret = %d",
eventp->tpid, ret);
cleanup:
bpf_map_delete_elem(&values, &tid);
return 0;
}
SEC("tracepoint/syscalls/sys_enter_kill")
int kill_entry(struct trace_event_raw_sys_enter *ctx)
{
pid_t tpid = (pid_t)ctx->args[0];
int sig = (int)ctx->args[1];
return probe_entry(tpid, sig);
}
SEC("tracepoint/syscalls/sys_exit_kill")
int kill_exit(struct trace_event_raw_sys_exit *ctx)
{
return probe_exit(ctx, ctx->ret);
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";
```
The above code defines an eBPF program for capturing system calls that send signals to processes, including kill, tkill, and tgkill. It captures the enter and exit events of system calls by using tracepoints, and executes specified probe functions such as `probe_entry` and `probe_exit` when these events occur.Instructions: Translate the following Chinese text to English
while maintaining the original formatting: "In the probe function, we use the bpf_map to store the captured event information, including the process ID of the sending signal, the process ID of the receiving signal, the signal value, and the return value of the system call. When the system call exits, we retrieve the event information stored in the bpf_map and use bpf_printk to print the process ID, process name, sent signal, and return value of the system call.
Finally, we also need to use the SEC macro to define the probe and specify the name of the system call to be captured and the probe function to be executed.
eunomia-bpf is an open-source eBPF dynamic loading runtime and development toolchain that combines with Wasm. Its purpose is to simplify the development, building, distribution, and running of eBPF programs. You can refer to <https://github.com/eunomia-bpf/eunomia-bpf> for downloading and installing the ecc compilation toolchain and ecli runtime. We use eunomia-bpf to compile and run this example.
Compile and run the above code:
```shell
docker run -it -v `pwd`/:/src/ ghcr.io/eunomia-bpf/ecc-`uname -m`:latest
```
or
```console
$ ecc sigsnoop.bpf.c
Compiling bpf object...
Generating export types...
Packing ebpf object and config into package.json...
$ sudo ecli run package.json
Running eBPF program...
```
After running this program, you can view the output of the eBPF program by checking the /sys/kernel/debug/tracing/trace_pipe file:
```console
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
systemd-journal-363 [000] d...1 672.563868: bpf_trace_printk: PID 363 (systemd-journal) sent signal 0
systemd-journal-363 [000] d...1 672.563869: bpf_trace_printk: to PID 1400, ret = 0
systemd-journal-363 [000] d...1 672.563870: bpf_trace_printk: PID 363 (systemd-journal) sent signal 0
systemd-journal-363 [000] d...1 672.563870: bpf_trace_printk: to PID 1527, ret = -3
```
## Summary
This article mainly introduces how to implement an eBPF tool to capture the collection of system calls sent by processes using signals and save the state using a hash map. Using a hash map requires defining a struct:
```c
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, __u32);
__type(value, struct event);
} values SEC(".maps");
```
And using corresponding APIs for access, such as bpf_map_lookup_elem, bpf_map_update_elem, bpf_map_delete_elem, etc.
For more examples and detailed development guides, please refer to the official documentation of eunomia-bpf: <https://github.com/eunomia-bpf/eunomia-bpf>
If you want to learn more about eBPF knowledge and practice, you can visit our tutorial code repository <https://github.com/eunomia-bpf/bpf-developer-tutorial> to get more examples and complete tutorials."

View File

@@ -0,0 +1,124 @@
# eBPF Beginner's Practical Tutorial Seven: Capturing Process Execution/Exit Time, Printing Output to User Space via perf event array
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel that allows developers to dynamically load, update, and run user-defined code at runtime.
This article is the seventh part of the eBPF beginner's development tutorial and mainly introduces how to capture process execution events in the Linux kernel and print output to the user command line via a perf event array. This eliminates the need to view the output of eBPF programs by checking the `/sys/kernel/debug/tracing/trace_pipe` file. After sending information to user space via the perf event array, complex data processing and analysis can be performed.
## perf buffer
eBPF provides two circular buffers for transferring information from eBPF programs to user space controllers. The first one is the perf circular buffer, which has existed since at least kernel v4.15. The second one is the BPF circular buffer introduced later. This article only considers the perf circular buffer.
## execsnoop
To print output to the user command line via the perf event array, a header file and a C source file need to be written. The example code is as follows:
Header file: execsnoop.h
```c
#ifndef __EXECSNOOP_H
#define __EXECSNOOP_H
#define TASK_COMM_LEN 16
struct event {
int pid;
int ppid;
int uid;
int retval;
bool is_exit;
char comm[TASK_COMM_LEN];
};
#endif /* __EXECSNOOP_H */
```
Source file: execsnoop.bpf.c
```c
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include "execsnoop.h"
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
SEC("tracepoint/syscalls/sys_enter_execve")
int tracepoint_syscalls_sys_enter_execve(struct trace_event_raw_sys_enter* ctx)
{
u64 id;
pid_t pid, tgid;
struct event event={0};
struct task_struct *task;
uid_t uid = (u32)bpf_get_current_uid_gid();
id = bpf_get_current_pid_tgid();
tgid = id >> 32;
event.pid = tgid;
event.uid = uid;
task = (struct task_struct*)bpf_get_current_task();
event.ppid = BPF_CORE_READ(task, real_parent, tgid);
bpf_get_current_comm(&event.comm, sizeof(event.comm));
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
return 0;
}
char LICENSE[] SEC("license") = "GPL";
```
This code defines an eBPF program for capturing the entry of the `execve` system call.
In the entry program, we first obtain the process ID and user ID of the current process, then use the `bpf_get_current_task` function to obtain the `task_struct` structure of the current process, and use the `bpf_get_current_comm` function to read the process name. Finally, we use the `bpf_perf_event_output` function to output the process execution event to the perf buffer.
With this code, we can capture process execution events in the Linux kernel and analyze the process execution conditions.Instructions: Translate the following Chinese text to English while maintaining the original formatting:
"eunomia-bpf is an open-source eBPF dynamic loading runtime and development toolchain that combines with Wasm. Its goal is to simplify the development, building, distribution, and execution of eBPF programs. You can refer to the following link to download and install the ecc compilation toolchain and ecli runtime: [https://github.com/eunomia-bpf/eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf). We use eunomia-bpf to compile and execute this example.
Compile using a container:
```shell
docker run -it -v `pwd`/:/src/ ghcr.io/eunomia-bpf/ecc-`uname -m`:latest
```
Or compile using ecc:
```shell
ecc execsnoop.bpf.c execsnoop.h
```
Run:
```console
$ sudo ./ecli run package.json
TIME PID PPID UID COMM
21:28:30 40747 3517 1000 node
21:28:30 40748 40747 1000 sh
21:28:30 40749 3517 1000 node
21:28:30 40750 40749 1000 sh
21:28:30 40751 3517 1000 node
21:28:30 40752 40751 1000 sh
21:28:30 40753 40752 1000 cpuUsage.sh
```
## Summary
This article introduces how to capture events of processes running in the Linux kernel and print output to the user command-line using the perf event array. After sending information to the user space via the perf event array, complex data processing and analysis can be performed. In the corresponding kernel code of libbpf, a structure and corresponding header file can be defined as follows:
```c
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
```
This allows sending information directly to the user space.
For more examples and detailed development guide, please refer to the official documentation of eunomia-bpf: [https://github.com/eunomia-bpf/eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf).
If you want to learn more about eBPF knowledge and practice, you can visit our tutorial code repository [https://github.com/eunomia-bpf/bpf-developer-tutorial](https://github.com/eunomia-bpf/bpf-developer-tutorial) to get more examples and complete tutorials."

View File

@@ -0,0 +1,159 @@
# eBPF Introductory Development Tutorial 8: Monitoring Process Exit Events with eBPF, Using Ring Buffer to Print Output to User Space
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and run user-defined code at runtime in the kernel.
This article is the eighth part of the eBPF Introductory Development Tutorial, focusing on monitoring process exit events with eBPF.
## Ring Buffer
There is now a new BPF data structure available called the eBPF ring buffer. It solves the memory efficiency and event reordering issues of the BPF perf buffer, which is currently the de facto standard for sending data from the kernel to user space. It provides compatibility with perf buffer for easy migration while also introducing new reserved/commit APIs for improved usability. Additionally, synthetic and real-world benchmark tests have shown that in nearly all cases, the eBPF ring buffer should be the default choice for sending data from BPF programs to user space.
### eBPF Ring Buffer vs eBPF Perf Buffer
Whenever a BPF program needs to send collected data to user space for post-processing and logging, it typically uses the BPF perf buffer (perfbuf). Perfbuf is a collection of per-CPU circular buffers that allow efficient data exchange between the kernel and user space. It works well in practice, but it has two main drawbacks that have proven to be inconvenient: inefficient memory usage and event reordering.
To address these issues, starting from Linux 5.8, BPF introduces a new BPF data structure called BPF ring buffer. It is a multiple producer, single consumer (MPSC) queue that can be safely shared across multiple CPUs.
The BPF ring buffer supports familiar features from BPF perf buffer:
- Variable-length data records.
- Efficient reading of data from user space through memory-mapped regions without additional memory copies and/or entering kernel system calls.
- Support for epoll notifications and busy loop operations with absolute minimal latency.
At the same time, the BPF ring buffer solves the following problems of the BPF perf buffer:
- Memory overhead.
- Data ordering.
- Unnecessary work and additional data copying.
## exitsnoop
This article is the eighth part of the eBPF Introductory Development Tutorial, focusing on monitoring process exit events with eBPF and using the ring buffer to print output to user space.
The steps for printing output to user space using the ring buffer are similar to perf buffer. First, a header file needs to be defined:
Header File: exitsnoop.h
```c
#ifndef __BOOTSTRAP_H
#define __BOOTSTRAP_H
#define TASK_COMM_LEN 16
#define MAX_FILENAME_LEN 127
struct event {
int pid;
int ppid;
unsigned exit_code;
unsigned long long duration_ns;
char comm[TASK_COMM_LEN];
};
#endif /* __BOOTSTRAP_H */
```
Source File: exitsnoop.bpf.c
```c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include "exitsnoop.h"
char LICENSE[] SEC("license") = "Dual BSD/GPL";
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} rb SEC(".maps");
SEC("tp/sched/sched_process_exit")
int handle_exit(struct trace_event_raw_sched_process_template* ctx)
{
struct task_struct *task;
struct event *e;
pid_t pid, tid;
u64 id, ts, *start_ts, duration_ns = 0;
/* get PID and TID of exiting thread/process */
id = bpf_get_current_pid_tgid();
pid = id >> 32;
format: rawtid = (u32)id;
/* ignore thread exits */
if (pid != tid)
return 0;
/* reserve sample from BPF ringbuf */
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e)
return 0;
/* fill out the sample with data */
task = (struct task_struct *)bpf_get_current_task();
e->duration_ns = duration_ns;
e->pid = pid;
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
e->exit_code = (BPF_CORE_READ(task, exit_code) >> 8) & 0xff;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
/* send data to user-space for post-processing */
bpf_ringbuf_submit(e, 0);
return 0;
}
```
This code demonstrates how to monitor process exit events using exitsnoop and print output to user space using a ring buffer:
1. First, we include the required headers and exitsnoop.h.
2. We define a global variable named "LICENSE" with the content "Dual BSD/GPL", which is the license requirement for eBPF programs.
3. We define a mapping named rb of type BPF_MAP_TYPE_RINGBUF, which will be used to transfer data from kernel space to user space. We specify max_entries as 256 * 1024, representing the maximum capacity of the ring buffer.
4. We define an eBPF program named handle_exit, which will be executed when a process exit event is triggered. It takes a trace_event_raw_sched_process_template struct pointer named ctx as the parameter.
5. We use the bpf_get_current_pid_tgid() function to obtain the PID and TID of the current task. For the main thread, the PID and TID are the same; for child threads, they are different. Since we only care about the exit of the process (main thread), we return 0 if the PID and TID are different, ignoring the exit events of child threads.
6. We use the bpf_ringbuf_reserve function to reserve space for the event struct e in the ring buffer. If the reservation fails, we return 0.
7. We use the bpf_get_current_task() function to obtain a task_struct structure pointer for the current task.
8. We fill in the process-related information into the reserved event struct e, including the duration of the process, PID, PPID, exit code, and process name.
9. Finally, we use the bpf_ringbuf_submit function to submit the filled event struct e to the ring buffer, for further processing and output in user space.
This example demonstrates how to capture process exit events using exitsnoop and a ring buffer in an eBPF program, and transfer relevant information to user space. This is useful for analyzing process exit reasons and monitoring system behavior.
## Compile and Run
eunomia-bpf is an open-source eBPF dynamic loading runtime and development toolchain that combines with Wasm. Its purpose is to simplify the development, build, distribution, and execution of eBPF programs. You can refer to <https://github.com/eunomia-bpf/eunomia-bpf> to download and install the ecc compiler toolchain and ecli runtime. We will use eunomia-bpf to compile and run this example.
Compile:
```shell
docker run -it -v `pwd`/:/src/ ghcr.io/eunomia-bpf/ecc-`uname -m`:latest
```
Or
```console
$ ecc exitsnoop.bpf.c exitsnoop.h
Compiling bpf object...
Generating export types...
Packing ebpf object and config into package.json...
```
Run:
```console
$ sudo ./ecli run package.json
TIME PID PPID EXIT_CODE DURATION_NS COMM".
format: Return only the translated content, not including the original text.21:40:09 42050 42049 0 0 which
21:40:09 42049 3517 0 0 sh
21:40:09 42052 42051 0 0 ps
21:40:09 42051 3517 0 0 sh
21:40:09 42055 42054 0 0 sed
21:40:09 42056 42054 0 0 cat
21:40:09 42057 42054 0 0 cat
21:40:09 42058 42054 0 0 cat
21:40:09 42059 42054 0 0 cat
## Summary
This article introduces how to develop a simple BPF program using eunomia-bpf that can monitor process exit events in a Linux system and send the captured events to user space programs via a ring buffer. In this article, we compiled and ran this example using eunomia-bpf.
To better understand and practice eBPF programming, we recommend reading the official documentation of eunomia-bpf at: <https://github.com/eunomia-bpf/eunomia-bpf>. Additionally, we provide a complete tutorial and source code for you to view and learn from at <https://github.com/eunomia-bpf/bpf-developer-tutorial>. We hope this tutorial helps you get started with eBPF development and provides useful references for your further learning and practice.

374
src/9-runqlat/README_en.md Executable file
View File

@@ -0,0 +1,374 @@
# eBPF Beginner's Development Tutorial 9: Capturing Process Scheduling Latency and Recording as Histogram
eBPF (Extended Berkeley Packet Filter) is a powerful network and performance analysis tool on the Linux kernel. It allows developers to dynamically load, update, and run user-defined code at runtime.
runqlat is an eBPF tool used for analyzing the scheduling performance of the Linux system. Specifically, runqlat is used to measure the time a task waits in the run queue before being scheduled to run on a CPU. This information is very useful for identifying performance bottlenecks and improving the overall efficiency of the Linux kernel scheduling algorithm.
## runqlat Principle
This tutorial is the ninth part of the eBPF beginner's development series, with the topic "Capturing Process Scheduling Latency". Here, we will introduce a program called runqlat, which records process scheduling latency as a histogram.
The Linux operating system uses processes to execute all system and user tasks. These processes can be blocked, killed, running, or waiting to run. The number of processes in the latter two states determines the length of the CPU run queue.
Processes can have several possible states, such as:
- Runnable or running
- Interruptible sleep
- Uninterruptible sleep
- Stopped
- Zombie process
Processes waiting for resources or other function signals are in the interruptible or uninterruptible sleep state: the process is put to sleep until the resource it needs becomes available. Then, depending on the type of sleep, the process can transition to the runnable state or remain asleep.
Even when a process has all the resources it needs, it does not start running immediately. It transitions to the runnable state and is queued together with other processes in the same state. The CPU can execute these processes in the next few seconds or milliseconds. The scheduler arranges the processes for the CPU and determines the next process to run.
Depending on the hardware configuration of the system, the length of this runnable queue (known as the CPU run queue) can be short or long. A short run queue length indicates that the CPU is not being fully utilized. On the other hand, if the run queue is long, it may mean that the CPU is not powerful enough to handle all the processes or that the number of CPU cores is insufficient. In an ideal CPU utilization, the length of the run queue will be equal to the number of cores in the system.
Process scheduling latency, also known as "run queue latency," is the time it takes for a thread to go from becoming runnable (e.g., receiving an interrupt that prompts it to do more work) to actually running on the CPU. In the case of CPU saturation, you can imagine that the thread has to wait for its turn. But in other peculiar scenarios, this can also happen, and in some cases, it can be reduced by tuning to improve the overall system performance.
We will illustrate how to use the runqlat tool through an example. This is a heavily loaded system:
```shell
# runqlat
Tracing run queue latency... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 233 |*********** |
2 -> 3 : 742 |************************************ |
4 -> 7 : 203 |********** |
8 -> 15 : 173 |******** |
16 -> 31 : 24 |* |
32 -> 63 : 0 | |
64 -> 127 : 30 |* |
128 -> 255 : 6 | |
256 -> 511 : 3 | |
512 -> 1023 : 5 | |
1024 -> 2047 : 27 |* |".
format: Return only the translated content, not including the original text.## runqlat Code Implementation
### runqlat.bpf.c
First, we need to write a source code file `runqlat.bpf.c`:
```c
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2020 Wenbo Zhang
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
#include "runqlat.h"
#include "bits.bpf.h"
#include "maps.bpf.h"
#include "core_fixes.bpf.h"
#define MAX_ENTRIES 10240
#define TASK_RUNNING 0
const volatile bool filter_cg = false;
const volatile bool targ_per_process = false;
const volatile bool targ_per_thread = false;
const volatile bool targ_per_pidns = false;
const volatile bool targ_ms = false;
const volatile pid_t targ_tgid = 0;
struct {
__uint(type, BPF_MAP_TYPE_CGROUP_ARRAY);
__type(key, u32);
__type(value, u32);
__uint(max_entries, 1);
} cgroup_map SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, u32);
__type(value, u64);
} start SEC(".maps");
static struct hist zero;
/// @sample {"interval": 1000, "type" : "log2_hist"}
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, u32);
__type(value, struct hist);
} hists SEC(".maps");
static int trace_enqueue(u32 tgid, u32 pid)
{
``````cpp
u64 ts;
if (!pid)
return 0;
if (targ_tgid && targ_tgid != tgid)
return 0;
ts = bpf_ktime_get_ns();
bpf_map_update_elem(&start, &pid, &ts, BPF_ANY);
return 0;
}
static unsigned int pid_namespace(struct task_struct *task)
{
struct pid *pid;
unsigned int level;
struct upid upid;
unsigned int inum;
/* get the pid namespace by following task_active_pid_ns(),
* pid->numbers[pid->level].ns
*/
pid = BPF_CORE_READ(task, thread_pid);
level = BPF_CORE_READ(pid, level);
bpf_core_read(&upid, sizeof(upid), &pid->numbers[level]);
inum = BPF_CORE_READ(upid.ns, ns.inum);
return inum;
}
static int handle_switch(bool preempt, struct task_struct *prev, struct task_struct *next)
{
struct hist *histp;
u64 *tsp, slot;
u32 pid, hkey;
s64 delta;
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
if (get_task_state(prev) == TASK_RUNNING)
trace_enqueue(BPF_CORE_READ(prev, tgid), BPF_CORE_READ(prev, pid));
pid = BPF_CORE_READ(next, pid);
tsp = bpf_map_lookup_elem(&start, &pid);
if (!tsp)
return 0;
delta = bpf_ktime_get_ns() - *tsp;
if (delta < 0)
goto cleanup;
if (targ_per_process)
hkey = BPF_CORE_READ(next, tgid);
else if (targ_per_thread)
hkey = pid;
else if (targ_per_pidns)
hkey = pid_namespace(next);
else
hkey = -1;
histp = bpf_map_lookup_or_try_init(&hists, &hkey, &zero);
if (!histp)
goto cleanup;
if (!histp->comm[0])
bpf_probe_read_kernel_str(&histp->comm, sizeof(histp->comm),
next->comm);
if (targ_ms)
delta /= 1000000U;
else
delta /= 1000U;
slot = log2l(delta);
if (slot >= MAX_SLOTS)
slot = MAX_SLOTS - 1;
__sync_fetch_and_add(&histp->slots[slot], 1);
cleanup:
bpf_map_delete_elem(&start, &pid);
return 0;
}
SEC("raw_tp/sched_wakeup")
int BPF_PROG(handle_sched_wakeup, struct task_struct *p)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid));
}
SEC("raw_tp/sched_wakeup_new")
```# BPF_PROG(handle_sched_wakeup_new, struct task_struct *p) Definition
The code defines a function named `BPF_PROG(handle_sched_wakeup_new, struct task_struct *p)`. It takes a handle and a `task_struct` pointer as parameters. The function checks if a filter condition `filter_cg` is true and whether the current task is under the cgroup using the `bpf_current_task_under_cgroup` function with the `cgroup_map` parameter. If the filter condition is true and the task is not under the cgroup, the function returns 0. Otherwise, it calls the `trace_enqueue` function with the `BPF_CORE_READ(p, tgid)` and `BPF_CORE_READ(p, pid)` values and returns the result.
## SEC("raw_tp/sched_switch") Definition
The code defines another function named `BPF_PROG(handle_sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next)`. It is associated with the `raw_tp/sched_switch` security section. It takes a boolean parameter `preempt` and two `task_struct` pointers named `prev` and `next`. The function calls the `handle_switch` function with the `preempt`, `prev`, and `next` parameters and returns the result.
## LICENSE Declaration
The code declares a character array named `LICENSE` and assigns it the value "GPL". It is associated with the `license` security section.
## Constants and Global Variables
The code defines several constants and volatile global variables used for filtering corresponding tracing targets. These variables include:
- `MAX_ENTRIES`: The maximum number of map entries.
- `TASK_RUNNING`: The task status value.
- `filter_cg`, `targ_per_process`, `targ_per_thread`, `targ_per_pidns`, `targ_ms`, `targ_tgid`: Boolean variables for filtering and target options. These options can be set by user-space programs to customize the behavior of the eBPF program.
## eBPF Maps
The code defines several eBPF maps including:
- `cgroup_map`: A cgroup array map used for filtering cgroups.
- `start`: A hash map used to store timestamps when processes are enqueued.
- `hists`: A hash map used to store histogram data for recording process scheduling delays.
## Helper Functions
The code includes two helper functions:
- `trace_enqueue`: This function is used to record the timestamp when a process is enqueued. It takes the `tgid` and `pid` values as parameters. If the `pid` value is 0 or the `targ_tgid` value is not 0 and not equal to `tgid`, the function returns 0. Otherwise, it retrieves the current timestamp using `bpf_ktime_get_ns` and updates the `start` map with the `pid` key and the timestamp value.
- `pid_namespace`: This function is used to get the PID namespace of a process. It takes a `task_struct` pointer as a parameter and returns the PID namespace of the process. The function retrieves the PID namespace by following `task_active_pid_ns()` and `pid->numbers[pid->level].ns`.
Please note that the translation of function names and variable names may require further context.```
level = BPF_CORE_READ(pid, level);
bpf_core_read(&upid, sizeof(upid), &pid->numbers[level]);
inum = BPF_CORE_READ(upid.ns, ns.inum);
return inum;
}
```
The `handle_switch` function is the core part, used to handle scheduling switch events, calculate process scheduling latency, and update histogram data:
```c
static int handle_switch(bool preempt, struct task_struct *prev, struct task_struct *next)
{
...
}
```
Firstly, the function determines whether to filter cgroup based on the setting of `filter_cg`. Then, if the previous process state is `TASK_RUNNING`, the `trace_enqueue` function is called to record the enqueue time of the process. Then, the function looks up the enqueue timestamp of the next process. If it is not found, it returns directly. The scheduling latency (delta) is calculated, and the key for the histogram map (hkey) is determined based on different options (targ_per_process, targ_per_thread, targ_per_pidns). Then, the histogram map is looked up or initialized, and the histogram data is updated. Finally, the enqueue timestamp record of the process is deleted.
Next is the entry point of the eBPF program. The program uses three entry points to capture different scheduling events:
- `handle_sched_wakeup`: Used to handle the `sched_wakeup` event triggered when a process is woken up from sleep state.
- `handle_sched_wakeup_new`: Used to handle the `sched_wakeup_new` event triggered when a newly created process is woken up.
- `handle_sched_switch`: Used to handle the `sched_switch` event triggered when the scheduler selects a new process to run.
These entry points handle different scheduling events, but all call the `handle_switch` function to calculate the scheduling latency of the process and update the histogram data.
Finally, the program includes a license declaration:
```c
char LICENSE[] SEC("license") = "GPL";
```
This declaration specifies the license type of the eBPF program, which is "GPL" in this case. This is required for many kernel features as they require eBPF programs to follow the GPL license.
### runqlat.h
Next, we need to define a header file `runqlat.h` for handling events reported from kernel mode to user mode:
```c
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#ifndef __RUNQLAT_H
#define __RUNQLAT_H
#define TASK_COMM_LEN 16
#define MAX_SLOTS 26
struct hist {
__u32 slots[MAX_SLOTS];
char comm[TASK_COMM_LEN];
};
#endif /* __RUNQLAT_H */
```
## Compilation and Execution
`eunomia-bpf` is an open-source eBPF dynamic loading runtime and development toolkit combined with Wasm.
Its purpose is to simplify the development, build, distribution, and execution of eBPF programs.
You can refer to <https://github.com/eunomia-bpf/eunomia-bpf> to download and install the `ecc` compilation toolkit and `ecli` runtime.
We will use `eunomia-bpf` to compile and run this example.
Compile:
```shell
docker run -it -v `pwd`/:/src/ ghcr.io/eunomia-bpf/ecc-`uname -m`:latest
```
or
```console
$ ecc runqlat.bpf.c runqlat.h
Compiling bpf object...
Generating export types...
Packing ebpf object and config into package.json...
```
Run:
```console
$ sudo ecli run examples/bpftools/runqlat/package.json -h
Usage: runqlat_bpf [--help] [--version] [--verbose] [--filter_cg] [--targ_per_process] [--targ_per_thread] [--targ_per_pidns] [--targ_ms] [--targ_tgid VAR]
A simple eBPF program
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
--verbose prints libbpf debug information
--filter_cg set value of bool variable filter_cg
--targ_per_process set value of bool variable targ_per_process
--targ_per_thread set value of bool variable targ_per_thread
--targ_per_pidns set value of bool variable targ_per_pidns
--targ_ms set value of bool variable targ_ms
--targ_tgid set value of pid_t variable targ_tgid
$ sudo ecli run examples/bpftools/runqlat/package.json
key = 4294967295
comm = rcu_preempt
(unit) : count distribution
0 -> 1 : 9 |**** |
2 -> 3 : 6 |** |
4 -> 7 : 12 |***** |
8 -> 15 : 28 |************* |
16 -> 31 : 40 |******************* |
32 -> 63 : 83 |****************************************|
64 -> 127 : 57 |*************************** |
128 -> 255 : 19 |********* |
256 -> 511 : 11 |***** |
512 -> 1023 : 2 | |
1024 -> 2047 : 2 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | |
16384 -> 32767 : 1 | |
$ sudo ecli run examples/bpftools/runqlat/package.json --targ_per_process
key = 3189
comm = cpptools
(unit) : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |*** |
16 -> 31 : 2 |******* |
32 -> 63 : 11 |****************************************|
64 -> 127 : 8 |***************************** |
128 -> 255 : 3 |********** |
```
Complete source code can be found at: <https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/9-runqlat>
References:
- <https://www.brendangregg.com/blog/2016-10-08/linux-bcc-runqlat.html>
- <https://github.com/iovisor/bcc/blob/master/libbpf-tools/runqlat.c>
## Summary
runqlat is a Linux kernel BPF program that summarizes scheduler run queue latency using a bar chart to show the length of time tasks wait to run on a CPU. To compile this program, you can use the `ecc` tool and to run it, you can use the `ecli` command.
runqlat is a tool for monitoring process scheduling latency in the Linux kernel. It can help you understand the time processes spend waiting to run in the kernel and optimize process scheduling based on this information to improve system performance. The original source code can be found in libbpf-tools: <https://github.com/iovisor/bcc/blob/master/libbpf-tools/runqlat.bpf.c>
For more examples and detailed development guide, please refer to the official documentation of eunomia-bpf: <https://github.com/eunomia-bpf/eunomia-bpf>
If you want to learn more about eBPF knowledge and practices, you can visit our tutorial code repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> for more examples and complete tutorials.

43
src/SUMMARY_en.md Normal file
View File

@@ -0,0 +1,43 @@
# Summary
# eBPF Practice Tutorial: Based on libbpf and CO-RE
- [Introduction to basic concepts of eBPF and common development tools](0-introduce/README.md)
- [eBPF Hello World, basic framework and development process](1-helloworld/README.md)
- [Monitoring and capturing unlink system calls using kprobe](2-kprobe-unlink/README.md)
- [Monitoring and capturing unlink system calls using fentry](3-fentry-unlink/README.md)
- [Collection of system calls for capturing processes opening files, filtering process pid using global variables](4-opensnoop/README.md)
- [Capturing readline function calls of bash using uprobe](5-uprobe-bashreadline/README.md)
- [Collection of system calls for capturing process signal sending, saving state using hash map](6-sigsnoop/README.md)
- [Capturing process execution/exit time, printing output to user space using perf event array](7-execsnoop/README.md)
- [Monitoring process exit events using exitsnoop, printing output to user space using ring buffer](8-exitsnoop/README.md)
- [A Linux kernel BPF program that summarizes scheduler run queue latency using histograms, displaying time length tasks wait to run on the CPU](9-runqlat/README.md)
- [Capturing interrupt events using hardirqs or softirqs](10-hardirqs/README.md)
- [Developing user space programs and tracing exec() and exit() system calls using bootstrap](11-bootstrap/README.md)
- [Developing programs to measure TCP connection latency using libbpf-bootstrap](13-tcpconnlat/README.md)
- [Recording TCP connection state and TCP RTT using libbpf-bootstrap](14-tcpstates/README.md)
- [Capturing user space Java GC event duration using USDT](15-javagc/README.md)
- [Writing eBPF program Memleak to monitor memory leaks](16-memleak/README.md)
- [Writing eBPF program Biopattern to measure random/sequential disk I/O](17-biopattern/README.md)
- [More reference materials](18-further-reading/README.md)
- [Performing security detection and defense using LSM](19-lsm-connect/README.md)
- [Performing traffic control using eBPF and tc](20-tc/README.md)
# Advanced Features and Advanced Topics of eBPF
- [Using eBPF programs on Android](22-android/README.md)
- [Tracing HTTP requests or other layer 7 protocols using eBPF](23-http/README.md)
- [Accelerating network request forwarding using sockops](29-sockops/README.md)
- [Hiding process or file information using eBPF](24-hide/README.md)
- [Terminating processes by sending signals using bpf_send_signal](25-signal/README.md)
- [Adding sudo users using eBPF](26-sudo/README.md)
- [Replacing text read or written by any program using eBPF](27-replace/README.md)
- [BPF lifecycle: Running eBPF programs continuously after the user space application exits using Detached mode](28-detach/README.md)
# bcc tutorial
- [BPF Features by Linux Kernel Version](bcc-documents/kernel-versions.md)
- [Kernel Configuration for BPF Features](bcc-documents/kernel_config.md)
- [bcc Reference Guide](bcc-documents/reference_guide.md)
- [Special Filtering](bcc-documents/special_filtering.md)
- [bcc Tutorial](bcc-documents/tutorial.md)".- [bcc Python Developer Tutorial](bcc-documents/tutorial_bcc_python_developer.md)