mirror of
https://github.com/eunomia-bpf/bpf-developer-tutorial.git
synced 2026-02-03 10:14:44 +08:00
feat: deploy static web with mdbook (#11)
This commit is contained in:
168
src/0-introduce/README.md
Normal file
168
src/0-introduce/README.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# eBPF 入门开发实践教程一:介绍 eBPF 的基本概念、常见的开发工具
|
||||
|
||||
## 1. eBPF简介:安全和有效地扩展内核
|
||||
|
||||
eBPF 是一项革命性的技术,起源于 Linux 内核,可以在操作系统的内核中运行沙盒程序。它被用来安全和有效地扩展内核的功能,而不需要改变内核的源代码或加载内核模块。eBPF 通过允许在操作系统内运行沙盒程序,应用程序开发人员可以在运行时,可编程地向操作系统动态添加额外的功能。然后,操作系统保证安全和执行效率,就像在即时编译(JIT)编译器和验证引擎的帮助下进行本地编译一样。eBPF 程序在内核版本之间是可移植的,并且可以自动更新,从而避免了工作负载中断和节点重启。
|
||||
|
||||
今天,eBPF被广泛用于各类场景:在现代数据中心和云原生环境中,可以提供高性能的网络包处理和负载均衡;以非常低的资源开销,做到对多种细粒度指标的可观测性,帮助应用程序开发人员跟踪应用程序,为性能故障排除提供洞察力;保障应用程序和容器运行时的安全执行,等等。可能性是无穷的,而 eBPF 在操作系统内核中所释放的创新才刚刚开始[3]。
|
||||
|
||||
### eBPF 的未来:内核的 JavaScript 可编程接口
|
||||
|
||||
对于浏览器而言,JavaScript 的引入带来的可编程性开启了一场巨大的革命,使浏览器发展成为几乎独立的操作系统。现在让我们回到 eBPF:为了理解 eBPF 对 Linux 内核的可编程性影响,对 Linux 内核的结构以及它如何与应用程序和硬件进行交互有一个高层次的理解是有帮助的[4]。
|
||||
|
||||

|
||||
|
||||
Linux 内核的主要目的是抽象出硬件或虚拟硬件,并提供一个一致的API(系统调用),允许应用程序运行和共享资源。为了实现这个目的,我们维护了一系列子系统和层,以分配这些责任[5]。每个子系统通常允许某种程度的配置,以考虑到用户的不同需求。如果不能配置所需的行为,就需要改变内核,从历史上看,改变内核的行为,或者让用户编写的程序能够在内核中运行,就有两种选择:
|
||||
|
||||
| 本地支持内核模块 | 写一个内核模块 |
|
||||
| ----------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
|
||||
| 改变内核源代码,并说服Linux内核社区相信这种改变是必要的。等待几年,让新的内核版本成为一种商品。 | 定期修复它,因为每个内核版本都可能破坏它。由于缺乏安全边界,冒着破坏你的Linux内核的风险 |
|
||||
|
||||
实际上,两种方案都不常用,前者成本太高,后者则几乎没有可移植性。
|
||||
|
||||
有了 eBPF,就有了一个新的选择,可以重新编程 Linux 内核的行为,而不需要改变内核的源代码或加载内核模块,同时保证在不同内核版本之间一定程度上的行为一致性和兼容性、以及安全性[6]。为了实现这个目的,eBPF 程序也需要有一套对应的 API,允许用户定义的应用程序运行和共享资源 --- 换句话说,某种意义上讲 eBPF 虚拟机也提供了一套类似于系统调用的机制,借助 eBPF 和用户态通信的机制,Wasm 虚拟机和用户态应用也可以获得这套“系统调用”的完整使用权,一方面能可编程地扩展传统的系统调用的能力,另一方面能在网络、文件系统等许多层次实现更高效的可编程 IO 处理。
|
||||
|
||||

|
||||
|
||||
正如上图所示,当今的 Linux 内核正在向一个新的内核模型演化:用户定义的应用程序可以在内核态和用户态同时执行,用户态通过传统的系统调用访问系统资源,内核态则通过 BPF Helper Calls 和系统的各个部分完成交互。截止 2023 年初,内核中的 eBPF 虚拟机中已经有 220 多个Helper 系统接口,涵盖了非常多的应用场景。
|
||||
|
||||
值得注意的是,BPF Helper Call 和系统调用二者并不是竞争关系,它们的编程模型和有性能优势的场景完全不同,也不会完全替代对方。对 Wasm 和 Wasi 相关生态来说,情况也类似,专门设计的 wasi 接口需要经历一个漫长的标准化过程,但可能在特定场景能为用户态应用获取更佳的性能和可移植性保证,而 eBPF 在保证沙箱本质和可移植性的前提下,可以提供一个快速灵活的扩展系统接口的方案。
|
||||
|
||||
目前的 eBPF 仍然处于早期阶段,但是借助当前 eBPF 提供的内核接口和用户态交互的能力,经由 Wasm-bpf 的系统接口转换,Wasm 虚拟机中的应用已经几乎有能力获取内核以及用户态任意一个函数调用的数据和返回值(kprobe,uprobe...);以很低的代价收集和理解所有系统调用,并获取所有网络操作的数据包和套接字级别的数据(tracepoint,socket...);在网络包处理解决方案中添加额外的协议分析器,并轻松地编程任何转发逻辑(XDP,TC...),以满足不断变化的需求,而无需离开Linux内核的数据包处理环境。
|
||||
|
||||
不仅如此,eBPF 还有能力往用户空间任意进程的任意地址写入数据(bpf_probe_write_user[7]),有限度地修改内核函数的返回值(bpf_override_return[8]),甚至在内核态直接执行某些系统调用[9];所幸的是,eBPF 在加载进内核之前对字节码会进行严格的安全检查,确保没有内存越界等操作,同时,许多可能会扩大攻击面、带来安全风险的功能都是需要在编译内核时明确选择启用才能使用的;在 Wasm 虚拟机将字节码加载进内核之前,也可以明确选择启用或者禁用某些 eBPF 功能,以确保沙箱的安全性。
|
||||
|
||||
## 2. 关于如何学习 eBPF 相关的开发的一些建议
|
||||
|
||||
本文不会对 eBPF 的原理做更详细的介绍,不过这里有一个学习规划和参考资料,也许会有一些价值:
|
||||
|
||||
### eBPF 入门(5-7h)
|
||||
|
||||
- Google 或者其他搜索引擎查找:eBPF
|
||||
- 询问 ChatGPT 之类的东西:eBPF 是什么?
|
||||
|
||||
推荐:
|
||||
|
||||
- 阅读 ebpf 简介:<https://ebpf.io/(30min)>
|
||||
- 简要了解一下 ebpf 内核相关文档:<https://prototype-kernel.readthedocs.io/en/latest/bpf/> (知道有问题去哪里查询: 30min)
|
||||
- 阅读 ebpf 中文入门指南:<https://www.modb.pro/db/391570(1h)>
|
||||
- 有大量的参考资料:<https://github.com/zoidbergwill/awesome-ebpf(2-3h)>
|
||||
- 可以选自己感兴趣的 PPT 翻一翻:<https://github.com/gojue/ebpf-slide(1-2h)>
|
||||
|
||||
回答三个问题:
|
||||
|
||||
1. 了解 eBPF 是什么东西?为啥要有这个玩意,不能用内核模块?
|
||||
2. 它有什么功能?能在 Linux 内核里面完成哪些事情?有哪些 eBPF 程序的类型和 helper(不需要知道全部,但是需要知道去哪里找)?
|
||||
3. 能拿来做什么?比如说在哪些场景中进行运用?网络、安全、可观测性?
|
||||
|
||||
### 了解如何开发 eBPF 程序(10-15h)
|
||||
|
||||
了解并尝试一下 eBPF 开发框架:
|
||||
|
||||
- BCC 开发各类小工具的例子:<https://github.com/iovisor/bcc/blob/master/docs/tutorial_bcc_python_developer.md(跑一遍:3-4h)>
|
||||
- libbpf 的一些例子: <https://github.com/libbpf/libbpf-bootstrap(选感兴趣的运行一下,并阅读一下源代码:2h>)
|
||||
- 基于 libbpf 和 eunomia-bpf 的教程: <https://github.com/eunomia-bpf/bpf-developer-tutorial(阅读> 1-10 的部分: 3-4h)
|
||||
|
||||
其他开发框架:Go 语言或者 Rust 语言,请自行搜索并且尝试(0-2h)
|
||||
|
||||
有任何问题或者想了解的东西,不管是不是和本项目相关,都可以在本项目的 discussions 里面开始讨论。
|
||||
|
||||
回答一些问题,并且进行一些尝试(2-5h):
|
||||
|
||||
1. 如何开发一个最简单的 eBPF 程序?
|
||||
2. 如何用 eBPF 追踪一个内核功能或函数?有很多种方法,举出对应的代码;
|
||||
3. 有哪些方案能通过用户态和内核态通信?如何从用户态向内核态传送信息?如何从内核态向用户态传递信息?举出代码示例;
|
||||
4. 编写一个你自己的 eBPF 程序,实现一个功能;
|
||||
5. eBPF 程序的整个生命周期里面,分别在用户态和内核态做了哪些事情?
|
||||
|
||||
## 3. 如何使用eBPF编程
|
||||
|
||||
原始的eBPF程序编写是非常繁琐和困难的。为了改变这一现状,llvm于2015年推出了可以将由高级语言编写的代码编译为eBPF字节码的功能,同时,eBPF 社区将 `bpf()` 等原始的系统调用进行了初步地封装,给出了`libbpf`库。这些库会包含将字节码加载到内核中的函数以及一些其他的关键函数。在Linux的源码包的`samples/bpf/`目录下,有大量Linux提供的基于`libbpf`的eBPF样例代码。
|
||||
|
||||
一个典型的基于 `libbpf` 的eBPF程序具有`*_kern.c`和`*_user.c`两个文件,`*_kern.c`中书写在内核中的挂载点以及处理函数,`*_user.c`中书写用户态代码,完成内核态代码注入以及与用户交互的各种任务。 更为详细的教程可以参考[该视频](https://www.bilibili.com/video/BV1f54y1h74r?spm_id_from=333.999.0.0)然而由于该方法仍然较难理解且入门存在一定的难度,因此现阶段的eBPF程序开发大多基于一些工具,比如:
|
||||
|
||||
- BCC
|
||||
- BPFtrace
|
||||
- libbpf-bootstrap
|
||||
- Go eBPF library
|
||||
|
||||
以及还有比较新的工具,例如 `eunomia-bpf`.
|
||||
|
||||
## 编写 eBPF 程序
|
||||
|
||||
eBPF 程序由内核态部分和用户态部分构成。内核态部分包含程序的实际逻辑,用户态部分负责加载和管理内核态部分。使用 eunomia-bpf 开发工具,只需编写内核态部分的代码。
|
||||
|
||||
内核态部分的代码需要符合 eBPF 的语法和指令集。eBPF 程序主要由若干个函数组成,每个函数都有其特定的作用。可以使用的函数类型包括:
|
||||
|
||||
- kprobe:插探函数,在指定的内核函数前或后执行。
|
||||
- tracepoint:跟踪点函数,在指定的内核跟踪点处执行。
|
||||
- raw_tracepoint:原始跟踪点函数,在指定的内核原始跟踪点处执行。
|
||||
- xdp:网络数据处理函数,拦截和处理网络数据包。
|
||||
- perf_event:性能事件函数,用于处理内核性能事件。
|
||||
- kretprobe:函数返回插探函数,在指定的内核函数返回时执行。
|
||||
- tracepoint_return:跟踪点函数返回,在指定的内核跟踪点返回时执行。
|
||||
- raw_tracepoint_return:原始跟踪点函数返回,在指定的内核原始跟踪
|
||||
|
||||
### BCC
|
||||
|
||||
BCC全称为BPF Compiler Collection,该项目是一个python库,
|
||||
包含了完整的编写、编译、和加载BPF程序的工具链,以及用于调试和诊断性能问题的工具。
|
||||
|
||||
自2015年发布以来,BCC经过上百位贡献者地不断完善后,目前已经包含了大量随时可用的跟踪工具。[其官方项目库](https://github.com/iovisor/bcc/blob/master/docs/tutorial.md)
|
||||
提供了一个方便上手的教程,用户可以快速地根据教程完成BCC入门工作。
|
||||
|
||||
用户可以在BCC上使用Python、Lua等高级语言进行编程。
|
||||
相较于使用C语言直接编程,这些高级语言具有极大的便捷性,用户只需要使用C来设计内核中的
|
||||
BPF程序,其余包括编译、解析、加载等工作在内,均可由BCC完成。
|
||||
|
||||
然而使用BCC存在一个缺点便是在于其兼容性并不好。基于BCC的
|
||||
eBPF程序每次执行时候都需要进行编译,编译则需要用户配置相关的头文件和对应实现。在实际应用中,
|
||||
相信大家也会有体会,编译依赖问题是一个很棘手的问题。也正是因此,在本项目的开发中我们放弃了BCC,
|
||||
选择了可以做到一次编译-多次运行的libbpf-bootstrap工具。
|
||||
|
||||
### eBPF Go library
|
||||
|
||||
eBPF Go库提供了一个通用的eBPF库,它解耦了获取 eBPF 字节码的过程和 eBPF 程序的加载和管理,并实现了类似 libbpf 一样的 CO- 功能。eBPF程序通常是通过编写高级语言创建的,然后使用clang/LLVM编译器编译为eBPF字节码。
|
||||
|
||||
### libbpf
|
||||
|
||||
`libbpf-bootstrap`是一个基于`libbpf`库的BPF开发脚手架,从其
|
||||
[github](https://github.com/libbpf/libbpf-bootstrap) 上可以得到其源码。
|
||||
|
||||
`libbpf-bootstrap`综合了BPF社区过去多年的实践,为开发者提了一个现代化的、便捷的工作流,实
|
||||
现了一次编译,重复使用的目的。
|
||||
|
||||
基于`libbpf-bootstrap`的BPF程序对于源文件有一定的命名规则,
|
||||
用于生成内核态字节码的bpf文件以`.bpf.c`结尾,用户态加载字节码的文件以`.c`结尾,且这两个文件的
|
||||
前缀必须相同。
|
||||
|
||||
基于`libbpf-bootstrap`的BPF程序在编译时会先将`*.bpf.c`文件编译为
|
||||
对应的`.o`文件,然后根据此文件生成`skeleton`文件,即`*.skel.h`,这个文件会包含内核态中定义的一些
|
||||
数据结构,以及用于装载内核态代码的关键函数。在用户态代码`include`此文件之后调用对应的装载函数即可将
|
||||
字节码装载到内核中。同样的,`libbpf-bootstrap`也有非常完备的入门教程,用户可以在[该处](https://nakryiko.com/posts/libbpf-bootstrap/)
|
||||
得到详细的入门操作介绍。
|
||||
|
||||
### eunomia-bpf
|
||||
|
||||
开发、构建和分发 eBPF 一直以来都是一个高门槛的工作,使用 BCC、bpftrace 等工具开发效率高、可移植性好,但是分发部署时需要安装 LLVM、Clang等编译环境,每次运行的时候执行本地或远程编译过程,资源消耗较大;使用原生的 CO-RE libbpf时又需要编写不少用户态加载代码来帮助 eBPF 程序正确加载和从内核中获取上报的信息,同时对于 eBPF 程序的分发、管理也没有很好地解决方案。
|
||||
|
||||
[eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf) 是一个开源的 eBPF 动态加载运行时和开发工具链,是为了简化 eBPF 程序的开发、构建、分发、运行而设计的,基于 libbpf 的 CO-RE 轻量级开发框架。
|
||||
|
||||
使用 eunomia-bpf ,可以:
|
||||
|
||||
- 在编写 eBPF 程序或工具时只编写内核态代码,自动获取内核态导出信息,并作为模块动态加载;
|
||||
- 使用 WASM 进行用户态交互程序的开发,在 WASM 虚拟机内部控制整个 eBPF 程序的加载和执行,以及处理相关数据;
|
||||
- eunomia-bpf 可以将预编译的 eBPF 程序打包为通用的 JSON 或 WASM 模块,跨架构和内核版本进行分发,无需重新编译即可动态加载运行。
|
||||
|
||||
eunomia-bpf 由一个编译工具链和一个运行时库组成, 对比传统的 BCC、原生 libbpf 等框架,大幅简化了 eBPF 程序的开发流程,在大多数时候只需编写内核态代码,即可轻松构建、打包、发布完整的 eBPF 应用,同时内核态 eBPF 代码保证和主流的 libbpf, libbpfgo, libbpf-rs 等开发框架的 100% 兼容性。需要编写用户态代码的时候,也可以借助 Webassembly 实现通过多种语言进行用户态开发。和 bpftrace 等脚本工具相比, eunomia-bpf 保留了类似的便捷性, 同时不仅局限于 trace 方面, 可以用于更多的场景, 如网络、安全等等。
|
||||
|
||||
> - eunomia-bpf 项目 Github 地址: <https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
> - gitee 镜像: <https://gitee.com/anolis/eunomia>
|
||||
|
||||
## 参考资料
|
||||
|
||||
- eBPF 介绍:<https://ebpf.io/>
|
||||
- BPF Compiler Collection (BCC):<https://github.com/iovisor/bcc>
|
||||
- eunomia-bpf:<https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
BIN
src/0-introduce/kernel-arch.webp
Normal file
BIN
src/0-introduce/kernel-arch.webp
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 35 KiB |
BIN
src/0-introduce/new-os-model.jpg
Normal file
BIN
src/0-introduce/new-os-model.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 46 KiB |
7
src/1-helloworld/.gitignore
vendored
Normal file
7
src/1-helloworld/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
.vscode
|
||||
package.json
|
||||
*.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
package.yaml
|
||||
ecli
|
||||
152
src/1-helloworld/README.md
Normal file
152
src/1-helloworld/README.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# eBPF 入门开发实践教程二:Hello World,基本框架和开发流程
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具。它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第二篇,主要介绍 eBPF 的基本框架和开发流程。
|
||||
|
||||
开发 eBPF 程序可以使用多种工具,如 BCC、eunomia-bpf 等。不同的工具有不同的特点,但基本流程大致相同。
|
||||
|
||||
## 开发 eBPF 程序的流程
|
||||
|
||||
下面以 BCC 工具为例,介绍 eBPF 程序的基本开发流程。
|
||||
|
||||
1. 安装编译环境和依赖。使用 BCC 开发 eBPF 程序需要安装 LLVM/Clang 和 bcc,以及其它的依赖库。
|
||||
2. 编写 eBPF 程序。eBPF 程序主要由两部分构成:内核态部分和用户态部分。内核态部分包含 eBPF 程序的实际逻辑,用户态部分负责加载、运行和监控内核态程序。
|
||||
3. 编译和加载 eBPF 程序。使用 bcc 工具将 eBPF 程序编译成机器码,然后使用用户态代码加载并运行该程序。
|
||||
4. 运行程序并处理数据。eBPF 程序在内核运行时会触发事件,并将事件相关的信息传递给用户态程序。用户态程序负责处理这些信息并将结果输出。
|
||||
5. 结束程序。当 eBPF 程序运行完成后,用户态程序可以卸载并结束运行。
|
||||
|
||||
通过这个过程,你可以开发出一个能够在内核中运行的 eBPF 程序。
|
||||
|
||||
## 使用 eunomia-bpf 开发 eBPF 程序
|
||||
|
||||
eunomia-bpf 是一个开源的 eBPF 动态加载运行时和开发工具链,它的目的是简化 eBPF 程序的开发、构建、分发、运行。它基于 libbpf 的 CO-RE 轻量级开发框架,支持通过用户态 WASM 虚拟机控制 eBPF 程序的加载和执行,并将预编译的 eBPF 程序打包为通用的 JSON 或 WASM 模块进行分发。使用 eunomia-bpf 可以大幅简化 eBPF 程序的开发流程。
|
||||
|
||||
使用 eunomia-bpf 开发 eBPF 程序的流程也大致相同,只是细节略有不同。
|
||||
|
||||
1. 安装编译环境和依赖。使用 eunomia-bpf 开发 eBPF 程序需要安装 eunomia-bpf 工具链和运行时库,以及其它的依赖库。
|
||||
2. 编写 eBPF 程序。eBPF 程序主要由两部分构成:内核态部分和用户态部分。内核态部分包含 eBPF 程序的实际逻辑,用户态部分负责加载、运行和监控内核态程序。使用 eunomia-bpf,只需编写内核态代码即可,无需编写用户态代码。
|
||||
3. 编译和加载 eBPF 程序。使用 eunomia-bpf 工具链将 eBPF 程序编译成机器码,并将编译后的代码打包为可以在任何系统上运行的模块。然后使用 eunomia-bpf 运行时库加载并运行该模块。
|
||||
4. 运行程序并处理数据。eBPF 程序在内核运行时会触发事件,并将事件相关的信息传递给用户态程序。eunomia-bpf 的运行时库负责处理这些信息并将结果输出。
|
||||
5. 结束程序。当 eBPF 程序运行完成后,eunomia-bpf 的运行时库可以卸载并结束运行
|
||||
|
||||
## 下载安装 eunomia-bpf 开发工具
|
||||
|
||||
可以通过以下步骤下载和安装 eunomia-bpf:
|
||||
|
||||
下载 ecli 工具,用于运行 eBPF 程序:
|
||||
|
||||
```console
|
||||
$ wget https://aka.pw/bpf-ecli -O ecli && chmod +x ./ecli
|
||||
$ ./ecli -h
|
||||
Usage: ecli [--help] [--version] [--json] [--no-cache] url-and-args
|
||||
```
|
||||
|
||||
下载编译器工具链,用于将 eBPF 内核代码编译为 config 文件或 WASM 模块:
|
||||
|
||||
```console
|
||||
$ wget https://github.com/eunomia-bpf/eunomia-bpf/releases/latest/download/ecc && chmod +x ./ecc
|
||||
$ ./ecc -h
|
||||
eunomia-bpf compiler
|
||||
Usage: ecc [OPTIONS] <SOURCE_PATH> [EXPORT_EVENT_HEADER]
|
||||
....
|
||||
```
|
||||
|
||||
也可以使用 docker 镜像进行编译:
|
||||
|
||||
```console
|
||||
$ docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest # 使用 docker 进行编译。`pwd` 应该包含 *.bpf.c 文件和 *.h 文件。
|
||||
export PATH=PATH:~/.eunomia/bin
|
||||
Compiling bpf object...
|
||||
Packing ebpf object and config into /src/package.json...
|
||||
```
|
||||
|
||||
## Hello World - minimal eBPF program
|
||||
|
||||
```c
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
#define BPF_NO_GLOBAL_DATA
|
||||
#include <linux/bpf.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
typedef unsigned int u32;
|
||||
typedef int pid_t;
|
||||
const pid_t pid_filter = 0;
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
|
||||
SEC("tp/syscalls/sys_enter_write")
|
||||
int handle_tp(void *ctx)
|
||||
{
|
||||
pid_t pid = bpf_get_current_pid_tgid() >> 32;
|
||||
if (pid_filter && pid != pid_filter)
|
||||
return 0;
|
||||
bpf_printk("BPF triggered from PID %d.\n", pid);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
这段程序通过定义一个 handle_tp 函数并使用 SEC 宏把它附加到 sys_enter_write tracepoint(即在进入 write 系统调用时执行)。该函数通过使用 bpf_get_current_pid_tgid 和 bpf_printk 函数获取调用 write 系统调用的进程 ID,并在内核日志中打印出来。
|
||||
|
||||
- `bpf_trace_printk()`: 一种将信息输出到trace_pipe(/sys/kernel/debug/tracing/trace_pipe)简单机制。 在一些简单用例中这样使用没有问题, but它也有一些限制:最多3 参数; 第一个参数必须是%s(即字符串);同时trace_pipe在内核中全局共享,其他并行使用trace_pipe的程序有可能会将 trace_pipe 的输出扰乱。 一个更好的方式是通过 BPF_PERF_OUTPUT(), 稍后将会讲到。
|
||||
- `void *ctx`:ctx本来是具体类型的参数, 但是由于我们这里没有使用这个参数,因此就将其写成void *类型。
|
||||
- `return 0`;:必须这样,返回0 (如果要知道why, 参考 #139 <https://github.com/iovisor/bcc/issues/139>)。
|
||||
|
||||
要编译和运行这段程序,可以使用 ecc 工具和 ecli 命令。首先使用 ecc 编译程序:
|
||||
|
||||
```console
|
||||
$ ecc hello.bpf.c
|
||||
Compiling bpf object...
|
||||
Packing ebpf object and config into package.json...
|
||||
```
|
||||
|
||||
或使用 docker 镜像进行编译:
|
||||
|
||||
```shell
|
||||
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
|
||||
```
|
||||
|
||||
然后使用 ecli 运行编译后的程序:
|
||||
|
||||
```console
|
||||
$ sudo ecli ./package.json
|
||||
Runing eBPF program...
|
||||
```
|
||||
|
||||
运行这段程序后,可以通过查看 /sys/kernel/debug/tracing/trace_pipe 文件来查看 eBPF 程序的输出:
|
||||
|
||||
```console
|
||||
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
|
||||
<...>-3840345 [010] d... 3220701.101143: bpf_trace_printk: write system call from PID 3840345.
|
||||
<...>-3840345 [010] d... 3220701.101143: bpf_trace_printk: write system call from PID 3840345.
|
||||
```
|
||||
|
||||
## eBPF 程序的基本框架
|
||||
|
||||
如上所述, eBPF 程序的基本框架包括:
|
||||
|
||||
- 包含头文件:需要包含 <linux/bpf.h> 和 <bpf/bpf_helpers.h> 等头文件。
|
||||
- 定义许可证:需要定义许可证,通常使用 "Dual BSD/GPL"。
|
||||
- 定义 BPF 函数:需要定义一个 BPF 函数,例如其名称为 handle_tp,其参数为 void *ctx,返回值为 int。通常用 C 语言编写。
|
||||
- 使用 BPF 助手函数:在例如 BPF 函数中,可以使用 BPF 助手函数 bpf_get_current_pid_tgid() 和 bpf_printk()。
|
||||
- 返回值
|
||||
|
||||
## tracepoints
|
||||
|
||||
跟踪点(tracepoints)是内核静态插桩技术,跟踪点在技术上只是放置在内核源代码中的跟踪函数,实际上就是在源码中插入的一些带有控制条件的探测点,这些探测点允许事后再添加处理函数。比如在内核中,最常见的静态跟踪方法就是 printk,即输出日志。又比如:在系统调用、调度程序事件、文件系统操作和磁盘 I/O 的开始和结束时都有跟踪点。 于 2009 年在 Linux 2.6.32 版本中首次提供。跟踪点是一种稳定的 API,数量有限。
|
||||
|
||||
## 总结
|
||||
|
||||
eBPF 程序的开发和使用流程可以概括为如下几个步骤:
|
||||
|
||||
- 定义 eBPF 程序的接口和类型:这包括定义 eBPF 程序的接口函数,定义和实现 eBPF 内核映射(maps)和共享内存(perf events),以及定义和使用 eBPF 内核帮助函数(helpers)。
|
||||
- 编写 eBPF 程序的代码:这包括编写 eBPF 程序的主要逻辑,实现 eBPF 内核映射的读写操作,以及使用 eBPF 内核帮助函数。
|
||||
- 编译 eBPF 程序:这包括使用 eBPF 编译器(例如 clang)将 eBPF 程序代码编译为 eBPF 字节码,并生成可执行的 eBPF 内核模块。ecc 本质上也是调用 clang 编译器来编译 eBPF 程序。
|
||||
- 加载 eBPF 程序到内核:这包括将编译好的 eBPF 内核模块加载到 Linux 内核中,并将 eBPF 程序附加到指定的内核事件上。
|
||||
- 使用 eBPF 程序:这包括监测 eBPF 程序的运行情况,并使用 eBPF 内核映射和共享内存进行数据交换和共享。
|
||||
- 在实际开发中,还可能需要进行其他的步骤,例如配置编译和加载参数,管理 eBPF 内核模块和内核映射,以及使用其他高级功能等。
|
||||
|
||||
需要注意的是,BPF 程序的执行是在内核空间进行的,因此需要使用特殊的工具和技术来编写、编译和调试 BPF 程序。eunomia-bpf 是一个开源的 BPF 编译器和工具包,它可以帮助开发者快速和简单地编写和运行 BPF 程序。
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
21
src/1-helloworld/minimal.bpf.c
Normal file
21
src/1-helloworld/minimal.bpf.c
Normal file
@@ -0,0 +1,21 @@
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
#define BPF_NO_GLOBAL_DATA
|
||||
#include <linux/bpf.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
typedef unsigned int u32;
|
||||
typedef int pid_t;
|
||||
const pid_t pid_filter = 0;
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
|
||||
SEC("tp/syscalls/sys_enter_write")
|
||||
int handle_tp(void *ctx)
|
||||
{
|
||||
pid_t pid = bpf_get_current_pid_tgid() >> 32;
|
||||
if (pid_filter && pid != pid_filter)
|
||||
return 0;
|
||||
bpf_printk("BPF triggered from PID %d.\n", pid);
|
||||
return 0;
|
||||
}
|
||||
7
src/10-hardirqs/.gitignore
vendored
Normal file
7
src/10-hardirqs/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
.vscode
|
||||
package.json
|
||||
*.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
package.yaml
|
||||
ecli
|
||||
176
src/10-hardirqs/README.md
Normal file
176
src/10-hardirqs/README.md
Normal file
@@ -0,0 +1,176 @@
|
||||
# eBPF 入门开发实践教程十:在 eBPF 中使用 hardirqs 或 softirqs 捕获中断事件
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具。它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第十篇,在 eBPF 中。
|
||||
|
||||
## hardirqs是什么?
|
||||
|
||||
hardirqs 是 bcc-tools 工具包的一部分,该工具包是一组用于在 Linux 系统上执行系统跟踪和分析的实用程序。
|
||||
hardirqs 是一种用于跟踪和分析 Linux 内核中的中断处理程序的工具。它使用 BPF(Berkeley Packet Filter)程序来收集有关中断处理程序的数据,
|
||||
并可用于识别内核中的性能问题和其他与中断处理相关的问题。
|
||||
|
||||
## 实现原理
|
||||
|
||||
在 Linux 内核中,每个中断处理程序都有一个唯一的名称,称为中断向量。hardirqs 通过检查每个中断处理程序的中断向量,来监控内核中的中断处理程序。当内核接收到一个中断时,它会查找与该中断相关的中断处理程序,并执行该程序。hardirqs 通过检查内核中执行的中断处理程序,来监控内核中的中断处理程序。另外,hardirqs 还可以通过注入 BPF 程序到内核中,来捕获内核中的中断处理程序。这样,hardirqs 就可以监控内核中执行的中断处理程序,并收集有关它们的信息。
|
||||
|
||||
## 代码实现
|
||||
|
||||
```c
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
// Copyright (c) 2020 Wenbo Zhang
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include "hardirqs.h"
|
||||
#include "bits.bpf.h"
|
||||
#include "maps.bpf.h"
|
||||
|
||||
#define MAX_ENTRIES 256
|
||||
|
||||
const volatile bool filter_cg = false;
|
||||
const volatile bool targ_dist = false;
|
||||
const volatile bool targ_ns = false;
|
||||
const volatile bool do_count = false;
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_CGROUP_ARRAY);
|
||||
__type(key, u32);
|
||||
__type(value, u32);
|
||||
__uint(max_entries, 1);
|
||||
} cgroup_map SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
|
||||
__uint(max_entries, 1);
|
||||
__type(key, u32);
|
||||
__type(value, u64);
|
||||
} start SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, struct irq_key);
|
||||
__type(value, struct info);
|
||||
} infos SEC(".maps");
|
||||
|
||||
static struct info zero;
|
||||
|
||||
static int handle_entry(int irq, struct irqaction *action)
|
||||
{
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
if (do_count) {
|
||||
struct irq_key key = {};
|
||||
struct info *info;
|
||||
|
||||
bpf_probe_read_kernel_str(&key.name, sizeof(key.name), BPF_CORE_READ(action, name));
|
||||
info = bpf_map_lookup_or_try_init(&infos, &key, &zero);
|
||||
if (!info)
|
||||
return 0;
|
||||
info->count += 1;
|
||||
return 0;
|
||||
} else {
|
||||
u64 ts = bpf_ktime_get_ns();
|
||||
u32 key = 0;
|
||||
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
bpf_map_update_elem(&start, &key, &ts, BPF_ANY);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
static int handle_exit(int irq, struct irqaction *action)
|
||||
{
|
||||
struct irq_key ikey = {};
|
||||
struct info *info;
|
||||
u32 key = 0;
|
||||
u64 delta;
|
||||
u64 *tsp;
|
||||
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
tsp = bpf_map_lookup_elem(&start, &key);
|
||||
if (!tsp)
|
||||
return 0;
|
||||
|
||||
delta = bpf_ktime_get_ns() - *tsp;
|
||||
if (!targ_ns)
|
||||
delta /= 1000U;
|
||||
|
||||
bpf_probe_read_kernel_str(&ikey.name, sizeof(ikey.name), BPF_CORE_READ(action, name));
|
||||
info = bpf_map_lookup_or_try_init(&infos, &ikey, &zero);
|
||||
if (!info)
|
||||
return 0;
|
||||
|
||||
if (!targ_dist) {
|
||||
info->count += delta;
|
||||
} else {
|
||||
u64 slot;
|
||||
|
||||
slot = log2(delta);
|
||||
if (slot >= MAX_SLOTS)
|
||||
slot = MAX_SLOTS - 1;
|
||||
info->slots[slot]++;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("tp_btf/irq_handler_entry")
|
||||
int BPF_PROG(irq_handler_entry_btf, int irq, struct irqaction *action)
|
||||
{
|
||||
return handle_entry(irq, action);
|
||||
}
|
||||
|
||||
SEC("tp_btf/irq_handler_exit")
|
||||
int BPF_PROG(irq_handler_exit_btf, int irq, struct irqaction *action)
|
||||
{
|
||||
return handle_exit(irq, action);
|
||||
}
|
||||
|
||||
SEC("raw_tp/irq_handler_entry")
|
||||
int BPF_PROG(irq_handler_entry, int irq, struct irqaction *action)
|
||||
{
|
||||
return handle_entry(irq, action);
|
||||
}
|
||||
|
||||
SEC("raw_tp/irq_handler_exit")
|
||||
int BPF_PROG(irq_handler_exit, int irq, struct irqaction *action)
|
||||
{
|
||||
return handle_exit(irq, action);
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
```
|
||||
|
||||
这是一个 BPF(Berkeley Packet Filter)程序。BPF 程序是小型程序,可以直接在 Linux 内核中运行,用于过滤和操纵网络流量。这个特定的程序似乎旨在收集内核中中断处理程序的统计信息。它定义了一些地图(可以在 BPF 程序和内核的其他部分之间共享的数据结构)和两个函数:handle_entry 和 handle_exit。当内核进入和退出中断处理程序时,分别执行这些函数。handle_entry 函数用于跟踪中断处理程序被执行的次数,而 handle_exit 则用于测量中断处理程序中花费的时间。
|
||||
|
||||
## 运行代码
|
||||
|
||||
eunomia-bpf 是一个结合 Wasm 的开源 eBPF 动态加载运行时和开发工具链,它的目的是简化 eBPF 程序的开发、构建、分发、运行。可以参考 <https://github.com/eunomia-bpf/eunomia-bpf> 下载和安装 ecc 编译工具链和 ecli 运行时。我们使用 eunomia-bpf 编译运行这个例子。
|
||||
|
||||
要编译这个程序,请使用 ecc 工具:
|
||||
|
||||
```console
|
||||
$ ecc hardirqs.bpf.c
|
||||
Compiling bpf object...
|
||||
Packing ebpf object and config into package.json...
|
||||
```
|
||||
|
||||
然后运行:
|
||||
|
||||
```console
|
||||
sudo ecli ./package.json
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:<https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
31
src/10-hardirqs/bits.bpf.h
Normal file
31
src/10-hardirqs/bits.bpf.h
Normal file
@@ -0,0 +1,31 @@
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
#ifndef __BITS_BPF_H
|
||||
#define __BITS_BPF_H
|
||||
|
||||
#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
|
||||
#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = val)
|
||||
|
||||
static __always_inline u64 log2(u32 v)
|
||||
{
|
||||
u32 shift, r;
|
||||
|
||||
r = (v > 0xFFFF) << 4; v >>= r;
|
||||
shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
|
||||
shift = (v > 0xF) << 2; v >>= shift; r |= shift;
|
||||
shift = (v > 0x3) << 1; v >>= shift; r |= shift;
|
||||
r |= (v >> 1);
|
||||
|
||||
return r;
|
||||
}
|
||||
|
||||
static __always_inline u64 log2l(u64 v)
|
||||
{
|
||||
u32 hi = v >> 32;
|
||||
|
||||
if (hi)
|
||||
return log2(hi) + 32;
|
||||
else
|
||||
return log2(v);
|
||||
}
|
||||
|
||||
#endif /* __BITS_BPF_H */
|
||||
135
src/10-hardirqs/hardirqs.bpf.c
Normal file
135
src/10-hardirqs/hardirqs.bpf.c
Normal file
@@ -0,0 +1,135 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
// Copyright (c) 2020 Wenbo Zhang
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include "hardirqs.h"
|
||||
#include "bits.bpf.h"
|
||||
#include "maps.bpf.h"
|
||||
|
||||
#define MAX_ENTRIES 256
|
||||
|
||||
const volatile bool filter_cg = false;
|
||||
const volatile bool targ_dist = false;
|
||||
const volatile bool targ_ns = false;
|
||||
const volatile bool do_count = false;
|
||||
|
||||
struct irq_key {
|
||||
char name[32];
|
||||
};
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_CGROUP_ARRAY);
|
||||
__type(key, u32);
|
||||
__type(value, u32);
|
||||
__uint(max_entries, 1);
|
||||
} cgroup_map SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
|
||||
__uint(max_entries, 1);
|
||||
__type(key, u32);
|
||||
__type(value, u64);
|
||||
} start SEC(".maps");
|
||||
|
||||
/// @sample {"interval": 1000, "type" : "log2_hist"}
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, struct irq_key);
|
||||
__type(value, struct info);
|
||||
} infos SEC(".maps");
|
||||
|
||||
static struct info zero;
|
||||
|
||||
static int handle_entry(int irq, struct irqaction *action)
|
||||
{
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
if (do_count) {
|
||||
struct irq_key key = {};
|
||||
struct info *info;
|
||||
|
||||
bpf_probe_read_kernel_str(&key.name, sizeof(key.name), BPF_CORE_READ(action, name));
|
||||
info = bpf_map_lookup_or_try_init(&infos, &key, &zero);
|
||||
if (!info)
|
||||
return 0;
|
||||
info->count += 1;
|
||||
return 0;
|
||||
} else {
|
||||
u64 ts = bpf_ktime_get_ns();
|
||||
u32 key = 0;
|
||||
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
bpf_map_update_elem(&start, &key, &ts, BPF_ANY);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
static int handle_exit(int irq, struct irqaction *action)
|
||||
{
|
||||
struct irq_key ikey = {};
|
||||
struct info *info;
|
||||
u32 key = 0;
|
||||
u64 delta;
|
||||
u64 *tsp;
|
||||
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
tsp = bpf_map_lookup_elem(&start, &key);
|
||||
if (!tsp)
|
||||
return 0;
|
||||
|
||||
delta = bpf_ktime_get_ns() - *tsp;
|
||||
if (!targ_ns)
|
||||
delta /= 1000U;
|
||||
|
||||
bpf_probe_read_kernel_str(&ikey.name, sizeof(ikey.name), BPF_CORE_READ(action, name));
|
||||
info = bpf_map_lookup_or_try_init(&infos, &ikey, &zero);
|
||||
if (!info)
|
||||
return 0;
|
||||
|
||||
if (!targ_dist) {
|
||||
info->count += delta;
|
||||
} else {
|
||||
u64 slot;
|
||||
|
||||
slot = log2(delta);
|
||||
if (slot >= MAX_SLOTS)
|
||||
slot = MAX_SLOTS - 1;
|
||||
info->slots[slot]++;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("tp_btf/irq_handler_entry")
|
||||
int BPF_PROG(irq_handler_entry_btf, int irq, struct irqaction *action)
|
||||
{
|
||||
return handle_entry(irq, action);
|
||||
}
|
||||
|
||||
SEC("tp_btf/irq_handler_exit")
|
||||
int BPF_PROG(irq_handler_exit_btf, int irq, struct irqaction *action)
|
||||
{
|
||||
return handle_exit(irq, action);
|
||||
}
|
||||
|
||||
SEC("raw_tp/irq_handler_entry")
|
||||
int BPF_PROG(irq_handler_entry, int irq, struct irqaction *action)
|
||||
{
|
||||
return handle_entry(irq, action);
|
||||
}
|
||||
|
||||
SEC("raw_tp/irq_handler_exit")
|
||||
int BPF_PROG(irq_handler_exit, int irq, struct irqaction *action)
|
||||
{
|
||||
return handle_exit(irq, action);
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
12
src/10-hardirqs/hardirqs.h
Normal file
12
src/10-hardirqs/hardirqs.h
Normal file
@@ -0,0 +1,12 @@
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
#ifndef __HARDIRQS_H
|
||||
#define __HARDIRQS_H
|
||||
|
||||
#define MAX_SLOTS 20
|
||||
|
||||
struct info {
|
||||
__u64 count;
|
||||
__u32 slots[MAX_SLOTS];
|
||||
};
|
||||
|
||||
#endif /* __HARDIRQS_H */
|
||||
26
src/10-hardirqs/maps.bpf.h
Normal file
26
src/10-hardirqs/maps.bpf.h
Normal file
@@ -0,0 +1,26 @@
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
// Copyright (c) 2020 Anton Protopopov
|
||||
#ifndef __MAPS_BPF_H
|
||||
#define __MAPS_BPF_H
|
||||
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <asm-generic/errno.h>
|
||||
|
||||
static __always_inline void *
|
||||
bpf_map_lookup_or_try_init(void *map, const void *key, const void *init)
|
||||
{
|
||||
void *val;
|
||||
long err;
|
||||
|
||||
val = bpf_map_lookup_elem(map, key);
|
||||
if (val)
|
||||
return val;
|
||||
|
||||
err = bpf_map_update_elem(map, key, init, BPF_NOEXIST);
|
||||
if (err && err != -EEXIST)
|
||||
return 0;
|
||||
|
||||
return bpf_map_lookup_elem(map, key);
|
||||
}
|
||||
|
||||
#endif /* __MAPS_BPF_H */
|
||||
7
src/11-bootstrap/.gitignore
vendored
Normal file
7
src/11-bootstrap/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
.vscode
|
||||
package.json
|
||||
*.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
package.yaml
|
||||
ecli
|
||||
170
src/11-bootstrap/README.md
Normal file
170
src/11-bootstrap/README.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# eBPF 入门开发实践教程十一:在 eBPF 中使用 bootstrap 开发用户态程序并跟踪 exec() 和 exit() 系统调用
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具。它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
## 什么是bootstrap?
|
||||
|
||||
|
||||
Bootstrap是一个工具,它使用BPF(Berkeley Packet Filter)程序跟踪执行exec()系统调用(使用SEC(“tp/sched/sched_process_exec”)handle_exit BPF程序),这大致对应于新进程的生成(忽略fork()部分)。此外,它还跟踪exit()(使用SEC(“tp/sched/sched_process_exit”)handle_exit BPF程序)以了解每个进程何时退出。这两个BPF程序共同工作,允许捕获有关任何新进程的有趣信息,例如二进制文件的文件名,以及测量进程的生命周期并在进程死亡时收集有趣的统计信息,例如退出代码或消耗的资源量等。我认为这是深入了解内核内部并观察事物如何真正运作的良好起点。
|
||||
|
||||
Bootstrap还使用argp API(libc的一部分)进行命令行参数解析。
|
||||
|
||||
## Bootstrap
|
||||
|
||||
TODO: 添加关于用户态的应用部分,以及关于 libbpf-boostrap 的完整介绍。也许可以参考类似:http://cn-sec.com/archives/1267522.html 的文档。
|
||||
|
||||
```c
|
||||
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
|
||||
/* Copyright (c) 2020 Facebook */
|
||||
#include "vmlinux.h"
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include "bootstrap.h"
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, 8192);
|
||||
__type(key, pid_t);
|
||||
__type(value, u64);
|
||||
} exec_start SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_RINGBUF);
|
||||
__uint(max_entries, 256 * 1024);
|
||||
} rb SEC(".maps");
|
||||
|
||||
const volatile unsigned long long min_duration_ns = 0;
|
||||
|
||||
SEC("tp/sched/sched_process_exec")
|
||||
int handle_exec(struct trace_event_raw_sched_process_exec *ctx)
|
||||
{
|
||||
struct task_struct *task;
|
||||
unsigned fname_off;
|
||||
struct event *e;
|
||||
pid_t pid;
|
||||
u64 ts;
|
||||
|
||||
/* remember time exec() was executed for this PID */
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
ts = bpf_ktime_get_ns();
|
||||
bpf_map_update_elem(&exec_start, &pid, &ts, BPF_ANY);
|
||||
|
||||
/* don't emit exec events when minimum duration is specified */
|
||||
if (min_duration_ns)
|
||||
return 0;
|
||||
|
||||
/* reserve sample from BPF ringbuf */
|
||||
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
|
||||
if (!e)
|
||||
return 0;
|
||||
|
||||
/* fill out the sample with data */
|
||||
task = (struct task_struct *)bpf_get_current_task();
|
||||
|
||||
e->exit_event = false;
|
||||
e->pid = pid;
|
||||
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
|
||||
bpf_get_current_comm(&e->comm, sizeof(e->comm));
|
||||
|
||||
fname_off = ctx->__data_loc_filename & 0xFFFF;
|
||||
bpf_probe_read_str(&e->filename, sizeof(e->filename), (void *)ctx + fname_off);
|
||||
|
||||
/* successfully submit it to user-space for post-processing */
|
||||
bpf_ringbuf_submit(e, 0);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("tp/sched/sched_process_exit")
|
||||
int handle_exit(struct trace_event_raw_sched_process_template* ctx)
|
||||
{
|
||||
struct task_struct *task;
|
||||
struct event *e;
|
||||
pid_t pid, tid;
|
||||
u64 id, ts, *start_ts, duration_ns = 0;
|
||||
|
||||
/* get PID and TID of exiting thread/process */
|
||||
id = bpf_get_current_pid_tgid();
|
||||
pid = id >> 32;
|
||||
tid = (u32)id;
|
||||
|
||||
/* ignore thread exits */
|
||||
if (pid != tid)
|
||||
return 0;
|
||||
|
||||
/* if we recorded start of the process, calculate lifetime duration */
|
||||
start_ts = bpf_map_lookup_elem(&exec_start, &pid);
|
||||
if (start_ts)
|
||||
duration_ns = bpf_ktime_get_ns() - *start_ts;
|
||||
else if (min_duration_ns)
|
||||
return 0;
|
||||
bpf_map_delete_elem(&exec_start, &pid);
|
||||
|
||||
/* if process didn't live long enough, return early */
|
||||
if (min_duration_ns && duration_ns < min_duration_ns)
|
||||
return 0;
|
||||
|
||||
/* reserve sample from BPF ringbuf */
|
||||
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
|
||||
if (!e)
|
||||
return 0;
|
||||
|
||||
/* fill out the sample with data */
|
||||
task = (struct task_struct *)bpf_get_current_task();
|
||||
|
||||
e->exit_event = true;
|
||||
e->duration_ns = duration_ns;
|
||||
e->pid = pid;
|
||||
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
|
||||
e->exit_code = (BPF_CORE_READ(task, exit_code) >> 8) & 0xff;
|
||||
bpf_get_current_comm(&e->comm, sizeof(e->comm));
|
||||
|
||||
/* send data to user-space for post-processing */
|
||||
bpf_ringbuf_submit(e, 0);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
```
|
||||
|
||||
这是一段使用BPF(Berkeley Packet Filter)的C程序,用于跟踪进程启动和退出事件,并显示有关它们的信息。BPF是一种强大的机制,允许您将称为BPF程序的小程序附加到Linux内核的各个部分。这些程序可用于过滤,监视或修改内核的行为。
|
||||
|
||||
程序首先定义一些常量,并包含一些头文件。然后定义了一个名为env的struct,用于存储一些程序选项,例如详细模式和进程报告的最小持续时间。
|
||||
|
||||
然后,程序定义了一个名为parse_arg的函数,用于解析传递给程序的命令行参数。它接受三个参数:一个表示正在解析的选项的整数key,一个表示选项参数的字符指针arg和一个表示当前解析状态的struct argp_state指针state。该函数处理选项并在env struct中设置相应的值。
|
||||
|
||||
然后,程序定义了一个名为sig_handler的函数,当被调用时会将全局标志exiting设置为true。这用于在接收到信号时允许程序干净地退出。
|
||||
|
||||
接下来,我们将继续描述这段代码中的其他部分。
|
||||
|
||||
程序定义了一个名为exec_start的BPF map,它的类型为BPF_MAP_TYPE_HASH,最大条目数为8192,键类型为pid_t,值类型为u64。
|
||||
|
||||
另外,程序还定义了一个名为rb的BPF map,它的类型为BPF_MAP_TYPE_RINGBUF,最大条目数为256 * 1024。
|
||||
|
||||
程序还定义了一个名为min_duration_ns的常量,其值为0。
|
||||
|
||||
程序定义了一个名为handle_exec的SEC(static evaluator of code)函数,它被附加到跟踪进程执行的BPF程序上。该函数记录为该PID执行exec()的时间,并在指定了最小持续时间时不发出exec事件。如果未指定最小持续时间,则会从BPF ringbuf保留样本并使用数据填充样本,然后将其提交给用户空间进行后处理。
|
||||
|
||||
程序还定义了一个名为handle_exit的SEC函数,它被附加到跟踪进程退出的BPF程序上。该函数会在确定PID和TID后计算进程的生命周期,然后根据min_duration_ns的值决定是否发出退出事件。如果进程的生命周期足够长,则会从BPF ringbuf保留样本并使用数据填充样本,然后将其提交给用户空间进行后处理。
|
||||
|
||||
最后,主函数调用bpf_ringbuf_poll来轮询BPF ringbuf,并在接收到新的事件时处理该事件。这个函数会持续运行,直到全局标志exiting被设置为true,此时它会清理资源并退出。
|
||||
|
||||
|
||||
编译运行上述代码:
|
||||
|
||||
```console
|
||||
$ ecc bootstrap.bpf.c
|
||||
Compiling bpf object...
|
||||
Packing ebpf object and config into package.json...
|
||||
$ sudo ecli package.json
|
||||
Runing eBPF program...
|
||||
```
|
||||
|
||||
|
||||
## 总结
|
||||
|
||||
这是一个使用BPF的C程序,用于跟踪进程的启动和退出事件,并显示有关这些事件的信息。它通过使用argp API来解析命令行参数,并使用BPF地图存储进程的信息,包括进程的PID和执行文件的文件名。程序还使用了SEC函数来附加BPF程序,以监视进程的执行和退出事件。最后,程序在终端中打印出启动和退出的进程信息。
|
||||
|
||||
编译这个程序可以使用 ecc 工具,运行时可以使用 ecli 命令。更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:https://github.com/eunomia-bpf/eunomia-bpf
|
||||
112
src/11-bootstrap/bootstrap.bpf.c
Normal file
112
src/11-bootstrap/bootstrap.bpf.c
Normal file
@@ -0,0 +1,112 @@
|
||||
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
|
||||
/* Copyright (c) 2020 Facebook */
|
||||
#include "vmlinux.h"
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include "bootstrap.h"
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, 8192);
|
||||
__type(key, pid_t);
|
||||
__type(value, u64);
|
||||
} exec_start SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_RINGBUF);
|
||||
__uint(max_entries, 256 * 1024);
|
||||
} rb SEC(".maps");
|
||||
|
||||
const volatile unsigned long long min_duration_ns = 0;
|
||||
|
||||
SEC("tp/sched/sched_process_exec")
|
||||
int handle_exec(struct trace_event_raw_sched_process_exec *ctx)
|
||||
{
|
||||
struct task_struct *task;
|
||||
unsigned fname_off;
|
||||
struct event *e;
|
||||
pid_t pid;
|
||||
u64 ts;
|
||||
|
||||
/* remember time exec() was executed for this PID */
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
ts = bpf_ktime_get_ns();
|
||||
bpf_map_update_elem(&exec_start, &pid, &ts, BPF_ANY);
|
||||
|
||||
/* don't emit exec events when minimum duration is specified */
|
||||
if (min_duration_ns)
|
||||
return 0;
|
||||
|
||||
/* reserve sample from BPF ringbuf */
|
||||
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
|
||||
if (!e)
|
||||
return 0;
|
||||
|
||||
/* fill out the sample with data */
|
||||
task = (struct task_struct *)bpf_get_current_task();
|
||||
|
||||
e->exit_event = false;
|
||||
e->pid = pid;
|
||||
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
|
||||
bpf_get_current_comm(&e->comm, sizeof(e->comm));
|
||||
|
||||
fname_off = ctx->__data_loc_filename & 0xFFFF;
|
||||
bpf_probe_read_str(&e->filename, sizeof(e->filename), (void *)ctx + fname_off);
|
||||
|
||||
/* successfully submit it to user-space for post-processing */
|
||||
bpf_ringbuf_submit(e, 0);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("tp/sched/sched_process_exit")
|
||||
int handle_exit(struct trace_event_raw_sched_process_template* ctx)
|
||||
{
|
||||
struct task_struct *task;
|
||||
struct event *e;
|
||||
pid_t pid, tid;
|
||||
u64 id, ts, *start_ts, duration_ns = 0;
|
||||
|
||||
/* get PID and TID of exiting thread/process */
|
||||
id = bpf_get_current_pid_tgid();
|
||||
pid = id >> 32;
|
||||
tid = (u32)id;
|
||||
|
||||
/* ignore thread exits */
|
||||
if (pid != tid)
|
||||
return 0;
|
||||
|
||||
/* if we recorded start of the process, calculate lifetime duration */
|
||||
start_ts = bpf_map_lookup_elem(&exec_start, &pid);
|
||||
if (start_ts)
|
||||
duration_ns = bpf_ktime_get_ns() - *start_ts;
|
||||
else if (min_duration_ns)
|
||||
return 0;
|
||||
bpf_map_delete_elem(&exec_start, &pid);
|
||||
|
||||
/* if process didn't live long enough, return early */
|
||||
if (min_duration_ns && duration_ns < min_duration_ns)
|
||||
return 0;
|
||||
|
||||
/* reserve sample from BPF ringbuf */
|
||||
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
|
||||
if (!e)
|
||||
return 0;
|
||||
|
||||
/* fill out the sample with data */
|
||||
task = (struct task_struct *)bpf_get_current_task();
|
||||
|
||||
e->exit_event = true;
|
||||
e->duration_ns = duration_ns;
|
||||
e->pid = pid;
|
||||
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
|
||||
e->exit_code = (BPF_CORE_READ(task, exit_code) >> 8) & 0xff;
|
||||
bpf_get_current_comm(&e->comm, sizeof(e->comm));
|
||||
|
||||
/* send data to user-space for post-processing */
|
||||
bpf_ringbuf_submit(e, 0);
|
||||
return 0;
|
||||
}
|
||||
|
||||
19
src/11-bootstrap/bootstrap.h
Normal file
19
src/11-bootstrap/bootstrap.h
Normal file
@@ -0,0 +1,19 @@
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
/* Copyright (c) 2020 Facebook */
|
||||
#ifndef __BOOTSTRAP_H
|
||||
#define __BOOTSTRAP_H
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
#define MAX_FILENAME_LEN 127
|
||||
|
||||
struct event {
|
||||
int pid;
|
||||
int ppid;
|
||||
unsigned exit_code;
|
||||
unsigned long long duration_ns;
|
||||
char comm[TASK_COMM_LEN];
|
||||
char filename[MAX_FILENAME_LEN];
|
||||
bool exit_event;
|
||||
};
|
||||
|
||||
#endif /* __BOOTSTRAP_H */
|
||||
104
src/12-profile/README.md
Normal file
104
src/12-profile/README.md
Normal file
@@ -0,0 +1,104 @@
|
||||
## eBPF 入门实践教程:编写 eBPF 程序 profile 进行性能分析
|
||||
|
||||
### 背景
|
||||
|
||||
`profile` 是一款用户追踪程序执行调用流程的工具,类似于perf中的 -g 指令。但是相较于perf而言,
|
||||
`profile`的功能更为细化,它可以选择用户需要追踪的层面,比如在用户态层面进行追踪,或是在内核态进行追踪。
|
||||
|
||||
### 实现原理
|
||||
|
||||
`profile` 的实现依赖于linux中的perf_event。在注入ebpf程序前,`profile` 工具会先将 perf_event
|
||||
注册好。
|
||||
```c
|
||||
static int open_and_attach_perf_event(int freq, struct bpf_program *prog,
|
||||
struct bpf_link *links[])
|
||||
{
|
||||
struct perf_event_attr attr = {
|
||||
.type = PERF_TYPE_SOFTWARE,
|
||||
.freq = env.freq,
|
||||
.sample_freq = env.sample_freq,
|
||||
.config = PERF_COUNT_SW_CPU_CLOCK,
|
||||
};
|
||||
int i, fd;
|
||||
|
||||
for (i = 0; i < nr_cpus; i++) {
|
||||
if (env.cpu != -1 && env.cpu != i)
|
||||
continue;
|
||||
|
||||
fd = syscall(__NR_perf_event_open, &attr, -1, i, -1, 0);
|
||||
if (fd < 0) {
|
||||
/* Ignore CPU that is offline */
|
||||
if (errno == ENODEV)
|
||||
continue;
|
||||
fprintf(stderr, "failed to init perf sampling: %s\n",
|
||||
strerror(errno));
|
||||
return -1;
|
||||
}
|
||||
links[i] = bpf_program__attach_perf_event(prog, fd);
|
||||
if (!links[i]) {
|
||||
fprintf(stderr, "failed to attach perf event on cpu: "
|
||||
"%d\n", i);
|
||||
links[i] = NULL;
|
||||
close(fd);
|
||||
return -1;
|
||||
}
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
其ebpf程序实现逻辑是对程序的堆栈进行定时采样,从而捕获程序的执行流程。
|
||||
```c
|
||||
SEC("perf_event")
|
||||
int do_perf_event(struct bpf_perf_event_data *ctx)
|
||||
{
|
||||
__u64 id = bpf_get_current_pid_tgid();
|
||||
__u32 pid = id >> 32;
|
||||
__u32 tid = id;
|
||||
__u64 *valp;
|
||||
static const __u64 zero;
|
||||
struct key_t key = {};
|
||||
|
||||
if (!include_idle && tid == 0)
|
||||
return 0;
|
||||
|
||||
if (targ_pid != -1 && targ_pid != pid)
|
||||
return 0;
|
||||
if (targ_tid != -1 && targ_tid != tid)
|
||||
return 0;
|
||||
|
||||
key.pid = pid;
|
||||
bpf_get_current_comm(&key.name, sizeof(key.name));
|
||||
|
||||
if (user_stacks_only)
|
||||
key.kern_stack_id = -1;
|
||||
else
|
||||
key.kern_stack_id = bpf_get_stackid(&ctx->regs, &stackmap, 0);
|
||||
|
||||
if (kernel_stacks_only)
|
||||
key.user_stack_id = -1;
|
||||
else
|
||||
key.user_stack_id = bpf_get_stackid(&ctx->regs, &stackmap, BPF_F_USER_STACK);
|
||||
|
||||
if (key.kern_stack_id >= 0) {
|
||||
// populate extras to fix the kernel stack
|
||||
__u64 ip = PT_REGS_IP(&ctx->regs);
|
||||
|
||||
if (is_kernel_addr(ip)) {
|
||||
key.kernel_ip = ip;
|
||||
}
|
||||
}
|
||||
|
||||
valp = bpf_map_lookup_or_try_init(&counts, &key, &zero);
|
||||
if (valp)
|
||||
__sync_fetch_and_add(valp, 1);
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
通过这种方式,它可以根据用户指令,简单的决定追踪用户态层面的执行流程或是内核态层面的执行流程。
|
||||
### Eunomia中使用方式
|
||||
|
||||
|
||||
### 总结
|
||||
`profile` 实现了对程序执行流程的分析,在debug等操作中可以极大的帮助开发者提高效率。
|
||||
2
src/13-tcpconnlat/.gitignore
vendored
Normal file
2
src/13-tcpconnlat/.gitignore
vendored
Normal file
@@ -0,0 +1,2 @@
|
||||
.vscode
|
||||
package.json
|
||||
165
src/13-tcpconnlat/README.md
Normal file
165
src/13-tcpconnlat/README.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# eBPF入门实践教程:使用 libbpf-bootstrap 开发程序统计 TCP 连接延时
|
||||
|
||||
## 背景
|
||||
|
||||
在互联网后端日常开发接口的时候中,不管你使用的是C、Java、PHP还是Golang,都避免不了需要调用mysql、redis等组件来获取数据,可能还需要执行一些rpc远程调用,或者再调用一些其它restful api。 在这些调用的底层,基本都是在使用TCP协议进行传输。这是因为在传输层协议中,TCP协议具备可靠的连接,错误重传,拥塞控制等优点,所以目前应用比UDP更广泛一些。但相对而言,tcp 连接也有一些缺点,例如建立连接的延时较长等。因此也会出现像 QUIC ,即 快速UDP网络连接 ( Quick UDP Internet Connections )这样的替代方案。
|
||||
|
||||
tcp 连接延时分析对于网络性能分析优化或者故障排查都能起到不少作用。
|
||||
|
||||
## tcpconnlat 的实现原理
|
||||
|
||||
tcpconnlat 这个工具跟踪执行活动TCP连接的内核函数 (例如,通过connect()系统调用),并显示本地测量的连接的延迟(时间),即从发送 SYN 到响应包的时间。
|
||||
|
||||
### tcp 连接原理
|
||||
|
||||
tcp 连接的整个过程如图所示:
|
||||
|
||||

|
||||
|
||||
在这个连接过程中,我们来简单分析一下每一步的耗时:
|
||||
|
||||
1. 客户端发出SYNC包:客户端一般是通过connect系统调用来发出 SYN 的,这里牵涉到本机的系统调用和软中断的 CPU 耗时开销
|
||||
2. SYN传到服务器:SYN从客户端网卡被发出,这是一次长途远距离的网络传输
|
||||
3. 服务器处理SYN包:内核通过软中断来收包,然后放到半连接队列中,然后再发出SYN/ACK响应。主要是 CPU 耗时开销
|
||||
4. SYC/ACK传到客户端:长途网络跋涉
|
||||
5. 客户端处理 SYN/ACK:客户端内核收包并处理SYN后,经过几us的CPU处理,接着发出 ACK。同样是软中断处理开销
|
||||
6. ACK传到服务器:长途网络跋涉
|
||||
7. 服务端收到ACK:服务器端内核收到并处理ACK,然后把对应的连接从半连接队列中取出来,然后放到全连接队列中。一次软中断CPU开销
|
||||
8. 服务器端用户进程唤醒:正在被accpet系统调用阻塞的用户进程被唤醒,然后从全连接队列中取出来已经建立好的连接。一次上下文切换的CPU开销
|
||||
|
||||
在客户端视角,在正常情况下一次TCP连接总的耗时也就就大约是一次网络RTT的耗时。但在某些情况下,可能会导致连接时的网络传输耗时上涨、CPU处理开销增加、甚至是连接失败。这种时候在发现延时过长之后,就可以结合其他信息进行分析。
|
||||
|
||||
### ebpf 实现原理
|
||||
|
||||
在 TCP 三次握手的时候,Linux 内核会维护两个队列,分别是:
|
||||
|
||||
- 半连接队列,也称 SYN 队列;
|
||||
- 全连接队列,也称 accepet 队列;
|
||||
|
||||
服务端收到客户端发起的 SYN 请求后,内核会把该连接存储到半连接队列,并向客户端响应 SYN+ACK,接着客户端会返回 ACK,服务端收到第三次握手的 ACK 后,内核会把连接从半连接队列移除,然后创建新的完全的连接,并将其添加到 accept 队列,等待进程调用 accept 函数时把连接取出来。
|
||||
|
||||
我们的 ebpf 代码实现在 <https://github.com/yunwei37/Eunomia/blob/master/bpftools/tcpconnlat/tcpconnlat.bpf.c> 中:
|
||||
|
||||
它主要使用了 trace_tcp_rcv_state_process 和 kprobe/tcp_v4_connect 这样的跟踪点:
|
||||
|
||||
```c
|
||||
|
||||
SEC("kprobe/tcp_v4_connect")
|
||||
int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
|
||||
{
|
||||
return trace_connect(sk);
|
||||
}
|
||||
|
||||
SEC("kprobe/tcp_v6_connect")
|
||||
int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
|
||||
{
|
||||
return trace_connect(sk);
|
||||
}
|
||||
|
||||
SEC("kprobe/tcp_rcv_state_process")
|
||||
int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
|
||||
{
|
||||
return handle_tcp_rcv_state_process(ctx, sk);
|
||||
}
|
||||
```
|
||||
|
||||
在 trace_connect 中,我们跟踪新的 tcp 连接,记录到达时间,并且把它加入 map 中:
|
||||
|
||||
```c
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, 4096);
|
||||
__type(key, struct sock *);
|
||||
__type(value, struct piddata);
|
||||
} start SEC(".maps");
|
||||
|
||||
static int trace_connect(struct sock *sk)
|
||||
{
|
||||
u32 tgid = bpf_get_current_pid_tgid() >> 32;
|
||||
struct piddata piddata = {};
|
||||
|
||||
if (targ_tgid && targ_tgid != tgid)
|
||||
return 0;
|
||||
|
||||
bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
|
||||
piddata.ts = bpf_ktime_get_ns();
|
||||
piddata.tgid = tgid;
|
||||
bpf_map_update_elem(&start, &sk, &piddata, 0);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
在 handle_tcp_rcv_state_process 中,我们跟踪接收到的 tcp 数据包,从 map 从提取出对应的 connect 事件,并且计算延迟:
|
||||
|
||||
```c
|
||||
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
|
||||
{
|
||||
struct piddata *piddatap;
|
||||
struct event event = {};
|
||||
s64 delta;
|
||||
u64 ts;
|
||||
|
||||
if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
|
||||
return 0;
|
||||
|
||||
piddatap = bpf_map_lookup_elem(&start, &sk);
|
||||
if (!piddatap)
|
||||
return 0;
|
||||
|
||||
ts = bpf_ktime_get_ns();
|
||||
delta = (s64)(ts - piddatap->ts);
|
||||
if (delta < 0)
|
||||
goto cleanup;
|
||||
|
||||
event.delta_us = delta / 1000U;
|
||||
if (targ_min_us && event.delta_us < targ_min_us)
|
||||
goto cleanup;
|
||||
__builtin_memcpy(&event.comm, piddatap->comm,
|
||||
sizeof(event.comm));
|
||||
event.ts_us = ts / 1000;
|
||||
event.tgid = piddatap->tgid;
|
||||
event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
|
||||
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
|
||||
event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
|
||||
if (event.af == AF_INET) {
|
||||
event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
|
||||
event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
|
||||
} else {
|
||||
BPF_CORE_READ_INTO(&event.saddr_v6, sk,
|
||||
__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
|
||||
BPF_CORE_READ_INTO(&event.daddr_v6, sk,
|
||||
__sk_common.skc_v6_daddr.in6_u.u6_addr32);
|
||||
}
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
|
||||
&event, sizeof(event));
|
||||
|
||||
cleanup:
|
||||
bpf_map_delete_elem(&start, &sk);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
## 编译运行
|
||||
|
||||
- ```git clone https://github.com/libbpf/libbpf-bootstrap libbpf-bootstrap-cloned```
|
||||
- 将 [libbpf-bootstrap](libbpf-bootstrap)目录下的文件复制到 ```libbpf-bootstrap-cloned/examples/c```下
|
||||
- 修改 ```libbpf-bootstrap-cloned/examples/c/Makefile``` ,在其 ```APPS``` 项后添加 ```tcpconnlat```
|
||||
- 在 ```libbpf-bootstrap-cloned/examples/c``` 下运行 ```make tcpconnlat```
|
||||
- ```sudo ./tcpconnlat```
|
||||
|
||||
## 效果
|
||||
|
||||
```plain
|
||||
root@yutong-VirtualBox:~/libbpf-bootstrap/examples/c# ./tcpconnlat
|
||||
PID COMM IP SADDR DADDR DPORT LAT(ms)
|
||||
222564 wget 4 192.168.88.15 110.242.68.3 80 25.29
|
||||
222684 wget 4 192.168.88.15 167.179.101.42 443 246.76
|
||||
222726 ssh 4 192.168.88.15 167.179.101.42 22 241.17
|
||||
222774 ssh 4 192.168.88.15 1.15.149.151 22 25.31
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
通过上面的实验,我们可以看到,tcpconnlat 工具的实现原理是基于内核的TCP连接的跟踪,并且可以跟踪到 tcp 连接的延迟时间;除了命令行使用方式之外,还可以将其和容器、k8s 等元信息综合起来,通过 `prometheus` 和 `grafana` 等工具进行网络性能分析。
|
||||
|
||||
来源:<https://github.com/iovisor/bcc/blob/master/libbpf-tools/tcpconnlat.bpf.c>
|
||||
131
src/13-tcpconnlat/libbpf-bootstrap/tcpconnlat.bpf.c
Normal file
131
src/13-tcpconnlat/libbpf-bootstrap/tcpconnlat.bpf.c
Normal file
@@ -0,0 +1,131 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
// Copyright (c) 2020 Wenbo Zhang
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include "tcpconnlat.h"
|
||||
|
||||
#define AF_INET 2
|
||||
#define AF_INET6 10
|
||||
|
||||
const volatile __u64 targ_min_us = 0;
|
||||
const volatile pid_t targ_tgid = 0;
|
||||
|
||||
struct piddata {
|
||||
char comm[TASK_COMM_LEN];
|
||||
u64 ts;
|
||||
u32 tgid;
|
||||
};
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, 4096);
|
||||
__type(key, struct sock *);
|
||||
__type(value, struct piddata);
|
||||
} start SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
|
||||
__uint(key_size, sizeof(u32));
|
||||
__uint(value_size, sizeof(u32));
|
||||
} events SEC(".maps");
|
||||
|
||||
static int trace_connect(struct sock *sk)
|
||||
{
|
||||
u32 tgid = bpf_get_current_pid_tgid() >> 32;
|
||||
struct piddata piddata = {};
|
||||
|
||||
if (targ_tgid && targ_tgid != tgid)
|
||||
return 0;
|
||||
|
||||
bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
|
||||
piddata.ts = bpf_ktime_get_ns();
|
||||
piddata.tgid = tgid;
|
||||
bpf_map_update_elem(&start, &sk, &piddata, 0);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
|
||||
{
|
||||
struct piddata *piddatap;
|
||||
struct event event = {};
|
||||
s64 delta;
|
||||
u64 ts;
|
||||
|
||||
if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
|
||||
return 0;
|
||||
|
||||
piddatap = bpf_map_lookup_elem(&start, &sk);
|
||||
if (!piddatap)
|
||||
return 0;
|
||||
|
||||
ts = bpf_ktime_get_ns();
|
||||
delta = (s64)(ts - piddatap->ts);
|
||||
if (delta < 0)
|
||||
goto cleanup;
|
||||
|
||||
event.delta_us = delta / 1000U;
|
||||
if (targ_min_us && event.delta_us < targ_min_us)
|
||||
goto cleanup;
|
||||
__builtin_memcpy(&event.comm, piddatap->comm,
|
||||
sizeof(event.comm));
|
||||
event.ts_us = ts / 1000;
|
||||
event.tgid = piddatap->tgid;
|
||||
event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
|
||||
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
|
||||
event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
|
||||
if (event.af == AF_INET) {
|
||||
event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
|
||||
event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
|
||||
} else {
|
||||
BPF_CORE_READ_INTO(&event.saddr_v6, sk,
|
||||
__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
|
||||
BPF_CORE_READ_INTO(&event.daddr_v6, sk,
|
||||
__sk_common.skc_v6_daddr.in6_u.u6_addr32);
|
||||
}
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
|
||||
&event, sizeof(event));
|
||||
|
||||
cleanup:
|
||||
bpf_map_delete_elem(&start, &sk);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("kprobe/tcp_v4_connect")
|
||||
int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
|
||||
{
|
||||
return trace_connect(sk);
|
||||
}
|
||||
|
||||
SEC("kprobe/tcp_v6_connect")
|
||||
int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
|
||||
{
|
||||
return trace_connect(sk);
|
||||
}
|
||||
|
||||
SEC("kprobe/tcp_rcv_state_process")
|
||||
int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
|
||||
{
|
||||
return handle_tcp_rcv_state_process(ctx, sk);
|
||||
}
|
||||
|
||||
SEC("fentry/tcp_v4_connect")
|
||||
int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
|
||||
{
|
||||
return trace_connect(sk);
|
||||
}
|
||||
|
||||
SEC("fentry/tcp_v6_connect")
|
||||
int BPF_PROG(fentry_tcp_v6_connect, struct sock *sk)
|
||||
{
|
||||
return trace_connect(sk);
|
||||
}
|
||||
|
||||
SEC("fentry/tcp_rcv_state_process")
|
||||
int BPF_PROG(fentry_tcp_rcv_state_process, struct sock *sk)
|
||||
{
|
||||
return handle_tcp_rcv_state_process(ctx, sk);
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
298
src/13-tcpconnlat/libbpf-bootstrap/tcpconnlat.c
Normal file
298
src/13-tcpconnlat/libbpf-bootstrap/tcpconnlat.c
Normal file
@@ -0,0 +1,298 @@
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
// Copyright (c) 2020 Wenbo Zhang
|
||||
//
|
||||
// Based on tcpconnlat(8) from BCC by Brendan Gregg.
|
||||
// 11-Jul-2020 Wenbo Zhang Created this.
|
||||
#include "tcpconnlat.h"
|
||||
#include <argp.h>
|
||||
#include <arpa/inet.h>
|
||||
#include <bpf/bpf.h>
|
||||
#include <bpf/btf.h>
|
||||
#include <bpf/libbpf.h>
|
||||
#include <signal.h>
|
||||
#include <stdio.h>
|
||||
#include <time.h>
|
||||
#include <unistd.h>
|
||||
#include "tcpconnlat.skel.h"
|
||||
// #include "trace_helpers.h"
|
||||
|
||||
#define PERF_BUFFER_PAGES 16
|
||||
#define PERF_POLL_TIMEOUT_MS 100
|
||||
|
||||
static volatile sig_atomic_t exiting = 0;
|
||||
|
||||
static struct env {
|
||||
__u64 min_us;
|
||||
pid_t pid;
|
||||
bool timestamp;
|
||||
bool lport;
|
||||
bool verbose;
|
||||
} env;
|
||||
|
||||
const char* argp_program_version = "tcpconnlat 0.1";
|
||||
const char* argp_program_bug_address =
|
||||
"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
|
||||
const char argp_program_doc[] =
|
||||
"\nTrace TCP connects and show connection latency.\n"
|
||||
"\n"
|
||||
"USAGE: tcpconnlat [--help] [-t] [-p PID] [-L]\n"
|
||||
"\n"
|
||||
"EXAMPLES:\n"
|
||||
" tcpconnlat # summarize on-CPU time as a histogram\n"
|
||||
" tcpconnlat 1 # trace connection latency slower than 1 ms\n"
|
||||
" tcpconnlat 0.1 # trace connection latency slower than 100 "
|
||||
"us\n"
|
||||
" tcpconnlat -t # 1s summaries, milliseconds, and timestamps\n"
|
||||
" tcpconnlat -p 185 # trace PID 185 only\n"
|
||||
" tcpconnlat -L # include LPORT while printing outputs\n";
|
||||
|
||||
static const struct argp_option opts[] = {
|
||||
{"timestamp", 't', NULL, 0, "Include timestamp on output"},
|
||||
{"pid", 'p', "PID", 0, "Trace this PID only"},
|
||||
{"lport", 'L', NULL, 0, "Include LPORT on output"},
|
||||
{"verbose", 'v', NULL, 0, "Verbose debug output"},
|
||||
{NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help"},
|
||||
{},
|
||||
};
|
||||
|
||||
static error_t parse_arg(int key, char* arg, struct argp_state* state) {
|
||||
static int pos_args;
|
||||
|
||||
switch (key) {
|
||||
case 'h':
|
||||
argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
|
||||
break;
|
||||
case 'v':
|
||||
env.verbose = true;
|
||||
break;
|
||||
case 'p':
|
||||
errno = 0;
|
||||
env.pid = strtol(arg, NULL, 10);
|
||||
if (errno) {
|
||||
fprintf(stderr, "invalid PID: %s\n", arg);
|
||||
argp_usage(state);
|
||||
}
|
||||
break;
|
||||
case 't':
|
||||
env.timestamp = true;
|
||||
break;
|
||||
case 'L':
|
||||
env.lport = true;
|
||||
break;
|
||||
case ARGP_KEY_ARG:
|
||||
if (pos_args++) {
|
||||
fprintf(stderr, "Unrecognized positional argument: %s\n", arg);
|
||||
argp_usage(state);
|
||||
}
|
||||
errno = 0;
|
||||
env.min_us = strtod(arg, NULL) * 1000;
|
||||
if (errno || env.min_us <= 0) {
|
||||
fprintf(stderr, "Invalid delay (in us): %s\n", arg);
|
||||
argp_usage(state);
|
||||
}
|
||||
break;
|
||||
default:
|
||||
return ARGP_ERR_UNKNOWN;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int libbpf_print_fn(enum libbpf_print_level level,
|
||||
const char* format,
|
||||
va_list args) {
|
||||
if (level == LIBBPF_DEBUG && !env.verbose)
|
||||
return 0;
|
||||
return vfprintf(stderr, format, args);
|
||||
}
|
||||
|
||||
static void sig_int(int signo) {
|
||||
exiting = 1;
|
||||
}
|
||||
|
||||
void handle_event(void* ctx, int cpu, void* data, __u32 data_sz) {
|
||||
const struct event* e = data;
|
||||
char src[INET6_ADDRSTRLEN];
|
||||
char dst[INET6_ADDRSTRLEN];
|
||||
union {
|
||||
struct in_addr x4;
|
||||
struct in6_addr x6;
|
||||
} s, d;
|
||||
static __u64 start_ts;
|
||||
|
||||
if (env.timestamp) {
|
||||
if (start_ts == 0)
|
||||
start_ts = e->ts_us;
|
||||
printf("%-9.3f ", (e->ts_us - start_ts) / 1000000.0);
|
||||
}
|
||||
if (e->af == AF_INET) {
|
||||
s.x4.s_addr = e->saddr_v4;
|
||||
d.x4.s_addr = e->daddr_v4;
|
||||
} else if (e->af == AF_INET6) {
|
||||
memcpy(&s.x6.s6_addr, e->saddr_v6, sizeof(s.x6.s6_addr));
|
||||
memcpy(&d.x6.s6_addr, e->daddr_v6, sizeof(d.x6.s6_addr));
|
||||
} else {
|
||||
fprintf(stderr, "broken event: event->af=%d", e->af);
|
||||
return;
|
||||
}
|
||||
|
||||
if (env.lport) {
|
||||
printf("%-6d %-12.12s %-2d %-16s %-6d %-16s %-5d %.2f\n", e->tgid,
|
||||
e->comm, e->af == AF_INET ? 4 : 6,
|
||||
inet_ntop(e->af, &s, src, sizeof(src)), e->lport,
|
||||
inet_ntop(e->af, &d, dst, sizeof(dst)), ntohs(e->dport),
|
||||
e->delta_us / 1000.0);
|
||||
} else {
|
||||
printf("%-6d %-12.12s %-2d %-16s %-16s %-5d %.2f\n", e->tgid, e->comm,
|
||||
e->af == AF_INET ? 4 : 6, inet_ntop(e->af, &s, src, sizeof(src)),
|
||||
inet_ntop(e->af, &d, dst, sizeof(dst)), ntohs(e->dport),
|
||||
e->delta_us / 1000.0);
|
||||
}
|
||||
}
|
||||
|
||||
void handle_lost_events(void* ctx, int cpu, __u64 lost_cnt) {
|
||||
fprintf(stderr, "lost %llu events on CPU #%d\n", lost_cnt, cpu);
|
||||
}
|
||||
static bool fentry_try_attach(int id) {
|
||||
int prog_fd, attach_fd;
|
||||
char error[4096];
|
||||
struct bpf_insn insns[] = {
|
||||
{.code = BPF_ALU64 | BPF_MOV | BPF_K, .dst_reg = BPF_REG_0, .imm = 0},
|
||||
{.code = BPF_JMP | BPF_EXIT},
|
||||
};
|
||||
LIBBPF_OPTS(bpf_prog_load_opts, opts,
|
||||
.expected_attach_type = BPF_TRACE_FENTRY, .attach_btf_id = id,
|
||||
.log_buf = error, .log_size = sizeof(error), );
|
||||
|
||||
prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACING, "test", "GPL", insns,
|
||||
sizeof(insns) / sizeof(struct bpf_insn), &opts);
|
||||
if (prog_fd < 0)
|
||||
return false;
|
||||
|
||||
attach_fd = bpf_raw_tracepoint_open(NULL, prog_fd);
|
||||
if (attach_fd >= 0)
|
||||
close(attach_fd);
|
||||
|
||||
close(prog_fd);
|
||||
return attach_fd >= 0;
|
||||
}
|
||||
static bool fentry_can_attach(const char* name, const char* mod) {
|
||||
struct btf *btf, *vmlinux_btf, *module_btf = NULL;
|
||||
int err, id;
|
||||
|
||||
vmlinux_btf = btf__load_vmlinux_btf();
|
||||
err = libbpf_get_error(vmlinux_btf);
|
||||
if (err)
|
||||
return false;
|
||||
|
||||
btf = vmlinux_btf;
|
||||
|
||||
if (mod) {
|
||||
module_btf = btf__load_module_btf(mod, vmlinux_btf);
|
||||
err = libbpf_get_error(module_btf);
|
||||
if (!err)
|
||||
btf = module_btf;
|
||||
}
|
||||
|
||||
id = btf__find_by_name_kind(btf, name, BTF_KIND_FUNC);
|
||||
|
||||
btf__free(module_btf);
|
||||
btf__free(vmlinux_btf);
|
||||
return id > 0 && fentry_try_attach(id);
|
||||
}
|
||||
|
||||
int main(int argc, char** argv) {
|
||||
static const struct argp argp = {
|
||||
.options = opts,
|
||||
.parser = parse_arg,
|
||||
.doc = argp_program_doc,
|
||||
};
|
||||
struct perf_buffer* pb = NULL;
|
||||
struct tcpconnlat_bpf* obj;
|
||||
int err;
|
||||
|
||||
err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
|
||||
if (err)
|
||||
return err;
|
||||
|
||||
libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
|
||||
libbpf_set_print(libbpf_print_fn);
|
||||
|
||||
obj = tcpconnlat_bpf__open();
|
||||
if (!obj) {
|
||||
fprintf(stderr, "failed to open BPF object\n");
|
||||
return 1;
|
||||
}
|
||||
|
||||
/* initialize global data (filtering options) */
|
||||
obj->rodata->targ_min_us = env.min_us;
|
||||
obj->rodata->targ_tgid = env.pid;
|
||||
|
||||
if (fentry_can_attach("tcp_v4_connect", NULL)) {
|
||||
bpf_program__set_attach_target(obj->progs.fentry_tcp_v4_connect, 0,
|
||||
"tcp_v4_connect");
|
||||
bpf_program__set_attach_target(obj->progs.fentry_tcp_v6_connect, 0,
|
||||
"tcp_v6_connect");
|
||||
bpf_program__set_attach_target(obj->progs.fentry_tcp_rcv_state_process,
|
||||
0, "tcp_rcv_state_process");
|
||||
bpf_program__set_autoload(obj->progs.tcp_v4_connect, false);
|
||||
bpf_program__set_autoload(obj->progs.tcp_v6_connect, false);
|
||||
bpf_program__set_autoload(obj->progs.tcp_rcv_state_process, false);
|
||||
} else {
|
||||
bpf_program__set_autoload(obj->progs.fentry_tcp_v4_connect, false);
|
||||
bpf_program__set_autoload(obj->progs.fentry_tcp_v6_connect, false);
|
||||
bpf_program__set_autoload(obj->progs.fentry_tcp_rcv_state_process,
|
||||
false);
|
||||
}
|
||||
|
||||
err = tcpconnlat_bpf__load(obj);
|
||||
if (err) {
|
||||
fprintf(stderr, "failed to load BPF object: %d\n", err);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
err = tcpconnlat_bpf__attach(obj);
|
||||
if (err) {
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
pb = perf_buffer__new(bpf_map__fd(obj->maps.events), PERF_BUFFER_PAGES,
|
||||
handle_event, handle_lost_events, NULL, NULL);
|
||||
if (!pb) {
|
||||
fprintf(stderr, "failed to open perf buffer: %d\n", errno);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/* print header */
|
||||
if (env.timestamp)
|
||||
printf("%-9s ", ("TIME(s)"));
|
||||
if (env.lport) {
|
||||
printf("%-6s %-12s %-2s %-16s %-6s %-16s %-5s %s\n", "PID", "COMM",
|
||||
"IP", "SADDR", "LPORT", "DADDR", "DPORT", "LAT(ms)");
|
||||
} else {
|
||||
printf("%-6s %-12s %-2s %-16s %-16s %-5s %s\n", "PID", "COMM", "IP",
|
||||
"SADDR", "DADDR", "DPORT", "LAT(ms)");
|
||||
}
|
||||
|
||||
if (signal(SIGINT, sig_int) == SIG_ERR) {
|
||||
fprintf(stderr, "can't set signal handler: %s\n", strerror(errno));
|
||||
err = 1;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/* main: poll */
|
||||
while (!exiting) {
|
||||
err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS);
|
||||
if (err < 0 && err != -EINTR) {
|
||||
fprintf(stderr, "error polling perf buffer: %s\n", strerror(-err));
|
||||
goto cleanup;
|
||||
}
|
||||
/* reset err to return 0 if exiting */
|
||||
err = 0;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
perf_buffer__free(pb);
|
||||
tcpconnlat_bpf__destroy(obj);
|
||||
|
||||
return err != 0;
|
||||
}
|
||||
31
src/13-tcpconnlat/libbpf-bootstrap/tcpconnlat.h
Normal file
31
src/13-tcpconnlat/libbpf-bootstrap/tcpconnlat.h
Normal file
@@ -0,0 +1,31 @@
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
#ifndef __TCPCONNLAT_H
|
||||
#define __TCPCONNLAT_H
|
||||
|
||||
// #include <inttypes.h>
|
||||
typedef unsigned char __u8;
|
||||
typedef unsigned short __u16;
|
||||
typedef unsigned int __u32;
|
||||
typedef unsigned long long __u64;
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
|
||||
struct event {
|
||||
union {
|
||||
__u32 saddr_v4;
|
||||
__u8 saddr_v6[16];
|
||||
};
|
||||
union {
|
||||
__u32 daddr_v4;
|
||||
__u8 daddr_v6[16];
|
||||
};
|
||||
char comm[TASK_COMM_LEN];
|
||||
__u64 delta_us;
|
||||
__u64 ts_us;
|
||||
__u32 tgid;
|
||||
int af;
|
||||
__u16 lport;
|
||||
__u16 dport;
|
||||
};
|
||||
|
||||
#endif /* __TCPCONNLAT_H_ */
|
||||
113
src/13-tcpconnlat/tcpconnlat.bpf.c
Normal file
113
src/13-tcpconnlat/tcpconnlat.bpf.c
Normal file
@@ -0,0 +1,113 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
// Copyright (c) 2020 Wenbo Zhang
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include "tcpconnlat.bpf.h"
|
||||
|
||||
#define AF_INET 2
|
||||
#define AF_INET6 10
|
||||
|
||||
const volatile __u64 targ_min_us = 0;
|
||||
const volatile pid_t targ_tgid = 0;
|
||||
|
||||
struct piddata {
|
||||
char comm[TASK_COMM_LEN];
|
||||
u64 ts;
|
||||
u32 tgid;
|
||||
};
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, 4096);
|
||||
__type(key, struct sock *);
|
||||
__type(value, struct piddata);
|
||||
} start SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
|
||||
__uint(key_size, sizeof(u32));
|
||||
__uint(value_size, sizeof(u32));
|
||||
} events SEC(".maps");
|
||||
|
||||
static int trace_connect(struct sock *sk)
|
||||
{
|
||||
u32 tgid = bpf_get_current_pid_tgid() >> 32;
|
||||
struct piddata piddata = {};
|
||||
|
||||
if (targ_tgid && targ_tgid != tgid)
|
||||
return 0;
|
||||
|
||||
bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
|
||||
piddata.ts = bpf_ktime_get_ns();
|
||||
piddata.tgid = tgid;
|
||||
bpf_map_update_elem(&start, &sk, &piddata, 0);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
|
||||
{
|
||||
struct piddata *piddatap;
|
||||
struct event event = {};
|
||||
s64 delta;
|
||||
u64 ts;
|
||||
|
||||
if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
|
||||
return 0;
|
||||
|
||||
piddatap = bpf_map_lookup_elem(&start, &sk);
|
||||
if (!piddatap)
|
||||
return 0;
|
||||
|
||||
ts = bpf_ktime_get_ns();
|
||||
delta = (s64)(ts - piddatap->ts);
|
||||
if (delta < 0)
|
||||
goto cleanup;
|
||||
|
||||
event.delta_us = delta / 1000U;
|
||||
if (targ_min_us && event.delta_us < targ_min_us)
|
||||
goto cleanup;
|
||||
__builtin_memcpy(&event.comm, piddatap->comm,
|
||||
sizeof(event.comm));
|
||||
event.ts_us = ts / 1000;
|
||||
event.tgid = piddatap->tgid;
|
||||
event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
|
||||
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
|
||||
event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
|
||||
if (event.af == AF_INET) {
|
||||
event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
|
||||
event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
|
||||
} else {
|
||||
BPF_CORE_READ_INTO(&event.saddr_v6, sk,
|
||||
__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
|
||||
BPF_CORE_READ_INTO(&event.daddr_v6, sk,
|
||||
__sk_common.skc_v6_daddr.in6_u.u6_addr32);
|
||||
}
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
|
||||
&event, sizeof(event));
|
||||
|
||||
cleanup:
|
||||
bpf_map_delete_elem(&start, &sk);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("kprobe/tcp_v4_connect")
|
||||
int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
|
||||
{
|
||||
return trace_connect(sk);
|
||||
}
|
||||
|
||||
SEC("kprobe/tcp_v6_connect")
|
||||
int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
|
||||
{
|
||||
return trace_connect(sk);
|
||||
}
|
||||
|
||||
SEC("kprobe/tcp_rcv_state_process")
|
||||
int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
|
||||
{
|
||||
return handle_tcp_rcv_state_process(ctx, sk);
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
26
src/13-tcpconnlat/tcpconnlat.bpf.h
Normal file
26
src/13-tcpconnlat/tcpconnlat.bpf.h
Normal file
@@ -0,0 +1,26 @@
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
#ifndef __TCPCONNLAT_H
|
||||
#define __TCPCONNLAT_H
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
|
||||
struct event {
|
||||
// union {
|
||||
unsigned int saddr_v4;
|
||||
unsigned char saddr_v6[16];
|
||||
// };
|
||||
// union {
|
||||
unsigned int daddr_v4;
|
||||
unsigned char daddr_v6[16];
|
||||
// };
|
||||
char comm[TASK_COMM_LEN];
|
||||
unsigned long long delta_us;
|
||||
unsigned long long ts_us;
|
||||
unsigned int tgid;
|
||||
int af;
|
||||
unsigned short lport;
|
||||
unsigned short dport;
|
||||
};
|
||||
|
||||
|
||||
#endif /* __TCPCONNLAT_H_ */
|
||||
187
src/13-tcpconnlat/tcpconnlat.md
Normal file
187
src/13-tcpconnlat/tcpconnlat.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# eBPF 入门实践教程:编写 eBPF 程序 tcpconnlat 测量 tcp 连接延时
|
||||
|
||||
## 代码解释
|
||||
|
||||
### 背景
|
||||
|
||||
在互联网后端日常开发接口的时候中,不管你使用的是C、Java、PHP还是Golang,都避免不了需要调用mysql、redis等组件来获取数据,可能还需要执行一些rpc远程调用,或者再调用一些其它restful api。 在这些调用的底层,基本都是在使用TCP协议进行传输。这是因为在传输层协议中,TCP协议具备可靠的连接,错误重传,拥塞控制等优点,所以目前应用比UDP更广泛一些。但相对而言,tcp 连接也有一些缺点,例如建立连接的延时较长等。因此也会出现像 QUIC ,即 快速UDP网络连接 ( Quick UDP Internet Connections )这样的替代方案。
|
||||
|
||||
tcp 连接延时分析对于网络性能分析优化或者故障排查都能起到不少作用。
|
||||
|
||||
### tcpconnlat 的实现原理
|
||||
|
||||
tcpconnlat 这个工具跟踪执行活动TCP连接的内核函数 (例如,通过connect()系统调用),并显示本地测量的连接的延迟(时间),即从发送 SYN 到响应包的时间。
|
||||
|
||||
### tcp 连接原理
|
||||
|
||||
tcp 连接的整个过程如图所示:
|
||||
|
||||

|
||||
|
||||
在这个连接过程中,我们来简单分析一下每一步的耗时:
|
||||
|
||||
1. 客户端发出SYNC包:客户端一般是通过connect系统调用来发出 SYN 的,这里牵涉到本机的系统调用和软中断的 CPU 耗时开销
|
||||
2. SYN传到服务器:SYN从客户端网卡被发出,这是一次长途远距离的网络传输
|
||||
3. 服务器处理SYN包:内核通过软中断来收包,然后放到半连接队列中,然后再发出SYN/ACK响应。主要是 CPU 耗时开销
|
||||
4. SYC/ACK传到客户端:长途网络跋涉
|
||||
5. 客户端处理 SYN/ACK:客户端内核收包并处理SYN后,经过几us的CPU处理,接着发出 ACK。同样是软中断处理开销
|
||||
6. ACK传到服务器:长途网络跋涉
|
||||
7. 服务端收到ACK:服务器端内核收到并处理ACK,然后把对应的连接从半连接队列中取出来,然后放到全连接队列中。一次软中断CPU开销
|
||||
8. 服务器端用户进程唤醒:正在被accpet系统调用阻塞的用户进程被唤醒,然后从全连接队列中取出来已经建立好的连接。一次上下文切换的CPU开销
|
||||
|
||||
在客户端视角,在正常情况下一次TCP连接总的耗时也就就大约是一次网络RTT的耗时。但在某些情况下,可能会导致连接时的网络传输耗时上涨、CPU处理开销增加、甚至是连接失败。这种时候在发现延时过长之后,就可以结合其他信息进行分析。
|
||||
|
||||
### ebpf 实现原理
|
||||
|
||||
在 TCP 三次握手的时候,Linux 内核会维护两个队列,分别是:
|
||||
|
||||
- 半连接队列,也称 SYN 队列;
|
||||
- 全连接队列,也称 accepet 队列;
|
||||
|
||||
服务端收到客户端发起的 SYN 请求后,内核会把该连接存储到半连接队列,并向客户端响应 SYN+ACK,接着客户端会返回 ACK,服务端收到第三次握手的 ACK 后,内核会把连接从半连接队列移除,然后创建新的完全的连接,并将其添加到 accept 队列,等待进程调用 accept 函数时把连接取出来。
|
||||
|
||||
我们的 ebpf 代码实现在 <https://github.com/yunwei37/Eunomia/blob/master/bpftools/tcpconnlat/tcpconnlat.bpf.c> 中:
|
||||
|
||||
它主要使用了 trace_tcp_rcv_state_process 和 kprobe/tcp_v4_connect 这样的跟踪点:
|
||||
|
||||
```c
|
||||
|
||||
SEC("kprobe/tcp_v4_connect")
|
||||
int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
|
||||
{
|
||||
return trace_connect(sk);
|
||||
}
|
||||
|
||||
SEC("kprobe/tcp_v6_connect")
|
||||
int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
|
||||
{
|
||||
return trace_connect(sk);
|
||||
}
|
||||
|
||||
SEC("kprobe/tcp_rcv_state_process")
|
||||
int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
|
||||
{
|
||||
return handle_tcp_rcv_state_process(ctx, sk);
|
||||
}
|
||||
```
|
||||
|
||||
在 trace_connect 中,我们跟踪新的 tcp 连接,记录到达时间,并且把它加入 map 中:
|
||||
|
||||
```c
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, 4096);
|
||||
__type(key, struct sock *);
|
||||
__type(value, struct piddata);
|
||||
} start SEC(".maps");
|
||||
|
||||
static int trace_connect(struct sock *sk)
|
||||
{
|
||||
u32 tgid = bpf_get_current_pid_tgid() >> 32;
|
||||
struct piddata piddata = {};
|
||||
|
||||
if (targ_tgid && targ_tgid != tgid)
|
||||
return 0;
|
||||
|
||||
bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
|
||||
piddata.ts = bpf_ktime_get_ns();
|
||||
piddata.tgid = tgid;
|
||||
bpf_map_update_elem(&start, &sk, &piddata, 0);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
在 handle_tcp_rcv_state_process 中,我们跟踪接收到的 tcp 数据包,从 map 从提取出对应的 connect 事件,并且计算延迟:
|
||||
|
||||
```c
|
||||
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
|
||||
{
|
||||
struct piddata *piddatap;
|
||||
struct event event = {};
|
||||
s64 delta;
|
||||
u64 ts;
|
||||
|
||||
if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
|
||||
return 0;
|
||||
|
||||
piddatap = bpf_map_lookup_elem(&start, &sk);
|
||||
if (!piddatap)
|
||||
return 0;
|
||||
|
||||
ts = bpf_ktime_get_ns();
|
||||
delta = (s64)(ts - piddatap->ts);
|
||||
if (delta < 0)
|
||||
goto cleanup;
|
||||
|
||||
event.delta_us = delta / 1000U;
|
||||
if (targ_min_us && event.delta_us < targ_min_us)
|
||||
goto cleanup;
|
||||
__builtin_memcpy(&event.comm, piddatap->comm,
|
||||
sizeof(event.comm));
|
||||
event.ts_us = ts / 1000;
|
||||
event.tgid = piddatap->tgid;
|
||||
event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
|
||||
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
|
||||
event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
|
||||
if (event.af == AF_INET) {
|
||||
event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
|
||||
event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
|
||||
} else {
|
||||
BPF_CORE_READ_INTO(&event.saddr_v6, sk,
|
||||
__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
|
||||
BPF_CORE_READ_INTO(&event.daddr_v6, sk,
|
||||
__sk_common.skc_v6_daddr.in6_u.u6_addr32);
|
||||
}
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
|
||||
&event, sizeof(event));
|
||||
|
||||
cleanup:
|
||||
bpf_map_delete_elem(&start, &sk);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
### Eunomia 测试 demo
|
||||
|
||||
使用命令行进行追踪:
|
||||
|
||||
```bash
|
||||
$ sudo build/bin/Release/eunomia run tcpconnlat
|
||||
[sudo] password for yunwei:
|
||||
[2022-08-07 02:13:39.601] [info] eunomia run in cmd...
|
||||
[2022-08-07 02:13:40.534] [info] press 'Ctrl C' key to exit...
|
||||
PID COMM IP SRC DEST PORT LAT(ms) CONATINER/OS
|
||||
3477 openresty 4 172.19.0.7 172.19.0.5 2379 0.05 docker-apisix_apisix_1
|
||||
3483 openresty 4 172.19.0.7 172.19.0.5 2379 0.08 docker-apisix_apisix_1
|
||||
3477 openresty 4 172.19.0.7 172.19.0.5 2379 0.04 docker-apisix_apisix_1
|
||||
3478 openresty 4 172.19.0.7 172.19.0.5 2379 0.05 docker-apisix_apisix_1
|
||||
3478 openresty 4 172.19.0.7 172.19.0.5 2379 0.03 docker-apisix_apisix_1
|
||||
3478 openresty 4 172.19.0.7 172.19.0.5 2379 0.03 docker-apisix_apisix_1
|
||||
```
|
||||
|
||||
还可以使用 eunomia 作为 prometheus exporter,在运行上述命令之后,打开 prometheus 自带的可视化面板:
|
||||
|
||||
使用下述查询命令即可看到延时的统计图表:
|
||||
|
||||
```plain
|
||||
rate(eunomia_observed_tcpconnlat_v4_histogram_sum[5m])
|
||||
/
|
||||
rate(eunomia_observed_tcpconnlat_v4_histogram_count[5m])
|
||||
```
|
||||
|
||||
结果:
|
||||
|
||||

|
||||
|
||||
### 总结
|
||||
|
||||
通过上面的实验,我们可以看到,tcpconnlat 工具的实现原理是基于内核的TCP连接的跟踪,并且可以跟踪到 tcp 连接的延迟时间;除了命令行使用方式之外,还可以将其和容器、k8s 等元信息综合起来,通过 `prometheus` 和 `grafana` 等工具进行网络性能分析。
|
||||
|
||||
> `Eunomia` 是一个使用 C/C++ 开发的基于 eBPF的轻量级,高性能云原生监控工具,旨在帮助用户了解容器的各项行为、监控可疑的容器安全事件,力求提供覆盖容器全生命周期的轻量级开源监控解决方案。它使用 `Linux` `eBPF` 技术在运行时跟踪您的系统和应用程序,并分析收集的事件以检测可疑的行为模式。目前,它包含性能分析、容器集群网络可视化分析*、容器安全感知告警、一键部署、持久化存储监控等功能,提供了多样化的 ebpf 追踪点。其核心导出器/命令行工具最小仅需要约 4MB 大小的二进制程序,即可在支持的 Linux 内核上启动。
|
||||
|
||||
项目地址:<https://github.com/yunwei37/Eunomia>
|
||||
|
||||
### 参考资料
|
||||
|
||||
1. <http://kerneltravel.net/blog/2020/tcpconnlat/>
|
||||
2. <https://network.51cto.com/article/640631.html>
|
||||
BIN
src/13-tcpconnlat/tcpconnlat1.png
Normal file
BIN
src/13-tcpconnlat/tcpconnlat1.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 132 KiB |
BIN
src/13-tcpconnlat/tcpconnlat_p.png
Normal file
BIN
src/13-tcpconnlat/tcpconnlat_p.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 40 KiB |
5
src/14-tcpstates/.gitignore
vendored
Normal file
5
src/14-tcpstates/.gitignore
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
.vscode
|
||||
package.json
|
||||
eunomia-exporter
|
||||
ecli
|
||||
|
||||
160
src/14-tcpstates/README.md
Normal file
160
src/14-tcpstates/README.md
Normal file
@@ -0,0 +1,160 @@
|
||||
# eBPF入门实践教程:使用 libbpf-bootstrap 开发程序统计 TCP 连接延时
|
||||
|
||||
```tcpstates``` 是一个追踪当前系统上的TCP套接字的TCP状态的程序,主要通过跟踪内核跟踪点 ```inet_sock_set_state``` 来实现。统计数据通过 ```perf_event```向用户态传输。
|
||||
|
||||
```c
|
||||
SEC("tracepoint/sock/inet_sock_set_state")
|
||||
int handle_set_state(struct trace_event_raw_inet_sock_set_state *ctx)
|
||||
```
|
||||
|
||||
在套接字改变状态处附加一个eBPF跟踪函数。
|
||||
|
||||
```c
|
||||
if (ctx->protocol != IPPROTO_TCP)
|
||||
return 0;
|
||||
|
||||
if (target_family && target_family != family)
|
||||
return 0;
|
||||
|
||||
if (filter_by_sport && !bpf_map_lookup_elem(&sports, &sport))
|
||||
return 0;
|
||||
|
||||
if (filter_by_dport && !bpf_map_lookup_elem(&dports, &dport))
|
||||
return 0;
|
||||
```
|
||||
|
||||
跟踪函数被调用后,先判断当前改变状态的套接字是否满足我们需要的过滤条件,如果不满足则不进行记录。
|
||||
|
||||
```c
|
||||
tsp = bpf_map_lookup_elem(×tamps, &sk);
|
||||
ts = bpf_ktime_get_ns();
|
||||
if (!tsp)
|
||||
delta_us = 0;
|
||||
else
|
||||
delta_us = (ts - *tsp) / 1000;
|
||||
|
||||
event.skaddr = (__u64)sk;
|
||||
event.ts_us = ts / 1000;
|
||||
event.delta_us = delta_us;
|
||||
event.pid = bpf_get_current_pid_tgid() >> 32;
|
||||
event.oldstate = ctx->oldstate;
|
||||
event.newstate = ctx->newstate;
|
||||
event.family = family;
|
||||
event.sport = sport;
|
||||
event.dport = dport;
|
||||
bpf_get_current_comm(&event.task, sizeof(event.task));
|
||||
|
||||
if (family == AF_INET) {
|
||||
bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_rcv_saddr);
|
||||
bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_daddr);
|
||||
} else { /* family == AF_INET6 */
|
||||
bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
|
||||
bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_v6_daddr.in6_u.u6_addr32);
|
||||
}
|
||||
```
|
||||
|
||||
使用状态改变相关填充event结构体。
|
||||
|
||||
- 此处使用了```libbpf``` 的 CO-RE 支持。
|
||||
|
||||
```c
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
|
||||
```
|
||||
|
||||
将事件结构体发送至用户态程序。
|
||||
|
||||
```c
|
||||
if (ctx->newstate == TCP_CLOSE)
|
||||
bpf_map_delete_elem(×tamps, &sk);
|
||||
else
|
||||
bpf_map_update_elem(×tamps, &sk, &ts, BPF_ANY);
|
||||
```
|
||||
|
||||
根据这个TCP链接的新状态,决定是更新下时间戳记录还是不再记录它的时间戳。
|
||||
|
||||
## 用户态程序
|
||||
|
||||
```c
|
||||
while (!exiting) {
|
||||
err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS);
|
||||
if (err < 0 && err != -EINTR) {
|
||||
warn("error polling perf buffer: %s\n", strerror(-err));
|
||||
goto cleanup;
|
||||
}
|
||||
/* reset err to return 0 if exiting */
|
||||
err = 0;
|
||||
}
|
||||
```
|
||||
|
||||
不停轮询内核程序所发过来的 ```perf event```。
|
||||
|
||||
```c
|
||||
static void handle_event(void* ctx, int cpu, void* data, __u32 data_sz) {
|
||||
char ts[32], saddr[26], daddr[26];
|
||||
struct event* e = data;
|
||||
struct tm* tm;
|
||||
int family;
|
||||
time_t t;
|
||||
|
||||
if (emit_timestamp) {
|
||||
time(&t);
|
||||
tm = localtime(&t);
|
||||
strftime(ts, sizeof(ts), "%H:%M:%S", tm);
|
||||
printf("%8s ", ts);
|
||||
}
|
||||
|
||||
inet_ntop(e->family, &e->saddr, saddr, sizeof(saddr));
|
||||
inet_ntop(e->family, &e->daddr, daddr, sizeof(daddr));
|
||||
if (wide_output) {
|
||||
family = e->family == AF_INET ? 4 : 6;
|
||||
printf(
|
||||
"%-16llx %-7d %-16s %-2d %-26s %-5d %-26s %-5d %-11s -> %-11s "
|
||||
"%.3f\n",
|
||||
e->skaddr, e->pid, e->task, family, saddr, e->sport, daddr,
|
||||
e->dport, tcp_states[e->oldstate], tcp_states[e->newstate],
|
||||
(double)e->delta_us / 1000);
|
||||
} else {
|
||||
printf(
|
||||
"%-16llx %-7d %-10.10s %-15s %-5d %-15s %-5d %-11s -> %-11s %.3f\n",
|
||||
e->skaddr, e->pid, e->task, saddr, e->sport, daddr, e->dport,
|
||||
tcp_states[e->oldstate], tcp_states[e->newstate],
|
||||
(double)e->delta_us / 1000);
|
||||
}
|
||||
}
|
||||
|
||||
static void handle_lost_events(void* ctx, int cpu, __u64 lost_cnt) {
|
||||
warn("lost %llu events on CPU #%d\n", lost_cnt, cpu);
|
||||
}
|
||||
```
|
||||
|
||||
收到事件后所调用对应的处理函数并进行输出打印。
|
||||
|
||||
## 编译运行
|
||||
|
||||
- ```git clone https://github.com/libbpf/libbpf-bootstrap libbpf-bootstrap-cloned```
|
||||
- 将 [libbpf-bootstrap](libbpf-bootstrap)目录下的文件复制到 ```libbpf-bootstrap-cloned/examples/c```下
|
||||
- 修改 ```libbpf-bootstrap-cloned/examples/c/Makefile``` ,在其 ```APPS``` 项后添加 ```tcpstates```
|
||||
- 在 ```libbpf-bootstrap-cloned/examples/c``` 下运行 ```make tcpstates```
|
||||
- ```sudo ./tcpstates```
|
||||
|
||||
## 效果
|
||||
|
||||
```plain
|
||||
root@yutong-VirtualBox:~/libbpf-bootstrap/examples/c# ./tcpstates
|
||||
SKADDR PID COMM LADDR LPORT RADDR RPORT OLDSTATE -> NEWSTATE MS
|
||||
ffff9bf61bb62bc0 164978 node 192.168.88.15 0 52.178.17.2 443 CLOSE -> SYN_SENT 0.000
|
||||
ffff9bf61bb62bc0 0 swapper/0 192.168.88.15 41596 52.178.17.2 443 SYN_SENT -> ESTABLISHED 225.794
|
||||
ffff9bf61bb62bc0 0 swapper/0 192.168.88.15 41596 52.178.17.2 443 ESTABLISHED -> CLOSE_WAIT 901.454
|
||||
ffff9bf61bb62bc0 164978 node 192.168.88.15 41596 52.178.17.2 443 CLOSE_WAIT -> LAST_ACK 0.793
|
||||
ffff9bf61bb62bc0 164978 node 192.168.88.15 41596 52.178.17.2 443 LAST_ACK -> LAST_ACK 0.086
|
||||
ffff9bf61bb62bc0 228759 kworker/u6 192.168.88.15 41596 52.178.17.2 443 LAST_ACK -> CLOSE 0.193
|
||||
ffff9bf6d8ee88c0 229832 redis-serv 0.0.0.0 6379 0.0.0.0 0 CLOSE -> LISTEN 0.000
|
||||
ffff9bf6d8ee88c0 229832 redis-serv 0.0.0.0 6379 0.0.0.0 0 LISTEN -> CLOSE 1.763
|
||||
ffff9bf7109d6900 88750 node 127.0.0.1 39755 127.0.0.1 50966 ESTABLISHED -> FIN_WAIT1 0.000
|
||||
```
|
||||
|
||||
对于输出的详细解释,详见 [README.md](README.md)
|
||||
|
||||
## 总结
|
||||
|
||||
这里的代码修改自 <https://github.com/iovisor/bcc/blob/master/libbpf-tools/tcpstates.bpf.c>
|
||||
102
src/14-tcpstates/libbpf-bootstrap/tcpstates.bpf.c
Normal file
102
src/14-tcpstates/libbpf-bootstrap/tcpstates.bpf.c
Normal file
@@ -0,0 +1,102 @@
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
/* Copyright (c) 2021 Hengqi Chen */
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include "tcpstates.h"
|
||||
|
||||
#define MAX_ENTRIES 10240
|
||||
#define AF_INET 2
|
||||
#define AF_INET6 10
|
||||
|
||||
const volatile bool filter_by_sport = false;
|
||||
const volatile bool filter_by_dport = false;
|
||||
const volatile short target_family = 0;
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, __u16);
|
||||
__type(value, __u16);
|
||||
} sports SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, __u16);
|
||||
__type(value, __u16);
|
||||
} dports SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, struct sock *);
|
||||
__type(value, __u64);
|
||||
} timestamps SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
|
||||
__uint(key_size, sizeof(__u32));
|
||||
__uint(value_size, sizeof(__u32));
|
||||
} events SEC(".maps");
|
||||
|
||||
SEC("tracepoint/sock/inet_sock_set_state")
|
||||
int handle_set_state(struct trace_event_raw_inet_sock_set_state *ctx)
|
||||
{
|
||||
struct sock *sk = (struct sock *)ctx->skaddr;
|
||||
__u16 family = ctx->family;
|
||||
__u16 sport = ctx->sport;
|
||||
__u16 dport = ctx->dport;
|
||||
__u64 *tsp, delta_us, ts;
|
||||
struct event event = {};
|
||||
|
||||
if (ctx->protocol != IPPROTO_TCP)
|
||||
return 0;
|
||||
|
||||
if (target_family && target_family != family)
|
||||
return 0;
|
||||
|
||||
if (filter_by_sport && !bpf_map_lookup_elem(&sports, &sport))
|
||||
return 0;
|
||||
|
||||
if (filter_by_dport && !bpf_map_lookup_elem(&dports, &dport))
|
||||
return 0;
|
||||
|
||||
tsp = bpf_map_lookup_elem(×tamps, &sk);
|
||||
ts = bpf_ktime_get_ns();
|
||||
if (!tsp)
|
||||
delta_us = 0;
|
||||
else
|
||||
delta_us = (ts - *tsp) / 1000;
|
||||
|
||||
event.skaddr = (__u64)sk;
|
||||
event.ts_us = ts / 1000;
|
||||
event.delta_us = delta_us;
|
||||
event.pid = bpf_get_current_pid_tgid() >> 32;
|
||||
event.oldstate = ctx->oldstate;
|
||||
event.newstate = ctx->newstate;
|
||||
event.family = family;
|
||||
event.sport = sport;
|
||||
event.dport = dport;
|
||||
bpf_get_current_comm(&event.task, sizeof(event.task));
|
||||
|
||||
if (family == AF_INET) {
|
||||
bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_rcv_saddr);
|
||||
bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_daddr);
|
||||
} else { /* family == AF_INET6 */
|
||||
bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
|
||||
bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_v6_daddr.in6_u.u6_addr32);
|
||||
}
|
||||
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
|
||||
|
||||
if (ctx->newstate == TCP_CLOSE)
|
||||
bpf_map_delete_elem(×tamps, &sk);
|
||||
else
|
||||
bpf_map_update_elem(×tamps, &sk, &ts, BPF_ANY);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
546
src/14-tcpstates/libbpf-bootstrap/tcpstates.c
Normal file
546
src/14-tcpstates/libbpf-bootstrap/tcpstates.c
Normal file
@@ -0,0 +1,546 @@
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
|
||||
/*
|
||||
* tcpstates Trace TCP session state changes with durations.
|
||||
* Copyright (c) 2021 Hengqi Chen
|
||||
*
|
||||
* Based on tcpstates(8) from BCC by Brendan Gregg.
|
||||
* 18-Dec-2021 Hengqi Chen Created this.
|
||||
*/
|
||||
#include <argp.h>
|
||||
#include <arpa/inet.h>
|
||||
#include <bpf/bpf.h>
|
||||
#include <bpf/libbpf.h>
|
||||
#include <errno.h>
|
||||
#include <signal.h>
|
||||
#include <string.h>
|
||||
#include <sys/socket.h>
|
||||
#include <sys/utsname.h>
|
||||
#include <time.h>
|
||||
#include <unistd.h>
|
||||
#include <zlib.h>
|
||||
// #include "btf_helpers.h"
|
||||
#include "tcpstates.h"
|
||||
#include "tcpstates.skel.h"
|
||||
// #include "trace_helpers.h"
|
||||
|
||||
#define PERF_BUFFER_PAGES 16
|
||||
#define PERF_POLL_TIMEOUT_MS 100
|
||||
#define warn(...) fprintf(stderr, __VA_ARGS__)
|
||||
|
||||
static volatile sig_atomic_t exiting = 0;
|
||||
|
||||
static bool emit_timestamp = false;
|
||||
static short target_family = 0;
|
||||
static char* target_sports = NULL;
|
||||
static char* target_dports = NULL;
|
||||
static bool wide_output = false;
|
||||
static bool verbose = false;
|
||||
static const char* tcp_states[] = {
|
||||
[1] = "ESTABLISHED", [2] = "SYN_SENT", [3] = "SYN_RECV",
|
||||
[4] = "FIN_WAIT1", [5] = "FIN_WAIT2", [6] = "TIME_WAIT",
|
||||
[7] = "CLOSE", [8] = "CLOSE_WAIT", [9] = "LAST_ACK",
|
||||
[10] = "LISTEN", [11] = "CLOSING", [12] = "NEW_SYN_RECV",
|
||||
[13] = "UNKNOWN",
|
||||
};
|
||||
|
||||
const char* argp_program_version = "tcpstates 1.0";
|
||||
const char* argp_program_bug_address =
|
||||
"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
|
||||
const char argp_program_doc[] =
|
||||
"Trace TCP session state changes and durations.\n"
|
||||
"\n"
|
||||
"USAGE: tcpstates [-4] [-6] [-T] [-L lport] [-D dport]\n"
|
||||
"\n"
|
||||
"EXAMPLES:\n"
|
||||
" tcpstates # trace all TCP state changes\n"
|
||||
" tcpstates -T # include timestamps\n"
|
||||
" tcpstates -L 80 # only trace local port 80\n"
|
||||
" tcpstates -D 80 # only trace remote port 80\n";
|
||||
|
||||
static const struct argp_option opts[] = {
|
||||
{"verbose", 'v', NULL, 0, "Verbose debug output"},
|
||||
{"timestamp", 'T', NULL, 0, "Include timestamp on output"},
|
||||
{"ipv4", '4', NULL, 0, "Trace IPv4 family only"},
|
||||
{"ipv6", '6', NULL, 0, "Trace IPv6 family only"},
|
||||
{"wide", 'w', NULL, 0, "Wide column output (fits IPv6 addresses)"},
|
||||
{"localport", 'L', "LPORT", 0,
|
||||
"Comma-separated list of local ports to trace."},
|
||||
{"remoteport", 'D', "DPORT", 0,
|
||||
"Comma-separated list of remote ports to trace."},
|
||||
{NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help"},
|
||||
{},
|
||||
};
|
||||
|
||||
static error_t parse_arg(int key, char* arg, struct argp_state* state) {
|
||||
long port_num;
|
||||
char* port;
|
||||
|
||||
switch (key) {
|
||||
case 'v':
|
||||
verbose = true;
|
||||
break;
|
||||
case 'T':
|
||||
emit_timestamp = true;
|
||||
break;
|
||||
case '4':
|
||||
target_family = AF_INET;
|
||||
break;
|
||||
case '6':
|
||||
target_family = AF_INET6;
|
||||
break;
|
||||
case 'w':
|
||||
wide_output = true;
|
||||
break;
|
||||
case 'L':
|
||||
if (!arg) {
|
||||
warn("No ports specified\n");
|
||||
argp_usage(state);
|
||||
}
|
||||
target_sports = strdup(arg);
|
||||
port = strtok(arg, ",");
|
||||
while (port) {
|
||||
port_num = strtol(port, NULL, 10);
|
||||
if (errno || port_num <= 0 || port_num > 65536) {
|
||||
warn("Invalid ports: %s\n", arg);
|
||||
argp_usage(state);
|
||||
}
|
||||
port = strtok(NULL, ",");
|
||||
}
|
||||
break;
|
||||
case 'D':
|
||||
if (!arg) {
|
||||
warn("No ports specified\n");
|
||||
argp_usage(state);
|
||||
}
|
||||
target_dports = strdup(arg);
|
||||
port = strtok(arg, ",");
|
||||
while (port) {
|
||||
port_num = strtol(port, NULL, 10);
|
||||
if (errno || port_num <= 0 || port_num > 65536) {
|
||||
warn("Invalid ports: %s\n", arg);
|
||||
argp_usage(state);
|
||||
}
|
||||
port = strtok(NULL, ",");
|
||||
}
|
||||
break;
|
||||
case 'h':
|
||||
argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
|
||||
break;
|
||||
default:
|
||||
return ARGP_ERR_UNKNOWN;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int libbpf_print_fn(enum libbpf_print_level level,
|
||||
const char* format,
|
||||
va_list args) {
|
||||
if (level == LIBBPF_DEBUG && !verbose)
|
||||
return 0;
|
||||
|
||||
return vfprintf(stderr, format, args);
|
||||
}
|
||||
|
||||
static void sig_int(int signo) {
|
||||
exiting = 1;
|
||||
}
|
||||
|
||||
static void handle_event(void* ctx, int cpu, void* data, __u32 data_sz) {
|
||||
char ts[32], saddr[26], daddr[26];
|
||||
struct event* e = data;
|
||||
struct tm* tm;
|
||||
int family;
|
||||
time_t t;
|
||||
|
||||
if (emit_timestamp) {
|
||||
time(&t);
|
||||
tm = localtime(&t);
|
||||
strftime(ts, sizeof(ts), "%H:%M:%S", tm);
|
||||
printf("%8s ", ts);
|
||||
}
|
||||
|
||||
inet_ntop(e->family, &e->saddr, saddr, sizeof(saddr));
|
||||
inet_ntop(e->family, &e->daddr, daddr, sizeof(daddr));
|
||||
if (wide_output) {
|
||||
family = e->family == AF_INET ? 4 : 6;
|
||||
printf(
|
||||
"%-16llx %-7d %-16s %-2d %-26s %-5d %-26s %-5d %-11s -> %-11s "
|
||||
"%.3f\n",
|
||||
e->skaddr, e->pid, e->task, family, saddr, e->sport, daddr,
|
||||
e->dport, tcp_states[e->oldstate], tcp_states[e->newstate],
|
||||
(double)e->delta_us / 1000);
|
||||
} else {
|
||||
printf(
|
||||
"%-16llx %-7d %-10.10s %-15s %-5d %-15s %-5d %-11s -> %-11s %.3f\n",
|
||||
e->skaddr, e->pid, e->task, saddr, e->sport, daddr, e->dport,
|
||||
tcp_states[e->oldstate], tcp_states[e->newstate],
|
||||
(double)e->delta_us / 1000);
|
||||
}
|
||||
}
|
||||
|
||||
static void handle_lost_events(void* ctx, int cpu, __u64 lost_cnt) {
|
||||
warn("lost %llu events on CPU #%d\n", lost_cnt, cpu);
|
||||
}
|
||||
|
||||
extern unsigned char _binary_min_core_btfs_tar_gz_start[] __attribute__((weak));
|
||||
extern unsigned char _binary_min_core_btfs_tar_gz_end[] __attribute__((weak));
|
||||
|
||||
|
||||
/* tar header from
|
||||
* https://github.com/tklauser/libtar/blob/v1.2.20/lib/libtar.h#L39-L60 */
|
||||
struct tar_header {
|
||||
char name[100];
|
||||
char mode[8];
|
||||
char uid[8];
|
||||
char gid[8];
|
||||
char size[12];
|
||||
char mtime[12];
|
||||
char chksum[8];
|
||||
char typeflag;
|
||||
char linkname[100];
|
||||
char magic[6];
|
||||
char version[2];
|
||||
char uname[32];
|
||||
char gname[32];
|
||||
char devmajor[8];
|
||||
char devminor[8];
|
||||
char prefix[155];
|
||||
char padding[12];
|
||||
};
|
||||
|
||||
static char* tar_file_start(struct tar_header* tar,
|
||||
const char* name,
|
||||
int* length) {
|
||||
while (tar->name[0]) {
|
||||
sscanf(tar->size, "%o", length);
|
||||
if (!strcmp(tar->name, name))
|
||||
return (char*)(tar + 1);
|
||||
tar += 1 + (*length + 511) / 512;
|
||||
}
|
||||
return NULL;
|
||||
}
|
||||
#define FIELD_LEN 65
|
||||
#define ID_FMT "ID=%64s"
|
||||
#define VERSION_FMT "VERSION_ID=\"%64s"
|
||||
|
||||
struct os_info {
|
||||
char id[FIELD_LEN];
|
||||
char version[FIELD_LEN];
|
||||
char arch[FIELD_LEN];
|
||||
char kernel_release[FIELD_LEN];
|
||||
};
|
||||
|
||||
static struct os_info* get_os_info() {
|
||||
struct os_info* info = NULL;
|
||||
struct utsname u;
|
||||
size_t len = 0;
|
||||
ssize_t read;
|
||||
char* line = NULL;
|
||||
FILE* f;
|
||||
|
||||
if (uname(&u) == -1)
|
||||
return NULL;
|
||||
|
||||
f = fopen("/etc/os-release", "r");
|
||||
if (!f)
|
||||
return NULL;
|
||||
|
||||
info = calloc(1, sizeof(*info));
|
||||
if (!info)
|
||||
goto out;
|
||||
|
||||
strncpy(info->kernel_release, u.release, FIELD_LEN);
|
||||
strncpy(info->arch, u.machine, FIELD_LEN);
|
||||
|
||||
while ((read = getline(&line, &len, f)) != -1) {
|
||||
if (sscanf(line, ID_FMT, info->id) == 1)
|
||||
continue;
|
||||
|
||||
if (sscanf(line, VERSION_FMT, info->version) == 1) {
|
||||
/* remove '"' suffix */
|
||||
info->version[strlen(info->version) - 1] = 0;
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
out:
|
||||
free(line);
|
||||
fclose(f);
|
||||
|
||||
return info;
|
||||
}
|
||||
#define INITIAL_BUF_SIZE (1024 * 1024 * 4) /* 4MB */
|
||||
|
||||
/* adapted from https://zlib.net/zlib_how.html */
|
||||
static int inflate_gz(unsigned char* src,
|
||||
int src_size,
|
||||
unsigned char** dst,
|
||||
int* dst_size) {
|
||||
size_t size = INITIAL_BUF_SIZE;
|
||||
size_t next_size = size;
|
||||
z_stream strm;
|
||||
void* tmp;
|
||||
int ret;
|
||||
|
||||
strm.zalloc = Z_NULL;
|
||||
strm.zfree = Z_NULL;
|
||||
strm.opaque = Z_NULL;
|
||||
strm.avail_in = 0;
|
||||
strm.next_in = Z_NULL;
|
||||
|
||||
ret = inflateInit2(&strm, 16 + MAX_WBITS);
|
||||
if (ret != Z_OK)
|
||||
return -EINVAL;
|
||||
|
||||
*dst = malloc(size);
|
||||
if (!*dst)
|
||||
return -ENOMEM;
|
||||
|
||||
strm.next_in = src;
|
||||
strm.avail_in = src_size;
|
||||
|
||||
/* run inflate() on input until it returns Z_STREAM_END */
|
||||
do {
|
||||
strm.next_out = *dst + strm.total_out;
|
||||
strm.avail_out = next_size;
|
||||
ret = inflate(&strm, Z_NO_FLUSH);
|
||||
if (ret != Z_OK && ret != Z_STREAM_END)
|
||||
goto out_err;
|
||||
/* we need more space */
|
||||
if (strm.avail_out == 0) {
|
||||
next_size = size;
|
||||
size *= 2;
|
||||
tmp = realloc(*dst, size);
|
||||
if (!tmp) {
|
||||
ret = -ENOMEM;
|
||||
goto out_err;
|
||||
}
|
||||
*dst = tmp;
|
||||
}
|
||||
} while (ret != Z_STREAM_END);
|
||||
|
||||
*dst_size = strm.total_out;
|
||||
|
||||
/* clean up and return */
|
||||
ret = inflateEnd(&strm);
|
||||
if (ret != Z_OK) {
|
||||
ret = -EINVAL;
|
||||
goto out_err;
|
||||
}
|
||||
return 0;
|
||||
|
||||
out_err:
|
||||
free(*dst);
|
||||
*dst = NULL;
|
||||
return ret;
|
||||
}
|
||||
struct btf *btf__load_vmlinux_btf(void);
|
||||
void btf__free(struct btf *btf);
|
||||
static bool vmlinux_btf_exists(void) {
|
||||
struct btf* btf;
|
||||
int err;
|
||||
|
||||
btf = btf__load_vmlinux_btf();
|
||||
err = libbpf_get_error(btf);
|
||||
if (err)
|
||||
return false;
|
||||
|
||||
btf__free(btf);
|
||||
return true;
|
||||
}
|
||||
|
||||
static int ensure_core_btf(struct bpf_object_open_opts* opts) {
|
||||
char name_fmt[] = "./%s/%s/%s/%s.btf";
|
||||
char btf_path[] = "/tmp/bcc-libbpf-tools.btf.XXXXXX";
|
||||
struct os_info* info = NULL;
|
||||
unsigned char* dst_buf = NULL;
|
||||
char* file_start;
|
||||
int dst_size = 0;
|
||||
char name[100];
|
||||
FILE* dst = NULL;
|
||||
int ret;
|
||||
|
||||
/* do nothing if the system provides BTF */
|
||||
if (vmlinux_btf_exists())
|
||||
return 0;
|
||||
|
||||
/* compiled without min core btfs */
|
||||
if (!_binary_min_core_btfs_tar_gz_start)
|
||||
return -EOPNOTSUPP;
|
||||
|
||||
info = get_os_info();
|
||||
if (!info)
|
||||
return -errno;
|
||||
|
||||
ret = mkstemp(btf_path);
|
||||
if (ret < 0) {
|
||||
ret = -errno;
|
||||
goto out;
|
||||
}
|
||||
|
||||
dst = fdopen(ret, "wb");
|
||||
if (!dst) {
|
||||
ret = -errno;
|
||||
goto out;
|
||||
}
|
||||
|
||||
ret = snprintf(name, sizeof(name), name_fmt, info->id, info->version,
|
||||
info->arch, info->kernel_release);
|
||||
if (ret < 0 || ret == sizeof(name)) {
|
||||
ret = -EINVAL;
|
||||
goto out;
|
||||
}
|
||||
|
||||
ret = inflate_gz(
|
||||
_binary_min_core_btfs_tar_gz_start,
|
||||
_binary_min_core_btfs_tar_gz_end - _binary_min_core_btfs_tar_gz_start,
|
||||
&dst_buf, &dst_size);
|
||||
if (ret < 0)
|
||||
goto out;
|
||||
|
||||
ret = 0;
|
||||
file_start = tar_file_start((struct tar_header*)dst_buf, name, &dst_size);
|
||||
if (!file_start) {
|
||||
ret = -EINVAL;
|
||||
goto out;
|
||||
}
|
||||
|
||||
if (fwrite(file_start, 1, dst_size, dst) != dst_size) {
|
||||
ret = -ferror(dst);
|
||||
goto out;
|
||||
}
|
||||
|
||||
opts->btf_custom_path = strdup(btf_path);
|
||||
if (!opts->btf_custom_path)
|
||||
ret = -ENOMEM;
|
||||
|
||||
out:
|
||||
free(info);
|
||||
fclose(dst);
|
||||
free(dst_buf);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void cleanup_core_btf(struct bpf_object_open_opts* opts) {
|
||||
if (!opts)
|
||||
return;
|
||||
|
||||
if (!opts->btf_custom_path)
|
||||
return;
|
||||
|
||||
unlink(opts->btf_custom_path);
|
||||
free((void*)opts->btf_custom_path);
|
||||
}
|
||||
|
||||
int main(int argc, char** argv) {
|
||||
LIBBPF_OPTS(bpf_object_open_opts, open_opts);
|
||||
static const struct argp argp = {
|
||||
.options = opts,
|
||||
.parser = parse_arg,
|
||||
.doc = argp_program_doc,
|
||||
};
|
||||
struct perf_buffer* pb = NULL;
|
||||
struct tcpstates_bpf* obj;
|
||||
int err, port_map_fd;
|
||||
short port_num;
|
||||
char* port;
|
||||
|
||||
err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
|
||||
if (err)
|
||||
return err;
|
||||
|
||||
libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
|
||||
libbpf_set_print(libbpf_print_fn);
|
||||
|
||||
err = ensure_core_btf(&open_opts);
|
||||
if (err) {
|
||||
warn("failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err));
|
||||
return 1;
|
||||
}
|
||||
|
||||
obj = tcpstates_bpf__open_opts(&open_opts);
|
||||
if (!obj) {
|
||||
warn("failed to open BPF object\n");
|
||||
return 1;
|
||||
}
|
||||
|
||||
obj->rodata->filter_by_sport = target_sports != NULL;
|
||||
obj->rodata->filter_by_dport = target_dports != NULL;
|
||||
obj->rodata->target_family = target_family;
|
||||
|
||||
err = tcpstates_bpf__load(obj);
|
||||
if (err) {
|
||||
warn("failed to load BPF object: %d\n", err);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (target_sports) {
|
||||
port_map_fd = bpf_map__fd(obj->maps.sports);
|
||||
port = strtok(target_sports, ",");
|
||||
while (port) {
|
||||
port_num = strtol(port, NULL, 10);
|
||||
bpf_map_update_elem(port_map_fd, &port_num, &port_num, BPF_ANY);
|
||||
port = strtok(NULL, ",");
|
||||
}
|
||||
}
|
||||
if (target_dports) {
|
||||
port_map_fd = bpf_map__fd(obj->maps.dports);
|
||||
port = strtok(target_dports, ",");
|
||||
while (port) {
|
||||
port_num = strtol(port, NULL, 10);
|
||||
bpf_map_update_elem(port_map_fd, &port_num, &port_num, BPF_ANY);
|
||||
port = strtok(NULL, ",");
|
||||
}
|
||||
}
|
||||
|
||||
err = tcpstates_bpf__attach(obj);
|
||||
if (err) {
|
||||
warn("failed to attach BPF programs: %d\n", err);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
pb = perf_buffer__new(bpf_map__fd(obj->maps.events), PERF_BUFFER_PAGES,
|
||||
handle_event, handle_lost_events, NULL, NULL);
|
||||
if (!pb) {
|
||||
err = -errno;
|
||||
warn("failed to open perf buffer: %d\n", err);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (signal(SIGINT, sig_int) == SIG_ERR) {
|
||||
warn("can't set signal handler: %s\n", strerror(errno));
|
||||
err = 1;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (emit_timestamp)
|
||||
printf("%-8s ", "TIME(s)");
|
||||
if (wide_output)
|
||||
printf(
|
||||
"%-16s %-7s %-16s %-2s %-26s %-5s %-26s %-5s %-11s -> %-11s %s\n",
|
||||
"SKADDR", "PID", "COMM", "IP", "LADDR", "LPORT", "RADDR", "RPORT",
|
||||
"OLDSTATE", "NEWSTATE", "MS");
|
||||
else
|
||||
printf("%-16s %-7s %-10s %-15s %-5s %-15s %-5s %-11s -> %-11s %s\n",
|
||||
"SKADDR", "PID", "COMM", "LADDR", "LPORT", "RADDR", "RPORT",
|
||||
"OLDSTATE", "NEWSTATE", "MS");
|
||||
|
||||
while (!exiting) {
|
||||
err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS);
|
||||
if (err < 0 && err != -EINTR) {
|
||||
warn("error polling perf buffer: %s\n", strerror(-err));
|
||||
goto cleanup;
|
||||
}
|
||||
/* reset err to return 0 if exiting */
|
||||
err = 0;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
perf_buffer__free(pb);
|
||||
tcpstates_bpf__destroy(obj);
|
||||
cleanup_core_btf(&open_opts);
|
||||
|
||||
return err != 0;
|
||||
}
|
||||
23
src/14-tcpstates/libbpf-bootstrap/tcpstates.h
Normal file
23
src/14-tcpstates/libbpf-bootstrap/tcpstates.h
Normal file
@@ -0,0 +1,23 @@
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
/* Copyright (c) 2021 Hengqi Chen */
|
||||
#ifndef __TCPSTATES_H
|
||||
#define __TCPSTATES_H
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
|
||||
struct event {
|
||||
unsigned __int128 saddr;
|
||||
unsigned __int128 daddr;
|
||||
__u64 skaddr;
|
||||
__u64 ts_us;
|
||||
__u64 delta_us;
|
||||
__u32 pid;
|
||||
int oldstate;
|
||||
int newstate;
|
||||
__u16 family;
|
||||
__u16 sport;
|
||||
__u16 dport;
|
||||
char task[TASK_COMM_LEN];
|
||||
};
|
||||
|
||||
#endif /* __TCPSTATES_H */
|
||||
109
src/14-tcpstates/tcpstates.bpf.c
Normal file
109
src/14-tcpstates/tcpstates.bpf.c
Normal file
@@ -0,0 +1,109 @@
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
/* Copyright (c) 2021 Hengqi Chen */
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include "tcpstates.bpf.h"
|
||||
|
||||
#define MAX_ENTRIES 10240
|
||||
#define AF_INET 2
|
||||
#define AF_INET6 10
|
||||
|
||||
const volatile bool filter_by_sport = false;
|
||||
const volatile bool filter_by_dport = false;
|
||||
const volatile short target_family = 0;
|
||||
|
||||
struct
|
||||
{
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, __u16);
|
||||
__type(value, __u16);
|
||||
} sports SEC(".maps");
|
||||
|
||||
struct
|
||||
{
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, __u16);
|
||||
__type(value, __u16);
|
||||
} dports SEC(".maps");
|
||||
|
||||
struct
|
||||
{
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, struct sock *);
|
||||
__type(value, __u64);
|
||||
} timestamps SEC(".maps");
|
||||
|
||||
struct
|
||||
{
|
||||
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
|
||||
__uint(key_size, sizeof(__u32));
|
||||
__uint(value_size, sizeof(__u32));
|
||||
} events SEC(".maps");
|
||||
|
||||
SEC("tracepoint/sock/inet_sock_set_state")
|
||||
int handle_set_state(struct trace_event_raw_inet_sock_set_state *ctx)
|
||||
{
|
||||
struct sock *sk = (struct sock *)ctx->skaddr;
|
||||
__u16 family = ctx->family;
|
||||
__u16 sport = ctx->sport;
|
||||
__u16 dport = ctx->dport;
|
||||
__u64 *tsp, delta_us, ts;
|
||||
struct event event = {};
|
||||
|
||||
if (ctx->protocol != IPPROTO_TCP)
|
||||
return 0;
|
||||
|
||||
if (target_family && target_family != family)
|
||||
return 0;
|
||||
|
||||
if (filter_by_sport && !bpf_map_lookup_elem(&sports, &sport))
|
||||
return 0;
|
||||
|
||||
if (filter_by_dport && !bpf_map_lookup_elem(&dports, &dport))
|
||||
return 0;
|
||||
|
||||
tsp = bpf_map_lookup_elem(×tamps, &sk);
|
||||
ts = bpf_ktime_get_ns();
|
||||
if (!tsp)
|
||||
delta_us = 0;
|
||||
else
|
||||
delta_us = (ts - *tsp) / 1000;
|
||||
|
||||
event.skaddr = (__u64)sk;
|
||||
event.ts_us = ts / 1000;
|
||||
event.delta_us = delta_us;
|
||||
event.pid = bpf_get_current_pid_tgid() >> 32;
|
||||
event.oldstate = ctx->oldstate;
|
||||
event.newstate = ctx->newstate;
|
||||
event.family = family;
|
||||
event.sport = sport;
|
||||
event.dport = dport;
|
||||
bpf_get_current_comm(&event.task, sizeof(event.task));
|
||||
|
||||
if (family == AF_INET)
|
||||
{
|
||||
bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_rcv_saddr);
|
||||
bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_daddr);
|
||||
}
|
||||
else
|
||||
{ /* family == AF_INET6 */
|
||||
bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
|
||||
bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_v6_daddr.in6_u.u6_addr32);
|
||||
}
|
||||
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
|
||||
|
||||
if (ctx->newstate == TCP_CLOSE)
|
||||
bpf_map_delete_elem(×tamps, &sk);
|
||||
else
|
||||
bpf_map_update_elem(×tamps, &sk, &ts, BPF_ANY);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
24
src/14-tcpstates/tcpstates.bpf.h
Normal file
24
src/14-tcpstates/tcpstates.bpf.h
Normal file
@@ -0,0 +1,24 @@
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
/* Copyright (c) 2021 Hengqi Chen */
|
||||
#ifndef __TCPSTATES_H
|
||||
#define __TCPSTATES_H
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
|
||||
struct event
|
||||
{
|
||||
unsigned __int128 saddr;
|
||||
unsigned __int128 daddr;
|
||||
__u64 skaddr;
|
||||
__u64 ts_us;
|
||||
__u64 delta_us;
|
||||
__u32 pid;
|
||||
int oldstate;
|
||||
int newstate;
|
||||
__u16 family;
|
||||
__u16 sport;
|
||||
__u16 dport;
|
||||
char task[TASK_COMM_LEN];
|
||||
};
|
||||
|
||||
#endif /* __TCPSTATES_H */
|
||||
23
src/15-tcprtt/README.md
Normal file
23
src/15-tcprtt/README.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# eBPF 入门实践教程:编写 eBPF 程序 Tcprtt 测量 TCP 连接的往返时间
|
||||
|
||||
## 背景
|
||||
|
||||
网络质量在互联网社会中是一个很重要的因素。导致网络质量差的因素有很多,可能是硬件因素导致,也可能是程序
|
||||
写的不好导致。为了能更好地定位网络问题,`tcprtt` 工具被提出。它可以监测TCP链接的往返时间,从而分析
|
||||
网络质量,帮助用户定位问题来源。
|
||||
|
||||
当有tcp链接建立时,该工具会自动根据当前系统的支持情况,选择合适的执行函数。
|
||||
在执行函数中,`tcprtt`会收集tcp链接的各项基本底薪,包括地址,源端口,目标端口,耗时
|
||||
等等,并将其更新到直方图的map中。运行结束后通过用户态代码,展现给用户。
|
||||
|
||||
## 编写 eBPF 程序
|
||||
|
||||
TODO
|
||||
|
||||
## 编译运行
|
||||
|
||||
TODO
|
||||
|
||||
## 总结
|
||||
|
||||
TODO
|
||||
27
src/16-memleak/README.md
Normal file
27
src/16-memleak/README.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# eBPF 入门实践教程:编写 eBPF 程序 Memleak 监控内存泄漏
|
||||
|
||||
## 背景
|
||||
|
||||
内存泄漏对于一个程序而言是一个很严重的问题。倘若放任一个存在内存泄漏的程序运行,久而久之
|
||||
系统的内存会慢慢被耗尽,导致程序运行速度显著下降。为了避免这一情况,`memleak`工具被提出。
|
||||
它可以跟踪并匹配内存分配和释放的请求,并且打印出已经被分配资源而又尚未释放的堆栈信息。
|
||||
|
||||
## 实现原理
|
||||
|
||||
`memleak` 的实现逻辑非常直观。它在我们常用的动态分配内存的函数接口路径上挂载了ebpf程序,
|
||||
同时在free上也挂载了ebpf程序。在调用分配内存相关函数时,`memleak` 会记录调用者的pid,分配得到
|
||||
内存的地址,分配得到的内存大小等基本数据。在free之后,`memeleak`则会去map中删除记录的对应的分配
|
||||
信息。对于用户态常用的分配函数 `malloc`, `calloc` 等,`memleak`使用了 uporbe 技术实现挂载,对于
|
||||
内核态的函数,比如 `kmalloc` 等,`memleak` 则使用了现有的 tracepoint 来实现。
|
||||
|
||||
## 编写 eBPF 程序
|
||||
|
||||
TODO
|
||||
|
||||
## 编译运行
|
||||
|
||||
TODO
|
||||
|
||||
## 总结
|
||||
|
||||
TODO
|
||||
23
src/17-biopattern/README.md
Normal file
23
src/17-biopattern/README.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# eBPF 入门实践教程:编写 eBPF 程序 Biopattern: 统计随机/顺序磁盘 I/O
|
||||
|
||||
## 背景
|
||||
|
||||
Biopattern 可以统计随机/顺序磁盘I/O次数的比例。
|
||||
|
||||
TODO
|
||||
|
||||
## 实现原理
|
||||
|
||||
Biopattern 的ebpf代码在 tracepoint/block/block_rq_complete 挂载点下实现。在磁盘完成IO请求
|
||||
后,程序会经过此挂载点。Biopattern 内部存有一张以设备号为主键的哈希表,当程序经过挂载点时, Biopattern
|
||||
会获得操作信息,根据哈希表中该设备的上一次操作记录来判断本次操作是随机IO还是顺序IO,并更新操作计数。
|
||||
|
||||
## 编写 eBPF 程序
|
||||
|
||||
TODO
|
||||
|
||||
### 总结
|
||||
|
||||
Biopattern 可以展现随机/顺序磁盘I/O次数的比例,对于开发者把握整体I/O情况有较大帮助。
|
||||
|
||||
TODO
|
||||
3
src/18-further-reading/README.md
Normal file
3
src/18-further-reading/README.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# 更多的参考资料
|
||||
|
||||
TODO
|
||||
6
src/19-lsm-connect/.gitignore
vendored
Normal file
6
src/19-lsm-connect/.gitignore
vendored
Normal file
@@ -0,0 +1,6 @@
|
||||
.vscode
|
||||
package.json
|
||||
*.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
package.yaml
|
||||
39
src/19-lsm-connect/README.md
Normal file
39
src/19-lsm-connect/README.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# eBPF 入门实践教程:使用 LSM 进行安全检测防御
|
||||
|
||||
## 背景
|
||||
|
||||
TODO
|
||||
|
||||
## LSM 概述
|
||||
|
||||
TODO
|
||||
|
||||
## 编写 eBPF 程序
|
||||
|
||||
TODO
|
||||
|
||||
## 编译运行
|
||||
|
||||
```console
|
||||
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
|
||||
```
|
||||
|
||||
or compile with `ecc`:
|
||||
|
||||
```console
|
||||
$ ecc lsm-connect.bpf.c
|
||||
Compiling bpf object...
|
||||
Packing ebpf object and config into package.json...
|
||||
```
|
||||
|
||||
Run:
|
||||
|
||||
```console
|
||||
sudo ecli examples/bpftools/lsm-connect/package.json
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
TODO
|
||||
|
||||
参考:<https://github.com/leodido/demo-cloud-native-ebpf-day>
|
||||
41
src/19-lsm-connect/lsm-connect.bpf.c
Normal file
41
src/19-lsm-connect/lsm-connect.bpf.c
Normal file
@@ -0,0 +1,41 @@
|
||||
#include "vmlinux.h"
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
|
||||
#define EPERM 1
|
||||
#define AF_INET 2
|
||||
|
||||
const __u32 blockme = 16843009; // 1.1.1.1 -> int
|
||||
|
||||
SEC("lsm/socket_connect")
|
||||
int BPF_PROG(restrict_connect, struct socket *sock, struct sockaddr *address, int addrlen, int ret)
|
||||
{
|
||||
// Satisfying "cannot override a denial" rule
|
||||
if (ret != 0)
|
||||
{
|
||||
return ret;
|
||||
}
|
||||
|
||||
// Only IPv4 in this example
|
||||
if (address->sa_family != AF_INET)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
// Cast the address to an IPv4 socket address
|
||||
struct sockaddr_in *addr = (struct sockaddr_in *)address;
|
||||
|
||||
// Where do you want to go?
|
||||
__u32 dest = addr->sin_addr.s_addr;
|
||||
bpf_printk("lsm: found connect to %d", dest);
|
||||
|
||||
if (dest == blockme)
|
||||
{
|
||||
bpf_printk("lsm: blocking %d", dest);
|
||||
return -EPERM;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
7
src/2-kprobe-unlink/.gitignore
vendored
Normal file
7
src/2-kprobe-unlink/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
.vscode
|
||||
package.json
|
||||
*.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
package.yaml
|
||||
ecli
|
||||
106
src/2-kprobe-unlink/README.md
Normal file
106
src/2-kprobe-unlink/README.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# eBPF 入门开发实践教程二:在 eBPF 中使用 kprobe 监测捕获 unlink 系统调用
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具。它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第二篇,在 eBPF 中使用 kprobe 捕获 unlink 系统调用。
|
||||
|
||||
## kprobes技术背景
|
||||
|
||||
开发人员在内核或者模块的调试过程中,往往会需要要知道其中的一些函数有无被调用、何时被调用、执行是否正确以及函数的入参和返回值是什么等等。比较简单的做法是在内核代码对应的函数中添加日志打印信息,但这种方式往往需要重新编译内核或模块,重新启动设备之类的,操作较为复杂甚至可能会破坏原有的代码执行过程。
|
||||
|
||||
而利用kprobes技术,用户可以定义自己的回调函数,然后在内核或者模块中几乎所有的函数中(有些函数是不可探测的,例如kprobes自身的相关实现函数,后文会有详细说明)动态的插入探测点,当内核执行流程执行到指定的探测函数时,会调用该回调函数,用户即可收集所需的信息了,同时内核最后还会回到原本的正常执行流程。如果用户已经收集足够的信息,不再需要继续探测,则同样可以动态地移除探测点。因此kprobes技术具有对内核执行流程影响小和操作方便的优点。
|
||||
|
||||
kprobes技术包括的3种探测手段分别时kprobe、jprobe和kretprobe。首先kprobe是最基本的探测方式,是实现后两种的基础,它可以在任意的位置放置探测点(就连函数内部的某条指令处也可以),它提供了探测点的调用前、调用后和内存访问出错3种回调方式,分别是pre_handler、post_handler和fault_handler,其中pre_handler函数将在被探测指令被执行前回调,post_handler会在被探测指令执行完毕后回调(注意不是被探测函数),fault_handler会在内存访问出错时被调用;jprobe基于kprobe实现,它用于获取被探测函数的入参值;最后kretprobe从名字中就可以看出其用途了,它同样基于kprobe实现,用于获取被探测函数的返回值。
|
||||
|
||||
kprobes的技术原理并不仅仅包含存软件的实现方案,它也需要硬件架构提供支持。其中涉及硬件架构相关的是CPU的异常处理和单步调试技术,前者用于让程序的执行流程陷入到用户注册的回调函数中去,而后者则用于单步执行被探测点指令,因此并不是所有的架构均支持,目前kprobes技术已经支持多种架构,包括i386、x86_64、ppc64、ia64、sparc64、arm、ppc和mips(有些架构实现可能并不完全,具体可参考内核的Documentation/kprobes.txt)。
|
||||
|
||||
kprobes的特点与使用限制:
|
||||
|
||||
1. kprobes允许在同一个被被探测位置注册多个kprobe,但是目前jprobe却不可以;同时也不允许以其他的jprobe回调函数和kprobe的post_handler回调函数作为被探测点。
|
||||
2. 一般情况下,可以探测内核中的任何函数,包括中断处理函数。不过在kernel/kprobes.c和arch/*/kernel/kprobes.c程序中用于实现kprobes自身的函数是不允许被探测的,另外还有do_page_fault和notifier_call_chain;
|
||||
3. 如果以一个内联函数为探测点,则kprobes可能无法保证对该函数的所有实例都注册探测点。由于gcc可能会自动将某些函数优化为内联函数,因此可能无法达到用户预期的探测效果;
|
||||
4. 一个探测点的回调函数可能会修改被探测函数运行的上下文,例如通过修改内核的数据结构或者保存与struct pt_regs结构体中的触发探测器之前寄存器信息。因此kprobes可以被用来安装bug修复代码或者注入故障测试代码;
|
||||
5. kprobes会避免在处理探测点函数时再次调用另一个探测点的回调函数,例如在printk()函数上注册了探测点,则在它的回调函数中可能再次调用printk函数,此时将不再触发printk探测点的回调,仅仅时增加了kprobe结构体中nmissed字段的数值;
|
||||
6. 在kprobes的注册和注销过程中不会使用mutex锁和动态的申请内存;
|
||||
7. kprobes回调函数的运行期间是关闭内核抢占的,同时也可能在关闭中断的情况下执行,具体要视CPU架构而定。因此不论在何种情况下,在回调函数中不要调用会放弃CPU的函数(如信号量、mutex锁等);
|
||||
8. kretprobe通过替换返回地址为预定义的trampoline的地址来实现,因此栈回溯和gcc内嵌函数__builtin_return_address()调用将返回trampoline的地址而不是真正的被探测函数的返回地址;
|
||||
9. 如果一个函数的调用次数和返回次数不相等,则在类似这样的函数上注册kretprobe将可能不会达到预期的效果,例如do_exit()函数会存在问题,而do_execve()函数和do_fork()函数不会;
|
||||
10. 如果当在进入和退出一个函数时,CPU运行在非当前任务所有的栈上,那么往该函数上注册kretprobe可能会导致不可预料的后果,因此,kprobes不支持在X86_64的结构下为__switch_to()函数注册kretprobe,将直接返回-EINVAL。
|
||||
|
||||
## kprobe
|
||||
|
||||
```c
|
||||
#include "vmlinux.h"
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
|
||||
SEC("kprobe/do_unlinkat")
|
||||
int BPF_KPROBE(do_unlinkat, int dfd, struct filename *name)
|
||||
{
|
||||
pid_t pid;
|
||||
const char *filename;
|
||||
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
filename = BPF_CORE_READ(name, name);
|
||||
bpf_printk("KPROBE ENTRY pid = %d, filename = %s\n", pid, filename);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("kretprobe/do_unlinkat")
|
||||
int BPF_KRETPROBE(do_unlinkat_exit, long ret)
|
||||
{
|
||||
pid_t pid;
|
||||
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
bpf_printk("KPROBE EXIT: pid = %d, ret = %ld\n", pid, ret);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
kprobe 是 eBPF 用于处理内核空间入口和出口(返回)探针(kprobe 和 kretprobe)的一个例子。它将 kprobe 和 kretprobe BPF 程序附加到 do_unlinkat() 函数上,并使用 bpf_printk() 宏分别记录 PID、文件名和返回值。
|
||||
|
||||
eunomia-bpf 是一个结合 Wasm 的开源 eBPF 动态加载运行时和开发工具链,它的目的是简化 eBPF 程序的开发、构建、分发、运行。可以参考 <https://github.com/eunomia-bpf/eunomia-bpf> 下载和安装 ecc 编译工具链和 ecli 运行时。
|
||||
|
||||
要编译这个程序,请使用 ecc 工具:
|
||||
|
||||
```console
|
||||
$ ecc kprobe-link.bpf.c
|
||||
Compiling bpf object...
|
||||
Packing ebpf object and config into package.json...
|
||||
```
|
||||
|
||||
然后运行:
|
||||
|
||||
```console
|
||||
sudo ecli package.json
|
||||
```
|
||||
|
||||
在另外一个窗口中:
|
||||
|
||||
```shell
|
||||
touch test1
|
||||
rm test1
|
||||
touch test2
|
||||
rm test2
|
||||
```
|
||||
|
||||
在 /sys/kernel/debug/tracing/trace_pipe 文件中,应该能看到类似下面的 kprobe 演示输出:
|
||||
|
||||
```shell
|
||||
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
|
||||
rm-9346 [005] d..3 4710.951696: bpf_trace_printk: KPROBE ENTRY pid = 9346, filename = test1
|
||||
rm-9346 [005] d..4 4710.951819: bpf_trace_printk: KPROBE EXIT: ret = 0
|
||||
rm-9346 [005] d..3 4710.951852: bpf_trace_printk: KPROBE ENTRY pid = 9346, filename = test2
|
||||
rm-9346 [005] d..4 4710.951895: bpf_trace_printk: KPROBE EXIT: ret = 0
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
通过本文的示例,我们学习了如何使用 eBPF 的 kprobe 和 kretprobe 捕获 unlink 系统调用。更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:<https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第二篇。下一篇文章将介绍如何在 eBPF 中使用 fentry 监测捕获 unlink 系统调用。
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
30
src/2-kprobe-unlink/kprobe-link.bpf.c
Normal file
30
src/2-kprobe-unlink/kprobe-link.bpf.c
Normal file
@@ -0,0 +1,30 @@
|
||||
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
|
||||
/* Copyright (c) 2021 Sartura */
|
||||
#include "vmlinux.h"
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
|
||||
SEC("kprobe/do_unlinkat")
|
||||
int BPF_KPROBE(do_unlinkat, int dfd, struct filename *name)
|
||||
{
|
||||
pid_t pid;
|
||||
const char *filename;
|
||||
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
filename = BPF_CORE_READ(name, name);
|
||||
bpf_printk("KPROBE ENTRY pid = %d, filename = %s\n", pid, filename);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("kretprobe/do_unlinkat")
|
||||
int BPF_KRETPROBE(do_unlinkat_exit, long ret)
|
||||
{
|
||||
pid_t pid;
|
||||
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
bpf_printk("KPROBE EXIT: pid = %d, ret = %ld\n", pid, ret);
|
||||
return 0;
|
||||
}
|
||||
10
src/20-tc/.gitignore
vendored
Executable file
10
src/20-tc/.gitignore
vendored
Executable file
@@ -0,0 +1,10 @@
|
||||
.vscode
|
||||
package.json
|
||||
*.wasm
|
||||
ewasm-skel.h
|
||||
ecli
|
||||
ewasm
|
||||
*.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
package.yaml
|
||||
89
src/20-tc/README.md
Normal file
89
src/20-tc/README.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# eBPF 入门实践教程:使用 eBPF 进行 tc 流量控制
|
||||
|
||||
## tc 程序示例
|
||||
|
||||
```c
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_endian.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
#define TC_ACT_OK 0
|
||||
#define ETH_P_IP 0x0800 /* Internet Protocol packet */
|
||||
|
||||
/// @tchook {"ifindex":1, "attach_point":"BPF_TC_INGRESS"}
|
||||
/// @tcopts {"handle":1, "priority":1}
|
||||
SEC("tc")
|
||||
int tc_ingress(struct __sk_buff *ctx)
|
||||
{
|
||||
void *data_end = (void *)(__u64)ctx->data_end;
|
||||
void *data = (void *)(__u64)ctx->data;
|
||||
struct ethhdr *l2;
|
||||
struct iphdr *l3;
|
||||
|
||||
if (ctx->protocol != bpf_htons(ETH_P_IP))
|
||||
return TC_ACT_OK;
|
||||
|
||||
l2 = data;
|
||||
if ((void *)(l2 + 1) > data_end)
|
||||
return TC_ACT_OK;
|
||||
|
||||
l3 = (struct iphdr *)(l2 + 1);
|
||||
if ((void *)(l3 + 1) > data_end)
|
||||
return TC_ACT_OK;
|
||||
|
||||
bpf_printk("Got IP packet: tot_len: %d, ttl: %d", bpf_ntohs(l3->tot_len), l3->ttl);
|
||||
return TC_ACT_OK;
|
||||
}
|
||||
|
||||
char __license[] SEC("license") = "GPL";
|
||||
```
|
||||
|
||||
这段代码定义了一个 eBPF 程序,它可以通过 Linux TC(Transmission Control)来捕获数据包并进行处理。在这个程序中,我们限定了只捕获 IPv4 协议的数据包,然后通过 bpf_printk 函数打印出数据包的总长度和 Time-To-Live(TTL)字段的值。
|
||||
|
||||
需要注意的是,我们在代码中使用了一些 BPF 库函数,例如 bpf_htons 和 bpf_ntohs 函数,它们用于进行网络字节序和主机字节序之间的转换。此外,我们还使用了一些注释来为 TC 提供附加点和选项信息。例如,在这段代码的开头,我们使用了以下注释:
|
||||
|
||||
```c
|
||||
/// @tchook {"ifindex":1, "attach_point":"BPF_TC_INGRESS"}
|
||||
/// @tcopts {"handle":1, "priority":1}
|
||||
```
|
||||
|
||||
这些注释告诉 TC 将 eBPF 程序附加到网络接口的 ingress 附加点,并指定了 handle 和 priority 选项的值。
|
||||
|
||||
总之,这段代码实现了一个简单的 eBPF 程序,用于捕获数据包并打印出它们的信息。
|
||||
|
||||
## 编译运行
|
||||
|
||||
```console
|
||||
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
|
||||
```
|
||||
|
||||
or compile with `ecc`:
|
||||
|
||||
```console
|
||||
$ ecc tc.bpf.c
|
||||
Compiling bpf object...
|
||||
Packing ebpf object and config into package.json...
|
||||
```
|
||||
|
||||
```shell
|
||||
$ sudo ecli ./package.json
|
||||
...
|
||||
Successfully started! Please run `sudo cat /sys/kernel/debug/tracing/trace_pipe` to see output of the BPF program.
|
||||
......
|
||||
```
|
||||
|
||||
The `tc` output in `/sys/kernel/debug/tracing/trace_pipe` should look
|
||||
something like this:
|
||||
|
||||
```console
|
||||
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
|
||||
node-1254811 [007] ..s1 8737831.671074: 0: Got IP packet: tot_len: 79, ttl: 64
|
||||
sshd-1254728 [006] ..s1 8737831.674334: 0: Got IP packet: tot_len: 79, ttl: 64
|
||||
sshd-1254728 [006] ..s1 8737831.674349: 0: Got IP packet: tot_len: 72, ttl: 64
|
||||
node-1254811 [007] ..s1 8737831.674550: 0: Got IP packet: tot_len: 71, ttl: 64
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
TODO
|
||||
36
src/20-tc/tc.bpf.c
Normal file
36
src/20-tc/tc.bpf.c
Normal file
@@ -0,0 +1,36 @@
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
/* Copyright (c) 2022 Hengqi Chen */
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_endian.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
#define TC_ACT_OK 0
|
||||
#define ETH_P_IP 0x0800 /* Internet Protocol packet */
|
||||
|
||||
/// @tchook {"ifindex":1, "attach_point":"BPF_TC_INGRESS"}
|
||||
/// @tcopts {"handle":1, "priority":1}
|
||||
SEC("tc")
|
||||
int tc_ingress(struct __sk_buff *ctx)
|
||||
{
|
||||
void *data_end = (void *)(__u64)ctx->data_end;
|
||||
void *data = (void *)(__u64)ctx->data;
|
||||
struct ethhdr *l2;
|
||||
struct iphdr *l3;
|
||||
|
||||
if (ctx->protocol != bpf_htons(ETH_P_IP))
|
||||
return TC_ACT_OK;
|
||||
|
||||
l2 = data;
|
||||
if ((void *)(l2 + 1) > data_end)
|
||||
return TC_ACT_OK;
|
||||
|
||||
l3 = (struct iphdr *)(l2 + 1);
|
||||
if ((void *)(l3 + 1) > data_end)
|
||||
return TC_ACT_OK;
|
||||
|
||||
bpf_printk("Got IP packet: tot_len: %d, ttl: %d", bpf_ntohs(l3->tot_len), l3->ttl);
|
||||
return TC_ACT_OK;
|
||||
}
|
||||
|
||||
char __license[] SEC("license") = "GPL";
|
||||
1
src/21-xdp/README.md
Normal file
1
src/21-xdp/README.md
Normal file
@@ -0,0 +1 @@
|
||||
# TODO
|
||||
7
src/3-fentry-unlink/.gitignore
vendored
Normal file
7
src/3-fentry-unlink/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
.vscode
|
||||
package.json
|
||||
*.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
package.yaml
|
||||
ecli
|
||||
80
src/3-fentry-unlink/README.md
Normal file
80
src/3-fentry-unlink/README.md
Normal file
@@ -0,0 +1,80 @@
|
||||
# eBPF 入门开发实践教程三:在 eBPF 中使用 fentry 监测捕获 unlink 系统调用
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具。它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第三篇,在 eBPF 中使用 fentry 捕获 unlink 系统调用。
|
||||
|
||||
## Fentry
|
||||
|
||||
```c
|
||||
#include "vmlinux.h"
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
|
||||
SEC("fentry/do_unlinkat")
|
||||
int BPF_PROG(do_unlinkat, int dfd, struct filename *name)
|
||||
{
|
||||
pid_t pid;
|
||||
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
bpf_printk("fentry: pid = %d, filename = %s\n", pid, name->name);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("fexit/do_unlinkat")
|
||||
int BPF_PROG(do_unlinkat_exit, int dfd, struct filename *name, long ret)
|
||||
{
|
||||
pid_t pid;
|
||||
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
bpf_printk("fexit: pid = %d, filename = %s, ret = %ld\n", pid, name->name, ret);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
这段程序通过定义两个函数,分别附加到 do_unlinkat 和 do_unlinkat_exit 上。这两个函数分别在进入 do_unlinkat 和离开 do_unlinkat 时执行。这两个函数通过使用 bpf_get_current_pid_tgid 和 bpf_printk 函数来获取调用 do_unlinkat 的进程 ID,文件名和返回值,并在内核日志中打印出来。
|
||||
|
||||
与 kprobes 相比,fentry 和 fexit 程序有更高的性能和可用性。在这个例子中,我们可以直接访问函数的指针参数,就像在普通的 C 代码中一样,而不需要使用各种读取帮助程序。fexit 和 kretprobe 程序最大的区别在于,fexit 程序可以访问函数的输入参数和返回值,而 kretprobe 只能访问返回值。
|
||||
|
||||
从 5.5 内核开始,fentry 和 fexit 程序可用。
|
||||
|
||||
eunomia-bpf 是一个结合 Wasm 的开源 eBPF 动态加载运行时和开发工具链,它的目的是简化 eBPF 程序的开发、构建、分发、运行。可以参考 <https://github.com/eunomia-bpf/eunomia-bpf> 下载和安装 ecc 编译工具链和 ecli 运行时。我们使用 eunomia-bpf 编译运行这个例子。
|
||||
|
||||
编译运行上述代码:
|
||||
|
||||
```console
|
||||
$ ecc fentry-link.bpf.c
|
||||
Compiling bpf object...
|
||||
Packing ebpf object and config into package.json...
|
||||
$ sudo ecli package.json
|
||||
Runing eBPF program...
|
||||
```
|
||||
|
||||
在另外一个窗口中:
|
||||
|
||||
```shell
|
||||
touch test_file
|
||||
rm test_file
|
||||
touch test_file2
|
||||
rm test_file2
|
||||
```
|
||||
|
||||
运行这段程序后,可以通过查看 /sys/kernel/debug/tracing/trace_pipe 文件来查看 eBPF 程序的输出:
|
||||
|
||||
```console
|
||||
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
|
||||
rm-9290 [004] d..2 4637.798698: bpf_trace_printk: fentry: pid = 9290, filename = test_file
|
||||
rm-9290 [004] d..2 4637.798843: bpf_trace_printk: fexit: pid = 9290, filename = test_file, ret = 0
|
||||
rm-9290 [004] d..2 4637.798698: bpf_trace_printk: fentry: pid = 9290, filename = test_file2
|
||||
rm-9290 [004] d..2 4637.798843: bpf_trace_printk: fexit: pid = 9290, filename = test_file2, ret = 0
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
这段程序是一个 eBPF 程序,通过使用 fentry 和 fexit 捕获 do_unlinkat 和 do_unlinkat_exit 函数,并通过使用 bpf_get_current_pid_tgid 和 bpf_printk 函数获取调用 do_unlinkat 的进程 ID、文件名和返回值,并在内核日志中打印出来。
|
||||
|
||||
编译这个程序可以使用 ecc 工具,运行时可以使用 ecli 命令,并通过查看 /sys/kernel/debug/tracing/trace_pipe 文件查看 eBPF 程序的输出。更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:<https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
27
src/3-fentry-unlink/fentry-link.bpf.c
Normal file
27
src/3-fentry-unlink/fentry-link.bpf.c
Normal file
@@ -0,0 +1,27 @@
|
||||
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
|
||||
/* Copyright (c) 2021 Sartura */
|
||||
#include "vmlinux.h"
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
|
||||
SEC("fentry/do_unlinkat")
|
||||
int BPF_PROG(do_unlinkat, int dfd, struct filename *name)
|
||||
{
|
||||
pid_t pid;
|
||||
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
bpf_printk("fentry: pid = %d, filename = %s\n", pid, name->name);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("fexit/do_unlinkat")
|
||||
int BPF_PROG(do_unlinkat_exit, int dfd, struct filename *name, long ret)
|
||||
{
|
||||
pid_t pid;
|
||||
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
bpf_printk("fexit: pid = %d, filename = %s, ret = %ld\n", pid, name->name, ret);
|
||||
return 0;
|
||||
}
|
||||
7
src/4-opensnoop/.gitignore
vendored
Normal file
7
src/4-opensnoop/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
.vscode
|
||||
package.json
|
||||
eunomia-exporter
|
||||
ecli
|
||||
*.bpf.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
103
src/4-opensnoop/README.md
Normal file
103
src/4-opensnoop/README.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# eBPF 入门开发实践教程四:在 eBPF 中捕获进程打开文件的系统调用集合,使用全局变量过滤进程 pid
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具,它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第四篇,主要介绍如何捕获进程打开文件的系统调用集合,并使用全局变量在 eBPF 中过滤进程 pid。
|
||||
|
||||
## 在 eBPF 中捕获进程打开文件的系统调用集合
|
||||
|
||||
首先,我们需要编写一段 eBPF 程序来捕获进程打开文件的系统调用,具体实现如下:
|
||||
|
||||
```c
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
|
||||
/// @description "Process ID to trace"
|
||||
const volatile int pid_target = 0;
|
||||
|
||||
SEC("tracepoint/syscalls/sys_enter_openat")
|
||||
int tracepoint__syscalls__sys_enter_openat(struct trace_event_raw_sys_enter* ctx)
|
||||
{
|
||||
u64 id = bpf_get_current_pid_tgid();
|
||||
u32 pid = id;
|
||||
|
||||
if (pid_target && pid_target != pid)
|
||||
return false;
|
||||
// Use bpf_printk to print the process information
|
||||
bpf_printk("Process ID: %d enter sys openat\n", pid);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/// "Trace open family syscalls."
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
|
||||
```
|
||||
|
||||
上面的 eBPF 程序通过定义函数 tracepoint__syscalls__sys_enter_openat 并使用 SEC 宏把它们附加到 sys_enter_openat 的 tracepoint(即在进入 openat 系统调用时执行)。这个函数通过使用 bpf_get_current_pid_tgid 函数获取调用 openat 系统调用的进程 ID,并使用 bpf_printk 函数在内核日志中打印出来。
|
||||
|
||||
eunomia-bpf 是一个结合 Wasm 的开源 eBPF 动态加载运行时和开发工具链,它的目的是简化 eBPF 程序的开发、构建、分发、运行。可以参考 <https://github.com/eunomia-bpf/eunomia-bpf> 下载和安装 ecc 编译工具链和 ecli 运行时。我们使用 eunomia-bpf 编译运行这个例子。
|
||||
|
||||
编译运行上述代码:
|
||||
|
||||
```console
|
||||
$ ecc fentry-link.bpf.c
|
||||
Compiling bpf object...
|
||||
Packing ebpf object and config into package.json...
|
||||
$ sudo ecli package.json
|
||||
Runing eBPF program...
|
||||
```
|
||||
|
||||
运行这段程序后,可以通过查看 /sys/kernel/debug/tracing/trace_pipe 文件来查看 eBPF 程序的输出:
|
||||
|
||||
```console
|
||||
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
|
||||
<...>-3840345 [010] d... 3220701.101179: bpf_trace_printk: Process ID: 3840345 enter sys openat
|
||||
<...>-3840345 [010] d... 3220702.158000: bpf_trace_printk: Process ID: 3840345 enter sys openat
|
||||
```
|
||||
|
||||
此时,我们已经能够捕获进程打开文件的系统调用了。
|
||||
|
||||
## 使用全局变量在 eBPF 中过滤进程 pid
|
||||
|
||||
在上面的程序中,我们定义了一个全局变量 pid_target 来指定要捕获的进程的 pid。在 tracepoint__syscalls__sys_enter_open 和 tracepoint__syscalls__sys_enter_openat 函数中,我们可以使用这个全局变量来过滤输出,只输出指定的进程的信息。
|
||||
|
||||
可以通过执行 ecli -h 命令来查看 opensnoop 的帮助信息:
|
||||
|
||||
```console
|
||||
$ ecli package.json -h
|
||||
Usage: opensnoop_bpf [--help] [--version] [--verbose] [--pid_target VAR]
|
||||
|
||||
Trace open family syscalls.
|
||||
|
||||
Optional arguments:
|
||||
-h, --help shows help message and exits
|
||||
-v, --version prints version information and exits
|
||||
--verbose prints libbpf debug information
|
||||
--pid_target Process ID to trace
|
||||
|
||||
Built with eunomia-bpf framework.
|
||||
See https://github.com/eunomia-bpf/eunomia-bpf for more information.
|
||||
```
|
||||
|
||||
可以通过 --pid_target 参数来指定要捕获的进程的 pid,例如:
|
||||
|
||||
```console
|
||||
$ sudo ./ecli run package.json --pid_target 618
|
||||
Runing eBPF program...
|
||||
```
|
||||
|
||||
运行这段程序后,可以通过查看 /sys/kernel/debug/tracing/trace_pipe 文件来查看 eBPF 程序的输出:
|
||||
|
||||
```console
|
||||
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
|
||||
<...>-3840345 [010] d... 3220701.101179: bpf_trace_printk: Process ID: 618 enter sys openat
|
||||
<...>-3840345 [010] d... 3220702.158000: bpf_trace_printk: Process ID: 618 enter sys openat
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
本文介绍了如何使用 eBPF 程序来捕获进程打开文件的系统调用。在 eBPF 程序中,我们可以通过定义 tracepoint__syscalls__sys_enter_open 和 tracepoint__syscalls__sys_enter_openat 函数并使用 SEC 宏把它们附加到 sys_enter_open 和 sys_enter_openat 两个 tracepoint 来捕获进程打开文件的系统调用。我们可以使用 bpf_get_current_pid_tgid 函数获取调用 open 或 openat 系统调用的进程 ID,并使用 bpf_printk 函数在内核日志中打印出来。在 eBPF 程序中,我们还可以通过定义一个全局变量 pid_target 来指定要捕获的进程的 pid,从而过滤输出,只输出指定的进程的信息。
|
||||
|
||||
更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:<https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
21
src/4-opensnoop/opensnoop.bpf.c
Normal file
21
src/4-opensnoop/opensnoop.bpf.c
Normal file
@@ -0,0 +1,21 @@
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
|
||||
/// @description "Process ID to trace"
|
||||
const volatile int pid_target = 0;
|
||||
|
||||
SEC("tracepoint/syscalls/sys_enter_openat")
|
||||
int tracepoint__syscalls__sys_enter_openat(struct trace_event_raw_sys_enter* ctx)
|
||||
{
|
||||
u64 id = bpf_get_current_pid_tgid();
|
||||
u32 pid = id;
|
||||
|
||||
if (pid_target && pid_target != pid)
|
||||
return false;
|
||||
// Use bpf_printk to print the process information
|
||||
bpf_printk("Process ID: %d enter sys openat\n", pid);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/// "Trace open family syscalls."
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
7
src/5-uprobe-bashreadline/.gitignore
vendored
Normal file
7
src/5-uprobe-bashreadline/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
.vscode
|
||||
package.json
|
||||
ecli
|
||||
*.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
package.yaml
|
||||
112
src/5-uprobe-bashreadline/README.md
Normal file
112
src/5-uprobe-bashreadline/README.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# eBPF 入门开发实践教程五:在 eBPF 中使用 uprobe 捕获 bash 的 readline 函数调用
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具,它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第五篇,主要介绍如何使用 uprobe 捕获 bash 的 readline 函数调用。
|
||||
|
||||
## 什么是uprobe
|
||||
|
||||
uprobe是一种用户空间探针,uprobe探针允许在用户空间程序中动态插桩,插桩位置包括:函数入口、特定偏移处,以及函数返回处。当我们定义uprobe时,内核会在附加的指令上创建快速断点指令(x86机器上为int3指令),当程序执行到该指令时,内核将触发事件,程序陷入到内核态,并以回调函数的方式调用探针函数,执行完探针函数再返回到用户态继续执行后序的指令。
|
||||
|
||||
uprobe基于文件,当一个二进制文件中的一个函数被跟踪时,所有使用到这个文件的进程都会被插桩,包括那些尚未启动的进程,这样就可以在全系统范围内跟踪系统调用。
|
||||
|
||||
uprobe适用于在用户态去解析一些内核态探针无法解析的流量,例如http2流量(报文header被编码,内核无法解码),https流量(加密流量,内核无法解密)。
|
||||
|
||||
## 使用 uprobe 捕获 bash 的 readline 函数调用
|
||||
|
||||
uprobe 是一种用于捕获用户空间函数调用的 eBPF 的探针,我们可以通过它来捕获用户空间程序调用的系统函数。
|
||||
|
||||
例如,我们可以使用 uprobe 来捕获 bash 的 readline 函数调用,从而获取用户在 bash 中输入的命令行。示例代码如下:
|
||||
|
||||
```c
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
#define MAX_LINE_SIZE 80
|
||||
|
||||
/* Format of u[ret]probe section definition supporting auto-attach:
|
||||
* u[ret]probe/binary:function[+offset]
|
||||
*
|
||||
* binary can be an absolute/relative path or a filename; the latter is resolved to a
|
||||
* full binary path via bpf_program__attach_uprobe_opts.
|
||||
*
|
||||
* Specifying uprobe+ ensures we carry out strict matching; either "uprobe" must be
|
||||
* specified (and auto-attach is not possible) or the above format is specified for
|
||||
* auto-attach.
|
||||
*/
|
||||
SEC("uretprobe//bin/bash:readline")
|
||||
int BPF_KRETPROBE(printret, const void *ret)
|
||||
{
|
||||
char str[MAX_LINE_SIZE];
|
||||
char comm[TASK_COMM_LEN];
|
||||
u32 pid;
|
||||
|
||||
if (!ret)
|
||||
return 0;
|
||||
|
||||
bpf_get_current_comm(&comm, sizeof(comm));
|
||||
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
bpf_probe_read_user_str(str, sizeof(str), ret);
|
||||
|
||||
bpf_printk("PID %d (%s) read: %s ", pid, comm, str);
|
||||
|
||||
return 0;
|
||||
};
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
```
|
||||
|
||||
这段代码的作用是在 bash 的 readline 函数返回时执行指定的 BPF_KRETPROBE 函数,即 printret 函数。
|
||||
|
||||
在 printret 函数中,我们首先获取了调用 readline 函数的进程的进程名称和进程 ID,然后通过 bpf_probe_read_user_str 函数读取了用户输入的命令行字符串,最后通过 bpf_printk 函数打印出进程 ID、进程名称和输入的命令行字符串。
|
||||
|
||||
除此之外,我们还需要通过 SEC 宏来定义 uprobe 探针,并使用 BPF_KRETPROBE 宏来定义探针函数。
|
||||
|
||||
在 SEC 宏中,我们需要指定 uprobe 的类型、要捕获的二进制文件的路径和要捕获的函数名称。例如,上面的代码中的 SEC 宏的定义如下:
|
||||
|
||||
```c
|
||||
SEC("uprobe//bin/bash:readline")
|
||||
```
|
||||
|
||||
这表示我们要捕获的是 /bin/bash 二进制文件中的 readline 函数。
|
||||
|
||||
接下来,我们需要使用 BPF_KRETPROBE 宏来定义探针函数,例如:
|
||||
|
||||
```c
|
||||
BPF_KRETPROBE(printret, const void *ret)
|
||||
```
|
||||
|
||||
这里的 printret 是探针函数的名称,const void *ret 是探针函数的参数,它代表被捕获的函数的返回值。
|
||||
|
||||
eunomia-bpf 是一个结合 Wasm 的开源 eBPF 动态加载运行时和开发工具链,它的目的是简化 eBPF 程序的开发、构建、分发、运行。可以参考 <https://github.com/eunomia-bpf/eunomia-bpf> 下载和安装 ecc 编译工具链和 ecli 运行时。我们使用 eunomia-bpf 编译运行这个例子。
|
||||
|
||||
编译运行上述代码:
|
||||
|
||||
```console
|
||||
$ ecc bashreadline.bpf.c bashreadline.h
|
||||
Compiling bpf object...
|
||||
Packing ebpf object and config into package.json...
|
||||
$ sudo ecli package.json
|
||||
Runing eBPF program...
|
||||
```
|
||||
|
||||
运行这段程序后,可以通过查看 /sys/kernel/debug/tracing/trace_pipe 文件来查看 eBPF 程序的输出:
|
||||
|
||||
```console
|
||||
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
|
||||
bash-32969 [000] d..31 64001.375748: bpf_trace_printk: PID 32969 (bash) read: fff
|
||||
bash-32969 [000] d..31 64002.056951: bpf_trace_printk: PID 32969 (bash) read: fff
|
||||
```
|
||||
|
||||
可以看到,我们成功的捕获了 bash 的 readline 函数调用,并获取了用户在 bash 中输入的命令行。
|
||||
|
||||
## 总结
|
||||
|
||||
在上述代码中,我们使用了 SEC 宏来定义了一个 uprobe 探针,它指定了要捕获的用户空间程序 (bin/bash) 和要捕获的函数 (readline)。此外,我们还使用了 BPF_KRETPROBE 宏来定义了一个用于处理 readline 函数返回值的回调函数 (printret)。该函数可以获取到 readline 函数的返回值,并将其打印到内核日志中。通过这样的方式,我们就可以使用 eBPF 来捕获 bash 的 readline 函数调用,并获取用户在 bash 中输入的命令行。
|
||||
|
||||
更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:<https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
38
src/5-uprobe-bashreadline/bashreadline.bpf.c
Normal file
38
src/5-uprobe-bashreadline/bashreadline.bpf.c
Normal file
@@ -0,0 +1,38 @@
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
#define MAX_LINE_SIZE 80
|
||||
|
||||
/* Format of u[ret]probe section definition supporting auto-attach:
|
||||
* u[ret]probe/binary:function[+offset]
|
||||
*
|
||||
* binary can be an absolute/relative path or a filename; the latter is resolved to a
|
||||
* full binary path via bpf_program__attach_uprobe_opts.
|
||||
*
|
||||
* Specifying uprobe+ ensures we carry out strict matching; either "uprobe" must be
|
||||
* specified (and auto-attach is not possible) or the above format is specified for
|
||||
* auto-attach.
|
||||
*/
|
||||
SEC("uretprobe//bin/bash:readline")
|
||||
int BPF_KRETPROBE(printret, const void *ret)
|
||||
{
|
||||
char str[MAX_LINE_SIZE];
|
||||
char comm[TASK_COMM_LEN];
|
||||
u32 pid;
|
||||
|
||||
if (!ret)
|
||||
return 0;
|
||||
|
||||
bpf_get_current_comm(&comm, sizeof(comm));
|
||||
|
||||
pid = bpf_get_current_pid_tgid() >> 32;
|
||||
bpf_probe_read_user_str(str, sizeof(str), ret);
|
||||
|
||||
bpf_printk("PID %d (%s) read: %s ", pid, comm, str);
|
||||
|
||||
return 0;
|
||||
};
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
10
src/6-sigsnoop/.gitignore
vendored
Executable file
10
src/6-sigsnoop/.gitignore
vendored
Executable file
@@ -0,0 +1,10 @@
|
||||
.vscode
|
||||
package.json
|
||||
*.wasm
|
||||
ewasm-skel.h
|
||||
ecli
|
||||
ewasm
|
||||
*.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
package.yaml
|
||||
141
src/6-sigsnoop/README.md
Executable file
141
src/6-sigsnoop/README.md
Executable file
@@ -0,0 +1,141 @@
|
||||
# eBPF 入门开发实践教程六:捕获进程发送信号的系统调用集合,使用 hash map 保存状态
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具,它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第六篇,主要介绍如何实现一个 eBPF 工具,捕获进程发送信号的系统调用集合,使用 hash map 保存状态。
|
||||
|
||||
## sigsnoop
|
||||
|
||||
示例代码如下:
|
||||
|
||||
```c
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
#define MAX_ENTRIES 10240
|
||||
#define TASK_COMM_LEN 16
|
||||
|
||||
struct event {
|
||||
unsigned int pid;
|
||||
unsigned int tpid;
|
||||
int sig;
|
||||
int ret;
|
||||
char comm[TASK_COMM_LEN];
|
||||
};
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, __u32);
|
||||
__type(value, struct event);
|
||||
} values SEC(".maps");
|
||||
|
||||
|
||||
static int probe_entry(pid_t tpid, int sig)
|
||||
{
|
||||
struct event event = {};
|
||||
__u64 pid_tgid;
|
||||
__u32 tid;
|
||||
|
||||
pid_tgid = bpf_get_current_pid_tgid();
|
||||
tid = (__u32)pid_tgid;
|
||||
event.pid = pid_tgid >> 32;
|
||||
event.tpid = tpid;
|
||||
event.sig = sig;
|
||||
bpf_get_current_comm(event.comm, sizeof(event.comm));
|
||||
bpf_map_update_elem(&values, &tid, &event, BPF_ANY);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int probe_exit(void *ctx, int ret)
|
||||
{
|
||||
__u64 pid_tgid = bpf_get_current_pid_tgid();
|
||||
__u32 tid = (__u32)pid_tgid;
|
||||
struct event *eventp;
|
||||
|
||||
eventp = bpf_map_lookup_elem(&values, &tid);
|
||||
if (!eventp)
|
||||
return 0;
|
||||
|
||||
eventp->ret = ret;
|
||||
bpf_printk("PID %d (%s) sent signal %d to PID %d, ret = %d",
|
||||
eventp->pid, eventp->comm, eventp->sig, eventp->tpid, ret);
|
||||
|
||||
cleanup:
|
||||
bpf_map_delete_elem(&values, &tid);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("tracepoint/syscalls/sys_enter_kill")
|
||||
int kill_entry(struct trace_event_raw_sys_enter *ctx)
|
||||
{
|
||||
pid_t tpid = (pid_t)ctx->args[0];
|
||||
int sig = (int)ctx->args[1];
|
||||
|
||||
return probe_entry(tpid, sig);
|
||||
}
|
||||
|
||||
SEC("tracepoint/syscalls/sys_exit_kill")
|
||||
int kill_exit(struct trace_event_raw_sys_exit *ctx)
|
||||
{
|
||||
return probe_exit(ctx, ctx->ret);
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
```
|
||||
|
||||
上面的代码定义了一个 eBPF 程序,用于捕获进程发送信号的系统调用,包括 kill、tkill 和 tgkill。它通过使用 tracepoint 来捕获系统调用的进入和退出事件,并在这些事件发生时执行指定的探针函数,例如 probe_entry 和 probe_exit。
|
||||
|
||||
在探针函数中,我们使用 bpf_map 存储捕获的事件信息,包括发送信号的进程 ID、接收信号的进程 ID、信号值和系统调用的返回值。在系统调用退出时,我们将获取存储在 bpf_map 中的事件信息,并使用 bpf_printk 打印进程 ID、进程名称、发送的信号和系统调用的返回值。
|
||||
|
||||
最后,我们还需要使用 SEC 宏来定义探针,并指定要捕获的系统调用的名称,以及要执行的探针函数。
|
||||
|
||||
eunomia-bpf 是一个结合 Wasm 的开源 eBPF 动态加载运行时和开发工具链,它的目的是简化 eBPF 程序的开发、构建、分发、运行。可以参考 <https://github.com/eunomia-bpf/eunomia-bpf> 下载和安装 ecc 编译工具链和 ecli 运行时。我们使用 eunomia-bpf 编译运行这个例子。
|
||||
|
||||
编译运行上述代码:
|
||||
|
||||
```shell
|
||||
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
|
||||
```
|
||||
|
||||
或者
|
||||
|
||||
```console
|
||||
$ ecc sigsnoop.bpf.c sigsnoop.h
|
||||
Compiling bpf object...
|
||||
Generating export types...
|
||||
Packing ebpf object and config into package.json...
|
||||
$ sudo ecli package.json
|
||||
Runing eBPF program...
|
||||
```
|
||||
|
||||
运行这段程序后,可以通过查看 /sys/kernel/debug/tracing/trace_pipe 文件来查看 eBPF 程序的输出:
|
||||
|
||||
```console
|
||||
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
|
||||
node-3517 [003] d..31 82575.798191: bpf_trace_printk: PID 3517 (node) sent signal 0 to PID 3427, ret = 0
|
||||
node-15194 [003] d..31 82575.849227: bpf_trace_printk: PID 15194 (node) sent signal 0 to PID 3427, ret = 0
|
||||
node-30016 [003] d..31 82576.001361: bpf_trace_printk: PID 30016 (node) sent signal 0 to PID 3427, ret = 0
|
||||
cpptools-srv-38617 [002] d..31 82576.461085: bpf_trace_printk: PID 38617 (cpptools-srv) sent signal 0 to PID 30496, ret = 0
|
||||
node-30040 [002] d..31 82576.467720: bpf_trace_printk: PID 30016 (node) sent signal 0 to PID 3427, ret = 0
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
本文主要介绍如何实现一个 eBPF 工具,捕获进程发送信号的系统调用集合,使用 hash map 保存状态。使用 hash map 需要定义一个结构体:
|
||||
|
||||
```c
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, __u32);
|
||||
__type(value, struct event);
|
||||
} values SEC(".maps");
|
||||
```
|
||||
|
||||
并使用一些对应的 API 进行访问,例如 bpf_map_lookup_elem、bpf_map_update_elem、bpf_map_delete_elem 等。
|
||||
|
||||
更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:<https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
74
src/6-sigsnoop/sigsnoop.bpf.c
Executable file
74
src/6-sigsnoop/sigsnoop.bpf.c
Executable file
@@ -0,0 +1,74 @@
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
#define MAX_ENTRIES 10240
|
||||
#define TASK_COMM_LEN 16
|
||||
|
||||
struct event {
|
||||
unsigned int pid;
|
||||
unsigned int tpid;
|
||||
int sig;
|
||||
int ret;
|
||||
char comm[TASK_COMM_LEN];
|
||||
};
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, __u32);
|
||||
__type(value, struct event);
|
||||
} values SEC(".maps");
|
||||
|
||||
|
||||
static int probe_entry(pid_t tpid, int sig)
|
||||
{
|
||||
struct event event = {};
|
||||
__u64 pid_tgid;
|
||||
__u32 tid;
|
||||
|
||||
pid_tgid = bpf_get_current_pid_tgid();
|
||||
tid = (__u32)pid_tgid;
|
||||
event.pid = pid_tgid >> 32;
|
||||
event.tpid = tpid;
|
||||
event.sig = sig;
|
||||
bpf_get_current_comm(event.comm, sizeof(event.comm));
|
||||
bpf_map_update_elem(&values, &tid, &event, BPF_ANY);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int probe_exit(void *ctx, int ret)
|
||||
{
|
||||
__u64 pid_tgid = bpf_get_current_pid_tgid();
|
||||
__u32 tid = (__u32)pid_tgid;
|
||||
struct event *eventp;
|
||||
|
||||
eventp = bpf_map_lookup_elem(&values, &tid);
|
||||
if (!eventp)
|
||||
return 0;
|
||||
|
||||
eventp->ret = ret;
|
||||
bpf_printk("PID %d (%s) sent signal %d to PID %d, ret = %d",
|
||||
eventp->pid, eventp->comm, eventp->sig, eventp->tpid, ret);
|
||||
|
||||
cleanup:
|
||||
bpf_map_delete_elem(&values, &tid);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("tracepoint/syscalls/sys_enter_kill")
|
||||
int kill_entry(struct trace_event_raw_sys_enter *ctx)
|
||||
{
|
||||
pid_t tpid = (pid_t)ctx->args[0];
|
||||
int sig = (int)ctx->args[1];
|
||||
|
||||
return probe_entry(tpid, sig);
|
||||
}
|
||||
|
||||
SEC("tracepoint/syscalls/sys_exit_kill")
|
||||
int kill_exit(struct trace_event_raw_sys_exit *ctx)
|
||||
{
|
||||
return probe_exit(ctx, ctx->ret);
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
3
src/7-execsnoop/.gitignore
vendored
Normal file
3
src/7-execsnoop/.gitignore
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
ecli
|
||||
*.json
|
||||
|
||||
125
src/7-execsnoop/README.md
Normal file
125
src/7-execsnoop/README.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# eBPF 入门实践教程七:捕获进程执行/退出时间,通过 perf event array 向用户态打印输出
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具,它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第七篇,主要介绍如何捕获 Linux 内核中进程执行的事件,并且通过 perf event array 向用户态命令行打印输出,不需要再通过查看 /sys/kernel/debug/tracing/trace_pipe 文件来查看 eBPF 程序的输出。通过 perf event array 向用户态发送信息之后,可以进行复杂的数据处理和分析。
|
||||
|
||||
## perf buffer
|
||||
|
||||
eBPF 提供了两个环形缓冲区,可以用来将信息从 eBPF 程序传输到用户区控制器。第一个是perf环形缓冲区,,它至少从内核v4.15开始就存在了。第二个是后来引入的 BPF 环形缓冲区。本文只考虑perf环形缓冲区。
|
||||
|
||||
## execsnoop
|
||||
|
||||
通过 perf event array 向用户态命令行打印输出,需要编写一个头文件,一个 C 源文件。示例代码如下:
|
||||
|
||||
头文件:execsnoop.h
|
||||
|
||||
```c
|
||||
#ifndef __EXECSNOOP_H
|
||||
#define __EXECSNOOP_H
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
|
||||
struct event {
|
||||
int pid;
|
||||
int ppid;
|
||||
int uid;
|
||||
int retval;
|
||||
bool is_exit;
|
||||
char comm[TASK_COMM_LEN];
|
||||
};
|
||||
|
||||
#endif /* __EXECSNOOP_H */
|
||||
```
|
||||
|
||||
源文件:execsnoop.bpf.c
|
||||
|
||||
```c
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include "execsnoop.h"
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
|
||||
__uint(key_size, sizeof(u32));
|
||||
__uint(value_size, sizeof(u32));
|
||||
} events SEC(".maps");
|
||||
|
||||
SEC("tracepoint/syscalls/sys_enter_execve")
|
||||
int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx)
|
||||
{
|
||||
u64 id;
|
||||
pid_t pid, tgid;
|
||||
struct event event;
|
||||
struct task_struct *task;
|
||||
|
||||
uid_t uid = (u32)bpf_get_current_uid_gid();
|
||||
id = bpf_get_current_pid_tgid();
|
||||
pid = (pid_t)id;
|
||||
tgid = id >> 32;
|
||||
|
||||
event.pid = tgid;
|
||||
event.uid = uid;
|
||||
task = (struct task_struct*)bpf_get_current_task();
|
||||
event.ppid = BPF_CORE_READ(task, real_parent, tgid);
|
||||
bpf_get_current_comm(&event.comm, sizeof(event.comm));
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
|
||||
return 0;
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
```
|
||||
|
||||
这段代码定义了个 eBPF 程序,用于捕获进程执行 execve 系统调用的入口。
|
||||
|
||||
在入口程序中,我们首先获取了当前进程的进程 ID 和用户 ID,然后通过 bpf_get_current_task 函数获取了当前进程的 task_struct 结构体,并通过 bpf_probe_read_str 函数读取了进程名称。最后,我们通过 bpf_perf_event_output 函数将进程执行事件输出到 perf buffer。
|
||||
|
||||
使用这段代码,我们就可以捕获 Linux 内核中进程执行的事件, 并分析进程的执行情况。
|
||||
|
||||
eunomia-bpf 是一个结合 Wasm 的开源 eBPF 动态加载运行时和开发工具链,它的目的是简化 eBPF 程序的开发、构建、分发、运行。可以参考 <https://github.com/eunomia-bpf/eunomia-bpf> 下载和安装 ecc 编译工具链和 ecli 运行时。我们使用 eunomia-bpf 编译运行这个例子。
|
||||
|
||||
使用容器编译:
|
||||
|
||||
```shell
|
||||
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
|
||||
```
|
||||
|
||||
或者使用 ecc 编译:
|
||||
|
||||
```shell
|
||||
ecc execsnoop.bpf.c execsnoop.h
|
||||
```
|
||||
|
||||
运行
|
||||
|
||||
```console
|
||||
$ sudo ./ecli run package.json
|
||||
TIME PID PPID UID COMM
|
||||
21:28:30 40747 3517 1000 node
|
||||
21:28:30 40748 40747 1000 sh
|
||||
21:28:30 40749 3517 1000 node
|
||||
21:28:30 40750 40749 1000 sh
|
||||
21:28:30 40751 3517 1000 node
|
||||
21:28:30 40752 40751 1000 sh
|
||||
21:28:30 40753 40752 1000 cpuUsage.sh
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
本文介绍了如何捕获 Linux 内核中进程执行的事件,并且通过 perf event array 向用户态命令行打印输出,通过 perf event array 向用户态发送信息之后,可以进行复杂的数据处理和分析。在 libbpf 对应的内核态代码中,定义这样一个结构体和对应的头文件:
|
||||
|
||||
```c
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
|
||||
__uint(key_size, sizeof(u32));
|
||||
__uint(value_size, sizeof(u32));
|
||||
} events SEC(".maps");
|
||||
```
|
||||
|
||||
就可以往用户态直接发送信息。
|
||||
|
||||
更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:<https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
36
src/7-execsnoop/execsnoop.bpf.c
Normal file
36
src/7-execsnoop/execsnoop.bpf.c
Normal file
@@ -0,0 +1,36 @@
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include "execsnoop.h"
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
|
||||
__uint(key_size, sizeof(u32));
|
||||
__uint(value_size, sizeof(u32));
|
||||
} events SEC(".maps");
|
||||
|
||||
SEC("tracepoint/syscalls/sys_enter_execve")
|
||||
int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx)
|
||||
{
|
||||
u64 id;
|
||||
pid_t pid, tgid;
|
||||
struct event event;
|
||||
struct task_struct *task;
|
||||
|
||||
uid_t uid = (u32)bpf_get_current_uid_gid();
|
||||
id = bpf_get_current_pid_tgid();
|
||||
pid = (pid_t)id;
|
||||
tgid = id >> 32;
|
||||
|
||||
event.pid = tgid;
|
||||
event.uid = uid;
|
||||
task = (struct task_struct*)bpf_get_current_task();
|
||||
event.ppid = BPF_CORE_READ(task, real_parent, tgid);
|
||||
bpf_get_current_comm(&event.comm, sizeof(event.comm));
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
|
||||
return 0;
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
|
||||
16
src/7-execsnoop/execsnoop.h
Normal file
16
src/7-execsnoop/execsnoop.h
Normal file
@@ -0,0 +1,16 @@
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
#ifndef __EXECSNOOP_H
|
||||
#define __EXECSNOOP_H
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
|
||||
struct event {
|
||||
int pid;
|
||||
int ppid;
|
||||
int uid;
|
||||
char comm[TASK_COMM_LEN];
|
||||
};
|
||||
|
||||
#endif /* __EXECSNOOP_H */
|
||||
|
||||
|
||||
4
src/8-exitsnoop/.gitignore
vendored
Normal file
4
src/8-exitsnoop/.gitignore
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
.vscode
|
||||
eunomia-exporter
|
||||
ecli
|
||||
*.json
|
||||
154
src/8-exitsnoop/README.md
Normal file
154
src/8-exitsnoop/README.md
Normal file
@@ -0,0 +1,154 @@
|
||||
# eBPF 入门开发实践教程八:在 eBPF 中使用 exitsnoop 监控进程退出事件,使用 ring buffer 向用户态打印输出
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具。它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第八篇,在 eBPF 中使用 exitsnoop 监控进程退出事件。
|
||||
|
||||
## ring buffer
|
||||
|
||||
现在有一个新的 BPF 数据结构可用。BPF 环形缓冲区(ring buffer)。它解决了 BPF perf buffer(当今从内核向用户空间发送数据的事实上的标准)的内存效率和事件重排问题,同时达到或超过了它的性能。它既提供了与 perf buffer 兼容以方便迁移,又有新的保留/提交API,具有更好的可用性。另外,合成和真实世界的基准测试表明,在几乎所有的情况下,所以考虑将其作为从BPF程序向用户空间发送数据的默认选择。
|
||||
|
||||
### BPF ringbuf vs BPF perfbuf
|
||||
|
||||
今天,只要BPF程序需要将收集到的数据发送到用户空间进行后处理和记录,它通常会使用BPF perf buffer(perfbuf)来实现。Perfbuf 是每个CPU循环缓冲区的集合,它允许在内核和用户空间之间有效地交换数据。它在实践中效果很好,但由于其按CPU设计,它有两个主要的缺点,在实践中被证明是不方便的:内存的低效使用和事件的重新排序。
|
||||
|
||||
为了解决这些问题,从Linux 5.8开始,BPF提供了一个新的BPF数据结构(BPF map)。BPF环形缓冲区(ringbuf)。它是一个多生产者、单消费者(MPSC)队列,可以同时在多个CPU上安全共享。
|
||||
|
||||
BPF ringbuf 支持来自 BPF perfbuf 的熟悉的功能:
|
||||
|
||||
- 变长的数据记录。
|
||||
- 能够通过内存映射区域有效地从用户空间读取数据,而不需要额外的内存拷贝和/或进入内核的系统调用。
|
||||
- 既支持epoll通知,又能以绝对最小的延迟进行忙环操作。
|
||||
|
||||
同时,BPF ringbuf解决了BPF perfbuf的以下问题:
|
||||
|
||||
- 内存开销。
|
||||
- 数据排序。
|
||||
- 浪费的工作和额外的数据复制。
|
||||
|
||||
## exitsnoop
|
||||
|
||||
本文是 eBPF 入门开发实践教程的第八篇,在 eBPF 中使用 exitsnoop 监控进程退出事件,并使用 ring buffer 向用户态打印输出。
|
||||
|
||||
使用 ring buffer 向用户态打印输出的步骤和 perf buffer 类似,首先需要定义一个头文件:
|
||||
|
||||
头文件:exitsnoop.h
|
||||
|
||||
```c
|
||||
#ifndef __BOOTSTRAP_H
|
||||
#define __BOOTSTRAP_H
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
#define MAX_FILENAME_LEN 127
|
||||
|
||||
struct event {
|
||||
int pid;
|
||||
int ppid;
|
||||
unsigned exit_code;
|
||||
unsigned long long duration_ns;
|
||||
char comm[TASK_COMM_LEN];
|
||||
};
|
||||
|
||||
#endif /* __BOOTSTRAP_H */
|
||||
```
|
||||
|
||||
源文件:exitsnoop.bpf.c
|
||||
|
||||
```c
|
||||
#include "vmlinux.h"
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include "exitsnoop.h"
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_RINGBUF);
|
||||
__uint(max_entries, 256 * 1024);
|
||||
} rb SEC(".maps");
|
||||
|
||||
SEC("tp/sched/sched_process_exit")
|
||||
int handle_exit(struct trace_event_raw_sched_process_template* ctx)
|
||||
{
|
||||
struct task_struct *task;
|
||||
struct event *e;
|
||||
pid_t pid, tid;
|
||||
u64 id, ts, *start_ts, duration_ns = 0;
|
||||
|
||||
/* get PID and TID of exiting thread/process */
|
||||
id = bpf_get_current_pid_tgid();
|
||||
pid = id >> 32;
|
||||
tid = (u32)id;
|
||||
|
||||
/* ignore thread exits */
|
||||
if (pid != tid)
|
||||
return 0;
|
||||
|
||||
/* reserve sample from BPF ringbuf */
|
||||
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
|
||||
if (!e)
|
||||
return 0;
|
||||
|
||||
/* fill out the sample with data */
|
||||
task = (struct task_struct *)bpf_get_current_task();
|
||||
|
||||
e->duration_ns = duration_ns;
|
||||
e->pid = pid;
|
||||
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
|
||||
e->exit_code = (BPF_CORE_READ(task, exit_code) >> 8) & 0xff;
|
||||
bpf_get_current_comm(&e->comm, sizeof(e->comm));
|
||||
|
||||
/* send data to user-space for post-processing */
|
||||
bpf_ringbuf_submit(e, 0);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
这段代码是一个 BPF 程序,用于监控 Linux 系统中的进程退出事件。
|
||||
|
||||
该程序通过注册一个 tracepoint,来监控进程退出事件。Tracepoint 是一种内核特性,允许内核模块获取特定事件的通知。在本程序中,注册的 tracepoint 是“tp/sched/sched_process_exit”,表示该程序监控的是进程退出事件。
|
||||
|
||||
当系统中发生进程退出事件时,BPF 程序会捕获该事件,并调用“handle_exit”函数来处理它。该函数首先检查当前退出事件是否是进程退出事件(而不是线程退出事件),然后在 BPF 环形缓冲区(“rb”)中保留一个事件结构体,并填充该结构体中的其他信息,例如进程 ID、进程名称、退出代码和退出信号等信息。最后,该函数还会调用 BPF 的“perf_event_output”函数,将捕获的事件发送给用户空间程序。
|
||||
|
||||
总而言之,这段代码是一个 BPF 程序,用于监控 Linux 系统中的进程退出事件.
|
||||
|
||||
## Compile and Run
|
||||
|
||||
eunomia-bpf 是一个结合 Wasm 的开源 eBPF 动态加载运行时和开发工具链,它的目的是简化 eBPF 程序的开发、构建、分发、运行。可以参考 <https://github.com/eunomia-bpf/eunomia-bpf> 下载和安装 ecc 编译工具链和 ecli 运行时。我们使用 eunomia-bpf 编译运行这个例子。
|
||||
|
||||
Compile:
|
||||
|
||||
```shell
|
||||
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
|
||||
```
|
||||
|
||||
Or
|
||||
|
||||
```console
|
||||
$ ecc exitsnoop.bpf.c exitsnoop.h
|
||||
Compiling bpf object...
|
||||
Generating export types...
|
||||
Packing ebpf object and config into package.json...
|
||||
```
|
||||
|
||||
Run:
|
||||
|
||||
```console
|
||||
$ sudo ./ecli run package.json
|
||||
TIME PID PPID EXIT_CODE DURATION_NS COMM
|
||||
21:40:09 42050 42049 0 0 which
|
||||
21:40:09 42049 3517 0 0 sh
|
||||
21:40:09 42052 42051 0 0 ps
|
||||
21:40:09 42051 3517 0 0 sh
|
||||
21:40:09 42055 42054 0 0 sed
|
||||
21:40:09 42056 42054 0 0 cat
|
||||
21:40:09 42057 42054 0 0 cat
|
||||
21:40:09 42058 42054 0 0 cat
|
||||
21:40:09 42059 42054 0 0 cat
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
本文介绍了如何使用 eunomia-bpf 开发一个简单的 BPF 程序,该程序可以监控 Linux 系统中的进程退出事件, 并将捕获的事件通过 ring buffer 发送给用户空间程序。在本文中,我们使用 eunomia-bpf 编译运行了这个例子。如果你想了解更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:<https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
50
src/8-exitsnoop/exitsnoop.bpf.c
Normal file
50
src/8-exitsnoop/exitsnoop.bpf.c
Normal file
@@ -0,0 +1,50 @@
|
||||
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
|
||||
/* Copyright (c) 2020 Facebook */
|
||||
#include "vmlinux.h"
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include "exitsnoop.h"
|
||||
|
||||
char LICENSE[] SEC("license") = "Dual BSD/GPL";
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_RINGBUF);
|
||||
__uint(max_entries, 256 * 1024);
|
||||
} rb SEC(".maps");
|
||||
|
||||
SEC("tp/sched/sched_process_exit")
|
||||
int handle_exit(struct trace_event_raw_sched_process_template* ctx)
|
||||
{
|
||||
struct task_struct *task;
|
||||
struct event *e;
|
||||
pid_t pid, tid;
|
||||
u64 id, ts, *start_ts, duration_ns = 0;
|
||||
|
||||
/* get PID and TID of exiting thread/process */
|
||||
id = bpf_get_current_pid_tgid();
|
||||
pid = id >> 32;
|
||||
tid = (u32)id;
|
||||
|
||||
/* ignore thread exits */
|
||||
if (pid != tid)
|
||||
return 0;
|
||||
|
||||
/* reserve sample from BPF ringbuf */
|
||||
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
|
||||
if (!e)
|
||||
return 0;
|
||||
|
||||
/* fill out the sample with data */
|
||||
task = (struct task_struct *)bpf_get_current_task();
|
||||
|
||||
e->duration_ns = duration_ns;
|
||||
e->pid = pid;
|
||||
e->ppid = BPF_CORE_READ(task, real_parent, tgid);
|
||||
e->exit_code = (BPF_CORE_READ(task, exit_code) >> 8) & 0xff;
|
||||
bpf_get_current_comm(&e->comm, sizeof(e->comm));
|
||||
|
||||
/* send data to user-space for post-processing */
|
||||
bpf_ringbuf_submit(e, 0);
|
||||
return 0;
|
||||
}
|
||||
15
src/8-exitsnoop/exitsnoop.h
Normal file
15
src/8-exitsnoop/exitsnoop.h
Normal file
@@ -0,0 +1,15 @@
|
||||
#ifndef __BOOTSTRAP_H
|
||||
#define __BOOTSTRAP_H
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
#define MAX_FILENAME_LEN 127
|
||||
|
||||
struct event {
|
||||
int pid;
|
||||
int ppid;
|
||||
unsigned exit_code;
|
||||
unsigned long long duration_ns;
|
||||
char comm[TASK_COMM_LEN];
|
||||
};
|
||||
|
||||
#endif /* __BOOTSTRAP_H */
|
||||
7
src/9-runqlat/.gitignore
vendored
Normal file
7
src/9-runqlat/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
.vscode
|
||||
package.json
|
||||
*.o
|
||||
*.skel.json
|
||||
*.skel.yaml
|
||||
package.yaml
|
||||
ecli
|
||||
277
src/9-runqlat/README.md
Executable file
277
src/9-runqlat/README.md
Executable file
@@ -0,0 +1,277 @@
|
||||
# eBPF 入门开发实践教程九:一个 Linux 内核 BPF 程序,通过柱状图来总结调度程序运行队列延迟,显示任务等待运行在 CPU 上的时间长度
|
||||
|
||||
eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具。它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。
|
||||
|
||||
## runqlat是什么?
|
||||
|
||||
bcc-tools 是一组用于在 Linux 系统上使用 BPF 程序的工具。runqlat 是 bcc-tools 中的一个工具,用于分析 Linux 系统的调度性能。具体来说,runqlat 用于测量一个任务在被调度到 CPU 上运行之前在运行队列中等待的时间。这些信息对于识别性能瓶颈和提高 Linux 内核调度算法的整体效率非常有用。
|
||||
|
||||
## runqlat 原理
|
||||
|
||||
runqlat 使用内核跟踪点和函数探针的结合来测量进程在运行队列中的时间。当进程被排队时,trace_enqueue 函数会在一个映射中记录时间戳。当进程被调度到 CPU 上运行时,handle_switch 函数会检索时间戳,并计算当前时间与排队时间之间的时间差。这个差值(或 delta)然后用于更新进程的直方图,该直方图记录运行队列延迟的分布。该直方图可用于分析 Linux 内核的调度性能。
|
||||
|
||||
## runqlat 代码实现
|
||||
|
||||
首先我们需要编写一个源代码文件 runqlat.bpf.c:
|
||||
|
||||
```c
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
// Copyright (c) 2020 Wenbo Zhang
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include "runqlat.h"
|
||||
#include "bits.bpf.h"
|
||||
#include "maps.bpf.h"
|
||||
#include "core_fixes.bpf.h"
|
||||
|
||||
#define MAX_ENTRIES 10240
|
||||
#define TASK_RUNNING 0
|
||||
|
||||
const volatile bool filter_cg = false;
|
||||
const volatile bool targ_per_process = false;
|
||||
const volatile bool targ_per_thread = false;
|
||||
const volatile bool targ_per_pidns = false;
|
||||
const volatile bool targ_ms = false;
|
||||
const volatile pid_t targ_tgid = 0;
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_CGROUP_ARRAY);
|
||||
__type(key, u32);
|
||||
__type(value, u32);
|
||||
__uint(max_entries, 1);
|
||||
} cgroup_map SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, u32);
|
||||
__type(value, u64);
|
||||
} start SEC(".maps");
|
||||
|
||||
static struct hist zero;
|
||||
|
||||
/// @sample {"interval": 1000, "type" : "log2_hist"}
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, u32);
|
||||
__type(value, struct hist);
|
||||
} hists SEC(".maps");
|
||||
|
||||
static int trace_enqueue(u32 tgid, u32 pid)
|
||||
{
|
||||
u64 ts;
|
||||
|
||||
if (!pid)
|
||||
return 0;
|
||||
if (targ_tgid && targ_tgid != tgid)
|
||||
return 0;
|
||||
|
||||
ts = bpf_ktime_get_ns();
|
||||
bpf_map_update_elem(&start, &pid, &ts, BPF_ANY);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static unsigned int pid_namespace(struct task_struct *task)
|
||||
{
|
||||
struct pid *pid;
|
||||
unsigned int level;
|
||||
struct upid upid;
|
||||
unsigned int inum;
|
||||
|
||||
/* get the pid namespace by following task_active_pid_ns(),
|
||||
* pid->numbers[pid->level].ns
|
||||
*/
|
||||
pid = BPF_CORE_READ(task, thread_pid);
|
||||
level = BPF_CORE_READ(pid, level);
|
||||
bpf_core_read(&upid, sizeof(upid), &pid->numbers[level]);
|
||||
inum = BPF_CORE_READ(upid.ns, ns.inum);
|
||||
|
||||
return inum;
|
||||
}
|
||||
|
||||
static int handle_switch(bool preempt, struct task_struct *prev, struct task_struct *next)
|
||||
{
|
||||
struct hist *histp;
|
||||
u64 *tsp, slot;
|
||||
u32 pid, hkey;
|
||||
s64 delta;
|
||||
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
if (get_task_state(prev) == TASK_RUNNING)
|
||||
trace_enqueue(BPF_CORE_READ(prev, tgid), BPF_CORE_READ(prev, pid));
|
||||
|
||||
pid = BPF_CORE_READ(next, pid);
|
||||
|
||||
tsp = bpf_map_lookup_elem(&start, &pid);
|
||||
if (!tsp)
|
||||
return 0;
|
||||
delta = bpf_ktime_get_ns() - *tsp;
|
||||
if (delta < 0)
|
||||
goto cleanup;
|
||||
|
||||
if (targ_per_process)
|
||||
hkey = BPF_CORE_READ(next, tgid);
|
||||
else if (targ_per_thread)
|
||||
hkey = pid;
|
||||
else if (targ_per_pidns)
|
||||
hkey = pid_namespace(next);
|
||||
else
|
||||
hkey = -1;
|
||||
histp = bpf_map_lookup_or_try_init(&hists, &hkey, &zero);
|
||||
if (!histp)
|
||||
goto cleanup;
|
||||
if (!histp->comm[0])
|
||||
bpf_probe_read_kernel_str(&histp->comm, sizeof(histp->comm),
|
||||
next->comm);
|
||||
if (targ_ms)
|
||||
delta /= 1000000U;
|
||||
else
|
||||
delta /= 1000U;
|
||||
slot = log2l(delta);
|
||||
if (slot >= MAX_SLOTS)
|
||||
slot = MAX_SLOTS - 1;
|
||||
__sync_fetch_and_add(&histp->slots[slot], 1);
|
||||
|
||||
cleanup:
|
||||
bpf_map_delete_elem(&start, &pid);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("raw_tp/sched_wakeup")
|
||||
int BPF_PROG(handle_sched_wakeup, struct task_struct *p)
|
||||
{
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid));
|
||||
}
|
||||
|
||||
SEC("raw_tp/sched_wakeup_new")
|
||||
int BPF_PROG(handle_sched_wakeup_new, struct task_struct *p)
|
||||
{
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid));
|
||||
}
|
||||
|
||||
SEC("raw_tp/sched_switch")
|
||||
int BPF_PROG(handle_sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next)
|
||||
{
|
||||
return handle_switch(preempt, prev, next);
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
```
|
||||
|
||||
然后我们需要定义一个头文件`runqlat.h`,用来给用户态处理从内核态上报的事件:
|
||||
|
||||
```c
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
#ifndef __RUNQLAT_H
|
||||
#define __RUNQLAT_H
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
#define MAX_SLOTS 26
|
||||
|
||||
struct hist {
|
||||
__u32 slots[MAX_SLOTS];
|
||||
char comm[TASK_COMM_LEN];
|
||||
};
|
||||
|
||||
#endif /* __RUNQLAT_H */
|
||||
```
|
||||
|
||||
这是一个 Linux 内核 BPF 程序,旨在收集和报告运行队列的延迟。BPF 是 Linux 内核中一项技术,它允许将程序附加到内核中的特定点并进行安全高效的执行。这些程序可用于收集有关内核行为的信息,并实现自定义行为。这个 BPF 程序使用 BPF maps 来收集有关任务何时从内核的运行队列中排队和取消排队的信息,并记录任务在被安排执行之前在运行队列上等待的时间。然后,它使用这些信息生成直方图,显示不同组任务的运行队列延迟分布。这些直方图可用于识别和诊断内核调度行为中的性能问题。
|
||||
|
||||
## 编译运行
|
||||
|
||||
eunomia-bpf 是一个结合 Wasm 的开源 eBPF 动态加载运行时和开发工具链,它的目的是简化 eBPF 程序的开发、构建、分发、运行。可以参考 <https://github.com/eunomia-bpf/eunomia-bpf> 下载和安装 ecc 编译工具链和 ecli 运行时。我们使用 eunomia-bpf 编译运行这个例子。
|
||||
|
||||
Compile:
|
||||
|
||||
```shell
|
||||
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
|
||||
```
|
||||
|
||||
或者
|
||||
|
||||
```console
|
||||
$ ecc runqlat.bpf.c runqlat.h
|
||||
Compiling bpf object...
|
||||
Generating export types...
|
||||
Packing ebpf object and config into package.json...
|
||||
```
|
||||
|
||||
Run:
|
||||
|
||||
```console
|
||||
$ sudo ecli examples/bpftools/runqlat/package.json -h
|
||||
Usage: runqlat_bpf [--help] [--version] [--verbose] [--filter_cg] [--targ_per_process] [--targ_per_thread] [--targ_per_pidns] [--targ_ms] [--targ_tgid VAR]
|
||||
|
||||
A simple eBPF program
|
||||
|
||||
Optional arguments:
|
||||
-h, --help shows help message and exits
|
||||
-v, --version prints version information and exits
|
||||
--verbose prints libbpf debug information
|
||||
--filter_cg set value of bool variable filter_cg
|
||||
--targ_per_process set value of bool variable targ_per_process
|
||||
--targ_per_thread set value of bool variable targ_per_thread
|
||||
--targ_per_pidns set value of bool variable targ_per_pidns
|
||||
--targ_ms set value of bool variable targ_ms
|
||||
--targ_tgid set value of pid_t variable targ_tgid
|
||||
|
||||
Built with eunomia-bpf framework.
|
||||
See https://github.com/eunomia-bpf/eunomia-bpf for more information.
|
||||
|
||||
$ sudo ecli examples/bpftools/runqlat/package.json
|
||||
key = 4294967295
|
||||
comm = rcu_preempt
|
||||
|
||||
(unit) : count distribution
|
||||
0 -> 1 : 9 |**** |
|
||||
2 -> 3 : 6 |** |
|
||||
4 -> 7 : 12 |***** |
|
||||
8 -> 15 : 28 |************* |
|
||||
16 -> 31 : 40 |******************* |
|
||||
32 -> 63 : 83 |****************************************|
|
||||
64 -> 127 : 57 |*************************** |
|
||||
128 -> 255 : 19 |********* |
|
||||
256 -> 511 : 11 |***** |
|
||||
512 -> 1023 : 2 | |
|
||||
1024 -> 2047 : 2 | |
|
||||
2048 -> 4095 : 0 | |
|
||||
4096 -> 8191 : 0 | |
|
||||
8192 -> 16383 : 0 | |
|
||||
16384 -> 32767 : 1 | |
|
||||
|
||||
$ sudo ecli examples/bpftools/runqlat/package.json --targ_per_process
|
||||
key = 3189
|
||||
comm = cpptools
|
||||
|
||||
(unit) : count distribution
|
||||
0 -> 1 : 0 | |
|
||||
2 -> 3 : 0 | |
|
||||
4 -> 7 : 0 | |
|
||||
8 -> 15 : 1 |*** |
|
||||
16 -> 31 : 2 |******* |
|
||||
32 -> 63 : 11 |****************************************|
|
||||
64 -> 127 : 8 |***************************** |
|
||||
128 -> 255 : 3 |********** |
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
runqlat 是一个 Linux 内核 BPF 程序,通过柱状图来总结调度程序运行队列延迟,显示任务等待运行在 CPU 上的时间长度。编译这个程序可以使用 ecc 工具,运行时可以使用 ecli 命令。
|
||||
|
||||
runqlat 是一种用于监控Linux内核中进程调度延迟的工具。它可以帮助您了解进程在内核中等待执行的时间,并根据这些信息优化进程调度,提高系统的性能。可以在 libbpf-tools 中找到最初的源代码:<https://github.com/iovisor/bcc/blob/master/libbpf-tools/runqlat.bpf.c>
|
||||
|
||||
更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:<https://github.com/eunomia-bpf/eunomia-bpf>
|
||||
|
||||
完整的教程和源代码已经全部开源,可以在 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 中查看。
|
||||
31
src/9-runqlat/bits.bpf.h
Normal file
31
src/9-runqlat/bits.bpf.h
Normal file
@@ -0,0 +1,31 @@
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
#ifndef __BITS_BPF_H
|
||||
#define __BITS_BPF_H
|
||||
|
||||
#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
|
||||
#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = val)
|
||||
|
||||
static __always_inline u64 log2(u32 v)
|
||||
{
|
||||
u32 shift, r;
|
||||
|
||||
r = (v > 0xFFFF) << 4; v >>= r;
|
||||
shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
|
||||
shift = (v > 0xF) << 2; v >>= shift; r |= shift;
|
||||
shift = (v > 0x3) << 1; v >>= shift; r |= shift;
|
||||
r |= (v >> 1);
|
||||
|
||||
return r;
|
||||
}
|
||||
|
||||
static __always_inline u64 log2l(u64 v)
|
||||
{
|
||||
u32 hi = v >> 32;
|
||||
|
||||
if (hi)
|
||||
return log2(hi) + 32;
|
||||
else
|
||||
return log2(v);
|
||||
}
|
||||
|
||||
#endif /* __BITS_BPF_H */
|
||||
112
src/9-runqlat/core_fixes.bpf.h
Normal file
112
src/9-runqlat/core_fixes.bpf.h
Normal file
@@ -0,0 +1,112 @@
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
/* Copyright (c) 2021 Hengqi Chen */
|
||||
|
||||
#ifndef __CORE_FIXES_BPF_H
|
||||
#define __CORE_FIXES_BPF_H
|
||||
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
|
||||
/**
|
||||
* commit 2f064a59a1 ("sched: Change task_struct::state") changes
|
||||
* the name of task_struct::state to task_struct::__state
|
||||
* see:
|
||||
* https://github.com/torvalds/linux/commit/2f064a59a1
|
||||
*/
|
||||
struct task_struct___o {
|
||||
volatile long int state;
|
||||
} __attribute__((preserve_access_index));
|
||||
|
||||
struct task_struct___x {
|
||||
unsigned int __state;
|
||||
} __attribute__((preserve_access_index));
|
||||
|
||||
static __always_inline __s64 get_task_state(void *task)
|
||||
{
|
||||
struct task_struct___x *t = task;
|
||||
|
||||
if (bpf_core_field_exists(t->__state))
|
||||
return BPF_CORE_READ(t, __state);
|
||||
return BPF_CORE_READ((struct task_struct___o *)task, state);
|
||||
}
|
||||
|
||||
/**
|
||||
* commit 309dca309fc3 ("block: store a block_device pointer in struct bio")
|
||||
* adds a new member bi_bdev which is a pointer to struct block_device
|
||||
* see:
|
||||
* https://github.com/torvalds/linux/commit/309dca309fc3
|
||||
*/
|
||||
struct bio___o {
|
||||
struct gendisk *bi_disk;
|
||||
} __attribute__((preserve_access_index));
|
||||
|
||||
struct bio___x {
|
||||
struct block_device *bi_bdev;
|
||||
} __attribute__((preserve_access_index));
|
||||
|
||||
static __always_inline struct gendisk *get_gendisk(void *bio)
|
||||
{
|
||||
struct bio___x *b = bio;
|
||||
|
||||
if (bpf_core_field_exists(b->bi_bdev))
|
||||
return BPF_CORE_READ(b, bi_bdev, bd_disk);
|
||||
return BPF_CORE_READ((struct bio___o *)bio, bi_disk);
|
||||
}
|
||||
|
||||
/**
|
||||
* commit d5869fdc189f ("block: introduce block_rq_error tracepoint")
|
||||
* adds a new tracepoint block_rq_error and it shares the same arguments
|
||||
* with tracepoint block_rq_complete. As a result, the kernel BTF now has
|
||||
* a `struct trace_event_raw_block_rq_completion` instead of
|
||||
* `struct trace_event_raw_block_rq_complete`.
|
||||
* see:
|
||||
* https://github.com/torvalds/linux/commit/d5869fdc189f
|
||||
*/
|
||||
struct trace_event_raw_block_rq_complete___x {
|
||||
dev_t dev;
|
||||
sector_t sector;
|
||||
unsigned int nr_sector;
|
||||
} __attribute__((preserve_access_index));
|
||||
|
||||
struct trace_event_raw_block_rq_completion___x {
|
||||
dev_t dev;
|
||||
sector_t sector;
|
||||
unsigned int nr_sector;
|
||||
} __attribute__((preserve_access_index));
|
||||
|
||||
static __always_inline bool has_block_rq_completion()
|
||||
{
|
||||
if (bpf_core_type_exists(struct trace_event_raw_block_rq_completion___x))
|
||||
return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* commit d152c682f03c ("block: add an explicit ->disk backpointer to the
|
||||
* request_queue") and commit f3fa33acca9f ("block: remove the ->rq_disk
|
||||
* field in struct request") make some changes to `struct request` and
|
||||
* `struct request_queue`. Now, to get the `struct gendisk *` field in a CO-RE
|
||||
* way, we need both `struct request` and `struct request_queue`.
|
||||
* see:
|
||||
* https://github.com/torvalds/linux/commit/d152c682f03c
|
||||
* https://github.com/torvalds/linux/commit/f3fa33acca9f
|
||||
*/
|
||||
struct request_queue___x {
|
||||
struct gendisk *disk;
|
||||
} __attribute__((preserve_access_index));
|
||||
|
||||
struct request___x {
|
||||
struct request_queue___x *q;
|
||||
struct gendisk *rq_disk;
|
||||
} __attribute__((preserve_access_index));
|
||||
|
||||
static __always_inline struct gendisk *get_disk(void *request)
|
||||
{
|
||||
struct request___x *r = request;
|
||||
|
||||
if (bpf_core_field_exists(r->rq_disk))
|
||||
return BPF_CORE_READ(r, rq_disk);
|
||||
return BPF_CORE_READ(r, q, disk);
|
||||
}
|
||||
|
||||
#endif /* __CORE_FIXES_BPF_H */
|
||||
26
src/9-runqlat/maps.bpf.h
Normal file
26
src/9-runqlat/maps.bpf.h
Normal file
@@ -0,0 +1,26 @@
|
||||
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
// Copyright (c) 2020 Anton Protopopov
|
||||
#ifndef __MAPS_BPF_H
|
||||
#define __MAPS_BPF_H
|
||||
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <asm-generic/errno.h>
|
||||
|
||||
static __always_inline void *
|
||||
bpf_map_lookup_or_try_init(void *map, const void *key, const void *init)
|
||||
{
|
||||
void *val;
|
||||
long err;
|
||||
|
||||
val = bpf_map_lookup_elem(map, key);
|
||||
if (val)
|
||||
return val;
|
||||
|
||||
err = bpf_map_update_elem(map, key, init, BPF_NOEXIST);
|
||||
if (err && err != -EEXIST)
|
||||
return 0;
|
||||
|
||||
return bpf_map_lookup_elem(map, key);
|
||||
}
|
||||
|
||||
#endif /* __MAPS_BPF_H */
|
||||
152
src/9-runqlat/runqlat.bpf.c
Normal file
152
src/9-runqlat/runqlat.bpf.c
Normal file
@@ -0,0 +1,152 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
// Copyright (c) 2020 Wenbo Zhang
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_core_read.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
#include "runqlat.h"
|
||||
#include "bits.bpf.h"
|
||||
#include "maps.bpf.h"
|
||||
#include "core_fixes.bpf.h"
|
||||
|
||||
#define MAX_ENTRIES 10240
|
||||
#define TASK_RUNNING 0
|
||||
|
||||
const volatile bool filter_cg = false;
|
||||
const volatile bool targ_per_process = false;
|
||||
const volatile bool targ_per_thread = false;
|
||||
const volatile bool targ_per_pidns = false;
|
||||
const volatile bool targ_ms = false;
|
||||
const volatile pid_t targ_tgid = 0;
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_CGROUP_ARRAY);
|
||||
__type(key, u32);
|
||||
__type(value, u32);
|
||||
__uint(max_entries, 1);
|
||||
} cgroup_map SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, u32);
|
||||
__type(value, u64);
|
||||
} start SEC(".maps");
|
||||
|
||||
static struct hist zero;
|
||||
|
||||
/// @sample {"interval": 1000, "type" : "log2_hist"}
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__uint(max_entries, MAX_ENTRIES);
|
||||
__type(key, u32);
|
||||
__type(value, struct hist);
|
||||
} hists SEC(".maps");
|
||||
|
||||
static int trace_enqueue(u32 tgid, u32 pid)
|
||||
{
|
||||
u64 ts;
|
||||
|
||||
if (!pid)
|
||||
return 0;
|
||||
if (targ_tgid && targ_tgid != tgid)
|
||||
return 0;
|
||||
|
||||
ts = bpf_ktime_get_ns();
|
||||
bpf_map_update_elem(&start, &pid, &ts, BPF_ANY);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static unsigned int pid_namespace(struct task_struct *task)
|
||||
{
|
||||
struct pid *pid;
|
||||
unsigned int level;
|
||||
struct upid upid;
|
||||
unsigned int inum;
|
||||
|
||||
/* get the pid namespace by following task_active_pid_ns(),
|
||||
* pid->numbers[pid->level].ns
|
||||
*/
|
||||
pid = BPF_CORE_READ(task, thread_pid);
|
||||
level = BPF_CORE_READ(pid, level);
|
||||
bpf_core_read(&upid, sizeof(upid), &pid->numbers[level]);
|
||||
inum = BPF_CORE_READ(upid.ns, ns.inum);
|
||||
|
||||
return inum;
|
||||
}
|
||||
|
||||
static int handle_switch(bool preempt, struct task_struct *prev, struct task_struct *next)
|
||||
{
|
||||
struct hist *histp;
|
||||
u64 *tsp, slot;
|
||||
u32 pid, hkey;
|
||||
s64 delta;
|
||||
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
if (get_task_state(prev) == TASK_RUNNING)
|
||||
trace_enqueue(BPF_CORE_READ(prev, tgid), BPF_CORE_READ(prev, pid));
|
||||
|
||||
pid = BPF_CORE_READ(next, pid);
|
||||
|
||||
tsp = bpf_map_lookup_elem(&start, &pid);
|
||||
if (!tsp)
|
||||
return 0;
|
||||
delta = bpf_ktime_get_ns() - *tsp;
|
||||
if (delta < 0)
|
||||
goto cleanup;
|
||||
|
||||
if (targ_per_process)
|
||||
hkey = BPF_CORE_READ(next, tgid);
|
||||
else if (targ_per_thread)
|
||||
hkey = pid;
|
||||
else if (targ_per_pidns)
|
||||
hkey = pid_namespace(next);
|
||||
else
|
||||
hkey = -1;
|
||||
histp = bpf_map_lookup_or_try_init(&hists, &hkey, &zero);
|
||||
if (!histp)
|
||||
goto cleanup;
|
||||
if (!histp->comm[0])
|
||||
bpf_probe_read_kernel_str(&histp->comm, sizeof(histp->comm),
|
||||
next->comm);
|
||||
if (targ_ms)
|
||||
delta /= 1000000U;
|
||||
else
|
||||
delta /= 1000U;
|
||||
slot = log2l(delta);
|
||||
if (slot >= MAX_SLOTS)
|
||||
slot = MAX_SLOTS - 1;
|
||||
__sync_fetch_and_add(&histp->slots[slot], 1);
|
||||
|
||||
cleanup:
|
||||
bpf_map_delete_elem(&start, &pid);
|
||||
return 0;
|
||||
}
|
||||
|
||||
SEC("raw_tp/sched_wakeup")
|
||||
int BPF_PROG(handle_sched_wakeup, struct task_struct *p)
|
||||
{
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid));
|
||||
}
|
||||
|
||||
SEC("raw_tp/sched_wakeup_new")
|
||||
int BPF_PROG(handle_sched_wakeup_new, struct task_struct *p)
|
||||
{
|
||||
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
|
||||
return 0;
|
||||
|
||||
return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid));
|
||||
}
|
||||
|
||||
SEC("raw_tp/sched_switch")
|
||||
int BPF_PROG(handle_sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next)
|
||||
{
|
||||
return handle_switch(preempt, prev, next);
|
||||
}
|
||||
|
||||
char LICENSE[] SEC("license") = "GPL";
|
||||
14
src/9-runqlat/runqlat.h
Normal file
14
src/9-runqlat/runqlat.h
Normal file
@@ -0,0 +1,14 @@
|
||||
|
||||
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
|
||||
#ifndef __RUNQLAT_H
|
||||
#define __RUNQLAT_H
|
||||
|
||||
#define TASK_COMM_LEN 16
|
||||
#define MAX_SLOTS 26
|
||||
|
||||
struct hist {
|
||||
__u32 slots[MAX_SLOTS];
|
||||
char comm[TASK_COMM_LEN];
|
||||
};
|
||||
|
||||
#endif /* __RUNQLAT_H */
|
||||
37
src/SUMMARY.md
Normal file
37
src/SUMMARY.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Summary
|
||||
|
||||
# eBPF 入门开发实践教程
|
||||
|
||||
- [eBPF 入门开发实践教程一:介绍 eBPF 的基本概念、常见的开发工具](0-introduce/README.md)
|
||||
- [eBPF 入门开发实践教程二:Hello World,基本框架和开发流程](1-helloworld/README.md)
|
||||
- [eBPF 入门开发实践教程二:在 eBPF 中使用 kprobe 监测捕获 unlink 系统调用](2-kprobe-unlink/README.md)
|
||||
- [eBPF 入门开发实践教程三:在 eBPF 中使用 fentry 监测捕获 unlink 系统调用](3-fentry-unlink/README.md)
|
||||
- [eBPF 入门开发实践教程四:在 eBPF 中捕获进程打开文件的系统调用集合,使用全局变量过滤进程 pid](4-opensnoop/README.md)
|
||||
- [eBPF 入门开发实践教程五:在 eBPF 中使用 uprobe 捕获 bash 的 readline 函数调用](5-uprobe-bashreadline/README.md)
|
||||
- [eBPF 入门开发实践教程六:捕获进程发送信号的系统调用集合,使用 hash map 保存状态](6-sigsnoop/README.md)
|
||||
- [eBPF 入门实践教程七:捕获进程执行/退出时间,通过 perf event array 向用户态打印输出](7-execsnoop/README.md)
|
||||
- [eBPF 入门开发实践教程八:在 eBPF 中使用 exitsnoop 监控进程退出事件,使用 ring buffer 向用户态打印输出](8-exitsnoop/README.md)
|
||||
- [eBPF 入门开发实践教程九:一个 Linux 内核 BPF 程序,通过柱状图来总结调度程序运行队列延迟,显示任务等待运行在 CPU 上的时间长度](9-runqlat/README.md)
|
||||
- [eBPF 入门开发实践教程十:在 eBPF 中使用 hardirqs 或 softirqs 捕获中断事件](10-hardirqs/README.md)
|
||||
- [eBPF 入门开发实践教程十一:在 eBPF 中使用 bootstrap 开发用户态程序并跟踪 exec() 和 exit() 系统调用](11-bootstrap/README.md)
|
||||
|
||||
# eBPF入门实践教程
|
||||
|
||||
- [eBPF入门实践教程:使用 libbpf-bootstrap 开发程序统计 TCP 连接延时](13-tcpconnlat/README.md)
|
||||
- [eBPF 入门实践教程:编写 eBPF 程序 tcpconnlat 测量 tcp 连接延时](13-tcpconnlat/tcpconnlat.md)
|
||||
- [eBPF入门实践教程:使用 libbpf-bootstrap 开发程序统计 TCP 连接延时](14-tcpstates/README.md)
|
||||
- [eBPF 入门实践教程:编写 eBPF 程序 Tcprtt 测量 TCP 连接的往返时间](15-tcprtt/README.md)
|
||||
- [eBPF 入门实践教程:编写 eBPF 程序 Memleak 监控内存泄漏](16-memleak/README.md)
|
||||
- [eBPF 入门实践教程:编写 eBPF 程序 Biopattern: 统计随机/顺序磁盘 I/O](17-biopattern/README.md)
|
||||
- [更多的参考资料](18-further-reading/README.md)
|
||||
- [eBPF 入门实践教程:使用 LSM 进行安全检测防御](19-lsm-connect/README.md)
|
||||
- [eBPF 入门实践教程:使用 eBPF 进行 tc 流量控制](20-tc/README.md)
|
||||
|
||||
# bcc Guide
|
||||
|
||||
- [BPF Features by Linux Kernel Version](bcc-documents/kernel-versions.md)
|
||||
- [Kernel Configuration for BPF Features](bcc-documents/kernel_config.md)
|
||||
- [bcc Reference Guide](bcc-documents/reference_guide.md)
|
||||
- [Special Filtering](bcc-documents/special_filtering.md)
|
||||
- [bcc Tutorial](bcc-documents/tutorial.md)
|
||||
- [bcc Python Developer Tutorial](bcc-documents/tutorial_bcc_python_developer.md)
|
||||
514
src/bcc-documents/kernel-versions.md
Normal file
514
src/bcc-documents/kernel-versions.md
Normal file
@@ -0,0 +1,514 @@
|
||||
# BPF Features by Linux Kernel Version
|
||||
|
||||
## eBPF support
|
||||
|
||||
Kernel version | Commit
|
||||
---------------|-------
|
||||
3.15 | [`bd4cf0ed331a`](https://github.com/torvalds/linux/commit/bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8)
|
||||
|
||||
## JIT compiling
|
||||
|
||||
The list of supported architectures for your kernel can be retrieved with:
|
||||
```sh
|
||||
git grep HAVE_EBPF_JIT arch/
|
||||
```
|
||||
|
||||
Feature / Architecture | Kernel version | Commit
|
||||
-----------------------|----------------|-------
|
||||
x86\_64 | 3.16 | [`622582786c9e`](https://github.com/torvalds/linux/commit/622582786c9e041d0bd52bde201787adeab249f8)
|
||||
ARM64 | 3.18 | [`e54bcde3d69d`](https://github.com/torvalds/linux/commit/e54bcde3d69d40023ae77727213d14f920eb264a)
|
||||
s390 | 4.1 | [`054623105728`](https://github.com/torvalds/linux/commit/054623105728b06852f077299e2bf1bf3d5f2b0b)
|
||||
Constant blinding for JIT machines | 4.7 | [`4f3446bb809f`](https://github.com/torvalds/linux/commit/4f3446bb809f20ad56cadf712e6006815ae7a8f9)
|
||||
PowerPC64 | 4.8 | [`156d0e290e96`](https://github.com/torvalds/linux/commit/156d0e290e969caba25f1851c52417c14d141b24)
|
||||
Constant blinding - PowerPC64 | 4.9 | [`b7b7013cac55`](https://github.com/torvalds/linux/commit/b7b7013cac55d794940bd9cb7b7c55c9dececac4)
|
||||
Sparc64 | 4.12 | [`7a12b5031c6b`](https://github.com/torvalds/linux/commit/7a12b5031c6b947cc13918237ae652b536243b76)
|
||||
MIPS | 4.13 | [`f381bf6d82f0`](https://github.com/torvalds/linux/commit/f381bf6d82f032b7410185b35d000ea370ac706b)
|
||||
ARM32 | 4.14 | [`39c13c204bb1`](https://github.com/torvalds/linux/commit/39c13c204bb1150d401e27d41a9d8b332be47c49)
|
||||
x86\_32 | 4.18 | [`03f5781be2c7`](https://github.com/torvalds/linux/commit/03f5781be2c7b7e728d724ac70ba10799cc710d7)
|
||||
RISC-V RV64G | 5.1 | [`2353ecc6f91f`](https://github.com/torvalds/linux/commit/2353ecc6f91fd15b893fa01bf85a1c7a823ee4f2)
|
||||
RISC-V RV32G | 5.7 | [`5f316b65e99f`](https://github.com/torvalds/linux/commit/5f316b65e99f109942c556dc8790abd4c75bcb34)
|
||||
PowerPC32 | 5.13 | [`51c66ad849a7`](https://github.com/torvalds/linux/commit/51c66ad849a703d9bbfd7704c941827aed0fd9fd)
|
||||
LoongArch | 6.1 | [`5dc615520c4d`](https://github.com/torvalds/linux/commit/5dc615520c4dfb358245680f1904bad61116648e)
|
||||
|
||||
## Main features
|
||||
|
||||
Several (but not all) of these _main features_ translate to an eBPF program type.
|
||||
The list of such program types supported in your kernel can be found in file
|
||||
[`include/uapi/linux/bpf.h`](https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h):
|
||||
```sh
|
||||
git grep -W 'bpf_prog_type {' include/uapi/linux/bpf.h
|
||||
```
|
||||
|
||||
Feature | Kernel version | Commit
|
||||
--------|----------------|-------
|
||||
`AF_PACKET` (libpcap/tcpdump, `cls_bpf` classifier, netfilter's `xt_bpf`, team driver's load-balancing mode…) | 3.15 | [`bd4cf0ed331a`](https://github.com/torvalds/linux/commit/bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8)
|
||||
Kernel helpers | 3.15 | [`bd4cf0ed331a`](https://github.com/torvalds/linux/commit/bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8)
|
||||
`bpf()` syscall | 3.18 | [`99c55f7d47c0`](https://github.com/torvalds/linux/commit/99c55f7d47c0dc6fc64729f37bf435abf43f4c60)
|
||||
Maps (_a.k.a._ Tables; details below) | 3.18 | [`99c55f7d47c0`](https://github.com/torvalds/linux/commit/99c55f7d47c0dc6fc64729f37bf435abf43f4c60)
|
||||
BPF attached to sockets | 3.19 | [`89aa075832b0`](https://github.com/torvalds/linux/commit/89aa075832b0da4402acebd698d0411dcc82d03e)
|
||||
BPF attached to `kprobes` | 4.1 | [`2541517c32be`](https://github.com/torvalds/linux/commit/2541517c32be2531e0da59dfd7efc1ce844644f5)
|
||||
`cls_bpf` / `act_bpf` for `tc` | 4.1 | [`e2e9b6541dd4`](https://github.com/torvalds/linux/commit/e2e9b6541dd4b31848079da80fe2253daaafb549)
|
||||
Tail calls | 4.2 | [`04fd61ab36ec`](https://github.com/torvalds/linux/commit/04fd61ab36ec065e194ab5e74ae34a5240d992bb)
|
||||
Non-root programs on sockets | 4.4 | [`1be7f75d1668`](https://github.com/torvalds/linux/commit/1be7f75d1668d6296b80bf35dcf6762393530afc)
|
||||
Persistent maps and programs (virtual FS) | 4.4 | [`b2197755b263`](https://github.com/torvalds/linux/commit/b2197755b2633e164a439682fb05a9b5ea48f706)
|
||||
`tc`'s `direct-action` (`da`) mode | 4.4 | [`045efa82ff56`](https://github.com/torvalds/linux/commit/045efa82ff563cd4e656ca1c2e354fa5bf6bbda4)
|
||||
`tc`'s `clsact` qdisc | 4.5 | [`1f211a1b929c`](https://github.com/torvalds/linux/commit/1f211a1b929c804100e138c5d3d656992cfd5622)
|
||||
BPF attached to tracepoints | 4.7 | [`98b5c2c65c29`](https://github.com/torvalds/linux/commit/98b5c2c65c2951772a8fc661f50d675e450e8bce)
|
||||
Direct packet access | 4.7 | [`969bf05eb3ce`](https://github.com/torvalds/linux/commit/969bf05eb3cedd5a8d4b7c346a85c2ede87a6d6d)
|
||||
XDP (see below) | 4.8 | [`6a773a15a1e8`](https://github.com/torvalds/linux/commit/6a773a15a1e8874e5eccd2f29190c31085912c95)
|
||||
BPF attached to perf events | 4.9 | [`0515e5999a46`](https://github.com/torvalds/linux/commit/0515e5999a466dfe6e1924f460da599bb6821487)
|
||||
Hardware offload for `tc`'s `cls_bpf` | 4.9 | [`332ae8e2f6ec`](https://github.com/torvalds/linux/commit/332ae8e2f6ecda5e50c5c62ed62894963e3a83f5)
|
||||
Verifier exposure and internal hooks | 4.9 | [`13a27dfc6697`](https://github.com/torvalds/linux/commit/13a27dfc669724564aafa2699976ee756029fed2)
|
||||
BPF attached to cgroups for socket filtering | 4.10 | [`0e33661de493`](https://github.com/torvalds/linux/commit/0e33661de493db325435d565a4a722120ae4cbf3)
|
||||
Lightweight tunnel encapsulation | 4.10 | [`3a0af8fd61f9`](https://github.com/torvalds/linux/commit/3a0af8fd61f90920f6fa04e4f1e9a6a73c1b4fd2)
|
||||
**e**BPF support for `xt_bpf` module (iptables) | 4.10 | [`2c16d6033264`](https://github.com/torvalds/linux/commit/2c16d60332643e90d4fa244f4a706c454b8c7569)
|
||||
BPF program tag | 4.10 | [`7bd509e311f4`](https://github.com/torvalds/linux/commit/7bd509e311f408f7a5132fcdde2069af65fa05ae)
|
||||
Tracepoints to debug BPF | 4.11 (removed in 4.18) | [`a67edbf4fb6d`](https://github.com/torvalds/linux/commit/a67edbf4fb6deadcfe57a04a134abed4a5ba3bb5) [`4d220ed0f814`](https://github.com/torvalds/linux/commit/4d220ed0f8140c478ab7b0a14d96821da639b646)
|
||||
Testing / benchmarking BPF programs | 4.12 | [`1cf1cae963c2`](https://github.com/torvalds/linux/commit/1cf1cae963c2e6032aebe1637e995bc2f5d330f4)
|
||||
BPF programs and maps IDs | 4.13 | [`dc4bb0e23561`](https://github.com/torvalds/linux/commit/dc4bb0e2356149aee4cdae061936f3bbdd45595c)
|
||||
BPF support for `sock_ops` | 4.13 | [`40304b2a1567`](https://github.com/torvalds/linux/commit/40304b2a1567fecc321f640ee4239556dd0f3ee0)
|
||||
BPF support for skbs on sockets | 4.14 | [`b005fd189cec`](https://github.com/torvalds/linux/commit/b005fd189cec9407b700599e1e80e0552446ee79)
|
||||
bpftool utility in kernel sources | 4.15 | [`71bb428fe2c1`](https://github.com/torvalds/linux/commit/71bb428fe2c19512ac671d5ee16ef3e73e1b49a8)
|
||||
BPF attached to cgroups as device controller | 4.15 | [`ebc614f68736`](https://github.com/torvalds/linux/commit/ebc614f687369f9df99828572b1d85a7c2de3d92)
|
||||
bpf2bpf function calls | 4.16 | [`cc8b0b92a169`](https://github.com/torvalds/linux/commit/cc8b0b92a1699bc32f7fec71daa2bfc90de43a4d)
|
||||
BPF used for monitoring socket RX/TX data | 4.17 | [`4f738adba30a`](https://github.com/torvalds/linux/commit/4f738adba30a7cfc006f605707e7aee847ffefa0)
|
||||
BPF attached to raw tracepoints | 4.17 | [`c4f6699dfcb8`](https://github.com/torvalds/linux/commit/c4f6699dfcb8558d138fe838f741b2c10f416cf9)
|
||||
BPF attached to `bind()` system call | 4.17 | [`4fbac77d2d09`](https://github.com/torvalds/linux/commit/4fbac77d2d092b475dda9eea66da674369665427) [`aac3fc320d94`](https://github.com/torvalds/linux/commit/aac3fc320d9404f2665a8b1249dc3170d5fa3caf)
|
||||
BPF attached to `connect()` system call | 4.17 | [`d74bad4e74ee`](https://github.com/torvalds/linux/commit/d74bad4e74ee373787a9ae24197c17b7cdc428d5)
|
||||
BPF Type Format (BTF) | 4.18 | [`69b693f0aefa`](https://github.com/torvalds/linux/commit/69b693f0aefa0ed521e8bd02260523b5ae446ad7)
|
||||
AF_XDP | 4.18 | [`fbfc504a24f5`](https://github.com/torvalds/linux/commit/fbfc504a24f53f7ebe128ab55cb5dba634f4ece8)
|
||||
bpfilter | 4.18 | [`d2ba09c17a06`](https://github.com/torvalds/linux/commit/d2ba09c17a0647f899d6c20a11bab9e6d3382f07)
|
||||
End.BPF action for seg6local LWT | 4.18 | [`004d4b274e2a`](https://github.com/torvalds/linux/commit/004d4b274e2a1a895a0e5dc66158b90a7d463d44)
|
||||
BPF attached to LIRC devices | 4.18 | [`f4364dcfc86d`](https://github.com/torvalds/linux/commit/f4364dcfc86df7c1ca47b256eaf6b6d0cdd0d936)
|
||||
Pass map values to map helpers | 4.18 | [`d71962f3e627`](https://github.com/torvalds/linux/commit/d71962f3e627b5941804036755c844fabfb65ff5)
|
||||
BPF socket reuseport | 4.19 | [`2dbb9b9e6df6`](https://github.com/torvalds/linux/commit/2dbb9b9e6df67d444fbe425c7f6014858d337adf)
|
||||
BPF flow dissector | 4.20 | [`d58e468b1112`](https://github.com/torvalds/linux/commit/d58e468b1112dcd1d5193c0a89ff9f98b5a3e8b9)
|
||||
BPF 1M insn limit | 5.2 | [`c04c0d2b968a`](https://github.com/torvalds/linux/commit/c04c0d2b968ac45d6ef020316808ef6c82325a82)
|
||||
BPF cgroup sysctl | 5.2 | [`7b146cebe30c`](https://github.com/torvalds/linux/commit/7b146cebe30cb481b0f70d85779da938da818637)
|
||||
BPF raw tracepoint writable | 5.2 | [`9df1c28bb752`](https://github.com/torvalds/linux/commit/9df1c28bb75217b244257152ab7d788bb2a386d0)
|
||||
BPF bounded loop | 5.3 | [`2589726d12a1`](https://github.com/torvalds/linux/commit/2589726d12a1b12eaaa93c7f1ea64287e383c7a5)
|
||||
BPF trampoline | 5.5 | [`fec56f5890d9`](https://github.com/torvalds/linux/commit/fec56f5890d93fc2ed74166c397dc186b1c25951)
|
||||
BPF LSM hook | 5.7 | [`fc611f47f218`](https://github.com/torvalds/linux/commit/fc611f47f2188ade2b48ff6902d5cce8baac0c58) [`641cd7b06c91`](https://github.com/torvalds/linux/commit/641cd7b06c911c5935c34f24850ea18690649917)
|
||||
BPF iterator | 5.8 | [`180139dca8b3`](https://github.com/torvalds/linux/commit/180139dca8b38c858027b8360ee10064fdb2fbf7)
|
||||
BPF socket lookup hook | 5.9 | [`e9ddbb7707ff`](https://github.com/torvalds/linux/commit/e9ddbb7707ff5891616240026062b8c1e29864ca)
|
||||
Sleepable BPF programs | 5.10 | [`1e6c62a88215`](https://github.com/torvalds/linux/commit/1e6c62a8821557720a9b2ea9617359b264f2f67c)
|
||||
|
||||
### Program types
|
||||
|
||||
Program type | Kernel version | Commit | Enum
|
||||
-------------|----------------|--------|-----
|
||||
Socket filter | 3.19 | [`ddd872bc3098`](https://github.com/torvalds/linux/commit/ddd872bc3098f9d9abe1680a6b2013e59e3337f7) | BPF_PROG_TYPE_SOCKET_FILTER
|
||||
Kprobe | 4.1 | [`2541517c32be`](https://github.com/torvalds/linux/commit/2541517c32be2531e0da59dfd7efc1ce844644f5) | BPF_PROG_TYPE_KPROBE
|
||||
traffic control (TC) | 4.1 | [`96be4325f443`](https://github.com/torvalds/linux/commit/96be4325f443dbbfeb37d2a157675ac0736531a1) | BPF_PROG_TYPE_SCHED_CLS
|
||||
traffic control (TC) | 4.1 | [`94caee8c312d`](https://github.com/torvalds/linux/commit/94caee8c312d96522bcdae88791aaa9ebcd5f22c) | BPF_PROG_TYPE_SCHED_ACT
|
||||
Tracepoint | 4.7 | [`98b5c2c65c29`](https://github.com/torvalds/linux/commit/98b5c2c65c2951772a8fc661f50d675e450e8bce) | BPF_PROG_TYPE_TRACEPOINT
|
||||
XDP | 4.8 | [`6a773a15a1e8`](https://github.com/torvalds/linux/commit/6a773a15a1e8874e5eccd2f29190c31085912c95) | BPF_PROG_TYPE_XDP
|
||||
Perf event | 4.9 | [`0515e5999a46`](https://github.com/torvalds/linux/commit/0515e5999a466dfe6e1924f460da599bb6821487) | BPF_PROG_TYPE_PERF_EVENT
|
||||
cgroup socket filtering | 4.10 | [`0e33661de493`](https://github.com/torvalds/linux/commit/0e33661de493db325435d565a4a722120ae4cbf3) | BPF_PROG_TYPE_CGROUP_SKB
|
||||
cgroup sock modification | 4.10 | [`610236587600`](https://github.com/torvalds/linux/commit/61023658760032e97869b07d54be9681d2529e77) | BPF_PROG_TYPE_CGROUP_SOCK
|
||||
lightweight tunnel (IN) | 4.10 | [`3a0af8fd61f9`](https://github.com/torvalds/linux/commit/3a0af8fd61f90920f6fa04e4f1e9a6a73c1b4fd2) | BPF_PROG_TYPE_LWT_IN
|
||||
lightweight tunnel (OUT) | 4.10 | [`3a0af8fd61f9`](https://github.com/torvalds/linux/commit/3a0af8fd61f90920f6fa04e4f1e9a6a73c1b4fd2) | BPF_PROG_TYPE_LWT_OUT
|
||||
lightweight tunnel (XMIT) | 4.10 | [`3a0af8fd61f9`](https://github.com/torvalds/linux/commit/3a0af8fd61f90920f6fa04e4f1e9a6a73c1b4fd2) | BPF_PROG_TYPE_LWT_XMIT
|
||||
cgroup sock ops (per conn) | 4.13 | [`40304b2a1567`](https://github.com/torvalds/linux/commit/40304b2a1567fecc321f640ee4239556dd0f3ee0) | BPF_PROG_TYPE_SOCK_OPS
|
||||
stream parser / stream verdict | 4.14 | [`b005fd189cec`](https://github.com/torvalds/linux/commit/b005fd189cec9407b700599e1e80e0552446ee79) | BPF_PROG_TYPE_SK_SKB
|
||||
cgroup device manager | 4.15 | [`ebc614f68736`](https://github.com/torvalds/linux/commit/ebc614f687369f9df99828572b1d85a7c2de3d92) | BPF_PROG_TYPE_CGROUP_DEVICE
|
||||
socket msg verdict | 4.17 | [`4f738adba30a`](https://github.com/torvalds/linux/commit/4f738adba30a7cfc006f605707e7aee847ffefa0) | BPF_PROG_TYPE_SK_MSG
|
||||
Raw tracepoint | 4.17 | [`c4f6699dfcb8`](https://github.com/torvalds/linux/commit/c4f6699dfcb8558d138fe838f741b2c10f416cf9) | BPF_PROG_TYPE_RAW_TRACEPOINT
|
||||
socket binding | 4.17 | [`4fbac77d2d09`](https://github.com/torvalds/linux/commit/4fbac77d2d092b475dda9eea66da674369665427) | BPF_PROG_TYPE_CGROUP_SOCK_ADDR
|
||||
LWT seg6local | 4.18 | [`004d4b274e2a`](https://github.com/torvalds/linux/commit/004d4b274e2a1a895a0e5dc66158b90a7d463d44) | BPF_PROG_TYPE_LWT_SEG6LOCAL
|
||||
lirc devices | 4.18 | [`f4364dcfc86d`](https://github.com/torvalds/linux/commit/f4364dcfc86df7c1ca47b256eaf6b6d0cdd0d936) | BPF_PROG_TYPE_LIRC_MODE2
|
||||
lookup SO_REUSEPORT socket | 4.19 | [`2dbb9b9e6df6`](https://github.com/torvalds/linux/commit/2dbb9b9e6df67d444fbe425c7f6014858d337adf) | BPF_PROG_TYPE_SK_REUSEPORT
|
||||
flow dissector | 4.20 | [`d58e468b1112`](https://github.com/torvalds/linux/commit/d58e468b1112dcd1d5193c0a89ff9f98b5a3e8b9) | BPF_PROG_TYPE_FLOW_DISSECTOR
|
||||
cgroup sysctl | 5.2 | [`7b146cebe30c`](https://github.com/torvalds/linux/commit/7b146cebe30cb481b0f70d85779da938da818637) | BPF_PROG_TYPE_CGROUP_SYSCTL
|
||||
writable raw tracepoints | 5.2 | [`9df1c28bb752`](https://github.com/torvalds/linux/commit/9df1c28bb75217b244257152ab7d788bb2a386d0) | BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
|
||||
cgroup getsockopt/setsockopt | 5.3 | [`0d01da6afc54`](https://github.com/torvalds/linux/commit/0d01da6afc5402f60325c5da31b22f7d56689b49) | BPF_PROG_TYPE_CGROUP_SOCKOPT
|
||||
Tracing (BTF/BPF trampoline) | 5.5 | [`f1b9509c2fb0`](https://github.com/torvalds/linux/commit/f1b9509c2fb0ef4db8d22dac9aef8e856a5d81f6) | BPF_PROG_TYPE_TRACING
|
||||
struct ops | 5.6 | [`27ae7997a661`](https://github.com/torvalds/linux/commit/27ae7997a66174cb8afd6a75b3989f5e0c1b9e5a) | BPF_PROG_TYPE_STRUCT_OPS
|
||||
extensions | 5.6 | [`be8704ff07d2`](https://github.com/torvalds/linux/commit/be8704ff07d2374bcc5c675526f95e70c6459683) | BPF_PROG_TYPE_EXT
|
||||
LSM | 5.7 | [`fc611f47f218`](https://github.com/torvalds/linux/commit/fc611f47f2188ade2b48ff6902d5cce8baac0c58) | BPF_PROG_TYPE_LSM
|
||||
lookup listening socket | 5.9 | [`e9ddbb7707ff`](https://github.com/torvalds/linux/commit/e9ddbb7707ff5891616240026062b8c1e29864ca) | BPF_PROG_TYPE_SK_LOOKUP
|
||||
Allow executing syscalls | 5.15 | [`79a7f8bdb159`](https://github.com/torvalds/linux/commit/79a7f8bdb159d9914b58740f3d31d602a6e4aca8) | BPF_PROG_TYPE_SYSCALL
|
||||
|
||||
## Maps (_a.k.a._ Tables, in BCC lingo)
|
||||
|
||||
### Map types
|
||||
|
||||
The list of map types supported in your kernel can be found in file
|
||||
[`include/uapi/linux/bpf.h`](https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h):
|
||||
```sh
|
||||
git grep -W 'bpf_map_type {' include/uapi/linux/bpf.h
|
||||
```
|
||||
|
||||
Map type | Kernel version | Commit | Enum
|
||||
----------|----------------|--------|------
|
||||
Hash | 3.19 | [`0f8e4bd8a1fc`](https://github.com/torvalds/linux/commit/0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475) | BPF_MAP_TYPE_HASH
|
||||
Array | 3.19 | [`28fbcfa08d8e`](https://github.com/torvalds/linux/commit/28fbcfa08d8ed7c5a50d41a0433aad222835e8e3) | BPF_MAP_TYPE_ARRAY
|
||||
Prog array | 4.2 | [`04fd61ab36ec`](https://github.com/torvalds/linux/commit/04fd61ab36ec065e194ab5e74ae34a5240d992bb) | BPF_MAP_TYPE_PROG_ARRAY
|
||||
Perf events | 4.3 | [`ea317b267e9d`](https://github.com/torvalds/linux/commit/ea317b267e9d03a8241893aa176fba7661d07579) | BPF_MAP_TYPE_PERF_EVENT_ARRAY
|
||||
Per-CPU hash | 4.6 | [`824bd0ce6c7c`](https://github.com/torvalds/linux/commit/824bd0ce6c7c43a9e1e210abf124958e54d88342) | BPF_MAP_TYPE_PERCPU_HASH
|
||||
Per-CPU array | 4.6 | [`a10423b87a7e`](https://github.com/torvalds/linux/commit/a10423b87a7eae75da79ce80a8d9475047a674ee) | BPF_MAP_TYPE_PERCPU_ARRAY
|
||||
Stack trace | 4.6 | [`d5a3b1f69186`](https://github.com/torvalds/linux/commit/d5a3b1f691865be576c2bffa708549b8cdccda19) | BPF_MAP_TYPE_STACK_TRACE
|
||||
cgroup array | 4.8 | [`4ed8ec521ed5`](https://github.com/torvalds/linux/commit/4ed8ec521ed57c4e207ad464ca0388776de74d4b) | BPF_MAP_TYPE_CGROUP_ARRAY
|
||||
LRU hash | 4.10 | [`29ba732acbee`](https://github.com/torvalds/linux/commit/29ba732acbeece1e34c68483d1ec1f3720fa1bb3) [`3a08c2fd7634`](https://github.com/torvalds/linux/commit/3a08c2fd763450a927d1130de078d6f9e74944fb) | BPF_MAP_TYPE_LRU_HASH
|
||||
LRU per-CPU hash | 4.10 | [`8f8449384ec3`](https://github.com/torvalds/linux/commit/8f8449384ec364ba2a654f11f94e754e4ff719e0) [`961578b63474`](https://github.com/torvalds/linux/commit/961578b63474d13ad0e2f615fcc2901c5197dda6) | BPF_MAP_TYPE_LRU_PERCPU_HASH
|
||||
LPM trie (longest-prefix match) | 4.11 | [`b95a5c4db09b`](https://github.com/torvalds/linux/commit/b95a5c4db09bc7c253636cb84dc9b12c577fd5a0) | BPF_MAP_TYPE_LPM_TRIE
|
||||
Array of maps | 4.12 | [`56f668dfe00d`](https://github.com/torvalds/linux/commit/56f668dfe00dcf086734f1c42ea999398fad6572) | BPF_MAP_TYPE_ARRAY_OF_MAPS
|
||||
Hash of maps | 4.12 | [`bcc6b1b7ebf8`](https://github.com/torvalds/linux/commit/bcc6b1b7ebf857a9fe56202e2be3361131588c15) | BPF_MAP_TYPE_HASH_OF_MAPS
|
||||
Netdevice references (array) | 4.14 | [`546ac1ffb70d`](https://github.com/torvalds/linux/commit/546ac1ffb70d25b56c1126940e5ec639c4dd7413) | BPF_MAP_TYPE_DEVMAP
|
||||
Socket references (array) | 4.14 | [`174a79ff9515`](https://github.com/torvalds/linux/commit/174a79ff9515f400b9a6115643dafd62a635b7e6) | BPF_MAP_TYPE_SOCKMAP
|
||||
CPU references | 4.15 | [`6710e1126934`](https://github.com/torvalds/linux/commit/6710e1126934d8b4372b4d2f9ae1646cd3f151bf) | BPF_MAP_TYPE_CPUMAP
|
||||
AF_XDP socket (XSK) references | 4.18 | [`fbfc504a24f5`](https://github.com/torvalds/linux/commit/fbfc504a24f53f7ebe128ab55cb5dba634f4ece8) | BPF_MAP_TYPE_XSKMAP
|
||||
Socket references (hashmap) | 4.18 | [`81110384441a`](https://github.com/torvalds/linux/commit/81110384441a59cff47430f20f049e69b98c17f4) | BPF_MAP_TYPE_SOCKHASH
|
||||
cgroup storage | 4.19 | [`de9cbbaadba5`](https://github.com/torvalds/linux/commit/de9cbbaadba5adf88a19e46df61f7054000838f6) | BPF_MAP_TYPE_CGROUP_STORAGE
|
||||
reuseport sockarray | 4.19 | [`5dc4c4b7d4e8`](https://github.com/torvalds/linux/commit/5dc4c4b7d4e8115e7cde96a030f98cb3ab2e458c) | BPF_MAP_TYPE_REUSEPORT_SOCKARRAY
|
||||
precpu cgroup storage | 4.20 | [`b741f1630346`](https://github.com/torvalds/linux/commit/b741f1630346defcbc8cc60f1a2bdae8b3b0036f) | BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE
|
||||
queue | 4.20 | [`f1a2e44a3aec`](https://github.com/torvalds/linux/commit/f1a2e44a3aeccb3ff18d3ccc0b0203e70b95bd92) | BPF_MAP_TYPE_QUEUE
|
||||
stack | 4.20 | [`f1a2e44a3aec`](https://github.com/torvalds/linux/commit/f1a2e44a3aeccb3ff18d3ccc0b0203e70b95bd92) | BPF_MAP_TYPE_STACK
|
||||
socket local storage | 5.2 | [`6ac99e8f23d4`](https://github.com/torvalds/linux/commit/6ac99e8f23d4b10258406ca0dd7bffca5f31da9d) | BPF_MAP_TYPE_SK_STORAGE
|
||||
Netdevice references (hashmap) | 5.4 | [`6f9d451ab1a3`](https://github.com/torvalds/linux/commit/6f9d451ab1a33728adb72d7ff66a7b374d665176) | BPF_MAP_TYPE_DEVMAP_HASH
|
||||
struct ops | 5.6 | [`85d33df357b6`](https://github.com/torvalds/linux/commit/85d33df357b634649ddbe0a20fd2d0fc5732c3cb) | BPF_MAP_TYPE_STRUCT_OPS
|
||||
ring buffer | 5.8 | [`457f44363a88`](https://github.com/torvalds/linux/commit/457f44363a8894135c85b7a9afd2bd8196db24ab) | BPF_MAP_TYPE_RINGBUF
|
||||
inode storage | 5.10 | [`8ea636848aca`](https://github.com/torvalds/linux/commit/8ea636848aca35b9f97c5b5dee30225cf2dd0fe6) | BPF_MAP_TYPE_INODE_STORAGE
|
||||
task storage | 5.11 | [`4cf1bc1f1045`](https://github.com/torvalds/linux/commit/4cf1bc1f10452065a29d576fc5693fc4fab5b919) | BPF_MAP_TYPE_TASK_STORAGE
|
||||
Bloom filter | 5.16 | [`9330986c0300`](https://github.com/torvalds/linux/commit/9330986c03006ab1d33d243b7cfe598a7a3c1baa) | BPF_MAP_TYPE_BLOOM_FILTER
|
||||
user ringbuf | 6.1 | [`583c1f420173`](https://github.com/torvalds/linux/commit/583c1f420173f7d84413a1a1fbf5109d798b4faa) | BPF_MAP_TYPE_USER_RINGBUF
|
||||
|
||||
### Map userspace API
|
||||
|
||||
Some (but not all) of these *API features* translate to a subcommand beginning with `BPF_MAP_`.
|
||||
The list of subcommands supported in your kernel can be found in file
|
||||
[`include/uapi/linux/bpf.h`](https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h):
|
||||
```sh
|
||||
git grep -W 'bpf_cmd {' include/uapi/linux/bpf.h
|
||||
```
|
||||
|
||||
Feature | Kernel version | Commit
|
||||
--------|----------------|-------
|
||||
Basic operations (lookup, update, delete, `GET_NEXT_KEY`) | 3.18 | [`db20fd2b0108`](https://github.com/torvalds/linux/commit/db20fd2b01087bdfbe30bce314a198eefedcc42e)
|
||||
Pass flags to `UPDATE_ELEM` | 3.19 | [`3274f52073d8`](https://github.com/torvalds/linux/commit/3274f52073d88b62f3c5ace82ae9d48546232e72)
|
||||
Pre-alloc map memory by default | 4.6 | [`6c9059817432`](https://github.com/torvalds/linux/commit/6c90598174322b8888029e40dd84a4eb01f56afe)
|
||||
Pass `NULL` to `GET_NEXT_KEY` | 4.12 | [`8fe45924387b`](https://github.com/torvalds/linux/commit/8fe45924387be6b5c1be59a7eb330790c61d5d10)
|
||||
Creation: select NUMA node | 4.14 | [`96eabe7a40aa`](https://github.com/torvalds/linux/commit/96eabe7a40aa17e613cf3db2c742ee8b1fc764d0)
|
||||
Restrict access from syscall side | 4.15 | [`6e71b04a8224`](https://github.com/torvalds/linux/commit/6e71b04a82248ccf13a94b85cbc674a9fefe53f5)
|
||||
Creation: specify map name | 4.15 | [`ad5b177bd73f`](https://github.com/torvalds/linux/commit/ad5b177bd73f5107d97c36f56395c4281fb6f089)
|
||||
`LOOKUP_AND_DELETE_ELEM` | 4.20 | [`bd513cd08f10`](https://github.com/torvalds/linux/commit/bd513cd08f10cbe28856f99ae951e86e86803861)
|
||||
Creation: `BPF_F_ZERO_SEED` | 5.0 | [`96b3b6c9091d`](https://github.com/torvalds/linux/commit/96b3b6c9091d23289721350e32c63cc8749686be)
|
||||
`BPF_F_LOCK` flag for lookup / update | 5.1 | [`96049f3afd50`](https://github.com/torvalds/linux/commit/96049f3afd50fe8db69fa0068cdca822e747b1e4)
|
||||
Restrict access from BPF side | 5.2 | [`591fe9888d78`](https://github.com/torvalds/linux/commit/591fe9888d7809d9ee5c828020b6c6ae27c37229)
|
||||
`FREEZE` | 5.2 | [`87df15de441b`](https://github.com/torvalds/linux/commit/87df15de441bd4add7876ef584da8cabdd9a042a)
|
||||
mmap() support for array maps | 5.5 | [`fc9702273e2e`](https://github.com/torvalds/linux/commit/fc9702273e2edb90400a34b3be76f7b08fa3344b)
|
||||
`LOOKUP_BATCH` | 5.6 | [`cb4d03ab499d`](https://github.com/torvalds/linux/commit/cb4d03ab499d4c040f4ab6fd4389d2b49f42b5a5)
|
||||
`UPDATE_BATCH`, `DELETE_BATCH` | 5.6 | [`aa2e93b8e58e`](https://github.com/torvalds/linux/commit/aa2e93b8e58e18442edfb2427446732415bc215e)
|
||||
`LOOKUP_AND_DELETE_BATCH` | 5.6 | [`057996380a42`](https://github.com/torvalds/linux/commit/057996380a42bb64ccc04383cfa9c0ace4ea11f0)
|
||||
`LOOKUP_AND_DELETE_ELEM` support for hash maps | 5.14 | [`3e87f192b405`](https://github.com/torvalds/linux/commit/3e87f192b405960c0fe83e0925bd0dadf4f8cf43)
|
||||
|
||||
## XDP
|
||||
|
||||
An approximate list of drivers or components supporting XDP programs for your
|
||||
kernel can be retrieved with:
|
||||
```sh
|
||||
git grep -l XDP_SETUP_PROG drivers/
|
||||
```
|
||||
|
||||
Feature / Driver | Kernel version | Commit
|
||||
-----------------|----------------|-------
|
||||
XDP core architecture | 4.8 | [`6a773a15a1e8`](https://github.com/torvalds/linux/commit/6a773a15a1e8874e5eccd2f29190c31085912c95)
|
||||
Action: drop | 4.8 | [`6a773a15a1e8`](https://github.com/torvalds/linux/commit/6a773a15a1e8874e5eccd2f29190c31085912c95)
|
||||
Action: pass on to stack | 4.8 | [`6a773a15a1e8`](https://github.com/torvalds/linux/commit/6a773a15a1e8874e5eccd2f29190c31085912c95)
|
||||
Action: direct forwarding (on same port) | 4.8 | [`6ce96ca348a9`](https://github.com/torvalds/linux/commit/6ce96ca348a9e949f8c43f4d3e98db367d93cffd)
|
||||
Direct packet data write | 4.8 | [`4acf6c0b84c9`](https://github.com/torvalds/linux/commit/4acf6c0b84c91243c705303cd9ff16421914150d)
|
||||
Mellanox `mlx4` driver | 4.8 | [`47a38e155037`](https://github.com/torvalds/linux/commit/47a38e155037f417c5740e24ccae6482aedf4b68)
|
||||
Mellanox `mlx5` driver | 4.9 | [`86994156c736`](https://github.com/torvalds/linux/commit/86994156c736978d113e7927455d4eeeb2128b9f)
|
||||
Netronome `nfp` driver | 4.10 | [`ecd63a0217d5`](https://github.com/torvalds/linux/commit/ecd63a0217d5f1e8a92f7516f5586d1177b95de2)
|
||||
QLogic (Cavium) `qed*` drivers | 4.10 | [`496e05170958`](https://github.com/torvalds/linux/commit/496e051709588f832d7a6a420f44f8642b308a87)
|
||||
`virtio_net` driver | 4.10 | [`f600b6905015`](https://github.com/torvalds/linux/commit/f600b690501550b94e83e07295d9c8b9c4c39f4e)
|
||||
Broadcom `bnxt_en` driver | 4.11 | [`c6d30e8391b8`](https://github.com/torvalds/linux/commit/c6d30e8391b85e00eb544e6cf047ee0160ee9938)
|
||||
Intel `ixgbe*` drivers | 4.12 | [`924708081629`](https://github.com/torvalds/linux/commit/9247080816297de4e31abb684939c0e53e3a8a67)
|
||||
Cavium `thunderx` driver | 4.12 | [`05c773f52b96`](https://github.com/torvalds/linux/commit/05c773f52b96ef3fbc7d9bfa21caadc6247ef7a8)
|
||||
Generic XDP | 4.12 | [`b5cdae3291f7`](https://github.com/torvalds/linux/commit/b5cdae3291f7be7a34e75affe4c0ec1f7f328b64)
|
||||
Intel `i40e` driver | 4.13 | [`0c8493d90b6b`](https://github.com/torvalds/linux/commit/0c8493d90b6bb0f5c4fe9217db8f7203f24c0f28)
|
||||
Action: redirect | 4.14 | [`6453073987ba`](https://github.com/torvalds/linux/commit/6453073987ba392510ab6c8b657844a9312c67f7)
|
||||
Support for tap | 4.14 | [`761876c857cb`](https://github.com/torvalds/linux/commit/761876c857cb2ef8489fbee01907151da902af91)
|
||||
Support for veth | 4.14 | [`d445516966dc`](https://github.com/torvalds/linux/commit/d445516966dcb2924741b13b27738b54df2af01a)
|
||||
Intel `ixgbevf` driver | 4.17 | [`c7aec59657b6`](https://github.com/torvalds/linux/commit/c7aec59657b60f3a29fc7d3274ebefd698879301)
|
||||
Freescale `dpaa2` driver | 5.0 | [`7e273a8ebdd3`](https://github.com/torvalds/linux/commit/7e273a8ebdd3b83f94eb8b49fc8ee61464f47cc2)
|
||||
Socionext `netsec` driver | 5.3 | [`ba2b232108d3`](https://github.com/torvalds/linux/commit/ba2b232108d3c2951bab02930a00f23b0cffd5af)
|
||||
TI `cpsw` driver | 5.3 | [`9ed4050c0d75`](https://github.com/torvalds/linux/commit/9ed4050c0d75768066a07cf66eef4f8dc9d79b52)
|
||||
Intel `ice` driver |5.5| [`efc2214b6047`](https://github.com/torvalds/linux/commit/efc2214b6047b6f5b4ca53151eba62521b9452d6)
|
||||
Solarflare `sfc` driver | 5.5 | [`eb9a36be7f3e`](https://github.com/torvalds/linux/commit/eb9a36be7f3ec414700af9a616f035eda1f1e63e)
|
||||
Marvell `mvneta` driver | 5.5 | [`0db51da7a8e9`](https://github.com/torvalds/linux/commit/0db51da7a8e99f0803ec3a8e25c1a66234a219cb)
|
||||
Microsoft `hv_netvsc` driver | 5.6 | [`351e1581395f`](https://github.com/torvalds/linux/commit/351e1581395fcc7fb952bbd7dda01238f69968fd)
|
||||
Amazon `ena` driver | 5.6 | [`838c93dc5449`](https://github.com/torvalds/linux/commit/838c93dc5449e5d6378bae117b0a65a122cf7361)
|
||||
`xen-netfront` driver | 5.9 | [`6c5aa6fc4def`](https://github.com/torvalds/linux/commit/6c5aa6fc4defc2a0977a2c59e4710d50fa1e834c)
|
||||
Intel `igb` driver | 5.10 | [`9cbc948b5a20`](https://github.com/torvalds/linux/commit/9cbc948b5a20c9c054d9631099c0426c16da546b)
|
||||
|
||||
## Helpers
|
||||
|
||||
The list of helpers supported in your kernel can be found in file
|
||||
[`include/uapi/linux/bpf.h`](https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h):
|
||||
```sh
|
||||
git grep ' FN(' include/uapi/linux/bpf.h
|
||||
```
|
||||
|
||||
Alphabetical order
|
||||
|
||||
Helper | Kernel version | License | Commit |
|
||||
-------|----------------|---------|--------|
|
||||
`BPF_FUNC_bind()` | 4.17 | | [`d74bad4e74ee`](https://github.com/torvalds/linux/commit/d74bad4e74ee373787a9ae24197c17b7cdc428d5) |
|
||||
`BPF_FUNC_bprm_opts_set()` | 5.11 | | [`3f6719c7b62f`](https://github.com/torvalds/linux/commit/3f6719c7b62f0327c9091e26d0da10e65668229e)
|
||||
`BPF_FUNC_btf_find_by_name_kind()` | 5.14 | | [`3d78417b60fb`](https://github.com/torvalds/linux/commit/3d78417b60fba249cc555468cb72d96f5cde2964)
|
||||
`BPF_FUNC_cgrp_storage_delete()` | 6.2 | | [`c4bcfb38a95e`](https://github.com/torvalds/linux/commit/c4bcfb38a95edb1021a53f2d0356a78120ecfbe4)
|
||||
`BPF_FUNC_cgrp_storage_get()` | 6.2 | | [`c4bcfb38a95e`](https://github.com/torvalds/linux/commit/c4bcfb38a95edb1021a53f2d0356a78120ecfbe4)
|
||||
`BPF_FUNC_check_mtu()` | 5.12 | | [`34b2021cc616`](https://github.com/torvalds/linux/commit/34b2021cc61642d61c3cf943d9e71925b827941b)
|
||||
`BPF_FUNC_clone_redirect()` | 4.2 | | [`3896d655f4d4`](https://github.com/torvalds/linux/commit/3896d655f4d491c67d669a15f275a39f713410f8)
|
||||
`BPF_FUNC_copy_from_user()` | 5.10 | | [`07be4c4a3e7a`](https://github.com/torvalds/linux/commit/07be4c4a3e7a0db148e44b16c5190e753d1c8569)
|
||||
`BPF_FUNC_copy_from_user_task()` | 5.18 | GPL | [`376040e47334`](https://github.com/torvalds/linux/commit/376040e47334c6dc6a939a32197acceb00fe4acf)
|
||||
`BPF_FUNC_csum_diff()` | 4.6 | | [`7d672345ed29`](https://github.com/torvalds/linux/commit/7d672345ed295b1356a5d9f7111da1d1d7d65867)
|
||||
`BPF_FUNC_csum_level()` | 5.7 | | [`7cdec54f9713`](https://github.com/torvalds/linux/commit/7cdec54f9713256bb170873a1fc5c75c9127c9d2)
|
||||
`BPF_FUNC_csum_update()` | 4.9 | | [`36bbef52c7eb`](https://github.com/torvalds/linux/commit/36bbef52c7eb646ed6247055a2acd3851e317857)
|
||||
`BPF_FUNC_current_task_under_cgroup()` | 4.9 | | [`60d20f9195b2`](https://github.com/torvalds/linux/commit/60d20f9195b260bdf0ac10c275ae9f6016f9c069)
|
||||
`BPF_FUNC_d_path()` | 5.10 | | [`6e22ab9da793`](https://github.com/torvalds/linux/commit/6e22ab9da79343532cd3cde39df25e5a5478c692)
|
||||
`BPF_FUNC_dynptr_data()` | 5.19 | | [`34d4ef5775f7`](https://github.com/torvalds/linux/commit/34d4ef5775f776ec4b0d53a02d588bf3195cada6)
|
||||
`BPF_FUNC_dynptr_from_mem()` | 5.19 | | [`263ae152e962`](https://github.com/torvalds/linux/commit/263ae152e96253f40c2c276faad8629e096b3bad)
|
||||
`BPF_FUNC_dynptr_read()` | 5.19 | | [`13bbbfbea759`](https://github.com/torvalds/linux/commit/13bbbfbea7598ea9f8d9c3d73bf053bb57f9c4b2)
|
||||
`BPF_FUNC_dynptr_write()` | 5.19 | | [`13bbbfbea759`](https://github.com/torvalds/linux/commit/13bbbfbea7598ea9f8d9c3d73bf053bb57f9c4b2)
|
||||
`BPF_FUNC_fib_lookup()` | 4.18 | GPL | [`87f5fc7e48dd`](https://github.com/torvalds/linux/commit/87f5fc7e48dd3175b30dd03b41564e1a8e136323)
|
||||
`BPF_FUNC_find_vma()` | 5.17 | | [`7c7e3d31e785`](https://github.com/torvalds/linux/commit/7c7e3d31e7856a8260a254f8c71db416f7f9f5a1)
|
||||
`BPF_FUNC_for_each_map_elem()` | 5.13 | | [`69c087ba6225`](https://github.com/torvalds/linux/commit/69c087ba6225b574afb6e505b72cb75242a3d844)
|
||||
`BPF_FUNC_get_attach_cookie()` | 5.15 | | [`7adfc6c9b315`](https://github.com/torvalds/linux/commit/7adfc6c9b315e174cf8743b21b7b691c8766791b)
|
||||
`BPF_FUNC_get_branch_snapshot()` | 5.16 | GPL | [`856c02dbce4f`](https://github.com/torvalds/linux/commit/856c02dbce4f8d6a5644083db22c11750aa11481)
|
||||
`BPF_FUNC_get_current_ancestor_cgroup_id()` | 5.6 | | [`b4490c5c4e02`](https://github.com/torvalds/linux/commit/b4490c5c4e023f09b7d27c9a9d3e7ad7d09ea6bf)
|
||||
`BPF_FUNC_get_cgroup_classid()` | 4.3 | | [`8d20aabe1c76`](https://github.com/torvalds/linux/commit/8d20aabe1c76cccac544d9fcc3ad7823d9e98a2d)
|
||||
`BPF_FUNC_get_current_cgroup_id()` | 4.18 | | [`bf6fa2c893c5`](https://github.com/torvalds/linux/commit/bf6fa2c893c5237b48569a13fa3c673041430b6c)
|
||||
`BPF_FUNC_get_current_comm()` | 4.2 | | [`ffeedafbf023`](https://github.com/torvalds/linux/commit/ffeedafbf0236f03aeb2e8db273b3e5ae5f5bc89)
|
||||
`BPF_FUNC_get_current_pid_tgid()` | 4.2 | | [`ffeedafbf023`](https://github.com/torvalds/linux/commit/ffeedafbf0236f03aeb2e8db273b3e5ae5f5bc89)
|
||||
`BPF_FUNC_get_current_task()` | 4.8 | GPL | [`606274c5abd8`](https://github.com/torvalds/linux/commit/606274c5abd8e245add01bc7145a8cbb92b69ba8)
|
||||
`BPF_FUNC_get_current_task_btf()` | 5.11 | GPL | [`3ca1032ab7ab`](https://github.com/torvalds/linux/commit/3ca1032ab7ab010eccb107aa515598788f7d93bb)
|
||||
`BPF_FUNC_get_current_uid_gid()` | 4.2 | | [`ffeedafbf023`](https://github.com/torvalds/linux/commit/ffeedafbf0236f03aeb2e8db273b3e5ae5f5bc89)
|
||||
`BPF_FUNC_get_func_arg()` | 5.17 | | [`f92c1e183604`](https://github.com/torvalds/linux/commit/f92c1e183604c20ce00eb889315fdaa8f2d9e509)
|
||||
`BPF_FUNC_get_func_arg_cnt()` | 5.17 | | [`f92c1e183604`](https://github.com/torvalds/linux/commit/f92c1e183604c20ce00eb889315fdaa8f2d9e509)
|
||||
`BPF_FUNC_get_func_ip()` | 5.15 | | [`5d8b583d04ae`](https://github.com/torvalds/linux/commit/5d8b583d04aedb3bd5f6d227a334c210c7d735f9)
|
||||
`BPF_FUNC_get_func_ret()` | 5.17 | | [`f92c1e183604`](https://github.com/torvalds/linux/commit/f92c1e183604c20ce00eb889315fdaa8f2d9e509)
|
||||
`BPF_FUNC_get_retval()` | 5.18 | | [`b44123b4a3dc`](https://github.com/torvalds/linux/commit/b44123b4a3dcad4664d3a0f72c011ffd4c9c4d93)
|
||||
`BPF_FUNC_get_hash_recalc()` | 4.8 | | [`13c5c240f789`](https://github.com/torvalds/linux/commit/13c5c240f789bbd2bcacb14a23771491485ae61f)
|
||||
`BPF_FUNC_get_listener_sock()` | 5.1 | | [`dbafd7ddd623`](https://github.com/torvalds/linux/commit/dbafd7ddd62369b2f3926ab847cbf8fc40e800b7)
|
||||
`BPF_FUNC_get_local_storage()` | 4.19 | | [`cd3394317653`](https://github.com/torvalds/linux/commit/cd3394317653837e2eb5c5d0904a8996102af9fc)
|
||||
`BPF_FUNC_get_netns_cookie()` | 5.7 | | [`f318903c0bf4`](https://github.com/torvalds/linux/commit/f318903c0bf42448b4c884732df2bbb0ef7a2284)
|
||||
`BPF_FUNC_get_ns_current_pid_tgid()` | 5.7 | | [`b4490c5c4e02`](https://github.com/torvalds/linux/commit/b4490c5c4e023f09b7d27c9a9d3e7ad7d09ea6bf)
|
||||
`BPF_FUNC_get_numa_node_id()` | 4.10 | | [`2d0e30c30f84`](https://github.com/torvalds/linux/commit/2d0e30c30f84d08dc16f0f2af41f1b8a85f0755e)
|
||||
`BPF_FUNC_get_prandom_u32()` | 4.1 | | [`03e69b508b6f`](https://github.com/torvalds/linux/commit/03e69b508b6f7c51743055c9f61d1dfeadf4b635)
|
||||
`BPF_FUNC_get_route_realm()` | 4.4 | | [`c46646d0484f`](https://github.com/torvalds/linux/commit/c46646d0484f5d08e2bede9b45034ba5b8b489cc)
|
||||
`BPF_FUNC_get_smp_processor_id()` | 4.1 | | [`c04167ce2ca0`](https://github.com/torvalds/linux/commit/c04167ce2ca0ecaeaafef006cb0d65cf01b68e42)
|
||||
`BPF_FUNC_get_socket_cookie()` | 4.12 | | [`91b8270f2a4d`](https://github.com/torvalds/linux/commit/91b8270f2a4d1d9b268de90451cdca63a70052d6)
|
||||
`BPF_FUNC_get_socket_uid()` | 4.12 | | [`6acc5c291068`](https://github.com/torvalds/linux/commit/6acc5c2910689fc6ee181bf63085c5efff6a42bd)
|
||||
`BPF_FUNC_get_stack()` | 4.18 | GPL | [`de2ff05f48af`](https://github.com/torvalds/linux/commit/de2ff05f48afcde816ff4edb217417f62f624ab5)
|
||||
`BPF_FUNC_get_stackid()` | 4.6 | GPL | [`d5a3b1f69186`](https://github.com/torvalds/linux/commit/d5a3b1f691865be576c2bffa708549b8cdccda19)
|
||||
`BPF_FUNC_get_task_stack()` | 5.9 | | [`fa28dcb82a38`](https://github.com/torvalds/linux/commit/fa28dcb82a38f8e3993b0fae9106b1a80b59e4f0)
|
||||
`BPF_FUNC_getsockopt()` | 4.15 | | [`cd86d1fd2102`](https://github.com/torvalds/linux/commit/cd86d1fd21025fdd6daf23d1288da405e7ad0ec6)
|
||||
`BPF_FUNC_ima_file_hash()` | 5.18 | | [`174b16946e39`](https://github.com/torvalds/linux/commit/174b16946e39ebd369097e0f773536c91a8c1a4c)
|
||||
`BPF_FUNC_ima_inode_hash()` | 5.11 | | [`27672f0d280a`](https://github.com/torvalds/linux/commit/27672f0d280a3f286a410a8db2004f46ace72a17)
|
||||
`BPF_FUNC_inode_storage_delete()` | 5.10 | | [`8ea636848aca`](https://github.com/torvalds/linux/commit/8ea636848aca35b9f97c5b5dee30225cf2dd0fe6)
|
||||
`BPF_FUNC_inode_storage_get()` | 5.10 | | [`8ea636848aca`](https://github.com/torvalds/linux/commit/8ea636848aca35b9f97c5b5dee30225cf2dd0fe6)
|
||||
`BPF_FUNC_jiffies64()` | 5.5 | | [`5576b991e9c1`](https://github.com/torvalds/linux/commit/5576b991e9c1a11d2cc21c4b94fc75ec27603896)
|
||||
`BPF_FUNC_kallsyms_lookup_name()` | 5.16 | | [`d6aef08a872b`](https://github.com/torvalds/linux/commit/d6aef08a872b9e23eecc92d0e92393473b13c497)
|
||||
`BPF_FUNC_kptr_xchg()` | 5.19 | | [`c0a5a21c25f3`](https://github.com/torvalds/linux/commit/c0a5a21c25f37c9fd7b36072f9968cdff1e4aa13)
|
||||
`BPF_FUNC_ktime_get_boot_ns()` | 5.8 | | [`71d19214776e`](https://github.com/torvalds/linux/commit/71d19214776e61b33da48f7c1b46e522c7f78221)
|
||||
`BPF_FUNC_ktime_get_coarse_ns()` | 5.11 | | [`d05512618056`](https://github.com/torvalds/linux/commit/d055126180564a57fe533728a4e93d0cb53d49b3)
|
||||
`BPF_FUNC_ktime_get_ns()` | 4.1 | | [`d9847d310ab4`](https://github.com/torvalds/linux/commit/d9847d310ab4003725e6ed1822682e24bd406908)
|
||||
`BPF_FUNC_ktime_get_tai_ns()` | 6.1 | | [`c8996c98f703`](https://github.com/torvalds/linux/commit/c8996c98f703b09afe77a1d247dae691c9849dc1)
|
||||
`BPF_FUNC_l3_csum_replace()` | 4.1 | | [`91bc4822c3d6`](https://github.com/torvalds/linux/commit/91bc4822c3d61b9bb7ef66d3b77948a4f9177954)
|
||||
`BPF_FUNC_l4_csum_replace()` | 4.1 | | [`91bc4822c3d6`](https://github.com/torvalds/linux/commit/91bc4822c3d61b9bb7ef66d3b77948a4f9177954)
|
||||
`BPF_FUNC_load_hdr_opt()` | 5.10 | | [`0813a841566f`](https://github.com/torvalds/linux/commit/0813a841566f0962a5551be7749b43c45f0022a0)
|
||||
`BPF_FUNC_loop()` | 5.17 | | [`e6f2dd0f8067`](https://github.com/torvalds/linux/commit/e6f2dd0f80674e9d5960337b3e9c2a242441b326)
|
||||
`BPF_FUNC_lwt_push_encap()` | 4.18 | | [`fe94cc290f53`](https://github.com/torvalds/linux/commit/fe94cc290f535709d3c5ebd1e472dfd0aec7ee79)
|
||||
`BPF_FUNC_lwt_seg6_action()` | 4.18 | | [`fe94cc290f53`](https://github.com/torvalds/linux/commit/fe94cc290f535709d3c5ebd1e472dfd0aec7ee79)
|
||||
`BPF_FUNC_lwt_seg6_adjust_srh()` | 4.18 | | [`fe94cc290f53`](https://github.com/torvalds/linux/commit/fe94cc290f535709d3c5ebd1e472dfd0aec7ee79)
|
||||
`BPF_FUNC_lwt_seg6_store_bytes()` | 4.18 | | [`fe94cc290f53`](https://github.com/torvalds/linux/commit/fe94cc290f535709d3c5ebd1e472dfd0aec7ee79)
|
||||
`BPF_FUNC_map_delete_elem()` | 3.19 | | [`d0003ec01c66`](https://github.com/torvalds/linux/commit/d0003ec01c667b731c139e23de3306a8b328ccf5)
|
||||
`BPF_FUNC_map_lookup_elem()` | 3.19 | | [`d0003ec01c66`](https://github.com/torvalds/linux/commit/d0003ec01c667b731c139e23de3306a8b328ccf5)
|
||||
`BPF_FUNC_map_lookup_percpu_elem()` | 5.19 | | [`07343110b293`](https://github.com/torvalds/linux/commit/07343110b293456d30393e89b86c4dee1ac051c8)
|
||||
`BPF_FUNC_map_peek_elem()` | 4.20 | | [`f1a2e44a3aec`](https://github.com/torvalds/linux/commit/f1a2e44a3aeccb3ff18d3ccc0b0203e70b95bd92)
|
||||
`BPF_FUNC_map_pop_elem()` | 4.20 | | [`f1a2e44a3aec`](https://github.com/torvalds/linux/commit/f1a2e44a3aeccb3ff18d3ccc0b0203e70b95bd92)
|
||||
`BPF_FUNC_map_push_elem()` | 4.20 | | [`f1a2e44a3aec`](https://github.com/torvalds/linux/commit/f1a2e44a3aeccb3ff18d3ccc0b0203e70b95bd92)
|
||||
`BPF_FUNC_map_update_elem()` | 3.19 | | [`d0003ec01c66`](https://github.com/torvalds/linux/commit/d0003ec01c667b731c139e23de3306a8b328ccf5)
|
||||
`BPF_FUNC_msg_apply_bytes()` | 4.17 | | [`2a100317c9eb`](https://github.com/torvalds/linux/commit/2a100317c9ebc204a166f16294884fbf9da074ce)
|
||||
`BPF_FUNC_msg_cork_bytes()` | 4.17 | | [`91843d540a13`](https://github.com/torvalds/linux/commit/91843d540a139eb8070bcff8aa10089164436deb)
|
||||
`BPF_FUNC_msg_pop_data()` | 5.0 | | [`7246d8ed4dcc`](https://github.com/torvalds/linux/commit/7246d8ed4dcce23f7509949a77be15fa9f0e3d28)
|
||||
`BPF_FUNC_msg_pull_data()` | 4.17 | | [`015632bb30da`](https://github.com/torvalds/linux/commit/015632bb30daaaee64e1bcac07570860e0bf3092)
|
||||
`BPF_FUNC_msg_push_data()` | 4.20 | | [`6fff607e2f14`](https://github.com/torvalds/linux/commit/6fff607e2f14bd7c63c06c464a6f93b8efbabe28)
|
||||
`BPF_FUNC_msg_redirect_hash()` | 4.18 | | [`81110384441a`](https://github.com/torvalds/linux/commit/81110384441a59cff47430f20f049e69b98c17f4)
|
||||
`BPF_FUNC_msg_redirect_map()` | 4.17 | | [`4f738adba30a`](https://github.com/torvalds/linux/commit/4f738adba30a7cfc006f605707e7aee847ffefa0)
|
||||
`BPF_FUNC_per_cpu_ptr()` | 5.10 | | [`eaa6bcb71ef6`](https://github.com/torvalds/linux/commit/eaa6bcb71ef6ed3dc18fc525ee7e293b06b4882b) |
|
||||
`BPF_FUNC_perf_event_output()` | 4.4 | GPL | [`a43eec304259`](https://github.com/torvalds/linux/commit/a43eec304259a6c637f4014a6d4767159b6a3aa3)
|
||||
`BPF_FUNC_perf_event_read()` | 4.3 | GPL | [`35578d798400`](https://github.com/torvalds/linux/commit/35578d7984003097af2b1e34502bc943d40c1804)
|
||||
`BPF_FUNC_perf_event_read_value()` | 4.15 | GPL | [`908432ca84fc`](https://github.com/torvalds/linux/commit/908432ca84fc229e906ba164219e9ad0fe56f755)
|
||||
`BPF_FUNC_perf_prog_read_value()` | 4.15 | GPL | [`4bebdc7a85aa`](https://github.com/torvalds/linux/commit/4bebdc7a85aa400c0222b5329861e4ad9252f1e5)
|
||||
`BPF_FUNC_probe_read()` | 4.1 | GPL | [`2541517c32be`](https://github.com/torvalds/linux/commit/2541517c32be2531e0da59dfd7efc1ce844644f5)
|
||||
`BPF_FUNC_probe_read_kernel()` | 5.5 | GPL | [`6ae08ae3dea2`](https://github.com/torvalds/linux/commit/6ae08ae3dea2cfa03dd3665a3c8475c2d429ef47)
|
||||
`BPF_FUNC_probe_read_kernel_str()` | 5.5 | GPL | [`6ae08ae3dea2`](https://github.com/torvalds/linux/commit/6ae08ae3dea2cfa03dd3665a3c8475c2d429ef47)
|
||||
`BPF_FUNC_probe_read_user()` | 5.5 | GPL | [`6ae08ae3dea2`](https://github.com/torvalds/linux/commit/6ae08ae3dea2cfa03dd3665a3c8475c2d429ef47)
|
||||
`BPF_FUNC_probe_read_user_str()` | 5.5 | GPL | [`6ae08ae3dea2`](https://github.com/torvalds/linux/commit/6ae08ae3dea2cfa03dd3665a3c8475c2d429ef47)
|
||||
`BPF_FUNC_probe_read_str()` | 4.11 | GPL | [`a5e8c07059d0`](https://github.com/torvalds/linux/commit/a5e8c07059d0f0b31737408711d44794928ac218)
|
||||
`BPF_FUNC_probe_write_user()` | 4.8 | GPL | [`96ae52279594`](https://github.com/torvalds/linux/commit/96ae52279594470622ff0585621a13e96b700600)
|
||||
`BPF_FUNC_rc_keydown()` | 4.18 | GPL | [`f4364dcfc86d`](https://github.com/torvalds/linux/commit/f4364dcfc86df7c1ca47b256eaf6b6d0cdd0d936)
|
||||
`BPF_FUNC_rc_pointer_rel()` | 5.0 | GPL | [`01d3240a04f4`](https://github.com/torvalds/linux/commit/01d3240a04f4c09392e13c77b54d4423ebce2d72)
|
||||
`BPF_FUNC_rc_repeat()` | 4.18 | GPL | [`f4364dcfc86d`](https://github.com/torvalds/linux/commit/f4364dcfc86df7c1ca47b256eaf6b6d0cdd0d936)
|
||||
`BPF_FUNC_read_branch_records()` | 5.6 | GPL | [`fff7b64355ea`](https://github.com/torvalds/linux/commit/fff7b64355eac6e29b50229ad1512315bc04b44e)
|
||||
`BPF_FUNC_redirect()` | 4.4 | | [`27b29f63058d`](https://github.com/torvalds/linux/commit/27b29f63058d26c6c1742f1993338280d5a41dc6)
|
||||
`BPF_FUNC_redirect_map()` | 4.14 | | [`97f91a7cf04f`](https://github.com/torvalds/linux/commit/97f91a7cf04ff605845c20948b8a80e54cbd3376)
|
||||
`BPF_FUNC_redirect_neigh()` | 5.10 | | [`b4ab31414970`](https://github.com/torvalds/linux/commit/b4ab31414970a7a03a5d55d75083f2c101a30592)
|
||||
`BPF_FUNC_redirect_peer()` | 5.10 | | [`9aa1206e8f48`](https://github.com/torvalds/linux/commit/9aa1206e8f48222f35a0c809f33b2f4aaa1e2661)
|
||||
`BPF_FUNC_reserve_hdr_opt()` | 5.10 | | [`0813a841566f`](https://github.com/torvalds/linux/commit/0813a841566f0962a5551be7749b43c45f0022a0)
|
||||
`BPF_FUNC_ringbuf_discard()` | 5.8 | | [`457f44363a88`](https://github.com/torvalds/linux/commit/457f44363a8894135c85b7a9afd2bd8196db24ab)
|
||||
`BPF_FUNC_ringbuf_discard_dynptr()` | 5.19 | | [`bc34dee65a65`](https://github.com/torvalds/linux/commit/bc34dee65a65e9c920c420005b8a43f2a721a458)
|
||||
`BPF_FUNC_ringbuf_output()` | 5.8 | | [`457f44363a88`](https://github.com/torvalds/linux/commit/457f44363a8894135c85b7a9afd2bd8196db24ab)
|
||||
`BPF_FUNC_ringbuf_query()` | 5.8 | | [`457f44363a88`](https://github.com/torvalds/linux/commit/457f44363a8894135c85b7a9afd2bd8196db24ab)
|
||||
`BPF_FUNC_ringbuf_reserve()` | 5.8 | | [`457f44363a88`](https://github.com/torvalds/linux/commit/457f44363a8894135c85b7a9afd2bd8196db24ab)
|
||||
`BPF_FUNC_ringbuf_reserve_dynptr()` | 5.19 | | [`bc34dee65a65`](https://github.com/torvalds/linux/commit/bc34dee65a65e9c920c420005b8a43f2a721a458)
|
||||
`BPF_FUNC_ringbuf_submit()` | 5.8 | | [`457f44363a88`](https://github.com/torvalds/linux/commit/457f44363a8894135c85b7a9afd2bd8196db24ab)
|
||||
`BPF_FUNC_ringbuf_submit_dynptr()` | 5.19 | | [`bc34dee65a65`](https://github.com/torvalds/linux/commit/bc34dee65a65e9c920c420005b8a43f2a721a458)
|
||||
`BPF_FUNC_send_signal()` | 5.3 | | [`8b401f9ed244`](https://github.com/torvalds/linux/commit/8b401f9ed2441ad9e219953927a842d24ed051fc)
|
||||
`BPF_FUNC_send_signal_thread()` | 5.5 | | [`8482941f0906`](https://github.com/torvalds/linux/commit/8482941f09067da42f9c3362e15bfb3f3c19d610)
|
||||
`BPF_FUNC_seq_printf()` | 5.7 | GPL | [`492e639f0c22`](https://github.com/torvalds/linux/commit/492e639f0c222784e2e0f121966375f641c61b15)
|
||||
`BPF_FUNC_seq_printf_btf()` | 5.10 | | [`eb411377aed9`](https://github.com/torvalds/linux/commit/eb411377aed9e27835e77ee0710ee8f4649958f3)
|
||||
`BPF_FUNC_seq_write()` | 5.7 | GPL | [`492e639f0c22`](https://github.com/torvalds/linux/commit/492e639f0c222784e2e0f121966375f641c61b15)
|
||||
`BPF_FUNC_set_hash()` | 4.13 | | [`ded092cd73c2`](https://github.com/torvalds/linux/commit/ded092cd73c2c56a394b936f86897f29b2e131c0)
|
||||
`BPF_FUNC_set_hash_invalid()` | 4.9 | | [`7a4b28c6cc9f`](https://github.com/torvalds/linux/commit/7a4b28c6cc9ffac50f791b99cc7e46106436e5d8)
|
||||
`BPF_FUNC_set_retval()` | 5.18 | | [`b44123b4a3dc`](https://github.com/torvalds/linux/commit/b44123b4a3dcad4664d3a0f72c011ffd4c9c4d93)
|
||||
`BPF_FUNC_setsockopt()` | 4.13 | | [`8c4b4c7e9ff0`](https://github.com/torvalds/linux/commit/8c4b4c7e9ff0447995750d9329949fa082520269)
|
||||
`BPF_FUNC_sk_ancestor_cgroup_id()` | 5.7 | | [`f307fa2cb4c9`](https://github.com/torvalds/linux/commit/f307fa2cb4c935f7f1ff0aeb880c7b44fb9a642b)
|
||||
`BPF_FUNC_sk_assign()` | 5.6 | | [`cf7fbe660f2d`](https://github.com/torvalds/linux/commit/cf7fbe660f2dbd738ab58aea8e9b0ca6ad232449)
|
||||
`BPF_FUNC_sk_cgroup_id()` | 5.7 | | [`f307fa2cb4c9`](https://github.com/torvalds/linux/commit/f307fa2cb4c935f7f1ff0aeb880c7b44fb9a642b)
|
||||
`BPF_FUNC_sk_fullsock()` | 5.1 | | [`46f8bc92758c`](https://github.com/torvalds/linux/commit/46f8bc92758c6259bcf945e9216098661c1587cd)
|
||||
`BPF_FUNC_sk_lookup_tcp()` | 4.20 | | [`6acc9b432e67`](https://github.com/torvalds/linux/commit/6acc9b432e6714d72d7d77ec7c27f6f8358d0c71)
|
||||
`BPF_FUNC_sk_lookup_udp()` | 4.20 | | [`6acc9b432e67`](https://github.com/torvalds/linux/commit/6acc9b432e6714d72d7d77ec7c27f6f8358d0c71)
|
||||
`BPF_FUNC_sk_redirect_hash()` | 4.18 | | [`81110384441a`](https://github.com/torvalds/linux/commit/81110384441a59cff47430f20f049e69b98c17f4)
|
||||
`BPF_FUNC_sk_redirect_map()` | 4.14 | | [`174a79ff9515`](https://github.com/torvalds/linux/commit/174a79ff9515f400b9a6115643dafd62a635b7e6)
|
||||
`BPF_FUNC_sk_release()` | 4.20 | | [`6acc9b432e67`](https://github.com/torvalds/linux/commit/6acc9b432e6714d72d7d77ec7c27f6f8358d0c71)
|
||||
`BPF_FUNC_sk_select_reuseport()` | 4.19 | | [`2dbb9b9e6df6`](https://github.com/torvalds/linux/commit/2dbb9b9e6df67d444fbe425c7f6014858d337adf)
|
||||
`BPF_FUNC_sk_storage_delete()` | 5.2 | | [`6ac99e8f23d4`](https://github.com/torvalds/linux/commit/6ac99e8f23d4b10258406ca0dd7bffca5f31da9d)
|
||||
`BPF_FUNC_sk_storage_get()` | 5.2 | | [`6ac99e8f23d4`](https://github.com/torvalds/linux/commit/6ac99e8f23d4b10258406ca0dd7bffca5f31da9d)
|
||||
`BPF_FUNC_skb_adjust_room()` | 4.13 | | [`2be7e212d541`](https://github.com/torvalds/linux/commit/2be7e212d5419a400d051c84ca9fdd083e5aacac)
|
||||
`BPF_FUNC_skb_ancestor_cgroup_id()` | 4.19 | | [`7723628101aa`](https://github.com/torvalds/linux/commit/7723628101aaeb1d723786747529b4ea65c5b5c5)
|
||||
`BPF_FUNC_skb_change_head()` | 4.10 | | [`3a0af8fd61f9`](https://github.com/torvalds/linux/commit/3a0af8fd61f90920f6fa04e4f1e9a6a73c1b4fd2)
|
||||
`BPF_FUNC_skb_change_proto()` | 4.8 | | [`6578171a7ff0`](https://github.com/torvalds/linux/commit/6578171a7ff0c31dc73258f93da7407510abf085)
|
||||
`BPF_FUNC_skb_change_tail()` | 4.9 | | [`5293efe62df8`](https://github.com/torvalds/linux/commit/5293efe62df81908f2e90c9820c7edcc8e61f5e9)
|
||||
`BPF_FUNC_skb_change_type()` | 4.8 | | [`d2485c4242a8`](https://github.com/torvalds/linux/commit/d2485c4242a826fdf493fd3a27b8b792965b9b9e)
|
||||
`BPF_FUNC_skb_cgroup_classid()` | 5.10 | | [`b426ce83baa7`](https://github.com/torvalds/linux/commit/b426ce83baa7dff947fb354118d3133f2953aac8)
|
||||
`BPF_FUNC_skb_cgroup_id()` | 4.18 | | [`cb20b08ead40`](https://github.com/torvalds/linux/commit/cb20b08ead401fd17627a36f035c0bf5bfee5567)
|
||||
`BPF_FUNC_skb_ecn_set_ce()` | 5.1 | | [`f7c917ba11a6`](https://github.com/torvalds/linux/commit/f7c917ba11a67632a8452ea99fe132f626a7a2cc)
|
||||
`BPF_FUNC_skb_get_tunnel_key()` | 4.3 | | [`d3aa45ce6b94`](https://github.com/torvalds/linux/commit/d3aa45ce6b94c65b83971257317867db13e5f492)
|
||||
`BPF_FUNC_skb_get_tunnel_opt()` | 4.6 | | [`14ca0751c96f`](https://github.com/torvalds/linux/commit/14ca0751c96f8d3d0f52e8ed3b3236f8b34d3460)
|
||||
`BPF_FUNC_skb_get_xfrm_state()` | 4.18 | | [`12bed760a78d`](https://github.com/torvalds/linux/commit/12bed760a78da6e12ac8252fec64d019a9eac523)
|
||||
`BPF_FUNC_skb_load_bytes()` | 4.5 | | [`05c74e5e53f6`](https://github.com/torvalds/linux/commit/05c74e5e53f6cb07502c3e6a820f33e2777b6605)
|
||||
`BPF_FUNC_skb_load_bytes_relative()` | 4.18 | | [`4e1ec56cdc59`](https://github.com/torvalds/linux/commit/4e1ec56cdc59746943b2acfab3c171b930187bbe)
|
||||
`BPF_FUNC_skb_output()` | 5.5 | | [`a7658e1a4164`](https://github.com/torvalds/linux/commit/a7658e1a4164ce2b9eb4a11aadbba38586e93bd6)
|
||||
`BPF_FUNC_skb_pull_data()` | 4.9 | | [`36bbef52c7eb`](https://github.com/torvalds/linux/commit/36bbef52c7eb646ed6247055a2acd3851e317857)
|
||||
`BPF_FUNC_skb_set_tstamp()` | 5.18 | | [`9bb984f28d5b`](https://github.com/torvalds/linux/commit/9bb984f28d5bcb917d35d930fcfb89f90f9449fd)
|
||||
`BPF_FUNC_skb_set_tunnel_key()` | 4.3 | | [`d3aa45ce6b94`](https://github.com/torvalds/linux/commit/d3aa45ce6b94c65b83971257317867db13e5f492)
|
||||
`BPF_FUNC_skb_set_tunnel_opt()` | 4.6 | | [`14ca0751c96f`](https://github.com/torvalds/linux/commit/14ca0751c96f8d3d0f52e8ed3b3236f8b34d3460)
|
||||
`BPF_FUNC_skb_store_bytes()` | 4.1 | | [`91bc4822c3d6`](https://github.com/torvalds/linux/commit/91bc4822c3d61b9bb7ef66d3b77948a4f9177954)
|
||||
`BPF_FUNC_skb_under_cgroup()` | 4.8 | | [`4a482f34afcc`](https://github.com/torvalds/linux/commit/4a482f34afcc162d8456f449b137ec2a95be60d8)
|
||||
`BPF_FUNC_skb_vlan_pop()` | 4.3 | | [`4e10df9a60d9`](https://github.com/torvalds/linux/commit/4e10df9a60d96ced321dd2af71da558c6b750078)
|
||||
`BPF_FUNC_skb_vlan_push()` | 4.3 | | [`4e10df9a60d9`](https://github.com/torvalds/linux/commit/4e10df9a60d96ced321dd2af71da558c6b750078)
|
||||
`BPF_FUNC_skc_lookup_tcp()` | 5.2 | | [`edbf8c01de5a`](https://github.com/torvalds/linux/commit/edbf8c01de5a104a71ed6df2bf6421ceb2836a8e)
|
||||
`BPF_FUNC_skc_to_mctcp_sock()` | 5.19 | | [`3bc253c2e652`](https://github.com/torvalds/linux/commit/3bc253c2e652cf5f12cd8c00d80d8ec55d67d1a7)
|
||||
`BPF_FUNC_skc_to_tcp_sock()` | 5.9 | | [`478cfbdf5f13`](https://github.com/torvalds/linux/commit/478cfbdf5f13dfe09cfd0b1cbac821f5e27f6108)
|
||||
`BPF_FUNC_skc_to_tcp_request_sock()` | 5.9 | | [`478cfbdf5f13`](https://github.com/torvalds/linux/commit/478cfbdf5f13dfe09cfd0b1cbac821f5e27f6108)
|
||||
`BPF_FUNC_skc_to_tcp_timewait_sock()` | 5.9 | | [`478cfbdf5f13`](https://github.com/torvalds/linux/commit/478cfbdf5f13dfe09cfd0b1cbac821f5e27f6108)
|
||||
`BPF_FUNC_skc_to_tcp6_sock()` | 5.9 | | [`af7ec1383361`](https://github.com/torvalds/linux/commit/af7ec13833619e17f03aa73a785a2f871da6d66b)
|
||||
`BPF_FUNC_skc_to_udp6_sock()` | 5.9 | | [`0d4fad3e57df`](https://github.com/torvalds/linux/commit/0d4fad3e57df2bf61e8ffc8d12a34b1caf9b8835)
|
||||
`BPF_FUNC_skc_to_unix_sock()` | 5.16 | | [`9eeb3aa33ae0`](https://github.com/torvalds/linux/commit/9eeb3aa33ae005526f672b394c1791578463513f)
|
||||
`BPF_FUNC_snprintf()` | 5.13 | | [`7b15523a989b`](https://github.com/torvalds/linux/commit/7b15523a989b63927c2bb08e9b5b0bbc10b58bef)
|
||||
`BPF_FUNC_snprintf_btf()` | 5.10 | | [`c4d0bfb45068`](https://github.com/torvalds/linux/commit/c4d0bfb45068d853a478b9067a95969b1886a30f)
|
||||
`BPF_FUNC_sock_from_file()` | 5.11 | | [`4f19cab76136`](https://github.com/torvalds/linux/commit/4f19cab76136e800a3f04d8c9aa4d8e770e3d3d8)
|
||||
`BPF_FUNC_sock_hash_update()` | 4.18 | | [`81110384441a`](https://github.com/torvalds/linux/commit/81110384441a59cff47430f20f049e69b98c17f4)
|
||||
`BPF_FUNC_sock_map_update()` | 4.14 | | [`174a79ff9515`](https://github.com/torvalds/linux/commit/174a79ff9515f400b9a6115643dafd62a635b7e6)
|
||||
`BPF_FUNC_spin_lock()` | 5.1 | | [`d83525ca62cf`](https://github.com/torvalds/linux/commit/d83525ca62cf8ebe3271d14c36fb900c294274a2)
|
||||
`BPF_FUNC_spin_unlock()` | 5.1 | | [`d83525ca62cf`](https://github.com/torvalds/linux/commit/d83525ca62cf8ebe3271d14c36fb900c294274a2)
|
||||
`BPF_FUNC_store_hdr_opt()` | 5.10 | | [`0813a841566f`](https://github.com/torvalds/linux/commit/0813a841566f0962a5551be7749b43c45f0022a0)
|
||||
`BPF_FUNC_strncmp()` | 5.17 | | [`c5fb19937455`](https://github.com/torvalds/linux/commit/c5fb19937455095573a19ddcbff32e993ed10e35)
|
||||
`BPF_FUNC_strtol()` | 5.2 | | [`d7a4cb9b6705`](https://github.com/torvalds/linux/commit/d7a4cb9b6705a89937d12c8158a35a3145dc967a)
|
||||
`BPF_FUNC_strtoul()` | 5.2 | | [`d7a4cb9b6705`](https://github.com/torvalds/linux/commit/d7a4cb9b6705a89937d12c8158a35a3145dc967a)
|
||||
`BPF_FUNC_sys_bpf()` | 5.14 | | [`79a7f8bdb159`](https://github.com/torvalds/linux/commit/79a7f8bdb159d9914b58740f3d31d602a6e4aca8)
|
||||
`BPF_FUNC_sys_close()` | 5.14 | | [`3abea089246f`](https://github.com/torvalds/linux/commit/3abea089246f76c1517b054ddb5946f3f1dbd2c0)
|
||||
`BPF_FUNC_sysctl_get_current_value()` | 5.2 | | [`1d11b3016cec`](https://github.com/torvalds/linux/commit/1d11b3016cec4ed9770b98e82a61708c8f4926e7)
|
||||
`BPF_FUNC_sysctl_get_name()` | 5.2 | | [`808649fb787d`](https://github.com/torvalds/linux/commit/808649fb787d918a48a360a668ee4ee9023f0c11)
|
||||
`BPF_FUNC_sysctl_get_new_value()` | 5.2 | | [`4e63acdff864`](https://github.com/torvalds/linux/commit/4e63acdff864654cee0ac5aaeda3913798ee78f6)
|
||||
`BPF_FUNC_sysctl_set_new_value()` | 5.2 | | [`4e63acdff864`](https://github.com/torvalds/linux/commit/4e63acdff864654cee0ac5aaeda3913798ee78f6)
|
||||
`BPF_FUNC_tail_call()` | 4.2 | | [`04fd61ab36ec`](https://github.com/torvalds/linux/commit/04fd61ab36ec065e194ab5e74ae34a5240d992bb)
|
||||
`BPF_FUNC_task_pt_regs()` | 5.15 | GPL | [`dd6e10fbd9f`](https://github.com/torvalds/linux/commit/dd6e10fbd9fb86a571d925602c8a24bb4d09a2a7)
|
||||
`BPF_FUNC_task_storage_delete()` | 5.11 | | [`4cf1bc1f1045`](https://github.com/torvalds/linux/commit/4cf1bc1f10452065a29d576fc5693fc4fab5b919)
|
||||
`BPF_FUNC_task_storage_get()` | 5.11 | | [`4cf1bc1f1045`](https://github.com/torvalds/linux/commit/4cf1bc1f10452065a29d576fc5693fc4fab5b919)
|
||||
`BPF_FUNC_tcp_check_syncookie()` | 5.2 | | [`399040847084`](https://github.com/torvalds/linux/commit/399040847084a69f345e0a52fd62f04654e0fce3)
|
||||
`BPF_FUNC_tcp_gen_syncookie()` | 5.3 | | [`70d66244317e`](https://github.com/torvalds/linux/commit/70d66244317e958092e9c971b08dd5b7fd29d9cb#diff-05da4bf36c7fbcd176254e1615d98b28)
|
||||
`BPF_FUNC_tcp_raw_check_syncookie_ipv4()` | 6.0 | | [`33bf9885040c`](https://github.com/torvalds/linux/commit/33bf9885040c399cf6a95bd33216644126728e14)
|
||||
`BPF_FUNC_tcp_raw_check_syncookie_ipv6()` | 6.0 | | [`33bf9885040c`](https://github.com/torvalds/linux/commit/33bf9885040c399cf6a95bd33216644126728e14)
|
||||
`BPF_FUNC_tcp_raw_gen_syncookie_ipv4()` | 6.0 | | [`33bf9885040c`](https://github.com/torvalds/linux/commit/33bf9885040c399cf6a95bd33216644126728e14)
|
||||
`BPF_FUNC_tcp_raw_gen_syncookie_ipv6()` | 6.0 | | [`33bf9885040c`](https://github.com/torvalds/linux/commit/33bf9885040c399cf6a95bd33216644126728e14)
|
||||
`BPF_FUNC_tcp_send_ack()` | 5.5 | | [`206057fe020a`](https://github.com/torvalds/linux/commit/206057fe020ac5c037d5e2dd6562a9bd216ec765)
|
||||
`BPF_FUNC_tcp_sock()` | 5.1 | | [`655a51e536c0`](https://github.com/torvalds/linux/commit/655a51e536c09d15ffa3603b1b6fce2b45b85a1f)
|
||||
`BPF_FUNC_this_cpu_ptr()` | 5.10 | | [`63d9b80dcf2c`](https://github.com/torvalds/linux/commit/63d9b80dcf2c67bc5ade61cbbaa09d7af21f43f1) |
|
||||
`BPF_FUNC_timer_init()` | 5.15 | | [`b00628b1c7d5`](https://github.com/torvalds/linux/commit/b00628b1c7d595ae5b544e059c27b1f5828314b4)
|
||||
`BPF_FUNC_timer_set_callback()` | 5.15 | | [`b00628b1c7d5`](https://github.com/torvalds/linux/commit/b00628b1c7d595ae5b544e059c27b1f5828314b4)
|
||||
`BPF_FUNC_timer_start()` | 5.15 | | [`b00628b1c7d5`](https://github.com/torvalds/linux/commit/b00628b1c7d595ae5b544e059c27b1f5828314b4)
|
||||
`BPF_FUNC_timer_cancel()` | 5.15 | | [`b00628b1c7d5`](https://github.com/torvalds/linux/commit/b00628b1c7d595ae5b544e059c27b1f5828314b4)
|
||||
`BPF_FUNC_trace_printk()` | 4.1 | GPL | [`9c959c863f82`](https://github.com/torvalds/linux/commit/9c959c863f8217a2ff3d7c296e8223654d240569)
|
||||
`BPF_FUNC_trace_vprintk()` | 5.16 | GPL | [`10aceb629e19`](https://github.com/torvalds/linux/commit/10aceb629e198429c849d5e995c3bb1ba7a9aaa3)
|
||||
`BPF_FUNC_user_ringbuf_drain()` | 6.1 | | [`205715673844`](https://github.com/torvalds/linux/commit/20571567384428dfc9fe5cf9f2e942e1df13c2dd)
|
||||
`BPF_FUNC_xdp_adjust_head()` | 4.10 | | [`17bedab27231`](https://github.com/torvalds/linux/commit/17bedab2723145d17b14084430743549e6943d03)
|
||||
`BPF_FUNC_xdp_adjust_meta()` | 4.15 | | [`de8f3a83b0a0`](https://github.com/torvalds/linux/commit/de8f3a83b0a0fddb2cf56e7a718127e9619ea3da)
|
||||
`BPF_FUNC_xdp_adjust_tail()` | 4.18 | | [`b32cc5b9a346`](https://github.com/torvalds/linux/commit/b32cc5b9a346319c171e3ad905e0cddda032b5eb)
|
||||
`BPF_FUNC_xdp_get_buff_len()` | 5.18 | | [`0165cc817075`](https://github.com/torvalds/linux/commit/0165cc817075cf701e4289838f1d925ff1911b3e)
|
||||
`BPF_FUNC_xdp_load_bytes()` | 5.18 | | [`3f364222d032`](https://github.com/torvalds/linux/commit/3f364222d032eea6b245780e845ad213dab28cdd)
|
||||
`BPF_FUNC_xdp_store_bytes()` | 5.18 | | [`3f364222d032`](https://github.com/torvalds/linux/commit/3f364222d032eea6b245780e845ad213dab28cdd)
|
||||
`BPF_FUNC_xdp_output()` | 5.6 | GPL | [`d831ee84bfc9`](https://github.com/torvalds/linux/commit/d831ee84bfc9173eecf30dbbc2553ae81b996c60)
|
||||
`BPF_FUNC_override_return()` | 4.16 | GPL | [`9802d86585db`](https://github.com/torvalds/linux/commit/9802d86585db91655c7d1929a4f6bbe0952ea88e)
|
||||
`BPF_FUNC_sock_ops_cb_flags_set()` | 4.16 | | [`b13d88072172`](https://github.com/torvalds/linux/commit/b13d880721729384757f235166068c315326f4a1)
|
||||
|
||||
Note: GPL-only BPF helpers require a GPL-compatible license. The current licenses considered GPL-compatible by the kernel are:
|
||||
|
||||
* GPL
|
||||
* GPL v2
|
||||
* GPL and additional rights
|
||||
* Dual BSD/GPL
|
||||
* Dual MIT/GPL
|
||||
* Dual MPL/GPL
|
||||
|
||||
Check the list of GPL-compatible licenses in your [kernel source code](https://github.com/torvalds/linux/blob/master/include/linux/license.h).
|
||||
|
||||
## Program Types
|
||||
The list of program types and supported helper functions can be retrieved with:
|
||||
```sh
|
||||
git grep -W 'func_proto(enum bpf_func_id func_id' kernel/ net/ drivers/
|
||||
```
|
||||
|
||||
|Program Type| Helper Functions|
|
||||
|------------|-----------------|
|
||||
|`BPF_PROG_TYPE_SOCKET_FILTER`|`BPF_FUNC_skb_load_bytes()` <br> `BPF_FUNC_skb_load_bytes_relative()` <br> `BPF_FUNC_get_socket_cookie()` <br> `BPF_FUNC_get_socket_uid()` <br> `BPF_FUNC_perf_event_output()` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_KPROBE`|`BPF_FUNC_perf_event_output()` <br> `BPF_FUNC_get_stackid()` <br> `BPF_FUNC_get_stack()` <br> `BPF_FUNC_perf_event_read_value()` <br> `BPF_FUNC_override_return()` <br> `Tracing functions`|
|
||||
|`BPF_PROG_TYPE_SCHED_CLS` <br> `BPF_PROG_TYPE_SCHED_ACT`|`BPF_FUNC_skb_store_bytes()` <br> `BPF_FUNC_skb_load_bytes()` <br> `BPF_FUNC_skb_load_bytes_relative()` <br> `BPF_FUNC_skb_pull_data()` <br> `BPF_FUNC_csum_diff()` <br> `BPF_FUNC_csum_update()` <br> `BPF_FUNC_l3_csum_replace()` <br> `BPF_FUNC_l4_csum_replace()` <br> `BPF_FUNC_clone_redirect()` <br> `BPF_FUNC_get_cgroup_classid()` <br> `BPF_FUNC_skb_vlan_push()` <br> `BPF_FUNC_skb_vlan_pop()` <br> `BPF_FUNC_skb_change_proto()` <br> `BPF_FUNC_skb_change_type()` <br> `BPF_FUNC_skb_adjust_room()` <br> `BPF_FUNC_skb_change_tail()` <br> `BPF_FUNC_skb_get_tunnel_key()` <br> `BPF_FUNC_skb_set_tunnel_key()` <br> `BPF_FUNC_skb_get_tunnel_opt()` <br> `BPF_FUNC_skb_set_tunnel_opt()` <br> `BPF_FUNC_redirect()` <br> `BPF_FUNC_get_route_realm()` <br> `BPF_FUNC_get_hash_recalc()` <br> `BPF_FUNC_set_hash_invalid()` <br> `BPF_FUNC_set_hash()` <br> `BPF_FUNC_perf_event_output()` <br> `BPF_FUNC_get_smp_processor_id()` <br> `BPF_FUNC_skb_under_cgroup()` <br> `BPF_FUNC_get_socket_cookie()` <br> `BPF_FUNC_get_socket_uid()` <br> `BPF_FUNC_fib_lookup()` <br> `BPF_FUNC_skb_get_xfrm_state()` <br> `BPF_FUNC_skb_cgroup_id()` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_TRACEPOINT`|`BPF_FUNC_perf_event_output()` <br> `BPF_FUNC_get_stackid()` <br> `BPF_FUNC_get_stack()` <br> `BPF_FUNC_d_path()` <br> `Tracing functions`|
|
||||
|`BPF_PROG_TYPE_XDP`| `BPF_FUNC_perf_event_output()` <br> `BPF_FUNC_get_smp_processor_id()` <br> `BPF_FUNC_csum_diff()` <br> `BPF_FUNC_xdp_adjust_head()` <br> `BPF_FUNC_xdp_adjust_meta()` <br> `BPF_FUNC_redirect()` <br> `BPF_FUNC_redirect_map()` <br> `BPF_FUNC_xdp_adjust_tail()` <br> `BPF_FUNC_fib_lookup()` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_PERF_EVENT`| `BPF_FUNC_perf_event_output()` <br> `BPF_FUNC_get_stackid()` <br> `BPF_FUNC_get_stack()` <br> `BPF_FUNC_perf_prog_read_value()` <br> `Tracing functions`|
|
||||
|`BPF_PROG_TYPE_CGROUP_SKB`|`BPF_FUNC_skb_load_bytes()` <br> `BPF_FUNC_skb_load_bytes_relative()` <br> `BPF_FUNC_get_socket_cookie()` <br> `BPF_FUNC_get_socket_uid()` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_CGROUP_SOCK`|`BPF_FUNC_get_current_uid_gid()` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_LWT_IN`|`BPF_FUNC_lwt_push_encap()` <br> `LWT functions` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_LWT_OUT`| `LWT functions` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_LWT_XMIT`| `BPF_FUNC_skb_get_tunnel_key()` <br> `BPF_FUNC_skb_set_tunnel_key()` <br> `BPF_FUNC_skb_get_tunnel_opt()` <br> `BPF_FUNC_skb_set_tunnel_opt()` <br> `BPF_FUNC_redirect()` <br> `BPF_FUNC_clone_redirect()` <br> `BPF_FUNC_skb_change_tail()` <br> `BPF_FUNC_skb_change_head()` <br> `BPF_FUNC_skb_store_bytes()` <br> `BPF_FUNC_csum_update()` <br> `BPF_FUNC_l3_csum_replace()` <br> `BPF_FUNC_l4_csum_replace()` <br> `BPF_FUNC_set_hash_invalid()` <br> `LWT functions`|
|
||||
|`BPF_PROG_TYPE_SOCK_OPS`|`BPF_FUNC_setsockopt()` <br> `BPF_FUNC_getsockopt()` <br> `BPF_FUNC_sock_ops_cb_flags_set()` <br> `BPF_FUNC_sock_map_update()` <br> `BPF_FUNC_sock_hash_update()` <br> `BPF_FUNC_get_socket_cookie()` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_SK_SKB`|`BPF_FUNC_skb_store_bytes()` <br> `BPF_FUNC_skb_load_bytes()` <br> `BPF_FUNC_skb_pull_data()` <br> `BPF_FUNC_skb_change_tail()` <br> `BPF_FUNC_skb_change_head()` <br> `BPF_FUNC_get_socket_cookie()` <br> `BPF_FUNC_get_socket_uid()` <br> `BPF_FUNC_sk_redirect_map()` <br> `BPF_FUNC_sk_redirect_hash()` <br> `BPF_FUNC_sk_lookup_tcp()` <br> `BPF_FUNC_sk_lookup_udp()` <br> `BPF_FUNC_sk_release()` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_CGROUP_DEVICE`|`BPF_FUNC_map_lookup_elem()` <br> `BPF_FUNC_map_update_elem()` <br> `BPF_FUNC_map_delete_elem()` <br> `BPF_FUNC_get_current_uid_gid()` <br> `BPF_FUNC_trace_printk()`|
|
||||
|`BPF_PROG_TYPE_SK_MSG`|`BPF_FUNC_msg_redirect_map()` <br> `BPF_FUNC_msg_redirect_hash()` <br> `BPF_FUNC_msg_apply_bytes()` <br> `BPF_FUNC_msg_cork_bytes()` <br> `BPF_FUNC_msg_pull_data()` <br> `BPF_FUNC_msg_push_data()` <br> `BPF_FUNC_msg_pop_data()` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_RAW_TRACEPOINT`|`BPF_FUNC_perf_event_output()` <br> `BPF_FUNC_get_stackid()` <br> `BPF_FUNC_get_stack()` <br> `BPF_FUNC_skb_output()` <br> `Tracing functions`|
|
||||
|`BPF_PROG_TYPE_CGROUP_SOCK_ADDR`|`BPF_FUNC_get_current_uid_gid()` <br> `BPF_FUNC_bind()` <br> `BPF_FUNC_get_socket_cookie()` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_LWT_SEG6LOCAL`|`BPF_FUNC_lwt_seg6_store_bytes()` <br> `BPF_FUNC_lwt_seg6_action()` <br> `BPF_FUNC_lwt_seg6_adjust_srh()` <br> `LWT functions`|
|
||||
|`BPF_PROG_TYPE_LIRC_MODE2`|`BPF_FUNC_rc_repeat()` <br> `BPF_FUNC_rc_keydown()` <br> `BPF_FUNC_rc_pointer_rel()` <br> `BPF_FUNC_map_lookup_elem()` <br> `BPF_FUNC_map_update_elem()` <br> `BPF_FUNC_map_delete_elem()` <br> `BPF_FUNC_ktime_get_ns()` <br> `BPF_FUNC_tail_call()` <br> `BPF_FUNC_get_prandom_u32()` <br> `BPF_FUNC_trace_printk()`|
|
||||
|`BPF_PROG_TYPE_SK_REUSEPORT`|`BPF_FUNC_sk_select_reuseport()` <br> `BPF_FUNC_skb_load_bytes()` <br> `BPF_FUNC_load_bytes_relative()` <br> `Base functions`|
|
||||
|`BPF_PROG_TYPE_FLOW_DISSECTOR`|`BPF_FUNC_skb_load_bytes()` <br> `Base functions`|
|
||||
|
||||
|Function Group| Functions|
|
||||
|------------------|-------|
|
||||
|`Base functions`| `BPF_FUNC_map_lookup_elem()` <br> `BPF_FUNC_map_update_elem()` <br> `BPF_FUNC_map_delete_elem()` <br> `BPF_FUNC_map_peek_elem()` <br> `BPF_FUNC_map_pop_elem()` <br> `BPF_FUNC_map_push_elem()` <br> `BPF_FUNC_get_prandom_u32()` <br> `BPF_FUNC_get_smp_processor_id()` <br> `BPF_FUNC_get_numa_node_id()` <br> `BPF_FUNC_tail_call()` <br> `BPF_FUNC_ktime_get_boot_ns()` <br> `BPF_FUNC_ktime_get_ns()` <br> `BPF_FUNC_trace_printk()` <br> `BPF_FUNC_spin_lock()` <br> `BPF_FUNC_spin_unlock()` |
|
||||
|`Tracing functions`|`BPF_FUNC_map_lookup_elem()` <br> `BPF_FUNC_map_update_elem()` <br> `BPF_FUNC_map_delete_elem()` <br> `BPF_FUNC_probe_read()` <br> `BPF_FUNC_ktime_get_boot_ns()` <br> `BPF_FUNC_ktime_get_ns()` <br> `BPF_FUNC_tail_call()` <br> `BPF_FUNC_get_current_pid_tgid()` <br> `BPF_FUNC_get_current_task()` <br> `BPF_FUNC_get_current_uid_gid()` <br> `BPF_FUNC_get_current_comm()` <br> `BPF_FUNC_trace_printk()` <br> `BPF_FUNC_get_smp_processor_id()` <br> `BPF_FUNC_get_numa_node_id()` <br> `BPF_FUNC_perf_event_read()` <br> `BPF_FUNC_probe_write_user()` <br> `BPF_FUNC_current_task_under_cgroup()` <br> `BPF_FUNC_get_prandom_u32()` <br> `BPF_FUNC_probe_read_str()` <br> `BPF_FUNC_get_current_cgroup_id()` <br> `BPF_FUNC_send_signal()` <br> `BPF_FUNC_probe_read_kernel()` <br> `BPF_FUNC_probe_read_kernel_str()` <br> `BPF_FUNC_probe_read_user()` <br> `BPF_FUNC_probe_read_user_str()` <br> `BPF_FUNC_send_signal_thread()` <br> `BPF_FUNC_get_ns_current_pid_tgid()` <br> `BPF_FUNC_xdp_output()` <br> `BPF_FUNC_get_task_stack()`|
|
||||
|`LWT functions`| `BPF_FUNC_skb_load_bytes()` <br> `BPF_FUNC_skb_pull_data()` <br> `BPF_FUNC_csum_diff()` <br> `BPF_FUNC_get_cgroup_classid()` <br> `BPF_FUNC_get_route_realm()` <br> `BPF_FUNC_get_hash_recalc()` <br> `BPF_FUNC_perf_event_output()` <br> `BPF_FUNC_get_smp_processor_id()` <br> `BPF_FUNC_skb_under_cgroup()`|
|
||||
48
src/bcc-documents/kernel_config.md
Normal file
48
src/bcc-documents/kernel_config.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Kernel Configuration for BPF Features
|
||||
|
||||
## BPF Related Kernel Configurations
|
||||
|
||||
| Functionalities | Kernel Configuration | Description |
|
||||
|:----------------|:---------------------|:------------|
|
||||
| **Basic** | CONFIG_BPF_SYSCALL | Enable the bpf() system call |
|
||||
| | CONFIG_BPF_JIT | BPF programs are normally handled by a BPF interpreter. This option allows the kernel to generate native code when a program is loaded into the kernel. This will significantly speed-up processing of BPF programs |
|
||||
| | CONFIG_HAVE_BPF_JIT | Enable BPF Just In Time compiler |
|
||||
| | CONFIG_HAVE_EBPF_JIT | Extended BPF JIT (eBPF) |
|
||||
| | CONFIG_HAVE_CBPF_JIT | Classic BPF JIT (cBPF) |
|
||||
| | CONFIG_MODULES | Enable to build loadable kernel modules |
|
||||
| | CONFIG_BPF | BPF VM interpreter |
|
||||
| | CONFIG_BPF_EVENTS | Allow the user to attach BPF programs to kprobe, uprobe, and tracepoint events |
|
||||
| | CONFIG_PERF_EVENTS | Kernel performance events and counters |
|
||||
| | CONFIG_HAVE_PERF_EVENTS | Enable perf events |
|
||||
| | CONFIG_PROFILING | Enable the extended profiling support mechanisms used by profilers |
|
||||
| **BTF** | CONFIG_DEBUG_INFO_BTF | Generate deduplicated BTF type information from DWARF debug info |
|
||||
| | CONFIG_PAHOLE_HAS_SPLIT_BTF | Generate BTF for each selected kernel module |
|
||||
| | CONFIG_DEBUG_INFO_BTF_MODULES | Generate compact split BTF type information for kernel modules |
|
||||
| **Security** | CONFIG_BPF_JIT_ALWAYS_ON | Enable BPF JIT and removes BPF interpreter to avoid speculative execution |
|
||||
| | CONFIG_BPF_UNPRIV_DEFAULT_OFF | Disable unprivileged BPF by default by setting |
|
||||
| **Cgroup** | CONFIG_CGROUP_BPF | Support for BPF programs attached to cgroups |
|
||||
| **Network** | CONFIG_BPFILTER | BPF based packet filtering framework (BPFILTER) |
|
||||
| | CONFIG_BPFILTER_UMH | This builds bpfilter kernel module with embedded user mode helper |
|
||||
| | CONFIG_NET_CLS_BPF | BPF-based classifier - to classify packets based on programmable BPF (JIT'ed) filters as an alternative to ematches |
|
||||
| | CONFIG_NET_ACT_BPF | Execute BPF code on packets. The BPF code will decide if the packet should be dropped or not |
|
||||
| | CONFIG_BPF_STREAM_PARSER | Enable this to allow a TCP stream parser to be used with BPF_MAP_TYPE_SOCKMAP |
|
||||
| | CONFIG_LWTUNNEL_BPF | Allow to run BPF programs as a nexthop action following a route lookup for incoming and outgoing packets |
|
||||
| | CONFIG_NETFILTER_XT_MATCH_BPF | BPF matching applies a linux socket filter to each packet and accepts those for which the filter returns non-zero |
|
||||
| | CONFIG_IPV6_SEG6_BPF | To support BPF seg6local hook. bpf: Add IPv6 Segment Routing helpersy. [Reference](https://github.com/torvalds/linux/commit/fe94cc290f535709d3c5ebd1e472dfd0aec7ee7) |
|
||||
| **kprobes** | CONFIG_KPROBE_EVENTS | This allows the user to add tracing events (similar to tracepoints) on the fly via the ftrace interface |
|
||||
| | CONFIG_KPROBES | Enable kprobes-based dynamic events |
|
||||
| | CONFIG_HAVE_KPROBES | Check if krpobes enabled |
|
||||
| | CONFIG_HAVE_REGS_AND_STACK_ACCESS_API | This symbol should be selected by an architecture if it supports the API needed to access registers and stack entries from pt_regs. For example the kprobes-based event tracer needs this API. |
|
||||
| | CONFIG_KPROBES_ON_FTRACE | Have kprobes on function tracer if arch supports full passing of pt_regs to function tracing |
|
||||
| **kprobe multi** | CONFIG_FPROBE | Enable fprobe to attach the probe on multiple functions at once |
|
||||
| **kprobe override** | CONFIG_BPF_KPROBE_OVERRIDE | Enable BPF programs to override a kprobed function |
|
||||
| **uprobes** | CONFIG_UPROBE_EVENTS | Enable uprobes-based dynamic events |
|
||||
| | CONFIG_ARCH_SUPPORTS_UPROBES | Arch specific uprobes support |
|
||||
| | CONFIG_UPROBES | Uprobes is the user-space counterpart to kprobes: they enable instrumentation applications (such as 'perf probe') to establish unintrusive probes in user-space binaries and libraries, by executing handler functions when the probes are hit by user-space applications. |
|
||||
| | CONFIG_MMU | MMU-based virtualised addressing space support by paged memory management |
|
||||
| **Tracepoints** | CONFIG_TRACEPOINTS | Enable inserting tracepoints in the kernel and connect to proble functions |
|
||||
| | CONFIG_HAVE_SYSCALL_TRACEPOINTS | Enable syscall enter/exit tracing |
|
||||
| **Raw Tracepoints** | Same as Tracepoints | |
|
||||
| **LSM** | CONFIG_BPF_LSM | Enable instrumentation of the security hooks with BPF programs for implementing dynamic MAC and Audit Policies |
|
||||
| **LIRC** | CONFIG_BPF_LIRC_MODE2 | Allow attaching BPF programs to a lirc device |
|
||||
|
||||
2609
src/bcc-documents/reference_guide.md
Normal file
2609
src/bcc-documents/reference_guide.md
Normal file
File diff suppressed because it is too large
Load Diff
125
src/bcc-documents/special_filtering.md
Normal file
125
src/bcc-documents/special_filtering.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# Special Filtering
|
||||
|
||||
Some tools have special filtering capabitilies, the main use case is to trace
|
||||
processes running in containers, but those mechanisms are generic and could
|
||||
be used in other cases as well.
|
||||
|
||||
## Filtering by cgroups
|
||||
|
||||
Some tools have an option to filter by cgroup by referencing a pinned BPF hash
|
||||
map managed externally.
|
||||
|
||||
Examples of commands:
|
||||
|
||||
```sh
|
||||
# ./opensnoop --cgroupmap /sys/fs/bpf/test01
|
||||
# ./execsnoop --cgroupmap /sys/fs/bpf/test01
|
||||
# ./tcpconnect --cgroupmap /sys/fs/bpf/test01
|
||||
# ./tcpaccept --cgroupmap /sys/fs/bpf/test01
|
||||
# ./tcptracer --cgroupmap /sys/fs/bpf/test01
|
||||
```
|
||||
|
||||
The commands above will only display results from processes that belong to one
|
||||
of the cgroups whose id, returned by `bpf_get_current_cgroup_id()`, is in the
|
||||
pinned BPF hash map.
|
||||
|
||||
The BPF hash map can be created by:
|
||||
|
||||
```sh
|
||||
# bpftool map create /sys/fs/bpf/test01 type hash key 8 value 8 entries 128 \
|
||||
name cgroupset flags 0
|
||||
```
|
||||
|
||||
To get a shell in a new cgroup, you can use:
|
||||
|
||||
```sh
|
||||
# systemd-run --pty --unit test bash
|
||||
```
|
||||
|
||||
The shell will be running in the cgroup
|
||||
`/sys/fs/cgroup/unified/system.slice/test.service`.
|
||||
|
||||
The cgroup id can be discovered using the `name_to_handle_at()` system call. In
|
||||
the examples/cgroupid, you will find an example of program to get the cgroup
|
||||
id.
|
||||
|
||||
```sh
|
||||
# cd examples/cgroupid
|
||||
# make
|
||||
# ./cgroupid hex /sys/fs/cgroup/unified/system.slice/test.service
|
||||
```
|
||||
|
||||
or, using Docker:
|
||||
|
||||
```sh
|
||||
# cd examples/cgroupid
|
||||
# docker build -t cgroupid .
|
||||
# docker run --rm --privileged -v /sys/fs/cgroup:/sys/fs/cgroup \
|
||||
cgroupid cgroupid hex /sys/fs/cgroup/unified/system.slice/test.service
|
||||
```
|
||||
|
||||
This prints the cgroup id as a hexadecimal string in the host endianness such
|
||||
as `77 16 00 00 01 00 00 00`.
|
||||
|
||||
```sh
|
||||
# FILE=/sys/fs/bpf/test01
|
||||
# CGROUPID_HEX="77 16 00 00 01 00 00 00"
|
||||
# bpftool map update pinned $FILE key hex $CGROUPID_HEX value hex 00 00 00 00 00 00 00 00 any
|
||||
```
|
||||
|
||||
Now that the shell started by systemd-run has its cgroup id in the BPF hash
|
||||
map, bcc tools will display results from this shell. Cgroups can be added and
|
||||
removed from the BPF hash map without restarting the bcc tool.
|
||||
|
||||
This feature is useful for integrating bcc tools in external projects.
|
||||
|
||||
## Filtering by mount by namespace
|
||||
|
||||
The BPF hash map can be created by:
|
||||
|
||||
```sh
|
||||
# bpftool map create /sys/fs/bpf/mnt_ns_set type hash key 8 value 4 entries 128 \
|
||||
name mnt_ns_set flags 0
|
||||
```
|
||||
|
||||
Execute the `execsnoop` tool filtering only the mount namespaces
|
||||
in `/sys/fs/bpf/mnt_ns_set`:
|
||||
|
||||
```sh
|
||||
# tools/execsnoop.py --mntnsmap /sys/fs/bpf/mnt_ns_set
|
||||
```
|
||||
|
||||
Start a terminal in a new mount namespace:
|
||||
|
||||
```sh
|
||||
# unshare -m bash
|
||||
```
|
||||
|
||||
Update the hash map with the mount namespace ID of the terminal above:
|
||||
|
||||
```sh
|
||||
FILE=/sys/fs/bpf/mnt_ns_set
|
||||
if [ $(printf '\1' | od -dAn) -eq 1 ]; then
|
||||
HOST_ENDIAN_CMD=tac
|
||||
else
|
||||
HOST_ENDIAN_CMD=cat
|
||||
fi
|
||||
|
||||
NS_ID_HEX="$(printf '%016x' $(stat -Lc '%i' /proc/self/ns/mnt) | sed 's/.\{2\}/&\n/g' | $HOST_ENDIAN_CMD)"
|
||||
bpftool map update pinned $FILE key hex $NS_ID_HEX value hex 00 00 00 00 any
|
||||
```
|
||||
|
||||
Execute a command in this terminal:
|
||||
|
||||
```sh
|
||||
# ping kinvolk.io
|
||||
```
|
||||
|
||||
You'll see how on the `execsnoop` terminal you started above the call is logged:
|
||||
|
||||
```sh
|
||||
# tools/execsnoop.py --mntnsmap /sys/fs/bpf/mnt_ns_set
|
||||
[sudo] password for mvb:
|
||||
PCOMM PID PPID RET ARGS
|
||||
ping 8096 7970 0 /bin/ping kinvolk.io
|
||||
```
|
||||
422
src/bcc-documents/tutorial.md
Normal file
422
src/bcc-documents/tutorial.md
Normal file
@@ -0,0 +1,422 @@
|
||||
# bcc Tutorial
|
||||
|
||||
This tutorial covers how to use [bcc](https://github.com/iovisor/bcc) tools to quickly solve performance, troubleshooting, and networking issues. If you want to develop new bcc tools, see [tutorial_bcc_python_developer.md](tutorial_bcc_python_developer.md) for that tutorial.
|
||||
|
||||
It is assumed for this tutorial that bcc is already installed, and you can run tools like execsnoop successfully. See [INSTALL.md](https://github.com/iovisor/bcc/tree/master/INSTALL.md). This uses enhancements added to the Linux 4.x series.
|
||||
|
||||
## Observability
|
||||
|
||||
Some quick wins.
|
||||
|
||||
### 0. Before bcc
|
||||
|
||||
Before using bcc, you should start with the Linux basics. One reference is the [Linux Performance Analysis in 60,000 Milliseconds](https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55) post, which covers these commands:
|
||||
|
||||
1. uptime
|
||||
1. dmesg | tail
|
||||
1. vmstat 1
|
||||
1. mpstat -P ALL 1
|
||||
1. pidstat 1
|
||||
1. iostat -xz 1
|
||||
1. free -m
|
||||
1. sar -n DEV 1
|
||||
1. sar -n TCP,ETCP 1
|
||||
1. top
|
||||
|
||||
### 1. General Performance
|
||||
|
||||
Here is a generic checklist for performance investigations with bcc, first as a list, then in detail:
|
||||
|
||||
1. execsnoop
|
||||
1. opensnoop
|
||||
1. ext4slower (or btrfs\*, xfs\*, zfs\*)
|
||||
1. biolatency
|
||||
1. biosnoop
|
||||
1. cachestat
|
||||
1. tcpconnect
|
||||
1. tcpaccept
|
||||
1. tcpretrans
|
||||
1. runqlat
|
||||
1. profile
|
||||
|
||||
These tools may be installed on your system under /usr/share/bcc/tools, or you can run them from the bcc github repo under /tools where they have a .py extension. Browse the 50+ tools available for more analysis options.
|
||||
|
||||
#### 1.1 execsnoop
|
||||
|
||||
```sh
|
||||
# ./execsnoop
|
||||
PCOMM PID RET ARGS
|
||||
supervise 9660 0 ./run
|
||||
supervise 9661 0 ./run
|
||||
mkdir 9662 0 /bin/mkdir -p ./main
|
||||
run 9663 0 ./run
|
||||
[...]
|
||||
```
|
||||
|
||||
execsnoop prints one line of output for each new process. Check for short-lived processes. These can consume CPU resources, but not show up in most monitoring tools that periodically take snapshots of which processes are running.
|
||||
|
||||
It works by tracing exec(), not the fork(), so it will catch many types of new processes but not all (eg, it won't see an application launching working processes, that doesn't exec() anything else).
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/execsnoop_example.txt).
|
||||
|
||||
#### 1.2. opensnoop
|
||||
|
||||
```sh
|
||||
# ./opensnoop
|
||||
PID COMM FD ERR PATH
|
||||
1565 redis-server 5 0 /proc/1565/stat
|
||||
1565 redis-server 5 0 /proc/1565/stat
|
||||
1565 redis-server 5 0 /proc/1565/stat
|
||||
1603 snmpd 9 0 /proc/net/dev
|
||||
1603 snmpd 11 0 /proc/net/if_inet6
|
||||
1603 snmpd -1 2 /sys/class/net/eth0/device/vendor
|
||||
1603 snmpd 11 0 /proc/sys/net/ipv4/neigh/eth0/retrans_time_ms
|
||||
1603 snmpd 11 0 /proc/sys/net/ipv6/neigh/eth0/retrans_time_ms
|
||||
1603 snmpd 11 0 /proc/sys/net/ipv6/conf/eth0/forwarding
|
||||
[...]
|
||||
```
|
||||
|
||||
opensnoop prints one line of output for each open() syscall, including details.
|
||||
|
||||
Files that are opened can tell you a lot about how applications work: identifying their data files, config files, and log files. Sometimes applications can misbehave, and perform poorly, when they are constantly attempting to read files that do not exist. opensnoop gives you a quick look.
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/opensnoop_example.txt).
|
||||
|
||||
#### 1.3. ext4slower (or btrfs\*, xfs\*, zfs\*)
|
||||
|
||||
```sh
|
||||
# ./ext4slower
|
||||
Tracing ext4 operations slower than 10 ms
|
||||
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
|
||||
06:35:01 cron 16464 R 1249 0 16.05 common-auth
|
||||
06:35:01 cron 16463 R 1249 0 16.04 common-auth
|
||||
06:35:01 cron 16465 R 1249 0 16.03 common-auth
|
||||
06:35:01 cron 16465 R 4096 0 10.62 login.defs
|
||||
06:35:01 cron 16464 R 4096 0 10.61 login.defs
|
||||
```
|
||||
|
||||
ext4slower traces the ext4 file system and times common operations, and then only prints those that exceed a threshold.
|
||||
|
||||
This is great for identifying or exonerating one type of performance issue: show individually slow disk i/O via the file system. Disks process I/O asynchronously, and it can be difficult to associate latency at that layer with the latency applications experience. Tracing higher up in the kernel stack, at the VFS -> file system interface, will more closely match what an application suffers. Use this tool to identify if file system latency exceeds a given threshold.
|
||||
|
||||
Similar tools exist in bcc for other file systems: btrfsslower, xfsslower, and zfsslower. There is also fileslower, which works at the VFS layer and traces everything (although at some higher overhead).
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/ext4slower_example.txt).
|
||||
|
||||
#### 1.4. biolatency
|
||||
|
||||
```sh
|
||||
# ./biolatency
|
||||
Tracing block device I/O... Hit Ctrl-C to end.
|
||||
^C
|
||||
usecs : count distribution
|
||||
0 -> 1 : 0 | |
|
||||
2 -> 3 : 0 | |
|
||||
4 -> 7 : 0 | |
|
||||
8 -> 15 : 0 | |
|
||||
16 -> 31 : 0 | |
|
||||
32 -> 63 : 0 | |
|
||||
64 -> 127 : 1 | |
|
||||
128 -> 255 : 12 |******** |
|
||||
256 -> 511 : 15 |********** |
|
||||
512 -> 1023 : 43 |******************************* |
|
||||
1024 -> 2047 : 52 |**************************************|
|
||||
2048 -> 4095 : 47 |********************************** |
|
||||
4096 -> 8191 : 52 |**************************************|
|
||||
8192 -> 16383 : 36 |************************** |
|
||||
16384 -> 32767 : 15 |********** |
|
||||
32768 -> 65535 : 2 |* |
|
||||
65536 -> 131071 : 2 |* |
|
||||
```
|
||||
|
||||
biolatency traces disk I/O latency (time from device issue to completion), and when the tool ends (Ctrl-C, or a given interval), it prints a histogram summary of the latency.
|
||||
|
||||
This is great for understanding disk I/O latency beyond the average times given by tools like iostat. I/O latency outliers will be visible at the end of the distribution, as well as multi-mode distributions.
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/biolatency_example.txt).
|
||||
|
||||
#### 1.5. biosnoop
|
||||
|
||||
```sh
|
||||
# ./biosnoop
|
||||
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
|
||||
0.000004001 supervise 1950 xvda1 W 13092560 4096 0.74
|
||||
0.000178002 supervise 1950 xvda1 W 13092432 4096 0.61
|
||||
0.001469001 supervise 1956 xvda1 W 13092440 4096 1.24
|
||||
0.001588002 supervise 1956 xvda1 W 13115128 4096 1.09
|
||||
1.022346001 supervise 1950 xvda1 W 13115272 4096 0.98
|
||||
1.022568002 supervise 1950 xvda1 W 13188496 4096 0.93
|
||||
[...]
|
||||
```
|
||||
|
||||
biosnoop prints a line of output for each disk I/O, with details including latency (time from device issue to completion).
|
||||
|
||||
This allows you to examine disk I/O in more detail, and look for time-ordered patterns (eg, reads queueing behind writes). Note that the output will be verbose if your system performs disk I/O at a high rate.
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/biosnoop_example.txt).
|
||||
|
||||
#### 1.6. cachestat
|
||||
|
||||
```sh
|
||||
# ./cachestat
|
||||
HITS MISSES DIRTIES READ_HIT% WRITE_HIT% BUFFERS_MB CACHED_MB
|
||||
1074 44 13 94.9% 2.9% 1 223
|
||||
2195 170 8 92.5% 6.8% 1 143
|
||||
182 53 56 53.6% 1.3% 1 143
|
||||
62480 40960 20480 40.6% 19.8% 1 223
|
||||
7 2 5 22.2% 22.2% 1 223
|
||||
348 0 0 100.0% 0.0% 1 223
|
||||
[...]
|
||||
```
|
||||
|
||||
cachestat prints a one line summary every second (or every custom interval) showing statistics from the file system cache.
|
||||
|
||||
Use this to identify a low cache hit ratio, and a high rate of misses: which gives one lead for performance tuning.
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/cachestat_example.txt).
|
||||
|
||||
#### 1.7. tcpconnect
|
||||
|
||||
```sh
|
||||
# ./tcpconnect
|
||||
PID COMM IP SADDR DADDR DPORT
|
||||
1479 telnet 4 127.0.0.1 127.0.0.1 23
|
||||
1469 curl 4 10.201.219.236 54.245.105.25 80
|
||||
1469 curl 4 10.201.219.236 54.67.101.145 80
|
||||
1991 telnet 6 ::1 ::1 23
|
||||
2015 ssh 6 fe80::2000:bff:fe82:3ac fe80::2000:bff:fe82:3ac 22
|
||||
[...]
|
||||
```
|
||||
|
||||
tcpconnect prints one line of output for every active TCP connection (eg, via connect()), with details including source and destination addresses.
|
||||
|
||||
Look for unexpected connections that may point to inefficiencies in application configuration, or an intruder.
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/tcpconnect_example.txt).
|
||||
|
||||
#### 1.8. tcpaccept
|
||||
|
||||
```sh
|
||||
# ./tcpaccept
|
||||
PID COMM IP RADDR LADDR LPORT
|
||||
907 sshd 4 192.168.56.1 192.168.56.102 22
|
||||
907 sshd 4 127.0.0.1 127.0.0.1 22
|
||||
5389 perl 6 1234:ab12:2040:5020:2299:0:5:0 1234:ab12:2040:5020:2299:0:5:0 7001
|
||||
[...]
|
||||
```
|
||||
|
||||
tcpaccept prints one line of output for every passive TCP connection (eg, via accept()), with details including source and destination addresses.
|
||||
|
||||
Look for unexpected connections that may point to inefficiencies in application configuration, or an intruder.
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/tcpaccept_example.txt).
|
||||
|
||||
#### 1.9. tcpretrans
|
||||
|
||||
```sh
|
||||
# ./tcpretrans
|
||||
TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE
|
||||
01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED
|
||||
01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED
|
||||
01:55:17 0 4 10.153.223.157:22 R> 69.53.245.40:22957 ESTABLISHED
|
||||
[...]
|
||||
```
|
||||
|
||||
tcprerans prints one line of output for every TCP retransmit packet, with details including source and destination addresses, and kernel state of the TCP connection.
|
||||
|
||||
TCP retransmissions cause latency and throughput issues. For ESTABLISHED retransmits, look for patterns with networks. For SYN_SENT, this may point to target kernel CPU saturation and kernel packet drops.
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/tcpretrans_example.txt).
|
||||
|
||||
#### 1.10. runqlat
|
||||
|
||||
```sh
|
||||
# ./runqlat
|
||||
Tracing run queue latency... Hit Ctrl-C to end.
|
||||
^C
|
||||
usecs : count distribution
|
||||
0 -> 1 : 233 |*********** |
|
||||
2 -> 3 : 742 |************************************ |
|
||||
4 -> 7 : 203 |********** |
|
||||
8 -> 15 : 173 |******** |
|
||||
16 -> 31 : 24 |* |
|
||||
32 -> 63 : 0 | |
|
||||
64 -> 127 : 30 |* |
|
||||
128 -> 255 : 6 | |
|
||||
256 -> 511 : 3 | |
|
||||
512 -> 1023 : 5 | |
|
||||
1024 -> 2047 : 27 |* |
|
||||
2048 -> 4095 : 30 |* |
|
||||
4096 -> 8191 : 20 | |
|
||||
8192 -> 16383 : 29 |* |
|
||||
16384 -> 32767 : 809 |****************************************|
|
||||
32768 -> 65535 : 64 |*** |
|
||||
```
|
||||
|
||||
runqlat times how long threads were waiting on the CPU run queues, and prints this as a histogram.
|
||||
|
||||
This can help quantify time lost waiting for a turn on CPU, during periods of CPU saturation.
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/runqlat_example.txt).
|
||||
|
||||
#### 1.11. profile
|
||||
|
||||
```sh
|
||||
# ./profile
|
||||
Sampling at 49 Hertz of all threads by user + kernel stack... Hit Ctrl-C to end.
|
||||
^C
|
||||
00007f31d76c3251 [unknown]
|
||||
47a2c1e752bf47f7 [unknown]
|
||||
- sign-file (8877)
|
||||
1
|
||||
|
||||
ffffffff813d0af8 __clear_user
|
||||
ffffffff813d5277 iov_iter_zero
|
||||
ffffffff814ec5f2 read_iter_zero
|
||||
ffffffff8120be9d __vfs_read
|
||||
ffffffff8120c385 vfs_read
|
||||
ffffffff8120d786 sys_read
|
||||
ffffffff817cc076 entry_SYSCALL_64_fastpath
|
||||
00007fc5652ad9b0 read
|
||||
- dd (25036)
|
||||
4
|
||||
|
||||
0000000000400542 func_a
|
||||
0000000000400598 main
|
||||
00007f12a133e830 __libc_start_main
|
||||
083e258d4c544155 [unknown]
|
||||
- func_ab (13549)
|
||||
5
|
||||
|
||||
[...]
|
||||
|
||||
ffffffff8105eb66 native_safe_halt
|
||||
ffffffff8103659e default_idle
|
||||
ffffffff81036d1f arch_cpu_idle
|
||||
ffffffff810bba5a default_idle_call
|
||||
ffffffff810bbd07 cpu_startup_entry
|
||||
ffffffff8104df55 start_secondary
|
||||
- swapper/1 (0)
|
||||
75
|
||||
```
|
||||
|
||||
profile is a CPU profiler, which takes samples of stack traces at timed intervals, and prints a summary of unique stack traces and a count of their occurrence.
|
||||
|
||||
Use this tool to understand the code paths that are consuming CPU resources.
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/profile_example.txt).
|
||||
|
||||
### 2. Observability with Generic Tools
|
||||
|
||||
In addition to the above tools for performance tuning, below is a checklist for bcc generic tools, first as a list, and in detail:
|
||||
|
||||
1. trace
|
||||
1. argdist
|
||||
1. funccount
|
||||
|
||||
These generic tools may be useful to provide visibility to solve your specific problems.
|
||||
|
||||
#### 2.1. trace
|
||||
|
||||
##### Example 1
|
||||
|
||||
Suppose you want to track file ownership change. There are three syscalls, `chown`, `fchown` and `lchown` which users can use to change file ownership. The corresponding syscall entry is `SyS_[f|l]chown`. The following command can be used to print out syscall parameters and the calling process user id. You can use `id` command to find the uid of a particular user.
|
||||
|
||||
```sh
|
||||
$ trace.py \
|
||||
'p::SyS_chown "file = %s, to_uid = %d, to_gid = %d, from_uid = %d", arg1, arg2, arg3, $uid' \
|
||||
'p::SyS_fchown "fd = %d, to_uid = %d, to_gid = %d, from_uid = %d", arg1, arg2, arg3, $uid' \
|
||||
'p::SyS_lchown "file = %s, to_uid = %d, to_gid = %d, from_uid = %d", arg1, arg2, arg3, $uid'
|
||||
PID TID COMM FUNC -
|
||||
1269255 1269255 python3.6 SyS_lchown file = /tmp/dotsync-usisgezu/tmp, to_uid = 128203, to_gid = 100, from_uid = 128203
|
||||
1269441 1269441 zstd SyS_chown file = /tmp/dotsync-vic7ygj0/dotsync-package.zst, to_uid = 128203, to_gid = 100, from_uid = 128203
|
||||
1269255 1269255 python3.6 SyS_lchown file = /tmp/dotsync-a40zd7ev/tmp, to_uid = 128203, to_gid = 100, from_uid = 128203
|
||||
1269442 1269442 zstd SyS_chown file = /tmp/dotsync-gzp413o_/dotsync-package.zst, to_uid = 128203, to_gid = 100, from_uid = 128203
|
||||
1269255 1269255 python3.6 SyS_lchown file = /tmp/dotsync-whx4fivm/tmp/.bash_profile, to_uid = 128203, to_gid = 100, from_uid = 128203
|
||||
```
|
||||
|
||||
##### Example 2
|
||||
|
||||
Suppose you want to count nonvoluntary context switches (`nvcsw`) in your bpf based performance monitoring tools and you do not know what is the proper method. `/proc/<pid>/status` already tells you the number (`nonvoluntary_ctxt_switches`) for a pid and you can use `trace.py` to do a quick experiment to verify your method. With kernel source code, the `nvcsw` is counted at file `linux/kernel/sched/core.c` function `__schedule` and under condition
|
||||
```c
|
||||
!(!preempt && prev->state) // i.e., preempt || !prev->state
|
||||
```
|
||||
|
||||
The `__schedule` function is marked as `notrace`, and the best place to evaluate the above condition seems in `sched/sched_switch` tracepoint called inside function `__schedule` and defined in `linux/include/trace/events/sched.h`. `trace.py` already has `args` being the pointer to the tracepoint `TP_STRUCT__entry`. The above condition in function `__schedule` can be represented as
|
||||
```c
|
||||
args->prev_state == TASK_STATE_MAX || args->prev_state == 0
|
||||
```
|
||||
|
||||
The below command can be used to count the involuntary context switches (per process or per pid) and compare to `/proc/<pid>/status` or `/proc/<pid>/task/<task_id>/status` for correctness, as in typical cases, involuntary context switches are not very common.
|
||||
```sh
|
||||
$ trace.py -p 1134138 't:sched:sched_switch (args->prev_state == TASK_STATE_MAX || args->prev_state == 0)'
|
||||
PID TID COMM FUNC
|
||||
1134138 1134140 contention_test sched_switch
|
||||
1134138 1134142 contention_test sched_switch
|
||||
...
|
||||
$ trace.py -L 1134140 't:sched:sched_switch (args->prev_state == TASK_STATE_MAX || args->prev_state == 0)'
|
||||
PID TID COMM FUNC
|
||||
1134138 1134140 contention_test sched_switch
|
||||
1134138 1134140 contention_test sched_switch
|
||||
...
|
||||
```
|
||||
|
||||
##### Example 3
|
||||
|
||||
This example is related to issue [1231](https://github.com/iovisor/bcc/issues/1231) and [1516](https://github.com/iovisor/bcc/issues/1516) where uprobe does not work at all in certain cases. First, you can do a `strace` as below
|
||||
|
||||
```sh
|
||||
$ strace trace.py 'r:bash:readline "%s", retval'
|
||||
...
|
||||
perf_event_open(0x7ffd968212f0, -1, 0, -1, 0x8 /* PERF_FLAG_??? */) = -1 EIO (Input/output error)
|
||||
...
|
||||
```
|
||||
|
||||
The `perf_event_open` syscall returns `-EIO`. Digging into kernel uprobe related codes in `/kernel/trace` and `/kernel/events` directories to search `EIO`, the function `uprobe_register` is the most suspicious. Let us find whether this function is called or not and what is the return value if it is called. In one terminal using the following command to print out the return value of uprobe_register,
|
||||
```sh
|
||||
$ trace.py 'r::uprobe_register "ret = %d", retval'
|
||||
```
|
||||
In another terminal run the same bash uretprobe tracing example, and you should get
|
||||
```sh
|
||||
$ trace.py 'r::uprobe_register "ret = %d", retval'
|
||||
PID TID COMM FUNC -
|
||||
1041401 1041401 python2.7 uprobe_register ret = -5
|
||||
```
|
||||
|
||||
The `-5` error code is EIO. This confirms that the following code in function `uprobe_register` is the most suspicious culprit.
|
||||
```c
|
||||
if (!inode->i_mapping->a_ops->readpage && !shmem_mapping(inode->i_mapping))
|
||||
return -EIO;
|
||||
```
|
||||
The `shmem_mapping` function is defined as
|
||||
```c
|
||||
bool shmem_mapping(struct address_space *mapping)
|
||||
{
|
||||
return mapping->a_ops == &shmem_aops;
|
||||
}
|
||||
```
|
||||
|
||||
To confirm the theory, find what is `inode->i_mapping->a_ops` with the following command
|
||||
```sh
|
||||
$ trace.py -I 'linux/fs.h' 'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops'
|
||||
PID TID COMM FUNC -
|
||||
814288 814288 python2.7 uprobe_register a_ops = ffffffff81a2adc0
|
||||
^C$ grep ffffffff81a2adc0 /proc/kallsyms
|
||||
ffffffff81a2adc0 R empty_aops
|
||||
```
|
||||
|
||||
The kernel symbol `empty_aops` does not have `readpage` defined and hence the above suspicious condition is true. Further examining the kernel source code shows that `overlayfs` does not provide its own `a_ops` while some other file systems (e.g., ext4) define their own `a_ops` (e.g., `ext4_da_aops`), and `ext4_da_aops` defines `readpage`. Hence, uprobe works fine on ext4 while not on overlayfs.
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/trace_example.txt).
|
||||
|
||||
#### 2.2. argdist
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/argdist_example.txt).
|
||||
|
||||
#### 2.3. funccount
|
||||
|
||||
More [examples](https://github.com/iovisor/bcc/tree/master/tools/funccount_example.txt).
|
||||
|
||||
## Networking
|
||||
|
||||
To do.
|
||||
724
src/bcc-documents/tutorial_bcc_python_developer.md
Normal file
724
src/bcc-documents/tutorial_bcc_python_developer.md
Normal file
@@ -0,0 +1,724 @@
|
||||
# bcc Python Developer Tutorial
|
||||
|
||||
This tutorial is about developing [bcc](https://github.com/iovisor/bcc) tools and programs using the Python interface. There are two parts: observability then networking. Snippets are taken from various programs in bcc: see their files for licences.
|
||||
|
||||
Also see the bcc developer's [reference_guide.md](reference_guide.md), and a tutorial for end-users of tools: [tutorial.md](tutorial.md). There is also a lua interface for bcc.
|
||||
|
||||
## Observability
|
||||
|
||||
This observability tutorial contains 17 lessons, and 46 enumerated things to learn.
|
||||
|
||||
### Lesson 1. Hello World
|
||||
|
||||
Start by running [examples/hello_world.py](https://github.com/iovisor/bcc/tree/master/examples/hello_world.py), while running some commands (eg, "ls") in another session. It should print "Hello, World!" for new processes. If not, start by fixing bcc: see [INSTALL.md](https://github.com/iovisor/bcc/tree/master/INSTALL.md).
|
||||
|
||||
```sh
|
||||
# ./examples/hello_world.py
|
||||
bash-13364 [002] d... 24573433.052937: : Hello, World!
|
||||
bash-13364 [003] d... 24573436.642808: : Hello, World!
|
||||
[...]
|
||||
```
|
||||
|
||||
Here's the code for hello_world.py:
|
||||
|
||||
```Python
|
||||
from bcc import BPF
|
||||
BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\\n"); return 0; }').trace_print()
|
||||
```
|
||||
|
||||
There are six things to learn from this:
|
||||
|
||||
1. ```text='...'```: This defines a BPF program inline. The program is written in C.
|
||||
|
||||
1. ```kprobe__sys_clone()```: This is a short-cut for kernel dynamic tracing via kprobes. If the C function begins with ``kprobe__``, the rest is treated as a kernel function name to instrument, in this case, ```sys_clone()```.
|
||||
|
||||
1. ```void *ctx```: ctx has arguments, but since we aren't using them here, we'll just cast it to ```void *```.
|
||||
|
||||
1. ```bpf_trace_printk()```: A simple kernel facility for printf() to the common trace_pipe (/sys/kernel/debug/tracing/trace_pipe). This is ok for some quick examples, but has limitations: 3 args max, 1 %s only, and trace_pipe is globally shared, so concurrent programs will have clashing output. A better interface is via BPF_PERF_OUTPUT(), covered later.
|
||||
|
||||
1. ```return 0;```: Necessary formality (if you want to know why, see [#139](https://github.com/iovisor/bcc/issues/139)).
|
||||
|
||||
1. ```.trace_print()```: A bcc routine that reads trace_pipe and prints the output.
|
||||
|
||||
### Lesson 2. sys_sync()
|
||||
|
||||
Write a program that traces the sys_sync() kernel function. Print "sys_sync() called" when it runs. Test by running ```sync``` in another session while tracing. The hello_world.py program has everything you need for this.
|
||||
|
||||
Improve it by printing "Tracing sys_sync()... Ctrl-C to end." when the program first starts. Hint: it's just Python.
|
||||
|
||||
### Lesson 3. hello_fields.py
|
||||
|
||||
This program is in [examples/tracing/hello_fields.py](https://github.com/iovisor/bcc/tree/master/examples/tracing/hello_fields.py). Sample output (run commands in another session):
|
||||
|
||||
```sh
|
||||
# examples/tracing/hello_fields.py
|
||||
TIME(s) COMM PID MESSAGE
|
||||
24585001.174885999 sshd 1432 Hello, World!
|
||||
24585001.195710000 sshd 15780 Hello, World!
|
||||
24585001.991976000 systemd-udevd 484 Hello, World!
|
||||
24585002.276147000 bash 15787 Hello, World!
|
||||
```
|
||||
|
||||
Code:
|
||||
|
||||
```Python
|
||||
from bcc import BPF
|
||||
|
||||
# define BPF program
|
||||
prog = """
|
||||
int hello(void *ctx) {
|
||||
bpf_trace_printk("Hello, World!\\n");
|
||||
return 0;
|
||||
}
|
||||
"""
|
||||
|
||||
# load BPF program
|
||||
b = BPF(text=prog)
|
||||
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")
|
||||
|
||||
# header
|
||||
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))
|
||||
|
||||
# format output
|
||||
while 1:
|
||||
try:
|
||||
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
|
||||
except ValueError:
|
||||
continue
|
||||
print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))
|
||||
```
|
||||
|
||||
This is similar to hello_world.py, and traces new processes via sys_clone() again, but has a few more things to learn:
|
||||
|
||||
1. ```prog =```: This time we declare the C program as a variable, and later refer to it. This is useful if you want to add some string substitutions based on command line arguments.
|
||||
|
||||
1. ```hello()```: Now we're just declaring a C function, instead of the ```kprobe__``` shortcut. We'll refer to this later. All C functions declared in the BPF program are expected to be executed on a probe, hence they all need to take a ```pt_reg* ctx``` as first argument. If you need to define some helper function that will not be executed on a probe, they need to be defined as ```static inline``` in order to be inlined by the compiler. Sometimes you would also need to add ```_always_inline``` function attribute to it.
|
||||
|
||||
1. ```b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")```: Creates a kprobe for the kernel clone system call function, which will execute our defined hello() function. You can call attach_kprobe() more than once, and attach your C function to multiple kernel functions.
|
||||
|
||||
1. ```b.trace_fields()```: Returns a fixed set of fields from trace_pipe. Similar to trace_print(), this is handy for hacking, but for real tooling we should switch to BPF_PERF_OUTPUT().
|
||||
|
||||
### Lesson 4. sync_timing.py
|
||||
|
||||
Remember the days of sysadmins typing ```sync``` three times on a slow console before ```reboot```, to give the first asynchronous sync time to complete? Then someone thought ```sync;sync;sync``` was clever, to run them all on one line, which became industry practice despite defeating the original purpose! And then sync became synchronous, so more reasons it was silly. Anyway.
|
||||
|
||||
The following example times how quickly the ```do_sync``` function is called, and prints output if it has been called more recently than one second ago. A ```sync;sync;sync``` will print output for the 2nd and 3rd sync's:
|
||||
|
||||
```sh
|
||||
# examples/tracing/sync_timing.py
|
||||
Tracing for quick sync's... Ctrl-C to end
|
||||
At time 0.00 s: multiple syncs detected, last 95 ms ago
|
||||
At time 0.10 s: multiple syncs detected, last 96 ms ago
|
||||
```
|
||||
|
||||
This program is [examples/tracing/sync_timing.py](https://github.com/iovisor/bcc/tree/master/examples/tracing/sync_timing.py):
|
||||
|
||||
```Python
|
||||
from __future__ import print_function
|
||||
from bcc import BPF
|
||||
|
||||
# load BPF program
|
||||
b = BPF(text="""
|
||||
#include <uapi/linux/ptrace.h>
|
||||
|
||||
BPF_HASH(last);
|
||||
|
||||
int do_trace(struct pt_regs *ctx) {
|
||||
u64 ts, *tsp, delta, key = 0;
|
||||
|
||||
// attempt to read stored timestamp
|
||||
tsp = last.lookup(&key);
|
||||
if (tsp != NULL) {
|
||||
delta = bpf_ktime_get_ns() - *tsp;
|
||||
if (delta < 1000000000) {
|
||||
// output if time is less than 1 second
|
||||
bpf_trace_printk("%d\\n", delta / 1000000);
|
||||
}
|
||||
last.delete(&key);
|
||||
}
|
||||
|
||||
// update stored timestamp
|
||||
ts = bpf_ktime_get_ns();
|
||||
last.update(&key, &ts);
|
||||
return 0;
|
||||
}
|
||||
""")
|
||||
|
||||
b.attach_kprobe(event=b.get_syscall_fnname("sync"), fn_name="do_trace")
|
||||
print("Tracing for quick sync's... Ctrl-C to end")
|
||||
|
||||
# format output
|
||||
start = 0
|
||||
while 1:
|
||||
(task, pid, cpu, flags, ts, ms) = b.trace_fields()
|
||||
if start == 0:
|
||||
start = ts
|
||||
ts = ts - start
|
||||
print("At time %.2f s: multiple syncs detected, last %s ms ago" % (ts, ms))
|
||||
```
|
||||
|
||||
Things to learn:
|
||||
|
||||
1. ```bpf_ktime_get_ns()```: Returns the time as nanoseconds.
|
||||
1. ```BPF_HASH(last)```: Creates a BPF map object that is a hash (associative array), called "last". We didn't specify any further arguments, so it defaults to key and value types of u64.
|
||||
1. ```key = 0```: We'll only store one key/value pair in this hash, where the key is hardwired to zero.
|
||||
1. ```last.lookup(&key)```: Lookup the key in the hash, and return a pointer to its value if it exists, else NULL. We pass the key in as an address to a pointer.
|
||||
1. ```if (tsp != NULL) {```: The verifier requires that pointer values derived from a map lookup must be checked for a null value before they can be dereferenced and used.
|
||||
1. ```last.delete(&key)```: Delete the key from the hash. This is currently required because of [a kernel bug in `.update()`](https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=a6ed3ea65d9868fdf9eff84e6fe4f666b8d14b02) (fixed in 4.8.10).
|
||||
1. ```last.update(&key, &ts)```: Associate the value in the 2nd argument to the key, overwriting any previous value. This records the timestamp.
|
||||
|
||||
### Lesson 5. sync_count.py
|
||||
|
||||
Modify the sync_timing.py program (prior lesson) to store the count of all kernel sync system calls (both fast and slow), and print it with the output. This count can be recorded in the BPF program by adding a new key index to the existing hash.
|
||||
|
||||
### Lesson 6. disksnoop.py
|
||||
|
||||
Browse the [examples/tracing/disksnoop.py](https://github.com/iovisor/bcc/tree/master/examples/tracing/disksnoop.py) program to see what is new. Here is some sample output:
|
||||
|
||||
```sh
|
||||
# disksnoop.py
|
||||
TIME(s) T BYTES LAT(ms)
|
||||
16458043.436012 W 4096 3.13
|
||||
16458043.437326 W 4096 4.44
|
||||
16458044.126545 R 4096 42.82
|
||||
16458044.129872 R 4096 3.24
|
||||
[...]
|
||||
```
|
||||
|
||||
And a code snippet:
|
||||
|
||||
```Python
|
||||
[...]
|
||||
REQ_WRITE = 1 # from include/linux/blk_types.h
|
||||
|
||||
# load BPF program
|
||||
b = BPF(text="""
|
||||
#include <uapi/linux/ptrace.h>
|
||||
#include <linux/blk-mq.h>
|
||||
|
||||
BPF_HASH(start, struct request *);
|
||||
|
||||
void trace_start(struct pt_regs *ctx, struct request *req) {
|
||||
// stash start timestamp by request ptr
|
||||
u64 ts = bpf_ktime_get_ns();
|
||||
|
||||
start.update(&req, &ts);
|
||||
}
|
||||
|
||||
void trace_completion(struct pt_regs *ctx, struct request *req) {
|
||||
u64 *tsp, delta;
|
||||
|
||||
tsp = start.lookup(&req);
|
||||
if (tsp != 0) {
|
||||
delta = bpf_ktime_get_ns() - *tsp;
|
||||
bpf_trace_printk("%d %x %d\\n", req->__data_len,
|
||||
req->cmd_flags, delta / 1000);
|
||||
start.delete(&req);
|
||||
}
|
||||
}
|
||||
""")
|
||||
if BPF.get_kprobe_functions(b'blk_start_request'):
|
||||
b.attach_kprobe(event="blk_start_request", fn_name="trace_start")
|
||||
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
|
||||
if BPF.get_kprobe_functions(b'__blk_account_io_done'):
|
||||
b.attach_kprobe(event="__blk_account_io_done", fn_name="trace_completion")
|
||||
else:
|
||||
b.attach_kprobe(event="blk_account_io_done", fn_name="trace_completion")
|
||||
[...]
|
||||
```
|
||||
|
||||
Things to learn:
|
||||
|
||||
1. ```REQ_WRITE```: We're defining a kernel constant in the Python program because we'll use it there later. If we were using REQ_WRITE in the BPF program, it should just work (without needing to be defined) with the appropriate #includes.
|
||||
1. ```trace_start(struct pt_regs *ctx, struct request *req)```: This function will later be attached to kprobes. The arguments to kprobe functions are ```struct pt_regs *ctx```, for registers and BPF context, and then the actual arguments to the function. We'll attach this to blk_start_request(), where the first argument is ```struct request *```.
|
||||
1. ```start.update(&req, &ts)```: We're using the pointer to the request struct as a key in our hash. What? This is commonplace in tracing. Pointers to structs turn out to be great keys, as they are unique: two structs can't have the same pointer address. (Just be careful about when it gets free'd and reused.) So what we're really doing is tagging the request struct, which describes the disk I/O, with our own timestamp, so that we can time it. There's two common keys used for storing timestamps: pointers to structs, and, thread IDs (for timing function entry to return).
|
||||
1. ```req->__data_len```: We're dereferencing members of ```struct request```. See its definition in the kernel source for what members are there. bcc actually rewrites these expressions to be a series of ```bpf_probe_read_kernel()``` calls. Sometimes bcc can't handle a complex dereference, and you need to call ```bpf_probe_read_kernel()``` directly.
|
||||
|
||||
This is a pretty interesting program, and if you can understand all the code, you'll understand many important basics. We're still using the bpf_trace_printk() hack, so let's fix that next.
|
||||
|
||||
### Lesson 7. hello_perf_output.py
|
||||
|
||||
Let's finally stop using bpf_trace_printk() and use the proper BPF_PERF_OUTPUT() interface. This will also mean we stop getting the free trace_field() members like PID and timestamp, and will need to fetch them directly. Sample output while commands are run in another session:
|
||||
|
||||
```sh
|
||||
# hello_perf_output.py
|
||||
TIME(s) COMM PID MESSAGE
|
||||
0.000000000 bash 22986 Hello, perf_output!
|
||||
0.021080275 systemd-udevd 484 Hello, perf_output!
|
||||
0.021359520 systemd-udevd 484 Hello, perf_output!
|
||||
0.021590610 systemd-udevd 484 Hello, perf_output!
|
||||
[...]
|
||||
```
|
||||
|
||||
Code is [examples/tracing/hello_perf_output.py](https://github.com/iovisor/bcc/tree/master/examples/tracing/hello_perf_output.py):
|
||||
|
||||
```Python
|
||||
from bcc import BPF
|
||||
|
||||
# define BPF program
|
||||
prog = """
|
||||
#include <linux/sched.h>
|
||||
|
||||
// define output data structure in C
|
||||
struct data_t {
|
||||
u32 pid;
|
||||
u64 ts;
|
||||
char comm[TASK_COMM_LEN];
|
||||
};
|
||||
BPF_PERF_OUTPUT(events);
|
||||
|
||||
int hello(struct pt_regs *ctx) {
|
||||
struct data_t data = {};
|
||||
|
||||
data.pid = bpf_get_current_pid_tgid();
|
||||
data.ts = bpf_ktime_get_ns();
|
||||
bpf_get_current_comm(&data.comm, sizeof(data.comm));
|
||||
|
||||
events.perf_submit(ctx, &data, sizeof(data));
|
||||
|
||||
return 0;
|
||||
}
|
||||
"""
|
||||
|
||||
# load BPF program
|
||||
b = BPF(text=prog)
|
||||
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")
|
||||
|
||||
# header
|
||||
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))
|
||||
|
||||
# process event
|
||||
start = 0
|
||||
def print_event(cpu, data, size):
|
||||
global start
|
||||
event = b["events"].event(data)
|
||||
if start == 0:
|
||||
start = event.ts
|
||||
time_s = (float(event.ts - start)) / 1000000000
|
||||
print("%-18.9f %-16s %-6d %s" % (time_s, event.comm, event.pid,
|
||||
"Hello, perf_output!"))
|
||||
|
||||
# loop with callback to print_event
|
||||
b["events"].open_perf_buffer(print_event)
|
||||
while 1:
|
||||
b.perf_buffer_poll()
|
||||
```
|
||||
|
||||
Things to learn:
|
||||
|
||||
1. ```struct data_t```: This defines the C struct we'll use to pass data from kernel to user space.
|
||||
1. ```BPF_PERF_OUTPUT(events)```: This names our output channel "events".
|
||||
1. ```struct data_t data = {};```: Create an empty data_t struct that we'll then populate.
|
||||
1. ```bpf_get_current_pid_tgid()```: Returns the process ID in the lower 32 bits (kernel's view of the PID, which in user space is usually presented as the thread ID), and the thread group ID in the upper 32 bits (what user space often thinks of as the PID). By directly setting this to a u32, we discard the upper 32 bits. Should you be presenting the PID or the TGID? For a multi-threaded app, the TGID will be the same, so you need the PID to differentiate them, if that's what you want. It's also a question of expectations for the end user.
|
||||
1. ```bpf_get_current_comm()```: Populates the first argument address with the current process name.
|
||||
1. ```events.perf_submit()```: Submit the event for user space to read via a perf ring buffer.
|
||||
1. ```def print_event()```: Define a Python function that will handle reading events from the ```events``` stream.
|
||||
1. ```b["events"].event(data)```: Now get the event as a Python object, auto-generated from the C declaration.
|
||||
1. ```b["events"].open_perf_buffer(print_event)```: Associate the Python ```print_event``` function with the ```events``` stream.
|
||||
1. ```while 1: b.perf_buffer_poll()```: Block waiting for events.
|
||||
|
||||
### Lesson 8. sync_perf_output.py
|
||||
|
||||
Rewrite sync_timing.py, from a prior lesson, to use ```BPF_PERF_OUTPUT```.
|
||||
|
||||
### Lesson 9. bitehist.py
|
||||
|
||||
The following tool records a histogram of disk I/O sizes. Sample output:
|
||||
|
||||
```sh
|
||||
# bitehist.py
|
||||
Tracing... Hit Ctrl-C to end.
|
||||
^C
|
||||
kbytes : count distribution
|
||||
0 -> 1 : 3 | |
|
||||
2 -> 3 : 0 | |
|
||||
4 -> 7 : 211 |********** |
|
||||
8 -> 15 : 0 | |
|
||||
16 -> 31 : 0 | |
|
||||
32 -> 63 : 0 | |
|
||||
64 -> 127 : 1 | |
|
||||
128 -> 255 : 800 |**************************************|
|
||||
```
|
||||
|
||||
Code is [examples/tracing/bitehist.py](https://github.com/iovisor/bcc/tree/master/examples/tracing/bitehist.py):
|
||||
|
||||
```Python
|
||||
from __future__ import print_function
|
||||
from bcc import BPF
|
||||
from time import sleep
|
||||
|
||||
# load BPF program
|
||||
b = BPF(text="""
|
||||
#include <uapi/linux/ptrace.h>
|
||||
#include <linux/blkdev.h>
|
||||
|
||||
BPF_HISTOGRAM(dist);
|
||||
|
||||
int kprobe__blk_account_io_done(struct pt_regs *ctx, struct request *req)
|
||||
{
|
||||
dist.increment(bpf_log2l(req->__data_len / 1024));
|
||||
return 0;
|
||||
}
|
||||
""")
|
||||
|
||||
# header
|
||||
print("Tracing... Hit Ctrl-C to end.")
|
||||
|
||||
# trace until Ctrl-C
|
||||
try:
|
||||
sleep(99999999)
|
||||
except KeyboardInterrupt:
|
||||
print()
|
||||
|
||||
# output
|
||||
b["dist"].print_log2_hist("kbytes")
|
||||
```
|
||||
|
||||
A recap from earlier lessons:
|
||||
|
||||
- ```kprobe__```: This prefix means the rest will be treated as a kernel function name that will be instrumented using kprobe.
|
||||
- ```struct pt_regs *ctx, struct request *req```: Arguments to kprobe. The ```ctx``` is registers and BPF context, the ```req``` is the first argument to the instrumented function: ```blk_account_io_done()```.
|
||||
- ```req->__data_len```: Dereferencing that member.
|
||||
|
||||
New things to learn:
|
||||
|
||||
1. ```BPF_HISTOGRAM(dist)```: Defines a BPF map object that is a histogram, and names it "dist".
|
||||
1. ```dist.increment()```: Increments the histogram bucket index provided as first argument by one by default. Optionally, custom increments can be passed as the second argument.
|
||||
1. ```bpf_log2l()```: Returns the log-2 of the provided value. This becomes the index of our histogram, so that we're constructing a power-of-2 histogram.
|
||||
1. ```b["dist"].print_log2_hist("kbytes")```: Prints the "dist" histogram as power-of-2, with a column header of "kbytes". The only data transferred from kernel to user space is the bucket counts, making this efficient.
|
||||
|
||||
### Lesson 10. disklatency.py
|
||||
|
||||
Write a program that times disk I/O, and prints a histogram of their latency. Disk I/O instrumentation and timing can be found in the disksnoop.py program from a prior lesson, and histogram code can be found in bitehist.py from a prior lesson.
|
||||
|
||||
### Lesson 11. vfsreadlat.py
|
||||
|
||||
This example is split into separate Python and C files. Example output:
|
||||
|
||||
```sh
|
||||
# vfsreadlat.py 1
|
||||
Tracing... Hit Ctrl-C to end.
|
||||
usecs : count distribution
|
||||
0 -> 1 : 0 | |
|
||||
2 -> 3 : 2 |*********** |
|
||||
4 -> 7 : 7 |****************************************|
|
||||
8 -> 15 : 4 |********************** |
|
||||
|
||||
usecs : count distribution
|
||||
0 -> 1 : 29 |****************************************|
|
||||
2 -> 3 : 28 |************************************** |
|
||||
4 -> 7 : 4 |***** |
|
||||
8 -> 15 : 8 |*********** |
|
||||
16 -> 31 : 0 | |
|
||||
32 -> 63 : 0 | |
|
||||
64 -> 127 : 0 | |
|
||||
128 -> 255 : 0 | |
|
||||
256 -> 511 : 2 |** |
|
||||
512 -> 1023 : 0 | |
|
||||
1024 -> 2047 : 0 | |
|
||||
2048 -> 4095 : 0 | |
|
||||
4096 -> 8191 : 4 |***** |
|
||||
8192 -> 16383 : 6 |******** |
|
||||
16384 -> 32767 : 9 |************ |
|
||||
32768 -> 65535 : 6 |******** |
|
||||
65536 -> 131071 : 2 |** |
|
||||
|
||||
usecs : count distribution
|
||||
0 -> 1 : 11 |****************************************|
|
||||
2 -> 3 : 2 |******* |
|
||||
4 -> 7 : 10 |************************************ |
|
||||
8 -> 15 : 8 |***************************** |
|
||||
16 -> 31 : 1 |*** |
|
||||
32 -> 63 : 2 |******* |
|
||||
[...]
|
||||
```
|
||||
|
||||
Browse the code in [examples/tracing/vfsreadlat.py](https://github.com/iovisor/bcc/tree/master/examples/tracing/vfsreadlat.py) and [examples/tracing/vfsreadlat.c](https://github.com/iovisor/bcc/tree/master/examples/tracing/vfsreadlat.c). Things to learn:
|
||||
|
||||
1. ```b = BPF(src_file = "vfsreadlat.c")```: Read the BPF C program from a separate source file.
|
||||
1. ```b.attach_kretprobe(event="vfs_read", fn_name="do_return")```: Attaches the BPF C function ```do_return()``` to the return of the kernel function ```vfs_read()```. This is a kretprobe: instrumenting the return from a function, rather than its entry.
|
||||
1. ```b["dist"].clear()```: Clears the histogram.
|
||||
|
||||
### Lesson 12. urandomread.py
|
||||
|
||||
Tracing while a ```dd if=/dev/urandom of=/dev/null bs=8k count=5``` is run:
|
||||
|
||||
```sh
|
||||
# urandomread.py
|
||||
TIME(s) COMM PID GOTBITS
|
||||
24652832.956994001 smtp 24690 384
|
||||
24652837.726500999 dd 24692 65536
|
||||
24652837.727111001 dd 24692 65536
|
||||
24652837.727703001 dd 24692 65536
|
||||
24652837.728294998 dd 24692 65536
|
||||
24652837.728888001 dd 24692 65536
|
||||
```
|
||||
|
||||
Hah! I caught smtp by accident. Code is [examples/tracing/urandomread.py](https://github.com/iovisor/bcc/tree/master/examples/tracing/urandomread.py):
|
||||
|
||||
```Python
|
||||
from __future__ import print_function
|
||||
from bcc import BPF
|
||||
|
||||
# load BPF program
|
||||
b = BPF(text="""
|
||||
TRACEPOINT_PROBE(random, urandom_read) {
|
||||
// args is from /sys/kernel/debug/tracing/events/random/urandom_read/format
|
||||
bpf_trace_printk("%d\\n", args->got_bits);
|
||||
return 0;
|
||||
}
|
||||
""")
|
||||
|
||||
# header
|
||||
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "GOTBITS"))
|
||||
|
||||
# format output
|
||||
while 1:
|
||||
try:
|
||||
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
|
||||
except ValueError:
|
||||
continue
|
||||
print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))
|
||||
```
|
||||
|
||||
Things to learn:
|
||||
|
||||
1. ```TRACEPOINT_PROBE(random, urandom_read)```: Instrument the kernel tracepoint ```random:urandom_read```. These have a stable API, and thus are recommend to use instead of kprobes, wherever possible. You can run ```perf list``` for a list of tracepoints. Linux >= 4.7 is required to attach BPF programs to tracepoints.
|
||||
1. ```args->got_bits```: ```args``` is auto-populated to be a structure of the tracepoint arguments. The comment above says where you can see that structure. Eg:
|
||||
|
||||
```sh
|
||||
# cat /sys/kernel/debug/tracing/events/random/urandom_read/format
|
||||
name: urandom_read
|
||||
ID: 972
|
||||
format:
|
||||
field:unsigned short common_type; offset:0; size:2; signed:0;
|
||||
field:unsigned char common_flags; offset:2; size:1; signed:0;
|
||||
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
|
||||
field:int common_pid; offset:4; size:4; signed:1;
|
||||
|
||||
field:int got_bits; offset:8; size:4; signed:1;
|
||||
field:int pool_left; offset:12; size:4; signed:1;
|
||||
field:int input_left; offset:16; size:4; signed:1;
|
||||
|
||||
print fmt: "got_bits %d nonblocking_pool_entropy_left %d input_entropy_left %d", REC->got_bits, REC->pool_left, REC->input_left
|
||||
```
|
||||
|
||||
In this case, we were printing the ```got_bits``` member.
|
||||
|
||||
### Lesson 13. disksnoop.py fixed
|
||||
|
||||
Convert disksnoop.py from a previous lesson to use the ```block:block_rq_issue``` and ```block:block_rq_complete``` tracepoints.
|
||||
|
||||
### Lesson 14. strlen_count.py
|
||||
|
||||
This program instruments a user-level function, the ```strlen()``` library function, and frequency counts its string argument. Example output:
|
||||
|
||||
```sh
|
||||
# strlen_count.py
|
||||
Tracing strlen()... Hit Ctrl-C to end.
|
||||
^C COUNT STRING
|
||||
1 " "
|
||||
1 "/bin/ls"
|
||||
1 "."
|
||||
1 "cpudist.py.1"
|
||||
1 ".bashrc"
|
||||
1 "ls --color=auto"
|
||||
1 "key_t"
|
||||
[...]
|
||||
10 "a7:~# "
|
||||
10 "/root"
|
||||
12 "LC_ALL"
|
||||
12 "en_US.UTF-8"
|
||||
13 "en_US.UTF-8"
|
||||
20 "~"
|
||||
70 "#%^,~:-=?+/}"
|
||||
340 "\x01\x1b]0;root@bgregg-test: ~\x07\x02root@bgregg-test:~# "
|
||||
```
|
||||
|
||||
These are various strings that are being processed by this library function while tracing, along with their frequency counts. ```strlen()``` was called on "LC_ALL" 12 times, for example.
|
||||
|
||||
Code is [examples/tracing/strlen_count.py](https://github.com/iovisor/bcc/tree/master/examples/tracing/strlen_count.py):
|
||||
|
||||
```Python
|
||||
from __future__ import print_function
|
||||
from bcc import BPF
|
||||
from time import sleep
|
||||
|
||||
# load BPF program
|
||||
b = BPF(text="""
|
||||
#include <uapi/linux/ptrace.h>
|
||||
|
||||
struct key_t {
|
||||
char c[80];
|
||||
};
|
||||
BPF_HASH(counts, struct key_t);
|
||||
|
||||
int count(struct pt_regs *ctx) {
|
||||
if (!PT_REGS_PARM1(ctx))
|
||||
return 0;
|
||||
|
||||
struct key_t key = {};
|
||||
u64 zero = 0, *val;
|
||||
|
||||
bpf_probe_read_user(&key.c, sizeof(key.c), (void *)PT_REGS_PARM1(ctx));
|
||||
// could also use `counts.increment(key)`
|
||||
val = counts.lookup_or_try_init(&key, &zero);
|
||||
if (val) {
|
||||
(*val)++;
|
||||
}
|
||||
return 0;
|
||||
};
|
||||
""")
|
||||
b.attach_uprobe(name="c", sym="strlen", fn_name="count")
|
||||
|
||||
# header
|
||||
print("Tracing strlen()... Hit Ctrl-C to end.")
|
||||
|
||||
# sleep until Ctrl-C
|
||||
try:
|
||||
sleep(99999999)
|
||||
except KeyboardInterrupt:
|
||||
pass
|
||||
|
||||
# print output
|
||||
print("%10s %s" % ("COUNT", "STRING"))
|
||||
counts = b.get_table("counts")
|
||||
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
|
||||
print("%10d \"%s\"" % (v.value, k.c.encode('string-escape')))
|
||||
```
|
||||
|
||||
Things to learn:
|
||||
|
||||
1. ```PT_REGS_PARM1(ctx)```: This fetches the first argument to ```strlen()```, which is the string.
|
||||
1. ```b.attach_uprobe(name="c", sym="strlen", fn_name="count")```: Attach to library "c" (if this is the main program, use its pathname), instrument the user-level function ```strlen()```, and on execution call our C function ```count()```.
|
||||
|
||||
### Lesson 15. nodejs_http_server.py
|
||||
|
||||
This program instruments a user statically-defined tracing (USDT) probe, which is the user-level version of a kernel tracepoint. Sample output:
|
||||
|
||||
```sh
|
||||
# nodejs_http_server.py 24728
|
||||
TIME(s) COMM PID ARGS
|
||||
24653324.561322998 node 24728 path:/index.html
|
||||
24653335.343401998 node 24728 path:/images/welcome.png
|
||||
24653340.510164998 node 24728 path:/images/favicon.png
|
||||
```
|
||||
|
||||
Relevant code from [examples/tracing/nodejs_http_server.py](https://github.com/iovisor/bcc/tree/master/examples/tracing/nodejs_http_server.py):
|
||||
|
||||
```Python
|
||||
from __future__ import print_function
|
||||
from bcc import BPF, USDT
|
||||
import sys
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
print("USAGE: nodejs_http_server PID")
|
||||
exit()
|
||||
pid = sys.argv[1]
|
||||
debug = 0
|
||||
|
||||
# load BPF program
|
||||
bpf_text = """
|
||||
#include <uapi/linux/ptrace.h>
|
||||
int do_trace(struct pt_regs *ctx) {
|
||||
uint64_t addr;
|
||||
char path[128]={0};
|
||||
bpf_usdt_readarg(6, ctx, &addr);
|
||||
bpf_probe_read_user(&path, sizeof(path), (void *)addr);
|
||||
bpf_trace_printk("path:%s\\n", path);
|
||||
return 0;
|
||||
};
|
||||
"""
|
||||
|
||||
# enable USDT probe from given PID
|
||||
u = USDT(pid=int(pid))
|
||||
u.enable_probe(probe="http__server__request", fn_name="do_trace")
|
||||
if debug:
|
||||
print(u.get_text())
|
||||
print(bpf_text)
|
||||
|
||||
# initialize BPF
|
||||
b = BPF(text=bpf_text, usdt_contexts=[u])
|
||||
```
|
||||
|
||||
Things to learn:
|
||||
|
||||
1. ```bpf_usdt_readarg(6, ctx, &addr)```: Read the address of argument 6 from the USDT probe into ```addr```.
|
||||
1. ```bpf_probe_read_user(&path, sizeof(path), (void *)addr)```: Now the string ```addr``` points to into our ```path``` variable.
|
||||
1. ```u = USDT(pid=int(pid))```: Initialize USDT tracing for the given PID.
|
||||
1. ```u.enable_probe(probe="http__server__request", fn_name="do_trace")```: Attach our ```do_trace()``` BPF C function to the Node.js ```http__server__request``` USDT probe.
|
||||
1. ```b = BPF(text=bpf_text, usdt_contexts=[u])```: Need to pass in our USDT object, ```u```, to BPF object creation.
|
||||
|
||||
### Lesson 16. task_switch.c
|
||||
|
||||
This is an older tutorial included as a bonus lesson. Use this for recap and to reinforce what you've already learned.
|
||||
|
||||
This is a slightly more complex tracing example than Hello World. This program
|
||||
will be invoked for every task change in the kernel, and record in a BPF map
|
||||
the new and old pids.
|
||||
|
||||
The C program below introduces a new concept: the prev argument. This
|
||||
argument is treated specially by the BCC frontend, such that accesses
|
||||
to this variable are read from the saved context that is passed by the
|
||||
kprobe infrastructure. The prototype of the args starting from
|
||||
position 1 should match the prototype of the kernel function being
|
||||
kprobed. If done so, the program will have seamless access to the
|
||||
function parameters.
|
||||
|
||||
```c
|
||||
#include <uapi/linux/ptrace.h>
|
||||
#include <linux/sched.h>
|
||||
|
||||
struct key_t {
|
||||
u32 prev_pid;
|
||||
u32 curr_pid;
|
||||
};
|
||||
|
||||
BPF_HASH(stats, struct key_t, u64, 1024);
|
||||
int count_sched(struct pt_regs *ctx, struct task_struct *prev) {
|
||||
struct key_t key = {};
|
||||
u64 zero = 0, *val;
|
||||
|
||||
key.curr_pid = bpf_get_current_pid_tgid();
|
||||
key.prev_pid = prev->pid;
|
||||
|
||||
// could also use `stats.increment(key);`
|
||||
val = stats.lookup_or_try_init(&key, &zero);
|
||||
if (val) {
|
||||
(*val)++;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
The userspace component loads the file shown above, and attaches it to the
|
||||
`finish_task_switch` kernel function.
|
||||
The `[]` operator of the BPF object gives access to each BPF_HASH in the
|
||||
program, allowing pass-through access to the values residing in the kernel. Use
|
||||
the object as you would any other python dict object: read, update, and deletes
|
||||
are all allowed.
|
||||
```python
|
||||
from bcc import BPF
|
||||
from time import sleep
|
||||
|
||||
b = BPF(src_file="task_switch.c")
|
||||
b.attach_kprobe(event="finish_task_switch", fn_name="count_sched")
|
||||
|
||||
# generate many schedule events
|
||||
for i in range(0, 100): sleep(0.01)
|
||||
|
||||
for k, v in b["stats"].items():
|
||||
print("task_switch[%5d->%5d]=%u" % (k.prev_pid, k.curr_pid, v.value))
|
||||
```
|
||||
|
||||
These programs can be found in the files [examples/tracing/task_switch.c](https://github.com/iovisor/bcc/tree/master/examples/tracing/task_switch.c) and [examples/tracing/task_switch.py](https://github.com/iovisor/bcc/tree/master/examples/tracing/task_switch.py) respectively.
|
||||
|
||||
### Lesson 17. Further Study
|
||||
|
||||
For further study, see Sasha Goldshtein's [linux-tracing-workshop](https://github.com/goldshtn/linux-tracing-workshop), which contains additional labs. There are also many tools in bcc /tools to study.
|
||||
|
||||
Please read [CONTRIBUTING-SCRIPTS.md](https://github.com/iovisor/bcc/tree/master/CONTRIBUTING-SCRIPTS.md) if you wish to contribute tools to bcc. At the bottom of the main [README.md](https://github.com/iovisor/bcc/tree/master/README.md), you'll also find methods for contacting us. Good luck, and happy tracing!
|
||||
|
||||
## Networking
|
||||
|
||||
To do.
|
||||
Reference in New Issue
Block a user