init with documents from eunomia-bpf

This commit is contained in:
yunwei37
2022-12-02 19:18:03 +08:00
parent 1179ec171e
commit 81d749a9cc
85 changed files with 11876 additions and 0 deletions

161
0-introduce/introduce.md Normal file
View File

@@ -0,0 +1,161 @@
# eBPF 入门开发实践指南一:介绍与快速上手
<!-- TOC -->
- [1. 什么是eBPF](#1-什么是ebpf)
- [1.1. 起源](#11-起源)
- [1.2. 执行逻辑](#12-执行逻辑)
- [1.3. 架构](#13-架构)
- [1.3.1. 寄存器设计](#131-寄存器设计)
- [1.3.2. 指令编码格式](#132-指令编码格式)
- [1.4. 本节参考文章](#14-本节参考文章)
- [2. 如何使用eBPF编程](#2-如何使用ebpf编程)
- [2.1. BCC](#21-bcc)
- [2.2. libbpf-bootstrap](#22-libbpf-bootstrap)
- [2.3 eunomia-bpf](#23-eunomia-bpf)
<!-- /TOC -->
## 1. 什么是eBPF
Linux内核一直是实现监控/可观测性、网络和安全功能的理想地方,
但是直接在内核中进行监控并不是一个容易的事情。在传统的Linux软件开发中
实现这些功能往往都离不开修改内核源码或加载内核模块。修改内核源码是一件非常危险的行为,
稍有不慎可能便会导致系统崩溃,并且每次检验修改的代码都需要重新编译内核,耗时耗力。
加载内核模块虽然来说更为灵活,不需要重新编译源码,但是也可能导致内核崩溃,且随着内核版本的变化
模块也需要进行相应的修改,否则将无法使用。
在这一背景下eBPF技术应运而生。它是一项革命性技术能在内核中运行沙箱程序sandbox programs而无需修改内核源码或者加载内核模块。用户可以使用其提供的各种接口实现在内核中追踪、监测系统的作用。
### 1.1. 起源
eBPF的雏形是BPF(Berkeley Packet Filter, 伯克利包过滤器)。BPF于
1992年被Steven McCanne和Van Jacobson在其[论文](https://www.tcpdump.org/papers/bpf-usenix93.pdf)
提出。二人提出BPF的初衷是是提供一种新的数据包过滤方法该方法的模型如下图所示。
![](../imgs/original_bpf.png)
相较于其他过滤方法BPF有两大创新点首先是它使用了一个新的虚拟机可以有效地工作在基于寄存器结构的CPU之上。其次是其不会全盘复制数据包的所有信息只会复制相关数据可以有效地提高效率。这两大创新使得BPF在实际应用中得到了巨大的成功在被移植到Linux系统后其被上层的`libcap`
`tcpdump`等应用使用,是一个性能卓越的工具。
传统的BPF是32位架构其指令集编码格式为
- 16 bit: 操作指令
- 8 bit: 下一条指令跳向正确目标的偏移量
- 8 bit: 下一条指令跳往错误目标的偏移量
经过十余年的沉积后2013年Alexei Starovoitov对BPF进行了彻底地改造改造后的BPF被命名为eBPF(extended BPF)于Linux Kernel 3.15中引入Linux内核源码。
eBPF相较于BPF有了革命性的变化。首先在于eBPF支持了更多领域的应用它不仅支持网络包的过滤还可以通过
`kprobe``tracepoint`,`lsm`等Linux现有的工具对响应事件进行追踪。另一方面其在使用上也更为
灵活更为方便。同时其JIT编译器也得到了升级解释器也被替换这直接使得其具有达到平台原生的
执行性能的能力。
### 1.2. 执行逻辑
eBPF在执行逻辑上和BPF有相似之处eBPF也可以认为是一个基于寄存器的使用自定义的64位RISC指令集的
微型"虚拟机"。它可以在Linux内核中以一种安全可控的方式运行本机编译的eBPF程序并且访问内核函数和内存的子集。
在写好程序后我们将代码使用llvm编译得到使用BPF指令集的ELF文件解析出需要注入的部分后调用函数将其
注入内核。用户态的程序和注入内核态中的字节码公用一个位于内核的eBPF Map进行通信实现数据的传递。同时
为了防止我们写入的程序本身不会对内核产生较大影响编译好的字节码在注入内核之前会被eBPF校验器严格地检查。
eBPF程序是由事件驱动的我们在程序中需要提前确定程序的执行点。编译好的程序被注入内核后如果提前确定的执行点
被调用,那么注入的程序就会被触发,按照既定方式处理。
### 1.3. 架构
#### 1.3.1. 寄存器设计
eBPF有11个寄存器分别是R0~R10每个寄存器均是64位大小有相应的32位子寄存器其指令集是固定的64位宽。
#### 1.3.2. 指令编码格式
eBPF指令编码格式为
- 8 bit: 存放真实指令码
- 4 bit: 存放指令用到的目标寄存器号
- 4 bit: 存放指令用到的源寄存器号
- 16 bit: 存放偏移量,具体作用取决于指令类型
- 32 bit: 存放立即数
### 1.4. 本节参考文章
[A thorough introduction to eBPF](https://lwn.net/Articles/740157/)
[bpf简介](https://www.collabora.com/news-and-blog/blog/2019/04/05/an-ebpf-overview-part-1-introduction/)
[bpf架构知识](https://www.collabora.com/news-and-blog/blog/2019/04/15/an-ebpf-overview-part-2-machine-and-bytecode/)
## 2. 如何使用eBPF编程
原始的eBPF程序编写是非常繁琐和困难的。为了改变这一现状
llvm于2015年推出了可以将由高级语言编写的代码编译为eBPF字节码的功能同时其将`bpf()`
等原始的系统调用进行了初步地封装,给出了`libbpf`库。这些库会包含将字节码加载到内核中
的函数以及一些其他的关键函数。在Linux的源码包的`samples/bpf/`目录下有大量Linux
提供的基于`libbpf`的eBPF样例代码。
一个典型的基于`libbpf`的eBPF程序具有`*_kern.c``*_user.c`两个文件,
`*_kern.c`中书写在内核中的挂载点以及处理函数,`*_user.c`中书写用户态代码,
完成内核态代码注入以及与用户交互的各种任务。 更为详细的教程可以参考[该视频](https://www.bilibili.com/video/BV1f54y1h74r?spm_id_from=333.999.0.0)
然而由于该方法仍然较难理解且入门存在一定的难度因此现阶段的eBPF程序开发大多基于一些工具比如
- BCC
- BPFtrace
- libbpf-bootstrap
以及还有比较新的工具,例如 `eunomia-bpf` 将 CO-RE eBPF 功能作为服务运行,包含一个工具链和一个运行时,主要功能包括:
- 不需要再为每个 eBPF 工具编写用户态代码框架:大多数情况下只需要编写内核态应用程序,即可实现正确加载运行 eBPF 程序;同时所需编写的内核态代码和 libbpf 完全兼容,可轻松实现迁移;
- 提供基于 async Rust 的 Prometheus 或 OpenTelemetry 自定义可观测性数据收集器通常仅占用不到1%的资源开销,编写内核态代码和 yaml 配置文件即可实现 eBPF 信息可视化,编译后可在其他机器上通过 API 请求直接部署;
### 2.1. BCC
BCC全称为BPF Compiler Collection该项目是一个python库
包含了完整的编写、编译、和加载BPF程序的工具链以及用于调试和诊断性能问题的工具。
自2015年发布以来BCC经过上百位贡献者地不断完善后目前已经包含了大量随时可用的跟踪工具。[其官方项目库](https://github.com/iovisor/bcc/blob/master/docs/tutorial.md)
提供了一个方便上手的教程用户可以快速地根据教程完成BCC入门工作。
用户可以在BCC上使用Python、Lua等高级语言进行编程。
相较于使用C语言直接编程这些高级语言具有极大的便捷性用户只需要使用C来设计内核中的
BPF程序其余包括编译、解析、加载等工作在内均可由BCC完成。
然而使用BCC存在一个缺点便是在于其兼容性并不好。基于BCC的
eBPF程序每次执行时候都需要进行编译编译则需要用户配置相关的头文件和对应实现。在实际应用中
相信大家也会有体会编译依赖问题是一个很棘手的问题。也正是因此在本项目的开发中我们放弃了BCC
选择了可以做到一次编译-多次运行的libbpf-bootstrap工具。
### 2.2. libbpf-bootstrap
`libbpf-bootstrap`是一个基于`libbpf`库的BPF开发脚手架从其
[github](https://github.com/libbpf/libbpf-bootstrap) 上可以得到其源码。
`libbpf-bootstrap`综合了BPF社区过去多年的实践为开发者提了一个现代化的、便捷的工作流
现了一次编译,重复使用的目的。
基于`libbpf-bootstrap`的BPF程序对于源文件有一定的命名规则
用于生成内核态字节码的bpf文件以`.bpf.c`结尾,用户态加载字节码的文件以`.c`结尾,且这两个文件的
前缀必须相同。
基于`libbpf-bootstrap`的BPF程序在编译时会先将`*.bpf.c`文件编译为
对应的`.o`文件,然后根据此文件生成`skeleton`文件,即`*.skel.h`,这个文件会包含内核态中定义的一些
数据结构,以及用于装载内核态代码的关键函数。在用户态代码`include`此文件之后调用对应的装载函数即可将
字节码装载到内核中。同样的,`libbpf-bootstrap`也有非常完备的入门教程,用户可以在[该处](https://nakryiko.com/posts/libbpf-bootstrap/)
得到详细的入门操作介绍。
### 2.3 eunomia-bpf
开发、构建和分发 eBPF 一直以来都是一个高门槛的工作,使用 BCC、bpftrace 等工具开发效率高、可移植性好,但是分发部署时需要安装 LLVM、Clang等编译环境每次运行的时候执行本地或远程编译过程资源消耗较大使用原生的 CO-RE libbpf时又需要编写不少用户态加载代码来帮助 eBPF 程序正确加载和从内核中获取上报的信息,同时对于 eBPF 程序的分发、管理也没有很好地解决方案.
[eunomia-bpf](https://github.com/eunomia-bpf/eunomia-bpf) 是一个开源的 eBPF 动态加载运行时和开发工具链,是为了简化 eBPF 程序的开发、构建、分发、运行而设计的,基于 libbpf 的 CO-RE 轻量级开发框架。
使用 eunomia-bpf ,可以:
- 在编写 eBPF 程序或工具时只编写内核态代码,自动获取内核态导出信息;
- 使用 WASM 进行用户态交互程序的开发,在 WASM 虚拟机内部控制整个 eBPF 程序的加载和执行,以及处理相关数据;
- eunomia-bpf 可以将预编译的 eBPF 程序打包为通用的 JSON 或 WASM 模块,跨架构和内核版本进行分发,无需重新编译即可动态加载运行。
eunomia-bpf 由一个编译工具链和一个运行时库组成, 对比传统的 BCC、原生 libbpf 等框架,大幅简化了 eBPF 程序的开发流程,在大多数时候只需编写内核态代码,即可轻松构建、打包、发布完整的 eBPF 应用,同时内核态 eBPF 代码保证和主流的 libbpf, libbpfgo, libbpf-rs 等开发框架的 100% 兼容性。需要编写用户态代码的时候,也可以借助 Webassembly 实现通过多种语言进行用户态开发。和 bpftrace 等脚本工具相比, eunomia-bpf 保留了类似的便捷性, 同时不仅局限于 trace 方面, 可以用于更多的场景, 如网络、安全等等。
> - eunomia-bpf 项目 Github 地址: <https://github.com/eunomia-bpf/eunomia-bpf>
> - gitee 镜像: <https://gitee.com/anolis/eunomia>
## 参考资料

6
1-helloworld/.gitignore vendored Normal file
View File

@@ -0,0 +1,6 @@
.vscode
package.json
*.o
*.skel.json
*.skel.yaml
package.yaml

57
1-helloworld/README.md Normal file
View File

@@ -0,0 +1,57 @@
---
layout: post
title: minimal
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, tracepoint, example, syscall]
summary: a minimal example of a BPF application installs a tracepoint handler which is triggered by write syscall
---
`minimal` is just that a minimal practical BPF application example. It
doesn't use or require BPF CO-RE, so should run on quite old kernels. It
installs a tracepoint handler which is triggered once every second. It uses
`bpf_printk()` BPF helper to communicate with the world.
```console
$ sudo ecli examples/bpftools/minimal/package.json
Runing eBPF program...
```
To see it's output,
read `/sys/kernel/debug/tracing/trace_pipe` file as a root:
```shell
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
<...>-3840345 [010] d... 3220701.101143: bpf_trace_printk: BPF triggered from PID 3840345.
<...>-3840345 [010] d... 3220702.101265: bpf_trace_printk: BPF triggered from PID 3840345.
```
`minimal` is great as a bare-bones experimental playground to quickly try out
new ideas or BPF features.
## Compile and Run
Compile:
```console
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
or compile with `ecc`:
```console
$ ecc minimal.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
Run:
```console
sudo ecli ./package.json
```

View File

@@ -0,0 +1,21 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#define BPF_NO_GLOBAL_DATA
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
typedef unsigned int u32;
typedef int pid_t;
const pid_t pid_filter = 0;
char LICENSE[] SEC("license") = "Dual BSD/GPL";
SEC("tp/syscalls/sys_enter_write")
int handle_tp(void *ctx)
{
pid_t pid = bpf_get_current_pid_tgid() >> 32;
if (pid_filter && pid != pid_filter)
return 0;
bpf_printk("BPF triggered from PID %d.\n", pid);
return 0;
}

6
10-lsm-connect/.gitignore vendored Normal file
View File

@@ -0,0 +1,6 @@
.vscode
package.json
*.o
*.skel.json
*.skel.yaml
package.yaml

34
10-lsm-connect/README.md Normal file
View File

@@ -0,0 +1,34 @@
---
layout: post
title: lsm-connect
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, examples, lsm, no-output]
summary: BPF LSM program (on socket_connect hook) that prevents any connection towards 1.1.1.1 to happen. Found in demo-cloud-native-ebpf-day
---
## run
```console
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
or compile with `ecc`:
```console
$ ecc lsm-connect.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
Run:
```console
sudo ecli examples/bpftools/lsm-connect/package.json
```
## reference
<https://github.com/leodido/demo-cloud-native-ebpf-day>

View File

@@ -0,0 +1,41 @@
#include "vmlinux.h"
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
char LICENSE[] SEC("license") = "GPL";
#define EPERM 1
#define AF_INET 2
const __u32 blockme = 16843009; // 1.1.1.1 -> int
SEC("lsm/socket_connect")
int BPF_PROG(restrict_connect, struct socket *sock, struct sockaddr *address, int addrlen, int ret)
{
// Satisfying "cannot override a denial" rule
if (ret != 0)
{
return ret;
}
// Only IPv4 in this example
if (address->sa_family != AF_INET)
{
return 0;
}
// Cast the address to an IPv4 socket address
struct sockaddr_in *addr = (struct sockaddr_in *)address;
// Where do you want to go?
__u32 dest = addr->sin_addr.s_addr;
bpf_printk("lsm: found connect to %d", dest);
if (dest == blockme)
{
bpf_printk("lsm: blocking %d", dest);
return -EPERM;
}
return 0;
}

10
11-tc/.gitignore vendored Executable file
View File

@@ -0,0 +1,10 @@
.vscode
package.json
*.wasm
ewasm-skel.h
ecli
ewasm
*.o
*.skel.json
*.skel.yaml
package.yaml

56
11-tc/README.md Normal file
View File

@@ -0,0 +1,56 @@
---
layout: post
title: tc
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, tc, example]
summary: a minimal example of a BPF application use tc
---
`tc` (short for Traffic Control) is an example of handling ingress network traffics.
It creates a qdisc on the `lo` interface and attaches the `tc_ingress` BPF program to it.
It reports the metadata of the IP packets that coming into the `lo` interface.
```shell
$ sudo ecli ./package.json
...
Successfully started! Please run `sudo cat /sys/kernel/debug/tracing/trace_pipe` to see output of the BPF program.
......
```
The `tc` output in `/sys/kernel/debug/tracing/trace_pipe` should look
something like this:
```
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
node-1254811 [007] ..s1 8737831.671074: 0: Got IP packet: tot_len: 79, ttl: 64
sshd-1254728 [006] ..s1 8737831.674334: 0: Got IP packet: tot_len: 79, ttl: 64
sshd-1254728 [006] ..s1 8737831.674349: 0: Got IP packet: tot_len: 72, ttl: 64
node-1254811 [007] ..s1 8737831.674550: 0: Got IP packet: tot_len: 71, ttl: 64
```
## Compile and Run
Compile:
```console
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
or compile with `ecc`:
```console
$ ecc tc.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
Run:
```console
sudo ecli ./package.json
```

36
11-tc/tc.bpf.c Normal file
View File

@@ -0,0 +1,36 @@
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
/* Copyright (c) 2022 Hengqi Chen */
#include <vmlinux.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#define TC_ACT_OK 0
#define ETH_P_IP 0x0800 /* Internet Protocol packet */
/// @tchook {"ifindex":1, "attach_point":"BPF_TC_INGRESS"}
/// @tcopts {"handle":1, "priority":1}
SEC("tc")
int tc_ingress(struct __sk_buff *ctx)
{
void *data_end = (void *)(__u64)ctx->data_end;
void *data = (void *)(__u64)ctx->data;
struct ethhdr *l2;
struct iphdr *l3;
if (ctx->protocol != bpf_htons(ETH_P_IP))
return TC_ACT_OK;
l2 = data;
if ((void *)(l2 + 1) > data_end)
return TC_ACT_OK;
l3 = (struct iphdr *)(l2 + 1);
if ((void *)(l3 + 1) > data_end)
return TC_ACT_OK;
bpf_printk("Got IP packet: tot_len: %d, ttl: %d", bpf_ntohs(l3->tot_len), l3->ttl);
return TC_ACT_OK;
}
char __license[] SEC("license") = "GPL";

3
12-bindsnoop/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
.vscode
package.json
ecli

106
12-bindsnoop/README.md Normal file
View File

@@ -0,0 +1,106 @@
---
layout: post
title: bindsnoop
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, syscall, kprobe, perf-event]
summary: This tool traces the kernel function performing socket binding and print socket options set before the system call.
---
## origin
origin from:
https://github.com/iovisor/bcc/blob/master/libbpf-tools/bindsnoop.bpf.c
## Compile and Run
Compile:
```shell
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
Run:
```shell
sudo ./ecli run examples/bpftools/bindsnoop/package.json
```
## details in bcc
Demonstrations of bindsnoop, the Linux eBPF/bcc version.
This tool traces the kernel function performing socket binding and
print socket options set before the system call invocation that might
```console
impact bind behavior and bound interface:
SOL_IP IP_FREEBIND F....
SOL_IP IP_TRANSPARENT .T...
SOL_IP IP_BIND_ADDRESS_NO_PORT ..N..
SOL_SOCKET SO_REUSEADDR ...R.
SOL_SOCKET SO_REUSEPORT ....r
```
```console
# ./bindsnoop.py
Tracing binds ... Hit Ctrl-C to end
PID COMM PROT ADDR PORT OPTS IF
3941081 test_bind_op TCP 192.168.1.102 0 F.N.. 0
3940194 dig TCP :: 62087 ..... 0
3940219 dig UDP :: 48665 ..... 0
3940893 Acceptor Thr TCP :: 35343 ...R. 0
```
The output shows four bind system calls:
two "test_bind_op" instances, one with IP_FREEBIND and IP_BIND_ADDRESS_NO_PORT
options, dig process called bind for TCP and UDP sockets,
and Acceptor called bind for TCP with SO_REUSEADDR option set.
The -t option prints a timestamp column
```console
# ./bindsnoop.py -t
TIME(s) PID COMM PROT ADDR PORT OPTS IF
0.000000 3956801 dig TCP :: 49611 ..... 0
0.011045 3956822 dig UDP :: 56343 ..... 0
2.310629 3956498 test_bind_op TCP 192.168.1.102 39609 F...r 0
```
The -U option prints a UID column:
```console
# ./bindsnoop.py -U
Tracing binds ... Hit Ctrl-C to end
UID PID COMM PROT ADDR PORT OPTS IF
127072 3956498 test_bind_op TCP 192.168.1.102 44491 F...r 0
127072 3960261 Acceptor Thr TCP :: 48869 ...R. 0
0 3960729 Acceptor Thr TCP :: 44637 ...R. 0
0 3959075 chef-client UDP :: 61722 ..... 0
```
The -u option filtering UID:
```console
# ./bindsnoop.py -Uu 0
Tracing binds ... Hit Ctrl-C to end
UID PID COMM PROT ADDR PORT OPTS IF
0 3966330 Acceptor Thr TCP :: 39319 ...R. 0
0 3968044 python3.7 TCP ::1 59371 ..... 0
0 10224 fetch TCP 0.0.0.0 42091 ...R. 0
```
The --cgroupmap option filters based on a cgroup set.
It is meant to be used with an externally created map.
```console
# ./bindsnoop.py --cgroupmap /sys/fs/bpf/test01
```
For more details, see docs/special_filtering.md
In order to track heavy bind usage one can use --count option
```console
# ./bindsnoop.py --count
Tracing binds ... Hit Ctrl-C to end
LADDR LPORT BINDS
0.0.0.0 6771 4
0.0.0.0 4433 4
127.0.0.1 33665 1
```

View File

@@ -0,0 +1,151 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
/* Copyright (c) 2021 Hengqi Chen */
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_endian.h>
#include "bindsnoop.bpf.h"
#define MAX_ENTRIES 10240
#define MAX_PORTS 1024
const volatile bool filter_cg = false;
const volatile pid_t target_pid = 0;
const volatile bool ignore_errors = true;
const volatile bool filter_by_port = false;
struct {
__uint(type, BPF_MAP_TYPE_CGROUP_ARRAY);
__type(key, u32);
__type(value, u32);
__uint(max_entries, 1);
} cgroup_map SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, __u32);
__type(value, struct socket *);
} sockets SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_PORTS);
__type(key, __u16);
__type(value, __u16);
} ports SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u32));
} events SEC(".maps");
static int probe_entry(struct pt_regs *ctx, struct socket *socket)
{
__u64 pid_tgid = bpf_get_current_pid_tgid();
__u32 pid = pid_tgid >> 32;
__u32 tid = (__u32)pid_tgid;
if (target_pid && target_pid != pid)
return 0;
bpf_map_update_elem(&sockets, &tid, &socket, BPF_ANY);
return 0;
};
static int probe_exit(struct pt_regs *ctx, short ver)
{
__u64 pid_tgid = bpf_get_current_pid_tgid();
__u32 pid = pid_tgid >> 32;
__u32 tid = (__u32)pid_tgid;
struct socket **socketp, *socket;
struct inet_sock *inet_sock;
struct sock *sock;
union bind_options opts;
struct bind_event event = {};
__u16 sport = 0, *port;
int ret;
socketp = bpf_map_lookup_elem(&sockets, &tid);
if (!socketp)
return 0;
ret = PT_REGS_RC(ctx);
if (ignore_errors && ret != 0)
goto cleanup;
socket = *socketp;
sock = BPF_CORE_READ(socket, sk);
inet_sock = (struct inet_sock *)sock;
sport = bpf_ntohs(BPF_CORE_READ(inet_sock, inet_sport));
port = bpf_map_lookup_elem(&ports, &sport);
if (filter_by_port && !port)
goto cleanup;
opts.fields.freebind = BPF_CORE_READ_BITFIELD_PROBED(inet_sock, freebind);
opts.fields.transparent = BPF_CORE_READ_BITFIELD_PROBED(inet_sock, transparent);
opts.fields.bind_address_no_port = BPF_CORE_READ_BITFIELD_PROBED(inet_sock, bind_address_no_port);
opts.fields.reuseaddress = BPF_CORE_READ_BITFIELD_PROBED(sock, __sk_common.skc_reuse);
opts.fields.reuseport = BPF_CORE_READ_BITFIELD_PROBED(sock, __sk_common.skc_reuseport);
event.opts = opts.data;
event.ts_us = bpf_ktime_get_ns() / 1000;
event.pid = pid;
event.port = sport;
event.bound_dev_if = BPF_CORE_READ(sock, __sk_common.skc_bound_dev_if);
event.ret = ret;
event.proto = BPF_CORE_READ_BITFIELD_PROBED(sock, sk_protocol);
bpf_get_current_comm(&event.task, sizeof(event.task));
if (ver == 4) {
event.ver = ver;
bpf_probe_read_kernel(&event.addr, sizeof(event.addr), &inet_sock->inet_saddr);
} else { /* ver == 6 */
event.ver = ver;
bpf_probe_read_kernel(&event.addr, sizeof(event.addr), sock->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
}
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
cleanup:
bpf_map_delete_elem(&sockets, &tid);
return 0;
}
SEC("kprobe/inet_bind")
int BPF_KPROBE(ipv4_bind_entry, struct socket *socket)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return probe_entry(ctx, socket);
}
SEC("kretprobe/inet_bind")
int BPF_KRETPROBE(ipv4_bind_exit)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return probe_exit(ctx, 4);
}
SEC("kprobe/inet6_bind")
int BPF_KPROBE(ipv6_bind_entry, struct socket *socket)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return probe_entry(ctx, socket);
}
SEC("kretprobe/inet6_bind")
int BPF_KRETPROBE(ipv6_bind_exit)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return probe_exit(ctx, 6);
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";

View File

@@ -0,0 +1,31 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#ifndef __BINDSNOOP_H
#define __BINDSNOOP_H
#define TASK_COMM_LEN 16
struct bind_event {
unsigned __int128 addr;
unsigned long long ts_us;
unsigned int pid;
unsigned int bound_dev_if;
int ret;
unsigned short port;
unsigned short proto;
unsigned char opts;
unsigned char ver;
char task[TASK_COMM_LEN];
};
union bind_options {
unsigned char data;
struct {
unsigned char freebind : 1;
unsigned char transparent : 1;
unsigned char bind_address_no_port : 1;
unsigned char reuseaddress : 1;
unsigned char reuseport : 1;
} fields;
};
#endif /* __BINDSNOOP_H */

95
12-bindsnoop/bindsnoop.md Normal file
View File

@@ -0,0 +1,95 @@
## eBPF 入门实践教程:编写 eBPF 程序 Bindsnoopn 监控 socket 端口绑定事件
### 背景
Bindsnoop 会跟踪操作 socket 端口绑定的内核函数,并且在可能会影响端口绑定的系统调用发生之前,打印
现有的 socket 选项。
### 实现原理
Bindsnoop 通过kprobe实现。其主要挂载点为 inet_bind 和 inet6_bind。inet_bind 为处理 IPV4 类型
socket 端口绑定系统调用的接口inet6_bind 为处理IPV6类型 socket 端口绑定系统调用的接口。
```c
SEC("kprobe/inet_bind")
int BPF_KPROBE(ipv4_bind_entry, struct socket *socket)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return probe_entry(ctx, socket);
}
SEC("kretprobe/inet_bind")
int BPF_KRETPROBE(ipv4_bind_exit)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return probe_exit(ctx, 4);
}
SEC("kprobe/inet6_bind")
int BPF_KPROBE(ipv6_bind_entry, struct socket *socket)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return probe_entry(ctx, socket);
}
SEC("kretprobe/inet6_bind")
int BPF_KRETPROBE(ipv6_bind_exit)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return probe_exit(ctx, 6);
}
```
当系统试图进行socket端口绑定操作时, kprobe挂载的处理函数会被触发。在进入绑定函数时`probe_entry`会先被
调用,它会以 tid 为主键将 socket 信息存入 map 中。
```c
static int probe_entry(struct pt_regs *ctx, struct socket *socket)
{
__u64 pid_tgid = bpf_get_current_pid_tgid();
__u32 pid = pid_tgid >> 32;
__u32 tid = (__u32)pid_tgid;
if (target_pid && target_pid != pid)
return 0;
bpf_map_update_elem(&sockets, &tid, &socket, BPF_ANY);
return 0;
};
```
在执行完绑定函数后,`probe_exit`函数会被调用。该函数会读取tid对应的socket信息将其和其他信息一起
写入 event 结构体并输出到用户态。
```c
struct bind_event {
unsigned __int128 addr;
__u64 ts_us;
__u32 pid;
__u32 bound_dev_if;
int ret;
__u16 port;
__u16 proto;
__u8 opts;
__u8 ver;
char task[TASK_COMM_LEN];
};
```
当用户停止该工具时,其用户态代码会读取存入的数据并按要求打印。
### Eunomia中使用方式
![result](../imgs/mountsnoop.jpg)
![result](../imgs/bindsnoop-prometheus.png)
### 总结
Bindsnoop 通过 kprobe 挂载点,实现了对 socket 端口的监视,增强了 Eunomia 的应用范围。

2
13-tcpconnlat/.gitignore vendored Normal file
View File

@@ -0,0 +1,2 @@
.vscode
package.json

137
13-tcpconnlat/README.md Normal file
View File

@@ -0,0 +1,137 @@
---
layout: post
title: tcpconnlat
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, syscall, network]
summary: Traces the kernel function performing active TCP connections(eg, via a connect() syscall; accept() are passive connections). and show connection latency.
---
## origin
origin from:
https://github.com/iovisor/bcc/blob/master/libbpf-tools/tcpconnlat.bpf.c
## Compile and Run
Compile:
```shell
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
Run:
```shell
sudo ./ecli run package.json
```
TODO: support union in C
## details in bcc
Demonstrations of tcpconnect, the Linux eBPF/bcc version.
This tool traces the kernel function performing active TCP connections
(eg, via a connect() syscall; accept() are passive connections). Some example
output (IP addresses changed to protect the innocent):
```console
# ./tcpconnect
PID COMM IP SADDR DADDR DPORT
1479 telnet 4 127.0.0.1 127.0.0.1 23
1469 curl 4 10.201.219.236 54.245.105.25 80
1469 curl 4 10.201.219.236 54.67.101.145 80
1991 telnet 6 ::1 ::1 23
2015 ssh 6 fe80::2000:bff:fe82:3ac fe80::2000:bff:fe82:3ac 22
```
This output shows four connections, one from a "telnet" process, two from
"curl", and one from "ssh". The output details shows the IP version, source
address, destination address, and destination port. This traces attempted
connections: these may have failed.
The overhead of this tool should be negligible, since it is only tracing the
kernel functions performing connect. It is not tracing every packet and then
filtering.
The -t option prints a timestamp column:
```console
# ./tcpconnect -t
TIME(s) PID COMM IP SADDR DADDR DPORT
31.871 2482 local_agent 4 10.103.219.236 10.251.148.38 7001
31.874 2482 local_agent 4 10.103.219.236 10.101.3.132 7001
31.878 2482 local_agent 4 10.103.219.236 10.171.133.98 7101
90.917 2482 local_agent 4 10.103.219.236 10.251.148.38 7001
90.928 2482 local_agent 4 10.103.219.236 10.102.64.230 7001
90.938 2482 local_agent 4 10.103.219.236 10.115.167.169 7101
```
The output shows some periodic connections (or attempts) from a "local_agent"
process to various other addresses. A few connections occur every minute.
The -d option tracks DNS responses and tries to associate each connection with
the a previous DNS query issued before it. If a DNS response matching the IP
is found, it will be printed. If no match was found, "No DNS Query" is printed
in this column. Queries for 127.0.0.1 and ::1 are automatically associated with
"localhost". If the time between when the DNS response was received and a
connect call was traced exceeds 100ms, the tool will print the time delta
after the query name. See below for www.domain.com for an example.
```console
# ./tcpconnect -d
PID COMM IP SADDR DADDR DPORT QUERY
1543 amazon-ssm-a 4 10.66.75.54 176.32.119.67 443 ec2messages.us-west-1.amazonaws.com
1479 telnet 4 127.0.0.1 127.0.0.1 23 localhost
1469 curl 4 10.201.219.236 54.245.105.25 80 www.domain.com (123.342ms)
1469 curl 4 10.201.219.236 54.67.101.145 80 No DNS Query
1991 telnet 6 ::1 ::1 23 localhost
2015 ssh 6 fe80::2000:bff:fe82:3ac fe80::2000:bff:fe82:3ac 22 anotherhost.org
```
The -L option prints a LPORT column:
```console
# ./tcpconnect -L
PID COMM IP SADDR LPORT DADDR DPORT
3706 nc 4 192.168.122.205 57266 192.168.122.150 5000
3722 ssh 4 192.168.122.205 50966 192.168.122.150 22
3779 ssh 6 fe80::1 52328 fe80::2 22
```
The -U option prints a UID column:
```console
# ./tcpconnect -U
UID PID COMM IP SADDR DADDR DPORT
0 31333 telnet 6 ::1 ::1 23
0 31333 telnet 4 127.0.0.1 127.0.0.1 23
1000 31322 curl 4 127.0.0.1 127.0.0.1 80
1000 31322 curl 6 ::1 ::1 80
```
The -u option filtering UID:
```console
# ./tcpconnect -Uu 1000
UID PID COMM IP SADDR DADDR DPORT
1000 31338 telnet 6 ::1 ::1 23
1000 31338 telnet 4 127.0.0.1 127.0.0.1 23
```
To spot heavy outbound connections quickly one can use the -c flag. It will
count all active connections per source ip and destination ip/port.
```console
# ./tcpconnect.py -c
Tracing connect ... Hit Ctrl-C to end
^C
LADDR RADDR RPORT CONNECTS
192.168.10.50 172.217.21.194 443 70
192.168.10.50 172.213.11.195 443 34
192.168.10.50 172.212.22.194 443 21
[...]
```
The --cgroupmap option filters based on a cgroup set. It is meant to be used
with an externally created map.
```console
# ./tcpconnect --cgroupmap /sys/fs/bpf/test01
```
For more details, see docs/special_filtering.md

View File

@@ -0,0 +1,113 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2020 Wenbo Zhang
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
#include "tcpconnlat.bpf.h"
#define AF_INET 2
#define AF_INET6 10
const volatile __u64 targ_min_us = 0;
const volatile pid_t targ_tgid = 0;
struct piddata {
char comm[TASK_COMM_LEN];
u64 ts;
u32 tgid;
};
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 4096);
__type(key, struct sock *);
__type(value, struct piddata);
} start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
static int trace_connect(struct sock *sk)
{
u32 tgid = bpf_get_current_pid_tgid() >> 32;
struct piddata piddata = {};
if (targ_tgid && targ_tgid != tgid)
return 0;
bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
piddata.ts = bpf_ktime_get_ns();
piddata.tgid = tgid;
bpf_map_update_elem(&start, &sk, &piddata, 0);
return 0;
}
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
{
struct piddata *piddatap;
struct event event = {};
s64 delta;
u64 ts;
if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
return 0;
piddatap = bpf_map_lookup_elem(&start, &sk);
if (!piddatap)
return 0;
ts = bpf_ktime_get_ns();
delta = (s64)(ts - piddatap->ts);
if (delta < 0)
goto cleanup;
event.delta_us = delta / 1000U;
if (targ_min_us && event.delta_us < targ_min_us)
goto cleanup;
__builtin_memcpy(&event.comm, piddatap->comm,
sizeof(event.comm));
event.ts_us = ts / 1000;
event.tgid = piddatap->tgid;
event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
if (event.af == AF_INET) {
event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
} else {
BPF_CORE_READ_INTO(&event.saddr_v6, sk,
__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
BPF_CORE_READ_INTO(&event.daddr_v6, sk,
__sk_common.skc_v6_daddr.in6_u.u6_addr32);
}
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
&event, sizeof(event));
cleanup:
bpf_map_delete_elem(&start, &sk);
return 0;
}
SEC("kprobe/tcp_v4_connect")
int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_v6_connect")
int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_rcv_state_process")
int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
{
return handle_tcp_rcv_state_process(ctx, sk);
}
char LICENSE[] SEC("license") = "GPL";

View File

@@ -0,0 +1,26 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#ifndef __TCPCONNLAT_H
#define __TCPCONNLAT_H
#define TASK_COMM_LEN 16
struct event {
// union {
unsigned int saddr_v4;
unsigned char saddr_v6[16];
// };
// union {
unsigned int daddr_v4;
unsigned char daddr_v6[16];
// };
char comm[TASK_COMM_LEN];
unsigned long long delta_us;
unsigned long long ts_us;
unsigned int tgid;
int af;
unsigned short lport;
unsigned short dport;
};
#endif /* __TCPCONNLAT_H_ */

186
13-tcpconnlat/tcpconnlat.md Normal file
View File

@@ -0,0 +1,186 @@
## eBPF 入门实践教程:编写 eBPF 程序 tcpconnlat 测量 tcp 连接延时
### 背景
在互联网后端日常开发接口的时候中不管你使用的是C、Java、PHP还是Golang都避免不了需要调用mysql、redis等组件来获取数据可能还需要执行一些rpc远程调用或者再调用一些其它restful api。 在这些调用的底层基本都是在使用TCP协议进行传输。这是因为在传输层协议中TCP协议具备可靠的连接错误重传拥塞控制等优点所以目前应用比UDP更广泛一些。但相对而言tcp 连接也有一些缺点,例如建立连接的延时较长等。因此也会出现像 QUIC ,即 快速UDP网络连接 ( Quick UDP Internet Connections )这样的替代方案。
tcp 连接延时分析对于网络性能分析优化或者故障排查都能起到不少作用。
### tcpconnlat 的实现原理
tcpconnlat 这个工具跟踪执行活动TCP连接的内核函数 (例如通过connect()系统调用),并显示本地测量的连接的延迟(时间),即从发送 SYN 到响应包的时间。
### tcp 连接原理
tcp 连接的整个过程如图所示:
![tcpconnlate](tcpconnlat1.png)
在这个连接过程中,我们来简单分析一下每一步的耗时:
1. 客户端发出SYNC包客户端一般是通过connect系统调用来发出 SYN 的,这里牵涉到本机的系统调用和软中断的 CPU 耗时开销
2. SYN传到服务器SYN从客户端网卡被发出这是一次长途远距离的网络传输
3. 服务器处理SYN包内核通过软中断来收包然后放到半连接队列中然后再发出SYN/ACK响应。主要是 CPU 耗时开销
4. SYC/ACK传到客户端长途网络跋涉
5. 客户端处理 SYN/ACK客户端内核收包并处理SYN后经过几us的CPU处理接着发出 ACK。同样是软中断处理开销
6. ACK传到服务器长途网络跋涉
7. 服务端收到ACK服务器端内核收到并处理ACK然后把对应的连接从半连接队列中取出来然后放到全连接队列中。一次软中断CPU开销
8. 服务器端用户进程唤醒正在被accpet系统调用阻塞的用户进程被唤醒然后从全连接队列中取出来已经建立好的连接。一次上下文切换的CPU开销
在客户端视角在正常情况下一次TCP连接总的耗时也就就大约是一次网络RTT的耗时。但在某些情况下可能会导致连接时的网络传输耗时上涨、CPU处理开销增加、甚至是连接失败。这种时候在发现延时过长之后就可以结合其他信息进行分析。
### ebpf 实现原理
在 TCP 三次握手的时候Linux 内核会维护两个队列,分别是:
- 半连接队列,也称 SYN 队列;
- 全连接队列,也称 accepet 队列;
服务端收到客户端发起的 SYN 请求后,内核会把该连接存储到半连接队列,并向客户端响应 SYN+ACK接着客户端会返回 ACK服务端收到第三次握手的 ACK 后,内核会把连接从半连接队列移除,然后创建新的完全的连接,并将其添加到 accept 队列,等待进程调用 accept 函数时把连接取出来。
我们的 ebpf 代码实现在 https://github.com/yunwei37/Eunomia/blob/master/bpftools/tcpconnlat/tcpconnlat.bpf.c 中:
它主要使用了 trace_tcp_rcv_state_process 和 kprobe/tcp_v4_connect 这样的跟踪点:
```c
SEC("kprobe/tcp_v4_connect")
int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_v6_connect")
int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_rcv_state_process")
int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
{
return handle_tcp_rcv_state_process(ctx, sk);
}
```
在 trace_connect 中,我们跟踪新的 tcp 连接,记录到达时间,并且把它加入 map 中:
```c
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 4096);
__type(key, struct sock *);
__type(value, struct piddata);
} start SEC(".maps");
static int trace_connect(struct sock *sk)
{
u32 tgid = bpf_get_current_pid_tgid() >> 32;
struct piddata piddata = {};
if (targ_tgid && targ_tgid != tgid)
return 0;
bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
piddata.ts = bpf_ktime_get_ns();
piddata.tgid = tgid;
bpf_map_update_elem(&start, &sk, &piddata, 0);
return 0;
}
```
在 handle_tcp_rcv_state_process 中,我们跟踪接收到的 tcp 数据包,从 map 从提取出对应的 connect 事件,并且计算延迟:
```c
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
{
struct piddata *piddatap;
struct event event = {};
s64 delta;
u64 ts;
if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
return 0;
piddatap = bpf_map_lookup_elem(&start, &sk);
if (!piddatap)
return 0;
ts = bpf_ktime_get_ns();
delta = (s64)(ts - piddatap->ts);
if (delta < 0)
goto cleanup;
event.delta_us = delta / 1000U;
if (targ_min_us && event.delta_us < targ_min_us)
goto cleanup;
__builtin_memcpy(&event.comm, piddatap->comm,
sizeof(event.comm));
event.ts_us = ts / 1000;
event.tgid = piddatap->tgid;
event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
if (event.af == AF_INET) {
event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
} else {
BPF_CORE_READ_INTO(&event.saddr_v6, sk,
__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
BPF_CORE_READ_INTO(&event.daddr_v6, sk,
__sk_common.skc_v6_daddr.in6_u.u6_addr32);
}
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
&event, sizeof(event));
cleanup:
bpf_map_delete_elem(&start, &sk);
return 0;
}
```
### Eunomia 测试 demo
使用命令行进行追踪:
```bash
$ sudo build/bin/Release/eunomia run tcpconnlat
[sudo] password for yunwei:
[2022-08-07 02:13:39.601] [info] eunomia run in cmd...
[2022-08-07 02:13:40.534] [info] press 'Ctrl C' key to exit...
PID COMM IP SRC DEST PORT LAT(ms) CONATINER/OS
3477 openresty 4 172.19.0.7 172.19.0.5 2379 0.05 docker-apisix_apisix_1
3483 openresty 4 172.19.0.7 172.19.0.5 2379 0.08 docker-apisix_apisix_1
3477 openresty 4 172.19.0.7 172.19.0.5 2379 0.04 docker-apisix_apisix_1
3478 openresty 4 172.19.0.7 172.19.0.5 2379 0.05 docker-apisix_apisix_1
3478 openresty 4 172.19.0.7 172.19.0.5 2379 0.03 docker-apisix_apisix_1
3478 openresty 4 172.19.0.7 172.19.0.5 2379 0.03 docker-apisix_apisix_1
```
还可以使用 eunomia 作为 prometheus exporter在运行上述命令之后打开 prometheus 自带的可视化面板:
使用下述查询命令即可看到延时的统计图表:
```
rate(eunomia_observed_tcpconnlat_v4_histogram_sum[5m])
/
rate(eunomia_observed_tcpconnlat_v4_histogram_count[5m])
```
结果:
![result](tcpconnlat_p.png)
### 总结
通过上面的实验我们可以看到tcpconnlat 工具的实现原理是基于内核的TCP连接的跟踪并且可以跟踪到 tcp 连接的延迟时间除了命令行使用方式之外还可以将其和容器、k8s 等元信息综合起来,通过 `prometheus``grafana` 等工具进行网络性能分析。
> `Eunomia` 是一个使用 C/C++ 开发的基于 eBPF的轻量级高性能云原生监控工具旨在帮助用户了解容器的各项行为、监控可疑的容器安全事件力求提供覆盖容器全生命周期的轻量级开源监控解决方案。它使用 `Linux` `eBPF` 技术在运行时跟踪您的系统和应用程序,并分析收集的事件以检测可疑的行为模式。目前,它包含性能分析、容器集群网络可视化分析*、容器安全感知告警、一键部署、持久化存储监控等功能,提供了多样化的 ebpf 追踪点。其核心导出器/命令行工具最小仅需要约 4MB 大小的二进制程序,即可在支持的 Linux 内核上启动。
项目地址https://github.com/yunwei37/Eunomia
### 参考资料
1. http://kerneltravel.net/blog/2020/tcpconnlat/
2. https://network.51cto.com/article/640631.html

Binary file not shown.

After

Width:  |  Height:  |  Size: 132 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

5
14-tcpstates/.gitignore vendored Normal file
View File

@@ -0,0 +1,5 @@
.vscode
package.json
eunomia-exporter
ecli

56
14-tcpstates/README.md Normal file
View File

@@ -0,0 +1,56 @@
---
layout: post
title: tcpstates
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, syscall, network]
summary: Tcpstates prints TCP state change information, including the duration in each state as milliseconds
---
## origin
origin from:
https://github.com/iovisor/bcc/blob/master/libbpf-tools/tcpconnlat.bpf.c
## Compile and Run
Compile:
```shell
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
Run:
```shell
sudo ./ecli run package.json
```
## details in bcc
Demonstrations of tcpstates, the Linux BPF/bcc version.
tcpstates prints TCP state change information, including the duration in each
state as milliseconds. For example, a single TCP session:
```console
# tcpstates
SKADDR C-PID C-COMM LADDR LPORT RADDR RPORT OLDSTATE -> NEWSTATE MS
ffff9fd7e8192000 22384 curl 100.66.100.185 0 52.33.159.26 80 CLOSE -> SYN_SENT 0.000
ffff9fd7e8192000 0 swapper/5 100.66.100.185 63446 52.33.159.26 80 SYN_SENT -> ESTABLISHED 1.373
ffff9fd7e8192000 22384 curl 100.66.100.185 63446 52.33.159.26 80 ESTABLISHED -> FIN_WAIT1 176.042
ffff9fd7e8192000 0 swapper/5 100.66.100.185 63446 52.33.159.26 80 FIN_WAIT1 -> FIN_WAIT2 0.536
ffff9fd7e8192000 0 swapper/5 100.66.100.185 63446 52.33.159.26 80 FIN_WAIT2 -> CLOSE 0.006
^C
```
This showed that the most time was spent in the ESTABLISHED state (which then
transitioned to FIN_WAIT1), which was 176.042 milliseconds.
The first column is the socked address, as the output may include lines from
different sessions interleaved. The next two columns show the current on-CPU
process ID and command name: these may show the process that owns the TCP
session, depending on whether the state change executes synchronously in
process context. If that's not the case, they may show kernel details.

View File

@@ -0,0 +1,109 @@
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
/* Copyright (c) 2021 Hengqi Chen */
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include "tcpstates.bpf.h"
#define MAX_ENTRIES 10240
#define AF_INET 2
#define AF_INET6 10
const volatile bool filter_by_sport = false;
const volatile bool filter_by_dport = false;
const volatile short target_family = 0;
struct
{
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, __u16);
__type(value, __u16);
} sports SEC(".maps");
struct
{
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, __u16);
__type(value, __u16);
} dports SEC(".maps");
struct
{
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, struct sock *);
__type(value, __u64);
} timestamps SEC(".maps");
struct
{
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u32));
} events SEC(".maps");
SEC("tracepoint/sock/inet_sock_set_state")
int handle_set_state(struct trace_event_raw_inet_sock_set_state *ctx)
{
struct sock *sk = (struct sock *)ctx->skaddr;
__u16 family = ctx->family;
__u16 sport = ctx->sport;
__u16 dport = ctx->dport;
__u64 *tsp, delta_us, ts;
struct event event = {};
if (ctx->protocol != IPPROTO_TCP)
return 0;
if (target_family && target_family != family)
return 0;
if (filter_by_sport && !bpf_map_lookup_elem(&sports, &sport))
return 0;
if (filter_by_dport && !bpf_map_lookup_elem(&dports, &dport))
return 0;
tsp = bpf_map_lookup_elem(&timestamps, &sk);
ts = bpf_ktime_get_ns();
if (!tsp)
delta_us = 0;
else
delta_us = (ts - *tsp) / 1000;
event.skaddr = (__u64)sk;
event.ts_us = ts / 1000;
event.delta_us = delta_us;
event.pid = bpf_get_current_pid_tgid() >> 32;
event.oldstate = ctx->oldstate;
event.newstate = ctx->newstate;
event.family = family;
event.sport = sport;
event.dport = dport;
bpf_get_current_comm(&event.task, sizeof(event.task));
if (family == AF_INET)
{
bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_rcv_saddr);
bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_daddr);
}
else
{ /* family == AF_INET6 */
bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_v6_daddr.in6_u.u6_addr32);
}
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
if (ctx->newstate == TCP_CLOSE)
bpf_map_delete_elem(&timestamps, &sk);
else
bpf_map_update_elem(&timestamps, &sk, &ts, BPF_ANY);
return 0;
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";

View File

@@ -0,0 +1,24 @@
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
/* Copyright (c) 2021 Hengqi Chen */
#ifndef __TCPSTATES_H
#define __TCPSTATES_H
#define TASK_COMM_LEN 16
struct event
{
unsigned __int128 saddr;
unsigned __int128 daddr;
__u64 skaddr;
__u64 ts_us;
__u64 delta_us;
__u32 pid;
int oldstate;
int newstate;
__u16 family;
__u16 sport;
__u16 dport;
char task[TASK_COMM_LEN];
};
#endif /* __TCPSTATES_H */

116
15-tcprtt/tcprtt.md Normal file
View File

@@ -0,0 +1,116 @@
## eBPF 入门实践教程:编写 eBPF 程序 Tcprtt 测量 TCP 连接的往返时间
### 背景
网络质量在互联网社会中是一个很重要的因素。导致网络质量差的因素有很多,可能是硬件因素导致,也可能是程序
写的不好导致。为了能更好地定位网络问题,`tcprtt` 工具被提出。它可以监测TCP链接的往返时间从而分析
网络质量,帮助用户定位问题来源。
### 实现原理
`tcprtt` 在tcp链接建立的执行点下挂载了执行函数。
```c
SEC("fentry/tcp_rcv_established")
int BPF_PROG(tcp_rcv, struct sock *sk)
{
const struct inet_sock *inet = (struct inet_sock *)(sk);
struct tcp_sock *ts;
struct hist *histp;
u64 key, slot;
u32 srtt;
if (targ_sport && targ_sport != inet->inet_sport)
return 0;
if (targ_dport && targ_dport != sk->__sk_common.skc_dport)
return 0;
if (targ_saddr && targ_saddr != inet->inet_saddr)
return 0;
if (targ_daddr && targ_daddr != sk->__sk_common.skc_daddr)
return 0;
if (targ_laddr_hist)
key = inet->inet_saddr;
else if (targ_raddr_hist)
key = inet->sk.__sk_common.skc_daddr;
else
key = 0;
histp = bpf_map_lookup_or_try_init(&hists, &key, &zero);
if (!histp)
return 0;
ts = (struct tcp_sock *)(sk);
srtt = BPF_CORE_READ(ts, srtt_us) >> 3;
if (targ_ms)
srtt /= 1000U;
slot = log2l(srtt);
if (slot >= MAX_SLOTS)
slot = MAX_SLOTS - 1;
__sync_fetch_and_add(&histp->slots[slot], 1);
if (targ_show_ext) {
__sync_fetch_and_add(&histp->latency, srtt);
__sync_fetch_and_add(&histp->cnt, 1);
}
return 0;
}
SEC("kprobe/tcp_rcv_established")
int BPF_KPROBE(tcp_rcv_kprobe, struct sock *sk)
{
const struct inet_sock *inet = (struct inet_sock *)(sk);
u32 srtt, saddr, daddr;
struct tcp_sock *ts;
struct hist *histp;
u64 key, slot;
if (targ_sport) {
u16 sport;
bpf_probe_read_kernel(&sport, sizeof(sport), &inet->inet_sport);
if (targ_sport != sport)
return 0;
}
if (targ_dport) {
u16 dport;
bpf_probe_read_kernel(&dport, sizeof(dport), &sk->__sk_common.skc_dport);
if (targ_dport != dport)
return 0;
}
bpf_probe_read_kernel(&saddr, sizeof(saddr), &inet->inet_saddr);
if (targ_saddr && targ_saddr != saddr)
return 0;
bpf_probe_read_kernel(&daddr, sizeof(daddr), &sk->__sk_common.skc_daddr);
if (targ_daddr && targ_daddr != daddr)
return 0;
if (targ_laddr_hist)
key = saddr;
else if (targ_raddr_hist)
key = daddr;
else
key = 0;
histp = bpf_map_lookup_or_try_init(&hists, &key, &zero);
if (!histp)
return 0;
ts = (struct tcp_sock *)(sk);
bpf_probe_read_kernel(&srtt, sizeof(srtt), &ts->srtt_us);
srtt >>= 3;
if (targ_ms)
srtt /= 1000U;
slot = log2l(srtt);
if (slot >= MAX_SLOTS)
slot = MAX_SLOTS - 1;
__sync_fetch_and_add(&histp->slots[slot], 1);
if (targ_show_ext) {
__sync_fetch_and_add(&histp->latency, srtt);
__sync_fetch_and_add(&histp->cnt, 1);
}
return 0;
}
```
当有tcp链接建立时该工具会自动根据当前系统的支持情况选择合适的执行函数。
在执行函数中,`tcprtt`会收集tcp链接的各项基本底薪包括地址源端口目标端口耗时
等等并将其更新到直方图的map中。运行结束后通过用户态代码展现给用户。
### Eunomia中使用方式
### 总结
`tcprtt` 通过直方图的形式,可以轻松展现当前系统中网络抖动的情况,方便开发者快速定位系统网络问题

104
16-profile/profile.md Normal file
View File

@@ -0,0 +1,104 @@
## eBPF 入门实践教程:编写 eBPF 程序 profile 进行性能分析
### 背景
`profile` 是一款用户追踪程序执行调用流程的工具类似于perf中的 -g 指令。但是相较于perf而言
`profile`的功能更为细化,它可以选择用户需要追踪的层面,比如在用户态层面进行追踪,或是在内核态进行追踪。
### 实现原理
`profile` 的实现依赖于linux中的perf_event。在注入ebpf程序前`profile` 工具会先将 perf_event
注册好。
```c
static int open_and_attach_perf_event(int freq, struct bpf_program *prog,
struct bpf_link *links[])
{
struct perf_event_attr attr = {
.type = PERF_TYPE_SOFTWARE,
.freq = env.freq,
.sample_freq = env.sample_freq,
.config = PERF_COUNT_SW_CPU_CLOCK,
};
int i, fd;
for (i = 0; i < nr_cpus; i++) {
if (env.cpu != -1 && env.cpu != i)
continue;
fd = syscall(__NR_perf_event_open, &attr, -1, i, -1, 0);
if (fd < 0) {
/* Ignore CPU that is offline */
if (errno == ENODEV)
continue;
fprintf(stderr, "failed to init perf sampling: %s\n",
strerror(errno));
return -1;
}
links[i] = bpf_program__attach_perf_event(prog, fd);
if (!links[i]) {
fprintf(stderr, "failed to attach perf event on cpu: "
"%d\n", i);
links[i] = NULL;
close(fd);
return -1;
}
}
return 0;
}
```
其ebpf程序实现逻辑是对程序的堆栈进行定时采样从而捕获程序的执行流程。
```c
SEC("perf_event")
int do_perf_event(struct bpf_perf_event_data *ctx)
{
__u64 id = bpf_get_current_pid_tgid();
__u32 pid = id >> 32;
__u32 tid = id;
__u64 *valp;
static const __u64 zero;
struct key_t key = {};
if (!include_idle && tid == 0)
return 0;
if (targ_pid != -1 && targ_pid != pid)
return 0;
if (targ_tid != -1 && targ_tid != tid)
return 0;
key.pid = pid;
bpf_get_current_comm(&key.name, sizeof(key.name));
if (user_stacks_only)
key.kern_stack_id = -1;
else
key.kern_stack_id = bpf_get_stackid(&ctx->regs, &stackmap, 0);
if (kernel_stacks_only)
key.user_stack_id = -1;
else
key.user_stack_id = bpf_get_stackid(&ctx->regs, &stackmap, BPF_F_USER_STACK);
if (key.kern_stack_id >= 0) {
// populate extras to fix the kernel stack
__u64 ip = PT_REGS_IP(&ctx->regs);
if (is_kernel_addr(ip)) {
key.kernel_ip = ip;
}
}
valp = bpf_map_lookup_or_try_init(&counts, &key, &zero);
if (valp)
__sync_fetch_and_add(valp, 1);
return 0;
}
```
通过这种方式,它可以根据用户指令,简单的决定追踪用户态层面的执行流程或是内核态层面的执行流程。
### Eunomia中使用方式
### 总结
`profile` 实现了对程序执行流程的分析在debug等操作中可以极大的帮助开发者提高效率。

80
17-memleak/memleak.md Normal file
View File

@@ -0,0 +1,80 @@
## eBPF 入门实践教程:编写 eBPF 程序 Memleak 监控内存泄漏
### 背景
内存泄漏对于一个程序而言是一个很严重的问题。倘若放任一个存在内存泄漏的程序运行,久而久之
系统的内存会慢慢被耗尽,导致程序运行速度显著下降。为了避免这一情况,`memleak`工具被提出。
它可以跟踪并匹配内存分配和释放的请求,并且打印出已经被分配资源而又尚未释放的堆栈信息。
### 实现原理
`memleak` 的实现逻辑非常直观。它在我们常用的动态分配内存的函数接口路径上挂载了ebpf程序
同时在free上也挂载了ebpf程序。在调用分配内存相关函数时`memleak` 会记录调用者的pid分配得到
内存的地址分配得到的内存大小等基本数据。在free之后`memeleak`则会去map中删除记录的对应的分配
信息。对于用户态常用的分配函数 `malloc`, `calloc` 等,`memleak`使用了 uporbe 技术实现挂载,对于
内核态的函数,比如 `kmalloc` 等,`memleak` 则使用了现有的 tracepoint 来实现。
`memleak`主要的挂载点为
```c
SEC("uprobe/malloc")
SEC("uretprobe/malloc")
SEC("uprobe/calloc")
SEC("uretprobe/calloc")
SEC("uprobe/realloc")
SEC("uretprobe/realloc")
SEC("uprobe/memalign")
SEC("uretprobe/memalign")
SEC("uprobe/posix_memalign")
SEC("uretprobe/posix_memalign")
SEC("uprobe/valloc")
SEC("uretprobe/valloc")
SEC("uprobe/pvalloc")
SEC("uretprobe/pvalloc")
SEC("uprobe/aligned_alloc")
SEC("uretprobe/aligned_alloc")
SEC("uprobe/free")
SEC("tracepoint/kmem/kmalloc")
SEC("tracepoint/kmem/kfree")
SEC("tracepoint/kmem/kmalloc_node")
SEC("tracepoint/kmem/kmem_cache_alloc")
SEC("tracepoint/kmem/kmem_cache_alloc_node")
SEC("tracepoint/kmem/kmem_cache_free")
SEC("tracepoint/kmem/mm_page_alloc")
SEC("tracepoint/kmem/mm_page_free")
SEC("tracepoint/percpu/percpu_alloc_percpu")
SEC("tracepoint/percpu/percpu_free_percpu")
```
### Eunomia中使用方式
### 总结
`memleak` 实现了对内存分配系列函数的监控追踪,可以避免程序发生严重的内存泄漏事故,对于开发者而言
具有极大的帮助。

121
18-biopattern/biolatency.md Normal file
View File

@@ -0,0 +1,121 @@
## eBPF 入门实践教程:编写 eBPF 程序 Biolatency: 统计系统中发生的I/O事件
### 背景
Biolatency 可以统计在该工具运行后系统中发生的I/O事件个数并且计算I/O事件在不同时间段内的分布情况
直方图的形式展现给用户。
### 实现原理
Biolatency 主要通过 tracepoint 实现,其在 block_rq_insert, block_rq_issue,
block_rq_complete 挂载点下设置了处理函数。在 block_rq_insert 和 block_rq_issue 挂载点下,
Biolatency 会将IO操作发生时的request queue和时间计入map中。
```c
int trace_rq_start(struct request *rq, int issue)
{
if (issue && targ_queued && BPF_CORE_READ(rq->q, elevator))
return 0;
u64 ts = bpf_ktime_get_ns();
if (filter_dev) {
struct gendisk *disk = get_disk(rq);
u32 dev;
dev = disk ? MKDEV(BPF_CORE_READ(disk, major),
BPF_CORE_READ(disk, first_minor)) : 0;
if (targ_dev != dev)
return 0;
}
bpf_map_update_elem(&start, &rq, &ts, 0);
return 0;
}
SEC("tp_btf/block_rq_insert")
int block_rq_insert(u64 *ctx)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
if (LINUX_KERNEL_VERSION < KERNEL_VERSION(5, 11, 0))
return trace_rq_start((void *)ctx[1], false);
else
return trace_rq_start((void *)ctx[0], false);
}
SEC("tp_btf/block_rq_issue")
int block_rq_issue(u64 *ctx)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
if (LINUX_KERNEL_VERSION < KERNEL_VERSION(5, 11, 0))
return trace_rq_start((void *)ctx[1], true);
else
return trace_rq_start((void *)ctx[0], true);
}
```
在block_rq_complete 挂载点下Biolatency 会根据 request queue 从map中读取
上一次操作发生的时间然后计算与当前时间的差值来判断其在直方图中存在的区域将该区域内的IO操作
计数加一。
```c
SEC("tp_btf/block_rq_complete")
int BPF_PROG(block_rq_complete, struct request *rq, int error,
unsigned int nr_bytes)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
u64 slot, *tsp, ts = bpf_ktime_get_ns();
struct hist_key hkey = {};
struct hist *histp;
s64 delta;
tsp = bpf_map_lookup_elem(&start, &rq);
if (!tsp)
return 0;
delta = (s64)(ts - *tsp);
if (delta < 0)
goto cleanup;
if (targ_per_disk) {
struct gendisk *disk = get_disk(rq);
hkey.dev = disk ? MKDEV(BPF_CORE_READ(disk, major),
BPF_CORE_READ(disk, first_minor)) : 0;
}
if (targ_per_flag)
hkey.cmd_flags = rq->cmd_flags;
histp = bpf_map_lookup_elem(&hists, &hkey);
if (!histp) {
bpf_map_update_elem(&hists, &hkey, &initial_hist, 0);
histp = bpf_map_lookup_elem(&hists, &hkey);
if (!histp)
goto cleanup;
}
if (targ_ms)
delta /= 1000000U;
else
delta /= 1000U;
slot = log2l(delta);
if (slot >= MAX_SLOTS)
slot = MAX_SLOTS - 1;
__sync_fetch_and_add(&histp->slots[slot], 1);
cleanup:
bpf_map_delete_elem(&start, &rq);
return 0;
}
```
当用户中止程序时用户态程序会读取直方图map中的数据并打印呈现。
### Eunomia中使用方式
### 总结
Biolatency 通过 tracepoint 挂载点实现了对IO事件个数的统计并且能以直方图的
形式进行展现可以方便开发者了解系统I/O事件情况。

View File

@@ -0,0 +1,48 @@
## eBPF 入门实践教程:编写 eBPF 程序 Biopattern: 统计随机/顺序磁盘 I/O
### 背景
Biopattern 可以统计随机/顺序磁盘I/O次数的比例。
### 实现原理
Biopattern 的ebpf代码在 tracepoint/block/block_rq_complete 挂载点下实现。在磁盘完成IO请求
程序会经过此挂载点。Biopattern 内部存有一张以设备号为主键的哈希表,当程序经过挂载点时, Biopattern
会获得操作信息根据哈希表中该设备的上一次操作记录来判断本次操作是随机IO还是顺序IO并更新操作计数。
```c
SEC("tracepoint/block/block_rq_complete")
int handle__block_rq_complete(struct trace_event_raw_block_rq_complete *ctx)
{
sector_t *last_sectorp, sector = ctx->sector;
struct counter *counterp, zero = {};
u32 nr_sector = ctx->nr_sector;
dev_t dev = ctx->dev;
if (targ_dev != -1 && targ_dev != dev)
return 0;
counterp = bpf_map_lookup_or_try_init(&counters, &dev, &zero);
if (!counterp)
return 0;
if (counterp->last_sector) {
if (counterp->last_sector == sector)
__sync_fetch_and_add(&counterp->sequential, 1);
else
__sync_fetch_and_add(&counterp->random, 1);
__sync_fetch_and_add(&counterp->bytes, nr_sector * 512);
}
counterp->last_sector = sector + nr_sector;
return 0;
}
```
当用户停止Biopattern后用户态程序会读取获得的计数信息并将其输出给用户。
### Eunomia中使用方式
尚未集成
### 总结
Biopattern 可以展现随机/顺序磁盘I/O次数的比例对于开发者把握整体I/O情况有较大帮助。

100
18-biopattern/biostacks.md Normal file
View File

@@ -0,0 +1,100 @@
## eBPF 入门实践教程:编写 eBPF 程序 Biostacks: 监控内核 I/O 操作耗时
### 背景
由于有些磁盘I/O操作不是直接由应用发起的比如元数据读写因此有些直接捕捉磁盘I/O操作信息可能
会有一些无法解释的I/O操作发生。为此Biostacks 会直接追踪内核中初始化I/O操作的函数并将磁
盘I/O操作耗时以直方图的形式展现。
### 实现原理
Biostacks 的挂载点为 fentry/blk_account_io_start, kprobe/blk_account_io_merge_bio 和
fentry/blk_account_io_done。fentry/blk_account_io_start 和 kprobe/blk_account_io_merge_bio
挂载点均时内核需要发起I/O操作中必经的初始化路径。在经过此处时Biostacks 会根据 request queue ,将数据存入
map中。
```c
static __always_inline
int trace_start(void *ctx, struct request *rq, bool merge_bio)
{
struct internal_rqinfo *i_rqinfop = NULL, i_rqinfo = {};
struct gendisk *disk = BPF_CORE_READ(rq, rq_disk);
dev_t dev;
dev = disk ? MKDEV(BPF_CORE_READ(disk, major),
BPF_CORE_READ(disk, first_minor)) : 0;
if (targ_dev != -1 && targ_dev != dev)
return 0;
if (merge_bio)
i_rqinfop = bpf_map_lookup_elem(&rqinfos, &rq);
if (!i_rqinfop)
i_rqinfop = &i_rqinfo;
i_rqinfop->start_ts = bpf_ktime_get_ns();
i_rqinfop->rqinfo.pid = bpf_get_current_pid_tgid();
i_rqinfop->rqinfo.kern_stack_size =
bpf_get_stack(ctx, i_rqinfop->rqinfo.kern_stack,
sizeof(i_rqinfop->rqinfo.kern_stack), 0);
bpf_get_current_comm(&i_rqinfop->rqinfo.comm,
sizeof(&i_rqinfop->rqinfo.comm));
i_rqinfop->rqinfo.dev = dev;
if (i_rqinfop == &i_rqinfo)
bpf_map_update_elem(&rqinfos, &rq, i_rqinfop, 0);
return 0;
}
SEC("fentry/blk_account_io_start")
int BPF_PROG(blk_account_io_start, struct request *rq)
{
return trace_start(ctx, rq, false);
}
SEC("kprobe/blk_account_io_merge_bio")
int BPF_KPROBE(blk_account_io_merge_bio, struct request *rq)
{
return trace_start(ctx, rq, true);
}
```
在I/O操作完成后fentry/blk_account_io_done 下的处理函数会从map中读取之前存入的信息根据当下时间
记录时间差值得到I/O操作的耗时信息并更新到存储直方图数据的map中。
```c
SEC("fentry/blk_account_io_done")
int BPF_PROG(blk_account_io_done, struct request *rq)
{
u64 slot, ts = bpf_ktime_get_ns();
struct internal_rqinfo *i_rqinfop;
struct rqinfo *rqinfop;
struct hist *histp;
s64 delta;
i_rqinfop = bpf_map_lookup_elem(&rqinfos, &rq);
if (!i_rqinfop)
return 0;
delta = (s64)(ts - i_rqinfop->start_ts);
if (delta < 0)
goto cleanup;
histp = bpf_map_lookup_or_try_init(&hists, &i_rqinfop->rqinfo, &zero);
if (!histp)
goto cleanup;
if (targ_ms)
delta /= 1000000U;
else
delta /= 1000U;
slot = log2l(delta);
if (slot >= MAX_SLOTS)
slot = MAX_SLOTS - 1;
__sync_fetch_and_add(&histp->slots[slot], 1);
cleanup:
bpf_map_delete_elem(&rqinfos, &rq);
return 0;
}
```
在用户输入程序退出指令后其用户态程序会将直方图map中的信息读出并打印。
### Eunomia中使用方式
### 总结
Biostacks 从源头实现了对I/O操作的追踪可以极大的方便我们掌握磁盘I/O情况。

63
18-biopattern/bitesize.md Normal file
View File

@@ -0,0 +1,63 @@
## eBPF 入门实践教程:编写 eBPF 程序 Bitesize: 监控块设备 I/O
### 背景
为了能更好的获得 I/O 操作需要的磁盘块大小相关信息Bitesize 工具被开发。它可以在启动后追踪
不同进程所需要的块大小,并以直方图的形式显示分布
### 实现原理
Biteszie 在 block_rq_issue 追踪点下挂在了处理函数。当进程对磁盘发出了块 I/O 请求操作时,
系统会经过此挂载点此时处理函数或许请求的信息将其存入对应的map中。
```c
static int trace_rq_issue(struct request *rq)
{
struct hist_key hkey;
struct hist *histp;
u64 slot;
if (filter_dev) {
struct gendisk *disk = get_disk(rq);
u32 dev;
dev = disk ? MKDEV(BPF_CORE_READ(disk, major),
BPF_CORE_READ(disk, first_minor)) : 0;
if (targ_dev != dev)
return 0;
}
bpf_get_current_comm(&hkey.comm, sizeof(hkey.comm));
if (!comm_allowed(hkey.comm))
return 0;
histp = bpf_map_lookup_elem(&hists, &hkey);
if (!histp) {
bpf_map_update_elem(&hists, &hkey, &initial_hist, 0);
histp = bpf_map_lookup_elem(&hists, &hkey);
if (!histp)
return 0;
}
slot = log2l(rq->__data_len / 1024);
if (slot >= MAX_SLOTS)
slot = MAX_SLOTS - 1;
__sync_fetch_and_add(&histp->slots[slot], 1);
return 0;
}
SEC("tp_btf/block_rq_issue")
int BPF_PROG(block_rq_issue)
{
if (LINUX_KERNEL_VERSION >= KERNEL_VERSION(5, 11, 0))
return trace_rq_issue((void *)ctx[0]);
else
return trace_rq_issue((void *)ctx[1]);
}
```
当用户发出中止工具的指令后其用户态代码会将map中存储的数据读出并逐进程的展示追踪结果
### Eunomia中使用方式
### 总结
Bitesize 以进程为粒度,使得开发者可以更好的掌握程序对磁盘 I/O 的请求情况。

81
19-syscount/syscount.md Normal file
View File

@@ -0,0 +1,81 @@
## eBPF 入门实践教程:编写 eBPF 程序 syscount 监控慢系统调用
### 背景
`syscount` 可以统计系统或者某个进程发生的各类syscall的总数或者时耗时。
### 实现原理
`syscount` 的实现逻辑非常直观,他在 `sys_enter``sys_exit` 这两个 `tracepoint` 下挂载了
执行函数。
```c
SEC("tracepoint/raw_syscalls/sys_enter")
int sys_enter(struct trace_event_raw_sys_enter *args)
{
u64 id = bpf_get_current_pid_tgid();
pid_t pid = id >> 32;
u32 tid = id;
u64 ts;
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
if (filter_pid && pid != filter_pid)
return 0;
ts = bpf_ktime_get_ns();
bpf_map_update_elem(&start, &tid, &ts, 0);
return 0;
}
SEC("tracepoint/raw_syscalls/sys_exit")
int sys_exit(struct trace_event_raw_sys_exit *args)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
u64 id = bpf_get_current_pid_tgid();
static const struct data_t zero;
pid_t pid = id >> 32;
struct data_t *val;
u64 *start_ts, lat = 0;
u32 tid = id;
u32 key;
/* this happens when there is an interrupt */
if (args->id == -1)
return 0;
if (filter_pid && pid != filter_pid)
return 0;
if (filter_failed && args->ret >= 0)
return 0;
if (filter_errno && args->ret != -filter_errno)
return 0;
if (measure_latency) {
start_ts = bpf_map_lookup_elem(&start, &tid);
if (!start_ts)
return 0;
lat = bpf_ktime_get_ns() - *start_ts;
}
key = (count_by_process) ? pid : args->id;
val = bpf_map_lookup_or_try_init(&data, &key, &zero);
if (val) {
__sync_fetch_and_add(&val->count, 1);
if (count_by_process)
save_proc_name(val);
if (measure_latency)
__sync_fetch_and_add(&val->total_ns, lat);
}
return 0;
}
```
当syscall发生时`syscount`会记录其tid和发生的时间并存入map中。在syscall完成时`syscount` 会根据用户
的需求统计syscall持续的时间或者是发生的次数。
### Eunomia中使用方式
### 总结
`sycount` 使得用户可以较为方便的追踪某个进程或者是系统内系统调用发生的情况。

6
2-fentry-unlink/.gitignore vendored Normal file
View File

@@ -0,0 +1,6 @@
.vscode
package.json
*.o
*.skel.json
*.skel.yaml
package.yaml

76
2-fentry-unlink/README.md Normal file
View File

@@ -0,0 +1,76 @@
---
layout: post
title: fentry-link
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, examples, fentry, no-output]
summary: an example that uses fentry and fexit BPF programs for tracing a file is deleted
---
## Fentry
`fentry` is an example that uses fentry and fexit BPF programs for tracing. It
attaches `fentry` and `fexit` traces to `do_unlinkat()` which is called when a
file is deleted and logs the return value, PID, and filename to the
trace pipe.
Important differences, compared to kprobes, are improved performance and
usability. In this example, better usability is shown with the ability to
directly dereference pointer arguments, like in normal C, instead of using
various read helpers. The big distinction between **fexit** and **kretprobe**
programs is that fexit one has access to both input arguments and returned
result, while kretprobe can only access the result.
fentry and fexit programs are available starting from 5.5 kernels.
```console
$ sudo ecli examples/bpftools/fentry-link/package.json
Runing eBPF program...
```
The `fentry` output in `/sys/kernel/debug/tracing/trace_pipe` should look
something like this:
```console
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
rm-9290 [004] d..2 4637.798698: bpf_trace_printk: fentry: pid = 9290, filename = test_file
rm-9290 [004] d..2 4637.798843: bpf_trace_printk: fexit: pid = 9290, filename = test_file, ret = 0
rm-9290 [004] d..2 4637.798698: bpf_trace_printk: fentry: pid = 9290, filename = test_file2
rm-9290 [004] d..2 4637.798843: bpf_trace_printk: fexit: pid = 9290, filename = test_file2, ret = 0
```
## Run
- Compile:
```console
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
or
```console
$ ecc fentry-link.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
- Run and help:
```console
sudo ecli examples/bpftools/fentry-link/package.json -h
Usage: fentry_link_bpf [--help] [--version] [--verbose]
A simple eBPF program
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
--verbose prints libbpf debug information
Built with eunomia-bpf framework.
See https://github.com/eunomia-bpf/eunomia-bpf for more information.
```

View File

@@ -0,0 +1,27 @@
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2021 Sartura */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
char LICENSE[] SEC("license") = "Dual BSD/GPL";
SEC("fentry/do_unlinkat")
int BPF_PROG(do_unlinkat, int dfd, struct filename *name)
{
pid_t pid;
pid = bpf_get_current_pid_tgid() >> 32;
bpf_printk("fentry: pid = %d, filename = %s\n", pid, name->name);
return 0;
}
SEC("fexit/do_unlinkat")
int BPF_PROG(do_unlinkat_exit, int dfd, struct filename *name, long ret)
{
pid_t pid;
pid = bpf_get_current_pid_tgid() >> 32;
bpf_printk("fexit: pid = %d, filename = %s, ret = %ld\n", pid, name->name, ret);
return 0;
}

75
21-llcstat/llcstat.md Normal file
View File

@@ -0,0 +1,75 @@
## eBPF 入门实践教程:编写 eBPF 程序 llcstat 监控 cache miss 和 cache reference
### 背景
为了能更好地优化程序性能开发者有时需要考虑如何更好地减少cache miss的发生。
但是程序到底可能发生多少次cache miss这是一个难以回答的问题。`llcstat` 通过
ebpf技术实现了对 cache miss 和 cache reference 的准确追踪,可以极大方便开发者
调试程序,优化性能。
### 实现原理
`llcstat` 引入了linux中的 `perf_event` 机制,程序在用户态载入的时候,
会将现有的c `perf_event` attach到指定的位置。
```c
if (open_and_attach_perf_event(PERF_COUNT_HW_CACHE_MISSES,
env.sample_period,
obj->progs.on_cache_miss, mlinks))
goto cleanup;
if (open_and_attach_perf_event(PERF_COUNT_HW_CACHE_REFERENCES,
env.sample_period,
obj->progs.on_cache_ref, rlinks))
```
同时,`llcstat` 在内核态中会在`perf_event`下挂载执行函数,当程序运行到了
挂载点执行函数会启动并开始计数将结果写入对应的map中。
```c
static __always_inline
int trace_event(__u64 sample_period, bool miss)
{
struct key_info key = {};
struct value_info *infop, zero = {};
u64 pid_tgid = bpf_get_current_pid_tgid();
key.cpu = bpf_get_smp_processor_id();
key.pid = pid_tgid >> 32;
if (targ_per_thread)
key.tid = (u32)pid_tgid;
else
key.tid = key.pid;
infop = bpf_map_lookup_or_try_init(&infos, &key, &zero);
if (!infop)
return 0;
if (miss)
infop->miss += sample_period;
else
infop->ref += sample_period;
bpf_get_current_comm(infop->comm, sizeof(infop->comm));
return 0;
}
SEC("perf_event")
int on_cache_miss(struct bpf_perf_event_data *ctx)
{
return trace_event(ctx->sample_period, true);
}
SEC("perf_event")
int on_cache_ref(struct bpf_perf_event_data *ctx)
{
return trace_event(ctx->sample_period, false);
}
```
用户态程序会读取map存入的 cache miss 和 cache reference 的计数信息,并
逐进程的进行展示。
### Eunomia中使用方式
### 总结
`llcstat` 运用了ebpf计数高效简洁地展示了某个线程发生cache miss和cache
reference的次数这使得开发者们在优化程序的过程中有了更明确的量化指标。

6
3-kprobe-unlink/.gitignore vendored Normal file
View File

@@ -0,0 +1,6 @@
.vscode
package.json
*.o
*.skel.json
*.skel.yaml
package.yaml

55
3-kprobe-unlink/README.md Normal file
View File

@@ -0,0 +1,55 @@
---
layout: post
title: kprobe-link
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, examples, kprobe, no-output]
summary: an example of dealing with kernel-space entry and exit (return) probes, `kprobe` and `kretprobe` in libbpf lingo
---
`kprobe` is an example of dealing with kernel-space entry and exit (return)
probes, `kprobe` and `kretprobe` in libbpf lingo. It attaches `kprobe` and
`kretprobe` BPF programs to the `do_unlinkat()` function and logs the PID,
filename, and return result, respectively, using `bpf_printk()` macro.
```console
$ sudo ecli examples/bpftools/kprobe-link/package.json
Runing eBPF program...
```
The `kprobe` demo output in `/sys/kernel/debug/tracing/trace_pipe` should look
something like this:
```shell
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
rm-9346 [005] d..3 4710.951696: bpf_trace_printk: KPROBE ENTRY pid = 9346, filename = test1
rm-9346 [005] d..4 4710.951819: bpf_trace_printk: KPROBE EXIT: ret = 0
rm-9346 [005] d..3 4710.951852: bpf_trace_printk: KPROBE ENTRY pid = 9346, filename = test2
rm-9346 [005] d..4 4710.951895: bpf_trace_printk: KPROBE EXIT: ret = 0
```
## Run
Compile with docker:
```console
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
or compile with `ecc`:
```console
$ ecc kprobe-link.bpf.c
Compiling bpf object...
Packing ebpf object and config into package.json...
```
Run:
```console
sudo ecli examples/bpftools/kprobe-link/package.json
```

View File

@@ -0,0 +1,30 @@
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2021 Sartura */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
char LICENSE[] SEC("license") = "Dual BSD/GPL";
SEC("kprobe/do_unlinkat")
int BPF_KPROBE(do_unlinkat, int dfd, struct filename *name)
{
pid_t pid;
const char *filename;
pid = bpf_get_current_pid_tgid() >> 32;
filename = BPF_CORE_READ(name, name);
bpf_printk("KPROBE ENTRY pid = %d, filename = %s\n", pid, filename);
return 0;
}
SEC("kretprobe/do_unlinkat")
int BPF_KRETPROBE(do_unlinkat_exit, long ret)
{
pid_t pid;
pid = bpf_get_current_pid_tgid() >> 32;
bpf_printk("KPROBE EXIT: pid = %d, ret = %ld\n", pid, ret);
return 0;
}

7
4-opensnoop/.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
.vscode
package.json
eunomia-exporter
ecli
*.bpf.o
*.skel.json
*.skel.yaml

263
4-opensnoop/1_opensnoop.md Normal file
View File

@@ -0,0 +1,263 @@
## eBPF 入门实践教程:编写 eBPF 程序监控打开文件路径并使用 Prometheus 可视化
### 背景
通过对 open 系统调用的监测,`opensnoop`可以展现系统内所有调用了 open 系统调用的进程信息。
### 使用 ecli 一键运行
```console
$ # 下载安装 ecli 二进制
$ wget https://aka.pw/bpf-ecli -O ./ecli && chmod +x ./ecli
$ # 使用 url 一键运行
$ ./ecli run https://eunomia-bpf.github.io/eunomia-bpf/opensnoop/package.json
running and waiting for the ebpf events from perf event...
time ts pid uid ret flags comm fname
00:58:08 0 812 0 9 524288 vmtoolsd /etc/mtab
00:58:08 0 812 0 11 0 vmtoolsd /proc/devices
00:58:08 0 34351 0 24 524288 ecli /etc/localtime
00:58:08 0 812 0 9 0 vmtoolsd /sys/class/block/sda5/../device/../../../class
00:58:08 0 812 0 -2 0 vmtoolsd /sys/class/block/sda5/../device/../../../label
00:58:08 0 812 0 9 0 vmtoolsd /sys/class/block/sda1/../device/../../../class
00:58:08 0 812 0 -2 0 vmtoolsd /sys/class/block/sda1/../device/../../../label
00:58:08 0 812 0 9 0 vmtoolsd /run/systemd/resolve/resolv.conf
00:58:08 0 812 0 9 0 vmtoolsd /proc/net/route
00:58:08 0 812 0 9 0 vmtoolsd /proc/net/ipv6_route
```
### 实现
使用 eunomia-bpf 可以帮助你只需要编写内核态应用程序,不需要编写任何用户态辅助框架代码;需要编写的代码由两个部分组成:
- 头文件 opensnoop.h 里面定义需要导出的 C 语言结构体:
- 源文件 opensnoop.bpf.c 里面定义 BPF 代码:
头文件 opensnoop.h
```c
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#ifndef __OPENSNOOP_H
#define __OPENSNOOP_H
#define TASK_COMM_LEN 16
#define NAME_MAX 255
#define INVALID_UID ((uid_t)-1)
// used for export event
struct event {
/* user terminology for pid: */
unsigned long long ts;
int pid;
int uid;
int ret;
int flags;
char comm[TASK_COMM_LEN];
char fname[NAME_MAX];
};
#endif /* __OPENSNOOP_H */
```
`opensnoop` 的实现逻辑比较简单,它在 `sys_enter_open``sys_enter_openat` 这两个追踪点下
加了执行函数,当有 open 系统调用发生时,执行函数便会被触发。同样在,在对应的 `sys_exit_open`
`sys_exit_openat` 系统调用下,`opensnoop` 也加了执行函数。
源文件 opensnoop.bpf.c
```c
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2019 Facebook
// Copyright (c) 2020 Netflix
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include "opensnoop.h"
struct args_t {
const char *fname;
int flags;
};
const volatile pid_t targ_pid = 0;
const volatile pid_t targ_tgid = 0;
const volatile uid_t targ_uid = 0;
const volatile bool targ_failed = false;
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10240);
__type(key, u32);
__type(value, struct args_t);
} start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
static __always_inline bool valid_uid(uid_t uid) {
return uid != INVALID_UID;
}
static __always_inline
bool trace_allowed(u32 tgid, u32 pid)
{
u32 uid;
/* filters */
if (targ_tgid && targ_tgid != tgid)
return false;
if (targ_pid && targ_pid != pid)
return false;
if (valid_uid(targ_uid)) {
uid = (u32)bpf_get_current_uid_gid();
if (targ_uid != uid) {
return false;
}
}
return true;
}
SEC("tracepoint/syscalls/sys_enter_open")
int tracepoint__syscalls__sys_enter_open(struct trace_event_raw_sys_enter* ctx)
{
u64 id = bpf_get_current_pid_tgid();
/* use kernel terminology here for tgid/pid: */
u32 tgid = id >> 32;
u32 pid = id;
/* store arg info for later lookup */
if (trace_allowed(tgid, pid)) {
struct args_t args = {};
args.fname = (const char *)ctx->args[0];
args.flags = (int)ctx->args[1];
bpf_map_update_elem(&start, &pid, &args, 0);
}
return 0;
}
SEC("tracepoint/syscalls/sys_enter_openat")
int tracepoint__syscalls__sys_enter_openat(struct trace_event_raw_sys_enter* ctx)
{
u64 id = bpf_get_current_pid_tgid();
/* use kernel terminology here for tgid/pid: */
u32 tgid = id >> 32;
u32 pid = id;
/* store arg info for later lookup */
if (trace_allowed(tgid, pid)) {
struct args_t args = {};
args.fname = (const char *)ctx->args[1];
args.flags = (int)ctx->args[2];
bpf_map_update_elem(&start, &pid, &args, 0);
}
return 0;
}
static __always_inline
int trace_exit(struct trace_event_raw_sys_exit* ctx)
{
struct event event = {};
struct args_t *ap;
int ret;
u32 pid = bpf_get_current_pid_tgid();
ap = bpf_map_lookup_elem(&start, &pid);
if (!ap)
return 0; /* missed entry */
ret = ctx->ret;
if (targ_failed && ret >= 0)
goto cleanup; /* want failed only */
/* event data */
event.pid = bpf_get_current_pid_tgid() >> 32;
event.uid = bpf_get_current_uid_gid();
bpf_get_current_comm(&event.comm, sizeof(event.comm));
bpf_probe_read_user_str(&event.fname, sizeof(event.fname), ap->fname);
event.flags = ap->flags;
event.ret = ret;
/* emit event */
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
&event, sizeof(event));
cleanup:
bpf_map_delete_elem(&start, &pid);
return 0;
}
SEC("tracepoint/syscalls/sys_exit_open")
int tracepoint__syscalls__sys_exit_open(struct trace_event_raw_sys_exit* ctx)
{
return trace_exit(ctx);
}
SEC("tracepoint/syscalls/sys_exit_openat")
int tracepoint__syscalls__sys_exit_openat(struct trace_event_raw_sys_exit* ctx)
{
return trace_exit(ctx);
}
char LICENSE[] SEC("license") = "GPL";
```
在 enter 环节,`opensnoop` 会记录调用者的 pid, comm 等基本信息,并存入 map 中。在 exit 环节,`opensnoop`
会根据 pid 读出之前存入的数据,再结合捕获的其他数据,输出到用户态处理函数中,展现给用户。
完整示例代码请参考https://github.com/eunomia-bpf/eunomia-bpf/tree/master/examples/bpftools/opensnoop
把头文件和源文件放在独立的目录里面,编译运行:
```bash
$ # 使用容器进行编译,生成一个 package.json 文件,里面是已经编译好的代码和一些辅助信息
$ docker run -it -v /path/to/opensnoop:/src yunwei37/ebpm:latest
$ # 运行 eBPF 程序root shell
$ sudo ecli run package.json
```
### Prometheus 可视化
编写 yaml 配置文件:
```yaml
programs:
- name: opensnoop
metrics:
counters:
- name: eunomia_file_open_counter
description: test
labels:
- name: pid
- name: comm
- name: filename
from: fname
compiled_ebpf_filename: package.json
```
使用 eunomia-exporter 实现导出信息到 Prometheus
- 通过 https://github.com/eunomia-bpf/eunomia-bpf/releases 下载 eunomia-exporter
```console
$ ls
config.yaml eunomia-exporter package.json
$ sudo ./eunomia-exporter
Running ebpf program opensnoop takes 46 ms
Listening on http://127.0.0.1:8526
running and waiting for the ebpf events from perf event...
Receiving request at path /metrics
```
![result](../img/opensnoop_prometheus.png)
### 总结和参考资料
`opensnoop` 通过对 open 系统调用的追踪,使得用户可以较为方便地掌握目前系统中调用了 open 系统调用的进程信息。
参考资料:
- 源代码https://github.com/eunomia-bpf/eunomia-bpf/tree/master/examples/bpftools/opensnoop
- libbpf 参考代码https://github.com/iovisor/bcc/blob/master/libbpf-tools/opensnoop.bpf.c
- eunomia-bpf 手册https://eunomia-bpf.github.io/

281
4-opensnoop/README.md Normal file
View File

@@ -0,0 +1,281 @@
---
layout: post
title: opensnoop
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, syscall]
summary: opensnoop traces the open() syscall system-wide, and prints various details.
---
## origin
The kernel code is origin from:
<https://github.com/iovisor/bcc/blob/master/libbpf-tools/opensnoop.bpf.c>
result:
```console
$ sudo ecli examples/bpftools/opensnoop/package.json -h
Usage: opensnoop_bpf [--help] [--version] [--verbose] [--pid_target VAR] [--tgid_target VAR] [--uid_target VAR] [--failed]
Trace open family syscalls.
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
--verbose prints libbpf debug information
--pid_target Process ID to trace
--tgid_target Thread ID to trace
--uid_target User ID to trace
-f, --failed trace only failed events
Built with eunomia-bpf framework.
See https://github.com/eunomia-bpf/eunomia-bpf for more information.
$ sudo ecli examples/bpftools/opensnoop/package.json
TIME TS PID UID RET FLAGS COMM FNAME
20:31:50 0 1 0 51 524288 systemd /proc/614/cgroup
20:31:50 0 33182 0 25 524288 ecli /etc/localtime
20:31:53 0 754 0 6 0 irqbalance /proc/interrupts
20:31:53 0 754 0 6 0 irqbalance /proc/stat
20:32:03 0 754 0 6 0 irqbalance /proc/interrupts
20:32:03 0 754 0 6 0 irqbalance /proc/stat
20:32:03 0 632 0 7 524288 vmtoolsd /etc/mtab
20:32:03 0 632 0 9 0 vmtoolsd /proc/devices
$ sudo ecli examples/bpftools/opensnoop/package.json --pid_target 754
TIME TS PID UID RET FLAGS COMM FNAME
20:34:13 0 754 0 6 0 irqbalance /proc/interrupts
20:34:13 0 754 0 6 0 irqbalance /proc/stat
20:34:23 0 754 0 6 0 irqbalance /proc/interrupts
20:34:23 0 754 0 6 0 irqbalance /proc/stat
```
## Compile and Run
Compile with docker:
```shell
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
or compile with `ecc`:
```console
$ ecc opensnoop.bpf.c opensnoop.h
Compiling bpf object...
Generating export types...
Packing ebpf object and config into package.json...
```
Run:
```shell
sudo ./ecli run examples/bpftools/opensnoop/package.json
```
## details in bcc
Demonstrations of opensnoop, the Linux eBPF/bcc version.
opensnoop traces the open() syscall system-wide, and prints various details.
Example output:
```console
# ./opensnoop
PID COMM FD ERR PATH
17326 <...> 7 0 /sys/kernel/debug/tracing/trace_pipe
1576 snmpd 9 0 /proc/net/dev
1576 snmpd 11 0 /proc/net/if_inet6
1576 snmpd 11 0 /proc/sys/net/ipv4/neigh/eth0/retrans_time_ms
1576 snmpd 11 0 /proc/sys/net/ipv6/neigh/eth0/retrans_time_ms
1576 snmpd 11 0 /proc/sys/net/ipv6/conf/eth0/forwarding
1576 snmpd 11 0 /proc/sys/net/ipv6/neigh/eth0/base_reachable_time_ms
1576 snmpd 11 0 /proc/sys/net/ipv4/neigh/lo/retrans_time_ms
1576 snmpd 11 0 /proc/sys/net/ipv6/neigh/lo/retrans_time_ms
1576 snmpd 11 0 /proc/sys/net/ipv6/conf/lo/forwarding
1576 snmpd 11 0 /proc/sys/net/ipv6/neigh/lo/base_reachable_time_ms
1576 snmpd 9 0 /proc/diskstats
1576 snmpd 9 0 /proc/stat
1576 snmpd 9 0 /proc/vmstat
1956 supervise 9 0 supervise/status.new
1956 supervise 9 0 supervise/status.new
17358 run 3 0 /etc/ld.so.cache
17358 run 3 0 /lib/x86_64-linux-gnu/libtinfo.so.5
17358 run 3 0 /lib/x86_64-linux-gnu/libdl.so.2
17358 run 3 0 /lib/x86_64-linux-gnu/libc.so.6
17358 run -1 6 /dev/tty
17358 run 3 0 /proc/meminfo
17358 run 3 0 /etc/nsswitch.conf
17358 run 3 0 /etc/ld.so.cache
17358 run 3 0 /lib/x86_64-linux-gnu/libnss_compat.so.2
17358 run 3 0 /lib/x86_64-linux-gnu/libnsl.so.1
17358 run 3 0 /etc/ld.so.cache
17358 run 3 0 /lib/x86_64-linux-gnu/libnss_nis.so.2
17358 run 3 0 /lib/x86_64-linux-gnu/libnss_files.so.2
17358 run 3 0 /etc/passwd
17358 run 3 0 ./run
^C
``
While tracing, the snmpd process opened various /proc files (reading metrics),
and a "run" process read various libraries and config files (looks like it
was starting up: a new process).
opensnoop can be useful for discovering configuration and log files, if used
during application startup.
```console
The -p option can be used to filter on a PID, which is filtered in-kernel. Here
I've used it with -T to print timestamps:
./opensnoop -Tp 1956
TIME(s) PID COMM FD ERR PATH
0.000000000 1956 supervise 9 0 supervise/status.new
0.000289999 1956 supervise 9 0 supervise/status.new
1.023068000 1956 supervise 9 0 supervise/status.new
1.023381997 1956 supervise 9 0 supervise/status.new
2.046030000 1956 supervise 9 0 supervise/status.new
2.046363000 1956 supervise 9 0 supervise/status.new
3.068203997 1956 supervise 9 0 supervise/status.new
3.068544999 1956 supervise 9 0 supervise/status.new
```
This shows the supervise process is opening the status.new file twice every
second.
The -U option include UID on output:
```console
# ./opensnoop -U
UID PID COMM FD ERR PATH
0 27063 vminfo 5 0 /var/run/utmp
103 628 dbus-daemon -1 2 /usr/local/share/dbus-1/system-services
103 628 dbus-daemon 18 0 /usr/share/dbus-1/system-services
103 628 dbus-daemon -1 2 /lib/dbus-1/system-services
```
The -u option filtering UID:
```console
# ./opensnoop -Uu 1000
UID PID COMM FD ERR PATH
1000 30240 ls 3 0 /etc/ld.so.cache
1000 30240 ls 3 0 /lib/x86_64-linux-gnu/libselinux.so.1
1000 30240 ls 3 0 /lib/x86_64-linux-gnu/libc.so.6
1000 30240 ls 3 0 /lib/x86_64-linux-gnu/libpcre.so.3
1000 30240 ls 3 0 /lib/x86_64-linux-gnu/libdl.so.2
1000 30240 ls 3 0 /lib/x86_64-linux-gnu/libpthread.so.0
```
The -x option only prints failed opens:
```console
# ./opensnoop -x
PID COMM FD ERR PATH
18372 run -1 6 /dev/tty
18373 run -1 6 /dev/tty
18373 multilog -1 13 lock
18372 multilog -1 13 lock
18384 df -1 2 /usr/share/locale/en_US.UTF-8/LC_MESSAGES/coreutils.mo
18384 df -1 2 /usr/share/locale/en_US.utf8/LC_MESSAGES/coreutils.mo
18384 df -1 2 /usr/share/locale/en_US/LC_MESSAGES/coreutils.mo
18384 df -1 2 /usr/share/locale/en.UTF-8/LC_MESSAGES/coreutils.mo
18384 df -1 2 /usr/share/locale/en.utf8/LC_MESSAGES/coreutils.mo
18384 df -1 2 /usr/share/locale/en/LC_MESSAGES/coreutils.mo
18385 run -1 6 /dev/tty
18386 run -1 6 /dev/tty
```
This caught a df command failing to open a coreutils.mo file, and trying from
different directories.
The ERR column is the system error number. Error number 2 is ENOENT: no such
file or directory.
A maximum tracing duration can be set with the -d option. For example, to trace
for 2 seconds:
```console
# ./opensnoop -d 2
PID COMM FD ERR PATH
2191 indicator-multi 11 0 /sys/block
2191 indicator-multi 11 0 /sys/block
2191 indicator-multi 11 0 /sys/block
2191 indicator-multi 11 0 /sys/block
2191 indicator-multi 11 0 /sys/block
```
The -n option can be used to filter on process name using partial matches:
```console
# ./opensnoop -n ed
PID COMM FD ERR PATH
2679 sed 3 0 /etc/ld.so.cache
2679 sed 3 0 /lib/x86_64-linux-gnu/libselinux.so.1
2679 sed 3 0 /lib/x86_64-linux-gnu/libc.so.6
2679 sed 3 0 /lib/x86_64-linux-gnu/libpcre.so.3
2679 sed 3 0 /lib/x86_64-linux-gnu/libdl.so.2
2679 sed 3 0 /lib/x86_64-linux-gnu/libpthread.so.0
2679 sed 3 0 /proc/filesystems
2679 sed 3 0 /usr/lib/locale/locale-archive
2679 sed -1 2
2679 sed 3 0 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
2679 sed 3 0 /dev/null
2680 sed 3 0 /etc/ld.so.cache
2680 sed 3 0 /lib/x86_64-linux-gnu/libselinux.so.1
2680 sed 3 0 /lib/x86_64-linux-gnu/libc.so.6
2680 sed 3 0 /lib/x86_64-linux-gnu/libpcre.so.3
2680 sed 3 0 /lib/x86_64-linux-gnu/libdl.so.2
2680 sed 3 0 /lib/x86_64-linux-gnu/libpthread.so.0
2680 sed 3 0 /proc/filesystems
2680 sed 3 0 /usr/lib/locale/locale-archive
2680 sed -1 2
^C
```
This caught the 'sed' command because it partially matches 'ed' that's passed
to the '-n' option.
The -e option prints out extra columns; for example, the following output
contains the flags passed to open(2), in octal:
```console
# ./opensnoop -e
PID COMM FD ERR FLAGS PATH
28512 sshd 10 0 00101101 /proc/self/oom_score_adj
28512 sshd 3 0 02100000 /etc/ld.so.cache
28512 sshd 3 0 02100000 /lib/x86_64-linux-gnu/libwrap.so.0
28512 sshd 3 0 02100000 /lib/x86_64-linux-gnu/libaudit.so.1
28512 sshd 3 0 02100000 /lib/x86_64-linux-gnu/libpam.so.0
28512 sshd 3 0 02100000 /lib/x86_64-linux-gnu/libselinux.so.1
28512 sshd 3 0 02100000 /lib/x86_64-linux-gnu/libsystemd.so.0
28512 sshd 3 0 02100000 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.2
28512 sshd 3 0 02100000 /lib/x86_64-linux-gnu/libutil.so.1
```
The -f option filters based on flags to the open(2) call, for example:
```console
# ./opensnoop -e -f O_WRONLY -f O_RDWR
PID COMM FD ERR FLAGS PATH
28084 clear_console 3 0 00100002 /dev/tty
28084 clear_console -1 13 00100002 /dev/tty0
28084 clear_console -1 13 00100001 /dev/tty0
28084 clear_console -1 13 00100002 /dev/console
28084 clear_console -1 13 00100001 /dev/console
28051 sshd 8 0 02100002 /var/run/utmp
28051 sshd 7 0 00100001 /var/log/wtmp
```
The --cgroupmap option filters based on a cgroup set. It is meant to be used
with an externally created map.
```console
# ./opensnoop --cgroupmap /sys/fs/bpf/test01
```
For more details, see docs/special_filtering.md

12
4-opensnoop/config.yaml Normal file
View File

@@ -0,0 +1,12 @@
programs:
- name: opensnoop
metrics:
counters:
- name: eunomia_file_open_counter
description: test
labels:
- name: pid
- name: comm
- name: filename
from: fname
compiled_ebpf_filename: package.json

140
4-opensnoop/opensnoop.bpf.c Normal file
View File

@@ -0,0 +1,140 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2019 Facebook
// Copyright (c) 2020 Netflix
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include "opensnoop.h"
struct args_t {
const char *fname;
int flags;
};
/// Process ID to trace
const volatile int pid_target = 0;
/// Thread ID to trace
const volatile int tgid_target = 0;
/// @description User ID to trace
const volatile int uid_target = 0;
/// @cmdarg {"default": false, "short": "f", "long": "failed"}
/// @description trace only failed events
const volatile bool targ_failed = false;
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10240);
__type(key, u32);
__type(value, struct args_t);
} start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
static __always_inline bool valid_uid(uid_t uid) {
return uid != INVALID_UID;
}
static __always_inline
bool trace_allowed(u32 tgid, u32 pid)
{
u32 uid;
/* filters */
if (tgid_target && tgid_target != tgid)
return false;
if (pid_target && pid_target != pid)
return false;
if (valid_uid(uid_target)) {
uid = (u32)bpf_get_current_uid_gid();
if (uid_target != uid) {
return false;
}
}
return true;
}
SEC("tracepoint/syscalls/sys_enter_open")
int tracepoint__syscalls__sys_enter_open(struct trace_event_raw_sys_enter* ctx)
{
u64 id = bpf_get_current_pid_tgid();
/* use kernel terminology here for tgid/pid: */
u32 tgid = id >> 32;
u32 pid = id;
/* store arg info for later lookup */
if (trace_allowed(tgid, pid)) {
struct args_t args = {};
args.fname = (const char *)ctx->args[0];
args.flags = (int)ctx->args[1];
bpf_map_update_elem(&start, &pid, &args, 0);
}
return 0;
}
SEC("tracepoint/syscalls/sys_enter_openat")
int tracepoint__syscalls__sys_enter_openat(struct trace_event_raw_sys_enter* ctx)
{
u64 id = bpf_get_current_pid_tgid();
/* use kernel terminology here for tgid/pid: */
u32 tgid = id >> 32;
u32 pid = id;
/* store arg info for later lookup */
if (trace_allowed(tgid, pid)) {
struct args_t args = {};
args.fname = (const char *)ctx->args[1];
args.flags = (int)ctx->args[2];
bpf_map_update_elem(&start, &pid, &args, 0);
}
return 0;
}
static __always_inline
int trace_exit(struct trace_event_raw_sys_exit* ctx)
{
struct event event = {};
struct args_t *ap;
int ret;
u32 pid = bpf_get_current_pid_tgid();
ap = bpf_map_lookup_elem(&start, &pid);
if (!ap)
return 0; /* missed entry */
ret = ctx->ret;
if (targ_failed && ret >= 0)
goto cleanup; /* want failed only */
/* event data */
event.pid = bpf_get_current_pid_tgid() >> 32;
event.uid = bpf_get_current_uid_gid();
bpf_get_current_comm(&event.comm, sizeof(event.comm));
bpf_probe_read_user_str(&event.fname, sizeof(event.fname), ap->fname);
event.flags = ap->flags;
event.ret = ret;
/* emit event */
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
&event, sizeof(event));
cleanup:
bpf_map_delete_elem(&start, &pid);
return 0;
}
SEC("tracepoint/syscalls/sys_exit_open")
int tracepoint__syscalls__sys_exit_open(struct trace_event_raw_sys_exit* ctx)
{
return trace_exit(ctx);
}
SEC("tracepoint/syscalls/sys_exit_openat")
int tracepoint__syscalls__sys_exit_openat(struct trace_event_raw_sys_exit* ctx)
{
return trace_exit(ctx);
}
/// Trace open family syscalls.
char LICENSE[] SEC("license") = "GPL";

21
4-opensnoop/opensnoop.h Normal file
View File

@@ -0,0 +1,21 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#ifndef __OPENSNOOP_H
#define __OPENSNOOP_H
#define TASK_COMM_LEN 16
#define NAME_MAX 255
#define INVALID_UID ((uid_t)-1)
// used for export event
struct event {
/* user terminology for pid: */
unsigned long long ts;
int pid;
int uid;
int ret;
int flags;
char comm[TASK_COMM_LEN];
char fname[NAME_MAX];
};
#endif /* __OPENSNOOP_H */

7
5-uprobe-bashreadline/.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
.vscode
package.json
ecli
*.o
*.skel.json
*.skel.yaml
package.yaml

View File

@@ -0,0 +1,79 @@
---
layout: post
title: bootstrap
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, examples, uprobe, perf event]
summary: an example of a simple (but realistic) BPF application prints bash commands from all running bash shells on the system.
---
This prints bash commands from all running bash shells on the system.
## System requirements:
- Linux kernel > 5.5
- Eunomia's [ecli](https://github.com/eunomia-bpf/eunomia-bpf/tree/master/ecli) installed
## Run
- Compile:
```shell
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
or
```shell
ecc bashreadline.bpf.c bashreadline.h
```
- Run:
```console
$ sudo ./ecli run eunomia-bpf/examples/bpftools/bootstrap/package.json
TIME PID STR
11:17:34 28796 whoami
11:17:41 28796 ps -ef
11:17:51 28796 echo "Hello eBPF!"
```
## details in bcc
```
Demonstrations of bashreadline, the Linux eBPF/bcc version.
This prints bash commands from all running bash shells on the system. For
example:
# ./bashreadline
TIME PID COMMAND
05:28:25 21176 ls -l
05:28:28 21176 date
05:28:35 21176 echo hello world
05:28:43 21176 foo this command failed
05:28:45 21176 df -h
05:29:04 3059 echo another shell
05:29:13 21176 echo first shell again
When running the script on Arch Linux, you may need to specify the location
of libreadline.so library:
# ./bashreadline -s /lib/libreadline.so
TIME PID COMMAND
11:17:34 28796 whoami
11:17:41 28796 ps -ef
11:17:51 28796 echo "Hello eBPF!"
The entered command may fail. This is just showing what command lines were
entered interactively for bash to process.
It works by tracing the return of the readline() function using uprobes
(specifically a uretprobe).
```

View File

@@ -0,0 +1,48 @@
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2021 Facebook */
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "bashreadline.h"
#define TASK_COMM_LEN 16
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u32));
} events SEC(".maps");
/* Format of u[ret]probe section definition supporting auto-attach:
* u[ret]probe/binary:function[+offset]
*
* binary can be an absolute/relative path or a filename; the latter is resolved to a
* full binary path via bpf_program__attach_uprobe_opts.
*
* Specifying uprobe+ ensures we carry out strict matching; either "uprobe" must be
* specified (and auto-attach is not possible) or the above format is specified for
* auto-attach.
*/
SEC("uprobe//bin/bash:readline")
int BPF_KRETPROBE(printret, const void *ret) {
struct str_t data;
char comm[TASK_COMM_LEN];
u32 pid;
if (!ret)
return 0;
bpf_get_current_comm(&comm, sizeof(comm));
if (comm[0] != 'b' || comm[1] != 'a' || comm[2] != 's' || comm[3] != 'h' || comm[4] != 0 )
return 0;
pid = bpf_get_current_pid_tgid() >> 32;
data.pid = pid;
bpf_probe_read_user_str(&data.str, sizeof(data.str), ret);
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &data, sizeof(data));
return 0;
};
char LICENSE[] SEC("license") = "GPL";

View File

@@ -0,0 +1,13 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
/* Copyright (c) 2021 Facebook */
#ifndef __BASHREADLINE_H
#define __BASHREADLINE_H
#define MAX_LINE_SIZE 80
struct str_t {
__u32 pid;
char str[MAX_LINE_SIZE];
};
#endif /* __BASHREADLINE_H */

10
6-sigsnoop/.gitignore vendored Executable file
View File

@@ -0,0 +1,10 @@
.vscode
package.json
*.wasm
ewasm-skel.h
ecli
ewasm
*.o
*.skel.json
*.skel.yaml
package.yaml

155
6-sigsnoop/README.md Executable file
View File

@@ -0,0 +1,155 @@
---
layout: post
title: sigsnoop
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, syscall, kprobe, tracepoint]
summary: Trace signals generated system wide, from syscalls and others.
---
## origin
origin from:
https://github.com/iovisor/bcc/blob/master/libbpf-tools/sigsnoop.bpf.c
## Compile and Run
Compile:
```shell
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
Or compile with `ecc`:
```console
$ ecc sigsnoop.bpf.c sigsnoop.h
Compiling bpf object...
Generating export types...
Packing ebpf object and config into package.json...
```
Run:
```console
$ sudo ./ecli examples/bpftools/sigsnoop/package.json
TIME PID TPID SIG RET COMM
20:43:44 21276 3054 0 0 cpptools-srv
20:43:44 22407 3054 0 0 cpptools-srv
20:43:44 20222 3054 0 0 cpptools-srv
20:43:44 8933 3054 0 0 cpptools-srv
20:43:44 2915 2803 0 0 node
20:43:44 2943 2803 0 0 node
20:43:44 31453 3054 0 0 cpptools-srv
$ sudo ./ecli examples/bpftools/sigsnoop/package.json -h
Usage: sigsnoop_bpf [--help] [--version] [--verbose] [--filtered_pid VAR] [--target_signal VAR] [--failed_only]
A simple eBPF program
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
--verbose prints libbpf debug information
--filtered_pid set value of pid_t variable filtered_pid
--target_signal set value of int variable target_signal
--failed_only set value of bool variable failed_only
Built with eunomia-bpf framework.
See https://github.com/eunomia-bpf/eunomia-bpf for more information.
```
## WASM example
Generate WASM skel:
```shell
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest gen-wasm-skel
```
> The skel is generated and commit, so you don't need to generate it again.
> skel includes:
>
> - eunomia-include: include headers for WASM
> - app.c: the WASM app. all library is header only.
Build WASM module
```shell
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest build-wasm
```
Run:
```console
$ sudo ./ecli run app.wasm -h
Usage: sigsnoop [-h] [-x] [-k] [-n] [-p PID] [-s SIGNAL]
Trace standard and real-time signals.
-h, --help show this help message and exit
-x, --failed failed signals only
-k, --killed kill only
-p, --pid=<int> target pid
-s, --signal=<int> target signal
$ sudo ./ecli run app.wasm
running and waiting for the ebpf events from perf event...
{"pid":185539,"tpid":185538,"sig":17,"ret":0,"comm":"cat","sig_name":"SIGCHLD"}
{"pid":185540,"tpid":185538,"sig":17,"ret":0,"comm":"grep","sig_name":"SIGCHLD"}
$ sudo ./ecli run app.wasm -p 1641
running and waiting for the ebpf events from perf event...
{"pid":1641,"tpid":2368,"sig":23,"ret":0,"comm":"YDLive","sig_name":"SIGURG"}
{"pid":1641,"tpid":2368,"sig":23,"ret":0,"comm":"YDLive","sig_name":"SIGURG"}
```
## details in bcc
Demonstrations of sigsnoop.
This traces signals generated system wide. For example:
```console
# ./sigsnoop -n
TIME PID COMM SIG TPID RESULT
19:56:14 3204808 a.out SIGSEGV 3204808 0
19:56:14 3204808 a.out SIGPIPE 3204808 0
19:56:14 3204808 a.out SIGCHLD 3204722 0
```
The first line showed that a.out (a test program) deliver a SIGSEGV signal.
The result, 0, means success.
The second and third lines showed that a.out also deliver SIGPIPE/SIGCHLD
signals successively.
USAGE message:
```console
# ./sigsnoop -h
Usage: sigsnoop [OPTION...]
Trace standard and real-time signals.
USAGE: sigsnoop [-h] [-x] [-k] [-n] [-p PID] [-s SIGNAL]
EXAMPLES:
sigsnoop # trace signals system-wide
sigsnoop -k # trace signals issued by kill syscall only
sigsnoop -x # trace failed signals only
sigsnoop -p 1216 # only trace PID 1216
sigsnoop -s 9 # only trace signal 9
-k, --kill Trace signals issued by kill syscall only.
-n, --name Output signal name instead of signal number.
-p, --pid=PID Process ID to trace
-s, --signal=SIGNAL Signal to trace.
-x, --failed Trace failed signals only.
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
```
Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.
Report bugs to https://github.com/iovisor/bcc/tree/master/libbpf-tools.

245
6-sigsnoop/app.c Executable file
View File

@@ -0,0 +1,245 @@
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>
#include <stdint.h>
#include <stdbool.h>
#include "eunomia-include/wasm-app.h"
#include "eunomia-include/entry.h"
#include "eunomia-include/argp.h"
#include "sigsnoop.bpf.h"
#include "ewasm-skel.h"
#include "eunomia-include/sigsnoop.skel.h"
#define PERF_BUFFER_PAGES 16
#define PERF_POLL_TIMEOUT_MS 100
#define warn(...) printf(__VA_ARGS__)
#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
static volatile int exiting = 0;
static int target_pid = 0;
static int target_signal = 0;
static bool failed_only = false;
static bool kill_only = false;
static bool signal_name = false;
static bool verbose = false;
static const char *sig_name[] = {
[0] = "N/A",
[1] = "SIGHUP",
[2] = "SIGINT",
[3] = "SIGQUIT",
[4] = "SIGILL",
[5] = "SIGTRAP",
[6] = "SIGABRT",
[7] = "SIGBUS",
[8] = "SIGFPE",
[9] = "SIGKILL",
[10] = "SIGUSR1",
[11] = "SIGSEGV",
[12] = "SIGUSR2",
[13] = "SIGPIPE",
[14] = "SIGALRM",
[15] = "SIGTERM",
[16] = "SIGSTKFLT",
[17] = "SIGCHLD",
[18] = "SIGCONT",
[19] = "SIGSTOP",
[20] = "SIGTSTP",
[21] = "SIGTTIN",
[22] = "SIGTTOU",
[23] = "SIGURG",
[24] = "SIGXCPU",
[25] = "SIGXFSZ",
[26] = "SIGVTALRM",
[27] = "SIGPROF",
[28] = "SIGWINCH",
[29] = "SIGIO",
[30] = "SIGPWR",
[31] = "SIGSYS",
};
const char *argp_program_version = "sigsnoop 0.1";
const char *argp_program_bug_address =
"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
const char argp_program_doc[] =
"Trace standard and real-time signals.\n"
"\n"
"USAGE: sigsnoop [-h] [-x] [-k] [-n] [-p PID] [-s SIGNAL]\n"
"\n"
"EXAMPLES:\n"
" sigsnoop # trace signals system-wide\n"
" sigsnoop -k # trace signals issued by kill syscall only\n"
" sigsnoop -x # trace failed signals only\n"
" sigsnoop -p 1216 # only trace PID 1216\n"
" sigsnoop -s 9 # only trace signal 9\n";
static const struct argp_option opts[] = {
{ "failed", 'x', NULL, 0, "Trace failed signals only." },
{ "kill", 'k', NULL, 0, "Trace signals issued by kill syscall only." },
{ "pid", 'p', "PID", 0, "Process ID to trace" },
{ "signal", 's', "SIGNAL", 0, "Signal to trace." },
{ "name", 'n', NULL, 0, "Output signal name instead of signal number." },
{ "verbose", 'v', NULL, 0, "Verbose debug output" },
{ NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" },
{},
};
static error_t parse_arg(int key, char *arg, struct argp_state *state)
{
long pid, sig;
switch (key) {
case 'p':
errno = 0;
pid = strtol(arg, NULL, 10);
if (errno || pid <= 0) {
warn("Invalid PID: %s\n", arg);
argp_usage(state);
}
target_pid = pid;
break;
case 's':
errno = 0;
sig = strtol(arg, NULL, 10);
if (errno || sig <= 0) {
warn("Invalid SIGNAL: %s\n", arg);
argp_usage(state);
}
target_signal = sig;
break;
case 'n':
signal_name = true;
break;
case 'x':
failed_only = true;
break;
case 'k':
kill_only = true;
break;
case 'v':
verbose = true;
break;
case 'h':
argp_state_help(state, ARGP_HELP_STD_HELP);
break;
default:
return ARGP_ERR_UNKNOWN;
}
return 0;
}
static int libbpf_print_fn(const char *format, va_list args)
{
if (!verbose)
return 0;
return printf(format, args);
}
static void alias_parse(char *prog)
{
char *name = prog;
if (!strcmp(name, "killsnoop")) {
kill_only = true;
}
}
static void sig_int(int signo)
{
exiting = 1;
}
static void handle_event(void *ctx, int cpu, void *data, unsigned int data_sz)
{
struct event *e = data;
char ts[32] = "12:47:32";
if (signal_name && e->sig < ARRAY_SIZE(sig_name))
printf("%-8s %-7d %-16s %-9s %-7d %-6d\n",
ts, e->pid, e->comm, sig_name[e->sig], e->tpid, e->ret);
else
printf("%-8s %-7d %-16s %-9d %-7d %-6d\n",
ts, e->pid, e->comm, e->sig, e->tpid, e->ret);
}
static void handle_lost_events(void *ctx, int cpu, unsigned long long lost_cnt)
{
warn("lost %llu events on CPU #%d\n", lost_cnt, cpu);
}
int main(int argc, char **argv)
{
static const struct argp argp = {
.options = opts,
.parser = parse_arg,
.doc = argp_program_doc,
};
struct perf_buffer *pb = NULL;
struct sigsnoop_bpf *obj;
int err;
alias_parse(argv[0]);
err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
if (err)
return err;
obj = sigsnoop_bpf__open();
if (!obj) {
warn("failed to open BPF object\n");
return 1;
}
obj->rodata->filtered_pid = target_pid;
obj->rodata->target_signal = target_signal;
obj->rodata->failed_only = failed_only;
if (kill_only) {
bpf_program__set_autoload(obj->progs.sig_trace, false);
} else {
bpf_program__set_autoload(obj->progs.kill_entry, false);
bpf_program__set_autoload(obj->progs.kill_exit, false);
bpf_program__set_autoload(obj->progs.tkill_entry, false);
bpf_program__set_autoload(obj->progs.tkill_exit, false);
bpf_program__set_autoload(obj->progs.tgkill_entry, false);
bpf_program__set_autoload(obj->progs.tgkill_exit, false);
}
err = sigsnoop_bpf__load(obj);
if (err) {
warn("failed to load BPF object: %d\n", err);
goto cleanup;
}
err = sigsnoop_bpf__attach(obj);
if (err) {
warn("failed to attach BPF programs: %d\n", err);
goto cleanup;
}
pb = perf_buffer__new(bpf_map__fd(obj->maps.events), PERF_BUFFER_PAGES,
handle_event, handle_lost_events, NULL, NULL);
if (!pb) {
warn("failed to open perf buffer: %d\n", err);
goto cleanup;
}
printf("%-8s %-7s %-16s %-9s %-7s %-6s\n",
"TIME", "PID", "COMM", "SIG", "TPID", "RESULT");
while (!exiting) {
err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS);
if (err < 0 && err != -EINTR) {
warn("error polling perf buffer: %s\n", strerror(-err));
goto cleanup;
}
/* reset err to return 0 if exiting */
err = 0;
}
cleanup:
perf_buffer__free(pb);
sigsnoop_bpf__destroy(obj);
return err != 0;
}

View File

@@ -0,0 +1,96 @@
/* Name frobnication for compiling argp outside of glibc
Copyright (C) 1997 Free Software Foundation, Inc.
This file is part of the GNU C Library.
Written by Miles Bader <miles@gnu.ai.mit.edu>.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Library General Public License as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Library General Public License for more details.
You should have received a copy of the GNU Library General Public
License along with the GNU C Library; see the file COPYING.LIB. If not,
write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA. */
#if !_LIBC
/* This code is written for inclusion in gnu-libc, and uses names in the
namespace reserved for libc. If we're not compiling in libc, define those
names to be the normal ones instead. */
/* argp-parse functions */
#undef __argp_parse
#define __argp_parse argp_parse
#undef __option_is_end
#define __option_is_end _option_is_end
#undef __option_is_short
#define __option_is_short _option_is_short
#undef __argp_input
#define __argp_input _argp_input
/* argp-help functions */
#undef __argp_help
#define __argp_help argp_help
#undef __argp_error
#define __argp_error argp_error
#undef __argp_failure
#define __argp_failure argp_failure
#undef __argp_state_help
#define __argp_state_help argp_state_help
#undef __argp_usage
#define __argp_usage argp_usage
#undef __argp_basename
#define __argp_basename _argp_basename
#undef __argp_short_program_name
#define __argp_short_program_name _argp_short_program_name
/* argp-fmtstream functions */
#undef __argp_make_fmtstream
#define __argp_make_fmtstream argp_make_fmtstream
#undef __argp_fmtstream_free
#define __argp_fmtstream_free argp_fmtstream_free
#undef __argp_fmtstream_putc
#define __argp_fmtstream_putc argp_fmtstream_putc
#undef __argp_fmtstream_puts
#define __argp_fmtstream_puts argp_fmtstream_puts
#undef __argp_fmtstream_write
#define __argp_fmtstream_write argp_fmtstream_write
#undef __argp_fmtstream_printf
#define __argp_fmtstream_printf argp_fmtstream_printf
#undef __argp_fmtstream_set_lmargin
#define __argp_fmtstream_set_lmargin argp_fmtstream_set_lmargin
#undef __argp_fmtstream_set_rmargin
#define __argp_fmtstream_set_rmargin argp_fmtstream_set_rmargin
#undef __argp_fmtstream_set_wmargin
#define __argp_fmtstream_set_wmargin argp_fmtstream_set_wmargin
#undef __argp_fmtstream_point
#define __argp_fmtstream_point argp_fmtstream_point
#undef __argp_fmtstream_update
#define __argp_fmtstream_update _argp_fmtstream_update
#undef __argp_fmtstream_ensure
#define __argp_fmtstream_ensure _argp_fmtstream_ensure
#undef __argp_fmtstream_lmargin
#define __argp_fmtstream_lmargin argp_fmtstream_lmargin
#undef __argp_fmtstream_rmargin
#define __argp_fmtstream_rmargin argp_fmtstream_rmargin
#undef __argp_fmtstream_wmargin
#define __argp_fmtstream_wmargin argp_fmtstream_wmargin
/* normal libc functions we call */
#undef __sleep
#define __sleep sleep
#undef __strcasecmp
#define __strcasecmp strcasecmp
#undef __vsnprintf
#define __vsnprintf vsnprintf
#endif /* !_LIBC */
#ifndef __set_errno
#define __set_errno(e) (errno = (e))
#endif

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,403 @@
#ifndef ARGPARSE_C_H_
#define ARGPARSE_C_H_
/**
* Copyright (C) 2012-2015 Yecheng Fu <cofyc.jackson at gmail dot com>
* All rights reserved.
*
* Use of this source code is governed by a MIT-style license that can be found
* in the LICENSE file.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <errno.h>
#include "argparse.h"
#define OPT_UNSET 1
#define OPT_LONG (1 << 1)
/* We define these the same for all machines.
Changes from this to the outside world should be done in `_exit'. */
#define EXIT_FAILURE 1 /* Failing exit status. */
#define EXIT_SUCCESS 0 /* Successful exit status. */
static const char *
prefix_skip(const char *str, const char *prefix)
{
size_t len = strlen(prefix);
return strncmp(str, prefix, len) ? NULL : str + len;
}
static int
prefix_cmp(const char *str, const char *prefix)
{
for (;; str++, prefix++)
if (!*prefix) {
return 0;
} else if (*str != *prefix) {
return (unsigned char)*prefix - (unsigned char)*str;
}
}
static void
argparse_error(struct argparse *self, const struct argparse_option *opt,
const char *reason, int flags)
{
(void)self;
if (flags & OPT_LONG) {
printf("error: option `--%s` %s\n", opt->long_name, reason);
} else {
printf("error: option `-%c` %s\n", opt->short_name, reason);
}
exit(EXIT_FAILURE);
}
static int
argparse_getvalue(struct argparse *self, const struct argparse_option *opt,
int flags)
{
const char *s = NULL;
if (!opt->value)
goto skipped;
switch (opt->type) {
case ARGPARSE_OPT_BOOLEAN:
if (flags & OPT_UNSET) {
*(int *)opt->value = *(int *)opt->value - 1;
} else {
*(int *)opt->value = *(int *)opt->value + 1;
}
if (*(int *)opt->value < 0) {
*(int *)opt->value = 0;
}
break;
case ARGPARSE_OPT_BIT:
if (flags & OPT_UNSET) {
*(int *)opt->value &= ~opt->data;
} else {
*(int *)opt->value |= opt->data;
}
break;
case ARGPARSE_OPT_STRING:
if (self->optvalue) {
*(const char **)opt->value = self->optvalue;
self->optvalue = NULL;
} else if (self->argc > 1) {
self->argc--;
*(const char **)opt->value = *++self->argv;
} else {
argparse_error(self, opt, "requires a value", flags);
}
break;
case ARGPARSE_OPT_INTEGER:
// errno = 0;
if (self->optvalue) {
*(int *)opt->value = strtol(self->optvalue, (char **)&s, 0);
self->optvalue = NULL;
} else if (self->argc > 1) {
self->argc--;
*(int *)opt->value = strtol(*++self->argv, (char **)&s, 0);
} else {
argparse_error(self, opt, "requires a value", flags);
}
// if (errno == ERANGE)
// argparse_error(self, opt, "numerical result out of range", flags);
if (s[0] != '\0') // no digits or contains invalid characters
argparse_error(self, opt, "expects an integer value", flags);
break;
case ARGPARSE_OPT_FLOAT:
// errno = 0;
if (self->optvalue) {
*(float *)opt->value = strtod(self->optvalue, (char **)&s);
self->optvalue = NULL;
} else if (self->argc > 1) {
self->argc--;
*(float *)opt->value = strtod(*++self->argv, (char **)&s);
} else {
argparse_error(self, opt, "requires a value", flags);
}
// if (errno == ERANGE)
// argparse_error(self, opt, "numerical result out of range", flags);
if (s[0] != '\0') // no digits or contains invalid characters
argparse_error(self, opt, "expects a numerical value", flags);
break;
default:
exit(EXIT_FAILURE);
}
skipped:
if (opt->callback) {
return opt->callback(self, opt);
}
return 0;
}
static void
argparse_options_check(const struct argparse_option *options)
{
for (; options->type != ARGPARSE_OPT_END; options++) {
switch (options->type) {
case ARGPARSE_OPT_END:
case ARGPARSE_OPT_BOOLEAN:
case ARGPARSE_OPT_BIT:
case ARGPARSE_OPT_INTEGER:
case ARGPARSE_OPT_FLOAT:
case ARGPARSE_OPT_STRING:
case ARGPARSE_OPT_GROUP:
continue;
default:
printf("wrong option type: %d", options->type);
break;
}
}
}
static int
argparse_short_opt(struct argparse *self, const struct argparse_option *options)
{
for (; options->type != ARGPARSE_OPT_END; options++) {
if (options->short_name == *self->optvalue) {
self->optvalue = self->optvalue[1] ? self->optvalue + 1 : NULL;
return argparse_getvalue(self, options, 0);
}
}
return -2;
}
static int
argparse_long_opt(struct argparse *self, const struct argparse_option *options)
{
for (; options->type != ARGPARSE_OPT_END; options++) {
const char *rest;
int opt_flags = 0;
if (!options->long_name)
continue;
rest = prefix_skip(self->argv[0] + 2, options->long_name);
if (!rest) {
// negation disabled?
if (options->flags & OPT_NONEG) {
continue;
}
// only OPT_BOOLEAN/OPT_BIT supports negation
if (options->type != ARGPARSE_OPT_BOOLEAN && options->type !=
ARGPARSE_OPT_BIT) {
continue;
}
if (prefix_cmp(self->argv[0] + 2, "no-")) {
continue;
}
rest = prefix_skip(self->argv[0] + 2 + 3, options->long_name);
if (!rest)
continue;
opt_flags |= OPT_UNSET;
}
if (*rest) {
if (*rest != '=')
continue;
self->optvalue = rest + 1;
}
return argparse_getvalue(self, options, opt_flags | OPT_LONG);
}
return -2;
}
int
argparse_init(struct argparse *self, struct argparse_option *options,
const char *const *usages, int flags)
{
memset(self, 0, sizeof(*self));
self->options = options;
self->usages = usages;
self->flags = flags;
self->description = NULL;
self->epilog = NULL;
return 0;
}
void
argparse_describe(struct argparse *self, const char *description,
const char *epilog)
{
self->description = description;
self->epilog = epilog;
}
int
argparse_parse(struct argparse *self, int argc, const char **argv)
{
self->argc = argc - 1;
self->argv = argv + 1;
self->out = argv;
argparse_options_check(self->options);
for (; self->argc; self->argc--, self->argv++) {
const char *arg = self->argv[0];
if (arg[0] != '-' || !arg[1]) {
if (self->flags & ARGPARSE_STOP_AT_NON_OPTION) {
goto end;
}
// if it's not option or is a single char '-', copy verbatim
self->out[self->cpidx++] = self->argv[0];
continue;
}
// short option
if (arg[1] != '-') {
self->optvalue = arg + 1;
switch (argparse_short_opt(self, self->options)) {
case -1:
break;
case -2:
goto unknown;
}
while (self->optvalue) {
switch (argparse_short_opt(self, self->options)) {
case -1:
break;
case -2:
goto unknown;
}
}
continue;
}
// if '--' presents
if (!arg[2]) {
self->argc--;
self->argv++;
break;
}
// long option
switch (argparse_long_opt(self, self->options)) {
case -1:
break;
case -2:
goto unknown;
}
continue;
unknown:
printf("error: unknown option `%s`\n", self->argv[0]);
argparse_usage(self);
if (!(self->flags & ARGPARSE_IGNORE_UNKNOWN_ARGS)) {
exit(EXIT_FAILURE);
}
}
end:
memmove(self->out + self->cpidx, self->argv,
self->argc * sizeof(*self->out));
self->out[self->cpidx + self->argc] = NULL;
return self->cpidx + self->argc;
}
void
argparse_usage(struct argparse *self)
{
if (self->usages) {
printf("Usage: %s\n", *self->usages++);
while (*self->usages && **self->usages)
printf(" or: %s\n", *self->usages++);
} else {
printf("Usage:\n");
}
// print description
if (self->description)
printf("%s\n", self->description);
putchar('\n');
const struct argparse_option *options;
// figure out best width
size_t usage_opts_width = 0;
size_t len;
options = self->options;
for (; options->type != ARGPARSE_OPT_END; options++) {
len = 0;
if ((options)->short_name) {
len += 2;
}
if ((options)->short_name && (options)->long_name) {
len += 2; // separator ", "
}
if ((options)->long_name) {
len += strlen((options)->long_name) + 2;
}
if (options->type == ARGPARSE_OPT_INTEGER) {
len += strlen("=<int>");
}
if (options->type == ARGPARSE_OPT_FLOAT) {
len += strlen("=<flt>");
} else if (options->type == ARGPARSE_OPT_STRING) {
len += strlen("=<str>");
}
len = (len + 3) - ((len + 3) & 3);
if (usage_opts_width < len) {
usage_opts_width = len;
}
}
usage_opts_width += 4; // 4 spaces prefix
options = self->options;
for (; options->type != ARGPARSE_OPT_END; options++) {
size_t pos = 0;
size_t pad = 0;
if (options->type == ARGPARSE_OPT_GROUP) {
putchar('\n');
printf("%s", options->help);
putchar('\n');
continue;
}
pos = printf(" ");
if (options->short_name) {
pos += printf("-%c", options->short_name);
}
if (options->long_name && options->short_name) {
pos += printf(", ");
}
if (options->long_name) {
pos += printf("--%s", options->long_name);
}
if (options->type == ARGPARSE_OPT_INTEGER) {
pos += printf("=<int>");
} else if (options->type == ARGPARSE_OPT_FLOAT) {
pos += printf("=<flt>");
} else if (options->type == ARGPARSE_OPT_STRING) {
pos += printf("=<str>");
}
if (pos <= usage_opts_width) {
pad = usage_opts_width - pos;
} else {
putchar('\n');
pad = usage_opts_width;
}
printf(" %s\n", options->help);
}
// print epilog
if (self->epilog)
printf("%s\n", self->epilog);
}
int
argparse_help_cb_no_exit(struct argparse *self,
const struct argparse_option *option)
{
(void)option;
argparse_usage(self);
return (EXIT_SUCCESS);
}
int
argparse_help_cb(struct argparse *self, const struct argparse_option *option)
{
argparse_help_cb_no_exit(self, option);
exit(EXIT_SUCCESS);
}
#endif /* ARGPARSE_C_H */

View File

@@ -0,0 +1,133 @@
/**
* Copyright (C) 2012-2015 Yecheng Fu <cofyc.jackson at gmail dot com>
* All rights reserved.
*
* Use of this source code is governed by a MIT-style license that can be found
* in the LICENSE file.
*/
#ifndef ARGPARSE_H
#define ARGPARSE_H
/* For c++ compatibility */
#ifdef __cplusplus
extern "C" {
#endif
#include <stdint.h>
struct argparse;
struct argparse_option;
typedef int argparse_callback (struct argparse *self,
const struct argparse_option *option);
enum argparse_flag {
ARGPARSE_STOP_AT_NON_OPTION = 1 << 0,
ARGPARSE_IGNORE_UNKNOWN_ARGS = 1 << 1,
};
enum argparse_option_type {
/* special */
ARGPARSE_OPT_END,
ARGPARSE_OPT_GROUP,
/* options with no arguments */
ARGPARSE_OPT_BOOLEAN,
ARGPARSE_OPT_BIT,
/* options with arguments (optional or required) */
ARGPARSE_OPT_INTEGER,
ARGPARSE_OPT_FLOAT,
ARGPARSE_OPT_STRING,
};
enum argparse_option_flags {
OPT_NONEG = 1, /* disable negation */
};
/**
* argparse option
*
* `type`:
* holds the type of the option, you must have an ARGPARSE_OPT_END last in your
* array.
*
* `short_name`:
* the character to use as a short option name, '\0' if none.
*
* `long_name`:
* the long option name, without the leading dash, NULL if none.
*
* `value`:
* stores pointer to the value to be filled.
*
* `help`:
* the short help message associated to what the option does.
* Must never be NULL (except for ARGPARSE_OPT_END).
*
* `callback`:
* function is called when corresponding argument is parsed.
*
* `data`:
* associated data. Callbacks can use it like they want.
*
* `flags`:
* option flags.
*/
struct argparse_option {
enum argparse_option_type type;
const char short_name;
const char *long_name;
void *value;
const char *help;
argparse_callback *callback;
intptr_t data;
int flags;
};
/**
* argpparse
*/
struct argparse {
// user supplied
const struct argparse_option *options;
const char *const *usages;
int flags;
const char *description; // a description after usage
const char *epilog; // a description at the end
// internal context
int argc;
const char **argv;
const char **out;
int cpidx;
const char *optvalue; // current option value
};
// built-in callbacks
int argparse_help_cb(struct argparse *self,
const struct argparse_option *option);
int argparse_help_cb_no_exit(struct argparse *self,
const struct argparse_option *option);
// built-in option macros
#define OPT_END() { ARGPARSE_OPT_END, 0, NULL, NULL, 0, NULL, 0, 0 }
#define OPT_BOOLEAN(...) { ARGPARSE_OPT_BOOLEAN, __VA_ARGS__ }
#define OPT_BIT(...) { ARGPARSE_OPT_BIT, __VA_ARGS__ }
#define OPT_INTEGER(...) { ARGPARSE_OPT_INTEGER, __VA_ARGS__ }
#define OPT_FLOAT(...) { ARGPARSE_OPT_FLOAT, __VA_ARGS__ }
#define OPT_STRING(...) { ARGPARSE_OPT_STRING, __VA_ARGS__ }
#define OPT_GROUP(h) { ARGPARSE_OPT_GROUP, 0, NULL, NULL, h, NULL, 0, 0 }
#define OPT_HELP() OPT_BOOLEAN('h', "help", NULL, \
"show this help message and exit", \
argparse_help_cb, 0, OPT_NONEG)
int argparse_init(struct argparse *self, struct argparse_option *options,
const char *const *usages, int flags);
void argparse_describe(struct argparse *self, const char *description,
const char *epilog);
int argparse_parse(struct argparse *self, int argc, const char **argv);
void argparse_usage(struct argparse *self);
#ifdef __cplusplus
}
#endif
#endif

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,358 @@
/*
Copyright (c) 2009-2017 Dave Gamble and cJSON contributors
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
A header only cJSON library for C and C++.
*/
#ifndef cJSON__h
#define cJSON__h
#ifdef __cplusplus
extern "C" {
#endif
#if !defined(__WINDOWS__) \
&& (defined(WIN32) || defined(WIN64) || defined(_MSC_VER) \
|| defined(_WIN32))
#define __WINDOWS__
#endif
#ifdef __WINDOWS__
/**
* When compiling for windows, we specify a specific calling convention to avoid
* issues where we are being called from a project with a different default
* calling convention. For windows you have 3 define options:
* CJSON_HIDE_SYMBOLS - Define this in the case where you don't want to ever
* dllexport symbols
* CJSON_EXPORT_SYMBOLS - Define this on library build when you want to
* dllexport symbols (default)
* CJSON_IMPORT_SYMBOLS - Define this if you want to dllimport symbol
*
* For *nix builds that support visibility attribute, you can define similar
* behavior by setting default visibility to hidden by adding
* -fvisibility=hidden (for gcc)
* or
* -xldscope=hidden (for sun cc)
* to CFLAGS, then using the CJSON_API_VISIBILITY flag to "export" the same
* symbols the way CJSON_EXPORT_SYMBOLS does
*/
#define CJSON_CDECL __cdecl
#define CJSON_STDCALL __stdcall
/* export symbols by default, this is necessary for copy pasting the C and
header file */
#if !defined(CJSON_HIDE_SYMBOLS) && !defined(CJSON_IMPORT_SYMBOLS) \
&& !defined(CJSON_EXPORT_SYMBOLS)
#define CJSON_EXPORT_SYMBOLS
#endif
#if defined(CJSON_HIDE_SYMBOLS)
#define CJSON_PUBLIC(type) type CJSON_STDCALL
#elif defined(CJSON_EXPORT_SYMBOLS)
#define CJSON_PUBLIC(type) __declspec(dllexport) type CJSON_STDCALL
#elif defined(CJSON_IMPORT_SYMBOLS)
#define CJSON_PUBLIC(type) __declspec(dllimport) type CJSON_STDCALL
#endif
#else /* !__WINDOWS__ */
#define CJSON_CDECL
#define CJSON_STDCALL
#if (defined(__GNUC__) || defined(__SUNPRO_CC) || defined(__SUNPRO_C)) \
&& defined(CJSON_API_VISIBILITY)
#define CJSON_PUBLIC(type) __attribute__((visibility("default"))) type
#else
#define CJSON_PUBLIC(type) type
#endif
#endif
/* project version */
#define CJSON_VERSION_MAJOR 1
#define CJSON_VERSION_MINOR 7
#define CJSON_VERSION_PATCH 10
#include <stddef.h>
/* cJSON Types: */
#define cJSON_Invalid (0)
#define cJSON_False (1 << 0)
#define cJSON_True (1 << 1)
#define cJSON_NULL (1 << 2)
#define cJSON_Number (1 << 3)
#define cJSON_String (1 << 4)
#define cJSON_Array (1 << 5)
#define cJSON_Object (1 << 6)
#define cJSON_Raw (1 << 7) /* raw json */
#define cJSON_IsReference 256
#define cJSON_StringIsConst 512
/* The cJSON structure: */
typedef struct cJSON {
/* next/prev allow you to walk array/object chains. Alternatively, use
GetArraySize/GetArrayItem/GetObjectItem */
struct cJSON *next;
struct cJSON *prev;
/* An array or object item will have a child pointer pointing to a chain of
the items in the array/object. */
struct cJSON *child;
/* The type of the item, as above. */
int type;
/* The item's string, if type==cJSON_String and type == cJSON_Raw */
char *valuestring;
/* writing to valueint is DEPRECATED, use cJSON_SetNumberValue instead */
int valueint;
/* The item's number, if type==cJSON_Number */
double valuedouble;
/* The item's name string, if this item is the child of, or is in the list
of subitems of an object. */
char *string;
} cJSON;
typedef struct cJSON_Hooks {
/* malloc/free are CDECL on Windows regardless of the default calling
* convention of the compiler, so ensure the hooks allow passing those
* functions directly. */
void *(CJSON_CDECL *malloc_fn)(size_t sz);
void(CJSON_CDECL *free_fn)(void *ptr);
} cJSON_Hooks;
typedef int cJSON_bool;
/* Limits how deeply nested arrays/objects can be before cJSON rejects to parse
them. This is to prevent stack overflows. */
#ifndef CJSON_NESTING_LIMIT
#define CJSON_NESTING_LIMIT 1000
#endif
/* returns the version of cJSON as a string */
CJSON_PUBLIC(const char *) cJSON_Version(void);
/* Supply malloc, realloc and free functions to cJSON */
CJSON_PUBLIC(void) cJSON_InitHooks(cJSON_Hooks *hooks);
/* Memory Management: the caller is always responsible to free the results from
* all variants of cJSON_Parse (with cJSON_Delete) and cJSON_Print (with stdlib
* free, cJSON_Hooks.free_fn, or cJSON_free as appropriate). The exception is
* cJSON_PrintPreallocated, where the caller has full responsibility of the
* buffer. */
/* Supply a block of JSON, and this returns a cJSON object you can interrogate.
*/
CJSON_PUBLIC(cJSON *) cJSON_Parse(const char *value);
/* ParseWithOpts allows you to require (and check) that the JSON is null
* terminated, and to retrieve the pointer to the final byte parsed. */
/* If you supply a ptr in return_parse_end and parsing fails, then
* return_parse_end will contain a pointer to the error so will match
* cJSON_GetErrorPtr(). */
CJSON_PUBLIC(cJSON *)
cJSON_ParseWithOpts(const char *value, const char **return_parse_end,
cJSON_bool require_null_terminated);
/* Render a cJSON entity to text for transfer/storage. */
CJSON_PUBLIC(char *) cJSON_Print(const cJSON *item);
/* Render a cJSON entity to text for transfer/storage without any formatting. */
CJSON_PUBLIC(char *) cJSON_PrintUnformatted(const cJSON *item);
/* Render a cJSON entity to text using a buffered strategy. prebuffer is a guess
* at the final size. guessing well reduces reallocation. fmt=0 gives
* unformatted, =1 gives formatted */
CJSON_PUBLIC(char *)
cJSON_PrintBuffered(const cJSON *item, int prebuffer, cJSON_bool fmt);
/* Render a cJSON entity to text using a buffer already allocated in memory with
* given length. Returns 1 on success and 0 on failure. */
/* NOTE: cJSON is not always 100% accurate in estimating how much memory it will
* use, so to be safe allocate 5 bytes more than you actually need */
CJSON_PUBLIC(cJSON_bool)
cJSON_PrintPreallocated(cJSON *item, char *buffer, const int length,
const cJSON_bool format);
/* Delete a cJSON entity and all subentities. */
CJSON_PUBLIC(void) cJSON_Delete(cJSON *c);
/* Returns the number of items in an array (or object). */
CJSON_PUBLIC(int) cJSON_GetArraySize(const cJSON *array);
/* Retrieve item number "index" from array "array". Returns NULL if
* unsuccessful. */
CJSON_PUBLIC(cJSON *) cJSON_GetArrayItem(const cJSON *array, int index);
/* Get item "string" from object. Case insensitive. */
CJSON_PUBLIC(cJSON *)
cJSON_GetObjectItem(const cJSON *const object, const char *const string);
CJSON_PUBLIC(cJSON *)
cJSON_GetObjectItemCaseSensitive(const cJSON *const object,
const char *const string);
CJSON_PUBLIC(cJSON_bool)
cJSON_HasObjectItem(const cJSON *object, const char *string);
/* For analysing failed parses. This returns a pointer to the parse error.
* You'll probably need to look a few chars back to make sense of it. Defined
* when cJSON_Parse() returns 0. 0 when cJSON_Parse() succeeds. */
CJSON_PUBLIC(const char *) cJSON_GetErrorPtr(void);
/* Check if the item is a string and return its valuestring */
CJSON_PUBLIC(char *) cJSON_GetStringValue(cJSON *item);
/* These functions check the type of an item */
CJSON_PUBLIC(cJSON_bool) cJSON_IsInvalid(const cJSON *const item);
CJSON_PUBLIC(cJSON_bool) cJSON_IsFalse(const cJSON *const item);
CJSON_PUBLIC(cJSON_bool) cJSON_IsTrue(const cJSON *const item);
CJSON_PUBLIC(cJSON_bool) cJSON_IsBool(const cJSON *const item);
CJSON_PUBLIC(cJSON_bool) cJSON_IsNull(const cJSON *const item);
CJSON_PUBLIC(cJSON_bool) cJSON_IsNumber(const cJSON *const item);
CJSON_PUBLIC(cJSON_bool) cJSON_IsString(const cJSON *const item);
CJSON_PUBLIC(cJSON_bool) cJSON_IsArray(const cJSON *const item);
CJSON_PUBLIC(cJSON_bool) cJSON_IsObject(const cJSON *const item);
CJSON_PUBLIC(cJSON_bool) cJSON_IsRaw(const cJSON *const item);
/* These calls create a cJSON item of the appropriate type. */
CJSON_PUBLIC(cJSON *) cJSON_CreateNull(void);
CJSON_PUBLIC(cJSON *) cJSON_CreateTrue(void);
CJSON_PUBLIC(cJSON *) cJSON_CreateFalse(void);
CJSON_PUBLIC(cJSON *) cJSON_CreateBool(cJSON_bool boolean);
CJSON_PUBLIC(cJSON *) cJSON_CreateNumber(double num);
CJSON_PUBLIC(cJSON *) cJSON_CreateString(const char *string);
/* raw json */
CJSON_PUBLIC(cJSON *) cJSON_CreateRaw(const char *raw);
CJSON_PUBLIC(cJSON *) cJSON_CreateArray(void);
CJSON_PUBLIC(cJSON *) cJSON_CreateObject(void);
/* Create a string where valuestring references a string so
it will not be freed by cJSON_Delete */
CJSON_PUBLIC(cJSON *) cJSON_CreateStringReference(const char *string);
/* Create an object/arrray that only references it's elements so
they will not be freed by cJSON_Delete */
CJSON_PUBLIC(cJSON *) cJSON_CreateObjectReference(const cJSON *child);
CJSON_PUBLIC(cJSON *) cJSON_CreateArrayReference(const cJSON *child);
/* These utilities create an Array of count items. */
CJSON_PUBLIC(cJSON *) cJSON_CreateIntArray(const int *numbers, int count);
CJSON_PUBLIC(cJSON *) cJSON_CreateFloatArray(const float *numbers, int count);
CJSON_PUBLIC(cJSON *) cJSON_CreateDoubleArray(const double *numbers, int count);
CJSON_PUBLIC(cJSON *) cJSON_CreateStringArray(const char **strings, int count);
/* Append item to the specified array/object. */
CJSON_PUBLIC(cJSON_bool) cJSON_AddItemToArray(cJSON *array, cJSON *item);
CJSON_PUBLIC(cJSON_bool)
cJSON_AddItemToObject(cJSON *object, const char *string, cJSON *item);
/* Use this when string is definitely const (i.e. a literal, or as good as), and
* will definitely survive the cJSON object. WARNING: When this function was
* used, make sure to always check that (item->type & cJSON_StringIsConst) is
* zero before writing to `item->string` */
CJSON_PUBLIC(cJSON_bool)
cJSON_AddItemToObjectCS(cJSON *object, const char *string, cJSON *item);
/* Append reference to item to the specified array/object. Use this when you
* want to add an existing cJSON to a new cJSON, but don't want to corrupt your
* existing cJSON. */
CJSON_PUBLIC(cJSON_bool)
cJSON_AddItemReferenceToArray(cJSON *array, cJSON *item);
CJSON_PUBLIC(cJSON_bool)
cJSON_AddItemReferenceToObject(cJSON *object, const char *string, cJSON *item);
/* Remove/Detatch items from Arrays/Objects. */
CJSON_PUBLIC(cJSON *)
cJSON_DetachItemViaPointer(cJSON *parent, cJSON *const item);
CJSON_PUBLIC(cJSON *) cJSON_DetachItemFromArray(cJSON *array, int which);
CJSON_PUBLIC(void) cJSON_DeleteItemFromArray(cJSON *array, int which);
CJSON_PUBLIC(cJSON *)
cJSON_DetachItemFromObject(cJSON *object, const char *string);
CJSON_PUBLIC(cJSON *)
cJSON_DetachItemFromObjectCaseSensitive(cJSON *object, const char *string);
CJSON_PUBLIC(void)
cJSON_DeleteItemFromObject(cJSON *object, const char *string);
CJSON_PUBLIC(void)
cJSON_DeleteItemFromObjectCaseSensitive(cJSON *object, const char *string);
/* Update array items. */
CJSON_PUBLIC(cJSON_bool)
cJSON_InsertItemInArray(
cJSON *array, int which,
cJSON *newitem); /* Shifts pre-existing items to the right. */
CJSON_PUBLIC(cJSON_bool)
cJSON_ReplaceItemViaPointer(cJSON *const parent, cJSON *const item,
cJSON *replacement);
CJSON_PUBLIC(void)
cJSON_ReplaceItemInArray(cJSON *array, int which, cJSON *newitem);
CJSON_PUBLIC(void)
cJSON_ReplaceItemInObject(cJSON *object, const char *string, cJSON *newitem);
CJSON_PUBLIC(void)
cJSON_ReplaceItemInObjectCaseSensitive(cJSON *object, const char *string,
cJSON *newitem);
/* Duplicate a cJSON item */
CJSON_PUBLIC(cJSON *) cJSON_Duplicate(const cJSON *item, cJSON_bool recurse);
/* Duplicate will create a new, identical cJSON item to the one you pass, in new
memory that will need to be released. With recurse!=0, it will duplicate any
children connected to the item. The item->next and ->prev pointers are always
zero on return from Duplicate. */
/* Recursively compare two cJSON items for equality. If either a or b is NULL or
* invalid, they will be considered unequal.
* case_sensitive determines if object keys are treated case sensitive (1) or
* case insensitive (0) */
CJSON_PUBLIC(cJSON_bool)
cJSON_Compare(const cJSON *const a, const cJSON *const b,
const cJSON_bool case_sensitive);
CJSON_PUBLIC(void) cJSON_Minify(char *json);
/* Helper functions for creating and adding items to an object at the same time.
They return the added item or NULL on failure. */
CJSON_PUBLIC(cJSON *)
cJSON_AddNullToObject(cJSON *const object, const char *const name);
CJSON_PUBLIC(cJSON *)
cJSON_AddTrueToObject(cJSON *const object, const char *const name);
CJSON_PUBLIC(cJSON *)
cJSON_AddFalseToObject(cJSON *const object, const char *const name);
CJSON_PUBLIC(cJSON *)
cJSON_AddBoolToObject(cJSON *const object, const char *const name,
const cJSON_bool boolean);
CJSON_PUBLIC(cJSON *)
cJSON_AddNumberToObject(cJSON *const object, const char *const name,
const double number);
CJSON_PUBLIC(cJSON *)
cJSON_AddStringToObject(cJSON *const object, const char *const name,
const char *const string);
CJSON_PUBLIC(cJSON *)
cJSON_AddRawToObject(cJSON *const object, const char *const name,
const char *const raw);
CJSON_PUBLIC(cJSON *)
cJSON_AddObjectToObject(cJSON *const object, const char *const name);
CJSON_PUBLIC(cJSON *)
cJSON_AddArrayToObject(cJSON *const object, const char *const name);
/* When assigning an integer value, it needs to be propagated to valuedouble
too. */
#define cJSON_SetIntValue(object, number) \
((object) ? (object)->valueint = (object)->valuedouble = (number) \
: (number))
/* helper for the cJSON_SetNumberValue macro */
CJSON_PUBLIC(double) cJSON_SetNumberHelper(cJSON *object, double number);
#define cJSON_SetNumberValue(object, number) \
((object != NULL) ? cJSON_SetNumberHelper(object, (double)number) \
: (number))
/* Macro for iterating over an array or object */
#define cJSON_ArrayForEach(element, array) \
for (element = (array != NULL) ? (array)->child : NULL; element != NULL; \
element = element->next)
/* malloc/free objects using the malloc/free functions that have been set with
cJSON_InitHooks */
CJSON_PUBLIC(void *) cJSON_malloc(size_t size);
CJSON_PUBLIC(void) cJSON_free(void *object);
#endif

View File

@@ -0,0 +1,40 @@
#ifndef ENTRY_H_
#define ENTRY_H_
// header only helpers for develop wasm app
#include "cJSON/cJSON.c"
#include "helpers.h"
#define MAX_ARGS 32
int main(int argc, char **argv);
int bpf_main(char *env_json, int str_len)
{
cJSON *env = cJSON_Parse(env_json);
if (!env)
{
printf("cJSON_Parse failed for env json args.");
return 1;
}
if (!cJSON_IsArray(env)) {
printf("env json args is not an array.");
return 1;
}
int argc = cJSON_GetArraySize(env);
if (argc > MAX_ARGS) {
printf("env json args is too long.");
return 1;
}
char *argv[MAX_ARGS];
for (int i = 0; i < argc; i++) {
cJSON *item = cJSON_GetArrayItem(env, i);
if (!cJSON_IsString(item)) {
printf("env json args is not a string.");
return 1;
}
argv[i] = item->valuestring;
}
return main(argc, argv);
}
#endif

View File

@@ -0,0 +1,40 @@
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#ifndef _ASM_GENERIC_ERRNO_BASE_H
#define _ASM_GENERIC_ERRNO_BASE_H
#define EPERM 1 /* Operation not permitted */
#define ENOENT 2 /* No such file or directory */
#define ESRCH 3 /* No such process */
#define EINTR 4 /* Interrupted system call */
#define EIO 5 /* I/O error */
#define ENXIO 6 /* No such device or address */
#define E2BIG 7 /* Argument list too long */
#define ENOEXEC 8 /* Exec format error */
#define EBADF 9 /* Bad file number */
#define ECHILD 10 /* No child processes */
#define EAGAIN 11 /* Try again */
#define ENOMEM 12 /* Out of memory */
#define EACCES 13 /* Permission denied */
#define EFAULT 14 /* Bad address */
#define ENOTBLK 15 /* Block device required */
#define EBUSY 16 /* Device or resource busy */
#define EEXIST 17 /* File exists */
#define EXDEV 18 /* Cross-device link */
#define ENODEV 19 /* No such device */
#define ENOTDIR 20 /* Not a directory */
#define EISDIR 21 /* Is a directory */
#define EINVAL 22 /* Invalid argument */
#define ENFILE 23 /* File table overflow */
#define EMFILE 24 /* Too many open files */
#define ENOTTY 25 /* Not a typewriter */
#define ETXTBSY 26 /* Text file busy */
#define EFBIG 27 /* File too large */
#define ENOSPC 28 /* No space left on device */
#define ESPIPE 29 /* Illegal seek */
#define EROFS 30 /* Read-only file system */
#define EMLINK 31 /* Too many links */
#define EPIPE 32 /* Broken pipe */
#define EDOM 33 /* Math argument out of domain of func */
#define ERANGE 34 /* Math result not representable */
#endif

View File

@@ -0,0 +1,54 @@
#ifndef EWASM_APP_HELPERS_H_
#define EWASM_APP_HELPERS_H_
#include "native-ewasm.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include "cJSON/cJSON.h"
/// @brief start the eBPF program with JSON and wait for it to exit
/// @param program_data the json data of eBPF program
/// @return 0 on success, -1 on failure, the eBPF program will be terminated in failure case
int
start_bpf_program(char *program_data)
{
int res = create_bpf(program_data, strlen(program_data));
if (res < 0) {
printf("create_bpf failed %d", res);
return -1;
}
res = run_bpf(res);
if (res < 0) {
printf("run_bpf failed %d\n", res);
return -1;
}
res = wait_and_poll_bpf(res);
if (res < 0) {
printf("wait_and_poll_bpf failed %d\n", res);
return -1;
}
return 0;
}
/// @brief set the global variable of bpf program to the value
/// @param program the json program data
/// @param key global
/// @param value arg value
/// @return new eBPF program
cJSON *
set_bpf_program_global_var(cJSON *program, char *key, cJSON *value)
{
cJSON *args = cJSON_GetObjectItem(program, "runtime_args");
if (args == NULL)
{
args = cJSON_CreateObject();
cJSON_AddItemToObject(program, "runtime_args", args);
}
cJSON_AddItemToObject(args, key, value);
return program;
}
#endif // EWASM_APP_INIT_H

View File

@@ -0,0 +1,50 @@
#ifndef EWASM_NATIVE_API_H_
#define EWASM_NATIVE_API_H_
/// c function interface to called from wasm
#ifdef __cplusplus
extern "C" {
#endif
/// @brief create a ebpf program with json data
/// @param ebpf_json
/// @return id on success, -1 on failure
int
create_bpf(char *ebpf_json, int str_len);
/// @brief start running the ebpf program
/// @details load and attach the ebpf program to the kernel to run the ebpf
/// program if the ebpf program has maps to export to user space, you need to
/// call the wait and export.
int
run_bpf(int id);
/// @brief wait for the program to exit and receive data from export maps and
/// print the data
/// @details if the program has a ring buffer or perf event to export data
/// to user space, the program will help load the map info and poll the
/// events automatically.
int
wait_and_poll_bpf(int id);
#ifdef __cplusplus
}
#endif
/// @brief init the eBPF program
/// @param env_json the env config from input
/// @return 0 on success, -1 on failure, the eBPF program will be terminated in
/// failure case
int
bpf_main(char *env_json, int str_len);
/// @brief handle the event output from the eBPF program, valid only when
/// wait_and_poll_events is called
/// @param ctx user defined context
/// @param e json event message
/// @return 0 on success, -1 on failure,
/// the event will be send to next handler in chain on success, or dropped in
/// failure
int
process_event(int ctx, char *e, int str_len);
#endif // NATIVE_EWASM_H_

View File

@@ -0,0 +1,195 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
/* THIS FILE IS AUTOGENERATED BY BPFTOOL! */
#ifndef __SIGSNOOP_BPF_SKEL_H__
#define __SIGSNOOP_BPF_SKEL_H__
extern int errno;
#include <stdlib.h>
struct bpf_object_skeleton;
struct bpf_object;
struct bpf_map;
struct bpf_program;
struct bpf_object_open_opts;
struct bpf_link;
struct sigsnoop_bpf {
struct bpf_object_skeleton *skeleton;
struct bpf_object *obj;
struct {
struct bpf_map *events;
struct bpf_map *values;
struct bpf_map *rodata;
} maps;
struct {
struct bpf_program *kill_entry;
struct bpf_program *kill_exit;
struct bpf_program *tkill_entry;
struct bpf_program *tkill_exit;
struct bpf_program *tgkill_entry;
struct bpf_program *tgkill_exit;
struct bpf_program *sig_trace;
} progs;
struct {
struct bpf_link *kill_entry;
struct bpf_link *kill_exit;
struct bpf_link *tkill_entry;
struct bpf_link *tkill_exit;
struct bpf_link *tgkill_entry;
struct bpf_link *tgkill_exit;
struct bpf_link *sig_trace;
} links;
struct sigsnoop_bpf__rodata {
int filtered_pid;
int target_signal;
bool failed_only;
} *rodata;
#ifdef __cplusplus
static inline struct sigsnoop_bpf *open(const struct bpf_object_open_opts *opts = nullptr);
static inline struct sigsnoop_bpf *open_and_load();
static inline int load(struct sigsnoop_bpf *skel);
static inline int attach(struct sigsnoop_bpf *skel);
static inline void detach(struct sigsnoop_bpf *skel);
static inline void destroy(struct sigsnoop_bpf *skel);
static inline const void *elf_bytes(size_t *sz);
#endif /* __cplusplus */
};
static void
sigsnoop_bpf__destroy(struct sigsnoop_bpf *obj)
{
}
static inline int
sigsnoop_bpf__create_skeleton(struct sigsnoop_bpf *obj);
static inline struct sigsnoop_bpf *
sigsnoop_bpf__open_opts(const struct bpf_object_open_opts *opts)
{
struct sigsnoop_bpf *obj;
int err;
obj = (struct sigsnoop_bpf *)calloc(1, sizeof(*obj));
if (!obj) {
errno = ENOMEM;
return NULL;
}
return obj;
}
static inline struct sigsnoop_bpf *
sigsnoop_bpf__open(void)
{
return sigsnoop_bpf__open_opts(NULL);
}
static inline int
sigsnoop_bpf__load(struct sigsnoop_bpf *obj)
{
return 0;
}
static inline struct sigsnoop_bpf *
sigsnoop_bpf__open_and_load(void)
{
return NULL;
}
static inline int
sigsnoop_bpf__attach(struct sigsnoop_bpf *obj)
{
return 0;
}
static inline void
sigsnoop_bpf__detach(struct sigsnoop_bpf *obj)
{
}
static inline const void *sigsnoop_bpf__elf_bytes(size_t *sz);
static inline int
sigsnoop_bpf__create_skeleton(struct sigsnoop_bpf *obj)
{
return 0;
}
#ifdef __cplusplus
struct sigsnoop_bpf *sigsnoop_bpf::open(const struct bpf_object_open_opts *opts) { return sigsnoop_bpf__open_opts(opts); }
struct sigsnoop_bpf *sigsnoop_bpf::open_and_load() { return sigsnoop_bpf__open_and_load(); }
int sigsnoop_bpf::load(struct sigsnoop_bpf *skel) { return sigsnoop_bpf__load(skel); }
int sigsnoop_bpf::attach(struct sigsnoop_bpf *skel) { return sigsnoop_bpf__attach(skel); }
void sigsnoop_bpf::detach(struct sigsnoop_bpf *skel) { sigsnoop_bpf__detach(skel); }
void sigsnoop_bpf::destroy(struct sigsnoop_bpf *skel) { sigsnoop_bpf__destroy(skel); }
const void *sigsnoop_bpf::elf_bytes(size_t *sz) { return sigsnoop_bpf__elf_bytes(sz); }
#endif /* __cplusplus */
__attribute__((unused)) static void
sigsnoop_bpf__assert(struct sigsnoop_bpf *s __attribute__((unused)))
{
#ifdef __cplusplus
#define _Static_assert static_assert
#endif
_Static_assert(sizeof(s->rodata->filtered_pid) == 4, "unexpected size of 'filtered_pid'");
_Static_assert(sizeof(s->rodata->target_signal) == 4, "unexpected size of 'target_signal'");
_Static_assert(sizeof(s->rodata->failed_only) == 1, "unexpected size of 'failed_only'");
#ifdef __cplusplus
#undef _Static_assert
#endif
}
struct perf_buffer;
void perf_buffer__free(struct perf_buffer *pb) {
}
int perf_buffer__poll(struct perf_buffer *pb, int timeout_ms) {
return start_bpf_program(program_data);
}
int bpf_program__set_autoload(struct bpf_program *prog, bool autoload) {
return 0;
}
char* strerror(int errnum) {
return "error";
}
int bpf_map__fd(const struct bpf_map *map) {
return 0;
}
typedef void (*perf_buffer_sample_fn)(void *ctx, int cpu,
void *data, unsigned int size);
typedef void (*perf_buffer_lost_fn)(void *ctx, int cpu, unsigned long long cnt);
struct perf_buffer;
perf_buffer_sample_fn global_cb;
struct perf_buffer_opts;
struct perf_buffer *
perf_buffer__new(int map_fd, size_t page_cnt,
perf_buffer_sample_fn sample_cb, perf_buffer_lost_fn lost_cb, void *ctx,
const struct perf_buffer_opts *opts) {
global_cb = sample_cb;
return (void*)1;
}
int process_event(int ctx, char *e, int str_len)
{
struct event eve = {0};
cJSON *json = cJSON_Parse(e);
eve.sig = cJSON_GetObjectItem(json, "sig")->valueint;
eve.pid = cJSON_GetObjectItem(json, "pid")->valueint;
strcpy(eve.comm, cJSON_GetObjectItem(json, "comm")->valuestring);
eve.tpid = cJSON_GetObjectItem(json, "tpid")->valueint;
eve.ret = cJSON_GetObjectItem(json, "ret")->valueint;
global_cb((void*)ctx, 0, &eve, str_len);
return 0;
}
extern const char argp_program_doc[];
void argp_state_help(const struct argp_state *__state, int flag) {
printf("%s", argp_program_doc);
exit(0);
}
#endif /* __SIGSNOOP_BPF_SKEL_H__ */

View File

@@ -0,0 +1,8 @@
#ifndef EWASM_EWASM_APP_H_
#define EWASM_EWASM_APP_H_
// header only helpers for develop wasm app
#include "cJSON/cJSON.c"
#include "helpers.h"
#endif // EWASM_EWASM_APP_H

145
6-sigsnoop/sigsnoop.bpf.c Executable file
View File

@@ -0,0 +1,145 @@
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
/* Copyright (c) 2021~2022 Hengqi Chen */
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include "sigsnoop.h"
#define MAX_ENTRIES 10240
const volatile pid_t filtered_pid = 0;
const volatile int target_signal = 0;
const volatile bool failed_only = false;
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, __u32);
__type(value, struct event);
} values SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u32));
} events SEC(".maps");
static int probe_entry(pid_t tpid, int sig)
{
struct event event = {};
__u64 pid_tgid;
__u32 pid, tid;
if (target_signal && sig != target_signal)
return 0;
pid_tgid = bpf_get_current_pid_tgid();
pid = pid_tgid >> 32;
tid = (__u32)pid_tgid;
if (filtered_pid && pid != filtered_pid)
return 0;
event.pid = pid;
event.tpid = tpid;
event.sig = sig;
bpf_get_current_comm(event.comm, sizeof(event.comm));
bpf_map_update_elem(&values, &tid, &event, BPF_ANY);
return 0;
}
static int probe_exit(void *ctx, int ret)
{
__u64 pid_tgid = bpf_get_current_pid_tgid();
__u32 tid = (__u32)pid_tgid;
struct event *eventp;
eventp = bpf_map_lookup_elem(&values, &tid);
if (!eventp)
return 0;
if (failed_only && ret >= 0)
goto cleanup;
eventp->ret = ret;
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, eventp, sizeof(*eventp));
cleanup:
bpf_map_delete_elem(&values, &tid);
return 0;
}
SEC("tracepoint/syscalls/sys_enter_kill")
int kill_entry(struct trace_event_raw_sys_enter *ctx)
{
pid_t tpid = (pid_t)ctx->args[0];
int sig = (int)ctx->args[1];
return probe_entry(tpid, sig);
}
SEC("tracepoint/syscalls/sys_exit_kill")
int kill_exit(struct trace_event_raw_sys_exit *ctx)
{
return probe_exit(ctx, ctx->ret);
}
SEC("tracepoint/syscalls/sys_enter_tkill")
int tkill_entry(struct trace_event_raw_sys_enter *ctx)
{
pid_t tpid = (pid_t)ctx->args[0];
int sig = (int)ctx->args[1];
return probe_entry(tpid, sig);
}
SEC("tracepoint/syscalls/sys_exit_tkill")
int tkill_exit(struct trace_event_raw_sys_exit *ctx)
{
return probe_exit(ctx, ctx->ret);
}
SEC("tracepoint/syscalls/sys_enter_tgkill")
int tgkill_entry(struct trace_event_raw_sys_enter *ctx)
{
pid_t tpid = (pid_t)ctx->args[1];
int sig = (int)ctx->args[2];
return probe_entry(tpid, sig);
}
SEC("tracepoint/syscalls/sys_exit_tgkill")
int tgkill_exit(struct trace_event_raw_sys_exit *ctx)
{
return probe_exit(ctx, ctx->ret);
}
SEC("tracepoint/signal/signal_generate")
int sig_trace(struct trace_event_raw_signal_generate *ctx)
{
struct event event = {};
pid_t tpid = ctx->pid;
int ret = ctx->errno;
int sig = ctx->sig;
__u64 pid_tgid;
__u32 pid;
if (failed_only && ret == 0)
return 0;
if (target_signal && sig != target_signal)
return 0;
pid_tgid = bpf_get_current_pid_tgid();
pid = pid_tgid >> 32;
if (filtered_pid && pid != filtered_pid)
return 0;
event.pid = pid;
event.tpid = tpid;
event.sig = sig;
event.ret = ret;
bpf_get_current_comm(event.comm, sizeof(event.comm));
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
return 0;
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";

16
6-sigsnoop/sigsnoop.h Executable file
View File

@@ -0,0 +1,16 @@
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
/* Copyright (c) 2021~2022 Hengqi Chen */
#ifndef __SIGSNOOP_H
#define __SIGSNOOP_H
#define TASK_COMM_LEN 16
struct event {
unsigned int pid;
unsigned int tpid;
int sig;
int ret;
char comm[TASK_COMM_LEN];
};
#endif /* __SIGSNOOP_H */

92
6-sigsnoop/sigsnoop.md Normal file
View File

@@ -0,0 +1,92 @@
## eBPF 入门实践教程:编写 eBPF 程序 sigsnoop 工具监控全局 signal 事件
### 背景
### 实现原理
`sigsnoop` 在利用了linux的tracepoint挂载点其在syscall进入和退出的各个关键挂载点均挂载了执行函数。
```c
SEC("tracepoint/syscalls/sys_enter_kill")
int kill_entry(struct trace_event_raw_sys_enter *ctx)
{
pid_t tpid = (pid_t)ctx->args[0];
int sig = (int)ctx->args[1];
return probe_entry(tpid, sig);
}
SEC("tracepoint/syscalls/sys_exit_kill")
int kill_exit(struct trace_event_raw_sys_exit *ctx)
{
return probe_exit(ctx, ctx->ret);
}
SEC("tracepoint/syscalls/sys_enter_tkill")
int tkill_entry(struct trace_event_raw_sys_enter *ctx)
{
pid_t tpid = (pid_t)ctx->args[0];
int sig = (int)ctx->args[1];
return probe_entry(tpid, sig);
}
SEC("tracepoint/syscalls/sys_exit_tkill")
int tkill_exit(struct trace_event_raw_sys_exit *ctx)
{
return probe_exit(ctx, ctx->ret);
}
SEC("tracepoint/syscalls/sys_enter_tgkill")
int tgkill_entry(struct trace_event_raw_sys_enter *ctx)
{
pid_t tpid = (pid_t)ctx->args[1];
int sig = (int)ctx->args[2];
return probe_entry(tpid, sig);
}
SEC("tracepoint/syscalls/sys_exit_tgkill")
int tgkill_exit(struct trace_event_raw_sys_exit *ctx)
{
return probe_exit(ctx, ctx->ret);
}
SEC("tracepoint/signal/signal_generate")
int sig_trace(struct trace_event_raw_signal_generate *ctx)
{
struct event event = {};
pid_t tpid = ctx->pid;
int ret = ctx->errno;
int sig = ctx->sig;
__u64 pid_tgid;
__u32 pid;
if (failed_only && ret == 0)
return 0;
if (target_signal && sig != target_signal)
return 0;
pid_tgid = bpf_get_current_pid_tgid();
pid = pid_tgid >> 32;
if (filtered_pid && pid != filtered_pid)
return 0;
event.pid = pid;
event.tpid = tpid;
event.sig = sig;
event.ret = ret;
bpf_get_current_comm(event.comm, sizeof(event.comm));
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
return 0;
}
```
### Eunomia中使用方式
![result](../imgs/sigsnoop.png)
![result](../imgs/sigsnoop-prometheus.png)
### 总结

3
7-execsnoop/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
ecli
package.json

148
7-execsnoop/README.md Normal file
View File

@@ -0,0 +1,148 @@
---
layout: post
title: execsnoop
date: 2022-11-17 19:57
category: bpftools
author: yunwei37
tags: [bpftools, syscall]
summary: execsnoop traces the exec() syscall system-wide, and prints various details.
---
## origin
origin from:
https://github.com/iovisor/bcc/blob/master/libbpf-tools/execsnoop.bpf.c
## Compile and Run
Compile:
```shell
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
Run:
```
$ sudo ./ecli run package.json
running and waiting for the ebpf events from perf event...
time pid ppid uid retval args_count args_size comm args
23:07:35 32940 32783 1000 0 1 13 cat /usr/bin/cat
23:07:43 32946 24577 1000 0 1 10 bash /bin/bash
23:07:43 32948 32946 1000 0 1 18 lesspipe /usr/bin/lesspipe
23:07:43 32949 32948 1000 0 2 36 basename /usr/bin/basename
23:07:43 32951 32950 1000 0 2 35 dirname /usr/bin/dirname
23:07:43 32952 32946 1000 0 2 22 dircolors /usr/bin/dircolors
23:07:48 32953 32946 1000 0 2 25 ls /usr/bin/ls
23:07:53 32957 32946 1000 0 2 17 sleep /usr/bin/sleep
23:07:57 32959 32946 1000 0 1 17 oneko /usr/games/oneko
```
## details in bcc
Demonstrations of execsnoop, the Linux eBPF/bcc version.
execsnoop traces the exec() syscall system-wide, and prints various details.
Example output:
```
# ./execsnoop
COMM PID PPID RET ARGS
bash 33161 24577 0 /bin/bash
lesspipe 33163 33161 0 /usr/bin/lesspipe
basename 33164 33163 0 /usr/bin/basename /usr/bin/lesspipe
dirname 33166 33165 0 /usr/bin/dirname /usr/bin/lesspipe
dircolors 33167 33161 0 /usr/bin/dircolors -b
ls 33172 33161 0 /usr/bin/ls --color=auto
top 33173 33161 0 /usr/bin/top
oneko 33174 33161 0 /usr/games/oneko
systemctl 33175 2975 0 /bin/systemctl is-enabled -q whoopsie.path
apport-checkrep 33176 2975 0 /usr/share/apport/apport-checkreports
apport-checkrep 33177 2975 0 /usr/share/apport/apport-checkreports --system
apport-checkrep 33178 2975 0 /usr/share/apport/apport-checkreports --system
```
This shows process information when exec system call is called.
USAGE message:
```
usage: execsnoop [-h] [-T] [-t] [-x] [--cgroupmap CGROUPMAP]
[--mntnsmap MNTNSMAP] [-u USER] [-q] [-n NAME]
[-l LINE] [-U] [--max-args MAX_ARGS]
Trace exec() syscalls
options:
-h, --help show this help message and exit
-T, --time include time column on output (HH:MM:SS)
-t, --timestamp include timestamp on output
-x, --fails include failed exec()s
--cgroupmap CGROUPMAP
trace cgroups in this BPF map only
--mntnsmap MNTNSMAP trace mount namespaces in this BPF map only
-u USER, --uid USER trace this UID only
-q, --quote Add quotemarks (") around arguments.
-n NAME, --name NAME only print commands matching this name (regex), any
arg
-l LINE, --line LINE only print commands where arg contains this line
(regex)
-U, --print-uid print UID column
--max-args MAX_ARGS maximum number of arguments parsed and displayed,
defaults to 20
examples:
./execsnoop # trace all exec() syscalls
./execsnoop -x # include failed exec()s
./execsnoop -T # include time (HH:MM:SS)
./execsnoop -U # include UID
./execsnoop -u 1000 # only trace UID 1000
./execsnoop -u user # get user UID and trace only them
./execsnoop -t # include timestamps
./execsnoop -q # add "quotemarks" around arguments
./execsnoop -n main # only print command lines containing "main"
./execsnoop -l tpkg # only print command where arguments contains "tpkg"
./execsnoop --cgroupmap mappath # only trace cgroups in this BPF map
./execsnoop --mntnsmap mappath # only trace mount namespaces in the map
```
The -T and -t option include time and timestamps on output:
```
# ./execsnoop -T -t
TIME TIME(s) PCOMM PID PPID RET ARGS
23:35:25 4.335 bash 33360 24577 0 /bin/bash
23:35:25 4.338 lesspipe 33361 33360 0 /usr/bin/lesspipe
23:35:25 4.340 basename 33362 33361 0 /usr/bin/basename /usr/bin/lesspipe
23:35:25 4.342 dirname 33364 33363 0 /usr/bin/dirname /usr/bin/lesspipe
23:35:25 4.347 dircolors 33365 33360 0 /usr/bin/dircolors -b
23:35:40 19.327 touch 33367 33366 0 /usr/bin/touch /run/udev/gdm-machine-has-hardware-gpu
23:35:40 19.329 snap-device-hel 33368 33366 0 /usr/lib/snapd/snap-device-helper change snap_firefox_firefox /devices/pci0000:00/0000:00:02.0/drm/card0 226:0
23:35:40 19.331 snap-device-hel 33369 33366 0 /usr/lib/snapd/snap-device-helper change snap_firefox_geckodriver /devices/pci0000:00/0000:00:02.0/drm/card0 226:0
23:35:40 19.332 snap-device-hel 33370 33366 0 /usr/lib/snapd/snap-device-helper change snap_snap-store_snap-store /devices/pci0000:00/0000:00:02.0/drm/card0 226:0
```
The -u option filtering UID:
```
# ./execsnoop -Uu 1000
UID PCOMM PID PPID RET ARGS
1000 bash 33604 24577 0 /bin/bash
1000 lesspipe 33606 33604 0 /usr/bin/lesspipe
1000 basename 33607 33606 0 /usr/bin/basename /usr/bin/lesspipe
1000 dirname 33609 33608 0 /usr/bin/dirname /usr/bin/lesspipe
1000 dircolors 33610 33604 0 /usr/bin/dircolors -b
1000 sleep 33615 33604 0 /usr/bin/sleep
1000 sleep 33616 33604 0 /usr/bin/sleep 1
1000 clear 33617 33604 0 /usr/bin/clear
```
Report bugs to https://github.com/iovisor/bcc/tree/master/libbpf-tools.

146
7-execsnoop/execsnoop.bpf.c Normal file
View File

@@ -0,0 +1,146 @@
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include "execsnoop.bpf.h"
const volatile bool filter_cg = false;
const volatile bool ignore_failed = true;
const volatile uid_t targ_uid = INVALID_UID;
const volatile int max_args = DEFAULT_MAXARGS;
static const struct event empty_event = {};
struct {
__uint(type, BPF_MAP_TYPE_CGROUP_ARRAY);
__type(key, u32);
__type(value, u32);
__uint(max_entries, 1);
} cgroup_map SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10240);
__type(key, pid_t);
__type(value, struct event);
} execs SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
static __always_inline bool valid_uid(uid_t uid) {
return uid != INVALID_UID;
}
SEC("tracepoint/syscalls/sys_enter_execve")
int tracepoint__syscalls__sys_enter_execve(struct trace_event_raw_sys_enter* ctx)
{
u64 id;
pid_t pid, tgid;
unsigned int ret;
struct event *event;
struct task_struct *task;
const char **args = (const char **)(ctx->args[1]);
const char *argp;
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
uid_t uid = (u32)bpf_get_current_uid_gid();
int i;
if (valid_uid(targ_uid) && targ_uid != uid)
return 0;
id = bpf_get_current_pid_tgid();
pid = (pid_t)id;
tgid = id >> 32;
if (bpf_map_update_elem(&execs, &pid, &empty_event, BPF_NOEXIST))
return 0;
event = bpf_map_lookup_elem(&execs, &pid);
if (!event)
return 0;
event->pid = tgid;
event->uid = uid;
task = (struct task_struct*)bpf_get_current_task();
event->ppid = (pid_t)BPF_CORE_READ(task, real_parent, tgid);
event->args_count = 0;
event->args_size = 0;
ret = bpf_probe_read_user_str(event->args, ARGSIZE, (const char*)ctx->args[0]);
if (ret <= ARGSIZE) {
event->args_size += ret;
} else {
/* write an empty string */
event->args[0] = '\0';
event->args_size++;
}
event->args_count++;
#pragma unroll
for (i = 1; i < TOTAL_MAX_ARGS && i < max_args; i++) {
bpf_probe_read_user(&argp, sizeof(argp), &args[i]);
if (!argp)
return 0;
if (event->args_size > LAST_ARG)
return 0;
ret = bpf_probe_read_user_str(&event->args[event->args_size], ARGSIZE, argp);
if (ret > ARGSIZE)
return 0;
event->args_count++;
event->args_size += ret;
}
/* try to read one more argument to check if there is one */
bpf_probe_read_user(&argp, sizeof(argp), &args[max_args]);
if (!argp)
return 0;
/* pointer to max_args+1 isn't null, asume we have more arguments */
event->args_count++;
return 0;
}
SEC("tracepoint/syscalls/sys_exit_execve")
int tracepoint__syscalls__sys_exit_execve(struct trace_event_raw_sys_exit* ctx)
{
u64 id;
pid_t pid;
int ret;
struct event *event;
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
u32 uid = (u32)bpf_get_current_uid_gid();
if (valid_uid(targ_uid) && targ_uid != uid)
return 0;
id = bpf_get_current_pid_tgid();
pid = (pid_t)id;
event = bpf_map_lookup_elem(&execs, &pid);
if (!event)
return 0;
ret = ctx->ret;
if (ignore_failed && ret < 0)
goto cleanup;
event->retval = ret;
bpf_get_current_comm(&event->comm, sizeof(event->comm));
size_t len =((size_t)(&((struct event*)0)->args) + event->args_size);
if (len <= sizeof(*event))
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, event, len);
cleanup:
bpf_map_delete_elem(&execs, &pid);
return 0;
}
char LICENSE[] SEC("license") = "GPL";

View File

@@ -0,0 +1,26 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#ifndef __EXECSNOOP_H
#define __EXECSNOOP_H
#define ARGSIZE 128
#define TASK_COMM_LEN 16
#define TOTAL_MAX_ARGS 60
#define DEFAULT_MAXARGS 20
#define FULL_MAX_ARGS_ARR (TOTAL_MAX_ARGS * ARGSIZE)
#define INVALID_UID ((uid_t)-1)
#define LAST_ARG (FULL_MAX_ARGS_ARR - ARGSIZE)
struct event {
int pid;
int ppid;
int uid;
int retval;
int args_count;
unsigned int args_size;
char comm[TASK_COMM_LEN];
char args[FULL_MAX_ARGS_ARR];
};
#endif /* __EXECSNOOP_H */

4
8-runqslower/.gitignore vendored Normal file
View File

@@ -0,0 +1,4 @@
.vscode
package.json
eunomia-exporter
ecli

147
8-runqslower/README.md Normal file
View File

@@ -0,0 +1,147 @@
| layout | title | date | category | author | tags | summary |
| ------ | ---------- | ---------------- | -------- | -------- | --------------- | ----------------------------------------------- |
| post | runqslower | 2022-11-11-20:50 | bpftools | yunwei37 | bpftool syscall | runqslower Trace long process scheduling delays |
## origin
origin from:
https://github.com/iovisor/bcc/blob/master/libbpf-tools/runqslower.bpf.c
result:
```
$ sudo ecli/build/bin/Release/ecli run examples/bpftools/runqslower/package.json
running and waiting for the ebpf events from perf event...
time task prev_task delta_us pid prev_pid
20:11:59 gnome-shell swapper/0 32 2202 0
20:11:59 ecli swapper/3 23 3437 0
20:11:59 rcu_sched swapper/1 1 14 0
20:11:59 gnome-terminal- swapper/1 13 2714 0
20:11:59 ecli swapper/3 2 3437 0
20:11:59 kworker/3:3 swapper/3 3 215 0
20:11:59 containerd swapper/1 8 1088 0
20:11:59 ecli swapper/2 5 3437 0
20:11:59 HangDetector swapper/3 6 854 0
20:11:59 ecli swapper/2 60 3437 0
20:11:59 rcu_sched swapper/1 26 14 0
20:11:59 kworker/0:1 swapper/0 26 3414 0
20:11:59 ecli swapper/2 6 3437 0
```
## Compile and Run
Compile:
```
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
Run:
```
sudo ./ecli run examples/bpftools/runqslower/package.json
```
## details in bcc
Demonstrations of runqslower, the Linux eBPF/bcc version.
runqslower traces high scheduling delays between tasks being ready to run and them running on CPU after that. Example output:
```
# ./runqslower
Tracing run queue latency higher than 10000 us
TIME COMM TID LAT(us)
13:11:43 b'kworker/0:2' 8680 10250
13:12:18 b'irq/16-vmwgfx' 422 10838
13:12:18 b'systemd-oomd' 753 11012
13:12:18 b'containerd' 8272 11254
13:12:18 b'HangDetector' 764 12042
^C
``
This measures the time a task spends waiting on a run queue for a turn on-CPU, and shows this time as a individual events. This time should be small, but a task may need to wait its turn due to CPU load.
This measures two types of run queue latency:
1. The time from a task being enqueued on a run queue to its context switch and execution. This traces ttwu_do_wakeup(), wake_up_new_task() -> finish_task_switch() with either raw tracepoints (if supported) or kprobes and instruments the run queue latency after a voluntary context switch.
2. The time from when a task was involuntary context switched and still in the runnable state, to when it next executed. This is instrumented from finish_task_switch() alone.
The overhead of this tool may become significant for some workloads: see the OVERHEAD section.
This works by tracing various kernel scheduler functions using dynamic tracing, and will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
```console
Usage: runqslower [-h] [-p PID | -t TID | -P] [min_us]
```
The min_us option sets the latency of the run queue to track:
```
# ./runqslower 100
Tracing run queue latency higher than 100 us
TIME COMM TID LAT(us)
20:48:26 b'gnome-shell' 3005 201
20:48:26 b'gnome-shell' 3005 202
20:48:26 b'gnome-shell' 3005 254
20:48:26 b'gnome-shell' 3005 208
20:48:26 b'gnome-shell' 3005 132
20:48:26 b'gnome-shell' 3005 213
20:48:26 b'gnome-shell' 3005 205
20:48:26 b'python3' 5224 127
20:48:26 b'gnome-shell' 3005 214
20:48:26 b'gnome-shell' 3005 126
20:48:26 b'gnome-shell' 3005 285
20:48:26 b'Xorg' 2869 296
20:48:26 b'gnome-shell' 3005 119
20:48:26 b'gnome-shell' 3005 206
```
The -p PID option only traces this PID:
```
# ./runqslower -p 3005
Tracing run queue latency higher than 10000 us
TIME COMM TID LAT(us)
20:46:22 b'gnome-shell' 3005 16024
20:46:45 b'gnome-shell' 3005 11494
20:46:45 b'gnome-shell' 3005 21430
20:46:45 b'gnome-shell' 3005 14948
20:47:16 b'gnome-shell' 3005 10164
20:47:16 b'gnome-shell' 3005 18070
20:47:17 b'gnome-shell' 3005 13272
20:47:18 b'gnome-shell' 3005 10451
20:47:18 b'gnome-shell' 3005 15010
20:47:18 b'gnome-shell' 3005 19449
20:47:22 b'gnome-shell' 3005 19327
20:47:23 b'gnome-shell' 3005 13178
20:47:23 b'gnome-shell' 3005 13483
20:47:23 b'gnome-shell' 3005 15562
20:47:23 b'gnome-shell' 3005 13655
20:47:23 b'gnome-shell' 3005 19571
```
The -P option also shows previous task name and TID:
```
# ./runqslower -P
Tracing run queue latency higher than 10000 us
TIME COMM TID LAT(us) PREV COMM PREV TID
20:42:48 b'sysbench' 5159 10562 b'sysbench' 5152
20:42:48 b'sysbench' 5159 10367 b'sysbench' 5152
20:42:49 b'sysbench' 5158 11818 b'sysbench' 5159
20:42:49 b'sysbench' 5160 16913 b'sysbench' 5153
20:42:49 b'sysbench' 5157 13742 b'sysbench' 5160
20:42:49 b'sysbench' 5152 13746 b'sysbench' 5160
20:42:49 b'sysbench' 5153 13731 b'sysbench' 5160
20:42:49 b'sysbench' 5158 14688 b'sysbench' 5161
20:42:50 b'sysbench' 5155 10468 b'sysbench' 5152
20:42:50 b'sysbench' 5156 17695 b'sysbench' 5158
20:42:50 b'sysbench' 5155 11251 b'sysbench' 5152
20:42:50 b'sysbench' 5154 13283 b'sysbench' 5152
20:42:50 b'sysbench' 5158 22278 b'sysbench' 5157
```
For more details, see docs/special_filtering.md

112
8-runqslower/core_fixes.h Normal file
View File

@@ -0,0 +1,112 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
/* Copyright (c) 2021 Hengqi Chen */
#ifndef __CORE_FIXES_BPF_H
#define __CORE_FIXES_BPF_H
#include <vmlinux.h>
#include <bpf/bpf_core_read.h>
/**
* commit 2f064a59a1 ("sched: Change task_struct::state") changes
* the name of task_struct::state to task_struct::__state
* see:
* https://github.com/torvalds/linux/commit/2f064a59a1
*/
struct task_struct___o {
volatile long int state;
} __attribute__((preserve_access_index));
struct task_struct___x {
unsigned int __state;
} __attribute__((preserve_access_index));
static __always_inline __s64 get_task_state(void *task)
{
struct task_struct___x *t = task;
if (bpf_core_field_exists(t->__state))
return BPF_CORE_READ(t, __state);
return BPF_CORE_READ((struct task_struct___o *)task, state);
}
/**
* commit 309dca309fc3 ("block: store a block_device pointer in struct bio")
* adds a new member bi_bdev which is a pointer to struct block_device
* see:
* https://github.com/torvalds/linux/commit/309dca309fc3
*/
struct bio___o {
struct gendisk *bi_disk;
} __attribute__((preserve_access_index));
struct bio___x {
struct block_device *bi_bdev;
} __attribute__((preserve_access_index));
static __always_inline struct gendisk *get_gendisk(void *bio)
{
struct bio___x *b = bio;
if (bpf_core_field_exists(b->bi_bdev))
return BPF_CORE_READ(b, bi_bdev, bd_disk);
return BPF_CORE_READ((struct bio___o *)bio, bi_disk);
}
/**
* commit d5869fdc189f ("block: introduce block_rq_error tracepoint")
* adds a new tracepoint block_rq_error and it shares the same arguments
* with tracepoint block_rq_complete. As a result, the kernel BTF now has
* a `struct trace_event_raw_block_rq_completion` instead of
* `struct trace_event_raw_block_rq_complete`.
* see:
* https://github.com/torvalds/linux/commit/d5869fdc189f
*/
struct trace_event_raw_block_rq_complete___x {
dev_t dev;
sector_t sector;
unsigned int nr_sector;
} __attribute__((preserve_access_index));
struct trace_event_raw_block_rq_completion___x {
dev_t dev;
sector_t sector;
unsigned int nr_sector;
} __attribute__((preserve_access_index));
static __always_inline bool has_block_rq_completion()
{
if (bpf_core_type_exists(struct trace_event_raw_block_rq_completion___x))
return true;
return false;
}
/**
* commit d152c682f03c ("block: add an explicit ->disk backpointer to the
* request_queue") and commit f3fa33acca9f ("block: remove the ->rq_disk
* field in struct request") make some changes to `struct request` and
* `struct request_queue`. Now, to get the `struct gendisk *` field in a CO-RE
* way, we need both `struct request` and `struct request_queue`.
* see:
* https://github.com/torvalds/linux/commit/d152c682f03c
* https://github.com/torvalds/linux/commit/f3fa33acca9f
*/
struct request_queue___x {
struct gendisk *disk;
} __attribute__((preserve_access_index));
struct request___x {
struct request_queue___x *q;
struct gendisk *rq_disk;
} __attribute__((preserve_access_index));
static __always_inline struct gendisk *get_disk(void *request)
{
struct request___x *r = request;
if (bpf_core_field_exists(r->rq_disk))
return BPF_CORE_READ(r, rq_disk);
return BPF_CORE_READ(r, q, disk);
}
#endif /* __CORE_FIXES_BPF_H */

View File

@@ -0,0 +1,117 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2019 Facebook
#include <vmlinux.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "runqslower.bpf.h"
#include "core_fixes.h"
#define TASK_RUNNING 0
const volatile __u64 min_us = 0;
const volatile pid_t targ_pid = 0;
const volatile pid_t targ_tgid = 0;
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10240);
__type(key, u32);
__type(value, u64);
} start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
/* record enqueue timestamp */
static int trace_enqueue(u32 tgid, u32 pid)
{
u64 ts;
if (!pid)
return 0;
if (targ_tgid && targ_tgid != tgid)
return 0;
if (targ_pid && targ_pid != pid)
return 0;
ts = bpf_ktime_get_ns();
bpf_map_update_elem(&start, &pid, &ts, 0);
return 0;
}
static int handle_switch(void *ctx, struct task_struct *prev, struct task_struct *next)
{
struct event event = {};
u64 *tsp, delta_us;
u32 pid;
/* ivcsw: treat like an enqueue event and store timestamp */
if (get_task_state(prev) == TASK_RUNNING)
trace_enqueue(BPF_CORE_READ(prev, tgid), BPF_CORE_READ(prev, pid));
pid = BPF_CORE_READ(next, pid);
/* fetch timestamp and calculate delta */
tsp = bpf_map_lookup_elem(&start, &pid);
if (!tsp)
return 0; /* missed enqueue */
delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
if (min_us && delta_us <= min_us)
return 0;
event.pid = pid;
event.prev_pid = BPF_CORE_READ(prev, pid);
event.delta_us = delta_us;
bpf_probe_read_kernel_str(&event.task, sizeof(event.task), next->comm);
bpf_probe_read_kernel_str(&event.prev_task, sizeof(event.prev_task), prev->comm);
/* output */
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
&event, sizeof(event));
bpf_map_delete_elem(&start, &pid);
return 0;
}
SEC("tp_btf/sched_wakeup")
int BPF_PROG(sched_wakeup, struct task_struct *p)
{
return trace_enqueue(p->tgid, p->pid);
}
SEC("tp_btf/sched_wakeup_new")
int BPF_PROG(sched_wakeup_new, struct task_struct *p)
{
return trace_enqueue(p->tgid, p->pid);
}
SEC("tp_btf/sched_switch")
int BPF_PROG(sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next)
{
return handle_switch(ctx, prev, next);
}
SEC("raw_tp/sched_wakeup")
int BPF_PROG(handle_sched_wakeup, struct task_struct *p)
{
return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid));
}
SEC("raw_tp/sched_wakeup_new")
int BPF_PROG(handle_sched_wakeup_new, struct task_struct *p)
{
return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid));
}
SEC("raw_tp/sched_switch")
int BPF_PROG(handle_sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next)
{
return handle_switch(ctx, prev, next);
}
char LICENSE[] SEC("license") = "GPL";

View File

@@ -0,0 +1,15 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#ifndef __RUNQSLOWER_H
#define __RUNQSLOWER_H
#define TASK_COMM_LEN 16
struct event {
char task[TASK_COMM_LEN];
char prev_task[TASK_COMM_LEN];
__u64 delta_us;
int pid;
int prev_pid;
};
#endif /* __RUNQSLOWER_H */

6
9-runqlat/.gitignore vendored Normal file
View File

@@ -0,0 +1,6 @@
.vscode
package.json
*.o
*.skel.json
*.skel.yaml
package.yaml

675
9-runqlat/README.md Executable file
View File

@@ -0,0 +1,675 @@
---
layout: post
title: runqlat
date: 2022-10-10 16:18
category: bpftools
author: yunwei37
tags: [bpftools, syscall, tracepoint]
summary: Summarize run queue (scheduler) latency as a histogram.
---
## origin
origin from:
<https://github.com/iovisor/bcc/blob/master/libbpf-tools/runqlat.bpf.c>
This program summarizes scheduler run queue latency as a histogram, showing
how long tasks spent waiting their turn to run on-CPU.
## Compile and Run
Compile:
```shell
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
```
```console
$ ecc runqlat.bpf.c runqlat.h
Compiling bpf object...
Generating export types...
Packing ebpf object and config into package.json...
```
Run:
```console
$ sudo ecli examples/bpftools/runqlat/package.json -h
Usage: runqlat_bpf [--help] [--version] [--verbose] [--filter_cg] [--targ_per_process] [--targ_per_thread] [--targ_per_pidns] [--targ_ms] [--targ_tgid VAR]
A simple eBPF program
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
--verbose prints libbpf debug information
--filter_cg set value of bool variable filter_cg
--targ_per_process set value of bool variable targ_per_process
--targ_per_thread set value of bool variable targ_per_thread
--targ_per_pidns set value of bool variable targ_per_pidns
--targ_ms set value of bool variable targ_ms
--targ_tgid set value of pid_t variable targ_tgid
Built with eunomia-bpf framework.
See https://github.com/eunomia-bpf/eunomia-bpf for more information.
$ sudo ecli examples/bpftools/runqlat/package.json
key = 4294967295
comm = rcu_preempt
(unit) : count distribution
0 -> 1 : 9 |**** |
2 -> 3 : 6 |** |
4 -> 7 : 12 |***** |
8 -> 15 : 28 |************* |
16 -> 31 : 40 |******************* |
32 -> 63 : 83 |****************************************|
64 -> 127 : 57 |*************************** |
128 -> 255 : 19 |********* |
256 -> 511 : 11 |***** |
512 -> 1023 : 2 | |
1024 -> 2047 : 2 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | |
16384 -> 32767 : 1 | |
$ sudo ecli examples/bpftools/runqlat/package.json --targ_per_process
key = 3189
comm = cpptools
(unit) : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |*** |
16 -> 31 : 2 |******* |
32 -> 63 : 11 |****************************************|
64 -> 127 : 8 |***************************** |
128 -> 255 : 3 |********** |
```
## details in bcc
```text
Demonstrations of runqlat, the Linux eBPF/bcc version.
This program summarizes scheduler run queue latency as a histogram, showing
how long tasks spent waiting their turn to run on-CPU.
Here is a heavily loaded system:
# ./runqlat
Tracing run queue latency... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 233 |*********** |
2 -> 3 : 742 |************************************ |
4 -> 7 : 203 |********** |
8 -> 15 : 173 |******** |
16 -> 31 : 24 |* |
32 -> 63 : 0 | |
64 -> 127 : 30 |* |
128 -> 255 : 6 | |
256 -> 511 : 3 | |
512 -> 1023 : 5 | |
1024 -> 2047 : 27 |* |
2048 -> 4095 : 30 |* |
4096 -> 8191 : 20 | |
8192 -> 16383 : 29 |* |
16384 -> 32767 : 809 |****************************************|
32768 -> 65535 : 64 |*** |
The distribution is bimodal, with one mode between 0 and 15 microseconds,
and another between 16 and 65 milliseconds. These modes are visible as the
spikes in the ASCII distribution (which is merely a visual representation
of the "count" column). As an example of reading one line: 809 events fell
into the 16384 to 32767 microsecond range (16 to 32 ms) while tracing.
I would expect the two modes to be due the workload: 16 hot CPU-bound threads,
and many other mostly idle threads doing occasional work. I suspect the mostly
idle threads will run with a higher priority when they wake up, and are
the reason for the low latency mode. The high latency mode will be the
CPU-bound threads. More analysis with this and other tools can confirm.
A -m option can be used to show milliseconds instead, as well as an interval
and a count. For example, showing three x five second summary in milliseconds:
# ./runqlat -m 5 3
Tracing run queue latency... Hit Ctrl-C to end.
msecs : count distribution
0 -> 1 : 3818 |****************************************|
2 -> 3 : 39 | |
4 -> 7 : 39 | |
8 -> 15 : 62 | |
16 -> 31 : 2214 |*********************** |
32 -> 63 : 226 |** |
msecs : count distribution
0 -> 1 : 3775 |****************************************|
2 -> 3 : 52 | |
4 -> 7 : 37 | |
8 -> 15 : 65 | |
16 -> 31 : 2230 |*********************** |
32 -> 63 : 212 |** |
msecs : count distribution
0 -> 1 : 3816 |****************************************|
2 -> 3 : 49 | |
4 -> 7 : 40 | |
8 -> 15 : 53 | |
16 -> 31 : 2228 |*********************** |
32 -> 63 : 221 |** |
This shows a similar distribution across the three summaries.
A -p option can be used to show one PID only, which is filtered in kernel for
efficiency. For example, PID 4505, and one second summaries:
# ./runqlat -mp 4505 1
Tracing run queue latency... Hit Ctrl-C to end.
msecs : count distribution
0 -> 1 : 1 |* |
2 -> 3 : 2 |*** |
4 -> 7 : 1 |* |
8 -> 15 : 0 | |
16 -> 31 : 25 |****************************************|
32 -> 63 : 3 |**** |
msecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 2 |** |
4 -> 7 : 0 | |
8 -> 15 : 1 |* |
16 -> 31 : 30 |****************************************|
32 -> 63 : 1 |* |
msecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 28 |****************************************|
32 -> 63 : 2 |** |
msecs : count distribution
0 -> 1 : 1 |* |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 27 |****************************************|
32 -> 63 : 4 |***** |
[...]
For comparison, here is pidstat(1) for that process:
# pidstat -p 4505 1
Linux 4.4.0-virtual (bgregg-xxxxxxxx) 02/08/2016 _x86_64_ (8 CPU)
08:56:11 AM UID PID %usr %system %guest %CPU CPU Command
08:56:12 AM 0 4505 9.00 3.00 0.00 12.00 0 bash
08:56:13 AM 0 4505 7.00 5.00 0.00 12.00 0 bash
08:56:14 AM 0 4505 10.00 2.00 0.00 12.00 0 bash
08:56:15 AM 0 4505 11.00 2.00 0.00 13.00 0 bash
08:56:16 AM 0 4505 9.00 3.00 0.00 12.00 0 bash
[...]
This is a synthetic workload that is CPU bound. It's only spending 12% on-CPU
each second because of high CPU demand on this server: the remaining time
is spent waiting on a run queue, as visualized by runqlat.
Here is the same system, but when it is CPU idle:
# ./runqlat 5 1
Tracing run queue latency... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 2250 |******************************** |
2 -> 3 : 2340 |********************************** |
4 -> 7 : 2746 |****************************************|
8 -> 15 : 418 |****** |
16 -> 31 : 93 |* |
32 -> 63 : 28 | |
64 -> 127 : 119 |* |
128 -> 255 : 9 | |
256 -> 511 : 4 | |
512 -> 1023 : 20 | |
1024 -> 2047 : 22 | |
2048 -> 4095 : 5 | |
4096 -> 8191 : 2 | |
Back to a microsecond scale, this time there is little run queue latency past 1
millisecond, as would be expected.
Now 16 threads are performing heavy disk I/O:
# ./runqlat 5 1
Tracing run queue latency... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 204 | |
2 -> 3 : 944 |* |
4 -> 7 : 16315 |********************* |
8 -> 15 : 29897 |****************************************|
16 -> 31 : 1044 |* |
32 -> 63 : 23 | |
64 -> 127 : 128 | |
128 -> 255 : 24 | |
256 -> 511 : 5 | |
512 -> 1023 : 13 | |
1024 -> 2047 : 15 | |
2048 -> 4095 : 13 | |
4096 -> 8191 : 10 | |
The distribution hasn't changed too much. While the disks are 100% busy, there
is still plenty of CPU headroom, and threads still don't spend much time
waiting their turn.
A -P option will print a distribution for each PID:
# ./runqlat -P
Tracing run queue latency... Hit Ctrl-C to end.
^C
pid = 0
usecs : count distribution
0 -> 1 : 351 |******************************** |
2 -> 3 : 96 |******** |
4 -> 7 : 437 |****************************************|
8 -> 15 : 12 |* |
16 -> 31 : 10 | |
32 -> 63 : 0 | |
64 -> 127 : 16 |* |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 1 | |
pid = 12929
usecs : count distribution
0 -> 1 : 1 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 1 |****************************************|
pid = 12930
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 1 |****************************************|
32 -> 63 : 0 | |
64 -> 127 : 1 |****************************************|
pid = 12931
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 |******************** |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 2 |****************************************|
pid = 12932
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 1 |****************************************|
256 -> 511 : 0 | |
512 -> 1023 : 1 |****************************************|
pid = 7
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 426 |************************************* |
4 -> 7 : 457 |****************************************|
8 -> 15 : 16 |* |
pid = 9
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 425 |****************************************|
8 -> 15 : 16 |* |
pid = 11
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 10 |****************************************|
pid = 14
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 8 |****************************************|
4 -> 7 : 2 |********** |
pid = 18
usecs : count distribution
0 -> 1 : 414 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 20 |* |
8 -> 15 : 8 | |
pid = 12928
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 |****************************************|
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 |****************************************|
pid = 1867
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 15 |****************************************|
16 -> 31 : 1 |** |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 4 |********** |
pid = 1871
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 2 |****************************************|
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 1 |******************** |
pid = 1876
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 1 |****************************************|
pid = 1878
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 3 |****************************************|
pid = 1880
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 3 |****************************************|
pid = 9307
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
pid = 1886
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 |******************** |
8 -> 15 : 2 |****************************************|
pid = 1888
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 3 |****************************************|
pid = 3297
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
pid = 1892
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 1 |******************** |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 2 |****************************************|
pid = 7024
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 4 |****************************************|
pid = 16468
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 3 |****************************************|
pid = 12922
usecs : count distribution
0 -> 1 : 1 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
16 -> 31 : 1 |****************************************|
32 -> 63 : 0 | |
64 -> 127 : 1 |****************************************|
pid = 12923
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 |******************** |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 2 |****************************************|
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 1 |******************** |
1024 -> 2047 : 1 |******************** |
pid = 12924
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 2 |******************** |
8 -> 15 : 4 |****************************************|
16 -> 31 : 1 |********** |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 1 |********** |
pid = 12925
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 |****************************************|
pid = 12926
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 1 |****************************************|
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 1 |****************************************|
pid = 12927
usecs : count distribution
0 -> 1 : 1 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 1 |****************************************|
A -L option will print a distribution for each TID:
# ./runqlat -L
Tracing run queue latency... Hit Ctrl-C to end.
^C
tid = 0
usecs : count distribution
0 -> 1 : 593 |**************************** |
2 -> 3 : 829 |****************************************|
4 -> 7 : 300 |************** |
8 -> 15 : 321 |*************** |
16 -> 31 : 132 |****** |
32 -> 63 : 58 |** |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 13 | |
tid = 7
usecs : count distribution
0 -> 1 : 8 |******** |
2 -> 3 : 19 |******************** |
4 -> 7 : 37 |****************************************|
[...]
And a --pidnss option (short for PID namespaces) will print for each PID
namespace, for analyzing container performance:
# ./runqlat --pidnss -m
Tracing run queue latency... Hit Ctrl-C to end.
^C
pidns = 4026532870
msecs : count distribution
0 -> 1 : 40 |****************************************|
2 -> 3 : 1 |* |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 2 |** |
64 -> 127 : 5 |***** |
pidns = 4026532809
msecs : count distribution
0 -> 1 : 67 |****************************************|
pidns = 4026532748
msecs : count distribution
0 -> 1 : 63 |****************************************|
pidns = 4026532687
msecs : count distribution
0 -> 1 : 7 |****************************************|
pidns = 4026532626
msecs : count distribution
0 -> 1 : 45 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 3 |** |
pidns = 4026531836
msecs : count distribution
0 -> 1 : 314 |****************************************|
2 -> 3 : 1 | |
4 -> 7 : 11 |* |
8 -> 15 : 28 |*** |
16 -> 31 : 137 |***************** |
32 -> 63 : 86 |********** |
64 -> 127 : 1 | |
pidns = 4026532382
msecs : count distribution
0 -> 1 : 285 |****************************************|
2 -> 3 : 5 | |
4 -> 7 : 16 |** |
8 -> 15 : 9 |* |
16 -> 31 : 69 |********* |
32 -> 63 : 25 |*** |
Many of these distributions have two modes: the second, in this case, is
caused by capping CPU usage via CPU shares.
USAGE message:
# ./runqlat -h
usage: runqlat.py [-h] [-T] [-m] [-P] [--pidnss] [-L] [-p PID]
[interval] [count]
Summarize run queue (scheduler) latency as a histogram
positional arguments:
interval output interval, in seconds
count number of outputs
optional arguments:
-h, --help show this help message and exit
-T, --timestamp include timestamp on output
-m, --milliseconds millisecond histogram
-P, --pids print a histogram per process ID
--pidnss print a histogram per PID namespace
-L, --tids print a histogram per thread ID
-p PID, --pid PID trace this PID only
examples:
./runqlat # summarize run queue latency as a histogram
./runqlat 1 10 # print 1 second summaries, 10 times
./runqlat -mT 1 # 1s summaries, milliseconds, and timestamps
./runqlat -P # show each PID separately
./runqlat -p 185 # trace PID 185 only
```

31
9-runqlat/bits.bpf.h Normal file
View File

@@ -0,0 +1,31 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#ifndef __BITS_BPF_H
#define __BITS_BPF_H
#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = val)
static __always_inline u64 log2(u32 v)
{
u32 shift, r;
r = (v > 0xFFFF) << 4; v >>= r;
shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
shift = (v > 0xF) << 2; v >>= shift; r |= shift;
shift = (v > 0x3) << 1; v >>= shift; r |= shift;
r |= (v >> 1);
return r;
}
static __always_inline u64 log2l(u64 v)
{
u32 hi = v >> 32;
if (hi)
return log2(hi) + 32;
else
return log2(v);
}
#endif /* __BITS_BPF_H */

112
9-runqlat/core_fixes.bpf.h Normal file
View File

@@ -0,0 +1,112 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
/* Copyright (c) 2021 Hengqi Chen */
#ifndef __CORE_FIXES_BPF_H
#define __CORE_FIXES_BPF_H
#include <vmlinux.h>
#include <bpf/bpf_core_read.h>
/**
* commit 2f064a59a1 ("sched: Change task_struct::state") changes
* the name of task_struct::state to task_struct::__state
* see:
* https://github.com/torvalds/linux/commit/2f064a59a1
*/
struct task_struct___o {
volatile long int state;
} __attribute__((preserve_access_index));
struct task_struct___x {
unsigned int __state;
} __attribute__((preserve_access_index));
static __always_inline __s64 get_task_state(void *task)
{
struct task_struct___x *t = task;
if (bpf_core_field_exists(t->__state))
return BPF_CORE_READ(t, __state);
return BPF_CORE_READ((struct task_struct___o *)task, state);
}
/**
* commit 309dca309fc3 ("block: store a block_device pointer in struct bio")
* adds a new member bi_bdev which is a pointer to struct block_device
* see:
* https://github.com/torvalds/linux/commit/309dca309fc3
*/
struct bio___o {
struct gendisk *bi_disk;
} __attribute__((preserve_access_index));
struct bio___x {
struct block_device *bi_bdev;
} __attribute__((preserve_access_index));
static __always_inline struct gendisk *get_gendisk(void *bio)
{
struct bio___x *b = bio;
if (bpf_core_field_exists(b->bi_bdev))
return BPF_CORE_READ(b, bi_bdev, bd_disk);
return BPF_CORE_READ((struct bio___o *)bio, bi_disk);
}
/**
* commit d5869fdc189f ("block: introduce block_rq_error tracepoint")
* adds a new tracepoint block_rq_error and it shares the same arguments
* with tracepoint block_rq_complete. As a result, the kernel BTF now has
* a `struct trace_event_raw_block_rq_completion` instead of
* `struct trace_event_raw_block_rq_complete`.
* see:
* https://github.com/torvalds/linux/commit/d5869fdc189f
*/
struct trace_event_raw_block_rq_complete___x {
dev_t dev;
sector_t sector;
unsigned int nr_sector;
} __attribute__((preserve_access_index));
struct trace_event_raw_block_rq_completion___x {
dev_t dev;
sector_t sector;
unsigned int nr_sector;
} __attribute__((preserve_access_index));
static __always_inline bool has_block_rq_completion()
{
if (bpf_core_type_exists(struct trace_event_raw_block_rq_completion___x))
return true;
return false;
}
/**
* commit d152c682f03c ("block: add an explicit ->disk backpointer to the
* request_queue") and commit f3fa33acca9f ("block: remove the ->rq_disk
* field in struct request") make some changes to `struct request` and
* `struct request_queue`. Now, to get the `struct gendisk *` field in a CO-RE
* way, we need both `struct request` and `struct request_queue`.
* see:
* https://github.com/torvalds/linux/commit/d152c682f03c
* https://github.com/torvalds/linux/commit/f3fa33acca9f
*/
struct request_queue___x {
struct gendisk *disk;
} __attribute__((preserve_access_index));
struct request___x {
struct request_queue___x *q;
struct gendisk *rq_disk;
} __attribute__((preserve_access_index));
static __always_inline struct gendisk *get_disk(void *request)
{
struct request___x *r = request;
if (bpf_core_field_exists(r->rq_disk))
return BPF_CORE_READ(r, rq_disk);
return BPF_CORE_READ(r, q, disk);
}
#endif /* __CORE_FIXES_BPF_H */

26
9-runqlat/maps.bpf.h Normal file
View File

@@ -0,0 +1,26 @@
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
// Copyright (c) 2020 Anton Protopopov
#ifndef __MAPS_BPF_H
#define __MAPS_BPF_H
#include <bpf/bpf_helpers.h>
#include <asm-generic/errno.h>
static __always_inline void *
bpf_map_lookup_or_try_init(void *map, const void *key, const void *init)
{
void *val;
long err;
val = bpf_map_lookup_elem(map, key);
if (val)
return val;
err = bpf_map_update_elem(map, key, init, BPF_NOEXIST);
if (err && err != -EEXIST)
return 0;
return bpf_map_lookup_elem(map, key);
}
#endif /* __MAPS_BPF_H */

152
9-runqlat/runqlat.bpf.c Normal file
View File

@@ -0,0 +1,152 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2020 Wenbo Zhang
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
#include "runqlat.h"
#include "bits.bpf.h"
#include "maps.bpf.h"
#include "core_fixes.bpf.h"
#define MAX_ENTRIES 10240
#define TASK_RUNNING 0
const volatile bool filter_cg = false;
const volatile bool targ_per_process = false;
const volatile bool targ_per_thread = false;
const volatile bool targ_per_pidns = false;
const volatile bool targ_ms = false;
const volatile pid_t targ_tgid = 0;
struct {
__uint(type, BPF_MAP_TYPE_CGROUP_ARRAY);
__type(key, u32);
__type(value, u32);
__uint(max_entries, 1);
} cgroup_map SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, u32);
__type(value, u64);
} start SEC(".maps");
static struct hist zero;
/// @sample {"interval": 1000, "type" : "log2_hist"}
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_ENTRIES);
__type(key, u32);
__type(value, struct hist);
} hists SEC(".maps");
static int trace_enqueue(u32 tgid, u32 pid)
{
u64 ts;
if (!pid)
return 0;
if (targ_tgid && targ_tgid != tgid)
return 0;
ts = bpf_ktime_get_ns();
bpf_map_update_elem(&start, &pid, &ts, BPF_ANY);
return 0;
}
static unsigned int pid_namespace(struct task_struct *task)
{
struct pid *pid;
unsigned int level;
struct upid upid;
unsigned int inum;
/* get the pid namespace by following task_active_pid_ns(),
* pid->numbers[pid->level].ns
*/
pid = BPF_CORE_READ(task, thread_pid);
level = BPF_CORE_READ(pid, level);
bpf_core_read(&upid, sizeof(upid), &pid->numbers[level]);
inum = BPF_CORE_READ(upid.ns, ns.inum);
return inum;
}
static int handle_switch(bool preempt, struct task_struct *prev, struct task_struct *next)
{
struct hist *histp;
u64 *tsp, slot;
u32 pid, hkey;
s64 delta;
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
if (get_task_state(prev) == TASK_RUNNING)
trace_enqueue(BPF_CORE_READ(prev, tgid), BPF_CORE_READ(prev, pid));
pid = BPF_CORE_READ(next, pid);
tsp = bpf_map_lookup_elem(&start, &pid);
if (!tsp)
return 0;
delta = bpf_ktime_get_ns() - *tsp;
if (delta < 0)
goto cleanup;
if (targ_per_process)
hkey = BPF_CORE_READ(next, tgid);
else if (targ_per_thread)
hkey = pid;
else if (targ_per_pidns)
hkey = pid_namespace(next);
else
hkey = -1;
histp = bpf_map_lookup_or_try_init(&hists, &hkey, &zero);
if (!histp)
goto cleanup;
if (!histp->comm[0])
bpf_probe_read_kernel_str(&histp->comm, sizeof(histp->comm),
next->comm);
if (targ_ms)
delta /= 1000000U;
else
delta /= 1000U;
slot = log2l(delta);
if (slot >= MAX_SLOTS)
slot = MAX_SLOTS - 1;
__sync_fetch_and_add(&histp->slots[slot], 1);
cleanup:
bpf_map_delete_elem(&start, &pid);
return 0;
}
SEC("raw_tp/sched_wakeup")
int BPF_PROG(handle_sched_wakeup, struct task_struct *p)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid));
}
SEC("raw_tp/sched_wakeup_new")
int BPF_PROG(handle_sched_wakeup_new, struct task_struct *p)
{
if (filter_cg && !bpf_current_task_under_cgroup(&cgroup_map, 0))
return 0;
return trace_enqueue(BPF_CORE_READ(p, tgid), BPF_CORE_READ(p, pid));
}
SEC("raw_tp/sched_switch")
int BPF_PROG(handle_sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next)
{
return handle_switch(preempt, prev, next);
}
char LICENSE[] SEC("license") = "GPL";

14
9-runqlat/runqlat.h Normal file
View File

@@ -0,0 +1,14 @@
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#ifndef __RUNQLAT_H
#define __RUNQLAT_H
#define TASK_COMM_LEN 16
#define MAX_SLOTS 26
struct hist {
__u32 slots[MAX_SLOTS];
char comm[TASK_COMM_LEN];
};
#endif /* __RUNQLAT_H */