init with documents from eunomia-bpf

2026-05-09 23:32:34 +08:00 · 2022-12-02 19:18:03 +08:00
parent 1179ec171e
commit 81d749a9cc
85 changed files with 11876 additions and 0 deletions
--- a/13-tcpconnlat/.gitignore
+++ b/13-tcpconnlat/.gitignore
@@ -0,0 +1,2 @@
+.vscode
+package.json
--- a/13-tcpconnlat/README.md
+++ b/13-tcpconnlat/README.md
@@ -0,0 +1,137 @@
+---
+layout: post
+title: tcpconnlat
+date: 2022-10-10 16:18
+category: bpftools
+author: yunwei37
+tags: [bpftools, syscall, network]
+summary: Traces the kernel function performing active TCP connections(eg, via a connect() syscall; accept() are passive connections).  and show connection latency.
+---
+
+## origin
+
+origin from:
+
+https://github.com/iovisor/bcc/blob/master/libbpf-tools/tcpconnlat.bpf.c
+
+## Compile and Run
+
+Compile:
+
+```shell
+docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
+```
+
+Run:
+
+```shell
+sudo ./ecli run package.json
+```
+
+TODO: support union in C
+
+## details in bcc
+
+Demonstrations of tcpconnect, the Linux eBPF/bcc version.
+
+
+This tool traces the kernel function performing active TCP connections
+(eg, via a connect() syscall; accept() are passive connections). Some example
+output (IP addresses changed to protect the innocent):
+```console
+# ./tcpconnect
+PID    COMM         IP SADDR            DADDR            DPORT
+1479   telnet       4  127.0.0.1        127.0.0.1        23
+1469   curl         4  10.201.219.236   54.245.105.25    80
+1469   curl         4  10.201.219.236   54.67.101.145    80
+1991   telnet       6  ::1              ::1              23
+2015   ssh          6  fe80::2000:bff:fe82:3ac fe80::2000:bff:fe82:3ac 22
+```
+This output shows four connections, one from a "telnet" process, two from
+"curl", and one from "ssh". The output details shows the IP version, source
+address, destination address, and destination port. This traces attempted
+connections: these may have failed.
+
+The overhead of this tool should be negligible, since it is only tracing the
+kernel functions performing connect. It is not tracing every packet and then
+filtering.
+
+
+The -t option prints a timestamp column:
+```console
+# ./tcpconnect -t
+TIME(s)  PID    COMM         IP SADDR            DADDR            DPORT
+31.871   2482   local_agent  4  10.103.219.236   10.251.148.38    7001
+31.874   2482   local_agent  4  10.103.219.236   10.101.3.132     7001
+31.878   2482   local_agent  4  10.103.219.236   10.171.133.98    7101
+90.917   2482   local_agent  4  10.103.219.236   10.251.148.38    7001
+90.928   2482   local_agent  4  10.103.219.236   10.102.64.230    7001
+90.938   2482   local_agent  4  10.103.219.236   10.115.167.169   7101
+```
+The output shows some periodic connections (or attempts) from a "local_agent"
+process to various other addresses. A few connections occur every minute.
+
+The -d option tracks DNS responses and tries to associate each connection with
+the a previous DNS query issued before it.  If a DNS response matching the IP
+is found, it will be printed. If no match was found, "No DNS Query" is printed
+in this column. Queries for 127.0.0.1 and ::1 are automatically associated with
+"localhost". If the time between when the DNS response was received and a
+connect call was traced exceeds 100ms, the tool will print the time delta
+after the query name.  See below for www.domain.com for an example.
+```console
+# ./tcpconnect -d
+PID    COMM         IP SADDR            DADDR            DPORT QUERY
+1543   amazon-ssm-a 4  10.66.75.54      176.32.119.67    443   ec2messages.us-west-1.amazonaws.com
+1479   telnet       4  127.0.0.1        127.0.0.1        23    localhost
+1469   curl         4  10.201.219.236   54.245.105.25    80    www.domain.com (123.342ms)
+1469   curl         4  10.201.219.236   54.67.101.145    80    No DNS Query
+1991   telnet       6  ::1              ::1              23    localhost
+2015   ssh          6  fe80::2000:bff:fe82:3ac fe80::2000:bff:fe82:3ac 22    anotherhost.org
+```
+
+The -L option prints a LPORT column:
+```console
+# ./tcpconnect -L
+PID    COMM         IP SADDR            LPORT  DADDR            DPORT
+3706   nc           4  192.168.122.205  57266  192.168.122.150  5000
+3722   ssh          4  192.168.122.205  50966  192.168.122.150  22
+3779   ssh          6  fe80::1          52328  fe80::2          22
+```
+
+The -U option prints a UID column:
+```console
+# ./tcpconnect -U
+UID   PID    COMM         IP SADDR            DADDR            DPORT
+0     31333  telnet       6  ::1              ::1              23
+0     31333  telnet       4  127.0.0.1        127.0.0.1        23
+1000  31322  curl         4  127.0.0.1        127.0.0.1        80
+1000  31322  curl         6  ::1              ::1              80
+```
+
+The -u option filtering UID:
+```console
+# ./tcpconnect -Uu 1000
+UID   PID    COMM         IP SADDR            DADDR            DPORT
+1000  31338  telnet       6  ::1              ::1              23
+1000  31338  telnet       4  127.0.0.1        127.0.0.1        23
+```
+To spot heavy outbound connections quickly one can use the -c flag. It will
+count all active connections per source ip and destination ip/port.
+```console
+# ./tcpconnect.py -c
+Tracing connect ... Hit Ctrl-C to end
+^C
+LADDR                 RADDR                      RPORT             CONNECTS
+192.168.10.50         172.217.21.194             443               70
+192.168.10.50         172.213.11.195             443               34
+192.168.10.50         172.212.22.194             443               21
+[...]
+```
+
+The --cgroupmap option filters based on a cgroup set. It is meant to be used
+with an externally created map.
+```console
+# ./tcpconnect --cgroupmap /sys/fs/bpf/test01
+```
+For more details, see docs/special_filtering.md
+
--- a/13-tcpconnlat/tcpconnlat.bpf.c
+++ b/13-tcpconnlat/tcpconnlat.bpf.c
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Wenbo Zhang
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_tracing.h>
+#include "tcpconnlat.bpf.h"
+
+#define AF_INET    2
+#define AF_INET6   10
+
+const volatile __u64 targ_min_us = 0;
+const volatile pid_t targ_tgid = 0;
+
+struct piddata {
+	char comm[TASK_COMM_LEN];
+	u64 ts;
+	u32 tgid;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 4096);
+	__type(key, struct sock *);
+	__type(value, struct piddata);
+} start SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+	__uint(key_size, sizeof(u32));
+	__uint(value_size, sizeof(u32));
+} events SEC(".maps");
+
+static int trace_connect(struct sock *sk)
+{
+	u32 tgid = bpf_get_current_pid_tgid() >> 32;
+	struct piddata piddata = {};
+
+	if (targ_tgid && targ_tgid != tgid)
+		return 0;
+
+	bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
+	piddata.ts = bpf_ktime_get_ns();
+	piddata.tgid = tgid;
+	bpf_map_update_elem(&start, &sk, &piddata, 0);
+	return 0;
+}
+
+static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
+{
+	struct piddata *piddatap;
+	struct event event = {};
+	s64 delta;
+	u64 ts;
+
+	if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
+		return 0;
+
+	piddatap = bpf_map_lookup_elem(&start, &sk);
+	if (!piddatap)
+		return 0;
+
+	ts = bpf_ktime_get_ns();
+	delta = (s64)(ts - piddatap->ts);
+	if (delta < 0)
+		goto cleanup;
+
+	event.delta_us = delta / 1000U;
+	if (targ_min_us && event.delta_us < targ_min_us)
+		goto cleanup;
+	__builtin_memcpy(&event.comm, piddatap->comm,
+			sizeof(event.comm));
+	event.ts_us = ts / 1000;
+	event.tgid = piddatap->tgid;
+	event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
+	event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
+	event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
+	if (event.af == AF_INET) {
+		event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
+		event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
+	} else {
+		BPF_CORE_READ_INTO(&event.saddr_v6, sk,
+				__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
+		BPF_CORE_READ_INTO(&event.daddr_v6, sk,
+				__sk_common.skc_v6_daddr.in6_u.u6_addr32);
+	}
+	bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
+			&event, sizeof(event));
+
+cleanup:
+	bpf_map_delete_elem(&start, &sk);
+	return 0;
+}
+
+SEC("kprobe/tcp_v4_connect")
+int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
+{
+	return trace_connect(sk);
+}
+
+SEC("kprobe/tcp_v6_connect")
+int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
+{
+	return trace_connect(sk);
+}
+
+SEC("kprobe/tcp_rcv_state_process")
+int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
+{
+	return handle_tcp_rcv_state_process(ctx, sk);
+}
+
+char LICENSE[] SEC("license") = "GPL";
--- a/13-tcpconnlat/tcpconnlat.bpf.h
+++ b/13-tcpconnlat/tcpconnlat.bpf.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __TCPCONNLAT_H
+#define __TCPCONNLAT_H
+
+#define TASK_COMM_LEN	16
+
+struct event {
+	// union {
+		unsigned int saddr_v4;
+		unsigned char saddr_v6[16];
+	// };
+	// union {
+		unsigned int daddr_v4;
+		unsigned char daddr_v6[16];
+	// };
+	char comm[TASK_COMM_LEN];
+	unsigned long long delta_us;
+	unsigned long long ts_us;
+	unsigned int tgid;
+	int af;
+	unsigned short lport;
+	unsigned short dport;
+};
+
+
+#endif /* __TCPCONNLAT_H_ */
--- a/13-tcpconnlat/tcpconnlat.md
+++ b/13-tcpconnlat/tcpconnlat.md
@@ -0,0 +1,186 @@
+## eBPF 入门实践教程：编写 eBPF 程序 tcpconnlat 测量 tcp 连接延时
+
+### 背景
+
+在互联网后端日常开发接口的时候中，不管你使用的是C、Java、PHP还是Golang，都避免不了需要调用mysql、redis等组件来获取数据，可能还需要执行一些rpc远程调用，或者再调用一些其它restful api。 在这些调用的底层，基本都是在使用TCP协议进行传输。这是因为在传输层协议中，TCP协议具备可靠的连接，错误重传，拥塞控制等优点，所以目前应用比UDP更广泛一些。但相对而言，tcp 连接也有一些缺点，例如建立连接的延时较长等。因此也会出现像 QUIC ，即 快速UDP网络连接 ( Quick UDP Internet Connections )这样的替代方案。
+
+tcp 连接延时分析对于网络性能分析优化或者故障排查都能起到不少作用。
+
+### tcpconnlat 的实现原理
+
+tcpconnlat 这个工具跟踪执行活动TCP连接的内核函数 (例如，通过connect()系统调用），并显示本地测量的连接的延迟（时间），即从发送 SYN 到响应包的时间。
+
+### tcp 连接原理
+
+tcp 连接的整个过程如图所示：
+
+![tcpconnlate](tcpconnlat1.png)
+
+在这个连接过程中，我们来简单分析一下每一步的耗时：
+
+1. 客户端发出SYNC包：客户端一般是通过connect系统调用来发出 SYN 的，这里牵涉到本机的系统调用和软中断的 CPU 耗时开销
+2. SYN传到服务器：SYN从客户端网卡被发出，这是一次长途远距离的网络传输
+3. 服务器处理SYN包：内核通过软中断来收包，然后放到半连接队列中，然后再发出SYN/ACK响应。主要是 CPU 耗时开销
+4. SYC/ACK传到客户端：长途网络跋涉
+5. 客户端处理 SYN/ACK：客户端内核收包并处理SYN后，经过几us的CPU处理，接着发出 ACK。同样是软中断处理开销
+6. ACK传到服务器：长途网络跋涉
+7. 服务端收到ACK：服务器端内核收到并处理ACK，然后把对应的连接从半连接队列中取出来，然后放到全连接队列中。一次软中断CPU开销
+8. 服务器端用户进程唤醒：正在被accpet系统调用阻塞的用户进程被唤醒，然后从全连接队列中取出来已经建立好的连接。一次上下文切换的CPU开销
+
+在客户端视角，在正常情况下一次TCP连接总的耗时也就就大约是一次网络RTT的耗时。但在某些情况下，可能会导致连接时的网络传输耗时上涨、CPU处理开销增加、甚至是连接失败。这种时候在发现延时过长之后，就可以结合其他信息进行分析。
+
+### ebpf 实现原理
+
+在 TCP 三次握手的时候，Linux 内核会维护两个队列，分别是：
+
+- 半连接队列，也称 SYN 队列；
+- 全连接队列，也称 accepet 队列；
+
+
+服务端收到客户端发起的 SYN 请求后，内核会把该连接存储到半连接队列，并向客户端响应 SYN+ACK，接着客户端会返回 ACK，服务端收到第三次握手的 ACK 后，内核会把连接从半连接队列移除，然后创建新的完全的连接，并将其添加到 accept 队列，等待进程调用 accept 函数时把连接取出来。
+
+我们的 ebpf 代码实现在 https://github.com/yunwei37/Eunomia/blob/master/bpftools/tcpconnlat/tcpconnlat.bpf.c 中：
+
+它主要使用了 trace_tcp_rcv_state_process 和 kprobe/tcp_v4_connect 这样的跟踪点：
+
+```c
+
+SEC("kprobe/tcp_v4_connect")
+int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
+{
+	return trace_connect(sk);
+}
+
+SEC("kprobe/tcp_v6_connect")
+int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
+{
+	return trace_connect(sk);
+}
+
+SEC("kprobe/tcp_rcv_state_process")
+int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
+{
+	return handle_tcp_rcv_state_process(ctx, sk);
+}
+```
+
+在 trace_connect 中，我们跟踪新的 tcp 连接，记录到达时间，并且把它加入 map 中：
+
+```c
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 4096);
+	__type(key, struct sock *);
+	__type(value, struct piddata);
+} start SEC(".maps");
+
+static int trace_connect(struct sock *sk)
+{
+	u32 tgid = bpf_get_current_pid_tgid() >> 32;
+	struct piddata piddata = {};
+
+	if (targ_tgid && targ_tgid != tgid)
+		return 0;
+
+	bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
+	piddata.ts = bpf_ktime_get_ns();
+	piddata.tgid = tgid;
+	bpf_map_update_elem(&start, &sk, &piddata, 0);
+	return 0;
+}
+```
+
+在 handle_tcp_rcv_state_process 中，我们跟踪接收到的 tcp 数据包，从 map 从提取出对应的 connect 事件，并且计算延迟：
+
+```c
+static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
+{
+	struct piddata *piddatap;
+	struct event event = {};
+	s64 delta;
+	u64 ts;
+
+	if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
+		return 0;
+
+	piddatap = bpf_map_lookup_elem(&start, &sk);
+	if (!piddatap)
+		return 0;
+
+	ts = bpf_ktime_get_ns();
+	delta = (s64)(ts - piddatap->ts);
+	if (delta < 0)
+		goto cleanup;
+
+	event.delta_us = delta / 1000U;
+	if (targ_min_us && event.delta_us < targ_min_us)
+		goto cleanup;
+	__builtin_memcpy(&event.comm, piddatap->comm,
+			sizeof(event.comm));
+	event.ts_us = ts / 1000;
+	event.tgid = piddatap->tgid;
+	event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
+	event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
+	event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
+	if (event.af == AF_INET) {
+		event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
+		event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
+	} else {
+		BPF_CORE_READ_INTO(&event.saddr_v6, sk,
+				__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
+		BPF_CORE_READ_INTO(&event.daddr_v6, sk,
+				__sk_common.skc_v6_daddr.in6_u.u6_addr32);
+	}
+	bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
+			&event, sizeof(event));
+
+cleanup:
+	bpf_map_delete_elem(&start, &sk);
+	return 0;
+}
+```
+
+### Eunomia 测试 demo
+
+使用命令行进行追踪：
+
+```bash
+$ sudo build/bin/Release/eunomia run tcpconnlat
+[sudo] password for yunwei: 
+[2022-08-07 02:13:39.601] [info] eunomia run in cmd...
+[2022-08-07 02:13:40.534] [info] press 'Ctrl C' key to exit...
+PID    COMM        IP  SRC              DEST             PORT  LAT(ms) CONATINER/OS
+3477   openresty    4  172.19.0.7       172.19.0.5       2379  0.05    docker-apisix_apisix_1
+3483   openresty    4  172.19.0.7       172.19.0.5       2379  0.08    docker-apisix_apisix_1
+3477   openresty    4  172.19.0.7       172.19.0.5       2379  0.04    docker-apisix_apisix_1
+3478   openresty    4  172.19.0.7       172.19.0.5       2379  0.05    docker-apisix_apisix_1
+3478   openresty    4  172.19.0.7       172.19.0.5       2379  0.03    docker-apisix_apisix_1
+3478   openresty    4  172.19.0.7       172.19.0.5       2379  0.03    docker-apisix_apisix_1
+```
+
+还可以使用 eunomia 作为 prometheus exporter，在运行上述命令之后，打开 prometheus 自带的可视化面板：
+
+使用下述查询命令即可看到延时的统计图表：
+
+```
+  rate(eunomia_observed_tcpconnlat_v4_histogram_sum[5m])
+/
+  rate(eunomia_observed_tcpconnlat_v4_histogram_count[5m])
+```
+
+结果：
+
+![result](tcpconnlat_p.png)
+
+### 总结
+
+通过上面的实验，我们可以看到，tcpconnlat 工具的实现原理是基于内核的TCP连接的跟踪，并且可以跟踪到 tcp 连接的延迟时间；除了命令行使用方式之外，还可以将其和容器、k8s 等元信息综合起来，通过 `prometheus` 和 `grafana` 等工具进行网络性能分析。
+
+> `Eunomia` 是一个使用 C/C++ 开发的基于 eBPF的轻量级，高性能云原生监控工具，旨在帮助用户了解容器的各项行为、监控可疑的容器安全事件，力求提供覆盖容器全生命周期的轻量级开源监控解决方案。它使用 `Linux` `eBPF` 技术在运行时跟踪您的系统和应用程序，并分析收集的事件以检测可疑的行为模式。目前，它包含性能分析、容器集群网络可视化分析*、容器安全感知告警、一键部署、持久化存储监控等功能，提供了多样化的 ebpf 追踪点。其核心导出器/命令行工具最小仅需要约 4MB 大小的二进制程序，即可在支持的 Linux 内核上启动。
+
+项目地址：https://github.com/yunwei37/Eunomia
+
+### 参考资料
+
+1. http://kerneltravel.net/blog/2020/tcpconnlat/
+2. https://network.51cto.com/article/640631.html
--- a/13-tcpconnlat/tcpconnlat1.png
+++ b/13-tcpconnlat/tcpconnlat1.png
--- a/13-tcpconnlat/tcpconnlat_p.png
+++ b/13-tcpconnlat/tcpconnlat_p.png