Files

752 lines
44 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE HTML>
<html lang="en" class="light" dir="ltr">
<head>
<!-- Book generated using mdBook -->
<meta charset="UTF-8">
<title>lesson 13-tcpconnlat - bpf-developer-tutorial</title>
<!-- Custom HTML head -->
<meta name="description" content="">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#ffffff">
<link rel="icon" href="../favicon.svg">
<link rel="shortcut icon" href="../favicon.png">
<link rel="stylesheet" href="../css/variables.css">
<link rel="stylesheet" href="../css/general.css">
<link rel="stylesheet" href="../css/chrome.css">
<link rel="stylesheet" href="../css/print.css" media="print">
<!-- Fonts -->
<link rel="stylesheet" href="../FontAwesome/css/font-awesome.css">
<link rel="stylesheet" href="../fonts/fonts.css">
<!-- Highlight.js Stylesheets -->
<link rel="stylesheet" href="../highlight.css">
<link rel="stylesheet" href="../tomorrow-night.css">
<link rel="stylesheet" href="../ayu-highlight.css">
<!-- Custom theme stylesheets -->
</head>
<body class="sidebar-visible no-js">
<div id="body-container">
<!-- Provide site root to javascript -->
<script>
var path_to_root = "../";
var default_theme = window.matchMedia("(prefers-color-scheme: dark)").matches ? "navy" : "light";
</script>
<!-- Work around some values being stored in localStorage wrapped in quotes -->
<script>
try {
var theme = localStorage.getItem('mdbook-theme');
var sidebar = localStorage.getItem('mdbook-sidebar');
if (theme.startsWith('"') && theme.endsWith('"')) {
localStorage.setItem('mdbook-theme', theme.slice(1, theme.length - 1));
}
if (sidebar.startsWith('"') && sidebar.endsWith('"')) {
localStorage.setItem('mdbook-sidebar', sidebar.slice(1, sidebar.length - 1));
}
} catch (e) { }
</script>
<!-- Set the theme before any content is loaded, prevents flash -->
<script>
var theme;
try { theme = localStorage.getItem('mdbook-theme'); } catch(e) { }
if (theme === null || theme === undefined) { theme = default_theme; }
var html = document.querySelector('html');
html.classList.remove('light')
html.classList.add(theme);
var body = document.querySelector('body');
body.classList.remove('no-js')
body.classList.add('js');
</script>
<input type="checkbox" id="sidebar-toggle-anchor" class="hidden">
<!-- Hide / unhide sidebar before it is displayed -->
<script>
var body = document.querySelector('body');
var sidebar = null;
var sidebar_toggle = document.getElementById("sidebar-toggle-anchor");
if (document.body.clientWidth >= 1080) {
try { sidebar = localStorage.getItem('mdbook-sidebar'); } catch(e) { }
sidebar = sidebar || 'visible';
} else {
sidebar = 'hidden';
}
sidebar_toggle.checked = sidebar === 'visible';
body.classList.remove('sidebar-visible');
body.classList.add("sidebar-" + sidebar);
</script>
<nav id="sidebar" class="sidebar" aria-label="Table of contents">
<div class="sidebar-scrollbox">
<ol class="chapter"><li class="chapter-item expanded affix "><a href="../https://github.com/eunomia-bpf/bpf-developer-tutorial.html">https://github.com/eunomia-bpf/bpf-developer-tutorial</a></li><li class="chapter-item expanded affix "><li class="part-title">入门文档</li><li class="chapter-item expanded "><a href="../0-introduce/index.html"><strong aria-hidden="true">1.</strong> lesson 0-introduce</a></li><li class="chapter-item expanded "><a href="../1-helloworld/index.html"><strong aria-hidden="true">2.</strong> lesson 1-helloworld</a></li><li class="chapter-item expanded "><a href="../2-kprobe-unlink/index.html"><strong aria-hidden="true">3.</strong> lesson 2-kprobe-unlink</a></li><li class="chapter-item expanded "><a href="../3-fentry-unlink/index.html"><strong aria-hidden="true">4.</strong> lesson 3-fentry-unlink</a></li><li class="chapter-item expanded "><a href="../4-opensnoop/index.html"><strong aria-hidden="true">5.</strong> lesson 4-opensnoop</a></li><li class="chapter-item expanded "><a href="../5-uprobe-bashreadline/index.html"><strong aria-hidden="true">6.</strong> lesson 5-uprobe-bashreadline</a></li><li class="chapter-item expanded "><a href="../6-sigsnoop/index.html"><strong aria-hidden="true">7.</strong> lesson 6-sigsnoop</a></li><li class="chapter-item expanded "><a href="../7-execsnoop/index.html"><strong aria-hidden="true">8.</strong> lesson 7-execsnoop</a></li><li class="chapter-item expanded "><a href="../8-exitsnoop/index.html"><strong aria-hidden="true">9.</strong> lesson 8-execsnoop</a></li><li class="chapter-item expanded "><a href="../9-runqlat/index.html"><strong aria-hidden="true">10.</strong> lesson 9-runqlat</a></li><li class="chapter-item expanded "><a href="../10-hardirqs/index.html"><strong aria-hidden="true">11.</strong> lesson 10-hardirqs</a></li><li class="chapter-item expanded affix "><li class="part-title">进阶文档和示例</li><li class="chapter-item expanded "><a href="../11-bootstrap/index.html"><strong aria-hidden="true">12.</strong> lesson 11-bootstrap</a></li><li class="chapter-item expanded "><a href="../12-profile/index.html"><strong aria-hidden="true">13.</strong> lesson 12-profile</a></li><li class="chapter-item expanded "><a href="../13-tcpconnlat/index.html" class="active"><strong aria-hidden="true">14.</strong> lesson 13-tcpconnlat</a></li><li class="chapter-item expanded "><a href="../14-tcpstates/index.html"><strong aria-hidden="true">15.</strong> lesson 14-tcpstates</a></li><li class="chapter-item expanded "><a href="../15-javagc/index.html"><strong aria-hidden="true">16.</strong> lesson 15-javagc</a></li><li class="chapter-item expanded "><a href="../16-memleak/index.html"><strong aria-hidden="true">17.</strong> lesson 16-memleak</a></li><li class="chapter-item expanded "><a href="../17-biopattern/index.html"><strong aria-hidden="true">18.</strong> lesson 17-biopattern</a></li><li class="chapter-item expanded "><a href="../18-further-reading/index.html"><strong aria-hidden="true">19.</strong> lesson 18-further-reading</a></li><li class="chapter-item expanded "><a href="../19-lsm-connect/index.html"><strong aria-hidden="true">20.</strong> lesson 19-lsm-connect</a></li><li class="chapter-item expanded "><a href="../20-tc/index.html"><strong aria-hidden="true">21.</strong> lesson 20-tc</a></li><li class="chapter-item expanded "><a href="../21-xdp/index.html"><strong aria-hidden="true">22.</strong> lesson 21-xdp</a></li><li class="chapter-item expanded affix "><li class="part-title">高级主题</li><li class="chapter-item expanded "><a href="../22-android/index.html"><strong aria-hidden="true">23.</strong> 在 Android 上使用 eBPF 程序</a></li><li class="chapter-item expanded "><a href="../30-sslsniff/index.html"><strong aria-hidden="true">24.</strong> 使用 uprobe 捕获多种库的 SSL/TLS 明文数据</a></li><li class="chapter-item expanded "><a href="../23-http/index.html"><strong aria-hidden="true">25.</strong> 使用 eBPF socket filter 或 syscall trace 追踪 HTTP 请求和其他七层协议</a></li><li class="chapter-item expanded "><a href="../29-sockops/index.html"><strong aria-hidden="true">26.</strong> 使用 sockops 加速网络请求转发</a></li><li class="chapter-item expanded "><a href="../24-hide/index.html"><strong aria-hidden="true">27.</strong> 使用 eBPF 隐藏进程或文件信息</a></li><li class="chapter-item expanded "><a href="../25-signal/index.html"><strong aria-hidden="true">28.</strong> 使用 bpf_send_signal 发送信号终止进程</a></li><li class="chapter-item expanded "><a href="../26-sudo/index.html"><strong aria-hidden="true">29.</strong> 使用 eBPF 添加 sudo 用户</a></li><li class="chapter-item expanded "><a href="../27-replace/index.html"><strong aria-hidden="true">30.</strong> 使用 eBPF 替换任意程序读取或写入的文本</a></li><li class="chapter-item expanded "><a href="../28-detach/index.html"><strong aria-hidden="true">31.</strong> BPF 的生命周期:使用 Detached 模式在用户态应用退出后持续运行 eBPF 程序</a></li><li class="chapter-item expanded "><a href="../18-further-reading/ebpf-security.zh.html"><strong aria-hidden="true">32.</strong> eBPF 运行时的安全性与面临的挑战</a></li><li class="chapter-item expanded "><a href="../34-syscall/index.html"><strong aria-hidden="true">33.</strong> 使用 eBPF 修改系统调用参数</a></li><li class="chapter-item expanded "><a href="../35-user-ringbuf/index.html"><strong aria-hidden="true">34.</strong> eBPF开发实践使用 user ring buffer 向内核异步发送信息</a></li><li class="chapter-item expanded "><a href="../36-userspace-ebpf/index.html"><strong aria-hidden="true">35.</strong> 用户空间 eBPF 运行时:深度解析与应用实践</a></li><li class="chapter-item expanded "><a href="../37-uprobe-rust/index.html"><strong aria-hidden="true">36.</strong> 使用 uprobe 追踪 Rust 应用程序</a></li><li class="chapter-item expanded "><a href="../38-btf-uprobe/index.html"><strong aria-hidden="true">37.</strong> 借助 eBPF 和 BTF让用户态也能一次编译、到处运行</a></li><li class="chapter-item expanded affix "><li class="part-title">bcc 和 bpftrace 教程与文档</li><li class="chapter-item expanded "><a href="../bcc-documents/kernel-versions.html"><strong aria-hidden="true">38.</strong> BPF Features by Linux Kernel Version</a></li><li class="chapter-item expanded "><a href="../bcc-documents/kernel_config.html"><strong aria-hidden="true">39.</strong> Kernel Configuration for BPF Features</a></li><li class="chapter-item expanded "><a href="../bcc-documents/reference_guide.html"><strong aria-hidden="true">40.</strong> bcc Reference Guide</a></li><li class="chapter-item expanded "><a href="../bcc-documents/special_filtering.html"><strong aria-hidden="true">41.</strong> Special Filtering</a></li><li class="chapter-item expanded "><a href="../bcc-documents/tutorial.html"><strong aria-hidden="true">42.</strong> bcc Tutorial</a></li><li class="chapter-item expanded "><a href="../bcc-documents/tutorial_bcc_python_developer.html"><strong aria-hidden="true">43.</strong> bcc Python Developer Tutorial</a></li><li class="chapter-item expanded "><a href="../bpftrace-tutorial/index.html"><strong aria-hidden="true">44.</strong> bpftrace Tutorial</a></li></ol>
</div>
<div id="sidebar-resize-handle" class="sidebar-resize-handle">
<div class="sidebar-resize-indicator"></div>
</div>
</nav>
<!-- Track and set sidebar scroll position -->
<script>
var sidebarScrollbox = document.querySelector('#sidebar .sidebar-scrollbox');
sidebarScrollbox.addEventListener('click', function(e) {
if (e.target.tagName === 'A') {
sessionStorage.setItem('sidebar-scroll', sidebarScrollbox.scrollTop);
}
}, { passive: true });
var sidebarScrollTop = sessionStorage.getItem('sidebar-scroll');
sessionStorage.removeItem('sidebar-scroll');
if (sidebarScrollTop) {
// preserve sidebar scroll position when navigating via links within sidebar
sidebarScrollbox.scrollTop = sidebarScrollTop;
} else {
// scroll sidebar to current active section when navigating via "next/previous chapter" buttons
var activeSection = document.querySelector('#sidebar .active');
if (activeSection) {
activeSection.scrollIntoView({ block: 'center' });
}
}
</script>
<div id="page-wrapper" class="page-wrapper">
<div class="page">
<div id="menu-bar-hover-placeholder"></div>
<div id="menu-bar" class="menu-bar sticky">
<div class="left-buttons">
<label id="sidebar-toggle" class="icon-button" for="sidebar-toggle-anchor" title="Toggle Table of Contents" aria-label="Toggle Table of Contents" aria-controls="sidebar">
<i class="fa fa-bars"></i>
</label>
<button id="theme-toggle" class="icon-button" type="button" title="Change theme" aria-label="Change theme" aria-haspopup="true" aria-expanded="false" aria-controls="theme-list">
<i class="fa fa-paint-brush"></i>
</button>
<ul id="theme-list" class="theme-popup" aria-label="Themes" role="menu">
<li role="none"><button role="menuitem" class="theme" id="light">Light</button></li>
<li role="none"><button role="menuitem" class="theme" id="rust">Rust</button></li>
<li role="none"><button role="menuitem" class="theme" id="coal">Coal</button></li>
<li role="none"><button role="menuitem" class="theme" id="navy">Navy</button></li>
<li role="none"><button role="menuitem" class="theme" id="ayu">Ayu</button></li>
</ul>
<button id="search-toggle" class="icon-button" type="button" title="Search. (Shortkey: s)" aria-label="Toggle Searchbar" aria-expanded="false" aria-keyshortcuts="S" aria-controls="searchbar">
<i class="fa fa-search"></i>
</button>
</div>
<h1 class="menu-title">bpf-developer-tutorial</h1>
<div class="right-buttons">
<a href="../print.html" title="Print this book" aria-label="Print this book">
<i id="print-button" class="fa fa-print"></i>
</a>
</div>
</div>
<div id="search-wrapper" class="hidden">
<form id="searchbar-outer" class="searchbar-outer">
<input type="search" id="searchbar" name="searchbar" placeholder="Search this book ..." aria-controls="searchresults-outer" aria-describedby="searchresults-header">
</form>
<div id="searchresults-outer" class="searchresults-outer hidden">
<div id="searchresults-header" class="searchresults-header"></div>
<ul id="searchresults">
</ul>
</div>
</div>
<!-- Apply ARIA attributes after the sidebar and the sidebar toggle button are added to the DOM -->
<script>
document.getElementById('sidebar-toggle').setAttribute('aria-expanded', sidebar === 'visible');
document.getElementById('sidebar').setAttribute('aria-hidden', sidebar !== 'visible');
Array.from(document.querySelectorAll('#sidebar a')).forEach(function(link) {
link.setAttribute('tabIndex', sidebar === 'visible' ? 0 : -1);
});
</script>
<div id="content" class="content">
<main>
<h1 id="ebpf入门开发实践教程十三统计-tcp-连接延时并使用-libbpf-在用户态处理数据"><a class="header" href="#ebpf入门开发实践教程十三统计-tcp-连接延时并使用-libbpf-在用户态处理数据">eBPF入门开发实践教程十三统计 TCP 连接延时,并使用 libbpf 在用户态处理数据</a></h1>
<p>eBPF (Extended Berkeley Packet Filter) 是一项强大的网络和性能分析工具,被应用在 Linux 内核上。eBPF 允许开发者动态加载、更新和运行用户定义的代码,而无需重启内核或更改内核源代码。</p>
<p>本文是 eBPF 入门开发实践教程的第十三篇,主要介绍如何使用 eBPF 统计 TCP 连接延时,并使用 libbpf 在用户态处理数据。</p>
<h2 id="背景"><a class="header" href="#背景">背景</a></h2>
<p>在进行后端开发时,不论使用何种编程语言,我们都常常需要调用 MySQL、Redis 等数据库,或执行一些 RPC 远程调用,或者调用其他的 RESTful API。这些调用的底层通常都是基于 TCP 协议进行的。原因是 TCP 协议具有可靠连接、错误重传、拥塞控制等优点因此在网络传输层协议中TCP 的应用广泛程度超过了 UDP。然而TCP 也有一些缺点,如建立连接的延时较长。因此,也出现了一些替代方案,例如 QUICQuick UDP Internet Connections快速 UDP 网络连接)。</p>
<p>分析 TCP 连接延时对网络性能分析、优化以及故障排查都非常有用。</p>
<h2 id="tcpconnlat-工具概述"><a class="header" href="#tcpconnlat-工具概述">tcpconnlat 工具概述</a></h2>
<p><code>tcpconnlat</code> 这个工具能够跟踪内核中执行活动 TCP 连接的函数(如通过 <code>connect()</code> 系统调用),并测量并显示连接延时,即从发送 SYN 到收到响应包的时间。</p>
<h3 id="tcp-连接原理"><a class="header" href="#tcp-连接原理">TCP 连接原理</a></h3>
<p>TCP 连接的建立过程常被称为“三次握手”Three-way Handshake。以下是整个过程的步骤</p>
<ol>
<li>客户端向服务器发送 SYN 包:客户端通过 <code>connect()</code> 系统调用发出 SYN。这涉及到本地的系统调用以及软中断的 CPU 时间开销。</li>
<li>SYN 包传送到服务器:这是一次网络传输,涉及到的时间取决于网络延迟。</li>
<li>服务器处理 SYN 包:服务器内核通过软中断接收包,然后将其放入半连接队列,并发送 SYN/ACK 响应。这主要涉及 CPU 时间开销。</li>
<li>SYN/ACK 包传送到客户端:这是另一次网络传输。</li>
<li>客户端处理 SYN/ACK客户端内核接收并处理 SYN/ACK 包,然后发送 ACK。这主要涉及软中断处理开销。</li>
<li>ACK 包传送到服务器:这是第三次网络传输。</li>
<li>服务器接收 ACK服务器内核接收并处理 ACK然后将对应的连接从半连接队列移动到全连接队列。这涉及到一次软中断的 CPU 开销。</li>
<li>唤醒服务器端用户进程:被 <code>accept()</code> 系统调用阻塞的用户进程被唤醒然后从全连接队列中取出来已经建立好的连接。这涉及一次上下文切换的CPU开销。</li>
</ol>
<p>完整的流程图如下所示:</p>
<p><img src="tcpconnlat1.png" alt="tcpconnlat1" /></p>
<p>在客户端视角在正常情况下一次TCP连接总的耗时也就就大约是一次网络RTT的耗时。但在某些情况下可能会导致连接时的网络传输耗时上涨、CPU处理开销增加、甚至是连接失败。这种时候在发现延时过长之后就可以结合其他信息进行分析。</p>
<h2 id="tcpconnlat-的-ebpf-实现"><a class="header" href="#tcpconnlat-的-ebpf-实现">tcpconnlat 的 eBPF 实现</a></h2>
<p>为了理解 TCP 的连接建立过程,我们需要理解 Linux 内核在处理 TCP 连接时所使用的两个队列:</p>
<ul>
<li>半连接队列SYN 队列):存储那些正在进行三次握手操作的 TCP 连接,服务器收到 SYN 包后,会将该连接信息存储在此队列中。</li>
<li>全连接队列Accept 队列):存储已经完成三次握手,等待应用程序调用 <code>accept()</code> 函数的 TCP 连接。服务器在收到 ACK 包后,会创建一个新的连接并将其添加到此队列。</li>
</ul>
<p>理解了这两个队列的用途,我们就可以开始探究 tcpconnlat 的具体实现。tcpconnlat 的实现可以分为内核态和用户态两个部分,其中包括了几个主要的跟踪点:<code>tcp_v4_connect</code>, <code>tcp_v6_connect</code><code>tcp_rcv_state_process</code></p>
<p>这些跟踪点主要位于内核中的 TCP/IP 网络栈。当执行相关的系统调用或内核函数时,这些跟踪点会被激活,从而触发 eBPF 程序的执行。这使我们能够捕获和测量 TCP 连接建立的整个过程。</p>
<p>让我们先来看一下这些挂载点的源代码:</p>
<pre><code class="language-c">SEC("kprobe/tcp_v4_connect")
int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_v6_connect")
int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_rcv_state_process")
int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
{
return handle_tcp_rcv_state_process(ctx, sk);
}
</code></pre>
<p>这段代码展示了三个内核探针kprobe的定义。<code>tcp_v4_connect</code><code>tcp_v6_connect</code> 在对应的 IPv4 和 IPv6 连接被初始化时被触发,调用 <code>trace_connect()</code> 函数,而 <code>tcp_rcv_state_process</code> 在内核处理 TCP 连接状态变化时被触发,调用 <code>handle_tcp_rcv_state_process()</code> 函数。</p>
<p>接下来的部分将分为两大块:一部分是对这些挂载点内核态部分的分析,我们将解读内核源代码来详细说明这些函数如何工作;另一部分是用户态的分析,将关注 eBPF 程序如何收集这些挂载点的数据,以及如何与用户态程序进行交互。</p>
<h3 id="tcp_v4_connect-函数解析"><a class="header" href="#tcp_v4_connect-函数解析">tcp_v4_connect 函数解析</a></h3>
<p><code>tcp_v4_connect</code>函数是Linux内核处理TCP的IPv4连接请求的主要方式。当用户态程序通过<code>socket</code>系统调用创建了一个套接字后,接着通过<code>connect</code>系统调用尝试连接到远程服务器,此时就会触发<code>tcp_v4_connect</code>函数。</p>
<pre><code class="language-c">/* This will initiate an outgoing connection. */
int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
struct inet_timewait_death_row *tcp_death_row;
struct inet_sock *inet = inet_sk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct ip_options_rcu *inet_opt;
struct net *net = sock_net(sk);
__be16 orig_sport, orig_dport;
__be32 daddr, nexthop;
struct flowi4 *fl4;
struct rtable *rt;
int err;
if (addr_len &lt; sizeof(struct sockaddr_in))
return -EINVAL;
if (usin-&gt;sin_family != AF_INET)
return -EAFNOSUPPORT;
nexthop = daddr = usin-&gt;sin_addr.s_addr;
inet_opt = rcu_dereference_protected(inet-&gt;inet_opt,
lockdep_sock_is_held(sk));
if (inet_opt &amp;&amp; inet_opt-&gt;opt.srr) {
if (!daddr)
return -EINVAL;
nexthop = inet_opt-&gt;opt.faddr;
}
orig_sport = inet-&gt;inet_sport;
orig_dport = usin-&gt;sin_port;
fl4 = &amp;inet-&gt;cork.fl.u.ip4;
rt = ip_route_connect(fl4, nexthop, inet-&gt;inet_saddr,
sk-&gt;sk_bound_dev_if, IPPROTO_TCP, orig_sport,
orig_dport, sk);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
if (err == -ENETUNREACH)
IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
return err;
}
if (rt-&gt;rt_flags &amp; (RTCF_MULTICAST | RTCF_BROADCAST)) {
ip_rt_put(rt);
return -ENETUNREACH;
}
if (!inet_opt || !inet_opt-&gt;opt.srr)
daddr = fl4-&gt;daddr;
tcp_death_row = &amp;sock_net(sk)-&gt;ipv4.tcp_death_row;
if (!inet-&gt;inet_saddr) {
err = inet_bhash2_update_saddr(sk, &amp;fl4-&gt;saddr, AF_INET);
if (err) {
ip_rt_put(rt);
return err;
}
} else {
sk_rcv_saddr_set(sk, inet-&gt;inet_saddr);
}
if (tp-&gt;rx_opt.ts_recent_stamp &amp;&amp; inet-&gt;inet_daddr != daddr) {
/* Reset inherited state */
tp-&gt;rx_opt.ts_recent = 0;
tp-&gt;rx_opt.ts_recent_stamp = 0;
if (likely(!tp-&gt;repair))
WRITE_ONCE(tp-&gt;write_seq, 0);
}
inet-&gt;inet_dport = usin-&gt;sin_port;
sk_daddr_set(sk, daddr);
inet_csk(sk)-&gt;icsk_ext_hdr_len = 0;
if (inet_opt)
inet_csk(sk)-&gt;icsk_ext_hdr_len = inet_opt-&gt;opt.optlen;
tp-&gt;rx_opt.mss_clamp = TCP_MSS_DEFAULT;
/* Socket identity is still unknown (sport may be zero).
* However we set state to SYN-SENT and not releasing socket
* lock select source port, enter ourselves into the hash tables and
* complete initialization after this.
*/
tcp_set_state(sk, TCP_SYN_SENT);
err = inet_hash_connect(tcp_death_row, sk);
if (err)
goto failure;
sk_set_txhash(sk);
rt = ip_route_newports(fl4, rt, orig_sport, orig_dport,
inet-&gt;inet_sport, inet-&gt;inet_dport, sk);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
rt = NULL;
goto failure;
}
/* OK, now commit destination to socket. */
sk-&gt;sk_gso_type = SKB_GSO_TCPV4;
sk_setup_caps(sk, &amp;rt-&gt;dst);
rt = NULL;
if (likely(!tp-&gt;repair)) {
if (!tp-&gt;write_seq)
WRITE_ONCE(tp-&gt;write_seq,
secure_tcp_seq(inet-&gt;inet_saddr,
inet-&gt;inet_daddr,
inet-&gt;inet_sport,
usin-&gt;sin_port));
tp-&gt;tsoffset = secure_tcp_ts_off(net, inet-&gt;inet_saddr,
inet-&gt;inet_daddr);
}
inet-&gt;inet_id = get_random_u16();
if (tcp_fastopen_defer_connect(sk, &amp;err))
return err;
if (err)
goto failure;
err = tcp_connect(sk);
if (err)
goto failure;
return 0;
failure:
/*
* This unhashes the socket and releases the local port,
* if necessary.
*/
tcp_set_state(sk, TCP_CLOSE);
inet_bhash2_reset_saddr(sk);
ip_rt_put(rt);
sk-&gt;sk_route_caps = 0;
inet-&gt;inet_dport = 0;
return err;
}
EXPORT_SYMBOL(tcp_v4_connect);
</code></pre>
<p>参考链接:<a href="https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_ipv4.c#L340">https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_ipv4.c#L340</a></p>
<p>接下来,我们一步步分析这个函数:</p>
<p>首先,这个函数接收三个参数:一个套接字指针<code>sk</code>,一个指向套接字地址结构的指针<code>uaddr</code>和地址的长度<code>addr_len</code></p>
<pre><code class="language-c">int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
</code></pre>
<p>函数一开始就进行了参数检查确认地址长度正确而且地址的协议族必须是IPv4。不满足这些条件会导致函数返回错误。</p>
<p>接下来函数获取目标地址如果设置了源路由选项这是一个高级的IP特性通常不会被使用那么它还会获取源路由的下一跳地址。</p>
<pre><code class="language-c">nexthop = daddr = usin-&gt;sin_addr.s_addr;
inet_opt = rcu_dereference_protected(inet-&gt;inet_opt,
lockdep_sock_is_held(sk));
if (inet_opt &amp;&amp; inet_opt-&gt;opt.srr) {
if (!daddr)
return -EINVAL;
nexthop = inet_opt-&gt;opt.faddr;
}
</code></pre>
<p>然后,使用这些信息来寻找一个路由到目标地址的路由项。如果不能找到路由项或者路由项指向一个多播或广播地址,函数返回错误。</p>
<p>接下来它更新了源地址处理了一些TCP时间戳选项的状态并设置了目标端口和地址。之后它更新了一些其他的套接字和TCP选项并设置了连接状态为<code>SYN-SENT</code></p>
<p>然后,这个函数使用<code>inet_hash_connect</code>函数尝试将套接字添加到已连接的套接字的散列表中。如果这步失败,它会恢复套接字的状态并返回错误。</p>
<p>如果前面的步骤都成功了,接着,使用新的源和目标端口来更新路由项。如果这步失败,它会清理资源并返回错误。</p>
<p>接下来,它提交目标信息到套接字,并为之后的分段偏移选择一个安全的随机值。</p>
<p>然后函数尝试使用TCP Fast OpenTFO进行连接如果不能使用TFO或者TFO尝试失败它会使用普通的TCP三次握手进行连接。</p>
<p>最后,如果上面的步骤都成功了,函数返回成功,否则,它会清理所有资源并返回错误。</p>
<p>总的来说,<code>tcp_v4_connect</code>函数是一个处理TCP连接请求的复杂函数它处理了很多情况包括参数检查、路由查找、源地址选择、源路由、TCP选项处理、TCP Fast Open等等。它的主要目标是尽可能安全和有效地建立TCP连接。</p>
<h3 id="内核态代码"><a class="header" href="#内核态代码">内核态代码</a></h3>
<pre><code class="language-c">// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2020 Wenbo Zhang
#include &lt;vmlinux.h&gt;
#include &lt;bpf/bpf_helpers.h&gt;
#include &lt;bpf/bpf_core_read.h&gt;
#include &lt;bpf/bpf_tracing.h&gt;
#include "tcpconnlat.h"
#define AF_INET 2
#define AF_INET6 10
const volatile __u64 targ_min_us = 0;
const volatile pid_t targ_tgid = 0;
struct piddata {
char comm[TASK_COMM_LEN];
u64 ts;
u32 tgid;
};
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 4096);
__type(key, struct sock *);
__type(value, struct piddata);
} start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
static int trace_connect(struct sock *sk)
{
u32 tgid = bpf_get_current_pid_tgid() &gt;&gt; 32;
struct piddata piddata = {};
if (targ_tgid &amp;&amp; targ_tgid != tgid)
return 0;
bpf_get_current_comm(&amp;piddata.comm, sizeof(piddata.comm));
piddata.ts = bpf_ktime_get_ns();
piddata.tgid = tgid;
bpf_map_update_elem(&amp;start, &amp;sk, &amp;piddata, 0);
return 0;
}
static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
{
struct piddata *piddatap;
struct event event = {};
s64 delta;
u64 ts;
if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
return 0;
piddatap = bpf_map_lookup_elem(&amp;start, &amp;sk);
if (!piddatap)
return 0;
ts = bpf_ktime_get_ns();
delta = (s64)(ts - piddatap-&gt;ts);
if (delta &lt; 0)
goto cleanup;
event.delta_us = delta / 1000U;
if (targ_min_us &amp;&amp; event.delta_us &lt; targ_min_us)
goto cleanup;
__builtin_memcpy(&amp;event.comm, piddatap-&gt;comm,
sizeof(event.comm));
event.ts_us = ts / 1000;
event.tgid = piddatap-&gt;tgid;
event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
if (event.af == AF_INET) {
event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
} else {
BPF_CORE_READ_INTO(&amp;event.saddr_v6, sk,
__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
BPF_CORE_READ_INTO(&amp;event.daddr_v6, sk,
__sk_common.skc_v6_daddr.in6_u.u6_addr32);
}
bpf_perf_event_output(ctx, &amp;events, BPF_F_CURRENT_CPU,
&amp;event, sizeof(event));
cleanup:
bpf_map_delete_elem(&amp;start, &amp;sk);
return 0;
}
SEC("kprobe/tcp_v4_connect")
int BPF_KPROBE(tcp_v4_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_v6_connect")
int BPF_KPROBE(tcp_v6_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("kprobe/tcp_rcv_state_process")
int BPF_KPROBE(tcp_rcv_state_process, struct sock *sk)
{
return handle_tcp_rcv_state_process(ctx, sk);
}
SEC("fentry/tcp_v4_connect")
int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("fentry/tcp_v6_connect")
int BPF_PROG(fentry_tcp_v6_connect, struct sock *sk)
{
return trace_connect(sk);
}
SEC("fentry/tcp_rcv_state_process")
int BPF_PROG(fentry_tcp_rcv_state_process, struct sock *sk)
{
return handle_tcp_rcv_state_process(ctx, sk);
}
char LICENSE[] SEC("license") = "GPL";
</code></pre>
<p>这个eBPFExtended Berkeley Packet Filter程序主要用来监控并收集TCP连接的建立时间即从发起TCP连接请求(<code>connect</code>系统调用)到连接建立完成(SYN-ACK握手过程完成)的时间间隔。这对于监测网络延迟、服务性能分析等方面非常有用。</p>
<p>首先定义了两个eBPF maps<code>start</code><code>events</code><code>start</code>是一个哈希表,用于存储发起连接请求的进程信息和时间戳,而<code>events</code>是一个<code>PERF_EVENT_ARRAY</code>类型的map用于将事件数据传输到用户态。</p>
<pre><code class="language-c">struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 4096);
__type(key, struct sock *);
__type(value, struct piddata);
} start SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
</code></pre>
<p><code>tcp_v4_connect</code><code>tcp_v6_connect</code>的kprobe处理函数<code>trace_connect</code>会记录下发起连接请求的进程信息进程名、进程ID和当前时间戳并以socket结构作为key存储到<code>start</code>这个map中。</p>
<pre><code class="language-c">static int trace_connect(struct sock *sk)
{
u32 tgid = bpf_get_current_pid_tgid() &gt;&gt; 32;
struct piddata piddata = {};
if (targ_tgid &amp;&amp; targ_tgid != tgid)
return 0;
bpf_get_current_comm(&amp;piddata.comm, sizeof(piddata.comm));
piddata.ts = bpf_ktime_get_ns();
piddata.tgid = tgid;
bpf_map_update_elem(&amp;start, &amp;sk, &amp;piddata, 0);
return 0;
}
</code></pre>
<p>当TCP状态机处理到SYN-ACK包即连接建立的时候会触发<code>tcp_rcv_state_process</code>的kprobe处理函数<code>handle_tcp_rcv_state_process</code>。在这个函数中首先检查socket的状态是否为<code>SYN-SENT</code>,如果是,会从<code>start</code>这个map中查找socket对应的进程信息。然后计算出从发起连接到现在的时间间隔将该时间间隔进程信息以及TCP连接的详细信息源端口目标端口源IP目标IP等作为event通过<code>bpf_perf_event_output</code>函数发送到用户态。</p>
<pre><code class="language-c">static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk)
{
struct piddata *piddatap;
struct event event = {};
s64 delta;
u64 ts;
if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT)
return 0;
piddatap = bpf_map_lookup_elem(&amp;start, &amp;sk);
if (!piddatap)
return 0;
ts = bpf_ktime_get_ns();
delta = (s64)(ts - piddatap-&gt;ts);
if (delta &lt; 0)
goto cleanup;
event.delta_us = delta / 1000U;
if (targ_min_us &amp;&amp; event.delta_us &lt; targ_min_us)
goto
cleanup;
__builtin_memcpy(&amp;event.comm, piddatap-&gt;comm,
sizeof(event.comm));
event.ts_us = ts / 1000;
event.tgid = piddatap-&gt;tgid;
event.lport = BPF_CORE_READ(sk, __sk_common.skc_num);
event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
event.af = BPF_CORE_READ(sk, __sk_common.skc_family);
if (event.af == AF_INET) {
event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr);
} else {
BPF_CORE_READ_INTO(&amp;event.saddr_v6, sk,
__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
BPF_CORE_READ_INTO(&amp;event.daddr_v6, sk,
__sk_common.skc_v6_daddr.in6_u.u6_addr32);
}
bpf_perf_event_output(ctx, &amp;events, BPF_F_CURRENT_CPU,
&amp;event, sizeof(event));
cleanup:
bpf_map_delete_elem(&amp;start, &amp;sk);
return 0;
}
</code></pre>
<p>理解这个程序的关键在于理解Linux内核的网络栈处理流程以及eBPF程序的运行模式。Linux内核网络栈对TCP连接建立的处理过程是首先调用<code>tcp_v4_connect</code><code>tcp_v6_connect</code>函数根据IP版本不同发起TCP连接然后在收到SYN-ACK包时通过<code>tcp_rcv_state_process</code>函数来处理。eBPF程序通过在这两个关键函数上设置kprobe可以在关键时刻得到通知并执行相应的处理代码。</p>
<p>一些关键概念说明:</p>
<ul>
<li>kprobeKernel Probe是Linux内核中用于动态追踪内核行为的机制。可以在内核函数的入口和退出处设置断点当断点被触发时会执行与kprobe关联的eBPF程序。</li>
<li>map是eBPF程序中的一种数据结构用于在内核态和用户态之间共享数据。</li>
<li>socket在Linux网络编程中socket是一个抽象概念表示一个网络连接的端点。内核中的<code>struct sock</code>结构就是对socket的实现。</li>
</ul>
<h3 id="用户态数据处理"><a class="header" href="#用户态数据处理">用户态数据处理</a></h3>
<p>用户态数据处理是使用<code>perf_buffer__poll</code>来接收并处理从内核发送到用户态的eBPF事件。<code>perf_buffer__poll</code>是libbpf库提供的一个便捷函数用于轮询perf event buffer并处理接收到的数据。</p>
<p>首先,让我们详细看一下主轮询循环:</p>
<pre><code class="language-c"> /* main: poll */
while (!exiting) {
err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS);
if (err &lt; 0 &amp;&amp; err != -EINTR) {
fprintf(stderr, "error polling perf buffer: %s\n", strerror(-err));
goto cleanup;
}
/* reset err to return 0 if exiting */
err = 0;
}
</code></pre>
<p>这段代码使用一个while循环来反复轮询perf event buffer。如果轮询出错例如由于信号中断会打印出错误消息。这个轮询过程会一直持续直到收到一个退出标志<code>exiting</code></p>
<p>接下来,让我们来看看<code>handle_event</code>函数这个函数将处理从内核发送到用户态的每一个eBPF事件</p>
<pre><code class="language-c">void handle_event(void* ctx, int cpu, void* data, __u32 data_sz) {
const struct event* e = data;
char src[INET6_ADDRSTRLEN];
char dst[INET6_ADDRSTRLEN];
union {
struct in_addr x4;
struct in6_addr x6;
} s, d;
static __u64 start_ts;
if (env.timestamp) {
if (start_ts == 0)
start_ts = e-&gt;ts_us;
printf("%-9.3f ", (e-&gt;ts_us - start_ts) / 1000000.0);
}
if (e-&gt;af == AF_INET) {
s.x4.s_addr = e-&gt;saddr_v4;
d.x4.s_addr = e-&gt;daddr_v4;
} else if (e-&gt;af == AF_INET6) {
memcpy(&amp;s.x6.s6_addr, e-&gt;saddr_v6, sizeof(s.x6.s6_addr));
memcpy(&amp;d.x6.s6_addr, e-&gt;daddr_v6, sizeof(d.x6.s6_addr));
} else {
fprintf(stderr, "broken event: event-&gt;af=%d", e-&gt;af);
return;
}
if (env.lport) {
printf("%-6d %-12.12s %-2d %-16s %-6d %-16s %-5d %.2f\n", e-&gt;tgid,
e-&gt;comm, e-&gt;af == AF_INET ? 4 : 6,
inet_ntop(e-&gt;af, &amp;s, src, sizeof(src)), e-&gt;lport,
inet_ntop(e-&gt;af, &amp;d, dst, sizeof(dst)), ntohs(e-&gt;dport),
e-&gt;delta_us / 1000.0);
} else {
printf("%-6d %-12.12s %-2d %-16s %-16s %-5d %.2f\n", e-&gt;tgid, e-&gt;comm,
e-&gt;af == AF_INET ? 4 : 6, inet_ntop(e-&gt;af, &amp;s, src, sizeof(src)),
inet_ntop(e-&gt;af, &amp;d, dst, sizeof(dst)), ntohs(e-&gt;dport),
e-&gt;delta_us / 1000.0);
}
}
</code></pre>
<p><code>handle_event</code>函数的参数包括了CPU编号、指向数据的指针以及数据的大小。数据是一个<code>event</code>结构体包含了之前在内核态计算得到的TCP连接的信息。</p>
<p>首先它将接收到的事件的时间戳和起始时间戳如果存在进行对比计算出事件的相对时间并打印出来。接着根据IP地址的类型IPv4或IPv6将源地址和目标地址从网络字节序转换为主机字节序。</p>
<p>最后根据用户是否选择了显示本地端口将进程ID、进程名称、IP版本、源IP地址、本地端口如果有、目标IP地址、目标端口以及连接建立时间打印出来。这个连接建立时间是我们在内核态eBPF程序中计算并发送到用户态的。</p>
<h2 id="编译运行"><a class="header" href="#编译运行">编译运行</a></h2>
<pre><code class="language-console">$ make
...
BPF .output/tcpconnlat.bpf.o
GEN-SKEL .output/tcpconnlat.skel.h
CC .output/tcpconnlat.o
BINARY tcpconnlat
$ sudo ./tcpconnlat
PID COMM IP SADDR DADDR DPORT LAT(ms)
222564 wget 4 192.168.88.15 110.242.68.3 80 25.29
222684 wget 4 192.168.88.15 167.179.101.42 443 246.76
222726 ssh 4 192.168.88.15 167.179.101.42 22 241.17
222774 ssh 4 192.168.88.15 1.15.149.151 22 25.31
</code></pre>
<p>源代码:<a href="https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/13-tcpconnlat">https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/13-tcpconnlat</a> 关于如何安装依赖,请参考:<a href="https://eunomia.dev/tutorials/11-bootstrap/">https://eunomia.dev/tutorials/11-bootstrap/</a></p>
<p>参考资料:</p>
<ul>
<li><a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/tcpconnlat.c">tcpconnlat</a></li>
</ul>
<h2 id="总结"><a class="header" href="#总结">总结</a></h2>
<p>通过本篇 eBPF 入门实践教程,我们学习了如何使用 eBPF 来跟踪和统计 TCP 连接建立的延时。我们首先深入探讨了 eBPF 程序如何在内核态监听特定的内核函数,然后通过捕获这些函数的调用,从而得到连接建立的起始时间和结束时间,计算出延时。</p>
<p>我们还进一步了解了如何使用 BPF maps 来在内核态存储和查询数据,从而在 eBPF 程序的多个部分之间共享数据。同时,我们也探讨了如何使用 perf events 来将数据从内核态发送到用户态,以便进一步处理和展示。</p>
<p>在用户态,我们介绍了如何使用 libbpf 库的 API例如 perf_buffer__poll来接收和处理内核态发送过来的数据。我们还讲解了如何对这些数据进行解析和打印使得它们能以人类可读的形式显示出来。</p>
<p>如果您希望学习更多关于 eBPF 的知识和实践,请查阅 eunomia-bpf 的官方文档:<a href="https://github.com/eunomia-bpf/eunomia-bpf">https://github.com/eunomia-bpf/eunomia-bpf</a> 。您还可以访问我们的教程代码仓库 <a href="https://github.com/eunomia-bpf/bpf-developer-tutorial">https://github.com/eunomia-bpf/bpf-developer-tutorial</a> 或网站 <a href="https://eunomia.dev/zh/tutorials/">https://eunomia.dev/zh/tutorials/</a> 以获取更多示例和完整的教程。</p>
</main>
<nav class="nav-wrapper" aria-label="Page navigation">
<!-- Mobile navigation buttons -->
<a rel="prev" href="../12-profile/index.html" class="mobile-nav-chapters previous" title="Previous chapter" aria-label="Previous chapter" aria-keyshortcuts="Left">
<i class="fa fa-angle-left"></i>
</a>
<a rel="next prefetch" href="../14-tcpstates/index.html" class="mobile-nav-chapters next" title="Next chapter" aria-label="Next chapter" aria-keyshortcuts="Right">
<i class="fa fa-angle-right"></i>
</a>
<div style="clear: both"></div>
</nav>
</div>
</div>
<nav class="nav-wide-wrapper" aria-label="Page navigation">
<a rel="prev" href="../12-profile/index.html" class="nav-chapters previous" title="Previous chapter" aria-label="Previous chapter" aria-keyshortcuts="Left">
<i class="fa fa-angle-left"></i>
</a>
<a rel="next prefetch" href="../14-tcpstates/index.html" class="nav-chapters next" title="Next chapter" aria-label="Next chapter" aria-keyshortcuts="Right">
<i class="fa fa-angle-right"></i>
</a>
</nav>
</div>
<script>
window.playground_copyable = true;
</script>
<script src="../elasticlunr.min.js"></script>
<script src="../mark.min.js"></script>
<script src="../searcher.js"></script>
<script src="../clipboard.min.js"></script>
<script src="../highlight.js"></script>
<script src="../book.js"></script>
<!-- Custom JS scripts -->
</div>
</body>
</html>