Merge pull request #54 from qianmoke/master

翻译 SysCall
This commit is contained in:
Dongliang Mu
2016-06-11 21:28:57 -04:00
committed by GitHub
4 changed files with 1647 additions and 0 deletions

406
SysCall/syscall-1.md Normal file
View File

@@ -0,0 +1,406 @@
Linux 内核系统调用 第一节
================================================================================
简介
--------------------------------------------------------------------------------
这次提交为 [linux-insides] (http://0xax.gitbooks.io/linux-insides/content/)添加一个新的章节,从标题就可以知道, 这一章节将介绍Linux 内核中 [System call](https://en.wikipedia.org/wiki/System_call) 的概念。章节内容的选择并非偶然。在前一章节[章节](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)我们了解了中断及中断处理。系统调用的概念与中断非常相似,这是因为软件中断是执行系统调用最常见的方式。我们将讨论系统调用概念的各个方面。例如,用户空间发起系统调用的细节,内核中一组系统调用处理器的执行过程, [VDSO](https://en.wikipedia.org/wiki/VDSO) 和 [vsyscall](https://lwn.net/Articles/446528/) 概念以及其他信息。
在了解Linux 内核系统调用执行过程之前,了解一些系统调用的原理是有帮助的。我们从下面的段落开始。
什么是系统调用?
--------------------------------------------------------------------------------
系统调用是用户空间请求内核服务。操作系统内核提供很多服务。当程序读写文件,开始监听连接的[socket](https://en.wikipedia.org/wiki/Network_socket) 删除或创建目录或程序结束时,都会执行系统调用。换句话说,系统调用仅仅是一些 [C] (https://en.wikipedia.org/wiki/C_%28programming_language%29) 内核空间函数,用户空间程序调用其处理一些请求。
Linux 内核提供一系列的函数并且这些函数与CPU架构相关。 例如:[x86_64](https://en.wikipedia.org/wiki/X86-64) 提供 [322](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl) 个系统调用,[x86](https://en.wikipedia.org/wiki/X86) 提供 [358](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_32.tbl) 个不同的系统调用。
系统调用仅仅是一些函数。 我们讨论一个使用汇编语言编写的简单 `Hello world` 示例:
```assembly
.data
msg:
.ascii "Hello, world!\n"
len = . - msg
.text
.global _start
_start:
movq $1, %rax
movq $1, %rdi
movq $msg, %rsi
movq $len, %rdx
syscall
movq $60, %rax
xorq %rdi, %rdi
syscall
```
使用下面的命令可编译这些语句:
```
$ gcc -c test.S
$ ld -o test test.o
```
执行:
```
./test
Hello, world!
```
这些简单的代码是一个简单的Linux `x86_64` 架构 `Hello world` 汇编程序,代码包含两个段:
* `.data`
* `.text`
第一个段 - `.data` 存储程序的初始数据 (在示例中为`Hello world` 字符串). 第二个段 - `.text` 包含程序的代码. 程序可分为两部分: 第一部分为第一个 `syscall` 指令之前的代码,第二部分为两个 `syscall` 指令之间的代码。首先在示例程序及一般应用中, `syscall` 指令有什么功能?[64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)中提到:
```
SYSCALL 引起操作系统系统调用处理器处于特权级0通过加载IA32_LSTAR MSR至RIP完成(在RCX中保存 SYSCALL 之后指令地址之后)。
(WRMSR 指令确保IA32_LSTAR MSR总是包含一个连续的地址。)
...
...
...
SYSCALL 将 IA32_STAR MSR 的 4732 位加载至 CS 和 SS 段选择器。
因此,根据这些段选择器 CS 和 SS ,描述符缓存并未从描述符加载(位于 GDT 或 LDT 中)。相反,描述符缓存从固定值加载。
操作系统软件需要确保,由段选择器得到的描述符与从固定值加载至描述符缓存的描述符保持一致。 SYSCALL 指令不保证两者的一致。
```
使用[arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S)汇编程序中定义的 `entry_SYSCALL_64` 初始化 `syscalls`
同时 `SYSCALL` 指令进入[arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) 源码文件中的 `IA32_STAR` [Model specific register](https://en.wikipedia.org/wiki/Model-specific_register):
```C
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
```
因此,`syscall` 指令唤醒一个系统调用对应的处理程序。但是如何确定调用哪个处理器?事实上这些信息从通用目的[寄存器](https://en.wikipedia.org/wiki/Processor_register)的到。正如系统调用[](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)中描述,每个系统调用对应特定的编号。上面的示例中, 第一个系统调用是 - `write` 将数据写入指定文件。在系统调用表中查找 write 系统调用.[write](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L10) 系统调用的编号为 - `1`。在示例中通过`rax`寄存器传递该编号,接下来的几个通用目的寄存器: `%rdi`, `%rsi``%rdx` 保存 `write` 系统调用的参数。 在示例中为[文件描述符](https://en.wikipedia.org/wiki/File_descriptor) (`1` 是[stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29)), 第二个参数字符串指针, 第三个为数据的大小。是的,你听到的没错,系统调用的参数。正如上文, 系统调用仅仅是内核空间的 `C` 函数。示例中第一个系统调用为 write ,在 [fs/read_write.c] (https://github.com/torvalds/linux/blob/master/fs/read_write.c) 源文件中定义如下:
```C
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
...
...
...
}
```
或者换言之:
```C
ssize_t write(int fd, const void *buf, size_t nbytes);
```
现在不用担心宏 `SYSCALL_DEFINE3` ,稍后再做讨论。
示例的第二部分也是一样的, 但调用了另一系统调用[exit](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L69)。这个系统调用仅需一个参数:
* Return value
参数说明程序退出的方式。[strace](https://en.wikipedia.org/wiki/Strace) 工具可根据程序的名称输出系统调用的过程:
```
$ strace test
execve("./test", ["./test"], [/* 62 vars */]) = 0
write(1, "Hello, world!\n", 14Hello, world!
) = 14
_exit(0) = ?
+++ exited with 0 +++
```
`strace` 输出的第一行, [execve](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L68) 系统调用开始执行程序,第二,三行为程序中使用的系统调用: `write``exit`。注意示例中通过通用目的寄存器传递系统调用的参数。寄存器的顺序是特定的。寄存器的顺序由- 声明 [x86-64 calling conventions] (https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions)定义。
`x86_64` 架构的声明在另一个特别的文档中 - [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf)。通常, 函数参数被置于寄存器或者堆栈中。正确的顺序为:
* `rdi`;
* `rsi`;
* `rdx`;
* `rcx`;
* `r8`;
* `r9`.
对应函数的前六个参数。若函数多于六个参数,其他参数将放在堆栈中。
示例代码中未直接使用系统调用,但程序通过系统调用打印输出,检查文件的权限或是从文件中读写。
例如:
```C
#include <stdio.h>
int main(int argc, char **argv)
{
FILE *fp;
char buff[255];
fp = fopen("test.txt", "r");
fgets(buff, 255, fp);
printf("%s\n", buff);
fclose(fp);
return 0;
}
```
Linux内核中没有 `fopen`, `fgets`, `printf``fclose` 系统调用,而是 `open`, `read` `write``close``fopen`, `fgets`, `printf``fclose` 仅仅是 `C` [standard library](https://en.wikipedia.org/wiki/GNU_C_Library)中定义的函数。事实上这些函数是系统调用的封装。代码中没有直接使用系统调用,而是通过标准库的[封装](https://en.wikipedia.org/wiki/Wrapper_function)函数。这样做的主要原因是: 系统调用执行的要快,非常快。由于系统调用快的同时也非常小。标准库在执行系统调用前,确保系统调用参数设置正确及完成其他不同的检查。对比示例程序和以下命令:
```
$ gcc test.c -o test
```
通过[ltrace](https://en.wikipedia.org/wiki/Ltrace)工具观察:
```
$ ltrace ./test
__libc_start_main([ "./test" ] <unfinished ...>
fopen("test.txt", "r") = 0x602010
fgets("Hello World!\n", 255, 0x602010) = 0x7ffd2745e700
puts("Hello World!\n"Hello World!
) = 14
fclose(0x602010) = 0
+++ exited (status 0) +++
```
`ltrace`工具显示程序用户空间的调用。 `fopen` 函数打开给定的文本文件, `fgets` 函数读取文件内容至 `buf` 缓存,`puts` 输出文件内容至 `stdout` `fclose` 函数根据文件描述符关闭函数。如上文描述,这些函数调用特定的系统调用。例如: `puts` 内部调用 `write` 系统调用,`ltrace` 添加 `-S`可观察到这一调用:
```
write@SYS(1, "Hello World!\n\n", 14) = 14
```
系统调用是普遍存在的。每个程序都需要打开/写/读文件,网络连接,内存分配和许多其他功能只能由内核完成。[proc](https://en.wikipedia.org/wiki/Procfs) 文件系统有一个具有特定格式的特殊文件: `/proc/pid/systemcall`记录了正在被进程调用的系统调用的编号和参数寄存器。例如,进程号 1 的程序是[systemd](https://en.wikipedia.org/wiki/Systemd):
```
$ sudo cat /proc/1/comm
systemd
$ sudo cat /proc/1/syscall
232 0x4 0x7ffdf82e11b0 0x1f 0xffffffff 0x100 0x7ffdf82e11bf 0x7ffdf82e11a0 0x7f9114681193
```
编号为 - `232` 的系统调用为 [epoll_wait](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L241),该调用等待 [epoll](https://en.wikipedia.org/wiki/Epoll) 文件描述符的I/O事件. 例如我用来编写这一节的 `emacs` 编辑器:
```
$ ps ax | grep emacs
2093 ? Sl 2:40 emacs
$ sudo cat /proc/2093/comm
emacs
$ sudo cat /proc/2093/syscall
270 0xf 0x7fff068a5a90 0x7fff068a5b10 0x0 0x7fff068a59c0 0x7fff068a59d0 0x7fff068a59b0 0x7f777dd8813c
```
编号为 `270` 的系统调用是 [sys_pselect6](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L279) ,该系统调用使 `emacs` 监控多个文件描述符。
现在我们对系统调用有所了解,知道什么是系统调用及为什么需要系统调用。接下来,讨论示例程序中使用的 `write` 系统调用
写系统调用的实现
--------------------------------------------------------------------------------
查看Linux内核源文件中写系统调用的实现。[fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) 源码文件中的 `write` 系统调用定义如下:
```C
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
if (f.file) {
loff_t pos = file_pos_read(f.file);
ret = vfs_write(f.file, buf, count, &pos);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
}
return ret;
}
```
首先,宏 `SYSCALL_DEFINE3` 在头文件 [include/linux/syscalls.h](https://github.com/torvalds/linux/blob/master/include/linux/syscalls.h) 中定义并且作为 `sys_name(...)` 函数定义的扩展。宏的定义如下:
```C
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
#define SYSCALL_DEFINEx(x, sname, ...) \
SYSCALL_METADATA(sname, x, __VA_ARGS__) \
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
```
`SYSCALL_DEFINE3` 的参数有代表系统调用的名称的 `name` 和可变个数的参数。 这个宏仅仅作为 `SYSCALL_DEFINEx` 宏的扩展确定了传入宏的参数个数。 `_##name` 作为未来系统调用名称的存根 (更多关于 `##`符号连结可参阅[documentation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) of [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection))。宏 `SYSCALL_DEFINEx` 作为以下两个宏的扩展:
* `SYSCALL_METADATA`;
* `__SYSCALL_DEFINEx`.
第一个宏 `SYSCALL_METADATA` 的实现与内核配置选项 `CONFIG_FTRACE_SYSCALLS` 有关。从选项的名称可知,选项允许 tracer 捕获系统调用的进入和退出。若该内核配置选项开启,宏 `SYSCALL_METADATA` 执行头文件[include/trace/syscall.h](https://github.com/torvalds/linux/blob/master/include/trace/syscall.h) 中`syscall_metadata` 结构的初始化。结构中包含多种有用字段例如系统调用的名称, 系统调用[](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)中的编号,参数个数, 参数类型列表等:
```C
#define SYSCALL_METADATA(sname, nb, ...) \
... \
... \
... \
struct syscall_metadata __used \
__syscall_meta_##sname = { \
.name = "sys"#sname, \
.syscall_nr = -1, \
.nb_args = nb, \
.types = nb ? types_##sname : NULL, \
.args = nb ? args_##sname : NULL, \
.enter_event = &event_enter_##sname, \
.exit_event = &event_exit_##sname, \
.enter_fields = LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields), \
}; \
static struct syscall_metadata __used \
__attribute__((section("__syscalls_metadata"))) \
*__p_syscall_meta_##sname = &__syscall_meta_##sname;
```
若内核配置时 `CONFIG_FTRACE_SYSCALLS` 未开启,此时宏 `SYSCALL_METADATA`扩展为空字符串:
```C
#define SYSCALL_METADATA(sname, nb, ...)
```
第二个宏 `__SYSCALL_DEFINEx` 扩展为 以下五个函数的定义:
```C
#define __SYSCALL_DEFINEx(x, name, ...) \
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
__attribute__((alias(__stringify(SyS##name)))); \
\
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__)); \
\
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \
\
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
{ \
long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__)); \
__MAP(x,__SC_TEST,__VA_ARGS__); \
__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
return ret; \
} \
\
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
```
第一个函数 `sys##name` 是给定名称 `sys_system_call_name` 系统调用处理器函数的定义。 宏 `__SC_DECL` 的参数有 `__VA_ARGS__` 及组合调用传入参数系统类型和参数名称,因为宏定义中无法指定参数类型。宏 `__MAP` 应用宏 `__SC_DECL``__VA_ARGS__` 参数。其他由宏 `__SYSCALL_DEFINEx` 产生的函数需要 protect from the [CVE-2009-0029](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-0029) 此处不必深入研究。作为宏 `SYSCALL_DEFINE3` 的结论:
```C
asmlinkage long sys_write(unsigned int fd, const char __user * buf, size_t count);
```
现在我们对系统调用的定义有一定了解,回头讨论 `write` 系统调用的实现:
```C
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
if (f.file) {
loff_t pos = file_pos_read(f.file);
ret = vfs_write(f.file, buf, count, &pos);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
}
return ret;
}
```
从代码可知,该调用有三个参数:
* `fd` - 文件描述符;
* `buf` - 写缓冲区;
* `count` - 写缓冲区大小.
调用的功能是将用户定义的缓冲中的数据写入指定的设备或文件。注意第二个参数 `buf`, 定义了 `__user` 属性。该属性的主要目的是通过 [sparse](https://en.wikipedia.org/wiki/Sparse) 工具检查 Linux 内核代码。sparse 定义于 [include/linux/compiler.h] (https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) 头文件中并依赖 Linux 内核的 `__CHECKER__` 定义。其中全是关于 `sys_write` 系统调用的有用元信息。试着理解该系统调用的实现,定义从 `fd` 结构类型的 `f` 结构开始,这是 Linux 内核中的文件描述符。将调用的输出传入 `fdget_pos` 函数。 `fdget_pos` 函数在相同的[源文件](https://github.com/torvalds/linux/blob/master/fs/read_write.c)中定义,并且仅作为 `__to_fd` 函数的扩展:
```C
static inline struct fd fdget_pos(int fd)
{
return __to_fd(__fdget_pos(fd));
}
```
`fdget_pos` 的主要目的是将仅仅作为的数字的给定的文件描述符转化为 `fd` 结构。 通过一长链的函数调用, `fdget_pos` 函数得到当前进程的文件描述符表, `current->files`, 并尝试从表中获取一致的文件描述符编号。当获取到给定文件描述符的 `fd` 结构后, 检查文件并返回文件是否存在。通过调用函数 `file_pos_read` 获取当前处于文件中的位置。函数返回文件的 `f_pos` 字段:
```C
static inline loff_t file_pos_read(struct file *file)
{
return file->f_pos;
}
```
之后调用 `vfs_write` 函数。 `vfs_write` 函数在源码文件 [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) 中定义。其功能为 - 向指定文件的指定位置写入指定缓冲中的数据。此处不深入 `vfs_write` 函数的细节,因为这个函数与`系统调用`没有太多联系,反而与另一章节[Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)相关。`vfs_write` 结束相关工作后, 检查结果若成功执行,使用`file_pos_write` 函数改变在文件中的位置:
```C
if (ret >= 0)
file_pos_write(f.file, pos);
```
这恰好使用给定的位置更新给定文件的 `f_pos`:
```C
static inline void file_pos_write(struct file *file, loff_t pos)
{
file->f_pos = pos;
}
```
`write` 系统调用处理函数的结束, 是以下函数:
```C
fdput_pos(f);
```
解锁在共享文件描述符的线程并发写文件时保护文件位置的互斥量 `f_pos_lock`
我们讨论了Linux内核提供的系统调用的部分实现。显然略过了 `write` 系统调用的部分实现细节,正如文中所述, 在该章节中仅关心系统调用的相关内容,不讨论与其他子系统相关的内容,例如[Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system).
总结
--------------------------------------------------------------------------------
总结Linux内核中关于系统调用概念的 the first part covering system call concepts in the Linux kernel. 本节中讨论了系统调用的原理,接下来的一节将深入该主题,了解 Linux 内核系统调用相关代码。
若存在疑问及建议, 在twitter @[0xAX](https://twitter.com/0xAX), 通过[email](anotherworldofworld@gmail.com) 或者创建 [issue](https://github.com/0xAX/linux-insides/issues/new).
**由于英语是我的第一语言由此造成的不便深感抱歉。若发现错误请提交 PR 至 [linux-insides](https://github.com/0xAX/linux-insides).**
链接
--------------------------------------------------------------------------------
* [system call](https://en.wikipedia.org/wiki/System_call)
* [vdso](https://en.wikipedia.org/wiki/VDSO)
* [vsyscall](https://lwn.net/Articles/446528/)
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
* [socket](https://en.wikipedia.org/wiki/Network_socket)
* [C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29)
* [x86](https://en.wikipedia.org/wiki/X86)
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions)
* [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf)
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
* [Intel manual. PDF](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
* [system call table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)
* [GCC macro documentation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
* [file descriptor](https://en.wikipedia.org/wiki/File_descriptor)
* [stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29)
* [strace](https://en.wikipedia.org/wiki/Strace)
* [standard library](https://en.wikipedia.org/wiki/GNU_C_Library)
* [wrapper functions](https://en.wikipedia.org/wiki/Wrapper_function)
* [ltrace](https://en.wikipedia.org/wiki/Ltrace)
* [sparse](https://en.wikipedia.org/wiki/Sparse)
* [proc file system](https://en.wikipedia.org/wiki/Procfs)
* [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
* [systemd](https://en.wikipedia.org/wiki/Systemd)
* [epoll](https://en.wikipedia.org/wiki/Epoll)
* [Previous chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)

408
SysCall/syscall-2.md Normal file
View File

@@ -0,0 +1,408 @@
Linux 系统内核调用 第二节
================================================================================
Linux 内核如何处理系统调用
--------------------------------------------------------------------------------
前一[小节](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) 作为本章节的第一部分描述了 Linux 内核[system call](https://en.wikipedia.org/wiki/System_call) 概念。
前一节中提到通常系统调用处于内核处于操作系统层面。前一节内容从用户空间的角度介绍,并且 [write](http://man7.org/linux/man-pages/man2/write.2.html)系统调用实现的一部分内容没有讨论。在这一小节继续关注系统调用,在深入 Linux 内核之前,从一些理论开始。
程序中一个用户程序并不直接使用系统调用。我们并未这样写 `Hello World`程序代码:
```C
int main(int argc, char **argv)
{
...
...
...
sys_write(fd1, buf, strlen(buf));
...
...
}
```
我们可以使用与 [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library) 帮助类似的方式:
```C
#include <unistd.h>
int main(int argc, char **argv)
{
...
...
...
write(fd1, buf, strlen(buf));
...
...
}
```
不管怎样, `write` 不是直接的系统调用也不是内核函数。程序必须将通用目的寄存器按照正确的顺序存入正确的值,之后使用 `syscall` 指令实现真正的系统调用。在这一节我们关注 Linux 内核中,处理器执行 `syscall` 指令时的细节。
系统调用表的初始化
--------------------------------------------------------------------------------
从前一节可知系统调用与中断非常相似。深入的说,系统调用是软件中断的处理程序。因此,当处理器执行程序的 `syscall` 指令时,指令引起异常导致将控制权转移至异常处理。 众所周知,所有的异常处理 (或者内核 [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) 函数将响应异常) 是放在内核代码中的。但是 Linux 内核如何查找对应系统调用的系统调用处理程序的地址? Linux 内核由一个特殊的表:`system call table` 。 系统调用表是Linux内核源码文件 [arch/x86/entry/syscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) 中定义的数组`sys_call_table`的对应。其实现如下:
```C
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
[0 ... __NR_syscall_max] = &sys_ni_syscall,
#include <asm/syscalls_64.h>
};
```
`sys_call_table` 数组的大小为 `__NR_syscall_max + 1` `__NR_syscall_max` 宏作为给定[架构](https://en.wikipedia.org/wiki/List_of_CPU_architectures) 的系统调用最大数量。 这本书关于 [x86_64](https://en.wikipedia.org/wiki/X86-64) 架构, 因此 `__NR_syscall_max``322` ,这也是本书编写时(当前 Linux 内核版本为 `4.2.0-rc8+`)的数字。编译内核时可通过 [Kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt)产生的头文件查看该宏 - include/generated/asm-offsets.h`:
```C
#define __NR_syscall_max 322
```
对于 `x86_64` [arch/x86/entry/syscalls/syscall_64.tbl](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L331) 中也有相同的系统调用数量。这里存在两个重要的话题; `sys_call_table` 数组的类型及数组中元数的初始值。首先,`sys_call_ptr_t` 为指向系统调用表的指针。 其是通过 [typedef] 定义的函数指针的(https://en.wikipedia.org/wiki/Typedef) ,返回值为空且无参数:
```C
typedef void (*sys_call_ptr_t)(void);
```
其次为 `sys_call_table` 数组中元素的初始化。从上面的代码中可知,数组中所有元素包含指向 `sys_ni_syscall` 的系统调用处理器的指针。 `sys_ni_syscall` 函数为 “not-implemented” 调用。 首先, `sys_call_table` 的所有元素指向 “not-implemented” 系统调用。这是正确的初始化方法,因为我们仅仅初始化指向系统调用处理器的指针的存储位置,稍后再做处理。 `sys_ni_syscall` 的结果比较简单, 仅仅返回 [-errno](http://man7.org/linux/man-pages/man3/errno.3.html) 或者 `-ENOSYS` :
```C
asmlinkage long sys_ni_syscall(void)
{
return -ENOSYS;
}
```
The `-ENOSYS` error tells us that:
```
ENOSYS Function not implemented (POSIX.1)
```
在 `sys_call_table` 的初始化中同时也要注意 `...` 。可通过 [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) 编译器插件 - [Designated Initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html) 处理。插件允许使用不固定的顺序初始化元素。 在数组结束处,我们引用 `asm/syscalls_64.h` 头文件在。头文件由特殊的脚本 [arch/x86/entry/syscalls/syscalltbl.sh](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscalltbl.sh) 从 [syscall table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl) 产生。 `asm/syscalls_64.h` 包括以下宏的定义:
```C
__SYSCALL_COMMON(0, sys_read, sys_read)
__SYSCALL_COMMON(1, sys_write, sys_write)
__SYSCALL_COMMON(2, sys_open, sys_open)
__SYSCALL_COMMON(3, sys_close, sys_close)
__SYSCALL_COMMON(5, sys_newfstat, sys_newfstat)
...
...
...
```
宏 `__SYSCALL_COMMON` 在相同的 [源码](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c)中定义,作为宏 `__SYSCALL_64`的扩展:
```C
#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
#define __SYSCALL_64(nr, sym, compat) [nr] = sym,
```
因而, 到此为止, `sys_call_table` 为如下格式:
```C
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
[0 ... __NR_syscall_max] = &sys_ni_syscall,
[0] = sys_read,
[1] = sys_write,
[2] = sys_open,
...
...
...
};
```
之后所有指向“ non-implemented ”系统调用元素的内容为 `sys_ni_syscall` 函数的地址,该函数仅返回 `-ENOSYS` 。 其他元素指向 `sys_syscall_name` 函数。
至此, 完成系统调用表的填充并且 Linux内核了解系统调用处理器的为值。但是 Linux 内核在处理用户空间程序的系统调用时并未立即调用 `sys_syscall_name` 函数。 记住关于中断及中断处理的 [章节](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)。当 Linux 内核获得处理中断的控制权, 在调用中断处理程序前,必须做一些准备如保存用户空间寄存器,切换至新的堆栈及其他很多工作。系统调用处理也是相同的情形。第一件事是处理系统调用的准备,但是在 Linux 内核开始这些准备之前, 系统调用的入口必须完成初始化,同时只有 Linux 内核知道如何执行这些准备。在下一章节我们将关注 Linux 内核中关于系统调用入口的初始化过程。
系统调用入口初始化
--------------------------------------------------------------------------------
当系统中发生系统调用, 开始处理调用的代码的第一个字节在什么地方? 阅读 Intel 的手册 - [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html):
```
SYSCALL 引起操作系统系统调用处理器处于特权级0通过加载IA32_LSTAR MSR至RIP完成。
```
这就是说我们需要将系统调用入口放置到 `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register) 。 这一操作在 Linux 内核初始过程时完成。若已阅读关于 Linux 内核中断及中断处理政界的 [第四节](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) , Linux 内核调用在初始化过程中调用 `trap_init` 函数。该函数在 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) 源代码文件中定义,执行 `non-early` 异常处理(如除法错误,[协处理器](https://en.wikipedia.org/wiki/Coprocessor) 错误等 )的初始化。除了 `non-early` 异常处理的初始化外, 函数调用 [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/blob/arch/x86/kernel/cpu/common.c) 中 `cpu_init` 函数,调用相同源码文件中的 `syscall_init` 完成`per-cpu` 状态初始化。
该函数执行系统调用入口的初始化。查看函数的实现,函数没有参数且首先填充两个特殊模块寄存器:
```C
wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
```
第一个特殊模块集寄存器- `MSR_STAR` 的 `63:48` 为用户代码的代码段。这些数据将加载至 `CS` 和 `SS` 段选择符,由提供将系统调用返回至相应特权级的用户代码功能的 `sysret` 指令使用。 同时从内核代码来看, 当用户空间应用程序执行系统调用时,`MSR_STAR` 的 `47:32` 将作为 `CS` and `SS`段选择寄存器的基地址。第二行代码中我们将使用系统调用入口`entry_SYSCALL_64` 填充 `MSR_LSTAR` 寄存器。 `entry_SYSCALL_64` 在 [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) 汇编文件中定义,包含系统调用执行前的准备(上面已经提及这些准备)。 目前不关注 `entry_SYSCALL_64` ,将在章节的后续讨论。
在设置系统调用的入口之后,需要以下特殊模式寄存器:
* `MSR_CSTAR` - target `rip` for the compability mode callers;
* `MSR_IA32_SYSENTER_CS` - target `cs` for the `sysenter` instruction;
* `MSR_IA32_SYSENTER_ESP` - target `esp` for the `sysenter` instruction;
* `MSR_IA32_SYSENTER_EIP` - target `eip` for the `sysenter` instruction.
这些特殊模式寄存器的值与内核配置选项 `CONFIG_IA32_EMULATION` 有关。 若开启该内核配置选项允许64字节内核运行32字节的程序。 首先, 若 `CONFIG_IA32_EMULATION` 内合配置选项开启, 将使用兼容模式的系统调用入口填充这些特殊模式寄存器:
```C
wrmsrl(MSR_CSTAR, entry_SYSCALL_compat);
```
对于内核代码段, 将堆栈指针置零,`entry_SYSENTER_compat`字的地址写入[指令指针](https://en.wikipedia.org/wiki/Program_counter):
```C
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
```
另一方面, 若 `CONFIG_IA32_EMULATION` 内核配置选项未开启, 将把 `ignore_sysret` 字写入`MSR_CSTAR`:
```C
wrmsrl(MSR_CSTAR, ignore_sysret);
```
其在[arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) 汇编文件中定义,仅返回 `-ENOSYS` 错误代码:
```assembly
ENTRY(ignore_sysret)
mov $-ENOSYS, %eax
sysret
END(ignore_sysret)
```
现在需要像之前代码一样填充 `MSR_IA32_SYSENTER_CS`, `MSR_IA32_SYSENTER_ESP`, `MSR_IA32_SYSENTER_EIP` 特殊模式寄存器,当`CONFIG_IA32_EMULATION` 内核配置选项打开时。 在这种情况( `CONFIG_IA32_EMULATION` 配置选项未设置) 将用零填充 `MSR_IA32_SYSENTER_ESP` 和 `MSR_IA32_SYSENTER_EIP` ,同时将 [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) 的无效段加载至 `MSR_IA32_SYSENTER_CS` 特殊模式寄存器:
```C
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
```
可以从描述 Linux 内核启动过程的[章节](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html)阅读更多关于 `Global Descriptor Table` 的内容。
在`syscall_init` 函数的结束, 通过写入 `MSR_SYSCALL_MASK` 特殊寄存器的标志位,将 [标志寄存器](https://en.wikipedia.org/wiki/FLAGS_register) 中的标志位屏蔽:
```C
wrmsrl(MSR_SYSCALL_MASK,
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
```
这些标志位将在 syscall 初始化时清除。至此, `syscall_init` 函数结束 也意味着系统调用已经可用。现在我们关注当用户程序执行 `syscall` 指令发生什么。
系统调用处理执行前的准备
--------------------------------------------------------------------------------
如之前写到, 系统调用或中断处理在被 Linux 内核调用前需要一些准备。 宏 `idtentry` 完成异常处理被执行前的所需准备,宏 `interrupt` 完成中断处理被调用前的所需准备 `entry_SYSCALL_64` 完成系统调用执行前的所需准备。
`entry_SYSCALL_64` 在 [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) 汇编文件中定义 ,从下面的宏开始:
```assembly
SWAPGS_UNSAFE_STACK
```
该宏在 [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) 头文件中定义, 扩展 `swapgs` 指令:
```C
#define SWAPGS_UNSAFE_STACK swapgs
```
宏将交换 GS 段选择符及 `MSR_KERNEL_GS_BASE ` 特殊模式寄存器中的值。换句话说,将其入内核堆栈 。之后使老的堆栈指针指向 `rsp_scratch` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) 变量设置堆栈指针指向当前处理器的栈顶:
```assembly
movq %rsp, PER_CPU_VAR(rsp_scratch)
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
```
下一步中将堆栈段及老的堆栈指针如栈:
```assembly
pushq $__USER_DS
pushq PER_CPU_VAR(rsp_scratch)
```
之后使能中断, 因为入口中断被关闭,保存通用目的 [寄存器](https://en.wikipedia.org/wiki/Processor_register) (除 `bp`, `bx` 及 `r12` 至 `r15`), 标志位, “ non-implemented ” 系统调用相关的 `-ENOSYS` 及代码段寄存器至堆栈:
```assembly
ENABLE_INTERRUPTS(CLBR_NONE)
pushq %r11
pushq $__USER_CS
pushq %rcx
pushq %rax
pushq %rdi
pushq %rsi
pushq %rdx
pushq %rcx
pushq $-ENOSYS
pushq %r8
pushq %r9
pushq %r10
pushq %r11
sub $(6*8), %rsp
```
当系统调用由用户空间程序引起时, 通用目的寄存器状态如下:
* `rax` - contains system call number;
* `rcx` - contains return address to the user space;
* `r11` - contains register flags;
* `rdi` - contains first argument of a system call handler;
* `rsi` - contains second argument of a system call handler;
* `rdx` - contains third argument of a system call handler;
* `r10` - contains fourth argument of a system call handler;
* `r8` - contains fifth argument of a system call handler;
* `r9` - contains sixth argument of a system call handler;
其他通用目的寄存器 (如 `rbp`, `rbx` 和 `r12` 至 `r15`) 在[C ABI](http://www.x86-64.org/documentation/abi.pdf))保留。将寄存器标志位如栈,之后是 “non-implemented ”系统调用的用户代码段用户空间返回地址系统调用编号三个参数dump 错误代码和堆栈中的其他信息。
下一步检查当前 `thread_info` 中的 `_TIF_WORK_SYSCALL_ENTRY`:
```assembly
testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jnz tracesys
```
宏 `_TIF_WORK_SYSCALL_ENTRY`在 [arch/x86/include/asm/thread_info.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/thread_info.h) 头文件中定义 ,提供一系列与系统调用跟踪有关的进程信息标志:
```C
#define _TIF_WORK_SYSCALL_ENTRY \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
_TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
_TIF_NOHZ)
```
本章节中不讨论追踪/调试相关内容,将在关于 Linux 内核调试及追踪相关独立章节中讨论。 在 `tracesys` 标签之后, 下一标签为 `entry_SYSCALL_64_fastpath`.在 `entry_SYSCALL_64_fastpath` 中检查 头文件 [arch/x86/include/asm/unistd.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/unistd.h) 中定义的 `__SYSCALL_MASK`
```C
# ifdef CONFIG_X86_X32_ABI
# define __SYSCALL_MASK (~(__X32_SYSCALL_BIT))
# else
# define __SYSCALL_MASK (~0)
# endif
```
`__X32_SYSCALL_BIT` 为:
```C
#define __X32_SYSCALL_BIT 0x40000000
```
众所周知, `__SYSCALL_MASK` 与 `CONFIG_X86_X32_ABI` 内核配置选项相关, 作为 64位内核中32位[ABI](https://en.wikipedia.org/wiki/Application_binary_interface) 的掩码。
So we check the value of the `__SYSCALL_MASK` and if the `CONFIG_X86_X32_ABI` is disabled we compare the value of the `rax` register to the maximum syscall number (`__NR_syscall_max`), alternatively if the `CNOFIG_X86_X32_ABI` is enabled we mask the `eax` register with the `__X32_SYSCALL_BIT` and do the same comparison:
```assembly
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
```
至此检查最后一调比较指令的结果, `ja` 指令在 `CF` 和 `ZF` 标志为 0 时执行:
```assembly
ja 1f
```
若正确调用系统调用, 从 `r10` 移动第四个参数至 `rcx` ,保持 [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf) 开启,同时以系统调用的处理程序的地址为参数执行 `call` 指令:
```assembly
movq %r10, %rcx
call *sys_call_table(, %rax, 8)
```
注意, 上文提到 `sys_call_table` 是一个数组。 `rax` 通用目的寄存器为系统调用的编号,且 `sys_call_table` 的每个元素为 8 字节。 因此使用 `*sys_call_table(, %rax, 8)` 符号找到指定系统调用处理在 `sys_call_table` 中的偏移。
就这样。完成了所需的准备,系统调用处理将被相应的中断处理调用。 例如 Linux 内核代码中 `SYSCALL_DEFINE[N]`宏定义的 `sys_read`, `sys_write` 和其他中断处理。
退出系统调用
--------------------------------------------------------------------------------
在系统调用处理完成人物后, 将退回[arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S), 正好在系统调用之后:
```assembly
call *sys_call_table(, %rax, 8)
```
在从系统调用处理返回之后,下一步是将系统调用处理的返回值入栈。系统调用将用户程序的返回结果放置在通用目的寄存器`rax` 中,因此在系统调用处理完成其工作后,将寄存器的值入栈:
```C
movq %rax, RAX(%rsp)
```
在 `RAX` 指定的位置。
之后调用在 [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) 中定义的宏 `LOCKDEP_SYS_EXIT` :
```assembly
LOCKDEP_SYS_EXIT
```
宏的实现与 `CONFIG_DEBUG_LOCK_ALLOC` 内核配置选项相关,该配置允许在退出系统调用时调试锁。再次强调,在该章节不关注,将在单独的章节讨论相关内容。 在 `entry_SYSCALL_64` 函数的最后, 恢复除 `rxc` 和 `r11` 外所有通用寄存器, 因为 `rcx` 寄存器为调用系统调用的应用程序的返回地址, `r11` 寄存器为老的 [flags register](https://en.wikipedia.org/wiki/FLAGS_register). 在恢复所有通用寄存器之后, 将在 `rcx` 中装入返回地址, `r11` 寄存器装入标志 `rsp` 装入老的堆栈指针:
```assembly
RESTORE_C_REGS_EXCEPT_RCX_R11
movq RIP(%rsp), %rcx
movq EFLAGS(%rsp), %r11
movq RSP(%rsp), %rsp
USERGS_SYSRET64
```
最后仅仅调用宏 `USERGS_SYSRET64` ,其扩展调用 `swapgs` 指令交换用户 `GS` 和内核`GS` `sysretq` 指令执行从系统调用处理退出。
```C
#define USERGS_SYSRET64 \
swapgs; \
sysretq;
```
现在我们知道,当用户程序使用系统调用时发生的一切。整个过程的步骤如下:
* 用户程序中的代码装入通用目的寄存器的值(系统调用编号和系统调用的参数);
* 处理器从用户模式切换到内核模式 开始执行系统调用入口 - `entry_SYSCALL_64`;
* `entry_SYSCALL_64` 切换至内核堆栈,在堆栈中存通用目的寄存器, 老的堆栈,代码段, 标志位等;
* `entry_SYSCALL_64` 检查 `rax` 寄存器中的系统调用编号,系统调用编号正确时, 在 `sys_call_table` 中查找系统调用处理并调用;
* 若系统调用编号不正确, 跳至系统调用退出;
* 系统调用处理完成工作后, 恢复通用寄存器, 老的堆栈,标志位 及返回地址 ,通过`sysretq` 指令退出`entry_SYSCALL_64` .
结论
--------------------------------------------------------------------------------
这是 Linux 内核相关概念的第二节。在前一 [](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) ,从用户应用程序的角度讨论了这些概念的原理。在这一节继续深入系统调用概念的相关内容,讨论了系统调用发生时 Linux 内核执行的内容。
若存在疑问及建议, 在twitter @[0xAX](https://twitter.com/0xAX), 通过[email](anotherworldofworld@gmail.com) 或者创建 [issue](https://github.com/0xAX/linux-insides/issues/new).
**由于英语是我的第一语言由此造成的不便深感抱歉。若发现错误请提交 PR 至 [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [system call](https://en.wikipedia.org/wiki/System_call)
* [write](http://man7.org/linux/man-pages/man2/write.2.html)
* [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library)
* [list of cpu architectures](https://en.wikipedia.org/wiki/List_of_CPU_architectures)
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt)
* [typedef](https://en.wikipedia.org/wiki/Typedef)
* [errno](http://man7.org/linux/man-pages/man3/errno.3.html)
* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
* [model specific register](https://en.wikipedia.org/wiki/Model-specific_register)
* [intel 2b manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
* [coprocessor](https://en.wikipedia.org/wiki/Coprocessor)
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
* [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
* [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf)
* [previous chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html)

403
SysCall/syscall-3.md Normal file
View File

@@ -0,0 +1,403 @@
Linux 内核系统调用 第三节
================================================================================
vsyscalls 和 vDSO
--------------------------------------------------------------------------------
这是讲解 Linux 内核中系统调用[章节](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html)的第三部分,[前一节](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html)讨论了用户空间应用程序发起的系统调用的准备工作及系统调用的处理过程。在这一节将讨论两个与系统调用十分相似的概念,这两个概念是`vsyscall``vdso`
我们已经了解什么是`系统调用`。这是 Linux 内核一种特殊的运行机制,使得用户空间的应用程序可以请求,像写入文件和打开套接字等特权级下的任务。正如你所了解的,在 Linux 内核中发起一个系统调用是特别昂贵的操作,因为处理器需要中断当前正在执行的任务,切换内核模式的上下文,在系统调用处理完毕后跳转至用户空间。以下的两种机制 - `vsyscall` 和d `vdso` 被设计用来加速系统调用的处理,在这一节我们将了解两种机制的工作原理。
vsyscalls 介绍
--------------------------------------------------------------------------------
`vsyscall``virtual system call` 是第一种也是最古老的一种用于加快系统调用的机制。 `vsyscall` 的工作原则其实十分简单。Linux 内核在用户空间映射一个包含一些变量及一些系统调用的实现的内存页。 对于 [X86_64](https://en.wikipedia.org/wiki/X86-64) 架构可以在 Linux 内核的 [文档] (https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt) 找到关于这一内存区域的信息:
```
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
```
或:
```
~$ sudo cat /proc/1/maps | grep vsyscall
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
```
因此, 这些系统调用将在用户空间下执行,这意味着将不发生 [上下文切换](https://en.wikipedia.org/wiki/Context_switch)。 `vsyscall` 内存页的映射在 [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) 源代码中定义的 `map_vsyscall` 函数中实现。这一函数在 Linux 内核初始化时被 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) 源代码中定义的函数`setup_arch` (我们在[第五章](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) Linux 内核的初始化中讨论过该函数)。
注意 `map_vsyscall` 函数的实现依赖于内核配置选项 `CONFIG_X86_VSYSCALL_EMULATION` :
```C
#ifdef CONFIG_X86_VSYSCALL_EMULATION
extern void map_vsyscall(void);
#else
static inline void map_vsyscall(void) {}
#endif
```
正如帮助文档中所描述的, `CONFIG_X86_VSYSCALL_EMULATION` 配置选项: `使能 vsyscall 模拟`. 为何模拟 `vsyscall`? 事实上, `vsyscall` 由于安全原因是一种遗留 [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) 。虚拟系统调用具有绑定的地址, 意味着 `vsyscall` 的内存页的位置在任何时刻是相同,这一位置是在 `map_vsyscall` 函数中指定的。这一函数的实现如下:
```C
void __init map_vsyscall(void)
{
extern char __vsyscall_page;
unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);
...
...
...
}
```
`map_vsyscall` 函数的开始,通过宏 `__pa_symbol` 获取了 `vsyscall` 内存页的物理地址(我们已在[第四章](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process)讨论了该宏的实现)。`__vsyscall_page` 在 [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) 汇编源代码文件中定义, 具有如下的 [虚拟地址](https://en.wikipedia.org/wiki/Virtual_address_space):
```
ffffffff81881000 D __vsyscall_page
```
`.data..page_aligned, aw` [](https://en.wikipedia.org/wiki/Memory_segmentation) 中 包含如下三中系统调用:
* `gettimeofday`;
* `time`;
* `getcpu`.
或:
```assembly
__vsyscall_page:
mov $__NR_gettimeofday, %rax
syscall
ret
.balign 1024, 0xcc
mov $__NR_time, %rax
syscall
ret
.balign 1024, 0xcc
mov $__NR_getcpu, %rax
syscall
ret
```
回到 `map_vsyscall` 函数及 `__vsyscall_page` 的实现,在得到 `__vsyscall_page` 的物理地址之后,使用 `__set_fixmap``vsyscall` 内存页 检查设置 [fix-mapped](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)地址的变量`vsyscall_mode`:
```C
if (vsyscall_mode != NONE)
__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
vsyscall_mode == NATIVE
? PAGE_KERNEL_VSYSCALL
: PAGE_KERNEL_VVAR);
```
The `__set_fixmap` takes three arguments: The first is index of the `fixed_addresses` [enum](https://en.wikipedia.org/wiki/Enumerated_type). In our case `VSYSCALL_PAGE` is the first element of the `fixed_addresses` enum for the `x86_64` architecture:
```C
enum fixed_addresses {
...
...
...
#ifdef CONFIG_X86_VSYSCALL_EMULATION
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
#endif
...
...
...
```
该变量值为 `511`。第二个参数为映射内存页的物理地址,第三个参数为内存页的标志位。注意 `VSYSCALL_PAGE` 标志位依赖于变量 `vsyscall_mode` 。当 `vsyscall_mode` 变量为 `NATIVE` 时, 标志位为 `PAGE_KERNEL_VSYSCALL`,其他情况则是`PAGE_KERNEL_VVAR` 。两个宏 ( `PAGE_KERNEL_VSYSCALL``PAGE_KERNEL_VVAR`) 都将被扩展以下标志:
```C
#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
#define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER)
```
标志反映了 `vsyscall` 内存页的访问权限。两个标志都带有 `_PAGE_USER` 标志, 这意味着内存页可被运行于低特权级的用户模式进程访问。第二个标志位取决于 `vsyscall_mode` 变量的值。第一个标志 (`__PAGE_KERNEL_VSYSCALL`) 在 `vsyscall_mode``NATIVE` 时被设定。这意味着虚拟系统调用将以本地 `syscall` 指令的方式执行。另一情况下,在 `vsyscall_mode``emulate` 时 vsyscall 为 `PAGE_KERNEL_VVAR`,此时系统调用将被置于陷阱并被合理的模拟。 `vsyscall_mode` 变量通过 `vsyscall_setup` 获取值:
```C
static int __init vsyscall_setup(char *str)
{
if (str) {
if (!strcmp("emulate", str))
vsyscall_mode = EMULATE;
else if (!strcmp("native", str))
vsyscall_mode = NATIVE;
else if (!strcmp("none", str))
vsyscall_mode = NONE;
else
return -EINVAL;
return 0;
}
return -EINVAL;
}
```
函数将在早期的内核分析时被调用:
```C
early_param("vsyscall", vsyscall_setup);
```
关于 `early_param` 宏的更多信息可以在[第六章](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) Linux 内核初始化中找到。
在函数 `vsyscall_map` 的最后仅通过 [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) 宏检查 `vsyscall` 内存页的虚拟地址是否等于变量 `VSYSCALL_ADDR` :
```C
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
(unsigned long)VSYSCALL_ADDR);
```
就这样`vsyscall` 内存页设置完毕。上述的结果如下: 若设置 `vsyscall=native` 内核命令行参数,虚拟内存调用将以 [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) 文件中本地 `系统调用` 指令的方式执行。 [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) 知道虚拟系统调用处理器的地址。注意虚拟系统调用的地址以 `1024` (或 `0x400`) 比特对齐。
```assembly
__vsyscall_page:
mov $__NR_gettimeofday, %rax
syscall
ret
.balign 1024, 0xcc
mov $__NR_time, %rax
syscall
ret
.balign 1024, 0xcc
mov $__NR_getcpu, %rax
syscall
ret
```
`vsyscall` 内存页的起始地址为 `ffffffffff600000` 。因此, [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) 知道所有虚拟系统调用处理器的地址。可以在 `glibc` 源码中找到这些地址的定义:
```C
#define VSYSCALL_ADDR_vgettimeofday 0xffffffffff600000
#define VSYSCALL_ADDR_vtime 0xffffffffff600400
#define VSYSCALL_ADDR_vgetcpu 0xffffffffff600800
```
所有的虚拟系统调用请求都将映射至 `__vsyscall_page` + `VSYSCALL_ADDR_vsyscall_name` 偏置, 将虚拟内存系统调用的编号置于通用目的[寄存器](https://en.wikipedia.org/wiki/Processor_register),本地的 x86_64 `系统调用`指令将被执行。
在第二种情况中, 若将 `vsyscall=emulate` 参数传递给内核命令行, 提升虚拟系统调用处理器的尝试导致一个 [page fault](https://en.wikipedia.org/wiki/Page_fault) 异常。 谨记, `vsyscall` 内存页 具有 `__PAGE_KERNEL_VVAR` 的访问权限,这将禁止执行。 `do_page_fault` 函数是 `#PF` 或 page fault 的处理器。它将尝试了解最后一次 page fault 的原因。一种可能的场景是 `vsyscall` 模式为 `emulate` 情况下的虚拟系统调用。此时 `vsyscall` 将被 [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) 源码中定义的 `emulate_vsyscall` 函数处理。
The `emulate_vsyscall` function gets the number of a virtual system call, checks it, prints error and sends [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault) single:
```C
...
...
...
vsyscall_nr = addr_to_vsyscall_nr(address);
if (vsyscall_nr < 0) {
warn_bad_vsyscall(KERN_WARNING, regs, "misaligned vsyscall...);
goto sigsegv;
}
...
...
...
sigsegv:
force_sig(SIGSEGV, current);
reutrn true;
```
As it checked number of a virtual system call, it does some yet another checks like `access_ok` violations and execute system call function depends on the number of a virtual system call:
```C
switch (vsyscall_nr) {
case 0:
ret = sys_gettimeofday(
(struct timeval __user *)regs->di,
(struct timezone __user *)regs->si);
break;
...
...
...
}
```
In the end we put the result of the `sys_gettimeofday` or another virtual system call handler to the `ax` general purpose register, as we did it with the normal system calls and restore the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and add `8` bytes to the [stack pointer](https://en.wikipedia.org/wiki/Stack_register) register. This operation emulates `ret` instruction.
```C
regs->ax = ret;
do_ret:
regs->ip = caller;
regs->sp += 8;
return true;
```
That's all. Now let's look on the modern concept - `vDSO`.
Introduction to vDSO
--------------------------------------------------------------------------------
As I already wrote above, `vsyscall` is an obsolete concept and replaced by the `vDSO` or `virtual dynamic shared object`. The main difference between the `vsyscall` and `vDSO` mechanisms is that `vDSO` maps memory pages into each process in a shared object [form](https://en.wikipedia.org/wiki/Library_%28computing%29#Shared_libraries), but `vsyscall` is static in memory and has the same address every time. For the `x86_64` architecture it is called -`linux-vdso.so.1`. All userspace applications linked with this shared library via the `glibc`. For example:
```
~$ ldd /bin/uname
linux-vdso.so.1 (0x00007ffe014b7000)
libc.so.6 => /lib64/libc.so.6 (0x00007fbfee2fe000)
/lib64/ld-linux-x86-64.so.2 (0x00005559aab7c000)
```
Or:
```
~$ sudo cat /proc/1/maps | grep vdso
7fff39f73000-7fff39f75000 r-xp 00000000 00:00 0 [vdso]
```
Here we can see that [uname](https://en.wikipedia.org/wiki/Uname) util was linked with the three libraries:
* `linux-vdso.so.1`;
* `libc.so.6`;
* `ld-linux-x86-64.so.2`.
The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`.
Initialization of the `vDSO` occurs in the `init_vdso` function that defined in the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file. This function starts from the initialization of the `vDSO` images for 32-bits and 64-bits depends on the `CONFIG_X86_X32_ABI` kernel configuration option:
```C
static int __init init_vdso(void)
{
init_vdso_image(&vdso_image_64);
#ifdef CONFIG_X86_X32_ABI
init_vdso_image(&vdso_image_x32);
#endif
```
Both function initialize the `vdso_image` structure. This structure is defined in the two generated source code files: the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c) and the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c). These source code files generated by the [vdso2c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso2c.c) program from the different source code files, represent different approaches to call a system call like `int 0x80`, `sysenter` and etc. The full set of the images depends on the kernel configuration.
For example for the `x86_64` Linux kernel it will contain `vdso_image_64`:
```C
#ifdef CONFIG_X86_64
extern const struct vdso_image vdso_image_64;
#endif
```
But for the `x86` - `vdso_image_32`:
```C
#ifdef CONFIG_X86_X32
extern const struct vdso_image vdso_image_x32;
#endif
```
If our kernel is configured for the `x86` architecture or for the `x86_64` and compability mode, we will have ability to call a system call with the `int 0x80` interrupt, if compability mode is enabled, we will be able to call a system call with the native `syscall instruction` or `sysenter` instruction in other way:
```C
#if defined CONFIG_X86_32 || defined CONFIG_COMPAT
extern const struct vdso_image vdso_image_32_int80;
#ifdef CONFIG_COMPAT
extern const struct vdso_image vdso_image_32_syscall;
#endif
extern const struct vdso_image vdso_image_32_sysenter;
#endif
```
As we can understand from the name of the `vdso_image` structure, it represents image of the `vDSO` for the certain mode of the system call entry. This structure contains information about size in bytes of the `vDSO` area that always a multiple of `PAGE_SIZE` (`4096` bytes), pointer to the text mapping, start and end address of the `alternatives` (set of instructions with better alternatives for the certain type of the processor) and etc. For example `vdso_image_64` looks like this:
```C
const struct vdso_image vdso_image_64 = {
.data = raw_data,
.size = 8192,
.text_mapping = {
.name = "[vdso]",
.pages = pages,
},
.alt = 3145,
.alt_len = 26,
.sym_vvar_start = -8192,
.sym_vvar_page = -8192,
.sym_hpet_page = -4096,
};
```
Where the `raw_data` contains raw binary code of the 64-bit `vDSO` system calls which are `2` page size:
```C
static struct page *pages[2];
```
or 8 Kilobytes.
The `init_vdso_image` function is defined in the same source code file and just initializes the `vdso_image.text_mapping.pages`. First of all this function calculates the number of pages and initializes each `vdso_image.text_mapping.pages[number_of_page]` with the `virt_to_page` macro that converts given address to the `page` structure:
```C
void __init init_vdso_image(const struct vdso_image *image)
{
int i;
int npages = (image->size) / PAGE_SIZE;
for (i = 0; i < npages; i++)
image->text_mapping.pages[i] =
virt_to_page(image->data + i*PAGE_SIZE);
...
...
...
}
```
The `init_vdso` function passed to the `subsys_initcall` macro adds the given function to the `initcalls` list. All functions from this list will be called in the `do_initcalls` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
```C
subsys_initcall(init_vdso);
```
Ok, we just saw initialization of the `vDSO` and initialization of `page` structures that are related to the memory pages that contain `vDSO` system calls. But to where do their pages map? Actually they are mapped by the kernel, when it loads binary to the memory. The Linux kernel calls the `arch_setup_additional_pages` function from the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file that checks that `vDSO` enabled for the `x86_64` and calls the `map_vdso` function:
```C
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
if (!vdso64_enabled)
return 0;
return map_vdso(&vdso_image_64, true);
}
```
The `map_vdso` function is defined in the same source code file and maps pages for the `vDSO` and for the shared `vDSO` variables. That's all. The main differences between the `vsyscall` and the `vDSO` concepts is that `vsyscal` has a static address of `ffffffffff600000` and implements `3` system calls, whereas the `vDSO` loads dynamically and implements four system calls:
* `__vdso_clock_gettime`;
* `__vdso_getcpu`;
* `__vdso_gettimeofday`;
* `__vdso_time`.
That's all.
Conclusion
--------------------------------------------------------------------------------
This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
After all of these three parts, we know almost all things that are related to system calls, we know what system call is and why user applications need them. We also know what occurs when a user application calls a system call and how the kernel handles system calls.
The next part will be the last part in this [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) and we will see what occurs when a user runs the program.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [x86_64 memory map](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt)
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [context switching](https://en.wikipedia.org/wiki/Context_switch)
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
* [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space)
* [Segmentation](https://en.wikipedia.org/wiki/Memory_segmentation)
* [enum](https://en.wikipedia.org/wiki/Enumerated_type)
* [fix-mapped addresses](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
* [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
* [Page fault](https://en.wikipedia.org/wiki/Page_fault)
* [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault)
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
* [stack pointer](https://en.wikipedia.org/wiki/Stack_register)
* [uname](https://en.wikipedia.org/wiki/Uname)
* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html)

430
SysCall/syscall-4.md Normal file
View File

@@ -0,0 +1,430 @@
System calls in the Linux kernel. Part 4.
================================================================================
How does the Linux kernel run a program
--------------------------------------------------------------------------------
This is the fourth part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) - this part will be last in this chapter. In the previous part we stopped at the two new concepts:
* `vsyscall`;
* `vDSO`;
that are related and very similar on system call concept.
This part will be last part in this chapter and as you can understand from the part's title - we will see what does occur in the Linux kernel when we run our programs. So, let's start.
how do we launch our programs?
--------------------------------------------------------------------------------
There are many different ways to launch an application from an user perspective. For example we can run a program from the [shell](https://en.wikipedia.org/wiki/Unix_shell) or double-click on the application icon. It does not matter. The Linux kernel handles application launch regardless how we do launch this application.
In this part we will consider the way when we just launch an application from the shell. As you know, the standard way to launch an application from shell is the following: We just launch a [terminal emulator](https://en.wikipedia.org/wiki/Terminal_emulator) application and just write the name of the program and pass or not arguments to our program, for example:
![ls shell](http://s14.postimg.org/d6jgidc7l/Screenshot_from_2015_09_07_17_31_55.png)
Let's consider what does occur when we launch an application from the shell, what does shell do when we write program name, what does Linux kernel do etc. But before we will start to consider these interesting things, I want to warn that this book is about the Linux kernel. That's why we will see Linux kernel insides related stuff mostly in this part. We will not consider in details what does shell do, we will not consider complex cases, for example subshells etc.
My default shell is - [bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29), so I will consider how do bash shell launches a program. So let's start. The `bash` shell as well as any program that written with [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) programming language starts from the [main](https://en.wikipedia.org/wiki/Entry_point) function. If you will look on the source code of the `bash` shell, you will find the `main` function in the [shell.c](https://github.com/bminor/bash/blob/master/shell.c#L357) source code file. This function makes many different things before the main thread loop of the `bash` started to work. For example this function:
* checks and tries to open `/dev/tty`;
* check that shell running in debug mode;
* parses command line arguments;
* reads shell environment;
* loads `.bashrc`, `.profile` and other configuration files;
* and many many more.
After all of these operations we can see the call of the `reader_loop` function. This function defined in the [eval.c](https://github.com/bminor/bash/blob/master/eval.c#L67) source code file and represents main thread loop or in other words it reads and executes commands. As the `reader_loop` function made all checks and read the given program name and arguments, it calls the `execute_command` function from the [execute_cmd.c](https://github.com/bminor/bash/blob/master/execute_cmd.c#L378) source code file. The `execute_command` function through the chain of the functions calls:
```
execute_command
--> execute_command_internal
----> execute_simple_command
------> execute_disk_command
--------> shell_execve
```
makes different checks like do we need to start `subshell`, was it builtin `bash` function or not etc. As I already wrote above, we will not consider all details about things that are not related to the Linux kernel. In the end of this process, the `shell_execve` function calls the `execve` system call:
```C
execve (command, args, env);
```
The `execve` system call has the following signature:
```
int execve(const char *filename, char *const argv [], char *const envp[]);
```
and executes a program by the given filename, with the given arguments and [environment variables](https://en.wikipedia.org/wiki/Environment_variable). This system call is the first in our case and only, for example:
```
$ strace ls
execve("/bin/ls", ["ls"], [/* 62 vars */]) = 0
$ strace echo
execve("/bin/echo", ["echo"], [/* 62 vars */]) = 0
$ strace uname
execve("/bin/uname", ["uname"], [/* 62 vars */]) = 0
```
So, an user application (`bash` in our case) calls the system call and as we already know the next step is Linux kernel.
execve system call
--------------------------------------------------------------------------------
We saw preparation before a system call called by an user application and after a system call handler finished its work in the second [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and as we already know it takes three arguments:
```
SYSCALL_DEFINE3(execve,
const char __user *, filename,
const char __user *const __user *, argv,
const char __user *const __user *, envp)
{
return do_execve(getname(filename), argv, envp);
}
```
Implementation of the `execve` is pretty simple here, as we can see it just returns the result of the `do_execve` function. The `do_execve` function defined in the same source code file and do the following things:
* Initialize two pointers on a userspace data with the given arguments and environment variables;
* return the result of the `do_execveat_common`.
We can see its implementation:
```C
struct user_arg_ptr argv = { .ptr.native = __argv };
struct user_arg_ptr envp = { .ptr.native = __envp };
return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
```
The `do_execveat_common` function does main work - it executes a new program. This function takes similar set of arguments, but as you can see it takes five arguments instead of three. The first argument is the file descriptor that represent directory with our application, in our case the `AT_FDCWD` means that the given pathname is interpreted relative to the current working directory of the calling process. The fifth argument is flags. In our case we passed `0` to the `do_execveat_common`. We will check in a next step, so will see it latter.
First of all the `do_execveat_common` function checks the `filename` pointer and returns if it is `NULL`. After this we check flags of the current process that limit of running processes is not exceed:
```C
if (IS_ERR(filename))
return PTR_ERR(filename);
if ((current->flags & PF_NPROC_EXCEEDED) &&
atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
retval = -EAGAIN;
goto out_ret;
}
current->flags &= ~PF_NPROC_EXCEEDED;
```
If these two checks were successful we unset `PF_NPROC_EXCEEDED` flag in the flags of the current process to prevent fail of the `execve`. You can see that in the next step we call the `unshare_files` function that defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and unshares the files of the current task and check the result of this function:
```C
retval = unshare_files(&displaced);
if (retval)
goto out_ret;
```
We need to call this function to eliminate potential leak of the execve'd binary's [file descriptor](https://en.wikipedia.org/wiki/File_descriptor). In the next step we start preparation of the `bprm` that represented by the `struct linux_binprm` structure (defined in the [include/linux/binfmts.h](https://github.com/torvalds/linux/blob/master/linux/binfmts.h) header file). The `linux_binprm` structure is used to hold the arguments that are used when loading binaries. For example it contains `vma` field which has `vm_area_struct` type and represents single memory area over a contiguous interval in a given address space where our application will be loaded, `mm` field which is memory descriptor of the binary, pointer to the top of memory and many other different fields.
First of all we allocate memory for this structure with the `kzalloc` function and check the result of the allocation:
```C
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
if (!bprm)
goto out_files;
```
After this we start to prepare the `binprm` credentials with the call of the `prepare_bprm_creds` function:
```C
retval = prepare_bprm_creds(bprm);
if (retval)
goto out_free;
check_unsafe_exec(bprm);
current->in_execve = 1;
```
Initialization of the `binprm` credentials in other words is initialization of the `cred` structure that stored inside of the `linux_binprm` structure. The `cred` structure contains the security context of a task for example [real uid](https://en.wikipedia.org/wiki/User_identifier#Real_user_ID) of the task, real [guid](https://en.wikipedia.org/wiki/Globally_unique_identifier) of the task, `uid` and `guid` for the [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) operations etc. In the next step as we executed preparation of the `bprm` credentials we check that now we can safely execute a program with the call of the `check_unsafe_exec` function and set the current process to the `in_execve` state.
After all of these operations we call the `do_open_execat` function that checks the flags that we passed to the `do_execveat_common` function (remember that we have `0` in the `flags`) and searches and opens executable file on disk, checks that our we will load a binary file from `noexec` mount points (we need to avoid execute a binary from filesystems that do not contain executable binaries like [proc](https://en.wikipedia.org/wiki/Procfs) or [sysfs](https://en.wikipedia.org/wiki/Sysfs)), intializes `file` structure and returns pointer on this structure. Next we can see the call the `sched_exec` after this:
```C
file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
sched_exec();
```
The `sched_exec` function is used to determine the least loaded processor that can execute the new program and to migrate the current process to it.
After this we need to check [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) of the give executable binary. We try to check does the name of the our binary file starts from the `/` symbol or does the path of the given executable binary is interpreted relative to the current working directory of the calling process or in other words file descriptor is `AT_FDCWD` (read above about this).
If one of these checks is successfull we set the binary parameter filename:
```C
bprm->file = file;
if (fd == AT_FDCWD || filename->name[0] == '/') {
bprm->filename = filename->name;
}
```
Otherwise if the filename is empty we set the binary parameter filename to the `/dev/fd/%d` or `/dev/fd/%d/%s` depends on the filename of the given executable binary which means that we will execute the file to which the file descriptor refers:
```C
} else {
if (filename->name[0] == '\0')
pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd);
else
pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s",
fd, filename->name);
if (!pathbuf) {
retval = -ENOMEM;
goto out_unmark;
}
bprm->filename = pathbuf;
}
bprm->interp = bprm->filename;
```
Note that we set not only the `bprm->filename` but also `bprm->interp` that will contain name of the program interpreter. For now we just write the same name there, but later it will be updated with the real name of the program interpreter depends on binary format of a program. You can read above that we already prepared `cred` for the `linux_binprm`. The next step is initalization of other fields of the `linux_binprm`. First of all we call the `bprm_mm_init` function and pass the `bprm` to it:
```C
retval = bprm_mm_init(bprm);
if (retval)
goto out_unmark;
```
The `bprm_mm_init` defined in the same source code file and as we can understand from the function's name, it makes initialization of the memory descriptor or in other words the `bprm_mm_init` function initializes `mm_struct` structure. This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/mm_types.h) header file and represents address space of a process. We will not consider implementation of the `bprm_mm_init` function because we do not know many important stuff related to the Linux kernel memory manager, but we just need to know that this function initializes `mm_struct` and populate it with a temporary stack `vm_area_struct`.
After this we calculate the count of the command line arguments which are were passed to the our executable binary, the count of the environment variables and set it to the `bprm->argc` and `bprm->envc` respectively:
```C
bprm->argc = count(argv, MAX_ARG_STRINGS);
if ((retval = bprm->argc) < 0)
goto out;
bprm->envc = count(envp, MAX_ARG_STRINGS);
if ((retval = bprm->envc) < 0)
goto out;
```
As you can see we do this operations with the help of the `count` function that defined in the [same](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and calculates the count of strings in the `argv` array. The `MAX_ARG_STRINGS` macro defined in the [include/uapi/linux/binfmts.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h) header file and as we can understand from the macro's name, it represents maximum number of strings that were passed to the `execve` system call. The value of the `MAX_ARG_STRINGS`:
```C
#define MAX_ARG_STRINGS 0x7FFFFFFF
```
After we calculated the number of the command line arguments and environment variables, we call the `prepare_binprm` function. We already call the function with the similar name before this moment. This function is called `prepare_binprm_cred` and we remember that this function initializes `cred` structure in the `linux_bprm`. Now the `prepare_binprm` function:
```C
retval = prepare_binprm(bprm);
if (retval < 0)
goto out;
```
fills the `linux_binprm` structure with the `uid` from [inode](https://en.wikipedia.org/wiki/Inode) and read `128` bytes from the binary executable file. We read only first `128` from the executable file because we need to check a type of our executable. We will read the rest of the executable file in the later step. After the preparation of the `linux_bprm` structure we copy the filename of the executable binary file, command line arguments and enviroment variables to the `linux_bprm` with the call of the `copy_strings_kernel` function:
```C
retval = copy_strings_kernel(1, &bprm->filename, bprm);
if (retval < 0)
goto out;
retval = copy_strings(bprm->envc, envp, bprm);
if (retval < 0)
goto out;
retval = copy_strings(bprm->argc, argv, bprm);
if (retval < 0)
goto out;
```
And set the pointer to the top of new program's stack that we set in the `bprm_mm_init` function:
```C
bprm->exec = bprm->p;
```
The top of the stack will contain the program filename and we store this fileneme tothe `exec` field of the `linux_bprm` structure.
Now we have filled `linux_bprm` structure, we call the `exec_binprm` function:
```C
retval = exec_binprm(bprm);
if (retval < 0)
goto out;
```
First of all we store the [pid](https://en.wikipedia.org/wiki/Process_identifier) and `pid` that seen from the [namespace](https://en.wikipedia.org/wiki/Cgroups) of the current task in the `exec_binprm`:
```C
old_pid = current->pid;
rcu_read_lock();
old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
rcu_read_unlock();
```
and call the:
```C
search_binary_handler(bprm);
```
function. This function goes through the list of handlers that contains different binary formats. Currently the Linux kernel supports following binary formats:
* `binfmt_script` - support for interpreted scripts that are starts from the [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29) line;
* `binfmt_misc` - support differnt binary formats, according to runtime configuration of the Linux kernel;
* `binfmt_elf` - support [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format;
* `binfmt_aout` - support [a.out](https://en.wikipedia.org/wiki/A.out) format;
* `binfmt_flat` - support for [flat](https://en.wikipedia.org/wiki/Binary_file#Structure) format;
* `binfmt_elf_fdpic` - Support for [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF) binaries;
* `binfmt_em86` - support for Intel [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) binaries running on [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha) machines.
So, the search-binary_handler tries to call the `load_binary` function and pass `linux_binprm` to it. If the binary handler supports the given executable file format, it starts to prepare the executable binary for execution:
```C
int search_binary_handler(struct linux_binprm *bprm)
{
...
...
...
list_for_each_entry(fmt, &formats, lh) {
retval = fmt->load_binary(bprm);
if (retval < 0 && !bprm->mm) {
force_sigsegv(SIGSEGV, current);
return retval;
}
}
return retval;
```
Where the `load_binary` for example for the [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) checks the magic number (each `elf` binary file contains magic number in the header) in the `linux_bprm` buffer (remember that we read first `128` bytes from the executable binary file): and exit if it is not `elf` binary:
```C
static int load_elf_binary(struct linux_binprm *bprm)
{
...
...
...
loc->elf_ex = *((struct elfhdr *)bprm->buf);
if (memcmp(elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
goto out;
```
If the given executable file is in `elf` format, the `load_elf_binary` continues to execute. The `load_elf_binary` does many different things to prepare on execution executable file. For example it checks the architecture and type of the executable file:
```C
if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
goto out;
if (!elf_check_arch(&loc->elf_ex))
goto out;
```
and exit if there is wrong architecture and executable file non executable non shared. Tries to load the `program header table`:
```C
elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
if (!elf_phdata)
goto out;
```
that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute.
In the end of the execution of the `load_elf_binary` we call the `start_thread` function and pass three arguments to it:
```C
start_thread(regs, elf_entry, bprm->p);
retval = 0;
out:
kfree(loc);
out_ret:
return retval;
```
These arguments are:
* Set of [registers](https://en.wikipedia.org/wiki/Processor_register) for the new task;
* Address of the entry point of the new task;
* Address of the top of the stack for the new task.
As we can understand from the function's name, it starts new thread, but it is not so. The `start_thread` function just prepares new task's registers to be ready to run. Let's look on the implementation of this function:
```C
void
start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
{
start_thread_common(regs, new_ip, new_sp,
__USER_CS, __USER_DS, 0);
}
```
As we can see the `start_thread` function just makes a call of the `start_thread_common` function that will do all for us:
```C
static void
start_thread_common(struct pt_regs *regs, unsigned long new_ip,
unsigned long new_sp,
unsigned int _cs, unsigned int _ss, unsigned int _ds)
{
loadsegment(fs, 0);
loadsegment(es, _ds);
loadsegment(ds, _ds);
load_gs_index(0);
regs->ip = new_ip;
regs->sp = new_sp;
regs->cs = _cs;
regs->ss = _ss;
regs->flags = X86_EFLAGS_IF;
force_iret();
}
```
The `start_thread_common` function fills `fs` segment register with zero and `es` and `ds` with the value of the data segment register. After this we set new values to the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter), `cs` segments etc. In the end of the `start_thread_common` function we can see the `force_iret` macro that force a system call return via `iret` instruction. Ok, we prepared new thread to run in userspace and now we can return from the `exec_binprm` and now we are in the `do_execveat_common` again. After the `exec_binprm` will finish its execution we release memory for structures that was allocated before and return.
After we returned from the `execve` system call handler, execution of our program will be started. We can do it, because all context related information already configured for this purpose. As we saw the `execve` system call does not return control to a process, but code, data and other segments of the caller process are just overwritten of the program segments. The exit from our application will be implemented through the `exit` system call.
That's all. From this point our programm will be executed.
Conclusion
--------------------------------------------------------------------------------
This is the end of the fourth and last part of the about the system calls concept in the Linux kernel. We saw almost all related stuff to the `system call` concept in these four parts. We started from the understanding of the `system call` concept, we have learned what is it and why do users applications need in this concept. Next we saw how does the Linux handle a system call from an user application. We met two similar concepts to the `system call` concept, they are `vsyscall` and `vDSO` and finally we saw how does Linux kernel run an user program.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [System call](https://en.wikipedia.org/wiki/System_call)
* [shell](https://en.wikipedia.org/wiki/Unix_shell)
* [bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29)
* [entry point](https://en.wikipedia.org/wiki/Entry_point)
* [C](https://en.wikipedia.org/wiki/C_%28programming_language%29)
* [environment variables](https://en.wikipedia.org/wiki/Environment_variable)
* [file descriptor](https://en.wikipedia.org/wiki/File_descriptor)
* [real uid](https://en.wikipedia.org/wiki/User_identifier#Real_user_ID)
* [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
* [procfs](https://en.wikipedia.org/wiki/Procfs)
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
* [inode](https://en.wikipedia.org/wiki/Inode)
* [pid](https://en.wikipedia.org/wiki/Process_identifier)
* [namespace](https://en.wikipedia.org/wiki/Cgroups)
* [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29)
* [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
* [a.out](https://en.wikipedia.org/wiki/A.out)
* [flat](https://en.wikipedia.org/wiki/Binary_file#Structure)
* [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha)
* [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF)
* [segments](https://en.wikipedia.org/wiki/Memory_segmentation)
* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html)