mirror of
https://github.com/MintCN/linux-insides-zh.git
synced 2026-05-05 00:40:08 +08:00
Merge pull request #265 from hust-open-atom-club/gitbook_link_fixes
resolve Gitbook links
This commit is contained in:
@@ -3,7 +3,7 @@
|
||||
内核启动的第一步
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
在[上一节中](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-1.html)我们开始接触到内核启动代码,并且分析了初始化部分,最后我们停在了对`main`函数(`main`函数是第一个用C写的函数)的调用(`main`函数位于[arch/x86/boot/main.c](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18))。
|
||||
在[上一节中](/Booting/linux-bootstrap-1.md)我们开始接触到内核启动代码,并且分析了初始化部分,最后我们停在了对`main`函数(`main`函数是第一个用C写的函数)的调用(`main`函数位于[arch/x86/boot/main.c](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18))。
|
||||
|
||||
在这一节中我们将继续对内核启动过程的研究,我们将
|
||||
* 认识`保护模式`
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
简介
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这是 [linux 内核揭秘](http://0xax.gitbooks.io/linux-insides/content/) 的新一章的第一部分。你可以根据这部分的标题猜测 - 这一部分将涉及 Linux 内核中的 [`控制组`](https://en.wikipedia.org/wiki/Cgroups) 或 `cgroups` 机制。
|
||||
这是 [linux 内核揭秘](/SUMMARY.md) 的新一章的第一部分。你可以根据这部分的标题猜测 - 这一部分将涉及 Linux 内核中的 [`控制组`](https://en.wikipedia.org/wiki/Cgroups) 或 `cgroups` 机制。
|
||||
|
||||
`Cgroups` 是由 Linux 内核提供的一种机制,它允许我们分配诸如处理器时间、每组进程的数量、每个 `cgroup` 的内存大小,或者针对一个或一组进程的上述资源的组合。`Cgroups` 是按照层级结构组织的,这种机制类似于通常的进程,他们也是层级结构,并且子 `cgroups` 会继承其上级的一些属性。但实际上他们还是有区别的。`cgroups` 和进程之间的主要区别在于,多个不同层级的 `cgroup` 可以同时存在,而进程树则是单一的。同时存在的多个不同层级的 `cgroup` 并不是任意的,因为每个 `cgroup` 层级都要附加到一组 `cgroup` "子系统"中。
|
||||
|
||||
@@ -445,4 +445,4 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
|
||||
* [bash](https://www.gnu.org/software/bash/)
|
||||
* [docker](https://en.wikipedia.org/wiki/Docker_\(software\))
|
||||
* [perf events](https://en.wikipedia.org/wiki/Perf_\(Linux\))
|
||||
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html)
|
||||
* [Previous chapter](/MM/linux-mm-1.md)
|
||||
|
||||
@@ -91,7 +91,7 @@ early_param("percpu_alloc", percpu_alloc_setup);
|
||||
enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
|
||||
```
|
||||
|
||||
如果内核命令行中没有设置 `percpu_alloc` 参数,就会使用 `embed` 分配器,将第一个 per-cpu 块嵌入进带 [memblock](http://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html) 的 bootmem。最后一个分配器和第一个块 `page` 分配器一样,只是将第一个块使用 `PAGE_SIZE` 页进行了映射。
|
||||
如果内核命令行中没有设置 `percpu_alloc` 参数,就会使用 `embed` 分配器,将第一个 per-cpu 块嵌入进带 [memblock](/MM/linux-mm-1.md) 的 bootmem。最后一个分配器和第一个块 `page` 分配器一样,只是将第一个块使用 `PAGE_SIZE` 页进行了映射。
|
||||
|
||||
如我上面所写,首先我们在 `setup_per_cpu_areas` 中对第一个块分配器检查,检查到第一个块分配器不是 page 分配器:
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@ CPU masks
|
||||
* [lib/cpumask.c](https://github.com/torvalds/linux/blob/master/lib/cpumask.c)
|
||||
* [kernel/cpu.c](https://github.com/torvalds/linux/blob/master/kernel/cpu.c)
|
||||
|
||||
正如 [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h) 注释:Cpumasks 提供了代表系统中 CPU 集合的位图,一位放置一个 CPU 序号。我们已经在 [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) 部分,函数 `boot_cpu_init` 中看到了一点 cpumask。这个函数将第一个启动的 cpu 上线、激活等等……
|
||||
正如 [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h) 注释:Cpumasks 提供了代表系统中 CPU 集合的位图,一位放置一个 CPU 序号。我们已经在 [Kernel entry point](/Initialization/linux-initialization-4.md) 部分,函数 `boot_cpu_init` 中看到了一点 cpumask。这个函数将第一个启动的 cpu 上线、激活等等……
|
||||
|
||||
```C
|
||||
set_cpu_online(cpu, true);
|
||||
|
||||
@@ -214,7 +214,7 @@ static initcall_t *initcall_levels[] __initdata = {
|
||||
}
|
||||
```
|
||||
|
||||
如果你对这些不熟,可以在本书的某些[部分](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)了解更多关于[链接器](https://en.wikipedia.org/wiki/Linker_%28computing%29)的信息。
|
||||
如果你对这些不熟,可以在本书的某些[部分](/Misc/linux-misc-3.md)了解更多关于[链接器](https://en.wikipedia.org/wiki/Linker_%28computing%29)的信息。
|
||||
|
||||
正如我们刚看到的,`do_initcall_level` 函数有一个参数 - `initcall` 的级别,做了以下两件事:首先这个函数拷贝了 `initcall_command_line`,这是通常内核包含了各个模块参数的[命令行](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)的副本,并用 [kernel/params.c](https://github.com/torvalds/linux/blob/master/kernel/params.c)源码文件的 `parse_args` 函数解析它,然后调用各个级别的 `do_on_initcall` 函数:
|
||||
|
||||
@@ -387,9 +387,9 @@ rootfs_initcall(populate_rootfs);
|
||||
* [symbols concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
|
||||
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization)
|
||||
* [Introduction to linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Introduction to linkers](/Misc/linux-misc-3.md)
|
||||
* [Linux kernel command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)
|
||||
* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [rootfs](https://en.wikipedia.org/wiki/Initramfs)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [previous part](/Misc/linux-misc-2.md)
|
||||
|
||||
@@ -76,7 +76,7 @@ In the first case for the `blocking notifier chains`, callbacks will be called/e
|
||||
|
||||
The second `SRCU notifier chains` represent alternative form of `blocking notifier chains`. In the first case, blocking notifier chains uses `rw_semaphore` synchronization primitive to protect chain links. `SRCU` notifier chains run in process context too, but uses special form of [RCU](https://en.wikipedia.org/wiki/Read-copy-update) mechanism which is permissible to block in an read-side critical section.
|
||||
|
||||
In the third case for the `atomic notifier chains` runs in interrupt or atomic context and protected by [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) synchronization primitive. The last `raw notifier chains` provides special type of notifier chains without any locking restrictions on callbacks. This means that protection rests on the shoulders of caller side. It is very useful when we want to protect our chain with very specific locking mechanism.
|
||||
In the third case for the `atomic notifier chains` runs in interrupt or atomic context and protected by [spinlock](/SyncPrim/linux-sync-1.md) synchronization primitive. The last `raw notifier chains` provides special type of notifier chains without any locking restrictions on callbacks. This means that protection rests on the shoulders of caller side. It is very useful when we want to protect our chain with very specific locking mechanism.
|
||||
|
||||
If we will look at the implementation of the `notifier_block` structure, we will see that it contains pointer to the `next` element from a notification chain list, but we have no head. Actually a head of such list is in separate structure depends on type of a notification chain. For example for the `blocking notifier chains`:
|
||||
|
||||
@@ -118,7 +118,7 @@ which defines head for loadable modules blocking notifier chain. The `BLOCKING_N
|
||||
} while (0)
|
||||
```
|
||||
|
||||
So we may see that it takes name of a name of a head of a blocking notifier chain and initializes read/write [semaphore](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html) and set head to `NULL`. Besides the `BLOCKING_INIT_NOTIFIER_HEAD` macro, the Linux kernel additionally provides `ATOMIC_INIT_NOTIFIER_HEAD`, `RAW_INIT_NOTIFIER_HEAD` macros and `srcu_init_notifier` function for initialization atomic and other types of notification chains.
|
||||
So we may see that it takes name of a name of a head of a blocking notifier chain and initializes read/write [semaphore](/SyncPrim/linux-sync-3.md) and set head to `NULL`. Besides the `BLOCKING_INIT_NOTIFIER_HEAD` macro, the Linux kernel additionally provides `ATOMIC_INIT_NOTIFIER_HEAD`, `RAW_INIT_NOTIFIER_HEAD` macros and `srcu_init_notifier` function for initialization atomic and other types of notification chains.
|
||||
|
||||
After initialization of a head of a notification chain, a subsystem which wants to receive notification from the given notification chain it should register with certain function which is depends on type of notification. If you will look in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file, you will see following four function for this:
|
||||
|
||||
@@ -331,7 +331,7 @@ static struct notifier_block tracepoint_module_nb = {
|
||||
};
|
||||
```
|
||||
|
||||
When one of the `MODULE_STATE_LIVE`, `MODULE_STATE_COMING` or `MODULE_STATE_GOING` events occurred. For example the `MODULE_STATE_LIVE` the `MODULE_STATE_COMING` notifications will be sent during execution of the [init_module](http://man7.org/linux/man-pages/man2/init_module.2.html) [system call](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html). Or for example `MODULE_STATE_GOING` will be sent during execution of the [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html) `system call`:
|
||||
When one of the `MODULE_STATE_LIVE`, `MODULE_STATE_COMING` or `MODULE_STATE_GOING` events occurred. For example the `MODULE_STATE_LIVE` the `MODULE_STATE_COMING` notifications will be sent during execution of the [init_module](http://man7.org/linux/man-pages/man2/init_module.2.html) [system call](/SysCall/linux-syscall-1.md). Or for example `MODULE_STATE_GOING` will be sent during execution of the [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html) `system call`:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
|
||||
@@ -359,11 +359,11 @@ Links
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [callback](https://en.wikipedia.org/wiki/Callback_(computer_programming))
|
||||
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
|
||||
* [spinlock](/SyncPrim/linux-sync-1.md)
|
||||
* [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module)
|
||||
* [semaphore](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html)
|
||||
* [semaphore](/SyncPrim/linux-sync-3.md)
|
||||
* [tracepoints](https://www.kernel.org/doc/Documentation/trace/tracepoints.txt)
|
||||
* [system call](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html)
|
||||
* [system call](/SysCall/linux-syscall-1.md)
|
||||
* [init_module system call](http://man7.org/linux/man-pages/man2/init_module.2.html)
|
||||
* [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)
|
||||
* [previous part](/Concepts/linux-cpu-3.md)
|
||||
|
||||
@@ -13,7 +13,7 @@ Linux 内核中的位数组和位操作
|
||||
|
||||
* [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h)
|
||||
|
||||
头文件。正如我上面所写的,`位图`在 Linux 内核中被广泛地使用。例如,`位数组`常常用于保存一组在线/离线处理器,以便系统支持[热插拔](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)的 CPU(你可以在 [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) 部分阅读更多相关知识 ),一个位数组(bit array)可以在 Linux 内核初始化等期间保存一组已分配的[中断处理](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)。
|
||||
头文件。正如我上面所写的,`位图`在 Linux 内核中被广泛地使用。例如,`位数组`常常用于保存一组在线/离线处理器,以便系统支持[热插拔](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)的 CPU(你可以在 [cpumasks](/Concepts/linux-cpu-2.md) 部分阅读更多相关知识 ),一个位数组(bit array)可以在 Linux 内核初始化等期间保存一组已分配的[中断处理](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)。
|
||||
|
||||
因此,本部分的主要目的是了解位数组(bit array)是如何在 Linux 内核中实现的。让我们现在开始吧。
|
||||
|
||||
@@ -369,7 +369,7 @@ static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
|
||||
* [linked data structures](https://en.wikipedia.org/wiki/Linked_data_structure)
|
||||
* [tree data structures](https://en.wikipedia.org/wiki/Tree_%28data_structure%29)
|
||||
* [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [cpumasks](/Concepts/linux-cpu-2.md)
|
||||
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [atomic operations](https://en.wikipedia.org/wiki/Linearizability)
|
||||
|
||||
@@ -4,12 +4,12 @@
|
||||
简介
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
内存管理是操作系统内核中最复杂的部分之一(我认为没有之一)。在[讲解内核进入点之前的准备工作](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-3.html)时,我们在调用 `start_kernel` 函数前停止了讲解。`start_kernel` 函数在内核启动第一个 `init` 进程前初始化了所有的内核特性(包括那些依赖于架构的特性)。你也许还记得在引导时建立了初期页表、识别页表和固定映射页表,但是复杂的内存管理部分还没有开始工作。当 `start_kernel` 函数被调用时,我们会看到从初期内存管理到更复杂的内存管理数据结构和技术的转变。为了更好地理解内核的初始化过程,我们需要对这些技术有更清晰的理解。本章节是内存管理框架和 API 的不同部分的概述,从 `memblock` 开始。
|
||||
内存管理是操作系统内核中最复杂的部分之一(我认为没有之一)。在[讲解内核进入点之前的准备工作](/Initialization/linux-initialization-3.md)时,我们在调用 `start_kernel` 函数前停止了讲解。`start_kernel` 函数在内核启动第一个 `init` 进程前初始化了所有的内核特性(包括那些依赖于架构的特性)。你也许还记得在引导时建立了初期页表、识别页表和固定映射页表,但是复杂的内存管理部分还没有开始工作。当 `start_kernel` 函数被调用时,我们会看到从初期内存管理到更复杂的内存管理数据结构和技术的转变。为了更好地理解内核的初始化过程,我们需要对这些技术有更清晰的理解。本章节是内存管理框架和 API 的不同部分的概述,从 `memblock` 开始。
|
||||
|
||||
内存块
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
内存块是在引导初期,泛用内核内存分配器还没有开始工作时对内存区域进行管理的方法之一。以前它被称为 `逻辑内存块`,但是内核接纳了 [Yinghai Lu 提供的补丁](https://lkml.org/lkml/2010/7/13/68)后改名为 `memblock` 。`x86_64` 架构上的内核会使用这个方法。我们已经在[讲解内核进入点之前的准备工作](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-3.html)时遇到过了它。现在是时候对它更加熟悉了。我们会看到它是被怎样实现的。
|
||||
内存块是在引导初期,泛用内核内存分配器还没有开始工作时对内存区域进行管理的方法之一。以前它被称为 `逻辑内存块`,但是内核接纳了 [Yinghai Lu 提供的补丁](https://lkml.org/lkml/2010/7/13/68)后改名为 `memblock` 。`x86_64` 架构上的内核会使用这个方法。我们已经在[讲解内核进入点之前的准备工作](/Initialization/linux-initialization-3.md)时遇到过了它。现在是时候对它更加熟悉了。我们会看到它是被怎样实现的。
|
||||
|
||||
我们首先会学习 `memblock` 的数据结构。以下所有的数据结构都在 [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) 头文件中定义。
|
||||
|
||||
@@ -160,7 +160,7 @@ On this step the initialization of the `memblock` structure has been finished an
|
||||
memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
|
||||
```
|
||||
|
||||
函数。我们将内存块类型 - `memory`,内存基址和内存区域大小,节点的最大数目和标志传进去。如果 `CONFIG_NODES_SHIFT` 没有被设置,最大节点数目就是 1,否则是 `1 << CONFIG_NODES_SHIFT`。`memblock_add_range` 函数将新的内存区域加到了内存块中,它首先检查传入内存区域的大小,如果是 0 就直接返回。然后,这个函数会用 `memblock_type` 来检查 `memblock` 中的内存区域是否存在。如果不存在,我们就简单地用给定的值填充一个新的 `memory_region` 然后返回(我们已经在[对内核内存管理框架的初览](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-3.html)中看到了它的实现)。如果 `memblock_type` 不为空,我们就会使用提供的 `memblock_type` 将新的内存区域加到 `memblock` 中。
|
||||
函数。我们将内存块类型 - `memory`,内存基址和内存区域大小,节点的最大数目和标志传进去。如果 `CONFIG_NODES_SHIFT` 没有被设置,最大节点数目就是 1,否则是 `1 << CONFIG_NODES_SHIFT`。`memblock_add_range` 函数将新的内存区域加到了内存块中,它首先检查传入内存区域的大小,如果是 0 就直接返回。然后,这个函数会用 `memblock_type` 来检查 `memblock` 中的内存区域是否存在。如果不存在,我们就简单地用给定的值填充一个新的 `memory_region` 然后返回(我们已经在[对内核内存管理框架的初览](/Initialization/linux-initialization-3.md)中看到了它的实现)。如果 `memblock_type` 不为空,我们就会使用提供的 `memblock_type` 将新的内存区域加到 `memblock` 中。
|
||||
|
||||
首先,我们获取了内存区域的结束点:
|
||||
|
||||
@@ -415,4 +415,4 @@ memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
|
||||
* [e820](http://en.wikipedia.org/wiki/E820)
|
||||
* [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access)
|
||||
* [debugfs](http://en.wikipedia.org/wiki/Debugfs)
|
||||
* [对内核内存管理框架的初览](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-3.html)
|
||||
* [对内核内存管理框架的初览](/Initialization/linux-initialization-3.md)
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
固定映射地址和输入输出重映射
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
固定映射地址是一组特殊的编译时确定的地址,它们与物理地址不一定具有减 `__START_KERNEL_map` 的线性映射关系。每一个固定映射的地址都会映射到一个内存页,内核会像指针一样使用它们,但是绝不会修改它们的地址。这是这种地址的主要特点。就像注释所说的那样,“在编译期就获得一个常量地址,只有在引导阶段才会被设定上物理地址。”你在本书的[前面部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html)可以看到,我们已经设定了 `level2_fixmap_pgt` :
|
||||
固定映射地址是一组特殊的编译时确定的地址,它们与物理地址不一定具有减 `__START_KERNEL_map` 的线性映射关系。每一个固定映射的地址都会映射到一个内存页,内核会像指针一样使用它们,但是绝不会修改它们的地址。这是这种地址的主要特点。就像注释所说的那样,“在编译期就获得一个常量地址,只有在引导阶段才会被设定上物理地址。”你在本书的[前面部分](/Initialization/linux-initialization-1.md)可以看到,我们已经设定了 `level2_fixmap_pgt` :
|
||||
|
||||
```assembly
|
||||
NEXT_PAGE(level2_fixmap_pgt)
|
||||
@@ -79,7 +79,7 @@ static inline unsigned long virt_to_fix(const unsigned long vaddr)
|
||||
|
||||
一个 PFN 是一块页大小物理内存的下标。一个物理地址的 PFN 可以简单地定义为 (page_phys_addr >> PAGE_SHIFT);
|
||||
|
||||
`__virt_to_fix` 会清空给定地址的前 12 位,然后用固定映射区域的末地址(`FIXADDR_TOP`)减去它并右移 `PAGE_SHIFT` 即 12 位。让我们来解释它的工作原理。就像我已经写的那样,这个宏会使用 `x & PAGE_MASK` 来清空前 12 位。然后我们用 `FIXADDR_TOP` 减去它,就会得到 `FIXADDR_TOP` 的后 12 位。我们知道虚拟地址的前 12 位代表这个页的偏移量,当我们右移 `PAGE_SHIFT` 后就会得到 `Page frame number` ,即虚拟地址的所有位,包括最开始的 12 个偏移位。固定映射地址在[内核中多处使用](http://lxr.free-electrons.com/ident?i=fix_to_virt)。 `IDT` 描述符保存在这里,[英特尔可信赖执行技术](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID 储存在固定映射区域,以 `FIX_TBOOT_BASE` 下标开始。另外, [Xen](http://en.wikipedia.org/wiki/Xen) 引导映射等也储存在这个区域。我们已经在[内核初始化的第五部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html)看到了一部分关于固定映射地址的知识。接下来让我们看看什么是 `ioremap`,看看它是怎样实现的,与固定映射地址又有什么关系呢?
|
||||
`__virt_to_fix` 会清空给定地址的前 12 位,然后用固定映射区域的末地址(`FIXADDR_TOP`)减去它并右移 `PAGE_SHIFT` 即 12 位。让我们来解释它的工作原理。就像我已经写的那样,这个宏会使用 `x & PAGE_MASK` 来清空前 12 位。然后我们用 `FIXADDR_TOP` 减去它,就会得到 `FIXADDR_TOP` 的后 12 位。我们知道虚拟地址的前 12 位代表这个页的偏移量,当我们右移 `PAGE_SHIFT` 后就会得到 `Page frame number` ,即虚拟地址的所有位,包括最开始的 12 个偏移位。固定映射地址在[内核中多处使用](http://lxr.free-electrons.com/ident?i=fix_to_virt)。 `IDT` 描述符保存在这里,[英特尔可信赖执行技术](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID 储存在固定映射区域,以 `FIX_TBOOT_BASE` 下标开始。另外, [Xen](http://en.wikipedia.org/wiki/Xen) 引导映射等也储存在这个区域。我们已经在[内核初始化的第五部分](/Initialization/linux-initialization-5.md)看到了一部分关于固定映射地址的知识。接下来让我们看看什么是 `ioremap`,看看它是怎样实现的,与固定映射地址又有什么关系呢?
|
||||
|
||||
输入输出重映射
|
||||
--------------------------------------------------------------------------------
|
||||
@@ -133,7 +133,7 @@ $ cat /proc/ioports
|
||||
* `n` - 区域的长度;
|
||||
* `name` - 区域需求者的名字。
|
||||
|
||||
`request_region` 分配 I/O 端口区域。通常在 `request_region` 之前会调用 `check_region` 来检查传入的地址区间是否可用,然后 `release_region` 会释放这个内存区域。`request_region` 返回指向 `resource` 结构体的指针。 `resource` 结构体是对系统资源的树状子集的抽象。我们已经在[内核初始化的第五部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html)见到过它了,它的定义是这样的:
|
||||
`request_region` 分配 I/O 端口区域。通常在 `request_region` 之前会调用 `check_region` 来检查传入的地址区间是否可用,然后 `release_region` 会释放这个内存区域。`request_region` 返回指向 `resource` 结构体的指针。 `resource` 结构体是对系统资源的树状子集的抽象。我们已经在[内核初始化的第五部分](/Initialization/linux-initialization-5.md)见到过它了,它的定义是这样的:
|
||||
|
||||
```C
|
||||
struct resource {
|
||||
@@ -258,13 +258,13 @@ static inline const char *e820_type_to_string(int e820_type)
|
||||
|
||||
我们可以在 `/proc/iomem` 中看到它们。
|
||||
|
||||
现在让我们尝试着理解 `ioremap` 是如何工作的。我们已经了解了一部分 `ioremap` 的知识,我们在[内核初始化的第五部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html)见过它。如果你读了那个章节,你就会记得 [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c) 文件中对 `early_ioremap_init` 函数的调用。对 `ioremap` 的初始化分为两个部分:有一部分在我们正常使用 `ioremap` 之前,但是要首先进行 `vmalloc` 的初始化并调用 `paging_init` 才能进行正常的 `ioremap` 调用。我们现在还不了解 `vmalloc` 的知识,先看看第一部分的初始化。首先 `early_ioremap_init` 会检查固定映射是否与页中部目录对齐:
|
||||
现在让我们尝试着理解 `ioremap` 是如何工作的。我们已经了解了一部分 `ioremap` 的知识,我们在[内核初始化的第五部分](/Initialization/linux-initialization-5.md)见过它。如果你读了那个章节,你就会记得 [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c) 文件中对 `early_ioremap_init` 函数的调用。对 `ioremap` 的初始化分为两个部分:有一部分在我们正常使用 `ioremap` 之前,但是要首先进行 `vmalloc` 的初始化并调用 `paging_init` 才能进行正常的 `ioremap` 调用。我们现在还不了解 `vmalloc` 的知识,先看看第一部分的初始化。首先 `early_ioremap_init` 会检查固定映射是否与页中部目录对齐:
|
||||
|
||||
```C
|
||||
BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
|
||||
```
|
||||
|
||||
更多关于 `BUILD_BUG_ON` 的内容你可以在[内核初始化的第一部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html)看到。如果给定的表达式为真,`BUILD_BUG_ON` 宏就会抛出一个编译时错误。在检查后的下一步,我们可以看到对 `early_ioremap_setup` 函数的调用,这个函数定义在 [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c) 文件中。这个函数代表了对 `ioremap` 的大体初始化。`early_ioremap_setup` 函数用初期固定映射的地址填充了 `slot_virt` 数组。所有初期固定映射地址在内存中都在 `__end_of_permanent_fixed_addresses` 后面,它们从 `FIX_BITMAP_BEGIN` 开始,到 `FIX_BITMAP_END` 结束。实际上初期 `ioremap` 会使用 `512` 个临时引导时映射:
|
||||
更多关于 `BUILD_BUG_ON` 的内容你可以在[内核初始化的第一部分](/Initialization/linux-initialization-1.md)看到。如果给定的表达式为真,`BUILD_BUG_ON` 宏就会抛出一个编译时错误。在检查后的下一步,我们可以看到对 `early_ioremap_setup` 函数的调用,这个函数定义在 [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c) 文件中。这个函数代表了对 `ioremap` 的大体初始化。`early_ioremap_setup` 函数用初期固定映射的地址填充了 `slot_virt` 数组。所有初期固定映射地址在内存中都在 `__end_of_permanent_fixed_addresses` 后面,它们从 `FIX_BITMAP_BEGIN` 开始,到 `FIX_BITMAP_END` 结束。实际上初期 `ioremap` 会使用 `512` 个临时引导时映射:
|
||||
|
||||
```
|
||||
#define NR_FIX_BTMAPS 64
|
||||
@@ -319,7 +319,7 @@ pmd_populate_kernel(&init_mm, pmd, bm_pte);
|
||||
|
||||
`pmd_populate_kernel` 函数有三个参数:
|
||||
|
||||
* `init_mm` - `init` 进程的内存描述符 (你可以在[前文](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html)中看到);
|
||||
* `init_mm` - `init` 进程的内存描述符 (你可以在[前文](/Initialization/linux-initialization-5.md)中看到);
|
||||
* `pmd` - `ioremap` 固定映射开始处的页中部目录;
|
||||
* `bm_pte` - 初期 `ioremap` 页表入口数组定义为:
|
||||
|
||||
@@ -519,5 +519,5 @@ prev_map[slot] = NULL;
|
||||
* [e820](http://en.wikipedia.org/wiki/E820)
|
||||
* [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit)
|
||||
* [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer)
|
||||
* [Paging](http://xinqiu.gitbooks.io/linux-insides-cn/content/Theory/linux-theory-1.html)
|
||||
* [内核内存管理第一部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/MM/linux-mm-1.html)
|
||||
* [Paging](/Theory/linux-theory-1.md)
|
||||
* [内核内存管理第一部分](/MM/linux-mm-1.md)
|
||||
|
||||
@@ -4,7 +4,7 @@ Linux内核内存管理 第三节
|
||||
内核中 kmemcheck 介绍
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Linux内存管理[章节](https://xinqiu.gitbooks.io/linux-insides-cn/content/MM/)描述了Linux内核中[内存管理](https://en.wikipedia.org/wiki/Memory_management);本小节是第三部分。 在本章[第二节](https://xinqiu.gitbooks.io/linux-insides-cn/content/MM/linux-mm-2.html)中我们遇到了两个与内存管理相关的概念:
|
||||
Linux内存管理[章节](/MM/)描述了Linux内核中[内存管理](https://en.wikipedia.org/wiki/Memory_management);本小节是第三部分。 在本章[第二节](/MM/linux-mm-2.md)中我们遇到了两个与内存管理相关的概念:
|
||||
|
||||
* `固定映射地址`;
|
||||
* `输入输出重映射`.
|
||||
@@ -62,11 +62,11 @@ $ sudo cat /proc/ioports
|
||||
...
|
||||
```
|
||||
|
||||
`ioports` 的输出列出了系统中物理设备所注册的各种类型的I/O端口。内核不能直接访问设备的输入/输出地址。在内核能够使用这些内存之前,必须将这些地址映射到虚拟地址空间,这就是`io remap`机制的主要目的。在前面[第二节](https://xinqiu.gitbooks.io/linux-insides-cn/content/MM/linux-mm-2.html)中只介绍了早期的 `io remap` 。很快我们就要来看一看常规的 `io remap` 实现机制。但在此之前,我们需要学习一些其他的知识,例如不同类型的内存分配器等,不然的话我们很难理解该机制。
|
||||
`ioports` 的输出列出了系统中物理设备所注册的各种类型的I/O端口。内核不能直接访问设备的输入/输出地址。在内核能够使用这些内存之前,必须将这些地址映射到虚拟地址空间,这就是`io remap`机制的主要目的。在前面[第二节](/MM/linux-mm-2.md)中只介绍了早期的 `io remap` 。很快我们就要来看一看常规的 `io remap` 实现机制。但在此之前,我们需要学习一些其他的知识,例如不同类型的内存分配器等,不然的话我们很难理解该机制。
|
||||
|
||||
在进入Linux内核常规期的[内存管理](https://en.wikipedia.org/wiki/Memory_management)之前,我们要看一些特殊的内存机制,例如[调试](https://en.wikipedia.org/wiki/Debugging),检查[内存泄漏](https://en.wikipedia.org/wiki/Memory_leak),内存控制等等。学习这些内容有助于我们理解Linux内核的内存管理。
|
||||
|
||||
从本节的标题中,你可能已经看出来,我们会从[kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt)开始了解内存机制。和前面的[章节](https://xinqiu.gitbooks.io/linux-insides-cn/content/)一样,我们首先从理论上学习什么是 `kmemcheck` ,然后再来看Linux内核中是怎么实现这一机制的。
|
||||
从本节的标题中,你可能已经看出来,我们会从[kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt)开始了解内存机制。和前面的[章节](/SUMMARY.md)一样,我们首先从理论上学习什么是 `kmemcheck` ,然后再来看Linux内核中是怎么实现这一机制的。
|
||||
|
||||
让我们开始吧。Linux内核中的 `kmemcheck` 到底是什么呢?从该机制的名称上你可能已经猜到, `kmemcheck` 是检查内存的。你猜的很对。`kmemcheck` 的主要目的就是用来检查是否有内核代码访问 `未初始化的内存` 。让我们看一个简单的 [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) 程序:
|
||||
|
||||
@@ -148,7 +148,7 @@ config X86
|
||||
struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
|
||||
```
|
||||
|
||||
或者换句话说,在内核访问 [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29) 时会发生[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。这是由于 `kmemcheck` 将内存页标记为`不存在`(关于Linux内存分页的相关信息,你可以参考[分页](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html))。如果一个`缺页中断`异常发生了,异常处理程序会来处理这个异常,如果异常处理程序检测到内核使能了 `kmemcheck`,那么就会将控制权提交给 `kmemcheck` 来处理;`kmemcheck` 检查完之后,该内存页会被标记为 `present`,然后被中断的程序得以继续执行下去。 这里的处理方式比较巧妙,被中断程序的第一条指令执行时,`kmemcheck` 又会标记内存页为 `not present`,按照这种方式,下一个对内存页的访问也会被捕获。
|
||||
或者换句话说,在内核访问 [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29) 时会发生[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。这是由于 `kmemcheck` 将内存页标记为`不存在`(关于Linux内存分页的相关信息,你可以参考[分页](/Theory/linux-theory-1.md))。如果一个`缺页中断`异常发生了,异常处理程序会来处理这个异常,如果异常处理程序检测到内核使能了 `kmemcheck`,那么就会将控制权提交给 `kmemcheck` 来处理;`kmemcheck` 检查完之后,该内存页会被标记为 `present`,然后被中断的程序得以继续执行下去。 这里的处理方式比较巧妙,被中断程序的第一条指令执行时,`kmemcheck` 又会标记内存页为 `not present`,按照这种方式,下一个对内存页的访问也会被捕获。
|
||||
|
||||
目前我们只是从理论层面考察了 `kmemcheck`,接下来我们看一下Linux内核是怎么来实现该机制的。
|
||||
|
||||
@@ -167,7 +167,7 @@ struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
|
||||
|
||||

|
||||
|
||||
从Linux初始化过程章节的第七节 [part](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-7.html) 中,我们知道在内核初始化过程中,会在 `do_initcall_level` , `do_early_param` 等函数中解析内核 command line。前面也提到过 `kmemcheck` 子系统由两部分组成,第一部分启动比较早。在源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) 中有一个函数 `param_kmemcheck` ,该函数在command line解析时就会用到:
|
||||
从Linux初始化过程章节的第七节 [part](/Initialization/linux-initialization-7.md) 中,我们知道在内核初始化过程中,会在 `do_initcall_level` , `do_early_param` 等函数中解析内核 command line。前面也提到过 `kmemcheck` 子系统由两部分组成,第一部分启动比较早。在源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) 中有一个函数 `param_kmemcheck` ,该函数在command line解析时就会用到:
|
||||
|
||||
```C
|
||||
static int __init param_kmemcheck(char *str)
|
||||
@@ -190,7 +190,7 @@ early_param("kmemcheck", param_kmemcheck);
|
||||
|
||||
从前面的介绍我们知道 `param_kmemcheck` 可能存在三种情况:`0` (使能), `1` (禁止) or `2` (一次性)。 `param_kmemcheck` 的实现很简单:将command line传递的 `kmemcheck` 参数的值由字符串转换为整数,然后赋值给变量 `kmemcheck_enabled` 。
|
||||
|
||||
第二阶段在内核初始化阶段执行,而不是在早期初始化过程 [initcalls](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-3.html) 。第二阶断的过程体现在 `kmemcheck_init` :
|
||||
第二阶段在内核初始化阶段执行,而不是在早期初始化过程 [initcalls](/Concepts/linux-cpu-3.md) 。第二阶断的过程体现在 `kmemcheck_init` :
|
||||
|
||||
```C
|
||||
int __init kmemcheck_init(void)
|
||||
@@ -278,7 +278,7 @@ void kmemcheck_hide_pages(struct page *p, unsigned int n)
|
||||
|
||||
该函数遍历参数代表的所有内存页,并尝试获取每个内存页的 `页表项` 。如果获取成功,清理页表项的present 标记,设置页表项的 hidden 标记。在最后还需要刷新 [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) ,因为有一些内存页已经发生了改变。从这个地方开始,内存页就进入 `kmemcheck` 的跟踪系统。由于内存页的 `present` 标记被清除了,一旦 `kmalloc` 返回了内存地址,并且有代码访问这个地址,就会触发[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。
|
||||
|
||||
在Linux内核初始化的[第二节](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-2.html)介绍过,`缺页中断`处理程序是 [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) 的 `do_page_fault` 函数。该函数开始部分如下:
|
||||
在Linux内核初始化的[第二节](/Initialization/linux-initialization-2.md)介绍过,`缺页中断`处理程序是 [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) 的 `do_page_fault` 函数。该函数开始部分如下:
|
||||
|
||||
```C
|
||||
static noinline void
|
||||
@@ -296,7 +296,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
|
||||
}
|
||||
```
|
||||
|
||||
`kmemcheck_active` 函数获取 `kmemcheck_context` [per-cpu](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-1.html) 结构体,并返回该结构体成员 `balance` 和0的比较结果:
|
||||
`kmemcheck_active` 函数获取 `kmemcheck_context` [per-cpu](/Concepts/linux-cpu-1.md) 结构体,并返回该结构体成员 `balance` 和0的比较结果:
|
||||
|
||||
```
|
||||
bool kmemcheck_active(struct pt_regs *regs)
|
||||
@@ -337,7 +337,7 @@ if (!pte)
|
||||
static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE];
|
||||
```
|
||||
|
||||
`kmemcheck` 声明了一个特殊的 [tasklet](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/linux-interrupts-9.html) :
|
||||
`kmemcheck` 声明了一个特殊的 [tasklet](/Interrupts/linux-interrupts-9.md) :
|
||||
|
||||
```C
|
||||
static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0);
|
||||
@@ -423,11 +423,11 @@ Links
|
||||
* [kmemcheck documentation](https://www.kernel.org/doc/Documentation/kmemcheck.txt)
|
||||
* [valgrind](https://en.wikipedia.org/wiki/Valgrind)
|
||||
* [page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [initcalls](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-3.html)
|
||||
* [initcalls](/Concepts/linux-cpu-3.md)
|
||||
* [opcode](https://en.wikipedia.org/wiki/Opcode)
|
||||
* [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)
|
||||
* [per-cpu variables](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-1.html)
|
||||
* [per-cpu variables](/Concepts/linux-cpu-1.md)
|
||||
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [tasklet](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/linux-interrupts-9.html)
|
||||
* [Paging](https://xinqiu.gitbooks.io/linux-insides-cn/content/Theory/linux-theory-1.html)
|
||||
* [Previous part](https://xinqiu.gitbooks.io/linux-insides-cn/content/MM/linux-mm-2.html)
|
||||
* [tasklet](/Interrupts/linux-interrupts-9.md)
|
||||
* [Paging](/Theory/linux-theory-1.md)
|
||||
* [Previous part](/MM/linux-mm-2.md)
|
||||
|
||||
@@ -4,11 +4,11 @@ Linux 内核开发
|
||||
简介
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
如你所知,我从去年开始写了一系列关于 `x86_64` 架构汇编语言程序设计的[博文](http://xinqiu.gitbooks.io/categories/assembly/)。除了大学期间写过一些 `Hello World` 这样无实用价值的程序之外,我从来没写过哪怕一行的底层代码。那些程序也是很久以前的事情了,就像我刚才说的,我几乎完全没有写过底层代码。直到不久前,我才开始对这些事情感兴趣,因为我意识到我虽然可以写出程序,但是我却不知道我的程序是怎样被组织运行的。
|
||||
如你所知,我从去年开始写了一系列关于 `x86_64` 架构汇编语言程序设计的[博文](https://0xax.github.io/categories/assembler/)。除了大学期间写过一些 `Hello World` 这样无实用价值的程序之外,我从来没写过哪怕一行的底层代码。那些程序也是很久以前的事情了,就像我刚才说的,我几乎完全没有写过底层代码。直到不久前,我才开始对这些事情感兴趣,因为我意识到我虽然可以写出程序,但是我却不知道我的程序是怎样被组织运行的。
|
||||
|
||||
在写了一些汇编代码之后,我开始**大致**了解了程序在编译之后会变成什么样子。尽管如此,还是有很多其他的东西我不能够理解。例如:当 `syscall` 指令在我的汇编程序内执行时究竟发生了什么,当 `printf` 函数开始工作时又发生了什么,还有,我的程序是如何通过网络与其他计算机进行通信的。[汇编](https://en.wikipedia.org/wiki/Assembly_language#Assembler)语言并没有为这些问题带来答案,于是我决定做一番深入研究。我开始学习 Linux 内核的源代码,并且尝试着理解那些让我感兴趣的东西。然而 Linux 内核源代码也没有解答我**所有的**问题,不过我自身关于 Linux 内核及其外围流程的知识确实掌握的更好了。
|
||||
|
||||
在我开始学习 Linux 内核的九个半月之后,我写了这部分内容,并且发布了本书的[第一部分](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html)。到现在为止,本书共包括了四个部分,而这并不是终点。我之所以写这一系列关于 Linux 内核的文章其实更多的是为了我自己。你也知道,Linux 内核的代码量极其巨大,另外还非常容易忘记这一块或那一块内核代码做了什么,或者忘记某些东西是怎么实现的。出乎意料的是 [linux-insides](https://github.com/0xAX/linux-insides) 很快就火了,并且在九个月后积攒了 `9096` 个星星:
|
||||
在我开始学习 Linux 内核的九个半月之后,我写了这部分内容,并且发布了本书的[第一部分](/Booting/linux-bootstrap-1.md)。到现在为止,本书共包括了四个部分,而这并不是终点。我之所以写这一系列关于 Linux 内核的文章其实更多的是为了我自己。你也知道,Linux 内核的代码量极其巨大,另外还非常容易忘记这一块或那一块内核代码做了什么,或者忘记某些东西是怎么实现的。出乎意料的是 [linux-insides](https://github.com/0xAX/linux-insides) 很快就火了,并且在九个月后积攒了 `9096` 个星星:
|
||||
|
||||

|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
介绍
|
||||
---------------
|
||||
|
||||
在写 [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) 一书的过程中,我收到了很多邮件询问关于[链接器](https://zh.wikipedia.org/wiki/%E9%93%BE%E6%8E%A5%E5%99%A8)和链接器脚本的问题。所以我决定写这篇文章来介绍链接器和目标文件的链接方面的知识。
|
||||
在写 [linux-insides](/SUMMARY.md) 一书的过程中,我收到了很多邮件询问关于[链接器](https://zh.wikipedia.org/wiki/%E9%93%BE%E6%8E%A5%E5%99%A8)和链接器脚本的问题。所以我决定写这篇文章来介绍链接器和目标文件的链接方面的知识。
|
||||
|
||||
如果我们打开维基百科的 `链接器` 页,我们将会看到如下定义:
|
||||
|
||||
@@ -567,7 +567,7 @@ Disassembly of section .data:
|
||||
...
|
||||
```
|
||||
|
||||
除了我们已经看到的命令,另外还有一些。首先是 `ASSERT(exp, message)` ,保证给定的表达式不为零。如果为零,那么链接器会退出同时返回错误码,打印错误信息。如果你已经阅读了 [linux-insides](https://xinqiu.gitbooks.io/linux-insides-cn/content/) 的 Linux 内核启动流程,你或许知道 Linux 内核的设置头的偏移为 `0x1f1`。在 Linux 内核的链接器脚本中,我们可以看到下面的校验:
|
||||
除了我们已经看到的命令,另外还有一些。首先是 `ASSERT(exp, message)` ,保证给定的表达式不为零。如果为零,那么链接器会退出同时返回错误码,打印错误信息。如果你已经阅读了 [linux-insides](/Booting/README.md) 的 Linux 内核启动流程,你或许知道 Linux 内核的设置头的偏移为 `0x1f1`。在 Linux 内核的链接器脚本中,我们可以看到下面的校验:
|
||||
|
||||
```
|
||||
. = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");
|
||||
@@ -627,7 +627,7 @@ SECTIONS
|
||||
相关链接
|
||||
-----------------
|
||||
|
||||
* [Book about Linux kernel insides](http://0xax.gitbooks.io/linux-insides/content/)
|
||||
* [Book about Linux kernel insides](/SUMMARY.md)
|
||||
* [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29)
|
||||
* [object files](https://en.wikipedia.org/wiki/Object_file)
|
||||
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
|
||||
@@ -4,9 +4,9 @@
|
||||
简介
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
虽然 [linux-insides-zh](https://www.gitbook.com/book/xinqiu/linux-insides-cn/details) 大多描述的是内核相关的东西,但是我已经决定写一个大多与用户空间相关的部分。
|
||||
虽然 linux-insides-zh 大多描述的是内核相关的东西,但是我已经决定写一个大多与用户空间相关的部分。
|
||||
|
||||
[系统调用](https://zh.wikipedia.org/wiki/%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8)章节的[第四部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-4.html)已经描述了当我们想运行一个程序, Linux 内核的行为。这部分我想研究一下从用户空间的角度,当我们在 Linux 系统上运行一个程序,会发生什么。
|
||||
[系统调用](https://zh.wikipedia.org/wiki/%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8)章节的[第四部分](/SysCall/linux-syscall-4.md)已经描述了当我们想运行一个程序, Linux 内核的行为。这部分我想研究一下从用户空间的角度,当我们在 Linux 系统上运行一个程序,会发生什么。
|
||||
|
||||
我不知道你知识储备如何,但是在我的大学时期我学到,一个 `C` 程序从一个叫做 main 的函数开始执行。而且,这是部分正确的。每时每刻,当我们开始写一个新的程序时,我们从下面的实例代码开始编程:
|
||||
|
||||
@@ -123,7 +123,7 @@ x + y + z = 6
|
||||
|
||||
> The exec() family of functions replaces the current process image with a new process image.
|
||||
|
||||
如果你已经阅读过[系统调用](https://zh.wikipedia.org/wiki/%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8)章节的[第四部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-4.html),你可能就知道 execve 这个系统调用定义在 [files/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c#L1859) 文件中,并且如下所示,
|
||||
如果你已经阅读过[系统调用](https://zh.wikipedia.org/wiki/%E7%B3%BB%E7%BB%9F%E8%B0%83%E7%94%A8)章节的[第四部分](/SysCall/linux-syscall-4.md),你可能就知道 execve 这个系统调用定义在 [files/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c#L1859) 文件中,并且如下所示,
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE3(execve,
|
||||
@@ -135,7 +135,7 @@ SYSCALL_DEFINE3(execve,
|
||||
}
|
||||
```
|
||||
|
||||
它以可执行文件的名字,命令行参数的集合以及环境变量的集合作为参数。正如你猜测的,每一件事都是 `do_execve` 函数完成的。在这里我将不描述这个函数的实现细节,因为你可以从[这里](https://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-4.html)读到。但是,简而言之,`do_execve` 函数会检查诸如文件名是否有效,未超出进程数目限制等等。在这些检查之后,这个函数会解析 [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) 格式的可执行文件,为新的可执行文件创建内存描述符,并且在栈,堆等内存区域填上适当的值。当二进制镜像设置完成,`start_thread` 函数会设置一个新的进程。这个函数是框架相关的,而且对于 [x86_64](https://en.wikipedia.org/wiki/X86-64) 框架,它的定义是在 [arch/x86/kernel/process_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/process_64.c#L231) 文件中。
|
||||
它以可执行文件的名字,命令行参数的集合以及环境变量的集合作为参数。正如你猜测的,每一件事都是 `do_execve` 函数完成的。在这里我将不描述这个函数的实现细节,因为你可以从[这里](/SysCall/linux-syscall-4.md)读到。但是,简而言之,`do_execve` 函数会检查诸如文件名是否有效,未超出进程数目限制等等。在这些检查之后,这个函数会解析 [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) 格式的可执行文件,为新的可执行文件创建内存描述符,并且在栈,堆等内存区域填上适当的值。当二进制镜像设置完成,`start_thread` 函数会设置一个新的进程。这个函数是框架相关的,而且对于 [x86_64](https://en.wikipedia.org/wiki/X86-64) 框架,它的定义是在 [arch/x86/kernel/process_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/process_64.c#L231) 文件中。
|
||||
|
||||
`start_thread` 为[段寄存器](https://en.wikipedia.org/wiki/X86_memory_segmentation)设置新的值。从这一点开始,新进程已经准备就绪。一旦[进程切换]((https://en.wikipedia.org/wiki/Context_switch))完成,控制权就会返回到用户空间,并且新的可执行文件将会执行。
|
||||
|
||||
|
||||
@@ -3,7 +3,7 @@ Linux 内核中的同步原语. 第一部分.
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
这一部分为 [linux-insides](https://xinqiu.gitbooks.io/linux-insides-cn/content/) 这本书开启了新的章节。定时器和时间管理相关的概念在上一个[章节](https://xinqiu.gitbooks.io/linux-insides-cn/content/Timers/index.html)已经描述过了。现在是时候继续了。就像你可能从这一部分的标题所了解的那样,本章节将会描述 Linux 内核中的[同步](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)原语。
|
||||
这一部分为 [linux-insides](/SUMMARY.md) 这本书开启了新的章节。定时器和时间管理相关的概念在上一个[章节](/Timers/)已经描述过了。现在是时候继续了。就像你可能从这一部分的标题所了解的那样,本章节将会描述 Linux 内核中的[同步](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)原语。
|
||||
|
||||
像往常一样,在考虑一些同步相关的事情之前,我们会尝试去概括地了解什么是`同步原语`。事实上,同步原语是一种软件机制,提供了两个或者多个[并行](https://en.wikipedia.org/wiki/Parallel_computing)进程或者线程在不同时刻执行一段相同的代码段的能力。例如下面的代码片段:
|
||||
|
||||
@@ -20,7 +20,7 @@ clocksource_select();
|
||||
...
|
||||
mutex_unlock(&clocksource_mutex);
|
||||
```
|
||||
出自 [kernel/time/clocksource.c](https://github.com/torvalds/linux/master/kernel/time/clocksource.c) 源文件。这段代码来自于 `__clocksource_register_scale` 函数,此函数添加给定的 [clocksource](https://xinqiu.gitbooks.io/linux-insides-cn/content/Timers/linux-timers-2.html) 到时钟源列表中。这个函数在注册时钟源列表中生成两个不同的操作。例如 `clocksource_enqueue` 函数就是添加给定时钟源到注册时钟源列表——`clocksource_list` 中。注意这几行代码被两个函数所包围:`mutex_lock` 和 `mutex_unlock`,这两个函数都带有一个参数——在本例中为 `clocksource_mutex`。
|
||||
出自 [kernel/time/clocksource.c](https://github.com/torvalds/linux/master/kernel/time/clocksource.c) 源文件。这段代码来自于 `__clocksource_register_scale` 函数,此函数添加给定的 [clocksource](/Timers/linux-timers-2.md) 到时钟源列表中。这个函数在注册时钟源列表中生成两个不同的操作。例如 `clocksource_enqueue` 函数就是添加给定时钟源到注册时钟源列表——`clocksource_list` 中。注意这几行代码被两个函数所包围:`mutex_lock` 和 `mutex_unlock`,这两个函数都带有一个参数——在本例中为 `clocksource_mutex`。
|
||||
|
||||
这些函数展示了基于[互斥锁 (mutex)](https://en.wikipedia.org/wiki/Mutual_exclusion) 同步原语的加锁和解锁。当 `mutex_lock` 被执行,允许我们阻止两个或两个以上线程执行这段代码,而 `mute_unlock` 还没有被互斥锁的处理拥有者锁执行。换句话说,就是阻止在 `clocksource_list`上的并行操作。为什么在这里需要使用`互斥锁`? 如果两个并行处理尝试去注册一个时钟源会怎样。正如我们已经知道的那样,其中具有最大的等级(其具有最高的频率在系统中注册的时钟源)的列表中选择一个时钟源后,`clocksource_enqueue` 函数立即将一个给定的时钟源到 `clocksource_list` 列表:
|
||||
|
||||
@@ -39,7 +39,7 @@ static void clocksource_enqueue(struct clocksource *cs)
|
||||
|
||||
如果两个并行处理尝试同时去执行这个函数,那么这两个处理可能会找到相同的 `入口 (entry)` 可能发生[竞态条件 (race condition)](https://en.wikipedia.org/wiki/Race_condition) 或者换句话说,第二个执行 `list_add` 的处理程序,将会重写第一个线程写入的时钟源。
|
||||
|
||||
除了这个简答的例子,同步原语在 Linux 内核无处不在。如果再翻阅之前的[章节] (https://xinqiu.gitbooks.io/linux-insides-cn/content/Timers/index.html) 或者其他章节或者如果大概看看 Linux 内核源码,就会发现许多地方都使用同步原语。我们不考虑 `mutex` 在 Linux 内核是如何实现的。事实上,Linux 内核提供了一系列不同的同步原语:
|
||||
除了这个简答的例子,同步原语在 Linux 内核无处不在。如果再翻阅之前的[章节] (/Timers/) 或者其他章节或者如果大概看看 Linux 内核源码,就会发现许多地方都使用同步原语。我们不考虑 `mutex` 在 Linux 内核是如何实现的。事实上,Linux 内核提供了一系列不同的同步原语:
|
||||
|
||||
* `mutex`;
|
||||
* `semaphores`;
|
||||
@@ -236,7 +236,7 @@ static inline void __raw_spin_lock(raw_spinlock_t *lock)
|
||||
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
|
||||
}
|
||||
```
|
||||
就像你们可能了解的那样, 首先我们禁用了[抢占](https://en.wikipedia.org/wiki/Preemption_%28computing%29),通过 [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) (在 Linux 内核初始化进程章节的第九[部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-9.html)会了解到更多关于抢占)中的 `preempt_disable` 调用实现禁用。当我们将要解开给定的`自旋锁`,抢占将会再次启用:
|
||||
就像你们可能了解的那样, 首先我们禁用了[抢占](https://en.wikipedia.org/wiki/Preemption_%28computing%29),通过 [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) (在 Linux 内核初始化进程章节的第九[部分](/Initialization/linux-initialization-9.md)会了解到更多关于抢占)中的 `preempt_disable` 调用实现禁用。当我们将要解开给定的`自旋锁`,抢占将会再次启用:
|
||||
|
||||
```C
|
||||
static inline void __raw_spin_unlock(raw_spinlock_t *lock)
|
||||
@@ -409,7 +409,7 @@ head | 7 | - - - | 7 | tail
|
||||
|
||||
* [Concurrent computing](https://en.wikipedia.org/wiki/Concurrent_computing)
|
||||
* [Synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
|
||||
* [Clocksource framework](https://xinqiu.gitbooks.io/linux-insides-cn/content/Timers/linux-timers-2.html)
|
||||
* [Clocksource framework](/Timers/linux-timers-2.md)
|
||||
* [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [Race condition](https://en.wikipedia.org/wiki/Race_condition)
|
||||
* [Atomic operations](https://en.wikipedia.org/wiki/Linearizability)
|
||||
@@ -422,4 +422,4 @@ head | 7 | - - - | 7 | tail
|
||||
* [xadd instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html)
|
||||
* [NOP](https://en.wikipedia.org/wiki/NOP)
|
||||
* [Memory barriers](https://www.kernel.org/doc/Documentation/memory-barriers.txt)
|
||||
* [Previous chapter](https://xinqiu.gitbooks.io/linux-insides-cn/content/Timers/index.html)
|
||||
* [Previous chapter](/Timers/)
|
||||
|
||||
@@ -4,9 +4,9 @@ Linux 内核的同步原语. 第二部分.
|
||||
队列自旋锁
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这是本[章节](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/index.html)的第二部分,这部分描述 Linux 内核的和我们在本章的第一[部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/linux-sync-1.html)所见到的--[自旋锁](https://en.wikipedia.org/wiki/Spinlock)的同步原语。在这个部分我们将继续学习自旋锁的同步原语。 如果阅读了上一部分的相关内容,你可能记得除了正常自旋锁,Linux 内核还提供`自旋锁`的一种特殊类型 - `队列自旋锁`。 在这个部分我们将尝试理解此概念锁代表的含义。
|
||||
这是本[章节](/SyncPrim/)的第二部分,这部分描述 Linux 内核的和我们在本章的第一[部分](/SyncPrim/linux-sync-1.md)所见到的--[自旋锁](https://en.wikipedia.org/wiki/Spinlock)的同步原语。在这个部分我们将继续学习自旋锁的同步原语。 如果阅读了上一部分的相关内容,你可能记得除了正常自旋锁,Linux 内核还提供`自旋锁`的一种特殊类型 - `队列自旋锁`。 在这个部分我们将尝试理解此概念锁代表的含义。
|
||||
|
||||
我们在上一[部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/linux-sync-1.html)已知`自旋锁`的 [API](https://en.wikipedia.org/wiki/Application_programming_interface):
|
||||
我们在上一[部分](/SyncPrim/linux-sync-1.md)已知`自旋锁`的 [API](https://en.wikipedia.org/wiki/Application_programming_interface):
|
||||
|
||||
* `spin_lock_init` - 为给定`自旋锁`进行初始化;
|
||||
* `spin_lock` - 获取给定`自旋锁`;
|
||||
@@ -95,12 +95,12 @@ int unlock(lock)
|
||||
|
||||
第一个线程将执行 `test_and_set` 指令设置 `lock` 为 `1`。当第二个线程调用 `lock` 函数,它将在 `while` 循环中自旋,直到第一个线程调用 `unlock` 函数而且 `lock` 等于 `0`。这个实现对于执行不是很好,因为该实现至少有两个问题。第一个问题是该实现可能是非公平的而且一个处理器的线程可能有很长的等待时间,即使有其他线程也在等待释放锁,它还是调用了 `lock`。第二个问题是所有想要获取锁的线程,必须在共享内存的变量上执行很多类似`test_and_set` 这样的`原子`操作。这导致缓存失效,因为处理器缓存会存储 `lock=1`,但是在线程释放锁之后,内存中 `lock`可能只是`1`。
|
||||
|
||||
在上一[部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/linux-sync-1.html) 我们了解了自旋锁的第二种实现 -
|
||||
在上一[部分](/SyncPrim/linux-sync-1.md) 我们了解了自旋锁的第二种实现 -
|
||||
`排队自旋锁(ticket spinlock)`。这一方法解决了第一个问题而且能够保证想要获取锁的线程的顺序,但是仍然存在第二个问题。
|
||||
|
||||
这一部分的主旨是 `队列自旋锁`。这个方法能够帮助解决上述的两个问题。`队列自旋锁`允许每个处理器对自旋过程使用他自己的内存地址。通过学习名为 [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) 锁的这种基于队列自旋锁的实现,能够最好理解基于队列自旋锁的基本原则。在了解`队列自旋锁`的实现之前,我们先尝试理解什么是 `MCS` 锁。
|
||||
|
||||
`MCS`锁的基本理念就在上一段已经写到了,一个线程在本地变量上自旋然后每个系统的处理器自己拥有这些变量的拷贝。换句话说这个概念建立在 Linux 内核中的 [per-cpu](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-1.html) 变量概念之上。
|
||||
`MCS`锁的基本理念就在上一段已经写到了,一个线程在本地变量上自旋然后每个系统的处理器自己拥有这些变量的拷贝。换句话说这个概念建立在 Linux 内核中的 [per-cpu](/Concepts/linux-cpu-1.md) 变量概念之上。
|
||||
|
||||
当第一个线程想要获取锁,线程在`队列`中注册了自身,或者换句话说,因为线程现在是闲置的,线程要加入特殊`队列`并且获取锁。当第二个线程想要在第一个线程释放锁之前获取相同锁,这个线程就会把他自身的所变量的拷贝加入到这个特殊`队列`中。这个例子中第一个线程会包含一个 `next` 字段指向第二个线程。从这一时刻,第二个线程会等待直到第一个线程释放它的锁并且关于这个事件通知给 `next` 线程。第一个线程从`队列`中删除而第二个线程持有该锁。
|
||||
|
||||
@@ -458,7 +458,7 @@ smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
|
||||
总结
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这是 Linux 内核[同步原语](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)章节第二部分的结尾。在上一个[部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/linux-sync-1.html)我们已经见到了第一个同步原语`自旋锁`通过 Linux 内核 实现的`排队自旋锁(ticket spinlock)`。在这个部分我们了解了另一个`自旋锁`机制的实现 - `队列自旋锁`。下一个部分我们继续深入 Linux 内核同步原语。
|
||||
这是 Linux 内核[同步原语](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)章节第二部分的结尾。在上一个[部分](/SyncPrim/linux-sync-1.md)我们已经见到了第一个同步原语`自旋锁`通过 Linux 内核 实现的`排队自旋锁(ticket spinlock)`。在这个部分我们了解了另一个`自旋锁`机制的实现 - `队列自旋锁`。下一个部分我们继续深入 Linux 内核同步原语。
|
||||
|
||||
如果您有疑问或者建议,请在twitter [0xAX](https://twitter.com/0xAX) 上联系我,通过 [email](anotherworldofworld@gmail.com) 联系我,或者创建一个 [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
@@ -473,11 +473,11 @@ smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [Test and Set](https://en.wikipedia.org/wiki/Test-and-set)
|
||||
* [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
|
||||
* [per-cpu variables](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-1.html)
|
||||
* [per-cpu variables](/Concepts/linux-cpu-1.md)
|
||||
* [atomic instruction](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [CMPXCHG instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html)
|
||||
* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [NOP instruction](https://en.wikipedia.org/wiki/NOP)
|
||||
* [PREFETCHW instruction](http://www.felixcloutier.com/x86/PREFETCHW.html)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Previous part](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/linux-sync-1.html)
|
||||
* [Previous part](/SyncPrim/linux-sync-1.md)
|
||||
|
||||
@@ -5,7 +5,7 @@
|
||||
信号量
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这是本章的第三部分 [chapter](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/index.html),本章描述了内核中的同步原语,在之前的部分我们见到了特殊的 [自旋锁](https://en.wikipedia.org/wiki/Spinlock) - `排队自旋锁`。 在更前的 [部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/linux-sync-2.html) 是和 `自旋锁` 相关的描述。我们将描述更多同步原语。
|
||||
这是本章的第三部分 [chapter](/SyncPrim/),本章描述了内核中的同步原语,在之前的部分我们见到了特殊的 [自旋锁](https://en.wikipedia.org/wiki/Spinlock) - `排队自旋锁`。 在更前的 [部分](/SyncPrim/linux-sync-2.md) 是和 `自旋锁` 相关的描述。我们将描述更多同步原语。
|
||||
|
||||
在 `自旋锁` 之后的下一个我们将要讲到的 [内核同步原语](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)是 [信号量](https://en.wikipedia.org/wiki/Semaphore_%28programming%29)。我们会从理论角度开始学习什么是 `信号量`, 然后我们会像前几章一样讲到Linux内核是如何实现信号量的。
|
||||
|
||||
@@ -70,13 +70,13 @@ struct semaphore {
|
||||
}
|
||||
```
|
||||
|
||||
`__SEMAPHORE_INITIALIZER` 宏传入了 `信号量` 结构体的名字并且初始化这个结构体的各个域。首先我们使用 `__RAW_SPIN_LOCK_UNLOCKED` 宏对给予的 `信号量` 初始化一个 `自旋锁`。就像你从 [之前](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/linux-sync-1.html) 的部分看到那样,`__RAW_SPIN_LOCK_UNLOCKED` 宏是在 [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) 头文件中定义,它展开到 `__ARCH_SPIN_LOCK_UNLOCKED` 宏,而 `__ARCH_SPIN_LOCK_UNLOCKED` 宏又展开到零或者无锁状态
|
||||
`__SEMAPHORE_INITIALIZER` 宏传入了 `信号量` 结构体的名字并且初始化这个结构体的各个域。首先我们使用 `__RAW_SPIN_LOCK_UNLOCKED` 宏对给予的 `信号量` 初始化一个 `自旋锁`。就像你从 [之前](/SyncPrim/linux-sync-1.md) 的部分看到那样,`__RAW_SPIN_LOCK_UNLOCKED` 宏是在 [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) 头文件中定义,它展开到 `__ARCH_SPIN_LOCK_UNLOCKED` 宏,而 `__ARCH_SPIN_LOCK_UNLOCKED` 宏又展开到零或者无锁状态
|
||||
|
||||
```C
|
||||
#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
|
||||
```
|
||||
|
||||
`信号量` 的最后两个域 `count` 和 `wait_list` 是通过现有资源的数量和空 [链表](https://xinqiu.gitbooks.io/linux-insides-cn/content/DataStructures/linux-datastructures-1.html)来初始化。
|
||||
`信号量` 的最后两个域 `count` 和 `wait_list` 是通过现有资源的数量和空 [链表](/DataStructures/linux-datastructures-1.md)来初始化。
|
||||
第二种初始化 `信号量` 的方式是将 `信号量` 和现有资源数目传送给 `sema_init` 函数。 这个函数是在 [include/linux/semaphore.h](https://github.com/torvalds/linux/blob/master/include/linux/semaphore.h) 头文件中定义的。
|
||||
|
||||
```C
|
||||
@@ -88,7 +88,7 @@ static inline void sema_init(struct semaphore *sem, int val)
|
||||
}
|
||||
```
|
||||
|
||||
我们来看看这个函数是如何实现的。它看起来很简单。函数使用我们刚看到的 `__SEMAPHORE_INITIALIZER` 宏对传入的 `信号量` 进行初始化。就像我们在之前 [部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/index.html) 写的那样,我们将会跳过Linux内核关于 [锁验证](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) 的部分。
|
||||
我们来看看这个函数是如何实现的。它看起来很简单。函数使用我们刚看到的 `__SEMAPHORE_INITIALIZER` 宏对传入的 `信号量` 进行初始化。就像我们在之前 [部分](/SyncPrim/) 写的那样,我们将会跳过Linux内核关于 [锁验证](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) 的部分。
|
||||
从现在开始我们知道如何初始化一个 `信号量`,我们看看如何上锁和解锁。Linux内核提供了如下操作 `信号量` 的 [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
|
||||
```
|
||||
@@ -104,7 +104,7 @@ int down_timeout(struct semaphore *sem, long jiffies);
|
||||
|
||||
`down_killable` 函数和 `down_interruptible` 函数提供类似的功能,但是它还将当前进程的 `TASK_KILLABLE` 标志置位。这表示等待的进程可以被杀死信号中断。
|
||||
|
||||
`down_trylock` 函数和 `spin_trylock` 函数相似。这个函数试图去获取一个锁并且退出如果这个操作是失败的。在这个例子中,想获取锁的进程不会等待。最后的 `down_timeout`函数试图去获取一个锁。当前进程将会被中断进入到等待状态当超过传入的可等待时间。除此之外你也许注意到,这个等待的时间是以 [jiffies](https://xinqiu.gitbooks.io/linux-insides-cn/content/Timers/linux-timers-1.html)计数。
|
||||
`down_trylock` 函数和 `spin_trylock` 函数相似。这个函数试图去获取一个锁并且退出如果这个操作是失败的。在这个例子中,想获取锁的进程不会等待。最后的 `down_timeout`函数试图去获取一个锁。当前进程将会被中断进入到等待状态当超过传入的可等待时间。除此之外你也许注意到,这个等待的时间是以 [jiffies](/Timers/linux-timers-1.md)计数。
|
||||
|
||||
我们刚刚看了 `信号量` [API](https://en.wikipedia.org/wiki/Application_programming_interface)的定义。我们从 `down` 函数开始看。这个函数是在 [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) 源代码定义的。我们来看看函数实现:
|
||||
|
||||
@@ -180,7 +180,7 @@ struct semaphore_waiter waiter;
|
||||
#define current get_current()
|
||||
```
|
||||
|
||||
`get_current` 函数返回 `current_task` [per-cpu](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-1.html) 变量的值。
|
||||
`get_current` 函数返回 `current_task` [per-cpu](/Concepts/linux-cpu-1.md) 变量的值。
|
||||
|
||||
|
||||
```C
|
||||
@@ -339,14 +339,14 @@ static noinline void __sched __up(struct semaphore *sem)
|
||||
* [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
|
||||
* [deadlocks](https://en.wikipedia.org/wiki/Deadlock)
|
||||
* [scheduler](https://en.wikipedia.org/wiki/Scheduling_%28computing%29)
|
||||
* [Doubly linked list in the Linux kernel](https://xinqiu.gitbooks.io/linux-insides-cn/content/DataStructures/linux-datastructures-1.html)
|
||||
* [jiffies](https://xinqiu.gitbooks.io/linux-insides-cn/content/Timers/linux-timers-1.html)
|
||||
* [Doubly linked list in the Linux kernel](/DataStructures/linux-datastructures-1.md)
|
||||
* [jiffies](/Timers/linux-timers-1.md)
|
||||
* [interrupts](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [per-cpu](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-1.html)
|
||||
* [per-cpu](/Concepts/linux-cpu-1.md)
|
||||
* [bitmask](https://en.wikipedia.org/wiki/Mask_%28computing%29)
|
||||
* [SIGKILL](https://en.wikipedia.org/wiki/Unix_signal#SIGKILL)
|
||||
* [errno](https://en.wikipedia.org/wiki/Errno.h)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [Previous part](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/linux-sync-2.html)
|
||||
* [Previous part](/SyncPrim/linux-sync-2.md)
|
||||
|
||||
|
||||
@@ -430,4 +430,4 @@ Links
|
||||
* [Inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html)
|
||||
* [XADD instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html)
|
||||
* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html)
|
||||
* [Previous part](/SyncPrim/linux-sync-4.md)
|
||||
|
||||
@@ -6,20 +6,20 @@ Introduction
|
||||
|
||||
This is the sixth part of the chapter which describes [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_(computer_science)) in the Linux kernel and in the previous parts we finished to consider different [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitives. We will continue to learn synchronization primitives in this part and start to consider a similar synchronization primitive which can be used to avoid the `writer starvation` problem. The name of this synchronization primitive is - `seqlock` or `sequential locks`.
|
||||
|
||||
We know from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html) that [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) is a special lock mechanism which allows concurrent access for read-only operations, but an exclusive lock is needed for writing or modifying data. As we may guess, it may lead to a problem which is called `writer starvation`. In other words, a writer process can't acquire a lock as long as at least one reader process which aqcuired a lock holds it. So, in the situation when contention is high, it will lead to situation when a writer process which wants to acquire a lock will wait for it for a long time.
|
||||
We know from the previous [part](/SyncPrim/sync-5.md) that [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) is a special lock mechanism which allows concurrent access for read-only operations, but an exclusive lock is needed for writing or modifying data. As we may guess, it may lead to a problem which is called `writer starvation`. In other words, a writer process can't acquire a lock as long as at least one reader process which aqcuired a lock holds it. So, in the situation when contention is high, it will lead to situation when a writer process which wants to acquire a lock will wait for it for a long time.
|
||||
|
||||
The `seqlock` synchronization primitive can help solve this problem.
|
||||
|
||||
As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `seqlocks`.
|
||||
As in all previous parts of this [book](/SUMMARY.md), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `seqlocks`.
|
||||
|
||||
So, let's start.
|
||||
|
||||
Sequential lock
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, what is a `seqlock` synchronization primitive and how does it work? Let's try to answer on these questions in this paragraph. Actually `sequential locks` were introduced in the Linux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-free access to shared resources. Since the heart of `sequential lock` synchronization primitive is [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) synchronization primitive, `sequential locks` work in situations where the protected resources are small and simple. Additionally write access must be rare and also should be fast.
|
||||
So, what is a `seqlock` synchronization primitive and how does it work? Let's try to answer on these questions in this paragraph. Actually `sequential locks` were introduced in the Linux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-free access to shared resources. Since the heart of `sequential lock` synchronization primitive is [spinlock](/SyncPrim/linux-sync-1.md) synchronization primitive, `sequential locks` work in situations where the protected resources are small and simple. Additionally write access must be rare and also should be fast.
|
||||
|
||||
Work of this synchronization primitive is based on the sequence of events counter. Actually a `sequential lock` allows free access to a resource for readers, but each reader must check existence of conflicts with a writer. This synchronization primitive introduces a special counter. The main algorithm of work of `sequential locks` is simple: Each writer which acquired a sequential lock increments this counter and additionally acquires a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html). When this writer finishes, it will release the acquired spinlock to give access to other writers and increment the counter of a sequential lock again.
|
||||
Work of this synchronization primitive is based on the sequence of events counter. Actually a `sequential lock` allows free access to a resource for readers, but each reader must check existence of conflicts with a writer. This synchronization primitive introduces a special counter. The main algorithm of work of `sequential locks` is simple: Each writer which acquired a sequential lock increments this counter and additionally acquires a [spinlock](/SyncPrim/linux-sync-1.md). When this writer finishes, it will release the acquired spinlock to give access to other writers and increment the counter of a sequential lock again.
|
||||
|
||||
Read only access works on the following principle, it gets the value of a `sequential lock` counter before it will enter into [critical section](https://en.wikipedia.org/wiki/Critical_section) and compares it with the value of the same `sequential lock` counter at the exit of critical section. If their values are equal, this means that there weren't writers for this period. If their values are not equal, this means that a writer has incremented the counter during the [critical section](https://en.wikipedia.org/wiki/Critical_section). This conflict means that reading of protected data must be repeated.
|
||||
|
||||
@@ -54,7 +54,7 @@ typedef struct {
|
||||
} seqlock_t;
|
||||
```
|
||||
|
||||
As we may see the `seqlock_t` provides two fields. These fields represent a sequential lock counter, description of which we saw above and also a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) which will protect data from other writers. Note that the `seqcount` counter represented as `seqcount` type. The `seqcount` is structure:
|
||||
As we may see the `seqlock_t` provides two fields. These fields represent a sequential lock counter, description of which we saw above and also a [spinlock](/SyncPrim/linux-sync-1.md) which will protect data from other writers. Note that the `seqcount` counter represented as `seqcount` type. The `seqcount` is structure:
|
||||
|
||||
```C
|
||||
typedef struct seqcount {
|
||||
@@ -67,7 +67,7 @@ typedef struct seqcount {
|
||||
|
||||
which holds counter of a sequential lock and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related field.
|
||||
|
||||
As always in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/), before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `sequential lock` mechanism in the Linux kernel, we need to know how to initialize an instance of `seqlock_t`.
|
||||
As always in previous parts of this [chapter](/SyncPrim/), before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `sequential lock` mechanism in the Linux kernel, we need to know how to initialize an instance of `seqlock_t`.
|
||||
|
||||
We saw in the previous parts that often the Linux kernel provides two approaches to execute initialization of the given synchronization primitive. The same situation with the `seqlock_t` structure. These approaches allows to initialize a `seqlock_t` in two following:
|
||||
|
||||
@@ -114,7 +114,7 @@ So we just initialize counter of the given sequential lock to zero and additiona
|
||||
#endif
|
||||
```
|
||||
|
||||
As I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/) we will not consider [debugging](https://en.wikipedia.org/wiki/Debugging) and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related stuff in this part. So for now we just skip the `SEQCOUNT_DEP_MAP_INIT` macro. The second field of the given `seqlock_t` is `lock` initialized with the `__SPIN_LOCK_UNLOCKED` macro which is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file. We will not consider implementation of this macro here as it just initialize [rawspinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) with architecture-specific methods (More abot spinlocks you may read in first parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/)).
|
||||
As I already wrote in previous parts of this [chapter](/SyncPrim/) we will not consider [debugging](https://en.wikipedia.org/wiki/Debugging) and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related stuff in this part. So for now we just skip the `SEQCOUNT_DEP_MAP_INIT` macro. The second field of the given `seqlock_t` is `lock` initialized with the `__SPIN_LOCK_UNLOCKED` macro which is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file. We will not consider implementation of this macro here as it just initialize [rawspinlock](/SyncPrim/linux-sync-1.md) with architecture-specific methods (More abot spinlocks you may read in first parts of this [chapter](/SyncPrim/)).
|
||||
|
||||
We have considered the first way to initialize a sequential lock. Let's consider second way to do the same, but do it dynamically. We can initialize a sequentional lock with the `seqlock_init` macro which is defined in the same [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file.
|
||||
|
||||
@@ -149,7 +149,7 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
|
||||
}
|
||||
```
|
||||
|
||||
just initializes counter of the given `seqcount_t` with zero. The second call from the `seqlock_init` macro is the call of the `spin_lock_init` macro which we saw in the [first part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter.
|
||||
just initializes counter of the given `seqcount_t` with zero. The second call from the `seqlock_init` macro is the call of the `spin_lock_init` macro which we saw in the [first part](/SyncPrim/linux-sync-1.md) of this chapter.
|
||||
|
||||
So, now we know how to initialize a `sequential lock`, now let's look at how to use it. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `sequential locks`:
|
||||
|
||||
@@ -223,7 +223,7 @@ static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
|
||||
|
||||
which just compares value of the counter of the given `sequential lock` with the initial value of this counter. If the initial value of the counter which is obtained from `read_seqbegin()` function is odd, this means that a writer was in the middle of updating the data when our reader began to act. In this case the value of the data can be in inconsistent state, so we need to try to read it again.
|
||||
|
||||
This is a common pattern in the Linux kernel. For example, you may remember the `jiffies` concept from the [first part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of the [timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/) chapter. The sequential lock is used to obtain value of `jiffies` at [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
This is a common pattern in the Linux kernel. For example, you may remember the `jiffies` concept from the [first part](/Timers/linux-timers-1.md) of the [timers and time management in the Linux kernel](/Timers/) chapter. The sequential lock is used to obtain value of `jiffies` at [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
|
||||
```C
|
||||
u64 get_jiffies_64(void)
|
||||
@@ -303,7 +303,7 @@ static inline void raw_write_seqcount_end(seqcount_t *s)
|
||||
|
||||
and in the end we just call the `spin_unlock` macro to give access for other readers or writers.
|
||||
|
||||
That's all about `sequential lock` mechanism in the Linux kernel. Of course we did not consider full [API](https://en.wikipedia.org/wiki/Application_programming_interface) of this mechanism in this part. But all other functions are based on these which we described here. For example, Linux kernel also provides some safe macros/functions to use `sequential lock` mechanism in [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) of [softirq](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html): `write_seqclock_irq` and `write_sequnlock_irq`:
|
||||
That's all about `sequential lock` mechanism in the Linux kernel. Of course we did not consider full [API](https://en.wikipedia.org/wiki/Application_programming_interface) of this mechanism in this part. But all other functions are based on these which we described here. For example, Linux kernel also provides some safe macros/functions to use `sequential lock` mechanism in [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) of [softirq](/Interrupts/linux-interrupts-9.md): `write_seqclock_irq` and `write_sequnlock_irq`:
|
||||
|
||||
```C
|
||||
static inline void write_seqlock_irq(seqlock_t *sl)
|
||||
@@ -339,14 +339,14 @@ Links
|
||||
|
||||
* [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_(computer_science))
|
||||
* [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock)
|
||||
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
|
||||
* [spinlock](/SyncPrim/linux-sync-1.md)
|
||||
* [critical section](https://en.wikipedia.org/wiki/Critical_section)
|
||||
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [debugging](https://en.wikipedia.org/wiki/Debugging)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/)
|
||||
* [Timers and time management in the Linux kernel](/Timers/)
|
||||
* [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [softirq](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html)
|
||||
* [softirq](/Interrupts/linux-interrupts-9.md)
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture))
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html)
|
||||
* [Previous part](/SyncPrim/linux-sync-5.md)
|
||||
|
||||
@@ -4,7 +4,7 @@ Linux 内核系统调用 第一节
|
||||
简介
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这次提交为 [linux内核解密](http://xinqiu.gitbooks.io/linux-insides-cn/content/) 添加一个新的章节,从标题就可以知道, 这一章节将介绍Linux 内核中 [System Call](https://en.wikipedia.org/wiki/System_call) 的概念。章节内容的选择并非偶然。在前一[章节](http://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/index.html)我们了解了中断及中断处理。系统调用的概念与中断非常相似,这是因为软件中断是执行系统调用最常见的方式。接下来我们将从不同的角度来审视系统调用相关概念。例如,从用户空间发起系统调用时会发生什么,Linux内核中一组系统调用处理器的实现,[VDSO](https://en.wikipedia.org/wiki/VDSO) 和 [vsyscall](https://lwn.net/Articles/446528/) 的概念以及其他信息。
|
||||
这次提交为 [linux内核解密](/SUMMARY.md) 添加一个新的章节,从标题就可以知道, 这一章节将介绍Linux 内核中 [System Call](https://en.wikipedia.org/wiki/System_call) 的概念。章节内容的选择并非偶然。在前一[章节](/Interrupts/)我们了解了中断及中断处理。系统调用的概念与中断非常相似,这是因为软件中断是执行系统调用最常见的方式。接下来我们将从不同的角度来审视系统调用相关概念。例如,从用户空间发起系统调用时会发生什么,Linux内核中一组系统调用处理器的实现,[VDSO](https://en.wikipedia.org/wiki/VDSO) 和 [vsyscall](https://lwn.net/Articles/446528/) 的概念以及其他信息。
|
||||
|
||||
在了解 Linux 内核系统调用执行过程之前,让我们先来了解一些系统调用的相关原理。
|
||||
|
||||
@@ -411,4 +411,4 @@ fdput_pos(f);
|
||||
* [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [systemd](https://en.wikipedia.org/wiki/Systemd)
|
||||
* [epoll](https://en.wikipedia.org/wiki/Epoll)
|
||||
* [Previous chapter](http://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/index.html)
|
||||
* [Previous chapter](/Interrupts/)
|
||||
|
||||
@@ -4,7 +4,7 @@ Linux 系统内核调用 第二节
|
||||
Linux 内核如何处理系统调用
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
前一[小节](http://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html) 作为本章节的第一部分描述了 Linux 内核[system call](https://en.wikipedia.org/wiki/System_call) 概念。
|
||||
前一[小节](/SysCall/linux-syscall-1.md) 作为本章节的第一部分描述了 Linux 内核[system call](https://en.wikipedia.org/wiki/System_call) 概念。
|
||||
前一节中提到通常系统调用处于内核处于操作系统层面。 前一节内容从用户空间的角度介绍,并且 [write](http://man7.org/linux/man-pages/man2/write.2.html)系统调用实现的一部分内容没有讨论。 在这一小节继续关注系统调用,在深入 Linux 内核之前,从一些理论开始。
|
||||
|
||||
程序中一个用户程序并不直接使用系统调用。 我们并未这样写 `Hello World`程序代码:
|
||||
@@ -114,7 +114,7 @@ asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
|
||||
|
||||
之后所有指向 “non-implemented” 系统调用元素的内容为 `sys_ni_syscall` 函数的地址,该函数仅返回 `-ENOSYS`。 其他元素指向 `sys_syscall_name` 函数。
|
||||
|
||||
至此,我们完成了系统调用表的填充并且 Linux 内核了解每个系统调用处理器的位置。 但是 Linux 内核在处理用户空间程序的系统调用时并未立即调用 `sys_syscall_name` 函数。 记住关于中断及中断处理的[章节](http://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/index.html)。 当 Linux 内核获得处理中断的控制权,在调用中断处理程序前,必须做一些准备如保存用户空间寄存器、切换至新的堆栈及其他很多工作。 系统调用处理也是相同的情形。 第一件事是处理系统调用的准备,但是在 Linux 内核开始这些准备之前,系统调用的入口必须完成初始化,同时只有 Linux 内核知道如何执行这些准备。 在下一章节我们将关注 Linux 内核中关于系统调用入口的初始化过程。
|
||||
至此,我们完成了系统调用表的填充并且 Linux 内核了解每个系统调用处理器的位置。 但是 Linux 内核在处理用户空间程序的系统调用时并未立即调用 `sys_syscall_name` 函数。 记住关于中断及中断处理的[章节](/Interrupts/)。 当 Linux 内核获得处理中断的控制权,在调用中断处理程序前,必须做一些准备如保存用户空间寄存器、切换至新的堆栈及其他很多工作。 系统调用处理也是相同的情形。 第一件事是处理系统调用的准备,但是在 Linux 内核开始这些准备之前,系统调用的入口必须完成初始化,同时只有 Linux 内核知道如何执行这些准备。 在下一章节我们将关注 Linux 内核中关于系统调用入口的初始化过程。
|
||||
|
||||
系统调用入口初始化
|
||||
--------------------------------------------------------------------------------
|
||||
@@ -125,7 +125,7 @@ asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
|
||||
SYSCALL 引起操作系统系统调用处理器处于特权级 0,其通过加载 IA32_LSTAR MSR 至 RIP 完成。
|
||||
```
|
||||
|
||||
这就是说我们需要将系统调用入口放置到 `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register)。 这一操作在 Linux 内核初始过程时完成 若你已阅读关于 Linux 内核中断及中断处理章节的[第四节](http://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/linux-interrupts-4.html),Linux 内核调用在初始化过程中调用 `trap_init` 函数。 该函数在 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) 源代码文件中定义,执行 `non-early` 异常处理(如除法错误,[协处理器](https://en.wikipedia.org/wiki/Coprocessor) 错误等 )的初始化。 除了 `non-early` 异常处理的初始化外,函数调用 [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/blob/arch/x86/kernel/cpu/common.c) 中 `cpu_init` 函数,调用相同源码文件中的 `syscall_init` 完成 `per-cpu` 状态初始化。
|
||||
这就是说我们需要将系统调用入口放置到 `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register)。 这一操作在 Linux 内核初始过程时完成 若你已阅读关于 Linux 内核中断及中断处理章节的[第四节](/Interrupts/linux-interrupts-4.md),Linux 内核调用在初始化过程中调用 `trap_init` 函数。 该函数在 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) 源代码文件中定义,执行 `non-early` 异常处理(如除法错误,[协处理器](https://en.wikipedia.org/wiki/Coprocessor) 错误等 )的初始化。 除了 `non-early` 异常处理的初始化外,函数调用 [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/blob/arch/x86/kernel/cpu/common.c) 中 `cpu_init` 函数,调用相同源码文件中的 `syscall_init` 完成 `per-cpu` 状态初始化。
|
||||
|
||||
该函数执行系统调用入口的初始化。 查看函数的实现,函数没有参数且首先填充两个特殊模块寄存器:
|
||||
|
||||
@@ -180,7 +180,7 @@ wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
|
||||
```
|
||||
|
||||
可以从描述 Linux 内核启动过程的[章节](http://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-2.html)阅读更多关于 `Global Descriptor Table` 的内容。
|
||||
可以从描述 Linux 内核启动过程的[章节](/Booting/linux-bootstrap-2.md)阅读更多关于 `Global Descriptor Table` 的内容。
|
||||
|
||||
在 `syscall_init` 函数的尾段,通过写入 `MSR_SYSCALL_MASK` 特殊寄存器的标志位,将[标志寄存器](https://en.wikipedia.org/wiki/FLAGS_register)中的标志位屏蔽:
|
||||
|
||||
@@ -209,7 +209,7 @@ SWAPGS_UNSAFE_STACK
|
||||
#define SWAPGS_UNSAFE_STACK swapgs
|
||||
```
|
||||
|
||||
此宏将交换 GS 段选择符及 `MSR_KERNEL_GS_BASE ` 特殊模式寄存器中的值。 换句话说,将其入内核堆栈。 之后使老的堆栈指针指向 `rsp_scratch` [per-cpu](http://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-1.html) 变量,并设置堆栈指针指向当前处理器的栈顶:
|
||||
此宏将交换 GS 段选择符及 `MSR_KERNEL_GS_BASE ` 特殊模式寄存器中的值。 换句话说,将其入内核堆栈。 之后使老的堆栈指针指向 `rsp_scratch` [per-cpu](/Concepts/linux-cpu-1.md) 变量,并设置堆栈指针指向当前处理器的栈顶:
|
||||
|
||||
```assembly
|
||||
movq %rsp, PER_CPU_VAR(rsp_scratch)
|
||||
@@ -373,7 +373,7 @@ USERGS_SYSRET64
|
||||
结论
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这是 Linux 内核相关概念的第二节。 在[前一节](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-1.html)我们从用户应用程序的角度讨论了这些概念的原理。 在这一节继续深入系统调用概念的相关内容,讨论了系统调用发生时 Linux 内核执行的内容。
|
||||
这是 Linux 内核相关概念的第二节。 在[前一节](/SysCall/linux-syscall-1.md)我们从用户应用程序的角度讨论了这些概念的原理。 在这一节继续深入系统调用概念的相关内容,讨论了系统调用发生时 Linux 内核执行的内容。
|
||||
|
||||
若存在疑问及建议,在 twitter @[0xAX](https://twitter.com/0xAX),通过 [email](anotherworldofworld@gmail.com) 或者创建 [issue](https://github.com/MintCN/linux-insides-zh/issues/new)。
|
||||
|
||||
@@ -397,8 +397,8 @@ Links
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [per-cpu](http://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-1.html)
|
||||
* [per-cpu](/Concepts/linux-cpu-1.md)
|
||||
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
|
||||
* [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf)
|
||||
* [previous chapter](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-1.html)
|
||||
* [previous chapter](/SysCall/linux-syscall-1.md)
|
||||
|
||||
@@ -4,7 +4,7 @@ Linux 内核系统调用 第三节
|
||||
vsyscalls 和 vDSO
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这是讲解 Linux 内核中系统调用[章节](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/index.html)的第三部分,[前一节](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-2.html)讨论了用户空间应用程序发起的系统调用的准备工作及系统调用的处理过程。在这一节将讨论两个与系统调用十分相似的概念,这两个概念是`vsyscall` 和 `vdso`。
|
||||
这是讲解 Linux 内核中系统调用[章节](/SysCall/)的第三部分,[前一节](/SysCall/linux-syscall-2.md)讨论了用户空间应用程序发起的系统调用的准备工作及系统调用的处理过程。在这一节将讨论两个与系统调用十分相似的概念,这两个概念是`vsyscall` 和 `vdso`。
|
||||
|
||||
我们已经了解什么是`系统调用`。这是 Linux 内核一种特殊的运行机制,使得用户空间的应用程序可以请求,像写入文件和打开套接字等特权级下的任务。正如你所了解的,在 Linux 内核中发起一个系统调用是特别昂贵的操作,因为处理器需要中断当前正在执行的任务,切换内核模式的上下文,在系统调用处理完毕后跳转至用户空间。以下的两种机制 - `vsyscall` 和d `vdso` 被设计用来加速系统调用的处理,在这一节我们将了解两种机制的工作原理。
|
||||
|
||||
@@ -24,7 +24,7 @@ ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
|
||||
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
|
||||
```
|
||||
|
||||
因此, 这些系统调用将在用户空间下执行,这意味着将不发生 [上下文切换](https://en.wikipedia.org/wiki/Context_switch)。 `vsyscall` 内存页的映射在 [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) 源代码中定义的 `map_vsyscall` 函数中实现。这一函数在 Linux 内核初始化时被 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) 源代码中定义的函数`setup_arch` (我们在[第五章](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html) Linux 内核的初始化中讨论过该函数)。
|
||||
因此, 这些系统调用将在用户空间下执行,这意味着将不发生 [上下文切换](https://en.wikipedia.org/wiki/Context_switch)。 `vsyscall` 内存页的映射在 [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) 源代码中定义的 `map_vsyscall` 函数中实现。这一函数在 Linux 内核初始化时被 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) 源代码中定义的函数`setup_arch` (我们在[第五章](/Initialization/linux-initialization-5.md) Linux 内核的初始化中讨论过该函数)。
|
||||
|
||||
注意 `map_vsyscall` 函数的实现依赖于内核配置选项 `CONFIG_X86_VSYSCALL_EMULATION` :
|
||||
|
||||
@@ -49,7 +49,7 @@ void __init map_vsyscall(void)
|
||||
}
|
||||
```
|
||||
|
||||
在 `map_vsyscall` 函数的开始,通过宏 `__pa_symbol` 获取了 `vsyscall` 内存页的物理地址(我们已在[第四章](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process)讨论了该宏的实现)。`__vsyscall_page` 在 [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) 汇编源代码文件中定义, 具有如下的 [虚拟地址](https://en.wikipedia.org/wiki/Virtual_address_space):
|
||||
在 `map_vsyscall` 函数的开始,通过宏 `__pa_symbol` 获取了 `vsyscall` 内存页的物理地址(我们已在[第四章](/Initialization/linux-initialization-4.md) of the Linux kernel initialization process)讨论了该宏的实现)。`__vsyscall_page` 在 [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) 汇编源代码文件中定义, 具有如下的 [虚拟地址](https://en.wikipedia.org/wiki/Virtual_address_space):
|
||||
|
||||
```
|
||||
ffffffff81881000 D __vsyscall_page
|
||||
@@ -80,7 +80,7 @@ __vsyscall_page:
|
||||
ret
|
||||
```
|
||||
|
||||
回到 `map_vsyscall` 函数及 `__vsyscall_page` 的实现,在得到 `__vsyscall_page` 的物理地址之后,使用 `__set_fixmap` 为 `vsyscall` 内存页 检查设置 [fix-mapped](http://xinqiu.gitbooks.io/linux-insides-cn/content/MM/linux-mm-2.html)地址的变量`vsyscall_mode`:
|
||||
回到 `map_vsyscall` 函数及 `__vsyscall_page` 的实现,在得到 `__vsyscall_page` 的物理地址之后,使用 `__set_fixmap` 为 `vsyscall` 内存页 检查设置 [fix-mapped](/MM/linux-mm-2.md)地址的变量`vsyscall_mode`:
|
||||
|
||||
```C
|
||||
if (vsyscall_mode != NONE)
|
||||
@@ -140,9 +140,9 @@ static int __init vsyscall_setup(char *str)
|
||||
early_param("vsyscall", vsyscall_setup);
|
||||
```
|
||||
|
||||
关于 `early_param` 宏的更多信息可以在[第六章](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-6.html) Linux 内核初始化中找到。
|
||||
关于 `early_param` 宏的更多信息可以在[第六章](/Initialization/linux-initialization-6.md) Linux 内核初始化中找到。
|
||||
|
||||
在函数 `vsyscall_map` 的最后仅通过 [BUILD_BUG_ON](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) 宏检查 `vsyscall` 内存页的虚拟地址是否等于变量 `VSYSCALL_ADDR` :
|
||||
在函数 `vsyscall_map` 的最后仅通过 [BUILD_BUG_ON](/Initialization/linux-initialization-1.md) 宏检查 `vsyscall` 内存页的虚拟地址是否等于变量 `VSYSCALL_ADDR` :
|
||||
|
||||
```C
|
||||
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
|
||||
@@ -252,7 +252,7 @@ Here we can see that [uname](https://en.wikipedia.org/wiki/Uname) util was linke
|
||||
* `libc.so.6`;
|
||||
* `ld-linux-x86-64.so.2`.
|
||||
|
||||
The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](https://xinqiu.gitbooks.io/linux-insides-cn/content/Misc/linux-misc-3.html)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`.
|
||||
The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](/Misc/linux-misc-3.md)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`.
|
||||
|
||||
Initialization of the `vDSO` occurs in the `init_vdso` function that defined in the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file. This function starts from the initialization of the `vDSO` images for 32-bits and 64-bits depends on the `CONFIG_X86_X32_ABI` kernel configuration option:
|
||||
|
||||
@@ -370,11 +370,11 @@ That's all.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-2.html) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
|
||||
This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](/SysCall/linux-syscall-2.md) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
|
||||
|
||||
After all of these three parts, we know almost all things that are related to system calls, we know what system call is and why user applications need them. We also know what occurs when a user application calls a system call and how the kernel handles system calls.
|
||||
|
||||
The next part will be the last part in this [chapter](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/index.html) and we will see what occurs when a user runs the program.
|
||||
The next part will be the last part in this [chapter](/SysCall/) and we will see what occurs when a user runs the program.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/MintCN/linux-insides-zh/issues/new).
|
||||
|
||||
@@ -390,14 +390,14 @@ Links
|
||||
* [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space)
|
||||
* [Segmentation](https://en.wikipedia.org/wiki/Memory_segmentation)
|
||||
* [enum](https://en.wikipedia.org/wiki/Enumerated_type)
|
||||
* [fix-mapped addresses](http://xinqiu.gitbooks.io/linux-insides-cn/content/MM/linux-mm-2.html)
|
||||
* [fix-mapped addresses](MM/linux-mm-2.md)
|
||||
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [BUILD_BUG_ON](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html)
|
||||
* [BUILD_BUG_ON](/Initialization/linux-initialization-1.md)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [Page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [stack pointer](https://en.wikipedia.org/wiki/Stack_register)
|
||||
* [uname](https://en.wikipedia.org/wiki/Uname)
|
||||
* [Linkers](https://xinqiu.gitbooks.io/linux-insides-cn/content/Misc/linux-misc-3.html)
|
||||
* [Previous part](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-2.html)
|
||||
* [Linkers](/Misc/linux-misc-3.md)
|
||||
* [Previous part](/SysCall/linux-syscall-2.md)
|
||||
|
||||
@@ -4,7 +4,7 @@ System calls in the Linux kernel. Part 4.
|
||||
How does the Linux kernel run a program
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the fourth part of the [chapter](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/index.html) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-3.html) - this part will be last in this chapter. In the previous part we stopped at the two new concepts:
|
||||
This is the fourth part of the [chapter](/SysCall/) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](/SysCall/linux-syscall-3.md) - this part will be last in this chapter. In the previous part we stopped at the two new concepts:
|
||||
|
||||
* `vsyscall`;
|
||||
* `vDSO`;
|
||||
@@ -73,7 +73,7 @@ So, an user application (`bash` in our case) calls the system call and as we alr
|
||||
execve system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw preparation before a system call called by an user application and after a system call handler finished its work in the second [part](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and as we already know it takes three arguments:
|
||||
We saw preparation before a system call called by an user application and after a system call handler finished its work in the second [part](/SysCall/linux-syscall-2.md) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and as we already know it takes three arguments:
|
||||
|
||||
```
|
||||
SYSCALL_DEFINE3(execve,
|
||||
@@ -334,7 +334,7 @@ if (!elf_phdata)
|
||||
goto out;
|
||||
```
|
||||
|
||||
that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](https://xinqiu.gitbooks.io/linux-insides-cn/content/Misc/linux-misc-3.html) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute.
|
||||
that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](/Misc/linux-misc-3.md) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute.
|
||||
|
||||
In the end of the execution of the `load_elf_binary` we call the `start_thread` function and pass three arguments to it:
|
||||
|
||||
@@ -424,7 +424,7 @@ Links
|
||||
* [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha)
|
||||
* [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF)
|
||||
* [segments](https://en.wikipedia.org/wiki/Memory_segmentation)
|
||||
* [Linkers](https://xinqiu.gitbooks.io/linux-insides-cn/content/Misc/linux-misc-3.html)
|
||||
* [Linkers](/Misc/linux-misc-3.md)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [Previous part](http://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-3.html)
|
||||
* [Previous part](/SysCall/linux-syscall-3.md)
|
||||
|
||||
@@ -59,7 +59,7 @@ SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
|
||||
}
|
||||
```
|
||||
|
||||
如果你阅读过[上一节](https://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-4.html),你应该知道系统调用通过 `SYSCALL_DEFINE` 宏定义实现。因此,`open` 系统调用也不例外。
|
||||
如果你阅读过[上一节](/SysCall/linux-syscall-4.md),你应该知道系统调用通过 `SYSCALL_DEFINE` 宏定义实现。因此,`open` 系统调用也不例外。
|
||||
|
||||
`open` 系统调用位于 [fs/open.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/open.c) 源文件中,粗看非常简短
|
||||
|
||||
@@ -406,4 +406,4 @@ Linux 内核中关于不同系统调用的实现的第五部分已经完成了
|
||||
* [inode](https://en.wikipedia.org/wiki/Inode)
|
||||
* [RCU](https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt)
|
||||
* [read](http://man7.org/linux/man-pages/man2/read.2.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html)
|
||||
* [previous part](/SysCall/linux-syscall-4.md)
|
||||
@@ -4,7 +4,7 @@
|
||||
简介
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
在 Linux 内核启动过程中的[第五部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-5.html),我们学到了内核在启动的最早阶段都做了哪些工作。接下来,在我们明白内核如何运行第一个 init 进程之前,内核初始化其他部分,比如加载 `initrd` ,初始化 lockdep ,以及许多许多其他的工作。
|
||||
在 Linux 内核启动过程中的[第五部分](/Booting/linux-bootstrap-5.md),我们学到了内核在启动的最早阶段都做了哪些工作。接下来,在我们明白内核如何运行第一个 init 进程之前,内核初始化其他部分,比如加载 `initrd` ,初始化 lockdep ,以及许多许多其他的工作。
|
||||
|
||||
是的,那将有很多不同的事,但是还有更多更多更多关于**内存**的工作。
|
||||
|
||||
@@ -258,4 +258,4 @@ readelf -s vmlinux | grep ffffffff81000000
|
||||
* [MMU](http://en.wikipedia.org/wiki/Memory_management_unit)
|
||||
* [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md)
|
||||
* [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt)
|
||||
* [Last part - Kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html)
|
||||
* [Last part - Kernel booting process](/Booting/linux-bootstrap-5.md)
|
||||
|
||||
@@ -40,7 +40,7 @@ The `asm` keyword may be used in place of `__asm__`, however `__asm__` is portab
|
||||
|
||||
If you know assembly programming language this looks pretty familiar. The main problem is in the second form of inline assembly statements - `extended`. This form allows us to pass parameters to an assembly statement, perform [jumps](https://en.wikipedia.org/wiki/Branch_%28computer_science%29) etc. Does not sound difficult, but requires knowledge of special rules in addition to knowledge of the assembly language. Every time I see yet another piece of inline assembly code in the Linux kernel, I need to refer to the official [documentation](https://gcc.gnu.org/onlinedocs/) of `GCC` to remember how a particular `qualifier` behaves or what the meaning of `=&r` is for example.
|
||||
|
||||
I've decided to write this part to consolidate my knowledge related to the inline assembly, as inline assembly statements are quite common in the Linux kernel and we may see them in [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) parts sometimes. I thought that it would be useful if we have a special part which contains information on more important aspects of the inline assembly. Of course you may find comprehensive information about inline assembly in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-Language-with-C), but I like to put everything in one place.
|
||||
I've decided to write this part to consolidate my knowledge related to the inline assembly, as inline assembly statements are quite common in the Linux kernel and we may see them in [linux-insides](/SUMMARY.md) parts sometimes. I thought that it would be useful if we have a special part which contains information on more important aspects of the inline assembly. Of course you may find comprehensive information about inline assembly in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-Language-with-C), but I like to put everything in one place.
|
||||
|
||||
** Note: This part will not provide guide for assembly programming. It is not intended to teach you to write programs with assembler or to know what one or another assembler instruction means. Just a little memo for extended asm. **
|
||||
|
||||
|
||||
@@ -4,16 +4,16 @@ Timers and time management in the Linux kernel. Part 1.
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is yet another post that opens new chapter in the [linux-insides](http://xinqiu.gitbooks.io/linux-insides-cn/content/) book. The previous [part](https://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/linux-syscall-4.html) was a list part of the chapter that describes [system call](https://en.wikipedia.org/wiki/System_call) concept and now time is to start new chapter. As you can understand from the post's title, this chapter will be devoted to the `timers` and `time management` in the Linux kernel. The choice of topic for the current chapter is not accidental. Timers and generally time management are very important and widely used in the Linux kernel. The Linux kernel uses timers for various tasks, different timeouts for example in [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) implementation, the kernel must know current time, scheduling asynchronous functions, next event interrupt scheduling and many many more.
|
||||
This is yet another post that opens new chapter in the [linux-insides](/SUMMARY.md) book. The previous [part](/SysCall/linux-syscall-4.md) was a list part of the chapter that describes [system call](https://en.wikipedia.org/wiki/System_call) concept and now time is to start new chapter. As you can understand from the post's title, this chapter will be devoted to the `timers` and `time management` in the Linux kernel. The choice of topic for the current chapter is not accidental. Timers and generally time management are very important and widely used in the Linux kernel. The Linux kernel uses timers for various tasks, different timeouts for example in [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) implementation, the kernel must know current time, scheduling asynchronous functions, next event interrupt scheduling and many many more.
|
||||
|
||||
So, we will start to learn implementation of the different time management related stuff in this part. We will see different types of timers and how do different Linux kernel subsystems use them. As always we will start from the earliest part of the Linux kernel and will go through initialization process of the Linux kernel. We already did it in the special [chapter](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/index.html) which describes initialization process of the Linux kernel, but as you may remember we missed some things there. And one of them is the initialization of timers.
|
||||
So, we will start to learn implementation of the different time management related stuff in this part. We will see different types of timers and how do different Linux kernel subsystems use them. As always we will start from the earliest part of the Linux kernel and will go through initialization process of the Linux kernel. We already did it in the special [chapter](/Initialization/) which describes initialization process of the Linux kernel, but as you may remember we missed some things there. And one of them is the initialization of timers.
|
||||
|
||||
Let's start.
|
||||
|
||||
Initialization of non-standard PC hardware clock
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the Linux kernel was decompressed (more about this you can read in the [Kernel decompression](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-5.html) part) the architecture non-specific code starts to work in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. After initialization of the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), initialization of [cgroups](https://en.wikipedia.org/wiki/Cgroups) and setting [canary](https://en.wikipedia.org/wiki/Buffer_overflow_protection) value we can see the call of the `setup_arch` function.
|
||||
After the Linux kernel was decompressed (more about this you can read in the [Kernel decompression](/Booting/linux-bootstrap-5.md) part) the architecture non-specific code starts to work in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. After initialization of the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), initialization of [cgroups](https://en.wikipedia.org/wiki/Cgroups) and setting [canary](https://en.wikipedia.org/wiki/Buffer_overflow_protection) value we can see the call of the `setup_arch` function.
|
||||
|
||||
As you may remember this function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file and prepares/initializes architecture-specific stuff (for example it reserves place for [bss](https://en.wikipedia.org/wiki/.bss) section, reserves place for [initrd](https://en.wikipedia.org/wiki/Initrd), parses kernel command line and many many other things). Besides this, we can find some time management related functions there.
|
||||
|
||||
@@ -433,4 +433,4 @@ Links
|
||||
* [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253)
|
||||
* [seqlocks](https://en.wikipedia.org/wiki/Seqlock)
|
||||
* [cloksource documentation](https://www.kernel.org/doc/Documentation/timers/timekeeping.txt)
|
||||
* [Previous chapter](https://xinqiu.gitbooks.io/linux-insides-cn/content/SysCall/index.html)
|
||||
* [Previous chapter](/SysCall/README.md)
|
||||
|
||||
Reference in New Issue
Block a user