mirror of
https://github.com/MintCN/linux-insides-zh.git
synced 2026-04-25 19:20:28 +08:00
@@ -1,16 +1,16 @@
|
||||
CPU masks
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
介绍
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
`Cpumasks` is a special way provided by the Linux kernel to store information about CPUs in the system. The relevant source code and header files which contains API for `Cpumasks` manipulation:
|
||||
`Cpumasks` 是Linux内核提供的保存系统CPU信息的特殊方法。包含 `Cpumasks` 操作 API 相关的源码和头文件:
|
||||
|
||||
* [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h)
|
||||
* [lib/cpumask.c](https://github.com/torvalds/linux/blob/master/lib/cpumask.c)
|
||||
* [kernel/cpu.c](https://github.com/torvalds/linux/blob/master/kernel/cpu.c)
|
||||
|
||||
As comment says from the [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h): Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. We already saw a bit about cpumask in the `boot_cpu_init` function from the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. This function makes first boot cpu online, active and etc...:
|
||||
正如 [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h) 注释:Cpumasks 提供了代表系统中 CPU 集合的位图,一位放置一个 CPU 序号。我们已经在 [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) 部分,函数 `boot_cpu_init` 中看到了一点 cpumask。这个函数将第一个启动的 cpu 上线、激活等等……
|
||||
|
||||
```C
|
||||
set_cpu_online(cpu, true);
|
||||
@@ -19,40 +19,40 @@ set_cpu_present(cpu, true);
|
||||
set_cpu_possible(cpu, true);
|
||||
```
|
||||
|
||||
`set_cpu_possible` is a set of cpu ID's which can be plugged in anytime during the life of that system boot. `cpu_present` represents which CPUs are currently plugged in. `cpu_online` represents a subset of the `cpu_present` and indicates CPUs which are available for scheduling. These masks depend on the `CONFIG_HOTPLUG_CPU` configuration option and if this option is disabled `possible == present` and `active == online`. The implementations of all of these functions are very similar. Every function checks the second parameter. If it is `true`, it calls `cpumask_set_cpu` otherwise it calls `cpumask_clear_cpu` .
|
||||
`set_cpu_possible` 是一个在系统启动时任意时刻都可插入的 cpu ID 集合。`cpu_present` 代表了当前插入的 CPUs。`cpu_online` 是 `cpu_present` 的子集,表示可调度的 CPUs。这些掩码依赖于 `CONFIG_HOTPLUG_CPU` 配置选项,以及 `possible == present` 和 `active == online` 选项是否被禁用。这些函数的实现很相似,检测第二个参数,如果为 `true`,就调用 `cpumask_set_cpu` ,否则调用 `cpumask_clear_cpu`。
|
||||
|
||||
There are two ways for a `cpumask` creation. First is to use `cpumask_t`. It is defined as:
|
||||
有两种方法创建 `cpumask`。第一种是用 `cpumask_t`。定义如下:
|
||||
|
||||
```C
|
||||
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
|
||||
```
|
||||
|
||||
It wraps the `cpumask` structure which contains one bitmask `bits` field. The `DECLARE_BITMAP` macro gets two parameters:
|
||||
它封装了 `cpumask` 结构,其包含了一个位掩码 `bits` 字段。`DECLARE_BITMAP` 宏有两个参数:
|
||||
|
||||
* bitmap name;
|
||||
* number of bits.
|
||||
|
||||
and creates an array of `unsigned long` with the given name. Its implementation is pretty easy:
|
||||
并以给定名称创建了一个 `unsigned long` 数组。它的实现非常简单:
|
||||
|
||||
```C
|
||||
#define DECLARE_BITMAP(name,bits) \
|
||||
unsigned long name[BITS_TO_LONGS(bits)]
|
||||
```
|
||||
|
||||
where `BITS_TO_LONGS`:
|
||||
其中 `BITS_TO_LONGS`:
|
||||
|
||||
```C
|
||||
#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
|
||||
#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
|
||||
```
|
||||
|
||||
As we are focusing on the `x86_64` architecture, `unsigned long` is 8-bytes size and our array will contain only one element:
|
||||
因为我们专注于 `x86_64` 架构,`unsigned long` 是8字节大小,因此我们的数组仅包含一个元素:
|
||||
|
||||
```
|
||||
(((8) + (8) - 1) / (8)) = 1
|
||||
```
|
||||
|
||||
`NR_CPUS` macro represents the number of CPUs in the system and depends on the `CONFIG_NR_CPUS` macro which is defined in [include/linux/threads.h](https://github.com/torvalds/linux/blob/master/include/linux/threads.h) and looks like this:
|
||||
`NR_CPUS` 宏表示的是系统中 CPU 的数目,且依赖于在 [include/linux/threads.h](https://github.com/torvalds/linux/blob/master/include/linux/threads.h) 中定义的 `CONFIG_NR_CPUS` 宏,看起来像这样:
|
||||
|
||||
```C
|
||||
#ifndef CONFIG_NR_CPUS
|
||||
@@ -62,7 +62,7 @@ As we are focusing on the `x86_64` architecture, `unsigned long` is 8-bytes size
|
||||
#define NR_CPUS CONFIG_NR_CPUS
|
||||
```
|
||||
|
||||
The second way to define cpumask is to use the `DECLARE_BITMAP` macro directly and the `to_cpumask` macro which converts the given bitmap to `struct cpumask *`:
|
||||
第二种定义 cpumask 的方法是直接使用宏 `DECLARE_BITMAP` 和 `to_cpumask` 宏,后者将给定的位图转化为 `struct cpumask *`:
|
||||
|
||||
```C
|
||||
#define to_cpumask(bitmap) \
|
||||
@@ -70,7 +70,7 @@ The second way to define cpumask is to use the `DECLARE_BITMAP` macro directly a
|
||||
: (void *)sizeof(__check_is_bitmap(bitmap))))
|
||||
```
|
||||
|
||||
We can see the ternary operator operator here which is `true` every time. `__check_is_bitmap` inline function is defined as:
|
||||
可以看到这里的三目运算符每次总是 `true`。`__check_is_bitmap` 内联函数定义为:
|
||||
|
||||
```C
|
||||
static inline int __check_is_bitmap(const unsigned long *bitmap)
|
||||
@@ -79,17 +79,17 @@ static inline int __check_is_bitmap(const unsigned long *bitmap)
|
||||
}
|
||||
```
|
||||
|
||||
And returns `1` every time. We need it here for only one purpose: at compile time it checks that a given `bitmap` is a bitmap, or in other words it checks that a given `bitmap` has type - `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting an array of `unsigned long` to the `struct cpumask *`.
|
||||
每次都是返回 `1`。我们需要它只是因为:编译时检测一个给定的 `bitmap` 是一个位图,换句话说,它检测一个 `bitmap` 是否有 `unsigned long *` 类型。因此我们传递 `cpu_possible_bits` 给宏 `to_cpumask` ,将 `unsigned long` 数组转换为 `struct cpumask *`。
|
||||
|
||||
cpumask API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As we can define cpumask with one of the method, Linux kernel provides API for manipulating a cpumask. Let's consider one of the function which presented above. For example `set_cpu_online`. This function takes two parameters:
|
||||
因为我们可以用其中一个方法来定义 cpumask,Linux 内核提供了 API 来处理 cpumask。我们来研究下其中一个函数,例如 `set_cpu_online`,这个函数有两个参数:
|
||||
|
||||
* Number of CPU;
|
||||
* CPU status;
|
||||
* CPU 数目;
|
||||
* CPU 状态;
|
||||
|
||||
Implementation of this function looks as:
|
||||
这个函数的实现如下所示:
|
||||
|
||||
```C
|
||||
void set_cpu_online(unsigned int cpu, bool online)
|
||||
@@ -103,13 +103,13 @@ void set_cpu_online(unsigned int cpu, bool online)
|
||||
}
|
||||
```
|
||||
|
||||
First of all it checks the second `state` parameter and calls `cpumask_set_cpu` or `cpumask_clear_cpu` depends on it. Here we can see casting to the `struct cpumask *` of the second parameter in the `cpumask_set_cpu`. In our case it is `cpu_online_bits` which is a bitmap and defined as:
|
||||
该函数首先检测第二个 `state` 参数并调用依赖它的 `cpumask_set_cpu` 或 `cpumask_clear_cpu`。这里我们可以看到在中 `cpumask_set_cpu` 的第二个参数转换为 `struct cpumask *`。在我们的例子中是位图 `cpu_online_bits`,定义如下:
|
||||
|
||||
```C
|
||||
static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly;
|
||||
```
|
||||
|
||||
The `cpumask_set_cpu` function makes only one call to the `set_bit` function:
|
||||
函数 `cpumask_set_cpu` 仅调用了一次 `set_bit` 函数:
|
||||
|
||||
```C
|
||||
static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
|
||||
@@ -118,18 +118,18 @@ static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
|
||||
}
|
||||
```
|
||||
|
||||
The `set_bit` function takes two parameters too, and sets a given bit (first parameter) in the memory (second parameter or `cpu_online_bits` bitmap). We can see here that before `set_bit` will be called, its two parameters will be passed to the
|
||||
`set_bit` 函数也有两个参数,设置了一个给定位(第一个参数)的内存(第二个参数或 `cpu_online_bits` 位图)。这儿我们可以看到在调用 `set_bit` 之前,它的两个参数会传递给
|
||||
|
||||
* cpumask_check;
|
||||
* cpumask_bits.
|
||||
|
||||
Let's consider these two macros. First if `cpumask_check` does nothing in our case and just returns given parameter. The second `cpumask_bits` just returns the `bits` field from the given `struct cpumask *` structure:
|
||||
让我们细看下这两个宏。第一个 `cpumask_check` 在我们的例子里没做任何事,只是返回了给的参数。第二个 `cpumask_bits` 只是返回了传入 `struct cpumask *` 结构的 `bits` 域。
|
||||
|
||||
```C
|
||||
#define cpumask_bits(maskp) ((maskp)->bits)
|
||||
```
|
||||
|
||||
Now let's look on the `set_bit` implementation:
|
||||
现在让我们看下 `set_bit` 的实现:
|
||||
|
||||
```C
|
||||
static __always_inline void
|
||||
@@ -147,50 +147,49 @@ Now let's look on the `set_bit` implementation:
|
||||
}
|
||||
```
|
||||
|
||||
This function looks scary, but it is not so hard as it seems. First of all it passes `nr` or number of the bit to the `IS_IMMEDIATE` macro which just calls the GCC internal `__builtin_constant_p` function:
|
||||
这个函数看着吓人,但它没有看起来那么难。首先传参 `nr` 或者说位数给 `IS_IMMEDIATE` 宏,该宏调用了 GCC 内联函数 `__builtin_constant_p`:
|
||||
|
||||
```C
|
||||
#define IS_IMMEDIATE(nr) (__builtin_constant_p(nr))
|
||||
```
|
||||
|
||||
`__builtin_constant_p` checks that given parameter is known constant at compile-time. As our `cpu` is not compile-time constant, the `else` clause will be executed:
|
||||
`__builtin_constant_p` 检查给定参数是否编译时恒定变量。因为我们的 `cpu` 不是编译时恒定变量,将会执行 `else` 分支:
|
||||
|
||||
```C
|
||||
asm volatile(LOCK_PREFIX "bts %1,%0" : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
|
||||
```
|
||||
|
||||
Let's try to understand how it works step by step:
|
||||
让我们试着一步一步来理解它如何工作的:
|
||||
|
||||
`LOCK_PREFIX` is a x86 `lock` instruction. This instruction tells the cpu to occupy the system bus while the instruction(s) will be executed. This allows the CPU to synchronize memory access, preventing simultaneous access of multiple processors (or devices - the DMA controller for example) to one memory cell.
|
||||
`LOCK_PREFIX` 是个 x86 `lock` 指令。这个指令告诉 CPU 当指令执行时占据系统总线。这允许 CPU 同步内存访问,防止多核(或多设备 - 比如 DMA 控制器)并发访问同一个内存cell。
|
||||
|
||||
`BITOP_ADDR` casts the given parameter to the `(*(volatile long *)` and adds `+m` constraints. `+` means that this operand is both read and written by the instruction. `m` shows that this is a memory operand. `BITOP_ADDR` is defined as:
|
||||
`BITOP_ADDR` 转换给定参数至 `(*(volatile long *)` 并且加了 `+m` 约束。`+` 意味着这个操作数对于指令是可读写的。`m` 显示这是一个内存操作数。`BITOP_ADDR` 定义如下:
|
||||
|
||||
```C
|
||||
#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
|
||||
```
|
||||
|
||||
Next is the `memory` clobber. It tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters).
|
||||
接下来是 `memory`。它告诉编译器汇编代码执行内存读或写到某些项,而不是那些输入或输出操作数(例如,访问指向输出参数的内存)。
|
||||
|
||||
`Ir` - immediate register operand.
|
||||
`Ir` - 寄存器操作数。
|
||||
|
||||
`bts` 指令设置一个位字符串的给定位,存储给定位的值到 `CF` 标志位。所以我们传递 cpu 号,我们的例子中为 0,给 `set_bit` 并且执行后,其设置了在 `cpu_online_bits` cpumask 中的 0 位。这意味着第一个 cpu 此时上线了。
|
||||
|
||||
The `bts` instruction sets a given bit in a bit string and stores the value of a given bit in the `CF` flag. So we passed the cpu number which is zero in our case and after `set_bit` is executed, it sets the zero bit in the `cpu_online_bits` cpumask. It means that the first cpu is online at this moment.
|
||||
当然,除了 `set_cpu_*` API 外,cpumask 提供了其它 cpumasks 操作的 API。让我们简短看下。
|
||||
|
||||
Besides the `set_cpu_*` API, cpumask of course provides another API for cpumasks manipulation. Let's consider it in short.
|
||||
|
||||
Additional cpumask API
|
||||
附加的 cpumask API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
cpumask provides a set of macros for getting the numbers of CPUs in various states. For example:
|
||||
cpumaks 提供了一系列宏来得到不同状态 CPUs 序号。例如:
|
||||
|
||||
```C
|
||||
#define num_online_cpus() cpumask_weight(cpu_online_mask)
|
||||
```
|
||||
|
||||
This macro returns the amount of `online` CPUs. It calls the `cpumask_weight` function with the `cpu_online_mask` bitmap (read about it). The`cpumask_weight` function makes one call of the `bitmap_weight` function with two parameters:
|
||||
这个宏返回了 `online` CPUs 数量。它读取 `cpu_online_mask` 位图并调用了 `cpumask_weight` 函数。`cpumask_weight` 函数使用两个参数调用了一次 `bitmap_weight` 函数:
|
||||
|
||||
* cpumask bitmap;
|
||||
* `nr_cpumask_bits` - which is `NR_CPUS` in our case.
|
||||
* `nr_cpumask_bits` - 在我们的例子中就是 `NR_CPUS`。
|
||||
|
||||
```C
|
||||
static inline unsigned int cpumask_weight(const struct cpumask *srcp)
|
||||
@@ -199,27 +198,27 @@ static inline unsigned int cpumask_weight(const struct cpumask *srcp)
|
||||
}
|
||||
```
|
||||
|
||||
and calculates the number of bits in the given bitmap. Besides the `num_online_cpus`, cpumask provides macros for the all CPU states:
|
||||
并计算给定位图的位数。除了 `num_online_cpus`,cpumask还提供了所有 CPU 状态的宏:
|
||||
|
||||
* num_possible_cpus;
|
||||
* num_active_cpus;
|
||||
* cpu_online;
|
||||
* cpu_possible.
|
||||
|
||||
and many more.
|
||||
等等。
|
||||
|
||||
Besides that the Linux kernel provides the following API for the manipulation of `cpumask`:
|
||||
除了 Linux 内核提供的下述操作 `cpumask` 的 API:
|
||||
|
||||
* `for_each_cpu` - iterates over every cpu in a mask;
|
||||
* `for_each_cpu_not` - iterates over every cpu in a complemented mask;
|
||||
* `cpumask_clear_cpu` - clears a cpu in a cpumask;
|
||||
* `cpumask_test_cpu` - tests a cpu in a mask;
|
||||
* `cpumask_setall` - set all cpus in a mask;
|
||||
* `cpumask_size` - returns size to allocate for a 'struct cpumask' in bytes;
|
||||
* `for_each_cpu` - 遍历一个mask的所有 cpu;
|
||||
* `for_each_cpu_not` - 遍历所有补集的 cpu;
|
||||
* `cpumask_clear_cpu` - 清除一个 cpumask 的 cpu;
|
||||
* `cpumask_test_cpu` - 测试一个 mask 中的 cpu;
|
||||
* `cpumask_setall` - 设置 mask 的所有 cpu;
|
||||
* `cpumask_size` - 返回分配 'struct cpumask' 字节数大小;
|
||||
|
||||
and many many more...
|
||||
还有很多。
|
||||
|
||||
Links
|
||||
链接
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [cpumask documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
|
||||
@@ -1,22 +1,23 @@
|
||||
The initcall mechanism
|
||||
initcall 机制
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
介绍
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you may understand from the title, this part will cover an interesting and important concept in the Linux kernel which is called - `initcall`. We already saw definitions like these:
|
||||
|
||||
就像你从标题所理解的,这部分将涉及 Linux 内核中有趣且重要的概念,称之为 `initcall`。在 Linux 内核中,我们可以看到类似这样的定义:
|
||||
|
||||
```C
|
||||
early_param("debug", debug_kernel);
|
||||
```
|
||||
|
||||
or
|
||||
或者
|
||||
|
||||
```C
|
||||
arch_initcall(init_pit_clocksource);
|
||||
```
|
||||
|
||||
in some parts of the Linux kernel. Before we see how this mechanism is implemented in the Linux kernel, we must know actually what is it and how the Linux kernel uses it. Definitions like these represent a [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29) function which will be called either during initialization of the Linux kernel or right after it. Actually the main point of the `initcall` mechanism is to determine correct order of the built-in modules and subsystems initialization. For example let's look at the following function:
|
||||
在我们分析这个机制在内核中是如何实现的之前,我们必须了解这个机制是什么,以及在 Linux 内核中是如何使用它的。像这样的定义表示一个 [回调函数](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29) ,它们会在 Linux 内核启动中或启动后调用。实际上 `initcall` 机制的要点是确定内置模块和子系统初始化的正确顺序。举个例子,我们来看看下面的函数:
|
||||
|
||||
```C
|
||||
static int __init nmi_warning_debugfs(void)
|
||||
@@ -27,13 +28,13 @@ static int __init nmi_warning_debugfs(void)
|
||||
}
|
||||
```
|
||||
|
||||
from the [arch/x86/kernel/nmi.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/nmi.c) source code file. As we may see it just creates the `nmi_longest_ns` [debugfs](https://en.wikipedia.org/wiki/Debugfs) file in the `arch_debugfs_dir` directory. Actually, this `debugfs` file may be created only after the `arch_debugfs_dir` will be created. Creation of this directory occurs during the architecture-specific initialization of the Linux kernel. Actually this directory will be created in the `arch_kdebugfs_init` function from the [arch/x86/kernel/kdebugfs.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kdebugfs.c) source code file. Note that the `arch_kdebugfs_init` function is marked as `initcall` too:
|
||||
这个函数出自源码文件 [arch/x86/kernel/nmi.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/nmi.c)。我们可以看到,这个函数只是在 `arch_debugfs_dir` 目录中创建 `nmi_longest_ns` [debugfs](https://en.wikipedia.org/wiki/Debugfs) 文件。实际上,只有在 `arch_debugfs_dir` 创建后,才会创建这个 `debugfs` 文件。这个目录是在 Linux 内核特定架构的初始化期间创建的。实际上,该目录将在源码文件 [arch/x86/kernel/kdebugfs.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kdebugfs.c) 的 `arch_kdebugfs_init` 函数中创建。注意 `arch_kdebugfs_init` 函数也被标记为 `initcall`。
|
||||
|
||||
```C
|
||||
arch_initcall(arch_kdebugfs_init);
|
||||
```
|
||||
|
||||
The Linux kernel calls all architecture-specific `initcalls` before the `fs` related `initcalls`. So, our `nmi_longest_ns` file will be created only after the `arch_kdebugfs_dir` directory will be created. Actually, the Linux kernel provides eight levels of main `initcalls`:
|
||||
Linux 内核在调用 `fs` 相关的 `initcalls` 之前调用所有特定架构的 `initcalls`。因此,只有在 `arch_kdebugfs_dir` 目录创建以后才会创建我们的 `nmi_longest_ns`。实际上,Linux 内核提供了八个级别的主 `initcalls`:
|
||||
|
||||
* `early`;
|
||||
* `core`;
|
||||
@@ -44,7 +45,7 @@ The Linux kernel calls all architecture-specific `initcalls` before the `fs` rel
|
||||
* `device`;
|
||||
* `late`.
|
||||
|
||||
All of their names are represented by the `initcall_level_names` array which is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
它们的所有名称是由数组 `initcall_level_names` 来描述的,该数组定义在源码文件 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 中:
|
||||
|
||||
```C
|
||||
static char *initcall_level_names[] __initdata = {
|
||||
@@ -59,12 +60,12 @@ static char *initcall_level_names[] __initdata = {
|
||||
};
|
||||
```
|
||||
|
||||
All functions which are marked as `initcall` by these identifiers, will be called in the same order or at first `early initcalls` will be called, at second `core initcalls` and etc. From this moment we know a little about `initcall` mechanism, so we can start to dive into the source code of the Linux kernel to see how this mechanism is implemented.
|
||||
所有用这些标识符标记为 `initcall` 的函数将会以相同的顺序被调用,或者说,`early initcalls` 会首先被调用,其次是 `core initcalls`,以此类推。现在,我们对 `initcall` 机制了解点了,所以我们可以开始潜入 Linux 内核源码,来看看这个机制是如何实现的。
|
||||
|
||||
Implementation initcall mechanism in the Linux kernel
|
||||
initcall 机制在 Linux 内核中的实现
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The Linux kernel provides a set of macros from the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) header file to mark a given function as `initcall`. All of these macros are pretty simple:
|
||||
Linux 内核提供了一组来自头文件 [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) 的宏,来标记给定的函数为 `initcall`。所有这些宏都相当简单:
|
||||
|
||||
```C
|
||||
#define early_initcall(fn) __define_initcall(fn, early)
|
||||
@@ -77,12 +78,12 @@ The Linux kernel provides a set of macros from the [include/linux/init.h](https:
|
||||
#define late_initcall(fn) __define_initcall(fn, 7)
|
||||
```
|
||||
|
||||
and as we may see these macros just expand to the call of the `__define_initcall` macro from the same header file. Moreover, the `__define_initcall` macro takes two arguments:
|
||||
我们可以看到,这些宏只是从同一个头文件的 `__define_initcall` 宏的调用扩展而来。此外,`__define_initcall` 宏有两个参数:
|
||||
|
||||
* `fn` - callback function which will be called during call of `initcalls` of the certain level;
|
||||
* `id` - identifier to identify `initcall` to prevent error when two the same `initcalls` point to the same handler.
|
||||
* `fn` - 在调用某个级别 `initcalls` 时调用的回调函数;
|
||||
* `id` - 识别 `initcall` 的标识符,用来防止两个相同的 `initcalls` 指向同一个处理函数时出现错误。
|
||||
|
||||
The implementation of the `__define_initcall` macro looks like:
|
||||
`__define_initcall` 宏的实现如下所示:
|
||||
|
||||
```C
|
||||
#define __define_initcall(fn, id) \
|
||||
@@ -91,13 +92,13 @@ The implementation of the `__define_initcall` macro looks like:
|
||||
LTO_REFERENCE_INITCALL(__initcall_##fn##id)
|
||||
```
|
||||
|
||||
To understand the `__define_initcall` macro, first of all let's look at the `initcall_t` type. This type is defined in the same [header]() file and it represents pointer to a function which returns pointer to [integer](https://en.wikipedia.org/wiki/Integer) which will be result of the `initcall`:
|
||||
要了解 `__define_initcall` 宏,首先让我们来看下 `initcall_t` 类型。这个类型定义在同一个 [头文件]() 中,它表示一个返回 [整形](https://en.wikipedia.org/wiki/Integer)指针的函数指针,这将是 `initcall` 的结果:
|
||||
|
||||
```C
|
||||
typedef int (*initcall_t)(void);
|
||||
```
|
||||
|
||||
Now let's return to the `_-define_initcall` macro. The [##](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) provides ability to concatenate two symbols. In our case, the first line of the `__define_initcall` macro produces definition of the given function which is located in the `.initcall id .init` [ELF section](http://www.skyfree.org/linux/references/ELF_Format.pdf) and marked with the following [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) attributes: `__initcall_function_name_id` and `__used`. If we will look in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h) header file which represents data for the kernel [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script, we will see that all of `initcalls` sections will be placed in the `.data` section:
|
||||
现在让我们回到 `_-define_initcall` 宏。[##](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) 提供了连接两个符号的能力。在我们的例子中,`__define_initcall` 宏的第一行产生了 `.initcall id .init` [ELF 部分](http://www.skyfree.org/linux/references/ELF_Format.pdf) 给定函数的定义,并标记以下 [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) 属性: `__initcall_function_name_id` 和 `__used`。如果我们查看表示内核链接脚本数据的 [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h) 头文件,我们会看到所有的 `initcalls` 部分都将放在 `.data` 段:
|
||||
|
||||
```C
|
||||
#define INIT_CALLS \
|
||||
@@ -123,19 +124,19 @@ Now let's return to the `_-define_initcall` macro. The [##](https://gcc.gnu.org/
|
||||
|
||||
```
|
||||
|
||||
The second attribute - `__used` is defined in the [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) header file and it expands to the definition of the following `gcc` attribute:
|
||||
第二个属性 - `__used`,定义在 [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) 头文件中,它扩展了以下 `gcc` 定义:
|
||||
|
||||
```C
|
||||
#define __used __attribute__((__used__))
|
||||
```
|
||||
|
||||
which prevents `variable defined but not used` warning. The last line of the `__define_initcall` macro is:
|
||||
它防止 `定义了变量但未使用` 的告警。宏 `__define_initcall` 最后一行是:
|
||||
|
||||
```C
|
||||
LTO_REFERENCE_INITCALL(__initcall_##fn##id)
|
||||
```
|
||||
|
||||
depends on the `CONFIG_LTO` kernel configuration option and just provides stub for the compiler [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization):
|
||||
这取决于 `CONFIG_LTO` 内核配置选项,只为编译器提供[链接时间优化](https://gcc.gnu.org/wiki/LinkTimeOptimization)存根:
|
||||
|
||||
```
|
||||
#ifdef CONFIG_LTO
|
||||
@@ -149,9 +150,9 @@ depends on the `CONFIG_LTO` kernel configuration option and just provides stub f
|
||||
#endif
|
||||
```
|
||||
|
||||
In order to prevent any problem when there is no reference to a variable in a module, it will be moved to the end of the program. That's all about the `__define_initcall` macro. So, all of the `*_initcall` macros will be expanded during compilation of the Linux kernel, and all `initcalls` will be placed in their sections and all of them will be available from the `.data` section and the Linux kernel will know where to find a certain `initcall` to call it during initialization process.
|
||||
为了防止当模块中的变量没有引用时而产生的任何问题,它被移到了程序末尾。这就是关于 `__define_initcall` 宏的全部了。所以,所有的 `*_initcall` 宏将会在Linux内核编译时扩展,所有的 `initcalls` 会放置在它们的段内,并可以通过 `.data` 段来获取,Linux 内核在初始化过程中就知道在哪儿去找到 `initcall` 并调用它。
|
||||
|
||||
As `initcalls` can be called by the Linux kernel, let's look how the Linux kernel does this. This process starts in the `do_basic_setup` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
既然 Linux 内核可以调用 `initcalls`,我们就来看下 Linux 内核是如何做的。这个过程从 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 头文件的 `do_basic_setup` 函数开始:
|
||||
|
||||
```C
|
||||
static void __init do_basic_setup(void)
|
||||
@@ -166,7 +167,7 @@ static void __init do_basic_setup(void)
|
||||
}
|
||||
```
|
||||
|
||||
which is called during the initialization of the Linux kernel, right after main steps of initialization like memory manager related initialization, `CPU` subsystem and other already finished. The `do_initcalls` function just goes through the array of `initcall` levels and call the `do_initcall_level` function for each level:
|
||||
该函数在 Linux 内核初始化过程中调用,调用时机是主要的初始化步骤,比如内存管理器相关的初始化、`CPU` 子系统等完成之后。`do_initcalls` 函数只是遍历 `initcall` 级别数组,并调用每个级别的 `do_initcall_level` 函数:
|
||||
|
||||
```C
|
||||
static void __init do_initcalls(void)
|
||||
@@ -178,7 +179,7 @@ static void __init do_initcalls(void)
|
||||
}
|
||||
```
|
||||
|
||||
The `initcall_levels` array is defined in the same source code [file](https://github.com/torvalds/linux/blob/master/init/main.c) and contains pointers to the sections which were defined in the `__define_initcall` macro:
|
||||
`initcall_levels` 数组在同一个源码[文件](https://github.com/torvalds/linux/blob/master/init/main.c)中定义,包含了定义在 `__define_initcall` 宏中的那些段的指针:
|
||||
|
||||
```C
|
||||
static initcall_t *initcall_levels[] __initdata = {
|
||||
@@ -194,7 +195,7 @@ static initcall_t *initcall_levels[] __initdata = {
|
||||
};
|
||||
```
|
||||
|
||||
If you are interested, you can find these sections in the `arch/x86/kernel/vmlinux.lds` linker script which is generated after the Linux kernel compilation:
|
||||
如果你有兴趣,你可以在 Linux 内核编译后生成的链接器脚本 `arch/x86/kernel/vmlinux.lds` 中找到这些段:
|
||||
|
||||
```
|
||||
.init.data : AT(ADDR(.init.data) - 0xffffffff80000000) {
|
||||
@@ -213,16 +214,16 @@ If you are interested, you can find these sections in the `arch/x86/kernel/vmlin
|
||||
}
|
||||
```
|
||||
|
||||
If you are not familiar with this then you can know more about [linkers](https://en.wikipedia.org/wiki/Linker_%28computing%29) in the special [part](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) of this book.
|
||||
如果你对这些不熟,可以在本书的某些[部分](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)了解更多关于[链接器](https://en.wikipedia.org/wiki/Linker_%28computing%29)的信息。
|
||||
|
||||
As we just saw, the `do_initcall_level` function takes one parameter - level of `initcall` and does following two things: First of all this function parses the `initcall_command_line` which is copy of usual kernel [command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt) which may contain parameters for modules with the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/master/kernel/params.c) source code file and call the `do_on_initcall` function for each level:
|
||||
正如我们刚看到的,`do_initcall_level` 函数有一个参数 - `initcall` 的级别,做了以下两件事:首先这个函数拷贝了 `initcall_command_line`,这是通常内核包含了各个模块参数的[命令行](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)的副本,并用 [kernel/params.c](https://github.com/torvalds/linux/blob/master/kernel/params.c)源码文件的 `parse_args` 函数解析它,然后调用各个级别的 `do_on_initcall` 函数:
|
||||
|
||||
```C
|
||||
for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
|
||||
do_one_initcall(*fn);
|
||||
```
|
||||
|
||||
The `do_on_initcall` does main job for us. As we may see, this function takes one parameter which represent `initcall` callback function and does the call of the given callback:
|
||||
`do_on_initcall` 为我们做了主要的工作。我们可以看到,这个函数有一个参数表示 `initcall` 回调函数,并调用给定的回调函数:
|
||||
|
||||
```C
|
||||
int __init_or_module do_one_initcall(initcall_t fn)
|
||||
@@ -255,8 +256,7 @@ int __init_or_module do_one_initcall(initcall_t fn)
|
||||
}
|
||||
```
|
||||
|
||||
Let's try to understand what does the `do_on_initcall` function does. First of all we increase [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) counter so that we can check it later to be sure that it is not imbalanced. After this step we can see the call of the `initcall_backlist` function which
|
||||
goes over the `blacklisted_initcalls` list which stores blacklisted `initcalls` and releases the given `initcall` if it is located in this list:
|
||||
让我们来试着理解 `do_on_initcall` 函数做了什么。首先我们增加 [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) 计数,以便我们稍后进行检查,确保它不是不平衡的。这步以后,我们可以看到 `initcall_backlist` 函数的调用,这个函数遍历包含了 `initcalls` 黑名单的 `blacklisted_initcalls` 链表,如果 `initcall` 在黑名单里就释放它:
|
||||
|
||||
```C
|
||||
list_for_each_entry(entry, &blacklisted_initcalls, next) {
|
||||
@@ -268,9 +268,9 @@ list_for_each_entry(entry, &blacklisted_initcalls, next) {
|
||||
}
|
||||
```
|
||||
|
||||
The blacklisted `initcalls` stored in the `blacklisted_initcalls` list and this list is filled during early Linux kernel initialization from the Linux kernel command line.
|
||||
黑名单的 `initcalls` 保存在 `blacklisted_initcalls` 链表中,这个链表是在早期 Linux 内核初始化时由 Linux 内核命令行来填充的。
|
||||
|
||||
After the blacklisted `initcalls` will be handled, the next part of code does directly the call of the `initcall`:
|
||||
处理完进入黑名单的 `initcalls`,接下来的代码直接调用 `initcall`:
|
||||
|
||||
```C
|
||||
if (initcall_debug)
|
||||
@@ -279,13 +279,13 @@ else
|
||||
ret = fn();
|
||||
```
|
||||
|
||||
Depends on the value of the `initcall_debug` variable, the `do_one_initcall_debug` function will call `initcall` or this function will do it directly via `fn()`. The `initcall_debug` variable is defined in the [same](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
取决于 `initcall_debug` 变量的值,`do_one_initcall_debug` 函数将调用 `initcall`,或直接调用 `fn()`。`initcall_debug` 变量定义在[同一个源码文件](https://github.com/torvalds/linux/blob/master/init/main.c):
|
||||
|
||||
```C
|
||||
bool initcall_debug;
|
||||
```
|
||||
|
||||
and provides ability to print some information to the kernel [log buffer](https://en.wikipedia.org/wiki/Dmesg). The value of the variable can be set from the kernel commands via the `initcall_debug` parameter. As we can read from the [documentation](https://www.kernel.org/doc/Documentation/kernel-parameters.txt) of the Linux kernel command line:
|
||||
该变量提供了向内核[日志缓冲区](https://en.wikipedia.org/wiki/Dmesg)打印一些信息的能力。可以通过 `initcall_debug` 参数从内核命令行中设置这个变量的值。从Linux内核命令行[文档](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)可以看到:
|
||||
|
||||
```
|
||||
initcall_debug [KNL] Trace initcalls as they are executed. Useful
|
||||
@@ -293,7 +293,7 @@ initcall_debug [KNL] Trace initcalls as they are executed. Useful
|
||||
startup.
|
||||
```
|
||||
|
||||
And that's true. If we will look at the implementation of the `do_one_initcall_debug` function, we will see that it does the same as the `do_one_initcall` function or i.e. the `do_one_initcall_debug` function calls the given `initcall` and prints some information (like the [pid](https://en.wikipedia.org/wiki/Process_identifier) of the currently running task, duration of execution of the `initcall` and etc.) related to the execution of the given `initcall`:
|
||||
确实如此。如果我们看下 `do_one_initcall_debug` 函数的实现,我们会看到它与 `do_one_initcall` 函数做了一样的事,也就是说,`do_one_initcall_debug` 函数调用了给定的 `initcall`,并打印了一些和 `initcall` 相关的信息(比如当前任务的 [pid](https://en.wikipedia.org/wiki/Process_identifier)、`initcall` 的持续时间等):
|
||||
|
||||
```C
|
||||
static int __init_or_module do_one_initcall_debug(initcall_t fn)
|
||||
@@ -315,7 +315,7 @@ static int __init_or_module do_one_initcall_debug(initcall_t fn)
|
||||
}
|
||||
```
|
||||
|
||||
As an `initcall` was called by the one of the ` do_one_initcall` or `do_one_initcall_debug` functions, we may see two checks in the end of the `do_one_initcall` function. The first one checks the amount of possible `__preempt_count_add` and `__preempt_count_sub` calls inside of the executed initcall, and if this value is not equal to the previous value of the preemptible counter, we add the `preemption imbalance` string to the message buffer and set correct value of the preemptible counter:
|
||||
由于 `initcall` 被 `do_one_initcall` 或 `do_one_initcall_debug` 调用,我们可以看到在 `do_one_initcall` 函数末尾做了两次检查。第一个检查在initcall执行内部 `__preempt_count_add` 和 `__preempt_count_sub` 可能的执行次数,如果这个值和之前的可抢占计数不相等,我们就把 `preemption imbalance` 字符串添加到消息缓冲区,并设置正确的可抢占计数:
|
||||
|
||||
```C
|
||||
if (preempt_count() != count) {
|
||||
@@ -324,7 +324,7 @@ if (preempt_count() != count) {
|
||||
}
|
||||
```
|
||||
|
||||
Later this error string will be printed. The last check the state of local [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) and if they are disabled, we add the `disabled interrupts` strings to the our message buffer and enable `IRQs` for the current processor to prevent the state when `IRQs` were disabled by an `initcall` and didn't enable again:
|
||||
稍后这个错误字符串就会被打印出来。最后检查本地 [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) 的状态,如果它们被禁用了,我们就将 `disabled interrupts` 字符串添加到我们的消息缓冲区,并为当前处理器使能 `IRQs`,以防出现 `IRQs` 被 `initcall` 禁用了但不再使能的情况出现:
|
||||
|
||||
```C
|
||||
if (irqs_disabled()) {
|
||||
@@ -333,27 +333,27 @@ if (irqs_disabled()) {
|
||||
}
|
||||
```
|
||||
|
||||
That's all. In this way the Linux kernel does initialization of many subsystems in a correct order. From now on, we know what is the `initcall` mechanism in the Linux kernel. In this part, we covered main general portion of the `initcall` mechanism but we left some important concepts. Let's make a short look at these concepts.
|
||||
这就是全部了。通过这种方式,Linux 内核以正确的顺序完成了很多子系统的初始化。现在我们知道 Linux 内核的 `initcall` 机制是怎么回事了。在这部分中,我们介绍了 `initcall` 机制的主要部分,但遗留了一些重要的概念。让我们来简单看下这些概念。
|
||||
|
||||
First of all, we have missed one level of `initcalls`, this is `rootfs initcalls`. You can find definition of the `rootfs_initcall` in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) header file along with all similar macros which we saw in this part:
|
||||
首先,我们错过了一个级别的 `initcalls`,就是 `rootfs initcalls`。和我们在本部分看到的很多宏类似,你可以在 [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) 头文件中找到 `rootfs_initcall` 的定义:
|
||||
|
||||
```C
|
||||
#define rootfs_initcall(fn) __define_initcall(fn, rootfs)
|
||||
```
|
||||
|
||||
As we may understand from the macro's name, its main purpose is to store callbacks which are related to the [rootfs](https://en.wikipedia.org/wiki/Initramfs). Besides this goal, it may be useful to initialize other stuffs after initialization related to filesystems level only if devices related stuff are not initialized. For example, the decompression of the [initramfs](https://en.wikipedia.org/wiki/Initramfs) which occurred in the `populate_rootfs` function from the [init/initramfs.c](https://github.com/torvalds/linux/blob/master/init/initramfs.c) source code file:
|
||||
从这个宏的名字我们可以理解到,它的主要目的是保存和 [rootfs](https://en.wikipedia.org/wiki/Initramfs) 相关的回调。除此之外,只有在与设备相关的东西没被初始化时,在文件系统级别初始化以后再初始化一些其它东西时才有用。例如,发生在源码文件 [init/initramfs.c](https://github.com/torvalds/linux/blob/master/init/initramfs.c) 中 `populate_rootfs` 函数里的解压 [initramfs](https://en.wikipedia.org/wiki/Initramfs):
|
||||
|
||||
```C
|
||||
rootfs_initcall(populate_rootfs);
|
||||
```
|
||||
|
||||
From this place, we may see familiar output:
|
||||
在这里,我们可以看到熟悉的输出:
|
||||
|
||||
```
|
||||
[ 0.199960] Unpacking initramfs...
|
||||
```
|
||||
|
||||
Besides the `rootfs_initcall` level, there are additional `console_initcall`, `security_initcall` and other secondary `initcall` levels. The last thing that we have missed is the set of the `*_initcall_sync` levels. Almost each `*_initcall` macro that we have seen in this part, has macro companion with the `_sync` prefix:
|
||||
除了 `rootfs_initcall` 级别,还有其它的 `console_initcall`、 `security_initcall` 和其他辅助的 `initcall` 级别。我们遗漏的最后一件事,是 `*_initcall_sync` 级别的集合。在这部分我们看到的几乎每个 `*_initcall` 宏,都有 `_sync` 前缀的宏伴随:
|
||||
|
||||
```C
|
||||
#define core_initcall_sync(fn) __define_initcall(fn, 1s)
|
||||
@@ -365,20 +365,20 @@ Besides the `rootfs_initcall` level, there are additional `console_initcall`, `s
|
||||
#define late_initcall_sync(fn) __define_initcall(fn, 7s)
|
||||
```
|
||||
|
||||
The main goal of these additional levels is to wait for completion of all a module related initialization routines for a certain level.
|
||||
这些附加级别的主要目的是,等待所有某个级别的与模块相关的初始化例程完成。
|
||||
|
||||
That's all.
|
||||
这就是全部了。
|
||||
|
||||
Conclusion
|
||||
结论
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In this part we saw the important mechanism of the Linux kernel which allows to call a function which depends on the current state of the Linux kernel during its initialization.
|
||||
在这部分中,我们看到了 Linux 内核的一项重要机制,即在初始化期间允许调用依赖于 Linux 内核当前状态的函数。
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
如果你有问题或建议,可随时在 twitter [0xAX](https://twitter.com/0xAX) 上联系我,给我发 [email](anotherworldofworld@gmail.com),或者创建 [issue](https://github.com/0xAX/linux-insides/issues/new)。
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**.
|
||||
**请注意英语不是我的母语,对此带来的不便,我很抱歉。如果你发现了任何错误,都可以给我发 PR 到[linux-insides](https://github.com/0xAX/linux-insides)。**.
|
||||
|
||||
Links
|
||||
链接
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29)
|
||||
|
||||
@@ -1,24 +1,24 @@
|
||||
Per-CPU variables
|
||||
Per-cpu 变量
|
||||
================================================================================
|
||||
|
||||
Per-CPU variables are one of the kernel features. You can understand the meaning of this feature by reading its name. We can create a variable and each processor core will have its own copy of this variable. In this part, we take a closer look at this feature and try to understand how it is implemented and how it works.
|
||||
Per-cpu 变量是一项内核特性。从它的名字你就可以理解这项特性的意义了。我们可以创建一个变量,然后每个 CPU 上都会有一个此变量的拷贝。本节我们来看下这个特性,并试着去理解它是如何实现以及工作的。
|
||||
|
||||
The kernel provides an API for creating per-cpu variables - the `DEFINE_PER_CPU` macro:
|
||||
内核提供了一个创建 per-cpu 变量的 API - `DEFINE_PER_CPU` 宏:
|
||||
|
||||
```C
|
||||
#define DEFINE_PER_CPU(type, name) \
|
||||
DEFINE_PER_CPU_SECTION(type, name, "")
|
||||
```
|
||||
|
||||
This macro defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) as many other macros for work with per-cpu variables. Now we will see how this feature is implemented.
|
||||
正如其它许多处理 per-cpu 变量的宏一样,这个宏定义在 [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) 中。现在我们来看下这个特性是如何实现的。
|
||||
|
||||
Take a look at the `DECLARE_PER_CPU` definition. We see that it takes 2 parameters: `type` and `name`, so we can use it to create per-cpu variables, for example like this:
|
||||
看下 `DECLARE_PER_CPU` 的定义,可以看到它使用了 2 个参数:`type` 和 `name`,因此我们可以这样创建 per-cpu 变量:
|
||||
|
||||
```C
|
||||
DEFINE_PER_CPU(int, per_cpu_n)
|
||||
```
|
||||
|
||||
We pass the type and the name of our variable. `DEFINE_PER_CPU` calls the `DEFINE_PER_CPU_SECTION` macro and passes the same two parameters and empty string to it. Let's look at the definition of the `DEFINE_PER_CPU_SECTION`:
|
||||
我们传入要创建变量的类型和名字,`DEFINE_PER_CPU` 调用 `DEFINE_PER_CPU_SECTION`,将两个参数和空字符串传递给后者。让我们来看下 `DEFINE_PER_CPU_SECTION` 的定义:
|
||||
|
||||
```C
|
||||
#define DEFINE_PER_CPU_SECTION(type, name, sec) \
|
||||
@@ -32,69 +32,68 @@ We pass the type and the name of our variable. `DEFINE_PER_CPU` calls the `DEFIN
|
||||
PER_CPU_ATTRIBUTES
|
||||
```
|
||||
|
||||
where `section` is:
|
||||
其中 `section` 是:
|
||||
|
||||
```C
|
||||
#define PER_CPU_BASE_SECTION ".data..percpu"
|
||||
```
|
||||
|
||||
After all macros are expanded we will get a global per-cpu variable:
|
||||
当所有的宏展开之后,我们得到一个全局的 per-cpu 变量:
|
||||
|
||||
```C
|
||||
__attribute__((section(".data..percpu"))) int per_cpu_n
|
||||
```
|
||||
|
||||
It means that we will have a `per_cpu_n` variable in the `.data..percpu` section. We can find this section in the `vmlinux`:
|
||||
这意味着我们在 `.data..percpu` 段有了一个 `per_cpu_n` 变量,可以在 `vmlinux` 中找到它:
|
||||
|
||||
```
|
||||
.data..percpu 00013a58 0000000000000000 0000000001a5c000 00e00000 2**12
|
||||
CONTENTS, ALLOC, LOAD, DATA
|
||||
```
|
||||
|
||||
Ok, now we know that when we use the `DEFINE_PER_CPU` macro, a per-cpu variable in the `.data..percpu` section will be created. When the kernel initializes it calls the `setup_per_cpu_areas` function which loads the `.data..percpu` section multiple times, one section per CPU.
|
||||
好,现在我们知道了,当我们使用 `DEFINE_PER_CPU` 宏时,一个在 `.data..percpu` 段中的 per-cpu 变量就被创建了。内核初始化时,调用 `setup_per_cpu_areas` 函数多次加载 `.data..percpu` 段,每个 CPU 一次。
|
||||
|
||||
Let's look at the per-CPU areas initialization process. It starts in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) from the call of the `setup_per_cpu_areas` function which is defined in the [arch/x86/kernel/setup_percpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup_percpu.c).
|
||||
让我们来看下 per-cpu 区域初始化流程。它从 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 中调用 `setup_per_cpu_areas` 函数开始,这个函数定义在 [arch/x86/kernel/setup_percpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup_percpu.c) 中。
|
||||
|
||||
```C
|
||||
pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
|
||||
NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);
|
||||
```
|
||||
|
||||
The `setup_per_cpu_areas` starts from the output information about the maximum number of CPUs set during kernel configuration with the `CONFIG_NR_CPUS` configuration option, actual number of CPUs, `nr_cpumask_bits` is the same that `NR_CPUS` bit for the new `cpumask` operators and number of `NUMA` nodes.
|
||||
`setup_per_cpu_areas` 开始输出在内核配置中以 `CONFIG_NR_CPUS` 配置项设置的最大 CPUs 数,实际的 CPU 个数,`nr_cpumask_bits`(对于新的 `cpumask` 操作来说和 `NR_CPUS` 是一样的),还有 `NUMA` 节点个数。
|
||||
|
||||
We can see this output in the dmesg:
|
||||
我们可以在 `dmesg` 中看到这些输出:
|
||||
|
||||
```
|
||||
$ dmesg | grep percpu
|
||||
[ 0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
|
||||
```
|
||||
|
||||
In the next step we check the `percpu` first chunk allocator. All percpu areas are allocated in chunks. The first chunk is used for the static percpu variables. The Linux kernel has `percpu_alloc` command line parameters which provides the type of the first chunk allocator. We can read about it in the kernel documentation:
|
||||
然后我们检查 `per-cpu` 第一个块分配器。所有的 per-cpu 区域都是以块进行分配的。第一个块用于静态 per-cpu 变量。Linux 内核提供了决定第一个块分配器类型的命令行:`percpu_alloc` 。我们可以在内核文档中读到它的说明。
|
||||
|
||||
```
|
||||
percpu_alloc= Select which percpu first chunk allocator to use.
|
||||
Currently supported values are "embed" and "page".
|
||||
Archs may support subset or none of the selections.
|
||||
See comments in mm/percpu.c for details on each
|
||||
allocator. This parameter is primarily for debugging
|
||||
and performance comparison.
|
||||
percpu_alloc= 选择要使用哪个 per-cpu 第一个块分配器。
|
||||
当前支持的类型是 "embed" 和 "page"。
|
||||
不同架构支持这些类型的子集或不支持。
|
||||
更多分配器的细节参考 mm/percpu.c 中的注释。
|
||||
这个参数主要是为了调试和性能比较的。
|
||||
```
|
||||
|
||||
The [mm/percpu.c](https://github.com/torvalds/linux/blob/master/mm/percpu.c) contains the handler of this command line option:
|
||||
[mm/percpu.c](https://github.com/torvalds/linux/blob/master/mm/percpu.c) 包含了这个命令行选项的处理函数:
|
||||
|
||||
```C
|
||||
early_param("percpu_alloc", percpu_alloc_setup);
|
||||
```
|
||||
|
||||
Where the `percpu_alloc_setup` function sets the `pcpu_chosen_fc` variable depends on the `percpu_alloc` parameter value. By default the first chunk allocator is `auto`:
|
||||
其中 `percpu_alloc_setup` 函数根据 `percpu_alloc` 参数值设置 `pcpu_chosen_fc` 变量。默认第一个块分配器是 `auto`:
|
||||
|
||||
```C
|
||||
enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
|
||||
```
|
||||
|
||||
If the `percpu_alloc` parameter is not given to the kernel command line, the `embed` allocator will be used which embeds the first percpu chunk into bootmem with the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html). The last allocator is the first chunk `page` allocator which maps the first chunk with `PAGE_SIZE` pages.
|
||||
如果内核命令行中没有设置 `percpu_alloc` 参数,就会使用 `embed` 分配器,将第一个 per-cpu 块嵌入进带 [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html) 的 bootmem。最后一个分配器和第一个块 `page` 分配器一样,只是将第一个块使用 `PAGE_SIZE` 页进行了映射。
|
||||
|
||||
As I wrote above, first of all we make a check of the first chunk allocator type in the `setup_per_cpu_areas`. We check that first chunk allocator is not page:
|
||||
如我上面所写,首先我们在 `setup_per_cpu_areas` 中对第一个块分配器检查,检查到第一个块分配器不是 page 分配器:
|
||||
|
||||
```C
|
||||
if (pcpu_chosen_fc != PCPU_FC_PAGE) {
|
||||
@@ -104,7 +103,7 @@ if (pcpu_chosen_fc != PCPU_FC_PAGE) {
|
||||
}
|
||||
```
|
||||
|
||||
If it is not `PCPU_FC_PAGE`, we will use the `embed` allocator and allocate space for the first chunk with the `pcpu_embed_first_chunk` function:
|
||||
如果不是 `PCPU_FC_PAGE`,我们就使用 `embed` 分配器并使用 `pcpu_embed_first_chunk` 函数分配第一块空间。
|
||||
|
||||
```C
|
||||
rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
|
||||
@@ -113,16 +112,16 @@ rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
|
||||
pcpu_fc_alloc, pcpu_fc_free);
|
||||
```
|
||||
|
||||
As shown above, the `pcpu_embed_first_chunk` function embeds the first percpu chunk into bootmem then we pass a couple of parameters to the `pcup_embed_first_chunk`. They are as follows:
|
||||
如前所述,函数 `pcpu_embed_first_chunk` 将第一个 per-cpu 块嵌入 bootmen,因此我们传递一些参数给 `pcpu_embed_first_chunk`。参数如下:
|
||||
|
||||
* `PERCPU_FIRST_CHUNK_RESERVE` - the size of the reserved space for the static `percpu` variables;
|
||||
* `dyn_size` - minimum free size for dynamic allocation in bytes;
|
||||
* `atom_size` - all allocations are whole multiples of this and aligned to this parameter;
|
||||
* `pcpu_cpu_distance` - callback to determine distance between cpus;
|
||||
* `pcpu_fc_alloc` - function to allocate `percpu` page;
|
||||
* `pcpu_fc_free` - function to release `percpu` page.
|
||||
* `PERCPU_FIRST_CHUNK_RESERVE` - 为静态变量 `per-cpu` 保留空间的大小;
|
||||
* `dyn_size` - 动态分配的最少空闲字节;
|
||||
* `atom_size` - 所有的分配都是这个的整数倍,并以此对齐;
|
||||
* `pcpu_cpu_distance` - 决定 cpus 距离的回调函数;
|
||||
* `pcpu_fc_alloc` - 分配 `percpu` 页的函数;
|
||||
* `pcpu_fc_free` - 释放 `percpu` 页的函数。
|
||||
|
||||
We calculate all of these parameters before the call of the `pcpu_embed_first_chunk`:
|
||||
在调用 `pcpu_embed_first_chunk` 前我们计算好所有的参数:
|
||||
|
||||
```C
|
||||
const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
|
||||
@@ -134,15 +133,14 @@ size_t atom_size;
|
||||
#endif
|
||||
```
|
||||
|
||||
If the first chunk allocator is `PCPU_FC_PAGE`, we will use the `pcpu_page_first_chunk` instead of the `pcpu_embed_first_chunk`. After that `percpu` areas up, we setup `percpu` offset and its segment for every CPU with the `setup_percpu_segment` function (only for `x86` systems) and move some early data from the arrays to the `percpu` variables (`x86_cpu_to_apicid`, `irq_stack_ptr` and etc...). After the kernel finishes the initialization process, we will have loaded N `.data..percpu` sections, where N is the number of CPUs, and the section used by the bootstrap processor will contain an uninitialized variable created with the `DEFINE_PER_CPU` macro.
|
||||
如果第一个块分配器是 `PCPU_FC_PAGE`,我们用 `pcpu_page_first_chunk` 而不是 `pcpu_embed_first_chunk`。 `per-cpu` 区域准备好以后,我们用 `setup_percpu_segment` 函数设置 `per-cpu` 的偏移和段(只针对 `x86` 系统),并将前面的数据从数组移到 `per-cpu` 变量(`x86_cpu_to_apicid`, `irq_stack_ptr` 等等)。当内核完成初始化进程后,我们就有了N个 `.data..percpu` 段,其中 N 是 CPU 个数,bootstrap 进程使用的段将会包含用 `DEFINE_PER_CPU` 宏创建的未初始化的变量。
|
||||
|
||||
The kernel provides an API for per-cpu variables manipulating:
|
||||
内核提供了操作 per-cpu 变量的API:
|
||||
|
||||
* get_cpu_var(var)
|
||||
* put_cpu_var(var)
|
||||
|
||||
|
||||
Let's look at the `get_cpu_var` implementation:
|
||||
让我们来看看 `get_cpu_var` 的实现:
|
||||
|
||||
```C
|
||||
#define get_cpu_var(var) \
|
||||
@@ -152,29 +150,29 @@ Let's look at the `get_cpu_var` implementation:
|
||||
}))
|
||||
```
|
||||
|
||||
The Linux kernel is preemptible and accessing a per-cpu variable requires us to know which processor the kernel is running on. So, current code must not be preempted and moved to the another CPU while accessing a per-cpu variable. That's why, first of all we can see a call of the `preempt_disable` function then a call of the `this_cpu_ptr` macro, which looks like:
|
||||
Linux 内核是抢占式的,获取 per-cpu 变量需要我们知道内核运行在哪个处理器上。因此访问 per-cpu 变量时,当前代码不能被抢占,不能移到其它的 CPU。如我们所见,这就是为什么首先调用 `preempt_disable` 函数然后调用 `this_cpu_ptr` 宏,像这样:
|
||||
|
||||
```C
|
||||
#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
|
||||
```
|
||||
|
||||
and
|
||||
以及
|
||||
|
||||
```C
|
||||
#define raw_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
|
||||
```
|
||||
|
||||
where `per_cpu_ptr` returns a pointer to the per-cpu variable for the given cpu (second parameter). After we've created a per-cpu variable and made modifications to it, we must call the `put_cpu_var` macro which enables preemption with a call of `preempt_enable` function. So the typical usage of a per-cpu variable is as follows:
|
||||
`per_cpu_ptr` 返回一个指向给定 CPU(第 2 个参数) per-cpu 变量的指针。当我们创建了一个 per-cpu 变量并对其进行了修改时,我们必须调用 `put_cpu_var` 宏通过函数 `preempt_enable` 使能抢占。因此典型的 per-cpu 变量的使用如下:
|
||||
|
||||
```C
|
||||
get_cpu_var(var);
|
||||
...
|
||||
//Do something with the 'var'
|
||||
//用这个 'var' 做些啥
|
||||
...
|
||||
put_cpu_var(var);
|
||||
```
|
||||
|
||||
Let's look at the `per_cpu_ptr` macro:
|
||||
让我们来看下这个 `per_cpu_ptr` 宏:
|
||||
|
||||
```C
|
||||
#define per_cpu_ptr(ptr, cpu) \
|
||||
@@ -184,7 +182,7 @@ Let's look at the `per_cpu_ptr` macro:
|
||||
})
|
||||
```
|
||||
|
||||
As I wrote above, this macro returns a per-cpu variable for the given cpu. First of all it calls `__verify_pcpu_ptr`:
|
||||
就像我们上面写的,这个宏返回了一个给定 cpu 的 per-cpu 变量。首先它调用了 `__verify_pcpu_ptr`:
|
||||
|
||||
```C
|
||||
#define __verify_pcpu_ptr(ptr)
|
||||
@@ -194,37 +192,36 @@ do {
|
||||
} while (0)
|
||||
```
|
||||
|
||||
which makes the given `ptr` type of `const void __percpu *`,
|
||||
该宏声明了 `ptr` 类型的 `const void __percpu *`。
|
||||
|
||||
After this we can see the call of the `SHIFT_PERCPU_PTR` macro with two parameters. As first parameter we pass our ptr and for second parameter we pass the cpu number to the `per_cpu_offset` macro:
|
||||
之后,我们可以看到带两个参数的 `SHIFT_PERCPU_PTR` 宏的调用。第一个参数是我们的指针,第二个参数是传给 `per_cpu_offset` 宏的CPU数:
|
||||
|
||||
```C
|
||||
#define per_cpu_offset(x) (__per_cpu_offset[x])
|
||||
```
|
||||
|
||||
which expands to getting the `x` element from the `__per_cpu_offset` array:
|
||||
|
||||
该宏将 `x` 扩展为 `__per_cpu_offset` 数组:
|
||||
|
||||
```C
|
||||
extern unsigned long __per_cpu_offset[NR_CPUS];
|
||||
```
|
||||
|
||||
where `NR_CPUS` is the number of CPUs. The `__per_cpu_offset` array is filled with the distances between cpu-variable copies. For example all per-cpu data is `X` bytes in size, so if we access `__per_cpu_offset[Y]`, `X*Y` will be accessed. Let's look at the `SHIFT_PERCPU_PTR` implementation:
|
||||
其中 `NR_CPUS` 是 CPU 的数目。`__per_cpu_offset` 数组以 CPU 变量拷贝之间的距离填充。例如,所有 per-cpu 变量是 `X` 字节大小,所以我们通过 `__per_cpu_offset[Y]` 就可以访问 `X*Y`。让我们来看下 `SHIFT_PERCPU_PTR` 的实现:
|
||||
|
||||
```C
|
||||
#define SHIFT_PERCPU_PTR(__p, __offset) \
|
||||
RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset))
|
||||
```
|
||||
|
||||
`RELOC_HIDE` just returns offset `(typeof(ptr)) (__ptr + (off))` and it will return a pointer to the variable.
|
||||
`RELOC_HIDE` 只是取得偏移量 `(typeof(ptr)) (__ptr + (off))`,并返回一个指向该变量的指针。
|
||||
|
||||
That's all! Of course it is not the full API, but a general overview. It can be hard to start with, but to understand per-cpu variables you mainly need to understand the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) magic.
|
||||
就这些了!当然这不是全部的 API,只是一个大概。开头是比较艰难,但是理解 per-cpu 变量你只需理解 [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) 的奥秘。
|
||||
|
||||
Let's again look at the algorithm of getting a pointer to a per-cpu variable:
|
||||
让我们再看下获得 per-cpu 变量指针的算法:
|
||||
|
||||
* The kernel creates multiple `.data..percpu` sections (one per-cpu) during initialization process;
|
||||
* All variables created with the `DEFINE_PER_CPU` macro will be relocated to the first section or for CPU0;
|
||||
* `__per_cpu_offset` array filled with the distance (`BOOT_PERCPU_OFFSET`) between `.data..percpu` sections;
|
||||
* When the `per_cpu_ptr` is called, for example for getting a pointer on a certain per-cpu variable for the third CPU, the `__per_cpu_offset` array will be accessed, where every index points to the required CPU.
|
||||
* 内核在初始化流程中创建多个 `.data..percpu` 段(一个 per-cpu 变量一个);
|
||||
* 所有 `DEFINE_PER_CPU` 宏创建的变量都将重新分配到首个扇区或者 CPU0;
|
||||
* `__per_cpu_offset` 数组以 (`BOOT_PERCPU_OFFSET`) 和 `.data..percpu` 扇区之间的距离填充;
|
||||
* 当 `per_cpu_ptr` 被调用时,例如取一个 per-cpu 变量的第三个 CPU 的指针,将访问 `__per_cpu_offset` 数组,该数组的索引指向了所需 CPU。
|
||||
|
||||
That's all.
|
||||
就这么多了。
|
||||
|
||||
Reference in New Issue
Block a user