diff --git a/Initialization/linux-initialization-1.md b/Initialization/linux-initialization-1.md index d619262..372b23d 100644 --- a/Initialization/linux-initialization-1.md +++ b/Initialization/linux-initialization-1.md @@ -1,23 +1,26 @@ -Kernel initialization. Part 1. +内核初始化 第一部分 ================================================================================ -First steps in the kernel code +踏入内核代码的第一步(TODO: Need proofreading) -------------------------------------------------------------------------------- -The previous [post](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-5.html) was a last part of the Linux kernel [booting process](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/index.html) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489) will be called. +[上一章](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-5.html)是[引导过程](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/index.html)的最后一部分。从现在开始,我们将深入探究 Linux 内核的初始化过程。在解压缩完 Linux 内核镜像、并把它妥善地放入内存后,内核就开始工作了。我们在第一章中介绍了 Linux 内核引导程序,它的任务就是为执行内核代码做准备。而在本章中,我们将探究内核代码,看一看内核的初始化过程——即在启动 [PID](https://en.wikipedia.org/wiki/Process_identifier) 为 `1` 的 `init` 进程前,内核所做的大量工作。 -In the last [part](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-5.html) of the previous [chapter](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/index.html) we stopped at the [jmp](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file: +本章的内容很多,介绍了在内核启动前的所有准备工作。[arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) 文件中定义了内核入口点,我们会从这里开始,逐步地深入下去。在 `start_kernel` 函数(定义在 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489)) 执行之前,我们会看到很多的初期的初始化过程,例如初期页表初始化、切换到一个新的内核空间描述符等等。 + +在[上一章](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/index.html)的[最后一节](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-5.html)中,我们跟踪到了 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) 文件中的 [jmp](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) 指令: ```assembly jmp *%rax ``` -At this moment the `rax` register contains address of the Linux kernel entry point which that was obtained as a result of the call of the `decompress_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file. So, our last instruction in the kernel setup code is a jump on the kernel entry point. We already know where is defined the entry point of the linux kernel, so we are able to start to learn what does the Linux kernel does after the start. +此时 `rax` 寄存器中保存的就是 Linux 内核入口点,通过调用 `decompress_kernel` ([arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c)) 函数后获得。由此可见,内核引导程序的最后一行代码是一句指向内核入口点的跳转指令。既然已经知道了内核入口点定义在哪,我们就可以继续探究 Linux 内核在引导结束后做了些什么。 -First steps in the kernel + +内核执行的第一步 -------------------------------------------------------------------------------- -Okay, we got the address of the decompressed kernel image from the `decompress_kernel` function into `rax` register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly source code file and at the beginning of it, we can see following definitions: +OK,在调用了 `decompress_kernel` 函数后,`rax` 寄存器中保存了解压缩后的内核镜像的地址,并且跳转了过去。解压缩后的内核镜像的入口点定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S),这个文件的开头几行如下: ```assembly __HEAD @@ -29,13 +32,13 @@ startup_64: ... ``` -We can see definition of the `startup_64` routine that is defined in the `__HEAD` section, which is just a macro which expands to the definition of executable `.head.text` section: +我们可以看到 `startup_64` 过程定义在了 `__HEAD` 区段下。 `__HEAD` 只是一个宏,它将展开为可执行的 `.head.text` 区段: ```C #define __HEAD .section ".head.text","ax" ``` -We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S#L93) linker script: +我们可以在 [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S#L93) 链接器脚本文件中看到这个区段的定义: ``` .text : AT(ADDR(.text) - LOAD_OFFSET) { @@ -46,48 +49,48 @@ We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](htt } :text = 0x9090 ``` -Besides the definition of the `.text` section, we can understand default virtual and physical addresses from the linker script. Note that address of the `_text` is location counter which is defined as: +除了对 `.text` 区段的定义,我们还能从这个脚本文件中得知内核的默认物理地址与虚拟地址。`_text` 是一个地址计数器,对于 [x86_64](https://en.wikipedia.org/wiki/X86-64) 来说,它定义为: ``` . = __START_KERNEL; ``` -for the [x86_64](https://en.wikipedia.org/wiki/X86-64). The definition of the `__START_KERNEL` macro is located in the [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h) header file and represented by the sum of the base virtual address of the kernel mapping and physical start: +`__START_KERNEL` 宏的定义在 [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h) 头文件中,它由内核映射的虚拟基址与基物理起始点相加得到: ```C -#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) +#define _START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) #define __PHYSICAL_START ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN) ``` -Or in other words: +换句话说: -* Base physical address of the Linux kernel - `0x1000000`; -* Base virtual address of the Linux kernel - `0xffffffff81000000`. +* Linux 内核的物理基址 - `0x1000000`; +* Linux 内核的虚拟基址 - `0xffffffff81000000`. -Now we know default physical and virtual addresses of the `startup_64` routine, but to know actual addresses we must to calculate it with the following code: +现在我们知道了 `startup_64` 过程的默认物理地址与虚拟地址,但是真正的地址必须要通过下面的代码计算得到: ```assembly leaq _text(%rip), %rbp subq $_text - __START_KERNEL_map, %rbp ``` -Yes, it defined as `0x1000000`, but it may be different, for example if [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) is enabled. So our current goal is to calculate delta between `0x1000000` and where we actually loaded. Here we just put the `rip-relative` address to the `rbp` register and then subtract `$_text - __START_KERNEL_map` from it. We know that compiled virtual address of the `_text` is `0xffffffff81000000` and the physical address of it is `0x1000000`. The `__START_KERNEL_map` macro expands to the `0xffffffff80000000` address, so at the second line of the assembly code, we will get following expression: +没错,虽然定义为 `0x1000000`,但是仍然有可能变化,例如启用 [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) 的时候。所以我们当前的目标是计算 `0x1000000` 与实际加载地址的差。这里我们首先将RIP相对地址(`rip-relative`)放入 `rbp` 寄存器,并且从中减去 `$_text - __START_KERNEL_map` 。我们已经知道, `_text` 在编译后的默认虚拟地址为 `0xffffffff81000000`, 物理地址为 `0x1000000`。`__START_KERNEL_map` 宏将展开为 `0xffffffff80000000`,因此对于对于第二行汇编代码,我们将得到如下的表达式: ``` rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000) ``` -So, after the calculation, the `rbp` will contain `0` which represents difference between addresses where we actually loaded and where the code was compiled. In our case `zero` means that the Linux kernel was loaded by default address and the [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) was disabled. +在计算过后,`rbp` 的值将为 `0`,代表了实际加载地址与编译后的默认地址之间的差值。在我们这个例子中,`0` 代表了 Linux 内核被加载到了默认地址,并且没有启用 [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) 。 -After we got the address of the `startup_64`, we need to do a check that this address is correctly aligned. We will do it with the following code: +在得到了 `startup_64` 的地址后,我们需要检查这个地址是否已经正确对齐。下面的代码将进行这项工作: ```assembly testl $~PMD_PAGE_MASK, %ebp jnz bad_address ``` -Here we just compare low part of the `rbp` register with the complemented value of the `PMD_PAGE_MASK`. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [paging](http://xinqiu.gitbooks.io/linux-insides-cn/content/Theory/Paging.html) about it) and defined as: +在这里我们将 `rbp` 寄存器的低32位与 `PMD_PAGE_MASK` 进行比较。`PMD_PAGE_MASK` 代表中层页目录(`Page middle directory`)屏蔽位(相关信息请阅读 [paging](http://xinqiu.gitbooks.io/linux-insides-cn/content/Theory/Paging.html) 一节),它的定义如下: ```C #define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1)) @@ -96,9 +99,9 @@ Here we just compare low part of the `rbp` register with the complemented value #define PMD_SHIFT 21 ``` -As we can easily calculate, `PMD_PAGE_SIZE` is `2` megabytes. Here we use standard formula for checking alignment and if `text` address is not aligned for `2` megabytes, we jump to `bad_address` label. +可以很容易得出 `PMD_PAGE_SIZE` 为 `2MB` 。在这里我们使用标准公式来检查对齐问题,如果 `text` 的地址没有对齐到 `2MB`,则跳转到 `bad_address`。 -After this we check address that it is not too large by the checking of highest `18` bits: +在此之后,我们通过检查高 `18` 位来防止这个地址过大: ```assembly leaq _text(%rip), %rax @@ -106,18 +109,19 @@ After this we check address that it is not too large by the checking of highest jnz bad_address ``` -The address must not be greater than `46`-bits: +这个地址必须不超过 `46` 个比特,即小于2的46次方: ```C #define MAX_PHYSMEM_BITS 46 ``` -Okay, we did some early checks and now we can move on. +OK,至此我们完成了一些初步的检查,可以继续进行后续的工作了。 -Fix base addresses of page tables + +修正页表基地址 -------------------------------------------------------------------------------- -The first step before we start to setup identity paging is to fixup following addresses: +在开始设置 Identity 分页之前,我们需要首先修正下面的地址: ```assembly addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip) @@ -126,7 +130,7 @@ The first step before we start to setup identity paging is to fixup following ad addq %rbp, level2_fixmap_pgt + (506*8)(%rip) ``` -All of `early_level4_pgt`, `level3_kernel_pgt` and other address may be wrong if the `startup_64` is not equal to default `0x1000000` address. The `rbp` register contains the delta address so we add to the certain entries of the `early_level4_pgt`, the `level3_kernel_pgt` and the `level2_fixmap_pgt`. Let's try to understand what these labels mean. First of all let's look at their definition: +如果 `startup_64` 的值不为默认的 `0x1000000` 的话, 则包括 `early_level4_pgt`、`level3_kernel_pgt` 在内的很多地址都会不正确。`rbp`寄存器中包含的是相对地址,因此我们把它与 `early_level4_pgt`、`level3_kernel_pgt` 以及 `level2_fixmap_pgt` 中特定的项相加。首先我们来看一下它们的定义: ```assembly NEXT_PAGE(early_level4_pgt) @@ -151,25 +155,25 @@ NEXT_PAGE(level1_fixmap_pgt) .fill 512,8,0 ``` -Looks hard, but it isn't. First of all let's look at the `early_level4_pgt`. It starts with the (4096 - 8) bytes of zeros, it means that we don't use the first `511` entries. And after this we can see one `level3_kernel_pgt` entry. Note that we subtract `__START_KERNEL_map + _PAGE_TABLE` from it. As we know `__START_KERNEL_map` is a base virtual address of the kernel text, so if we subtract `__START_KERNEL_map`, we will get physical address of the `level3_kernel_pgt`. Now let's look at `_PAGE_TABLE`, it is just page entry access rights: +看起来很难理解,实则不然。首先我们来看一下 `early_level4_pgt`。它的前 (4096 - 8) 个字节全为 `0`,即它的前 `511` 个项均不使用,之后的一项是 `level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE`。我们知道 `__START_KERNEL_map` 是内核的虚拟基地址,因此减去 `__START_KERNEL_map` 后就得到了 `level3_kernel_pgt` 的物理地址。现在我们来看一下 `_PAGE_TABLE`,它是页表项的访问权限: ```C #define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_DIRTY) ``` -You can read more about it in the [paging](http://xinqiu.gitbooks.io/linux-insides-cn/content/Theory/Paging.html) part. +更多信息请阅读 [分页](http://xinqiu.gitbooks.io/linux-insides-cn/content/Theory/Paging.html) 部分. -The `level3_kernel_pgt` - stores two entries which map kernel space. At the start of it's definition, we can see that it is filled with zeros `L3_START_KERNEL` or `510` times. Here the `L3_START_KERNEL` is the index in the page upper directory which contains `__START_KERNEL_map` address and it equals `510`. After this, we can see the definition of the two `level3_kernel_pgt` entries: `level2_kernel_pgt` and `level2_fixmap_pgt`. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has: +`level3_kernel_pgt` 中保存的两项用来映射内核空间,在它的前 `510`(即 `L3_START_KERNEL`)项均为 `0`。这里的 `L3_START_KERNEL` 保存的是在上层页目录(Page Upper Directory)中包含`__START_KERNEL_map` 地址的那一条索引,它等于 `510`。后面一项 `level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE` 中的 `level2_kernel_pgt` 比较容易理解,它是一条页表项,包含了指向中层页目录的指针,它用来映射内核空间,并且具有如下的访问权限: ```C #define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \ _PAGE_DIRTY) ``` -access rights. The second - `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. They represented by the one `level2_fixmap_pgt` entry and `10` megabytes hole for the [vsyscalls](https://lwn.net/Articles/446528/) mapping. The next `level2_kernel_pgt` calls the `PDMS` macro which creates `512` megabytes from the `__START_KERNEL_map` for kernel `.text` (after these `512` megabytes will be modules memory space). +`level2_fixmap_pgt` 是一系列虚拟地址,它们可以在内核空间中指向任意的物理地址。它们由`level2_fixmap_pgt`作为入口点、`10`MB 大小的空间用来为 [vsyscalls](https://lwn.net/Articles/446528/) 做映射。`level2_kernel_pgt` 则调用了`PDMS` 宏,在 `__START_KERNEL_map` 地址处为内核的 `.text` 创建了 `512`MB 大小的空间(这 `512` MB空间的后面是模块内存空间)。 -Now, after we saw definitions of these symbols, let's get back to the code which is described at the beginning of the section. Remember that the `rbp` register contains delta between the address of the `startup_64` symbol which was got during kernel [linking](https://en.wikipedia.org/wiki/Linker_%28computing%29) and the actual address. So, for this moment, we just need to add add this delta to the base address of some page table entries, that they'll have correct addresses. In our case these entries are: +现在,在看过了这些符号的定义之后,让我们回到本节开始时介绍的那几行代码。`rbp` 寄存器包含了实际地址与 `startup_64` 地址之差,其中 `startup_64` 的地址是在内核[链接](https://en.wikipedia.org/wiki/Linker_%28computing%29)时获得的。因此我们只需要把它与各个页表项的基地址相加,就能够得到正确的地址了。在这里这些操作如下: ```assembly addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip) @@ -178,9 +182,9 @@ Now, after we saw definitions of these symbols, let's get back to the code which addq %rbp, level2_fixmap_pgt + (506*8)(%rip) ``` -or the last entry of the `early_level4_pgt` which is the `level3_kernel_pgt`, last two entries of the `level3_kernel_pgt` which are the `level2_kernel_pgt` and the `level2_fixmap_pgt` and five hundreds seventh entry of the `level2_fixmap_pgt` which is `level1_fixmap_pgt` page directory. +换句话说,`early_level4_pgt` 的最后一项就是 `level3_kernel_pgt`,`level3_kernel_pgt` 的最后两项分别是 `level2_kernel_pgt` 和 `level2_fixmap_pgt`, `level2_fixmap_pgt` 的第507项就是 `level1_fixmap_pgt` 页目录。 -After all of this we will have: +在这之后我们就得到了: ``` early_level4_pgt[511] -> level3_kernel_pgt[0] @@ -190,19 +194,19 @@ level2_kernel_pgt[0] -> 512 MB kernel mapping level2_fixmap_pgt[507] -> level1_fixmap_pgt ``` -Note that we didn't fixup base address of the `early_level4_pgt` and some of other page table directories, because we will see this during of building/filling of structures for these page tables. As we corrected base addresses of the page tables, we can start to build it. +需要注意的是,我们并不修正 `early_level4_pgt` 以及其他页目录的基地址,我们会在构造、填充这些页目录结构的时候修正。我们修正了页表基地址后,就可以开始构造这些页目录了。 -Identity mapping setup +Identity Map Paging -------------------------------------------------------------------------------- -Now we can see the set up of identity mapping of early page tables. In Identity Mapped Paging, virtual addresses are mapped to physical addresses that have the same value, `1 : 1`. Let's look at it in detail. First of all we get the `rip-relative` address of the `_text` and `_early_level4_pgt` and put they into `rdi` and `rbx` registers: +现在我们可以进入到对初期页表进行 Identity 映射的初始化过程了。在 Identity 映射分页中,虚拟地址会被映射到地址相同的物理地址上,即 `1 : 1`。下面我们来看一下细节。首先我们找到 `_text` 与 `_early_level4_pgt` 的 RIP 相对地址,并把他们放入 `rdi` 与 `rbx` 寄存器中。 ```assembly leaq _text(%rip), %rdi leaq early_level4_pgt(%rip), %rbx ``` -After this we store address of the `_text` in the `rax` and get the index of the page global directory entry which stores `_text` address, by shifting `_text` address on the `PGDIR_SHIFT`: +在此之后我们使用 `rax` 保存 `_text` 的地址。同时,在全局页目录表中有一条记录中存放的是 `_text` 的地址。为了得到这条索引,我们把 `_text` 的地址右移 `PGDIR_SHIFT` 位。 ```assembly movq %rdi, %rax @@ -213,7 +217,8 @@ After this we store address of the `_text` in the `rax` and get the index of the movq %rdx, 8(%rbx,%rax,8) ``` -where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories: +其中 `PGDIR_SHIFT` 为 `39`。`PGDIR_SHIFT`表示的是在虚拟地址下的全局页目录位的屏蔽值(mask)。下面的宏定义了所有类型的页目录的屏蔽值: + ```C #define PGDIR_SHIFT 39 @@ -221,9 +226,9 @@ where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global dir #define PMD_SHIFT 21 ``` -After this we put the address of the first `level3_kernel_pgt` in the `rdx` with the `_KERNPG_TABLE` access rights (see above) and fill the `early_level4_pgt` with the 2 `level3_kernel_pgt` entries. +此后我们就将 `level3_kernel_pgt` 的地址放进 `rdx` 中,并将它的访问权限设置为 `_KERNPG_TABLE`(见上),然后将 `level3_kernel_pgt` 填入 `early_level4_pgt` 的两项中。 -After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now contains the address of the first entry of the `level3_kernel_pgt`) and put `rdi` (it now contains physical address of the `_text`) to the `rax`. And after this we write addresses of the two page upper directory entries to the `level3_kernel_pgt`: +然后我们给 `rdx` 寄存器加上 `4096`(即 `early_level4_pgt` 的大小),并把 `rdi` 寄存器的值(即 `_text` 的物理地址)赋值给 `rax` 寄存器。之后我们把上层页目录中的两个项写入 `level3_kernel_pgt`: ```assembly addq $4096, %rdx @@ -236,7 +241,7 @@ After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now c movq %rdx, 4096(%rbx,%rax,8) ``` -In the next step we write addresses of the page middle directory entries to the `level2_kernel_pgt` and the last step is correcting of the kernel text+data virtual addresses: +下一步我们把中层页目录表项的地址写入 `level2_kernel_pgt`,然后修正内核的 text 和 data 的虚拟地址: ```assembly leaq level2_kernel_pgt(%rip), %rdi @@ -249,9 +254,9 @@ In the next step we write addresses of the page middle directory entries to the jne 1b ``` -Here we put the address of the `level2_kernel_pgt` to the `rdi` and address of the page table entry to the `r8` register. Next we check the present bit in the `level2_kernel_pgt` and if it is zero we're moving to the next page by adding 8 bytes to `rdi` which contains address of the `level2_kernel_pgt`. After this we compare it with `r8` (contains address of the page table entry) and go back to label `1` or move forward. +这里首先把 `level2_kernel_pgt` 的地址赋值给 `rdi`,并把页表项的地址赋值给 `r8` 寄存器。下一步我们来检查 `level2_kernel_pgt` 中的存在位,如果其为0,就把 `rdi` 加上8以便指向下一个页。然后我们将其与 `r8`(即页表项的地址)作比较,不相等的话就跳转回前面的标签 `1` ,反之则继续运行。 -In the next step we correct `phys_base` physical address with `rbp` (contains physical address of the `_text`), put physical address of the `early_level4_pgt` and jump to label `1`: +接下来我们使用 `rbp` (即 `_text` 的物理地址)来修正 `phys_base` 物理地址。将 `early_level4_pgt` 的物理地址与 `rbp` 相加,然后跳转至标签 `1`: ```assembly addq %rbp, phys_base(%rip) @@ -259,12 +264,12 @@ In the next step we correct `phys_base` physical address with `rbp` (contains ph jmp 1f ``` -where `phys_base` matches the first entry of the `level2_kernel_pgt` which is `512` MB kernel mapping. +其中 `phys_base` 与 `level2_kernel_pgt` 第一项相同,为 `512` MB的内核映射。 -Last preparation before jump at the kernel entry point +跳转至内核入口点之前的最后准备 -------------------------------------------------------------------------------- -After that we jump to the label `1` we enable `PAE`, `PGE` (Paging Global Extension) and put the physical address of the `phys_base` (see above) to the `rax` register and fill `cr3` register with it: +此后我们就跳转至标签`1`来开启 `PAE` 和 `PGE` (Paging Global Extension),并且将`phys_base`的物理地址(见上)放入 `rax` 就寄存器,同时将其放入 `cr3` 寄存器: ```assembly 1: @@ -275,7 +280,8 @@ After that we jump to the label `1` we enable `PAE`, `PGE` (Paging Global Extens movq %rax, %cr3 ``` -In the next step we check that CPU supports [NX](http://en.wikipedia.org/wiki/NX_bit) bit with: +接下来我们检查CPU是否支持 [NX](http://en.wikipedia.org/wiki/NX_bit) 位: + ```assembly movl $0x80000001, %eax @@ -283,16 +289,18 @@ In the next step we check that CPU supports [NX](http://en.wikipedia.org/wiki/NX movl %edx,%edi ``` -We put `0x80000001` value to the `eax` and execute `cpuid` instruction for getting the extended processor info and feature bits. The result will be in the `edx` register which we put to the `edi`. +首先将 `0x80000001` 放入 `eax` 中,然后执行 `cpuid` 指令来得到处理器信息。这条指令的结果会存放在 `edx` 中,我们把他再放到 `edi` 里。 + +现在我们把 `MSR_EFER` (即 `0xc0000080`)放入 `ecx`,然后执行 `rdmsr` 指令来读取CPU中的Model Specific Register (MSR)。 -Now we put `0xc0000080` or `MSR_EFER` to the `ecx` and call `rdmsr` instruction for the reading model specific register. ```assembly movl $MSR_EFER, %ecx rdmsr ``` -The result will be in the `edx:eax`. General view of the `EFER` is following: +返回结果将存放于 `edx:eax` 。下面展示了 `EFER` 各个位的含义: + ``` 63 32 @@ -309,7 +317,7 @@ The result will be in the `edx:eax`. General view of the `EFER` is following: -------------------------------------------------------------------------------- ``` -We will not see all fields in details here, but we will learn about this and other `MSRs` in a special part about it. As we read `EFER` to the `edx:eax`, we check `_EFER_SCE` or zero bit which is `System Call Extensions` with `btsl` instruction and set it to one. By the setting `SCE` bit we enable `SYSCALL` and `SYSRET` instructions. In the next step we check 20th bit in the `edi`, remember that this register stores result of the `cpuid` (see above). If `20` bit is set (`NX` bit) we just write `EFER_SCE` to the model specific register. +在这里我们不会介绍每一个位的含义,没有涉及到的位和其他的 MSR 将会在专门的部分介绍。在我们将 `EFER` 读入 `edx:eax` 之后,通过 `btsl` 来将 `_EFER_SCE` (即第0位)置1,设置 `SCE` 位将会启用 `SYSCALL` 以及 `SYSRET` 指令。下一步我们检查 `edi`(即 `cpuid` 的结果(见上)) 中的第20位。如果第 `20` 位(即 `NX` 位)置位,我们就只把 `EFER_SCE`写入MSR。 ```assembly btsl $_EFER_SCE, %eax @@ -320,17 +328,16 @@ We will not see all fields in details here, but we will learn about this and oth 1: wrmsr ``` -If the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is supported we enable `_EFER_NX` and write it too, with the `wrmsr` instruction. After the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is set, we set some bits in the `cr0` [control register](https://en.wikipedia.org/wiki/Control_register), namely: +如果支持 [NX](https://en.wikipedia.org/wiki/NX_bit) 那么我们就把 `_EFER_NX` 也写入MSR。在设置了 [NX](https://en.wikipedia.org/wiki/NX_bit) 后,还要对 `cr0` ([control register](https://en.wikipedia.org/wiki/Control_register)) 中的一些位进行设置: -* `X86_CR0_PE` - system is in protected mode; -* `X86_CR0_MP` - controls interaction of WAIT/FWAIT instructions with TS flag in CR0; -* `X86_CR0_ET` - on the 386, it allowed to specify whether the external math coprocessor was an 80287 or 80387; -* `X86_CR0_NE` - enable internal x87 floating point error reporting when set, else enables PC style x87 error detection; -* `X86_CR0_WP` - when set, the CPU can't write to read-only pages when privilege level is 0; -* `X86_CR0_AM` - alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3; -* `X86_CR0_PG` - enable paging. -by the execution following assembly code: +* `X86_CR0_PE` - 系统处于保护模式; +* `X86_CR0_MP` - 与CR0的TS标志位一同控制 WAIT/FWAIT 指令的功能; +* `X86_CR0_ET` - 386允许指定外部数学协处理器为80287或80387; +* `X86_CR0_NE` - 如果置位,则启用内置的x87浮点错误报告,否则启用PC风格的x87错误检测; +* `X86_CR0_WP` - 如果置位,则CPU在特权等级为0时无法写入只读内存页; +* `X86_CR0_AM` - 当AM位置位、EFLGS中的AC位置位、特权等级为3时,进行对齐检查; +* `X86_CR0_PG` - 启用分页. ```assembly #define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \ @@ -340,7 +347,7 @@ movl $CR0_STATE, %eax movq %rax, %cr0 ``` -We already know that to run any code, and even more [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code from assembly, we need to setup a stack. As always, we are doing it by the setting of [stack pointer](https://en.wikipedia.org/wiki/Stack_register) to a correct place in memory and resetting [flags](https://en.wikipedia.org/wiki/FLAGS_register) register after this: +为了从汇编执行[C语言](https://en.wikipedia.org/wiki/C_%28programming_language%29)代码,我们需要建立一个栈。首先将[栈指针](https://en.wikipedia.org/wiki/Stack_register) 指向一个内存中合适的区域,然后重置[FLAGS寄存器](https://en.wikipedia.org/wiki/FLAGS_register) ```assembly movq stack_start(%rip), %rsp @@ -348,14 +355,14 @@ pushq $0 popfq ``` -The most interesting thing here is the `stack_start`. It defined in the same [source](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) code file and looks like: +在这里最有意思的地方在于 `stack_start`。它也定义在[当前的源文件](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S)中: ```assembly GLOBAL(stack_start) .quad init_thread_union+THREAD_SIZE-8 ``` -The `GLOBAL` is already familiar to us from. It defined in the [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) header file expands to the `global` symbol definition: +对于 `GLOABL` 我们应该很熟悉了。它在 [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) 头文件中定义如下: ```C #define GLOBAL(name) \ @@ -363,16 +370,15 @@ The `GLOBAL` is already familiar to us from. It defined in the [arch/x86/include name: ``` -The `THREAD_SIZE` macro is defined in the [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h) header file and depends on value of the `KASAN_STACK_ORDER` macro: +`THREAD_SIZE` 定义在 [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h),它依赖于 `KASAN_STACK_ORDER` 的值: ```C #define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER) #define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER) ``` -We consider when the [kasan](http://lxr.free-electrons.com/source/Documentation/kasan.txt) is disabled and the `PAGE_SIZE` is `4096` bytes. So the `THREAD_SIZE` will expands to `16` kilobytes and represents size of the stack of a thread. Why is `thread`? You may already know that each [process](https://en.wikipedia.org/wiki/Process_%28computing%29) may have parent [processes](https://en.wikipedia.org/wiki/Parent_process) and [child](https://en.wikipedia.org/wiki/Child_process) processes. Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with the `thread_info` structure. +首先来考虑当禁用了 [kasan](http://lxr.free-electrons.com/source/Documentation/kasan.txt) 并且 `PAGE_SIZE` 大小为4096时的情况。此时 `THREAD_SIZE` 将为 `16` KB,代表了一个线程的栈的大小。为什么是`线程`?我们知道每一个[进程](https://en.wikipedia.org/wiki/Process_%28computing%29)可能会有[父进程](https://en.wikipedia.org/wiki/Parent_process)和[子进程](https://en.wikipedia.org/wiki/Child_process)。事实上,父进程和子进程使用不同的栈空间,每一个新进程都会拥有一个新的内核栈。在Linux内核中,这个栈由 `thread_info` 结构中的一个[union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B)表示: -And as we can see the `init_thread_union` is represented by the `thread_union`, which defined as: ```C union thread_union { @@ -381,14 +387,14 @@ union thread_union { }; ``` -and `init_thread_union` looks like: +例如,`init_thread_union`定义如下: ```C union thread_union init_thread_union __init_task_data = { INIT_THREAD_INFO(init_task) }; ``` -Where the `INIT_THREAD_INFO` macro takes `task_struct` structure which represents process descriptor in the Linux kernel and does some basic initialization of the given `task_struct` structure: +其中 `INIT_THREAD_INFO` 接受 `task_struct` 结构类型的参数,并进行一些初始化操作: ```C #define INIT_THREAD_INFO(tsk) \ @@ -400,7 +406,7 @@ Where the `INIT_THREAD_INFO` macro takes `task_struct` structure which represent } ``` -So, the `thread_union` contains low-level information about a process and process's stack and placed in the bottom of stack: +`task_struct` 结构在内核中代表了对进程的描述。因此,`thread_union` 包含了关于一个进程的低级信息,并且其位于进程栈底: ``` +-----------------------+ @@ -418,15 +424,15 @@ So, the `thread_union` contains low-level information about a process and proces +-----------------------+ ``` -Note that we reserve `8` bytes at the to of stack. This is necessary to guarantee illegal access of the next page memory. +需要注意的是我们在栈顶保留了 `8` 个字节的空间,用来保护对下一个内存页的非法访问。 -After the early boot stack is set, to update the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) with `lgdt` instruction: +在初期启动栈设置好之后,使用 `lgdt` 指令来更新[全局描述符表](https://en.wikipedia.org/wiki/Global_Descriptor_Table): ```assembly lgdt early_gdt_descr(%rip) ``` -where the `early_gdt_descr` is defined as: +其中 `early_gdt_descr` 定义如下: ```assembly early_gdt_descr: @@ -435,13 +441,13 @@ early_gdt_descr_base: .quad INIT_PER_CPU_VAR(gdt_page) ``` -We need to reload `Global Descriptor Table` because now kernel works in the low userspace addresses, but soon kernel will work in it's own space. Now let's look at the definition of `early_gdt_descr`. Global Descriptor Table contains `32` entries: +需要重新加载 `全局描述附表` 的原因是,虽然目前内核工作在用户空间的低地址中,但很快内核将会在它自己的内存地址空间中运行。下面让我们来看一下 `early_gdt_descr` 的定义。全局描述符表包含了32项,用于内核代码、数据、线程局部存储段等: ```C #define GDT_ENTRIES 32 ``` -for kernel code, data, thread local storage segments and etc... it's simple. Now let's look at the `early_gdt_descr_base`. First of `gdt_page` defined as: +现在来看一下 `early_gdt_descr_base`. 首先,`gdt_page` 的定义在[arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h)中: ```C struct gdt_page { @@ -449,7 +455,7 @@ struct gdt_page { } __attribute__((aligned(PAGE_SIZE))); ``` -in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h). It contains one field `gdt` which is array of the `desc_struct` structure which is defined as: +它只包含了一项 `desc_struct` 的数组`gdt`。`desc_struct`定义如下: ```C struct desc_struct { @@ -468,24 +474,26 @@ struct desc_struct { } __attribute__((packed)); ``` -and presents familiar to us `GDT` descriptor. Also we can note that `gdt_page` structure aligned to `PAGE_SIZE` which is `4096` bytes. It means that `gdt` will occupy one page. Now let's try to understand what is `INIT_PER_CPU_VAR`. `INIT_PER_CPU_VAR` is a macro which defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h) and just concats `init_per_cpu__` with the given parameter: +它跟 `GDT` 描述符的定义很像。同时需要注意的是,`gdt_page`结构是 `PAGE_SIZE`(` 4096`) 对齐的,即 `gdt` 将会占用一页内存。 + +下面我们来看一下 `INIT_PER_CPU_VAR`,它定义在 [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h),只是将给定的参数与 `init_per_cpu__`连接起来: ```C #define INIT_PER_CPU_VAR(var) init_per_cpu__##var ``` -After the `INIT_PER_CPU_VAR` macro will be expanded, we will have `init_per_cpu__gdt_page`. We can see in the [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S): +所以在宏展开之后,我们会得到 `init_per_cpu__gdt_page`。而在 [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) 中可以发现: ``` #define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load INIT_PER_CPU(gdt_page); ``` -As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT. +`INIT_PER_CPU` 扩展后也将得到 `init_per_cpu__gdt_page` 并将它的值设置为相对于 `__per_cpu_load` 的偏移量。这样,我们就得到了新GDT的正确的基地址。 -Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have will have its own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/per-cpu.html) post. +per-CPU变量是2.6内核中的特性。顾名思义,当我们创建一个 `per-CPU` 变量时,每个CPU都会拥有一份它自己的拷贝,在这里我们创建的是 `gdt_page` per-CPU变量。这种类型的变量有很多有点,比如由于每个CPU都只访问自己的变量而不需要锁等。因此在多处理器的情况下,每一个处理器核心都将拥有一份自己的 `GDT` 表,其中的每一项都代表了一块内存,这块内存可以由在这个核心上运行的线程访问。这里 [Theory/per-cpu](http://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/per-cpu.html) 有关于 `per-CPU` 变量的更详细的介绍。 -As we loaded new Global Descriptor Table, we reload segments as we did it every time: +在加载好了新的全局描述附表之后,跟之前一样我们重新加载一下各个段: ```assembly xorl %eax,%eax @@ -496,7 +504,7 @@ As we loaded new Global Descriptor Table, we reload segments as we did it every movl %eax,%gs ``` -After all of these steps we set up `gs` register that it post to the `irqstack` which represents special stack where [interrupts](https://en.wikipedia.org/wiki/Interrupt) will be handled on: +在所有这些步骤都结束后,我们需要设置一下 `gs` 寄存器,令它指向一个特殊的栈 `irqstack`,用于处理[中断]https://en.wikipedia.org/wiki/Interrupt): ```assembly movl $MSR_GS_BASE,%ecx @@ -505,13 +513,15 @@ After all of these steps we set up `gs` register that it post to the `irqstack` wrmsr ``` -where `MSR_GS_BASE` is: +其中, `MSR_GS_BASE` 为: ```C #define MSR_GS_BASE 0xc0000101 ``` -We need to put `MSR_GS_BASE` to the `ecx` register and load data from the `eax` and `edx` (which are point to the `initial_gs`) with `wrmsr` instruction. We don't use `cs`, `fs`, `ds` and `ss` segment registers for addressing in the 64-bit mode, but `fs` and `gs` registers can be used. `fs` and `gs` have a hidden part (as we saw it in the real mode for `cs`) and this part contains descriptor which mapped to [Model Specific Registers](https://en.wikipedia.org/wiki/Model-specific_register). So we can see above `0xc0000101` is a `gs.base` MSR address. When a [system call](https://en.wikipedia.org/wiki/System_call) or [interrupt](https://en.wikipedia.org/wiki/Interrupt) occurred, there is no kernel stack at the entry point, so the value of the `MSR_GS_BASE` will store address of the interrupt stack. +我们需要把 `MSR_GS_BASE` 放入 `ecx` 寄存器,同时利用 `wrmsr` 指令向 `eax` 和 `edx` 处的地址加载数据(即指向 `initial_gs`)。`cs`, `fs`, `ds` 和 `ss` 段寄存器在64位模式下不用来寻址,但 `fs` 和 `gs` 可以使用。 `fs` 和 `gs` 有一个隐含的部分(与实模式下的 `cs` 段寄存器类似),这个隐含部分存储了一个描述符,其指向 [Model Specific Registers](https://en.wikipedia.org/wiki/Model-specific_register)。因此上面的 `0xc0000101` 是一个 `gs.base` MSR 地址。当发生[系统调用](https://en.wikipedia.org/wiki/System_call) 或者 [中断](https://en.wikipedia.org/wiki/Interrupt)时,入口点处并没有内核栈,因此 `MSR_GS_BASE` 将会用来存放中断栈。 + +接下来我们把实模式中的 bootparam 结构的地址放入 `rdi` (),然后跳转到C语言代码: In the next step we put the address of the real mode bootparam structure to the `rdi` (remember `rsi` holds pointer to this structure from the start) and jump to the C code with: @@ -523,7 +533,7 @@ In the next step we put the address of the real mode bootparam structure to the lretq ``` -Here we put the address of the `initial_code` to the `rax` and push fake address, `__KERNEL_CS` and the address of the `initial_code` to the stack. After this we can see `lretq` instruction which means that after it return address will be extracted from stack (now there is address of the `initial_code`) and jump there. `initial_code` is defined in the same source code file and looks: +这里我们把 `initial_code` 放入 `rax` 中,并且向栈里分别压入一个无用的地址、`__KERNEL_CS` 和 `initial_code` 的地址。随后的 `lreq` 指令表示从栈上弹出返回地址并跳转。`initial_code` 同样定义在这个文件里: ```assembly .balign 8 @@ -534,7 +544,7 @@ Here we put the address of the `initial_code` to the `rax` and push fake address ... ``` -As we can see `initial_code` contains address of the `x86_64_start_kernel`, which is defined in the [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and looks like this: +可以看到 `initial_code` 包含了 `x86_64_start_kernel` 的地址,其定义在 [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c): ```C asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) { @@ -544,16 +554,16 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) { } ``` -It has one argument is a `real_mode_data` (remember that we passed address of the real mode data to the `rdi` register previously). +这个函数接受一个参数 `real_mode_data`(刚才我们传入了一个实模式下的数据的地址)。 -This is first C code in the kernel! +这个函数是内核中第一个执行的C语言代码! -Next to start_kernel +走进 start_kernel -------------------------------------------------------------------------------- -We need to see last preparations before we can see "kernel entry point" - start_kernel function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489). +在我们真正到达“内核入口点”之前,还需要一些最后的准备工作:[init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489)中的start_kernel函数。 -First of all we can see some checks in the `x86_64_start_kernel` function: +首先在 `x86_64_start_kernel` 函数中可以看到一些检查: ```C BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map); @@ -566,20 +576,24 @@ BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK) BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END); ``` +这些检查包括:模块的虚拟地址不能低于内核text段基地址 `__START_KERNEL_map`, + +`BUILD_BUG_ON` 宏定义如下: + There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - `__STAT_KERNEL_map`, that kernel text with modules is not less than image of the kernel and etc... `BUILD_BUG_ON` is a macro which looks as: ```C #define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) ``` -Let's try to understand how this trick works. Let's take for example first condition: `MODULES_VADDR < __START_KERNEL_map`. `!!conditions` is the same that `condition != 0`. So it means if `MODULES_VADDR < __START_KERNEL_map` is true, we will get `1` in the `!!(condition)` or zero if not. After `2*!!(condition)` we will get or `2` or `0`. In the end of calculations we can get two different behaviors: +我们来考虑一下这个trick是怎么工作的。首先以第一个条件 `MODULES_VADDR < __START_KERNEL_map` 为例:`!!conditions` 等价于 `condition != 0`,这代表如果 `MODULES_VADDR < __START_KERNEL_map` 为真,则 `!!(condition)` 为1,否则为0。随后 `2*!!(condition)` 将为 `2` 或 `0`。因此,这个宏将可能产生两种不同的行为: -* We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because `MODULES_VADDR` can't be less than `__START_KERNEL_map` will be in our case); -* No compilation errors. +* 编译错误。因为我们尝试取获取一个字符数组负索引处变量的大小。 +* 没有编译错误。 -That's all. So interesting C trick for getting compile error which depends on some constants. +这种C语言的trick利用常量达到了编译错误的目的。 -In the next step we can see call of the `cr4_init_shadow` function which stores shadow copy of the `cr4` per cpu. Context switches can change bits in the `cr4` so we need to store `cr4` for each CPU. And after this we can see call of the `reset_early_page_tables` function where we resets all page global directory entries and write new pointer to the PGT in `cr3`: +接下来 start_kernel 调用了 `cr4_init_shadow` 函数,其中存储了每个CPU中 `cr4` 的Shadow Copy。上下文切换可能会修改 `cr4` 中的位,因此需要位每个CPU保存一份 `cr4` 的内容。在这之后将会调用 `reset_early_page_tables` 函数,它充值了所有的全局页目录项,同时向 `cr3` 中重新写入了的全局页目录表的地址: ```C for (i = 0; i < PTRS_PER_PGD-1; i++) @@ -590,26 +604,25 @@ next_early_pgt = 0; write_cr3(__pa_nodebug(early_level4_pgt)); ``` -Soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (`PTRS_PER_PGD` is `512`) in the loop and make it zero. After this we set `next_early_pgt` to zero (we will see details about it in the next post) and write physical address of the `early_level4_pgt` to the `cr3`. `__pa_nodebug` is a macro which will be expanded to: +很快我们就会设置新的页表。在这里我们遍历了所有的全局页目录项(其中 `PTRS_PER_PGD` 为 `512`),将其设置为0。之后将 `next_early_pgt` 设置为0(会在下一篇文章中介绍细节),同时把 `early_level4_pgt` 的物理地址写入 `cr3`。`__pa_nodebug` 是一个宏,将被扩展为: ```C ((unsigned long)(x) - __START_KERNEL_map + phys_base) ``` -After this we clear `_bss` from the `__bss_stop` to `__bss_start` and the next step will be setup of the early `IDT` handlers, but it's big concept so we will see it in the next part. +此后我们清空了从 `__bss_stop` 到 `__bss_start` 的 `_bss` 段,下一步将是建立初期 `IDT(中断描述符表)` 的处理代码,内容很多,我们将会留到下一个部分再来探究。 -Conclusion +总结 -------------------------------------------------------------------------------- -This is the end of the first part about linux kernel initialization. +第一部分关于Linux内核的初始化过程到这里就结束了。 -If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/MintCN/linux-insides-zh/issues/new). +如果你有任何问题或建议,请在twitter上联系我 [0xAX](https://twitter.com/0xAX),或者通过[邮件](anotherworldofworld@gmail.com)与我沟通,还可以新开[issue](https://github.com/MintCN/linux-insides-zh/issues/new)。 -In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and a lot more. +下一部分我们会看到初期中断处理程序的初始化过程、内核空间的内存映射等。 -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** -Links +相关链接 -------------------------------------------------------------------------------- * [Model Specific Register](http://en.wikipedia.org/wiki/Model-specific_register) diff --git a/Initialization/linux-initialization-2.md b/Initialization/linux-initialization-2.md index 3a307b1..0bc6002 100644 --- a/Initialization/linux-initialization-2.md +++ b/Initialization/linux-initialization-2.md @@ -1,38 +1,38 @@ -Kernel initialization. Part 2. +内核初始化 第二部分 ================================================================================ -Early interrupt and exception handling +初期中断和异常处理 -------------------------------------------------------------------------------- -In the previous [part](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) we stopped before setting of early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have basic [paging](https://en.wikipedia.org/wiki/Page_table) structure for early boot and our current goal is to finish early preparation before the main kernel code will start to work. +在上一个 [部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) 我们谈到了初期中断初始化。目前我们已经处于解压缩后的Linux内核中了,还有了用于初期启动的基本的 [分页](https://en.wikipedia.org/wiki/Page_table) 机制。我们的目标是在内核的主体代码执行前做好准备工作。 -We already started to do this preparation in the previous [first](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) part of this [chapter](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/index.html). We continue in this part and will know more about interrupt and exception handling. +我们已经在 [本章](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/index.html) 的 [第一部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) 做了一些工作,在这一部分中我们会继续分析关于中断和异常处理部分的代码。 -Remember that we stopped before following loop: +我们在上一部分谈到了下面这个循环: ```C for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) set_intr_gate(i, early_idt_handler_array[i]); ``` -from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) source code file. But before we started to sort out this code, we need to know about interrupts and handlers. +这段代码位于 [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c)。在分析这段代码之前,我们先来了解一些关于中断和中断处理程序的知识。 -Some theory +理论 -------------------------------------------------------------------------------- -An interrupt is an event caused by software or hardware to the CPU. For example a user have pressed a key on keyboard. On interrupt, CPU stops the current task and transfer control to the special routine which is called - [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler). An interrupt handler handles and interrupt and transfer control back to the previously stopped task. We can split interrupts on three types: +中断是一种由软件或硬件产生的、向CPU发出的事件。例如,如果用户按下了键盘上的一个按键时,就会产生中断。此时CPU将会暂停当前的任务,并且将控制流转到特殊的程序中—— [中断处理程序(Interrupt Handler)](https://en.wikipedia.org/wiki/Interrupt_handler)。一个中断处理程序会对中断进行处理,然后将控制权交还给之前暂停的任务中。中断分为三类: -* Software interrupts - when a software signals CPU that it needs kernel attention. These interrupts are generally used for system calls; -* Hardware interrupts - when a hardware event happens, for example button is pressed on a keyboard; -* Exceptions - interrupts generated by CPU, when the CPU detects error, for example division by zero or accessing a memory page which is not in RAM. +* 软件中断 - 当一个软件可以向CPU发出信号,表明它需要系统内核的相关功能时产生。这些中断通常用于系统调用; +* 硬件中断 - 当一个硬件有任何事件发生时产生,例如键盘的按键被按下; +* 异常 - 当CPU检测到错误时产生,例如发生了除零错误或者访问了一个不存在的内存页。 -Every interrupt and exception is assigned a unique number which called - `vector number`. `Vector number` can be any number from `0` to `255`. There is common practice to use first `32` vector numbers for exceptions, and vector numbers from `32` to `255` are used for user-defined interrupts. We can see it in the code above - `NUM_EXCEPTION_VECTORS`, which defined as: +每一个中断和异常都可以由一个数来表示,这个数叫做 `向量号` ,它可以取从 `0` 到 `255` 中的任何一个数。通常在实践中前 `32` 个向量号用来表示异常,`32` 到 `255` 用来表示用户定义的中断。可以看到在上面的代码中,`NUM_EXCEPTION_VECTORS` 就定义为: ```C #define NUM_EXCEPTION_VECTORS 32 ``` -CPU uses vector number as an index in the `Interrupt Descriptor Table` (we will see description of it soon). CPU catch interrupts from the [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) or through it's pins. Following table shows `0-31` exceptions: +CPU会从 [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) 或者 CPU 引脚接收中断,并使用中断向量号作为 `中断描述符表` 的索引。下面的表中列出了 `0-31` 号异常: ``` ---------------------------------------------------------------------------------------------- @@ -84,9 +84,9 @@ CPU uses vector number as an index in the `Interrupt Descriptor Table` (we will ---------------------------------------------------------------------------------------------- ``` -To react on interrupt CPU uses special structure - Interrupt Descriptor Table or IDT. IDT is an array of 8-byte descriptors like Global Descriptor Table, but IDT entries are called `gates`. CPU multiplies vector number on 8 to find index of the IDT entry. But in 64-bit mode IDT is an array of 16-byte descriptors and CPU multiplies vector number on 16 to find index of the entry in the IDT. We remember from the previous part that CPU uses special `GDTR` register to locate Global Descriptor Table, so CPU uses special register `IDTR` for Interrupt Descriptor Table and `lidt` instruction for loading base address of the table into this register. +为了能够对中断进行处理,CPU使用了一种特殊的结构 - 中断描述符表(IDT)。IDT 是一个由描述符组成的数组,其中每个描述符都为8个字节,与全局描述附表一致;不过不同的是,我们把IDT中的每一项叫做 `门(gate)` 。为了获得某一项描述符的起始地址,CPU 会把向量号乘以8,在64位模式中则会乘以16。在前面我们已经见过,CPU使用一个特殊的 `GDTR` 寄存器来存放全局描述符表的地址,中断描述符表也有一个类似的寄存器 `IDTR` ,同时还有用于将基地址加载入这个寄存器的指令 `lidt` 。 -64-bit mode IDT entry has following structure: +64位模式下 IDT 的每一项的结构如下: ``` 127 96 @@ -115,46 +115,46 @@ To react on interrupt CPU uses special structure - Interrupt Descriptor Table or -------------------------------------------------------------------------------- ``` -Where: +其中: -* `Offset` - is offset to entry point of an interrupt handler; -* `DPL` - Descriptor Privilege Level; -* `P` - Segment Present flag; -* `Segment selector` - a code segment selector in GDT or LDT -* `IST` - provides ability to switch to a new stack for interrupts handling. +* `Offset` - 代表了到中断处理程序入口点的偏移; +* `DPL` - 描述符特权级别; +* `P` - Segment Present 标志; +* `Segment selector` - 在GDT或LDT中的代码段选择子; +* `IST` - 用来为中断处理提供一个新的栈。 -And the last `Type` field describes type of the `IDT` entry. There are three different kinds of handlers for interrupts: +最后的 `Type` 域描述了这一项的类型,中断处理程序共分为三种: -* Task descriptor -* Interrupt descriptor -* Trap descriptor +* 任务描述符 +* 中断描述符 +* 陷阱描述符 -Interrupt and trap descriptors contain a far pointer to the entry point of the interrupt handler. Only one difference between these types is how CPU handles `IF` flag. If interrupt handler was accessed through interrupt gate, CPU clear the `IF` flag to prevent other interrupts while current interrupt handler executes. After that current interrupt handler executes, CPU sets the `IF` flag again with `iret` instruction. +中断和陷阱描述符包含了一个指向中断处理程序的远 (far) 指针,二者唯一的不同在于CPU处理 `IF` 标志的方式。如果是由中断门进入中断处理程序的,CPU 会清除 `IF` 标志位,这样当当前中断处理程序执行时,CPU 不会对其他的中断进行处理;只有当当前的中断处理程序返回时,CPU 才在 `iret` 指令执行时重新设置 `IF` 标志位。 -Other bits in the interrupt gate reserved and must be 0. Now let's look how CPU handles interrupts: +中断门的其他位为保留位,必须为0。下面我们来看一下 CPU 是如何处理中断的: -* CPU save flags register, `CS`, and instruction pointer on the stack. -* If interrupt causes an error code (like `#PF` for example), CPU saves an error on the stack after instruction pointer; -* After interrupt handler executed, `iret` instruction used to return from it. +* CPU 会在栈上保存标志寄存器、`cs`段寄存器和程序计数器IP; +* 如果中断是由错误码引起的(比如 `#PF`), CPU会在栈上保存错误码; +* 在中断处理程序执行完毕后,由`iret`指令返回。 -Now let's back to code. +OK,接下来我们继续分析代码。 -Fill and load IDT +设置并加载 IDT -------------------------------------------------------------------------------- -We stopped at the following point: +我们分析到了如下代码: ```C for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) set_intr_gate(i, early_idt_handler_array[i]); ``` -Here we call `set_intr_gate` in the loop, which takes two parameters: +这里循环内部调用了 `set_intr_gate` ,它接受两个参数: -* Number of an interrupt or `vector number`; -* Address of the idt handler. +* 中断号,即 `向量号`; +* 中断处理程序的地址。 -and inserts an interrupt gate to the `IDT` table which is represented by the `&idt_descr` array. First of all let's look on the `early_idt_handler_array` array. It is an array which is defined in the [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) header file contains addresses of the first `32` exception handlers: +同时,这个函数还会将中断门插入至 `IDT` 表中,代码中的 `&idt_descr` 数组即为 `IDT`。 首先让我们来看一下 `early_idt_handler_array` 数组,它定义在 [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) 头文件中,包含了前32个异常处理程序的地址: ```C #define EARLY_IDT_HANDLER_SIZE 9 @@ -163,11 +163,11 @@ and inserts an interrupt gate to the `IDT` table which is represented by the `&i extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE]; ``` -The `early_idt_handler_array` is `288` bytes array which contains address of exception entry points every nine bytes. Every nine bytes of this array consist of two bytes optional instruction for pushing dummy error code if an exception does not provide it, two bytes instruction for pushing vector number to the stack and five bytes of `jump` to the common exception handler code. +`early_idt_handler_array` 是一个大小为 `288` 字节的数组,每一项为 `9` 个字节,其中2个字节的备用指令用于向栈中压入默认错误码(如果异常本身没有提供错误码的话),2个字节的指令用于向栈中压入向量号,剩余5个字节用于跳转到异常处理程序。 -As we can see, We're filling only first 32 `IDT` entries in the loop, because all of the early setup runs with interrupts disabled, so there is no need to set up interrupt handlers for vectors greater than `32`. The `early_idt_handler_array` array contains generic idt handlers and we can find its definition in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file. For now we will skip it, but will look it soon. Before this we will look on the implementation of the `set_intr_gate` macro. +在上面的代码中,我们只通过一个循环向 `IDT` 中填入了前32项内容,这是因为在整个初期设置阶段,中断是禁用的。`early_idt_handler_array` 数组中的每一项指向的都是同一个通用中断处理程序,定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) 。我们先暂时跳过这个数组的内容,看一下 `set_intr_gate` 的定义。 -The `set_intr_gate` macro is defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) header file and looks: +`set_intr_gate` 宏定义在 [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h): ```C #define set_intr_gate(n, addr) \ @@ -180,7 +180,7 @@ The `set_intr_gate` macro is defined in the [arch/x86/include/asm/desc.h](https: } while (0) ``` -First of all it checks with that passed interrupt number is not greater than `255` with `BUG_ON` macro. We need to do this check because we can have only `256` interrupts. After this, it make a call of the `_set_gate` function which writes address of an interrupt gate to the `IDT`: +首先 `BUG_ON` 宏确保了传入的中断向量号不会大于255,因为我们最多只有 `256` 个中断。然后它调用了 `_set_gate` 函数,它会将中断门写入 `IDT`: ```C static inline void _set_gate(int gate, unsigned type, void *addr, @@ -193,7 +193,7 @@ static inline void _set_gate(int gate, unsigned type, void *addr, } ``` -At the start of `_set_gate` function we can see call of the `pack_gate` function which fills `gate_desc` structure with the given values: +在 `_set_gate` 函数的开始,它调用了 `pack_gate` 函数。这个函数会使用给定的参数填充 `gate_desc` 结构: ```C static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func, @@ -211,8 +211,7 @@ static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func, gate->offset_high = PTR_HIGH(func); } ``` - -As I mentioned above, we fill gate descriptor in this function. We fill three parts of the address of the interrupt handler with the address which we got in the main loop (address of the interrupt handler entry point). We are using three following macros to split address on three parts: +在这个函数里,我们把从主循环中得到的中断处理程序入口点地址拆成三个部分,填入门描述符中。下面的三个宏就用来做这个拆分工作: ```C #define PTR_LOW(x) ((unsigned long long)(x) & 0xFFFF) @@ -220,9 +219,9 @@ As I mentioned above, we fill gate descriptor in this function. We fill three pa #define PTR_HIGH(x) ((unsigned long long)(x) >> 32) ``` -With the first `PTR_LOW` macro we get the first `2` bytes of the address, with the second `PTR_MIDDLE` we get the second `2` bytes of the address and with the third `PTR_HIGH` macro we get the last `4` bytes of the address. Next we setup the segment selector for interrupt handler, it will be our kernel code segment - `__KERNEL_CS`. In the next step we fill `Interrupt Stack Table` and `Descriptor Privilege Level` (highest privilege level) with zeros. And we set `GAT_INTERRUPT` type in the end. +调用 `PTR_LOW` 可以得到 x 的低 `2` 个字节,调用 `PTR_MIDDLE` 可以得到 x 的中间 `2` 个字节,调用 `PTR_HIGH` 则能够得到 x 的高 `4` 个字节。接下来我们来位中断处理程序设置段选择子,即内核代码段 `__KERNEL_CS`。然后将 `Interrupt Stack Table` 和 `描述符特权等级` (最高特权等级)设置为0,以及在最后设置 `GAT_INTERRUPT` 类型。 -Now we have filled IDT entry and we can call `native_write_idt_entry` function which just copies filled `IDT` entry to the `IDT`: +现在我们已经设置好了IDT中的一项,那么通过调用 `native_write_idt_entry` 函数来把复制到 `IDT`: ```C static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate) @@ -231,32 +230,32 @@ static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_ } ``` -After that main loop will finished, we will have filled `idt_table` array of `gate_desc` structures and we can load `Interrupt Descriptor table` with the call of the: +主循环结束后,`idt_table` 就已经设置完毕了,其为一个 `gate_desc` 数组。然后我们就可以通过下面的代码加载 `中断描述符表`: ```C load_idt((const struct desc_ptr *)&idt_descr); ``` -Where `idt_descr` is: +其中,`idt_descr` 为: ```C struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table }; ``` -and `load_idt` just executes `lidt` instruction: +`load_idt` 函数只是执行了一下 `lidt` 指令: ```C asm volatile("lidt %0"::"m" (*dtr)); ``` -You can note that there are calls of the `_trace_*` functions in the `_set_gate` and other functions. These functions fills `IDT` gates in the same manner that `_set_gate` but with one difference. These functions use `trace_idt_table` the `Interrupt Descriptor Table` instead of `idt_table` for tracepoints (we will cover this theme in the another part). +你可能已经注意到了,在代码中还有对 `_trace_*` 函数的调用。这些函数会用跟 `_set_gate` 同样的方法对 `IDT` 门进行设置,但仅有一处不同:这些函数并不设置 `idt_table` ,而是 `trace_idt_table` ,用于设置追踪点(tracepoint,我们将会在其他章节介绍这一部分)。 -Okay, now we have filled and loaded `Interrupt Descriptor Table`, we know how the CPU acts during an interrupt. So now time to deal with interrupts handlers. +好了,至此我们已经了解到,通过设置并加载 `中断描述符表` ,能够让CPU在发生中断时做出相应的动作。下面让我们来看一下如何编写中断处理程序。 -Early interrupts handlers +初期中断处理程序 -------------------------------------------------------------------------------- -As you can read above, we filled `IDT` with the address of the `early_idt_handler_array`. We can find it in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file: +在上面的代码中,我们用 `early_idt_handler_array` 的地址来填充了 `IDT` ,这个 `early_idt_handler_array` 定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): ```assembly .globl early_idt_handler_array @@ -273,7 +272,7 @@ early_idt_handlers: .endr ``` -We can see here, interrupt handlers generation for the first `32` exceptions. We check here, if exception has an error code then we do nothing, if exception does not return error code, we push zero to the stack. We do it for that would stack was uniform. After that we push exception number on the stack and jump on the `early_idt_handler_array` which is generic interrupt handler for now. As we may see above, every nine bytes of the `early_idt_handler_array` array consists from optional push of an error code, push of `vector number` and jump instruction. We can see it in the output of the `objdump` util: +这段代码自动生成为前 `32` 个异常生成了中断处理程序。首先,为了统一栈的布局,如果一个异常没有返回错误码,那么我们就手动在栈中压入一个 `0`。然后再在栈中压入中断向量号,最后跳转至通用的中断处理程序 `early_idt_handler_common` 。我们可以通过 `objdump` 命令的输出一探究竟: ``` $ objdump -D vmlinux @@ -294,7 +293,7 @@ ffffffff81fe5014: 6a 02 pushq $0x2 ... ``` -As i wrote above, CPU pushes flag register, `CS` and `RIP` on the stack. So before `early_idt_handler` will be executed, stack will contain following data: +由于在中断发生时,CPU 会在栈上压入标志寄存器、`CS` 段寄存器和 `RIP` 寄存器的内容。因此在 `early_idt_handler` 执行前,栈的布局如下: ``` |--------------------| @@ -305,14 +304,14 @@ As i wrote above, CPU pushes flag register, `CS` and `RIP` on the stack. So befo |--------------------| ``` -Now let's look on the `early_idt_handler_common` implementation. It locates in the same [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343) assembly file and first of all we can see check for [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt). We don't need to handle it, so just ignore it in the `early_idt_handler_common`: +下面我们来看一下 `early_idt_handler_common` 的实现。它也定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343) 文件中。首先它会检查当前中断是否为 [不可屏蔽中断(NMI)](http://en.wikipedia.org/wiki/Non-maskable_interrupt),如果是则简单地忽略它们: ```assembly cmpl $2,(%rsp) je .Lis_nmi ``` -where `is_nmi`: +其中 `is_nmi` 为: ```assembly is_nmi: @@ -320,7 +319,9 @@ is_nmi: INTERRUPT_RETURN ``` -drops an error code and vector number from the stack and call `INTERRUPT_RETURN` which is just expands to the `iretq` instruction. As we checked the vector number and it is not `NMI`, we check `early_recursion_flag` to prevent recursion in the `early_idt_handler_common` and if it's correct we save general registers on the stack: +这段程序首先从栈顶弹出错误码和中断向量号,然后通过调用 `INTERRUPT_RETURN` ,即 `iretq` 指令直接返回。 + +如果当前中断不是 `NMI` ,则首先检查 `early_recursion_flag` 以避免在 `early_idt_handler_common` 程序中递归地产生中断。如果一切都没问题,就先在栈上保存通用寄存器,为了防止中断返回时寄存器的内容错乱: ```assembly pushq %rax @@ -334,16 +335,14 @@ drops an error code and vector number from the stack and call `INTERRUPT_RETURN` pushq %r11 ``` -We need to do it to prevent wrong values of registers when we return from the interrupt handler. After this we check segment selector in the stack: +然后我们检查栈上的段选择子: ```assembly cmpl $__KERNEL_CS,96(%rsp) jne 11f ``` -which must be equal to the kernel code segment and if it is not we jump on label `11` which prints `PANIC` message and makes stack dump. - -After the code segment was checked, we check the vector number, and if it is `#PF` or [Page Fault](https://en.wikipedia.org/wiki/Page_fault), we put value from the `cr2` to the `rdi` register and call `early_make_pgtable` (well see it soon): +段选择子必须为内核代码段,如果不是则跳转到标签 `11` ,输出 `PANIC` 信息并打印栈的内容。然后我们来检查向量号,如果是 `#PF` 即 [缺页中断(Page Fault)](https://en.wikipedia.org/wiki/Page_fault),那么就把 `cr2` 寄存器中的值赋值给 `rdi` ,然后调用 `early_make_pgtable` (详见后文): ```assembly cmpl $14,72(%rsp) @@ -354,8 +353,7 @@ After the code segment was checked, we check the vector number, and if it is `#P jz 20f ``` -If vector number is not `#PF`, we restore general purpose registers from the stack: - +如果向量号不是 `#PF` ,那么就恢复通用寄存器: ```assembly popq %r11 popq %r10 @@ -368,16 +366,16 @@ If vector number is not `#PF`, we restore general purpose registers from the sta popq %rax ``` -and exit from the handler with `iret`. +并调用 `iret` 从中断处理程序返回。 -It is the end of the first interrupt handler. Note that it is very early interrupt handler, so it handles only Page Fault now. We will see handlers for the other interrupts, but now let's look on the page fault handler. +第一个中断处理程序到这里就结束了。由于它只是一个初期中段处理程序,因此只处理缺页中断。下面让我们首先来看一下缺页中断处理程序,其他中断的处理程序我们之后再进行分析。 -Page fault handling +缺页中断处理程序 -------------------------------------------------------------------------------- -In the previous paragraph we saw first early interrupt handler which checks interrupt number for page fault and calls `early_make_pgtable` for building new page tables if it is. We need to have `#PF` handler in this step because there are plans to add ability to load kernel above `4G` and make access to `boot_params` structure above the 4G. +在上一节中我们第一次见到了初期中断处理程序,它检查了缺页中断的中断号,并调用了 `early_make_pgtable` 来建立新的页表。在这里我们需要提供 `#PF` 中断处理程序,以便为之后将内核加载至 `4G` 地址以上,并且能访问位于4G以上的 `boot_params` 结构体。 -You can find implementation of the `early_make_pgtable` in the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and takes one parameter - address from the `cr2` register, which caused Page Fault. Let's look on it: +`early_make_pgtable` 的实现在 [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c),它接受一个参数:从 `cr2` 寄存器得到的地址,这个地址引发了内存中断。下面让我们来看一下: ```C int __init early_make_pgtable(unsigned long address) @@ -393,60 +391,61 @@ int __init early_make_pgtable(unsigned long address) } ``` -It starts from the definition of some variables which have `*val_t` types. All of these types are just: +首先它定义了一些 `*val_t` 类型的变量。这些类型均为: ```C typedef unsigned long pgdval_t; ``` -Also we will operate with the `*_t` (not val) types, for example `pgd_t` and etc... All of these types defined in the [arch/x86/include/asm/pgtable_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_types.h) and represent structures like this: +此外,我们还会遇见 `*_t` (不带val)的类型,比如 `pgd_t` ……这些类型都定义在 [arch/x86/include/asm/pgtable_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_types.h),形式如下: ```C typedef struct { pgdval_t pgd; } pgd_t; ``` -For example, +例如, ```C extern pgd_t early_level4_pgt[PTRS_PER_PGD]; ``` -Here `early_level4_pgt` presents early top-level page table directory which consists of an array of `pgd_t` types and `pgd` points to low-level page entries. +在这里 `early_level4_pgt` 代表了初期顶层页表目录,它是一个 `pdg_t` 类型的数组,其中的 `pgd` 指向了下一级页表。 -After we made the check that we have no invalid address, we're getting the address of the Page Global Directory entry which contains `#PF` address and put it's value to the `pgd` variable: +在确认不是非法地址后,我们取得页表中包含引起 `#PF` 中断的地址的那一项,将其赋值给 `pgd` 变量: ```C pgd_p = &early_level4_pgt[pgd_index(address)].pgd; pgd = *pgd_p; ``` -In the next step we check `pgd`, if it contains correct page global directory entry we put physical address of the page global directory entry and put it to the `pud_p` with: +接下来我们检查一下 `pgd` ,如果它包含了正确的全局页表项的话,我们就把这一项的物理地址处理后赋值给 `pud_p` : + ```C pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base); ``` -where `PTE_PFN_MASK` is a macro: +其中 `PTE_PFN_MASK` 是一个宏: ```C #define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK) ``` -which expands to: +展开后将为: ```C (~(PAGE_SIZE-1)) & ((1 << 46) - 1) ``` -or +或者写为: ``` 0b1111111111111111111111111111111111111111111111 ``` -which is 46 bits to mask page frame. +它是一个46bit大小的页帧屏蔽值。 -If `pgd` does not contain correct address we check that `next_early_pgt` is not greater than `EARLY_DYNAMIC_PAGE_TABLES` which is `64` and present a fixed number of buffers to set up new page tables on demand. If `next_early_pgt` is greater than `EARLY_DYNAMIC_PAGE_TABLES` we reset page tables and start again. If `next_early_pgt` is less than `EARLY_DYNAMIC_PAGE_TABLES`, we create new page upper directory pointer which points to the current dynamic page table and writes it's physical address with the `_KERPG_TABLE` access rights to the page global directory: +如果 `pgd` 没有包含有效的地址,我们就检查 `next_early_pgt` 与 `EARLY_DYNAMIC_PAGE_TABLES`(即 `64` )的大小。`EARLY_DYNAMIC_PAGE_TABLES` 它是一个固定大小的缓冲区,用来在需要的时候建立新的页表。如果 `next_early_pgt` 比 `EARLY_DYNAMIC_PAGE_TABLES` 大,我们就用一个上层页目录指针指向当前的动态页表,并将它的物理地址与 `_KERPG_TABLE` 访问权限一起写入全局页目录表: ```C if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) { @@ -460,30 +459,32 @@ for (i = 0; i < PTRS_PER_PUD; i++) *pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE; ``` -After this we fix up address of the page upper directory with: +然后我们来修正上层页目录的地址: ```C pud_p += pud_index(address); pud = *pud_p; ``` -In the next step we do the same actions as we did before, but with the page middle directory. In the end we fix address of the page middle directory which contains maps kernel text+data virtual addresses: +下面我们对中层页目录重复上面同样的操作。最后我们利用 In the end we fix address of the page middle directory which contains maps kernel text+data virtual addresses: ```C pmd = (physaddr & PMD_MASK) + early_pmd_flags; pmd_p[pmd_index(address)] = pmd; ``` -After page fault handler finished it's work and as result our `early_level4_pgt` contains entries which point to the valid addresses. +到此缺页中断处理程序就完成了它所有的工作,此时 `early_level4_pgt` 就包含了指向合法地址的项。 -Conclusion +小结 -------------------------------------------------------------------------------- -This is the end of the second part about linux kernel insides. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/MintCN/linux-insides-zh/issues/new). In the next part we will see all steps before kernel entry point - `start_kernel` function. +本书的第二部分到此结束了。 -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** +如果你有任何问题或建议,请在twitter上联系我 [0xAX](https://twitter.com/0xAX),或者通过[邮件](anotherworldofworld@gmail.com)与我沟通,还可以新开[issue](https://github.com/MintCN/linux-insides-zh/issues/new)。 -Links +接下来我们将会看到进入内核入口点 `start_kernel` 函数之前剩下所有的准备工作。 + +相关链接 -------------------------------------------------------------------------------- * [GNU assembly .rept](https://sourceware.org/binutils/docs-2.23/as/Rept.html) diff --git a/Initialization/linux-initialization-3.md b/Initialization/linux-initialization-3.md index 1c216f8..03c6208 100644 --- a/Initialization/linux-initialization-3.md +++ b/Initialization/linux-initialization-3.md @@ -1,21 +1,22 @@ -Kernel initialization. Part 3. +内核初始化 第三部分 ================================================================================ -Last preparations before the kernel entry point +进入内核入口点之前最后的准备工作 -------------------------------------------------------------------------------- -This is the third part of the Linux kernel initialization process series. In the previous [part](https://github.com/MintCN/linux-insides-zh/blob/master/Initialization/linux-initialization-2.md) we saw early interrupt and exception handling and will continue to dive into the linux kernel initialization process in the current part. Our next point is 'kernel entry point' - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. Yes, technically it is not kernel's entry point but the start of the generic kernel code which does not depend on certain architecture. But before we call the `start_kernel` function, we must do some preparations. So let's continue. + +这是 Linux 内核初始化过程的第三部分。在[上一个部分](https://github.com/MintCN/linux-insides-zh/blob/master/Initialization/linux-initialization-2.md) 中我们接触到了初期中断和异常处理,而在这个部分中我们要继续看一看 Linux 内核的初始化过程。在之后的章节我们将会关注“内核入口点”—— [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 文件中的`start_kernel` 函数。没错,从技术上说这并不是内核的入口点,只是不依赖于特定架构的通用内核代码的开始。不过,在我们调用 `start_kernel` 之前,有些准备必须要做。下面我们就来看一看。 boot_params again -------------------------------------------------------------------------------- -In the previous part we stopped at setting Interrupt Descriptor Table and loading it in the `IDTR` register. At the next step after this we can see a call of the `copy_bootdata` function: +在上一个部分中我们讲到了设置中断描述符表,并将其加载进 `IDTR` 寄存器。下一步是调用 `copy_bootdata` 函数: ```C copy_bootdata(__va(real_mode_data)); ``` -This function takes one argument - virtual address of the `real_mode_data`. Remember that we passed the address of the `boot_params` structure from [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L114) to the `x86_64_start_kernel` function as first argument in [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): +这个函数接受一个参数—— `read_mode_data` 的虚拟地址。`boot_params` 结构体是在 [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L114) 作为第一个参数传递到 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) 中的 `x86_64_start_kernel` 函数的: ``` /* rsi is pointer to real mode structure with interesting info. @@ -23,19 +24,19 @@ This function takes one argument - virtual address of the `real_mode_data`. Reme movq %rsi, %rdi ``` -Now let's look at `__va` macro. This macro defined in [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c): +下面我们来看一看 `__va` 宏。 这个宏定义在 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c): ```C #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) ``` -where `PAGE_OFFSET` is `__PAGE_OFFSET` which is `0xffff880000000000` and the base virtual address of the direct mapping of all physical memory. So we're getting virtual address of the `boot_params` structure and pass it to the `copy_bootdata` function, where we copy `real_mod_data` to the `boot_params` which is declared in the [arch/x86/kernel/setup.h](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.h) +其中 `PAGE_OFFSET` 就是 `__PAGE_OFFSET`(即 `0xffff880000000000`),也是所有对物理地址进行直接映射后的虚拟基地址。因此我们就得到了 `boot_params` 结构体的虚拟地址,并把他传入 `copy_bootdata` 函数中。在这个函数里我们把 `real_mod_data` (定义在 [arch/x86/kernel/setup.h](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.h)) 拷贝进 `boot_params`: ```C extern struct boot_params boot_params; ``` -Let's look at the `copy_boot_data` implementation: +`copy_boot_data` 的实现如下: ```C static void __init copy_bootdata(char *real_mode_data) @@ -53,9 +54,9 @@ static void __init copy_bootdata(char *real_mode_data) } ``` -First of all, note that this function is declared with `__init` prefix. It means that this function will be used only during the initialization and used memory will be freed. +首先,这个函数的声明中有一个 `__init` 前缀,这表示这个函数只在初始化阶段使用,并且它所使用的内存将会被释放。 -We can see declaration of two variables for the kernel command line and copying `real_mode_data` to the `boot_params` with the `memcpy` function. The next call of the `sanitize_boot_params` function which fills some fields of the `boot_params` structure like `ext_ramdisk_image` and etc... if bootloaders which fail to initialize unknown fields in `boot_params` to zero. After this we're getting address of the command line with the call of the `get_cmd_line_ptr` function: +在这个函数中首先声明了两个用于解析内核命令行的变量,然后使用`memcpy` 函数将 `real_mode_data` 拷贝进 `boot_params`。如果系统引导工具(bootloader)没能正确初始化 `boot_params` 中的某些成员的话,那么在接下来调用的 `sanitize_boot_params` 函数中将会对这些成员进行清零,比如 `ext_ramdisk_image` 等。此后我们通过调用 `get_cmd_line_ptr` 函数来得到命令行的地址: ```C unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr; @@ -63,26 +64,26 @@ cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32; return cmd_line_ptr; ``` -which gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check `cmd_line_ptr`, getting its virtual address and copy it to the `boot_command_line` which is just an array of bytes: +`get_cmd_line_ptr` 函数将会从 `boot_params` 中获得命令行的64位地址并返回。最后,我们检查一下是否正确获得了 `cmd_line_ptr`,并把它的虚拟地址拷贝到一个字节数组 `boot_command_line` 中: ```C extern char __initdata boot_command_line[]; ``` -After this we will have copied kernel command line and `boot_params` structure. In the next step we can see call of the `load_ucode_bsp` function which loads processor microcode, but we will not see it here. +这一步完成之后,我们就得到了内核命令行和 `boot_params` 结构体。之后,内核通过调用 `load_ucode_bsp` 函数来加载处理器微代码(microcode),不过我们目前先暂时忽略这一步。 -After microcode was loaded we can see the check of the `console_loglevel` and the `early_printk` function which prints `Kernel Alive` string. But you'll never see this output because `early_printk` is not initialized yet. It is a minor bug in the kernel and i sent the patch - [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2) and you will see it in the mainline soon. So you can skip this code. +微代码加载之后,内核会对 `console_loglevel` 进行检查,同时通过 `early_printk` 函数来打印出字符串 `Kernel Alive`。不过这个输出不会真的被显示出来,因为这个时候 `early_printk` 还没有被初始化。这是目前内核中的一个小bug,作者已经提交了补丁 [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2),补丁很快就能应用在主分支中了。所以你可以先跳过这段代码。 -Move on init pages +初始化内存页 -------------------------------------------------------------------------------- -In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call: +至此,我们已经拷贝了 `boot_params` 结构体,接下来将对初期页表进行一些设置以便在初始化内核的过程中使用。我们之前已经对初始化了初期页表,以便支持换页,这在之前的[部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html)中已经讨论过。现在则通过调用 `reset_early_page_tables` 函数将初期页表中大部分项清零(在之前的部分也有介绍),只保留内核高地址的映射。然后我们调用: ```C clear_page(init_level4_pgt); ``` -function and pass `init_level4_pgt` which also defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and looks: +`init_level4_pgt` 同样定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): ```assembly NEXT_PAGE(init_level4_pgt) @@ -93,7 +94,7 @@ NEXT_PAGE(init_level4_pgt) .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE ``` -which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. `clear_page` function defined in the [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/clear_page_64.S) let's look on this function: +这段代码为内核的代码段、数据段和 bss 段映射了前 2.5G 个字节。`clear_page` 函数定义在 [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/clear_page_64.S): ```assembly ENTRY(clear_page) @@ -121,30 +122,30 @@ ENTRY(clear_page) ENDPROC(clear_page) ``` -As you can understand from the function name it clears or fills with zeros page tables. First of all note that this function starts with the `CFI_STARTPROC` and `CFI_ENDPROC` which are expands to GNU assembly directives: +顾名思义,这个函数会将页表清零。这个函数的开始和结束部分有两个宏 `CFI_STARTPROC` 和 `CFI_ENDPROC`,他们会展开成 GNU 汇编指令,用于调试: ```C #define CFI_STARTPROC .cfi_startproc #define CFI_ENDPROC .cfi_endproc ``` -and used for debugging. After `CFI_STARTPROC` macro we zero out `eax` register and put 64 to the `ecx` (it will be a counter). Next we can see loop which starts with the `.Lloop` label and it starts from the `ecx` decrement. After it we put zero from the `rax` register to the `rdi` which contains the base address of the `init_level4_pgt` now and do the same procedure seven times but every time move `rdi` offset on 8. After this we will have first 64 bytes of the `init_level4_pgt` filled with zeros. In the next step we put the address of the `init_level4_pgt` with 64-bytes offset to the `rdi` again and repeat all operations until `ecx` reaches zero. In the end we will have `init_level4_pgt` filled with zeros. +在 `CFI_STARTPROC` 之后我们将 `eax` 寄存器清零,并将 `ecx` 赋值为 64(用作计数器)。接下来从 `.Lloop` 标签开始循环,首先就是将 `ecx` 减一。然后将 `rax` 中的值(目前为0)写入 `rdi` 指向的地址,`rdi` 中保存的是 `init_level4_pgt` 的基地址。接下来重复7次这个步骤,但是每次都相对 `rdi` 多偏移8个字节。之后 `init_level4_pgt` 的前64个字节就都被填充为0了。接下来我们将 `rdi` 中的值加上64,重复这个步骤,直到 `ecx` 减至0。最后就完成了将 `init_level4_pgt` 填零。 -As we have `init_level4_pgt` filled with zeros, we set the last `init_level4_pgt` entry to kernel high mapping with the: +在将 `init_level4_pgt` 填0之后,再把它的最后一项设置为内核高地址的映射: ```C init_level4_pgt[511] = early_level4_pgt[511]; ``` -Remember that we dropped all `early_level4_pgt` entries in the `reset_early_page_table` function and kept only kernel high mapping there. +在前面我们已经使用 `reset_early_page_table` 函数清除 `early_level4_pgt` 中的大部分项,而只保留内核高地址的映射。 -The last step in the `x86_64_start_kernel` function is the call of the: +`x86_64_start_kernel` 函数的最后一步是调用: ```C x86_64_start_reservations(real_mode_data); ``` -function with the `real_mode_data` as argument. The `x86_64_start_reservations` function defined in the same source code file as the `x86_64_start_kernel` function and looks: +并传入 `real_mode_data` 参数。 `x86_64_start_reservations` 函数与 `x86_64_start_kernel` 函数定义在同一个文件中: ```C void __init x86_64_start_reservations(char *real_mode_data) @@ -158,43 +159,43 @@ void __init x86_64_start_reservations(char *real_mode_data) } ``` -You can see that it is the last function before we are in the kernel entry point - `start_kernel` function. Let's look what it does and how it works. +这就是进入内核入口点之前的最后一个函数了。下面我们就来介绍一下这个函数。 -Last step before kernel entry point +内核入口点前的最后一步 -------------------------------------------------------------------------------- -First of all we can see in the `x86_64_start_reservations` function the check for `boot_params.hdr.version`: +在 `x86_64_start_reservations` 函数中首先检查了 `boot_params.hdr.version`: ```C if (!boot_params.hdr.version) copy_bootdata(__va(real_mode_data)); ``` -and if it is zero we call `copy_bootdata` function again with the virtual address of the `real_mode_data` (read about about it's implementation). +如果它为0,则再次调用 `copy_bootdata`,并传入 `real_mode_data` 的虚拟地址。 -In the next step we can see the call of the `reserve_ebda_region` function which defined in the [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head.c). This function reserves memory block for the `EBDA` or Extended BIOS Data Area. The Extended BIOS Data Area located in the top of conventional memory and contains data about ports, disk parameters and etc... +接下来则调用了 `reserve_ebda_region` 函数,它定义在 [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head.c)。这个函数为 `EBDA`(即Extended BIOS Data Area,扩展BIOS数据区域)预留空间。扩展BIOS预留区域位于常规内存顶部(译注:常规内存(Conventiional Memory)是指前640K字节内存),包含了端口、磁盘参数等数据。 -Let's look on the `reserve_ebda_region` function. It starts from the checking is paravirtualization enabled or not: +接下来我们来看一下 `reserve_ebda_region` 函数。它首先会检查是否启用了半虚拟化: ```C if (paravirt_enabled()) return; ``` -we exit from the `reserve_ebda_region` function if paravirtualization is enabled because if it enabled the extended bios data area is absent. In the next step we need to get the end of the low memory: +如果开启了半虚拟化,那么就退出 `reserve_ebda_region` 函数,因为此时没有扩展BIOS数据区域。下面我们首先得到低地址内存的末尾地址: ```C lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES); lowmem <<= 10; ``` -We're getting the virtual address of the BIOS low memory in kilobytes and convert it to bytes with shifting it on 10 (multiply on 1024 in other words). After this we need to get the address of the extended BIOS data are with the: +首先我们得到了BIOS地地址内存的虚拟地址,以KB为单位,然后将其左移10位(即乘以1024)转换为以字节为单位。然后我们需要获得扩展BIOS数据区域的地址: ```C ebda_addr = get_bios_ebda(); ``` -where `get_bios_ebda` function defined in the [arch/x86/include/asm/bios_ebda.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bios_ebda.h) and looks like: +其中, `get_bios_ebda` 函数定义在 [arch/x86/include/asm/bios_ebda.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bios_ebda.h): ```C static inline unsigned int get_bios_ebda(void) @@ -205,7 +206,7 @@ static inline unsigned int get_bios_ebda(void) } ``` -Let's try to understand how it works. Here we can see that we converting physical address `0x40E` to the virtual, where `0x0040:0x000e` is the segment which contains base address of the extended BIOS data area. Don't worry that we are using `phys_to_virt` function for converting a physical address to virtual address. You can note that previously we have used `__va` macro for the same point, but `phys_to_virt` is the same: +下面我们来尝试理解一下这段代码。这段代码中,首先我们将物理地址 `0x40E` 转换为虚拟地址,`0x0040:0x000e` 就是包含有扩展BIOS数据区域基地址的代码段。这里我们使用了 `phys_to_virt` 函数进行地址转换,而不是之前使用的 `__va` 宏。不过,事实上他们两个基本上是一样的: ```C static inline void *phys_to_virt(phys_addr_t address) @@ -214,7 +215,7 @@ static inline void *phys_to_virt(phys_addr_t address) } ``` -only with one difference: we pass argument with the `phys_addr_t` which depends on `CONFIG_PHYS_ADDR_T_64BIT`: +而不同之处在于,`phys_to_virt` 函数的参数类型 `phys_addr_t` 的定义依赖于 `CONFIG_PHYS_ADDR_T_64BIT`: ```C #ifdef CONFIG_PHYS_ADDR_T_64BIT @@ -224,9 +225,9 @@ only with one difference: we pass argument with the `phys_addr_t` which depends #endif ``` -This configuration option is enabled by `CONFIG_PHYS_ADDR_T_64BIT`. After that we got virtual address of the segment which stores the base address of the extended BIOS data area, we shift it on 4 and return. After this `ebda_addr` variables contains the base address of the extended BIOS data area. +具体的类型是由 `CONFIG_PHYS_ADDR_T_64BIT` 设置选项控制的。此后我们得到了包含扩展BIOS数据区域虚拟基地址的段,把它左移4位后返回。这样,`ebda_addr` 变量就包含了扩展BIOS数据区域的基地址。 -In the next step we check that address of the extended BIOS data area and low memory is not less than `INSANE_CUTOFF` macro +下一步我们来检查扩展BIOS数据区域与低地址内存的地址,看一看它们是否小于 `INSANE_CUTOFF` 宏: ```C if (ebda_addr < INSANE_CUTOFF) @@ -236,13 +237,13 @@ if (lowmem < INSANE_CUTOFF) lowmem = LOWMEM_CAP; ``` -which is: +`INSANE_CUTOFF` 为: ```C #define INSANE_CUTOFF 0x20000U ``` -or 128 kilobytes. In the last step we get lower part in the low memory and extended bios data area and call `memblock_reserve` function which will reserve memory region for extended bios data between low memory and one megabyte mark: +即 128 KB. 上一步我们得到了低地址内存中的低地址部分以及扩展BIOS数据区域,然后调用 `memblock_reserve` 函数来在低内存地址与1MB之间为扩展BIOS数据预留内存区域。 ```C lowmem = min(lowmem, ebda_addr); @@ -250,36 +251,36 @@ lowmem = min(lowmem, LOWMEM_CAP); memblock_reserve(lowmem, 0x100000 - lowmem); ``` -`memblock_reserve` function is defined at [mm/block.c](https://github.com/torvalds/linux/blob/master/mm/block.c) and takes two parameters: +`memblock_reserve` 函数定义在 [mm/block.c](https://github.com/torvalds/linux/blob/master/mm/block.c),它接受两个参数: -* base physical address; -* region size. +* 基物理地址 +* 区域大小 -and reserves memory region for the given base address and size. `memblock_reserve` is the first function in this book from linux kernel memory manager framework. We will take a closer look on memory manager soon, but now let's look at its implementation. +然后在给定的基地址处预留指定大小的内存。`memblock_reserve` 是在这本书中我们接触到的第一个Linux内核内存管理框架中的函数。我们很快会详细地介绍内存管理,不过现在还是先来看一看这个函数的实现。 -First touch of the linux kernel memory manager framework +Linux内核管理框架初探 -------------------------------------------------------------------------------- -In the previous paragraph we stopped at the call of the `memblock_reserve` function and as i sad before it is the first function from the memory manager framework. Let's try to understand how it works. `memblock_reserve` function just calls: +在上一段中我们遇到了对 `memblock_reserve` 函数的调用。现在我们来尝试理解一下这个函数是如何工作的。 `memblock_reserve` 函数只是调用了: ```C memblock_reserve_region(base, size, MAX_NUMNODES, 0); ``` -function and passes 4 parameters there: +`memblock_reserve_region` 接受四个参数: -* physical base address of the memory region; -* size of the memory region; -* maximum number of numa nodes; -* flags. +* 内存区域的物理基地址 +* 内存区域的大小 +* 最大 NUMA 节点数 +* 标志参数 flags -At the start of the `memblock_reserve_region` body we can see definition of the `memblock_type` structure: +在 `memblock_reserve_region` 函数一开始,就是一个 `memblock_type` 结构体类型的变量: ```C struct memblock_type *_rgn = &memblock.reserved; ``` -which presents the type of the memory block and looks: +`memblock_type` 类型代表了一块内存,定义如下: ```C struct memblock_type { @@ -290,7 +291,7 @@ struct memblock_type { }; ``` -As we need to reserve memory block for extended bios data area, the type of the current memory region is reserved where `memblock` structure is: +因为我们要为扩展BIOS数据区域预留内存块,所以当前内存区域的类型就是预留。`memblock` 结构体的定义为: ```C struct memblock { @@ -304,7 +305,7 @@ struct memblock { }; ``` -and describes generic memory block. You can see that we initialize `_rgn` by assigning it to the address of the `memblock.reserved`. `memblock` is the global variable which looks: +它描述了一块通用的数据块。我们用 `memblock.reserved` 的值来初始化 `_rgn`。`memblock` 全局变量定义如下: ```C struct memblock memblock __initdata_memblock = { @@ -324,27 +325,27 @@ struct memblock memblock __initdata_memblock = { }; ``` -We will not dive into detail of this variable, but we will see all details about it in the parts about memory manager. Just note that `memblock` variable defined with the `__initdata_memblock` which is: +我们现在不会继续深究这个变量,但在内存管理部分的中我们会详细地对它进行介绍。需要注意的是,这个变量的声明中使用了 `__initdata_memblock`: ```C #define __initdata_memblock __meminitdata ``` -and `__meminit_data` is: +而 `__meminit_data` 为: ```C #define __meminitdata __section(.meminit.data) ``` -From this we can conclude that all memory blocks will be in the `.meminit.data` section. After we defined `_rgn` we print information about it with `memblock_dbg` macros. You can enable it by passing `memblock=debug` to the kernel command line. +自此我们得出这样的结论:所有的内存块都将定义在 `.meminit.data` 区段中。在我们定义了 `_rgn` 之后,使用了 `memblock_dbg` 宏来输出相关的信息。你可以在从内核命令行传入参数 `memblock=debug` 来开启这些输出。 -After debugging lines were printed next is the call of the following function: +在输出了这些调试信息后,是对下面这个函数的调用: ```C memblock_add_range(_rgn, base, size, nid, flags); ``` -which adds new memory block region into the `.meminit.data` section. As we do not initialize `_rgn` but it just contains `&memblock.reserved`, we just fill passed `_rgn` with the base address of the extended BIOS data area region, size of this region and flags: +它向 `.meminit.data` 区段添加了一个新的内存块区域。由于 `_rgn` 的值是 `&memblock.reserved`,下面的代码就直接将扩展BIOS数据区域的基地址、大小和标志填入 `_rgn` 中: ```C if (type->regions[0].size == 0) { @@ -358,12 +359,12 @@ if (type->regions[0].size == 0) { } ``` -After we filled our region we can see the call of the `memblock_set_region_node` function with two parameters: +在填充好了区域后,接着是对 `memblock_set_region_node` 函数的调用。它接受两个参数: -* address of the filled memory region; -* NUMA node id. +* 填充好的内存区域的地址 +* NUMA节点ID -where our regions represented by the `memblock_region` structure: +其中我们的区域由 `memblock_region` 结构体来表示: ```C struct memblock_region { @@ -376,13 +377,13 @@ struct memblock_region { }; ``` -NUMA node id depends on `MAX_NUMNODES` macro which is defined in the [include/linux/numa.h](https://github.com/torvalds/linux/blob/master/include/linux/numa.h): +NUMA节点ID依赖于 `MAX_NUMNODES` 宏,定义在 [include/linux/numa.h](https://github.com/torvalds/linux/blob/master/include/linux/numa.h) ```C #define MAX_NUMNODES (1 << NODES_SHIFT) ``` -where `NODES_SHIFT` depends on `CONFIG_NODES_SHIFT` configuration parameter and defined as: +其中 `NODES_SHIFT` 依赖于 `CONFIG_NODES_SHIFT` 配置参数,定义如下: ```C #ifdef CONFIG_NODES_SHIFT @@ -392,7 +393,7 @@ where `NODES_SHIFT` depends on `CONFIG_NODES_SHIFT` configuration parameter and #endif ``` -`memblick_set_region_node` function just fills `nid` field from `memblock_region` with the given value: +`memblick_set_region_node` 函数只是填充了 `memblock_region` 中的 `nid` 成员: ```C static inline void memblock_set_region_node(struct memblock_region *r, int nid) @@ -401,28 +402,24 @@ static inline void memblock_set_region_node(struct memblock_region *r, int nid) } ``` -After this we will have first reserved `memblock` for the extended bios data area in the `.meminit.data` section. `reserve_ebda_region` function finished its work on this step and we can go back to the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). +在这之后我们就在 `.meminit.data` 区段拥有了为扩展BIOS数据区域预留的第一个 `memblock`。`reserve_ebda_region` 已经完成了它该做的任务,我们回到 [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) 继续。 -We finished all preparations before the kernel entry point! The last step in the `x86_64_start_reservations` function is the call of the: +至此我们已经结束了进入内核之前所有的准备工作。`x86_64_start_reservations` 的最后一步是调用 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 中的: ```C start_kernel() ``` -function from [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) file. +这一部分到此结束。 -That's all for this part. - -Conclusion +小结 -------------------------------------------------------------------------------- -It is the end of the third part about linux kernel insides. In next part we will see the first initialization steps in the kernel entry point - `start_kernel` function. It will be the first step before we will see launch of the first `init` process. +本书的第三部分到这里就结束了。在下一部分中,我们将会见到内核入口点处的初始化工作 —— 位于 `start_kernel` 函数中。这些工作是在启动第一个进程 `init` 之前首先要完成的工作。 -If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). +如果你有任何问题或建议,请在twitter上联系我 [0xAX](https://twitter.com/0xAX),或者通过[邮件](anotherworldofworld@gmail.com)与我沟通,还可以新开[issue](https://github.com/MintCN/linux-insides-zh/issues/new)。 -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/MintCN/linux-insides-zh).** - -Links +相关链接 -------------------------------------------------------------------------------- * [BIOS data area](http://stanislavs.org/helppc/bios_data_area.html)