mirror of
https://github.com/MintCN/linux-insides-zh.git
synced 2026-02-03 02:23:23 +08:00
@@ -1,15 +1,15 @@
|
||||
Kernel booting process. Part 5.
|
||||
内核引导过程. Part 5.
|
||||
================================================================================
|
||||
|
||||
Kernel decompression
|
||||
内核解压
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the fifth part of the `Kernel booting process` series. We saw transition to the 64-bit mode in the previous [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#transition-to-the-long-mode) and we will continue from this point in this part. We will see the last steps before we jump to the kernel code as preparation for kernel decompression, relocation and directly kernel decompression. So... let's start to dive in the kernel code again.
|
||||
这是`内核引导过程`系列文章的第五部分。在[前一部分](linux-bootstrap-4.md#transition-to-the-long-mode)我们看到了切换到64位模式的过程,在这一部分我们会从这里继续。我们会看到跳进内核代码的最后步骤:内核解压前的准备、重定位和直接内核解压。所以...让我们再次深入内核源码。
|
||||
|
||||
Preparation before kernel decompression
|
||||
内核解压前的准备
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We stopped right before the jump on the `64-bit` entry point - `startup_64` which is located in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) source code file. We already saw the jump to the `startup_64` in the `startup_32`:
|
||||
我们停在了跳转到`64位`入口点——`startup_64`的跳转之前,它在源文件 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) 里面。在之前的部分,我们已经在`startup_32`里面看到了到`startup_64`的跳转:
|
||||
|
||||
```assembly
|
||||
pushl $__KERNEL_CS
|
||||
@@ -24,7 +24,7 @@ We stopped right before the jump on the `64-bit` entry point - `startup_64` whic
|
||||
lret
|
||||
```
|
||||
|
||||
in the previous part. Since we loaded the new `Global Descriptor Table` and there was CPU transition in other mode (`64-bit` mode in our case), we can see the setup of the data segments:
|
||||
由于我们加载了新的`全局描述符表`并且在其他模式有CPU的模式转换(在我们这里是`64位`模式),我们可以在`startup_64`的开头看到数据段的建立:
|
||||
|
||||
```assembly
|
||||
.code64
|
||||
@@ -38,9 +38,9 @@ ENTRY(startup_64)
|
||||
movl %eax, %gs
|
||||
```
|
||||
|
||||
in the beginning of the `startup_64`. All segment registers besides `cs` register now reseted as we joined into the `long mode`.
|
||||
除`cs`之外的段寄存器在我们进入`长模式`时已经重置。
|
||||
|
||||
The next step is computation of difference between where the kernel was compiled and where it was loaded:
|
||||
下一步是计算内核编译时的位置和它被加载的位置的差:
|
||||
|
||||
```assembly
|
||||
#ifdef CONFIG_RELOCATABLE
|
||||
@@ -60,9 +60,9 @@ The next step is computation of difference between where the kernel was compiled
|
||||
addq %rbp, %rbx
|
||||
```
|
||||
|
||||
The `rbp` contains the decompressed kernel start address and after this code executes `rbx` register will contain address to relocate the kernel code for decompression. We already saw code like this in the `startup_32` ( you can read about it in the previous part - [Calculate relocation address](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#calculate-relocation-address)), but we need to do this calculation again because the bootloader can use 64-bit boot protocol and `startup_32` just will not be executed in this case.
|
||||
`rbp`包含了解压后内核的起始地址,在这段代码执行之后`rbx`会包含用于解压的重定位内核代码的地址。我们已经在`startup_32`看到类似的代码(你可以看之前的部分[计算重定位地址](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#calculate-relocation-address)),但是我们需要再做这个计算,因为引导加载器可以用64位引导协议,而`startup_32`在这种情况下不会执行。
|
||||
|
||||
In the next step we can see setup of the stack pointer and resetting of the flags register:
|
||||
下一步,我们可以看到栈指针的设置和标志寄存器的重置:
|
||||
|
||||
```assembly
|
||||
leaq boot_stack_end(%rbx), %rsp
|
||||
@@ -71,7 +71,7 @@ In the next step we can see setup of the stack pointer and resetting of the flag
|
||||
popfq
|
||||
```
|
||||
|
||||
As you can see above, the `rbx` register contains the start address of the kernel decompressor code and we just put this address with `boot_stack_end` offset to the `rsp` register which represents pointer to the top of the stack. After this step, the stack will be correct. You can find definition of the `boot_stack_end` in the end of [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) assembly source code file:
|
||||
如上所述,`rbx`寄存器包含了内核解压代码的起始地址,我们把这个地址的`boot_stack_entry`偏移地址相加放到表示栈顶指针的`rsp`寄存器。在这一步之后,栈就是正确的。你可以在汇编源码文件 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) 的末尾找到`boot_stack_end`的定义:
|
||||
|
||||
```assembly
|
||||
.bss
|
||||
@@ -83,9 +83,9 @@ boot_stack:
|
||||
boot_stack_end:
|
||||
```
|
||||
|
||||
It located in the end of the `.bss` section, right before the `.pgtable`. If you will look into [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/vmlinux.lds.S) linker script, you will find Definition of the `.bss` and `.pgtable` there.
|
||||
它在`.bss`节的末尾,就在`.pgtable`前面。如果你查看 [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/vmlinux.lds.S) 链接脚本,你会找到`.bss`和`.pgtable`的定义。
|
||||
|
||||
As we set the stack, now we can copy the compressed kernel to the address that we got above, when we calculated the relocation address of the decompressed kernel. Before details, let's look at this assembly code:
|
||||
由于我们设置了栈,在我们计算了解压了的内核的重定位地址后,我们可以复制压缩了的内核到以上地址。在查看细节之前,我们先看这段汇编代码:
|
||||
|
||||
```assembly
|
||||
pushq %rsi
|
||||
@@ -99,9 +99,9 @@ As we set the stack, now we can copy the compressed kernel to the address that w
|
||||
popq %rsi
|
||||
```
|
||||
|
||||
First of all we push `rsi` to the stack. We need preserve the value of `rsi`, because this register now stores a pointer to the `boot_params` which is real mode structure that contains booting related data (you must remember this structure, we filled it in the start of kernel setup). In the end of this code we'll restore the pointer to the `boot_params` into `rsi` again.
|
||||
首先我们把`rsi`压进栈。我们需要保存`rsi`的值,因为这个寄存器现在存放指向`boot_params`的指针,这是包含引导相关数据的实模式结构体(你一定记得这个结构体,我们在开始设置内核的时候就填充了它)。在代码的结尾,我们会重新恢复指向`boot_params`的指针到`rsi`.
|
||||
|
||||
The next two `leaq` instructions calculates effective addresses of the `rip` and `rbx` with `_bss - 8` offset and put it to the `rsi` and `rdi`. Why do we calculate these addresses? Actually the compressed kernel image is located between this copying code (from `startup_32` to the current code) and the decompression code. You can verify this by looking at the linker script - [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/vmlinux.lds.S):
|
||||
接下来两个`leaq`指令用`_bss - 8`偏移和`rip`和`rbx`计算有效地址并存放到`rsi`和`rdi`. 我们为什么要计算这些地址?实际上,压缩了的代码镜像存放在这份复制了的代码(从`startup_32`到当前的代码)和解压了的代码之间。你可以通过查看链接脚本 [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/vmlinux.lds.S) 验证:
|
||||
|
||||
```
|
||||
. = 0;
|
||||
@@ -121,7 +121,7 @@ The next two `leaq` instructions calculates effective addresses of the `rip` and
|
||||
}
|
||||
```
|
||||
|
||||
Note that `.head.text` section contains `startup_32`. You may remember it from the previous part:
|
||||
注意`.head.text`节包含了`startup_32`. 你可以从之前的部分回忆起它:
|
||||
|
||||
```assembly
|
||||
__HEAD
|
||||
@@ -132,7 +132,7 @@ ENTRY(startup_32)
|
||||
...
|
||||
```
|
||||
|
||||
The `.text` section contains decompression code:
|
||||
`.text`节包含解压代码:
|
||||
|
||||
```assembly
|
||||
.text
|
||||
@@ -146,21 +146,21 @@ relocated:
|
||||
...
|
||||
```
|
||||
|
||||
And `.rodata..compressed` contains the compressed kernel image. So `rsi` will contain the absolute address of `_bss - 8`, and `rdi` will contain the relocation relative address of `_bss - 8`. As we store these addresses in registers, we put the address of `_bss` in the `rcx` register. As you can see in the `vmlinux.lds.S` linker script, it's located at the end of all sections with the setup/kernel code. Now we can start to copy data from `rsi` to `rdi`, `8` bytes at the time, with the `movsq` instruction.
|
||||
`.rodata..compressed`包含了压缩了的内核镜像。所以`rsi`包含`_bss - 8`的绝对地址,`rdi`包含`_bss - 8`的重定位的相对地址。在我们把这些地址放入寄存器时,我们把`_bss`的地址放到了`rcx`寄存器。正如你在`vmlinux.lds.S`链接脚本中看到了一样,它和设置/内核代码一起在所有节的末尾。现在我们可以开始用`movsq`指令每次8字节地从`rsi`到`rdi`复制代码。
|
||||
|
||||
Note that there is an `std` instruction before data copying: it sets the `DF` flag, which means that `rsi` and `rdi` will be decremented. In other words, we will copy the bytes backwards. At the end, we clear the `DF` flag with the `cld` instruction, and restore `boot_params` structure to `rsi`.
|
||||
注意在数据复制前有`std`指令:它设置`DF`标志,意味着`rsi`和`rdi`会递减。换句话说,我们会从后往前复制这些字节。最后,我们用`cld`指令清除`DF`标志,并恢复`boot_params`到`rsi`.
|
||||
|
||||
Now we have the address of the `.text` section address after relocation, and we can jump to it:
|
||||
现在我们有`.text`节的重定位后的地址,我们可以跳到那里:
|
||||
|
||||
```assembly
|
||||
leaq relocated(%rbx), %rax
|
||||
jmp *%rax
|
||||
```
|
||||
|
||||
Last preparation before kernel decompression
|
||||
在内核解压前的最后准备
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the previous paragraph we saw that the `.text` section starts with the `relocated` label. The first thing it does is clearing the `bss` section with:
|
||||
在上一段我们看到了`.text`节从`relocated`标签开始。它做的第一件事是清空`.bss`节:
|
||||
|
||||
```assembly
|
||||
xorl %eax, %eax
|
||||
@@ -171,9 +171,9 @@ In the previous paragraph we saw that the `.text` section starts with the `reloc
|
||||
rep stosq
|
||||
```
|
||||
|
||||
We need to initialize the `.bss` section, because we'll soon jump to [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code. Here we just clear `eax`, put the address of `_bss` in `rdi` and `_ebss` in `rcx`, and fill it with zeros with the `rep stosq` instruction.
|
||||
我们要初始化`.bss`节,因为我们很快要跳转到[C](https://en.wikipedia.org/wiki/C_%28programming_language%29)代码。这里我们就清空`eax`,把`_bss`的地址放到`rdi`,把`_ebss`放到`rcx`,然后用`rep stosq`填零。
|
||||
|
||||
At the end, we can see the call to the `extract_kernel` function:
|
||||
最后,我们可以调用`extract_kernel`函数:
|
||||
|
||||
```assembly
|
||||
pushq %rsi
|
||||
@@ -187,49 +187,49 @@ At the end, we can see the call to the `extract_kernel` function:
|
||||
popq %rsi
|
||||
```
|
||||
|
||||
Again we set `rdi` to a pointer to the `boot_params` structure and preserve it on the stack. In the same time we set `rsi` to point to the area which should be usedd for kernel uncompression. The last step is preparation of the `extract_kernel` parameters and call of this function which will uncompres the kernel. The `extract_kernel` function is defined in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) source code file and takes six arguments:
|
||||
我们再一次设置`rdi`为指向`boot_params`结构体的指针并把它保存到栈中。同时我们设置`rsi`指向用于内核解压的区域。最后一步是准备`extract_kernel`的参数并调用这个解压内核的函数。`extract_kernel`函数在 [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) 源文件定义并有六个参数:
|
||||
|
||||
* `rmode` - pointer to the [boot_params](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973//arch/x86/include/uapi/asm/bootparam.h#L114) structure which is filled by bootloader or during early kernel initialization;
|
||||
* `heap` - pointer to the `boot_heap` which represents start address of the early boot heap;
|
||||
* `input_data` - pointer to the start of the compressed kernel or in other words pointer to the `arch/x86/boot/compressed/vmlinux.bin.bz2`;
|
||||
* `input_len` - size of the compressed kernel;
|
||||
* `output` - start address of the future decompressed kernel;
|
||||
* `output_len` - size of decompressed kernel;
|
||||
* `rmode` - 指向 [boot_params](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973//arch/x86/include/uapi/asm/bootparam.h#L114) 结构体的指针,`boot_params`被引导加载器填充或在早期内核初始化时填充
|
||||
* `heap` - 指向早期启动堆的起始地址 `boot_heap` 的指针
|
||||
* `input_data` - 指向压缩的内核,即 `arch/x86/boot/compressed/vmlinux.bin.bz2` 的指针
|
||||
* `input_len` - 压缩的内核的大小
|
||||
* `output` - 解压后内核的起始地址
|
||||
* `output_len` - 解压后内核的大小
|
||||
|
||||
All arguments will be passed through the registers according to [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf). We've finished all preparation and can now look at the kernel decompression.
|
||||
所有参数根据 [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf) 通过寄存器传递。我们已经完成了所有的准备工作,现在我们可以看内核解压的过程。
|
||||
|
||||
Kernel decompression
|
||||
内核解压
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As we saw in previous paragraph, the `extract_kernel` function is defined in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) source code file and takes six arguments. This function starts with the video/console initialization that we already saw in the previous parts. We need to do this again because we don't know if we started in [real mode](https://en.wikipedia.org/wiki/Real_mode) or a bootloader was used, or whether the bootloader used the `32` or `64-bit` boot protocol.
|
||||
就像我们在之前的段落中看到了那样,`extract_kernel`函数在源文件 [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) 定义并有六个参数。正如我们在之前的部分看到的,这个函数从图形/控制台初始化开始。我们要再次做这件事,因为我们不知道我们是不是从[实模式](https://en.wikipedia.org/wiki/Real_mode)开始,或者是使用了引导加载器,或者引导加载器用了32位还是64位启动协议。
|
||||
|
||||
After the first initialization steps, we store pointers to the start of the free memory and to the end of it:
|
||||
在最早的初始化步骤后,我们保存空闲内存的起始和末尾地址。
|
||||
|
||||
```C
|
||||
free_mem_ptr = heap;
|
||||
free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
|
||||
```
|
||||
|
||||
where the `heap` is the second parameter of the `extract_kernel` function which we got in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S):
|
||||
在这里 `heap` 是我们在 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) 得到的 `extract_kernel` 函数的第二个参数:
|
||||
|
||||
```assembly
|
||||
leaq boot_heap(%rip), %rsi
|
||||
```
|
||||
|
||||
As you saw above, the `boot_heap` is defined as:
|
||||
如上所述,`boot_heap`定义为:
|
||||
|
||||
```assembly
|
||||
boot_heap:
|
||||
.fill BOOT_HEAP_SIZE, 1, 0
|
||||
```
|
||||
|
||||
where the `BOOT_HEAP_SIZE` is macro which expands to `0x10000` (`0x400000` in a case of `bzip2` kernel) and represents the size of the heap.
|
||||
在这里`BOOT_HEAP_SIZE`是一个展开为`0x10000`(对`bzip2`内核是`0x400000`)的宏,代表堆的大小。
|
||||
|
||||
After heap pointers initialization, the next step is the call of the `choose_random_location` function from [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/kaslr.c#L425) source code file. As we can guess from the function name, it chooses the memory location where the kernel image will be decompressed. It may look weird that we need to find or even `choose` location where to decompress the compressed kernel image, but the Linux kernel supports [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) which allows decompression of the kernel into a random address, for security reasons.
|
||||
在堆指针初始化后,下一步是从 [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/kaslr.c#L425) 调用`choose_random_location`函数。我们可以从函数名猜到,它选择内核镜像解压到的内存地址。看起来很奇怪,我们要寻找甚至是`选择`内核解压的地址,但是Linux内核支持[kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization),为了安全,它允许解压内核到随机的地址。
|
||||
|
||||
We will not consider randomization of the Linux kernel load address in this part, but will do it in the next part.
|
||||
在这一部分,我们不会考虑Linux内核的加载地址的随机化,我们会在下一部分讨论。
|
||||
|
||||
Now let's back to [misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c#L404). After getting the address for the kernel image, there need to be some checks to be sure that the retrieved random address is correctly aligned and address is not wrong:
|
||||
现在我们回头看 [misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c#L404). 在获得内核镜像的地址后,需要有一些检查以确保获得的随机地址是正确对齐的,并且地址没有错误:
|
||||
|
||||
```C
|
||||
if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))
|
||||
@@ -251,19 +251,19 @@ if (virt_addr != LOAD_PHYSICAL_ADDR)
|
||||
error("Destination virtual address changed when not relocatable");
|
||||
```
|
||||
|
||||
After all these checks we will see the familiar message:
|
||||
在所有这些检查后,我们可以看到熟悉的消息:
|
||||
|
||||
```
|
||||
Decompressing Linux...
|
||||
```
|
||||
|
||||
and call the `__decompress` function:
|
||||
然后调用解压内核的`__decompress`函数:
|
||||
|
||||
```C
|
||||
__decompress(input_data, input_len, NULL, NULL, output, output_len, NULL, error);
|
||||
```
|
||||
|
||||
which will decompress the kernel. The implementation of the `__decompress` function depends on what decompression algorithm was chosen during kernel compilation:
|
||||
`__decompress`函数的实现取决于在内核编译期间选择什么压缩算法:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_KERNEL_GZIP
|
||||
@@ -291,7 +291,7 @@ which will decompress the kernel. The implementation of the `__decompress` funct
|
||||
#endif
|
||||
```
|
||||
|
||||
After kernel is decompressed, the last two functions are `parse_elf` and `handle_relocations`. The main point of these functions is to move the uncompressed kernel image to the correct memory place. The fact is that the decompression will decompress [in-place](https://en.wikipedia.org/wiki/In-place_algorithm), and we still need to move kernel to the correct address. As we already know, the kernel image is an [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) executable, so the main goal of the `parse_elf` function is to move loadable segments to the correct address. We can see loadable segments in the output of the `readelf` program:
|
||||
在内核解压之后,最后两个函数是`parse_elf`和`handle_relocations`.这些函数的主要用途是把解压后的内核移动到正确的位置。事实上,解压过程会[原地](https://en.wikipedia.org/wiki/In-place_algorithm)解压,我们还是要把内核移动到正确的地址。我们已经知道,内核镜像是一个[ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)可执行文件,所以`parse_elf`的主要目标是移动可加载的段到正确的地址。我们可以在`readelf`的输出看到可加载的段:
|
||||
|
||||
```
|
||||
readelf -l vmlinux
|
||||
@@ -313,7 +313,7 @@ Program Headers:
|
||||
0x0000000000138000 0x000000000029b000 RWE 200000
|
||||
```
|
||||
|
||||
The goal of the `parse_elf` function is to load these segments to the `output` address we got from the `choose_random_location` function. This function starts with checking the [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) signature:
|
||||
`parse_elf`函数的目标是加载这些段到从`choose_random_location`函数得到的`output`地址。这个函数从检查ELF签名标志开始:
|
||||
|
||||
```C
|
||||
Elf64_Ehdr ehdr;
|
||||
@@ -330,7 +330,7 @@ if (ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
|
||||
}
|
||||
```
|
||||
|
||||
and if it's not valid, it prints an error message and halts. If we got a valid `ELF` file, we go through all program headers from the given `ELF` file and copy all loadable segments with correct address to the output buffer:
|
||||
如果是无效的,它会打印一条错误消息并停机。如果我们得到一个有效的`ELF`文件,我们从给定的`ELF`文件遍历所有程序头,并用正确的地址复制所有可加载的段到输出缓冲区:
|
||||
|
||||
```C
|
||||
for (i = 0; i < ehdr.e_phnum; i++) {
|
||||
@@ -352,41 +352,41 @@ and if it's not valid, it prints an error message and halts. If we got a valid `
|
||||
}
|
||||
```
|
||||
|
||||
That's all.
|
||||
这就是全部的工作。
|
||||
|
||||
From this moment, all loadable segments are in the correct place.
|
||||
从现在开始,所有可加载的段都在正确的位置。
|
||||
|
||||
The next step after the `parse_elf` function is the call of the `handle_relocations` function. Implementation of this function depends on the `CONFIG_X86_NEED_RELOCS` kernel configuration option and if it is enabled, this function adjusts addresses in the kernel image, and is called only if the `CONFIG_RANDOMIZE_BASE` configuration option was enabled during kernel configuration. Implementation of the `handle_relocations` function is easy enough. This function subtracts value of the `LOAD_PHYSICAL_ADDR` from the value of the base load address of the kernel and thus we obtain the difference between where the kernel was linked to load and where it was actually loaded. After this we can perform kernel relocation as we know actual address where the kernel was loaded, its address where it was linked to run and relocation table which is in the end of the kernel image.
|
||||
在`parse_elf`函数之后是调用`handle_relocations`函数。这个函数的实现依赖于`CONFIG_X86_NEED_RELOCS`内核配置选项,如果它被启用,这个函数调整内核镜像的地址,只有在内核配置时启用了`CONFIG_RANDOMIZE_BASE`配置选项才会调用。`handle_relocations`函数的实现足够简单。这个函数从基准内核加载地址的值减掉`LOAD_PHYSICAL_ADDR`的值,从而我们获得内核链接后要加载的地址和实际加载地址的差值。在这之后我们可以进行内核重定位,因为我们知道内核加载的实际地址、它被链接的运行的地址和内核镜像末尾的重定位表。
|
||||
|
||||
After the kernel is relocated, we return back from the `extract_kernel` to [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S).
|
||||
在内核重定位后,我们从`extract_kernel`回来,到 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S).
|
||||
|
||||
The address of the kernel will be in the `rax` register and we jump to it:
|
||||
内核的地址在`rax`寄存器,我们跳到那里:
|
||||
|
||||
```assembly
|
||||
jmp *%rax
|
||||
```
|
||||
|
||||
That's all. Now we are in the kernel!
|
||||
就是这样。现在我们就在内核里!
|
||||
|
||||
Conclusion
|
||||
结论
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fifth part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals.
|
||||
这是关于内核引导过程的第五部分的结尾。我们不会再看到关于内核引导的文章(可能有这篇和前面的文章的更新),但是会有关于其他内核内部细节的很多文章。
|
||||
|
||||
Next chapter will describe more advanced details about linux kernel booting process, like a load address randomization and etc.
|
||||
下一章会描述更高级的关于内核引导过程的细节,如加载地址随机化等等。
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX).
|
||||
如果你有什么问题或建议,写个评论或在 [twitter](https://twitter.com/0xAX) 找我。
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
|
||||
**如果你发现文中描述有任何问题,请提交一个 PR 到 [linux-insides-zh](https://github.com/MintCN/linux-insides-zh) 。**
|
||||
|
||||
Links
|
||||
链接
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization)
|
||||
* [initrd](http://en.wikipedia.org/wiki/Initrd)
|
||||
* [long mode](http://en.wikipedia.org/wiki/Long_mode)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initrd)
|
||||
* [long mode](https://en.wikipedia.org/wiki/Long_mode)
|
||||
* [bzip2](http://www.bzip.org/)
|
||||
* [RDdRand instruction](http://en.wikipedia.org/wiki/RdRand)
|
||||
* [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [Programmable Interval Timers](http://en.wikipedia.org/wiki/Intel_8253)
|
||||
* [RDRand instruction](https://en.wikipedia.org/wiki/RdRand)
|
||||
* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [Programmable Interval Timers](https://en.wikipedia.org/wiki/Intel_8253)
|
||||
* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md)
|
||||
|
||||
@@ -1,12 +1,12 @@
|
||||
Kernel booting process. Part 6.
|
||||
内核引导过程. Part 6.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
简介
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the sixth part of the `Kernel booting process` series. In the [previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-5.md) we have seen the end of the kernel boot process. But we have skipped some important advanced parts.
|
||||
这是`内核引导过程`系列文章的第六部分。在[前一部分](linux-bootstrap-5.md),我们已经看到了内核引导过程的结尾,但是我们跳过了一些高级部分。
|
||||
|
||||
As you may remember the entry point of the Linux kernel is the `start_kernel` function from the [main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file started to execute at `LOAD_PHYSICAL_ADDR` address. This address depends on the `CONFIG_PHYSICAL_START` kernel configuration option which is `0x1000000` by default:
|
||||
你可能还记得,Linux内核的入口点是 [main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 的`start_kernel`函数,它在`LOAD_PHYSICAL_ADDR`地址开始执行。这个地址依赖于`CONFIG_PHYSICAL_START`内核配置选项,默认为`0x1000000`:
|
||||
|
||||
```
|
||||
config PHYSICAL_START
|
||||
@@ -19,18 +19,18 @@ config PHYSICAL_START
|
||||
...
|
||||
```
|
||||
|
||||
This value may be changed during kernel configuration, but also load address can be selected as a random value. For this purpose the `CONFIG_RANDOMIZE_BASE` kernel configuration option should be enabled during kernel configuration.
|
||||
这个选项在内核配置时可以修改,但是加载地址可以选择为一个随机值。为此,`CONFIG_RANDOMIZE_BASE`内核配置选项在内核配置时应该启用。
|
||||
|
||||
In this case a physical address at which Linux kernel image will be decompressed and loaded will be randomized. This part considers the case when this option is enabled and load address of the kernel image will be randomized for [security reasons](https://en.wikipedia.org/wiki/Address_space_layout_randomization).
|
||||
在这种情况下,Linux内核镜像解压和加载的物理地址会被随机化。我们在这一部分考虑这个选项被启用,并且为了[安全原因](https://en.wikipedia.org/wiki/Address_space_layout_randomization),内核镜像的加载地址被随机化的情况。
|
||||
|
||||
Initialization of page tables
|
||||
页表的初始化
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Before the kernel decompressor will start to find random memory range where the kernel will be decompressed and loaded, the identity mapped page tables should be initialized. If a [bootloader](https://en.wikipedia.org/wiki/Booting) used [16-bit or 32-bit boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt), we already have page tables. But in any case, we may need new pages by demand if the kernel decompressor selects memory range outside of them. That's why we need to build new identity mapped page tables.
|
||||
在内核解压器要开始找随机的内核解压和加载地址之前,应该初始化恒等映射(identity mapped,虚拟地址和物理地址相同)页表。如果[引导加载器](https://en.wikipedia.org/wiki/Booting)使用[16位或32位引导协议](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt),那么我们已经有了页表。但在任何情况下,如果内核解压器选择它们之外的内存区域,我们需要新的页。这就是为什么我们需要建立新的恒等映射页表。
|
||||
|
||||
Yes, building of identity mapped page tables is the one of the first step during randomization of load address. But before we will consider it, let's try to remember where did we come from to this point.
|
||||
是的,建立恒等映射页表是随机化加载地址的最早的步骤之一。但是在此之前,让我们回忆一下我们是怎么来到这里的。
|
||||
|
||||
In the [previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-5.md), we saw transition to [long mode](https://en.wikipedia.org/wiki/Long_mode) and jump to the kernel decompressor entry point - `extract_kernel` function. The randomization stuff starts here from the call of the:
|
||||
在[前一部分](linux-bootstrap-5.md),我们看到了到[长模式](https://en.wikipedia.org/wiki/Long_mode)的转换,并跳转到了内核解压器的入口点——`extract_kernel`函数。随机化从调用这个函数开始:
|
||||
|
||||
```C
|
||||
void choose_random_location(unsigned long input,
|
||||
@@ -41,7 +41,7 @@ void choose_random_location(unsigned long input,
|
||||
{}
|
||||
```
|
||||
|
||||
function. As you may see, this function takes following five parameters:
|
||||
你可以看到,这个函数有五个参数:
|
||||
|
||||
* `input`;
|
||||
* `input_size`;
|
||||
@@ -49,7 +49,7 @@ function. As you may see, this function takes following five parameters:
|
||||
* `output_isze`;
|
||||
* `virt_addr`.
|
||||
|
||||
Let's try to understand what these parameters are. The first `input` parameter came from parameters of the `extract_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file:
|
||||
让我们试着理解一下这些参数是什么。第一个`input`参数来自源文件 [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) 里的`extract_kernel`函数:
|
||||
|
||||
```C
|
||||
asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
|
||||
@@ -71,13 +71,13 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
|
||||
}
|
||||
```
|
||||
|
||||
This parameter is passed from assembler code:
|
||||
这个参数由 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) 的汇编代码传递:
|
||||
|
||||
```C
|
||||
leaq input_data(%rip), %rdx
|
||||
```
|
||||
|
||||
from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S). The `input_data` is generated by the little [mkpiggy](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/mkpiggy.c) program. If you have compiled linux kernel source code under your hands, you may find the generated file by this program which should be placed in the `linux/arch/x86/boot/compressed/piggy.S`. In my case this file looks:
|
||||
`input_data`由 [mkpiggy](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/mkpiggy.c) 程序生成。如果你亲手编译过Linux内核源码,你会找到这个程序生成的文件,它应该位于 `linux/arch/x86/boot/compressed/piggy.S`. 在我这里,这个文件是这样的:
|
||||
|
||||
```assembly
|
||||
.section ".rodata..compressed","a",@progbits
|
||||
@@ -91,21 +91,21 @@ input_data:
|
||||
input_data_end:
|
||||
```
|
||||
|
||||
As you may see it contains four global symbols. The first two `z_input_len` and `z_output_len` which are sizes of compressed and uncompressed `vmlinux.bin.gz`. The third is our `input_data` and as you may see it points to linux kernel image in raw binary format (all debugging symbols, comments and relocation information are stripped). And the last `input_data_end` points to the end of the compressed linux image.
|
||||
你能看到它有四个全局符号。前两个`z_input_len`和`z_output_len`是压缩的和解压后的`vmlinux.bin.gz`的大小。第三个是我们的`input_data`,你可以看到,它指向二进制格式(去掉所有调试符号、注释和重定位信息)的Linux内核镜像。最后的`input_data_end`指向压缩的Linux镜像的末尾。
|
||||
|
||||
So, our first parameter of the `choose_random_location` function is the pointer to the compressed kernel image that is embedded into the `piggy.o` object file.
|
||||
所以我们`choose_random_location`函数的第一个参数是指向嵌入在`piggy.o`目标文件的压缩的内核镜像的指针。
|
||||
|
||||
The second parameter of the `choose_random_location` function is the `z_input_len` that we have seen just now.
|
||||
`choose_random_location`函数的第二个参数是我们刚刚看到的`z_input_len`.
|
||||
|
||||
The third and fourth parameters of the `choose_random_location` function are address where to place decompressed kernel image and the length of decompressed kernel image respectively. The address where to put decompressed kernel came from [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) and it is address of the `startup_32` aligned to 2 megabytes boundary. The size of the decompressed kernel came from the same `piggy.S` and it is `z_output_len`.
|
||||
`choose_random_location`函数的第三和第四个参数分别是解压后的内核镜像的位置和长度。放置解压后内核的地址来自 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S),并且它是`startup_32`对齐到 2MB 边界的地址。解压后的内核的大小来自同样的`piggy.S`,并且它是`z_output_len`.
|
||||
|
||||
The last parameter of the `choose_random_location` function is the virtual address of the kernel load address. As we may see, by default it coincides with the default physical load address:
|
||||
`choose_random_location`函数的最后一个参数是内核加载地址的虚拟地址。我们可以看到,它和默认的物理加载地址相同:
|
||||
|
||||
```C
|
||||
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
|
||||
```
|
||||
|
||||
which depends on kernel configuration:
|
||||
它依赖于内核配置:
|
||||
|
||||
```C
|
||||
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
|
||||
@@ -113,7 +113,7 @@ which depends on kernel configuration:
|
||||
& ~(CONFIG_PHYSICAL_ALIGN - 1))
|
||||
```
|
||||
|
||||
Now, as we considered parameters of the `choose_random_location` function, let's look at implementation of it. This function starts from the checking of `nokaslr` option in the kernel command line:
|
||||
现在,由于我们考虑`choose_random_location`函数的参数,让我们看看它的实现。这个函数从检查内核命令行的`nokaslr`选项开始:
|
||||
|
||||
```C
|
||||
if (cmdline_find_option_bool("nokaslr")) {
|
||||
@@ -122,7 +122,7 @@ if (cmdline_find_option_bool("nokaslr")) {
|
||||
}
|
||||
```
|
||||
|
||||
and if the options was given we exit from the `choose_random_location` function ad kernel load address will not be randomized. Related command line options can be found in the [kernel documentation](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/kernel-parameters.txt):
|
||||
如果有这个选项,那么我们就退出`choose_random_location`函数,并且内核的加载地址不会随机化。相关的命令行选项可以在[内核文档](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/kernel-parameters.txt)找到:
|
||||
|
||||
```
|
||||
kaslr/nokaslr [X86]
|
||||
@@ -134,15 +134,15 @@ kASLR is disabled by default. When kASLR is enabled,
|
||||
hibernation will be disabled.
|
||||
```
|
||||
|
||||
Let's assume that we didn't pass `nokaslr` to the kernel command line and the `CONFIG_RANDOMIZE_BASE` kernel configuration option is enabled.
|
||||
假设我们没有把`nokaslr`传到内核命令行,并且`CONFIG_RANDOMIZE_BASE`启用了内核配置选项。
|
||||
|
||||
The next step is the call of the:
|
||||
下一步是以下函数的调用:
|
||||
|
||||
```C
|
||||
initialize_identity_maps();
|
||||
```
|
||||
|
||||
function which is defined in the [arch/x86/boot/compressed/pagetable.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/pagetable.c) source code file. This function starts from initialization of `mapping_info` an instance of the `x86_mapping_info` structure:
|
||||
它在 [arch/x86/boot/compressed/pagetable.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/pagetable.c) 源码文件定义。这个函数从初始化`mapping_info`,`x86_mapping_info`结构体的一个实例开始。
|
||||
|
||||
```C
|
||||
mapping_info.alloc_pgt_page = alloc_pgt_page;
|
||||
@@ -151,7 +151,7 @@ mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sev_me_mask;
|
||||
mapping_info.kernpg_flag = _KERNPG_TABLE | sev_me_mask;
|
||||
```
|
||||
|
||||
The `x86_mapping_info` structure is defined in the [arch/x86/include/asm/init.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/init.h) header file and looks:
|
||||
`x86_mapping_info`结构体在 [arch/x86/include/asm/init.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/init.h) 头文件定义:
|
||||
|
||||
```C
|
||||
struct x86_mapping_info {
|
||||
@@ -164,18 +164,19 @@ struct x86_mapping_info {
|
||||
};
|
||||
```
|
||||
|
||||
This structure provides information about memory mappings. As you may remember from the previous part, we already setup'ed initial page tables from 0 up to `4G`. For now we may need to access memory above `4G` to load kernel at random position. So, the `initialize_identity_maps` function executes initialization of a memory region for a possible needed new page table. First of all let's try to look at the definition of the `x86_mapping_info` structure.
|
||||
这个结构体提供了关于内存映射的信息。你可能还记得,在前面的部分,我们已经建立了初始的从0到`4G`的页表。现在我们可能需要访问`4G`以上的内存来在随机的位置加载内核。所以,`initialize_identity_maps`函数初始化一个内存区域,它用于可能需要的新页表。首先,让我们尝试查看`x86_mapping_info`结构体的定义。
|
||||
|
||||
The `alloc_pgt_page` is a callback function that will be called to allocate space for a page table entry. The `context` field is an instance of the `alloc_pgt_data` structure in our case which will be used to track allocated page tables. The `page_flag` and `kernpg_flag` fields are page flags. The first represents flags for `PMD` or `PUD` entries. The second `kernpg_flag` field represents flags for kernel pages which can be overridden later. The `direct_gbpages` field represents support for huge pages and the last `offset` field represents offset between kernel virtual addresses and physical addresses up to `PMD` level.
|
||||
`alloc_pgt_page`是一个会在为一个页表项分配空间时调用的回调函数。`context`域是一个用于跟踪已分配页表的`alloc_pgt_data`结构体的实例。`page_flag`和`kernpg_flag`是页标志。第一个代表`PMD`或`PUD`表项的标志。第二个`kernpg_flag`域代表会在之后被覆盖的内核页的标志。`direct_gbpages`域代表对大页的支持。最后的`offset`域代表内核虚拟地址到`PMD`级物理地址的偏移。
|
||||
|
||||
`alloc_pgt_page`回调函数检查有一个新页的空间,从缓冲区分配新页并返回新页的地址:
|
||||
|
||||
The `alloc_pgt_page` callback just validates that there is space for a new page, allocates new page:
|
||||
|
||||
```C
|
||||
entry = pages->pgt_buf + pages->pgt_buf_offset;
|
||||
pages->pgt_buf_offset += PAGE_SIZE;
|
||||
```
|
||||
|
||||
in the buffer from the:
|
||||
缓冲区在此结构体中:
|
||||
|
||||
```C
|
||||
struct alloc_pgt_data {
|
||||
@@ -185,36 +186,36 @@ struct alloc_pgt_data {
|
||||
};
|
||||
```
|
||||
|
||||
structure and returns address of a new page. The last goal of the `initialize_identity_maps` function is to initialize `pgdt_buf_size` and `pgt_buf_offset`. As we are only in initialization phase, the `initialze_identity_maps` function sets `pgt_buf_offset` to zero:
|
||||
`initialize_identity_maps`函数最后的目标是初始化`pgdt_buf_size`和`pgt_buf_offset`. 由于我们只是在初始化阶段,`initialize_identity_maps`函数设置`pgt_buf_offset`为0:
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf_offset = 0;
|
||||
```
|
||||
|
||||
and the `pgt_data.pgt_buf_size` will be set to `77824` or `69632` depends on which boot protocol will be used by bootloader (64-bit or 32-bit). The same is for `pgt_data.pgt_buf`. If a bootloader loaded the kernel at `startup_32`, the `pgdt_data.pgdt_buf` will point to the end of the page table which already was initialzed in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
|
||||
而`pgt_data.pgt_buf_size`会根据引导加载器所用的引导协议(64位或32位)被设置为`77824`或`69632`. `pgt_data.pgt_buf`也是一样。如果引导加载器在`startup_32`引导内核,`pgdt_data.pgdt_buf`会指向已经在 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) 初始化的页表的末尾:
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
|
||||
```
|
||||
|
||||
where `_pgtable` points to the beginning of this page table [_pgtable](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S). In other way, if a bootloader have used 64-bit boot protocol and loaded the kernel at `startup_64`, early page tables should be built by bootloader itself and `_pgtable` will be just overwrote:
|
||||
其中`_pgtable`指向这个页表 [_pgtable](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S) 的开头。另一方面,如果引导加载器用64位引导协议并在`startup_64`加载内核,早期页表应该由引导加载器建立,并且`_pgtable`会被重写:
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf = _pgtable
|
||||
```
|
||||
|
||||
As the buffer for new page tables is initialized, we may return back to the `choose_random_location` function.
|
||||
在新页表的缓冲区被初始化之下,我们回到`choose_random_location`函数。
|
||||
|
||||
Avoid reserved memory ranges
|
||||
避开保留的内存范围
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the stuff related to identity page tables is initilized, we may start to choose random location where to put decompressed kernel image. But as you may guess, we can't choose any address. There are some reseved addresses in memory ranges. Such addresses occupied by important things, like [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk), kernel command line and etc. The
|
||||
在恒等映射页表相关的数据被初始化之后,我们可以开始选择放置解压后内核的随机位置。但是正如你猜的那样,我们不能选择任意地址。在内存的范围中,有一些保留的地址。这些地址被重要的东西占用,如[initrd](https://en.wikipedia.org/wiki/Initial_ramdisk), 内核命令行等等。这个函数:
|
||||
|
||||
```C
|
||||
mem_avoid_init(input, input_size, *output);
|
||||
```
|
||||
|
||||
function will help us to do this. All non-safe memory regions will be collected in the:
|
||||
会帮我们做这件事。所有不安全的内存区域会收集到:
|
||||
|
||||
```C
|
||||
struct mem_vector {
|
||||
@@ -225,7 +226,7 @@ struct mem_vector {
|
||||
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
|
||||
```
|
||||
|
||||
array. Where `MEM_AVOID_MAX` is from `mem_avoid_index` [enum](https://en.wikipedia.org/wiki/Enumerated_type#C) which represents different types of reserved memory regions:
|
||||
数组。其中`MEM_AVOID_MAX`来自[枚举类型](https://en.wikipedia.org/wiki/Enumerated_type#C)`mem_avoid_index`, 它代表不同类型的保留内存区域:
|
||||
|
||||
```C
|
||||
enum mem_avoid_index {
|
||||
@@ -239,9 +240,9 @@ enum mem_avoid_index {
|
||||
};
|
||||
```
|
||||
|
||||
Both are defined in the [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) source code file.
|
||||
它们都定义在源文件 [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) 中。
|
||||
|
||||
Let's look at the implementation of the `mem_avoid_init` function. The main goal of this function is to store information about reseved memory regions described by the `mem_avoid_index` enum in the `mem_avoid` array and create new pages for such regions in our new identity mapped buffer. Numerous parts fo the `mem_avoid_index` function are similar, but let's take a look at the one of them:
|
||||
让我们看看`mem_avoid_init`函数的实现。这个函数的主要目标是在`mem_avoid`数组存放关于被`mem_avoid_index`枚举类型描述的保留内存区域的信息,并且在我们新的恒等映射缓冲区为这样的区域创建新页。`mem_avoid_index`函数的几个部分很相似,但是先看看其中一个:
|
||||
|
||||
```C
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].start = input;
|
||||
@@ -250,7 +251,7 @@ add_identity_map(mem_avoid[MEM_AVOID_ZO_RANGE].start,
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].size);
|
||||
```
|
||||
|
||||
At the beginning of the `mem_avoid_init` function tries to avoid memory region that is used for current kernel decompression. We fill an entry from the `mem_avoid` array with the start and size of such region and call the `add_identity_map` function which should build identity mapped pages for this region. The `add_identity_map` function is defined in the [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) source code file and looks:
|
||||
`mem_avoid_init`函数的开头尝试避免用于当前内核解压的内存区域。我们用这个区域的起始地址和大小填写`mem_avoid`数组的一项,并调用`add_identity_map`函数,它会为这个区域建立恒等映射页。`add_identity_map`函数在源文件 [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) 定义:
|
||||
|
||||
```C
|
||||
void add_identity_map(unsigned long start, unsigned long size)
|
||||
@@ -267,18 +268,18 @@ void add_identity_map(unsigned long start, unsigned long size)
|
||||
}
|
||||
```
|
||||
|
||||
As you may see it aligns memory region to 2 megabytes boundary and checks given start and end addresses.
|
||||
你可以看到,它对齐内存到 2MB 边界并检查给定的起始地址和终止地址。
|
||||
|
||||
In the end it just calls the `kernel_ident_mapping_init` function from the [arch/x86/mm/ident_map.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ident_map.c) source code file and pass `mapping_info` instance that was initilized above, address of the top level page table and addresses of memory region for which new identity mapping should be built.
|
||||
最后它调用`kernel_ident_mapping_init`函数,它在源文件 [arch/x86/mm/ident_map.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ident_map.c) 中,并传入以上初始化好的`mapping_info`实例、顶层页表的地址和建立新的恒等映射的内存区域的地址。
|
||||
|
||||
The `kernel_ident_mapping_init` function sets default flags for new pages if they were not given:
|
||||
`kernel_ident_mapping_init`函数为新页设置默认的标志,如果它们没有被给出:
|
||||
|
||||
```C
|
||||
if (!info->kernpg_flag)
|
||||
info->kernpg_flag = _KERNPG_TABLE;
|
||||
```
|
||||
|
||||
and starts to build new 2-megabytes (because of `PSE` bit in the `mapping_info.page_flag`) page entries (`PGD -> P4D -> PUD -> PMD` in a case of [five-level page tables](https://lwn.net/Articles/717293/) or `PGD -> PUD -> PMD` in a case of [four-level page tables](https://lwn.net/Articles/117749/)) related to the given addresses.
|
||||
并且开始建立新的2MB (因为`mapping_info.page_flag`中的`PSE`位) 给定地址相关的页表项([五级页表](https://lwn.net/Articles/717293/)中的`PGD -> P4D -> PUD -> PMD`或者[四级页表](https://lwn.net/Articles/117749/)中的`PGD -> PUD -> PMD`)。
|
||||
|
||||
```C
|
||||
for (; addr < end; addr = next) {
|
||||
@@ -295,32 +296,32 @@ for (; addr < end; addr = next) {
|
||||
}
|
||||
```
|
||||
|
||||
First of all here we find next entry of the `Page Global Directory` for the given address and if it is greater than `end` of the given memory region, we set it to `end`. After this we allocater a new page with our `x86_mapping_info` callback that we already considered above and call the `ident_p4d_init` function. The `ident_p4d_init` function will do the same, but for low-level page directories (`p4d` -> `pud` -> `pmd`).
|
||||
首先我们找给定地址在 `页全局目录` 的下一项,如果它大于给定的内存区域的末地址`end`,我们把它设为`end`.之后,我们用之前看过的`x86_mapping_info`回调函数分配一个新页,然后调用`ident_p4d_init`函数。`ident_p4d_init`函数做同样的事情,但是用于低层的页目录 (`p4d` -> `pud` -> `pmd`).
|
||||
|
||||
That's all.
|
||||
就是这样。
|
||||
|
||||
New page entries related to reserved addresses are in our page tables. This is not the end of the `mem_avoid_init` function, but other parts are similar. It just build pages for [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk), kernel command line and etc.
|
||||
和保留地址相关的新页表项已经在我们的页表中。这不是`mem_avoid_init`函数的末尾,但是其他部分类似。它建立用于 [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk)、内核命令行等数据的页。
|
||||
|
||||
Now we may return back to `choose_random_location` function.
|
||||
现在我们可以回到`choose_random_location`函数。
|
||||
|
||||
Physical address randomization
|
||||
物理地址随机化
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the reserved memory regions were stored in the `mem_avoid` array and identity mapping pages were built for them, we select minimal available address to choose random memory region to decompress the kernel:
|
||||
在保留内存区域存储在`mem_avoid`数组并且为它们建立了恒等映射页之后,我们选择最小可用的地址作为解压内核的随机内存区域:
|
||||
|
||||
```C
|
||||
min_addr = min(*output, 512UL << 20);
|
||||
```
|
||||
|
||||
As you may see it should be smaller than `512` megabytes. This `512` megabytes value was selected just to avoid unknown things in lower memory.
|
||||
你可以看到,它应该小于512MB. 选择这个512MB的值只是避免低内存区域中未知的东西。
|
||||
|
||||
The next step is to select random physical and virtual addresses to load kernel. The first is physical addresses:
|
||||
下一步是选择随机的物理和虚拟地址来加载内核。首先是物理地址:
|
||||
|
||||
```C
|
||||
random_addr = find_random_phys_addr(min_addr, output_size);
|
||||
```
|
||||
|
||||
The `find_random_phys_addr` function is defined in the [same](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) source code file:
|
||||
`find_random_phys_addr`函数在[同一个](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c)源文件中定义:
|
||||
|
||||
```
|
||||
static unsigned long find_random_phys_addr(unsigned long minimum,
|
||||
@@ -336,7 +337,7 @@ static unsigned long find_random_phys_addr(unsigned long minimum,
|
||||
}
|
||||
```
|
||||
|
||||
The main goal of `process_efi_entries` function is to find all suitable memory ranges in full accessible memory to load kernel. If the kernel compiled and runned on the system without [EFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) support, we continue to search such memory regions in the [e820](https://en.wikipedia.org/wiki/E820) regions. All founded memory regions will be stored in the
|
||||
`process_efi_entries`函数的主要目标是在整个可用的内存找到所有的合适的内存区域来加载内核。如果内核没有在支持[EFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)的系统中编译和运行,我们继续在[e820](https://en.wikipedia.org/wiki/E820)区域中找这样的内存区域。所有找到的内存区域会存储在
|
||||
|
||||
```C
|
||||
struct slot_area {
|
||||
@@ -349,20 +350,20 @@ struct slot_area {
|
||||
static struct slot_area slot_areas[MAX_SLOT_AREA];
|
||||
```
|
||||
|
||||
array. The kernel decompressor should select random index of this array and it will be random place where kernel will be decompressed. This selection will be executed by the `slots_fetch_random` function. The main goal of the `slots_fetch_random` function is to select random memory range from the `slot_areas` array via `kaslr_get_random_long` function:
|
||||
数组中。内核解压器应该选择这个数组随机的索引,并且它会是内核解压的随机位置。这个选择会被`slots_fetch_random`函数执行。`slots_fetch_random`函数的主要目标是通过`kaslr_get_random_long`函数从`slot_areas`数组选择随机的内存范围:
|
||||
|
||||
```C
|
||||
slot = kaslr_get_random_long("Physical") % slot_max;
|
||||
```
|
||||
|
||||
The `kaslr_get_random_long` function is defined in the [arch/x86/lib/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/lib/kaslr.c) source code file and it just returns random number. Note that the random number will be get via different ways depends on kernel configuration and system opportunities (select random number base on [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter), [rdrand](https://en.wikipedia.org/wiki/RdRand) and so on).
|
||||
`kaslr_get_random_long`函数在源文件 [arch/x86/lib/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/lib/kaslr.c) 中定义,它返回一个随机数。注意这个随机数会通过不同的方式得到,取决于内核配置、系统机会(基于[时间戳计数器](https://en.wikipedia.org/wiki/Time_Stamp_Counter)的随机数、[rdrand](https://en.wikipedia.org/wiki/RdRand)等等)。
|
||||
|
||||
That's all from this point random memory range will be selected.
|
||||
这就是随机内存范围的选择方法。
|
||||
|
||||
Virtual address randomization
|
||||
虚拟地址随机化
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After random memory region was selected by the kernel decompressor, new identity mapped pages will be built for this region by demand:
|
||||
在内核解压器选择了随机内存区域后,新的恒等映射页会为这个区域按需建立:
|
||||
|
||||
```C
|
||||
random_addr = find_random_phys_addr(min_addr, output_size);
|
||||
@@ -373,7 +374,7 @@ if (*output != random_addr) {
|
||||
}
|
||||
```
|
||||
|
||||
From this time `output` will store the base address of a memory region where kernel will be decompressed. But for this moment, as you may remember we randomized only physical address. Virtual address should be randomized too in a case of [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
这时,`output`会存放内核将会解压的一个内存区域的基地址。但是现在,正如你还记得的那样,我们只是随机化了物理地址。在[x86_64](https://en.wikipedia.org/wiki/X86-64)架构,虚拟地址也应该被随机化:
|
||||
|
||||
```C
|
||||
if (IS_ENABLED(CONFIG_X86_64))
|
||||
@@ -382,22 +383,22 @@ if (IS_ENABLED(CONFIG_X86_64))
|
||||
*virt_addr = random_addr;
|
||||
```
|
||||
|
||||
As you may see in a case of non `x86_64` architecture, randomzed virtual address will coincide with randomized physical address. The `find_random_virt_addr` function calculates amount of virtual memory ranges that may hold kernel image and calls the `kaslr_get_random_long` that we already saw in a previous case when we tried to find random `physical` address.
|
||||
正如你所看到的,对于非`x86_64`架构,随机化的虚拟地址和随机化的物理地址相同。`find_random_virt_addr`函数计算可以保存内存镜像的虚拟内存范围的数量并且调用我们在尝试找到随机的`物理`地址的时候,之前已经看到的`kaslr_get_random_long`函数。
|
||||
|
||||
From this moment we have both randomized base physical (`*output`) and virtual (`*virt_addr`) addresses for decompressed kernel.
|
||||
这时,我们同时有了用于解压内核的随机化的物理(`*output`)和虚拟(`*virt_addr`)基地址。
|
||||
|
||||
That's all.
|
||||
就是这样。
|
||||
|
||||
Conclusion
|
||||
结论
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the sixth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals.
|
||||
这是关于Linux内核引导过程的第六,并且是最后一部分的结尾。我们不再会看到关于内核引导的帖子(可能有对这篇和之前文章的更新),但是会有很多关于其他内核内部细节的文章。
|
||||
|
||||
Next chapter will be about kernel initialization and we will see the first steps in the Linux kernel initialization code.
|
||||
下一章是关于内核初始化的,我们会看到Linux内核初始化代码的早期步骤。
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX).
|
||||
如果你有什么问题或建议,写个评论或在 [twitter](https://twitter.com/0xAX) 找我。
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
|
||||
**如果你发现文中描述有任何问题,请提交一个 PR 到 [linux-insides-zh](https://github.com/MintCN/linux-insides-zh) 。**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
@@ -27,8 +27,8 @@
|
||||
|├ [1.2](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-2.md)|[@hailincai](https://github.com/hailincai)|已完成|
|
||||
|├ [1.3](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-3.md)|[@hailincai](https://github.com/hailincai)|已完成|
|
||||
|├ [1.4](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-4.md)|[@zmj1316](https://github.com/zmj1316)|已完成|
|
||||
|├ [1.5](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-5.md)||正在进行|
|
||||
|└ [1.6](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-6.md)||正在进行|
|
||||
|├ [1.5](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-5.md)|[@mytbk](https://github.com/mytbk)|更新至[31998d14](https://github.com/0xAX/linux-insides/commit/31998d14320f25399d67d4fff446a65178931e90)|
|
||||
|└ [1.6](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-6.md)|[@mytbk](https://github.com/mytbk)|更新至[31998d14](https://github.com/0xAX/linux-insides/commit/31998d14320f25399d67d4fff446a65178931e90)|
|
||||
| 2. [Initialization](https://github.com/MintCN/linux-insides-zh/tree/master/Initialization)||正在进行|
|
||||
|├ [2.0](https://github.com/MintCN/linux-insides-zh/blob/master/Initialization/README.md)|[@mudongliang](https://github.com/mudongliang)|更新至[44017507](https://github.com/0xAX/linux-insides/commit/4401750766f7150dcd16f579026f5554541a6ab9)|
|
||||
|├ [2.1](https://github.com/MintCN/linux-insides-zh/blob/master/Initialization/linux-initialization-1.md)|[@dontpanic92](https://github.com/dontpanic92)|更新至[44017507](https://github.com/0xAX/linux-insides/commit/4401750766f7150dcd16f579026f5554541a6ab9)|
|
||||
|
||||
Reference in New Issue
Block a user