mirror of
https://github.com/MintCN/linux-insides-zh.git
synced 2026-04-24 18:50:42 +08:00
add part 6 of Booting Section, and update part 5 from upstream
This commit is contained in:
@@ -9,7 +9,7 @@ This is the fifth part of the `Kernel booting process` series. We saw transition
|
||||
Preparation before kernel decompression
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We stopped right before the jump on the 64-bit entry point - `startup_64` which is located in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) source code file. We already saw the jump to the `startup_64` in the `startup_32`:
|
||||
We stopped right before the jump on the `64-bit` entry point - `startup_64` which is located in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) source code file. We already saw the jump to the `startup_64` in the `startup_32`:
|
||||
|
||||
```assembly
|
||||
pushl $__KERNEL_CS
|
||||
@@ -24,7 +24,7 @@ We stopped right before the jump on the 64-bit entry point - `startup_64` which
|
||||
lret
|
||||
```
|
||||
|
||||
in the previous part, `startup_64` starts to work. Since we loaded the new Global Descriptor Table and there was CPU transition in other mode (64-bit mode in our case), we can see the setup of the data segments:
|
||||
in the previous part. Since we loaded the new `Global Descriptor Table` and there was CPU transition in other mode (`64-bit` mode in our case), we can see the setup of the data segments:
|
||||
|
||||
```assembly
|
||||
.code64
|
||||
@@ -38,7 +38,7 @@ ENTRY(startup_64)
|
||||
movl %eax, %gs
|
||||
```
|
||||
|
||||
in the beginning of the `startup_64`. All segment registers besides `cs` now point to the `ds` which is `0x18` (if you don't understand why it is `0x18`, read the previous part).
|
||||
in the beginning of the `startup_64`. All segment registers besides `cs` register now reseted as we joined into the `long mode`.
|
||||
|
||||
The next step is computation of difference between where the kernel was compiled and where it was loaded:
|
||||
|
||||
@@ -55,10 +55,12 @@ The next step is computation of difference between where the kernel was compiled
|
||||
#endif
|
||||
movq $LOAD_PHYSICAL_ADDR, %rbp
|
||||
1:
|
||||
leaq z_extract_offset(%rbp), %rbx
|
||||
movl BP_init_size(%rsi), %ebx
|
||||
subl $_end, %ebx
|
||||
addq %rbp, %rbx
|
||||
```
|
||||
|
||||
`rbp` contains the decompressed kernel start address and after this code executes `rbx` register will contain address to relocate the kernel code for decompression. We already saw code like this in the `startup_32` ( you can read about it in the previous part - [Calculate relocation address](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#calculate-relocation-address)), but we need to do this calculation again because the bootloader can use 64-bit boot protocol and `startup_32` just will not be executed in this case.
|
||||
The `rbp` contains the decompressed kernel start address and after this code executes `rbx` register will contain address to relocate the kernel code for decompression. We already saw code like this in the `startup_32` ( you can read about it in the previous part - [Calculate relocation address](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#calculate-relocation-address)), but we need to do this calculation again because the bootloader can use 64-bit boot protocol and `startup_32` just will not be executed in this case.
|
||||
|
||||
In the next step we can see setup of the stack pointer and resetting of the flags register:
|
||||
|
||||
@@ -69,7 +71,7 @@ In the next step we can see setup of the stack pointer and resetting of the flag
|
||||
popfq
|
||||
```
|
||||
|
||||
As you can see above, the `rbx` register contains the start address of the kernel decompressor code and we just put this address with `boot_stack_end` offset to the `rsp` register which represents pointer to the top of the stack. After this step, the stack will be correct. You can find definition of the `boot_stack_end` in the end of [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file:
|
||||
As you can see above, the `rbx` register contains the start address of the kernel decompressor code and we just put this address with `boot_stack_end` offset to the `rsp` register which represents pointer to the top of the stack. After this step, the stack will be correct. You can find definition of the `boot_stack_end` in the end of [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) assembly source code file:
|
||||
|
||||
```assembly
|
||||
.bss
|
||||
@@ -81,7 +83,7 @@ boot_stack:
|
||||
boot_stack_end:
|
||||
```
|
||||
|
||||
It located in the end of the `.bss` section, right before the `.pgtable`. If you will look into [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S) linker script, you will find Definition of the `.bss` and `.pgtable` there.
|
||||
It located in the end of the `.bss` section, right before the `.pgtable`. If you will look into [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/vmlinux.lds.S) linker script, you will find Definition of the `.bss` and `.pgtable` there.
|
||||
|
||||
As we set the stack, now we can copy the compressed kernel to the address that we got above, when we calculated the relocation address of the decompressed kernel. Before details, let's look at this assembly code:
|
||||
|
||||
@@ -99,7 +101,7 @@ As we set the stack, now we can copy the compressed kernel to the address that w
|
||||
|
||||
First of all we push `rsi` to the stack. We need preserve the value of `rsi`, because this register now stores a pointer to the `boot_params` which is real mode structure that contains booting related data (you must remember this structure, we filled it in the start of kernel setup). In the end of this code we'll restore the pointer to the `boot_params` into `rsi` again.
|
||||
|
||||
The next two `leaq` instructions calculates effective addresses of the `rip` and `rbx` with `_bss - 8` offset and put it to the `rsi` and `rdi`. Why do we calculate these addresses? Actually the compressed kernel image is located between this copying code (from `startup_32` to the current code) and the decompression code. You can verify this by looking at the linker script - [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S):
|
||||
The next two `leaq` instructions calculates effective addresses of the `rip` and `rbx` with `_bss - 8` offset and put it to the `rsi` and `rdi`. Why do we calculate these addresses? Actually the compressed kernel image is located between this copying code (from `startup_32` to the current code) and the decompression code. You can verify this by looking at the linker script - [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/vmlinux.lds.S):
|
||||
|
||||
```
|
||||
. = 0;
|
||||
@@ -171,39 +173,35 @@ In the previous paragraph we saw that the `.text` section starts with the `reloc
|
||||
|
||||
We need to initialize the `.bss` section, because we'll soon jump to [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code. Here we just clear `eax`, put the address of `_bss` in `rdi` and `_ebss` in `rcx`, and fill it with zeros with the `rep stosq` instruction.
|
||||
|
||||
At the end, we can see the call to the `decompress_kernel` function:
|
||||
At the end, we can see the call to the `extract_kernel` function:
|
||||
|
||||
```assembly
|
||||
pushq %rsi
|
||||
movq $z_run_size, %r9
|
||||
pushq %r9
|
||||
movq %rsi, %rdi
|
||||
leaq boot_heap(%rip), %rsi
|
||||
leaq input_data(%rip), %rdx
|
||||
movl $z_input_len, %ecx
|
||||
movq %rbp, %r8
|
||||
movq $z_output_len, %r9
|
||||
call decompress_kernel
|
||||
popq %r9
|
||||
call extract_kernel
|
||||
popq %rsi
|
||||
```
|
||||
|
||||
Again we set `rdi` to a pointer to the `boot_params` structure and call `decompress_kernel` from [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) with seven arguments:
|
||||
Again we set `rdi` to a pointer to the `boot_params` structure and preserve it on the stack. In the same time we set `rsi` to point to the area which should be usedd for kernel uncompression. The last step is preparation of the `extract_kernel` parameters and call of this function which will uncompres the kernel. The `extract_kernel` function is defined in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) source code file and takes six arguments:
|
||||
|
||||
* `rmode` - pointer to the [boot_params](https://github.com/torvalds/linux/blob/master//arch/x86/include/uapi/asm/bootparam.h#L114) structure which is filled by bootloader or during early kernel initialization;
|
||||
* `rmode` - pointer to the [boot_params](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973//arch/x86/include/uapi/asm/bootparam.h#L114) structure which is filled by bootloader or during early kernel initialization;
|
||||
* `heap` - pointer to the `boot_heap` which represents start address of the early boot heap;
|
||||
* `input_data` - pointer to the start of the compressed kernel or in other words pointer to the `arch/x86/boot/compressed/vmlinux.bin.bz2`;
|
||||
* `input_len` - size of the compressed kernel;
|
||||
* `output` - start address of the future decompressed kernel;
|
||||
* `output_len` - size of decompressed kernel;
|
||||
* `run_size` - amount of space needed to run the kernel including `.bss` and `.brk` sections.
|
||||
|
||||
All arguments will be passed through the registers according to [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf). We've finished all preparation and can now look at the kernel decompression.
|
||||
|
||||
Kernel decompression
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As we saw in previous paragraph, the `decompress_kernel` function is defined in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file and takes seven arguments. This function starts with the video/console initialization that we already saw in the previous parts. We need to do this again because we don't know if we started in [real mode](https://en.wikipedia.org/wiki/Real_mode) or a bootloader was used, or whether the bootloader used the 32 or 64-bit boot protocol.
|
||||
As we saw in previous paragraph, the `extract_kernel` function is defined in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) source code file and takes six arguments. This function starts with the video/console initialization that we already saw in the previous parts. We need to do this again because we don't know if we started in [real mode](https://en.wikipedia.org/wiki/Real_mode) or a bootloader was used, or whether the bootloader used the `32` or `64-bit` boot protocol.
|
||||
|
||||
After the first initialization steps, we store pointers to the start of the free memory and to the end of it:
|
||||
|
||||
@@ -212,7 +210,7 @@ free_mem_ptr = heap;
|
||||
free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
|
||||
```
|
||||
|
||||
where the `heap` is the second parameter of the `decompress_kernel` function which we got in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
|
||||
where the `heap` is the second parameter of the `extract_kernel` function which we got in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S):
|
||||
|
||||
```assembly
|
||||
leaq boot_heap(%rip), %rsi
|
||||
@@ -225,218 +223,47 @@ boot_heap:
|
||||
.fill BOOT_HEAP_SIZE, 1, 0
|
||||
```
|
||||
|
||||
where the `BOOT_HEAP_SIZE` is macro which expands to `0x8000` (`0x400000` in a case of `bzip2` kernel) and represents the size of the heap.
|
||||
where the `BOOT_HEAP_SIZE` is macro which expands to `0x10000` (`0x400000` in a case of `bzip2` kernel) and represents the size of the heap.
|
||||
|
||||
After heap pointers initialization, the next step is the call of the `choose_random_location` function from [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c#L425) source code file. As we can guess from the function name, it chooses the memory location where the kernel image will be decompressed. It may look weird that we need to find or even `choose` location where to decompress the compressed kernel image, but the Linux kernel supports [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) which allows decompression of the kernel into a random address, for security reasons. Let's open the [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c#L425) source code file and look at `choose_random_location`.
|
||||
After heap pointers initialization, the next step is the call of the `choose_random_location` function from [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/kaslr.c#L425) source code file. As we can guess from the function name, it chooses the memory location where the kernel image will be decompressed. It may look weird that we need to find or even `choose` location where to decompress the compressed kernel image, but the Linux kernel supports [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) which allows decompression of the kernel into a random address, for security reasons.
|
||||
|
||||
First, `choose_random_location` tries to find the `kaslr` option in the Linux kernel command line if `CONFIG_HIBERNATION` is set, and `nokaslr` otherwise:
|
||||
We will not consider randomization of the Linux kernel load address in this part, but will do it in the next part.
|
||||
|
||||
Now let's back to [misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c#L404). After getting the address for the kernel image, there need to be some checks to be sure that the retrieved random address is correctly aligned and address is not wrong:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_HIBERNATION
|
||||
if (!cmdline_find_option_bool("kaslr")) {
|
||||
debug_putstr("KASLR disabled by default...\n");
|
||||
goto out;
|
||||
}
|
||||
#else
|
||||
if (cmdline_find_option_bool("nokaslr")) {
|
||||
debug_putstr("KASLR disabled by cmdline...\n");
|
||||
goto out;
|
||||
}
|
||||
#endif
|
||||
if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))
|
||||
error("Destination physical address inappropriately aligned");
|
||||
|
||||
if (virt_addr & (MIN_KERNEL_ALIGN - 1))
|
||||
error("Destination virtual address inappropriately aligned");
|
||||
|
||||
if (heap > 0x3fffffffffffUL)
|
||||
error("Destination address too large");
|
||||
|
||||
if (virt_addr + max(output_len, kernel_total_size) > KERNEL_IMAGE_SIZE)
|
||||
error("Destination virtual address is beyond the kernel mapping area");
|
||||
|
||||
if ((unsigned long)output != LOAD_PHYSICAL_ADDR)
|
||||
error("Destination address does not match LOAD_PHYSICAL_ADDR");
|
||||
|
||||
if (virt_addr != LOAD_PHYSICAL_ADDR)
|
||||
error("Destination virtual address changed when not relocatable");
|
||||
```
|
||||
|
||||
If the `CONFIG_HIBERNATION` kernel configuration option is enabled during kernel configuration and there is no `kaslr` option in the Linux kernel command line, it prints `KASLR disabled by default...` and jumps to the `out` label:
|
||||
|
||||
```C
|
||||
out:
|
||||
return (unsigned char *)choice;
|
||||
```
|
||||
|
||||
which just returns the `output` parameter which we passed to the `choose_random_location`, unchanged. If the `CONFIG_HIBERNATION` kernel configuration option is disabled and the `nokaslr` option is in the kernel command line, we jump to `out` again.
|
||||
|
||||
For now, let's assume the kernel was configured with randomization enabled and try to understand what `kASLR` is. We can find information about it in the [documentation](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt):
|
||||
|
||||
```
|
||||
kaslr/nokaslr [X86]
|
||||
|
||||
Enable/disable kernel and module base offset ASLR
|
||||
(Address Space Layout Randomization) if built into
|
||||
the kernel. When CONFIG_HIBERNATION is selected,
|
||||
kASLR is disabled by default. When kASLR is enabled,
|
||||
hibernation will be disabled.
|
||||
```
|
||||
|
||||
It means that we can pass the `kaslr` option to the kernel's command line and get a random address for the decompressed kernel (you can read more about ASLR [here](https://en.wikipedia.org/wiki/Address_space_layout_randomization)). So, our current goal is to find random address where we can `safely` to decompress the Linux kernel. I repeat: `safely`. What does it mean in this context? You may remember that besides the code of decompressor and directly the kernel image, there are some unsafe places in memory. For example, the [initrd](https://en.wikipedia.org/wiki/Initrd) image is in memory too, and we must not overlap it with the decompressed kernel.
|
||||
|
||||
The next function will help us to find a safe place where we can decompress kernel. This function is `mem_avoid_init`. It defined in the same source code [file](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c), and takes four arguments that we already saw in the `decompress_kernel` function:
|
||||
|
||||
* `input_data` - pointer to the start of the compressed kernel, or in other words, the pointer to `arch/x86/boot/compressed/vmlinux.bin.bz2`;
|
||||
* `input_len` - the size of the compressed kernel;
|
||||
* `output` - the start address of the future decompressed kernel;
|
||||
* `output_len` - the size of decompressed kernel.
|
||||
|
||||
The main point of this function is to fill array of the `mem_vector` structures:
|
||||
|
||||
```C
|
||||
#define MEM_AVOID_MAX 5
|
||||
|
||||
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
|
||||
```
|
||||
|
||||
where the `mem_vector` structure contains information about unsafe memory regions:
|
||||
|
||||
```C
|
||||
struct mem_vector {
|
||||
unsigned long start;
|
||||
unsigned long size;
|
||||
};
|
||||
```
|
||||
|
||||
The implementation of the `mem_avoid_init` is pretty simple. Let's look on the part of this function:
|
||||
|
||||
```C
|
||||
...
|
||||
...
|
||||
...
|
||||
initrd_start = (u64)real_mode->ext_ramdisk_image << 32;
|
||||
initrd_start |= real_mode->hdr.ramdisk_image;
|
||||
initrd_size = (u64)real_mode->ext_ramdisk_size << 32;
|
||||
initrd_size |= real_mode->hdr.ramdisk_size;
|
||||
mem_avoid[1].start = initrd_start;
|
||||
mem_avoid[1].size = initrd_size;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Here we can see calculation of the [initrd](http://en.wikipedia.org/wiki/Initrd) start address and size. The `ext_ramdisk_image` is the high `32-bits` of the `ramdisk_image` field from the setup header, and `ext_ramdisk_size` is the high 32-bits of the `ramdisk_size` field from the [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt):
|
||||
|
||||
```
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
...
|
||||
...
|
||||
...
|
||||
0218/4 2.00+ ramdisk_image initrd load address (set by boot loader)
|
||||
021C/4 2.00+ ramdisk_size initrd size (set by boot loader)
|
||||
...
|
||||
```
|
||||
|
||||
And `ext_ramdisk_image` and `ext_ramdisk_size` can be found in the [Documentation/x86/zero-page.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/zero-page.txt):
|
||||
|
||||
```
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
...
|
||||
...
|
||||
...
|
||||
0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits
|
||||
0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits
|
||||
...
|
||||
```
|
||||
|
||||
So we're taking `ext_ramdisk_image` and `ext_ramdisk_size`, shifting them left on `32` (now they will contain low 32-bits in the high 32-bit bits) and getting start address of the `initrd` and size of it. After this we store these values in the `mem_avoid` array.
|
||||
|
||||
The next step after we've collected all unsafe memory regions in the `mem_avoid` array will be searching for a random address that does not overlap with the unsafe regions, using the `find_random_addr` function. First of all we can see the alignment of the output address in the `find_random_addr` function:
|
||||
|
||||
```C
|
||||
minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
|
||||
```
|
||||
|
||||
You can remember `CONFIG_PHYSICAL_ALIGN` configuration option from the previous part. This option provides the value to which kernel should be aligned and it is `0x200000` by default. Once we have the aligned output address, we go through the memory regions which we got with the help of the BIOS [e820](https://en.wikipedia.org/wiki/E820) service and collect regions suitable for the decompressed kernel image:
|
||||
|
||||
```C
|
||||
for (i = 0; i < real_mode->e820_entries; i++) {
|
||||
process_e820_entry(&real_mode->e820_map[i], minimum, size);
|
||||
}
|
||||
```
|
||||
|
||||
Recall that we collected `e820_entries` in the second part of the [Kernel booting process part 2](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-2.md#memory-detection). The `process_e820_entry` function does some checks that an `e820` memory region is not `non-RAM`, that the start address of the memory region is not bigger than maximum allowed `aslr` offset, and that the memory region is above the minimum load location:
|
||||
|
||||
```C
|
||||
struct mem_vector region, img;
|
||||
|
||||
if (entry->type != E820_RAM)
|
||||
return;
|
||||
|
||||
if (entry->addr >= CONFIG_RANDOMIZE_BASE_MAX_OFFSET)
|
||||
return;
|
||||
|
||||
if (entry->addr + entry->size < minimum)
|
||||
return;
|
||||
```
|
||||
|
||||
After this, we store an `e820` memory region start address and the size in the `mem_vector` structure (we saw definition of this structure above):
|
||||
|
||||
```C
|
||||
region.start = entry->addr;
|
||||
region.size = entry->size;
|
||||
```
|
||||
|
||||
As we store these values, we align the `region.start` as we did it in the `find_random_addr` function and check that we didn't get an address that is outside the original memory region:
|
||||
|
||||
```C
|
||||
region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
|
||||
|
||||
if (region.start > entry->addr + entry->size)
|
||||
return;
|
||||
```
|
||||
|
||||
In the next step, we reduce the size of the memory region to not include rejected regions at the start, and ensure that the last address in the memory region is smaller than `CONFIG_RANDOMIZE_BASE_MAX_OFFSET`, so that the end of the kernel image will be less than the maximum `aslr` offset:
|
||||
|
||||
```C
|
||||
region.size -= region.start - entry->addr;
|
||||
|
||||
if (region.start + region.size > CONFIG_RANDOMIZE_BASE_MAX_OFFSET)
|
||||
region.size = CONFIG_RANDOMIZE_BASE_MAX_OFFSET - region.start;
|
||||
```
|
||||
|
||||
Finally, we go through all unsafe memory regions and check that the region does not overlap unsafe areas, such as kernel command line, initrd, etc...:
|
||||
|
||||
```C
|
||||
for (img.start = region.start, img.size = image_size ;
|
||||
mem_contains(®ion, &img) ;
|
||||
img.start += CONFIG_PHYSICAL_ALIGN) {
|
||||
if (mem_avoid_overlap(&img))
|
||||
continue;
|
||||
slots_append(img.start);
|
||||
}
|
||||
```
|
||||
|
||||
If the memory region does not overlap unsafe regions we call the `slots_append` function with the start address of the region. `slots_append` function just collects start addresses of memory regions to the `slots` array:
|
||||
|
||||
```C
|
||||
slots[slot_max++] = addr;
|
||||
```
|
||||
|
||||
which is defined as:
|
||||
|
||||
```C
|
||||
static unsigned long slots[CONFIG_RANDOMIZE_BASE_MAX_OFFSET /
|
||||
CONFIG_PHYSICAL_ALIGN];
|
||||
static unsigned long slot_max;
|
||||
```
|
||||
|
||||
After `process_e820_entry` is done, we will have an array of addresses that are safe for the decompressed kernel. Then we call `slots_fetch_random` function to get a random item from this array:
|
||||
|
||||
```C
|
||||
if (slot_max == 0)
|
||||
return 0;
|
||||
|
||||
return slots[get_random_long() % slot_max];
|
||||
```
|
||||
|
||||
where `get_random_long` function checks different CPU flags as `X86_FEATURE_RDRAND` or `X86_FEATURE_TSC` and chooses a method for getting random number (it can be the RDRAND instruction, the time stamp counter, the programmable interval timer, etc...). After retrieving the random address, execution of the `choose_random_location` is finished.
|
||||
|
||||
Now let's back to [misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c#L404). After getting the address for the kernel image, there need to be some checks to be sure that the retrieved random address is correctly aligned and address is not wrong.
|
||||
|
||||
After all these checks we will see the familiar message:
|
||||
|
||||
```
|
||||
Decompressing Linux...
|
||||
```
|
||||
|
||||
and call the `__decompress` function which will decompress the kernel. The `__decompress` function depends on what decompression algorithm was chosen during kernel compilation:
|
||||
and call the `__decompress` function:
|
||||
|
||||
```C
|
||||
__decompress(input_data, input_len, NULL, NULL, output, output_len, NULL, error);
|
||||
```
|
||||
|
||||
which will decompress the kernel. The implementation of the `__decompress` function depends on what decompression algorithm was chosen during kernel compilation:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_KERNEL_GZIP
|
||||
@@ -495,11 +322,11 @@ Elf64_Phdr *phdrs, *phdr;
|
||||
memcpy(&ehdr, output, sizeof(ehdr));
|
||||
|
||||
if (ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
|
||||
ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
|
||||
ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
|
||||
ehdr.e_ident[EI_MAG3] != ELFMAG3) {
|
||||
error("Kernel is not a valid ELF file");
|
||||
return;
|
||||
ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
|
||||
ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
|
||||
ehdr.e_ident[EI_MAG3] != ELFMAG3) {
|
||||
error("Kernel is not a valid ELF file");
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
@@ -517,18 +344,23 @@ and if it's not valid, it prints an error message and halts. If we got a valid `
|
||||
#else
|
||||
dest = (void *)(phdr->p_paddr);
|
||||
#endif
|
||||
memcpy(dest,
|
||||
output + phdr->p_offset,
|
||||
phdr->p_filesz);
|
||||
memmove(dest, output + phdr->p_offset, phdr->p_filesz);
|
||||
break;
|
||||
default:
|
||||
break;
|
||||
default: /* Ignore other PT_* */ break;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
That's all. From now on, all loadable segments are in the correct place. The last `handle_relocations` function adjusts addresses in the kernel image, and is called only if the `kASLR` was enabled during kernel configuration.
|
||||
That's all.
|
||||
|
||||
After the kernel is relocated, we return back from the `decompress_kernel` to [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S). The address of the kernel will be in the `rax` register and we jump to it:
|
||||
From this moment, all loadable segments are in the correct place.
|
||||
|
||||
The next step after the `parse_elf` function is the call of the `handle_relocations` function. Implementation of this function depends on the `CONFIG_X86_NEED_RELOCS` kernel configuration option and if it is enabled, this function adjusts addresses in the kernel image, and is called only if the `CONFIG_RANDOMIZE_BASE` configuration option was enabled during kernel configuration. Implementation of the `handle_relocations` function is easy enough. This function subtracts value of the `LOAD_PHYSICAL_ADDR` from the value of the base load address of the kernel and thus we obtain the difference between where the kernel was linked to load and where it was actually loaded. After this we can perform kernel relocation as we know actual address where the kernel was loaded, its address where it was linked to run and relocation table which is in the end of the kernel image.
|
||||
|
||||
After the kernel is relocated, we return back from the `extract_kernel` to [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S).
|
||||
|
||||
The address of the kernel will be in the `rax` register and we jump to it:
|
||||
|
||||
```assembly
|
||||
jmp *%rax
|
||||
@@ -539,9 +371,9 @@ That's all. Now we are in the kernel!
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fifth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals.
|
||||
This is the end of the fifth part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals.
|
||||
|
||||
Next chapter will be about kernel initialization and we will see the first steps in the Linux kernel initialization code.
|
||||
Next chapter will describe more advanced details about linux kernel booting process, like a load address randomization and etc.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX).
|
||||
|
||||
|
||||
417
Booting/linux-bootstrap-6.md
Normal file
417
Booting/linux-bootstrap-6.md
Normal file
@@ -0,0 +1,417 @@
|
||||
Kernel booting process. Part 6.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the sixth part of the `Kernel booting process` series. In the [previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-5.md) we have seen the end of the kernel boot process. But we have skipped some important advanced parts.
|
||||
|
||||
As you may remember the entry point of the Linux kernel is the `start_kernel` function from the [main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file started to execute at `LOAD_PHYSICAL_ADDR` address. This address depends on the `CONFIG_PHYSICAL_START` kernel configuration option which is `0x1000000` by default:
|
||||
|
||||
```
|
||||
config PHYSICAL_START
|
||||
hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)
|
||||
default "0x1000000"
|
||||
---help---
|
||||
This gives the physical address where the kernel is loaded.
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
This value may be changed during kernel configuration, but also load address can be selected as a random value. For this purpose the `CONFIG_RANDOMIZE_BASE` kernel configuration option should be enabled during kernel configuration.
|
||||
|
||||
In this case a physical address at which Linux kernel image will be decompressed and loaded will be randomized. This part considers the case when this option is enabled and load address of the kernel image will be randomized for [security reasons](https://en.wikipedia.org/wiki/Address_space_layout_randomization).
|
||||
|
||||
Initialization of page tables
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Before the kernel decompressor will start to find random memory range where the kernel will be decompressed and loaded, the identity mapped page tables should be initialized. If a [bootloader](https://en.wikipedia.org/wiki/Booting) used [16-bit or 32-bit boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt), we already have page tables. But in any case, we may need new pages by demand if the kernel decompressor selects memory range outside of them. That's why we need to build new identity mapped page tables.
|
||||
|
||||
Yes, building of identity mapped page tables is the one of the first step during randomization of load address. But before we will consider it, let's try to remember where did we come from to this point.
|
||||
|
||||
In the [previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-5.md), we saw transition to [long mode](https://en.wikipedia.org/wiki/Long_mode) and jump to the kernel decompressor entry point - `extract_kernel` function. The randomization stuff starts here from the call of the:
|
||||
|
||||
```C
|
||||
void choose_random_location(unsigned long input,
|
||||
unsigned long input_size,
|
||||
unsigned long *output,
|
||||
unsigned long output_size,
|
||||
unsigned long *virt_addr)
|
||||
{}
|
||||
```
|
||||
|
||||
function. As you may see, this function takes following five parameters:
|
||||
|
||||
* `input`;
|
||||
* `input_size`;
|
||||
* `output`;
|
||||
* `output_isze`;
|
||||
* `virt_addr`.
|
||||
|
||||
Let's try to understand what these parameters are. The first `input` parameter came from parameters of the `extract_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file:
|
||||
|
||||
```C
|
||||
asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
|
||||
unsigned char *input_data,
|
||||
unsigned long input_len,
|
||||
unsigned char *output,
|
||||
unsigned long output_len)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
choose_random_location((unsigned long)input_data, input_len,
|
||||
(unsigned long *)&output,
|
||||
max(output_len, kernel_total_size),
|
||||
&virt_addr);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
This parameter is passed from assembler code:
|
||||
|
||||
```C
|
||||
leaq input_data(%rip), %rdx
|
||||
```
|
||||
|
||||
from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S). The `input_data` is generated by the little [mkpiggy](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/mkpiggy.c) program. If you have compiled linux kernel source code under your hands, you may find the generated file by this program which should be placed in the `linux/arch/x86/boot/compressed/piggy.S`. In my case this file looks:
|
||||
|
||||
```assembly
|
||||
.section ".rodata..compressed","a",@progbits
|
||||
.globl z_input_len
|
||||
z_input_len = 6988196
|
||||
.globl z_output_len
|
||||
z_output_len = 29207032
|
||||
.globl input_data, input_data_end
|
||||
input_data:
|
||||
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
|
||||
input_data_end:
|
||||
```
|
||||
|
||||
As you may see it contains four global symbols. The first two `z_input_len` and `z_output_len` which are sizes of compressed and uncompressed `vmlinux.bin.gz`. The third is our `input_data` and as you may see it points to linux kernel image in raw binary format (all debugging symbols, comments and relocation information are stripped). And the last `input_data_end` points to the end of the compressed linux image.
|
||||
|
||||
So, our first parameter of the `choose_random_location` function is the pointer to the compressed kernel image that is embedded into the `piggy.o` object file.
|
||||
|
||||
The second parameter of the `choose_random_location` function is the `z_input_len` that we have seen just now.
|
||||
|
||||
The third and fourth parameters of the `choose_random_location` function are address where to place decompressed kernel image and the length of decompressed kernel image respectively. The address where to put decompressed kernel came from [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) and it is address of the `startup_32` aligned to 2 megabytes boundary. The size of the decompressed kernel came from the same `piggy.S` and it is `z_output_len`.
|
||||
|
||||
The last parameter of the `choose_random_location` function is the virtual address of the kernel load address. As we may see, by default it coincides with the default physical load address:
|
||||
|
||||
```C
|
||||
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
|
||||
```
|
||||
|
||||
which depends on kernel configuration:
|
||||
|
||||
```C
|
||||
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
|
||||
+ (CONFIG_PHYSICAL_ALIGN - 1)) \
|
||||
& ~(CONFIG_PHYSICAL_ALIGN - 1))
|
||||
```
|
||||
|
||||
Now, as we considered parameters of the `choose_random_location` function, let's look at implementation of it. This function starts from the checking of `nokaslr` option in the kernel command line:
|
||||
|
||||
```C
|
||||
if (cmdline_find_option_bool("nokaslr")) {
|
||||
warn("KASLR disabled: 'nokaslr' on cmdline.");
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
and if the options was given we exit from the `choose_random_location` function ad kernel load address will not be randomized. Related command line options can be found in the [kernel documentation](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/kernel-parameters.txt):
|
||||
|
||||
```
|
||||
kaslr/nokaslr [X86]
|
||||
|
||||
Enable/disable kernel and module base offset ASLR
|
||||
(Address Space Layout Randomization) if built into
|
||||
the kernel. When CONFIG_HIBERNATION is selected,
|
||||
kASLR is disabled by default. When kASLR is enabled,
|
||||
hibernation will be disabled.
|
||||
```
|
||||
|
||||
Let's assume that we didn't pass `nokaslr` to the kernel command line and the `CONFIG_RANDOMIZE_BASE` kernel configuration option is enabled.
|
||||
|
||||
The next step is the call of the:
|
||||
|
||||
```C
|
||||
initialize_identity_maps();
|
||||
```
|
||||
|
||||
function which is defined in the [arch/x86/boot/compressed/pagetable.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/pagetable.c) source code file. This function starts from initialization of `mapping_info` an instance of the `x86_mapping_info` structure:
|
||||
|
||||
```C
|
||||
mapping_info.alloc_pgt_page = alloc_pgt_page;
|
||||
mapping_info.context = &pgt_data;
|
||||
mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sev_me_mask;
|
||||
mapping_info.kernpg_flag = _KERNPG_TABLE | sev_me_mask;
|
||||
```
|
||||
|
||||
The `x86_mapping_info` structure is defined in the [arch/x86/include/asm/init.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/init.h) header file and looks:
|
||||
|
||||
```C
|
||||
struct x86_mapping_info {
|
||||
void *(*alloc_pgt_page)(void *);
|
||||
void *context;
|
||||
unsigned long page_flag;
|
||||
unsigned long offset;
|
||||
bool direct_gbpages;
|
||||
unsigned long kernpg_flag;
|
||||
};
|
||||
```
|
||||
|
||||
This structure provides information about memory mappings. As you may remember from the previous part, we already setup'ed initial page tables from 0 up to `4G`. For now we may need to access memory above `4G` to load kernel at random position. So, the `initialize_identity_maps` function executes initialization of a memory region for a possible needed new page table. First of all let's try to look at the definition of the `x86_mapping_info` structure.
|
||||
|
||||
The `alloc_pgt_page` is a callback function that will be called to allocate space for a page table entry. The `context` field is an instance of the `alloc_pgt_data` structure in our case which will be used to track allocated page tables. The `page_flag` and `kernpg_flag` fields are page flags. The first represents flags for `PMD` or `PUD` entries. The second `kernpg_flag` field represents flags for kernel pages which can be overridden later. The `direct_gbpages` field represents support for huge pages and the last `offset` field represents offset between kernel virtual addresses and physical addresses up to `PMD` level.
|
||||
|
||||
The `alloc_pgt_page` callback just validates that there is space for a new page, allocates new page:
|
||||
|
||||
```C
|
||||
entry = pages->pgt_buf + pages->pgt_buf_offset;
|
||||
pages->pgt_buf_offset += PAGE_SIZE;
|
||||
```
|
||||
|
||||
in the buffer from the:
|
||||
|
||||
```C
|
||||
struct alloc_pgt_data {
|
||||
unsigned char *pgt_buf;
|
||||
unsigned long pgt_buf_size;
|
||||
unsigned long pgt_buf_offset;
|
||||
};
|
||||
```
|
||||
|
||||
structure and returns address of a new page. The last goal of the `initialize_identity_maps` function is to initialize `pgdt_buf_size` and `pgt_buf_offset`. As we are only in initialization phase, the `initialze_identity_maps` function sets `pgt_buf_offset` to zero:
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf_offset = 0;
|
||||
```
|
||||
|
||||
and the `pgt_data.pgt_buf_size` will be set to `77824` or `69632` depends on which boot protocol will be used by bootloader (64-bit or 32-bit). The same is for `pgt_data.pgt_buf`. If a bootloader loaded the kernel at `startup_32`, the `pgdt_data.pgdt_buf` will point to the end of the page table which already was initialzed in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
|
||||
```
|
||||
|
||||
where `_pgtable` points to the beginning of this page table [_pgtable](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S). In other way, if a bootloader have used 64-bit boot protocol and loaded the kernel at `startup_64`, early page tables should be built by bootloader itself and `_pgtable` will be just overwrote:
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf = _pgtable
|
||||
```
|
||||
|
||||
As the buffer for new page tables is initialized, we may return back to the `choose_random_location` function.
|
||||
|
||||
Avoid reserved memory ranges
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the stuff related to identity page tables is initilized, we may start to choose random location where to put decompressed kernel image. But as you may guess, we can't choose any address. There are some reseved addresses in memory ranges. Such addresses occupied by important things, like [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk), kernel command line and etc. The
|
||||
|
||||
```C
|
||||
mem_avoid_init(input, input_size, *output);
|
||||
```
|
||||
|
||||
function will help us to do this. All non-safe memory regions will be collected in the:
|
||||
|
||||
```C
|
||||
struct mem_vector {
|
||||
unsigned long long start;
|
||||
unsigned long long size;
|
||||
};
|
||||
|
||||
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
|
||||
```
|
||||
|
||||
array. Where `MEM_AVOID_MAX` is from `mem_avoid_index` [enum](https://en.wikipedia.org/wiki/Enumerated_type#C) which represents different types of reserved memory regions:
|
||||
|
||||
```C
|
||||
enum mem_avoid_index {
|
||||
MEM_AVOID_ZO_RANGE = 0,
|
||||
MEM_AVOID_INITRD,
|
||||
MEM_AVOID_CMDLINE,
|
||||
MEM_AVOID_BOOTPARAMS,
|
||||
MEM_AVOID_MEMMAP_BEGIN,
|
||||
MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
|
||||
MEM_AVOID_MAX,
|
||||
};
|
||||
```
|
||||
|
||||
Both are defined in the [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) source code file.
|
||||
|
||||
Let's look at the implementation of the `mem_avoid_init` function. The main goal of this function is to store information about reseved memory regions described by the `mem_avoid_index` enum in the `mem_avoid` array and create new pages for such regions in our new identity mapped buffer. Numerous parts fo the `mem_avoid_index` function are similar, but let's take a look at the one of them:
|
||||
|
||||
```C
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].start = input;
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input;
|
||||
add_identity_map(mem_avoid[MEM_AVOID_ZO_RANGE].start,
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].size);
|
||||
```
|
||||
|
||||
At the beginning of the `mem_avoid_init` function tries to avoid memory region that is used for current kernel decompression. We fill an entry from the `mem_avoid` array with the start and size of such region and call the `add_identity_map` function which should build identity mapped pages for this region. The `add_identity_map` function is defined in the [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) source code file and looks:
|
||||
|
||||
```C
|
||||
void add_identity_map(unsigned long start, unsigned long size)
|
||||
{
|
||||
unsigned long end = start + size;
|
||||
|
||||
start = round_down(start, PMD_SIZE);
|
||||
end = round_up(end, PMD_SIZE);
|
||||
if (start >= end)
|
||||
return;
|
||||
|
||||
kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt,
|
||||
start, end);
|
||||
}
|
||||
```
|
||||
|
||||
As you may see it aligns memory region to 2 megabytes boundary and checks given start and end addresses.
|
||||
|
||||
In the end it just calls the `kernel_ident_mapping_init` function from the [arch/x86/mm/ident_map.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ident_map.c) source code file and pass `mapping_info` instance that was initilized above, address of the top level page table and addresses of memory region for which new identity mapping should be built.
|
||||
|
||||
The `kernel_ident_mapping_init` function sets default flags for new pages if they were not given:
|
||||
|
||||
```C
|
||||
if (!info->kernpg_flag)
|
||||
info->kernpg_flag = _KERNPG_TABLE;
|
||||
```
|
||||
|
||||
and starts to build new 2-megabytes (because of `PSE` bit in the `mapping_info.page_flag`) page entries (`PGD -> P4D -> PUD -> PMD` in a case of [five-level page tables](https://lwn.net/Articles/717293/) or `PGD -> PUD -> PMD` in a case of [four-level page tables](https://lwn.net/Articles/117749/)) related to the given addresses.
|
||||
|
||||
```C
|
||||
for (; addr < end; addr = next) {
|
||||
p4d_t *p4d;
|
||||
|
||||
next = (addr & PGDIR_MASK) + PGDIR_SIZE;
|
||||
if (next > end)
|
||||
next = end;
|
||||
|
||||
p4d = (p4d_t *)info->alloc_pgt_page(info->context);
|
||||
result = ident_p4d_init(info, p4d, addr, next);
|
||||
|
||||
return result;
|
||||
}
|
||||
```
|
||||
|
||||
First of all here we find next entry of the `Page Global Directory` for the given address and if it is greater than `end` of the given memory region, we set it to `end`. After this we allocater a new page with our `x86_mapping_info` callback that we already considered above and call the `ident_p4d_init` function. The `ident_p4d_init` function will do the same, but for low-level page directories (`p4d` -> `pud` -> `pmd`).
|
||||
|
||||
That's all.
|
||||
|
||||
New page entries related to reserved addresses are in our page tables. This is not the end of the `mem_avoid_init` function, but other parts are similar. It just build pages for [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk), kernel command line and etc.
|
||||
|
||||
Now we may return back to `choose_random_location` function.
|
||||
|
||||
Physical address randomization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the reserved memory regions were stored in the `mem_avoid` array and identity mapping pages were built for them, we select minimal available address to choose random memory region to decompress the kernel:
|
||||
|
||||
```C
|
||||
min_addr = min(*output, 512UL << 20);
|
||||
```
|
||||
|
||||
As you may see it should be smaller than `512` megabytes. This `512` megabytes value was selected just to avoid unknown things in lower memory.
|
||||
|
||||
The next step is to select random physical and virtual addresses to load kernel. The first is physical addresses:
|
||||
|
||||
```C
|
||||
random_addr = find_random_phys_addr(min_addr, output_size);
|
||||
```
|
||||
|
||||
The `find_random_phys_addr` function is defined in the [same](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) source code file:
|
||||
|
||||
```
|
||||
static unsigned long find_random_phys_addr(unsigned long minimum,
|
||||
unsigned long image_size)
|
||||
{
|
||||
minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
|
||||
|
||||
if (process_efi_entries(minimum, image_size))
|
||||
return slots_fetch_random();
|
||||
|
||||
process_e820_entries(minimum, image_size);
|
||||
return slots_fetch_random();
|
||||
}
|
||||
```
|
||||
|
||||
The main goal of `process_efi_entries` function is to find all suitable memory ranges in full accessible memory to load kernel. If the kernel compiled and runned on the system without [EFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) support, we continue to search such memory regions in the [e820](https://en.wikipedia.org/wiki/E820) regions. All founded memory regions will be stored in the
|
||||
|
||||
```C
|
||||
struct slot_area {
|
||||
unsigned long addr;
|
||||
int num;
|
||||
};
|
||||
|
||||
#define MAX_SLOT_AREA 100
|
||||
|
||||
static struct slot_area slot_areas[MAX_SLOT_AREA];
|
||||
```
|
||||
|
||||
array. The kernel decompressor should select random index of this array and it will be random place where kernel will be decompressed. This selection will be executed by the `slots_fetch_random` function. The main goal of the `slots_fetch_random` function is to select random memory range from the `slot_areas` array via `kaslr_get_random_long` function:
|
||||
|
||||
```C
|
||||
slot = kaslr_get_random_long("Physical") % slot_max;
|
||||
```
|
||||
|
||||
The `kaslr_get_random_long` function is defined in the [arch/x86/lib/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/lib/kaslr.c) source code file and it just returns random number. Note that the random number will be get via different ways depends on kernel configuration and system opportunities (select random number base on [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter), [rdrand](https://en.wikipedia.org/wiki/RdRand) and so on).
|
||||
|
||||
That's all from this point random memory range will be selected.
|
||||
|
||||
Virtual address randomization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After random memory region was selected by the kernel decompressor, new identity mapped pages will be built for this region by demand:
|
||||
|
||||
```C
|
||||
random_addr = find_random_phys_addr(min_addr, output_size);
|
||||
|
||||
if (*output != random_addr) {
|
||||
add_identity_map(random_addr, output_size);
|
||||
*output = random_addr;
|
||||
}
|
||||
```
|
||||
|
||||
From this time `output` will store the base address of a memory region where kernel will be decompressed. But for this moment, as you may remember we randomized only physical address. Virtual address should be randomized too in a case of [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
|
||||
```C
|
||||
if (IS_ENABLED(CONFIG_X86_64))
|
||||
random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);
|
||||
|
||||
*virt_addr = random_addr;
|
||||
```
|
||||
|
||||
As you may see in a case of non `x86_64` architecture, randomzed virtual address will coincide with randomized physical address. The `find_random_virt_addr` function calculates amount of virtual memory ranges that may hold kernel image and calls the `kaslr_get_random_long` that we already saw in a previous case when we tried to find random `physical` address.
|
||||
|
||||
From this moment we have both randomized base physical (`*output`) and virtual (`*virt_addr`) addresses for decompressed kernel.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the sixth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals.
|
||||
|
||||
Next chapter will be about kernel initialization and we will see the first steps in the Linux kernel initialization code.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization)
|
||||
* [Linux kernel boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt)
|
||||
* [long mode](https://en.wikipedia.org/wiki/Long_mode)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk)
|
||||
* [Enumerated type](https://en.wikipedia.org/wiki/Enumerated_type#C)
|
||||
* [four-level page tables](https://lwn.net/Articles/117749/)
|
||||
* [five-level page tables](https://lwn.net/Articles/717293/)
|
||||
* [EFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)
|
||||
* [e820](https://en.wikipedia.org/wiki/E820)
|
||||
* [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [rdrand](https://en.wikipedia.org/wiki/RdRand)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-5.md)
|
||||
@@ -27,7 +27,8 @@
|
||||
|├ [1.2](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-2.md)|[@hailincai](https://github.com/hailincai)|已完成|
|
||||
|├ [1.3](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-3.md)|[@hailincai](https://github.com/hailincai)|已完成|
|
||||
|├ [1.4](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-4.md)|[@zmj1316](https://github.com/zmj1316)|已完成|
|
||||
|└ [1.5](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-5.md)||未开始|
|
||||
|├ [1.5](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-5.md)||未开始|
|
||||
|└ [1.6](https://github.com/MintCN/linux-insides-zh/blob/master/Booting/linux-bootstrap-6.md)||未开始|
|
||||
| 2. [Initialization](https://github.com/MintCN/linux-insides-zh/tree/master/Initialization)||正在进行|
|
||||
|├ [2.0](https://github.com/MintCN/linux-insides-zh/blob/master/Initialization/README.md)|[@mudongliang](https://github.com/mudongliang)|更新至[44017507](https://github.com/0xAX/linux-insides/commit/4401750766f7150dcd16f579026f5554541a6ab9)|
|
||||
|├ [2.1](https://github.com/MintCN/linux-insides-zh/blob/master/Initialization/linux-initialization-1.md)|[@dontpanic92](https://github.com/dontpanic92)|更新至[44017507](https://github.com/0xAX/linux-insides/commit/4401750766f7150dcd16f579026f5554541a6ab9)|
|
||||
|
||||
Reference in New Issue
Block a user