From 2ef4f5443cb2b339955d561808444329c9559639 Mon Sep 17 00:00:00 2001 From: Dongliang Mu Date: Sat, 11 Jun 2016 21:44:40 -0400 Subject: [PATCH] Add Initialization in English version --- Initialization/linux-initialization-1.md | 619 ++++++++++++++++++++++ Initialization/linux-initialization-10.md | 473 +++++++++++++++++ Initialization/linux-initialization-2.md | 495 +++++++++++++++++ Initialization/linux-initialization-3.md | 430 +++++++++++++++ Initialization/linux-initialization-4.md | 452 ++++++++++++++++ Initialization/linux-initialization-5.md | 512 ++++++++++++++++++ Initialization/linux-initialization-6.md | 549 +++++++++++++++++++ Initialization/linux-initialization-7.md | 482 +++++++++++++++++ Initialization/linux-initialization-8.md | 479 +++++++++++++++++ Initialization/linux-initialization-9.md | 430 +++++++++++++++ 10 files changed, 4921 insertions(+) create mode 100644 Initialization/linux-initialization-1.md create mode 100644 Initialization/linux-initialization-10.md create mode 100644 Initialization/linux-initialization-2.md create mode 100644 Initialization/linux-initialization-3.md create mode 100644 Initialization/linux-initialization-4.md create mode 100644 Initialization/linux-initialization-5.md create mode 100644 Initialization/linux-initialization-6.md create mode 100644 Initialization/linux-initialization-7.md create mode 100644 Initialization/linux-initialization-8.md create mode 100644 Initialization/linux-initialization-9.md diff --git a/Initialization/linux-initialization-1.md b/Initialization/linux-initialization-1.md new file mode 100644 index 0000000..49e4eea --- /dev/null +++ b/Initialization/linux-initialization-1.md @@ -0,0 +1,619 @@ +Kernel initialization. Part 1. +================================================================================ + +First steps in the kernel code +-------------------------------------------------------------------------------- + +The previous [post](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) was a last part of the Linux kernel [booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489) will be called. + +In the last [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) we stopped at the [jmp](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file: + +```assembly +jmp *%rax +``` + +At this moment the `rax` register contains address of the Linux kernel entry point which that was obtained as a result of the call of the `decompress_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file. So, our last instruction in the kernel setup code is a jump on the kernel entry point. We already know where is defined the entry point of the linux kernel, so we are able to start to learn what does the Linux kernel does after the start. + +First steps in the kernel +-------------------------------------------------------------------------------- + +Okay, we got the address of the decompressed kernel image from the `decompress_kernel` function into `rax` register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly source code file and at the beginning of it, we can see following definitions: + +```assembly + __HEAD + .code64 + .globl startup_64 +startup_64: + ... + ... + ... +``` + +We can see definition of the `startup_64` routine that is defined in the `__HEAD` section, which is just a macro which expands to the definition of executable `.head.text` section: + +```C +#define __HEAD .section ".head.text","ax" +``` + +We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S#L93) linker script: + +``` +.text : AT(ADDR(.text) - LOAD_OFFSET) { + _text = .; + ... + ... + ... +} :text = 0x9090 +``` + +Besides the definition of the `.text` section, we can understand default virtual and physical addresses from the linker script. Note that address of the `_text` is location counter which is defined as: + +``` +. = __START_KERNEL; +``` + +for the [x86_64](https://en.wikipedia.org/wiki/X86-64). The definition of the `__START_KERNEL` macro is located in the [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h) header file and represented by the sum of the base virtual address of the kernel mapping and physical start: + +```C +#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) + +#define __PHYSICAL_START ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN) +``` + +Or in other words: + +* Base physical address of the Linux kernel - `0x1000000`; +* Base virtual address of the Linux kernel - `0xffffffff81000000`. + +Now we know default physical and virtual addresses of the `startup_64` routine, but to know actual addresses we must to calculate it with the following code: + +```assembly + leaq _text(%rip), %rbp + subq $_text - __START_KERNEL_map, %rbp +``` + +Yes, it defined as `0x1000000`, but it may be different, for example if [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) is enabled. So our current goal is to calculate delta between `0x1000000` and where we actually loaded. Here we just put the `rip-relative` address to the `rbp` register and then subtract `$_text - __START_KERNEL_map` from it. We know that compiled virtual address of the `_text` is `0xffffffff81000000` and the physical address of it is `0x1000000`. The `__START_KERNEL_map` macro expands to the `0xffffffff80000000` address, so at the second line of the assembly code, we will get following expression: + +``` +rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000) +``` + +So, after the calculation, the `rbp` will contain `0` which represents difference between addresses where we actually loaded and where the code was compiled. In our case `zero` means that the Linux kernel was loaded by default address and the [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) was disabled. + +After we got the address of the `startup_64`, we need to do a check that this address is correctly aligned. We will do it with the following code: + +```assembly + testl $~PMD_PAGE_MASK, %ebp + jnz bad_address +``` + +Here we just compare low part of the `rbp` register with the complemented value of the `PMD_PAGE_MASK`. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) about it) and defined as: + +```C +#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1)) + +#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT) +#define PMD_SHIFT 21 +``` + +As we can easily calculate, `PMD_PAGE_SIZE` is `2` megabytes. Here we use standard formula for checking alignment and if `text` address is not aligned for `2` megabytes, we jump to `bad_address` label. + +After this we check address that it is not too large by the checking of highest `18` bits: + +```assembly + leaq _text(%rip), %rax + shrq $MAX_PHYSMEM_BITS, %rax + jnz bad_address +``` + +The address must not be greater than `46`-bits: + +```C +#define MAX_PHYSMEM_BITS 46 +``` + +Okay, we did some early checks and now we can move on. + +Fix base addresses of page tables +-------------------------------------------------------------------------------- + +The first step before we start to setup identity paging is to fixup following addresses: + +```assembly + addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip) + addq %rbp, level3_kernel_pgt + (510*8)(%rip) + addq %rbp, level3_kernel_pgt + (511*8)(%rip) + addq %rbp, level2_fixmap_pgt + (506*8)(%rip) +``` + +All of `early_level4_pgt`, `level3_kernel_pgt` and other address may be wrong if the `startup_64` is not equal to default `0x1000000` address. The `rbp` register contains the delta address so we add to the certain entries of the `early_level4_pgt`, the `level3_kernel_pgt` and the `level2_fixmap_pgt`. Let's try to understand what these labels mean. First of all let's look at their definition: + +```assembly +NEXT_PAGE(early_level4_pgt) + .fill 511,8,0 + .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE + +NEXT_PAGE(level3_kernel_pgt) + .fill L3_START_KERNEL,8,0 + .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE + .quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE + +NEXT_PAGE(level2_kernel_pgt) + PMDS(0, __PAGE_KERNEL_LARGE_EXEC, + KERNEL_IMAGE_SIZE/PMD_SIZE) + +NEXT_PAGE(level2_fixmap_pgt) + .fill 506,8,0 + .quad level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE + .fill 5,8,0 + +NEXT_PAGE(level1_fixmap_pgt) + .fill 512,8,0 +``` + +Looks hard, but it isn't. First of all let's look at the `early_level4_pgt`. It starts with the (4096 - 8) bytes of zeros, it means that we don't use the first `511` entries. And after this we can see one `level3_kernel_pgt` entry. Note that we subtract `__START_KERNEL_map + _PAGE_TABLE` from it. As we know `__START_KERNEL_map` is a base virtual address of the kernel text, so if we subtract `__START_KERNEL_map`, we will get physical address of the `level3_kernel_pgt`. Now let's look at `_PAGE_TABLE`, it is just page entry access rights: + +```C +#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ + _PAGE_ACCESSED | _PAGE_DIRTY) +``` + +You can read more about it in the [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) part. + +The `level3_kernel_pgt` - stores two entries which map kernel space. At the start of it's definition, we can see that it is filled with zeros `L3_START_KERNEL` or `510` times. Here the `L3_START_KERNEL` is the index in the page upper directory which contains `__START_KERNEL_map` address and it equals `510`. After this, we can see the definition of the two `level3_kernel_pgt` entries: `level2_kernel_pgt` and `level2_fixmap_pgt`. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has: + +```C +#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \ + _PAGE_DIRTY) +``` + +access rights. The second - `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. They represented by the one `level2_fixmap_pgt` entry and `10` megabytes hole for the [vsyscalls](https://lwn.net/Articles/446528/) mapping. The next `level2_kernel_pgt` calls the `PDMS` macro which creates `512` megabytes from the `__START_KERNEL_map` for kernel `.text` (after these `512` megabytes will be modules memory space). + +Now, after we saw definitions of these symbols, let's get back to the code which is described at the beginning of the section. Remember that the `rbp` register contains delta between the address of the `startup_64` symbol which was got during kernel [linking](https://en.wikipedia.org/wiki/Linker_%28computing%29) and the actual address. So, for this moment, we just need to add add this delta to the base address of some page table entries, that they'll have correct addresses. In our case these entries are: + +```assembly + addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip) + addq %rbp, level3_kernel_pgt + (510*8)(%rip) + addq %rbp, level3_kernel_pgt + (511*8)(%rip) + addq %rbp, level2_fixmap_pgt + (506*8)(%rip) +``` + +or the last entry of the `early_level4_pgt` which is the `level3_kernel_pgt`, last two entries of the `level3_kernel_pgt` which are the `level2_kernel_pgt` and the `level2_fixmap_pgt` and five hundreds seventh entry of the `level2_fixmap_pgt` which is `level1_fixmap_pgt` page directory. + +After all of this we will have: + +``` +early_level4_pgt[511] -> level3_kernel_pgt[0] +level3_kernel_pgt[510] -> level2_kernel_pgt[0] +level3_kernel_pgt[511] -> level2_fixmap_pgt[0] +level2_kernel_pgt[0] -> 512 MB kernel mapping +level2_fixmap_pgt[507] -> level1_fixmap_pgt +``` + +Note that we didn't fixup base address of the `early_level4_pgt` and some of other page table directories, because we will see this during of building/filling of structures for these page tables. As we corrected base addresses of the page tables, we can start to build it. + +Identity mapping setup +-------------------------------------------------------------------------------- + +Now we can see the set up of identity mapping of early page tables. In Identity Mapped Paging, virtual addresses are mapped to physical addresses that have the same value, `1 : 1`. Let's look at it in detail. First of all we get the `rip-relative` address of the `_text` and `_early_level4_pgt` and put they into `rdi` and `rbx` registers: + +```assembly + leaq _text(%rip), %rdi + leaq early_level4_pgt(%rip), %rbx +``` + +After this we store address of the `_text` in the `rax` and get the index of the page global directory entry which stores `_text` address, by shifting `_text` address on the `PGDIR_SHIFT`: + +```assembly + movq %rdi, %rax + shrq $PGDIR_SHIFT, %rax + + leaq (4096 + _KERNPG_TABLE)(%rbx), %rdx + movq %rdx, 0(%rbx,%rax,8) + movq %rdx, 8(%rbx,%rax,8) +``` + +where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories: + +```C +#define PGDIR_SHIFT 39 +#define PUD_SHIFT 30 +#define PMD_SHIFT 21 +``` + +After this we put the address of the first `level3_kernel_pgt` in the `rdx` with the `_KERNPG_TABLE` access rights (see above) and fill the `early_level4_pgt` with the 2 `level3_kernel_pgt` entries. + +After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now contains the address of the first entry of the `level3_kernel_pgt`) and put `rdi` (it now contains physical address of the `_text`) to the `rax`. And after this we write addresses of the two page upper directory entries to the `level3_kernel_pgt`: + +```assembly + addq $4096, %rdx + movq %rdi, %rax + shrq $PUD_SHIFT, %rax + andl $(PTRS_PER_PUD-1), %eax + movq %rdx, 4096(%rbx,%rax,8) + incl %eax + andl $(PTRS_PER_PUD-1), %eax + movq %rdx, 4096(%rbx,%rax,8) +``` + +In the next step we write addresses of the page middle directory entries to the `level2_kernel_pgt` and the last step is correcting of the kernel text+data virtual addresses: + +```assembly + leaq level2_kernel_pgt(%rip), %rdi + leaq 4096(%rdi), %r8 +1: testq $1, 0(%rdi) + jz 2f + addq %rbp, 0(%rdi) +2: addq $8, %rdi + cmp %r8, %rdi + jne 1b +``` + +Here we put the address of the `level2_kernel_pgt` to the `rdi` and address of the page table entry to the `r8` register. Next we check the present bit in the `level2_kernel_pgt` and if it is zero we're moving to the next page by adding 8 bytes to `rdi` which contains address of the `level2_kernel_pgt`. After this we compare it with `r8` (contains address of the page table entry) and go back to label `1` or move forward. + +In the next step we correct `phys_base` physical address with `rbp` (contains physical address of the `_text`), put physical address of the `early_level4_pgt` and jump to label `1`: + +```assembly + addq %rbp, phys_base(%rip) + movq $(early_level4_pgt - __START_KERNEL_map), %rax + jmp 1f +``` + +where `phys_base` matches the first entry of the `level2_kernel_pgt` which is `512` MB kernel mapping. + +Last preparation before jump at the kernel entry point +-------------------------------------------------------------------------------- + +After that we jump to the label `1` we enable `PAE`, `PGE` (Paging Global Extension) and put the physical address of the `phys_base` (see above) to the `rax` register and fill `cr3` register with it: + +```assembly +1: + movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx + movq %rcx, %cr4 + + addq phys_base(%rip), %rax + movq %rax, %cr3 +``` + +In the next step we check that CPU supports [NX](http://en.wikipedia.org/wiki/NX_bit) bit with: + +```assembly + movl $0x80000001, %eax + cpuid + movl %edx,%edi +``` + +We put `0x80000001` value to the `eax` and execute `cpuid` instruction for getting the extended processor info and feature bits. The result will be in the `edx` register which we put to the `edi`. + +Now we put `0xc0000080` or `MSR_EFER` to the `ecx` and call `rdmsr` instruction for the reading model specific register. + +```assembly + movl $MSR_EFER, %ecx + rdmsr +``` + +The result will be in the `edx:eax`. General view of the `EFER` is following: + +``` +63 32 + -------------------------------------------------------------------------------- +| | +| Reserved MBZ | +| | + -------------------------------------------------------------------------------- +31 16 15 14 13 12 11 10 9 8 7 1 0 + -------------------------------------------------------------------------------- +| | T | | | | | | | | | | +| Reserved MBZ | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE| +| | E | | | | | | | | | | + -------------------------------------------------------------------------------- +``` + +We will not see all fields in details here, but we will learn about this and other `MSRs` in a special part about it. As we read `EFER` to the `edx:eax`, we check `_EFER_SCE` or zero bit which is `System Call Extensions` with `btsl` instruction and set it to one. By the setting `SCE` bit we enable `SYSCALL` and `SYSRET` instructions. In the next step we check 20th bit in the `edi`, remember that this register stores result of the `cpuid` (see above). If `20` bit is set (`NX` bit) we just write `EFER_SCE` to the model specific register. + +```assembly + btsl $_EFER_SCE, %eax + btl $20,%edi + jnc 1f + btsl $_EFER_NX, %eax + btsq $_PAGE_BIT_NX,early_pmd_flags(%rip) +1: wrmsr +``` + +If the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is supported we enable `_EFER_NX` and write it too, with the `wrmsr` instruction. After the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is set, we set some bits in the `cr0` [control register](https://en.wikipedia.org/wiki/Control_register), namely: + +* `X86_CR0_PE` - system is in protected mode; +* `X86_CR0_MP` - controls interaction of WAIT/FWAIT instructions with TS flag in CR0; +* `X86_CR0_ET` - on the 386, it allowed to specify whether the external math coprocessor was an 80287 or 80387; +* `X86_CR0_NE` - enable internal x87 floating point error reporting when set, else enables PC style x87 error detection; +* `X86_CR0_WP` - when set, the CPU can't write to read-only pages when privilege level is 0; +* `X86_CR0_AM` - alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3; +* `X86_CR0_PG` - enable paging. + +by the execution following assembly code: + +```assembly +#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \ + X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \ + X86_CR0_PG) +movl $CR0_STATE, %eax +movq %rax, %cr0 +``` + +We already know that to run any code, and even more [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code from assembly, we need to setup a stack. As always, we are doing it by the setting of [stack pointer](https://en.wikipedia.org/wiki/Stack_register) to a correct place in memory and resetting [flags](https://en.wikipedia.org/wiki/FLAGS_register) register after this: + +```assembly +movq stack_start(%rip), %rsp +pushq $0 +popfq +``` + +The most interesting thing here is the `stack_start`. It defined in the same [source](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) code file and looks like: + +```assembly +GLOBAL(stack_start) +.quad init_thread_union+THREAD_SIZE-8 +``` + +The `GLOBAL` is already familiar to us from. It defined in the [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) header file expands to the `global` symbol definition: + +```C +#define GLOBAL(name) \ + .globl name; \ + name: +``` + +The `THREAD_SIZE` macro is defined in the [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h) header file and depends on value of the `KASAN_STACK_ORDER` macro: + +```C +#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER) +#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER) +``` + +We consider when the [kasan](http://lxr.free-electrons.com/source/Documentation/kasan.txt) is disabled and the `PAGE_SIZE` is `4096` bytes. So the `THREAD_SIZE` will expands to `16` kilobytes and represents size of the stack of a thread. Why is `thread`? You may already know that each [process](https://en.wikipedia.org/wiki/Process_%28computing%29) may have parent [processes](https://en.wikipedia.org/wiki/Parent_process) and [child](https://en.wikipedia.org/wiki/Child_process) processes. Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with the `thread_info` structure. + +And as we can see the `init_thread_union` is represented by the `thread_union`, which defined as: + +```C +union thread_union { + struct thread_info thread_info; + unsigned long stack[THREAD_SIZE/sizeof(long)]; +}; +``` + +and `init_thread_union` looks like: + +```C +union thread_union init_thread_union __init_task_data = + { INIT_THREAD_INFO(init_task) }; +``` + +Where the `INIT_THREAD_INFO` macro takes `task_struct` structure which represents process descriptor in the Linux kernel and does some basic initialization of the given `task_struct` structure: + +```C +#define INIT_THREAD_INFO(tsk) \ +{ \ + .task = &tsk, \ + .flags = 0, \ + .cpu = 0, \ + .addr_limit = KERNEL_DS, \ +} +``` + +So, the `thread_union` contains low-level information about a process and process's stack and placed in the bottom of stack: + +``` ++-----------------------+ +| | +| | +| | +| Kernel stack | +| | +| | +| | +|-----------------------| +| | +| struct thread_info | +| | ++-----------------------+ +``` + +Note that we reserve `8` bytes at the to of stack. This is necessary to guarantee illegal access of the next page memory. + +After the early boot stack is set, to update the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) with `lgdt` instruction: + +```assembly +lgdt early_gdt_descr(%rip) +``` + +where the `early_gdt_descr` is defined as: + +```assembly +early_gdt_descr: + .word GDT_ENTRIES*8-1 +early_gdt_descr_base: + .quad INIT_PER_CPU_VAR(gdt_page) +``` + +We need to reload `Global Descriptor Table` because now kernel works in the low userspace addresses, but soon kernel will work in it's own space. Now let's look at the definition of `early_gdt_descr`. Global Descriptor Table contains `32` entries: + +```C +#define GDT_ENTRIES 32 +``` + +for kernel code, data, thread local storage segments and etc... it's simple. Now let's look at the `early_gdt_descr_base`. First of `gdt_page` defined as: + +```C +struct gdt_page { + struct desc_struct gdt[GDT_ENTRIES]; +} __attribute__((aligned(PAGE_SIZE))); +``` + +in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h). It contains one field `gdt` which is array of the `desc_struct` structure which is defined as: + +```C +struct desc_struct { + union { + struct { + unsigned int a; + unsigned int b; + }; + struct { + u16 limit0; + u16 base0; + unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1; + unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8; + }; + }; + } __attribute__((packed)); +``` + +and presents familiar to us `GDT` descriptor. Also we can note that `gdt_page` structure aligned to `PAGE_SIZE` which is `4096` bytes. It means that `gdt` will occupy one page. Now let's try to understand what is `INIT_PER_CPU_VAR`. `INIT_PER_CPU_VAR` is a macro which defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h) and just concats `init_per_cpu__` with the given parameter: + +```C +#define INIT_PER_CPU_VAR(var) init_per_cpu__##var +``` + +After the `INIT_PER_CPU_VAR` macro will be expanded, we will have `init_per_cpu__gdt_page`. We can see in the [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S): + +``` +#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load +INIT_PER_CPU(gdt_page); +``` + +As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT. + +Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have will have its own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) post. + +As we loaded new Global Descriptor Table, we reload segments as we did it every time: + +```assembly + xorl %eax,%eax + movl %eax,%ds + movl %eax,%ss + movl %eax,%es + movl %eax,%fs + movl %eax,%gs +``` + +After all of these steps we set up `gs` register that it post to the `irqstack` which represents special stack where [interrupts](https://en.wikipedia.org/wiki/Interrupt) will be handled on: + +```assembly + movl $MSR_GS_BASE,%ecx + movl initial_gs(%rip),%eax + movl initial_gs+4(%rip),%edx + wrmsr +``` + +where `MSR_GS_BASE` is: + +```C +#define MSR_GS_BASE 0xc0000101 +``` + +We need to put `MSR_GS_BASE` to the `ecx` register and load data from the `eax` and `edx` (which are point to the `initial_gs`) with `wrmsr` instruction. We don't use `cs`, `fs`, `ds` and `ss` segment registers for addressing in the 64-bit mode, but `fs` and `gs` registers can be used. `fs` and `gs` have a hidden part (as we saw it in the real mode for `cs`) and this part contains descriptor which mapped to [Model Specific Registers](https://en.wikipedia.org/wiki/Model-specific_register). So we can see above `0xc0000101` is a `gs.base` MSR address. When a [system call](https://en.wikipedia.org/wiki/System_call) or [interrupt](https://en.wikipedia.org/wiki/Interrupt) occurred, there is no kernel stack at the entry point, so the value of the `MSR_GS_BASE` will store address of the interrupt stack. + +In the next step we put the address of the real mode bootparam structure to the `rdi` (remember `rsi` holds pointer to this structure from the start) and jump to the C code with: + +```assembly + movq initial_code(%rip),%rax + pushq $0 + pushq $__KERNEL_CS + pushq %rax + lretq +``` + +Here we put the address of the `initial_code` to the `rax` and push fake address, `__KERNEL_CS` and the address of the `initial_code` to the stack. After this we can see `lretq` instruction which means that after it return address will be extracted from stack (now there is address of the `initial_code`) and jump there. `initial_code` is defined in the same source code file and looks: + +```assembly + .balign 8 + GLOBAL(initial_code) + .quad x86_64_start_kernel + ... + ... + ... +``` + +As we can see `initial_code` contains address of the `x86_64_start_kernel`, which is defined in the [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and looks like this: + +```C +asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) { + ... + ... + ... +} +``` + +It has one argument is a `real_mode_data` (remember that we passed address of the real mode data to the `rdi` register previously). + +This is first C code in the kernel! + +Next to start_kernel +-------------------------------------------------------------------------------- + +We need to see last preparations before we can see "kernel entry point" - start_kernel function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489). + +First of all we can see some checks in the `x86_64_start_kernel` function: + +```C +BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map); +BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE); +BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE); +BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0); +BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0); +BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL)); +BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK))); +BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END); +``` + +There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - `__STAT_KERNEL_map`, that kernel text with modules is not less than image of the kernel and etc... `BUILD_BUG_ON` is a macro which looks as: + +```C +#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) +``` + +Let's try to understand how this trick works. Let's take for example first condition: `MODULES_VADDR < __START_KERNEL_map`. `!!conditions` is the same that `condition != 0`. So it means if `MODULES_VADDR < __START_KERNEL_map` is true, we will get `1` in the `!!(condition)` or zero if not. After `2*!!(condition)` we will get or `2` or `0`. In the end of calculations we can get two different behaviors: + +* We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because `MODULES_VADDR` can't be less than `__START_KERNEL_map` will be in our case); +* No compilation errors. + +That's all. So interesting C trick for getting compile error which depends on some constants. + +In the next step we can see call of the `cr4_init_shadow` function which stores shadow copy of the `cr4` per cpu. Context switches can change bits in the `cr4` so we need to store `cr4` for each CPU. And after this we can see call of the `reset_early_page_tables` function where we resets all page global directory entries and write new pointer to the PGT in `cr3`: + +```C +for (i = 0; i < PTRS_PER_PGD-1; i++) + early_level4_pgt[i].pgd = 0; + +next_early_pgt = 0; + +write_cr3(__pa_nodebug(early_level4_pgt)); +``` + +Soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (`PTRS_PER_PGD` is `512`) in the loop and make it zero. After this we set `next_early_pgt` to zero (we will see details about it in the next post) and write physical address of the `early_level4_pgt` to the `cr3`. `__pa_nodebug` is a macro which will be expanded to: + +```C +((unsigned long)(x) - __START_KERNEL_map + phys_base) +``` + +After this we clear `_bss` from the `__bss_stop` to `__bss_start` and the next step will be setup of the early `IDT` handlers, but it's big concept so we will see it in the next part. + +Conclusion +-------------------------------------------------------------------------------- + +This is the end of the first part about linux kernel initialization. + +If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). + +In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and a lot more. + +**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [Model Specific Register](http://en.wikipedia.org/wiki/Model-specific_register) +* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) +* [Previous part - Kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) +* [NX](http://en.wikipedia.org/wiki/NX_bit) +* [ASLR](http://en.wikipedia.org/wiki/Address_space_layout_randomization) diff --git a/Initialization/linux-initialization-10.md b/Initialization/linux-initialization-10.md new file mode 100644 index 0000000..a56d86f --- /dev/null +++ b/Initialization/linux-initialization-10.md @@ -0,0 +1,473 @@ +Kernel initialization. Part 10. +================================================================================ + +End of the linux kernel initialization process +================================================================================ + +This is tenth part of the chapter about linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the [previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) chapter, so let's finish it. + +After the call of the `acpi_early_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c), we can see the following code: + +```C +#ifdef CONFIG_X86_ESPFIX64 + init_espfix_bsp(); +#endif +``` + +Here we can see the call of the `init_espfix_bsp` function which depends on the `CONFIG_X86_ESPFIX64` kernel configuration option. As we can understand from the function name, it does something with the stack. This function is defined in the [arch/x86/kernel/espfix_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/espfix_64.c) and prevents leaking of `31:16` bits of the `esp` register during returning to 16-bit stack. First of all we install `espfix` page upper directory into the kernel page directory in the `init_espfix_bs`: + +```C +pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)]; +pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page); +``` + +Where `ESPFIX_BASE_ADDR` is: + +```C +#define PGDIR_SHIFT 39 +#define ESPFIX_PGD_ENTRY _AC(-2, UL) +#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << PGDIR_SHIFT) +``` + +Also we can find it in the [Documentation/x86/x86_64/mm](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt): + +``` +... unused hole ... +ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks +... unused hole ... +``` + +After we've filled page global directory with the `espfix` pud, the next step is call of the `init_espfix_random` and `init_espfix_ap` functions. The first function returns random locations for the `espfix` page and the second enables the `espfix` for the current CPU. After the `init_espfix_bsp` finished the work, we can see the call of the `thread_info_cache_init` function which defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and allocates cache for the `thread_info` if `THREAD_SIZE` is less than `PAGE_SIZE`: + +```C +# if THREAD_SIZE >= PAGE_SIZE +... +... +... +void thread_info_cache_init(void) +{ + thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE, + THREAD_SIZE, 0, NULL); + BUG_ON(thread_info_cache == NULL); +} +... +... +... +#endif +``` + +As we already know the `PAGE_SIZE` is `(_AC(1,UL) << PAGE_SHIFT)` or `4096` bytes and `THREAD_SIZE` is `(PAGE_SIZE << THREAD_SIZE_ORDER)` or `16384` bytes for the `x86_64`. The next function after the `thread_info_cache_init` is the `cred_init` from the [kernel/cred.c](https://github.com/torvalds/linux/blob/master/kernel/cred.c). This function just allocates cache for the credentials (like `uid`, `gid`, etc.): + +```C +void __init cred_init(void) +{ + cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), + 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); +} +``` + +more about credentials you can read in the [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/master/Documentation/security/credentials.txt). Next step is the `fork_init` function from the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c). The `fork_init` function allocates cache for the `task_struct`. Let's look on the implementation of the `fork_init`. First of all we can see definitions of the `ARCH_MIN_TASKALIGN` macro and creation of a slab where task_structs will be allocated: + +```C +#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR +#ifndef ARCH_MIN_TASKALIGN +#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES +#endif + task_struct_cachep = + kmem_cache_create("task_struct", sizeof(struct task_struct), + ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL); +#endif +``` + +As we can see this code depends on the `CONFIG_ARCH_TASK_STRUCT_ACLLOCATOR` kernel configuration option. This configuration option shows the presence of the `alloc_task_struct` for the given architecture. As `x86_64` has no `alloc_task_struct` function, this code will not work and even will not be compiled on the `x86_64`. + +Allocating cache for init task +-------------------------------------------------------------------------------- + +After this we can see the call of the `arch_task_cache_init` function in the `fork_init`: + +```C +void arch_task_cache_init(void) +{ + task_xstate_cachep = + kmem_cache_create("task_xstate", xstate_size, + __alignof__(union thread_xstate), + SLAB_PANIC | SLAB_NOTRACK, NULL); + setup_xstate_comp(); +} +``` + +The `arch_task_cache_init` does initialization of the architecture-specific caches. In our case it is `x86_64`, so as we can see, the `arch_task_cache_init` allocates cache for the `task_xstate` which represents [FPU](http://en.wikipedia.org/wiki/Floating-point_unit) state and sets up offsets and sizes of all extended states in [xsave](http://www.felixcloutier.com/x86/XSAVES.html) area with the call of the `setup_xstate_comp` function. After the `arch_task_cache_init` we calculate default maximum number of threads with the: + +```C +set_max_threads(MAX_THREADS); +``` + +where default maximum number of threads is: + +```C +#define FUTEX_TID_MASK 0x3fffffff +#define MAX_THREADS FUTEX_TID_MASK +``` + +In the end of the `fork_init` function we initialize [signal](http://www.win.tue.nl/~aeb/linux/lk/lk-5.html) handler: + +```C +init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2; +init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2; +init_task.signal->rlim[RLIMIT_SIGPENDING] = + init_task.signal->rlim[RLIMIT_NPROC]; +``` + +As we know the `init_task` is an instance of the `task_struct` structure, so it contains `signal` field which represents signal handler. It has following type `struct signal_struct`. On the first two lines we can see setting of the current and maximum limit of the `resource limits`. Every process has an associated set of resource limits. These limits specify amount of resources which current process can use. Here `rlim` is resource control limit and presented by the: + +```C +struct rlimit { + __kernel_ulong_t rlim_cur; + __kernel_ulong_t rlim_max; +}; +``` + +structure from the [include/uapi/linux/resource.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/resource.h). In our case the resource is the `RLIMIT_NPROC` which is the maximum number of processes that user can own and `RLIMIT_SIGPENDING` - the maximum number of pending signals. We can see it in the: + +```C +cat /proc/self/limits +Limit Soft Limit Hard Limit Units +... +... +... +Max processes 63815 63815 processes +Max pending signals 63815 63815 signals +... +... +... +``` + +Initialization of the caches +-------------------------------------------------------------------------------- + +The next function after the `fork_init` is the `proc_caches_init` from the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c). This function allocates caches for the memory descriptors (or `mm_struct` structure). At the beginning of the `proc_caches_init` we can see allocation of the different [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) caches with the call of the `kmem_cache_create`: + +* `sighand_cachep` - manage information about installed signal handlers; +* `signal_cachep` - manage information about process signal descriptor; +* `files_cachep` - manage information about opened files; +* `fs_cachep` - manage filesystem information. + +After this we allocate `SLAB` cache for the `mm_struct` structures: + +```C +mm_cachep = kmem_cache_create("mm_struct", + sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN, + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL); +``` + + +After this we allocate `SLAB` cache for the important `vm_area_struct` which used by the kernel to manage virtual memory space: + +```C +vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC); +``` + +Note, that we use `KMEM_CACHE` macro here instead of the `kmem_cache_create`. This macro is defined in the [include/linux/slab.h](https://github.com/torvalds/linux/blob/master/include/linux/slab.h) and just expands to the `kmem_cache_create` call: + +```C +#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\ + sizeof(struct __struct), __alignof__(struct __struct),\ + (__flags), NULL) +``` + +The `KMEM_CACHE` has one difference from `kmem_cache_create`. Take a look on `__alignof__` operator. The `KMEM_CACHE` macro aligns `SLAB` to the size of the given structure, but `kmem_cache_create` uses given value to align space. After this we can see the call of the `mmap_init` and `nsproxy_cache_init` functions. The first function initializes virtual memory area `SLAB` and the second function initializes `SLAB` for namespaces. + +The next function after the `proc_caches_init` is `buffer_init`. This function is defined in the [fs/buffer.c](https://github.com/torvalds/linux/blob/master/fs/buffer.c) source code file and allocate cache for the `buffer_head`. The `buffer_head` is a special structure which defined in the [include/linux/buffer_head.h](https://github.com/torvalds/linux/blob/master/include/linux/buffer_head.h) and used for managing buffers. In the start of the `buffer_init` function we allocate cache for the `struct buffer_head` structures with the call of the `kmem_cache_create` function as we did in the previous functions. And calculate the maximum size of the buffers in memory with: + +```C +nrpages = (nr_free_buffer_pages() * 10) / 100; +max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); +``` + +which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points, etc. More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function is defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/master/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to after been loaded into memory. + +Creation of the root for the procfs +-------------------------------------------------------------------------------- + +After all of this preparations we need to create the root for the [proc](http://en.wikipedia.org/wiki/Procfs) filesystem. We will do it with the call of the `proc_root_init` function from the [fs/proc/root.c](https://github.com/torvalds/linux/blob/master/fs/proc/root.c). At the start of the `proc_root_init` function we allocate the cache for the inodes and register a new filesystem in the system with the: + +```C +err = register_filesystem(&proc_fs_type); + if (err) + return; +``` + +As I wrote above we will not dive into details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) and different filesystems in this chapter, but will see it in the chapter about the `VFS`. After we've registered a new filesystem in our system, we call the `proc_self_init` function from the [fs/proc/self.c](https://github.com/torvalds/linux/blob/master/fs/proc/self.c) and this function allocates `inode` number for the `self` (`/proc/self` directory refers to the process accessing the `/proc` filesystem). The next step after the `proc_self_init` is `proc_setup_thread_self` which setups the `/proc/thread-self` directory which contains information about current thread. After this we create `/proc/self/mounts` symlink which will contains mount points with the call of the + +```C +proc_symlink("mounts", NULL, "self/mounts"); +``` + +and a couple of directories depends on the different configuration options: + +```C +#ifdef CONFIG_SYSVIPC + proc_mkdir("sysvipc", NULL); +#endif + proc_mkdir("fs", NULL); + proc_mkdir("driver", NULL); + proc_mkdir("fs/nfsd", NULL); +#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE) + proc_mkdir("openprom", NULL); +#endif + proc_mkdir("bus", NULL); + ... + ... + ... + if (!proc_mkdir("tty", NULL)) + return; + proc_mkdir("tty/ldisc", NULL); + ... + ... + ... +``` + +In the end of the `proc_root_init` we call the `proc_sys_init` function which creates `/proc/sys` directory and initializes the [Sysctl](http://en.wikipedia.org/wiki/Sysctl). + +It is the end of `start_kernel` function. I did not describe all functions which are called in the `start_kernel`. I skipped them, because they are not important for the generic kernel initialization stuff and depend on only different kernel configurations. They are `taskstats_init_early` which exports per-task statistic to the user-space, `delayacct_init` - initializes per-task delay accounting, `key_init` and `security_init` initialize different security stuff, `check_bugs` - fix some architecture-dependent bugs, `ftrace_init` function executes initialization of the [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt), `cgroup_init` makes initialization of the rest of the [cgroup](http://en.wikipedia.org/wiki/Cgroups) subsystem,etc. Many of these parts and subsystems will be described in the other chapters. + +That's all. Finally we have passed through the long-long `start_kernel` function. But it is not the end of the linux kernel initialization process. We haven't run the first process yet. In the end of the `start_kernel` we can see the last call of the - `rest_init` function. Let's go ahead. + +First steps after the start_kernel +-------------------------------------------------------------------------------- + +The `rest_init` function is defined in the same source code file as `start_kernel` function, and this file is [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). In the beginning of the `rest_init` we can see call of the two following functions: + +```C + rcu_scheduler_starting(); + smpboot_thread_init(); +``` + +The first `rcu_scheduler_starting` makes [RCU](http://en.wikipedia.org/wiki/Read-copy-update) scheduler active and the second `smpboot_thread_init` registers the `smpboot_thread_notifier` CPU notifier (more about it you can read in the [CPU hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). After this we can see the following calls: + +```C +kernel_thread(kernel_init, NULL, CLONE_FS); +pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); +``` + +Here the `kernel_thread` function (defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c)) creates new kernel thread.As we can see the `kernel_thread` function takes three arguments: + +* Function which will be executed in a new thread; +* Parameter for the `kernel_init` function; +* Flags. + +We will not dive into details about `kernel_thread` implementation (we will see it in the chapter which describe scheduler, just need to say that `kernel_thread` invokes [clone](http://www.tutorialspoint.com/unix_system_calls/clone.htm)). Now we only need to know that we create new kernel thread with `kernel_thread` function, parent and child of the thread will use shared information about filesystem and it will start to execute `kernel_init` function. A kernel thread differs from a user thread that it runs in kernel mode. So with these two `kernel_thread` calls we create two new kernel threads with the `PID = 1` for `init` process and `PID = 2` for `kthreadd`. We already know what is `init` process. Let's look on the `kthreadd`. It is a special kernel thread which manages and helps different parts of the kernel to create another kernel thread. We can see it in the output of the `ps` util: + +```C +$ ps -ef | grep kthread +root 2 0 0 Jan11 ? 00:00:00 [kthreadd] +``` + +Let's postpone `kernel_init` and `kthreadd` for now and go ahead in the `rest_init`. In the next step after we have created two new kernel threads we can see the following code: + +```C + rcu_read_lock(); + kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns); + rcu_read_unlock(); +``` + +The first `rcu_read_lock` function marks the beginning of an [RCU](http://en.wikipedia.org/wiki/Read-copy-update) read-side critical section and the `rcu_read_unlock` marks the end of an RCU read-side critical section. We call these functions because we need to protect the `find_task_by_pid_ns`. The `find_task_by_pid_ns` returns pointer to the `task_struct` by the given pid. So, here we are getting the pointer to the `task_struct` for `PID = 2` (we got it after `kthreadd` creation with the `kernel_thread`). In the next step we call `complete` function + +```C +complete(&kthreadd_done); +``` + +and pass address of the `kthreadd_done`. The `kthreadd_done` defined as + +```C +static __initdata DECLARE_COMPLETION(kthreadd_done); +``` + +where `DECLARE_COMPLETION` macro defined as: + +```C +#define DECLARE_COMPLETION(work) \ + struct completion work = COMPLETION_INITIALIZER(work) +``` + +and expands to the definition of the `completion` structure. This structure is defined in the [include/linux/completion.h](https://github.com/torvalds/linux/blob/master/include/linux/completion.h) and presents `completions` concept. Completions is a code synchronization mechanism which provides race-free solution for the threads that must wait for some process to have reached a point or a specific state. Using completions consists of three parts: The first is definition of the `complete` structure and we did it with the `DECLARE_COMPLETION`. The second is call of the `wait_for_completion`. After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not call `complete` function. Note that we call `wait_for_completion` with the `kthreadd_done` in the beginning of the `kernel_init_freeable`: + +```C +wait_for_completion(&kthreadd_done); +``` + +And the last step is to call `complete` function as we saw it above. After this the `kernel_init_freeable` function will not be executed while `kthreadd` thread will not be set. After the `kthreadd` was set, we can see three following functions in the `rest_init`: + +```C + init_idle_bootup_task(current); + schedule_preempt_disabled(); + cpu_startup_entry(CPUHP_ONLINE); +``` + +The first `init_idle_bootup_task` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) sets the Scheduling class for the current process (`idle` class in our case): + +```C +void init_idle_bootup_task(struct task_struct *idle) +{ + idle->sched_class = &idle_sched_class; +} +``` + +where `idle` class is a low task priority and tasks can be run only when the processor doesn't have anything to run besides this tasks. The second function `schedule_preempt_disabled` disables preempt in `idle` tasks. And the third function `cpu_startup_entry` is defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c) and calls `cpu_idle_loop` from the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c). The `cpu_idle_loop` function works as process with `PID = 0` and works in the background. Main purpose of the `cpu_idle_loop` is to consume the idle CPU cycles. When there is no process to run, this process starts to work. We have one process with `idle` scheduling class (we just set the `current` task to the `idle` with the call of the `init_idle_bootup_task` function), so the `idle` thread does not do useful work but just checks if there is an active task to switch to: + +```C +static void cpu_idle_loop(void) +{ + ... + ... + ... + while (1) { + while (!need_resched()) { + ... + ... + ... + } + ... + } +``` + +More about it will be in the chapter about scheduler. So for this moment the `start_kernel` calls the `rest_init` function which spawns an `init` (`kernel_init` function) process and become `idle` process itself. Now is time to look on the `kernel_init`. Execution of the `kernel_init` function starts from the call of the `kernel_init_freeable` function. The `kernel_init_freeable` function first of all waits for the completion of the `kthreadd` setup. I already wrote about it above: + +```C +wait_for_completion(&kthreadd_done); +``` + +After this we set `gfp_allowed_mask` to `__GFP_BITS_MASK` which means that system is already running, set allowed [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to all CPUs and [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes with the `set_mems_allowed` function, allow `init` process to run on any CPU with the `set_cpus_allowed_ptr`, set pid for the `cad` or `Ctrl-Alt-Delete`, do preparation for booting of the other CPUs with the call of the `smp_prepare_cpus`, call early [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism) with the `do_pre_smp_initcalls`, initialize `SMP` with the `smp_init` and initialize [lockup_detector](https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt) with the call of the `lockup_detector_init` and initialize scheduler with the `sched_init_smp`. + +After this we can see the call of the following functions - `do_basic_setup`. Before we will call the `do_basic_setup` function, our kernel already initialized for this moment. As comment says: + +``` +Now we can finally start doing some real work.. +``` + +The `do_basic_setup` will reinitialize [cpuset](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to the active CPUs, initialize the `khelper` - which is a kernel thread which used for making calls out to userspace from within the kernel, initialize [tmpfs](http://en.wikipedia.org/wiki/Tmpfs), initialize `drivers` subsystem, enable the user-mode helper `workqueue` and make post-early call of the `initcalls`. We can see opening of the `dev/console` and dup twice file descriptors from `0` to `2` after the `do_basic_setup`: + + +```C +if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) + pr_err("Warning: unable to open an initial console.\n"); + +(void) sys_dup(0); +(void) sys_dup(0); +``` + +We are using two system calls here `sys_open` and `sys_dup`. In the next chapters we will see explanation and implementation of the different system calls. After we opened initial console, we check that `rdinit=` option was passed to the kernel command line or set default path of the ramdisk: + +```C +if (!ramdisk_execute_command) + ramdisk_execute_command = "/init"; +``` + +Check user's permissions for the `ramdisk` and call the `prepare_namespace` function from the [init/do_mounts.c](https://github.com/torvalds/linux/blob/master/init/do_mounts.c) which checks and mounts the [initrd](http://en.wikipedia.org/wiki/Initrd): + +```C +if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { + ramdisk_execute_command = NULL; + prepare_namespace(); +} +``` + +This is the end of the `kernel_init_freeable` function and we need return to the `kernel_init`. The next step after the `kernel_init_freeable` finished its execution is the `async_synchronize_full`. This function waits until all asynchronous function calls have been done and after it we will call the `free_initmem` which will release all memory occupied by the initialization stuff which located between `__init_begin` and `__init_end`. After this we protect `.rodata` with the `mark_rodata_ro` and update state of the system from the `SYSTEM_BOOTING` to the + +```C +system_state = SYSTEM_RUNNING; +``` + +And tries to run the `init` process: + +```C +if (ramdisk_execute_command) { + ret = run_init_process(ramdisk_execute_command); + if (!ret) + return 0; + pr_err("Failed to execute %s (error %d)\n", + ramdisk_execute_command, ret); +} +``` + +First of all it checks the `ramdisk_execute_command` which we set in the `kernel_init_freeable` function and it will be equal to the value of the `rdinit=` kernel command line parameters or `/init` by default. The `run_init_process` function fills the first element of the `argv_init` array: + +```C +static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, }; +``` + +which represents arguments of the `init` program and call `do_execve` function: + +```C +argv_init[0] = init_filename; +return do_execve(getname_kernel(init_filename), + (const char __user *const __user *)argv_init, + (const char __user *const __user *)envp_init); +``` + +The `do_execve` function is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and runs program with the given file name and arguments. If we did not pass `rdinit=` option to the kernel command line, kernel starts to check the `execute_command` which is equal to value of the `init=` kernel command line parameter: + +```C + if (execute_command) { + ret = run_init_process(execute_command); + if (!ret) + return 0; + panic("Requested init %s failed (error %d).", + execute_command, ret); + } +``` + +If we did not pass `init=` kernel command line parameter either, kernel tries to run one of the following executable files: + +```C +if (!try_to_run_init_process("/sbin/init") || + !try_to_run_init_process("/etc/init") || + !try_to_run_init_process("/bin/init") || + !try_to_run_init_process("/bin/sh")) + return 0; +``` + +Otherwise we finish with [panic](http://en.wikipedia.org/wiki/Kernel_panic): + +```C +panic("No working init found. Try passing init= option to kernel. " + "See Linux Documentation/init.txt for guidance."); +``` + +That's all! Linux kernel initialization process is finished! + +Conclusion +-------------------------------------------------------------------------------- + +It is the end of the tenth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). It is not only the `tenth` part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting. + +If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). + +**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) +* [xsave](http://www.felixcloutier.com/x86/XSAVES.html) +* [FPU](http://en.wikipedia.org/wiki/Floating-point_unit) +* [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/master/Documentation/security/credentials.txt) +* [Documentation/x86/x86_64/mm](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt) +* [RCU](http://en.wikipedia.org/wiki/Read-copy-update) +* [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) +* [inode](http://en.wikipedia.org/wiki/Inode) +* [proc](http://en.wikipedia.org/wiki/Procfs) +* [man proc](http://linux.die.net/man/5/proc) +* [Sysctl](http://en.wikipedia.org/wiki/Sysctl) +* [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt) +* [cgroup](http://en.wikipedia.org/wiki/Cgroups) +* [CPU hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) +* [completions - wait for completion handling](https://www.kernel.org/doc/Documentation/scheduler/completion.txt) +* [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) +* [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) +* [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism) +* [Tmpfs](http://en.wikipedia.org/wiki/Tmpfs) +* [initrd](http://en.wikipedia.org/wiki/Initrd) +* [panic](http://en.wikipedia.org/wiki/Kernel_panic) +* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) diff --git a/Initialization/linux-initialization-2.md b/Initialization/linux-initialization-2.md new file mode 100644 index 0000000..eba0cfc --- /dev/null +++ b/Initialization/linux-initialization-2.md @@ -0,0 +1,495 @@ +Kernel initialization. Part 2. +================================================================================ + +Early interrupt and exception handling +-------------------------------------------------------------------------------- + +In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) we stopped before setting of early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have basic [paging](https://en.wikipedia.org/wiki/Page_table) structure for early boot and our current goal is to finish early preparation before the main kernel code will start to work. + +We already started to do this preparation in the previous [first](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). We continue in this part and will know more about interrupt and exception handling. + +Remember that we stopped before following loop: + +```C +for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) + set_intr_gate(i, early_idt_handler_array[i]); +``` + +from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) source code file. But before we started to sort out this code, we need to know about interrupts and handlers. + +Some theory +-------------------------------------------------------------------------------- + +An interrupt is an event caused by software or hardware to the CPU. For example a user have pressed a key on keyboard. On interrupt, CPU stops the current task and transfer control to the special routine which is called - [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler). An interrupt handler handles and interrupt and transfer control back to the previously stopped task. We can split interrupts on three types: + +* Software interrupts - when a software signals CPU that it needs kernel attention. These interrupts are generally used for system calls; +* Hardware interrupts - when a hardware event happens, for example button is pressed on a keyboard; +* Exceptions - interrupts generated by CPU, when the CPU detects error, for example division by zero or accessing a memory page which is not in RAM. + +Every interrupt and exception is assigned a unique number which called - `vector number`. `Vector number` can be any number from `0` to `255`. There is common practice to use first `32` vector numbers for exceptions, and vector numbers from `32` to `255` are used for user-defined interrupts. We can see it in the code above - `NUM_EXCEPTION_VECTORS`, which defined as: + +```C +#define NUM_EXCEPTION_VECTORS 32 +``` + +CPU uses vector number as an index in the `Interrupt Descriptor Table` (we will see description of it soon). CPU catch interrupts from the [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) or through it's pins. Following table shows `0-31` exceptions: + +``` +---------------------------------------------------------------------------------------------- +|Vector|Mnemonic|Description |Type |Error Code|Source | +---------------------------------------------------------------------------------------------- +|0 | #DE |Divide Error |Fault|NO |DIV and IDIV | +|--------------------------------------------------------------------------------------------- +|1 | #DB |Reserved |F/T |NO | | +|--------------------------------------------------------------------------------------------- +|2 | --- |NMI |INT |NO |external NMI | +|--------------------------------------------------------------------------------------------- +|3 | #BP |Breakpoint |Trap |NO |INT 3 | +|--------------------------------------------------------------------------------------------- +|4 | #OF |Overflow |Trap |NO |INTO instruction | +|--------------------------------------------------------------------------------------------- +|5 | #BR |Bound Range Exceeded|Fault|NO |BOUND instruction | +|--------------------------------------------------------------------------------------------- +|6 | #UD |Invalid Opcode |Fault|NO |UD2 instruction | +|--------------------------------------------------------------------------------------------- +|7 | #NM |Device Not Available|Fault|NO |Floating point or [F]WAIT | +|--------------------------------------------------------------------------------------------- +|8 | #DF |Double Fault |Abort|YES |Ant instrctions which can generate NMI| +|--------------------------------------------------------------------------------------------- +|9 | --- |Reserved |Fault|NO | | +|--------------------------------------------------------------------------------------------- +|10 | #TS |Invalid TSS |Fault|YES |Task switch or TSS access | +|--------------------------------------------------------------------------------------------- +|11 | #NP |Segment Not Present |Fault|NO |Accessing segment register | +|--------------------------------------------------------------------------------------------- +|12 | #SS |Stack-Segment Fault |Fault|YES |Stack operations | +|--------------------------------------------------------------------------------------------- +|13 | #GP |General Protection |Fault|YES |Memory reference | +|--------------------------------------------------------------------------------------------- +|14 | #PF |Page fault |Fault|YES |Memory reference | +|--------------------------------------------------------------------------------------------- +|15 | --- |Reserved | |NO | | +|--------------------------------------------------------------------------------------------- +|16 | #MF |x87 FPU fp error |Fault|NO |Floating point or [F]Wait | +|--------------------------------------------------------------------------------------------- +|17 | #AC |Alignment Check |Fault|YES |Data reference | +|--------------------------------------------------------------------------------------------- +|18 | #MC |Machine Check |Abort|NO | | +|--------------------------------------------------------------------------------------------- +|19 | #XM |SIMD fp exception |Fault|NO |SSE[2,3] instructions | +|--------------------------------------------------------------------------------------------- +|20 | #VE |Virtualization exc. |Fault|NO |EPT violations | +|--------------------------------------------------------------------------------------------- +|21-31 | --- |Reserved |INT |NO |External interrupts | +---------------------------------------------------------------------------------------------- +``` + +To react on interrupt CPU uses special structure - Interrupt Descriptor Table or IDT. IDT is an array of 8-byte descriptors like Global Descriptor Table, but IDT entries are called `gates`. CPU multiplies vector number on 8 to find index of the IDT entry. But in 64-bit mode IDT is an array of 16-byte descriptors and CPU multiplies vector number on 16 to find index of the entry in the IDT. We remember from the previous part that CPU uses special `GDTR` register to locate Global Descriptor Table, so CPU uses special register `IDTR` for Interrupt Descriptor Table and `lidt` instruction for loading base address of the table into this register. + +64-bit mode IDT entry has following structure: + +``` +127 96 + -------------------------------------------------------------------------------- +| | +| Reserved | +| | + -------------------------------------------------------------------------------- +95 64 + -------------------------------------------------------------------------------- +| | +| Offset 63..32 | +| | + -------------------------------------------------------------------------------- +63 48 47 46 44 42 39 34 32 + -------------------------------------------------------------------------------- +| | | D | | | | | | | +| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST | +| | | L | | | | | | | + -------------------------------------------------------------------------------- +31 15 16 0 + -------------------------------------------------------------------------------- +| | | +| Segment Selector | Offset 15..0 | +| | | + -------------------------------------------------------------------------------- +``` + +Where: + +* `Offset` - is offset to entry point of an interrupt handler; +* `DPL` - Descriptor Privilege Level; +* `P` - Segment Present flag; +* `Segment selector` - a code segment selector in GDT or LDT +* `IST` - provides ability to switch to a new stack for interrupts handling. + +And the last `Type` field describes type of the `IDT` entry. There are three different kinds of handlers for interrupts: + +* Task descriptor +* Interrupt descriptor +* Trap descriptor + +Interrupt and trap descriptors contain a far pointer to the entry point of the interrupt handler. Only one difference between these types is how CPU handles `IF` flag. If interrupt handler was accessed through interrupt gate, CPU clear the `IF` flag to prevent other interrupts while current interrupt handler executes. After that current interrupt handler executes, CPU sets the `IF` flag again with `iret` instruction. + +Other bits in the interrupt gate reserved and must be 0. Now let's look how CPU handles interrupts: + +* CPU save flags register, `CS`, and instruction pointer on the stack. +* If interrupt causes an error code (like `#PF` for example), CPU saves an error on the stack after instruction pointer; +* After interrupt handler executed, `iret` instruction used to return from it. + +Now let's back to code. + +Fill and load IDT +-------------------------------------------------------------------------------- + +We stopped at the following point: + +```C +for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) + set_intr_gate(i, early_idt_handler_array[i]); +``` + +Here we call `set_intr_gate` in the loop, which takes two parameters: + +* Number of an interrupt or `vector number`; +* Address of the idt handler. + +and inserts an interrupt gate to the `IDT` table which is represented by the `&idt_descr` array. First of all let's look on the `early_idt_handler_array` array. It is an array which is defined in the [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) header file contains addresses of the first `32` exception handlers: + +```C +#define EARLY_IDT_HANDLER_SIZE 9 +#define NUM_EXCEPTION_VECTORS 32 + +extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE]; +``` + +The `early_idt_handler_array` is `288` bytes array which contains address of exception entry points every nine bytes. Every nine bytes of this array consist of two bytes optional instruction for pushing dummy error code if an exception does not provide it, two bytes instruction for pushing vector number to the stack and five bytes of `jump` to the common exception handler code. + +As we can see, We're filling only first 32 `IDT` entries in the loop, because all of the early setup runs with interrupts disabled, so there is no need to set up interrupt handlers for vectors greater than `32`. The `early_idt_handler_array` array contains generic idt handlers and we can find its definition in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file. For now we will skip it, but will look it soon. Before this we will look on the implementation of the `set_intr_gate` macro. + +The `set_intr_gate` macro is defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) header file and looks: + +```C +#define set_intr_gate(n, addr) \ + do { \ + BUG_ON((unsigned)n > 0xFF); \ + _set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0, \ + __KERNEL_CS); \ + _trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\ + 0, 0, __KERNEL_CS); \ + } while (0) +``` + +First of all it checks with that passed interrupt number is not greater than `255` with `BUG_ON` macro. We need to do this check because we can have only `256` interrupts. After this, it make a call of the `_set_gate` function which writes address of an interrupt gate to the `IDT`: + +```C +static inline void _set_gate(int gate, unsigned type, void *addr, + unsigned dpl, unsigned ist, unsigned seg) +{ + gate_desc s; + pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg); + write_idt_entry(idt_table, gate, &s); + write_trace_idt_entry(gate, &s); +} +``` + +At the start of `_set_gate` function we can see call of the `pack_gate` function which fills `gate_desc` structure with the given values: + +```C +static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func, + unsigned dpl, unsigned ist, unsigned seg) +{ + gate->offset_low = PTR_LOW(func); + gate->segment = __KERNEL_CS; + gate->ist = ist; + gate->p = 1; + gate->dpl = dpl; + gate->zero0 = 0; + gate->zero1 = 0; + gate->type = type; + gate->offset_middle = PTR_MIDDLE(func); + gate->offset_high = PTR_HIGH(func); +} +``` + +As I mentioned above, we fill gate descriptor in this function. We fill three parts of the address of the interrupt handler with the address which we got in the main loop (address of the interrupt handler entry point). We are using three following macros to split address on three parts: + +```C +#define PTR_LOW(x) ((unsigned long long)(x) & 0xFFFF) +#define PTR_MIDDLE(x) (((unsigned long long)(x) >> 16) & 0xFFFF) +#define PTR_HIGH(x) ((unsigned long long)(x) >> 32) +``` + +With the first `PTR_LOW` macro we get the first `2` bytes of the address, with the second `PTR_MIDDLE` we get the second `2` bytes of the address and with the third `PTR_HIGH` macro we get the last `4` bytes of the address. Next we setup the segment selector for interrupt handler, it will be our kernel code segment - `__KERNEL_CS`. In the next step we fill `Interrupt Stack Table` and `Descriptor Privilege Level` (highest privilege level) with zeros. And we set `GAT_INTERRUPT` type in the end. + +Now we have filled IDT entry and we can call `native_write_idt_entry` function which just copies filled `IDT` entry to the `IDT`: + +```C +static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate) +{ + memcpy(&idt[entry], gate, sizeof(*gate)); +} +``` + +After that main loop will finished, we will have filled `idt_table` array of `gate_desc` structures and we can load `Interrupt Descriptor table` with the call of the: + +```C +load_idt((const struct desc_ptr *)&idt_descr); +``` + +Where `idt_descr` is: + +```C +struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table }; +``` + +and `load_idt` just executes `lidt` instruction: + +```C +asm volatile("lidt %0"::"m" (*dtr)); +``` + +You can note that there are calls of the `_trace_*` functions in the `_set_gate` and other functions. These functions fills `IDT` gates in the same manner that `_set_gate` but with one difference. These functions use `trace_idt_table` the `Interrupt Descriptor Table` instead of `idt_table` for tracepoints (we will cover this theme in the another part). + +Okay, now we have filled and loaded `Interrupt Descriptor Table`, we know how the CPU acts during an interrupt. So now time to deal with interrupts handlers. + +Early interrupts handlers +-------------------------------------------------------------------------------- + +As you can read above, we filled `IDT` with the address of the `early_idt_handler_array`. We can find it in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file: + +```assembly + .globl early_idt_handler_array +early_idt_handlers: + i = 0 + .rept NUM_EXCEPTION_VECTORS + .if (EXCEPTION_ERRCODE_MASK >> i) & 1 + pushq $0 + .endif + pushq $i + jmp early_idt_handler_common + i = i + 1 + .fill early_idt_handler_array + i*EARLY_IDT_HANDLER_SIZE - ., 1, 0xcc + .endr +``` + +We can see here, interrupt handlers generation for the first `32` exceptions. We check here, if exception has an error code then we do nothing, if exception does not return error code, we push zero to the stack. We do it for that would stack was uniform. After that we push exception number on the stack and jump on the `early_idt_handler_array` which is generic interrupt handler for now. As we may see above, every nine bytes of the `early_idt_handler_array` array consists from optional push of an error code, push of `vector number` and jump instruction. We can see it in the output of the `objdump` util: + +``` +$ objdump -D vmlinux +... +... +... +ffffffff81fe5000 : +ffffffff81fe5000: 6a 00 pushq $0x0 +ffffffff81fe5002: 6a 00 pushq $0x0 +ffffffff81fe5004: e9 17 01 00 00 jmpq ffffffff81fe5120 +ffffffff81fe5009: 6a 00 pushq $0x0 +ffffffff81fe500b: 6a 01 pushq $0x1 +ffffffff81fe500d: e9 0e 01 00 00 jmpq ffffffff81fe5120 +ffffffff81fe5012: 6a 00 pushq $0x0 +ffffffff81fe5014: 6a 02 pushq $0x2 +... +... +... +``` + +As i wrote above, CPU pushes flag register, `CS` and `RIP` on the stack. So before `early_idt_handler` will be executed, stack will contain following data: + +``` +|--------------------| +| %rflags | +| %cs | +| %rip | +| rsp --> error code | +|--------------------| +``` + +Now let's look on the `early_idt_handler_common` implementation. It locates in the same [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343) assembly file and first of all we can see check for [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt). We don't need to handle it, so just ignore it in the `early_idt_handler_common`: + +```assembly + cmpl $2,(%rsp) + je .Lis_nmi +``` + +where `is_nmi`: + +```assembly +is_nmi: + addq $16,%rsp + INTERRUPT_RETURN +``` + +drops an error code and vector number from the stack and call `INTERRUPT_RETURN` which is just expands to the `iretq` instruction. As we checked the vector number and it is not `NMI`, we check `early_recursion_flag` to prevent recursion in the `early_idt_handler_common` and if it's correct we save general registers on the stack: + +```assembly + pushq %rax + pushq %rcx + pushq %rdx + pushq %rsi + pushq %rdi + pushq %r8 + pushq %r9 + pushq %r10 + pushq %r11 +``` + +We need to do it to prevent wrong values of registers when we return from the interrupt handler. After this we check segment selector in the stack: + +```assembly + cmpl $__KERNEL_CS,96(%rsp) + jne 11f +``` + +which must be equal to the kernel code segment and if it is not we jump on label `11` which prints `PANIC` message and makes stack dump. + +After the code segment was checked, we check the vector number, and if it is `#PF` or [Page Fault](https://en.wikipedia.org/wiki/Page_fault), we put value from the `cr2` to the `rdi` register and call `early_make_pgtable` (well see it soon): + +```assembly + cmpl $14,72(%rsp) + jnz 10f + GET_CR2_INTO(%rdi) + call early_make_pgtable + andl %eax,%eax + jz 20f +``` + +If vector number is not `#PF`, we restore general purpose registers from the stack: + +```assembly + popq %r11 + popq %r10 + popq %r9 + popq %r8 + popq %rdi + popq %rsi + popq %rdx + popq %rcx + popq %rax +``` + +and exit from the handler with `iret`. + +It is the end of the first interrupt handler. Note that it is very early interrupt handler, so it handles only Page Fault now. We will see handlers for the other interrupts, but now let's look on the page fault handler. + +Page fault handling +-------------------------------------------------------------------------------- + +In the previous paragraph we saw first early interrupt handler which checks interrupt number for page fault and calls `early_make_pgtable` for building new page tables if it is. We need to have `#PF` handler in this step because there are plans to add ability to load kernel above `4G` and make access to `boot_params` structure above the 4G. + +You can find implementation of the `early_make_pgtable` in the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and takes one parameter - address from the `cr2` register, which caused Page Fault. Let's look on it: + +```C +int __init early_make_pgtable(unsigned long address) +{ + unsigned long physaddr = address - __PAGE_OFFSET; + unsigned long i; + pgdval_t pgd, *pgd_p; + pudval_t pud, *pud_p; + pmdval_t pmd, *pmd_p; + ... + ... + ... +} +``` + +It starts from the definition of some variables which have `*val_t` types. All of these types are just: + +```C +typedef unsigned long pgdval_t; +``` + +Also we will operate with the `*_t` (not val) types, for example `pgd_t` and etc... All of these types defined in the [arch/x86/include/asm/pgtable_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_types.h) and represent structures like this: + +```C +typedef struct { pgdval_t pgd; } pgd_t; +``` + +For example, + +```C +extern pgd_t early_level4_pgt[PTRS_PER_PGD]; +``` + +Here `early_level4_pgt` presents early top-level page table directory which consists of an array of `pgd_t` types and `pgd` points to low-level page entries. + +After we made the check that we have no invalid address, we're getting the address of the Page Global Directory entry which contains `#PF` address and put it's value to the `pgd` variable: + +```C +pgd_p = &early_level4_pgt[pgd_index(address)].pgd; +pgd = *pgd_p; +``` + +In the next step we check `pgd`, if it contains correct page global directory entry we put physical address of the page global directory entry and put it to the `pud_p` with: + +```C +pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base); +``` + +where `PTE_PFN_MASK` is a macro: + +```C +#define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK) +``` + +which expands to: + +```C +(~(PAGE_SIZE-1)) & ((1 << 46) - 1) +``` + +or + +``` +0b1111111111111111111111111111111111111111111111 +``` + +which is 46 bits to mask page frame. + +If `pgd` does not contain correct address we check that `next_early_pgt` is not greater than `EARLY_DYNAMIC_PAGE_TABLES` which is `64` and present a fixed number of buffers to set up new page tables on demand. If `next_early_pgt` is greater than `EARLY_DYNAMIC_PAGE_TABLES` we reset page tables and start again. If `next_early_pgt` is less than `EARLY_DYNAMIC_PAGE_TABLES`, we create new page upper directory pointer which points to the current dynamic page table and writes it's physical address with the `_KERPG_TABLE` access rights to the page global directory: + +```C +if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) { + reset_early_page_tables(); + goto again; +} + +pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++]; +for (i = 0; i < PTRS_PER_PUD; i++) + pud_p[i] = 0; +*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE; +``` + +After this we fix up address of the page upper directory with: + +```C +pud_p += pud_index(address); +pud = *pud_p; +``` + +In the next step we do the same actions as we did before, but with the page middle directory. In the end we fix address of the page middle directory which contains maps kernel text+data virtual addresses: + +```C +pmd = (physaddr & PMD_MASK) + early_pmd_flags; +pmd_p[pmd_index(address)] = pmd; +``` + +After page fault handler finished it's work and as result our `early_level4_pgt` contains entries which point to the valid addresses. + +Conclusion +-------------------------------------------------------------------------------- + +This is the end of the second part about linux kernel insides. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). In the next part we will see all steps before kernel entry point - `start_kernel` function. + +**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [GNU assembly .rept](https://sourceware.org/binutils/docs-2.23/as/Rept.html) +* [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) +* [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt) +* [Page table](https://en.wikipedia.org/wiki/Page_table) +* [Interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) +* [Page Fault](https://en.wikipedia.org/wiki/Page_fault), +* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) diff --git a/Initialization/linux-initialization-3.md b/Initialization/linux-initialization-3.md new file mode 100644 index 0000000..9ad6473 --- /dev/null +++ b/Initialization/linux-initialization-3.md @@ -0,0 +1,430 @@ +Kernel initialization. Part 3. +================================================================================ + +Last preparations before the kernel entry point +-------------------------------------------------------------------------------- + +This is the third part of the Linux kernel initialization process series. In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) we saw early interrupt and exception handling and will continue to dive into the linux kernel initialization process in the current part. Our next point is 'kernel entry point' - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. Yes, technically it is not kernel's entry point but the start of the generic kernel code which does not depend on certain architecture. But before we call the `start_kernel` function, we must do some preparations. So let's continue. + +boot_params again +-------------------------------------------------------------------------------- + +In the previous part we stopped at setting Interrupt Descriptor Table and loading it in the `IDTR` register. At the next step after this we can see a call of the `copy_bootdata` function: + +```C +copy_bootdata(__va(real_mode_data)); +``` + +This function takes one argument - virtual address of the `real_mode_data`. Remember that we passed the address of the `boot_params` structure from [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L114) to the `x86_64_start_kernel` function as first argument in [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): + +``` + /* rsi is pointer to real mode structure with interesting info. + pass it to C */ + movq %rsi, %rdi +``` + +Now let's look at `__va` macro. This macro defined in [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c): + +```C +#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) +``` + +where `PAGE_OFFSET` is `__PAGE_OFFSET` which is `0xffff880000000000` and the base virtual address of the direct mapping of all physical memory. So we're getting virtual address of the `boot_params` structure and pass it to the `copy_bootdata` function, where we copy `real_mod_data` to the `boot_params` which is declared in the [arch/x86/kernel/setup.h](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.h) + +```C +extern struct boot_params boot_params; +``` + +Let's look at the `copy_boot_data` implementation: + +```C +static void __init copy_bootdata(char *real_mode_data) +{ + char * command_line; + unsigned long cmd_line_ptr; + + memcpy(&boot_params, real_mode_data, sizeof boot_params); + sanitize_boot_params(&boot_params); + cmd_line_ptr = get_cmd_line_ptr(); + if (cmd_line_ptr) { + command_line = __va(cmd_line_ptr); + memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE); + } +} +``` + +First of all, note that this function is declared with `__init` prefix. It means that this function will be used only during the initialization and used memory will be freed. + +We can see declaration of two variables for the kernel command line and copying `real_mode_data` to the `boot_params` with the `memcpy` function. The next call of the `sanitize_boot_params` function which fills some fields of the `boot_params` structure like `ext_ramdisk_image` and etc... if bootloaders which fail to initialize unknown fields in `boot_params` to zero. After this we're getting address of the command line with the call of the `get_cmd_line_ptr` function: + +```C +unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr; +cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32; +return cmd_line_ptr; +``` + +which gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check `cmd_line_ptr`, getting its virtual address and copy it to the `boot_command_line` which is just an array of bytes: + +```C +extern char __initdata boot_command_line[]; +``` + +After this we will have copied kernel command line and `boot_params` structure. In the next step we can see call of the `load_ucode_bsp` function which loads processor microcode, but we will not see it here. + +After microcode was loaded we can see the check of the `console_loglevel` and the `early_printk` function which prints `Kernel Alive` string. But you'll never see this output because `early_printk` is not initialized yet. It is a minor bug in the kernel and i sent the patch - [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2) and you will see it in the mainline soon. So you can skip this code. + +Move on init pages +-------------------------------------------------------------------------------- + +In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call: + +```C + clear_page(init_level4_pgt); +``` + +function and pass `init_level4_pgt` which also defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and looks: + +```assembly +NEXT_PAGE(init_level4_pgt) + .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE + .org init_level4_pgt + L4_PAGE_OFFSET*8, 0 + .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE + .org init_level4_pgt + L4_START_KERNEL*8, 0 + .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE +``` + +which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. `clear_page` function defined in the [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/clear_page_64.S) let's look on this function: + +```assembly +ENTRY(clear_page) + CFI_STARTPROC + xorl %eax,%eax + movl $4096/64,%ecx + .p2align 4 + .Lloop: + decl %ecx +#define PUT(x) movq %rax,x*8(%rdi) + movq %rax,(%rdi) + PUT(1) + PUT(2) + PUT(3) + PUT(4) + PUT(5) + PUT(6) + PUT(7) + leaq 64(%rdi),%rdi + jnz .Lloop + nop + ret + CFI_ENDPROC + .Lclear_page_end: + ENDPROC(clear_page) +``` + +As you can understand from the function name it clears or fills with zeros page tables. First of all note that this function starts with the `CFI_STARTPROC` and `CFI_ENDPROC` which are expands to GNU assembly directives: + +```C +#define CFI_STARTPROC .cfi_startproc +#define CFI_ENDPROC .cfi_endproc +``` + +and used for debugging. After `CFI_STARTPROC` macro we zero out `eax` register and put 64 to the `ecx` (it will be a counter). Next we can see loop which starts with the `.Lloop` label and it starts from the `ecx` decrement. After it we put zero from the `rax` register to the `rdi` which contains the base address of the `init_level4_pgt` now and do the same procedure seven times but every time move `rdi` offset on 8. After this we will have first 64 bytes of the `init_level4_pgt` filled with zeros. In the next step we put the address of the `init_level4_pgt` with 64-bytes offset to the `rdi` again and repeat all operations until `ecx` reaches zero. In the end we will have `init_level4_pgt` filled with zeros. + +As we have `init_level4_pgt` filled with zeros, we set the last `init_level4_pgt` entry to kernel high mapping with the: + +```C +init_level4_pgt[511] = early_level4_pgt[511]; +``` + +Remember that we dropped all `early_level4_pgt` entries in the `reset_early_page_table` function and kept only kernel high mapping there. + +The last step in the `x86_64_start_kernel` function is the call of the: + +```C +x86_64_start_reservations(real_mode_data); +``` + +function with the `real_mode_data` as argument. The `x86_64_start_reservations` function defined in the same source code file as the `x86_64_start_kernel` function and looks: + +```C +void __init x86_64_start_reservations(char *real_mode_data) +{ + if (!boot_params.hdr.version) + copy_bootdata(__va(real_mode_data)); + + reserve_ebda_region(); + + start_kernel(); +} +``` + +You can see that it is the last function before we are in the kernel entry point - `start_kernel` function. Let's look what it does and how it works. + +Last step before kernel entry point +-------------------------------------------------------------------------------- + +First of all we can see in the `x86_64_start_reservations` function the check for `boot_params.hdr.version`: + +```C +if (!boot_params.hdr.version) + copy_bootdata(__va(real_mode_data)); +``` + +and if it is zero we call `copy_bootdata` function again with the virtual address of the `real_mode_data` (read about about it's implementation). + +In the next step we can see the call of the `reserve_ebda_region` function which defined in the [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head.c). This function reserves memory block for the `EBDA` or Extended BIOS Data Area. The Extended BIOS Data Area located in the top of conventional memory and contains data about ports, disk parameters and etc... + +Let's look on the `reserve_ebda_region` function. It starts from the checking is paravirtualization enabled or not: + +```C +if (paravirt_enabled()) + return; +``` + +we exit from the `reserve_ebda_region` function if paravirtualization is enabled because if it enabled the extended bios data area is absent. In the next step we need to get the end of the low memory: + +```C +lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES); +lowmem <<= 10; +``` + +We're getting the virtual address of the BIOS low memory in kilobytes and convert it to bytes with shifting it on 10 (multiply on 1024 in other words). After this we need to get the address of the extended BIOS data are with the: + +```C +ebda_addr = get_bios_ebda(); +``` + +where `get_bios_ebda` function defined in the [arch/x86/include/asm/bios_ebda.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bios_ebda.h) and looks like: + +```C +static inline unsigned int get_bios_ebda(void) +{ + unsigned int address = *(unsigned short *)phys_to_virt(0x40E); + address <<= 4; + return address; +} +``` + +Let's try to understand how it works. Here we can see that we converting physical address `0x40E` to the virtual, where `0x0040:0x000e` is the segment which contains base address of the extended BIOS data area. Don't worry that we are using `phys_to_virt` function for converting a physical address to virtual address. You can note that previously we have used `__va` macro for the same point, but `phys_to_virt` is the same: + +```C +static inline void *phys_to_virt(phys_addr_t address) +{ + return __va(address); +} +``` + +only with one difference: we pass argument with the `phys_addr_t` which depends on `CONFIG_PHYS_ADDR_T_64BIT`: + +```C +#ifdef CONFIG_PHYS_ADDR_T_64BIT + typedef u64 phys_addr_t; +#else + typedef u32 phys_addr_t; +#endif +``` + +This configuration option is enabled by `CONFIG_PHYS_ADDR_T_64BIT`. After that we got virtual address of the segment which stores the base address of the extended BIOS data area, we shift it on 4 and return. After this `ebda_addr` variables contains the base address of the extended BIOS data area. + +In the next step we check that address of the extended BIOS data area and low memory is not less than `INSANE_CUTOFF` macro + +```C +if (ebda_addr < INSANE_CUTOFF) + ebda_addr = LOWMEM_CAP; + +if (lowmem < INSANE_CUTOFF) + lowmem = LOWMEM_CAP; +``` + +which is: + +```C +#define INSANE_CUTOFF 0x20000U +``` + +or 128 kilobytes. In the last step we get lower part in the low memory and extended bios data area and call `memblock_reserve` function which will reserve memory region for extended bios data between low memory and one megabyte mark: + +```C +lowmem = min(lowmem, ebda_addr); +lowmem = min(lowmem, LOWMEM_CAP); +memblock_reserve(lowmem, 0x100000 - lowmem); +``` + +`memblock_reserve` function is defined at [mm/block.c](https://github.com/torvalds/linux/blob/master/mm/block.c) and takes two parameters: + +* base physical address; +* region size. + +and reserves memory region for the given base address and size. `memblock_reserve` is the first function in this book from linux kernel memory manager framework. We will take a closer look on memory manager soon, but now let's look at its implementation. + +First touch of the linux kernel memory manager framework +-------------------------------------------------------------------------------- + +In the previous paragraph we stopped at the call of the `memblock_reserve` function and as i sad before it is the first function from the memory manager framework. Let's try to understand how it works. `memblock_reserve` function just calls: + +```C +memblock_reserve_region(base, size, MAX_NUMNODES, 0); +``` + +function and passes 4 parameters there: + +* physical base address of the memory region; +* size of the memory region; +* maximum number of numa nodes; +* flags. + +At the start of the `memblock_reserve_region` body we can see definition of the `memblock_type` structure: + +```C +struct memblock_type *_rgn = &memblock.reserved; +``` + +which presents the type of the memory block and looks: + +```C +struct memblock_type { + unsigned long cnt; + unsigned long max; + phys_addr_t total_size; + struct memblock_region *regions; +}; +``` + +As we need to reserve memory block for extended bios data area, the type of the current memory region is reserved where `memblock` structure is: + +```C +struct memblock { + bool bottom_up; + phys_addr_t current_limit; + struct memblock_type memory; + struct memblock_type reserved; +#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP + struct memblock_type physmem; +#endif +}; +``` + +and describes generic memory block. You can see that we initialize `_rgn` by assigning it to the address of the `memblock.reserved`. `memblock` is the global variable which looks: + +```C +struct memblock memblock __initdata_memblock = { + .memory.regions = memblock_memory_init_regions, + .memory.cnt = 1, + .memory.max = INIT_MEMBLOCK_REGIONS, + .reserved.regions = memblock_reserved_init_regions, + .reserved.cnt = 1, + .reserved.max = INIT_MEMBLOCK_REGIONS, +#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP + .physmem.regions = memblock_physmem_init_regions, + .physmem.cnt = 1, + .physmem.max = INIT_PHYSMEM_REGIONS, +#endif + .bottom_up = false, + .current_limit = MEMBLOCK_ALLOC_ANYWHERE, +}; +``` + +We will not dive into detail of this variable, but we will see all details about it in the parts about memory manager. Just note that `memblock` variable defined with the `__initdata_memblock` which is: + +```C +#define __initdata_memblock __meminitdata +``` + +and `__meminit_data` is: + +```C +#define __meminitdata __section(.meminit.data) +``` + +From this we can conclude that all memory blocks will be in the `.meminit.data` section. After we defined `_rgn` we print information about it with `memblock_dbg` macros. You can enable it by passing `memblock=debug` to the kernel command line. + +After debugging lines were printed next is the call of the following function: + +```C +memblock_add_range(_rgn, base, size, nid, flags); +``` + +which adds new memory block region into the `.meminit.data` section. As we do not initialize `_rgn` but it just contains `&memblock.reserved`, we just fill passed `_rgn` with the base address of the extended BIOS data area region, size of this region and flags: + +```C +if (type->regions[0].size == 0) { + WARN_ON(type->cnt != 1 || type->total_size); + type->regions[0].base = base; + type->regions[0].size = size; + type->regions[0].flags = flags; + memblock_set_region_node(&type->regions[0], nid); + type->total_size = size; + return 0; +} +``` + +After we filled our region we can see the call of the `memblock_set_region_node` function with two parameters: + +* address of the filled memory region; +* NUMA node id. + +where our regions represented by the `memblock_region` structure: + +```C +struct memblock_region { + phys_addr_t base; + phys_addr_t size; + unsigned long flags; +#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP + int nid; +#endif +}; +``` + +NUMA node id depends on `MAX_NUMNODES` macro which is defined in the [include/linux/numa.h](https://github.com/torvalds/linux/blob/master/include/linux/numa.h): + +```C +#define MAX_NUMNODES (1 << NODES_SHIFT) +``` + +where `NODES_SHIFT` depends on `CONFIG_NODES_SHIFT` configuration parameter and defined as: + +```C +#ifdef CONFIG_NODES_SHIFT + #define NODES_SHIFT CONFIG_NODES_SHIFT +#else + #define NODES_SHIFT 0 +#endif +``` + +`memblick_set_region_node` function just fills `nid` field from `memblock_region` with the given value: + +```C +static inline void memblock_set_region_node(struct memblock_region *r, int nid) +{ + r->nid = nid; +} +``` + +After this we will have first reserved `memblock` for the extended bios data area in the `.meminit.data` section. `reserve_ebda_region` function finished its work on this step and we can go back to the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). + +We finished all preparations before the kernel entry point! The last step in the `x86_64_start_reservations` function is the call of the: + +```C +start_kernel() +``` + +function from [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) file. + +That's all for this part. + +Conclusion +-------------------------------------------------------------------------------- + +It is the end of the third part about linux kernel insides. In next part we will see the first initialization steps in the kernel entry point - `start_kernel` function. It will be the first step before we will see launch of the first `init` process. + +If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). + +**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [BIOS data area](http://stanislavs.org/helppc/bios_data_area.html) +* [What is in the extended BIOS data area on a PC?](http://www.kryslix.com/nsfaq/Q.6.html) +* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) diff --git a/Initialization/linux-initialization-4.md b/Initialization/linux-initialization-4.md new file mode 100644 index 0000000..bc23ec3 --- /dev/null +++ b/Initialization/linux-initialization-4.md @@ -0,0 +1,452 @@ +Kernel initialization. Part 4. +================================================================================ + +Kernel entry point +================================================================================ + +If you have read the previous part - [Last preparations before the kernel entry point](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-3.md), you can remember that we finished all pre-initialization stuff and stopped right before the call to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). The `start_kernel` is the entry of the generic and architecture independent kernel code, although we will return to the `arch/` folder many times. If you look inside of the `start_kernel` function, you will see that this function is very big. For this moment it contains about `86` calls of functions. Yes, it's very big and of course this part will not cover all the processes that occur in this function. In the current part we will only start to do it. This part and all the next which will be in the [Kernel initialization process](https://github.com/0xAX/linux-insides/blob/master/Initialization/README.md) chapter will cover it. + +The main purpose of the `start_kernel` to finish kernel initialization process and launch the first `init` process. Before the first process will be started, the `start_kernel` must do many things such as: to enable [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), to initialize processor id, to enable early [cgroups](http://en.wikipedia.org/wiki/Cgroups) subsystem, to setup per-cpu areas, to initialize different caches in [vfs](http://en.wikipedia.org/wiki/Virtual_file_system), to initialize memory manager, rcu, vmalloc, scheduler, IRQs, ACPI and many many more. Only after these steps will we see the launch of the first `init` process in the last part of this chapter. So much kernel code awaits us, let's start. + +**NOTE: All parts from this big chapter `Linux Kernel initialization process` will not cover anything about debugging. There will be a separate chapter about kernel debugging tips.** + +A little about function attributes +--------------------------------------------------------------------------------- + +As I wrote above, the `start_kernel` function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). This function defined with the `__init` attribute and as you already may know from other parts, all functions which are defined with this attribute are necessary during kernel initialization. + +```C +#define __init __section(.init.text) __cold notrace +``` + +After the initialization process have finished, the kernel will release these sections with a call to the `free_initmem` function. Note also that `__init` is defined with two attributes: `__cold` and `notrace`. The purpose of the first `cold` attribute is to mark that the function is rarely used and the compiler must optimize this function for size. The second `notrace` is defined as: + +```C +#define notrace __attribute__((no_instrument_function)) +``` + +where `no_instrument_function` says to the compiler not to generate profiling function calls. + +In the definition of the `start_kernel` function, you can also see the `__visible` attribute which expands to the: + +``` +#define __visible __attribute__((externally_visible)) +``` + +where `externally_visible` tells to the compiler that something uses this function or variable, to prevent marking this function/variable as `unusable`. You can find the definition of this and other macro attributes in [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h). + +First steps in the start_kernel +-------------------------------------------------------------------------------- + +At the beginning of the `start_kernel` you can see the definition of these two variables: + +```C +char *command_line; +char *after_dashes; +``` + +The first represents a pointer to the kernel command line and the second will contain the result of the `parse_args` function which parses an input string with parameters in the form `name=value`, looking for specific keywords and invoking the right handlers. We will not go into the details related with these two variables at this time, but will see it in the next parts. In the next step we can see a call to the: + +```C +lockdep_init(); +``` + +function. `lockdep_init` initializes [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). Its implementation is pretty simple, it just initializes two [list_head](https://github.com/0xAX/linux-insides/blob/master/DataStructures/dlist.md) hashes and sets the `lockdep_initialized` global variable to `1`. Lock validator detects circular lock dependencies and is called when any [spinlock](http://en.wikipedia.org/wiki/Spinlock) or [mutex](http://en.wikipedia.org/wiki/Mutual_exclusion) is acquired. + +The next function is `set_task_stack_end_magic` which takes address of the `init_task` and sets `STACK_END_MAGIC` (`0x57AC6E9D`) as canary for it. `init_task` represents the initial task structure: + +```C +struct task_struct init_task = INIT_TASK(init_task); +``` + +where `task_struct` stores all the information about a process. I will not explain this structure in this book because it's very big. You can find its definition in [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L1278). At this moment `task_struct` contains more than `100` fields! Although you will not see the explanation of the `task_struct` in this book, we will use it very often since it is the fundamental structure which describes the `process` in the Linux kernel. I will describe the meaning of the fields of this structure as we meet them in practice. + +You can see the definition of the `init_task` and it initialized by the `INIT_TASK` macro. This macro is from [include/linux/init_task.h](https://github.com/torvalds/linux/blob/master/include/linux/init_task.h) and it just fills the `init_task` with the values for the first process. For example it sets: + +* init process state to zero or `runnable`. A runnable process is one which is waiting only for a CPU to run on; +* init process flags - `PF_KTHREAD` which means - kernel thread; +* a list of runnable task; +* process address space; +* init process stack to the `&init_thread_info` which is `init_thread_union.thread_info` and `initthread_union` has type - `thread_union` which contains `thread_info` and process stack: + +```C +union thread_union { + struct thread_info thread_info; + unsigned long stack[THREAD_SIZE/sizeof(long)]; +}; +``` + +Every process has its own stack and it is 16 kilobytes or 4 page frames. in `x86_64`. We can note that it is defined as array of `unsigned long`. The next field of the `thread_union` is - `thread_info` defined as: + +```C +struct thread_info { + struct task_struct *task; + struct exec_domain *exec_domain; + __u32 flags; + __u32 status; + __u32 cpu; + int saved_preempt_count; + mm_segment_t addr_limit; + struct restart_block restart_block; + void __user *sysenter_return; + unsigned int sig_on_uaccess_error:1; + unsigned int uaccess_err:1; +}; +``` + +and occupies 52 bytes. The `thread_info` structure contains architecture-specific information on the thread. We know that on `x86_64` the stack grows down and `thread_union.thread_info` is stored at the bottom of the stack in our case. So the process stack is 16 kilobytes and `thread_info` is at the bottom. The remaining thread_size will be `16 kilobytes - 62 bytes = 16332 bytes`. Note that `thread_union` represented as the [union](http://en.wikipedia.org/wiki/Union_type) and not structure, it means that `thread_info` and stack share the memory space. + +Schematically it can be represented as follows: + +```C ++-----------------------+ +| | +| | +| stack | +| | +|_______________________| +| | | +| | | +| | | +|__________↓____________| +--------------------+ +| | | | +| thread_info |<----------->| task_struct | +| | | | ++-----------------------+ +--------------------+ +``` + +http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct + +So the `INIT_TASK` macro fills these `task_struct's` fields and many many more. As I already wrote above, I will not describe all the fields and values in the `INIT_TASK` macro but we will see them soon. + +Now let's go back to the `set_task_stack_end_magic` function. This function defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c#L297) and sets a [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow) to the `init` process stack to prevent stack overflow. + +```C +void set_task_stack_end_magic(struct task_struct *tsk) +{ + unsigned long *stackend; + stackend = end_of_stack(tsk); + *stackend = STACK_END_MAGIC; /* for overflow detection */ +} +``` + +Its implementation is simple. `set_task_stack_end_magic` gets the end of the stack for the given `task_struct` with the `end_of_stack` function. The end of a process stack depends on the `CONFIG_STACK_GROWSUP` configuration option. As we learn in `x86_64` architecture, the stack grows down. So the end of the process stack will be: + +```C +(unsigned long *)(task_thread_info(p) + 1); +``` + +where `task_thread_info` just returns the stack which we filled with the `INIT_TASK` macro: + +```C +#define task_thread_info(task) ((struct thread_info *)(task)->stack) +``` + +As we got the end of the init process stack, we write `STACK_END_MAGIC` there. After `canary` is set, we can check it like this: + +```C +if (*end_of_stack(task) != STACK_END_MAGIC) { + // + // handle stack overflow here + // +} +``` + +The next function after the `set_task_stack_end_magic` is `smp_setup_processor_id`. This function has an empty body for `x86_64`: + +```C +void __init __weak smp_setup_processor_id(void) +{ +} +``` + +as it not implemented for all architectures, but some such as [s390](http://en.wikipedia.org/wiki/IBM_ESA/390) and [arm64](http://en.wikipedia.org/wiki/ARM_architecture#64.2F32-bit_architecture). + +The next function in `start_kernel` is `debug_objects_early_init`. Implementation of this function is almost the same as `lockdep_init`, but fills hashes for object debugging. As I wrote above, we will not see the explanation of this and other functions which are for debugging purposes in this chapter. + +After the `debug_object_early_init` function we can see the call of the `boot_init_stack_canary` function which fills `task_struct->canary` with the canary value for the `-fstack-protector` gcc feature. This function depends on the `CONFIG_CC_STACKPROTECTOR` configuration option and if this option is disabled, `boot_init_stack_canary` does nothing, otherwise it generates random numbers based on random pool and the [TSC](http://en.wikipedia.org/wiki/Time_Stamp_Counter): + +```C +get_random_bytes(&canary, sizeof(canary)); +tsc = __native_read_tsc(); +canary += tsc + (tsc << 32UL); +``` + +After we got a random number, we fill the `stack_canary` field of `task_struct` with it: + +```C +current->stack_canary = canary; +``` + +and write this value to the top of the IRQ stack with the: + +```C +this_cpu_write(irq_stack_union.stack_canary, canary); // read below about this_cpu_write +``` + +Again, we will not dive into details here, we will cover it in the part about [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). As canary is set, we disable local and early boot IRQs and register the bootstrap CPU in the CPU maps. We disable local IRQs (interrupts for current CPU) with the `local_irq_disable` macro which expands to the call of the `arch_local_irq_disable` function from [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h): + +```C +static inline notrace void arch_local_irq_enable(void) +{ + native_irq_enable(); +} +``` + +Where `native_irq_enable` is `cli` instruction for `x86_64`. As interrupts are disabled we can register the current CPU with the given ID in the CPU bitmap. + +The first processor activation +--------------------------------------------------------------------------------- + +The current function from the `start_kernel` is `boot_cpu_init`. This function initializes various CPU masks for the bootstrap processor. First of all it gets the bootstrap processor id with a call to: + +```C +int cpu = smp_processor_id(); +``` + +For now it is just zero. If the `CONFIG_DEBUG_PREEMPT` configuration option is disabled, `smp_processor_id` just expands to the call of `raw_smp_processor_id` which expands to the: + +```C +#define raw_smp_processor_id() (this_cpu_read(cpu_number)) +``` + +`this_cpu_read` as many other function like this (`this_cpu_write`, `this_cpu_add` and etc...) defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) and presents `this_cpu` operation. These operations provide a way of optimizing access to the [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Theory/per-cpu.html) variables which are associated with the current processor. In our case it is `this_cpu_read`: + +``` +__pcpu_size_call_return(this_cpu_read_, pcp) +``` + +Remember that we have passed `cpu_number` as `pcp` to the `this_cpu_read` from the `raw_smp_processor_id`. Now let's look at the `__pcpu_size_call_return` implementation: + +```C +#define __pcpu_size_call_return(stem, variable) \ +({ \ + typeof(variable) pscr_ret__; \ + __verify_pcpu_ptr(&(variable)); \ + switch(sizeof(variable)) { \ + case 1: pscr_ret__ = stem##1(variable); break; \ + case 2: pscr_ret__ = stem##2(variable); break; \ + case 4: pscr_ret__ = stem##4(variable); break; \ + case 8: pscr_ret__ = stem##8(variable); break; \ + default: \ + __bad_size_call_parameter(); break; \ + } \ + pscr_ret__; \ +}) +``` + +Yes, it looks a little strange but it's easy. First of all we can see the definition of the `pscr_ret__` variable with the `int` type. Why int? Ok, `variable` is `common_cpu` and it was declared as per-cpu int variable: + +```C +DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number); +``` + +In the next step we call `__verify_pcpu_ptr` with the address of `cpu_number`. `__veryf_pcpu_ptr` used to verify that the given parameter is a per-cpu pointer. After that we set `pscr_ret__` value which depends on the size of the variable. Our `common_cpu` variable is `int`, so it 4 bytes in size. It means that we will get `this_cpu_read_4(common_cpu)` in `pscr_ret__`. In the end of the `__pcpu_size_call_return` we just call it. `this_cpu_read_4` is a macro: + +```C +#define this_cpu_read_4(pcp) percpu_from_op("mov", pcp) +``` + +which calls `percpu_from_op` and pass `mov` instruction and per-cpu variable there. `percpu_from_op` will expand to the inline assembly call: + +```C +asm("movl %%gs:%1,%0" : "=r" (pfo_ret__) : "m" (common_cpu)) +``` + +Let's try to understand how it works and what it does. The `gs` segment register contains the base of per-cpu area. Here we just copy `common_cpu` which is in memory to the `pfo_ret__` with the `movl` instruction. Or with another words: + +```C +this_cpu_read(common_cpu) +``` + +is the same as: + +```C +movl %gs:$common_cpu, $pfo_ret__ +``` + +As we didn't setup per-cpu area, we have only one - for the current running CPU, we will get `zero` as a result of the `smp_processor_id`. + +As we got the current processor id, `boot_cpu_init` sets the given CPU online, active, present and possible with the: + +```C +set_cpu_online(cpu, true); +set_cpu_active(cpu, true); +set_cpu_present(cpu, true); +set_cpu_possible(cpu, true); +``` + +All of these functions use the concept - `cpumask`. `cpu_possible` is a set of CPU ID's which can be plugged in at any time during the life of that system boot. `cpu_present` represents which CPUs are currently plugged in. `cpu_online` represents subset of the `cpu_present` and indicates CPUs which are available for scheduling. These masks depend on the `CONFIG_HOTPLUG_CPU` configuration option and if this option is disabled `possible == present` and `active == online`. Implementation of the all of these functions are very similar. Every function checks the second parameter. If it is `true`, it calls `cpumask_set_cpu` or `cpumask_clear_cpu` otherwise. + +For example let's look at `set_cpu_possible`. As we passed `true` as the second parameter, the: + +```C +cpumask_set_cpu(cpu, to_cpumask(cpu_possible_bits)); +``` + +will be called. First of all let's try to understand the `to_cpumask` macro. This macro casts a bitmap to a `struct cpumask *`. CPU masks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. CPU mask presented by the `cpu_mask` structure: + +```C +typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t; +``` + +which is just bitmap declared with the `DECLARE_BITMAP` macro: + +```C +#define DECLARE_BITMAP(name, bits) unsigned long name[BITS_TO_LONGS(bits)] +``` + +As we can see from its definition, the `DECLARE_BITMAP` macro expands to the array of `unsigned long`. Now let's look at how the `to_cpumask` macro is implemented: + +```C +#define to_cpumask(bitmap) \ + ((struct cpumask *)(1 ? (bitmap) \ + : (void *)sizeof(__check_is_bitmap(bitmap)))) +``` + +I don't know about you, but it looked really weird for me at the first time. We can see a ternary operator here which is `true` every time, but why the `__check_is_bitmap` here? It's simple, let's look at it: + +```C +static inline int __check_is_bitmap(const unsigned long *bitmap) +{ + return 1; +} +``` + +Yeah, it just returns `1` every time. Actually we need in it here only for one purpose: at compile time it checks that the given `bitmap` is a bitmap, or in other words it checks that the given `bitmap` has a type of `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting the array of `unsigned long` to the `struct cpumask *`. Now we can call `cpumask_set_cpu` function with the `cpu` - 0 and `struct cpumask *cpu_possible_bits`. This function makes only one call of the `set_bit` function which sets the given `cpu` in the cpumask. All of these `set_cpu_*` functions work on the same principle. + +If you're not sure that this `set_cpu_*` operations and `cpumask` are not clear for you, don't worry about it. You can get more info by reading the special part about it - [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) or [documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). + +As we activated the bootstrap processor, it's time to go to the next function in the `start_kernel.` Now it is `page_address_init`, but this function does nothing in our case, because it executes only when all `RAM` can't be mapped directly. + +Print linux banner +--------------------------------------------------------------------------------- + +The next call is `pr_notice`: + +```C +#define pr_notice(fmt, ...) \ + printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__) +``` + +as you can see it just expands to the `printk` call. At this moment we use `pr_notice` to print the Linux banner: + +```C +pr_notice("%s", linux_banner); +``` + +which is just the kernel version with some additional parameters: + +``` +Linux version 4.0.0-rc6+ (alex@localhost) (gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #319 SMP +``` + +Architecture-dependent parts of initialization +--------------------------------------------------------------------------------- + +The next step is architecture-specific initialization. The Linux kernel does it with the call of the `setup_arch` function. This is a very big function like `start_kernel` and we do not have time to consider all of its implementation in this part. Here we'll only start to do it and continue in the next part. As it is `architecture-specific`, we need to go again to the `arch/` directory. The `setup_arch` function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and takes only one argument - address of the kernel command line. + +This function starts from the reserving memory block for the kernel `_text` and `_data` which starts from the `_text` symbol (you can remember it from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L46)) and ends before `__bss_stop`. We are using `memblock` for the reserving of memory block: + +```C +memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text); +``` + +You can read about `memblock` in the [Linux kernel memory management Part 1.](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html). As you can remember `memblock_reserve` function takes two parameters: + +* base physical address of a memory block; +* size of a memory block. + +We can get the base physical address of the `_text` symbol with the `__pa_symbol` macro: + +```C +#define __pa_symbol(x) \ + __phys_addr_symbol(__phys_reloc_hide((unsigned long)(x))) +``` + +First of all it calls `__phys_reloc_hide` macro on the given parameter. The `__phys_reloc_hide` macro does nothing for `x86_64` and just returns the given parameter. Implementation of the `__phys_addr_symbol` macro is easy. It just subtracts the symbol address from the base address of the kernel text mapping base virtual address (you can remember that it is `__START_KERNEL_map`) and adds `phys_base` which is the base address of `_text`: + +```C +#define __phys_addr_symbol(x) \ + ((unsigned long)(x) - __START_KERNEL_map + phys_base) +``` + +After we got the physical address of the `_text` symbol, `memblock_reserve` can reserve a memory block from the `_text` to the `__bss_stop - _text`. + +Reserve memory for initrd +--------------------------------------------------------------------------------- + +In the next step after we reserved place for the kernel text and data is reserving place for the [initrd](http://en.wikipedia.org/wiki/Initrd). We will not see details about `initrd` in this post, you just may know that it is temporary root file system stored in memory and used by the kernel during its startup. The `early_reserve_initrd` function does all work. First of all this function gets the base address of the ram disk, its size and the end address with: + +```C +u64 ramdisk_image = get_ramdisk_image(); +u64 ramdisk_size = get_ramdisk_size(); +u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size); +``` + +All of these parameters are taken from `boot_params`. If you have read the chapter about [Linux Kernel Booting Process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html), you must remember that we filled the `boot_params` structure during boot time. The kernel setup header contains a couple of fields which describes ramdisk, for example: + +``` +Field name: ramdisk_image +Type: write (obligatory) +Offset/size: 0x218/4 +Protocol: 2.00+ + + The 32-bit linear address of the initial ramdisk or ramfs. Leave at + zero if there is no initial ramdisk/ramfs. +``` + +So we can get all the information that interests us from `boot_params`. For example let's look at `get_ramdisk_image`: + +```C +static u64 __init get_ramdisk_image(void) +{ + u64 ramdisk_image = boot_params.hdr.ramdisk_image; + + ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32; + + return ramdisk_image; +} +``` + +Here we get the address of the ramdisk from the `boot_params` and shift left it on `32`. We need to do it because as you can read in the [Documentation/x86/zero-page.txt](https://github.com/0xAX/linux/blob/master/Documentation/x86/zero-page.txt): + +``` +0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits +``` + +So after shifting it on 32, we're getting a 64-bit address in `ramdisk_image` and we return it. `get_ramdisk_size` works on the same principle as `get_ramdisk_image`, but it used `ext_ramdisk_size` instead of `ext_ramdisk_image`. After we got ramdisk's size, base address and end address, we check that bootloader provided ramdisk with the: + +```C +if (!boot_params.hdr.type_of_loader || + !ramdisk_image || !ramdisk_size) + return; +``` + +and reserve memory block with the calculated addresses for the initial ramdisk in the end: + +```C +memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image); +``` + +Conclusion +--------------------------------------------------------------------------------- + +It is the end of the fourth part about the Linux kernel initialization process. We started to dive in the kernel generic code from the `start_kernel` function in this part and stopped on the architecture-specific initialization in the `setup_arch`. In the next part we will continue with architecture-dependent initialization steps. + +If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). + +**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [GCC function attributes](https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html) +* [this_cpu operations](https://www.kernel.org/doc/Documentation/this_cpu_ops.txt) +* [cpumask](http://www.crashcourse.ca/wiki/index.php/Cpumask) +* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) +* [cgroups](http://en.wikipedia.org/wiki/Cgroups) +* [stack buffer overflow](http://en.wikipedia.org/wiki/Stack_buffer_overflow) +* [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) +* [initrd](http://en.wikipedia.org/wiki/Initrd) +* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-3.md) diff --git a/Initialization/linux-initialization-5.md b/Initialization/linux-initialization-5.md new file mode 100644 index 0000000..f032d44 --- /dev/null +++ b/Initialization/linux-initialization-5.md @@ -0,0 +1,512 @@ +Kernel initialization. Part 5. +================================================================================ + +Continue of architecture-specific initialization +================================================================================ + +In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L856) function and now we will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will skip functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You may remember, we already saw it in the [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) about interrupts: and exceptions: + +``` +---------------------------------------------------------------------------------------------- +|Vector|Mnemonic|Description |Type |Error Code|Source | +---------------------------------------------------------------------------------------------- +|3 | #BP |Breakpoint |Trap |NO |INT 3 | +---------------------------------------------------------------------------------------------- +``` + +Debug interrupt `#DB` is the primary method of invoking debuggers. `early_trap_init` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). This functions sets `#DB` and `#BP` handlers and reloads [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table): + +```C +void __init early_trap_init(void) +{ + set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK); + set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK); + load_idt(&idt_descr); +} +``` + +We already saw implementation of the `set_intr_gate` in the previous part about interrupts. Here are two similar functions `set_intr_gate_ist` and `set_system_intr_gate_ist`. Both of these two functions take three parameters: + +* number of the interrupt; +* base address of the interrupt/exception handler; +* third parameter is - `Interrupt Stack Table`. `IST` is a new mechanism in the `x86_64` and part of the [TSS](http://en.wikipedia.org/wiki/Task_state_segment). Every active thread in kernel mode has own kernel stack which is 16 kilobytes. While a thread in user space, kernel stack is empty except `thread_info` (read about it previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html)) at the bottom. In addition to per-thread stacks, there are a couple of specialized stacks associated with each CPU. All about these stack you can read in the linux kernel documentation - [Kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks). `x86_64` provides feature which allows to switch to a new `special` stack for during any events as non-maskable interrupt and etc... And the name of this feature is - `Interrupt Stack Table`. There can be up to 7 `IST` entries per CPU and every entry points to the dedicated stack. In our case this is `DEBUG_STACK`. + +`set_intr_gate_ist` and `set_system_intr_gate_ist` work by the same principle as `set_intr_gate` with only one difference. Both of these functions checks +interrupt number and call `_set_gate` inside: + +```C +BUG_ON((unsigned)n > 0xFF); +_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS); +``` + +as `set_intr_gate` does this. But `set_intr_gate` calls `_set_gate` with [dpl](http://en.wikipedia.org/wiki/Privilege_level) - 0, and ist - 0, but `set_intr_gate_ist` and `set_system_intr_gate_ist` sets `ist` as `DEBUG_STACK` and `set_system_intr_gate_ist` sets `dpl` as `0x3` which is the lowest privilege. When an interrupt occurs and the hardware loads such a descriptor, then hardware automatically sets the new stack pointer based on the IST value, then invokes the interrupt handler. All of the special kernel stacks will be setted in the `cpu_init` function (we will see it later). + +As `#DB` and `#BP` gates written to the `idt_descr`, we reload `IDT` table with `load_idt` which just cals `ldtr` instruction. Now let's look on interrupt handlers and will try to understand how they works. Of course, I can't cover all interrupt handlers in this book and I do not see the point in this. It is very interesting to delve in the linux kernel source code, so we will see how `debug` handler implemented in this part, and understand how other interrupt handlers are implemented will be your task. + +#DB handler +-------------------------------------------------------------------------------- + +As you can read above, we passed address of the `#DB` handler as `&debug` in the `set_intr_gate_ist`. [lxr.free-electorns.com](http://lxr.free-electrons.com/ident) is a great resource for searching identifiers in the linux kernel source code, but unfortunately you will not find `debug` handler with it. All of you can find, it is `debug` definition in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/traps.h): + +```C +asmlinkage void debug(void); +``` + +We can see `asmlinkage` attribute which tells to us that `debug` is function written with [assembly](http://en.wikipedia.org/wiki/Assembly_language). Yeah, again and again assembly :). Implementation of the `#DB` handler as other handlers is in this [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) and defined with the `idtentry` assembly macro: + +```assembly +idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK +``` + +`idtentry` is a macro which defines an interrupt/exception entry point. As you can see it takes five arguments: + +* name of the interrupt entry point; +* name of the interrupt handler; +* has interrupt error code or not; +* paranoid - if this parameter = 1, switch to special stack (read above); +* shift_ist - stack to switch during interrupt. + +Now let's look on `idtentry` macro implementation. This macro defined in the same assembly file and defines `debug` function with the `ENTRY` macro. For the start `idtentry` macro checks that given parameters are correct in case if need to switch to the special stack. In the next step it checks that give interrupt returns error code. If interrupt does not return error code (in our case `#DB` does not return error code), it calls `INTR_FRAME` or `XCPT_FRAME` if interrupt has error code. Both of these macros `XCPT_FRAME` and `INTR_FRAME` do nothing and need only for the building initial frame state for interrupts. They uses `CFI` directives and used for debugging. More info you can find in the [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html). As comment from the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) says: `CFI macros are used to generate dwarf2 unwind information for better backtraces. They don't change any code.` so we will ignore them. + +```assembly +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 +ENTRY(\sym) + /* Sanity check */ + .if \shift_ist != -1 && \paranoid == 0 + .error "using shift_ist requires paranoid=1" + .endif + + .if \has_error_code + XCPT_FRAME + .else + INTR_FRAME + .endif + ... + ... + ... +``` + +You can remember from the previous part about early interrupts/exceptions handling that after interrupt occurs, current stack will have following format: + +``` + +-----------------------+ + | | ++40 | SS | ++32 | RSP | ++24 | RFLAGS | ++16 | CS | ++8 | RIP | + 0 | Error Code | <---- rsp + | | + +-----------------------+ +``` + +The next two macro from the `idtentry` implementation are: + +```assembly + ASM_CLAC + PARAVIRT_ADJUST_EXCEPTION_FRAME +``` + +First `ASM_CLAC` macro depends on `CONFIG_X86_SMAP` configuration option and need for security reason, more about it you can read [here](https://lwn.net/Articles/517475/). The second `PARAVIRT_ADJUST_EXCEPTION_FRAME` macro is for handling handle Xen-type-exceptions (this chapter about kernel initialization and we will not consider virtualization stuff here). + +The next piece of code checks if interrupt has error code or not and pushes `$-1` which is `0xffffffffffffffff` on `x86_64` on the stack if not: + +```assembly + .ifeq \has_error_code + pushq_cfi $-1 + .endif +``` + +We need to do it as `dummy` error code for stack consistency for all interrupts. In the next step we subtract from the stack pointer `$ORIG_RAX-R15`: + +```assembly + subq $ORIG_RAX-R15, %rsp +``` + +where `ORIRG_RAX`, `R15` and other macros defined in the [arch/x86/include/asm/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/calling.h) and `ORIG_RAX-R15` is 120 bytes. General purpose registers will occupy these 120 bytes because we need to store all registers on the stack during interrupt handling. After we set stack for general purpose registers, the next step is checking that interrupt came from userspace with: + +```assembly +testl $3, CS(%rsp) +jnz 1f +``` + +Here we checks first and second bits in the `CS`. You can remember that `CS` register contains segment selector where first two bits are `RPL`. All privilege levels are integers in the range 0–3, where the lowest number corresponds to the highest privilege. So if interrupt came from the kernel mode we call `save_paranoid` or jump on label `1` if not. In the `save_paranoid` we store all general purpose registers on the stack and switch user `gs` on kernel `gs` if need: + +```assembly + movl $1,%ebx + movl $MSR_GS_BASE,%ecx + rdmsr + testl %edx,%edx + js 1f + SWAPGS + xorl %ebx,%ebx +1: ret +``` + +In the next steps we put `pt_regs` pointer to the `rdi`, save error code in the `rsi` if it has and call interrupt handler which is - `do_debug` in our case from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). `do_debug` like other handlers takes two parameters: + +* pt_regs - is a structure which presents set of CPU registers which are saved in the process' memory region; +* error code - error code of interrupt. + +After interrupt handler finished its work, calls `paranoid_exit` which restores stack, switch on userspace if interrupt came from there and calls `iret`. That's all. Of course it is not all :), but we will see more deeply in the separate chapter about interrupts. + +This is general view of the `idtentry` macro for `#DB` interrupt. All interrupts are similar to this implementation and defined with idtentry too. After `early_trap_init` finished its work, the next function is `early_cpu_init`. This function defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) and collects information about CPU and its vendor. + +Early ioremap initialization +-------------------------------------------------------------------------------- + +The next step is initialization of early `ioremap`. In general there are two ways to communicate with devices: + +* I/O Ports; +* Device memory. + +We already saw first method (`outb/inb` instructions) in the part about linux kernel booting [process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html). The second method is to map I/O physical addresses to virtual addresses. When a physical address is accessed by the CPU, it may refer to a portion of physical RAM which can be mapped on memory of the I/O device. So `ioremap` used to map device memory into kernel address space. + +As i wrote above next function is the `early_ioremap_init` which re-maps I/O memory to kernel address space so it can access it. We need to initialize early ioremap for early initialization code which needs to temporarily map I/O or memory regions before the normal mapping functions like `ioremap` are available. Implementation of this function is in the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c). At the start of the `early_ioremap_init` we can see definition of the `pmd` point with `pmd_t` type (which presents page middle directory entry `typedef struct { pmdval_t pmd; } pmd_t;` where `pmdval_t` is `unsigned long`) and make a check that `fixmap` aligned in a correct way: + +```C +pmd_t *pmd; +BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1)); +``` + +`fixmap` - is fixed virtual address mappings which extends from `FIXADDR_START` to `FIXADDR_TOP`. Fixed virtual addresses are needed for subsystems that need to know the virtual address at compile time. After the check `early_ioremap_init` makes a call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c). `early_ioremap_setup` fills `slot_virt` array of the `unsigned long` with virtual addresses with 512 temporary boot-time fix-mappings: + +```C +for (i = 0; i < FIX_BTMAPS_SLOTS; i++) + slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i); +``` + +After this we get page middle directory entry for the `FIX_BTMAP_BEGIN` and put to the `pmd` variable, fills `bm_pte` with zeros which is boot time page tables and call `pmd_populate_kernel` function for setting given page table entry in the given page middle directory: + +```C +pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)); +memset(bm_pte, 0, sizeof(bm_pte)); +pmd_populate_kernel(&init_mm, pmd, bm_pte); +``` + +That's all for this. If you feeling puzzled, don't worry. There is special part about `ioremap` and `fixmaps` in the [Linux Kernel Memory Management. Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) chapter. + +Obtaining major and minor numbers for the root device +-------------------------------------------------------------------------------- + +After early `ioremap` was initialized, you can see the following code: + +```C +ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev); +``` + +This code obtains major and minor numbers for the root device where `initrd` will be mounted later in the `do_mount_root` function. Major number of the device identifies a driver associated with the device. Minor number referred on the device controlled by driver. Note that `old_decode_dev` takes one parameter from the `boot_params_structure`. As we can read from the x86 linux kernel boot protocol: + +``` +Field name: root_dev +Type: modify (optional) +Offset/size: 0x1fc/2 +Protocol: ALL + + The default root device device number. The use of this field is + deprecated, use the "root=" option on the command line instead. +``` + +Now let's try to understand what `old_decode_dev` does. Actually it just calls `MKDEV` inside which generates `dev_t` from the give major and minor numbers. It's implementation is pretty simple: + +```C +static inline dev_t old_decode_dev(u16 val) +{ + return MKDEV((val >> 8) & 255, val & 255); +} +``` + +where `dev_t` is a kernel data type to present major/minor number pair. But what's the strange `old_` prefix? For historical reasons, there are two ways of managing the major and minor numbers of a device. In the first way major and minor numbers occupied 2 bytes. You can see it in the previous code: 8 bit for major number and 8 bit for minor number. But there is a problem: only 256 major numbers and 256 minor numbers are possible. So 16-bit integer was replaced by 32-bit integer where 12 bits reserved for major number and 20 bits for minor. You can see this in the `new_decode_dev` implementation: + +```C +static inline dev_t new_decode_dev(u32 dev) +{ + unsigned major = (dev & 0xfff00) >> 8; + unsigned minor = (dev & 0xff) | ((dev >> 12) & 0xfff00); + return MKDEV(major, minor); +} +``` + +After calculation we will get `0xfff` or 12 bits for `major` if it is `0xffffffff` and `0xfffff` or 20 bits for `minor`. So in the end of execution of the `old_decode_dev` we will get major and minor numbers for the root device in `ROOT_DEV`. + +Memory map setup +-------------------------------------------------------------------------------- + +The next point is the setup of the memory map with the call of the `setup_memory_map` function. But before this we setup different parameters as information about a screen (current row and column, video page and etc... (you can read about it in the [Video mode initialization and transition to protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html))), Extended display identification data, video mode, bootloader_type and etc...: + +```C + screen_info = boot_params.screen_info; + edid_info = boot_params.edid_info; + saved_video_mode = boot_params.hdr.vid_mode; + bootloader_type = boot_params.hdr.type_of_loader; + if ((bootloader_type >> 4) == 0xe) { + bootloader_type &= 0xf; + bootloader_type |= (boot_params.hdr.ext_loader_type+0x10) << 4; + } + bootloader_version = bootloader_type & 0xf; + bootloader_version |= boot_params.hdr.ext_loader_ver << 4; +``` + +All of these parameters we got during boot time and stored in the `boot_params` structure. After this we need to setup the end of the I/O memory. As you know one of the main purposes of the kernel is resource management. And one of the resource is memory. As we already know there are two ways to communicate with devices are I/O ports and device memory. All information about registered resources are available through: + +* /proc/ioports - provides a list of currently registered port regions used for input or output communication with a device; +* /proc/iomem - provides current map of the system's memory for each physical device. + +At the moment we are interested in `/proc/iomem`: + +``` +cat /proc/iomem +00000000-00000fff : reserved +00001000-0009d7ff : System RAM +0009d800-0009ffff : reserved +000a0000-000bffff : PCI Bus 0000:00 +000c0000-000cffff : Video ROM +000d0000-000d3fff : PCI Bus 0000:00 +000d4000-000d7fff : PCI Bus 0000:00 +000d8000-000dbfff : PCI Bus 0000:00 +000dc000-000dffff : PCI Bus 0000:00 +000e0000-000fffff : reserved + 000e0000-000e3fff : PCI Bus 0000:00 + 000e4000-000e7fff : PCI Bus 0000:00 + 000f0000-000fffff : System ROM +``` + +As you can see range of addresses are shown in hexadecimal notation with its owner. Linux kernel provides API for managing any resources in a general way. Global resources (for example PICs or I/O ports) can be divided into subsets - relating to any hardware bus slot. The main structure `resource`: + +```C +struct resource { + resource_size_t start; + resource_size_t end; + const char *name; + unsigned long flags; + struct resource *parent, *sibling, *child; +}; +``` + +presents abstraction for a tree-like subset of system resources. This structure provides range of addresses from `start` to `end` (`resource_size_t` is `phys_addr_t` or `u64` for `x86_64`) which a resource covers, `name` of a resource (you see these names in the `/proc/iomem` output) and `flags` of a resource (All resources flags defined in the [include/linux/ioport.h](https://github.com/torvalds/linux/blob/master/include/linux/ioport.h)). The last are three pointers to the `resource` structure. These pointers enable a tree-like structure: + +``` ++-------------+ +-------------+ +| | | | +| parent |------| sibling | +| | | | ++-------------+ +-------------+ + | + | ++-------------+ +| | +| child | +| | ++-------------+ +``` + +Every subset of resources has root range resources. For `iomem` it is `iomem_resource` which defined as: + +```C +struct resource iomem_resource = { + .name = "PCI mem", + .start = 0, + .end = -1, + .flags = IORESOURCE_MEM, +}; +EXPORT_SYMBOL(iomem_resource); +``` + +TODO EXPORT_SYMBOL + +`iomem_resource` defines root addresses range for io memory with `PCI mem` name and `IORESOURCE_MEM` (`0x00000200`) as flags. As i wrote above our current point is setup the end address of the `iomem`. We will do it with: + +```C +iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1; +``` + +Here we shift `1` on `boot_cpu_data.x86_phys_bits`. `boot_cpu_data` is `cpuinfo_x86` structure which we filled during execution of the `early_cpu_init`. As you can understand from the name of the `x86_phys_bits` field, it presents maximum bits amount of the maximum physical address in the system. Note also that `iomem_resource` is passed to the `EXPORT_SYMBOL` macro. This macro exports the given symbol (`iomem_resource` in our case) for dynamic linking or in other words it makes a symbol accessible to dynamically loaded modules. + +After we set the end address of the root `iomem` resource address range, as I wrote above the next step will be setup of the memory map. It will be produced with the call of the `setup_ memory_map` function: + +```C +void __init setup_memory_map(void) +{ + char *who; + + who = x86_init.resources.memory_setup(); + memcpy(&e820_saved, &e820, sizeof(struct e820map)); + printk(KERN_INFO "e820: BIOS-provided physical RAM map:\n"); + e820_print_map(who); +} +``` + +First of all we call look here the call of the `x86_init.resources.memory_setup`. `x86_init` is a `x86_init_ops` structure which presents platform specific setup functions as resources initialization, pci initialization and etc... initialization of the `x86_init` is in the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c). I will not give here the full description because it is very long, but only one part which interests us for now: + +```C +struct x86_init_ops x86_init __initdata = { + .resources = { + .probe_roms = probe_roms, + .reserve_resources = reserve_standard_io_resources, + .memory_setup = default_machine_specific_memory_setup, + }, + ... + ... + ... +} +``` + +As we can see here `memry_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions are collected, print of all regions with printk. You can find this print if you execute `dmesg` command and you can see something like this: + +``` +[ 0.000000] e820: BIOS-provided physical RAM map: +[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable +[ 0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved +[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved +[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000be825fff] usable +[ 0.000000] BIOS-e820: [mem 0x00000000be826000-0x00000000be82cfff] ACPI NVS +[ 0.000000] BIOS-e820: [mem 0x00000000be82d000-0x00000000bf744fff] usable +[ 0.000000] BIOS-e820: [mem 0x00000000bf745000-0x00000000bfff4fff] reserved +[ 0.000000] BIOS-e820: [mem 0x00000000bfff5000-0x00000000dc041fff] usable +[ 0.000000] BIOS-e820: [mem 0x00000000dc042000-0x00000000dc0d2fff] reserved +[ 0.000000] BIOS-e820: [mem 0x00000000dc0d3000-0x00000000dc138fff] usable +[ 0.000000] BIOS-e820: [mem 0x00000000dc139000-0x00000000dc27dfff] ACPI NVS +[ 0.000000] BIOS-e820: [mem 0x00000000dc27e000-0x00000000deffefff] reserved +[ 0.000000] BIOS-e820: [mem 0x00000000defff000-0x00000000deffffff] usable +... +... +... +``` + +Copying of the BIOS Enhanced Disk Device information +-------------------------------------------------------------------------------- + +The next two steps is parsing of the `setup_data` with `parse_setup_data` function and copying BIOS EDD to the safe place. `setup_data` is a field from the kernel boot header and as we can read from the `x86` boot protocol: + +``` +Field name: setup_data +Type: write (special) +Offset/size: 0x250/8 +Protocol: 2.09+ + + The 64-bit physical pointer to NULL terminated single linked list of + struct setup_data. This is used to define a more extensible boot + parameters passing mechanism. +``` + +It used for storing setup information for different types as device tree blob, EFI setup data and etc... In the second step we copy BIOS EDD information from the `boot_params` structure that we collected in the [arch/x86/boot/edd.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/edd.c) to the `edd` structure: + +```C +static inline void __init copy_edd(void) +{ + memcpy(edd.mbr_signature, boot_params.edd_mbr_sig_buffer, + sizeof(edd.mbr_signature)); + memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info)); + edd.mbr_signature_nr = boot_params.edd_mbr_sig_buf_entries; + edd.edd_info_nr = boot_params.eddbuf_entries; +} +``` + +Memory descriptor initialization +-------------------------------------------------------------------------------- + +The next step is initialization of the memory descriptor of the init process. As you already can know every process has its own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h). As every process has its own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html): + +```C +#define INIT_TASK(tsk) \ +{ + ... + ... + ... + .mm = NULL, \ + .active_mm = &init_mm, \ + ... +} +``` + +`mm` points to the process address space and `active_mm` points to the active address space if process has no address space such as kernel threads (more about it you can read in the [documentation](https://www.kernel.org/doc/Documentation/vm/active_mm.txt)). Now we fill memory descriptor of the initial process: + +```C + init_mm.start_code = (unsigned long) _text; + init_mm.end_code = (unsigned long) _etext; + init_mm.end_data = (unsigned long) _edata; + init_mm.brk = _brk_end; +``` + +with the kernel's text, data and brk. `init_mm` is the memory descriptor of the initial process and defined as: + +```C +struct mm_struct init_mm = { + .mm_rb = RB_ROOT, + .pgd = swapper_pg_dir, + .mm_users = ATOMIC_INIT(2), + .mm_count = ATOMIC_INIT(1), + .mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem), + .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), + .mmlist = LIST_HEAD_INIT(init_mm.mmlist), + INIT_MM_CONTEXT(init_mm) +}; +``` + +where `mm_rb` is a red-black tree of the virtual memory areas, `pgd` is a pointer to the page global directory, `mm_users` is address space users, `mm_count` is primary usage counter and `mmap_sem` is memory area semaphore. After we setup memory descriptor of the initial process, next step is initialization of the Intel Memory Protection Extensions with `mpx_mm_init`. The next step is initialization of the code/data/bss resources with: + +```C + code_resource.start = __pa_symbol(_text); + code_resource.end = __pa_symbol(_etext)-1; + data_resource.start = __pa_symbol(_etext); + data_resource.end = __pa_symbol(_edata)-1; + bss_resource.start = __pa_symbol(__bss_start); + bss_resource.end = __pa_symbol(__bss_stop)-1; +``` + +We already know a little about `resource` structure (read above). Here we fills code/data/bss resources with their physical addresses. You can see it in the `/proc/iomem`: + +```C +00100000-be825fff : System RAM + 01000000-015bb392 : Kernel code + 015bb393-01930c3f : Kernel data + 01a11000-01ac3fff : Kernel bss +``` + +All of these structures are defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and look like typical resource initialization: + +```C +static struct resource code_resource = { + .name = "Kernel code", + .start = 0, + .end = 0, + .flags = IORESOURCE_BUSY | IORESOURCE_MEM +}; +``` + +The last step which we will cover in this part will be `NX` configuration. `NX-bit` or no execute bit is 63-bit in the page directory entry which controls the ability to execute code from all physical pages mapped by the table entry. This bit can only be used/set when the `no-execute` page-protection mechanism is enabled by the setting `EFER.NXE` to 1. In the `x86_configure_nx` function we check that CPU has support of `NX-bit` and it does not disabled. After the check we fill `__supported_pte_mask` depend on it: + +```C +void x86_configure_nx(void) +{ + if (cpu_has_nx && !disable_nx) + __supported_pte_mask |= _PAGE_NX; + else + __supported_pte_mask &= ~_PAGE_NX; +} +``` + +Conclusion +-------------------------------------------------------------------------------- + +It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function which makes initialization of architecture-specific stuff. It was long part, but we have not finished with it. As i already wrote, the `setup_arch` is big function, and I am really not sure that we will cover all of it even in the next part. There were some new interesting concepts in this part like `Fix-mapped` addresses, ioremap and etc... Don't worry if they are unclear for you. There is a special part about these concepts - [Linux kernel memory management Part 2.](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md). In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameters, early dump of the pci devices, direct Media Interface scanning and many many more. + +If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). + +**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [mm vs active_mm](https://www.kernel.org/doc/Documentation/vm/active_mm.txt) +* [e820](http://en.wikipedia.org/wiki/E820) +* [Supervisor mode access prevention](https://lwn.net/Articles/517475/) +* [Kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks) +* [TSS](http://en.wikipedia.org/wiki/Task_state_segment) +* [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table) +* [Memory mapped I/O](http://en.wikipedia.org/wiki/Memory-mapped_I/O) +* [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html) +* [PDF. dwarf4 specification](http://dwarfstd.org/doc/DWARF4.pdf) +* [Call stack](http://en.wikipedia.org/wiki/Call_stack) +* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) diff --git a/Initialization/linux-initialization-6.md b/Initialization/linux-initialization-6.md new file mode 100644 index 0000000..dfed9f2 --- /dev/null +++ b/Initialization/linux-initialization-6.md @@ -0,0 +1,549 @@ +Kernel initialization. Part 6. +================================================================================ + +Architecture-specific initialization, again... +================================================================================ + +In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) we saw architecture-specific (`x86_64` in our case) initialization stuff from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and finished on `x86_configure_nx` function which sets the `_PAGE_NX` flag depends on support of [NX bit](http://en.wikipedia.org/wiki/NX_bit). As I wrote before `setup_arch` function and `start_kernel` are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after `x86_configure_nx` is `parse_early_param`. This function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) and as you can understand from its name, this function parses kernel command line and setups different services depends on the given parameters (all kernel command line parameters you can find are in the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt)). You may remember how we setup `earlyprintk` in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). On the early stage we looked for kernel parameters and their value with the `cmdline_find_option` function and `__cmdline_find_option`, `__cmdline_find_option_bool` helpers from the [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cmdline.c). There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already note calls like this: + +```C +early_param("gbpages", parse_direct_gbpages_on); +``` + +`early_param` macro takes two parameters: + +* command line parameter name; +* function which will be called if given parameter is passed. + +and defined as: + +```C +#define early_param(str, fn) \ + __setup_param(str, fn, fn, 1) +``` + +in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h). As you can see `early_param` macro just makes call of the `__setup_param` macro: + +```C +#define __setup_param(str, unique_id, fn, early) \ + static const char __setup_str_##unique_id[] __initconst \ + __aligned(1) = str; \ + static struct obs_kernel_param __setup_##unique_id \ + __used __section(.init.setup) \ + __attribute__((aligned((sizeof(long))))) \ + = { __setup_str_##unique_id, fn, early } +``` + +This macro defines `__setup_str_*_id` variable (where `*` depends on given function name) and assigns it to the given command line parameter name. In the next line we can see definition of the `__setup_*` variable which type is `obs_kernel_param` and its initialization. `obs_kernel_param` structure defined as: + +```C +struct obs_kernel_param { + const char *str; + int (*setup_func)(char *); + int early; +}; +``` + +and contains three fields: + +* name of the kernel parameter; +* function which setups something depend on parameter; +* field determines is parameter early (1) or not (0). + +Note that `__set_param` macro defines with `__section(.init.setup)` attribute. It means that all `__setup_str_*` will be placed in the `.init.setup` section, moreover, as we can see in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h), they will be placed between `__setup_start` and `__setup_end`: + +``` +#define INIT_SETUP(initsetup_align) \ + . = ALIGN(initsetup_align); \ + VMLINUX_SYMBOL(__setup_start) = .; \ + *(.init.setup) \ + VMLINUX_SYMBOL(__setup_end) = .; +``` + +Now we know how parameters are defined, let's back to the `parse_early_param` implementation: + +```C +void __init parse_early_param(void) +{ + static int done __initdata; + static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata; + + if (done) + return; + + /* All fall through to do_early_param. */ + strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE); + parse_early_options(tmp_cmdline); + done = 1; +} +``` + +The `parse_early_param` function defines two static variables. First `done` check that `parse_early_param` already called and the second is temporary storage for kernel command line. After this we copy `boot_command_line` to the temporary command line which we just defined and call the `parse_early_options` function from the same source code `main.c` file. `parse_early_options` calls the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/master/) where `parse_args` parses given command line and calls `do_early_param` function. This [function](https://github.com/torvalds/linux/blob/master/init/main.c#L413) goes from the ` __setup_start` to `__setup_end`, and calls the function from the `obs_kernel_param` if a parameter is early. After this all services which are depend on early command line parameters were setup and the next call after the `parse_early_param` is `x86_report_nx`. As I wrote in the beginning of this part, we already set `NX-bit` with the `x86_configure_nx`. The next `x86_report_nx` function from the [arch/x86/mm/setup_nx.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/setup_nx.c) just prints information about the `NX`. Note that we call `x86_report_nx` not right after the `x86_configure_nx`, but after the call of the `parse_early_param`. The answer is simple: we call it after the `parse_early_param` because the kernel support `noexec` parameter: + +``` +noexec [X86] + On X86-32 available only on PAE configured kernels. + noexec=on: enable non-executable mappings (default) + noexec=off: disable non-executable mappings +``` + +We can see it in the booting time: + +![NX](http://oi62.tinypic.com/swwxhy.jpg) + +After this we can see call of the: + +```C + memblock_x86_reserve_range_setup_data(); +``` + +function. This function is defined in the same [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and remaps memory for the `setup_data` and reserved memory block for the `setup_data` (more about `setup_data` you can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) and about `ioremap` and `memblock` you can read in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)). + +In the next step we can see following conditional statement: + +```C + if (acpi_mps_check()) { +#ifdef CONFIG_X86_LOCAL_APIC + disable_apic = 1; +#endif + setup_clear_cpu_cap(X86_FEATURE_APIC); + } +``` + +The first `acpi_mps_check` function from the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) depends on `CONFIG_X86_LOCAL_APIC` and `CONFIG_x86_MPPARSE` configuration options: + +```C +int __init acpi_mps_check(void) +{ +#if defined(CONFIG_X86_LOCAL_APIC) && !defined(CONFIG_X86_MPPARSE) + /* mptable code is not built-in*/ + if (acpi_disabled || acpi_noirq) { + printk(KERN_WARNING "MPS support code is not built-in.\n" + "Using acpi=off or acpi=noirq or pci=noacpi " + "may have problem\n"); + return 1; + } +#endif + return 0; +} +``` + +It checks the built-in `MPS` or [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) table. If `CONFIG_X86_LOCAL_APIC` is set and `CONFIG_x86_MPPAARSE` is not set, `acpi_mps_check` prints warning message if the one of the command line options: `acpi=off`, `acpi=noirq` or `pci=noacpi` passed to the kernel. If `acpi_mps_check` returns `1` it means that we disable local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and clear `X86_FEATURE_APIC` bit in the of the current CPU with the `setup_clear_cpu_cap` macro. (more about CPU mask you can read in the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)). + +Early PCI dump +-------------------------------------------------------------------------------- + +In the next step we make a dump of the [PCI](http://en.wikipedia.org/wiki/Conventional_PCI) devices with the following code: + +```C +#ifdef CONFIG_PCI + if (pci_early_dump_regs) + early_dump_pci_devices(); +#endif +``` + +`pci_early_dump_regs` variable defined in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c) and its value depends on the kernel command line parameter: `pci=earlydump`. We can find definition of this parameter in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch): + +```C +early_param("pci", pci_setup); +``` + +`pci_setup` function gets the string after the `pci=` and analyzes it. This function calls `pcibios_setup` which defined as `__weak` in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch) and every architecture defines the same function which overrides `__weak` analog. For example `x86_64` architecture-dependent version is in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c): + +```C +char *__init pcibios_setup(char *str) { + ... + ... + ... + } else if (!strcmp(str, "earlydump")) { + pci_early_dump_regs = 1; + return NULL; + } + ... + ... + ... +} +``` + +So, if `CONFIG_PCI` option is set and we passed `pci=earlydump` option to the kernel command line, next function which will be called - `early_dump_pci_devices` from the [arch/x86/pci/early.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/early.c). This function checks `noearly` pci parameter with: + +```C +if (!early_pci_allowed()) + return; +``` + +and returns if it was passed. Each PCI domain can host up to `256` buses and each bus hosts up to 32 devices. So, we goes in a loop: + +```C +for (bus = 0; bus < 256; bus++) { + for (slot = 0; slot < 32; slot++) { + for (func = 0; func < 8; func++) { + ... + ... + ... + } + } +} +``` + +and read the `pci` config with the `read_pci_config` function. + +That's all. We will not go deep in the `pci` details, but will see more details in the special `Drivers/PCI` part. + +Finish with memory parsing +-------------------------------------------------------------------------------- + +After the `early_dump_pci_devices`, there are a couple of function related with available memory and [e820](http://en.wikipedia.org/wiki/E820) which we collected in the [First steps in the kernel setup](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) part: + +```C + /* update the e820_saved too */ + e820_reserve_setup_data(); + finish_e820_parsing(); + ... + ... + ... + e820_add_kernel_range(); + trim_bios_range(void); + max_pfn = e820_end_of_ram_pfn(); + early_reserve_e820_mpc_new(); +``` + +Let's look on it. As you can see the first function is `e820_reserve_setup_data`. This function does almost the same as `memblock_x86_reserve_range_setup_data` which we saw above, but it also calls `e820_update_range` which adds new regions to the `e820map` with the given type which is `E820_RESERVED_KERN` in our case. The next function is `finish_e820_parsing` which sanitizes `e820map` with the `sanitize_e820_map` function. Besides this two functions we can see a couple of functions related to the [e820](http://en.wikipedia.org/wiki/E820). You can see it in the listing above. `e820_add_kernel_range` function takes the physical address of the kernel start and end: + +```C +u64 start = __pa_symbol(_text); +u64 size = __pa_symbol(_end) - start; +``` + +checks that `.text` `.data` and `.bss` marked as `E820RAM` in the `e820map` and prints the warning message if not. The next function `trm_bios_range` update first 4096 bytes in `e820Map` as `E820_RESERVED` and sanitizes it again with the call of the `sanitize_e820_map`. After this we get the last page frame number with the call of the `e820_end_of_ram_pfn` function. Every memory page has an unique number - `Page frame number` and `e820_end_of_ram_pfn` function returns the maximum with the call of the `e820_end_pfn`: + +```C +unsigned long __init e820_end_of_ram_pfn(void) +{ + return e820_end_pfn(MAX_ARCH_PFN); +} +``` + +where `e820_end_pfn` takes maximum page frame number on the certain architecture (`MAX_ARCH_PFN` is `0x400000000` for `x86_64`). In the `e820_end_pfn` we go through the all `e820` slots and check that `e820` entry has `E820_RAM` or `E820_PRAM` type because we calculate page frame numbers only for these types, gets the base address and end address of the page frame number for the current `e820` entry and makes some checks for these addresses: + +```C +for (i = 0; i < e820.nr_map; i++) { + struct e820entry *ei = &e820.map[i]; + unsigned long start_pfn; + unsigned long end_pfn; + + if (ei->type != E820_RAM && ei->type != E820_PRAM) + continue; + + start_pfn = ei->addr >> PAGE_SHIFT; + end_pfn = (ei->addr + ei->size) >> PAGE_SHIFT; + + if (start_pfn >= limit_pfn) + continue; + if (end_pfn > limit_pfn) { + last_pfn = limit_pfn; + break; + } + if (end_pfn > last_pfn) + last_pfn = end_pfn; +} +``` + +```C + if (last_pfn > max_arch_pfn) + last_pfn = max_arch_pfn; + + printk(KERN_INFO "e820: last_pfn = %#lx max_arch_pfn = %#lx\n", + last_pfn, max_arch_pfn); + return last_pfn; +``` + +After this we check that `last_pfn` which we got in the loop is not greater that maximum page frame number for the certain architecture (`x86_64` in our case), print information about last page frame number and return it. We can see the `last_pfn` in the `dmesg` output: + +``` +... +[ 0.000000] e820: last_pfn = 0x41f000 max_arch_pfn = 0x400000000 +... +``` + +After this, as we have calculated the biggest page frame number, we calculate `max_low_pfn` which is the biggest page frame number in the `low memory` or bellow first `4` gigabytes. If installed more than 4 gigabytes of RAM, `max_low_pfn` will be result of the `e820_end_of_low_ram_pfn` function which does the same `e820_end_of_ram_pfn` but with 4 gigabytes limit, in other way `max_low_pfn` will be the same as `max_pfn`: + +```C +if (max_pfn > (1UL<<(32 - PAGE_SHIFT))) + max_low_pfn = e820_end_of_low_ram_pfn(); +else + max_low_pfn = max_pfn; + +high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1; +``` + +Next we calculate `high_memory` (defines the upper bound on direct map memory) with `__va` macro which returns a virtual address by the given physical memory. + +DMI scanning +------------------------------------------------------------------------------- + +The next step after manipulations with different memory regions and `e820` slots is collecting information about computer. We will get all information with the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface) and following functions: + +```C +dmi_scan_machine(); +dmi_memdev_walk(); +``` + +First is `dmi_scan_machine` defined in the [drivers/firmware/dmi_scan.c](https://github.com/torvalds/linux/blob/master/drivers/firmware/dmi_scan.c). This function goes through the [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) structures and extracts information. There are two ways specified to gain access to the `SMBIOS` table: get the pointer to the `SMBIOS` table from the [EFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)'s configuration table and scanning the physical memory between `0xF0000` and `0x10000` addresses. Let's look on the second approach. `dmi_scan_machine` function remaps memory between `0xf0000` and `0x10000` with the `dmi_early_remap` which just expands to the `early_ioremap`: + +```C +void __init dmi_scan_machine(void) +{ + char __iomem *p, *q; + char buf[32]; + ... + ... + ... + p = dmi_early_remap(0xF0000, 0x10000); + if (p == NULL) + goto error; +``` + +and iterates over all `DMI` header address and find search `_SM_` string: + +```C +memset(buf, 0, 16); +for (q = p; q < p + 0x10000; q += 16) { + memcpy_fromio(buf + 16, q, 16); + if (!dmi_smbios3_present(buf) || !dmi_present(buf)) { + dmi_available = 1; + dmi_early_unmap(p, 0x10000); + goto out; + } + memcpy(buf, buf + 16, 16); +} +``` + +`_SM_` string must be between `000F0000h` and `0x000FFFFF`. Here we copy 16 bytes to the `buf` with `memcpy_fromio` which is the same `memcpy` and execute `dmi_smbios3_present` and `dmi_present` on the buffer. These functions check that first 4 bytes is `_SM_` string, get `SMBIOS` version and gets `_DMI_` attributes as `DMI` structure table length, table address and etc... After one of these functions finish, you will see the result of it in the `dmesg` output: + +``` +[ 0.000000] SMBIOS 2.7 present. +[ 0.000000] DMI: Gigabyte Technology Co., Ltd. Z97X-UD5H-BK/Z97X-UD5H-BK, BIOS F6 06/17/2014 +``` + +In the end of the `dmi_scan_machine`, we unmap the previously remapped memory: + +```C +dmi_early_unmap(p, 0x10000); +``` + +The second function is - `dmi_memdev_walk`. As you can understand it goes over memory devices. Let's look on it: + +```C +void __init dmi_memdev_walk(void) +{ + if (!dmi_available) + return; + + if (dmi_walk_early(count_mem_devices) == 0 && dmi_memdev_nr) { + dmi_memdev = dmi_alloc(sizeof(*dmi_memdev) * dmi_memdev_nr); + if (dmi_memdev) + dmi_walk_early(save_mem_devices); + } +} +``` + +It checks that `DMI` available (we got it in the previous function - `dmi_scan_machine`) and collects information about memory devices with `dmi_walk_early` and `dmi_alloc` which defined as: + +``` +#ifdef CONFIG_DMI +RESERVE_BRK(dmi_alloc, 65536); +#endif +``` + +`RESERVE_BRK` defined in the [arch/x86/include/asm/setup.h](http://en.wikipedia.org/wiki/Desktop_Management_Interface) and reserves space with given size in the `brk` section. + +------------------------- + init_hypervisor_platform(); + x86_init.resources.probe_roms(); + insert_resource(&iomem_resource, &code_resource); + insert_resource(&iomem_resource, &data_resource); + insert_resource(&iomem_resource, &bss_resource); + early_gart_iommu_check(); + + +SMP config +-------------------------------------------------------------------------------- + +The next step is parsing of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration. We do it with the call of the `find_smp_config` function which just calls function: + +```C +static inline void find_smp_config(void) +{ + x86_init.mpparse.find_smp_config(); +} +``` + +inside. `x86_init.mpparse.find_smp_config` is the `default_find_smp_config` function from the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). In the `default_find_smp_config` function we are scanning a couple of memory regions for `SMP` config and return if they are found: + +```C +if (smp_scan_config(0x0, 0x400) || + smp_scan_config(639 * 0x400, 0x400) || + smp_scan_config(0xF0000, 0x10000)) + return; +``` + +First of all `smp_scan_config` function defines a couple of variables: + +```C +unsigned int *bp = phys_to_virt(base); +struct mpf_intel *mpf; +``` + +First is virtual address of the memory region where we will scan `SMP` config, second is the pointer to the `mpf_intel` structure. Let's try to understand what is it `mpf_intel`. All information stores in the multiprocessor configuration data structure. `mpf_intel` presents this structure and looks: + +```C +struct mpf_intel { + char signature[4]; + unsigned int physptr; + unsigned char length; + unsigned char specification; + unsigned char checksum; + unsigned char feature1; + unsigned char feature2; + unsigned char feature3; + unsigned char feature4; + unsigned char feature5; +}; +``` + +As we can read in the documentation - one of the main functions of the system BIOS is to construct the MP floating pointer structure and the MP configuration table. And operating system must have access to this information about the multiprocessor configuration and `mpf_intel` stores the physical address (look at second parameter) of the multiprocessor configuration table. So, `smp_scan_config` going in a loop through the given memory range and tries to find `MP floating pointer structure` there. It checks that current byte points to the `SMP` signature, checks checksum, checks if `mpf->specification` is 1 or 4(it must be `1` or `4` by specification) in the loop: + +```C +while (length > 0) { +if ((*bp == SMP_MAGIC_IDENT) && + (mpf->length == 1) && + !mpf_checksum((unsigned char *)bp, 16) && + ((mpf->specification == 1) + || (mpf->specification == 4))) { + + mem = virt_to_phys(mpf); + memblock_reserve(mem, sizeof(*mpf)); + if (mpf->physptr) + smp_reserve_memory(mpf); + } +} +``` + +reserves given memory block if search is successful with `memblock_reserve` and reserves physical address of the multiprocessor configuration table. You can find documentation about this in the - [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf). You can read More details in the special part about `SMP`. + +Additional early memory initialization routines +-------------------------------------------------------------------------------- + +In the next step of the `setup_arch` we can see the call of the `early_alloc_pgt_buf` function which allocates the page table buffer for early stage. The page table buffer will be placed in the `brk` area. Let's look on its implementation: + +```C +void __init early_alloc_pgt_buf(void) +{ + unsigned long tables = INIT_PGT_BUF_SIZE; + phys_addr_t base; + + base = __pa(extend_brk(tables, PAGE_SIZE)); + + pgt_buf_start = base >> PAGE_SHIFT; + pgt_buf_end = pgt_buf_start; + pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT); +} +``` + +First of all it get the size of the page table buffer, it will be `INIT_PGT_BUF_SIZE` which is `(6 * PAGE_SIZE)` in the current linux kernel 4.0. As we got the size of the page table buffer, we call `extend_brk` function with two parameters: size and align. As you can understand from its name, this function extends the `brk` area. As we can see in the linux kernel linker script `brk` is in memory right after the [BSS](http://en.wikipedia.org/wiki/.bss): + +```C + . = ALIGN(PAGE_SIZE); + .brk : AT(ADDR(.brk) - LOAD_OFFSET) { + __brk_base = .; + . += 64 * 1024; /* 64k alignment slop space */ + *(.brk_reservation) /* areas brk users have reserved */ + __brk_limit = .; + } +``` + +Or we can find it with `readelf` util: + +![brk area](http://oi61.tinypic.com/71lkeu.jpg) + +After that we got physical address of the new `brk` with the `__pa` macro, we calculate the base address and the end of the page table buffer. In the next step as we got page table buffer, we reserve memory block for the brk area with the `reserve_brk` function: + +```C +static void __init reserve_brk(void) +{ + if (_brk_end > _brk_start) + memblock_reserve(__pa_symbol(_brk_start), + _brk_end - _brk_start); + + _brk_start = 0; +} +``` + +Note that in the end of the `reserve_brk`, we set `brk_start` to zero, because after this we will not allocate it anymore. The next step after reserving memory block for the `brk`, we need to unmap out-of-range memory areas in the kernel mapping with the `cleanup_highmap` function. Remember that kernel mapping is `__START_KERNEL_map` and `_end - _text` or `level2_kernel_pgt` maps the kernel `_text`, `data` and `bss`. In the start of the `clean_high_map` we define these parameters: + +```C +unsigned long vaddr = __START_KERNEL_map; +unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1; +pmd_t *pmd = level2_kernel_pgt; +pmd_t *last_pmd = pmd + PTRS_PER_PMD; +``` + +Now, as we defined start and end of the kernel mapping, we go in the loop through the all kernel page middle directory entries and clean entries which are not between `_text` and `end`: + +```C +for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) { + if (pmd_none(*pmd)) + continue; + if (vaddr < (unsigned long) _text || vaddr > end) + set_pmd(pmd, __pmd(0)); +} +``` + +After this we set the limit for the `memblock` allocation with the `memblock_set_current_limit` function (read more about `memblock` you can in the [Linux kernel memory management Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md)), it will be `ISA_END_ADDRESS` or `0x100000` and fill the `memblock` information according to `e820` with the call of the `memblock_x86_fill` function. You can see the result of this function in the kernel initialization time: + +``` +MEMBLOCK configuration: + memory size = 0x1fff7ec00 reserved size = 0x1e30000 + memory.cnt = 0x3 + memory[0x0] [0x00000000001000-0x0000000009efff], 0x9e000 bytes flags: 0x0 + memory[0x1] [0x00000000100000-0x000000bffdffff], 0xbfee0000 bytes flags: 0x0 + memory[0x2] [0x00000100000000-0x0000023fffffff], 0x140000000 bytes flags: 0x0 + reserved.cnt = 0x3 + reserved[0x0] [0x0000000009f000-0x000000000fffff], 0x61000 bytes flags: 0x0 + reserved[0x1] [0x00000001000000-0x00000001a57fff], 0xa58000 bytes flags: 0x0 + reserved[0x2] [0x0000007ec89000-0x0000007fffffff], 0x1377000 bytes flags: 0x0 +``` + +The rest functions after the `memblock_x86_fill` are: `early_reserve_e820_mpc_new` allocates additional slots in the `e820map` for MultiProcessor Specification table, `reserve_real_mode` - reserves low memory from `0x0` to 1 megabyte for the trampoline to the real mode (for rebooting, etc.), `trim_platform_memory_ranges` - trims certain memory regions started from `0x20050000`, `0x20110000`, etc. these regions must be excluded because [Sandy Bridge](http://en.wikipedia.org/wiki/Sandy_Bridge) has problems with these regions, `trim_low_memory_range` reserves the first 4 kilobyte page in `memblock`, `init_mem_mapping` function reconstructs direct memory mapping and setups the direct mapping of the physical memory at `PAGE_OFFSET`, `early_trap_pf_init` setups `#PF` handler (we will look on it in the chapter about interrupts) and `setup_real_mode` function setups trampoline to the [real mode](http://en.wikipedia.org/wiki/Real_mode) code. + +That's all. You can note that this part will not cover all functions which are in the `setup_arch` (like `early_gart_iommu_check`, [mtrr](http://en.wikipedia.org/wiki/Memory_type_range_register) initialization, etc.). As I already wrote many times, `setup_arch` is big, and linux kernel is big. That's why I can't cover every line in the linux kernel. I don't think that we missed something important, but you can say something like: each line of code is important. Yes, it's true, but I missed them anyway, because I think that it is not realistic to cover full linux kernel. Anyway we will often return to the idea that we have already seen, and if something is unfamiliar, we will cover this theme. + +Conclusion +-------------------------------------------------------------------------------- + +It is the end of the sixth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function again and it was long part, but we are not finished with it. Yes, `setup_arch` is big, hope that next part will be the last part about this function. + +If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). + +**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) +* [NX bit](http://en.wikipedia.org/wiki/NX_bit) +* [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt) +* [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) +* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) +* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) +* [PCI](http://en.wikipedia.org/wiki/Conventional_PCI) +* [e820](http://en.wikipedia.org/wiki/E820) +* [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) +* [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) +* [EFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) +* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) +* [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf) +* [BSS](http://en.wikipedia.org/wiki/.bss) +* [SMBIOS specification](http://www.dmtf.org/sites/default/files/standards/documents/DSP0134v2.5Final.pdf) +* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) diff --git a/Initialization/linux-initialization-7.md b/Initialization/linux-initialization-7.md new file mode 100644 index 0000000..ca19293 --- /dev/null +++ b/Initialization/linux-initialization-7.md @@ -0,0 +1,482 @@ +Kernel initialization. Part 7. +================================================================================ + +The End of the architecture-specific initialization, almost... +================================================================================ + +This is the seventh part of the Linux Kernel initialization process which covers insides of the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L861). As you can know from the previous [parts](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html), the `setup_arch` function does some architecture-specific (in our case it is [x86_64](http://en.wikipedia.org/wiki/X86-64)) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface), early dump of the [PCI](http://en.wikipedia.org/wiki/PCI) device and many many more. If you have read the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html), you can remember that we've finished it at the `setup_real_mode` function. In the next step, as we set limit of the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html) to the all mapped pages, we can see the call of the `setup_log_buf` function from the [kernel/printk/printk.c](https://github.com/torvalds/linux/blob/master/kernel/printk/printk.c). + +The `setup_log_buf` function setups kernel cyclic buffer and its length depends on the `CONFIG_LOG_BUF_SHIFT` configuration option. As we can read from the documentation of the `CONFIG_LOG_BUF_SHIFT` it can be between `12` and `21`. In the insides, buffer defined as array of chars: + +```C +#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT) +static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN); +static char *log_buf = __log_buf; +``` + +Now let's look on the implementation of the `setup_log_buf` function. It starts with check that current buffer is empty (It must be empty, because we just setup it) and another check that it is early setup. If setup of the kernel log buffer is not early, we call the `log_buf_add_cpu` function which increase size of the buffer for every CPU: + +```C +if (log_buf != __log_buf) + return; + +if (!early && !new_log_buf_len) + log_buf_add_cpu(); +``` + +We will not research `log_buf_add_cpu` function, because as you can see in the `setup_arch`, we call `setup_log_buf` as: + +```C +setup_log_buf(1); +``` + +where `1` means that it is early setup. In the next step we check `new_log_buf_len` variable which is updated length of the kernel log buffer and allocate new space for the buffer with the `memblock_virt_alloc` function for it, or just return. + +As kernel log buffer is ready, the next function is `reserve_initrd`. You can remember that we already called the `early_reserve_initrd` function in the fourth part of the [Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). Now, as we reconstructed direct memory mapping in the `init_mem_mapping` function, we need to move [initrd](http://en.wikipedia.org/wiki/Initrd) into directly mapped memory. The `reserve_initrd` function starts from the definition of the base address and end address of the `initrd` and check that `initrd` is provided by a bootloader. All the same as what we saw in the `early_reserve_initrd`. But instead of the reserving place in the `memblock` area with the call of the `memblock_reserve` function, we get the mapped size of the direct memory area and check that the size of the `initrd` is not greater than this area with: + +```C +mapped_size = memblock_mem_size(max_pfn_mapped); +if (ramdisk_size >= (mapped_size>>1)) + panic("initrd too large to handle, " + "disabling initrd (%lld needed, %lld available)\n", + ramdisk_size, mapped_size>>1); +``` + +You can see here that we call `memblock_mem_size` function and pass the `max_pfn_mapped` to it, where `max_pfn_mapped` contains the highest direct mapped page frame number. If you do not remember what is `page frame number`, explanation is simple: First `12` bits of the virtual address represent offset in the physical page or page frame. If we right-shift out `12` bits of the virtual address, we'll discard offset part and will get `Page Frame Number`. In the `memblock_mem_size` we go through the all memblock `mem` (not reserved) regions and calculates size of the mapped pages and return it to the `mapped_size` variable (see code above). As we got amount of the direct mapped memory, we check that size of the `initrd` is not greater than mapped pages. If it is greater we just call `panic` which halts the system and prints famous [Kernel panic](http://en.wikipedia.org/wiki/Kernel_panic) message. In the next step we print information about the `initrd` size. We can see the result of this in the `dmesg` output: + +```C +[0.000000] RAMDISK: [mem 0x36d20000-0x37687fff] +``` + +and relocate `initrd` to the direct mapping area with the `relocate_initrd` function. In the start of the `relocate_initrd` function we try to find a free area with the `memblock_find_in_range` function: + +```C +relocated_ramdisk = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), area_size, PAGE_SIZE); + +if (!relocated_ramdisk) + panic("Cannot find place for new RAMDISK of size %lld\n", + ramdisk_size); +``` + +The `memblock_find_in_range` function tries to find a free area in a given range, in our case from `0` to the maximum mapped physical address and size must equal to the aligned size of the `initrd`. If we didn't find a area with the given size, we call `panic` again. If all is good, we start to relocated RAM disk to the down of the directly mapped memory in the next step. + +In the end of the `reserve_initrd` function, we free memblock memory which occupied by the ramdisk with the call of the: + +```C +memblock_free(ramdisk_image, ramdisk_end - ramdisk_image); +``` + +After we relocated `initrd` ramdisk image, the next function is `vsmp_init` from the [arch/x86/kernel/vsmp_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsmp_64.c). This function initializes support of the `ScaleMP vSMP`. As I already wrote in the previous parts, this chapter will not cover non-related `x86_64` initialization parts (for example as the current or `ACPI`, etc.). So we will skip implementation of this for now and will back to it in the part which cover techniques of parallel computing. + +The next function is `io_delay_init` from the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/io_delay.c). This function allows to override default default I/O delay `0x80` port. We already saw I/O delay in the [Last preparation before transition into protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html), now let's look on the `io_delay_init` implementation: + +```C +void __init io_delay_init(void) +{ + if (!io_delay_override) + dmi_check_system(io_delay_0xed_port_dmi_table); +} +``` + +This function check `io_delay_override` variable and overrides I/O delay port if `io_delay_override` is set. We can set `io_delay_override` variably by passing `io_delay` option to the kernel command line. As we can read from the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt), `io_delay` option is: + +``` +io_delay= [X86] I/O delay method + 0x80 + Standard port 0x80 based delay + 0xed + Alternate port 0xed based delay (needed on some systems) + udelay + Simple two microseconds delay + none + No delay +``` + +We can see `io_delay` command line parameter setup with the `early_param` macro in the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/io_delay.c) + +```C +early_param("io_delay", io_delay_param); +``` + +More about `early_param` you can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html). So the `io_delay_param` function which setups `io_delay_override` variable will be called in the [do_early_param](https://github.com/torvalds/linux/blob/master/init/main.c#L413) function. `io_delay_param` function gets the argument of the `io_delay` kernel command line parameter and sets `io_delay_type` depends on it: + +```C +static int __init io_delay_param(char *s) +{ + if (!s) + return -EINVAL; + + if (!strcmp(s, "0x80")) + io_delay_type = CONFIG_IO_DELAY_TYPE_0X80; + else if (!strcmp(s, "0xed")) + io_delay_type = CONFIG_IO_DELAY_TYPE_0XED; + else if (!strcmp(s, "udelay")) + io_delay_type = CONFIG_IO_DELAY_TYPE_UDELAY; + else if (!strcmp(s, "none")) + io_delay_type = CONFIG_IO_DELAY_TYPE_NONE; + else + return -EINVAL; + + io_delay_override = 1; + return 0; +} +``` + +The next functions are `acpi_boot_table_init`, `early_acpi_boot_init` and `initmem_init` after the `io_delay_init`, but as I wrote above we will not cover [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) related stuff in this `Linux Kernel initialization process` chapter. + +Allocate area for DMA +-------------------------------------------------------------------------------- + +In the next step we need to allocate area for the [Direct memory access](http://en.wikipedia.org/wiki/Direct_memory_access) with the `dma_contiguous_reserve` function which is defined in the [drivers/base/dma-contiguous.c](https://github.com/torvalds/linux/blob/master/drivers/base/dma-contiguous.c). `DMA` is a special mode when devices communicate with memory without CPU. Note that we pass one parameter - `max_pfn_mapped << PAGE_SHIFT`, to the `dma_contiguous_reserve` function and as you can understand from this expression, this is limit of the reserved memory. Let's look on the implementation of this function. It starts from the definition of the following variables: + +```C +phys_addr_t selected_size = 0; +phys_addr_t selected_base = 0; +phys_addr_t selected_limit = limit; +bool fixed = false; +``` + +where first represents size in bytes of the reserved area, second is base address of the reserved area, third is end address of the reserved area and the last `fixed` parameter shows where to place reserved area. If `fixed` is `1` we just reserve area with the `memblock_reserve`, if it is `0` we allocate space with the `kmemleak_alloc`. In the next step we check `size_cmdline` variable and if it is not equal to `-1` we fill all variables which you can see above with the values from the `cma` kernel command line parameter: + +```C +if (size_cmdline != -1) { + ... + ... + ... +} +``` + +You can find in this source code file definition of the early parameter: + +```C +early_param("cma", early_cma); +``` + +where `cma` is: + +``` +cma=nn[MG]@[start[MG][-end[MG]]] + [ARM,X86,KNL] + Sets the size of kernel global memory area for + contiguous memory allocations and optionally the + placement constraint by the physical address range of + memory allocations. A value of 0 disables CMA + altogether. For more information, see + include/linux/dma-contiguous.h +``` + +If we will not pass `cma` option to the kernel command line, `size_cmdline` will be equal to `-1`. In this way we need to calculate size of the reserved area which depends on the following kernel configuration options: + +* `CONFIG_CMA_SIZE_SEL_MBYTES` - size in megabytes, default global `CMA` area, which is equal to `CMA_SIZE_MBYTES * SZ_1M` or `CONFIG_CMA_SIZE_MBYTES * 1M`; +* `CONFIG_CMA_SIZE_SEL_PERCENTAGE` - percentage of total memory; +* `CONFIG_CMA_SIZE_SEL_MIN` - use lower value; +* `CONFIG_CMA_SIZE_SEL_MAX` - use higher value. + +As we calculated the size of the reserved area, we reserve area with the call of the `dma_contiguous_reserve_area` function which first of all calls: + +``` +ret = cma_declare_contiguous(base, size, limit, 0, 0, fixed, res_cma); +``` + +function. The `cma_declare_contiguous` reserves contiguous area from the given base address with given size. After we reserved area for the `DMA`, next function is the `memblock_find_dma_reserve`. As you can understand from its name, this function counts the reserved pages in the `DMA` area. This part will not cover all details of the `CMA` and `DMA`, because they are big. We will see much more details in the special part in the Linux Kernel Memory management which covers contiguous memory allocators and areas. + +Initialization of the sparse memory +-------------------------------------------------------------------------------- + +The next step is the call of the function - `x86_init.paging.pagetable_init`. If you try to find this function in the linux kernel source code, in the end of your search, you will see the following macro: + +```C +#define native_pagetable_init paging_init +``` + +which expands as you can see to the call of the `paging_init` function from the [arch/x86/mm/init_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/init_64.c). The `paging_init` function initializes sparse memory and zone sizes. First of all what's zones and what is it `Sparsemem`. The `Sparsemem` is a special foundation in the linux kernel memory manager which used to split memory area into different memory banks in the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) systems. Let's look on the implementation of the `paginig_init` function: + +```C +void __init paging_init(void) +{ + sparse_memory_present_with_active_regions(MAX_NUMNODES); + sparse_init(); + + node_clear_state(0, N_MEMORY); + if (N_MEMORY != N_NORMAL_MEMORY) + node_clear_state(0, N_NORMAL_MEMORY); + + zone_sizes_init(); +} +``` + +As you can see there is call of the `sparse_memory_present_with_active_regions` function which records a memory area for every `NUMA` node to the array of the `mem_section` structure which contains a pointer to the structure of the array of `struct page`. The next `sparse_init` function allocates non-linear `mem_section` and `mem_map`. In the next step we clear state of the movable memory nodes and initialize sizes of zones. Every `NUMA` node is divided into a number of pieces which are called - `zones`. So, `zone_sizes_init` function from the [arch/x86/mm/init.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/init.c) initializes size of zones. + +Again, this part and next parts do not cover this theme in full details. There will be special part about `NUMA`. + +vsyscall mapping +-------------------------------------------------------------------------------- + +The next step after `SparseMem` initialization is setting of the `trampoline_cr4_features` which must contain content of the `cr4` [Control register](http://en.wikipedia.org/wiki/Control_register). First of all we need to check that current CPU has support of the `cr4` register and if it has, we save its content to the `trampoline_cr4_features` which is storage for `cr4` in the real mode: + +```C +if (boot_cpu_data.cpuid_level >= 0) { + mmu_cr4_features = __read_cr4(); + if (trampoline_cr4_features) + *trampoline_cr4_features = mmu_cr4_features; +} +``` + +The next function which you can see is `map_vsyscal` from the [arch/x86/kernel/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsyscall_64.c). This function maps memory space for [vsyscalls](https://lwn.net/Articles/446528/) and depends on `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option. Actually `vsyscall` is a special segment which provides fast access to the certain system calls like `getcpu`, etc. Let's look on implementation of this function: + +```C +void __init map_vsyscall(void) +{ + extern char __vsyscall_page; + unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page); + + if (vsyscall_mode != NONE) + __set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall, + vsyscall_mode == NATIVE + ? PAGE_KERNEL_VSYSCALL + : PAGE_KERNEL_VVAR); + + BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) != + (unsigned long)VSYSCALL_ADDR); +} +``` + +In the beginning of the `map_vsyscall` we can see definition of two variables. The first is extern variable `__vsyscall_page`. As a extern variable, it defined somewhere in other source code file. Actually we can see definition of the `__vsyscall_page` in the [arch/x86/kernel/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsyscall_emu_64.S). The `__vsyscall_page` symbol points to the aligned calls of the `vsyscalls` as `gettimeofday`, etc.: + +```assembly + .globl __vsyscall_page + .balign PAGE_SIZE, 0xcc + .type __vsyscall_page, @object +__vsyscall_page: + + mov $__NR_gettimeofday, %rax + syscall + ret + + .balign 1024, 0xcc + mov $__NR_time, %rax + syscall + ret + ... + ... + ... +``` + +The second variable is `physaddr_vsyscall` which just stores physical address of the `__vsyscall_page` symbol. In the next step we check the `vsyscall_mode` variable, and if it is not equal to `NONE`, it is `EMULATE` by default: + +```C +static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE; +``` + +And after this check we can see the call of the `__set_fixmap` function which calls `native_set_fixmap` with the same parameters: + +```C +void native_set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t flags) +{ + __native_set_fixmap(idx, pfn_pte(phys >> PAGE_SHIFT, flags)); +} + +void __native_set_fixmap(enum fixed_addresses idx, pte_t pte) +{ + unsigned long address = __fix_to_virt(idx); + + if (idx >= __end_of_fixed_addresses) { + BUG(); + return; + } + set_pte_vaddr(address, pte); + fixmaps_set++; +} +``` + +Here we can see that `native_set_fixmap` makes value of `Page Table Entry` from the given physical address (physical address of the `__vsyscall_page` symbol in our case) and calls internal function - `__native_set_fixmap`. Internal function gets the virtual address of the given `fixed_addresses` index (`VSYSCALL_PAGE` in our case) and checks that given index is not greater than end of the fix-mapped addresses. After this we set page table entry with the call of the `set_pte_vaddr` function and increase count of the fix-mapped addresses. And in the end of the `map_vsyscall` we check that virtual address of the `VSYSCALL_PAGE` (which is first index in the `fixed_addresses`) is not greater than `VSYSCALL_ADDR` which is `-10UL << 20` or `ffffffffff600000` with the `BUILD_BUG_ON` macro: + +```C +BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) != + (unsigned long)VSYSCALL_ADDR); +``` + +Now `vsyscall` area is in the `fix-mapped` area. That's all about `map_vsyscall`, if you do not know anything about fix-mapped addresses, you can read [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). We will see more about `vsyscalls` in the `vsyscalls and vdso` part. + +Getting the SMP configuration +-------------------------------------------------------------------------------- + +You may remember how we made a search of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html). Now we need to get the `SMP` configuration if we found it. For this we check `smp_found_config` variable which we set in the `smp_scan_config` function (read about it the previous part) and call the `get_smp_config` function: + +```C +if (smp_found_config) + get_smp_config(); +``` + +The `get_smp_config` expands to the `x86_init.mpparse.default_get_smp_config` function which is defined in the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). This function defines a pointer to the multiprocessor floating pointer structure - `mpf_intel` (you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html)) and does some checks: + +```C +struct mpf_intel *mpf = mpf_found; + +if (!mpf) + return; + +if (acpi_lapic && early) + return; +``` + +Here we can see that multiprocessor configuration was found in the `smp_scan_config` function or just return from the function if not. The next check is `acpi_lapic` and `early`. And as we did this checks, we start to read the `SMP` configuration. As we finished reading it, the next step is - `prefill_possible_map` function which makes preliminary filling of the possible CPU's `cpumask` (more about it you can read in the [Introduction to the cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)). + +The rest of the setup_arch +-------------------------------------------------------------------------------- + +Here we are getting to the end of the `setup_arch` function. The rest of function of course is important, but details about these stuff will not will not be included in this part. We will just take a short look on these functions, because although they are important as I wrote above, but they cover non-generic kernel features related with the `NUMA`, `SMP`, `ACPI` and `APICs`, etc. First of all, the next call of the `init_apic_mappings` function. As we can understand this function sets the address of the local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). The next is `x86_io_apic_ops.init` and this function initializes I/O APIC. Please note that we will see all details related with `APIC` in the chapter about interrupts and exceptions handling. In the next step we reserve standard I/O resources like `DMA`, `TIMER`, `FPU`, etc., with the call of the `x86_init.resources.reserve_resources` function. Following is `mcheck_init` function initializes `Machine check Exception` and the last is `register_refined_jiffies` which registers [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29) (There will be separate chapter about timers in the kernel). + +So that's all. Finally we have finished with the big `setup_arch` function in this part. Of course as I already wrote many times, we did not see full details about this function, but do not worry about it. We will be back more than once to this function from different chapters for understanding how different platform-dependent parts are initialized. + +That's all, and now we can back to the `start_kernel` from the `setup_arch`. + +Back to the main.c +================================================================================ + +As I wrote above, we have finished with the `setup_arch` function and now we can back to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As you may remember or saw yourself, `start_kernel` function as big as the `setup_arch`. So the couple of the next part will be dedicated to learning of this function. So, let's continue with it. After the `setup_arch` we can see the call of the `mm_init_cpumask` function. This function sets the [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) pointer to the memory descriptor `cpumask`. We can look on its implementation: + +```C +static inline void mm_init_cpumask(struct mm_struct *mm) +{ +#ifdef CONFIG_CPUMASK_OFFSTACK + mm->cpu_vm_mask_var = &mm->cpumask_allocation; +#endif + cpumask_clear(mm->cpu_vm_mask_var); +} +``` + +As you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c), we pass memory descriptor of the init process to the `mm_init_cpumask` and depends on `CONFIG_CPUMASK_OFFSTACK` configuration option we clear [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer) switch `cpumask`. + +In the next step we can see the call of the following function: + +```C +setup_command_line(command_line); +``` + +This function takes pointer to the kernel command line allocates a couple of buffers to store command line. We need a couple of buffers, because one buffer used for future reference and accessing to command line and one for parameter parsing. We will allocate space for the following buffers: + +* `saved_command_line` - will contain boot command line; +* `initcall_command_line` - will contain boot command line. will be used in the `do_initcall_level`; +* `static_command_line` - will contain command line for parameters parsing. + +We will allocate space with the `memblock_virt_alloc` function. This function calls `memblock_virt_alloc_try_nid` which allocates boot memory block with `memblock_reserve` if [slab](http://en.wikipedia.org/wiki/Slab_allocation) is not available or uses `kzalloc_node` (more about it will be in the linux memory management chapter). The `memblock_virt_alloc` uses `BOOTMEM_LOW_LIMIT` (physical address of the `(PAGE_OFFSET + 0x1000000)` value) and `BOOTMEM_ALLOC_ACCESSIBLE` (equal to the current value of the `memblock.current_limit`) as minimum address of the memory region and maximum address of the memory region. + +Let's look on the implementation of the `setup_command_line`: + +```C +static void __init setup_command_line(char *command_line) +{ + saved_command_line = + memblock_virt_alloc(strlen(boot_command_line) + 1, 0); + initcall_command_line = + memblock_virt_alloc(strlen(boot_command_line) + 1, 0); + static_command_line = memblock_virt_alloc(strlen(command_line) + 1, 0); + strcpy(saved_command_line, boot_command_line); + strcpy(static_command_line, command_line); + } + ``` + +Here we can see that we allocate space for the three buffers which will contain kernel command line for the different purposes (read above). And as we allocated space, we store `boot_command_line` in the `saved_command_line` and `command_line` (kernel command line from the `setup_arch`) to the `static_command_line`. + +The next function after the `setup_command_line` is the `setup_nr_cpu_ids`. This function setting `nr_cpu_ids` (number of CPUs) according to the last bit in the `cpu_possible_mask` (more about it you can read in the chapter describes [cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) concept). Let's look on its implementation: + +```C +void __init setup_nr_cpu_ids(void) +{ + nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1; +} +``` + +Here `nr_cpu_ids` represents number of CPUs, `NR_CPUS` represents the maximum number of CPUs which we can set in configuration time: + +![CONFIG_NR_CPUS](http://oi59.tinypic.com/28mh45h.jpg) + +Actually we need to call this function, because `NR_CPUS` can be greater than actual amount of the CPUs in the your computer. Here we can see that we call `find_last_bit` function and pass two parameters to it: + +* `cpu_possible_mask` bits; +* maximum number of CPUS. + +In the `setup_arch` we can find the call of the `prefill_possible_map` function which calculates and writes to the `cpu_possible_mask` actual number of the CPUs. We call the `find_last_bit` function which takes the address and maximum size to search and returns bit number of the first set bit. We passed `cpu_possible_mask` bits and maximum number of the CPUs. First of all the `find_last_bit` function splits given `unsigned long` address to the [words](http://en.wikipedia.org/wiki/Word_%28computer_architecture%29): + +```C +words = size / BITS_PER_LONG; +``` + +where `BITS_PER_LONG` is `64` on the `x86_64`. As we got amount of words in the given size of the search data, we need to check is given size does not contain partial words with the following check: + +```C +if (size & (BITS_PER_LONG-1)) { + tmp = (addr[words] & (~0UL >> (BITS_PER_LONG + - (size & (BITS_PER_LONG-1))))); + if (tmp) + goto found; +} +``` + +if it contains partial word, we mask the last word and check it. If the last word is not zero, it means that current word contains at least one set bit. We go to the `found` label: + +```C +found: + return words * BITS_PER_LONG + __fls(tmp); +``` + +Here you can see `__fls` function which returns last set bit in a given word with help of the `bsr` instruction: + +```C +static inline unsigned long __fls(unsigned long word) +{ + asm("bsr %1,%0" + : "=r" (word) + : "rm" (word)); + return word; +} +``` + +The `bsr` instruction which scans the given operand for first bit set. If the last word is not partial we going through the all words in the given address and trying to find first set bit: + +```C +while (words) { + tmp = addr[--words]; + if (tmp) { +found: + return words * BITS_PER_LONG + __fls(tmp); + } +} +``` + +Here we put the last word to the `tmp` variable and check that `tmp` contains at least one set bit. If a set bit found, we return the number of this bit. If no one words do not contains set bit we just return given size: + +```C +return size; +``` + +After this `nr_cpu_ids` will contain the correct amount of the available CPUs. + +That's all. + +Conclusion +================================================================================ + +It is the end of the seventh part about the linux kernel initialization process. In this part, finally we have finished with the `setup_arch` function and returned to the `start_kernel` function. In the next part we will continue to learn generic kernel code from the `start_kernel` and will continue our way to the first `init` process. + +If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). + +**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +================================================================================ + +* [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface) +* [x86_64](http://en.wikipedia.org/wiki/X86-64) +* [initrd](http://en.wikipedia.org/wiki/Initrd) +* [Kernel panic](http://en.wikipedia.org/wiki/Kernel_panic) +* [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt) +* [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) +* [Direct memory access](http://en.wikipedia.org/wiki/Direct_memory_access) +* [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) +* [Control register](http://en.wikipedia.org/wiki/Control_register) +* [vsyscalls](https://lwn.net/Articles/446528/) +* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) +* [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29) +* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html) diff --git a/Initialization/linux-initialization-8.md b/Initialization/linux-initialization-8.md new file mode 100644 index 0000000..e810218 --- /dev/null +++ b/Initialization/linux-initialization-8.md @@ -0,0 +1,479 @@ +Kernel initialization. Part 8. +================================================================================ + +Scheduler initialization +================================================================================ + +This is the eighth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of the Linux kernel initialization process and we stopped on the `setup_nr_cpu_ids` function in the [previous](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md) part. The main point of the current part is [scheduler](http://en.wikipedia.org/wiki/Scheduling_%28computing%29) initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the `setup_per_cpu_areas` function. This function setups areas for the `percpu` variables, more about it you can read in the special part about the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html). After `percpu` areas is up and running, the next step is the `smp_prepare_boot_cpu` function. This function does some preparations for the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing): + +```C +static inline void smp_prepare_boot_cpu(void) +{ + smp_ops.smp_prepare_boot_cpu(); +} +``` + +where the `smp_prepare_boot_cpu` expands to the call of the `native_smp_prepare_boot_cpu` function (more about `smp_ops` will be in the special parts about `SMP`): + +```C +void __init native_smp_prepare_boot_cpu(void) +{ + int me = smp_processor_id(); + switch_to_new_gdt(me); + cpumask_set_cpu(me, cpu_callout_mask); + per_cpu(cpu_state, me) = CPU_ONLINE; +} +``` + +The `native_smp_prepare_boot_cpu` function gets the id of the current CPU (which is Bootstrap processor and its `id` is zero) with the `smp_processor_id` function. I will not explain how the `smp_processor_id` works, because we already saw it in the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. As we got processor `id` number we reload [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) for the given CPU with the `switch_to_new_gdt` function: + +```C +void switch_to_new_gdt(int cpu) +{ + struct desc_ptr gdt_descr; + + gdt_descr.address = (long)get_cpu_gdt_table(cpu); + gdt_descr.size = GDT_SIZE - 1; + load_gdt(&gdt_descr); + load_percpu_segment(cpu); +} +``` + +The `gdt_descr` variable represents pointer to the `GDT` descriptor here (we already saw `desc_ptr` in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). We get the address and the size of the `GDT` descriptor where `GDT_SIZE` is `256` or: + +```C +#define GDT_SIZE (GDT_ENTRIES * 8) +``` + +and the address of the descriptor we will get with the `get_cpu_gdt_table`: + +```C +static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu) +{ + return per_cpu(gdt_page, cpu).gdt; +} +``` + +The `get_cpu_gdt_table` uses `per_cpu` macro for getting `gdt_page` percpu variable for the given CPU number (bootstrap processor with `id` - 0 in our case). You may ask the following question: so, if we can access `gdt_page` percpu variable, where it was defined? Actually we already saw it in this book. If you have read the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, you can remember that we saw definition of the `gdt_page` in the [arch/x86/kernel/head_64.S](https://github.com/0xAX/linux/blob/master/arch/x86/kernel/head_64.S): + +```assembly +early_gdt_descr: + .word GDT_ENTRIES*8-1 +early_gdt_descr_base: + .quad INIT_PER_CPU_VAR(gdt_page) +``` + +and if we will look on the [linker](https://github.com/0xAX/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) file we can see that it locates after the `__per_cpu_load` symbol: + +```C +#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load +INIT_PER_CPU(gdt_page); +``` + +and filled `gdt_page` in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c#L94): + +```C +DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { +#ifdef CONFIG_X86_64 + [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff), + [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff), + [GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff), + [GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff), + [GDT_ENTRY_DEFAULT_USER_DS] = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff), + [GDT_ENTRY_DEFAULT_USER_CS] = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff), + ... + ... + ... +``` + +more about `percpu` variables you can read in the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) part. As we got address and size of the `GDT` descriptor we reload `GDT` with the `load_gdt` which just execute `lgdt` instruct and load `percpu_segment` with the following function: + +```C +void load_percpu_segment(int cpu) { + loadsegment(gs, 0); + wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu)); + load_stack_canary_segment(); +} +``` + +The base address of the `percpu` area must contain `gs` register (or `fs` register for `x86`), so we are using `loadsegment` macro and pass `gs`. In the next step we writes the base address if the [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) stack and setup stack [canary](http://en.wikipedia.org/wiki/Buffer_overflow_protection) (this is only for `x86_32`). After we load new `GDT`, we fill `cpu_callout_mask` bitmap with the current cpu and set cpu state as online with the setting `cpu_state` percpu variable for the current processor - `CPU_ONLINE`: + +```C +cpumask_set_cpu(me, cpu_callout_mask); +per_cpu(cpu_state, me) = CPU_ONLINE; +``` + +So, what is `cpu_callout_mask` bitmap... As we initialized bootstrap processor (processor which is booted the first on `x86`) the other processors in a multiprocessor system are known as `secondary processors`. Linux kernel uses following two bitmasks: + +* `cpu_callout_mask` +* `cpu_callin_mask` + +After bootstrap processor initialized, it updates the `cpu_callout_mask` to indicate which secondary processor can be initialized next. All other or secondary processors can do some initialization stuff before and check the `cpu_callout_mask` on the boostrap processor bit. Only after the bootstrap processor filled the `cpu_callout_mask` with this secondary processor, it will continue the rest of its initialization. After that the certain processor finish its initialization process, the processor sets bit in the `cpu_callin_mask`. Once the bootstrap processor finds the bit in the `cpu_callin_mask` for the current secondary processor, this processor repeats the same procedure for initialization of one of the remaining secondary processors. In a short words it works as i described, but we will see more details in the chapter about `SMP`. + +That's all. We did all `SMP` boot preparation. + +Build zonelists +----------------------------------------------------------------------- + +In the next step we can see the call of the `build_all_zonelists` function. This function sets up the order of zones that allocations are preferred from. What are zones and what's order we will understand soon. For the start let's see how linux kernel considers physical memory. Physical memory is split into banks which are called - `nodes`. If you has no hardware support for `NUMA`, you will see only one node: + +``` +$ cat /sys/devices/system/node/node0/numastat +numa_hit 72452442 +numa_miss 0 +numa_foreign 0 +interleave_hit 12925 +local_node 72452442 +other_node 0 +``` + +Every `node` is presented by the `struct pglist_data` in the linux kernel. Each node is divided into a number of special blocks which are called - `zones`. Every zone is presented by the `zone struct` in the linux kernel and has one of the type: + +* `ZONE_DMA` - 0-16M; +* `ZONE_DMA32` - used for 32 bit devices that can only do DMA areas below 4G; +* `ZONE_NORMAL` - all RAM from the 4GB on the `x86_64`; +* `ZONE_HIGHMEM` - absent on the `x86_64`; +* `ZONE_MOVABLE` - zone which contains movable pages. + +which are presented by the `zone_type` enum. We can get information about zones with the: + +``` +$ cat /proc/zoneinfo +Node 0, zone DMA + pages free 3975 + min 3 + low 3 + ... + ... +Node 0, zone DMA32 + pages free 694163 + min 875 + low 1093 + ... + ... +Node 0, zone Normal + pages free 2529995 + min 3146 + low 3932 + ... + ... +``` + +As I wrote above all nodes are described with the `pglist_data` or `pg_data_t` structure in memory. This structure is defined in the [include/linux/mmzone.h](https://github.com/torvalds/linux/blob/master/include/linux/mmzone.h). The `build_all_zonelists` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c) constructs an ordered `zonelist` (of different zones `DMA`, `DMA32`, `NORMAL`, `HIGH_MEMORY`, `MOVABLE`) which specifies the zones/nodes to visit when a selected `zone` or `node` cannot satisfy the allocation request. That's all. More about `NUMA` and multiprocessor systems will be in the special part. + +The rest of the stuff before scheduler initialization +-------------------------------------------------------------------------------- + +Before we will start to dive into linux kernel scheduler initialization process we must do a couple of things. The first thing is the `page_alloc_init` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c). This function looks pretty easy: + +```C +void __init page_alloc_init(void) +{ + hotcpu_notifier(page_alloc_cpu_notify, 0); +} +``` + +and initializes handler for the `CPU` [hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). Of course the `hotcpu_notifier` depends on the +`CONFIG_HOTPLUG_CPU` configuration option and if this option is set, it just calls `cpu_notifier` macro which expands to the call of the `register_cpu_notifier` which adds hotplug cpu handler (`page_alloc_cpu_notify` in our case). + +After this we can see the kernel command line in the initialization output: + +![kernel command line](http://oi58.tinypic.com/2m7vz10.jpg) + +And a couple of functions such as `parse_early_param` and `parse_args` which handles linux kernel command line. You may remember that we already saw the call of the `parse_early_param` function in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (`x86_64` in our case), but not all architecture calls this function. And we need to call the second function `parse_args` to parse and handle non-early command line arguments. + +In the next step we can see the call of the `jump_label_init` from the [kernel/jump_label.c](https://github.com/torvalds/linux/blob/master/kernel/jump_label.c). and initializes [jump label](https://lwn.net/Articles/412072/). + +After this we can see the call of the `setup_log_buf` function which setups the [printk](http://www.makelinux.net/books/lkd2/ch18lev1sec3) log buffer. We already saw this function in the seventh [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the linux kernel initialization process chapter. + +PID hash initialization +-------------------------------------------------------------------------------- + +The next is `pidhash_init` function. As you know each process has assigned a unique number which called - `process identification number` or `PID`. Each process generated with fork or clone is automatically assigned a new unique `PID` value by the kernel. The management of `PIDs` centered around the two special data structures: `struct pid` and `struct upid`. First structure represents information about a `PID` in the kernel. The second structure represents the information that is visible in a specific namespace. All `PID` instances stored in the special hash table: + +```C +static struct hlist_head *pid_hash; +``` + +This hash table is used to find the pid instance that belongs to a numeric `PID` value. So, `pidhash_init` initializes this hash table. In the start of the `pidhash_init` function we can see the call of the `alloc_large_system_hash`: + +```C +pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18, + HASH_EARLY | HASH_SMALL, + &pidhash_shift, NULL, + 0, 4096); +``` + +The number of elements of the `pid_hash` depends on the `RAM` configuration, but it can be between `2^4` and `2^12`. The `pidhash_init` computes the size +and allocates the required storage (which is `hlist` in our case - the same as [doubly linked list](http://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html), but contains one pointer instead on the [struct hlist_head](https://github.com/torvalds/linux/blob/master/include/linux/types.h)]. The `alloc_large_system_hash` function allocates a large system hash table with `memblock_virt_alloc_nopanic` if we pass `HASH_EARLY` flag (as it in our case) or with `__vmalloc` if we did no pass this flag. + +The result we can see in the `dmesg` output: + +``` +$ dmesg | grep hash +[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes) +... +... +... +``` + +That's all. The rest of the stuff before scheduler initialization is the following functions: `vfs_caches_init_early` does early initialization of the [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system) (more about it will be in the chapter which will describe virtual file system), `sort_main_extable` sorts the kernel's built-in exception table entries which are between `__start___ex_table` and `__stop___ex_table`, and `trap_init` initializes trap handlers (more about last two function we will know in the separate chapter about interrupts). + +The last step before the scheduler initialization is initialization of the memory manager with the `mm_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As we can see, the `mm_init` function initializes different parts of the linux kernel memory manager: + +```C +page_ext_init_flatmem(); +mem_init(); +kmem_cache_init(); +percpu_init_late(); +pgtable_init(); +vmalloc_init(); +``` + +The first is `page_ext_init_flatmem` which depends on the `CONFIG_SPARSEMEM` kernel configuration option and initializes extended data per page handling. The `mem_init` releases all `bootmem`, the `kmem_cache_init` initializes kernel cache, the `percpu_init_late` - replaces `percpu` chunks with those allocated by [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29), the `pgtable_init` - initializes the `page->ptl` kernel cache, the `vmalloc_init` - initializes `vmalloc`. Please, **NOTE** that we will not dive into details about all of these functions and concepts, but we will see all of they it in the [Linux kernel memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter. + +That's all. Now we can look on the `scheduler`. + +Scheduler initialization +-------------------------------------------------------------------------------- + +And now we come to the main purpose of this part - initialization of the task scheduler. I want to say again as I already did it many times, you will not see the full explanation of the scheduler here, there will be special chapter about this. Ok, next point is the `sched_init` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) and as we can understand from the function's name, it initializes scheduler. Let's start to dive into this function and try to understand how the scheduler is initialized. At the start of the `sched_init` function we can see the following code: + +```C +#ifdef CONFIG_FAIR_GROUP_SCHED + alloc_size += 2 * nr_cpu_ids * sizeof(void **); +#endif +#ifdef CONFIG_RT_GROUP_SCHED + alloc_size += 2 * nr_cpu_ids * sizeof(void **); +#endif +``` + +First of all we can see two configuration options here: + +* `CONFIG_FAIR_GROUP_SCHED` +* `CONFIG_RT_GROUP_SCHED` + +Both of this options provide two different planning models. As we can read from the [documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt), the current scheduler - `CFS` or `Completely Fair Scheduler` use a simple concept. It models process scheduling as if the system has an ideal multitasking processor where each process would receive `1/n` processor time, where `n` is the number of the runnable processes. The scheduler uses the special set of rules. These rules determine when and how to select a new process to run and they are called `scheduling policy`. The Completely Fair Scheduler supports following `normal` or `non-real-time` scheduling policies: `SCHED_NORMAL`, `SCHED_BATCH` and `SCHED_IDLE`. The `SCHED_NORMAL` is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the [nice](http://en.wikipedia.org/wiki/Nice_%28Unix%29) value, the `SCHED_BATCH` used for the 100% non-interactive tasks and the `SCHED_IDLE` runs tasks only when the processor has no task to run besides this task. The `real-time` policies are also supported for the time-critical applications: `SCHED_FIFO` and `SCHED_RR`. If you've read something about the Linux kernel scheduler, you can know that it is modular. It means that it supports different algorithms to schedule different types of processes. Usually this modularity is called `scheduler classes`. These modules encapsulate scheduling policy details and are handled by the scheduler core without knowing too much about them. + + +Now let's back to the our code and look on the two configuration options `CONFIG_FAIR_GROUP_SCHED` and `CONFIG_RT_GROUP_SCHED`. The scheduler operates on an individual task. These options allows to schedule group tasks (more about it you can read in the [CFS group scheduling](http://lwn.net/Articles/240474/)). We can see that we assign the `alloc_size` variables which represent size based on amount of the processors to allocate for the `sched_entity` and `cfs_rq` to the `2 * nr_cpu_ids * sizeof(void **)` expression with `kzalloc`: + +```C +ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT); + +#ifdef CONFIG_FAIR_GROUP_SCHED + root_task_group.se = (struct sched_entity **)ptr; + ptr += nr_cpu_ids * sizeof(void **); + + root_task_group.cfs_rq = (struct cfs_rq **)ptr; + ptr += nr_cpu_ids * sizeof(void **); +#endif + +``` + +The `sched_entity` is a structure which is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and used by the scheduler to keep track of process accounting. The `cfs_rq` presents [run queue](http://en.wikipedia.org/wiki/Run_queue). So, you can see that we allocated space with size `alloc_size` for the run queue and scheduler entity of the `root_task_group`. The `root_task_group` is an instance of the `task_group` structure from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) which contains task group related information: + +```C +struct task_group { + ... + ... + struct sched_entity **se; + struct cfs_rq **cfs_rq; + ... + ... +} +``` + +The root task group is the task group which belongs to every task in system. As we allocated space for the root task group scheduler entity and runqueue, we go over all possible CPUs (`cpu_possible_mask` bitmap) and allocate zeroed memory from a particular memory node with the `kzalloc_node` function for the `load_balance_mask` `percpu` variable: + +```C +DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); +``` + +Here `cpumask_var_t` is the `cpumask_t` with one difference: `cpumask_var_t` is allocated only `nr_cpu_ids` bits when the `cpumask_t` always has `NR_CPUS` bits (more about `cpumask` you can read in the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part). As you can see: + +```C +#ifdef CONFIG_CPUMASK_OFFSTACK + for_each_possible_cpu(i) { + per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node( + cpumask_size(), GFP_KERNEL, cpu_to_node(i)); + } +#endif +``` + +this code depends on the `CONFIG_CPUMASK_OFFSTACK` configuration option. This configuration options says to use dynamic allocation for `cpumask`, instead of putting it on the stack. All groups have to be able to rely on the amount of CPU time. With the call of the two following functions: + +```C +init_rt_bandwidth(&def_rt_bandwidth, + global_rt_period(), global_rt_runtime()); +init_dl_bandwidth(&def_dl_bandwidth, + global_rt_period(), global_rt_runtime()); +``` + +we initialize bandwidth management for the `SCHED_DEADLINE` real-time tasks. These functions initializes `rt_bandwidth` and `dl_bandwidth` structures which store information about maximum `deadline` bandwidth of the system. For example, let's look on the implementation of the `init_rt_bandwidth` function: + +```C +void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime) +{ + rt_b->rt_period = ns_to_ktime(period); + rt_b->rt_runtime = runtime; + + raw_spin_lock_init(&rt_b->rt_runtime_lock); + + hrtimer_init(&rt_b->rt_period_timer, + CLOCK_MONOTONIC, HRTIMER_MODE_REL); + rt_b->rt_period_timer.function = sched_rt_period_timer; +} +``` + +It takes three parameters: + +* address of the `rt_bandwidth` structure which contains information about the allocated and consumed quota within a period; +* `period` - period over which real-time task bandwidth enforcement is measured in `us`; +* `runtime` - part of the period that we allow tasks to run in `us`. + +As `period` and `runtime` we pass result of the `global_rt_period` and `global_rt_runtime` functions. Which are `1s` second and and `0.95s` by default. The `rt_bandwidth` structure is defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) and looks: + +```C +struct rt_bandwidth { + raw_spinlock_t rt_runtime_lock; + ktime_t rt_period; + u64 rt_runtime; + struct hrtimer rt_period_timer; +}; +``` + +As you can see, it contains `runtime` and `period` and also two following fields: + +* `rt_runtime_lock` - [spinlock](http://en.wikipedia.org/wiki/Spinlock) for the `rt_time` protection; +* `rt_period_timer` - [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt) for unthrottled of real-time tasks. + +So, in the `init_rt_bandwidth` we initialize `rt_bandwidth` period and runtime with the given parameters, initialize the spinlock and high-resolution time. In the next step, depends on enable of [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), we make initialization of the root domain: + +```C +#ifdef CONFIG_SMP + init_defrootdomain(); +#endif +``` + +The real-time scheduler requires global resources to make scheduling decision. But unfortunately scalability bottlenecks appear as the number of CPUs increase. The concept of root domains was introduced for improving scalability. The linux kernel provides a special mechanism for assigning a set of CPUs and memory nodes to a set of tasks and it is called - `cpuset`. If a `cpuset` contains non-overlapping with other `cpuset` CPUs, it is `exclusive cpuset`. Each exclusive cpuset defines an isolated domain or `root domain` of CPUs partitioned from other cpusets or CPUs. A `root domain` is presented by the `struct root_domain` from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) in the linux kernel and its main purpose is to narrow the scope of the global variables to per-domain variables and all real-time scheduling decisions are made only within the scope of a root domain. That's all about it, but we will see more details about it in the chapter about real-time scheduler. + +After `root domain` initialization, we make initialization of the bandwidth for the real-time tasks of the root task group as we did it above: + +```C +#ifdef CONFIG_RT_GROUP_SCHED + init_rt_bandwidth(&root_task_group.rt_bandwidth, + global_rt_period(), global_rt_runtime()); +#endif +``` + +In the next step, depends on the `CONFIG_CGROUP_SCHED` kernel configuration option we initialize the `siblings` and `children` lists of the root task group. As we can read from the documentation, the `CONFIG_CGROUP_SCHED` is: + +``` +This option allows you to create arbitrary task groups using the "cgroup" pseudo +filesystem and control the cpu bandwidth allocated to each such task group. +``` + +As we finished with the lists initialization, we can see the call of the `autogroup_init` function: + +```C +#ifdef CONFIG_CGROUP_SCHED + list_add(&root_task_group.list, &task_groups); + INIT_LIST_HEAD(&root_task_group.children); + INIT_LIST_HEAD(&root_task_group.siblings); + autogroup_init(&init_task); +#endif +``` + +which initializes automatic process group scheduling. + +After this we are going through the all `possible` cpu (you can remember that `possible` CPUs store in the `cpu_possible_mask` bitmap that can ever be available in the system) and initialize a `runqueue` for each possible cpu: + +```C +for_each_possible_cpu(i) { + struct rq *rq; + ... + ... + ... +``` + +Each processor has its own locking and individual runqueue. All runnable tasks are stored in an active array and indexed according to its priority. When a process consumes its time slice, it is moved to an expired array. All of these arras are stored in the special structure which names is `runqueue`. As there are no global lock and runqueue, we are going through the all possible CPUs and initialize runqueue for the every cpu. The `runqueue` is presented by the `rq` structure in the linux kernel which is defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h). + +```C +rq = cpu_rq(i); +raw_spin_lock_init(&rq->lock); +rq->nr_running = 0; +rq->calc_load_active = 0; +rq->calc_load_update = jiffies + LOAD_FREQ; +init_cfs_rq(&rq->cfs); +init_rt_rq(&rq->rt); +init_dl_rq(&rq->dl); +rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime; +``` + +Here we get the runqueue for the every CPU with the `cpu_rq` macro which returns `runqueues` percpu variable and start to initialize it with runqueue lock, number of running tasks, `calc_load` relative fields (`calc_load_active` and `calc_load_update`) which are used in the reckoning of a CPU load and initialization of the completely fair, real-time and deadline related fields in a runqueue. After this we initialize `cpu_load` array with zeros and set the last load update tick to the `jiffies` variable which determines the number of time ticks (cycles), since the system boot: + +```C +for (j = 0; j < CPU_LOAD_IDX_MAX; j++) + rq->cpu_load[j] = 0; + +rq->last_load_update_tick = jiffies; +``` + +where `cpu_load` keeps history of runqueue loads in the past, for now `CPU_LOAD_IDX_MAX` is 5. In the next step we fill `runqueue` fields which are related to the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), but we will not cover them in this part. And in the end of the loop we initialize high-resolution timer for the give `runqueue` and set the `iowait` (more about it in the separate part about scheduler) number: + +```C +init_rq_hrtick(rq); +atomic_set(&rq->nr_iowait, 0); +``` + +Now we come out from the `for_each_possible_cpu` loop and the next we need to set load weight for the `init` task with the `set_load_weight` function. Weight of process is calculated through its dynamic priority which is static priority + scheduling class of the process. After this we increase memory usage counter of the memory descriptor of the `init` process and set scheduler class for the current process: + +```C +atomic_inc(&init_mm.mm_count); +current->sched_class = &fair_sched_class; +``` + +And make current process (it will be the first `init` process) `idle` and update the value of the `calc_load_update` with the 5 seconds interval: + +```C +init_idle(current, smp_processor_id()); +calc_load_update = jiffies + LOAD_FREQ; +``` + +So, the `init` process will be run, when there will be no other candidates (as it is the first process in the system). In the end we just set `scheduler_running` variable: + +```C +scheduler_running = 1; +``` + +That's all. Linux kernel scheduler is initialized. Of course, we have skipped many different details and explanations here, because we need to know and understand how different concepts (like process and process groups, runqueue, rcu, etc.) works in the linux kernel , but we took a short look on the scheduler initialization process. We will look all other details in the separate part which will be fully dedicated to the scheduler. + +Conclusion +-------------------------------------------------------------------------------- + +It is the end of the eighth part about the linux kernel initialization process. In this part, we looked on the initialization process of the scheduler and we will continue in the next part to dive in the linux kernel initialization process and will see initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and many other initialization stuff in the next part. + +If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). + +**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) +* [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt) +* [spinlock](http://en.wikipedia.org/wiki/Spinlock) +* [Run queue](http://en.wikipedia.org/wiki/Run_queue) +* [Linux kernem memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) +* [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29) +* [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system) +* [Linux kernel hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) +* [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) +* [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) +* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) +* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) +* [RCU](http://en.wikipedia.org/wiki/Read-copy-update) +* [CFS Scheduler documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt) +* [Real-Time group scheduling](https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt) +* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) diff --git a/Initialization/linux-initialization-9.md b/Initialization/linux-initialization-9.md new file mode 100644 index 0000000..35ec3a1 --- /dev/null +++ b/Initialization/linux-initialization-9.md @@ -0,0 +1,430 @@ +Kernel initialization. Part 9. +================================================================================ + +RCU initialization +================================================================================ + +This is ninth part of the [Linux Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the previous part we stopped at the [scheduler initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html). In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). We can see that the next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) after the `sched_init` is the call of the `preempt_disable`. There are two macros: + +* `preempt_disable` +* `preempt_enable` + +for preemption disabling and enabling. First of all let's try to understand what is `preempt` in the context of an operating system kernel. In simple words, preemption is ability of the operating system kernel to preempt current task to run task with higher priority. Here we need to disable preemption because we will have only one `init` process for the early boot time and we don't need to stop it before we call `cpu_idle` function. The `preempt_disable` macro is defined in the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) and depends on the `CONFIG_PREEMPT_COUNT` kernel configuration option. This macro is implemented as: + +```C +#define preempt_disable() \ +do { \ + preempt_count_inc(); \ + barrier(); \ +} while (0) +``` + +and if `CONFIG_PREEMPT_COUNT` is not set just: + +```C +#define preempt_disable() barrier() +``` + +Let's look on it. First of all we can see one difference between these macro implementations. The `preempt_disable` with `CONFIG_PREEMPT_COUNT` set contains the call of the `preempt_count_inc`. There is special `percpu` variable which stores the number of held locks and `preempt_disable` calls: + +```C +DECLARE_PER_CPU(int, __preempt_count); +``` + +In the first implementation of the `preempt_disable` we increment this `__preempt_count`. There is API for returning value of the `__preempt_count`, it is the `preempt_count` function. As we called `preempt_disable`, first of all we increment preemption counter with the `preempt_count_inc` macro which expands to the: + +``` +#define preempt_count_inc() preempt_count_add(1) +#define preempt_count_add(val) __preempt_count_add(val) +``` + +where `preempt_count_add` calls the `raw_cpu_add_4` macro which adds `1` to the given `percpu` variable (`__preempt_count`) in our case (more about `precpu` variables you can read in the part about [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)). Ok, we increased `__preempt_count` and the next step we can see the call of the `barrier` macro in the both macros. The `barrier` macro inserts an optimization barrier. In the processors with `x86_64` architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example: + +```C +preempt_disable(); +foo(); +preempt_enable(); +``` + +Compiler can rearrange it as: + +```C +preempt_disable(); +preempt_enable(); +foo(); +``` + +In this case non-preemptible function `foo` can be preempted. As we put `barrier` macro in the `preempt_disable` and `preempt_enable` macros, it prevents the compiler from swapping `preempt_count_inc` with other statements. More about barriers you can read [here](http://en.wikipedia.org/wiki/Memory_barrier) and [here](https://www.kernel.org/doc/Documentation/memory-barriers.txt). + +In the next step we can see following statement: + +```C +if (WARN(!irqs_disabled(), + "Interrupts were enabled *very* early, fixing it\n")) + local_irq_disable(); +``` + +which check [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) state, and disabling (with `cli` instruction for `x86_64`) if they are enabled. + +That's all. Preemption is disabled and we can go ahead. + +Initialization of the integer ID management +-------------------------------------------------------------------------------- + +In the next step we can see the call of the `idr_init_cache` function which defined in the [lib/idr.c](https://github.com/torvalds/linux/blob/master/lib/idr.c). The `idr` library is used in a various [places](http://lxr.free-electrons.com/ident?i=idr_find) in the linux kernel to manage assigning integer `IDs` to objects and looking up objects by id. + +Let's look on the implementation of the `idr_init_cache` function: + +```C +void __init idr_init_cache(void) +{ + idr_layer_cache = kmem_cache_create("idr_layer_cache", + sizeof(struct idr_layer), 0, SLAB_PANIC, NULL); +} +``` + +Here we can see the call of the `kmem_cache_create`. We already called the `kmem_cache_init` in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L485). This function create generalized caches again using the `kmem_cache_alloc` (more about caches we will see in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter). In our case, as we are using `kmem_cache_t` which will be used by the [slab](http://en.wikipedia.org/wiki/Slab_allocation) allocator and `kmem_cache_create` creates it. As you can see we pass five parameters to the `kmem_cache_create`: + +* name of the cache; +* size of the object to store in cache; +* offset of the first object in the page; +* flags; +* constructor for the objects. + +and it will create `kmem_cache` for the integer IDs. Integer `IDs` is commonly used pattern to map set of integer IDs to the set of pointers. We can see usage of the integer IDs in the [i2c](http://en.wikipedia.org/wiki/I%C2%B2C) drivers subsystem. For example [drivers/i2c/i2c-core.c](https://github.com/torvalds/linux/blob/master/drivers/i2c/i2c-core.c) which represents the core of the `i2c` subsystem defines `ID` for the `i2c` adapter with the `DEFINE_IDR` macro: + +```C +static DEFINE_IDR(i2c_adapter_idr); +``` + +and then uses it for the declaration of the `i2c` adapter: + +```C +static int __i2c_add_numbered_adapter(struct i2c_adapter *adap) +{ + int id; + ... + ... + ... + id = idr_alloc(&i2c_adapter_idr, adap, adap->nr, adap->nr + 1, GFP_KERNEL); + ... + ... + ... +} +``` + +and `id2_adapter_idr` presents dynamically calculated bus number. + +More about integer ID management you can read [here](https://lwn.net/Articles/103209/). + +RCU initialization +-------------------------------------------------------------------------------- + +The next step is [RCU](http://en.wikipedia.org/wiki/Read-copy-update) initialization with the `rcu_init` function and it's implementation depends on two kernel configuration options: + +* `CONFIG_TINY_RCU` +* `CONFIG_TREE_RCU` + +In the first case `rcu_init` will be in the [kernel/rcu/tiny.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tiny.c) and in the second case it will be defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c). We will see the implementation of the `tree rcu`, but first of all about the `RCU` in general. + +`RCU` or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure), [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) data structures and other. One of these mechanisms is - the `read-copy update`. The `RCU` technique is designed for rarely-modified data structures. The idea of the `RCU` is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy. + +Of course this description of the `RCU` is very simplified. To understand some details about `RCU`, first of all we need to learn some terminology. Data readers in the `RCU` executed in the [critical section](http://en.wikipedia.org/wiki/Critical_section). Every time when data reader get to the critical section, it calls the `rcu_read_lock`, and `rcu_read_unlock` on exit from the critical section. If the thread is not in the critical section, it will be in state which called - `quiescent state`. The moment when every thread is in the `quiescent state` called - `grace period`. If a thread wants to remove an element from the data structure, this occurs in two steps. First step is `removal` - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits until it is finished. From this moment, the removed element is available to the thread-readers. After the `grace period` finished, the second step of the element removal will be started, it just removes the element from the physical memory. + +There a couple of implementations of the `RCU`. Old `RCU` called classic, the new implementation called `tree` RCU. As you may already understand, the `CONFIG_TREE_RCU` kernel configuration option enables tree `RCU`. Another is the `tiny` RCU which depends on `CONFIG_TINY_RCU` and `CONFIG_SMP=n`. We will see more details about the `RCU` in general in the separate chapter about synchronization primitives, but now let's look on the `rcu_init` implementation from the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c): + +```C +void __init rcu_init(void) +{ + int cpu; + + rcu_bootup_announce(); + rcu_init_geometry(); + rcu_init_one(&rcu_bh_state, &rcu_bh_data); + rcu_init_one(&rcu_sched_state, &rcu_sched_data); + __rcu_init_preempt(); + open_softirq(RCU_SOFTIRQ, rcu_process_callbacks); + + /* + * We don't need protection against CPU-hotplug here because + * this is called early in boot, before either interrupts + * or the scheduler are operational. + */ + cpu_notifier(rcu_cpu_notify, 0); + pm_notifier(rcu_pm_notify, 0); + for_each_online_cpu(cpu) + rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu); + + rcu_early_boot_tests(); +} +``` + +In the beginning of the `rcu_init` function we define `cpu` variable and call `rcu_bootup_announce`. The `rcu_bootup_announce` function is pretty simple: + +```C +static void __init rcu_bootup_announce(void) +{ + pr_info("Hierarchical RCU implementation.\n"); + rcu_bootup_announce_oddness(); +} +``` + +It just prints information about the `RCU` with the `pr_info` function and `rcu_bootup_announce_oddness` which uses `pr_info` too, for printing different information about the current `RCU` configuration which depends on different kernel configuration options like `CONFIG_RCU_TRACE`, `CONFIG_PROVE_RCU`, `CONFIG_RCU_FANOUT_EXACT`, etc. In the next step, we can see the call of the `rcu_init_geometry` function. This function is defined in the same source code file and computes the node tree geometry depends on the amount of CPUs. Actually `RCU` provides scalability with extremely low internal RCU lock contention. What if a data structure will be read from the different CPUs? `RCU` API provides the `rcu_state` structure which presents RCU global state including node hierarchy. Hierarchy is presented by the: + +``` +struct rcu_node node[NUM_RCU_NODES]; +``` + +array of structures. As we can read in the comment of above definition: + +``` +The root (first level) of the hierarchy is in ->node[0] (referenced by ->level[0]), the second +level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]), and the third level +in ->node[m+1] and following (->node[m+1] referenced by ->level[2]). The number of levels is +determined by the number of CPUs and by CONFIG_RCU_FANOUT. + +Small systems will have a "hierarchy" consisting of a single rcu_node. +``` + +The `rcu_node` structure is defined in the [kernel/rcu/tree.h](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.h) and contains information about current grace period, is grace period completed or not, CPUs or groups that need to switch in order for current grace period to proceed, etc. Every `rcu_node` contains a lock for a couple of CPUs. These `rcu_node` structures are embedded into a linear array in the `rcu_state` structure and represented as a tree with the root as the first element and covers all CPUs. As you can see the number of the rcu nodes determined by the `NUM_RCU_NODES` which depends on number of available CPUs: + +```C +#define NUM_RCU_NODES (RCU_SUM - NR_CPUS) +#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4) +``` + +where levels values depend on the `CONFIG_RCU_FANOUT_LEAF` configuration option. For example for the simplest case, one `rcu_node` will cover two CPU on machine with the eight CPUs: + +``` ++-----------------------------------------------------------------+ +| rcu_state | +| +----------------------+ | +| | root | | +| | rcu_node | | +| +----------------------+ | +| | | | +| +----v-----+ +--v-------+ | +| | | | | | +| | rcu_node | | rcu_node | | +| | | | | | +| +------------------+ +----------------+ | +| | | | | | +| | | | | | +| +----v-----+ +-------v--+ +-v--------+ +-v--------+ | +| | | | | | | | | | +| | rcu_node | | rcu_node | | rcu_node | | rcu_node | | +| | | | | | | | | | +| +----------+ +----------+ +----------+ +----------+ | +| | | | | | +| | | | | | +| | | | | | +| | | | | | ++---------|-----------------|-------------|---------------|-------+ + | | | | ++---------v-----------------v-------------v---------------v--------+ +| | | | | +| CPU1 | CPU3 | CPU5 | CPU7 | +| | | | | +| CPU2 | CPU4 | CPU6 | CPU8 | +| | | | | ++------------------------------------------------------------------+ +``` + +So, in the `rcu_init_geometry` function we just need to calculate the total number of `rcu_node` structures. We start to do it with the calculation of the `jiffies` till to the first and next `fqs` which is `force-quiescent-state` (read above about it): + +```C +d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; +if (jiffies_till_first_fqs == ULONG_MAX) + jiffies_till_first_fqs = d; +if (jiffies_till_next_fqs == ULONG_MAX) + jiffies_till_next_fqs = d; +``` + +where: + +```C +#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500)) +#define RCU_JIFFIES_FQS_DIV 256 +``` + +As we calculated these [jiffies](http://en.wikipedia.org/wiki/Jiffy_%28time%29), we check that previous defined `jiffies_till_first_fqs` and `jiffies_till_next_fqs` variables are equal to the [ULONG_MAX](http://www.rowleydownload.co.uk/avr/documentation/index.htm?http://www.rowleydownload.co.uk/avr/documentation/ULONG_MAX.htm) (their default values) and set they equal to the calculated value. As we did not touch these variables before, they are equal to the `ULONG_MAX`: + +```C +static ulong jiffies_till_first_fqs = ULONG_MAX; +static ulong jiffies_till_next_fqs = ULONG_MAX; +``` + +In the next step of the `rcu_init_geometry`, we check that `rcu_fanout_leaf` didn't change (it has the same value as `CONFIG_RCU_FANOUT_LEAF` in compile-time) and equal to the value of the `CONFIG_RCU_FANOUT_LEAF` configuration option, we just return: + +```C +if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF && + nr_cpu_ids == NR_CPUS) + return; +``` + +After this we need to compute the number of nodes that an `rcu_node` tree can handle with the given number of levels: + +```C +rcu_capacity[0] = 1; +rcu_capacity[1] = rcu_fanout_leaf; +for (i = 2; i <= MAX_RCU_LVLS; i++) + rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT; +``` + +And in the last step we calculate the number of rcu_nodes at each level of the tree in the [loop](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c#L4094). + +As we calculated geometry of the `rcu_node` tree, we need to go back to the `rcu_init` function and next step we need to initialize two `rcu_state` structures with the `rcu_init_one` function: + +```C +rcu_init_one(&rcu_bh_state, &rcu_bh_data); +rcu_init_one(&rcu_sched_state, &rcu_sched_data); +``` + +The `rcu_init_one` function takes two arguments: + +* Global `RCU` state; +* Per-CPU data for `RCU`. + +Both variables defined in the [kernel/rcu/tree.h](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.h) with its `percpu` data: + +``` +extern struct rcu_state rcu_bh_state; +DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); +``` + +About this states you can read [here](http://lwn.net/Articles/264090/). As I wrote above we need to initialize `rcu_state` structures and `rcu_init_one` function will help us with it. After the `rcu_state` initialization, we can see the call of the ` __rcu_init_preempt` which depends on the `CONFIG_PREEMPT_RCU` kernel configuration option. It does the same as previous functions - initialization of the `rcu_preempt_state` structure with the `rcu_init_one` function which has `rcu_state` type. After this, in the `rcu_init`, we can see the call of the: + +```C +open_softirq(RCU_SOFTIRQ, rcu_process_callbacks); +``` + +function. This function registers a handler of the `pending interrupt`. Pending interrupt or `softirq` supposes that part of actions can be delayed for later execution when the system is less loaded. Pending interrupts is represented by the following structure: + +```C +struct softirq_action +{ + void (*action)(struct softirq_action *); +}; +``` + +which is defined in the [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h) and contains only one field - handler of an interrupt. You can check about `softirqs` in the your system with the: + +``` +$ cat /proc/softirqs + CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 + HI: 2 0 0 1 0 2 0 0 + TIMER: 137779 108110 139573 107647 107408 114972 99653 98665 + NET_TX: 1127 0 4 0 1 1 0 0 + NET_RX: 334 221 132939 3076 451 361 292 303 + BLOCK: 5253 5596 8 779 2016 37442 28 2855 +BLOCK_IOPOLL: 0 0 0 0 0 0 0 0 + TASKLET: 66 0 2916 113 0 24 26708 0 + SCHED: 102350 75950 91705 75356 75323 82627 69279 69914 + HRTIMER: 510 302 368 260 219 255 248 246 + RCU: 81290 68062 82979 69015 68390 69385 63304 63473 +``` + +The `open_softirq` function takes two parameters: + +* index of the interrupt; +* interrupt handler. + +and adds interrupt handler to the array of the pending interrupts: + +```C +void open_softirq(int nr, void (*action)(struct softirq_action *)) +{ + softirq_vec[nr].action = action; +} +``` + +In our case the interrupt handler is - `rcu_process_callbacks` which is defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c) and does the `RCU` core processing for the current CPU. After we registered `softirq` interrupt for the `RCU`, we can see the following code: + +```C +cpu_notifier(rcu_cpu_notify, 0); +pm_notifier(rcu_pm_notify, 0); +for_each_online_cpu(cpu) + rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu); +``` + +Here we can see registration of the `cpu` notifier which needs in systems which supports [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) and we will not dive into details about this theme. The last function in the `rcu_init` is the `rcu_early_boot_tests`: + +```C +void rcu_early_boot_tests(void) +{ + pr_info("Running RCU self tests\n"); + + if (rcu_self_test) + early_boot_test_call_rcu(); + if (rcu_self_test_bh) + early_boot_test_call_rcu_bh(); + if (rcu_self_test_sched) + early_boot_test_call_rcu_sched(); +} +``` + +which runs self tests for the `RCU`. + +That's all. We saw initialization process of the `RCU` subsystem. As I wrote above, more about the `RCU` will be in the separate chapter about synchronization primitives. + +Rest of the initialization process +-------------------------------------------------------------------------------- + +Ok, we already passed the main theme of this part which is `RCU` initialization, but it is not the end of the linux kernel initialization process. In the last paragraph of this theme we will see a couple of functions which work in the initialization time, but we will not dive into deep details around this function for different reasons. Some reasons not to dive into details are following: + +* They are not very important for the generic kernel initialization process and depend on the different kernel configuration; +* They have the character of debugging and not important for now; +* We will see many of this stuff in the separate parts/chapters. + +After we initialized `RCU`, the next step which you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the - `trace_init` function. As you can understand from its name, this function initialize [tracing](http://en.wikipedia.org/wiki/Tracing_%28software%29) subsystem. You can read more about linux kernel trace system - [here](http://elinux.org/Kernel_Trace_Systems). + +After the `trace_init`, we can see the call of the `radix_tree_init`. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). This function is defined in the [lib/radix-tree.c](https://github.com/torvalds/linux/blob/master/lib/radix-tree.c) and you can read more about it in the part about [Radix tree](https://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html). + +In the next step we can see the functions which are related to the `interrupts handling` subsystem, they are: + +* `early_irq_init` +* `init_IRQ` +* `softirq_init` + +We will see explanation about this functions and their implementation in the special part about interrupts and exceptions handling. After this many different functions (like `init_timers`, `hrtimers_init`, `time_init`, etc.) which are related to different timing and timers stuff. We will see more about these function in the chapter about timers. + +The next couple of functions are related with the [perf](https://perf.wiki.kernel.org/index.php/Main_Page) events - `perf_event-init` (there will be separate chapter about perf), initialization of the `profiling` with the `profile_init`. After this we enable `irq` with the call of the: + +```C +local_irq_enable(); +``` + +which expands to the `sti` instruction and making post initialization of the [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) with the call of the `kmem_cache_init_late` function (As I wrote above we will know about the `SLAB` in the [Linux memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter). + +After the post initialization of the `SLAB`, next point is initialization of the console with the `console_init` function from the [drivers/tty/tty_io.c](https://github.com/torvalds/linux/blob/master/drivers/tty/tty_io.c). + +After the console initialization, we can see the `lockdep_info` function which prints information about the [Lock dependency validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). After this, we can see the initialization of the dynamic allocation of the `debug objects` with the `debug_objects_mem_init`, kernel memory leak [detector](https://www.kernel.org/doc/Documentation/kmemleak.txt) initialization with the `kmemleak_init`, `percpu` pageset setup with the `setup_per_cpu_pageset`, setup of the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) policy with the `numa_policy_init`, setting time for the scheduler with the `sched_clock_init`, `pidmap` initialization with the call of the `pidmap_init` function for the initial `PID` namespace, cache creation with the `anon_vma_init` for the private virtual memory areas and early initialization of the [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) with the `acpi_early_init`. + +This is the end of the ninth part of the [linux kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and here we saw initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). In the last paragraph of this part (`Rest of the initialization process`) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters. + +Conclusion +-------------------------------------------------------------------------------- + +It is the end of the ninth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). In this part, we looked on the initialization process of the `RCU` subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the `start_kernel` function and will go to the `rest_init` function from the same [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file and will see the start of the first process. + +If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). + +**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure) +* [kmemleak](https://www.kernel.org/doc/Documentation/kmemleak.txt) +* [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) +* [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) +* [RCU](http://en.wikipedia.org/wiki/Read-copy-update) +* [RCU documentation](https://github.com/torvalds/linux/tree/master/Documentation/RCU) +* [integer ID management](https://lwn.net/Articles/103209/) +* [Documentation/memory-barriers.txt](https://www.kernel.org/doc/Documentation/memory-barriers.txt) +* [Runtime locking correctness validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) +* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) +* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) +* [slab](http://en.wikipedia.org/wiki/Slab_allocation) +* [i2c](http://en.wikipedia.org/wiki/I%C2%B2C) +* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html)