diff --git a/Booting/README.md b/Booting/README.md deleted file mode 100644 index f1e7a7e..0000000 --- a/Booting/README.md +++ /dev/null @@ -1,10 +0,0 @@ -# Kernel boot process - -This chapter describes the linux kernel boot process. You will see here a -couple of posts which describe the full cycle of the kernel loading process: - -* [From the bootloader to kernel](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) - describes all stages from turning on the computer to before the first instruction of the kernel; -* [First steps in the kernel setup code](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) - describes first steps in the kernel setup code. You will see heap initialization, querying of different parameters like EDD, IST and etc... -* [Video mode initialization and transition to protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) - describes video mode initialization in the kernel setup code and transition to protected mode. -* [Transition to 64-bit mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-4.html) - describes preparation for transition into 64-bit mode and transition into it. -* [Kernel Decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) - describes preparation before kernel decompression and directly decompression. diff --git a/Booting/linux-bootstrap-1.md b/Booting/linux-bootstrap-1.md deleted file mode 100644 index 2ccbad8..0000000 --- a/Booting/linux-bootstrap-1.md +++ /dev/null @@ -1,496 +0,0 @@ -Kernel booting process. Part 1. -================================================================================ - -From the bootloader to kernel --------------------------------------------------------------------------------- - -If you have read my previous [blog posts](http://0xax.blogspot.com/search/label/asm), you can see that sometime ago I started to get involved with low-level programming. I wrote some posts about x86_64 assembly programming for Linux. At the same time, I started to dive into the Linux source code. I have a great interest in understanding how low-level things work, how programs run on my computer, how they are located in memory, how the kernel manages processes and memory, how the network stack works on low-level and many many other things. So, I decided to write yet another series of posts about the Linux kernel for **x86_64**. - -Note that I'm not a professional kernel hacker and I don't write code for the kernel at work. It's just a hobby. I just like low-level stuff, and it is interesting for me to see how these things work. So if you notice anything confusing, or if you have any questions/remarks, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). I appreciate it. All posts will also be accessible at [linux-insides](https://github.com/0xAX/linux-insides) and if you find something wrong with my English or the post content, feel free to send a pull request. - - -*Note that this isn't the official documentation, just learning and sharing knowledge.* - -**Required knowledge** - -* Understanding C code -* Understanding assembly code (AT&T syntax) - -Anyway, if you just started to learn some tools, I will try to explain some parts during this and the following posts. Ok, little introduction finished and now we can start to dive into the kernel and low-level stuff. - -All code is actually for kernel - 3.18. If there are changes, I will update the posts accordingly. - -The Magic Power Button, What happens next? --------------------------------------------------------------------------------- - -Despite that this is a series of posts about Linux kernel, we will not start from kernel code (at least in this paragraph). Ok, you pressed the magic power button on your laptop or desktop computer and it started to work. After the motherboard sends a signal to the [power supply](https://en.wikipedia.org/wiki/Power_supply), the power supply provides the computer with the proper amount of electricity. Once motherboard receives the [power good signal](https://en.wikipedia.org/wiki/Power_good_signal), it tries to run the CPU. The CPU resets all leftover data in its registers and sets up predefined values for every register. - - -[80386](https://en.wikipedia.org/wiki/Intel_80386) and later CPUs define the following predefined data in CPU registers after the computer resets: - -``` -IP 0xfff0 -CS selector 0xf000 -CS base 0xffff0000 -``` - -The processor starts working in [real mode](https://en.wikipedia.org/wiki/Real_mode) and we need to back up a little to understand memory segmentation in this mode. Real mode is supported in all x86-compatible processors, from [8086](https://en.wikipedia.org/wiki/Intel_8086) to modern Intel 64-bit CPUs. The 8086 processor had a 20-bit address bus, which means that it could work with 0-2^20 bytes address space (1 megabyte). But it only has 16-bit registers, and with 16-bit registers the maximum address is 2^16 or 0xffff (64 kilobytes). [Memory segmentation](http://en.wikipedia.org/wiki/Memory_segmentation) is used to make use of all of the address space available. All memory is divided into small, fixed-size segments of 65535 bytes, or 64 KB. Since we cannot address memory below 64 KB with 16 bit registers, an alternate method to do it was devised. An address consists of two parts: the beginning address of the segment and the offset from the beginning of this segment. To get a physical address in memory, we need to multiply the segment part by 16 and add the offset part: - -``` -PhysicalAddress = Segment * 16 + Offset -``` - -For example if `CS:IP` is `0x2000:0x0010`, the corresponding physical address will be: - -```python ->>> hex((0x2000 << 4) + 0x0010) -'0x20010' -``` - -But if we take the biggest segment part and offset: `0xffff:0xffff`, it will be: - -```python ->>> hex((0xffff << 4) + 0xffff) -'0x10ffef' -``` - -which is 65519 bytes over first megabyte. Since only one megabyte is accessible in real mode, `0x10ffef` becomes `0x00ffef` with disabled [A20](https://en.wikipedia.org/wiki/A20_line). - -Ok, now we know about real mode and memory addressing. Let's get back to register values after reset. - -`CS` register consists of two parts: the visible segment selector and hidden base address. We know predefined `CS` base and `IP` value, logical address will be: - -``` -0xffff0000:0xfff0 -``` - -In this way starting address formed by adding the base address to the value in the EIP register: - -```python ->>> 0xffff0000 + 0xfff0 -'0xfffffff0' -``` - -We get `0xfffffff0` which is 4GB - 16 bytes. This point is the [Reset vector](http://en.wikipedia.org/wiki/Reset_vector). This is the memory location at which CPU expects to find the first instruction to execute after reset. It contains a [jump](http://en.wikipedia.org/wiki/JMP_%28x86_instruction%29) instruction which usually points to the BIOS entry point. For example, if we look in [coreboot](http://www.coreboot.org/) source code, we will see it: - -```assembly - .section ".reset" - .code16 -.globl reset_vector -reset_vector: - .byte 0xe9 - .int _start - ( . + 2 ) - ... -``` - -We can see here the jump instruction [opcode](http://ref.x86asm.net/coder32.html#xE9) - 0xe9 to the address `_start - ( . + 2)`. And we can see that `reset` section is 16 bytes and starts at `0xfffffff0`: - -``` -SECTIONS { - _ROMTOP = 0xfffffff0; - . = _ROMTOP; - .reset . : { - *(.reset) - . = 15 ; - BYTE(0x00); - } -} -``` - -Now the BIOS has started to work. After initializing and checking the hardware, it needs to find a bootable device. A boot order is stored in the BIOS configuration. The function of boot order is to control which devices the kernel attempts to boot. In the case of attempting to boot a hard drive, the BIOS tries to find a boot sector. On hard drives partitioned with an MBR partition layout, the boot sector is stored in the first 446 bytes of the first sector (512 bytes). The final two bytes of the first sector are `0x55` and `0xaa` which signals the BIOS that the device is bootable. For example: - -```assembly -; -; Note: this example is written in Intel Assembly syntax -; -[BITS 16] -[ORG 0x7c00] - -boot: - mov al, '!' - mov ah, 0x0e - mov bh, 0x00 - mov bl, 0x07 - - int 0x10 - jmp $ - -times 510-($-$$) db 0 - -db 0x55 -db 0xaa -``` - -Build and run it with: - -``` -nasm -f bin boot.nasm && qemu-system-x86_64 boot -``` - -This will instruct [QEMU](http://qemu.org) to use the `boot` binary we just built as a disk image. Since the binary generated by the assembly code above fulfills the requirements of the boot sector (the origin is set to `0x7c00`, and we end with the magic sequence). QEMU will treat the binary as the master boot record(MBR) of a disk image. - -We will see: - -![Simple bootloader which prints only `!`](http://oi60.tinypic.com/2qbwup0.jpg) - -In this example we can see that this code will be executed in 16 bit real mode and will start at 0x7c00 in memory. After the start it calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt which just prints `!` symbol. It fills rest of 510 bytes with zeros and finish with two magic bytes `0xaa` and `0x55`. - -You can see binary dump of it with `objdump` util: - -``` -nasm -f bin boot.nasm -objdump -D -b binary -mi386 -Maddr16,data16,intel boot -``` - -A real-world boot sector has code for continuing the boot process and the partition table instead of a bunch of 0's and an exclamation point :) Ok so, from this point onwards BIOS hands over the control to the bootloader and we can go ahead. - -**NOTE**: As you can read above the CPU is in real mode. In real mode, calculating the physical address in memory is done as following: - -``` -PhysicalAddress = Segment * 16 + Offset -``` - -Same as I mentioned before. But we have only 16 bit general purpose registers. The maximum value of 16 bit register is: `0xffff`; So if we take the biggest values the result will be: - -```python ->>> hex((0xffff * 16) + 0xffff) -'0x10ffef' -``` - -Where `0x10ffef` is equal to `1MB + 64KB - 16b`. But a [8086](https://en.wikipedia.org/wiki/Intel_8086) processor, which was the first processor with real mode. It had 20 bit address line and `2^20 = 1048576.0` is 1MB. So, it means that the actual memory available is 1MB. - -General real mode's memory map is: - -``` -0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table -0x00000400 - 0x000004FF - BIOS Data Area -0x00000500 - 0x00007BFF - Unused -0x00007C00 - 0x00007DFF - Our Bootloader -0x00007E00 - 0x0009FFFF - Unused -0x000A0000 - 0x000BFFFF - Video RAM (VRAM) Memory -0x000B0000 - 0x000B7777 - Monochrome Video Memory -0x000B8000 - 0x000BFFFF - Color Video Memory -0x000C0000 - 0x000C7FFF - Video ROM BIOS -0x000C8000 - 0x000EFFFF - BIOS Shadow Area -0x000F0000 - 0x000FFFFF - System BIOS -``` - -But stop, at the beginning of post I wrote that first instruction executed by the CPU is located at the address `0xFFFFFFF0`, which is much bigger than `0xFFFFF` (1MB). How can CPU access it in real mode? As I write about it and you can read in [coreboot](http://www.coreboot.org/Developer_Manual/Memory_map) documentation: - -``` -0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address space -``` - -At the start of execution BIOS is not in RAM, it is located in the ROM. - -Bootloader --------------------------------------------------------------------------------- - -There are a number of bootloaders which can boot Linux, such as [GRUB 2](https://www.gnu.org/software/grub/) and [syslinux](http://www.syslinux.org/wiki/index.php/The_Syslinux_Project). The Linux kernel has a [Boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt) which specifies the requirements for bootloaders to implement Linux support. This example will describe GRUB 2. - -Now that the BIOS has chosen a boot device and transferred control to the boot sector code, execution starts from [boot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/boot.S;hb=HEAD). This code is very simple due to the limited amount of space available, and contains a pointer that it uses to jump to the location of GRUB 2's core image. The core image begins with [diskboot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/diskboot.S;hb=HEAD), which is usually stored immediately after the first sector in the unused space before the first partition. The above code loads the rest of the core image into memory, which contains GRUB 2's kernel and drivers for handling filesystems. After loading the rest of the core image, it executes [grub_main](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/kern/main.c). - -`grub_main` initializes console, gets base address for modules, sets root device, loads/parses grub configuration file, loads modules etc. At the end of execution, `grub_main` moves grub to normal mode. `grub_normal_execute` (from `grub-core/normal/main.c`) completes last preparation and shows a menu for selecting an operating system. When we select one of grub menu entries, `grub_menu_execute_entry` begins to be executed, which executes grub `boot` command. It starts to boot the selected operating system. - -As we can read in the kernel boot protocol, the bootloader must read and fill some fields of kernel setup header which starts at `0x01f1` offset from the kernel setup code. Kernel header [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) starts from: - -```assembly - .globl hdr -hdr: - setup_sects: .byte 0 - root_flags: .word ROOT_RDONLY - syssize: .long 0 - ram_size: .word 0 - vid_mode: .word SVGA_MODE - root_dev: .word 0 - boot_flag: .word 0xAA55 -``` - -The bootloader must fill this and the rest of the headers (only marked as `write` in the Linux boot protocol, for example [this](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L354)) with values which it either got from command line or calculated. We will not see description and explanation of all fields of kernel setup header, we will get back to it when kernel uses it. Anyway, you can find description of any field in the [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156). - -As we can see in kernel boot protocol, the memory map will be the following after kernel loading: - -```shell - | Protected-mode kernel | -100000 +------------------------+ - | I/O memory hole | -0A0000 +------------------------+ - | Reserved for BIOS | Leave as much as possible unused - ~ ~ - | Command line | (Can also be below the X+10000 mark) -X+10000 +------------------------+ - | Stack/heap | For use by the kernel real-mode code. -X+08000 +------------------------+ - | Kernel setup | The kernel real-mode code. - | Kernel boot sector | The kernel legacy boot sector. - X +------------------------+ - | Boot loader | <- Boot sector entry point 0x7C00 -001000 +------------------------+ - | Reserved for MBR/BIOS | -000800 +------------------------+ - | Typically used by MBR | -000600 +------------------------+ - | BIOS use only | -000000 +------------------------+ - -``` - -So after the bootloader transferred control to the kernel, it starts somewhere at: - -``` -0x1000 + X + sizeof(KernelBootSector) + 1 -``` - -where `X` is the address of kernel bootsector loaded. In my case `X` is `0x10000`, we can see it in memory dump: - -![kernel first address](http://oi57.tinypic.com/16bkco2.jpg) - -Ok, now the bootloader has loaded Linux kernel into the memory, filled header fields and jumped to it. Now we can move directly to the kernel setup code. - -Start of Kernel Setup --------------------------------------------------------------------------------- - -Finally we are in the kernel. Technically kernel didn't run yet, first of all we need to setup kernel, memory manager, process manager etc. Kernel setup execution starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) at the [_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L293). It is a little strange at the first look, there are many instructions before it. - -Actually Long time ago Linux kernel had its own bootloader, but now if you run for example: - -``` -qemu-system-x86_64 vmlinuz-3.18-generic -``` - -You will see: - -![Try vmlinuz in qemu](http://oi60.tinypic.com/r02xkz.jpg) - -Actually `header.S` starts from [MZ](https://en.wikipedia.org/wiki/DOS_MZ_executable) (see image above), error message printing and following [PE](https://en.wikipedia.org/wiki/Portable_Executable) header: - -```assembly -#ifdef CONFIG_EFI_STUB -# "MZ", MS-DOS header -.byte 0x4d -.byte 0x5a -#endif -... -... -... -pe_header: - .ascii "PE" - .word 0 -``` - -It needs this for loading the operating system with [UEFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface). Here we will not see how it works (we will these later in the next parts). - -So the actual kernel setup entry point is: - -``` -// header.S line 292 -.globl _start -_start: -``` - -Bootloader (grub2 and others) knows about this point (`0x200` offset from `MZ`) and makes a jump directly to this point, despite the fact that `header.S` starts from `.bstext` section which prints error message: - -``` -// -// arch/x86/boot/setup.ld -// -. = 0; // current position -.bstext : { *(.bstext) } // put .bstext section to position 0 -.bsdata : { *(.bsdata) } -``` - -So kernel setup entry point is: - -```assembly - .globl _start -_start: - .byte 0xeb - .byte start_of_setup-1f -1: - // - // rest of the header - // -``` - -Here we can see `jmp` instruction opcode - `0xeb` to the `start_of_setup-1f` point. `Nf` notation means following: `2f` refers to the next local `2:` label. In our case it is label `1` which goes right after jump. It contains rest of setup [header](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156) and right after setup header we can see `.entrytext` section which starts at `start_of_setup` label. - -Actually it's the first code which starts to execute besides previous jump instruction. After kernel setup got the control from bootloader, first `jmp` instruction is located at `0x200` (first 512 bytes) offset from the start of kernel real mode. This we can read in Linux kernel boot protocol and also see in grub2 source code: - -```C - state.gs = state.fs = state.es = state.ds = state.ss = segment; - state.cs = segment + 0x20; -``` - -It means that segment registers will have following values after kernel setup starts to work: - -``` -fs = es = ds = ss = 0x1000 -cs = 0x1020 -``` - -for my case when kernel loaded at `0x10000`. - -After jump to `start_of_setup`, it needs to do the following things: - -* Be sure that all values of all segment registers are equal -* Setup correct stack if needed -* Setup [bss](https://en.wikipedia.org/wiki/.bss) -* Jump to C code at [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c) - -Let's look at implementation. - -Segment registers align --------------------------------------------------------------------------------- - -First of all it ensures that `ds` and `es` segment registers point to the same address and enables interrupts with `sti` instruction: - -```assembly - movw %ds, %ax - movw %ax, %es - sti -``` - -As I wrote above, grub2 loads kernel setup code at `0x10000` address and `cs` at `0x1020` because execution doesn't start from the start of file, but from: - -``` -_start: - .byte 0xeb - .byte start_of_setup-1f -``` - -`jump`, which is 512 bytes offset from the [4d 5a](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L47). Also need to align `cs` from `0x10200` to `0x10000` as all other segment registers. After that we setup the stack: - -```assembly - pushw %ds - pushw $6f - lretw -``` - -push `ds` value to stack, and address of [6](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L494) label and execute `lretw` instruction. When we call `lretw`, it loads address of label `6` to [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and `cs` with value of `ds`. After it we will have `ds` and `cs` with the same values. - -Stack Setup --------------------------------------------------------------------------------- - -Actually, almost all of the setup code is preparation for C language environment in the real mode. The next [step](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L467) is checking of `ss` register value and making of correct stack if `ss` is wrong: - -```assembly - movw %ss, %dx - cmpw %ax, %dx - movw %sp, %dx - je 2f -``` - -Generally, it can be 3 different cases: - -* `ss` has valid value 0x10000 (as all other segment registers beside `cs`) -* `ss` is invalid and `CAN_USE_HEAP` flag is set (see below) -* `ss` is invalid and `CAN_USE_HEAP` flag is not set (see below) - -Let's look at all of these cases: - -1. `ss` has a correct address (0x10000). In this case we go to label [2](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L481): - -``` -2: andw $~3, %dx - jnz 3f - movw $0xfffc, %dx -3: movw %ax, %ss - movzwl %dx, %esp - sti -``` - -Here we can see aligning of `dx` (contains `sp` given by bootloader) to 4 bytes and checking that it is not zero. If it is zero we put `0xfffc` (4 byte aligned address before maximum segment size - 64 KB) to `dx`. If it is not zero we continue to use `sp` given by bootloader (0xf7f4 in my case). After this we put `ax` value to `ss` which stores correct segment address `0x10000` and set up correct `sp`. After it we have correct stack: - -![stack](http://oi58.tinypic.com/16iwcis.jpg) - -2. In the second case (`ss` != `ds`), first of all put [_end](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L52) (address of end of setup code) value in `dx`. And check `loadflags` header field with `testb` instruction too see if we can use heap or not. [loadflags](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) is a bitmask header which is defined as: - -```C -#define LOADED_HIGH (1<<0) -#define QUIET_FLAG (1<<5) -#define KEEP_SEGMENTS (1<<6) -#define CAN_USE_HEAP (1<<7) -``` - -And as we can read in the boot protocol: - -``` -Field name: loadflags - - This field is a bitmask. - - Bit 7 (write): CAN_USE_HEAP - Set this bit to 1 to indicate that the value entered in the - heap_end_ptr is valid. If this field is clear, some setup code - functionality will be disabled. -``` - -If `CAN_USE_HEAP` bit is set, put `heap_end_ptr` to `dx` which points to `_end` and add `STACK_SIZE` (minimal stack size - 512 bytes) to it. After this if `dx` is not carry, jump to `2` (it will not be carry, dx = _end + 512) label as in previous case and make correct stack. - -![stack](http://oi62.tinypic.com/dr7b5w.jpg) - -3. The last case when `CAN_USE_HEAP` is not set, we just use minimal stack from `_end` to `_end + STACK_SIZE`: - -![minimal stack](http://oi60.tinypic.com/28w051y.jpg) - -BSS Setup --------------------------------------------------------------------------------- - -The last two steps that need to happen before we can jump to the main C code, are that we need to set up the [BSS](https://en.wikipedia.org/wiki/.bss) area, and check the "magic" signature. Firstly, signature checking: - -```assembly -cmpl $0x5a5aaa55, setup_sig -jne setup_bad -``` - -This simply consists of comparing the [setup_sig](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L39) against the magic number `0x5a5aaa55`. If they are not equal, a fatal error is reported. - -But if the magic number matches, knowing we have a set of correct segment registers, and a stack, we need only setup the BSS section before jumping into the C code. - -The BSS section is used for storing statically allocated, uninitialized, data. Linux carefully ensures this area of memory is first blanked, using the following code: - -```assembly - movw $__bss_start, %di - movw $_end+3, %cx - xorl %eax, %eax - subw %di, %cx - shrw $2, %cx - rep; stosl -``` - -First of all the [__bss_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L47) address is moved into `di`, and the `_end + 3` address (+3 - aligns to 4 bytes) is moved into `cx`. The `eax` register is cleared (using an `xor` instruction), and the bss section size (`cx`-`di`) is calculated and put into `cx`. Then, `cx` is divided by four (the size of a 'word'), and the `stosl` instruction is repeatedly used, storing the value of `eax` (zero) into the address pointed to by `di`, and automatically increasing `di` by four (this occurs until `cx` reaches zero). The net effect of this code, is that zeros are written through all words in memory from `__bss_start` to `_end`: - -![bss](http://oi59.tinypic.com/29m2eyr.jpg) - -Jump to main --------------------------------------------------------------------------------- - -That's all, we have the stack, BSS and now we can jump to the `main()` C function: - -```assembly - calll main -``` - -The `main()` function is located in [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c). What will be there? We will see it in the next part. - -Conclusion --------------------------------------------------------------------------------- - -This is the end of the first part about Linux kernel internals. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). In the next part we will see first C code which executes in Linux kernel setup, implementation of memory routines as `memset`, `memcpy`, `earlyprintk` implementation and early console initialization and many more. - -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - - * [Intel 80386 programmer's reference manual 1986](http://css.csail.mit.edu/6.858/2014/readings/i386.pdf) - * [Minimal Boot Loader for IntelĀ® Architecture](https://www.cs.cmu.edu/~410/doc/minimal_boot.pdf) - * [8086](http://en.wikipedia.org/wiki/Intel_8086) - * [80386](http://en.wikipedia.org/wiki/Intel_80386) - * [Reset vector](http://en.wikipedia.org/wiki/Reset_vector) - * [Real mode](http://en.wikipedia.org/wiki/Real_mode) - * [Linux kernel boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) - * [CoreBoot developer manual](http://www.coreboot.org/Developer_Manual) - * [Ralf Brown's Interrupt List](http://www.ctyme.com/intr/int.htm) - * [Power supply](http://en.wikipedia.org/wiki/Power_supply) - * [Power good signal](http://en.wikipedia.org/wiki/Power_good_signal) diff --git a/Booting/linux-bootstrap-2.md b/Booting/linux-bootstrap-2.md deleted file mode 100644 index 2379001..0000000 --- a/Booting/linux-bootstrap-2.md +++ /dev/null @@ -1,547 +0,0 @@ -Kernel booting process. Part 2. -================================================================================ - -First steps in the kernel setup --------------------------------------------------------------------------------- - -We started to dive into linux kernel internals in the previous [part](linux-bootstrap-1.md) and saw the initial part of the kernel setup code. We stopped at the first call to the `main` function (which is the first function written in C) from [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c). - -In this part we will continue to research the kernel setup code and -* see what `protected mode` is, -* some preparation for the transition into it, -* the heap and console initialization, -* memory detection, cpu validation, keyboard initialization -* and much much more. - -So, Let's go ahead. - -Protected mode --------------------------------------------------------------------------------- - -Before we can move to the native Intel64 [Long Mode](http://en.wikipedia.org/wiki/Long_mode), the kernel must switch the CPU into protected mode. - -What is [protected mode](https://en.wikipedia.org/wiki/Protected_mode)? Protected mode was first added to the x86 architecture in 1982 and was the main mode of Intel processors from the [80286](http://en.wikipedia.org/wiki/Intel_80286) processor until Intel 64 and long mode came. - -The main reason to move away from [Real mode](http://wiki.osdev.org/Real_Mode) is that there is very limited access to the RAM. As you may remember from the previous part, there is only 220 bytes or 1 Megabyte, sometimes even only 640 Kilobytes of RAM available in the Real mode. - -Protected mode brought many changes, but the main one is the difference in memory management. The 20-bit address bus was replaced with a 32-bit address bus. It allowed access to 4 Gigabytes of memory vs 1 Megabyte of real mode. Also [paging](http://en.wikipedia.org/wiki/Paging) support was added, which you can read about in the next sections. - -Memory management in Protected mode is divided into two, almost independent parts: - -* Segmentation -* Paging - -Here we will only see segmentation. Paging will be discussed in the next sections. - -As you can read in the previous part, addresses consist of two parts in real mode: - -* Base address of the segment -* Offset from the segment base - -And we can get the physical address if we know these two parts by: - -``` -PhysicalAddress = Segment * 16 + Offset -``` - -Memory segmentation was completely redone in protected mode. There are no 64 Kilobyte fixed-size segments. Instead, the size and location of each segment is described by an associated data structure called _Segment Descriptor_. The segment descriptors are stored in a data structure called `Global Descriptor Table` (GDT). - -The GDT is a structure which resides in memory. It has no fixed place in the memory so, its address is stored in the special `GDTR` register. Later we will see the GDT loading in the Linux kernel code. There will be an operation for loading it into memory, something like: - -```assembly -lgdt gdt -``` - -where the `lgdt` instruction loads the base address and limit(size) of global descriptor table to the `GDTR` register. `GDTR` is a 48-bit register and consists of two parts: - - * size(16-bit) of global descriptor table; - * address(32-bit) of the global descriptor table. - -As mentioned above the GDT contains `segment descriptors` which describe memory segments. Each descriptor is 64-bits in size. The general scheme of a descriptor is: - -``` -31 24 19 16 7 0 ------------------------------------------------------------- -| | |B| |A| | | | |0|E|W|A| | -| BASE 31:24 |G|/|L|V| LIMIT |P|DPL|S| TYPE | BASE 23:16 | 4 -| | |D| |L| 19:16 | | | |1|C|R|A| | ------------------------------------------------------------- -| | | -| BASE 15:0 | LIMIT 15:0 | 0 -| | | ------------------------------------------------------------- -``` - -Don't worry, I know it looks a little scary after real mode, but it's easy. For example LIMIT 15:0 means that bit 0-15 of the Descriptor contain the value for the limit. The rest of it is in LIMIT 16:19. So, the size of Limit is 0-19 i.e 20-bits. Let's take a closer look at it: - -1. Limit[20-bits] is at 0-15,16-19 bits. It defines `length_of_segment - 1`. It depends on `G`(Granularity) bit. - - * if `G` (bit 55) is 0 and segment limit is 0, the size of the segment is 1 Byte - * if `G` is 1 and segment limit is 0, the size of the segment is 4096 Bytes - * if `G` is 0 and segment limit is 0xfffff, the size of the segment is 1 Megabyte - * if `G` is 1 and segment limit is 0xfffff, the size of the segment is 4 Gigabytes - - So, it means that if - * if G is 0, Limit is interpreted in terms of 1 Byte and the maximum size of the segment can be 1 Megabyte. - * if G is 1, Limit is interpreted in terms of 4096 Bytes = 4 KBytes = 1 Page and the maximum size of the segment can be 4 Gigabytes. Actually when G is 1, the value of Limit is shifted to the left by 12 bits. So, 20 bits + 12 bits = 32 bits and 232 = 4 Gigabytes. - -2. Base[32-bits] is at (0-15, 32-39 and 56-63 bits). It defines the physical address of the segment's starting location. - -3. Type/Attribute (40-47 bits) defines the type of segment and kinds of access to it. - * `S` flag at bit 44 specifies descriptor type. If `S` is 0 then this segment is a system segment, whereas if `S` is 1 then this is a code or data segment (Stack segments are data segments which must be read/write segments). - -To determine if the segment is a code or data segment we can check its Ex(bit 43) Attribute marked as 0 in the above diagram. If it is 0, then the segment is a Data segment otherwise it is a code segment. - -A segment can be of one of the following types: - -``` -| Type Field | Descriptor Type | Description -|-----------------------------|-----------------|------------------ -| Decimal | | -| 0 E W A | | -| 0 0 0 0 0 | Data | Read-Only -| 1 0 0 0 1 | Data | Read-Only, accessed -| 2 0 0 1 0 | Data | Read/Write -| 3 0 0 1 1 | Data | Read/Write, accessed -| 4 0 1 0 0 | Data | Read-Only, expand-down -| 5 0 1 0 1 | Data | Read-Only, expand-down, accessed -| 6 0 1 1 0 | Data | Read/Write, expand-down -| 7 0 1 1 1 | Data | Read/Write, expand-down, accessed -| C R A | | -| 8 1 0 0 0 | Code | Execute-Only -| 9 1 0 0 1 | Code | Execute-Only, accessed -| 10 1 0 1 0 | Code | Execute/Read -| 11 1 0 1 1 | Code | Execute/Read, accessed -| 12 1 1 0 0 | Code | Execute-Only, conforming -| 14 1 1 0 1 | Code | Execute-Only, conforming, accessed -| 13 1 1 1 0 | Code | Execute/Read, conforming -| 15 1 1 1 1 | Code | Execute/Read, conforming, accessed -``` - -As we can see the first bit(bit 43) is `0` for a _data_ segment and `1` for a _code_ segment. The next three bits(40, 41, 42, 43) are either `EWA`(*E*xpansion *W*ritable *A*ccessible) or CRA(*C*onforming *R*eadable *A*ccessible). - * if E(bit 42) is 0, expand up other wise expand down. Read more [here](http://www.sudleyplace.com/dpmione/expanddown.html). - * if W(bit 41)(for Data Segments) is 1, write access is allowed otherwise not. Note that read access is always allowed on data segments. - * A(bit 40) - Whether the segment is accessed by processor or not. - * C(bit 43) is conforming bit(for code selectors). If C is 1, the segment code can be executed from a lower level privilege for e.g user level. If C is 0, it can only be executed from the same privilege level. - * R(bit 41)(for code segments). If 1 read access to segment is allowed otherwise not. Write access is never allowed to code segments. - -4. DPL[2-bits] (Descriptor Privilege Level) is at bits 45-46. It defines the privilege level of the segment. It can be 0-3 where 0 is the most privileged. - -5. P flag(bit 47) - indicates if the segment is present in memory or not. If P is 0, the segment will be presented as _invalid_ and the processor will refuse to read this segment. - -6. AVL flag(bit 52) - Available and reserved bits. It is ignored in Linux. - -7. L flag(bit 53) - indicates whether a code segment contains native 64-bit code. If 1 then the code segment executes in 64 bit mode. - -8. D/B flag(bit 54) - Default/Big flag represents the operand size i.e 16/32 bits. If it is set then 32 bit otherwise 16. - -Segment registers don't contain the base address of the segment as in real mode. Instead they contain a special structure - `Segment Selector`. Each Segment Descriptor has an associated Segment Selector. `Segment Selector` is a 16-bit structure: - -``` ------------------------------ -| Index | TI | RPL | ------------------------------ -``` - -Where, -* **Index** shows the index number of the descriptor in the GDT. -* **TI**(Table Indicator) shows where to search for the descriptor. If it is 0 then search in the Global Descriptor Table(GDT) otherwise it will look in Local Descriptor Table(LDT). -* And **RPL** is Requester's Privilege Level. - -Every segment register has a visible and hidden part. -* Visible - Segment Selector is stored here -* Hidden - Segment Descriptor(base, limit, attributes, flags) - -The following steps are needed to get the physical address in the protected mode: - -* The segment selector must be loaded in one of the segment registers -* The CPU tries to find a segment descriptor by GDT address + Index from selector and load the descriptor into the *hidden* part of the segment register -* Base address (from segment descriptor) + offset will be the linear address of the segment which is the physical address (if paging is disabled). - -Schematically it will look like this: - -![linear address](http://oi62.tinypic.com/2yo369v.jpg) - -The algorithm for the transition from real mode into protected mode is: - -* Disable interrupts -* Describe and load GDT with `lgdt` instruction -* Set PE (Protection Enable) bit in CR0 (Control Register 0) -* Jump to protected mode code - -We will see the complete transition to protected mode in the linux kernel in the next part, but before we can move to protected mode, we need to do some more preparations. - -Let's look at [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c). We can see some routines there which perform keyboard initialization, heap initialization, etc... Let's take a look. - -Copying boot parameters into the "zeropage" --------------------------------------------------------------------------------- - -We will start from the `main` routine in "main.c". First function which is called in `main` is [`copy_boot_params(void)`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L30). It copies the kernel setup header into the field of the `boot_params` structure which is defined in the [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L113). - -The `boot_params` structure contains the `struct setup_header hdr` field. This structure contains the same fields as defined in [linux boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) and is filled by the boot loader and also at kernel compile/build time. `copy_boot_params` does two things: - -1. Copies `hdr` from [header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L281) to the `boot_params` structure in `setup_header` field - -2. Updates pointer to the kernel command line if the kernel was loaded with the old command line protocol. - -Note that it copies `hdr` with `memcpy` function which is defined in the [copy.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/copy.S) source file. Let's have a look inside: - -```assembly -GLOBAL(memcpy) - pushw %si - pushw %di - movw %ax, %di - movw %dx, %si - pushw %cx - shrw $2, %cx - rep; movsl - popw %cx - andw $3, %cx - rep; movsb - popw %di - popw %si - retl -ENDPROC(memcpy) -``` - -Yeah, we just moved to C code and now assembly again :) First of all we can see that `memcpy` and other routines which are defined here, start and end with the two macros: `GLOBAL` and `ENDPROC`. `GLOBAL` is described in [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) which defines `globl` directive and the label for it. `ENDPROC` is described in [include/linux/linkage.h](https://github.com/torvalds/linux/blob/master/include/linux/linkage.h) which marks `name` symbol as function name and ends with the size of the `name` symbol. - -Implementation of `memcpy` is easy. At first, it pushes values from `si` and `di` registers to the stack because their values will change during the `memcpy`, so it pushes them on the stack to preserve their values. `memcpy` (and other functions in copy.S) use `fastcall` calling conventions. So it gets its incoming parameters from the `ax`, `dx` and `cx` registers. Calling `memcpy` looks like this: - -```c -memcpy(&boot_params.hdr, &hdr, sizeof hdr); -``` - -So, -* `ax` will contain the address of the `boot_params.hdr` in bytes -* `dx` will contain the address of `hdr` in bytes -* `cx` will contain the size of `hdr` in bytes. - -`memcpy` puts the address of `boot_params.hdr` into `si` and saves the size on the stack. After this it shifts to the right on 2 size (or divide on 4) and copies from `si` to `di` by 4 bytes. After this we restore the size of `hdr` again, align it by 4 bytes and copy the rest of the bytes from `si` to `di` byte by byte (if there is more). Restore `si` and `di` values from the stack in the end and after this copying is finished. - -Console initialization --------------------------------------------------------------------------------- - -After the `hdr` is copied into `boot_params.hdr`, the next step is console initialization by calling the `console_init` function which is defined in [arch/x86/boot/early_serial_console.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/early_serial_console.c). - -It tries to find the `earlyprintk` option in the command line and if the search was successful, it parses the port address and baud rate of the serial port and initializes the serial port. Value of `earlyprintk` command line option can be one of the: - - * serial,0x3f8,115200 - * serial,ttyS0,115200 - * ttyS0,115200 - -After serial port initialization we can see the first output: - -```C -if (cmdline_find_option_bool("debug")) - puts("early console in setup code\n"); -``` - -The definition of `puts` is in [tty.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/tty.c). As we can see it prints character by character in a loop by calling the `putchar` function. Let's look into the `putchar` implementation: - -```C -void __attribute__((section(".inittext"))) putchar(int ch) -{ - if (ch == '\n') - putchar('\r'); - - bios_putchar(ch); - - if (early_serial_base != 0) - serial_putchar(ch); -} -``` - -`__attribute__((section(".inittext")))` means that this code will be in the `.inittext` section. We can find it in the linker file [setup.ld](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L19). - -First of all, `put_char` checks for the `\n` symbol and if it is found, prints `\r` before. After that it outputs the character on the VGA screen by calling the BIOS with the `0x10` interrupt call: - -```C -static void __attribute__((section(".inittext"))) bios_putchar(int ch) -{ - struct biosregs ireg; - - initregs(&ireg); - ireg.bx = 0x0007; - ireg.cx = 0x0001; - ireg.ah = 0x0e; - ireg.al = ch; - intcall(0x10, &ireg, NULL); -} -``` - -Here `initregs` takes the `biosregs` structure and first fills `biosregs` with zeros using the `memset` function and then fills it with register values. - -```C - memset(reg, 0, sizeof *reg); - reg->eflags |= X86_EFLAGS_CF; - reg->ds = ds(); - reg->es = ds(); - reg->fs = fs(); - reg->gs = gs(); -``` - -Let's look at the [memset](https://github.com/torvalds/linux/blob/master/arch/x86/boot/copy.S#L36) implementation: - -```assembly -GLOBAL(memset) - pushw %di - movw %ax, %di - movzbl %dl, %eax - imull $0x01010101,%eax - pushw %cx - shrw $2, %cx - rep; stosl - popw %cx - andw $3, %cx - rep; stosb - popw %di - retl -ENDPROC(memset) -``` - -As you can read above, it uses the `fastcall` calling conventions like the `memcpy` function, which means that the function gets parameters from `ax`, `dx` and `cx` registers. - -Generally `memset` is like a memcpy implementation. It saves the value of the `di` register on the stack and puts the `ax` value into `di` which is the address of the `biosregs` structure. Next is the `movzbl` instruction, which copies the `dl` value to the low 2 bytes of the `eax` register. The remaining 2 high bytes of `eax` will be filled with zeros. - -The next instruction multiplies `eax` with `0x01010101`. It needs to because `memset` will copy 4 bytes at the same time. For example, we need to fill a structure with `0x7` with memset. `eax` will contain `0x00000007` value in this case. So if we multiply `eax` with `0x01010101`, we will get `0x07070707` and now we can copy these 4 bytes into the structure. `memset` uses `rep; stosl` instructions for copying `eax` into `es:di`. - -The rest of the `memset` function does almost the same as `memcpy`. - -After that `biosregs` structure is filled with `memset`, `bios_putchar` calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt which prints a character. Afterwards it checks if the serial port was initialized or not and writes a character there with [serial_putchar](https://github.com/torvalds/linux/blob/master/arch/x86/boot/tty.c#L30) and `inb/outb` instructions if it was set. - -Heap initialization --------------------------------------------------------------------------------- - -After the stack and bss section were prepared in [header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) (see previous [part](linux-bootstrap-1.md)), the kernel needs to initialize the [heap](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L116) with the [`init_heap`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L116) function. - -First of all `init_heap` checks the [`CAN_USE_HEAP`](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L21) flag from the [`loadflags`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) in the kernel setup header and calculates the end of the stack if this flag was set: - -```C - char *stack_end; - - if (boot_params.hdr.loadflags & CAN_USE_HEAP) { - asm("leal %P1(%%esp),%0" - : "=r" (stack_end) : "i" (-STACK_SIZE)); -``` - -or in other words `stack_end = esp - STACK_SIZE`. - -Then there is the `heap_end` calculation: -```c - heap_end = (char *)((size_t)boot_params.hdr.heap_end_ptr + 0x200); -``` -which means `heap_end_ptr` or `_end` + `512`(`0x200h`). And at the last is checked that whether `heap_end` is greater than `stack_end`. If it is then `stack_end` is assigned to `heap_end` to make them equal. - -Now the heap is initialized and we can use it using the `GET_HEAP` method. We will see how it is used, how to use it and how the it is implemented in the next posts. - -CPU validation --------------------------------------------------------------------------------- - -The next step as we can see is cpu validation by `validate_cpu` from [arch/x86/boot/cpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cpu.c). - -It calls the [`check_cpu`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cpucheck.c#L102) function and passes cpu level and required cpu level to it and checks that the kernel launches on the right cpu level. -```c -check_cpu(&cpu_level, &req_level, &err_flags); - if (cpu_level < req_level) { - ... - return -1; - } -``` -`check_cpu` checks the cpu's flags, presence of [long mode](http://en.wikipedia.org/wiki/Long_mode) in case of x86_64(64-bit) CPU, checks the processor's vendor and makes preparation for certain vendors like turning off SSE+SSE2 for AMD if they are missing, etc. - -Memory detection --------------------------------------------------------------------------------- - -The next step is memory detection by the `detect_memory` function. `detect_memory` basically provides a map of available RAM to the cpu. It uses different programming interfaces for memory detection like `0xe820`, `0xe801` and `0x88`. We will see only the implementation of **0xE820** here. - -Let's look into the `detect_memory_e820` implementation from the [arch/x86/boot/memory.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/memory.c) source file. First of all, the `detect_memory_e820` function initializes the `biosregs` structure as we saw above and fills registers with special values for the `0xe820` call: - -```assembly - initregs(&ireg); - ireg.ax = 0xe820; - ireg.cx = sizeof buf; - ireg.edx = SMAP; - ireg.di = (size_t)&buf; -``` - -* `ax` contains the number of the function (0xe820 in our case) -* `cx` register contains size of the buffer which will contain data about memory -* `edx` must contain the `SMAP` magic number -* `es:di` must contain the address of the buffer which will contain memory data -* `ebx` has to be zero. - -Next is a loop where data about the memory will be collected. It starts from the call of the `0x15` BIOS interrupt, which writes one line from the address allocation table. For getting the next line we need to call this interrupt again (which we do in the loop). Before the next call `ebx` must contain the value returned previously: - -```C - intcall(0x15, &ireg, &oreg); - ireg.ebx = oreg.ebx; -``` - -Ultimately, it does iterations in the loop to collect data from the address allocation table and writes this data into the `e820entry` array: - -* start of memory segment -* size of memory segment -* type of memory segment (which can be reserved, usable and etc...). - -You can see the result of this in the `dmesg` output, something like: - -``` -[ 0.000000] e820: BIOS-provided physical RAM map: -[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable -[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved -[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved -[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffdffff] usable -[ 0.000000] BIOS-e820: [mem 0x000000003ffe0000-0x000000003fffffff] reserved -[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved -``` - -Keyboard initialization --------------------------------------------------------------------------------- - -The next step is the initialization of the keyboard with the call of the [`keyboard_init()`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L65) function. At first `keyboard_init` initializes registers using the `initregs` function and calling the [0x16](http://www.ctyme.com/intr/rb-1756.htm) interrupt for getting the keyboard status. -```c - initregs(&ireg); - ireg.ah = 0x02; /* Get keyboard status */ - intcall(0x16, &ireg, &oreg); - boot_params.kbd_status = oreg.al; -``` -After this it calls [0x16](http://www.ctyme.com/intr/rb-1757.htm) again to set repeat rate and delay. -```c - ireg.ax = 0x0305; /* Set keyboard repeat rate */ - intcall(0x16, &ireg, NULL); -``` - -Querying --------------------------------------------------------------------------------- - -The next couple of steps are queries for different parameters. We will not dive into details about these queries, but will get back to it in later parts. Let's take a short look at these functions: - -The [query_mca](https://github.com/torvalds/linux/blob/master/arch/x86/boot/mca.c#L18) routine calls the [0x15](http://www.ctyme.com/intr/rb-1594.htm) BIOS interrupt to get the machine model number, sub-model number, BIOS revision level, and other hardware-specific attributes: - -```c -int query_mca(void) -{ - struct biosregs ireg, oreg; - u16 len; - - initregs(&ireg); - ireg.ah = 0xc0; - intcall(0x15, &ireg, &oreg); - - if (oreg.eflags & X86_EFLAGS_CF) - return -1; /* No MCA present */ - - set_fs(oreg.es); - len = rdfs16(oreg.bx); - - if (len > sizeof(boot_params.sys_desc_table)) - len = sizeof(boot_params.sys_desc_table); - - copy_from_fs(&boot_params.sys_desc_table, oreg.bx, len); - return 0; -} -``` - -It fills the `ah` register with `0xc0` and calls the `0x15` BIOS interruption. After the interrupt execution it checks the [carry flag](http://en.wikipedia.org/wiki/Carry_flag) and if it is set to 1, the BIOS doesn't support (**MCA**)[https://en.wikipedia.org/wiki/Micro_Channel_architecture]. If carry flag is set to 0, `ES:BX` will contain a pointer to the system information table, which looks like this: - -``` -Offset Size Description ) - 00h WORD number of bytes following - 02h BYTE model (see #00515) - 03h BYTE submodel (see #00515) - 04h BYTE BIOS revision: 0 for first release, 1 for 2nd, etc. - 05h BYTE feature byte 1 (see #00510) - 06h BYTE feature byte 2 (see #00511) - 07h BYTE feature byte 3 (see #00512) - 08h BYTE feature byte 4 (see #00513) - 09h BYTE feature byte 5 (see #00514) ----AWARD BIOS--- - 0Ah N BYTEs AWARD copyright notice ----Phoenix BIOS--- - 0Ah BYTE ??? (00h) - 0Bh BYTE major version - 0Ch BYTE minor version (BCD) - 0Dh 4 BYTEs ASCIZ string "PTL" (Phoenix Technologies Ltd) ----Quadram Quad386--- - 0Ah 17 BYTEs ASCII signature string "Quadram Quad386XT" ----Toshiba (Satellite Pro 435CDS at least)--- - 0Ah 7 BYTEs signature "TOSHIBA" - 11h BYTE ??? (8h) - 12h BYTE ??? (E7h) product ID??? (guess) - 13h 3 BYTEs "JPN" - ``` - -Next we call the `set_fs` routine and pass the value of the `es` register to it. Implementation of `set_fs` is pretty simple: - -```c -static inline void set_fs(u16 seg) -{ - asm volatile("movw %0,%%fs" : : "rm" (seg)); -} -``` - -This function contains inline assembly which gets the value of the `seg` parameter and puts it into the `fs` register. There are many functions in [boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/boot/boot.h) like `set_fs`, for example `set_gs`, `fs`, `gs` for reading a value in it etc... - -At the end of `query_mca` it just copies the table which pointed to by `es:bx` to the `boot_params.sys_desc_table`. - -The next step is getting [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep) information by calling the `query_ist` function. First of all it checks the CPU level and if it is correct, calls `0x15` for getting info and saves the result to `boot_params`. - -The following [query_apm_bios](https://github.com/torvalds/linux/blob/master/arch/x86/boot/apm.c#L21) function gets [Advanced Power Management](http://en.wikipedia.org/wiki/Advanced_Power_Management) information from the BIOS. `query_apm_bios` calls the `0x15` BIOS interruption too, but with `ah` = `0x53` to check `APM` installation. After the `0x15` execution, `query_apm_bios` functions checks `PM` signature (it must be `0x504d`), carry flag (it must be 0 if `APM` supported) and value of the `cx` register (if it's 0x02, protected mode interface is supported). - -Next it calls the `0x15` again, but with `ax = 0x5304` for disconnecting the `APM` interface and connecting the 32-bit protected mode interface. In the end it fills `boot_params.apm_bios_info` with values obtained from the BIOS. - -Note that `query_apm_bios` will be executed only if `CONFIG_APM` or `CONFIG_APM_MODULE` was set in configuration file: - -```C -#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE) - query_apm_bios(); -#endif -``` - -The last is the [`query_edd`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/edd.c#L122) function, which queries `Enhanced Disk Drive` information from the BIOS. Let's look into the `query_edd` implementation. - -First of all it reads the [edd](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt#L1023) option from kernel's command line and if it was set to `off` then `query_edd` just returns. - -If EDD is enabled, `query_edd` goes over BIOS-supported hard disks and queries EDD information in the following loop: - -```C - for (devno = 0x80; devno < 0x80+EDD_MBR_SIG_MAX; devno++) { - if (!get_edd_info(devno, &ei) && boot_params.eddbuf_entries < EDDMAXNR) { - memcpy(edp, &ei, sizeof ei); - edp++; - boot_params.eddbuf_entries++; - } - ... - ... - ... -``` - -where `0x80` is the first hard drive and the value of `EDD_MBR_SIG_MAX` macro is 16. It collects data into the array of [edd_info](https://github.com/torvalds/linux/blob/master/include/uapi/linux/edd.h#L172) structures. `get_edd_info` checks that EDD is present by invoking the `0x13` interrupt with `ah` as `0x41` and if EDD is present, `get_edd_info` again calls the `0x13` interrupt, but with `ah` as `0x48` and `si` containing the address of the buffer where EDD information will be stored. - -Conclusion --------------------------------------------------------------------------------- - -This is the end of the second part about Linux kernel internals. In the next part we will see video mode setting and the rest of preparations before transition to protected mode and directly transitioning into it. - -If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [Protected mode](http://en.wikipedia.org/wiki/Protected_mode) -* [Protected mode](http://wiki.osdev.org/Protected_Mode) -* [Long mode](http://en.wikipedia.org/wiki/Long_mode) -* [Nice explanation of CPU Modes with code](http://www.codeproject.com/Articles/45788/The-Real-Protected-Long-mode-assembly-tutorial-for) -* [How to Use Expand Down Segments on Intel 386 and Later CPUs](http://www.sudleyplace.com/dpmione/expanddown.html) -* [earlyprintk documentation](http://lxr.free-electrons.com/source/Documentation/x86/earlyprintk.txt) -* [Kernel Parameters](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt) -* [Serial console](https://github.com/torvalds/linux/blob/master/Documentation/serial-console.txt) -* [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep) -* [APM](https://en.wikipedia.org/wiki/Advanced_Power_Management) -* [EDD specification](http://www.t13.org/documents/UploadedDocuments/docs2004/d1572r3-EDD3.pdf) -* [TLDP documentation for Linux Boot Process](http://www.tldp.org/HOWTO/Linux-i386-Boot-Code-HOWTO/setup.html) (old) -* [Previous Part](linux-bootstrap-1.md) - diff --git a/Booting/linux-bootstrap-3.md b/Booting/linux-bootstrap-3.md deleted file mode 100644 index 8cc25ac..0000000 --- a/Booting/linux-bootstrap-3.md +++ /dev/null @@ -1,541 +0,0 @@ -Kernel booting process. Part 3. -================================================================================ - -Video mode initialization and transition to protected mode --------------------------------------------------------------------------------- - -This is the third part of the `Kernel booting process` series. In the previous [part](linux-bootstrap-2.md#kernel-booting-process-part-2), we stopped right before the call of the `set_video` routine from the [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L181). We will see video mode initialization in the kernel setup code, preparation before switching into the protected mode and transition into it in this part. - -**NOTE** If you don't know anything about protected mode, you can find some information about it in the previous [part](linux-bootstrap-2.md#protected-mode). Also there are a couple of [links](linux-bootstrap-2.md#links) which can help you. - -As i wrote above, we will start from the `set_video` function which defined in the [arch/x86/boot/video.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/video.c#L315) source code file. We can see that it starts with getting of video mode from the `boot_params.hdr` structure: - -```C -u16 mode = boot_params.hdr.vid_mode; -``` - -which we filled in the `copy_boot_params` function (you can read about it in the previous post). `vid_mode` is an obligatory field which filled by the bootloader. You can find information about it in the kernel boot protocol: - -``` -Offset Proto Name Meaning -/Size -01FA/2 ALL vid_mode Video mode control -``` - -As we can read from the linux kernel boot protocol: - -``` -vga= - here is either an integer (in C notation, either - decimal, octal, or hexadecimal) or one of the strings - "normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask" - (meaning 0xFFFD). This value should be entered into the - vid_mode field, as it is used by the kernel before the command - line is parsed. -``` - -So we can add `vga` option to the grub or another bootloader configuration file and it will pass this option to the kernel command line. This option can have different values as we can read from the description, for example it can be integer number or `ask`. If you will pass `ask`, you see menu like this: - -![video mode setup menu](http://oi59.tinypic.com/ejcz81.jpg) - -which will suggest to select video mode. We will look on it's implementation, but before we must to know about another things. - -Kernel data types --------------------------------------------------------------------------------- - -Earlier we saw definitions of different data types like `u16` and etc... in the kernel setup code. Let's look on a couple of data types provided by the kernel: - -``` -| Type | char | short | int | long | u8 | u16 | u32 | u64 | -|------|------|-------|-----|------|----|-----|-----|-----| -| Size | 1 | 2 | 4 | 8 | 1 | 2 | 4 | 8 | -``` - -If you will read source code of the kernel, you'll see it very often, so it will be good to remember about it. - -Heap API --------------------------------------------------------------------------------- - -As we got `vid_mode` from the `boot_params.hdr`, we can see call of the `RESET_HEAP` in the `set_video` function. `RESET_HEAP` is a macro which defined in the [boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/boot/boot.h#L199) and looks as: - -```C -#define RESET_HEAP() ((void *)( HEAP = _end )) -``` - -If you read second part, you can remember that we initialized the heap with the [init_heap](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L116) function. Since we can use heap, we have a couple functions for it which defined in the `boot.h`. They are: - -```C -#define RESET_HEAP()... -``` - -As we saw just now. It uses for resetting the heap by setting the `HEAP` variable equal to `_end`, where `_end` is just: - -```C -extern char _end[]; -``` - -Next is `GET_HEAP` macro: - -```C -#define GET_HEAP(type, n) \ - ((type *)__get_heap(sizeof(type),__alignof__(type),(n))) -``` - -for heap allocation. It calls internal function `__get_heap` with 3 parameters: - -* size of a type in bytes, which need be allocated -* next parameter shows how type of variable is aligned -* how many bytes to allocate - -Implementation of `__get_heap` is: - -```C -static inline char *__get_heap(size_t s, size_t a, size_t n) -{ - char *tmp; - - HEAP = (char *)(((size_t)HEAP+(a-1)) & ~(a-1)); - tmp = HEAP; - HEAP += s*n; - return tmp; -} -``` - -and further we will see usage of it, something like this: - -```C -saved.data = GET_HEAP(u16, saved.x*saved.y); -``` - -Let's try to understand how `GET_HEAP` works. We can see here that `HEAP` (which equal to `_end` after `RESET_HEAP()`) is the address of aligned memory according to `a` parameter. After it we save memory address from `HEAP` to the `tmp` variable, move `HEAP` to the end of allocated block and return `tmp` which is start address of allocated memory. - -And the last function is: - -```C -static inline bool heap_free(size_t n) -{ - return (int)(heap_end-HEAP) >= (int)n; -} -``` - -which subtracts value of the `HEAP` from the `heap_end` (we calculated it in the previous [part](linux-bootstrap-2.md)) and returns 1 if there is enough memory for `n`. - -That's all. Now we have simple API for heap and can setup video mode. - -Setup video mode --------------------------------------------------------------------------------- - -Now we can move directly to video mode initialization. We stopped at the `RESET_HEAP()` call in the `set_video` function. The next call of `store_mode_params` which stores video mode parameters in the `boot_params.screen_info` structure which defined in the [include/uapi/linux/screen_info.h](https://github.com/0xAX/linux/blob/master/include/uapi/linux/screen_info.h). If we will look at `store_mode_params` function, we can see that it starts from the call of the `store_cursor_position` function. As you can understand from the function name, it gets information about cursor and stores it. First of all `store_cursor_position` initializes two variables which has type - `biosregs`, with `AH = 0x3` and calls `0x10` BIOS interruption. After interruption successfully executed, it returns row and column in the `DL` and `DH` registers. Row and column will be stored in the `orig_x` and `orig_y` fields from the the `boot_params.screen_info` structure. After `store_cursor_position` executed, `store_video_mode` function will be called. It just gets current video mode and stores it in the `boot_params.screen_info.orig_video_mode`. - -After this, it checks current video mode and set the `video_segment`. After the BIOS transfers control to the boot sector, the following addresses are video memory: - -``` -0xB000:0x0000 32 Kb Monochrome Text Video Memory -0xB800:0x0000 32 Kb Color Text Video Memory -``` - -So we set the `video_segment` variable to `0xb000` if current video mode is MDA, HGC, VGA in monochrome mode or `0xb800` in color mode. After setup of the address of the video segment need to store font size in the `boot_params.screen_info.orig_video_points` with: - -```C -set_fs(0); -font_size = rdfs16(0x485); -boot_params.screen_info.orig_video_points = font_size; -``` - -First of all we put 0 to the `FS` register with `set_fs` function. We already saw functions like `set_fs` in the previous part. They are all defined in the [boot.h](https://github.com/0xAX/linux/blob/master/arch/x86/boot/boot.h). Next we read value which located at address `0x485` (this memory location used to get the font size) and save font size in the `boot_params.screen_info.orig_video_points`. - -The next we get amount of columns and rows by address `0x44a` and stores they in the `boot_params.screen_info.orig_video_cols` and `boot_params.screen_info.orig_video_lines`. After this, execution of the `store_mode_params` is finished. - -The next we can see `save_screen` function which just saves screen content to the heap. This function collects all data which we got in the previous functions like rows and columns amount and etc... to the `saved_screen` structure, which defined as: - -```C -static struct saved_screen { - int x, y; - int curx, cury; - u16 *data; -} saved; -``` - -It checks that heap has free space for it with: - -```C -if (!heap_free(saved.x*saved.y*sizeof(u16)+512)) - return; -``` - -and allocates space in the heap if it is enough and stores `saved_screen` in it. - -The next call is `probe_cards(0)` from the [arch/x86/boot/video-mode.c](https://github.com/0xAX/linux/blob/master/arch/x86/boot/video-mode.c#L33). It goes over all video_cards and collects number of modes provided by the cards. Here is the interesting moment, we can see the loop: - -```C -for (card = video_cards; card < video_cards_end; card++) { - /* collecting number of modes here */ -} -``` - -but `video_cards` not declared anywhere. Answer is simple: Every video mode presented in the x86 kernel setup code has definition like this: - -```C -static __videocard video_vga = { - .card_name = "VGA", - .probe = vga_probe, - .set_mode = vga_set_mode, -}; -``` - -where `__videocard` is a macro: - -```C -#define __videocard struct card_info __attribute__((used,section(".videocards"))) -``` - -which means that `card_info` structure: - -```C -struct card_info { - const char *card_name; - int (*set_mode)(struct mode_info *mode); - int (*probe)(void); - struct mode_info *modes; - int nmodes; - int unsafe; - u16 xmode_first; - u16 xmode_n; -}; -``` - -is in the `.videocards` segment. Let's look on the [arch/x86/boot/setup.ld](https://github.com/0xAX/linux/blob/master/arch/x86/boot/setup.ld) linker file, we can see there: - -``` - .videocards : { - video_cards = .; - *(.videocards) - video_cards_end = .; - } -``` - -It means that `video_cards` is just memory address and all `card_info` structures are placed in this segment. It means that all `card_info` structures are placed between `video_cards` and `video_cards_end`, so we can use it in a loop to go over all of it. After `probe_cards` executed we have all structures like `static __videocard video_vga` with filled `nmodes` (number of video modes). - -After that `probe_cards` executed, we move to the main loop in the `setup_video` function. There is infinite loop which tries to setup video mode with the `set_mode` function or prints menu if we passed `vid_mode=ask` to the kernel command line or video mode is undefined. - -The `set_mode` function is defined in the [video-mode.c](https://github.com/0xAX/linux/blob/master/arch/x86/boot/video-mode.c#L147) and gets only one parameter - `mode` which is number of video mode (we got it or from the menu or in the start of the `setup_video`, from kernel setup header). - -`set_mode` function checks the `mode` and calls `raw_set_mode` function. The `raw_set_mode` calls `set_mode` function for selected card. We can get access to this function from the `card_info` structure, every video mode defines this structure with filled value which depends on video mode (for example for `vga` it is `video_vga.set_mode` function, see above example of `card_info` structure for `vga`). `video_vga.set_mode` is `vga_set_mode`, which checks vga mode and call function depend on mode: - -```C -static int vga_set_mode(struct mode_info *mode) -{ - vga_set_basic_mode(); - - force_x = mode->x; - force_y = mode->y; - - switch (mode->mode) { - case VIDEO_80x25: - break; - case VIDEO_8POINT: - vga_set_8font(); - break; - case VIDEO_80x43: - vga_set_80x43(); - break; - case VIDEO_80x28: - vga_set_14font(); - break; - case VIDEO_80x30: - vga_set_80x30(); - break; - case VIDEO_80x34: - vga_set_80x34(); - break; - case VIDEO_80x60: - vga_set_80x60(); - break; - } - return 0; -} -``` - -Every function which setups video mode, just call `0x10` BIOS interruption with certain value in the `AH` register. After this we have set video mode and now we can switch to the protected mode. - -Last preparation before transition into protected mode --------------------------------------------------------------------------------- - -We can see the last function call - `go_to_protected_mode` in the [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L184). As comment says: `Do the last things and invoke protected mode`, so let's see last preparation and switch into the protected mode. - -`go_to_protected_mode` defined in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c#L104). It contains some functions which make last preparations before we can jump into protected mode, so let's look on it and try to understand what they do and how it works. - -At first we see call of `realmode_switch_hook` function in the `go_to_protected_mode`. This function invokes real mode switch hook if it is present and disables [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt). Hooks are used if bootloader runs in a hostile environment. More about hooks you can read in the [boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) (see **ADVANCED BOOT LOADER HOOKS**). `readlmode_swtich` hook presents pointer to the 16-bit real mode far subroutine which disables non-maskable interruptions. After we checked `realmode_switch` hook (it doesn't present for me), there is disabling of non-maskable interruptions: - -```assembly -asm volatile("cli"); -outb(0x80, 0x70); -io_delay(); -``` - -At first there is inline assembly instruction with `cli` instruction which clears the interrupt flag (`IF`), after this external interrupts disabled. Next line disables NMI (non-maskable interruption). Interruption is a signal to the CPU which emitted by hardware or software. After getting signal, CPU stops to execute current instructions sequence and transfers control to the interruption handler. After interruption handler finished it's work, it transfers control to the interrupted instruction. Non-maskable interruptions (NMI) - interruptions which are always processed, independently of permission. We will not dive into details interruptions now, but will back to it in the next posts. - -Let's back to the code. We can see that second line is writing `0x80` (disabled bit) byte to the `0x70` (CMOS Address register). And call the `io_delay` function after it. `io_delay` which initiates small delay and looks like: - -``` -static inline void io_delay(void) -{ - const u16 DELAY_PORT = 0x80; - asm volatile("outb %%al,%0" : : "dN" (DELAY_PORT)); -} -``` - -Outputting any byte to the port `0x80` should delay exactly 1 microsecond. Sow we can write any value (value from `AL` register in our case) to the `0x80` port. After this delay `realmode_switch_hook` function finished execution and we can move to the next function. - -The next function is `enable_a20`, which enables [A20 line](http://en.wikipedia.org/wiki/A20_line). This function defined in the [arch/x86/boot/a20.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/a20.c) and it tries to enable A20 gate with different methods. The first is `a20_test_short` function which checks is A20 already enabled or not with `a20_test` function: - -```C -static int a20_test(int loops) -{ - int ok = 0; - int saved, ctr; - - set_fs(0x0000); - set_gs(0xffff); - - saved = ctr = rdfs32(A20_TEST_ADDR); - - while (loops--) { - wrfs32(++ctr, A20_TEST_ADDR); - io_delay(); /* Serialize and make delay constant */ - ok = rdgs32(A20_TEST_ADDR+0x10) ^ ctr; - if (ok) - break; - } - - wrfs32(saved, A20_TEST_ADDR); - return ok; -} -``` - -First of all we put `0x0000` to the `FS` register and `0xffff` to the `GS` register. Next we read value by address `A20_TEST_ADDR` (it is `0x200`) and put this value into `saved` variable and `ctr`. Next we write updated `ctr` value into `fs:gs` with `wrfs32` function, make little delay, and read value into the `GS` register by address `A20_TEST_ADDR+0x10`, if it's not zero we already have enabled a20 line. If A20 is disabled, we try to enabled it with different method which you can find in the `a20.c`. For example with call of `0x15` BIOS interruption with `AH=0x2041` and etc... If `enabled_a20` function finished with fail, printed error message and called function `die`. You can remember it from the first source code file where we started - [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S): - -```assembly -die: - hlt - jmp die - .size die, .-die -``` - -After the a20 gate successfully enabled, there are reset coprocessor and mask all interrupts. And after all of this preparations, we can see actual transition into protected mode. - -Setup Interrupt Descriptor Table --------------------------------------------------------------------------------- - -Then next ist setup of Interrupt Descriptor table (IDT). `setup_idt`: - -```C -static void setup_idt(void) -{ - static const struct gdt_ptr null_idt = {0, 0}; - asm volatile("lidtl %0" : : "m" (null_idt)); -} -``` - -which setups `Interrupt descriptor table` (describes interrupt handlers and etc...). For now IDT is not installed (we will see it later), but now we just load IDT with `lidtl` instruction. `null_idt` contains address and size of IDT, but now they are just zero. `null_idt` is a `gdt_ptr` structure, it looks: - -```C -struct gdt_ptr { - u16 len; - u32 ptr; -} __attribute__((packed)); -``` - -where we can see - 16-bit length of IDT and 32-bit pointer to it (More details about IDT and interruptions we will see in the next posts). ` __attribute__((packed))` means here that size of `gdt_ptr` minimum as required. So size of the `gdt_ptr` will be 6 bytes here or 48 bits. (Next we will load pointer to the `gdt_ptr` to the `GDTR` register and you can remember from the previous post that it is 48-bits size). - -Setup Global Descriptor Table --------------------------------------------------------------------------------- - -The next point is setup of the Global Descriptor Table (GDT). We can see `setup_gdt` function which setups GDT (you can read about it in the [Kernel booting process. Part 2.](linux-bootstrap-2.md#protected-mode)). There is definition of the `boot_gdt` array in this function, which contains definition of the three segments: - -```C - static const u64 boot_gdt[] __attribute__((aligned(16))) = { - [GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff), - [GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff), - [GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103), - }; -``` - -For code, data and TSS (Task state segment). We will not use task state segment for now, it was added there to make Intel VT happy as we can see in the comment line (if you're interesting you can find commit which describes it - [here](https://github.com/torvalds/linux/commit/88089519f302f1296b4739be45699f06f728ec31)). Let's look on `boot_gdt`. First of all we can note that it has `__attribute__((aligned(16)))` attribute. It means that this structure will be aligned by 16 bytes. Let's look on simple example: - -```C -#include - -struct aligned { - int a; -}__attribute__((aligned(16))); - -struct nonaligned { - int b; -}; - -int main(void) -{ - struct aligned a; - struct nonaligned na; - - printf("Not aligned - %zu \n", sizeof(na)); - printf("Aligned - %zu \n", sizeof(a)); - - return 0; -} -``` - -Technically structure which contains one `int` field, must be 4 bytes, but here `aligned` structure will be 16 bytes: - -``` -$ gcc test.c -o test && test -Not aligned - 4 -Aligned - 16 -``` - -`GDT_ENTRY_BOOT_CS` has index - 2 here, `GDT_ENTRY_BOOT_DS` is `GDT_ENTRY_BOOT_CS + 1` and etc... It starts from 2, because first is a mandatory null descriptor (index - 0) and the second is not used (index - 1). - -`GDT_ENTRY` is a macro which takes flags, base and limit and builds GDT entry. For example let's look on the code segment entry. `GDT_ENTRY` takes following values: - -* base - 0 -* limit - 0xfffff -* flags - 0xc09b - -What does it mean? Segment's base address is 0, limit (size of segment) is - `0xffff` (1 MB). Let's look on flags. It is `0xc09b` and it will be: - -``` -1100 0000 1001 1011 -``` - -in binary. Let's try to understand what every bit means. We will go through all bits from left to right: - -* 1 - (G) granularity bit -* 1 - (D) if 0 16-bit segment; 1 = 32-bit segment -* 0 - (L) executed in 64 bit mode if 1 -* 0 - (AVL) available for use by system software -* 0000 - 4 bit length 19:16 bits in the descriptor -* 1 - (P) segment presence in memory -* 00 - (DPL) - privilege level, 0 is the highest privilege -* 1 - (S) code or data segment, not a system segment -* 101 - segment type execute/read/ -* 1 - accessed bit - -You can know more about every bit in the previous [post](linux-bootstrap-2.md) or in the [IntelĀ® 64 and IA-32 Architectures Software Developer’s Manuals 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html). - -After this we get length of GDT with: - -```C -gdt.len = sizeof(boot_gdt)-1; -``` - -We get size of `boot_gdt` and subtract 1 (the last valid address in the GDT). - -Next we get pointer to the GDT with: - -```C -gdt.ptr = (u32)&boot_gdt + (ds() << 4); -``` - -Here we just get address of `boot_gdt` and add it to address of data segment shifted on 4 (remember we're in the real mode now). - -In the last we execute `lgdtl` instruction to load GDT into GDTR register: - -```C -asm volatile("lgdtl %0" : : "m" (gdt)); -``` - -Actual transition into protected mode --------------------------------------------------------------------------------- - -It is the end of `go_to_protected_mode` function. We loaded IDT, GDT, disable interruptions and now can switch CPU into protected mode. The last step we call `protected_mode_jump` function with two parameters: - -```C -protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4)); -``` - -which defined in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S#L26). It takes two parameters: - -* address of protected mode entry point -* address of `boot_params` - -Let's look inside `protected_mode_jump`. As i wrote above, you can find it in the `arch/x86/boot/pmjump.S`. First parameter will be in `eax` register and second is in `edx`. First of all we put address of `boot_params` to the `esi` register and address of code segment register `cs` (0x1000) to the `bx`. After this we shift `bx` on 4 and add address of label `2` to it (we will have physical address of label `2` in the `bx` after it) and jump to label `1`. Next we put data segment and task state segment in the `cs` and `di` registers with: - -```assembly -movw $__BOOT_DS, %cx -movw $__BOOT_TSS, %di -``` - -As you can read above `GDT_ENTRY_BOOT_CS` has index 2 and every GDT entry is 8 byte, so `CS` will be `2 * 8 = 16`, `__BOOT_DS` is 24 and etc... Next we set `PE` (Protection Enable) bit in the `CR0` control register: - -```assembly -movl %cr0, %edx -orb $X86_CR0_PE, %dl -movl %edx, %cr0 -``` - -and make long jump to the protected mode: - -```assembly - .byte 0x66, 0xea -2: .long in_pm32 - .word __BOOT_CS -``` - -where `0x66` is the operand-size prefix, which allows to mix 16-bit and 32-bit code, `0xea` - is the jump opcode, `in_pm32` is the segment offset and `__BOOT_CS` is the segment. - -After this we are in the protected mode: - -```assembly -.code32 -.section ".text32","ax" -``` - -Let's look on the first steps in the protected mode. First of all we setup data segment with: - -```assembly -movl %ecx, %ds -movl %ecx, %es -movl %ecx, %fs -movl %ecx, %gs -movl %ecx, %ss -``` - -if you read with attention, you can remember that we saved `$__BOOT_DS` in the `cx` register. Now we fill with it all segment registers besides `cs` (`cs` is already `__BOOT_CS`). Next we zero out all general purpose registers besides `eax` with: - -```assembly -xorl %ecx, %ecx -xorl %edx, %edx -xorl %ebx, %ebx -xorl %ebp, %ebp -xorl %edi, %edi -``` - -And jump to the 32-bit entry point in the end: - -``` -jmpl *%eax -``` - -remember that `eax` contains address of the 32-bit entry (we passed it as first parameter into `protected_mode_jump`). That's all we're in the protected mode and stops before it's entry point. What is happening after we joined in the 32-bit entry point we will see in the next part. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the third part about linux kernel internals. In next part we will see first steps in the protected mode and transition into the [long mode](http://en.wikipedia.org/wiki/Long_mode). - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [VGA](http://en.wikipedia.org/wiki/Video_Graphics_Array) -* [VESA BIOS Extensions](http://en.wikipedia.org/wiki/VESA_BIOS_Extensions) -* [Data structure alignment](http://en.wikipedia.org/wiki/Data_structure_alignment) -* [Non-maskable interrupt](http://en.wikipedia.org/wiki/Non-maskable_interrupt) -* [A20](http://en.wikipedia.org/wiki/A20_line) -* [GCC designated inits](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Designated-Inits.html) -* [GCC type attributes](https://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html) -* [Previous part](linux-bootstrap-2.md) diff --git a/Booting/linux-bootstrap-4.md b/Booting/linux-bootstrap-4.md deleted file mode 100644 index 7c8d7bd..0000000 --- a/Booting/linux-bootstrap-4.md +++ /dev/null @@ -1,504 +0,0 @@ -Kernel booting process. Part 4. -================================================================================ - -Transition to 64-bit mode --------------------------------------------------------------------------------- - -It is the fourth part of the `Kernel booting process` and we will see first steps in the [protected mode](http://en.wikipedia.org/wiki/Protected_mode), like checking that cpu supports the [long mode](http://en.wikipedia.org/wiki/Long_mode) and [SSE](http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions), [paging](http://en.wikipedia.org/wiki/Paging) and initialization of the page tables and transition to the long mode in in the end of this part. - -**NOTE: will be much assembly code in this part, so if you have poor knowledge, read a book about it** - -In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md) we stopped at the jump to the 32-bit entry point in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S): - -```assembly -jmpl *%eax -``` - -Remind that `eax` register contains the address of the 32-bit entry point. We can read about this point from the linux kernel x86 boot protocol: - -``` -When using bzImage, the protected-mode kernel was relocated to 0x100000 -``` - -And now we can make sure that it is true. Let's look on registers value in 32-bit entry point: - -``` -eax 0x100000 1048576 -ecx 0x0 0 -edx 0x0 0 -ebx 0x0 0 -esp 0x1ff5c 0x1ff5c -ebp 0x0 0x0 -esi 0x14470 83056 -edi 0x0 0 -eip 0x100000 0x100000 -eflags 0x46 [ PF ZF ] -cs 0x10 16 -ss 0x18 24 -ds 0x18 24 -es 0x18 24 -fs 0x18 24 -gs 0x18 24 -``` - -We can see here that `cs` register contains - `0x10` (as you can remember from the previous part, it is the second index in the Global Descriptor Table), `eip` register is `0x100000` and base address of the all segments include code segment is zero. So we can get physical address, it will be `0:0x100000` or just `0x100000`, as in boot protocol. Now let's start with 32-bit entry point. - -32-bit entry point --------------------------------------------------------------------------------- - -We can find definition of the 32-bit entry point in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S): - -```assembly - __HEAD - .code32 -ENTRY(startup_32) -.... -.... -.... -ENDPROC(startup_32) -``` - -First of all why `compressed` directory? Actually `bzimage` is a gzipped `vmlinux + header + kernel setup code`. We saw the kernel setup code in the all of previous parts. So, the main goal of the `head_64.S` is to prepare for entering long mode, enter into it and decompress the kernel. We will see all of these steps besides kernel decompression in this part. - -Also you can note that there are two files in the `arch/x86/boot/compressed` directory: - -* head_32.S -* head_64.S - -We will see only `head_64.S` because we are learning linux kernel for `x86_64`. `head_32.S` even not compiled in our case. Let's look on the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile), we can see there following target: - -```Makefile -vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \ - $(obj)/string.o $(obj)/cmdline.o \ - $(obj)/piggy.o $(obj)/cpuflags.o -``` - -Note on `$(obj)/head_$(BITS).o`. It means that compilation of the head_{32,64}.o depends on value of the `$(BITS)`. We can find it in the other Makefile - [arch/x86/kernel/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/Makefile): - -```Makefile -ifeq ($(CONFIG_X86_32),y) - BITS := 32 - ... - ... -else - ... - ... - BITS := 64 -endif -``` - -Now we know where to start, so let's do it. - -Reload the segments if need --------------------------------------------------------------------------------- - -As i wrote above, we start in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S). First of all we can see before `startup_32` definition: - -```assembly - __HEAD - .code32 -ENTRY(startup_32) -``` - -`__HEAD` defined in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) and looks as: - -```C -#define __HEAD .section ".head.text","ax" -``` - -We can find this section in the [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S) linker script: - -``` -SECTIONS -{ - . = 0; - .head.text : { - _head = . ; - HEAD_TEXT - _ehead = . ; - } -``` - -Note on `. = 0;`. `.` is a special variable of linker - location counter. Assigning a value to it, is an offset relative to the offset of the segment. As we assign zero to it, we can read from comments: - -``` -Be careful parts of head_64.S assume startup_32 is at address 0. -``` - -Ok, now we know where we are, and now the best time to look inside the `startup_32` function. - -In the start of the `startup_32` we can see the `cld` instruction which clears `DF` flag. After this, string operations like `stosb` and other will increment the index registers `esi` or `edi`. - -The Next we can see the check of `KEEP_SEGMENTS` flag from `loadflags`. If you remember we already saw `loadflags` in the `arch/x86/boot/head.S` (there we checked flag `CAN_USE_HEAP`). Now we need to check `KEEP_SEGMENTS` flag. We can find description of this flag in the linux boot protocol: - -``` -Bit 6 (write): KEEP_SEGMENTS - Protocol: 2.07+ - - If 0, reload the segment registers in the 32bit entry point. - - If 1, do not reload the segment registers in the 32bit entry point. - Assume that %cs %ds %ss %es are all set to flat segments with - a base of 0 (or the equivalent for their environment). -``` - -and if `KEEP_SEGMENTS` is not set, we need to set `ds`, `ss` and `es` registers to flat segment with base 0. That we do: - -```C - testb $(1 << 6), BP_loadflags(%esi) - jnz 1f - - cli - movl $(__BOOT_DS), %eax - movl %eax, %ds - movl %eax, %es - movl %eax, %ss -``` - -remember that `__BOOT_DS` is `0x18` (index of data segment in the Global Descriptor Table). If `KEEP_SEGMENTS` is not set, we jump to the label `1f` or update segment registers with `__BOOT_DS` if this flag is set. - -If you read previous the [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md), you can remember that we already updated segment registers in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S), so why we need to set up it again? Actually linux kernel has also 32-bit boot protocol, so `startup_32` can be first function which will be executed right after a bootloader transfers control to the kernel. - -As we checked `KEEP_SEGMENTS` flag and put the correct value to the segment registers, next step is calculate difference between where we loaded and compiled to run (remember that `setup.ld.S` contains `. = 0` at the start of the section): - -```assembly - leal (BP_scratch+4)(%esi), %esp - call 1f -1: popl %ebp - subl $1b, %ebp -``` - -Here `esi` register contains address of the [boot_params](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L113) structure. `boot_params` contains special field `scratch` with offset `0x1e4`. We are getting address of the `scratch` field + 4 bytes and put it to the `esp` register (we will use it as stack for these calculations). After this we can see call instruction and `1f` label as operand of it. What does it mean `call`? It means that it pushes `ebp` value in the stack, next `esp` value, next function arguments and return address in the end. After this we pop return address from the stack into `ebp` register (`ebp` will contain return address) and subtract address of the previous label `1`. - -After this we have address where we loaded in the `ebp` - `0x100000`. - -Now we can setup the stack and verify CPU that it has support of the long mode and [SSE](http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions). - -Stack setup and CPU verification --------------------------------------------------------------------------------- - -The next we can see assembly code which setups new stack for kernel decompression: - -```assembly - movl $boot_stack_end, %eax - addl %ebp, %eax - movl %eax, %esp -``` - -`boots_stack_end` is in the `.bss` section, we can see definition of it in the end of `head_64.S`: - -```assembly - .bss - .balign 4 -boot_heap: - .fill BOOT_HEAP_SIZE, 1, 0 -boot_stack: - .fill BOOT_STACK_SIZE, 1, 0 -boot_stack_end: -``` - -First of all we put address of the `boot_stack_end` into `eax` register and add to it value of the `ebp` (remember that `ebp` now contains address where we loaded - `0x100000`). In the end we just put `eax` value into `esp` and that's all, we have correct stack pointer. - -The next step is CPU verification. Need to check that CPU has support of `long mode` and `SSE`: - -```assembly - call verify_cpu - testl %eax, %eax - jnz no_longmode -``` - -It just calls `verify_cpu` function from the [arch/x86/kernel/verify_cpu.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/verify_cpu.S) which contains a couple of calls of the `cpuid` instruction. `cpuid` is instruction which is used for getting information about processor. In our case it checks long mode and SSE support and returns `0` on success or `1` on fail in the `eax` register. - -If `eax` is not zero, we jump to the `no_longmode` label which just stops the CPU with `hlt` instruction while any hardware interrupt will not happen. - -```assembly -no_longmode: -1: - hlt - jmp 1b -``` - -We set stack, cheked CPU and now can move on the next step. - -Calculate relocation address --------------------------------------------------------------------------------- - -The next step is calculating relocation address for decompression if need. We can see following assembly code: - -```assembly -#ifdef CONFIG_RELOCATABLE - movl %ebp, %ebx - movl BP_kernel_alignment(%esi), %eax - decl %eax - addl %eax, %ebx - notl %eax - andl %eax, %ebx - cmpl $LOAD_PHYSICAL_ADDR, %ebx - jge 1f -#endif - movl $LOAD_PHYSICAL_ADDR, %ebx -1: - addl $z_extract_offset, %ebx -``` - -First of all note on `CONFIG_RELOCATABLE` macro. This configuration option defined in the [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig) and as we can read from it's description: - -``` -This builds a kernel image that retains relocation information -so it can be loaded someplace besides the default 1MB. - -Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address -it has been loaded at and the compile time physical address -(CONFIG_PHYSICAL_START) is used as the minimum location. -``` - -In short words, this code calculates address where to move kernel for decompression put it to `ebx` register if the kernel is relocatable or bzimage will decompress itself above `LOAD_PHYSICAL_ADDR`. - -Let's look on the code. If we have `CONFIG_RELOCATABLE=n` in our kernel configuration file, it just puts `LOAD_PHYSICAL_ADDR` to the `ebx` register and adds `z_extract_offset` to `ebx`. As `ebx` is zero for now, it will contain `z_extract_offset`. Now let's try to understand these two values. - -`LOAD_PHYSICAL_ADDR` is the macro which defined in the [arch/x86/include/asm/boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/boot.h) and it looks like this: - -```C -#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \ - + (CONFIG_PHYSICAL_ALIGN - 1)) \ - & ~(CONFIG_PHYSICAL_ALIGN - 1)) -``` - -Here we calculates aligned address where kernel is loaded (`0x100000` or 1 megabyte in our case). `PHYSICAL_ALIGN` is an alignment value to which kernel should be aligned, it ranges from `0x200000` to `0x1000000` for x86_64. With the default values we will get 2 megabytes in the `LOAD_PHYSICAL_ADDR`: - -```python ->>> 0x100000 + (0x200000 - 1) & ~(0x200000 - 1) -2097152 -``` - -After that we got alignment unit, we adds `z_extract_offset` (which is `0xe5c000` in my case) to the 2 megabytes. In the end we will get 17154048 byte offset. You can find `z_extract_offset` in the `arch/x86/boot/compressed/piggy.S`. This file generated in compile time by [mkpiggy](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/mkpiggy.c) program. - -Now let's try to understand the code if `CONFIG_RELOCATABLE` is `y`. - -First of all we put `ebp` value to the `ebx` (remember that `ebp` contains address where we loaded) and `kernel_alignment` field from kernel setup header to the `eax` register. `kernel_alignment` is a physical address of alignment required for the kernel. Next we do the same as in the previous case (when kernel is not relocatable), but we just use value of the `kernel_alignment` field as align unit and `ebx` (address where we loaded) as base address instead of `CONFIG_PHYSICAL_ALIGN` and `LOAD_PHYSICAL_ADDR`. - -After that we calculated address, we compare it with `LOAD_PHYSICAL_ADDR` and add `z_extract_offset` to it again or put `LOAD_PHYSICAL_ADDR` in the `ebx` if calculated address is less than we need. - -After all of this calculation we will have `ebp` which contains address where we loaded and `ebx` with address where to move kernel for decompression. - -Preparation before entering long mode --------------------------------------------------------------------------------- - -Now we need to do the last preparations before we can see transition to the 64-bit mode. At first we need to update Global Descriptor Table for this: - -```assembly - leal gdt(%ebp), %eax - movl %eax, gdt+2(%ebp) - lgdt gdt(%ebp) -``` - -Here we put the address from `ebp` with `gdt` offset to `eax` register, next we put this address into `ebp` with offset `gdt+2` and load Global Descriptor Table with the `lgdt` instruction. - -Let's look on Global Descriptor Table definition: - -```assembly - .data -gdt: - .word gdt_end - gdt - .long gdt - .word 0 - .quad 0x0000000000000000 /* NULL descriptor */ - .quad 0x00af9a000000ffff /* __KERNEL_CS */ - .quad 0x00cf92000000ffff /* __KERNEL_DS */ - .quad 0x0080890000000000 /* TS descriptor */ - .quad 0x0000000000000000 /* TS continued */ -``` - -It defined in the same file in the `.data` section. It contains 5 descriptors: null descriptor, for kernel code segment, kernel data segment and two task descriptors. We already loaded GDT in the previous [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md), we're doing almost the same here, but descriptors with `CS.L = 1` and `CS.D = 0` for execution in the 64 bit mode. - -After we have loaded Global Descriptor Table, we must enable [PAE](http://en.wikipedia.org/wiki/Physical_Address_Extension) mode with putting value of `cr4` register into `eax`, setting 5 bit in it and load it again in the `cr4` : - -```assembly - movl %cr4, %eax - orl $X86_CR4_PAE, %eax - movl %eax, %cr4 -``` - -Now we finished almost with all preparations before we can move into 64-bit mode. The last step is to build page tables, but before some information about long mode. - -Long mode --------------------------------------------------------------------------------- - -Long mode is the native mode for x86_64 processors. First of all let's look on some difference between `x86_64` and `x86`. - -It provides some features as: - -* New 8 general purpose registers from `r8` to `r15` + all general purpose registers are 64-bit now -* 64-bit instruction pointer - `RIP` -* New operating mode - Long mode -* 64-Bit Addresses and Operands -* RIP Relative Addressing (we will see example if it in the next parts) - -Long mode is an extension of legacy protected mode. It consists from two sub-modes: - -* 64-bit mode -* compatibility mode - -To switch into 64-bit mode we need to do following things: - -* enable PAE (we already did it, see above) -* build page tables and load the address of top level page table into `cr3` register -* enable `EFER.LME` -* enable paging - -We already enabled `PAE` with setting the PAE bit in the `cr4` register. Now let's look on paging. - -Early page tables initialization --------------------------------------------------------------------------------- - -Before we can move in the 64-bit mode, we need to build page tables, so, let's look on building of early 4G boot page tables. - -**NOTE: I will not describe theory of virtual memory here, if you need to know more about it, see links in the end** - -Linux kernel uses 4-level paging, and generally we build 6 page tables: - -* One PML4 table -* One PDP table -* Four Page Directory tables - -Let's look on the implementation of it. First of all we clear buffer for the page tables in the memory. Every table is 4096 bytes, so we need 24 kilobytes buffer: - -```assembly - leal pgtable(%ebx), %edi - xorl %eax, %eax - movl $((4096*6)/4), %ecx - rep stosl -``` - -We put address which stored in `ebx` (remember that `ebx` contains the address where to relocate kernel for decompression) with `pgtable` offset to the `edi` register. `pgtable` defined in the end of `head_64.S` and looks: - -```assembly - .section ".pgtable","a",@nobits - .balign 4096 -pgtable: - .fill 6*4096, 1, 0 -``` - -It is in the `.pgtable` section and it size is 24 kilobytes. After we put address to the `edi`, we zero out `eax` register and writes zeros to the buffer with `rep stosl` instruction. - -Now we can build top level page table - `PML4` with: - -```assembly - leal pgtable + 0(%ebx), %edi - leal 0x1007 (%edi), %eax - movl %eax, 0(%edi) -``` - -Here we get address which stored in the `ebx` with `pgtable` offset and put it to the `edi`. Next we put this address with offset `0x1007` to the `eax` register. `0x1007` is 4096 bytes (size of the PML4) + 7 (PML4 entry flags - `PRESENT+RW+USER`) and puts `eax` to the `edi`. After this manipulations `edi` will contain the address of the first Page Directory Pointer Entry with flags - `PRESENT+RW+USER`. - -In the next step we build 4 Page Directory entry in the Page Directory Pointer table, where first entry will be with `0x7` flags and other with `0x8`: - -```assembly - leal pgtable + 0x1000(%ebx), %edi - leal 0x1007(%edi), %eax - movl $4, %ecx -1: movl %eax, 0x00(%edi) - addl $0x00001000, %eax - addl $8, %edi - decl %ecx - jnz 1b -``` - -We put base address of the page directory pointer table to the `edi` and address of the first page directory pointer entry to the `eax`. Put `4` to the `ecx` register, it will be counter in the following loop and write the address of the first page directory pointer table entry to the `edi` register. - -After this `edi` will contain address of the first page directory pointer entry with flags `0x7`. Next we just calculates address of following page directory pointer entries with flags `0x8` and writes their addresses to the `edi`. - -The next step is building of `2048` page table entries by 2 megabytes: - -```assembly - leal pgtable + 0x2000(%ebx), %edi - movl $0x00000183, %eax - movl $2048, %ecx -1: movl %eax, 0(%edi) - addl $0x00200000, %eax - addl $8, %edi - decl %ecx - jnz 1b -``` - -Here we do almost the same that in the previous example, just first entry will be with flags - `$0x00000183` - `PRESENT + WRITE + MBZ` and all another with `0x8`. In the end we will have 2048 pages by 2 megabytes. - -Our early page table structure are done, it maps 4 gigabytes of memory and now we can put address of the high-level page table - `PML4` to the `cr3` control register: - -```assembly - leal pgtable(%ebx), %eax - movl %eax, %cr3 -``` - -That's all now we can see transition to the long mode. - -Transition to the long mode --------------------------------------------------------------------------------- - -First of all we need to set `EFER.LME` flag in the [MSR](http://en.wikipedia.org/wiki/Model-specific_register) to `0xC0000080`: - -```assembly - movl $MSR_EFER, %ecx - rdmsr - btsl $_EFER_LME, %eax - wrmsr -``` - -Here we put `MSR_EFER` flag (which defined in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/msr-index.h#L7)) to the `ecx` register and call `rdmsr` instruction which reads [MSR](http://en.wikipedia.org/wiki/Model-specific_register) register. After `rdmsr` executed, we will have result data in the `edx:eax` which depends on `ecx` value. We check `EFER_LME` bit with `btsl` instruction and write data from `eax` to the `MSR` register with `wrmsr` instruction. - -In next step we push address of the kernel segment code to the stack (we defined it in the GDT) and put address of the `startup_64` routine to the `eax`. - -```assembly - pushl $__KERNEL_CS - leal startup_64(%ebp), %eax -``` - -After this we push this address to the stack and enable paging with setting `PG` and `PE` bits in the `cr0` register: - -```assembly - movl $(X86_CR0_PG | X86_CR0_PE), %eax - movl %eax, %cr0 -``` - -and call: - -```assembly -lret -``` - -Remember that we pushed address of the `startup_64` function to the stack in the previous step, and after `lret` instruction, CPU extracts address of it and jumps there. - -After all of these steps we're finally in the 64-bit mode: - -```assembly - .code64 - .org 0x200 -ENTRY(startup_64) -.... -.... -.... -``` - -That's all! - -Conclusion --------------------------------------------------------------------------------- - -This is the end of the fourth part linux kernel booting process. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-internals/issues/new). - -In the next part we will see kernel decompression and many more. - -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [Protected mode](http://en.wikipedia.org/wiki/Protected_mode) -* [IntelĀ® 64 and IA-32 Architectures Software Developer’s Manual 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) -* [GNU linker](http://www.eecs.umich.edu/courses/eecs373/readings/Linker.pdf) -* [SSE](http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) -* [Paging](http://en.wikipedia.org/wiki/Paging) -* [Model specific register](http://en.wikipedia.org/wiki/Model-specific_register) -* [.fill instruction](http://www.chemie.fu-berlin.de/chemnet/use/info/gas/gas_7.html) -* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md) -* [Paging on osdev.org](http://wiki.osdev.org/Paging) -* [Paging Systems](https://www.cs.rutgers.edu/~pxk/416/notes/09a-paging.html) -* [x86 Paging Tutorial](http://www.cirosantilli.com/x86-paging/) diff --git a/Booting/linux-bootstrap-5.md b/Booting/linux-bootstrap-5.md deleted file mode 100644 index dce09cd..0000000 --- a/Booting/linux-bootstrap-5.md +++ /dev/null @@ -1,478 +0,0 @@ -Kernel booting process. Part 5. -================================================================================ - -Kernel decompression --------------------------------------------------------------------------------- - -This is the fifth part of the `Kernel booting process` series. We saw transition to the 64-bit mode in the previous [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#transition-to-the-long-mode) and we will continue from this point in this part. We will see the last steps before we jump to the kernel code as preparation for kernel decompression, relocation and directly kernel decompression. So... let's start to dive in the kernel code again. - -Preparation before kernel decompression --------------------------------------------------------------------------------- - -We stoped right before jump on 64-bit entry point - `startup_64` which located in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) source code file. We already saw the jump to the `startup_64` in the `startup_32`: - -```assembly - pushl $__KERNEL_CS - leal startup_64(%ebp), %eax - ... - ... - ... - pushl %eax - ... - ... - ... - lret -``` - -in the previous part, `startup_64` starts to work. Since we loaded the new Global Descriptor Table and there was CPU transition in other mode (64-bit mode in our case), we can see setup of the data segments: - -```assembly - .code64 - .org 0x200 -ENTRY(startup_64) - xorl %eax, %eax - movl %eax, %ds - movl %eax, %es - movl %eax, %ss - movl %eax, %fs - movl %eax, %gs -``` - -in the beginning of the `startup_64`. All segment registers besides `cs` points now to the `ds` which is `0x18` (if you don't understand why it is `0x18`, read the previous part). - -The next step is computation of difference between where kernel was compiled and where it was loaded: - -```assembly -#ifdef CONFIG_RELOCATABLE - leaq startup_32(%rip), %rbp - movl BP_kernel_alignment(%rsi), %eax - decl %eax - addq %rax, %rbp - notq %rax - andq %rax, %rbp - cmpq $LOAD_PHYSICAL_ADDR, %rbp - jge 1f -#endif - movq $LOAD_PHYSICAL_ADDR, %rbp -1: - leaq z_extract_offset(%rbp), %rbx -``` - -`rbp` contains decompressed kernel start address and after this code executed `rbx` register will contain address where to relocate the kernel code for decompression. We already saw code like this in the `startup_32` ( you can read about it in the previous part - [Calculate relocation address](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#calculate-relocation-address)), but we need to do this calculation again because bootloader can use 64-bit boot protocol and `startup_32` just will not be executed in this case. - -In the next step we can see setup of the stack and reset of flags register: - -```assembly - leaq boot_stack_end(%rbx), %rsp - - pushq $0 - popfq -``` - -As you can see above `rbx` register contains the start address of the decompressing kernel code and we just put this address with `boot_stack_end` offset to the `rsp` register. After this stack will be correct. You can find definition of the `boot_stack_end` in the end of `compressed/head_64.S` file: - -```assembly - .bss - .balign 4 -boot_heap: - .fill BOOT_HEAP_SIZE, 1, 0 -boot_stack: - .fill BOOT_STACK_SIZE, 1, 0 -boot_stack_end: - -``` - -It located in the `.bss` section right before `.pgtable`. You can look at [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S) to find it. - -As we set the stack, now we can copy the compressed kernel to the address that we got above, when we calculated the relocation address of the decompressed kernel. Let's look on this code: - -```assembly - pushq %rsi - leaq (_bss-8)(%rip), %rsi - leaq (_bss-8)(%rbx), %rdi - movq $_bss, %rcx - shrq $3, %rcx - std - rep movsq - cld - popq %rsi -``` - -First of all we push `rsi` to the stack. We need save value of `rsi`, because this register now stores pointer to the `boot_params` real mode structure (you must remember this structure, we filled it in the start of kernel setup). In the end of this code we'll restore pointer to the `boot_params` into `rsi` again. - -The next two `leaq` instructions calculates effective address of the `rip` and `rbx` with `_bss - 8` offset and put it to the `rsi` and `rdi`. Why we calculate this addresses? Actually compressed kernel image located between this copying code (from `startup_32` to the current code) and the decompression code. You can verify this by looking on the linker script - [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S): - -``` - . = 0; - .head.text : { - _head = . ; - HEAD_TEXT - _ehead = . ; - } - .rodata..compressed : { - *(.rodata..compressed) - } - .text : { - _text = .; /* Text */ - *(.text) - *(.text.*) - _etext = . ; - } -``` - -Note that `.head.text` section contains `startup_32`. You can remember it from the previous part: - -```assembly - __HEAD - .code32 -ENTRY(startup_32) -... -... -... -``` - -`.text` section contains decompression code: - -assembly -``` - .text -relocated: -... -... -... -/* - * Do the decompression, and jump to the new kernel.. - */ -... -``` - -And `.rodata..compressed` contains compressed kernel image. - -So `rsi` will contain `rip` relative address of the `_bss - 8` and `rdi` will contain relocation relative address of the ``_bss - 8`. As we store these addresses in register, we put the address of `_bss` to the `rcx` register. As you can see in the `vmlinux.lds.S`, it located in the end of all sections with the setup/kernel code. Now we can start to copy data from `rsi` to `rdi` by 8 bytes with `movsq` instruction. - -Note that there is `std` instruction before data copying, it sets `DF` flag and it means that `rsi` and `rdi` will be decremeted or in other words, we will crbxopy bytes in backwards. - -In the end we clear `DF` flag with `cld` instruction and restore `boot_params` structure to the `rsi`. - -After it we get `.text` section address address and jump to it: - -```assembly - leaq relocated(%rbx), %rax - jmp *%rax -``` - -Last preparation before kernel decompression --------------------------------------------------------------------------------- - -`.text` sections starts with the `relocated` label. For the start there is clearing of the `bss` section with: - -```assembly - xorl %eax, %eax - leaq _bss(%rip), %rdi - leaq _ebss(%rip), %rcx - subq %rdi, %rcx - shrq $3, %rcx - rep stosq -``` - -Here we just clear `eax`, put RIP relative address of the `_bss` to the `rdi` and `_ebss` to `rcx` and fill it with zeros with `rep stosq` instructions. - -In the end we can see the call of the `decompress_kernel` routine: - -```assembly - pushq %rsi - movq $z_run_size, %r9 - pushq %r9 - movq %rsi, %rdi - leaq boot_heap(%rip), %rsi - leaq input_data(%rip), %rdx - movl $z_input_len, %ecx - movq %rbp, %r8 - movq $z_output_len, %r9 - call decompress_kernel - popq %r9 - popq %rsi -``` - -Again we save `rsi` with pointer to `boot_params` structure and call `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) with seven arguments. All arguments will be passed through the registers. We finished all preparation and now can look on the kernel decompression. - -Kernel decompression --------------------------------------------------------------------------------- - -As i wrote above, `decompress_kernel` function is in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file. This function starts with the video/console initialization that we saw in the previous parts. This calls need if bootloaded used 32 or 64-bit protocols. After this we store pointers to the start of the free memory and to the end of it: - -```C - free_mem_ptr = heap; - free_mem_end_ptr = heap + BOOT_HEAP_SIZE; -``` - -where `heap` is the second parameter of the `decompress_kernel` function which we got with: - -```assembly -leaq boot_heap(%rip), %rsi -``` - -As you saw about `boot_heap` defined as: - -```assembly -boot_heap: - .fill BOOT_HEAP_SIZE, 1, 0 -``` - -where `BOOT_HEAP_SIZE` is `0x400000` if the kernel compressed with `bzip2` or `0x8000` if not. - -In the next step we call `choose_kernel_location` function from the [arch/x86/boot/compressed/aslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/aslr.c#L298). As we can understand from the function name it chooses memory location where to decompress the kernel image. Let's look on this function. - -At the start `choose_kernel_location` tries to find `kaslr` option in the command line if `CONFIG_HIBERNATION` is set and `nokaslr` option if this configuration option `CONFIG_HIBERNATION` is not set: - -```C -#ifdef CONFIG_HIBERNATION - if (!cmdline_find_option_bool("kaslr")) { - debug_putstr("KASLR disabled by default...\n"); - goto out; - } -#else - if (cmdline_find_option_bool("nokaslr")) { - debug_putstr("KASLR disabled by cmdline...\n"); - goto out; - } -#endif -``` - -If there is no `kaslr` or `nokaslr` in the command line it jumps to `out` label: - -```C -out: - return (unsigned char *)choice; -``` - -which just returns the `output` parameter which we passed to the `choose_kernel_location` without any changes. Let's try to understand what is it `kaslr`. We can find information about it in the [documentation](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt): - -``` -kaslr/nokaslr [X86] - -Enable/disable kernel and module base offset ASLR -(Address Space Layout Randomization) if built into -the kernel. When CONFIG_HIBERNATION is selected, -kASLR is disabled by default. When kASLR is enabled, -hibernation will be disabled. -``` - -It means that we can pass `kaslr` option to the kernel's command line and get random address for the decompressed kernel (more about aslr you can read [here](https://en.wikipedia.org/wiki/Address_space_layout_randomization)). - -Let's consider the case when kernel's command line contains `kaslr` option. - -There is the call of the `mem_avoid_init` function from the same `aslr.c` source code file. This function gets the unsafe memory regions (initrd, kernel command line and etc...). We need to know about this memory regions to not overlap them with the kernel after decompression. For example: - -```C - initrd_start = (u64)real_mode->ext_ramdisk_image << 32; - initrd_start |= real_mode->hdr.ramdisk_image; - initrd_size = (u64)real_mode->ext_ramdisk_size << 32; - initrd_size |= real_mode->hdr.ramdisk_size; - mem_avoid[1].start = initrd_start; - mem_avoid[1].size = initrd_size; -``` - -Here we can see calculation of the [initrd](http://en.wikipedia.org/wiki/Initrd) start address and size. `ext_ramdisk_image` is high 32-bits of the `ramdisk_image` field from boot header and `ext_ramdisk_size` is high 32-bits of the `ramdisk_size` field from [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt): - -``` -Offset Proto Name Meaning -/Size -... -... -... -0218/4 2.00+ ramdisk_image initrd load address (set by boot loader) -021C/4 2.00+ ramdisk_size initrd size (set by boot loader) -... -``` - -And `ext_ramdisk_image` and `ext_ramdisk_size` you can find in the [Documentation/x86/zero-page.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/zero-page.txt): - -``` -Offset Proto Name Meaning -/Size -... -... -... -0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits -0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits -... -``` - -So we're taking `ext_ramdisk_image` and `ext_ramdisk_size`, shifting they left on 32 (now they will contain low 32-bits in the high 32-bit bits) and getting start address of the `initrd` and size of it. After this we store these values in the `mem_avoid` array which defined as: - -```C -#define MEM_AVOID_MAX 5 -static struct mem_vector mem_avoid[MEM_AVOID_MAX]; -``` - -where `mem_vector` structure is: - -```C -struct mem_vector { - unsigned long start; - unsigned long size; -}; -``` - -The next step after we collected all unsafe memory regions in the `mem_avoid` array will be search of the random address which does not overlap with the unsafe regions with the `find_random_addr` function. - -First of all we can see align of the output address in the `find_random_addr` function: - -```C -minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN); -``` - -you can remember `CONFIG_PHYSICAL_ALIGN` configuration option from the previous part. This option provides the value to which kernel should be aligned and it is `0x200000` by default. After that we got aligned output address, we go through the memory and collect regions which are good for decompressed kernel image: - -```C -for (i = 0; i < real_mode->e820_entries; i++) { - process_e820_entry(&real_mode->e820_map[i], minimum, size); -} -``` - -You can remember that we collected `e820_entries` in the second part of the [Kernel booting process part 2](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-2.md#memory-detection). - -First of all `process_e820_entry` function does some checks that e820 memory region is not non-RAM, that the start address of the memory region is not bigger than Maximum allowed `aslr` offset and that memory region is not less than value of kernel alignment: - -```C -struct mem_vector region, img; - -if (entry->type != E820_RAM) - return; - -if (entry->addr >= CONFIG_RANDOMIZE_BASE_MAX_OFFSET) - return; - -if (entry->addr + entry->size < minimum) - return; -``` - -After this, we store e820 memory region start address and the size in the `mem_vector` structure (we saw definition of this structure above): - -```C -region.start = entry->addr; -region.size = entry->size; -``` - -As we store these values, we align the `region.start` as we did it in the `find_random_addr` function and check that we didn't get address that bigger than original memory region: - -```C -region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN); - -if (region.start > entry->addr + entry->size) - return; -``` - -Next we get difference between the original address and aligned and check that if the last address in the memory region is bigger than `CONFIG_RANDOMIZE_BASE_MAX_OFFSET`, we reduce the memory region size that end of kernel image will be less than maximum `aslr` offset: - -```C -region.size -= region.start - entry->addr; - -if (region.start + region.size > CONFIG_RANDOMIZE_BASE_MAX_OFFSET) - region.size = CONFIG_RANDOMIZE_BASE_MAX_OFFSET - region.start; -``` - -In the end we go through the all unsafe memory regions and check that this region does not overlap unsafe ares with kernel command line, initrd and etc...: - -```C -for (img.start = region.start, img.size = image_size ; - mem_contains(®ion, &img) ; - img.start += CONFIG_PHYSICAL_ALIGN) { - if (mem_avoid_overlap(&img)) - continue; - slots_append(img.start); - } -``` - -If memory region does not overlap unsafe regions we call `slots_append` function with the start address of the region. `slots_append` function just collects start addresses of memory regions to the `slots` array: - -```C - slots[slot_max++] = addr; -``` - -which defined as: - -```C -static unsigned long slots[CONFIG_RANDOMIZE_BASE_MAX_OFFSET / - CONFIG_PHYSICAL_ALIGN]; -static unsigned long slot_max; -``` - -After `process_e820_entry` will be executed, we will have array of the addresses which are safe for the decompressed kernel. Next we call `slots_fetch_random` function for getting random item from this array: - -```C -if (slot_max == 0) - return 0; - -return slots[get_random_long() % slot_max]; -``` - -where `get_random_long` function checks different CPU flags as `X86_FEATURE_RDRAND` or `X86_FEATURE_TSC` and chooses method for getting random number (it can be obtain with RDRAND instruction, Time stamp counter, programmable interval timer and etc...). After that we got random address execution of the `choose_kernel_location` is finished. - -Now let's back to the [misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c#L404). After we got address for the kernel image, there need to do some checks to be sure that gotten random address is correctly aligned and address is not wrong. - -After all these checks will see the familiar message: - -``` -Decompressing Linux... -``` - -and call `decompress` function which will decompress the kernel. `decompress` function depends on what decompression algorithm was chosen during kernel compilartion: - -```C -#ifdef CONFIG_KERNEL_GZIP -#include "../../../../lib/decompress_inflate.c" -#endif - -#ifdef CONFIG_KERNEL_BZIP2 -#include "../../../../lib/decompress_bunzip2.c" -#endif - -#ifdef CONFIG_KERNEL_LZMA -#include "../../../../lib/decompress_unlzma.c" -#endif - -#ifdef CONFIG_KERNEL_XZ -#include "../../../../lib/decompress_unxz.c" -#endif - -#ifdef CONFIG_KERNEL_LZO -#include "../../../../lib/decompress_unlzo.c" -#endif - -#ifdef CONFIG_KERNEL_LZ4 -#include "../../../../lib/decompress_unlz4.c" -#endif -``` - -After kernel will be decompressed, the last function `handle_relocations` will relocate the kernel to the address that we got from `choose_kernel_location`. After that kernel relocated we return from the `decompress_kernel` to the `head_64.S`. The address of the kernel will be in the `rax` register and we jump on it: - -```assembly -jmp *%rax -``` - -That's all. Now we are in the kernel! - -Conclusion --------------------------------------------------------------------------------- - -This is the end of the fifth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe only updates in this and previous posts), but there will be many posts about other kernel internals. - -Next chapter will be about kernel initialization and we will see the first steps in the linux kernel initialization code. - -If you will have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) -* [initrd](http://en.wikipedia.org/wiki/Initrd) -* [long mode](http://en.wikipedia.org/wiki/Long_mode) -* [bzip2](http://www.bzip.org/) -* [RDdRand instruction](http://en.wikipedia.org/wiki/RdRand) -* [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter) -* [Programmable Interval Timers](http://en.wikipedia.org/wiki/Intel_8253) -* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md deleted file mode 100644 index 1a41d1a..0000000 --- a/CONTRIBUTING.md +++ /dev/null @@ -1,30 +0,0 @@ -Contributing -================================================================================ - -If you want to contribute to [linux-insides](https://github.com/0xAX/linux-insides), please follow these simple rules: - -1. Press fork button: - - ![fork](http://oi58.tinypic.com/jj2trm.jpg) - -2. Clone the repo from your account with: - - ``` - git clone git@github.com:your_github_username/linux-insides.git - ``` - -3. Create branch with: - - ``` - git checkout -b "linux-bootstrap-1-fix" - ``` - -4. Make your changes. - -5. Don't forget to add yourself in `contributors.md` - -**IMPORTANT** - -Please, make the actual changes. While you made your changes, I can merge changes from somebody else and your changes can conflict with `master` branch content. Please rebase on master every time before you're going to push your changes and check that your branch doesn't conflict with `master`. - -Thank you. diff --git a/Concepts/README.md b/Concepts/README.md deleted file mode 100644 index 8ff2c3a..0000000 --- a/Concepts/README.md +++ /dev/null @@ -1,6 +0,0 @@ -# Linux kernel concepts - -This chapter describes various concepts which are used in the Linux kernel. - -* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) -* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) diff --git a/Concepts/cpumask.md b/Concepts/cpumask.md deleted file mode 100644 index 78b66e8..0000000 --- a/Concepts/cpumask.md +++ /dev/null @@ -1,225 +0,0 @@ -CPU masks -================================================================================ - -Introduction --------------------------------------------------------------------------------- - -`Cpumasks` is a special way provided by the Linux kernel to store information about CPUs in the system. The relevant source code and header files which are contains API for `Cpumasks` manipulating: - -* [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h) -* [lib/cpumask.c](https://github.com/torvalds/linux/blob/master/lib/cpumask.c) -* [kernel/cpu.c](https://github.com/torvalds/linux/blob/master/kernel/cpu.c) - -As comment says from the [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h): Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. We already saw a bit about cpumask in the `boot_cpu_init` function from the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. This function makes first boot cpu online, active and etc...: - -```C -set_cpu_online(cpu, true); -set_cpu_active(cpu, true); -set_cpu_present(cpu, true); -set_cpu_possible(cpu, true); -``` - -`set_cpu_possible` is a set of cpu ID's which can be plugged in anytime during the life of that system boot. `cpu_present` represents which CPUs are currently plugged in. `cpu_online` represents subset of the `cpu_present` and indicates CPUs which are available for scheduling. These masks depends on `CONFIG_HOTPLUG_CPU` configuration option and if this option is disabled `possible == present` and `active == online`. Implementation of the all of these functions are very similar. Every function checks the second parameter. If it is `true`, calls `cpumask_set_cpu` or `cpumask_clear_cpu` otherwise. - -There are two ways for a `cpumask` creation. First is to use `cpumask_t`. It defined as: - -```C -typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t; -``` - -It wraps `cpumask` structure which contains one bitmak `bits` field. `DECLARE_BITMAP` macro gets two parameters: - -* bitmap name; -* number of bits. - -and creates an array of `unsigned long` with the give name. It's implementation is pretty easy: - -```C -#define DECLARE_BITMAP(name,bits) \ - unsigned long name[BITS_TO_LONGS(bits)] -``` - -where `BITS_TO_LONG`: - -```C -#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long)) -#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d)) -``` - -As we learning `x86_64` architecture, `unsigned long` is 8-bytes size and our array will contain only one element: - -``` -(((8) + (8) - 1) / (8)) = 1 -``` - -`NR_CPUS` macro presents the number of the CPUs in the system and depends on the `CONFIG_NR_CPUS` macro which defined in the [include/linux/threads.h](https://github.com/torvalds/linux/blob/master/include/linux/threads.h) and looks like this: - -```C -#ifndef CONFIG_NR_CPUS - #define CONFIG_NR_CPUS 1 -#endif - -#define NR_CPUS CONFIG_NR_CPUS -``` - -The second way to define cpumask is to use `DECLARE_BITMAP` macro directly and `to_cpumask` macro which convertes given bitmap to the `struct cpumask *`: - -```C -#define to_cpumask(bitmap) \ - ((struct cpumask *)(1 ? (bitmap) \ - : (void *)sizeof(__check_is_bitmap(bitmap)))) -``` - -We can see ternary operator operator here which is `true` every time. `__check_is_bitmap` inline function defined as: - -```C -static inline int __check_is_bitmap(const unsigned long *bitmap) -{ - return 1; -} -``` - -And returns `1` every time. We need in it here only for one purpose: In compile time it checks that given `bitmap` is a bitmap, or with another words it checks that given `bitmap` has type - `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting array of `unsigned long` to the `struct cpumask *`. - -cpumask API --------------------------------------------------------------------------------- - -As we can define cpumask with one of the method, Linux kernel provides API for manipulating a cpumask. Let's consider one of the function which presented above. For example `set_cpu_online`. This function takes two parameters: - -* Number of CPU; -* CPU status; - -Implementation of this function looks as: - -```C -void set_cpu_online(unsigned int cpu, bool online) -{ - if (online) { - cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits)); - cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits)); - } else { - cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits)); - } -} -``` - -First of all it checks the second `state` parameter and calls `cpumask_set_cpu` or `cpumask_clear_cpu` depends on it. Here we can see casting to the `struct cpumask *` of the second parameter in the `cpumask_set_cpu`. In our case it is `cpu_online_bits` which is bitmap and defined as: - -```C -static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly; -``` - -`cpumask_set_cpu` function makes only one call of the `set_bit` function inside: - -```C -static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp) -{ - set_bit(cpumask_check(cpu), cpumask_bits(dstp)); -} -``` - -`set_bit` function takes two parameter too, and sets a given bit (first parameter) in the memory (second parameter or `cpu_online_bits` bitmap). We can see here that before `set_bit` will be called, its two parameter will be passed to the - -* cpumask_check; -* cpumask_bits. - -Let's consider these two macro. First if `cpumask_check` does nothing in our case and just returns given parameter. The second `cpumask_bits` just returns `bits` field from the given `struct cpumask *` structure: - -```C -#define cpumask_bits(maskp) ((maskp)->bits) -``` - -Now let's look on the `set_bit` implementation: - -```C - static __always_inline void - set_bit(long nr, volatile unsigned long *addr) - { - if (IS_IMMEDIATE(nr)) { - asm volatile(LOCK_PREFIX "orb %1,%0" - : CONST_MASK_ADDR(nr, addr) - : "iq" ((u8)CONST_MASK(nr)) - : "memory"); - } else { - asm volatile(LOCK_PREFIX "bts %1,%0" - : BITOP_ADDR(addr) : "Ir" (nr) : "memory"); - } - } -``` - -This function looks scarry, but it is not so hard as it seems. First of all it passes `nr` or number of the bit to the `IS_IMMEDIATE` macro which just makes call of the GCC internal `__builtin_constant_p` function: - -```C -#define IS_IMMEDIATE(nr) (__builtin_constant_p(nr)) -``` - -`__builtin_constant_p` checks that given parameter is known constant at compile-time. As our `cpu` is not compile-time constant, `else` clause will be executed: - -```C -asm volatile(LOCK_PREFIX "bts %1,%0" : BITOP_ADDR(addr) : "Ir" (nr) : "memory"); -``` - -Let's try to understand how it works step by step: - -`LOCK_PREFIX` is a x86 `lock` instruction. This instruction tells to the cpu to occupy the system bus while instruction will be executed. This allows to synchronize memory access, preventing simultaneous access of multiple processors (or devices - DMA controller for example) to one memory cell. - -`BITOP_ADDR` casts given parameter to the `(*(volatile long *)` and adds `+m` constraints. `+` means that this operand is bot read and written by the instruction. `m` shows that this is memory operand. `BITOP_ADDR` is defined as: - -```C -#define BITOP_ADDR(x) "+m" (*(volatile long *) (x)) -``` - -Next is the `memory` clobber. It tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters). - -`Ir` - immideate register operand. - - -`bts` instruction sets given bit in a bit string and stores the value of a given bit in the `CF` flag. So we passed cpu number which is zero in our case and after `set_bit` will be executed, it sets zero bit in the `cpu_online_bits` cpumask. It would mean that the first cpu is online at this moment. - -Besides the `set_cpu_*` API, cpumask ofcourse provides another API for cpumasks manipulation. Let's consider it in shoft. - -Additional cpumask API --------------------------------------------------------------------------------- - -cpumask provides the set of macro for getting amount of the CPUs with different state. For example: - -```C -#define num_online_cpus() cpumask_weight(cpu_online_mask) -``` - -This macro returns amount of the `online` CPUs. It calls `cpumask_weight` function with the `cpu_online_mask` bitmap (read about about it). `cpumask_wieght` function makes an one call of the `bitmap_wiegt` function with two parameters: - -* cpumask bitmap; -* `nr_cpumask_bits` - which is `NR_CPUS` in our case. - -```C -static inline unsigned int cpumask_weight(const struct cpumask *srcp) -{ - return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits); -} -``` - -and calculates amount of the bits in the given bitmap. Besides the `num_online_cpus`, cpumask provides macros for the all CPU states: - -* num_possible_cpus; -* num_active_cpus; -* cpu_online; -* cpu_possible. - -and many more. - -Besides that Linux kernel provides following API for the manipulating of `cpumask`: - -* `for_each_cpu` - iterates over every cpu in a mask; -* `for_each_cpu_not` - iterates over every cpu in a complemented mask; -* `cpumask_clear_cpu` - clears a cpu in a cpumask; -* `cpumask_test_cpu` - tests a cpu in a mask; -* `cpumask_setall` - set all cpus in a mask; -* `cpumask_size` - returns size to allocate for a 'struct cpumask' in bytes; - -and many many more... - -Links --------------------------------------------------------------------------------- - -* [cpumask documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) diff --git a/Concepts/per-cpu.md b/Concepts/per-cpu.md deleted file mode 100644 index aa9f408..0000000 --- a/Concepts/per-cpu.md +++ /dev/null @@ -1,230 +0,0 @@ -Per-CPU variables -================================================================================ - -Per-CPU variables are one of the kernel features. You can understand what this feature means by reading its name. We can create a variable and each processor core will have its own copy of this variable. We take a closer look on this feature and try to understand how it is implemented and how it works in this part. - -The kernel provides API for creating per-cpu variables - `DEFINE_PER_CPU` macro: - -```C -#define DEFINE_PER_CPU(type, name) \ - DEFINE_PER_CPU_SECTION(type, name, "") -``` - -This macro defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) as many other macros for work with per-cpu variables. Now we will see how this feature is implemented. - -Take a look at the `DECLARE_PER_CPU` definition. We see that it takes 2 parameters: `type` and `name`, so we can use it to create per-cpu variable, for example like this: - -```C -DEFINE_PER_CPU(int, per_cpu_n) -``` - -We pass the type and the name of our variable. `DEFI_PER_CPU` calls `DEFINE_PER_CPU_SECTION` macro and passes the same two paramaters and empty string to it. Let's look at the definition of the `DEFINE_PER_CPU_SECTION`: - -```C -#define DEFINE_PER_CPU_SECTION(type, name, sec) \ - __PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES \ - __typeof__(type) name -``` - -```C -#define __PCPU_ATTRS(sec) \ - __percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \ - PER_CPU_ATTRIBUTES -``` - -where section is: - -```C -#define PER_CPU_BASE_SECTION ".data..percpu" -``` - -After all macros are expanded we will get global per-cpu variable: - -```C -__attribute__((section(".data..percpu"))) int per_cpu_n -``` - -It means that we will have `per_cpu_n` variable in the `.data..percpu` section. We can find this section in the `vmlinux`: - -``` -.data..percpu 00013a58 0000000000000000 0000000001a5c000 00e00000 2**12 - CONTENTS, ALLOC, LOAD, DATA -``` - -Ok, now we know that when we use `DEFINE_PER_CPU` macro, per-cpu variable in the `.data..percpu` section will be created. When the kernel initilizes it calls `setup_per_cpu_areas` function which loads `.data..percpu` section multiply times, one section per CPU. - -Let's look on the per-CPU areas initialization process. It start in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) from the call of the `setup_per_cpu_areas` function which defined in the [arch/x86/kernel/setup_percpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup_percpu.c). - -```C -pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n", - NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids); -``` - -The `setup_per_cpu_areas` starts from the output information about the Maximum number of CPUs set during kernel configuration with `CONFIG_NR_CPUS` configuration option, actual number of CPUs, `nr_cpumask_bits` is the same that `NR_CPUS` bit for the new `cpumask` operators and number of `NUMA` nodes. - -We can see this output in the dmesg: - -``` -$ dmesg | grep percpu -[ 0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1 -``` - -In the next step we check `percpu` first chunk allocator. All percpu areas are allocated in chunks. First chunk is used for the static percpu variables. Linux kernel has `percpu_alloc` command line parameters which provides type of the first chunk allocator. We can read about it in the kernel documentation: - -``` -percpu_alloc= Select which percpu first chunk allocator to use. - Currently supported values are "embed" and "page". - Archs may support subset or none of the selections. - See comments in mm/percpu.c for details on each - allocator. This parameter is primarily for debugging - and performance comparison. -``` - -The [mm/percpu.c](https://github.com/torvalds/linux/blob/master/mm/percpu.c) contains handler of this command line option: - -```C -early_param("percpu_alloc", percpu_alloc_setup); -``` - -Where `percpu_alloc_setup` function sets the `pcpu_chosen_fc` variable depends on the `percpu_alloc` parameter value. By default first chunk allocator is `auto`: - -```C -enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO; -``` - -If `percpu_alooc` parameter not given to the kernel command line, the `embed` allocator will be used wich as you can understand embed the first percpu chunk into bootmem with the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html). The last allocator is first chunk `page` allocator which maps first chunk with `PAGE_SIZE` pages. - -As I wrote about first of all we make a check of the first chunk allocator type in the `setup_per_cpu_areas`. First of all we check that first chunk allocator is not page: - -```C -if (pcpu_chosen_fc != PCPU_FC_PAGE) { - ... - ... - ... -} -``` - -If it is not `PCPU_FC_PAGE`, we will use `embed` allocator and allocate space for the first chunk with the `pcpu_embed_first_chunk` function: - -```C -rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE, - dyn_size, atom_size, - pcpu_cpu_distance, - pcpu_fc_alloc, pcpu_fc_free); -``` - -As I wrote above, the `pcpu_embed_first_chunk` function embeds the first percpu chunk into bootmem. As you can see we pass a couple of parameters to the `pcup_embed_first_chunk`, they are - -* `PERCPU_FIRST_CHUNK_RESERVE` - the size of the reserved space for the static `percpu` variables; -* `dyn_size` - minimum free size for dynamic allocation in byte; -* `atom_size` - all allocations are whole multiples of this and aligned to this parameter; -* `pcpu_cpu_distance` - callback to determine distance between cpus; -* `pcpu_fc_alloc` - function to allocate `percpu` page; -* `pcpu_fc_free` - function to release `percpu` page. - -All of this parameters we calculat before the call of the `pcpu_embed_first_chunk`: - -```C -const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE; -size_t atom_size; -#ifdef CONFIG_X86_64 - atom_size = PMD_SIZE; -#else - atom_size = PAGE_SIZE; -#endif -``` - -If first chunk allocator is `PCPU_FC_PAGE`, we will use the `pcpu_page_first_chunk` instead of the `pcpu_embed_first_chunk`. After that `percpu` areas up, we setup `percpu` offset and its segment for the every CPU with the `setup_percpu_segment` function (only for `x86` systems) and move some early data from the arrays to the `percpu` variables (`x86_cpu_to_apicid`, `irq_stack_ptr` and etc...). After the kernel finished the initialization process, we have loaded N `.data..percpu` sections, where N is the number of CPU, and section used by bootstrap processor will contain uninitialized variable created with `DEFINE_PER_CPU` macro. - -The kernel provides API for per-cpu variables manipulating: - -* get_cpu_var(var) -* put_cpu_var(var) - - -Let's look at `get_cpu_var` implementation: - -```C -#define get_cpu_var(var) \ -(*({ \ - preempt_disable(); \ - this_cpu_ptr(&var); \ -})) -``` - -Linux kernel is preemptible and accessing a per-cpu variable requires to know which processor kernel running on. So, current code must not be preempted and moved to the another CPU while accessing a per-cpu variable. That's why first of all we can see call of the `preempt_disable` function. After this we can see call of the `this_cpu_ptr` macro, which looks as: - -```C -#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr) -``` - -and - -```C -#define raw_cpu_ptr(ptr) per_cpu_ptr(ptr, 0) -``` - -where `per_cpu_ptr` returns a pointer to the per-cpu variable for the given cpu (second parameter). After that we got per-cpu variables and made any manipulations on it, we must call `put_cpu_var` macro which enables preemption with call of `preempt_enable` function. So the typical usage of a per-cpu variable is following: - -```C -get_cpu_var(var); -... -//Do something with the 'var' -... -put_cpu_var(var); -``` - -Let's look at `per_cpu_ptr` macro: - -```C -#define per_cpu_ptr(ptr, cpu) \ -({ \ - __verify_pcpu_ptr(ptr); \ - SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); \ -}) -``` - -As I wrote above, this macro returns per-cpu variable for the given cpu. First of all it calls `__verify_pcpu_ptr`: - -```C -#define __verify_pcpu_ptr(ptr) -do { - const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; - (void)__vpp_verify; -} while (0) -``` - -which makes given `ptr` type of `const void __percpu *`, - -After this we can see the call of the `SHIFT_PERCPU_PTR` macro with two parameters. At first parameter we pass our ptr and sencond we pass cpu number to the `per_cpu_offset` macro which: - -```C -#define per_cpu_offset(x) (__per_cpu_offset[x]) -``` - -expands to getting `x` element from the `__per_cpu_offset` array: - - -```C -extern unsigned long __per_cpu_offset[NR_CPUS]; -``` - -where `NR_CPUS` is the number of CPUs. `__per_cpu_offset` array filled with the distances between cpu-variables copies. For example all per-cpu data is `X` bytes size, so if we access `__per_cpu_offset[Y]`, so `X*Y` will be accessed. Let's look at the `SHIFT_PERCPU_PTR` implementation: - -```C -#define SHIFT_PERCPU_PTR(__p, __offset) \ - RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset)) -``` - -`RELOC_HIDE` just returns offset `(typeof(ptr)) (__ptr + (off))` and it will be pointer of the variable. - -That's all! Of course it is not the full API, but the general part. It can be hard for the start, but to understand per-cpu variables feature need to understand mainly [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) magic. - -Let's again look at the algorithm of getting pointer on per-cpu variable: - -* The kernel creates multiply `.data..percpu` sections (ones perc-pu) during initialization process; -* All variables created with the `DEFINE_PER_CPU` macro will be reloacated to the first section or for CPU0; -* `__per_cpu_offset` array filled with the distance (`BOOT_PERCPU_OFFSET`) between `.data..percpu` sections; -* When `per_cpu_ptr` called for example for getting pointer on the certain per-cpu variable for the third CPU, `__per_cpu_offset` array will be accessed, where every index points to the certain CPU. - -That's all. diff --git a/DataStructures/README.md b/DataStructures/README.md deleted file mode 100644 index 0c0fd4b..0000000 --- a/DataStructures/README.md +++ /dev/null @@ -1,9 +0,0 @@ -Data Structures in the Linux Kernel -======================================================================== - -Linux kernel provides implementations of a different data structures like linked list, B+ tree, prinority heap and many many more. - -This part considers these data structures and algorithms. - - * [Doubly linked list](https://github.com/0xAX/linux-insides/blob/master/DataStructures/dlist.md) - * [Radix tree](https://github.com/0xAX/linux-insides/blob/master/DataStructures/radix-tree.md) diff --git a/DataStructures/dlist.md b/DataStructures/dlist.md deleted file mode 100644 index f85561a..0000000 --- a/DataStructures/dlist.md +++ /dev/null @@ -1,247 +0,0 @@ -Data Structures in the Linux Kernel -================================================================================ - -Doubly linked list --------------------------------------------------------------------------------- - -Linux kernel provides its own doubly linked list implementation which you can find in the [include/linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h). We will start `Data Structures in the Linux kernel` from the doubly linked list data structure. Why? Because it is very popular in the kernel, just try to [search](http://lxr.free-electrons.com/ident?i=list_head) - -First of all let's look on the main structure: - -```C -struct list_head { - struct list_head *next, *prev; -}; -``` - -You can note that it is different from many lists implementations which you could see. For example this doubly linked list structure from the [glib](http://www.gnu.org/software/libc/): - -```C -struct GList { - gpointer data; - GList *next; - GList *prev; -}; -``` - -Usually a linked list structure contains a pointer to the item. Linux kernel implementation of the list does not. So the main question is - `where does the list store the data?`. The actual implementation of lists in the kernel is - `Intrusive list`. An intrusive linked list does not contain data in its nodes - A node just contains pointers to the next and previous node and list nodes part of the data that are added to the list. This makes the data structure generic, so it does not care about entry data type anymore. - -For example: - -```C -struct nmi_desc { - spinlock_t lock; - struct list_head head; -}; -``` - -Let's look at some examples, how `list_head` is used in the kernel. As I already wrote about, there are many, really many different places where lists are used in the kernel. Let's look for example in miscellaneous character drivers. Misc character drivers API from the [drivers/char/misc.c](https://github.com/torvalds/linux/blob/master/drivers/char/misc.c) for writing small drivers for handling simple hardware or virtual devices. This drivers share major number: - -```C -#define MISC_MAJOR 10 -``` - -but has own minor number. For example you can see it with: - -``` -ls -l /dev | grep 10 -crw------- 1 root root 10, 235 Mar 21 12:01 autofs -drwxr-xr-x 10 root root 200 Mar 21 12:01 cpu -crw------- 1 root root 10, 62 Mar 21 12:01 cpu_dma_latency -crw------- 1 root root 10, 203 Mar 21 12:01 cuse -drwxr-xr-x 2 root root 100 Mar 21 12:01 dri -crw-rw-rw- 1 root root 10, 229 Mar 21 12:01 fuse -crw------- 1 root root 10, 228 Mar 21 12:01 hpet -crw------- 1 root root 10, 183 Mar 21 12:01 hwrng -crw-rw----+ 1 root kvm 10, 232 Mar 21 12:01 kvm -crw-rw---- 1 root disk 10, 237 Mar 21 12:01 loop-control -crw------- 1 root root 10, 227 Mar 21 12:01 mcelog -crw------- 1 root root 10, 59 Mar 21 12:01 memory_bandwidth -crw------- 1 root root 10, 61 Mar 21 12:01 network_latency -crw------- 1 root root 10, 60 Mar 21 12:01 network_throughput -crw-r----- 1 root kmem 10, 144 Mar 21 12:01 nvram -brw-rw---- 1 root disk 1, 10 Mar 21 12:01 ram10 -crw--w---- 1 root tty 4, 10 Mar 21 12:01 tty10 -crw-rw---- 1 root dialout 4, 74 Mar 21 12:01 ttyS10 -crw------- 1 root root 10, 63 Mar 21 12:01 vga_arbiter -crw------- 1 root root 10, 137 Mar 21 12:01 vhci -``` - -Now let's look how lists are used in the misc device drivers. First of all let's look on `miscdevice` structure: - -```C -struct miscdevice -{ - int minor; - const char *name; - const struct file_operations *fops; - struct list_head list; - struct device *parent; - struct device *this_device; - const char *nodename; - mode_t mode; -}; -``` - -We can see the fourth field in the `miscdevice` structure - `list` which is list of registered devices. In the beginning of the source code file we can see definition of the: - -```C -static LIST_HEAD(misc_list); -``` - -which expands to definition of the variables with `list_head` type: - -```C -#define LIST_HEAD(name) \ - struct list_head name = LIST_HEAD_INIT(name) -``` - -and initializes it with the `LIST_HEAD_INIT` macro which set previous and next entries: - -```C -#define LIST_HEAD_INIT(name) { &(name), &(name) } -``` - -Now let's look on the `misc_register` function which registers a miscellaneous device. At the start it initializes `miscdevice->list` with the `INIT_LIST_HEAD` function: - -```C -INIT_LIST_HEAD(&misc->list); -``` - -which does the same that `LIST_HEAD_INIT` macro: - -```C -static inline void INIT_LIST_HEAD(struct list_head *list) -{ - list->next = list; - list->prev = list; -} -``` - -In the next step after device created with the `device_create` function we add it to the miscellaneous devices list with: - -``` -list_add(&misc->list, &misc_list); -``` - -Kernel `list.h` provides this API for the addition of new entry to the list. Let's look on it's implementation: - -```C -static inline void list_add(struct list_head *new, struct list_head *head) -{ - __list_add(new, head, head->next); -} -``` - -It just calls internal function `__list_add` with the 3 given parameters: - -* new - new entry; -* head - list head after which will be inserted new item; -* head->next - next item after list head. - -Implementation of the `__list_add` is pretty simple: - -```C -static inline void __list_add(struct list_head *new, - struct list_head *prev, - struct list_head *next) -{ - next->prev = new; - new->next = next; - new->prev = prev; - prev->next = new; -} -``` - -Here we set new item between `prev` and `next`. So `misc` list which we defined at the start with the `LIST_HEAD_INIT` macro will contain previous and next pointers to the `miscdevice->list`. - -There is still only one question how to get list's entry. There is special special macro for this point: - -```C -#define list_entry(ptr, type, member) \ - container_of(ptr, type, member) -``` - -which gets three parameters: - -* ptr - the structure list_head pointer; -* type - structure type; -* member - the name of the list_head within the struct; - -For example: - -```C -const struct miscdevice *p = list_entry(v, struct miscdevice, list) -``` - -After this we can access to the any `miscdevice` field with `p->minor` or `p->name` and etc... Let's look on the `list_entry` implementation: - -```C -#define list_entry(ptr, type, member) \ - container_of(ptr, type, member) -``` - -As we can see it just calls `container_of` macro with the same arguments. For the first look `container_of` looks strange: - -```C -#define container_of(ptr, type, member) ({ \ - const typeof( ((type *)0)->member ) *__mptr = (ptr); \ - (type *)( (char *)__mptr - offsetof(type,member) );}) -``` - -First of all you can note that it consists from two expressions in curly brackets. Compiler will evaluate the whole block in the curly braces and use the value of the last expression. - -For example: - -``` -#include - -int main() { - int i = 0; - printf("i = %d\n", ({++i; ++i;})); - return 0; -} -``` - -will print `2`. - -The next point is `typeof`, it's simple. As you can understand from its name, it just returns the type of the given variable. When I first saw the implementation of the `container_of` macro, the strangest thing for me was the zero in the `((type *)0)` expression. Actually this pointer magic calculates the offset of the given field from the address of the structure, but as we have `0` here, it will be just a zero offset alongwith the field width. Let's look at a simple example: - -```C -#include - -struct s { - int field1; - char field2; - char field3; -}; - -int main() { - printf("%p\n", &((struct s*)0)->field3); - return 0; -} -``` - -will print `0x5`. - -The next offsetof macro calculates offset from the beginning of the structure to the given structure's field. Its implementation is very similar to the previous code: - -```C -#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER) -``` - -Let's summarize all about `container_of` macro. `container_of` macro returns address of the structure by the given address of the structure's field with `list_head` type, the name of the structure field with `list_head` type and type of the container structure. At the first line this macro declares the `__mptr` pointer which points to the field of the structure that `ptr` points to and assigns it to the `ptr`. Now `ptr` and `__mptr` point to the same address. Technically we don't need this line but its useful for type checking. First line ensures that that given structure (`type` parameter) has a member called `member`. In the second line it calculates offset of the field from the structure with the `offsetof` macro and subtracts it from the structure address. That's all. - -Of course `list_add` and `list_entry` is not only functions which provides ``. Implementation of the doubly linked list provides the following API: - -* list_add -* list_add_tail -* list_del -* list_replace -* list_move -* list_is_last -* list_empty -* list_cut_position -* list_splice - -and many more. diff --git a/DataStructures/radix-tree.md b/DataStructures/radix-tree.md deleted file mode 100644 index 90c0a69..0000000 --- a/DataStructures/radix-tree.md +++ /dev/null @@ -1,184 +0,0 @@ -Data Structures in the Linux Kernel -================================================================================ - -Radix tree --------------------------------------------------------------------------------- - -As you already know linux kernel provides many different libraries and functions which implement different data structures and algorithm. In this part we will consider one of these data structures - [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). There are two files which are related to `radix tree` implementation and API in the linux kernel: - -* [include/linux/radix-tree.h](https://github.com/torvalds/linux/blob/master/include/linux/radix-tree.h) -* [lib/radix-tree.c](https://github.com/torvalds/linux/blob/master/lib/radix-tree.c) - -Lets talk about what is `radix tree`. Radix tree is a `compressed trie` where [trie](http://en.wikipedia.org/wiki/Trie) is a data structure which implements interface of an associative array and allows to store values as `key-value`. The keys are usually strings, but any other data type can be used as well. Trie is different from any `n-tree` in its nodes. Nodes of a trie do not store keys, instead, a node of a trie stores single character labels. The key which is related to a given node is derived by traversing from the root of the tree to this node. For example: - - -``` -               +-----------+ -               |           | -               |    " "    | - | | -        +------+-----------+------+ -        |                         | -        |                         | -   +----v------+            +-----v-----+ -   |           |            |           | -   |    g      |            |     c     | - | | | | -   +-----------+            +-----------+ -        |                         | -        |                         | -   +----v------+            +-----v-----+ -   |           |            |           | -   |    o      |            |     a     | - | | | | -   +-----------+            +-----------+ -                                  | -                                  | -                            +-----v-----+ -                            |           | -                            |     t     | - | | -                            +-----------+ -``` - -So in this example, we can see the `trie` with keys, `go` and `cat`. The compressed trie or `radix tree` differs from `trie`, such that all intermediates nodes which have only one child are removed. - -Radix tree in linux kernel is the datastructure which maps values to the integer key. It is represented by the following structures from the file [include/linux/radix-tree.h](https://github.com/torvalds/linux/blob/master/include/linux/radix-tree.h): - -```C -struct radix_tree_root { - unsigned int height; - gfp_t gfp_mask; - struct radix_tree_node __rcu *rnode; -}; -``` - -This structure presents the root of a radix tree and contains three fields: - -* `height` - height of the tree; -* `gfp_mask` - tells how memory allocations are to be performed; -* `rnode` - pointer to the child node. - -The first structure we will discuss is `gfp_mask`: - -Low-level kernel memory allocation functions take a set of flags as - `gfp_mask`, which describes how that allocation is to be performed. These `GFP_` flags which control the allocation process can have following values, (`GF_NOIO` flag) be sleep and wait for memory, (`__GFP_HIGHMEM` flag) is high memory can be used, (`GFP_ATOMIC` flag) is allocation process high-priority and can't sleep etc. - -The next structure is `rnode`: - -```C -struct radix_tree_node { - unsigned int path; - unsigned int count; - union { - struct { - struct radix_tree_node *parent; - void *private_data; - }; - struct rcu_head rcu_head; - }; - /* For tree user */ - struct list_head private_list; - void __rcu *slots[RADIX_TREE_MAP_SIZE]; - unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS]; -}; -``` - -This structure contains information about the offset in a parent and height from the bottom, count of the child nodes and fields for accessing and freeing a node. The fields are described below: - -* `path` - offset in parent & height from the bottom; -* `count` - count of the child nodes; -* `parent` - pointer to the parent node; -* `private_data` - used by the user of a tree; -* `rcu_head` - used for freeing a node; -* `private_list` - used by the user of a tree; - -The two last fields of the `radix_tree_node` - `tags` and `slots` are important and interesting. Every node can contains a set of slots which are store pointers to the data. Empty slots in the linux kernel radix tree implementation store `NULL`. Radix tree in the linux kernel also supports tags which are associated with the `tags` fields in the `radix_tree_node` structure. Tags allow to set individual bits on records which are stored in the radix tree. - -Now we know about radix tree structure, time to look on its API. - -Linux kernel radix tree API ---------------------------------------------------------------------------------- - -We start from the datastructure intialization. There are two ways to initialize new radix tree. The first is to use `RADIX_TREE` macro: - -```C -RADIX_TREE(name, gfp_mask); -```` - -As you can see we pass the `name` parameter, so with the `RADIX_TREE` macro we can define and initialize radix tree with the given name. Implementation of the `RADIX_TREE` is easy: - -```C -#define RADIX_TREE(name, mask) \ - struct radix_tree_root name = RADIX_TREE_INIT(mask) - -#define RADIX_TREE_INIT(mask) { \ - .height = 0, \ - .gfp_mask = (mask), \ - .rnode = NULL, \ -} -``` - -At the beginning of the `RADIX_TREE` macro we define instance of the `radix_tree_root` structure with the given name and call `RADIX_TREE_INIT` macro with the given mask. The `RADIX_TREE_INIT` macro just initializes `radix_tree_root` structure with the default values and the given mask. - -The second way is to define `radix_tree_root` structure by hand and pass it with mask to the `INIT_RADIX_TREE` macro: - -```C -struct radix_tree_root my_radix_tree; -INIT_RADIX_TREE(my_tree, gfp_mask_for_my_radix_tree); -``` - -where: - -```C -#define INIT_RADIX_TREE(root, mask) \ -do { \ - (root)->height = 0; \ - (root)->gfp_mask = (mask); \ - (root)->rnode = NULL; \ -} while (0) -``` - -makes the same initialziation with default values as it does `RADIX_TREE_INIT` macro. - -The next are two functions for the inserting and deleting records to/from a radix tree: - -* `radix_tree_insert`; -* `radix_tree_delete`. - -The first `radix_tree_insert` function takes three parameters: - -* root of a radix tree; -* index key; -* data to insert; - -The `radix_tree_delete` function takes the same set of parameters as the `radix_tree_insert`, but without data. - -The search in a radix tree implemented in two ways: - -* `radix_tree_lookup`; -* `radix_tree_gang_lookup`; -* `radix_tree_lookup_slot`. - -The first `radix_tree_lookup` function takes two parameters: - -* root of a radix tree; -* index key; - -This function tries to find the given key in the tree and returns associated record with this key. The second `radix_tree_gang_lookup` function have the following signature - -```C -unsigned int radix_tree_gang_lookup(struct radix_tree_root *root, - void **results, - unsigned long first_index, - unsigned int max_items); -``` - -and returns number of records, sorted by the keys, starting from the first index. Number of the returned records will be not greater than `max_items` value. - -And the last `radix_tree_lookup_slot` function will return the slot which will contain the data. - -Links ---------------------------------------------------------------------------------- - -* [Radix tree](http://en.wikipedia.org/wiki/Radix_tree) -* [Trie](http://en.wikipedia.org/wiki/Trie) diff --git a/Initialization/README.md b/Initialization/README.md deleted file mode 100644 index dcb14e4..0000000 --- a/Initialization/README.md +++ /dev/null @@ -1,16 +0,0 @@ -# Kernel initialization process - -You will find here a couple of posts which describe the full cycle of kernel initialization from its first steps after the kernel has decompressed to the start of the first process run by the kernel itself. - -*Note* That there will not be description of the all kernel initialization steps. Here will be only generic kernel part, without interrupts handling, ACPI, and many other parts. All parts which I'll miss, will be described in other chapters. - -* [First steps after kernel decompression](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-1.md) - describes first steps in the kernel. -* [Early interrupt and exception handling](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) - describes early interrupts initialization and early page fault handler. -* [Last preparations before the kernel entry point](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-3.md) - describes the last preparations before the call of the `start_kernel`. -* [Kernel entry point](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-4.md) - describes first steps in the kernel generic code. -* [Continue of architecture-specific initializations](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-5.md) - describes architecture-specific initialization. -* [Architecture-specific initializations, again...](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-6.md) - describes continue of the architecture-specific initialization process. -* [The End of the architecture-specific initializations, almost...](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md) - describes the end of the `setup_arch` related stuff. -* [Scheduler initialization](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-8.md) - describes preparation before scheduler initialization and initialization of it. -* [RCU initialization](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-9.md) - describes the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). -* [End of the initialization](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-10.md) - the last part about linux kernel initialization. diff --git a/Initialization/linux-initialization-1.md b/Initialization/linux-initialization-1.md deleted file mode 100644 index c93b246..0000000 --- a/Initialization/linux-initialization-1.md +++ /dev/null @@ -1,522 +0,0 @@ -Kernel initialization. Part 1. -================================================================================ - -First steps in the kernel code --------------------------------------------------------------------------------- - -In the previous post (`Kernel booting process. Part 5.`) - [Kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) we stopped at the [jump](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) on the decompressed kernel: - -```assembly -jmp *%rax -``` - -and now we are in the kernel. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489) will be called. - -So let's start. - -First steps in the kernel --------------------------------------------------------------------------------- - -Okay, we got address of the kernel from the `decompress_kernel` function into `rax` register and just jumped there. Decompressed kernel code starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): - -```assembly - __HEAD - .code64 - .globl startup_64 -startup_64: - ... - ... - ... -``` - -We can see definition of the `startup_64` routine and it defined in the `__HEAD` section, which is just: - -```C -#define __HEAD .section ".head.text","ax" -``` - -We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S#L93) linker script: - -``` -.text : AT(ADDR(.text) - LOAD_OFFSET) { - _text = .; - ... - ... - ... -} :text = 0x9090 -``` - -We can understand default virtual and physical addresses from the linker script. Note that address of the `_text` is location counter which is defined as: - -``` -. = __START_KERNEL; -``` - -for `x86_64`. We can find definition of the `__START_KERNEL` macro in the [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h): - -```C -#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) - -#define __PHYSICAL_START ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN) -``` - -Here we can see that `__START_KERNEL` is the sum of the `__START_KERNEL_map` (which is `0xffffffff80000000`, see post about [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)) and `__PHYSICAL_START`. Where `__PHYSICAL_START` is aligned value of the `CONFIG_PHYSICAL_START`. So if you will not use [kASLR](http://en.wikipedia.org/wiki/Address_space_layout_randomization) and will not change `CONFIG_PHYSICAL_START` in the configuration addresses will be following: - -* Physical address - `0x1000000`; -* Virtual address - `0xffffffff81000000`. - -Now we know default physical and virtual addresses of the `startup_64` routine, but to know actual addresses we must to calculate it with the following code: - -```assembly - leaq _text(%rip), %rbp - subq $_text - __START_KERNEL_map, %rbp -``` - -Here we just put the `rip-relative` address to the `rbp` register and then subtract `$_text - __START_KERNEL_map` from it. We know that compiled address of the `_text` is `0xffffffff81000000` and `__START_KERNEL_map` contains `0xffffffff81000000`, so `rbp` will contain physical address of the `text` - `0x1000000` after this calculation. We need to calculate it because kernel can't be run on the default address, but now we know the actual physical address. - -In the next step we checks that this address is aligned with: - -```assembly - movq %rbp, %rax - andl $~PMD_PAGE_MASK, %eax - testl %eax, %eax - jnz bad_address -``` - -Here we just put address to the `%rax` and test first bit. `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) about it) and defined as: - -```C -#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1)) - -#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT) -#define PMD_SHIFT 21 -``` - -As we can easily calculate, `PMD_PAGE_SIZE` is 2 megabytes. Here we use standard formula for checking alignment and if `text` address is not aligned for 2 megabytes, we jump to `bad_address` label. - -After this we check address that it is not too large: - -```assembly - leaq _text(%rip), %rax - shrq $MAX_PHYSMEM_BITS, %rax - jnz bad_address -``` - -Address most not be greater than 46-bits: - -```C -#define MAX_PHYSMEM_BITS 46 -``` - -Okay, we did some early checks and now we can move on. - -Fix base addresses of page tables --------------------------------------------------------------------------------- - -The first step before we started to setup identity paging, need to correct following addresses: - -```assembly - addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip) - addq %rbp, level3_kernel_pgt + (510*8)(%rip) - addq %rbp, level3_kernel_pgt + (511*8)(%rip) - addq %rbp, level2_fixmap_pgt + (506*8)(%rip) -``` - -Here we need to correct `early_level4_pgt` and other addresses of the page table directories, because as I wrote above, kernel can't be run at the default `0x1000000` address. `rbp` register contains actual address so we add to the `early_level4_pgt`, `level3_kernel_pgt` and `level2_fixmap_pgt`. Let's try to understand what these labels means. First of all let's look on their definition: - -```assembly -NEXT_PAGE(early_level4_pgt) - .fill 511,8,0 - .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE - -NEXT_PAGE(level3_kernel_pgt) - .fill L3_START_KERNEL,8,0 - .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE - .quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE - -NEXT_PAGE(level2_kernel_pgt) - PMDS(0, __PAGE_KERNEL_LARGE_EXEC, - KERNEL_IMAGE_SIZE/PMD_SIZE) - -NEXT_PAGE(level2_fixmap_pgt) - .fill 506,8,0 - .quad level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE - .fill 5,8,0 - -NEXT_PAGE(level1_fixmap_pgt) - .fill 512,8,0 -``` - -Looks hard, but it is not true. - -First of all let's look on the `early_level4_pgt`. It starts with the (4096 - 8) bytes of zeros, it means that we don't use first 511 `early_level4_pgt` entries. And after this we can see `level3_kernel_pgt` entry. Note that we subtract `__START_KERNEL_map + _PAGE_TABLE` from it. As we know `__START_KERNEL_map` is a base virtual address of the kernel text, so if we subtract `__START_KERNEL_map`, we will get physical address of the `level3_kernel_pgt`. Now let's look on `_PAGE_TABLE`, it is just page entry access rights: - -```C -#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ - _PAGE_ACCESSED | _PAGE_DIRTY) -``` - -more about it, you can read in the [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) post. - -`level3_kernel_pgt` - stores entries which map kernel space. At the start of it's definition, we can see that it filled with zeros `L3_START_KERNEL` times. Here `L3_START_KERNEL` is the index in the page upper directory which contains `__START_KERNEL_map` address and it equals `510`. After it we can see definition of two `level3_kernel_pgt` entries: `level2_kernel_pgt` and `level2_fixmap_pgt`. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has: - -```C -#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \ - _PAGE_DIRTY) -``` - -access rights. The second - `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. - -The next `level2_kernel_pgt` calls `PDMS` macro which creates 512 megabytes from the `__START_KERNEL_map` for kernel text (after these 512 megabytes will be modules memory space). - -Now we know Let's back to our code which is in the beginning of the section. Remember that `rbp` contains actual physical address of the `_text` section. We just add this address to the base address of the page tables, that they'll have correct addresses: - -```assembly - addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip) - addq %rbp, level3_kernel_pgt + (510*8)(%rip) - addq %rbp, level3_kernel_pgt + (511*8)(%rip) - addq %rbp, level2_fixmap_pgt + (506*8)(%rip) -``` - -At the first line we add `rbp` to the `early_level4_pgt`, at the second line we add `rbp` to the `level2_kernel_pgt`, at the third line we add `rbp` to the `level2_fixmap_pgt` and add `rbp` to the `level1_fixmap_pgt`. - -After all of this we will have: - -``` -early_level4_pgt[511] -> level3_kernel_pgt[0] -level3_kernel_pgt[510] -> level2_kernel_pgt[0] -level3_kernel_pgt[511] -> level2_fixmap_pgt[0] -level2_kernel_pgt[0] -> 512 MB kernel mapping -level2_fixmap_pgt[506] -> level1_fixmap_pgt -``` - -As we corrected base addresses of the page tables, we can start to build it. - -Identity mapping setup --------------------------------------------------------------------------------- - -Now we can see set up the identity mapping early page tables. Identity Mapped Paging is a virtual addresses which are mapped to physical addresses that have the same value, `1 : 1`. Let's look on it in details. First of all we get the `rip-relative` address of the `_text` and `_early_level4_pgt` and put they into `rdi` and `rbx` registers: - -```assembly - leaq _text(%rip), %rdi - leaq early_level4_pgt(%rip), %rbx -``` - -After this we store physical address of the `_text` in the `rax` and get the index of the page global directory entry which stores `_text` address, by shifting `_text` address on the `PGDIR_SHIFT`: - -```assembly - movq %rdi, %rax - shrq $PGDIR_SHIFT, %rax - - leaq (4096 + _KERNPG_TABLE)(%rbx), %rdx - movq %rdx, 0(%rbx,%rax,8) - movq %rdx, 8(%rbx,%rax,8) -``` - -where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories: - -```C -#define PGDIR_SHIFT 39 -#define PUD_SHIFT 30 -#define PMD_SHIFT 21 -``` - -After this we put the address of the first `level3_kernel_pgt` to the `rdx` with the `_KERNPG_TABLE` access rights (see above) and fill the `early_level4_pgt` with the 2 `level3_kernel_pgt` entries. - -After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now contains the address of the first entry of the `level3_kernel_pgt`) and put `rdi` (it now contains physical address of the `_text`) to the `rax`. And after this we write addresses of the two page upper directory entries to the `level3_kernel_pgt`: - -```assembly - addq $4096, %rdx - movq %rdi, %rax - shrq $PUD_SHIFT, %rax - andl $(PTRS_PER_PUD-1), %eax - movq %rdx, 4096(%rbx,%rax,8) - incl %eax - andl $(PTRS_PER_PUD-1), %eax - movq %rdx, 4096(%rbx,%rax,8) -``` - -In the next step we write addresses of the page middle directory entries to the `level2_kernel_pgt` and the last step is correcting of the kernel text+data virtual addresses: - -```assembly - leaq level2_kernel_pgt(%rip), %rdi - leaq 4096(%rdi), %r8 -1: testq $1, 0(%rdi) - jz 2f - addq %rbp, 0(%rdi) -2: addq $8, %rdi - cmp %r8, %rdi - jne 1b -``` - -Here we put the address of the `level2_kernel_pgt` to the `rdi` and address of the page table entry to the `r8` register. Next we check the present bit in the `level2_kernel_pgt` and if it is zero we're moving to the next page by adding 8 bytes to `rdi` which contaitns address of the `level2_kernel_pgt`. After this we compare it with `r8` (contains address of the page table entry) and go back to label `1` or move forward. - -In the next step we correct `phys_base` physical address with `rbp` (contains physical address of the `_text`), put physical address of the `early_level4_pgt` and jump to label `1`: - -```assembly - addq %rbp, phys_base(%rip) - movq $(early_level4_pgt - __START_KERNEL_map), %rax - jmp 1f -``` - -where `phys_base` mathes the first entry of the `level2_kernel_pgt` which is 512 MB kernel mapping. - -Last preparations --------------------------------------------------------------------------------- - -After that we jumped to the label `1` we enable `PAE`, `PGE` (Paging Global Extension) and put the physical address of the `phys_base` (see above) to the `rax` register and fill `cr3` register with it: - -```assembly -1: - movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx - movq %rcx, %cr4 - - addq phys_base(%rip), %rax - movq %rax, %cr3 -``` - -In the next step we check that CPU support [NX](http://en.wikipedia.org/wiki/NX_bit) bit with: - -```assembly - movl $0x80000001, %eax - cpuid - movl %edx,%edi -``` - -We put `0x80000001` value to the `eax` and execute `cpuid` instruction for getting extended processor info and feature bits. The result will be in the `edx` register which we put to the `edi`. - -Now we put `0xc0000080` or `MSR_EFER` to the `ecx` and call `rdmsr` instruction for the reading model specific register. - -```assembly - movl $MSR_EFER, %ecx - rdmsr -``` - -The result will be in the `edx:eax`. General view of the `EFER` is following: - -``` -63 32 - -------------------------------------------------------------------------------- -| | -| Reserved MBZ | -| | - -------------------------------------------------------------------------------- -31 16 15 14 13 12 11 10 9 8 7 1 0 - -------------------------------------------------------------------------------- -| | T | | | | | | | | | | -| Reserved MBZ | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE| -| | E | | | | | | | | | | - -------------------------------------------------------------------------------- -``` - -We will not see all fields in details here, but we will learn about this and other `MSRs` in the special part about. As we read `EFER` to the `edx:eax`, we checks `_EFER_SCE` or zero bit which is `System Call Extensions` with `btsl` instruction and set it to one. By the setting `SCE` bit we enable `SYSCALL` and `SYSRET` instructions. In the next step we check 20th bit in the `edi`, remember that this register stores result of the `cpuid` (see above). If `20` bit is set (`NX` bit) we just write `EFER_SCE` to the model specific register. - -```assembly - btsl $_EFER_SCE, %eax - btl $20,%edi - jnc 1f - btsl $_EFER_NX, %eax - btsq $_PAGE_BIT_NX,early_pmd_flags(%rip) -1: wrmsr -``` - -If `NX` bit is supported we enable `_EFER_NX` and write it too, with the `wrmsr` instruction. - -In the next step we need to update Global Descriptor table with `lgdt` instruction: - -```assembly -lgdt early_gdt_descr(%rip) -``` - -where Global Descriptor table defined as: - -```assembly -early_gdt_descr: - .word GDT_ENTRIES*8-1 -early_gdt_descr_base: - .quad INIT_PER_CPU_VAR(gdt_page) -``` - -We need to reload Global Descriptor Table because now kernel works in the userspace addresses, but soon kernel will work in it's own space. Now let's look on `early_gdt_descr` definition. Global Descriptor Table contains 32 entries: - -```C -#define GDT_ENTRIES 32 -``` - -for kernel code, data, thread local storage segments and etc... it's simple. Now let's look on the `early_gdt_descr_base`. First of `gdt_page` defined as: - -```C -struct gdt_page { - struct desc_struct gdt[GDT_ENTRIES]; -} __attribute__((aligned(PAGE_SIZE))); -``` - -in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h). It contains one field `gdt` which is array of the `desc_struct` structures which defined as: - -```C -struct desc_struct { - union { - struct { - unsigned int a; - unsigned int b; - }; - struct { - u16 limit0; - u16 base0; - unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1; - unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8; - }; - }; - } __attribute__((packed)); -``` - -and presents familiar to us GDT descriptor. Also we can note that `gdt_page` structure aligned to `PAGE_SIZE` which is 4096 bytes. It means that `gdt` will occupy one page. Now let's try to understand what is it `INIT_PER_CPU_VAR`. `INIT_PER_CPU_VAR` is a macro which defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h) and just concats `init_per_cpu__` with the given parameter: - -```C -#define INIT_PER_CPU_VAR(var) init_per_cpu__##var -``` - -After this we have `init_per_cpu__gdt_page`. We can see in the [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S): - -``` -#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load -INIT_PER_CPU(gdt_page); -``` - -As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT. - -Generally per-CPU variables is a 2.6 kernel feature. You can understand what is it from it's name. When we create `per-CPU` variable, each CPU will have will have it's own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with it's own copy of variable and etc... So every core on multiprocessor will have it's own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) post. - -As we loaded new Global Descriptor Table, we reload segments as we did it every time: - -```assembly - xorl %eax,%eax - movl %eax,%ds - movl %eax,%ss - movl %eax,%es - movl %eax,%fs - movl %eax,%gs -``` - -After all of these steps we set up `gs` register that it post to the `irqstack` (we will see information about it in the next parts): - -```assembly - movl $MSR_GS_BASE,%ecx - movl initial_gs(%rip),%eax - movl initial_gs+4(%rip),%edx - wrmsr -``` - -where `MSR_GS_BASE` is: - -```C -#define MSR_GS_BASE 0xc0000101 -``` - -We need to put `MSR_GS_BASE` to the `ecx` register and load data from the `eax` and `edx` (which are point to the `initial_gs`) with `wrmsr` instruction. We don't use `cs`, `fs`, `ds` and `ss` segment registers for addressation in the 64-bit mode, but `fs` and `gs` registers can be used. `fs` and `gs` have a hidden part (as we saw it in the real mode for `cs`) and this part contains descriptor which mapped to Model specific registers. So we can see above `0xc0000101` is a `gs.base` MSR address. - -In the next step we put the address of the real mode bootparam structure to the `rdi` (remember `rsi` holds pointer to this structure from the start) and jump to the C code with: - -```assembly - movq initial_code(%rip),%rax - pushq $0 - pushq $__KERNEL_CS - pushq %rax - lretq -``` - -Here we put the address of the `initial_code` to the `rax` and push fake address, `__KERNEL_CS` and the address of the `initial_code` to the stack. After this we can see `lretq` instruction which means that after it return address will be extracted from stack (now there is address of the `initial_code`) and jump there. `initial_code` defined in the same source code file and looks: - -```assembly - __REFDATA - .balign 8 - GLOBAL(initial_code) - .quad x86_64_start_kernel - ... - ... - ... -``` - -As we can see `initial_code` contains address of the `x86_64_start_kernel`, which defined in the [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and looks like this: - -```C -asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) { - ... - ... - ... -} -``` - -It has one argument is a `real_mode_data` (remember that we passed address of the real mode data to the `rdi` register previously). - -This is first C code in the kernel! - -Next to start_kernel --------------------------------------------------------------------------------- - -We need to see last preparations before we can see "kernel entry point" - start_kernel function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489). - -First of all we can see some checks in the `x86_64_start_kernel` function: - -```C -BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map); -BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE); -BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE); -BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0); -BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0); -BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL)); -BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK))); -BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END); -``` - -There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - `__STAT_KERNEL_map`, that kernel text with modules is not less than image of the kernel and etc... `BUILD_BUG_ON` is a macro which looks as: - -```C -#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) -``` - -Let's try to understand this trick works. Let's take for example first condition: `MODULES_VADDR < __START_KERNEL_map`. `!!conditions` is the same that `condition != 0`. So it means if `MODULES_VADDR < __START_KERNEL_map` is true, we will get `1` in the `!!(condition)` or zero if not. After `2*!!(condition)` we will get or `2` or `0`. In the end of calculations we can get two different behaviors: - -* We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because `MODULES_VADDR` can't be less than `__START_KERNEL_map` will be in our case); -* No compilation errors. - -That's all. So interesting C trick for getting compile error which depends on some constants. - -In the next step we can see call of the `cr4_init_shadow` function which stores shadow copy of the `cr4` per cpu. Context switches can change bits in the `cr4` so we need to store `cr4` for each CPU. And after this we can see call of the `reset_early_page_tables` function where we resets all page global directory entries and write new pointer to the PGT in `cr3`: - -```C -for (i = 0; i < PTRS_PER_PGD-1; i++) - early_level4_pgt[i].pgd = 0; - -next_early_pgt = 0; - -write_cr3(__pa_nodebug(early_level4_pgt)); -``` - -soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (`PTRS_PER_PGD` is `512`) in the loop and make it zero. After this we set `next_early_pgt` to zero (we will see details about it in the next post) and write physical address of the `early_level4_pgt` to the `cr3`. `__pa_nodebug` is a macro which will be expanded to: - -```C -((unsigned long)(x) - __START_KERNEL_map + phys_base) -``` - -After this we clear `_bss` from the `__bss_stop` to `__bss_start` and the next step will be setup of the early `IDT` handlers, but it's big theme so we will see it in the next part. - -Conclusion --------------------------------------------------------------------------------- - -This is the end of the first part about linux kernel initialization. - -If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). - -In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and many many more. - -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-insides).** - -Links --------------------------------------------------------------------------------- - -* [Model Specific Register](http://en.wikipedia.org/wiki/Model-specific_register) -* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) -* [Previous part - Kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) -* [NX](http://en.wikipedia.org/wiki/NX_bit) -* [ASLR](http://en.wikipedia.org/wiki/Address_space_layout_randomization) diff --git a/Initialization/linux-initialization-10.md b/Initialization/linux-initialization-10.md deleted file mode 100644 index f59f507..0000000 --- a/Initialization/linux-initialization-10.md +++ /dev/null @@ -1,473 +0,0 @@ -Kernel initialization. Part 10. -================================================================================ - -End of the linux kernel initialization process -================================================================================ - -This is tenth part of the chapter about linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the [previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) chapter, so let's finish with it. - -After the call of the `acpi_early_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c), we can see the following code: - -```C -#ifdef CONFIG_X86_ESPFIX64 - init_espfix_bsp(); -#endif -``` - -Here we can see the call of the `init_espfix_bsp` function which depends on the `CONFIG_X86_ESPFIX64` kernel configuration option. As we can understand from the function name, it does something with the stack. This function defined in the [arch/x86/kernel/espfix_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/espfix_64.c) and prevents leaking of `31:16` bits of the `esp` register during returning to 16-bit stack. First of all we install `espfix` page upper directory into the kernel page directory in the `init_espfix_bs`: - -```C -pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)]; -pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page); -``` - -Where `ESPFIX_BASE_ADDR` is: - -```C -#define PGDIR_SHIFT 39 -#define ESPFIX_PGD_ENTRY _AC(-2, UL) -#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << PGDIR_SHIFT) -``` - -Also we can find it in the [Documentation/arch/x86_64/mm](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt): - -``` -... unused hole ... -ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks -... unused hole ... -``` - -After we've filled page global directory with the `espfix` pud, the next step is call of the `init_espfix_random` and `init_espfix_ap` functions. The first function returns random locations for the `espfix` page and the second enables the `espfix` the current CPU. After the `init_espfix_bsp` finished to work, we can see the call of the `thread_info_cache_init` function which defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and allocates cache for the `thread_info` if its size is less than `PAGE_SIZE`: - -```C -# if THREAD_SIZE >= PAGE_SIZE -... -... -... -void thread_info_cache_init(void) -{ - thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE, - THREAD_SIZE, 0, NULL); - BUG_ON(thread_info_cache == NULL); -} -... -... -... -#endif -``` - -As we already know the `PAGE_SIZE` is `(_AC(1,UL) << PAGE_SHIFT)` or `4096` bytes and `THREAD_SIZE` is `(PAGE_SIZE << THREAD_SIZE_ORDER)` or `16384` bytes for the `x86_64`. The next function after the `thread_info_cache_init` is the `cred_init` from the [kernel/cred.c](https://github.com/torvalds/linux/blob/master/kernel/cred.c). This function just allocates space for the credentials (like `uid`, `gid` and etc...): - -```C -void __init cred_init(void) -{ - cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), - 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); -} -``` - -more about credentials you can read in the [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/master/Documentation/security/credentials.txt). Next step is the `fork_init` function from the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c). The `fork_init` function allocates space for the `task_struct`. Let's look on the implementation of the `fork_init`. First of all we can see definitions of the `ARCH_MIN_TASKALIGN` macro and creation of a slab where task_structs will be allocated: - -```C -#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR -#ifndef ARCH_MIN_TASKALIGN -#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES -#endif - task_struct_cachep = - kmem_cache_create("task_struct", sizeof(struct task_struct), - ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL); -#endif -``` - -As we can see this code depends on the `CONFIG_ARCH_TASK_STRUCT_ACLLOCATOR` kernel configuration option. This configuration option shows the presence of the `alloc_task_struct` for the given architecture. As `x86_64` has no `alloc_task_struct` function, this code will not work and even will not be compiled on the `x86_64`. - -Allocating cache for init task --------------------------------------------------------------------------------- - -After this we can see the call of the `arch_task_cache_init` function in the `fork_init`: - -```C -void arch_task_cache_init(void) -{ - task_xstate_cachep = - kmem_cache_create("task_xstate", xstate_size, - __alignof__(union thread_xstate), - SLAB_PANIC | SLAB_NOTRACK, NULL); - setup_xstate_comp(); -} -``` - -The `arch_task_cache_init` does initialization of the architecture-specific caches. In our case it is `x86_64`, so as we can see, the `arch_task_cache_init` allocates space for the `task_xstate` which represents [FPU](http://en.wikipedia.org/wiki/Floating-point_unit) state and sets up offsets and sizes of all extended states in [xsave](http://www.felixcloutier.com/x86/XSAVES.html) area with the call of the `setup_xstate_comp` function. After the `arch_task_cache_init` we calculate default maximum number of threads with the: - -```C -set_max_threads(MAX_THREADS); -``` - -where default maximum number of threads is: - -```C -#define FUTEX_TID_MASK 0x3fffffff -#define MAX_THREADS FUTEX_TID_MASK -``` - -In the end of the `fork_init` function we initalize [signal](http://www.win.tue.nl/~aeb/linux/lk/lk-5.html) handler: - -```C -init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2; -init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2; -init_task.signal->rlim[RLIMIT_SIGPENDING] = - init_task.signal->rlim[RLIMIT_NPROC]; -``` - -As we know the `init_task` is an instance of the `task_struct` structure, so it contains `signal` field which represents signal handler. It has following type `struct signal_struct`. On the first two lines we can see setting of the current and maximum limit of the `resource limits`. Every process has an associated set of resource limits. These limits specify amount of resources which current process can use. Here `rlim` is resource control limit and presented by the: - -```C -struct rlimit { - __kernel_ulong_t rlim_cur; - __kernel_ulong_t rlim_max; -}; -``` - -structure from the [include/uapi/linux/resource.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/resource.h). In our case the resource is the `RLIMIT_NPROC` which is the maximum number of process that use can own and `RLIMIT_SIGPENDING` - the maximum number of pending signals. We can see it in the: - -```C -cat /proc/self/limits -Limit Soft Limit Hard Limit Units -... -... -... -Max processes 63815 63815 processes -Max pending signals 63815 63815 signals -... -... -... -``` - -Initialization of the caches --------------------------------------------------------------------------------- - -The next function after the `fork_init` is the `proc_caches_init` from the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c). This function allocates caches for the memory descriptors (or `mm_struct` structure). At the beginning of the `proc_caches_init` we can see allocation of the different [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) caches with the call of the `kmem_cache_create`: - -* `sighand_cachep` - manage information about installed signal handlers; -* `signal_cachep` - manage information about process signal descriptor; -* `files_cachep` - manage information about opened files; -* `fs_cachep` - manage filesystem information. - -After this we allocate `SLAB` cache for the `mm_struct` structures: - -```C -mm_cachep = kmem_cache_create("mm_struct", - sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN, - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL); -``` - - -After this we allocate `SLAB` cache for the important `vm_area_struct` which used by the kernel to manage virtual memory space: - -```C -vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC); -``` - -Note, that we use `KMEM_CACHE` macro here instead of the `kmem_cache_create`. This macro defined in the [include/linux/slab.h](https://github.com/torvalds/linux/blob/master/include/linux/slab.h) and just expands to the `kmem_cache_create` call: - -```C -#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\ - sizeof(struct __struct), __alignof__(struct __struct),\ - (__flags), NULL) -``` - -The `KMEM_CACHE` has one difference from `kmem_cache_create`. Take a look on `__alignof__` operator. The `KMEM_CACHE` macro aligns `SLAB` to the size of the given structure, but `kmem_cache_create` uses given value to align space. After this we can see the call of the `mmap_init` and `nsproxy_cache_init` functions. The first function initalizes virtual memory area `SLAB` and the second function initializes `SLAB` for namespaces. - -The next function after the `proc_caches_init` is `buffer_init`. This function defined in the [fs/buffer.c](https://github.com/torvalds/linux/blob/master/fs/buffer.c) source code file and allocate cache for the `buffer_head`. The `buffer_head` is a special structure which defined in the [include/linux/buffer_head.h](https://github.com/torvalds/linux/blob/master/include/linux/buffer_head.h) and used for managing buffers. In the start of the `bufer_init` function we allocate cache for the `struct buffer_head` structures with the call of the `kmem_cache_create` function as we did it in the previous functions. And calcuate the maximum size of the buffers in memory with: - -```C -nrpages = (nr_free_buffer_pages() * 10) / 100; -max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); -``` - -which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points and etc... More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/master/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to when set. - -Creation of the root for the procfs --------------------------------------------------------------------------------- - -After all of this preparations we need to create the root for the [proc](http://en.wikipedia.org/wiki/Procfs) filesystem. We will do it with the call of the `proc_root_init` function from the [fs/proc/root.c](https://github.com/torvalds/linux/blob/master/fs/proc/root.c). At the start of the `proc_root_init` function we allocate the cache for the inodes and register a new filesystem in the system with the: - -```C -err = register_filesystem(&proc_fs_type); - if (err) - return; -``` - -As I wrote above we will not dive into details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) and different filesystems in this chapter, but will see it in the chapter about the `VFS`. After we've registered a new filesystem in the our system, we call the `proc_self_init` function from the TO[fs/proc/self.c](https://github.com/torvalds/linux/blob/master/fs/proc/self.c) and this function allocates `inode` number for the `self` (`/proc/self` directory refers to the process accessing the `/proc` filesystem). The next step after the `proc_self_init` is `proc_setup_thread_self` which setups the `/proc/thread-self` directory which contains information about current thread. After this we create `/proc/self/mounts` symllink which will contains mount points with the call of the - -```C -proc_symlink("mounts", NULL, "self/mounts"); -``` - -and a couple of directories depends on the different configuration options: - -```C -#ifdef CONFIG_SYSVIPC - proc_mkdir("sysvipc", NULL); -#endif - proc_mkdir("fs", NULL); - proc_mkdir("driver", NULL); - proc_mkdir("fs/nfsd", NULL); -#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE) - proc_mkdir("openprom", NULL); -#endif - proc_mkdir("bus", NULL); - ... - ... - ... - if (!proc_mkdir("tty", NULL)) - return; - proc_mkdir("tty/ldisc", NULL); - ... - ... - ... -``` - -In the end of the `proc_root_init` we call the `proc_sys_init` function which creates `/proc/sys` directory and initializes the [Sysctl](http://en.wikipedia.org/wiki/Sysctl). - -It is the end of `start_kernel` function. I did not describe all functions which are called in the `start_kernel`. I missed it, because they are not so important for the generic kernel initialization stuff and depend on only different kernel configurations. They are `taskstats_init_early` which exports per-task statistic to the user-space, `delayacct_init` - initializes per-task delay accounting, `key_init` and `security_init` initialize diferent security stuff, `check_bugs` - makes fix up of the some architecture-dependent bugs, `ftrace_init` function executes initialization of the [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt), `cgroup_init` makes initialization of the rest of the [cgroup](http://en.wikipedia.org/wiki/Cgroups) subsystem and etc... Many of these parts and subsystems will be described in the other chapters. - -That's all. Finally we passed through the long-long `start_kernel` function. But it is not the end of the linux kernel initialization process. We haven't run the first process yet. In the end of the `start_kernel` we can see the last call of the - `rest_init` function. Let's go ahead. - -First steps after the start_kernel --------------------------------------------------------------------------------- - -The `rest_init` function defined in the same source code file as `start_kernel` function, and this file is [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). In the beginning of the `rest_init` we can see call of the two following functions: - -```C - rcu_scheduler_starting(); - smpboot_thread_init(); -``` - -The first `rcu_scheduler_starting` makes [RCU](http://en.wikipedia.org/wiki/Read-copy-update) scheduler active and the second `smpboot_thread_init` registers the `smpboot_thread_notifier` CPU notifier (more about it you can read in the [CPU hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). After this we can see the following calls: - -```C -kernel_thread(kernel_init, NULL, CLONE_FS); -pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); -``` - -Here the `kernel_thread` function (defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c)) creates new kernel thread.As we can see the `kernel_thread` function takes three arguments: - -* Function which will be executed in a new thread; -* Parameter for the `kernel_init` function; -* Flags. - -We will not dive into details about `kernel_thread` implementation (we will see it in the chapter which will describe scheduler, just need to say that `kernel_thread` invokes [clone](http://www.tutorialspoint.com/unix_system_calls/clone.htm)). Now we only need to know that we create new kernel thread with `kernel_thread` function, parent and child of the thread will use shared information about a filesystem and it will start to execute `kernel_init` function. A kernel thread differs from an user thread that it runs in a kernel mode. So with these two `kernel_thread` calls we create two new kernel threads with the `PID = 1` for `init` process and `PID = 2` for `kthread`. We already know what is `init` process. Let's look on the `kthread`. It is special kernel thread which allows to `init` and different parts of the kernel to create another kernel threads. We can see it in the output of the `ps` util: - -```C -$ ps -ef | grep kthradd -alex 12866 4767 0 18:26 pts/0 00:00:00 grep kthradd -``` - -Let's postpone `kernel_init` and `kthreadd` for now and will go ahead in the `rest_init`. In the next step after we have created two new kernel threads we can see the following code: - -```C - rcu_read_lock(); - kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns); - rcu_read_unlock(); -``` - -The first `rcu_read_lock` function marks the beginning of an [RCU](http://en.wikipedia.org/wiki/Read-copy-update) read-side critical section and the `rcu_read_unlock` marks the end of an RCU read-side critical section. We call these functions because we need to protect the `find_task_by_pid_ns`. The `find_task_by_pid_ns` returns pointer to the `task_struct` by the given pid. So, here we are getting the pointer to the `task_struct` for the `PID = 2` (we got it after `kthreadd` creation with the `kernel_thread`). In the next step we call `complete` function - -```C -complete(&kthreadd_done); -``` - -and pass address of the `kthreadd_done`. The `kthreadd_done` defined as - -```C -static __initdata DECLARE_COMPLETION(kthreadd_done); -``` - -where `DECLARE_COMPLETION` macro defined as: - -```C -#define DECLARE_COMPLETION(work) \ - struct completion work = COMPLETION_INITIALIZER(work) -``` - -and expands to the definition of the `completion` structure. This structure defined in the [include/linux/completion.h](https://github.com/torvalds/linux/blob/master/include/linux/completion.h) and presents `completions` concept. Completions are a code synchronization mechanism which is provide race-free solution for the threads that must wait for some process to have reached a point or a specific state. Using completions consists of three parts: The first is definition of the `complete` structure and we did it with the `DECLARE_COMPLETION`. The second is call of the `wait_for_completion`. After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not call `complete` function. Note that we call `wait_for_completion` with the `kthreadd_done` in the beginning of the `kernel_init_freeable`: - -```C -wait_for_completion(&kthreadd_done); -``` - -And the last step is to call `complete` function as we saw it above. After this the `kernel_init_freeable` function will not be executed while `kthreadd` thread will not be set. After the `kthreadd` was set, we can see three following functions in the `rest_init`: - -```C - init_idle_bootup_task(current); - schedule_preempt_disabled(); - cpu_startup_entry(CPUHP_ONLINE); -``` - -The first `init_idle_bootup_task` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) sets the Scheduling class for the current process (`idle` class in our case): - -```C -void init_idle_bootup_task(struct task_struct *idle) -{ - idle->sched_class = &idle_sched_class; -} -``` - -where `idle` class is a low priority tasks and tasks can be run only when the processor doesn't have to run anything besides this tasks. The second function `schedule_preempt_disabled` disables preempt in `idle` tasks. And the third function `cpu_startup_entry` defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c) and calls `cpu_idle_loop` from the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c). The `cpu_idle_loop` function works as process with `PID = 0` and works in the background. Main purpose of the `cpu_idle_loop` is usage of the idle CPU cycles. When there are no one process to run, this process starts to work. We have one process with `idle` scheduling class (we just set the `current` task to the `idle` with the call of the `init_idle_bootup_task` function), so the `idle` thread does not do useful work and checks that there is not active task to switch: - -```C -static void cpu_idle_loop(void) -{ - ... - ... - ... - while (1) { - while (!need_resched()) { - ... - ... - ... - } - ... - } -``` - -More about it will be in the chapter about scheduler. So for this moment the `start_kernel` calls the `rest_init` function which spawns an `init` (`kernel_init` function) process and become `idle` process itself. Now is time to look on the `kernel_init`. Execution of the `kernel_init` function starts from the call of the `kernel_init_freeable` function. The `kernel_init_freeable` function first of all waits for the completion of the `kthreadd` setup. I already wrote about it above: - -```C -wait_for_completion(&kthreadd_done); -``` - -After this we set `gfp_allowed_mask` to `__GFP_BITS_MASK` which means that already system is running, set allowed [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to all CPUs and [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes with the `set_mems_allowed` function, allow `init` process to run on any CPU with the `set_cpus_allowed_ptr`, set pid for the `cad` or `Ctrl-Alt-Delete`, do preparation for booting of the other CPUs with the call of the `smp_prepare_cpus`, call early [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism) with the `do_pre_smp_initcalls`, initialization of the `SMP` with the `smp_init` and initialization of the [lockup_detector](https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt) with the call of the `lockup_detector_init` and initialize scheduler with the `sched_init_smp`. - -After this we can see the call of the following functions - `do_basic_setup`. Before we will call the `do_basic_setup` function, our kernel already initialized for this moment. As comment says: - -``` -Now we can finally start doing some real work.. -``` - -The `do_basic_setup` will reinitialize [cpuset](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to the active CPUs, initialization of the `khelper` - which is a kernel thread which used for making calls out to userspace from within the kernel, initialize [tmpfs](http://en.wikipedia.org/wiki/Tmpfs), initialize `drivers` subsystem, enable the user-mode helper `workqueue` and make post-early call of the `initcalls`. We can see openinng of the `dev/console` and dup twice file descriptors from `0` to `2` after the `do_basic_setup`: - - -```C -if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) - pr_err("Warning: unable to open an initial console.\n"); - -(void) sys_dup(0); -(void) sys_dup(0); -``` - -We are using two system calls here `sys_open` and `sys_dup`. In the next chapters we will see explanation and implementation of the different system calls. After we opened initial console, we check that `rdinit=` option was passed to the kernel command line or set default path of the ramdisk: - -```C -if (!ramdisk_execute_command) - ramdisk_execute_command = "/init"; -``` - -Check user's permissions for the `ramdisk` and call the `prepare_namespace` function from the [init/do_mounts.c](https://github.com/torvalds/linux/blob/master/init/do_mounts.c) which checks and mounts the [initrd](http://en.wikipedia.org/wiki/Initrd): - -```C -if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { - ramdisk_execute_command = NULL; - prepare_namespace(); -} -``` - -This is the end of the `kernel_init_freeable` function and we need return to the `kernel_init`. The next step after the `kernel_init_freeable` finished its execution is the `async_synchronize_full`. This function waits until all asynchronous function calls have been done and after it we will call the `free_initmem` which will release all memory occupied by the initialization stuff which located between `__init_begin` and `__init_end`. After this we protect `.rodata` with the `mark_rodata_ro` and update state of the system from the `SYSTEM_BOOTING` to the - -```C -system_state = SYSTEM_RUNNING; -``` - -And tries to run the `init` process: - -```C -if (ramdisk_execute_command) { - ret = run_init_process(ramdisk_execute_command); - if (!ret) - return 0; - pr_err("Failed to execute %s (error %d)\n", - ramdisk_execute_command, ret); -} -``` - -First of all it checks the `ramdisk_execute_command` which we set in the `kernel_init_freeable` function and it will be equal to the value of the `rdinit=` kernel command line parameters or `/init` by default. The `run_init_process` function fills the first element of the `argv_init` array: - -```C -static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, }; -``` - -which represents arguments of the `init` program and call `do_execve` function: - -```C -argv_init[0] = init_filename; -return do_execve(getname_kernel(init_filename), - (const char __user *const __user *)argv_init, - (const char __user *const __user *)envp_init); -``` - -The `do_execve` function defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and runs program with the given file name and arguments. If we did not pass `rdinit=` option to the kernel command line, kernel starts to check the `execute_command` which is equal to value of the `init=` kernel command line parameter: - -```C - if (execute_command) { - ret = run_init_process(execute_command); - if (!ret) - return 0; - panic("Requested init %s failed (error %d).", - execute_command, ret); - } -``` - -If we did not pass `init=` kernel command line parameter too, kernel tries to run one of the following executable files: - -```C -if (!try_to_run_init_process("/sbin/init") || - !try_to_run_init_process("/etc/init") || - !try_to_run_init_process("/bin/init") || - !try_to_run_init_process("/bin/sh")) - return 0; -``` - -In other way we finish with [panic](http://en.wikipedia.org/wiki/Kernel_panic): - -```C -panic("No working init found. Try passing init= option to kernel. " - "See Linux Documentation/init.txt for guidance."); -``` - -That's all! Linux kernel initialization process is finished! - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the tenth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). And it is not only `tenth` part, but this is the last part which describes initialization of the linux kernel. As I wrote in the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I missed details about different subsystem of the kernel, for example I almost did not cover linux kernel scheduler or we did not see almost anything about interrupts and exceptions handling and etc... From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) -* [xsave](http://www.felixcloutier.com/x86/XSAVES.html) -* [FPU](http://en.wikipedia.org/wiki/Floating-point_unit) -* [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/master/Documentation/security/credentials.txt) -* [Documentation/x86/x86_64/mm](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt) -* [RCU](http://en.wikipedia.org/wiki/Read-copy-update) -* [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) -* [inode](http://en.wikipedia.org/wiki/Inode) -* [proc](http://en.wikipedia.org/wiki/Procfs) -* [man proc](http://linux.die.net/man/5/proc) -* [Sysctl](http://en.wikipedia.org/wiki/Sysctl) -* [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt) -* [cgroup](http://en.wikipedia.org/wiki/Cgroups) -* [CPU hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) -* [completions - wait for completion handling](https://www.kernel.org/doc/Documentation/scheduler/completion.txt) -* [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) -* [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) -* [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism) -* [Tmpfs](http://en.wikipedia.org/wiki/Tmpfs) -* [initrd](http://en.wikipedia.org/wiki/Initrd) -* [panic](http://en.wikipedia.org/wiki/Kernel_panic) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) diff --git a/Initialization/linux-initialization-2.md b/Initialization/linux-initialization-2.md deleted file mode 100644 index 542deb7..0000000 --- a/Initialization/linux-initialization-2.md +++ /dev/null @@ -1,468 +0,0 @@ -Kernel initialization. Part 2. -================================================================================ - -Early interrupt and exception handling --------------------------------------------------------------------------------- - -In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) we stopped before setting of early interrupt handlers. We continue in this part and will know more about interrupt and exception handling. - -Remember that we stopped before following loop: - -```C - for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) - set_intr_gate(i, early_idt_handlers[i]); -``` - -from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) source code file. But before we started to sort out this code, we need to know about interrupts and handlers. - -Some theory --------------------------------------------------------------------------------- - -Interrupt is an event caused by software or hardware to the CPU. On interrupt, CPU stops the current task and transfer control to the interrupt handler, which handles interruption and transfer control back to the previously stopped task. We can split interrupts on three types: - -* Software interrupts - when a software signals CPU that it needs kernel attention. These interrupts generally used for system calls; -* Hardware interrupts - when a hardware, for example button pressed on a keyboard; -* Exceptions - interrupts generated by CPU, when the CPU detects error, for example division by zero or accessing a memory page which is not in RAM. - -Every interrupt and exception is assigned an unique number which called - `vector number`. `Vector number` can be any number from `0` to `255`. There is common practice to use first `32` vector numbers for exceptions, and vector numbers from `31` to `255` are used for user-defined interrupts. We can see it in the code above - `NUM_EXCEPTION_VECTORS`, which defined as: - -```C -#define NUM_EXCEPTION_VECTORS 32 -``` - -CPU uses vector number as an index in the `Interrupt Descriptor Table` (we will see description of it soon). CPU catch interrupts from the [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) or through it's pins. Following table shows `0-31` exceptions: - -``` ----------------------------------------------------------------------------------------------- -|Vector|Mnemonic|Description |Type |Error Code|Source | ----------------------------------------------------------------------------------------------- -|0 | #DE |Divide Error |Fault|NO |DIV and IDIV | -|--------------------------------------------------------------------------------------------- -|1 | #DB |Reserved |F/T |NO | | -|--------------------------------------------------------------------------------------------- -|2 | --- |NMI |INT |NO |external NMI | -|--------------------------------------------------------------------------------------------- -|3 | #BP |Breakpoint |Trap |NO |INT 3 | -|--------------------------------------------------------------------------------------------- -|4 | #OF |Overflow |Trap |NO |INTO instruction | -|--------------------------------------------------------------------------------------------- -|5 | #BR |Bound Range Exceeded|Fault|NO |BOUND instruction | -|--------------------------------------------------------------------------------------------- -|6 | #UD |Invalid Opcode |Fault|NO |UD2 instruction | -|--------------------------------------------------------------------------------------------- -|7 | #NM |Device Not Available|Fault|NO |Floating point or [F]WAIT | -|--------------------------------------------------------------------------------------------- -|8 | #DF |Double Fault |Abort|YES |Ant instrctions which can generate NMI| -|--------------------------------------------------------------------------------------------- -|9 | --- |Reserved |Fault|NO | | -|--------------------------------------------------------------------------------------------- -|10 | #TS |Invalid TSS |Fault|YES |Task switch or TSS access | -|--------------------------------------------------------------------------------------------- -|11 | #NP |Segment Not Present |Fault|NO |Accessing segment register | -|--------------------------------------------------------------------------------------------- -|12 | #SS |Stack-Segment Fault |Fault|YES |Stack operations | -|--------------------------------------------------------------------------------------------- -|13 | #GP |General Protection |Fault|YES |Memory reference | -|--------------------------------------------------------------------------------------------- -|14 | #PF |Page fault |Fault|YES |Memory reference | -|--------------------------------------------------------------------------------------------- -|15 | --- |Reserved | |NO | | -|--------------------------------------------------------------------------------------------- -|16 | #MF |x87 FPU fp error |Fault|NO |Floating point or [F]Wait | -|--------------------------------------------------------------------------------------------- -|17 | #AC |Alignment Check |Fault|YES |Data reference | -|--------------------------------------------------------------------------------------------- -|18 | #MC |Machine Check |Abort|NO | | -|--------------------------------------------------------------------------------------------- -|19 | #XM |SIMD fp exception |Fault|NO |SSE[2,3] instructions | -|--------------------------------------------------------------------------------------------- -|20 | #VE |Virtualization exc. |Fault|NO |EPT violations | -|--------------------------------------------------------------------------------------------- -|21-31 | --- |Reserved |INT |NO |External interrupts | ----------------------------------------------------------------------------------------------- -``` - -To react on interrupt CPU uses special structure - Interrupt Descriptor Table or IDT. IDT is an array of 8-byte descriptors like Global Descriptor Table, but IDT entries are called `gates`. CPU multiplies vector number on 8 to find index of the IDT entry. But in 64-bit mode IDT is an array of 16-byte descriptors and CPU multiplies vector number on 16 to find index of the entry in the IDT. We remember from the previous part that CPU uses special `GDTR` register to locate Global Descriptor Table, so CPU uses special register `IDTR` for Interrupt Descriptor Table and `lidt` instruuction for loading base address of the table into this register. - -64-bit mode IDT entry has following structure: - -``` -127 96 - -------------------------------------------------------------------------------- -| | -| Reserved | -| | - -------------------------------------------------------------------------------- -95 64 - -------------------------------------------------------------------------------- -| | -| Offset 63..32 | -| | - -------------------------------------------------------------------------------- -63 48 47 46 44 42 39 34 32 - -------------------------------------------------------------------------------- -| | | D | | | | | | | -| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST | -| | | L | | | | | | | - -------------------------------------------------------------------------------- -31 15 16 0 - -------------------------------------------------------------------------------- -| | | -| Segment Selector | Offset 15..0 | -| | | - -------------------------------------------------------------------------------- -``` - -Where: - -* Offset - is offset to entry point of an interrupt handler; -* DPL - Descriptor Privilege Level; -* P - Segment Present flag; -* Segment selector - a code segment selector in GDT or LDT -* IST - provides ability to switch to a new stack for interrupts handling. - -And the last `Type` field describes type of the `IDT` entry. There are three different kinds of handlers for interrupts: - -* Task descriptor -* Interrupt descriptor -* Trap descriptor - -Interrupt and trap descriptors contain a far pointer to the entry point of the interrupt handler. Only one difference between these types is how CPU handles `IF` flag. If interrupt handler was accessed through interrupt gate, CPU clear the `IF` flag to prevent other interrupts while current interrupt handler executes. After that current interrupt handler executes, CPU sets the `IF` flag again with `iret` instruction. - -Other bits reserved and must be 0. - -Now let's look how CPU handles interrupts: - -* CPU save flags register, `CS`, and instruction pointer on the stack. -* If interrupt causes an error code (like `#PF` for example), CPU saves an error on the stack after instruction pointer; -* After interrupt handler executed, `iret` instruction used to return from it. - -Now let's back to code. - -Fill and load IDT --------------------------------------------------------------------------------- - -We stopped at the following point: - -```C -for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) - set_intr_gate(i, early_idt_handlers[i]); -``` - -Here we call `set_intr_gate` in the loop, which takes two parameters: - -* Number of an interrupt; -* Address of the idt handler. - -and inserts an interrupt gate in the nth `IDT` entry. First of all let's look on the `early_idt_handlers`. It is an array which contains address of the first 32 interrupt handlers: - -```C -extern const char early_idt_handlers[NUM_EXCEPTION_VECTORS][2+2+5]; -``` - -We're filling only first 32 IDT entries because all of the early setup runs with interrupts disabled, so there is no need to set up early exception handlers for vectors greater than 32. `early_idt_handlers` contains generic idt handlers and we can find it in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S), we will look it soon. - -Now let's look on `set_intr_gate` implementation: - -```C -#define set_intr_gate(n, addr) \ - do { \ - BUG_ON((unsigned)n > 0xFF); \ - _set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0, \ - __KERNEL_CS); \ - _trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\ - 0, 0, __KERNEL_CS); \ - } while (0) -``` - -First of all it checks with that passed interrupt number is not greater than `255` with `BUG_ON` macro. We need to do this check because we can have only 256 interrupts. After this it calls `_set_gate` which writes address of an interrupt gate to the `IDT`: - - -```C -static inline void _set_gate(int gate, unsigned type, void *addr, - unsigned dpl, unsigned ist, unsigned seg) -{ - gate_desc s; - pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg); - write_idt_entry(idt_table, gate, &s); - write_trace_idt_entry(gate, &s); -} -``` - -At the start of `_set_gate` function we can see call of the `pack_gate` function which fills `gate_desc` structure with the given values: - -```C -static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func, - unsigned dpl, unsigned ist, unsigned seg) -{ - gate->offset_low = PTR_LOW(func); - gate->segment = __KERNEL_CS; - gate->ist = ist; - gate->p = 1; - gate->dpl = dpl; - gate->zero0 = 0; - gate->zero1 = 0; - gate->type = type; - gate->offset_middle = PTR_MIDDLE(func); - gate->offset_high = PTR_HIGH(func); -} -``` - -As mentioned above we fill gate descriptor in this function. We fill three parts of the address of the interrupt handler with the address which we got in the main loop (address of the interrupt handler entry point). We are using three following macro to split address on three parts: - -```C -#define PTR_LOW(x) ((unsigned long long)(x) & 0xFFFF) -#define PTR_MIDDLE(x) (((unsigned long long)(x) >> 16) & 0xFFFF) -#define PTR_HIGH(x) ((unsigned long long)(x) >> 32) -``` - -With the first `PTR_LOW` macro we get the first 2 bytes of the address, with the second `PTR_MIDDLE` we get the second 2 bytes of the address and with the third `PTR_HIGH` macro we get the last 4 bytes of the address. Next we setup the segment selector for interrupt handler, it will be our kernel code segment - `__KERNEL_CS`. In the next step we fill `Interrupt Stack Table` and `Descriptor Privilege Level` (highest privilege level) with zeros. And we set `GAT_INTERRUPT` type in the end. - -Now we have filled IDT entry and we can call `native_write_idt_entry` function which just copies filled `IDT` entry to the `IDT`: - -```C -static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate) -{ - memcpy(&idt[entry], gate, sizeof(*gate)); -} -``` - -After that main loop will finished, we will have filled `idt_table` array of `gate_desc` structures and we can load `IDT` with: - -```C -load_idt((const struct desc_ptr *)&idt_descr); -``` - -Where `idt_descr` is: - -```C -struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table }; -``` - -and `load_idt` just executes `lidt` instruction: - -```C -asm volatile("lidt %0"::"m" (*dtr)); -``` - -You can note that there are calls of the `_trace_*` functions in the `_set_gate` and other functions. These functions fills `IDT` gates in the same manner that `_set_gate` but with one difference. These functions use `trace_idt_table` Interrupt Descriptor Table instead of `idt_table` for tracepoints (we will cover this theme in the another part). - -Okay, now we have filled and loaded Interrupt Descriptor Table, we know how the CPU acts during interrupt. So now time to deal with interrupts handlers. - -Early interrupts handlers --------------------------------------------------------------------------------- - -As you can read above, we filled `IDT` with the address of the `early_idt_handlers`. We can find it in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): - -```assembly - .globl early_idt_handlers -early_idt_handlers: - i = 0 - .rept NUM_EXCEPTION_VECTORS - .if (EXCEPTION_ERRCODE_MASK >> i) & 1 - ASM_NOP2 - .else - pushq $0 - .endif - pushq $i - jmp early_idt_handler - i = i + 1 - .endr -``` - -We can see here, interrupt handlers generation for the first 32 exceptions. We check here, if exception has error code then we do nothing, if exception does not return error code, we push zero to the stack. We do it for that would stack was uniform. After that we push exception number on the stack and jump on the `early_idt_handler` which is generic interrupt handler for now. As i wrote above, CPU pushes flag register, `CS` and `RIP` on the stack. So before `early_idt_handler` will be executed, stack will contain following data: - -``` -|--------------------| -| %rflags | -| %cs | -| %rip | -| rsp --> error code | -|--------------------| -``` - -Now let's look on the `early_idt_handler` implementation. It locates in the same [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343). First of all we can see check for [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt), we no need to handle it, so just ignore they in the `early_idt_handler`: - -```assembly - cmpl $2,(%rsp) - je is_nmi -``` - -where `is_nmi`: - -```assembly -is_nmi: - addq $16,%rsp - INTERRUPT_RETURN -``` - -we drop error code and vector number from the stack and call `INTERRUPT_RETURN` which is just `iretq`. As we checked the vector number and it is not `NMI`, we check `early_recursion_flag` to prevent recursion in the `early_idt_handler` and if it's correct we save general registers on the stack: - -```assembly - pushq %rax - pushq %rcx - pushq %rdx - pushq %rsi - pushq %rdi - pushq %r8 - pushq %r9 - pushq %r10 - pushq %r11 -``` - -we need to do it to prevent wrong values in it when we return from the interrupt handler. After this we check segment selector in the stack: - -```assembly - cmpl $__KERNEL_CS,96(%rsp) - jne 11f -``` - -it must be equal to the kernel code segment and if it is not we jump on label `11` which prints `PANIC` message and makes stack dump. - -After code segment was checked, we check the vector number, and if it is `#PF`, we put value from the `cr2` to the `rdi` register and call `early_make_pgtable` (well see it soon): - -```assembly - cmpl $14,72(%rsp) - jnz 10f - GET_CR2_INTO(%rdi) - call early_make_pgtable - andl %eax,%eax - jz 20f -``` - -If vector number is not `#PF`, we restore general purpose registers from the stack: - -```assembly - popq %r11 - popq %r10 - popq %r9 - popq %r8 - popq %rdi - popq %rsi - popq %rdx - popq %rcx - popq %rax -``` - -and exit from the handler with `iret`. - -It is the end of the first interrupt handler. Note that it is very early interrupt handler, so it handles only Page Fault now. We will see handlers for the other interrupts, but now let's look on the page fault handler. - -Page fault handling --------------------------------------------------------------------------------- - -In the previous paragraph we saw first early interrupt handler which checks interrupt number for page fault and calls `early_make_pgtable` for building new page tables if it is. We need to have `#PF` handler in this step because there are plans to add ability to load kernel above 4G and make access to `boot_params` structure above the 4G. - -You can find implementation of the `early_make_pgtable` in the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and takes one parameter - address from the `cr2` register, which caused Page Fault. Let's look on it: - -```C -int __init early_make_pgtable(unsigned long address) -{ - unsigned long physaddr = address - __PAGE_OFFSET; - unsigned long i; - pgdval_t pgd, *pgd_p; - pudval_t pud, *pud_p; - pmdval_t pmd, *pmd_p; - ... - ... - ... -} -``` - -It starts from the definition of some variables which have `*val_t` types. All of these types are just: - -```C -typedef unsigned long pgdval_t; -``` - -Also we will operate with the `*_t` (not val) types, for example `pgd_t` and etc... All of these types defined in the [arch/x86/include/asm/pgtable_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_types.h) and represent structures like this: - -```C -typedef struct { pgdval_t pgd; } pgd_t; -``` - -For example, - -```C -extern pgd_t early_level4_pgt[PTRS_PER_PGD]; -``` - -Here `early_level4_pgt` presents early top-level page table directory which consists of an array of `pgd_t` types and `pgd` points to low-level page entries. - -After we made the check that we have no invalid address, we're getting the address of the Page Global Directory entry which contains `#PF` address and put it's value to the `pgd` variable: - -```C -pgd_p = &early_level4_pgt[pgd_index(address)].pgd; -pgd = *pgd_p; -``` - -In the next step we check `pgd`, if it contains correct page global directory entry we put physical address of the page global directory entry and put it to the `pud_p` with: - -```C -pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base); -``` - -where `PTE_PFN_MASK` is a macro: - -```C -#define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK) -``` - -which expands to: - -```C -(~(PAGE_SIZE-1)) & ((1 << 46) - 1) -``` - -or - -``` -0b1111111111111111111111111111111111111111111111 -``` - -which is 46 bits to mask page frame. - -If `pgd` does not contain correct address we check that `next_early_pgt` is not greater than `EARLY_DYNAMIC_PAGE_TABLES` which is `64` and present a fixed number of buffers to set up new page tables on demand. If `next_early_pgt` is greater than `EARLY_DYNAMIC_PAGE_TABLES` we reset page tables and start again. If `next_early_pgt` is less than `EARLY_DYNAMIC_PAGE_TABLES`, we create new page upper directory pointer which points to the current dynamic page table and writes it's physical address with the `_KERPG_TABLE` access rights to the page global directory: - -```C -if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) { - reset_early_page_tables(); - goto again; -} - -pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++]; -for (i = 0; i < PTRS_PER_PUD; i++) - pud_p[i] = 0; -*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE; -``` - -After this we fix up address of the page upper directory with: - -```C -pud_p += pud_index(address); -pud = *pud_p; -``` - -In the next step we do the same actions as we did before, but with the page middle directory. In the end we fix address of the page middle directory which contains maps kernel text+data virtual addresses: - -```C -pmd = (physaddr & PMD_MASK) + early_pmd_flags; -pmd_p[pmd_index(address)] = pmd; -``` - -After page fault handler finished it's work and as result our `early_level4_pgt` contains entries which point to the valid addresses. - -Conclusion --------------------------------------------------------------------------------- - -This is the end of the second part about linux kernel internals. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). In the next part we will see all steps before kernel entry point - `start_kernel` function. - -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [GNU assembly .rept](https://sourceware.org/binutils/docs-2.23/as/Rept.html) -* [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) -* [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) diff --git a/Initialization/linux-initialization-3.md b/Initialization/linux-initialization-3.md deleted file mode 100644 index e34b6e8..0000000 --- a/Initialization/linux-initialization-3.md +++ /dev/null @@ -1,430 +0,0 @@ -Kernel initialization. Part 3. -================================================================================ - -Last preparations before the kernel entry point --------------------------------------------------------------------------------- - -This is the third part of the Linux kernel initialization process series. In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) we saw early interrupt and exception handling and will continue to dive into the linux kernel initialization process in the current part. Our next point is 'kernel entry point' - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. Yes, technically it is not kernel's entry point but the start of the generic kernel code which does not depend on certain architecture. But before we will see call of the `start_kernel` function, we must do some preparations. So let's continue. - -boot_params again --------------------------------------------------------------------------------- - -In the previous part we stopped at setting Interrupt Descriptor Table and loading it in the `IDTR` register. At the next step after this we can see a call of the `copy_bootdata` function: - -```C -copy_bootdata(__va(real_mode_data)); -``` - -This function takes one argument - virtual address of the `real_mode_data`. Remember that we passed the address of the `boot_params` structure from [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L114) to the `x86_64_start_kernel` function as first argument in [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): - -``` - /* rsi is pointer to real mode structure with interesting info. - pass it to C */ - movq %rsi, %rdi -``` - -Now let's look at `__va` macro. This macro defined in [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c): - -```C -#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) -``` - -where `PAGE_OFFSET` is `__PAGE_OFFSET` which is `0xffff880000000000` and the base virtual address of the direct mapping of all physical memory. So we're getting virtual address of the `boot_params` structure and pass it to the `copy_bootdata` function, where we copy `real_mod_data` to the `boot_params` which is declared in the [arch/x86/kernel/setup.h](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.h) - -```C -extern struct boot_params boot_params; -``` - -Let's look at the `copy_boot_data` implementation: - -```C -static void __init copy_bootdata(char *real_mode_data) -{ - char * command_line; - unsigned long cmd_line_ptr; - - memcpy(&boot_params, real_mode_data, sizeof boot_params); - sanitize_boot_params(&boot_params); - cmd_line_ptr = get_cmd_line_ptr(); - if (cmd_line_ptr) { - command_line = __va(cmd_line_ptr); - memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE); - } -} -``` - -First of all, note that this function is declared with `__init` prefix. It means that this function will be used only during the initialization and used memory will be freed. - -We can see declaration of two variables for the kernel command line and copying `real_mode_data` to the `boot_params` with the `memcpy` function. The next call of the `sanitize_boot_params` function which fills some fields of the `boot_params` structure like `ext_ramdisk_image` and etc... if bootloaders which fail to initialize unknown fields in `boot_params` to zero. After this we're getting address of the command line with the call of the `get_cmd_line_ptr` function: - -```C -unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr; -cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32; -return cmd_line_ptr; -``` - -which gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check that we got `cmd_line_pty`, getting its virtual address and copy it to the `boot_command_line` which is just an array of bytes: - -```C -extern char __initdata boot_command_line[]; -``` - -After this we will have copied kernel command line and `boot_params` structure. In the next step we can see call of the `load_ucode_bsp` function which loads processor microcode, but we will not see it here. - -After microcode was loaded we can see the check of the `console_loglevel` and the `early_printk` function which prints `Kernel Alive` string. But you'll never see this output because `early_printk` is not initilized yet. It is a minor bug in the kernel and i sent the patch - [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2) and you will see it in the mainline soon. So you can skip this code. - -Move on init pages --------------------------------------------------------------------------------- - -In the next step as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call: - -```C - clear_page(init_level4_pgt); -``` - -function and pass `init_level4_pgt` which defined also in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and looks: - -```assembly -NEXT_PAGE(init_level4_pgt) - .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE - .org init_level4_pgt + L4_PAGE_OFFSET*8, 0 - .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE - .org init_level4_pgt + L4_START_KERNEL*8, 0 - .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE -``` - -which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. `clear_page` function defined in the [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/clear_page_64.S) let look on this function: - -```assembly -ENTRY(clear_page) - CFI_STARTPROC - xorl %eax,%eax - movl $4096/64,%ecx - .p2align 4 - .Lloop: - decl %ecx -#define PUT(x) movq %rax,x*8(%rdi) - movq %rax,(%rdi) - PUT(1) - PUT(2) - PUT(3) - PUT(4) - PUT(5) - PUT(6) - PUT(7) - leaq 64(%rdi),%rdi - jnz .Lloop - nop - ret - CFI_ENDPROC - .Lclear_page_end: - ENDPROC(clear_page) -``` - -As you can understart from the function name it clears or fills with zeros page tables. First of all note that this function starts with the `CFI_STARTPROC` and `CFI_ENDPROC` which are expands to GNU assembly directives: - -```C -#define CFI_STARTPROC .cfi_startproc -#define CFI_ENDPROC .cfi_endproc -``` - -and used for debugging. After `CFI_STARTPROC` macro we zero out `eax` register and put 64 to the `ecx` (it will be counter). Next we can see loop which starts with the `.Lloop` label and it starts from the `ecx` decrement. After it we put zero from the `rax` register to the `rdi` which contains the base address of the `init_level4_pgt` now and do the same procedure seven times but every time move `rdi` offset on 8. After this we will have first 64 bytes of the `init_level4_pgt` filled with zeros. In the next step we put the address of the `init_level4_pgt` with 64-bytes offset to the `rdi` again and repeat all operations which `ecx` is not zero. In the end we will have `init_level4_pgt` filled with zeros. - -As we have `init_level4_pgt` filled with zeros, we set the last `init_level4_pgt` entry to kernel high mapping with the: - -```C -init_level4_pgt[511] = early_level4_pgt[511]; -``` - -Remember that we dropped all `early_level4_pgt` entries in the `reset_early_page_table` function and kept only kernel high mapping there. - -The last step in the `x86_64_start_kernel` function is the call of the: - -```C -x86_64_start_reservations(real_mode_data); -``` - -function with the `real_mode_data` as argument. The `x86_64_start_reservations` function defined in the same source code file as the `x86_64_start_kernel` function and looks: - -```C -void __init x86_64_start_reservations(char *real_mode_data) -{ - if (!boot_params.hdr.version) - copy_bootdata(__va(real_mode_data)); - - reserve_ebda_region(); - - start_kernel(); -} -``` - -You can see that it is the last function before we are in the kernel entry point - `start_kernel` function. Let's look what it does and how it works. - -Last step before kernel entry point --------------------------------------------------------------------------------- - -First of all we can see in the `x86_64_start_reservations` function check for `boot_params.hdr.version`: - -```C -if (!boot_params.hdr.version) - copy_bootdata(__va(real_mode_data)); -``` - -and if it is not we call again `copy_bootdata` function with the virtual address of the `real_mode_data` (read about about it's implementation). - -In the next step we can see the call of the `reserve_ebda_region` function which defined in the [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head.c). This function reserves memory block for th `EBDA` or Extended BIOS Data Area. The Extended BIOS Data Area located in the top of conventional memory and contains data about ports, disk parameters and etc... - -Let's look on the `reserve_ebda_region` function. It starts from the checking is paravirtualization enabled or not: - -```C -if (paravirt_enabled()) - return; -``` - -we exit from the `reserve_ebda_region` function if paravirtualization is enabled because if it enabled the extended bios data area is absent. In the next step we need to get the end of the low memory: - -```C -lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES); -lowmem <<= 10; -``` - -We're getting the virtual address of the BIOS low memory in kilobytes and convert it to bytes with shifting it on 10 (multiply on 1024 in other words). After this we need to get the address of the extended BIOS data are with the: - -```C -ebda_addr = get_bios_ebda(); -``` - -where `get_bios_ebda` function defined in the [arch/x86/include/asm/bios_ebda.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bios_ebda.h) and looks like: - -```C -static inline unsigned int get_bios_ebda(void) -{ - unsigned int address = *(unsigned short *)phys_to_virt(0x40E); - address <<= 4; - return address; -} -``` - -Let's try to understand how it works. Here we can see that we converting physical address `0x40E` to the virtual, where `0x0040:0x000e` is the segment which contains base address of the extended BIOS data area. Don't worry that we are using `phys_to_virt` function for converting a physical address to virtual address. You can note that previously we have used `__va` macro for the same point, but `phys_to_virt` is the same: - -```C -static inline void *phys_to_virt(phys_addr_t address) -{ - return __va(address); -} -``` - -only with one difference: we pass argument with the `phys_addr_t` which depends on `CONFIG_PHYS_ADDR_T_64BIT`: - -```C -#ifdef CONFIG_PHYS_ADDR_T_64BIT - typedef u64 phys_addr_t; -#else - typedef u32 phys_addr_t; -#endif -``` - -This configuration option is enabled by `CONFIG_PHYS_ADDR_T_64BIT`. After that we got virtual address of the segment which stores the base address of the extended BIOS data area, we shift it on 4 and return. After this `ebda_addr` variables contains the base address of the extended BIOS data area. - -In the next step we check that address of the extended BIOS data area and low memory is not less than `INSANE_CUTOFF` macro - -```C -if (ebda_addr < INSANE_CUTOFF) - ebda_addr = LOWMEM_CAP; - -if (lowmem < INSANE_CUTOFF) - lowmem = LOWMEM_CAP; -``` - -which is: - -```C -#define INSANE_CUTOFF 0x20000U -``` - -or 128 kilobytes. In the last step we get lower part in the low memory and extended bios data area and call `memblock_reserve` function which will reserve memory region for extended bios data between low memory and one megabyte mark: - -```C -lowmem = min(lowmem, ebda_addr); -lowmem = min(lowmem, LOWMEM_CAP); -memblock_reserve(lowmem, 0x100000 - lowmem); -``` - -`memblock_reserve` function is defined at [mm/block.c](https://github.com/torvalds/linux/blob/master/mm/block.c) and takes two parameters: - -* base physical address; -* region size. - -and reserves memory region for the given base address and size. `memblock_reserve` is the first function in this book from linux kernel memory manager framework. We will take a closer look on memory manager soon, but now let's look at its implementation. - -First touch of the linux kernel memory manager framework --------------------------------------------------------------------------------- - -In the previous paragraph we stopped at the call of the `memblock_reserve` function and as i sad before it is the first function from the memory manager framework. Let's try to understand how it works. `memblock_reserve` function just calls: - -```C -memblock_reserve_region(base, size, MAX_NUMNODES, 0); -``` - -function and passes 4 parameters there: - -* physical base address of the memory region; -* size of the memory region; -* maximum number of numa nodes; -* flags. - -At the start of the `memblock_reserve_region` body we can see definition of the `memblock_type` structure: - -```C -struct memblock_type *_rgn = &memblock.reserved; -``` - -which presents the type of the memory block and looks: - -```C -struct memblock_type { - unsigned long cnt; - unsigned long max; - phys_addr_t total_size; - struct memblock_region *regions; -}; -``` - -As we need to reserve memory block for extended bios data area, the type of the current memory region is reserved where `memblock` structure is: - -```C -struct memblock { - bool bottom_up; - phys_addr_t current_limit; - struct memblock_type memory; - struct memblock_type reserved; -#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP - struct memblock_type physmem; -#endif -}; -``` - -and describes generic memory block. You can see that we initialize `_rgn` by assigning it to the address of the `memblock.reserved`. `memblock` is the global variable which looks: - -```C -struct memblock memblock __initdata_memblock = { - .memory.regions = memblock_memory_init_regions, - .memory.cnt = 1, - .memory.max = INIT_MEMBLOCK_REGIONS, - .reserved.regions = memblock_reserved_init_regions, - .reserved.cnt = 1, - .reserved.max = INIT_MEMBLOCK_REGIONS, -#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP - .physmem.regions = memblock_physmem_init_regions, - .physmem.cnt = 1, - .physmem.max = INIT_PHYSMEM_REGIONS, -#endif - .bottom_up = false, - .current_limit = MEMBLOCK_ALLOC_ANYWHERE, -}; -``` - -We will not dive into detail of this varaible, but we will see all details about it in the parts about memory manager. Just note that `memblock` variable defined with the `__initdata_memblock` which is: - -```C -#define __initdata_memblock __meminitdata -``` - -and `__meminit_data` is: - -```C -#define __meminitdata __section(.meminit.data) -``` - -From this we can conclude that all memory blocks will be in the `.meminit.data` section. After we defined `_rgn` we print information about it with `memblock_dbg` macros. You can enable it by passing `memblock=debug` to the kernel command line. - -After debugging lines were printed next is the call of the following function: - -```C -memblock_add_range(_rgn, base, size, nid, flags); -``` - -which adds new memory block region into the `.meminit.data` section. As we do not initlieze `_rgn` but it just contains `&memblock.reserved`, we just fill passed `_rgn` with the base address of the extended BIOS data area region, size of this region and flags: - -```C -if (type->regions[0].size == 0) { - WARN_ON(type->cnt != 1 || type->total_size); - type->regions[0].base = base; - type->regions[0].size = size; - type->regions[0].flags = flags; - memblock_set_region_node(&type->regions[0], nid); - type->total_size = size; - return 0; -} -``` - -After we filled our region we can see the call of the `memblock_set_region_node` function with two parameters: - -* address of the filled memory region; -* NUMA node id. - -where our regions represented by the `memblock_region` structure: - -```C -struct memblock_region { - phys_addr_t base; - phys_addr_t size; - unsigned long flags; -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP - int nid; -#endif -}; -``` - -NUMA node id depends on `MAX_NUMNODES` macro which is defined in the [include/linux/numa.h](https://github.com/torvalds/linux/blob/master/include/linux/numa.h): - -```C -#define MAX_NUMNODES (1 << NODES_SHIFT) -``` - -where `NODES_SHIFT` depends on `CONFIG_NODES_SHIFT` configuration parameter and defined as: - -```C -#ifdef CONFIG_NODES_SHIFT - #define NODES_SHIFT CONFIG_NODES_SHIFT -#else - #define NODES_SHIFT 0 -#endif -``` - -`memblick_set_region_node` function just fills `nid` field from `memblock_region` with the given value: - -```C -static inline void memblock_set_region_node(struct memblock_region *r, int nid) -{ - r->nid = nid; -} -``` - -After this we will have first reserved `memblock` for the extended bios data area in the `.meminit.data` section. `reserve_ebda_region` function finished its work on this step and we can go back to the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). - -We finished all preparations before the kernel entry point! The last step in the `x86_64_start_reservations` function is the call of the: - -```C -start_kernel() -``` - -function from [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) file. - -That's all for this part. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the third part about linux kernel internals. In next part we will see the first initialization steps in the kernel entry point - `start_kernel` function. It will be the first step before we will see launch of the first `init` process. - -If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [BIOS data area](http://stanislavs.org/helppc/bios_data_area.html) -* [What is in the extended BIOS data area on a PC?](http://www.kryslix.com/nsfaq/Q.6.html) -* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) diff --git a/Initialization/linux-initialization-4.md b/Initialization/linux-initialization-4.md deleted file mode 100644 index b249d16..0000000 --- a/Initialization/linux-initialization-4.md +++ /dev/null @@ -1,452 +0,0 @@ -Kernel initialization. Part 4. -================================================================================ - -Kernel entry point -================================================================================ - -If you have read the previous part - [Last preparations before the kernel entry point](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-3.md), you can remember that we finished all pre-initialization stuff and stopped right before the call to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). The `start_kernel` is the entry of the generic and architecture independent kernel code, although we will return to the `arch/` folder many times. If you look inside of the `start_kernel` function, you will see that this function is very big. For this moment it contains about `86` calls of functions. Yes, it's very big and of course this part will not cover all the processes that occur in this function. In the current part we will only start to do it. This part and all the next which will be in the [Kernel initialization process](https://github.com/0xAX/linux-insides/blob/master/Initialization/README.md) chapter will cover it. - -The main purpose of the `start_kernel` to finish kernel initialization process and launch the first `init` process. Before the first process will be started, the `start_kernel` must do many things such as: to enable [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), to initialize processor id, to enable early [cgroups](http://en.wikipedia.org/wiki/Cgroups) subsystem, to setup per-cpu areas, to initialize different caches in [vfs](http://en.wikipedia.org/wiki/Virtual_file_system), to initialize memory manager, rcu, vmalloc, scheduler, IRQs, ACPI and many many more. Only after these steps we will see the launch of the first `init` process in the last part of this chapter. So much kernel code awaits us, let's start. - -**NOTE: All parts from this big chapter `Linux Kernel initialization process` will not cover anything about debugging. There will be a separate chapter about kernel debugging tips.** - -A little about function attributes ---------------------------------------------------------------------------------- - -As I wrote above, the `start_kernel` function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). This function defined with the `__init` attribute and as you already may know from other parts, all functions which are defined with this attribute are necessary during kernel initialization. - -```C -#define __init __section(.init.text) __cold notrace -``` - -After the initialization process will be finished, the kernel will release these sections with a call to the `free_initmem` function. Note also that `__init` is defined with two attributes: `__cold` and `notrace`. The purpose of the first `cold` attribute is to mark that the function is rarely used and the compiler must optimize this function for size. The second `notrace` is defined as: - -```C -#define notrace __attribute__((no_instrument_function)) -``` - -where `no_instrument_function` says to the compiler not to generate profiling function calls. - -In the definition of the `start_kernel` function, you can also see the `__visible` attribute which expands to the: - -``` -#define __visible __attribute__((externally_visible)) -``` - -where `externally_visible` tells to the compiler that something uses this function or variable, to prevent marking this function/variable as `unusable`. You can find the definition of this and other macro attributes in [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h). - -First steps in the start_kernel --------------------------------------------------------------------------------- - -At the beginning of the `start_kernel` you can see the definition of these two variables: - -```C -char *command_line; -char *after_dashes; -``` - -The first represents a pointer to the kernel command line and the second will contain the result of the `parse_args` function which parses an input string with parameters in the form `name=value`, looking for specific keywords and invoking the right handlers. We will not go into the details related with these two variables at this time, but will see it in the next parts. In the next step we can see a call to the: - -```C -lockdep_init(); -``` - -function. `lockdep_init` initializes [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). Its implementation is pretty simple, it just initializes two [list_head](https://github.com/0xAX/linux-insides/blob/master/DataStructures/dlist.md) hashes and sets the `lockdep_initialized` global variable to `1`. Lock validator detects circular lock dependencies and is called when any [spinlock](http://en.wikipedia.org/wiki/Spinlock) or [mutex](http://en.wikipedia.org/wiki/Mutual_exclusion) is acquired. - -The next function is `set_task_stack_end_magic` which takes address of the `init_task` and sets `STACK_END_MAGIC` (`0x57AC6E9D`) as canary for it. `init_task` represents the initial task structure: - -```C -struct task_struct init_task = INIT_TASK(init_task); -``` - -where `task_struct` stores all the information about a process. I will not explain this structure in this book because it's very big. You can find its definition in [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L1278). At this moment `task_struct` contains more than `100` fields! Although you will not see the explanation of the `task_struct` in this book, we will use it very often since it is the fundamental structure which describes the `process` in the Linux kernel. I will describe the meaning of the fields of this structure as we meet them in practice. - -You can see the definition of the `init_task` and it initialized by the `INIT_TASK` macro. This macro is from [include/linux/init_task.h](https://github.com/torvalds/linux/blob/master/include/linux/init_task.h) and it just fills the `init_task` with the values for the first process. For example it sets: - -* init process state to zero or `runnable`. A runnable process is one which is waiting only for a CPU to run on; -* init process flags - `PF_KTHREAD` which means - kernel thread; -* a list of runnable task; -* process address space; -* init process stack to the `&init_thread_info` which is `init_thread_union.thread_info` and `initthread_union` has type - `thread_union` which contains `thread_info` and process stack: - -```C -union thread_union { - struct thread_info thread_info; - unsigned long stack[THREAD_SIZE/sizeof(long)]; -}; -``` - -Every process has its own stack and it is 16 killobytes or 4 page frames. in `x86_64`. We can note that it is defined as array of `unsigned long`. The next field of the `thread_union` is - `thread_info` defined as: - -```C -struct thread_info { - struct task_struct *task; - struct exec_domain *exec_domain; - __u32 flags; - __u32 status; - __u32 cpu; - int saved_preempt_count; - mm_segment_t addr_limit; - struct restart_block restart_block; - void __user *sysenter_return; - unsigned int sig_on_uaccess_error:1; - unsigned int uaccess_err:1; -}; -``` - -and occupies 52 bytes. The `thread_info` structure contains architecture-specific information on the thread. We know that on `x86_64` the stack grows down and `thread_union.thread_info` is stored at the bottom of the stack in our case. So the process stack is 16 killobytes and `thread_info` is at the bottom. The remaining thread_size will be `16 killobytes - 62 bytes = 16332 bytes`. Note that `thread_unioun` represented as the [union](http://en.wikipedia.org/wiki/Union_type) and not structure, it means that `thread_info` and stack share the memory space. - -Schematically it can be represented as follows: - -```C -+-----------------------+ -| | -| | -| stack | -| | -|_______________________| -| | | -| | | -| | | -|__________↓____________| +--------------------+ -| | | | -| thread_info |<----------->| task_struct | -| | | | -+-----------------------+ +--------------------+ -``` - -http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct - -So the `INIT_TASK` macro fills these `task_struct's` fields and many many more. As I already wrote about, I will not describe all the fields and values in the `INIT_TASK` macro but we will see them soon. - -Now let's go back to the `set_task_stack_end_magic` function. This function defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c#L297) and sets a [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow) to the `init` process stack to prevent stack overflow. - -```C -void set_task_stack_end_magic(struct task_struct *tsk) -{ - unsigned long *stackend; - stackend = end_of_stack(tsk); - *stackend = STACK_END_MAGIC; /* for overflow detection */ -} -``` - -Its implementation is simple. `set_task_stack_end_magic` gets the end of the stack for the given `task_struct` with the `end_of_stack` function. The end of a process stack depends on the `CONFIG_STACK_GROWSUP` configuration option. As we learn in `x86_64` architecture, the stack grows down. So the end of the process stack will be: - -```C -(unsigned long *)(task_thread_info(p) + 1); -``` - -where `task_thread_info` just returns the stack which we filled with the `INIT_TASK` macro: - -```C -#define task_thread_info(task) ((struct thread_info *)(task)->stack) -``` - -As we got the end of the init process stack, we write `STACK_END_MAGIC` there. After `canary` is set, we can check it like this: - -```C -if (*end_of_stack(task) != STACK_END_MAGIC) { - // - // handle stack overflow here - // -} -``` - -The next function after the `set_task_stack_end_magic` is `smp_setup_processor_id`. This function has an empty body for `x86_64`: - -```C -void __init __weak smp_setup_processor_id(void) -{ -} -``` - -as it not implemented for all architectures, but some such as [s390](http://en.wikipedia.org/wiki/IBM_ESA/390) and [arm64](http://en.wikipedia.org/wiki/ARM_architecture#64.2F32-bit_architecture). - -The next function in `start_kernel` is `debug_objects_early_init`. Implementation of this function is almost the same as `lockdep_init`, but fills hashes for object debugging. As I wrote about, we will not see the explanation of this and other functions which are for debugging purposes in this chapter. - -After the `debug_object_early_init` function we can see the call of the `boot_init_stack_canary` function which fills `task_struct->canary` with the canary value for the `-fstack-protector` gcc feature. This function depends on the `CONFIG_CC_STACKPROTECTOR` configuration option and if this option is disabled, `boot_init_stack_canary` does nothing, otherwise it generates random numbers based on random pool and the [TSC](http://en.wikipedia.org/wiki/Time_Stamp_Counter): - -```C -get_random_bytes(&canary, sizeof(canary)); -tsc = __native_read_tsc(); -canary += tsc + (tsc << 32UL); -``` - -After we got a random number, we fill the `stack_canary` field of `task_struct` with it: - -```C -current->stack_canary = canary; -``` - -and write this value to the top of the IRQ stack with the: - -```C -this_cpu_write(irq_stack_union.stack_canary, canary); // read below about this_cpu_write -``` - -Again, we will not dive into details here, we will cover it in the part about [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). As canary is set, we disable local and early boot IRQs and register the bootstrap CPU in the CPU maps. We disable local IRQs (interrupts for current CPU) with the `local_irq_disable` macro which expands to the call of the `arch_local_irq_disable` function from [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h): - -```C -static inline notrace void arch_local_irq_enable(void) -{ - native_irq_enable(); -} -``` - -Where `native_irq_enable` is `cli` instruction for `x86_64`. As interrupts are disabled we can register the current CPU with the given ID in the CPU bitmap. - -The first processor activation ---------------------------------------------------------------------------------- - -The current function from the `start_kernel` is `boot_cpu_init`. This function initializes various CPU masks for the bootstrap processor. First of all it gets the bootstrap processor id with a call to: - -```C -int cpu = smp_processor_id(); -``` - -For now it is just zero. If the `CONFIG_DEBUG_PREEMPT` configuration option is disabled, `smp_processor_id` just expands to the call of `raw_smp_processor_id` which expands to the: - -```C -#define raw_smp_processor_id() (this_cpu_read(cpu_number)) -``` - -`this_cpu_read` as many other function like this (`this_cpu_write`, `this_cpu_add` and etc...) defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) and presents `this_cpu` operation. These operations provide a way of optimizing access to the [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Theory/per-cpu.html) variables which are associated with the current processor. In our case it is `this_cpu_read`: - -``` -__pcpu_size_call_return(this_cpu_read_, pcp) -``` - -Remember that we have passed `cpu_number` as `pcp` to the `this_cpu_read` from the `raw_smp_processor_id`. Now let's look at the `__pcpu_size_call_return` implementation: - -```C -#define __pcpu_size_call_return(stem, variable) \ -({ \ - typeof(variable) pscr_ret__; \ - __verify_pcpu_ptr(&(variable)); \ - switch(sizeof(variable)) { \ - case 1: pscr_ret__ = stem##1(variable); break; \ - case 2: pscr_ret__ = stem##2(variable); break; \ - case 4: pscr_ret__ = stem##4(variable); break; \ - case 8: pscr_ret__ = stem##8(variable); break; \ - default: \ - __bad_size_call_parameter(); break; \ - } \ - pscr_ret__; \ -}) -``` - -Yes, it looks a little strange but it's easy. First of all we can see the definition of the `pscr_ret__` variable with the `int` type. Why int? Ok, `variable` is `common_cpu` and it was declared as per-cpu int variable: - -```C -DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number); -``` - -In the next step we call `__verify_pcpu_ptr` with the address of `cpu_number`. `__veryf_pcpu_ptr` used to verify that the given parameter is a per-cpu pointer. After that we set `pscr_ret__` value which depends on the size of the variable. Our `common_cpu` variable is `int`, so it 4 bytes in size. It means that we will get `this_cpu_read_4(common_cpu)` in `pscr_ret__`. In the end of the `__pcpu_size_call_return` we just call it. `this_cpu_read_4` is a macro: - -```C -#define this_cpu_read_4(pcp) percpu_from_op("mov", pcp) -``` - -which calls `percpu_from_op` and pass `mov` instruction and per-cpu variable there. `percpu_from_op` will expand to the inline assembly call: - -```C -asm("movl %%gs:%1,%0" : "=r" (pfo_ret__) : "m" (common_cpu)) -``` - -Let's try to understand how it works and what it does. The `gs` segment register contains the base of per-cpu area. Here we just copy `common_cpu` which is in memory to the `pfo_ret__` with the `movl` instruction. Or with another words: - -```C -this_cpu_read(common_cpu) -``` - -is the same as: - -```C -movl %gs:$common_cpu, $pfo_ret__ -``` - -As we didn't setup per-cpu area, we have only one - for the current running CPU, we will get `zero` as a result of the `smp_processor_id`. - -As we got the current processor id, `boot_cpu_init` sets the given CPU online, active, present and possible with the: - -```C -set_cpu_online(cpu, true); -set_cpu_active(cpu, true); -set_cpu_present(cpu, true); -set_cpu_possible(cpu, true); -``` - -All of these functions use the concept - `cpumask`. `cpu_possible` is a set of CPU ID's which can be plugged in at any time during the life of that system boot. `cpu_present` represents which CPUs are currently plugged in. `cpu_online` represents subset of the `cpu_present` and indicates CPUs which are available for scheduling. These masks depend on the `CONFIG_HOTPLUG_CPU` configuration option and if this option is disabled `possible == present` and `active == online`. Implementation of the all of these functions are very similar. Every function checks the second parameter. If it is `true`, it calls `cpumask_set_cpu` or `cpumask_clear_cpu` otherwise. - -For example let's look at `set_cpu_possible`. As we passed `true` as the second parameter, the: - -```C -cpumask_set_cpu(cpu, to_cpumask(cpu_possible_bits)); -``` - -will be called. First of all let's try to understand the `to_cpu_mask` macro. This macro casts a bitmap to a `struct cpumask *`. CPU masks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. CPU mask presented by the `cpu_mask` structure: - -```C -typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t; -``` - -which is just bitmap declared with the `DECLARE_BITMAP` macro: - -```C -#define DECLARE_BITMAP(name, bits) unsigned long name[BITS_TO_LONGS(bits)] -``` - -As we can see from its definition, the `DECLARE_BITMAP` macro expands to the array of `unsigned long`. Now let's look at how the `to_cpumask` macro is implemented: - -```C -#define to_cpumask(bitmap) \ - ((struct cpumask *)(1 ? (bitmap) \ - : (void *)sizeof(__check_is_bitmap(bitmap)))) -``` - -I don't know about you, but it looked really weird for me at the first time. We can see a ternary operator here which is `true` every time, but why the `__check_is_bitmap` here? It's simple, let's look at it: - -```C -static inline int __check_is_bitmap(const unsigned long *bitmap) -{ - return 1; -} -``` - -Yeah, it just returns `1` every time. Actually we need in it here only for one purpose: at compile time it checks that the given `bitmap` is a bitmap, or in other words it checks that the given `bitmap` has a type of `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting the array of `unsigned long` to the `struct cpumask *`. Now we can call `cpumask_set_cpu` function with the `cpu` - 0 and `struct cpumask *cpu_possible_bits`. This function makes only one call of the `set_bit` function which sets the given `cpu` in the cpumask. All of these `set_cpu_*` functions work on the same principle. - -If you're not sure that this `set_cpu_*` operations and `cpumask` are not clear for you, don't worry about it. You can get more info by reading the special part about it - [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) or [documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). - -As we activated the bootstrap processor, it's time to go to the next function in the `start_kernel.` Now it is `page_address_init`, but this function does nothing in our case, because it executes only when all `RAM` can't be mapped directly. - -Print linux banner ---------------------------------------------------------------------------------- - -The next call is `pr_notice`: - -```C -#define pr_notice(fmt, ...) \ - printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__) -``` - -as you can see it just expands to the `printk` call. At this moment we use `pr_notice` to print the Linux banner: - -```C -pr_notice("%s", linux_banner); -``` - -which is just the kernel version with some additional parameters: - -``` -Linux version 4.0.0-rc6+ (alex@localhost) (gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #319 SMP -``` - -Architecture-dependent parts of initialization ---------------------------------------------------------------------------------- - -The next step is architecture-specific initializations. The Linux kernel does it with the call of the `setup_arch` function. This is a very big function like `start_kernel` and we do not have time to consider all of its implementation in this part. Here we'll only start to do it and continue in the next part. As it is `architecture-specific`, we need to go again to the `arch/` directory. The `setup_arch` function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and takes only one argument - address of the kernel command line. - -This function starts from the reserving memory block for the kernel `_text` and `_data` which starts from the `_text` symbol (you can remember it from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L46)) and ends before `__bss_stop`. We are using `memblock` for the reserving of memory block: - -```C -memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text); -``` - -You can read about `memblock` in the [Linux kernel memory management Part 1.](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html). As you can remember `memblock_reserve` function takes two parameters: - -* base physical address of a memory block; -* size of a memory block. - -We can get the base physical address of the `_text` symbol with the `__pa_symbol` macro: - -```C -#define __pa_symbol(x) \ - __phys_addr_symbol(__phys_reloc_hide((unsigned long)(x))) -``` - -First of all it calls `__phys_reloc_hide` macro on the given parameter. The `__phys_reloc_hide` macro does nothing for `x86_64` and just returns the given parameter. Implementation of the `__phys_addr_symbol` macro is easy. It just subtracts the symbol address from the base address of the kernel text mapping base virtual address (you can remember that it is `__START_KERNEL_map`) and adds `phys_base` which is the base address of `_text`: - -```C -#define __phys_addr_symbol(x) \ - ((unsigned long)(x) - __START_KERNEL_map + phys_base) -``` - -After we got the physical address of the `_text` symbol, `memblock_reserve` can reserve a memory block from the `_text` to the `__bss_stop - _text`. - -Reserve memory for initrd ---------------------------------------------------------------------------------- - -In the next step after we reserved place for the kernel text and data is reserving place for the [initrd](http://en.wikipedia.org/wiki/Initrd). We will not see details about `initrd` in this post, you just may know that it is temporary root file system stored in memory and used by the kernel during its startup. The `early_reserve_initrd` function does all work. First of all this function gets the base address of the ram disk, its size and the end address with: - -```C -u64 ramdisk_image = get_ramdisk_image(); -u64 ramdisk_size = get_ramdisk_size(); -u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size); -``` - -All of these parameters are taken from `boot_params`. If you have read the chapter about [Linux Kernel Booting Process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html), you must remember that we filled the `boot_params` structure during boot time. The kernel setup header contains a couple of fields which describes ramdisk, for example: - -``` -Field name: ramdisk_image -Type: write (obligatory) -Offset/size: 0x218/4 -Protocol: 2.00+ - - The 32-bit linear address of the initial ramdisk or ramfs. Leave at - zero if there is no initial ramdisk/ramfs. -``` - -So we can get all the information that interests us from `boot_params`. For example let's look at `get_ramdisk_image`: - -```C -static u64 __init get_ramdisk_image(void) -{ - u64 ramdisk_image = boot_params.hdr.ramdisk_image; - - ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32; - - return ramdisk_image; -} -``` - -Here we get the address of the ramdisk from the `boot_params` and shift left it on `32`. We need to do it because as you can read in the [Documentation/x86/zero-page.txt](https://github.com/0xAX/linux/blob/master/Documentation/x86/zero-page.txt): - -``` -0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits -``` - -So after shifting it on 32, we're getting a 64-bit address in `ramdisk_image` and we return it. `get_ramdisk_size` works on the same principle as `get_ramdisk_image`, but it used `ext_ramdisk_size` instead of `ext_ramdisk_image`. After we got ramdisk's size, base address and end address, we check that bootloader provided ramdisk with the: - -```C -if (!boot_params.hdr.type_of_loader || - !ramdisk_image || !ramdisk_size) - return; -``` - -and reserve memory block with the calculated addresses for the initial ramdisk in the end: - -```C -memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image); -``` - -Conclusion ---------------------------------------------------------------------------------- - -It is the end of the fourth part about the Linux kernel initialization process. We started to dive in the kernel generic code from the `start_kernel` function in this part and stopped on the architecture-specific initializations in the `setup_arch`. In the next part we will continue with architecture-dependent initialization steps. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [GCC function attributes](https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html) -* [this_cpu operations](https://www.kernel.org/doc/Documentation/this_cpu_ops.txt) -* [cpumask](http://www.crashcourse.ca/wiki/index.php/Cpumask) -* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) -* [cgroups](http://en.wikipedia.org/wiki/Cgroups) -* [stack buffer overflow](http://en.wikipedia.org/wiki/Stack_buffer_overflow) -* [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) -* [initrd](http://en.wikipedia.org/wiki/Initrd) -* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-3.md) diff --git a/Initialization/linux-initialization-5.md b/Initialization/linux-initialization-5.md deleted file mode 100644 index eb49234..0000000 --- a/Initialization/linux-initialization-5.md +++ /dev/null @@ -1,512 +0,0 @@ -Kernel initialization. Part 5. -================================================================================ - -Continue of architecture-specific initializations -================================================================================ - -In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L856) function and will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will miss functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You can remember, we already saw it in the [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) about interrupts: and exceptions: - -``` ----------------------------------------------------------------------------------------------- -|Vector|Mnemonic|Description |Type |Error Code|Source | ----------------------------------------------------------------------------------------------- -|3 | #BP |Breakpoint |Trap |NO |INT 3 | ----------------------------------------------------------------------------------------------- -``` - -Debug interrupt `#DB` is the primary means of invoking debuggers. `early_trap_init` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). This functions sets `#DB` and `#BP` handlers and reloads [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table): - -```C -void __init early_trap_init(void) -{ - set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK); - set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK); - load_idt(&idt_descr); -} -``` - -We already saw implementation of the `set_intr_gate` in the previous part about interrupts. Here are two similar functions `set_intr_gate_ist` and `set_system_intr_gate_ist`. Both of these two functions take two parameters: - -* number of the interrupt; -* base address of the interrupt/exception handler; -* third parameter is - `Interrupt Stack Table`. `IST` is a new mechanism in the `x86_64` and part of the [TSS](http://en.wikipedia.org/wiki/Task_state_segment). Every active thread in kernel mode has own kernel stack which is 16 killobytes. While a thread in user space, kernel stack is empty except `thread_info` (read about it previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html)) at the bottom. In addition to per-thread stacks, there are a couple of specialized stacks associated with each CPU. All about these stack you can read in the linux kernel documentation - [Kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks). `x86_64` provides feature which allows to switch to a new `special` stack for during any events as non-maskable interrupt and etc... And the name of this feature is - `Interrupt Stack Table`. There can be up to 7 `IST` entries per CPU and every entry points to the dedicated stack. In our case this is `DEBUG_STACK`. - -`set_intr_gate_ist` and `set_system_intr_gate_ist` work by the same principle as `set_intr_gate` with only one difference. Both of these functions checks -interrupt number and call `_set_gate` inside: - -```C -BUG_ON((unsigned)n > 0xFF); -_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS); -``` - -as `set_intr_gate` does this. But `set_intr_gate` calls `_set_gate` with [dpl](http://en.wikipedia.org/wiki/Privilege_level) - 0, and ist - 0, but `set_intr_gate_ist` and `set_system_intr_gate_ist` sets `ist` as `DEBUG_STACK` and `set_system_intr_gate_ist` sets `dpl` as `0x3` which is the lowest privilege. When an interrupt occurs and the hardware loads such a descriptor, then hardware automatically sets the new stack pointer based on the IST value, then invokes the interrupt handler. All of the special kernel stacks will be setted in the `cpu_init` function (we will see it later). - -As `#DB` and `#BP` gates written to the `idt_descr`, we reload `IDT` table with `load_idt` which just cals `ldtr` instruction. Now let's look on interrupt handlers and will try to understand how they works. Of course, I can't cover all interrupt handlers in this book and I do not see the point in this. It is very interesting to delve in the linux kernel source code, so we will see how `debug` handler implemented in this part, and understand how other interrupt handlers are implemented will be your task. - -#DB handler --------------------------------------------------------------------------------- - -As you can read above, we passed address of the `#DB` handler as `&debug` in the `set_intr_gate_ist`. [lxr.free-electorns.com](http://lxr.free-electrons.com/ident) is a great resource for searching identificators in the linux kernel source code, but unfortunately you will not find `debug` handler with it. All of you can find, it is `debug` definition in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/traps.h): - -```C -asmlinkage void debug(void); -``` - -We can see `asmlinkage` attribute which tells to us that `debug` is function written with [assembly](http://en.wikipedia.org/wiki/Assembly_language). Yeah, again and again assembly :). Implementation of the `#DB` handler as other handlers is in this [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) and defined with the `idtentry` assembly macro: - -```assembly -idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -``` - -`idtentry` is a macro which defines an interrupt/exception entry point. As you can see it takes five arguments: - -* name of the interrupt entry point; -* name of the interrupt handler; -* has interrupt error code or not; -* paranoid - if this parameter = 1, switch to special stack (read above); -* shift_ist - stack to switch during interrupt. - -Now let's look on `idtentry` macro implementation. This macro defined in the same assembly file and defines `debug` function with the `ENTRY` macro. For the start `idtentry` macro checks that given parameters are correct in case if need to switch to the special stack. In the next step it checks that give interrupt returns error code. If interrupt does not return error code (in our case `#DB` does not return error code), it calls `INTR_FRAME` or `XCPT_FRAME` if interrupt has error code. Both of these macros `XCPT_FRAME` and `INTR_FRAME` do nothing and need only for the building initial frame state for interrupts. They uses `CFI` directives and used for debugging. More info you can find in the [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html). As comment from the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) says: `CFI macros are used to generate dwarf2 unwind information for better backtraces. They don't change any code.` so we will ignore them. - -```assembly -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 -ENTRY(\sym) - /* Sanity check */ - .if \shift_ist != -1 && \paranoid == 0 - .error "using shift_ist requires paranoid=1" - .endif - - .if \has_error_code - XCPT_FRAME - .else - INTR_FRAME - .endif - ... - ... - ... -``` - -You can remember from the previous part about early interrupts/exceptions handling that after interrupt occurs, current stack will have following format: - -``` - +-----------------------+ - | | -+40 | SS | -+32 | RSP | -+24 | RFLAGS | -+16 | CS | -+8 | RIP | - 0 | Error Code | <---- rsp - | | - +-----------------------+ -``` - -The next two macro from the `idtentry` implementation are: - -```assembly - ASM_CLAC - PARAVIRT_ADJUST_EXCEPTION_FRAME -``` - -First `ASM_CLAC` macro depends on `CONFIG_X86_SMAP` configuration option and need for security resason, more about it you can read [here](https://lwn.net/Articles/517475/). The second `PARAVIRT_ADJUST_EXCEPTION_FRAME` macro is for handling handle Xen-type-exceptions (this chapter about kernel initializations and we will not consider virtualization stuff here). - -The next piece of code checks is interrupt has error code or not and pushes `$-1` which is `0xffffffffffffffff` on `x86_64` on the stack if not: - -```assembly - .ifeq \has_error_code - pushq_cfi $-1 - .endif -``` - -We need to do it as `dummy` error code for stack consistency for all interrupts. In the next step we subscract from the stack pointer `$ORIG_RAX-R15`: - -```assembly - subq $ORIG_RAX-R15, %rsp -``` - -where `ORIRG_RAX`, `R15` and other macros defined in the [arch/x86/include/asm/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/calling.h) and `ORIG_RAX-R15` is 120 bytes. General purpose registers will occupy these 120 bytes because we need to store all registers on the stack during interrupt handling. After we set stack for general purpose registers, the next step is checking that interrupt came from userspace with: - -```assembly -testl $3, CS(%rsp) -jnz 1f -``` - -Here we checks first and second bits in the `CS`. You can remember that `CS` register contains segment selector where first two bits are `RPL`. All privilege levels are integers in the range 0–3, where the lowest number corresponds to the highest privilege. So if interrupt came from the kernel mode we call `save_paranoid` or jump on label `1` if not. In the `save_paranoid` we store all general purpose registers on the stack and switch user `gs` on kernel `gs` if need: - -```assembly - movl $1,%ebx - movl $MSR_GS_BASE,%ecx - rdmsr - testl %edx,%edx - js 1f - SWAPGS - xorl %ebx,%ebx -1: ret -``` - -In the next steps we put `pt_regs` pointer to the `rdi`, save error code in the `rsi` if it is and call interrupt handler which is - `do_debug` in our case from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). `do_debug` like other handlers takes two parameters: - -* pt_regs - is a structure which presents set of CPU registers which are saved in the process' memory region; -* error code - error code of interrupt. - -After interrupt handler finished its work, calls `paranoid_exit` which restores stack, switch on userspace if interrupt came from there and calls `iret`. That's all. Of course it is not all :), but we will see more deeply in the separate chapter about interrupts. - -This is general view of the `idtentry` macro for `#DB` interrupt. All interrupts are similar on this implementation and defined with idtentry too. After `early_trap_init` finished its work, the next function is `early_cpu_init`. This function defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) and collects information about a CPU and its vendor. - -Early ioremap initialization --------------------------------------------------------------------------------- - -The next step is initialization of early `ioremap`. In general there are two ways to comminicate with devices: - -* I/O Ports; -* Device memory. - -We already saw first method (`outb/inb` instructions) in the part about linux kernel booting [process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html). The second method is to map I/O physical addresses to virtual addresses. When a physical address is accessed by the CPU, it may refer to a portion of physical RAM which can be mapped on memory of the I/O device. So `ioremap` used to map device memory into kernel address space. - -As i wrote above next function is the `early_ioremap_init` which re-maps I/O memory to kernel address space so it can access it. We need to initialize early ioremap for early initialization code which needs to temporarily map I/O or memory regions before the normal mapping functions like `ioremap` are available. Implementation of this function is in the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c). At the start of the `early_ioremap_init` we can see definition of the `pmd` point with `pmd_t` type (which presents page middle directory entry `typedef struct { pmdval_t pmd; } pmd_t;` where `pmdval_t` is `unsigned long`) and make a check that `fixmap` aligned in a correct way: - -```C -pmd_t *pmd; -BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1)); -``` - -`fixmap` - is fixed virtual address mappings which extends from `FIXADDR_START` to `FIXADDR_TOP`. Fixed virtual addresses are needed for subsystems that need to know the virtual address at compile time. After the check `early_ioremap_init` makes a call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c). `early_ioremap_setup` fills `slot_virt` arry of the `unsigned long` with virtual addresses with 512 temporary boot-time fix-mappings: - -```C -for (i = 0; i < FIX_BTMAPS_SLOTS; i++) - slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i); -``` - -After this we get page middle directory entry for the `FIX_BTMAP_BEGIN` and put to the `pmd` variable, fills with zeros `bm_pte` which is boot time page tables and call `pmd_populate_kernel` function for setting given page table entry in the given page middle directory: - -```C -pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)); -memset(bm_pte, 0, sizeof(bm_pte)); -pmd_populate_kernel(&init_mm, pmd, bm_pte); -``` - -That's all for this. If you feeling missunderstanding, don't worry. There is special part about `ioremap` and `fixmaps` in the [Linux Kernel Memory Management. Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) chapter. - -Obtaining major and minor numbers for the root device --------------------------------------------------------------------------------- - -After early `ioremap` was initialized, you can see the following code: - -```C -ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev); -``` - -This code obtains major and minor numbers for the root device where `initrd` will be mounted later in the `do_mount_root` function. Major number of the device identifies a driver associated with the device. Minor number referred on the device controlled by driver. Note that `old_decode_dev` takes one parameter from the `boot_params_structure`. As we can read from the x86 linux kernel boot protocol: - -``` -Field name: root_dev -Type: modify (optional) -Offset/size: 0x1fc/2 -Protocol: ALL - - The default root device device number. The use of this field is - deprecated, use the "root=" option on the command line instead. -``` - -Now let's try understand what is it `old_decode_dev`. Actually it just calls `MKDEV` inside which generates `dev_t` from the give major and minor numbers. It's implementation pretty easy: - -```C -static inline dev_t old_decode_dev(u16 val) -{ - return MKDEV((val >> 8) & 255, val & 255); -} -``` - -where `dev_t` is a kernel data type to present major/minor number pair. But what's the strange `old_` prefix? For historical reasons, there are two ways of managing the major and minor numbers of a device. In the first way major and minor numbers occupied 2 bytes. You can see it in the previous code: 8 bit for major number and 8 bit for minor number. But there is problem with this way: 256 major numbers and 256 minor numbers are possible. So 16-bit integer was replaced with 32-bit integer where 12 bits reserved for major number and 20 bits for minor. You can see this in the `new_decode_dev` implementation: - -```C -static inline dev_t new_decode_dev(u32 dev) -{ - unsigned major = (dev & 0xfff00) >> 8; - unsigned minor = (dev & 0xff) | ((dev >> 12) & 0xfff00); - return MKDEV(major, minor); -} -``` - -After calculation we will get `0xfff` or 12 bits for `major` if it is `0xffffffff` and `0xfffff` or 20 bits for `minor`. So in the end of execution of the `old_decode_dev` we will get major and minor numbers for the root device in `ROOT_DEV`. - -Memory map setup --------------------------------------------------------------------------------- - -The next point is the setup of the memory map with the call of the `setup_memory_map` function. But before this we setup different parameters as information about a screen (current row and column, video page and etc... (you can read about it in the [Video mode initialization and transition to protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html))), Extended display identification data, video mode, bootloader_type and etc...: - -```C - screen_info = boot_params.screen_info; - edid_info = boot_params.edid_info; - saved_video_mode = boot_params.hdr.vid_mode; - bootloader_type = boot_params.hdr.type_of_loader; - if ((bootloader_type >> 4) == 0xe) { - bootloader_type &= 0xf; - bootloader_type |= (boot_params.hdr.ext_loader_type+0x10) << 4; - } - bootloader_version = bootloader_type & 0xf; - bootloader_version |= boot_params.hdr.ext_loader_ver << 4; -``` - -All of these parameters we got during boot time and stored in the `boot_params` structure. After this we need to setup the end of the I/O memory. As you know the one of the main purposes of the kernel is resource management. And one of the resource is a memory. As we already know there are two ways to communicate with devices are I/O ports and device memory. All information about registered resources available through: - -* /proc/ioports - provides a list of currently registered port regions used for input or output communication with a device; -* /proc/iomem - provides current map of the system's memory for each physical device. - -At the moment we are interested in `/proc/iomem`: - -``` -cat /proc/iomem -00000000-00000fff : reserved -00001000-0009d7ff : System RAM -0009d800-0009ffff : reserved -000a0000-000bffff : PCI Bus 0000:00 -000c0000-000cffff : Video ROM -000d0000-000d3fff : PCI Bus 0000:00 -000d4000-000d7fff : PCI Bus 0000:00 -000d8000-000dbfff : PCI Bus 0000:00 -000dc000-000dffff : PCI Bus 0000:00 -000e0000-000fffff : reserved - 000e0000-000e3fff : PCI Bus 0000:00 - 000e4000-000e7fff : PCI Bus 0000:00 - 000f0000-000fffff : System ROM -``` - -As you can see range of addresses are shown in hexadecimal notation with its owner. Linux kernel provides API for managing any resources in a general way. Global resources (for example PICs or I/O ports) can be divided into subsets - relating to any hardware bus slot. The main structure `resource`: - -```C -struct resource { - resource_size_t start; - resource_size_t end; - const char *name; - unsigned long flags; - struct resource *parent, *sibling, *child; -}; -``` - -presents abstraction for a tree-like subset of system resources. This structure provides range of addresses from `start` to `end` (`resource_size_t` is `phys_addr_t` or `u64` for `x86_64`) which a resource covers, `name` of a resource (you see these names in the `/proc/iomem` output) and `flags` of a resource (All resources flags defined in the [include/linux/ioport.h](https://github.com/torvalds/linux/blob/master/include/linux/ioport.h)). The last are three pointers to the `resource` structure. These pointers enable a tree-like structure: - -``` -+-------------+ +-------------+ -| | | | -| parent |------| sibling | -| | | | -+-------------+ +-------------+ - | - | -+-------------+ -| | -| child | -| | -+-------------+ -``` - -Every subset of resources has root range resources. For `iomem` it is `iomem_resource` which defined as: - -```C -struct resource iomem_resource = { - .name = "PCI mem", - .start = 0, - .end = -1, - .flags = IORESOURCE_MEM, -}; -EXPORT_SYMBOL(iomem_resource); -``` - -TODO EXPORT_SYMBOL - -`iomem_resource` defines root addresses range for io memory with `PCI mem` name and `IORESOURCE_MEM` (`0x00000200`) as flags. As i wrote about our current point is setup the end address of the `iomem`. We will do it with: - -```C -iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1; -``` - -Here we shift `1` on `boot_cpu_data.x86_phys_bits`. `boot_cpu_data` is `cpuinfo_x86` structure which we filled during execution of the `early_cpu_init`. As you can understand from the name of the `x86_phys_bits` field, it presents maximum bits amount of the maximum physical address in the system. Note also that `iomem_resource` passed to the `EXPORT_SYMBOL` macro. This macro exports the given symbol (`iomem_resource` in our case) for dynamic linking or in another words it makes a symbol accessible to dynamically loaded modules. - -As we set the end address of the root `iomem` resource address range, as I wrote about the next step will be setup of the memory map. It will be produced with the call of the `setup_memory_map` function: - -```C -void __init setup_memory_map(void) -{ - char *who; - - who = x86_init.resources.memory_setup(); - memcpy(&e820_saved, &e820, sizeof(struct e820map)); - printk(KERN_INFO "e820: BIOS-provided physical RAM map:\n"); - e820_print_map(who); -} -``` - -First of all we call look here the call of the `x86_init.resources.memory_setup`. `x86_init` is a `x86_init_ops` structure which presents platform specific setup functions as resources initializtion, pci initialization and etc... Initiaization of the `x86_init` is in the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c). I will not give here the full description because it is very long, but only one part which interests us for now: - -```C -struct x86_init_ops x86_init __initdata = { - .resources = { - .probe_roms = probe_roms, - .reserve_resources = reserve_standard_io_resources, - .memory_setup = default_machine_specific_memory_setup, - }, - ... - ... - ... -} -``` - -As we can see here `memry_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions collect, print of all regions with printk. You can find this print if you execute `dmesg` command, you must see something like this: - -``` -[ 0.000000] e820: BIOS-provided physical RAM map: -[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable -[ 0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved -[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved -[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000be825fff] usable -[ 0.000000] BIOS-e820: [mem 0x00000000be826000-0x00000000be82cfff] ACPI NVS -[ 0.000000] BIOS-e820: [mem 0x00000000be82d000-0x00000000bf744fff] usable -[ 0.000000] BIOS-e820: [mem 0x00000000bf745000-0x00000000bfff4fff] reserved -[ 0.000000] BIOS-e820: [mem 0x00000000bfff5000-0x00000000dc041fff] usable -[ 0.000000] BIOS-e820: [mem 0x00000000dc042000-0x00000000dc0d2fff] reserved -[ 0.000000] BIOS-e820: [mem 0x00000000dc0d3000-0x00000000dc138fff] usable -[ 0.000000] BIOS-e820: [mem 0x00000000dc139000-0x00000000dc27dfff] ACPI NVS -[ 0.000000] BIOS-e820: [mem 0x00000000dc27e000-0x00000000deffefff] reserved -[ 0.000000] BIOS-e820: [mem 0x00000000defff000-0x00000000deffffff] usable -... -... -... -``` - -Copying of the BIOS Enhanced Disk Device information --------------------------------------------------------------------------------- - -The next two steps is parsing of the `setup_data` with `parse_setup_data` function and copying BIOS EDD to the safe place. `setup_data` is a field from the kernel boot header and as we can read from the `x86` boot protocol: - -``` -Field name: setup_data -Type: write (special) -Offset/size: 0x250/8 -Protocol: 2.09+ - - The 64-bit physical pointer to NULL terminated single linked list of - struct setup_data. This is used to define a more extensible boot - parameters passing mechanism. -``` - -It used for storing setup information for different types as device tree blob, EFI setup data and etc... In the second step we copy BIOS EDD informantion from the `boot_params` structure that we collected in the [arch/x86/boot/edd.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/edd.c) to the `edd` structure: - -```C -static inline void __init copy_edd(void) -{ - memcpy(edd.mbr_signature, boot_params.edd_mbr_sig_buffer, - sizeof(edd.mbr_signature)); - memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info)); - edd.mbr_signature_nr = boot_params.edd_mbr_sig_buf_entries; - edd.edd_info_nr = boot_params.eddbuf_entries; -} -``` - -Memory descriptor initialization --------------------------------------------------------------------------------- - -The next step is initialization of the memory descriptor of the init process. As you already can know every process has own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h). As every process has own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html): - -```C -#define INIT_TASK(tsk) \ -{ - ... - ... - ... - .mm = NULL, \ - .active_mm = &init_mm, \ - ... -} -``` - -`mm` points to the process address space and `active_mm` points to the active address space if process has no own as kernel threads (more about it you can read in the [documentation](https://www.kernel.org/doc/Documentation/vm/active_mm.txt)). Now we fill memory descriptor of the initial process: - -```C - init_mm.start_code = (unsigned long) _text; - init_mm.end_code = (unsigned long) _etext; - init_mm.end_data = (unsigned long) _edata; - init_mm.brk = _brk_end; -``` - -with the kernel's text, data and brk. `init_mm` is memory descriptor of the initial process and defined as: - -```C -struct mm_struct init_mm = { - .mm_rb = RB_ROOT, - .pgd = swapper_pg_dir, - .mm_users = ATOMIC_INIT(2), - .mm_count = ATOMIC_INIT(1), - .mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem), - .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), - .mmlist = LIST_HEAD_INIT(init_mm.mmlist), - INIT_MM_CONTEXT(init_mm) -}; -``` - -where `mm_rb` is a red-black tree of the virtual memory areas, `pgd` is a pointer to the page global directory, `mm_users` is address space users, `mm_count` is primary usage counter and `mmap_sem` is memory area semaphore. After that we setup memory descriptor of the initiali process, next step is initialization of the intel Memory Protection Extensions with `mpx_mm_init`. The next step after it is initialization of the code/data/bss resources with: - -```C - code_resource.start = __pa_symbol(_text); - code_resource.end = __pa_symbol(_etext)-1; - data_resource.start = __pa_symbol(_etext); - data_resource.end = __pa_symbol(_edata)-1; - bss_resource.start = __pa_symbol(__bss_start); - bss_resource.end = __pa_symbol(__bss_stop)-1; -``` - -We already know a little about `resource` structure (read above). Here we fills code/data/bss resources with the physical addresses of they. You can see it in the `/proc/iomem` output: - -```C -00100000-be825fff : System RAM - 01000000-015bb392 : Kernel code - 015bb393-01930c3f : Kernel data - 01a11000-01ac3fff : Kernel bss -``` - -All of these structures defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and look like typical resource initialization: - -```C -static struct resource code_resource = { - .name = "Kernel code", - .start = 0, - .end = 0, - .flags = IORESOURCE_BUSY | IORESOURCE_MEM -}; -``` - -The last step which we will cover in this part will be `NX` configuration. `NX-bit` or no execute bit is 63-bit in the page directory entry which controls the ability to execute code from all physical pages mapped by the table entry. This bit can only be used/set when the `no-execute` page-protection mechanism is enabled by the setting `EFER.NXE` to 1. In the `x86_configure_nx` function we check that CPU has support of `NX-bit` and it does not disabled. After the check we fill `__supported_pte_mask` depend on it: - -```C -void x86_configure_nx(void) -{ - if (cpu_has_nx && !disable_nx) - __supported_pte_mask |= _PAGE_NX; - else - __supported_pte_mask &= ~_PAGE_NX; -} -``` - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function which makes initialization of architecutre-specific stuff. It was long part, but we not finished with it. As i already wrote, the `setup_arch` is big function, and I am really not sure that we will cover full of it even in the next part. There were some new interesting concepts in this part like `Fix-mapped` addresses, ioremap and etc... Don't worry if they are unclear for you. There is special part about these concepts - [Linux kernel memory management Part 2.](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md). In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameteres, early dump of the pci devices, direct Media Interface scanning and many many more. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [mm vs active_mm](https://www.kernel.org/doc/Documentation/vm/active_mm.txt) -* [e820](http://en.wikipedia.org/wiki/E820) -* [Supervisor mode access prevention](https://lwn.net/Articles/517475/) -* [Kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks) -* [TSS](http://en.wikipedia.org/wiki/Task_state_segment) -* [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table) -* [Memory mapped I/O](http://en.wikipedia.org/wiki/Memory-mapped_I/O) -* [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html) -* [PDF. dwarf4 specification](http://dwarfstd.org/doc/DWARF4.pdf) -* [Call stack](http://en.wikipedia.org/wiki/Call_stack) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) diff --git a/Initialization/linux-initialization-6.md b/Initialization/linux-initialization-6.md deleted file mode 100644 index 3f81be0..0000000 --- a/Initialization/linux-initialization-6.md +++ /dev/null @@ -1,551 +0,0 @@ -Kernel initialization. Part 6. -================================================================================ - -Architecture-specific initializations, again... -================================================================================ - -In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) we saw architecture-specific (`x86_64` in our case) initialization stuff from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and finished on `x86_configure_nx` function which sets the `_PAGE_NX` flag depends on support of [NX bit](http://en.wikipedia.org/wiki/NX_bit). As I wrote before `setup_arch` function and `start_kernel` are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after `x86_configure_nx` is `parse_early_param`. This function defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) and as you can understand from its name, this function parses kernel command line and setups different some services depends on give parameters (all kernel command line parameters you can find in the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt)). You can remember how we setup `earlyprintk` in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). On the early stage we looked for kernel parameters and their value with the `cmdline_find_option` function and `__cmdline_find_option`, `__cmdline_find_option_bool` helpers from the [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cmdline.c). There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already can note calls like this: - -```C -early_param("gbpages", parse_direct_gbpages_on); -``` - -`early_param` macro takes two parameters: - -* command line parameter name; -* function which will be called if given parameter passed. - -and defined as: - -```C -#define early_param(str, fn) \ - __setup_param(str, fn, fn, 1) -``` - -in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h). As you can see `early_param` macro just makes call of the `__setup_param` macro: - -```C -#define __setup_param(str, unique_id, fn, early) \ - static const char __setup_str_##unique_id[] __initconst \ - __aligned(1) = str; \ - static struct obs_kernel_param __setup_##unique_id \ - __used __section(.init.setup) \ - __attribute__((aligned((sizeof(long))))) \ - = { __setup_str_##unique_id, fn, early } -``` - -This macro defines `__setup_str_*_id` variable (where `*` depends on given function name) and assigns it to the given command line parameter name. In the next line we can see definition of the `__setup_*` variable which type is `obs_kernel_param` and its initialization. `obs_kernel_param` structure defined as: - -```C -struct obs_kernel_param { - const char *str; - int (*setup_func)(char *); - int early; -}; -``` - -and contains three fields: - -* name of the kernel parameter; -* function which setups something depend on parameter; -* field determinies is parameter early (1) or not (0). - -Note that `__set_param` macro defines with `__section(.init.setup)` attribute. It means that all `__setup_str_*` will be placed in the `.init.setup` section, moreover, as we can see in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h), they will be placed between `__setup_start` and `__setup_end`: - -``` -#define INIT_SETUP(initsetup_align) \ - . = ALIGN(initsetup_align); \ - VMLINUX_SYMBOL(__setup_start) = .; \ - *(.init.setup) \ - VMLINUX_SYMBOL(__setup_end) = .; -``` - -Now we know how parameters are defined, let's back to the `parse_early_param` implementation: - -```C -void __init parse_early_param(void) -{ - static int done __initdata; - static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata; - - if (done) - return; - - /* All fall through to do_early_param. */ - strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE); - parse_early_options(tmp_cmdline); - done = 1; -} -``` - -The `parse_early_param` function defines two static variables. First `done` check that `parse_early_param` already called and the second is temporary storage for kernel command line. After this we copy `boot_command_line` to the temporary commad line which we just defined and call the `parse_early_options` function from the the same source code `main.c` file. `parse_early_options` calls the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/master/) where `parse_args` parses given command line and calls `do_early_param` function. This [function](https://github.com/torvalds/linux/blob/master/init/main.c#L413) goes from the ` __setup_start` to `__setup_end`, and calls the function from the `obs_kernel_param` if a parameter is early. After this all services which are depend on early command line parameters were setup and the next call after the `parse_early_param` is `x86_report_nx`. As I wrote in the beginning of this part, we already set `NX-bit` with the `x86_configure_nx`. The next `x86_report_nx` function the [arch/x86/mm/setup_nx.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/setup_nx.c) just prints information about the `NX`. Note that we call `x86_report_nx` not right after the `x86_configure_nx`, but after the call of the `parse_early_param`. The answer is simple: we call it after the `parse_early_param` because the kernel support `noexec` parameter: - -``` -noexec [X86] - On X86-32 available only on PAE configured kernels. - noexec=on: enable non-executable mappings (default) - noexec=off: disable non-executable mappings -``` - -We can see it in the booting time: - -![NX](http://oi62.tinypic.com/swwxhy.jpg) - -After this we can see call of the: - -```C - memblock_x86_reserve_range_setup_data(); -``` - -function. This function defined in the same [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and remaps memory for the `setup_data` and reserved memory block for the `setup_data` (more about `setup_data` you can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) and about `ioremap` and `memblock` you can read in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)). - -In the next step we can see following conditional statement: - -```C - if (acpi_mps_check()) { -#ifdef CONFIG_X86_LOCAL_APIC - disable_apic = 1; -#endif - setup_clear_cpu_cap(X86_FEATURE_APIC); - } -``` - -The first `acpi_mps_check` function from the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) depends on `CONFIG_X86_LOCAL_APIC` and `CNOFIG_x86_MPPARSE` configuration options: - -```C -int __init acpi_mps_check(void) -{ -#if defined(CONFIG_X86_LOCAL_APIC) && !defined(CONFIG_X86_MPPARSE) - /* mptable code is not built-in*/ - if (acpi_disabled || acpi_noirq) { - printk(KERN_WARNING "MPS support code is not built-in.\n" - "Using acpi=off or acpi=noirq or pci=noacpi " - "may have problem\n"); - return 1; - } -#endif - return 0; -} -``` - -It checks the built-in `MPS` or [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) table. If `CONFIG_X86_LOCAL_APIC` is set and `CONFIG_x86_MPPAARSE` is not set, `acpi_mps_check` prints warning message if the one of the command line options: `acpi=off`, `acpi=noirq` or `pci=noacpi` passed to the kernel. If `acpi_mps_check` returns `1` which means that - -we disable local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and clears `X86_FEATURE_APIC` bit in the of the current CPU with the `setup_clear_cpu_cap` macro. (more about CPU mask you can read in the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)). - -Early PCI dump --------------------------------------------------------------------------------- - -In the next step we make a dump of the [PCI](http://en.wikipedia.org/wiki/Conventional_PCI) devices with the following code: - -```C -#ifdef CONFIG_PCI - if (pci_early_dump_regs) - early_dump_pci_devices(); -#endif -``` - -`pci_early_dump_regs` variable defined in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c) and its value depends on the kernel command line parameter: `pci=earlydump`. We can find defition of this parameter in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch): - -```C -early_param("pci", pci_setup); -``` - -`pci_setup` function gets the string after the `pci=` and analyzes it. This function calls `pcibios_setup` which defined as `__weak` in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch) and every architecture defines the same function which overrides `__weak` analog. For example `x86_64` architecture-depened version is in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c): - -```C -char *__init pcibios_setup(char *str) { - ... - ... - ... - } else if (!strcmp(str, "earlydump")) { - pci_early_dump_regs = 1; - return NULL; - } - ... - ... - ... -} -``` - -So, if `CONFIG_PCI` option is set and we passed `pci=earlydump` option to the kernel command line, next function which will be called - `early_dump_pci_devices` from the [arch/x86/pci/early.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/early.c). This function checks `noearly` pci parameter with: - -```C -if (!early_pci_allowed()) - return; -``` - -and returns if it was passed. Each PCI domain can host up to `256` buses and each bus hosts up to 32 devices. So, we goes in a loop: - -```C -for (bus = 0; bus < 256; bus++) { - for (slot = 0; slot < 32; slot++) { - for (func = 0; func < 8; func++) { - ... - ... - ... - } - } -} -``` - -and read the `pci` config with the `read_pci_config` function. - -That's all. We will no go deep in the `pci` details, but will see more details in the special `Drivers/PCI` part. - -Finish with memory parsing --------------------------------------------------------------------------------- - -After the `early_dump_pci_devices`, there are a couple of function related with available memory and [e820](http://en.wikipedia.org/wiki/E820) which we collected in the [First steps in the kernel setup](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) part: - -```C - /* update the e820_saved too */ - e820_reserve_setup_data(); - finish_e820_parsing(); - ... - ... - ... - e820_add_kernel_range(); - trim_bios_range(void); - max_pfn = e820_end_of_ram_pfn(); - early_reserve_e820_mpc_new(); -``` - -Let's look on it. As you can see the first function is `e820_reserve_setup_data`. This function does almost the same as `memblock_x86_reserve_range_setup_data` which we saw above, but it also calls `e820_update_range` which adds new regions to the `e820map` with the given type which is `E820_RESERVED_KERN` in our case. The next function is `finish_e820_parsing` which sanitazes `e820map` with the `sanitize_e820_map` function. Besides this two functions we can see a couple of functions related to the [e820](http://en.wikipedia.org/wiki/E820). You can see it in the listing which is above. `e820_add_kernel_range` function takes the physical address of the kernel start and end: - -```C -u64 start = __pa_symbol(_text); -u64 size = __pa_symbol(_end) - start; -``` - -checks that `.text` `.data` and `.bss` marked as `E820RAM` in the `e820map` and prints the warning message if not. The next function `trm_bios_range` update first 4096 bytes in `e820Map` as `E820_RESERVED` and sanitizes it again with the call of the `sanitize_e820_map`. After this we get the last page frame number with the call of the `e820_end_of_ram_pfn` function. Every memory page has an unique number - `Page frame number` and `e820_end_of_ram_pfn` function returns the maximum with the call of the `e820_end_pfn`: - -```C -unsigned long __init e820_end_of_ram_pfn(void) -{ - return e820_end_pfn(MAX_ARCH_PFN); -} -``` - -where `e820_end_pfn` takes maximum page frame number on the certain architecture (`MAX_ARCH_PFN` is `0x400000000` for `x86_64`). In the `e820_end_pfn` we go through the all `e820` slots and check that `e820` entry has `E820_RAM` or `E820_PRAM` type because we calcluate page frame numbers only for these types, gets the base address and end address of the page frame number for the current `e820` entry and makes some checks for these addresses: - -```C -for (i = 0; i < e820.nr_map; i++) { - struct e820entry *ei = &e820.map[i]; - unsigned long start_pfn; - unsigned long end_pfn; - - if (ei->type != E820_RAM && ei->type != E820_PRAM) - continue; - - start_pfn = ei->addr >> PAGE_SHIFT; - end_pfn = (ei->addr + ei->size) >> PAGE_SHIFT; - - if (start_pfn >= limit_pfn) - continue; - if (end_pfn > limit_pfn) { - last_pfn = limit_pfn; - break; - } - if (end_pfn > last_pfn) - last_pfn = end_pfn; -} -``` - -```C - if (last_pfn > max_arch_pfn) - last_pfn = max_arch_pfn; - - printk(KERN_INFO "e820: last_pfn = %#lx max_arch_pfn = %#lx\n", - last_pfn, max_arch_pfn); - return last_pfn; -``` - -After this we check that `last_pfn` which we got in the loop is not greater that maximum page frame number for the certain architecture (`x86_64` in our case), print inofmration about last page frame number and return it. We can see the `last_pfn` in the `dmesg` output: - -``` -... -[ 0.000000] e820: last_pfn = 0x41f000 max_arch_pfn = 0x400000000 -... -``` - -After this, as we have calculated the biggest page frame number, we calculate `max_low_pfn` which is the biggest page frame number in the `low memory` or bellow first `4` gigabytes. If installed more than 4 gigabytes of RAM, `max_low_pfn` will be result of the `e820_end_of_low_ram_pfn` function which does the same `e820_end_of_ram_pfn` but with 4 gigabytes limit, in other way `max_low_pfn` will be the same as `max_pfn`: - -```C -if (max_pfn > (1UL<<(32 - PAGE_SHIFT))) - max_low_pfn = e820_end_of_low_ram_pfn(); -else - max_low_pfn = max_pfn; - -high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1; -``` - -Next we calculate `high_memory` (defines the upper bound on direct map memory) with `__va` macro which returns a virtual address by the given physical. - -DMI scanning -------------------------------------------------------------------------------- - -The next step after manipulations with different memory regions and `e820` slots is collecting information about computer. We will get all information with the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface) and following functions: - -```C -dmi_scan_machine(); -dmi_memdev_walk(); -``` - -First is `dmi_scan_machine` defined in the [drivers/firmware/dmi_scan.c](https://github.com/torvalds/linux/blob/master/drivers/firmware/dmi_scan.c). This function goes through the [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) structures and extracts informantion. There are two ways specified to gain access to the `SMBIOS` table: get the pointer to the `SMBIOS` table from the [EFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)'s configuration table and scanning the physycal memory between `0xF0000` and `0x10000` addresses. Let's look on the second approach. `dmi_scan_machine` function remaps memory between `0xf0000` and `0x10000` with the `dmi_early_remap` which just expands to the `early_ioremap`: - -```C -void __init dmi_scan_machine(void) -{ - char __iomem *p, *q; - char buf[32]; - ... - ... - ... - p = dmi_early_remap(0xF0000, 0x10000); - if (p == NULL) - goto error; -``` - -and iterates over all `DMI` header address and find search `_SM_` string: - -```C -memset(buf, 0, 16); -for (q = p; q < p + 0x10000; q += 16) { - memcpy_fromio(buf + 16, q, 16); - if (!dmi_smbios3_present(buf) || !dmi_present(buf)) { - dmi_available = 1; - dmi_early_unmap(p, 0x10000); - goto out; - } - memcpy(buf, buf + 16, 16); -} -``` - -`_SM_` string must be between `000F0000h` and `0x000FFFFF`. Here we copy 16 bytes to the `buf` with `memcpy_fromio` which is the same `memcpy` and execute `dmi_smbios3_present` and `dmi_present` on the buffer. These functions check that first 4 bytes is `_SM_` string, get `SMBIOS` version and gets `_DMI_` attributes as `DMI` structure table length, table address and etc... After one of these function will finish to execute, you will see the result of it in the `dmesg` output: - -``` -[ 0.000000] SMBIOS 2.7 present. -[ 0.000000] DMI: Gigabyte Technology Co., Ltd. Z97X-UD5H-BK/Z97X-UD5H-BK, BIOS F6 06/17/2014 -``` - -In the end of the `dmi_scan_machine`, we unmap the previously remaped memory: - -```C -dmi_early_unmap(p, 0x10000); -``` - -The second function is - `dmi_memdev_walk`. As you can understand it goes over memory devices. Let's look on it: - -```C -void __init dmi_memdev_walk(void) -{ - if (!dmi_available) - return; - - if (dmi_walk_early(count_mem_devices) == 0 && dmi_memdev_nr) { - dmi_memdev = dmi_alloc(sizeof(*dmi_memdev) * dmi_memdev_nr); - if (dmi_memdev) - dmi_walk_early(save_mem_devices); - } -} -``` - -It checks that `DMI` available (we got it in the previous function - `dmi_scan_machine`) and collects information about memory devices with `dmi_walk_early` and `dmi_alloc` which defined as: - -``` -#ifdef CONFIG_DMI -RESERVE_BRK(dmi_alloc, 65536); -#endif -``` - -`RESERVE_BRK` defined in the [arch/x86/include/asm/setup.h](http://en.wikipedia.org/wiki/Desktop_Management_Interface) and reserves space with given size in the `brk` section. - -------------------------- - init_hypervisor_platform(); - x86_init.resources.probe_roms(); - insert_resource(&iomem_resource, &code_resource); - insert_resource(&iomem_resource, &data_resource); - insert_resource(&iomem_resource, &bss_resource); - early_gart_iommu_check(); - - -SMP config --------------------------------------------------------------------------------- - -The next step is parsing of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration. We do it with the call of the `find_smp_config` function which just calls function: - -```C -static inline void find_smp_config(void) -{ - x86_init.mpparse.find_smp_config(); -} -``` - -inside. `x86_init.mpparse.find_smp_config` is a `default_find_smp_config` function from the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). In the `default_find_smp_config` function we are scanning a couple of memory regions for `SMP` config and return if they are not: - -```C -if (smp_scan_config(0x0, 0x400) || - smp_scan_config(639 * 0x400, 0x400) || - smp_scan_config(0xF0000, 0x10000)) - return; -``` - -First of all `smp_scan_config` function defines a couple of variables: - -```C -unsigned int *bp = phys_to_virt(base); -struct mpf_intel *mpf; -``` - -First is virtual address of the memory region where we will scan `SMP` config, second is the pointer to the `mpf_intel` structure. Let's try to understand what is it `mpf_intel`. All information stores in the multiprocessor configuration data structure. `mpf_intel` presents this structure and looks: - -```C -struct mpf_intel { - char signature[4]; - unsigned int physptr; - unsigned char length; - unsigned char specification; - unsigned char checksum; - unsigned char feature1; - unsigned char feature2; - unsigned char feature3; - unsigned char feature4; - unsigned char feature5; -}; -``` - -As we can read in the documentation - one of the main functions of the system BIOS is to construct the MP floating pointer structure and the MP configuration table. And operating system must have access to this information about the multiprocessor configuration and `mpf_intel` stores the physical address (look at second parameter) of the multiprocessor configuration table. So, `smp_scan_config` going in a loop through the given memory range and tries to find `MP floating pointer structure` there. It checks that current byte points to the `SMP` signature, checks checksum, checks that `mpf->specification` is 1 (it must be `1` or `4` by specification) in the loop: - -```C -while (length > 0) { -if ((*bp == SMP_MAGIC_IDENT) && - (mpf->length == 1) && - !mpf_checksum((unsigned char *)bp, 16) && - ((mpf->specification == 1) - || (mpf->specification == 4))) { - - mem = virt_to_phys(mpf); - memblock_reserve(mem, sizeof(*mpf)); - if (mpf->physptr) - smp_reserve_memory(mpf); - } -} -``` - -reserves given memory block if search is successful with `memblock_reserve` and reserves physical address of the multiprocessor configuration table. All documentation about this you can find in the - [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf). More details you can read in the special part about `SMP`. - -Additional early memory initialization routines --------------------------------------------------------------------------------- - -In the next step of the `setup_arch` we can see the call of the `early_alloc_pgt_buf` function which allocates the page table buffer for early stage. The page table buffer will be place in the `brk` area. Let's look on its implementation: - -```C -void __init early_alloc_pgt_buf(void) -{ - unsigned long tables = INIT_PGT_BUF_SIZE; - phys_addr_t base; - - base = __pa(extend_brk(tables, PAGE_SIZE)); - - pgt_buf_start = base >> PAGE_SHIFT; - pgt_buf_end = pgt_buf_start; - pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT); -} -``` - -First of all it get the size of the page table buffer, it will be `INIT_PGT_BUF_SIZE` which is `(6 * PAGE_SIZE)` in the current linux kernel 4.0. As we got the size of the page table buffer, we call `extend_brk` function with two parameters: size and align. As you can understand from its name, this function extends the `brk` area. As we can see in the linux kernel linker script `brk` in memory right after the [BSS](http://en.wikipedia.org/wiki/.bss): - -```C - . = ALIGN(PAGE_SIZE); - .brk : AT(ADDR(.brk) - LOAD_OFFSET) { - __brk_base = .; - . += 64 * 1024; /* 64k alignment slop space */ - *(.brk_reservation) /* areas brk users have reserved */ - __brk_limit = .; - } -``` - -Or we can find it with `readelf` util: - -![brk area](http://oi61.tinypic.com/71lkeu.jpg) - -After that we got physical address of the new `brk` with the `__pa` macro, we calculate the base address and the end of the page table buffer. In the next step as we got page table buffer, we reserve memory block for the brk are with the `reserve_brk` function: - -```C -static void __init reserve_brk(void) -{ - if (_brk_end > _brk_start) - memblock_reserve(__pa_symbol(_brk_start), - _brk_end - _brk_start); - - _brk_start = 0; -} -``` - -Note that in the end of the `reserve_brk`, we set `brk_start` to zero, because after this we will not allocate it anymore. The next step after reserving memory block for the `brk`, we need to unmap out-of-range memory areas in the kernel mapping with the `cleanup_highmap` function. Remeber that kernel mapping is `__START_KERNEL_map` and `_end - _text` or `level2_kernel_pgt` maps the kernel `_text`, `data` and `bss`. In the start of the `clean_high_map` we define these parameters: - -```C -unsigned long vaddr = __START_KERNEL_map; -unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1; -pmd_t *pmd = level2_kernel_pgt; -pmd_t *last_pmd = pmd + PTRS_PER_PMD; -``` - -Now, as we defined start and end of the kernel mapping, we go in the loop through the all kernel page middle directory entries and clean entries which are not between `_text` and `end`: - -```C -for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) { - if (pmd_none(*pmd)) - continue; - if (vaddr < (unsigned long) _text || vaddr > end) - set_pmd(pmd, __pmd(0)); -} -``` - -After this we set the limit for the `memblock` allocation with the `memblock_set_current_limit` function (read more about `memblock` you can in the [Linux kernel memory management Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md)), it will be `ISA_END_ADDRESS` or `0x100000` and fill the `memblock` information according to `e820` with the call of the `memblock_x86_fill` function. You can see the result of this function in the kernel initialization time: - -``` -MEMBLOCK configuration: - memory size = 0x1fff7ec00 reserved size = 0x1e30000 - memory.cnt = 0x3 - memory[0x0] [0x00000000001000-0x0000000009efff], 0x9e000 bytes flags: 0x0 - memory[0x1] [0x00000000100000-0x000000bffdffff], 0xbfee0000 bytes flags: 0x0 - memory[0x2] [0x00000100000000-0x0000023fffffff], 0x140000000 bytes flags: 0x0 - reserved.cnt = 0x3 - reserved[0x0] [0x0000000009f000-0x000000000fffff], 0x61000 bytes flags: 0x0 - reserved[0x1] [0x00000001000000-0x00000001a57fff], 0xa58000 bytes flags: 0x0 - reserved[0x2] [0x0000007ec89000-0x0000007fffffff], 0x1377000 bytes flags: 0x0 -``` - -The rest functions after the `memblock_x86_fill` are: `early_reserve_e820_mpc_new` alocates additional slots in the `e820map` for MultiProcessor Specification table, `reserve_real_mode` - reserves low memory from `0x0` to 1 megabyte for the trampoline to the real mode (for rebootin and etc...), `trim_platform_memory_ranges` - trims certain memory regions started from `0x20050000`, `0x20110000` and etc... these regions must be excluded because [Sandy Bridge](http://en.wikipedia.org/wiki/Sandy_Bridge) has problems with these regions, `trim_low_memory_range` reserves the first 4 killobytes page in `memblock`, `init_mem_mapping` function reconstructs direct memory mapping and setups the direct mapping of the physical memory at `PAGE_OFFSET`, `early_trap_pf_init` setups `#PF` handler (we will look on it in the chapter about interrupts) and `setup_real_mode` function setups trampoline to the [real mode](http://en.wikipedia.org/wiki/Real_mode) code. - -That's all. You can note that this part will not cover all functions which are in the `setup_arch` (like `early_gart_iommu_check`, [mtrr](http://en.wikipedia.org/wiki/Memory_type_range_register) initalization and etc...). As I already wrote many times, `setup_arch` is big, and linux kernel is big. That's why I can't cover every line in the linux kernel. I don't think that we missed something important,... but you can say something like: each line of code is important. Yes, it's true, but I missed they anyway, because I think that it is not real to cover full linux kernel. Anyway we will often return to the idea that we have already seen, and if something will be unfamiliar, we will cover this theme. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the sixth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function again It was long part, but we not finished with it. Yes, `setup_arch` is big, hope that next part will be last about this function. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) -* [NX bit](http://en.wikipedia.org/wiki/NX_bit) -* [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt) -* [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) -* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) -* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) -* [PCI](http://en.wikipedia.org/wiki/Conventional_PCI) -* [e820](http://en.wikipedia.org/wiki/E820) -* [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) -* [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) -* [EFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) -* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) -* [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf) -* [BSS](http://en.wikipedia.org/wiki/.bss) -* [SMBIOS specification](http://www.dmtf.org/sites/default/files/standards/documents/DSP0134v2.5Final.pdf) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) diff --git a/Initialization/linux-initialization-7.md b/Initialization/linux-initialization-7.md deleted file mode 100644 index 95a123c..0000000 --- a/Initialization/linux-initialization-7.md +++ /dev/null @@ -1,482 +0,0 @@ -Kernel initialization. Part 7. -================================================================================ - -The End of the architecture-specific initializations, almost... -================================================================================ - -This is the seventh parth of the Linux Kernel initialization process which covers internals of the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L861). As you can know from the previous [parts](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html), the `setup_arch` function does some architecture-specific (in our case it is [x86_64](http://en.wikipedia.org/wiki/X86-64)) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface), early dump of the [PCI](http://en.wikipedia.org/wiki/PCI) device and many many more. If you have read the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html), you can remember that we've finished it at the `setup_real_mode` function. In the next step, as we set limit of the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html) to the all mapped pages, we can see the call of the `setup_log_buf` function from the [kernel/printk/printk.c](https://github.com/torvalds/linux/blob/master/kernel/printk/printk.c). - -The `setup_log_buf` function setups kernel cyclic buffer which length depends on the `CONFIG_LOG_BUF_SHIFT` configuration option. As we can read from the documentation of the `CONFIG_LOG_BUF_SHIFT` it can be between `12` and `21`. In the internals, buffer defined as array of chars: - -```C -#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT) -static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN); -static char *log_buf = __log_buf; -``` - -Now let's look on the implementation of th `setup_log_buf` function. It starts with check that current buffer is empty (It must be empty, because we just setup it) and another check that it is early setup. If setup of the kernel log buffer is not early, we call the `log_buf_add_cpu` function which increase size of the buffer for every CPU: - -```C -if (log_buf != __log_buf) - return; - -if (!early && !new_log_buf_len) - log_buf_add_cpu(); -``` - -We will not research `log_buf_add_cpu` function, because as you can see in the `setup_arch`, we call `setup_log_buf` as: - -```C -setup_log_buf(1); -``` - -where `1` means that is is early setup. In the next step we check `new_log_buf_len` variable which is updated length of the kernel log buffer and allocate new space for the buffer with the `memblock_virt_alloc` function for it, or just return. - -As kernel log buffer is ready, the next function is `reserve_initrd`. You can remember that we already called the `early_reserve_initrd` function in the fourth part of the [Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). Now, as we reconstructed direct memory mapping in the `init_mem_mapping` function, we need to move [initrd](http://en.wikipedia.org/wiki/Initrd) to the down into directly mapped memory. The `reserve_initrd` function starts from the definition of the base address and end address of the `initrd` and check that `initrd` was provided by a bootloader. All the same as we saw it in the `early_reserve_initrd`. But instead of the reserving place in the `memblock` area with the call of the `memblock_reserve` function, we get the mapped size of the direct memory area and check that the size of the `initrd` is not greater that this area with: - -```C -mapped_size = memblock_mem_size(max_pfn_mapped); -if (ramdisk_size >= (mapped_size>>1)) - panic("initrd too large to handle, " - "disabling initrd (%lld needed, %lld available)\n", - ramdisk_size, mapped_size>>1); -``` - -You can see here that we call `memblock_mem_size` function and pass the `max_pfn_mapped` to it, where `max_pfn_mapped` contains the highest direct mapped page frame number. If you do not remember what is it `page frame number`, explanation is simple: First `12` bits of the virtual address represent offset in the physical page or page frame. If we will shift right virtual address on `12`, we'll discard offset part and will get `Page Frame Number`. In the `memblock_mem_size` we go through the all memblock `mem` (not reserved) regions and calculates size of the mapped pages amount and return it to the `mapped_size` variable (see code above). As we got amount of the direct mapped memory, we check that size of the `initrd` is not greater than mapped pages. If it is greater we just call `panic` which halts the system and prints popular [Kernel panic](http://en.wikipedia.org/wiki/Kernel_panic) message. In the next step we print information about the `initrd` size. We can see the result of this in the `dmesg` output: - -```C -[0.000000] RAMDISK: [mem 0x36d20000-0x37687fff] -``` - -and relocate `initrd` to the direct mapping area with the `relocate_initrd` function. In the start of the `relocate_initrd` function we try to find free area with the `memblock_find_in_range` function: - -```C -relocated_ramdisk = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), area_size, PAGE_SIZE); - -if (!relocated_ramdisk) - panic("Cannot find place for new RAMDISK of size %lld\n", - ramdisk_size); -``` - -The `memblock_find_in_range` function tries to find free area in a given range, in our case from `0` to the maximum mapped physical address and size must equal to the aligned size of the `initrd`. If we didn't find area with the given size, we call `panic` again. If all is good, we start to relocated RAM disk to the down of the directly mapped meory in the next step. - -In the end of the `reserve_initrd` function, we free memblock memory which occupied by the ramdisk with the call of the: - -```C -memblock_free(ramdisk_image, ramdisk_end - ramdisk_image); -``` - -After we relocated `initrd` ramdisk image, the next function is `vsmp_init` from the [arch/x86/kernel/vsmp_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsmp_64.c). This function initializes support of the `ScaleMP vSMP`. As I already wrote in the previous parts, this chapter will not cover non-related `x86_64` initialization parts (for example as the current or `ACPI` and etc...). So we will miss implementation of this for now and will back to it in the part which will cover techniques of parallel computing. - -The next function is `io_delay_init` from the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/io_delay.c). This function allows to override default default I/O delay `0x80` port. We already saw I/O delay in the [Last preparation before transition into protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html), now let's look on the `io_delay_init` implementation: - -```C -void __init io_delay_init(void) -{ - if (!io_delay_override) - dmi_check_system(io_delay_0xed_port_dmi_table); -} -``` - -This function check `io_delay_override` variable and overrides I/O delay port if `io_delay_override` is set. We can set `io_delay_override` variably by passing `io_delay` option to the kernel command line. As we can read from the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt), `io_delay` option is: - -``` -io_delay= [X86] I/O delay method - 0x80 - Standard port 0x80 based delay - 0xed - Alternate port 0xed based delay (needed on some systems) - udelay - Simple two microseconds delay - none - No delay -``` - -We can see `io_delay` command line parameter setup with the `early_param` macro in the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/io_delay.c) - -```C -early_param("io_delay", io_delay_param); -``` - -More about `early_param` you can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html). So the `io_delay_param` function which setups `io_delay_override` variable will be called in the [do_early_param](https://github.com/torvalds/linux/blob/master/init/main.c#L413) function. `io_delay_param` function gets the argument of the `io_delay` kernel command line parameter and sets `io_delay_type` depends on it: - -```C -static int __init io_delay_param(char *s) -{ - if (!s) - return -EINVAL; - - if (!strcmp(s, "0x80")) - io_delay_type = CONFIG_IO_DELAY_TYPE_0X80; - else if (!strcmp(s, "0xed")) - io_delay_type = CONFIG_IO_DELAY_TYPE_0XED; - else if (!strcmp(s, "udelay")) - io_delay_type = CONFIG_IO_DELAY_TYPE_UDELAY; - else if (!strcmp(s, "none")) - io_delay_type = CONFIG_IO_DELAY_TYPE_NONE; - else - return -EINVAL; - - io_delay_override = 1; - return 0; -} -``` - -The next functions are `acpi_boot_table_init`, `early_acpi_boot_init` and `initmem_init` after the `io_delay_init`, but as I wrote above we will not cover [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) related stuff in this `Linux Kernel initialization process` chapter. - -Allocate area for DMA --------------------------------------------------------------------------------- - -In the next step we need to allocate area for the [Direct memory access](http://en.wikipedia.org/wiki/Direct_memory_access) with the `dma_contiguous_reserve` function which defined in the [drivers/base/dma-contiguous.c](https://github.com/torvalds/linux/blob/master/drivers/base/dma-contiguous.c). `DMA` area is a special mode when devices comminicate with memory without CPU. Note that we pass one parameter - `max_pfn_mapped << PAGE_SHIFT`, to the `dma_contiguous_reserve` function and as you can understand from this expression, this is limit of the reserved memory. Let's look on the implementation of this function. It starts from the definition of the following variables: - -```C -phys_addr_t selected_size = 0; -phys_addr_t selected_base = 0; -phys_addr_t selected_limit = limit; -bool fixed = false; -``` - -where first represents size in bytes of the reserved area, second is base address of the reserved area, third is end address of the reserved area and the last `fixed` parameter shows where to place reserved area. If `fixed` is `1` we just reserve area with the `memblock_reserve`, if it is `0` we allocate space with the `kmemleak_alloc`. In the next step we check `size_cmdline` variable and if it is not equal to `-1` we fill all variables which you can see above with the values from the `cma` kernel command line parameter: - -```C -if (size_cmdline != -1) { - ... - ... - ... -} -``` - -You can find in this source code file definition of the early parameter: - -```C -early_param("cma", early_cma); -``` - -where `cma` is: - -``` -cma=nn[MG]@[start[MG][-end[MG]]] - [ARM,X86,KNL] - Sets the size of kernel global memory area for - contiguous memory allocations and optionally the - placement constraint by the physical address range of - memory allocations. A value of 0 disables CMA - altogether. For more information, see - include/linux/dma-contiguous.h -``` - -If we will not pass `cma` option to the kernel command line, `size_cmdline` will be equal to `-1`. In this way we need to calculate size of the reserved area which depends on the following kernel configuration options: - -* `CONFIG_CMA_SIZE_SEL_MBYTES` - size in megabytes, default global `CMA` area, which is equal to `CMA_SIZE_MBYTES * SZ_1M` or `CONFIG_CMA_SIZE_MBYTES * 1M`; -* `CONFIG_CMA_SIZE_SEL_PERCENTAGE` - percentage of total memory; -* `CONFIG_CMA_SIZE_SEL_MIN` - use lower value; -* `CONFIG_CMA_SIZE_SEL_MAX` - use higher value. - -As we calculated the size of the reserved area, we reserve area with the call of the `dma_contiguous_reserve_area` function which first of all calls: - -``` -ret = cma_declare_contiguous(base, size, limit, 0, 0, fixed, res_cma); -``` - -function. The `cma_declare_contiguous` reserves contiguous area from the given base address and with given size. After we reserved area for the `DMA`, next function is the `memblock_find_dma_reserve`. As you can understand from its name, this function counts the reserved pages in the `DMA` area. This part will not cover all details of the `CMA` and `DMA`, because they are big. We will see much more details in the special part in the Linux Kernel Memory management which covers contiguous memory allocators and areas. - -Initialization of the sparse memory --------------------------------------------------------------------------------- - -The next step is the call of the function - `x86_init.paging.pagetable_init`. If you will try to find this function in the linux kernel source code, in the end of your search, you will see the following macro: - -```C -#define native_pagetable_init paging_init -``` - -which expands as you can see to the call of the `paging_init` function from the [arch/x86/mm/init_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/init_64.c). The `paging_init` function initializes sparse memory and zone sizes. First of all what's zones and what is it `Sparsemem`. The `Sparsemem` is a special foundation in the linux kernen memory manager which used to split memory area to the different memory banks in the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) systems. Let's look on the implementation of the `paginig_init` function: - -```C -void __init paging_init(void) -{ - sparse_memory_present_with_active_regions(MAX_NUMNODES); - sparse_init(); - - node_clear_state(0, N_MEMORY); - if (N_MEMORY != N_NORMAL_MEMORY) - node_clear_state(0, N_NORMAL_MEMORY); - - zone_sizes_init(); -} -``` - -As you can see there is call of the `sparse_memory_present_with_active_regions` function which records a memory area for every `NUMA` node to the array of the `mem_section` structure which contains a pointer to the structure of the array of `struct page`. The next `sparse_init` function allocates non-linear `mem_section` and `mem_map`. In the next step we clear state of the movable memory nodes and initialize sizes of zones. Every `NUMA` node is devided into a number of pieces which are called - `zones`. So, `zone_sizes_init` function from the [arch/x86/mm/init.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/init.c) initializes size of zones. - -Again, this part and next parts do not cover this theme in full details. There will be special part about `NUMA`. - -vsyscall mapping --------------------------------------------------------------------------------- - -The next step after `SparseMem` initialization is setting of the `trampoline_cr4_features` which must contain content of the `cr4` [Control register](http://en.wikipedia.org/wiki/Control_register). First of all we need to check that current CPU has support of the `cr4` register and if it has, we save its content to the `trampoline_cr4_features` which is storage for `cr4` in the real mode: - -```C -if (boot_cpu_data.cpuid_level >= 0) { - mmu_cr4_features = __read_cr4(); - if (trampoline_cr4_features) - *trampoline_cr4_features = mmu_cr4_features; -} -``` - -The next function which you can see is `map_vsyscal` from the [arch/x86/kernel/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsyscall_64.c). This function maps memory space for [vsyscalls](https://lwn.net/Articles/446528/) and depends on `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option. Actually `vsyscall` is a special segment which provides fast access to the certain system calls like `getcpu` and etc... Let's look on implementation of this function: - -```C -void __init map_vsyscall(void) -{ - extern char __vsyscall_page; - unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page); - - if (vsyscall_mode != NONE) - __set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall, - vsyscall_mode == NATIVE - ? PAGE_KERNEL_VSYSCALL - : PAGE_KERNEL_VVAR); - - BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) != - (unsigned long)VSYSCALL_ADDR); -} -``` - -In the beginning of the `map_vsyscal` we can see definition of two variables. The first is extern valirable `__vsyscall_page`. As variable extern, it defined somewhere in other source code file. Actually we can see definition of the `__vsyscall_page` in the [arch/x86/kernel/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsyscall_emu_64.S). The `__vsyscall_page` symbol points to the aligned calls of the `vsyscalls` as `gettimeofday` and etc...: - -```assembly - .globl __vsyscall_page - .balign PAGE_SIZE, 0xcc - .type __vsyscall_page, @object -__vsyscall_page: - - mov $__NR_gettimeofday, %rax - syscall - ret - - .balign 1024, 0xcc - mov $__NR_time, %rax - syscall - ret - ... - ... - ... -``` - -The second variable is `physaddr_vsyscall` which just stores physical address of the `__vsyscall_page` symbol. In the next step we check the `vsyscall_mode` variable, and if it is not equal to `NONE` which is `EMULATE` by default: - -```C -static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE; -``` - -And after this check we can see the call of the `__set_fixmap` function which calls `native_set_fixmap` with the same parameters: - -```C -void native_set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t flags) -{ - __native_set_fixmap(idx, pfn_pte(phys >> PAGE_SHIFT, flags)); -} - -void __native_set_fixmap(enum fixed_addresses idx, pte_t pte) -{ - unsigned long address = __fix_to_virt(idx); - - if (idx >= __end_of_fixed_addresses) { - BUG(); - return; - } - set_pte_vaddr(address, pte); - fixmaps_set++; -} -``` - -Here we can see that `native_set_fixmap` makes value of `Page Table Entry` from the given physical address (physical address of the `__vsyscall_page` symbol in our case) and calls internal function - `__native_set_fixmap`. Internal function gets the virtual address of the given `fixed_addresses` index (`VSYSCALL_PAGE` in our case) and checks that given index is not greated than end of the fix-mapped addresses. After this we set page table entry with the call of the `set_pte_vaddr` function and increase count of the fix-mapped addresses. And in the end of the `map_vsyscall` we check that virtual address of the `VSYSCALL_PAGE` (which is first index in the `fixed_addresses`) is not greater than `VSYSCALL_ADDR` which is `-10UL << 20` or `ffffffffff600000` with the `BUILD_BUG_ON` macro: - -```C -BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) != - (unsigned long)VSYSCALL_ADDR); -``` - -Now `vsyscall` area is in the `fix-mapped` area. That's all about `map_vsyscall`, if you do not know anything about fix-mapped addresses, you can read [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). More about `vsyscalls` we will see in the `vsyscalls and vdso` part. - -Getting the SMP configuration --------------------------------------------------------------------------------- - -You can remember how we made a search of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html). Now we need to get the `SMP` configurtaion if we found it. For this we check `smp_found_config` variable which we set in the `smp_scan_config` function (read about it the previous part) and call the `get_smp_config` function: - -```C -if (smp_found_config) - get_smp_config(); -``` - -The `get_smp_config` expands to the `x86_init.mpparse.default_get_smp_config` function which defined in the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). This function defines pointer to the multiprocessor floating pointer structure - `mpf_intel` (you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html)) and does some checks: - -```C -struct mpf_intel *mpf = mpf_found; - -if (!mpf) - return; - -if (acpi_lapic && early) - return; -``` - -Here we can see that multiprocessor configuration was found in the `smp_scan_config` function or just return from the function if not. The next check check that it is early. And as we did this checks, we start to read the `SMP` configuration. As we finished to read it, the next step is - `prefill_possible_map` function which makes preliminary filling of the possible CPUs `cpumask` (more about it you can read in the [Introduction to the cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)). - -The rest of the setup_arch --------------------------------------------------------------------------------- - -Here we are getting to the end of the `setup_arch` function. The rest function of course make important stuff, but details about these stuff will not will not be included in this part. We will just take a short look on these functions, because although they are important as I wrote above, but they cover non-generic kernel features related with the `NUMA`, `SMP`, `ACPI` and `APICs` and etc... First of all, the next call of the `init_apic_mappings` function. As we can understand this function sets the address of the local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). The next is `x86_io_apic_ops.init` and this function initializes I/O APIC. Please note that all details related with `APIC`, we will see in the chapter about interrupts and exceptions handling. In the next step we reserve standard I/O resources like `DMA`, `TIMER`, `FPU` and etc..., with the call of the `x86_init.resources.reserve_resources` function. Following is `mcheck_init` function initializes `Machine check Exception` and the last is `register_refined_jiffies` which registers [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29) (There will be separate chapter about timers in the kernel). - -So that's all. Finally we have finished with the big `setup_arch` function in this part. Of course as I already wrote many times, we did not see full details about this function, but do not worry about it. We will be back more than once to this function from different chapters for understanding how different platform-dependent parts are initialized. - -That's all, and now we can back to the `start_kernel` from the `setup_arch`. - -Back to the main.c -================================================================================ - -As I wrote above, we have finished with the `setup_arch` function and now we can back to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As you can remember or even you saw yourself, `start_kernel` function is very big too as the `setup_arch`. So the couple of the next part will be dedicated to the learning of this function. So, let's continue with it. After the `setup_arch` we can see the call of the `mm_init_cpumask` function. This function sets the [cpumask]((http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)) pointer to the memory descriptor `cpumask`. We can look on its implementation: - -```C -static inline void mm_init_cpumask(struct mm_struct *mm) -{ -#ifdef CONFIG_CPUMASK_OFFSTACK - mm->cpu_vm_mask_var = &mm->cpumask_allocation; -#endif - cpumask_clear(mm->cpu_vm_mask_var); -} -``` - -As you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c), we passed memory descriptor of the init process to the `mm_init_cpumask` and here depend on `CONFIG_CPUMASK_OFFSTACK` configuration option we set or clear [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer) switch `cpumask`. - -In the next step we can see the call of the following function: - -```C -setup_command_line(command_line); -``` - -This function takes pointer to the kernel command line allocates a couple of buffers to store command line. We need a couple of buffers, because one buffer used for future reference and accessing to command line and one for parameter parsing. We will allocate space for the following buffers: - -* `saved_command_line` - will contain boot command line; -* `initcall_command_line` - will contain boot command line. will be used in the `do_initcall_level`; -* `static_command_line` - will contain command line for parameters parsing. - -We will allocate space with the `memblock_virt_alloc` function. This function calls `memblock_virt_alloc_try_nid` which allocates boot memory block with `memblock_reserve` if [slab](http://en.wikipedia.org/wiki/Slab_allocation) is not available or uses `kzalloc_node` (more about it will be in the linux memory management chapter). The `memblock_virt_alloc` uses `BOOTMEM_LOW_LIMIT` (physicall address of the `(PAGE_OFFSET + 0x1000000)` value) and `BOOTMEM_ALLOC_ACCESSIBLE` (equal to the current value of the `memblock.current_limit`) as minimum address of the memory egion and maximum address of the memory region. - -Let's look on the implementation of the `setup_command_line`: - -```C -static void __init setup_command_line(char *command_line) -{ - saved_command_line = - memblock_virt_alloc(strlen(boot_command_line) + 1, 0); - initcall_command_line = - memblock_virt_alloc(strlen(boot_command_line) + 1, 0); - static_command_line = memblock_virt_alloc(strlen(command_line) + 1, 0); - strcpy(saved_command_line, boot_command_line); - strcpy(static_command_line, command_line); - } - ``` - -Here we can see that we allocate space for the three buffers which will contain kernel command line for the different purposes (read above). And as we allocated space, we storing `boot_comand_line` in the `saved_command_line` and `command_line` (kernel command line from the `setup_arch` to the `static_command_line`). - -The next function after the `setup_command_line` is the `setup_nr_cpu_ids`. This function setting `nr_cpu_ids` (number of CPUs) according to the last bit in the `cpu_possible_mask` (more about it you can read in the chapter describes [cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) concept). Let's look on its implementation: - -```C -void __init setup_nr_cpu_ids(void) -{ - nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1; -} -``` - -Here `nr_cpu_ids` represents number of CPUs, `NR_CPUS` represents the maximum number of CPUs which we can set in configuration time: - -![CONFIG_NR_CPUS](http://oi59.tinypic.com/28mh45h.jpg) - -Actually we need to call this function, because `NR_CPUS` can be greater than actual amount of the CPUs in the your computer. Here we can see that we call `find_last_bit` function and pass two parameters to it: - -* `cpu_possible_mask` bits; -* maximim number of CPUS. - -In the `setup_arch` we can find the call of the `prefill_possible_map` function which calculates and writes to the `cpu_possible_mask` actual number of the CPUs. We call the `find_last_bit` function which takes the address and maximum size to search and returns bit number of the first set bit. We passed `cpu_possible_mask` bits and maximum number of the CPUs. First of all the `find_last_bit` function splits given `unsigned long` address to the [words](http://en.wikipedia.org/wiki/Word_%28computer_architecture%29): - -```C -words = size / BITS_PER_LONG; -``` - -where `BITS_PER_LONG` is `64` on the `x86_64`. As we got amount of words in the given size of the search data, we need to check is given size does not contain partial words with the following check: - -```C -if (size & (BITS_PER_LONG-1)) { - tmp = (addr[words] & (~0UL >> (BITS_PER_LONG - - (size & (BITS_PER_LONG-1))))); - if (tmp) - goto found; -} -``` - -if it contains partial word, we mask the last word and check it. If the last word is not zero, it means that current word contains at least one set bit. We go to the `found` label: - -```C -found: - return words * BITS_PER_LONG + __fls(tmp); -``` - -Here you can see `__fls` function which returns last set bit in a given word with help of the `bsr` instruction: - -```C -static inline unsigned long __fls(unsigned long word) -{ - asm("bsr %1,%0" - : "=r" (word) - : "rm" (word)); - return word; -} -``` - -The `bsr` instruction which scans the given operand for first bit set. If the last word is not partial we going through the all words in the given address and trying to find first set bit: - -```C -while (words) { - tmp = addr[--words]; - if (tmp) { -found: - return words * BITS_PER_LONG + __fls(tmp); - } -} -``` - -Here we put the last word to the `tmp` variable and check that `tmp` contains at least one set bit. If a set bit found, we return the number of this bit. If no one words do not contains set bit we just return given size: - -```C -return size; -``` - -After this `nr_cpu_ids` will contain the correct amount of the avaliable CPUs. - -That's all. - -Conclusion -================================================================================ - -It is the end of the seventh part about the linux kernel initialization process. In this part, finally we have finsihed with the `setup_arch` function and returned to the `start_kernel` function. In the next part we will continue to learn generic kernel code from the `start_kernel` and will continue our way to the first `init` process. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links -================================================================================ - -* [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface) -* [x86_64](http://en.wikipedia.org/wiki/X86-64) -* [initrd](http://en.wikipedia.org/wiki/Initrd) -* [Kernel panic](http://en.wikipedia.org/wiki/Kernel_panic) -* [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt) -* [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) -* [Direct memory access](http://en.wikipedia.org/wiki/Direct_memory_access) -* [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) -* [Control register](http://en.wikipedia.org/wiki/Control_register) -* [vsyscalls](https://lwn.net/Articles/446528/) -* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) -* [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html) diff --git a/Initialization/linux-initialization-8.md b/Initialization/linux-initialization-8.md deleted file mode 100644 index 0adf83c..0000000 --- a/Initialization/linux-initialization-8.md +++ /dev/null @@ -1,481 +0,0 @@ -Kernel initialization. Part 8. -================================================================================ - -Scheduler initialization -================================================================================ - -This is the eighth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of the Linux kernel initialization process and we stopped on the `setup_nr_cpu_ids` function in the [previous](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md) part. The main point of the current part is [scheduler](http://en.wikipedia.org/wiki/Scheduling_%28computing%29) initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the `setup_per_cpu_areas` function. This function setups areas for the `percpu` variables, more about it you can read in the special part about the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html). After `percpu` areas up and running, the next step is the `smp_prepare_boot_cpu` function. This function does some preparations for the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing): - -```C -static inline void smp_prepare_boot_cpu(void) -{ - smp_ops.smp_prepare_boot_cpu(); -} -``` - -where the `smp_prepare_boot_cpu` expands to the call of the `native_smp_prepare_boot_cpu` function (more about `smp_ops` will be in the special parts about `SMP`): - -```C -void __init native_smp_prepare_boot_cpu(void) -{ - int me = smp_processor_id(); - switch_to_new_gdt(me); - cpumask_set_cpu(me, cpu_callout_mask); - per_cpu(cpu_state, me) = CPU_ONLINE; -} -``` - -The `native_smp_prepare_boot_cpu` function gets the number of the current CPU (which is Bootstrap processor and its `id` is zero) with the `smp_processor_id` function. I will not explain how the `smp_processor_id` works, because we alread saw it in the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. As we got processor `id` number we reload [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) for the given CPU with the `switch_to_new_gdt` function: - -```C -void switch_to_new_gdt(int cpu) -{ - struct desc_ptr gdt_descr; - - gdt_descr.address = (long)get_cpu_gdt_table(cpu); - gdt_descr.size = GDT_SIZE - 1; - load_gdt(&gdt_descr); - load_percpu_segment(cpu); -} -``` - -The `gdt_descr` variable represents pointer to the `GDT` descriptor here (we already saw `desc_ptr` in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). We get the address and the size of the `GDT` descriptor where `GDT_SIZE` is `256` or: - -```C -#define GDT_SIZE (GDT_ENTRIES * 8) -``` - -and the address of the descriptor we will get with the `get_cpu_gdt_table`: - -```C -static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu) -{ - return per_cpu(gdt_page, cpu).gdt; -} -``` - -The `get_cpu_gdt_table` uses `per_cpu` macro for getting `gdt_page` percpu variable for the given CPU number (bootstrap processor with `id` - 0 in our case). You can ask the following question: so, if we can access `gdt_page` percpu variable, where it was defined? Actually we alread saw it in this book. If you have read the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, you can remember that we saw definition of the `gdt_page` in the [arch/x86/kernel/head_64.S](https://github.com/0xAX/linux/blob/master/arch/x86/kernel/head_64.S): - -```assembly -early_gdt_descr: - .word GDT_ENTRIES*8-1 -early_gdt_descr_base: - .quad INIT_PER_CPU_VAR(gdt_page) -``` - -and if we will look on the [linker](https://github.com/0xAX/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) file we can see that it locates after the `__per_cpu_load` symbol: - -```C -#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load -INIT_PER_CPU(gdt_page); -``` - -and filled `gdt_page` in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c#L94): - -```C -DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { -#ifdef CONFIG_X86_64 - [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff), - [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff), - [GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff), - [GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff), - [GDT_ENTRY_DEFAULT_USER_DS] = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff), - [GDT_ENTRY_DEFAULT_USER_CS] = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff), - ... - ... - ... -``` - -more about `percpu` variables you can read in the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) part. As we got address and size of the `GDT` descriptor we case reload `GDT` with the `load_gdt` which just execute `lgdt` instruct and load `percpu_segment` with the following function: - -```C -void load_percpu_segment(int cpu) { - loadsegment(gs, 0); - wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu)); - load_stack_canary_segment(); -} -``` - -The base address of the `percpu` area must contain `gs` register (or `fs` register for `x86`), so we are using `loadsegment` macro and pass `gs`. In the next step we writes the base address if the [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) stack and setup stack [canary](http://en.wikipedia.org/wiki/Buffer_overflow_protection) (this is only for `x86_32`). After we load new `GDT`, we fill `cpu_callout_mask` bitmap with the current cpu and set cpu state as online with the setting `cpu_state` percpu variable for the current processor - `CPU_ONLINE`: - -```C -cpumask_set_cpu(me, cpu_callout_mask); -per_cpu(cpu_state, me) = CPU_ONLINE; -``` - -So, what is it `cpu_callout_mask` bitmap... As we initialized bootstrap processor (procesoor which is booted the first on `x86`) the other processors in a multiprocessor system are known as `secondary processors`. Linux kernel uses two following bitmasks: - -* `cpu_callout_mask` -* `cpu_callin_mask` - -After bootstrap processor initialized, it updates the `cpu_callout_mask` to indicate which secondary processor can be initialized next. All other or secondary processors can do some initialization stuff before and check the `cpu_callout_mask` on the boostrap processor bit. Only after the bootstrap processor filled the `cpu_callout_mask` this secondary processor, it will continue the rest of its initialization. After that the certain processor will finish its initialization process, the processor sets bit in the `cpu_callin_mask`. Once the bootstrap processor finds the bit in the `cpu_callin_mask` for the current secondary processor, this processor repeats the same procedure for initialization of the rest of a secondary processors. In a short words it works as i described, but more details we will see in the chapter about `SMP`. - -That's all. We did all `SMP` boot preparation. - -Build zonelists ------------------------------------------------------------------------ - -In the next step we can see the call of the `build_all_zonelists` function. This function sets up the order of zones that allocations are preferred from. What are zones and what's order we will understand now. For the start let's see how linux kernel considers physical memory. Physical memory may be arranged into banks which are called - `nodes`. If you has no hardware with support for `NUMA`, you will see only one node: - -``` -$ cat /sys/devices/system/node/node0/numastat -numa_hit 72452442 -numa_miss 0 -numa_foreign 0 -interleave_hit 12925 -local_node 72452442 -other_node 0 -``` - -Every `node` presented by the `struct pglist data` in the linux kernel. Each node devided into a number of special blocks which are called - `zones`. Every zone presented by the `zone struct` in the linux kernel and has one of the type: - -* `ZONE_DMA` - 0-16M; -* `ZONE_DMA32` - used for 32 bit devices that can only do DMA areas below 4G; -* `ZONE_NORMAL` - all RAM from the 4GB on the `x86_64`; -* `ZONE_HIGHMEM` - absent on the `x86_64`; -* `ZONE_MOVABLE` - zone which contains movable pages. - -which are presented by the `zone_type` enum. Information about zones we can get with the: - -``` -$ cat /proc/zoneinfo -Node 0, zone DMA - pages free 3975 - min 3 - low 3 - ... - ... -Node 0, zone DMA32 - pages free 694163 - min 875 - low 1093 - ... - ... -Node 0, zone Normal - pages free 2529995 - min 3146 - low 3932 - ... - ... -``` - -As I wrote above all nodes are described with the `pglist_data` or `pg_data_t` structure in memory. This structure defined in the [include/linux/mmzone.h](https://github.com/torvalds/linux/blob/master/include/linux/mmzone.h). The `build_all_zonelists` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c) constructs an ordered `zonelist` (of different zones `DMA`, `DMA32`, `NORMAL`, `HIGH_MEMORY`, `MOVABLE`) which specifies the zones/nodes to visit when a selected `zone` or `node` cannot satisfy the allocation request. That's all. More about `NUMA` and multiprocessor systems will be in the special part. - -The rest of the stuff before scheduler initialization --------------------------------------------------------------------------------- - -Before we will start to dive into linux kernel scheduler initialization process we must to do a couple of things. The fisrt thing is the `page_alloc_init` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c). This function looks pretty easy: - -```C -void __init page_alloc_init(void) -{ - hotcpu_notifier(page_alloc_cpu_notify, 0); -} -``` - -and initializes handler for the `CPU` [hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). Of course the `hotcpu_notifier` depends on the -`CONFIG_HOTPLUG_CPU` configuration option and if this option is set, it just calls `cpu_notifier` macro which expands to the call of the `register_cpu_notifier` which adds hotplug cpu handler (`page_alloc_cpu_notify` in our case). - -After this we can see the kernel command line in the initialization output: - -![kernel command line](http://oi58.tinypic.com/2m7vz10.jpg) - -And a couple of functions as `parse_early_param` and `parse_args` which are handles linux kernel command line. You can remember that we already saw the call of the `parse_early_param` function in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (`x86_64` in our case), but not all architecture calls this function. And we need in the call of the second function `parse_args` to parse and handle non-early command line arguments. - -In the next step we can see the call of the `jump_label_init` from the [kernel/jump_label.c](https://github.com/torvalds/linux/blob/master/kernel/jump_label.c). and initializes [jump label](https://lwn.net/Articles/412072/). - -After this we can see the call of the `setup_log_buf` function which setups the [printk](http://www.makelinux.net/books/lkd2/ch18lev1sec3) log buffer. We already saw this function in the seventh [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the linux kernel initialization process chapter. - -PID hash initialization --------------------------------------------------------------------------------- - -The next is `pidhash_init` function. As you know an each process has assigned unique number which called - `process identification number` or `PID`. Each process generated with fork or clone is automatically assigned a new unique `PID` value by the kernel. The management of `PIDs` centered around the two special data structures: `struct pid` and `struct upid`. First structure represents information about a `PID` in the kernel. The second structure represents the information that is visible in a specific namespace. All `PID` instances stored in the special hash table: - -```C -static struct hlist_head *pid_hash; -``` - -This hash table is used to find the pid instance that belongs to a numeric `PID` value. So, `pidhash_init` initializes this hash. In the start of the `pidhash_init` function we can see the call of the `alloc_large_system_hash`: - -```C -pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18, - HASH_EARLY | HASH_SMALL, - &pidhash_shift, NULL, - 0, 4096); -``` - -The number of elements of the `pid_hash` depends on the `RAM` configuration, but it can be between `2^4` and `2^12`. The `pidhash_init` computes the size -and allocates the required storage (which is `hlist` in our case - the same as [doubly linked list](http://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html), but contains one pointer instead on the [struct hlist_head](https://github.com/torvalds/linux/blob/master/include/linux/types.h)]. The `alloc_large_system_hash` function allocates a large system hash table with `memblock_virt_alloc_nopanic` if we pass `HASH_EARLY` flag (as it in our case) or with `__vmalloc` if we did no pass this flag. - -The result we can see in the `dmesg` output: - -``` -$ dmesg | grep hash -[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes) -... -... -... -``` - -That's all. The rest of the stuff before scheduler initialization is the following functions: `vfs_caches_init_early` does early initialization of the [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system) (more about it will be in the chapter which will describe virtual file system), `sort_main_extable` sorts the kernel's built-in exception table entries which are between `__start___ex_table` and `__stop___ex_table,`, and `trap_init` initializies trap handlers (morea about last two function we will know in the separate chapter about interrupts). - -The last step before the scheduler initialization is initialization of the memory manager with the `mm_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As we can see, the `mm_init` function initializes different part of the linux kernel memory manager: - -```C -page_ext_init_flatmem(); -mem_init(); -kmem_cache_init(); -percpu_init_late(); -pgtable_init(); -vmalloc_init(); -``` - -The first is `page_ext_init_flatmem` depends on the `CONFIG_SPARSEMEM` kernel configuration option and initializes extended data per page handling. The `mem_init` releases all `bootmem`, the `kmem_cache_init` initializes kernel cache, the `percpu_init_late` - replaces `percpu` chunks with those allocated by [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29), the `pgtable_init` - initilizes the `vmalloc_init` - initializes `vmalloc`. Please, **NOTE** that we will not dive into details about all of these functions and concepts, but we will see all of they it in the [Linux kernem memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter. - -That's all. Now we can look on the `scheduler`. - -Scheduler initialization --------------------------------------------------------------------------------- - -And now we came to the main purpose of this part - initialization of the task scheduler. I want to say again as I did it already many times, you will not see the full explanation of the scheduler here, there will be special chapter about this. Ok, next point is the `sched_init` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) and as we can understand from the function's name, it initializes scheduler. Let's start to dive in this function and try to understand how the scheduler initialized. At the start of the `sched_init` function we can see the following code: - -```C -#ifdef CONFIG_FAIR_GROUP_SCHED - alloc_size += 2 * nr_cpu_ids * sizeof(void **); -#endif -#ifdef CONFIG_RT_GROUP_SCHED - alloc_size += 2 * nr_cpu_ids * sizeof(void **); -#endif -``` - -First of all we can see two configuration options here: - -* `CONFIG_FAIR_GROUP_SCHED` -* `CONFIG_RT_GROUP_SCHED` - -Both of this options provide two different planning models. As we can read from the [documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt), the current scheduler - `CFS` or `Completely Fair Scheduler` used a simple concept. It models process scheduling as if the system had an ideal multitasking processor where each process would receive `1/n` processor time, where `n` is the number of the runnable processes. The scheduler uses the special set of rules used. These rules determine when and how to select a new process to run and they are called `scheduling policy`. The Completely Fair Scheduler supports following `normal` or `non-real-time` scheduling policies: `SCHED_NORMAL`, `SCHED_BATCH` and `SCHED_IDLE`. The `SCHED_NORMAL` is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the [nice](http://en.wikipedia.org/wiki/Nice_%28Unix%29) value, the `SCHED_BATCH` used for the 100% non-interactive tasks and the `SCHED_IDLE` runs tasks only when the processor has not to run anything besides this task. The `real-time` policies are also supported for the time-critial applications: `SCHED_FIFO` and `SCHED_RR`. If you've read something about the Linux kernel scheduler, you can know that it is modular. It means that it supports different algorithms to schedule different types of processes. Usually this modularity is called `scheduler classes`. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming too much about them. - - -Now let's back to the our code and look on the two configuration options `CONFIG_FAIR_GROUP_SCHED` and `CONFIG_RT_GROUP_SCHED`. The scheduler operates on an individual task. These options allows to schedule group tasks (more about it you can read in the [CFS group scheduling](http://lwn.net/Articles/240474/)). We can see that we assign the `alloc_size` variables which represent size based on amount of the processors to allocate for the `sched_entity` and `cfs_rq` to the `2 * nr_cpu_ids * sizeof(void **)` expression with `kzalloc`: - -```C -ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT); - -#ifdef CONFIG_FAIR_GROUP_SCHED - root_task_group.se = (struct sched_entity **)ptr; - ptr += nr_cpu_ids * sizeof(void **); - - root_task_group.cfs_rq = (struct cfs_rq **)ptr; - ptr += nr_cpu_ids * sizeof(void **); -#endif - -``` - -The `sched_entity` is struture which defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and used by the scheduler to keep track of process accounting. The `cfs_rq` presents [run queue](http://en.wikipedia.org/wiki/Run_queue). So, you can see that we allocated space with size `alloc_size` for the run queue and scheduler entity of the `root_task_group`. The `root_task_group` is an instance of the `task_group` structure from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) which contains task group related information: - -```C -struct task_group { - ... - ... - struct sched_entity **se; - struct cfs_rq **cfs_rq; - ... - ... -} -``` - -The root task group is the task group which belongs every task in system. As we allocated space for the root task group scheduler entity and runqueue, we go over all possible CPUs (`cpu_possible_mask` bitmap) and allocate zeroed memory from a particular memory node with the `kzalloc_node` function for the `load_balance_mask` `percpu` variable: - -```C -DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); -``` - -Here `cpumask_var_t` is the `cpumask_t` with one difference: `cpumask_var_t` is allocated only `nr_cpu_ids` bits when the `cpumask_t` always has `NR_CPUS` bits (more about `cpumask` you can read in the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part). As you can see: - -```C -#ifdef CONFIG_CPUMASK_OFFSTACK - for_each_possible_cpu(i) { - per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node( - cpumask_size(), GFP_KERNEL, cpu_to_node(i)); - } -#endif -``` - -this code depends on the `CONFIG_CPUMASK_OFFSTACK` configuration option. This configuration options says to use dynamic allocation for `cpumask`, instead of putting it on the stack. All groups have to be able to rely on the amount of CPU time. With the call of the two following functions: - -```C -init_rt_bandwidth(&def_rt_bandwidth, - global_rt_period(), global_rt_runtime()); -init_dl_bandwidth(&def_dl_bandwidth, - global_rt_period(), global_rt_runtime()); -``` - -we initialize bandwidth management for the `SCHED_DEADLINE` real-time tasks. These functions initializes `rt_bandwidth` and `dl_bandwidth` structures which are store information about maximum `deadline` bandwith of the system. For example, let's look on the implementation of the `init_rt_bandwidth` function: - -```C -void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime) -{ - rt_b->rt_period = ns_to_ktime(period); - rt_b->rt_runtime = runtime; - - raw_spin_lock_init(&rt_b->rt_runtime_lock); - - hrtimer_init(&rt_b->rt_period_timer, - CLOCK_MONOTONIC, HRTIMER_MODE_REL); - rt_b->rt_period_timer.function = sched_rt_period_timer; -} -``` - -It takes three parameters: - -* address of the `rt_bandwidth` structure which contains information about the allocated and consumed quota within a period; -* `period` - period over which real-time task bandwidth enforcement is measured in `us`; -* `runtime` - part of the period that we allow tasks to run in `us`. - -As `period` and `runtime` we pass result of the `global_rt_period` and `global_rt_runtime` functions. Which are `1s` second and and `0.95s` by default. The `rt_bandwidth` structure defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) and looks: - -```C -struct rt_bandwidth { - raw_spinlock_t rt_runtime_lock; - ktime_t rt_period; - u64 rt_runtime; - struct hrtimer rt_period_timer; -}; -``` - -As you can see, it contains `runtime` and `period` and also two following fields: - -* `rt_runtime_lock` - [spinlock](http://en.wikipedia.org/wiki/Spinlock) for the `rt_time` protection; -* `rt_period_timer` - [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt) for unthrottled of real-time tasks. - -So, in the `init_rt_bandwidth` we initialize `rt_bandwidth` period and runtime with the given parameters, initialize the spinlock and high-resolution time. In the next step, depends on the enabled [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), we make initialization of the root domain: - -```C -#ifdef CONFIG_SMP - init_defrootdomain(); -#endif -``` - -The real-time scheduler requires global resources to make scheduling decision. But unfortenatelly scalability bottlenecks appear as the number of CPUs increase. The concept of root domains was introduced for improving scalability. The linux kernel provides special mechanism for assigning a set of CPUs and memory nodes to a set of task and it is called - `cpuset`. If a `cpuset` contains non-overlapping with other `cpuset` CPUs, it is `exclusive cpuset`. Each exclusive cpuset defines an isolated domain or `root domain` of CPUs partitioned from other cpusets or CPUs. A `root domain` presented by the `struct root_domain` from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) in the linux kernel and its main purpose is to narrow the scope of the global variables to per-domain variables and all real-time scheduling decisions are made only within the scope of a root domain. That's all about it, but we will see more details about it in the chapter about scheduling about real-time scheduler. - -After `root domain` initialization, we make initialization of the bandwidth for the real-time tasks of the root task group as we did it above: - -```C -#ifdef CONFIG_RT_GROUP_SCHED - init_rt_bandwidth(&root_task_group.rt_bandwidth, - global_rt_period(), global_rt_runtime()); -#endif -``` - -In the next step, depends on the `CONFIG_CGROUP_SCHED` kernel configuration option we initialze the `siblings` and `children` lists of the root task group. As we can read from the documentation, the `CONFIG_CGROUP_SCHED` is: - -``` -This option allows you to create arbitrary task groups using the "cgroup" pseudo -filesystem and control the cpu bandwidth allocated to each such task group. -``` - -As we finished with the lists initialization, we can see the call of the `autogroup_init` function: - -```C -#ifdef CONFIG_CGROUP_SCHED - list_add(&root_task_group.list, &task_groups); - INIT_LIST_HEAD(&root_task_group.children); - INIT_LIST_HEAD(&root_task_group.siblings); - autogroup_init(&init_task); -#endif -``` - -which initializes automatic process group scheduling. - -After this we are going through the all `possible` cpu (you can remember that `possible` CPUs store in the `cpu_possible_mask` bitmap of possible CPUs that can ever be available in the system) and initialize a `runqueue` for each possible cpu: - -```C -for_each_possible_cpu(i) { - struct rq *rq; - ... - ... - ... -``` - -Each processor has its own locking and individual runqueue. All runnalble tasks are stored in an active array and indexed according to its priority. When a process consumes its time slice, it is moved to an expired array. All of these arras are stored in the special structure which names is `runqueu`. As there are no global lock and runqueu, we are going through the all possible CPUs and initialize runqueue for the every cpu. The `runque` is presented by the `rq` structure in the linux kernel which defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h). - -```C -rq = cpu_rq(i); -raw_spin_lock_init(&rq->lock); -rq->nr_running = 0; -rq->calc_load_active = 0; -rq->calc_load_update = jiffies + LOAD_FREQ; -init_cfs_rq(&rq->cfs); -init_rt_rq(&rq->rt); -init_dl_rq(&rq->dl); -rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime; -``` - -Here we get the runque for the every CPU with the `cpu_rq` macto which returns `runqueues` percpu variable and start to initialize it with runqueu lock, number of running tasks, `calc_load` relative fields (`calc_load_active` and `calc_load_update`) which are used in the reckoning of a CPU load and initialization of the completely fair, real-time and deadline related fields in a runqueue. After this we initialize `cpu_load` array with zeros and set the last load update tick to the `jiffies` variable which determines the number of time ticks (cycles), since the system boot: - -```C -for (j = 0; j < CPU_LOAD_IDX_MAX; j++) - rq->cpu_load[j] = 0; - -rq->last_load_update_tick = jiffies; -``` - -where `cpu_load` keeps history of runqueue loads in the past, for now `CPU_LOAD_IDX_MAX` is 5. In the next step we fill `runqueue` fields which are related to the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), but we will not cover they in this part. And in the end of the loop we initialize high-resolution timer for the give `runqueue` and set the `iowait` (more about it in the separate part about scheduler) number: - -```C -init_rq_hrtick(rq); -atomic_set(&rq->nr_iowait, 0); -``` - -Now we came out from the `for_each_possible_cpu` loop and the next we need to set load weight for the `init` task with the `set_load_weight` function. Weight of process is calculated through its dynamic priority which is static priority + scheduling class of the process. After this we increase memory usage counter of the memory descriptor of the `init` process and set scheduler class for the current process: - -```C -atomic_inc(&init_mm.mm_count); -current->sched_class = &fair_sched_class; -``` - -And make current process (it will be the first `init` process) `idle` and update the value of the `calc_load_update` with the 5 seconds interval: - -```C -init_idle(current, smp_processor_id()); -calc_load_update = jiffies + LOAD_FREQ; -``` - -So, the `init` process will be run, when there will be no other candidates (as it is the first process in the system). In the end we just set `scheduler_running` variable: - -```C -scheduler_running = 1; -``` - -That's all. Linux kernel scheduler is initialized. Of course, we missed many different details and explanations here, because we need to know and understand how different concepts (like process and process groups, runqueue, rcu and etc...) works in the linux kernel , but we took a short look on the scheduler initialization process. All other details we will look in the separate part which will be fully dedicated to the scheduler. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the eighth part about the linux kernel initialization process. In this part, we looked on the initialization process of the scheduler and we will continue in the next part to dive in the linux kernel initialization process and will see initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and many more. - -and other initialization stuff in the next part. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) -* [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt) -* [spinlock](http://en.wikipedia.org/wiki/Spinlock) -* [Run queue](http://en.wikipedia.org/wiki/Run_queue) -* [Linux kernem memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) -* [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29) -* [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system) -* [Linux kernel hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) -* [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) -* [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) -* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) -* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) -* [RCU](http://en.wikipedia.org/wiki/Read-copy-update) -* [CFS Scheduler documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt) -* [Real-Time group scheduling](https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) diff --git a/Initialization/linux-initialization-9.md b/Initialization/linux-initialization-9.md deleted file mode 100644 index a6489ae..0000000 --- a/Initialization/linux-initialization-9.md +++ /dev/null @@ -1,430 +0,0 @@ -Kernel initialization. Part 9. -================================================================================ - -RCU initialization -================================================================================ - -This is ninth part of the [Linux Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the previous part we stopped at the [scheduler initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html). In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). We can see that the next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) after the `sched_init` is the call of the `preempt_disablepreempt_disable`. There are two macros: - -* `preempt_disable` -* `preempt_enable` - -for preemption disabling and enabling. First of all let's try to understand what is it `preempt` in the context of an operating system kernel. In a simple words, preemption is ability of the operating system kernel to preempt current task to run task with higher priority. Here we need to disable preemption because we will have only one `init` process for the early boot time and we no need to stop it before we will call `cpu_idle` function. The `preempt_disable` macro defined in the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) and depends on the `CONFIG_PREEMPT_COUNT` kernel configuration option. This maco implemeted as: - -```C -#define preempt_disable() \ -do { \ - preempt_count_inc(); \ - barrier(); \ -} while (0) -``` - -and if `CONFIG_PREEMPT_COUNT` is not set just: - -```C -#define preempt_disable() barrier() -``` - -Let's look on it. First of all we can see one difference between these macro implementations. The `preempt_disable` with `CONFIG_PREEMPT_COUNT` contains the call of the `preempt_count_inc`. There is special `percpu` variable which stores the number of held locks and `preempt_disable` calls: - -```C -DECLARE_PER_CPU(int, __preempt_count); -``` - -In the first implementation of the `preempt_disable` we increment this `__preempt_count`. There is API for returning value of the `__preempt_count`, it is the `preempt_count` function. As we called `preempt_disable`, first of all we increment preemption counter with the `preempt_count_inc` macro which expands to the: - -``` -#define preempt_count_inc() preempt_count_add(1) -#define preempt_count_add(val) __preempt_count_add(val) -``` - -where `preempt_count_add` calls the `raw_cpu_add_4` macro which adds `1` to the given `percpu` variable (`__preempt_count`) in our case (more about `precpu` variables you can read in the part about [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)). Ok, we increased `__preempt_count` and th next step we can see the call of the `barrier` macro in the both macros. The `barrier` macro inserts an optimization barrier. In the processors with `x86_64` architecture independent memory access operations can be performed in any order. That's why we need in the oportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider simple example: - -```C -preempt_disable(); -foo(); -preempt_enable(); -``` - -Compiler can rearrange it as: - -```C -preempt_disable(); -preempt_enable(); -foo(); -``` - -In this case non-preemptible function `foo` can be preempted. As we put `barrier` macro in the `preempt_disable` and `preempt_enable` macros, it prevents the compiler from swapping `preempt_count_inc` with other statements. More about barriers you can read [here](http://en.wikipedia.org/wiki/Memory_barrier) and [here](https://www.kernel.org/doc/Documentation/memory-barriers.txt). - -In the next step we can see following statement: - -```C -if (WARN(!irqs_disabled(), - "Interrupts were enabled *very* early, fixing it\n")) - local_irq_disable(); -``` - -which check [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) state, and disabling (with `cli` instruction for `x86_64`) if they are enabled. - -That's all. Preemption is disabled and we can go ahead. - -Initialization of the integer ID management --------------------------------------------------------------------------------- - -In the next step we can see the call of the `idr_init_cache` function which defined in the [lib/idr.c](https://github.com/torvalds/linux/blob/master/lib/idr.c). The `idr` library used in a various [places](http://lxr.free-electrons.com/ident?i=idr_find) in the linux kernel to manage assigning integer `IDs` to objects and looking up objects by id. - -Let's look on the implementation of the `idr_init_cache` function: - -```C -void __init idr_init_cache(void) -{ - idr_layer_cache = kmem_cache_create("idr_layer_cache", - sizeof(struct idr_layer), 0, SLAB_PANIC, NULL); -} -``` - -Here we can see the call of the `kmem_cache_create`. We already called the `kmem_cache_init` in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L485). This function create generalized caches again using the `kmem_cache_alloc` (more about caches we will see in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter). In our case, as we are using `kmem_cache_t` it will be used the [slab](http://en.wikipedia.org/wiki/Slab_allocation) allocator and `kmem_cache_create` creates it. As you can seee we pass five parameters to the `kmem_cache_create`: - -* name of the cache; -* size of the object to store in cache; -* offset of the first object in the page; -* flags; -* constructor for the objects. - -and it will create `kmem_cache` for the integer IDs. Integer `IDs` is commonly used pattern for the to map set of integer IDs to the set of pointers. We can see usage of the integer IDs for example in the [i2c](http://en.wikipedia.org/wiki/I%C2%B2C) drivers subsystem. For example [drivers/i2c/i2c-core.c]((https://github.com/torvalds/linux/blob/master/drivers/i2c/i2c-core) which presentes the core of the `i2c` subsystem defines `ID` for the `i2c` adapter with the `DEFINE_IDR` macro: - -```C -static DEFINE_IDR(i2c_adapter_idr); -``` - -and than it uses it for the declaration of the `i2c` adapter: - -```C -static int __i2c_add_numbered_adapter(struct i2c_adapter *adap) -{ - int id; - ... - ... - ... - id = idr_alloc(&i2c_adapter_idr, adap, adap->nr, adap->nr + 1, GFP_KERNEL); - ... - ... - ... -} -``` - -and `id2_adapter_idr` presents dynamically calculated bus number. - -More about integer ID management you can read [here](https://lwn.net/Articles/103209/). - -RCU initialization --------------------------------------------------------------------------------- - -The next step is [RCU](http://en.wikipedia.org/wiki/Read-copy-update) initialization with the `rcu_init` function and it's implementation depends on two kernel configuration options: - -* `CONFIG_TINY_RCU` -* `CONFIG_TREE_RCU` - -In the first case `rcu_init` will be in the [kernel/rcu/tiny.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tiny.c) and in the second case it will be defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c). We will see the implementation of the `tree rcu`, but first of all about the `RCU` in general. - -`RCU` or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure), [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) data structures and other. One of these mechanisms is - the `read-copy update`. The `RCU` technique designed for rarely-modified data structures. The idea of the `RCU` is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy. - -Of course this description of the `RCU` is very simplified. To understand some details about `RCU`, first of all we need to learn some terminology. Data readers in the `RCU` executed in the [critical section](http://en.wikipedia.org/wiki/Critical_section). Everytime when data reader joins to the critical section, it calls the `rcu_read_lock`, and `rcu_read_unlock` on exit from the critical section. If the thread is not in the critical section, it will be in state which called - `quiescent state`. Every moment when every thread was in the `quiescent state` called - `grace period`. If a thread wants to remove element from the data structure, this occurs in two steps. First steps is `removal` - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits while it will be finsihed. From this moment, the removed element is available to the thread-readers. After the `grace perioud` will be finished, the second step of the element removal will be started, it just removes element from the physical memory. - -There a couple implementations of the `RCU`. Old `RCU` called classic, the new implemetation called `tree` RCU. As you already can undrestand, the `CONFIG_TREE_RCU` kernel configuration option enables tree `RCU`. Another is the `tiny` RCU which depends on `CONFIG_TINY_RCU` and `CONFIG_SMP=n`. We will see more details about the `RCU` in general in the separate chapter about synchronization primitives, but now let's look on the `rcu_init` implementation from the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c): - -```C -void __init rcu_init(void) -{ - int cpu; - - rcu_bootup_announce(); - rcu_init_geometry(); - rcu_init_one(&rcu_bh_state, &rcu_bh_data); - rcu_init_one(&rcu_sched_state, &rcu_sched_data); - __rcu_init_preempt(); - open_softirq(RCU_SOFTIRQ, rcu_process_callbacks); - - /* - * We don't need protection against CPU-hotplug here because - * this is called early in boot, before either interrupts - * or the scheduler are operational. - */ - cpu_notifier(rcu_cpu_notify, 0); - pm_notifier(rcu_pm_notify, 0); - for_each_online_cpu(cpu) - rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu); - - rcu_early_boot_tests(); -} -``` - -In the beginning of the `rcu_init` function we define `cpu` variable and call `rcu_bootup_announce`. The `rcu_bootup_announce` function is pretty simple: - -```C -static void __init rcu_bootup_announce(void) -{ - pr_info("Hierarchical RCU implementation.\n"); - rcu_bootup_announce_oddness(); -} -``` - -It just prints information about the `RCU` with the `pr_info` function and `rcu_bootup_announce_oddness` which uses `pr_info` too, for printing different information about the current `RCU` configuration which depends on different kernel configuration options like `CONFIG_RCU_TRACE`, `CONFIG_PROVE_RCU`, `CONFIG_RCU_FANOUT_EXACT` and etc... In the next step, we can see the call of the `rcu_init_geometry` function. This function defined in the same source code file and computes the node tree geometry depends on amount of CPUs. Actually `RCU` provides scalability with extremely low internal to RCU lock contention. What if a data structure will be read from the different CPUs? `RCU` API provides the `rcu_state` structure wihch presents RCU global state including node hierarchy. Hierachy presented by the: - -``` -struct rcu_node node[NUM_RCU_NODES]; -``` - -array of structures. As we can read in the comment which is above definition of this structure: - -``` -The root (first level) of the hierarchy is in ->node[0] (referenced by ->level[0]), the second -level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]), and the third level -in ->node[m+1] and following (->node[m+1] referenced by ->level[2]). The number of levels is -determined by the number of CPUs and by CONFIG_RCU_FANOUT. - -Small systems will have a "hierarchy" consisting of a single rcu_node. -``` - -The `rcu_node` structure defined in the [kernel/rcu/tree.h](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.h) and contains information about current grace period, is grace period completed or not, CPUs or groups that need to switch in order for current grace period to proceed and etc... Every `rcu_node` contains a lock for a couple of CPUs. These `rcu_node` structures embedded into a linear array in the `rcu_state` structure and represeted as a tree with the root in the zero element and it covers all CPUs. As you can see the number of the rcu nodes determined by the `NUM_RCU_NODES` which depends on number of available CPUs: - -```C -#define NUM_RCU_NODES (RCU_SUM - NR_CPUS) -#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4) -``` - -where levels values depend on the `CONFIG_RCU_FANOUT_LEAF` configuration option. For example for the simplest case, one `rcu_node` will cover two CPU on machine with the eight CPUs: - -``` -+-----------------------------------------------------------------+ -| rcu_state | -| +----------------------+ | -| | root | | -| | rcu_node | | -| +----------------------+ | -| | | | -| +----v-----+ +--v-------+ | -| | | | | | -| | rcu_node | | rcu_node | | -| | | | | | -| +------------------+ +----------------+ | -| | | | | | -| | | | | | -| +----v-----+ +-------v--+ +-v--------+ +-v--------+ | -| | | | | | | | | | -| | rcu_node | | rcu_node | | rcu_node | | rcu_node | | -| | | | | | | | | | -| +----------+ +----------+ +----------+ +----------+ | -| | | | | | -| | | | | | -| | | | | | -| | | | | | -+---------|-----------------|-------------|---------------|-------+ - | | | | -+---------v-----------------v-------------v---------------v--------+ -| | | | | -| CPU1 | CPU3 | CPU5 | CPU7 | -| | | | | -| CPU2 | CPU4 | CPU6 | CPU8 | -| | | | | -+------------------------------------------------------------------+ -``` - -So, in the `rcu_init_geometry` function we just need to calculate the total number of `rcu_node` structures. We start to do it with the calculation of the `jiffies` till to the first and next `fqs` which is `force-quiescent-state` (read above about it): - -```C -d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; -if (jiffies_till_first_fqs == ULONG_MAX) - jiffies_till_first_fqs = d; -if (jiffies_till_next_fqs == ULONG_MAX) - jiffies_till_next_fqs = d; -``` - -where: - -```C -#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500)) -#define RCU_JIFFIES_FQS_DIV 256 -``` - -As we calculated these [jiffies](http://en.wikipedia.org/wiki/Jiffy_%28time%29), we check that previous defined `jiffies_till_first_fqs` and `jiffies_till_next_fqs` variables are equal to the [ULONG_MAX](http://www.rowleydownload.co.uk/avr/documentation/index.htm?http://www.rowleydownload.co.uk/avr/documentation/ULONG_MAX.htm) (their default values) and set they equal to the calculated value. As we did not touch these variables before, they are equal to the `ULONG_MAX`: - -```C -static ulong jiffies_till_first_fqs = ULONG_MAX; -static ulong jiffies_till_next_fqs = ULONG_MAX; -``` - -In the next step of the `rcu_init_geometry`, we check that `rcu_fanout_leaf` didn't chage (it has the same value as `CONFIG_RCU_FANOUT_LEAF` in compile-time) and equal to the value of the `CONFIG_RCU_FANOUT_LEAF` configuration option, we just return: - -```C -if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF && - nr_cpu_ids == NR_CPUS) - return; -``` - -After this we need to compute the number of nodes that can be handled an `rcu_node` tree with the given number of levels: - -```C -rcu_capacity[0] = 1; -rcu_capacity[1] = rcu_fanout_leaf; -for (i = 2; i <= MAX_RCU_LVLS; i++) - rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT; -``` - -And in the last step we calcluate the number of rcu_nodes at each level of the tree in the [loop](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c#L4094). - -As we calculated geometry of the `rcu_node` tree, we need to back to the `rcu_init` function and next step we need to initialize two `rcu_state` structures with the `rcu_init_one` function: - -```C -rcu_init_one(&rcu_bh_state, &rcu_bh_data); -rcu_init_one(&rcu_sched_state, &rcu_sched_data); -``` - -The `rcu_init_one` function takes two arguments: - -* Global `RCU` state; -* Per-CPU data for `RCU`. - -Both variables defined in the [kernel/rcu/tree.h](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.h) with its `percpu` data: - -``` -extern struct rcu_state rcu_bh_state; -DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); -``` - -About this states you can read [here](http://lwn.net/Articles/264090/). As I wrote above we need to initialize `rcu_state` structures and `rcu_init_one` function will help us with it. After the `rcu_state` initialization, we can see the call of the ` __rcu_init_preempt` which depends on the `CONFIG_PREEMPT_RCU` kernel configuration option. It does the same that previous functions - initialization of the `rcu_preempt_state` structure with the `rcu_init_one` function which has `rcu_state` type. After this, in the `rcu_init`, we can see the call of the: - -```C -open_softirq(RCU_SOFTIRQ, rcu_process_callbacks); -``` - -function. This function registers a handler of the `pending interrupt`. Pending interrupt or `softirq` supposes that part of actions cab be delayed for later execution when the system will be less loaded. Pending interrupts represeted by the following structure: - -```C -struct softirq_action -{ - void (*action)(struct softirq_action *); -}; -``` - -which defined in the [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h) and contains only one field - handler of an interrupt. You can know about `softirqs` in the your system with the: - -``` -$ cat /proc/softirqs - CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 - HI: 2 0 0 1 0 2 0 0 - TIMER: 137779 108110 139573 107647 107408 114972 99653 98665 - NET_TX: 1127 0 4 0 1 1 0 0 - NET_RX: 334 221 132939 3076 451 361 292 303 - BLOCK: 5253 5596 8 779 2016 37442 28 2855 -BLOCK_IOPOLL: 0 0 0 0 0 0 0 0 - TASKLET: 66 0 2916 113 0 24 26708 0 - SCHED: 102350 75950 91705 75356 75323 82627 69279 69914 - HRTIMER: 510 302 368 260 219 255 248 246 - RCU: 81290 68062 82979 69015 68390 69385 63304 63473 -``` - -The `open_softirq` function takes two parameters: - -* index of the interrupt; -* interrupt handler. - -and adds interrupt handler to the array of the pending interrupts: - -```C -void open_softirq(int nr, void (*action)(struct softirq_action *)) -{ - softirq_vec[nr].action = action; -} -``` - -In our case the interrupt handler is - `rcu_process_callbacks` which defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c) and does the `RCU` core processing for the current CPU. After we registered `softirq` interrupt for the `RCU`, we can see the following code: - -```C -cpu_notifier(rcu_cpu_notify, 0); -pm_notifier(rcu_pm_notify, 0); -for_each_online_cpu(cpu) - rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu); -``` - -Here we can see registration of the `cpu` notifier which needs in sysmtems which supports [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) and we will not dive into details about this theme. The last function in the `rcu_init` is the `rcu_early_boot_tests`: - -```C -void rcu_early_boot_tests(void) -{ - pr_info("Running RCU self tests\n"); - - if (rcu_self_test) - early_boot_test_call_rcu(); - if (rcu_self_test_bh) - early_boot_test_call_rcu_bh(); - if (rcu_self_test_sched) - early_boot_test_call_rcu_sched(); -} -``` - -which runs self tests for the `RCU`. - -That's all. We saw initialization process of the `RCU` subsystem. As I wrote above, more about the `RCU` will be in the separate chapter about synchronization primitives. - -Rest of the initialization process --------------------------------------------------------------------------------- - -Ok, we already passed the main theme of this part which is `RCU` initialization, but it is not the end of the linux kernel initialization process. In the last paragraph of this theme we will see a couple of functions which work in the initialization time, but we will not dive into deep details around this function by different reasons. Some reasons not to dive into details are following: - -* They are not very important for the generic kernel initialization process and can depend on the different kernel configuration; -* They have the character of debugging and not important too for now; -* We will see many of this stuff in the separate parts/chapters. - -After we initilized `RCU`, the next step which you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the - `trace_init` function. As you can understand from its name, this function initialize [tracing](http://en.wikipedia.org/wiki/Tracing_%28software%29) subsystem. More about linux kernel trace system you can read - [here](http://elinux.org/Kernel_Trace_Systems). - -After the `trace_init`, we can see the call of the `radix_tree_init`. If you are familar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). This function defined in the [lib/radix-tree.c](https://github.com/torvalds/linux/blob/master/lib/radix-tree.c) and more about it you can read in the part about [Radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.md). - -In the next step we can see the functions which are related to the `interrupts handling` subsystem, they are: - -* `early_irq_init` -* `init_IRQ` -* `softirq_init` - -We will see explanation about this functions and their implementation in the special part about interrupts and exceptions handling. After this many different functions (like `init_timers`, `hrtimers_init`, `time_init` and etc...) which are related to different timing and timers stuff. More about these function we will see in the chapter about timers. - -The next couple of functions related with the [perf](https://perf.wiki.kernel.org/index.php/Main_Page) events - `perf_event-init` (will be separate chapter about perf), initialization of the `profiling` with the `profile_init`. After this we enable `irq` with the call of the: - -```C -local_irq_enable(); -``` - -which expands to the `sti` instruction and making post initialization of the [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) with the call of the `kmem_cache_init_late` function (As I wrote above we will know about the `SLAB` in the [Linux memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter). - -After the post initialization of the `SLAB`, next point is initialization of the console with the `console_init` function from the [drivers/tty/tty_io.c](https://github.com/torvalds/linux/blob/master/drivers/tty/tty_io.c). - -After the console initialization, we can see the `lockdep_info` function which prints information about the [Lock dependency validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). After this, we can see the initialization of the dynamic allocation of the `debug objects` with the `debug_objects_mem_init`, kernel memory leack [detector](https://www.kernel.org/doc/Documentation/kmemleak.txt) initialization with the `kmemleak_init`, `percpu` pageset setup with the `setup_per_cpu_pageset`, setup of the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) policy with the `numa_policy_init`, setting time for the scheduler with the `sched_clock_init`, `pidmap` initialization with the call of the `pidmap_init` function for the initial `PID` namespace, cache creation with the `anon_vma_init` for the private virtual memory areas and early initialization of the [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) with the `acpi_early_init`. - -This is the end of the ninth part of the [linux kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and here we saw initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). In the last paragraph of this part (`Rest of the initialization process`) we went thorugh the many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I wrote already many times, we will see details of implementations, but in the other parts or other chapters. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the ninth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). In this part, we looked on the initialization process of the `RCU` subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the `start_kernel` function and will go to the `rest_init` function from the same [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file and will see that start of the first process. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure) -* [kmemleak](https://www.kernel.org/doc/Documentation/kmemleak.txt) -* [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) -* [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) -* [RCU](http://en.wikipedia.org/wiki/Read-copy-update) -* [RCU documentation](https://github.com/torvalds/linux/tree/master/Documentation/RCU) -* [integer ID management](https://lwn.net/Articles/103209/) -* [Documentation/memory-barriers.txt](https://www.kernel.org/doc/Documentation/memory-barriers.txt) -* [Runtime locking correctness validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) -* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) -* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) -* [slab](http://en.wikipedia.org/wiki/Slab_allocation) -* [i2c](http://en.wikipedia.org/wiki/I%C2%B2C) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) diff --git a/LINKS.md b/LINKS.md deleted file mode 100644 index d953b7f..0000000 --- a/LINKS.md +++ /dev/null @@ -1,46 +0,0 @@ -Useful links -======================== - -Linux boot ------------------------- - -* [Linux/x86 boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) -* [Linux kernel parameters](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt) - -Protected mode ------------------------- - -* [64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) - -Serial programming ------------------------- - -* [8250 UART Programming](http://en.wikibooks.org/wiki/Serial_Programming/8250_UART_Programming#UART_Registers) -* [Serial ports on OSDEV](http://wiki.osdev.org/Serial_Ports) - -VGA ------------------------- - -* [Video Graphics Array (VGA)](http://en.wikipedia.org/wiki/Video_Graphics_Array) - -IO ------------------------- - -* [IO port programming](http://www.tldp.org/HOWTO/text/IO-Port-Programming) - -GCC and GAS ------------------------- - -* [GCC type attributes](https://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html) -* [Assembler Directives](http://www.chemie.fu-berlin.de/chemnet/use/info/gas/gas_toc.html#TOC65) - - -Important data structures --------------------------- - -* [task_struct definition](http://lxr.free-electrons.com/source/include/linux/sched.h#L1274) - -Other architectures ------------------------- - -* [PowerPC and Linux Kernel Inside](http://www.systemcomputing.org/ppc/) diff --git a/Theory/ELF.md b/Theory/ELF.md deleted file mode 100644 index 325b343..0000000 --- a/Theory/ELF.md +++ /dev/null @@ -1,216 +0,0 @@ -Executable and Linkable Format -================================================================================ - -ELF (Executable and Linkable Format) is a standard file format for executable files and shared libraries. Linux as many UNIX-like operating systems uses this format. Let's look on structure of the ELF-64 Object File Format and some defintions in the linux kernel source code related with it. - -An ELF object file consists of the following parts: - -* ELF header - describes the main characteristics of the object file: type, CPU architecture, the virtual address of the entry point, the size and offset the remaining parts, etc...; -* Program header table - listing the available segments and their attributes. Program header table need loaders for placing sections of the file as virtual memory segments; -* Section header table - contains description of the sections. - -Now let's look closer on these components. - -**ELF header** - -It's located in the beginning of the object file. It's main point is to locate all other parts of the object file. File header contains following fields: - -* ELF identification - array of bytes which helps to identify the file as an ELF object file and also provides information about general object file characteristic; -* Object file type - identifies the object file type. This field can describe that ELF file is relocatable object file, executable file, etc...; -* Target architecture; -* Version of the object file format; -* Virtual address of the program entry point; -* File offset of the program header table; -* File offset of the section header table; -* Size of an ELF header; -* Size of a program header table entry; -* and other fields... - -You can find `elf64_hdr` structure which presents ELF64 header in the linux kernel source code: - -```C -typedef struct elf64_hdr { - unsigned char e_ident[EI_NIDENT]; - Elf64_Half e_type; - Elf64_Half e_machine; - Elf64_Word e_version; - Elf64_Addr e_entry; - Elf64_Off e_phoff; - Elf64_Off e_shoff; - Elf64_Word e_flags; - Elf64_Half e_ehsize; - Elf64_Half e_phentsize; - Elf64_Half e_phnum; - Elf64_Half e_shentsize; - Elf64_Half e_shnum; - Elf64_Half e_shstrndx; -} Elf64_Ehdr; -``` - -This structure defined in the [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h) - -**Sections** - -All data stores in a sections in an Elf object file. Sections identified by index in the section header table. Section header contains following fields: - -* Section name; -* Section type; -* Section attributes; -* Virtual address in memory; -* Offset in file; -* Size of section; -* Link to other section; -* Miscellaneous information; -* Address alignment boundary; -* Size of entries, if section has table; - -And presented with the following `elf64_shdr` structure in the linux kernel: - -```C -typedef struct elf64_shdr { - Elf64_Word sh_name; - Elf64_Word sh_type; - Elf64_Xword sh_flags; - Elf64_Addr sh_addr; - Elf64_Off sh_offset; - Elf64_Xword sh_size; - Elf64_Word sh_link; - Elf64_Word sh_info; - Elf64_Xword sh_addralign; - Elf64_Xword sh_entsize; -} Elf64_Shdr; -``` - -**Program header table** - -All sections are grouped into segments in an executable or shared object file. Program header is an array of structures which describe every segment. It looks like: - -```C -typedef struct elf64_phdr { - Elf64_Word p_type; - Elf64_Word p_flags; - Elf64_Off p_offset; - Elf64_Addr p_vaddr; - Elf64_Addr p_paddr; - Elf64_Xword p_filesz; - Elf64_Xword p_memsz; - Elf64_Xword p_align; -} Elf64_Phdr; -``` - -in the linux kernel source code. - -`elf64_phdr` defined in the same [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h). - -And ELF object file also contains other fields/structures which you can find in the [Documentation](http://downloads.openwatcom.org/ftp/devel/docs/elf-64-gen.pdf). Better let's look on the `vmlinux`. - -vmlinux --------------------------------------------------------------------------------- - -`vmlinux` is relocatable ELF object file too. So we can look on it with the `readelf` util. First of all let's look on a header: - -``` -$ readelf -h vmlinux -ELF Header: - Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 - Class: ELF64 - Data: 2's complement, little endian - Version: 1 (current) - OS/ABI: UNIX - System V - ABI Version: 0 - Type: EXEC (Executable file) - Machine: Advanced Micro Devices X86-64 - Version: 0x1 - Entry point address: 0x1000000 - Start of program headers: 64 (bytes into file) - Start of section headers: 381608416 (bytes into file) - Flags: 0x0 - Size of this header: 64 (bytes) - Size of program headers: 56 (bytes) - Number of program headers: 5 - Size of section headers: 64 (bytes) - Number of section headers: 73 - Section header string table index: 70 -``` - -Here we can see that `vmlinux` is 64-bit executable file. - -We can read from the [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt): - -``` -ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0 -``` - -So we can find it in the `vmlinux` with: - -``` -readelf -s vmlinux | grep ffffffff81000000 - 1: ffffffff81000000 0 SECTION LOCAL DEFAULT 1 - 65099: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 _text - 90766: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 startup_64 -``` - -Note that here is address of the `startup_64` routine is not `ffffffff80000000`, but `ffffffff81000000` and now i'll explain why. - -We can see following definition in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S): - -``` - . = __START_KERNEL; - ... - ... - .. - /* Text and read-only data */ - .text : AT(ADDR(.text) - LOAD_OFFSET) { - _text = .; - ... - ... - ... - } -``` - -Where `__START_KERNEL` is: - -``` -#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) -``` - -`__START_KERNEL_map` is the value from documentation - `ffffffff80000000` and `__PHYSICAL_START` is `0x1000000`. That's why address of the `startup_64` is `ffffffff81000000`. - -And the last we can get program headers from `vmlinux` with the following command: - -``` -readelf -l vmlinux - -Elf file type is EXEC (Executable file) -Entry point 0x1000000 -There are 5 program headers, starting at offset 64 - -Program Headers: - Type Offset VirtAddr PhysAddr - FileSiz MemSiz Flags Align - LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000 - 0x0000000000cfd000 0x0000000000cfd000 R E 200000 - LOAD 0x0000000001000000 0xffffffff81e00000 0x0000000001e00000 - 0x0000000000100000 0x0000000000100000 RW 200000 - LOAD 0x0000000001200000 0x0000000000000000 0x0000000001f00000 - 0x0000000000014d98 0x0000000000014d98 RW 200000 - LOAD 0x0000000001315000 0xffffffff81f15000 0x0000000001f15000 - 0x000000000011d000 0x0000000000279000 RWE 200000 - NOTE 0x0000000000b17284 0xffffffff81917284 0x0000000001917284 - 0x0000000000000024 0x0000000000000024 4 - - Section to Segment mapping: - Segment Sections... - 00 .text .notes __ex_table .rodata __bug_table .pci_fixup .builtin_fw - .tracedata __ksymtab __ksymtab_gpl __kcrctab __kcrctab_gpl - __ksymtab_strings __param __modver - 01 .data .vvar - 02 .data..percpu - 03 .init.text .init.data .x86_cpu_dev.init .altinstructions - .altinstr_replacement .iommu_table .apicdrivers .exit.text - .smp_locks .data_nosave .bss .brk -``` - -Here we can see five segments with sections list. All of these sections you can find in the generated linker script at - `arch/x86/kernel/vmlinux.lds`. - -That's all. Of course it's not a full description of ELF object format, but if you are interesting in it, you can find documentation - [here](http://downloads.openwatcom.org/ftp/devel/docs/elf-64-gen.pdf) diff --git a/Theory/Paging.md b/Theory/Paging.md deleted file mode 100644 index c0b896f..0000000 --- a/Theory/Paging.md +++ /dev/null @@ -1,263 +0,0 @@ -Paging -================================================================================ - -Introduction --------------------------------------------------------------------------------- - -In the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the series `Linux kernel booting process` we finished to learn what and how kernel does on the earliest stage. In the next step kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many different things, before we can see how the kernel will run the first init process. - -Yeah, there will be many different things, but many many and once again many work with **memory**. - -In my view, memory management is one of the most complex part of the linux kernel and in system programming generally. So before we will proceed with the kernel initialization stuff, we will get acquainted with the paging. - -`Paging` is a process of translation a linear memory address to a physical address. If you have read previous parts, you can remember that we saw segmentation in the real mode when physical address calculated by shifting a segment register on four and adding offset. Or also we saw segmentation in the protected mode, where we used the tables of descriptors and base addresses from descriptors with offsets to calculate physical addresses. Now we are in 64-bit mode and that we will see paging. - -As Intel manual says: - -> Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system where sections of a program’s execution environment are mapped into physical memory as needed. - -So... I will try to explain how paging works in theory in this post. Of course it will be closely related with the linux kernel for `x86_64`, but we will not go into deep details (at least in this post). - -Enabling paging --------------------------------------------------------------------------------- - -There are three paging modes: - -* 32-bit paging; -* PAE paging; -* IA-32e paging. - -We will see explanation only last mode here. To enable `IA-32e paging` paging mode need to do following things: - -* set `CR0.PG` bit; -* set `CR4.PAE` bit; -* set `IA32_EFER.LME` bit. - -We already saw setting of this bits in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S): - -```assembly -movl $(X86_CR0_PG | X86_CR0_PE), %eax -movl %eax, %cr0 -``` - -and - -```assembly -movl $MSR_EFER, %ecx -rdmsr -btsl $_EFER_LME, %eax -wrmsr -``` - -Paging structures --------------------------------------------------------------------------------- - -Paging divides the linear address space into fixed-size pages. Pages can be mapped into the physical address space or even external storage. This fixed size is `4096` bytes for the `x86_64` linux kernel. For a linear address translation to a physical address used special structures. Every structure is `4096` bytes size and contains `512` entries (this only for `PAE` and `IA32_EFER.LME` modes). Paging structures are hierarchical and linux kernel uses 4 level paging for `x86_64`. CPU uses a part of the linear address to identify entry of the another paging structure which is at the lower level or physical memory region (`page frame`) or physical address in this region (`page offset`). The address of the top level paging structure located in the `cr3` register. We already saw this in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S): - -```assembly -leal pgtable(%ebx), %eax -movl %eax, %cr3 -``` - -We built page table structures and put the address of the top-level structure to the `cr3` register. Here `cr3` is used to store the address of the top-level `PML4` structure or `Page Global Directory` as it calls in linux kernel. `cr3` is 64-bit register and has the following structure: - -``` -63 52 51 32 - -------------------------------------------------------------------------------- -| | | -| Reserved MBZ | Address of the top level structure | -| | | - -------------------------------------------------------------------------------- -31 12 11 5 4 3 2 0 - -------------------------------------------------------------------------------- -| | | P | P | | -| Address of the top level structure | Reserved | C | W | Reserved | -| | | D | T | | - -------------------------------------------------------------------------------- -``` - -These fields have the following meanings: - -* Bits 2:0 - ignored; -* Bits 51:12 - stores the address of the top level paging structure; -* Bit 3 and 4 - PWT or Page-Level Writethrough and PCD or Page-level cache disable indicate. These bits control the way the page or Page Table is handled by the hardware cache; -* Reserved - reserved must be 0; -* Bits 63:52 - reserved must be 0. - -The linear address translation address is following: - -* Given linear address arrives to the [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) instead of memory bus. -* 64-bit linear address splits on some parts. Only low 48 bits are significant, it means that `2^48` or 256 TBytes of linear-address space may be accessed at any given time. -* `cr3` register stores the address of the 4 top-level paging structure. -* `47:39` bits of the given linear address stores an index into the paging structure level-4, `38:30` bits stores index into the paging structure level-3, `29:21` bits stores an index into the paging structure level-2, `20:12` bits stores an index into the paging structure level-1 and `11:0` bits provide the byte offset into the physical page. - -schematically, we can imagine it like this: - -![4-level paging](http://oi58.tinypic.com/207mb0x.jpg) - -Every access to a linear address is either a supervisor-mode access or a user-mode access. This access determined by the `CPL` (current privilege level). If `CPL < 3` it is a supervisor mode access level and user mode access level in other ways. For example top level page table entry contains access bits and has the following structure: - -``` -63 62 52 51 32 - -------------------------------------------------------------------------------- -| N | | | -| | Available | Address of the paging structure on lower level | -| X | | | - -------------------------------------------------------------------------------- -31 12 11 9 8 7 6 5 4 3 2 1 0 - -------------------------------------------------------------------------------- -| | | M |I| | P | P |U|W| | -| Address of the paging structure on lower level | AVL | B |G|A| C | W | | | P | -| | | Z |N| | D | T |S|R| | - -------------------------------------------------------------------------------- -``` - -Where: - -* 63 bit - N/X bit (No Execute Bit) - presents ability to execute the code from physical pages mapped by the table entry; -* 62:52 bits - ignored by CPU, used by system software; -* 51:12 bits - stores physical address of the lower level paging structure; -* 12:9 bits - ignored by CPU; -* MBZ - must be zero bits; -* Ignored bits; -* A - accessed bit indicates was physical page or page structure accessed; -* PWT and PCD used for cache; -* U/S - user/supervisor bit controls user access to the all physical pages mapped by this table entry; -* R/W - read/write bit controls read/write access to the all physical pages mapped by this table entry; -* P - present bit. Current bit indicates was page table or physical page loaded into primary memory or not. - -Ok, now we know about paging structures and it's entries. Let's see some details about 4-level paging in linux kernel. - -Paging structures in linux kernel --------------------------------------------------------------------------------- - -As i wrote about linux kernel for `x86_64` uses 4-level page tables. Their names are: - -* Page Global Directory -* Page Upper Directory -* Page Middle Directory -* Page Table Entry - -After that you compiled and installed linux kernel, you can note `System.map` file which stores address of the functions that are used by the kernel. Note that addresses are virtual. For example: - -``` -$ grep "start_kernel" System.map -ffffffff81efe497 T x86_64_start_kernel -ffffffff81efeaa2 T start_kernel -``` - -We can see `0xffffffff81efe497` here. I'm not sure that you have so big RAM. But anyway `start_kernel` and `x86_64_start_kernel` will be executed. The address space in `x86_64` is `2^64` size, but it's too large, that's why used smaller address space, only 48-bits wide. So we have situation when physical address limited with 48 bits, but addressing still performed with 64 bit pointers. How to solve this problem? Ok, look on the diagram: - -``` -0xffffffffffffffff +-----------+ - | | - | | Kernelspace - | | - 0xffff800000000000 +-----------+ - | | - | | - | hole | - | | - | | -0x00007fffffffffff +-----------+ - | | - | | Userspace - | | -0x0000000000000000  +-----------+ -``` - -This solution is `sign extension`. Here we can see that low 48 bits of a virtual address can be used for addressing. Bits `63:48` can be or 0 or 1. Note that all virtual address space is spliten on 2 parts: - -* Kernel space -* Userspace - -Userspace occupies the lower part of the virtual address space, from `0x000000000000000` to `0x00007fffffffffff` and kernel space occupies the highest part from the `0xffff8000000000` to `0xffffffffffffffff`. Note that bits `63:48` is 0 for userspace and 1 for kernel space. All addresses which are in kernel space and in userspace or in another words which higher `63:48` bits zero or one calls `canonical` addresses. There is `non-canonical` area between these memory regions. Together this two memory regions (kernel space and user space) are exactly `2^48` bits. We can find virtual memory map with 4 level page tables in the [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt): - -``` -0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm -hole caused by [48:63] sign extension -ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor -ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory -ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole -ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space -ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole -ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB) -... unused hole ... -ffffec0000000000 - fffffc0000000000 (=44 bits) kasan shadow memory (16TB) -... unused hole ... -ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks -... unused hole ... -ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0 -ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space -ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls -ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole -``` - -We can see here memory map for user space, kernel space and non-canonical area between. User space memory map is simple. Let's take a closer look on the kernel space. We can see that it starts from the guard hole which reserved for hypervisor. We can find definition of this guard hole in the [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h): - -```C -#define __PAGE_OFFSET _AC(0xffff880000000000, UL) -``` - -Previously this guard hole and `__PAGE_OFFSET` was from `0xffff800000000000` to `0xffff80ffffffffff` for preventing of access to non-canonical area, but later was added 3 bits for hypervisor. - -Next is the lowest usable address in kernel space - `ffff880000000000`. This virtual memory region is for direct mapping of the all physical memory. After the memory space which mapped all physical address - guard hole, it needs to be between direct mapping of the all physical memory and vmalloc area. After the virtual memory map for the first terabyte and unused hole after it, we can see `kasan` shadow memory. It was added by the [commit](https://github.com/torvalds/linux/commit/ef7f0d6a6ca8c9e4b27d78895af86c2fbfaeedb2) and provides kernel address sanitizer. After next unused hole we can se `esp` fixup stacks (we will talk about it in the other parts) and the start of the kernel text mapping from the physical address - `0`. We can find definition of this address in the same file as the `__PAGE_OFFSET`: - -```C -#define __START_KERNEL_map _AC(0xffffffff80000000, UL) -``` - -Usually kernel's `.text` start here with the `CONFIG_PHYSICAL_START` offset. We saw it in the post about [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md): - -``` -readelf -s vmlinux | grep ffffffff81000000 - 1: ffffffff81000000 0 SECTION LOCAL DEFAULT 1 - 65099: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 _text - 90766: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 startup_64 -``` - -Here i checked `vmlinux` with the `CONFIG_PHYSICAL_START` is `0x1000000`. So we have the start point of the kernel `.text` - `0xffffffff80000000` and offset - `0x1000000`, the resulted virtual address will be `0xffffffff80000000 + 1000000 = 0xffffffff81000000`. - -After the kernel `.text` region, we can see virtual memory region for kernel modules, `vsyscalls` and 2 megabytes unused hole. - -We know how looks kernel's virtual memory map and now we can see how a virtual address translates into physical. Let's take for example following address: - -``` -0xffffffff81000000 -``` - -In binary it will be: - -``` -1111111111111111 111111111 111111110 000001000 000000000 000000000000 - 63:48 47:39 38:30 29:21 20:12 11:0 -``` - -The given virtual address split on some parts as i wrote above: - -* `63:48` - bits not used; -* `47:39` - bits of the given linear address stores an index into the paging structure level-4; -* `38:30` - bits stores index into the paging structure level-3; -* `29:21` - bits stores an index into the paging structure level-2; -* `20:12` - bits stores an index into the paging structure level-1; -* `11:0` - bits provide the byte offset into the physical page. - -That is all. Now you know a little about `paging` theory and we can go ahead in the kernel source code and see first initialization steps. - -Conclusion --------------------------------------------------------------------------------- - -It's the end of this short part about paging theory. Of course this post doesn't cover all details about paging, but soon we will see it on practice how linux kernel builds paging structures and work with it. - -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - - -Links --------------------------------------------------------------------------------- - -* [Paging on Wikipedia](http://en.wikipedia.org/wiki/Paging) -* [Intel 64 and IA-32 architectures software developer's manual volume 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) -* [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) -* [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md) -* [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt) -* [Last part - Kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) diff --git a/Theory/README.md b/Theory/README.md deleted file mode 100644 index 7374a88..0000000 --- a/Theory/README.md +++ /dev/null @@ -1,6 +0,0 @@ -# Theory - -This chapter describes various theoretical concepts and concepts which are not directly related to practice but useful to know. - -* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) -* [Elf64 format](http://0xax.gitbooks.io/linux-insides/content/Theory/ELF.html) diff --git a/contributors.md b/contributors.md deleted file mode 100644 index 066d753..0000000 --- a/contributors.md +++ /dev/null @@ -1,66 +0,0 @@ -Thank you to all contributors: ------------------------------- - -* [Akash Shende](https://github.com/akash0x53) -* [Jakub Kramarz](https://github.com/jkramarz) -* [ckrooss](https://github.com/ckrooss) -* [ecksun](https://github.com/ecksun) -* [Maciek Makowski](https://github.com/mmakowski) -* [Thomas Marcelis](https://github.com/ThomasMarcelis) -* [Chris Costes](https://github.com/ccostes) -* [nathansoz](https://github.com/nathansoz) -* [RubanDeventhiran](https://github.com/RubanDeventhiran) -* [fuzhli](https://github.com/fuzhli) -* [andars](https://github.com/andars) -* [Alexandru Pana](https://github.com/alexpana) -* [Bogdan Rădulescu](https://github.com/bogdanr) -* [zil](https://github.com/zil) -* [codelitt](https://github.com/codelitt) -* [gulyasm](https://github.com/gulyasm) -* [alx741](https://github.com/alx741) -* [Haddayn](https://github.com/Haddayn) -* [Daniel Campoverde Carrión](https://github.com/alx741) -* [Guillaume Gomez](https://github.com/GuillaumeGomez) -* [Leandro Moreira](https://github.com/leandromoreira) -* [Jonatan PĆ„lsson](https://github.com/jonte) -* [George Horrell](https://github.com/georgehorrell) -* [Ciro Santilli](https://github.com/cirosantilli) -* [Kevin Soules](https://github.com/eax64) -* [Fabio Pozzi](https://github.com/fabiopozzi) -* [Kevin Swinton](https://github.com/kevinjswinton) -* [Leandro Moreira](https://github.com/leandromoreira) -* [LYF610400210](https://github.com/LYF610400210) -* [Cam Cope](https://github.com/ccope) -* [Miquel SabatĆ© SolĆ ](https://github.com/mssola) -* [Michael Aquilina](https://github.com/MichaelAquilina) -* [Gabriel Sullice](https://github.com/gabesullice) -* [Michael Drüing](https://github.com/darkstar) -* [Alexander Polakov](https://github.com/polachok) -* [Anton Davydov](https://github.com/davydovanton) -* [Arpan Kapoor](https://github.com/arpankapoor) -* [Brandon Fosdick](https://github.com/bfoz) -* [Ashleigh Newman-Jones](https://github.com/anewmanjones) -* [Terrell Russell](https://github.com/trel) -* [Mario](https://github.com/bedna-KU) -* [Ewoud Kohl van Wijngaarden](https://github.com/ekohl) -* [Jochen Maes](https://github.com/sejo) -* [Brother-Lal](https://github.com/Brother-Lal) -* [Brian McKenna](https://github.com/puffnfresh) -* [Josh Triplett](https://github.com/joshtriplett) -* [James Flowers](https://github.com/comjf) -* [Alexander Harding](https://github.com/aeharding) -* [Dzmitry Plashchynski](https://github.com/plashchynski) -* [Simarpreet Singh](https://github.com/simar7) -* [umatomba](https://github.com/umatomba) -* [Vaibhav Tulsyan](https://github.com/xennygrimmato) -* [Brandon Wamboldt](https://github.com/brandonwamboldt) -* [Maxime Leboeuf](https://github.com/leboeuf) -* [Maximilien Richer](https://github.com/halfa) -* [marmeladema](https://github.com/marmeladema) -* [Anisse Astier](https://github.com/anisse) -* [TheCodeArtist](https://github.com/TheCodeArtist) -* [Ehsun N](https://github.com/imehsunn) -* [Adam Shannon](https://github.com/adamdecaf) -* [Donny Nadolny](https://github.com/dnadolny) -* [Ehsun N](https://github.com/imehsunn) -* [Waqar Ahmed](https://github.com/Waqar144) diff --git a/interrupts/README.md b/interrupts/README.md deleted file mode 100644 index 8382468..0000000 --- a/interrupts/README.md +++ /dev/null @@ -1,12 +0,0 @@ -# Interrupts and Interrupt Handling - -You will find a couple of posts which describes an interrupts and an exceptions handling in the linux kernel. - -* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes an interrupts handling theory. -* [Start to dive into interrupts in the Linux kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - this part starts to describe interrupts and exceptions handling related stuff from the early stage. -* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - third part describes early interrupt handlers. -* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - fourth part describes first non-early interrupt handlers. -* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - descripbes implementation of some exception handlers as double fault, divide by zero and etc. -* [Handling Non-Maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and the rest of interrupts handlers from the architecture-specific part. -* [Dive into external hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - this part describes early initialization of code which is related to handling of external hardware interrupts. -* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - this part describes non-early initialization of code which is related to handling of external hardware interrupts. diff --git a/interrupts/interrupts-1.md b/interrupts/interrupts-1.md deleted file mode 100644 index 38d527e..0000000 --- a/interrupts/interrupts-1.md +++ /dev/null @@ -1,520 +0,0 @@ -Interrupts and Interrupt Handling. Part 1. -================================================================================ - -Introduction --------------------------------------------------------------------------------- - -This is the first part of the new chapter of the [linux insides](http://0xax.gitbooks.io/linux-insides/content/) book. We have come a long way in the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of this book. We started from the earliest [steps](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of kernel initialization and finished with the [launch](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-10.html) of the first `init` process. Yes, we saw several initialization steps which are related to the various kernel subsystems. But we did not dig deep into the details of these subsystems. With this chapter, we will try to understand how the various kernel subsystems work and how they are implemented. As you can already understand from the chapter's title, the first subsystem will be [interrupts](http://en.wikipedia.org/wiki/Interrupt). - -What is an Interrupt? --------------------------------------------------------------------------------- - -We have already heard of the word `interrupt` in several parts of this book. We even saw a couple of examples of interrupt handlers. In the current chapter we will start from the theory i.e. - -* What are `interrupts` ? -* What are `interrupt handlers`? - -We will then continue to dig deeper into the details of `interrupts` and how the Linux kernel handles them. - -So..., First of all what is an interrupt? An interrupt is an `event` which is emitted by software or hardware when its needs the CPU's attention. For example, we press a button on the keyboard and what do we expect next? What should the operating system and computer do after this? To simplify matters assume that each peripheral device has an interrupt line to the CPU. A device can use it to signal an interrupt to the CPU. However interrupts are not signaled directly to the CPU. In the old machines there was a [PIC](http://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) which is a chip responsible for sequentially processing multiple interrupt requests from multiple devices. In the new machines there is an [Advanced Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) commonly known as - `APIC`. An `APIC` consists of two separate devices: - -* `Local APIC` -* `I/O APIC` - -The first - `Local APIC` is located on each CPU core. The local APIC is responsible for handling the CPU-specific interrupt configuration. The local APIC is usually used to manage interrupts from the APIC-timer, thermal sensor and any other such locally connected I/O devices. - -The second - `I/O APIC` provides multi-processor interrupt management. It is used to distribute external interrupts among the CPU cores. More about the local and I/O APICs will be covered later in this chapter. As you can understand, interrupts can occur at any time. When an interrupt occurs, the operating system must handle it immediately. But what does it mean `to handle an interrupt`? When an interrupt occurs, the operating system must ensure the following steps: - -* The kernel must pause execution of the current process; (preempt current task) -* The kernel must search for the handler of the interrupt and transfer control (execute interrupt handler); -* After the interrupt handler completes execution, the interrupted process can resume execution; - -Of course there are numerous intricacies involved in this procedure of handling interrupts. But the above 3 steps form the basic skeleton of the procedure. - -Addresses of each of the interrupt handlers are maintained in a special location referred to as the - `Interrupt Descriptor Table` or `IDT`. The processor uses an unique number for recognizing the type of interruption or exception. This number is called - `vector number`. A vector number is an index in the `IDT`. There is limited amount of the vector numbers and it can be from `0` to `255`. You can note the following range-check upon the vector number within the Linux kernel source-code: - -```C -BUG_ON((unsigned)n > 0xFF); -``` - -You can find this check within the Linux kernel source code related to interrupt setup (eg. The `set_intr_gate`, `void set_system_intr_gate` in [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h)). First 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. You can find the table with the description of these vector numbers in the second part of the Linux kernel initialization process - [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). Vector numbers from `32` to `255` are designated as user-defined interrupts and are not reserved by the processor. These interrupts are generally assigned to external I/O devices to enable those devices to send interrupts to the processor. - -Now let's talk about the types of interrupts. Broadly speaking, we can split interrupts into 2 major classes: - -* External or hardware generated interrupts; -* Software-generated interrupts. - -The first - external interrupts are received through the `Local APIC` or pins on the processor which are connected to the `Local APIC`. The second - software-generated interrupts are caused by an exceptional condition in the processor itself (sometimes using special architecture-specific instructions). A common example for an exceptional condition is `division by zero`. Another example is exiting a program with the `syscall` instruction. - -As mentioned earlier, an interrupt can occur at any time for a reason which the code and CPU have no control over. On the other hand, exceptions are `synchronous` with program execution and can be classified into 3 categories: - -* `Faults` -* `Traps` -* `Aborts` - -A `fault` is an exception reported before the execution of a "faulty" instruction (which can then be corrected). If corrected, it allows the interrupted program to be resume. - -Next a `trap` is an exception which is reported immediately following the execution of the `trap` instruction. Traps also allow the interrupted program to be continued just as a `fault` does. - -Finally an `abort` is an exception that does not always report the exact instruction which caused the exception and does not allow the interrupted program to be resumed. - -Also we already know from the previous [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) that interrupts can be classified as `maskable` and `non-maskable`. Maskable interrupts are interrupts which can be blocked with the two following instructions for `x86_64` - `sti` and `cli`. We can find them in the Linux kernel source code: - -```C -static inline void native_irq_disable(void) -{ - asm volatile("cli": : :"memory"); -} -``` - -and - -```C -static inline void native_irq_enable(void) -{ - asm volatile("sti": : :"memory"); -} -``` - -These two instructions modify the `IF` flag bit within the interrupt register. The `sti` instruction sets the `IF` flag and the `cli` instruction clears this flag. Non-maskable interrupts are always reported. Usually any failure in the hardware is mapped to such non-maskable interrupts. - -If multiple exceptions or interrupts occur at the same time, the processor handles them in order of their predefined priorities. We can determine the priorities from the highest to the lowest in the following table: - -``` -+----------------------------------------------------------------+ -| | | -| Priority | Description | -| | | -+--------------+-------------------------------------------------+ -| | Hardware Reset and Machine Checks | -| 1 | - RESET | -| | - Machine Check | -+--------------+-------------------------------------------------+ -| | Trap on Task Switch | -| 2 | - T flag in TSS is set | -| | | -+--------------+-------------------------------------------------+ -| | External Hardware Interventions | -| | - FLUSH | -| 3 | - STOPCLK | -| | - SMI | -| | - INIT | -+--------------+-------------------------------------------------+ -| | Traps on the Previous Instruction | -| 4 | - Breakpoints | -| | - Debug Trap Exceptions | -+--------------+-------------------------------------------------+ -| 5 | Nonmaskable Interrupts | -+--------------+-------------------------------------------------+ -| 6 | Maskable Hardware Interrupts | -+--------------+-------------------------------------------------+ -| 7 | Code Breakpoint Fault | -+--------------+-------------------------------------------------+ -| 8 | Faults from Fetching Next Instruction | -| | Code-Segment Limit Violation | -| | Code Page Fault | -+--------------+-------------------------------------------------+ -| | Faults from Decoding the Next Instruction | -| | Instruction length > 15 bytes | -| 9 | Invalid Opcode | -| | Coprocessor Not Available | -| | | -+--------------+-------------------------------------------------+ -| 10 | Faults on Executing an Instruction | -| | Overflow | -| | Bound error | -| | Invalid TSS | -| | Segment Not Present | -| | Stack fault | -| | General Protection | -| | Data Page Fault | -| | Alignment Check | -| | x87 FPU Floating-point exception | -| | SIMD floating-point exception | -| | Virtualization exception | -+--------------+-------------------------------------------------+ -``` - -Now that we know a little about the various types of interrupts and exceptions, it is time to move on to a more practical part. We start with the description of the `Interrupt Descriptor Table`. As mentioned earlier, the `IDT` stores entry points of the interrupts and exceptions handlers. The `IDT` is similar in structure to the `Global Descriptor Table` which we saw in the second part of the [Kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). But of course it has some differences. Instead of `descriptors`, the `IDT` entries are called `gates`. It can contain one of the following gates: - -* Interrupt gates -* Task gates -* Trap gates. - -in the `x86` architecture. Only [long mode](http://en.wikipedia.org/wiki/Long_mode) interrupt gates and trap gates can be referenced in the `x86_64`. Like the `Global Descriptor Table`, the `Interrupt Descriptor table` is an array of 8-byte gates on `x86` and an array of 16-byte gates on `x86_64`. We can remember from the second part of the [Kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), that `Global Descriptor Table` must contain `NULL` descriptor as its first element. Unlike the `Global Descriptor Table`, the `Interrupt Descriptor Table` may contain a gate; it is not mandatory. For example, you may remember that we have loaded the Interrupt Descriptor table with the `NULL` gates only in the earlier [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) while transitioning into [protected mode](http://en.wikipedia.org/wiki/Protected_mode): - -```C -/* - * Set up the IDT - */ -static void setup_idt(void) -{ - static const struct gdt_ptr null_idt = {0, 0}; - asm volatile("lidtl %0" : : "m" (null_idt)); -} -``` - -from the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c). The `Interrupt Descriptor table` can be located anywhere in the linear address space and the base address of it must be aligned on an 8-byte boundary on `x86` or 16-byte boundary on `x86_64`. Base address of the `IDT` is stored in the special register - `IDTR`. There are two instructions on `x86`-compatible processors to modify the `IDTR` register: - -* `LIDT` -* `SIDT` - -The first instruction `LIDT` is used to load the base-address of the `IDT` i.e. the specified operand into the `IDTR`. The second instruction `SIDT` is used to read and store the contents of the `IDTR` into the specified operand. The `IDTR` register is 48-bits on the `x86` and contains following information: - -``` -+-----------------------------------+----------------------+ -| | | -| Base address of the IDT | Limit of the IDT | -| | | -+-----------------------------------+----------------------+ -47 16 15 0 -``` - -Looking at the implementation of `setup_idt`, we have prepared a `null_idt` and loaded it to the `IDTR` register with the `lidt` instruction. Note that `null_idt` has `gdt_ptr` type which is defined as: - -```C -struct gdt_ptr { - u16 len; - u32 ptr; -} __attribute__((packed)); -``` - -Here we can see the definition of the structure with the two fields of 2-bytes and 4-bytes each (a total of 48-bits) as we can see in the diagram. Now let's look at the `IDT` entries structure. The `IDT` entries structure is an array of the 16-byte entries which are called gates in the `x86_64`. They have the following structure: - -``` -127 96 -+-------------------------------------------------------------------------------+ -| | -| Reserved | -| | -+-------------------------------------------------------------------------------- -95 64 -+-------------------------------------------------------------------------------+ -| | -| Offset 63..32 | -| | -+-------------------------------------------------------------------------------+ -63 48 47 46 44 42 39 34 32 -+-------------------------------------------------------------------------------+ -| | | D | | | | | | | -| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST | -| | | L | | | | | | | - -------------------------------------------------------------------------------+ -31 16 15 0 -+-------------------------------------------------------------------------------+ -| | | -| Segment Selector | Offset 15..0 | -| | | -+-------------------------------------------------------------------------------+ -``` - -To form an index into the IDT, the processor scales the exception or interrupt vector by sixteen. The processor handles the occurrence of exceptions and interrupts just like it handles calls of a procedure when it sees the `call` instruction. A processor uses an unique number or `vector number` of the interrupt or the exception as the index to find the necessary `Interrupt Descriptor Table` entry. Now let's take a closer look at an `IDT` entry. - -As we can see, `IDT` entry on the diagram consists of the following fields: - -* `0-15` bits - offset from the segment selector which is used by the processor as the base address of the entry point of the interrupt handler; -* `16-31` bits - base address of the segment select which contains the entry point of the interrupt handler; -* `IST` - a new special mechanism in the `x86_64`, will see it later; -* `DPL` - Descriptor Privilege Level; -* `P` - Segment Present flag; -* `48-63` bits - second part of the handler base address; -* `64-95` bits - third part of the base address of the handler; -* `96-127` bits - and the last bits are reserved by the CPU. - -And the last `Type` field describes the type of the `IDT` entry. There are three different kinds of handlers for interrupts: - -* Interrupt gate -* Trap gate -* Task gate - -The `IST` or `Interrupt Stack Table` is a new mechanism in the `x86_64`. It is used as an alternative to the the legacy stack-switch mechanism. Previously The `x86` architecture provided a mechanism to automatically switch stack frames in response to an interrupt. The `IST` is a modified version of the `x86` Stack switching mode. This mechanism unconditionally switches stacks when it is enabled and can be enabled for any interrupt in the `IDT` entry related with the certain interrupt (we will soon see it). From this we can understand that `IST` is not necessary for all interrupts. Some interrupts can continue to use the legacy stack switching mode. The `IST` mechanism provides up to seven `IST` pointers in the [Task State Segment](http://en.wikipedia.org/wiki/Task_state_segment) or `TSS` which is the special structure which contains information about a process. The `TSS` is used for stack switching during the execution of an interrupt or exception handler in the Linux kernel. Each pointer is referenced by an interrupt gate from the `IDT`. - -The `Interrupt Descriptor Table` represented by the array of the `gate_desc` structures: - -```C -extern gate_desc idt_table[]; -``` - -where `gate_desc` is: - -```C -#ifdef CONFIG_X86_64 -... -... -... -typedef struct gate_struct64 gate_desc; -... -... -... -#endif -``` - -and `gate_struct64` defined as: - -```C -struct gate_struct64 { - u16 offset_low; - u16 segment; - unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1; - u16 offset_middle; - u32 offset_high; - u32 zero1; -} __attribute__((packed)); -``` - -Each active thread has a large stack in the Linux kernel for the `x86_64` architecture. The stack size is defined as `THREAD_SIZE` and is equal to: - -```C -#define PAGE_SHIFT 12 -#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT) -... -... -... -#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER) -#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER) -``` - -The `PAGE_SIZE` is `4096`-bytes and the `THREAD_SIZE_ORDER` depends on the `KASAN_STACK_ORDER`. As we can see, the `KASAN_STACK` depends on the `CONFIG_KASAN` kernel configuration parameter and equals to the: - -```C -#ifdef CONFIG_KASAN - #define KASAN_STACK_ORDER 1 -#else - #define KASAN_STACK_ORDER 0 -#endif -``` - -`KASan` is a runtime memory [debugger](http://lwn.net/Articles/618180/). So... the `THREAD_SIZE` will be `16384` bytes if `CONFIG_KASAN` is disabled or `32768` if this kernel configuration option is enabled. These stacks contain useful data as long as a thread is alive or in a zombie state. While the thread is in user-space, the kernel stack is empty except for the `thread_info` structure (details about this structure are available in the fourth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process) at the bottom of the stack. The active or zombie threads aren't the only threads with their own stack. There also exist specialized stacks that are associated with each available CPU. These stacks are active when the kernel is executing on that CPU. When the user-space is executing on the CPU, these stacks do not contain any useful information. Each CPU has a few special per-cpu stacks as well. The first is the `interrupt stack` used for the external hardware interrupts. Its size is determined as follows: - -```C -#define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER) -#define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER) -``` - -or `16384` bytes. The per-cpu interrupt stack represented by the `irq_stack_union` union in the Linux kernel for `x86_64`: - -```C -union irq_stack_union { - char irq_stack[IRQ_STACK_SIZE]; - - struct { - char gs_base[40]; - unsigned long stack_canary; - }; -}; -``` - -The first `irq_stack` field is a 16 kilobytes array. Also you can see that `irq_stack_union` contains structure with the two fields: - -* `gs_base` - The `gs` register always points to the bottom of the `irqstack` union. On the `x86_64`, the `gs` register is shared by per-cpu area and stack canary (more about `per-cpu` variables you can read in the special [part](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)). All per-cpu symbols are zero based and the `gs` points to the base of per-cpu area. You already know that [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation) is abolished in the long mode, but we can set base address for the two segment registers - `fs` and `gs` with the [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register) and these registers can be still be used as address registers. If you remember the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of the Linux kernel initialization process, you can remember that we have set the `gs` register: - -```assembly - movl $MSR_GS_BASE,%ecx - movl initial_gs(%rip),%eax - movl initial_gs+4(%rip),%edx - wrmsr -``` - -where `initial_gs` points to the `irq_stack_union`: - -```assembly -GLOBAL(initial_gs) -.quad INIT_PER_CPU_VAR(irq_stack_union) -``` - -* `stack_canary` - [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) for the interrupt stack is a `stack protector` -to verify that the stack hasn't been overwritten. Note that `gs_base` is an 40 bytes array. `GCC` requires that stack canary will be on the fixed offset from the base of the `gs` and its value must be `40` for the `x86_64` and `20` for the `x86`. - -The `irq_stack_union` is the first datum in the `percpu` area, we can see it in the `System.map`: - -``` -0000000000000000 D __per_cpu_start -0000000000000000 D irq_stack_union -0000000000004000 d exception_stacks -0000000000009000 D gdt_page -... -... -... -``` - -We can see its definition in the code: - -```C -DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible; -``` - -Now, its time to look at the initialization of the `irq_stack_union`. Besides the `irq_stack_union` definition, we can see the definition of the following per-cpu variables in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h): - -```C -DECLARE_PER_CPU(char *, irq_stack_ptr); -DECLARE_PER_CPU(unsigned int, irq_count); -``` - -The first is the `irq_stack_ptr`. From the variable's name, it is obvious that this is a pointer to the top of the stack. The second - `irq_count` is used to check if a CPU is already on an interrupt stack or not. Initialization of the `irq_stack_ptr` is located in the `setup_per_cpu_areas` function in [arch/x86/kernel/setup_percpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup_percpu.c): - -```C -void __init setup_per_cpu_areas(void) -{ -... -... -#ifdef CONFIG_X86_64 -for_each_possible_cpu(cpu) { - ... - ... - ... - per_cpu(irq_stack_ptr, cpu) = - per_cpu(irq_stack_union.irq_stack, cpu) + - IRQ_STACK_SIZE - 64; - ... - ... - ... -#endif -... -... -} -``` - -Here we go over all the CPUs on-by-one and setup `irq_stack_ptr`. This turns out to be equal to the top of the interrupt stack minus `64`. Why `64`? If you remember, we set the stack canary in the beginning of the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) with the call of the `boot_init_stack_canary` function: - -```C -static __always_inline void boot_init_stack_canary(void) -{ - u64 canary; - ... - ... - ... - -#ifdef CONFIG_X86_64 - BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40); -#endif - // - // getting canary value here - // - - this_cpu_write(irq_stack_union.stack_canary, canary); - ... - ... - ... -} -``` - -Note that `canary` is `64` bits value. That's why we need to subtract `64` from the size of the interrupt stack to avoid overlapping with the stack canary value. Initialization of the `irq_stack_union.gs_base` is in the `load_percpu_segment` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c): - -TODO maybe more about the wrmsl - -```C -void load_percpu_segment(int cpu) -{ - ... - ... - ... - loadsegment(gs, 0); - wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu)); -} -``` - -and as we already know `gs` register points to the bottom of the interrupt stack: - -```assembly - movl $MSR_GS_BASE,%ecx - movl initial_gs(%rip),%eax - movl initial_gs+4(%rip),%edx - wrmsr - - GLOBAL(initial_gs) - .quad INIT_PER_CPU_VAR(irq_stack_union) -``` - -Here we can see the `wrmsr` instruction which loads the data from `edx:eax` into the [Model specific register](http://en.wikipedia.org/wiki/Model-specific_register) pointed by the `ecx` register. In our case model specific register is `MSR_GS_BASE` which contains the base address of the memory segment pointed by the `gs` register. `edx:eax` points to the address of the `initial_gs` which is the base address of our `irq_stack_union`. - -We already know that `x86_64` has a feature called `Interrupt Stack Table` or `IST` and this feature provides the ability to switch to a new stack for events non-maskable interrupt, double fault and etc... There can be up to seven `IST` entries per-cpu. Some of them are: - -* `DOUBLEFAULT_STACK` -* `NMI_STACK` -* `DEBUG_STACK` -* `MCE_STACK` - -or - -```C -#define DOUBLEFAULT_STACK 1 -#define NMI_STACK 2 -#define DEBUG_STACK 3 -#define MCE_STACK 4 -``` - -All interrupt-gate descriptors which switch to a new stack with the `IST` are initialized with the `set_intr_gate_ist` function. For example: - -```C -set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK); -... -... -... -set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK); -``` - -where `&nmi` and `&double_fault` are addresses of the entries to the given interrupt handlers: - -```C -asmlinkage void nmi(void); -asmlinkage void double_fault(void); -``` - -defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) - -```assembly -idtentry double_fault do_double_fault has_error_code=1 paranoid=2 -... -... -... -ENTRY(nmi) -... -... -... -END(nmi) -``` - -When an interrupt or an exception occurs, the new `ss` selector is forced to `NULL` and the `ss` selector’s `rpl` field is set to the new `cpl`. The old `ss`, `rsp`, register flags, `cs`, `rip` are pushed onto the new stack. In 64-bit mode, the size of interrupt stack-frame pushes is fixed at 8-bytes, so we will get the following stack: - -``` -+---------------+ -| | -| SS | 40 -| RSP | 32 -| RFLAGS | 24 -| CS | 16 -| RIP | 8 -| Error code | 0 -| | -+---------------+ -``` - -If the `IST` field in the interrupt gate is not `0`, we read the `IST` pointer into `rsp`. If the interrupt vector number has an error code associated with it, we then push the error code onto the stack. If the interrupt vector number has no error code, we go ahead and push the dummy error code on to the stack. We need to do this to ensure stack consistency. Next we load the segment-selector field from the gate descriptor into the CS register and must verify that the target code-segment is a 64-bit mode code segment by the checking bit `21` i.e. the `L` bit in the `Global Descriptor Table`. Finally we load the offset field from the gate descriptor into `rip` which will be the entry-point of the interrupt handler. After this the interrupt handler begins to execute. After an interrupt handler finishes its execution, it must return control to the interrupted process with the `iret` instruction. The `iret` instruction unconditionally pops the stack pointer (`ss:rsp`) to restore the stack of the interrupted process and does not depend on the `cpl` change. - -That's all. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the first part about interrupts and interrupt handling in the Linux kernel. We saw some theory and the first steps of the initialization of stuff related to interrupts and exceptions. In the next part we will continue to dive into interrupts and interrupts handling - into the more practical aspects of it. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [Advanced Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) -* [protected mode](http://en.wikipedia.org/wiki/Protected_mode) -* [long mode](http://en.wikipedia.org/wiki/Long_mode) -* [kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks) -* [PIC](http://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) -* [Advanced Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) -* [long mode](http://en.wikipedia.org/wiki/Long_mode) -* [protected mode](http://en.wikipedia.org/wiki/Protected_mode) -* [Task State Segement](http://en.wikipedia.org/wiki/Task_state_segment) -* [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation) -* [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register) -* [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) -* [Previous chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) diff --git a/interrupts/interrupts-2.md b/interrupts/interrupts-2.md deleted file mode 100644 index 92055eb..0000000 --- a/interrupts/interrupts-2.md +++ /dev/null @@ -1,547 +0,0 @@ -Interrupts and Interrupt Handling. Part 2. -================================================================================ - -Start to dive into interrupt and exceptions handling in the Linux kernel --------------------------------------------------------------------------------- - -We saw some theory about an interrupts and an exceptions handling in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) and as I already wrote in that part, we will start to dive into interrupts and exceptions in the Linux kernel source code in this part. As you already can note, the previous part mostly described theoretical aspects and since this part we will start to dive directly into the Linux kernel source code. We will start to do it as we did it in other chapters, from the very early places. We will not see the Linux kernel source code from the earliest [code lines](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L292) as we saw it for example in the [Linux kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter, but we will start from the earliest code which is related to the interrupts and exceptions. Since this part we will try to go through the all interrupts and exceptions related stuff which we can find in the Linux kernel source code. - -If you've read the previous parts, you can remember that the earliest place in the Linux kernel `x86_64` architecture-specifix source code which is related to the interrupt is located in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c) source code file and represents the first setup of the [Interrupt Descriptor Table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table). It occurs right before the transition into the [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the `go_to_protected_mode` function by the call of the `setup_idt`: - -```C -void go_to_protected_mode(void) -{ - ... - setup_idt(); - ... -} -``` - -The `setup_idt` function defined in the same source code file as the `go_to_protected_mode` function and just loads address of the `NULL` interrupts descriptor table: - -```C -static void setup_idt(void) -{ - static const struct gdt_ptr null_idt = {0, 0}; - asm volatile("lidtl %0" : : "m" (null_idt)); -} -``` - -where `gdt_ptr` represents special 48-bit `GTDR` register which must contain base address of the `Global Descriptor Table`: - -```C -struct gdt_ptr { - u16 len; - u32 ptr; -} __attribute__((packed)); -``` - -Of course in our case the `gdt_ptr` does not represent `GDTR` register, but `IDTR` since we set `Interrupt Descriptor Table`. You will not find `idt_ptr` structure, because if it had been in the Linux kernel source code, it would have been the same as `gdt_ptr` but with different name. So, as you can understand there is no sense to have two similar structures which are differ only in a name. You can note here, that we do not fill the `Interrupt Descriptor Table` with entries, because it is too early to handle any interrupts or exceptions for this moment. That's why we just fill the `IDT` with the `NULL`. - -And after the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S). More about it you can read in the [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) which describes transition to the protected mode. - -We already know from the earliest parts that entry of the protected mode located in the `boot_params.hdr.code32_start` and you can see that we pass the entry of the protected mode and `boot_params` to the `protected_mode_jump` in the end of the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c): - -```C -protected_mode_jump(boot_params.hdr.code32_start, - (u32)&boot_params + (ds() << 4)); -``` - -The `protected_mode_jump` defined in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S) and gets these two parameters in the `ax` and `dx` registers using one of the [8086](http://en.wikipedia.org/wiki/Intel_8086) calling [convention](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions): - -```assembly -GLOBAL(protected_mode_jump) - ... - ... - ... - .byte 0x66, 0xea # ljmpl opcode -2: .long in_pm32 # offset - .word __BOOT_CS # segment -... -... -... -ENDPROC(protected_mode_jump) -``` - -where `in_pm32` contains jump to the 32-bit entrypoint: - -```assembly -GLOBAL(in_pm32) - ... - ... - jmpl *%eax // %eax contains address of the `startup_32` - ... - ... -ENDPROC(in_pm32) -``` - -As you can remember 32-bit entry point is in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly file, although it contains `_64` in the its name. We can see the two similar files in the `arch/x86/boot/compressed` directory: - -* `arch/x86/boot/compressed/head_32.S`. -* `arch/x86/boot/compressed/head_64.S`; - -But the 32-bit mode entry point the the second file in our case. The first file even not compiled for `x86_64`. Let's look on the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile): - -``` -vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \ -... -... -``` - -We can see here that `head_*` depends on the `$(BITS)` variable which depends on the architecture. You can find it in the [arch/x86/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/Makefile): - -``` -ifeq ($(CONFIG_X86_32),y) -... - BITS := 32 -else - BITS := 64 - ... -endif -``` - -Now as we jumped on the `startup_32` from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) we will not find anything related to the interrupt handling here. The `startup_32` contains code that makes preparations before transition into the [long mode](http://en.wikipedia.org/wiki/Long_mode) and directly jumps in it. The `long mode` entry located `startup_64` and it makes preparation before the [kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) that occurs in the `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c). After kernel decompressed, we jump on the `startup_64` from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). In the `startup_64` we start to build identity-mapped pages. After we have built identity-mapped pages, checked [NX](http://en.wikipedia.org/wiki/NX_bit) bit, made setup of the `Extended Feature Enable Register` (see in links), updated early `Global Descriptor Table` wit the `lgdt` instruction, we need to setup `gs` register with the following code: - -```assembly -movl $MSR_GS_BASE,%ecx -movl initial_gs(%rip),%eax -movl initial_gs+4(%rip),%edx -wrmsr -``` - -We already saw this code in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) and not time to know better what is going on here. First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/msr-index.h) and looks: - -```C -#define MSR_GS_BASE 0xc0000101 -``` - -From this we can understand that `MSR_GS_BASE` defines number of the `model specific register`. Since registers `cs`, `ds`, `es`, and `ss` are not used in the 64-bit mode, their fields are ignored. But we can access memory over `fs` and `gs` registers. The model specific register provides `back door` to the hidden parts of these segment registers and allows to use 64-bit base address for segment register addressed by the `fs` and `gs`. So the `MSR_GS_BASE` is the hidden part and this part is mapped on the `GS.base` field. Let's look on the `initial_gs`: - -```assembly -GLOBAL(initial_gs) - .quad INIT_PER_CPU_VAR(irq_stack_union) -``` - -We pass `irq_stack_union` symbol to the `INIT_PER_CPU_VAR` macro which just concatenates `init_per_cpu__` prefix with the given symbol. In our case we will get `init_per_cpu__irq_stack_union` symbol. Let's look on the [linker](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) script. There we can see following definition: - -``` -#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load -INIT_PER_CPU(irq_stack_union); -``` - -It tells us that address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where are `init_per_cpu__irq_stack_union` and `__per_cpu_load` and what they mean. The first `irq_stack_union` defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call of the `init_per_cpu_var` macro: - -```C -DECLARE_INIT_PER_CPU(irq_stack_union); - -#define DECLARE_INIT_PER_CPU(var) \ - extern typeof(per_cpu_var(var)) init_per_cpu_var(var) - -#define init_per_cpu_var(var) init_per_cpu__##var -``` - -If we will expand all macro we will get the same `init_per_cpu__irq_stack_union` as we got after expanding of the `INIT_PER_CPU` macro, but you can note that it is already not just symbol, but variable. Let's look on the `typeof(percpu_var(var))` expression. Our `var` is `irq_stack_union` and `per_cpu_var` macro defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h): - -```C -#define PER_CPU_VAR(var) %__percpu_seg:var -``` - -where: - -```C -#ifdef CONFIG_X86_64 - #define __percpu_seg gs -endif -``` - -So, we accessing `gs:irq_stack_union` and geting its type which is `irq_union`. Ok, we defined the first variable and know its address, now let's look on the second `__per_cpu_load` symbol. There are a couple of percpu variables which are located after this symbol. The `__per_cpu_load` defined in the [include/asm-generic/sections.h](https://github.com/torvalds/linux/blob/master/include/asm-generic-sections.h): - -```C -extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[]; -``` - -and presented base address of the `per-cpu` variables from the data area. So, we know address of the `irq_stack_union`, `__per_cpu_load` and we know that `init_per_cpu__irq_stack_union` must be placed right after `__per_cpu_load`. And we can see it in the [System.map](http://en.wikipedia.org/wiki/System.map): - -``` -... -... -... -ffffffff819ed000 D __init_begin -ffffffff819ed000 D __per_cpu_load -ffffffff819ed000 A init_per_cpu__irq_stack_union -... -... -... -``` - -Now we know about `initia_gs`, so let's book to the our code: - -```assembly -movl $MSR_GS_BASE,%ecx -movl initial_gs(%rip),%eax -movl initial_gs+4(%rip),%edx -wrmsr -``` - -Here we specified model specific register with `MSR_GS_BASE`, put 64-bit address of the `initial_gs` to the `edx:eax` pair and execute `wrmsr` instruction for filling the `gs` register with base address of the `init_per_cpu__irq_stack_union` which will be bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and on of these preparations is filling of the early `Interrupt Descriptor Table` with the interrupts handlres entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) and can remember following code: - -```C -for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) - set_intr_gate(i, early_idt_handlers[i]); - -load_idt((const struct desc_ptr *)&idt_descr); -``` - -but I wrote `Early interrupt and exception handling` part when Linux kernel version was - `3.18`. For this day actual version of the Linux kernel is `4.1.0-rc6+` and ` Andy Lutomirski` sent the [patch](https://lkml.org/lkml/2015/6/2/106) and soon it will be in the mainline kernel that changes behaviour for the `early_idt_handlers`. **NOTE** While I wrote this part the [patch](https://github.com/torvalds/linux/commit/425be5679fd292a3c36cb1fe423086708a99f11a) already turned in the Linux kernel source code. Let's look on it. Now the same part looks like: - -```C -for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) - set_intr_gate(i, early_idt_handler_array[i]); - -load_idt((const struct desc_ptr *)&idt_descr); -``` - -AS you can see it has only one difference in the name of the array of the interrupts handlers entry points. Now it is `early_idt_handler_arry`: - -```C -extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE]; -``` - -where `NUM_EXCEPTION_VECTORS` and `EARLY_IDT_HANDLER_SIZE` are defined as: - -```C -#define NUM_EXCEPTION_VECTORS 32 -#define EARLY_IDT_HANDLER_SIZE 9 -``` - -So, the `early_idt_handler_array` is an array of the interrupts handlers entry points and contains one entry point on every nine bytes. You can remember that previous `early_idt_handlers` was defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). The `early_idt_handler_array` is defined in the same source code file too: - -```assembly -ENTRY(early_idt_handler_array) -... -... -... -ENDPROC(early_idt_handler_common) -``` - -It fills `early_idt_handler_arry` with the `.rept NUM_EXCEPTION_VECTORS` and contains entry of the `early_make_pgtable` interrupt handler (more about its implementation you can read in the part about [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). For now we come to the end of the `x86_64` architecture-specific code and the next part is the generic kernel code. Of course you already can know that we will return to the architecture-specific code in the `setup_arch` function and other places, but this is the end of the `x86_64` early code. - -Setting stack canary for the interrupt stack -------------------------------------------------------------------------------- - -The next stop after the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) is the biggest `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). If you've read the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first `init` process with the [pid](https://en.wikipedia.org/wiki/Process_identifier) - `1`. The first thing that is related to the interrupts and exceptions handling is the call of the `boot_init_stack_canary` function. - -This function sets the [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) value to protect interrupt stack overflow. We already saw a little some details about implementation of the `boot_init_stack_canary` in the previous part and now let's take a closer look on it. You can find implementation of this function in the [arch/x86/include/asm/stackprotector.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/stackprotector.h) and its depends on the `CONFIG_CC_STACKPROTECTOR` kernel configuration option. If this option is not set this function will not do anything: - -```C -#ifdef CONFIG_CC_STACKPROTECTOR -... -... -... -#else -static inline void boot_init_stack_canary(void) -{ -} -#endif -``` - -If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_init_stack_canary` function starts from the check stat `irq_stack_union` that represents [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) interrupt stack has offset equal to forty bytes from the `stack_canary` value: - -```C -#ifdef CONFIG_X86_64 - BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40); -#endif -``` - -As we can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) the `irq_stack_union` represented by the following union: - -```C -union irq_stack_union { - char irq_stack[IRQ_STACK_SIZE]; - - struct { - char gs_base[40]; - unsigned long stack_canary; - }; -}; -``` - -which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h). We know that [unioun](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro). - -After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter): - -```C -get_random_bytes(&canary, sizeof(canary)); -tsc = __native_read_tsc(); -canary += tsc + (tsc << 32UL); -``` - -and write `canary` value to the `irq_stack_union` with the `this_cpu_write` macro: - -```C -this_cpu_write(irq_stack_union.stack_canary, canary); -``` - -more about `this_cpu_*` operation you can read in the [Linux kernel documentation](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt). - -Disabling/Enabling local interrupts --------------------------------------------------------------------------------- - -The next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) which is related to the interrupts and interrupts handling after we have set the `canary` value to the interrupt stack - is the call of the `local_irq_disable` macro. - -This macro defined in the [include/linux/irqflags.h](https://github.com/torvalds/linux/blob/master/include/linux/irqflags.h) header file and as you can understand, we can disable interrupts for the CPU with the call of this macro. Let's look on its implementation. First of all note that it depends on the `CONFIG_TRACE_IRQFLAGS_SUPPORT` kernel configuration option: - -```C -#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT -... -#define local_irq_disable() \ - do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0) -... -#else -... -#define local_irq_disable() do { raw_local_irq_disable(); } while (0) -... -#endif -``` - -They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `stoftirq` state. In ourcase `lockdep` subsytem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c): - -```C -void trace_hardirqs_off(void) -{ - trace_hardirqs_off_caller(CALLER_ADDR0); -} -EXPORT_SYMBOL(trace_hardirqs_off); -``` - -and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` filed of the current process increment the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_internals.h](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep_internals.h) and located in the `lockdep_stats` structure: - -```C -struct lockdep_stats { -... -... -... -int softirqs_off_events; -int redundant_softirqs_off; -... -... -... -} -``` - -If you will set `CONFIG_DEBUG_LOCKDEP` kernel configuration option, the `lockdep_stats_debug_show` function will write all tracing information to the `/proc/lockdep`: - -```C -static void lockdep_stats_debug_show(struct seq_file *m) -{ -#ifdef CONFIG_DEBUG_LOCKDEP - unsigned long long hi1 = debug_atomic_read(hardirqs_on_events), - hi2 = debug_atomic_read(hardirqs_off_events), - hr1 = debug_atomic_read(redundant_hardirqs_on), - ... - ... - ... - seq_printf(m, " hardirq on events: %11llu\n", hi1); - seq_printf(m, " hardirq off events: %11llu\n", hi2); - seq_printf(m, " redundant hardirq ons: %11llu\n", hr1); -#endif -} -``` - -and you can see its result with the: - -``` -$ sudo cat /proc/lockdep - hardirq on events: 12838248974 - hardirq off events: 12838248979 - redundant hardirq ons: 67792 - redundant hardirq offs: 3836339146 - softirq on events: 38002159 - softirq off events: 38002187 - redundant softirq ons: 0 - redundant softirq offs: 0 -``` - -Ok, now we know a little about tracing, but more info will be in the separate part about `lockdep` and `tracing`. You can see that the both `local_disable_irq` macros have the same part - `raw_local_irq_disable`. This macro defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) and expands to the call of the: - -```C -static inline void native_irq_disable(void) -{ - asm volatile("cli": : :"memory"); -} -``` - -And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle and interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macr - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction: - -```C -static inline void native_irq_enable(void) -{ - asm volatile("sti": : :"memory"); -} -``` - -Now we know how `local_irq_disable` and `local_irq_enable` work. It was the first call of the `local_irq_disable` macro, but we will meet these macros many times in the Linux kernel source code. But for now we are in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) and we just disabled `local` interrupts. Why local and why we did it? Previously kernel provided a method to disable interrupts on all processors and it was called `cli`. This function was [removed](https://lwn.net/Articles/291956/) and now we have `local_irq_{enabled,disable}` to disable or enable interrupts on the current processor. After we've disabled the interrupts with the `local_irq_disable` macro, we set the: - -```C -early_boot_irqs_disabled = true; -``` - -The `early_boot_irqs_disabled` variable defined in the [include/linux/kernel.h](https://github.com/torvalds/linux/blob/master/include/linux/kernel.h): - -```C -extern bool early_boot_irqs_disabled; -``` - -and used in the different places. For example it used in the `smp_call_function_many` function from the [kernel/smp.c](https://github.com/torvalds/linux/blob/master/kernel/smp.c) for the checking possible deadlock when interrupts are disabled: - -```C -WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() - && !oops_in_progress && !early_boot_irqs_disabled); -``` - -Early trap initialization during kernel initialization --------------------------------------------------------------------------------- - -The next functions after the `local_disable_irq` are `boot_cpu_init` and `page_address_init`, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html)). The next is the `setup_arch` function. As you can remember this function located in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel.setup.c) source code file and makes initialization of many different architecture-dependent [stuff](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). The first interrupts related function which we can see in the `setup_arch` is the - `early_trap_init` function. This function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and fills `Interrupt Descriptor Table` with the couple of entries: - -```C -void __init early_trap_init(void) -{ - set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK); - set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK); -#ifdef CONFIG_X86_32 - set_intr_gate(X86_TRAP_PF, page_fault); -#endif - load_idt(&idt_descr); -} -``` - -Here we can see calls of three different functions: - -* `set_intr_gate_ist` -* `set_system_intr_gate_ist` -* `set_intr_gate` - -All of these functions defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) and do the similar thing but not the same. The first `set_intr_gate_ist` function inserts new an interrupt gate in the `IDT`. Let's look on its implementation: - -```C -static inline void set_intr_gate_ist(int n, void *addr, unsigned ist) -{ - BUG_ON((unsigned)n > 0xFF); - _set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS); -} -``` - -First of all we can see the check that `n` which is [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) of the interrupt is not greater than `0xff` or 255. We need to check it because we remember from the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) that vector number of an interrupt must be between `0` and `255`. In the next step we can see the call of the `_set_gate` function that sets a given interrupt gate to the `IDT` table: - -```C -static inline void _set_gate(int gate, unsigned type, void *addr, - unsigned dpl, unsigned ist, unsigned seg) -{ - gate_desc s; - - pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg); - write_idt_entry(idt_table, gate, &s); - write_trace_idt_entry(gate, &s); -} -``` - -Here we start from the `pack_gate` function which takes clean `IDT` entry represented by the `gate_desc` structure and fills it with the base address and limit, [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks), [Privilege level](http://en.wikipedia.org/wiki/Privilege_level), type of an interrupt which can be one of the following values: - -* `GATE_INTERRUPT` -* `GATE_TRAP` -* `GATE_CALL` -* `GATE_TASK` - -and set the present bit for the given `IDT` entry: - -```C -static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func, - unsigned dpl, unsigned ist, unsigned seg) -{ - gate->offset_low = PTR_LOW(func); - gate->segment = __KERNEL_CS; - gate->ist = ist; - gate->p = 1; - gate->dpl = dpl; - gate->zero0 = 0; - gate->zero1 = 0; - gate->type = type; - gate->offset_middle = PTR_MIDDLE(func); - gate->offset_high = PTR_HIGH(func); -} -``` - -After this we write just filled interrupt gate to the `IDT` with the `write_idt_entry` macro which expands to the `native_write_idt_entry` and just copy the interrupt gate to the `idt_table` table by the given index: - -```C -#define write_idt_entry(dt, entry, g) native_write_idt_entry(dt, entry, g) - -static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate) -{ - memcpy(&idt[entry], gate, sizeof(*gate)); -} -``` - -where `idt_table` is just array of `gate_desc`: - -```C -extern gate_desc idt_table[]; -``` - -That's all. The second `set_system_intr_gate_ist` function has only one difference from the `set_intr_gate_ist`: - -```C -static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist) -{ - BUG_ON((unsigned)n > 0xFF); - _set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS); -} -``` - -Do you see it? Look on the fourth parameter of the `_set_gate`. It is `0x3`. In the `set_intr_gate` it was `0x0`. We know that this parameter represent `DPL` or privilege level. We also know that `0` is the highest privilge level and `3` is the lowest.Now we know how `set_system_intr_gate_ist`, `set_intr_gate_ist`, `set_intr_gate` are work and we can return to the `early_trap_init` function. Let's look on it again: - -```C -set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK); -set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK); -``` - -We set two `IDT` entries for the `#DB` interrupt and `int3`. These functions takes the same set of parameters: - -* vector number of an interrupt; -* address of an interrupt handler; -* interrupt stack table index. - -That's all. More about interrupts and handlers you will know in the next parts. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the second part about interrupts and interrupt handling in the Linux kernel. We saw the some theory in the previous part and started to dive into interrupts and exceptions handling in the current part. We have started from the earliest parts in the Linux kernel source code which are related to the interrupts. In the next part we will continue to dive into this interesting theme and will know more about interrupt handling process. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table) -* [Protected mode](http://en.wikipedia.org/wiki/Protected_mode) -* [List of x86 calling conventions](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions) -* [8086](http://en.wikipedia.org/wiki/Intel_8086) -* [Long mode](http://en.wikipedia.org/wiki/Long_mode) -* [NX](http://en.wikipedia.org/wiki/NX_bit) -* [Extended Feature Enable Register](http://en.wikipedia.org/wiki/Control_register#Additional_Control_registers_in_x86-64_series) -* [Model-specific register](http://en.wikipedia.org/wiki/Model-specific_register) -* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier) -* [lockdep](http://lwn.net/Articles/321663/) -* [irqflags tracing](https://www.kernel.org/doc/Documentation/irqflags-tracing.txt) -* [IF](http://en.wikipedia.org/wiki/Interrupt_flag) -* [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) -* [Union type](http://en.wikipedia.org/wiki/Union_type) -* [this_cpu_* operations](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt) -* [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) -* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks) -* [Privilege level](http://en.wikipedia.org/wiki/Privilege_level) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) diff --git a/interrupts/interrupts-3.md b/interrupts/interrupts-3.md deleted file mode 100644 index 48a7013..0000000 --- a/interrupts/interrupts-3.md +++ /dev/null @@ -1,470 +0,0 @@ -Interrupts and Interrupt Handling. Part 3. -================================================================================ - -Interrupt handlers --------------------------------------------------------------------------------- - -This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stoped in the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) on the setting of the two exceptions handlers for the two following exceptions: - -* `#DB` - debug exception, transfers control from the interrupted process to the debug handler; -* `#BP` - breakpoint exception, caused by the `int 3` instruction. - -These exceptions allow the `x86_64` architecture to have early exception processing for the purpose of debugging via the [kgdb](https://en.wikipedia.org/wiki/KGDB). - -As you can remember we set these exceptions handlers in the `early_trap_init` function: - -```C -void __init early_trap_init(void) -{ - set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK); - set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK); - load_idt(&idt_descr); -} -``` - -from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We already saw implementation of the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions in the previous part and now we will look on the implementation of these early exceptions handlers. - -Debug and Breakpoint exceptions --------------------------------------------------------------------------------- - -Ok, we set the interrupts gates in the `early_trap_init` function for the `#DB` and `#BP` exceptions and now time is to look on their handlers. But first of all let's look on these exceptions. The first exceptions - `#DB` or debug exception occurs when a debug event occurs, for example attempt to change the contents of a [debug register](http://en.wikipedia.org/wiki/X86_debug_register). Debug registers are special registers which present in processors starting from the [Intel 80386](http://en.wikipedia.org/wiki/Intel_80386) and as you can understand from its name they are used for debugging. These registers allow to set breakpoints on the code and read or write data to trace, thus tracking the place of errors. The debug registers are privileged resources available and the program in either real-address or protected mode at `CPL` is `0`, that's why we have used `set_intr_gate_ist` for the `#DB`, but not the `set_system_intr_gate_ist`. The verctor number of the `#DB` exceptions is `1` (we pass it as `X86_TRAP_DB`) and has no error code: - -``` ----------------------------------------------------------------------------------------------- -|Vector|Mnemonic|Description |Type |Error Code|Source | ----------------------------------------------------------------------------------------------- -|1 | #DB |Reserved |F/T |NO | | ----------------------------------------------------------------------------------------------- -``` - -The second is `#BP` or breakpoint exception occurs when processor executes the [INT 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3) instruction. We can add it anywhere in our code, for example let's look on the simple program: - -```C -// breakpoint.c -#include - -int main() { - int i; - while (i < 6){ - printf("i equal to: %d\n", i); - __asm__("int3"); - ++i; - } -} -``` - -If we will compile and run this program, we will see following output: - -``` -$ gcc breakpoint.c -o breakpoint -i equal to: 0 -Trace/breakpoint trap -``` - -But if will run it with gdb, we will see our breakpoint and can continue execution of our program: - -``` -$ gdb breakpoint -... -... -... -(gdb) run -Starting program: /home/alex/breakpoints -i equal to: 0 - -Program received signal SIGTRAP, Trace/breakpoint trap. -0x0000000000400585 in main () -=> 0x0000000000400585 : 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1 -(gdb) c -Continuing. -i equal to: 1 - -Program received signal SIGTRAP, Trace/breakpoint trap. -0x0000000000400585 in main () -=> 0x0000000000400585 : 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1 -(gdb) c -Continuing. -i equal to: 2 - -Program received signal SIGTRAP, Trace/breakpoint trap. -0x0000000000400585 in main () -=> 0x0000000000400585 : 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1 -... -... -... -``` - -Now we know a little about these two exceptions and we can move on to consideration of their handlers. - -Preparation before an interrupt handler --------------------------------------------------------------------------------- - -As you can note, the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions takes an addresses of the exceptions handlers in the second parameter: - -* `&debug`; -* `&int3`. - -You will not find these functions in the C code. All that can be found in in the `*.c/*.h` files only definition of this functions in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h): - -```C -asmlinkage void debug(void); -asmlinkage void int3(void); -``` - -But we can see `asmlinkage` descriptor here. The `asmlinkage` is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are will be called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack. So, both handlers are defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file with the `idtentry` macro: - -```assembly -idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -``` - -Actually `debug` and `int3` are not interrupts handlers. Remember that before we can execute an interrupt/exception handler, we need to do some preparations as: - -* When an interrupt or exception occured, the processor uses an exception or interrupt vector as an index to a descriptor in the `IDT`; -* In legacy mode `ss:esp` registers are pushed on the stack only if privilege level changed. In 64-bit mode `ss:rsp` pushed on the stack everytime; -* During stack switching with `IST` the new `ss` selector is forced to null. Old `ss` and `rsp` are pushed on the new stack. -* The `rflags`, `cs`, `rip` and error code pushed on the stack; -* Control transfered to an interrupt handler; -* After an interrupt handler will finish its work and finishes with the `iret` instruction, old `ss` will be poped from the stack and loaded to the `ss` register. -* `ss:rsp` will be popped from the stack unconditionally in the 64-bit mode and will be popped only if there is a privilege level change in legacy mode. -* `iret` instruction will restore `rip`, `cs` and `rflags`; -* Interrupted program will continue its execution. - -``` - +--------------------+ -+40 | ss | -+32 | rsp | -+24 | rflags | -+16 | cs | - +8 | rip | - 0 | error code | - +--------------------+ -``` - -Now we can see on the preparations before a process will transfer control to an interrupt/exception handler from practical side. As I already wrote above the first thirteen exceptions handlers defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly file with the [idtentry](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S#L967) macro: - -```assembly -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 -ENTRY(\sym) -... -... -... -END(\sym) -.endm -``` - -This macro defines an exception entry point and as we can see it takes `five` arguments: - -* `sym` - defines global symbol with the `.globl name`. -* `do_sym` - an interrupt handler. -* `has_error_code:req` - information about error code, The `:req` qualifier tells the assembler that the argument is required; -* `paranoid` - shows us how we need to check current mode; -* `shift_ist` - shows us what's stack to use; - -As we can see our exceptions handlers are almost the same: - -```assembly -idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK -``` - -The differences are only in the global name and name of exceptions handlers. Now let's look how `idtentry` macro implemented. It starts from the two checks: - -```assembly - .if \shift_ist != -1 && \paranoid == 0 - .error "using shift_ist requires paranoid=1" - .endif - - .if \has_error_code - XCPT_FRAME - .else - INTR_FRAME - .endif -``` - -First check makes the check that an exceptions uses `Interrupt stack table` and `paranoid` is set, in other way it emits the erorr with the [.error](https://sourceware.org/binutils/docs/as/Error.html#Error) directive. The second `if` clause checks existence of an error code and calls `XCPT_FRAME` or `INTR_FRAME` macros depends on it. These macros just expand to the set of [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html) which are used by `GNU AS` to manage call frames. The `CFI` directives are used only to generate [dwarf2](http://en.wikipedia.org/wiki/DWARF) unwind information for better backtraces and they don't change any code, so we will not go into detail about it and from this point I will skip all code which is related to these directives. In the next step we check error code again and push it on the stack if an exception has it with the: - -```assembly -.ifeq \has_error_code - pushq_cfi $-1 -.endif -``` - -The `pushq_cfi` macro defined in the [arch/x86/include/asm/dwarf2.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/dwarf2.h) and expands to the `pushq` instruction which pushes given error code: - -```assembly - .macro pushq_cfi reg - pushq \reg - CFI_ADJUST_CFA_OFFSET 8 - .endm -``` - -Pay attention on the `$-1`. We already know that when an exception occrus, the processor pushes `ss`, `rsp`, `rflags`, `cs` and `rip` on the stack: - -```C -#define RIP 16*8 -#define CS 17*8 -#define EFLAGS 18*8 -#define RSP 19*8 -#define SS 20*8 -``` - -With the `pushq \reg` we denote that place before the `RIP` will contain error code of an exception: - -```C -#define ORIG_RAX 15*8 -``` - -The `ORIG_RAX` will contain error code of an exception, [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) number on a hardware interrupt and system call number on [system call](http://en.wikipedia.org/wiki/System_call) entry. In the next step we can see thr `ALLOC_PT_GPREGS_ON_STACK` macro which allocates space for the 15 general purpose registers on the stack: - -```assembly -.macro ALLOC_PT_GPREGS_ON_STACK addskip=0 -subq $15*8+\addskip, %rsp -CFI_ADJUST_CFA_OFFSET 15*8+\addskip -.endm -``` - -After this we check `paranoid` and if it is set we check first three `CPL` bits. We compare it with the `3` and it allows us to know did we come from userspace or not: - -```assembly -.if \paranoid - .if \paranoid == 1 - CFI_REMEMBER_STATE - testl $3, CS(%rsp) - jnz 1f - .endif - call paranoid_entry -.else - call error_entry -.endif -``` - -If we came from userspace we jump on the label `1` which starts from the `call error_entry` instruction. The `error_entry` saves all registers in the `pt_regs` structure which presetens an interrupt/exception stack frame and defined in the [arch/x86/include/uapi/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h). It saves common and extra registers on the stack with the: - -```assembly -SAVE_C_REGS 8 -SAVE_EXTRA_REGS 8 -``` - -from `rdi` to `r15` and executes [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. This instruction provides a method to for the Linux kernel to obtain a pointer to the kernel data structures and save the user's `gsbase`. After this we will exit from the `error_entry` with the `ret` instruction. After the `error_entry` finished to execute, since we came from userspace we need to switch on kernel interrupt stack: - -```assembly - movq %rsp,%rdi - call sync_regs -``` - -We just save all registers to the `error_entry` in the `error_entry`, we put address of the `pt_regs` to the `rdi` and call `sync_regs` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c): - -```C -asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs) -{ - struct pt_regs *regs = task_pt_regs(current); - *regs = *eregs; - return regs; -} -``` - -This function switchs off the `IST` stack if we came from usermode. After this we switch on the stack which we got from the `sync_regs`: - -```assembly -movq %rax,%rsp -movq %rsp,%rdi -``` - -and put pointer of the `pt_regs` again in the `rdi`, and in the last step we call an exception handler: - -```assembly -call \do_sym -``` - -So, realy exceptions handlers are `do_debug` and `do_int3` functions. We will see these function in this part, but little later. First of all let's look on the preparations before a processor will transfer control to an interrupt handler. In another way if `paranoid` is set, but it is not 1, we call `paranoid_entry` which makes almost the same that `error_entry`, but it checks current mode with more slow but accurate way: - -```assembly -ENTRY(paranoid_entry) - SAVE_C_REGS 8 - SAVE_EXTRA_REGS 8 - ... - ... - movl $MSR_GS_BASE,%ecx - rdmsr - testl %edx,%edx - js 1f /* negative -> in kernel */ - SWAPGS - ... - ... - ret -END(paranoid_entry) -``` - -If `edx` wll be negative, we are in the kernel mode. As we store all registers on the stack, check that we are in the kernel mode, we need to setup `IST` stack if it is set for a given exception, call an exception handler and restore the exception stack: - -```assembly - .if \shift_ist != -1 - subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist) - .endif - - call \do_sym - - .if \shift_ist != -1 - addq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist) - .endif -``` - -The last step when an exception handler will finish it's work all registers will be restored from the stack with the `RESTORE_C_REGS` and `RESTORE_EXTRA_REGS` macros and control will be returned an interrupted task. That's all. Now we know about preparation before an interrupt/exception handler will start to execute and we can go directly to the implementation of the handlers. - -Implementation of ainterrupts and exceptions handlers --------------------------------------------------------------------------------- - -Both handlers `do_debug` and `do_int3` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file and have two similar things: All interrupts/exceptions handlers marked with the `dotraplinkage` prefix that expands to the: - -```C -#define dotraplinkage __visible -#define __visible __attribute__((externally_visible)) -``` - -which tells to compiler that something else uses this function (in our case these functions are called from the assembly interrupt preparation code). And also they takes two parameters: - -* pointer to the `pt_regs` structure which contains registers of the interrupted task; -* error code. - -First of all let's consider `do_debug` handler. This function starts from the getting previous state with the `ist_enter` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We call it because we need to know, did we come to the interrupt handler from the kernel mode or user mode. - -```C -prev_state = ist_enter(regs); -``` - -The `ist_enter` function returns previous state context state and executes a couple preprartions before we continue to handle an exception. It starts from the check of the previous mode with the `user_mode_vm` macro. It takes `pt_regs` structure which contains a set of registers of the interrupted task and returns `1` if we came from userspace and `0` if we came from kernel space. According to the previous mode we execute `exception_enter` if we are from the userspace or inform [RCU](https://en.wikipedia.org/wiki/Read-copy-update) if we are from krenel space: - -```C -... -if (user_mode_vm(regs)) { - prev_state = exception_enter(); -} else { - rcu_nmi_enter(); - prev_state = IN_KERNEL; -} -... -... -... -return prev_state; -``` - -After this we load the `DR6` debug registers to the `dr6` variable with the call of the `get_debugreg` macro from the [arch/x86/include/asm/debugreg.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/debugreg.h): - -```C -get_debugreg(dr6, 6); -dr6 &= ~DR6_RESERVED; -``` - -The `DR6` debug register is debug status register contains information about the reason for stopping the `#DB` or debug exception handler. After we loaded its value to the `dr6` variable we filter out all reserved bits (`4:12` bits). In the next step we check `dr6` register and previous state with the following `if` condition expression: - -```C -if (!dr6 && user_mode_vm(regs)) - user_icebp = 1; -``` - -If `dr6` does not show any reasons why we caught this trap we set `user_icebp` to one which means that user-code wants to get [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP) signal. In the next step we check was it [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) trap and if yes we go to exit: - -```C -if ((dr6 & DR_STEP) && kmemcheck_trap(regs)) - goto exit; -``` - -After we did all these checks, we clear the `dr6` register, clear the `DEBUGCTLMSR_BTF` flag which provides single-step on branches debugging, set `dr6` register for the current thread and increase `debug_stack_usage` [per-cpu]([Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)) variable with the: - -```C -set_debugreg(0, 6); -clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP); -tsk->thread.debugreg6 = dr6; -debug_stack_usage_inc(); -``` - -As we saved `dr6`, we can allow irqs: - -```C -static inline void preempt_conditional_sti(struct pt_regs *regs) -{ - preempt_count_inc(); - if (regs->flags & X86_EFLAGS_IF) - local_irq_enable(); -} -``` - -more about `local_irq_enabled` and related stuff you can read in the second part about [interrupts handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html). In the next step we check the previous mode was [virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) and handle the trap: - -```C -if (regs->flags & X86_VM_MASK) { - handle_vm86_trap((struct kernel_vm86_regs *) regs, error_code, X86_TRAP_DB); - preempt_conditional_cli(regs); - debug_stack_usage_dec(); - goto exit; -} -... -... -... -exit: - ist_exit(regs, prev_state); -``` - -If we came not from the virtual 8086 mode, we need to check `dr6` register and previous mode as we did it above. Here we check if step mode debugging is -enabled and we are not from the user mode, we enabled step mode debugging in the `dr6` copy in the current thread, set `TIF_SINGLE_STEP` falg and re-enable [Trap flag](https://en.wikipedia.org/wiki/Trap_flag) for the user mode: - -```C -if ((dr6 & DR_STEP) && !user_mode(regs)) { - tsk->thread.debugreg6 &= ~DR_STEP; - set_tsk_thread_flag(tsk, TIF_SINGLESTEP); - regs->flags &= ~X86_EFLAGS_TF; -} -``` - -Then we get `SIGTRAP` signal code: - -```C -si_code = get_si_code(tsk->thread.debugreg6); -``` - -and send it for user icebp traps: - -```C -if (tsk->thread.debugreg6 & (DR_STEP | DR_TRAP_BITS) || user_icebp) - send_sigtrap(tsk, regs, error_code, si_code); -preempt_conditional_cli(regs); -debug_stack_usage_dec(); -exit: - ist_exit(regs, prev_state); -``` - -In the end we disabled `irqs`, decrement value of the `debug_stack_usage` and exit from the exception handler with the `ist_exit` function. - -The second exception handler is `do_int3` defined in the same source code file - [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). In the `do_int3` we makes almost the same that in the `do_debug` handler. We get the previous state with the `ist_enter`, increment and decrement the `debug_stack_usage` per-cpu variable, enabled and disable local interrupts. But of course there is one difference between these two handlers. We need to lock and than sync processor cores during breakpoint patching. - -That's all. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the third part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Interrupt descriptor table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) in the previous part with the `#DB` and `#BP` gates and started to dive into preparation before control will be transfered to an exception handler and implementation of some interrupt handlers in this part. In the next part we will continue to dive into this theme and will go next by the `setup_arch` function and will try to understand interrupts handling related stuff. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [Debug registers](http://en.wikipedia.org/wiki/X86_debug_register) -* [Intel 80385](http://en.wikipedia.org/wiki/Intel_80386) -* [INT 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3) -* [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection) -* [TSS](http://en.wikipedia.org/wiki/Task_state_segment) -* [GNU assembly .error directive](https://sourceware.org/binutils/docs/as/Error.html#Error) -* [dwarf2](http://en.wikipedia.org/wiki/DWARF) -* [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html) -* [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) -* [system call](http://en.wikipedia.org/wiki/System_call) -* [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) -* [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP) -* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) -* [kgdb](https://en.wikipedia.org/wiki/KGDB) -* [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) diff --git a/interrupts/interrupts-4.md b/interrupts/interrupts-4.md deleted file mode 100644 index 57a9c81..0000000 --- a/interrupts/interrupts-4.md +++ /dev/null @@ -1,465 +0,0 @@ -Interrupts and Interrupt Handling. Part 4. -================================================================================ - -Initialization of non-early interrupt gates --------------------------------------------------------------------------------- - -This is fourth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) we saw first early `#DB` and `#BP` exceptions handlers from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We stopped on the right after the `early_trap_init` function that called in the `setup_arch` function which defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/setup.c). In this part we will continue to dive into an interrupts and exceptions handling in the Linux kernel for `x86_64` and continue to do it from from the place where we left off in the last part. First thing which is related to the interrupts and exceptions handling is the setup of the `#PF` or [page fault](https://en.wikipedia.org/wiki/Page_fault) handler with the `early_trap_pf_init` function. Let's start from it. - -Early page fault handler --------------------------------------------------------------------------------- - -The `early_trap_pf_init` function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). It uses `set_intr_gate` macro that filles [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the given entry: - -```C -void __init early_trap_pf_init(void) -{ -#ifdef CONFIG_X86_64 - set_intr_gate(X86_TRAP_PF, page_fault); -#endif -} -``` - -This macro defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h). We already saw macros like this in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) - `set_system_intr_gate` and `set_intr_gate_ist`. This macro checks that given vector number is not greater than `255` (maximum vector number) and calls `_set_gate` function as `set_system_intr_gate` and `set_intr_gate_ist` did it: - -```C -#define set_intr_gate(n, addr) \ -do { \ - BUG_ON((unsigned)n > 0xFF); \ - _set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0, \ - __KERNEL_CS); \ - _trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\ - 0, 0, __KERNEL_CS); \ -} while (0) -``` - -The `set_intr_gate` macro takes two parameters: - -* vector number of a interrupt; -* address of an interrupt handler; - -In our case they are: - -* `X86_TRAP_PF` - `14`; -* `page_fault` - the interrupt handler entry point. - -The `X86_TRAP_PF` is the element of enum which defined in the [arch/x86/include/asm/traprs.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traprs.h): - -```C -enum { - ... - ... - ... - ... - X86_TRAP_PF, /* 14, Page Fault */ - ... - ... - ... -} -``` - -When the `early_trap_pf_init` will be called, the `set_intr_gate` will be expanded to the call of the `_set_gate` which will fill the `IDT` with the handler for the page fault. Now let's look on the implementation of the `page_fault` handler. The `page_fault` handler defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file as all exceptions handlers. Let's look on it: - -```assembly -trace_idtentry page_fault do_page_fault has_error_code=1 -``` - -We saw in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) how `#DB` and `#BP` handlers defined. They were defined with the `idtentry` macro, but here we can see `trace_idtentry`. This macro defined in the same source code file and depends on the `CONFIG_TRACING` kernel configuration option: - -```assembly -#ifdef CONFIG_TRACING -.macro trace_idtentry sym do_sym has_error_code:req -idtentry trace(\sym) trace(\do_sym) has_error_code=\has_error_code -idtentry \sym \do_sym has_error_code=\has_error_code -.endm -#else -.macro trace_idtentry sym do_sym has_error_code:req -idtentry \sym \do_sym has_error_code=\has_error_code -.endm -#endif -``` - -We will not dive into exceptions [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29) now. If `CONFIG_TRACING` is not set, we can see that `trace_idtentry` macro just expands to the normal `idtentry`. We already saw implementation of the `idtentry` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html), so let's start from the `page_fault` exception handler. - -As we can see in the `idtentry` definition, the handler of the `page_fault` is `do_page_fault` function which defined in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) and as all exceptions handlers it takes two arguments: - -* `regs` - `pt_regs` structure that holds state of an interrupted process; -* `error_code` - error code of the page fault exception. - -Let's look inside this function. First of all we read content of the [cr2](https://en.wikipedia.org/wiki/Control_register) control register: - -```C -dotraplinkage void notrace -do_page_fault(struct pt_regs *regs, unsigned long error_code) -{ - unsigned long address = read_cr2(); - ... - ... - ... -} -``` - -This register contains a linear address which caused `page fault`. In the next step we make a call of the `exception_enter` function from the [include/linux/context_tracking.h](https://github.com/torvalds/linux/blob/master/include/context_tracking.h). The `exception_enter` and `exception_exit` are functions from context tracking subsytem in the Linux kernel used by the [RCU](https://en.wikipedia.org/wiki/Read-copy-update) to remove its dependency on the timer tick while a processor runs in userspace. Almost in the every exception handler we will see similar code: - -```C -enum ctx_state prev_state; -prev_state = exception_enter(); -... -... // exception handler here -... -exception_exit(prev_state); -``` - -The `exception_enter` function checks that `context tracking` is enabled with the `context_tracking_is_enabled` and if it is in enabled state, we get previous context with te `this_cpu_read` (more about `this_cpu_*` operations you can read in the [Documentation](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)). After this it calls `context_tracking_user_exit` function which informs that Inform the context tracking that the processor is exiting userspace mode and entering the kernel: - -```C -static inline enum ctx_state exception_enter(void) -{ - enum ctx_state prev_ctx; - - if (!context_tracking_is_enabled()) - return 0; - - prev_ctx = this_cpu_read(context_tracking.state); - context_tracking_user_exit(); - - return prev_ctx; -} -``` - -The state can be one of the: - -```C -enum ctx_state { - IN_KERNEL = 0, - IN_USER, -} state; -``` - -And in the end we return previous context. Between the `exception_enter` and `exception_exit` we call actual page fault handler: - -```C -__do_page_fault(regs, error_code, address); -``` - -The `__do_page_fault` is defined in the same source code file as `do_page_fault` - [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c). In the bingging of the `__do_page_fault` we check state of the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) checker. The `kmemcheck` detects warns about some uses of uninitialized memory. We need to check it because page fault can be caused by kmemcheck: - -```C -if (kmemcheck_active(regs)) - kmemcheck_hide(regs); - prefetchw(&mm->mmap_sem); -``` - -After this we can see the call of the `prefetchw` which executes instruction with the same [name](http://www.felixcloutier.com/x86/PREFETCHW.html) which fetches [X86_FEATURE_3DNOW](https://en.wikipedia.org/?title=3DNow!) to get exclusive [cache line](https://en.wikipedia.org/wiki/CPU_cache). The main purpose of prefetching is to hide the latency of a memory access. In the next step we check that we got page fault not in the kernel space with the following conditiion: - -```C -if (unlikely(fault_in_kernel_space(address))) { -... -... -... -} -``` - -where `fault_in_kernel_space` is: - -```C -static int fault_in_kernel_space(unsigned long address) -{ - return address >= TASK_SIZE_MAX; -} -``` - -The `TASK_SIZE_MAX` macro expands to the: - -```C -#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE) -``` - -or `0x00007ffffffff000`. Pay attention on `unlikely` macro. There are two macros in the Linux kernel: - -```C -#define likely(x) __builtin_expect(!!(x), 1) -#define unlikely(x) __builtin_expect(!!(x), 0) -``` - -You can [often](http://lxr.free-electrons.com/ident?i=unlikely) find these macros in the code of the Linux kernel. Main purpose of these macros is optimization. Sometimes this situation is that we need to check the condition of the code and we know that it will rarely be `true` or `false`. With these macros we can tell to the compiler about this. For example - -```C -static int proc_root_readdir(struct file *file, struct dir_context *ctx) -{ - if (ctx->pos < FIRST_PROCESS_ENTRY) { - int error = proc_readdir(file, ctx); - if (unlikely(error <= 0)) - return error; -... -... -... -} -``` - -Here we can see `proc_root_readdir` function which will be called when the Linux [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) needs to read the `root` directory contents. If condition marked with `unlikely`, compiler can put `false` code right after branching. Now let's back to the our address check. Comparison between the given address and the `0x00007ffffffff000` will give us to know, was page fault in the kernel mode or user mode. After this check we know it. After this `__do_page_fault` routine will try to understand the problem that provoked page fault exception and then will pass address to the approprite routine. It can be `kmemcheck` fault, spurious fault, [kprobes](https://www.kernel.org/doc/Documentation/kprobes.txt) fault and etc. Will not dive into implementation details of the page fault exception handler in this part, because we need to know many different concepts which are provided by the Linux kerne, but will see it in the chapter about the [memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) in the Linux kernel. - -Back to start_kernel --------------------------------------------------------------------------------- - -There are many different function calls after the `early_trap_pf_init` in the `setup_arch` function from different kernel subsystems, but there are no one interrupts and exceptions handling related. So, we have to go back where we came from - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L492). The first things after the `setup_arch` is the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). This function makes initialization of the remaining exceptions handlers (remember that we already setup 3 handlres for the `#DB` - debug exception, `#BP` - breakpoint exception and `#PF` - page fault exception). The `trap_init` function starts from the check of the [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture): - -```C -#ifdef CONFIG_EISA - void __iomem *p = early_ioremap(0x0FFFD9, 4); - - if (readl(p) == 'E' + ('I'<<8) + ('S'<<16) + ('A'<<24)) - EISA_bus = 1; - early_iounmap(p, 4); -#endif -``` - -Note that it depends on the `CONFIG_EISA` kernel configuration parameter which represetns `EISA` support. Here we use `early_ioremap` function to map `I/O` memory on the page tables. We use `readl` function to read first `4` bytes from the mapped region and if they are equal to `EISA` string we set `EISA_bus` to one. In the end we just unmap previously mapped region. More about `early_ioremap` you can read in the part which describes [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). - -After this we start to fill the `Interrupt Descriptor Table` with the different interrupt gates. First of all we set `#DE` or `Divide Error` and `#NMI` or `Non-maskable Interrupt`: - -```C -set_intr_gate(X86_TRAP_DE, divide_error); -set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK); -``` - -We use `set_intr_gate` macro to set the interrupt gate for the `#DE` exception and `set_intr_gate_ist` for the `#NMI`. You can remember that we already used these macros when we have set the interrupts gates for the page fault handler, debug handler and etc, you can find explanation of it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html). After this we setup exception gates for the following exceptions: - -```C -set_system_intr_gate(X86_TRAP_OF, &overflow); -set_intr_gate(X86_TRAP_BR, bounds); -set_intr_gate(X86_TRAP_UD, invalid_op); -set_intr_gate(X86_TRAP_NM, device_not_available); -``` - -Here we can see: - -* `#OF` or `Overflow` exception. This exception indicates that an overflow trap occurred when an special [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html) instruction was executed; -* `#BR` or `BOUND Range exceeded` exception. This exception indeicates that a `BOUND-range-exceed` fault occured when a [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) instruction was executed; -* `#UD` or `Invalid Opcode` exception. Occurs when a processor attempted to execute invalid or reserved [opcode](https://en.wikipedia.org/?title=Opcode), processor attempted to execute instruction with invalid operand(s) and etc; -* `#NM` or `Device Not Available` exception. Occurs when the processor tries to execute `x87 FPU` floating point instruction while `EM` flag in the [control register](https://en.wikipedia.org/wiki/Control_register#CR0) `cr0` was set. - -In the next step we set the interrupt gate for the `#DF` or `Double fault` exception: - -```C -set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK); -``` - -This exception occurs when processor detected a second exception while calling an exception handler for a prior exception. In usual way when the processor detects another exception while trying to call an exception handler, the two exceptions can be handled serially. If the processor cannot handle them serially, it signals the double-fault or `#DF` exception. - -The following set of the interrupt gates is: - -```C -set_intr_gate(X86_TRAP_OLD_MF, &coprocessor_segment_overrun); -set_intr_gate(X86_TRAP_TS, &invalid_TSS); -set_intr_gate(X86_TRAP_NP, &segment_not_present); -set_intr_gate_ist(X86_TRAP_SS, &stack_segment, STACKFAULT_STACK); -set_intr_gate(X86_TRAP_GP, &general_protection); -set_intr_gate(X86_TRAP_SPURIOUS, &spurious_interrupt_bug); -set_intr_gate(X86_TRAP_MF, &coprocessor_error); -set_intr_gate(X86_TRAP_AC, &alignment_check); -``` - -Here we can see setup for the following exception handlers: - -* `#CSO` or `Coprocessor Segment Overrun` - this exception indicates that math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) of an old processor detected a page or segment violation. Modern processors do not generate this exception -* `#TS` or `Invalid TSS` exception - indicates that there was an error related to the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment). -* `#NP` or `Segement Not Present` exception indicates that the `present flag` of a segment or gate descriptor is clear during attempt to load one of `cs`, `ds`, `es`, `fs`, or `gs` register. -* `#SS` or `Stack Fault` exception indicates one of the stack related conditions was detected, for example a not-present stack segment is detected when attempting to load the `ss` register. -* `#GP` or `General Protection` exception indicates that the processor detected one of a class of protection violations called general-protection violations. There are many different conditions that can cause general-procetion exception. For example loading the `ss`, `ds`, `es`, `fs`, or `gs` register with a segment selector for a system segment, writing to a code segment or a read-only data segment, referencing an entry in the `Interrupt Descriptor Table` (following an interrupt or exception) that is not an interrupt, trap, or task gate and many many more. -* `Spurious Interrupt` - a hardware interrupt that is unwanted. -* `#MF` or `x87 FPU Floating-Point Error` exception caused when the [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions) has detected a floating point error. -* `#AC` or `Alignment Check` exception Indicates that the processor detected an unaligned memory operand when alignment checking was enabled. - -After that we setup this exception gates, we can see setup of the `Machine-Check` exception: - -```C -#ifdef CONFIG_X86_MCE - set_intr_gate_ist(X86_TRAP_MC, &machine_check, MCE_STACK); -#endif -``` - -Note that it depends on the `CONFIG_X86_MCE` kernel configuration option and indicates that the processor detected an internal [machine error](https://en.wikipedia.org/wiki/Machine-check_exception) or a bus error, or that an external agent detected a bus error. The next exception gate is for the [SIMD](https://en.wikipedia.org/?title=SIMD) Floating-Point exception: - -```C -set_intr_gate(X86_TRAP_XF, &simd_coprocessor_error); -``` - -which indicates the processor has detected an `SSE` or `SSE2` or `SSE3` SIMD floating-point exception. There are six classes of numeric exception conditions that can occur while executing an SIMD floating-point instruction: - -* Invalid operation -* Divide-by-zero -* Denormal operand -* Numeric overflow -* Numeric underflow -* Inexact result (Precision) - -In the next step we fill the `used_vectors` array which defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h) header file and represents `bitmap`: - -```C -DECLARE_BITMAP(used_vectors, NR_VECTORS); -``` - -of the first `32` interrupts (more about bitmaps in the Linux kernel you can read in the part which describes [cpumasks and bitmaps](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)) - -```C -for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++) - set_bit(i, used_vectors) -``` - -where `FIRST_EXTERNAL_VECTOR` is: - -```C -#define FIRST_EXTERNAL_VECTOR 0x20 -``` - -After this we setup the interrupt gate for the `ia32_syscall` and add `0x80` to the `used_vectors` bitmap: - -```C -#ifdef CONFIG_IA32_EMULATION - set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall); - set_bit(IA32_SYSCALL_VECTOR, used_vectors); -#endif -``` - -There is `CONFIG_IA32_EMULATION` kernel configuration option on `x86_64` Linux kernels. This option provides ability to execute 32-bit processes in compatibility-mode. In the next parts we will see how it works, in the meantime we need only to know that there is yet another interrupt gate in the `IDT` with the vector number `0x80`. In the next step we maps `IDT` to the fixmap area: - -```C -__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO); -idt_descr.address = fix_to_virt(FIX_RO_IDT); -``` - -and write its address to the `idt_descr.address` (more about fix-mapped addresses you can read in the second part of the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) chapter). After this we can see the call of the `cpu_init` function that defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c). This function makes initialization of the all `per-cpu` state. In the beginnig of the `cpu_init` we do the following things: First of all we wait while current cpu is initialized and than we call the `cr4_init_shadow` function which stores shadow copy of the `cr4` control register for the current cpu and load CPU microcode if need with the following function calls: - -```C -wait_for_master_cpu(cpu); -cr4_init_shadow(); -load_ucode_ap(); -``` - -Next we get the `Task State Segement` for the current cpu and `orig_ist` structure which represents origin `Interrupt Stack Table` values with the: - -```C -t = &per_cpu(cpu_tss, cpu); -oist = &per_cpu(orig_ist, cpu); -``` - -As we got values of the `Task State Segement` and `Interrupt Stack Table` for the current processor, we clear following bits in the `cr4` control register: - -```C -cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE); -``` - -with this we disable `vm86` extension, virtual interrupts, timestamp ([RDTSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) can only be executed with the highest privilege) and debug extension. After this we reload the `Glolbal Descripto Table` and `Interrupt Descriptor table` with the: - -```C - switch_to_new_gdt(cpu); - loadsegment(fs, 0); - load_current_idt(); -``` - -After this we setup array of the Thread-Local Storage Descriptors, configure [NX](https://en.wikipedia.org/wiki/NX_bit) and load CPU microcode. Now is time to setup and load `per-cpu` Task State Segements. We are going in a loop through the all exception stack which is `N_EXCEPTION_STACKS` or `4` and fill it with `Interrupt Stack Tables`: - -```C - if (!oist->ist[0]) { - char *estacks = per_cpu(exception_stacks, cpu); - - for (v = 0; v < N_EXCEPTION_STACKS; v++) { - estacks += exception_stack_sizes[v]; - oist->ist[v] = t->x86_tss.ist[v] = - (unsigned long)estacks; - if (v == DEBUG_STACK-1) - per_cpu(debug_stack_addr, cpu) = (unsigned long)estacks; - } - } -``` - -As we have filled `Task State Segements` with the `Interrupt Stack Tables` we can set `TSS` descriptor for the current processor and load it with the: - -```C -set_tss_desc(cpu, t); -load_TR_desc(); -``` - -where `set_tss_desc` macro from the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) writes given descriptor to the `Global Descriptor Table` of the given processor: - -```C -#define set_tss_desc(cpu, addr) __set_tss_desc(cpu, GDT_ENTRY_TSS, addr) -static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr) -{ - struct desc_struct *d = get_cpu_gdt_table(cpu); - tss_desc tss; - set_tssldt_descriptor(&tss, (unsigned long)addr, DESC_TSS, - IO_BITMAP_OFFSET + IO_BITMAP_BYTES + - sizeof(unsigned long) - 1); - write_gdt_entry(d, entry, &tss, DESC_TSS); -} -``` - -and `load_TR_desc` macro expands to the `ltr` or `Load Task Register` instruction: - -```C -#define load_TR_desc() native_load_tr_desc() -static inline void native_load_tr_desc(void) -{ - asm volatile("ltr %w0"::"q" (GDT_ENTRY_TSS*8)); -} -``` - -In the end of the `trap_init` function we can see the following code: - -```C -set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK); -set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK); -... -... -... -#ifdef CONFIG_X86_64 - memcpy(&nmi_idt_table, &idt_table, IDT_ENTRIES * 16); - set_nmi_gate(X86_TRAP_DB, &debug); - set_nmi_gate(X86_TRAP_BP, &int3); -#endif -``` - -Here we copy `idt_table` to the `nmi_dit_table` and setup exception handlers for the `#DB` or `Debug exception` and `#BR` or `Breakpoint exception`. You can remember that we already set these interrupt gates in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html), so why do we need to setup it again? We setup it again because when we initialized it before in the `early_trap_init` function, the `Task State Segement` was not ready yet, but now it is ready after the call of the `cpu_init` function. - -That's all. Soon we will consider all handlers of these interrupts/exceptions. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the fourth part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment) in this part and initialization of the different interrupt handlers as `Divide Error`, `Page Fault` excetpion and etc. You can noted that we saw just initialization stuf, and will dive into details about handlers for these exceptions. In the next part we will start to do it. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [page fault](https://en.wikipedia.org/wiki/Page_fault) -* [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) -* [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29) -* [cr2](https://en.wikipedia.org/wiki/Control_register) -* [RCU](https://en.wikipedia.org/wiki/Read-copy-update) -* [this_cpu_* operations](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt) -* [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) -* [prefetchw](http://www.felixcloutier.com/x86/PREFETCHW.html) -* [3DNow](https://en.wikipedia.org/?title=3DNow!) -* [CPU caches](https://en.wikipedia.org/wiki/CPU_cache) -* [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) -* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) -* [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) -* [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture) -* [INT isntruction](https://en.wikipedia.org/wiki/INT_%28x86_instruction%29) -* [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html) -* [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) -* [opcode](https://en.wikipedia.org/?title=Opcode) -* [control register](https://en.wikipedia.org/wiki/Control_register#CR0) -* [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions) -* [MCE exception](https://en.wikipedia.org/wiki/Machine-check_exception) -* [SIMD](https://en.wikipedia.org/?title=SIMD) -* [cpumasks and bitmaps](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) -* [NX](https://en.wikipedia.org/wiki/NX_bit) -* [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) diff --git a/interrupts/interrupts-5.md b/interrupts/interrupts-5.md deleted file mode 100644 index 2cfe45c..0000000 --- a/interrupts/interrupts-5.md +++ /dev/null @@ -1,493 +0,0 @@ -Interrupts and Interrupt Handling. Part 5. -================================================================================ - -Implementation of exception handlers --------------------------------------------------------------------------------- - -This is the fifth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) we stopped on the setting of interrupt gates to the [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table). We did it in the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file. We saw only setting of these interrupt gates in the previous part and in the current part we will see implementation of the exception handlers for these gates. The preparation before an exception handler will be executed is in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and occurs in the [idtentry](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S#L820) macro that defines exceptions entry points: - -```assembly -idtentry divide_error do_divide_error has_error_code=0 -idtentry overflow do_overflow has_error_code=0 -idtentry invalid_op do_invalid_op has_error_code=0 -idtentry bounds do_bounds has_error_code=0 -idtentry device_not_available do_device_not_available has_error_code=0 -idtentry coprocessor_segment_overrun do_coprocessor_segment_overrun has_error_code=0 -idtentry invalid_TSS do_invalid_TSS has_error_code=1 -idtentry segment_not_present do_segment_not_present has_error_code=1 -idtentry spurious_interrupt_bug do_spurious_interrupt_bug has_error_code=0 -idtentry coprocessor_error do_coprocessor_error has_error_code=0 -idtentry alignment_check do_alignment_check has_error_code=1 -idtentry simd_coprocessor_error do_simd_coprocessor_error has_error_code=0 -``` - -The `idtentry` macro does following preparation before an actual exception handler (`do_divide_error` for the `divide_error`, `do_overflow` for the `overflow` and etc.) will get control. In another words the `idtentry` macro allocates place for the registers ([pt_regs](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h#L43) structure) on the stack, pushes dummy error code for the stack consistency if an interrupt/exception has no error code, checks the segment selector in the `cs` segment register and switches depends on the previous state(userspace or kernelspace). After all of these preparations it makes a call of an actual interrupt/exception handler: - -```assembly -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 -ENTRY(\sym) - ... - ... - ... - call \do_sym - ... - ... - ... -END(\sym) -.endm -``` - -After an exception handler will finish its work, the `idtentry` macro restores stack and general purpose registers of an interrupted task and executes [iret](http://x86.renejeschke.de/html/file_module_x86_id_145.html) instruction: - -```assembly -ENTRY(paranoid_exit) - ... - ... - ... - RESTORE_EXTRA_REGS - RESTORE_C_REGS - REMOVE_PT_GPREGS_FROM_STACK 8 - INTERRUPT_RETURN -END(paranoid_exit) -``` - -where `INTERRUPT_RETURN` is: - -```assembly -#define INTERRUPT_RETURN jmp native_iret -... -ENTRY(native_iret) -.global native_irq_return_iret -native_irq_return_iret: -iretq -``` - -More about the `idtentry` macro you can read in the thirt part of the [http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers: - -* divide_error -* overflow -* invalid_op -* coprocessor_segment_overrun -* invalid_TSS -* segment_not_present -* stack_segment -* alignment_check - -All these handlers defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file with the `DO_ERROR` macro: - -```C -DO_ERROR(X86_TRAP_DE, SIGFPE, "divide error", divide_error) -DO_ERROR(X86_TRAP_OF, SIGSEGV, "overflow", overflow) -DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op) -DO_ERROR(X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun", coprocessor_segment_overrun) -DO_ERROR(X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS) -DO_ERROR(X86_TRAP_NP, SIGBUS, "segment not present", segment_not_present) -DO_ERROR(X86_TRAP_SS, SIGBUS, "stack segment", stack_segment) -DO_ERROR(X86_TRAP_AC, SIGBUS, "alignment check", alignment_check) -``` - -As we can see the `DO_ERROR` macro takes 4 parameters: - -* Vector number of an interrupt; -* Signal number which will be sent to the interrupted process; -* String which describes an exception; -* Exception handler entry point. - -This macro defined in the same souce code file and expands to the function with the `do_handler` name: - -```C -#define DO_ERROR(trapnr, signr, str, name) \ -dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \ -{ \ - do_error_trap(regs, error_code, str, trapnr, signr); \ -} -``` - -Note on the `##` tokens. This is special feature - [GCC macro Concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html#Concatenation) which concatenates two given strings. For example, first `DO_ERROR` in our example will expands to the: - -```C -dotraplinkage void do_divide_error(struct pt_regs *regs, long error_code) \ -{ - ... -} -``` - -We can see that all functions which are generated by the `DO_ERROR` macro just make a call of the `do_error_trap` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). Let's look on implementation of the `do_error_trap` function. - -Trap handlers --------------------------------------------------------------------------------- - -The `do_error_trap` function starts and ends from the two following functions: - -```C -enum ctx_state prev_state = exception_enter(); -... -... -... -exception_exit(prev_state); -``` - -from the [include/linux/context_tracking.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking.h). The context tracking in the Linux kernel subsystem which provide kernel boundaries probes to keep track of the transitions between level contexts with two basic initial contexts: `user` or `kernel`. The `exception_enter` function checks that context tracking is enabled. After this if it is enabled, the `exception_enter` reads previous context and compares it with the `CONTEXT_KERNEL`. If the previous context is `user`, we call `context_tracking_exit` function from the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/master/kernel/context_tracking.c) which inform the context tracking subsystem that a processor is exiting user mode and entering the kernel mode: - -```C -if (!context_tracking_is_enabled()) - return 0; - -prev_ctx = this_cpu_read(context_tracking.state); -if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); - -return prev_ctx; -``` - -If previous context is non `user`, we just return it. The `pre_ctx` has `enum ctx_state` type which defined in the [include/linux/context_tracking_state.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking_state.h) and looks as: - -```C -enum ctx_state { - CONTEXT_KERNEL = 0, - CONTEXT_USER, - CONTEXT_GUEST, -} state; -``` - -The second function is `exception_exit` defined in the same [include/linux/context_tracking.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking.h) file and checks that context tracking is enabled and call the `contert_tracking_enter` function if the previous context was `user`: - -```C -static inline void exception_exit(enum ctx_state prev_ctx) -{ - if (context_tracking_is_enabled()) { - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_enter(prev_ctx); - } -} -``` - -The `context_tracking_enter` function informs the context tracking subsystem that a processor is going to enter to the user mode from the kernel mode. We can see the following code between the `exception_enter` and `exception_exit`: - -```C -if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) != - NOTIFY_STOP) { - conditional_sti(regs); - do_trap(trapnr, signr, str, regs, error_code, - fill_trap_info(regs, signr, trapnr, &info)); -} -``` - -First of all it calls the `notify_die` function which defined in the [kernel/notifier.c](https://github.com/torvalds/linux/tree/master/kernel/notifier.c). To get notified for [kernel panic](https://en.wikipedia.org/wiki/Kernel_panic), [kernel oops](https://en.wikipedia.org/wiki/Linux_kernel_oops), [Non-Maskable Interrupt](https://en.wikipedia.org/wiki/Non-maskable_interrupt) or other events the caller needs to insert itself in the `notify_die` chain and the `notify_die` function does it. The Linux kernel has special mechanism that allows kernel to ask when something happens and this mechanism called `notifiers` or `notifier chains`. This mechanism used for example for the `USB` hotplug events (look on the [drivers/usb/core/notify.c](https://github.com/torvalds/linux/tree/master/drivers/usb/core/notify.c)), for the memory [hotplug](https://en.wikipedia.org/wiki/Hot_swapping) (look on the [include/linux/memory.h](https://github.com/torvalds/linux/tree/master/include/linux/memory.h), the `hotplug_memory_notifier` macro and etc...), system reboots and etc. A notifier chain is thus a simple, singly-linked list. When a Linux kernel subsystem wants to be notified of specific events, it fills out a special `notifier_block` structure and passes it to the `notifier_chain_register` function. An event can be sent with the call of the `notifier_call_chain` function. First of all the `notify_die` function fills `die_args` structure with the trap number, trap string, registers and other values: - -```C -struct die_args args = { - .regs = regs, - .str = str, - .err = err, - .trapnr = trap, - .signr = sig, -} -``` - -and returns the result of the `atomic_notifier_call_chain` function with the `die_chain`: - -```C -static ATOMIC_NOTIFIER_HEAD(die_chain); -return atomic_notifier_call_chain(&die_chain, val, &args); -``` - -which just expands to the `atomit_notifier_head` structure that contains lock and `notifier_block`: - -```C -struct atomic_notifier_head { - spinlock_t lock; - struct notifier_block __rcu *head; -}; -``` - -The `atomic_notifier_call_chain` function calls each function in a notifier chain in turn and returns the value of the last notifier function called. If the `notify_die` in the `do_error_trap` does not return `NOTIFY_STOP` we execute `conditional_sti` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) that checks the value of the [interrupt flag](https://en.wikipedia.org/wiki/Interrupt_flag) and enables interrupt depends on it: - -```C -static inline void conditional_sti(struct pt_regs *regs) -{ - if (regs->flags & X86_EFLAGS_IF) - local_irq_enable(); -} -``` - -more about `local_irq_enable` macro you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html) of this chapter. The next and last call in the `do_error_trap` is the `do_trap` function. First of all the `do_trap` function defined the `tsk` variable which has `trak_struct` type and represents the current interrupted process. After the definition of the `tsk`, we can see the call of the `do_trap_no_signal` function: - -```C -struct task_struct *tsk = current; - -if (!do_trap_no_signal(tsk, trapnr, str, regs, error_code)) - return; -``` - -The `do_trap_no_signal` function makes two checks: - -* Did we come from the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode; -* Did we come from the kernelspace. - -```C -if (v8086_mode(regs)) { - ... -} - -if (!user_mode(regs)) { - ... -} - -return -1; -``` - -We will not consider first case because the [long mode](https://en.wikipedia.org/wiki/Long_mode) does not support the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode. In the second case we invoke `fixup_exception` function which will try to recover a fault and `die` if we can't: - -```C -if (!fixup_exception(regs)) { - tsk->thread.error_code = error_code; - tsk->thread.trap_nr = trapnr; - die(str, regs, error_code); -} -``` - -The `die` function defined in the [arch/x86/kernel/dumpstack.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/dumpstack.c) source code file, prints useful information about stack, registers, kernel modules and caused kernel [oops](https://en.wikipedia.org/wiki/Linux_kernel_oops). If we came from the userspace the `do_trap_no_signal` function will return `-1` and the execution of the `do_trap` function will continue. If we passed through the `do_trap_no_signal` function and did not exit from the `do_trap` after this, it means that previous context was - `user`. Most exceptions caused by the processor are interpreted by Linux as error conditions, for example division by zero, invalid opcode and etc. When an exception occurs the Linux kernel sends a [signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process that caused the exception to notify it of an incorrect condition. So, in the `do_trap` function we need to send a signal with the given number (`SIGFPE` for the divide error, `SIGILL` for the overflow exception and etc...). First of all we save error code and vector number in the current interrupts process with the filling `thread.error_code` and `thread_trap_nr`: - -```C -tsk->thread.error_code = error_code; -tsk->thread.trap_nr = trapnr; -``` - -After this we make a check do we need to print information about unhandled signals for the interrupted process. We check that `show_unhandled_signals` variable is set, that `unhandled_signal` function from the [kernel/signal.c](https://github.com/torvalds/linux/blob/master/kernel/signal.c) will return unhandled signal(s) and [printk](https://en.wikipedia.org/wiki/Printk) rate limit: - -```C -#ifdef CONFIG_X86_64 - if (show_unhandled_signals && unhandled_signal(tsk, signr) && - printk_ratelimit()) { - pr_info("%s[%d] trap %s ip:%lx sp:%lx error:%lx", - tsk->comm, tsk->pid, str, - regs->ip, regs->sp, error_code); - print_vma_addr(" in ", regs->ip); - pr_cont("\n"); - } -#endif -``` - -And send a given signal to interrupted process: - -```C -force_sig_info(signr, info ?: SEND_SIG_PRIV, tsk); -``` - -This is the end of the `do_trap`. We just saw generic implementation for eight different exceptions which are defined with the `DO_ERROR` macro. Now let's look on another exception handlers. - -Double fault --------------------------------------------------------------------------------- - -The next exception is `#DF` or `Double fault`. This exception occurrs when the processor detected a second exception while calling an exception handler for a prior exception. We set the trap gate for this exception in the previous part: - -```C -set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK); -``` - -Note that this exception runs on the `DOUBLEFAULT_STACK` [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks) which has index - `1`: - -```C -#define DOUBLEFAULT_STACK 1 -``` - -The `double_fault` is handler for this exception and defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). The `double_fault` handler starts from the definition of two variables: string that describes excetpion and interrupted process, as other exception handlers: - -```C -static const char str[] = "double fault"; -struct task_struct *tsk = current; -``` - -The handler of the double fault exception splitted on two parts. The first part is the check which checks that a fault is a `non-IST` fault on the `espfix64` stack. Actually the `iret` instruction restores only the bottom `16` bits when returning to a `16` bit segment. The `espfix` feature solves this problem. So if the `non-IST` fault on the espfix64 stack we modify the stack to make it look like `General Protection Fault`: - -```C -struct pt_regs *normal_regs = task_pt_regs(current); - -memmove(&normal_regs->ip, (void *)regs->sp, 5*8); -ormal_regs->orig_ax = 0; -regs->ip = (unsigned long)general_protection; -regs->sp = (unsigned long)&normal_regs->orig_ax; -return; -``` - -In the second case we do almost the same that we did in the previous excetpion handlers. The first is the call of the `ist_enter` function that discards previous context, `user` in our case: - -```C -ist_enter(regs); -``` - -And after this we fill the interrupted process with the vector number of the `Double fault` excetpion and error code as we did it in the previous handlers: - -```C -tsk->thread.error_code = error_code; -tsk->thread.trap_nr = X86_TRAP_DF; -``` - -Next we print useful information about the double fault ([PID](https://en.wikipedia.org/wiki/Process_identifier) number, registers content): - -```C -#ifdef CONFIG_DOUBLEFAULT - df_debug(regs, error_code); -#endif -``` - -And die: - -``` - for (;;) - die(str, regs, error_code); -``` - -That's all. - -Device not available exception handler --------------------------------------------------------------------------------- - -The next exception is the `#NM` or `Device not available`. The `Device not available` exception can occur depending on these things: - -* The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87) floating-point instruction while the EM flag in [control register](https://en.wikipedia.org/wiki/Control_register) `cr0` was set; -* The processor executed a `wait` or `fwait` instruction while the `MP` and `TS` flags of register `cr0` were set; -* The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87), [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29) or [SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) instruction while the `TS` falg in control register `cr0` was set and the `EM` flag is clear. - -The handler of the `Device not available` exception is the `do_device_not_available` function and it defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file too. It starts and ends from the getting of the previous context, as other traps which we saw in the beginning of this part: - -```C -enum ctx_state prev_state; -prev_state = exception_enter(); -... -... -... -exception_exit(prev_state); -``` - -In the next step we check that `FPU` is not eager: - -```C -BUG_ON(use_eager_fpu()); -``` - -When we switch into a task or interrupt we may avoid loading the `FPU` state. If a task will use it, we catch `Device not Available exception` exception. If we loading the `FPU` state during task switching, the `FPU` is eager. In the next step we check `cr0` control register on the `EM` flag which can show us is `x87` floating point unit present (flag clear) or not (flag set): - -```C -#ifdef CONFIG_MATH_EMULATION - if (read_cr0() & X86_CR0_EM) { - struct math_emu_info info = { }; - - conditional_sti(regs); - - info.regs = regs; - math_emulate(&info); - exception_exit(prev_state); - return; - } -#endif -``` - -If the `x87` floating point unit not presented, we enable interrupts with the `conditional_sti`, fill the `math_emu_info` (defined in the [arch/x86/include/asm/math_emu.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/math_emu.h)) structure with the registers of an interrupt task and call `math_emulate` function from the [arch/x86/math-emu/fpu_entry.c](https://github.com/torvalds/linux/tree/master/arch/x86/math-emu/fpu_entry.c). As you can understand from function's name, it emulates `X87 FPU` unit (more about the `x87` we will know in the special chapter). In other way, if `X86_CR0_EM` flag is clear which means that `x87 FPU` unit is presented, we call the `fpu__restore` function from the [arch/x86/kernel/fpu/core.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/fpu/core.c) which copies the `FPU` registers from the `fpustate` to the live hardware registers. After this `FPU` instructions can be used: - -```C -fpu__restore(¤t->thread.fpu); -``` - -General protection fault exception handler --------------------------------------------------------------------------------- - -The next exception is the `#GP` or `General protection fault`. This exception occurs when the processor detected one of a class of protection violations called `general-protection violations`. It can be: - -* Exceeding the segment limit when accessing the `cs`, `ds`, `es`, `fs` or `gs` segments; -* Loading the `ss`, `ds`, `es`, `fs` or `gs` register with a segment selector for a system segment.; -* Violating any of the privilege rules; -* and other... - -The exception handler for this exception is the `do_general_protection` from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). The `do_general_protection` function starts and ends as other exception handlers from the getting of the previous context: - -```C -prev_state = exception_enter(); -... -exception_exit(prev_state); -``` - -After this we enable interrupts if they were disabled and check that we came from the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode: - -```C -conditional_sti(regs); - -if (v8086_mode(regs)) { - local_irq_enable(); - handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code); - goto exit; -} -``` - -As long mode does not support this mode, we will not consider exception handling for this case. In the next step check that previous mode was kernel mode and try to fix the trap. If we can't fix the current general protection fault exception we fill the interrupted process with the vector number and error code of the exception and add it to the `notify_die` chain: - -```C -if (!user_mode(regs)) { - if (fixup_exception(regs)) - goto exit; - - tsk->thread.error_code = error_code; - tsk->thread.trap_nr = X86_TRAP_GP; - if (notify_die(DIE_GPF, "general protection fault", regs, error_code, - X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP) - die("general protection fault", regs, error_code); - goto exit; -} -``` - -If we can fix exception we go to the `exit` label which exits from exception state: - -```C -exit: - exception_exit(prev_state); -``` - -If we came from user mode we send `SIGSEGV` signal to the interrupted process from user mode as we did it in the `do_trap` function: - -```C -if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) && - printk_ratelimit()) { - pr_info("%s[%d] general protection ip:%lx sp:%lx error:%lx", - tsk->comm, task_pid_nr(tsk), - regs->ip, regs->sp, error_code); - print_vma_addr(" in ", regs->ip); - pr_cont("\n"); -} - -force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk); -``` - -That's all. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the fifth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we saw implementation of some interrupt handlers in this part. In the next part we will continue to dive into interrupt and exception handlers and will see handler for the [Non-Maskable Interrupts](https://en.wikipedia.org/wiki/Non-maskable_interrupt), handling of the math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) and [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exceptions and many many more. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) -* [iret instruction](http://x86.renejeschke.de/html/file_module_x86_id_145.html) -* [GCC macro Concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html#Concatenation) -* [kernel panic](https://en.wikipedia.org/wiki/Kernel_panic) -* [kernel oops](https://en.wikipedia.org/wiki/Linux_kernel_oops) -* [Non-Maskable Interrupt](https://en.wikipedia.org/wiki/Non-maskable_interrupt) -* [hotplug](https://en.wikipedia.org/wiki/Hot_swapping) -* [interrupt flag](https://en.wikipedia.org/wiki/Interrupt_flag) -* [long mode](https://en.wikipedia.org/wiki/Long_mode) -* [signal](https://en.wikipedia.org/wiki/Unix_signal) -* [printk](https://en.wikipedia.org/wiki/Printk) -* [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) -* [SIMD](https://en.wikipedia.org/wiki/SIMD) -* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks) -* [PID](https://en.wikipedia.org/wiki/Process_identifier) -* [x87 FPU](https://en.wikipedia.org/wiki/X87) -* [control register](https://en.wikipedia.org/wiki/Control_register) -* [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) diff --git a/interrupts/interrupts-6.md b/interrupts/interrupts-6.md deleted file mode 100644 index ea9abac..0000000 --- a/interrupts/interrupts-6.md +++ /dev/null @@ -1,480 +0,0 @@ -Interrupts and Interrupt Handling. Part 6. -================================================================================ - -Non-maskable interrupt handler --------------------------------------------------------------------------------- - -It is sixth part of the [Interrupts and Interrupt Handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html) we saw implementation of some exception handlers for the [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault) exception, divide exception, invalid [opcode](https://en.wikipedia.org/wiki/Opcode) exceptions and etc. As I wrote in the previous part we will see implementations of the rest exceptions in this part. We will see implementation of the following handlers: - -* [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt; -* [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) Range Exceeded Exception; -* [Coprocessor](https://en.wikipedia.org/wiki/Coprocessor) exception; -* [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exception. - -in this part. So, let's start. - -Non-Maskable interrupt handling --------------------------------------------------------------------------------- - -A [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt is a hardware interrupt that cannot be ignore by standard masking techniques. In a general way, a non-maskable interrupt can be generated in either of two ways: - -* External hardware asserts the non-maskable interrupt [pin](https://en.wikipedia.org/wiki/CPU_socket) on the CPU. -* The processor receives a message on the system bus or the APIC serial bus with a delivery mode `NMI`. - -When the processor receives a `NMI` from one of these sources, the processor handles it immediately by calling the `NMI` handler pointed to by interrupt vector which has number `2` (see table in the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html)). We already filled the [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the [vector number](https://en.wikipedia.org/wiki/Interrupt_vector_table), address of the `nmi` interrupt handler and `NMI_STACK` [Interrupt Stack Table entry](https://github.com/torvalds/linux/blob/master/Documentation/x86/kernel-stacks): - -```C -set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK); -``` - -in the `trap_init` function which defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file. In the previous [parts](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we saw that entry points of the all interrupt handlers are defined with the: - -```assembly -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 -ENTRY(\sym) -... -... -... -END(\sym) -.endm -``` - -macro from the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file. But the handler of the `Non-Maskable` interrupts is not defined with this macro. It has own entry point: - -```assembly -ENTRY(nmi) -... -... -... -END(nmi) -``` - -in the same [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file. Lets dive into it and will try to understand how `Non-Maskable` interrupt handler works. The `nmi` handlers starts from the call of the: - -```assembly -PARAVIRT_ADJUST_EXCEPTION_FRAME -``` - -macro but we will not dive into details about it in this part, because this macro related to the [Paravirtualization](https://en.wikipedia.org/wiki/Paravirtualization) stuff which we will see in another chapter. After this save the content of the `rdx` register on the stack: - -```assembly -pushq %rdx -``` - -And allocated check that `cs` was not the kernel segment when an non-maskable interrupt occurs: - -```assembly -cmpl $__KERNEL_CS, 16(%rsp) -jne first_nmi -``` - -The `__KERNEL_CS` macro defined in the [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) and represented second descriptor in the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table): - -```C -#define GDT_ENTRY_KERNEL_CS 2 -#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8) -``` - -more about `GDT` you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the Linux kernel booting process chapter. If `cs` is not kernel segment, it means that it is not nested `NMI` and we jump on the `first_nmi` label. Let's consider this case. First of all we put address of the current stack pointer to the `rdx` and pushes `1` to the stack in the `first_nmi` label: - -```assembly -first_nmi: - movq (%rsp), %rdx - pushq $1 -``` - -Why do we push `1` on the stack? As the comment says: `We allow breakpoints in NMIs`. On the [x86_64](https://en.wikipedia.org/wiki/X86-64), like other architectures, the CPU will not execute another `NMI` until the first `NMI` is complete. A `NMI` interrupt finished with the [iret](http://faydoc.tripod.com/cpu/iret.htm) instruction like other interrupts and exceptions do it. If the `NMI` handler triggers either a [page fault](https://en.wikipedia.org/wiki/Page_fault) or [breakpoint](https://en.wikipedia.org/wiki/Breakpoint) or another exception which are use `iret` instruction too. If this happens while in `NMI` context, the CPU will leave `NMI` context and a new `NMI` may come in. The `iret` used to return from those exceptions will re-enable `NMIs` and we will get nested non-maskable interrupts. The problem the `NMI` handler will not return to the state that it was, when the exception triggered, but instead it will return to a state that will allow new `NMIs` to preempt the running `NMI` handler. If another `NMI` comes in before the first NMI handler is complete, the new NMI will write all over the preempted `NMIs` stack. We can have nested `NMIs` where the next `NMI` is using the top of the stack of the previous `NMI`. It means that we cannot execute it because a nested non-maskable interrupt will corrupt stack of a previous non-maskable interrupt. That's why we have allocated space on the stack for temporary variable. We will check this variable that it was set when a previous `NMI` is executing and clear if it is not nested `NMI`. We push `1` here to the previously allocated space on the stack to denote that a `non-maskable` interrupt executed currently. Remember that when and `NMI` or another exception occurs we have the following [stack frame](https://en.wikipedia.org/wiki/Call_stack): - -``` -+------------------------+ -| SS | -| RSP | -| RFLAGS | -| CS | -| RIP | -+------------------------+ -``` - -and also an error code if an exception has it. So, after all of these manipulations our stack frame will look like this: - -``` -+------------------------+ -| SS | -| RSP | -| RFLAGS | -| CS | -| RIP | -| RDX | -| 1 | -+------------------------+ -``` - -In the next step we allocate yet another `40` bytes on the stack: - -```assembly -subq $(5*8), %rsp -``` - -and pushes the copy of the original stack frame after the allocated space: - -```C -.rept 5 -pushq 11*8(%rsp) -.endr -``` - -with the [.rept](http://tigcc.ticalc.org/doc/gnuasm.html#SEC116) assembly directive. We need in the copy of the original stack frame. Generally we need in two copies of the interrupt stack. First is `copied` interrupts stack: `saved` stack frame and `copied` stack frame. Now we pushes original stack frame to the `saved` stack frame which locates after the just allocated `40` bytes (`copied` stack frame). This stack frame is used to fixup the `copied` stack frame that a nested NMI may change. The second - `copied` stack frame modified by any nested `NMIs` to let the first `NMI` know that we triggered a second `NMI` and we should repeat the first `NMI` handler. Ok, we have made first copy of the original stack frame, now time to make second copy: - -```assembly -addq $(10*8), %rsp - -.rept 5 -pushq -6*8(%rsp) -.endr -subq $(5*8), %rsp -``` - -After all of these manipulations our stack frame will be like this: - -``` -+-------------------------+ -| original SS | -| original Return RSP | -| original RFLAGS | -| original CS | -| original RIP | -+-------------------------+ -| temp storage for rdx | -+-------------------------+ -| NMI executing variable | -+-------------------------+ -| copied SS | -| copied Return RSP | -| copied RFLAGS | -| copied CS | -| copied RIP | -+-------------------------+ -| Saved SS | -| Saved Return RSP | -| Saved RFLAGS | -| Saved CS | -| Saved RIP | -+-------------------------+ -``` - -After this we push dummy error code on the stack as we did it already in the previous exception handlers and allocate space for the general purpose registers on the stack: - -```assembly -pushq $-1 -ALLOC_PT_GPREGS_ON_STACK -``` - -We already saw implementation of the `ALLOC_PT_GREGS_ON_STACK` macro in the third part of the interrupts [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html). This macro defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) and yet another allocates `120` bytes on stack for the general purpose registers, from the `rdi` to the `r15`: - -```assembly -.macro ALLOC_PT_GPREGS_ON_STACK addskip=0 -addq $-(15*8+\addskip), %rsp -.endm -``` - -After space allocation for the general registers we can see call of the `paranoid_entry`: - -```assembly -call paranoid_entry -``` - -We can remember from the previous parts this label. It pushes general purpose registers on the stack, reads `MSR_GS_BASE` [Model Specific register](https://en.wikipedia.org/wiki/Model-specific_register) and checks its value. If the value of the `MSR_GS_BASE` is negative, we came from the kernel mode and just return from the `paranoid_entry`, in other way it means that we came from the usermode and need to execute `swapgs` instruction which will change user `gs` with the kernel `gs`: - -```assembly -ENTRY(paranoid_entry) - cld - SAVE_C_REGS 8 - SAVE_EXTRA_REGS 8 - movl $1, %ebx - movl $MSR_GS_BASE, %ecx - rdmsr - testl %edx, %edx - js 1f - SWAPGS - xorl %ebx, %ebx -1: ret -END(paranoid_entry) -``` - -Note that after the `swapgs` instruction we zeroed the `ebx` register. Next time we will check content of this register and if we executed `swapgs` than `ebx` must contain `0` and `1` in other way. In the next step we store value of the `cr2` [control register](https://en.wikipedia.org/wiki/Control_register) to the `r12` register, because the `NMI` handler can cause `page fault` and corrupt the value of this control register: - -```C -movq %cr2, %r12 -``` - -Now time to call actual `NMI` handler. We push the address of the `pt_regs` to the `rdi`, error code to the `rsi` and call the `do_nmi` handler: - -```assembly -movq %rsp, %rdi -movq $-1, %rsi -call do_nmi -``` - -We will back to the `do_nmi` little later in this part, but now let's look what occurs after the `do_nmi` will finish its execution. After the `do_nmi` handler will be finished we check the `cr2` register, because we can got page fault during `do_nmi` performed and if we got it we restore original `cr2`, in other way we jump on the label `1`. After this we test content of the `ebx` register (remember it must contain `0` if we have used `swapgs` instruction and `1` if we didn't use it) and execute `SWAPGS_UNSAFE_STACK` if it contains `1` or jump to the `nmi_restore` label. The `SWAPGS_UNSAFE_STACK` macro just expands to the `swapgs` instruction. In the `nmi_restore` label we restore general purpose registers, clear allocated space on the stack for this registers clear our temporary variable and exit from the interrupt handler with the `INTERRUPT_RETURN` macro: - -```assembly - movq %cr2, %rcx - cmpq %rcx, %r12 - je 1f - movq %r12, %cr2 -1: - testl %ebx, %ebx - jnz nmi_restore -nmi_swapgs: - SWAPGS_UNSAFE_STACK -nmi_restore: - RESTORE_EXTRA_REGS - RESTORE_C_REGS - /* Pop the extra iret frame at once */ - REMOVE_PT_GPREGS_FROM_STACK 6*8 - /* Clear the NMI executing stack variable */ - movq $0, 5*8(%rsp) - INTERRUPT_RETURN -``` - -where `INTERRUPT_RETURN` is defined in the [arch/x86/include/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/irqflags.h) and just expands to the `iret` instruction. That's all. - -Now let's consider case when another `NMI` interrupt occurred when previous `NMI` interrupt didn't finish its execution. You can remember from the beginning of this part that we've made a check that we came from userspace and jump on the `first_nmi` in this case: - -```assembly -cmpl $__KERNEL_CS, 16(%rsp) -jne first_nmi -``` - -Note that in this case it is first `NMI` every time, because if the first `NMI` catched page fault, breakpoint or another exception it will be executed in the kernel mode. If we didn't come from userspace, first of all we test our temporary variable: - -```assembly -cmpl $1, -8(%rsp) -je nested_nmi -``` - -and if it is set to `1` we jump to the `nested_nmi` label. If it is not `1`, we test the `IST` stack. In the case of nested `NMIs` we check that we are above the `repeat_nmi`. In this case we ignore it, in other way we check that we above than `end_repeat_nmi` and jump on the `nested_nmi_out` label. - -Now let's look on the `do_nmi` exception handler. This function defined in the [arch/x86/kernel/nmi.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/nmi.c) source code file and takes two parameters: - -* address of the `pt_regs`; -* error code. - -as all exception handlers. The `do_nmi` starts from the call of the `nmi_nesting_preprocess` function and ends with the call of the `nmi_nesting_postprocess`. The `nmi_nesting_preprocess` function checks that we likely do not work with the debug stack and if we on the debug stack set the `update_debug_stack` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable to `1` and call the `debug_stack_set_zero` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c). This function increases the `debug_stack_use_ctr` per-cpu variable and loads new `Interrupt Descriptor Table`: - -```C -static inline void nmi_nesting_preprocess(struct pt_regs *regs) -{ - if (unlikely(is_debug_stack(regs->sp))) { - debug_stack_set_zero(); - this_cpu_write(update_debug_stack, 1); - } -} -``` - -The `nmi_nesting_postprocess` function checks the `update_debug_stack` per-cpu variable which we set in the `nmi_nesting_preprocess` and resets debug stack or in another words it loads origin `Interrupt Descriptor Table`. After the call of the `nmi_nesting_preprocess` function, we can see the call of the `nmi_enter` in the `do_nmi`. The `nmi_enter` increases `lockdep_recursion` field of the interrupted process, update preempt counter and informs the [RCU](https://en.wikipedia.org/wiki/Read-copy-update) subsystem about `NMI`. There is also `nmi_exit` function that does the same stuff as `nmi_enter`, but vice-versa. After the `nmi_enter` we increase `__nmi_count` in the `irq_stat` structure and call the `default_do_nmi` function. First of all in the `default_do_nmi` we check the address of the previous nmi and update address of the last nmi to the actual: - -```C -if (regs->ip == __this_cpu_read(last_nmi_rip)) - b2b = true; -else - __this_cpu_write(swallow_nmi, false); - -__this_cpu_write(last_nmi_rip, regs->ip); -``` - -After this first of all we need to handle CPU-specific `NMIs`: - -```C -handled = nmi_handle(NMI_LOCAL, regs, b2b); -__this_cpu_add(nmi_stats.normal, handled); -``` - -And than non-specific `NMIs` depends on its reason: - -```C -reason = x86_platform.get_nmi_reason(); -if (reason & NMI_REASON_MASK) { - if (reason & NMI_REASON_SERR) - pci_serr_error(reason, regs); - else if (reason & NMI_REASON_IOCHK) - io_check_error(reason, regs); - - __this_cpu_add(nmi_stats.external, 1); - return; -} -``` - -That's all. - -Range Exceeded Exception --------------------------------------------------------------------------------- - -The next exception is the `BOUND` range exceeded exception. The `BOUND` instruction determines if the first operand (array index) is within the bounds of an array specified the second operand (bounds operand). If the index is not within bounds, a `BOUND` range exceeded exception or `#BR` is occurred. The handler of the `#BR` exception is the `do_bounds` function that defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). The `do_bounds` handler starts with the call of the `exception_enter` function and ends with the call of the `exception_exit`: - -```C -prev_state = exception_enter(); - -if (notify_die(DIE_TRAP, "bounds", regs, error_code, - X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP) - goto exit; -... -... -... -exception_exit(prev_state); -return; -``` - -After we have got the state of the previous context, we add the exception to the `notify_die` chain and if it will return `NOTIFY_STOP` we return from the exception. More about notify chains and the `context tracking` functions you can read in the [previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html). In the next step we enable interrupts if they were disabled with the `contidional_sti` function that checks `IF` flag and call the `local_irq_enable` depends on its value: - -```C -conditional_sti(regs); - -if (!user_mode(regs)) - die("bounds", regs, error_code); -``` - -and check that if we didn't came from user mode we send `SIGSEGV` signal with the `die` function. After this we check is [MPX](https://en.wikipedia.org/wiki/Intel_MPX) enabled or not, and if this feature is disabled we jump on the `exit_trap` label: - -```C -if (!cpu_feature_enabled(X86_FEATURE_MPX)) { - goto exit_trap; -} - -where we execute `do_trap` function (more about it you can find in the previous part): - -```C -exit_trap: - do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL); - exception_exit(prev_state); -``` - -If `MPX` feature is enabled we check the `BNDSTATUS` with the `get_xsave_field_ptr` function and if it is zero, it means that the `MPX` was not responsible for this exception: - -```C -bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR); -if (!bndcsr) - goto exit_trap; -``` - -After all of this, there is still only one way when `MPX` is responsible for this exception. We will not dive into the details about Intel Memory Protection Extensions in this part, but will see it in another chapter. - -Coprocessor exception and SIMD exception --------------------------------------------------------------------------------- - -The next two exceptions are [x87 FPU](https://en.wikipedia.org/wiki/X87) Floating-Point Error exception or `#MF` and [SIMD](https://en.wikipedia.org/wiki/SIMD) Floating-Point Exception or `#XF`. The first exception occurs when the `x87 FPU` has detected floating point error. For example divide by zero, numeric overflow and etc. The second exception occurs when the processor has detected [SSE/SSE2/SSE3](https://en.wikipedia.org/wiki/SSE3) `SIMD` floating-point exception. It can be the same as for the `x87 FPU`. The handlers for these exceptions are `do_coprocessor_error` and `do_simd_coprocessor_error` are defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and very similar on each other. They both make a call of the `math_error` function from the same source code file but pass different vector number. The `do_coprocessor_error` passes `X86_TRAP_MF` vector number to the `math_error`: - -```C -dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code) -{ - enum ctx_state prev_state; - - prev_state = exception_enter(); - math_error(regs, error_code, X86_TRAP_MF); - exception_exit(prev_state); -} -``` - -and `do_simd_coprocessor_error` passes `X86_TRAP_XF` to the `math_error` function: - -```C -dotraplinkage void -do_simd_coprocessor_error(struct pt_regs *regs, long error_code) -{ - enum ctx_state prev_state; - - prev_state = exception_enter(); - math_error(regs, error_code, X86_TRAP_XF); - exception_exit(prev_state); -} -``` - -First of all the `math_error` function defines current interrupted task, address of its fpu, string which describes an exception, add it to the `notify_die` chain and return from the exception handler if it will return `NOTIFY_STOP`: - -```C - struct task_struct *task = current; - struct fpu *fpu = &task->thread.fpu; - siginfo_t info; - char *str = (trapnr == X86_TRAP_MF) ? "fpu exception" : - "simd exception"; - - if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, SIGFPE) == NOTIFY_STOP) - return; -``` - -After this we check that we are from the kernel mode and if yes we will try to fix an excetpion with the `fixup_exception` function. If we cannot we fill the task with the exception's error code and vector number and die: - -```C -if (!user_mode(regs)) { - if (!fixup_exception(regs)) { - task->thread.error_code = error_code; - task->thread.trap_nr = trapnr; - die(str, regs, error_code); - } - return; -} -``` - -If we came from the user mode, we save the `fpu` state, fill the task structure with the vector number of an exception and `siginfo_t` with the number of signal, `errno`, the address where exception occurred and signal code: - -```C -fpu__save(fpu); - -task->thread.trap_nr = trapnr; -task->thread.error_code = error_code; -info.si_signo = SIGFPE; -info.si_errno = 0; -info.si_addr = (void __user *)uprobe_get_trap_addr(regs); -info.si_code = fpu__exception_code(fpu, trapnr); -``` - -After this we check the signal code and if it is non-zero we return: - -```C -if (!info.si_code) - return; -``` - -Or send the `SIGFPE` signal in the end: - -```C -force_sig_info(SIGFPE, &info, task); -``` - -That's all. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the sixth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we saw implementation of some exception handlers in this part, like `non-maskable` interrupt, [SIMD](https://en.wikipedia.org/wiki/SIMD) and [x87 FPU](https://en.wikipedia.org/wiki/X87) floating point exception. Finally we have finsihed with the `trap_init` function in this part and will go ahead in the next part. The next our point is the external interrupts and the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault) -* [opcode](https://en.wikipedia.org/wiki/Opcode) -* [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) -* [BOUND instruction](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) -* [CPU socket](https://en.wikipedia.org/wiki/CPU_socket) -* [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) -* [Interrupt Stack Table](https://github.com/torvalds/linux/blob/master/Documentation/x86/kernel-stacks) -* [Paravirtualization](https://en.wikipedia.org/wiki/Paravirtualization) -* [.rept](http://tigcc.ticalc.org/doc/gnuasm.html#SEC116) -* [SIMD](https://en.wikipedia.org/wiki/SIMD) -* [Coprocessor](https://en.wikipedia.org/wiki/Coprocessor) -* [x86_64](https://en.wikipedia.org/wiki/X86-64) -* [iret](http://faydoc.tripod.com/cpu/iret.htm) -* [page fault](https://en.wikipedia.org/wiki/Page_fault) -* [breakpoint](https://en.wikipedia.org/wiki/Breakpoint) -* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) -* [stack frame](https://en.wikipedia.org/wiki/Call_stack) -* [Model Specific regiser](https://en.wikipedia.org/wiki/Model-specific_register) -* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) -* [RCU](https://en.wikipedia.org/wiki/Read-copy-update) -* [MPX](https://en.wikipedia.org/wiki/Intel_MPX) -* [x87 FPU](https://en.wikipedia.org/wiki/X87) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html) diff --git a/interrupts/interrupts-7.md b/interrupts/interrupts-7.md deleted file mode 100644 index 43194b1..0000000 --- a/interrupts/interrupts-7.md +++ /dev/null @@ -1,461 +0,0 @@ -Interrupts and Interrupt Handling. Part 7. -================================================================================ - -Introduction to external interrupts --------------------------------------------------------------------------------- - -This is the seventh part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-6.html) we have finished with the exceptions which are generated by the processor. In this part we will continue to dive to the interrupt handling and will start with the external handware interrupt handling. As you can remember, in the previous part we have finsihed with the `trap_init` function from the [arch/x86/kernel/trap.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and the next step is the call of the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). - -Interrupts are signal that are sent across [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) or `Interrupt Request Line` by a hardware or software. External hardware interrupts allow devices like keyboard, mouse and etc, to indicate that it needs attention of the processor. Once the processor receives the `Interrupt Request`, it will temporary stop execution of the running program and invoke special routine which depends on an interrupt. We already know that this routine is called interrupt handler (or how we will call it `ISR` or `Interrupt Service Routine` from this part). The `ISR` or `Interrupt Handler Routine` can be found in Interrupt Vector table that is located at fixed address in the memory. After the interrupt is handled processor resumes the interrupted process. At the boot/initialization time, the Linux kernel identifies all devices in the machine, and appropriate interrupt handlers are loaded into the interrupt table. As we saw in the previous parts, most exceptions are handled simply by the sending a [Unix signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process. That's why kernel is can handle an exception quickly. Unfortunatelly we can not use this approach for the external handware interrupts, because often they arrive after (and sometimes long after) the process to which they are related has been suspended. So it would make no sense to send a Unix signal to the current process. External interrupt handling depends on the type of an interrupt: - -* `I/O` interrupts; -* Timer interrupts; -* Interprocessor interrupts. - -I will try to describe all types of interrupts in this book. - -Generally, a handler of an `I/O` interrupt must be flexible enough to service several devices at the same time. For exmaple in the [PCI](https://en.wikipedia.org/wiki/Conventional_PCI) bus architecture several devices may share the same `IRQ` line. In the simplest way the Linux kernel must do following thing when an `I/O` interrupt occured: - -* Save the value of an `IRQ` and the register's contents on the kernel stack; -* Send an acknowledgment to the hardware controller which is servicing the `IRQ` line; -* Execute the interrupt service routine (next we will call it `ISR`) which is associated with the device; -* Restore registers and return from an interrupt; - -Ok, we know a little theory and now let's start with the `early_irq_init` function. The implementation of the `early_irq_init` function is in the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c). This function make early initialziation of the `irq_desc` structure. The `irq_desc` structure is the foundation of interrupt management code in the Linux kernel. An array of this structure, which has the same name - `irq_desc`, keeps track of every interrupt request source in the Linux kernel. This structure defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/master/include/linux/irqdesc.h) and as you can note it depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. This kernel configuration option enables support for sparse irqs. The `irq_desc` structure contains many different fiels: - -* `irq_common_data` - per irq and chip data passed down to chip functions; -* `status_use_accessors` - contains status of the interrupt source which is can be combination of of the values from the `enum` from the [include/linux/irq.h](https://github.com/torvalds/linux/blob/master/include/linux/irq.h) and different macros which are defined in the same source code file; -* `kstat_irqs` - irq stats per-cpu; -* `handle_irq` - highlevel irq-events handler; -* `action` - identifies the interrupt service routines to be invoked when the [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) occurs; -* `irq_count` - counter of interrupt occurrences on the IRQ line; -* `depth` - `0` if the IRQ line is enabled and a positive value if it has been disabled at least once; -* `last_unhandled` - aging timer for unhandled count; -* `irqs_unhandled` - count of the unhandled interrupts; -* `lock` - a spin lock used to serialize the accesses to the `IRQ` descriptor; -* `pending_mask` - pending rebalanced interrupts; -* `owner` - an owner of interrupt descriptor. Interrupt descriptors can be allocated from modules. This field is need to proved refcount on the module which provides the interrupts; -* and etc. - -Of course it is not all fields of the `irq_desc` structure, because it is too long to describe each field of this structure, but we will see it all soon. Now let's start to dive into the implementation of the `early_irq_init` function. - -Early external interrupts initialization --------------------------------------------------------------------------------- - -Now, let's look on the implementation of the `early_irq_init` function. Note that implementation of the `early_irq_init` function depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. Now we consider implementation of the `early_irq_init` function when the `CONFIG_SPARSE_IRQ` kernel configuration option is not set. This function starts from the declaration of the following variables: `irq` descriptors counter, loop counter, memory node and the `irq_desc` descriptor: - -```C -int __init early_irq_init(void) -{ - int count, i, node = first_online_node; - struct irq_desc *desc; - ... - ... - ... -} -``` - -The `node` is an online [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) node which depends on the `MAX_NUMNODES` value which depends on the `CONFIG_NODES_SHIFT` kernel configuration parameter: - -```C -#define MAX_NUMNODES (1 << NODES_SHIFT) -... -... -... -#ifdef CONFIG_NODES_SHIFT - #define NODES_SHIFT CONFIG_NODES_SHIFT -#else - #define NODES_SHIFT 0 -#endif -``` - -As I already wrote, implementation of the `first_online_node` macro depends on the `MAX_NUMNODES` value: - -```C -#if MAX_NUMNODES > 1 - #define first_online_node first_node(node_states[N_ONLINE]) -#else - #define first_online_node 0 -``` - -The `node_states` is the [enum](https://en.wikipedia.org/wiki/Enumerated_type) which defined in the [include/linux/nodemask.h](https://github.com/torvalds/linux/blob/master/include/linux/nodemask.h) and represent the set of the states of a node. In our case we are searching an online node and it will be `0` if `MAX_NUMNODES` is one or zero. If the `MAX_NUMNODES` is greater than one, the `node_states[N_ONLINE]` will return `1` and the `first_node` macro will be expands to the call of the `__first_node` function which will return `minimal` or the first online node: - -```C -#define first_node(src) __first_node(&(src)) - -static inline int __first_node(const nodemask_t *srcp) -{ - return min_t(int, MAX_NUMNODES, find_first_bit(srcp->bits, MAX_NUMNODES)); -} -``` - -More about this will be in the another chapter about the `NUMA`. The next step after the declaration of these local variables is the call of the: - -```C -init_irq_default_affinity(); -``` - -function. The `init_irq_default_affinity` function defined in the same source code file and depends on the `CONFIG_SMP` kernel configuration option allocates a given [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) structure (in our case it is the `irq_default_affinity`): - -```C -#if defined(CONFIG_SMP) -cpumask_var_t irq_default_affinity; - -static void __init init_irq_default_affinity(void) -{ - alloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT); - cpumask_setall(irq_default_affinity); -} -#else -static void __init init_irq_default_affinity(void) -{ -} -#endif -``` - -We know that when a hardware, such as disk controller or keyboard, needs attention from the processor, it throws an interrupt. The interrupt tells to the processor that something has happened and that the processor should interrupt current process and handle an incoming event. In order to prevent mutliple devices from sending the same interrupts, the [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) system was established where each device in a computer system is assigned its own special IRQ so that its interrupts are unique. Linux kernel can assign certain `IRQs` to specific processors. This is known as `SMP IRQ affinity`, and it allows you control how your system will respond to various hardware events (that's why it has certain implementation only if the `CONFIG_SMP` kernel configuration option is set). After we allocated `irq_default_affinity` cpumask, we can see `printk` output: - -```C -printk(KERN_INFO "NR_IRQS:%d\n", NR_IRQS); -``` - -which prints `NR_IRQS`: - -```C -~$ dmesg | grep NR_IRQS -[ 0.000000] NR_IRQS:4352 -``` - -The `NR_IRQS` is the maximum number of the `irq` descriptors or in another words maximum number of interrupts. Its value depends on the state of the `COFNIG_X86_IO_APIC` kernel configuration option. If the `CONFIG_X86_IO_APIC` is not set and the Linux kernel uses an old [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) chip, the `NR_IRQS` is: - -```C -#define NR_IRQS_LEGACY 16 - -#ifdef CONFIG_X86_IO_APIC -... -... -... -#else -# define NR_IRQS NR_IRQS_LEGACY -#endif -``` - -In other way, when the `CONFIG_X86_IO_APIC` kernel configuration option is set, the `NR_IRQS` depends on the amount of the processors and amount of the interrupt vectors: - -```C -#define CPU_VECTOR_LIMIT (64 * NR_CPUS) -#define NR_VECTORS 256 -#define IO_APIC_VECTOR_LIMIT ( 32 * MAX_IO_APICS ) -#define MAX_IO_APICS 128 - -# define NR_IRQS \ - (CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT ? \ - (NR_VECTORS + CPU_VECTOR_LIMIT) : \ - (NR_VECTORS + IO_APIC_VECTOR_LIMIT)) -... -... -... -``` - -We remember from the previous parts, that the amount of processors we can set during Linux kernel configuration process with the `CONFIG_NR_CPUS` configuration option: - -![kernel](http://oi60.tinypic.com/1zdm1dt.jpg) - -In the first case (`CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT`), the `NR_IRQS` will be `4352`, in the second case (`CPU_VECTOR_LIMIT < IO_APIC_VECTOR_LIMIT`), the `NR_IRQS` will be `768`. In my case the `NR_CPUS` is `8` as you can see in the my configuration, the `CPU_VECTOR_LIMIT` is `512` and the `IO_APIC_VECTOR_LIMIT` is `4096`. So `NR_IRQS` for my configuration is `4352`: - -``` -~$ dmesg | grep NR_IRQS -[ 0.000000] NR_IRQS:4352 -``` - -In the next step we assign array of the IRQ descriptors to the `irq_desc` variable which we defined in the start of the `early_irq_init` function and cacluate count of the `irq_desc` array with the `ARRAY_SIZE` macro: - -```C -desc = irq_desc; -count = ARRAY_SIZE(irq_desc); -``` - -The `irq_desc` array defined in the same source code file and looks like: - -```C -struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = { - [0 ... NR_IRQS-1] = { - .handle_irq = handle_bad_irq, - .depth = 1, - .lock = __RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock), - } -}; -``` - -The `irq_desc` is array of the `irq` descriptors. It has three already initialized fields: - -* `handle_irq` - as I already wrote above, this field is the highlevel irq-event handler. In our case it initialized with the `handle_bad_irq` function that defined in the [kernel/irq/handle.c](https://github.com/torvalds/linux/blob/master/kernel/irq/handle.c) source code file and handles spurious and unhandled irqs; -* `depth` - `0` if the IRQ line is enabled and a positive value if it has been disabled at least once; -* `lock` - A spin lock used to serialize the accesses to the `IRQ` descriptor. - -As we calculated count of the interrupts and initialized our `irq_desc` array, we start to fill descriptors in the loop: - -```C -for (i = 0; i < count; i++) { - desc[i].kstat_irqs = alloc_percpu(unsigned int); - alloc_masks(&desc[i], GFP_KERNEL, node); - raw_spin_lock_init(&desc[i].lock); - lockdep_set_class(&desc[i].lock, &irq_desc_lock_class); - desc_set_defaults(i, &desc[i], node, NULL); -} -``` - -We are going through the all interrupt descriptors and do the following things: - -First of all we allocate [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable for the `irq` kernel statistic with the `alloc_percpu` macro. This macro allocates one instance of an object of the given type for every processor on the system. You can access kernel statistic from the userspace via `/proc/stat`: - -``` -~$ cat /proc/stat -cpu 207907 68 53904 5427850 14394 0 394 0 0 0 -cpu0 25881 11 6684 679131 1351 0 18 0 0 0 -cpu1 24791 16 5894 679994 2285 0 24 0 0 0 -cpu2 26321 4 7154 678924 664 0 71 0 0 0 -cpu3 26648 8 6931 678891 414 0 244 0 0 0 -... -... -... -``` - -Where the sixth column is the servicing interrupts. After this we allocate [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) for the given irq descriptor affinity and initialize the [spinlock](https://en.wikipedia.org/wiki/Spinlock) for the given interrupt descriptor. After this before the [critical section](https://en.wikipedia.org/wiki/Critical_section), the lock will be aqcuired with a call of the `raw_spin_lock` and unlocked with the call of the `raw_spin_unlock`. In the next step we call the `lockdep_set_class` macro which set the [Lock validator](https://lwn.net/Articles/185666/) `irq_desc_lock_class` class for the lock of the given interrupt descriptor. More about `lockdep`, `spinlock` and other synchronization primitives will be described in the separate chapter. - -In the end of the loop we call the `desc_set_defaults` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c). This function takes four parameters: - -* number of a irq; -* interrupt descriptor; -* online `NUMA` node; -* owner of interrupt descriptor. Interrupt descriptors can be allocated from modules. This field is need to proved refcount on the module which provides the interrupts; - -and fills the rest of the `irq_desc` fields. The `desc_set_defaults` function fills interrupt number, `irq` chip, platform-specific per-chip private data for the chip methods, per-IRQ data for the `irq_chip` methods and [MSI](https://en.wikipedia.org/wiki/Message_Signaled_Interrupts) descriptor for the per `irq` and `irq` chip data: - -```C -desc->irq_data.irq = irq; -desc->irq_data.chip = &no_irq_chip; -desc->irq_data.chip_data = NULL; -desc->irq_data.handler_data = NULL; -desc->irq_data.msi_desc = NULL; -... -... -... -``` - -The `irq_data.chip` structure provides general `API` like the `irq_set_chip`, `irq_set_irq_type` and etc, for the irq controller [drivers](https://github.com/torvalds/linux/tree/master/drivers/irqchip). You can find it in the [kernel/irq/chip.c](https://github.com/torvalds/linux/blob/master/kernel/irq/chip.c) source code file. - -After this we set the status of the accessor for the given descriptor and set disabled state of the interrupts: - -```C -... -... -... -irq_settings_clr_and_set(desc, ~0, _IRQ_DEFAULT_INIT_FLAGS); -irqd_set(&desc->irq_data, IRQD_IRQ_DISABLED); -... -... -... -``` - -In the next step we set the high level interrupt handlers to the `handle_bad_irq` which handles spurious and unhandled irqs (as the hardware stuff is not initialized yet, we set this handler), set `irq_desc.desc` to `1` which means that an `IRQ` is disabled, reset count of the unhandled interrupts and interrupts in general: - -```C -... -... -... -desc->handle_irq = handle_bad_irq; -desc->depth = 1; -desc->irq_count = 0; -desc->irqs_unhandled = 0; -desc->name = NULL; -desc->owner = owner; -... -... -... -``` - -After this we go through the all [possible](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) processor with the [for_each_possible_cpu](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h#L714) helper and set the `kstat_irqs` to zero for the given interrupt descriptor: - -```C - for_each_possible_cpu(cpu) - *per_cpu_ptr(desc->kstat_irqs, cpu) = 0; -``` - -and call the `desc_smp_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c) that initializes `NUMA` node of the given interrupt descriptor, sets default `SMP` affinity and clears the `pending_mask` of the given interrupt descriptor depends on the value of the `CONFIG_GENERIC_PENDING_IRQ` kernel configuration option: - -```C -static void desc_smp_init(struct irq_desc *desc, int node) -{ - desc->irq_data.node = node; - cpumask_copy(desc->irq_data.affinity, irq_default_affinity); -#ifdef CONFIG_GENERIC_PENDING_IRQ - cpumask_clear(desc->pending_mask); -#endif -} -``` - -In the end of the `early_irq_init` function we return the return value of the `arch_early_irq_init` function: - -```C -return arch_early_irq_init(); -``` - -This function defined in the [kernel/apic/vector.c](https://github.com/torvalds/linux/blob/master/kernel/apic/vector.c) and contains only one call of the `arch_early_ioapic_init` function from the [kernel/apic/io_apic.c](https://github.com/torvalds/linux/blob/master/kernel/apic/io_apic.c). As we can understand from the `arch_early_ioapic_init` function's name, this function makes early initialization of the [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). First of all it make a check of the number of the legacy interrupts wit the call of the `nr_legacy_irqs` function. If we have no lagacy interrupts with the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) programmable interrupt controller we set `io_apic_irqs` to the `0xffffffffffffffff`: - -```C -if (!nr_legacy_irqs()) - io_apic_irqs = ~0UL; -``` - -After this we are going through the all `I/O APICs` and allocate space for the registers with the call of the `alloc_ioapic_saved_registers`: - -```C -for_each_ioapic(i) - alloc_ioapic_saved_registers(i); -``` - -And in the end of the `arch_early_ioapic_init` function we are going through the all legacy irqs (from `IRQ0` to `IRQ15`) in the loop and allocate space for the `irq_cfg` which represents configuration of an irq on the given `NUMA` node: - -```C -for (i = 0; i < nr_legacy_irqs(); i++) { - cfg = alloc_irq_and_cfg_at(i, node); - cfg->vector = IRQ0_VECTOR + i; - cpumask_setall(cfg->domain); -} -``` - -That's all. - -Sparse IRQs --------------------------------------------------------------------------------- - -We already saw in the beginning of this part that implementation of the `early_irq_init` function depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. Previously we saw implementation of the `early_irq_init` function when the `CONFIG_SPARSE_IRQ` configuration option is not set, not let's look on the its implementation when this option is set. Implementation of this function very similar, but little differ. We can see the same definition of variables and call of the `init_irq_default_affinity` in the beginning of the `early_irq_init` function: - -```C -#ifdef CONFIG_SPARSE_IRQ -int __init early_irq_init(void) -{ - int i, initcnt, node = first_online_node; - struct irq_desc *desc; - - init_irq_default_affinity(); - ... - ... - ... -} -#else -... -... -... -``` - -But after this we can see the following call: - -```C -initcnt = arch_probe_nr_irqs(); -``` - -The `arch_probe_nr_irqs` function defined in the [arch/x86/kernel/apic/vector.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/apic/vector.c) and calculates count of the pre-allocated irqs and update `nr_irqs` with its number. But stop. Why there are pre-allocated irqs? There is alternative form of interrupts called - [Message Signaled Interrupts](https://en.wikipedia.org/wiki/Message_Signaled_Interrupts) available in the [PCI](https://en.wikipedia.org/wiki/Conventional_PCI). Instead of assigning a fixed number of the interrupt request, the device is allowed to record a message at a particular address of RAM, in fact, the display on the [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs). `MSI` permits a device to allocate `1`, `2`, `4`, `8`, `16` or `32` interrupts and `MSI-X` permits a device to allocate up to `2048` interrupts. Now we know that irqs can be pre-allocated. More about `MSI` will be in a next part, but now let's look on the `arch_probe_nr_irqs` function. We can see the check which assign amount of the interrupt vectors for the each processor in the system to the `nr_irqs` if it is greater and calculate the `nr` which represents number of `MSI` interrupts: - -```C -int nr_irqs = NR_IRQS; - -if (nr_irqs > (NR_VECTORS * nr_cpu_ids)) - nr_irqs = NR_VECTORS * nr_cpu_ids; - -nr = (gsi_top + nr_legacy_irqs()) + 8 * nr_cpu_ids; -``` - -Take a look on the `gsi_top` variable. Each `APIC` is identified with its own `ID` and with the offset where its `IRQ` starts. It is called `GSI` base or `Global System Interrupt` base. So the `gsi_top` represnters it. We get the `Global System Interrupt` base from the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table (you can remember that we have parsed this table in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux Kernel initialization process chapter). - -After this we update the `nr` depends on the value of the `gsi_top`: - -```C -#if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ) - if (gsi_top <= NR_IRQS_LEGACY) - nr += 8 * nr_cpu_ids; - else - nr += gsi_top * 16; -#endif -``` - -Update the `nr_irqs` if it less than `nr` and return the number of the legacy irqs: - -```C -if (nr < nr_irqs) - nr_irqs = nr; - -return nr_legacy_irqs(); -} -``` - -The next after the `arch_probe_nr_irqs` is printing information about number of `IRQs`: - -```C -printk(KERN_INFO "NR_IRQS:%d nr_irqs:%d %d\n", NR_IRQS, nr_irqs, initcnt); -``` - -We can find it in the [dmesg](https://en.wikipedia.org/wiki/Dmesg) output: - -``` -$ dmesg | grep NR_IRQS -[ 0.000000] NR_IRQS:4352 nr_irqs:488 16 -``` - -After this we do some checks that `nr_irqs` and `initcnt` values is not greater than maximum allowable number of `irqs`: - -```C -if (WARN_ON(nr_irqs > IRQ_BITMAP_BITS)) - nr_irqs = IRQ_BITMAP_BITS; - -if (WARN_ON(initcnt > IRQ_BITMAP_BITS)) - initcnt = IRQ_BITMAP_BITS; -``` - -where `IRQ_BITMAP_BITS` is equal to the `NR_IRQS` if the `CONFIG_SPARSE_IRQ` is not set and `NR_IRQS + 8196` in other way. In the next step we are going over all interrupt descript which need to be allocated in the loop and allocate space for the descriptor and insert to the `irq_desc_tree` [radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html): - -```C -for (i = 0; i < initcnt; i++) { - desc = alloc_desc(i, node, NULL); - set_bit(i, allocated_irqs); - irq_insert_desc(i, desc); -} -``` - -In the end of the `early_irq_init` function we return the value of the call of the `arch_early_irq_init` function as we did it already in the previous variant when the `CONFIG_SPARSE_IRQ` option was not set: - -```C -return arch_early_irq_init(); -``` - -That's all. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the seventh part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we started to dive into external hardware interrupts in this part. We saw early initialization of the `irq_desc` structure which represents description of an external interrupt and contains information about it like list of irq actions, information about interrupt handler, interrupts's owner, count of the unhandled interrupt and etc. In the next part we will continue to research external interrupts. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) -* [numa](https://en.wikipedia.org/wiki/Non-uniform_memory_access) -* [Enum type](https://en.wikipedia.org/wiki/Enumerated_type) -* [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) -* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) -* [spinlock](https://en.wikipedia.org/wiki/Spinlock) -* [critical section](https://en.wikipedia.org/wiki/Critical_section) -* [Lock validator](https://lwn.net/Articles/185666/) -* [MSI](https://en.wikipedia.org/wiki/Message_Signaled_Interrupts) -* [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) -* [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs) -* [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) -* [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) -* [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) -* [radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html) -* [dmesg](https://en.wikipedia.org/wiki/Dmesg) diff --git a/interrupts/interrupts-8.md b/interrupts/interrupts-8.md deleted file mode 100644 index 0614375..0000000 --- a/interrupts/interrupts-8.md +++ /dev/null @@ -1,542 +0,0 @@ -Interrupts and Interrupt Handling. Part 8. -================================================================================ - -Non-early initialization of the IRQs --------------------------------------------------------------------------------- - -This is the eighth part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html) we started to dive into the external hardware [interrupts](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). We looked on the implementation of the `early_irq_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c) source code file and saw the initialization of the `irq_desc` structure in this function. Remind that `irq_desc` structure (defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/master/include/linux/irqdesc.h#L46) is the foundation of interrupt management code in the Linux kernel and represents an interrupt descriptor. In this part we will continue to dive into the initialization stuff which is related to the external hardware interrupts. - -Right after the call of the `early_irq_init` function in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) we can see the call of the `init_IRQ` function. This function is architecture-specfic and defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c). The `init_IRQ` function makes initialization of the `vector_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable that defined in the same [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) source code file: - -```C -... -DEFINE_PER_CPU(vector_irq_t, vector_irq) = { - [0 ... NR_VECTORS - 1] = -1, -}; -... -``` - -and represents `percpu` array of the interrupt vector numbers. The `vector_irq_t` defined in the [arch/x86/include/asm/hw_irq.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/hw_irq.h) and expands to the: - -```C -typedef int vector_irq_t[NR_VECTORS]; -``` - -where `NR_VECTORS` is count of the vector number and as you can remember from the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) of this chapter it is `256` for the [x86_64](https://en.wikipedia.org/wiki/X86-64): - -```C -#define NR_VECTORS 256 -``` - -So, in the start of the `init_IRQ` function we fill the `vecto_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) array with the vector number of the `legacy` interrupts: - -```C -void __init init_IRQ(void) -{ - int i; - - for (i = 0; i < nr_legacy_irqs(); i++) - per_cpu(vector_irq, 0)[IRQ0_VECTOR + i] = i; -... -... -... -} -``` - -This `vector_irq` will be used during the first steps of an external hardware interrupt handling in the `do_IRQ` function from the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c): - -```C -__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs) -{ - ... - ... - ... - irq = __this_cpu_read(vector_irq[vector]); - - if (!handle_irq(irq, regs)) { - ... - ... - ... - } - - exiting_irq(); - ... - ... - return 1; -} -``` - -Why is `legacy` here? Actuall all interrupts handled by the modern [IO-APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs) controller. But these interrupts (from `0x30` to `0x3f`) by legacy interrupt-controllers like [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller). If these interrupts are handled by the `I/O APIC` then this vector space will be freed and re-used. Let's look on this code closer. First of all the `nr_legacy_irqs` defined in the [arch/x86/include/asm/i8259.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/i8259.h) and just returns the `nr_legacy_irqs` field from the `legacy_pic` strucutre: - -```C -static inline int nr_legacy_irqs(void) -{ - return legacy_pic->nr_legacy_irqs; -} -``` - -This structure defined in the same header file and represents non-modern programmable interrupts controller: - -```C -struct legacy_pic { - int nr_legacy_irqs; - struct irq_chip *chip; - void (*mask)(unsigned int irq); - void (*unmask)(unsigned int irq); - void (*mask_all)(void); - void (*restore_mask)(void); - void (*init)(int auto_eoi); - int (*irq_pending)(unsigned int irq); - void (*make_irq)(unsigned int irq); -}; -``` - -Actuall default maximum number of the legacy interrupts represtented by the `NR_IRQ_LEGACY` macro from the [arch/x86/include/asm/irq_vectors.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_vectors.h): - -```C -#define NR_IRQS_LEGACY 16 -``` - -In the loop we are accessing the `vecto_irq` per-cpu array with the `per_cpu` macro by the `IRQ0_VECTOR + i` index and write the legacy vector number there. The `IRQ0_VECTOR` macro defined in the [arch/x86/include/asm/irq_vectors.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_vectors.h) header file and expands to the `0x30`: - -```C -#define FIRST_EXTERNAL_VECTOR 0x20 - -#define IRQ0_VECTOR ((FIRST_EXTERNAL_VECTOR + 16) & ~15) -``` - -Why is `0x30` here? You can remember from the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) of this chapter that first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. Vector numbers from `0x30` to `0x3f` are reserved for the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture). So, it means that we fill the `vector_irq` from the `IRQ0_VECTOR` which is equal to the `32` to the `IRQ0_VECTOR + 16` (before the `0x30`). - -In the end of the `init_IRQ` functio we can see the call of the following function: - -```C -x86_init.irqs.intr_init(); -``` - -from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c) source code file. If you have read [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you can remember the `x86_init` structure. This structure contains a couple of files which are points to the function related to the platform setup (`x86_64` in our case), for example `resources` - related with the memory resources, `mpparse` - related with the parsing of the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table and etc.). As we can see the `x86_init` also contains the `irqs` field which contains three following fields: - -```C -struct x86_init_ops x86_init __initdata -{ - ... - ... - ... - .irqs = { - .pre_vector_init = init_ISA_irqs, - .intr_init = native_init_IRQ, - .trap_init = x86_init_noop, - }, - ... - ... - ... -} -``` - -Now, we are interesting in the `native_init_IRQ`. As we can note, the name of the `native_init_IRQ` function contains the `native_` prefix which means that this function is architecture-specific. It defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) and executes general initialization of the [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs) and initialization of the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture) irqs. Let's look on the implementation of the `native_init_IRQ` function and will try to understand what occurs there. The `native_init_IRQ` function starts from the execution of the following function: - -```C -x86_init.irqs.pre_vector_init(); -``` - -As we can see above, the `pre_vector_init` points to the `init_ISA_irqs` function that defined in the same [source code](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) file and as we can understand from the function's name, it makes initialization of the `ISA` related interrupts. The `init_ISA_irqs` function starts from the definition of the `chip` variable which has a `irq_chip` type: - -```C -void __init init_ISA_irqs(void) -{ - struct irq_chip *chip = legacy_pic->chip; - ... - ... - ... -``` - -The `irq_chip` structure defined in the [include/linux/irq.h](https://github.com/torvalds/linux/blob/master/include/linux/irq.h) header file and represents hardware interrupt chip descriptor. It contains: - -* `name` - name of a device. Used in the `/proc/interrupts`: - -```C -$ cat /proc/interrupts - CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 - 0: 16 0 0 0 0 0 0 0 IO-APIC 2-edge timer - 1: 2 0 0 0 0 0 0 0 IO-APIC 1-edge i8042 - 8: 1 0 0 0 0 0 0 0 IO-APIC 8-edge rtc0 -``` - -look on the last columnt; - -* `(*irq_mask)(struct irq_data *data)` - mask an interrupt source; -* `(*irq_ack)(struct irq_data *data)` - start of a new interrupt; -* `(*irq_startup)(struct irq_data *data)` - start up the interrupt; -* `(*irq_shutdown)(struct irq_data *data)` - shutdown the interrupt -* and etc. - -fields. Note that the `irq_data` structure represents set of the per irq chip data passed down to chip functions. It contains `mask` - precomputed bitmask for accessing the chip registers, `irq` - interrupt number, `hwirq` - hardware interrupt number, local to the interrupt domain chip low level interrupt hardware access and etc. - -After this depends on the `CONFIG_X86_64` and `CONFIG_X86_LOCAL_APIC` kernel configuration option call the `init_bsp_APIC` function from the [arch/x86/kernel/apic/apic.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/apic/apic.c): - -```C -#if defined(CONFIG_X86_64) || defined(CONFIG_X86_LOCAL_APIC) - init_bsp_APIC(); -#endif -``` - -This function makes initialization of the [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) of `bootstrap processor` (or processor which starts first). It starts from the check that we found [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) config (read more about it in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux kernel initialization process chapter) and the processor has `APIC`: - -```C -if (smp_found_config || !cpu_has_apic) - return; -``` - -In other way we return from this function. In the next step we call the `clear_local_APIC` function from the same source code file that shutdowns the local `APIC` (more about it will be in the chapter about the `Advanced Programmable Interrupt Controller`) and enable `APIC` of the first processor by the setting `unsigned int value` to the `APIC_SPIV_APIC_ENABLED`: - -```C -value = apic_read(APIC_SPIV); -value &= ~APIC_VECTOR_MASK; -value |= APIC_SPIV_APIC_ENABLED; -``` - -and writing it with the help of the `apic_write` function: - -```C -apic_write(APIC_SPIV, value); -``` - -After we have enabled `APIC` for the bootstrap processor, we return to the `init_ISA_irqs` function and in the next step we initalize legacy `Programmable Interrupt Controller` and set the legacy chip and handler for the each legacy irq: - -```C -legacy_pic->init(0); - -for (i = 0; i < nr_legacy_irqs(); i++) - irq_set_chip_and_handler(i, chip, handle_level_irq); -``` - -Where can we find `init` function? The `legacy_pic` defined in the [arch/x86/kernel/i8259.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/i8259.c) and it is: - -```C -struct legacy_pic *legacy_pic = &default_legacy_pic; -``` - -Where the `default_legacy_pic` is: - -```C -struct legacy_pic default_legacy_pic = { - ... - ... - ... - .init = init_8259A, - ... - ... - ... -} -``` - -The `init_8259A` function defined in the same source code file and executes initialization of the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) ``Programmable Interrupt Controller` (more about it will be in the separate chapter abot `Programmable Interrupt Controllers` and `APIC`). - -Now we can return to the `native_init_IRQ` function, after the `init_ISA_irqs` function finished its work. The next step is the call of the `apic_intr_init` function that allocates special interrupt gates which are used by the [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) architecture for the [Inter-processor interrupt](https://en.wikipedia.org/wiki/Inter-processor_interrupt). The `alloc_intr_gate` macro from the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) used for the interrupt descriptor allocation allocation: - -```C -#define alloc_intr_gate(n, addr) \ -do { \ - alloc_system_vector(n); \ - set_intr_gate(n, addr); \ -} while (0) -``` - -As we can see, first of all it expands to the call of the `alloc_system_vector` function that checks the given vector number in the `user_vectors` bitmap (read previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html) about it) and if it is not set in the `user_vectors` bitmap we set it. After this we test that the `first_system_vector` is greater than given interrupt vector number and if it is greater we assign it: - -```C -if (!test_bit(vector, used_vectors)) { - set_bit(vector, used_vectors); - if (first_system_vector > vector) - first_system_vector = vector; -} else { - BUG(); -} -``` - -We already saw the `set_bit` macro, now let's look on the `test_bit` and the `first_system_vector`. The first `test_bit` macro defined in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/bitops.h) and looks like this: - -```C -#define test_bit(nr, addr) \ - (__builtin_constant_p((nr)) \ - ? constant_test_bit((nr), (addr)) \ - : variable_test_bit((nr), (addr))) -``` - -We can see the [ternary operator](https://en.wikipedia.org/wiki/Ternary_operation) here make a test with the [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) built-in function `__builtin_constant_p` tests that given vector number (`nr`) is known at compile time. If you're feeling misunderstanding of the `__builtin_constant_p`, we can make simple test: - -```C -#include - -#define PREDEFINED_VAL 1 - -int main() { - int i = 5; - printf("__builtin_constant_p(i) is %d\n", __builtin_constant_p(i)); - printf("__builtin_constant_p(PREDEFINED_VAL) is %d\n", __builtin_constant_p(PREDEFINED_VAL)); - printf("__builtin_constant_p(100) is %d\n", __builtin_constant_p(100)); - - return 0; -} -``` - -and look on the result: - -``` -$ gcc test.c -o test -$ ./test -__builtin_constant_p(i) is 0 -__builtin_constant_p(PREDEFINED_VAL) is 1 -__builtin_constant_p(100) is 1 -``` - -Now I think it must be clear for you. Let's get back to the `test_bit` macro. If the `__builtin_constant_p` will return non-zero, we call `constant_test_bit` function: - -```C -static inline int constant_test_bit(int nr, const void *addr) -{ - const u32 *p = (const u32 *)addr; - - return ((1UL << (nr & 31)) & (p[nr >> 5])) != 0; -} -``` - -and the `variable_test_bit` in other way: - -```C -static inline int variable_test_bit(int nr, const void *addr) -{ - u8 v; - const u32 *p = (const u32 *)addr; - - asm("btl %2,%1; setc %0" : "=qm" (v) : "m" (*p), "Ir" (nr)); - return v; -} -``` - -What's the difference between two these functions and why do we need in two different functions for the same purpose? As you already can guess main purpose is optimization. If we will write simple example with these functions: - -```C -#define CONST 25 - -int main() { - int nr = 24; - variable_test_bit(nr, (int*)0x10000000); - constant_test_bit(CONST, (int*)0x10000000) - return 0; -} -``` - -and will look on the assembly output of our example we will see followig assembly code: - -```assembly -pushq %rbp -movq %rsp, %rbp - -movl $268435456, %esi -movl $25, %edi -call constant_test_bit -``` - -for the `constant_test_bit`, and: - -```assembly -pushq %rbp -movq %rsp, %rbp - -subq $16, %rsp -movl $24, -4(%rbp) -movl -4(%rbp), %eax -movl $268435456, %esi -movl %eax, %edi -call variable_test_bit -``` - -for the `variable_test_bit`. These two code listings starts with the same part, first of all we save base of the current stack frame in the `%rbp` register. But after this code for both examples is different. In the first example we put `$268435456` (here the `$268435456` is our second parameter - `0x10000000`) to the `esi` and `$25` (our first parameter) to the `edi` register and call `constant_test_bit`. We put functuin parameters to the `esi` and `edi` registers because as we are learning Linux kernel for the `x86_64` architecture we use `System V AMD64 ABI` [calling convention](https://en.wikipedia.org/wiki/X86_calling_conventions). All is pretty simple. When we are using predifined constant, the compiler can just substitute its value. Now let's look on the second part. As you can see here, the compiler can not substitute value from the `nr` variable. In this case compiler must calcuate its offset on the programm's [stack frame](https://en.wikipedia.org/wiki/Call_stack). We substract `16` from the `rsp` register to allocate stack for the local variables data and put the `$24` (value of the `nr` variable) to the `rbp` with offset `-4`. Our stack frame will be like this: - -``` - <- stack grows - - %[rbp] - | -+----------+ +---------+ +---------+ +--------+ -| | | | | return | | | -| nr |-| |-| |-| argc | -| | | | | address | | | -+----------+ +---------+ +---------+ +--------+ - | - %[rsp] -``` - -After this we put this value to the `eax`, so `eax` register now contains value of the `nr`. In the end we do the same that in the first example, we put the `$268435456` (the first parameter of the `variable_test_bit` function) and the value of the `eax` (value of `nr`) to the `edi` register (the second parameter of the `variable_test_bit function`). - -The next step after the `apic_intr_init` function will finish its work is the setting interrup gates from the `FIRST_EXTERNAL_VECTOR` or `0x20` to the `0x256`: - -```C -i = FIRST_EXTERNAL_VECTOR; - -#ifndef CONFIG_X86_LOCAL_APIC -#define first_system_vector NR_VECTORS -#endif - -for_each_clear_bit_from(i, used_vectors, first_system_vector) { - set_intr_gate(i, irq_entries_start + 8 * (i - FIRST_EXTERNAL_VECTOR)); -} -``` - -But as we are using the `for_each_clear_bit_from` helper, we set only non-initialized interrupt gates. After this we use the same `for_each_clear_bit_from` helper to fill the non-filled interrupt gates in the interrupt table with the `spurious_interrupt`: - -```C -#ifdef CONFIG_X86_LOCAL_APIC -for_each_clear_bit_from(i, used_vectors, NR_VECTORS) - set_intr_gate(i, spurious_interrupt); -#endif -``` - -Where the `spurious_interrupt` function represent interrupt handler fro the `spurious` interrupt. Here the `used_vectors` is the `unsigned long` that contains already initialized interrupt gates. We already filled first `32` interrupt vectors in the `trap_init` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file: - -```C -for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++) - set_bit(i, used_vectors); -``` - -You can remember how we did it in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-6.html) of this chapter. - -In the end of the `native_init_IRQ` function we can see the following check: - -```C -if (!acpi_ioapic && !of_ioapic && nr_legacy_irqs()) - setup_irq(2, &irq2); -``` - -First of all let's deal with the condition. The `acpi_ioapic` variable represents existence of [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs). It defined in the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c). This variable set in the `acpi_set_irq_model_ioapic` function that called during the processing `Multiple APIC Description Table`. This occurs during initialization of the architecture-specific stuff in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) (more about it we will know in the other chapter about [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)). Note that the value of the `acpi_ioapic` variable depends on the `CONFIG_ACPI` and `CONFIG_X86_LOCAL_APIC` Linux kernel configuration options. If these options did not set, this variable will be just zero: - -```C -#define acpi_ioapic 0 -``` - -The second condition - `!of_ioapic && nr_legacy_irqs()` checks that we do not use [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware) `I/O APIC` and legacy interrupt controller. We already know about the `nr_legacy_irqs`. The second is `of_ioapic` variable defined in the [arch/x86/kernel/devicetree.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/devicetree.c) and initialized in the `dtb_ioapic_setup` function that build information about `APICs` in the [devicetree](https://en.wikipedia.org/wiki/Device_tree). Note that `of_ioapic` variable depends on the `CONFIG_OF` Linux kernel configuration opiotn. If this option is not set, the value of the `of_ioapic` will be zero too: - -```C -#ifdef CONFIG_OF -extern int of_ioapic; -... -... -... -#else -#define of_ioapic 0 -... -... -... -#endif -``` - -If the condition will return non-zero vaule we call the: - -```C -setup_irq(2, &irq2); -``` - -function. First of all about the `irq2`. The `irq2` is the `irqaction` structure that defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irqinit.c) source code file and represents `IRQ 2` line that is used to query devices connected cascade: - -```C -static struct irqaction irq2 = { - .handler = no_action, - .name = "cascade", - .flags = IRQF_NO_THREAD, -}; -``` - -Some time ago interrupt controller consisted of two chips and one was connected to second. The second chip that was connected to the first chip via this `IRQ 2` line. This chip serviced lines from `8` to `15` and after after this lines of the first chip. So, for example [Intel 8259A](https://en.wikipedia.org/wiki/Intel_8259) has following lines: - -* `IRQ 0` - system time; -* `IRQ 1` - keyboard; -* `IRQ 2` - used for devices which are cascade connected; -* `IRQ 8` - [RTC](https://en.wikipedia.org/wiki/Real-time_clock); -* `IRQ 9` - reserved; -* `IRQ 10` - reserved; -* `IRQ 11` - reserved; -* `IRQ 12` - `ps/2` mouse; -* `IRQ 13` - coprocessor; -* `IRQ 14` - hard drive controller; -* `IRQ 1` - reserved; -* `IRQ 3` - `COM2` and `COM4`; -* `IRQ 4` - `COM1` and `COM3`; -* `IRQ 5` - `LPT2`; -* `IRQ 6` - drive controller; -* `IRQ 7` - `LPT1`. - -The `setup_irq` function defined in the [kernel/irq/manage.c](https://github.com/torvalds/linux/blob/master/kernel/irq/manage.c) and takes two parameters: - -* vector number of an interrupt; -* `irqaction` structure related with an interrupt. - -This function initializes interrupt descriptor from the given vector number at the beginning: - -```C -struct irq_desc *desc = irq_to_desc(irq); -``` - -And call the `__setup_irq` function that setups given interrupt: - -```C -chip_bus_lock(desc); -retval = __setup_irq(irq, desc, act); -chip_bus_sync_unlock(desc); -return retval; -``` - -Note that the interrupt descriptor is locked during `__setup_irq` function will work. The `__setup_irq` function makes many different things: It creates a handler thread when a thread function is supplied and the interrupt does not nest into another interrupt thread, sets the flags of the chip, fills the `irqaction` structure and many many more. - -All of the above it creates `/prov/vector_number` directory and fills it, but if you are using modern computer all values will be zero there: - -``` -$ cat /proc/irq/2/node -0 - -$cat /proc/irq/2/affinity_hint -00 - -cat /proc/irq/2/spurious -count 0 -unhandled 0 -last_unhandled 0 ms -``` - -because probably `APIC` handles interrupts on the our machine. - -That's all. - -Conclusion --------------------------------------------------------------------------------- - -It is the end of the eighth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we started to do it and saw early initialization of the `IRQs`. In this part we already saw non-early interrupts initialization in the `init_IRQ` function. We saw initialization of the `vector_irq` per-cpu array which is store vector numbers of the interrupts and will be used during interrupt handling and initialization of other stuff which is related to the external hardware interrupts. - -In the next part we will continue to learn interrupts handling related stuff and will see initialization of the `softirqs`. - -If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). - -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) -* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) -* [x86_64](https://en.wikipedia.org/wiki/X86-64) -* [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) -* [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) -* [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture) -* [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) -* [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs) -* [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs) -* [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) -* [Inter-processor interrupt](https://en.wikipedia.org/wiki/Inter-processor_interrupt) -* [ternary operator](https://en.wikipedia.org/wiki/Ternary_operation) -* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) -* [calling convention](https://en.wikipedia.org/wiki/X86_calling_conventions) -* [PDF. System V Application Binary Interface AMD64](http://x86-64.org/documentation/abi.pdf) -* [Call stack](https://en.wikipedia.org/wiki/Call_stack) -* [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware) -* [devicetree](https://en.wikipedia.org/wiki/Device_tree) -* [RTC](https://en.wikipedia.org/wiki/Real-time_clock) -* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html) diff --git a/mm/README.md b/mm/README.md deleted file mode 100644 index fa0971e..0000000 --- a/mm/README.md +++ /dev/null @@ -1,7 +0,0 @@ -# Linux kernel memory management - -This chapter describes memory management in the linux kernel. You will see here a -couple of posts which describe different parts of the linux memory management framework: - -* [Memblock](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-1.md) - describes early `memblock` allocator. -* [Fix-Mapped Addresses and ioremap](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) - describes `fix-mapped` addresses and early `ioremap`. diff --git a/mm/linux-mm-1.md b/mm/linux-mm-1.md deleted file mode 100644 index e90263f..0000000 --- a/mm/linux-mm-1.md +++ /dev/null @@ -1,418 +0,0 @@ -Linux kernel memory management Part 1. -================================================================================ - -Introduction --------------------------------------------------------------------------------- - -Memory management is one of the most complex (and I think that it is the most complex) parts of the operating system kernel. In the [last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part we stopped right before call of the `start_kernel` function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first `init` process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No compilcated memory management is working yet. When the `start_kernel` function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have a clear understanding of these techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the `memblock`. - -Memblock --------------------------------------------------------------------------------- - -Memblock is one of the methods of managing memory regions during the early bootstrap period while the usual kernel memory allocators are not up and -running yet. Previously it was called `Logical Memory Block`, but with the [patch](https://lkml.org/lkml/2010/7/13/68) by Yinghai Lu, it was renamed to the `memblock`. As Linux kernel for `x86_64` architecture uses this method. We already met `memblock` in the [Last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part. And now time to get acquainted with it closer. We will see how it is implemented. - -We will start to learn `memblock` from the data structures. Definitions of the all data structures can be found in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file. - -The first structure has the same name as this part and it is: - -```C -struct memblock { - bool bottom_up; - phys_addr_t current_limit; - struct memblock_type memory; --> array of memblock_region - struct memblock_type reserved; --> array of memblock_region -#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP - struct memblock_type physmem; -#endif -}; -``` - -This structure contains five fields. First is `bottom_up` which allows allocating memory in bottom-up mode when it is `true`. Next field is `current_limit`. This field describes the limit size of the memory block. The next three fields describe the type of the memory block. It can be: reserved, memory and physical memory if the `CONFIG_HAVE_MEMBLOCK_PHYS_MAP` configuration option is enabled. Now we see yet another data structure - `memblock_type`. Let's look at its definition: - -```C -struct memblock_type { - unsigned long cnt; - unsigned long max; - phys_addr_t total_size; - struct memblock_region *regions; -}; -``` - -This structure provides information about memory type. It contains fields which describe the number of memory regions which are inside the current memory block, the size of all memory regions, the size of the allocated array of the memory regions and pointer to the array of the `memblock_region` structures. `memblock_region` is a structure which describes a memory region. Its definition is: - -```C -struct memblock_region { - phys_addr_t base; - phys_addr_t size; - unsigned long flags; -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP - int nid; -#endif -}; -``` - -`memblock_region` provides base address and size of the memory region, flags which can be: - -```C -#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0) -#define MEMBLOCK_ALLOC_ACCESSIBLE 0 -#define MEMBLOCK_HOTPLUG 0x1 -``` - -Also `memblock_region` provides integer field - [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access) node selector, if the `CONFIG_HAVE_MEMBLOCK_NODE_MAP` configuration option is enabled. - -Schematically we can imagine it as: - -``` -+---------------------------+ +---------------------------+ -| memblock | | | -| _______________________ | | | -| | memory | | | Array of the | -| | memblock_type |-|-->| membock_region | -| |_______________________| | | | -| | +---------------------------+ -| _______________________ | +---------------------------+ -| | reserved | | | | -| | memblock_type |-|-->| Array of the | -| |_______________________| | | memblock_region | -| | | | -+---------------------------+ +---------------------------+ -``` - -These three structures: `memblock`, `memblock_type` and `memblock_region` are main in the `Memblock`. Now we know about it and can look at Memblock initialization process. - -Memblock initialization --------------------------------------------------------------------------------- - -As all API of the `memblock` described in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file, all implementation of these function is in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c) source code file. Let's look at the top of the source code file and we will see the initialization of the `memblock` structure: - -```C -struct memblock memblock __initdata_memblock = { - .memory.regions = memblock_memory_init_regions, - .memory.cnt = 1, - .memory.max = INIT_MEMBLOCK_REGIONS, - - .reserved.regions = memblock_reserved_init_regions, - .reserved.cnt = 1, - .reserved.max = INIT_MEMBLOCK_REGIONS, - -#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP - .physmem.regions = memblock_physmem_init_regions, - .physmem.cnt = 1, - .physmem.max = INIT_PHYSMEM_REGIONS, -#endif - .bottom_up = false, - .current_limit = MEMBLOCK_ALLOC_ANYWHERE, -}; -``` - -Here we can see initialization of the `memblock` structure which has the same name as structure - `memblock`. First of all note on `__initdata_memblock`. Defenition of this macro looks like: - -```C -#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK - #define __init_memblock __meminit - #define __initdata_memblock __meminitdata -#else - #define __init_memblock - #define __initdata_memblock -#endif -``` - -You can note that it depends on `CONFIG_ARCH_DISCARD_MEMBLOCK`. If this configuration option is enabled, memblock code will be put to the `.init` section and it will be released after the kernel is booted up. - -Next we can see initialization of the `memblock_type memory`, `memblock_type reserved` and `memblock_type physmem` fields of the `memblock` structure. Here we are interested only in the `memblock_type.regions` initialization process. Note that every `memblock_type` field initialized by the arrays of the `memblock_region`: - -```C -static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock; -static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock; -#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP -static struct memblock_region memblock_physmem_init_regions[INIT_PHYSMEM_REGIONS] __initdata_memblock; -#endif -``` - -Every array contains 128 memory regions. We can see it in the `INIT_MEMBLOCK_REGIONS` macro definition: - -```C -#define INIT_MEMBLOCK_REGIONS 128 -``` - -Note that all arrays are also defined with the `__initdata_memblock` macro which we already saw in the `memblock` strucutre initialization (read above if you've forgot). - -The last two fields describe that `bottom_up` allocation is disabled and the limit of the current Memblock is: - -```C -#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0) -``` - -which is `0xffffffffffffffff`. - -On this step initialization of the `memblock` structure finished and we can look on the Memblock API. - -Memblock API --------------------------------------------------------------------------------- - -Ok we have finished with initilization of the `memblock` structure and now we can look on the Memblock API and its implementation. As I said above, all implementation of the `memblock` presented in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c). To understand how `memblock` works and is implemented, let's look at its usage first of all. There are a couple of [places](http://lxr.free-electrons.com/ident?i=memblock) in the linux kernel where memblock is used. For example let's take `memblock_x86_fill` function from the [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c#L1061). This function goes through the memory map provided by the [e820](http://en.wikipedia.org/wiki/E820) and adds memory regions reserved by the kernel to the `memblock` with the `memblock_add` function. As we met `memblock_add` function first, let's start from it. - -This function takes physical base address and size of the memory region and adds it to the `memblock`. `memblock_add` function does not do anything special in its body, but just calls: - -```C -memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0); -``` - -function. We pass memory block type - `memory`, physical base address and size of the memory region, maximum number of nodes which are zero if `CONFIG_NODES_SHIFT` is not set in the configuration file or `CONFIG_NODES_SHIFT` if it is set, and flags. The `memblock_add_range` function adds new memory region to the memory block. It starts by checking the size of the given region and if it is zero it just returns. After this, `memblock_add_range` checks for existence of the memory regions in the `memblock` structure with the given `memblock_type`. If there are no memory regions, we just fill new `memory_region` with the given values and return (we already saw the implementation of this in the [First touch of the linux kernel memory manager framework](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)). If `memblock_type` is not empty, we start to add new memory region to the `memblock` with the given `memblock_type`. - -First of all we get the end of the memory region with the: - -```C -phys_addr_t end = base + memblock_cap_size(base, &size); -``` - -`memblock_cap_size` adjusts `size` that `base + size` will not overflow. Its implementation is pretty easy: - -```C -static inline phys_addr_t memblock_cap_size(phys_addr_t base, phys_addr_t *size) -{ - return *size = min(*size, (phys_addr_t)ULLONG_MAX - base); -} -``` - -`memblock_cap_size` returns new size which is the smallest value between the given size and `ULLONG_MAX - base`. - -After that we have the end address of the new memory region, `memblock_add_range` checks overlap and merge conditions with already added memory regions. Insertion of the new memory region to the `memblcok` consists of two steps: - -* Adding of non-overlapping parts of the new memory area as separate regions; -* Merging of all neighbouring regions. - -We are going through all the already stored memory regions and checking for overlap with the new region: - -```C - for (i = 0; i < type->cnt; i++) { - struct memblock_region *rgn = &type->regions[i]; - phys_addr_t rbase = rgn->base; - phys_addr_t rend = rbase + rgn->size; - - if (rbase >= end) - break; - if (rend <= base) - continue; - ... - ... - ... - } -``` - -If the new memory region does not overlap regions which are already stored in the `memblock`, insert this region into the memblock with and this is first step, we check that new region can fit into the memory block and call `memblock_double_array` in other way: - -```C -while (type->cnt + nr_new > type->max) - if (memblock_double_array(type, obase, size) < 0) - return -ENOMEM; - insert = true; - goto repeat; -``` - -`memblock_double_array` doubles the size of the given regions array. Then we set `insert` to `true` and go to the `repeat` label. In the second step, starting from the `repeat` label we go through the same loop and insert the current memory region into the memory block with the `memblock_insert_region` function: - -```C - if (base < end) { - nr_new++; - if (insert) - memblock_insert_region(type, i, base, end - base, - nid, flags); - } -``` - -As we set `insert` to `true` in the first step, now `memblock_insert_region` will be called. `memblock_insert_region` has almost the same implementation that we saw when we insert new region to the empty `memblock_type` (see above). This function gets the last memory region: - -```C -struct memblock_region *rgn = &type->regions[idx]; -``` - -and copies memory area with `memmove`: - -```C -memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn)); -``` - -After this fills `memblock_region` fields of the new memory region base, size and etc... and increase size of the `memblock_type`. In the end of the execution, `memblock_add_range` calls `memblock_merge_regions` which merges neighboring compatible regions in the second step. - -In the second case the new memory region can overlap already stored regions. For example we already have `region1` in the `memblock`: - -``` -0 0x1000 -+-----------------------+ -| | -| | -| region1 | -| | -| | -+-----------------------+ -``` - -And now we want to add `region2` to the `memblock` with the following base address and size: - -``` -0x100 0x2000 -+-----------------------+ -| | -| | -| region2 | -| | -| | -+-----------------------+ -``` - -In this case set the base address of the new memory region as the end address of the overlapped region with: - -```C -base = min(rend, end); -``` - -So it will be `0x1000` in our case. And insert it as we did it already in the second step with: - -``` -if (base < end) { - nr_new++; - if (insert) - memblock_insert_region(type, i, base, end - base, nid, flags); -} -``` - -In this case we insert `overlapping portion` (we insert only the higher portion, because the lower portion is already in the overlapped memory region), then the remaining portion and merge these portions with `memblock_merge_regions`. As I said above `memblock_merge_regions` function merges neighboring compatible regions. It goes through the all memory regions from the given `memblock_type`, takes two neighboring memory regions - `type->regions[i]` and `type->regions[i + 1]` and checks that these regions have the same flags, belong to the same node and that end address of the first regions is not equal to the base address of the second region: - -```C -while (i < type->cnt - 1) { - struct memblock_region *this = &type->regions[i]; - struct memblock_region *next = &type->regions[i + 1]; - if (this->base + this->size != next->base || - memblock_get_region_node(this) != - memblock_get_region_node(next) || - this->flags != next->flags) { - BUG_ON(this->base + this->size > next->base); - i++; - continue; - } -``` - -If none of these conditions are not true, we update the size of the first region with the size of the next region: - -```C -this->size += next->size; -``` - -As we update the size of the first memory region with the size of the next memory region, we copy every (in the loop) memory region which is after the current (`this`) memory region to the one index ago with the `memmove` function: - -```C -memmove(next, next + 1, (type->cnt - (i + 2)) * sizeof(*next)); -``` - -And decrease the count of the memory regions which are belongs to the `memblock_type`: - -```C -type->cnt--; -``` - -After this we will get two memory regions merged into one: - -``` -0 0x2000 -+------------------------------------------------+ -| | -| | -| region1 | -| | -| | -+------------------------------------------------+ -``` - -That's all. This is the whole principle of the work of the `memblock_add_range` function. - -There is also `memblock_reserve` function which does the same as `memblock_add`, but only with one difference. It stores `memblock_type.reserved` in the memblock instead of `memblock_type.memory`. - -Of course this is not the full API. Memblock provides an API for not only adding `memory` and `reserved` memory regions, but also: - -* memblock_remove - removes memory region from memblock; -* memblock_find_in_range - finds free area in given range; -* memblock_free - releases memory region in memblock; -* for_each_mem_range - iterates through memblock areas. - -and many more.... - -Getting info about memory regions --------------------------------------------------------------------------------- - -Memblock also provides an API for getting information about allocated memory regions in the `memblcok`. It is split in two parts: - -* get_allocated_memblock_memory_regions_info - getting info about memory regions; -* get_allocated_memblock_reserved_regions_info - getting info about reserved regions. - -Implementation of these functions is easy. Let's look at `get_allocated_memblock_reserved_regions_info` for example: - -```C -phys_addr_t __init_memblock get_allocated_memblock_reserved_regions_info( - phys_addr_t *addr) -{ - if (memblock.reserved.regions == memblock_reserved_init_regions) - return 0; - - *addr = __pa(memblock.reserved.regions); - - return PAGE_ALIGN(sizeof(struct memblock_region) * - memblock.reserved.max); -} -``` - -First of all this function checks that `memblock` contains reserved memory regions. If `memblock` does not contain reserved memory regions we just return zero. Otherwise we write the physical address of the reserved memory regions array to the given address and return aligned size of the allocated array. Note that there is `PAGE_ALIGN` macro used for align. Actually it depends on size of page: - -```C -#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE) -``` - -Implementation of the `get_allocated_memblock_memory_regions_info` function is the same. It has only one difference, `memblock_type.memory` used instead of `memblock_type.reserved`. - -Memblock debugging --------------------------------------------------------------------------------- - -There are many calls to `memblock_dbg` in the memblock implementation. If you pass the `memblock=debug` option to the kernel command line, this function will be called. Actually `memblock_dbg` is just a macro which expands to `printk`: - -```C -#define memblock_dbg(fmt, ...) \ - if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__) -``` - -For example you can see a call of this macro in the `memblock_reserve` function: - -```C -memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n", - (unsigned long long)base, - (unsigned long long)base + size - 1, - flags, (void *)_RET_IP_); -``` - -And you will see something like this: - -![Memblock](http://oi57.tinypic.com/1zoj589.jpg) - -Memblock has also support in [debugfs](http://en.wikipedia.org/wiki/Debugfs). If you run kernel not in `X86` architecture you can access: - -* /sys/kernel/debug/memblock/memory -* /sys/kernel/debug/memblock/reserved -* /sys/kernel/debug/memblock/physmem - -for getting dump of the `memblock` contents. - -Conclusion --------------------------------------------------------------------------------- - -This is the end of the first part about linux kernel memory management. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-internals/issues/new). - -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [e820](http://en.wikipedia.org/wiki/E820) -* [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access) -* [debugfs](http://en.wikipedia.org/wiki/Debugfs) -* [First touch of the linux kernel memory manager framework](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) diff --git a/mm/linux-mm-2.md b/mm/linux-mm-2.md deleted file mode 100644 index 86136a4..0000000 --- a/mm/linux-mm-2.md +++ /dev/null @@ -1,522 +0,0 @@ -Linux kernel memory management Part 2. -================================================================================ - -Fix-Mapped Addresses and ioremap --------------------------------------------------------------------------------- - -`Fix-Mapped` addresses are a set of special compile-time addresses whose corresponding physical address do not have to be a linear address minus `__START_KERNEL_map`. Each fix-mapped address maps one page frame and the kernel uses them as pointers that never change their address. That is the main point of these addresses. As the comment says: `to have a constant address at compile time, but to set the physical address only in the boot process`. You can remember that in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html), we already set the `level2_fixmap_pgt`: - -```assembly -NEXT_PAGE(level2_fixmap_pgt) - .fill 506,8,0 - .quad level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE - .fill 5,8,0 - -NEXT_PAGE(level1_fixmap_pgt) - .fill 512,8,0 -``` - -As you can see `level2_fixmap_pgt` is right after the `level2_kernel_pgt` which is kernel code+data+bss. Every fix-mapped address is represented by an integer index which is defined in the `fixed_addresses` enum from the [arch/x86/include/asm/fixmap.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/fixmap.h). For example it contains entries for `VSYSCALL_PAGE` - if emulation of legacy vsyscall page is enabled, `FIX_APIC_BASE` for local [apic](h -ttp://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and etc... In a virtual memory fix-mapped area is placed in the modules area: - -``` - +-----------+-----------------+---------------+------------------+ - | | | | | - |kernel text| kernel | | vsyscalls | - | mapping | text | Modules | fix-mapped | - |from phys 0| data | | addresses | - | | | | | - +-----------+-----------------+---------------+------------------+ -__START_KERNEL_map __START_KERNEL MODULES_VADDR 0xffffffffffffffff -``` - -Base virtual address and size of the `fix-mapped` area are presented by the two following macro: - -```C -#define FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT) -#define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE) -``` - -Here `__end_of_permanent_fixed_addresses` is an element of the `fixed_addresses` enum and as I wrote above: Every fix-mapped address is represented by an integer index which is defined in the `fixed_addresses`. `PAGE_SHIFT` determines size of a page. For example size of the one page we can get with the `1 << PAGE_SHIFT`. In our case we need to get the size of the fix-mapped area, but not only of one page, that's why we are using `__end_of_permanent_fixed_addresses` for getting the size of the fix-mapped area. In my case it's a little more than `536` killobytes. In your case it might be a different number, because the size depends on amount of the fix-mapped addresses which are depends on your kernel's configuration. - -The second `FIXADDR_START` macro just extracts from the last address of the fix-mapped area its size for getting base virtual address of the fix-mapped area. `FIXADDR_TOP` is rounded up address from the base address of the [vsyscall](https://lwn.net/Articles/446528/) space: - -```C -#define FIXADDR_TOP (round_up(VSYSCALL_ADDR + PAGE_SIZE, 1<= __end_of_fixed_addresses); - return __fix_to_virt(idx); -} -``` - -first of all it checks that the index given for the `fixed_addresses` enum is not greater or equal than `__end_of_fixed_addresses` with the `BUILD_BUG_ON` macro and then returns the result of the `__fix_to_virt` macro: - -```C -#define __fix_to_virt(x) (FIXADDR_TOP - ((x) << PAGE_SHIFT)) -``` - -Here we shift left the given `fix-mapped` address index on the `PAGE_SHIFT` which determines size of a page as I wrote above and subtract it from the `FIXADDR_TOP` which is the highest address of the `fix-mapped` area. There is an inverse function for getting `fix-mapped` address from a virtual address: - -```C -static inline unsigned long virt_to_fix(const unsigned long vaddr) -{ - BUG_ON(vaddr >= FIXADDR_TOP || vaddr < FIXADDR_START); - return __virt_to_fix(vaddr); -} -``` - -`virt_to_fix` takes virtual address, checks that this address is between `FIXADDR_START` and `FIXADDR_TOP` and calls `__virt_to_fix` macro which implemented as: - -```C -#define __virt_to_fix(x) ((FIXADDR_TOP - ((x)&PAGE_MASK)) >> PAGE_SHIFT) -``` - -A PFN is simply an index within physical memory that is counted in page-sized units. PFN for a physical address could be trivially defined as (page_phys_addr >> PAGE_SHIFT); - -`__virt_to_fix` clears the first 12 bits in the given address, subtracts it from the last address the of `fix-mapped` area (`FIXADDR_TOP`) and shifts right result on `PAGE_SHIFT` which is `12`. Let me explain how it works. As I already wrote we will clear the first 12 bits in the given address with `x & PAGE_MASK`. As we subtract this from the `FIXADDR_TOP`, we will get the last 12 bits of the `FIXADDR_TOP` which are present. We know that the first 12 bits of the virtual address represent the offset in the page frame. With the shiting it on `PAGE_SHIFT` we will get `Page frame number` which is just all bits in a virtual address besides the first 12 offset bits. `Fix-mapped` addresses are used in different [places](http://lxr.free-electrons.com/ident?i=fix_to_virt) in the linux kernel. `IDT` descriptor stored there, [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID stored in the `fix-mapped` area started from `FIX_TBOOT_BASE` index, [Xen](http://en.wikipedia.org/wiki/Xen) bootmap and many more... We already saw a little about `fix-mapped` addresses in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. We used `fix-mapped` area in the early `ioremap` initialization. Let's look on it and try to understand what is it `ioremap`, how it is implemented in the kernel and how it is releated to the `fix-mapped` addresses. - -ioremap --------------------------------------------------------------------------------- - -Linux kernel provides many different primitives to manage memory. For this moment we will touch `I/O memory`. Every device is controlled by reading/writing from/to its registers. For example a driver can turn off/on a device by writing to its registers or get the state of a device by reading from its registers. Besides registers, many devices have buffers where a driver can write something or read from there. As we know for this moment there are two ways to access device's registers and data buffers: - -* through the I/O ports; -* mapping of the all registers to the memory address space; - -In the first case every control register of a device has a number of input and output port. And driver of a device can read from a port and write to it with two `in` and `out` instructions which we already saw. If you want to know about currently registered port regions, you can know they by accessing of `/proc/ioports`: - -``` -$ cat /proc/ioports -0000-0cf7 : PCI Bus 0000:00 - 0000-001f : dma1 - 0020-0021 : pic1 - 0040-0043 : timer0 - 0050-0053 : timer1 - 0060-0060 : keyboard - 0064-0064 : keyboard - 0070-0077 : rtc0 - 0080-008f : dma page reg - 00a0-00a1 : pic2 - 00c0-00df : dma2 - 00f0-00ff : fpu - 00f0-00f0 : PNP0C04:00 - 03c0-03df : vesafb - 03f8-03ff : serial - 04d0-04d1 : pnp 00:06 - 0800-087f : pnp 00:01 - 0a00-0a0f : pnp 00:04 - 0a20-0a2f : pnp 00:04 - 0a30-0a3f : pnp 00:04 -0cf8-0cff : PCI conf1 -0d00-ffff : PCI Bus 0000:00 -... -... -... -``` - -`/proc/ioporst` provides information about what driver used address of a `I/O` ports region. All of these memory regions, for example `0000-0cf7`, were claimed with the `request_region` function from the [include/linux/ioport.h](https://github.com/torvalds/linux/blob/master/include/linux/ioport.h). Actually `request_region` is a macro which defied as: - -```C -#define request_region(start,n,name) __request_region(&ioport_resource, (start), (n), (name), 0) -``` - -As we can see it takes three parameters: - -* `start` - begin of region; -* `n` - length of region; -* `name` - name of requester. - -`request_region` allocates `I/O` port region. Very often `check_region` function called before the `request_region` to check that the given address range is available and `release_region` to release memory region. `request_region` returns pointer to the `resource` structure. `resource` structure presents abstraction for a tree-like subset of system resources. We already saw `resource` structure in the firth part about kernel [initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) process and it looks as: - -```C -struct resource { - resource_size_t start; - resource_size_t end; - const char *name; - unsigned long flags; - struct resource *parent, *sibling, *child; -}; -``` - -and contains start and end addresses of the resource, name and etc... Every `resource` structure contains pointers to the `parent`, `slibling` and `child` resources. As it has parent and childs, it means that every subset of resuorces has root `resource` structure. For example, for `I/O` ports it is `ioport_resource` structure: - -```C -struct resource ioport_resource = { - .name = "PCI IO", - .start = 0, - .end = IO_SPACE_LIMIT, - .flags = IORESOURCE_IO, -}; -EXPORT_SYMBOL(ioport_resource); -``` - -Or for `iomem`, it is `iomem_resource` structure: - -```C -struct resource iomem_resource = { - .name = "PCI mem", - .start = 0, - .end = -1, - .flags = IORESOURCE_MEM, -}; -``` - -As I wrote about `request_regions` is used for registering of I/O port region and this macro used in many [places](http://lxr.free-electrons.com/ident?i=request_region) in the kernel. For example let's look at [drivers/char/rtc.c](https://github.com/torvalds/linux/blob/master/char/rtc.c). This source code file provides [Real Time Clock](http://en.wikipedia.org/wiki/Real-time_clock) interface in the linux kernel. As every kernel module, `rtc` module contains `module_init` definition: - -```C -module_init(rtc_init); -``` - -where `rtc_init` is `rtc` initialization function. This function defined in the same `rtc.c` source code file. In the `rtc_init` function we can see a couple calls of the `rtc_request_region` functions, which wrap `request_region` for example: - -```C -r = rtc_request_region(RTC_IO_EXTENT); -``` - -where `rtc_request_region` calls: - -```C -r = request_region(RTC_PORT(0), size, "rtc"); -``` - -Here `RTC_IO_EXTENT` is a size of memory region and it is `0x8`, `"rtc"` is a name of region and `RTC_PORT` is: - -```C -#define RTC_PORT(x) (0x70 + (x)) -``` - -So with the `request_region(RTC_PORT(0), size, "rtc")` we register memory region, started at `0x70` and with size `0x8`. Let's look on the `/proc/ioports`: - -``` -~$ sudo cat /proc/ioports | grep rtc -0070-0077 : rtc0 -``` - -So, we got it! Ok, it was ports. The second way is use of `I/O` memory. As I wrote above this way is mapping of control registers and memory of a device to the memory address space. `I/O` memory is a set of contiguous addresses which are provided by a device to CPU through a bus. All memory-mapped I/O addresses are not used by the kernel directly. There is a special `ioremap` function which allows us to covert the physical address on a bus to the kernel virtual address or in another words `ioremap` maps I/O physical memory region to access it from the kernel. The `ioremap` function takes two parameters: - -* start of the memory region; -* size of the memory region; - -I/O memory mapping API provides function for the checking, requesting and release of a memory region as this does I/O ports API. There are three functions for it: - -* `request_mem_region` -* `release_mem_region` -* `check_mem_region` - -``` -~$ sudo cat /proc/iomem -... -... -... -be826000-be82cfff : ACPI Non-volatile Storage -be82d000-bf744fff : System RAM -bf745000-bfff4fff : reserved -bfff5000-dc041fff : System RAM -dc042000-dc0d2fff : reserved -dc0d3000-dc138fff : System RAM -dc139000-dc27dfff : ACPI Non-volatile Storage -dc27e000-deffefff : reserved -defff000-deffffff : System RAM -df000000-dfffffff : RAM buffer -e0000000-feafffff : PCI Bus 0000:00 - e0000000-efffffff : PCI Bus 0000:01 - e0000000-efffffff : 0000:01:00.0 - f7c00000-f7cfffff : PCI Bus 0000:06 - f7c00000-f7c0ffff : 0000:06:00.0 - f7c10000-f7c101ff : 0000:06:00.0 - f7c10000-f7c101ff : ahci - f7d00000-f7dfffff : PCI Bus 0000:03 - f7d00000-f7d3ffff : 0000:03:00.0 - f7d00000-f7d3ffff : alx -... -... -... -``` - -Part of these addresses is from the call of the `e820_reserve_resources` function. We can find call of this function in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and the function itself defined in the [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c). `e820_reserve_resources` goes through the [e820](http://en.wikipedia.org/wiki/E820) map and inserts memory regions to the root `iomem` resource structure. All `e820` memory regions which are will be inserted to the `iomem` resource will have following types: - -```C -static inline const char *e820_type_to_string(int e820_type) -{ - switch (e820_type) { - case E820_RESERVED_KERN: - case E820_RAM: return "System RAM"; - case E820_ACPI: return "ACPI Tables"; - case E820_NVS: return "ACPI Non-volatile Storage"; - case E820_UNUSABLE: return "Unusable memory"; - default: return "reserved"; - } -} -``` - -and we can see it in the `/proc/iomem` (read above). - -Now let's try to understand how `ioremap` works. We already know a little about `ioremap`, we saw it in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. If you have read this part, you can remember the call of the `early_ioremap_init` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c). Initialization of the `ioremap` is split inn two parts: there is the early part which we can use before the normal `ioremap` is available and the normal `ioremap` which is available after `vmalloc` initialization and call of the `paging_init`. We do not know anything about `vmalloc` for now, so let's consider early initialization of the `ioremap`. First of all `early_ioremap_init` checks that `fixmap` is aligned on page middle directory boundary: - -```C -BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1)); -``` - -more about `BUILD_BUG_ON` you can read in the first part about [Linux Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html). So `BUILD_BUG_ON` macro raises compilation error if the given expression is true. In the next step after this check, we can see call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c). This function presents generic initialization of the `ioremap`. `early_ioremap_setup` function fills the `slot_virt` array with the virtual addresses of the early fixmaps. All early fixmaps are after `__end_of_permanent_fixed_addresses` in memory. They are stats from the `FIX_BITMAP_BEGIN` (top) and ends with `FIX_BITMAP_END` (down). Actually there are `512` temporary boot-time mappings, used by early `ioremap`: - -``` -#define NR_FIX_BTMAPS 64 -#define FIX_BTMAPS_SLOTS 8 -#define TOTAL_FIX_BTMAPS (NR_FIX_BTMAPS * FIX_BTMAPS_SLOTS) -``` - -and `early_ioremap_setup`: - -```C -void __init early_ioremap_setup(void) -{ - int i; - - for (i = 0; i < FIX_BTMAPS_SLOTS; i++) - if (WARN_ON(prev_map[i])) - break; - - for (i = 0; i < FIX_BTMAPS_SLOTS; i++) - slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i); -} -``` - -the `slot_virt` and other arrays are defined in the same source code file: - -```C -static void __iomem *prev_map[FIX_BTMAPS_SLOTS] __initdata; -static unsigned long prev_size[FIX_BTMAPS_SLOTS] __initdata; -static unsigned long slot_virt[FIX_BTMAPS_SLOTS] __initdata; -``` - -`slot_virt` contains virtual addresses of the `fix-mapped` areas, `prev_map` array contains addresses of the early ioremap areas. Note that I wrote above: `Actually there are 512 temporary boot-time mappings, used by early ioremap` and you can see that all arrays defined with the `__initdata` attribute which means that this memory will be released after kernel initialization process. After `early_ioremap_setup` finished to work, we're getting page middle directory where early ioremap beginning with the `early_ioremap_pmd` function which just gets the base address of the page global directory and calculates the page middle directory for the given address: - -```C -static inline pmd_t * __init early_ioremap_pmd(unsigned long addr) -{ - pgd_t *base = __va(read_cr3()); - pgd_t *pgd = &base[pgd_index(addr)]; - pud_t *pud = pud_offset(pgd, addr); - pmd_t *pmd = pmd_offset(pud, addr); - return pmd; -} -``` - -After this we fills `bm_pte` (early ioremap page table entries) with zeros and call the `pmd_populate_kernel` function: - -```C -pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)); -memset(bm_pte, 0, sizeof(bm_pte)); -pmd_populate_kernel(&init_mm, pmd, bm_pte); -``` - -`pmd_populate_kernel` takes three parameters: - -* `init_mm` - memory descriptor of the `init` process (you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html)); -* `pmd` - page middle directory of the beginning of the `ioremap` fixmaps; -* `bm_pte` - early `ioremap` page table entries array which defined as: - -```C -static pte_t bm_pte[PAGE_SIZE/sizeof(pte_t)] __page_aligned_bss; -``` - -The `pmd_popularte_kernel` function defined in the [arch/x86/include/asm/pgalloc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgalloc.) and populates given page middle directory (`pmd`) with the given page table entries (`bm_pte`): - -```C -static inline void pmd_populate_kernel(struct mm_struct *mm, - pmd_t *pmd, pte_t *pte) -{ - paravirt_alloc_pte(mm, __pa(pte) >> PAGE_SHIFT); - set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE)); -} -``` - -where `set_pmd` is: - -```C -#define set_pmd(pmdp, pmd) native_set_pmd(pmdp, pmd) -``` - -and `native_set_pmd` is: - -```C -static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd) -{ - *pmdp = pmd; -} -``` - -That's all. Early `ioremap` is ready to use. There are a couple of checks in the `early_ioremap_init` function, but they are not so important, anyway initialization of the `ioremap` is finished. - -Use of early ioremap --------------------------------------------------------------------------------- - -As early `ioremap` is setup, we can use it. It provides two functions: - -* early_ioremap -* early_iounmap - -for mapping/unmapping of IO physical address to virtual address. Both functions depends on `CONFIG_MMU` configuration option. [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit) is a special block of memory management. Main purpose of this block is translation physical addresses to the virtual. Techinically memory management unit knows about high-level page table address (`pgd`) from the `cr3` control register. If `CONFIG_MMU` options is set to `n`, `early_ioremap` just returns the given physical address and `early_iounmap` does not nothing. In other way, if `CONFIG_MMU` option is set to `y`, `early_ioremap` calls `__early_ioremap` which takes three parameters: - -* `phys_addr` - base physicall address of the `I/O` memory region to map on virtual addresses; -* `size` - size of the `I/O` memroy region; -* `prot` - page table entry bits. - -First of all in the `__early_ioremap`, we goes through the all early ioremap fixmap slots and check first free are in the `prev_map` array and remember it's number in the `slot` variable and set up size as we found it: - -```C -slot = -1; -for (i = 0; i < FIX_BTMAPS_SLOTS; i++) { - if (!prev_map[i]) { - slot = i; - break; - } -} -... -... -... -prev_size[slot] = size; -last_addr = phys_addr + size - 1; -``` - - -In the next spte we can see the following code: - -```C -offset = phys_addr & ~PAGE_MASK; -phys_addr &= PAGE_MASK; -size = PAGE_ALIGN(last_addr + 1) - phys_addr; -``` - -Here we are using `PAGE_MASK` for clearing all bits in the `phys_addr` besides first 12 bits. `PAGE_MASK` macro defined as: - -```C -#define PAGE_MASK (~(PAGE_SIZE-1)) -``` - -We know that size of a page is 4096 bytes or `1000000000000` in binary. `PAGE_SIZE - 1` will be `111111111111`, but with `~`, we will get `000000000000`, but as we use `~PAGE_MASK` we will get `111111111111` again. On the second line we do the same but clear first 12 bits and getting page-aligned size of the area on the third line. We getting aligned area and now we need to get the number of pages which are occupied by the new `ioremap` are and calculate the fix-mapped index from `fixed_addresses` in the next steps: - -```C -nrpages = size >> PAGE_SHIFT; -idx = FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*slot; -``` - -Now we can fill `fix-mapped` area with the given physical addresses. Every iteration in the loop, we call `__early_set_fixmap` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c), increase given physical address on page size which is `4096` bytes and update `addresses` index and number of pages: - -```C -while (nrpages > 0) { - __early_set_fixmap(idx, phys_addr, prot); - phys_addr += PAGE_SIZE; - --idx; - --nrpages; -} -``` - -The `__early_set_fixmap` function gets the page table entry (stored in the `bm_pte`, see above) for the given physical address with: - -```C -pte = early_ioremap_pte(addr); -``` - -In the next step of the `early_ioremap_pte` we check the given page flags with the `pgprot_val` macro and calls `set_pte` or `pte_clear` depends on it: - -```C -if (pgprot_val(flags)) - set_pte(pte, pfn_pte(phys >> PAGE_SHIFT, flags)); - else - pte_clear(&init_mm, addr, pte); -``` - -As you can see above, we passed `FIXMAP_PAGE_IO` as flags to the `__early_ioremap`. `FIXMPA_PAGE_IO` expands to the: - -```C -(__PAGE_KERNEL_EXEC | _PAGE_NX) -``` - -flags, so we call `set_pte` function for setting page table entry which works in the same manner as `set_pmd` but for PTEs (read above about it). As we set all `PTEs` in the loop, we can see the call of the `__flush_tlb_one` function: - -```C -__flush_tlb_one(addr); -``` - -This function defined in the [arch/x86/include/asm/tlbflush.h](https://github.com/torvalds/linux/blob/master) and calls `__flush_tlb_single` or `__flush_tlb` depends on value of the `cpu_has_invlpg`: - -```C -static inline void __flush_tlb_one(unsigned long addr) -{ - if (cpu_has_invlpg) - __flush_tlb_single(addr); - else - __flush_tlb(); -} -``` - -`__flush_tlb_one` function invalidates given address in the [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer). As you just saw we updated paging structure, but `TLB` not informed of changes, that's why we need to do it manually. There are two ways how to do it. First is update `cr3` control register and `__flush_tlb` function does this: - -```C -native_write_cr3(native_read_cr3()); -``` - -The second method is to use `invlpg` instruction invalidates `TLB` entry. Let's look on `__flush_tlb_one` implementation. As you can see first of all it checks `cpu_has_invlpg` which defined as: - -```C -#if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64) -# define cpu_has_invlpg 1 -#else -# define cpu_has_invlpg (boot_cpu_data.x86 > 3) -#endif -``` - -If a CPU support `invlpg` instruction, we call the `__flush_tlb_single` macro which expands to the call of the `__native_flush_tlb_single`: - -```C -static inline void __native_flush_tlb_single(unsigned long addr) -{ - asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); -} -``` - -or call `__flush_tlb` which just updates `cr3` register as we saw it above. After this step execution of the `__early_set_fixmap` function is finsihed and we can back to the `__early_ioremap` implementation. As we set fixmap area for the given address, need to save the base virtual address of the I/O Re-mapped area in the `prev_map` with the `slot` index: - -```C -prev_map[slot] = (void __iomem *)(offset + slot_virt[slot]); -``` - -and return it. - -The second function is - `early_iounmap` - unmaps an `I/O` memory region. This function takes two parameters: base address and size of a `I/O` region and generally looks very similar on `early_ioremap`. It also goes through fixmap slots and looks for slot with the given address. After this it gets the index of the fixmap slot and calls `__late_clear_fixmap` or `__early_set_fixmap` depends on `after_paging_init` value. It calls `__early_set_fixmap` with on difference then it does `early_ioremap`: it passes `zero` as physicall address. And in the end it sets address of the I/O memory region to `NULL`: - -```C -prev_map[slot] = NULL; -``` - -That's all about `fixmaps` and `ioremap`. Of course this part does not cover full features of the `ioremap`, it was only early ioremap, but there is also normal ioremap. But we need to know more things than now before it. - -So, this is the end! - -Conclusion --------------------------------------------------------------------------------- - -This is the end of the second part about linux kernel memory management. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-internals/issues/new). - -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).** - -Links --------------------------------------------------------------------------------- - -* [apic](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) -* [vsyscall](https://lwn.net/Articles/446528/) -* [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) -* [Xen](http://en.wikipedia.org/wiki/Xen) -* [Real Time Clock](http://en.wikipedia.org/wiki/Real-time_clock) -* [e820](http://en.wikipedia.org/wiki/E820) -* [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit) -* [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer) -* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) -* [Linux kernel memory management Part 1.](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html)