boot dir created

This commit is contained in:
0xAX
2015-01-28 23:16:02 +06:00
parent 7060567758
commit 2cd7c2b1cc
2 changed files with 0 additions and 0 deletions

484
boot/linux-bootstrap-1.md Normal file
View File

@@ -0,0 +1,484 @@
Linux internals
================================================================================
Kernel booting process. Part 1.
--------------------------------------------------------------------------------
If you have read my previous [blog posts](http://0xax.blogspot.com/search/label/asm), you can see that some time ago I started to get involved with low-level programming. I wrote some posts about x86_64 assembly programming for Linux. At the same time, I started to dive into the Linux source code. It is very interesting for me to understand how low-level things work, how programs run on my computer, how they are located in memory, how the kernel manages processes and memory, how the network stack works on low-level and many many other things. I decided to write yet another series of posts about the Linux kernel for **x86_64**.
Note that I'm not a professional kernel hacker, and I don't write code for the kernel at work. It's just a hobby. I just like low-level stuff, and it is interesting for me to see how these things work. So if you notice anything confusing, or if you have any questions/remarks, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-internals/issues/new). I appreciate it. All posts will also be accessible at [linux-internals](https://github.com/0xAX/linux-internals) and if you find something wrong with my English or post content, feel free to send pull request.
*Note that this isn't official documentation, just learning and sharing knowledge.*
**Required knowledge**
* Understanding C code
* Understanding assembly code (AT&T syntax)
Anyway, if you just started to learn some tools, I will try to explain some parts during this and following posts. Ok, little introduction finished and now we can start to dive into kernel and low-level stuff.
All code is actual for kernel - 3.18, if there are changes, I will update posts.
Magic power button, what's next?
--------------------------------------------------------------------------------
Despite that this is a series of posts about linux kernel, we will not start from kernel code (at least in this paragraph). Ok, you pressed magic power button on your laptop or desktop computer and it started to work. After the mother board sends a signal to the [power supply](http://en.wikipedia.org/wiki/Power_supply), the power supply provides the computer with the proper amount of electricity. Once motherboard receives the [power good signal](http://en.wikipedia.org/wiki/Power_good_signal), it tries to run the CPU. The CPU resets all leftover data in its registers and sets up predefined values for every register.
[80386](http://en.wikipedia.org/wiki/Intel_80386) and later CPUs defines the following predefined data in CPU registers after the computer resets:
```
IP 0xfff0
CS selector 0xf000
CS base 0xffff0000
```
The processor starts working in [real mode](http://en.wikipedia.org/wiki/Real_mode) now and we need to make a little retreat for understanding memory segmentation in this mode. Real mode is supported in all x86 compatible processors, from [8086](http://en.wikipedia.org/wiki/Intel_8086) to modern Intel 64bit CPUs. The 8086 processor had a 20 bit address bus, which means that it could work with 0-2^20 bytes address space (1 megabyte). But it only had 16 bit registers, and with 16 bit registers the maximum address is 2^16 or 0xffff (64 kilobytes). Memory segmentation was used to make use of all of the address space. All memory was divided into small, fixed-size segments of 65535 bytes, or 64 KB. Since we can not address memory behind 64 KB with 16 bit registers, another method to do it was devised. An address consists of two parts: the beginning address of the segment and the offset from the beginning of this segment. To get a physical address in memory, we need to multiply the segment part by 16 and add the offset part:
```
PhysicalAddress = Segment * 16 + Offset
```
For example `CS:IP` is `0x2000:0x0010`, physical address will be:
```python
>>> hex((0x2000 << 4) + 0x0010)
'0x20010'
```
But if we take the biggest segment part and offset: `0xffff:0xffff`, it will be:
```python
>>> hex((0xffff << 4) + 0xffff)
'0x10ffef'
```
which is 65519 bytes over first megabyte. Since only one megabyte is accessible in real mode, `0x10ffef` becomes `0x00ffef` with disabled [A20](http://en.wikipedia.org/wiki/A20_line).
Ok, now we know about real mode and memory addressing, let's get back to register values after reset.
`CS` register consists of two parts: the visible segment selector and hidden base address. We know predefined `CS` base and `IP` value, logical address will be:
```
0xffff0000:0xfff0
```
In this way starting address formed by adding the base address to the value in the EIP register:
```python
>>> 0xffff0000 + 0xfff0
'0xfffffff0'
```
We get `0xfffffff0` which is 4GB - 16 bytes. This point is the [Reset vector](http://en.wikipedia.org/wiki/Reset_vector). This is the memory location at which CPU expects to find the first instruction to execute after reset. It contains a [jump](http://en.wikipedia.org/wiki/JMP_%28x86_instruction%29) instruction which usually points to the BIOS entry point. For example if we look in [coreboot](http://www.coreboot.org/) source code, we will see it:
```assembly
.section ".reset"
.code16
.globl reset_vector
reset_vector:
.byte 0xe9
.int _start - ( . + 2 )
...
```
We can see here jump instruction [opcode](http://ref.x86asm.net/coder32.html#xE9) - 0xe9 to the address `_start - ( . + 2)`. And we can see that `reset` section is 16 bytes and starts at `0xfffffff0`:
```
SECTIONS {
_ROMTOP = 0xfffffff0;
. = _ROMTOP;
.reset . : {
*(.reset)
. = 15 ;
BYTE(0x00);
}
}
```
Now BIOS has started to work. After all initializations and hardware checking, it needs to load operating system. BIOS tries to find bootable device which contains boot sector. Boot sector is the first sector on device (512 bytes) and contains sequence of `0x55` and `0xaa` at 511 and 512 byte. For example:
```assembly
[BITS 16]
[ORG 0x7c00]
jmp boot
boot:
mov al, '!'
mov ah, 0x0e
mov bh, 0x00
mov bl, 0x07
int 0x10
jmp $
times 510-($-$$) db 0
db 0x55
db 0xaa
```
Build and run it with:
```
nasm -f bin boot.nasm && qemu-system-x86_64 boot
```
We will see:
![Simple bootloader which prints only `!`](http://oi60.tinypic.com/2qbwup0.jpg)
In this example we can see that this code will be executed in 16 bit real mode and will start at 0x7c00 in memory. After the start it calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt which just prints `!` symbol. It fills rest of 510 bytes with zeros and finish with two magic bytes 0xaa and 0x55.
Real world boot loader starts at the same point, ends with `0xaa55` bytes, but reads kernel code from device, loads it to memory, parses and passes boot parameters to kernel and etc... instead of printing one symbol :) Ok, so, from this moment BIOS handed control to the operating system bootloader and we can go ahead.
**NOTE**: as you can read above the CPU is in real mode. In real mode, calculating the physical address in memory is as follows:
```
PhysicalAddress = Segment * 16 + Offset
```
as I wrote above. But we have only 16 bit general purpose registers. The maximum value of 16 bit register is: `0xffff`; So if we take the biggest values, it will be:
```python
>>> hex((0xffff * 16) + 0xffff)
'0x10ffef'
```
Where `0x10ffef` is equal to `1mb + 64KB - 16b`. But [8086](http://en.wikipedia.org/wiki/Intel_8086) processor, which was first processor with real mode, had 20 bit address line, and `2^20 = 1048576.0` which is 1MB, so it means that actually available memory amount is 1MB.
General real mode memory map is:
```
0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table
0x00000400 - 0x000004FF - BIOS Data Area
0x00000500 - 0x00007BFF - Unused
0x00007C00 - 0x00007DFF - Our Bootloader
0x00007E00 - 0x0009FFFF - Unused
0x000A0000 - 0x000BFFFF - Video RAM (VRAM) Memory
0x000B0000 - 0x000B7777 - Monochrome Video Memory
0x000B8000 - 0x000BFFFF - Color Video Memory
0x000C0000 - 0x000C7FFF - Video ROM BIOS
0x000C8000 - 0x000EFFFF - BIOS Shadow Area
0x000F0000 - 0x000FFFFF - System BIOS
```
But stop, at the beginning of post I wrote that first instruction executed by the CPU is located at address `0xfffffff0`, which is much bigger than `0xffff` (1MB). How can CPU access it in real mode? As I write about and you can read in [coreboot](http://www.coreboot.org/Developer_Manual/Memory_map) documentation:
```
0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address space
```
At the start of execution BIOS is not in RAM, it is located in ROM.
Bootloader
--------------------------------------------------------------------------------
Now BIOS has transferred control to the operating system bootloader and it needs to load operating system into the memory. There are a couple of bootloaders which can boot linux, like: [Grub2](http://www.gnu.org/software/grub/), [syslinux](http://www.syslinux.org/wiki/index.php/The_Syslinux_Project) and etc... Linux kernel has [Boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt) which describes how to load linux kernel.
Let us briefly consider how grub loads linux. GRUB2 execution starts from `grub-core/boot/i386/pc/boot.S`. It starts to load from device its own kernel (not to be confused with linux kernel) and executes `grub_main` after successfully loading.
`grub_main` initializes console, gets base address for modules, sets root device, loads/parses grub configuration file, loads modules etc... At the end of execution `grub_main` moves grub to normal mode. `grub_normal_execute` (from `grub-core/normal/main.c`) completes last preparation and shows a menu for selecting an operating system. When we select one of grub menu entries, `grub_menu_execute_entry` begins to be executed, which executes grub `boot` command. It starts to boot operating system.
As we can read in the kernel boot protocol, the bootloader must read and fill some fields of kernel setup header which starts at `0x01f1` offset from the kernel setup code. Kernel header [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) starts from:
```assembly
.globl hdr
hdr:
setup_sects: .byte 0
root_flags: .word ROOT_RDONLY
syssize: .long 0
ram_size: .word 0
vid_mode: .word SVGA_MODE
root_dev: .word 0
boot_flag: .word 0xAA55
```
The bootloader must fill this and the rest of the headers (only marked as `write` in the linux boot protocol, for example [this](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L354)) with values which it either got from command line or calculated. We will not see description and explanation of all fields of kernel setup header, we will get back to it when kernel uses it. Anyway, you can find description of any field in the [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156).
As we can see in kernel boot protocol, the memory map will be the following after kernel loading:
```shell
| Protected-mode kernel |
100000 +------------------------+
| I/O memory hole |
0A0000 +------------------------+
| Reserved for BIOS | Leave as much as possible unused
~ ~
| Command line | (Can also be below the X+10000 mark)
X+10000 +------------------------+
| Stack/heap | For use by the kernel real-mode code.
X+08000 +------------------------+
| Kernel setup | The kernel real-mode code.
| Kernel boot sector | The kernel legacy boot sector.
X +------------------------+
| Boot loader | <- Boot sector entry point 0x7C00
001000 +------------------------+
| Reserved for MBR/BIOS |
000800 +------------------------+
| Typically used by MBR |
000600 +------------------------+
| BIOS use only |
000000 +------------------------+
```
So after the bootloader transferred control to the kernel, it starts somewhere at:
```
0x1000 + X + sizeof(KernelBootSector) + 1
```
where `X` is the address kernel bootsector loaded. In my case `X` is `0x10000` (), we can see it in memory dump:
![kernel first address](http://oi57.tinypic.com/16bkco2.jpg)
Ok, bootloader loaded linux kernel into memory, filled header fields and jumped to it. Now we can move directly to the kernel setup code.
Start of kernel setup
--------------------------------------------------------------------------------
Finally we are in the kernel. Technically kernel didn't run yet, first of all we need to setup kernel, memory manager, process manager and etc... Kernel setup execution starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) at the [_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L293). It is little strange at the first look, there are many instructions before it. Actually....
Long time ago linux had its own bootloader, but now if you run for example:
```
qemu-system-x86_64 vmlinuz-3.18-generic
```
You will see:
![Try vmlinuz in qemu](http://oi60.tinypic.com/r02xkz.jpg)
Actually `header.S` starts from [MZ](http://en.wikipedia.org/wiki/DOS_MZ_executable) (see image above), error message printing and following [PE](http://en.wikipedia.org/wiki/Portable_Executable) header:
```assembly
#ifdef CONFIG_EFI_STUB
# "MZ", MS-DOS header
.byte 0x4d
.byte 0x5a
#endif
...
...
...
pe_header:
.ascii "PE"
.word 0
```
It needs this for loading operating system with [UEFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface). Here we will not see how it works (will look into it in the next parts).
So actual kernel setup entry point is:
```
// header.S line 292
.globl _start
_start:
```
Bootloader (grub2 and others) knows about this point (`0x200` offset from `MZ`) and makes a jump directly to this point, despite the fact that `header.S` starts from `.bstext` section which prints error message:
```
//
// arch/x86/boot/setup.ld
//
. = 0; // current position
.bstext : { *(.bstext) } // put .bstext section to position 0
.bsdata : { *(.bsdata) }
```
So kernel setup entry point is:
```assembly
.globl _start
_start:
.byte 0xeb
.byte start_of_setup-1f
1:
//
// rest of the header
//
```
Here we can see `jmp` instruction opcode - `0xeb` to the `start_of_setup-1f` point. `Nf` notation means following: `2f` refers to the next local `2:` label. In our case it is label `1` which goes right after jump. It contains rest of setup [header](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156) and right after setup header we can see `.entrytext` section which starts at `start_of_setup` label.
Actually it's first code which starts to execute besides previous jump instruction. After kernel setup got the control from bootloader, first `jmp` instruction is located at `0x200` (first 512 bytes) offset from the start of kernel real mode. This we can read in linux kernel boot protocol and also see in grub2 source code:
```C
state.gs = state.fs = state.es = state.ds = state.ss = segment;
state.cs = segment + 0x20;
```
It means that segment registers will have following values after kernel setup starts to work:
```
fs = es = ds = ss = 0x1000
cs = 0x1020
```
for my case when kernel loaded at `0x10000`.
After jump to `start_of_setup`, needs to do following things:
* Be sure that all values of all segment registers are equal
* Setup correct stack if need
* Setup [bss](http://en.wikipedia.org/wiki/.bss)
* Jump to C code at [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c)
Let's look at implementation.
Segment registers align
--------------------------------------------------------------------------------
First of all it ensures that `ds` and `es` segment registers point to the same address and enables interrupts with `sti` instruction:
```assembly
movw %ds, %ax
movw %ax, %es
sti
```
As i wrote above, grub2 loads kernel setup code at `0x10000` address and `cs` at `0x1020` because execution doesn't start from the start of file, but from:
```
_start:
.byte 0xeb
.byte start_of_setup-1f
```
jump, which is 512 bytes offset from the [4d 5a](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L47). Also need to align `cs` from 0x10200 to 0x10000 as all other segment registers. After that we setup stack:
```assembly
pushw %ds
pushw $6f
lretw
```
push `ds` value to stack, and address of [6](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L494) label and execute `lretw` instruction. When we call `lretw`, it loads address of `6` label to [instruction pointer](http://en.wikipedia.org/wiki/Program_counter) register and `cs` with value of `ds`. After it we will have `ds` and `cs` with the same values.
Stack setup
--------------------------------------------------------------------------------
Actually, almost all of the setup code is preparation for C language environment in the real mode. The next [step](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L467) is checking of `ss` register value and making of correct stack if `ss` is wrong:
```assembly
movw %ss, %dx
cmpw %ax, %dx
movw %sp, %dx
je 2f
```
Generally, it can be 3 different cases:
* `ss` has valid value 0x10000 (as all other segment registers beside `cs`)
* `ss` is invalid and `CAN_USE_HEAP` flag is set (see below)
* `ss` is invalid and `CAN_USE_HEAP` flag is not set (see below)
Let's look at all of these cases:
1. `ss` has a correct address (0x10000). In this case we go to [2](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L481) label:
```
2: andw $~3, %dx
jnz 3f
movw $0xfffc, %dx
3: movw %ax, %ss
movzwl %dx, %esp
sti
```
Here we can see aligning of `dx` (contains `sp` given by bootloader) to 4 bytes and checking that it is not zero. If it is zero we put `0xfffc` (4 byte aligned address before maximum segment size - 64 KB) to `dx`. If it is not zero we continue to use `sp` given by bootloader (0xf7f4 in my case). After this we put `ax` value to `ss` which stores correct segment address `0x10000` and set up correct `sp`. After it we have correct stack:
![stack](http://oi58.tinypic.com/16iwcis.jpg)
2. In the second case (`ss` != `ds`), first of all put [_end](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L52) (address of end of setup code) value in `dx`. And check `loadflags` header field with `testb` instruction too see if we can use heap or not. [loadflags](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) is a bitmask header which is defined as:
```C
#define LOADED_HIGH (1<<0)
#define QUIET_FLAG (1<<5)
#define KEEP_SEGMENTS (1<<6)
#define CAN_USE_HEAP (1<<7)
```
And as we can read in the boot protocol:
```
Field name: loadflags
This field is a bitmask.
Bit 7 (write): CAN_USE_HEAP
Set this bit to 1 to indicate that the value entered in the
heap_end_ptr is valid. If this field is clear, some setup code
functionality will be disabled.
```
If `CAN_USE_HEAP` bit is set, put `heap_end_ptr` to `dx` which points to `_end` and add `STACK_SIZE` (minimal stack size - 512 bytes) to it. After this if `dx` is not carry, jump to `2` (it will be not carry, dx = _end + 512) label as in previous case and make correct stack.
![stack](http://oi62.tinypic.com/dr7b5w.jpg)
3. The last case when `CAN_USE_HEAP` is not set, we just use minimal stack from `_end` to `_end + STACK_SIZE`:
![minimal stack](http://oi60.tinypic.com/28w051y.jpg)
Bss setup
--------------------------------------------------------------------------------
Last two steps before we can jump to see code need to setup [bss](http://en.wikipedia.org/wiki/.bss) and check magic signature. Signature checking:
```assembly
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
```
just consists of comparing of [setup_sig](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L39) and `0x5a5aaa55` number, and if they are not equal jump to error printing.
Ok now we have correct segment registers, stack, need only setup bss and jump to C code. Bss section used for storing statically allocated uninitialized data. Here is the code:
```assembly
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep; stosl
```
First of all we put [__bss_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L47) address in `di` and `_end + 3` (+3 - align to 4 bytes) in `cx`. Clear `eax` register with `xor` instruction and calculate size of BSS section (put in `cx`). Divide `cx` by 4 and repeat `cx` times `stosl` instruction which stores value of `eax` (it is zero) and increase `di`by the size of `eax`. In this way, we write zeros from `__bss_start` to `_end`:
![bss](http://oi59.tinypic.com/29m2eyr.jpg)
Jump to main
--------------------------------------------------------------------------------
That's all, we have stack, bss and now we can jump to `main` C function:
```assembly
calll main
```
which is in [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c). What will be there? We will see it in the next part.
Conclusion
--------------------------------------------------------------------------------
This is the end of the first part about linux kernel internals. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). In the next part we will see first C code which executes in linux kernel setup, implementation of memory routines as memset, memcpy, `earlyprintk` implementation and early console initialization and many more.
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
Links
--------------------------------------------------------------------------------
* [Intel 80386 programmer's reference manual 1986](http://css.csail.mit.edu/6.858/2014/readings/i386.pdf)
* [Minimal Boot Loader for Intel® Architecture](https://www.cs.cmu.edu/~410/doc/minimal_boot.pdf)
* [8086](http://en.wikipedia.org/wiki/Intel_8086)
* [80386](http://en.wikipedia.org/wiki/Intel_80386)
* [Reset vector](http://en.wikipedia.org/wiki/Reset_vector)
* [Real mode](http://en.wikipedia.org/wiki/Real_mode)
* [Linux kernel boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt)
* [CoreBoot developer manual](http://www.coreboot.org/Developer_Manual)
* [Ralf Brown's Interrupt List](http://www.ctyme.com/intr/int.htm)
* [Power supply](http://en.wikipedia.org/wiki/Power_supply)
* [Power good signal](http://en.wikipedia.org/wiki/Power_good_signal)

467
boot/linux-bootstrap-2.md Normal file
View File

@@ -0,0 +1,467 @@
Linux internals
================================================================================
Kernel booting process. Part 2.
--------------------------------------------------------------------------------
We started to dive into linux kernel internals in the previous [part](https://github.com/0xAX/linux-insides/blob/master/linux-bootstrap-1.md) and saw the initial part of the kernel setup code. We stopped at the first call of the `main` function (which is the first function written in C) from [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c). Here we will continue to research of the kernel setup code and see what is `protected mode`, some preparation for transition to it, the heap and console initialization, memory detection and many many more. So... Let's go ahead.
Protected mode
--------------------------------------------------------------------------------
Before we can move to the native Intel64 [Long mode](http://en.wikipedia.org/wiki/Long_mode), the kernel must switch the CPU into protected mode. What is it protected mode? Protected mode was first added to the x86 architecture in 1982 and was the main mode of Intel processors from [80286](http://en.wikipedia.org/wiki/Intel_80286) processor until Intel 64 and long mode. Main reason to move from the real mode that there is very limited access to the RAM. As you can remember from the previous part, there is only 2^20 bytes or 1 megabyte, or even less 640 kilobytes.
Protected mode brought many changes, but main is another memory management. 24-bit address bus was replaced with 32-bit address bus. It gives 4 gigabytes of physical address. Also [paging](http://en.wikipedia.org/wiki/Paging) support was added which we will see in next parts.
Memory management in the protected mode is divided into two, almost independent parts:
* Segmentation
* Paging
Here we can see only segmentation. As you can read in previous part, address consists from two parts in the real mode:
* Base address of segment
* Offset from the segment base
And we can get physical address if we know these two parts by:
```
PhysicalAddress = Segment * 16 + Offset
```
Memory segmentation was completely redone in the protected mode. There are no 64 kilobytes fixed-size segments. All memory segments described by `Global Descriptor Table` (GDT) instead segment registers. GDT is a structure which contains in memory. There is no fixed place for it in memory, but it's address stored in the special register - `GDTR`. Later, when we will see the GDT loading in the linux kernel code. There will be operation for loading it in memory, something like:
```assembly
lgdt gdt
```
where `lgdt` instruction loads base address and limit of global descriptor table to the `GDTR` register. `GDTR` is 48-bit register and consists of two parts:
* size - 16 bit of global descriptor table;
* address - 32-bit of the global descriptor table.
Global descriptor table contains `descriptors` which describes memory segment. Every descriptor is 64-bit. General scheme of descriptor is:
```
31 24 19 16 7 0
------------------------------------------------------------
| | |B| |A| | | | |0|E|W|A| |
| BASE 31..24 |G|/|L|V| LIMIT |P|DPL|S| TYPE | BASE 23:16 | 4
| | |D| |L| 19..16| | | |1|C|R|A| |
------------------------------------------------------------
| | |
| BASE 15..0 | LIMIT 15..0 | 0
| | |
------------------------------------------------------------
```
Don't worry, i know that it looks little scary after real mode, but it's easy. Let's look on it closer:
1. Limit (0 - 15 bits) defines a `length_of_segment - 1`. It depends on `G` bit.
* if `G` (55-bit) is 0 and segment limit is 0 - size of segment - 1 byte
* if `G` is 1 and segment limit is 0 - size of segment 4096 bytes
* if `G` is 0 and segment limit is 0xfffff - size of segment 1 megabyte
* if `G` is 1 and segment limit is 0xfffff - size of segment 4 gigabytes
2. Base (0-15, 32-39 and 56-63 bits) defines physical address of segment's start address.
3. Type (40-47 bits) defines type of segment and kinds of access to it. Next `S` flag specifies descriptor type. if `S` is 0 - this segment is system segment, if `S` is 1 - code or data segment (Stack segments are data segments which must be read/write segments). If segment is a code or data segment, it can be one of the following access types:
```
| Type Field | Descriptor Type | Description
|-----------------------------|-----------------|------------------
| Decimal | |
| 0 E W A | |
| 0 0 0 0 0 | Data | Read-Only
| 1 0 0 0 1 | Data | Read-Only, accessed
| 2 0 0 1 0 | Data | Read/Write
| 3 0 0 1 1 | Data | Read/Write, accessed
| 4 0 1 0 0 | Data | Read-Only, expand-down
| 5 0 1 0 1 | Data | Read-Only, expand-down, accessed
| 6 0 1 1 0 | Data | Read/Write, expand-down
| 7 0 1 1 1 | Data | Read/Write, expand-down, accessed
| C R A | |
| 8 1 0 0 0 | Code | Execute-Only
| 9 1 0 0 1 | Code | Execute-Only, accessed
| 10 1 0 1 0 | Code | Execute/Read
| 11 1 0 1 1 | Code | Execute/Read, accessed
| 12 1 1 0 0 | Code | Execute-Only, conforming
| 14 1 1 0 1 | Code | Execute-Only, conforming, accessed
| 13 1 1 1 0 | Code | Execute/Read, conforming
| 15 1 1 1 1 | Code | Execute/Read, conforming, accessed
```
As we can see first bit is 0 for data segment and 1 for code segment. Next three bits `EWA` are expansion direction (expand-down segment will grow down, more about it you can read [here](http://www.sudleyplace.com/dpmione/expanddown.html)), write enable and accessed for data segments. `CRA` bits are conforming (A transfer of execution into a more-privileged conforming segment allows execution to continue at the current privilege level), read enable and accessed.
4. DPL (descriptor privilege level) defines the privilege level of the segment. I can be 0-3 where 0 is the most privileged.
5. P flag - indicates if segment is presented in memory or not.
6. AVL flag - Available and reserved bits.
7. L flag - indicates whether a code segment contains native 64-bit code. If 1 than code segment executes in 64 bit mode.
8. B/D flag - default operation size/default stack pointer size and/or upper bound.
Segment registers doesn't contain base address of the segment as in the real mode. Instead it contains special structure - `segment selector`. `Selector` is a 16-bit structure:
```
-----------------------------
| Index | TI | RPL |
-----------------------------
```
Where `Index` shows the index number of the descriptor in descriptor table. `TI` shows where to search the descriptor: in the global descriptor table or local. And `RPL` is privilege level.
Every segment register has visible and hidden part. When selector is loaded into one of the segment registers, it will be stored into visible part. Hidden part contains base address, limit and access information of descriptor which pointed by the selector. There are following steps for getting physical address in the protected mode:
* Segment selector must be loaded in one of the segment registers;
* CPU tries to find (by GDT address + Index from selector) and loads descriptor into the hidden part of segment register;
* Base address (from segment descriptor) + offset will be linear address of segment which is physical address (if paging is disabled).
Schematically it will look like this:
![linear address](http://oi62.tinypic.com/2yo369v.jpg)
Algorithm for transition from the real mode into protected mode is:
* Disable interrupts;
* Describe and load GDT with `lgdt` instruction;
* Set PE (Protection Enable) bit in CR0 (Control Register 0);
* Jump to protected mode code;
We will see transition to the protected mode in the linux kernel in the next part, but before we can move to protected mode, we need to do some preparations.
Let's look on [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c). We can see some routines there which make keyboard initialization, heap initialization and etc... Let's look on it.
Copying boot parameters into "zeropage"
--------------------------------------------------------------------------------
We will start from the `main` routine in the main.c. First function which called in the `main` is a [copy_boot_params](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L30). It copies kernel setup header into the field of `boot_params` structure which defined in the [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L113).
`boot_params` structure contains `struct setup_header hdr` field. This structure contains the same fields as defined in [linux boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) and filled by boot loader and also in the kernel building time. `copy_boot_params` function does two things: copies `hdr` from [header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L281) to the `boot_params` structure in `setup_header` field and updates pointer to the kernel command line if the kernel was loaded with old command line protocol.
You can note that It copies `hdr` with `memcpy` function which is defined in the [copy.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/copy.S) source file. Let's look on it:
```assembly
GLOBAL(memcpy)
pushw %si
pushw %di
movw %ax, %di
movw %dx, %si
pushw %cx
shrw $2, %cx
rep; movsl
popw %cx
andw $3, %cx
rep; movsb
popw %di
popw %si
retl
ENDPROC(memcpy)
```
Yeah, we just moved to C code and assembly again :) First of all we can see that `memcpy` and other routines which are defined here, starts and ends with the two macro: `GLOBAL` and `ENDPROC`. GLOBAL described in the [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) which defines `globl` directive and label for it. ENDPROC described in the [include/linux/linkage.h](https://github.com/torvalds/linux/blob/master/include/linux/linkage.h) which marks `name` symbol as function name and ends with the size of the `name` symbol.
Implementation of the `memcpy` is easy. At first, it pushes values from `si` and `di` registers to the stack because their values will change in the `memcpy`, so push it in the stack to preserve their values. `memcpy` (and other functions in copy.S) uses `fastcall` calling conventions. So it gets incoming parameters from the `ax`, `dx` and `cx` registers. Call of `memcpy` looks as:
```C
memcpy(&boot_params.hdr, &hdr, sizeof hdr);
```
So `ax` will contain address of the `boot_params.hdr`, `dx` will contain address of the `hdr` and `cx` will contain size of the `hdr` in bytes. memcpy puts address of `boot_params.hdr` to the `di` register and address of `hdr` to `si` and saves size in stack. After this it shifts to the right on 2 size (or divide on 4) and copies from `si` to `di` by 4 bytes. After it we restore size of `hdr` again, align it by 4 bytes and copy rest of bytes from `si` to `di` by one byte (if there is rest). Restore `si` and `di` values from the stack in the end and after this copying finished.
Console initialization
--------------------------------------------------------------------------------
After the `hdr` has copied into the `boot_params.hdr`, next step is console initialization by call of `console_init` function which defined in the [arch/x86/boot/early_serial_console.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/early_serial_console.c).
It tries to find `earlyprintk` option in the command line and if the search was successful, it parses port address and baud rate of the serial port and initializes serial port. Value of `earlyprintk` command line option can be one of the:
* serial,0x3f8,115200
* serial,ttyS0,115200
* ttyS0,115200
After serial port initialization we can see first output:
```C
if (cmdline_find_option_bool("debug"))
puts("early console in setup code\n");
```
`puts` definition is in the [tty.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/tty.c). As we can see it prints character by character in the loop with calling of `putchar` function. Let's look on `putchar` implementation:
```C
void __attribute__((section(".inittext"))) putchar(int ch)
{
if (ch == '\n')
putchar('\r');
bios_putchar(ch);
if (early_serial_base != 0)
serial_putchar(ch);
}
```
`__attribute__((section(".inittext")))` means that this code will be in the .inittext section. We can find it in the linker file [setup.ld](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L19).
First of all, `put_char` checks `\n` symbol and if it is, prints `\r` before. After it puts characters on VGA by bios with `0x10` interruption call:
```C
static void __attribute__((section(".inittext"))) bios_putchar(int ch)
{
struct biosregs ireg;
initregs(&ireg);
ireg.bx = 0x0007;
ireg.cx = 0x0001;
ireg.ah = 0x0e;
ireg.al = ch;
intcall(0x10, &ireg, NULL);
}
```
Where `initregs` takes `biosregs` structure and fills `biosregs` with zeros with `memset` function at first and than fills it with registers values.
```C
memset(reg, 0, sizeof *reg);
reg->eflags |= X86_EFLAGS_CF;
reg->ds = ds();
reg->es = ds();
reg->fs = fs();
reg->gs = gs();
```
Let's look on the [memset](https://github.com/torvalds/linux/blob/master/arch/x86/boot/copy.S#L36) implementation:
```assembly
GLOBAL(memset)
pushw %di
movw %ax, %di
movzbl %dl, %eax
imull $0x01010101,%eax
pushw %cx
shrw $2, %cx
rep; stosl
popw %cx
andw $3, %cx
rep; stosb
popw %di
retl
ENDPROC(memset)
```
As you can read above, it uses `fastcall` calling conventions as the `memcpy` function, it means that function gets parameters from `ax`, `dx` and `cx` registers.
Generally `memset` is like a memcpy implementation. It saves value of `di` register on the stack and puts `ax` value into `di` which is the address of `biosregs` structure. Next is `movzbl` instruction, which copies `dl` value to the low 2 bytes of `eax` register. The rest 2 high bytes of `eax` will be filled by zeros.
The next instruction multiplies `eax` value on the `0x01010101`. It needs because `memset` will copy by 4 bytes at the time. For example we need to fill structure with `0x7` byte with memset. `eax` will contain `0x00000007` value in this case. So if we multiply `eax` on `0x01010101`, we will get `0x07070707` and now we can copy this 4 bytes into structure. `memset` uses `rep; stosl` instructions for copying `eax` into `es:di`.
The rest of the `memset` function does almost the same as `memcpy`.
After that `biosregs` structure filled with `memset`, `bios_putchar` calls [0x10](http://www.ctyme.com/intr/rb-0106.htm) interruption which prints a character. After it checks was serial port initialized or not and writes a character there with [serial_putchar](https://github.com/torvalds/linux/blob/master/arch/x86/boot/tty.c#L30) and `inb/outb` instructions if it was set.
Heap initialization
--------------------------------------------------------------------------------
After the stack and bss section were prepared in the [header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) (see previous [part](https://github.com/0xAX/linux-insides/blob/master/linux-bootstrap-1.md)), need to initialize the [heap](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L116) with the [init_heap](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L116) function.
First of all `init_heap` checks `CAN_USE_HEAP` flag from the `loadflags` kernel setup header and calculates end of the stack if this flag was set:
```C
char *stack_end;
if (boot_params.hdr.loadflags & CAN_USE_HEAP) {
asm("leal %P1(%%esp),%0"
: "=r" (stack_end) : "i" (-STACK_SIZE));
```
or in other words `stack_end = esp - STACK_SIZE`.
Than there is `heap_end` calculation which is `heap_end_ptr` or `_end` + 512 and check if `heap_end` greater that `stack_end` makes it equal.
From this moment we can use the heap in the kernel setup code. We will see how to use it and how implemented API for it in next posts.
CPU validation
--------------------------------------------------------------------------------
The next step as we can see cpu validation by `validate_cpu` from [arch/x86/boot/cpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cpu.c).
It calls `check_cpu` function and passes cpu level and required cpu level to it and checks that kernel launched at the right cpu. It checks cpu's flags, presence of [long mode](http://en.wikipedia.org/wiki/Long_mode) (which we will see more details about it in the next parts) for x86_64, checks processor's vendor and makes preparation for еру certain vendor like turning off SSE+SSE2 for amd if they are missing and etc...
Memory detection
--------------------------------------------------------------------------------
The next step is memory detection by `detect_memory` function. It uses different programming interfaces for memory detection like `0xe820`, `0xe801` and `0x88`. We will see only implementation of 0xE820 here. Let's look on `detect_memory_e820` implementation from the [arch/x86/boot/memory.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/memory.c) source file. First of all, `detect_memory_e820` function initializes `biosregs` structure as we saw above and filled registers with special values for `0xe820` call:
```assembly
initregs(&ireg);
ireg.ax = 0xe820;
ireg.cx = sizeof buf;
ireg.edx = SMAP;
ireg.di = (size_t)&buf;
```
`ax` register must contain number of the function (0xe820 in our case), `cx` register contains size of the buffer which will contain data about memory, `edx` must contain `SMAP` magic number, `es:di` must contain address of the buffer which will contain memory data and `ebx` must be zero.
The next is loop, where will be collected data about the memory. It starts from the call of the 0x15 bios interruption, which writes one line from the address allocation table. For getting the next line need to call this interruption again (what we do in the loop). Before the every next call `ebx` must contain value returned by previous value:
```C
intcall(0x15, &ireg, &oreg);
ireg.ebx = oreg.ebx;
```
Ultimately, it makes iterations in the loop, collecting data from address allocation table and writes this data into array of `e820entry` array:
* start of memory segment
* size of memory segment
* type of memory segment (which can be reserved, usable and etc...).
You can see the result of execution in the `dmesg` output, something like:
```
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffdffff] usable
[ 0.000000] BIOS-e820: [mem 0x000000003ffe0000-0x000000003fffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
```
Keyboard initialization
--------------------------------------------------------------------------------
The next step is initialization of the keyboard with the call of `keyboard_init` function. At first `keyboard_init` initializes registers with `initregs` function and call the [0x16](http://www.ctyme.com/intr/rb-1756.htm) interruption for getting keyboard status. After this it calls the [0x16](http://www.ctyme.com/intr/rb-1757.htm) again for setting repeat rate and delay.
Querying
--------------------------------------------------------------------------------
The next couple of steps are queries of different parameters. We will not dive into details about these queries, but will back to the all of it in the next parts. Let's make a short look on this functions:
The [query_mca](https://github.com/torvalds/linux/blob/master/arch/x86/boot/mca.c#L18) routine calls [0x15](http://www.ctyme.com/intr/rb-1594.htm) BIOS interruption for getting machine model number, sub-model number, BIOS revision level, and other hardware-specific attributes:
```C
int query_mca(void)
{
struct biosregs ireg, oreg;
u16 len;
initregs(&ireg);
ireg.ah = 0xc0;
intcall(0x15, &ireg, &oreg);
if (oreg.eflags & X86_EFLAGS_CF)
return -1; /* No MCA present */
set_fs(oreg.es);
len = rdfs16(oreg.bx);
if (len > sizeof(boot_params.sys_desc_table))
len = sizeof(boot_params.sys_desc_table);
copy_from_fs(&boot_params.sys_desc_table, oreg.bx, len);
return 0;
}
```
It fills `ah` register with `0xc0` value and calls the `0x15` BIOS interruption. After the interruption execution it checks [carry flag](http://en.wikipedia.org/wiki/Carry_flag) flag and if it set to 1, BIOS doesn't support `MCA`. If carry flag is et to 0, `ES:BX` will contain pointer to the system information table, which looks like this:
```
Offset Size Description )
00h WORD number of bytes following
02h BYTE model (see #00515)
03h BYTE submodel (see #00515)
04h BYTE BIOS revision: 0 for first release, 1 for 2nd, etc.
05h BYTE feature byte 1 (see #00510)
06h BYTE feature byte 2 (see #00511)
07h BYTE feature byte 3 (see #00512)
08h BYTE feature byte 4 (see #00513)
09h BYTE feature byte 5 (see #00514)
---AWARD BIOS---
0Ah N BYTEs AWARD copyright notice
---Phoenix BIOS---
0Ah BYTE ??? (00h)
0Bh BYTE major version
0Ch BYTE minor version (BCD)
0Dh 4 BYTEs ASCIZ string "PTL" (Phoenix Technologies Ltd)
---Quadram Quad386---
0Ah 17 BYTEs ASCII signature string "Quadram Quad386XT"
---Toshiba (Satellite Pro 435CDS at least)---
0Ah 7 BYTEs signature "TOSHIBA"
11h BYTE ??? (8h)
12h BYTE ??? (E7h) product ID??? (guess)
13h 3 BYTEs "JPN"
```
Next we call `set_fs` routine and pass the value of `es` register to it. Implementation of the `set_fs` is pretty simple:
```C
static inline void set_fs(u16 seg)
{
asm volatile("movw %0,%%fs" : : "rm" (seg));
}
```
There is inline assembly which gets value of the `seg` parameter and puts it to the `fs` register. There are many functions in the [boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/boot/boot.h), like `set_fs`, there are `set_gs`, `fs`, `gs` for reading value in it and etc...
In the end of `query_mca` it just copies table which pointed by `es:bx` to the `boot_params.sys_desc_table`.
The next is getting [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep) information with the call of `query_ist` function. First of all it checks cpu level and if it is correct, calls `0x15` for getting info and saves result to `boot_params`.
The following [query_apm_bios](https://github.com/torvalds/linux/blob/master/arch/x86/boot/apm.c#L21) function which gets [Advanced Power Management](http://en.wikipedia.org/wiki/Advanced_Power_Management) information from the BIOS. `query_apm_bios` calls the `0x15` BIOS interruption too, but with `ah` - `0x53` to check `APM` installation. After the `0x15` execution, `query_apm_bios` functions checks `PM` signature (it must be `0x504d`), carry flag (it must be 0 if `APM` supported) and value of the `cx` register (if it's 0x02, protected mode interface supported).
The next it calls the `0x15` again, but with `ax = 0x5304` for disconnecting `APM` interface and connects 32bit protected mode interface. In the end it fills `boot_params.apm_bios_info` with values obtained from the BIOS.
Note that `query_apm_bios` will be executed only if `CONFIG_APM` or `CONFIG_APM_MODULE` was set in configuration file:
```C
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
query_apm_bios();
#endif
```
The last is [query_edd](https://github.com/torvalds/linux/blob/master/arch/x86/boot/edd.c#L122) function, which asks `Enhanced Disk Drive` information from the BIOS. Let's look on `query_edd` implementation.
First of all it reads [edd](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt#L1023) option from kernel's command line and if it was set to `off` than `query_edd` just returns.
If EDD is enabled, `query_edd` goes over BIOS-supported hard disks and queries EDD information in the loop:
```C
for (devno = 0x80; devno < 0x80+EDD_MBR_SIG_MAX; devno++) {
if (!get_edd_info(devno, &ei) && boot_params.eddbuf_entries < EDDMAXNR) {
memcpy(edp, &ei, sizeof ei);
edp++;
boot_params.eddbuf_entries++;
}
...
...
...
```
where the `0x80` is the first hard drive and the `EDD_MBR_SIG_MAX` macro is 16. It collects data to the array of [edd_info](https://github.com/torvalds/linux/blob/master/include/uapi/linux/edd.h#L172) structures. `get_edd_info` checks that presence of EDD by invoking `0x13` interruption with `ah` is `0x41` and If EDD is presented, `get_edd_info` again calls `0x13` interruption, but with `ah` is `0x48` and `si` address of buffer where EDD informantion will be stored.
Conclusion
--------------------------------------------------------------------------------
It is the end of the second part about linux kernel internals. In next part we will see setting of video mode and the rest of preparation before transition to protected mode and directly transition into it.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
Links
--------------------------------------------------------------------------------
* [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
* [Long mode](http://en.wikipedia.org/wiki/Long_mode)
* [How to Use Expand Down Segments on Intel 386 and Later CPUs](http://www.sudleyplace.com/dpmione/expanddown.html)
* [earlyprintk documentation](http://lxr.free-electrons.com/source/Documentation/x86/earlyprintk.txt)
* [Kernel Parameters](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt)
* [Serial console](https://github.com/torvalds/linux/blob/master/Documentation/serial-console.txt)
* [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep)
* [APM](https://en.wikipedia.org/wiki/Advanced_Power_Management)
* [EDD specification](http://www.t13.org/documents/UploadedDocuments/docs2004/d1572r3-EDD3.pdf)
* [Previous part](https://github.com/0xAX/linux-insides/blob/master/linux-bootstrap-1.md)