mirror of
https://github.com/MintCN/linux-insides-zh.git
synced 2026-04-29 05:00:52 +08:00
fix conflict
This commit is contained in:
@@ -22,7 +22,7 @@
|
||||
|
||||
淘汰[实模式](http://wiki.osdev.org/Real_Mode)的主要原因是因为在实模式下,系统能够访问的内存非常有限。如果你还记得我们在上一节说的,在实模式下,系统最多只能访问1M内存,而且在很多时候,实际能够访问的内存只有640K。
|
||||
|
||||
保护模式带来了很多的改变,不过主要的改变都集中在内存管理方法。在保护模式中,实模式的20位地址线被替换成32位地址线,因此系统可以访问多大4GB的地址空间。另外,在保护模式中引入了[内存分页](http://en.wikipedia.org/wiki/Paging)功能,在后面的章节中我们将介绍这个功能。
|
||||
保护模式带来了很多的改变,不过主要的改变都集中在内存管理方法。在保护模式中,实模式的20位地址线被替换成32位地址线,因此系统可以访问多达4GB的地址空间。另外,在保护模式中引入了[内存分页](http://en.wikipedia.org/wiki/Paging)功能,在后面的章节中我们将介绍这个功能。
|
||||
|
||||
保护模式提供了2种完全不同的内存管理机制:
|
||||
|
||||
|
||||
225
Concepts/cpumask.md
Normal file
225
Concepts/cpumask.md
Normal file
@@ -0,0 +1,225 @@
|
||||
CPU masks
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
`Cpumasks` is a special way provided by the Linux kernel to store information about CPUs in the system. The relevant source code and header files which are contains API for `Cpumasks` manipulating:
|
||||
|
||||
* [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h)
|
||||
* [lib/cpumask.c](https://github.com/torvalds/linux/blob/master/lib/cpumask.c)
|
||||
* [kernel/cpu.c](https://github.com/torvalds/linux/blob/master/kernel/cpu.c)
|
||||
|
||||
As comment says from the [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h): Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. We already saw a bit about cpumask in the `boot_cpu_init` function from the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. This function makes first boot cpu online, active and etc...:
|
||||
|
||||
```C
|
||||
set_cpu_online(cpu, true);
|
||||
set_cpu_active(cpu, true);
|
||||
set_cpu_present(cpu, true);
|
||||
set_cpu_possible(cpu, true);
|
||||
```
|
||||
|
||||
`set_cpu_possible` is a set of cpu ID's which can be plugged in anytime during the life of that system boot. `cpu_present` represents which CPUs are currently plugged in. `cpu_online` represents a subset of the `cpu_present` and indicates CPUs which are available for scheduling. These masks depend on the `CONFIG_HOTPLUG_CPU` configuration option and if this option is disabled `possible == present` and `active == online`. The implementations of all of these functions are very similar. Every function checks the second parameter. If it is `true`, it calls `cpumask_set_cpu` otherwise it calls `cpumask_clear_cpu` .
|
||||
|
||||
There are two ways for a `cpumask` creation. First is to use `cpumask_t`. It is defined as:
|
||||
|
||||
```C
|
||||
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
|
||||
```
|
||||
|
||||
It wraps the `cpumask` structure which contains one bitmask `bits` field. The `DECLARE_BITMAP` macro gets two parameters:
|
||||
|
||||
* bitmap name;
|
||||
* number of bits.
|
||||
|
||||
and creates an array of `unsigned long` with the given name. Its implementation is pretty easy:
|
||||
|
||||
```C
|
||||
#define DECLARE_BITMAP(name,bits) \
|
||||
unsigned long name[BITS_TO_LONGS(bits)]
|
||||
```
|
||||
|
||||
where `BITS_TO_LONGS`:
|
||||
|
||||
```C
|
||||
#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
|
||||
#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
|
||||
```
|
||||
|
||||
As we are focusing on the `x86_64` architecture, `unsigned long` is 8-bytes size and our array will contain only one element:
|
||||
|
||||
```
|
||||
(((8) + (8) - 1) / (8)) = 1
|
||||
```
|
||||
|
||||
`NR_CPUS` macro represents the number of CPUs in the system and depends on the `CONFIG_NR_CPUS` macro which is defined in [include/linux/threads.h](https://github.com/torvalds/linux/blob/master/include/linux/threads.h) and looks like this:
|
||||
|
||||
```C
|
||||
#ifndef CONFIG_NR_CPUS
|
||||
#define CONFIG_NR_CPUS 1
|
||||
#endif
|
||||
|
||||
#define NR_CPUS CONFIG_NR_CPUS
|
||||
```
|
||||
|
||||
The second way to define cpumask is to use the `DECLARE_BITMAP` macro directly and the `to_cpumask` macro which converts the given bitmap to `struct cpumask *`:
|
||||
|
||||
```C
|
||||
#define to_cpumask(bitmap) \
|
||||
((struct cpumask *)(1 ? (bitmap) \
|
||||
: (void *)sizeof(__check_is_bitmap(bitmap))))
|
||||
```
|
||||
|
||||
We can see the ternary operator operator here which is `true` every time. `__check_is_bitmap` inline function is defined as:
|
||||
|
||||
```C
|
||||
static inline int __check_is_bitmap(const unsigned long *bitmap)
|
||||
{
|
||||
return 1;
|
||||
}
|
||||
```
|
||||
|
||||
And returns `1` every time. We need it here for only one purpose: at compile time it checks that a given `bitmap` is a bitmap, or in other words it checks that a given `bitmap` has type - `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting an array of `unsigned long` to the `struct cpumask *`.
|
||||
|
||||
cpumask API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As we can define cpumask with one of the method, Linux kernel provides API for manipulating a cpumask. Let's consider one of the function which presented above. For example `set_cpu_online`. This function takes two parameters:
|
||||
|
||||
* Number of CPU;
|
||||
* CPU status;
|
||||
|
||||
Implementation of this function looks as:
|
||||
|
||||
```C
|
||||
void set_cpu_online(unsigned int cpu, bool online)
|
||||
{
|
||||
if (online) {
|
||||
cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
|
||||
cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
|
||||
} else {
|
||||
cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
First of all it checks the second `state` parameter and calls `cpumask_set_cpu` or `cpumask_clear_cpu` depends on it. Here we can see casting to the `struct cpumask *` of the second parameter in the `cpumask_set_cpu`. In our case it is `cpu_online_bits` which is a bitmap and defined as:
|
||||
|
||||
```C
|
||||
static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly;
|
||||
```
|
||||
|
||||
The `cpumask_set_cpu` function makes only one call to the `set_bit` function:
|
||||
|
||||
```C
|
||||
static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
|
||||
{
|
||||
set_bit(cpumask_check(cpu), cpumask_bits(dstp));
|
||||
}
|
||||
```
|
||||
|
||||
The `set_bit` function takes two parameters too, and sets a given bit (first parameter) in the memory (second parameter or `cpu_online_bits` bitmap). We can see here that before `set_bit` will be called, its two parameters will be passed to the
|
||||
|
||||
* cpumask_check;
|
||||
* cpumask_bits.
|
||||
|
||||
Let's consider these two macros. First if `cpumask_check` does nothing in our case and just returns given parameter. The second `cpumask_bits` just returns the `bits` field from the given `struct cpumask *` structure:
|
||||
|
||||
```C
|
||||
#define cpumask_bits(maskp) ((maskp)->bits)
|
||||
```
|
||||
|
||||
Now let's look on the `set_bit` implementation:
|
||||
|
||||
```C
|
||||
static __always_inline void
|
||||
set_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
if (IS_IMMEDIATE(nr)) {
|
||||
asm volatile(LOCK_PREFIX "orb %1,%0"
|
||||
: CONST_MASK_ADDR(nr, addr)
|
||||
: "iq" ((u8)CONST_MASK(nr))
|
||||
: "memory");
|
||||
} else {
|
||||
asm volatile(LOCK_PREFIX "bts %1,%0"
|
||||
: BITOP_ADDR(addr) : "Ir" (nr) : "memory");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This function looks scary, but it is not so hard as it seems. First of all it passes `nr` or number of the bit to the `IS_IMMEDIATE` macro which just calls the GCC internal `__builtin_constant_p` function:
|
||||
|
||||
```C
|
||||
#define IS_IMMEDIATE(nr) (__builtin_constant_p(nr))
|
||||
```
|
||||
|
||||
`__builtin_constant_p` checks that given parameter is known constant at compile-time. As our `cpu` is not compile-time constant, the `else` clause will be executed:
|
||||
|
||||
```C
|
||||
asm volatile(LOCK_PREFIX "bts %1,%0" : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
|
||||
```
|
||||
|
||||
Let's try to understand how it works step by step:
|
||||
|
||||
`LOCK_PREFIX` is a x86 `lock` instruction. This instruction tells the cpu to occupy the system bus while the instruction(s) will be executed. This allows the CPU to synchronize memory access, preventing simultaneous access of multiple processors (or devices - the DMA controller for example) to one memory cell.
|
||||
|
||||
`BITOP_ADDR` casts the given parameter to the `(*(volatile long *)` and adds `+m` constraints. `+` means that this operand is both read and written by the instruction. `m` shows that this is a memory operand. `BITOP_ADDR` is defined as:
|
||||
|
||||
```C
|
||||
#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
|
||||
```
|
||||
|
||||
Next is the `memory` clobber. It tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters).
|
||||
|
||||
`Ir` - immediate register operand.
|
||||
|
||||
|
||||
The `bts` instruction sets a given bit in a bit string and stores the value of a given bit in the `CF` flag. So we passed the cpu number which is zero in our case and after `set_bit` is executed, it sets the zero bit in the `cpu_online_bits` cpumask. It means that the first cpu is online at this moment.
|
||||
|
||||
Besides the `set_cpu_*` API, cpumask of course provides another API for cpumasks manipulation. Let's consider it in short.
|
||||
|
||||
Additional cpumask API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
cpumask provides a set of macros for getting the numbers of CPUs in various states. For example:
|
||||
|
||||
```C
|
||||
#define num_online_cpus() cpumask_weight(cpu_online_mask)
|
||||
```
|
||||
|
||||
This macro returns the amount of `online` CPUs. It calls the `cpumask_weight` function with the `cpu_online_mask` bitmap (read about it). The`cpumask_weight` function makes one call of the `bitmap_weight` function with two parameters:
|
||||
|
||||
* cpumask bitmap;
|
||||
* `nr_cpumask_bits` - which is `NR_CPUS` in our case.
|
||||
|
||||
```C
|
||||
static inline unsigned int cpumask_weight(const struct cpumask *srcp)
|
||||
{
|
||||
return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits);
|
||||
}
|
||||
```
|
||||
|
||||
and calculates the number of bits in the given bitmap. Besides the `num_online_cpus`, cpumask provides macros for the all CPU states:
|
||||
|
||||
* num_possible_cpus;
|
||||
* num_active_cpus;
|
||||
* cpu_online;
|
||||
* cpu_possible.
|
||||
|
||||
and many more.
|
||||
|
||||
Besides that the Linux kernel provides the following API for the manipulation of `cpumask`:
|
||||
|
||||
* `for_each_cpu` - iterates over every cpu in a mask;
|
||||
* `for_each_cpu_not` - iterates over every cpu in a complemented mask;
|
||||
* `cpumask_clear_cpu` - clears a cpu in a cpumask;
|
||||
* `cpumask_test_cpu` - tests a cpu in a mask;
|
||||
* `cpumask_setall` - set all cpus in a mask;
|
||||
* `cpumask_size` - returns size to allocate for a 'struct cpumask' in bytes;
|
||||
|
||||
and many many more...
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [cpumask documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
395
Concepts/initcall.md
Normal file
395
Concepts/initcall.md
Normal file
@@ -0,0 +1,395 @@
|
||||
The initcall mechanism
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you may understand from the title, this part will cover interesting and important concept in the Linux kernel which is called - `initcall`. We already saw definitions like these:
|
||||
|
||||
```C
|
||||
early_param("debug", debug_kernel);
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```C
|
||||
arch_initcall(init_pit_clocksource);
|
||||
```
|
||||
|
||||
in some parts of the Linux kernel. Before we see how this mechanism is implemented in the Linux kernel, we must know actually what is it and how the Linux kernel uses it. Definitions like these represent a [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29) function which will be called during initialization of the Linux kernel or right after it. Actually the main point of the `initcall` mechanism is to determine correct order of the built-in modules and subsystems initialization. For example let's look at the following function:
|
||||
|
||||
```C
|
||||
static int __init nmi_warning_debugfs(void)
|
||||
{
|
||||
debugfs_create_u64("nmi_longest_ns", 0644,
|
||||
arch_debugfs_dir, &nmi_longest_ns);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
from the [arch/x86/kernel/nmi.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/nmi.c) source code file. As we may see it just creates the `nmi_longest_ns` [debugfs](https://en.wikipedia.org/wiki/Debugfs) file in the `arch_debugfs_dir` directory. Actually, this `debugfs` file may be created only after the `arch_debugfs_dir` will be created. Creation of this directory occurs during the architecture-specific initialization of the Linux kernel. Actually this directory will be created in the `arch_kdebugfs_init` function from the [arch/x86/kernel/kdebugfs.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kdebugfs.c) source code file. Note that the `arch_kdebugfs_init` function is marked as `initcall` too:
|
||||
|
||||
```C
|
||||
arch_initcall(arch_kdebugfs_init);
|
||||
```
|
||||
|
||||
The Linux kernel calls all architecture-specific `initcalls` before the `fs` related `initcalls`. So, our `nmi_longest_ns` file will be created only after the `arch_kdebugfs_dir` directory will be created. Actually, the Linux kernel provides eight levels of main `initcalls`:
|
||||
|
||||
* `early`;
|
||||
* `core`;
|
||||
* `postcore`;
|
||||
* `arch`;
|
||||
* `susys`;
|
||||
* `fs`;
|
||||
* `device`;
|
||||
* `late`.
|
||||
|
||||
All of their names are represented by the `initcall_level_names` array which is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
|
||||
```C
|
||||
static char *initcall_level_names[] __initdata = {
|
||||
"early",
|
||||
"core",
|
||||
"postcore",
|
||||
"arch",
|
||||
"subsys",
|
||||
"fs",
|
||||
"device",
|
||||
"late",
|
||||
};
|
||||
```
|
||||
|
||||
All functions which are marked as `initcall` by these identifiers, will be called in the same order or at first `early initcalls` will be called, at second `core initcalls` and etc. From this moment we know a little about `initcall` mechanism, so we can start to dive into the source code of the Linux kernel to see how this mechanism is implemented.
|
||||
|
||||
Implementation initcall mechanism in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The Linux kernel provides a set of macros from the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) header file to mark a given function as `initcall`. All of these macros are pretty simple:
|
||||
|
||||
```C
|
||||
#define early_initcall(fn) __define_initcall(fn, early)
|
||||
#define core_initcall(fn) __define_initcall(fn, 1)
|
||||
#define postcore_initcall(fn) __define_initcall(fn, 2)
|
||||
#define arch_initcall(fn) __define_initcall(fn, 3)
|
||||
#define subsys_initcall(fn) __define_initcall(fn, 4)
|
||||
#define fs_initcall(fn) __define_initcall(fn, 5)
|
||||
#define device_initcall(fn) __define_initcall(fn, 6)
|
||||
#define late_initcall(fn) __define_initcall(fn, 7)
|
||||
```
|
||||
|
||||
and as we may see these macros just expands to the call of the `__define_initcall` macro from the same header file. As we may see, the `__define_initcall` macro takes two arguments:
|
||||
|
||||
* `fn` - callback function which will be called during call of `initcalls` of the certain level;
|
||||
* `id` - identifier to identify `initcall` to prevent error when two the same `initcalls` point to the same handler.
|
||||
|
||||
The implementation of the `__define_initcall` macro looks like:
|
||||
|
||||
```C
|
||||
#define __define_initcall(fn, id) \
|
||||
static initcall_t __initcall_##fn##id __used \
|
||||
__attribute__((__section__(".initcall" #id ".init"))) = fn; \
|
||||
LTO_REFERENCE_INITCALL(__initcall_##fn##id)
|
||||
```
|
||||
|
||||
To understand the `__define_initcall` macro, first of all let's look at the `initcall_t` type. This type is defined in the same [header]() file and represents pointer to a function which returns pointer to [integer](https://en.wikipedia.org/wiki/Integer) which will be result of the `initcall`:
|
||||
|
||||
```C
|
||||
typedef int (*initcall_t)(void);
|
||||
```
|
||||
|
||||
Now let's return to the `_-define_initcall` macro. The [##](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) provides ability to concatenate two symbols. In our case, the first line of the `__define_initcall` macro produces definition of the given function which is located in the `.initcall id .init` [ELF section](http://www.skyfree.org/linux/references/ELF_Format.pdf) and marked with the following [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) attributes: `__initcall_function_name_id` and `__used`. If we will look in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h) header file which represents data for the kernel [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script, we will see that all of `initcalls` sections will be placed in the `.data` section:
|
||||
|
||||
```C
|
||||
#define INIT_CALLS \
|
||||
VMLINUX_SYMBOL(__initcall_start) = .; \
|
||||
*(.initcallearly.init) \
|
||||
INIT_CALLS_LEVEL(0) \
|
||||
INIT_CALLS_LEVEL(1) \
|
||||
INIT_CALLS_LEVEL(2) \
|
||||
INIT_CALLS_LEVEL(3) \
|
||||
INIT_CALLS_LEVEL(4) \
|
||||
INIT_CALLS_LEVEL(5) \
|
||||
INIT_CALLS_LEVEL(rootfs) \
|
||||
INIT_CALLS_LEVEL(6) \
|
||||
INIT_CALLS_LEVEL(7) \
|
||||
VMLINUX_SYMBOL(__initcall_end) = .;
|
||||
|
||||
#define INIT_DATA_SECTION(initsetup_align) \
|
||||
.init.data : AT(ADDR(.init.data) - LOAD_OFFSET) { \
|
||||
... \
|
||||
INIT_CALLS \
|
||||
... \
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
The seconds attribute - `__used` is defined in the [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) header file and just expands to the definition of the following `gcc` attribute:
|
||||
|
||||
```C
|
||||
#define __used __attribute__((__used__))
|
||||
```
|
||||
|
||||
which prevents `variable defined but not used` warning. The last line of the `__define_initcall` macro is:
|
||||
|
||||
```C
|
||||
LTO_REFERENCE_INITCALL(__initcall_##fn##id)
|
||||
```
|
||||
|
||||
depends on the `CONFIG_LTO` kernel configuration option and just provides stub for the compiler [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization):
|
||||
|
||||
```
|
||||
#ifdef CONFIG_LTO
|
||||
#define LTO_REFERENCE_INITCALL(x) \
|
||||
static __used __exit void *reference_##x(void) \
|
||||
{ \
|
||||
return &x; \
|
||||
}
|
||||
#else
|
||||
#define LTO_REFERENCE_INITCALL(x)
|
||||
#endif
|
||||
```
|
||||
|
||||
to prevent problem when there is no reference to a variable in a module it will be moved to the end of the program. That's all about the `__define_initcall` macro. So, all of the `*_initcall` macros will be expanded during compilation of the Linux kernel, and all `initcalls` will be placed in their sections and all of them will be available from the `.data` section and the Linux kernel will know where to find a certain `initcall` to call it during initialization process.
|
||||
|
||||
As `initcalls` can be called by the Linux kernel, let's look how the Linux kernel does this. This process starts in the `do_basic_setup` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
|
||||
```C
|
||||
static void __init do_basic_setup(void)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
do_initcalls();
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
which is called during the initialization of the Linux kernel, right after main steps of initialization like memory manager related initialization, `CPU` subsystem and other already finished. The `do_initcalls` function just goes through the array of `initcall` levels and call the `do_initcall_level` function for each level:
|
||||
|
||||
```C
|
||||
static void __init do_initcalls(void)
|
||||
{
|
||||
int level;
|
||||
|
||||
for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++)
|
||||
do_initcall_level(level);
|
||||
}
|
||||
```
|
||||
|
||||
The `initcall_levels` array is defined in the same source code [file](https://github.com/torvalds/linux/blob/master/init/main.c) and contains pointers to the sections which were defined in the `__define_initcall` macro:
|
||||
|
||||
```C
|
||||
static initcall_t *initcall_levels[] __initdata = {
|
||||
__initcall0_start,
|
||||
__initcall1_start,
|
||||
__initcall2_start,
|
||||
__initcall3_start,
|
||||
__initcall4_start,
|
||||
__initcall5_start,
|
||||
__initcall6_start,
|
||||
__initcall7_start,
|
||||
__initcall_end,
|
||||
};
|
||||
```
|
||||
|
||||
If you are interested, you can find these sections in the `arch/x86/kernel/vmlinux.lds` linker script which is generated after the Linux kernel compilation:
|
||||
|
||||
```
|
||||
.init.data : AT(ADDR(.init.data) - 0xffffffff80000000) {
|
||||
...
|
||||
...
|
||||
...
|
||||
...
|
||||
__initcall_start = .;
|
||||
*(.initcallearly.init)
|
||||
__initcall0_start = .;
|
||||
*(.initcall0.init)
|
||||
*(.initcall0s.init)
|
||||
__initcall1_start = .;
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
If this is not familiar for you, you can know more about [linkers](https://en.wikipedia.org/wiki/Linker_%28computing%29) in the special [part](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) of this book.
|
||||
|
||||
As we just saw, the `do_initcall_level` function takes one parameter - level of `initcall` and does two following things: First of all this function parses the `initcall_command_line` which is copy of usual kernel [command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt) which may contain parameters for modules with the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/master/kernel/params.c) source code file and call the `do_on_initcall` function for each level:
|
||||
|
||||
```C
|
||||
for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
|
||||
do_one_initcall(*fn);
|
||||
```
|
||||
|
||||
The `do_on_initcall` does all main job for us. As we may see, this function takes one parameter which represent `initcall` callback function and does the call of the given callback:
|
||||
|
||||
```C
|
||||
int __init_or_module do_one_initcall(initcall_t fn)
|
||||
{
|
||||
int count = preempt_count();
|
||||
int ret;
|
||||
char msgbuf[64];
|
||||
|
||||
if (initcall_blacklisted(fn))
|
||||
return -EPERM;
|
||||
|
||||
if (initcall_debug)
|
||||
ret = do_one_initcall_debug(fn);
|
||||
else
|
||||
ret = fn();
|
||||
|
||||
msgbuf[0] = 0;
|
||||
|
||||
if (preempt_count() != count) {
|
||||
sprintf(msgbuf, "preemption imbalance ");
|
||||
preempt_count_set(count);
|
||||
}
|
||||
if (irqs_disabled()) {
|
||||
strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
|
||||
local_irq_enable();
|
||||
}
|
||||
WARN(msgbuf[0], "initcall %pF returned with %s\n", fn, msgbuf);
|
||||
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
Let's try to understand what does the `do_on_initcall` function does. First of all we increase [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) counter to check it later to be sure that it is not imbalanced. After this step we can see the call of the `initcall_backlist` function which
|
||||
goes over the `blacklisted_initcalls` list which stores blacklisted `initcalls` and releases the given `initcall` if it is located in this list:
|
||||
|
||||
```C
|
||||
list_for_each_entry(entry, &blacklisted_initcalls, next) {
|
||||
if (!strcmp(fn_name, entry->buf)) {
|
||||
pr_debug("initcall %s blacklisted\n", fn_name);
|
||||
kfree(fn_name);
|
||||
return true;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The blacklisted `initcalls` stored in the `blacklisted_initcalls` list and this list is filled during early Linux kernel initialization from the Linux kernel command line.
|
||||
|
||||
After the blacklisted `initcalls` will be handled, the next part of code does directly the call of the `initcall`:
|
||||
|
||||
```C
|
||||
if (initcall_debug)
|
||||
ret = do_one_initcall_debug(fn);
|
||||
else
|
||||
ret = fn();
|
||||
```
|
||||
|
||||
Depends on the value of the `initcall_debug` variable, the `do_one_initcall_debug` function will call `initcall` or this function will do it directly via `fn()`. The `initcall_debug` variable is defined in the [same](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
|
||||
```C
|
||||
bool initcall_debug;
|
||||
```
|
||||
|
||||
and provides ability to print some information to the kernel [log buffer](https://en.wikipedia.org/wiki/Dmesg). The value of the variable can be set from the kernel commands via the `initcall_debug` parameter. As we can read from the [documentation](https://www.kernel.org/doc/Documentation/kernel-parameters.txt) of the Linux kernel command line:
|
||||
|
||||
```
|
||||
initcall_debug [KNL] Trace initcalls as they are executed. Useful
|
||||
for working out where the kernel is dying during
|
||||
startup.
|
||||
```
|
||||
|
||||
And that's true. If we will look at the implementation of the `do_one_initcall_debug` function, we will see that it does the same as the `do_one_initcall` function or i.e. the `do_one_initcall_debug` function calls the given `initcall` and prints some information (like the [pid](https://en.wikipedia.org/wiki/Process_identifier) of the currently running task, duration of execution of the `initcall` and etc.) related to the execution of the given `initcall`:
|
||||
|
||||
```C
|
||||
static int __init_or_module do_one_initcall_debug(initcall_t fn)
|
||||
{
|
||||
ktime_t calltime, delta, rettime;
|
||||
unsigned long long duration;
|
||||
int ret;
|
||||
|
||||
printk(KERN_DEBUG "calling %pF @ %i\n", fn, task_pid_nr(current));
|
||||
calltime = ktime_get();
|
||||
ret = fn();
|
||||
rettime = ktime_get();
|
||||
delta = ktime_sub(rettime, calltime);
|
||||
duration = (unsigned long long) ktime_to_ns(delta) >> 10;
|
||||
printk(KERN_DEBUG "initcall %pF returned %d after %lld usecs\n",
|
||||
fn, ret, duration);
|
||||
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
As an `initcall` was called by the one of the ` do_one_initcall` or `do_one_initcall_debug` functions, we may see two checks in the end of the `do_one_initcall` function. The first one checks the amount of possible `__preempt_count_add` and `__preempt_count_sub` calls inside of the executed initcall, and if this value is not equal to the previous value of the preemptible counter, we add the `preemption imbalance` string to the message buffer and set correct value of the preemptible counter:
|
||||
|
||||
```C
|
||||
if (preempt_count() != count) {
|
||||
sprintf(msgbuf, "preemption imbalance ");
|
||||
preempt_count_set(count);
|
||||
}
|
||||
```
|
||||
|
||||
Later this error string will be printed. The last check the state of local [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) and if they are disabled, we add the `disabled interrupts` strings to the our message buffer and enable `IRQs` for the current processor to prevent the state when `IRQs` were disabled by an `initcall` and didn't enabled again:
|
||||
|
||||
```C
|
||||
if (irqs_disabled()) {
|
||||
strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
|
||||
local_irq_enable();
|
||||
}
|
||||
```
|
||||
|
||||
That's all. In this way the Linux kernel does initialization of many subsystems in a correct order. From now we know what is it `initcall` mechanism in the Linux kernel. We saw main general part of the `initcall` mechanism in this part. But we avoided some important concepts. Let's make a short look at these concepts.
|
||||
|
||||
First of all, we have missed one level of `initcalls`, this is `rootfs initcalls`. You can find definition of the `rootfs_initcall` in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) header file together with all similar macros which we saw in this part:
|
||||
|
||||
```C
|
||||
#define rootfs_initcall(fn) __define_initcall(fn, rootfs)
|
||||
```
|
||||
|
||||
As we may understand from the macro's name, its main purpose is to store callbacks which are related to the [rootfs](https://en.wikipedia.org/wiki/Initramfs). Besides this goal, it may be useful to initialize other stuffs after initialization related to filesystems level, but only before devices related stuff are not initialized. For example, the decompression of the [initramfs](https://en.wikipedia.org/wiki/Initramfs) which occurred in the `populate_rootfs` function from the [init/initramfs.c](https://github.com/torvalds/linux/blob/master/init/initramfs.c) source code file:
|
||||
|
||||
```C
|
||||
rootfs_initcall(populate_rootfs);
|
||||
```
|
||||
|
||||
From this place, we may see familiar output:
|
||||
|
||||
```
|
||||
[ 0.199960] Unpacking initramfs...
|
||||
```
|
||||
|
||||
Besides the `rootfs_initcall` level, there are additional `console_initcall`, `security_initcall` and other secondary `initcall` levels. The last thing that we have missed is the set of the `*_initcall_sync` levels. Almost each `*_initcall` macro that we have seen in this part, has macro companion with the `_sync` prefix:
|
||||
|
||||
```C
|
||||
#define core_initcall_sync(fn) __define_initcall(fn, 1s)
|
||||
#define postcore_initcall_sync(fn) __define_initcall(fn, 2s)
|
||||
#define arch_initcall_sync(fn) __define_initcall(fn, 3s)
|
||||
#define subsys_initcall_sync(fn) __define_initcall(fn, 4s)
|
||||
#define fs_initcall_sync(fn) __define_initcall(fn, 5s)
|
||||
#define device_initcall_sync(fn) __define_initcall(fn, 6s)
|
||||
#define late_initcall_sync(fn) __define_initcall(fn, 7s)
|
||||
```
|
||||
|
||||
The main goal of these additional levels is to wait for completion of all a module related initialization routines for a certain level.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In this part we saw the important mechanism of the Linux kernel which allows to call a function which depends on the current state of the Linux kernel during its initialization.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**.
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29)
|
||||
* [debugfs](https://en.wikipedia.org/wiki/Debugfs)
|
||||
* [integer type](https://en.wikipedia.org/wiki/Integer)
|
||||
* [symbols concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
|
||||
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization)
|
||||
* [Introduction to linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Linux kernel command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)
|
||||
* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [rootfs](https://en.wikipedia.org/wiki/Initramfs)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
230
Concepts/per-cpu.md
Normal file
230
Concepts/per-cpu.md
Normal file
@@ -0,0 +1,230 @@
|
||||
Per-CPU variables
|
||||
================================================================================
|
||||
|
||||
Per-CPU variables are one of the kernel features. You can understand what this feature means by reading its name. We can create a variable and each processor core will have its own copy of this variable. In this part, we take a closer look at this feature and try to understand how it is implemented and how it works.
|
||||
|
||||
The kernel provides an API for creating per-cpu variables - the `DEFINE_PER_CPU` macro:
|
||||
|
||||
```C
|
||||
#define DEFINE_PER_CPU(type, name) \
|
||||
DEFINE_PER_CPU_SECTION(type, name, "")
|
||||
```
|
||||
|
||||
This macro defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) as many other macros for work with per-cpu variables. Now we will see how this feature is implemented.
|
||||
|
||||
Take a look at the `DECLARE_PER_CPU` definition. We see that it takes 2 parameters: `type` and `name`, so we can use it to create per-cpu variables, for example like this:
|
||||
|
||||
```C
|
||||
DEFINE_PER_CPU(int, per_cpu_n)
|
||||
```
|
||||
|
||||
We pass the type and the name of our variable. `DEFINE_PER_CPU` calls the `DEFINE_PER_CPU_SECTION` macro and passes the same two parameters and empty string to it. Let's look at the definition of the `DEFINE_PER_CPU_SECTION`:
|
||||
|
||||
```C
|
||||
#define DEFINE_PER_CPU_SECTION(type, name, sec) \
|
||||
__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES \
|
||||
__typeof__(type) name
|
||||
```
|
||||
|
||||
```C
|
||||
#define __PCPU_ATTRS(sec) \
|
||||
__percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \
|
||||
PER_CPU_ATTRIBUTES
|
||||
```
|
||||
|
||||
where `section` is:
|
||||
|
||||
```C
|
||||
#define PER_CPU_BASE_SECTION ".data..percpu"
|
||||
```
|
||||
|
||||
After all macros are expanded we will get a global per-cpu variable:
|
||||
|
||||
```C
|
||||
__attribute__((section(".data..percpu"))) int per_cpu_n
|
||||
```
|
||||
|
||||
It means that we will have a `per_cpu_n` variable in the `.data..percpu` section. We can find this section in the `vmlinux`:
|
||||
|
||||
```
|
||||
.data..percpu 00013a58 0000000000000000 0000000001a5c000 00e00000 2**12
|
||||
CONTENTS, ALLOC, LOAD, DATA
|
||||
```
|
||||
|
||||
Ok, now we know that when we use the `DEFINE_PER_CPU` macro, a per-cpu variable in the `.data..percpu` section will be created. When the kernel initializes it calls the `setup_per_cpu_areas` function which loads the `.data..percpu` section multiple times, one section per CPU.
|
||||
|
||||
Let's look at the per-CPU areas initialization process. It starts in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) from the call of the `setup_per_cpu_areas` function which is defined in the [arch/x86/kernel/setup_percpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup_percpu.c).
|
||||
|
||||
```C
|
||||
pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
|
||||
NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);
|
||||
```
|
||||
|
||||
The `setup_per_cpu_areas` starts from the output information about the maximum number of CPUs set during kernel configuration with the `CONFIG_NR_CPUS` configuration option, actual number of CPUs, `nr_cpumask_bits` is the same that `NR_CPUS` bit for the new `cpumask` operators and number of `NUMA` nodes.
|
||||
|
||||
We can see this output in the dmesg:
|
||||
|
||||
```
|
||||
$ dmesg | grep percpu
|
||||
[ 0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
|
||||
```
|
||||
|
||||
In the next step we check the `percpu` first chunk allocator. All percpu areas are allocated in chunks. The first chunk is used for the static percpu variables. The Linux kernel has `percpu_alloc` command line parameters which provides the type of the first chunk allocator. We can read about it in the kernel documentation:
|
||||
|
||||
```
|
||||
percpu_alloc= Select which percpu first chunk allocator to use.
|
||||
Currently supported values are "embed" and "page".
|
||||
Archs may support subset or none of the selections.
|
||||
See comments in mm/percpu.c for details on each
|
||||
allocator. This parameter is primarily for debugging
|
||||
and performance comparison.
|
||||
```
|
||||
|
||||
The [mm/percpu.c](https://github.com/torvalds/linux/blob/master/mm/percpu.c) contains the handler of this command line option:
|
||||
|
||||
```C
|
||||
early_param("percpu_alloc", percpu_alloc_setup);
|
||||
```
|
||||
|
||||
Where the `percpu_alloc_setup` function sets the `pcpu_chosen_fc` variable depends on the `percpu_alloc` parameter value. By default the first chunk allocator is `auto`:
|
||||
|
||||
```C
|
||||
enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
|
||||
```
|
||||
|
||||
If the `percpu_alloc` parameter is not given to the kernel command line, the `embed` allocator will be used which embeds the first percpu chunk into bootmem with the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html). The last allocator is the first chunk `page` allocator which maps the first chunk with `PAGE_SIZE` pages.
|
||||
|
||||
As I wrote about first of all, we make a check of the first chunk allocator type in the `setup_per_cpu_areas`. First of all we check that first chunk allocator is not page:
|
||||
|
||||
```C
|
||||
if (pcpu_chosen_fc != PCPU_FC_PAGE) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
If it is not `PCPU_FC_PAGE`, we will use the `embed` allocator and allocate space for the first chunk with the `pcpu_embed_first_chunk` function:
|
||||
|
||||
```C
|
||||
rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
|
||||
dyn_size, atom_size,
|
||||
pcpu_cpu_distance,
|
||||
pcpu_fc_alloc, pcpu_fc_free);
|
||||
```
|
||||
|
||||
As I wrote above, the `pcpu_embed_first_chunk` function embeds the first percpu chunk into bootmem. As you can see we pass a couple of parameters to the `pcup_embed_first_chunk`, they are
|
||||
|
||||
* `PERCPU_FIRST_CHUNK_RESERVE` - the size of the reserved space for the static `percpu` variables;
|
||||
* `dyn_size` - minimum free size for dynamic allocation in bytes;
|
||||
* `atom_size` - all allocations are whole multiples of this and aligned to this parameter;
|
||||
* `pcpu_cpu_distance` - callback to determine distance between cpus;
|
||||
* `pcpu_fc_alloc` - function to allocate `percpu` page;
|
||||
* `pcpu_fc_free` - function to release `percpu` page.
|
||||
|
||||
All of these parameters we calculate before the call of the `pcpu_embed_first_chunk`:
|
||||
|
||||
```C
|
||||
const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
|
||||
size_t atom_size;
|
||||
#ifdef CONFIG_X86_64
|
||||
atom_size = PMD_SIZE;
|
||||
#else
|
||||
atom_size = PAGE_SIZE;
|
||||
#endif
|
||||
```
|
||||
|
||||
If the first chunk allocator is `PCPU_FC_PAGE`, we will use the `pcpu_page_first_chunk` instead of the `pcpu_embed_first_chunk`. After that `percpu` areas up, we setup `percpu` offset and its segment for every CPU with the `setup_percpu_segment` function (only for `x86` systems) and move some early data from the arrays to the `percpu` variables (`x86_cpu_to_apicid`, `irq_stack_ptr` and etc...). After the kernel finishes the initialization process, we will have loaded N `.data..percpu` sections, where N is the number of CPUs, and the section used by the bootstrap processor will contain an uninitialized variable created with the `DEFINE_PER_CPU` macro.
|
||||
|
||||
The kernel provides an API for per-cpu variables manipulating:
|
||||
|
||||
* get_cpu_var(var)
|
||||
* put_cpu_var(var)
|
||||
|
||||
|
||||
Let's look at the `get_cpu_var` implementation:
|
||||
|
||||
```C
|
||||
#define get_cpu_var(var) \
|
||||
(*({ \
|
||||
preempt_disable(); \
|
||||
this_cpu_ptr(&var); \
|
||||
}))
|
||||
```
|
||||
|
||||
The Linux kernel is preemptible and accessing a per-cpu variable requires us to know which processor the kernel running on. So, current code must not be preempted and moved to the another CPU while accessing a per-cpu variable. That's why first of all we can see a call of the `preempt_disable` function. After this we can see a call of the `this_cpu_ptr` macro, which looks like:
|
||||
|
||||
```C
|
||||
#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
|
||||
```
|
||||
|
||||
and
|
||||
|
||||
```C
|
||||
#define raw_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
|
||||
```
|
||||
|
||||
where `per_cpu_ptr` returns a pointer to the per-cpu variable for the given cpu (second parameter). After we've created a per-cpu variable and made modifications to it, we must call the `put_cpu_var` macro which enables preemption with a call of `preempt_enable` function. So the typical usage of a per-cpu variable is as follows:
|
||||
|
||||
```C
|
||||
get_cpu_var(var);
|
||||
...
|
||||
//Do something with the 'var'
|
||||
...
|
||||
put_cpu_var(var);
|
||||
```
|
||||
|
||||
Let's look at the `per_cpu_ptr` macro:
|
||||
|
||||
```C
|
||||
#define per_cpu_ptr(ptr, cpu) \
|
||||
({ \
|
||||
__verify_pcpu_ptr(ptr); \
|
||||
SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); \
|
||||
})
|
||||
```
|
||||
|
||||
As I wrote above, this macro returns a per-cpu variable for the given cpu. First of all it calls `__verify_pcpu_ptr`:
|
||||
|
||||
```C
|
||||
#define __verify_pcpu_ptr(ptr)
|
||||
do {
|
||||
const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL;
|
||||
(void)__vpp_verify;
|
||||
} while (0)
|
||||
```
|
||||
|
||||
which makes the given `ptr` type of `const void __percpu *`,
|
||||
|
||||
After this we can see the call of the `SHIFT_PERCPU_PTR` macro with two parameters. At first parameter we pass our ptr and second we pass the cpu number to the `per_cpu_offset` macro:
|
||||
|
||||
```C
|
||||
#define per_cpu_offset(x) (__per_cpu_offset[x])
|
||||
```
|
||||
|
||||
which expands to getting the `x` element from the `__per_cpu_offset` array:
|
||||
|
||||
|
||||
```C
|
||||
extern unsigned long __per_cpu_offset[NR_CPUS];
|
||||
```
|
||||
|
||||
where `NR_CPUS` is the number of CPUs. The `__per_cpu_offset` array is filled with the distances between cpu-variable copies. For example all per-cpu data is `X` bytes in size, so if we access `__per_cpu_offset[Y]`, `X*Y` will be accessed. Let's look at the `SHIFT_PERCPU_PTR` implementation:
|
||||
|
||||
```C
|
||||
#define SHIFT_PERCPU_PTR(__p, __offset) \
|
||||
RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset))
|
||||
```
|
||||
|
||||
`RELOC_HIDE` just returns offset `(typeof(ptr)) (__ptr + (off))` and it will return a pointer to the variable.
|
||||
|
||||
That's all! Of course it is not the full API, but a general overview. It can be hard to start with, but to understand per-cpu variables you mainly need to understand the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) magic.
|
||||
|
||||
Let's again look at the algorithm of getting a pointer to a per-cpu variable:
|
||||
|
||||
* The kernel creates multiple `.data..percpu` sections (one per-cpu) during initialization process;
|
||||
* All variables created with the `DEFINE_PER_CPU` macro will be relocated to the first section or for CPU0;
|
||||
* `__per_cpu_offset` array filled with the distance (`BOOT_PERCPU_OFFSET`) between `.data..percpu` sections;
|
||||
* When the `per_cpu_ptr` is called, for example for getting a pointer on a certain per-cpu variable for the third CPU, the `__per_cpu_offset` array will be accessed, where every index points to the required CPU.
|
||||
|
||||
That's all.
|
||||
384
DataStructures/bitmap.md
Normal file
384
DataStructures/bitmap.md
Normal file
@@ -0,0 +1,384 @@
|
||||
Data Structures in the Linux Kernel
|
||||
================================================================================
|
||||
|
||||
Bit arrays and bit operations in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Besides different [linked](https://en.wikipedia.org/wiki/Linked_data_structure) and [tree](https://en.wikipedia.org/wiki/Tree_%28data_structure%29) based data structures, the Linux kernel provides [API](https://en.wikipedia.org/wiki/Application_programming_interface) for [bit arrays](https://en.wikipedia.org/wiki/Bit_array) or `bitmap`. Bit arrays are heavily used in the Linux kernel and following source code files contain common `API` for work with such structures:
|
||||
|
||||
* [lib/bitmap.c](https://github.com/torvalds/linux/blob/master/lib/bitmap.c)
|
||||
* [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h)
|
||||
|
||||
Besides these two files, there is also architecture-specific header file which provides optimized bit operations for certain architecture. We consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so in our case it will be:
|
||||
|
||||
* [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h)
|
||||
|
||||
header file. As I just wrote above, the `bitmap` is heavily used in the Linux kernel. For example a `bit array` is used to store set of online/offline processors for systems which support [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) cpu (more about this you can read in the [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part), a `bit array` stores set of allocated [irqs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) during initialization of the Linux kernel and etc.
|
||||
|
||||
So, the main goal of this part is to see how `bit arrays` are implemented in the Linux kernel. Let's start.
|
||||
|
||||
Declaration of bit array
|
||||
================================================================================
|
||||
|
||||
Before we will look on `API` for bitmaps manipulation, we must know how to declare it in the Linux kernel. There are two common method to declare own bit array. The first simple way to declare a bit array is to array of `unsigned long`. For example:
|
||||
|
||||
```C
|
||||
unsigned long my_bitmap[8]
|
||||
```
|
||||
|
||||
The second way is to use the `DECLARE_BITMAP` macro which is defined in the [include/linux/types.h](https://github.com/torvalds/linux/blob/master/include/linux/types.h) header file:
|
||||
|
||||
```C
|
||||
#define DECLARE_BITMAP(name,bits) \
|
||||
unsigned long name[BITS_TO_LONGS(bits)]
|
||||
```
|
||||
|
||||
We can see that `DECLARE_BITMAP` macro takes two parameters:
|
||||
|
||||
* `name` - name of bitmap;
|
||||
* `bits` - amount of bits in bitmap;
|
||||
|
||||
and just expands to the definition of `unsigned long` array with `BITS_TO_LONGS(bits)` elements, where the `BITS_TO_LONGS` macro converts a given number of bits to number of `longs` or in other words it calculates how many `8` byte elements in `bits`:
|
||||
|
||||
```C
|
||||
#define BITS_PER_BYTE 8
|
||||
#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
|
||||
#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
|
||||
```
|
||||
|
||||
So, for example `DECLARE_BITMAP(my_bitmap, 64)` will produce:
|
||||
|
||||
```python
|
||||
>>> (((64) + (64) - 1) / (64))
|
||||
1
|
||||
```
|
||||
|
||||
and:
|
||||
|
||||
```C
|
||||
unsigned long my_bitmap[1];
|
||||
```
|
||||
|
||||
After we are able to declare a bit array, we can start to use it.
|
||||
|
||||
Architecture-specific bit operations
|
||||
================================================================================
|
||||
|
||||
We already saw above a couple of source code and header files which provide [API](https://en.wikipedia.org/wiki/Application_programming_interface) for manipulation of bit arrays. The most important and widely used API of bit arrays is architecture-specific and located as we already know in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file.
|
||||
|
||||
First of all let's look at the two most important functions:
|
||||
|
||||
* `set_bit`;
|
||||
* `clear_bit`.
|
||||
|
||||
I think that there is no need to explain what these function do. This is already must be clear from their name. Let's look on their implementation. If you will look into the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file, you will note that each of these functions represented by two variants: [atomic](https://en.wikipedia.org/wiki/Linearizability) and not. Before we will start to dive into implementations of these functions, first of all we must to know a little about `atomic` operations.
|
||||
|
||||
In simple words atomic operations guarantees that two or more operations will not be performed on the same data concurrently. The `x86` architecture provides a set of atomic instructions, for example [xchg](http://x86.renejeschke.de/html/file_module_x86_id_328.html) instruction, [cmpxchg](http://x86.renejeschke.de/html/file_module_x86_id_41.html) instruction and etc. Besides atomic instructions, some of non-atomic instructions can be made atomic with the help of the [lock](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction. It is enough to know about atomic operations for now, so we can begin to consider implementation of `set_bit` and `clear_bit` functions.
|
||||
|
||||
First of all, let's start to consider `non-atomic` variants of this function. Names of non-atomic `set_bit` and `clear_bit` starts from double underscore. As we already know, all of these functions are defined in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file and the first function is `__set_bit`:
|
||||
|
||||
```C
|
||||
static inline void __set_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
asm volatile("bts %1,%0" : ADDR : "Ir" (nr) : "memory");
|
||||
}
|
||||
```
|
||||
|
||||
As we can see it takes two arguments:
|
||||
|
||||
* `nr` - number of bit in a bit array.
|
||||
* `addr` - address of a bit array where we need to set bit.
|
||||
|
||||
Note that the `addr` parameter is defined with `volatile` keyword which tells to compiler that value maybe changed by the given address. The implementation of the `__set_bit` is pretty easy. As we can see, it just contains one line of [inline assembler](https://en.wikipedia.org/wiki/Inline_assembler) code. In our case we are using the [bts](http://x86.renejeschke.de/html/file_module_x86_id_25.html) instruction which selects a bit which is specified with the first operand (`nr` in our case) from the bit array, stores the value of the selected bit in the [CF](https://en.wikipedia.org/wiki/FLAGS_register) flags register and set this bit.
|
||||
|
||||
Note that we can see usage of the `nr`, but there is `addr` here. You already might guess that the secret is in `ADDR`. The `ADDR` is the macro which is defined in the same header code file and expands to the string which contains value of the given address and `+m` constraint:
|
||||
|
||||
```C
|
||||
#define ADDR BITOP_ADDR(addr)
|
||||
#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
|
||||
```
|
||||
|
||||
Besides the `+m`, we can see other constraints in the `__set_bit` function. Let's look on they and try to understand what do they mean:
|
||||
|
||||
* `+m` - represents memory operand where `+` tells that the given operand will be input and output operand;
|
||||
* `I` - represents integer constant;
|
||||
* `r` - represents register operand
|
||||
|
||||
Besides these constraint, we also can see - the `memory` keyword which tells compiler that this code will change value in memory. That's all. Now let's look at the same function but at `atomic` variant. It looks more complex that its `non-atomic` variant:
|
||||
|
||||
```C
|
||||
static __always_inline void
|
||||
set_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
if (IS_IMMEDIATE(nr)) {
|
||||
asm volatile(LOCK_PREFIX "orb %1,%0"
|
||||
: CONST_MASK_ADDR(nr, addr)
|
||||
: "iq" ((u8)CONST_MASK(nr))
|
||||
: "memory");
|
||||
} else {
|
||||
asm volatile(LOCK_PREFIX "bts %1,%0"
|
||||
: BITOP_ADDR(addr) : "Ir" (nr) : "memory");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
First of all note that this function takes the same set of parameters that `__set_bit`, but additionally marked with the `__always_inline` attribute. The `__always_inline` is macro which defined in the [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) and just expands to the `always_inline` attribute:
|
||||
|
||||
```C
|
||||
#define __always_inline inline __attribute__((always_inline))
|
||||
```
|
||||
|
||||
which means that this function will be always inlined to reduce size of the Linux kernel image. Now let's try to understand implementation of the `set_bit` function. First of all we check a given number of bit at the beginning of the `set_bit` function. The `IS_IMMEDIATE` macro defined in the same [header](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) file and expands to the call of the builtin [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) function:
|
||||
|
||||
```C
|
||||
#define IS_IMMEDIATE(nr) (__builtin_constant_p(nr))
|
||||
```
|
||||
|
||||
The `__builtin_constant_p` builtin function returns `1` if the given parameter is known to be constant at compile-time and returns `0` in other case. We no need to use slow `bts` instruction to set bit if the given number of bit is known in compile time constant. We can just apply [bitwise or](https://en.wikipedia.org/wiki/Bitwise_operation#OR) for byte from the give address which contains given bit and masked number of bits where high bit is `1` and other is zero. In other case if the given number of bit is not known constant at compile-time, we do the same as we did in the `__set_bit` function. The `CONST_MASK_ADDR` macro:
|
||||
|
||||
```C
|
||||
#define CONST_MASK_ADDR(nr, addr) BITOP_ADDR((void *)(addr) + ((nr)>>3))
|
||||
```
|
||||
|
||||
expands to the give address with offset to the byte which contains a given bit. For example we have address `0x1000` and the number of bit is `0x9`. So, as `0x9` is `one byte + one bit` our address with be `addr + 1`:
|
||||
|
||||
```python
|
||||
>>> hex(0x1000 + (0x9 >> 3))
|
||||
'0x1001'
|
||||
```
|
||||
|
||||
The `CONST_MASK` macro represents our given number of bit as byte where high bit is `1` and other bits are `0`:
|
||||
|
||||
```C
|
||||
#define CONST_MASK(nr) (1 << ((nr) & 7))
|
||||
```
|
||||
|
||||
```python
|
||||
>>> bin(1 << (0x9 & 7))
|
||||
'0b10'
|
||||
```
|
||||
|
||||
In the end we just apply bitwise `or` for these values. So, for example if our address will be `0x4097` and we need to set `0x9` bit:
|
||||
|
||||
```python
|
||||
>>> bin(0x4097)
|
||||
'0b100000010010111'
|
||||
>>> bin((0x4097 >> 0x9) | (1 << (0x9 & 7)))
|
||||
'0b100010'
|
||||
```
|
||||
|
||||
the `ninth` bit will be set.
|
||||
|
||||
Note that all of these operations are marked with `LOCK_PREFIX` which is expands to the [lock](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction which guarantees atomicity of this operation.
|
||||
|
||||
As we already know, besides the `set_bit` and `__set_bit` operations, the Linux kernel provides two inverse functions to clear bit in atomic and non-atomic context. They are `clear_bit` and `__clear_bit`. Both of these functions are defined in the same [header file](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) and takes the same set of arguments. But not only arguments are similar. Generally these functions are very similar on the `set_bit` and `__set_bit`. Let's look on the implementation of the non-atomic `__clear_bit` function:
|
||||
|
||||
```C
|
||||
static inline void __clear_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
asm volatile("btr %1,%0" : ADDR : "Ir" (nr));
|
||||
}
|
||||
```
|
||||
|
||||
Yes. As we see, it takes the same set of arguments and contains very similar block of inline assembler. It just uses the [btr](http://x86.renejeschke.de/html/file_module_x86_id_24.html) instruction instead of `bts`. As we can understand form the function's name, it clears a given bit by the given address. The `btr` instruction acts like `btr`. This instruction also selects a given bit which is specified in the first operand, stores its value in the `CF` flag register and clears this bit in the given bit array which is specified with second operand.
|
||||
|
||||
The atomic variant of the `__clear_bit` is `clear_bit`:
|
||||
|
||||
```C
|
||||
static __always_inline void
|
||||
clear_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
if (IS_IMMEDIATE(nr)) {
|
||||
asm volatile(LOCK_PREFIX "andb %1,%0"
|
||||
: CONST_MASK_ADDR(nr, addr)
|
||||
: "iq" ((u8)~CONST_MASK(nr)));
|
||||
} else {
|
||||
asm volatile(LOCK_PREFIX "btr %1,%0"
|
||||
: BITOP_ADDR(addr)
|
||||
: "Ir" (nr));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
and as we can see it is very similar on `set_bit` and just contains two differences. The first difference it uses `btr` instruction to clear bit when the `set_bit` uses `bts` instruction to set bit. The second difference it uses negated mask and `and` instruction to clear bit in the given byte when the `set_bit` uses `or` instruction.
|
||||
|
||||
That's all. Now we can set and clear bit in any bit array and and we can go to other operations on bitmasks.
|
||||
|
||||
Most widely used operations on a bit arrays are set and clear bit in a bit array in the Linux kernel. But besides this operations it is useful to do additional operations on a bit array. Yet another widely used operation in the Linux kernel - is to know is a given bit set or not in a bit array. We can achieve this with the help of the `test_bit` macro. This macro is defined in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file and expands to the call of the `constant_test_bit` or `variable_test_bit` depends on bit number:
|
||||
|
||||
```C
|
||||
#define test_bit(nr, addr) \
|
||||
(__builtin_constant_p((nr)) \
|
||||
? constant_test_bit((nr), (addr)) \
|
||||
: variable_test_bit((nr), (addr)))
|
||||
```
|
||||
|
||||
So, if the `nr` is known in compile time constant, the `test_bit` will be expanded to the call of the `constant_test_bit` function or `variable_test_bit` in other case. Now let's look at implementations of these functions. Let's start from the `variable_test_bit`:
|
||||
|
||||
```C
|
||||
static inline int variable_test_bit(long nr, volatile const unsigned long *addr)
|
||||
{
|
||||
int oldbit;
|
||||
|
||||
asm volatile("bt %2,%1\n\t"
|
||||
"sbb %0,%0"
|
||||
: "=r" (oldbit)
|
||||
: "m" (*(unsigned long *)addr), "Ir" (nr));
|
||||
|
||||
return oldbit;
|
||||
}
|
||||
```
|
||||
|
||||
The `variable_test_bit` function takes similar set of arguments as `set_bit` and other function take. We also may see inline assembly code here which executes [bt](http://x86.renejeschke.de/html/file_module_x86_id_22.html) and [sbb](http://x86.renejeschke.de/html/file_module_x86_id_286.html) instruction. The `bt` or `bit test` instruction selects a given bit which is specified with first operand from the bit array which is specified with the second operand and stores its value in the [CF](https://en.wikipedia.org/wiki/FLAGS_register) bit of flags register. The second `sbb` instruction subtracts first operand from second and subtracts value of the `CF`. So, here write a value of a given bit number from a given bit array to the `CF` bit of flags register and execute `sbb` instruction which calculates: `00000000 - CF` and writes the result to the `oldbit`.
|
||||
|
||||
The `constant_test_bit` function does the same as we saw in the `set_bit`:
|
||||
|
||||
```C
|
||||
static __always_inline int constant_test_bit(long nr, const volatile unsigned long *addr)
|
||||
{
|
||||
return ((1UL << (nr & (BITS_PER_LONG-1))) &
|
||||
(addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
|
||||
}
|
||||
```
|
||||
|
||||
It generates a byte where high bit is `1` and other bits are `0` (as we saw in `CONST_MASK`) and applies bitwise [and](https://en.wikipedia.org/wiki/Bitwise_operation#AND) to the byte which contains a given bit number.
|
||||
|
||||
The next widely used bit array related operation is to change bit in a bit array. The Linux kernel provides two helper for this:
|
||||
|
||||
* `__change_bit`;
|
||||
* `change_bit`.
|
||||
|
||||
As you already can guess, these two variants are atomic and non-atomic as for example `set_bit` and `__set_bit`. For the start, let's look at the implementation of the `__change_bit` function:
|
||||
|
||||
```C
|
||||
static inline void __change_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
asm volatile("btc %1,%0" : ADDR : "Ir" (nr));
|
||||
}
|
||||
```
|
||||
|
||||
Pretty easy, is not it? The implementation of the `__change_bit` is the same as `__set_bit`, but instead of `bts` instruction, we are using [btc](http://x86.renejeschke.de/html/file_module_x86_id_23.html). This instruction selects a given bit from a given bit array, stores its value in the `CF` and changes its value by the applying of complement operation. So, a bit with value `1` will be `0` and vice versa:
|
||||
|
||||
```python
|
||||
>>> int(not 1)
|
||||
0
|
||||
>>> int(not 0)
|
||||
1
|
||||
```
|
||||
|
||||
The atomic version of the `__change_bit` is the `change_bit` function:
|
||||
|
||||
```C
|
||||
static inline void change_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
if (IS_IMMEDIATE(nr)) {
|
||||
asm volatile(LOCK_PREFIX "xorb %1,%0"
|
||||
: CONST_MASK_ADDR(nr, addr)
|
||||
: "iq" ((u8)CONST_MASK(nr)));
|
||||
} else {
|
||||
asm volatile(LOCK_PREFIX "btc %1,%0"
|
||||
: BITOP_ADDR(addr)
|
||||
: "Ir" (nr));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
It is similar on `set_bit` function, but also has two differences. The first difference is `xor` operation instead of `or` and the second is `bts` instead of `bts`.
|
||||
|
||||
For this moment we know the most important architecture-specific operations with bit arrays. Time to look at generic bitmap API.
|
||||
|
||||
Common bit operations
|
||||
================================================================================
|
||||
|
||||
Besides the architecture-specific API from the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file, the Linux kernel provides common API for manipulation of bit arrays. As we know from the beginning of this part, we can find it in the [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) header file and additionally in the * [lib/bitmap.c](https://github.com/torvalds/linux/blob/master/lib/bitmap.c) source code file. But before these source code files let's look into the [include/linux/bitops.h](https://github.com/torvalds/linux/blob/master/include/linux/bitops.h) header file which provides a set of useful macro. Let's look on some of they.
|
||||
|
||||
First of all let's look at following four macros:
|
||||
|
||||
* `for_each_set_bit`
|
||||
* `for_each_set_bit_from`
|
||||
* `for_each_clear_bit`
|
||||
* `for_each_clear_bit_from`
|
||||
|
||||
All of these macros provide iterator over certain set of bits in a bit array. The first macro iterates over bits which are set, the second does the same, but starts from a certain bits. The last two macros do the same, but iterates over clear bits. Let's look on implementation of the `for_each_set_bit` macro:
|
||||
|
||||
```C
|
||||
#define for_each_set_bit(bit, addr, size) \
|
||||
for ((bit) = find_first_bit((addr), (size)); \
|
||||
(bit) < (size); \
|
||||
(bit) = find_next_bit((addr), (size), (bit) + 1))
|
||||
```
|
||||
|
||||
As we may see it takes three arguments and expands to the loop from first set bit which is returned as result of the `find_first_bit` function and to the last bit number while it is less than given size.
|
||||
|
||||
Besides these four macros, the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) provides API for rotation of `64-bit` or `32-bit` values and etc.
|
||||
|
||||
The next [header](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) file which provides API for manipulation with a bit arrays. For example it provides two functions:
|
||||
|
||||
* `bitmap_zero`;
|
||||
* `bitmap_fill`.
|
||||
|
||||
To clear a bit array and fill it with `1`. Let's look on the implementation of the `bitmap_zero` function:
|
||||
|
||||
```C
|
||||
static inline void bitmap_zero(unsigned long *dst, unsigned int nbits)
|
||||
{
|
||||
if (small_const_nbits(nbits))
|
||||
*dst = 0UL;
|
||||
else {
|
||||
unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
|
||||
memset(dst, 0, len);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
First of all we can see the check for `nbits`. The `small_const_nbits` is macro which defined in the same header [file](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) and looks:
|
||||
|
||||
```C
|
||||
#define small_const_nbits(nbits) \
|
||||
(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG)
|
||||
```
|
||||
|
||||
As we may see it checks that `nbits` is known constant in compile time and `nbits` value does not overflow `BITS_PER_LONG` or `64`. If bits number does not overflow amount of bits in a `long` value we can just set to zero. In other case we need to calculate how many `long` values do we need to fill our bit array and fill it with [memset](http://man7.org/linux/man-pages/man3/memset.3.html).
|
||||
|
||||
The implementation of the `bitmap_fill` function is similar on implementation of the `biramp_zero` function, except we fill a given bit array with `0xff` values or `0b11111111`:
|
||||
|
||||
```C
|
||||
static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
|
||||
{
|
||||
unsigned int nlongs = BITS_TO_LONGS(nbits);
|
||||
if (!small_const_nbits(nbits)) {
|
||||
unsigned int len = (nlongs - 1) * sizeof(unsigned long);
|
||||
memset(dst, 0xff, len);
|
||||
}
|
||||
dst[nlongs - 1] = BITMAP_LAST_WORD_MASK(nbits);
|
||||
}
|
||||
```
|
||||
|
||||
Besides the `bitmap_fill` and `bitmap_zero` functions, the [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) header file provides `bitmap_copy` which is similar on the `bitmap_zero`, but just uses [memcpy](http://man7.org/linux/man-pages/man3/memcpy.3.html) instead of [memset](http://man7.org/linux/man-pages/man3/memset.3.html). Also it provides bitwise operations for bit array like `bitmap_and`, `bitmap_or`, `bitamp_xor` and etc. We will not consider implementation of these functions because it is easy to understand implementations of these functions if you understood all from this part. Anyway if you are interested how did these function implemented, you may open [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) header file and start to research.
|
||||
|
||||
That's all.
|
||||
|
||||
Links
|
||||
================================================================================
|
||||
|
||||
* [bitmap](https://en.wikipedia.org/wiki/Bit_array)
|
||||
* [linked data structures](https://en.wikipedia.org/wiki/Linked_data_structure)
|
||||
* [tree data structures](https://en.wikipedia.org/wiki/Tree_%28data_structure%29)
|
||||
* [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [atomic operations](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [xchg instruction](http://x86.renejeschke.de/html/file_module_x86_id_328.html)
|
||||
* [cmpxchg instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html)
|
||||
* [lock instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [bts instruction](http://x86.renejeschke.de/html/file_module_x86_id_25.html)
|
||||
* [btr instruction](http://x86.renejeschke.de/html/file_module_x86_id_24.html)
|
||||
* [bt instruction](http://x86.renejeschke.de/html/file_module_x86_id_22.html)
|
||||
* [sbb instruction](http://x86.renejeschke.de/html/file_module_x86_id_286.html)
|
||||
* [btc instruction](http://x86.renejeschke.de/html/file_module_x86_id_23.html)
|
||||
* [man memcpy](http://man7.org/linux/man-pages/man3/memcpy.3.html)
|
||||
* [man memset](http://man7.org/linux/man-pages/man3/memset.3.html)
|
||||
* [CF](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [inline assembler](https://en.wikipedia.org/wiki/Inline_assembler)
|
||||
* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
16
Initialization/README.md
Normal file
16
Initialization/README.md
Normal file
@@ -0,0 +1,16 @@
|
||||
#内核初始化流程
|
||||
|
||||
读者在这章可以了解到整个内核初始化的完整周期,从内核解压之后的第一步到内核自身运行的第一个进程。
|
||||
|
||||
*注意* 这里不是所有内核初始化步骤的介绍。这里只有通用的内核内容,不会涉及到中断控制、 ACPI 、以及其它部分。此处没有详述的部分,会在其它章节中描述。
|
||||
|
||||
* [内核解压之后的首要步骤](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-1.md) - 描述内核中的首要步骤。
|
||||
* [早期的中断和异常控制](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) - 描述了早期的中断初始化和早期的缺页处理函数。
|
||||
* [在到达内核端点之前最后的准备](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-3.md) - 描述了在调用 start_kernel 之前最后的准备工作。
|
||||
* [内核端点](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-4.md) - 描述了内核通用代码中的第一步。
|
||||
* [继续指定架构的初始化](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-5.md) - 描述了特定架构的初始化。
|
||||
* [再次初始化指定架构](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-6.md) - 描述了再一次的指定架构初始化流程。
|
||||
* [指定架构初始化的最后部分](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md) - 描述了指定架构初始化流程的结尾。
|
||||
* [调度初始化](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-8.md) - 描述了调度初始化之前的准备工作,以及调度初始化。
|
||||
* [RCU 初始化](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-9.md) - 描述了 RCU 的初始化。
|
||||
* [初始化结束](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-10.md) - Linux内核初始化的最后部分。
|
||||
619
Initialization/linux-initialization-1.md
Normal file
619
Initialization/linux-initialization-1.md
Normal file
@@ -0,0 +1,619 @@
|
||||
Kernel initialization. Part 1.
|
||||
================================================================================
|
||||
|
||||
First steps in the kernel code
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The previous [post](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) was a last part of the Linux kernel [booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489) will be called.
|
||||
|
||||
In the last [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) we stopped at the [jmp](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file:
|
||||
|
||||
```assembly
|
||||
jmp *%rax
|
||||
```
|
||||
|
||||
At this moment the `rax` register contains address of the Linux kernel entry point which that was obtained as a result of the call of the `decompress_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file. So, our last instruction in the kernel setup code is a jump on the kernel entry point. We already know where is defined the entry point of the linux kernel, so we are able to start to learn what does the Linux kernel does after the start.
|
||||
|
||||
First steps in the kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Okay, we got the address of the decompressed kernel image from the `decompress_kernel` function into `rax` register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly source code file and at the beginning of it, we can see following definitions:
|
||||
|
||||
```assembly
|
||||
__HEAD
|
||||
.code64
|
||||
.globl startup_64
|
||||
startup_64:
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
We can see definition of the `startup_64` routine that is defined in the `__HEAD` section, which is just a macro which expands to the definition of executable `.head.text` section:
|
||||
|
||||
```C
|
||||
#define __HEAD .section ".head.text","ax"
|
||||
```
|
||||
|
||||
We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S#L93) linker script:
|
||||
|
||||
```
|
||||
.text : AT(ADDR(.text) - LOAD_OFFSET) {
|
||||
_text = .;
|
||||
...
|
||||
...
|
||||
...
|
||||
} :text = 0x9090
|
||||
```
|
||||
|
||||
Besides the definition of the `.text` section, we can understand default virtual and physical addresses from the linker script. Note that address of the `_text` is location counter which is defined as:
|
||||
|
||||
```
|
||||
. = __START_KERNEL;
|
||||
```
|
||||
|
||||
for the [x86_64](https://en.wikipedia.org/wiki/X86-64). The definition of the `__START_KERNEL` macro is located in the [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h) header file and represented by the sum of the base virtual address of the kernel mapping and physical start:
|
||||
|
||||
```C
|
||||
#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
|
||||
|
||||
#define __PHYSICAL_START ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN)
|
||||
```
|
||||
|
||||
Or in other words:
|
||||
|
||||
* Base physical address of the Linux kernel - `0x1000000`;
|
||||
* Base virtual address of the Linux kernel - `0xffffffff81000000`.
|
||||
|
||||
Now we know default physical and virtual addresses of the `startup_64` routine, but to know actual addresses we must to calculate it with the following code:
|
||||
|
||||
```assembly
|
||||
leaq _text(%rip), %rbp
|
||||
subq $_text - __START_KERNEL_map, %rbp
|
||||
```
|
||||
|
||||
Yes, it defined as `0x1000000`, but it may be different, for example if [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) is enabled. So our current goal is to calculate delta between `0x1000000` and where we actually loaded. Here we just put the `rip-relative` address to the `rbp` register and then subtract `$_text - __START_KERNEL_map` from it. We know that compiled virtual address of the `_text` is `0xffffffff81000000` and the physical address of it is `0x1000000`. The `__START_KERNEL_map` macro expands to the `0xffffffff80000000` address, so at the second line of the assembly code, we will get following expression:
|
||||
|
||||
```
|
||||
rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000)
|
||||
```
|
||||
|
||||
So, after the calculation, the `rbp` will contain `0` which represents difference between addresses where we actually loaded and where the code was compiled. In our case `zero` means that the Linux kernel was loaded by default address and the [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) was disabled.
|
||||
|
||||
After we got the address of the `startup_64`, we need to do a check that this address is correctly aligned. We will do it with the following code:
|
||||
|
||||
```assembly
|
||||
testl $~PMD_PAGE_MASK, %ebp
|
||||
jnz bad_address
|
||||
```
|
||||
|
||||
Here we just compare low part of the `rbp` register with the complemented value of the `PMD_PAGE_MASK`. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) about it) and defined as:
|
||||
|
||||
```C
|
||||
#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))
|
||||
|
||||
#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT)
|
||||
#define PMD_SHIFT 21
|
||||
```
|
||||
|
||||
As we can easily calculate, `PMD_PAGE_SIZE` is `2` megabytes. Here we use standard formula for checking alignment and if `text` address is not aligned for `2` megabytes, we jump to `bad_address` label.
|
||||
|
||||
After this we check address that it is not too large by the checking of highest `18` bits:
|
||||
|
||||
```assembly
|
||||
leaq _text(%rip), %rax
|
||||
shrq $MAX_PHYSMEM_BITS, %rax
|
||||
jnz bad_address
|
||||
```
|
||||
|
||||
The address must not be greater than `46`-bits:
|
||||
|
||||
```C
|
||||
#define MAX_PHYSMEM_BITS 46
|
||||
```
|
||||
|
||||
Okay, we did some early checks and now we can move on.
|
||||
|
||||
Fix base addresses of page tables
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The first step before we start to setup identity paging is to fixup following addresses:
|
||||
|
||||
```assembly
|
||||
addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
|
||||
addq %rbp, level3_kernel_pgt + (510*8)(%rip)
|
||||
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
|
||||
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
|
||||
```
|
||||
|
||||
All of `early_level4_pgt`, `level3_kernel_pgt` and other address may be wrong if the `startup_64` is not equal to default `0x1000000` address. The `rbp` register contains the delta address so we add to the certain entries of the `early_level4_pgt`, the `level3_kernel_pgt` and the `level2_fixmap_pgt`. Let's try to understand what these labels mean. First of all let's look at their definition:
|
||||
|
||||
```assembly
|
||||
NEXT_PAGE(early_level4_pgt)
|
||||
.fill 511,8,0
|
||||
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
|
||||
|
||||
NEXT_PAGE(level3_kernel_pgt)
|
||||
.fill L3_START_KERNEL,8,0
|
||||
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
|
||||
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
|
||||
|
||||
NEXT_PAGE(level2_kernel_pgt)
|
||||
PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
|
||||
KERNEL_IMAGE_SIZE/PMD_SIZE)
|
||||
|
||||
NEXT_PAGE(level2_fixmap_pgt)
|
||||
.fill 506,8,0
|
||||
.quad level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
|
||||
.fill 5,8,0
|
||||
|
||||
NEXT_PAGE(level1_fixmap_pgt)
|
||||
.fill 512,8,0
|
||||
```
|
||||
|
||||
Looks hard, but it isn't. First of all let's look at the `early_level4_pgt`. It starts with the (4096 - 8) bytes of zeros, it means that we don't use the first `511` entries. And after this we can see one `level3_kernel_pgt` entry. Note that we subtract `__START_KERNEL_map + _PAGE_TABLE` from it. As we know `__START_KERNEL_map` is a base virtual address of the kernel text, so if we subtract `__START_KERNEL_map`, we will get physical address of the `level3_kernel_pgt`. Now let's look at `_PAGE_TABLE`, it is just page entry access rights:
|
||||
|
||||
```C
|
||||
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
|
||||
_PAGE_ACCESSED | _PAGE_DIRTY)
|
||||
```
|
||||
|
||||
You can read more about it in the [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) part.
|
||||
|
||||
The `level3_kernel_pgt` - stores two entries which map kernel space. At the start of it's definition, we can see that it is filled with zeros `L3_START_KERNEL` or `510` times. Here the `L3_START_KERNEL` is the index in the page upper directory which contains `__START_KERNEL_map` address and it equals `510`. After this, we can see the definition of the two `level3_kernel_pgt` entries: `level2_kernel_pgt` and `level2_fixmap_pgt`. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has:
|
||||
|
||||
```C
|
||||
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
|
||||
_PAGE_DIRTY)
|
||||
```
|
||||
|
||||
access rights. The second - `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. They represented by the one `level2_fixmap_pgt` entry and `10` megabytes hole for the [vsyscalls](https://lwn.net/Articles/446528/) mapping. The next `level2_kernel_pgt` calls the `PDMS` macro which creates `512` megabytes from the `__START_KERNEL_map` for kernel `.text` (after these `512` megabytes will be modules memory space).
|
||||
|
||||
Now, after we saw definitions of these symbols, let's get back to the code which is described at the beginning of the section. Remember that the `rbp` register contains delta between the address of the `startup_64` symbol which was got during kernel [linking](https://en.wikipedia.org/wiki/Linker_%28computing%29) and the actual address. So, for this moment, we just need to add add this delta to the base address of some page table entries, that they'll have correct addresses. In our case these entries are:
|
||||
|
||||
```assembly
|
||||
addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
|
||||
addq %rbp, level3_kernel_pgt + (510*8)(%rip)
|
||||
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
|
||||
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
|
||||
```
|
||||
|
||||
or the last entry of the `early_level4_pgt` which is the `level3_kernel_pgt`, last two entries of the `level3_kernel_pgt` which are the `level2_kernel_pgt` and the `level2_fixmap_pgt` and five hundreds seventh entry of the `level2_fixmap_pgt` which is `level1_fixmap_pgt` page directory.
|
||||
|
||||
After all of this we will have:
|
||||
|
||||
```
|
||||
early_level4_pgt[511] -> level3_kernel_pgt[0]
|
||||
level3_kernel_pgt[510] -> level2_kernel_pgt[0]
|
||||
level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
|
||||
level2_kernel_pgt[0] -> 512 MB kernel mapping
|
||||
level2_fixmap_pgt[507] -> level1_fixmap_pgt
|
||||
```
|
||||
|
||||
Note that we didn't fixup base address of the `early_level4_pgt` and some of other page table directories, because we will see this during of building/filling of structures for these page tables. As we corrected base addresses of the page tables, we can start to build it.
|
||||
|
||||
Identity mapping setup
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Now we can see the set up of identity mapping of early page tables. In Identity Mapped Paging, virtual addresses are mapped to physical addresses that have the same value, `1 : 1`. Let's look at it in detail. First of all we get the `rip-relative` address of the `_text` and `_early_level4_pgt` and put they into `rdi` and `rbx` registers:
|
||||
|
||||
```assembly
|
||||
leaq _text(%rip), %rdi
|
||||
leaq early_level4_pgt(%rip), %rbx
|
||||
```
|
||||
|
||||
After this we store address of the `_text` in the `rax` and get the index of the page global directory entry which stores `_text` address, by shifting `_text` address on the `PGDIR_SHIFT`:
|
||||
|
||||
```assembly
|
||||
movq %rdi, %rax
|
||||
shrq $PGDIR_SHIFT, %rax
|
||||
|
||||
leaq (4096 + _KERNPG_TABLE)(%rbx), %rdx
|
||||
movq %rdx, 0(%rbx,%rax,8)
|
||||
movq %rdx, 8(%rbx,%rax,8)
|
||||
```
|
||||
|
||||
where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories:
|
||||
|
||||
```C
|
||||
#define PGDIR_SHIFT 39
|
||||
#define PUD_SHIFT 30
|
||||
#define PMD_SHIFT 21
|
||||
```
|
||||
|
||||
After this we put the address of the first `level3_kernel_pgt` in the `rdx` with the `_KERNPG_TABLE` access rights (see above) and fill the `early_level4_pgt` with the 2 `level3_kernel_pgt` entries.
|
||||
|
||||
After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now contains the address of the first entry of the `level3_kernel_pgt`) and put `rdi` (it now contains physical address of the `_text`) to the `rax`. And after this we write addresses of the two page upper directory entries to the `level3_kernel_pgt`:
|
||||
|
||||
```assembly
|
||||
addq $4096, %rdx
|
||||
movq %rdi, %rax
|
||||
shrq $PUD_SHIFT, %rax
|
||||
andl $(PTRS_PER_PUD-1), %eax
|
||||
movq %rdx, 4096(%rbx,%rax,8)
|
||||
incl %eax
|
||||
andl $(PTRS_PER_PUD-1), %eax
|
||||
movq %rdx, 4096(%rbx,%rax,8)
|
||||
```
|
||||
|
||||
In the next step we write addresses of the page middle directory entries to the `level2_kernel_pgt` and the last step is correcting of the kernel text+data virtual addresses:
|
||||
|
||||
```assembly
|
||||
leaq level2_kernel_pgt(%rip), %rdi
|
||||
leaq 4096(%rdi), %r8
|
||||
1: testq $1, 0(%rdi)
|
||||
jz 2f
|
||||
addq %rbp, 0(%rdi)
|
||||
2: addq $8, %rdi
|
||||
cmp %r8, %rdi
|
||||
jne 1b
|
||||
```
|
||||
|
||||
Here we put the address of the `level2_kernel_pgt` to the `rdi` and address of the page table entry to the `r8` register. Next we check the present bit in the `level2_kernel_pgt` and if it is zero we're moving to the next page by adding 8 bytes to `rdi` which contains address of the `level2_kernel_pgt`. After this we compare it with `r8` (contains address of the page table entry) and go back to label `1` or move forward.
|
||||
|
||||
In the next step we correct `phys_base` physical address with `rbp` (contains physical address of the `_text`), put physical address of the `early_level4_pgt` and jump to label `1`:
|
||||
|
||||
```assembly
|
||||
addq %rbp, phys_base(%rip)
|
||||
movq $(early_level4_pgt - __START_KERNEL_map), %rax
|
||||
jmp 1f
|
||||
```
|
||||
|
||||
where `phys_base` matches the first entry of the `level2_kernel_pgt` which is `512` MB kernel mapping.
|
||||
|
||||
Last preparation before jump at the kernel entry point
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After that we jump to the label `1` we enable `PAE`, `PGE` (Paging Global Extension) and put the physical address of the `phys_base` (see above) to the `rax` register and fill `cr3` register with it:
|
||||
|
||||
```assembly
|
||||
1:
|
||||
movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
|
||||
movq %rcx, %cr4
|
||||
|
||||
addq phys_base(%rip), %rax
|
||||
movq %rax, %cr3
|
||||
```
|
||||
|
||||
In the next step we check that CPU supports [NX](http://en.wikipedia.org/wiki/NX_bit) bit with:
|
||||
|
||||
```assembly
|
||||
movl $0x80000001, %eax
|
||||
cpuid
|
||||
movl %edx,%edi
|
||||
```
|
||||
|
||||
We put `0x80000001` value to the `eax` and execute `cpuid` instruction for getting the extended processor info and feature bits. The result will be in the `edx` register which we put to the `edi`.
|
||||
|
||||
Now we put `0xc0000080` or `MSR_EFER` to the `ecx` and call `rdmsr` instruction for the reading model specific register.
|
||||
|
||||
```assembly
|
||||
movl $MSR_EFER, %ecx
|
||||
rdmsr
|
||||
```
|
||||
|
||||
The result will be in the `edx:eax`. General view of the `EFER` is following:
|
||||
|
||||
```
|
||||
63 32
|
||||
--------------------------------------------------------------------------------
|
||||
| |
|
||||
| Reserved MBZ |
|
||||
| |
|
||||
--------------------------------------------------------------------------------
|
||||
31 16 15 14 13 12 11 10 9 8 7 1 0
|
||||
--------------------------------------------------------------------------------
|
||||
| | T | | | | | | | | | |
|
||||
| Reserved MBZ | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE|
|
||||
| | E | | | | | | | | | |
|
||||
--------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
We will not see all fields in details here, but we will learn about this and other `MSRs` in a special part about it. As we read `EFER` to the `edx:eax`, we check `_EFER_SCE` or zero bit which is `System Call Extensions` with `btsl` instruction and set it to one. By the setting `SCE` bit we enable `SYSCALL` and `SYSRET` instructions. In the next step we check 20th bit in the `edi`, remember that this register stores result of the `cpuid` (see above). If `20` bit is set (`NX` bit) we just write `EFER_SCE` to the model specific register.
|
||||
|
||||
```assembly
|
||||
btsl $_EFER_SCE, %eax
|
||||
btl $20,%edi
|
||||
jnc 1f
|
||||
btsl $_EFER_NX, %eax
|
||||
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
|
||||
1: wrmsr
|
||||
```
|
||||
|
||||
If the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is supported we enable `_EFER_NX` and write it too, with the `wrmsr` instruction. After the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is set, we set some bits in the `cr0` [control register](https://en.wikipedia.org/wiki/Control_register), namely:
|
||||
|
||||
* `X86_CR0_PE` - system is in protected mode;
|
||||
* `X86_CR0_MP` - controls interaction of WAIT/FWAIT instructions with TS flag in CR0;
|
||||
* `X86_CR0_ET` - on the 386, it allowed to specify whether the external math coprocessor was an 80287 or 80387;
|
||||
* `X86_CR0_NE` - enable internal x87 floating point error reporting when set, else enables PC style x87 error detection;
|
||||
* `X86_CR0_WP` - when set, the CPU can't write to read-only pages when privilege level is 0;
|
||||
* `X86_CR0_AM` - alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3;
|
||||
* `X86_CR0_PG` - enable paging.
|
||||
|
||||
by the execution following assembly code:
|
||||
|
||||
```assembly
|
||||
#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
|
||||
X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
|
||||
X86_CR0_PG)
|
||||
movl $CR0_STATE, %eax
|
||||
movq %rax, %cr0
|
||||
```
|
||||
|
||||
We already know that to run any code, and even more [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code from assembly, we need to setup a stack. As always, we are doing it by the setting of [stack pointer](https://en.wikipedia.org/wiki/Stack_register) to a correct place in memory and resetting [flags](https://en.wikipedia.org/wiki/FLAGS_register) register after this:
|
||||
|
||||
```assembly
|
||||
movq stack_start(%rip), %rsp
|
||||
pushq $0
|
||||
popfq
|
||||
```
|
||||
|
||||
The most interesting thing here is the `stack_start`. It defined in the same [source](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) code file and looks like:
|
||||
|
||||
```assembly
|
||||
GLOBAL(stack_start)
|
||||
.quad init_thread_union+THREAD_SIZE-8
|
||||
```
|
||||
|
||||
The `GLOBAL` is already familiar to us from. It defined in the [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) header file expands to the `global` symbol definition:
|
||||
|
||||
```C
|
||||
#define GLOBAL(name) \
|
||||
.globl name; \
|
||||
name:
|
||||
```
|
||||
|
||||
The `THREAD_SIZE` macro is defined in the [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h) header file and depends on value of the `KASAN_STACK_ORDER` macro:
|
||||
|
||||
```C
|
||||
#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
|
||||
#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)
|
||||
```
|
||||
|
||||
We consider when the [kasan](http://lxr.free-electrons.com/source/Documentation/kasan.txt) is disabled and the `PAGE_SIZE` is `4096` bytes. So the `THREAD_SIZE` will expands to `16` kilobytes and represents size of the stack of a thread. Why is `thread`? You may already know that each [process](https://en.wikipedia.org/wiki/Process_%28computing%29) may have parent [processes](https://en.wikipedia.org/wiki/Parent_process) and [child](https://en.wikipedia.org/wiki/Child_process) processes. Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with the `thread_info` structure.
|
||||
|
||||
And as we can see the `init_thread_union` is represented by the `thread_union`, which defined as:
|
||||
|
||||
```C
|
||||
union thread_union {
|
||||
struct thread_info thread_info;
|
||||
unsigned long stack[THREAD_SIZE/sizeof(long)];
|
||||
};
|
||||
```
|
||||
|
||||
and `init_thread_union` looks like:
|
||||
|
||||
```C
|
||||
union thread_union init_thread_union __init_task_data =
|
||||
{ INIT_THREAD_INFO(init_task) };
|
||||
```
|
||||
|
||||
Where the `INIT_THREAD_INFO` macro takes `task_struct` structure which represents process descriptor in the Linux kernel and does some basic initialization of the given `task_struct` structure:
|
||||
|
||||
```C
|
||||
#define INIT_THREAD_INFO(tsk) \
|
||||
{ \
|
||||
.task = &tsk, \
|
||||
.flags = 0, \
|
||||
.cpu = 0, \
|
||||
.addr_limit = KERNEL_DS, \
|
||||
}
|
||||
```
|
||||
|
||||
So, the `thread_union` contains low-level information about a process and process's stack and placed in the bottom of stack:
|
||||
|
||||
```
|
||||
+-----------------------+
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
| Kernel stack |
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
|-----------------------|
|
||||
| |
|
||||
| struct thread_info |
|
||||
| |
|
||||
+-----------------------+
|
||||
```
|
||||
|
||||
Note that we reserve `8` bytes at the to of stack. This is necessary to guarantee illegal access of the next page memory.
|
||||
|
||||
After the early boot stack is set, to update the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) with `lgdt` instruction:
|
||||
|
||||
```assembly
|
||||
lgdt early_gdt_descr(%rip)
|
||||
```
|
||||
|
||||
where the `early_gdt_descr` is defined as:
|
||||
|
||||
```assembly
|
||||
early_gdt_descr:
|
||||
.word GDT_ENTRIES*8-1
|
||||
early_gdt_descr_base:
|
||||
.quad INIT_PER_CPU_VAR(gdt_page)
|
||||
```
|
||||
|
||||
We need to reload `Global Descriptor Table` because now kernel works in the low userspace addresses, but soon kernel will work in it's own space. Now let's look at the definition of `early_gdt_descr`. Global Descriptor Table contains `32` entries:
|
||||
|
||||
```C
|
||||
#define GDT_ENTRIES 32
|
||||
```
|
||||
|
||||
for kernel code, data, thread local storage segments and etc... it's simple. Now let's look at the `early_gdt_descr_base`. First of `gdt_page` defined as:
|
||||
|
||||
```C
|
||||
struct gdt_page {
|
||||
struct desc_struct gdt[GDT_ENTRIES];
|
||||
} __attribute__((aligned(PAGE_SIZE)));
|
||||
```
|
||||
|
||||
in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h). It contains one field `gdt` which is array of the `desc_struct` structure which is defined as:
|
||||
|
||||
```C
|
||||
struct desc_struct {
|
||||
union {
|
||||
struct {
|
||||
unsigned int a;
|
||||
unsigned int b;
|
||||
};
|
||||
struct {
|
||||
u16 limit0;
|
||||
u16 base0;
|
||||
unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
|
||||
unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
|
||||
};
|
||||
};
|
||||
} __attribute__((packed));
|
||||
```
|
||||
|
||||
and presents familiar to us `GDT` descriptor. Also we can note that `gdt_page` structure aligned to `PAGE_SIZE` which is `4096` bytes. It means that `gdt` will occupy one page. Now let's try to understand what is `INIT_PER_CPU_VAR`. `INIT_PER_CPU_VAR` is a macro which defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h) and just concats `init_per_cpu__` with the given parameter:
|
||||
|
||||
```C
|
||||
#define INIT_PER_CPU_VAR(var) init_per_cpu__##var
|
||||
```
|
||||
|
||||
After the `INIT_PER_CPU_VAR` macro will be expanded, we will have `init_per_cpu__gdt_page`. We can see in the [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S):
|
||||
|
||||
```
|
||||
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
|
||||
INIT_PER_CPU(gdt_page);
|
||||
```
|
||||
|
||||
As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT.
|
||||
|
||||
Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have will have its own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) post.
|
||||
|
||||
As we loaded new Global Descriptor Table, we reload segments as we did it every time:
|
||||
|
||||
```assembly
|
||||
xorl %eax,%eax
|
||||
movl %eax,%ds
|
||||
movl %eax,%ss
|
||||
movl %eax,%es
|
||||
movl %eax,%fs
|
||||
movl %eax,%gs
|
||||
```
|
||||
|
||||
After all of these steps we set up `gs` register that it post to the `irqstack` which represents special stack where [interrupts](https://en.wikipedia.org/wiki/Interrupt) will be handled on:
|
||||
|
||||
```assembly
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
movl initial_gs(%rip),%eax
|
||||
movl initial_gs+4(%rip),%edx
|
||||
wrmsr
|
||||
```
|
||||
|
||||
where `MSR_GS_BASE` is:
|
||||
|
||||
```C
|
||||
#define MSR_GS_BASE 0xc0000101
|
||||
```
|
||||
|
||||
We need to put `MSR_GS_BASE` to the `ecx` register and load data from the `eax` and `edx` (which are point to the `initial_gs`) with `wrmsr` instruction. We don't use `cs`, `fs`, `ds` and `ss` segment registers for addressing in the 64-bit mode, but `fs` and `gs` registers can be used. `fs` and `gs` have a hidden part (as we saw it in the real mode for `cs`) and this part contains descriptor which mapped to [Model Specific Registers](https://en.wikipedia.org/wiki/Model-specific_register). So we can see above `0xc0000101` is a `gs.base` MSR address. When a [system call](https://en.wikipedia.org/wiki/System_call) or [interrupt](https://en.wikipedia.org/wiki/Interrupt) occurred, there is no kernel stack at the entry point, so the value of the `MSR_GS_BASE` will store address of the interrupt stack.
|
||||
|
||||
In the next step we put the address of the real mode bootparam structure to the `rdi` (remember `rsi` holds pointer to this structure from the start) and jump to the C code with:
|
||||
|
||||
```assembly
|
||||
movq initial_code(%rip),%rax
|
||||
pushq $0
|
||||
pushq $__KERNEL_CS
|
||||
pushq %rax
|
||||
lretq
|
||||
```
|
||||
|
||||
Here we put the address of the `initial_code` to the `rax` and push fake address, `__KERNEL_CS` and the address of the `initial_code` to the stack. After this we can see `lretq` instruction which means that after it return address will be extracted from stack (now there is address of the `initial_code`) and jump there. `initial_code` is defined in the same source code file and looks:
|
||||
|
||||
```assembly
|
||||
.balign 8
|
||||
GLOBAL(initial_code)
|
||||
.quad x86_64_start_kernel
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
As we can see `initial_code` contains address of the `x86_64_start_kernel`, which is defined in the [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and looks like this:
|
||||
|
||||
```C
|
||||
asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
It has one argument is a `real_mode_data` (remember that we passed address of the real mode data to the `rdi` register previously).
|
||||
|
||||
This is first C code in the kernel!
|
||||
|
||||
Next to start_kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We need to see last preparations before we can see "kernel entry point" - start_kernel function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489).
|
||||
|
||||
First of all we can see some checks in the `x86_64_start_kernel` function:
|
||||
|
||||
```C
|
||||
BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
|
||||
BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
|
||||
BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
|
||||
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
|
||||
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
|
||||
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
|
||||
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK)));
|
||||
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
|
||||
```
|
||||
|
||||
There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - `__STAT_KERNEL_map`, that kernel text with modules is not less than image of the kernel and etc... `BUILD_BUG_ON` is a macro which looks as:
|
||||
|
||||
```C
|
||||
#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
|
||||
```
|
||||
|
||||
Let's try to understand how this trick works. Let's take for example first condition: `MODULES_VADDR < __START_KERNEL_map`. `!!conditions` is the same that `condition != 0`. So it means if `MODULES_VADDR < __START_KERNEL_map` is true, we will get `1` in the `!!(condition)` or zero if not. After `2*!!(condition)` we will get or `2` or `0`. In the end of calculations we can get two different behaviors:
|
||||
|
||||
* We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because `MODULES_VADDR` can't be less than `__START_KERNEL_map` will be in our case);
|
||||
* No compilation errors.
|
||||
|
||||
That's all. So interesting C trick for getting compile error which depends on some constants.
|
||||
|
||||
In the next step we can see call of the `cr4_init_shadow` function which stores shadow copy of the `cr4` per cpu. Context switches can change bits in the `cr4` so we need to store `cr4` for each CPU. And after this we can see call of the `reset_early_page_tables` function where we resets all page global directory entries and write new pointer to the PGT in `cr3`:
|
||||
|
||||
```C
|
||||
for (i = 0; i < PTRS_PER_PGD-1; i++)
|
||||
early_level4_pgt[i].pgd = 0;
|
||||
|
||||
next_early_pgt = 0;
|
||||
|
||||
write_cr3(__pa_nodebug(early_level4_pgt));
|
||||
```
|
||||
|
||||
Soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (`PTRS_PER_PGD` is `512`) in the loop and make it zero. After this we set `next_early_pgt` to zero (we will see details about it in the next post) and write physical address of the `early_level4_pgt` to the `cr3`. `__pa_nodebug` is a macro which will be expanded to:
|
||||
|
||||
```C
|
||||
((unsigned long)(x) - __START_KERNEL_map + phys_base)
|
||||
```
|
||||
|
||||
After this we clear `_bss` from the `__bss_stop` to `__bss_start` and the next step will be setup of the early `IDT` handlers, but it's big concept so we will see it in the next part.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the first part about linux kernel initialization.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and a lot more.
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Model Specific Register](http://en.wikipedia.org/wiki/Model-specific_register)
|
||||
* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [Previous part - Kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html)
|
||||
* [NX](http://en.wikipedia.org/wiki/NX_bit)
|
||||
* [ASLR](http://en.wikipedia.org/wiki/Address_space_layout_randomization)
|
||||
473
Initialization/linux-initialization-10.md
Normal file
473
Initialization/linux-initialization-10.md
Normal file
@@ -0,0 +1,473 @@
|
||||
Kernel initialization. Part 10.
|
||||
================================================================================
|
||||
|
||||
End of the linux kernel initialization process
|
||||
================================================================================
|
||||
|
||||
This is tenth part of the chapter about linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the [previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) chapter, so let's finish it.
|
||||
|
||||
After the call of the `acpi_early_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c), we can see the following code:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_ESPFIX64
|
||||
init_espfix_bsp();
|
||||
#endif
|
||||
```
|
||||
|
||||
Here we can see the call of the `init_espfix_bsp` function which depends on the `CONFIG_X86_ESPFIX64` kernel configuration option. As we can understand from the function name, it does something with the stack. This function is defined in the [arch/x86/kernel/espfix_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/espfix_64.c) and prevents leaking of `31:16` bits of the `esp` register during returning to 16-bit stack. First of all we install `espfix` page upper directory into the kernel page directory in the `init_espfix_bs`:
|
||||
|
||||
```C
|
||||
pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
|
||||
pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);
|
||||
```
|
||||
|
||||
Where `ESPFIX_BASE_ADDR` is:
|
||||
|
||||
```C
|
||||
#define PGDIR_SHIFT 39
|
||||
#define ESPFIX_PGD_ENTRY _AC(-2, UL)
|
||||
#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << PGDIR_SHIFT)
|
||||
```
|
||||
|
||||
Also we can find it in the [Documentation/x86/x86_64/mm](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt):
|
||||
|
||||
```
|
||||
... unused hole ...
|
||||
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
|
||||
... unused hole ...
|
||||
```
|
||||
|
||||
After we've filled page global directory with the `espfix` pud, the next step is call of the `init_espfix_random` and `init_espfix_ap` functions. The first function returns random locations for the `espfix` page and the second enables the `espfix` for the current CPU. After the `init_espfix_bsp` finished the work, we can see the call of the `thread_info_cache_init` function which defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and allocates cache for the `thread_info` if `THREAD_SIZE` is less than `PAGE_SIZE`:
|
||||
|
||||
```C
|
||||
# if THREAD_SIZE >= PAGE_SIZE
|
||||
...
|
||||
...
|
||||
...
|
||||
void thread_info_cache_init(void)
|
||||
{
|
||||
thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE,
|
||||
THREAD_SIZE, 0, NULL);
|
||||
BUG_ON(thread_info_cache == NULL);
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
As we already know the `PAGE_SIZE` is `(_AC(1,UL) << PAGE_SHIFT)` or `4096` bytes and `THREAD_SIZE` is `(PAGE_SIZE << THREAD_SIZE_ORDER)` or `16384` bytes for the `x86_64`. The next function after the `thread_info_cache_init` is the `cred_init` from the [kernel/cred.c](https://github.com/torvalds/linux/blob/master/kernel/cred.c). This function just allocates cache for the credentials (like `uid`, `gid`, etc.):
|
||||
|
||||
```C
|
||||
void __init cred_init(void)
|
||||
{
|
||||
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred),
|
||||
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
|
||||
}
|
||||
```
|
||||
|
||||
more about credentials you can read in the [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/master/Documentation/security/credentials.txt). Next step is the `fork_init` function from the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c). The `fork_init` function allocates cache for the `task_struct`. Let's look on the implementation of the `fork_init`. First of all we can see definitions of the `ARCH_MIN_TASKALIGN` macro and creation of a slab where task_structs will be allocated:
|
||||
|
||||
```C
|
||||
#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
|
||||
#ifndef ARCH_MIN_TASKALIGN
|
||||
#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
|
||||
#endif
|
||||
task_struct_cachep =
|
||||
kmem_cache_create("task_struct", sizeof(struct task_struct),
|
||||
ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);
|
||||
#endif
|
||||
```
|
||||
|
||||
As we can see this code depends on the `CONFIG_ARCH_TASK_STRUCT_ACLLOCATOR` kernel configuration option. This configuration option shows the presence of the `alloc_task_struct` for the given architecture. As `x86_64` has no `alloc_task_struct` function, this code will not work and even will not be compiled on the `x86_64`.
|
||||
|
||||
Allocating cache for init task
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After this we can see the call of the `arch_task_cache_init` function in the `fork_init`:
|
||||
|
||||
```C
|
||||
void arch_task_cache_init(void)
|
||||
{
|
||||
task_xstate_cachep =
|
||||
kmem_cache_create("task_xstate", xstate_size,
|
||||
__alignof__(union thread_xstate),
|
||||
SLAB_PANIC | SLAB_NOTRACK, NULL);
|
||||
setup_xstate_comp();
|
||||
}
|
||||
```
|
||||
|
||||
The `arch_task_cache_init` does initialization of the architecture-specific caches. In our case it is `x86_64`, so as we can see, the `arch_task_cache_init` allocates cache for the `task_xstate` which represents [FPU](http://en.wikipedia.org/wiki/Floating-point_unit) state and sets up offsets and sizes of all extended states in [xsave](http://www.felixcloutier.com/x86/XSAVES.html) area with the call of the `setup_xstate_comp` function. After the `arch_task_cache_init` we calculate default maximum number of threads with the:
|
||||
|
||||
```C
|
||||
set_max_threads(MAX_THREADS);
|
||||
```
|
||||
|
||||
where default maximum number of threads is:
|
||||
|
||||
```C
|
||||
#define FUTEX_TID_MASK 0x3fffffff
|
||||
#define MAX_THREADS FUTEX_TID_MASK
|
||||
```
|
||||
|
||||
In the end of the `fork_init` function we initialize [signal](http://www.win.tue.nl/~aeb/linux/lk/lk-5.html) handler:
|
||||
|
||||
```C
|
||||
init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2;
|
||||
init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
|
||||
init_task.signal->rlim[RLIMIT_SIGPENDING] =
|
||||
init_task.signal->rlim[RLIMIT_NPROC];
|
||||
```
|
||||
|
||||
As we know the `init_task` is an instance of the `task_struct` structure, so it contains `signal` field which represents signal handler. It has following type `struct signal_struct`. On the first two lines we can see setting of the current and maximum limit of the `resource limits`. Every process has an associated set of resource limits. These limits specify amount of resources which current process can use. Here `rlim` is resource control limit and presented by the:
|
||||
|
||||
```C
|
||||
struct rlimit {
|
||||
__kernel_ulong_t rlim_cur;
|
||||
__kernel_ulong_t rlim_max;
|
||||
};
|
||||
```
|
||||
|
||||
structure from the [include/uapi/linux/resource.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/resource.h). In our case the resource is the `RLIMIT_NPROC` which is the maximum number of processes that user can own and `RLIMIT_SIGPENDING` - the maximum number of pending signals. We can see it in the:
|
||||
|
||||
```C
|
||||
cat /proc/self/limits
|
||||
Limit Soft Limit Hard Limit Units
|
||||
...
|
||||
...
|
||||
...
|
||||
Max processes 63815 63815 processes
|
||||
Max pending signals 63815 63815 signals
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Initialization of the caches
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next function after the `fork_init` is the `proc_caches_init` from the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c). This function allocates caches for the memory descriptors (or `mm_struct` structure). At the beginning of the `proc_caches_init` we can see allocation of the different [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) caches with the call of the `kmem_cache_create`:
|
||||
|
||||
* `sighand_cachep` - manage information about installed signal handlers;
|
||||
* `signal_cachep` - manage information about process signal descriptor;
|
||||
* `files_cachep` - manage information about opened files;
|
||||
* `fs_cachep` - manage filesystem information.
|
||||
|
||||
After this we allocate `SLAB` cache for the `mm_struct` structures:
|
||||
|
||||
```C
|
||||
mm_cachep = kmem_cache_create("mm_struct",
|
||||
sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
|
||||
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
|
||||
```
|
||||
|
||||
|
||||
After this we allocate `SLAB` cache for the important `vm_area_struct` which used by the kernel to manage virtual memory space:
|
||||
|
||||
```C
|
||||
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
|
||||
```
|
||||
|
||||
Note, that we use `KMEM_CACHE` macro here instead of the `kmem_cache_create`. This macro is defined in the [include/linux/slab.h](https://github.com/torvalds/linux/blob/master/include/linux/slab.h) and just expands to the `kmem_cache_create` call:
|
||||
|
||||
```C
|
||||
#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
|
||||
sizeof(struct __struct), __alignof__(struct __struct),\
|
||||
(__flags), NULL)
|
||||
```
|
||||
|
||||
The `KMEM_CACHE` has one difference from `kmem_cache_create`. Take a look on `__alignof__` operator. The `KMEM_CACHE` macro aligns `SLAB` to the size of the given structure, but `kmem_cache_create` uses given value to align space. After this we can see the call of the `mmap_init` and `nsproxy_cache_init` functions. The first function initializes virtual memory area `SLAB` and the second function initializes `SLAB` for namespaces.
|
||||
|
||||
The next function after the `proc_caches_init` is `buffer_init`. This function is defined in the [fs/buffer.c](https://github.com/torvalds/linux/blob/master/fs/buffer.c) source code file and allocate cache for the `buffer_head`. The `buffer_head` is a special structure which defined in the [include/linux/buffer_head.h](https://github.com/torvalds/linux/blob/master/include/linux/buffer_head.h) and used for managing buffers. In the start of the `buffer_init` function we allocate cache for the `struct buffer_head` structures with the call of the `kmem_cache_create` function as we did in the previous functions. And calculate the maximum size of the buffers in memory with:
|
||||
|
||||
```C
|
||||
nrpages = (nr_free_buffer_pages() * 10) / 100;
|
||||
max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
|
||||
```
|
||||
|
||||
which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points, etc. More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function is defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/master/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to after been loaded into memory.
|
||||
|
||||
Creation of the root for the procfs
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After all of this preparations we need to create the root for the [proc](http://en.wikipedia.org/wiki/Procfs) filesystem. We will do it with the call of the `proc_root_init` function from the [fs/proc/root.c](https://github.com/torvalds/linux/blob/master/fs/proc/root.c). At the start of the `proc_root_init` function we allocate the cache for the inodes and register a new filesystem in the system with the:
|
||||
|
||||
```C
|
||||
err = register_filesystem(&proc_fs_type);
|
||||
if (err)
|
||||
return;
|
||||
```
|
||||
|
||||
As I wrote above we will not dive into details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) and different filesystems in this chapter, but will see it in the chapter about the `VFS`. After we've registered a new filesystem in our system, we call the `proc_self_init` function from the [fs/proc/self.c](https://github.com/torvalds/linux/blob/master/fs/proc/self.c) and this function allocates `inode` number for the `self` (`/proc/self` directory refers to the process accessing the `/proc` filesystem). The next step after the `proc_self_init` is `proc_setup_thread_self` which setups the `/proc/thread-self` directory which contains information about current thread. After this we create `/proc/self/mounts` symlink which will contains mount points with the call of the
|
||||
|
||||
```C
|
||||
proc_symlink("mounts", NULL, "self/mounts");
|
||||
```
|
||||
|
||||
and a couple of directories depends on the different configuration options:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_SYSVIPC
|
||||
proc_mkdir("sysvipc", NULL);
|
||||
#endif
|
||||
proc_mkdir("fs", NULL);
|
||||
proc_mkdir("driver", NULL);
|
||||
proc_mkdir("fs/nfsd", NULL);
|
||||
#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)
|
||||
proc_mkdir("openprom", NULL);
|
||||
#endif
|
||||
proc_mkdir("bus", NULL);
|
||||
...
|
||||
...
|
||||
...
|
||||
if (!proc_mkdir("tty", NULL))
|
||||
return;
|
||||
proc_mkdir("tty/ldisc", NULL);
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
In the end of the `proc_root_init` we call the `proc_sys_init` function which creates `/proc/sys` directory and initializes the [Sysctl](http://en.wikipedia.org/wiki/Sysctl).
|
||||
|
||||
It is the end of `start_kernel` function. I did not describe all functions which are called in the `start_kernel`. I skipped them, because they are not important for the generic kernel initialization stuff and depend on only different kernel configurations. They are `taskstats_init_early` which exports per-task statistic to the user-space, `delayacct_init` - initializes per-task delay accounting, `key_init` and `security_init` initialize different security stuff, `check_bugs` - fix some architecture-dependent bugs, `ftrace_init` function executes initialization of the [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt), `cgroup_init` makes initialization of the rest of the [cgroup](http://en.wikipedia.org/wiki/Cgroups) subsystem,etc. Many of these parts and subsystems will be described in the other chapters.
|
||||
|
||||
That's all. Finally we have passed through the long-long `start_kernel` function. But it is not the end of the linux kernel initialization process. We haven't run the first process yet. In the end of the `start_kernel` we can see the last call of the - `rest_init` function. Let's go ahead.
|
||||
|
||||
First steps after the start_kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `rest_init` function is defined in the same source code file as `start_kernel` function, and this file is [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). In the beginning of the `rest_init` we can see call of the two following functions:
|
||||
|
||||
```C
|
||||
rcu_scheduler_starting();
|
||||
smpboot_thread_init();
|
||||
```
|
||||
|
||||
The first `rcu_scheduler_starting` makes [RCU](http://en.wikipedia.org/wiki/Read-copy-update) scheduler active and the second `smpboot_thread_init` registers the `smpboot_thread_notifier` CPU notifier (more about it you can read in the [CPU hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). After this we can see the following calls:
|
||||
|
||||
```C
|
||||
kernel_thread(kernel_init, NULL, CLONE_FS);
|
||||
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
|
||||
```
|
||||
|
||||
Here the `kernel_thread` function (defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c)) creates new kernel thread.As we can see the `kernel_thread` function takes three arguments:
|
||||
|
||||
* Function which will be executed in a new thread;
|
||||
* Parameter for the `kernel_init` function;
|
||||
* Flags.
|
||||
|
||||
We will not dive into details about `kernel_thread` implementation (we will see it in the chapter which describe scheduler, just need to say that `kernel_thread` invokes [clone](http://www.tutorialspoint.com/unix_system_calls/clone.htm)). Now we only need to know that we create new kernel thread with `kernel_thread` function, parent and child of the thread will use shared information about filesystem and it will start to execute `kernel_init` function. A kernel thread differs from a user thread that it runs in kernel mode. So with these two `kernel_thread` calls we create two new kernel threads with the `PID = 1` for `init` process and `PID = 2` for `kthreadd`. We already know what is `init` process. Let's look on the `kthreadd`. It is a special kernel thread which manages and helps different parts of the kernel to create another kernel thread. We can see it in the output of the `ps` util:
|
||||
|
||||
```C
|
||||
$ ps -ef | grep kthread
|
||||
root 2 0 0 Jan11 ? 00:00:00 [kthreadd]
|
||||
```
|
||||
|
||||
Let's postpone `kernel_init` and `kthreadd` for now and go ahead in the `rest_init`. In the next step after we have created two new kernel threads we can see the following code:
|
||||
|
||||
```C
|
||||
rcu_read_lock();
|
||||
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
|
||||
rcu_read_unlock();
|
||||
```
|
||||
|
||||
The first `rcu_read_lock` function marks the beginning of an [RCU](http://en.wikipedia.org/wiki/Read-copy-update) read-side critical section and the `rcu_read_unlock` marks the end of an RCU read-side critical section. We call these functions because we need to protect the `find_task_by_pid_ns`. The `find_task_by_pid_ns` returns pointer to the `task_struct` by the given pid. So, here we are getting the pointer to the `task_struct` for `PID = 2` (we got it after `kthreadd` creation with the `kernel_thread`). In the next step we call `complete` function
|
||||
|
||||
```C
|
||||
complete(&kthreadd_done);
|
||||
```
|
||||
|
||||
and pass address of the `kthreadd_done`. The `kthreadd_done` defined as
|
||||
|
||||
```C
|
||||
static __initdata DECLARE_COMPLETION(kthreadd_done);
|
||||
```
|
||||
|
||||
where `DECLARE_COMPLETION` macro defined as:
|
||||
|
||||
```C
|
||||
#define DECLARE_COMPLETION(work) \
|
||||
struct completion work = COMPLETION_INITIALIZER(work)
|
||||
```
|
||||
|
||||
and expands to the definition of the `completion` structure. This structure is defined in the [include/linux/completion.h](https://github.com/torvalds/linux/blob/master/include/linux/completion.h) and presents `completions` concept. Completions is a code synchronization mechanism which provides race-free solution for the threads that must wait for some process to have reached a point or a specific state. Using completions consists of three parts: The first is definition of the `complete` structure and we did it with the `DECLARE_COMPLETION`. The second is call of the `wait_for_completion`. After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not call `complete` function. Note that we call `wait_for_completion` with the `kthreadd_done` in the beginning of the `kernel_init_freeable`:
|
||||
|
||||
```C
|
||||
wait_for_completion(&kthreadd_done);
|
||||
```
|
||||
|
||||
And the last step is to call `complete` function as we saw it above. After this the `kernel_init_freeable` function will not be executed while `kthreadd` thread will not be set. After the `kthreadd` was set, we can see three following functions in the `rest_init`:
|
||||
|
||||
```C
|
||||
init_idle_bootup_task(current);
|
||||
schedule_preempt_disabled();
|
||||
cpu_startup_entry(CPUHP_ONLINE);
|
||||
```
|
||||
|
||||
The first `init_idle_bootup_task` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) sets the Scheduling class for the current process (`idle` class in our case):
|
||||
|
||||
```C
|
||||
void init_idle_bootup_task(struct task_struct *idle)
|
||||
{
|
||||
idle->sched_class = &idle_sched_class;
|
||||
}
|
||||
```
|
||||
|
||||
where `idle` class is a low task priority and tasks can be run only when the processor doesn't have anything to run besides this tasks. The second function `schedule_preempt_disabled` disables preempt in `idle` tasks. And the third function `cpu_startup_entry` is defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c) and calls `cpu_idle_loop` from the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c). The `cpu_idle_loop` function works as process with `PID = 0` and works in the background. Main purpose of the `cpu_idle_loop` is to consume the idle CPU cycles. When there is no process to run, this process starts to work. We have one process with `idle` scheduling class (we just set the `current` task to the `idle` with the call of the `init_idle_bootup_task` function), so the `idle` thread does not do useful work but just checks if there is an active task to switch to:
|
||||
|
||||
```C
|
||||
static void cpu_idle_loop(void)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
while (1) {
|
||||
while (!need_resched()) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
More about it will be in the chapter about scheduler. So for this moment the `start_kernel` calls the `rest_init` function which spawns an `init` (`kernel_init` function) process and become `idle` process itself. Now is time to look on the `kernel_init`. Execution of the `kernel_init` function starts from the call of the `kernel_init_freeable` function. The `kernel_init_freeable` function first of all waits for the completion of the `kthreadd` setup. I already wrote about it above:
|
||||
|
||||
```C
|
||||
wait_for_completion(&kthreadd_done);
|
||||
```
|
||||
|
||||
After this we set `gfp_allowed_mask` to `__GFP_BITS_MASK` which means that system is already running, set allowed [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to all CPUs and [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes with the `set_mems_allowed` function, allow `init` process to run on any CPU with the `set_cpus_allowed_ptr`, set pid for the `cad` or `Ctrl-Alt-Delete`, do preparation for booting of the other CPUs with the call of the `smp_prepare_cpus`, call early [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism) with the `do_pre_smp_initcalls`, initialize `SMP` with the `smp_init` and initialize [lockup_detector](https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt) with the call of the `lockup_detector_init` and initialize scheduler with the `sched_init_smp`.
|
||||
|
||||
After this we can see the call of the following functions - `do_basic_setup`. Before we will call the `do_basic_setup` function, our kernel already initialized for this moment. As comment says:
|
||||
|
||||
```
|
||||
Now we can finally start doing some real work..
|
||||
```
|
||||
|
||||
The `do_basic_setup` will reinitialize [cpuset](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to the active CPUs, initialize the `khelper` - which is a kernel thread which used for making calls out to userspace from within the kernel, initialize [tmpfs](http://en.wikipedia.org/wiki/Tmpfs), initialize `drivers` subsystem, enable the user-mode helper `workqueue` and make post-early call of the `initcalls`. We can see opening of the `dev/console` and dup twice file descriptors from `0` to `2` after the `do_basic_setup`:
|
||||
|
||||
|
||||
```C
|
||||
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
|
||||
pr_err("Warning: unable to open an initial console.\n");
|
||||
|
||||
(void) sys_dup(0);
|
||||
(void) sys_dup(0);
|
||||
```
|
||||
|
||||
We are using two system calls here `sys_open` and `sys_dup`. In the next chapters we will see explanation and implementation of the different system calls. After we opened initial console, we check that `rdinit=` option was passed to the kernel command line or set default path of the ramdisk:
|
||||
|
||||
```C
|
||||
if (!ramdisk_execute_command)
|
||||
ramdisk_execute_command = "/init";
|
||||
```
|
||||
|
||||
Check user's permissions for the `ramdisk` and call the `prepare_namespace` function from the [init/do_mounts.c](https://github.com/torvalds/linux/blob/master/init/do_mounts.c) which checks and mounts the [initrd](http://en.wikipedia.org/wiki/Initrd):
|
||||
|
||||
```C
|
||||
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
|
||||
ramdisk_execute_command = NULL;
|
||||
prepare_namespace();
|
||||
}
|
||||
```
|
||||
|
||||
This is the end of the `kernel_init_freeable` function and we need return to the `kernel_init`. The next step after the `kernel_init_freeable` finished its execution is the `async_synchronize_full`. This function waits until all asynchronous function calls have been done and after it we will call the `free_initmem` which will release all memory occupied by the initialization stuff which located between `__init_begin` and `__init_end`. After this we protect `.rodata` with the `mark_rodata_ro` and update state of the system from the `SYSTEM_BOOTING` to the
|
||||
|
||||
```C
|
||||
system_state = SYSTEM_RUNNING;
|
||||
```
|
||||
|
||||
And tries to run the `init` process:
|
||||
|
||||
```C
|
||||
if (ramdisk_execute_command) {
|
||||
ret = run_init_process(ramdisk_execute_command);
|
||||
if (!ret)
|
||||
return 0;
|
||||
pr_err("Failed to execute %s (error %d)\n",
|
||||
ramdisk_execute_command, ret);
|
||||
}
|
||||
```
|
||||
|
||||
First of all it checks the `ramdisk_execute_command` which we set in the `kernel_init_freeable` function and it will be equal to the value of the `rdinit=` kernel command line parameters or `/init` by default. The `run_init_process` function fills the first element of the `argv_init` array:
|
||||
|
||||
```C
|
||||
static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
|
||||
```
|
||||
|
||||
which represents arguments of the `init` program and call `do_execve` function:
|
||||
|
||||
```C
|
||||
argv_init[0] = init_filename;
|
||||
return do_execve(getname_kernel(init_filename),
|
||||
(const char __user *const __user *)argv_init,
|
||||
(const char __user *const __user *)envp_init);
|
||||
```
|
||||
|
||||
The `do_execve` function is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and runs program with the given file name and arguments. If we did not pass `rdinit=` option to the kernel command line, kernel starts to check the `execute_command` which is equal to value of the `init=` kernel command line parameter:
|
||||
|
||||
```C
|
||||
if (execute_command) {
|
||||
ret = run_init_process(execute_command);
|
||||
if (!ret)
|
||||
return 0;
|
||||
panic("Requested init %s failed (error %d).",
|
||||
execute_command, ret);
|
||||
}
|
||||
```
|
||||
|
||||
If we did not pass `init=` kernel command line parameter either, kernel tries to run one of the following executable files:
|
||||
|
||||
```C
|
||||
if (!try_to_run_init_process("/sbin/init") ||
|
||||
!try_to_run_init_process("/etc/init") ||
|
||||
!try_to_run_init_process("/bin/init") ||
|
||||
!try_to_run_init_process("/bin/sh"))
|
||||
return 0;
|
||||
```
|
||||
|
||||
Otherwise we finish with [panic](http://en.wikipedia.org/wiki/Kernel_panic):
|
||||
|
||||
```C
|
||||
panic("No working init found. Try passing init= option to kernel. "
|
||||
"See Linux Documentation/init.txt for guidance.");
|
||||
```
|
||||
|
||||
That's all! Linux kernel initialization process is finished!
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the tenth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). It is not only the `tenth` part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [SLAB](http://en.wikipedia.org/wiki/Slab_allocation)
|
||||
* [xsave](http://www.felixcloutier.com/x86/XSAVES.html)
|
||||
* [FPU](http://en.wikipedia.org/wiki/Floating-point_unit)
|
||||
* [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/master/Documentation/security/credentials.txt)
|
||||
* [Documentation/x86/x86_64/mm](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt)
|
||||
* [RCU](http://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [VFS](http://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [inode](http://en.wikipedia.org/wiki/Inode)
|
||||
* [proc](http://en.wikipedia.org/wiki/Procfs)
|
||||
* [man proc](http://linux.die.net/man/5/proc)
|
||||
* [Sysctl](http://en.wikipedia.org/wiki/Sysctl)
|
||||
* [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt)
|
||||
* [cgroup](http://en.wikipedia.org/wiki/Cgroups)
|
||||
* [CPU hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [completions - wait for completion handling](https://www.kernel.org/doc/Documentation/scheduler/completion.txt)
|
||||
* [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access)
|
||||
* [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt)
|
||||
* [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism)
|
||||
* [Tmpfs](http://en.wikipedia.org/wiki/Tmpfs)
|
||||
* [initrd](http://en.wikipedia.org/wiki/Initrd)
|
||||
* [panic](http://en.wikipedia.org/wiki/Kernel_panic)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html)
|
||||
495
Initialization/linux-initialization-2.md
Normal file
495
Initialization/linux-initialization-2.md
Normal file
@@ -0,0 +1,495 @@
|
||||
Kernel initialization. Part 2.
|
||||
================================================================================
|
||||
|
||||
Early interrupt and exception handling
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) we stopped before setting of early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have basic [paging](https://en.wikipedia.org/wiki/Page_table) structure for early boot and our current goal is to finish early preparation before the main kernel code will start to work.
|
||||
|
||||
We already started to do this preparation in the previous [first](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). We continue in this part and will know more about interrupt and exception handling.
|
||||
|
||||
Remember that we stopped before following loop:
|
||||
|
||||
```C
|
||||
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
|
||||
set_intr_gate(i, early_idt_handler_array[i]);
|
||||
```
|
||||
|
||||
from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) source code file. But before we started to sort out this code, we need to know about interrupts and handlers.
|
||||
|
||||
Some theory
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
An interrupt is an event caused by software or hardware to the CPU. For example a user have pressed a key on keyboard. On interrupt, CPU stops the current task and transfer control to the special routine which is called - [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler). An interrupt handler handles and interrupt and transfer control back to the previously stopped task. We can split interrupts on three types:
|
||||
|
||||
* Software interrupts - when a software signals CPU that it needs kernel attention. These interrupts are generally used for system calls;
|
||||
* Hardware interrupts - when a hardware event happens, for example button is pressed on a keyboard;
|
||||
* Exceptions - interrupts generated by CPU, when the CPU detects error, for example division by zero or accessing a memory page which is not in RAM.
|
||||
|
||||
Every interrupt and exception is assigned a unique number which called - `vector number`. `Vector number` can be any number from `0` to `255`. There is common practice to use first `32` vector numbers for exceptions, and vector numbers from `32` to `255` are used for user-defined interrupts. We can see it in the code above - `NUM_EXCEPTION_VECTORS`, which defined as:
|
||||
|
||||
```C
|
||||
#define NUM_EXCEPTION_VECTORS 32
|
||||
```
|
||||
|
||||
CPU uses vector number as an index in the `Interrupt Descriptor Table` (we will see description of it soon). CPU catch interrupts from the [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) or through it's pins. Following table shows `0-31` exceptions:
|
||||
|
||||
```
|
||||
----------------------------------------------------------------------------------------------
|
||||
|Vector|Mnemonic|Description |Type |Error Code|Source |
|
||||
----------------------------------------------------------------------------------------------
|
||||
|0 | #DE |Divide Error |Fault|NO |DIV and IDIV |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|1 | #DB |Reserved |F/T |NO | |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|2 | --- |NMI |INT |NO |external NMI |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|3 | #BP |Breakpoint |Trap |NO |INT 3 |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|4 | #OF |Overflow |Trap |NO |INTO instruction |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|5 | #BR |Bound Range Exceeded|Fault|NO |BOUND instruction |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|6 | #UD |Invalid Opcode |Fault|NO |UD2 instruction |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|7 | #NM |Device Not Available|Fault|NO |Floating point or [F]WAIT |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|8 | #DF |Double Fault |Abort|YES |Ant instrctions which can generate NMI|
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|9 | --- |Reserved |Fault|NO | |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|10 | #TS |Invalid TSS |Fault|YES |Task switch or TSS access |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|11 | #NP |Segment Not Present |Fault|NO |Accessing segment register |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|12 | #SS |Stack-Segment Fault |Fault|YES |Stack operations |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|13 | #GP |General Protection |Fault|YES |Memory reference |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|14 | #PF |Page fault |Fault|YES |Memory reference |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|15 | --- |Reserved | |NO | |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|16 | #MF |x87 FPU fp error |Fault|NO |Floating point or [F]Wait |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|17 | #AC |Alignment Check |Fault|YES |Data reference |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|18 | #MC |Machine Check |Abort|NO | |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|19 | #XM |SIMD fp exception |Fault|NO |SSE[2,3] instructions |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|20 | #VE |Virtualization exc. |Fault|NO |EPT violations |
|
||||
|---------------------------------------------------------------------------------------------
|
||||
|21-31 | --- |Reserved |INT |NO |External interrupts |
|
||||
----------------------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
To react on interrupt CPU uses special structure - Interrupt Descriptor Table or IDT. IDT is an array of 8-byte descriptors like Global Descriptor Table, but IDT entries are called `gates`. CPU multiplies vector number on 8 to find index of the IDT entry. But in 64-bit mode IDT is an array of 16-byte descriptors and CPU multiplies vector number on 16 to find index of the entry in the IDT. We remember from the previous part that CPU uses special `GDTR` register to locate Global Descriptor Table, so CPU uses special register `IDTR` for Interrupt Descriptor Table and `lidt` instruction for loading base address of the table into this register.
|
||||
|
||||
64-bit mode IDT entry has following structure:
|
||||
|
||||
```
|
||||
127 96
|
||||
--------------------------------------------------------------------------------
|
||||
| |
|
||||
| Reserved |
|
||||
| |
|
||||
--------------------------------------------------------------------------------
|
||||
95 64
|
||||
--------------------------------------------------------------------------------
|
||||
| |
|
||||
| Offset 63..32 |
|
||||
| |
|
||||
--------------------------------------------------------------------------------
|
||||
63 48 47 46 44 42 39 34 32
|
||||
--------------------------------------------------------------------------------
|
||||
| | | D | | | | | | |
|
||||
| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST |
|
||||
| | | L | | | | | | |
|
||||
--------------------------------------------------------------------------------
|
||||
31 15 16 0
|
||||
--------------------------------------------------------------------------------
|
||||
| | |
|
||||
| Segment Selector | Offset 15..0 |
|
||||
| | |
|
||||
--------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `Offset` - is offset to entry point of an interrupt handler;
|
||||
* `DPL` - Descriptor Privilege Level;
|
||||
* `P` - Segment Present flag;
|
||||
* `Segment selector` - a code segment selector in GDT or LDT
|
||||
* `IST` - provides ability to switch to a new stack for interrupts handling.
|
||||
|
||||
And the last `Type` field describes type of the `IDT` entry. There are three different kinds of handlers for interrupts:
|
||||
|
||||
* Task descriptor
|
||||
* Interrupt descriptor
|
||||
* Trap descriptor
|
||||
|
||||
Interrupt and trap descriptors contain a far pointer to the entry point of the interrupt handler. Only one difference between these types is how CPU handles `IF` flag. If interrupt handler was accessed through interrupt gate, CPU clear the `IF` flag to prevent other interrupts while current interrupt handler executes. After that current interrupt handler executes, CPU sets the `IF` flag again with `iret` instruction.
|
||||
|
||||
Other bits in the interrupt gate reserved and must be 0. Now let's look how CPU handles interrupts:
|
||||
|
||||
* CPU save flags register, `CS`, and instruction pointer on the stack.
|
||||
* If interrupt causes an error code (like `#PF` for example), CPU saves an error on the stack after instruction pointer;
|
||||
* After interrupt handler executed, `iret` instruction used to return from it.
|
||||
|
||||
Now let's back to code.
|
||||
|
||||
Fill and load IDT
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We stopped at the following point:
|
||||
|
||||
```C
|
||||
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
|
||||
set_intr_gate(i, early_idt_handler_array[i]);
|
||||
```
|
||||
|
||||
Here we call `set_intr_gate` in the loop, which takes two parameters:
|
||||
|
||||
* Number of an interrupt or `vector number`;
|
||||
* Address of the idt handler.
|
||||
|
||||
and inserts an interrupt gate to the `IDT` table which is represented by the `&idt_descr` array. First of all let's look on the `early_idt_handler_array` array. It is an array which is defined in the [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) header file contains addresses of the first `32` exception handlers:
|
||||
|
||||
```C
|
||||
#define EARLY_IDT_HANDLER_SIZE 9
|
||||
#define NUM_EXCEPTION_VECTORS 32
|
||||
|
||||
extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
|
||||
```
|
||||
|
||||
The `early_idt_handler_array` is `288` bytes array which contains address of exception entry points every nine bytes. Every nine bytes of this array consist of two bytes optional instruction for pushing dummy error code if an exception does not provide it, two bytes instruction for pushing vector number to the stack and five bytes of `jump` to the common exception handler code.
|
||||
|
||||
As we can see, We're filling only first 32 `IDT` entries in the loop, because all of the early setup runs with interrupts disabled, so there is no need to set up interrupt handlers for vectors greater than `32`. The `early_idt_handler_array` array contains generic idt handlers and we can find its definition in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file. For now we will skip it, but will look it soon. Before this we will look on the implementation of the `set_intr_gate` macro.
|
||||
|
||||
The `set_intr_gate` macro is defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) header file and looks:
|
||||
|
||||
```C
|
||||
#define set_intr_gate(n, addr) \
|
||||
do { \
|
||||
BUG_ON((unsigned)n > 0xFF); \
|
||||
_set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0, \
|
||||
__KERNEL_CS); \
|
||||
_trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\
|
||||
0, 0, __KERNEL_CS); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
First of all it checks with that passed interrupt number is not greater than `255` with `BUG_ON` macro. We need to do this check because we can have only `256` interrupts. After this, it make a call of the `_set_gate` function which writes address of an interrupt gate to the `IDT`:
|
||||
|
||||
```C
|
||||
static inline void _set_gate(int gate, unsigned type, void *addr,
|
||||
unsigned dpl, unsigned ist, unsigned seg)
|
||||
{
|
||||
gate_desc s;
|
||||
pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
|
||||
write_idt_entry(idt_table, gate, &s);
|
||||
write_trace_idt_entry(gate, &s);
|
||||
}
|
||||
```
|
||||
|
||||
At the start of `_set_gate` function we can see call of the `pack_gate` function which fills `gate_desc` structure with the given values:
|
||||
|
||||
```C
|
||||
static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
|
||||
unsigned dpl, unsigned ist, unsigned seg)
|
||||
{
|
||||
gate->offset_low = PTR_LOW(func);
|
||||
gate->segment = __KERNEL_CS;
|
||||
gate->ist = ist;
|
||||
gate->p = 1;
|
||||
gate->dpl = dpl;
|
||||
gate->zero0 = 0;
|
||||
gate->zero1 = 0;
|
||||
gate->type = type;
|
||||
gate->offset_middle = PTR_MIDDLE(func);
|
||||
gate->offset_high = PTR_HIGH(func);
|
||||
}
|
||||
```
|
||||
|
||||
As I mentioned above, we fill gate descriptor in this function. We fill three parts of the address of the interrupt handler with the address which we got in the main loop (address of the interrupt handler entry point). We are using three following macros to split address on three parts:
|
||||
|
||||
```C
|
||||
#define PTR_LOW(x) ((unsigned long long)(x) & 0xFFFF)
|
||||
#define PTR_MIDDLE(x) (((unsigned long long)(x) >> 16) & 0xFFFF)
|
||||
#define PTR_HIGH(x) ((unsigned long long)(x) >> 32)
|
||||
```
|
||||
|
||||
With the first `PTR_LOW` macro we get the first `2` bytes of the address, with the second `PTR_MIDDLE` we get the second `2` bytes of the address and with the third `PTR_HIGH` macro we get the last `4` bytes of the address. Next we setup the segment selector for interrupt handler, it will be our kernel code segment - `__KERNEL_CS`. In the next step we fill `Interrupt Stack Table` and `Descriptor Privilege Level` (highest privilege level) with zeros. And we set `GAT_INTERRUPT` type in the end.
|
||||
|
||||
Now we have filled IDT entry and we can call `native_write_idt_entry` function which just copies filled `IDT` entry to the `IDT`:
|
||||
|
||||
```C
|
||||
static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
|
||||
{
|
||||
memcpy(&idt[entry], gate, sizeof(*gate));
|
||||
}
|
||||
```
|
||||
|
||||
After that main loop will finished, we will have filled `idt_table` array of `gate_desc` structures and we can load `Interrupt Descriptor table` with the call of the:
|
||||
|
||||
```C
|
||||
load_idt((const struct desc_ptr *)&idt_descr);
|
||||
```
|
||||
|
||||
Where `idt_descr` is:
|
||||
|
||||
```C
|
||||
struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table };
|
||||
```
|
||||
|
||||
and `load_idt` just executes `lidt` instruction:
|
||||
|
||||
```C
|
||||
asm volatile("lidt %0"::"m" (*dtr));
|
||||
```
|
||||
|
||||
You can note that there are calls of the `_trace_*` functions in the `_set_gate` and other functions. These functions fills `IDT` gates in the same manner that `_set_gate` but with one difference. These functions use `trace_idt_table` the `Interrupt Descriptor Table` instead of `idt_table` for tracepoints (we will cover this theme in the another part).
|
||||
|
||||
Okay, now we have filled and loaded `Interrupt Descriptor Table`, we know how the CPU acts during an interrupt. So now time to deal with interrupts handlers.
|
||||
|
||||
Early interrupts handlers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you can read above, we filled `IDT` with the address of the `early_idt_handler_array`. We can find it in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file:
|
||||
|
||||
```assembly
|
||||
.globl early_idt_handler_array
|
||||
early_idt_handlers:
|
||||
i = 0
|
||||
.rept NUM_EXCEPTION_VECTORS
|
||||
.if (EXCEPTION_ERRCODE_MASK >> i) & 1
|
||||
pushq $0
|
||||
.endif
|
||||
pushq $i
|
||||
jmp early_idt_handler_common
|
||||
i = i + 1
|
||||
.fill early_idt_handler_array + i*EARLY_IDT_HANDLER_SIZE - ., 1, 0xcc
|
||||
.endr
|
||||
```
|
||||
|
||||
We can see here, interrupt handlers generation for the first `32` exceptions. We check here, if exception has an error code then we do nothing, if exception does not return error code, we push zero to the stack. We do it for that would stack was uniform. After that we push exception number on the stack and jump on the `early_idt_handler_array` which is generic interrupt handler for now. As we may see above, every nine bytes of the `early_idt_handler_array` array consists from optional push of an error code, push of `vector number` and jump instruction. We can see it in the output of the `objdump` util:
|
||||
|
||||
```
|
||||
$ objdump -D vmlinux
|
||||
...
|
||||
...
|
||||
...
|
||||
ffffffff81fe5000 <early_idt_handler_array>:
|
||||
ffffffff81fe5000: 6a 00 pushq $0x0
|
||||
ffffffff81fe5002: 6a 00 pushq $0x0
|
||||
ffffffff81fe5004: e9 17 01 00 00 jmpq ffffffff81fe5120 <early_idt_handler_common>
|
||||
ffffffff81fe5009: 6a 00 pushq $0x0
|
||||
ffffffff81fe500b: 6a 01 pushq $0x1
|
||||
ffffffff81fe500d: e9 0e 01 00 00 jmpq ffffffff81fe5120 <early_idt_handler_common>
|
||||
ffffffff81fe5012: 6a 00 pushq $0x0
|
||||
ffffffff81fe5014: 6a 02 pushq $0x2
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
As i wrote above, CPU pushes flag register, `CS` and `RIP` on the stack. So before `early_idt_handler` will be executed, stack will contain following data:
|
||||
|
||||
```
|
||||
|--------------------|
|
||||
| %rflags |
|
||||
| %cs |
|
||||
| %rip |
|
||||
| rsp --> error code |
|
||||
|--------------------|
|
||||
```
|
||||
|
||||
Now let's look on the `early_idt_handler_common` implementation. It locates in the same [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343) assembly file and first of all we can see check for [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt). We don't need to handle it, so just ignore it in the `early_idt_handler_common`:
|
||||
|
||||
```assembly
|
||||
cmpl $2,(%rsp)
|
||||
je .Lis_nmi
|
||||
```
|
||||
|
||||
where `is_nmi`:
|
||||
|
||||
```assembly
|
||||
is_nmi:
|
||||
addq $16,%rsp
|
||||
INTERRUPT_RETURN
|
||||
```
|
||||
|
||||
drops an error code and vector number from the stack and call `INTERRUPT_RETURN` which is just expands to the `iretq` instruction. As we checked the vector number and it is not `NMI`, we check `early_recursion_flag` to prevent recursion in the `early_idt_handler_common` and if it's correct we save general registers on the stack:
|
||||
|
||||
```assembly
|
||||
pushq %rax
|
||||
pushq %rcx
|
||||
pushq %rdx
|
||||
pushq %rsi
|
||||
pushq %rdi
|
||||
pushq %r8
|
||||
pushq %r9
|
||||
pushq %r10
|
||||
pushq %r11
|
||||
```
|
||||
|
||||
We need to do it to prevent wrong values of registers when we return from the interrupt handler. After this we check segment selector in the stack:
|
||||
|
||||
```assembly
|
||||
cmpl $__KERNEL_CS,96(%rsp)
|
||||
jne 11f
|
||||
```
|
||||
|
||||
which must be equal to the kernel code segment and if it is not we jump on label `11` which prints `PANIC` message and makes stack dump.
|
||||
|
||||
After the code segment was checked, we check the vector number, and if it is `#PF` or [Page Fault](https://en.wikipedia.org/wiki/Page_fault), we put value from the `cr2` to the `rdi` register and call `early_make_pgtable` (well see it soon):
|
||||
|
||||
```assembly
|
||||
cmpl $14,72(%rsp)
|
||||
jnz 10f
|
||||
GET_CR2_INTO(%rdi)
|
||||
call early_make_pgtable
|
||||
andl %eax,%eax
|
||||
jz 20f
|
||||
```
|
||||
|
||||
If vector number is not `#PF`, we restore general purpose registers from the stack:
|
||||
|
||||
```assembly
|
||||
popq %r11
|
||||
popq %r10
|
||||
popq %r9
|
||||
popq %r8
|
||||
popq %rdi
|
||||
popq %rsi
|
||||
popq %rdx
|
||||
popq %rcx
|
||||
popq %rax
|
||||
```
|
||||
|
||||
and exit from the handler with `iret`.
|
||||
|
||||
It is the end of the first interrupt handler. Note that it is very early interrupt handler, so it handles only Page Fault now. We will see handlers for the other interrupts, but now let's look on the page fault handler.
|
||||
|
||||
Page fault handling
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the previous paragraph we saw first early interrupt handler which checks interrupt number for page fault and calls `early_make_pgtable` for building new page tables if it is. We need to have `#PF` handler in this step because there are plans to add ability to load kernel above `4G` and make access to `boot_params` structure above the 4G.
|
||||
|
||||
You can find implementation of the `early_make_pgtable` in the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and takes one parameter - address from the `cr2` register, which caused Page Fault. Let's look on it:
|
||||
|
||||
```C
|
||||
int __init early_make_pgtable(unsigned long address)
|
||||
{
|
||||
unsigned long physaddr = address - __PAGE_OFFSET;
|
||||
unsigned long i;
|
||||
pgdval_t pgd, *pgd_p;
|
||||
pudval_t pud, *pud_p;
|
||||
pmdval_t pmd, *pmd_p;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
It starts from the definition of some variables which have `*val_t` types. All of these types are just:
|
||||
|
||||
```C
|
||||
typedef unsigned long pgdval_t;
|
||||
```
|
||||
|
||||
Also we will operate with the `*_t` (not val) types, for example `pgd_t` and etc... All of these types defined in the [arch/x86/include/asm/pgtable_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_types.h) and represent structures like this:
|
||||
|
||||
```C
|
||||
typedef struct { pgdval_t pgd; } pgd_t;
|
||||
```
|
||||
|
||||
For example,
|
||||
|
||||
```C
|
||||
extern pgd_t early_level4_pgt[PTRS_PER_PGD];
|
||||
```
|
||||
|
||||
Here `early_level4_pgt` presents early top-level page table directory which consists of an array of `pgd_t` types and `pgd` points to low-level page entries.
|
||||
|
||||
After we made the check that we have no invalid address, we're getting the address of the Page Global Directory entry which contains `#PF` address and put it's value to the `pgd` variable:
|
||||
|
||||
```C
|
||||
pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
|
||||
pgd = *pgd_p;
|
||||
```
|
||||
|
||||
In the next step we check `pgd`, if it contains correct page global directory entry we put physical address of the page global directory entry and put it to the `pud_p` with:
|
||||
|
||||
```C
|
||||
pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
|
||||
```
|
||||
|
||||
where `PTE_PFN_MASK` is a macro:
|
||||
|
||||
```C
|
||||
#define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK)
|
||||
```
|
||||
|
||||
which expands to:
|
||||
|
||||
```C
|
||||
(~(PAGE_SIZE-1)) & ((1 << 46) - 1)
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```
|
||||
0b1111111111111111111111111111111111111111111111
|
||||
```
|
||||
|
||||
which is 46 bits to mask page frame.
|
||||
|
||||
If `pgd` does not contain correct address we check that `next_early_pgt` is not greater than `EARLY_DYNAMIC_PAGE_TABLES` which is `64` and present a fixed number of buffers to set up new page tables on demand. If `next_early_pgt` is greater than `EARLY_DYNAMIC_PAGE_TABLES` we reset page tables and start again. If `next_early_pgt` is less than `EARLY_DYNAMIC_PAGE_TABLES`, we create new page upper directory pointer which points to the current dynamic page table and writes it's physical address with the `_KERPG_TABLE` access rights to the page global directory:
|
||||
|
||||
```C
|
||||
if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
|
||||
reset_early_page_tables();
|
||||
goto again;
|
||||
}
|
||||
|
||||
pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
|
||||
for (i = 0; i < PTRS_PER_PUD; i++)
|
||||
pud_p[i] = 0;
|
||||
*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
|
||||
```
|
||||
|
||||
After this we fix up address of the page upper directory with:
|
||||
|
||||
```C
|
||||
pud_p += pud_index(address);
|
||||
pud = *pud_p;
|
||||
```
|
||||
|
||||
In the next step we do the same actions as we did before, but with the page middle directory. In the end we fix address of the page middle directory which contains maps kernel text+data virtual addresses:
|
||||
|
||||
```C
|
||||
pmd = (physaddr & PMD_MASK) + early_pmd_flags;
|
||||
pmd_p[pmd_index(address)] = pmd;
|
||||
```
|
||||
|
||||
After page fault handler finished it's work and as result our `early_level4_pgt` contains entries which point to the valid addresses.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part about linux kernel insides. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). In the next part we will see all steps before kernel entry point - `start_kernel` function.
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [GNU assembly .rept](https://sourceware.org/binutils/docs-2.23/as/Rept.html)
|
||||
* [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt)
|
||||
* [Page table](https://en.wikipedia.org/wiki/Page_table)
|
||||
* [Interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [Page Fault](https://en.wikipedia.org/wiki/Page_fault),
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
|
||||
430
Initialization/linux-initialization-3.md
Normal file
430
Initialization/linux-initialization-3.md
Normal file
@@ -0,0 +1,430 @@
|
||||
Kernel initialization. Part 3.
|
||||
================================================================================
|
||||
|
||||
Last preparations before the kernel entry point
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the Linux kernel initialization process series. In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) we saw early interrupt and exception handling and will continue to dive into the linux kernel initialization process in the current part. Our next point is 'kernel entry point' - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. Yes, technically it is not kernel's entry point but the start of the generic kernel code which does not depend on certain architecture. But before we call the `start_kernel` function, we must do some preparations. So let's continue.
|
||||
|
||||
boot_params again
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the previous part we stopped at setting Interrupt Descriptor Table and loading it in the `IDTR` register. At the next step after this we can see a call of the `copy_bootdata` function:
|
||||
|
||||
```C
|
||||
copy_bootdata(__va(real_mode_data));
|
||||
```
|
||||
|
||||
This function takes one argument - virtual address of the `real_mode_data`. Remember that we passed the address of the `boot_params` structure from [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L114) to the `x86_64_start_kernel` function as first argument in [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S):
|
||||
|
||||
```
|
||||
/* rsi is pointer to real mode structure with interesting info.
|
||||
pass it to C */
|
||||
movq %rsi, %rdi
|
||||
```
|
||||
|
||||
Now let's look at `__va` macro. This macro defined in [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c):
|
||||
|
||||
```C
|
||||
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
|
||||
```
|
||||
|
||||
where `PAGE_OFFSET` is `__PAGE_OFFSET` which is `0xffff880000000000` and the base virtual address of the direct mapping of all physical memory. So we're getting virtual address of the `boot_params` structure and pass it to the `copy_bootdata` function, where we copy `real_mod_data` to the `boot_params` which is declared in the [arch/x86/kernel/setup.h](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.h)
|
||||
|
||||
```C
|
||||
extern struct boot_params boot_params;
|
||||
```
|
||||
|
||||
Let's look at the `copy_boot_data` implementation:
|
||||
|
||||
```C
|
||||
static void __init copy_bootdata(char *real_mode_data)
|
||||
{
|
||||
char * command_line;
|
||||
unsigned long cmd_line_ptr;
|
||||
|
||||
memcpy(&boot_params, real_mode_data, sizeof boot_params);
|
||||
sanitize_boot_params(&boot_params);
|
||||
cmd_line_ptr = get_cmd_line_ptr();
|
||||
if (cmd_line_ptr) {
|
||||
command_line = __va(cmd_line_ptr);
|
||||
memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
First of all, note that this function is declared with `__init` prefix. It means that this function will be used only during the initialization and used memory will be freed.
|
||||
|
||||
We can see declaration of two variables for the kernel command line and copying `real_mode_data` to the `boot_params` with the `memcpy` function. The next call of the `sanitize_boot_params` function which fills some fields of the `boot_params` structure like `ext_ramdisk_image` and etc... if bootloaders which fail to initialize unknown fields in `boot_params` to zero. After this we're getting address of the command line with the call of the `get_cmd_line_ptr` function:
|
||||
|
||||
```C
|
||||
unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
|
||||
cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32;
|
||||
return cmd_line_ptr;
|
||||
```
|
||||
|
||||
which gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check `cmd_line_ptr`, getting its virtual address and copy it to the `boot_command_line` which is just an array of bytes:
|
||||
|
||||
```C
|
||||
extern char __initdata boot_command_line[];
|
||||
```
|
||||
|
||||
After this we will have copied kernel command line and `boot_params` structure. In the next step we can see call of the `load_ucode_bsp` function which loads processor microcode, but we will not see it here.
|
||||
|
||||
After microcode was loaded we can see the check of the `console_loglevel` and the `early_printk` function which prints `Kernel Alive` string. But you'll never see this output because `early_printk` is not initialized yet. It is a minor bug in the kernel and i sent the patch - [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2) and you will see it in the mainline soon. So you can skip this code.
|
||||
|
||||
Move on init pages
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call:
|
||||
|
||||
```C
|
||||
clear_page(init_level4_pgt);
|
||||
```
|
||||
|
||||
function and pass `init_level4_pgt` which also defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and looks:
|
||||
|
||||
```assembly
|
||||
NEXT_PAGE(init_level4_pgt)
|
||||
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
|
||||
.org init_level4_pgt + L4_PAGE_OFFSET*8, 0
|
||||
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
|
||||
.org init_level4_pgt + L4_START_KERNEL*8, 0
|
||||
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
|
||||
```
|
||||
|
||||
which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. `clear_page` function defined in the [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/clear_page_64.S) let's look on this function:
|
||||
|
||||
```assembly
|
||||
ENTRY(clear_page)
|
||||
CFI_STARTPROC
|
||||
xorl %eax,%eax
|
||||
movl $4096/64,%ecx
|
||||
.p2align 4
|
||||
.Lloop:
|
||||
decl %ecx
|
||||
#define PUT(x) movq %rax,x*8(%rdi)
|
||||
movq %rax,(%rdi)
|
||||
PUT(1)
|
||||
PUT(2)
|
||||
PUT(3)
|
||||
PUT(4)
|
||||
PUT(5)
|
||||
PUT(6)
|
||||
PUT(7)
|
||||
leaq 64(%rdi),%rdi
|
||||
jnz .Lloop
|
||||
nop
|
||||
ret
|
||||
CFI_ENDPROC
|
||||
.Lclear_page_end:
|
||||
ENDPROC(clear_page)
|
||||
```
|
||||
|
||||
As you can understand from the function name it clears or fills with zeros page tables. First of all note that this function starts with the `CFI_STARTPROC` and `CFI_ENDPROC` which are expands to GNU assembly directives:
|
||||
|
||||
```C
|
||||
#define CFI_STARTPROC .cfi_startproc
|
||||
#define CFI_ENDPROC .cfi_endproc
|
||||
```
|
||||
|
||||
and used for debugging. After `CFI_STARTPROC` macro we zero out `eax` register and put 64 to the `ecx` (it will be a counter). Next we can see loop which starts with the `.Lloop` label and it starts from the `ecx` decrement. After it we put zero from the `rax` register to the `rdi` which contains the base address of the `init_level4_pgt` now and do the same procedure seven times but every time move `rdi` offset on 8. After this we will have first 64 bytes of the `init_level4_pgt` filled with zeros. In the next step we put the address of the `init_level4_pgt` with 64-bytes offset to the `rdi` again and repeat all operations until `ecx` reaches zero. In the end we will have `init_level4_pgt` filled with zeros.
|
||||
|
||||
As we have `init_level4_pgt` filled with zeros, we set the last `init_level4_pgt` entry to kernel high mapping with the:
|
||||
|
||||
```C
|
||||
init_level4_pgt[511] = early_level4_pgt[511];
|
||||
```
|
||||
|
||||
Remember that we dropped all `early_level4_pgt` entries in the `reset_early_page_table` function and kept only kernel high mapping there.
|
||||
|
||||
The last step in the `x86_64_start_kernel` function is the call of the:
|
||||
|
||||
```C
|
||||
x86_64_start_reservations(real_mode_data);
|
||||
```
|
||||
|
||||
function with the `real_mode_data` as argument. The `x86_64_start_reservations` function defined in the same source code file as the `x86_64_start_kernel` function and looks:
|
||||
|
||||
```C
|
||||
void __init x86_64_start_reservations(char *real_mode_data)
|
||||
{
|
||||
if (!boot_params.hdr.version)
|
||||
copy_bootdata(__va(real_mode_data));
|
||||
|
||||
reserve_ebda_region();
|
||||
|
||||
start_kernel();
|
||||
}
|
||||
```
|
||||
|
||||
You can see that it is the last function before we are in the kernel entry point - `start_kernel` function. Let's look what it does and how it works.
|
||||
|
||||
Last step before kernel entry point
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
First of all we can see in the `x86_64_start_reservations` function the check for `boot_params.hdr.version`:
|
||||
|
||||
```C
|
||||
if (!boot_params.hdr.version)
|
||||
copy_bootdata(__va(real_mode_data));
|
||||
```
|
||||
|
||||
and if it is zero we call `copy_bootdata` function again with the virtual address of the `real_mode_data` (read about about it's implementation).
|
||||
|
||||
In the next step we can see the call of the `reserve_ebda_region` function which defined in the [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head.c). This function reserves memory block for the `EBDA` or Extended BIOS Data Area. The Extended BIOS Data Area located in the top of conventional memory and contains data about ports, disk parameters and etc...
|
||||
|
||||
Let's look on the `reserve_ebda_region` function. It starts from the checking is paravirtualization enabled or not:
|
||||
|
||||
```C
|
||||
if (paravirt_enabled())
|
||||
return;
|
||||
```
|
||||
|
||||
we exit from the `reserve_ebda_region` function if paravirtualization is enabled because if it enabled the extended bios data area is absent. In the next step we need to get the end of the low memory:
|
||||
|
||||
```C
|
||||
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
|
||||
lowmem <<= 10;
|
||||
```
|
||||
|
||||
We're getting the virtual address of the BIOS low memory in kilobytes and convert it to bytes with shifting it on 10 (multiply on 1024 in other words). After this we need to get the address of the extended BIOS data are with the:
|
||||
|
||||
```C
|
||||
ebda_addr = get_bios_ebda();
|
||||
```
|
||||
|
||||
where `get_bios_ebda` function defined in the [arch/x86/include/asm/bios_ebda.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bios_ebda.h) and looks like:
|
||||
|
||||
```C
|
||||
static inline unsigned int get_bios_ebda(void)
|
||||
{
|
||||
unsigned int address = *(unsigned short *)phys_to_virt(0x40E);
|
||||
address <<= 4;
|
||||
return address;
|
||||
}
|
||||
```
|
||||
|
||||
Let's try to understand how it works. Here we can see that we converting physical address `0x40E` to the virtual, where `0x0040:0x000e` is the segment which contains base address of the extended BIOS data area. Don't worry that we are using `phys_to_virt` function for converting a physical address to virtual address. You can note that previously we have used `__va` macro for the same point, but `phys_to_virt` is the same:
|
||||
|
||||
```C
|
||||
static inline void *phys_to_virt(phys_addr_t address)
|
||||
{
|
||||
return __va(address);
|
||||
}
|
||||
```
|
||||
|
||||
only with one difference: we pass argument with the `phys_addr_t` which depends on `CONFIG_PHYS_ADDR_T_64BIT`:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_PHYS_ADDR_T_64BIT
|
||||
typedef u64 phys_addr_t;
|
||||
#else
|
||||
typedef u32 phys_addr_t;
|
||||
#endif
|
||||
```
|
||||
|
||||
This configuration option is enabled by `CONFIG_PHYS_ADDR_T_64BIT`. After that we got virtual address of the segment which stores the base address of the extended BIOS data area, we shift it on 4 and return. After this `ebda_addr` variables contains the base address of the extended BIOS data area.
|
||||
|
||||
In the next step we check that address of the extended BIOS data area and low memory is not less than `INSANE_CUTOFF` macro
|
||||
|
||||
```C
|
||||
if (ebda_addr < INSANE_CUTOFF)
|
||||
ebda_addr = LOWMEM_CAP;
|
||||
|
||||
if (lowmem < INSANE_CUTOFF)
|
||||
lowmem = LOWMEM_CAP;
|
||||
```
|
||||
|
||||
which is:
|
||||
|
||||
```C
|
||||
#define INSANE_CUTOFF 0x20000U
|
||||
```
|
||||
|
||||
or 128 kilobytes. In the last step we get lower part in the low memory and extended bios data area and call `memblock_reserve` function which will reserve memory region for extended bios data between low memory and one megabyte mark:
|
||||
|
||||
```C
|
||||
lowmem = min(lowmem, ebda_addr);
|
||||
lowmem = min(lowmem, LOWMEM_CAP);
|
||||
memblock_reserve(lowmem, 0x100000 - lowmem);
|
||||
```
|
||||
|
||||
`memblock_reserve` function is defined at [mm/block.c](https://github.com/torvalds/linux/blob/master/mm/block.c) and takes two parameters:
|
||||
|
||||
* base physical address;
|
||||
* region size.
|
||||
|
||||
and reserves memory region for the given base address and size. `memblock_reserve` is the first function in this book from linux kernel memory manager framework. We will take a closer look on memory manager soon, but now let's look at its implementation.
|
||||
|
||||
First touch of the linux kernel memory manager framework
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the previous paragraph we stopped at the call of the `memblock_reserve` function and as i sad before it is the first function from the memory manager framework. Let's try to understand how it works. `memblock_reserve` function just calls:
|
||||
|
||||
```C
|
||||
memblock_reserve_region(base, size, MAX_NUMNODES, 0);
|
||||
```
|
||||
|
||||
function and passes 4 parameters there:
|
||||
|
||||
* physical base address of the memory region;
|
||||
* size of the memory region;
|
||||
* maximum number of numa nodes;
|
||||
* flags.
|
||||
|
||||
At the start of the `memblock_reserve_region` body we can see definition of the `memblock_type` structure:
|
||||
|
||||
```C
|
||||
struct memblock_type *_rgn = &memblock.reserved;
|
||||
```
|
||||
|
||||
which presents the type of the memory block and looks:
|
||||
|
||||
```C
|
||||
struct memblock_type {
|
||||
unsigned long cnt;
|
||||
unsigned long max;
|
||||
phys_addr_t total_size;
|
||||
struct memblock_region *regions;
|
||||
};
|
||||
```
|
||||
|
||||
As we need to reserve memory block for extended bios data area, the type of the current memory region is reserved where `memblock` structure is:
|
||||
|
||||
```C
|
||||
struct memblock {
|
||||
bool bottom_up;
|
||||
phys_addr_t current_limit;
|
||||
struct memblock_type memory;
|
||||
struct memblock_type reserved;
|
||||
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
|
||||
struct memblock_type physmem;
|
||||
#endif
|
||||
};
|
||||
```
|
||||
|
||||
and describes generic memory block. You can see that we initialize `_rgn` by assigning it to the address of the `memblock.reserved`. `memblock` is the global variable which looks:
|
||||
|
||||
```C
|
||||
struct memblock memblock __initdata_memblock = {
|
||||
.memory.regions = memblock_memory_init_regions,
|
||||
.memory.cnt = 1,
|
||||
.memory.max = INIT_MEMBLOCK_REGIONS,
|
||||
.reserved.regions = memblock_reserved_init_regions,
|
||||
.reserved.cnt = 1,
|
||||
.reserved.max = INIT_MEMBLOCK_REGIONS,
|
||||
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
|
||||
.physmem.regions = memblock_physmem_init_regions,
|
||||
.physmem.cnt = 1,
|
||||
.physmem.max = INIT_PHYSMEM_REGIONS,
|
||||
#endif
|
||||
.bottom_up = false,
|
||||
.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
|
||||
};
|
||||
```
|
||||
|
||||
We will not dive into detail of this variable, but we will see all details about it in the parts about memory manager. Just note that `memblock` variable defined with the `__initdata_memblock` which is:
|
||||
|
||||
```C
|
||||
#define __initdata_memblock __meminitdata
|
||||
```
|
||||
|
||||
and `__meminit_data` is:
|
||||
|
||||
```C
|
||||
#define __meminitdata __section(.meminit.data)
|
||||
```
|
||||
|
||||
From this we can conclude that all memory blocks will be in the `.meminit.data` section. After we defined `_rgn` we print information about it with `memblock_dbg` macros. You can enable it by passing `memblock=debug` to the kernel command line.
|
||||
|
||||
After debugging lines were printed next is the call of the following function:
|
||||
|
||||
```C
|
||||
memblock_add_range(_rgn, base, size, nid, flags);
|
||||
```
|
||||
|
||||
which adds new memory block region into the `.meminit.data` section. As we do not initialize `_rgn` but it just contains `&memblock.reserved`, we just fill passed `_rgn` with the base address of the extended BIOS data area region, size of this region and flags:
|
||||
|
||||
```C
|
||||
if (type->regions[0].size == 0) {
|
||||
WARN_ON(type->cnt != 1 || type->total_size);
|
||||
type->regions[0].base = base;
|
||||
type->regions[0].size = size;
|
||||
type->regions[0].flags = flags;
|
||||
memblock_set_region_node(&type->regions[0], nid);
|
||||
type->total_size = size;
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
After we filled our region we can see the call of the `memblock_set_region_node` function with two parameters:
|
||||
|
||||
* address of the filled memory region;
|
||||
* NUMA node id.
|
||||
|
||||
where our regions represented by the `memblock_region` structure:
|
||||
|
||||
```C
|
||||
struct memblock_region {
|
||||
phys_addr_t base;
|
||||
phys_addr_t size;
|
||||
unsigned long flags;
|
||||
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
|
||||
int nid;
|
||||
#endif
|
||||
};
|
||||
```
|
||||
|
||||
NUMA node id depends on `MAX_NUMNODES` macro which is defined in the [include/linux/numa.h](https://github.com/torvalds/linux/blob/master/include/linux/numa.h):
|
||||
|
||||
```C
|
||||
#define MAX_NUMNODES (1 << NODES_SHIFT)
|
||||
```
|
||||
|
||||
where `NODES_SHIFT` depends on `CONFIG_NODES_SHIFT` configuration parameter and defined as:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_NODES_SHIFT
|
||||
#define NODES_SHIFT CONFIG_NODES_SHIFT
|
||||
#else
|
||||
#define NODES_SHIFT 0
|
||||
#endif
|
||||
```
|
||||
|
||||
`memblick_set_region_node` function just fills `nid` field from `memblock_region` with the given value:
|
||||
|
||||
```C
|
||||
static inline void memblock_set_region_node(struct memblock_region *r, int nid)
|
||||
{
|
||||
r->nid = nid;
|
||||
}
|
||||
```
|
||||
|
||||
After this we will have first reserved `memblock` for the extended bios data area in the `.meminit.data` section. `reserve_ebda_region` function finished its work on this step and we can go back to the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c).
|
||||
|
||||
We finished all preparations before the kernel entry point! The last step in the `x86_64_start_reservations` function is the call of the:
|
||||
|
||||
```C
|
||||
start_kernel()
|
||||
```
|
||||
|
||||
function from [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) file.
|
||||
|
||||
That's all for this part.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the third part about linux kernel insides. In next part we will see the first initialization steps in the kernel entry point - `start_kernel` function. It will be the first step before we will see launch of the first `init` process.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [BIOS data area](http://stanislavs.org/helppc/bios_data_area.html)
|
||||
* [What is in the extended BIOS data area on a PC?](http://www.kryslix.com/nsfaq/Q.6.html)
|
||||
* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md)
|
||||
452
Initialization/linux-initialization-4.md
Normal file
452
Initialization/linux-initialization-4.md
Normal file
@@ -0,0 +1,452 @@
|
||||
Kernel initialization. Part 4.
|
||||
================================================================================
|
||||
|
||||
Kernel entry point
|
||||
================================================================================
|
||||
|
||||
If you have read the previous part - [Last preparations before the kernel entry point](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-3.md), you can remember that we finished all pre-initialization stuff and stopped right before the call to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). The `start_kernel` is the entry of the generic and architecture independent kernel code, although we will return to the `arch/` folder many times. If you look inside of the `start_kernel` function, you will see that this function is very big. For this moment it contains about `86` calls of functions. Yes, it's very big and of course this part will not cover all the processes that occur in this function. In the current part we will only start to do it. This part and all the next which will be in the [Kernel initialization process](https://github.com/0xAX/linux-insides/blob/master/Initialization/README.md) chapter will cover it.
|
||||
|
||||
The main purpose of the `start_kernel` to finish kernel initialization process and launch the first `init` process. Before the first process will be started, the `start_kernel` must do many things such as: to enable [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), to initialize processor id, to enable early [cgroups](http://en.wikipedia.org/wiki/Cgroups) subsystem, to setup per-cpu areas, to initialize different caches in [vfs](http://en.wikipedia.org/wiki/Virtual_file_system), to initialize memory manager, rcu, vmalloc, scheduler, IRQs, ACPI and many many more. Only after these steps will we see the launch of the first `init` process in the last part of this chapter. So much kernel code awaits us, let's start.
|
||||
|
||||
**NOTE: All parts from this big chapter `Linux Kernel initialization process` will not cover anything about debugging. There will be a separate chapter about kernel debugging tips.**
|
||||
|
||||
A little about function attributes
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
As I wrote above, the `start_kernel` function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). This function defined with the `__init` attribute and as you already may know from other parts, all functions which are defined with this attribute are necessary during kernel initialization.
|
||||
|
||||
```C
|
||||
#define __init __section(.init.text) __cold notrace
|
||||
```
|
||||
|
||||
After the initialization process have finished, the kernel will release these sections with a call to the `free_initmem` function. Note also that `__init` is defined with two attributes: `__cold` and `notrace`. The purpose of the first `cold` attribute is to mark that the function is rarely used and the compiler must optimize this function for size. The second `notrace` is defined as:
|
||||
|
||||
```C
|
||||
#define notrace __attribute__((no_instrument_function))
|
||||
```
|
||||
|
||||
where `no_instrument_function` says to the compiler not to generate profiling function calls.
|
||||
|
||||
In the definition of the `start_kernel` function, you can also see the `__visible` attribute which expands to the:
|
||||
|
||||
```
|
||||
#define __visible __attribute__((externally_visible))
|
||||
```
|
||||
|
||||
where `externally_visible` tells to the compiler that something uses this function or variable, to prevent marking this function/variable as `unusable`. You can find the definition of this and other macro attributes in [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h).
|
||||
|
||||
First steps in the start_kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
At the beginning of the `start_kernel` you can see the definition of these two variables:
|
||||
|
||||
```C
|
||||
char *command_line;
|
||||
char *after_dashes;
|
||||
```
|
||||
|
||||
The first represents a pointer to the kernel command line and the second will contain the result of the `parse_args` function which parses an input string with parameters in the form `name=value`, looking for specific keywords and invoking the right handlers. We will not go into the details related with these two variables at this time, but will see it in the next parts. In the next step we can see a call to the:
|
||||
|
||||
```C
|
||||
lockdep_init();
|
||||
```
|
||||
|
||||
function. `lockdep_init` initializes [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). Its implementation is pretty simple, it just initializes two [list_head](https://github.com/0xAX/linux-insides/blob/master/DataStructures/dlist.md) hashes and sets the `lockdep_initialized` global variable to `1`. Lock validator detects circular lock dependencies and is called when any [spinlock](http://en.wikipedia.org/wiki/Spinlock) or [mutex](http://en.wikipedia.org/wiki/Mutual_exclusion) is acquired.
|
||||
|
||||
The next function is `set_task_stack_end_magic` which takes address of the `init_task` and sets `STACK_END_MAGIC` (`0x57AC6E9D`) as canary for it. `init_task` represents the initial task structure:
|
||||
|
||||
```C
|
||||
struct task_struct init_task = INIT_TASK(init_task);
|
||||
```
|
||||
|
||||
where `task_struct` stores all the information about a process. I will not explain this structure in this book because it's very big. You can find its definition in [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L1278). At this moment `task_struct` contains more than `100` fields! Although you will not see the explanation of the `task_struct` in this book, we will use it very often since it is the fundamental structure which describes the `process` in the Linux kernel. I will describe the meaning of the fields of this structure as we meet them in practice.
|
||||
|
||||
You can see the definition of the `init_task` and it initialized by the `INIT_TASK` macro. This macro is from [include/linux/init_task.h](https://github.com/torvalds/linux/blob/master/include/linux/init_task.h) and it just fills the `init_task` with the values for the first process. For example it sets:
|
||||
|
||||
* init process state to zero or `runnable`. A runnable process is one which is waiting only for a CPU to run on;
|
||||
* init process flags - `PF_KTHREAD` which means - kernel thread;
|
||||
* a list of runnable task;
|
||||
* process address space;
|
||||
* init process stack to the `&init_thread_info` which is `init_thread_union.thread_info` and `initthread_union` has type - `thread_union` which contains `thread_info` and process stack:
|
||||
|
||||
```C
|
||||
union thread_union {
|
||||
struct thread_info thread_info;
|
||||
unsigned long stack[THREAD_SIZE/sizeof(long)];
|
||||
};
|
||||
```
|
||||
|
||||
Every process has its own stack and it is 16 kilobytes or 4 page frames. in `x86_64`. We can note that it is defined as array of `unsigned long`. The next field of the `thread_union` is - `thread_info` defined as:
|
||||
|
||||
```C
|
||||
struct thread_info {
|
||||
struct task_struct *task;
|
||||
struct exec_domain *exec_domain;
|
||||
__u32 flags;
|
||||
__u32 status;
|
||||
__u32 cpu;
|
||||
int saved_preempt_count;
|
||||
mm_segment_t addr_limit;
|
||||
struct restart_block restart_block;
|
||||
void __user *sysenter_return;
|
||||
unsigned int sig_on_uaccess_error:1;
|
||||
unsigned int uaccess_err:1;
|
||||
};
|
||||
```
|
||||
|
||||
and occupies 52 bytes. The `thread_info` structure contains architecture-specific information on the thread. We know that on `x86_64` the stack grows down and `thread_union.thread_info` is stored at the bottom of the stack in our case. So the process stack is 16 kilobytes and `thread_info` is at the bottom. The remaining thread_size will be `16 kilobytes - 62 bytes = 16332 bytes`. Note that `thread_union` represented as the [union](http://en.wikipedia.org/wiki/Union_type) and not structure, it means that `thread_info` and stack share the memory space.
|
||||
|
||||
Schematically it can be represented as follows:
|
||||
|
||||
```C
|
||||
+-----------------------+
|
||||
| |
|
||||
| |
|
||||
| stack |
|
||||
| |
|
||||
|_______________________|
|
||||
| | |
|
||||
| | |
|
||||
| | |
|
||||
|__________↓____________| +--------------------+
|
||||
| | | |
|
||||
| thread_info |<----------->| task_struct |
|
||||
| | | |
|
||||
+-----------------------+ +--------------------+
|
||||
```
|
||||
|
||||
http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct
|
||||
|
||||
So the `INIT_TASK` macro fills these `task_struct's` fields and many many more. As I already wrote above, I will not describe all the fields and values in the `INIT_TASK` macro but we will see them soon.
|
||||
|
||||
Now let's go back to the `set_task_stack_end_magic` function. This function defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c#L297) and sets a [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow) to the `init` process stack to prevent stack overflow.
|
||||
|
||||
```C
|
||||
void set_task_stack_end_magic(struct task_struct *tsk)
|
||||
{
|
||||
unsigned long *stackend;
|
||||
stackend = end_of_stack(tsk);
|
||||
*stackend = STACK_END_MAGIC; /* for overflow detection */
|
||||
}
|
||||
```
|
||||
|
||||
Its implementation is simple. `set_task_stack_end_magic` gets the end of the stack for the given `task_struct` with the `end_of_stack` function. The end of a process stack depends on the `CONFIG_STACK_GROWSUP` configuration option. As we learn in `x86_64` architecture, the stack grows down. So the end of the process stack will be:
|
||||
|
||||
```C
|
||||
(unsigned long *)(task_thread_info(p) + 1);
|
||||
```
|
||||
|
||||
where `task_thread_info` just returns the stack which we filled with the `INIT_TASK` macro:
|
||||
|
||||
```C
|
||||
#define task_thread_info(task) ((struct thread_info *)(task)->stack)
|
||||
```
|
||||
|
||||
As we got the end of the init process stack, we write `STACK_END_MAGIC` there. After `canary` is set, we can check it like this:
|
||||
|
||||
```C
|
||||
if (*end_of_stack(task) != STACK_END_MAGIC) {
|
||||
//
|
||||
// handle stack overflow here
|
||||
//
|
||||
}
|
||||
```
|
||||
|
||||
The next function after the `set_task_stack_end_magic` is `smp_setup_processor_id`. This function has an empty body for `x86_64`:
|
||||
|
||||
```C
|
||||
void __init __weak smp_setup_processor_id(void)
|
||||
{
|
||||
}
|
||||
```
|
||||
|
||||
as it not implemented for all architectures, but some such as [s390](http://en.wikipedia.org/wiki/IBM_ESA/390) and [arm64](http://en.wikipedia.org/wiki/ARM_architecture#64.2F32-bit_architecture).
|
||||
|
||||
The next function in `start_kernel` is `debug_objects_early_init`. Implementation of this function is almost the same as `lockdep_init`, but fills hashes for object debugging. As I wrote above, we will not see the explanation of this and other functions which are for debugging purposes in this chapter.
|
||||
|
||||
After the `debug_object_early_init` function we can see the call of the `boot_init_stack_canary` function which fills `task_struct->canary` with the canary value for the `-fstack-protector` gcc feature. This function depends on the `CONFIG_CC_STACKPROTECTOR` configuration option and if this option is disabled, `boot_init_stack_canary` does nothing, otherwise it generates random numbers based on random pool and the [TSC](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
|
||||
|
||||
```C
|
||||
get_random_bytes(&canary, sizeof(canary));
|
||||
tsc = __native_read_tsc();
|
||||
canary += tsc + (tsc << 32UL);
|
||||
```
|
||||
|
||||
After we got a random number, we fill the `stack_canary` field of `task_struct` with it:
|
||||
|
||||
```C
|
||||
current->stack_canary = canary;
|
||||
```
|
||||
|
||||
and write this value to the top of the IRQ stack with the:
|
||||
|
||||
```C
|
||||
this_cpu_write(irq_stack_union.stack_canary, canary); // read below about this_cpu_write
|
||||
```
|
||||
|
||||
Again, we will not dive into details here, we will cover it in the part about [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). As canary is set, we disable local and early boot IRQs and register the bootstrap CPU in the CPU maps. We disable local IRQs (interrupts for current CPU) with the `local_irq_disable` macro which expands to the call of the `arch_local_irq_disable` function from [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h):
|
||||
|
||||
```C
|
||||
static inline notrace void arch_local_irq_enable(void)
|
||||
{
|
||||
native_irq_enable();
|
||||
}
|
||||
```
|
||||
|
||||
Where `native_irq_enable` is `cli` instruction for `x86_64`. As interrupts are disabled we can register the current CPU with the given ID in the CPU bitmap.
|
||||
|
||||
The first processor activation
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
The current function from the `start_kernel` is `boot_cpu_init`. This function initializes various CPU masks for the bootstrap processor. First of all it gets the bootstrap processor id with a call to:
|
||||
|
||||
```C
|
||||
int cpu = smp_processor_id();
|
||||
```
|
||||
|
||||
For now it is just zero. If the `CONFIG_DEBUG_PREEMPT` configuration option is disabled, `smp_processor_id` just expands to the call of `raw_smp_processor_id` which expands to the:
|
||||
|
||||
```C
|
||||
#define raw_smp_processor_id() (this_cpu_read(cpu_number))
|
||||
```
|
||||
|
||||
`this_cpu_read` as many other function like this (`this_cpu_write`, `this_cpu_add` and etc...) defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) and presents `this_cpu` operation. These operations provide a way of optimizing access to the [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Theory/per-cpu.html) variables which are associated with the current processor. In our case it is `this_cpu_read`:
|
||||
|
||||
```
|
||||
__pcpu_size_call_return(this_cpu_read_, pcp)
|
||||
```
|
||||
|
||||
Remember that we have passed `cpu_number` as `pcp` to the `this_cpu_read` from the `raw_smp_processor_id`. Now let's look at the `__pcpu_size_call_return` implementation:
|
||||
|
||||
```C
|
||||
#define __pcpu_size_call_return(stem, variable) \
|
||||
({ \
|
||||
typeof(variable) pscr_ret__; \
|
||||
__verify_pcpu_ptr(&(variable)); \
|
||||
switch(sizeof(variable)) { \
|
||||
case 1: pscr_ret__ = stem##1(variable); break; \
|
||||
case 2: pscr_ret__ = stem##2(variable); break; \
|
||||
case 4: pscr_ret__ = stem##4(variable); break; \
|
||||
case 8: pscr_ret__ = stem##8(variable); break; \
|
||||
default: \
|
||||
__bad_size_call_parameter(); break; \
|
||||
} \
|
||||
pscr_ret__; \
|
||||
})
|
||||
```
|
||||
|
||||
Yes, it looks a little strange but it's easy. First of all we can see the definition of the `pscr_ret__` variable with the `int` type. Why int? Ok, `variable` is `common_cpu` and it was declared as per-cpu int variable:
|
||||
|
||||
```C
|
||||
DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
|
||||
```
|
||||
|
||||
In the next step we call `__verify_pcpu_ptr` with the address of `cpu_number`. `__veryf_pcpu_ptr` used to verify that the given parameter is a per-cpu pointer. After that we set `pscr_ret__` value which depends on the size of the variable. Our `common_cpu` variable is `int`, so it 4 bytes in size. It means that we will get `this_cpu_read_4(common_cpu)` in `pscr_ret__`. In the end of the `__pcpu_size_call_return` we just call it. `this_cpu_read_4` is a macro:
|
||||
|
||||
```C
|
||||
#define this_cpu_read_4(pcp) percpu_from_op("mov", pcp)
|
||||
```
|
||||
|
||||
which calls `percpu_from_op` and pass `mov` instruction and per-cpu variable there. `percpu_from_op` will expand to the inline assembly call:
|
||||
|
||||
```C
|
||||
asm("movl %%gs:%1,%0" : "=r" (pfo_ret__) : "m" (common_cpu))
|
||||
```
|
||||
|
||||
Let's try to understand how it works and what it does. The `gs` segment register contains the base of per-cpu area. Here we just copy `common_cpu` which is in memory to the `pfo_ret__` with the `movl` instruction. Or with another words:
|
||||
|
||||
```C
|
||||
this_cpu_read(common_cpu)
|
||||
```
|
||||
|
||||
is the same as:
|
||||
|
||||
```C
|
||||
movl %gs:$common_cpu, $pfo_ret__
|
||||
```
|
||||
|
||||
As we didn't setup per-cpu area, we have only one - for the current running CPU, we will get `zero` as a result of the `smp_processor_id`.
|
||||
|
||||
As we got the current processor id, `boot_cpu_init` sets the given CPU online, active, present and possible with the:
|
||||
|
||||
```C
|
||||
set_cpu_online(cpu, true);
|
||||
set_cpu_active(cpu, true);
|
||||
set_cpu_present(cpu, true);
|
||||
set_cpu_possible(cpu, true);
|
||||
```
|
||||
|
||||
All of these functions use the concept - `cpumask`. `cpu_possible` is a set of CPU ID's which can be plugged in at any time during the life of that system boot. `cpu_present` represents which CPUs are currently plugged in. `cpu_online` represents subset of the `cpu_present` and indicates CPUs which are available for scheduling. These masks depend on the `CONFIG_HOTPLUG_CPU` configuration option and if this option is disabled `possible == present` and `active == online`. Implementation of the all of these functions are very similar. Every function checks the second parameter. If it is `true`, it calls `cpumask_set_cpu` or `cpumask_clear_cpu` otherwise.
|
||||
|
||||
For example let's look at `set_cpu_possible`. As we passed `true` as the second parameter, the:
|
||||
|
||||
```C
|
||||
cpumask_set_cpu(cpu, to_cpumask(cpu_possible_bits));
|
||||
```
|
||||
|
||||
will be called. First of all let's try to understand the `to_cpumask` macro. This macro casts a bitmap to a `struct cpumask *`. CPU masks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. CPU mask presented by the `cpu_mask` structure:
|
||||
|
||||
```C
|
||||
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
|
||||
```
|
||||
|
||||
which is just bitmap declared with the `DECLARE_BITMAP` macro:
|
||||
|
||||
```C
|
||||
#define DECLARE_BITMAP(name, bits) unsigned long name[BITS_TO_LONGS(bits)]
|
||||
```
|
||||
|
||||
As we can see from its definition, the `DECLARE_BITMAP` macro expands to the array of `unsigned long`. Now let's look at how the `to_cpumask` macro is implemented:
|
||||
|
||||
```C
|
||||
#define to_cpumask(bitmap) \
|
||||
((struct cpumask *)(1 ? (bitmap) \
|
||||
: (void *)sizeof(__check_is_bitmap(bitmap))))
|
||||
```
|
||||
|
||||
I don't know about you, but it looked really weird for me at the first time. We can see a ternary operator here which is `true` every time, but why the `__check_is_bitmap` here? It's simple, let's look at it:
|
||||
|
||||
```C
|
||||
static inline int __check_is_bitmap(const unsigned long *bitmap)
|
||||
{
|
||||
return 1;
|
||||
}
|
||||
```
|
||||
|
||||
Yeah, it just returns `1` every time. Actually we need in it here only for one purpose: at compile time it checks that the given `bitmap` is a bitmap, or in other words it checks that the given `bitmap` has a type of `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting the array of `unsigned long` to the `struct cpumask *`. Now we can call `cpumask_set_cpu` function with the `cpu` - 0 and `struct cpumask *cpu_possible_bits`. This function makes only one call of the `set_bit` function which sets the given `cpu` in the cpumask. All of these `set_cpu_*` functions work on the same principle.
|
||||
|
||||
If you're not sure that this `set_cpu_*` operations and `cpumask` are not clear for you, don't worry about it. You can get more info by reading the special part about it - [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) or [documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt).
|
||||
|
||||
As we activated the bootstrap processor, it's time to go to the next function in the `start_kernel.` Now it is `page_address_init`, but this function does nothing in our case, because it executes only when all `RAM` can't be mapped directly.
|
||||
|
||||
Print linux banner
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
The next call is `pr_notice`:
|
||||
|
||||
```C
|
||||
#define pr_notice(fmt, ...) \
|
||||
printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)
|
||||
```
|
||||
|
||||
as you can see it just expands to the `printk` call. At this moment we use `pr_notice` to print the Linux banner:
|
||||
|
||||
```C
|
||||
pr_notice("%s", linux_banner);
|
||||
```
|
||||
|
||||
which is just the kernel version with some additional parameters:
|
||||
|
||||
```
|
||||
Linux version 4.0.0-rc6+ (alex@localhost) (gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #319 SMP
|
||||
```
|
||||
|
||||
Architecture-dependent parts of initialization
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
The next step is architecture-specific initialization. The Linux kernel does it with the call of the `setup_arch` function. This is a very big function like `start_kernel` and we do not have time to consider all of its implementation in this part. Here we'll only start to do it and continue in the next part. As it is `architecture-specific`, we need to go again to the `arch/` directory. The `setup_arch` function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and takes only one argument - address of the kernel command line.
|
||||
|
||||
This function starts from the reserving memory block for the kernel `_text` and `_data` which starts from the `_text` symbol (you can remember it from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L46)) and ends before `__bss_stop`. We are using `memblock` for the reserving of memory block:
|
||||
|
||||
```C
|
||||
memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text);
|
||||
```
|
||||
|
||||
You can read about `memblock` in the [Linux kernel memory management Part 1.](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html). As you can remember `memblock_reserve` function takes two parameters:
|
||||
|
||||
* base physical address of a memory block;
|
||||
* size of a memory block.
|
||||
|
||||
We can get the base physical address of the `_text` symbol with the `__pa_symbol` macro:
|
||||
|
||||
```C
|
||||
#define __pa_symbol(x) \
|
||||
__phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))
|
||||
```
|
||||
|
||||
First of all it calls `__phys_reloc_hide` macro on the given parameter. The `__phys_reloc_hide` macro does nothing for `x86_64` and just returns the given parameter. Implementation of the `__phys_addr_symbol` macro is easy. It just subtracts the symbol address from the base address of the kernel text mapping base virtual address (you can remember that it is `__START_KERNEL_map`) and adds `phys_base` which is the base address of `_text`:
|
||||
|
||||
```C
|
||||
#define __phys_addr_symbol(x) \
|
||||
((unsigned long)(x) - __START_KERNEL_map + phys_base)
|
||||
```
|
||||
|
||||
After we got the physical address of the `_text` symbol, `memblock_reserve` can reserve a memory block from the `_text` to the `__bss_stop - _text`.
|
||||
|
||||
Reserve memory for initrd
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
In the next step after we reserved place for the kernel text and data is reserving place for the [initrd](http://en.wikipedia.org/wiki/Initrd). We will not see details about `initrd` in this post, you just may know that it is temporary root file system stored in memory and used by the kernel during its startup. The `early_reserve_initrd` function does all work. First of all this function gets the base address of the ram disk, its size and the end address with:
|
||||
|
||||
```C
|
||||
u64 ramdisk_image = get_ramdisk_image();
|
||||
u64 ramdisk_size = get_ramdisk_size();
|
||||
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
|
||||
```
|
||||
|
||||
All of these parameters are taken from `boot_params`. If you have read the chapter about [Linux Kernel Booting Process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html), you must remember that we filled the `boot_params` structure during boot time. The kernel setup header contains a couple of fields which describes ramdisk, for example:
|
||||
|
||||
```
|
||||
Field name: ramdisk_image
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x218/4
|
||||
Protocol: 2.00+
|
||||
|
||||
The 32-bit linear address of the initial ramdisk or ramfs. Leave at
|
||||
zero if there is no initial ramdisk/ramfs.
|
||||
```
|
||||
|
||||
So we can get all the information that interests us from `boot_params`. For example let's look at `get_ramdisk_image`:
|
||||
|
||||
```C
|
||||
static u64 __init get_ramdisk_image(void)
|
||||
{
|
||||
u64 ramdisk_image = boot_params.hdr.ramdisk_image;
|
||||
|
||||
ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
|
||||
|
||||
return ramdisk_image;
|
||||
}
|
||||
```
|
||||
|
||||
Here we get the address of the ramdisk from the `boot_params` and shift left it on `32`. We need to do it because as you can read in the [Documentation/x86/zero-page.txt](https://github.com/0xAX/linux/blob/master/Documentation/x86/zero-page.txt):
|
||||
|
||||
```
|
||||
0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits
|
||||
```
|
||||
|
||||
So after shifting it on 32, we're getting a 64-bit address in `ramdisk_image` and we return it. `get_ramdisk_size` works on the same principle as `get_ramdisk_image`, but it used `ext_ramdisk_size` instead of `ext_ramdisk_image`. After we got ramdisk's size, base address and end address, we check that bootloader provided ramdisk with the:
|
||||
|
||||
```C
|
||||
if (!boot_params.hdr.type_of_loader ||
|
||||
!ramdisk_image || !ramdisk_size)
|
||||
return;
|
||||
```
|
||||
|
||||
and reserve memory block with the calculated addresses for the initial ramdisk in the end:
|
||||
|
||||
```C
|
||||
memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);
|
||||
```
|
||||
|
||||
Conclusion
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the fourth part about the Linux kernel initialization process. We started to dive in the kernel generic code from the `start_kernel` function in this part and stopped on the architecture-specific initialization in the `setup_arch`. In the next part we will continue with architecture-dependent initialization steps.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [GCC function attributes](https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html)
|
||||
* [this_cpu operations](https://www.kernel.org/doc/Documentation/this_cpu_ops.txt)
|
||||
* [cpumask](http://www.crashcourse.ca/wiki/index.php/Cpumask)
|
||||
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [cgroups](http://en.wikipedia.org/wiki/Cgroups)
|
||||
* [stack buffer overflow](http://en.wikipedia.org/wiki/Stack_buffer_overflow)
|
||||
* [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [initrd](http://en.wikipedia.org/wiki/Initrd)
|
||||
* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-3.md)
|
||||
512
Initialization/linux-initialization-5.md
Normal file
512
Initialization/linux-initialization-5.md
Normal file
@@ -0,0 +1,512 @@
|
||||
Kernel initialization. Part 5.
|
||||
================================================================================
|
||||
|
||||
Continue of architecture-specific initialization
|
||||
================================================================================
|
||||
|
||||
In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L856) function and now we will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will skip functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You may remember, we already saw it in the [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) about interrupts: and exceptions:
|
||||
|
||||
```
|
||||
----------------------------------------------------------------------------------------------
|
||||
|Vector|Mnemonic|Description |Type |Error Code|Source |
|
||||
----------------------------------------------------------------------------------------------
|
||||
|3 | #BP |Breakpoint |Trap |NO |INT 3 |
|
||||
----------------------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
Debug interrupt `#DB` is the primary method of invoking debuggers. `early_trap_init` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). This functions sets `#DB` and `#BP` handlers and reloads [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table):
|
||||
|
||||
```C
|
||||
void __init early_trap_init(void)
|
||||
{
|
||||
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
||||
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
||||
load_idt(&idt_descr);
|
||||
}
|
||||
```
|
||||
|
||||
We already saw implementation of the `set_intr_gate` in the previous part about interrupts. Here are two similar functions `set_intr_gate_ist` and `set_system_intr_gate_ist`. Both of these two functions take three parameters:
|
||||
|
||||
* number of the interrupt;
|
||||
* base address of the interrupt/exception handler;
|
||||
* third parameter is - `Interrupt Stack Table`. `IST` is a new mechanism in the `x86_64` and part of the [TSS](http://en.wikipedia.org/wiki/Task_state_segment). Every active thread in kernel mode has own kernel stack which is 16 kilobytes. While a thread in user space, kernel stack is empty except `thread_info` (read about it previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html)) at the bottom. In addition to per-thread stacks, there are a couple of specialized stacks associated with each CPU. All about these stack you can read in the linux kernel documentation - [Kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks). `x86_64` provides feature which allows to switch to a new `special` stack for during any events as non-maskable interrupt and etc... And the name of this feature is - `Interrupt Stack Table`. There can be up to 7 `IST` entries per CPU and every entry points to the dedicated stack. In our case this is `DEBUG_STACK`.
|
||||
|
||||
`set_intr_gate_ist` and `set_system_intr_gate_ist` work by the same principle as `set_intr_gate` with only one difference. Both of these functions checks
|
||||
interrupt number and call `_set_gate` inside:
|
||||
|
||||
```C
|
||||
BUG_ON((unsigned)n > 0xFF);
|
||||
_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
|
||||
```
|
||||
|
||||
as `set_intr_gate` does this. But `set_intr_gate` calls `_set_gate` with [dpl](http://en.wikipedia.org/wiki/Privilege_level) - 0, and ist - 0, but `set_intr_gate_ist` and `set_system_intr_gate_ist` sets `ist` as `DEBUG_STACK` and `set_system_intr_gate_ist` sets `dpl` as `0x3` which is the lowest privilege. When an interrupt occurs and the hardware loads such a descriptor, then hardware automatically sets the new stack pointer based on the IST value, then invokes the interrupt handler. All of the special kernel stacks will be setted in the `cpu_init` function (we will see it later).
|
||||
|
||||
As `#DB` and `#BP` gates written to the `idt_descr`, we reload `IDT` table with `load_idt` which just cals `ldtr` instruction. Now let's look on interrupt handlers and will try to understand how they works. Of course, I can't cover all interrupt handlers in this book and I do not see the point in this. It is very interesting to delve in the linux kernel source code, so we will see how `debug` handler implemented in this part, and understand how other interrupt handlers are implemented will be your task.
|
||||
|
||||
#DB handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you can read above, we passed address of the `#DB` handler as `&debug` in the `set_intr_gate_ist`. [lxr.free-electorns.com](http://lxr.free-electrons.com/ident) is a great resource for searching identifiers in the linux kernel source code, but unfortunately you will not find `debug` handler with it. All of you can find, it is `debug` definition in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/traps.h):
|
||||
|
||||
```C
|
||||
asmlinkage void debug(void);
|
||||
```
|
||||
|
||||
We can see `asmlinkage` attribute which tells to us that `debug` is function written with [assembly](http://en.wikipedia.org/wiki/Assembly_language). Yeah, again and again assembly :). Implementation of the `#DB` handler as other handlers is in this [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) and defined with the `idtentry` assembly macro:
|
||||
|
||||
```assembly
|
||||
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
```
|
||||
|
||||
`idtentry` is a macro which defines an interrupt/exception entry point. As you can see it takes five arguments:
|
||||
|
||||
* name of the interrupt entry point;
|
||||
* name of the interrupt handler;
|
||||
* has interrupt error code or not;
|
||||
* paranoid - if this parameter = 1, switch to special stack (read above);
|
||||
* shift_ist - stack to switch during interrupt.
|
||||
|
||||
Now let's look on `idtentry` macro implementation. This macro defined in the same assembly file and defines `debug` function with the `ENTRY` macro. For the start `idtentry` macro checks that given parameters are correct in case if need to switch to the special stack. In the next step it checks that give interrupt returns error code. If interrupt does not return error code (in our case `#DB` does not return error code), it calls `INTR_FRAME` or `XCPT_FRAME` if interrupt has error code. Both of these macros `XCPT_FRAME` and `INTR_FRAME` do nothing and need only for the building initial frame state for interrupts. They uses `CFI` directives and used for debugging. More info you can find in the [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html). As comment from the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) says: `CFI macros are used to generate dwarf2 unwind information for better backtraces. They don't change any code.` so we will ignore them.
|
||||
|
||||
```assembly
|
||||
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
|
||||
ENTRY(\sym)
|
||||
/* Sanity check */
|
||||
.if \shift_ist != -1 && \paranoid == 0
|
||||
.error "using shift_ist requires paranoid=1"
|
||||
.endif
|
||||
|
||||
.if \has_error_code
|
||||
XCPT_FRAME
|
||||
.else
|
||||
INTR_FRAME
|
||||
.endif
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
You can remember from the previous part about early interrupts/exceptions handling that after interrupt occurs, current stack will have following format:
|
||||
|
||||
```
|
||||
+-----------------------+
|
||||
| |
|
||||
+40 | SS |
|
||||
+32 | RSP |
|
||||
+24 | RFLAGS |
|
||||
+16 | CS |
|
||||
+8 | RIP |
|
||||
0 | Error Code | <---- rsp
|
||||
| |
|
||||
+-----------------------+
|
||||
```
|
||||
|
||||
The next two macro from the `idtentry` implementation are:
|
||||
|
||||
```assembly
|
||||
ASM_CLAC
|
||||
PARAVIRT_ADJUST_EXCEPTION_FRAME
|
||||
```
|
||||
|
||||
First `ASM_CLAC` macro depends on `CONFIG_X86_SMAP` configuration option and need for security reason, more about it you can read [here](https://lwn.net/Articles/517475/). The second `PARAVIRT_ADJUST_EXCEPTION_FRAME` macro is for handling handle Xen-type-exceptions (this chapter about kernel initialization and we will not consider virtualization stuff here).
|
||||
|
||||
The next piece of code checks if interrupt has error code or not and pushes `$-1` which is `0xffffffffffffffff` on `x86_64` on the stack if not:
|
||||
|
||||
```assembly
|
||||
.ifeq \has_error_code
|
||||
pushq_cfi $-1
|
||||
.endif
|
||||
```
|
||||
|
||||
We need to do it as `dummy` error code for stack consistency for all interrupts. In the next step we subtract from the stack pointer `$ORIG_RAX-R15`:
|
||||
|
||||
```assembly
|
||||
subq $ORIG_RAX-R15, %rsp
|
||||
```
|
||||
|
||||
where `ORIRG_RAX`, `R15` and other macros defined in the [arch/x86/include/asm/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/calling.h) and `ORIG_RAX-R15` is 120 bytes. General purpose registers will occupy these 120 bytes because we need to store all registers on the stack during interrupt handling. After we set stack for general purpose registers, the next step is checking that interrupt came from userspace with:
|
||||
|
||||
```assembly
|
||||
testl $3, CS(%rsp)
|
||||
jnz 1f
|
||||
```
|
||||
|
||||
Here we checks first and second bits in the `CS`. You can remember that `CS` register contains segment selector where first two bits are `RPL`. All privilege levels are integers in the range 0–3, where the lowest number corresponds to the highest privilege. So if interrupt came from the kernel mode we call `save_paranoid` or jump on label `1` if not. In the `save_paranoid` we store all general purpose registers on the stack and switch user `gs` on kernel `gs` if need:
|
||||
|
||||
```assembly
|
||||
movl $1,%ebx
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
rdmsr
|
||||
testl %edx,%edx
|
||||
js 1f
|
||||
SWAPGS
|
||||
xorl %ebx,%ebx
|
||||
1: ret
|
||||
```
|
||||
|
||||
In the next steps we put `pt_regs` pointer to the `rdi`, save error code in the `rsi` if it has and call interrupt handler which is - `do_debug` in our case from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). `do_debug` like other handlers takes two parameters:
|
||||
|
||||
* pt_regs - is a structure which presents set of CPU registers which are saved in the process' memory region;
|
||||
* error code - error code of interrupt.
|
||||
|
||||
After interrupt handler finished its work, calls `paranoid_exit` which restores stack, switch on userspace if interrupt came from there and calls `iret`. That's all. Of course it is not all :), but we will see more deeply in the separate chapter about interrupts.
|
||||
|
||||
This is general view of the `idtentry` macro for `#DB` interrupt. All interrupts are similar to this implementation and defined with idtentry too. After `early_trap_init` finished its work, the next function is `early_cpu_init`. This function defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) and collects information about CPU and its vendor.
|
||||
|
||||
Early ioremap initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step is initialization of early `ioremap`. In general there are two ways to communicate with devices:
|
||||
|
||||
* I/O Ports;
|
||||
* Device memory.
|
||||
|
||||
We already saw first method (`outb/inb` instructions) in the part about linux kernel booting [process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html). The second method is to map I/O physical addresses to virtual addresses. When a physical address is accessed by the CPU, it may refer to a portion of physical RAM which can be mapped on memory of the I/O device. So `ioremap` used to map device memory into kernel address space.
|
||||
|
||||
As i wrote above next function is the `early_ioremap_init` which re-maps I/O memory to kernel address space so it can access it. We need to initialize early ioremap for early initialization code which needs to temporarily map I/O or memory regions before the normal mapping functions like `ioremap` are available. Implementation of this function is in the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c). At the start of the `early_ioremap_init` we can see definition of the `pmd` point with `pmd_t` type (which presents page middle directory entry `typedef struct { pmdval_t pmd; } pmd_t;` where `pmdval_t` is `unsigned long`) and make a check that `fixmap` aligned in a correct way:
|
||||
|
||||
```C
|
||||
pmd_t *pmd;
|
||||
BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
|
||||
```
|
||||
|
||||
`fixmap` - is fixed virtual address mappings which extends from `FIXADDR_START` to `FIXADDR_TOP`. Fixed virtual addresses are needed for subsystems that need to know the virtual address at compile time. After the check `early_ioremap_init` makes a call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c). `early_ioremap_setup` fills `slot_virt` array of the `unsigned long` with virtual addresses with 512 temporary boot-time fix-mappings:
|
||||
|
||||
```C
|
||||
for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
|
||||
slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
|
||||
```
|
||||
|
||||
After this we get page middle directory entry for the `FIX_BTMAP_BEGIN` and put to the `pmd` variable, fills `bm_pte` with zeros which is boot time page tables and call `pmd_populate_kernel` function for setting given page table entry in the given page middle directory:
|
||||
|
||||
```C
|
||||
pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
|
||||
memset(bm_pte, 0, sizeof(bm_pte));
|
||||
pmd_populate_kernel(&init_mm, pmd, bm_pte);
|
||||
```
|
||||
|
||||
That's all for this. If you feeling puzzled, don't worry. There is special part about `ioremap` and `fixmaps` in the [Linux Kernel Memory Management. Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) chapter.
|
||||
|
||||
Obtaining major and minor numbers for the root device
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After early `ioremap` was initialized, you can see the following code:
|
||||
|
||||
```C
|
||||
ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);
|
||||
```
|
||||
|
||||
This code obtains major and minor numbers for the root device where `initrd` will be mounted later in the `do_mount_root` function. Major number of the device identifies a driver associated with the device. Minor number referred on the device controlled by driver. Note that `old_decode_dev` takes one parameter from the `boot_params_structure`. As we can read from the x86 linux kernel boot protocol:
|
||||
|
||||
```
|
||||
Field name: root_dev
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x1fc/2
|
||||
Protocol: ALL
|
||||
|
||||
The default root device device number. The use of this field is
|
||||
deprecated, use the "root=" option on the command line instead.
|
||||
```
|
||||
|
||||
Now let's try to understand what `old_decode_dev` does. Actually it just calls `MKDEV` inside which generates `dev_t` from the give major and minor numbers. It's implementation is pretty simple:
|
||||
|
||||
```C
|
||||
static inline dev_t old_decode_dev(u16 val)
|
||||
{
|
||||
return MKDEV((val >> 8) & 255, val & 255);
|
||||
}
|
||||
```
|
||||
|
||||
where `dev_t` is a kernel data type to present major/minor number pair. But what's the strange `old_` prefix? For historical reasons, there are two ways of managing the major and minor numbers of a device. In the first way major and minor numbers occupied 2 bytes. You can see it in the previous code: 8 bit for major number and 8 bit for minor number. But there is a problem: only 256 major numbers and 256 minor numbers are possible. So 16-bit integer was replaced by 32-bit integer where 12 bits reserved for major number and 20 bits for minor. You can see this in the `new_decode_dev` implementation:
|
||||
|
||||
```C
|
||||
static inline dev_t new_decode_dev(u32 dev)
|
||||
{
|
||||
unsigned major = (dev & 0xfff00) >> 8;
|
||||
unsigned minor = (dev & 0xff) | ((dev >> 12) & 0xfff00);
|
||||
return MKDEV(major, minor);
|
||||
}
|
||||
```
|
||||
|
||||
After calculation we will get `0xfff` or 12 bits for `major` if it is `0xffffffff` and `0xfffff` or 20 bits for `minor`. So in the end of execution of the `old_decode_dev` we will get major and minor numbers for the root device in `ROOT_DEV`.
|
||||
|
||||
Memory map setup
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next point is the setup of the memory map with the call of the `setup_memory_map` function. But before this we setup different parameters as information about a screen (current row and column, video page and etc... (you can read about it in the [Video mode initialization and transition to protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html))), Extended display identification data, video mode, bootloader_type and etc...:
|
||||
|
||||
```C
|
||||
screen_info = boot_params.screen_info;
|
||||
edid_info = boot_params.edid_info;
|
||||
saved_video_mode = boot_params.hdr.vid_mode;
|
||||
bootloader_type = boot_params.hdr.type_of_loader;
|
||||
if ((bootloader_type >> 4) == 0xe) {
|
||||
bootloader_type &= 0xf;
|
||||
bootloader_type |= (boot_params.hdr.ext_loader_type+0x10) << 4;
|
||||
}
|
||||
bootloader_version = bootloader_type & 0xf;
|
||||
bootloader_version |= boot_params.hdr.ext_loader_ver << 4;
|
||||
```
|
||||
|
||||
All of these parameters we got during boot time and stored in the `boot_params` structure. After this we need to setup the end of the I/O memory. As you know one of the main purposes of the kernel is resource management. And one of the resource is memory. As we already know there are two ways to communicate with devices are I/O ports and device memory. All information about registered resources are available through:
|
||||
|
||||
* /proc/ioports - provides a list of currently registered port regions used for input or output communication with a device;
|
||||
* /proc/iomem - provides current map of the system's memory for each physical device.
|
||||
|
||||
At the moment we are interested in `/proc/iomem`:
|
||||
|
||||
```
|
||||
cat /proc/iomem
|
||||
00000000-00000fff : reserved
|
||||
00001000-0009d7ff : System RAM
|
||||
0009d800-0009ffff : reserved
|
||||
000a0000-000bffff : PCI Bus 0000:00
|
||||
000c0000-000cffff : Video ROM
|
||||
000d0000-000d3fff : PCI Bus 0000:00
|
||||
000d4000-000d7fff : PCI Bus 0000:00
|
||||
000d8000-000dbfff : PCI Bus 0000:00
|
||||
000dc000-000dffff : PCI Bus 0000:00
|
||||
000e0000-000fffff : reserved
|
||||
000e0000-000e3fff : PCI Bus 0000:00
|
||||
000e4000-000e7fff : PCI Bus 0000:00
|
||||
000f0000-000fffff : System ROM
|
||||
```
|
||||
|
||||
As you can see range of addresses are shown in hexadecimal notation with its owner. Linux kernel provides API for managing any resources in a general way. Global resources (for example PICs or I/O ports) can be divided into subsets - relating to any hardware bus slot. The main structure `resource`:
|
||||
|
||||
```C
|
||||
struct resource {
|
||||
resource_size_t start;
|
||||
resource_size_t end;
|
||||
const char *name;
|
||||
unsigned long flags;
|
||||
struct resource *parent, *sibling, *child;
|
||||
};
|
||||
```
|
||||
|
||||
presents abstraction for a tree-like subset of system resources. This structure provides range of addresses from `start` to `end` (`resource_size_t` is `phys_addr_t` or `u64` for `x86_64`) which a resource covers, `name` of a resource (you see these names in the `/proc/iomem` output) and `flags` of a resource (All resources flags defined in the [include/linux/ioport.h](https://github.com/torvalds/linux/blob/master/include/linux/ioport.h)). The last are three pointers to the `resource` structure. These pointers enable a tree-like structure:
|
||||
|
||||
```
|
||||
+-------------+ +-------------+
|
||||
| | | |
|
||||
| parent |------| sibling |
|
||||
| | | |
|
||||
+-------------+ +-------------+
|
||||
|
|
||||
|
|
||||
+-------------+
|
||||
| |
|
||||
| child |
|
||||
| |
|
||||
+-------------+
|
||||
```
|
||||
|
||||
Every subset of resources has root range resources. For `iomem` it is `iomem_resource` which defined as:
|
||||
|
||||
```C
|
||||
struct resource iomem_resource = {
|
||||
.name = "PCI mem",
|
||||
.start = 0,
|
||||
.end = -1,
|
||||
.flags = IORESOURCE_MEM,
|
||||
};
|
||||
EXPORT_SYMBOL(iomem_resource);
|
||||
```
|
||||
|
||||
TODO EXPORT_SYMBOL
|
||||
|
||||
`iomem_resource` defines root addresses range for io memory with `PCI mem` name and `IORESOURCE_MEM` (`0x00000200`) as flags. As i wrote above our current point is setup the end address of the `iomem`. We will do it with:
|
||||
|
||||
```C
|
||||
iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
|
||||
```
|
||||
|
||||
Here we shift `1` on `boot_cpu_data.x86_phys_bits`. `boot_cpu_data` is `cpuinfo_x86` structure which we filled during execution of the `early_cpu_init`. As you can understand from the name of the `x86_phys_bits` field, it presents maximum bits amount of the maximum physical address in the system. Note also that `iomem_resource` is passed to the `EXPORT_SYMBOL` macro. This macro exports the given symbol (`iomem_resource` in our case) for dynamic linking or in other words it makes a symbol accessible to dynamically loaded modules.
|
||||
|
||||
After we set the end address of the root `iomem` resource address range, as I wrote above the next step will be setup of the memory map. It will be produced with the call of the `setup_ memory_map` function:
|
||||
|
||||
```C
|
||||
void __init setup_memory_map(void)
|
||||
{
|
||||
char *who;
|
||||
|
||||
who = x86_init.resources.memory_setup();
|
||||
memcpy(&e820_saved, &e820, sizeof(struct e820map));
|
||||
printk(KERN_INFO "e820: BIOS-provided physical RAM map:\n");
|
||||
e820_print_map(who);
|
||||
}
|
||||
```
|
||||
|
||||
First of all we call look here the call of the `x86_init.resources.memory_setup`. `x86_init` is a `x86_init_ops` structure which presents platform specific setup functions as resources initialization, pci initialization and etc... initialization of the `x86_init` is in the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c). I will not give here the full description because it is very long, but only one part which interests us for now:
|
||||
|
||||
```C
|
||||
struct x86_init_ops x86_init __initdata = {
|
||||
.resources = {
|
||||
.probe_roms = probe_roms,
|
||||
.reserve_resources = reserve_standard_io_resources,
|
||||
.memory_setup = default_machine_specific_memory_setup,
|
||||
},
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
As we can see here `memry_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions are collected, print of all regions with printk. You can find this print if you execute `dmesg` command and you can see something like this:
|
||||
|
||||
```
|
||||
[ 0.000000] e820: BIOS-provided physical RAM map:
|
||||
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable
|
||||
[ 0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
|
||||
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000be825fff] usable
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000be826000-0x00000000be82cfff] ACPI NVS
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000be82d000-0x00000000bf744fff] usable
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000bf745000-0x00000000bfff4fff] reserved
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000bfff5000-0x00000000dc041fff] usable
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000dc042000-0x00000000dc0d2fff] reserved
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000dc0d3000-0x00000000dc138fff] usable
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000dc139000-0x00000000dc27dfff] ACPI NVS
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000dc27e000-0x00000000deffefff] reserved
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000defff000-0x00000000deffffff] usable
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Copying of the BIOS Enhanced Disk Device information
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next two steps is parsing of the `setup_data` with `parse_setup_data` function and copying BIOS EDD to the safe place. `setup_data` is a field from the kernel boot header and as we can read from the `x86` boot protocol:
|
||||
|
||||
```
|
||||
Field name: setup_data
|
||||
Type: write (special)
|
||||
Offset/size: 0x250/8
|
||||
Protocol: 2.09+
|
||||
|
||||
The 64-bit physical pointer to NULL terminated single linked list of
|
||||
struct setup_data. This is used to define a more extensible boot
|
||||
parameters passing mechanism.
|
||||
```
|
||||
|
||||
It used for storing setup information for different types as device tree blob, EFI setup data and etc... In the second step we copy BIOS EDD information from the `boot_params` structure that we collected in the [arch/x86/boot/edd.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/edd.c) to the `edd` structure:
|
||||
|
||||
```C
|
||||
static inline void __init copy_edd(void)
|
||||
{
|
||||
memcpy(edd.mbr_signature, boot_params.edd_mbr_sig_buffer,
|
||||
sizeof(edd.mbr_signature));
|
||||
memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));
|
||||
edd.mbr_signature_nr = boot_params.edd_mbr_sig_buf_entries;
|
||||
edd.edd_info_nr = boot_params.eddbuf_entries;
|
||||
}
|
||||
```
|
||||
|
||||
Memory descriptor initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step is initialization of the memory descriptor of the init process. As you already can know every process has its own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h). As every process has its own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html):
|
||||
|
||||
```C
|
||||
#define INIT_TASK(tsk) \
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
.mm = NULL, \
|
||||
.active_mm = &init_mm, \
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
`mm` points to the process address space and `active_mm` points to the active address space if process has no address space such as kernel threads (more about it you can read in the [documentation](https://www.kernel.org/doc/Documentation/vm/active_mm.txt)). Now we fill memory descriptor of the initial process:
|
||||
|
||||
```C
|
||||
init_mm.start_code = (unsigned long) _text;
|
||||
init_mm.end_code = (unsigned long) _etext;
|
||||
init_mm.end_data = (unsigned long) _edata;
|
||||
init_mm.brk = _brk_end;
|
||||
```
|
||||
|
||||
with the kernel's text, data and brk. `init_mm` is the memory descriptor of the initial process and defined as:
|
||||
|
||||
```C
|
||||
struct mm_struct init_mm = {
|
||||
.mm_rb = RB_ROOT,
|
||||
.pgd = swapper_pg_dir,
|
||||
.mm_users = ATOMIC_INIT(2),
|
||||
.mm_count = ATOMIC_INIT(1),
|
||||
.mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
|
||||
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
|
||||
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
|
||||
INIT_MM_CONTEXT(init_mm)
|
||||
};
|
||||
```
|
||||
|
||||
where `mm_rb` is a red-black tree of the virtual memory areas, `pgd` is a pointer to the page global directory, `mm_users` is address space users, `mm_count` is primary usage counter and `mmap_sem` is memory area semaphore. After we setup memory descriptor of the initial process, next step is initialization of the Intel Memory Protection Extensions with `mpx_mm_init`. The next step is initialization of the code/data/bss resources with:
|
||||
|
||||
```C
|
||||
code_resource.start = __pa_symbol(_text);
|
||||
code_resource.end = __pa_symbol(_etext)-1;
|
||||
data_resource.start = __pa_symbol(_etext);
|
||||
data_resource.end = __pa_symbol(_edata)-1;
|
||||
bss_resource.start = __pa_symbol(__bss_start);
|
||||
bss_resource.end = __pa_symbol(__bss_stop)-1;
|
||||
```
|
||||
|
||||
We already know a little about `resource` structure (read above). Here we fills code/data/bss resources with their physical addresses. You can see it in the `/proc/iomem`:
|
||||
|
||||
```C
|
||||
00100000-be825fff : System RAM
|
||||
01000000-015bb392 : Kernel code
|
||||
015bb393-01930c3f : Kernel data
|
||||
01a11000-01ac3fff : Kernel bss
|
||||
```
|
||||
|
||||
All of these structures are defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and look like typical resource initialization:
|
||||
|
||||
```C
|
||||
static struct resource code_resource = {
|
||||
.name = "Kernel code",
|
||||
.start = 0,
|
||||
.end = 0,
|
||||
.flags = IORESOURCE_BUSY | IORESOURCE_MEM
|
||||
};
|
||||
```
|
||||
|
||||
The last step which we will cover in this part will be `NX` configuration. `NX-bit` or no execute bit is 63-bit in the page directory entry which controls the ability to execute code from all physical pages mapped by the table entry. This bit can only be used/set when the `no-execute` page-protection mechanism is enabled by the setting `EFER.NXE` to 1. In the `x86_configure_nx` function we check that CPU has support of `NX-bit` and it does not disabled. After the check we fill `__supported_pte_mask` depend on it:
|
||||
|
||||
```C
|
||||
void x86_configure_nx(void)
|
||||
{
|
||||
if (cpu_has_nx && !disable_nx)
|
||||
__supported_pte_mask |= _PAGE_NX;
|
||||
else
|
||||
__supported_pte_mask &= ~_PAGE_NX;
|
||||
}
|
||||
```
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function which makes initialization of architecture-specific stuff. It was long part, but we have not finished with it. As i already wrote, the `setup_arch` is big function, and I am really not sure that we will cover all of it even in the next part. There were some new interesting concepts in this part like `Fix-mapped` addresses, ioremap and etc... Don't worry if they are unclear for you. There is a special part about these concepts - [Linux kernel memory management Part 2.](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md). In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameters, early dump of the pci devices, direct Media Interface scanning and many many more.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [mm vs active_mm](https://www.kernel.org/doc/Documentation/vm/active_mm.txt)
|
||||
* [e820](http://en.wikipedia.org/wiki/E820)
|
||||
* [Supervisor mode access prevention](https://lwn.net/Articles/517475/)
|
||||
* [Kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
|
||||
* [TSS](http://en.wikipedia.org/wiki/Task_state_segment)
|
||||
* [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table)
|
||||
* [Memory mapped I/O](http://en.wikipedia.org/wiki/Memory-mapped_I/O)
|
||||
* [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html)
|
||||
* [PDF. dwarf4 specification](http://dwarfstd.org/doc/DWARF4.pdf)
|
||||
* [Call stack](http://en.wikipedia.org/wiki/Call_stack)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html)
|
||||
549
Initialization/linux-initialization-6.md
Normal file
549
Initialization/linux-initialization-6.md
Normal file
@@ -0,0 +1,549 @@
|
||||
Kernel initialization. Part 6.
|
||||
================================================================================
|
||||
|
||||
Architecture-specific initialization, again...
|
||||
================================================================================
|
||||
|
||||
In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) we saw architecture-specific (`x86_64` in our case) initialization stuff from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and finished on `x86_configure_nx` function which sets the `_PAGE_NX` flag depends on support of [NX bit](http://en.wikipedia.org/wiki/NX_bit). As I wrote before `setup_arch` function and `start_kernel` are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after `x86_configure_nx` is `parse_early_param`. This function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) and as you can understand from its name, this function parses kernel command line and setups different services depends on the given parameters (all kernel command line parameters you can find are in the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt)). You may remember how we setup `earlyprintk` in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). On the early stage we looked for kernel parameters and their value with the `cmdline_find_option` function and `__cmdline_find_option`, `__cmdline_find_option_bool` helpers from the [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cmdline.c). There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already note calls like this:
|
||||
|
||||
```C
|
||||
early_param("gbpages", parse_direct_gbpages_on);
|
||||
```
|
||||
|
||||
`early_param` macro takes two parameters:
|
||||
|
||||
* command line parameter name;
|
||||
* function which will be called if given parameter is passed.
|
||||
|
||||
and defined as:
|
||||
|
||||
```C
|
||||
#define early_param(str, fn) \
|
||||
__setup_param(str, fn, fn, 1)
|
||||
```
|
||||
|
||||
in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h). As you can see `early_param` macro just makes call of the `__setup_param` macro:
|
||||
|
||||
```C
|
||||
#define __setup_param(str, unique_id, fn, early) \
|
||||
static const char __setup_str_##unique_id[] __initconst \
|
||||
__aligned(1) = str; \
|
||||
static struct obs_kernel_param __setup_##unique_id \
|
||||
__used __section(.init.setup) \
|
||||
__attribute__((aligned((sizeof(long))))) \
|
||||
= { __setup_str_##unique_id, fn, early }
|
||||
```
|
||||
|
||||
This macro defines `__setup_str_*_id` variable (where `*` depends on given function name) and assigns it to the given command line parameter name. In the next line we can see definition of the `__setup_*` variable which type is `obs_kernel_param` and its initialization. `obs_kernel_param` structure defined as:
|
||||
|
||||
```C
|
||||
struct obs_kernel_param {
|
||||
const char *str;
|
||||
int (*setup_func)(char *);
|
||||
int early;
|
||||
};
|
||||
```
|
||||
|
||||
and contains three fields:
|
||||
|
||||
* name of the kernel parameter;
|
||||
* function which setups something depend on parameter;
|
||||
* field determines is parameter early (1) or not (0).
|
||||
|
||||
Note that `__set_param` macro defines with `__section(.init.setup)` attribute. It means that all `__setup_str_*` will be placed in the `.init.setup` section, moreover, as we can see in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h), they will be placed between `__setup_start` and `__setup_end`:
|
||||
|
||||
```
|
||||
#define INIT_SETUP(initsetup_align) \
|
||||
. = ALIGN(initsetup_align); \
|
||||
VMLINUX_SYMBOL(__setup_start) = .; \
|
||||
*(.init.setup) \
|
||||
VMLINUX_SYMBOL(__setup_end) = .;
|
||||
```
|
||||
|
||||
Now we know how parameters are defined, let's back to the `parse_early_param` implementation:
|
||||
|
||||
```C
|
||||
void __init parse_early_param(void)
|
||||
{
|
||||
static int done __initdata;
|
||||
static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
|
||||
|
||||
if (done)
|
||||
return;
|
||||
|
||||
/* All fall through to do_early_param. */
|
||||
strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
|
||||
parse_early_options(tmp_cmdline);
|
||||
done = 1;
|
||||
}
|
||||
```
|
||||
|
||||
The `parse_early_param` function defines two static variables. First `done` check that `parse_early_param` already called and the second is temporary storage for kernel command line. After this we copy `boot_command_line` to the temporary command line which we just defined and call the `parse_early_options` function from the same source code `main.c` file. `parse_early_options` calls the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/master/) where `parse_args` parses given command line and calls `do_early_param` function. This [function](https://github.com/torvalds/linux/blob/master/init/main.c#L413) goes from the ` __setup_start` to `__setup_end`, and calls the function from the `obs_kernel_param` if a parameter is early. After this all services which are depend on early command line parameters were setup and the next call after the `parse_early_param` is `x86_report_nx`. As I wrote in the beginning of this part, we already set `NX-bit` with the `x86_configure_nx`. The next `x86_report_nx` function from the [arch/x86/mm/setup_nx.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/setup_nx.c) just prints information about the `NX`. Note that we call `x86_report_nx` not right after the `x86_configure_nx`, but after the call of the `parse_early_param`. The answer is simple: we call it after the `parse_early_param` because the kernel support `noexec` parameter:
|
||||
|
||||
```
|
||||
noexec [X86]
|
||||
On X86-32 available only on PAE configured kernels.
|
||||
noexec=on: enable non-executable mappings (default)
|
||||
noexec=off: disable non-executable mappings
|
||||
```
|
||||
|
||||
We can see it in the booting time:
|
||||
|
||||

|
||||
|
||||
After this we can see call of the:
|
||||
|
||||
```C
|
||||
memblock_x86_reserve_range_setup_data();
|
||||
```
|
||||
|
||||
function. This function is defined in the same [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and remaps memory for the `setup_data` and reserved memory block for the `setup_data` (more about `setup_data` you can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) and about `ioremap` and `memblock` you can read in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)).
|
||||
|
||||
In the next step we can see following conditional statement:
|
||||
|
||||
```C
|
||||
if (acpi_mps_check()) {
|
||||
#ifdef CONFIG_X86_LOCAL_APIC
|
||||
disable_apic = 1;
|
||||
#endif
|
||||
setup_clear_cpu_cap(X86_FEATURE_APIC);
|
||||
}
|
||||
```
|
||||
|
||||
The first `acpi_mps_check` function from the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) depends on `CONFIG_X86_LOCAL_APIC` and `CONFIG_x86_MPPARSE` configuration options:
|
||||
|
||||
```C
|
||||
int __init acpi_mps_check(void)
|
||||
{
|
||||
#if defined(CONFIG_X86_LOCAL_APIC) && !defined(CONFIG_X86_MPPARSE)
|
||||
/* mptable code is not built-in*/
|
||||
if (acpi_disabled || acpi_noirq) {
|
||||
printk(KERN_WARNING "MPS support code is not built-in.\n"
|
||||
"Using acpi=off or acpi=noirq or pci=noacpi "
|
||||
"may have problem\n");
|
||||
return 1;
|
||||
}
|
||||
#endif
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
It checks the built-in `MPS` or [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) table. If `CONFIG_X86_LOCAL_APIC` is set and `CONFIG_x86_MPPAARSE` is not set, `acpi_mps_check` prints warning message if the one of the command line options: `acpi=off`, `acpi=noirq` or `pci=noacpi` passed to the kernel. If `acpi_mps_check` returns `1` it means that we disable local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and clear `X86_FEATURE_APIC` bit in the of the current CPU with the `setup_clear_cpu_cap` macro. (more about CPU mask you can read in the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)).
|
||||
|
||||
Early PCI dump
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the next step we make a dump of the [PCI](http://en.wikipedia.org/wiki/Conventional_PCI) devices with the following code:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_PCI
|
||||
if (pci_early_dump_regs)
|
||||
early_dump_pci_devices();
|
||||
#endif
|
||||
```
|
||||
|
||||
`pci_early_dump_regs` variable defined in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c) and its value depends on the kernel command line parameter: `pci=earlydump`. We can find definition of this parameter in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch):
|
||||
|
||||
```C
|
||||
early_param("pci", pci_setup);
|
||||
```
|
||||
|
||||
`pci_setup` function gets the string after the `pci=` and analyzes it. This function calls `pcibios_setup` which defined as `__weak` in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch) and every architecture defines the same function which overrides `__weak` analog. For example `x86_64` architecture-dependent version is in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c):
|
||||
|
||||
```C
|
||||
char *__init pcibios_setup(char *str) {
|
||||
...
|
||||
...
|
||||
...
|
||||
} else if (!strcmp(str, "earlydump")) {
|
||||
pci_early_dump_regs = 1;
|
||||
return NULL;
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
So, if `CONFIG_PCI` option is set and we passed `pci=earlydump` option to the kernel command line, next function which will be called - `early_dump_pci_devices` from the [arch/x86/pci/early.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/early.c). This function checks `noearly` pci parameter with:
|
||||
|
||||
```C
|
||||
if (!early_pci_allowed())
|
||||
return;
|
||||
```
|
||||
|
||||
and returns if it was passed. Each PCI domain can host up to `256` buses and each bus hosts up to 32 devices. So, we goes in a loop:
|
||||
|
||||
```C
|
||||
for (bus = 0; bus < 256; bus++) {
|
||||
for (slot = 0; slot < 32; slot++) {
|
||||
for (func = 0; func < 8; func++) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
and read the `pci` config with the `read_pci_config` function.
|
||||
|
||||
That's all. We will not go deep in the `pci` details, but will see more details in the special `Drivers/PCI` part.
|
||||
|
||||
Finish with memory parsing
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the `early_dump_pci_devices`, there are a couple of function related with available memory and [e820](http://en.wikipedia.org/wiki/E820) which we collected in the [First steps in the kernel setup](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) part:
|
||||
|
||||
```C
|
||||
/* update the e820_saved too */
|
||||
e820_reserve_setup_data();
|
||||
finish_e820_parsing();
|
||||
...
|
||||
...
|
||||
...
|
||||
e820_add_kernel_range();
|
||||
trim_bios_range(void);
|
||||
max_pfn = e820_end_of_ram_pfn();
|
||||
early_reserve_e820_mpc_new();
|
||||
```
|
||||
|
||||
Let's look on it. As you can see the first function is `e820_reserve_setup_data`. This function does almost the same as `memblock_x86_reserve_range_setup_data` which we saw above, but it also calls `e820_update_range` which adds new regions to the `e820map` with the given type which is `E820_RESERVED_KERN` in our case. The next function is `finish_e820_parsing` which sanitizes `e820map` with the `sanitize_e820_map` function. Besides this two functions we can see a couple of functions related to the [e820](http://en.wikipedia.org/wiki/E820). You can see it in the listing above. `e820_add_kernel_range` function takes the physical address of the kernel start and end:
|
||||
|
||||
```C
|
||||
u64 start = __pa_symbol(_text);
|
||||
u64 size = __pa_symbol(_end) - start;
|
||||
```
|
||||
|
||||
checks that `.text` `.data` and `.bss` marked as `E820RAM` in the `e820map` and prints the warning message if not. The next function `trm_bios_range` update first 4096 bytes in `e820Map` as `E820_RESERVED` and sanitizes it again with the call of the `sanitize_e820_map`. After this we get the last page frame number with the call of the `e820_end_of_ram_pfn` function. Every memory page has an unique number - `Page frame number` and `e820_end_of_ram_pfn` function returns the maximum with the call of the `e820_end_pfn`:
|
||||
|
||||
```C
|
||||
unsigned long __init e820_end_of_ram_pfn(void)
|
||||
{
|
||||
return e820_end_pfn(MAX_ARCH_PFN);
|
||||
}
|
||||
```
|
||||
|
||||
where `e820_end_pfn` takes maximum page frame number on the certain architecture (`MAX_ARCH_PFN` is `0x400000000` for `x86_64`). In the `e820_end_pfn` we go through the all `e820` slots and check that `e820` entry has `E820_RAM` or `E820_PRAM` type because we calculate page frame numbers only for these types, gets the base address and end address of the page frame number for the current `e820` entry and makes some checks for these addresses:
|
||||
|
||||
```C
|
||||
for (i = 0; i < e820.nr_map; i++) {
|
||||
struct e820entry *ei = &e820.map[i];
|
||||
unsigned long start_pfn;
|
||||
unsigned long end_pfn;
|
||||
|
||||
if (ei->type != E820_RAM && ei->type != E820_PRAM)
|
||||
continue;
|
||||
|
||||
start_pfn = ei->addr >> PAGE_SHIFT;
|
||||
end_pfn = (ei->addr + ei->size) >> PAGE_SHIFT;
|
||||
|
||||
if (start_pfn >= limit_pfn)
|
||||
continue;
|
||||
if (end_pfn > limit_pfn) {
|
||||
last_pfn = limit_pfn;
|
||||
break;
|
||||
}
|
||||
if (end_pfn > last_pfn)
|
||||
last_pfn = end_pfn;
|
||||
}
|
||||
```
|
||||
|
||||
```C
|
||||
if (last_pfn > max_arch_pfn)
|
||||
last_pfn = max_arch_pfn;
|
||||
|
||||
printk(KERN_INFO "e820: last_pfn = %#lx max_arch_pfn = %#lx\n",
|
||||
last_pfn, max_arch_pfn);
|
||||
return last_pfn;
|
||||
```
|
||||
|
||||
After this we check that `last_pfn` which we got in the loop is not greater that maximum page frame number for the certain architecture (`x86_64` in our case), print information about last page frame number and return it. We can see the `last_pfn` in the `dmesg` output:
|
||||
|
||||
```
|
||||
...
|
||||
[ 0.000000] e820: last_pfn = 0x41f000 max_arch_pfn = 0x400000000
|
||||
...
|
||||
```
|
||||
|
||||
After this, as we have calculated the biggest page frame number, we calculate `max_low_pfn` which is the biggest page frame number in the `low memory` or bellow first `4` gigabytes. If installed more than 4 gigabytes of RAM, `max_low_pfn` will be result of the `e820_end_of_low_ram_pfn` function which does the same `e820_end_of_ram_pfn` but with 4 gigabytes limit, in other way `max_low_pfn` will be the same as `max_pfn`:
|
||||
|
||||
```C
|
||||
if (max_pfn > (1UL<<(32 - PAGE_SHIFT)))
|
||||
max_low_pfn = e820_end_of_low_ram_pfn();
|
||||
else
|
||||
max_low_pfn = max_pfn;
|
||||
|
||||
high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
|
||||
```
|
||||
|
||||
Next we calculate `high_memory` (defines the upper bound on direct map memory) with `__va` macro which returns a virtual address by the given physical memory.
|
||||
|
||||
DMI scanning
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
The next step after manipulations with different memory regions and `e820` slots is collecting information about computer. We will get all information with the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface) and following functions:
|
||||
|
||||
```C
|
||||
dmi_scan_machine();
|
||||
dmi_memdev_walk();
|
||||
```
|
||||
|
||||
First is `dmi_scan_machine` defined in the [drivers/firmware/dmi_scan.c](https://github.com/torvalds/linux/blob/master/drivers/firmware/dmi_scan.c). This function goes through the [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) structures and extracts information. There are two ways specified to gain access to the `SMBIOS` table: get the pointer to the `SMBIOS` table from the [EFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)'s configuration table and scanning the physical memory between `0xF0000` and `0x10000` addresses. Let's look on the second approach. `dmi_scan_machine` function remaps memory between `0xf0000` and `0x10000` with the `dmi_early_remap` which just expands to the `early_ioremap`:
|
||||
|
||||
```C
|
||||
void __init dmi_scan_machine(void)
|
||||
{
|
||||
char __iomem *p, *q;
|
||||
char buf[32];
|
||||
...
|
||||
...
|
||||
...
|
||||
p = dmi_early_remap(0xF0000, 0x10000);
|
||||
if (p == NULL)
|
||||
goto error;
|
||||
```
|
||||
|
||||
and iterates over all `DMI` header address and find search `_SM_` string:
|
||||
|
||||
```C
|
||||
memset(buf, 0, 16);
|
||||
for (q = p; q < p + 0x10000; q += 16) {
|
||||
memcpy_fromio(buf + 16, q, 16);
|
||||
if (!dmi_smbios3_present(buf) || !dmi_present(buf)) {
|
||||
dmi_available = 1;
|
||||
dmi_early_unmap(p, 0x10000);
|
||||
goto out;
|
||||
}
|
||||
memcpy(buf, buf + 16, 16);
|
||||
}
|
||||
```
|
||||
|
||||
`_SM_` string must be between `000F0000h` and `0x000FFFFF`. Here we copy 16 bytes to the `buf` with `memcpy_fromio` which is the same `memcpy` and execute `dmi_smbios3_present` and `dmi_present` on the buffer. These functions check that first 4 bytes is `_SM_` string, get `SMBIOS` version and gets `_DMI_` attributes as `DMI` structure table length, table address and etc... After one of these functions finish, you will see the result of it in the `dmesg` output:
|
||||
|
||||
```
|
||||
[ 0.000000] SMBIOS 2.7 present.
|
||||
[ 0.000000] DMI: Gigabyte Technology Co., Ltd. Z97X-UD5H-BK/Z97X-UD5H-BK, BIOS F6 06/17/2014
|
||||
```
|
||||
|
||||
In the end of the `dmi_scan_machine`, we unmap the previously remapped memory:
|
||||
|
||||
```C
|
||||
dmi_early_unmap(p, 0x10000);
|
||||
```
|
||||
|
||||
The second function is - `dmi_memdev_walk`. As you can understand it goes over memory devices. Let's look on it:
|
||||
|
||||
```C
|
||||
void __init dmi_memdev_walk(void)
|
||||
{
|
||||
if (!dmi_available)
|
||||
return;
|
||||
|
||||
if (dmi_walk_early(count_mem_devices) == 0 && dmi_memdev_nr) {
|
||||
dmi_memdev = dmi_alloc(sizeof(*dmi_memdev) * dmi_memdev_nr);
|
||||
if (dmi_memdev)
|
||||
dmi_walk_early(save_mem_devices);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
It checks that `DMI` available (we got it in the previous function - `dmi_scan_machine`) and collects information about memory devices with `dmi_walk_early` and `dmi_alloc` which defined as:
|
||||
|
||||
```
|
||||
#ifdef CONFIG_DMI
|
||||
RESERVE_BRK(dmi_alloc, 65536);
|
||||
#endif
|
||||
```
|
||||
|
||||
`RESERVE_BRK` defined in the [arch/x86/include/asm/setup.h](http://en.wikipedia.org/wiki/Desktop_Management_Interface) and reserves space with given size in the `brk` section.
|
||||
|
||||
-------------------------
|
||||
init_hypervisor_platform();
|
||||
x86_init.resources.probe_roms();
|
||||
insert_resource(&iomem_resource, &code_resource);
|
||||
insert_resource(&iomem_resource, &data_resource);
|
||||
insert_resource(&iomem_resource, &bss_resource);
|
||||
early_gart_iommu_check();
|
||||
|
||||
|
||||
SMP config
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step is parsing of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration. We do it with the call of the `find_smp_config` function which just calls function:
|
||||
|
||||
```C
|
||||
static inline void find_smp_config(void)
|
||||
{
|
||||
x86_init.mpparse.find_smp_config();
|
||||
}
|
||||
```
|
||||
|
||||
inside. `x86_init.mpparse.find_smp_config` is the `default_find_smp_config` function from the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). In the `default_find_smp_config` function we are scanning a couple of memory regions for `SMP` config and return if they are found:
|
||||
|
||||
```C
|
||||
if (smp_scan_config(0x0, 0x400) ||
|
||||
smp_scan_config(639 * 0x400, 0x400) ||
|
||||
smp_scan_config(0xF0000, 0x10000))
|
||||
return;
|
||||
```
|
||||
|
||||
First of all `smp_scan_config` function defines a couple of variables:
|
||||
|
||||
```C
|
||||
unsigned int *bp = phys_to_virt(base);
|
||||
struct mpf_intel *mpf;
|
||||
```
|
||||
|
||||
First is virtual address of the memory region where we will scan `SMP` config, second is the pointer to the `mpf_intel` structure. Let's try to understand what is it `mpf_intel`. All information stores in the multiprocessor configuration data structure. `mpf_intel` presents this structure and looks:
|
||||
|
||||
```C
|
||||
struct mpf_intel {
|
||||
char signature[4];
|
||||
unsigned int physptr;
|
||||
unsigned char length;
|
||||
unsigned char specification;
|
||||
unsigned char checksum;
|
||||
unsigned char feature1;
|
||||
unsigned char feature2;
|
||||
unsigned char feature3;
|
||||
unsigned char feature4;
|
||||
unsigned char feature5;
|
||||
};
|
||||
```
|
||||
|
||||
As we can read in the documentation - one of the main functions of the system BIOS is to construct the MP floating pointer structure and the MP configuration table. And operating system must have access to this information about the multiprocessor configuration and `mpf_intel` stores the physical address (look at second parameter) of the multiprocessor configuration table. So, `smp_scan_config` going in a loop through the given memory range and tries to find `MP floating pointer structure` there. It checks that current byte points to the `SMP` signature, checks checksum, checks if `mpf->specification` is 1 or 4(it must be `1` or `4` by specification) in the loop:
|
||||
|
||||
```C
|
||||
while (length > 0) {
|
||||
if ((*bp == SMP_MAGIC_IDENT) &&
|
||||
(mpf->length == 1) &&
|
||||
!mpf_checksum((unsigned char *)bp, 16) &&
|
||||
((mpf->specification == 1)
|
||||
|| (mpf->specification == 4))) {
|
||||
|
||||
mem = virt_to_phys(mpf);
|
||||
memblock_reserve(mem, sizeof(*mpf));
|
||||
if (mpf->physptr)
|
||||
smp_reserve_memory(mpf);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
reserves given memory block if search is successful with `memblock_reserve` and reserves physical address of the multiprocessor configuration table. You can find documentation about this in the - [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf). You can read More details in the special part about `SMP`.
|
||||
|
||||
Additional early memory initialization routines
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the next step of the `setup_arch` we can see the call of the `early_alloc_pgt_buf` function which allocates the page table buffer for early stage. The page table buffer will be placed in the `brk` area. Let's look on its implementation:
|
||||
|
||||
```C
|
||||
void __init early_alloc_pgt_buf(void)
|
||||
{
|
||||
unsigned long tables = INIT_PGT_BUF_SIZE;
|
||||
phys_addr_t base;
|
||||
|
||||
base = __pa(extend_brk(tables, PAGE_SIZE));
|
||||
|
||||
pgt_buf_start = base >> PAGE_SHIFT;
|
||||
pgt_buf_end = pgt_buf_start;
|
||||
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
|
||||
}
|
||||
```
|
||||
|
||||
First of all it get the size of the page table buffer, it will be `INIT_PGT_BUF_SIZE` which is `(6 * PAGE_SIZE)` in the current linux kernel 4.0. As we got the size of the page table buffer, we call `extend_brk` function with two parameters: size and align. As you can understand from its name, this function extends the `brk` area. As we can see in the linux kernel linker script `brk` is in memory right after the [BSS](http://en.wikipedia.org/wiki/.bss):
|
||||
|
||||
```C
|
||||
. = ALIGN(PAGE_SIZE);
|
||||
.brk : AT(ADDR(.brk) - LOAD_OFFSET) {
|
||||
__brk_base = .;
|
||||
. += 64 * 1024; /* 64k alignment slop space */
|
||||
*(.brk_reservation) /* areas brk users have reserved */
|
||||
__brk_limit = .;
|
||||
}
|
||||
```
|
||||
|
||||
Or we can find it with `readelf` util:
|
||||
|
||||

|
||||
|
||||
After that we got physical address of the new `brk` with the `__pa` macro, we calculate the base address and the end of the page table buffer. In the next step as we got page table buffer, we reserve memory block for the brk area with the `reserve_brk` function:
|
||||
|
||||
```C
|
||||
static void __init reserve_brk(void)
|
||||
{
|
||||
if (_brk_end > _brk_start)
|
||||
memblock_reserve(__pa_symbol(_brk_start),
|
||||
_brk_end - _brk_start);
|
||||
|
||||
_brk_start = 0;
|
||||
}
|
||||
```
|
||||
|
||||
Note that in the end of the `reserve_brk`, we set `brk_start` to zero, because after this we will not allocate it anymore. The next step after reserving memory block for the `brk`, we need to unmap out-of-range memory areas in the kernel mapping with the `cleanup_highmap` function. Remember that kernel mapping is `__START_KERNEL_map` and `_end - _text` or `level2_kernel_pgt` maps the kernel `_text`, `data` and `bss`. In the start of the `clean_high_map` we define these parameters:
|
||||
|
||||
```C
|
||||
unsigned long vaddr = __START_KERNEL_map;
|
||||
unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
|
||||
pmd_t *pmd = level2_kernel_pgt;
|
||||
pmd_t *last_pmd = pmd + PTRS_PER_PMD;
|
||||
```
|
||||
|
||||
Now, as we defined start and end of the kernel mapping, we go in the loop through the all kernel page middle directory entries and clean entries which are not between `_text` and `end`:
|
||||
|
||||
```C
|
||||
for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
|
||||
if (pmd_none(*pmd))
|
||||
continue;
|
||||
if (vaddr < (unsigned long) _text || vaddr > end)
|
||||
set_pmd(pmd, __pmd(0));
|
||||
}
|
||||
```
|
||||
|
||||
After this we set the limit for the `memblock` allocation with the `memblock_set_current_limit` function (read more about `memblock` you can in the [Linux kernel memory management Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md)), it will be `ISA_END_ADDRESS` or `0x100000` and fill the `memblock` information according to `e820` with the call of the `memblock_x86_fill` function. You can see the result of this function in the kernel initialization time:
|
||||
|
||||
```
|
||||
MEMBLOCK configuration:
|
||||
memory size = 0x1fff7ec00 reserved size = 0x1e30000
|
||||
memory.cnt = 0x3
|
||||
memory[0x0] [0x00000000001000-0x0000000009efff], 0x9e000 bytes flags: 0x0
|
||||
memory[0x1] [0x00000000100000-0x000000bffdffff], 0xbfee0000 bytes flags: 0x0
|
||||
memory[0x2] [0x00000100000000-0x0000023fffffff], 0x140000000 bytes flags: 0x0
|
||||
reserved.cnt = 0x3
|
||||
reserved[0x0] [0x0000000009f000-0x000000000fffff], 0x61000 bytes flags: 0x0
|
||||
reserved[0x1] [0x00000001000000-0x00000001a57fff], 0xa58000 bytes flags: 0x0
|
||||
reserved[0x2] [0x0000007ec89000-0x0000007fffffff], 0x1377000 bytes flags: 0x0
|
||||
```
|
||||
|
||||
The rest functions after the `memblock_x86_fill` are: `early_reserve_e820_mpc_new` allocates additional slots in the `e820map` for MultiProcessor Specification table, `reserve_real_mode` - reserves low memory from `0x0` to 1 megabyte for the trampoline to the real mode (for rebooting, etc.), `trim_platform_memory_ranges` - trims certain memory regions started from `0x20050000`, `0x20110000`, etc. these regions must be excluded because [Sandy Bridge](http://en.wikipedia.org/wiki/Sandy_Bridge) has problems with these regions, `trim_low_memory_range` reserves the first 4 kilobyte page in `memblock`, `init_mem_mapping` function reconstructs direct memory mapping and setups the direct mapping of the physical memory at `PAGE_OFFSET`, `early_trap_pf_init` setups `#PF` handler (we will look on it in the chapter about interrupts) and `setup_real_mode` function setups trampoline to the [real mode](http://en.wikipedia.org/wiki/Real_mode) code.
|
||||
|
||||
That's all. You can note that this part will not cover all functions which are in the `setup_arch` (like `early_gart_iommu_check`, [mtrr](http://en.wikipedia.org/wiki/Memory_type_range_register) initialization, etc.). As I already wrote many times, `setup_arch` is big, and linux kernel is big. That's why I can't cover every line in the linux kernel. I don't think that we missed something important, but you can say something like: each line of code is important. Yes, it's true, but I missed them anyway, because I think that it is not realistic to cover full linux kernel. Anyway we will often return to the idea that we have already seen, and if something is unfamiliar, we will cover this theme.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the sixth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function again and it was long part, but we are not finished with it. Yes, `setup_arch` is big, hope that next part will be the last part about this function.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification)
|
||||
* [NX bit](http://en.wikipedia.org/wiki/NX_bit)
|
||||
* [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt)
|
||||
* [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [PCI](http://en.wikipedia.org/wiki/Conventional_PCI)
|
||||
* [e820](http://en.wikipedia.org/wiki/E820)
|
||||
* [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS)
|
||||
* [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS)
|
||||
* [EFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)
|
||||
* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)
|
||||
* [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf)
|
||||
* [BSS](http://en.wikipedia.org/wiki/.bss)
|
||||
* [SMBIOS specification](http://www.dmtf.org/sites/default/files/standards/documents/DSP0134v2.5Final.pdf)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html)
|
||||
482
Initialization/linux-initialization-7.md
Normal file
482
Initialization/linux-initialization-7.md
Normal file
@@ -0,0 +1,482 @@
|
||||
Kernel initialization. Part 7.
|
||||
================================================================================
|
||||
|
||||
The End of the architecture-specific initialization, almost...
|
||||
================================================================================
|
||||
|
||||
This is the seventh part of the Linux Kernel initialization process which covers insides of the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L861). As you can know from the previous [parts](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html), the `setup_arch` function does some architecture-specific (in our case it is [x86_64](http://en.wikipedia.org/wiki/X86-64)) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface), early dump of the [PCI](http://en.wikipedia.org/wiki/PCI) device and many many more. If you have read the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html), you can remember that we've finished it at the `setup_real_mode` function. In the next step, as we set limit of the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html) to the all mapped pages, we can see the call of the `setup_log_buf` function from the [kernel/printk/printk.c](https://github.com/torvalds/linux/blob/master/kernel/printk/printk.c).
|
||||
|
||||
The `setup_log_buf` function setups kernel cyclic buffer and its length depends on the `CONFIG_LOG_BUF_SHIFT` configuration option. As we can read from the documentation of the `CONFIG_LOG_BUF_SHIFT` it can be between `12` and `21`. In the insides, buffer defined as array of chars:
|
||||
|
||||
```C
|
||||
#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
|
||||
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
|
||||
static char *log_buf = __log_buf;
|
||||
```
|
||||
|
||||
Now let's look on the implementation of the `setup_log_buf` function. It starts with check that current buffer is empty (It must be empty, because we just setup it) and another check that it is early setup. If setup of the kernel log buffer is not early, we call the `log_buf_add_cpu` function which increase size of the buffer for every CPU:
|
||||
|
||||
```C
|
||||
if (log_buf != __log_buf)
|
||||
return;
|
||||
|
||||
if (!early && !new_log_buf_len)
|
||||
log_buf_add_cpu();
|
||||
```
|
||||
|
||||
We will not research `log_buf_add_cpu` function, because as you can see in the `setup_arch`, we call `setup_log_buf` as:
|
||||
|
||||
```C
|
||||
setup_log_buf(1);
|
||||
```
|
||||
|
||||
where `1` means that it is early setup. In the next step we check `new_log_buf_len` variable which is updated length of the kernel log buffer and allocate new space for the buffer with the `memblock_virt_alloc` function for it, or just return.
|
||||
|
||||
As kernel log buffer is ready, the next function is `reserve_initrd`. You can remember that we already called the `early_reserve_initrd` function in the fourth part of the [Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). Now, as we reconstructed direct memory mapping in the `init_mem_mapping` function, we need to move [initrd](http://en.wikipedia.org/wiki/Initrd) into directly mapped memory. The `reserve_initrd` function starts from the definition of the base address and end address of the `initrd` and check that `initrd` is provided by a bootloader. All the same as what we saw in the `early_reserve_initrd`. But instead of the reserving place in the `memblock` area with the call of the `memblock_reserve` function, we get the mapped size of the direct memory area and check that the size of the `initrd` is not greater than this area with:
|
||||
|
||||
```C
|
||||
mapped_size = memblock_mem_size(max_pfn_mapped);
|
||||
if (ramdisk_size >= (mapped_size>>1))
|
||||
panic("initrd too large to handle, "
|
||||
"disabling initrd (%lld needed, %lld available)\n",
|
||||
ramdisk_size, mapped_size>>1);
|
||||
```
|
||||
|
||||
You can see here that we call `memblock_mem_size` function and pass the `max_pfn_mapped` to it, where `max_pfn_mapped` contains the highest direct mapped page frame number. If you do not remember what is `page frame number`, explanation is simple: First `12` bits of the virtual address represent offset in the physical page or page frame. If we right-shift out `12` bits of the virtual address, we'll discard offset part and will get `Page Frame Number`. In the `memblock_mem_size` we go through the all memblock `mem` (not reserved) regions and calculates size of the mapped pages and return it to the `mapped_size` variable (see code above). As we got amount of the direct mapped memory, we check that size of the `initrd` is not greater than mapped pages. If it is greater we just call `panic` which halts the system and prints famous [Kernel panic](http://en.wikipedia.org/wiki/Kernel_panic) message. In the next step we print information about the `initrd` size. We can see the result of this in the `dmesg` output:
|
||||
|
||||
```C
|
||||
[0.000000] RAMDISK: [mem 0x36d20000-0x37687fff]
|
||||
```
|
||||
|
||||
and relocate `initrd` to the direct mapping area with the `relocate_initrd` function. In the start of the `relocate_initrd` function we try to find a free area with the `memblock_find_in_range` function:
|
||||
|
||||
```C
|
||||
relocated_ramdisk = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), area_size, PAGE_SIZE);
|
||||
|
||||
if (!relocated_ramdisk)
|
||||
panic("Cannot find place for new RAMDISK of size %lld\n",
|
||||
ramdisk_size);
|
||||
```
|
||||
|
||||
The `memblock_find_in_range` function tries to find a free area in a given range, in our case from `0` to the maximum mapped physical address and size must equal to the aligned size of the `initrd`. If we didn't find a area with the given size, we call `panic` again. If all is good, we start to relocated RAM disk to the down of the directly mapped memory in the next step.
|
||||
|
||||
In the end of the `reserve_initrd` function, we free memblock memory which occupied by the ramdisk with the call of the:
|
||||
|
||||
```C
|
||||
memblock_free(ramdisk_image, ramdisk_end - ramdisk_image);
|
||||
```
|
||||
|
||||
After we relocated `initrd` ramdisk image, the next function is `vsmp_init` from the [arch/x86/kernel/vsmp_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsmp_64.c). This function initializes support of the `ScaleMP vSMP`. As I already wrote in the previous parts, this chapter will not cover non-related `x86_64` initialization parts (for example as the current or `ACPI`, etc.). So we will skip implementation of this for now and will back to it in the part which cover techniques of parallel computing.
|
||||
|
||||
The next function is `io_delay_init` from the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/io_delay.c). This function allows to override default default I/O delay `0x80` port. We already saw I/O delay in the [Last preparation before transition into protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html), now let's look on the `io_delay_init` implementation:
|
||||
|
||||
```C
|
||||
void __init io_delay_init(void)
|
||||
{
|
||||
if (!io_delay_override)
|
||||
dmi_check_system(io_delay_0xed_port_dmi_table);
|
||||
}
|
||||
```
|
||||
|
||||
This function check `io_delay_override` variable and overrides I/O delay port if `io_delay_override` is set. We can set `io_delay_override` variably by passing `io_delay` option to the kernel command line. As we can read from the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt), `io_delay` option is:
|
||||
|
||||
```
|
||||
io_delay= [X86] I/O delay method
|
||||
0x80
|
||||
Standard port 0x80 based delay
|
||||
0xed
|
||||
Alternate port 0xed based delay (needed on some systems)
|
||||
udelay
|
||||
Simple two microseconds delay
|
||||
none
|
||||
No delay
|
||||
```
|
||||
|
||||
We can see `io_delay` command line parameter setup with the `early_param` macro in the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/io_delay.c)
|
||||
|
||||
```C
|
||||
early_param("io_delay", io_delay_param);
|
||||
```
|
||||
|
||||
More about `early_param` you can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html). So the `io_delay_param` function which setups `io_delay_override` variable will be called in the [do_early_param](https://github.com/torvalds/linux/blob/master/init/main.c#L413) function. `io_delay_param` function gets the argument of the `io_delay` kernel command line parameter and sets `io_delay_type` depends on it:
|
||||
|
||||
```C
|
||||
static int __init io_delay_param(char *s)
|
||||
{
|
||||
if (!s)
|
||||
return -EINVAL;
|
||||
|
||||
if (!strcmp(s, "0x80"))
|
||||
io_delay_type = CONFIG_IO_DELAY_TYPE_0X80;
|
||||
else if (!strcmp(s, "0xed"))
|
||||
io_delay_type = CONFIG_IO_DELAY_TYPE_0XED;
|
||||
else if (!strcmp(s, "udelay"))
|
||||
io_delay_type = CONFIG_IO_DELAY_TYPE_UDELAY;
|
||||
else if (!strcmp(s, "none"))
|
||||
io_delay_type = CONFIG_IO_DELAY_TYPE_NONE;
|
||||
else
|
||||
return -EINVAL;
|
||||
|
||||
io_delay_override = 1;
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
The next functions are `acpi_boot_table_init`, `early_acpi_boot_init` and `initmem_init` after the `io_delay_init`, but as I wrote above we will not cover [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) related stuff in this `Linux Kernel initialization process` chapter.
|
||||
|
||||
Allocate area for DMA
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the next step we need to allocate area for the [Direct memory access](http://en.wikipedia.org/wiki/Direct_memory_access) with the `dma_contiguous_reserve` function which is defined in the [drivers/base/dma-contiguous.c](https://github.com/torvalds/linux/blob/master/drivers/base/dma-contiguous.c). `DMA` is a special mode when devices communicate with memory without CPU. Note that we pass one parameter - `max_pfn_mapped << PAGE_SHIFT`, to the `dma_contiguous_reserve` function and as you can understand from this expression, this is limit of the reserved memory. Let's look on the implementation of this function. It starts from the definition of the following variables:
|
||||
|
||||
```C
|
||||
phys_addr_t selected_size = 0;
|
||||
phys_addr_t selected_base = 0;
|
||||
phys_addr_t selected_limit = limit;
|
||||
bool fixed = false;
|
||||
```
|
||||
|
||||
where first represents size in bytes of the reserved area, second is base address of the reserved area, third is end address of the reserved area and the last `fixed` parameter shows where to place reserved area. If `fixed` is `1` we just reserve area with the `memblock_reserve`, if it is `0` we allocate space with the `kmemleak_alloc`. In the next step we check `size_cmdline` variable and if it is not equal to `-1` we fill all variables which you can see above with the values from the `cma` kernel command line parameter:
|
||||
|
||||
```C
|
||||
if (size_cmdline != -1) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
You can find in this source code file definition of the early parameter:
|
||||
|
||||
```C
|
||||
early_param("cma", early_cma);
|
||||
```
|
||||
|
||||
where `cma` is:
|
||||
|
||||
```
|
||||
cma=nn[MG]@[start[MG][-end[MG]]]
|
||||
[ARM,X86,KNL]
|
||||
Sets the size of kernel global memory area for
|
||||
contiguous memory allocations and optionally the
|
||||
placement constraint by the physical address range of
|
||||
memory allocations. A value of 0 disables CMA
|
||||
altogether. For more information, see
|
||||
include/linux/dma-contiguous.h
|
||||
```
|
||||
|
||||
If we will not pass `cma` option to the kernel command line, `size_cmdline` will be equal to `-1`. In this way we need to calculate size of the reserved area which depends on the following kernel configuration options:
|
||||
|
||||
* `CONFIG_CMA_SIZE_SEL_MBYTES` - size in megabytes, default global `CMA` area, which is equal to `CMA_SIZE_MBYTES * SZ_1M` or `CONFIG_CMA_SIZE_MBYTES * 1M`;
|
||||
* `CONFIG_CMA_SIZE_SEL_PERCENTAGE` - percentage of total memory;
|
||||
* `CONFIG_CMA_SIZE_SEL_MIN` - use lower value;
|
||||
* `CONFIG_CMA_SIZE_SEL_MAX` - use higher value.
|
||||
|
||||
As we calculated the size of the reserved area, we reserve area with the call of the `dma_contiguous_reserve_area` function which first of all calls:
|
||||
|
||||
```
|
||||
ret = cma_declare_contiguous(base, size, limit, 0, 0, fixed, res_cma);
|
||||
```
|
||||
|
||||
function. The `cma_declare_contiguous` reserves contiguous area from the given base address with given size. After we reserved area for the `DMA`, next function is the `memblock_find_dma_reserve`. As you can understand from its name, this function counts the reserved pages in the `DMA` area. This part will not cover all details of the `CMA` and `DMA`, because they are big. We will see much more details in the special part in the Linux Kernel Memory management which covers contiguous memory allocators and areas.
|
||||
|
||||
Initialization of the sparse memory
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step is the call of the function - `x86_init.paging.pagetable_init`. If you try to find this function in the linux kernel source code, in the end of your search, you will see the following macro:
|
||||
|
||||
```C
|
||||
#define native_pagetable_init paging_init
|
||||
```
|
||||
|
||||
which expands as you can see to the call of the `paging_init` function from the [arch/x86/mm/init_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/init_64.c). The `paging_init` function initializes sparse memory and zone sizes. First of all what's zones and what is it `Sparsemem`. The `Sparsemem` is a special foundation in the linux kernel memory manager which used to split memory area into different memory banks in the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) systems. Let's look on the implementation of the `paginig_init` function:
|
||||
|
||||
```C
|
||||
void __init paging_init(void)
|
||||
{
|
||||
sparse_memory_present_with_active_regions(MAX_NUMNODES);
|
||||
sparse_init();
|
||||
|
||||
node_clear_state(0, N_MEMORY);
|
||||
if (N_MEMORY != N_NORMAL_MEMORY)
|
||||
node_clear_state(0, N_NORMAL_MEMORY);
|
||||
|
||||
zone_sizes_init();
|
||||
}
|
||||
```
|
||||
|
||||
As you can see there is call of the `sparse_memory_present_with_active_regions` function which records a memory area for every `NUMA` node to the array of the `mem_section` structure which contains a pointer to the structure of the array of `struct page`. The next `sparse_init` function allocates non-linear `mem_section` and `mem_map`. In the next step we clear state of the movable memory nodes and initialize sizes of zones. Every `NUMA` node is divided into a number of pieces which are called - `zones`. So, `zone_sizes_init` function from the [arch/x86/mm/init.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/init.c) initializes size of zones.
|
||||
|
||||
Again, this part and next parts do not cover this theme in full details. There will be special part about `NUMA`.
|
||||
|
||||
vsyscall mapping
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step after `SparseMem` initialization is setting of the `trampoline_cr4_features` which must contain content of the `cr4` [Control register](http://en.wikipedia.org/wiki/Control_register). First of all we need to check that current CPU has support of the `cr4` register and if it has, we save its content to the `trampoline_cr4_features` which is storage for `cr4` in the real mode:
|
||||
|
||||
```C
|
||||
if (boot_cpu_data.cpuid_level >= 0) {
|
||||
mmu_cr4_features = __read_cr4();
|
||||
if (trampoline_cr4_features)
|
||||
*trampoline_cr4_features = mmu_cr4_features;
|
||||
}
|
||||
```
|
||||
|
||||
The next function which you can see is `map_vsyscal` from the [arch/x86/kernel/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsyscall_64.c). This function maps memory space for [vsyscalls](https://lwn.net/Articles/446528/) and depends on `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option. Actually `vsyscall` is a special segment which provides fast access to the certain system calls like `getcpu`, etc. Let's look on implementation of this function:
|
||||
|
||||
```C
|
||||
void __init map_vsyscall(void)
|
||||
{
|
||||
extern char __vsyscall_page;
|
||||
unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);
|
||||
|
||||
if (vsyscall_mode != NONE)
|
||||
__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
|
||||
vsyscall_mode == NATIVE
|
||||
? PAGE_KERNEL_VSYSCALL
|
||||
: PAGE_KERNEL_VVAR);
|
||||
|
||||
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
|
||||
(unsigned long)VSYSCALL_ADDR);
|
||||
}
|
||||
```
|
||||
|
||||
In the beginning of the `map_vsyscall` we can see definition of two variables. The first is extern variable `__vsyscall_page`. As a extern variable, it defined somewhere in other source code file. Actually we can see definition of the `__vsyscall_page` in the [arch/x86/kernel/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsyscall_emu_64.S). The `__vsyscall_page` symbol points to the aligned calls of the `vsyscalls` as `gettimeofday`, etc.:
|
||||
|
||||
```assembly
|
||||
.globl __vsyscall_page
|
||||
.balign PAGE_SIZE, 0xcc
|
||||
.type __vsyscall_page, @object
|
||||
__vsyscall_page:
|
||||
|
||||
mov $__NR_gettimeofday, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_time, %rax
|
||||
syscall
|
||||
ret
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The second variable is `physaddr_vsyscall` which just stores physical address of the `__vsyscall_page` symbol. In the next step we check the `vsyscall_mode` variable, and if it is not equal to `NONE`, it is `EMULATE` by default:
|
||||
|
||||
```C
|
||||
static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE;
|
||||
```
|
||||
|
||||
And after this check we can see the call of the `__set_fixmap` function which calls `native_set_fixmap` with the same parameters:
|
||||
|
||||
```C
|
||||
void native_set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t flags)
|
||||
{
|
||||
__native_set_fixmap(idx, pfn_pte(phys >> PAGE_SHIFT, flags));
|
||||
}
|
||||
|
||||
void __native_set_fixmap(enum fixed_addresses idx, pte_t pte)
|
||||
{
|
||||
unsigned long address = __fix_to_virt(idx);
|
||||
|
||||
if (idx >= __end_of_fixed_addresses) {
|
||||
BUG();
|
||||
return;
|
||||
}
|
||||
set_pte_vaddr(address, pte);
|
||||
fixmaps_set++;
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see that `native_set_fixmap` makes value of `Page Table Entry` from the given physical address (physical address of the `__vsyscall_page` symbol in our case) and calls internal function - `__native_set_fixmap`. Internal function gets the virtual address of the given `fixed_addresses` index (`VSYSCALL_PAGE` in our case) and checks that given index is not greater than end of the fix-mapped addresses. After this we set page table entry with the call of the `set_pte_vaddr` function and increase count of the fix-mapped addresses. And in the end of the `map_vsyscall` we check that virtual address of the `VSYSCALL_PAGE` (which is first index in the `fixed_addresses`) is not greater than `VSYSCALL_ADDR` which is `-10UL << 20` or `ffffffffff600000` with the `BUILD_BUG_ON` macro:
|
||||
|
||||
```C
|
||||
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
|
||||
(unsigned long)VSYSCALL_ADDR);
|
||||
```
|
||||
|
||||
Now `vsyscall` area is in the `fix-mapped` area. That's all about `map_vsyscall`, if you do not know anything about fix-mapped addresses, you can read [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). We will see more about `vsyscalls` in the `vsyscalls and vdso` part.
|
||||
|
||||
Getting the SMP configuration
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
You may remember how we made a search of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html). Now we need to get the `SMP` configuration if we found it. For this we check `smp_found_config` variable which we set in the `smp_scan_config` function (read about it the previous part) and call the `get_smp_config` function:
|
||||
|
||||
```C
|
||||
if (smp_found_config)
|
||||
get_smp_config();
|
||||
```
|
||||
|
||||
The `get_smp_config` expands to the `x86_init.mpparse.default_get_smp_config` function which is defined in the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). This function defines a pointer to the multiprocessor floating pointer structure - `mpf_intel` (you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html)) and does some checks:
|
||||
|
||||
```C
|
||||
struct mpf_intel *mpf = mpf_found;
|
||||
|
||||
if (!mpf)
|
||||
return;
|
||||
|
||||
if (acpi_lapic && early)
|
||||
return;
|
||||
```
|
||||
|
||||
Here we can see that multiprocessor configuration was found in the `smp_scan_config` function or just return from the function if not. The next check is `acpi_lapic` and `early`. And as we did this checks, we start to read the `SMP` configuration. As we finished reading it, the next step is - `prefill_possible_map` function which makes preliminary filling of the possible CPU's `cpumask` (more about it you can read in the [Introduction to the cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)).
|
||||
|
||||
The rest of the setup_arch
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Here we are getting to the end of the `setup_arch` function. The rest of function of course is important, but details about these stuff will not will not be included in this part. We will just take a short look on these functions, because although they are important as I wrote above, but they cover non-generic kernel features related with the `NUMA`, `SMP`, `ACPI` and `APICs`, etc. First of all, the next call of the `init_apic_mappings` function. As we can understand this function sets the address of the local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). The next is `x86_io_apic_ops.init` and this function initializes I/O APIC. Please note that we will see all details related with `APIC` in the chapter about interrupts and exceptions handling. In the next step we reserve standard I/O resources like `DMA`, `TIMER`, `FPU`, etc., with the call of the `x86_init.resources.reserve_resources` function. Following is `mcheck_init` function initializes `Machine check Exception` and the last is `register_refined_jiffies` which registers [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29) (There will be separate chapter about timers in the kernel).
|
||||
|
||||
So that's all. Finally we have finished with the big `setup_arch` function in this part. Of course as I already wrote many times, we did not see full details about this function, but do not worry about it. We will be back more than once to this function from different chapters for understanding how different platform-dependent parts are initialized.
|
||||
|
||||
That's all, and now we can back to the `start_kernel` from the `setup_arch`.
|
||||
|
||||
Back to the main.c
|
||||
================================================================================
|
||||
|
||||
As I wrote above, we have finished with the `setup_arch` function and now we can back to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As you may remember or saw yourself, `start_kernel` function as big as the `setup_arch`. So the couple of the next part will be dedicated to learning of this function. So, let's continue with it. After the `setup_arch` we can see the call of the `mm_init_cpumask` function. This function sets the [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) pointer to the memory descriptor `cpumask`. We can look on its implementation:
|
||||
|
||||
```C
|
||||
static inline void mm_init_cpumask(struct mm_struct *mm)
|
||||
{
|
||||
#ifdef CONFIG_CPUMASK_OFFSTACK
|
||||
mm->cpu_vm_mask_var = &mm->cpumask_allocation;
|
||||
#endif
|
||||
cpumask_clear(mm->cpu_vm_mask_var);
|
||||
}
|
||||
```
|
||||
|
||||
As you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c), we pass memory descriptor of the init process to the `mm_init_cpumask` and depends on `CONFIG_CPUMASK_OFFSTACK` configuration option we clear [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer) switch `cpumask`.
|
||||
|
||||
In the next step we can see the call of the following function:
|
||||
|
||||
```C
|
||||
setup_command_line(command_line);
|
||||
```
|
||||
|
||||
This function takes pointer to the kernel command line allocates a couple of buffers to store command line. We need a couple of buffers, because one buffer used for future reference and accessing to command line and one for parameter parsing. We will allocate space for the following buffers:
|
||||
|
||||
* `saved_command_line` - will contain boot command line;
|
||||
* `initcall_command_line` - will contain boot command line. will be used in the `do_initcall_level`;
|
||||
* `static_command_line` - will contain command line for parameters parsing.
|
||||
|
||||
We will allocate space with the `memblock_virt_alloc` function. This function calls `memblock_virt_alloc_try_nid` which allocates boot memory block with `memblock_reserve` if [slab](http://en.wikipedia.org/wiki/Slab_allocation) is not available or uses `kzalloc_node` (more about it will be in the linux memory management chapter). The `memblock_virt_alloc` uses `BOOTMEM_LOW_LIMIT` (physical address of the `(PAGE_OFFSET + 0x1000000)` value) and `BOOTMEM_ALLOC_ACCESSIBLE` (equal to the current value of the `memblock.current_limit`) as minimum address of the memory region and maximum address of the memory region.
|
||||
|
||||
Let's look on the implementation of the `setup_command_line`:
|
||||
|
||||
```C
|
||||
static void __init setup_command_line(char *command_line)
|
||||
{
|
||||
saved_command_line =
|
||||
memblock_virt_alloc(strlen(boot_command_line) + 1, 0);
|
||||
initcall_command_line =
|
||||
memblock_virt_alloc(strlen(boot_command_line) + 1, 0);
|
||||
static_command_line = memblock_virt_alloc(strlen(command_line) + 1, 0);
|
||||
strcpy(saved_command_line, boot_command_line);
|
||||
strcpy(static_command_line, command_line);
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see that we allocate space for the three buffers which will contain kernel command line for the different purposes (read above). And as we allocated space, we store `boot_command_line` in the `saved_command_line` and `command_line` (kernel command line from the `setup_arch`) to the `static_command_line`.
|
||||
|
||||
The next function after the `setup_command_line` is the `setup_nr_cpu_ids`. This function setting `nr_cpu_ids` (number of CPUs) according to the last bit in the `cpu_possible_mask` (more about it you can read in the chapter describes [cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) concept). Let's look on its implementation:
|
||||
|
||||
```C
|
||||
void __init setup_nr_cpu_ids(void)
|
||||
{
|
||||
nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
|
||||
}
|
||||
```
|
||||
|
||||
Here `nr_cpu_ids` represents number of CPUs, `NR_CPUS` represents the maximum number of CPUs which we can set in configuration time:
|
||||
|
||||

|
||||
|
||||
Actually we need to call this function, because `NR_CPUS` can be greater than actual amount of the CPUs in the your computer. Here we can see that we call `find_last_bit` function and pass two parameters to it:
|
||||
|
||||
* `cpu_possible_mask` bits;
|
||||
* maximum number of CPUS.
|
||||
|
||||
In the `setup_arch` we can find the call of the `prefill_possible_map` function which calculates and writes to the `cpu_possible_mask` actual number of the CPUs. We call the `find_last_bit` function which takes the address and maximum size to search and returns bit number of the first set bit. We passed `cpu_possible_mask` bits and maximum number of the CPUs. First of all the `find_last_bit` function splits given `unsigned long` address to the [words](http://en.wikipedia.org/wiki/Word_%28computer_architecture%29):
|
||||
|
||||
```C
|
||||
words = size / BITS_PER_LONG;
|
||||
```
|
||||
|
||||
where `BITS_PER_LONG` is `64` on the `x86_64`. As we got amount of words in the given size of the search data, we need to check is given size does not contain partial words with the following check:
|
||||
|
||||
```C
|
||||
if (size & (BITS_PER_LONG-1)) {
|
||||
tmp = (addr[words] & (~0UL >> (BITS_PER_LONG
|
||||
- (size & (BITS_PER_LONG-1)))));
|
||||
if (tmp)
|
||||
goto found;
|
||||
}
|
||||
```
|
||||
|
||||
if it contains partial word, we mask the last word and check it. If the last word is not zero, it means that current word contains at least one set bit. We go to the `found` label:
|
||||
|
||||
```C
|
||||
found:
|
||||
return words * BITS_PER_LONG + __fls(tmp);
|
||||
```
|
||||
|
||||
Here you can see `__fls` function which returns last set bit in a given word with help of the `bsr` instruction:
|
||||
|
||||
```C
|
||||
static inline unsigned long __fls(unsigned long word)
|
||||
{
|
||||
asm("bsr %1,%0"
|
||||
: "=r" (word)
|
||||
: "rm" (word));
|
||||
return word;
|
||||
}
|
||||
```
|
||||
|
||||
The `bsr` instruction which scans the given operand for first bit set. If the last word is not partial we going through the all words in the given address and trying to find first set bit:
|
||||
|
||||
```C
|
||||
while (words) {
|
||||
tmp = addr[--words];
|
||||
if (tmp) {
|
||||
found:
|
||||
return words * BITS_PER_LONG + __fls(tmp);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Here we put the last word to the `tmp` variable and check that `tmp` contains at least one set bit. If a set bit found, we return the number of this bit. If no one words do not contains set bit we just return given size:
|
||||
|
||||
```C
|
||||
return size;
|
||||
```
|
||||
|
||||
After this `nr_cpu_ids` will contain the correct amount of the available CPUs.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
================================================================================
|
||||
|
||||
It is the end of the seventh part about the linux kernel initialization process. In this part, finally we have finished with the `setup_arch` function and returned to the `start_kernel` function. In the next part we will continue to learn generic kernel code from the `start_kernel` and will continue our way to the first `init` process.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
================================================================================
|
||||
|
||||
* [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface)
|
||||
* [x86_64](http://en.wikipedia.org/wiki/X86-64)
|
||||
* [initrd](http://en.wikipedia.org/wiki/Initrd)
|
||||
* [Kernel panic](http://en.wikipedia.org/wiki/Kernel_panic)
|
||||
* [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt)
|
||||
* [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface)
|
||||
* [Direct memory access](http://en.wikipedia.org/wiki/Direct_memory_access)
|
||||
* [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access)
|
||||
* [Control register](http://en.wikipedia.org/wiki/Control_register)
|
||||
* [vsyscalls](https://lwn.net/Articles/446528/)
|
||||
* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)
|
||||
* [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html)
|
||||
479
Initialization/linux-initialization-8.md
Normal file
479
Initialization/linux-initialization-8.md
Normal file
@@ -0,0 +1,479 @@
|
||||
Kernel initialization. Part 8.
|
||||
================================================================================
|
||||
|
||||
Scheduler initialization
|
||||
================================================================================
|
||||
|
||||
This is the eighth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of the Linux kernel initialization process and we stopped on the `setup_nr_cpu_ids` function in the [previous](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md) part. The main point of the current part is [scheduler](http://en.wikipedia.org/wiki/Scheduling_%28computing%29) initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the `setup_per_cpu_areas` function. This function setups areas for the `percpu` variables, more about it you can read in the special part about the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html). After `percpu` areas is up and running, the next step is the `smp_prepare_boot_cpu` function. This function does some preparations for the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing):
|
||||
|
||||
```C
|
||||
static inline void smp_prepare_boot_cpu(void)
|
||||
{
|
||||
smp_ops.smp_prepare_boot_cpu();
|
||||
}
|
||||
```
|
||||
|
||||
where the `smp_prepare_boot_cpu` expands to the call of the `native_smp_prepare_boot_cpu` function (more about `smp_ops` will be in the special parts about `SMP`):
|
||||
|
||||
```C
|
||||
void __init native_smp_prepare_boot_cpu(void)
|
||||
{
|
||||
int me = smp_processor_id();
|
||||
switch_to_new_gdt(me);
|
||||
cpumask_set_cpu(me, cpu_callout_mask);
|
||||
per_cpu(cpu_state, me) = CPU_ONLINE;
|
||||
}
|
||||
```
|
||||
|
||||
The `native_smp_prepare_boot_cpu` function gets the id of the current CPU (which is Bootstrap processor and its `id` is zero) with the `smp_processor_id` function. I will not explain how the `smp_processor_id` works, because we already saw it in the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. As we got processor `id` number we reload [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) for the given CPU with the `switch_to_new_gdt` function:
|
||||
|
||||
```C
|
||||
void switch_to_new_gdt(int cpu)
|
||||
{
|
||||
struct desc_ptr gdt_descr;
|
||||
|
||||
gdt_descr.address = (long)get_cpu_gdt_table(cpu);
|
||||
gdt_descr.size = GDT_SIZE - 1;
|
||||
load_gdt(&gdt_descr);
|
||||
load_percpu_segment(cpu);
|
||||
}
|
||||
```
|
||||
|
||||
The `gdt_descr` variable represents pointer to the `GDT` descriptor here (we already saw `desc_ptr` in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). We get the address and the size of the `GDT` descriptor where `GDT_SIZE` is `256` or:
|
||||
|
||||
```C
|
||||
#define GDT_SIZE (GDT_ENTRIES * 8)
|
||||
```
|
||||
|
||||
and the address of the descriptor we will get with the `get_cpu_gdt_table`:
|
||||
|
||||
```C
|
||||
static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
|
||||
{
|
||||
return per_cpu(gdt_page, cpu).gdt;
|
||||
}
|
||||
```
|
||||
|
||||
The `get_cpu_gdt_table` uses `per_cpu` macro for getting `gdt_page` percpu variable for the given CPU number (bootstrap processor with `id` - 0 in our case). You may ask the following question: so, if we can access `gdt_page` percpu variable, where it was defined? Actually we already saw it in this book. If you have read the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, you can remember that we saw definition of the `gdt_page` in the [arch/x86/kernel/head_64.S](https://github.com/0xAX/linux/blob/master/arch/x86/kernel/head_64.S):
|
||||
|
||||
```assembly
|
||||
early_gdt_descr:
|
||||
.word GDT_ENTRIES*8-1
|
||||
early_gdt_descr_base:
|
||||
.quad INIT_PER_CPU_VAR(gdt_page)
|
||||
```
|
||||
|
||||
and if we will look on the [linker](https://github.com/0xAX/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) file we can see that it locates after the `__per_cpu_load` symbol:
|
||||
|
||||
```C
|
||||
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
|
||||
INIT_PER_CPU(gdt_page);
|
||||
```
|
||||
|
||||
and filled `gdt_page` in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c#L94):
|
||||
|
||||
```C
|
||||
DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
|
||||
#ifdef CONFIG_X86_64
|
||||
[GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
|
||||
[GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
|
||||
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
|
||||
[GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),
|
||||
[GDT_ENTRY_DEFAULT_USER_DS] = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),
|
||||
[GDT_ENTRY_DEFAULT_USER_CS] = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
more about `percpu` variables you can read in the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) part. As we got address and size of the `GDT` descriptor we reload `GDT` with the `load_gdt` which just execute `lgdt` instruct and load `percpu_segment` with the following function:
|
||||
|
||||
```C
|
||||
void load_percpu_segment(int cpu) {
|
||||
loadsegment(gs, 0);
|
||||
wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));
|
||||
load_stack_canary_segment();
|
||||
}
|
||||
```
|
||||
|
||||
The base address of the `percpu` area must contain `gs` register (or `fs` register for `x86`), so we are using `loadsegment` macro and pass `gs`. In the next step we writes the base address if the [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) stack and setup stack [canary](http://en.wikipedia.org/wiki/Buffer_overflow_protection) (this is only for `x86_32`). After we load new `GDT`, we fill `cpu_callout_mask` bitmap with the current cpu and set cpu state as online with the setting `cpu_state` percpu variable for the current processor - `CPU_ONLINE`:
|
||||
|
||||
```C
|
||||
cpumask_set_cpu(me, cpu_callout_mask);
|
||||
per_cpu(cpu_state, me) = CPU_ONLINE;
|
||||
```
|
||||
|
||||
So, what is `cpu_callout_mask` bitmap... As we initialized bootstrap processor (processor which is booted the first on `x86`) the other processors in a multiprocessor system are known as `secondary processors`. Linux kernel uses following two bitmasks:
|
||||
|
||||
* `cpu_callout_mask`
|
||||
* `cpu_callin_mask`
|
||||
|
||||
After bootstrap processor initialized, it updates the `cpu_callout_mask` to indicate which secondary processor can be initialized next. All other or secondary processors can do some initialization stuff before and check the `cpu_callout_mask` on the boostrap processor bit. Only after the bootstrap processor filled the `cpu_callout_mask` with this secondary processor, it will continue the rest of its initialization. After that the certain processor finish its initialization process, the processor sets bit in the `cpu_callin_mask`. Once the bootstrap processor finds the bit in the `cpu_callin_mask` for the current secondary processor, this processor repeats the same procedure for initialization of one of the remaining secondary processors. In a short words it works as i described, but we will see more details in the chapter about `SMP`.
|
||||
|
||||
That's all. We did all `SMP` boot preparation.
|
||||
|
||||
Build zonelists
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
In the next step we can see the call of the `build_all_zonelists` function. This function sets up the order of zones that allocations are preferred from. What are zones and what's order we will understand soon. For the start let's see how linux kernel considers physical memory. Physical memory is split into banks which are called - `nodes`. If you has no hardware support for `NUMA`, you will see only one node:
|
||||
|
||||
```
|
||||
$ cat /sys/devices/system/node/node0/numastat
|
||||
numa_hit 72452442
|
||||
numa_miss 0
|
||||
numa_foreign 0
|
||||
interleave_hit 12925
|
||||
local_node 72452442
|
||||
other_node 0
|
||||
```
|
||||
|
||||
Every `node` is presented by the `struct pglist_data` in the linux kernel. Each node is divided into a number of special blocks which are called - `zones`. Every zone is presented by the `zone struct` in the linux kernel and has one of the type:
|
||||
|
||||
* `ZONE_DMA` - 0-16M;
|
||||
* `ZONE_DMA32` - used for 32 bit devices that can only do DMA areas below 4G;
|
||||
* `ZONE_NORMAL` - all RAM from the 4GB on the `x86_64`;
|
||||
* `ZONE_HIGHMEM` - absent on the `x86_64`;
|
||||
* `ZONE_MOVABLE` - zone which contains movable pages.
|
||||
|
||||
which are presented by the `zone_type` enum. We can get information about zones with the:
|
||||
|
||||
```
|
||||
$ cat /proc/zoneinfo
|
||||
Node 0, zone DMA
|
||||
pages free 3975
|
||||
min 3
|
||||
low 3
|
||||
...
|
||||
...
|
||||
Node 0, zone DMA32
|
||||
pages free 694163
|
||||
min 875
|
||||
low 1093
|
||||
...
|
||||
...
|
||||
Node 0, zone Normal
|
||||
pages free 2529995
|
||||
min 3146
|
||||
low 3932
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
As I wrote above all nodes are described with the `pglist_data` or `pg_data_t` structure in memory. This structure is defined in the [include/linux/mmzone.h](https://github.com/torvalds/linux/blob/master/include/linux/mmzone.h). The `build_all_zonelists` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c) constructs an ordered `zonelist` (of different zones `DMA`, `DMA32`, `NORMAL`, `HIGH_MEMORY`, `MOVABLE`) which specifies the zones/nodes to visit when a selected `zone` or `node` cannot satisfy the allocation request. That's all. More about `NUMA` and multiprocessor systems will be in the special part.
|
||||
|
||||
The rest of the stuff before scheduler initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Before we will start to dive into linux kernel scheduler initialization process we must do a couple of things. The first thing is the `page_alloc_init` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c). This function looks pretty easy:
|
||||
|
||||
```C
|
||||
void __init page_alloc_init(void)
|
||||
{
|
||||
hotcpu_notifier(page_alloc_cpu_notify, 0);
|
||||
}
|
||||
```
|
||||
|
||||
and initializes handler for the `CPU` [hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). Of course the `hotcpu_notifier` depends on the
|
||||
`CONFIG_HOTPLUG_CPU` configuration option and if this option is set, it just calls `cpu_notifier` macro which expands to the call of the `register_cpu_notifier` which adds hotplug cpu handler (`page_alloc_cpu_notify` in our case).
|
||||
|
||||
After this we can see the kernel command line in the initialization output:
|
||||
|
||||

|
||||
|
||||
And a couple of functions such as `parse_early_param` and `parse_args` which handles linux kernel command line. You may remember that we already saw the call of the `parse_early_param` function in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (`x86_64` in our case), but not all architecture calls this function. And we need to call the second function `parse_args` to parse and handle non-early command line arguments.
|
||||
|
||||
In the next step we can see the call of the `jump_label_init` from the [kernel/jump_label.c](https://github.com/torvalds/linux/blob/master/kernel/jump_label.c). and initializes [jump label](https://lwn.net/Articles/412072/).
|
||||
|
||||
After this we can see the call of the `setup_log_buf` function which setups the [printk](http://www.makelinux.net/books/lkd2/ch18lev1sec3) log buffer. We already saw this function in the seventh [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the linux kernel initialization process chapter.
|
||||
|
||||
PID hash initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next is `pidhash_init` function. As you know each process has assigned a unique number which called - `process identification number` or `PID`. Each process generated with fork or clone is automatically assigned a new unique `PID` value by the kernel. The management of `PIDs` centered around the two special data structures: `struct pid` and `struct upid`. First structure represents information about a `PID` in the kernel. The second structure represents the information that is visible in a specific namespace. All `PID` instances stored in the special hash table:
|
||||
|
||||
```C
|
||||
static struct hlist_head *pid_hash;
|
||||
```
|
||||
|
||||
This hash table is used to find the pid instance that belongs to a numeric `PID` value. So, `pidhash_init` initializes this hash table. In the start of the `pidhash_init` function we can see the call of the `alloc_large_system_hash`:
|
||||
|
||||
```C
|
||||
pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
|
||||
HASH_EARLY | HASH_SMALL,
|
||||
&pidhash_shift, NULL,
|
||||
0, 4096);
|
||||
```
|
||||
|
||||
The number of elements of the `pid_hash` depends on the `RAM` configuration, but it can be between `2^4` and `2^12`. The `pidhash_init` computes the size
|
||||
and allocates the required storage (which is `hlist` in our case - the same as [doubly linked list](http://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html), but contains one pointer instead on the [struct hlist_head](https://github.com/torvalds/linux/blob/master/include/linux/types.h)]. The `alloc_large_system_hash` function allocates a large system hash table with `memblock_virt_alloc_nopanic` if we pass `HASH_EARLY` flag (as it in our case) or with `__vmalloc` if we did no pass this flag.
|
||||
|
||||
The result we can see in the `dmesg` output:
|
||||
|
||||
```
|
||||
$ dmesg | grep hash
|
||||
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
That's all. The rest of the stuff before scheduler initialization is the following functions: `vfs_caches_init_early` does early initialization of the [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system) (more about it will be in the chapter which will describe virtual file system), `sort_main_extable` sorts the kernel's built-in exception table entries which are between `__start___ex_table` and `__stop___ex_table`, and `trap_init` initializes trap handlers (more about last two function we will know in the separate chapter about interrupts).
|
||||
|
||||
The last step before the scheduler initialization is initialization of the memory manager with the `mm_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As we can see, the `mm_init` function initializes different parts of the linux kernel memory manager:
|
||||
|
||||
```C
|
||||
page_ext_init_flatmem();
|
||||
mem_init();
|
||||
kmem_cache_init();
|
||||
percpu_init_late();
|
||||
pgtable_init();
|
||||
vmalloc_init();
|
||||
```
|
||||
|
||||
The first is `page_ext_init_flatmem` which depends on the `CONFIG_SPARSEMEM` kernel configuration option and initializes extended data per page handling. The `mem_init` releases all `bootmem`, the `kmem_cache_init` initializes kernel cache, the `percpu_init_late` - replaces `percpu` chunks with those allocated by [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29), the `pgtable_init` - initializes the `page->ptl` kernel cache, the `vmalloc_init` - initializes `vmalloc`. Please, **NOTE** that we will not dive into details about all of these functions and concepts, but we will see all of they it in the [Linux kernel memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter.
|
||||
|
||||
That's all. Now we can look on the `scheduler`.
|
||||
|
||||
Scheduler initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
And now we come to the main purpose of this part - initialization of the task scheduler. I want to say again as I already did it many times, you will not see the full explanation of the scheduler here, there will be special chapter about this. Ok, next point is the `sched_init` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) and as we can understand from the function's name, it initializes scheduler. Let's start to dive into this function and try to understand how the scheduler is initialized. At the start of the `sched_init` function we can see the following code:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_FAIR_GROUP_SCHED
|
||||
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
|
||||
#endif
|
||||
#ifdef CONFIG_RT_GROUP_SCHED
|
||||
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
|
||||
#endif
|
||||
```
|
||||
|
||||
First of all we can see two configuration options here:
|
||||
|
||||
* `CONFIG_FAIR_GROUP_SCHED`
|
||||
* `CONFIG_RT_GROUP_SCHED`
|
||||
|
||||
Both of this options provide two different planning models. As we can read from the [documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt), the current scheduler - `CFS` or `Completely Fair Scheduler` use a simple concept. It models process scheduling as if the system has an ideal multitasking processor where each process would receive `1/n` processor time, where `n` is the number of the runnable processes. The scheduler uses the special set of rules. These rules determine when and how to select a new process to run and they are called `scheduling policy`. The Completely Fair Scheduler supports following `normal` or `non-real-time` scheduling policies: `SCHED_NORMAL`, `SCHED_BATCH` and `SCHED_IDLE`. The `SCHED_NORMAL` is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the [nice](http://en.wikipedia.org/wiki/Nice_%28Unix%29) value, the `SCHED_BATCH` used for the 100% non-interactive tasks and the `SCHED_IDLE` runs tasks only when the processor has no task to run besides this task. The `real-time` policies are also supported for the time-critical applications: `SCHED_FIFO` and `SCHED_RR`. If you've read something about the Linux kernel scheduler, you can know that it is modular. It means that it supports different algorithms to schedule different types of processes. Usually this modularity is called `scheduler classes`. These modules encapsulate scheduling policy details and are handled by the scheduler core without knowing too much about them.
|
||||
|
||||
|
||||
Now let's back to the our code and look on the two configuration options `CONFIG_FAIR_GROUP_SCHED` and `CONFIG_RT_GROUP_SCHED`. The scheduler operates on an individual task. These options allows to schedule group tasks (more about it you can read in the [CFS group scheduling](http://lwn.net/Articles/240474/)). We can see that we assign the `alloc_size` variables which represent size based on amount of the processors to allocate for the `sched_entity` and `cfs_rq` to the `2 * nr_cpu_ids * sizeof(void **)` expression with `kzalloc`:
|
||||
|
||||
```C
|
||||
ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);
|
||||
|
||||
#ifdef CONFIG_FAIR_GROUP_SCHED
|
||||
root_task_group.se = (struct sched_entity **)ptr;
|
||||
ptr += nr_cpu_ids * sizeof(void **);
|
||||
|
||||
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
|
||||
ptr += nr_cpu_ids * sizeof(void **);
|
||||
#endif
|
||||
|
||||
```
|
||||
|
||||
The `sched_entity` is a structure which is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and used by the scheduler to keep track of process accounting. The `cfs_rq` presents [run queue](http://en.wikipedia.org/wiki/Run_queue). So, you can see that we allocated space with size `alloc_size` for the run queue and scheduler entity of the `root_task_group`. The `root_task_group` is an instance of the `task_group` structure from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) which contains task group related information:
|
||||
|
||||
```C
|
||||
struct task_group {
|
||||
...
|
||||
...
|
||||
struct sched_entity **se;
|
||||
struct cfs_rq **cfs_rq;
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The root task group is the task group which belongs to every task in system. As we allocated space for the root task group scheduler entity and runqueue, we go over all possible CPUs (`cpu_possible_mask` bitmap) and allocate zeroed memory from a particular memory node with the `kzalloc_node` function for the `load_balance_mask` `percpu` variable:
|
||||
|
||||
```C
|
||||
DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
|
||||
```
|
||||
|
||||
Here `cpumask_var_t` is the `cpumask_t` with one difference: `cpumask_var_t` is allocated only `nr_cpu_ids` bits when the `cpumask_t` always has `NR_CPUS` bits (more about `cpumask` you can read in the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part). As you can see:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_CPUMASK_OFFSTACK
|
||||
for_each_possible_cpu(i) {
|
||||
per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
|
||||
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
this code depends on the `CONFIG_CPUMASK_OFFSTACK` configuration option. This configuration options says to use dynamic allocation for `cpumask`, instead of putting it on the stack. All groups have to be able to rely on the amount of CPU time. With the call of the two following functions:
|
||||
|
||||
```C
|
||||
init_rt_bandwidth(&def_rt_bandwidth,
|
||||
global_rt_period(), global_rt_runtime());
|
||||
init_dl_bandwidth(&def_dl_bandwidth,
|
||||
global_rt_period(), global_rt_runtime());
|
||||
```
|
||||
|
||||
we initialize bandwidth management for the `SCHED_DEADLINE` real-time tasks. These functions initializes `rt_bandwidth` and `dl_bandwidth` structures which store information about maximum `deadline` bandwidth of the system. For example, let's look on the implementation of the `init_rt_bandwidth` function:
|
||||
|
||||
```C
|
||||
void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
|
||||
{
|
||||
rt_b->rt_period = ns_to_ktime(period);
|
||||
rt_b->rt_runtime = runtime;
|
||||
|
||||
raw_spin_lock_init(&rt_b->rt_runtime_lock);
|
||||
|
||||
hrtimer_init(&rt_b->rt_period_timer,
|
||||
CLOCK_MONOTONIC, HRTIMER_MODE_REL);
|
||||
rt_b->rt_period_timer.function = sched_rt_period_timer;
|
||||
}
|
||||
```
|
||||
|
||||
It takes three parameters:
|
||||
|
||||
* address of the `rt_bandwidth` structure which contains information about the allocated and consumed quota within a period;
|
||||
* `period` - period over which real-time task bandwidth enforcement is measured in `us`;
|
||||
* `runtime` - part of the period that we allow tasks to run in `us`.
|
||||
|
||||
As `period` and `runtime` we pass result of the `global_rt_period` and `global_rt_runtime` functions. Which are `1s` second and and `0.95s` by default. The `rt_bandwidth` structure is defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) and looks:
|
||||
|
||||
```C
|
||||
struct rt_bandwidth {
|
||||
raw_spinlock_t rt_runtime_lock;
|
||||
ktime_t rt_period;
|
||||
u64 rt_runtime;
|
||||
struct hrtimer rt_period_timer;
|
||||
};
|
||||
```
|
||||
|
||||
As you can see, it contains `runtime` and `period` and also two following fields:
|
||||
|
||||
* `rt_runtime_lock` - [spinlock](http://en.wikipedia.org/wiki/Spinlock) for the `rt_time` protection;
|
||||
* `rt_period_timer` - [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt) for unthrottled of real-time tasks.
|
||||
|
||||
So, in the `init_rt_bandwidth` we initialize `rt_bandwidth` period and runtime with the given parameters, initialize the spinlock and high-resolution time. In the next step, depends on enable of [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), we make initialization of the root domain:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_SMP
|
||||
init_defrootdomain();
|
||||
#endif
|
||||
```
|
||||
|
||||
The real-time scheduler requires global resources to make scheduling decision. But unfortunately scalability bottlenecks appear as the number of CPUs increase. The concept of root domains was introduced for improving scalability. The linux kernel provides a special mechanism for assigning a set of CPUs and memory nodes to a set of tasks and it is called - `cpuset`. If a `cpuset` contains non-overlapping with other `cpuset` CPUs, it is `exclusive cpuset`. Each exclusive cpuset defines an isolated domain or `root domain` of CPUs partitioned from other cpusets or CPUs. A `root domain` is presented by the `struct root_domain` from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) in the linux kernel and its main purpose is to narrow the scope of the global variables to per-domain variables and all real-time scheduling decisions are made only within the scope of a root domain. That's all about it, but we will see more details about it in the chapter about real-time scheduler.
|
||||
|
||||
After `root domain` initialization, we make initialization of the bandwidth for the real-time tasks of the root task group as we did it above:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_RT_GROUP_SCHED
|
||||
init_rt_bandwidth(&root_task_group.rt_bandwidth,
|
||||
global_rt_period(), global_rt_runtime());
|
||||
#endif
|
||||
```
|
||||
|
||||
In the next step, depends on the `CONFIG_CGROUP_SCHED` kernel configuration option we initialize the `siblings` and `children` lists of the root task group. As we can read from the documentation, the `CONFIG_CGROUP_SCHED` is:
|
||||
|
||||
```
|
||||
This option allows you to create arbitrary task groups using the "cgroup" pseudo
|
||||
filesystem and control the cpu bandwidth allocated to each such task group.
|
||||
```
|
||||
|
||||
As we finished with the lists initialization, we can see the call of the `autogroup_init` function:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_CGROUP_SCHED
|
||||
list_add(&root_task_group.list, &task_groups);
|
||||
INIT_LIST_HEAD(&root_task_group.children);
|
||||
INIT_LIST_HEAD(&root_task_group.siblings);
|
||||
autogroup_init(&init_task);
|
||||
#endif
|
||||
```
|
||||
|
||||
which initializes automatic process group scheduling.
|
||||
|
||||
After this we are going through the all `possible` cpu (you can remember that `possible` CPUs store in the `cpu_possible_mask` bitmap that can ever be available in the system) and initialize a `runqueue` for each possible cpu:
|
||||
|
||||
```C
|
||||
for_each_possible_cpu(i) {
|
||||
struct rq *rq;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Each processor has its own locking and individual runqueue. All runnable tasks are stored in an active array and indexed according to its priority. When a process consumes its time slice, it is moved to an expired array. All of these arras are stored in the special structure which names is `runqueue`. As there are no global lock and runqueue, we are going through the all possible CPUs and initialize runqueue for the every cpu. The `runqueue` is presented by the `rq` structure in the linux kernel which is defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h).
|
||||
|
||||
```C
|
||||
rq = cpu_rq(i);
|
||||
raw_spin_lock_init(&rq->lock);
|
||||
rq->nr_running = 0;
|
||||
rq->calc_load_active = 0;
|
||||
rq->calc_load_update = jiffies + LOAD_FREQ;
|
||||
init_cfs_rq(&rq->cfs);
|
||||
init_rt_rq(&rq->rt);
|
||||
init_dl_rq(&rq->dl);
|
||||
rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
|
||||
```
|
||||
|
||||
Here we get the runqueue for the every CPU with the `cpu_rq` macro which returns `runqueues` percpu variable and start to initialize it with runqueue lock, number of running tasks, `calc_load` relative fields (`calc_load_active` and `calc_load_update`) which are used in the reckoning of a CPU load and initialization of the completely fair, real-time and deadline related fields in a runqueue. After this we initialize `cpu_load` array with zeros and set the last load update tick to the `jiffies` variable which determines the number of time ticks (cycles), since the system boot:
|
||||
|
||||
```C
|
||||
for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
|
||||
rq->cpu_load[j] = 0;
|
||||
|
||||
rq->last_load_update_tick = jiffies;
|
||||
```
|
||||
|
||||
where `cpu_load` keeps history of runqueue loads in the past, for now `CPU_LOAD_IDX_MAX` is 5. In the next step we fill `runqueue` fields which are related to the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), but we will not cover them in this part. And in the end of the loop we initialize high-resolution timer for the give `runqueue` and set the `iowait` (more about it in the separate part about scheduler) number:
|
||||
|
||||
```C
|
||||
init_rq_hrtick(rq);
|
||||
atomic_set(&rq->nr_iowait, 0);
|
||||
```
|
||||
|
||||
Now we come out from the `for_each_possible_cpu` loop and the next we need to set load weight for the `init` task with the `set_load_weight` function. Weight of process is calculated through its dynamic priority which is static priority + scheduling class of the process. After this we increase memory usage counter of the memory descriptor of the `init` process and set scheduler class for the current process:
|
||||
|
||||
```C
|
||||
atomic_inc(&init_mm.mm_count);
|
||||
current->sched_class = &fair_sched_class;
|
||||
```
|
||||
|
||||
And make current process (it will be the first `init` process) `idle` and update the value of the `calc_load_update` with the 5 seconds interval:
|
||||
|
||||
```C
|
||||
init_idle(current, smp_processor_id());
|
||||
calc_load_update = jiffies + LOAD_FREQ;
|
||||
```
|
||||
|
||||
So, the `init` process will be run, when there will be no other candidates (as it is the first process in the system). In the end we just set `scheduler_running` variable:
|
||||
|
||||
```C
|
||||
scheduler_running = 1;
|
||||
```
|
||||
|
||||
That's all. Linux kernel scheduler is initialized. Of course, we have skipped many different details and explanations here, because we need to know and understand how different concepts (like process and process groups, runqueue, rcu, etc.) works in the linux kernel , but we took a short look on the scheduler initialization process. We will look all other details in the separate part which will be fully dedicated to the scheduler.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the eighth part about the linux kernel initialization process. In this part, we looked on the initialization process of the scheduler and we will continue in the next part to dive in the linux kernel initialization process and will see initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and many other initialization stuff in the next part.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt)
|
||||
* [spinlock](http://en.wikipedia.org/wiki/Spinlock)
|
||||
* [Run queue](http://en.wikipedia.org/wiki/Run_queue)
|
||||
* [Linux kernem memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29)
|
||||
* [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [Linux kernel hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)
|
||||
* [RCU](http://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [CFS Scheduler documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt)
|
||||
* [Real-Time group scheduling](https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html)
|
||||
430
Initialization/linux-initialization-9.md
Normal file
430
Initialization/linux-initialization-9.md
Normal file
@@ -0,0 +1,430 @@
|
||||
Kernel initialization. Part 9.
|
||||
================================================================================
|
||||
|
||||
RCU initialization
|
||||
================================================================================
|
||||
|
||||
This is ninth part of the [Linux Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the previous part we stopped at the [scheduler initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html). In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). We can see that the next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) after the `sched_init` is the call of the `preempt_disable`. There are two macros:
|
||||
|
||||
* `preempt_disable`
|
||||
* `preempt_enable`
|
||||
|
||||
for preemption disabling and enabling. First of all let's try to understand what is `preempt` in the context of an operating system kernel. In simple words, preemption is ability of the operating system kernel to preempt current task to run task with higher priority. Here we need to disable preemption because we will have only one `init` process for the early boot time and we don't need to stop it before we call `cpu_idle` function. The `preempt_disable` macro is defined in the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) and depends on the `CONFIG_PREEMPT_COUNT` kernel configuration option. This macro is implemented as:
|
||||
|
||||
```C
|
||||
#define preempt_disable() \
|
||||
do { \
|
||||
preempt_count_inc(); \
|
||||
barrier(); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
and if `CONFIG_PREEMPT_COUNT` is not set just:
|
||||
|
||||
```C
|
||||
#define preempt_disable() barrier()
|
||||
```
|
||||
|
||||
Let's look on it. First of all we can see one difference between these macro implementations. The `preempt_disable` with `CONFIG_PREEMPT_COUNT` set contains the call of the `preempt_count_inc`. There is special `percpu` variable which stores the number of held locks and `preempt_disable` calls:
|
||||
|
||||
```C
|
||||
DECLARE_PER_CPU(int, __preempt_count);
|
||||
```
|
||||
|
||||
In the first implementation of the `preempt_disable` we increment this `__preempt_count`. There is API for returning value of the `__preempt_count`, it is the `preempt_count` function. As we called `preempt_disable`, first of all we increment preemption counter with the `preempt_count_inc` macro which expands to the:
|
||||
|
||||
```
|
||||
#define preempt_count_inc() preempt_count_add(1)
|
||||
#define preempt_count_add(val) __preempt_count_add(val)
|
||||
```
|
||||
|
||||
where `preempt_count_add` calls the `raw_cpu_add_4` macro which adds `1` to the given `percpu` variable (`__preempt_count`) in our case (more about `precpu` variables you can read in the part about [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)). Ok, we increased `__preempt_count` and the next step we can see the call of the `barrier` macro in the both macros. The `barrier` macro inserts an optimization barrier. In the processors with `x86_64` architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example:
|
||||
|
||||
```C
|
||||
preempt_disable();
|
||||
foo();
|
||||
preempt_enable();
|
||||
```
|
||||
|
||||
Compiler can rearrange it as:
|
||||
|
||||
```C
|
||||
preempt_disable();
|
||||
preempt_enable();
|
||||
foo();
|
||||
```
|
||||
|
||||
In this case non-preemptible function `foo` can be preempted. As we put `barrier` macro in the `preempt_disable` and `preempt_enable` macros, it prevents the compiler from swapping `preempt_count_inc` with other statements. More about barriers you can read [here](http://en.wikipedia.org/wiki/Memory_barrier) and [here](https://www.kernel.org/doc/Documentation/memory-barriers.txt).
|
||||
|
||||
In the next step we can see following statement:
|
||||
|
||||
```C
|
||||
if (WARN(!irqs_disabled(),
|
||||
"Interrupts were enabled *very* early, fixing it\n"))
|
||||
local_irq_disable();
|
||||
```
|
||||
|
||||
which check [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) state, and disabling (with `cli` instruction for `x86_64`) if they are enabled.
|
||||
|
||||
That's all. Preemption is disabled and we can go ahead.
|
||||
|
||||
Initialization of the integer ID management
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the next step we can see the call of the `idr_init_cache` function which defined in the [lib/idr.c](https://github.com/torvalds/linux/blob/master/lib/idr.c). The `idr` library is used in a various [places](http://lxr.free-electrons.com/ident?i=idr_find) in the linux kernel to manage assigning integer `IDs` to objects and looking up objects by id.
|
||||
|
||||
Let's look on the implementation of the `idr_init_cache` function:
|
||||
|
||||
```C
|
||||
void __init idr_init_cache(void)
|
||||
{
|
||||
idr_layer_cache = kmem_cache_create("idr_layer_cache",
|
||||
sizeof(struct idr_layer), 0, SLAB_PANIC, NULL);
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see the call of the `kmem_cache_create`. We already called the `kmem_cache_init` in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L485). This function create generalized caches again using the `kmem_cache_alloc` (more about caches we will see in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter). In our case, as we are using `kmem_cache_t` which will be used by the [slab](http://en.wikipedia.org/wiki/Slab_allocation) allocator and `kmem_cache_create` creates it. As you can see we pass five parameters to the `kmem_cache_create`:
|
||||
|
||||
* name of the cache;
|
||||
* size of the object to store in cache;
|
||||
* offset of the first object in the page;
|
||||
* flags;
|
||||
* constructor for the objects.
|
||||
|
||||
and it will create `kmem_cache` for the integer IDs. Integer `IDs` is commonly used pattern to map set of integer IDs to the set of pointers. We can see usage of the integer IDs in the [i2c](http://en.wikipedia.org/wiki/I%C2%B2C) drivers subsystem. For example [drivers/i2c/i2c-core.c](https://github.com/torvalds/linux/blob/master/drivers/i2c/i2c-core.c) which represents the core of the `i2c` subsystem defines `ID` for the `i2c` adapter with the `DEFINE_IDR` macro:
|
||||
|
||||
```C
|
||||
static DEFINE_IDR(i2c_adapter_idr);
|
||||
```
|
||||
|
||||
and then uses it for the declaration of the `i2c` adapter:
|
||||
|
||||
```C
|
||||
static int __i2c_add_numbered_adapter(struct i2c_adapter *adap)
|
||||
{
|
||||
int id;
|
||||
...
|
||||
...
|
||||
...
|
||||
id = idr_alloc(&i2c_adapter_idr, adap, adap->nr, adap->nr + 1, GFP_KERNEL);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
and `id2_adapter_idr` presents dynamically calculated bus number.
|
||||
|
||||
More about integer ID management you can read [here](https://lwn.net/Articles/103209/).
|
||||
|
||||
RCU initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step is [RCU](http://en.wikipedia.org/wiki/Read-copy-update) initialization with the `rcu_init` function and it's implementation depends on two kernel configuration options:
|
||||
|
||||
* `CONFIG_TINY_RCU`
|
||||
* `CONFIG_TREE_RCU`
|
||||
|
||||
In the first case `rcu_init` will be in the [kernel/rcu/tiny.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tiny.c) and in the second case it will be defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c). We will see the implementation of the `tree rcu`, but first of all about the `RCU` in general.
|
||||
|
||||
`RCU` or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure), [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) data structures and other. One of these mechanisms is - the `read-copy update`. The `RCU` technique is designed for rarely-modified data structures. The idea of the `RCU` is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.
|
||||
|
||||
Of course this description of the `RCU` is very simplified. To understand some details about `RCU`, first of all we need to learn some terminology. Data readers in the `RCU` executed in the [critical section](http://en.wikipedia.org/wiki/Critical_section). Every time when data reader get to the critical section, it calls the `rcu_read_lock`, and `rcu_read_unlock` on exit from the critical section. If the thread is not in the critical section, it will be in state which called - `quiescent state`. The moment when every thread is in the `quiescent state` called - `grace period`. If a thread wants to remove an element from the data structure, this occurs in two steps. First step is `removal` - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits until it is finished. From this moment, the removed element is available to the thread-readers. After the `grace period` finished, the second step of the element removal will be started, it just removes the element from the physical memory.
|
||||
|
||||
There a couple of implementations of the `RCU`. Old `RCU` called classic, the new implementation called `tree` RCU. As you may already understand, the `CONFIG_TREE_RCU` kernel configuration option enables tree `RCU`. Another is the `tiny` RCU which depends on `CONFIG_TINY_RCU` and `CONFIG_SMP=n`. We will see more details about the `RCU` in general in the separate chapter about synchronization primitives, but now let's look on the `rcu_init` implementation from the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c):
|
||||
|
||||
```C
|
||||
void __init rcu_init(void)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
rcu_bootup_announce();
|
||||
rcu_init_geometry();
|
||||
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
|
||||
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
|
||||
__rcu_init_preempt();
|
||||
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
|
||||
|
||||
/*
|
||||
* We don't need protection against CPU-hotplug here because
|
||||
* this is called early in boot, before either interrupts
|
||||
* or the scheduler are operational.
|
||||
*/
|
||||
cpu_notifier(rcu_cpu_notify, 0);
|
||||
pm_notifier(rcu_pm_notify, 0);
|
||||
for_each_online_cpu(cpu)
|
||||
rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
|
||||
|
||||
rcu_early_boot_tests();
|
||||
}
|
||||
```
|
||||
|
||||
In the beginning of the `rcu_init` function we define `cpu` variable and call `rcu_bootup_announce`. The `rcu_bootup_announce` function is pretty simple:
|
||||
|
||||
```C
|
||||
static void __init rcu_bootup_announce(void)
|
||||
{
|
||||
pr_info("Hierarchical RCU implementation.\n");
|
||||
rcu_bootup_announce_oddness();
|
||||
}
|
||||
```
|
||||
|
||||
It just prints information about the `RCU` with the `pr_info` function and `rcu_bootup_announce_oddness` which uses `pr_info` too, for printing different information about the current `RCU` configuration which depends on different kernel configuration options like `CONFIG_RCU_TRACE`, `CONFIG_PROVE_RCU`, `CONFIG_RCU_FANOUT_EXACT`, etc. In the next step, we can see the call of the `rcu_init_geometry` function. This function is defined in the same source code file and computes the node tree geometry depends on the amount of CPUs. Actually `RCU` provides scalability with extremely low internal RCU lock contention. What if a data structure will be read from the different CPUs? `RCU` API provides the `rcu_state` structure which presents RCU global state including node hierarchy. Hierarchy is presented by the:
|
||||
|
||||
```
|
||||
struct rcu_node node[NUM_RCU_NODES];
|
||||
```
|
||||
|
||||
array of structures. As we can read in the comment of above definition:
|
||||
|
||||
```
|
||||
The root (first level) of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
|
||||
level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]), and the third level
|
||||
in ->node[m+1] and following (->node[m+1] referenced by ->level[2]). The number of levels is
|
||||
determined by the number of CPUs and by CONFIG_RCU_FANOUT.
|
||||
|
||||
Small systems will have a "hierarchy" consisting of a single rcu_node.
|
||||
```
|
||||
|
||||
The `rcu_node` structure is defined in the [kernel/rcu/tree.h](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.h) and contains information about current grace period, is grace period completed or not, CPUs or groups that need to switch in order for current grace period to proceed, etc. Every `rcu_node` contains a lock for a couple of CPUs. These `rcu_node` structures are embedded into a linear array in the `rcu_state` structure and represented as a tree with the root as the first element and covers all CPUs. As you can see the number of the rcu nodes determined by the `NUM_RCU_NODES` which depends on number of available CPUs:
|
||||
|
||||
```C
|
||||
#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
|
||||
#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4)
|
||||
```
|
||||
|
||||
where levels values depend on the `CONFIG_RCU_FANOUT_LEAF` configuration option. For example for the simplest case, one `rcu_node` will cover two CPU on machine with the eight CPUs:
|
||||
|
||||
```
|
||||
+-----------------------------------------------------------------+
|
||||
| rcu_state |
|
||||
| +----------------------+ |
|
||||
| | root | |
|
||||
| | rcu_node | |
|
||||
| +----------------------+ |
|
||||
| | | |
|
||||
| +----v-----+ +--v-------+ |
|
||||
| | | | | |
|
||||
| | rcu_node | | rcu_node | |
|
||||
| | | | | |
|
||||
| +------------------+ +----------------+ |
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
| +----v-----+ +-------v--+ +-v--------+ +-v--------+ |
|
||||
| | | | | | | | | |
|
||||
| | rcu_node | | rcu_node | | rcu_node | | rcu_node | |
|
||||
| | | | | | | | | |
|
||||
| +----------+ +----------+ +----------+ +----------+ |
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
+---------|-----------------|-------------|---------------|-------+
|
||||
| | | |
|
||||
+---------v-----------------v-------------v---------------v--------+
|
||||
| | | | |
|
||||
| CPU1 | CPU3 | CPU5 | CPU7 |
|
||||
| | | | |
|
||||
| CPU2 | CPU4 | CPU6 | CPU8 |
|
||||
| | | | |
|
||||
+------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
So, in the `rcu_init_geometry` function we just need to calculate the total number of `rcu_node` structures. We start to do it with the calculation of the `jiffies` till to the first and next `fqs` which is `force-quiescent-state` (read above about it):
|
||||
|
||||
```C
|
||||
d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
|
||||
if (jiffies_till_first_fqs == ULONG_MAX)
|
||||
jiffies_till_first_fqs = d;
|
||||
if (jiffies_till_next_fqs == ULONG_MAX)
|
||||
jiffies_till_next_fqs = d;
|
||||
```
|
||||
|
||||
where:
|
||||
|
||||
```C
|
||||
#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
|
||||
#define RCU_JIFFIES_FQS_DIV 256
|
||||
```
|
||||
|
||||
As we calculated these [jiffies](http://en.wikipedia.org/wiki/Jiffy_%28time%29), we check that previous defined `jiffies_till_first_fqs` and `jiffies_till_next_fqs` variables are equal to the [ULONG_MAX](http://www.rowleydownload.co.uk/avr/documentation/index.htm?http://www.rowleydownload.co.uk/avr/documentation/ULONG_MAX.htm) (their default values) and set they equal to the calculated value. As we did not touch these variables before, they are equal to the `ULONG_MAX`:
|
||||
|
||||
```C
|
||||
static ulong jiffies_till_first_fqs = ULONG_MAX;
|
||||
static ulong jiffies_till_next_fqs = ULONG_MAX;
|
||||
```
|
||||
|
||||
In the next step of the `rcu_init_geometry`, we check that `rcu_fanout_leaf` didn't change (it has the same value as `CONFIG_RCU_FANOUT_LEAF` in compile-time) and equal to the value of the `CONFIG_RCU_FANOUT_LEAF` configuration option, we just return:
|
||||
|
||||
```C
|
||||
if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
|
||||
nr_cpu_ids == NR_CPUS)
|
||||
return;
|
||||
```
|
||||
|
||||
After this we need to compute the number of nodes that an `rcu_node` tree can handle with the given number of levels:
|
||||
|
||||
```C
|
||||
rcu_capacity[0] = 1;
|
||||
rcu_capacity[1] = rcu_fanout_leaf;
|
||||
for (i = 2; i <= MAX_RCU_LVLS; i++)
|
||||
rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT;
|
||||
```
|
||||
|
||||
And in the last step we calculate the number of rcu_nodes at each level of the tree in the [loop](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c#L4094).
|
||||
|
||||
As we calculated geometry of the `rcu_node` tree, we need to go back to the `rcu_init` function and next step we need to initialize two `rcu_state` structures with the `rcu_init_one` function:
|
||||
|
||||
```C
|
||||
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
|
||||
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
|
||||
```
|
||||
|
||||
The `rcu_init_one` function takes two arguments:
|
||||
|
||||
* Global `RCU` state;
|
||||
* Per-CPU data for `RCU`.
|
||||
|
||||
Both variables defined in the [kernel/rcu/tree.h](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.h) with its `percpu` data:
|
||||
|
||||
```
|
||||
extern struct rcu_state rcu_bh_state;
|
||||
DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
|
||||
```
|
||||
|
||||
About this states you can read [here](http://lwn.net/Articles/264090/). As I wrote above we need to initialize `rcu_state` structures and `rcu_init_one` function will help us with it. After the `rcu_state` initialization, we can see the call of the ` __rcu_init_preempt` which depends on the `CONFIG_PREEMPT_RCU` kernel configuration option. It does the same as previous functions - initialization of the `rcu_preempt_state` structure with the `rcu_init_one` function which has `rcu_state` type. After this, in the `rcu_init`, we can see the call of the:
|
||||
|
||||
```C
|
||||
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
|
||||
```
|
||||
|
||||
function. This function registers a handler of the `pending interrupt`. Pending interrupt or `softirq` supposes that part of actions can be delayed for later execution when the system is less loaded. Pending interrupts is represented by the following structure:
|
||||
|
||||
```C
|
||||
struct softirq_action
|
||||
{
|
||||
void (*action)(struct softirq_action *);
|
||||
};
|
||||
```
|
||||
|
||||
which is defined in the [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h) and contains only one field - handler of an interrupt. You can check about `softirqs` in the your system with the:
|
||||
|
||||
```
|
||||
$ cat /proc/softirqs
|
||||
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
|
||||
HI: 2 0 0 1 0 2 0 0
|
||||
TIMER: 137779 108110 139573 107647 107408 114972 99653 98665
|
||||
NET_TX: 1127 0 4 0 1 1 0 0
|
||||
NET_RX: 334 221 132939 3076 451 361 292 303
|
||||
BLOCK: 5253 5596 8 779 2016 37442 28 2855
|
||||
BLOCK_IOPOLL: 0 0 0 0 0 0 0 0
|
||||
TASKLET: 66 0 2916 113 0 24 26708 0
|
||||
SCHED: 102350 75950 91705 75356 75323 82627 69279 69914
|
||||
HRTIMER: 510 302 368 260 219 255 248 246
|
||||
RCU: 81290 68062 82979 69015 68390 69385 63304 63473
|
||||
```
|
||||
|
||||
The `open_softirq` function takes two parameters:
|
||||
|
||||
* index of the interrupt;
|
||||
* interrupt handler.
|
||||
|
||||
and adds interrupt handler to the array of the pending interrupts:
|
||||
|
||||
```C
|
||||
void open_softirq(int nr, void (*action)(struct softirq_action *))
|
||||
{
|
||||
softirq_vec[nr].action = action;
|
||||
}
|
||||
```
|
||||
|
||||
In our case the interrupt handler is - `rcu_process_callbacks` which is defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c) and does the `RCU` core processing for the current CPU. After we registered `softirq` interrupt for the `RCU`, we can see the following code:
|
||||
|
||||
```C
|
||||
cpu_notifier(rcu_cpu_notify, 0);
|
||||
pm_notifier(rcu_pm_notify, 0);
|
||||
for_each_online_cpu(cpu)
|
||||
rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
|
||||
```
|
||||
|
||||
Here we can see registration of the `cpu` notifier which needs in systems which supports [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) and we will not dive into details about this theme. The last function in the `rcu_init` is the `rcu_early_boot_tests`:
|
||||
|
||||
```C
|
||||
void rcu_early_boot_tests(void)
|
||||
{
|
||||
pr_info("Running RCU self tests\n");
|
||||
|
||||
if (rcu_self_test)
|
||||
early_boot_test_call_rcu();
|
||||
if (rcu_self_test_bh)
|
||||
early_boot_test_call_rcu_bh();
|
||||
if (rcu_self_test_sched)
|
||||
early_boot_test_call_rcu_sched();
|
||||
}
|
||||
```
|
||||
|
||||
which runs self tests for the `RCU`.
|
||||
|
||||
That's all. We saw initialization process of the `RCU` subsystem. As I wrote above, more about the `RCU` will be in the separate chapter about synchronization primitives.
|
||||
|
||||
Rest of the initialization process
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Ok, we already passed the main theme of this part which is `RCU` initialization, but it is not the end of the linux kernel initialization process. In the last paragraph of this theme we will see a couple of functions which work in the initialization time, but we will not dive into deep details around this function for different reasons. Some reasons not to dive into details are following:
|
||||
|
||||
* They are not very important for the generic kernel initialization process and depend on the different kernel configuration;
|
||||
* They have the character of debugging and not important for now;
|
||||
* We will see many of this stuff in the separate parts/chapters.
|
||||
|
||||
After we initialized `RCU`, the next step which you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the - `trace_init` function. As you can understand from its name, this function initialize [tracing](http://en.wikipedia.org/wiki/Tracing_%28software%29) subsystem. You can read more about linux kernel trace system - [here](http://elinux.org/Kernel_Trace_Systems).
|
||||
|
||||
After the `trace_init`, we can see the call of the `radix_tree_init`. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). This function is defined in the [lib/radix-tree.c](https://github.com/torvalds/linux/blob/master/lib/radix-tree.c) and you can read more about it in the part about [Radix tree](https://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html).
|
||||
|
||||
In the next step we can see the functions which are related to the `interrupts handling` subsystem, they are:
|
||||
|
||||
* `early_irq_init`
|
||||
* `init_IRQ`
|
||||
* `softirq_init`
|
||||
|
||||
We will see explanation about this functions and their implementation in the special part about interrupts and exceptions handling. After this many different functions (like `init_timers`, `hrtimers_init`, `time_init`, etc.) which are related to different timing and timers stuff. We will see more about these function in the chapter about timers.
|
||||
|
||||
The next couple of functions are related with the [perf](https://perf.wiki.kernel.org/index.php/Main_Page) events - `perf_event-init` (there will be separate chapter about perf), initialization of the `profiling` with the `profile_init`. After this we enable `irq` with the call of the:
|
||||
|
||||
```C
|
||||
local_irq_enable();
|
||||
```
|
||||
|
||||
which expands to the `sti` instruction and making post initialization of the [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) with the call of the `kmem_cache_init_late` function (As I wrote above we will know about the `SLAB` in the [Linux memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter).
|
||||
|
||||
After the post initialization of the `SLAB`, next point is initialization of the console with the `console_init` function from the [drivers/tty/tty_io.c](https://github.com/torvalds/linux/blob/master/drivers/tty/tty_io.c).
|
||||
|
||||
After the console initialization, we can see the `lockdep_info` function which prints information about the [Lock dependency validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). After this, we can see the initialization of the dynamic allocation of the `debug objects` with the `debug_objects_mem_init`, kernel memory leak [detector](https://www.kernel.org/doc/Documentation/kmemleak.txt) initialization with the `kmemleak_init`, `percpu` pageset setup with the `setup_per_cpu_pageset`, setup of the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) policy with the `numa_policy_init`, setting time for the scheduler with the `sched_clock_init`, `pidmap` initialization with the call of the `pidmap_init` function for the initial `PID` namespace, cache creation with the `anon_vma_init` for the private virtual memory areas and early initialization of the [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) with the `acpi_early_init`.
|
||||
|
||||
This is the end of the ninth part of the [linux kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and here we saw initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). In the last paragraph of this part (`Rest of the initialization process`) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the ninth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). In this part, we looked on the initialization process of the `RCU` subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the `start_kernel` function and will go to the `rest_init` function from the same [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file and will see the start of the first process.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure)
|
||||
* [kmemleak](https://www.kernel.org/doc/Documentation/kmemleak.txt)
|
||||
* [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface)
|
||||
* [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [RCU](http://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [RCU documentation](https://github.com/torvalds/linux/tree/master/Documentation/RCU)
|
||||
* [integer ID management](https://lwn.net/Articles/103209/)
|
||||
* [Documentation/memory-barriers.txt](https://www.kernel.org/doc/Documentation/memory-barriers.txt)
|
||||
* [Runtime locking correctness validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [slab](http://en.wikipedia.org/wiki/Slab_allocation)
|
||||
* [i2c](http://en.wikipedia.org/wiki/I%C2%B2C)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html)
|
||||
@@ -415,4 +415,4 @@ memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
|
||||
* [e820](http://en.wikipedia.org/wiki/E820)
|
||||
* [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access)
|
||||
* [debugfs](http://en.wikipedia.org/wiki/Debugfs)
|
||||
* [对内核内存管理框架的初览](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)
|
||||
* [对内核内存管理框架的初览](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)
|
||||
@@ -520,4 +520,4 @@ prev_map[slot] = NULL;
|
||||
* [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit)
|
||||
* [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer)
|
||||
* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [内核内存管理第一部分](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html)
|
||||
* [内核内存管理第一部分](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html)
|
||||
489
Misc/contribute.md
Normal file
489
Misc/contribute.md
Normal file
@@ -0,0 +1,489 @@
|
||||
Linux kernel development
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you already may know, I've started a series of [blog posts](http://0xax.github.io/categories/assembly/) about assembler programming for `x86_64` architecture in the last year. I have never written a line of low-level code before this moment, except for a couple of toy `Hello World` examples in university. It was a long time ago and, as I already said, I didn't write low-level code at all. Some time ago I became interested in such things. I understood that I can write programs, but didn't actually understand how my program is arranged.
|
||||
|
||||
After writing some assembler code I began to understand how my program looks after compilation, **approximately**. But anyway, I didn't understand many other things. For example: what occurs when the `syscall` instruction is executed in my assembler, what occurs when the `printf` function starts to work or how can my program talk with other computers via network. [Assembler](https://en.wikipedia.org/wiki/Assembly_language#Assembler) programming language didn't give me answers to my questions and I decided to go deeper in my research. I started to learn from the source code of the Linux kernel and tried to understand the things that I'm interested in. The source code of the Linux kernel didn't give me the answers to **all** of my questions, but now my knowledge about the Linux kernel and the processes around it is much better.
|
||||
|
||||
I'm writing this part nine and a half months after I've started to learn from the source code of the Linux kernel and published the first [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) of this book. Now it contains forty parts and it is not the end. I decided to write this series about the Linux kernel mostly for myself. As you know the Linux kernel is very huge piece of code and it is easy to forget what does this or that part of the Linux kernel mean and how does it implement something. But soon the [linux-insides](https://github.com/0xAX/linux-insides) repo became popular and after nine months it has `9096` stars:
|
||||
|
||||

|
||||
|
||||
It seems that people are interested in the insides of the Linux kernel. Besides this, in all the time that I have been writing `linux-insides`, I have received many questions from different people about how to begin contributing to the Linux kernel. Generally people are interested in contributing to open source projects and the Linux kernel is not an exception:
|
||||
|
||||

|
||||
|
||||
So, it seems that people are interested in the Linux kernel development process. I thought it would be strange if a book about the Linux kernel would not contain a part describing how to take a part in the Linux kernel development and that's why I decided to write it. You will not find information about why you should be interested in contributing to the Linux kernel in this part. But if you are interested how to start with Linux kernel development, this part is for you.
|
||||
|
||||
Let's start.
|
||||
|
||||
How to start with Linux kernel
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
First of all, let's see how to get, build, and run the Linux kernel. You can run your custom build of the Linux kernel in two ways:
|
||||
|
||||
* Run the Linux kernel on a virtual machine;
|
||||
* Run the Linux kernel on real hardware.
|
||||
|
||||
I'll provide descriptions for both methods. Before we start doing anything with the Linux kernel, we need to get it. There are a couple of ways to do this depending on your purpose. If you just want to update the current version of the Linux kernel on your computer, you can use the instructions specific to your Linux [distro](https://en.wikipedia.org/wiki/Linux_distribution).
|
||||
|
||||
In the first case you just need to download new version of the Linux kernel with the [package manager](https://en.wikipedia.org/wiki/Package_manager). For example, to upgrade the version of the Linux kernel to `4.1` for [Ubuntu (Vivid Vervet)](http://releases.ubuntu.com/15.04/), you will just need to execute the following commands:
|
||||
|
||||
```
|
||||
$ sudo add-apt-repository ppa:kernel-ppa/ppa
|
||||
$ sudo apt-get update
|
||||
```
|
||||
|
||||
After this execute this command:
|
||||
|
||||
```
|
||||
$ apt-cache showpkg linux-headers
|
||||
```
|
||||
|
||||
and choose the version of the Linux kernel in which you are interested. In the end execute the next command and replace `${version}` with the version that you chose in the output of the previous command:
|
||||
|
||||
```
|
||||
$ sudo apt-get install linux-headers-${version} linux-headers-${version}-generic linux-image-${version}-generic --fix-missing
|
||||
```
|
||||
|
||||
and reboot your system. After the reboot you will see the new kernel in the [grub](https://en.wikipedia.org/wiki/GNU_GRUB) menu.
|
||||
|
||||
In the other way if you are interested in the Linux kernel development, you will need to get the source code of the Linux kernel. You can find it on the [kernel.org](https://kernel.org/) website and download an archive with the Linux kernel source code. Actually the Linux kernel development process is fully built around `git` [version control system](https://en.wikipedia.org/wiki/Version_control). So you can get it with `git` from the `kernel.org`:
|
||||
|
||||
```
|
||||
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
|
||||
```
|
||||
|
||||
I don't know how about you, but I prefer `github`. There is a [mirror](https://github.com/torvalds/linux) of the Linux kernel mainline repository, so you can clone it with:
|
||||
|
||||
```
|
||||
$ git clone git@github.com:torvalds/linux.git
|
||||
```
|
||||
|
||||
I use my own [fork](https://github.com/0xAX/linux) for development and when I want to pull updates from the main repository I just execute the following command:
|
||||
|
||||
```
|
||||
$ git checkout master
|
||||
$ git pull upstream master
|
||||
```
|
||||
|
||||
Note that the remote name of the main repository is `upstream`. To add a new remote with the main Linux repository you can execute:
|
||||
|
||||
```
|
||||
git remote add upstream git@github.com:torvalds/linux.git
|
||||
```
|
||||
|
||||
After this you will have two remotes:
|
||||
|
||||
```
|
||||
~/dev/linux (master) $ git remote -v
|
||||
origin git@github.com:0xAX/linux.git (fetch)
|
||||
origin git@github.com:0xAX/linux.git (push)
|
||||
upstream https://github.com/torvalds/linux.git (fetch)
|
||||
upstream https://github.com/torvalds/linux.git (push)
|
||||
```
|
||||
|
||||
One is of your fork (`origin`) and the second is for the main repository (`upstream`).
|
||||
|
||||
Now that we have a local copy of the Linux kernel source code, we need to configure and build it. The Linux kernel can be configured in different ways. The simplest way is to just copy the configuration file of the already installed kernel that is located in the `/boot` directory:
|
||||
|
||||
```
|
||||
$ sudo cp /boot/config-$(uname -r) ~/dev/linux/.config
|
||||
```
|
||||
|
||||
If your current Linux kernel was built with the support for access to the `/proc/config.gz` file, you can copy your actual kernel configuration file with this command:
|
||||
|
||||
```
|
||||
$ cat /proc/config.gz | gunzip > ~/dev/linux/.config
|
||||
```
|
||||
|
||||
If you are not satisfied with the standard kernel configuration that is provided by the maintainers of your distro, you can configure the Linux kernel manually. There are a couple of ways to do it. The Linux kernel root [Makefile](https://github.com/torvalds/linux/blob/master/Makefile) provides a set of targets that allows you to configure it. For example `menuconfig` provides a menu-driven interface for the kernel configuration:
|
||||
|
||||

|
||||
|
||||
The `defconfig` argument generates the default kernel configuration file for the current architecture, for example [x86_64 defconfig](https://github.com/torvalds/linux/blob/master/arch/x86/configs/x86_64_defconfig). You can pass the `ARCH` command line argument to `make` to build `defconfig` for the given architecture:
|
||||
|
||||
```
|
||||
$ make ARCH=arm64 defconfig
|
||||
```
|
||||
|
||||
The `allnoconfig`, `allyesconfig` and `allmodconfig` arguments allow you to generate a new configuration file where all options will be disabled, enabled, and enabled as modules respectively. The `nconfig` command line arguments that provides `ncurses` based program with menu to configure Linux kernel:
|
||||
|
||||

|
||||
|
||||
And even `randconfig` to generate random Linux kernel configuration file. I will not write about how to configure the Linux kernel or which options to enable because it makes no sense to do so for two reasons: First of all I do not know your hardware and second, if you know your hardware, the only remaining task is to find out how to use programs for kernel configuration, and all of them are pretty simple to use.
|
||||
|
||||
OK, we now have the source code of the Linux kernel and configured it. The next step is the compilation of the Linux kernel. The simplest way to compile Linux kernel is to just execute:
|
||||
|
||||
```
|
||||
$ make
|
||||
scripts/kconfig/conf --silentoldconfig Kconfig
|
||||
#
|
||||
# configuration written to .config
|
||||
#
|
||||
CHK include/config/kernel.release
|
||||
UPD include/config/kernel.release
|
||||
CHK include/generated/uapi/linux/version.h
|
||||
CHK include/generated/utsrelease.h
|
||||
...
|
||||
...
|
||||
...
|
||||
OBJCOPY arch/x86/boot/vmlinux.bin
|
||||
AS arch/x86/boot/header.o
|
||||
LD arch/x86/boot/setup.elf
|
||||
OBJCOPY arch/x86/boot/setup.bin
|
||||
BUILD arch/x86/boot/bzImage
|
||||
Setup is 15740 bytes (padded to 15872 bytes).
|
||||
System is 4342 kB
|
||||
CRC 82703414
|
||||
Kernel: arch/x86/boot/bzImage is ready (#73)
|
||||
```
|
||||
|
||||
To increase the speed of kernel compilation you can pass `-jN` command line argument to `make`, where `N` specifies the number of commands to run simultaneously:
|
||||
|
||||
```
|
||||
$ make -j8
|
||||
```
|
||||
|
||||
If you want to build Linux kernel for an architecture that differs from your current, the simplest way to do it pass two arguments:
|
||||
|
||||
* `ARCH` command line argument and the name of the target architecture;
|
||||
* `CROSS_COMPILER` command line argument and the cross-compiler tool prefix;
|
||||
|
||||
For example if we want to compile the Linux kernel for the [arm64](https://en.wikipedia.org/wiki/ARM_architecture#AArch64_features) with default kernel configuration file, we need to execute following command:
|
||||
|
||||
```
|
||||
$ make -j4 ARCH=arm64 CROSS_COMPILER=aarch64-linux-gnu- defconfig
|
||||
$ make -j4 ARCH=arm64 CROSS_COMPILER=aarch64-linux-gnu-
|
||||
```
|
||||
|
||||
As result of compilation we can see the compressed kernel - `arch/x86/boot/bzImage`. Now that we have compiled the kernel, we can either install it on our computer or just run it in an emulator.
|
||||
|
||||
Installing Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As I already wrote we will consider two ways how to launch new kernel: In the first case we can install and run the new version of the Linux kernel on the real hardware and the second is launch the Linux kernel on a virtual machine. In the previous paragraph we saw how to build the Linux kernel from source code and as a result we have got compressed image:
|
||||
|
||||
```
|
||||
...
|
||||
...
|
||||
...
|
||||
Kernel: arch/x86/boot/bzImage is ready (#73)
|
||||
```
|
||||
|
||||
After we have got the [bzImage](https://en.wikipedia.org/wiki/Vmlinux#bzImage) we need to install `headers`, `modules` of the new Linux kernel with the:
|
||||
|
||||
```
|
||||
$ sudo make headers_install
|
||||
$ sudo make modules_install
|
||||
```
|
||||
|
||||
and directly the kernel itself:
|
||||
|
||||
```
|
||||
$ sudo make install
|
||||
```
|
||||
|
||||
From this moment we have installed new version of the Linux kernel and now we must tell the `bootloader` about it. Of course we can add it manually by the editing of the `/boot/grub2/grub.cfg` configuration file, but I prefer to use a script for this purpose. I'm using two different Linux distros: Fedora and Ubuntu. There are two different ways to update the [grub](https://en.wikipedia.org/wiki/GNU_GRUB) configuration file. I'm using following script for this purpose:
|
||||
|
||||
```shell
|
||||
#!/bin/bash
|
||||
|
||||
source "term-colors"
|
||||
|
||||
DISTRIBUTIVE=$(cat /etc/*-release | grep NAME | head -1 | sed -n -e 's/NAME\=//p')
|
||||
echo -e "Distributive: ${Green}${DISTRIBUTIVE}${Color_Off}"
|
||||
|
||||
if [[ "$DISTRIBUTIVE" == "Fedora" ]] ;
|
||||
then
|
||||
su -c 'grub2-mkconfig -o /boot/grub2/grub.cfg'
|
||||
else
|
||||
sudo update-grub
|
||||
fi
|
||||
|
||||
echo "${Green}Done.${Color_Off}"
|
||||
```
|
||||
|
||||
This is the last step of the new Linux kernel installation and after this you can reboot your computer and select new version of the kernel during boot.
|
||||
|
||||
The second case is to launch new Linux kernel in the virtual machine. I prefer [qemu](https://en.wikipedia.org/wiki/QEMU). First of all we need to build initial ramdisk - [initrd](https://en.wikipedia.org/wiki/Initrd) for this. The `initrd` is a temporary root file system that is used by the Linux kernel during initialization process while other filesystems are not mounted. We can build `initrd` with the following commands:
|
||||
|
||||
First of all we need to download [busybox](https://en.wikipedia.org/wiki/BusyBox) and run `menuconfig` for its configuration:
|
||||
|
||||
```shell
|
||||
$ mkdir initrd
|
||||
$ cd initrd
|
||||
$ curl http://busybox.net/downloads/busybox-1.23.2.tar.bz2 | tar xjf -
|
||||
$ cd busybox-1.23.2/
|
||||
$ make menuconfig
|
||||
$ make -j4
|
||||
```
|
||||
|
||||
`busybox` is an executable file - `/bin/busybox` that contains a set of standard tools like [coreutils](https://en.wikipedia.org/wiki/GNU_Core_Utilities). In the `busysbox` menu we need to enable: `Build BusyBox as a static binary (no shared libs)` option:
|
||||
|
||||

|
||||
|
||||
We can find this menu in the:
|
||||
|
||||
```
|
||||
Busybox Settings
|
||||
--> Build Options
|
||||
```
|
||||
|
||||
After this we exit from the `busysbox` configuration menu and execute following commands for building and installation of it:
|
||||
|
||||
```
|
||||
$ make -j4
|
||||
$ sudo make install
|
||||
```
|
||||
|
||||
Now that `busybox` is installed, we can begin building our `initrd`. To do this, we go to the previous `initrd` directory and:
|
||||
|
||||
```
|
||||
$ cd ..
|
||||
$ mkdir -p initramfs
|
||||
$ cd initramfs
|
||||
$ mkdir -pv {bin,sbin,etc,proc,sys,usr/{bin,sbin}}
|
||||
$ cp -av ../busybox-1.23.2/_install/* .
|
||||
```
|
||||
|
||||
copy `busybox` fields to the `bin`, `sbin` and other directories. Now we need to create executable `init` file that will be executed as a first process in the system. My `init` file just mounts [procfs](https://en.wikipedia.org/wiki/Procfs) and [sysfs](https://en.wikipedia.org/wiki/Sysfs) filesystems and executed shell:
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
mount -t proc none /proc
|
||||
mount -t sysfs none /sys
|
||||
|
||||
exec /bin/sh
|
||||
```
|
||||
|
||||
Now we can create an archive that will be our `initrd`:
|
||||
|
||||
```
|
||||
$ find . -print0 | cpio --null -ov --format=newc | gzip -9 > ~/dev/initrd_x86_64.gz
|
||||
```
|
||||
|
||||
We can now run our kernel in the virtual machine. As I already wrote I prefer [qemu](https://en.wikipedia.org/wiki/QEMU) for this. We can run our kernel with the following command:
|
||||
|
||||
```
|
||||
$ qemu-system-x86_64 -snapshot -m 8GB -serial stdio -kernel ~/dev/linux/arch/x86_64/boot/bzImage -initrd ~/dev/initrd_x86_64.gz -append "root=/dev/sda1 ignore_loglevel"
|
||||
```
|
||||
|
||||

|
||||
|
||||
From now we can run the Linux kernel in the virtual machine and this means that we can begin to change and test the kernel.
|
||||
|
||||
Consider using [ivandaviov/minimal](https://github.com/ivandavidov/minimal) to automate the process of generating initrd.
|
||||
|
||||
Getting started with the Linux Kernel Development
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
The main point of this paragraph is to answer two questions: What to do and what not to do before sending your first patch to the Linux kernel. Please, do not confuse this `to do` with `todo`. I have no answer what you can fix in the Linux kernel. I just want to tell you my workflow during experimenting with the Linux kernel source code.
|
||||
|
||||
First of all I pull the latest updates from Linus's repo with the following commands:
|
||||
|
||||
```
|
||||
$ git checkout master
|
||||
$ git pull upstream master
|
||||
```
|
||||
|
||||
After this my local repository with the Linux kernel source code is synced with the [mainline](https://github.com/torvalds/linux) repository. Now we can make some changes in the source code. As I already wrote, I have no advice for you where you can start and what `TODO` in the Linux kernel. But the best place for newbies is `staging` tree. In other words the set of drivers from the [drivers/staging](https://github.com/torvalds/linux/tree/master/drivers/staging). The maintainer of the `staging` tree is [Greg Kroah-Hartman](https://en.wikipedia.org/wiki/Greg_Kroah-Hartman) and the `staging` tree is that place where your trivial patch can be accepted. Let's look on a simple example that describes how to generate patch, check it and send to the [Linux kernel mail listing](https://lkml.org/).
|
||||
|
||||
If we look in the driver for the [Digi International EPCA PCI](https://github.com/torvalds/linux/tree/master/drivers/staging/dgap) based devices, we will see the `dgap_sindex` function on line 295:
|
||||
|
||||
```C
|
||||
static char *dgap_sindex(char *string, char *group)
|
||||
{
|
||||
char *ptr;
|
||||
|
||||
if (!string || !group)
|
||||
return NULL;
|
||||
|
||||
for (; *string; string++) {
|
||||
for (ptr = group; *ptr; ptr++) {
|
||||
if (*ptr == *string)
|
||||
return string;
|
||||
}
|
||||
}
|
||||
|
||||
return NULL;
|
||||
}
|
||||
```
|
||||
|
||||
This function looks for a match of any character in the group and returns that position. During research of source code of the Linux kernel, I have noted that the [lib/string.c](https://github.com/torvalds/linux/blob/master/lib/string.c#L473) source code file contains the implementation of the `strpbrk` function that does the same thing as `dgap_sinidex`. It is not a good idea to use a custom implementation of a function that already exists, so we can remove the `dgap_sindex` function from the [drivers/staging/dgap/dgap.c](https://github.com/torvalds/linux/blob/master/drivers/staging/dgap/dgap.c) source code file and use the `strpbrk` instead.
|
||||
|
||||
First of all let's create new `git` branch based on the current master that synced with the Linux kernel mainline repo:
|
||||
|
||||
```
|
||||
$ git checkout -b "dgap-remove-dgap_sindex"
|
||||
```
|
||||
|
||||
And now we can replace the `dgap_sindex` with the `strpbrk`. After we did all changes we need to recompile the Linux kernel or just [dgap](https://github.com/torvalds/linux/tree/master/drivers/staging/dgap) directory. Do not forget to enable this driver in the kernel configuration. You can find it in the:
|
||||
|
||||
```
|
||||
Device Drivers
|
||||
--> Staging drivers
|
||||
----> Digi EPCA PCI products
|
||||
```
|
||||
|
||||

|
||||
|
||||
Now is time to make commit. I'm using following combination for this:
|
||||
|
||||
```
|
||||
$ git add .
|
||||
$ git commit -s -v
|
||||
```
|
||||
|
||||
After the last command an editor will be opened that will be chosen from `$GIT_EDITOR` or `$EDITOR` environment variable. The `-s` command line argument will add `Signed-off-by` line by the committer at the end of the commit log message. You can find this line in the end of each commit message, for example - [00cc1633](https://github.com/torvalds/linux/commit/00cc1633816de8c95f337608a1ea64e228faf771). The main point of this line is the tracking of who did a change. The `-v` option show unified diff between the HEAD commit and what would be committed at the bottom of the commit message. It is not necessary, but very useful sometimes. A couple of words about commit message. Actually a commit message consists from two parts:
|
||||
|
||||
The first part is on the first line and contains short description of changes. It starts from the `[PATCH]` prefix followed by a subsystem, driver or architecture name and after `:` symbol short description. In our case it will be something like this:
|
||||
|
||||
```
|
||||
[PATCH] staging/dgap: Use strpbrk() instead of dgap_sindex()
|
||||
```
|
||||
|
||||
After short description usually we have an empty line and full description of the commit. In our case it will be:
|
||||
|
||||
```
|
||||
The <linux/string.h> provides strpbrk() function that does the same that the
|
||||
dgap_sindex(). Let's use already defined function instead of writing custom.
|
||||
```
|
||||
|
||||
And the `Sign-off-by` line in the end of the commit message. Note that each line of a commit message must no be longer than `80` symbols and commit message must describe your changes in details. Do not just write a commit message like: `Custom function removed`, you need to describe what you did and why. The patch reviewers must know what they review. Besides this commit messages in this view are very helpful. Each time when we can't understand something, we can use [git blame](http://git-scm.com/docs/git-blame) to read description of changes.
|
||||
|
||||
After we have committed changes time to generate patch. We can do it with the `format-patch` command:
|
||||
|
||||
```
|
||||
$ git format-patch master
|
||||
0001-staging-dgap-Use-strpbrk-instead-of-dgap_sindex.patch
|
||||
```
|
||||
|
||||
We've passed name of the branch (`master` in this case) to the `format-patch` command that will generate a patch with the last changes that are in the `dgap-remove-dgap_sindex` branch and not are in the `master` branch. As you can note, the `format-patch` command generates file that contains last changes and has name that is based on the commit short description. If you want to generate a patch with the custom name, you can use `--stdout` option:
|
||||
|
||||
```
|
||||
$ git format-patch master --stdout > dgap-patch-1.patch
|
||||
```
|
||||
|
||||
The last step after we have generated our patch is to send it to the Linux kernel mailing list. Of course, you can use any email client, `git` provides a special command for this: `git send-email`. Before you send your patch, you need to know where to send it. Yes, you can just send it to the Linux kernel mailing list address which is `linux-kernel@vger.kernel.org`, but it is very likely that the patch will be ignored, because of the large flow of messages. The better choice would be to send the patch to the maintainers of the subsystem where you have made changes. To find the names of these maintainers use the `get_maintainer.pl` script. All you need to do is pass the file or directory where you wrote code.
|
||||
|
||||
```
|
||||
$ ./scripts/get_maintainer.pl -f drivers/staging/dgap/dgap.c
|
||||
Lidza Louina <lidza.louina@gmail.com> (maintainer:DIGI EPCA PCI PRODUCTS)
|
||||
Mark Hounschell <markh@compro.net> (maintainer:DIGI EPCA PCI PRODUCTS)
|
||||
Daeseok Youn <daeseok.youn@gmail.com> (maintainer:DIGI EPCA PCI PRODUCTS)
|
||||
Greg Kroah-Hartman <gregkh@linuxfoundation.org> (supporter:STAGING SUBSYSTEM)
|
||||
driverdev-devel@linuxdriverproject.org (open list:DIGI EPCA PCI PRODUCTS)
|
||||
devel@driverdev.osuosl.org (open list:STAGING SUBSYSTEM)
|
||||
linux-kernel@vger.kernel.org (open list)
|
||||
```
|
||||
|
||||
You will see the set of the names and related emails. Now we can send our patch with:
|
||||
|
||||
```
|
||||
$ git send-email --to "Lidza Louina <lidza.louina@gmail.com>" \
|
||||
--cc "Mark Hounschell <markh@compro.net>" \
|
||||
--cc "Daeseok Youn <daeseok.youn@gmail.com>" \
|
||||
--cc "Greg Kroah-Hartman <gregkh@linuxfoundation.org>" \
|
||||
--cc "driverdev-devel@linuxdriverproject.org" \
|
||||
--cc "devel@driverdev.osuosl.org" \
|
||||
--cc "linux-kernel@vger.kernel.org"
|
||||
```
|
||||
|
||||
That's all. The patch is sent and now you only have to wait for feedback from the Linux kernel developers. After you send a patch and a maintainer accepts it, you will find it in the maintainer's repository (for example [patch](https://git.kernel.org/cgit/linux/kernel/git/gregkh/staging.git/commit/?h=staging-testing&id=b9f7f1d0846f15585b8af64435b6b706b25a5c0b) that you saw in this part) and after some time the maintainer will send a pull request to Linus and you will see your patch in the mainline repository.
|
||||
|
||||
That's all.
|
||||
|
||||
Some advice
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the end of this part I want to give you some advice that will describe what to do and what not to do during development of the Linux kernel:
|
||||
|
||||
* Think, Think, Think. And think again before you decide to send a patch.
|
||||
|
||||
* Each time when you have changed something in the Linux kernel source code - compile it. After any changes. Again and again. Nobody likes changes that don't even compile.
|
||||
|
||||
* The Linux kernel has a coding style [guide](https://github.com/torvalds/linux/blob/master/Documentation/CodingStyle) and you need to comply with it. There is great script which can help to check your changes. This script is - [scripts/checkpatch.pl](https://github.com/torvalds/linux/blob/master/scripts/checkpatch.pl). Just pass source code file with changes to it and you will see:
|
||||
|
||||
```
|
||||
$ ./scripts/checkpatch.pl -f drivers/staging/dgap/dgap.c
|
||||
WARNING: Block comments use * on subsequent lines
|
||||
#94: FILE: drivers/staging/dgap/dgap.c:94:
|
||||
+/*
|
||||
+ SUPPORTED PRODUCTS
|
||||
|
||||
CHECK: spaces preferred around that '|' (ctx:VxV)
|
||||
#143: FILE: drivers/staging/dgap/dgap.c:143:
|
||||
+ { PPCM, PCI_DEV_XEM_NAME, 64, (T_PCXM|T_PCLITE|T_PCIBUS) },
|
||||
|
||||
```
|
||||
|
||||
Also you can see problematic places with the help of the `git diff`:
|
||||
|
||||

|
||||
|
||||
* [Linus doesn't accept github pull requests](https://github.com/torvalds/linux/pull/17#issuecomment-5654674)
|
||||
|
||||
* If your change consists from some different and unrelated changes, you need to split the changes via separate commits. The `git format-patch` command will generate patches for each commit and the subject of each patch will contain a `vN` prefix where the `N` is the number of the patch. If you are planning to send a series of patches it will be helpful to pass the `--cover-letter` option to the `git format-patch` command. This will generate an additional file that will contain the cover letter that you can use to describe what your patchset changes. It is also a good idea to use the `--in-reply-to` option in the `git send-email` command. This option allows you to send your patch series in reply to your cover message. The structure of the your patch will look like this for a maintainer:
|
||||
|
||||
```
|
||||
|--> cover letter
|
||||
|----> patch_1
|
||||
|----> patch_2
|
||||
```
|
||||
|
||||
You need to pass `message-id` as an argument of the `--in-reply-to` option that you can find in the output of the `git send-email`:
|
||||
|
||||
It's important that your email be in the [plain text](https://en.wikipedia.org/wiki/Plain_text) format. Generally, `send-email` and `format-patch` are very useful during development, so look at the documentation for the commands and you'll find some useful options such as: [git send-email](http://git-scm.com/docs/git-send-email) and [git format-patch](http://git-scm.com/docs/git-format-patch).
|
||||
|
||||
* Do not be surprised if you do not get an immediate answer after you send your patch. Maintainers can be very busy.
|
||||
|
||||
* The [scripts](https://github.com/torvalds/linux/tree/master/scripts) directory contains many different useful scripts that are related to Linux kernel development. We already saw two scripts from this directory: the `checkpatch.pl` and the `get_maintainer.pl` scripts. Outside of those scripts, you can find the [stackusage](https://github.com/torvalds/linux/blob/master/scripts/stackusage) script that will print usage of the stack, [extract-vmlinux](https://github.com/torvalds/linux/blob/master/scripts/extract-vmlinux) for extracting an uncompressed kernel image, and many others. Outside of the `scripts` directory you can find some very useful [scripts](https://github.com/lorenzo-stoakes/kernel-scripts) by [Lorenzo Stoakes](https://twitter.com/ljsloz) for kernel development.
|
||||
|
||||
* Subscribe to the Linux kernel mailing list. There are a large number of letters every day on `lkml`, but it is very useful to read them and understand things such as the current state of the Linux kernel. Other than `lkml` there are [set](http://vger.kernel.org/vger-lists.html) mailing listings which are related to the different Linux kernel subsystems.
|
||||
|
||||
* If your patch is not accepted the first time and you receive feedback from Linux kernel developers, make your changes and resend the patch with the `[PATCH vN]` prefix (where `N` is the number of patch version). For example:
|
||||
|
||||
```
|
||||
[PATCH v2] staging/dgap: Use strpbrk() instead of dgap_sindex()
|
||||
```
|
||||
|
||||
Also it must contain a changelog that describes all changes from previous patch versions. Of course, this is not an exhaustive list of requirements for Linux kernel development, but some of the most important items were addressed.
|
||||
|
||||
Happy Hacking!
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
I hope this will help others join the Linux kernel community!
|
||||
If you have any questions or suggestions, write me at [email](kuleshovmail@gmail.com) or ping [me](https://twitter.com/0xAX) on twitter.
|
||||
|
||||
Please note that English is not my first language, and I am really sorry for any inconvenience. If you find any mistakes please let me know via email or send a PR.
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [blog posts about assembly programming for x86_64](http://0xax.github.io/categories/assembly/)
|
||||
* [Assembler](https://en.wikipedia.org/wiki/Assembly_language#Assembler)
|
||||
* [distro](https://en.wikipedia.org/wiki/Linux_distribution)
|
||||
* [package manager](https://en.wikipedia.org/wiki/Package_manager)
|
||||
* [grub](https://en.wikipedia.org/wiki/GNU_GRUB)
|
||||
* [kernel.org](https://kernel.org/)
|
||||
* [version control system](https://en.wikipedia.org/wiki/Version_control)
|
||||
* [arm64](https://en.wikipedia.org/wiki/ARM_architecture#AArch64_features)
|
||||
* [bzImage](https://en.wikipedia.org/wiki/Vmlinux#bzImage)
|
||||
* [qemu](https://en.wikipedia.org/wiki/QEMU)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initrd)
|
||||
* [busybox](https://en.wikipedia.org/wiki/BusyBox)
|
||||
* [coreutils](https://en.wikipedia.org/wiki/GNU_Core_Utilities)
|
||||
* [procfs](https://en.wikipedia.org/wiki/Procfs)
|
||||
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
|
||||
* [Linux kernel mail listing archive](https://lkml.org/)
|
||||
* [Linux kernel coding style guide](https://github.com/torvalds/linux/blob/master/Documentation/CodingStyle)
|
||||
* [How to Get Your Change Into the Linux Kernel](https://github.com/torvalds/linux/blob/master/Documentation/SubmittingPatches)
|
||||
* [Linux Kernel Newbies](http://kernelnewbies.org/)
|
||||
* [plain text](https://en.wikipedia.org/wiki/Plain_text)
|
||||
638
Misc/linkers.md
Normal file
638
Misc/linkers.md
Normal file
@@ -0,0 +1,638 @@
|
||||
Introduction
|
||||
---------------
|
||||
|
||||
During the writing of the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book I have received many emails with questions related to the [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script and linker-related subjects. So I've decided to write this to cover some aspects of the linker and the linking of object files.
|
||||
|
||||
If we open the `Linker` page on Wikipedia, we will see following definition:
|
||||
|
||||
>In computer science, a linker or link editor is a computer program that takes one or more object files generated by a compiler and combines them into a single executable file, library file, or another object file.
|
||||
|
||||
If you've written at least one program on C in your life, you will have seen files with the `*.o` extension. These files are [object files](https://en.wikipedia.org/wiki/Object_file). Object files are blocks of machine code and data with placeholder addresses that reference data and functions in other object files or libraries, as well as a list of its own functions and data. The main purpose of the linker is collect/handle the code and data of each object file, turning it into the final executable file or library. In this post we will try to go through all aspects of this process. Let's start.
|
||||
|
||||
Linking process
|
||||
---------------
|
||||
|
||||
Let's create a simple project with the following structure:
|
||||
|
||||
```
|
||||
*-linkers
|
||||
*--main.c
|
||||
*--lib.c
|
||||
*--lib.h
|
||||
```
|
||||
|
||||
Our `main.c` source code file contains:
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
|
||||
#include "lib.h"
|
||||
|
||||
int main(int argc, char **argv) {
|
||||
printf("factorial of 5 is: %d\n", factorial(5));
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
The `lib.c` file contains:
|
||||
|
||||
```C
|
||||
int factorial(int base) {
|
||||
int res,i = 1;
|
||||
|
||||
if (base == 0) {
|
||||
return 1;
|
||||
}
|
||||
|
||||
while (i <= base) {
|
||||
res *= i;
|
||||
i++;
|
||||
}
|
||||
|
||||
return res;
|
||||
}
|
||||
```
|
||||
|
||||
And the `lib.h` file contains:
|
||||
|
||||
```C
|
||||
#ifndef LIB_H
|
||||
#define LIB_H
|
||||
|
||||
int factorial(int base);
|
||||
|
||||
#endif
|
||||
```
|
||||
|
||||
Now let's compile only the `main.c` source code file with:
|
||||
|
||||
```
|
||||
$ gcc -c main.c
|
||||
```
|
||||
|
||||
If we look inside the outputted object file with the `nm` util, we will see the
|
||||
following output:
|
||||
|
||||
```
|
||||
$ nm -A main.o
|
||||
main.o: U factorial
|
||||
main.o:0000000000000000 T main
|
||||
main.o: U printf
|
||||
```
|
||||
|
||||
The `nm` util allows us to see the list of symbols from the given object file. It consists of three columns: the first is the name of the given object file and the address of any resolved symbols. The second column contains a character that represents the status of the given symbol. In this case the `U` means `undefined` and the `T` denotes that the symbols are placed in the `.text` section of the object. The `nm` utility shows us here that we have three symbols in the `main.c` source code file:
|
||||
|
||||
* `factorial` - the factorial function defined in the `lib.c` source code file. It is marked as `undefined` here because we compiled only the `main.c` source code file, and it does not know anything about code from the `lib.c` file for now;
|
||||
* `main` - the main function;
|
||||
* `printf` - the function from the [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) library. `main.c` does not know anything about it for now either.
|
||||
|
||||
What can we understand from the output of `nm` so far? The `main.o` object file contains the local symbol `main` at address `0000000000000000` (it will be filled with correct address after is is linked), and two unresolved symbols. We can see all of this information in the disassembly output of the `main.o` object file:
|
||||
|
||||
```
|
||||
$ objdump -S main.o
|
||||
|
||||
main.o: file format elf64-x86-64
|
||||
Disassembly of section .text:
|
||||
|
||||
0000000000000000 <main>:
|
||||
0: 55 push %rbp
|
||||
1: 48 89 e5 mov %rsp,%rbp
|
||||
4: 48 83 ec 10 sub $0x10,%rsp
|
||||
8: 89 7d fc mov %edi,-0x4(%rbp)
|
||||
b: 48 89 75 f0 mov %rsi,-0x10(%rbp)
|
||||
f: bf 05 00 00 00 mov $0x5,%edi
|
||||
14: e8 00 00 00 00 callq 19 <main+0x19>
|
||||
19: 89 c6 mov %eax,%esi
|
||||
1b: bf 00 00 00 00 mov $0x0,%edi
|
||||
20: b8 00 00 00 00 mov $0x0,%eax
|
||||
25: e8 00 00 00 00 callq 2a <main+0x2a>
|
||||
2a: b8 00 00 00 00 mov $0x0,%eax
|
||||
2f: c9 leaveq
|
||||
30: c3 retq
|
||||
```
|
||||
|
||||
Here we are interested only in the two `callq` operations. The two `callq` operations contain `linker stubs`, or the function name and offset from it to the next instruction. These stubs will be updated to the real addresses of the functions. We can see these functions' names with in the following `objdump` output:
|
||||
|
||||
```
|
||||
$ objdump -S -r main.o
|
||||
|
||||
...
|
||||
14: e8 00 00 00 00 callq 19 <main+0x19>
|
||||
15: R_X86_64_PC32 factorial-0x4
|
||||
19: 89 c6 mov %eax,%esi
|
||||
...
|
||||
25: e8 00 00 00 00 callq 2a <main+0x2a>
|
||||
26: R_X86_64_PC32 printf-0x4
|
||||
2a: b8 00 00 00 00 mov $0x0,%eax
|
||||
...
|
||||
```
|
||||
|
||||
The `-r` or `--reloc ` flags of the `objdump` util print the `relocation` entries of the file. Now let's look in more detail at the relocation process.
|
||||
|
||||
Relocation
|
||||
------------
|
||||
|
||||
Relocation is the process of connecting symbolic references with symbolic definitions. Let's look at the previous snippet from the `objdump` output:
|
||||
|
||||
```
|
||||
14: e8 00 00 00 00 callq 19 <main+0x19>
|
||||
15: R_X86_64_PC32 factorial-0x4
|
||||
19: 89 c6 mov %eax,%esi
|
||||
```
|
||||
|
||||
Note the `e8 00 00 00 00` on the first line. The `e8` is the [opcode](https://en.wikipedia.org/wiki/Opcode) of the `call`, and the remainder of the line is a relative offset. So the `e8 00 00 00 00` contains a one-byte operation code followed by a four-byte address. Note that the `00 00 00 00` is 4-bytes. Why only 4-bytes if an address can be 8-bytes in a `x86_64` (64-bit) machine? Actually we compiled the `main.c` source code file with the `-mcmodel=small`! From the `gcc` man page:
|
||||
|
||||
```
|
||||
-mcmodel=small
|
||||
|
||||
Generate code for the small code model: the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or dynamically linked. This is the default code model.
|
||||
```
|
||||
|
||||
Of course we didn't pass this option to the `gcc` when we compiled the `main.c`, but it is the default. We know that our program will be linked in the lower 2 GB of the address space from the `gcc` manual extract above. Four bytes is therefore enough for this. So we have opcode of the `call` instruction and an unknown address. When we compile `main.c` with all its dependencies to an executable file, and then look at the factorial call we see:
|
||||
|
||||
```
|
||||
$ gcc main.c lib.c -o factorial | objdump -S factorial | grep factorial
|
||||
|
||||
factorial: file format elf64-x86-64
|
||||
...
|
||||
...
|
||||
0000000000400506 <main>:
|
||||
40051a: e8 18 00 00 00 callq 400537 <factorial>
|
||||
...
|
||||
...
|
||||
0000000000400537 <factorial>:
|
||||
400550: 75 07 jne 400559 <factorial+0x22>
|
||||
400557: eb 1b jmp 400574 <factorial+0x3d>
|
||||
400559: eb 0e jmp 400569 <factorial+0x32>
|
||||
40056f: 7e ea jle 40055b <factorial+0x24>
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
As we can see in the previous output, the address of the `main` function is `0x0000000000400506`. Why it does not start from `0x0`? You may already know that standard C programs are linked with the `glibc` C standard library (assuming the `-nostdlib` was not passed to the `gcc`). The compiled code for a program includes constructor functions to initialize data in the program when the program is started. These functions need to be called before the program is started, or in another words before the `main` function is called. To make the initialization and termination functions work, the compiler must output something in the assembler code to cause those functions to be called at the appropriate time. Execution of this program will start from the code placed in the special `.init` section. We can see this in the beginning of the objdump output:
|
||||
|
||||
```
|
||||
objdump -S factorial | less
|
||||
|
||||
factorial: file format elf64-x86-64
|
||||
|
||||
Disassembly of section .init:
|
||||
|
||||
00000000004003a8 <_init>:
|
||||
4003a8: 48 83 ec 08 sub $0x8,%rsp
|
||||
4003ac: 48 8b 05 a5 05 20 00 mov 0x2005a5(%rip),%rax # 600958 <_DYNAMIC+0x1d0>
|
||||
```
|
||||
|
||||
Not that it starts at the `0x00000000004003a8` address relative to the `glibc` code. We can check it also in the [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) output by running `readelf`:
|
||||
|
||||
```
|
||||
$ readelf -d factorial | grep \(INIT\)
|
||||
0x000000000000000c (INIT) 0x4003a8
|
||||
```
|
||||
|
||||
So, the address of the `main` function is `0000000000400506` and is offset from the `.init` section. As we can see from the output, the address of the `factorial` function is `0x0000000000400537` and binary code for the call of the `factorial` function now is `e8 18 00 00 00`. We already know that `e8` is opcode for the `call` instruction, the next `18 00 00 00` (note that address represented as little endian for `x86_64`, so it is `00 00 00 18`) is the offset from the `callq` to the `factorial` function:
|
||||
|
||||
```python
|
||||
>>> hex(0x40051a + 0x18 + 0x5) == hex(0x400537)
|
||||
True
|
||||
```
|
||||
|
||||
So we add `0x18` and `0x5` to the address of the `call` instruction. The offset is measured from the address of the following instruction. Our call instruction is 5-bytes long (`e8 18 00 00 00`) and the `0x18` is the offset of the call after the `factorial` function. A compiler generally creates each object file with the program addresses starting at zero. But if a program is created from multiple object files, these will overlap.
|
||||
|
||||
What we have seen in this section is the `relocation` process. This process assigns load addresses to the various parts of the program, adjusting the code and data in the program to reflect the assigned addresses.
|
||||
|
||||
Ok, now that we know a little about linkers and relocation it is time to learn more about linkers by linking our object files.
|
||||
|
||||
GNU linker
|
||||
-----------------
|
||||
|
||||
As you can understand from the title, I will use [GNU linker](https://en.wikipedia.org/wiki/GNU_linker) or just `ld` in this post. Of course we can use `gcc` to link our `factorial` project:
|
||||
|
||||
```
|
||||
$ gcc main.c lib.o -o factorial
|
||||
```
|
||||
|
||||
and after it we will get executable file - `factorial` as a result:
|
||||
|
||||
```
|
||||
./factorial
|
||||
factorial of 5 is: 120
|
||||
```
|
||||
|
||||
But `gcc` does not link object files. Instead it uses `collect2` which is just wrapper for the `GNU ld` linker:
|
||||
|
||||
```
|
||||
~$ /usr/lib/gcc/x86_64-linux-gnu/4.9/collect2 --version
|
||||
collect2 version 4.9.3
|
||||
/usr/bin/ld --version
|
||||
GNU ld (GNU Binutils for Debian) 2.25
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Ok, we can use gcc and it will produce executable file of our program for us. But let's look how to use `GNU ld` linker for the same purpose. First of all let's try to link these object files with the following example:
|
||||
|
||||
```
|
||||
ld main.o lib.o -o factorial
|
||||
```
|
||||
|
||||
Try to do it and you will get following error:
|
||||
|
||||
```
|
||||
$ ld main.o lib.o -o factorial
|
||||
ld: warning: cannot find entry symbol _start; defaulting to 00000000004000b0
|
||||
main.o: In function `main':
|
||||
main.c:(.text+0x26): undefined reference to `printf'
|
||||
```
|
||||
|
||||
Here we can see two problems:
|
||||
|
||||
* Linker can't find `_start` symbol;
|
||||
* Linker does not know anything about `printf` function.
|
||||
|
||||
First of all let's try to understand what is this `_start` entry symbol that appears to be required for our program to run? When I started to learn programming I learned that the `main` function is the entry point of the program. I think you learned this too :) But it actually isn't the entry point, it's `_start` instead. The `_start` symbol is defined in the `crt1.o` object file. We can find it with the following command:
|
||||
|
||||
```
|
||||
$ objdump -S /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o
|
||||
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o: file format elf64-x86-64
|
||||
|
||||
|
||||
Disassembly of section .text:
|
||||
|
||||
0000000000000000 <_start>:
|
||||
0: 31 ed xor %ebp,%ebp
|
||||
2: 49 89 d1 mov %rdx,%r9
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
We pass this object file to the `ld` command as its first argument (see above). Now let's try to link it and will look on result:
|
||||
|
||||
```
|
||||
ld /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o \
|
||||
main.o lib.o -o factorial
|
||||
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o: In function `_start':
|
||||
/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:115: undefined reference to `__libc_csu_fini'
|
||||
/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:116: undefined reference to `__libc_csu_init'
|
||||
/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:122: undefined reference to `__libc_start_main'
|
||||
main.o: In function `main':
|
||||
main.c:(.text+0x26): undefined reference to `printf'
|
||||
```
|
||||
|
||||
Unfortunately we will see even more errors. We can see here old error about undefined `printf` and yet another three undefined references:
|
||||
|
||||
* `__libc_csu_fini`
|
||||
* `__libc_csu_init`
|
||||
* `__libc_start_main`
|
||||
|
||||
The `_start` symbol is defined in the [sysdeps/x86_64/start.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=0d27a38e9c02835ce17d1c9287aa01be222e72eb;hb=HEAD) assembly file in the `glibc` source code. We can find following assembly code lines there:
|
||||
|
||||
```assembly
|
||||
mov $__libc_csu_fini, %R8_LP
|
||||
mov $__libc_csu_init, %RCX_LP
|
||||
...
|
||||
call __libc_start_main
|
||||
```
|
||||
|
||||
Here we pass address of the entry point to the `.init` and `.fini` section that contain code that starts to execute when the program is ran and the code that executes when program terminates. And in the end we see the call of the `main` function from our program. These three symbols are defined in the [csu/elf-init.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/elf-init.c;hb=1d4bbc54bd4f7d85d774871341b49f4357af1fb7) source code file. The following two object files:
|
||||
|
||||
* `crtn.o`;
|
||||
* `crti.o`.
|
||||
|
||||
define the function prologs/epilogs for the .init and .fini sections (with the `_init` and `_fini` symbols respectively).
|
||||
|
||||
The `crtn.o` object file contains these `.init` and `.fini` sections:
|
||||
|
||||
```
|
||||
$ objdump -S /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o
|
||||
|
||||
0000000000000000 <.init>:
|
||||
0: 48 83 c4 08 add $0x8,%rsp
|
||||
4: c3 retq
|
||||
|
||||
Disassembly of section .fini:
|
||||
|
||||
0000000000000000 <.fini>:
|
||||
0: 48 83 c4 08 add $0x8,%rsp
|
||||
4: c3 retq
|
||||
```
|
||||
|
||||
And the `crti.o` object file contains the `_init` and `_fini` symbols. Let's try to link again with these two object files:
|
||||
|
||||
```
|
||||
$ ld \
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o \
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o \
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o main.o lib.o \
|
||||
-o factorial
|
||||
```
|
||||
|
||||
And anyway we will get the same errors. Now we need to pass `-lc` option to the `ld`. This option will search for the standard library in the paths present in the `$LD_LIBRARY_PATH` environment variable. Let's try to link again wit the `-lc` option:
|
||||
|
||||
```
|
||||
$ ld \
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o \
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o \
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o main.o lib.o -lc \
|
||||
-o factorial
|
||||
```
|
||||
|
||||
Finally we get an executable file, but if we try to run it, we will get strange results:
|
||||
|
||||
```
|
||||
$ ./factorial
|
||||
bash: ./factorial: No such file or directory
|
||||
```
|
||||
|
||||
What's the problem here? Let's look on the executable file with the [readelf](https://sourceware.org/binutils/docs/binutils/readelf.html) util:
|
||||
|
||||
```
|
||||
$ readelf -l factorial
|
||||
|
||||
Elf file type is EXEC (Executable file)
|
||||
Entry point 0x4003c0
|
||||
There are 7 program headers, starting at offset 64
|
||||
|
||||
Program Headers:
|
||||
Type Offset VirtAddr PhysAddr
|
||||
FileSiz MemSiz Flags Align
|
||||
PHDR 0x0000000000000040 0x0000000000400040 0x0000000000400040
|
||||
0x0000000000000188 0x0000000000000188 R E 8
|
||||
INTERP 0x00000000000001c8 0x00000000004001c8 0x00000000004001c8
|
||||
0x000000000000001c 0x000000000000001c R 1
|
||||
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
|
||||
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
|
||||
0x0000000000000610 0x0000000000000610 R E 200000
|
||||
LOAD 0x0000000000000610 0x0000000000600610 0x0000000000600610
|
||||
0x00000000000001cc 0x00000000000001cc RW 200000
|
||||
DYNAMIC 0x0000000000000610 0x0000000000600610 0x0000000000600610
|
||||
0x0000000000000190 0x0000000000000190 RW 8
|
||||
NOTE 0x00000000000001e4 0x00000000004001e4 0x00000000004001e4
|
||||
0x0000000000000020 0x0000000000000020 R 4
|
||||
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
|
||||
0x0000000000000000 0x0000000000000000 RW 10
|
||||
|
||||
Section to Segment mapping:
|
||||
Segment Sections...
|
||||
00
|
||||
01 .interp
|
||||
02 .interp .note.ABI-tag .hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame
|
||||
03 .dynamic .got .got.plt .data
|
||||
04 .dynamic
|
||||
05 .note.ABI-tag
|
||||
06
|
||||
```
|
||||
|
||||
Note on the strange line:
|
||||
|
||||
```
|
||||
INTERP 0x00000000000001c8 0x00000000004001c8 0x00000000004001c8
|
||||
0x000000000000001c 0x000000000000001c R 1
|
||||
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
|
||||
```
|
||||
|
||||
The `.interp` section in the `elf` file holds the path name of a program interpreter or in another words the `.interp` section simply contains an `ascii` string that is the name of the dynamic linker. The dynamic linker is the part of Linux that loads and links shared libraries needed by an executable when it is executed, by copying the content of libraries from disk to RAM. As we can see in the output of the `readelf` command it is placed in the `/lib64/ld-linux-x86-64.so.2` file for the `x86_64` architecture. Now let's add the `-dynamic-linker` option with the path of `ld-linux-x86-64.so.2` to the `ld` call and will see the following results:
|
||||
|
||||
```
|
||||
$ gcc -c main.c lib.c
|
||||
|
||||
$ ld \
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o \
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o \
|
||||
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o main.o lib.o \
|
||||
-dynamic-linker /lib64/ld-linux-x86-64.so.2 \
|
||||
-lc -o factorial
|
||||
```
|
||||
|
||||
Now we can run it as normal executable file:
|
||||
|
||||
```
|
||||
$ ./factorial
|
||||
|
||||
factorial of 5 is: 120
|
||||
```
|
||||
|
||||
It works! With the first line we compile the `main.c` and the `lib.c` source code files to object files. We will get the `main.o` and the `lib.o` after execution of the `gcc`:
|
||||
|
||||
```
|
||||
$ file lib.o main.o
|
||||
lib.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
|
||||
main.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
|
||||
```
|
||||
|
||||
and after this we link object files of our program with the needed system object files and libraries. We just saw a simple example of how to compile and link a C program with the `gcc` compiler and `GNU ld` linker. In this example we have used a couple command line options of the `GNU linker`, but it supports much more command line options than `-o`, `-dynamic-linker`, etc... Moreover `GNU ld` has its own language that allows to control the linking process. In the next two paragraphs we will look into it.
|
||||
|
||||
Useful command line options of the GNU linker
|
||||
----------------------------------------------
|
||||
|
||||
As I already wrote and as you can see in the manual of the `GNU linker`, it has big set of the command line options. We've seen a couple of options in this post: `-o <output>` - that tells `ld` to produce an output file called `output` as the result of linking, `-l<name>` that adds the archive or object file specified by the name, `-dynamic-linker` that specifies the name of the dynamic linker. Of course `ld` supports much more command line options, let's look at some of them.
|
||||
|
||||
The first useful command line option is `@file`. In this case the `file` specifies filename where command line options will be read. For example we can create file with the name `linker.ld`, put there our command line arguments from the previous example and execute it with:
|
||||
|
||||
```
|
||||
$ ld @linker.ld
|
||||
```
|
||||
|
||||
The next command line option is `-b` or `--format`. This command line option specifies format of the input object files `ELF`, `DJGPP/COFF` and etc. There is a command line option for the same purpose but for the output file: `--oformat=output-format`.
|
||||
|
||||
The next command line option is `--defsym`. Full format of this command line option is the `--defsym=symbol=expression`. It allows to create global symbol in the output file containing the absolute address given by expression. We can find following case where this command line option can be useful: in the Linux kernel source code and more precisely in the Makefile that is related to the kernel decompression for the ARM architecture - [arch/arm/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/arm/boot/compressed/Makefile), we can find following definition:
|
||||
|
||||
```
|
||||
LDFLAGS_vmlinux = --defsym _kernel_bss_size=$(KBSS_SZ)
|
||||
```
|
||||
|
||||
As we already know, it defines the `_kernel_bss_size` symbol with the size of the `.bss` section in the output file. This symbol will be used in the first [assembly file](https://github.com/torvalds/linux/blob/master/arch/arm/boot/compressed/head.S) that will be executed during kernel decompressing:
|
||||
|
||||
```assembly
|
||||
ldr r5, =_kernel_bss_size
|
||||
```
|
||||
|
||||
The next command line options is the `-shared` that allows us to create shared library. The `-M` or `-map <filename>` command line option prints the linking map with the information about symbols. In our case:
|
||||
|
||||
```
|
||||
$ ld -M @linker.ld
|
||||
...
|
||||
...
|
||||
...
|
||||
.text 0x00000000004003c0 0x112
|
||||
*(.text.unlikely .text.*_unlikely .text.unlikely.*)
|
||||
*(.text.exit .text.exit.*)
|
||||
*(.text.startup .text.startup.*)
|
||||
*(.text.hot .text.hot.*)
|
||||
*(.text .stub .text.* .gnu.linkonce.t.*)
|
||||
.text 0x00000000004003c0 0x2a /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o
|
||||
...
|
||||
...
|
||||
...
|
||||
.text 0x00000000004003ea 0x31 main.o
|
||||
0x00000000004003ea main
|
||||
.text 0x000000000040041b 0x3f lib.o
|
||||
0x000000000040041b factorial
|
||||
```
|
||||
|
||||
Of course the `GNU linker` support standard command line options: `--help` and `--version` that print common help of the usage of the `ld` and its version. That's all about command line options of the `GNU linker`. Of course it is not the full set of command line options supported by the `ld` util. You can find the complete documentation of the `ld` util in the manual.
|
||||
|
||||
Control Language linker
|
||||
----------------------------------------------
|
||||
|
||||
As I wrote previously, `ld` has support for its own language. It accepts Linker Command Language files written in a superset of AT&T's Link Editor Command Language syntax, to provide explicit and total control over the linking process. Let's look on its details.
|
||||
|
||||
With the linker language we can control:
|
||||
|
||||
* input files;
|
||||
* output files;
|
||||
* file formats
|
||||
* addresses of sections;
|
||||
* etc...
|
||||
|
||||
Commands written in the linker control language are usually placed in a file called linker script. We can pass it to `ld` with the `-T` command line option. The main command in a linker script is the `SECTIONS` command. Each linker script must contain this command and it determines the `map` of the output file. The special variable `.` contains current position of the output. Let's write a simple assembly program and we will look at how we can use a linker script to control linking of this program. We will take a hello world program for this example:
|
||||
|
||||
```assembly
|
||||
section .data
|
||||
msg db "hello, world!",`\n`
|
||||
section .text
|
||||
global _start
|
||||
_start:
|
||||
mov rax, 1
|
||||
mov rdi, 1
|
||||
mov rsi, msg
|
||||
mov rdx, 14
|
||||
syscall
|
||||
mov rax, 60
|
||||
mov rdi, 0
|
||||
syscall
|
||||
```
|
||||
|
||||
We can compile and link it with the following commands:
|
||||
|
||||
```
|
||||
$ nasm -f elf64 -o hello.o hello.asm
|
||||
$ ld -o hello hello.o
|
||||
```
|
||||
|
||||
Our program consists from two sections: `.text` contains code of the program and `.data` contains initialized variables. Let's write simple linker script and try to link our `hello.asm` assembly file with it. Our script is:
|
||||
|
||||
```
|
||||
/*
|
||||
* Linker script for the factorial
|
||||
*/
|
||||
OUTPUT(hello)
|
||||
OUTPUT_FORMAT("elf64-x86-64")
|
||||
INPUT(hello.o)
|
||||
|
||||
SECTIONS
|
||||
{
|
||||
. = 0x200000;
|
||||
.text : {
|
||||
*(.text)
|
||||
}
|
||||
|
||||
. = 0x400000;
|
||||
.data : {
|
||||
*(.data)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
On the first three lines you can see a comment written in `C` style. After it the `OUTPUT` and the `OUTPUT_FORMAT` commands specify the name of our executable file and its format. The next command, `INPUT`, specifies the input file to the `ld` linker. Then, we can see the main `SECTIONS` command, which, as I already wrote, must be present in every linker script. The `SECTIONS` command represents the set and order of the sections which will be in the output file. At the beginning of the `SECTIONS` command we can see following line `. = 0x200000`. I already wrote above that `.` command points to the current position of the output. This line says that the code should be loaded at address `0x200000` and the line `. = 0x400000` says that data section should be loaded at address `0x400000`. The second line after the `. = 0x200000` defines `.text` as an output section. We can see `*(.text)` expression inside it. The `*` symbol is wildcard that matches any file name. In other words, the `*(.text)` expression says all `.text` input sections in all input files. We can rewrite it as `hello.o(.text)` for our example. After the following location counter `. = 0x400000`, we can see definition of the data section.
|
||||
|
||||
We can compile and link it with the:
|
||||
|
||||
```
|
||||
$ nasm -f elf64 -o hello.o hello.S && ld -T linker.script && ./hello
|
||||
hello, world!
|
||||
```
|
||||
|
||||
If we will look inside it with the `objdump` util, we can see that `.text` section starts from the address `0x200000` and the `.data` sections starts from the address `0x400000`:
|
||||
|
||||
```
|
||||
$ objdump -D hello
|
||||
|
||||
Disassembly of section .text:
|
||||
|
||||
0000000000200000 <_start>:
|
||||
200000: b8 01 00 00 00 mov $0x1,%eax
|
||||
...
|
||||
|
||||
Disassembly of section .data:
|
||||
|
||||
0000000000400000 <msg>:
|
||||
400000: 68 65 6c 6c 6f pushq $0x6f6c6c65
|
||||
...
|
||||
```
|
||||
|
||||
Apart from the commands we have already seen, there are a few others. The first is the `ASSERT(exp, message)` that ensures that given expression is not zero. If it is zero, then exit the linker with an error code and print the given error message. If you've read about Linux kernel booting process in the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book, you may know that the setup header of the Linux kernel has offset `0x1f1`. In the linker script of the Linux kernel we can find a check for this:
|
||||
|
||||
```
|
||||
. = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");
|
||||
```
|
||||
|
||||
The `INCLUDE filename` command allows to include external linker script symbols in the current one. In a linker script we can assign a value to a symbol. `ld` supports a couple of assignment operators:
|
||||
|
||||
* symbol = expression ;
|
||||
* symbol += expression ;
|
||||
* symbol -= expression ;
|
||||
* symbol *= expression ;
|
||||
* symbol /= expression ;
|
||||
* symbol <<= expression ;
|
||||
* symbol >>= expression ;
|
||||
* symbol &= expression ;
|
||||
* symbol |= expression ;
|
||||
|
||||
As you can note all operators are C assignment operators. For example we can use it in our linker script as:
|
||||
|
||||
```
|
||||
START_ADDRESS = 0x200000;
|
||||
DATA_OFFSET = 0x200000;
|
||||
|
||||
SECTIONS
|
||||
{
|
||||
. = START_ADDRESS;
|
||||
.text : {
|
||||
*(.text)
|
||||
}
|
||||
|
||||
. = START_ADDRESS + DATA_OFFSET;
|
||||
.data : {
|
||||
*(.data)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
As you already may noted the syntax for expressions in the linker script language is identical to that of C expressions. Besides this the control language of the linking supports following builtin functions:
|
||||
|
||||
* `ABSOLUTE` - returns absolute value of the given expression;
|
||||
* `ADDR` - takes the section and returns its address;
|
||||
* `ALIGN` - returns the value of the location counter (`.` operator) that aligned by the boundary of the next expression after the given expression;
|
||||
* `DEFINED` - returns `1` if the given symbol placed in the global symbol table and `0` in other way;
|
||||
* `MAX` and `MIN` - return maximum and minimum of the two given expressions;
|
||||
* `NEXT` - returns the next unallocated address that is a multiple of the give expression;
|
||||
* `SIZEOF` - returns the size in bytes of the given named section.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
-----------------
|
||||
|
||||
This is the end of the post about linkers. We learned many things about linkers in this post, such as what is a linker and why it is needed, how to use it, etc..
|
||||
|
||||
If you have any questions or suggestions, write me an [email](kuleshovmail@gmail.com) or ping [me](https://twitter.com/0xAX) on twitter.
|
||||
|
||||
Please note that English is not my first language, and I am really sorry for any inconvenience. If you find any mistakes please let me know via email or send a PR.
|
||||
|
||||
Links
|
||||
-----------------
|
||||
|
||||
* [Book about Linux kernel insides](http://0xax.gitbooks.io/linux-insides/content/)
|
||||
* [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29)
|
||||
* [object files](https://en.wikipedia.org/wiki/Object_file)
|
||||
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [opcode](https://en.wikipedia.org/wiki/Opcode)
|
||||
* [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
|
||||
* [GNU linker](https://en.wikipedia.org/wiki/GNU_linker)
|
||||
* [My posts about assembly programming for x86_64](http://0xax.github.io/categories/assembly/)
|
||||
* [readelf](https://sourceware.org/binutils/docs/binutils/readelf.html)
|
||||
62
README.md
62
README.md
@@ -9,7 +9,6 @@ Linux Insides
|
||||
|
||||
**问题/建议**: 通过在 twitter 上 [@0xAX](https://twitter.com/0xAX) ,直接添加 [issue](https://github.com/0xAX/linux-insides/issues/new) 或者直接给我发[邮件](mailto:anotherworldofworld@gmail.com),请自由地向我提出任何问题或者建议。
|
||||
|
||||
|
||||
##翻译进度
|
||||
|
||||
| 章节|译者|翻译进度|
|
||||
@@ -21,7 +20,18 @@ Linux Insides
|
||||
|├ 1.3|[@hailincai](https://github.com/hailincai)|已完成|
|
||||
|├ 1.4|[@zmj1316](https://github.com/zmj1316)|已完成|
|
||||
|└ 1.5|[@chengong](https://github.com/chengong)|正在进行|
|
||||
| 2. Initialization|[@lijiangsheng1](https://github.com/lijiangsheng1)|正在进行|
|
||||
| 2. Initialization||正在进行|
|
||||
|├ 2.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|├ 2.1||未开始|
|
||||
|├ 2.2||未开始|
|
||||
|├ 2.3||未开始|
|
||||
|├ 2.4||未开始|
|
||||
|├ 2.5||未开始|
|
||||
|├ 2.6||未开始|
|
||||
|├ 2.7||未开始|
|
||||
|├ 2.8||未开始|
|
||||
|├ 2.9||未开始|
|
||||
|└ 2.10||未开始|
|
||||
| 3. Interrupts||正在进行|
|
||||
|├ 3.0|[@littleneko](https://github.com/littleneko)|正在进行|
|
||||
|├ 3.1|[@littleneko](https://github.com/littleneko)|正在进行|
|
||||
@@ -32,22 +42,50 @@ Linux Insides
|
||||
|├ 3.6|[@cloudusers](https://github.com/cloudusers)|正在进行|
|
||||
|├ 3.7|[@cloudusers](https://github.com/cloudusers)|正在进行|
|
||||
|├ 3.8|[@cloudusers](https://github.com/cloudusers)|正在进行|
|
||||
|├ 3.9|[@zhangyangjing](https://github.com/zhangyangjing)|正在进行|
|
||||
|├ 3.9|[@zhangyangjing](https://github.com/zhangyangjing)|已完成|
|
||||
|└ 3.10||未开始|
|
||||
| 4. System calls|[@qianmoke](https://github.com/qianmoke)|正在进行|
|
||||
| 5. Timers and time management|[@icecoobe](https://github.com/icecoobe)|正在进行|
|
||||
| 6. Synchronization primitives||未开始|
|
||||
| 7. Memory management|[@choleraehyq](https://github.com/choleraehyq)|已完成|
|
||||
| 4. System calls||正在进行|
|
||||
|├ 4.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|├ 4.1|[@qianmoke](https://github.com/qianmoke)|已完成|
|
||||
|├ 4.2|[@qianmoke](https://github.com/qianmoke)|已完成|
|
||||
|├ 4.3||未开始|
|
||||
|└ 4.4||未开始|
|
||||
| 5. Timers and time management||正在进行|
|
||||
|├ 5.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|├ 5.1||未开始|
|
||||
|├ 5.2||未开始|
|
||||
|├ 5.3||未开始|
|
||||
|├ 5.4||未开始|
|
||||
|├ 5.5||未开始|
|
||||
|├ 5.6||未开始|
|
||||
|└ 5.7||未开始|
|
||||
| 6. Synchronization primitives||正在进行|
|
||||
|├ 6.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|├ 6.1||未开始|
|
||||
|├ 6.2||未开始|
|
||||
|├ 6.3|[@huxq](https://github.com/huxq)|已完成|
|
||||
|├ 6.4|[@huxq](https://github.com/huxq)|正在进行|
|
||||
|└ 6.5||未开始|
|
||||
| 7. Memory management||正在进行|
|
||||
|├ 7.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|├ 7.1|[@choleraehyq](https://github.com/choleraehyq)|已完成|
|
||||
|└ 7.2||[@choleraehyq](https://github.com/choleraehyq)|已完成|
|
||||
| 8. SMP||未开始|
|
||||
| 9. Concepts||未开始|
|
||||
| 10. DataStructures||已完成|
|
||||
| 9. Concepts||正在进行|
|
||||
|├ 9.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|├ 9.1||未开始|
|
||||
|├ 9.2||未开始|
|
||||
|└ 9.3||未开始|
|
||||
| 10. DataStructures||正在进行|
|
||||
|├ 10.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|├ 10.1|[@oska874](http://github.com/oska874) [@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|└ 10.2|[@oska874](https://github.com/oska874)|已完成|
|
||||
| 11. Theory||已完成|
|
||||
|├ 10.2|[@oska874](https://github.com/oska874)|已完成|
|
||||
|└ 10.3||未开始|
|
||||
| 11. Theory||正在进行|
|
||||
|├ 11.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|├ 11.1|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|└ 11.2|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|├ 11.2|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|└ 11.3||未开始|
|
||||
| 12. Initial ram disk||未开始|
|
||||
| 13. Misc||正在进行|
|
||||
|├ 13.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|
||||
9
SyncPrim/README.md
Normal file
9
SyncPrim/README.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Linux 内核中的同步原语
|
||||
|
||||
这个章节描述内核中所有的同步原语。
|
||||
|
||||
* [自旋锁简介](http://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) - 这个章节的第一部分描述 Linux 内核中自旋锁机制的实现;
|
||||
* [队列自旋锁](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html) - 第二部分描述自旋锁的另一种类型 - 队列自旋锁;
|
||||
* [信号量](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html) - this part describes impmentation of `semaphore` synchronization primitive in the Linux kernel. 这个部分描述 Linux 内核中的同步原语 `semaphore` 的实现;
|
||||
* [互斥锁](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html) - 这个部分描述 Linux 内核中的 `mutex` ;
|
||||
* [读者/写者信号量](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html) - 这个部分描述特殊类型的信号量 - `reader/writer` 信号量;
|
||||
352
SyncPrim/sync-3.md
Normal file
352
SyncPrim/sync-3.md
Normal file
@@ -0,0 +1,352 @@
|
||||
|
||||
内核同步原语. 第三部分.
|
||||
================================================================================
|
||||
|
||||
信号量
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这是本章的第三部分 [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html),本章描述了内核中的同步原语,在之前的部分我们见到了特殊的 [自旋锁](https://en.wikipedia.org/wiki/Spinlock) - `排队自旋锁`。 在更前的 [部分](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html) 是和 `自旋锁` 相关的描述。我们将描述更多同步原语。
|
||||
|
||||
在 `自旋锁` 之后的下一个我们将要讲到的 [内核同步原语](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)是 [信号量](https://en.wikipedia.org/wiki/Semaphore_%28programming%29)。我们会从理论角度开始学习什么是 `信号量`, 然后我们会像前几章一样讲到Linux内核是如何实现信号量的。
|
||||
|
||||
好吧,现在我们开始。
|
||||
|
||||
介绍Linux内核中的信号量
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
那么究竟什么是 `信号量` ?就像你可以猜到那样 - `信号量` 是另外一种支持线程或者进程的同步机制。Linux内核已经提供了一种同步机制 - `自旋锁`, 为什么我们还需要另外一种呢?为了回答这个问题,我们需要理解这两种机制。我们已经熟悉了 `自旋锁` ,因此我们从 `信号量` 机制开始。
|
||||
|
||||
`自旋锁` 的设计理念是它仅会被持有非常短的时间。 但持有自旋锁的时候我们不可以进入睡眠模式因为其他的进程在等待我们。为了防止 [死锁](https://en.wikipedia.org/wiki/Deadlock) [上下文交换](https://en.wikipedia.org/wiki/Context_switch) 也是不允许的。
|
||||
|
||||
当需要长时间持有一个锁的时候 [信号量](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) 就是一个很好的解决方案。从另一个方面看,这个机制对于需要短期持有锁的应用并不是最优。为了理解这个问题,我们需要知道什么是 `信号量`。
|
||||
|
||||
就像一般的同步原语,`信号量` 是基于变量的。这个变量可以变大或者减少,并且这个变量的状态代表了获取锁的能力。注意这个变量的值并不限于 `0` 和 `1`。有两种类型的 `信号量`:
|
||||
|
||||
* `二值信号量`;
|
||||
* `普通信号量`.
|
||||
|
||||
第一种 `信号量` 的值可以为 `1` 或者 `0`。第二种 `信号量` 的值可以为任何非负数。如果 `信号量` 的值大于 `1` 那么它被叫做 `计数信号量`,并且它允许多于 `1` 个进程获取它。这种机制允许我们记录现有的资源,而 `自旋锁` 只允许我们为一个任务上锁。除了所有这些之外,另外一个重要的点是 `信号量` 允许进入睡眠状态。 另外当某进程在等待一个被其他进程获取的锁时, [调度器](https://en.wikipedia.org/wiki/Scheduling_%28computing%29) 也许会切换别的进程。
|
||||
|
||||
信号量 API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
因此,我们从理论方面了解一些 `信号量`的知识,我们来看看它在Linux内核中是如何实现的。所有 `信号量` 相关的 [API](https://en.wikipedia.org/wiki/Application_programming_interface) 都在名为 [include/linux/semaphore.h](https://github.com/torvalds/linux/blob/master/include/linux/semaphore.h) 的头文件中
|
||||
|
||||
我们看到 `信号量` 机制是有以下的结构体表示的:
|
||||
|
||||
```C
|
||||
struct semaphore {
|
||||
raw_spinlock_t lock;
|
||||
unsigned int count;
|
||||
struct list_head wait_list;
|
||||
};
|
||||
```
|
||||
|
||||
在内核中, `信号量` 结构体由三部分组成:
|
||||
|
||||
* `lock` - 保护 `信号量` 的 `自旋锁`;
|
||||
* `count` - 现有资源的数量;
|
||||
* `wait_list` - 等待获取此锁的进程序列.
|
||||
|
||||
在我们考虑Linux内核的的 `信号量` [API](https://en.wikipedia.org/wiki/Application_programming_interface) 之前,我们需要知道如何初始化一个 `信号量`。事实上, Linux内核提供了两个 `信号量` 的初始函数。这些函数允许初始化一个 `信号量` 为:
|
||||
|
||||
* `静态`;
|
||||
* `动态`.
|
||||
|
||||
我们来看看第一个种初始化静态 `信号量`。我们可以使用 `DEFINE_SEMAPHORE` 宏将 `信号量` 静态初始化。
|
||||
|
||||
```C
|
||||
#define DEFINE_SEMAPHORE(name) \
|
||||
struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
|
||||
```
|
||||
|
||||
就像我们看到这样,`DEFINE_SEMAPHORE` 宏只提供了初始化 `二值` 信号量。 `DEFINE_SEMAPHORE` 宏展开到 `信号量` 结构体的定义。结构体通过 `__SEMAPHORE_INITIALIZER` 宏初始化。我们来看看这个宏的实现
|
||||
```C
|
||||
#define __SEMAPHORE_INITIALIZER(name, n) \
|
||||
{ \
|
||||
.lock = __RAW_SPIN_LOCK_UNLOCKED((name).lock), \
|
||||
.count = n, \
|
||||
.wait_list = LIST_HEAD_INIT((name).wait_list), \
|
||||
}
|
||||
```
|
||||
|
||||
`__SEMAPHORE_INITIALIZER` 宏传入了 `信号量` 结构体的名字并且初始化这个结构体的各个域。首先我们使用 `__RAW_SPIN_LOCK_UNLOCKED` 宏对给予的 `信号量` 初始化一个 `自旋锁`。就像你从 [之前](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/sync-1.html) 的部分看到那样,`__RAW_SPIN_LOCK_UNLOCKED` 宏是在 [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) 头文件中定义,它展开到 `__ARCH_SPIN_LOCK_UNLOCKED` 宏,而 `__ARCH_SPIN_LOCK_UNLOCKED` 宏又展开到零或者无锁状态
|
||||
|
||||
```C
|
||||
#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
|
||||
```
|
||||
|
||||
`信号量` 的最后两个域 `count` 和 `wait_list` 是通过现有资源的数量和空 [链表](https://xinqiu.gitbooks.io/linux-insides-cn/content/DataStructures/dlist.html)来初始化。
|
||||
第二种初始化 `信号量` 的方式是将 `信号量` 和现有资源数目传送给 `sema_init` 函数。 这个函数是在 [include/linux/semaphore.h](https://github.com/torvalds/linux/blob/master/include/linux/semaphore.h) 头文件中定义的。
|
||||
|
||||
```C
|
||||
static inline void sema_init(struct semaphore *sem, int val)
|
||||
{
|
||||
static struct lock_class_key __key;
|
||||
*sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val);
|
||||
lockdep_init_map(&sem->lock.dep_map, "semaphore->lock", &__key, 0);
|
||||
}
|
||||
```
|
||||
|
||||
我们来看看这个函数是如何实现的。它看起来很简单。函数使用我们刚看到的 `__SEMAPHORE_INITIALIZER` 宏对传入的 `信号量` 进行初始化。就像我们在之前 [部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/SyncPrim/index.html) 写的那样,我们将会跳过Linux内核关于 [锁验证](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) 的部分。
|
||||
从现在开始我们知道如何初始化一个 `信号量`,我们看看如何上锁和解锁。Linux内核提供了如下操作 `信号量` 的 [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
|
||||
```
|
||||
void down(struct semaphore *sem);
|
||||
void up(struct semaphore *sem);
|
||||
int down_interruptible(struct semaphore *sem);
|
||||
int down_killable(struct semaphore *sem);
|
||||
int down_trylock(struct semaphore *sem);
|
||||
int down_timeout(struct semaphore *sem, long jiffies);
|
||||
```
|
||||
|
||||
前两个函数: `down` 和 `up` 是用来获取或释放 `信号量`。 `down_interruptible`函数试图去获取一个 `信号量`。如果被成功获取,`信号量` 的计数就会被减少并且锁也会被获取。同时当前任务也会被调度到受阻状态,也就是说 `TASK_INTERRUPTIBLE` 标志将会被至位。`TASK_INTERRUPTIBLE` 表示这个进程也许可以通过 [信号](https://en.wikipedia.org/wiki/Unix_signal) 退回到销毁状态。
|
||||
|
||||
`down_killable` 函数和 `down_interruptible` 函数提供类似的功能,但是它还将当前进程的 `TASK_KILLABLE` 标志置位。这表示等待的进程可以被杀死信号中断。
|
||||
|
||||
`down_trylock` 函数和 `spin_trylock` 函数相似。这个函数试图去获取一个锁并且退出如果这个操作是失败的。在这个例子中,想获取锁的进程不会等待。最后的 `down_timeout`函数试图去获取一个锁。当前进程将会被中断进入到等待状态当超过传入的可等待时间。除此之外你也许注意到,这个等待的时间是以 [jiffies](https://xinqiu.gitbooks.io/linux-insides-cn/content/Timers/timers-1.html)计数。
|
||||
|
||||
我们刚刚看了 `信号量` [API](https://en.wikipedia.org/wiki/Application_programming_interface)的定义。我们从 `down` 函数开始看。这个函数是在 [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) 源代码定义的。我们来看看函数实现:
|
||||
|
||||
```C
|
||||
void down(struct semaphore *sem)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
raw_spin_lock_irqsave(&sem->lock, flags);
|
||||
if (likely(sem->count > 0))
|
||||
sem->count--;
|
||||
else
|
||||
__down(sem);
|
||||
raw_spin_unlock_irqrestore(&sem->lock, flags);
|
||||
}
|
||||
EXPORT_SYMBOL(down);
|
||||
```
|
||||
|
||||
我们先看在 `down` 函数起始处定义的 `flags` 变量。这个变量将会传入到 `raw_spin_lock_irqsave` 和 `raw_spin_lock_irqrestore` 宏定义。这些宏是在 [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h)头文件定义的。这些宏用来保护当前 `信号量` 的计数器。事实上这两个宏的作用和 `spin_lock` 和 `spin_unlock` 宏相似。只不过这组宏会存储/重置当前中断标志并且禁止 [中断](https://en.wikipedia.org/wiki/Interrupt)。
|
||||
|
||||
就像你猜到那样, `down` 函数的主要就是通过 `raw_spin_lock_irqsave` 和 `raw_spin_unlock_irqrestore` 宏来实现的。我们通过将 `信号量` 的计数器和零对比,如果计数器大于零,我们可以减少这个计数器。这表示我们已经获取了这个锁。否则如果计数器是零,这表示所以的现有资源都已经被占用,我们需要等待以获取这个锁。就像我们看到那样, `__down` 函数将会被调用。
|
||||
`__down` 函数是在 [相同](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c))的源代码定义的,它的实现看起来如下:
|
||||
```C
|
||||
static noinline void __sched __down(struct semaphore *sem)
|
||||
{
|
||||
__down_common(sem, TASK_UNINTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
|
||||
}
|
||||
```
|
||||
|
||||
`__down` 函数仅仅调用了 `__down_common` 函数,并且传入了三个参数
|
||||
|
||||
* `semaphore`;
|
||||
* `flag` - 对当前任务;
|
||||
* `timeout` - 最长等待 `信号量` 的时间.
|
||||
|
||||
在我们看 `__down_common` 函数之前,注意 `down_trylock`, `down_timeout` 和 `down_killable` 的实现也都是基于 `__down_common` 函数。
|
||||
|
||||
```C
|
||||
static noinline int __sched __down_interruptible(struct semaphore *sem)
|
||||
{
|
||||
return __down_common(sem, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
|
||||
}
|
||||
```
|
||||
|
||||
`__down_killable` 函数:
|
||||
|
||||
```C
|
||||
static noinline int __sched __down_killable(struct semaphore *sem)
|
||||
{
|
||||
return __down_common(sem, TASK_KILLABLE, MAX_SCHEDULE_TIMEOUT);
|
||||
}
|
||||
```
|
||||
|
||||
`__down_timeout` 函数:
|
||||
|
||||
```C
|
||||
static noinline int __sched __down_timeout(struct semaphore *sem, long timeout)
|
||||
{
|
||||
return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);
|
||||
}
|
||||
```
|
||||
|
||||
现在我们来看看 `__down_common` 函数的实现。这个函数是在 [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c)源文件中定义的。这个函数的定义从以下两个本地变量开始。
|
||||
|
||||
```C
|
||||
struct task_struct *task = current;
|
||||
struct semaphore_waiter waiter;
|
||||
```
|
||||
|
||||
第一个变量表示当前想获取本地处理器锁的任务。 `current` 宏是在 [arch/x86/include/asm/current.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/current.h) 头文件中定义的。
|
||||
|
||||
```C
|
||||
#define current get_current()
|
||||
```
|
||||
|
||||
`get_current` 函数返回 `current_task` [per-cpu](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/per-cpu.html) 变量的值。
|
||||
|
||||
|
||||
```C
|
||||
DECLARE_PER_CPU(struct task_struct *, current_task);
|
||||
|
||||
static __always_inline struct task_struct *get_current(void)
|
||||
{
|
||||
return this_cpu_read_stable(current_task);
|
||||
}
|
||||
```
|
||||
|
||||
第二个变量是 `waiter` 表示了一个 `semaphore.wait_list` 列表的入口:
|
||||
|
||||
```C
|
||||
struct semaphore_waiter {
|
||||
struct list_head list;
|
||||
struct task_struct *task;
|
||||
bool up;
|
||||
};
|
||||
```
|
||||
|
||||
下一步我们将当前进程加入到 `wait_list` 并且在定义如下变量后填充 `waiter` 域
|
||||
|
||||
```C
|
||||
list_add_tail(&waiter.list, &sem->wait_list);
|
||||
waiter.task = task;
|
||||
waiter.up = false;
|
||||
```
|
||||
|
||||
下一步我们进入到如下的无限循环:
|
||||
|
||||
```C
|
||||
for (;;) {
|
||||
if (signal_pending_state(state, task))
|
||||
goto interrupted;
|
||||
|
||||
if (unlikely(timeout <= 0))
|
||||
goto timed_out;
|
||||
|
||||
__set_task_state(task, state);
|
||||
|
||||
raw_spin_unlock_irq(&sem->lock);
|
||||
timeout = schedule_timeout(timeout);
|
||||
raw_spin_lock_irq(&sem->lock);
|
||||
|
||||
if (waiter.up)
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
在之前的代码中我们将 `waiter.up` 设置为 `false`。所以当 `up` 没有设置为 `true` 任务将会在这个无限循环中循环。这个循环从检查当前的任务是否处于 `pending` 状态开始,也就是说此任务的标志包含 `TASK_INTERRUPTIBLE` 或者 `TASK_WAKEKILL` 标志。我之前写到当一个任务在等待获取一个信号的时候任务也许可以被 [信号](https://en.wikipedia.org/wiki/Unix_signal) 中断。`signal_pending_state` 函数是在 [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h)原文件中定义的,它看起来如下:
|
||||
|
||||
```C
|
||||
static inline int signal_pending_state(long state, struct task_struct *p)
|
||||
{
|
||||
if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
|
||||
return 0;
|
||||
if (!signal_pending(p))
|
||||
return 0;
|
||||
|
||||
return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
|
||||
}
|
||||
```
|
||||
我们先会检测 `state` [位掩码](https://en.wikipedia.org/wiki/Mask_%28computing%29) 包含 `TASK_INTERRUPTIBLE` 或者 `TASK_WAKEKILL` 位,如果不包含这两个位,函数退出。下一步我们检测当前任务是否有一个挂起信号,如果没有挂起信号函数退出。最后我们就检测 `state` 位掩码的 `TASK_INTERRUPTIBLE` 位。如果,我们任务包含一个挂起信号,我们将会跳转到 `interrupted` 标签:
|
||||
|
||||
```C
|
||||
interrupted:
|
||||
list_del(&waiter.list);
|
||||
return -EINTR;
|
||||
```
|
||||
|
||||
在这个标签中,我们会删除等待锁的列表,然后返回 `-EINTR` [错误码](https://en.wikipedia.org/wiki/Errno.h)。 如果一个任务没有挂起信号,我们检测超时是否小于等于零。
|
||||
|
||||
```C
|
||||
if (unlikely(timeout <= 0))
|
||||
goto timed_out;
|
||||
```
|
||||
|
||||
我们跳转到 `timed_out` 标签:
|
||||
|
||||
```C
|
||||
timed_out:
|
||||
list_del(&waiter.list);
|
||||
return -ETIME;
|
||||
```
|
||||
|
||||
在这个标签里,我们继续做和 `interrupted` 一样的事情。我们将任务从锁等待者中删除,但是返回 `-ETIME` 错误码。如果一个任务没有挂起信号而且给予的超时也没有过期,当前的任务将会被设置为传入的 `state`:
|
||||
|
||||
```C
|
||||
__set_task_state(task, state);
|
||||
```
|
||||
|
||||
然后调用 `schedule_timeout` 函数:
|
||||
|
||||
```C
|
||||
raw_spin_unlock_irq(&sem->lock);
|
||||
timeout = schedule_timeout(timeout);
|
||||
raw_spin_lock_irq(&sem->lock);
|
||||
```
|
||||
|
||||
这个函数是在 [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) 代码中定义的。`schedule_timeout` 函数将当前的任务置为休眠到设置的超时为止。
|
||||
|
||||
这就是所有关于 `__down_common` 函数。如果一个函数想要获取一个已经被其它任务获取的锁,它将会转入到无限循环。并且它不能被信号中断,当前设置的超时不会过期或者当前持有锁的任务不释放它。现在我们来看看 `up` 函数的实现。
|
||||
|
||||
`up` 函数和 `down` 函数定义在[同一个](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) 原文件。这个函数的主要功能是释放锁,这个函数看起来:
|
||||
|
||||
```C
|
||||
void up(struct semaphore *sem)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
raw_spin_lock_irqsave(&sem->lock, flags);
|
||||
if (likely(list_empty(&sem->wait_list)))
|
||||
sem->count++;
|
||||
else
|
||||
__up(sem);
|
||||
raw_spin_unlock_irqrestore(&sem->lock, flags);
|
||||
}
|
||||
EXPORT_SYMBOL(up);
|
||||
```
|
||||
|
||||
它看起来和 `down` 函数相似。这里有两个不同点。首先我们增加 `semaphore` 的计数。如果等待列表是空的,我们调用在当前原文件中定义的 `__up` 函数。如果等待列表不是空的,我们需要允许列表中的第一个任务去获取一个锁:
|
||||
|
||||
```C
|
||||
static noinline void __sched __up(struct semaphore *sem)
|
||||
{
|
||||
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
|
||||
struct semaphore_waiter, list);
|
||||
list_del(&waiter->list);
|
||||
waiter->up = true;
|
||||
wake_up_process(waiter->task);
|
||||
}
|
||||
```
|
||||
|
||||
在此我们获取待序列中的第一个任务,将它从列表中删除,将它的 `waiter-up` 设置为真。从此刻起 `__down_common` 函数中的无限循环将会被停止。 `wake_up_process` 函数将会在 `__up` 函数的结尾调用。我们从 `__down_common` 函数调用的 `schedule_timeout` 函数调用了 `schedule_timeout` 函数。`schedule_timeout` 函数将当前任务置于睡眠状态直到超时等待。现在我们进程也许会睡眠,我们需要唤醒。这就是为什么我们需要从 [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) 源代码中调用 `wake_up_process` 函数
|
||||
|
||||
这就是所有的信息了。
|
||||
|
||||
小结
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这就是Linux内核中关于 [同步原语](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) 的第三部分的终结。在之前的两个部分,我们已经见到了第一个Linux内核的同步原语 `自旋锁`,它是使用 `ticket spinlock` 实现并且用于很短时间的锁。在这个部分我们见到了另外一种同步原语 - [信号量](https://en.wikipedia.org/wiki/Semaphore_%28programming%29),信号量用于长时间的锁,因为它会导致 [上下文切换](https://en.wikipedia.org/wiki/Context_switch)。 在下一部分,我们将会继续深入Linux内核的同步原语并且讨论另一个同步原语 - [互斥量](https://en.wikipedia.org/wiki/Mutual_exclusion)。
|
||||
|
||||
如果你有问题或者建议,请在twitter [0xAX](https://twitter.com/0xAX)上联系我,通过 [email](anotherworldofworld@gmail.com)联系我,或者创建一个 [issue](https://github.com/0xAX/linux-insides/issues/new)
|
||||
|
||||
|
||||
|
||||
|
||||
链接
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [spinlocks](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [synchronization primitive](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
|
||||
* [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29)
|
||||
* [context switch](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
|
||||
* [deadlocks](https://en.wikipedia.org/wiki/Deadlock)
|
||||
* [scheduler](https://en.wikipedia.org/wiki/Scheduling_%28computing%29)
|
||||
* [Doubly linked list in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html)
|
||||
* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
||||
* [interrupts](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [bitmask](https://en.wikipedia.org/wiki/Mask_%28computing%29)
|
||||
* [SIGKILL](https://en.wikipedia.org/wiki/Unix_signal#SIGKILL)
|
||||
* [errno](https://en.wikipedia.org/wiki/Errno.h)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html)
|
||||
|
||||
406
SysCall/syscall-1.md
Normal file
406
SysCall/syscall-1.md
Normal file
@@ -0,0 +1,406 @@
|
||||
Linux 内核系统调用 第一节
|
||||
================================================================================
|
||||
|
||||
简介
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这次提交为 [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) 添加一个新的章节,从标题就可以知道, 这一章节将介绍Linux 内核中 [System Call](https://en.wikipedia.org/wiki/System_call) 的概念。章节内容的选择并非偶然。在前一[章节](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)我们了解了中断及中断处理。系统调用的概念与中断非常相似,这是因为软件中断是执行系统调用最常见的方式。我们将讨论系统调用概念的各个方面。例如,用户空间发起系统调用的细节,内核中一组系统调用处理器的执行过程, [VDSO](https://en.wikipedia.org/wiki/VDSO) 和 [vsyscall](https://lwn.net/Articles/446528/) 概念以及其他信息。
|
||||
|
||||
在了解 Linux 内核系统调用执行过程之前,了解一些系统调用的原理是有帮助的。我们从下面的段落开始。
|
||||
|
||||
什么是系统调用?
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
系统调用是用户空间请求内核服务。操作系统内核提供很多服务。当程序读写文件,开始监听连接的[socket](https://en.wikipedia.org/wiki/Network_socket) , 删除或创建目录或程序结束时,都会执行系统调用。换句话说,系统调用仅仅是一些 [C] (https://en.wikipedia.org/wiki/C_%28programming_language%29) 内核空间函数,用户空间程序调用其处理一些请求。
|
||||
|
||||
Linux 内核提供一系列的函数并且这些函数与CPU架构相关。 例如:[x86_64](https://en.wikipedia.org/wiki/X86-64) 提供 [322](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl) 个系统调用,[x86](https://en.wikipedia.org/wiki/X86) 提供 [358](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_32.tbl) 个不同的系统调用。
|
||||
系统调用仅仅是一些函数。 我们讨论一个使用汇编语言编写的简单 `Hello world` 示例:
|
||||
|
||||
```assembly
|
||||
.data
|
||||
|
||||
msg:
|
||||
.ascii "Hello, world!\n"
|
||||
len = . - msg
|
||||
|
||||
.text
|
||||
.global _start
|
||||
|
||||
_start:
|
||||
movq $1, %rax
|
||||
movq $1, %rdi
|
||||
movq $msg, %rsi
|
||||
movq $len, %rdx
|
||||
syscall
|
||||
|
||||
movq $60, %rax
|
||||
xorq %rdi, %rdi
|
||||
syscall
|
||||
```
|
||||
|
||||
使用下面的命令可编译这些语句:
|
||||
|
||||
```
|
||||
$ gcc -c test.S
|
||||
$ ld -o test test.o
|
||||
```
|
||||
|
||||
执行:
|
||||
|
||||
```
|
||||
./test
|
||||
Hello, world!
|
||||
```
|
||||
|
||||
这些简单的代码是一个简单的Linux `x86_64` 架构 `Hello world` 汇编程序,代码包含两个段:
|
||||
|
||||
* `.data`
|
||||
* `.text`
|
||||
|
||||
第一个段 - `.data` 存储程序的初始数据 (在示例中为`Hello world` 字符串). 第二个段 - `.text` 包含程序的代码. 程序可分为两部分: 第一部分为第一个 `syscall` 指令之前的代码,第二部分为两个 `syscall` 指令之间的代码。首先在示例程序及一般应用中, `syscall` 指令有什么功能?[64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)中提到:
|
||||
|
||||
```
|
||||
SYSCALL 引起操作系统系统调用处理器处于特权级0,通过加载IA32_LSTAR MSR至RIP完成(在RCX中保存 SYSCALL 之后指令地址之后)。
|
||||
(WRMSR 指令确保IA32_LSTAR MSR总是包含一个连续的地址。)
|
||||
...
|
||||
...
|
||||
...
|
||||
SYSCALL 将 IA32_STAR MSR 的 47:32 位加载至 CS 和 SS 段选择器。
|
||||
因此,根据这些段选择器 CS 和 SS ,描述符缓存并未从描述符加载(位于 GDT 或 LDT 中)。相反,描述符缓存从固定值加载。
|
||||
操作系统软件需要确保,由段选择器得到的描述符与从固定值加载至描述符缓存的描述符保持一致。 SYSCALL 指令不保证两者的一致。
|
||||
```
|
||||
|
||||
使用[arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S)汇编程序中定义的 `entry_SYSCALL_64` 初始化 `syscalls`
|
||||
同时 `SYSCALL` 指令进入[arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) 源码文件中的 `IA32_STAR` [Model specific register](https://en.wikipedia.org/wiki/Model-specific_register):
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
|
||||
```
|
||||
|
||||
因此,`syscall` 指令唤醒一个系统调用对应的处理程序。但是如何确定调用哪个处理器?事实上这些信息从通用目的[寄存器](https://en.wikipedia.org/wiki/Processor_register)的到。正如系统调用[表](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)中描述,每个系统调用对应特定的编号。上面的示例中, 第一个系统调用是 - `write` 将数据写入指定文件。在系统调用表中查找 write 系统调用.[write](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L10) 系统调用的编号为 - `1`。在示例中通过`rax`寄存器传递该编号,接下来的几个通用目的寄存器: `%rdi`, `%rsi` 和 `%rdx` 保存 `write` 系统调用的参数。 在示例中为[文件描述符](https://en.wikipedia.org/wiki/File_descriptor) (`1` 是[stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29)), 第二个参数字符串指针, 第三个为数据的大小。是的,你听到的没错,系统调用的参数。正如上文, 系统调用仅仅是内核空间的 `C` 函数。示例中第一个系统调用为 write ,在 [fs/read_write.c] (https://github.com/torvalds/linux/blob/master/fs/read_write.c) 源文件中定义如下:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
|
||||
size_t, count)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
或者换言之:
|
||||
|
||||
```C
|
||||
ssize_t write(int fd, const void *buf, size_t nbytes);
|
||||
```
|
||||
|
||||
现在不用担心宏 `SYSCALL_DEFINE3` ,稍后再做讨论。
|
||||
|
||||
示例的第二部分也是一样的, 但调用了另一系统调用[exit](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L69)。这个系统调用仅需一个参数:
|
||||
|
||||
* Return value
|
||||
|
||||
参数说明程序退出的方式。[strace](https://en.wikipedia.org/wiki/Strace) 工具可根据程序的名称输出系统调用的过程:
|
||||
|
||||
```
|
||||
$ strace test
|
||||
execve("./test", ["./test"], [/* 62 vars */]) = 0
|
||||
write(1, "Hello, world!\n", 14Hello, world!
|
||||
) = 14
|
||||
_exit(0) = ?
|
||||
|
||||
+++ exited with 0 +++
|
||||
```
|
||||
|
||||
`strace` 输出的第一行, [execve](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L68) 系统调用开始执行程序,第二,三行为程序中使用的系统调用: `write` 和 `exit`。注意示例中通过通用目的寄存器传递系统调用的参数。寄存器的顺序是特定的。寄存器的顺序由- 声明 [x86-64 calling conventions] (https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions)定义。
|
||||
`x86_64` 架构的声明在另一个特别的文档中 - [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf)。通常, 函数参数被置于寄存器或者堆栈中。正确的顺序为:
|
||||
|
||||
* `rdi`;
|
||||
* `rsi`;
|
||||
* `rdx`;
|
||||
* `rcx`;
|
||||
* `r8`;
|
||||
* `r9`.
|
||||
|
||||
对应函数的前六个参数。若函数多于六个参数,其他参数将放在堆栈中。
|
||||
|
||||
示例代码中未直接使用系统调用,但程序通过系统调用打印输出,检查文件的权限或是从文件中读写。
|
||||
|
||||
例如:
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
FILE *fp;
|
||||
char buff[255];
|
||||
|
||||
fp = fopen("test.txt", "r");
|
||||
fgets(buff, 255, fp);
|
||||
printf("%s\n", buff);
|
||||
fclose(fp);
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Linux内核中没有 `fopen`, `fgets`, `printf` 和 `fclose` 系统调用,而是 `open`, `read` `write` 和 `close`。`fopen`, `fgets`, `printf` 和 `fclose` 仅仅是 `C` [standard library](https://en.wikipedia.org/wiki/GNU_C_Library)中定义的函数。事实上这些函数是系统调用的封装。代码中没有直接使用系统调用,而是通过标准库的[封装](https://en.wikipedia.org/wiki/Wrapper_function)函数。这样做的主要原因是: 系统调用执行的要快,非常快。由于系统调用快的同时也非常小。标准库在执行系统调用前,确保系统调用参数设置正确及完成其他不同的检查。对比示例程序和以下命令:
|
||||
|
||||
```
|
||||
$ gcc test.c -o test
|
||||
```
|
||||
|
||||
通过[ltrace](https://en.wikipedia.org/wiki/Ltrace)工具观察:
|
||||
|
||||
```
|
||||
$ ltrace ./test
|
||||
__libc_start_main([ "./test" ] <unfinished ...>
|
||||
fopen("test.txt", "r") = 0x602010
|
||||
fgets("Hello World!\n", 255, 0x602010) = 0x7ffd2745e700
|
||||
puts("Hello World!\n"Hello World!
|
||||
|
||||
) = 14
|
||||
fclose(0x602010) = 0
|
||||
+++ exited (status 0) +++
|
||||
```
|
||||
|
||||
`ltrace`工具显示程序用户空间的调用。 `fopen` 函数打开给定的文本文件, `fgets` 函数读取文件内容至 `buf` 缓存,`puts` 输出文件内容至 `stdout` , `fclose` 函数根据文件描述符关闭函数。如上文描述,这些函数调用特定的系统调用。例如: `puts` 内部调用 `write` 系统调用,`ltrace` 添加 `-S`可观察到这一调用:
|
||||
|
||||
```
|
||||
write@SYS(1, "Hello World!\n\n", 14) = 14
|
||||
```
|
||||
|
||||
系统调用是普遍存在的。每个程序都需要打开/写/读文件,网络连接,内存分配和许多其他功能只能由内核完成。[proc](https://en.wikipedia.org/wiki/Procfs) 文件系统有一个具有特定格式的特殊文件: `/proc/pid/systemcall`记录了正在被进程调用的系统调用的编号和参数寄存器。例如,进程号 1 的程序是[systemd](https://en.wikipedia.org/wiki/Systemd):
|
||||
|
||||
```
|
||||
$ sudo cat /proc/1/comm
|
||||
systemd
|
||||
|
||||
$ sudo cat /proc/1/syscall
|
||||
232 0x4 0x7ffdf82e11b0 0x1f 0xffffffff 0x100 0x7ffdf82e11bf 0x7ffdf82e11a0 0x7f9114681193
|
||||
```
|
||||
|
||||
编号为 - `232` 的系统调用为 [epoll_wait](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L241),该调用等待 [epoll](https://en.wikipedia.org/wiki/Epoll) 文件描述符的I/O事件. 例如我用来编写这一节的 `emacs` 编辑器:
|
||||
|
||||
```
|
||||
$ ps ax | grep emacs
|
||||
2093 ? Sl 2:40 emacs
|
||||
|
||||
$ sudo cat /proc/2093/comm
|
||||
emacs
|
||||
|
||||
$ sudo cat /proc/2093/syscall
|
||||
270 0xf 0x7fff068a5a90 0x7fff068a5b10 0x0 0x7fff068a59c0 0x7fff068a59d0 0x7fff068a59b0 0x7f777dd8813c
|
||||
```
|
||||
|
||||
编号为 `270` 的系统调用是 [sys_pselect6](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L279) ,该系统调用使 `emacs` 监控多个文件描述符。
|
||||
|
||||
现在我们对系统调用有所了解,知道什么是系统调用及为什么需要系统调用。接下来,讨论示例程序中使用的 `write` 系统调用
|
||||
|
||||
写系统调用的实现
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
查看Linux内核源文件中写系统调用的实现。[fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) 源码文件中的 `write` 系统调用定义如下:
|
||||
```C
|
||||
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
|
||||
size_t, count)
|
||||
{
|
||||
struct fd f = fdget_pos(fd);
|
||||
ssize_t ret = -EBADF;
|
||||
|
||||
if (f.file) {
|
||||
loff_t pos = file_pos_read(f.file);
|
||||
ret = vfs_write(f.file, buf, count, &pos);
|
||||
if (ret >= 0)
|
||||
file_pos_write(f.file, pos);
|
||||
fdput_pos(f);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
首先,宏 `SYSCALL_DEFINE3` 在头文件 [include/linux/syscalls.h](https://github.com/torvalds/linux/blob/master/include/linux/syscalls.h) 中定义并且作为 `sys_name(...)` 函数定义的扩展。宏的定义如下:
|
||||
|
||||
```C
|
||||
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
|
||||
|
||||
#define SYSCALL_DEFINEx(x, sname, ...) \
|
||||
SYSCALL_METADATA(sname, x, __VA_ARGS__) \
|
||||
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
|
||||
```
|
||||
|
||||
宏 `SYSCALL_DEFINE3` 的参数有代表系统调用的名称的 `name` 和可变个数的参数。 这个宏仅仅作为 `SYSCALL_DEFINEx` 宏的扩展确定了传入宏的参数个数。 `_##name` 作为未来系统调用名称的存根 (更多关于 `##`符号连结可参阅[documentation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) of [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection))。宏 `SYSCALL_DEFINEx` 作为以下两个宏的扩展:
|
||||
|
||||
* `SYSCALL_METADATA`;
|
||||
* `__SYSCALL_DEFINEx`.
|
||||
|
||||
第一个宏 `SYSCALL_METADATA` 的实现与内核配置选项 `CONFIG_FTRACE_SYSCALLS` 有关。从选项的名称可知,选项允许 tracer 捕获系统调用的进入和退出。若该内核配置选项开启,宏 `SYSCALL_METADATA` 执行头文件[include/trace/syscall.h](https://github.com/torvalds/linux/blob/master/include/trace/syscall.h) 中`syscall_metadata` 结构的初始化。结构中包含多种有用字段例如系统调用的名称, 系统调用[表](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)中的编号,参数个数, 参数类型列表等:
|
||||
|
||||
```C
|
||||
#define SYSCALL_METADATA(sname, nb, ...) \
|
||||
... \
|
||||
... \
|
||||
... \
|
||||
struct syscall_metadata __used \
|
||||
__syscall_meta_##sname = { \
|
||||
.name = "sys"#sname, \
|
||||
.syscall_nr = -1, \
|
||||
.nb_args = nb, \
|
||||
.types = nb ? types_##sname : NULL, \
|
||||
.args = nb ? args_##sname : NULL, \
|
||||
.enter_event = &event_enter_##sname, \
|
||||
.exit_event = &event_exit_##sname, \
|
||||
.enter_fields = LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields), \
|
||||
}; \
|
||||
|
||||
static struct syscall_metadata __used \
|
||||
__attribute__((section("__syscalls_metadata"))) \
|
||||
*__p_syscall_meta_##sname = &__syscall_meta_##sname;
|
||||
```
|
||||
|
||||
若内核配置时 `CONFIG_FTRACE_SYSCALLS` 未开启,此时宏 `SYSCALL_METADATA`扩展为空字符串:
|
||||
|
||||
```C
|
||||
#define SYSCALL_METADATA(sname, nb, ...)
|
||||
```
|
||||
|
||||
第二个宏 `__SYSCALL_DEFINEx` 扩展为 以下五个函数的定义:
|
||||
|
||||
```C
|
||||
#define __SYSCALL_DEFINEx(x, name, ...) \
|
||||
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
|
||||
__attribute__((alias(__stringify(SyS##name)))); \
|
||||
\
|
||||
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__)); \
|
||||
\
|
||||
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \
|
||||
\
|
||||
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
|
||||
{ \
|
||||
long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__)); \
|
||||
__MAP(x,__SC_TEST,__VA_ARGS__); \
|
||||
__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
|
||||
return ret; \
|
||||
} \
|
||||
\
|
||||
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
|
||||
```
|
||||
|
||||
第一个函数 `sys##name` 是给定名称 `sys_system_call_name` 系统调用处理器函数的定义。 宏 `__SC_DECL` 的参数有 `__VA_ARGS__` 及组合调用传入参数系统类型和参数名称,因为宏定义中无法指定参数类型。宏 `__MAP` 应用宏 `__SC_DECL` 给 `__VA_ARGS__` 参数。其他由宏 `__SYSCALL_DEFINEx` 产生的函数需要 protect from the [CVE-2009-0029](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-0029) 此处不必深入研究。作为宏 `SYSCALL_DEFINE3` 的结论:
|
||||
|
||||
```C
|
||||
asmlinkage long sys_write(unsigned int fd, const char __user * buf, size_t count);
|
||||
```
|
||||
|
||||
现在我们对系统调用的定义有一定了解,回头讨论 `write` 系统调用的实现:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
|
||||
size_t, count)
|
||||
{
|
||||
struct fd f = fdget_pos(fd);
|
||||
ssize_t ret = -EBADF;
|
||||
|
||||
if (f.file) {
|
||||
loff_t pos = file_pos_read(f.file);
|
||||
ret = vfs_write(f.file, buf, count, &pos);
|
||||
if (ret >= 0)
|
||||
file_pos_write(f.file, pos);
|
||||
fdput_pos(f);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
从代码可知,该调用有三个参数:
|
||||
|
||||
* `fd` - 文件描述符;
|
||||
* `buf` - 写缓冲区;
|
||||
* `count` - 写缓冲区大小.
|
||||
|
||||
调用的功能是将用户定义的缓冲中的数据写入指定的设备或文件。注意第二个参数 `buf`, 定义了 `__user` 属性。该属性的主要目的是通过 [sparse](https://en.wikipedia.org/wiki/Sparse) 工具检查 Linux 内核代码。sparse 定义于 [include/linux/compiler.h] (https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) 头文件中并依赖 Linux 内核的 `__CHECKER__` 定义。其中全是关于 `sys_write` 系统调用的有用元信息。试着理解该系统调用的实现,定义从 `fd` 结构类型的 `f` 结构开始,这是 Linux 内核中的文件描述符。将调用的输出传入 `fdget_pos` 函数。 `fdget_pos` 函数在相同的[源文件](https://github.com/torvalds/linux/blob/master/fs/read_write.c)中定义,并且仅作为 `__to_fd` 函数的扩展:
|
||||
|
||||
```C
|
||||
static inline struct fd fdget_pos(int fd)
|
||||
{
|
||||
return __to_fd(__fdget_pos(fd));
|
||||
}
|
||||
```
|
||||
|
||||
`fdget_pos` 的主要目的是将仅仅作为的数字的给定的文件描述符转化为 `fd` 结构。 通过一长链的函数调用, `fdget_pos` 函数得到当前进程的文件描述符表, `current->files`, 并尝试从表中获取一致的文件描述符编号。当获取到给定文件描述符的 `fd` 结构后, 检查文件并返回文件是否存在。通过调用函数 `file_pos_read` 获取当前处于文件中的位置。函数返回文件的 `f_pos` 字段:
|
||||
|
||||
```C
|
||||
static inline loff_t file_pos_read(struct file *file)
|
||||
{
|
||||
return file->f_pos;
|
||||
}
|
||||
```
|
||||
|
||||
之后调用 `vfs_write` 函数。 `vfs_write` 函数在源码文件 [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) 中定义。其功能为 - 向指定文件的指定位置写入指定缓冲中的数据。此处不深入 `vfs_write` 函数的细节,因为这个函数与`系统调用`没有太多联系,反而与另一章节[Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)相关。`vfs_write` 结束相关工作后, 检查结果若成功执行,使用`file_pos_write` 函数改变在文件中的位置:
|
||||
|
||||
```C
|
||||
if (ret >= 0)
|
||||
file_pos_write(f.file, pos);
|
||||
```
|
||||
|
||||
这恰好使用给定的位置更新给定文件的 `f_pos`:
|
||||
|
||||
```C
|
||||
static inline void file_pos_write(struct file *file, loff_t pos)
|
||||
{
|
||||
file->f_pos = pos;
|
||||
}
|
||||
```
|
||||
|
||||
在 `write` 系统调用处理函数的结束, 是以下函数:
|
||||
|
||||
```C
|
||||
fdput_pos(f);
|
||||
```
|
||||
|
||||
解锁在共享文件描述符的线程并发写文件时保护文件位置的互斥量 `f_pos_lock`。
|
||||
|
||||
我们讨论了Linux内核提供的系统调用的部分实现。显然略过了 `write` 系统调用的部分实现细节,正如文中所述, 在该章节中仅关心系统调用的相关内容,不讨论与其他子系统相关的内容,例如[Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system).
|
||||
|
||||
总结
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
总结Linux内核中关于系统调用概念的 the first part covering system call concepts in the Linux kernel. 本节中讨论了系统调用的原理,接下来的一节将深入该主题,了解 Linux 内核系统调用相关代码。
|
||||
|
||||
若存在疑问及建议, 在twitter @[0xAX](https://twitter.com/0xAX), 通过[email](anotherworldofworld@gmail.com) 或者创建 [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**由于英语是我的第一语言由此造成的不便深感抱歉。若发现错误请提交 PR 至 [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
链接
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [vdso](https://en.wikipedia.org/wiki/VDSO)
|
||||
* [vsyscall](https://lwn.net/Articles/446528/)
|
||||
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [socket](https://en.wikipedia.org/wiki/Network_socket)
|
||||
* [C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29)
|
||||
* [x86](https://en.wikipedia.org/wiki/X86)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions)
|
||||
* [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf)
|
||||
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [Intel manual. PDF](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
|
||||
* [system call table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)
|
||||
* [GCC macro documentation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
|
||||
* [file descriptor](https://en.wikipedia.org/wiki/File_descriptor)
|
||||
* [stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29)
|
||||
* [strace](https://en.wikipedia.org/wiki/Strace)
|
||||
* [standard library](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [wrapper functions](https://en.wikipedia.org/wiki/Wrapper_function)
|
||||
* [ltrace](https://en.wikipedia.org/wiki/Ltrace)
|
||||
* [sparse](https://en.wikipedia.org/wiki/Sparse)
|
||||
* [proc file system](https://en.wikipedia.org/wiki/Procfs)
|
||||
* [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [systemd](https://en.wikipedia.org/wiki/Systemd)
|
||||
* [epoll](https://en.wikipedia.org/wiki/Epoll)
|
||||
* [Previous chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)
|
||||
408
SysCall/syscall-2.md
Normal file
408
SysCall/syscall-2.md
Normal file
@@ -0,0 +1,408 @@
|
||||
Linux 系统内核调用 第二节
|
||||
================================================================================
|
||||
|
||||
Linux 内核如何处理系统调用
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
前一[小节](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) 作为本章节的第一部分描述了 Linux 内核[system call](https://en.wikipedia.org/wiki/System_call) 概念。
|
||||
前一节中提到通常系统调用处于内核处于操作系统层面。前一节内容从用户空间的角度介绍,并且 [write](http://man7.org/linux/man-pages/man2/write.2.html)系统调用实现的一部分内容没有讨论。在这一小节继续关注系统调用,在深入 Linux 内核之前,从一些理论开始。
|
||||
|
||||
程序中一个用户程序并不直接使用系统调用。我们并未这样写 `Hello World`程序代码:
|
||||
|
||||
```C
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
sys_write(fd1, buf, strlen(buf));
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
我们可以使用与 [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library) 帮助类似的方式:
|
||||
|
||||
```C
|
||||
#include <unistd.h>
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
write(fd1, buf, strlen(buf));
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
不管怎样, `write` 不是直接的系统调用也不是内核函数。程序必须将通用目的寄存器按照正确的顺序存入正确的值,之后使用 `syscall` 指令实现真正的系统调用。在这一节我们关注 Linux 内核中,处理器执行 `syscall` 指令时的细节。
|
||||
|
||||
系统调用表的初始化
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
从前一节可知系统调用与中断非常相似。深入的说,系统调用是软件中断的处理程序。因此,当处理器执行程序的 `syscall` 指令时,指令引起异常导致将控制权转移至异常处理。 众所周知,所有的异常处理 (或者内核 [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) 函数将响应异常) 是放在内核代码中的。但是 Linux 内核如何查找对应系统调用的系统调用处理程序的地址? Linux 内核由一个特殊的表:`system call table` 。 系统调用表是Linux内核源码文件 [arch/x86/entry/syscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) 中定义的数组`sys_call_table`的对应。其实现如下:
|
||||
|
||||
```C
|
||||
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
|
||||
[0 ... __NR_syscall_max] = &sys_ni_syscall,
|
||||
#include <asm/syscalls_64.h>
|
||||
};
|
||||
```
|
||||
|
||||
`sys_call_table` 数组的大小为 `__NR_syscall_max + 1` , `__NR_syscall_max` 宏作为给定[架构](https://en.wikipedia.org/wiki/List_of_CPU_architectures) 的系统调用最大数量。 这本书关于 [x86_64](https://en.wikipedia.org/wiki/X86-64) 架构, 因此 `__NR_syscall_max` 为 `322` ,这也是本书编写时(当前 Linux 内核版本为 `4.2.0-rc8+`)的数字。编译内核时可通过 [Kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt)产生的头文件查看该宏 - include/generated/asm-offsets.h`:
|
||||
|
||||
```C
|
||||
#define __NR_syscall_max 322
|
||||
```
|
||||
|
||||
对于 `x86_64` , [arch/x86/entry/syscalls/syscall_64.tbl](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L331) 中也有相同的系统调用数量。这里存在两个重要的话题; `sys_call_table` 数组的类型及数组中元数的初始值。首先,`sys_call_ptr_t` 为指向系统调用表的指针。 其是通过 [typedef] 定义的函数指针的(https://en.wikipedia.org/wiki/Typedef) ,返回值为空且无参数:
|
||||
|
||||
```C
|
||||
typedef void (*sys_call_ptr_t)(void);
|
||||
```
|
||||
|
||||
其次为 `sys_call_table` 数组中元素的初始化。从上面的代码中可知,数组中所有元素包含指向 `sys_ni_syscall` 的系统调用处理器的指针。 `sys_ni_syscall` 函数为 “not-implemented” 调用。 首先, `sys_call_table` 的所有元素指向 “not-implemented” 系统调用。这是正确的初始化方法,因为我们仅仅初始化指向系统调用处理器的指针的存储位置,稍后再做处理。 `sys_ni_syscall` 的结果比较简单, 仅仅返回 [-errno](http://man7.org/linux/man-pages/man3/errno.3.html) 或者 `-ENOSYS` :
|
||||
|
||||
```C
|
||||
asmlinkage long sys_ni_syscall(void)
|
||||
{
|
||||
return -ENOSYS;
|
||||
}
|
||||
```
|
||||
|
||||
The `-ENOSYS` error tells us that:
|
||||
|
||||
```
|
||||
ENOSYS Function not implemented (POSIX.1)
|
||||
```
|
||||
|
||||
在 `sys_call_table` 的初始化中同时也要注意 `...` 。可通过 [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) 编译器插件 - [Designated Initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html) 处理。插件允许使用不固定的顺序初始化元素。 在数组结束处,我们引用 `asm/syscalls_64.h` 头文件在。头文件由特殊的脚本 [arch/x86/entry/syscalls/syscalltbl.sh](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscalltbl.sh) 从 [syscall table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl) 产生。 `asm/syscalls_64.h` 包括以下宏的定义:
|
||||
|
||||
```C
|
||||
__SYSCALL_COMMON(0, sys_read, sys_read)
|
||||
__SYSCALL_COMMON(1, sys_write, sys_write)
|
||||
__SYSCALL_COMMON(2, sys_open, sys_open)
|
||||
__SYSCALL_COMMON(3, sys_close, sys_close)
|
||||
__SYSCALL_COMMON(5, sys_newfstat, sys_newfstat)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
宏 `__SYSCALL_COMMON` 在相同的 [源码](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c)中定义,作为宏 `__SYSCALL_64`的扩展:
|
||||
|
||||
```C
|
||||
#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
|
||||
#define __SYSCALL_64(nr, sym, compat) [nr] = sym,
|
||||
```
|
||||
|
||||
因而, 到此为止, `sys_call_table` 为如下格式:
|
||||
|
||||
```C
|
||||
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
|
||||
[0 ... __NR_syscall_max] = &sys_ni_syscall,
|
||||
[0] = sys_read,
|
||||
[1] = sys_write,
|
||||
[2] = sys_open,
|
||||
...
|
||||
...
|
||||
...
|
||||
};
|
||||
```
|
||||
|
||||
之后所有指向“ non-implemented ”系统调用元素的内容为 `sys_ni_syscall` 函数的地址,该函数仅返回 `-ENOSYS` 。 其他元素指向 `sys_syscall_name` 函数。
|
||||
|
||||
至此, 完成系统调用表的填充并且 Linux内核了解系统调用处理器的为值。但是 Linux 内核在处理用户空间程序的系统调用时并未立即调用 `sys_syscall_name` 函数。 记住关于中断及中断处理的 [章节](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)。当 Linux 内核获得处理中断的控制权, 在调用中断处理程序前,必须做一些准备如保存用户空间寄存器,切换至新的堆栈及其他很多工作。系统调用处理也是相同的情形。第一件事是处理系统调用的准备,但是在 Linux 内核开始这些准备之前, 系统调用的入口必须完成初始化,同时只有 Linux 内核知道如何执行这些准备。在下一章节我们将关注 Linux 内核中关于系统调用入口的初始化过程。
|
||||
|
||||
系统调用入口初始化
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
当系统中发生系统调用, 开始处理调用的代码的第一个字节在什么地方? 阅读 Intel 的手册 - [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html):
|
||||
|
||||
```
|
||||
SYSCALL 引起操作系统系统调用处理器处于特权级0,通过加载IA32_LSTAR MSR至RIP完成。
|
||||
|
||||
```
|
||||
|
||||
这就是说我们需要将系统调用入口放置到 `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register) 。 这一操作在 Linux 内核初始过程时完成。若已阅读关于 Linux 内核中断及中断处理政界的 [第四节](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) , Linux 内核调用在初始化过程中调用 `trap_init` 函数。该函数在 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) 源代码文件中定义,执行 `non-early` 异常处理(如除法错误,[协处理器](https://en.wikipedia.org/wiki/Coprocessor) 错误等 )的初始化。除了 `non-early` 异常处理的初始化外, 函数调用 [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/blob/arch/x86/kernel/cpu/common.c) 中 `cpu_init` 函数,调用相同源码文件中的 `syscall_init` 完成`per-cpu` 状态初始化。
|
||||
|
||||
该函数执行系统调用入口的初始化。查看函数的实现,函数没有参数且首先填充两个特殊模块寄存器:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);
|
||||
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
|
||||
```
|
||||
|
||||
第一个特殊模块集寄存器- `MSR_STAR` 的 `63:48` 为用户代码的代码段。这些数据将加载至 `CS` 和 `SS` 段选择符,由提供将系统调用返回至相应特权级的用户代码功能的 `sysret` 指令使用。 同时从内核代码来看, 当用户空间应用程序执行系统调用时,`MSR_STAR` 的 `47:32` 将作为 `CS` and `SS`段选择寄存器的基地址。第二行代码中我们将使用系统调用入口`entry_SYSCALL_64` 填充 `MSR_LSTAR` 寄存器。 `entry_SYSCALL_64` 在 [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) 汇编文件中定义,包含系统调用执行前的准备(上面已经提及这些准备)。 目前不关注 `entry_SYSCALL_64` ,将在章节的后续讨论。
|
||||
|
||||
在设置系统调用的入口之后,需要以下特殊模式寄存器:
|
||||
|
||||
* `MSR_CSTAR` - target `rip` for the compability mode callers;
|
||||
* `MSR_IA32_SYSENTER_CS` - target `cs` for the `sysenter` instruction;
|
||||
* `MSR_IA32_SYSENTER_ESP` - target `esp` for the `sysenter` instruction;
|
||||
* `MSR_IA32_SYSENTER_EIP` - target `eip` for the `sysenter` instruction.
|
||||
|
||||
这些特殊模式寄存器的值与内核配置选项 `CONFIG_IA32_EMULATION` 有关。 若开启该内核配置选项,允许64字节内核运行32字节的程序。 首先, 若 `CONFIG_IA32_EMULATION` 内合配置选项开启, 将使用兼容模式的系统调用入口填充这些特殊模式寄存器:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_CSTAR, entry_SYSCALL_compat);
|
||||
```
|
||||
|
||||
对于内核代码段, 将堆栈指针置零,`entry_SYSENTER_compat`字的地址写入[指令指针](https://en.wikipedia.org/wiki/Program_counter):
|
||||
|
||||
```C
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
|
||||
```
|
||||
|
||||
另一方面, 若 `CONFIG_IA32_EMULATION` 内核配置选项未开启, 将把 `ignore_sysret` 字写入`MSR_CSTAR`:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_CSTAR, ignore_sysret);
|
||||
```
|
||||
|
||||
其在[arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) 汇编文件中定义,仅返回 `-ENOSYS` 错误代码:
|
||||
|
||||
```assembly
|
||||
ENTRY(ignore_sysret)
|
||||
mov $-ENOSYS, %eax
|
||||
sysret
|
||||
END(ignore_sysret)
|
||||
```
|
||||
|
||||
现在需要像之前代码一样填充 `MSR_IA32_SYSENTER_CS`, `MSR_IA32_SYSENTER_ESP`, `MSR_IA32_SYSENTER_EIP` 特殊模式寄存器,当`CONFIG_IA32_EMULATION` 内核配置选项打开时。 在这种情况( `CONFIG_IA32_EMULATION` 配置选项未设置) 将用零填充 `MSR_IA32_SYSENTER_ESP` 和 `MSR_IA32_SYSENTER_EIP` ,同时将 [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) 的无效段加载至 `MSR_IA32_SYSENTER_CS` 特殊模式寄存器:
|
||||
|
||||
```C
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
|
||||
```
|
||||
|
||||
可以从描述 Linux 内核启动过程的[章节](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html)阅读更多关于 `Global Descriptor Table` 的内容。
|
||||
|
||||
在`syscall_init` 函数的结束, 通过写入 `MSR_SYSCALL_MASK` 特殊寄存器的标志位,将 [标志寄存器](https://en.wikipedia.org/wiki/FLAGS_register) 中的标志位屏蔽:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_SYSCALL_MASK,
|
||||
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
|
||||
X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
|
||||
```
|
||||
|
||||
这些标志位将在 syscall 初始化时清除。至此, `syscall_init` 函数结束 也意味着系统调用已经可用。现在我们关注当用户程序执行 `syscall` 指令发生什么。
|
||||
|
||||
系统调用处理执行前的准备
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
如之前写到, 系统调用或中断处理在被 Linux 内核调用前需要一些准备。 宏 `idtentry` 完成异常处理被执行前的所需准备,宏 `interrupt` 完成中断处理被调用前的所需准备 ,`entry_SYSCALL_64` 完成系统调用执行前的所需准备。
|
||||
|
||||
`entry_SYSCALL_64` 在 [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) 汇编文件中定义 ,从下面的宏开始:
|
||||
|
||||
```assembly
|
||||
SWAPGS_UNSAFE_STACK
|
||||
```
|
||||
|
||||
该宏在 [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) 头文件中定义, 扩展 `swapgs` 指令:
|
||||
|
||||
```C
|
||||
#define SWAPGS_UNSAFE_STACK swapgs
|
||||
```
|
||||
|
||||
宏将交换 GS 段选择符及 `MSR_KERNEL_GS_BASE ` 特殊模式寄存器中的值。换句话说,将其入内核堆栈 。之后使老的堆栈指针指向 `rsp_scratch` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) 变量设置堆栈指针指向当前处理器的栈顶:
|
||||
|
||||
```assembly
|
||||
movq %rsp, PER_CPU_VAR(rsp_scratch)
|
||||
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
|
||||
```
|
||||
|
||||
下一步中将堆栈段及老的堆栈指针如栈:
|
||||
|
||||
```assembly
|
||||
pushq $__USER_DS
|
||||
pushq PER_CPU_VAR(rsp_scratch)
|
||||
```
|
||||
|
||||
之后使能中断, 因为入口中断被关闭,保存通用目的 [寄存器](https://en.wikipedia.org/wiki/Processor_register) (除 `bp`, `bx` 及 `r12` 至 `r15`), 标志位, “ non-implemented ” 系统调用相关的 `-ENOSYS` 及代码段寄存器至堆栈:
|
||||
|
||||
```assembly
|
||||
ENABLE_INTERRUPTS(CLBR_NONE)
|
||||
|
||||
pushq %r11
|
||||
pushq $__USER_CS
|
||||
pushq %rcx
|
||||
pushq %rax
|
||||
pushq %rdi
|
||||
pushq %rsi
|
||||
pushq %rdx
|
||||
pushq %rcx
|
||||
pushq $-ENOSYS
|
||||
pushq %r8
|
||||
pushq %r9
|
||||
pushq %r10
|
||||
pushq %r11
|
||||
sub $(6*8), %rsp
|
||||
```
|
||||
|
||||
当系统调用由用户空间程序引起时, 通用目的寄存器状态如下:
|
||||
|
||||
* `rax` - contains system call number;
|
||||
* `rcx` - contains return address to the user space;
|
||||
* `r11` - contains register flags;
|
||||
* `rdi` - contains first argument of a system call handler;
|
||||
* `rsi` - contains second argument of a system call handler;
|
||||
* `rdx` - contains third argument of a system call handler;
|
||||
* `r10` - contains fourth argument of a system call handler;
|
||||
* `r8` - contains fifth argument of a system call handler;
|
||||
* `r9` - contains sixth argument of a system call handler;
|
||||
|
||||
其他通用目的寄存器 (如 `rbp`, `rbx` 和 `r12` 至 `r15`) 在[C ABI](http://www.x86-64.org/documentation/abi.pdf))保留。将寄存器标志位如栈,之后是 “non-implemented ”系统调用的用户代码段,用户空间返回地址,系统调用编号,三个参数,dump 错误代码和堆栈中的其他信息。
|
||||
|
||||
下一步检查当前 `thread_info` 中的 `_TIF_WORK_SYSCALL_ENTRY`:
|
||||
|
||||
```assembly
|
||||
testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
|
||||
jnz tracesys
|
||||
```
|
||||
|
||||
宏 `_TIF_WORK_SYSCALL_ENTRY`在 [arch/x86/include/asm/thread_info.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/thread_info.h) 头文件中定义 ,提供一系列与系统调用跟踪有关的进程信息标志:
|
||||
|
||||
```C
|
||||
#define _TIF_WORK_SYSCALL_ENTRY \
|
||||
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
|
||||
_TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
|
||||
_TIF_NOHZ)
|
||||
```
|
||||
|
||||
本章节中不讨论追踪/调试相关内容,将在关于 Linux 内核调试及追踪相关独立章节中讨论。 在 `tracesys` 标签之后, 下一标签为 `entry_SYSCALL_64_fastpath`.在 `entry_SYSCALL_64_fastpath` 中检查 头文件 [arch/x86/include/asm/unistd.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/unistd.h) 中定义的 `__SYSCALL_MASK`
|
||||
|
||||
```C
|
||||
# ifdef CONFIG_X86_X32_ABI
|
||||
# define __SYSCALL_MASK (~(__X32_SYSCALL_BIT))
|
||||
# else
|
||||
# define __SYSCALL_MASK (~0)
|
||||
# endif
|
||||
```
|
||||
|
||||
`__X32_SYSCALL_BIT` 为:
|
||||
|
||||
```C
|
||||
#define __X32_SYSCALL_BIT 0x40000000
|
||||
```
|
||||
|
||||
众所周知, `__SYSCALL_MASK` 与 `CONFIG_X86_X32_ABI` 内核配置选项相关, 作为 64位内核中32位[ABI](https://en.wikipedia.org/wiki/Application_binary_interface) 的掩码。
|
||||
|
||||
So we check the value of the `__SYSCALL_MASK` and if the `CONFIG_X86_X32_ABI` is disabled we compare the value of the `rax` register to the maximum syscall number (`__NR_syscall_max`), alternatively if the `CNOFIG_X86_X32_ABI` is enabled we mask the `eax` register with the `__X32_SYSCALL_BIT` and do the same comparison:
|
||||
|
||||
```assembly
|
||||
#if __SYSCALL_MASK == ~0
|
||||
cmpq $__NR_syscall_max, %rax
|
||||
#else
|
||||
andl $__SYSCALL_MASK, %eax
|
||||
cmpl $__NR_syscall_max, %eax
|
||||
#endif
|
||||
```
|
||||
|
||||
至此检查最后一调比较指令的结果, `ja` 指令在 `CF` 和 `ZF` 标志为 0 时执行:
|
||||
|
||||
```assembly
|
||||
ja 1f
|
||||
```
|
||||
|
||||
若正确调用系统调用, 从 `r10` 移动第四个参数至 `rcx` ,保持 [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf) 开启,同时以系统调用的处理程序的地址为参数执行 `call` 指令:
|
||||
|
||||
```assembly
|
||||
movq %r10, %rcx
|
||||
call *sys_call_table(, %rax, 8)
|
||||
```
|
||||
|
||||
注意, 上文提到 `sys_call_table` 是一个数组。 `rax` 通用目的寄存器为系统调用的编号,且 `sys_call_table` 的每个元素为 8 字节。 因此使用 `*sys_call_table(, %rax, 8)` 符号找到指定系统调用处理在 `sys_call_table` 中的偏移。
|
||||
|
||||
就这样。完成了所需的准备,系统调用处理将被相应的中断处理调用。 例如 Linux 内核代码中 `SYSCALL_DEFINE[N]`宏定义的 `sys_read`, `sys_write` 和其他中断处理。
|
||||
|
||||
退出系统调用
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
在系统调用处理完成人物后, 将退回[arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S), 正好在系统调用之后:
|
||||
|
||||
```assembly
|
||||
call *sys_call_table(, %rax, 8)
|
||||
```
|
||||
|
||||
在从系统调用处理返回之后,下一步是将系统调用处理的返回值入栈。系统调用将用户程序的返回结果放置在通用目的寄存器`rax` 中,因此在系统调用处理完成其工作后,将寄存器的值入栈:
|
||||
|
||||
```C
|
||||
movq %rax, RAX(%rsp)
|
||||
```
|
||||
|
||||
在 `RAX` 指定的位置。
|
||||
|
||||
之后调用在 [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) 中定义的宏 `LOCKDEP_SYS_EXIT` :
|
||||
|
||||
```assembly
|
||||
LOCKDEP_SYS_EXIT
|
||||
```
|
||||
|
||||
宏的实现与 `CONFIG_DEBUG_LOCK_ALLOC` 内核配置选项相关,该配置允许在退出系统调用时调试锁。再次强调,在该章节不关注,将在单独的章节讨论相关内容。 在 `entry_SYSCALL_64` 函数的最后, 恢复除 `rxc` 和 `r11` 外所有通用寄存器, 因为 `rcx` 寄存器为调用系统调用的应用程序的返回地址, `r11` 寄存器为老的 [flags register](https://en.wikipedia.org/wiki/FLAGS_register). 在恢复所有通用寄存器之后, 将在 `rcx` 中装入返回地址, `r11` 寄存器装入标志 , `rsp` 装入老的堆栈指针:
|
||||
|
||||
```assembly
|
||||
RESTORE_C_REGS_EXCEPT_RCX_R11
|
||||
|
||||
movq RIP(%rsp), %rcx
|
||||
movq EFLAGS(%rsp), %r11
|
||||
movq RSP(%rsp), %rsp
|
||||
|
||||
USERGS_SYSRET64
|
||||
```
|
||||
|
||||
最后仅仅调用宏 `USERGS_SYSRET64` ,其扩展调用 `swapgs` 指令交换用户 `GS` 和内核`GS`, `sysretq` 指令执行从系统调用处理退出。
|
||||
|
||||
```C
|
||||
#define USERGS_SYSRET64 \
|
||||
swapgs; \
|
||||
sysretq;
|
||||
```
|
||||
|
||||
现在我们知道,当用户程序使用系统调用时发生的一切。整个过程的步骤如下:
|
||||
|
||||
* 用户程序中的代码装入通用目的寄存器的值(系统调用编号和系统调用的参数);
|
||||
* 处理器从用户模式切换到内核模式 开始执行系统调用入口 - `entry_SYSCALL_64`;
|
||||
* `entry_SYSCALL_64` 切换至内核堆栈,在堆栈中存通用目的寄存器, 老的堆栈,代码段, 标志位等;
|
||||
* `entry_SYSCALL_64` 检查 `rax` 寄存器中的系统调用编号,系统调用编号正确时, 在 `sys_call_table` 中查找系统调用处理并调用;
|
||||
* 若系统调用编号不正确, 跳至系统调用退出;
|
||||
* 系统调用处理完成工作后, 恢复通用寄存器, 老的堆栈,标志位 及返回地址 ,通过`sysretq` 指令退出`entry_SYSCALL_64` .
|
||||
|
||||
|
||||
结论
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这是 Linux 内核相关概念的第二节。在前一 [节](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) ,从用户应用程序的角度讨论了这些概念的原理。在这一节继续深入系统调用概念的相关内容,讨论了系统调用发生时 Linux 内核执行的内容。
|
||||
|
||||
若存在疑问及建议, 在twitter @[0xAX](https://twitter.com/0xAX), 通过[email](anotherworldofworld@gmail.com) 或者创建 [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**由于英语是我的第一语言由此造成的不便深感抱歉。若发现错误请提交 PR 至 [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [write](http://man7.org/linux/man-pages/man2/write.2.html)
|
||||
* [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [list of cpu architectures](https://en.wikipedia.org/wiki/List_of_CPU_architectures)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt)
|
||||
* [typedef](https://en.wikipedia.org/wiki/Typedef)
|
||||
* [errno](http://man7.org/linux/man-pages/man3/errno.3.html)
|
||||
* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [model specific register](https://en.wikipedia.org/wiki/Model-specific_register)
|
||||
* [intel 2b manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
|
||||
* [coprocessor](https://en.wikipedia.org/wiki/Coprocessor)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
|
||||
* [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf)
|
||||
* [previous chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html)
|
||||
403
SysCall/syscall-3.md
Normal file
403
SysCall/syscall-3.md
Normal file
@@ -0,0 +1,403 @@
|
||||
Linux 内核系统调用 第三节
|
||||
================================================================================
|
||||
|
||||
vsyscalls 和 vDSO
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这是讲解 Linux 内核中系统调用[章节](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html)的第三部分,[前一节](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html)讨论了用户空间应用程序发起的系统调用的准备工作及系统调用的处理过程。在这一节将讨论两个与系统调用十分相似的概念,这两个概念是`vsyscall` 和 `vdso`。
|
||||
|
||||
我们已经了解什么是`系统调用`。这是 Linux 内核一种特殊的运行机制,使得用户空间的应用程序可以请求,像写入文件和打开套接字等特权级下的任务。正如你所了解的,在 Linux 内核中发起一个系统调用是特别昂贵的操作,因为处理器需要中断当前正在执行的任务,切换内核模式的上下文,在系统调用处理完毕后跳转至用户空间。以下的两种机制 - `vsyscall` 和d `vdso` 被设计用来加速系统调用的处理,在这一节我们将了解两种机制的工作原理。
|
||||
|
||||
vsyscalls 介绍
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
`vsyscall` 或 `virtual system call` 是第一种也是最古老的一种用于加快系统调用的机制。 `vsyscall` 的工作原则其实十分简单。Linux 内核在用户空间映射一个包含一些变量及一些系统调用的实现的内存页。 对于 [X86_64](https://en.wikipedia.org/wiki/X86-64) 架构可以在 Linux 内核的 [文档] (https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt) 找到关于这一内存区域的信息:
|
||||
|
||||
```
|
||||
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
|
||||
```
|
||||
|
||||
或:
|
||||
|
||||
```
|
||||
~$ sudo cat /proc/1/maps | grep vsyscall
|
||||
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
|
||||
```
|
||||
|
||||
因此, 这些系统调用将在用户空间下执行,这意味着将不发生 [上下文切换](https://en.wikipedia.org/wiki/Context_switch)。 `vsyscall` 内存页的映射在 [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) 源代码中定义的 `map_vsyscall` 函数中实现。这一函数在 Linux 内核初始化时被 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) 源代码中定义的函数`setup_arch` (我们在[第五章](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) Linux 内核的初始化中讨论过该函数)。
|
||||
|
||||
注意 `map_vsyscall` 函数的实现依赖于内核配置选项 `CONFIG_X86_VSYSCALL_EMULATION` :
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_VSYSCALL_EMULATION
|
||||
extern void map_vsyscall(void);
|
||||
#else
|
||||
static inline void map_vsyscall(void) {}
|
||||
#endif
|
||||
```
|
||||
|
||||
正如帮助文档中所描述的, `CONFIG_X86_VSYSCALL_EMULATION` 配置选项: `使能 vsyscall 模拟`. 为何模拟 `vsyscall`? 事实上, `vsyscall` 由于安全原因是一种遗留 [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) 。虚拟系统调用具有绑定的地址, 意味着 `vsyscall` 的内存页的位置在任何时刻是相同,这一位置是在 `map_vsyscall` 函数中指定的。这一函数的实现如下:
|
||||
|
||||
```C
|
||||
void __init map_vsyscall(void)
|
||||
{
|
||||
extern char __vsyscall_page;
|
||||
unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
在 `map_vsyscall` 函数的开始,通过宏 `__pa_symbol` 获取了 `vsyscall` 内存页的物理地址(我们已在[第四章](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process)讨论了该宏的实现)。`__vsyscall_page` 在 [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) 汇编源代码文件中定义, 具有如下的 [虚拟地址](https://en.wikipedia.org/wiki/Virtual_address_space):
|
||||
|
||||
```
|
||||
ffffffff81881000 D __vsyscall_page
|
||||
```
|
||||
|
||||
在 `.data..page_aligned, aw` [段](https://en.wikipedia.org/wiki/Memory_segmentation) 中 包含如下三中系统调用:
|
||||
|
||||
* `gettimeofday`;
|
||||
* `time`;
|
||||
* `getcpu`.
|
||||
|
||||
或:
|
||||
|
||||
```assembly
|
||||
__vsyscall_page:
|
||||
mov $__NR_gettimeofday, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_time, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_getcpu, %rax
|
||||
syscall
|
||||
ret
|
||||
```
|
||||
|
||||
回到 `map_vsyscall` 函数及 `__vsyscall_page` 的实现,在得到 `__vsyscall_page` 的物理地址之后,使用 `__set_fixmap` 为 `vsyscall` 内存页 检查设置 [fix-mapped](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)地址的变量`vsyscall_mode`:
|
||||
|
||||
```C
|
||||
if (vsyscall_mode != NONE)
|
||||
__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
|
||||
vsyscall_mode == NATIVE
|
||||
? PAGE_KERNEL_VSYSCALL
|
||||
: PAGE_KERNEL_VVAR);
|
||||
```
|
||||
|
||||
The `__set_fixmap` takes three arguments: The first is index of the `fixed_addresses` [enum](https://en.wikipedia.org/wiki/Enumerated_type). In our case `VSYSCALL_PAGE` is the first element of the `fixed_addresses` enum for the `x86_64` architecture:
|
||||
|
||||
```C
|
||||
enum fixed_addresses {
|
||||
...
|
||||
...
|
||||
...
|
||||
#ifdef CONFIG_X86_VSYSCALL_EMULATION
|
||||
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
|
||||
#endif
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
该变量值为 `511`。第二个参数为映射内存页的物理地址,第三个参数为内存页的标志位。注意 `VSYSCALL_PAGE` 标志位依赖于变量 `vsyscall_mode` 。当 `vsyscall_mode` 变量为 `NATIVE` 时, 标志位为 `PAGE_KERNEL_VSYSCALL`,其他情况则是`PAGE_KERNEL_VVAR` 。两个宏 ( `PAGE_KERNEL_VSYSCALL` 及 `PAGE_KERNEL_VVAR`) 都将被扩展以下标志:
|
||||
|
||||
```C
|
||||
#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
|
||||
#define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER)
|
||||
```
|
||||
|
||||
标志反映了 `vsyscall` 内存页的访问权限。两个标志都带有 `_PAGE_USER` 标志, 这意味着内存页可被运行于低特权级的用户模式进程访问。第二个标志位取决于 `vsyscall_mode` 变量的值。第一个标志 (`__PAGE_KERNEL_VSYSCALL`) 在 `vsyscall_mode` 为 `NATIVE` 时被设定。这意味着虚拟系统调用将以本地 `syscall` 指令的方式执行。另一情况下,在 `vsyscall_mode` 为 `emulate` 时 vsyscall 为 `PAGE_KERNEL_VVAR`,此时系统调用将被置于陷阱并被合理的模拟。 `vsyscall_mode` 变量通过 `vsyscall_setup` 获取值:
|
||||
|
||||
```C
|
||||
static int __init vsyscall_setup(char *str)
|
||||
{
|
||||
if (str) {
|
||||
if (!strcmp("emulate", str))
|
||||
vsyscall_mode = EMULATE;
|
||||
else if (!strcmp("native", str))
|
||||
vsyscall_mode = NATIVE;
|
||||
else if (!strcmp("none", str))
|
||||
vsyscall_mode = NONE;
|
||||
else
|
||||
return -EINVAL;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
return -EINVAL;
|
||||
}
|
||||
```
|
||||
|
||||
函数将在早期的内核分析时被调用:
|
||||
|
||||
```C
|
||||
early_param("vsyscall", vsyscall_setup);
|
||||
```
|
||||
|
||||
关于 `early_param` 宏的更多信息可以在[第六章](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) Linux 内核初始化中找到。
|
||||
|
||||
在函数 `vsyscall_map` 的最后仅通过 [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) 宏检查 `vsyscall` 内存页的虚拟地址是否等于变量 `VSYSCALL_ADDR` :
|
||||
|
||||
```C
|
||||
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
|
||||
(unsigned long)VSYSCALL_ADDR);
|
||||
```
|
||||
|
||||
就这样`vsyscall` 内存页设置完毕。上述的结果如下: 若设置 `vsyscall=native` 内核命令行参数,虚拟内存调用将以 [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) 文件中本地 `系统调用` 指令的方式执行。 [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) 知道虚拟系统调用处理器的地址。注意虚拟系统调用的地址以 `1024` (或 `0x400`) 比特对齐。
|
||||
|
||||
```assembly
|
||||
__vsyscall_page:
|
||||
mov $__NR_gettimeofday, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_time, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_getcpu, %rax
|
||||
syscall
|
||||
ret
|
||||
```
|
||||
|
||||
`vsyscall` 内存页的起始地址为 `ffffffffff600000` 。因此, [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) 知道所有虚拟系统调用处理器的地址。可以在 `glibc` 源码中找到这些地址的定义:
|
||||
|
||||
```C
|
||||
#define VSYSCALL_ADDR_vgettimeofday 0xffffffffff600000
|
||||
#define VSYSCALL_ADDR_vtime 0xffffffffff600400
|
||||
#define VSYSCALL_ADDR_vgetcpu 0xffffffffff600800
|
||||
```
|
||||
|
||||
所有的虚拟系统调用请求都将映射至 `__vsyscall_page` + `VSYSCALL_ADDR_vsyscall_name` 偏置, 将虚拟内存系统调用的编号置于通用目的[寄存器](https://en.wikipedia.org/wiki/Processor_register),本地的 x86_64 `系统调用`指令将被执行。
|
||||
|
||||
在第二种情况中, 若将 `vsyscall=emulate` 参数传递给内核命令行, 提升虚拟系统调用处理器的尝试导致一个 [page fault](https://en.wikipedia.org/wiki/Page_fault) 异常。 谨记, `vsyscall` 内存页 具有 `__PAGE_KERNEL_VVAR` 的访问权限,这将禁止执行。 `do_page_fault` 函数是 `#PF` 或 page fault 的处理器。它将尝试了解最后一次 page fault 的原因。一种可能的场景是 `vsyscall` 模式为 `emulate` 情况下的虚拟系统调用。此时 `vsyscall` 将被 [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) 源码中定义的 `emulate_vsyscall` 函数处理。
|
||||
|
||||
The `emulate_vsyscall` function gets the number of a virtual system call, checks it, prints error and sends [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault) single:
|
||||
|
||||
```C
|
||||
...
|
||||
...
|
||||
...
|
||||
vsyscall_nr = addr_to_vsyscall_nr(address);
|
||||
if (vsyscall_nr < 0) {
|
||||
warn_bad_vsyscall(KERN_WARNING, regs, "misaligned vsyscall...);
|
||||
goto sigsegv;
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
sigsegv:
|
||||
force_sig(SIGSEGV, current);
|
||||
reutrn true;
|
||||
```
|
||||
|
||||
As it checked number of a virtual system call, it does some yet another checks like `access_ok` violations and execute system call function depends on the number of a virtual system call:
|
||||
|
||||
```C
|
||||
switch (vsyscall_nr) {
|
||||
case 0:
|
||||
ret = sys_gettimeofday(
|
||||
(struct timeval __user *)regs->di,
|
||||
(struct timezone __user *)regs->si);
|
||||
break;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
In the end we put the result of the `sys_gettimeofday` or another virtual system call handler to the `ax` general purpose register, as we did it with the normal system calls and restore the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and add `8` bytes to the [stack pointer](https://en.wikipedia.org/wiki/Stack_register) register. This operation emulates `ret` instruction.
|
||||
|
||||
```C
|
||||
regs->ax = ret;
|
||||
|
||||
do_ret:
|
||||
regs->ip = caller;
|
||||
regs->sp += 8;
|
||||
return true;
|
||||
```
|
||||
|
||||
That's all. Now let's look on the modern concept - `vDSO`.
|
||||
|
||||
Introduction to vDSO
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As I already wrote above, `vsyscall` is an obsolete concept and replaced by the `vDSO` or `virtual dynamic shared object`. The main difference between the `vsyscall` and `vDSO` mechanisms is that `vDSO` maps memory pages into each process in a shared object [form](https://en.wikipedia.org/wiki/Library_%28computing%29#Shared_libraries), but `vsyscall` is static in memory and has the same address every time. For the `x86_64` architecture it is called -`linux-vdso.so.1`. All userspace applications linked with this shared library via the `glibc`. For example:
|
||||
|
||||
```
|
||||
~$ ldd /bin/uname
|
||||
linux-vdso.so.1 (0x00007ffe014b7000)
|
||||
libc.so.6 => /lib64/libc.so.6 (0x00007fbfee2fe000)
|
||||
/lib64/ld-linux-x86-64.so.2 (0x00005559aab7c000)
|
||||
```
|
||||
|
||||
Or:
|
||||
|
||||
```
|
||||
~$ sudo cat /proc/1/maps | grep vdso
|
||||
7fff39f73000-7fff39f75000 r-xp 00000000 00:00 0 [vdso]
|
||||
```
|
||||
|
||||
Here we can see that [uname](https://en.wikipedia.org/wiki/Uname) util was linked with the three libraries:
|
||||
|
||||
* `linux-vdso.so.1`;
|
||||
* `libc.so.6`;
|
||||
* `ld-linux-x86-64.so.2`.
|
||||
|
||||
The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`.
|
||||
|
||||
Initialization of the `vDSO` occurs in the `init_vdso` function that defined in the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file. This function starts from the initialization of the `vDSO` images for 32-bits and 64-bits depends on the `CONFIG_X86_X32_ABI` kernel configuration option:
|
||||
|
||||
```C
|
||||
static int __init init_vdso(void)
|
||||
{
|
||||
init_vdso_image(&vdso_image_64);
|
||||
|
||||
#ifdef CONFIG_X86_X32_ABI
|
||||
init_vdso_image(&vdso_image_x32);
|
||||
#endif
|
||||
```
|
||||
|
||||
Both function initialize the `vdso_image` structure. This structure is defined in the two generated source code files: the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c) and the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c). These source code files generated by the [vdso2c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso2c.c) program from the different source code files, represent different approaches to call a system call like `int 0x80`, `sysenter` and etc. The full set of the images depends on the kernel configuration.
|
||||
|
||||
For example for the `x86_64` Linux kernel it will contain `vdso_image_64`:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_64
|
||||
extern const struct vdso_image vdso_image_64;
|
||||
#endif
|
||||
```
|
||||
|
||||
But for the `x86` - `vdso_image_32`:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_X32
|
||||
extern const struct vdso_image vdso_image_x32;
|
||||
#endif
|
||||
```
|
||||
|
||||
If our kernel is configured for the `x86` architecture or for the `x86_64` and compability mode, we will have ability to call a system call with the `int 0x80` interrupt, if compability mode is enabled, we will be able to call a system call with the native `syscall instruction` or `sysenter` instruction in other way:
|
||||
|
||||
```C
|
||||
#if defined CONFIG_X86_32 || defined CONFIG_COMPAT
|
||||
extern const struct vdso_image vdso_image_32_int80;
|
||||
#ifdef CONFIG_COMPAT
|
||||
extern const struct vdso_image vdso_image_32_syscall;
|
||||
#endif
|
||||
extern const struct vdso_image vdso_image_32_sysenter;
|
||||
#endif
|
||||
```
|
||||
|
||||
As we can understand from the name of the `vdso_image` structure, it represents image of the `vDSO` for the certain mode of the system call entry. This structure contains information about size in bytes of the `vDSO` area that always a multiple of `PAGE_SIZE` (`4096` bytes), pointer to the text mapping, start and end address of the `alternatives` (set of instructions with better alternatives for the certain type of the processor) and etc. For example `vdso_image_64` looks like this:
|
||||
|
||||
```C
|
||||
const struct vdso_image vdso_image_64 = {
|
||||
.data = raw_data,
|
||||
.size = 8192,
|
||||
.text_mapping = {
|
||||
.name = "[vdso]",
|
||||
.pages = pages,
|
||||
},
|
||||
.alt = 3145,
|
||||
.alt_len = 26,
|
||||
.sym_vvar_start = -8192,
|
||||
.sym_vvar_page = -8192,
|
||||
.sym_hpet_page = -4096,
|
||||
};
|
||||
```
|
||||
|
||||
Where the `raw_data` contains raw binary code of the 64-bit `vDSO` system calls which are `2` page size:
|
||||
|
||||
```C
|
||||
static struct page *pages[2];
|
||||
```
|
||||
|
||||
or 8 Kilobytes.
|
||||
|
||||
The `init_vdso_image` function is defined in the same source code file and just initializes the `vdso_image.text_mapping.pages`. First of all this function calculates the number of pages and initializes each `vdso_image.text_mapping.pages[number_of_page]` with the `virt_to_page` macro that converts given address to the `page` structure:
|
||||
|
||||
```C
|
||||
void __init init_vdso_image(const struct vdso_image *image)
|
||||
{
|
||||
int i;
|
||||
int npages = (image->size) / PAGE_SIZE;
|
||||
|
||||
for (i = 0; i < npages; i++)
|
||||
image->text_mapping.pages[i] =
|
||||
virt_to_page(image->data + i*PAGE_SIZE);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `init_vdso` function passed to the `subsys_initcall` macro adds the given function to the `initcalls` list. All functions from this list will be called in the `do_initcalls` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
|
||||
```C
|
||||
subsys_initcall(init_vdso);
|
||||
```
|
||||
|
||||
Ok, we just saw initialization of the `vDSO` and initialization of `page` structures that are related to the memory pages that contain `vDSO` system calls. But to where do their pages map? Actually they are mapped by the kernel, when it loads binary to the memory. The Linux kernel calls the `arch_setup_additional_pages` function from the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file that checks that `vDSO` enabled for the `x86_64` and calls the `map_vdso` function:
|
||||
|
||||
```C
|
||||
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
|
||||
{
|
||||
if (!vdso64_enabled)
|
||||
return 0;
|
||||
|
||||
return map_vdso(&vdso_image_64, true);
|
||||
}
|
||||
```
|
||||
|
||||
The `map_vdso` function is defined in the same source code file and maps pages for the `vDSO` and for the shared `vDSO` variables. That's all. The main differences between the `vsyscall` and the `vDSO` concepts is that `vsyscal` has a static address of `ffffffffff600000` and implements `3` system calls, whereas the `vDSO` loads dynamically and implements four system calls:
|
||||
|
||||
* `__vdso_clock_gettime`;
|
||||
* `__vdso_getcpu`;
|
||||
* `__vdso_gettimeofday`;
|
||||
* `__vdso_time`.
|
||||
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
|
||||
|
||||
After all of these three parts, we know almost all things that are related to system calls, we know what system call is and why user applications need them. We also know what occurs when a user application calls a system call and how the kernel handles system calls.
|
||||
|
||||
The next part will be the last part in this [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) and we will see what occurs when a user runs the program.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [x86_64 memory map](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [context switching](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
|
||||
* [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space)
|
||||
* [Segmentation](https://en.wikipedia.org/wiki/Memory_segmentation)
|
||||
* [enum](https://en.wikipedia.org/wiki/Enumerated_type)
|
||||
* [fix-mapped addresses](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
|
||||
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [Page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [stack pointer](https://en.wikipedia.org/wiki/Stack_register)
|
||||
* [uname](https://en.wikipedia.org/wiki/Uname)
|
||||
* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html)
|
||||
430
SysCall/syscall-4.md
Normal file
430
SysCall/syscall-4.md
Normal file
@@ -0,0 +1,430 @@
|
||||
System calls in the Linux kernel. Part 4.
|
||||
================================================================================
|
||||
|
||||
How does the Linux kernel run a program
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the fourth part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) - this part will be last in this chapter. In the previous part we stopped at the two new concepts:
|
||||
|
||||
* `vsyscall`;
|
||||
* `vDSO`;
|
||||
|
||||
that are related and very similar on system call concept.
|
||||
|
||||
This part will be last part in this chapter and as you can understand from the part's title - we will see what does occur in the Linux kernel when we run our programs. So, let's start.
|
||||
|
||||
how do we launch our programs?
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
There are many different ways to launch an application from an user perspective. For example we can run a program from the [shell](https://en.wikipedia.org/wiki/Unix_shell) or double-click on the application icon. It does not matter. The Linux kernel handles application launch regardless how we do launch this application.
|
||||
|
||||
In this part we will consider the way when we just launch an application from the shell. As you know, the standard way to launch an application from shell is the following: We just launch a [terminal emulator](https://en.wikipedia.org/wiki/Terminal_emulator) application and just write the name of the program and pass or not arguments to our program, for example:
|
||||
|
||||

|
||||
|
||||
Let's consider what does occur when we launch an application from the shell, what does shell do when we write program name, what does Linux kernel do etc. But before we will start to consider these interesting things, I want to warn that this book is about the Linux kernel. That's why we will see Linux kernel insides related stuff mostly in this part. We will not consider in details what does shell do, we will not consider complex cases, for example subshells etc.
|
||||
|
||||
My default shell is - [bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29), so I will consider how do bash shell launches a program. So let's start. The `bash` shell as well as any program that written with [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) programming language starts from the [main](https://en.wikipedia.org/wiki/Entry_point) function. If you will look on the source code of the `bash` shell, you will find the `main` function in the [shell.c](https://github.com/bminor/bash/blob/master/shell.c#L357) source code file. This function makes many different things before the main thread loop of the `bash` started to work. For example this function:
|
||||
|
||||
* checks and tries to open `/dev/tty`;
|
||||
* check that shell running in debug mode;
|
||||
* parses command line arguments;
|
||||
* reads shell environment;
|
||||
* loads `.bashrc`, `.profile` and other configuration files;
|
||||
* and many many more.
|
||||
|
||||
After all of these operations we can see the call of the `reader_loop` function. This function defined in the [eval.c](https://github.com/bminor/bash/blob/master/eval.c#L67) source code file and represents main thread loop or in other words it reads and executes commands. As the `reader_loop` function made all checks and read the given program name and arguments, it calls the `execute_command` function from the [execute_cmd.c](https://github.com/bminor/bash/blob/master/execute_cmd.c#L378) source code file. The `execute_command` function through the chain of the functions calls:
|
||||
|
||||
```
|
||||
execute_command
|
||||
--> execute_command_internal
|
||||
----> execute_simple_command
|
||||
------> execute_disk_command
|
||||
--------> shell_execve
|
||||
```
|
||||
|
||||
makes different checks like do we need to start `subshell`, was it builtin `bash` function or not etc. As I already wrote above, we will not consider all details about things that are not related to the Linux kernel. In the end of this process, the `shell_execve` function calls the `execve` system call:
|
||||
|
||||
```C
|
||||
execve (command, args, env);
|
||||
```
|
||||
|
||||
The `execve` system call has the following signature:
|
||||
|
||||
```
|
||||
int execve(const char *filename, char *const argv [], char *const envp[]);
|
||||
```
|
||||
|
||||
and executes a program by the given filename, with the given arguments and [environment variables](https://en.wikipedia.org/wiki/Environment_variable). This system call is the first in our case and only, for example:
|
||||
|
||||
```
|
||||
$ strace ls
|
||||
execve("/bin/ls", ["ls"], [/* 62 vars */]) = 0
|
||||
|
||||
$ strace echo
|
||||
execve("/bin/echo", ["echo"], [/* 62 vars */]) = 0
|
||||
|
||||
$ strace uname
|
||||
execve("/bin/uname", ["uname"], [/* 62 vars */]) = 0
|
||||
```
|
||||
|
||||
So, an user application (`bash` in our case) calls the system call and as we already know the next step is Linux kernel.
|
||||
|
||||
execve system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw preparation before a system call called by an user application and after a system call handler finished its work in the second [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and as we already know it takes three arguments:
|
||||
|
||||
```
|
||||
SYSCALL_DEFINE3(execve,
|
||||
const char __user *, filename,
|
||||
const char __user *const __user *, argv,
|
||||
const char __user *const __user *, envp)
|
||||
{
|
||||
return do_execve(getname(filename), argv, envp);
|
||||
}
|
||||
```
|
||||
|
||||
Implementation of the `execve` is pretty simple here, as we can see it just returns the result of the `do_execve` function. The `do_execve` function defined in the same source code file and do the following things:
|
||||
|
||||
* Initialize two pointers on a userspace data with the given arguments and environment variables;
|
||||
* return the result of the `do_execveat_common`.
|
||||
|
||||
We can see its implementation:
|
||||
|
||||
```C
|
||||
struct user_arg_ptr argv = { .ptr.native = __argv };
|
||||
struct user_arg_ptr envp = { .ptr.native = __envp };
|
||||
return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
|
||||
```
|
||||
|
||||
The `do_execveat_common` function does main work - it executes a new program. This function takes similar set of arguments, but as you can see it takes five arguments instead of three. The first argument is the file descriptor that represent directory with our application, in our case the `AT_FDCWD` means that the given pathname is interpreted relative to the current working directory of the calling process. The fifth argument is flags. In our case we passed `0` to the `do_execveat_common`. We will check in a next step, so will see it latter.
|
||||
|
||||
First of all the `do_execveat_common` function checks the `filename` pointer and returns if it is `NULL`. After this we check flags of the current process that limit of running processes is not exceed:
|
||||
|
||||
```C
|
||||
if (IS_ERR(filename))
|
||||
return PTR_ERR(filename);
|
||||
|
||||
if ((current->flags & PF_NPROC_EXCEEDED) &&
|
||||
atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) {
|
||||
retval = -EAGAIN;
|
||||
goto out_ret;
|
||||
}
|
||||
|
||||
current->flags &= ~PF_NPROC_EXCEEDED;
|
||||
```
|
||||
|
||||
If these two checks were successful we unset `PF_NPROC_EXCEEDED` flag in the flags of the current process to prevent fail of the `execve`. You can see that in the next step we call the `unshare_files` function that defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and unshares the files of the current task and check the result of this function:
|
||||
|
||||
```C
|
||||
retval = unshare_files(&displaced);
|
||||
if (retval)
|
||||
goto out_ret;
|
||||
```
|
||||
|
||||
We need to call this function to eliminate potential leak of the execve'd binary's [file descriptor](https://en.wikipedia.org/wiki/File_descriptor). In the next step we start preparation of the `bprm` that represented by the `struct linux_binprm` structure (defined in the [include/linux/binfmts.h](https://github.com/torvalds/linux/blob/master/linux/binfmts.h) header file). The `linux_binprm` structure is used to hold the arguments that are used when loading binaries. For example it contains `vma` field which has `vm_area_struct` type and represents single memory area over a contiguous interval in a given address space where our application will be loaded, `mm` field which is memory descriptor of the binary, pointer to the top of memory and many other different fields.
|
||||
|
||||
First of all we allocate memory for this structure with the `kzalloc` function and check the result of the allocation:
|
||||
|
||||
```C
|
||||
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
|
||||
if (!bprm)
|
||||
goto out_files;
|
||||
```
|
||||
|
||||
After this we start to prepare the `binprm` credentials with the call of the `prepare_bprm_creds` function:
|
||||
|
||||
```C
|
||||
retval = prepare_bprm_creds(bprm);
|
||||
if (retval)
|
||||
goto out_free;
|
||||
|
||||
check_unsafe_exec(bprm);
|
||||
current->in_execve = 1;
|
||||
```
|
||||
|
||||
Initialization of the `binprm` credentials in other words is initialization of the `cred` structure that stored inside of the `linux_binprm` structure. The `cred` structure contains the security context of a task for example [real uid](https://en.wikipedia.org/wiki/User_identifier#Real_user_ID) of the task, real [guid](https://en.wikipedia.org/wiki/Globally_unique_identifier) of the task, `uid` and `guid` for the [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) operations etc. In the next step as we executed preparation of the `bprm` credentials we check that now we can safely execute a program with the call of the `check_unsafe_exec` function and set the current process to the `in_execve` state.
|
||||
|
||||
After all of these operations we call the `do_open_execat` function that checks the flags that we passed to the `do_execveat_common` function (remember that we have `0` in the `flags`) and searches and opens executable file on disk, checks that our we will load a binary file from `noexec` mount points (we need to avoid execute a binary from filesystems that do not contain executable binaries like [proc](https://en.wikipedia.org/wiki/Procfs) or [sysfs](https://en.wikipedia.org/wiki/Sysfs)), intializes `file` structure and returns pointer on this structure. Next we can see the call the `sched_exec` after this:
|
||||
|
||||
```C
|
||||
file = do_open_execat(fd, filename, flags);
|
||||
retval = PTR_ERR(file);
|
||||
if (IS_ERR(file))
|
||||
goto out_unmark;
|
||||
|
||||
sched_exec();
|
||||
```
|
||||
|
||||
The `sched_exec` function is used to determine the least loaded processor that can execute the new program and to migrate the current process to it.
|
||||
|
||||
After this we need to check [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) of the give executable binary. We try to check does the name of the our binary file starts from the `/` symbol or does the path of the given executable binary is interpreted relative to the current working directory of the calling process or in other words file descriptor is `AT_FDCWD` (read above about this).
|
||||
|
||||
If one of these checks is successfull we set the binary parameter filename:
|
||||
|
||||
```C
|
||||
bprm->file = file;
|
||||
|
||||
if (fd == AT_FDCWD || filename->name[0] == '/') {
|
||||
bprm->filename = filename->name;
|
||||
}
|
||||
```
|
||||
|
||||
Otherwise if the filename is empty we set the binary parameter filename to the `/dev/fd/%d` or `/dev/fd/%d/%s` depends on the filename of the given executable binary which means that we will execute the file to which the file descriptor refers:
|
||||
|
||||
```C
|
||||
} else {
|
||||
if (filename->name[0] == '\0')
|
||||
pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd);
|
||||
else
|
||||
pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s",
|
||||
fd, filename->name);
|
||||
if (!pathbuf) {
|
||||
retval = -ENOMEM;
|
||||
goto out_unmark;
|
||||
}
|
||||
|
||||
bprm->filename = pathbuf;
|
||||
}
|
||||
|
||||
bprm->interp = bprm->filename;
|
||||
```
|
||||
|
||||
Note that we set not only the `bprm->filename` but also `bprm->interp` that will contain name of the program interpreter. For now we just write the same name there, but later it will be updated with the real name of the program interpreter depends on binary format of a program. You can read above that we already prepared `cred` for the `linux_binprm`. The next step is initalization of other fields of the `linux_binprm`. First of all we call the `bprm_mm_init` function and pass the `bprm` to it:
|
||||
|
||||
```C
|
||||
retval = bprm_mm_init(bprm);
|
||||
if (retval)
|
||||
goto out_unmark;
|
||||
```
|
||||
|
||||
The `bprm_mm_init` defined in the same source code file and as we can understand from the function's name, it makes initialization of the memory descriptor or in other words the `bprm_mm_init` function initializes `mm_struct` structure. This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/mm_types.h) header file and represents address space of a process. We will not consider implementation of the `bprm_mm_init` function because we do not know many important stuff related to the Linux kernel memory manager, but we just need to know that this function initializes `mm_struct` and populate it with a temporary stack `vm_area_struct`.
|
||||
|
||||
After this we calculate the count of the command line arguments which are were passed to the our executable binary, the count of the environment variables and set it to the `bprm->argc` and `bprm->envc` respectively:
|
||||
|
||||
```C
|
||||
bprm->argc = count(argv, MAX_ARG_STRINGS);
|
||||
if ((retval = bprm->argc) < 0)
|
||||
goto out;
|
||||
|
||||
bprm->envc = count(envp, MAX_ARG_STRINGS);
|
||||
if ((retval = bprm->envc) < 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
As you can see we do this operations with the help of the `count` function that defined in the [same](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and calculates the count of strings in the `argv` array. The `MAX_ARG_STRINGS` macro defined in the [include/uapi/linux/binfmts.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h) header file and as we can understand from the macro's name, it represents maximum number of strings that were passed to the `execve` system call. The value of the `MAX_ARG_STRINGS`:
|
||||
|
||||
```C
|
||||
#define MAX_ARG_STRINGS 0x7FFFFFFF
|
||||
```
|
||||
|
||||
After we calculated the number of the command line arguments and environment variables, we call the `prepare_binprm` function. We already call the function with the similar name before this moment. This function is called `prepare_binprm_cred` and we remember that this function initializes `cred` structure in the `linux_bprm`. Now the `prepare_binprm` function:
|
||||
|
||||
```C
|
||||
retval = prepare_binprm(bprm);
|
||||
if (retval < 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
fills the `linux_binprm` structure with the `uid` from [inode](https://en.wikipedia.org/wiki/Inode) and read `128` bytes from the binary executable file. We read only first `128` from the executable file because we need to check a type of our executable. We will read the rest of the executable file in the later step. After the preparation of the `linux_bprm` structure we copy the filename of the executable binary file, command line arguments and enviroment variables to the `linux_bprm` with the call of the `copy_strings_kernel` function:
|
||||
|
||||
```C
|
||||
retval = copy_strings_kernel(1, &bprm->filename, bprm);
|
||||
if (retval < 0)
|
||||
goto out;
|
||||
|
||||
retval = copy_strings(bprm->envc, envp, bprm);
|
||||
if (retval < 0)
|
||||
goto out;
|
||||
|
||||
retval = copy_strings(bprm->argc, argv, bprm);
|
||||
if (retval < 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
And set the pointer to the top of new program's stack that we set in the `bprm_mm_init` function:
|
||||
|
||||
```C
|
||||
bprm->exec = bprm->p;
|
||||
```
|
||||
|
||||
The top of the stack will contain the program filename and we store this fileneme tothe `exec` field of the `linux_bprm` structure.
|
||||
|
||||
Now we have filled `linux_bprm` structure, we call the `exec_binprm` function:
|
||||
|
||||
```C
|
||||
retval = exec_binprm(bprm);
|
||||
if (retval < 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
First of all we store the [pid](https://en.wikipedia.org/wiki/Process_identifier) and `pid` that seen from the [namespace](https://en.wikipedia.org/wiki/Cgroups) of the current task in the `exec_binprm`:
|
||||
|
||||
```C
|
||||
old_pid = current->pid;
|
||||
rcu_read_lock();
|
||||
old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
|
||||
rcu_read_unlock();
|
||||
```
|
||||
|
||||
and call the:
|
||||
|
||||
```C
|
||||
search_binary_handler(bprm);
|
||||
```
|
||||
|
||||
function. This function goes through the list of handlers that contains different binary formats. Currently the Linux kernel supports following binary formats:
|
||||
|
||||
* `binfmt_script` - support for interpreted scripts that are starts from the [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29) line;
|
||||
* `binfmt_misc` - support differnt binary formats, according to runtime configuration of the Linux kernel;
|
||||
* `binfmt_elf` - support [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format;
|
||||
* `binfmt_aout` - support [a.out](https://en.wikipedia.org/wiki/A.out) format;
|
||||
* `binfmt_flat` - support for [flat](https://en.wikipedia.org/wiki/Binary_file#Structure) format;
|
||||
* `binfmt_elf_fdpic` - Support for [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF) binaries;
|
||||
* `binfmt_em86` - support for Intel [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) binaries running on [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha) machines.
|
||||
|
||||
So, the search-binary_handler tries to call the `load_binary` function and pass `linux_binprm` to it. If the binary handler supports the given executable file format, it starts to prepare the executable binary for execution:
|
||||
|
||||
```C
|
||||
int search_binary_handler(struct linux_binprm *bprm)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
list_for_each_entry(fmt, &formats, lh) {
|
||||
retval = fmt->load_binary(bprm);
|
||||
if (retval < 0 && !bprm->mm) {
|
||||
force_sigsegv(SIGSEGV, current);
|
||||
return retval;
|
||||
}
|
||||
}
|
||||
|
||||
return retval;
|
||||
```
|
||||
|
||||
Where the `load_binary` for example for the [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) checks the magic number (each `elf` binary file contains magic number in the header) in the `linux_bprm` buffer (remember that we read first `128` bytes from the executable binary file): and exit if it is not `elf` binary:
|
||||
|
||||
```C
|
||||
static int load_elf_binary(struct linux_binprm *bprm)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
loc->elf_ex = *((struct elfhdr *)bprm->buf);
|
||||
|
||||
if (memcmp(elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
If the given executable file is in `elf` format, the `load_elf_binary` continues to execute. The `load_elf_binary` does many different things to prepare on execution executable file. For example it checks the architecture and type of the executable file:
|
||||
|
||||
```C
|
||||
if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
|
||||
goto out;
|
||||
if (!elf_check_arch(&loc->elf_ex))
|
||||
goto out;
|
||||
```
|
||||
|
||||
and exit if there is wrong architecture and executable file non executable non shared. Tries to load the `program header table`:
|
||||
|
||||
```C
|
||||
elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
|
||||
if (!elf_phdata)
|
||||
goto out;
|
||||
```
|
||||
|
||||
that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute.
|
||||
|
||||
In the end of the execution of the `load_elf_binary` we call the `start_thread` function and pass three arguments to it:
|
||||
|
||||
```C
|
||||
start_thread(regs, elf_entry, bprm->p);
|
||||
retval = 0;
|
||||
out:
|
||||
kfree(loc);
|
||||
out_ret:
|
||||
return retval;
|
||||
```
|
||||
|
||||
These arguments are:
|
||||
|
||||
* Set of [registers](https://en.wikipedia.org/wiki/Processor_register) for the new task;
|
||||
* Address of the entry point of the new task;
|
||||
* Address of the top of the stack for the new task.
|
||||
|
||||
As we can understand from the function's name, it starts new thread, but it is not so. The `start_thread` function just prepares new task's registers to be ready to run. Let's look on the implementation of this function:
|
||||
|
||||
```C
|
||||
void
|
||||
start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
|
||||
{
|
||||
start_thread_common(regs, new_ip, new_sp,
|
||||
__USER_CS, __USER_DS, 0);
|
||||
}
|
||||
```
|
||||
|
||||
As we can see the `start_thread` function just makes a call of the `start_thread_common` function that will do all for us:
|
||||
|
||||
```C
|
||||
static void
|
||||
start_thread_common(struct pt_regs *regs, unsigned long new_ip,
|
||||
unsigned long new_sp,
|
||||
unsigned int _cs, unsigned int _ss, unsigned int _ds)
|
||||
{
|
||||
loadsegment(fs, 0);
|
||||
loadsegment(es, _ds);
|
||||
loadsegment(ds, _ds);
|
||||
load_gs_index(0);
|
||||
regs->ip = new_ip;
|
||||
regs->sp = new_sp;
|
||||
regs->cs = _cs;
|
||||
regs->ss = _ss;
|
||||
regs->flags = X86_EFLAGS_IF;
|
||||
force_iret();
|
||||
}
|
||||
```
|
||||
|
||||
The `start_thread_common` function fills `fs` segment register with zero and `es` and `ds` with the value of the data segment register. After this we set new values to the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter), `cs` segments etc. In the end of the `start_thread_common` function we can see the `force_iret` macro that force a system call return via `iret` instruction. Ok, we prepared new thread to run in userspace and now we can return from the `exec_binprm` and now we are in the `do_execveat_common` again. After the `exec_binprm` will finish its execution we release memory for structures that was allocated before and return.
|
||||
|
||||
After we returned from the `execve` system call handler, execution of our program will be started. We can do it, because all context related information already configured for this purpose. As we saw the `execve` system call does not return control to a process, but code, data and other segments of the caller process are just overwritten of the program segments. The exit from our application will be implemented through the `exit` system call.
|
||||
|
||||
That's all. From this point our programm will be executed.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fourth and last part of the about the system calls concept in the Linux kernel. We saw almost all related stuff to the `system call` concept in these four parts. We started from the understanding of the `system call` concept, we have learned what is it and why do users applications need in this concept. Next we saw how does the Linux handle a system call from an user application. We met two similar concepts to the `system call` concept, they are `vsyscall` and `vDSO` and finally we saw how does Linux kernel run an user program.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [System call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [shell](https://en.wikipedia.org/wiki/Unix_shell)
|
||||
* [bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29)
|
||||
* [entry point](https://en.wikipedia.org/wiki/Entry_point)
|
||||
* [C](https://en.wikipedia.org/wiki/C_%28programming_language%29)
|
||||
* [environment variables](https://en.wikipedia.org/wiki/Environment_variable)
|
||||
* [file descriptor](https://en.wikipedia.org/wiki/File_descriptor)
|
||||
* [real uid](https://en.wikipedia.org/wiki/User_identifier#Real_user_ID)
|
||||
* [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [procfs](https://en.wikipedia.org/wiki/Procfs)
|
||||
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
|
||||
* [inode](https://en.wikipedia.org/wiki/Inode)
|
||||
* [pid](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [namespace](https://en.wikipedia.org/wiki/Cgroups)
|
||||
* [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29)
|
||||
* [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
|
||||
* [a.out](https://en.wikipedia.org/wiki/A.out)
|
||||
* [flat](https://en.wikipedia.org/wiki/Binary_file#Structure)
|
||||
* [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha)
|
||||
* [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF)
|
||||
* [segments](https://en.wikipedia.org/wiki/Memory_segmentation)
|
||||
* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html)
|
||||
441
Theory/asm.md
Normal file
441
Theory/asm.md
Normal file
@@ -0,0 +1,441 @@
|
||||
Inline assembly
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
While reading source code in the [Linux kernel](https://github.com/torvalds/linux), I often see statements like this:
|
||||
|
||||
```C
|
||||
__asm__("andq %%rsp,%0; ":"=r" (ti) : "0" (CURRENT_MASK));
|
||||
```
|
||||
|
||||
Yes, this is [inline assembly](https://en.wikipedia.org/wiki/Inline_assembler) or in other words assembler code which is integrated in a high level programming language. In this case the high level programming language is [C](https://en.wikipedia.org/wiki/C_%28programming_language%29). Yes, the `C` programming language is not very high-level, but still.
|
||||
|
||||
If you are familiar with the [assembly](https://en.wikipedia.org/wiki/Assembly_language) programming language, you may notice that `inline assembly` is not very different from normal assembler. Moreover, the special form of inline assembly which is called `basic form` is exactly the same. For example:
|
||||
|
||||
```C
|
||||
__asm__("movq %rax, %rsp");
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```C
|
||||
__asm__("hlt");
|
||||
```
|
||||
|
||||
The same code (of course without `__asm__` prefix) you might see in plain assembly code. Yes, this is very similar, but not so simple as it might seem at first glance. Actually, the [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) supports two forms of inline assembly statements:
|
||||
|
||||
* `basic`;
|
||||
* `extended`.
|
||||
|
||||
The basic form consists of only two things: the `__asm__` keyword and the string with valid assembler instructions. For example it may look something like this:
|
||||
|
||||
```C
|
||||
__asm__("movq $3, %rax\t\n"
|
||||
"movq %rsi, %rdi");
|
||||
```
|
||||
|
||||
The `asm` keyword may be used in place of `__asm__`, however `__asm__` is portable whereas the `asm` keyword is a `GNU` [extension](https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html). In further examples I will only use the `__asm__` variant.
|
||||
|
||||
If you know assembly programming language this looks pretty familiar. The main problem is in the second form of inline assembly statements - `extended`. This form allows us to pass parameters to an assembly statement, perform [jumps](https://en.wikipedia.org/wiki/Branch_%28computer_science%29) etc. Does not sound difficult, but requires knowledge of special rules in addition to knowledge of the assembly language. Every time I see yet another piece of inline assembly code in the Linux kernel, I need to refer to the official [documentation](https://gcc.gnu.org/onlinedocs/) of `GCC` to remember how a particular `qualifier` behaves or what the meaning of `=&r` is for example.
|
||||
|
||||
I've decided to write this part to consolidate my knowledge related to the inline assembly, as inline assembly statements are quite common in the Linux kernel and we may see them in [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) parts sometimes. I thought that it would be useful if we have a special part which contains information on more important aspects of the inline assembly. Of course you may find comprehensive information about inline assembly in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-Language-with-C), but I like to put everything in one place.
|
||||
|
||||
** Note: This part will not provide guide for assembly programming. It is not intended to teach you to write programs with assembler or to know what one or another assembler instruction means. Just a little memo for extended asm. **
|
||||
|
||||
Introduction to extended inline assembly
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, let's start. As I already mentioned above, the `basic` assembly statement consists of the `asm` or `__asm__` keyword and set of assembly instructions. This form is in no way different from "normal" assembly. The most interesting part is inline assembler with operands, or `extended` assembler. An extended assembly statement looks more complicated and consists of more than two parts:
|
||||
|
||||
```assembly
|
||||
__asm__ [volatile] [goto] (AssemblerTemplate
|
||||
[ : OutputOperands ]
|
||||
[ : InputOperands ]
|
||||
[ : Clobbers ]
|
||||
[ : GotoLabels ]);
|
||||
```
|
||||
|
||||
All parameters which are marked with squared brackets are optional. You may notice that if we skip the optional parameters and the modifiers `volatile` and `goto` we obtain the `basic` form.
|
||||
|
||||
Let's start to consider this in order. The first optional `qualifier` is `volatile`. This specifier tells the compiler that an assembly statement may produce `side effects`. In this case we need to prevent compiler optimizations related to the given assembly statement. In simple terms the `volatile` specifier instructs the compiler not to modify the statement and place it exactly where it was in the original code. As an example let's look at the following function from the [Linux kernel](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h):
|
||||
|
||||
```C
|
||||
static inline void native_load_gdt(const struct desc_ptr *dtr)
|
||||
{
|
||||
asm volatile("lgdt %0"::"m" (*dtr));
|
||||
}
|
||||
```
|
||||
|
||||
Here we see the `native_load_gdt` function which loads a base address from the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) to the `GDTR` register with the `lgdt` instruction. This assembly statement is marked with `volatile` qualifier. It is very important that the compiler does not change the original place of this assembly statement in the resulting code. Otherwise the `GDTR` register may contain wrong address for the `Global Descriptor Table` or the address may be correct, but the structure has not been filled yet. This can lead to an exception being generated, preventing the kernel from booting correctly.
|
||||
|
||||
The second optional `qualifier` is the `goto`. This qualifier tells the compiler that the given assembly statement may perform a jump to one of the labels which are listed in the `GotoLabels`. For example:
|
||||
|
||||
```C
|
||||
__asm__ goto("jmp %l[label]" : : : label);
|
||||
```
|
||||
|
||||
Since we finished with these two qualifiers, let's look at the main part of an assembly statement body. As we have seen above, the main part of an assembly statement consists of the following four parts:
|
||||
|
||||
* set of assembly instructions;
|
||||
* output parameters;
|
||||
* input parameters;
|
||||
* clobbers.
|
||||
|
||||
The first represents a string which contains a set of valid assembly instructions which may be separated by the `\t\n` sequence. Names of processor [registers](https://en.wikipedia.org/wiki/Processor_register) must be prefixed with the `%%` sequence in `extended` form and other symbols like immediates must start with the `$` symbol. The `OutputOperands` and `InputOperands` are comma-separated lists of [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) variables which may be provided with "constraints" and the `Clobbers` is a list of registers or other values which are modified by the assembler instructions from the `AssemblerTemplate` beyond those listed in the `OutputOperands`. Before we dive into the examples we have to know a little bit about `constraints`. A constraint is a string which specifies placement of an operand. For example the value of an operand may be written to a processor register or read from memory etc.
|
||||
|
||||
Consider the following simple example:
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
|
||||
int main(void)
|
||||
{
|
||||
int a = 5;
|
||||
int b = 10;
|
||||
int sum = 0;
|
||||
|
||||
__asm__("addl %1,%2" : "=r" (sum) : "r" (a), "0" (b));
|
||||
printf("a + b = %d\n", sum);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Let's compile and run it to be sure that it works as expected:
|
||||
|
||||
```
|
||||
$ gcc test.c -o test
|
||||
./test
|
||||
a + b = 15
|
||||
```
|
||||
|
||||
Ok, great. It works. Now let's look at this example in detail. Here we see a simple `C` program which calculates the sum of two variables placing the result into the `sum` variable and in the end we print the result. This example consists of three parts. The first is the assembly statement with the [add](http://x86.renejeschke.de/html/file_module_x86_id_5.html) instruction. It adds the value of the source operand together with the value of the destination operand and stores the result in the destination operand. In our case:
|
||||
|
||||
```assembly
|
||||
addl %1, %2
|
||||
```
|
||||
|
||||
will be expanded to the:
|
||||
|
||||
```assembly
|
||||
addl a, b
|
||||
```
|
||||
|
||||
Variables and expressions which are listed in the `OutputOperands` and `InputOperands` may be matched in the `AssemblerTemplate`. An input/output operand is designated as `%N` where the `N` is the number of operand from left to right beginning from `zero`. The second part of the our assembly statement is located after the first `:` symbol and contains the definition of the output value:
|
||||
|
||||
```assembly
|
||||
"=r" (sum)
|
||||
```
|
||||
|
||||
Notice that the `sum` is marked with two special symbols: `=r`. This is the first constraint that we have encountered. The actual constraint here is only `r` itself. The `=` symbol is `modifier` which denotes output value. This tells to compiler that the previous value will be discarded and replaced by the new data. Besides the `=` modifier, `GCC` provides support for following three modifiers:
|
||||
|
||||
* `+` - an operand is read and written by an instruction;
|
||||
* `&` - output register shouldn't overlap an input register and should be used only for output;
|
||||
* `%` - tells the compiler that operands may be [commutative](https://en.wikipedia.org/wiki/Commutative_property).
|
||||
|
||||
Now let's go back to the `r` qualifier. As I mentioned above, a qualifier denotes the placement of an operand. The `r` symbol means a value will be stored in one of the [general purpose register](https://en.wikipedia.org/wiki/Processor_register). The last part of our assembly statement:
|
||||
|
||||
```assembly
|
||||
"r" (a), "0" (b)
|
||||
```
|
||||
|
||||
These are input operands - variables `a` and `b`. We already know what the `r` qualifier does. Now we can have a look at the constraint for the variable `b`. The `0` or any other digit from `1` to `9` is called "matching constraint". With this a single operand can be used for multiple roles. The value of the constraint is the source operand index. In our case `0` will match `sum`. If we look at assembly output of our program
|
||||
|
||||
```C
|
||||
0000000000400400 <main>:
|
||||
400401: ba 05 00 00 00 mov $0x5,%edx
|
||||
400406: b8 0a 00 00 00 mov $0xa,%eax
|
||||
40040b: 01 d0 add %edx,%eax
|
||||
```
|
||||
|
||||
we see that only two general purpose registers are used: `%edx` and `%eax`. This way the `%eax` register is used for storing the value of `b` as well as storing the result of the calculation. We have looked at input and output parameters of an inline assembly statement. Before we move on to other constraints supported by `gcc`, there is one remaining part of the inline assembly statement we have not discussed yet - `clobbers`.
|
||||
|
||||
Clobbers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As mentioned above, the "clobbered" part should contain a comma-separated list of registers whose content will be modified by the assembler code. This is useful if our assembly expression needs additional registers for calculation. If we add clobbered registers to the inline assembly statement, the compiler take this into account and the register in question will not simultaneously be used by the compiler.
|
||||
|
||||
Consider the example from before, but we will add an additional, simple assembler instruction:
|
||||
|
||||
```C
|
||||
__asm__("movq $100, %%rdx\t\n"
|
||||
"addl %1,%2" : "=r" (sum) : "r" (a), "0" (b));
|
||||
```
|
||||
|
||||
If we look at the assembly output
|
||||
|
||||
```C
|
||||
0000000000400400 <main>:
|
||||
400400: ba 05 00 00 00 mov $0x5,%edx
|
||||
400405: b8 0a 00 00 00 mov $0xa,%eax
|
||||
40040a: 48 c7 c2 64 00 00 00 mov $0x64,%rdx
|
||||
400411: 01 d0 add %edx,%eax
|
||||
```
|
||||
|
||||
we see that the `%edx` register is overwritten with `0x64` or `100` and the result will be `115` instead of `15`. Now if we add the `%rdx` register to the list of "clobbered" register
|
||||
|
||||
```C
|
||||
__asm__("movq $100, %%rdx\t\n"
|
||||
"addl %1,%2" : "=r" (sum) : "r" (a), "0" (b) : "%rdx");
|
||||
```
|
||||
|
||||
and look at the assembler output again
|
||||
|
||||
```C
|
||||
0000000000400400 <main>:
|
||||
400400: b9 05 00 00 00 mov $0x5,%ecx
|
||||
400405: b8 0a 00 00 00 mov $0xa,%eax
|
||||
40040a: 48 c7 c2 64 00 00 00 mov $0x64,%rdx
|
||||
400411: 01 c8 add %ecx,%eax
|
||||
```
|
||||
|
||||
the `%ecx` register will be used for `sum` calculation, preserving the intended semantics of the program. Besides general purpose registers, we may pass two special specifiers. They are:
|
||||
|
||||
* `cc`;
|
||||
* `memory`.
|
||||
|
||||
The first - `cc` indicates that an assembler code modifies [flags](https://en.wikipedia.org/wiki/FLAGS_register) register. This is typically used if the assembly within contains arithmetic or logic instructions.
|
||||
|
||||
```C
|
||||
__asm__("incq %0" ::""(variable): "cc");
|
||||
```
|
||||
|
||||
The second `memory` specifier tells the compiler that the given inline assembly statement executes read/write operations on memory not specified by operands in the output list. This prevents the compiler from keeping memory values loaded and cached in registers. Let's take a look at the following example:
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
|
||||
int main(void)
|
||||
{
|
||||
int a[3] = {10,20,30};
|
||||
int b = 5;
|
||||
|
||||
__asm__ volatile("incl %0" :: "m" (a[0]));
|
||||
printf("a[0] - b = %d\n", a[0] - b);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
This example may be artificial, but it illustrates the main idea. Here we have an array of integers and one integer variable. The example is pretty simple, we take the first element of `a` and increment its value. After this we subtract the value of `b` from the first element of `a`. In the end we print the result. If we compile and run this simple example the result may surprise you.
|
||||
|
||||
```
|
||||
~$ gcc -O3 test.c -o test
|
||||
~$ ./test
|
||||
a[0] - b = 5
|
||||
```
|
||||
|
||||
The result is `5` here, but why? We incremented `a[0]` and subtracted b, so the result should be `6` here. If we have a look at the assembler output for this example
|
||||
|
||||
```assembly
|
||||
00000000004004f6 <main>:
|
||||
4004f6: c7 44 24 f0 0a 00 00 movl $0xa,-0x10(%rsp)
|
||||
4004fd: 00
|
||||
4004fe: c7 44 24 f4 14 00 00 movl $0x14,-0xc(%rsp)
|
||||
400505: 00
|
||||
400506: c7 44 24 f8 1e 00 00 movl $0x1e,-0x8(%rsp)
|
||||
40050d: 00
|
||||
40050e: ff 44 24 f0 incl -0x10(%rsp)
|
||||
400512: b8 05 00 00 00 mov $0x5,%eax
|
||||
```
|
||||
|
||||
we see that the first element of the `a` contains the value `0xa` (`10`). The last two lines of code are the actual calculations. We see our increment instruction with `incl` but then just a move of `5` to the `%eax` register. This looks strange. The problem is we have passed the `-O3` flag to `gcc`, so the compiler did some constant folding and propagation to determine the result of `a[0] - 5` at compile time and reduced it to a `mov` with a constant `5` at runtime.
|
||||
|
||||
Let's now add `memory` to the clobbers list
|
||||
|
||||
```C
|
||||
__asm__ volatile("incl %0" :: "m" (a[0]) : "memory");
|
||||
```
|
||||
|
||||
and the new result of running this is
|
||||
|
||||
```
|
||||
~$ gcc -O3 test.c -o test
|
||||
~$ ./test
|
||||
a[0] - b = 6
|
||||
```
|
||||
|
||||
Now the result is correct. If we look at the assembly output again
|
||||
|
||||
```assembly
|
||||
00000000004004f6 <main>:
|
||||
4004f6: c7 44 24 f0 0a 00 00 movl $0xa,-0x10(%rsp)
|
||||
4004fd: 00
|
||||
4004fe: c7 44 24 f4 14 00 00 movl $0x14,-0xc(%rsp)
|
||||
400505: 00
|
||||
400506: c7 44 24 f8 1e 00 00 movl $0x1e,-0x8(%rsp)
|
||||
40050d: 00
|
||||
40050e: ff 44 24 f0 incl -0x10(%rsp)
|
||||
400512: 8b 44 24 f0 mov -0x10(%rsp),%eax
|
||||
400516: 83 e8 05 sub $0x5,%eax
|
||||
400519: c3 retq
|
||||
```
|
||||
|
||||
we will see one difference here which is in the following piece code:
|
||||
|
||||
```assembly
|
||||
400512: 8b 44 24 f0 mov -0x10(%rsp),%eax
|
||||
400516: 83 e8 05 sub $0x5,%eax
|
||||
```
|
||||
|
||||
Instead of constant folding, `GCC` now preserves calculations in the assembly and places the value of `a[0]` in the `%eax` register afterwards. In the end it just subtracts the constant value of `b`. Besides the `memory` specifier, we also see a new constraint here - `m`. This constraint tells the compiler to use the address of `a[0]`, instead of its value. So, now we are finished with `clobbers` and we may continue by looking at other constraints supported by `GCC` besides `r` and `m` which we have already seen.
|
||||
|
||||
Constraints
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
Now that we are finished with all three parts of an inline assembly statement, let's return to constraints. We already saw some constraints in the previous parts, like `r` which represents a `register` operand, `m` which represents a memory operand and `0-9` which represent an reused, indexed operand. Besides these `GCC` provides support for other constraints. For example the `i` constraint represents an `immediate` integer operand with know value.
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
|
||||
int main(void)
|
||||
{
|
||||
int a = 0;
|
||||
|
||||
__asm__("movl %1, %0" : "=r"(a) : "i"(100));
|
||||
printf("a = %d\n", a);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
The result is:
|
||||
|
||||
```
|
||||
~$ gcc test.c -o test
|
||||
~$ ./test
|
||||
a = 100
|
||||
```
|
||||
|
||||
Or for example `I` which represents an immediate 32-bit integer. The difference between `i` and `I` is that `i` is general, whereas `I` is strictly specified to 32-bit integer data. For example if you try to compile the following
|
||||
|
||||
```C
|
||||
int test_asm(int nr)
|
||||
{
|
||||
unsigned long a = 0;
|
||||
|
||||
__asm__("movq %1, %0" : "=r"(a) : "I"(0xffffffffffff));
|
||||
return a;
|
||||
}
|
||||
```
|
||||
|
||||
you will get an error
|
||||
|
||||
```
|
||||
$ gcc -O3 test.c -o test
|
||||
test.c: In function ‘test_asm’:
|
||||
test.c:7:9: warning: asm operand 1 probably doesn’t match constraints
|
||||
__asm__("movq %1, %0" : "=r"(a) : "I"(0xffffffffffff));
|
||||
^
|
||||
test.c:7:9: error: impossible constraint in ‘asm’
|
||||
```
|
||||
|
||||
when at the same time
|
||||
|
||||
```C
|
||||
int test_asm(int nr)
|
||||
{
|
||||
unsigned long a = 0;
|
||||
|
||||
__asm__("movq %1, %0" : "=r"(a) : "i"(0xffffffffffff));
|
||||
return a;
|
||||
}
|
||||
```
|
||||
|
||||
works perfectly.
|
||||
|
||||
```
|
||||
~$ gcc -O3 test.c -o test
|
||||
~$ echo $?
|
||||
0
|
||||
```
|
||||
|
||||
`GCC` also supports `J`, `K`, `N` constraints for integer constants in the range of 0-63 bits, signed 8-bit integer constants and unsigned 8-bit integer constants respectively. The `o` constraint represents a memory operand with an `offsetable` memory address. For example:
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
|
||||
int main(void)
|
||||
{
|
||||
static unsigned long arr[3] = {0, 1, 2};
|
||||
static unsigned long element;
|
||||
|
||||
__asm__ volatile("movq 16+%1, %0" : "=r"(element) : "o"(arr));
|
||||
printf("%d\n", element);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
The result, as expected:
|
||||
|
||||
```
|
||||
~$ gcc -O3 test.c -o test
|
||||
~$ ./test
|
||||
2
|
||||
```
|
||||
|
||||
All of these constraints may be combined (so long as they do not conflict). In this case the compiler will choose the best one for a certain situation. For example:
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
|
||||
int a = 1;
|
||||
|
||||
int main(void)
|
||||
{
|
||||
int b;
|
||||
__asm__ ("movl %1,%0" : "=r"(b) : "r"(a));
|
||||
return b;
|
||||
}
|
||||
```
|
||||
|
||||
will use a memory operand.
|
||||
|
||||
```assembly
|
||||
0000000000400400 <main>:
|
||||
400400: 8b 05 26 0c 20 00 mov 0x200c26(%rip),%eax # 60102c <a>
|
||||
```
|
||||
|
||||
That's about all of the commonly used constraints in inline assembly statements. You can find more in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints).
|
||||
|
||||
Architecture specific constraints
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Before we finish, let's look at the set of special constraints. These constrains are architecture specific and as this book is specific to the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, we will look at constraints related to it. First of all the set of `a` ... `d` and also `S` and `D` constraints represent [generic purpose](https://en.wikipedia.org/wiki/Processor_register) registers. In this case the `a` constraint corresponds to `%al`, `%ax`, `%eax` or `%rax` register depending on instruction size. The `S` and `D` constraints are `%si` and `%di` registers respectively. For example let's take our previous example. We can see in its assembly output that value of the `a` variable is stored in the `%eax` register. Now let's look at the assembly output of the same assembly, but with other constraint:
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
|
||||
int a = 1;
|
||||
|
||||
int main(void)
|
||||
{
|
||||
int b;
|
||||
__asm__ ("movl %1,%0" : "=r"(b) : "d"(a));
|
||||
return b;
|
||||
}
|
||||
```
|
||||
|
||||
Now we see that value of the `a` variable will be stored in the `%edx` register:
|
||||
|
||||
```assembly
|
||||
0000000000400400 <main>:
|
||||
400400: 8b 15 26 0c 20 00 mov 0x200c26(%rip),%edx # 60102c <a>
|
||||
```
|
||||
|
||||
The `f` and `t` constraints represent any floating point stack register - `%st` and the top of the floating point stack respectively. The `u` constraint represents the second value from the top of the floating point stack.
|
||||
|
||||
That's all. You may find more details about [x86_64](https://en.wikipedia.org/wiki/X86-64) and general constraints in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints).
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Linux kernel source code](https://github.com/torvalds/linux)
|
||||
* [assembly programming language](https://en.wikipedia.org/wiki/Assembly_language)
|
||||
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [GNU extension](https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html)
|
||||
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [Processor registers](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [add instruction](http://x86.renejeschke.de/html/file_module_x86_id_5.html)
|
||||
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [constraints](https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints)
|
||||
436
Timers/timers-1.md
Normal file
436
Timers/timers-1.md
Normal file
@@ -0,0 +1,436 @@
|
||||
Timers and time management in the Linux kernel. Part 1.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is yet another post that opens new chapter in the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) was a list part of the chapter that describes [system call](https://en.wikipedia.org/wiki/System_call) concept and now time is to start new chapter. As you can understand from the post's title, this chapter will be devoted to the `timers` and `time management` in the Linux kernel. The choice of topic for the current chapter is not accidental. Timers and generally time management are very important and widely used in the Linux kernel. The Linux kernel uses timers for various tasks, different timeouts for example in [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) implementation, the kernel must know current time, scheduling asynchronous functions, next event interrupt scheduling and many many more.
|
||||
|
||||
So, we will start to learn implementation of the different time management related stuff in this part. We will see different types of timers and how do different Linux kernel subsystems use them. As always we will start from the earliest part of the Linux kernel and will go through initialization process of the Linux kernel. We already did it in the special [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) which describes initialization process of the Linux kernel, but as you may remember we missed some things there. And one of them is the initialization of timers.
|
||||
|
||||
Let's start.
|
||||
|
||||
Initialization of non-standard PC hardware clock
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the Linux kernel was decompressed (more about this you can read in the [Kernel decompression](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) part) the architecture non-specific code starts to work in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. After initialization of the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), initialization of [cgroups](https://en.wikipedia.org/wiki/Cgroups) and setting [canary](https://en.wikipedia.org/wiki/Buffer_overflow_protection) value we can see the call of the `setup_arch` function.
|
||||
|
||||
As you may remember this function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file and prepares/initializes architecture-specific stuff (for example it reserves place for [bss](https://en.wikipedia.org/wiki/.bss) section, reserves place for [initrd](https://en.wikipedia.org/wiki/Initrd), parses kernel command line and many many other things). Besides this, we can find some time management related functions there.
|
||||
|
||||
The first is:
|
||||
|
||||
```C
|
||||
x86_init.timers.wallclock_init();
|
||||
```
|
||||
|
||||
We already saw `x86_init` structure in the chapter that describes initialization of the Linux kernel. This structure contains pointers to the default setup functions for the different platforms like [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms), [Intel CE4100](http://www.wpgholdings.com/epaper/US/newsRelease_20091215/255874.pdf) and etc. The `x86_init` structure defined in the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c#L36) and as you can see it determines standard PC hardware by default.
|
||||
|
||||
As we can see, the `x86_init` structure has `x86_init_ops` type that provides a set of functions for platform specific setup like reserving standard resources, platform specific memory setup, initialization of interrupt handlers and etc. This structure looks like:
|
||||
|
||||
```C
|
||||
struct x86_init_ops {
|
||||
struct x86_init_resources resources;
|
||||
struct x86_init_mpparse mpparse;
|
||||
struct x86_init_irqs irqs;
|
||||
struct x86_init_oem oem;
|
||||
struct x86_init_paging paging;
|
||||
struct x86_init_timers timers;
|
||||
struct x86_init_iommu iommu;
|
||||
struct x86_init_pci pci;
|
||||
};
|
||||
```
|
||||
|
||||
We can note `timers` field that has `x86_init_timers` type and as we can understand by its name - this field is related to time management and timers. The `x86_init_timers` contains four fields which are all functions that returns pointer on [void](https://en.wikipedia.org/wiki/Void_type):
|
||||
|
||||
* `setup_percpu_clockev` - set up the per cpu clock event device for the boot cpu;
|
||||
* `tsc_pre_init` - platform function called before [TSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) init;
|
||||
* `timer_init` - initialize the platform timer;
|
||||
* `wallclock_init` - initialize the wallclock device.
|
||||
|
||||
So, as we already know, in our case the `wallclock_init` executes initialization of the wallclock device. If we will look on the `x86_init` structure, we will see that `wallclock_init` points to the `x86_init_noop`:
|
||||
|
||||
```C
|
||||
struct x86_init_ops x86_init __initdata = {
|
||||
...
|
||||
...
|
||||
...
|
||||
.timers = {
|
||||
.wallclock_init = x86_init_noop,
|
||||
},
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Where the `x86_init_noop` is just a function that does nothing:
|
||||
|
||||
```C
|
||||
void __cpuinit x86_init_noop(void) { }
|
||||
```
|
||||
|
||||
for the standard PC hardware. Actually, the `wallclock_init` function is used in the [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms) platform. Initialization of the `x86_init.timers.wallclock_init` located in the [arch/x86/platform/intel-mid/intel-mid.c](https://github.com/torvalds/linux/blob/master/arch/x86/platform/intel-mid/intel-mid.c) source code file in the `x86_intel_mid_early_setup` function:
|
||||
|
||||
```C
|
||||
void __init x86_intel_mid_early_setup(void)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
x86_init.timers.wallclock_init = intel_mid_rtc_init;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Implementation of the `intel_mid_rtc_init` function is in the [arch/x86/platform/intel-mid/intel_mid_vrtc.c](https://github.com/torvalds/linux/blob/master/arch/x86/platform/intel-mid/intel_mid_vrtc.c) source code file and looks pretty easy. First of all, this function parses [Simple Firmware Interface](https://en.wikipedia.org/wiki/Simple_Firmware_Interface) M-Real-Time-Clock table for the getting such devices to the `sfi_mrtc_array` array and initialization of the `set_time` and `get_time` functions:
|
||||
|
||||
```C
|
||||
void __init intel_mid_rtc_init(void)
|
||||
{
|
||||
unsigned long vrtc_paddr;
|
||||
|
||||
sfi_table_parse(SFI_SIG_MRTC, NULL, NULL, sfi_parse_mrtc);
|
||||
|
||||
vrtc_paddr = sfi_mrtc_array[0].phys_addr;
|
||||
if (!sfi_mrtc_num || !vrtc_paddr)
|
||||
return;
|
||||
|
||||
vrtc_virt_base = (void __iomem *)set_fixmap_offset_nocache(FIX_LNW_VRTC,
|
||||
vrtc_paddr);
|
||||
|
||||
x86_platform.get_wallclock = vrtc_get_time;
|
||||
x86_platform.set_wallclock = vrtc_set_mmss;
|
||||
}
|
||||
```
|
||||
|
||||
That's all, after this a device based on `Intel MID` will be able to get time from hardware clock. As I already wrote, the standard PC [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture does not support `x86_init_noop` and just do nothing during call of this function. We just saw initialization of the [real time clock](https://en.wikipedia.org/wiki/Real-time_clock) for the [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms) architecture and now times to return to the general `x86_64` architecture and will look on the time management related stuff there.
|
||||
|
||||
Acquainted with jiffies
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
If we will return to the `setup_arch` function which is located as you remember in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file, we will see the next call of the time management related function:
|
||||
|
||||
```C
|
||||
register_refined_jiffies(CLOCK_TICK_RATE);
|
||||
```
|
||||
|
||||
Before we will look on the implementation of this function, we must know about [jiffy](https://en.wikipedia.org/wiki/Jiffy_%28time%29). As we can read on wikipedia:
|
||||
|
||||
```
|
||||
Jiffy is an informal term for any unspecified short period of time
|
||||
```
|
||||
|
||||
This definition is very similar to the `jiffy` in the Linux kernel. There is global variable with the `jiffies` which holds the number of ticks that have occurred since the system booted. The Linux kernel sets this variable to zero:
|
||||
|
||||
```C
|
||||
extern unsigned long volatile __jiffy_data jiffies;
|
||||
```
|
||||
|
||||
during initialization process. This global variable will be increased each time during timer interrupt. Besides this, near the `jiffies` variable we can see definition of the similar variable
|
||||
|
||||
```C
|
||||
extern u64 jiffies_64;
|
||||
```
|
||||
|
||||
Actually only one of these variables is in use in the Linux kernel. And it depends on the processor type. For the [x86_64](https://en.wikipedia.org/wiki/X86-64) it will be `u64` use and for the [x86](https://en.wikipedia.org/wiki/X86) is `unsigned long`. We will see this if we will look on the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) linker script:
|
||||
|
||||
```
|
||||
#ifdef CONFIG_X86_32
|
||||
...
|
||||
jiffies = jiffies_64;
|
||||
...
|
||||
#else
|
||||
...
|
||||
jiffies_64 = jiffies;
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
In the case of `x86_32` the `jiffies` will be lower `32` bits of the `jiffies_64` variable. Schematically, we can imagine it as follows
|
||||
|
||||
```
|
||||
jiffies_64
|
||||
+-----------------------------------------------------+
|
||||
| | |
|
||||
| | |
|
||||
| | jiffies on `x86_32` |
|
||||
| | |
|
||||
| | |
|
||||
+-----------------------------------------------------+
|
||||
63 31 0
|
||||
```
|
||||
|
||||
Now we know a little theory about `jiffies` and we can return to the our function. There is no architecture-specific implementation for our function - the `register_refined_jiffies`. This function located in the generic kernel code - [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file. Main point of the `register_refined_jiffies` is registration of the jiffy `clocksource`. Before we will look on the implementation of the `register_refined_jiffies` function, we must know what is it `clocksource`. As we can read in the comments:
|
||||
|
||||
```
|
||||
The `clocksource` is hardware abstraction for a free-running counter.
|
||||
```
|
||||
|
||||
I'm not sure about you, but that description didn't give a good understanding about the `clocksource` concept. Let's try to understand what is it, but we will not go deeper because this topic will be described in a separate part in much more detail. The main point of the `clocksource` is timekeeping abstraction or in very simple words - it provides a time value to the kernel. We already know about `jiffies` interface that represents number of ticks that have occurred since the system booted. It represented by the global variable in the Linux kernel and increased each timer interrupt. The Linux kernel can use `jiffies` for time measurement. So why do we need in separate context like the `clocksource`? Actually different hardware devices provide different clock sources that are widely in their capabilities. The availability of more precise techniques for time intervals measurement is hardware-dependent.
|
||||
|
||||
For example `x86` has on-chip a 64-bit counter that is called [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) and its frequency can be equal to processor frequency. Or for example [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) that consists of a `64-bit` counter of at least `10 MHz` frequency. Two different timers and they are both for `x86`. If we will add timers from other architectures, this only makes this problem more complex. The Linux kernel provides `clocksource` concept to solve the problem.
|
||||
|
||||
The clocksource concept represented by the `clocksource` structure in the Linux kernel. This structure defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file and contains a couple of fields that describe a time counter. For example it contains - `name` field which is the name of a counter, `flags` field that describes different properties of a counter, pointers to the `suspend` and `resume` functions, and many more.
|
||||
|
||||
Let's look on the `clocksource` structure for jiffies that defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_jiffies = {
|
||||
.name = "jiffies",
|
||||
.rating = 1,
|
||||
.read = jiffies_read,
|
||||
.mask = 0xffffffff,
|
||||
.mult = NSEC_PER_JIFFY << JIFFIES_SHIFT,
|
||||
.shift = JIFFIES_SHIFT,
|
||||
.max_cycles = 10,
|
||||
};
|
||||
```
|
||||
|
||||
We can see definition of the default name here - `jiffies`, the next is `rating` field allows the best registered clock source to be chosen by the clock source management code available for the specified hardware. The `rating` may have following value:
|
||||
|
||||
* `1-99` - Only available for bootup and testing purposes;
|
||||
* `100-199` - Functional for real use, but not desired.
|
||||
* `200-299` - A correct and usable clocksource.
|
||||
* `300-399` - A reasonably fast and accurate clocksource.
|
||||
* `400-499` - The ideal clocksource. A must-use where available;
|
||||
|
||||
For example rating of the [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) is `300`, but rating of the [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is `250`. The next field is `read` - is pointer to the function that allows to read clocksource's cycle value or in other words it just returns `jiffies` variable with `cycle_t` type:
|
||||
|
||||
```C
|
||||
static cycle_t jiffies_read(struct clocksource *cs)
|
||||
{
|
||||
return (cycle_t) jiffies;
|
||||
}
|
||||
```
|
||||
|
||||
that is just 64-bit unsigned type:
|
||||
|
||||
```C
|
||||
typedef u64 cycle_t;
|
||||
```
|
||||
|
||||
The next field is the `mask` value ensures that subtraction between counters values from non `64 bit` counters do not need special overflow logic. In our case the mask is `0xffffffff` and it is `32` bits. This means that `jiffy` wraps around to zero after `42` seconds:
|
||||
|
||||
```python
|
||||
>>> 0xffffffff
|
||||
4294967295
|
||||
# 42 nanoseconds
|
||||
>>> 42 * pow(10, -9)
|
||||
4.2000000000000006e-08
|
||||
# 43 nanoseconds
|
||||
>>> 43 * pow(10, -9)
|
||||
4.3e-08
|
||||
```
|
||||
|
||||
The next two fields `mult` and `shift` are used to convert the clocksource's period to nanoseconds per cycle. When the kernel calls the `clocksource.read` function, this function returns value in `machine` time units represented with `cycle_t` data type that we saw just now. To convert this return value to the [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) we need in these two fields: `mult` and `shift`. The `clocksource` provides `clocksource_cyc2ns` function that will do it for us with the following expression:
|
||||
|
||||
```C
|
||||
((u64) cycles * mult) >> shift;
|
||||
```
|
||||
|
||||
As we can see the `mult` field is equal:
|
||||
|
||||
```C
|
||||
NSEC_PER_JIFFY << JIFFIES_SHIFT
|
||||
|
||||
#define NSEC_PER_JIFFY ((NSEC_PER_SEC+HZ/2)/HZ)
|
||||
#define NSEC_PER_SEC 1000000000L
|
||||
```
|
||||
|
||||
by default, and the `shift` is
|
||||
|
||||
```C
|
||||
#if HZ < 34
|
||||
#define JIFFIES_SHIFT 6
|
||||
#elif HZ < 67
|
||||
#define JIFFIES_SHIFT 7
|
||||
#else
|
||||
#define JIFFIES_SHIFT 8
|
||||
#endif
|
||||
```
|
||||
|
||||
The `jiffies` clock source uses the `NSEC_PER_JIFFY` multiplier conversion to specify the nanosecond over cycle ratio. Note that values of the `JIFFIES_SHIFT` and `NSEC_PER_JIFFY` depend on `HZ` value. The `HZ` represents the frequency of the system timer. This macro defined in the [include/asm-generic/param.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/param.h) and depends on the `CONFIG_HZ` kernel configuration option. The value of `HZ` differs for each supported architecture, but for `x86` it's defined like:
|
||||
|
||||
```C
|
||||
#define HZ CONFIG_HZ
|
||||
```
|
||||
|
||||
Where `CONFIG_HZ` can be one of the following values:
|
||||
|
||||

|
||||
|
||||
This means that in our case the timer interrupt frequency is `250 HZ` or occurs `250` times per second or one timer interrupt each `4ms`.
|
||||
|
||||
The last field that we can see in the definition of the `clocksource_jiffies` structure is the - `max_cycles` that holds the maximum cycle value that can safely be multiplied without potentially causing an overflow.
|
||||
|
||||
Ok, we just saw definition of the `clocksource_jiffies` structure, also we know a little about `jiffies` and `clocksource`, now is time to get back to the implementation of the our function. In the beginning of this part we have stopped on the call of the:
|
||||
|
||||
```C
|
||||
register_refined_jiffies(CLOCK_TICK_RATE);
|
||||
```
|
||||
|
||||
function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file.
|
||||
|
||||
As I already wrote, the main purpose of the `register_refined_jiffies` function is to register `refined_jiffies` clocksource. We already saw the `clocksource_jiffies` structure represents standard `jiffies` clock source. Now, if you look in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file, you will find yet another clock source definition:
|
||||
|
||||
```C
|
||||
struct clocksource refined_jiffies;
|
||||
```
|
||||
|
||||
There is one different between `refined_jiffies` and `clocksource_jiffies`: The standard `jiffies` based clock source is the lowest common denominator clock source which should function on all systems. As we already know, the `jiffies` global variable will be increased during each timer interrupt. This means that standard `jiffies` based clock source has the same resolution as the timer interrupt frequency. From this we can understand that standard `jiffies` based clock source may suffer from inaccuracies. The `refined_jiffies` uses `CLOCK_TICK_RATE` as the base of `jiffies` shift.
|
||||
|
||||
Let's look on the implementation of this function. First of all we can see that the `refined_jiffies` clock source based on the `clocksource_jiffies` structure:
|
||||
|
||||
```C
|
||||
int register_refined_jiffies(long cycles_per_second)
|
||||
{
|
||||
u64 nsec_per_tick, shift_hz;
|
||||
long cycles_per_tick;
|
||||
|
||||
refined_jiffies = clocksource_jiffies;
|
||||
refined_jiffies.name = "refined-jiffies";
|
||||
refined_jiffies.rating++;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Here we can see that we update the name of the `refined_jiffies` to `refined-jiffies` and increase the rating of this structure. As you remember, the `clocksource_jiffies` has rating - `1`, so our `refined_jiffies` clocksource will have rating - `2`. This means that the `refined_jiffies` will be best selection for clock source management code.
|
||||
|
||||
In the next step we need to calculate number of cycles per one tick:
|
||||
|
||||
```C
|
||||
cycles_per_tick = (cycles_per_second + HZ/2)/HZ;
|
||||
```
|
||||
|
||||
Note that we have used `NSEC_PER_SEC` macro as the base of the standard `jiffies` multiplier. Here we are using the `cycles_per_second` which is the first parameter of the `register_refined_jiffies` function. We've passed the `CLOCK_TICK_RATE` macro to the `register_refined_jiffies` function. This macro definied in the [arch/x86/include/asm/timex.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/timex.h) header file and expands to the:
|
||||
|
||||
```C
|
||||
#define CLOCK_TICK_RATE PIT_TICK_RATE
|
||||
```
|
||||
|
||||
where the `PIT_TICK_RATE` macro expands to the frequency of the [Intel 8253](Programmable interval timer):
|
||||
|
||||
```C
|
||||
#define PIT_TICK_RATE 1193182ul
|
||||
```
|
||||
|
||||
After this we calculate `shift_hz` for the `register_refined_jiffies` that will store `hz << 8` or in other words frequency of the system timer. We shift left the `cycles_per_second` or frequency of the programmable interval timer on `8` in order to get extra accuracy:
|
||||
|
||||
```C
|
||||
shift_hz = (u64)cycles_per_second << 8;
|
||||
shift_hz += cycles_per_tick/2;
|
||||
do_div(shift_hz, cycles_per_tick);
|
||||
```
|
||||
|
||||
In the next step we calculate the number of seconds per one tick by shifting left the `NSEC_PER_SEC` on `8` too as we did it with the `shift_hz` and do the same calculation as before:
|
||||
|
||||
```C
|
||||
nsec_per_tick = (u64)NSEC_PER_SEC << 8;
|
||||
nsec_per_tick += (u32)shift_hz/2;
|
||||
do_div(nsec_per_tick, (u32)shift_hz);
|
||||
```
|
||||
|
||||
```C
|
||||
refined_jiffies.mult = ((u32)nsec_per_tick) << JIFFIES_SHIFT;
|
||||
```
|
||||
|
||||
In the end of the `register_refined_jiffies` function we register new clock source with the `__clocksource_register` function that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file and return:
|
||||
|
||||
```C
|
||||
__clocksource_register(&refined_jiffies);
|
||||
return 0;
|
||||
```
|
||||
|
||||
The clock source management code provides the API for clock source registration and selection. As we can see, clock sources are registered by calling the `__clocksource_register` function during kernel initialization or from a kernel module. During registration, the clock source management code will choose the best clock source available in the system using the `clocksource.rating` field which we already saw when we initialized `clocksource` structure for `jiffies`.
|
||||
|
||||
Using the jiffies
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We just saw initialization of two `jiffies` based clock sources in the previous paragraph:
|
||||
|
||||
* standard `jiffies` based clock source;
|
||||
* refined `jiffies` based clock source;
|
||||
|
||||
Don't worry if you don't understand the calculations here. They look frightening at first. Soon, step by step we will learn these things. So, we just saw initialization of `jffies` based clock sources and also we know that the Linux kernel has the global variable `jiffies` that holds the number of ticks that have occurred since the kernel started to work. Now, let's look how to use it. To use `jiffies` we just can use `jiffies` global variable by its name or with the call of the `get_jiffies_64` function. This function defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file and just returns full `64-bit` value of the `jiffies`:
|
||||
|
||||
```C
|
||||
u64 get_jiffies_64(void)
|
||||
{
|
||||
unsigned long seq;
|
||||
u64 ret;
|
||||
|
||||
do {
|
||||
seq = read_seqbegin(&jiffies_lock);
|
||||
ret = jiffies_64;
|
||||
} while (read_seqretry(&jiffies_lock, seq));
|
||||
return ret;
|
||||
}
|
||||
EXPORT_SYMBOL(get_jiffies_64);
|
||||
```
|
||||
|
||||
Note that the `get_jiffies_64` function does not implemented as `jiffies_read` for example:
|
||||
|
||||
```C
|
||||
static cycle_t jiffies_read(struct clocksource *cs)
|
||||
{
|
||||
return (cycle_t) jiffies;
|
||||
}
|
||||
```
|
||||
|
||||
We can see that implementation of the `get_jiffies_64` is more complex. The reading of the `jiffies_64` variable is implemented using [seqlocks](https://en.wikipedia.org/wiki/Seqlock). Actually this is done for machines that cannot atomically read the full 64-bit values.
|
||||
|
||||
If we can access the `jiffies` or the `jiffies_64` variable we can convert it to `human` time units. To get one second we can use following expression:
|
||||
|
||||
```C
|
||||
jiffies / HZ
|
||||
```
|
||||
|
||||
So, if we know this, we can get any time units. For example:
|
||||
|
||||
```C
|
||||
/* Thirty seconds from now */
|
||||
jiffies + 30*HZ
|
||||
|
||||
/* Two minutes from now */
|
||||
jiffies + 120*HZ
|
||||
|
||||
/* One millisecond from now */
|
||||
jiffies + HZ / 1000
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This concludes the first part covering time and time management related concepts in the Linux kernel. We met first two concepts and its initialization in this part: `jiffies` and `clocksource`. In the next part we will continue to dive into this interesting theme and as I already wrote in this part we will acquainted and try to understand insides of these and other time management concepts in the Linux kernel.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol)
|
||||
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [cgroups](https://en.wikipedia.org/wiki/Cgroups)
|
||||
* [bss](https://en.wikipedia.org/wiki/.bss)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initrd)
|
||||
* [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms)
|
||||
* [TSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [void](https://en.wikipedia.org/wiki/Void_type)
|
||||
* [Simple Firmware Interface](https://en.wikipedia.org/wiki/Simple_Firmware_Interface)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [real time clock](https://en.wikipedia.org/wiki/Real-time_clock)
|
||||
* [Jiffy](https://en.wikipedia.org/wiki/Jiffy_%28time%29)
|
||||
* [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
|
||||
* [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253)
|
||||
* [seqlocks](https://en.wikipedia.org/wiki/Seqlock)
|
||||
* [cloksource documentation](https://www.kernel.org/doc/Documentation/timers/timekeeping.txt)
|
||||
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html)
|
||||
451
Timers/timers-2.md
Normal file
451
Timers/timers-2.md
Normal file
@@ -0,0 +1,451 @@
|
||||
Timers and time management in the Linux kernel. Part 2.
|
||||
================================================================================
|
||||
|
||||
Introduction to the `clocksource` framework
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) was the first part in the current [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and time management related stuff in the Linux kernel. We got acquainted with two concepts in the previous part:
|
||||
|
||||
* `jiffies`
|
||||
* `clocksource`
|
||||
|
||||
The first is the global variable that is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/master/include/linux/jiffies.h) header file and represents the counter that is increased during each timer interrupt. So if we can access this global variable and we know the timer interrupt rate we can convert `jiffies` to the human time units. As we already know the timer interrupt rate represented by the compile-time constant that is called `HZ` in the Linux kernel. The value of `HZ` is equal to the value of the `CONFIG_HZ` kernel configuration option and if we will look into the [arch/x86/configs/x86_64_defconfig](https://github.com/torvalds/linux/blob/master/arch/x86/configs/x86_64_defconfig) kernel configuration file, we will see that:
|
||||
|
||||
```
|
||||
CONFIG_HZ_1000=y
|
||||
```
|
||||
|
||||
kernel configuration option is set. This means that value of `CONFIG_HZ` will be `1000` by default for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. So, if we divide the value of `jiffies` by the value of `HZ`:
|
||||
|
||||
```
|
||||
jiffies / HZ
|
||||
```
|
||||
|
||||
we will get the amount of seconds that elapsed since the beginning of the moment the Linux kernel started to work or in other words we will get the system [uptime](https://en.wikipedia.org/wiki/Uptime). Since `HZ` represents the amount of timer interrupts in a second, we can set a value for some time in the future. For example:
|
||||
|
||||
```C
|
||||
/* one minute from now */
|
||||
unsigned long later = jiffies + 60*HZ;
|
||||
|
||||
/* five minutes from now */
|
||||
unsigned long later = jiffies + 5*60*HZ;
|
||||
```
|
||||
|
||||
This is a very common practice in the Linux kernel. For example, if you will look into the [arch/x86/kernel/smpboot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/smpboot.c) source code file, you will find the `do_boot_cpu` function. This function boots all processors besides bootstrap processor. You can find a snippet that waits ten seconds for a response from the application processor:
|
||||
|
||||
```C
|
||||
if (!boot_error) {
|
||||
timeout = jiffies + 10*HZ;
|
||||
while (time_before(jiffies, timeout)) {
|
||||
...
|
||||
...
|
||||
...
|
||||
udelay(100);
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
We assign `jiffies + 10*HZ` value to the `timeout` variable here. As I think you already understood, this means a ten seconds timeout. After this we are entering a loop where we use the `time_before` macro to compare the current `jiffies` value and our timeout.
|
||||
|
||||
Or for example if we look into the [sound/isa/sscape.c](https://github.com/torvalds/linux/blob/master/sound/isa/sscape) source code file which represents the driver for the [Ensoniq Soundscape Elite](https://en.wikipedia.org/wiki/Ensoniq_Soundscape_Elite) sound card, we will see the `obp_startup_ack` function that waits upto a given timeout for the On-Board Processor to return its start-up acknowledgement sequence:
|
||||
|
||||
```C
|
||||
static int obp_startup_ack(struct soundscape *s, unsigned timeout)
|
||||
{
|
||||
unsigned long end_time = jiffies + msecs_to_jiffies(timeout);
|
||||
|
||||
do {
|
||||
...
|
||||
...
|
||||
...
|
||||
x = host_read_unsafe(s->io_base);
|
||||
...
|
||||
...
|
||||
...
|
||||
if (x == 0xfe || x == 0xff)
|
||||
return 1;
|
||||
msleep(10);
|
||||
} while (time_before(jiffies, end_time));
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
As you can see, the `jiffies` variable is very widely used in the Linux kernel [code](http://lxr.free-electrons.com/ident?i=jiffies). As I already wrote, we met yet another new time management related concept in the previous part - `clocksource`. We have only seen a short description of this concept and the API for a clock source registration. Let's take a closer look in this part.
|
||||
|
||||
Introduction to `clocksource`
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `clocksource` concept represents the generic API for clock sources management in the Linux kernel. Why do we need a separate framework for this? Let's go back to the beginning. The `time` concept is the fundamental concept in the Linux kernel and other operating system kernels. And the timekeeping is one of the necessities to use this concept. For example Linux kernel must know and update the time elapsed since system startup, it must determine how long the current process has been running for every processor and many many more. Where the Linux kernel can get information about time? First of all it is Real Time Clock or [RTC](https://en.wikipedia.org/wiki/Real-time_clock) that represents by the a nonvolatile device. You can find a set of architecture-independent real time clock drivers in the Linux kernel in the [drivers/rtc](https://github.com/torvalds/linux/tree/master/drivers/rtc) directory. Besides this, each architecture can provide a driver for the architecture-dependent real time clock, for example - `CMOS/RTC` - [arch/x86/kernel/rtc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/rtc.c) for the [x86](https://en.wikipedia.org/wiki/X86) architecture. The second is system timer - timer that excites [interrupts](https://en.wikipedia.org/wiki/Interrupt) with a periodic rate. For example, for [IBM PC](https://en.wikipedia.org/wiki/IBM_Personal_Computer) compatibles it was - [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer).
|
||||
|
||||
We already know that for timekeeping purposes we can use `jiffies` in the Linux kernel. The `jiffies` can be considered as read only global variable which is updated with `HZ` frequency. We know that the `HZ` is a compile-time kernel parameter whose reasonable range is from `100` to `1000` [Hz](https://en.wikipedia.org/wiki/Hertz). So, it is guaranteed to have an interface for time measurement with `1` - `10` milliseconds resolution. Besides standard `jiffies`, we saw the `refined_jiffies` clock source in the previous part that is based on the `i8253/i8254` [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) tick rate which is almost `1193182` hertz. So we can get something about `1` microsecond resolution with the `refined_jiffies`. In this time, [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) are the favorite choice for the time value units of the given clock source.
|
||||
|
||||
The availability of more precise techniques for time intervals measurement is hardware-dependent. We just knew a little about `x86` dependent timers hardware. But each architecture provides own timers hardware. Earlier each architecture had own implementation for this purpose. Solution of this problem is an abstraction layer and associated API in a common code framework for managing various clock sources and independent of the timer interrupt. This common code framework became - `clocksource` framework.
|
||||
|
||||
Generic timeofday and clock source management framework moved a lot of timekeeping code into the architecture independent portion of the code, with the architecture-dependent portion reduced to defining and managing low-level hardware pieces of clocksources. It takes a large amount of funds to measure the time interval on different architectures with different hardware, and it is very complex. Implementation of the each clock related service is strongly associated with an individual hardware device and as you can understand, it results in similar implementations for different architectures.
|
||||
|
||||
Within this framework, each clock source is required to maintain a representation of time as a monotonically increasing value. As we can see in the Linux kernel code, nanoseconds are the favorite choice for the time value units of a clock source in this time. One of the main point of the clock source framework is to allow an user to select clock source among a range of available hardware devices supporting clock functions when configuring the system and selecting, accessing and scaling different clock sources.
|
||||
|
||||
The clocksource structure
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The fundamental of the `clocksource` framework is the `clocksource` structure that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file. We already saw some fields that are provided by the `clocksource` structure in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html). Let's look on the full definition of this structure and try to describe all of its fields:
|
||||
|
||||
```C
|
||||
struct clocksource {
|
||||
cycle_t (*read)(struct clocksource *cs);
|
||||
cycle_t mask;
|
||||
u32 mult;
|
||||
u32 shift;
|
||||
u64 max_idle_ns;
|
||||
u32 maxadj;
|
||||
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
|
||||
struct arch_clocksource_data archdata;
|
||||
#endif
|
||||
u64 max_cycles;
|
||||
const char *name;
|
||||
struct list_head list;
|
||||
int rating;
|
||||
int (*enable)(struct clocksource *cs);
|
||||
void (*disable)(struct clocksource *cs);
|
||||
unsigned long flags;
|
||||
void (*suspend)(struct clocksource *cs);
|
||||
void (*resume)(struct clocksource *cs);
|
||||
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
|
||||
struct list_head wd_list;
|
||||
cycle_t cs_last;
|
||||
cycle_t wd_last;
|
||||
#endif
|
||||
struct module *owner;
|
||||
} ____cacheline_aligned;
|
||||
```
|
||||
|
||||
We already saw the first field of the `clocksource` structure in the previous part - it is pointer to the `read` function that returns best counter selected by the clocksource framework. For example we use `jiffies_read` function to read `jiffies` value:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_jiffies = {
|
||||
...
|
||||
.read = jiffies_read,
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
where `jiffies_read` just returns:
|
||||
|
||||
```C
|
||||
static cycle_t jiffies_read(struct clocksource *cs)
|
||||
{
|
||||
return (cycle_t) jiffies;
|
||||
}
|
||||
```
|
||||
|
||||
Or the `read_tsc` function:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_tsc = {
|
||||
...
|
||||
.read = read_tsc,
|
||||
...
|
||||
};
|
||||
```
|
||||
|
||||
for the [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) reading.
|
||||
|
||||
The next field is `mask` that allows to ensure that subtraction between counters values from non `64 bit` counters do not need special overflow logic. After the `mask` field, we can see two fields: `mult` and `shift`. These are the fields that are base of mathematical functions that are provide ability to convert time values specific to each clock source. In other words these two fields help us to convert an abstract machine time units of a counter to nanoseconds.
|
||||
|
||||
After these two fields we can see the `64` bits `max_idle_ns` field represents max idle time permitted by the clocksource in nanoseconds. We need in this field for the Linux kernel with enabled `CONFIG_NO_HZ` kernel configuration option. This kernel configuration option enables the Linux kernel to run without a regular timer tick (we will see full explanation of this in other part). The problem that dynamic tick allows the kernel to sleep for periods longer than a single tick, moreover sleep time could be unlimited. The `max_idle_ns` field represents this sleeping limit.
|
||||
|
||||
The next field after the `max_idle_ns` is the `maxadj` field which is the maximum adjustment value to `mult`. The main formula by which we convert cycles to the nanoseconds:
|
||||
|
||||
```C
|
||||
((u64) cycles * mult) >> shift;
|
||||
```
|
||||
|
||||
is not `100%` accurate. Instead the number is taken as close as possible to a nanosecond and `maxadj` helps to correct this and allows clocksource API to avoid `mult` values that might overflow when adjusted. The next four fields are pointers to the function:
|
||||
|
||||
* `enable` - optional function to enable clocksource;
|
||||
* `disable` - optional function to disable clocksource;
|
||||
* `suspend` - suspend function for the clocksource;
|
||||
* `resume` - resume function for the clocksource;
|
||||
|
||||
The next field is the `max_cycles` and as we can understand from its name, this field represents maximum cycle value before potential overflow. And the last field is `owner` represents reference to a kernel [module](https://en.wikipedia.org/wiki/Loadable_kernel_module) that is owner of a clocksource. This is all. We just went through all the standard fields of the `clocksource` structure. But you can noted that we missed some fields of the `clocksource` structure. We can divide all of missed field on two types: Fields of the first type are already known for us. For example, they are `name` field that represents name of a `clocksource`, the `rating` field that helps to the Linux kernel to select the best clocksource and etc. The second type, fields which are dependent from the different Linux kernel configuration options. Let's look on these fields.
|
||||
|
||||
The first field is the `archdata`. This field has `arch_clocksource_data` type and depends on the `CONFIG_ARCH_CLOCKSOURCE_DATA` kernel configuration option. This field is actual only for the [x86](https://en.wikipedia.org/wiki/X86) and [IA64](https://en.wikipedia.org/wiki/IA-64) architectures for this moment. And again, as we can understand from the field's name, it represents architecture-specific data for a clock source. For example, it represents `vDSO` clock mode:
|
||||
|
||||
```C
|
||||
struct arch_clocksource_data {
|
||||
int vclock_mode;
|
||||
};
|
||||
```
|
||||
|
||||
for the `x86` architectures. Where the `vDSO` clock mode can be one of the:
|
||||
|
||||
```C
|
||||
#define VCLOCK_NONE 0
|
||||
#define VCLOCK_TSC 1
|
||||
#define VCLOCK_HPET 2
|
||||
#define VCLOCK_PVCLOCK 3
|
||||
```
|
||||
|
||||
The last three fields are `wd_list`, `cs_last` and the `wd_last` depends on the `CONFIG_CLOCKSOURCE_WATCHDOG` kernel configuration option. First of all let's try to understand what is it `watchdog`. In a simple words, watchdog is a timer that is used for detection of the computer malfunctions and recovering from it. All of these three fields contain watchdog related data that is used by the `clocksource` framework. If we will grep the Linux kernel source code, we will see that only [arch/x86/KConfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig#L54) kernel configuration file contains the `CONFIG_CLOCKSOURCE_WATCHDOG` kernel configuration option. So, why do `x86` and `x86_64` need in [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer)? You already may know that all `x86` processors has special 64-bit register - [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). This register contains number of [cycles](https://en.wikipedia.org/wiki/Clock_rate) since the reset. Sometimes the time stamp counter needs to be verified against another clock source. We will not see initialization of the `watchdog` timer in this part, before this we must learn more about timers.
|
||||
|
||||
That's all. From this moment we know all fields of the `clocksource` structure. This knowledge will help us to learn insides of the `clocksource` framework.
|
||||
|
||||
New clock source registration
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw only one function from the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html). This function was - `__clocksource_register`. This function defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/tree/master/include/linux/clocksource.h) header file and as we can understand from the function's name, main point of this function is to register new clocksource. If we will look on the implementation of the `__clocksource_register` function, we will see that it just makes call of the `__clocksource_register_scale` function and returns its result:
|
||||
|
||||
```C
|
||||
static inline int __clocksource_register(struct clocksource *cs)
|
||||
{
|
||||
return __clocksource_register_scale(cs, 1, 0);
|
||||
}
|
||||
```
|
||||
|
||||
Before we will see implementation of the `__clocksource_register_scale` function, we can see that `clocksource` provides additional API for a new clock source registration:
|
||||
|
||||
```C
|
||||
static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
|
||||
{
|
||||
return __clocksource_register_scale(cs, 1, hz);
|
||||
}
|
||||
|
||||
static inline int clocksource_register_khz(struct clocksource *cs, u32 khz)
|
||||
{
|
||||
return __clocksource_register_scale(cs, 1000, khz);
|
||||
}
|
||||
```
|
||||
|
||||
And all of these functions do the same. They return value of the `__clocksource_register_scale` function but with different set of parameters. The `__clocksource_register_scale` function defined in the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c) source code file. To understand difference between these functions, let's look on the parameters of the `clocksource_register_khz` function. As we can see, this function takes three parameters:
|
||||
|
||||
* `cs` - clocksource to be installed;
|
||||
* `scale` - scale factor of a clock source. In other words, if we will multiply value of this parameter on frequency, we will get `hz` of a clocksource;
|
||||
* `freq` - clock source frequency divided by scale.
|
||||
|
||||
Now let's look on the implementation of the `__clocksource_register_scale` function:
|
||||
|
||||
```C
|
||||
int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
|
||||
{
|
||||
__clocksource_update_freq_scale(cs, scale, freq);
|
||||
mutex_lock(&clocksource_mutex);
|
||||
clocksource_enqueue(cs);
|
||||
clocksource_enqueue_watchdog(cs);
|
||||
clocksource_select();
|
||||
mutex_unlock(&clocksource_mutex);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
First of all we can see that the `__clocksource_register_scale` function starts from the call of the `__clocksource_update_freq_scale` function that defined in the same source code file and updates given clock source with the new frequency. Let's look on the implementation of this function. In the first step we need to check given frequency and if it was not passed as `zero`, we need to calculate `mult` and `shift` parameters for the given clock source. Why do we need to check value of the `frequency`? Actually it can be zero. if you attentively looked on the implementation of the `__clocksource_register` function, you may have noticed that we passed `frequency` as `0`. We will do it only for some clock sources that have self defined `mult` and `shift` parameters. Look in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) and you will see that we saw calculation of the `mult` and `shift` for `jiffies`. The `__clocksource_update_freq_scale` function will do it for us for other clock sources.
|
||||
|
||||
So in the start of the `__clocksource_update_freq_scale` function we check the value of the `frequency` parameter and if is not zero we need to calculate `mult` and `shift` for the given clock source. Let's look on the `mult` and `shift` calculation:
|
||||
|
||||
```C
|
||||
void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq)
|
||||
{
|
||||
u64 sec;
|
||||
|
||||
if (freq) {
|
||||
sec = cs->mask;
|
||||
do_div(sec, freq);
|
||||
do_div(sec, scale);
|
||||
|
||||
if (!sec)
|
||||
sec = 1;
|
||||
else if (sec > 600 && cs->mask > UINT_MAX)
|
||||
sec = 600;
|
||||
|
||||
clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
|
||||
NSEC_PER_SEC / scale, sec * scale);
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see calculation of the maximum number of seconds which we can run before a clock source counter will overflow. First of all we fill the `sec` variable with the value of a clock source mask. Remember that a clock source's mask represents maximum amount of bits that are valid for the given clock source. After this, we can see two division operations. At first we divide our `sec` variable on a clock source frequency and then on scale factor. The `freq` parameter shows us how many timer interrupts will be occurred in one second. So, we divide `mask` value that represents maximum number of a counter (for example `jiffy`) on the frequency of a timer and will get the maximum number of seconds for the certain clock source. The second division operation will give us maximum number of seconds for the certain clock source depends on its scale factor which can be `1` hertz or `1` kilohertz (10^ Hz).
|
||||
|
||||
After we have got maximum number of seconds, we check this value and set it to `1` or `600` depends on the result at the next step. These values is maximum sleeping time for a clocksource in seconds. In the next step we can see call of the `clocks_calc_mult_shift`. Main point of this function is calculation of the `mult` and `shift` values for a given clock source. In the end of the `__clocksource_update_freq_scale` function we check that just calculated `mult` value of a given clock source will not cause overflow after adjustment, update the `max_idle_ns` and `max_cycles` values of a given clock source with the maximum nanoseconds that can be converted to a clock source counter and print result to the kernel buffer:
|
||||
|
||||
```C
|
||||
pr_info("%s: mask: 0x%llx max_cycles: 0x%llx, max_idle_ns: %lld ns\n",
|
||||
cs->name, cs->mask, cs->max_cycles, cs->max_idle_ns);
|
||||
```
|
||||
|
||||
that we can see in the [dmesg](https://en.wikipedia.org/wiki/Dmesg) output:
|
||||
|
||||
```
|
||||
$ dmesg | grep "clocksource:"
|
||||
[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
|
||||
[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
|
||||
[ 0.094084] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
|
||||
[ 0.205302] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
|
||||
[ 1.452979] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x7350b459580, max_idle_ns: 881591204237 ns
|
||||
```
|
||||
|
||||
After the `__clocksource_update_freq_scale` function will finish its work, we can return back to the `__clocksource_register_scale` function that will register new clock source. We can see the call of the following three functions:
|
||||
|
||||
```C
|
||||
mutex_lock(&clocksource_mutex);
|
||||
clocksource_enqueue(cs);
|
||||
clocksource_enqueue_watchdog(cs);
|
||||
clocksource_select();
|
||||
mutex_unlock(&clocksource_mutex);
|
||||
```
|
||||
|
||||
Note that before the first will be called, we lock the `clocksource_mutex` [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion). The point of the `clocksource_mutex` mutex is to protect `curr_clocksource` variable which represents currently selected `clocksource` and `clocksource_list` variable which represents list that contains registered `clocksources`. Now, let's look on these three functions.
|
||||
|
||||
The first `clocksource_enqueue` function and other two defined in the same source code [file](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c). We go through all already registered `clocksources` or in other words we go through all elements of the `clocksource_list` and tries to find best place for a given `clocksource`:
|
||||
|
||||
```C
|
||||
static void clocksource_enqueue(struct clocksource *cs)
|
||||
{
|
||||
struct list_head *entry = &clocksource_list;
|
||||
struct clocksource *tmp;
|
||||
|
||||
list_for_each_entry(tmp, &clocksource_list, list)
|
||||
if (tmp->rating >= cs->rating)
|
||||
entry = &tmp->list;
|
||||
list_add(&cs->list, entry);
|
||||
}
|
||||
```
|
||||
|
||||
In the end we just insert new clocksource to the `clocksource_list`. The second function - `clocksource_enqueue_watchdog` does almost the same that previous function, but it inserts new clock source to the `wd_list` depends on flags of a clock source and starts new [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer) timer. As I already wrote, we will not consider `watchdog` related stuff in this part but will do it in next parts.
|
||||
|
||||
The last function is the `clocksource_select`. As we can understand from the function's name, main point of this function - select the best `clocksource` from registered clocksources. This function consists only from the call of the function helper:
|
||||
|
||||
```C
|
||||
static void clocksource_select(void)
|
||||
{
|
||||
return __clocksource_select(false);
|
||||
}
|
||||
```
|
||||
|
||||
Note that the `__clocksource_select` function takes one parameter (`false` in our case). This [bool](https://en.wikipedia.org/wiki/Boolean_data_type) parameter shows how to traverse the `clocksource_list`. In our case we pass `false` that is meant that we will go through all entries of the `clocksource_list`. We already know that `clocksource` with the best rating will the first in the `clocksource_list` after the call of the `clocksource_enqueue` function, so we can easily get it from this list. After we found a clock source with the best rating, we switch to it:
|
||||
|
||||
```C
|
||||
if (curr_clocksource != best && !timekeeping_notify(best)) {
|
||||
pr_info("Switched to clocksource %s\n", best->name);
|
||||
curr_clocksource = best;
|
||||
}
|
||||
```
|
||||
|
||||
The result of this operation we can see in the `dmesg` output:
|
||||
|
||||
```
|
||||
$ dmesg | grep Switched
|
||||
[ 0.199688] clocksource: Switched to clocksource hpet
|
||||
[ 2.452966] clocksource: Switched to clocksource tsc
|
||||
```
|
||||
|
||||
Note that we can see two clock sources in the `dmesg` output (`hpet` and `tsc` in our case). Yes, actually there can be many different clock sources on a particular hardware. So the Linux kernel knows about all registered clock sources and switches to a clock source with a better rating each time after registration of a new clock source.
|
||||
|
||||
If we will look on the bottom of the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c) source code file, we will see that it has [sysfs](https://en.wikipedia.org/wiki/Sysfs) interface. Main initialization occurs in the `init_clocksource_sysfs` function which will be called during device `initcalls`. Let's look on the implementation of the `init_clocksource_sysfs` function:
|
||||
|
||||
```C
|
||||
static struct bus_type clocksource_subsys = {
|
||||
.name = "clocksource",
|
||||
.dev_name = "clocksource",
|
||||
};
|
||||
|
||||
static int __init init_clocksource_sysfs(void)
|
||||
{
|
||||
int error = subsys_system_register(&clocksource_subsys, NULL);
|
||||
|
||||
if (!error)
|
||||
error = device_register(&device_clocksource);
|
||||
if (!error)
|
||||
error = device_create_file(
|
||||
&device_clocksource,
|
||||
&dev_attr_current_clocksource);
|
||||
if (!error)
|
||||
error = device_create_file(&device_clocksource,
|
||||
&dev_attr_unbind_clocksource);
|
||||
if (!error)
|
||||
error = device_create_file(
|
||||
&device_clocksource,
|
||||
&dev_attr_available_clocksource);
|
||||
return error;
|
||||
}
|
||||
device_initcall(init_clocksource_sysfs);
|
||||
```
|
||||
|
||||
First of all we can see that it registers a `clocksource` subsystem with the call of the `subsys_system_register` function. In other words, after the call of this function, we will have following directory:
|
||||
|
||||
```
|
||||
$ pwd
|
||||
/sys/devices/system/clocksource
|
||||
```
|
||||
|
||||
After this step, we can see registration of the `device_clocksource` device which is represented by the following structure:
|
||||
|
||||
```C
|
||||
static struct device device_clocksource = {
|
||||
.id = 0,
|
||||
.bus = &clocksource_subsys,
|
||||
};
|
||||
```
|
||||
|
||||
and creation of three files:
|
||||
|
||||
* `dev_attr_current_clocksource`;
|
||||
* `dev_attr_unbind_clocksource`;
|
||||
* `dev_attr_available_clocksource`.
|
||||
|
||||
These files will provide information about current clock source in the system, available clock sources in the system and interface which allows to unbind the clock source.
|
||||
|
||||
After the `init_clocksource_sysfs` function will be executed, we will be able find some information about available clock sources in the:
|
||||
|
||||
```
|
||||
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
|
||||
tsc hpet acpi_pm
|
||||
```
|
||||
|
||||
Or for example information about current clock source in the system:
|
||||
|
||||
```
|
||||
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
|
||||
tsc
|
||||
```
|
||||
|
||||
In the previous part, we saw API for the registration of the `jiffies` clock source, but didn't dive into details about the `clocksource` framework. In this part we did it and saw implementation of the new clock source registration and selection of a clock source with the best rating value in the system. Of course, this is not all API that `clocksource` framework provides. There a couple additional functions like `clocksource_unregister` for removing given clock source from the `clocksource_list` and etc. But I will not describe this functions in this part, because they are not important for us right now. Anyway if you are interesting in it, you can find it in the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c).
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the following two concepts: `jiffies` and `clocksource`. In this part we saw some examples of the `jiffies` usage and knew more details about the `clocksource` concept.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
* [x86](https://en.wikipedia.org/wiki/X86)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [uptime](https://en.wikipedia.org/wiki/Uptime)
|
||||
* [Ensoniq Soundscape Elite](https://en.wikipedia.org/wiki/Ensoniq_Soundscape_Elite)
|
||||
* [RTC](https://en.wikipedia.org/wiki/Real-time_clock)
|
||||
* [interrupts](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [IBM PC](https://en.wikipedia.org/wiki/IBM_Personal_Computer)
|
||||
* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer)
|
||||
* [Hz](https://en.wikipedia.org/wiki/Hertz)
|
||||
* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
|
||||
* [dmesg](https://en.wikipedia.org/wiki/Dmesg)
|
||||
* [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [loadable kernel module](https://en.wikipedia.org/wiki/Loadable_kernel_module)
|
||||
* [IA64](https://en.wikipedia.org/wiki/IA-64)
|
||||
* [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer)
|
||||
* [clock rate](https://en.wikipedia.org/wiki/Clock_rate)
|
||||
* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
||||
444
Timers/timers-3.md
Normal file
444
Timers/timers-3.md
Normal file
@@ -0,0 +1,444 @@
|
||||
Timers and time management in the Linux kernel. Part 3.
|
||||
================================================================================
|
||||
|
||||
The tick broadcast framework and dyntick
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and we stopped on the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html). We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of this chapter is - `jiffies`. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the:
|
||||
|
||||
```C
|
||||
register_refined_jiffies(CLOCK_TICK_RATE);
|
||||
```
|
||||
|
||||
function which defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file and executes initialization of the `refined_jiffies` clock source for us. Recall that this function is called from the `setup_arch` function that defined in the [https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c](arch/x86/kernel/setup.c) source code and executes architecture-specific ([x86_64](https://en.wikipedia.org/wiki/X86-64) in our case) initialization. Look on the implementation of the `setup_arch` and you will note that the call of the `register_refined_jiffies` is the last step before the `setup_arch` function will finish its work.
|
||||
|
||||
There are many different `x86_64` specific things already configured after the end of the `setup_arch` execution. For example some early [interrupt](https://en.wikipedia.org/wiki/Interrupt) handlers already able to handle interrupts, memory space reserved for the [initrd](https://en.wikipedia.org/wiki/Initrd), [DMI](https://en.wikipedia.org/wiki/Desktop_Management_Interface) scanned, the Linux kernel log buffer is already set and this means that the [printk](https://en.wikipedia.org/wiki/Printk) function is able to work, [e820](https://en.wikipedia.org/wiki/E820) parsed and the Linux kernel already knows about available memory and and many many other architecture specific things (if you are interesting, you can read more about the `setup_arch` function and Linux kernel initialization process in the second [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of this book).
|
||||
|
||||
Now, the `setup_arch` finished its work and we can back to the generic Linux kernel code. Recall that the `setup_arch` function was called from the `start_kernel` function which is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. So, we shall return to this function. You can see that there are many different function are called right after `setup_arch` function inside of the `start_kernel` function, but since our chapter is devoted to timers and time management related stuff, we will skip all code which is not related to this topic. The first function which is related to the time management in the Linux kernel is:
|
||||
|
||||
```C
|
||||
tick_init();
|
||||
```
|
||||
|
||||
in the `start_kernel`. The `tick_init` function defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and does two things:
|
||||
|
||||
* Initialization of `tick broadcast` framework related data structures;
|
||||
* Initialization of `full` tickless mode related data structures.
|
||||
|
||||
We didn't see anything related to the `tick broadcast` framework in this book and didn't know anything about tickless mode in the Linux kernel. So, the main point of this part is to look on these concepts and to know what are they.
|
||||
|
||||
The idle process
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
First of all, let's look on the implementation of the `tick_init` function. As I already wrote, this function defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and consists from the two calls of following functions:
|
||||
|
||||
```C
|
||||
void __init tick_init(void)
|
||||
{
|
||||
tick_broadcast_init();
|
||||
tick_nohz_init();
|
||||
}
|
||||
```
|
||||
|
||||
As you can understand from the paragraph's title, we are interesting only in the `tick_broadcast_init` function for now. This function defined in the [kernel/time/tick-broadcast.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-broadcast.c) source code file and executes initialization of the `tick broadcast` framework related data structures. Before we will look on the implementation of the `tick_broadcast_init` function and will try to understand what does this function do, we need to know about `tick broadcast` framework.
|
||||
|
||||
Main point of a central processor is to execute programs. But sometimes a processor may be in a special state when it is not being used by any program. This special state is called - [idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29). When the processor has no anything to execute, the Linux kernel launches `idle` task. We already saw a little about this in the last part of the [Linux kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-10.html). When the Linux kernel will finish all initialization processes in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file, it will call the `rest_init` function from the same source code file. Main point of this function is to launch kernel `init` thread and the `kthreadd` thread, to call the `schedule` function to start task scheduling and to go to sleep by calling the `cpu_idle_loop` function that defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/kernel/sched/idle.c) source code file.
|
||||
|
||||
The `cpu_idle_loop` function represents infinite loop which checks the need for rescheduling on each iteration. After the scheduler finds something to execute, the `idle` process will finish its work and the control will be moved to a new runnable task with the call of the `schedule_preempt_disabled` function:
|
||||
|
||||
```C
|
||||
static void cpu_idle_loop(void)
|
||||
{
|
||||
while (1) {
|
||||
while (!need_resched()) {
|
||||
...
|
||||
...
|
||||
...
|
||||
/* the main idle function */
|
||||
cpuidle_idle_call();
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
schedule_preempt_disabled();
|
||||
}
|
||||
```
|
||||
|
||||
Of course, we will not consider full implementation of the `cpu_idle_loop` function and details of the `idle` state in this part, because it is not related to our topic. But there is one interesting moment for us. We know that the processor can execute only one task in one time. How does the Linux kernel decide to reschedule and stop `idle` process if the processor executes infinite loop in the `cpu_idle_loop`? The answer is system timer interrupts. When an interrupt occurs, the processor stops the `idle` thread and transfers control to an interrupt handler. After the system timer interrupt handler will be handled, the `need_resched` will return true and the Linux kernel will stop `idle` process and will transfer control to the current runnable task. But handling of the system timer interrupts is not effective for [power management](https://en.wikipedia.org/wiki/Power_management), because if a processor is in `idle` state, there is little point in sending it a system timer interrupt.
|
||||
|
||||
By default, there is the `CONFIG_HZ_PERIODIC` kernel configuration option which is enabled in the Linux kernel and tells to handle each interrupt of the system timer. To solve this problem, the Linux kernel provides two additional ways of managing scheduling-clock interrupts:
|
||||
|
||||
The first is to omit scheduling-clock ticks on idle processors. To enable this behaviour in the Linux kernel, we need to enable the `CONFIG_NO_HZ_IDLE` kernel configuration option. This option allows Linux kernel to avoid sending timer interrupts to idle processors. In this case periodic timer interrupts will be replaced with on-demand interrupts. This mode is called - `dyntick-idle` mode. But if the kernel does not handle interrupts of a system timer, how can the kernel decide if the system has nothing to do?
|
||||
|
||||
Whenever the idle task is selected to run, the periodic tick is disabled with the call of the `tick_nohz_idle_enter` function that defined in the [kernel/time/tick-sched.c](https://github.com/torvalds/linux/blob/master/kernel/time/tich-sched.c) source code file and enabled with the call of the `tick_nohz_idle_exit` function. There is special concept in the Linux kernel which is called - `clock event devices` that are used to schedule the next interrupt. This concept provides API for devices which can deliver interrupts at a specific time in the future and represented by the `clock_event_device` structure in the Linux kernel. We will not dive into implementation of the `clock_event_device` structure now. We will see it in the next prat of this chapter. But there is one interesting moment for us right now.
|
||||
|
||||
The second way is to omit scheduling-clock ticks on processors that are either in `idle` state or that have only one runnable task or in other words busy processor. We can enable this feature with the `CONFIG_NO_HZ_FULL` kernel configuration option and it allows to reduce the number of timer interrupts significantly.
|
||||
|
||||
Besides the `cpu_idle_loop`, idle processor can be in a sleeping state. The Linux kernel provides special `cpuidle` framework. Main point of this framework is to put an idle processor to sleeping states. The name of the set of these states is - `C-states`. But how does a processor will be woken if local timer is disabled? The linux kernel provides `tick broadcast` framework for this. The main point of this framework is assign a timer which is not affected by the `C-states`. This timer will wake a sleeping processor.
|
||||
|
||||
Now, after some theory we can return to the implementation of our function. Let's recall that the `tick_init` function just calls two following functions:
|
||||
|
||||
```C
|
||||
void __init tick_init(void)
|
||||
{
|
||||
tick_broadcast_init();
|
||||
tick_nohz_init();
|
||||
}
|
||||
```
|
||||
|
||||
Let's consider the first function. The first `tick_broadcast_init` function defined in the [kernel/time/tick-broadcast.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-broadcast.c) source code file and executes initialization of the `tick broadcast` framework related data structures. Let's look on the implementation of the `tick_broadcast_init` function:
|
||||
|
||||
```C
|
||||
void __init tick_broadcast_init(void)
|
||||
{
|
||||
zalloc_cpumask_var(&tick_broadcast_mask, GFP_NOWAIT);
|
||||
zalloc_cpumask_var(&tick_broadcast_on, GFP_NOWAIT);
|
||||
zalloc_cpumask_var(&tmpmask, GFP_NOWAIT);
|
||||
#ifdef CONFIG_TICK_ONESHOT
|
||||
zalloc_cpumask_var(&tick_broadcast_oneshot_mask, GFP_NOWAIT);
|
||||
zalloc_cpumask_var(&tick_broadcast_pending_mask, GFP_NOWAIT);
|
||||
zalloc_cpumask_var(&tick_broadcast_force_mask, GFP_NOWAIT);
|
||||
#endif
|
||||
}
|
||||
```
|
||||
|
||||
As we can see, the `tick_broadcast_init` function allocates different [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) with the help of the `zalloc_cpumask_var` function. The `zalloc_cpumask_var` function defined in the [lib/cpumask.c](https://github.com/torvalds/linux/blob/master/lib/cpumask.c) source code file and expands to the call of the following function:
|
||||
|
||||
```C
|
||||
bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
|
||||
{
|
||||
return alloc_cpumask_var(mask, flags | __GFP_ZERO);
|
||||
}
|
||||
```
|
||||
|
||||
Ultimately, the memory space will be allocated for the given `cpumask` with the certain flags with the help of the `kmalloc_node` function:
|
||||
|
||||
```C
|
||||
*mask = kmalloc_node(cpumask_size(), flags, node);
|
||||
```
|
||||
|
||||
Now let's look on the `cpumasks` that will be initialized in the `tick_broadcast_init` function. As we can see, the `tick_broadcast_init` function will initialize six `cpumasks`, and moreover, initialization of the last three `cpumasks` will be depended on the `CONFIG_TICK_ONESHOT` kernel configuration option.
|
||||
|
||||
The first three `cpumasks` are:
|
||||
|
||||
* `tick_broadcast_mask` - the bitmap which represents list of processors that are in a sleeping mode;
|
||||
* `tick_broadcast_on` - the bitmap that stores numbers of processors which are in a periodic broadcast state;
|
||||
* `tmpmask` - this bitmap for temporary usage.
|
||||
|
||||
As we already know, the next three `cpumasks` depends on the `CONFIG_TICK_ONESHOT` kernel configuration option. Actually each clock event devices can be in one of two modes:
|
||||
|
||||
* `periodic` - clock events devices that support periodic events;
|
||||
* `oneshot` - clock events devices that capable of issuing events that happen only once.
|
||||
|
||||
The linux kernel defines two mask for such clock events devices in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file:
|
||||
|
||||
```C
|
||||
#define CLOCK_EVT_FEAT_PERIODIC 0x000001
|
||||
#define CLOCK_EVT_FEAT_ONESHOT 0x000002
|
||||
```
|
||||
|
||||
So, the last three `cpumasks` are:
|
||||
|
||||
* `tick_broadcast_oneshot_mask` - stores numbers of processors that must be notified;
|
||||
* `tick_broadcast_pending_mask` - stores numbers of processors that pending broadcast;
|
||||
* `tick_broadcast_force_mask` - stores numbers of processors with enforced broadcast.
|
||||
|
||||
We have initialized six `cpumasks` in the `tick broadcast` framework, and now we can proceed to implementation of this framework.
|
||||
|
||||
The `tick broadcast` framework
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Hardware may provide some clock source devices. When a processor sleeps and its local timer stopped, there must be additional clock source device that will handle awakening of a processor. The Linux kernel uses these `special` clock source devices which can raise an interrupt at a specified time. We already know that such timers called `clock events` devices in the Linux kernel. Besides `clock events` devices. Actually, each processor in the system has its own local timer which is programmed to issue interrupt at the time of the next deferred task. Also these timers can be programmed to do a periodical job, like updating `jiffies` and etc. These timers represented by the `tick_device` structure in the Linux kernel. This structure defined in the [kernel/time/tick-sched.h](https://github.com/torvalds/linux/blob/master/kernel/time/tick-sched.h) header file and looks:
|
||||
|
||||
```C
|
||||
struct tick_device {
|
||||
struct clock_event_device *evtdev;
|
||||
enum tick_device_mode mode;
|
||||
};
|
||||
```
|
||||
|
||||
Note, that the `tick_device` structure contains two fields. The first field - `evtdev` represents pointer to the `clock_event_device` structure that defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and represents descriptor of a clock event device. A `clock event` device allows to register an event that will happen in the future. As I already wrote, we will not consider `clock_event_device` structure and related API in this part, but will see it in the next part.
|
||||
|
||||
The second field of the `tick_device` structure represents mode of the `tick_device`. As we already know, the mode can be one of the:
|
||||
|
||||
```C
|
||||
num tick_device_mode {
|
||||
TICKDEV_MODE_PERIODIC,
|
||||
TICKDEV_MODE_ONESHOT,
|
||||
};
|
||||
```
|
||||
|
||||
Each `clock events` device in the system registers itself by the call of the `clockevents_register_device` function or `clockevents_config_and_register` function during initialization process of the Linux kernel. During the registration of a new `clock events` device, the Linux kernel calls the `tick_check_new_device` function that defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/tick-common.c) source code file and checks the given `clock events` device should be used by the Linux kernel. After all checks, the `tick_check_new_device` function executes a call of the:
|
||||
|
||||
```C
|
||||
tick_install_broadcast_device(newdev);
|
||||
```
|
||||
|
||||
function that checks that the given `clock event` device can be broadcast device and install it, if the given device can be broadcast device. Let's look on the implementation of the `tick_install_broadcast_device` function:
|
||||
|
||||
```C
|
||||
void tick_install_broadcast_device(struct clock_event_device *dev)
|
||||
{
|
||||
struct clock_event_device *cur = tick_broadcast_device.evtdev;
|
||||
|
||||
if (!tick_check_broadcast_device(cur, dev))
|
||||
return;
|
||||
|
||||
if (!try_module_get(dev->owner))
|
||||
return;
|
||||
|
||||
clockevents_exchange_device(cur, dev);
|
||||
|
||||
if (cur)
|
||||
cur->event_handler = clockevents_handle_noop;
|
||||
|
||||
tick_broadcast_device.evtdev = dev;
|
||||
|
||||
if (!cpumask_empty(tick_broadcast_mask))
|
||||
tick_broadcast_start_periodic(dev);
|
||||
|
||||
if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
|
||||
tick_clock_notify();
|
||||
}
|
||||
```
|
||||
|
||||
First of all we get the current `clock event` device from the `tick_broadcast_device`. The `tick_broadcast_device` defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/tick-common.c) source code file:
|
||||
|
||||
```C
|
||||
static struct tick_device tick_broadcast_device;
|
||||
```
|
||||
|
||||
and represents external clock device that keeps track of events for a processor. The first step after we got the current clock device is the call of the `tick_check_broadcast_device` function which checks that a given clock events device can be utilized as broadcast device. The main point of the `tick_check_broadcast_device` function is to check value of the `features` field of the given `clock events` device. As we can understand from the name of this field, the `features` field contains a clock event device features. Available values defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and can be one of the `CLOCK_EVT_FEAT_PERIODIC` - which represents a clock events device which supports periodic events and etc. So, the `tick_check_broadcast_device` function check `features` flags for `CLOCK_EVT_FEAT_ONESHOT`, `CLOCK_EVT_FEAT_DUMMY` and other flags and returns `false` if the given clock events device has one of these features. In other way the `tick_check_broadcast_device` function compares `ratings` of the given clock event device and current clock event device and returns the best.
|
||||
|
||||
After the `tick_check_broadcast_device` function, we can see the call of the `try_module_get` function that checks module owner of the clock events. We need to do it to be sure that the given `clock events` device was correctly initialized. The next step is the call of the `clockevents_exchange_device` function that defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and will release old clock events device and replace the previous functional handler with a dummy handler.
|
||||
|
||||
In the last step of the `tick_install_broadcast_device` function we check that the `tick_broadcast_mask` is not empty and start the given `clock events` device in periodic mode with the call of the `tick_broadcast_start_periodic` function:
|
||||
|
||||
```C
|
||||
if (!cpumask_empty(tick_broadcast_mask))
|
||||
tick_broadcast_start_periodic(dev);
|
||||
|
||||
if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
|
||||
tick_clock_notify();
|
||||
```
|
||||
|
||||
The `tick_broadcast_mask` filled in the `tick_device_uses_broadcast` function that checks a `clock events` device during registration of this `clock events` device:
|
||||
|
||||
```C
|
||||
int cpu = smp_processor_id();
|
||||
|
||||
int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
if (!tick_device_is_functional(dev)) {
|
||||
...
|
||||
cpumask_set_cpu(cpu, tick_broadcast_mask);
|
||||
...
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
More about the `smp_processor_id` macro you can read in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter.
|
||||
|
||||
The `tick_broadcast_start_periodic` function check the given `clock event` device and call the `tick_setup_periodic` function:
|
||||
|
||||
```
|
||||
static void tick_broadcast_start_periodic(struct clock_event_device *bc)
|
||||
{
|
||||
if (bc)
|
||||
tick_setup_periodic(bc, 1);
|
||||
}
|
||||
```
|
||||
|
||||
that defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and sets broadcast handler for the given `clock event` device by the call of the following function:
|
||||
|
||||
```C
|
||||
tick_set_periodic_handler(dev, broadcast);
|
||||
```
|
||||
|
||||
This function checks the second parameter which represents broadcast state (`on` or `off`) and sets the broadcast handler depends on its value:
|
||||
|
||||
```C
|
||||
void tick_set_periodic_handler(struct clock_event_device *dev, int broadcast)
|
||||
{
|
||||
if (!broadcast)
|
||||
dev->event_handler = tick_handle_periodic;
|
||||
else
|
||||
dev->event_handler = tick_handle_periodic_broadcast;
|
||||
}
|
||||
```
|
||||
|
||||
When an `clock event` device will issue an interrupt, the `dev->event_handler` will be called. For example, let's look on the interrupt handler of the [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) which is located in the [arch/x86/kernel/hpet.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/hpet.c) source code file:
|
||||
|
||||
```C
|
||||
static irqreturn_t hpet_interrupt_handler(int irq, void *data)
|
||||
{
|
||||
struct hpet_dev *dev = (struct hpet_dev *)data;
|
||||
struct clock_event_device *hevt = &dev->evt;
|
||||
|
||||
if (!hevt->event_handler) {
|
||||
printk(KERN_INFO "Spurious HPET timer interrupt on HPET timer %d\n",
|
||||
dev->num);
|
||||
return IRQ_HANDLED;
|
||||
}
|
||||
|
||||
hevt->event_handler(hevt);
|
||||
return IRQ_HANDLED;
|
||||
}
|
||||
```
|
||||
|
||||
The `hpet_interrupt_handler` gets the [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) specific data and check the event handler of the `clock event` device. Recall that we just set in the `tick_set_periodic_handler` function. So the `tick_handler_periodic_broadcast` function will be called in the end of the high precision event timer interrupt handler.
|
||||
|
||||
The `tick_handler_periodic_broadcast` function calls the
|
||||
|
||||
```C
|
||||
bc_local = tick_do_periodic_broadcast();
|
||||
```
|
||||
|
||||
function which stores numbers of processors which have asked to be woken up in the temporary `cpumask` and call the `tick_do_broadcast` function:
|
||||
|
||||
```
|
||||
cpumask_and(tmpmask, cpu_online_mask, tick_broadcast_mask);
|
||||
return tick_do_broadcast(tmpmask);
|
||||
```
|
||||
|
||||
The `tick_do_broadcast` calls the `broadcast` function of the given clock events which sends [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt) interrupt to the set of the processors. In the end we can call the event handler of the given `tick_device`:
|
||||
|
||||
```C
|
||||
if (bc_local)
|
||||
td->evtdev->event_handler(td->evtdev);
|
||||
```
|
||||
|
||||
which actually represents interrupt handler of the local timer of a processor. After this a processor will wake up. That is all about `tick broadcast` framework in the Linux kernel. We have missed some aspects of this framework, for example reprogramming of a `clock event` device and broadcast with the oneshot timer and etc. But the Linux kernel is very big, it is not real to cover all aspects of it. I think it will be interesting to dive into with yourself.
|
||||
|
||||
If you remember, we have started this part with the call of the `tick_init` function. We just consider the `tick_broadcast_init` function and releated theory, but the `tick_init` function contains another call of a function and this function is - `tick_nohz_init`. Let's look on the implementation of this function.
|
||||
|
||||
Initialization of dyntick related data structures
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We already saw some information about `dyntick` concept in this part and we know that this concept allows kernel to disable system timer interrupts in the `idle` state. The `tick_nohz_init` function makes initialization of the different data structures which are related to this concept. This function defined in the [kernel/time/tick-sched.c](https://github.com/torvalds/linux/blob/master/kernel/time/tich-sched.c) source code file and starts from the check of the value of the `tick_nohz_full_running` variable which represents state of the tick-less mode for the `idle` state and the state when system timer interrups are disabled during a processor has only one runnable task:
|
||||
|
||||
```C
|
||||
if (!tick_nohz_full_running) {
|
||||
if (tick_nohz_init_all() < 0)
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
If this mode is not running we call the `tick_nohz_init_all` function that defined in the same source code file and check its result. The `tick_nohz_init_all` function tries to allocate the `tick_nohz_full_mask` with the call of the `alloc_cpumask_var` that will allocate space for a `tick_nohz_full_mask`. The `tck_nohz_full_mask` will store numbers of processors that have enabled full `NO_HZ`. After successful allocation of the `tick_nohz_full_mask` we set all bits in the `tick_nogz_full_mask`, set the `tick_nohz_full_running` and return result to the `tick_nohz_init` function:
|
||||
|
||||
```C
|
||||
static int tick_nohz_init_all(void)
|
||||
{
|
||||
int err = -1;
|
||||
#ifdef CONFIG_NO_HZ_FULL_ALL
|
||||
if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
|
||||
WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
|
||||
return err;
|
||||
}
|
||||
err = 0;
|
||||
cpumask_setall(tick_nohz_full_mask);
|
||||
tick_nohz_full_running = true;
|
||||
#endif
|
||||
return err;
|
||||
}
|
||||
```
|
||||
|
||||
In the next step we try to allocate a memory space for the `housekeeping_mask`:
|
||||
|
||||
```C
|
||||
if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
|
||||
WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
|
||||
cpumask_clear(tick_nohz_full_mask);
|
||||
tick_nohz_full_running = false;
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
This `cpumask` will store number of processor for `housekeeping` or in other words we need at least in one processor that will not be in `NO_HZ` mode, because it will do timekeeping and etc. After this we check the result of the architecture-specific `arch_irq_work_has_interrupt` function. This function checks ability to send inter-processor interrupt for the certain architecture. We need to check this, because system timer of a processor will be disabled during `NO_HZ` mode, so there must be at least one online processor which can send inter-processor interrupt to awake offline processor. This function defined in the [arch/x86/include/asm/irq_work.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_work.h) header file for the [x86_64](https://en.wikipedia.org/wiki/X86-64) and just checks that a processor has [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) from the [CPUID](https://en.wikipedia.org/wiki/CPUID):
|
||||
|
||||
```C
|
||||
static inline bool arch_irq_work_has_interrupt(void)
|
||||
{
|
||||
return cpu_has_apic;
|
||||
}
|
||||
```
|
||||
|
||||
If a processor has not `APIC`, the Linux kernel prints warning message, clears the `tick_nohz_full_mask` cpumask, copies numbers of all possible processors in the system to the `housekeeping_mask` and resets the value of the `tick_nohz_full_running` variable:
|
||||
|
||||
```C
|
||||
if (!arch_irq_work_has_interrupt()) {
|
||||
pr_warning("NO_HZ: Can't run full dynticks because arch doesn't "
|
||||
"support irq work self-IPIs\n");
|
||||
cpumask_clear(tick_nohz_full_mask);
|
||||
cpumask_copy(housekeeping_mask, cpu_possible_mask);
|
||||
tick_nohz_full_running = false;
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
After this step, we get the number of the current processor by the call of the `smp_processor_id` and check this processor in the `tick_nohz_full_mask`. If the `tick_nohz_full_mask` contains a given processor we clear appropriate bit in the `tick_nohz_full_mask`:
|
||||
|
||||
```C
|
||||
cpu = smp_processor_id();
|
||||
|
||||
if (cpumask_test_cpu(cpu, tick_nohz_full_mask)) {
|
||||
pr_warning("NO_HZ: Clearing %d from nohz_full range for timekeeping\n", cpu);
|
||||
cpumask_clear_cpu(cpu, tick_nohz_full_mask);
|
||||
}
|
||||
```
|
||||
|
||||
Because this processor will be used for timekeeping. After this step we put all numbers of processors that are in the `cpu_possible_mask` and not in the `tick_nohz_full_mask`:
|
||||
|
||||
```C
|
||||
cpumask_andnot(housekeeping_mask,
|
||||
cpu_possible_mask, tick_nohz_full_mask);
|
||||
```
|
||||
|
||||
After this operation, the `housekeeping_mask` will contain all processors of the system except a processor for timekeeping. In the last step of the `tick_nohz_init_all` function, we are going through all processors that are defined in the `tick_nohz_full_mask` and call the following function for an each processor:
|
||||
|
||||
```C
|
||||
for_each_cpu(cpu, tick_nohz_full_mask)
|
||||
context_tracking_cpu_set(cpu);
|
||||
```
|
||||
|
||||
The `context_tracking_cpu_set` function defined in the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/master/kernel/context_tracking.c) source code file and main point of this function is to set the `context_tracking.active` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable to `true`. When the `active` field will be set to `true` for the certain processor, all [context switches](https://en.wikipedia.org/wiki/Context_switch) will be ignored by the Linux kernel context tracking subsystem for this processor.
|
||||
|
||||
That's all. This is the end of the `tick_nohz_init` function. After this `NO_HZ` related data structures will be initialzed. We didn't see API of the `NO_HZ` mode, but will see it soon.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `clocksource` concept in the Linux kernel which represents framework for managing different clock source in a interrupt and hardware characteristics independent way. We continued to look on the Linux kernel initialization process in a time management context in this part and got acquainted with two new concepts for us: the `tick broadcast` framework and `tick-less` mode. The first concept helps the Linux kernel to deal with processors which are in deep sleep and the second concept represents the mode in which kernel may work to improve power management of `idle` processors.
|
||||
|
||||
In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - `timers`.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initrd)
|
||||
* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [DMI](https://en.wikipedia.org/wiki/Desktop_Management_Interface)
|
||||
* [printk](https://en.wikipedia.org/wiki/Printk)
|
||||
* [CPU idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29)
|
||||
* [power management](https://en.wikipedia.org/wiki/Power_management)
|
||||
* [NO_HZ documentation](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt)
|
||||
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt)
|
||||
* [CPUID](https://en.wikipedia.org/wiki/CPUID)
|
||||
* [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [context switches](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html)
|
||||
427
Timers/timers-4.md
Normal file
427
Timers/timers-4.md
Normal file
@@ -0,0 +1,427 @@
|
||||
Timers and time management in the Linux kernel. Part 4.
|
||||
================================================================================
|
||||
|
||||
Timers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html) we knew about the `tick broadcast` framework and `NO_HZ` mode in the Linux kernel. We will continue to dive into the time management related stuff in the Linux kernel in this part and will be acquainted with yet another concept in the Linux kernel - `timers`. Before we will look at timers in the Linux kernel, we have to learn some theory about this concept. Note that we will consider software timers in this part.
|
||||
|
||||
The Linux kernel provides a `software timer` concept to allow to kernel functions could be invoked at future moment. Timers are widely used in the Linux kernel. For example, look in the [net/netfilter/ipset/ip_set_list_set.c](https://github.com/torvalds/linux/blob/master/net/netfilter/ipset/ip_set_list_set.c) source code file. This source code file provides implementation of the framework for the managing of groups of [IP](https://en.wikipedia.org/wiki/Internet_Protocol) addresses.
|
||||
|
||||
We can find the `list_set` structure that contains `gc` filed in this source code file:
|
||||
|
||||
```C
|
||||
struct list_set {
|
||||
...
|
||||
struct timer_list gc;
|
||||
...
|
||||
};
|
||||
```
|
||||
|
||||
Not that the `gc` filed has `timer_list` type. This structure defined in the [include/linux/timer.h](https://github.com/torvalds/linux/blob/master/include/linux/timer.h) header file and main point of this structure is to store `dynamic` timers in the Linux kernel. Actually, the Linux kernel provides two types of timers called dynamic timers and interval timers. First type of timers is used by the kernel, and the second can be used by user mode. The `timer_list` structure contains actual `dynamic` timers. The `list_set` contains `gc` timer in our example represents timer for garbage collection. This timer will be initialized in the `list_set_gc_init` function:
|
||||
|
||||
```C
|
||||
static void
|
||||
list_set_gc_init(struct ip_set *set, void (*gc)(unsigned long ul_set))
|
||||
{
|
||||
struct list_set *map = set->data;
|
||||
...
|
||||
...
|
||||
...
|
||||
map->gc.function = gc;
|
||||
map->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
A function that is pointed by the `gc` pointer, will be called after timeout which is equal to the `map->gc.expires`.
|
||||
|
||||
Ok, we will not dive into this example with the [netfilter](https://en.wikipedia.org/wiki/Netfilter), because this chapter is not about [network](https://en.wikipedia.org/wiki/Computer_network) related stuff. But we saw that timers are widely used in the Linux kernel and learned that they represent concept which allows to functions to be called in future.
|
||||
|
||||
Now let's continue to research source code of Linux kernel which is related to the timers and time management stuff as we did it in all previous chapters.
|
||||
|
||||
Introduction to dynamic timers in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As I already wrote, we knew about the `tick broadcast` framework and `NO_HZ` mode in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html). They will be initialized in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file by the call of the `tick_init` function. If we will look at this source code file, we will see that the next time management related function is:
|
||||
|
||||
```C
|
||||
init_timers();
|
||||
```
|
||||
|
||||
This function defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and contains calls of four functions:
|
||||
|
||||
```C
|
||||
void __init init_timers(void)
|
||||
{
|
||||
init_timer_cpus();
|
||||
init_timer_stats();
|
||||
timer_register_cpu_notifier();
|
||||
open_softirq(TIMER_SOFTIRQ, run_timer_softirq);
|
||||
}
|
||||
```
|
||||
|
||||
Let's look on implementation of each function. The first function is `init_timer_cpus` defined in the [same](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and just calls the `init_timer_cpu` function for each possible processor in the system:
|
||||
|
||||
```C
|
||||
static void __init init_timer_cpus(void)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
for_each_possible_cpu(cpu)
|
||||
init_timer_cpu(cpu);
|
||||
}
|
||||
```
|
||||
|
||||
If you do not know or do not remember what is it a `possible` cpu, you can read the special [part](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) of this book which describes `cpumask` concept in the Linux kernel. In short words, a `possible` processor is a processor which can be plugged in anytime during the life of the system.
|
||||
|
||||
The `init_timer_cpu` function does main work for us, namely it executes initialization of the `tvec_base` structure for each processor. This structure defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and stores data related to a `dynamic` timer for a certain processor. Let's look on the definition of this structure:
|
||||
|
||||
```C
|
||||
struct tvec_base {
|
||||
spinlock_t lock;
|
||||
struct timer_list *running_timer;
|
||||
unsigned long timer_jiffies;
|
||||
unsigned long next_timer;
|
||||
unsigned long active_timers;
|
||||
unsigned long all_timers;
|
||||
int cpu;
|
||||
bool migration_enabled;
|
||||
bool nohz_active;
|
||||
struct tvec_root tv1;
|
||||
struct tvec tv2;
|
||||
struct tvec tv3;
|
||||
struct tvec tv4;
|
||||
struct tvec tv5;
|
||||
} ____cacheline_aligned;
|
||||
```
|
||||
|
||||
The `thec_base` structure contains following fields: The `lock` for `tvec_base` protection, the next `running_timer` field points to the currently running timer for the certain processor, the `timer_jiffies` fields represents the earliest expiration time (it will be used by the Linux kernel to find already expired timers). The next field - `next_timer` contains the next pending timer for a next timer [interrupt](https://en.wikipedia.org/wiki/Interrupt) in a case when a processor goes to sleep and the `NO_HZ` mode is enabled in the Linux kernel. The `active_timers` field provides accounting of non-deferrable timers or in other words all timers that will not be stopped during a processor will go to sleep. The `all_timers` field tracks total number of timers or `active_timers` + deferrable timers. The `cpu` field represents number of a processor which owns timers. The `migration_enabled` and `nohz_active` fields are represent opportunity of timers migration to another processor and status of the `NO_HZ` mode respectively.
|
||||
|
||||
The last five fields of the `tvec_base` structure represent lists of dynamic timers. The first `tv1` field has:
|
||||
|
||||
```C
|
||||
#define TVR_SIZE (1 << TVR_BITS)
|
||||
#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
|
||||
|
||||
...
|
||||
...
|
||||
...
|
||||
|
||||
struct tvec_root {
|
||||
struct hlist_head vec[TVR_SIZE];
|
||||
};
|
||||
```
|
||||
|
||||
type. Note that the value of the `TVR_SIZE` depends on the `CONFIG_BASE_SMALL` kernel configuration option:
|
||||
|
||||

|
||||
|
||||
that reduces size of the kernel data structures if disabled. The `v1` is array that may contain `64` or `256` elements where an each element represents a dynamic timer that will decay within the next `255` system timer interrupts. Next three fields: `tv2`, `tv3` and `tv4` are lists with dynamic timers too, but they store dynamic timers which will decay the next `2^14 - 1`, `2^20 - 1` and `2^26` respectively. The last `tv5` field represents list which stores dynamic timers with a large expiring period.
|
||||
|
||||
So, now we saw the `tvec_base` structure and description of its fields and we can look on the implementation of the `init_timer_cpu` function. As I already wrote, this function defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and executes initialization of the `tvec_bases`:
|
||||
|
||||
```C
|
||||
static void __init init_timer_cpu(int cpu)
|
||||
{
|
||||
struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
|
||||
|
||||
base->cpu = cpu;
|
||||
spin_lock_init(&base->lock);
|
||||
|
||||
base->timer_jiffies = jiffies;
|
||||
base->next_timer = base->timer_jiffies;
|
||||
}
|
||||
```
|
||||
|
||||
The `tvec_bases` represents [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable which represents main data structure for a dynamic timer for a given processor. This `per-cpu` variable defined in the same source code file:
|
||||
|
||||
```C
|
||||
static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
|
||||
```
|
||||
|
||||
First of all we're getting the address of the `tvec_bases` for the given processor to `base` variable and as we got it, we are starting to initialize some of the `tvec_base` fields in the `init_timer_cpu` function. After initialization of the `per-cpu` dynamic timers with the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) and the number of a possible processor, we need to initialize a `tstats_lookup_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) in the `init_timer_stats` function:
|
||||
|
||||
```C
|
||||
void __init init_timer_stats(void)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
for_each_possible_cpu(cpu)
|
||||
raw_spin_lock_init(&per_cpu(tstats_lookup_lock, cpu));
|
||||
}
|
||||
```
|
||||
|
||||
The `tstats_lookcup_lock` variable represents `per-cpu` raw spinlock:
|
||||
|
||||
```C
|
||||
static DEFINE_PER_CPU(raw_spinlock_t, tstats_lookup_lock);
|
||||
```
|
||||
|
||||
which will be used for protection of operation with statistics of timers that can be accessed through the [procfs](https://en.wikipedia.org/wiki/Procfs):
|
||||
|
||||
```C
|
||||
static int __init init_tstats_procfs(void)
|
||||
{
|
||||
struct proc_dir_entry *pe;
|
||||
|
||||
pe = proc_create("timer_stats", 0644, NULL, &tstats_fops);
|
||||
if (!pe)
|
||||
return -ENOMEM;
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
$ cat /proc/timer_stats
|
||||
Timerstats sample period: 3.888770 s
|
||||
12, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick)
|
||||
15, 1 swapper hcd_submit_urb (rh_timer_func)
|
||||
4, 959 kedac schedule_timeout (process_timeout)
|
||||
1, 0 swapper page_writeback_init (wb_timer_fn)
|
||||
28, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick)
|
||||
22, 2948 IRQ 4 tty_flip_buffer_push (delayed_work_timer_fn)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The next step after initialization of the `tstats_lookup_lock` spinlock is the call of the `timer_register_cpu_notifier` function. This function depends on the `CONFIG_HOTPLUG_CPU` kernel configuration option which enables support for [hotplug](https://en.wikipedia.org/wiki/Hot_swapping) processors in the Linux kernel.
|
||||
|
||||
When a processor will be logically offlined, a notification will be sent to the Linux kernel with the `CPU_DEAD` or the `CPU_DEAD_FROZEN` event by the call of the `cpu_notifier` macro:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_HOTPLUG_CPU
|
||||
...
|
||||
...
|
||||
static inline void timer_register_cpu_notifier(void)
|
||||
{
|
||||
cpu_notifier(timer_cpu_notify, 0);
|
||||
}
|
||||
...
|
||||
...
|
||||
#else
|
||||
...
|
||||
...
|
||||
static inline void timer_register_cpu_notifier(void) { }
|
||||
...
|
||||
...
|
||||
#endif /* CONFIG_HOTPLUG_CPU */
|
||||
```
|
||||
|
||||
In this case the `timer_cpu_notify` will be called which checks an event type and will call the `migrate_timers` function:
|
||||
|
||||
```C
|
||||
static int timer_cpu_notify(struct notifier_block *self,
|
||||
unsigned long action, void *hcpu)
|
||||
{
|
||||
switch (action) {
|
||||
case CPU_DEAD:
|
||||
case CPU_DEAD_FROZEN:
|
||||
migrate_timers((long)hcpu);
|
||||
break;
|
||||
default:
|
||||
break;
|
||||
}
|
||||
|
||||
return NOTIFY_OK;
|
||||
}
|
||||
```
|
||||
|
||||
This chapter will not describe `hotplug` related events in the Linux kernel source code, but if you are interesting in such things, you can find implementation of the `migrate_timers` function in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file.
|
||||
|
||||
The last step in the `init_timers` function is the call of the:
|
||||
|
||||
```C
|
||||
open_softirq(TIMER_SOFTIRQ, run_timer_softirq);
|
||||
```
|
||||
|
||||
function. The `open_softirq` function may be already familar to you if you have read the ninth [part](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html) about the interrupts and interrupt handling in the Linux kernel. In short words, the `open_softirq` function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) source code file and executes initialization of the deferred interrupt handler.
|
||||
|
||||
In our case the deferred function is the `run_timer_softirq` function that is will be called after a hardware interrupt in the `do_IRQ` function which defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c) source code file. The main point of this function is to handle a software dynamic timer. The Linux kernel does not do this thing during the hardware timer interrupt handling because this is time consuming operation.
|
||||
|
||||
Let's look on the implementation of the `run_timer_softirq` function:
|
||||
|
||||
```C
|
||||
static void run_timer_softirq(struct softirq_action *h)
|
||||
{
|
||||
struct tvec_base *base = this_cpu_ptr(&tvec_bases);
|
||||
|
||||
if (time_after_eq(jiffies, base->timer_jiffies))
|
||||
__run_timers(base);
|
||||
}
|
||||
```
|
||||
|
||||
At the beginning of the `run_timer_softirq` function we get a `dynamic` timer for a current processor and compares the current value of the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) with the value of the `timer_jiffies` for the current structure by the call of the `time_after_eq` macro which is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/master/include/linux/jiffies.h) header file:
|
||||
|
||||
```C
|
||||
#define time_after_eq(a,b) \
|
||||
(typecheck(unsigned long, a) && \
|
||||
typecheck(unsigned long, b) && \
|
||||
((long)((a) - (b)) >= 0))
|
||||
```
|
||||
|
||||
Reclaim that the `timer_jiffies` field of the `tvec_base` structure represents the relative time when functions delayed by the given timer will be executed. So we compare these two values and if the current time represented by the `jiffies` is greater than `base->timer_jiffies`, we call the `__run_timers` function that defined in the same source code file. Let's look on the implementation of this function.
|
||||
|
||||
As I just wrote, the `__run_timers` function runs all expired timers for a given processor. This function starts from the acquiring of the `tvec_base's` lock to protect the `tvec_base` structure
|
||||
|
||||
```C
|
||||
static inline void __run_timers(struct tvec_base *base)
|
||||
{
|
||||
struct timer_list *timer;
|
||||
|
||||
spin_lock_irq(&base->lock);
|
||||
...
|
||||
...
|
||||
...
|
||||
spin_unlock_irq(&base->lock);
|
||||
}
|
||||
```
|
||||
|
||||
After this it starts the loop while the `timer_jiffies` will not be greater than the `jiffies`:
|
||||
|
||||
```C
|
||||
while (time_after_eq(jiffies, base->timer_jiffies)) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
We can find many different manipulations in the our loop, but the main point is to find expired timers and call delayed functions. First of all we need to calculate the `index` of the `base->tv1` list that stores the next timer to be handled with the following expression:
|
||||
|
||||
```C
|
||||
index = base->timer_jiffies & TVR_MASK;
|
||||
```
|
||||
|
||||
where the `TVR_MASK` is a mask for the getting of the `tvec_root->vec` elements. As we got the index with the next timer which must be handled we check its value. If the index is zero, we go through all lists in our cascade table `tv2`, `tv3` and etc., and rehashing it with the call of the `cascade` function:
|
||||
|
||||
```C
|
||||
if (!index &&
|
||||
(!cascade(base, &base->tv2, INDEX(0))) &&
|
||||
(!cascade(base, &base->tv3, INDEX(1))) &&
|
||||
!cascade(base, &base->tv4, INDEX(2)))
|
||||
cascade(base, &base->tv5, INDEX(3));
|
||||
```
|
||||
|
||||
After this we increase the value of the `base->timer_jiffies`:
|
||||
|
||||
```C
|
||||
++base->timer_jiffies;
|
||||
```
|
||||
|
||||
In the last step we are executing a corresponding function for each timer from the list in a following loop:
|
||||
|
||||
```C
|
||||
hlist_move_list(base->tv1.vec + index, head);
|
||||
|
||||
while (!hlist_empty(head)) {
|
||||
...
|
||||
...
|
||||
...
|
||||
timer = hlist_entry(head->first, struct timer_list, entry);
|
||||
fn = timer->function;
|
||||
data = timer->data;
|
||||
|
||||
spin_unlock(&base->lock);
|
||||
call_timer_fn(timer, fn, data);
|
||||
spin_lock(&base->lock);
|
||||
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
where the `call_timer_fn` just call the given function:
|
||||
|
||||
```C
|
||||
static void call_timer_fn(struct timer_list *timer, void (*fn)(unsigned long),
|
||||
unsigned long data)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
fn(data);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
That's all. The Linux kernel has infrastructure for `dynamic timers` from this moment. We will not dive into this interesting theme. As I already wrote the `timers` is a [widely](http://lxr.free-electrons.com/ident?i=timer_list) used concept in the Linux kernel and nor one part, nor two parts will not cover understanding of such things how it implemented and how it works. But now we know about this concept, why does the Linux kernel needs in it and some data structures around it.
|
||||
|
||||
Now let's look usage of `dynamic timers` in the Linux kernel.
|
||||
|
||||
Usage of dynamic timers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you already can noted, if the Linux kernel provides a concept, it also provides API for managing of this concept and the `dynamic timers` concept is not exception here. To use a timer in the Linux kernel code, we must define a variable with a `timer_list` type. We can initialize our `timer_list` structure in two ways. The first is to use the `init_timer` macro that defined in the [include/linux/timer.h](https://github.com/torvalds/linux/blob/master/include/linux/timer.h) header file:
|
||||
|
||||
```C
|
||||
#define init_timer(timer) \
|
||||
__init_timer((timer), 0)
|
||||
|
||||
#define __init_timer(_timer, _flags) \
|
||||
init_timer_key((_timer), (_flags), NULL, NULL)
|
||||
```
|
||||
|
||||
where the `init_timer_key` function just calls the:
|
||||
|
||||
```C
|
||||
do_init_timer(timer, flags, name, key);
|
||||
```
|
||||
|
||||
function which fields the given `timer` with default values. The second way is to use the:
|
||||
|
||||
```C
|
||||
#define TIMER_INITIALIZER(_function, _expires, _data) \
|
||||
__TIMER_INITIALIZER((_function), (_expires), (_data), 0)
|
||||
```
|
||||
|
||||
macro which will initilize the given `timer_list` structure too.
|
||||
|
||||
After a `dynamic timer` is initialzed we can start this `timer` with the call of the:
|
||||
|
||||
```C
|
||||
void add_timer(struct timer_list * timer);
|
||||
```
|
||||
|
||||
function and stop it with the:
|
||||
|
||||
```C
|
||||
int del_timer(struct timer_list * timer);
|
||||
```
|
||||
|
||||
function.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fourth part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part we got acquainted with the two new concepts: the `tick broadcast` framework and the `NO_HZ` mode. In this part we continued to dive into time managemented related stuff and got acquainted with the new concept - `dynamic timer` or software timer. We didn't saw implementation of a `dynamic timers` management code in details in this part but saw data structures and API around this concept.
|
||||
|
||||
In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - `timers`.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
* [IP](https://en.wikipedia.org/wiki/Internet_Protocol)
|
||||
* [netfilter](https://en.wikipedia.org/wiki/Netfilter)
|
||||
* [network](https://en.wikipedia.org/wiki/Computer_network)
|
||||
* [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [procfs](https://en.wikipedia.org/wiki/Procfs)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html)
|
||||
415
Timers/timers-5.md
Normal file
415
Timers/timers-5.md
Normal file
@@ -0,0 +1,415 @@
|
||||
Timers and time management in the Linux kernel. Part 5.
|
||||
================================================================================
|
||||
|
||||
Introduction to the `clockevents` framework
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. As you might noted from the title of this part, the `clockevents` framework will be discussed. We already saw one framework in the [second](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) part of this chapter. It was `clocksource` framework. Both of these frameworks represent timekeeping abstractions in the Linux kernel.
|
||||
|
||||
At first let's refresh your memory and try to remember what is it `clocksource` framework and and what its purpose. The main goal of the `clocksource` framework is to provide `timeline`. As described in the [documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt):
|
||||
|
||||
> For example issuing the command 'date' on a Linux system will eventually read the clock source to determine exactly what time it is.
|
||||
|
||||
The Linux kernel supports many different clock sources. You can find some of them in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource). For example old good [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253) - [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) with `1193182` Hz frequency, yet another one - [ACPI PM](http://uefi.org/sites/default/files/resources/ACPI_5.pdf) timer with `3579545` Hz frequency. Besides the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource) directory, each architecture may provide own architecture-specific clock sources. For example [x86](https://en.wikipedia.org/wiki/X86) architecture provides [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer), or for example [powerpc](https://en.wikipedia.org/wiki/PowerPC) provides access to the processor timer through `timebase` register.
|
||||
|
||||
Each clock source provides monotonic atomic counter. As I already wrote, the Linux kernel supports a huge set of different clock source and each clock source has own parameters like [frequency](https://en.wikipedia.org/wiki/Frequency). The main goal of the `clocksource` framework is to provide [API](https://en.wikipedia.org/wiki/Application_programming_interface) to select best available clock source in the system i.e. a clock source with the highest frequency. Additional goal of the `clocksource` framework is to represent an atomic counter provided by a clock source in human units. In this time, nanoseconds are the favorite choice for the time value units of the given clock source in the Linux kernel.
|
||||
|
||||
The `clocksource` framework represented by the `clocksource` structure which is defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header code file which contains `name` of a clock source, rating of certain clock source in the system (a clock source with the higher frequency has the biggest rating in the system), `list` of all registered clock source in the system, `enable` and `disable` fields to enable and disable a clock source, pointer to the `read` function which must return an atomic counter of a clock source and etc.
|
||||
|
||||
Additionally the `clocksource` structure provides two fields: `mult` and `shift` which are needed for translation of an atomic counter which is provided by a certain clock source to the human units, i.e. [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond). Translation occurs via following formula:
|
||||
|
||||
```
|
||||
ns ~= (clocksource * mult) >> shift
|
||||
```
|
||||
|
||||
As we already know, besides the `clocksource` structure, the `clocksource` framework provides an API for registration of clock source with different frequency scale factor:
|
||||
|
||||
```C
|
||||
static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
|
||||
static inline int clocksource_register_khz(struct clocksource *cs, u32 khz)
|
||||
```
|
||||
|
||||
A clock source unregistration:
|
||||
|
||||
```C
|
||||
int clocksource_unregister(struct clocksource *cs)
|
||||
```
|
||||
|
||||
and etc.
|
||||
|
||||
Additionally to the `clocksource` framework, the Linux kernel provides `clockevents` framework. As described in the [documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt):
|
||||
|
||||
> Clock events are the conceptual reverse of clock sources
|
||||
|
||||
Main goal of the is to manage clock event devices or in other words - to manage devices that allow to register an event or in other words [interrupt](https://en.wikipedia.org/wiki/Interrupt) that is going to happen at a defined point of time in the future.
|
||||
|
||||
Now we know a little about the `clockevents` framework in the Linux kernel, and now time is to see on it [API](https://en.wikipedia.org/wiki/Application_programming_interface).
|
||||
|
||||
API of `clockevents` framework
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
The main structure which described a clock event device is `clock_event_device` structure. This structure is defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and contains a huge set of fields. as well as the `clocksource` structure it has `name` fields which contains human readable name of a clock event device, for example [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) timer:
|
||||
|
||||
```C
|
||||
static struct clock_event_device lapic_clockevent = {
|
||||
.name = "lapic",
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Addresses of the `event_handler`, `set_next_event`, `next_event` functions for a certain clock event device which are an [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler), setter of next event and local storage for next event respectively. Yet another field of the `clock_event_device` structure is - `features` field. Its value maybe on of the following generic features:
|
||||
|
||||
```C
|
||||
#define CLOCK_EVT_FEAT_PERIODIC 0x000001
|
||||
#define CLOCK_EVT_FEAT_ONESHOT 0x000002
|
||||
```
|
||||
|
||||
Where the `CLOCK_EVT_FEAT_PERIODIC` represents device which may be programmed to generate events periodically. The `CLOCK_EVT_FEAT_ONESHOT` represents device which may generate an event only once. Besides these two features, there are also architecture-specific features. For example [x86_64](https://en.wikipedia.org/wiki/X86-64) supports two additional features:
|
||||
|
||||
```C
|
||||
#define CLOCK_EVT_FEAT_C3STOP 0x000008
|
||||
```
|
||||
|
||||
The first `CLOCK_EVT_FEAT_C3STOP` means that a clock event device will be stopped in the [C3](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states) state. Additionally the `clock_event_device` structure has `mult` and `shift` fields as well as `clocksource` structure. The `clocksource` structure also contains other fields, but we will consider it later.
|
||||
|
||||
After we considered part of the `clock_event_device` structure, time is to look at the `API` of the `clockevents` framework. To work with a clock event device, first of all we need to initialize `clock_event_device` structure and register a clock events device. The `clockevents` framework provides following `API` for registration of clock event devices:
|
||||
|
||||
```C
|
||||
void clockevents_register_device(struct clock_event_device *dev)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
This function defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and as we may see, the `clockevents_register_device` function takes only one parameter:
|
||||
|
||||
* address of a `clock_event_device` structure which represents a clock event device.
|
||||
|
||||
So, to register a clock event device, at first we need to initialize `clock_event_device` structure with parameters of a certain clock event device. Let's take a look at one random clock event device in the Linux kernel source code. We can find one in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource) directory or try to take a look at an architecture-specific clock event device. Let's take for example - [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf). You can find its implementation in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource/timer-atmel-pit.c).
|
||||
|
||||
First of all let's look at initialization of the `clock_event_device` structure. This occurs in the `at91sam926x_pit_common_init` function:
|
||||
|
||||
```C
|
||||
struct pit_data {
|
||||
...
|
||||
...
|
||||
struct clock_event_device clkevt;
|
||||
...
|
||||
...
|
||||
};
|
||||
|
||||
static void __init at91sam926x_pit_common_init(struct pit_data *data)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
data->clkevt.name = "pit";
|
||||
data->clkevt.features = CLOCK_EVT_FEAT_PERIODIC;
|
||||
data->clkevt.shift = 32;
|
||||
data->clkevt.mult = div_sc(pit_rate, NSEC_PER_SEC, data->clkevt.shift);
|
||||
data->clkevt.rating = 100;
|
||||
data->clkevt.cpumask = cpumask_of(0);
|
||||
|
||||
data->clkevt.set_state_shutdown = pit_clkevt_shutdown;
|
||||
data->clkevt.set_state_periodic = pit_clkevt_set_periodic;
|
||||
data->clkevt.resume = at91sam926x_pit_resume;
|
||||
data->clkevt.suspend = at91sam926x_pit_suspend;
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see that `at91sam926x_pit_common_init` takes one parameter - pointer to the `pit_data` structure which contains `clock_event_device` structure which will contain clock event related information of the `at91sam926x` [periodic Interval Timer](https://en.wikipedia.org/wiki/Programmable_interval_timer). At the start we fill `name` of the timer device and its `features`. In our case we deal with periodic timer which as we already know may be programmed to generate events periodically.
|
||||
|
||||
The next two fields `shift` and `mult` are familiar to us. They will be used to translate counter of our timer to nanoseconds. After this we set rating of the timer to `100`. This means if there will not be timers with higher rating in the system, this timer will be used for timekeeping. The next field - `cpumask` indicates for which processors in the system the device will work. In our case, the device will work for the first processor. The `cpumask_of` macro defined in the [include/linux/cpumask.h](https://github.com/torvalds/linux/tree/master/include/linux/cpumask.h) header file and just expands to the call of the:
|
||||
|
||||
```C
|
||||
#define cpumask_of(cpu) (get_cpu_mask(cpu))
|
||||
```
|
||||
|
||||
Where the `get_cpu_mask` returns the cpumask containing just a given `cpu` number. More about `cpumasks` concept you may read in the [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part. In the last four lines of code we set callbacks for the clock event device suspend/resume, device shutdown and update of the clock event device state.
|
||||
|
||||
After we finished with the initialization of the `at91sam926x` periodic timer, we can register it by the call of the following functions:
|
||||
|
||||
```C
|
||||
clockevents_register_device(&data->clkevt);
|
||||
```
|
||||
|
||||
Now we can consider implementation of the `clockevent_register_device` function. As I already wrote above, this function is defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and starts from the initialization of the initial event device state:
|
||||
|
||||
```C
|
||||
clockevent_set_state(dev, CLOCK_EVT_STATE_DETACHED);
|
||||
```
|
||||
|
||||
Actually, an event device may be in one of this states:
|
||||
|
||||
```C
|
||||
enum clock_event_state {
|
||||
CLOCK_EVT_STATE_DETACHED,
|
||||
CLOCK_EVT_STATE_SHUTDOWN,
|
||||
CLOCK_EVT_STATE_PERIODIC,
|
||||
CLOCK_EVT_STATE_ONESHOT,
|
||||
CLOCK_EVT_STATE_ONESHOT_STOPPED,
|
||||
};
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `CLOCK_EVT_STATE_DETACHED` - a clock event device is not not used by `clockevents` framework. Actually it is initial state of all clock event devices;
|
||||
* `CLOCK_EVT_STATE_SHUTDOWN` - a clock event device is powered-off;
|
||||
* `CLOCK_EVT_STATE_PERIODIC` - a clock event device may be programmed to generate event periodically;
|
||||
* `CLOCK_EVT_STATE_ONESHOT` - a clock event device may be programmed to generate event only once;
|
||||
* `CLOCK_EVT_STATE_ONESHOT_STOPPED` - a clock event device was programmed to generate event only once and now it is temporary stopped.
|
||||
|
||||
The implementation of the `clock_event_set_state` function is pretty easy:
|
||||
|
||||
```C
|
||||
static inline void clockevent_set_state(struct clock_event_device *dev,
|
||||
enum clock_event_state state)
|
||||
{
|
||||
dev->state_use_accessors = state;
|
||||
}
|
||||
```
|
||||
|
||||
As we can see, it just fills the `state_use_accessors` field of the given `clock_event_device` structure with the given value which is in our case is `CLOCK_EVT_STATE_DETACHED`. Actually all clock event devices has this initial state during registration. The `state_use_accessors` field of the `clock_event_device` structure provides `current` state of the clock event device.
|
||||
|
||||
After we have set initial state of the given `clock_event_device` structure we check that the `cpumask` of the given clock event device is not zero:
|
||||
|
||||
```C
|
||||
if (!dev->cpumask) {
|
||||
WARN_ON(num_possible_cpus() > 1);
|
||||
dev->cpumask = cpumask_of(smp_processor_id());
|
||||
}
|
||||
```
|
||||
|
||||
Remember that we have set the `cpumask` of the `at91sam926x` periodic timer to first processor. If the `cpumask` field is zero, we check the number of possible processors in the system and print warning message if it is less than on. Additionally we set the `cpumask` of the given clock event device to the current processor. If you are interested in how the `smp_processor_id` macro is implemented, you can read more about it in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter.
|
||||
|
||||
After this check we lock the actual code of the clock event device registration by the call following macros:
|
||||
|
||||
```C
|
||||
raw_spin_lock_irqsave(&clockevents_lock, flags);
|
||||
...
|
||||
...
|
||||
...
|
||||
raw_spin_unlock_irqrestore(&clockevents_lock, flags);
|
||||
```
|
||||
|
||||
Additionally the `raw_spin_lock_irqsave` and the `raw_spin_unlock_irqrestore` macros disable local interrupts, however interrupts on other processors still may occur. We need to do it to prevent potential [deadlock](https://en.wikipedia.org/wiki/Deadlock) if we adding new clock event device to the list of clock event devices and an interrupt occurs from other clock event device.
|
||||
|
||||
We can see following code of clock event device registration between the `raw_spin_lock_irqsave` and `raw_spin_unlock_irqrestore` macros:
|
||||
|
||||
```C
|
||||
list_add(&dev->list, &clockevent_devices);
|
||||
tick_check_new_device(dev);
|
||||
clockevents_notify_released();
|
||||
```
|
||||
|
||||
First of all we add the given clock event device to the list of clock event devices which is represented by the `clockevent_devices`:
|
||||
|
||||
```C
|
||||
static LIST_HEAD(clockevent_devices);
|
||||
```
|
||||
|
||||
At the next step we call the `tick_check_new_device` function which is defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and checks do the new registered clock event device should be used or not. The `tick_check_new_device` function checks the given `clock_event_device` gets the current registered tick device which is represented by the `tick_device` structure and compares their ratings and features. Actually `CLOCK_EVT_STATE_ONESHOT` is preferred:
|
||||
|
||||
```C
|
||||
static bool tick_check_preferred(struct clock_event_device *curdev,
|
||||
struct clock_event_device *newdev)
|
||||
{
|
||||
if (!(newdev->features & CLOCK_EVT_FEAT_ONESHOT)) {
|
||||
if (curdev && (curdev->features & CLOCK_EVT_FEAT_ONESHOT))
|
||||
return false;
|
||||
if (tick_oneshot_mode_active())
|
||||
return false;
|
||||
}
|
||||
|
||||
return !curdev ||
|
||||
newdev->rating > curdev->rating ||
|
||||
!cpumask_equal(curdev->cpumask, newdev->cpumask);
|
||||
}
|
||||
```
|
||||
|
||||
If the new registered clock event device is more preferred than old tick device, we exchange old and new registered devices and install new device:
|
||||
|
||||
```C
|
||||
clockevents_exchange_device(curdev, newdev);
|
||||
tick_setup_device(td, newdev, cpu, cpumask_of(cpu));
|
||||
```
|
||||
|
||||
The `clockevents_exchange_device` function releases or in other words deleted the old clock event device from the `clockevent_devices` list. The next function - `tick_setup_device` as we may understand from its name, setups new tick device. This function check the mode of the new registered clock event device and call the `tick_setup_periodic` function or the `tick_setup_oneshot` depends on the tick device mode:
|
||||
|
||||
```C
|
||||
if (td->mode == TICKDEV_MODE_PERIODIC)
|
||||
tick_setup_periodic(newdev, 0);
|
||||
else
|
||||
tick_setup_oneshot(newdev, handler, next_event);
|
||||
```
|
||||
|
||||
Both of this functions calls the `clockevents_switch_state` to change state of the clock event device and the `clockevents_program_event` function to set next event of clock event device based on delta between the maximum and minimum difference current time and time for the next event. The `tick_setup_periodic`:
|
||||
|
||||
```C
|
||||
clockevents_switch_state(dev, CLOCK_EVT_STATE_PERIODIC);
|
||||
clockevents_program_event(dev, next, false))
|
||||
```
|
||||
|
||||
and the `tick_setup_oneshot_periodic`:
|
||||
|
||||
```C
|
||||
clockevents_switch_state(newdev, CLOCK_EVT_STATE_ONESHOT);
|
||||
clockevents_program_event(newdev, next_event, true);
|
||||
```
|
||||
|
||||
The `clockevents_switch_state` function checks that the clock event device is not in the given state and calls the `__clockevents_switch_state` function from the same source code file:
|
||||
|
||||
```C
|
||||
if (clockevent_get_state(dev) != state) {
|
||||
if (__clockevents_switch_state(dev, state))
|
||||
return;
|
||||
```
|
||||
|
||||
The `__clockevents_switch_state` function just makes a call of the certain callback depends on the given state:
|
||||
|
||||
```C
|
||||
static int __clockevents_switch_state(struct clock_event_device *dev,
|
||||
enum clock_event_state state)
|
||||
{
|
||||
if (dev->features & CLOCK_EVT_FEAT_DUMMY)
|
||||
return 0;
|
||||
|
||||
switch (state) {
|
||||
case CLOCK_EVT_STATE_DETACHED:
|
||||
case CLOCK_EVT_STATE_SHUTDOWN:
|
||||
if (dev->set_state_shutdown)
|
||||
return dev->set_state_shutdown(dev);
|
||||
return 0;
|
||||
|
||||
case CLOCK_EVT_STATE_PERIODIC:
|
||||
if (!(dev->features & CLOCK_EVT_FEAT_PERIODIC))
|
||||
return -ENOSYS;
|
||||
if (dev->set_state_periodic)
|
||||
return dev->set_state_periodic(dev);
|
||||
return 0;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
In our case for `at91sam926x` periodic timer, the state is the `CLOCK_EVT_FEAT_PERIODIC`:
|
||||
|
||||
```C
|
||||
data->clkevt.features = CLOCK_EVT_FEAT_PERIODIC;
|
||||
data->clkevt.set_state_periodic = pit_clkevt_set_periodic;
|
||||
```
|
||||
|
||||
So, for the `pit_clkevt_set_periodic` callback will be called. If we will read the documentation of the [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf), we will see that there is `Periodic Interval Timer Mode Register` which allows us to control of periodic interval timer.
|
||||
|
||||
It looks like:
|
||||
|
||||
```
|
||||
31 25 24
|
||||
+---------------------------------------------------------------+
|
||||
| | PITIEN | PITEN |
|
||||
+---------------------------------------------------------------+
|
||||
23 19 16
|
||||
+---------------------------------------------------------------+
|
||||
| | PIV |
|
||||
+---------------------------------------------------------------+
|
||||
15 8
|
||||
+---------------------------------------------------------------+
|
||||
| PIV |
|
||||
+---------------------------------------------------------------+
|
||||
7 0
|
||||
+---------------------------------------------------------------+
|
||||
| PIV |
|
||||
+---------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where `PIV` or `Periodic Interval Value` - defines the value compared with the primary `20-bit` counter of the Periodic Interval Timer. The `PITEN` or `Period Interval Timer Enabled` if the bit is `1` and the `PITIEN` or `Periodic Interval Timer Interrupt Enable` if the bit is `1`. So, to set periodic mode, we need to set `24`, `25` bits in the `Periodic Interval Timer Mode Register`. And we are doing it in the `pit_clkevt_set_periodic` function:
|
||||
|
||||
```C
|
||||
static int pit_clkevt_set_periodic(struct clock_event_device *dev)
|
||||
{
|
||||
struct pit_data *data = clkevt_to_pit_data(dev);
|
||||
...
|
||||
...
|
||||
...
|
||||
pit_write(data->base, AT91_PIT_MR,
|
||||
(data->cycle - 1) | AT91_PIT_PITEN | AT91_PIT_PITIEN);
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Where the `AT91_PT_MR`, `AT91_PT_PITEN` and the `AT91_PIT_PITIEN` are declared as:
|
||||
|
||||
```C
|
||||
#define AT91_PIT_MR 0x00
|
||||
#define AT91_PIT_PITIEN BIT(25)
|
||||
#define AT91_PIT_PITEN BIT(24)
|
||||
```
|
||||
|
||||
After the setup of the new clock event device is finished, we can return to the `clockevents_register_device` function. The last function in the `clockevents_register_device` function is:
|
||||
|
||||
```C
|
||||
clockevents_notify_released();
|
||||
```
|
||||
|
||||
This function checks the `clockevents_released` list which contains released clock event devices (remember that they may occur after the call of the ` clockevents_exchange_device` function). If this list is not empty, we go through clock event devices from the `clock_events_released` list and delete it from the `clockevent_devices`:
|
||||
|
||||
```C
|
||||
static void clockevents_notify_released(void)
|
||||
{
|
||||
struct clock_event_device *dev;
|
||||
|
||||
while (!list_empty(&clockevents_released)) {
|
||||
dev = list_entry(clockevents_released.next,
|
||||
struct clock_event_device, list);
|
||||
list_del(&dev->list);
|
||||
list_add(&dev->list, &clockevent_devices);
|
||||
tick_check_new_device(dev);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
That's all. From this moment we have registered new clock event device. So the usage of the `clockevents` framework is simple and clear. Architectures registered their clock event devices, in the clock events core. Users of the clockevents core can get clock event devices for their use. The `clockevents` framework provides notification mechanisms for various clock related management events like a clock event device registered or unregistered, a processor is offlined in system which supports [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) and etc.
|
||||
|
||||
We saw implementation only of the `clockevents_register_device` function. But generally, the clock event layer [API](https://en.wikipedia.org/wiki/Application_programming_interface) is small. Besides the `API` for clock event device registration, the `clockevents` framework provides functions to schedule the next event interrupt, clock event device notification service and support for suspend and resume for clock event devices.
|
||||
|
||||
If you want to know more about `clockevents` API you can start to research following source code and header files: [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c), [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) and [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h).
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `timers` concept. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about yet another framework - `clockevents`.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
* [timekeeping documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt)
|
||||
* [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253)
|
||||
* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer)
|
||||
* [ACPI pdf](http://uefi.org/sites/default/files/resources/ACPI_5.pdf)
|
||||
* [x86](https://en.wikipedia.org/wiki/X86)
|
||||
* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [powerpc](https://en.wikipedia.org/wiki/PowerPC)
|
||||
* [frequency](https://en.wikipedia.org/wiki/Frequency)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
|
||||
* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [C3 state](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states)
|
||||
* [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf)
|
||||
* [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [deadlock](https://en.wikipedia.org/wiki/Deadlock)
|
||||
* [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html)
|
||||
413
Timers/timers-6.md
Normal file
413
Timers/timers-6.md
Normal file
@@ -0,0 +1,413 @@
|
||||
Timers and time management in the Linux kernel. Part 6.
|
||||
================================================================================
|
||||
|
||||
x86_64 related clock sources
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-5.html) we saw `clockevents` framework and now we will continue to dive into time management related stuff in the Linux kernel. This part will describe implementation of [x86](https://en.wikipedia.org/wiki/X86) architecture related clock sources (more about `clocksource` concept you can read in the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) of this chapter).
|
||||
|
||||
First of all we must know what clock sources may be used at `x86` architecture. It is easy to know from the [sysfs](https://en.wikipedia.org/wiki/Sysfs) or from content of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. The `/sys/devices/system/clocksource/clocksourceN` provides two special files to achieve this:
|
||||
|
||||
* `available_clocksource` - provides information about available clock sources in the system;
|
||||
* `current_clocksource` - provides information about currently used clock source in the system.
|
||||
|
||||
So, let's look:
|
||||
|
||||
```
|
||||
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
|
||||
tsc hpet acpi_pm
|
||||
```
|
||||
|
||||
We can see that there are three registered clock sources in my system:
|
||||
|
||||
* `tsc` - [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter);
|
||||
* `hpet` - [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer);
|
||||
* `acpi_pm` - [ACPI Power Management Timer](http://uefi.org/sites/default/files/resources/ACPI_5.pdf).
|
||||
|
||||
Now let's look at the second file which provides best clock source (a clock source which has the best rating in the system):
|
||||
|
||||
```
|
||||
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
|
||||
tsc
|
||||
```
|
||||
|
||||
For me it is [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). As we may know from the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) of this chapter, which describes internals of the `clocksource` framework in the Linux kernel, the best clock source in a system is a clock source with the best (highest) rating or in other words with the highest [frequency](https://en.wikipedia.org/wiki/Frequency).
|
||||
|
||||
Frequency of the [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) power management timer is `3.579545 MHz`. Frequency of the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is at least `10 MHz`. And the frequency of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) depends on processor. For example On older processors, the `Time Stamp Counter` was counting internal processor clock cycles. This means its frequency changed when the processor's frequency scaling changed. The situation has changed for newer processors. Newer processors have an `invariant Time Stamp counter` that increments at a constant rate in all operational states of processor. Actually we can get its frequency in the output of the `/proc/cpuinfo`. For example for the first processor in the system:
|
||||
|
||||
```
|
||||
$ cat /proc/cpuinfo
|
||||
...
|
||||
model name : Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
|
||||
...
|
||||
```
|
||||
|
||||
And although Intel manual says that the frequency of the `Time Stamp Counter`, while constant, is not necessarily the maximum qualified frequency of the processor, or the frequency given in the brand string, anyway we may see that it will be much more than frequency of the `ACPI PM` timer or `High Precision Event Timer`. And we can see that the clock source with the best rating or highest frequency is current in the system.
|
||||
|
||||
You can note that besides these three clock source, we don't see yet another two familiar us clock sources in the output of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. These clock sources are `jiffy` and `refined_jiffies`. We don't see them because this filed maps only high resolution clock sources or in other words clock sources with the [CLOCK_SOURCE_VALID_FOR_HRES](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h#L113) flag.
|
||||
|
||||
As I already wrote above, we will consider all of these three clock sources in this part. We will consider it in order of their initialization or:
|
||||
|
||||
* `hpet`;
|
||||
* `acpi_pm`;
|
||||
* `tsc`.
|
||||
|
||||
We can make sure that the order is exactly like this in the output of the [dmesg](https://en.wikipedia.org/wiki/Dmesg) util:
|
||||
|
||||
```
|
||||
$ dmesg | grep clocksource
|
||||
[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
|
||||
[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
|
||||
[ 0.094369] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
|
||||
[ 0.186498] clocksource: Switched to clocksource hpet
|
||||
[ 0.196827] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
|
||||
[ 1.413685] tsc: Refined TSC clocksource calibration: 3999.981 MHz
|
||||
[ 1.413688] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x73509721780, max_idle_ns: 881591102108 ns
|
||||
[ 2.413748] clocksource: Switched to clocksource tsc
|
||||
```
|
||||
|
||||
The first clock source is the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer), so let's start from it.
|
||||
|
||||
High Precision Event Timer
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The implementation of the `High Precision Event Timer` for the [x86](https://en.wikipedia.org/wiki/X86) architecture is located in the [arch/x86/kernel/hpet.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/hpet.c) source code file. Its initialization starts from the call of the `hpet_enable` function. This function is called during Linux kernel initialization. If we will look into `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file, we will see that after the all architecture-specific stuff initialized, early console is disabled and time management subsystem already ready, call of the following function:
|
||||
|
||||
```C
|
||||
if (late_time_init)
|
||||
late_time_init();
|
||||
```
|
||||
|
||||
which does initialization of the late architecture specific timers after early jiffy counter already initialized. The definition of the `late_time_init` function for the `x86` architecture is located in the [arch/x86/kernel/time.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/time.c) source code file. It looks pretty easy:
|
||||
|
||||
```C
|
||||
static __init void x86_late_time_init(void)
|
||||
{
|
||||
x86_init.timers.timer_init();
|
||||
tsc_init();
|
||||
}
|
||||
```
|
||||
|
||||
As we may see, it does initialization of the `x86` related timer and initialization of the `Time Stamp Counter`. The seconds we will see in the next paragraph, but now let's consider the call of the `x86_init.timers.timer_init` function. The `timer_init` points to the `hpet_time_init` function from the same source code file. We can verify this by looking on the definition of the `x86_init` structure from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c):
|
||||
|
||||
```C
|
||||
struct x86_init_ops x86_init __initdata = {
|
||||
...
|
||||
...
|
||||
...
|
||||
.timers = {
|
||||
.setup_percpu_clockev = setup_boot_APIC_clock,
|
||||
.timer_init = hpet_time_init,
|
||||
.wallclock_init = x86_init_noop,
|
||||
},
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The `hpet_time_init` function does setup of the [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) if we can not enable `High Precision Event Timer` and setups default timer [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) for the enabled timer:
|
||||
|
||||
```C
|
||||
void __init hpet_time_init(void)
|
||||
{
|
||||
if (!hpet_enable())
|
||||
setup_pit_timer();
|
||||
setup_default_timer_irq();
|
||||
}
|
||||
```
|
||||
|
||||
First of all the `hpet_enable` function check we can enable `High Precision Event Timer` in the system by the call of the `is_hpet_capable` function and if we can, we map a virtual address space for it:
|
||||
|
||||
```C
|
||||
int __init hpet_enable(void)
|
||||
{
|
||||
if (!is_hpet_capable())
|
||||
return 0;
|
||||
|
||||
hpet_set_mapping();
|
||||
}
|
||||
```
|
||||
|
||||
The `is_hpet_capable` function checks that we didn't pass `hpet=disable` to the kernel command line and the `hpet_address` is received from the [ACPI HPET](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table. The `hpet_set_mapping` function just maps the virtual address spaces for the timer registers:
|
||||
|
||||
```C
|
||||
hpet_virt_address = ioremap_nocache(hpet_address, HPET_MMAP_SIZE);
|
||||
```
|
||||
|
||||
As we can read in the [IA-PC HPET (High Precision Event Timers) Specification](http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf):
|
||||
|
||||
> The timer register space is 1024 bytes
|
||||
|
||||
So, the `HPET_MMAP_SIZE` is `1024` bytes too:
|
||||
|
||||
```C
|
||||
#define HPET_MMAP_SIZE 1024
|
||||
```
|
||||
|
||||
After we mapped virtual space for the `High Precision Event Timer`, we read `HPET_ID` register to get number of the timers:
|
||||
|
||||
```C
|
||||
id = hpet_readl(HPET_ID);
|
||||
|
||||
last = (id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT;
|
||||
```
|
||||
|
||||
We need to get this number to allocate correct amount of space for the `General Configuration Register` of the `High Precision Event Timer`:
|
||||
|
||||
```C
|
||||
cfg = hpet_readl(HPET_CFG);
|
||||
|
||||
hpet_boot_cfg = kmalloc((last + 2) * sizeof(*hpet_boot_cfg), GFP_KERNEL);
|
||||
```
|
||||
|
||||
After the space is allocated for the configuration register of the `High Precision Event Timer`, we allow to main counter to run, and allow timer interrupts if they are enabled by the setting of `HPET_CFG_ENABLE` bit in the configuration register for all timers. In the end we just register new clock source by the call of the `hpet_clocksource_register` function:
|
||||
|
||||
```C
|
||||
if (hpet_clocksource_register())
|
||||
goto out_nohpet;
|
||||
```
|
||||
|
||||
which just calls already familiar
|
||||
|
||||
```C
|
||||
clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);
|
||||
```
|
||||
|
||||
function. Where the `clocksource_hpet` is the `clocksource` structure with the rating `250` (remember rating of the previous `refined_jiffies` clock source was `2`), name - `hpet` and `read_hpet` callback for the reading of atomic counter provided by the `High Precision Event Timer`:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_hpet = {
|
||||
.name = "hpet",
|
||||
.rating = 250,
|
||||
.read = read_hpet,
|
||||
.mask = HPET_MASK,
|
||||
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
|
||||
.resume = hpet_resume_counter,
|
||||
.archdata = { .vclock_mode = VCLOCK_HPET },
|
||||
};
|
||||
```
|
||||
|
||||
After the `clocksource_hpet` is registered, we can return to the `hpet_time_init()` function from the [arch/x86/kernel/time.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/time.c) source code file. We can remember that the last step is the call of the:
|
||||
|
||||
```C
|
||||
setup_default_timer_irq();
|
||||
```
|
||||
|
||||
function in the `hpet_time_init()`. The `setup_default_timer_irq` function checks existence of `legacy` IRQs or in other words support for the [i8259](https://en.wikipedia.org/wiki/Intel_8259) and setups [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC) depends on this.
|
||||
|
||||
That's all. From this moment the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) clock source registered in the Linux kernel `clock source` framework and may be used from generic kernel code via the `read_hpet`:
|
||||
```C
|
||||
static cycle_t read_hpet(struct clocksource *cs)
|
||||
{
|
||||
return (cycle_t)hpet_readl(HPET_COUNTER);
|
||||
}
|
||||
```
|
||||
|
||||
function which just reads and returns atomic counter from the `Main Counter Register`.
|
||||
|
||||
ACPI PM timer
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The seconds clock source is [ACPI Power Management Timer](http://uefi.org/sites/default/files/resources/ACPI_5.pdf). Implementation of this clock source is located in the [drivers/clocksource/acpi_pm.c](https://github.com/torvalds/linux/blob/master/drivers/clocksource_acpi_pm.c) source code file and starts from the call of the `init_acpi_pm_clocksource` function during `fs` [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html).
|
||||
|
||||
If we will look at implementation of the `init_acpi_pm_clocksource` function, we will see that it starts from the check of the value of `pmtmr_ioport` variable:
|
||||
|
||||
```C
|
||||
static int __init init_acpi_pm_clocksource(void)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
if (!pmtmr_ioport)
|
||||
return -ENODEV;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
This `pmtmr_ioport` variable contains extended address of the `Power Management Timer Control Register Block`. It gets its value in the `acpi_parse_fadt` function which is defined in the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) source code file. This function parses `FADT` or `Fixed ACPI Description Table` [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table and tries to get the values of the `X_PM_TMR_BLK` field which contains extended address of the `Power Management Timer Control Register Block`, represented in `Generic Address Structure` format:
|
||||
|
||||
```C
|
||||
static int __init acpi_parse_fadt(struct acpi_table_header *table)
|
||||
{
|
||||
#ifdef CONFIG_X86_PM_TIMER
|
||||
...
|
||||
...
|
||||
...
|
||||
pmtmr_ioport = acpi_gbl_FADT.xpm_timer_block.address;
|
||||
...
|
||||
...
|
||||
...
|
||||
#endif
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
So, if the `CONFIG_X86_PM_TIMER` Linux kernel configuration option is disabled or something going wrong in the `acpi_parse_fadt` function, we can't access the `Power Management Timer` register and return from the `init_acpi_pm_clocksource`. In other way, if the value of the `pmtmr_ioport` variable is not zero, we check rate of this timer and register this clock source by the call of the:
|
||||
|
||||
```C
|
||||
clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
|
||||
```
|
||||
|
||||
function. After the call of the `clocksource_register_hs`, the `acpi_pm` clock source will be registered in the `clocksource` framework of the Linux kernel:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_acpi_pm = {
|
||||
.name = "acpi_pm",
|
||||
.rating = 200,
|
||||
.read = acpi_pm_read,
|
||||
.mask = (cycle_t)ACPI_PM_MASK,
|
||||
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
|
||||
};
|
||||
```
|
||||
|
||||
with the rating - `200` and the `acpi_pm_read` callback to read atomic counter provided by the `acpi_pm` clock source. The `acpi_pm_read` function just executes `read_pmtmr` function:
|
||||
|
||||
```C
|
||||
static cycle_t acpi_pm_read(struct clocksource *cs)
|
||||
{
|
||||
return (cycle_t)read_pmtmr();
|
||||
}
|
||||
```
|
||||
|
||||
which reads value of the `Power Management Timer` register. This register has following structure:
|
||||
|
||||
```
|
||||
+-------------------------------+----------------------------------+
|
||||
| | |
|
||||
| upper eight bits of a | running count of the |
|
||||
| 32-bit power management timer | power management timer |
|
||||
| | |
|
||||
+-------------------------------+----------------------------------+
|
||||
31 E_TMR_VAL 24 TMR_VAL 0
|
||||
```
|
||||
|
||||
Address of this register is stored in the `Fixed ACPI Description Table` [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table and we already have it in the `pmtmr_ioport`. So, the implementation of the `read_pmtmr` function is pretty easy:
|
||||
|
||||
```C
|
||||
static inline u32 read_pmtmr(void)
|
||||
{
|
||||
return inl(pmtmr_ioport) & ACPI_PM_MASK;
|
||||
}
|
||||
```
|
||||
|
||||
We just read the value of the `Power Management Timer` register and mask its `24` bits.
|
||||
|
||||
That's all. Now we move to the last clock source in this part - `Time Stamp Counter`.
|
||||
|
||||
Time Stamp Counter
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The third and last clock source in this part is - [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) clock source and its implementation is located in the [arch/x86/kernel/tsc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c) source code file. We already saw the `x86_late_time_init` function in this part and initialization of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) starts from this place. This function calls the `tsc_init()` function from the [arch/x86/kernel/tsc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c) source code file.
|
||||
|
||||
At the beginning of the `tsc_init` function we can see check, which checks that a processor has support of the `Time Stamp Counter`:
|
||||
|
||||
```C
|
||||
void __init tsc_init(void)
|
||||
{
|
||||
u64 lpj;
|
||||
int cpu;
|
||||
|
||||
if (!cpu_has_tsc) {
|
||||
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
|
||||
return;
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The `cpu_has_tsc` macro expands to the call of the `cpu_has` macro:
|
||||
|
||||
```C
|
||||
#define cpu_has_tsc boot_cpu_has(X86_FEATURE_TSC)
|
||||
|
||||
#define boot_cpu_has(bit) cpu_has(&boot_cpu_data, bit)
|
||||
|
||||
#define cpu_has(c, bit) \
|
||||
(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 : \
|
||||
test_cpu_cap(c, bit))
|
||||
```
|
||||
|
||||
which check the given bit (the `X86_FEATURE_TSC_DEADLINE_TIMER` in our case) in the `boot_cpu_data` array which is filled during early Linux kernel initialization. If the processor has support of the `Time Stamp Counter`, we get the frequency of the `Time Stamp Counter` by the call of the `calibrate_tsc` function from the same source code file which tries to get frequency from the different source like [Model Specific Register](https://en.wikipedia.org/wiki/Model-specific_register), calibrate over [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) and etc, after this we initialize frequency and scale factor for the all processors in the system:
|
||||
|
||||
```C
|
||||
tsc_khz = x86_platform.calibrate_tsc();
|
||||
cpu_khz = tsc_khz;
|
||||
|
||||
for_each_possible_cpu(cpu) {
|
||||
cyc2ns_init(cpu);
|
||||
set_cyc2ns_scale(cpu_khz, cpu);
|
||||
}
|
||||
```
|
||||
|
||||
because only first bootstrap processor will call the `tsc_init`. After this we check hat `Time Stamp Counter` is not disabled:
|
||||
|
||||
```
|
||||
if (tsc_disabled > 0)
|
||||
return;
|
||||
...
|
||||
...
|
||||
...
|
||||
check_system_tsc_reliable();
|
||||
```
|
||||
|
||||
and call the `check_system_tsc_reliable` function which sets the `tsc_clocksource_reliable` if bootstrap processor has the `X86_FEATURE_TSC_RELIABLE` feature. Note that we went through the `tsc_init` function, but did not register our clock source. Actual registration of the `Time Stamp Counter` clock source occurs in the:
|
||||
|
||||
```C
|
||||
static int __init init_tsc_clocksource(void)
|
||||
{
|
||||
if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz)
|
||||
return 0;
|
||||
...
|
||||
...
|
||||
...
|
||||
if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
|
||||
clocksource_register_khz(&clocksource_tsc, tsc_khz);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
function. This function called during the `device` [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html). We do it to be sure that the `Time Stamp Counter` clock source will be registered after the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) clock source.
|
||||
|
||||
After these all three clock sources will be registered in the `clocksource` framework and the `Time Stamp Counter` clock source will be selected as active, because it has the highest rating among other clock sources:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_tsc = {
|
||||
.name = "tsc",
|
||||
.rating = 300,
|
||||
.read = read_tsc,
|
||||
.mask = CLOCKSOURCE_MASK(64),
|
||||
.flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_MUST_VERIFY,
|
||||
.archdata = { .vclock_mode = VCLOCK_TSC },
|
||||
};
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `clockevents` framework. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about three different clock sources which are used in the [x86](https://en.wikipedia.org/wiki/X86) architecture. The next part will be last part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) and we will see some user space related stuff, i.e. how some time related [system calls](https://en.wikipedia.org/wiki/System_call) implemented in the Linux kernel.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [x86](https://en.wikipedia.org/wiki/X86)
|
||||
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
|
||||
* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [ACPI Power Management Timer (PDF)](http://uefi.org/sites/default/files/resources/ACPI_5.pdf)
|
||||
* [frequency](https://en.wikipedia.org/wiki/Frequency).
|
||||
* [dmesg](https://en.wikipedia.org/wiki/Dmesg)
|
||||
* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer)
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [IA-PC HPET (High Precision Event Timers) Specification](http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf)
|
||||
* [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC)
|
||||
* [i8259](https://en.wikipedia.org/wiki/Intel_8259)
|
||||
* [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-5.html)
|
||||
421
Timers/timers-7.md
Normal file
421
Timers/timers-7.md
Normal file
@@ -0,0 +1,421 @@
|
||||
Timers and time management in the Linux kernel. Part 7.
|
||||
================================================================================
|
||||
|
||||
Time related system calls in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the seventh and last part [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-6.html) we saw some [x86_64](https://en.wikipedia.org/wiki/X86-64) like [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) and [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). Internal time management is interesting part of the Linux kernel, but of course not only the kernel needs in the `time` concept. Our programs need to know time too. In this part, we will consider implementation of some time management related [system calls](https://en.wikipedia.org/wiki/System_call). These system calls are:
|
||||
|
||||
* `clock_gettime`;
|
||||
* `gettimeofday`;
|
||||
* `nanosleep`.
|
||||
|
||||
We will start from simple userspace [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program and see all way from the call of the [standard library](https://en.wikipedia.org/wiki/Standard_library) function to the implementation of certain system call. As each [architecture](https://github.com/torvalds/linux/tree/master/arch) provides its own implementation of certain system call, we will consider only [x86_64](https://en.wikipedia.org/wiki/X86-64) specific implementations of system calls, as this book is related to this architecture.
|
||||
|
||||
Additionally we will not consider concept of system calls in this part, but only implementations of these three system calls in the Linux kernel. If you are interested in what is it a `system call`, there is special [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) about this.
|
||||
|
||||
So, let's from the `gettimeofday` system call.
|
||||
|
||||
Implementation of the `gettimeofday` system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As we can understand from the name of the `gettimeofday`, this function returns current time. First of all, let's look on the following simple example:
|
||||
|
||||
```C
|
||||
#include <time.h>
|
||||
#include <sys/time.h>
|
||||
#include <stdio.h>
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
char buffer[40];
|
||||
struct timeval time;
|
||||
|
||||
gettimeofday(&time, NULL);
|
||||
|
||||
strftime(buffer, 40, "Current date/time: %m-%d-%Y/%T", localtime(&time.tv_sec));
|
||||
printf("%s\n",buffer);
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
As you can see, here we call the `gettimeofday` function which takes two parameters: pointer to the `timeval` structure which represents an elapsed tim:
|
||||
|
||||
```C
|
||||
struct timeval {
|
||||
time_t tv_sec; /* seconds */
|
||||
suseconds_t tv_usec; /* microseconds */
|
||||
};
|
||||
```
|
||||
|
||||
The second parameter of the `gettimeofday` function is pointer to the `timezone` structure which represents a timezone. In our example, we pass address of the `timeval time` to the `gettimeofday` function, the Linux kernel fills the given `timeval` structure and returns it back to us. Additionally, we format the time with the `strftime` function to get something more human readable than elapsed microseconds. Let's see on result:
|
||||
|
||||
```C
|
||||
~$ gcc date.c -o date
|
||||
~$ ./date
|
||||
Current date/time: 03-26-2016/16:42:02
|
||||
```
|
||||
|
||||
As you already may know, an userspace application does not call a system call directly from the kernel space. Before the actual system call entry will be called, we call a function from the standard library. In my case it is [glibc](https://en.wikipedia.org/wiki/GNU_C_Library), so I will consider this case. The implementation of the `gettimeofday` function is located in the [sysdeps/unix/sysv/linux/x86/gettimeofday.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86/gettimeofday.c;h=36f7c26ffb0e818709d032c605fec8c4bd22a14e;hb=HEAD) source code file. As you already may know, the `gettimeofday` is not usual system call. It is located in the special area which is called `vDSO` (you can read more about it in the [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) which describes this concept).
|
||||
|
||||
The `glibc` implementation of the `gettimeofday` tries to resolve the given symbol, in our case this symbol is `__vdso_gettimeofday` by the call of the `_dl_vdso_vsym` internal function. If the symbol will not be resolved, it returns `NULL` and we fallback to the call of the usual system call:
|
||||
|
||||
```C
|
||||
return (_dl_vdso_vsym ("__vdso_gettimeofday", &linux26)
|
||||
?: (void*) (&__gettimeofday_syscall));
|
||||
```
|
||||
|
||||
The `gettimeofday` entry is located in the [arch/x86/entry/vdso/vclock_gettime.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vclock_gettime.c) source code file. As we can see the `gettimeofday` is weak alias of the `__vdso_gettimeofday`:
|
||||
|
||||
```C
|
||||
int gettimeofday(struct timeval *, struct timezone *)
|
||||
__attribute__((weak, alias("__vdso_gettimeofday")));
|
||||
```
|
||||
|
||||
The `__vdso_gettimeofday` is defined in the same source code file and calls the `do_realtime` function if the given `timeval` is not null:
|
||||
|
||||
```C
|
||||
notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
|
||||
{
|
||||
if (likely(tv != NULL)) {
|
||||
if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
|
||||
return vdso_fallback_gtod(tv, tz);
|
||||
tv->tv_usec /= 1000;
|
||||
}
|
||||
if (unlikely(tz != NULL)) {
|
||||
tz->tz_minuteswest = gtod->tz_minuteswest;
|
||||
tz->tz_dsttime = gtod->tz_dsttime;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
If the `do_realtime` will fail, we fallback to the real system call via call the `syscall` instruction and passing the `__NR_gettimeofday` system call number and the given `timeval` and `timezone`:
|
||||
|
||||
```C
|
||||
notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
|
||||
{
|
||||
long ret;
|
||||
|
||||
asm("syscall" : "=a" (ret) :
|
||||
"0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
The `do_realtime` function gets the time data from the `vsyscall_gtod_data` structure which is defined in the [arch/x86/include/asm/vgtod.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/vgtod.h#L16) header file and contains mapping of the `timespec` structure and a couple of fields which are related to the current clock source in the system. This function fills the given `timeval` structure with values from the `vsyscall_gtod_data` which contains a time related data which is updated via timer interrupt.
|
||||
|
||||
First of all we try to access the `gtod` or `global time of day` the `vsyscall_gtod_data` structure via the call of the `gtod_read_begin` and will continue to do it until it will be successful:
|
||||
|
||||
```C
|
||||
do {
|
||||
seq = gtod_read_begin(gtod);
|
||||
mode = gtod->vclock_mode;
|
||||
ts->tv_sec = gtod->wall_time_sec;
|
||||
ns = gtod->wall_time_snsec;
|
||||
ns += vgetsns(&mode);
|
||||
ns >>= gtod->shift;
|
||||
} while (unlikely(gtod_read_retry(gtod, seq)));
|
||||
|
||||
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
|
||||
ts->tv_nsec = ns;
|
||||
```
|
||||
|
||||
As we got access to the `gtod`, we fill the `ts->tv_sec` with the `gtod->wall_time_sec` which stores current time in seconds gotten from the [real time clock](https://en.wikipedia.org/wiki/Real-time_clock) during initialization of the timekeeping subsystem in the Linux kernel and the same value but in nanoseconds. In the end of this code we just fill the given `timespec` structure with the resulted values.
|
||||
|
||||
That's all about the `gettimeofday` system call. The next system call in our list is the `clock_gettime`.
|
||||
|
||||
Implementation of the clock_gettime system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `clock_gettime` function gets the time which is specified by the second parameter. Generally the `clock_gettime` function takes two parameters:
|
||||
|
||||
* `clk_id` - clock identifier;
|
||||
* `timespec` - address of the `timespec` structure which represent elapsed time.
|
||||
|
||||
Let's look on the following simple example:
|
||||
|
||||
```C
|
||||
#include <time.h>
|
||||
#include <sys/time.h>
|
||||
#include <stdio.h>
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
struct timespec elapsed_from_boot;
|
||||
|
||||
clock_gettime(CLOCK_BOOTTIME, &elapsed_from_boot);
|
||||
|
||||
printf("%d - seconds elapsed from boot\n", elapsed_from_boot.tv_sec);
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
which prints `uptime` information:
|
||||
|
||||
```C
|
||||
~$ gcc uptime.c -o uptime
|
||||
~$ ./uptime
|
||||
14180 - seconds elapsed from boot
|
||||
```
|
||||
|
||||
We can easily check the result with the help of the [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime) util:
|
||||
|
||||
```
|
||||
~$ uptime
|
||||
up 3:56
|
||||
```
|
||||
|
||||
The `elapsed_from_boot.tv_sec` represents elapsed time in seconds, so:
|
||||
|
||||
```python
|
||||
>>> 14180 / 60
|
||||
236
|
||||
>>> 14180 / 60 / 60
|
||||
3
|
||||
>>> 14180 / 60 % 60
|
||||
56
|
||||
```
|
||||
|
||||
The `clock_id` maybe one of the following:
|
||||
|
||||
* `CLOCK_REALTIME` - system wide clock which measures real or wall-clock time;
|
||||
* `CLOCK_REALTIME_COARSE` - faster version of the `CLOCK_REALTIME`;
|
||||
* `CLOCK_MONOTONIC` - represents monotonic time since some unspecified starting point;
|
||||
* `CLOCK_MONOTONIC_COARSE` - faster version of the `CLOCK_MONOTONIC`;
|
||||
* `CLOCK_MONOTONIC_RAW` - the same as the `CLOCK_MONOTONIC` but provides non [NTP](https://en.wikipedia.org/wiki/Network_Time_Protocol) adjusted time.
|
||||
* `CLOCK_BOOTTIME` - the same as the `CLOCK_MONOTONIC` but plus time that the system was suspended;
|
||||
* `CLOCK_PROCESS_CPUTIME_ID` - per-process time consumed by all threads in the process;
|
||||
* `CLOCK_THREAD_CPUTIME_ID` - thread-specific clock.
|
||||
|
||||
The `clock_gettime` is not usual syscall too, but as the `gettimeofday`, this system call is placed in the `vDSO` area. Entry of this system call is located in the same source code file - [arch/x86/entry/vdso/vclock_gettime.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vclock_gettime.c)) as for `gettimeofday`.
|
||||
|
||||
The Implementation of the `clock_gettime` depends on the clock id. If we have passed the `CLOCK_REALTIME` clock id, the `do_realtime` function will be called:
|
||||
|
||||
```C
|
||||
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
|
||||
{
|
||||
switch (clock) {
|
||||
case CLOCK_REALTIME:
|
||||
if (do_realtime(ts) == VCLOCK_NONE)
|
||||
goto fallback;
|
||||
break;
|
||||
...
|
||||
...
|
||||
...
|
||||
fallback:
|
||||
return vdso_fallback_gettime(clock, ts);
|
||||
}
|
||||
```
|
||||
|
||||
In other cases, the `do_{name_of_clock_id}` function is called. Implementations of some of them is similar. For example if we will pass the `CLOCK_MONOTONIC` clock id:
|
||||
|
||||
```C
|
||||
...
|
||||
...
|
||||
...
|
||||
case CLOCK_MONOTONIC:
|
||||
if (do_monotonic(ts) == VCLOCK_NONE)
|
||||
goto fallback;
|
||||
break;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
the `do_monotonic` function will be called which is very similar on the implementation of the `do_realtime`:
|
||||
|
||||
```C
|
||||
notrace static int __always_inline do_monotonic(struct timespec *ts)
|
||||
{
|
||||
do {
|
||||
seq = gtod_read_begin(gtod);
|
||||
mode = gtod->vclock_mode;
|
||||
ts->tv_sec = gtod->monotonic_time_sec;
|
||||
ns = gtod->monotonic_time_snsec;
|
||||
ns += vgetsns(&mode);
|
||||
ns >>= gtod->shift;
|
||||
} while (unlikely(gtod_read_retry(gtod, seq)));
|
||||
|
||||
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
|
||||
ts->tv_nsec = ns;
|
||||
|
||||
return mode;
|
||||
}
|
||||
```
|
||||
|
||||
We already saw a little about the implementation of this function in the previous paragraph about the `gettimeofday`. There is only one difference here, that the `sec` and `nsec` of our `timespec` value will be based on the `gtod->monotonic_time_sec` instead of `gtod->wall_time_sec` which maps the value of the `tk->tkr_mono.xtime_nsec` or number of [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) elapsed.
|
||||
|
||||
That's all.
|
||||
|
||||
Implementation of the `nanosleep` system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The last system call in our list is the `nanosleep`. As you can understand from its name, this function provides `sleeping` ability. Let's look on the following simple example:
|
||||
|
||||
```C
|
||||
#include <time.h>
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
|
||||
int main (void)
|
||||
{
|
||||
struct timespec ts = {5,0};
|
||||
|
||||
printf("sleep five seconds\n");
|
||||
nanosleep(&ts, NULL);
|
||||
printf("end of sleep\n");
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
If we will compile and run it, we will see the first line
|
||||
|
||||
```
|
||||
~$ gcc sleep_test.c -o sleep
|
||||
~$ ./sleep
|
||||
sleep five seconds
|
||||
end of sleep
|
||||
```
|
||||
|
||||
and the second line after five seconds.
|
||||
|
||||
The `nanosleep` is not located in the `vDSO` area like the `gettimeofday` and the `clock_gettime` functions. So, let's look how the `real` system call which is located in the kernel space will be called by the standard library. The implementation of the `nanosleep` system call will be called with the help of the [syscall](http://www.felixcloutier.com/x86/SYSCALL.html) instruction. Before the execution of the `syscall` instruction, parameters of the system call must be put in processor [registers](https://en.wikipedia.org/wiki/Processor_register) according to order which is described in the [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf) or in other words:
|
||||
|
||||
* `rdi` - first parameter;
|
||||
* `rsi` - second parameter;
|
||||
* `rdx` - third parameter;
|
||||
* `r10` - fourth parameter;
|
||||
* `r8` - fifth parameter;
|
||||
* `r9` - sixth parameter.
|
||||
|
||||
The `nanosleep` system call has two parameters - two pointers to the `timespec` structures. The system call suspends the calling thread until the given timeout has elapsed. Additionally it will finish if a signal interrupts its execution. It takes two parameters, the first is `timespec` which represents timeout for the sleep. The second parameter is the pointer to the `timespec` structure too and it contains remainder of time if the call of the `nanosleep` was interrupted.
|
||||
|
||||
As `nanosleep` has two parameters:
|
||||
|
||||
```C
|
||||
int nanosleep(const struct timespec *req, struct timespec *rem);
|
||||
```
|
||||
|
||||
To call system call, we need put the `req` to the `rdi` register, and the `rem` parameter to the `rsi` register. The [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) does these job in the `INTERNAL_SYSCALL` macro which is located in the [sysdeps/unix/sysv/linux/x86_64/sysdep.h](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h;h=d023d68174d3dfb4e698160b31ae31ad291802e1;hb=HEAD) header file.
|
||||
|
||||
```C
|
||||
# define INTERNAL_SYSCALL(name, err, nr, args...) \
|
||||
INTERNAL_SYSCALL_NCS (__NR_##name, err, nr, ##args)
|
||||
```
|
||||
|
||||
which takes the name of the system call, storage for possible error during execution of system call, number of the system call (all `x86_64` system calls you can find in the [system calls table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)) and arguments of certain system call. The `INTERNAL_SYSCALL` macro just expands to the call of the `INTERNAL_SYSCALL_NCS` macro, which prepares arguments of system call (puts them into the processor registers in correct order), executes `syscall` instruction and returns the result:
|
||||
|
||||
```C
|
||||
# define INTERNAL_SYSCALL_NCS(name, err, nr, args...) \
|
||||
({ \
|
||||
unsigned long int resultvar; \
|
||||
LOAD_ARGS_##nr (args) \
|
||||
LOAD_REGS_##nr \
|
||||
asm volatile ( \
|
||||
"syscall\n\t" \
|
||||
: "=a" (resultvar) \
|
||||
: "0" (name) ASM_ARGS_##nr : "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
|
||||
(long int) resultvar; })
|
||||
```
|
||||
|
||||
The `LOAD_ARGS_##nr` macro calls the `LOAD_ARGS_N` macro where the `N` is number of arguments of the system call. In our case, it will be the `LOAD_ARGS_2` macro. Ultimately all of these macros will be expanded to the following:
|
||||
|
||||
```C
|
||||
# define LOAD_REGS_TYPES_1(t1, a1) \
|
||||
register t1 _a1 asm ("rdi") = __arg1; \
|
||||
LOAD_REGS_0
|
||||
|
||||
# define LOAD_REGS_TYPES_2(t1, a1, t2, a2) \
|
||||
register t2 _a2 asm ("rsi") = __arg2; \
|
||||
LOAD_REGS_TYPES_1(t1, a1)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
After the `syscall` instruction will be executed, the [context switch](https://en.wikipedia.org/wiki/Context_switch) will occur and the kernel will transfer execution to the system call handler. The system call handler for the `nanosleep` system call is located in the [kernel/time/hrtimer.c](https://github.com/torvalds/linux/blob/master/kernel/time/hrtimer.c) source code file and defined with the `SYSCALL_DEFINE2` macro helper:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
|
||||
struct timespec __user *, rmtp)
|
||||
{
|
||||
struct timespec tu;
|
||||
|
||||
if (copy_from_user(&tu, rqtp, sizeof(tu)))
|
||||
return -EFAULT;
|
||||
|
||||
if (!timespec_valid(&tu))
|
||||
return -EINVAL;
|
||||
|
||||
return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
|
||||
}
|
||||
```
|
||||
|
||||
More about the `SYSCALL_DEFINE2` macro you may read in the [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) about system calls. If we look at the implementation of the `nanosleep` system call, first of all we will see that it starts from the call of the `copy_from_user` function. This function copies the given data from the userspace to kernelspace. In our case we copy timeout value to sleep to the kernelspace `timespec` structure and check that the given `timespec` is valid by the call of the `timesc_valid` function:
|
||||
|
||||
```C
|
||||
static inline bool timespec_valid(const struct timespec *ts)
|
||||
{
|
||||
if (ts->tv_sec < 0)
|
||||
return false;
|
||||
if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC)
|
||||
return false;
|
||||
return true;
|
||||
}
|
||||
```
|
||||
|
||||
which just checks that the given `timespec` does not represent date before `1970` and nanoseconds does not overflow `1` second. The `nanosleep` function ends with the call of the `hrtimer_nanosleep` function from the same source code file. The `hrtimer_nanosleep` function creates a [timer](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-4.html) and calls the `do_nanosleep` function. The `do_nanosleep` does main job for us. This function provides loop:
|
||||
|
||||
```C
|
||||
do {
|
||||
set_current_state(TASK_INTERRUPTIBLE);
|
||||
hrtimer_start_expires(&t->timer, mode);
|
||||
|
||||
if (likely(t->task))
|
||||
freezable_schedule();
|
||||
|
||||
} while (t->task && !signal_pending(current));
|
||||
|
||||
__set_current_state(TASK_RUNNING);
|
||||
return t->task == NULL;
|
||||
```
|
||||
|
||||
Which freezes current task during sleep. After we set `TASK_INTERRUPTIBLE` flag for the current task, the `hrtimer_start_expires` function starts the give high-resolution timer on the current processor. As the given high resolution timer will expire, the task will be again running.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the seventh part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part we saw [x86_64](https://en.wikipedia.org/wiki/X86-64) specific clock sources. As I wrote in the beginning, this part is the last part of this chapter. We saw important time management related concepts like `clocksource` and `clockevents` frameworks, `jiffies` counter and etc., in this chpater. Of course this does not cover all of the time management in the Linux kernel. Many parts of this mostly related to the scheduling which we will see in other chapter.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29)
|
||||
* [standard library](https://en.wikipedia.org/wiki/Standard_library)
|
||||
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [real time clock](https://en.wikipedia.org/wiki/Real-time_clock)
|
||||
* [NTP](https://en.wikipedia.org/wiki/Network_Time_Protocol)
|
||||
* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
|
||||
* [register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf)
|
||||
* [context switch](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [Introduction to timers in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-4.html)
|
||||
* [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime)
|
||||
* [system calls table for x86_64](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)
|
||||
* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-6.html)
|
||||
@@ -21,3 +21,7 @@
|
||||
[@hailincai](https://github.com/hailincai)
|
||||
|
||||
[@zmj1316](https://github.com/zmj1316)
|
||||
|
||||
[@zhangyangjing](https://github.com/zhangyangjing)
|
||||
|
||||
[@huxq](https://github.com/huxq)
|
||||
|
||||
526
interrupts/interrupts-9.md
Normal file
526
interrupts/interrupts-9.md
Normal file
@@ -0,0 +1,526 @@
|
||||
中断和中断处理(九)
|
||||
================================================================================
|
||||
|
||||
延后中断(软中断,Tasklets 和工作队列)介绍
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
这是 Linux 内核[中断和中断处理](https://www.gitbook.com/book/xinqiu/linux-insides-cn/content/interrupts/index.html)的第九节,在[上一节](https://www.gitbook.com/book/xinqiu/linux-nsides-cn/content/interrupts/interrupts-8.html)我们分析了源文件 [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irqinit.c) 中的 `init_IRQ` 实现。接下来的这一节我们将继续深入学习外部硬件中断的初始化。
|
||||
|
||||
中断处理会有一些特点,其中最主要的两个是:
|
||||
|
||||
* 中断处理必须快速执行完毕
|
||||
* 有时中断处理必须做很多冗长的事情
|
||||
|
||||
就像你所想到的,我们几乎不可能同时做到这两点,之前的中断被分为两部分:
|
||||
|
||||
* 前半部
|
||||
* 后半部
|
||||
|
||||
`后半部` 曾经是 Linux 内核延后中断执行的一种方式,但现在的实际情况已经不是这样了。现在它已作为一个遗留称谓代表内核中所有延后中断的机制。如你所知,中断处理代码运行于中断处理上下文中,此时禁止响应后续的中断,所以要避免中断处理代码长时间执行。但有些中断却又需要执行很多工作,所以中断处理有时会被分为两部分。第一部分中,中断处理先只做尽量少的重要工作,接下来提交第二部分给内核调度,然后就结束运行。当系统比较空闲并且处理器上下文允许处理中断时,第二部分被延后的剩余任务就会开始执行。
|
||||
|
||||
当前实现延后中断的有如下三种途径:
|
||||
|
||||
* `软中断`
|
||||
* `tasklets`
|
||||
* `工作队列`
|
||||
|
||||
在这一小节我们将详细介绍这三种实现,现在是时间深入了解一下了。
|
||||
|
||||
软中断
|
||||
----------------------------------------------------------------------------------
|
||||
|
||||
伴随着内核对并行处理的支持,出于性能考虑,所有新的下半部实现方案都基于被称之为 `ksoftirqd` (稍后将详细讨论)的内核线程。每个处理器都有自己的内核线程,名字叫做 `ksoftirqd/n`,n是处理器的编号。我们可以通过系统命令 `systemd-cgls` 看到它们:
|
||||
|
||||
```
|
||||
$ systemd-cgls -k | grep ksoft
|
||||
├─ 3 [ksoftirqd/0]
|
||||
├─ 13 [ksoftirqd/1]
|
||||
├─ 18 [ksoftirqd/2]
|
||||
├─ 23 [ksoftirqd/3]
|
||||
├─ 28 [ksoftirqd/4]
|
||||
├─ 33 [ksoftirqd/5]
|
||||
├─ 38 [ksoftirqd/6]
|
||||
├─ 43 [ksoftirqd/7]
|
||||
```
|
||||
|
||||
由 `spawn_ksoftirqd` 函数启动这些线程。就像我们看到的,这个函数在早期的 [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/index.html) 被调用。
|
||||
|
||||
```C
|
||||
early_initcall(spawn_ksoftirqd);
|
||||
```
|
||||
|
||||
软中断在 Linux 内核编译时就静态地确定了。`open_softirq` 函数负责 `softirq` 初始化,它在 [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) 中定义:
|
||||
|
||||
```C
|
||||
void open_softirq(int nr, void (*action)(struct softirq_action *))
|
||||
{
|
||||
softirq_vec[nr].action = action;
|
||||
}
|
||||
```
|
||||
|
||||
这个函数有两个参数:
|
||||
|
||||
* `softirq_vec` 数组的索引序号
|
||||
* 一个指向软中断处理函数的指针
|
||||
|
||||
我们首先来看 `softirq_vec` 数组:
|
||||
|
||||
```C
|
||||
static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
|
||||
```
|
||||
|
||||
它在同一源文件中定义。`softirq_vec` 数组包含了 `NR_SOFTIRQS` (其值为10)个不同 `softirq` 类型的 `softirq_action`。当前版本的 Linux 内核定义了十种软中断向量。其中两个 tasklet 相关,两个网络相关,两个块处理相关,两个定时器相关,另外调度器和 RCU 也各占一个。所有这些都在一个枚举中定义:
|
||||
|
||||
```C
|
||||
enum
|
||||
{
|
||||
HI_SOFTIRQ=0,
|
||||
TIMER_SOFTIRQ,
|
||||
NET_TX_SOFTIRQ,
|
||||
NET_RX_SOFTIRQ,
|
||||
BLOCK_SOFTIRQ,
|
||||
BLOCK_IOPOLL_SOFTIRQ,
|
||||
TASKLET_SOFTIRQ,
|
||||
SCHED_SOFTIRQ,
|
||||
HRTIMER_SOFTIRQ,
|
||||
RCU_SOFTIRQ,
|
||||
NR_SOFTIRQS
|
||||
};
|
||||
```
|
||||
|
||||
以上软中断的名字在如下的数组中定义:
|
||||
|
||||
```C
|
||||
const char * const softirq_to_name[NR_SOFTIRQS] = {
|
||||
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
|
||||
"TASKLET", "SCHED", "HRTIMER", "RCU"
|
||||
};
|
||||
```
|
||||
|
||||
我们也可以在 `/proc/softirqs` 的输出中看到他们:
|
||||
|
||||
```
|
||||
~$ cat /proc/softirqs
|
||||
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
|
||||
HI: 5 0 0 0 0 0 0 0
|
||||
TIMER: 332519 310498 289555 272913 282535 279467 282895 270979
|
||||
NET_TX: 2320 0 0 2 1 1 0 0
|
||||
NET_RX: 270221 225 338 281 311 262 430 265
|
||||
BLOCK: 134282 32 40 10 12 7 8 8
|
||||
BLOCK_IOPOLL: 0 0 0 0 0 0 0 0
|
||||
TASKLET: 196835 2 3 0 0 0 0 0
|
||||
SCHED: 161852 146745 129539 126064 127998 128014 120243 117391
|
||||
HRTIMER: 0 0 0 0 0 0 0 0
|
||||
RCU: 337707 289397 251874 239796 254377 254898 267497 256624
|
||||
```
|
||||
|
||||
可以看到 `softirq_vec` 数组的类型为 `softirq_action`。这是软中断机制里一个重要的数据结构,它只有一个指向中断处理函数的成员:
|
||||
|
||||
```C
|
||||
struct softirq_action
|
||||
{
|
||||
void (*action)(struct softirq_action *);
|
||||
};
|
||||
```
|
||||
|
||||
现在我们可以理解到 `open_softirq` 函数实际上用 `softirq_action` 参数填充了 `softirq_vec` 数组。由 `open_softirq` 注册的延后中断处理函数会由 `raise_softirq` 调用。这个函数只有一个参数 -- 软中断序号 `nr`。来看下它的实现:
|
||||
|
||||
```C
|
||||
void raise_softirq(unsigned int nr)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
local_irq_save(flags);
|
||||
raise_softirq_irqoff(nr);
|
||||
local_irq_restore(flags);
|
||||
}
|
||||
```
|
||||
|
||||
可以看到在 `local_irq_save` 和 `local_irq_restore` 两个宏中间调用了 `raise_softirq_irqoff` 函数。`local_irq_save` 的定义位于 [include/linux/irqflags.h](https://github.com/torvalds/linux/blob/master/include/linux/irqflags.h) 头文件,它保存了 [eflags](https://en.wikipedia.org/wiki/FLAGS_register) 寄存器中的 [IF](https://en.wikipedia.org/wiki/Interrupt_flag) 标志位并且禁用了当前处理器的中断。`local_irq_restore` 宏定义于相同头文件中,它做了完全相反的事情:装回之前保存的中断标志位然后允许中断。这里之所以要禁用中断是因为将要运行的 `softirq` 中断处理运行于中断上下文中。
|
||||
|
||||
`raise_softirq_irqoff` 函数设置当前处理器上和nr参数对应的软中断标志位(`__softirq_pending`)。这是通过以下代码做到的:
|
||||
|
||||
```C
|
||||
__raise_softirq_irqoff(nr);
|
||||
```
|
||||
|
||||
然后,通过 `in_interrupt` 函数获得 `irq_count` 值。我们在这一章的第一[小节](https://www.gitbook.com/book/xinqiu/linux-insides-cn/content/interrupts/interrupts-1.html)已经知道它是用来检测一个 cpu 是否处于中断环境。如果我们处于中断上下文中,我们就退出 `raise_softirq_irqoff` 函数,装回 `IF` 标志位并允许当前处理器的中断。如果不在中断上下文中,就会调用 `wakeup_softirqd` 函数:
|
||||
|
||||
```C
|
||||
if (!in_interrupt())
|
||||
wakeup_softirqd();
|
||||
```
|
||||
|
||||
`wakeup_softirqd` 函数会激活当前处理器上的 `ksoftirqd` 内核线程:
|
||||
|
||||
```C
|
||||
static void wakeup_softirqd(void)
|
||||
{
|
||||
struct task_struct *tsk = __this_cpu_read(ksoftirqd);
|
||||
|
||||
if (tsk && tsk->state != TASK_RUNNING)
|
||||
wake_up_process(tsk);
|
||||
}
|
||||
```
|
||||
|
||||
每个 `ksoftirqd` 内核线程都运行 `run_ksoftirqd` 函数来检测是否有延后中断需要处理,如果有的话就会调用 `__do_softirq` 函数。`__do_softirq` 读取当前处理器对应的 `__softirq_pending` 软中断标记,并调用所有已被标记中断对应的处理函数。在执行一个延后函数的同时,可能会发生新的软中断。这会导致用户态代码由于 `__do_softirq` 要处理很多延后中断而很长时间不能返回。为了解决这个问题,系统限制了延后中断处理的最大耗时:
|
||||
|
||||
```C
|
||||
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
|
||||
...
|
||||
...
|
||||
...
|
||||
restart:
|
||||
while ((softirq_bit = ffs(pending))) {
|
||||
...
|
||||
h->action(h);
|
||||
...
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
pending = local_softirq_pending();
|
||||
if (pending) {
|
||||
if (time_before(jiffies, end) && !need_resched() &&
|
||||
--max_restart)
|
||||
goto restart;
|
||||
}
|
||||
...
|
||||
```
|
||||
|
||||
除周期性检测是否有延后中断需要执行之外,系统还会在一些关键时间点上检测。一个主要的检测时间点就是当定义在 [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c) 的 `do_IRQ` 函数被调用时,这是 Linux 内核中执行延后中断的主要时机。在这个函数将要完成中断处理时它会调用 [arch/x86/include/asm/apic.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/apic.h) 中定义的 `exiting_irq` 函数,`exiting_irq` 又调用了 `irq_exit`。`irq_exit` 函数会检测当前处理器上下文是否有延后中断,有的话就会调用 `invoke_softirq`:
|
||||
|
||||
```C
|
||||
if (!in_interrupt() && local_softirq_pending())
|
||||
invoke_softirq();
|
||||
```
|
||||
|
||||
这样就调用到了我们上面提到的 `__do_softirq`。每个 `softirq` 都有如下的阶段:通过 `open_softirq` 函数注册一个软中断,通过 `raise_softirq` 函数标记一个软中断来激活它,然后所有被标记的软中断将会在 Linux 内核下一次执行周期性软中断检测时得以调度,对应此类型软中断的处理函数也就得以执行。
|
||||
|
||||
从上述可看出,软中断是静态分配的,这对于后期加载的内核模块将是一个问题。基于软中断实现的 `tasklets` 解决了这个问题。
|
||||
|
||||
Tasklets
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
如果你阅读 Linux 内核源码中软中断相关的代码,你会发现它很少会被用到。内核中实现延后中断的主要途径是 `tasklets`。正如上面说的,`tasklets` 构建于 `softirq` 中断之上,他是基于下面两个软中断实现的:
|
||||
|
||||
* `TASKLET_SOFTIRQ`;
|
||||
* `HI_SOFTIRQ`.
|
||||
|
||||
简而言之,`tasklets` 是运行时分配和初始化的软中断。和软中断不同的是,同一类型的 `tasklets` 可以在同一时间运行于不同的处理器上。我们已经了解到一些关于软中断的知识,当然上面的文字并不能详细讲解所有的细节,但我们现在可以通过直接阅读代码一步步的更深入了解软中断。我们返回到开始部分讨论的 `softirq_init` 函数实现,这个函数在 [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) 中定义如下:
|
||||
|
||||
```C
|
||||
void __init softirq_init(void)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
for_each_possible_cpu(cpu) {
|
||||
per_cpu(tasklet_vec, cpu).tail =
|
||||
&per_cpu(tasklet_vec, cpu).head;
|
||||
per_cpu(tasklet_hi_vec, cpu).tail =
|
||||
&per_cpu(tasklet_hi_vec, cpu).head;
|
||||
}
|
||||
|
||||
open_softirq(TASKLET_SOFTIRQ, tasklet_action);
|
||||
open_softirq(HI_SOFTIRQ, tasklet_hi_action);
|
||||
}
|
||||
```
|
||||
|
||||
可以看到在函数开头定义了一个名为 cpu 的 integer 类型变量。接下来他会作为参数传递给宏 `for_each_possible_cpu` 来获得系统中所有的处理器。如果 `possible_cpu` 对你来说是一个新的术语,你可以阅读 [CPU masks](https://www.gitbook.com/book/xinqiu/linux-insides-cn/content/Concepts/cpumask.html) 章节来了解更多知识。简单的说,`possible_cpu` 是系统运行期间插入的处理器集合。所有的 `possible processor` 存储在 `cpu_possible_bits` 位图中,你可以在 [kernel/cpu.c](https://github.com/torvalds/linux/blob/master/kernel/cpu.c) 中找到他的定义:
|
||||
|
||||
```C
|
||||
static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly;
|
||||
...
|
||||
...
|
||||
...
|
||||
const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits);
|
||||
```
|
||||
|
||||
好了,我们定义了 integer 类型变量 `cpu` 并且通过 `for_each_possible_cpu` 宏遍历了所有处理器,初始化了两个 `per-cpu` 变量:
|
||||
|
||||
* `tasklet_vec`;
|
||||
* `tasklet_hi_vec`;
|
||||
|
||||
这两个 `per-cpu` 变量和 `softirq_init` 函数都定义在相同[代码](https://github.com/torvalds/linux/blob/master/kernel/softirq.c)中,他们被定义为 `tasklet_head` 类型:
|
||||
|
||||
```C
|
||||
static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec);
|
||||
static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec);
|
||||
```
|
||||
|
||||
`tasklet_head` 结构代表一组 `Tasklets`,它包含两个成员,head 和 tail:
|
||||
|
||||
```C
|
||||
struct tasklet_head {
|
||||
struct tasklet_struct *head;
|
||||
struct tasklet_struct **tail;
|
||||
};
|
||||
```
|
||||
|
||||
`tasklet_struct` 数据类型在 [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h) 中定义,它代表一个 `Tasklet`。这本书之前部分我们没有见过这个单词,那我们先试着理解一下 `Tasklet` 究竟为何物。实际上,`Tasklet` 是处理延后中断的一种机制,来看一下 `tasklet_struct` 的具体定义:
|
||||
|
||||
```C
|
||||
struct tasklet_struct
|
||||
{
|
||||
struct tasklet_struct *next;
|
||||
unsigned long state;
|
||||
atomic_t count;
|
||||
void (*func)(unsigned long);
|
||||
unsigned long data;
|
||||
};
|
||||
```
|
||||
|
||||
这个数据结构包含有下面5个成员:
|
||||
|
||||
* 调度队列中的下一个 `Tasklet`
|
||||
* 当前这个 `Tasklet` 的状态
|
||||
* 这个 `Tasklet` 是否处于活动状态
|
||||
* `Tasklet` 的回调函数
|
||||
* 回调函数的参数
|
||||
|
||||
上面代码中,在 `softirq_init` 函数中初始化了两个 tasklets 数组:`tasklet_vec` 和 `tasklet_hi_vec`。Tasklets 和高优先级 Tasklets 分别存储于这两个数组中。初始化完成后我们看到代码 [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) 在 `softirq_init` 函数的最后又两次调用了 `open_softirq`:
|
||||
|
||||
```C
|
||||
open_softirq(TASKLET_SOFTIRQ, tasklet_action);
|
||||
open_softirq(HI_SOFTIRQ, tasklet_hi_action);
|
||||
```
|
||||
|
||||
`open_softirq` 函数的主要作用是初始化软中断,接下来让我们看看它是怎么做的。和 Tasklets 相关的软中断处理函数有两个,分别是 `tasklet_action` 和 `tasklet_hi_action`。其中 `tasklet_hi_action` 和 `HI_SOFTIRQ` 关联在一起,`tasklet_action` 和 `TASKLET_SOFTIRQ` 关联在一起。
|
||||
|
||||
Linux 内核提供一些 API 供操作 Tasklets 之用。首先是 `tasklet_init` 函数,它接受一个 `task_struct` 数据结构,一个处理函数,和另外一个参数,并利用这些参数来初始化所给的 `task_struct` 结构:
|
||||
|
||||
```C
|
||||
void tasklet_init(struct tasklet_struct *t,
|
||||
void (*func)(unsigned long), unsigned long data)
|
||||
{
|
||||
t->next = NULL;
|
||||
t->state = 0;
|
||||
atomic_set(&t->count, 0);
|
||||
t->func = func;
|
||||
t->data = data;
|
||||
}
|
||||
```
|
||||
|
||||
另外还有如下两个宏可以静态地初始化一个 tasklet:
|
||||
|
||||
```C
|
||||
DECLARE_TASKLET(name, func, data);
|
||||
DECLARE_TASKLET_DISABLED(name, func, data);
|
||||
```
|
||||
|
||||
Linux 内核提供三个函数标记一个 tasklet 已经准备就绪:
|
||||
|
||||
```C
|
||||
void tasklet_schedule(struct tasklet_struct *t);
|
||||
void tasklet_hi_schedule(struct tasklet_struct *t);
|
||||
void tasklet_hi_schedule_first(struct tasklet_struct *t);
|
||||
```
|
||||
|
||||
第一个函数使用普通优先级调度一个 tasklet,第二个使用高优先级,第三个则用更高优先级。所有这三个函数的实现都很类似,所以我们只看一下第一个 `tasklet_schedule` 的实现:
|
||||
|
||||
```C
|
||||
static inline void tasklet_schedule(struct tasklet_struct *t)
|
||||
{
|
||||
if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
|
||||
__tasklet_schedule(t);
|
||||
}
|
||||
|
||||
void __tasklet_schedule(struct tasklet_struct *t)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
local_irq_save(flags);
|
||||
t->next = NULL;
|
||||
*__this_cpu_read(tasklet_vec.tail) = t;
|
||||
__this_cpu_write(tasklet_vec.tail, &(t->next));
|
||||
raise_softirq_irqoff(TASKLET_SOFTIRQ);
|
||||
local_irq_restore(flags);
|
||||
}
|
||||
```
|
||||
|
||||
我们看到它检测并设置所给的 tasklet 为 `TASKLET_STATE_SCHED` 状态,然后以所给 tasklet 为参数执行了 `__tasklet_schedule` 函数。`__tasklet_schedule` 看起来和前面见到的 `raise_softirq` 很像。一开始它保存中断标志并禁用中断,继而将新的 tasklet 添加到 `tasklet_vec`,然后调用了我们前面见过的 `raise_softirq_irqoff` 函数。当 Linux 内核调度器决定去运行一个延后函数,`tasklet_action` 函数会被作为和 `TASKLET_SOFTIRQ` 相关联的延后函数调用。同样的,`tasklet_hi_action` 会被作为和 `HI_SOFTIRQ` 相关联的延后函数调用。这些函数之所以如此相似是因为他们之间只有一个地方不同 --- `tasklet_action` 使用 `tasklet_vec` 而 `tasklet_hi_action` 使用 `tasklet_hi_vec`。
|
||||
|
||||
让我们看下 `tasklet_action` 函数的实现:
|
||||
|
||||
```C
|
||||
static void tasklet_action(struct softirq_action *a)
|
||||
{
|
||||
local_irq_disable();
|
||||
list = __this_cpu_read(tasklet_vec.head);
|
||||
__this_cpu_write(tasklet_vec.head, NULL);
|
||||
__this_cpu_write(tasklet_vec.tail, this_cpu_ptr(&tasklet_vec.head));
|
||||
local_irq_enable();
|
||||
|
||||
while (list) {
|
||||
if (tasklet_trylock(t)) {
|
||||
t->func(t->data);
|
||||
tasklet_unlock(t);
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
在 `tasklet_action` 开始时利用 `local_irq_disable` 宏禁用了当前处理器的中断(你可以阅读本书[第二部分](https://www.gitbook.com/book/xinqiu/linux-insides-cn/content/interrupts/interrupts-2.html)了解更多关于此宏的信息)。接下来获取到当前处理器对应的普通优先级 tasklet 列表并把它设置为 `NULL` ,这是因为所有的 tasklet 都将被执行。然后使能当前处理器的中断,循环遍历 tasklet 列表,每一次遍历都会对当前 tasklet 调用 `tasklet_trylock` 函数来更新它的状态为 `TASKLET_STATE_RUN`:
|
||||
|
||||
```C
|
||||
static inline int tasklet_trylock(struct tasklet_struct *t)
|
||||
{
|
||||
return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state);
|
||||
}
|
||||
```
|
||||
|
||||
如果这个操作成功了就会执行此 tasklet 的处理函数(我们在 `tasklet_init` 中所设置的),然后调用 `tasklet_unlock` 函数清除他的 `TASKLET_STATE_RUN` 状态。
|
||||
|
||||
通常情况下,这就是 `tasklet` 的所有概念。当然这些还不足以覆盖所有的 `tasklets`,但是我想大家可以以此为切入点继续学习下去。
|
||||
|
||||
`tasklets` 在 Linux 内核中是一个[广泛](http://lxr.free-electrons.com/ident?i=tasklet_init)使用的概念,但就像我在本章开头所写的,还有第三个延后中断机制 -- 工作队列。接下来我们将会看看它又是怎样一种机制。
|
||||
|
||||
|
||||
工作队列
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
`工作队列`是另外一个处理延后函数的概念,它大体上和 `tasklets` 类似。工作队列运行于内核进程上下文,而 `tasklets` 运行于软中断上下文。这意味着`工作队列`函数不必像 `tasklets` 一样必须是原子性的。Tasklets 总是运行于它提交自的那个处理器,工作队列在默认情况下也是这样。`工作队列`在 Linux 内核代码 [kernel/workqueue.c](https://github.com/torvalds/linux/blob/master/kernel/workqueue.c) 中由如下的数据结构表示:
|
||||
|
||||
```C
|
||||
struct worker_pool {
|
||||
spinlock_t lock;
|
||||
int cpu;
|
||||
int node;
|
||||
int id;
|
||||
unsigned int flags;
|
||||
|
||||
struct list_head worklist;
|
||||
int nr_workers;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
因为这个结构有非常多的成员,这里就不把它们全部罗列出来,下面只讨论上面列出的这几个。
|
||||
|
||||
工作队列最基础的用法,是作为创建内核线程的接口来处理提交到队列里的工作任务。所有这些内核线程称之为 `worker thread`。工作队列内的任务是由代码 [include/linux/workqueue.h](https://github.com/torvalds/linux/blob/master/include/linux/workqueue.h) 中定义的 `work_struct` 表示的,起定义如下:
|
||||
|
||||
```C
|
||||
struct work_struct {
|
||||
atomic_long_t data;
|
||||
struct list_head entry;
|
||||
work_func_t func;
|
||||
#ifdef CONFIG_LOCKDEP
|
||||
struct lockdep_map lockdep_map;
|
||||
#endif
|
||||
};
|
||||
```
|
||||
|
||||
这里有两个字段比较有意思:`func` --将被`工作队列`调度执行的函数,`data` --这个函数的参数。Linux 内核提供了称之为 `kworker` 的特定于每个 cpu 的内核线程:
|
||||
|
||||
```
|
||||
systemd-cgls -k | grep kworker
|
||||
├─ 5 [kworker/0:0H]
|
||||
├─ 15 [kworker/1:0H]
|
||||
├─ 20 [kworker/2:0H]
|
||||
├─ 25 [kworker/3:0H]
|
||||
├─ 30 [kworker/4:0H]
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
这些线程会被用来调度执行工作队列的延后函数(就像 `ksoftirqd` 之于`软中断`)。除此之外我们还可以为一个`工作队列`创建一个新的工作线程。Linux 内核提供了如下宏静态创建一个队列任务:
|
||||
|
||||
```C
|
||||
#define DECLARE_WORK(n, f) \
|
||||
struct work_struct n = __WORK_INITIALIZER(n, f)
|
||||
```
|
||||
|
||||
它需要两个参数:工作队列的名字和工作队列的函数。我们还可以在运行时动态创建:
|
||||
|
||||
```C
|
||||
#define INIT_WORK(_work, _func) \
|
||||
__INIT_WORK((_work), (_func), 0)
|
||||
|
||||
#define __INIT_WORK(_work, _func, _onstack) \
|
||||
do { \
|
||||
__init_work((_work), _onstack); \
|
||||
(_work)->data = (atomic_long_t) WORK_DATA_INIT(); \
|
||||
INIT_LIST_HEAD(&(_work)->entry); \
|
||||
(_work)->func = (_func); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
这个宏需要一个 `work_struct` 数据结构作为将要创建的队列任务,和一个将在这个任务里调度运行的函数。通过这两个宏的其中一个创建一个 `work` 后,我们需要把它放到`工作队列`中去。可以通过 `queue_work` 或者 `queue_delayed_work` 来做到这一点:
|
||||
|
||||
```C
|
||||
static inline bool queue_work(struct workqueue_struct *wq,
|
||||
struct work_struct *work)
|
||||
{
|
||||
return queue_work_on(WORK_CPU_UNBOUND, wq, work);
|
||||
}
|
||||
```
|
||||
|
||||
`queue_work` 只是调用了 `queue_work_on` 函数指定相应的处理器。注意这里给 `queue_work_on` 函数传递了 `WORK_CPU_UNBOUND` 参数,它作为代表队列任务要绑定到哪一个处理器的枚举一员,定义于 [include/linux/workqueue.h](https://github.com/torvalds/linux/blob/master/include/linux/workqueue.h)。`queue_work_on` 函数测试并设置所给`任务`的 `WORK_STRUCT_PENDING_BIT` 标志位,然后以所给的工作队列和队列任务为参数执行 `__queue_work` 函数:
|
||||
|
||||
```C
|
||||
bool queue_work_on(int cpu, struct workqueue_struct *wq,
|
||||
struct work_struct *work)
|
||||
{
|
||||
bool ret = false;
|
||||
...
|
||||
if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
|
||||
__queue_work(cpu, wq, work);
|
||||
ret = true;
|
||||
}
|
||||
...
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
`__queue_work` 函数得到参数 `work poll`。是的,是 `work poll` 而不是 `workqueue`。实际上,所有的 `works` 都没有放在 `workqueue` 中,而是放在 Linux 内核中由 `worker_pool` 数据结构所定义的 `work poll`。如上所述,`workqueue_struct` 数据结构的 `pwqs` 成员是一个 `worker_pool` 列表。当我们创建一个 `workqueue`,他针对每一个处理器都创建了 `worker_pool`。每一个和 `worker_pool` 相关联的 `pool_workqueue` 都分配在相同的处理器上对应的优先级队列,`workqueue` 通过他们和 `worker_pool` 交互。在 `__queue_work` 函数里使用 `raw_smp_processor_id` 设置 cpu 为当前处理器在[第四章](https://www.gitbook.com/book/xinqiu/linux-insides-cn/content/Initialization/linux-initialization-4.html)你可以找到更多相关信息),得到与所给 `work_struct` 对应的 `pool_workqueue` 并将 `work` 插入到 `workqueue`:
|
||||
|
||||
```C
|
||||
static void __queue_work(int cpu, struct workqueue_struct *wq,
|
||||
struct work_struct *work)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
if (req_cpu == WORK_CPU_UNBOUND)
|
||||
cpu = raw_smp_processor_id();
|
||||
|
||||
if (!(wq->flags & WQ_UNBOUND))
|
||||
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
|
||||
else
|
||||
pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
|
||||
...
|
||||
...
|
||||
...
|
||||
insert_work(pwq, work, worklist, work_flags);
|
||||
```
|
||||
|
||||
现在我们可以创建 `works` 和 `workqueue`,我们需要知道他们究竟会在何时被执行。就像前面提到的,所有的 `works` 都会在内核线程中执行。当内核线程得到调度,它开始执行 `workqueue` 中的 `works`。每一个工作队列内核线程都会在 `worker_thread` 函数里执行一个循环。这些内核线程会做很多不同的事情,其中一些和本章前面提到的很类似。当开始执行时,所有的 `work_struct` 和 `works` 都会从他的 `workqueue` 移除。
|
||||
|
||||
|
||||
总结
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
现在结束了[中断和中断处理](https://www.gitbook.com/book/xinqiu/linux-insides-cn/content/interrupts/index.html)的第九节。这一节中我们继续讨论了外部硬件中断。在之前部分我们看到了 `IRQs` 的初始化和 `irq_desc` 数据结构,在这一节我们看到了用于延后函数的三个概念:`软中断`,`tasklet` 和`工作队列`。
|
||||
|
||||
下一节将是 `中断和中断处理` 的最后一节。我们将会了解真正的硬件驱动,并试着学习它是怎样和中断子系统一起工作的。
|
||||
|
||||
如果你有任何问题或建议,请给我发评论或者给我发 [Twitter](https://twitter.com/0xAX)。
|
||||
|
||||
**请注意英语并不是我的母语,我为任何表达不清楚的地方感到抱歉。如果你发现任何错误请发 PR 到 [linux-insides](https://github.com/0xAX/linux-insides)。(译者注:翻译问题请发 PR 到 [linux-insides-cn](https://www.gitbook.com/book/xinqiu/linux-insides-cn))**
|
||||
|
||||
|
||||
链接
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/index.html)
|
||||
* [IF](https://en.wikipedia.org/wiki/Interrupt_flag)
|
||||
* [eflags](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [Workqueue](https://github.com/torvalds/linux/blob/master/Documentation/workqueue.txt)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-8.html)
|
||||
Reference in New Issue
Block a user