mirror of
https://github.com/MintCN/linux-insides-zh.git
synced 2026-05-04 06:14:17 +08:00
Add file in English version and change the correponding README.md
This commit is contained in:
547
interrupts/interrupts-2.md
Normal file
547
interrupts/interrupts-2.md
Normal file
@@ -0,0 +1,547 @@
|
||||
Interrupts and Interrupt Handling. Part 2.
|
||||
================================================================================
|
||||
|
||||
Start to dive into interrupt and exceptions handling in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw some theory about interrupts and exception handling in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) and as I already wrote in that part, we will start to dive into interrupts and exceptions in the Linux kernel source code in this part. As you already can note, the previous part mostly described theoretical aspects and in this part we will start to dive directly into the Linux kernel source code. We will start to do it as we did it in other chapters, from the very early places. We will not see the Linux kernel source code from the earliest [code lines](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L292) as we saw it for example in the [Linux kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter, but we will start from the earliest code which is related to the interrupts and exceptions. In this part we will try to go through the all interrupts and exceptions related stuff which we can find in the Linux kernel source code.
|
||||
|
||||
If you've read the previous parts, you can remember that the earliest place in the Linux kernel `x86_64` architecture-specific source code which is related to the interrupt is located in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c) source code file and represents the first setup of the [Interrupt Descriptor Table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table). It occurs right before the transition into the [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the `go_to_protected_mode` function by the call of the `setup_idt`:
|
||||
|
||||
```C
|
||||
void go_to_protected_mode(void)
|
||||
{
|
||||
...
|
||||
setup_idt();
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `setup_idt` function is defined in the same source code file as the `go_to_protected_mode` function and just loads the address of the `NULL` interrupts descriptor table:
|
||||
|
||||
```C
|
||||
static void setup_idt(void)
|
||||
{
|
||||
static const struct gdt_ptr null_idt = {0, 0};
|
||||
asm volatile("lidtl %0" : : "m" (null_idt));
|
||||
}
|
||||
```
|
||||
|
||||
where `gdt_ptr` represents a special 48-bit `GDTR` register which must contain the base address of the `Global Descriptor Table`:
|
||||
|
||||
```C
|
||||
struct gdt_ptr {
|
||||
u16 len;
|
||||
u32 ptr;
|
||||
} __attribute__((packed));
|
||||
```
|
||||
|
||||
Of course in our case the `gdt_ptr` does not represent the `GDTR` register, but `IDTR` since we set `Interrupt Descriptor Table`. You will not find an `idt_ptr` structure, because if it had been in the Linux kernel source code, it would have been the same as `gdt_ptr` but with different name. So, as you can understand there is no sense to have two similar structures which differ only by name. You can note here, that we do not fill the `Interrupt Descriptor Table` with entries, because it is too early to handle any interrupts or exceptions at this point. That's why we just fill the `IDT` with `NULL`.
|
||||
|
||||
After the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S). You can read more about it in the [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) which describes the transition to protected mode.
|
||||
|
||||
We already know from the earliest parts that entry to protected mode is located in the `boot_params.hdr.code32_start` and you can see that we pass the entry of the protected mode and `boot_params` to the `protected_mode_jump` in the end of the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c):
|
||||
|
||||
```C
|
||||
protected_mode_jump(boot_params.hdr.code32_start,
|
||||
(u32)&boot_params + (ds() << 4));
|
||||
```
|
||||
|
||||
The `protected_mode_jump` is defined in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S) and gets these two parameters in the `ax` and `dx` registers using one of the [8086](http://en.wikipedia.org/wiki/Intel_8086) calling [conventions](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions):
|
||||
|
||||
```assembly
|
||||
GLOBAL(protected_mode_jump)
|
||||
...
|
||||
...
|
||||
...
|
||||
.byte 0x66, 0xea # ljmpl opcode
|
||||
2: .long in_pm32 # offset
|
||||
.word __BOOT_CS # segment
|
||||
...
|
||||
...
|
||||
...
|
||||
ENDPROC(protected_mode_jump)
|
||||
```
|
||||
|
||||
where `in_pm32` contains a jump to the 32-bit entry point:
|
||||
|
||||
```assembly
|
||||
GLOBAL(in_pm32)
|
||||
...
|
||||
...
|
||||
jmpl *%eax // %eax contains address of the `startup_32`
|
||||
...
|
||||
...
|
||||
ENDPROC(in_pm32)
|
||||
```
|
||||
|
||||
As you can remember the 32-bit entry point is in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly file, although it contains `_64` in its name. We can see the two similar files in the `arch/x86/boot/compressed` directory:
|
||||
|
||||
* `arch/x86/boot/compressed/head_32.S`.
|
||||
* `arch/x86/boot/compressed/head_64.S`;
|
||||
|
||||
But the 32-bit mode entry point is the second file in our case. The first file is not even compiled for `x86_64`. Let's look at the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile):
|
||||
|
||||
```
|
||||
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
We can see here that `head_*` depends on the `$(BITS)` variable which depends on the architecture. You can find it in the [arch/x86/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/Makefile):
|
||||
|
||||
```
|
||||
ifeq ($(CONFIG_X86_32),y)
|
||||
...
|
||||
BITS := 32
|
||||
else
|
||||
BITS := 64
|
||||
...
|
||||
endif
|
||||
```
|
||||
|
||||
Now as we jumped on the `startup_32` from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) we will not find anything related to the interrupt handling here. The `startup_32` contains code that makes preparations before the transition into [long mode](http://en.wikipedia.org/wiki/Long_mode) and directly jumps in to it. The `long mode` entry is located in `startup_64` and it makes preparations before the [kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) that occurs in the `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c). After the kernel is decompressed, we jump on the `startup_64` from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). In the `startup_64` we start to build identity-mapped pages. After we have built identity-mapped pages, checked the [NX](http://en.wikipedia.org/wiki/NX_bit) bit, setup the `Extended Feature Enable Register` (see in links), and updated the early `Global Descriptor Table` with the `lgdt` instruction, we need to setup `gs` register with the following code:
|
||||
|
||||
```assembly
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
movl initial_gs(%rip),%eax
|
||||
movl initial_gs+4(%rip),%edx
|
||||
wrmsr
|
||||
```
|
||||
|
||||
We already saw this code in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html). First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which is declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/msr-index.h) and looks like:
|
||||
|
||||
```C
|
||||
#define MSR_GS_BASE 0xc0000101
|
||||
```
|
||||
|
||||
From this we can understand that `MSR_GS_BASE` defines the number of the `model specific register`. Since registers `cs`, `ds`, `es`, and `ss` are not used in the 64-bit mode, their fields are ignored. But we can access memory over `fs` and `gs` registers. The model specific register provides a `back door` to the hidden parts of these segment registers and allows to use 64-bit base address for segment register addressed by the `fs` and `gs`. So the `MSR_GS_BASE` is the hidden part and this part is mapped on the `GS.base` field. Let's look on the `initial_gs`:
|
||||
|
||||
```assembly
|
||||
GLOBAL(initial_gs)
|
||||
.quad INIT_PER_CPU_VAR(irq_stack_union)
|
||||
```
|
||||
|
||||
We pass `irq_stack_union` symbol to the `INIT_PER_CPU_VAR` macro which just concatenates the `init_per_cpu__` prefix with the given symbol. In our case we will get the `init_per_cpu__irq_stack_union` symbol. Let's look at the [linker](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) script. There we can see following definition:
|
||||
|
||||
```
|
||||
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
|
||||
INIT_PER_CPU(irq_stack_union);
|
||||
```
|
||||
|
||||
It tells us that the address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where `init_per_cpu__irq_stack_union` and `__per_cpu_load` are what they mean. The first `irq_stack_union` is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call the `init_per_cpu_var` macro:
|
||||
|
||||
```C
|
||||
DECLARE_INIT_PER_CPU(irq_stack_union);
|
||||
|
||||
#define DECLARE_INIT_PER_CPU(var) \
|
||||
extern typeof(per_cpu_var(var)) init_per_cpu_var(var)
|
||||
|
||||
#define init_per_cpu_var(var) init_per_cpu__##var
|
||||
```
|
||||
|
||||
If we expand all macros we will get the same `init_per_cpu__irq_stack_union` as we got after expanding the `INIT_PER_CPU` macro, but you can note that it is not just a symbol, but a variable. Let's look at the `typeof(per_cpu_var(var))` expression. Our `var` is `irq_stack_union` and the `per_cpu_var` macro is defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h):
|
||||
|
||||
```C
|
||||
#define PER_CPU_VAR(var) %__percpu_seg:var
|
||||
```
|
||||
|
||||
where:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_64
|
||||
#define __percpu_seg gs
|
||||
endif
|
||||
```
|
||||
|
||||
So, we are accessing `gs:irq_stack_union` and getting its type which is `irq_union`. Ok, we defined the first variable and know its address, now let's look at the second `__per_cpu_load` symbol. There are a couple of `per-cpu` variables which are located after this symbol. The `__per_cpu_load` is defined in the [include/asm-generic/sections.h](https://github.com/torvalds/linux/blob/master/include/asm-generic-sections.h):
|
||||
|
||||
```C
|
||||
extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[];
|
||||
```
|
||||
|
||||
and presented base address of the `per-cpu` variables from the data area. So, we know the address of the `irq_stack_union`, `__per_cpu_load` and we know that `init_per_cpu__irq_stack_union` must be placed right after `__per_cpu_load`. And we can see it in the [System.map](http://en.wikipedia.org/wiki/System.map):
|
||||
|
||||
```
|
||||
...
|
||||
...
|
||||
...
|
||||
ffffffff819ed000 D __init_begin
|
||||
ffffffff819ed000 D __per_cpu_load
|
||||
ffffffff819ed000 A init_per_cpu__irq_stack_union
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Now we know about `initial_gs`, so let's look at the code:
|
||||
|
||||
```assembly
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
movl initial_gs(%rip),%eax
|
||||
movl initial_gs+4(%rip),%edx
|
||||
wrmsr
|
||||
```
|
||||
|
||||
Here we specified a model specific register with `MSR_GS_BASE`, put the 64-bit address of the `initial_gs` to the `edx:eax` pair and execute the `wrmsr` instruction for filling the `gs` register with the base address of the `init_per_cpu__irq_stack_union` which will be at the bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and one of these preparations is filling the early `Interrupt Descriptor Table` with the interrupts handlers entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) and can remember following code:
|
||||
|
||||
```C
|
||||
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
|
||||
set_intr_gate(i, early_idt_handlers[i]);
|
||||
|
||||
load_idt((const struct desc_ptr *)&idt_descr);
|
||||
```
|
||||
|
||||
but I wrote `Early interrupt and exception handling` part when Linux kernel version was - `3.18`. For this day actual version of the Linux kernel is `4.1.0-rc6+` and ` Andy Lutomirski` sent the [patch](https://lkml.org/lkml/2015/6/2/106) and soon it will be in the mainline kernel that changes behaviour for the `early_idt_handlers`. **NOTE** While I wrote this part the [patch](https://github.com/torvalds/linux/commit/425be5679fd292a3c36cb1fe423086708a99f11a) already turned in the Linux kernel source code. Let's look on it. Now the same part looks like:
|
||||
|
||||
```C
|
||||
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
|
||||
set_intr_gate(i, early_idt_handler_array[i]);
|
||||
|
||||
load_idt((const struct desc_ptr *)&idt_descr);
|
||||
```
|
||||
|
||||
AS you can see it has only one difference in the name of the array of the interrupts handlers entry points. Now it is `early_idt_handler_arry`:
|
||||
|
||||
```C
|
||||
extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
|
||||
```
|
||||
|
||||
where `NUM_EXCEPTION_VECTORS` and `EARLY_IDT_HANDLER_SIZE` are defined as:
|
||||
|
||||
```C
|
||||
#define NUM_EXCEPTION_VECTORS 32
|
||||
#define EARLY_IDT_HANDLER_SIZE 9
|
||||
```
|
||||
|
||||
So, the `early_idt_handler_array` is an array of the interrupts handlers entry points and contains one entry point on every nine bytes. You can remember that previous `early_idt_handlers` was defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). The `early_idt_handler_array` is defined in the same source code file too:
|
||||
|
||||
```assembly
|
||||
ENTRY(early_idt_handler_array)
|
||||
...
|
||||
...
|
||||
...
|
||||
ENDPROC(early_idt_handler_common)
|
||||
```
|
||||
|
||||
It fills `early_idt_handler_arry` with the `.rept NUM_EXCEPTION_VECTORS` and contains entry of the `early_make_pgtable` interrupt handler (more about its implementation you can read in the part about [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). For now we come to the end of the `x86_64` architecture-specific code and the next part is the generic kernel code. Of course you already can know that we will return to the architecture-specific code in the `setup_arch` function and other places, but this is the end of the `x86_64` early code.
|
||||
|
||||
Setting stack canary for the interrupt stack
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
The next stop after the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) is the biggest `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). If you've read the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first `init` process with the [pid](https://en.wikipedia.org/wiki/Process_identifier) - `1`. The first thing that is related to the interrupts and exceptions handling is the call of the `boot_init_stack_canary` function.
|
||||
|
||||
This function sets the [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) value to protect interrupt stack overflow. We already saw a little some details about implementation of the `boot_init_stack_canary` in the previous part and now let's take a closer look on it. You can find implementation of this function in the [arch/x86/include/asm/stackprotector.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/stackprotector.h) and its depends on the `CONFIG_CC_STACKPROTECTOR` kernel configuration option. If this option is not set this function will not do anything:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_CC_STACKPROTECTOR
|
||||
...
|
||||
...
|
||||
...
|
||||
#else
|
||||
static inline void boot_init_stack_canary(void)
|
||||
{
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_init_stack_canary` function starts from the check stat `irq_stack_union` that represents [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) interrupt stack has offset equal to forty bytes from the `stack_canary` value:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_64
|
||||
BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
|
||||
#endif
|
||||
```
|
||||
|
||||
As we can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) the `irq_stack_union` represented by the following union:
|
||||
|
||||
```C
|
||||
union irq_stack_union {
|
||||
char irq_stack[IRQ_STACK_SIZE];
|
||||
|
||||
struct {
|
||||
char gs_base[40];
|
||||
unsigned long stack_canary;
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h). We know that [union](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro).
|
||||
|
||||
After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
|
||||
|
||||
```C
|
||||
get_random_bytes(&canary, sizeof(canary));
|
||||
tsc = __native_read_tsc();
|
||||
canary += tsc + (tsc << 32UL);
|
||||
```
|
||||
|
||||
and write `canary` value to the `irq_stack_union` with the `this_cpu_write` macro:
|
||||
|
||||
```C
|
||||
this_cpu_write(irq_stack_union.stack_canary, canary);
|
||||
```
|
||||
|
||||
more about `this_cpu_*` operation you can read in the [Linux kernel documentation](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt).
|
||||
|
||||
Disabling/Enabling local interrupts
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) which is related to the interrupts and interrupts handling after we have set the `canary` value to the interrupt stack - is the call of the `local_irq_disable` macro.
|
||||
|
||||
This macro defined in the [include/linux/irqflags.h](https://github.com/torvalds/linux/blob/master/include/linux/irqflags.h) header file and as you can understand, we can disable interrupts for the CPU with the call of this macro. Let's look on its implementation. First of all note that it depends on the `CONFIG_TRACE_IRQFLAGS_SUPPORT` kernel configuration option:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
|
||||
...
|
||||
#define local_irq_disable() \
|
||||
do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
|
||||
...
|
||||
#else
|
||||
...
|
||||
#define local_irq_disable() do { raw_local_irq_disable(); } while (0)
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `softirq` state. In our case `lockdep` subsystem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c):
|
||||
|
||||
```C
|
||||
void trace_hardirqs_off(void)
|
||||
{
|
||||
trace_hardirqs_off_caller(CALLER_ADDR0);
|
||||
}
|
||||
EXPORT_SYMBOL(trace_hardirqs_off);
|
||||
```
|
||||
|
||||
and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` field of the current process and increases the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_insides.h](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep_insides.h) and located in the `lockdep_stats` structure:
|
||||
|
||||
```C
|
||||
struct lockdep_stats {
|
||||
...
|
||||
...
|
||||
...
|
||||
int softirqs_off_events;
|
||||
int redundant_softirqs_off;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
If you will set `CONFIG_DEBUG_LOCKDEP` kernel configuration option, the `lockdep_stats_debug_show` function will write all tracing information to the `/proc/lockdep`:
|
||||
|
||||
```C
|
||||
static void lockdep_stats_debug_show(struct seq_file *m)
|
||||
{
|
||||
#ifdef CONFIG_DEBUG_LOCKDEP
|
||||
unsigned long long hi1 = debug_atomic_read(hardirqs_on_events),
|
||||
hi2 = debug_atomic_read(hardirqs_off_events),
|
||||
hr1 = debug_atomic_read(redundant_hardirqs_on),
|
||||
...
|
||||
...
|
||||
...
|
||||
seq_printf(m, " hardirq on events: %11llu\n", hi1);
|
||||
seq_printf(m, " hardirq off events: %11llu\n", hi2);
|
||||
seq_printf(m, " redundant hardirq ons: %11llu\n", hr1);
|
||||
#endif
|
||||
}
|
||||
```
|
||||
|
||||
and you can see its result with the:
|
||||
|
||||
```
|
||||
$ sudo cat /proc/lockdep
|
||||
hardirq on events: 12838248974
|
||||
hardirq off events: 12838248979
|
||||
redundant hardirq ons: 67792
|
||||
redundant hardirq offs: 3836339146
|
||||
softirq on events: 38002159
|
||||
softirq off events: 38002187
|
||||
redundant softirq ons: 0
|
||||
redundant softirq offs: 0
|
||||
```
|
||||
|
||||
Ok, now we know a little about tracing, but more info will be in the separate part about `lockdep` and `tracing`. You can see that the both `local_disable_irq` macros have the same part - `raw_local_irq_disable`. This macro defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) and expands to the call of the:
|
||||
|
||||
```C
|
||||
static inline void native_irq_disable(void)
|
||||
{
|
||||
asm volatile("cli": : :"memory");
|
||||
}
|
||||
```
|
||||
|
||||
And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle an interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macro - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction:
|
||||
|
||||
```C
|
||||
static inline void native_irq_enable(void)
|
||||
{
|
||||
asm volatile("sti": : :"memory");
|
||||
}
|
||||
```
|
||||
|
||||
Now we know how `local_irq_disable` and `local_irq_enable` work. It was the first call of the `local_irq_disable` macro, but we will meet these macros many times in the Linux kernel source code. But for now we are in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) and we just disabled `local` interrupts. Why local and why we did it? Previously kernel provided a method to disable interrupts on all processors and it was called `cli`. This function was [removed](https://lwn.net/Articles/291956/) and now we have `local_irq_{enabled,disable}` to disable or enable interrupts on the current processor. After we've disabled the interrupts with the `local_irq_disable` macro, we set the:
|
||||
|
||||
```C
|
||||
early_boot_irqs_disabled = true;
|
||||
```
|
||||
|
||||
The `early_boot_irqs_disabled` variable defined in the [include/linux/kernel.h](https://github.com/torvalds/linux/blob/master/include/linux/kernel.h):
|
||||
|
||||
```C
|
||||
extern bool early_boot_irqs_disabled;
|
||||
```
|
||||
|
||||
and used in the different places. For example it used in the `smp_call_function_many` function from the [kernel/smp.c](https://github.com/torvalds/linux/blob/master/kernel/smp.c) for the checking possible deadlock when interrupts are disabled:
|
||||
|
||||
```C
|
||||
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
|
||||
&& !oops_in_progress && !early_boot_irqs_disabled);
|
||||
```
|
||||
|
||||
Early trap initialization during kernel initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next functions after the `local_disable_irq` are `boot_cpu_init` and `page_address_init`, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html)). The next is the `setup_arch` function. As you can remember this function located in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel.setup.c) source code file and makes initialization of many different architecture-dependent [stuff](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). The first interrupts related function which we can see in the `setup_arch` is the - `early_trap_init` function. This function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and fills `Interrupt Descriptor Table` with the couple of entries:
|
||||
|
||||
```C
|
||||
void __init early_trap_init(void)
|
||||
{
|
||||
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
||||
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
||||
#ifdef CONFIG_X86_32
|
||||
set_intr_gate(X86_TRAP_PF, page_fault);
|
||||
#endif
|
||||
load_idt(&idt_descr);
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see calls of three different functions:
|
||||
|
||||
* `set_intr_gate_ist`
|
||||
* `set_system_intr_gate_ist`
|
||||
* `set_intr_gate`
|
||||
|
||||
All of these functions defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) and do the similar thing but not the same. The first `set_intr_gate_ist` function inserts new an interrupt gate in the `IDT`. Let's look on its implementation:
|
||||
|
||||
```C
|
||||
static inline void set_intr_gate_ist(int n, void *addr, unsigned ist)
|
||||
{
|
||||
BUG_ON((unsigned)n > 0xFF);
|
||||
_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
|
||||
}
|
||||
```
|
||||
|
||||
First of all we can see the check that `n` which is [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) of the interrupt is not greater than `0xff` or 255. We need to check it because we remember from the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) that vector number of an interrupt must be between `0` and `255`. In the next step we can see the call of the `_set_gate` function that sets a given interrupt gate to the `IDT` table:
|
||||
|
||||
```C
|
||||
static inline void _set_gate(int gate, unsigned type, void *addr,
|
||||
unsigned dpl, unsigned ist, unsigned seg)
|
||||
{
|
||||
gate_desc s;
|
||||
|
||||
pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
|
||||
write_idt_entry(idt_table, gate, &s);
|
||||
write_trace_idt_entry(gate, &s);
|
||||
}
|
||||
```
|
||||
|
||||
Here we start from the `pack_gate` function which takes clean `IDT` entry represented by the `gate_desc` structure and fills it with the base address and limit, [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks), [Privilege level](http://en.wikipedia.org/wiki/Privilege_level), type of an interrupt which can be one of the following values:
|
||||
|
||||
* `GATE_INTERRUPT`
|
||||
* `GATE_TRAP`
|
||||
* `GATE_CALL`
|
||||
* `GATE_TASK`
|
||||
|
||||
and set the present bit for the given `IDT` entry:
|
||||
|
||||
```C
|
||||
static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
|
||||
unsigned dpl, unsigned ist, unsigned seg)
|
||||
{
|
||||
gate->offset_low = PTR_LOW(func);
|
||||
gate->segment = __KERNEL_CS;
|
||||
gate->ist = ist;
|
||||
gate->p = 1;
|
||||
gate->dpl = dpl;
|
||||
gate->zero0 = 0;
|
||||
gate->zero1 = 0;
|
||||
gate->type = type;
|
||||
gate->offset_middle = PTR_MIDDLE(func);
|
||||
gate->offset_high = PTR_HIGH(func);
|
||||
}
|
||||
```
|
||||
|
||||
After this we write just filled interrupt gate to the `IDT` with the `write_idt_entry` macro which expands to the `native_write_idt_entry` and just copy the interrupt gate to the `idt_table` table by the given index:
|
||||
|
||||
```C
|
||||
#define write_idt_entry(dt, entry, g) native_write_idt_entry(dt, entry, g)
|
||||
|
||||
static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
|
||||
{
|
||||
memcpy(&idt[entry], gate, sizeof(*gate));
|
||||
}
|
||||
```
|
||||
|
||||
where `idt_table` is just array of `gate_desc`:
|
||||
|
||||
```C
|
||||
extern gate_desc idt_table[];
|
||||
```
|
||||
|
||||
That's all. The second `set_system_intr_gate_ist` function has only one difference from the `set_intr_gate_ist`:
|
||||
|
||||
```C
|
||||
static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
|
||||
{
|
||||
BUG_ON((unsigned)n > 0xFF);
|
||||
_set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
|
||||
}
|
||||
```
|
||||
|
||||
Do you see it? Look on the fourth parameter of the `_set_gate`. It is `0x3`. In the `set_intr_gate` it was `0x0`. We know that this parameter represent `DPL` or privilege level. We also know that `0` is the highest privilege level and `3` is the lowest.Now we know how `set_system_intr_gate_ist`, `set_intr_gate_ist`, `set_intr_gate` are work and we can return to the `early_trap_init` function. Let's look on it again:
|
||||
|
||||
```C
|
||||
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
||||
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
||||
```
|
||||
|
||||
We set two `IDT` entries for the `#DB` interrupt and `int3`. These functions takes the same set of parameters:
|
||||
|
||||
* vector number of an interrupt;
|
||||
* address of an interrupt handler;
|
||||
* interrupt stack table index.
|
||||
|
||||
That's all. More about interrupts and handlers you will know in the next parts.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the second part about interrupts and interrupt handling in the Linux kernel. We saw the some theory in the previous part and started to dive into interrupts and exceptions handling in the current part. We have started from the earliest parts in the Linux kernel source code which are related to the interrupts. In the next part we will continue to dive into this interesting theme and will know more about interrupt handling process.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table)
|
||||
* [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
|
||||
* [List of x86 calling conventions](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions)
|
||||
* [8086](http://en.wikipedia.org/wiki/Intel_8086)
|
||||
* [Long mode](http://en.wikipedia.org/wiki/Long_mode)
|
||||
* [NX](http://en.wikipedia.org/wiki/NX_bit)
|
||||
* [Extended Feature Enable Register](http://en.wikipedia.org/wiki/Control_register#Additional_Control_registers_in_x86-64_series)
|
||||
* [Model-specific register](http://en.wikipedia.org/wiki/Model-specific_register)
|
||||
* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [lockdep](http://lwn.net/Articles/321663/)
|
||||
* [irqflags tracing](https://www.kernel.org/doc/Documentation/irqflags-tracing.txt)
|
||||
* [IF](http://en.wikipedia.org/wiki/Interrupt_flag)
|
||||
* [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries)
|
||||
* [Union type](http://en.wikipedia.org/wiki/Union_type)
|
||||
* [this_cpu_* operations](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)
|
||||
* [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table)
|
||||
* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
|
||||
* [Privilege level](http://en.wikipedia.org/wiki/Privilege_level)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html)
|
||||
522
interrupts/interrupts-3.md
Normal file
522
interrupts/interrupts-3.md
Normal file
@@ -0,0 +1,522 @@
|
||||
Interrupts and Interrupt Handling. Part 3.
|
||||
================================================================================
|
||||
|
||||
Exception Handling
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stopped at the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) source code file.
|
||||
|
||||
We already know that this function executes initialization of architecture-specfic stuff. In our case the `setup_arch` function does [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related initializations. The `setup_arch` is big function, and in the previous part we stopped on the setting of the two exceptions handlers for the two following exceptions:
|
||||
|
||||
* `#DB` - debug exception, transfers control from the interrupted process to the debug handler;
|
||||
* `#BP` - breakpoint exception, caused by the `int 3` instruction.
|
||||
|
||||
These exceptions allow the `x86_64` architecture to have early exception processing for the purpose of debugging via the [kgdb](https://en.wikipedia.org/wiki/KGDB).
|
||||
|
||||
As you can remember we set these exceptions handlers in the `early_trap_init` function:
|
||||
|
||||
```C
|
||||
void __init early_trap_init(void)
|
||||
{
|
||||
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
||||
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
||||
load_idt(&idt_descr);
|
||||
}
|
||||
```
|
||||
|
||||
from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We already saw implementation of the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions in the previous part and now we will look on the implementation of these two exceptions handlers.
|
||||
|
||||
Debug and Breakpoint exceptions
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Ok, we setup exception handlers in the `early_trap_init` function for the `#DB` and `#BP` exceptions and now time is to consider their implementations. But before we will do this, first of all let's look on details of these exceptions.
|
||||
|
||||
The first exceptions - `#DB` or `debug` exception occurs when a debug event occurs. For example - attempt to change the contents of a [debug register](http://en.wikipedia.org/wiki/X86_debug_register). Debug registers are special registers that were presented in `x86` processors starting from the [Intel 80386](http://en.wikipedia.org/wiki/Intel_80386) processor and as you can understand from name of this CPU extension, main purpose of these registers is debugging.
|
||||
|
||||
These registers allow to set breakpoints on the code and read or write data to trace it. Debug registers may be accessed only in the privileged mode and an attempt to read or write the debug registers when executing at any other privilege level causes a [general protection fault](https://en.wikipedia.org/wiki/General_protection_fault) exception. That's why we have used `set_intr_gate_ist` for the `#DB` exception, but not the `set_system_intr_gate_ist`.
|
||||
|
||||
The verctor number of the `#DB` exceptions is `1` (we pass it as `X86_TRAP_DB`) and as we may read in specification, this exception has no error code:
|
||||
|
||||
```
|
||||
+-----------------------------------------------------+
|
||||
|Vector|Mnemonic|Description |Type |Error Code|
|
||||
+-----------------------------------------------------+
|
||||
|1 | #DB |Reserved |F/T |NO |
|
||||
+-----------------------------------------------------+
|
||||
```
|
||||
|
||||
The second exception is `#BP` or `breakpoint` exception occurs when processor executes the [int 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3) instruction. Unlike the `DB` exception, the `#BP` exception may occur in userspace. We can add it anywhere in our code, for example let's look on the simple program:
|
||||
|
||||
```C
|
||||
// breakpoint.c
|
||||
#include <stdio.h>
|
||||
|
||||
int main() {
|
||||
int i;
|
||||
while (i < 6){
|
||||
printf("i equal to: %d\n", i);
|
||||
__asm__("int3");
|
||||
++i;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If we will compile and run this program, we will see following output:
|
||||
|
||||
```
|
||||
$ gcc breakpoint.c -o breakpoint
|
||||
i equal to: 0
|
||||
Trace/breakpoint trap
|
||||
```
|
||||
|
||||
But if will run it with gdb, we will see our breakpoint and can continue execution of our program:
|
||||
|
||||
```
|
||||
$ gdb breakpoint
|
||||
...
|
||||
...
|
||||
...
|
||||
(gdb) run
|
||||
Starting program: /home/alex/breakpoints
|
||||
i equal to: 0
|
||||
|
||||
Program received signal SIGTRAP, Trace/breakpoint trap.
|
||||
0x0000000000400585 in main ()
|
||||
=> 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1
|
||||
(gdb) c
|
||||
Continuing.
|
||||
i equal to: 1
|
||||
|
||||
Program received signal SIGTRAP, Trace/breakpoint trap.
|
||||
0x0000000000400585 in main ()
|
||||
=> 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1
|
||||
(gdb) c
|
||||
Continuing.
|
||||
i equal to: 2
|
||||
|
||||
Program received signal SIGTRAP, Trace/breakpoint trap.
|
||||
0x0000000000400585 in main ()
|
||||
=> 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
From this moment we know a little about these two exceptions and we can move on to consideration of their handlers.
|
||||
|
||||
Preparation before an exception handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you may note before, the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions takes an addresses of exceptions handlers in theirs second parameter. In or case our two exception handlers will be:
|
||||
|
||||
* `debug`;
|
||||
* `int3`.
|
||||
|
||||
You will not find these functions in the C code. all of that could be found in the kernel's `*.c/*.h` files only definition of these functions which are located in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h) kernel header file:
|
||||
|
||||
```C
|
||||
asmlinkage void debug(void);
|
||||
```
|
||||
|
||||
and
|
||||
|
||||
```C
|
||||
asmlinkage void int3(void);
|
||||
```
|
||||
|
||||
You may note `asmlinkage` directive in definitions of these functions. The directive is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack.
|
||||
|
||||
So, both handlers are defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file with the `idtentry` macro:
|
||||
|
||||
```assembly
|
||||
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
```
|
||||
|
||||
and
|
||||
|
||||
```assembly
|
||||
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
```
|
||||
|
||||
Each exception handler may be consists from two parts. The first part is generic part and it is the same for all exception handlers. An exception handler should to save [general purpose registers](https://en.wikipedia.org/wiki/Processor_register) on the stack, switch to kernel stack if an exception came from userspace and transfer control to the second part of an exception handler. The second part of an exception handler does certain work depends on certain exception. For example page fault exception handler should find virtual page for given address, invalid opcode exception handler should send `SIGILL` [signal](https://en.wikipedia.org/wiki/Unix_signal) and etc.
|
||||
|
||||
As we just saw, an exception handler starts from definition of the `idtentry` macro from the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file, so let's look at implementation of this macro. As we may see, the `idtentry` macro takes five arguments:
|
||||
|
||||
* `sym` - defines global symbol with the `.globl name` which will be an an entry of exception handler;
|
||||
* `do_sym` - symbol name which represents a secondary entry of an exception handler;
|
||||
* `has_error_code` - information about existence of an error code of exception.
|
||||
|
||||
The last two parameters are optional:
|
||||
|
||||
* `paranoid` - shows us how we need to check current mode (will see explanation in details later);
|
||||
* `shift_ist` - shows us is an exception running at `Interrupt Stack Table`.
|
||||
|
||||
Definition of the `.idtentry` macro looks:
|
||||
|
||||
```assembly
|
||||
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
|
||||
ENTRY(\sym)
|
||||
...
|
||||
...
|
||||
...
|
||||
END(\sym)
|
||||
.endm
|
||||
```
|
||||
|
||||
Before we will consider internals of the `idtentry` macro, we should to know state of stack when an exception occurs. As we may read in the [Intel® 64 and IA-32 Architectures Software Developer’s Manual 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html), the state of stack when an exception occurs is following:
|
||||
|
||||
```
|
||||
+------------+
|
||||
+40 | %SS |
|
||||
+32 | %RSP |
|
||||
+24 | %RFLAGS |
|
||||
+16 | %CS |
|
||||
+8 | %RIP |
|
||||
0 | ERROR CODE | <-- %RSP
|
||||
+------------+
|
||||
```
|
||||
|
||||
Now we may start to consider implementation of the `idtmacro`. Both `#DB` and `BP` exception handlers are defined as:
|
||||
|
||||
```assembly
|
||||
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
```
|
||||
|
||||
If we will look at these definitions, we may know that compiler will generate two routines with `debug` and `int3` names and both of these exception handlers will call `do_debug` and `do_int3` secondary handlers after some preparation. The third parameter defines existence of error code and as we may see both our exception do not have them. As we may see on the diagram above, processor pushes error code on stack if an exception provides it. In our case, the `debug` and `int3` exception do not have error codes. This may bring some difficulties because stack will look differently for exceptions which provides error code and for exceptions which not. That's why implementation of the `idtentry` macro starts from putting a fake error code to the stack if an exception does not provide it:
|
||||
|
||||
```assembly
|
||||
.ifeq \has_error_code
|
||||
pushq $-1
|
||||
.endif
|
||||
```
|
||||
|
||||
But it is not only fake error-code. Moreover the `-1` also represents invalid system call number, so that the system call restart logic will not be triggered.
|
||||
|
||||
The last two parameters of the `idtentry` macro `shift_ist` and `paranoid` allow to know do an exception handler runned at stack from `Interrupt Stack Table` or not. You already may know that each kernel thread in the system has own stack. In addition to these stacks, there are some specialized stacks associated with each processor in the system. One of these stacks is - exception stack. The [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture provides special feature which is called - `Interrupt Stack Table`. This feature allows to switch to a new stack for designated events such as an atomic exceptions like `double fault` and etc. So the `shift_ist` parameter allows us to know do we need to switch on `IST` stack for an exception handler or not.
|
||||
|
||||
The second parameter - `paranoid` defines the method which helps us to know did we come from userspace or not to an exception handler. The easiest way to determine this is to via `CPL` or `Current Privilege Level` in `CS` segment register. If it is equal to `3`, we came from userspace, if zero we came from kernel space:
|
||||
|
||||
```
|
||||
testl $3,CS(%rsp)
|
||||
jnz userspace
|
||||
...
|
||||
...
|
||||
...
|
||||
// we are from the kernel space
|
||||
```
|
||||
|
||||
But unfortunately this method does not give a 100% guarantee. As described in the kernel documentation:
|
||||
|
||||
> if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
|
||||
> which might have triggered right after a normal entry wrote CS to the
|
||||
> stack but before we executed SWAPGS, then the only safe way to check
|
||||
> for GS is the slower method: the RDMSR.
|
||||
|
||||
In other words for example `NMI` could happen inside the critical section of a [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. In this way we should check value of the `MSR_GS_BASE` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register) which stores pointer to the start of per-cpu area. So to check did we come from userspace or not, we should to check value of the `MSR_GS_BASE` model specific register and if it is negative we came from kernel space, in other way we came from userspace:
|
||||
|
||||
```assembly
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
rdmsr
|
||||
testl %edx,%edx
|
||||
js 1f
|
||||
```
|
||||
|
||||
In first two lines of code we read value of the `MSR_GS_BASE` model specific register into `edx:eax` pair. We can't set negative value to the `gs` from userspace. But from other side we know that direct mapping of the physical memory starts from the `0xffff880000000000` virtual address. In this way, `MSR_GS_BASE` will contain an address from `0xffff880000000000` to `0xffffc7ffffffffff`. After the `rdmsr` instruction will be executed, the smallest possible value in the `%edx` register will be - `0xffff8800` which is `-30720` in unsigned 4 bytes. That's why kernel space `gs` which points to start of `per-cpu` area will contain negative value.
|
||||
|
||||
After we pushed fake error code on the stack, we should allocate space for general purpose registers with:
|
||||
|
||||
```assembly
|
||||
ALLOC_PT_GPREGS_ON_STACK
|
||||
```
|
||||
|
||||
macro which is defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) header file. This macro just allocates 15*8 bytes space on the stack to preserve general purpose registers:
|
||||
|
||||
```assembly
|
||||
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
|
||||
addq $-(15*8+\addskip), %rsp
|
||||
.endm
|
||||
```
|
||||
|
||||
So the stack will look like this after execution of the `ALLOC_PT_GPREGS_ON_STACK`:
|
||||
|
||||
```
|
||||
+------------+
|
||||
+160 | %SS |
|
||||
+152 | %RSP |
|
||||
+144 | %RFLAGS |
|
||||
+136 | %CS |
|
||||
+128 | %RIP |
|
||||
+120 | ERROR CODE |
|
||||
|------------|
|
||||
+112 | |
|
||||
+104 | |
|
||||
+96 | |
|
||||
+88 | |
|
||||
+80 | |
|
||||
+72 | |
|
||||
+64 | |
|
||||
+56 | |
|
||||
+48 | |
|
||||
+40 | |
|
||||
+32 | |
|
||||
+24 | |
|
||||
+16 | |
|
||||
+8 | |
|
||||
+0 | | <- %RSP
|
||||
+------------+
|
||||
```
|
||||
|
||||
After we allocated space for general purpose registers, we do some checks to understand did an exception come from userspace or not and if yes, we should move back to an interrupted process stack or stay on exception stack:
|
||||
|
||||
```assembly
|
||||
.if \paranoid
|
||||
.if \paranoid == 1
|
||||
testb $3, CS(%rsp)
|
||||
jnz 1f
|
||||
.endif
|
||||
call paranoid_entry
|
||||
.else
|
||||
call error_entry
|
||||
.endif
|
||||
```
|
||||
|
||||
Let's consider all of these there cases in course.
|
||||
|
||||
An exception occured in userspace
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the first let's consider a case when an exception has `paranoid=1` like our `debug` and `int3` exceptions. In this case we check selector from `CS` segment register and jump at `1f` label if we came from userspace or the `paranoid_entry` will be called in other way.
|
||||
|
||||
Let's consider first case when we came from userspace to an exception handler. As described above we should jump at `1` label. The `1` label starts from the call of the
|
||||
|
||||
```assembly
|
||||
call error_entry
|
||||
```
|
||||
|
||||
routine which saves all general purpose registers in the previously allocated area on the stack:
|
||||
|
||||
```assembly
|
||||
SAVE_C_REGS 8
|
||||
SAVE_EXTRA_REGS 8
|
||||
```
|
||||
|
||||
These both macros are defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) header file and just move values of general purpose registers to a certain place at the stack, for example:
|
||||
|
||||
```assembly
|
||||
.macro SAVE_EXTRA_REGS offset=0
|
||||
movq %r15, 0*8+\offset(%rsp)
|
||||
movq %r14, 1*8+\offset(%rsp)
|
||||
movq %r13, 2*8+\offset(%rsp)
|
||||
movq %r12, 3*8+\offset(%rsp)
|
||||
movq %rbp, 4*8+\offset(%rsp)
|
||||
movq %rbx, 5*8+\offset(%rsp)
|
||||
.endm
|
||||
```
|
||||
|
||||
After execution of `SAVE_C_REGS` and `SAVE_EXTRA_REGS` the stack will look:
|
||||
|
||||
```
|
||||
+------------+
|
||||
+160 | %SS |
|
||||
+152 | %RSP |
|
||||
+144 | %RFLAGS |
|
||||
+136 | %CS |
|
||||
+128 | %RIP |
|
||||
+120 | ERROR CODE |
|
||||
|------------|
|
||||
+112 | %RDI |
|
||||
+104 | %RSI |
|
||||
+96 | %RDX |
|
||||
+88 | %RCX |
|
||||
+80 | %RAX |
|
||||
+72 | %R8 |
|
||||
+64 | %R9 |
|
||||
+56 | %R10 |
|
||||
+48 | %R11 |
|
||||
+40 | %RBX |
|
||||
+32 | %RBP |
|
||||
+24 | %R12 |
|
||||
+16 | %R13 |
|
||||
+8 | %R14 |
|
||||
+0 | %R15 | <- %RSP
|
||||
+------------+
|
||||
```
|
||||
|
||||
After the kernel saved general purpose registers at the stack, we should check that we came from userspace space again with:
|
||||
|
||||
```assembly
|
||||
testb $3, CS+8(%rsp)
|
||||
jz .Lerror_kernelspace
|
||||
```
|
||||
|
||||
because we may have potentially fault if as described in documentation truncated `%RIP` was reported. Anyway, in both cases the [SWAPGS](http://www.felixcloutier.com/x86/SWAPGS.html) instruction will be executed and values from `MSR_KERNEL_GS_BASE` and `MSR_GS_BASE` will be swapped. From this moment the `%gs` register will point to the base address of kernel structures. So, the `SWAPGS` instruction is called and it was main point of the `error_entry` routing.
|
||||
|
||||
Now we can back to the `idtentry` macro. We may see following assembler code after the call of `error_entry`:
|
||||
|
||||
```assembly
|
||||
movq %rsp, %rdi
|
||||
call sync_regs
|
||||
```
|
||||
|
||||
Here we put base address of stack pointer `%rdi` register which will be first argument (according to [x86_64 ABI](https://www.uclibc.org/docs/psABI-x86_64.pdf)) of the `sync_regs` function and call this function which is defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file:
|
||||
|
||||
```C
|
||||
asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
|
||||
{
|
||||
struct pt_regs *regs = task_pt_regs(current);
|
||||
*regs = *eregs;
|
||||
return regs;
|
||||
}
|
||||
```
|
||||
|
||||
This function takes the result of the `task_ptr_regs` macro which is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) header file, stores it in the stack pointer and return it. The `task_ptr_regs` macro expands to the address of `thread.sp0` which represents pointer to the normal kernel stack:
|
||||
|
||||
```C
|
||||
#define task_pt_regs(tsk) ((struct pt_regs *)(tsk)->thread.sp0 - 1)
|
||||
```
|
||||
|
||||
As we came from userspace, this means that exception handler will run in real process context. After we got stack pointer from the `sync_regs` we switch stack:
|
||||
|
||||
```assembly
|
||||
movq %rax, %rsp
|
||||
```
|
||||
|
||||
The last two steps before an exception handler will call secondary handler are:
|
||||
|
||||
1. Passing pointer to `pt_regs` structure which contains preserved general purpose registers to the `%rdi` register:
|
||||
|
||||
```assembly
|
||||
movq %rsp, %rdi
|
||||
```
|
||||
|
||||
as it will be passed as first parameter of secondary exception handler.
|
||||
|
||||
2. Pass error code to the `%rsi` register as it will be second argument of an exception handler and set it to `-1` on the stack for the same purpose as we did it before - to prevent restart of a system call:
|
||||
|
||||
```
|
||||
.if \has_error_code
|
||||
movq ORIG_RAX(%rsp), %rsi
|
||||
movq $-1, ORIG_RAX(%rsp)
|
||||
.else
|
||||
xorl %esi, %esi
|
||||
.endif
|
||||
```
|
||||
|
||||
Additionally you may see that we zeroed the `%esi` register above in a case if an exception does not provide error code.
|
||||
|
||||
In the end we just call secondary exception handler:
|
||||
|
||||
```assembly
|
||||
call \do_sym
|
||||
```
|
||||
|
||||
which:
|
||||
|
||||
```C
|
||||
dotraplinkage void do_debug(struct pt_regs *regs, long error_code);
|
||||
```
|
||||
|
||||
will be for `debug` exception and:
|
||||
|
||||
```C
|
||||
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code);
|
||||
```
|
||||
|
||||
will be for `int 3` exception. In this part we will not see implementations of secondary handlers, because of they are very specific, but will see some of them in one of next parts.
|
||||
|
||||
We just considered first case when an exception occured in userspace. Let's consider last two.
|
||||
|
||||
An exception with paranoid > 0 occured in kernelspace
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In this case an exception was occured in kernelspace and `idtentry` macro is defined with `paranoid=1` for this exception. This value of `paranoid` means that we should use slower way that we saw in the beginning of this part to check do we really came from kernelspace or not. The `paranoid_entry` routing allows us to know this:
|
||||
|
||||
```assembly
|
||||
ENTRY(paranoid_entry)
|
||||
cld
|
||||
SAVE_C_REGS 8
|
||||
SAVE_EXTRA_REGS 8
|
||||
movl $1, %ebx
|
||||
movl $MSR_GS_BASE, %ecx
|
||||
rdmsr
|
||||
testl %edx, %edx
|
||||
js 1f
|
||||
SWAPGS
|
||||
xorl %ebx, %ebx
|
||||
1: ret
|
||||
END(paranoid_entry)
|
||||
```
|
||||
|
||||
As you may see, this function representes the same that we covered before. We use second (slow) method to get information about previous state of an interrupted task. As we checked this and executed `SWAPGS` in a case if we came from userspace, we should to do the same that we did before: We need to put pointer to a strucutre which holds general purpose registers to the `%rdi` (which will be first parameter of a secondary handler) and put error code if an exception provides it to the `%rsi` (which will be second parameter of a secondary handler):
|
||||
|
||||
```assembly
|
||||
movq %rsp, %rdi
|
||||
|
||||
.if \has_error_code
|
||||
movq ORIG_RAX(%rsp), %rsi
|
||||
movq $-1, ORIG_RAX(%rsp)
|
||||
.else
|
||||
xorl %esi, %esi
|
||||
.endif
|
||||
```
|
||||
|
||||
The last step before a secondary handler of an exception will be called is cleanup of new `IST` stack fram:
|
||||
|
||||
```assembly
|
||||
.if \shift_ist != -1
|
||||
subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
|
||||
.endif
|
||||
```
|
||||
|
||||
You may remember that we passed the `shift_ist` as argument of the `idtentry` macro. Here we check its value and if its not equal to `-1`, we get pointer to a stack from `Interrupt Stack Table` by `shift_ist` index and setup it.
|
||||
|
||||
In the end of this second way we just call secondary exception handler as we did it before:
|
||||
|
||||
```assembly
|
||||
call \do_sym
|
||||
```
|
||||
|
||||
The last method is similar to previous both, but an exception occured with `paranoid=0` and we may use fast method determination of where we are from.
|
||||
|
||||
Exit from an exception handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After secondary handler will finish its works, we will return to the `idtentry` macro and the next step will be jump to the `error_exit`:
|
||||
|
||||
```assembly
|
||||
jmp error_exit
|
||||
```
|
||||
|
||||
routine. The `error_exit` function defined in the same [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file and the main goal of this function is to know where we are from (from userspace or kernelspace) and execute `SWPAGS` depends on this. Restore registers to previous state and execute `iret` instruction to transfer control to an interrupted task.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the third part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Interrupt descriptor table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) in the previous part with the `#DB` and `#BP` gates and started to dive into preparation before control will be transferred to an exception handler and implementation of some interrupt handlers in this part. In the next part we will continue to dive into this theme and will go next by the `setup_arch` function and will try to understand interrupts handling related stuff.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Debug registers](http://en.wikipedia.org/wiki/X86_debug_register)
|
||||
* [Intel 80385](http://en.wikipedia.org/wiki/Intel_80386)
|
||||
* [INT 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3)
|
||||
* [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [TSS](http://en.wikipedia.org/wiki/Task_state_segment)
|
||||
* [GNU assembly .error directive](https://sourceware.org/binutils/docs/as/Error.html#Error)
|
||||
* [dwarf2](http://en.wikipedia.org/wiki/DWARF)
|
||||
* [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html)
|
||||
* [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [system call](http://en.wikipedia.org/wiki/System_call)
|
||||
* [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html)
|
||||
* [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP)
|
||||
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [kgdb](https://en.wikipedia.org/wiki/KGDB)
|
||||
* [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)
|
||||
465
interrupts/interrupts-4.md
Normal file
465
interrupts/interrupts-4.md
Normal file
@@ -0,0 +1,465 @@
|
||||
Interrupts and Interrupt Handling. Part 4.
|
||||
================================================================================
|
||||
|
||||
Initialization of non-early interrupt gates
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is fourth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) we saw first early `#DB` and `#BP` exceptions handlers from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We stopped on the right after the `early_trap_init` function that called in the `setup_arch` function which defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/setup.c). In this part we will continue to dive into an interrupts and exceptions handling in the Linux kernel for `x86_64` and continue to do it from the place where we left off in the last part. First thing which is related to the interrupts and exceptions handling is the setup of the `#PF` or [page fault](https://en.wikipedia.org/wiki/Page_fault) handler with the `early_trap_pf_init` function. Let's start from it.
|
||||
|
||||
Early page fault handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `early_trap_pf_init` function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). It uses `set_intr_gate` macro that fills [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the given entry:
|
||||
|
||||
```C
|
||||
void __init early_trap_pf_init(void)
|
||||
{
|
||||
#ifdef CONFIG_X86_64
|
||||
set_intr_gate(X86_TRAP_PF, page_fault);
|
||||
#endif
|
||||
}
|
||||
```
|
||||
|
||||
This macro defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h). We already saw macros like this in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) - `set_system_intr_gate` and `set_intr_gate_ist`. This macro checks that given vector number is not greater than `255` (maximum vector number) and calls `_set_gate` function as `set_system_intr_gate` and `set_intr_gate_ist` did it:
|
||||
|
||||
```C
|
||||
#define set_intr_gate(n, addr) \
|
||||
do { \
|
||||
BUG_ON((unsigned)n > 0xFF); \
|
||||
_set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0, \
|
||||
__KERNEL_CS); \
|
||||
_trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\
|
||||
0, 0, __KERNEL_CS); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
The `set_intr_gate` macro takes two parameters:
|
||||
|
||||
* vector number of a interrupt;
|
||||
* address of an interrupt handler;
|
||||
|
||||
In our case they are:
|
||||
|
||||
* `X86_TRAP_PF` - `14`;
|
||||
* `page_fault` - the interrupt handler entry point.
|
||||
|
||||
The `X86_TRAP_PF` is the element of enum which defined in the [arch/x86/include/asm/traprs.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traprs.h):
|
||||
|
||||
```C
|
||||
enum {
|
||||
...
|
||||
...
|
||||
...
|
||||
...
|
||||
X86_TRAP_PF, /* 14, Page Fault */
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
When the `early_trap_pf_init` will be called, the `set_intr_gate` will be expanded to the call of the `_set_gate` which will fill the `IDT` with the handler for the page fault. Now let's look on the implementation of the `page_fault` handler. The `page_fault` handler defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file as all exceptions handlers. Let's look on it:
|
||||
|
||||
```assembly
|
||||
trace_idtentry page_fault do_page_fault has_error_code=1
|
||||
```
|
||||
|
||||
We saw in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) how `#DB` and `#BP` handlers defined. They were defined with the `idtentry` macro, but here we can see `trace_idtentry`. This macro defined in the same source code file and depends on the `CONFIG_TRACING` kernel configuration option:
|
||||
|
||||
```assembly
|
||||
#ifdef CONFIG_TRACING
|
||||
.macro trace_idtentry sym do_sym has_error_code:req
|
||||
idtentry trace(\sym) trace(\do_sym) has_error_code=\has_error_code
|
||||
idtentry \sym \do_sym has_error_code=\has_error_code
|
||||
.endm
|
||||
#else
|
||||
.macro trace_idtentry sym do_sym has_error_code:req
|
||||
idtentry \sym \do_sym has_error_code=\has_error_code
|
||||
.endm
|
||||
#endif
|
||||
```
|
||||
|
||||
We will not dive into exceptions [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29) now. If `CONFIG_TRACING` is not set, we can see that `trace_idtentry` macro just expands to the normal `idtentry`. We already saw implementation of the `idtentry` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html), so let's start from the `page_fault` exception handler.
|
||||
|
||||
As we can see in the `idtentry` definition, the handler of the `page_fault` is `do_page_fault` function which defined in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) and as all exceptions handlers it takes two arguments:
|
||||
|
||||
* `regs` - `pt_regs` structure that holds state of an interrupted process;
|
||||
* `error_code` - error code of the page fault exception.
|
||||
|
||||
Let's look inside this function. First of all we read content of the [cr2](https://en.wikipedia.org/wiki/Control_register) control register:
|
||||
|
||||
```C
|
||||
dotraplinkage void notrace
|
||||
do_page_fault(struct pt_regs *regs, unsigned long error_code)
|
||||
{
|
||||
unsigned long address = read_cr2();
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
This register contains a linear address which caused `page fault`. In the next step we make a call of the `exception_enter` function from the [include/linux/context_tracking.h](https://github.com/torvalds/linux/blob/master/include/context_tracking.h). The `exception_enter` and `exception_exit` are functions from context tracking subsystem in the Linux kernel used by the [RCU](https://en.wikipedia.org/wiki/Read-copy-update) to remove its dependency on the timer tick while a processor runs in userspace. Almost in the every exception handler we will see similar code:
|
||||
|
||||
```C
|
||||
enum ctx_state prev_state;
|
||||
prev_state = exception_enter();
|
||||
...
|
||||
... // exception handler here
|
||||
...
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
The `exception_enter` function checks that `context tracking` is enabled with the `context_tracking_is_enabled` and if it is in enabled state, we get previous context with the `this_cpu_read` (more about `this_cpu_*` operations you can read in the [Documentation](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)). After this it calls `context_tracking_user_exit` function which informs the context tracking that the processor is exiting userspace mode and entering the kernel:
|
||||
|
||||
```C
|
||||
static inline enum ctx_state exception_enter(void)
|
||||
{
|
||||
enum ctx_state prev_ctx;
|
||||
|
||||
if (!context_tracking_is_enabled())
|
||||
return 0;
|
||||
|
||||
prev_ctx = this_cpu_read(context_tracking.state);
|
||||
context_tracking_user_exit();
|
||||
|
||||
return prev_ctx;
|
||||
}
|
||||
```
|
||||
|
||||
The state can be one of the:
|
||||
|
||||
```C
|
||||
enum ctx_state {
|
||||
IN_KERNEL = 0,
|
||||
IN_USER,
|
||||
} state;
|
||||
```
|
||||
|
||||
And in the end we return previous context. Between the `exception_enter` and `exception_exit` we call actual page fault handler:
|
||||
|
||||
```C
|
||||
__do_page_fault(regs, error_code, address);
|
||||
```
|
||||
|
||||
The `__do_page_fault` is defined in the same source code file as `do_page_fault` - [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c). In the beginning of the `__do_page_fault` we check state of the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) checker. The `kmemcheck` detects warns about some uses of uninitialized memory. We need to check it because page fault can be caused by kmemcheck:
|
||||
|
||||
```C
|
||||
if (kmemcheck_active(regs))
|
||||
kmemcheck_hide(regs);
|
||||
prefetchw(&mm->mmap_sem);
|
||||
```
|
||||
|
||||
After this we can see the call of the `prefetchw` which executes instruction with the same [name](http://www.felixcloutier.com/x86/PREFETCHW.html) which fetches [X86_FEATURE_3DNOW](https://en.wikipedia.org/?title=3DNow!) to get exclusive [cache line](https://en.wikipedia.org/wiki/CPU_cache). The main purpose of prefetching is to hide the latency of a memory access. In the next step we check that we got page fault not in the kernel space with the following condition:
|
||||
|
||||
```C
|
||||
if (unlikely(fault_in_kernel_space(address))) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
where `fault_in_kernel_space` is:
|
||||
|
||||
```C
|
||||
static int fault_in_kernel_space(unsigned long address)
|
||||
{
|
||||
return address >= TASK_SIZE_MAX;
|
||||
}
|
||||
```
|
||||
|
||||
The `TASK_SIZE_MAX` macro expands to the:
|
||||
|
||||
```C
|
||||
#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
|
||||
```
|
||||
|
||||
or `0x00007ffffffff000`. Pay attention on `unlikely` macro. There are two macros in the Linux kernel:
|
||||
|
||||
```C
|
||||
#define likely(x) __builtin_expect(!!(x), 1)
|
||||
#define unlikely(x) __builtin_expect(!!(x), 0)
|
||||
```
|
||||
|
||||
You can [often](http://lxr.free-electrons.com/ident?i=unlikely) find these macros in the code of the Linux kernel. Main purpose of these macros is optimization. Sometimes this situation is that we need to check the condition of the code and we know that it will rarely be `true` or `false`. With these macros we can tell to the compiler about this. For example
|
||||
|
||||
```C
|
||||
static int proc_root_readdir(struct file *file, struct dir_context *ctx)
|
||||
{
|
||||
if (ctx->pos < FIRST_PROCESS_ENTRY) {
|
||||
int error = proc_readdir(file, ctx);
|
||||
if (unlikely(error <= 0))
|
||||
return error;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see `proc_root_readdir` function which will be called when the Linux [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) needs to read the `root` directory contents. If condition marked with `unlikely`, compiler can put `false` code right after branching. Now let's back to the our address check. Comparison between the given address and the `0x00007ffffffff000` will give us to know, was page fault in the kernel mode or user mode. After this check we know it. After this `__do_page_fault` routine will try to understand the problem that provoked page fault exception and then will pass address to the appropriate routine. It can be `kmemcheck` fault, spurious fault, [kprobes](https://www.kernel.org/doc/Documentation/kprobes.txt) fault and etc. Will not dive into implementation details of the page fault exception handler in this part, because we need to know many different concepts which are provided by the Linux kernel, but will see it in the chapter about the [memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) in the Linux kernel.
|
||||
|
||||
Back to start_kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
There are many different function calls after the `early_trap_pf_init` in the `setup_arch` function from different kernel subsystems, but there are no one interrupts and exceptions handling related. So, we have to go back where we came from - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L492). The first things after the `setup_arch` is the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). This function makes initialization of the remaining exceptions handlers (remember that we already setup 3 handlers for the `#DB` - debug exception, `#BP` - breakpoint exception and `#PF` - page fault exception). The `trap_init` function starts from the check of the [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture):
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_EISA
|
||||
void __iomem *p = early_ioremap(0x0FFFD9, 4);
|
||||
|
||||
if (readl(p) == 'E' + ('I'<<8) + ('S'<<16) + ('A'<<24))
|
||||
EISA_bus = 1;
|
||||
early_iounmap(p, 4);
|
||||
#endif
|
||||
```
|
||||
|
||||
Note that it depends on the `CONFIG_EISA` kernel configuration parameter which represents `EISA` support. Here we use `early_ioremap` function to map `I/O` memory on the page tables. We use `readl` function to read first `4` bytes from the mapped region and if they are equal to `EISA` string we set `EISA_bus` to one. In the end we just unmap previously mapped region. More about `early_ioremap` you can read in the part which describes [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html).
|
||||
|
||||
After this we start to fill the `Interrupt Descriptor Table` with the different interrupt gates. First of all we set `#DE` or `Divide Error` and `#NMI` or `Non-maskable Interrupt`:
|
||||
|
||||
```C
|
||||
set_intr_gate(X86_TRAP_DE, divide_error);
|
||||
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
|
||||
```
|
||||
|
||||
We use `set_intr_gate` macro to set the interrupt gate for the `#DE` exception and `set_intr_gate_ist` for the `#NMI`. You can remember that we already used these macros when we have set the interrupts gates for the page fault handler, debug handler and etc, you can find explanation of it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html). After this we setup exception gates for the following exceptions:
|
||||
|
||||
```C
|
||||
set_system_intr_gate(X86_TRAP_OF, &overflow);
|
||||
set_intr_gate(X86_TRAP_BR, bounds);
|
||||
set_intr_gate(X86_TRAP_UD, invalid_op);
|
||||
set_intr_gate(X86_TRAP_NM, device_not_available);
|
||||
```
|
||||
|
||||
Here we can see:
|
||||
|
||||
* `#OF` or `Overflow` exception. This exception indicates that an overflow trap occurred when an special [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html) instruction was executed;
|
||||
* `#BR` or `BOUND Range exceeded` exception. This exception indicates that a `BOUND-range-exceed` fault occurred when a [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) instruction was executed;
|
||||
* `#UD` or `Invalid Opcode` exception. Occurs when a processor attempted to execute invalid or reserved [opcode](https://en.wikipedia.org/?title=Opcode), processor attempted to execute instruction with invalid operand(s) and etc;
|
||||
* `#NM` or `Device Not Available` exception. Occurs when the processor tries to execute `x87 FPU` floating point instruction while `EM` flag in the [control register](https://en.wikipedia.org/wiki/Control_register#CR0) `cr0` was set.
|
||||
|
||||
In the next step we set the interrupt gate for the `#DF` or `Double fault` exception:
|
||||
|
||||
```C
|
||||
set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
|
||||
```
|
||||
|
||||
This exception occurs when processor detected a second exception while calling an exception handler for a prior exception. In usual way when the processor detects another exception while trying to call an exception handler, the two exceptions can be handled serially. If the processor cannot handle them serially, it signals the double-fault or `#DF` exception.
|
||||
|
||||
The following set of the interrupt gates is:
|
||||
|
||||
```C
|
||||
set_intr_gate(X86_TRAP_OLD_MF, &coprocessor_segment_overrun);
|
||||
set_intr_gate(X86_TRAP_TS, &invalid_TSS);
|
||||
set_intr_gate(X86_TRAP_NP, &segment_not_present);
|
||||
set_intr_gate_ist(X86_TRAP_SS, &stack_segment, STACKFAULT_STACK);
|
||||
set_intr_gate(X86_TRAP_GP, &general_protection);
|
||||
set_intr_gate(X86_TRAP_SPURIOUS, &spurious_interrupt_bug);
|
||||
set_intr_gate(X86_TRAP_MF, &coprocessor_error);
|
||||
set_intr_gate(X86_TRAP_AC, &alignment_check);
|
||||
```
|
||||
|
||||
Here we can see setup for the following exception handlers:
|
||||
|
||||
* `#CSO` or `Coprocessor Segment Overrun` - this exception indicates that math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) of an old processor detected a page or segment violation. Modern processors do not generate this exception
|
||||
* `#TS` or `Invalid TSS` exception - indicates that there was an error related to the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment).
|
||||
* `#NP` or `Segment Not Present` exception indicates that the `present flag` of a segment or gate descriptor is clear during attempt to load one of `cs`, `ds`, `es`, `fs`, or `gs` register.
|
||||
* `#SS` or `Stack Fault` exception indicates one of the stack related conditions was detected, for example a not-present stack segment is detected when attempting to load the `ss` register.
|
||||
* `#GP` or `General Protection` exception indicates that the processor detected one of a class of protection violations called general-protection violations. There are many different conditions that can cause general-protection exception. For example loading the `ss`, `ds`, `es`, `fs`, or `gs` register with a segment selector for a system segment, writing to a code segment or a read-only data segment, referencing an entry in the `Interrupt Descriptor Table` (following an interrupt or exception) that is not an interrupt, trap, or task gate and many many more.
|
||||
* `Spurious Interrupt` - a hardware interrupt that is unwanted.
|
||||
* `#MF` or `x87 FPU Floating-Point Error` exception caused when the [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions) has detected a floating point error.
|
||||
* `#AC` or `Alignment Check` exception Indicates that the processor detected an unaligned memory operand when alignment checking was enabled.
|
||||
|
||||
After that we setup this exception gates, we can see setup of the `Machine-Check` exception:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_MCE
|
||||
set_intr_gate_ist(X86_TRAP_MC, &machine_check, MCE_STACK);
|
||||
#endif
|
||||
```
|
||||
|
||||
Note that it depends on the `CONFIG_X86_MCE` kernel configuration option and indicates that the processor detected an internal [machine error](https://en.wikipedia.org/wiki/Machine-check_exception) or a bus error, or that an external agent detected a bus error. The next exception gate is for the [SIMD](https://en.wikipedia.org/?title=SIMD) Floating-Point exception:
|
||||
|
||||
```C
|
||||
set_intr_gate(X86_TRAP_XF, &simd_coprocessor_error);
|
||||
```
|
||||
|
||||
which indicates the processor has detected an `SSE` or `SSE2` or `SSE3` SIMD floating-point exception. There are six classes of numeric exception conditions that can occur while executing an SIMD floating-point instruction:
|
||||
|
||||
* Invalid operation
|
||||
* Divide-by-zero
|
||||
* Denormal operand
|
||||
* Numeric overflow
|
||||
* Numeric underflow
|
||||
* Inexact result (Precision)
|
||||
|
||||
In the next step we fill the `used_vectors` array which defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h) header file and represents `bitmap`:
|
||||
|
||||
```C
|
||||
DECLARE_BITMAP(used_vectors, NR_VECTORS);
|
||||
```
|
||||
|
||||
of the first `32` interrupts (more about bitmaps in the Linux kernel you can read in the part which describes [cpumasks and bitmaps](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html))
|
||||
|
||||
```C
|
||||
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
|
||||
set_bit(i, used_vectors)
|
||||
```
|
||||
|
||||
where `FIRST_EXTERNAL_VECTOR` is:
|
||||
|
||||
```C
|
||||
#define FIRST_EXTERNAL_VECTOR 0x20
|
||||
```
|
||||
|
||||
After this we setup the interrupt gate for the `ia32_syscall` and add `0x80` to the `used_vectors` bitmap:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_IA32_EMULATION
|
||||
set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
|
||||
set_bit(IA32_SYSCALL_VECTOR, used_vectors);
|
||||
#endif
|
||||
```
|
||||
|
||||
There is `CONFIG_IA32_EMULATION` kernel configuration option on `x86_64` Linux kernels. This option provides ability to execute 32-bit processes in compatibility-mode. In the next parts we will see how it works, in the meantime we need only to know that there is yet another interrupt gate in the `IDT` with the vector number `0x80`. In the next step we maps `IDT` to the fixmap area:
|
||||
|
||||
```C
|
||||
__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
|
||||
idt_descr.address = fix_to_virt(FIX_RO_IDT);
|
||||
```
|
||||
|
||||
and write its address to the `idt_descr.address` (more about fix-mapped addresses you can read in the second part of the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) chapter). After this we can see the call of the `cpu_init` function that defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c). This function makes initialization of the all `per-cpu` state. In the beginning of the `cpu_init` we do the following things: First of all we wait while current cpu is initialized and than we call the `cr4_init_shadow` function which stores shadow copy of the `cr4` control register for the current cpu and load CPU microcode if need with the following function calls:
|
||||
|
||||
```C
|
||||
wait_for_master_cpu(cpu);
|
||||
cr4_init_shadow();
|
||||
load_ucode_ap();
|
||||
```
|
||||
|
||||
Next we get the `Task State Segment` for the current cpu and `orig_ist` structure which represents origin `Interrupt Stack Table` values with the:
|
||||
|
||||
```C
|
||||
t = &per_cpu(cpu_tss, cpu);
|
||||
oist = &per_cpu(orig_ist, cpu);
|
||||
```
|
||||
|
||||
As we got values of the `Task State Segment` and `Interrupt Stack Table` for the current processor, we clear following bits in the `cr4` control register:
|
||||
|
||||
```C
|
||||
cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
|
||||
```
|
||||
|
||||
with this we disable `vm86` extension, virtual interrupts, timestamp ([RDTSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) can only be executed with the highest privilege) and debug extension. After this we reload the `Global Descriptor Table` and `Interrupt Descriptor table` with the:
|
||||
|
||||
```C
|
||||
switch_to_new_gdt(cpu);
|
||||
loadsegment(fs, 0);
|
||||
load_current_idt();
|
||||
```
|
||||
|
||||
After this we setup array of the Thread-Local Storage Descriptors, configure [NX](https://en.wikipedia.org/wiki/NX_bit) and load CPU microcode. Now is time to setup and load `per-cpu` Task State Segments. We are going in a loop through the all exception stack which is `N_EXCEPTION_STACKS` or `4` and fill it with `Interrupt Stack Tables`:
|
||||
|
||||
```C
|
||||
if (!oist->ist[0]) {
|
||||
char *estacks = per_cpu(exception_stacks, cpu);
|
||||
|
||||
for (v = 0; v < N_EXCEPTION_STACKS; v++) {
|
||||
estacks += exception_stack_sizes[v];
|
||||
oist->ist[v] = t->x86_tss.ist[v] =
|
||||
(unsigned long)estacks;
|
||||
if (v == DEBUG_STACK-1)
|
||||
per_cpu(debug_stack_addr, cpu) = (unsigned long)estacks;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
As we have filled `Task State Segments` with the `Interrupt Stack Tables` we can set `TSS` descriptor for the current processor and load it with the:
|
||||
|
||||
```C
|
||||
set_tss_desc(cpu, t);
|
||||
load_TR_desc();
|
||||
```
|
||||
|
||||
where `set_tss_desc` macro from the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) writes given descriptor to the `Global Descriptor Table` of the given processor:
|
||||
|
||||
```C
|
||||
#define set_tss_desc(cpu, addr) __set_tss_desc(cpu, GDT_ENTRY_TSS, addr)
|
||||
static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr)
|
||||
{
|
||||
struct desc_struct *d = get_cpu_gdt_table(cpu);
|
||||
tss_desc tss;
|
||||
set_tssldt_descriptor(&tss, (unsigned long)addr, DESC_TSS,
|
||||
IO_BITMAP_OFFSET + IO_BITMAP_BYTES +
|
||||
sizeof(unsigned long) - 1);
|
||||
write_gdt_entry(d, entry, &tss, DESC_TSS);
|
||||
}
|
||||
```
|
||||
|
||||
and `load_TR_desc` macro expands to the `ltr` or `Load Task Register` instruction:
|
||||
|
||||
```C
|
||||
#define load_TR_desc() native_load_tr_desc()
|
||||
static inline void native_load_tr_desc(void)
|
||||
{
|
||||
asm volatile("ltr %w0"::"q" (GDT_ENTRY_TSS*8));
|
||||
}
|
||||
```
|
||||
|
||||
In the end of the `trap_init` function we can see the following code:
|
||||
|
||||
```C
|
||||
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
||||
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
||||
...
|
||||
...
|
||||
...
|
||||
#ifdef CONFIG_X86_64
|
||||
memcpy(&nmi_idt_table, &idt_table, IDT_ENTRIES * 16);
|
||||
set_nmi_gate(X86_TRAP_DB, &debug);
|
||||
set_nmi_gate(X86_TRAP_BP, &int3);
|
||||
#endif
|
||||
```
|
||||
|
||||
Here we copy `idt_table` to the `nmi_dit_table` and setup exception handlers for the `#DB` or `Debug exception` and `#BR` or `Breakpoint exception`. You can remember that we already set these interrupt gates in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html), so why do we need to setup it again? We setup it again because when we initialized it before in the `early_trap_init` function, the `Task State Segment` was not ready yet, but now it is ready after the call of the `cpu_init` function.
|
||||
|
||||
That's all. Soon we will consider all handlers of these interrupts/exceptions.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the fourth part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment) in this part and initialization of the different interrupt handlers as `Divide Error`, `Page Fault` exception and etc. You can note that we saw just initialization stuff, and will dive into details about handlers for these exceptions. In the next part we will start to do it.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)
|
||||
* [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29)
|
||||
* [cr2](https://en.wikipedia.org/wiki/Control_register)
|
||||
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [this_cpu_* operations](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)
|
||||
* [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt)
|
||||
* [prefetchw](http://www.felixcloutier.com/x86/PREFETCHW.html)
|
||||
* [3DNow](https://en.wikipedia.org/?title=3DNow!)
|
||||
* [CPU caches](https://en.wikipedia.org/wiki/CPU_cache)
|
||||
* [VFS](https://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
|
||||
* [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture)
|
||||
* [INT isntruction](https://en.wikipedia.org/wiki/INT_%28x86_instruction%29)
|
||||
* [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html)
|
||||
* [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm)
|
||||
* [opcode](https://en.wikipedia.org/?title=Opcode)
|
||||
* [control register](https://en.wikipedia.org/wiki/Control_register#CR0)
|
||||
* [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions)
|
||||
* [MCE exception](https://en.wikipedia.org/wiki/Machine-check_exception)
|
||||
* [SIMD](https://en.wikipedia.org/?title=SIMD)
|
||||
* [cpumasks and bitmaps](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [NX](https://en.wikipedia.org/wiki/NX_bit)
|
||||
* [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html)
|
||||
493
interrupts/interrupts-5.md
Normal file
493
interrupts/interrupts-5.md
Normal file
@@ -0,0 +1,493 @@
|
||||
Interrupts and Interrupt Handling. Part 5.
|
||||
================================================================================
|
||||
|
||||
Implementation of exception handlers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the fifth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) we stopped on the setting of interrupt gates to the [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table). We did it in the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file. We saw only setting of these interrupt gates in the previous part and in the current part we will see implementation of the exception handlers for these gates. The preparation before an exception handler will be executed is in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and occurs in the [idtentry](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S#L820) macro that defines exceptions entry points:
|
||||
|
||||
```assembly
|
||||
idtentry divide_error do_divide_error has_error_code=0
|
||||
idtentry overflow do_overflow has_error_code=0
|
||||
idtentry invalid_op do_invalid_op has_error_code=0
|
||||
idtentry bounds do_bounds has_error_code=0
|
||||
idtentry device_not_available do_device_not_available has_error_code=0
|
||||
idtentry coprocessor_segment_overrun do_coprocessor_segment_overrun has_error_code=0
|
||||
idtentry invalid_TSS do_invalid_TSS has_error_code=1
|
||||
idtentry segment_not_present do_segment_not_present has_error_code=1
|
||||
idtentry spurious_interrupt_bug do_spurious_interrupt_bug has_error_code=0
|
||||
idtentry coprocessor_error do_coprocessor_error has_error_code=0
|
||||
idtentry alignment_check do_alignment_check has_error_code=1
|
||||
idtentry simd_coprocessor_error do_simd_coprocessor_error has_error_code=0
|
||||
```
|
||||
|
||||
The `idtentry` macro does following preparation before an actual exception handler (`do_divide_error` for the `divide_error`, `do_overflow` for the `overflow` and etc.) will get control. In another words the `idtentry` macro allocates place for the registers ([pt_regs](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h#L43) structure) on the stack, pushes dummy error code for the stack consistency if an interrupt/exception has no error code, checks the segment selector in the `cs` segment register and switches depends on the previous state(userspace or kernelspace). After all of these preparations it makes a call of an actual interrupt/exception handler:
|
||||
|
||||
```assembly
|
||||
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
|
||||
ENTRY(\sym)
|
||||
...
|
||||
...
|
||||
...
|
||||
call \do_sym
|
||||
...
|
||||
...
|
||||
...
|
||||
END(\sym)
|
||||
.endm
|
||||
```
|
||||
|
||||
After an exception handler will finish its work, the `idtentry` macro restores stack and general purpose registers of an interrupted task and executes [iret](http://x86.renejeschke.de/html/file_module_x86_id_145.html) instruction:
|
||||
|
||||
```assembly
|
||||
ENTRY(paranoid_exit)
|
||||
...
|
||||
...
|
||||
...
|
||||
RESTORE_EXTRA_REGS
|
||||
RESTORE_C_REGS
|
||||
REMOVE_PT_GPREGS_FROM_STACK 8
|
||||
INTERRUPT_RETURN
|
||||
END(paranoid_exit)
|
||||
```
|
||||
|
||||
where `INTERRUPT_RETURN` is:
|
||||
|
||||
```assembly
|
||||
#define INTERRUPT_RETURN jmp native_iret
|
||||
...
|
||||
ENTRY(native_iret)
|
||||
.global native_irq_return_iret
|
||||
native_irq_return_iret:
|
||||
iretq
|
||||
```
|
||||
|
||||
More about the `idtentry` macro you can read in the third part of the [http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers:
|
||||
|
||||
* divide_error
|
||||
* overflow
|
||||
* invalid_op
|
||||
* coprocessor_segment_overrun
|
||||
* invalid_TSS
|
||||
* segment_not_present
|
||||
* stack_segment
|
||||
* alignment_check
|
||||
|
||||
All these handlers defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file with the `DO_ERROR` macro:
|
||||
|
||||
```C
|
||||
DO_ERROR(X86_TRAP_DE, SIGFPE, "divide error", divide_error)
|
||||
DO_ERROR(X86_TRAP_OF, SIGSEGV, "overflow", overflow)
|
||||
DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op)
|
||||
DO_ERROR(X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun", coprocessor_segment_overrun)
|
||||
DO_ERROR(X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS)
|
||||
DO_ERROR(X86_TRAP_NP, SIGBUS, "segment not present", segment_not_present)
|
||||
DO_ERROR(X86_TRAP_SS, SIGBUS, "stack segment", stack_segment)
|
||||
DO_ERROR(X86_TRAP_AC, SIGBUS, "alignment check", alignment_check)
|
||||
```
|
||||
|
||||
As we can see the `DO_ERROR` macro takes 4 parameters:
|
||||
|
||||
* Vector number of an interrupt;
|
||||
* Signal number which will be sent to the interrupted process;
|
||||
* String which describes an exception;
|
||||
* Exception handler entry point.
|
||||
|
||||
This macro defined in the same source code file and expands to the function with the `do_handler` name:
|
||||
|
||||
```C
|
||||
#define DO_ERROR(trapnr, signr, str, name) \
|
||||
dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
|
||||
{ \
|
||||
do_error_trap(regs, error_code, str, trapnr, signr); \
|
||||
}
|
||||
```
|
||||
|
||||
Note on the `##` tokens. This is special feature - [GCC macro Concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html#Concatenation) which concatenates two given strings. For example, first `DO_ERROR` in our example will expands to the:
|
||||
|
||||
```C
|
||||
dotraplinkage void do_divide_error(struct pt_regs *regs, long error_code) \
|
||||
{
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
We can see that all functions which are generated by the `DO_ERROR` macro just make a call of the `do_error_trap` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). Let's look on implementation of the `do_error_trap` function.
|
||||
|
||||
Trap handlers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `do_error_trap` function starts and ends from the two following functions:
|
||||
|
||||
```C
|
||||
enum ctx_state prev_state = exception_enter();
|
||||
...
|
||||
...
|
||||
...
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
from the [include/linux/context_tracking.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking.h). The context tracking in the Linux kernel subsystem which provide kernel boundaries probes to keep track of the transitions between level contexts with two basic initial contexts: `user` or `kernel`. The `exception_enter` function checks that context tracking is enabled. After this if it is enabled, the `exception_enter` reads previous context and compares it with the `CONTEXT_KERNEL`. If the previous context is `user`, we call `context_tracking_exit` function from the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/master/kernel/context_tracking.c) which inform the context tracking subsystem that a processor is exiting user mode and entering the kernel mode:
|
||||
|
||||
```C
|
||||
if (!context_tracking_is_enabled())
|
||||
return 0;
|
||||
|
||||
prev_ctx = this_cpu_read(context_tracking.state);
|
||||
if (prev_ctx != CONTEXT_KERNEL)
|
||||
context_tracking_exit(prev_ctx);
|
||||
|
||||
return prev_ctx;
|
||||
```
|
||||
|
||||
If previous context is non `user`, we just return it. The `pre_ctx` has `enum ctx_state` type which defined in the [include/linux/context_tracking_state.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking_state.h) and looks as:
|
||||
|
||||
```C
|
||||
enum ctx_state {
|
||||
CONTEXT_KERNEL = 0,
|
||||
CONTEXT_USER,
|
||||
CONTEXT_GUEST,
|
||||
} state;
|
||||
```
|
||||
|
||||
The second function is `exception_exit` defined in the same [include/linux/context_tracking.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking.h) file and checks that context tracking is enabled and call the `contert_tracking_enter` function if the previous context was `user`:
|
||||
|
||||
```C
|
||||
static inline void exception_exit(enum ctx_state prev_ctx)
|
||||
{
|
||||
if (context_tracking_is_enabled()) {
|
||||
if (prev_ctx != CONTEXT_KERNEL)
|
||||
context_tracking_enter(prev_ctx);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `context_tracking_enter` function informs the context tracking subsystem that a processor is going to enter to the user mode from the kernel mode. We can see the following code between the `exception_enter` and `exception_exit`:
|
||||
|
||||
```C
|
||||
if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=
|
||||
NOTIFY_STOP) {
|
||||
conditional_sti(regs);
|
||||
do_trap(trapnr, signr, str, regs, error_code,
|
||||
fill_trap_info(regs, signr, trapnr, &info));
|
||||
}
|
||||
```
|
||||
|
||||
First of all it calls the `notify_die` function which defined in the [kernel/notifier.c](https://github.com/torvalds/linux/tree/master/kernel/notifier.c). To get notified for [kernel panic](https://en.wikipedia.org/wiki/Kernel_panic), [kernel oops](https://en.wikipedia.org/wiki/Linux_kernel_oops), [Non-Maskable Interrupt](https://en.wikipedia.org/wiki/Non-maskable_interrupt) or other events the caller needs to insert itself in the `notify_die` chain and the `notify_die` function does it. The Linux kernel has special mechanism that allows kernel to ask when something happens and this mechanism called `notifiers` or `notifier chains`. This mechanism used for example for the `USB` hotplug events (look on the [drivers/usb/core/notify.c](https://github.com/torvalds/linux/tree/master/drivers/usb/core/notify.c)), for the memory [hotplug](https://en.wikipedia.org/wiki/Hot_swapping) (look on the [include/linux/memory.h](https://github.com/torvalds/linux/tree/master/include/linux/memory.h), the `hotplug_memory_notifier` macro and etc...), system reboots and etc. A notifier chain is thus a simple, singly-linked list. When a Linux kernel subsystem wants to be notified of specific events, it fills out a special `notifier_block` structure and passes it to the `notifier_chain_register` function. An event can be sent with the call of the `notifier_call_chain` function. First of all the `notify_die` function fills `die_args` structure with the trap number, trap string, registers and other values:
|
||||
|
||||
```C
|
||||
struct die_args args = {
|
||||
.regs = regs,
|
||||
.str = str,
|
||||
.err = err,
|
||||
.trapnr = trap,
|
||||
.signr = sig,
|
||||
}
|
||||
```
|
||||
|
||||
and returns the result of the `atomic_notifier_call_chain` function with the `die_chain`:
|
||||
|
||||
```C
|
||||
static ATOMIC_NOTIFIER_HEAD(die_chain);
|
||||
return atomic_notifier_call_chain(&die_chain, val, &args);
|
||||
```
|
||||
|
||||
which just expands to the `atomic_notifier_head` structure that contains lock and `notifier_block`:
|
||||
|
||||
```C
|
||||
struct atomic_notifier_head {
|
||||
spinlock_t lock;
|
||||
struct notifier_block __rcu *head;
|
||||
};
|
||||
```
|
||||
|
||||
The `atomic_notifier_call_chain` function calls each function in a notifier chain in turn and returns the value of the last notifier function called. If the `notify_die` in the `do_error_trap` does not return `NOTIFY_STOP` we execute `conditional_sti` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) that checks the value of the [interrupt flag](https://en.wikipedia.org/wiki/Interrupt_flag) and enables interrupt depends on it:
|
||||
|
||||
```C
|
||||
static inline void conditional_sti(struct pt_regs *regs)
|
||||
{
|
||||
if (regs->flags & X86_EFLAGS_IF)
|
||||
local_irq_enable();
|
||||
}
|
||||
```
|
||||
|
||||
more about `local_irq_enable` macro you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html) of this chapter. The next and last call in the `do_error_trap` is the `do_trap` function. First of all the `do_trap` function defined the `tsk` variable which has `task_struct` type and represents the current interrupted process. After the definition of the `tsk`, we can see the call of the `do_trap_no_signal` function:
|
||||
|
||||
```C
|
||||
struct task_struct *tsk = current;
|
||||
|
||||
if (!do_trap_no_signal(tsk, trapnr, str, regs, error_code))
|
||||
return;
|
||||
```
|
||||
|
||||
The `do_trap_no_signal` function makes two checks:
|
||||
|
||||
* Did we come from the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode;
|
||||
* Did we come from the kernelspace.
|
||||
|
||||
```C
|
||||
if (v8086_mode(regs)) {
|
||||
...
|
||||
}
|
||||
|
||||
if (!user_mode(regs)) {
|
||||
...
|
||||
}
|
||||
|
||||
return -1;
|
||||
```
|
||||
|
||||
We will not consider first case because the [long mode](https://en.wikipedia.org/wiki/Long_mode) does not support the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode. In the second case we invoke `fixup_exception` function which will try to recover a fault and `die` if we can't:
|
||||
|
||||
```C
|
||||
if (!fixup_exception(regs)) {
|
||||
tsk->thread.error_code = error_code;
|
||||
tsk->thread.trap_nr = trapnr;
|
||||
die(str, regs, error_code);
|
||||
}
|
||||
```
|
||||
|
||||
The `die` function defined in the [arch/x86/kernel/dumpstack.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/dumpstack.c) source code file, prints useful information about stack, registers, kernel modules and caused kernel [oops](https://en.wikipedia.org/wiki/Linux_kernel_oops). If we came from the userspace the `do_trap_no_signal` function will return `-1` and the execution of the `do_trap` function will continue. If we passed through the `do_trap_no_signal` function and did not exit from the `do_trap` after this, it means that previous context was - `user`. Most exceptions caused by the processor are interpreted by Linux as error conditions, for example division by zero, invalid opcode and etc. When an exception occurs the Linux kernel sends a [signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process that caused the exception to notify it of an incorrect condition. So, in the `do_trap` function we need to send a signal with the given number (`SIGFPE` for the divide error, `SIGILL` for the overflow exception and etc...). First of all we save error code and vector number in the current interrupts process with the filling `thread.error_code` and `thread_trap_nr`:
|
||||
|
||||
```C
|
||||
tsk->thread.error_code = error_code;
|
||||
tsk->thread.trap_nr = trapnr;
|
||||
```
|
||||
|
||||
After this we make a check do we need to print information about unhandled signals for the interrupted process. We check that `show_unhandled_signals` variable is set, that `unhandled_signal` function from the [kernel/signal.c](https://github.com/torvalds/linux/blob/master/kernel/signal.c) will return unhandled signal(s) and [printk](https://en.wikipedia.org/wiki/Printk) rate limit:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_64
|
||||
if (show_unhandled_signals && unhandled_signal(tsk, signr) &&
|
||||
printk_ratelimit()) {
|
||||
pr_info("%s[%d] trap %s ip:%lx sp:%lx error:%lx",
|
||||
tsk->comm, tsk->pid, str,
|
||||
regs->ip, regs->sp, error_code);
|
||||
print_vma_addr(" in ", regs->ip);
|
||||
pr_cont("\n");
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
And send a given signal to interrupted process:
|
||||
|
||||
```C
|
||||
force_sig_info(signr, info ?: SEND_SIG_PRIV, tsk);
|
||||
```
|
||||
|
||||
This is the end of the `do_trap`. We just saw generic implementation for eight different exceptions which are defined with the `DO_ERROR` macro. Now let's look on another exception handlers.
|
||||
|
||||
Double fault
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next exception is `#DF` or `Double fault`. This exception occurs when the processor detected a second exception while calling an exception handler for a prior exception. We set the trap gate for this exception in the previous part:
|
||||
|
||||
```C
|
||||
set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
|
||||
```
|
||||
|
||||
Note that this exception runs on the `DOUBLEFAULT_STACK` [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks) which has index - `1`:
|
||||
|
||||
```C
|
||||
#define DOUBLEFAULT_STACK 1
|
||||
```
|
||||
|
||||
The `double_fault` is handler for this exception and defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). The `double_fault` handler starts from the definition of two variables: string that describes exception and interrupted process, as other exception handlers:
|
||||
|
||||
```C
|
||||
static const char str[] = "double fault";
|
||||
struct task_struct *tsk = current;
|
||||
```
|
||||
|
||||
The handler of the double fault exception split on two parts. The first part is the check which checks that a fault is a `non-IST` fault on the `espfix64` stack. Actually the `iret` instruction restores only the bottom `16` bits when returning to a `16` bit segment. The `espfix` feature solves this problem. So if the `non-IST` fault on the espfix64 stack we modify the stack to make it look like `General Protection Fault`:
|
||||
|
||||
```C
|
||||
struct pt_regs *normal_regs = task_pt_regs(current);
|
||||
|
||||
memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
|
||||
ormal_regs->orig_ax = 0;
|
||||
regs->ip = (unsigned long)general_protection;
|
||||
regs->sp = (unsigned long)&normal_regs->orig_ax;
|
||||
return;
|
||||
```
|
||||
|
||||
In the second case we do almost the same that we did in the previous exception handlers. The first is the call of the `ist_enter` function that discards previous context, `user` in our case:
|
||||
|
||||
```C
|
||||
ist_enter(regs);
|
||||
```
|
||||
|
||||
And after this we fill the interrupted process with the vector number of the `Double fault` exception and error code as we did it in the previous handlers:
|
||||
|
||||
```C
|
||||
tsk->thread.error_code = error_code;
|
||||
tsk->thread.trap_nr = X86_TRAP_DF;
|
||||
```
|
||||
|
||||
Next we print useful information about the double fault ([PID](https://en.wikipedia.org/wiki/Process_identifier) number, registers content):
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_DOUBLEFAULT
|
||||
df_debug(regs, error_code);
|
||||
#endif
|
||||
```
|
||||
|
||||
And die:
|
||||
|
||||
```
|
||||
for (;;)
|
||||
die(str, regs, error_code);
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Device not available exception handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next exception is the `#NM` or `Device not available`. The `Device not available` exception can occur depending on these things:
|
||||
|
||||
* The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87) floating-point instruction while the EM flag in [control register](https://en.wikipedia.org/wiki/Control_register) `cr0` was set;
|
||||
* The processor executed a `wait` or `fwait` instruction while the `MP` and `TS` flags of register `cr0` were set;
|
||||
* The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87), [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29) or [SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) instruction while the `TS` flag in control register `cr0` was set and the `EM` flag is clear.
|
||||
|
||||
The handler of the `Device not available` exception is the `do_device_not_available` function and it defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file too. It starts and ends from the getting of the previous context, as other traps which we saw in the beginning of this part:
|
||||
|
||||
```C
|
||||
enum ctx_state prev_state;
|
||||
prev_state = exception_enter();
|
||||
...
|
||||
...
|
||||
...
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
In the next step we check that `FPU` is not eager:
|
||||
|
||||
```C
|
||||
BUG_ON(use_eager_fpu());
|
||||
```
|
||||
|
||||
When we switch into a task or interrupt we may avoid loading the `FPU` state. If a task will use it, we catch `Device not Available exception` exception. If we loading the `FPU` state during task switching, the `FPU` is eager. In the next step we check `cr0` control register on the `EM` flag which can show us is `x87` floating point unit present (flag clear) or not (flag set):
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_MATH_EMULATION
|
||||
if (read_cr0() & X86_CR0_EM) {
|
||||
struct math_emu_info info = { };
|
||||
|
||||
conditional_sti(regs);
|
||||
|
||||
info.regs = regs;
|
||||
math_emulate(&info);
|
||||
exception_exit(prev_state);
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
If the `x87` floating point unit not presented, we enable interrupts with the `conditional_sti`, fill the `math_emu_info` (defined in the [arch/x86/include/asm/math_emu.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/math_emu.h)) structure with the registers of an interrupt task and call `math_emulate` function from the [arch/x86/math-emu/fpu_entry.c](https://github.com/torvalds/linux/tree/master/arch/x86/math-emu/fpu_entry.c). As you can understand from function's name, it emulates `X87 FPU` unit (more about the `x87` we will know in the special chapter). In other way, if `X86_CR0_EM` flag is clear which means that `x87 FPU` unit is presented, we call the `fpu__restore` function from the [arch/x86/kernel/fpu/core.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/fpu/core.c) which copies the `FPU` registers from the `fpustate` to the live hardware registers. After this `FPU` instructions can be used:
|
||||
|
||||
```C
|
||||
fpu__restore(¤t->thread.fpu);
|
||||
```
|
||||
|
||||
General protection fault exception handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next exception is the `#GP` or `General protection fault`. This exception occurs when the processor detected one of a class of protection violations called `general-protection violations`. It can be:
|
||||
|
||||
* Exceeding the segment limit when accessing the `cs`, `ds`, `es`, `fs` or `gs` segments;
|
||||
* Loading the `ss`, `ds`, `es`, `fs` or `gs` register with a segment selector for a system segment.;
|
||||
* Violating any of the privilege rules;
|
||||
* and other...
|
||||
|
||||
The exception handler for this exception is the `do_general_protection` from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). The `do_general_protection` function starts and ends as other exception handlers from the getting of the previous context:
|
||||
|
||||
```C
|
||||
prev_state = exception_enter();
|
||||
...
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
After this we enable interrupts if they were disabled and check that we came from the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode:
|
||||
|
||||
```C
|
||||
conditional_sti(regs);
|
||||
|
||||
if (v8086_mode(regs)) {
|
||||
local_irq_enable();
|
||||
handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
|
||||
goto exit;
|
||||
}
|
||||
```
|
||||
|
||||
As long mode does not support this mode, we will not consider exception handling for this case. In the next step check that previous mode was kernel mode and try to fix the trap. If we can't fix the current general protection fault exception we fill the interrupted process with the vector number and error code of the exception and add it to the `notify_die` chain:
|
||||
|
||||
```C
|
||||
if (!user_mode(regs)) {
|
||||
if (fixup_exception(regs))
|
||||
goto exit;
|
||||
|
||||
tsk->thread.error_code = error_code;
|
||||
tsk->thread.trap_nr = X86_TRAP_GP;
|
||||
if (notify_die(DIE_GPF, "general protection fault", regs, error_code,
|
||||
X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP)
|
||||
die("general protection fault", regs, error_code);
|
||||
goto exit;
|
||||
}
|
||||
```
|
||||
|
||||
If we can fix exception we go to the `exit` label which exits from exception state:
|
||||
|
||||
```C
|
||||
exit:
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
If we came from user mode we send `SIGSEGV` signal to the interrupted process from user mode as we did it in the `do_trap` function:
|
||||
|
||||
```C
|
||||
if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
|
||||
printk_ratelimit()) {
|
||||
pr_info("%s[%d] general protection ip:%lx sp:%lx error:%lx",
|
||||
tsk->comm, task_pid_nr(tsk),
|
||||
regs->ip, regs->sp, error_code);
|
||||
print_vma_addr(" in ", regs->ip);
|
||||
pr_cont("\n");
|
||||
}
|
||||
|
||||
force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk);
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the fifth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we saw implementation of some interrupt handlers in this part. In the next part we will continue to dive into interrupt and exception handlers and will see handler for the [Non-Maskable Interrupts](https://en.wikipedia.org/wiki/Non-maskable_interrupt), handling of the math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) and [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exceptions and many many more.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)
|
||||
* [iret instruction](http://x86.renejeschke.de/html/file_module_x86_id_145.html)
|
||||
* [GCC macro Concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html#Concatenation)
|
||||
* [kernel panic](https://en.wikipedia.org/wiki/Kernel_panic)
|
||||
* [kernel oops](https://en.wikipedia.org/wiki/Linux_kernel_oops)
|
||||
* [Non-Maskable Interrupt](https://en.wikipedia.org/wiki/Non-maskable_interrupt)
|
||||
* [hotplug](https://en.wikipedia.org/wiki/Hot_swapping)
|
||||
* [interrupt flag](https://en.wikipedia.org/wiki/Interrupt_flag)
|
||||
* [long mode](https://en.wikipedia.org/wiki/Long_mode)
|
||||
* [signal](https://en.wikipedia.org/wiki/Unix_signal)
|
||||
* [printk](https://en.wikipedia.org/wiki/Printk)
|
||||
* [coprocessor](https://en.wikipedia.org/wiki/Coprocessor)
|
||||
* [SIMD](https://en.wikipedia.org/wiki/SIMD)
|
||||
* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
|
||||
* [PID](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [x87 FPU](https://en.wikipedia.org/wiki/X87)
|
||||
* [control register](https://en.wikipedia.org/wiki/Control_register)
|
||||
* [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html)
|
||||
480
interrupts/interrupts-6.md
Normal file
480
interrupts/interrupts-6.md
Normal file
@@ -0,0 +1,480 @@
|
||||
Interrupts and Interrupt Handling. Part 6.
|
||||
================================================================================
|
||||
|
||||
Non-maskable interrupt handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is sixth part of the [Interrupts and Interrupt Handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html) we saw implementation of some exception handlers for the [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault) exception, divide exception, invalid [opcode](https://en.wikipedia.org/wiki/Opcode) exceptions and etc. As I wrote in the previous part we will see implementations of the rest exceptions in this part. We will see implementation of the following handlers:
|
||||
|
||||
* [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt;
|
||||
* [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) Range Exceeded Exception;
|
||||
* [Coprocessor](https://en.wikipedia.org/wiki/Coprocessor) exception;
|
||||
* [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exception.
|
||||
|
||||
in this part. So, let's start.
|
||||
|
||||
Non-Maskable interrupt handling
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
A [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt is a hardware interrupt that cannot be ignored by standard masking techniques. In a general way, a non-maskable interrupt can be generated in either of two ways:
|
||||
|
||||
* External hardware asserts the non-maskable interrupt [pin](https://en.wikipedia.org/wiki/CPU_socket) on the CPU.
|
||||
* The processor receives a message on the system bus or the APIC serial bus with a delivery mode `NMI`.
|
||||
|
||||
When the processor receives a `NMI` from one of these sources, the processor handles it immediately by calling the `NMI` handler pointed to by interrupt vector which has number `2` (see table in the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html)). We already filled the [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the [vector number](https://en.wikipedia.org/wiki/Interrupt_vector_table), address of the `nmi` interrupt handler and `NMI_STACK` [Interrupt Stack Table entry](https://github.com/torvalds/linux/blob/master/Documentation/x86/kernel-stacks):
|
||||
|
||||
```C
|
||||
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
|
||||
```
|
||||
|
||||
in the `trap_init` function which defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file. In the previous [parts](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we saw that entry points of the all interrupt handlers are defined with the:
|
||||
|
||||
```assembly
|
||||
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
|
||||
ENTRY(\sym)
|
||||
...
|
||||
...
|
||||
...
|
||||
END(\sym)
|
||||
.endm
|
||||
```
|
||||
|
||||
macro from the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file. But the handler of the `Non-Maskable` interrupts is not defined with this macro. It has own entry point:
|
||||
|
||||
```assembly
|
||||
ENTRY(nmi)
|
||||
...
|
||||
...
|
||||
...
|
||||
END(nmi)
|
||||
```
|
||||
|
||||
in the same [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file. Lets dive into it and will try to understand how `Non-Maskable` interrupt handler works. The `nmi` handlers starts from the call of the:
|
||||
|
||||
```assembly
|
||||
PARAVIRT_ADJUST_EXCEPTION_FRAME
|
||||
```
|
||||
|
||||
macro but we will not dive into details about it in this part, because this macro related to the [Paravirtualization](https://en.wikipedia.org/wiki/Paravirtualization) stuff which we will see in another chapter. After this save the content of the `rdx` register on the stack:
|
||||
|
||||
```assembly
|
||||
pushq %rdx
|
||||
```
|
||||
|
||||
And allocated check that `cs` was not the kernel segment when an non-maskable interrupt occurs:
|
||||
|
||||
```assembly
|
||||
cmpl $__KERNEL_CS, 16(%rsp)
|
||||
jne first_nmi
|
||||
```
|
||||
|
||||
The `__KERNEL_CS` macro defined in the [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) and represented second descriptor in the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table):
|
||||
|
||||
```C
|
||||
#define GDT_ENTRY_KERNEL_CS 2
|
||||
#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8)
|
||||
```
|
||||
|
||||
more about `GDT` you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the Linux kernel booting process chapter. If `cs` is not kernel segment, it means that it is not nested `NMI` and we jump on the `first_nmi` label. Let's consider this case. First of all we put address of the current stack pointer to the `rdx` and pushes `1` to the stack in the `first_nmi` label:
|
||||
|
||||
```assembly
|
||||
first_nmi:
|
||||
movq (%rsp), %rdx
|
||||
pushq $1
|
||||
```
|
||||
|
||||
Why do we push `1` on the stack? As the comment says: `We allow breakpoints in NMIs`. On the [x86_64](https://en.wikipedia.org/wiki/X86-64), like other architectures, the CPU will not execute another `NMI` until the first `NMI` is completed. A `NMI` interrupt finished with the [iret](http://faydoc.tripod.com/cpu/iret.htm) instruction like other interrupts and exceptions do it. If the `NMI` handler triggers either a [page fault](https://en.wikipedia.org/wiki/Page_fault) or [breakpoint](https://en.wikipedia.org/wiki/Breakpoint) or another exception which are use `iret` instruction too. If this happens while in `NMI` context, the CPU will leave `NMI` context and a new `NMI` may come in. The `iret` used to return from those exceptions will re-enable `NMIs` and we will get nested non-maskable interrupts. The problem the `NMI` handler will not return to the state that it was, when the exception triggered, but instead it will return to a state that will allow new `NMIs` to preempt the running `NMI` handler. If another `NMI` comes in before the first NMI handler is complete, the new NMI will write all over the preempted `NMIs` stack. We can have nested `NMIs` where the next `NMI` is using the top of the stack of the previous `NMI`. It means that we cannot execute it because a nested non-maskable interrupt will corrupt stack of a previous non-maskable interrupt. That's why we have allocated space on the stack for temporary variable. We will check this variable that it was set when a previous `NMI` is executing and clear if it is not nested `NMI`. We push `1` here to the previously allocated space on the stack to denote that a `non-maskable` interrupt executed currently. Remember that when and `NMI` or another exception occurs we have the following [stack frame](https://en.wikipedia.org/wiki/Call_stack):
|
||||
|
||||
```
|
||||
+------------------------+
|
||||
| SS |
|
||||
| RSP |
|
||||
| RFLAGS |
|
||||
| CS |
|
||||
| RIP |
|
||||
+------------------------+
|
||||
```
|
||||
|
||||
and also an error code if an exception has it. So, after all of these manipulations our stack frame will look like this:
|
||||
|
||||
```
|
||||
+------------------------+
|
||||
| SS |
|
||||
| RSP |
|
||||
| RFLAGS |
|
||||
| CS |
|
||||
| RIP |
|
||||
| RDX |
|
||||
| 1 |
|
||||
+------------------------+
|
||||
```
|
||||
|
||||
In the next step we allocate yet another `40` bytes on the stack:
|
||||
|
||||
```assembly
|
||||
subq $(5*8), %rsp
|
||||
```
|
||||
|
||||
and pushes the copy of the original stack frame after the allocated space:
|
||||
|
||||
```C
|
||||
.rept 5
|
||||
pushq 11*8(%rsp)
|
||||
.endr
|
||||
```
|
||||
|
||||
with the [.rept](http://tigcc.ticalc.org/doc/gnuasm.html#SEC116) assembly directive. We need in the copy of the original stack frame. Generally we need in two copies of the interrupt stack. First is `copied` interrupts stack: `saved` stack frame and `copied` stack frame. Now we pushes original stack frame to the `saved` stack frame which locates after the just allocated `40` bytes (`copied` stack frame). This stack frame is used to fixup the `copied` stack frame that a nested NMI may change. The second - `copied` stack frame modified by any nested `NMIs` to let the first `NMI` know that we triggered a second `NMI` and we should repeat the first `NMI` handler. Ok, we have made first copy of the original stack frame, now time to make second copy:
|
||||
|
||||
```assembly
|
||||
addq $(10*8), %rsp
|
||||
|
||||
.rept 5
|
||||
pushq -6*8(%rsp)
|
||||
.endr
|
||||
subq $(5*8), %rsp
|
||||
```
|
||||
|
||||
After all of these manipulations our stack frame will be like this:
|
||||
|
||||
```
|
||||
+-------------------------+
|
||||
| original SS |
|
||||
| original Return RSP |
|
||||
| original RFLAGS |
|
||||
| original CS |
|
||||
| original RIP |
|
||||
+-------------------------+
|
||||
| temp storage for rdx |
|
||||
+-------------------------+
|
||||
| NMI executing variable |
|
||||
+-------------------------+
|
||||
| copied SS |
|
||||
| copied Return RSP |
|
||||
| copied RFLAGS |
|
||||
| copied CS |
|
||||
| copied RIP |
|
||||
+-------------------------+
|
||||
| Saved SS |
|
||||
| Saved Return RSP |
|
||||
| Saved RFLAGS |
|
||||
| Saved CS |
|
||||
| Saved RIP |
|
||||
+-------------------------+
|
||||
```
|
||||
|
||||
After this we push dummy error code on the stack as we did it already in the previous exception handlers and allocate space for the general purpose registers on the stack:
|
||||
|
||||
```assembly
|
||||
pushq $-1
|
||||
ALLOC_PT_GPREGS_ON_STACK
|
||||
```
|
||||
|
||||
We already saw implementation of the `ALLOC_PT_GREGS_ON_STACK` macro in the third part of the interrupts [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html). This macro defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) and yet another allocates `120` bytes on stack for the general purpose registers, from the `rdi` to the `r15`:
|
||||
|
||||
```assembly
|
||||
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
|
||||
addq $-(15*8+\addskip), %rsp
|
||||
.endm
|
||||
```
|
||||
|
||||
After space allocation for the general registers we can see call of the `paranoid_entry`:
|
||||
|
||||
```assembly
|
||||
call paranoid_entry
|
||||
```
|
||||
|
||||
We can remember from the previous parts this label. It pushes general purpose registers on the stack, reads `MSR_GS_BASE` [Model Specific register](https://en.wikipedia.org/wiki/Model-specific_register) and checks its value. If the value of the `MSR_GS_BASE` is negative, we came from the kernel mode and just return from the `paranoid_entry`, in other way it means that we came from the usermode and need to execute `swapgs` instruction which will change user `gs` with the kernel `gs`:
|
||||
|
||||
```assembly
|
||||
ENTRY(paranoid_entry)
|
||||
cld
|
||||
SAVE_C_REGS 8
|
||||
SAVE_EXTRA_REGS 8
|
||||
movl $1, %ebx
|
||||
movl $MSR_GS_BASE, %ecx
|
||||
rdmsr
|
||||
testl %edx, %edx
|
||||
js 1f
|
||||
SWAPGS
|
||||
xorl %ebx, %ebx
|
||||
1: ret
|
||||
END(paranoid_entry)
|
||||
```
|
||||
|
||||
Note that after the `swapgs` instruction we zeroed the `ebx` register. Next time we will check content of this register and if we executed `swapgs` than `ebx` must contain `0` and `1` in other way. In the next step we store value of the `cr2` [control register](https://en.wikipedia.org/wiki/Control_register) to the `r12` register, because the `NMI` handler can cause `page fault` and corrupt the value of this control register:
|
||||
|
||||
```C
|
||||
movq %cr2, %r12
|
||||
```
|
||||
|
||||
Now time to call actual `NMI` handler. We push the address of the `pt_regs` to the `rdi`, error code to the `rsi` and call the `do_nmi` handler:
|
||||
|
||||
```assembly
|
||||
movq %rsp, %rdi
|
||||
movq $-1, %rsi
|
||||
call do_nmi
|
||||
```
|
||||
|
||||
We will back to the `do_nmi` little later in this part, but now let's look what occurs after the `do_nmi` will finish its execution. After the `do_nmi` handler will be finished we check the `cr2` register, because we can got page fault during `do_nmi` performed and if we got it we restore original `cr2`, in other way we jump on the label `1`. After this we test content of the `ebx` register (remember it must contain `0` if we have used `swapgs` instruction and `1` if we didn't use it) and execute `SWAPGS_UNSAFE_STACK` if it contains `1` or jump to the `nmi_restore` label. The `SWAPGS_UNSAFE_STACK` macro just expands to the `swapgs` instruction. In the `nmi_restore` label we restore general purpose registers, clear allocated space on the stack for this registers, clear our temporary variable and exit from the interrupt handler with the `INTERRUPT_RETURN` macro:
|
||||
|
||||
```assembly
|
||||
movq %cr2, %rcx
|
||||
cmpq %rcx, %r12
|
||||
je 1f
|
||||
movq %r12, %cr2
|
||||
1:
|
||||
testl %ebx, %ebx
|
||||
jnz nmi_restore
|
||||
nmi_swapgs:
|
||||
SWAPGS_UNSAFE_STACK
|
||||
nmi_restore:
|
||||
RESTORE_EXTRA_REGS
|
||||
RESTORE_C_REGS
|
||||
/* Pop the extra iret frame at once */
|
||||
REMOVE_PT_GPREGS_FROM_STACK 6*8
|
||||
/* Clear the NMI executing stack variable */
|
||||
movq $0, 5*8(%rsp)
|
||||
INTERRUPT_RETURN
|
||||
```
|
||||
|
||||
where `INTERRUPT_RETURN` is defined in the [arch/x86/include/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/irqflags.h) and just expands to the `iret` instruction. That's all.
|
||||
|
||||
Now let's consider case when another `NMI` interrupt occurred when previous `NMI` interrupt didn't finish its execution. You can remember from the beginning of this part that we've made a check that we came from userspace and jump on the `first_nmi` in this case:
|
||||
|
||||
```assembly
|
||||
cmpl $__KERNEL_CS, 16(%rsp)
|
||||
jne first_nmi
|
||||
```
|
||||
|
||||
Note that in this case it is first `NMI` every time, because if the first `NMI` catched page fault, breakpoint or another exception it will be executed in the kernel mode. If we didn't come from userspace, first of all we test our temporary variable:
|
||||
|
||||
```assembly
|
||||
cmpl $1, -8(%rsp)
|
||||
je nested_nmi
|
||||
```
|
||||
|
||||
and if it is set to `1` we jump to the `nested_nmi` label. If it is not `1`, we test the `IST` stack. In the case of nested `NMIs` we check that we are above the `repeat_nmi`. In this case we ignore it, in other way we check that we above than `end_repeat_nmi` and jump on the `nested_nmi_out` label.
|
||||
|
||||
Now let's look on the `do_nmi` exception handler. This function defined in the [arch/x86/kernel/nmi.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/nmi.c) source code file and takes two parameters:
|
||||
|
||||
* address of the `pt_regs`;
|
||||
* error code.
|
||||
|
||||
as all exception handlers. The `do_nmi` starts from the call of the `nmi_nesting_preprocess` function and ends with the call of the `nmi_nesting_postprocess`. The `nmi_nesting_preprocess` function checks that we likely do not work with the debug stack and if we on the debug stack set the `update_debug_stack` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable to `1` and call the `debug_stack_set_zero` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c). This function increases the `debug_stack_use_ctr` per-cpu variable and loads new `Interrupt Descriptor Table`:
|
||||
|
||||
```C
|
||||
static inline void nmi_nesting_preprocess(struct pt_regs *regs)
|
||||
{
|
||||
if (unlikely(is_debug_stack(regs->sp))) {
|
||||
debug_stack_set_zero();
|
||||
this_cpu_write(update_debug_stack, 1);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `nmi_nesting_postprocess` function checks the `update_debug_stack` per-cpu variable which we set in the `nmi_nesting_preprocess` and resets debug stack or in another words it loads origin `Interrupt Descriptor Table`. After the call of the `nmi_nesting_preprocess` function, we can see the call of the `nmi_enter` in the `do_nmi`. The `nmi_enter` increases `lockdep_recursion` field of the interrupted process, update preempt counter and informs the [RCU](https://en.wikipedia.org/wiki/Read-copy-update) subsystem about `NMI`. There is also `nmi_exit` function that does the same stuff as `nmi_enter`, but vice-versa. After the `nmi_enter` we increase `__nmi_count` in the `irq_stat` structure and call the `default_do_nmi` function. First of all in the `default_do_nmi` we check the address of the previous nmi and update address of the last nmi to the actual:
|
||||
|
||||
```C
|
||||
if (regs->ip == __this_cpu_read(last_nmi_rip))
|
||||
b2b = true;
|
||||
else
|
||||
__this_cpu_write(swallow_nmi, false);
|
||||
|
||||
__this_cpu_write(last_nmi_rip, regs->ip);
|
||||
```
|
||||
|
||||
After this first of all we need to handle CPU-specific `NMIs`:
|
||||
|
||||
```C
|
||||
handled = nmi_handle(NMI_LOCAL, regs, b2b);
|
||||
__this_cpu_add(nmi_stats.normal, handled);
|
||||
```
|
||||
|
||||
And then non-specific `NMIs` depends on its reason:
|
||||
|
||||
```C
|
||||
reason = x86_platform.get_nmi_reason();
|
||||
if (reason & NMI_REASON_MASK) {
|
||||
if (reason & NMI_REASON_SERR)
|
||||
pci_serr_error(reason, regs);
|
||||
else if (reason & NMI_REASON_IOCHK)
|
||||
io_check_error(reason, regs);
|
||||
|
||||
__this_cpu_add(nmi_stats.external, 1);
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Range Exceeded Exception
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next exception is the `BOUND` range exceeded exception. The `BOUND` instruction determines if the first operand (array index) is within the bounds of an array specified the second operand (bounds operand). If the index is not within bounds, a `BOUND` range exceeded exception or `#BR` is occurred. The handler of the `#BR` exception is the `do_bounds` function that defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). The `do_bounds` handler starts with the call of the `exception_enter` function and ends with the call of the `exception_exit`:
|
||||
|
||||
```C
|
||||
prev_state = exception_enter();
|
||||
|
||||
if (notify_die(DIE_TRAP, "bounds", regs, error_code,
|
||||
X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
|
||||
goto exit;
|
||||
...
|
||||
...
|
||||
...
|
||||
exception_exit(prev_state);
|
||||
return;
|
||||
```
|
||||
|
||||
After we have got the state of the previous context, we add the exception to the `notify_die` chain and if it will return `NOTIFY_STOP` we return from the exception. More about notify chains and the `context tracking` functions you can read in the [previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html). In the next step we enable interrupts if they were disabled with the `contidional_sti` function that checks `IF` flag and call the `local_irq_enable` depends on its value:
|
||||
|
||||
```C
|
||||
conditional_sti(regs);
|
||||
|
||||
if (!user_mode(regs))
|
||||
die("bounds", regs, error_code);
|
||||
```
|
||||
|
||||
and check that if we didn't came from user mode we send `SIGSEGV` signal with the `die` function. After this we check is [MPX](https://en.wikipedia.org/wiki/Intel_MPX) enabled or not, and if this feature is disabled we jump on the `exit_trap` label:
|
||||
|
||||
```C
|
||||
if (!cpu_feature_enabled(X86_FEATURE_MPX)) {
|
||||
goto exit_trap;
|
||||
}
|
||||
|
||||
where we execute `do_trap` function (more about it you can find in the previous part):
|
||||
|
||||
```C
|
||||
exit_trap:
|
||||
do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
If `MPX` feature is enabled we check the `BNDSTATUS` with the `get_xsave_field_ptr` function and if it is zero, it means that the `MPX` was not responsible for this exception:
|
||||
|
||||
```C
|
||||
bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
|
||||
if (!bndcsr)
|
||||
goto exit_trap;
|
||||
```
|
||||
|
||||
After all of this, there is still only one way when `MPX` is responsible for this exception. We will not dive into the details about Intel Memory Protection Extensions in this part, but will see it in another chapter.
|
||||
|
||||
Coprocessor exception and SIMD exception
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next two exceptions are [x87 FPU](https://en.wikipedia.org/wiki/X87) Floating-Point Error exception or `#MF` and [SIMD](https://en.wikipedia.org/wiki/SIMD) Floating-Point Exception or `#XF`. The first exception occurs when the `x87 FPU` has detected floating point error. For example divide by zero, numeric overflow and etc. The second exception occurs when the processor has detected [SSE/SSE2/SSE3](https://en.wikipedia.org/wiki/SSE3) `SIMD` floating-point exception. It can be the same as for the `x87 FPU`. The handlers for these exceptions are `do_coprocessor_error` and `do_simd_coprocessor_error` are defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and very similar on each other. They both make a call of the `math_error` function from the same source code file but pass different vector number. The `do_coprocessor_error` passes `X86_TRAP_MF` vector number to the `math_error`:
|
||||
|
||||
```C
|
||||
dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
|
||||
{
|
||||
enum ctx_state prev_state;
|
||||
|
||||
prev_state = exception_enter();
|
||||
math_error(regs, error_code, X86_TRAP_MF);
|
||||
exception_exit(prev_state);
|
||||
}
|
||||
```
|
||||
|
||||
and `do_simd_coprocessor_error` passes `X86_TRAP_XF` to the `math_error` function:
|
||||
|
||||
```C
|
||||
dotraplinkage void
|
||||
do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
|
||||
{
|
||||
enum ctx_state prev_state;
|
||||
|
||||
prev_state = exception_enter();
|
||||
math_error(regs, error_code, X86_TRAP_XF);
|
||||
exception_exit(prev_state);
|
||||
}
|
||||
```
|
||||
|
||||
First of all the `math_error` function defines current interrupted task, address of its fpu, string which describes an exception, add it to the `notify_die` chain and return from the exception handler if it will return `NOTIFY_STOP`:
|
||||
|
||||
```C
|
||||
struct task_struct *task = current;
|
||||
struct fpu *fpu = &task->thread.fpu;
|
||||
siginfo_t info;
|
||||
char *str = (trapnr == X86_TRAP_MF) ? "fpu exception" :
|
||||
"simd exception";
|
||||
|
||||
if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, SIGFPE) == NOTIFY_STOP)
|
||||
return;
|
||||
```
|
||||
|
||||
After this we check that we are from the kernel mode and if yes we will try to fix an excetpion with the `fixup_exception` function. If we cannot we fill the task with the exception's error code and vector number and die:
|
||||
|
||||
```C
|
||||
if (!user_mode(regs)) {
|
||||
if (!fixup_exception(regs)) {
|
||||
task->thread.error_code = error_code;
|
||||
task->thread.trap_nr = trapnr;
|
||||
die(str, regs, error_code);
|
||||
}
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
If we came from the user mode, we save the `fpu` state, fill the task structure with the vector number of an exception and `siginfo_t` with the number of signal, `errno`, the address where exception occurred and signal code:
|
||||
|
||||
```C
|
||||
fpu__save(fpu);
|
||||
|
||||
task->thread.trap_nr = trapnr;
|
||||
task->thread.error_code = error_code;
|
||||
info.si_signo = SIGFPE;
|
||||
info.si_errno = 0;
|
||||
info.si_addr = (void __user *)uprobe_get_trap_addr(regs);
|
||||
info.si_code = fpu__exception_code(fpu, trapnr);
|
||||
```
|
||||
|
||||
After this we check the signal code and if it is non-zero we return:
|
||||
|
||||
```C
|
||||
if (!info.si_code)
|
||||
return;
|
||||
```
|
||||
|
||||
Or send the `SIGFPE` signal in the end:
|
||||
|
||||
```C
|
||||
force_sig_info(SIGFPE, &info, task);
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the sixth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we saw implementation of some exception handlers in this part, like `non-maskable` interrupt, [SIMD](https://en.wikipedia.org/wiki/SIMD) and [x87 FPU](https://en.wikipedia.org/wiki/X87) floating point exception. Finally we have finsihed with the `trap_init` function in this part and will go ahead in the next part. The next our point is the external interrupts and the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c).
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault)
|
||||
* [opcode](https://en.wikipedia.org/wiki/Opcode)
|
||||
* [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt)
|
||||
* [BOUND instruction](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm)
|
||||
* [CPU socket](https://en.wikipedia.org/wiki/CPU_socket)
|
||||
* [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)
|
||||
* [Interrupt Stack Table](https://github.com/torvalds/linux/blob/master/Documentation/x86/kernel-stacks)
|
||||
* [Paravirtualization](https://en.wikipedia.org/wiki/Paravirtualization)
|
||||
* [.rept](http://tigcc.ticalc.org/doc/gnuasm.html#SEC116)
|
||||
* [SIMD](https://en.wikipedia.org/wiki/SIMD)
|
||||
* [Coprocessor](https://en.wikipedia.org/wiki/Coprocessor)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [iret](http://faydoc.tripod.com/cpu/iret.htm)
|
||||
* [page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [breakpoint](https://en.wikipedia.org/wiki/Breakpoint)
|
||||
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [stack frame](https://en.wikipedia.org/wiki/Call_stack)
|
||||
* [Model Specific regiser](https://en.wikipedia.org/wiki/Model-specific_register)
|
||||
* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [MPX](https://en.wikipedia.org/wiki/Intel_MPX)
|
||||
* [x87 FPU](https://en.wikipedia.org/wiki/X87)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html)
|
||||
461
interrupts/interrupts-7.md
Normal file
461
interrupts/interrupts-7.md
Normal file
@@ -0,0 +1,461 @@
|
||||
Interrupts and Interrupt Handling. Part 7.
|
||||
================================================================================
|
||||
|
||||
Introduction to external interrupts
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the seventh part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-6.html) we have finished with the exceptions which are generated by the processor. In this part we will continue to dive to the interrupt handling and will start with the external hardware interrupt handling. As you can remember, in the previous part we have finished with the `trap_init` function from the [arch/x86/kernel/trap.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and the next step is the call of the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c).
|
||||
|
||||
Interrupts are signal that are sent across [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) or `Interrupt Request Line` by a hardware or software. External hardware interrupts allow devices like keyboard, mouse and etc, to indicate that it needs attention of the processor. Once the processor receives the `Interrupt Request`, it will temporary stop execution of the running program and invoke special routine which depends on an interrupt. We already know that this routine is called interrupt handler (or how we will call it `ISR` or `Interrupt Service Routine` from this part). The `ISR` or `Interrupt Handler Routine` can be found in Interrupt Vector table that is located at fixed address in the memory. After the interrupt is handled processor resumes the interrupted process. At the boot/initialization time, the Linux kernel identifies all devices in the machine, and appropriate interrupt handlers are loaded into the interrupt table. As we saw in the previous parts, most exceptions are handled simply by the sending a [Unix signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process. That's why kernel is can handle an exception quickly. Unfortunately we can not use this approach for the external hardware interrupts, because often they arrive after (and sometimes long after) the process to which they are related has been suspended. So it would make no sense to send a Unix signal to the current process. External interrupt handling depends on the type of an interrupt:
|
||||
|
||||
* `I/O` interrupts;
|
||||
* Timer interrupts;
|
||||
* Interprocessor interrupts.
|
||||
|
||||
I will try to describe all types of interrupts in this book.
|
||||
|
||||
Generally, a handler of an `I/O` interrupt must be flexible enough to service several devices at the same time. For example in the [PCI](https://en.wikipedia.org/wiki/Conventional_PCI) bus architecture several devices may share the same `IRQ` line. In the simplest way the Linux kernel must do following thing when an `I/O` interrupt occurred:
|
||||
|
||||
* Save the value of an `IRQ` and the register's contents on the kernel stack;
|
||||
* Send an acknowledgment to the hardware controller which is servicing the `IRQ` line;
|
||||
* Execute the interrupt service routine (next we will call it `ISR`) which is associated with the device;
|
||||
* Restore registers and return from an interrupt;
|
||||
|
||||
Ok, we know a little theory and now let's start with the `early_irq_init` function. The implementation of the `early_irq_init` function is in the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c). This function make early initialization of the `irq_desc` structure. The `irq_desc` structure is the foundation of interrupt management code in the Linux kernel. An array of this structure, which has the same name - `irq_desc`, keeps track of every interrupt request source in the Linux kernel. This structure defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/master/include/linux/irqdesc.h) and as you can note it depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. This kernel configuration option enables support for sparse irqs. The `irq_desc` structure contains many different files:
|
||||
|
||||
* `irq_common_data` - per irq and chip data passed down to chip functions;
|
||||
* `status_use_accessors` - contains status of the interrupt source which is combination of the values from the `enum` from the [include/linux/irq.h](https://github.com/torvalds/linux/blob/master/include/linux/irq.h) and different macros which are defined in the same source code file;
|
||||
* `kstat_irqs` - irq stats per-cpu;
|
||||
* `handle_irq` - highlevel irq-events handler;
|
||||
* `action` - identifies the interrupt service routines to be invoked when the [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) occurs;
|
||||
* `irq_count` - counter of interrupt occurrences on the IRQ line;
|
||||
* `depth` - `0` if the IRQ line is enabled and a positive value if it has been disabled at least once;
|
||||
* `last_unhandled` - aging timer for unhandled count;
|
||||
* `irqs_unhandled` - count of the unhandled interrupts;
|
||||
* `lock` - a spin lock used to serialize the accesses to the `IRQ` descriptor;
|
||||
* `pending_mask` - pending rebalanced interrupts;
|
||||
* `owner` - an owner of interrupt descriptor. Interrupt descriptors can be allocated from modules. This field is need to proved refcount on the module which provides the interrupts;
|
||||
* and etc.
|
||||
|
||||
Of course it is not all fields of the `irq_desc` structure, because it is too long to describe each field of this structure, but we will see it all soon. Now let's start to dive into the implementation of the `early_irq_init` function.
|
||||
|
||||
Early external interrupts initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Now, let's look on the implementation of the `early_irq_init` function. Note that implementation of the `early_irq_init` function depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. Now we consider implementation of the `early_irq_init` function when the `CONFIG_SPARSE_IRQ` kernel configuration option is not set. This function starts from the declaration of the following variables: `irq` descriptors counter, loop counter, memory node and the `irq_desc` descriptor:
|
||||
|
||||
```C
|
||||
int __init early_irq_init(void)
|
||||
{
|
||||
int count, i, node = first_online_node;
|
||||
struct irq_desc *desc;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `node` is an online [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) node which depends on the `MAX_NUMNODES` value which depends on the `CONFIG_NODES_SHIFT` kernel configuration parameter:
|
||||
|
||||
```C
|
||||
#define MAX_NUMNODES (1 << NODES_SHIFT)
|
||||
...
|
||||
...
|
||||
...
|
||||
#ifdef CONFIG_NODES_SHIFT
|
||||
#define NODES_SHIFT CONFIG_NODES_SHIFT
|
||||
#else
|
||||
#define NODES_SHIFT 0
|
||||
#endif
|
||||
```
|
||||
|
||||
As I already wrote, implementation of the `first_online_node` macro depends on the `MAX_NUMNODES` value:
|
||||
|
||||
```C
|
||||
#if MAX_NUMNODES > 1
|
||||
#define first_online_node first_node(node_states[N_ONLINE])
|
||||
#else
|
||||
#define first_online_node 0
|
||||
```
|
||||
|
||||
The `node_states` is the [enum](https://en.wikipedia.org/wiki/Enumerated_type) which defined in the [include/linux/nodemask.h](https://github.com/torvalds/linux/blob/master/include/linux/nodemask.h) and represent the set of the states of a node. In our case we are searching an online node and it will be `0` if `MAX_NUMNODES` is one or zero. If the `MAX_NUMNODES` is greater than one, the `node_states[N_ONLINE]` will return `1` and the `first_node` macro will be expands to the call of the `__first_node` function which will return `minimal` or the first online node:
|
||||
|
||||
```C
|
||||
#define first_node(src) __first_node(&(src))
|
||||
|
||||
static inline int __first_node(const nodemask_t *srcp)
|
||||
{
|
||||
return min_t(int, MAX_NUMNODES, find_first_bit(srcp->bits, MAX_NUMNODES));
|
||||
}
|
||||
```
|
||||
|
||||
More about this will be in the another chapter about the `NUMA`. The next step after the declaration of these local variables is the call of the:
|
||||
|
||||
```C
|
||||
init_irq_default_affinity();
|
||||
```
|
||||
|
||||
function. The `init_irq_default_affinity` function defined in the same source code file and depends on the `CONFIG_SMP` kernel configuration option allocates a given [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) structure (in our case it is the `irq_default_affinity`):
|
||||
|
||||
```C
|
||||
#if defined(CONFIG_SMP)
|
||||
cpumask_var_t irq_default_affinity;
|
||||
|
||||
static void __init init_irq_default_affinity(void)
|
||||
{
|
||||
alloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT);
|
||||
cpumask_setall(irq_default_affinity);
|
||||
}
|
||||
#else
|
||||
static void __init init_irq_default_affinity(void)
|
||||
{
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
We know that when a hardware, such as disk controller or keyboard, needs attention from the processor, it throws an interrupt. The interrupt tells to the processor that something has happened and that the processor should interrupt current process and handle an incoming event. In order to prevent multiple devices from sending the same interrupts, the [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) system was established where each device in a computer system is assigned its own special IRQ so that its interrupts are unique. Linux kernel can assign certain `IRQs` to specific processors. This is known as `SMP IRQ affinity`, and it allows you control how your system will respond to various hardware events (that's why it has certain implementation only if the `CONFIG_SMP` kernel configuration option is set). After we allocated `irq_default_affinity` cpumask, we can see `printk` output:
|
||||
|
||||
```C
|
||||
printk(KERN_INFO "NR_IRQS:%d\n", NR_IRQS);
|
||||
```
|
||||
|
||||
which prints `NR_IRQS`:
|
||||
|
||||
```C
|
||||
~$ dmesg | grep NR_IRQS
|
||||
[ 0.000000] NR_IRQS:4352
|
||||
```
|
||||
|
||||
The `NR_IRQS` is the maximum number of the `irq` descriptors or in another words maximum number of interrupts. Its value depends on the state of the `CONFIG_X86_IO_APIC` kernel configuration option. If the `CONFIG_X86_IO_APIC` is not set and the Linux kernel uses an old [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) chip, the `NR_IRQS` is:
|
||||
|
||||
```C
|
||||
#define NR_IRQS_LEGACY 16
|
||||
|
||||
#ifdef CONFIG_X86_IO_APIC
|
||||
...
|
||||
...
|
||||
...
|
||||
#else
|
||||
# define NR_IRQS NR_IRQS_LEGACY
|
||||
#endif
|
||||
```
|
||||
|
||||
In other way, when the `CONFIG_X86_IO_APIC` kernel configuration option is set, the `NR_IRQS` depends on the amount of the processors and amount of the interrupt vectors:
|
||||
|
||||
```C
|
||||
#define CPU_VECTOR_LIMIT (64 * NR_CPUS)
|
||||
#define NR_VECTORS 256
|
||||
#define IO_APIC_VECTOR_LIMIT ( 32 * MAX_IO_APICS )
|
||||
#define MAX_IO_APICS 128
|
||||
|
||||
# define NR_IRQS \
|
||||
(CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT ? \
|
||||
(NR_VECTORS + CPU_VECTOR_LIMIT) : \
|
||||
(NR_VECTORS + IO_APIC_VECTOR_LIMIT))
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
We remember from the previous parts, that the amount of processors we can set during Linux kernel configuration process with the `CONFIG_NR_CPUS` configuration option:
|
||||
|
||||

|
||||
|
||||
In the first case (`CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT`), the `NR_IRQS` will be `4352`, in the second case (`CPU_VECTOR_LIMIT < IO_APIC_VECTOR_LIMIT`), the `NR_IRQS` will be `768`. In my case the `NR_CPUS` is `8` as you can see in the my configuration, the `CPU_VECTOR_LIMIT` is `512` and the `IO_APIC_VECTOR_LIMIT` is `4096`. So `NR_IRQS` for my configuration is `4352`:
|
||||
|
||||
```
|
||||
~$ dmesg | grep NR_IRQS
|
||||
[ 0.000000] NR_IRQS:4352
|
||||
```
|
||||
|
||||
In the next step we assign array of the IRQ descriptors to the `irq_desc` variable which we defined in the start of the `early_irq_init` function and calculate count of the `irq_desc` array with the `ARRAY_SIZE` macro:
|
||||
|
||||
```C
|
||||
desc = irq_desc;
|
||||
count = ARRAY_SIZE(irq_desc);
|
||||
```
|
||||
|
||||
The `irq_desc` array defined in the same source code file and looks like:
|
||||
|
||||
```C
|
||||
struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
|
||||
[0 ... NR_IRQS-1] = {
|
||||
.handle_irq = handle_bad_irq,
|
||||
.depth = 1,
|
||||
.lock = __RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock),
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
The `irq_desc` is array of the `irq` descriptors. It has three already initialized fields:
|
||||
|
||||
* `handle_irq` - as I already wrote above, this field is the highlevel irq-event handler. In our case it initialized with the `handle_bad_irq` function that defined in the [kernel/irq/handle.c](https://github.com/torvalds/linux/blob/master/kernel/irq/handle.c) source code file and handles spurious and unhandled irqs;
|
||||
* `depth` - `0` if the IRQ line is enabled and a positive value if it has been disabled at least once;
|
||||
* `lock` - A spin lock used to serialize the accesses to the `IRQ` descriptor.
|
||||
|
||||
As we calculated count of the interrupts and initialized our `irq_desc` array, we start to fill descriptors in the loop:
|
||||
|
||||
```C
|
||||
for (i = 0; i < count; i++) {
|
||||
desc[i].kstat_irqs = alloc_percpu(unsigned int);
|
||||
alloc_masks(&desc[i], GFP_KERNEL, node);
|
||||
raw_spin_lock_init(&desc[i].lock);
|
||||
lockdep_set_class(&desc[i].lock, &irq_desc_lock_class);
|
||||
desc_set_defaults(i, &desc[i], node, NULL);
|
||||
}
|
||||
```
|
||||
|
||||
We are going through the all interrupt descriptors and do the following things:
|
||||
|
||||
First of all we allocate [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable for the `irq` kernel statistic with the `alloc_percpu` macro. This macro allocates one instance of an object of the given type for every processor on the system. You can access kernel statistic from the userspace via `/proc/stat`:
|
||||
|
||||
```
|
||||
~$ cat /proc/stat
|
||||
cpu 207907 68 53904 5427850 14394 0 394 0 0 0
|
||||
cpu0 25881 11 6684 679131 1351 0 18 0 0 0
|
||||
cpu1 24791 16 5894 679994 2285 0 24 0 0 0
|
||||
cpu2 26321 4 7154 678924 664 0 71 0 0 0
|
||||
cpu3 26648 8 6931 678891 414 0 244 0 0 0
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Where the sixth column is the servicing interrupts. After this we allocate [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) for the given irq descriptor affinity and initialize the [spinlock](https://en.wikipedia.org/wiki/Spinlock) for the given interrupt descriptor. After this before the [critical section](https://en.wikipedia.org/wiki/Critical_section), the lock will be acquired with a call of the `raw_spin_lock` and unlocked with the call of the `raw_spin_unlock`. In the next step we call the `lockdep_set_class` macro which set the [Lock validator](https://lwn.net/Articles/185666/) `irq_desc_lock_class` class for the lock of the given interrupt descriptor. More about `lockdep`, `spinlock` and other synchronization primitives will be described in the separate chapter.
|
||||
|
||||
In the end of the loop we call the `desc_set_defaults` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c). This function takes four parameters:
|
||||
|
||||
* number of a irq;
|
||||
* interrupt descriptor;
|
||||
* online `NUMA` node;
|
||||
* owner of interrupt descriptor. Interrupt descriptors can be allocated from modules. This field is need to proved refcount on the module which provides the interrupts;
|
||||
|
||||
and fills the rest of the `irq_desc` fields. The `desc_set_defaults` function fills interrupt number, `irq` chip, platform-specific per-chip private data for the chip methods, per-IRQ data for the `irq_chip` methods and [MSI](https://en.wikipedia.org/wiki/Message_Signaled_Interrupts) descriptor for the per `irq` and `irq` chip data:
|
||||
|
||||
```C
|
||||
desc->irq_data.irq = irq;
|
||||
desc->irq_data.chip = &no_irq_chip;
|
||||
desc->irq_data.chip_data = NULL;
|
||||
desc->irq_data.handler_data = NULL;
|
||||
desc->irq_data.msi_desc = NULL;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The `irq_data.chip` structure provides general `API` like the `irq_set_chip`, `irq_set_irq_type` and etc, for the irq controller [drivers](https://github.com/torvalds/linux/tree/master/drivers/irqchip). You can find it in the [kernel/irq/chip.c](https://github.com/torvalds/linux/blob/master/kernel/irq/chip.c) source code file.
|
||||
|
||||
After this we set the status of the accessor for the given descriptor and set disabled state of the interrupts:
|
||||
|
||||
```C
|
||||
...
|
||||
...
|
||||
...
|
||||
irq_settings_clr_and_set(desc, ~0, _IRQ_DEFAULT_INIT_FLAGS);
|
||||
irqd_set(&desc->irq_data, IRQD_IRQ_DISABLED);
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
In the next step we set the high level interrupt handlers to the `handle_bad_irq` which handles spurious and unhandled irqs (as the hardware stuff is not initialized yet, we set this handler), set `irq_desc.desc` to `1` which means that an `IRQ` is disabled, reset count of the unhandled interrupts and interrupts in general:
|
||||
|
||||
```C
|
||||
...
|
||||
...
|
||||
...
|
||||
desc->handle_irq = handle_bad_irq;
|
||||
desc->depth = 1;
|
||||
desc->irq_count = 0;
|
||||
desc->irqs_unhandled = 0;
|
||||
desc->name = NULL;
|
||||
desc->owner = owner;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
After this we go through the all [possible](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) processor with the [for_each_possible_cpu](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h#L714) helper and set the `kstat_irqs` to zero for the given interrupt descriptor:
|
||||
|
||||
```C
|
||||
for_each_possible_cpu(cpu)
|
||||
*per_cpu_ptr(desc->kstat_irqs, cpu) = 0;
|
||||
```
|
||||
|
||||
and call the `desc_smp_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c) that initializes `NUMA` node of the given interrupt descriptor, sets default `SMP` affinity and clears the `pending_mask` of the given interrupt descriptor depends on the value of the `CONFIG_GENERIC_PENDING_IRQ` kernel configuration option:
|
||||
|
||||
```C
|
||||
static void desc_smp_init(struct irq_desc *desc, int node)
|
||||
{
|
||||
desc->irq_data.node = node;
|
||||
cpumask_copy(desc->irq_data.affinity, irq_default_affinity);
|
||||
#ifdef CONFIG_GENERIC_PENDING_IRQ
|
||||
cpumask_clear(desc->pending_mask);
|
||||
#endif
|
||||
}
|
||||
```
|
||||
|
||||
In the end of the `early_irq_init` function we return the return value of the `arch_early_irq_init` function:
|
||||
|
||||
```C
|
||||
return arch_early_irq_init();
|
||||
```
|
||||
|
||||
This function defined in the [kernel/apic/vector.c](https://github.com/torvalds/linux/blob/master/kernel/apic/vector.c) and contains only one call of the `arch_early_ioapic_init` function from the [kernel/apic/io_apic.c](https://github.com/torvalds/linux/blob/master/kernel/apic/io_apic.c). As we can understand from the `arch_early_ioapic_init` function's name, this function makes early initialization of the [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). First of all it make a check of the number of the legacy interrupts wit the call of the `nr_legacy_irqs` function. If we have no legacy interrupts with the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) programmable interrupt controller we set `io_apic_irqs` to the `0xffffffffffffffff`:
|
||||
|
||||
```C
|
||||
if (!nr_legacy_irqs())
|
||||
io_apic_irqs = ~0UL;
|
||||
```
|
||||
|
||||
After this we are going through the all `I/O APICs` and allocate space for the registers with the call of the `alloc_ioapic_saved_registers`:
|
||||
|
||||
```C
|
||||
for_each_ioapic(i)
|
||||
alloc_ioapic_saved_registers(i);
|
||||
```
|
||||
|
||||
And in the end of the `arch_early_ioapic_init` function we are going through the all legacy irqs (from `IRQ0` to `IRQ15`) in the loop and allocate space for the `irq_cfg` which represents configuration of an irq on the given `NUMA` node:
|
||||
|
||||
```C
|
||||
for (i = 0; i < nr_legacy_irqs(); i++) {
|
||||
cfg = alloc_irq_and_cfg_at(i, node);
|
||||
cfg->vector = IRQ0_VECTOR + i;
|
||||
cpumask_setall(cfg->domain);
|
||||
}
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Sparse IRQs
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We already saw in the beginning of this part that implementation of the `early_irq_init` function depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. Previously we saw implementation of the `early_irq_init` function when the `CONFIG_SPARSE_IRQ` configuration option is not set, now let's look on the its implementation when this option is set. Implementation of this function very similar, but little differ. We can see the same definition of variables and call of the `init_irq_default_affinity` in the beginning of the `early_irq_init` function:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_SPARSE_IRQ
|
||||
int __init early_irq_init(void)
|
||||
{
|
||||
int i, initcnt, node = first_online_node;
|
||||
struct irq_desc *desc;
|
||||
|
||||
init_irq_default_affinity();
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
#else
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
But after this we can see the following call:
|
||||
|
||||
```C
|
||||
initcnt = arch_probe_nr_irqs();
|
||||
```
|
||||
|
||||
The `arch_probe_nr_irqs` function defined in the [arch/x86/kernel/apic/vector.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/apic/vector.c) and calculates count of the pre-allocated irqs and update `nr_irqs` with its number. But stop. Why there are pre-allocated irqs? There is alternative form of interrupts called - [Message Signaled Interrupts](https://en.wikipedia.org/wiki/Message_Signaled_Interrupts) available in the [PCI](https://en.wikipedia.org/wiki/Conventional_PCI). Instead of assigning a fixed number of the interrupt request, the device is allowed to record a message at a particular address of RAM, in fact, the display on the [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs). `MSI` permits a device to allocate `1`, `2`, `4`, `8`, `16` or `32` interrupts and `MSI-X` permits a device to allocate up to `2048` interrupts. Now we know that irqs can be pre-allocated. More about `MSI` will be in a next part, but now let's look on the `arch_probe_nr_irqs` function. We can see the check which assign amount of the interrupt vectors for the each processor in the system to the `nr_irqs` if it is greater and calculate the `nr` which represents number of `MSI` interrupts:
|
||||
|
||||
```C
|
||||
int nr_irqs = NR_IRQS;
|
||||
|
||||
if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
|
||||
nr_irqs = NR_VECTORS * nr_cpu_ids;
|
||||
|
||||
nr = (gsi_top + nr_legacy_irqs()) + 8 * nr_cpu_ids;
|
||||
```
|
||||
|
||||
Take a look on the `gsi_top` variable. Each `APIC` is identified with its own `ID` and with the offset where its `IRQ` starts. It is called `GSI` base or `Global System Interrupt` base. So the `gsi_top` represents it. We get the `Global System Interrupt` base from the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table (you can remember that we have parsed this table in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux Kernel initialization process chapter).
|
||||
|
||||
After this we update the `nr` depends on the value of the `gsi_top`:
|
||||
|
||||
```C
|
||||
#if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ)
|
||||
if (gsi_top <= NR_IRQS_LEGACY)
|
||||
nr += 8 * nr_cpu_ids;
|
||||
else
|
||||
nr += gsi_top * 16;
|
||||
#endif
|
||||
```
|
||||
|
||||
Update the `nr_irqs` if it less than `nr` and return the number of the legacy irqs:
|
||||
|
||||
```C
|
||||
if (nr < nr_irqs)
|
||||
nr_irqs = nr;
|
||||
|
||||
return nr_legacy_irqs();
|
||||
}
|
||||
```
|
||||
|
||||
The next after the `arch_probe_nr_irqs` is printing information about number of `IRQs`:
|
||||
|
||||
```C
|
||||
printk(KERN_INFO "NR_IRQS:%d nr_irqs:%d %d\n", NR_IRQS, nr_irqs, initcnt);
|
||||
```
|
||||
|
||||
We can find it in the [dmesg](https://en.wikipedia.org/wiki/Dmesg) output:
|
||||
|
||||
```
|
||||
$ dmesg | grep NR_IRQS
|
||||
[ 0.000000] NR_IRQS:4352 nr_irqs:488 16
|
||||
```
|
||||
|
||||
After this we do some checks that `nr_irqs` and `initcnt` values is not greater than maximum allowable number of `irqs`:
|
||||
|
||||
```C
|
||||
if (WARN_ON(nr_irqs > IRQ_BITMAP_BITS))
|
||||
nr_irqs = IRQ_BITMAP_BITS;
|
||||
|
||||
if (WARN_ON(initcnt > IRQ_BITMAP_BITS))
|
||||
initcnt = IRQ_BITMAP_BITS;
|
||||
```
|
||||
|
||||
where `IRQ_BITMAP_BITS` is equal to the `NR_IRQS` if the `CONFIG_SPARSE_IRQ` is not set and `NR_IRQS + 8196` in other way. In the next step we are going over all interrupt descriptors which need to be allocated in the loop and allocate space for the descriptor and insert to the `irq_desc_tree` [radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html):
|
||||
|
||||
```C
|
||||
for (i = 0; i < initcnt; i++) {
|
||||
desc = alloc_desc(i, node, NULL);
|
||||
set_bit(i, allocated_irqs);
|
||||
irq_insert_desc(i, desc);
|
||||
}
|
||||
```
|
||||
|
||||
In the end of the `early_irq_init` function we return the value of the call of the `arch_early_irq_init` function as we did it already in the previous variant when the `CONFIG_SPARSE_IRQ` option was not set:
|
||||
|
||||
```C
|
||||
return arch_early_irq_init();
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the seventh part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we started to dive into external hardware interrupts in this part. We saw early initialization of the `irq_desc` structure which represents description of an external interrupt and contains information about it like list of irq actions, information about interrupt handler, interrupt's owner, count of the unhandled interrupt and etc. In the next part we will continue to research external interrupts.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [numa](https://en.wikipedia.org/wiki/Non-uniform_memory_access)
|
||||
* [Enum type](https://en.wikipedia.org/wiki/Enumerated_type)
|
||||
* [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [critical section](https://en.wikipedia.org/wiki/Critical_section)
|
||||
* [Lock validator](https://lwn.net/Articles/185666/)
|
||||
* [MSI](https://en.wikipedia.org/wiki/Message_Signaled_Interrupts)
|
||||
* [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs)
|
||||
* [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259)
|
||||
* [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller)
|
||||
* [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification)
|
||||
* [radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html)
|
||||
* [dmesg](https://en.wikipedia.org/wiki/Dmesg)
|
||||
542
interrupts/interrupts-8.md
Normal file
542
interrupts/interrupts-8.md
Normal file
@@ -0,0 +1,542 @@
|
||||
Interrupts and Interrupt Handling. Part 8.
|
||||
================================================================================
|
||||
|
||||
Non-early initialization of the IRQs
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the eighth part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html) we started to dive into the external hardware [interrupts](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). We looked on the implementation of the `early_irq_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c) source code file and saw the initialization of the `irq_desc` structure in this function. Remind that `irq_desc` structure (defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/master/include/linux/irqdesc.h#L46) is the foundation of interrupt management code in the Linux kernel and represents an interrupt descriptor. In this part we will continue to dive into the initialization stuff which is related to the external hardware interrupts.
|
||||
|
||||
Right after the call of the `early_irq_init` function in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) we can see the call of the `init_IRQ` function. This function is architecture-specific and defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c). The `init_IRQ` function makes initialization of the `vector_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable that defined in the same [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) source code file:
|
||||
|
||||
```C
|
||||
...
|
||||
DEFINE_PER_CPU(vector_irq_t, vector_irq) = {
|
||||
[0 ... NR_VECTORS - 1] = -1,
|
||||
};
|
||||
...
|
||||
```
|
||||
|
||||
and represents `percpu` array of the interrupt vector numbers. The `vector_irq_t` defined in the [arch/x86/include/asm/hw_irq.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/hw_irq.h) and expands to the:
|
||||
|
||||
```C
|
||||
typedef int vector_irq_t[NR_VECTORS];
|
||||
```
|
||||
|
||||
where `NR_VECTORS` is count of the vector number and as you can remember from the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) of this chapter it is `256` for the [x86_64](https://en.wikipedia.org/wiki/X86-64):
|
||||
|
||||
```C
|
||||
#define NR_VECTORS 256
|
||||
```
|
||||
|
||||
So, in the start of the `init_IRQ` function we fill the `vecto_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) array with the vector number of the `legacy` interrupts:
|
||||
|
||||
```C
|
||||
void __init init_IRQ(void)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < nr_legacy_irqs(); i++)
|
||||
per_cpu(vector_irq, 0)[IRQ0_VECTOR + i] = i;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
This `vector_irq` will be used during the first steps of an external hardware interrupt handling in the `do_IRQ` function from the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c):
|
||||
|
||||
```C
|
||||
__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
irq = __this_cpu_read(vector_irq[vector]);
|
||||
|
||||
if (!handle_irq(irq, regs)) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
|
||||
exiting_irq();
|
||||
...
|
||||
...
|
||||
return 1;
|
||||
}
|
||||
```
|
||||
|
||||
Why is `legacy` here? Actually all interrupts are handled by the modern [IO-APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs) controller. But these interrupts (from `0x30` to `0x3f`) by legacy interrupt-controllers like [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller). If these interrupts are handled by the `I/O APIC` then this vector space will be freed and re-used. Let's look on this code closer. First of all the `nr_legacy_irqs` defined in the [arch/x86/include/asm/i8259.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/i8259.h) and just returns the `nr_legacy_irqs` field from the `legacy_pic` structure:
|
||||
|
||||
```C
|
||||
static inline int nr_legacy_irqs(void)
|
||||
{
|
||||
return legacy_pic->nr_legacy_irqs;
|
||||
}
|
||||
```
|
||||
|
||||
This structure defined in the same header file and represents non-modern programmable interrupts controller:
|
||||
|
||||
```C
|
||||
struct legacy_pic {
|
||||
int nr_legacy_irqs;
|
||||
struct irq_chip *chip;
|
||||
void (*mask)(unsigned int irq);
|
||||
void (*unmask)(unsigned int irq);
|
||||
void (*mask_all)(void);
|
||||
void (*restore_mask)(void);
|
||||
void (*init)(int auto_eoi);
|
||||
int (*irq_pending)(unsigned int irq);
|
||||
void (*make_irq)(unsigned int irq);
|
||||
};
|
||||
```
|
||||
|
||||
Actual default maximum number of the legacy interrupts represented by the `NR_IRQ_LEGACY` macro from the [arch/x86/include/asm/irq_vectors.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_vectors.h):
|
||||
|
||||
```C
|
||||
#define NR_IRQS_LEGACY 16
|
||||
```
|
||||
|
||||
In the loop we are accessing the `vecto_irq` per-cpu array with the `per_cpu` macro by the `IRQ0_VECTOR + i` index and write the legacy vector number there. The `IRQ0_VECTOR` macro defined in the [arch/x86/include/asm/irq_vectors.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_vectors.h) header file and expands to the `0x30`:
|
||||
|
||||
```C
|
||||
#define FIRST_EXTERNAL_VECTOR 0x20
|
||||
|
||||
#define IRQ0_VECTOR ((FIRST_EXTERNAL_VECTOR + 16) & ~15)
|
||||
```
|
||||
|
||||
Why is `0x30` here? You can remember from the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) of this chapter that first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. Vector numbers from `0x30` to `0x3f` are reserved for the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture). So, it means that we fill the `vector_irq` from the `IRQ0_VECTOR` which is equal to the `32` to the `IRQ0_VECTOR + 16` (before the `0x30`).
|
||||
|
||||
In the end of the `init_IRQ` function we can see the call of the following function:
|
||||
|
||||
```C
|
||||
x86_init.irqs.intr_init();
|
||||
```
|
||||
|
||||
from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c) source code file. If you have read [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you can remember the `x86_init` structure. This structure contains a couple of files which are points to the function related to the platform setup (`x86_64` in our case), for example `resources` - related with the memory resources, `mpparse` - related with the parsing of the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table and etc.). As we can see the `x86_init` also contains the `irqs` field which contains three following fields:
|
||||
|
||||
```C
|
||||
struct x86_init_ops x86_init __initdata
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
.irqs = {
|
||||
.pre_vector_init = init_ISA_irqs,
|
||||
.intr_init = native_init_IRQ,
|
||||
.trap_init = x86_init_noop,
|
||||
},
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Now, we are interesting in the `native_init_IRQ`. As we can note, the name of the `native_init_IRQ` function contains the `native_` prefix which means that this function is architecture-specific. It defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) and executes general initialization of the [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs) and initialization of the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture) irqs. Let's look on the implementation of the `native_init_IRQ` function and will try to understand what occurs there. The `native_init_IRQ` function starts from the execution of the following function:
|
||||
|
||||
```C
|
||||
x86_init.irqs.pre_vector_init();
|
||||
```
|
||||
|
||||
As we can see above, the `pre_vector_init` points to the `init_ISA_irqs` function that defined in the same [source code](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) file and as we can understand from the function's name, it makes initialization of the `ISA` related interrupts. The `init_ISA_irqs` function starts from the definition of the `chip` variable which has a `irq_chip` type:
|
||||
|
||||
```C
|
||||
void __init init_ISA_irqs(void)
|
||||
{
|
||||
struct irq_chip *chip = legacy_pic->chip;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The `irq_chip` structure defined in the [include/linux/irq.h](https://github.com/torvalds/linux/blob/master/include/linux/irq.h) header file and represents hardware interrupt chip descriptor. It contains:
|
||||
|
||||
* `name` - name of a device. Used in the `/proc/interrupts`:
|
||||
|
||||
```C
|
||||
$ cat /proc/interrupts
|
||||
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
|
||||
0: 16 0 0 0 0 0 0 0 IO-APIC 2-edge timer
|
||||
1: 2 0 0 0 0 0 0 0 IO-APIC 1-edge i8042
|
||||
8: 1 0 0 0 0 0 0 0 IO-APIC 8-edge rtc0
|
||||
```
|
||||
|
||||
look on the last column;
|
||||
|
||||
* `(*irq_mask)(struct irq_data *data)` - mask an interrupt source;
|
||||
* `(*irq_ack)(struct irq_data *data)` - start of a new interrupt;
|
||||
* `(*irq_startup)(struct irq_data *data)` - start up the interrupt;
|
||||
* `(*irq_shutdown)(struct irq_data *data)` - shutdown the interrupt
|
||||
* and etc.
|
||||
|
||||
fields. Note that the `irq_data` structure represents set of the per irq chip data passed down to chip functions. It contains `mask` - precomputed bitmask for accessing the chip registers, `irq` - interrupt number, `hwirq` - hardware interrupt number, local to the interrupt domain chip low level interrupt hardware access and etc.
|
||||
|
||||
After this depends on the `CONFIG_X86_64` and `CONFIG_X86_LOCAL_APIC` kernel configuration option call the `init_bsp_APIC` function from the [arch/x86/kernel/apic/apic.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/apic/apic.c):
|
||||
|
||||
```C
|
||||
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_LOCAL_APIC)
|
||||
init_bsp_APIC();
|
||||
#endif
|
||||
```
|
||||
|
||||
This function makes initialization of the [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) of `bootstrap processor` (or processor which starts first). It starts from the check that we found [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) config (read more about it in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux kernel initialization process chapter) and the processor has `APIC`:
|
||||
|
||||
```C
|
||||
if (smp_found_config || !cpu_has_apic)
|
||||
return;
|
||||
```
|
||||
|
||||
In other way we return from this function. In the next step we call the `clear_local_APIC` function from the same source code file that shutdowns the local `APIC` (more about it will be in the chapter about the `Advanced Programmable Interrupt Controller`) and enable `APIC` of the first processor by the setting `unsigned int value` to the `APIC_SPIV_APIC_ENABLED`:
|
||||
|
||||
```C
|
||||
value = apic_read(APIC_SPIV);
|
||||
value &= ~APIC_VECTOR_MASK;
|
||||
value |= APIC_SPIV_APIC_ENABLED;
|
||||
```
|
||||
|
||||
and writing it with the help of the `apic_write` function:
|
||||
|
||||
```C
|
||||
apic_write(APIC_SPIV, value);
|
||||
```
|
||||
|
||||
After we have enabled `APIC` for the bootstrap processor, we return to the `init_ISA_irqs` function and in the next step we initialize legacy `Programmable Interrupt Controller` and set the legacy chip and handler for the each legacy irq:
|
||||
|
||||
```C
|
||||
legacy_pic->init(0);
|
||||
|
||||
for (i = 0; i < nr_legacy_irqs(); i++)
|
||||
irq_set_chip_and_handler(i, chip, handle_level_irq);
|
||||
```
|
||||
|
||||
Where can we find `init` function? The `legacy_pic` defined in the [arch/x86/kernel/i8259.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/i8259.c) and it is:
|
||||
|
||||
```C
|
||||
struct legacy_pic *legacy_pic = &default_legacy_pic;
|
||||
```
|
||||
|
||||
Where the `default_legacy_pic` is:
|
||||
|
||||
```C
|
||||
struct legacy_pic default_legacy_pic = {
|
||||
...
|
||||
...
|
||||
...
|
||||
.init = init_8259A,
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `init_8259A` function defined in the same source code file and executes initialization of the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) ``Programmable Interrupt Controller` (more about it will be in the separate chapter about `Programmable Interrupt Controllers` and `APIC`).
|
||||
|
||||
Now we can return to the `native_init_IRQ` function, after the `init_ISA_irqs` function finished its work. The next step is the call of the `apic_intr_init` function that allocates special interrupt gates which are used by the [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) architecture for the [Inter-processor interrupt](https://en.wikipedia.org/wiki/Inter-processor_interrupt). The `alloc_intr_gate` macro from the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) used for the interrupt descriptor allocation:
|
||||
|
||||
```C
|
||||
#define alloc_intr_gate(n, addr) \
|
||||
do { \
|
||||
alloc_system_vector(n); \
|
||||
set_intr_gate(n, addr); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
As we can see, first of all it expands to the call of the `alloc_system_vector` function that checks the given vector number in the `user_vectors` bitmap (read previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html) about it) and if it is not set in the `user_vectors` bitmap we set it. After this we test that the `first_system_vector` is greater than given interrupt vector number and if it is greater we assign it:
|
||||
|
||||
```C
|
||||
if (!test_bit(vector, used_vectors)) {
|
||||
set_bit(vector, used_vectors);
|
||||
if (first_system_vector > vector)
|
||||
first_system_vector = vector;
|
||||
} else {
|
||||
BUG();
|
||||
}
|
||||
```
|
||||
|
||||
We already saw the `set_bit` macro, now let's look on the `test_bit` and the `first_system_vector`. The first `test_bit` macro defined in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/bitops.h) and looks like this:
|
||||
|
||||
```C
|
||||
#define test_bit(nr, addr) \
|
||||
(__builtin_constant_p((nr)) \
|
||||
? constant_test_bit((nr), (addr)) \
|
||||
: variable_test_bit((nr), (addr)))
|
||||
```
|
||||
|
||||
We can see the [ternary operator](https://en.wikipedia.org/wiki/Ternary_operation) here make a test with the [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) built-in function `__builtin_constant_p` tests that given vector number (`nr`) is known at compile time. If you're feeling misunderstanding of the `__builtin_constant_p`, we can make simple test:
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
|
||||
#define PREDEFINED_VAL 1
|
||||
|
||||
int main() {
|
||||
int i = 5;
|
||||
printf("__builtin_constant_p(i) is %d\n", __builtin_constant_p(i));
|
||||
printf("__builtin_constant_p(PREDEFINED_VAL) is %d\n", __builtin_constant_p(PREDEFINED_VAL));
|
||||
printf("__builtin_constant_p(100) is %d\n", __builtin_constant_p(100));
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
and look on the result:
|
||||
|
||||
```
|
||||
$ gcc test.c -o test
|
||||
$ ./test
|
||||
__builtin_constant_p(i) is 0
|
||||
__builtin_constant_p(PREDEFINED_VAL) is 1
|
||||
__builtin_constant_p(100) is 1
|
||||
```
|
||||
|
||||
Now I think it must be clear for you. Let's get back to the `test_bit` macro. If the `__builtin_constant_p` will return non-zero, we call `constant_test_bit` function:
|
||||
|
||||
```C
|
||||
static inline int constant_test_bit(int nr, const void *addr)
|
||||
{
|
||||
const u32 *p = (const u32 *)addr;
|
||||
|
||||
return ((1UL << (nr & 31)) & (p[nr >> 5])) != 0;
|
||||
}
|
||||
```
|
||||
|
||||
and the `variable_test_bit` in other way:
|
||||
|
||||
```C
|
||||
static inline int variable_test_bit(int nr, const void *addr)
|
||||
{
|
||||
u8 v;
|
||||
const u32 *p = (const u32 *)addr;
|
||||
|
||||
asm("btl %2,%1; setc %0" : "=qm" (v) : "m" (*p), "Ir" (nr));
|
||||
return v;
|
||||
}
|
||||
```
|
||||
|
||||
What's the difference between two these functions and why do we need in two different functions for the same purpose? As you already can guess main purpose is optimization. If we will write simple example with these functions:
|
||||
|
||||
```C
|
||||
#define CONST 25
|
||||
|
||||
int main() {
|
||||
int nr = 24;
|
||||
variable_test_bit(nr, (int*)0x10000000);
|
||||
constant_test_bit(CONST, (int*)0x10000000)
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
and will look on the assembly output of our example we will see following assembly code:
|
||||
|
||||
```assembly
|
||||
pushq %rbp
|
||||
movq %rsp, %rbp
|
||||
|
||||
movl $268435456, %esi
|
||||
movl $25, %edi
|
||||
call constant_test_bit
|
||||
```
|
||||
|
||||
for the `constant_test_bit`, and:
|
||||
|
||||
```assembly
|
||||
pushq %rbp
|
||||
movq %rsp, %rbp
|
||||
|
||||
subq $16, %rsp
|
||||
movl $24, -4(%rbp)
|
||||
movl -4(%rbp), %eax
|
||||
movl $268435456, %esi
|
||||
movl %eax, %edi
|
||||
call variable_test_bit
|
||||
```
|
||||
|
||||
for the `variable_test_bit`. These two code listings starts with the same part, first of all we save base of the current stack frame in the `%rbp` register. But after this code for both examples is different. In the first example we put `$268435456` (here the `$268435456` is our second parameter - `0x10000000`) to the `esi` and `$25` (our first parameter) to the `edi` register and call `constant_test_bit`. We put function parameters to the `esi` and `edi` registers because as we are learning Linux kernel for the `x86_64` architecture we use `System V AMD64 ABI` [calling convention](https://en.wikipedia.org/wiki/X86_calling_conventions). All is pretty simple. When we are using predefined constant, the compiler can just substitute its value. Now let's look on the second part. As you can see here, the compiler can not substitute value from the `nr` variable. In this case compiler must calculate its offset on the program's [stack frame](https://en.wikipedia.org/wiki/Call_stack). We subtract `16` from the `rsp` register to allocate stack for the local variables data and put the `$24` (value of the `nr` variable) to the `rbp` with offset `-4`. Our stack frame will be like this:
|
||||
|
||||
```
|
||||
<- stack grows
|
||||
|
||||
%[rbp]
|
||||
|
|
||||
+----------+ +---------+ +---------+ +--------+
|
||||
| | | | | return | | |
|
||||
| nr |-| |-| |-| argc |
|
||||
| | | | | address | | |
|
||||
+----------+ +---------+ +---------+ +--------+
|
||||
|
|
||||
%[rsp]
|
||||
```
|
||||
|
||||
After this we put this value to the `eax`, so `eax` register now contains value of the `nr`. In the end we do the same that in the first example, we put the `$268435456` (the first parameter of the `variable_test_bit` function) and the value of the `eax` (value of `nr`) to the `edi` register (the second parameter of the `variable_test_bit function`).
|
||||
|
||||
The next step after the `apic_intr_init` function will finish its work is the setting interrupt gates from the `FIRST_EXTERNAL_VECTOR` or `0x20` to the `0x256`:
|
||||
|
||||
```C
|
||||
i = FIRST_EXTERNAL_VECTOR;
|
||||
|
||||
#ifndef CONFIG_X86_LOCAL_APIC
|
||||
#define first_system_vector NR_VECTORS
|
||||
#endif
|
||||
|
||||
for_each_clear_bit_from(i, used_vectors, first_system_vector) {
|
||||
set_intr_gate(i, irq_entries_start + 8 * (i - FIRST_EXTERNAL_VECTOR));
|
||||
}
|
||||
```
|
||||
|
||||
But as we are using the `for_each_clear_bit_from` helper, we set only non-initialized interrupt gates. After this we use the same `for_each_clear_bit_from` helper to fill the non-filled interrupt gates in the interrupt table with the `spurious_interrupt`:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_LOCAL_APIC
|
||||
for_each_clear_bit_from(i, used_vectors, NR_VECTORS)
|
||||
set_intr_gate(i, spurious_interrupt);
|
||||
#endif
|
||||
```
|
||||
|
||||
Where the `spurious_interrupt` function represent interrupt handler for the `spurious` interrupt. Here the `used_vectors` is the `unsigned long` that contains already initialized interrupt gates. We already filled first `32` interrupt vectors in the `trap_init` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file:
|
||||
|
||||
```C
|
||||
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
|
||||
set_bit(i, used_vectors);
|
||||
```
|
||||
|
||||
You can remember how we did it in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-6.html) of this chapter.
|
||||
|
||||
In the end of the `native_init_IRQ` function we can see the following check:
|
||||
|
||||
```C
|
||||
if (!acpi_ioapic && !of_ioapic && nr_legacy_irqs())
|
||||
setup_irq(2, &irq2);
|
||||
```
|
||||
|
||||
First of all let's deal with the condition. The `acpi_ioapic` variable represents existence of [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs). It defined in the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c). This variable set in the `acpi_set_irq_model_ioapic` function that called during the processing `Multiple APIC Description Table`. This occurs during initialization of the architecture-specific stuff in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) (more about it we will know in the other chapter about [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)). Note that the value of the `acpi_ioapic` variable depends on the `CONFIG_ACPI` and `CONFIG_X86_LOCAL_APIC` Linux kernel configuration options. If these options did not set, this variable will be just zero:
|
||||
|
||||
```C
|
||||
#define acpi_ioapic 0
|
||||
```
|
||||
|
||||
The second condition - `!of_ioapic && nr_legacy_irqs()` checks that we do not use [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware) `I/O APIC` and legacy interrupt controller. We already know about the `nr_legacy_irqs`. The second is `of_ioapic` variable defined in the [arch/x86/kernel/devicetree.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/devicetree.c) and initialized in the `dtb_ioapic_setup` function that build information about `APICs` in the [devicetree](https://en.wikipedia.org/wiki/Device_tree). Note that `of_ioapic` variable depends on the `CONFIG_OF` Linux kernel configuration option. If this option is not set, the value of the `of_ioapic` will be zero too:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_OF
|
||||
extern int of_ioapic;
|
||||
...
|
||||
...
|
||||
...
|
||||
#else
|
||||
#define of_ioapic 0
|
||||
...
|
||||
...
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
If the condition will return non-zero value we call the:
|
||||
|
||||
```C
|
||||
setup_irq(2, &irq2);
|
||||
```
|
||||
|
||||
function. First of all about the `irq2`. The `irq2` is the `irqaction` structure that defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irqinit.c) source code file and represents `IRQ 2` line that is used to query devices connected cascade:
|
||||
|
||||
```C
|
||||
static struct irqaction irq2 = {
|
||||
.handler = no_action,
|
||||
.name = "cascade",
|
||||
.flags = IRQF_NO_THREAD,
|
||||
};
|
||||
```
|
||||
|
||||
Some time ago interrupt controller consisted of two chips and one was connected to second. The second chip that was connected to the first chip via this `IRQ 2` line. This chip serviced lines from `8` to `15` and after this lines of the first chip. So, for example [Intel 8259A](https://en.wikipedia.org/wiki/Intel_8259) has following lines:
|
||||
|
||||
* `IRQ 0` - system time;
|
||||
* `IRQ 1` - keyboard;
|
||||
* `IRQ 2` - used for devices which are cascade connected;
|
||||
* `IRQ 8` - [RTC](https://en.wikipedia.org/wiki/Real-time_clock);
|
||||
* `IRQ 9` - reserved;
|
||||
* `IRQ 10` - reserved;
|
||||
* `IRQ 11` - reserved;
|
||||
* `IRQ 12` - `ps/2` mouse;
|
||||
* `IRQ 13` - coprocessor;
|
||||
* `IRQ 14` - hard drive controller;
|
||||
* `IRQ 1` - reserved;
|
||||
* `IRQ 3` - `COM2` and `COM4`;
|
||||
* `IRQ 4` - `COM1` and `COM3`;
|
||||
* `IRQ 5` - `LPT2`;
|
||||
* `IRQ 6` - drive controller;
|
||||
* `IRQ 7` - `LPT1`.
|
||||
|
||||
The `setup_irq` function defined in the [kernel/irq/manage.c](https://github.com/torvalds/linux/blob/master/kernel/irq/manage.c) and takes two parameters:
|
||||
|
||||
* vector number of an interrupt;
|
||||
* `irqaction` structure related with an interrupt.
|
||||
|
||||
This function initializes interrupt descriptor from the given vector number at the beginning:
|
||||
|
||||
```C
|
||||
struct irq_desc *desc = irq_to_desc(irq);
|
||||
```
|
||||
|
||||
And call the `__setup_irq` function that setups given interrupt:
|
||||
|
||||
```C
|
||||
chip_bus_lock(desc);
|
||||
retval = __setup_irq(irq, desc, act);
|
||||
chip_bus_sync_unlock(desc);
|
||||
return retval;
|
||||
```
|
||||
|
||||
Note that the interrupt descriptor is locked during `__setup_irq` function will work. The `__setup_irq` function makes many different things: It creates a handler thread when a thread function is supplied and the interrupt does not nest into another interrupt thread, sets the flags of the chip, fills the `irqaction` structure and many many more.
|
||||
|
||||
All of the above it creates `/prov/vector_number` directory and fills it, but if you are using modern computer all values will be zero there:
|
||||
|
||||
```
|
||||
$ cat /proc/irq/2/node
|
||||
0
|
||||
|
||||
$cat /proc/irq/2/affinity_hint
|
||||
00
|
||||
|
||||
cat /proc/irq/2/spurious
|
||||
count 0
|
||||
unhandled 0
|
||||
last_unhandled 0 ms
|
||||
```
|
||||
|
||||
because probably `APIC` handles interrupts on the our machine.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the eighth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we started to do it and saw early initialization of the `IRQs`. In this part we already saw non-early interrupts initialization in the `init_IRQ` function. We saw initialization of the `vector_irq` per-cpu array which is store vector numbers of the interrupts and will be used during interrupt handling and initialization of other stuff which is related to the external hardware interrupts.
|
||||
|
||||
In the next part we will continue to learn interrupts handling related stuff and will see initialization of the `softirqs`.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259)
|
||||
* [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller)
|
||||
* [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture)
|
||||
* [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification)
|
||||
* [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs)
|
||||
* [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs)
|
||||
* [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing)
|
||||
* [Inter-processor interrupt](https://en.wikipedia.org/wiki/Inter-processor_interrupt)
|
||||
* [ternary operator](https://en.wikipedia.org/wiki/Ternary_operation)
|
||||
* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [calling convention](https://en.wikipedia.org/wiki/X86_calling_conventions)
|
||||
* [PDF. System V Application Binary Interface AMD64](http://x86-64.org/documentation/abi.pdf)
|
||||
* [Call stack](https://en.wikipedia.org/wiki/Call_stack)
|
||||
* [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware)
|
||||
* [devicetree](https://en.wikipedia.org/wiki/Device_tree)
|
||||
* [RTC](https://en.wikipedia.org/wiki/Real-time_clock)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html)
|
||||
Reference in New Issue
Block a user