Modify all linkers about interrupts

This commit is contained in:
mudongliang
2016-11-21 14:02:36 -05:00
parent 510cbba583
commit 1b60ca0443
17 changed files with 63 additions and 63 deletions

14
Interrupts/README.md Normal file
View File

@@ -0,0 +1,14 @@
# 中断和中断处理
在 linux 内核中你会发现很多关于中断和异常处理的话题
* [中断和中断处理 Part 1.](interrupts-1.md) - 描述中断处理主题
* [深入 Linux 内核中的中断](interrupts-2.md) - 这部分开始描述和初步步骤相关的中断和异常处理。
* [初步中断处理](interrupts-3.md) - 描述初步中断处理。
* [中断处理](interrupts-4.md) - fourth part describes first non-early interrupt handlers.
* [异常处理的实现](interrupts-5.md) - 一些异常处理的实现,比如双重错误、除零等等。
* [处理不可屏蔽中断](interrupts-6.md) - 描述了如何处理不可屏蔽的中断和剩下的一些与特定架构相关的中断。
* [深入外部硬件中断](interrupts-7.md) - 这部分讲述了关于处理外部硬件中断的一些早期初始化代码。
* [IRQs的非早期初始化](interrupts-8.md) - 这部分讲述了处理外部硬件中断的非早期初始化代码。
* [Softirq, Tasklets and Workqueues](interrupts-9.md) - 这部分讲述了softirqs、tasklets 和 workqueues 的内容.
* [](interrupts-10.md) - 这是中断和中断处理的最后一部分,并且我们将会看到一个真实的硬件驱动和中断。

492
Interrupts/interrupts-1.md Normal file
View File

@@ -0,0 +1,492 @@
中断和中断处理 Part 1.
================================================================================
Introduction
--------------------------------------------------------------------------------
这是 [linux 内核揭密](http://xinqiu.gitbooks.io/linux-insides-cn/content/) 这本书最新章节的第一部分。我们已经在这本书前面的[章节](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/index.html)中走过了漫长的道路。从内核初始化的[第一步](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html)开始,结束于第一个 `init` 程序的[启动](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-10.html)。我们见证了一系列与各种内核子系统相关的初始化步骤,但是我们并没有深入这些子系统。在这一章中,我们将会试着去了解这些内核子系统是如何工作和实现的。就像你在这章标题中看到的,第一个子系统是[中断interrupts](http://en.wikipedia.org/wiki/Interrupt)。
什么是中断?
--------------------------------------------------------------------------------
我们已经在这本书的很多地方听到过 `中断interrupts` 这个词,也看到过很多关于中断的例子。在这一章中我们将会从下面的主题开始:
* 什么是 `中断interrupts`
* 什么是 `中断处理interrupt handlers`
我们将会继续深入探讨 `中断` 的细节和 Linux 内核如何处理这些中断。
所以,首先什么是中断?中断就是当软件或者硬件需要使用 CPU 时引发的 `事件event`。比如,当我们在键盘上按下一个键的时候,我们下一步期望做什么?操作系统和电脑应该怎么做?做一个简单的假设,每一个物理硬件都有一根连接 CPU 的中断线,设备可以通过它对 CPU 发起中断信号。但是中断信号并不是直接发送给 CPU。在老机器上中断信号发送给 [PIC](http://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) ,它是一个顺序处理各种设备的各种中断请求的芯片。在新机器上,则是[高级程序中断控制器Advanced Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)做这件事情,即我们熟知的 `APIC`。一个 APIC 包括两个独立的设备:
* `Local APIC`
* `I/O APIC`
第一个设备 - `Local APIC ` 存在于每个CPU核心中Local APIC 负责处理特定于 CPU 的中断配置。Local APIC 常被用于管理来自 APIC 时钟APIC-timer、热敏元件和其他与 I/O 设备连接的设备的中断。
第二个设备 - `I/O APIC` 提供了多核处理器的中断管理。它被用来在所有的 CPU 核心中分发外部中断。更多关于 local 和 I/O APIC 的内容将会在这一节的下面讲到。就如你所知道的,中断可以在任何时间发生。当一个中断发生时,操作系统必须立刻处理它。但是 `处理一个中断` 是什么意思呢?当一个中断发生时,操作系统必须确保下面的步骤顺序:
* 内核必须暂停执行当前进程(取代当前的任务)
* 内核必须搜索中断处理程序并且转交控制权(执行中断处理程序)
* 中断处理程序结束之后,被中断的进程能够恢复执行。
当然,在这个中断处理程序中会涉及到很多错综复杂的过程。但是上面 3 条是这个程序的基本骨架。
每个中断处理程序的地址都保存在一个特殊的位置,这个位置被称为 `中断描述符表Interrupt Descriptor Table` 或者 `IDT`。处理器使用一个唯一的数字来识别中断和异常的类型,这个数字被称为 `中断标识码vector number`。一个中断标识码就是一个 `IDT` 的标识。中断标识码范围是有限的,从 `0``255`。你可以在 Linux 内核源码中找到下面的中断标识码范围检查代码:
```C
BUG_ON((unsigned)n > 0xFF);
```
你可以在 Linux 内核源码中关于中断设置的地方找到这个检查(例如:`set_intr_gate`, `void set_system_intr_gate` 在 [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h)中)。从 `0``31` 的 32 个中断标识码被处理器保留,用作处理架构定义的异常和中断。你可以在 Linux 内核初始化程序的第二部分 - [早期中断和异常处理](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-2.html)中找到这个表和关于这些中断标识码的描述。从 `32``255` 的中断标识码设计为用户定义中断并且不被系统保留。这些中断通常分配给外部 I/O 设备,使这些设备可以发送中断给处理器。
现在,我们来讨论中断的类型。笼统地来讲,我们可以把中断分为两个主要类型:
* 外部或者硬件引起的中断;
* 软件引起的中断。
第一种类型 - 外部中断,由 `Local APIC` 或者与 `Local APIC` 连接的处理器针脚接收。第二种类型 - 软件引起的中断,由处理器自身的特殊情况引起(有时使用特殊架构的指令)。一个常见的关于特殊情况的例子就是 `除零`。另一个例子就是使用 `系统调用syscall` 退出程序。
就如之前提到过的,中断可以在任何时间因为超出代码和 CPU 控制的原因而发生。另一方面,异常和程序执行 `同步synchronous` ,并且可以被分为 3 类:
* `故障Faults`
* `陷入Traps`
* `终止Aborts`
`故障` 是在执行一个“不完善的”指令(可以在之后被修正)之前被报告的异常。如果发生了,它允许被中断的程序继续执行。
接下来的 `陷入` 是一个在执行了 `陷入` 指令后立刻被报告的异常。陷入同样允许被中断的程序继续执行,就像 `故障` 一样。
最后的 `终止` 是一个从不报告引起异常的精确指令的异常,并且不允许被中断的程序继续执行。
我们已经从前面的[部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-3.html)知道,中断可以分为 `可屏蔽的maskable``不可屏蔽的non-maskable`。可屏蔽的中断可以被阻塞,使用 `x86_64` 的指令 - `sti``cli`。我们可以在 Linux 内核代码中找到他们:
```C
static inline void native_irq_disable(void)
{
asm volatile("cli": : :"memory");
}
```
and
```C
static inline void native_irq_enable(void)
{
asm volatile("sti": : :"memory");
}
```
这两个指令修改了在中断寄存器中的 `IF` 标识位。 `sti` 指令设置 `IF` 标识,`cli` 指令清除这个标识。不可屏蔽的中断总是被报告。通常,任何硬件上的失败都映射为不可屏蔽中断。
如果多个异常或者中断同时发生,处理器以事先设定好的中断优先级处理他们。我们可以定义下面表中的从最低到最高的优先级:
```
+----------------------------------------------------------------+
| | |
| Priority | Description |
| | |
+--------------+-------------------------------------------------+
| | Hardware Reset and Machine Checks |
| 1 | - RESET |
| | - Machine Check |
+--------------+-------------------------------------------------+
| | Trap on Task Switch |
| 2 | - T flag in TSS is set |
| | |
+--------------+-------------------------------------------------+
| | External Hardware Interventions |
| | - FLUSH |
| 3 | - STOPCLK |
| | - SMI |
| | - INIT |
+--------------+-------------------------------------------------+
| | Traps on the Previous Instruction |
| 4 | - Breakpoints |
| | - Debug Trap Exceptions |
+--------------+-------------------------------------------------+
| 5 | Nonmaskable Interrupts |
+--------------+-------------------------------------------------+
| 6 | Maskable Hardware Interrupts |
+--------------+-------------------------------------------------+
| 7 | Code Breakpoint Fault |
+--------------+-------------------------------------------------+
| 8 | Faults from Fetching Next Instruction |
| | Code-Segment Limit Violation |
| | Code Page Fault |
+--------------+-------------------------------------------------+
| | Faults from Decoding the Next Instruction |
| | Instruction length > 15 bytes |
| 9 | Invalid Opcode |
| | Coprocessor Not Available |
| | |
+--------------+-------------------------------------------------+
| 10 | Faults on Executing an Instruction |
| | Overflow |
| | Bound error |
| | Invalid TSS |
| | Segment Not Present |
| | Stack fault |
| | General Protection |
| | Data Page Fault |
| | Alignment Check |
| | x87 FPU Floating-point exception |
| | SIMD floating-point exception |
| | Virtualization exception |
+--------------+-------------------------------------------------+
```
现在我们了解了一些关于各种类型的中断和异常的内容,是时候转到更实用的部分了。我们从 `中断描述符表IDT` 开始。就如之前所提到的,`IDT` 保存了中断和异常处理程序的入口指针。`IDT` 是一个类似于 `全局描述符表Global Descriptor Table`的结构,我们在[内核启动程序](http://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-2.html)的第二部分已经介绍过。但是他们确实有一些不同,`IDT` 的表项被称为 `门gates`,而不是 `描述符descriptors`。它可以包含下面的一种:
* 中断门Interrupt gates
* 任务门Task gates
* 陷阱门Trap gates
`x86` 架构中,只有 [long mode](http://en.wikipedia.org/wiki/Long_mode) 中断门和陷阱门可以在 `x86_64` 中引用。就像 `全局描述符表``中断描述符表``x86` 上是一个 8 字节数组门,而在 `x86_64` 上是一个 16 字节数组门。让我们回忆在[内核启动程序](http://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-2.html)的第二部分,`全局描述符表` 必须包含 `NULL` 描述符作为它的第一个元素。与 `全局描述符表` 不一样的是,`中断描述符表` 的第一个元素可以是一个门。它并不是强制要求的。比如,你可能还记得我们只是在早期的章节中过渡到[保护模式](http://en.wikipedia.org/wiki/Protected_mode)时用 `NULL` 门加载过中断描述符表:
```C
/*
* Set up the IDT
*/
static void setup_idt(void)
{
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
}
```
在 [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c)中。`中断描述符表` 可以在线性地址空间和基址的任何地方被加载,只要在 `x86` 上以 8 字节对齐,在 `x86_64` 上以 16 字节对齐。`IDT` 的基址存储在一个特殊的寄存器 - IDTR。在 `x86` 上有两个指令 - 协调工作来修改 `IDTR` 寄存器:
* `LIDT`
* `SIDT`
第一个指令 `LIDT` 用来加载 `IDT` 的基址,即在 `IDTR` 的指定操作数。第二个指令 `SIDT` 用来在指定操作数中读取和存储 `IDTR` 的内容。在 `x86``IDTR` 寄存器是 48 位,包含了下面的信息:
```
+-----------------------------------+----------------------+
| | |
| Base address of the IDT | Limit of the IDT |
| | |
+-----------------------------------+----------------------+
47 16 15 0
```
让我们看看 `setup_idt` 的实现,我们准备了一个 `null_idt`,并且使用 `lidt` 指令把它加载到 `IDTR` 寄存器。注意,`null_idt``gdt_ptr` 类型,后者定义如下:
```C
struct gdt_ptr {
u16 len;
u32 ptr;
} __attribute__((packed));
```
这里我们可以看看 `IDTR` 结构的定义,就像我们在示意图中看到的一样,由 2 字节和 4 字节(共 48 位)的两个域组成。现在,让我们看看 `IDT` 入口结构体,它是一个在 `x86` 中被称为门的 16 字节数组。它拥有下面的结构:
```
127 96
+-------------------------------------------------------------------------------+
| |
| Reserved |
| |
+--------------------------------------------------------------------------------
95 64
+-------------------------------------------------------------------------------+
| |
| Offset 63..32 |
| |
+-------------------------------------------------------------------------------+
63 48 47 46 44 42 39 34 32
+-------------------------------------------------------------------------------+
| | | D | | | | | | |
| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST |
| | | L | | | | | | |
-------------------------------------------------------------------------------+
31 16 15 0
+-------------------------------------------------------------------------------+
| | |
| Segment Selector | Offset 15..0 |
| | |
+-------------------------------------------------------------------------------+
```
为了把索引格式化成 IDT 的格式,处理器把异常和中断向量分为 16 个级别。处理器处理异常和中断的发生就像它看到 `call` 指令时处理一个程序调用一样。处理器使用中断或异常的唯一的数字或 `中断标识码` 作为索引来寻找对应的 `中断描述符表` 的条目。现在让我们更近距离地看看 `IDT` 条目。
就像我们所看到的一样,在表中的 `IDT` 条目由下面的域组成:
* `0-15` bits - 段选择器偏移,处理器用它作为中断处理程序的入口指针基址;
* `16-31` bits - 段选择器基址,包含中断处理程序入口指针;
* `IST` - 在 `x86_64` 上的一个新的机制,下面我们会介绍它;
* `DPL` - 描述符特权级;
* `P` - 段存在标志;
* `48-63` bits - 中断处理程序基址的第二部分;
* `64-95` bits - 中断处理程序基址的第三部分;
* `96-127` bits - CPU 保留位.
`Type` 域描述了 `IDT` 条目的类型。有三种不同的中断处理程序:
* 中断门Interrupt gate
* 陷入门Trap gate
* 任务门Task gate
`IST` 或者说是 `Interrupt Stack Table``x86_64` 中的新机制,它用来代替传统的栈切换机制。之前的 `x86` 架构提供的机制可以在响应中断时自动切换栈帧。`IST``x86` 栈切换模式的一个修改版,在它使能之后可以无条件地切换栈,并且可以被任何与确定中断(我们将在下面介绍它)关联的 `IDT` 条目中的中断使能。从这里可以看出,`IST` 并不是所有的中断必须的,一些中断可以继续使用传统的栈切换模式。`IST` 机制在[任务状态段Task State Segment](http://en.wikipedia.org/wiki/Task_state_segment)或者 `TSS` 中提供了 7 个 `IST` 指针。`TSS` 是一个包含进程信息的特殊结构,用来在执行中断或者处理 Linux 内核异常的时候做栈切换。每一个指针都被 `IDT` 中的中断门引用。
`中断描述符表` 使用 `gate_desc` 的数组描述:
```C
extern gate_desc idt_table[];
```
`gate_desc` 定义如下:
```C
#ifdef CONFIG_X86_64
...
...
...
typedef struct gate_struct64 gate_desc;
...
...
...
#endif
```
`gate_struct64` 定义如下:
```C
struct gate_struct64 {
u16 offset_low;
u16 segment;
unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1;
u16 offset_middle;
u32 offset_high;
u32 zero1;
} __attribute__((packed));
```
`x86_64` 架构中,每一个活动的线程在 Linux 内核中都有一个很大的栈。这个栈的大小由 `THREAD_SIZE` 定义,而且与下面的定义相等:
```C
#define PAGE_SHIFT 12
#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT)
...
...
...
#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)
```
`PAGE_SIZE``4096` 字节,`THREAD_SIZE_ORDER` 的值依赖于 `KASAN_STACK_ORDER`。就像我们看到的,`KASAN_STACK` 依赖于 `CONFIG_KASAN` 内核配置参数,它定义如下:
```C
#ifdef CONFIG_KASAN
#define KASAN_STACK_ORDER 1
#else
#define KASAN_STACK_ORDER 0
#endif
```
`KASan` 是一个运行时内存[调试器](http://lwn.net/Articles/618180/)。所以,如果 `CONFIG_KASAN` 被禁用,`THREAD_SIZE``16384` ;如果内核配置选项打开,`THREAD_SIZE` 的值是 `32768`。这块栈空间保存着有用的数据,只要线程是活动状态或者僵尸状态。但是当线程在用户空间的时候,这个内核栈是空的,除非 `thread_info` 结构(关于这个结构的详细信息在 Linux 内核初始程序的第四[部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-4.html))在这个栈空间的底部。活动的或者僵尸线程并不是在他们栈中的唯一的线程,与每一个 CPU 关联的特殊栈也存在于这个空间。当内核在这个 CPU 上执行代码的时候,这些栈处于活动状态;当在这个 CPU 上执行用户空间代码时,这些栈不包含任何有用的信息。每一个 CPU 也有一个特殊的 per-cpu 栈。首先是给外部中断使用的 `中断栈interrupt stack`。它的大小定义如下:
```C
#define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)
#define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER)
```
或者是 `16384` 字节。Per-cpu 的中断栈在 `x86_64` 架构中使用 `irq_stack_union` 联合描述:
```C
union irq_stack_union {
char irq_stack[IRQ_STACK_SIZE];
struct {
char gs_base[40];
unsigned long stack_canary;
};
};
```
第一个 `irq_stack` 域是一个 16KB 的数组。然后你可以看到 `irq_stack_union` 联合包含了一个结构体,这个结构体有两个域:
* `gs_base` - 总是指向 `irqstack` 联合底部的 `gs` 寄存器。在 `x86_64` 中, per-cpu更多关于 `per-cpu` 变量的信息可以阅读特定的[章节](http://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/per-cpu.html) 和 stack canary 共享 `gs` 寄存器。所有的 per-cpu 标志初始值为零,并且 `gs` 指向 per-cpu 区域的开始。你已经知道[段内存模式](http://en.wikipedia.org/wiki/Memory_segmentation)已经废除很长时间了,但是我们可以使用[特殊模块寄存器Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register)给这两个段寄存器 - `fs``gs` 设置基址,并且这些寄存器仍然可以被用作地址寄存器。如果你记得 Linux 内核初始程序的第一[部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html),你会记起我们设置了 `gs` 寄存器:
```assembly
movl $MSR_GS_BASE,%ecx
movl initial_gs(%rip),%eax
movl initial_gs+4(%rip),%edx
wrmsr
```
`initial_gs` 指向 `irq_stack_union`:
```assembly
GLOBAL(initial_gs)
.quad INIT_PER_CPU_VAR(irq_stack_union)
```
* `stack_canary` - [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) 对于中断栈来说是一个用来验证栈是否已经被修改的 `栈保护者stack protector``gs_base` 是一个 40 字节的数组,`GCC` 要求 stack canary 在被修正过的偏移量上,并且 `gs` 的值在 `x86_64` 架构上必须是 `40`,在 `x86` 架构上必须是 `20`
`irq_stack_union``percpu` 的第一个数据, 我们可以在 `System.map`中看到它:
```
0000000000000000 D __per_cpu_start
0000000000000000 D irq_stack_union
0000000000004000 d exception_stacks
0000000000009000 D gdt_page
...
...
...
```
我们可以看到它在代码中的定义:
```C
DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible;
```
现在,是时候来看 `irq_stack_union` 的初始化过程了。除了 `irq_stack_union` 的定义,我们可以在[arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h)中查看下面的 per-cpu 变量
```C
DECLARE_PER_CPU(char *, irq_stack_ptr);
DECLARE_PER_CPU(unsigned int, irq_count);
```
第一个就是 `irq_stack_ptr`。从这个变量的名字中可以知道,它显然是一个指向这个栈顶的指针。第二个 `irq_count` 用来检查 CPU 是否已经在中断栈。`irq_stack_ptr` 的初始化在[arch/x86/kernel/setup_percpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup_percpu.c)的 `setup_per_cpu_areas` 函数中:
```C
void __init setup_per_cpu_areas(void)
{
...
...
#ifdef CONFIG_X86_64
for_each_possible_cpu(cpu) {
...
...
...
per_cpu(irq_stack_ptr, cpu) =
per_cpu(irq_stack_union.irq_stack, cpu) +
IRQ_STACK_SIZE - 64;
...
...
...
#endif
...
...
}
```
现在,我们一个一个查看所有 CPU并且设置 `irq_stack_ptr`。事实证明它等于中断栈的顶减去 `64`。为什么是 `64`TODO [[arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c)] 代码如下:
```C
void load_percpu_segment(int cpu)
{
...
...
...
loadsegment(gs, 0);
wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));
}
```
就像我们所知道的一样,`gs` 寄存器指向中断栈的栈底:
```assembly
movl $MSR_GS_BASE,%ecx
movl initial_gs(%rip),%eax
movl initial_gs+4(%rip),%edx
wrmsr
GLOBAL(initial_gs)
.quad INIT_PER_CPU_VAR(irq_stack_union)
```
现在我们可以看到 `wrmsr` 指令,这个指令从 `edx:eax` 加载数据到 被 `ecx` 指向的[MSR寄存器]((http://en.wikipedia.org/wiki/Model-specific_register))。在这里MSR寄存器是 `MSR_GS_BASE`,它保存了被 `gs` 寄存器指向的内存段的基址。`edx:eax` 指向 `initial_gs` 的地址,它就是 `irq_stack_union` 的基址。
我们还知道,`x86_64` 有一个叫 `中断栈表Interrupt Stack Table` 或者 `IST` 的组件当发生不可屏蔽中断、双重错误等等的时候这个组件提供了切换到新栈的功能。这可以到达7个 `IST` per-cpu 入口。其中一些如下;
There can be up to seven `IST` entries per-cpu. Some of them are:
* `DOUBLEFAULT_STACK`
* `NMI_STACK`
* `DEBUG_STACK`
* `MCE_STACK`
或者
```C
#define DOUBLEFAULT_STACK 1
#define NMI_STACK 2
#define DEBUG_STACK 3
#define MCE_STACK 4
```
所有被 `IST` 切换到新栈的中断门描述符都由 `set_intr_gate_ist` 函数初始化。例如:
```C
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
...
...
...
set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
```
其中 `&nmi``&double_fault` 是中断函数的入口地址:
```C
asmlinkage void nmi(void);
asmlinkage void double_fault(void);
```
定义在 [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S)中
```assembly
idtentry double_fault do_double_fault has_error_code=1 paranoid=2
...
...
...
ENTRY(nmi)
...
...
...
END(nmi)
```
当一个中断或者异常发生时,新的 `ss` 选择器被强制置为 `NULL`,并且 `ss` 选择器的 `rpl` 域被设置为新的 `cpl`。旧的 `ss``rsp`、寄存器标志、`cs``rip` 被压入新栈。在 64 位模型下,中断栈帧大小固定为 8 字节,所以我们可以得到下面的栈:
```
+---------------+
| |
| SS | 40
| RSP | 32
| RFLAGS | 24
| CS | 16
| RIP | 8
| Error code | 0
| |
+---------------+
```
如果在中断门中 `IST` 域不是 `0`,我们把 `IST` 读到 `rsp` 中。如果它关联了一个中断向量错误码,我们再把这个错误码压入栈。如果中断向量没有错误码,就继续并且把虚拟错误码压入栈。我们必须做以上的步骤以确保栈一致性。接下来我们从门描述符中加载段选择器域到 CS 寄存器中,并且通过验证第 `21` 位的值来验证目标代码是一个 64 位代码段,例如 `L` 位在 `全局描述符表Global Descriptor Table`。最后我们从门描述符中加载偏移域到 `rip` 中,`rip` 是中断处理函数的入口指针。然后中断函数开始执行,在中断函数执行结束后,它必须通过 `iret` 指令把控制权交还给被中断进程。`iret` 指令无条件地弹出栈指针(`ss:rsp`)来恢复被中断的进程,并且不会依赖于 `cpl` 改变。
这就是中断的所有过程。
总结
--------------------------------------------------------------------------------
关于 Linux 内核的中断和中断处理的第一部分至此结束。我们初步了解了一些理论和与中断和异常相关的初始化条件。在下一部分,我会接着深入了解中断和中断处理 - 更深入了解她真实的样子。
如果你有任何问题或建议,请给我发评论或者给我发 [Twitter](https://twitter.com/0xAX)。
**请注意英语并不是我的母语,我为任何表达不清楚的地方感到抱歉。如果你发现任何错误请发 PR 到 [linux-insides](https://github.com/MintCN/linux-insides-zh)。(译者注:翻译问题请发 PR 到 [linux-insides-cn](https://www.gitbook.com/book/xinqiu/linux-insides-cn))**
链接
--------------------------------------------------------------------------------
* [PIC](http://en.wikipedia.org/wiki/Programmable_Interrupt_Controller)
* [Advanced Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
* [protected mode](http://en.wikipedia.org/wiki/Protected_mode)
* [long mode](http://en.wikipedia.org/wiki/Long_mode)
* [kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
* [Task State Segement](http://en.wikipedia.org/wiki/Task_state_segment)
* [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation)
* [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register)
* [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries)
* [Previous chapter](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/index.html)

503
Interrupts/interrupts-10.md Normal file
View File

@@ -0,0 +1,503 @@
中断和中断处理(十)
=====================
终结篇
-------------------------
本文是 Linux 内核[中断和中断处理](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/index.html)的第十节。在[上一节](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/interrupts-9.html),我们了解了延后中断及其相关概念,如 `softirq``tasklet``workqueue`。本节我们继续深入这个主题,现在是见识真正的硬件驱动的时候了。
以 [StringARM** SA-100/21285 评估板](http://netwinder.osuosl.org/pub/netwinder/docs/intel/datashts/27813501.pdf)串行驱动为例,我们来观察驱动程序如何请求一个 [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) 线,一个中断被触发时会发生什么之类的。驱动程序代码位于 [drivers/tty/serial/21285.c](https://github.com/torvalds/linux/blob/master/drivers/tty/serial/21285.c) 源文件。好啦,源码在手,说走就走!
一个内核模块的初始化
-----------------------------------------------
与本书其他新概念类似为了考察这个驱动程序我们从考察它的初始化过程开始。如你所知Linux 内核为驱动程序或者内核模块的初始化和终止提供了两个宏:
* `module_init`
* `module_exit`
可以在驱动程序的源代码中查阅这些宏的用法:
```C
module_init(serial21285_init);
module_exit(serial21285_exit);
```
大多数驱动程序都能编译成一个可装载的内核[模块](https://en.wikipedia.org/wiki/Loadable_kernel_module),亦或被静态地链入 Linux 内核。前一种情况下,一个设备驱动程序的初始化由 `module_init``module_exit` 宏触发。这些宏定义在 [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) 中:
```C
#define module_init(initfn) \
static inline initcall_t __inittest(void) \
{ return initfn; } \
int init_module(void) __attribute__((alias(#initfn)));
#define module_exit(exitfn) \
static inline exitcall_t __exittest(void) \
{ return exitfn; } \
void cleanup_module(void) __attribute__((alias(#exitfn)));
```
并被 [initcall](http://kernelnewbies.org/Documents/InitcallMechanism) 函数调用:
* `early_initcall`
* `pure_initcall`
* `core_initcall`
* `postcore_initcall`
* `arch_initcall`
* `subsys_initcall`
* `fs_initcall`
* `rootfs_initcall`
* `device_initcall`
* `late_initcall`
这些函数又被 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 中的 `do_initcalls` 函数调用。然而,如果设备驱动程序被静态链入 Linux 内核,那么这些宏的实现则如下所示:
```C
#define module_init(x) __initcall(x);
#define module_exit(x) __exitcall(x);
```
这种情况下,模块装载的实现位于 [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c) 源文件中,而初始化发生在 `do_init_module` 函数内。我们不打算在本章深入探讨可装载模块的细枝末节,而会在一个专门介绍 Linux 内核模块的章节中窥其真容。话说回来,`module_init` 宏接受一个参数 - 本例中这个值是 `serial21285_init`。从函数名可以得知,这个函数做了一些驱动程序初始化的相关工作。请看:
```C
static int __init serial21285_init(void)
{
int ret;
printk(KERN_INFO "Serial: 21285 driver\n");
serial21285_setup_ports();
ret = uart_register_driver(&serial21285_reg);
if (ret == 0)
uart_add_one_port(&serial21285_reg, &serial21285_port);
return ret;
}
```
如你所见,首先它把驱动程序相关信息写入内核缓冲区,然后调用 `serial21285_setup_ports` 函数。该函数设置了 `serial21285_port` 设备的基本 [uart](https://en.wikipedia.org/wiki/Universal_asynchronous_receiver/transmitter) 时钟:
```C
unsigned int mem_fclk_21285 = 50000000;
static void serial21285_setup_ports(void)
{
serial21285_port.uartclk = mem_fclk_21285 / 4;
}
```
此处的 `serial21285` 是描述 `uart` 驱动程序的结构体:
```C
static struct uart_driver serial21285_reg = {
.owner = THIS_MODULE,
.driver_name = "ttyFB",
.dev_name = "ttyFB",
.major = SERIAL_21285_MAJOR,
.minor = SERIAL_21285_MINOR,
.nr = 1,
.cons = SERIAL_21285_CONSOLE,
};
```
如果驱动程序注册成功,我们借助 [drivers/tty/serial/serial_core.c](https://github.com/torvalds/linux/blob/master/drivers/tty/serial/serial_core.c) 源文件中的 `uart_add_one_port` 函数添加由驱动程序定义的端口 `serial21285_port` 结构体,然后从 `serial21285_init` 函数返回:
```C
if (ret == 0)
uart_add_one_port(&serial21285_reg, &serial21285_port);
return ret;
```
到此为止,我们的驱动程序初始化完毕。当一个 `uart` 端口被 [drivers/tty/serial/serial_core.c](https://github.com/torvalds/linux/blob/master/drivers/tty/serial/serial_core.c) 中的 `uart_open` 函数打开,该函数会调用 `uart_startup` 函数来启动这个串行端口,后者会调用 `startup` 函数。它是 `uart_ops` 结构体的一部分。每个 `uart` 驱动程序都会定义这样一个结构体。在本例中,它是这样的:
```C
static struct uart_ops serial21285_ops = {
...
.startup = serial21285_startup,
...
}
```
可以看到,`.startup` 字段是对 `serial21285_startup` 函数的引用。这个函数的实现是我们的关注重点,因为它与中断和中断处理密切相关。
请求中断线
---------------------------------------
我们来看看 `serial21285_startup` 函数的实现:
```C
static int serial21285_startup(struct uart_port *port)
{
int ret;
tx_enabled(port) = 1;
rx_enabled(port) = 1;
ret = request_irq(IRQ_CONRX, serial21285_rx_chars, 0,
serial21285_name, port);
if (ret == 0) {
ret = request_irq(IRQ_CONTX, serial21285_tx_chars, 0,
serial21285_name, port);
if (ret)
free_irq(IRQ_CONRX, port);
}
return ret;
}
```
首先是`TX``RX`。一个设备的串行总线仅由两条线组成:一条用于发送数据,另一条用于接收数据。与此对应,串行设备应该有两个串行引脚:接收器 - `RX` 和发送器 - `TX`。通过调用 `tx_enabled``rx_enalbed` 这两个宏来激活这些线。函数接下来的部分是我们最感兴趣的。注意 `request_irq` 这个函数。它注册了一个中断处理程序,然后激活一条给定的中断线。看一下这个函数的实现细节。该函数定义在 [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h) 头文件中,如下所示:
```C
static inline int __must_check
request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,
const char *name, void *dev)
{
return request_threaded_irq(irq, handler, NULL, flags, name, dev);
}
```
可以看到,`request_irq` 函数接受五个参数:
* `irq` - 被请求的中断号
* `handler` - 中断处理程序指针
* `flags` - 掩码选项
* `name` - 中断拥有者的名称
* `dev` - 用于共享中断线的指针
现在我们来考察 `request_irq` 函数的调用。可以看到,第一个参数是 `IRQ_CONRX`。我们知道它是中断号,但 `CONRX` 又是什么东西?这个宏定义在 [arch/arm/mach-footbridge/include/mach/irqs.h](https://github.com/torvalds/linux/blob/master/arch/arm/mach-footbridge/include/mach/irqs.h) 头文件中。我们可以在这里找到 `21285` 主板能够产生的全部中断。注意,在第二次调用 `request_irq` 函数时,我们传入了 `IRQ_CONTX` 中断号。我们的驱动程序会在这些中断中处理 `RX``TX` 事件。这些宏的实现很简单:
```C
#define IRQ_CONRX _DC21285_IRQ(0)
#define IRQ_CONTX _DC21285_IRQ(1)
...
...
...
#define _DC21285_IRQ(x) (16 + (x))
```
这个主板的 [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture) 中断号分布在`0``15`这个范围内。因此,我们的中断号就是在此之后的头两个值:`16``17`。在 `request_irq` 函数的两次调用中,第二个参数分别是 `serial21285_rx_chars``serial21285_tx_chars` 函数。当一个 `RX``TX` 中断发生时,这些函数就会被调用。我们不会在此深入探究这些函数,因为本章讲述的是中断与中断处理,而并非设备和驱动。下一个参数是 `flags``request_irq` 函数的两次调用中,它的值都是零。所有合法的 `flags` 都在 [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h) 中定义成诸如 `IRQF_*` 此类的宏。一些例子:
* `IRQF_SHARED` - 允许多个设备共享此中断号
* `IRQF_PERCPU` - 此中断号属于单独cpu的(per cpu)
* `IRQF_NO_THREAD` - 中断不能线程化
* `IRQF_NOBALANCING` - 此中断步参与irq平衡时
* `IRQF_IRQPOLL` - 此中断用于轮询
* 等等
这里,我们传入的是 `0`,也就是 `IRQF_TRIGGER_NONE`。这个标志是说,它不配置任何水平触发或边缘触发的中断行为。至于第四个参数(`name`),我们传入 `serial21285_name` ,它定义如下:
```C
static const char serial21285_name[] = "Footbridge UART";
```
它会显示在 `/proc/interrupts` 的输出中。针对最后一个参数,我们传入一个指向 `uart_port` 结构体的指针。对 `request_irq` 函数及其参数有所了解后,我们来看看它的实现。从上文可知,`request_irq` 函数内部只是调用了定义在 [kernel/irq/manage.c](https://github.com/torvalds/linux/blob/master/kernel/irq/manage.c) 源文件中的 `request_threaded_irq` 函数,并分配了一个给定的中断线。该函数起始部分是 `irqaction``irq_desc` 的定义:
```C
int request_threaded_irq(unsigned int irq, irq_handler_t handler,
irq_handler_t thread_fn, unsigned long irqflags,
const char *devname, void *dev_id)
{
struct irqaction *action;
struct irq_desc *desc;
int retval;
...
...
...
}
```
在本章,我们已经见识过 `irqaction``irq_desc` 结构体了。第一个结构体表示一个中断动作描述符,它包含中断处理程序指针,设备名称,中断号等等。第二个结构体表示一个中断描述符,包含指向 `irqaction` 的指针,中断标志等等。注意,`request_threaded_irq` 函数被 `request_irq` 调用时,带了一个额外的参数:`irq_handler_t thread_fn`。如果这个参数不为 `NULL`,它会创建 `irq` 线程,并在该线程中执行给定的 `irq` 处理程序。下一步,我们要做如下检查:
```C
if (((irqflags & IRQF_SHARED) && !dev_id) ||
(!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||
((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))
return -EINVAL;
```
首先,我们确保共享中断时传入了真正的 `dev_id`(译者注:不然后面搞不清楚哪台设备产生了中断),而且 `IRQF_COND_SUSPEND` 仅对共享中断生效。否则退出函数,返回 `-EINVAL` 错误。之后,我们借助 [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c) 源文件中定义的 `irq_to_desc` 函数将给定的 `irq` 中断号转换成 `irq` 中断描述符。如果不成功,则退出函数,返回 `-EINVAL` 错误:
```C
desc = irq_to_desc(irq);
if (!desc)
return -EINVAL;
```
`irq_to_desc` 函数检查给定的 `irq` 中断号是否小于最大中断号,并且返回中断描述符。这里,`irq` 中断号就是 `irq_desc` 数组的偏移量:
```C
struct irq_desc *irq_to_desc(unsigned int irq)
{
return (irq < NR_IRQS) ? irq_desc + irq : NULL;
}
```
由于我们已经把 `irq` 中断号转换成了 `irq` 中断描述符,现在来检查描述符的状态,确保我们可以请求中断:
```C
if (!irq_settings_can_request(desc) || WARN_ON(irq_settings_is_per_cpu_devid(desc)))
return -EINVAL;
```
失败则返回 `-EINVAL` 错误。接着,我们检查给定的中断处理程序(译者注:是指 `handler` 变量)。如果它没被传入 `request_irq` 函数,我们就检查 `thread_fn`。两个都是 `NULL` 则返回 `-EINVAL`。如果中断处理程序没有被传入 `request_irq` 函数而 `thread_fn` 不为空,则把 `handler` 设为 `irq_default_primary_handler`
```C
if (!handler) {
if (!thread_fn)
return -EINVAL;
handler = irq_default_primary_handler;
}
```
下一步,我们通过 `kzalloc` 函数为 `irqaction` 分配内存,若不成功则返回:
```C
action = kzalloc(sizeof(struct irqaction), GFP_KERNEL);
if (!action)
return -ENOMEM;
```
欲知 `kzalloc` 详情,请查阅专门介绍 Linux 内核[内存管理](https://xinqiu.gitbooks.io/linux-insides-cn/content/mm/index.html)的章节。为 `irqaction` 分配空间后,我们即对这个结构体进行初始化,设置它的中断处理程序,中断标志,设备名称等等:
```C
action->handler = handler;
action->thread_fn = thread_fn;
action->flags = irqflags;
action->name = devname;
action->dev_id = dev_id;
```
`request_threaded_irq` 函数末尾,我们调用 [kernel/irq/manage.c](https://github.com/torvalds/linux/blob/master/kernel/irq/manage.c) 中的 `__setup_irq` 函数,并注册一个给定的 `irqaction`。然后释放 `irqaction` 内存并返回:
```C
chip_bus_lock(desc);
retval = __setup_irq(irq, desc, action);
chip_bus_sync_unlock(desc);
if (retval)
kfree(action);
return retval;
```
注意,`__setup_irq` 函数的调用位于 `chip_bus_lock``chip_bus_sync_unlock` 函数之间。这些函数对慢速总线(如 [i2c](https://en.wikipedia.org/wiki/I%C2%B2C))芯片进行锁定/解锁。现在来看看 `__setup_irq` 函数的实现。`__setup_irq` 函数开头是各种检查。首先我们检查给定的中断描述符不为 `NULL``irqchip` 不为 `NULL`,以及给定的中断描述符模块拥有者不为 `NULL`。接下来我们检查中断是否嵌套在其他中断线程中。如果是的,我们则以 `irq_nested_primary_handler` 替换 `irq_default_priamry_handler`
下一步,如果给定的中断不是嵌套的,并且 `thread_fn` 不为空,我们就通过 `kthread_create` 创建了一个中断处理线程。
```C
if (new->thread_fn && !nested) {
struct task_struct *t;
t = kthread_create(irq_thread, new, "irq/%d-%s", irq, new->name);
...
}
```
并在最后为给定的中断描述符的剩余字段赋值。于是,我们的 `16``17` 号中断请求线注册完毕。当一个中断控制器获得这些中断的相关事件时,`serial21285_rx_chars``serial21285_tx_chars` 函数会被调用。现在我们来看一看一个中断发生时到底发生了什么。
准备处理中断
------------------------------
通过上文,我们观察了为给定的中断描述符请求中断号,为给定的中断注册 `irqaction` 结构体的过程。我们已经知道,当一个中断事件发生时,中断控制器向处理器通知该事件,处理器尝试为这个中断找到一个合适的中断门。如果你已阅读本章[第八节](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/interrupts-8.html),你应该还记得 `native_init_IRQ` 函数。这个函数会初始化本地 [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)。这个函数的如下部分是我们现在最感兴趣的地方:
```C
for_each_clear_bit_from(i, used_vectors, first_system_vector) {
set_intr_gate(i, irq_entries_start +
8 * (i - FIRST_EXTERNAL_VECTOR));
}
```
这里,我们从第 `first_system_vector` 位开始,依次向后迭代 `used_vectors` 位图中所有被清除的位:
```C
int first_system_vector = FIRST_SYSTEM_VECTOR; // 0xef
```
并且设置中断门,`i` 是向量号,`irq_entries_start + 8 * (i - FIRST_EXTERNAL_VECTOR)` 是起始地址。仅有一处尚不明了 - `irq_entries_start`。这个符号定义在 [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry_entry_64.S) 汇编文件中,并提供了 `irq` 入口。一起来看:
```assembly
.align 8
ENTRY(irq_entries_start)
vector=FIRST_EXTERNAL_VECTOR
.rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
pushq $(~vector+0x80)
vector=vector+1
jmp common_interrupt
.align 8
.endr
END(irq_entries_start)
```
这里我们可以看到 [GNU 汇编器](https://en.wikipedia.org/wiki/GNU_Assembler)的 `.rept` 指令。这条指令会把 `.endr` 之前的这几行代码重复 `FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR` 次。我们已经知道 `FIRST_SYSTEM_VECTOR` 的值是 `0xef`,而 `FIRST_EXTERNAL_VECTOR` 等于 `0x20`。于是,它将运行:
```python
>>> 0xef - 0x20
207
```
次。在 `.rept` 指令主体中,我们把入口程序地址压入栈中(注意,我们使用负数表示中断向量号,因为正数留作标识[系统调用](https://en.wikipedia.org/wiki/System_call)之用),将 `vector` 变量加 1并跳转到 `common_interrupt` 标签。在 `common_interrupt` 中,我们调整了栈中向量号,执行 `interrupt` 指令,参数是 `do_IRQ`
```assembly
common_interrupt:
addq $-0x80, (%rsp)
interrupt do_IRQ
```
`interrupt` 宏定义在同一个源文件中。它把[通用](https://en.wikipedia.org/wiki/Processor_register)寄存器的值保存在栈中。如果需要,它还会通过 `SWAPGS` 汇编指令在内核中改变用户空间 `gs` 寄存器。它会增加 [per-cpu](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/per-cpu.html) 的 `irq_count` 变量,来表明我们处于中断状态,然后调用 `do_IRQ` 函数。该函数定义于 [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c) 源文件中,作用是处理我们的设备中断。让我们一起考察这个函数。`do_IRQ` 函数接受一个参数 - `pt_regs` 结构体,它存放着用户空间寄存器的值:
```C
__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
{
struct pt_regs *old_regs = set_irq_regs(regs);
unsigned vector = ~regs->orig_ax;
unsigned irq;
irq_enter();
exit_idle();
...
...
...
}
```
函数开头调用了 `set_irq_regs` 函数,后者返回被保存的 `per-cpu` 中断寄存器指针。然后又调用 `irq_enter``exit_idle` 函数。第一个函数 `irq_enter` 进入到一个中断上下文,更新 `__preempt_count` 变量。第二个函数 `exit_idle` 检查当前进程是否是 [pid](https://en.wikipedia.org/wiki/Process_identifier) 为 `0``idle` 进程,然后把 `IDLE_END` 传送给 `idle_notifier`
接下来,我们从当前 cpu 中读取 `irq` 值,并调用 `handle_irq` 函数:
```C
irq = __this_cpu_read(vector_irq[vector]);
if (!handle_irq(irq, regs)) {
...
...
...
}
...
...
...
```
`handle_irq` 函数定义于 [arch/x86/kernel/irq_64.c](https://github.com/torvalds/linux/blob/arch/x86/kernel/irq_64.c) 源文件中,它检查给定的中断描述符,然后调用 `generic_handle_irq_desc` 函数:
```C
desc = irq_to_desc(irq);
if (unlikely(!desc))
return false;
generic_handle_irq_desc(irq, desc);
```
该函数又调用中断处理程序:
```C
static inline void generic_handle_irq_desc(unsigned int irq, struct irq_desc *desc)
{
desc->handle_irq(irq, desc);
}
```
但是,停一停……`handle_irq` 是何方神圣,为什么在知道 `irqaction` 指向真正的中断处理程序的情况下,偏偏通过中断描述符调用我们的中断处理程序?实际上,`irq_desc->handle_irq` 是一个用来调用中断处理程序的上层 API。它在[设备树](https://en.wikipedia.org/wiki/Device_tree) 和 [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) 的初始化过程中就设定好了。内核通过它选择正确的函数以及 `irq->actions(s)` 的调用链。就这样,当一个中断发生时,`serial21285_tx_chars` 或者 `serial21285_rx_chars` 函数会被调用。
`do_IRQ` 函数末尾,我们调用 `irq_exit` 函数来退出中断上下文,调用 `set_irq_regs` 函数并传入先前的用户空间寄存器,最后返回:
```C
irq_exit();
set_irq_regs(old_regs);
return 1;
```
我们已经知道,当一个 `IRQ` 工作结束之后,如果有延后中断,它们会被执行。
退出中断
---------------------
好了,中断处理程序执行完毕,我们必须从中断中返回。在 `do_IRQ` 函数将工作处理完毕后,我们将回到 [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry_entry_64.S) 汇编代码的 `ret_from_intr` 标签处。首先,我们通过 `DISABLE_INTERRUPTS` 宏禁止中断,这个宏被扩展成 `cli` 指令,将 [per-cpu](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/per-cpu.html) 的 `irq_count` 变量值减 1。记住当我们处于中断上下文的时候这个变量的值是 `1`
```assembly
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
decl PER_CPU_VAR(irq_count)
```
最后一步,我们检查之前的上下文(用户空间或者内核空间),正确地恢复它,然后通过指令退出中断:
```assembly
INTERRUPT_RETURN
```
此处的 `INTERRUPT_RETURN` 宏是:
```C
#define INTERRUPT_RETURN jmp native_iret
```
```assembly
ENTRY(native_iret)
.global native_irq_return_iret
native_irq_return_iret:
iretq
```
本节到此结束。
总结
--------------------------
这里是[中断和中断处理](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/index.html) 章节的第十节的结尾。如你在本节开头读到的那样,这是本章的最后一节。本章开篇阐述了中断理论,我们于是明白了什么是中断,中断的类型,然后也了解了异常以及对这种类型中断的处理,延后中断。最后在本节,我们考察了硬件中断和对这些中断的处理。当然,本节甚至本章都未能覆盖到 Linux 内核中断和中断处理的所有方面。这样并不现实,至少对我而言如此。这是一项浩大工程,不知你作何感想,对我来说,它确实浩大。这个主题远远超出本章讲述的内容,我不确定地球上能否找到一本书可以涵盖这个主题。我们漏掉了关于中断和中断处理的很多内容,但我相信,深入研究中断和中断处理相关的内核源码是个不错的点子。
如果有任何疑问或者建议,撰写评论或者在 [twitter](https://twitter.com/0xAX) 上联系我。
**请注意,英语并非我的母语。任何不便之处,我深感抱歉。如果发现任何错误,请在 [linux-insides](https://github.com/0xAX/linux-insides) 向我发送 PR。(译者注:翻译问题请发送 PR 到 [linux-insides-cn](https://www.gitbook.com/book/xinqiu/linux-insides-cn))**
链接
---------------------------------
* [串行驱动文档](https://www.kernel.org/doc/Documentation/serial/driver)
* [StrongARM** SA-110/21285 评估板](http://netwinder.osuosl.org/pub/netwinder/docs/intel/datashts/27813501.pdf)
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [模块](https://en.wikipedia.org/wiki/Loadable_kernel_module)
* [initcall](http://kernelnewbies.org/Documents/InitcallMechanism)
* [uart](https://en.wikipedia.org/wiki/Universal_asynchronous_receiver/transmitter)
* [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture)
* [内存管理](https://xinqiu.gitbooks.io/linux-insides-cn/content/mm/index.html)
* [i2c](https://en.wikipedia.org/wiki/I%C2%B2C)
* [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
* [GNU 汇编器](https://en.wikipedia.org/wiki/GNU_Assembler)
* [处理器寄存器](https://en.wikipedia.org/wiki/Processor_register)
* [per-cpu](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/per-cpu.html)
* [pid](https://en.wikipedia.org/wiki/Process_identifier)
* [设备树](https://en.wikipedia.org/wiki/Device_tree)
* [系统调用](https://en.wikipedia.org/wiki/System_call)
* [上一节](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/interrupts-9.html)

547
Interrupts/interrupts-2.md Normal file
View File

@@ -0,0 +1,547 @@
Interrupts and Interrupt Handling. Part 2.
================================================================================
Start to dive into interrupt and exceptions handling in the Linux kernel
--------------------------------------------------------------------------------
We saw some theory about interrupts and exception handling in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-1.html) and as I already wrote in that part, we will start to dive into interrupts and exceptions in the Linux kernel source code in this part. As you already can note, the previous part mostly described theoretical aspects and in this part we will start to dive directly into the Linux kernel source code. We will start to do it as we did it in other chapters, from the very early places. We will not see the Linux kernel source code from the earliest [code lines](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L292) as we saw it for example in the [Linux kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter, but we will start from the earliest code which is related to the interrupts and exceptions. In this part we will try to go through the all interrupts and exceptions related stuff which we can find in the Linux kernel source code.
If you've read the previous parts, you can remember that the earliest place in the Linux kernel `x86_64` architecture-specific source code which is related to the interrupt is located in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c) source code file and represents the first setup of the [Interrupt Descriptor Table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table). It occurs right before the transition into the [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the `go_to_protected_mode` function by the call of the `setup_idt`:
```C
void go_to_protected_mode(void)
{
...
setup_idt();
...
}
```
The `setup_idt` function is defined in the same source code file as the `go_to_protected_mode` function and just loads the address of the `NULL` interrupts descriptor table:
```C
static void setup_idt(void)
{
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
}
```
where `gdt_ptr` represents a special 48-bit `GDTR` register which must contain the base address of the `Global Descriptor Table`:
```C
struct gdt_ptr {
u16 len;
u32 ptr;
} __attribute__((packed));
```
Of course in our case the `gdt_ptr` does not represent the `GDTR` register, but `IDTR` since we set `Interrupt Descriptor Table`. You will not find an `idt_ptr` structure, because if it had been in the Linux kernel source code, it would have been the same as `gdt_ptr` but with different name. So, as you can understand there is no sense to have two similar structures which differ only by name. You can note here, that we do not fill the `Interrupt Descriptor Table` with entries, because it is too early to handle any interrupts or exceptions at this point. That's why we just fill the `IDT` with `NULL`.
After the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S). You can read more about it in the [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) which describes the transition to protected mode.
We already know from the earliest parts that entry to protected mode is located in the `boot_params.hdr.code32_start` and you can see that we pass the entry of the protected mode and `boot_params` to the `protected_mode_jump` in the end of the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c):
```C
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
```
The `protected_mode_jump` is defined in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S) and gets these two parameters in the `ax` and `dx` registers using one of the [8086](http://en.wikipedia.org/wiki/Intel_8086) calling [conventions](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions):
```assembly
GLOBAL(protected_mode_jump)
...
...
...
.byte 0x66, 0xea # ljmpl opcode
2: .long in_pm32 # offset
.word __BOOT_CS # segment
...
...
...
ENDPROC(protected_mode_jump)
```
where `in_pm32` contains a jump to the 32-bit entry point:
```assembly
GLOBAL(in_pm32)
...
...
jmpl *%eax // %eax contains address of the `startup_32`
...
...
ENDPROC(in_pm32)
```
As you can remember the 32-bit entry point is in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly file, although it contains `_64` in its name. We can see the two similar files in the `arch/x86/boot/compressed` directory:
* `arch/x86/boot/compressed/head_32.S`.
* `arch/x86/boot/compressed/head_64.S`;
But the 32-bit mode entry point is the second file in our case. The first file is not even compiled for `x86_64`. Let's look at the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile):
```
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
...
...
```
We can see here that `head_*` depends on the `$(BITS)` variable which depends on the architecture. You can find it in the [arch/x86/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/Makefile):
```
ifeq ($(CONFIG_X86_32),y)
...
BITS := 32
else
BITS := 64
...
endif
```
Now as we jumped on the `startup_32` from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) we will not find anything related to the interrupt handling here. The `startup_32` contains code that makes preparations before the transition into [long mode](http://en.wikipedia.org/wiki/Long_mode) and directly jumps in to it. The `long mode` entry is located in `startup_64` and it makes preparations before the [kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) that occurs in the `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c). After the kernel is decompressed, we jump on the `startup_64` from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). In the `startup_64` we start to build identity-mapped pages. After we have built identity-mapped pages, checked the [NX](http://en.wikipedia.org/wiki/NX_bit) bit, setup the `Extended Feature Enable Register` (see in links), and updated the early `Global Descriptor Table` with the `lgdt` instruction, we need to setup `gs` register with the following code:
```assembly
movl $MSR_GS_BASE,%ecx
movl initial_gs(%rip),%eax
movl initial_gs+4(%rip),%edx
wrmsr
```
We already saw this code in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-1.html). First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which is declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/msr-index.h) and looks like:
```C
#define MSR_GS_BASE 0xc0000101
```
From this we can understand that `MSR_GS_BASE` defines the number of the `model specific register`. Since registers `cs`, `ds`, `es`, and `ss` are not used in the 64-bit mode, their fields are ignored. But we can access memory over `fs` and `gs` registers. The model specific register provides a `back door` to the hidden parts of these segment registers and allows to use 64-bit base address for segment register addressed by the `fs` and `gs`. So the `MSR_GS_BASE` is the hidden part and this part is mapped on the `GS.base` field. Let's look on the `initial_gs`:
```assembly
GLOBAL(initial_gs)
.quad INIT_PER_CPU_VAR(irq_stack_union)
```
We pass `irq_stack_union` symbol to the `INIT_PER_CPU_VAR` macro which just concatenates the `init_per_cpu__` prefix with the given symbol. In our case we will get the `init_per_cpu__irq_stack_union` symbol. Let's look at the [linker](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) script. There we can see following definition:
```
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
INIT_PER_CPU(irq_stack_union);
```
It tells us that the address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where `init_per_cpu__irq_stack_union` and `__per_cpu_load` are what they mean. The first `irq_stack_union` is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call the `init_per_cpu_var` macro:
```C
DECLARE_INIT_PER_CPU(irq_stack_union);
#define DECLARE_INIT_PER_CPU(var) \
extern typeof(per_cpu_var(var)) init_per_cpu_var(var)
#define init_per_cpu_var(var) init_per_cpu__##var
```
If we expand all macros we will get the same `init_per_cpu__irq_stack_union` as we got after expanding the `INIT_PER_CPU` macro, but you can note that it is not just a symbol, but a variable. Let's look at the `typeof(per_cpu_var(var))` expression. Our `var` is `irq_stack_union` and the `per_cpu_var` macro is defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h):
```C
#define PER_CPU_VAR(var) %__percpu_seg:var
```
where:
```C
#ifdef CONFIG_X86_64
#define __percpu_seg gs
endif
```
So, we are accessing `gs:irq_stack_union` and getting its type which is `irq_union`. Ok, we defined the first variable and know its address, now let's look at the second `__per_cpu_load` symbol. There are a couple of `per-cpu` variables which are located after this symbol. The `__per_cpu_load` is defined in the [include/asm-generic/sections.h](https://github.com/torvalds/linux/blob/master/include/asm-generic-sections.h):
```C
extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[];
```
and presented base address of the `per-cpu` variables from the data area. So, we know the address of the `irq_stack_union`, `__per_cpu_load` and we know that `init_per_cpu__irq_stack_union` must be placed right after `__per_cpu_load`. And we can see it in the [System.map](http://en.wikipedia.org/wiki/System.map):
```
...
...
...
ffffffff819ed000 D __init_begin
ffffffff819ed000 D __per_cpu_load
ffffffff819ed000 A init_per_cpu__irq_stack_union
...
...
...
```
Now we know about `initial_gs`, so let's look at the code:
```assembly
movl $MSR_GS_BASE,%ecx
movl initial_gs(%rip),%eax
movl initial_gs+4(%rip),%edx
wrmsr
```
Here we specified a model specific register with `MSR_GS_BASE`, put the 64-bit address of the `initial_gs` to the `edx:eax` pair and execute the `wrmsr` instruction for filling the `gs` register with the base address of the `init_per_cpu__irq_stack_union` which will be at the bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and one of these preparations is filling the early `Interrupt Descriptor Table` with the interrupts handlers entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) and can remember following code:
```C
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, early_idt_handlers[i]);
load_idt((const struct desc_ptr *)&idt_descr);
```
but I wrote `Early interrupt and exception handling` part when Linux kernel version was - `3.18`. For this day actual version of the Linux kernel is `4.1.0-rc6+` and ` Andy Lutomirski` sent the [patch](https://lkml.org/lkml/2015/6/2/106) and soon it will be in the mainline kernel that changes behaviour for the `early_idt_handlers`. **NOTE** While I wrote this part the [patch](https://github.com/torvalds/linux/commit/425be5679fd292a3c36cb1fe423086708a99f11a) already turned in the Linux kernel source code. Let's look on it. Now the same part looks like:
```C
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, early_idt_handler_array[i]);
load_idt((const struct desc_ptr *)&idt_descr);
```
AS you can see it has only one difference in the name of the array of the interrupts handlers entry points. Now it is `early_idt_handler_arry`:
```C
extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
```
where `NUM_EXCEPTION_VECTORS` and `EARLY_IDT_HANDLER_SIZE` are defined as:
```C
#define NUM_EXCEPTION_VECTORS 32
#define EARLY_IDT_HANDLER_SIZE 9
```
So, the `early_idt_handler_array` is an array of the interrupts handlers entry points and contains one entry point on every nine bytes. You can remember that previous `early_idt_handlers` was defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). The `early_idt_handler_array` is defined in the same source code file too:
```assembly
ENTRY(early_idt_handler_array)
...
...
...
ENDPROC(early_idt_handler_common)
```
It fills `early_idt_handler_arry` with the `.rept NUM_EXCEPTION_VECTORS` and contains entry of the `early_make_pgtable` interrupt handler (more about its implementation you can read in the part about [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). For now we come to the end of the `x86_64` architecture-specific code and the next part is the generic kernel code. Of course you already can know that we will return to the architecture-specific code in the `setup_arch` function and other places, but this is the end of the `x86_64` early code.
Setting stack canary for the interrupt stack
-------------------------------------------------------------------------------
The next stop after the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) is the biggest `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). If you've read the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first `init` process with the [pid](https://en.wikipedia.org/wiki/Process_identifier) - `1`. The first thing that is related to the interrupts and exceptions handling is the call of the `boot_init_stack_canary` function.
This function sets the [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) value to protect interrupt stack overflow. We already saw a little some details about implementation of the `boot_init_stack_canary` in the previous part and now let's take a closer look on it. You can find implementation of this function in the [arch/x86/include/asm/stackprotector.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/stackprotector.h) and its depends on the `CONFIG_CC_STACKPROTECTOR` kernel configuration option. If this option is not set this function will not do anything:
```C
#ifdef CONFIG_CC_STACKPROTECTOR
...
...
...
#else
static inline void boot_init_stack_canary(void)
{
}
#endif
```
If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_init_stack_canary` function starts from the check stat `irq_stack_union` that represents [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) interrupt stack has offset equal to forty bytes from the `stack_canary` value:
```C
#ifdef CONFIG_X86_64
BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
#endif
```
As we can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-1.html) the `irq_stack_union` represented by the following union:
```C
union irq_stack_union {
char irq_stack[IRQ_STACK_SIZE];
struct {
char gs_base[40];
unsigned long stack_canary;
};
};
```
which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h). We know that [union](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro).
After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
```C
get_random_bytes(&canary, sizeof(canary));
tsc = __native_read_tsc();
canary += tsc + (tsc << 32UL);
```
and write `canary` value to the `irq_stack_union` with the `this_cpu_write` macro:
```C
this_cpu_write(irq_stack_union.stack_canary, canary);
```
more about `this_cpu_*` operation you can read in the [Linux kernel documentation](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt).
Disabling/Enabling local interrupts
--------------------------------------------------------------------------------
The next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) which is related to the interrupts and interrupts handling after we have set the `canary` value to the interrupt stack - is the call of the `local_irq_disable` macro.
This macro defined in the [include/linux/irqflags.h](https://github.com/torvalds/linux/blob/master/include/linux/irqflags.h) header file and as you can understand, we can disable interrupts for the CPU with the call of this macro. Let's look on its implementation. First of all note that it depends on the `CONFIG_TRACE_IRQFLAGS_SUPPORT` kernel configuration option:
```C
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
...
#define local_irq_disable() \
do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
...
#else
...
#define local_irq_disable() do { raw_local_irq_disable(); } while (0)
...
#endif
```
They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `softirq` state. In our case `lockdep` subsystem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c):
```C
void trace_hardirqs_off(void)
{
trace_hardirqs_off_caller(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_off);
```
and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` field of the current process and increases the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_insides.h](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep_insides.h) and located in the `lockdep_stats` structure:
```C
struct lockdep_stats {
...
...
...
int softirqs_off_events;
int redundant_softirqs_off;
...
...
...
}
```
If you will set `CONFIG_DEBUG_LOCKDEP` kernel configuration option, the `lockdep_stats_debug_show` function will write all tracing information to the `/proc/lockdep`:
```C
static void lockdep_stats_debug_show(struct seq_file *m)
{
#ifdef CONFIG_DEBUG_LOCKDEP
unsigned long long hi1 = debug_atomic_read(hardirqs_on_events),
hi2 = debug_atomic_read(hardirqs_off_events),
hr1 = debug_atomic_read(redundant_hardirqs_on),
...
...
...
seq_printf(m, " hardirq on events: %11llu\n", hi1);
seq_printf(m, " hardirq off events: %11llu\n", hi2);
seq_printf(m, " redundant hardirq ons: %11llu\n", hr1);
#endif
}
```
and you can see its result with the:
```
$ sudo cat /proc/lockdep
hardirq on events: 12838248974
hardirq off events: 12838248979
redundant hardirq ons: 67792
redundant hardirq offs: 3836339146
softirq on events: 38002159
softirq off events: 38002187
redundant softirq ons: 0
redundant softirq offs: 0
```
Ok, now we know a little about tracing, but more info will be in the separate part about `lockdep` and `tracing`. You can see that the both `local_disable_irq` macros have the same part - `raw_local_irq_disable`. This macro defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) and expands to the call of the:
```C
static inline void native_irq_disable(void)
{
asm volatile("cli": : :"memory");
}
```
And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle an interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macro - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction:
```C
static inline void native_irq_enable(void)
{
asm volatile("sti": : :"memory");
}
```
Now we know how `local_irq_disable` and `local_irq_enable` work. It was the first call of the `local_irq_disable` macro, but we will meet these macros many times in the Linux kernel source code. But for now we are in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) and we just disabled `local` interrupts. Why local and why we did it? Previously kernel provided a method to disable interrupts on all processors and it was called `cli`. This function was [removed](https://lwn.net/Articles/291956/) and now we have `local_irq_{enabled,disable}` to disable or enable interrupts on the current processor. After we've disabled the interrupts with the `local_irq_disable` macro, we set the:
```C
early_boot_irqs_disabled = true;
```
The `early_boot_irqs_disabled` variable defined in the [include/linux/kernel.h](https://github.com/torvalds/linux/blob/master/include/linux/kernel.h):
```C
extern bool early_boot_irqs_disabled;
```
and used in the different places. For example it used in the `smp_call_function_many` function from the [kernel/smp.c](https://github.com/torvalds/linux/blob/master/kernel/smp.c) for the checking possible deadlock when interrupts are disabled:
```C
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress && !early_boot_irqs_disabled);
```
Early trap initialization during kernel initialization
--------------------------------------------------------------------------------
The next functions after the `local_disable_irq` are `boot_cpu_init` and `page_address_init`, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html)). The next is the `setup_arch` function. As you can remember this function located in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel.setup.c) source code file and makes initialization of many different architecture-dependent [stuff](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). The first interrupts related function which we can see in the `setup_arch` is the - `early_trap_init` function. This function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and fills `Interrupt Descriptor Table` with the couple of entries:
```C
void __init early_trap_init(void)
{
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
#ifdef CONFIG_X86_32
set_intr_gate(X86_TRAP_PF, page_fault);
#endif
load_idt(&idt_descr);
}
```
Here we can see calls of three different functions:
* `set_intr_gate_ist`
* `set_system_intr_gate_ist`
* `set_intr_gate`
All of these functions defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) and do the similar thing but not the same. The first `set_intr_gate_ist` function inserts new an interrupt gate in the `IDT`. Let's look on its implementation:
```C
static inline void set_intr_gate_ist(int n, void *addr, unsigned ist)
{
BUG_ON((unsigned)n > 0xFF);
_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
}
```
First of all we can see the check that `n` which is [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) of the interrupt is not greater than `0xff` or 255. We need to check it because we remember from the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-1.html) that vector number of an interrupt must be between `0` and `255`. In the next step we can see the call of the `_set_gate` function that sets a given interrupt gate to the `IDT` table:
```C
static inline void _set_gate(int gate, unsigned type, void *addr,
unsigned dpl, unsigned ist, unsigned seg)
{
gate_desc s;
pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
write_idt_entry(idt_table, gate, &s);
write_trace_idt_entry(gate, &s);
}
```
Here we start from the `pack_gate` function which takes clean `IDT` entry represented by the `gate_desc` structure and fills it with the base address and limit, [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks), [Privilege level](http://en.wikipedia.org/wiki/Privilege_level), type of an interrupt which can be one of the following values:
* `GATE_INTERRUPT`
* `GATE_TRAP`
* `GATE_CALL`
* `GATE_TASK`
and set the present bit for the given `IDT` entry:
```C
static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
unsigned dpl, unsigned ist, unsigned seg)
{
gate->offset_low = PTR_LOW(func);
gate->segment = __KERNEL_CS;
gate->ist = ist;
gate->p = 1;
gate->dpl = dpl;
gate->zero0 = 0;
gate->zero1 = 0;
gate->type = type;
gate->offset_middle = PTR_MIDDLE(func);
gate->offset_high = PTR_HIGH(func);
}
```
After this we write just filled interrupt gate to the `IDT` with the `write_idt_entry` macro which expands to the `native_write_idt_entry` and just copy the interrupt gate to the `idt_table` table by the given index:
```C
#define write_idt_entry(dt, entry, g) native_write_idt_entry(dt, entry, g)
static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
{
memcpy(&idt[entry], gate, sizeof(*gate));
}
```
where `idt_table` is just array of `gate_desc`:
```C
extern gate_desc idt_table[];
```
That's all. The second `set_system_intr_gate_ist` function has only one difference from the `set_intr_gate_ist`:
```C
static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
{
BUG_ON((unsigned)n > 0xFF);
_set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
}
```
Do you see it? Look on the fourth parameter of the `_set_gate`. It is `0x3`. In the `set_intr_gate` it was `0x0`. We know that this parameter represent `DPL` or privilege level. We also know that `0` is the highest privilege level and `3` is the lowest.Now we know how `set_system_intr_gate_ist`, `set_intr_gate_ist`, `set_intr_gate` are work and we can return to the `early_trap_init` function. Let's look on it again:
```C
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
```
We set two `IDT` entries for the `#DB` interrupt and `int3`. These functions takes the same set of parameters:
* vector number of an interrupt;
* address of an interrupt handler;
* interrupt stack table index.
That's all. More about interrupts and handlers you will know in the next parts.
Conclusion
--------------------------------------------------------------------------------
It is the end of the second part about interrupts and interrupt handling in the Linux kernel. We saw the some theory in the previous part and started to dive into interrupts and exceptions handling in the current part. We have started from the earliest parts in the Linux kernel source code which are related to the interrupts. In the next part we will continue to dive into this interesting theme and will know more about interrupt handling process.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table)
* [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
* [List of x86 calling conventions](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions)
* [8086](http://en.wikipedia.org/wiki/Intel_8086)
* [Long mode](http://en.wikipedia.org/wiki/Long_mode)
* [NX](http://en.wikipedia.org/wiki/NX_bit)
* [Extended Feature Enable Register](http://en.wikipedia.org/wiki/Control_register#Additional_Control_registers_in_x86-64_series)
* [Model-specific register](http://en.wikipedia.org/wiki/Model-specific_register)
* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
* [lockdep](http://lwn.net/Articles/321663/)
* [irqflags tracing](https://www.kernel.org/doc/Documentation/irqflags-tracing.txt)
* [IF](http://en.wikipedia.org/wiki/Interrupt_flag)
* [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries)
* [Union type](http://en.wikipedia.org/wiki/Union_type)
* [this_cpu_* operations](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)
* [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table)
* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
* [Privilege level](http://en.wikipedia.org/wiki/Privilege_level)
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-1.html)

522
Interrupts/interrupts-3.md Normal file
View File

@@ -0,0 +1,522 @@
Interrupts and Interrupt Handling. Part 3.
================================================================================
Exception Handling
--------------------------------------------------------------------------------
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) about an interrupts and an exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) we stopped at the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) source code file.
We already know that this function executes initialization of architecture-specfic stuff. In our case the `setup_arch` function does [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related initializations. The `setup_arch` is big function, and in the previous part we stopped on the setting of the two exceptions handlers for the two following exceptions:
* `#DB` - debug exception, transfers control from the interrupted process to the debug handler;
* `#BP` - breakpoint exception, caused by the `int 3` instruction.
These exceptions allow the `x86_64` architecture to have early exception processing for the purpose of debugging via the [kgdb](https://en.wikipedia.org/wiki/KGDB).
As you can remember we set these exceptions handlers in the `early_trap_init` function:
```C
void __init early_trap_init(void)
{
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
load_idt(&idt_descr);
}
```
from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We already saw implementation of the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions in the previous part and now we will look on the implementation of these two exceptions handlers.
Debug and Breakpoint exceptions
--------------------------------------------------------------------------------
Ok, we setup exception handlers in the `early_trap_init` function for the `#DB` and `#BP` exceptions and now time is to consider their implementations. But before we will do this, first of all let's look on details of these exceptions.
The first exceptions - `#DB` or `debug` exception occurs when a debug event occurs. For example - attempt to change the contents of a [debug register](http://en.wikipedia.org/wiki/X86_debug_register). Debug registers are special registers that were presented in `x86` processors starting from the [Intel 80386](http://en.wikipedia.org/wiki/Intel_80386) processor and as you can understand from name of this CPU extension, main purpose of these registers is debugging.
These registers allow to set breakpoints on the code and read or write data to trace it. Debug registers may be accessed only in the privileged mode and an attempt to read or write the debug registers when executing at any other privilege level causes a [general protection fault](https://en.wikipedia.org/wiki/General_protection_fault) exception. That's why we have used `set_intr_gate_ist` for the `#DB` exception, but not the `set_system_intr_gate_ist`.
The verctor number of the `#DB` exceptions is `1` (we pass it as `X86_TRAP_DB`) and as we may read in specification, this exception has no error code:
```
+-----------------------------------------------------+
|Vector|Mnemonic|Description |Type |Error Code|
+-----------------------------------------------------+
|1 | #DB |Reserved |F/T |NO |
+-----------------------------------------------------+
```
The second exception is `#BP` or `breakpoint` exception occurs when processor executes the [int 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3) instruction. Unlike the `DB` exception, the `#BP` exception may occur in userspace. We can add it anywhere in our code, for example let's look on the simple program:
```C
// breakpoint.c
#include <stdio.h>
int main() {
int i;
while (i < 6){
printf("i equal to: %d\n", i);
__asm__("int3");
++i;
}
}
```
If we will compile and run this program, we will see following output:
```
$ gcc breakpoint.c -o breakpoint
i equal to: 0
Trace/breakpoint trap
```
But if will run it with gdb, we will see our breakpoint and can continue execution of our program:
```
$ gdb breakpoint
...
...
...
(gdb) run
Starting program: /home/alex/breakpoints
i equal to: 0
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000400585 in main ()
=> 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1
(gdb) c
Continuing.
i equal to: 1
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000400585 in main ()
=> 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1
(gdb) c
Continuing.
i equal to: 2
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000400585 in main ()
=> 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1
...
...
...
```
From this moment we know a little about these two exceptions and we can move on to consideration of their handlers.
Preparation before an exception handler
--------------------------------------------------------------------------------
As you may note before, the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions takes an addresses of exceptions handlers in theirs second parameter. In or case our two exception handlers will be:
* `debug`;
* `int3`.
You will not find these functions in the C code. all of that could be found in the kernel's `*.c/*.h` files only definition of these functions which are located in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h) kernel header file:
```C
asmlinkage void debug(void);
```
and
```C
asmlinkage void int3(void);
```
You may note `asmlinkage` directive in definitions of these functions. The directive is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack.
So, both handlers are defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file with the `idtentry` macro:
```assembly
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
```
and
```assembly
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
```
Each exception handler may be consists from two parts. The first part is generic part and it is the same for all exception handlers. An exception handler should to save [general purpose registers](https://en.wikipedia.org/wiki/Processor_register) on the stack, switch to kernel stack if an exception came from userspace and transfer control to the second part of an exception handler. The second part of an exception handler does certain work depends on certain exception. For example page fault exception handler should find virtual page for given address, invalid opcode exception handler should send `SIGILL` [signal](https://en.wikipedia.org/wiki/Unix_signal) and etc.
As we just saw, an exception handler starts from definition of the `idtentry` macro from the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file, so let's look at implementation of this macro. As we may see, the `idtentry` macro takes five arguments:
* `sym` - defines global symbol with the `.globl name` which will be an an entry of exception handler;
* `do_sym` - symbol name which represents a secondary entry of an exception handler;
* `has_error_code` - information about existence of an error code of exception.
The last two parameters are optional:
* `paranoid` - shows us how we need to check current mode (will see explanation in details later);
* `shift_ist` - shows us is an exception running at `Interrupt Stack Table`.
Definition of the `.idtentry` macro looks:
```assembly
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
...
...
...
END(\sym)
.endm
```
Before we will consider internals of the `idtentry` macro, we should to know state of stack when an exception occurs. As we may read in the [Intel® 64 and IA-32 Architectures Software Developers Manual 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html), the state of stack when an exception occurs is following:
```
+------------+
+40 | %SS |
+32 | %RSP |
+24 | %RFLAGS |
+16 | %CS |
+8 | %RIP |
0 | ERROR CODE | <-- %RSP
+------------+
```
Now we may start to consider implementation of the `idtmacro`. Both `#DB` and `BP` exception handlers are defined as:
```assembly
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
```
If we will look at these definitions, we may know that compiler will generate two routines with `debug` and `int3` names and both of these exception handlers will call `do_debug` and `do_int3` secondary handlers after some preparation. The third parameter defines existence of error code and as we may see both our exception do not have them. As we may see on the diagram above, processor pushes error code on stack if an exception provides it. In our case, the `debug` and `int3` exception do not have error codes. This may bring some difficulties because stack will look differently for exceptions which provides error code and for exceptions which not. That's why implementation of the `idtentry` macro starts from putting a fake error code to the stack if an exception does not provide it:
```assembly
.ifeq \has_error_code
pushq $-1
.endif
```
But it is not only fake error-code. Moreover the `-1` also represents invalid system call number, so that the system call restart logic will not be triggered.
The last two parameters of the `idtentry` macro `shift_ist` and `paranoid` allow to know do an exception handler runned at stack from `Interrupt Stack Table` or not. You already may know that each kernel thread in the system has own stack. In addition to these stacks, there are some specialized stacks associated with each processor in the system. One of these stacks is - exception stack. The [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture provides special feature which is called - `Interrupt Stack Table`. This feature allows to switch to a new stack for designated events such as an atomic exceptions like `double fault` and etc. So the `shift_ist` parameter allows us to know do we need to switch on `IST` stack for an exception handler or not.
The second parameter - `paranoid` defines the method which helps us to know did we come from userspace or not to an exception handler. The easiest way to determine this is to via `CPL` or `Current Privilege Level` in `CS` segment register. If it is equal to `3`, we came from userspace, if zero we came from kernel space:
```
testl $3,CS(%rsp)
jnz userspace
...
...
...
// we are from the kernel space
```
But unfortunately this method does not give a 100% guarantee. As described in the kernel documentation:
> if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
> which might have triggered right after a normal entry wrote CS to the
> stack but before we executed SWAPGS, then the only safe way to check
> for GS is the slower method: the RDMSR.
In other words for example `NMI` could happen inside the critical section of a [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. In this way we should check value of the `MSR_GS_BASE` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register) which stores pointer to the start of per-cpu area. So to check did we come from userspace or not, we should to check value of the `MSR_GS_BASE` model specific register and if it is negative we came from kernel space, in other way we came from userspace:
```assembly
movl $MSR_GS_BASE,%ecx
rdmsr
testl %edx,%edx
js 1f
```
In first two lines of code we read value of the `MSR_GS_BASE` model specific register into `edx:eax` pair. We can't set negative value to the `gs` from userspace. But from other side we know that direct mapping of the physical memory starts from the `0xffff880000000000` virtual address. In this way, `MSR_GS_BASE` will contain an address from `0xffff880000000000` to `0xffffc7ffffffffff`. After the `rdmsr` instruction will be executed, the smallest possible value in the `%edx` register will be - `0xffff8800` which is `-30720` in unsigned 4 bytes. That's why kernel space `gs` which points to start of `per-cpu` area will contain negative value.
After we pushed fake error code on the stack, we should allocate space for general purpose registers with:
```assembly
ALLOC_PT_GPREGS_ON_STACK
```
macro which is defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) header file. This macro just allocates 15*8 bytes space on the stack to preserve general purpose registers:
```assembly
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
addq $-(15*8+\addskip), %rsp
.endm
```
So the stack will look like this after execution of the `ALLOC_PT_GPREGS_ON_STACK`:
```
+------------+
+160 | %SS |
+152 | %RSP |
+144 | %RFLAGS |
+136 | %CS |
+128 | %RIP |
+120 | ERROR CODE |
|------------|
+112 | |
+104 | |
+96 | |
+88 | |
+80 | |
+72 | |
+64 | |
+56 | |
+48 | |
+40 | |
+32 | |
+24 | |
+16 | |
+8 | |
+0 | | <- %RSP
+------------+
```
After we allocated space for general purpose registers, we do some checks to understand did an exception come from userspace or not and if yes, we should move back to an interrupted process stack or stay on exception stack:
```assembly
.if \paranoid
.if \paranoid == 1
testb $3, CS(%rsp)
jnz 1f
.endif
call paranoid_entry
.else
call error_entry
.endif
```
Let's consider all of these there cases in course.
An exception occured in userspace
--------------------------------------------------------------------------------
In the first let's consider a case when an exception has `paranoid=1` like our `debug` and `int3` exceptions. In this case we check selector from `CS` segment register and jump at `1f` label if we came from userspace or the `paranoid_entry` will be called in other way.
Let's consider first case when we came from userspace to an exception handler. As described above we should jump at `1` label. The `1` label starts from the call of the
```assembly
call error_entry
```
routine which saves all general purpose registers in the previously allocated area on the stack:
```assembly
SAVE_C_REGS 8
SAVE_EXTRA_REGS 8
```
These both macros are defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) header file and just move values of general purpose registers to a certain place at the stack, for example:
```assembly
.macro SAVE_EXTRA_REGS offset=0
movq %r15, 0*8+\offset(%rsp)
movq %r14, 1*8+\offset(%rsp)
movq %r13, 2*8+\offset(%rsp)
movq %r12, 3*8+\offset(%rsp)
movq %rbp, 4*8+\offset(%rsp)
movq %rbx, 5*8+\offset(%rsp)
.endm
```
After execution of `SAVE_C_REGS` and `SAVE_EXTRA_REGS` the stack will look:
```
+------------+
+160 | %SS |
+152 | %RSP |
+144 | %RFLAGS |
+136 | %CS |
+128 | %RIP |
+120 | ERROR CODE |
|------------|
+112 | %RDI |
+104 | %RSI |
+96 | %RDX |
+88 | %RCX |
+80 | %RAX |
+72 | %R8 |
+64 | %R9 |
+56 | %R10 |
+48 | %R11 |
+40 | %RBX |
+32 | %RBP |
+24 | %R12 |
+16 | %R13 |
+8 | %R14 |
+0 | %R15 | <- %RSP
+------------+
```
After the kernel saved general purpose registers at the stack, we should check that we came from userspace space again with:
```assembly
testb $3, CS+8(%rsp)
jz .Lerror_kernelspace
```
because we may have potentially fault if as described in documentation truncated `%RIP` was reported. Anyway, in both cases the [SWAPGS](http://www.felixcloutier.com/x86/SWAPGS.html) instruction will be executed and values from `MSR_KERNEL_GS_BASE` and `MSR_GS_BASE` will be swapped. From this moment the `%gs` register will point to the base address of kernel structures. So, the `SWAPGS` instruction is called and it was main point of the `error_entry` routing.
Now we can back to the `idtentry` macro. We may see following assembler code after the call of `error_entry`:
```assembly
movq %rsp, %rdi
call sync_regs
```
Here we put base address of stack pointer `%rdi` register which will be first argument (according to [x86_64 ABI](https://www.uclibc.org/docs/psABI-x86_64.pdf)) of the `sync_regs` function and call this function which is defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file:
```C
asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
{
struct pt_regs *regs = task_pt_regs(current);
*regs = *eregs;
return regs;
}
```
This function takes the result of the `task_ptr_regs` macro which is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) header file, stores it in the stack pointer and return it. The `task_ptr_regs` macro expands to the address of `thread.sp0` which represents pointer to the normal kernel stack:
```C
#define task_pt_regs(tsk) ((struct pt_regs *)(tsk)->thread.sp0 - 1)
```
As we came from userspace, this means that exception handler will run in real process context. After we got stack pointer from the `sync_regs` we switch stack:
```assembly
movq %rax, %rsp
```
The last two steps before an exception handler will call secondary handler are:
1. Passing pointer to `pt_regs` structure which contains preserved general purpose registers to the `%rdi` register:
```assembly
movq %rsp, %rdi
```
as it will be passed as first parameter of secondary exception handler.
2. Pass error code to the `%rsi` register as it will be second argument of an exception handler and set it to `-1` on the stack for the same purpose as we did it before - to prevent restart of a system call:
```
.if \has_error_code
movq ORIG_RAX(%rsp), %rsi
movq $-1, ORIG_RAX(%rsp)
.else
xorl %esi, %esi
.endif
```
Additionally you may see that we zeroed the `%esi` register above in a case if an exception does not provide error code.
In the end we just call secondary exception handler:
```assembly
call \do_sym
```
which:
```C
dotraplinkage void do_debug(struct pt_regs *regs, long error_code);
```
will be for `debug` exception and:
```C
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code);
```
will be for `int 3` exception. In this part we will not see implementations of secondary handlers, because of they are very specific, but will see some of them in one of next parts.
We just considered first case when an exception occured in userspace. Let's consider last two.
An exception with paranoid > 0 occured in kernelspace
--------------------------------------------------------------------------------
In this case an exception was occured in kernelspace and `idtentry` macro is defined with `paranoid=1` for this exception. This value of `paranoid` means that we should use slower way that we saw in the beginning of this part to check do we really came from kernelspace or not. The `paranoid_entry` routing allows us to know this:
```assembly
ENTRY(paranoid_entry)
cld
SAVE_C_REGS 8
SAVE_EXTRA_REGS 8
movl $1, %ebx
movl $MSR_GS_BASE, %ecx
rdmsr
testl %edx, %edx
js 1f
SWAPGS
xorl %ebx, %ebx
1: ret
END(paranoid_entry)
```
As you may see, this function representes the same that we covered before. We use second (slow) method to get information about previous state of an interrupted task. As we checked this and executed `SWAPGS` in a case if we came from userspace, we should to do the same that we did before: We need to put pointer to a strucutre which holds general purpose registers to the `%rdi` (which will be first parameter of a secondary handler) and put error code if an exception provides it to the `%rsi` (which will be second parameter of a secondary handler):
```assembly
movq %rsp, %rdi
.if \has_error_code
movq ORIG_RAX(%rsp), %rsi
movq $-1, ORIG_RAX(%rsp)
.else
xorl %esi, %esi
.endif
```
The last step before a secondary handler of an exception will be called is cleanup of new `IST` stack fram:
```assembly
.if \shift_ist != -1
subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
.endif
```
You may remember that we passed the `shift_ist` as argument of the `idtentry` macro. Here we check its value and if its not equal to `-1`, we get pointer to a stack from `Interrupt Stack Table` by `shift_ist` index and setup it.
In the end of this second way we just call secondary exception handler as we did it before:
```assembly
call \do_sym
```
The last method is similar to previous both, but an exception occured with `paranoid=0` and we may use fast method determination of where we are from.
Exit from an exception handler
--------------------------------------------------------------------------------
After secondary handler will finish its works, we will return to the `idtentry` macro and the next step will be jump to the `error_exit`:
```assembly
jmp error_exit
```
routine. The `error_exit` function defined in the same [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file and the main goal of this function is to know where we are from (from userspace or kernelspace) and execute `SWPAGS` depends on this. Restore registers to previous state and execute `iret` instruction to transfer control to an interrupted task.
That's all.
Conclusion
--------------------------------------------------------------------------------
It is the end of the third part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Interrupt descriptor table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) in the previous part with the `#DB` and `#BP` gates and started to dive into preparation before control will be transferred to an exception handler and implementation of some interrupt handlers in this part. In the next part we will continue to dive into this theme and will go next by the `setup_arch` function and will try to understand interrupts handling related stuff.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [Debug registers](http://en.wikipedia.org/wiki/X86_debug_register)
* [Intel 80385](http://en.wikipedia.org/wiki/Intel_80386)
* [INT 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3)
* [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection)
* [TSS](http://en.wikipedia.org/wiki/Task_state_segment)
* [GNU assembly .error directive](https://sourceware.org/binutils/docs/as/Error.html#Error)
* [dwarf2](http://en.wikipedia.org/wiki/DWARF)
* [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html)
* [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [system call](http://en.wikipedia.org/wiki/System_call)
* [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html)
* [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP)
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
* [kgdb](https://en.wikipedia.org/wiki/KGDB)
* [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface)
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html)

465
Interrupts/interrupts-4.md Normal file
View File

@@ -0,0 +1,465 @@
Interrupts and Interrupt Handling. Part 4.
================================================================================
Initialization of non-early interrupt gates
--------------------------------------------------------------------------------
This is fourth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-3.html) we saw first early `#DB` and `#BP` exceptions handlers from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We stopped on the right after the `early_trap_init` function that called in the `setup_arch` function which defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/setup.c). In this part we will continue to dive into an interrupts and exceptions handling in the Linux kernel for `x86_64` and continue to do it from the place where we left off in the last part. First thing which is related to the interrupts and exceptions handling is the setup of the `#PF` or [page fault](https://en.wikipedia.org/wiki/Page_fault) handler with the `early_trap_pf_init` function. Let's start from it.
Early page fault handler
--------------------------------------------------------------------------------
The `early_trap_pf_init` function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). It uses `set_intr_gate` macro that fills [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the given entry:
```C
void __init early_trap_pf_init(void)
{
#ifdef CONFIG_X86_64
set_intr_gate(X86_TRAP_PF, page_fault);
#endif
}
```
This macro defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h). We already saw macros like this in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-3.html) - `set_system_intr_gate` and `set_intr_gate_ist`. This macro checks that given vector number is not greater than `255` (maximum vector number) and calls `_set_gate` function as `set_system_intr_gate` and `set_intr_gate_ist` did it:
```C
#define set_intr_gate(n, addr) \
do { \
BUG_ON((unsigned)n > 0xFF); \
_set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0, \
__KERNEL_CS); \
_trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\
0, 0, __KERNEL_CS); \
} while (0)
```
The `set_intr_gate` macro takes two parameters:
* vector number of a interrupt;
* address of an interrupt handler;
In our case they are:
* `X86_TRAP_PF` - `14`;
* `page_fault` - the interrupt handler entry point.
The `X86_TRAP_PF` is the element of enum which defined in the [arch/x86/include/asm/traprs.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traprs.h):
```C
enum {
...
...
...
...
X86_TRAP_PF, /* 14, Page Fault */
...
...
...
}
```
When the `early_trap_pf_init` will be called, the `set_intr_gate` will be expanded to the call of the `_set_gate` which will fill the `IDT` with the handler for the page fault. Now let's look on the implementation of the `page_fault` handler. The `page_fault` handler defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file as all exceptions handlers. Let's look on it:
```assembly
trace_idtentry page_fault do_page_fault has_error_code=1
```
We saw in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-3.html) how `#DB` and `#BP` handlers defined. They were defined with the `idtentry` macro, but here we can see `trace_idtentry`. This macro defined in the same source code file and depends on the `CONFIG_TRACING` kernel configuration option:
```assembly
#ifdef CONFIG_TRACING
.macro trace_idtentry sym do_sym has_error_code:req
idtentry trace(\sym) trace(\do_sym) has_error_code=\has_error_code
idtentry \sym \do_sym has_error_code=\has_error_code
.endm
#else
.macro trace_idtentry sym do_sym has_error_code:req
idtentry \sym \do_sym has_error_code=\has_error_code
.endm
#endif
```
We will not dive into exceptions [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29) now. If `CONFIG_TRACING` is not set, we can see that `trace_idtentry` macro just expands to the normal `idtentry`. We already saw implementation of the `idtentry` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-3.html), so let's start from the `page_fault` exception handler.
As we can see in the `idtentry` definition, the handler of the `page_fault` is `do_page_fault` function which defined in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) and as all exceptions handlers it takes two arguments:
* `regs` - `pt_regs` structure that holds state of an interrupted process;
* `error_code` - error code of the page fault exception.
Let's look inside this function. First of all we read content of the [cr2](https://en.wikipedia.org/wiki/Control_register) control register:
```C
dotraplinkage void notrace
do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
unsigned long address = read_cr2();
...
...
...
}
```
This register contains a linear address which caused `page fault`. In the next step we make a call of the `exception_enter` function from the [include/linux/context_tracking.h](https://github.com/torvalds/linux/blob/master/include/context_tracking.h). The `exception_enter` and `exception_exit` are functions from context tracking subsystem in the Linux kernel used by the [RCU](https://en.wikipedia.org/wiki/Read-copy-update) to remove its dependency on the timer tick while a processor runs in userspace. Almost in the every exception handler we will see similar code:
```C
enum ctx_state prev_state;
prev_state = exception_enter();
...
... // exception handler here
...
exception_exit(prev_state);
```
The `exception_enter` function checks that `context tracking` is enabled with the `context_tracking_is_enabled` and if it is in enabled state, we get previous context with the `this_cpu_read` (more about `this_cpu_*` operations you can read in the [Documentation](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)). After this it calls `context_tracking_user_exit` function which informs the context tracking that the processor is exiting userspace mode and entering the kernel:
```C
static inline enum ctx_state exception_enter(void)
{
enum ctx_state prev_ctx;
if (!context_tracking_is_enabled())
return 0;
prev_ctx = this_cpu_read(context_tracking.state);
context_tracking_user_exit();
return prev_ctx;
}
```
The state can be one of the:
```C
enum ctx_state {
IN_KERNEL = 0,
IN_USER,
} state;
```
And in the end we return previous context. Between the `exception_enter` and `exception_exit` we call actual page fault handler:
```C
__do_page_fault(regs, error_code, address);
```
The `__do_page_fault` is defined in the same source code file as `do_page_fault` - [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c). In the beginning of the `__do_page_fault` we check state of the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) checker. The `kmemcheck` detects warns about some uses of uninitialized memory. We need to check it because page fault can be caused by kmemcheck:
```C
if (kmemcheck_active(regs))
kmemcheck_hide(regs);
prefetchw(&mm->mmap_sem);
```
After this we can see the call of the `prefetchw` which executes instruction with the same [name](http://www.felixcloutier.com/x86/PREFETCHW.html) which fetches [X86_FEATURE_3DNOW](https://en.wikipedia.org/?title=3DNow!) to get exclusive [cache line](https://en.wikipedia.org/wiki/CPU_cache). The main purpose of prefetching is to hide the latency of a memory access. In the next step we check that we got page fault not in the kernel space with the following condition:
```C
if (unlikely(fault_in_kernel_space(address))) {
...
...
...
}
```
where `fault_in_kernel_space` is:
```C
static int fault_in_kernel_space(unsigned long address)
{
return address >= TASK_SIZE_MAX;
}
```
The `TASK_SIZE_MAX` macro expands to the:
```C
#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
```
or `0x00007ffffffff000`. Pay attention on `unlikely` macro. There are two macros in the Linux kernel:
```C
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
```
You can [often](http://lxr.free-electrons.com/ident?i=unlikely) find these macros in the code of the Linux kernel. Main purpose of these macros is optimization. Sometimes this situation is that we need to check the condition of the code and we know that it will rarely be `true` or `false`. With these macros we can tell to the compiler about this. For example
```C
static int proc_root_readdir(struct file *file, struct dir_context *ctx)
{
if (ctx->pos < FIRST_PROCESS_ENTRY) {
int error = proc_readdir(file, ctx);
if (unlikely(error <= 0))
return error;
...
...
...
}
```
Here we can see `proc_root_readdir` function which will be called when the Linux [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) needs to read the `root` directory contents. If condition marked with `unlikely`, compiler can put `false` code right after branching. Now let's back to the our address check. Comparison between the given address and the `0x00007ffffffff000` will give us to know, was page fault in the kernel mode or user mode. After this check we know it. After this `__do_page_fault` routine will try to understand the problem that provoked page fault exception and then will pass address to the appropriate routine. It can be `kmemcheck` fault, spurious fault, [kprobes](https://www.kernel.org/doc/Documentation/kprobes.txt) fault and etc. Will not dive into implementation details of the page fault exception handler in this part, because we need to know many different concepts which are provided by the Linux kernel, but will see it in the chapter about the [memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) in the Linux kernel.
Back to start_kernel
--------------------------------------------------------------------------------
There are many different function calls after the `early_trap_pf_init` in the `setup_arch` function from different kernel subsystems, but there are no one interrupts and exceptions handling related. So, we have to go back where we came from - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L492). The first things after the `setup_arch` is the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). This function makes initialization of the remaining exceptions handlers (remember that we already setup 3 handlers for the `#DB` - debug exception, `#BP` - breakpoint exception and `#PF` - page fault exception). The `trap_init` function starts from the check of the [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture):
```C
#ifdef CONFIG_EISA
void __iomem *p = early_ioremap(0x0FFFD9, 4);
if (readl(p) == 'E' + ('I'<<8) + ('S'<<16) + ('A'<<24))
EISA_bus = 1;
early_iounmap(p, 4);
#endif
```
Note that it depends on the `CONFIG_EISA` kernel configuration parameter which represents `EISA` support. Here we use `early_ioremap` function to map `I/O` memory on the page tables. We use `readl` function to read first `4` bytes from the mapped region and if they are equal to `EISA` string we set `EISA_bus` to one. In the end we just unmap previously mapped region. More about `early_ioremap` you can read in the part which describes [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html).
After this we start to fill the `Interrupt Descriptor Table` with the different interrupt gates. First of all we set `#DE` or `Divide Error` and `#NMI` or `Non-maskable Interrupt`:
```C
set_intr_gate(X86_TRAP_DE, divide_error);
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
```
We use `set_intr_gate` macro to set the interrupt gate for the `#DE` exception and `set_intr_gate_ist` for the `#NMI`. You can remember that we already used these macros when we have set the interrupts gates for the page fault handler, debug handler and etc, you can find explanation of it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-3.html). After this we setup exception gates for the following exceptions:
```C
set_system_intr_gate(X86_TRAP_OF, &overflow);
set_intr_gate(X86_TRAP_BR, bounds);
set_intr_gate(X86_TRAP_UD, invalid_op);
set_intr_gate(X86_TRAP_NM, device_not_available);
```
Here we can see:
* `#OF` or `Overflow` exception. This exception indicates that an overflow trap occurred when an special [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html) instruction was executed;
* `#BR` or `BOUND Range exceeded` exception. This exception indicates that a `BOUND-range-exceed` fault occurred when a [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) instruction was executed;
* `#UD` or `Invalid Opcode` exception. Occurs when a processor attempted to execute invalid or reserved [opcode](https://en.wikipedia.org/?title=Opcode), processor attempted to execute instruction with invalid operand(s) and etc;
* `#NM` or `Device Not Available` exception. Occurs when the processor tries to execute `x87 FPU` floating point instruction while `EM` flag in the [control register](https://en.wikipedia.org/wiki/Control_register#CR0) `cr0` was set.
In the next step we set the interrupt gate for the `#DF` or `Double fault` exception:
```C
set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
```
This exception occurs when processor detected a second exception while calling an exception handler for a prior exception. In usual way when the processor detects another exception while trying to call an exception handler, the two exceptions can be handled serially. If the processor cannot handle them serially, it signals the double-fault or `#DF` exception.
The following set of the interrupt gates is:
```C
set_intr_gate(X86_TRAP_OLD_MF, &coprocessor_segment_overrun);
set_intr_gate(X86_TRAP_TS, &invalid_TSS);
set_intr_gate(X86_TRAP_NP, &segment_not_present);
set_intr_gate_ist(X86_TRAP_SS, &stack_segment, STACKFAULT_STACK);
set_intr_gate(X86_TRAP_GP, &general_protection);
set_intr_gate(X86_TRAP_SPURIOUS, &spurious_interrupt_bug);
set_intr_gate(X86_TRAP_MF, &coprocessor_error);
set_intr_gate(X86_TRAP_AC, &alignment_check);
```
Here we can see setup for the following exception handlers:
* `#CSO` or `Coprocessor Segment Overrun` - this exception indicates that math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) of an old processor detected a page or segment violation. Modern processors do not generate this exception
* `#TS` or `Invalid TSS` exception - indicates that there was an error related to the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment).
* `#NP` or `Segment Not Present` exception indicates that the `present flag` of a segment or gate descriptor is clear during attempt to load one of `cs`, `ds`, `es`, `fs`, or `gs` register.
* `#SS` or `Stack Fault` exception indicates one of the stack related conditions was detected, for example a not-present stack segment is detected when attempting to load the `ss` register.
* `#GP` or `General Protection` exception indicates that the processor detected one of a class of protection violations called general-protection violations. There are many different conditions that can cause general-protection exception. For example loading the `ss`, `ds`, `es`, `fs`, or `gs` register with a segment selector for a system segment, writing to a code segment or a read-only data segment, referencing an entry in the `Interrupt Descriptor Table` (following an interrupt or exception) that is not an interrupt, trap, or task gate and many many more.
* `Spurious Interrupt` - a hardware interrupt that is unwanted.
* `#MF` or `x87 FPU Floating-Point Error` exception caused when the [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions) has detected a floating point error.
* `#AC` or `Alignment Check` exception Indicates that the processor detected an unaligned memory operand when alignment checking was enabled.
After that we setup this exception gates, we can see setup of the `Machine-Check` exception:
```C
#ifdef CONFIG_X86_MCE
set_intr_gate_ist(X86_TRAP_MC, &machine_check, MCE_STACK);
#endif
```
Note that it depends on the `CONFIG_X86_MCE` kernel configuration option and indicates that the processor detected an internal [machine error](https://en.wikipedia.org/wiki/Machine-check_exception) or a bus error, or that an external agent detected a bus error. The next exception gate is for the [SIMD](https://en.wikipedia.org/?title=SIMD) Floating-Point exception:
```C
set_intr_gate(X86_TRAP_XF, &simd_coprocessor_error);
```
which indicates the processor has detected an `SSE` or `SSE2` or `SSE3` SIMD floating-point exception. There are six classes of numeric exception conditions that can occur while executing an SIMD floating-point instruction:
* Invalid operation
* Divide-by-zero
* Denormal operand
* Numeric overflow
* Numeric underflow
* Inexact result (Precision)
In the next step we fill the `used_vectors` array which defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h) header file and represents `bitmap`:
```C
DECLARE_BITMAP(used_vectors, NR_VECTORS);
```
of the first `32` interrupts (more about bitmaps in the Linux kernel you can read in the part which describes [cpumasks and bitmaps](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html))
```C
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
set_bit(i, used_vectors)
```
where `FIRST_EXTERNAL_VECTOR` is:
```C
#define FIRST_EXTERNAL_VECTOR 0x20
```
After this we setup the interrupt gate for the `ia32_syscall` and add `0x80` to the `used_vectors` bitmap:
```C
#ifdef CONFIG_IA32_EMULATION
set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif
```
There is `CONFIG_IA32_EMULATION` kernel configuration option on `x86_64` Linux kernels. This option provides ability to execute 32-bit processes in compatibility-mode. In the next parts we will see how it works, in the meantime we need only to know that there is yet another interrupt gate in the `IDT` with the vector number `0x80`. In the next step we maps `IDT` to the fixmap area:
```C
__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
idt_descr.address = fix_to_virt(FIX_RO_IDT);
```
and write its address to the `idt_descr.address` (more about fix-mapped addresses you can read in the second part of the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) chapter). After this we can see the call of the `cpu_init` function that defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c). This function makes initialization of the all `per-cpu` state. In the beginning of the `cpu_init` we do the following things: First of all we wait while current cpu is initialized and than we call the `cr4_init_shadow` function which stores shadow copy of the `cr4` control register for the current cpu and load CPU microcode if need with the following function calls:
```C
wait_for_master_cpu(cpu);
cr4_init_shadow();
load_ucode_ap();
```
Next we get the `Task State Segment` for the current cpu and `orig_ist` structure which represents origin `Interrupt Stack Table` values with the:
```C
t = &per_cpu(cpu_tss, cpu);
oist = &per_cpu(orig_ist, cpu);
```
As we got values of the `Task State Segment` and `Interrupt Stack Table` for the current processor, we clear following bits in the `cr4` control register:
```C
cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
```
with this we disable `vm86` extension, virtual interrupts, timestamp ([RDTSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) can only be executed with the highest privilege) and debug extension. After this we reload the `Global Descriptor Table` and `Interrupt Descriptor table` with the:
```C
switch_to_new_gdt(cpu);
loadsegment(fs, 0);
load_current_idt();
```
After this we setup array of the Thread-Local Storage Descriptors, configure [NX](https://en.wikipedia.org/wiki/NX_bit) and load CPU microcode. Now is time to setup and load `per-cpu` Task State Segments. We are going in a loop through the all exception stack which is `N_EXCEPTION_STACKS` or `4` and fill it with `Interrupt Stack Tables`:
```C
if (!oist->ist[0]) {
char *estacks = per_cpu(exception_stacks, cpu);
for (v = 0; v < N_EXCEPTION_STACKS; v++) {
estacks += exception_stack_sizes[v];
oist->ist[v] = t->x86_tss.ist[v] =
(unsigned long)estacks;
if (v == DEBUG_STACK-1)
per_cpu(debug_stack_addr, cpu) = (unsigned long)estacks;
}
}
```
As we have filled `Task State Segments` with the `Interrupt Stack Tables` we can set `TSS` descriptor for the current processor and load it with the:
```C
set_tss_desc(cpu, t);
load_TR_desc();
```
where `set_tss_desc` macro from the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) writes given descriptor to the `Global Descriptor Table` of the given processor:
```C
#define set_tss_desc(cpu, addr) __set_tss_desc(cpu, GDT_ENTRY_TSS, addr)
static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr)
{
struct desc_struct *d = get_cpu_gdt_table(cpu);
tss_desc tss;
set_tssldt_descriptor(&tss, (unsigned long)addr, DESC_TSS,
IO_BITMAP_OFFSET + IO_BITMAP_BYTES +
sizeof(unsigned long) - 1);
write_gdt_entry(d, entry, &tss, DESC_TSS);
}
```
and `load_TR_desc` macro expands to the `ltr` or `Load Task Register` instruction:
```C
#define load_TR_desc() native_load_tr_desc()
static inline void native_load_tr_desc(void)
{
asm volatile("ltr %w0"::"q" (GDT_ENTRY_TSS*8));
}
```
In the end of the `trap_init` function we can see the following code:
```C
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
...
...
...
#ifdef CONFIG_X86_64
memcpy(&nmi_idt_table, &idt_table, IDT_ENTRIES * 16);
set_nmi_gate(X86_TRAP_DB, &debug);
set_nmi_gate(X86_TRAP_BP, &int3);
#endif
```
Here we copy `idt_table` to the `nmi_dit_table` and setup exception handlers for the `#DB` or `Debug exception` and `#BR` or `Breakpoint exception`. You can remember that we already set these interrupt gates in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-3.html), so why do we need to setup it again? We setup it again because when we initialized it before in the `early_trap_init` function, the `Task State Segment` was not ready yet, but now it is ready after the call of the `cpu_init` function.
That's all. Soon we will consider all handlers of these interrupts/exceptions.
Conclusion
--------------------------------------------------------------------------------
It is the end of the fourth part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment) in this part and initialization of the different interrupt handlers as `Divide Error`, `Page Fault` exception and etc. You can note that we saw just initialization stuff, and will dive into details about handlers for these exceptions. In the next part we will start to do it.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [page fault](https://en.wikipedia.org/wiki/Page_fault)
* [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)
* [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29)
* [cr2](https://en.wikipedia.org/wiki/Control_register)
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
* [this_cpu_* operations](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)
* [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt)
* [prefetchw](http://www.felixcloutier.com/x86/PREFETCHW.html)
* [3DNow](https://en.wikipedia.org/?title=3DNow!)
* [CPU caches](https://en.wikipedia.org/wiki/CPU_cache)
* [VFS](https://en.wikipedia.org/wiki/Virtual_file_system)
* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
* [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
* [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture)
* [INT isntruction](https://en.wikipedia.org/wiki/INT_%28x86_instruction%29)
* [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html)
* [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm)
* [opcode](https://en.wikipedia.org/?title=Opcode)
* [control register](https://en.wikipedia.org/wiki/Control_register#CR0)
* [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions)
* [MCE exception](https://en.wikipedia.org/wiki/Machine-check_exception)
* [SIMD](https://en.wikipedia.org/?title=SIMD)
* [cpumasks and bitmaps](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
* [NX](https://en.wikipedia.org/wiki/NX_bit)
* [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment)
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-3.html)

493
Interrupts/interrupts-5.md Normal file
View File

@@ -0,0 +1,493 @@
Interrupts and Interrupt Handling. Part 5.
================================================================================
Implementation of exception handlers
--------------------------------------------------------------------------------
This is the fifth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) we stopped on the setting of interrupt gates to the [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table). We did it in the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file. We saw only setting of these interrupt gates in the previous part and in the current part we will see implementation of the exception handlers for these gates. The preparation before an exception handler will be executed is in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and occurs in the [idtentry](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S#L820) macro that defines exceptions entry points:
```assembly
idtentry divide_error do_divide_error has_error_code=0
idtentry overflow do_overflow has_error_code=0
idtentry invalid_op do_invalid_op has_error_code=0
idtentry bounds do_bounds has_error_code=0
idtentry device_not_available do_device_not_available has_error_code=0
idtentry coprocessor_segment_overrun do_coprocessor_segment_overrun has_error_code=0
idtentry invalid_TSS do_invalid_TSS has_error_code=1
idtentry segment_not_present do_segment_not_present has_error_code=1
idtentry spurious_interrupt_bug do_spurious_interrupt_bug has_error_code=0
idtentry coprocessor_error do_coprocessor_error has_error_code=0
idtentry alignment_check do_alignment_check has_error_code=1
idtentry simd_coprocessor_error do_simd_coprocessor_error has_error_code=0
```
The `idtentry` macro does following preparation before an actual exception handler (`do_divide_error` for the `divide_error`, `do_overflow` for the `overflow` and etc.) will get control. In another words the `idtentry` macro allocates place for the registers ([pt_regs](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h#L43) structure) on the stack, pushes dummy error code for the stack consistency if an interrupt/exception has no error code, checks the segment selector in the `cs` segment register and switches depends on the previous state(userspace or kernelspace). After all of these preparations it makes a call of an actual interrupt/exception handler:
```assembly
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
...
...
...
call \do_sym
...
...
...
END(\sym)
.endm
```
After an exception handler will finish its work, the `idtentry` macro restores stack and general purpose registers of an interrupted task and executes [iret](http://x86.renejeschke.de/html/file_module_x86_id_145.html) instruction:
```assembly
ENTRY(paranoid_exit)
...
...
...
RESTORE_EXTRA_REGS
RESTORE_C_REGS
REMOVE_PT_GPREGS_FROM_STACK 8
INTERRUPT_RETURN
END(paranoid_exit)
```
where `INTERRUPT_RETURN` is:
```assembly
#define INTERRUPT_RETURN jmp native_iret
...
ENTRY(native_iret)
.global native_irq_return_iret
native_irq_return_iret:
iretq
```
More about the `idtentry` macro you can read in the third part of the [http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-3.html](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-3.html) chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers:
* divide_error
* overflow
* invalid_op
* coprocessor_segment_overrun
* invalid_TSS
* segment_not_present
* stack_segment
* alignment_check
All these handlers defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file with the `DO_ERROR` macro:
```C
DO_ERROR(X86_TRAP_DE, SIGFPE, "divide error", divide_error)
DO_ERROR(X86_TRAP_OF, SIGSEGV, "overflow", overflow)
DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op)
DO_ERROR(X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun", coprocessor_segment_overrun)
DO_ERROR(X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS)
DO_ERROR(X86_TRAP_NP, SIGBUS, "segment not present", segment_not_present)
DO_ERROR(X86_TRAP_SS, SIGBUS, "stack segment", stack_segment)
DO_ERROR(X86_TRAP_AC, SIGBUS, "alignment check", alignment_check)
```
As we can see the `DO_ERROR` macro takes 4 parameters:
* Vector number of an interrupt;
* Signal number which will be sent to the interrupted process;
* String which describes an exception;
* Exception handler entry point.
This macro defined in the same source code file and expands to the function with the `do_handler` name:
```C
#define DO_ERROR(trapnr, signr, str, name) \
dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
{ \
do_error_trap(regs, error_code, str, trapnr, signr); \
}
```
Note on the `##` tokens. This is special feature - [GCC macro Concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html#Concatenation) which concatenates two given strings. For example, first `DO_ERROR` in our example will expands to the:
```C
dotraplinkage void do_divide_error(struct pt_regs *regs, long error_code) \
{
...
}
```
We can see that all functions which are generated by the `DO_ERROR` macro just make a call of the `do_error_trap` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). Let's look on implementation of the `do_error_trap` function.
Trap handlers
--------------------------------------------------------------------------------
The `do_error_trap` function starts and ends from the two following functions:
```C
enum ctx_state prev_state = exception_enter();
...
...
...
exception_exit(prev_state);
```
from the [include/linux/context_tracking.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking.h). The context tracking in the Linux kernel subsystem which provide kernel boundaries probes to keep track of the transitions between level contexts with two basic initial contexts: `user` or `kernel`. The `exception_enter` function checks that context tracking is enabled. After this if it is enabled, the `exception_enter` reads previous context and compares it with the `CONTEXT_KERNEL`. If the previous context is `user`, we call `context_tracking_exit` function from the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/master/kernel/context_tracking.c) which inform the context tracking subsystem that a processor is exiting user mode and entering the kernel mode:
```C
if (!context_tracking_is_enabled())
return 0;
prev_ctx = this_cpu_read(context_tracking.state);
if (prev_ctx != CONTEXT_KERNEL)
context_tracking_exit(prev_ctx);
return prev_ctx;
```
If previous context is non `user`, we just return it. The `pre_ctx` has `enum ctx_state` type which defined in the [include/linux/context_tracking_state.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking_state.h) and looks as:
```C
enum ctx_state {
CONTEXT_KERNEL = 0,
CONTEXT_USER,
CONTEXT_GUEST,
} state;
```
The second function is `exception_exit` defined in the same [include/linux/context_tracking.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking.h) file and checks that context tracking is enabled and call the `contert_tracking_enter` function if the previous context was `user`:
```C
static inline void exception_exit(enum ctx_state prev_ctx)
{
if (context_tracking_is_enabled()) {
if (prev_ctx != CONTEXT_KERNEL)
context_tracking_enter(prev_ctx);
}
}
```
The `context_tracking_enter` function informs the context tracking subsystem that a processor is going to enter to the user mode from the kernel mode. We can see the following code between the `exception_enter` and `exception_exit`:
```C
if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=
NOTIFY_STOP) {
conditional_sti(regs);
do_trap(trapnr, signr, str, regs, error_code,
fill_trap_info(regs, signr, trapnr, &info));
}
```
First of all it calls the `notify_die` function which defined in the [kernel/notifier.c](https://github.com/torvalds/linux/tree/master/kernel/notifier.c). To get notified for [kernel panic](https://en.wikipedia.org/wiki/Kernel_panic), [kernel oops](https://en.wikipedia.org/wiki/Linux_kernel_oops), [Non-Maskable Interrupt](https://en.wikipedia.org/wiki/Non-maskable_interrupt) or other events the caller needs to insert itself in the `notify_die` chain and the `notify_die` function does it. The Linux kernel has special mechanism that allows kernel to ask when something happens and this mechanism called `notifiers` or `notifier chains`. This mechanism used for example for the `USB` hotplug events (look on the [drivers/usb/core/notify.c](https://github.com/torvalds/linux/tree/master/drivers/usb/core/notify.c)), for the memory [hotplug](https://en.wikipedia.org/wiki/Hot_swapping) (look on the [include/linux/memory.h](https://github.com/torvalds/linux/tree/master/include/linux/memory.h), the `hotplug_memory_notifier` macro and etc...), system reboots and etc. A notifier chain is thus a simple, singly-linked list. When a Linux kernel subsystem wants to be notified of specific events, it fills out a special `notifier_block` structure and passes it to the `notifier_chain_register` function. An event can be sent with the call of the `notifier_call_chain` function. First of all the `notify_die` function fills `die_args` structure with the trap number, trap string, registers and other values:
```C
struct die_args args = {
.regs = regs,
.str = str,
.err = err,
.trapnr = trap,
.signr = sig,
}
```
and returns the result of the `atomic_notifier_call_chain` function with the `die_chain`:
```C
static ATOMIC_NOTIFIER_HEAD(die_chain);
return atomic_notifier_call_chain(&die_chain, val, &args);
```
which just expands to the `atomic_notifier_head` structure that contains lock and `notifier_block`:
```C
struct atomic_notifier_head {
spinlock_t lock;
struct notifier_block __rcu *head;
};
```
The `atomic_notifier_call_chain` function calls each function in a notifier chain in turn and returns the value of the last notifier function called. If the `notify_die` in the `do_error_trap` does not return `NOTIFY_STOP` we execute `conditional_sti` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) that checks the value of the [interrupt flag](https://en.wikipedia.org/wiki/Interrupt_flag) and enables interrupt depends on it:
```C
static inline void conditional_sti(struct pt_regs *regs)
{
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
}
```
more about `local_irq_enable` macro you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-2.html) of this chapter. The next and last call in the `do_error_trap` is the `do_trap` function. First of all the `do_trap` function defined the `tsk` variable which has `task_struct` type and represents the current interrupted process. After the definition of the `tsk`, we can see the call of the `do_trap_no_signal` function:
```C
struct task_struct *tsk = current;
if (!do_trap_no_signal(tsk, trapnr, str, regs, error_code))
return;
```
The `do_trap_no_signal` function makes two checks:
* Did we come from the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode;
* Did we come from the kernelspace.
```C
if (v8086_mode(regs)) {
...
}
if (!user_mode(regs)) {
...
}
return -1;
```
We will not consider first case because the [long mode](https://en.wikipedia.org/wiki/Long_mode) does not support the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode. In the second case we invoke `fixup_exception` function which will try to recover a fault and `die` if we can't:
```C
if (!fixup_exception(regs)) {
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = trapnr;
die(str, regs, error_code);
}
```
The `die` function defined in the [arch/x86/kernel/dumpstack.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/dumpstack.c) source code file, prints useful information about stack, registers, kernel modules and caused kernel [oops](https://en.wikipedia.org/wiki/Linux_kernel_oops). If we came from the userspace the `do_trap_no_signal` function will return `-1` and the execution of the `do_trap` function will continue. If we passed through the `do_trap_no_signal` function and did not exit from the `do_trap` after this, it means that previous context was - `user`. Most exceptions caused by the processor are interpreted by Linux as error conditions, for example division by zero, invalid opcode and etc. When an exception occurs the Linux kernel sends a [signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process that caused the exception to notify it of an incorrect condition. So, in the `do_trap` function we need to send a signal with the given number (`SIGFPE` for the divide error, `SIGILL` for the overflow exception and etc...). First of all we save error code and vector number in the current interrupts process with the filling `thread.error_code` and `thread_trap_nr`:
```C
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = trapnr;
```
After this we make a check do we need to print information about unhandled signals for the interrupted process. We check that `show_unhandled_signals` variable is set, that `unhandled_signal` function from the [kernel/signal.c](https://github.com/torvalds/linux/blob/master/kernel/signal.c) will return unhandled signal(s) and [printk](https://en.wikipedia.org/wiki/Printk) rate limit:
```C
#ifdef CONFIG_X86_64
if (show_unhandled_signals && unhandled_signal(tsk, signr) &&
printk_ratelimit()) {
pr_info("%s[%d] trap %s ip:%lx sp:%lx error:%lx",
tsk->comm, tsk->pid, str,
regs->ip, regs->sp, error_code);
print_vma_addr(" in ", regs->ip);
pr_cont("\n");
}
#endif
```
And send a given signal to interrupted process:
```C
force_sig_info(signr, info ?: SEND_SIG_PRIV, tsk);
```
This is the end of the `do_trap`. We just saw generic implementation for eight different exceptions which are defined with the `DO_ERROR` macro. Now let's look on another exception handlers.
Double fault
--------------------------------------------------------------------------------
The next exception is `#DF` or `Double fault`. This exception occurs when the processor detected a second exception while calling an exception handler for a prior exception. We set the trap gate for this exception in the previous part:
```C
set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
```
Note that this exception runs on the `DOUBLEFAULT_STACK` [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks) which has index - `1`:
```C
#define DOUBLEFAULT_STACK 1
```
The `double_fault` is handler for this exception and defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). The `double_fault` handler starts from the definition of two variables: string that describes exception and interrupted process, as other exception handlers:
```C
static const char str[] = "double fault";
struct task_struct *tsk = current;
```
The handler of the double fault exception split on two parts. The first part is the check which checks that a fault is a `non-IST` fault on the `espfix64` stack. Actually the `iret` instruction restores only the bottom `16` bits when returning to a `16` bit segment. The `espfix` feature solves this problem. So if the `non-IST` fault on the espfix64 stack we modify the stack to make it look like `General Protection Fault`:
```C
struct pt_regs *normal_regs = task_pt_regs(current);
memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
ormal_regs->orig_ax = 0;
regs->ip = (unsigned long)general_protection;
regs->sp = (unsigned long)&normal_regs->orig_ax;
return;
```
In the second case we do almost the same that we did in the previous exception handlers. The first is the call of the `ist_enter` function that discards previous context, `user` in our case:
```C
ist_enter(regs);
```
And after this we fill the interrupted process with the vector number of the `Double fault` exception and error code as we did it in the previous handlers:
```C
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_DF;
```
Next we print useful information about the double fault ([PID](https://en.wikipedia.org/wiki/Process_identifier) number, registers content):
```C
#ifdef CONFIG_DOUBLEFAULT
df_debug(regs, error_code);
#endif
```
And die:
```
for (;;)
die(str, regs, error_code);
```
That's all.
Device not available exception handler
--------------------------------------------------------------------------------
The next exception is the `#NM` or `Device not available`. The `Device not available` exception can occur depending on these things:
* The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87) floating-point instruction while the EM flag in [control register](https://en.wikipedia.org/wiki/Control_register) `cr0` was set;
* The processor executed a `wait` or `fwait` instruction while the `MP` and `TS` flags of register `cr0` were set;
* The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87), [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29) or [SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) instruction while the `TS` flag in control register `cr0` was set and the `EM` flag is clear.
The handler of the `Device not available` exception is the `do_device_not_available` function and it defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file too. It starts and ends from the getting of the previous context, as other traps which we saw in the beginning of this part:
```C
enum ctx_state prev_state;
prev_state = exception_enter();
...
...
...
exception_exit(prev_state);
```
In the next step we check that `FPU` is not eager:
```C
BUG_ON(use_eager_fpu());
```
When we switch into a task or interrupt we may avoid loading the `FPU` state. If a task will use it, we catch `Device not Available exception` exception. If we loading the `FPU` state during task switching, the `FPU` is eager. In the next step we check `cr0` control register on the `EM` flag which can show us is `x87` floating point unit present (flag clear) or not (flag set):
```C
#ifdef CONFIG_MATH_EMULATION
if (read_cr0() & X86_CR0_EM) {
struct math_emu_info info = { };
conditional_sti(regs);
info.regs = regs;
math_emulate(&info);
exception_exit(prev_state);
return;
}
#endif
```
If the `x87` floating point unit not presented, we enable interrupts with the `conditional_sti`, fill the `math_emu_info` (defined in the [arch/x86/include/asm/math_emu.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/math_emu.h)) structure with the registers of an interrupt task and call `math_emulate` function from the [arch/x86/math-emu/fpu_entry.c](https://github.com/torvalds/linux/tree/master/arch/x86/math-emu/fpu_entry.c). As you can understand from function's name, it emulates `X87 FPU` unit (more about the `x87` we will know in the special chapter). In other way, if `X86_CR0_EM` flag is clear which means that `x87 FPU` unit is presented, we call the `fpu__restore` function from the [arch/x86/kernel/fpu/core.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/fpu/core.c) which copies the `FPU` registers from the `fpustate` to the live hardware registers. After this `FPU` instructions can be used:
```C
fpu__restore(&current->thread.fpu);
```
General protection fault exception handler
--------------------------------------------------------------------------------
The next exception is the `#GP` or `General protection fault`. This exception occurs when the processor detected one of a class of protection violations called `general-protection violations`. It can be:
* Exceeding the segment limit when accessing the `cs`, `ds`, `es`, `fs` or `gs` segments;
* Loading the `ss`, `ds`, `es`, `fs` or `gs` register with a segment selector for a system segment.;
* Violating any of the privilege rules;
* and other...
The exception handler for this exception is the `do_general_protection` from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). The `do_general_protection` function starts and ends as other exception handlers from the getting of the previous context:
```C
prev_state = exception_enter();
...
exception_exit(prev_state);
```
After this we enable interrupts if they were disabled and check that we came from the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode:
```C
conditional_sti(regs);
if (v8086_mode(regs)) {
local_irq_enable();
handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
goto exit;
}
```
As long mode does not support this mode, we will not consider exception handling for this case. In the next step check that previous mode was kernel mode and try to fix the trap. If we can't fix the current general protection fault exception we fill the interrupted process with the vector number and error code of the exception and add it to the `notify_die` chain:
```C
if (!user_mode(regs)) {
if (fixup_exception(regs))
goto exit;
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_GP;
if (notify_die(DIE_GPF, "general protection fault", regs, error_code,
X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP)
die("general protection fault", regs, error_code);
goto exit;
}
```
If we can fix exception we go to the `exit` label which exits from exception state:
```C
exit:
exception_exit(prev_state);
```
If we came from user mode we send `SIGSEGV` signal to the interrupted process from user mode as we did it in the `do_trap` function:
```C
if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
printk_ratelimit()) {
pr_info("%s[%d] general protection ip:%lx sp:%lx error:%lx",
tsk->comm, task_pid_nr(tsk),
regs->ip, regs->sp, error_code);
print_vma_addr(" in ", regs->ip);
pr_cont("\n");
}
force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk);
```
That's all.
Conclusion
--------------------------------------------------------------------------------
It is the end of the fifth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we saw implementation of some interrupt handlers in this part. In the next part we will continue to dive into interrupt and exception handlers and will see handler for the [Non-Maskable Interrupts](https://en.wikipedia.org/wiki/Non-maskable_interrupt), handling of the math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) and [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exceptions and many many more.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)
* [iret instruction](http://x86.renejeschke.de/html/file_module_x86_id_145.html)
* [GCC macro Concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html#Concatenation)
* [kernel panic](https://en.wikipedia.org/wiki/Kernel_panic)
* [kernel oops](https://en.wikipedia.org/wiki/Linux_kernel_oops)
* [Non-Maskable Interrupt](https://en.wikipedia.org/wiki/Non-maskable_interrupt)
* [hotplug](https://en.wikipedia.org/wiki/Hot_swapping)
* [interrupt flag](https://en.wikipedia.org/wiki/Interrupt_flag)
* [long mode](https://en.wikipedia.org/wiki/Long_mode)
* [signal](https://en.wikipedia.org/wiki/Unix_signal)
* [printk](https://en.wikipedia.org/wiki/Printk)
* [coprocessor](https://en.wikipedia.org/wiki/Coprocessor)
* [SIMD](https://en.wikipedia.org/wiki/SIMD)
* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
* [PID](https://en.wikipedia.org/wiki/Process_identifier)
* [x87 FPU](https://en.wikipedia.org/wiki/X87)
* [control register](https://en.wikipedia.org/wiki/Control_register)
* [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29)
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-4.html)

480
Interrupts/interrupts-6.md Normal file
View File

@@ -0,0 +1,480 @@
Interrupts and Interrupt Handling. Part 6.
================================================================================
Non-maskable interrupt handler
--------------------------------------------------------------------------------
It is sixth part of the [Interrupts and Interrupt Handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-5.html) we saw implementation of some exception handlers for the [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault) exception, divide exception, invalid [opcode](https://en.wikipedia.org/wiki/Opcode) exceptions and etc. As I wrote in the previous part we will see implementations of the rest exceptions in this part. We will see implementation of the following handlers:
* [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt;
* [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) Range Exceeded Exception;
* [Coprocessor](https://en.wikipedia.org/wiki/Coprocessor) exception;
* [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exception.
in this part. So, let's start.
Non-Maskable interrupt handling
--------------------------------------------------------------------------------
A [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt is a hardware interrupt that cannot be ignored by standard masking techniques. In a general way, a non-maskable interrupt can be generated in either of two ways:
* External hardware asserts the non-maskable interrupt [pin](https://en.wikipedia.org/wiki/CPU_socket) on the CPU.
* The processor receives a message on the system bus or the APIC serial bus with a delivery mode `NMI`.
When the processor receives a `NMI` from one of these sources, the processor handles it immediately by calling the `NMI` handler pointed to by interrupt vector which has number `2` (see table in the first [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-1.html)). We already filled the [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the [vector number](https://en.wikipedia.org/wiki/Interrupt_vector_table), address of the `nmi` interrupt handler and `NMI_STACK` [Interrupt Stack Table entry](https://github.com/torvalds/linux/blob/master/Documentation/x86/kernel-stacks):
```C
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
```
in the `trap_init` function which defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file. In the previous [parts](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) we saw that entry points of the all interrupt handlers are defined with the:
```assembly
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
...
...
...
END(\sym)
.endm
```
macro from the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file. But the handler of the `Non-Maskable` interrupts is not defined with this macro. It has own entry point:
```assembly
ENTRY(nmi)
...
...
...
END(nmi)
```
in the same [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file. Lets dive into it and will try to understand how `Non-Maskable` interrupt handler works. The `nmi` handlers starts from the call of the:
```assembly
PARAVIRT_ADJUST_EXCEPTION_FRAME
```
macro but we will not dive into details about it in this part, because this macro related to the [Paravirtualization](https://en.wikipedia.org/wiki/Paravirtualization) stuff which we will see in another chapter. After this save the content of the `rdx` register on the stack:
```assembly
pushq %rdx
```
And allocated check that `cs` was not the kernel segment when an non-maskable interrupt occurs:
```assembly
cmpl $__KERNEL_CS, 16(%rsp)
jne first_nmi
```
The `__KERNEL_CS` macro defined in the [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) and represented second descriptor in the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table):
```C
#define GDT_ENTRY_KERNEL_CS 2
#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8)
```
more about `GDT` you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the Linux kernel booting process chapter. If `cs` is not kernel segment, it means that it is not nested `NMI` and we jump on the `first_nmi` label. Let's consider this case. First of all we put address of the current stack pointer to the `rdx` and pushes `1` to the stack in the `first_nmi` label:
```assembly
first_nmi:
movq (%rsp), %rdx
pushq $1
```
Why do we push `1` on the stack? As the comment says: `We allow breakpoints in NMIs`. On the [x86_64](https://en.wikipedia.org/wiki/X86-64), like other architectures, the CPU will not execute another `NMI` until the first `NMI` is completed. A `NMI` interrupt finished with the [iret](http://faydoc.tripod.com/cpu/iret.htm) instruction like other interrupts and exceptions do it. If the `NMI` handler triggers either a [page fault](https://en.wikipedia.org/wiki/Page_fault) or [breakpoint](https://en.wikipedia.org/wiki/Breakpoint) or another exception which are use `iret` instruction too. If this happens while in `NMI` context, the CPU will leave `NMI` context and a new `NMI` may come in. The `iret` used to return from those exceptions will re-enable `NMIs` and we will get nested non-maskable interrupts. The problem the `NMI` handler will not return to the state that it was, when the exception triggered, but instead it will return to a state that will allow new `NMIs` to preempt the running `NMI` handler. If another `NMI` comes in before the first NMI handler is complete, the new NMI will write all over the preempted `NMIs` stack. We can have nested `NMIs` where the next `NMI` is using the top of the stack of the previous `NMI`. It means that we cannot execute it because a nested non-maskable interrupt will corrupt stack of a previous non-maskable interrupt. That's why we have allocated space on the stack for temporary variable. We will check this variable that it was set when a previous `NMI` is executing and clear if it is not nested `NMI`. We push `1` here to the previously allocated space on the stack to denote that a `non-maskable` interrupt executed currently. Remember that when and `NMI` or another exception occurs we have the following [stack frame](https://en.wikipedia.org/wiki/Call_stack):
```
+------------------------+
| SS |
| RSP |
| RFLAGS |
| CS |
| RIP |
+------------------------+
```
and also an error code if an exception has it. So, after all of these manipulations our stack frame will look like this:
```
+------------------------+
| SS |
| RSP |
| RFLAGS |
| CS |
| RIP |
| RDX |
| 1 |
+------------------------+
```
In the next step we allocate yet another `40` bytes on the stack:
```assembly
subq $(5*8), %rsp
```
and pushes the copy of the original stack frame after the allocated space:
```C
.rept 5
pushq 11*8(%rsp)
.endr
```
with the [.rept](http://tigcc.ticalc.org/doc/gnuasm.html#SEC116) assembly directive. We need in the copy of the original stack frame. Generally we need in two copies of the interrupt stack. First is `copied` interrupts stack: `saved` stack frame and `copied` stack frame. Now we pushes original stack frame to the `saved` stack frame which locates after the just allocated `40` bytes (`copied` stack frame). This stack frame is used to fixup the `copied` stack frame that a nested NMI may change. The second - `copied` stack frame modified by any nested `NMIs` to let the first `NMI` know that we triggered a second `NMI` and we should repeat the first `NMI` handler. Ok, we have made first copy of the original stack frame, now time to make second copy:
```assembly
addq $(10*8), %rsp
.rept 5
pushq -6*8(%rsp)
.endr
subq $(5*8), %rsp
```
After all of these manipulations our stack frame will be like this:
```
+-------------------------+
| original SS |
| original Return RSP |
| original RFLAGS |
| original CS |
| original RIP |
+-------------------------+
| temp storage for rdx |
+-------------------------+
| NMI executing variable |
+-------------------------+
| copied SS |
| copied Return RSP |
| copied RFLAGS |
| copied CS |
| copied RIP |
+-------------------------+
| Saved SS |
| Saved Return RSP |
| Saved RFLAGS |
| Saved CS |
| Saved RIP |
+-------------------------+
```
After this we push dummy error code on the stack as we did it already in the previous exception handlers and allocate space for the general purpose registers on the stack:
```assembly
pushq $-1
ALLOC_PT_GPREGS_ON_STACK
```
We already saw implementation of the `ALLOC_PT_GREGS_ON_STACK` macro in the third part of the interrupts [chapter](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-3.html). This macro defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) and yet another allocates `120` bytes on stack for the general purpose registers, from the `rdi` to the `r15`:
```assembly
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
addq $-(15*8+\addskip), %rsp
.endm
```
After space allocation for the general registers we can see call of the `paranoid_entry`:
```assembly
call paranoid_entry
```
We can remember from the previous parts this label. It pushes general purpose registers on the stack, reads `MSR_GS_BASE` [Model Specific register](https://en.wikipedia.org/wiki/Model-specific_register) and checks its value. If the value of the `MSR_GS_BASE` is negative, we came from the kernel mode and just return from the `paranoid_entry`, in other way it means that we came from the usermode and need to execute `swapgs` instruction which will change user `gs` with the kernel `gs`:
```assembly
ENTRY(paranoid_entry)
cld
SAVE_C_REGS 8
SAVE_EXTRA_REGS 8
movl $1, %ebx
movl $MSR_GS_BASE, %ecx
rdmsr
testl %edx, %edx
js 1f
SWAPGS
xorl %ebx, %ebx
1: ret
END(paranoid_entry)
```
Note that after the `swapgs` instruction we zeroed the `ebx` register. Next time we will check content of this register and if we executed `swapgs` than `ebx` must contain `0` and `1` in other way. In the next step we store value of the `cr2` [control register](https://en.wikipedia.org/wiki/Control_register) to the `r12` register, because the `NMI` handler can cause `page fault` and corrupt the value of this control register:
```C
movq %cr2, %r12
```
Now time to call actual `NMI` handler. We push the address of the `pt_regs` to the `rdi`, error code to the `rsi` and call the `do_nmi` handler:
```assembly
movq %rsp, %rdi
movq $-1, %rsi
call do_nmi
```
We will back to the `do_nmi` little later in this part, but now let's look what occurs after the `do_nmi` will finish its execution. After the `do_nmi` handler will be finished we check the `cr2` register, because we can got page fault during `do_nmi` performed and if we got it we restore original `cr2`, in other way we jump on the label `1`. After this we test content of the `ebx` register (remember it must contain `0` if we have used `swapgs` instruction and `1` if we didn't use it) and execute `SWAPGS_UNSAFE_STACK` if it contains `1` or jump to the `nmi_restore` label. The `SWAPGS_UNSAFE_STACK` macro just expands to the `swapgs` instruction. In the `nmi_restore` label we restore general purpose registers, clear allocated space on the stack for this registers, clear our temporary variable and exit from the interrupt handler with the `INTERRUPT_RETURN` macro:
```assembly
movq %cr2, %rcx
cmpq %rcx, %r12
je 1f
movq %r12, %cr2
1:
testl %ebx, %ebx
jnz nmi_restore
nmi_swapgs:
SWAPGS_UNSAFE_STACK
nmi_restore:
RESTORE_EXTRA_REGS
RESTORE_C_REGS
/* Pop the extra iret frame at once */
REMOVE_PT_GPREGS_FROM_STACK 6*8
/* Clear the NMI executing stack variable */
movq $0, 5*8(%rsp)
INTERRUPT_RETURN
```
where `INTERRUPT_RETURN` is defined in the [arch/x86/include/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/irqflags.h) and just expands to the `iret` instruction. That's all.
Now let's consider case when another `NMI` interrupt occurred when previous `NMI` interrupt didn't finish its execution. You can remember from the beginning of this part that we've made a check that we came from userspace and jump on the `first_nmi` in this case:
```assembly
cmpl $__KERNEL_CS, 16(%rsp)
jne first_nmi
```
Note that in this case it is first `NMI` every time, because if the first `NMI` catched page fault, breakpoint or another exception it will be executed in the kernel mode. If we didn't come from userspace, first of all we test our temporary variable:
```assembly
cmpl $1, -8(%rsp)
je nested_nmi
```
and if it is set to `1` we jump to the `nested_nmi` label. If it is not `1`, we test the `IST` stack. In the case of nested `NMIs` we check that we are above the `repeat_nmi`. In this case we ignore it, in other way we check that we above than `end_repeat_nmi` and jump on the `nested_nmi_out` label.
Now let's look on the `do_nmi` exception handler. This function defined in the [arch/x86/kernel/nmi.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/nmi.c) source code file and takes two parameters:
* address of the `pt_regs`;
* error code.
as all exception handlers. The `do_nmi` starts from the call of the `nmi_nesting_preprocess` function and ends with the call of the `nmi_nesting_postprocess`. The `nmi_nesting_preprocess` function checks that we likely do not work with the debug stack and if we on the debug stack set the `update_debug_stack` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable to `1` and call the `debug_stack_set_zero` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c). This function increases the `debug_stack_use_ctr` per-cpu variable and loads new `Interrupt Descriptor Table`:
```C
static inline void nmi_nesting_preprocess(struct pt_regs *regs)
{
if (unlikely(is_debug_stack(regs->sp))) {
debug_stack_set_zero();
this_cpu_write(update_debug_stack, 1);
}
}
```
The `nmi_nesting_postprocess` function checks the `update_debug_stack` per-cpu variable which we set in the `nmi_nesting_preprocess` and resets debug stack or in another words it loads origin `Interrupt Descriptor Table`. After the call of the `nmi_nesting_preprocess` function, we can see the call of the `nmi_enter` in the `do_nmi`. The `nmi_enter` increases `lockdep_recursion` field of the interrupted process, update preempt counter and informs the [RCU](https://en.wikipedia.org/wiki/Read-copy-update) subsystem about `NMI`. There is also `nmi_exit` function that does the same stuff as `nmi_enter`, but vice-versa. After the `nmi_enter` we increase `__nmi_count` in the `irq_stat` structure and call the `default_do_nmi` function. First of all in the `default_do_nmi` we check the address of the previous nmi and update address of the last nmi to the actual:
```C
if (regs->ip == __this_cpu_read(last_nmi_rip))
b2b = true;
else
__this_cpu_write(swallow_nmi, false);
__this_cpu_write(last_nmi_rip, regs->ip);
```
After this first of all we need to handle CPU-specific `NMIs`:
```C
handled = nmi_handle(NMI_LOCAL, regs, b2b);
__this_cpu_add(nmi_stats.normal, handled);
```
And then non-specific `NMIs` depends on its reason:
```C
reason = x86_platform.get_nmi_reason();
if (reason & NMI_REASON_MASK) {
if (reason & NMI_REASON_SERR)
pci_serr_error(reason, regs);
else if (reason & NMI_REASON_IOCHK)
io_check_error(reason, regs);
__this_cpu_add(nmi_stats.external, 1);
return;
}
```
That's all.
Range Exceeded Exception
--------------------------------------------------------------------------------
The next exception is the `BOUND` range exceeded exception. The `BOUND` instruction determines if the first operand (array index) is within the bounds of an array specified the second operand (bounds operand). If the index is not within bounds, a `BOUND` range exceeded exception or `#BR` is occurred. The handler of the `#BR` exception is the `do_bounds` function that defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). The `do_bounds` handler starts with the call of the `exception_enter` function and ends with the call of the `exception_exit`:
```C
prev_state = exception_enter();
if (notify_die(DIE_TRAP, "bounds", regs, error_code,
X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
goto exit;
...
...
...
exception_exit(prev_state);
return;
```
After we have got the state of the previous context, we add the exception to the `notify_die` chain and if it will return `NOTIFY_STOP` we return from the exception. More about notify chains and the `context tracking` functions you can read in the [previous part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-5.html). In the next step we enable interrupts if they were disabled with the `contidional_sti` function that checks `IF` flag and call the `local_irq_enable` depends on its value:
```C
conditional_sti(regs);
if (!user_mode(regs))
die("bounds", regs, error_code);
```
and check that if we didn't came from user mode we send `SIGSEGV` signal with the `die` function. After this we check is [MPX](https://en.wikipedia.org/wiki/Intel_MPX) enabled or not, and if this feature is disabled we jump on the `exit_trap` label:
```C
if (!cpu_feature_enabled(X86_FEATURE_MPX)) {
goto exit_trap;
}
where we execute `do_trap` function (more about it you can find in the previous part):
```C
exit_trap:
do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
exception_exit(prev_state);
```
If `MPX` feature is enabled we check the `BNDSTATUS` with the `get_xsave_field_ptr` function and if it is zero, it means that the `MPX` was not responsible for this exception:
```C
bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
if (!bndcsr)
goto exit_trap;
```
After all of this, there is still only one way when `MPX` is responsible for this exception. We will not dive into the details about Intel Memory Protection Extensions in this part, but will see it in another chapter.
Coprocessor exception and SIMD exception
--------------------------------------------------------------------------------
The next two exceptions are [x87 FPU](https://en.wikipedia.org/wiki/X87) Floating-Point Error exception or `#MF` and [SIMD](https://en.wikipedia.org/wiki/SIMD) Floating-Point Exception or `#XF`. The first exception occurs when the `x87 FPU` has detected floating point error. For example divide by zero, numeric overflow and etc. The second exception occurs when the processor has detected [SSE/SSE2/SSE3](https://en.wikipedia.org/wiki/SSE3) `SIMD` floating-point exception. It can be the same as for the `x87 FPU`. The handlers for these exceptions are `do_coprocessor_error` and `do_simd_coprocessor_error` are defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and very similar on each other. They both make a call of the `math_error` function from the same source code file but pass different vector number. The `do_coprocessor_error` passes `X86_TRAP_MF` vector number to the `math_error`:
```C
dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
{
enum ctx_state prev_state;
prev_state = exception_enter();
math_error(regs, error_code, X86_TRAP_MF);
exception_exit(prev_state);
}
```
and `do_simd_coprocessor_error` passes `X86_TRAP_XF` to the `math_error` function:
```C
dotraplinkage void
do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
{
enum ctx_state prev_state;
prev_state = exception_enter();
math_error(regs, error_code, X86_TRAP_XF);
exception_exit(prev_state);
}
```
First of all the `math_error` function defines current interrupted task, address of its fpu, string which describes an exception, add it to the `notify_die` chain and return from the exception handler if it will return `NOTIFY_STOP`:
```C
struct task_struct *task = current;
struct fpu *fpu = &task->thread.fpu;
siginfo_t info;
char *str = (trapnr == X86_TRAP_MF) ? "fpu exception" :
"simd exception";
if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, SIGFPE) == NOTIFY_STOP)
return;
```
After this we check that we are from the kernel mode and if yes we will try to fix an excetpion with the `fixup_exception` function. If we cannot we fill the task with the exception's error code and vector number and die:
```C
if (!user_mode(regs)) {
if (!fixup_exception(regs)) {
task->thread.error_code = error_code;
task->thread.trap_nr = trapnr;
die(str, regs, error_code);
}
return;
}
```
If we came from the user mode, we save the `fpu` state, fill the task structure with the vector number of an exception and `siginfo_t` with the number of signal, `errno`, the address where exception occurred and signal code:
```C
fpu__save(fpu);
task->thread.trap_nr = trapnr;
task->thread.error_code = error_code;
info.si_signo = SIGFPE;
info.si_errno = 0;
info.si_addr = (void __user *)uprobe_get_trap_addr(regs);
info.si_code = fpu__exception_code(fpu, trapnr);
```
After this we check the signal code and if it is non-zero we return:
```C
if (!info.si_code)
return;
```
Or send the `SIGFPE` signal in the end:
```C
force_sig_info(SIGFPE, &info, task);
```
That's all.
Conclusion
--------------------------------------------------------------------------------
It is the end of the sixth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we saw implementation of some exception handlers in this part, like `non-maskable` interrupt, [SIMD](https://en.wikipedia.org/wiki/SIMD) and [x87 FPU](https://en.wikipedia.org/wiki/X87) floating point exception. Finally we have finsihed with the `trap_init` function in this part and will go ahead in the next part. The next our point is the external interrupts and the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c).
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault)
* [opcode](https://en.wikipedia.org/wiki/Opcode)
* [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt)
* [BOUND instruction](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm)
* [CPU socket](https://en.wikipedia.org/wiki/CPU_socket)
* [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)
* [Interrupt Stack Table](https://github.com/torvalds/linux/blob/master/Documentation/x86/kernel-stacks)
* [Paravirtualization](https://en.wikipedia.org/wiki/Paravirtualization)
* [.rept](http://tigcc.ticalc.org/doc/gnuasm.html#SEC116)
* [SIMD](https://en.wikipedia.org/wiki/SIMD)
* [Coprocessor](https://en.wikipedia.org/wiki/Coprocessor)
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [iret](http://faydoc.tripod.com/cpu/iret.htm)
* [page fault](https://en.wikipedia.org/wiki/Page_fault)
* [breakpoint](https://en.wikipedia.org/wiki/Breakpoint)
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
* [stack frame](https://en.wikipedia.org/wiki/Call_stack)
* [Model Specific regiser](https://en.wikipedia.org/wiki/Model-specific_register)
* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
* [MPX](https://en.wikipedia.org/wiki/Intel_MPX)
* [x87 FPU](https://en.wikipedia.org/wiki/X87)
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-5.html)

461
Interrupts/interrupts-7.md Normal file
View File

@@ -0,0 +1,461 @@
Interrupts and Interrupt Handling. Part 7.
================================================================================
Introduction to external interrupts
--------------------------------------------------------------------------------
This is the seventh part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-6.html) we have finished with the exceptions which are generated by the processor. In this part we will continue to dive to the interrupt handling and will start with the external hardware interrupt handling. As you can remember, in the previous part we have finished with the `trap_init` function from the [arch/x86/kernel/trap.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and the next step is the call of the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c).
Interrupts are signal that are sent across [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) or `Interrupt Request Line` by a hardware or software. External hardware interrupts allow devices like keyboard, mouse and etc, to indicate that it needs attention of the processor. Once the processor receives the `Interrupt Request`, it will temporary stop execution of the running program and invoke special routine which depends on an interrupt. We already know that this routine is called interrupt handler (or how we will call it `ISR` or `Interrupt Service Routine` from this part). The `ISR` or `Interrupt Handler Routine` can be found in Interrupt Vector table that is located at fixed address in the memory. After the interrupt is handled processor resumes the interrupted process. At the boot/initialization time, the Linux kernel identifies all devices in the machine, and appropriate interrupt handlers are loaded into the interrupt table. As we saw in the previous parts, most exceptions are handled simply by the sending a [Unix signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process. That's why kernel is can handle an exception quickly. Unfortunately we can not use this approach for the external hardware interrupts, because often they arrive after (and sometimes long after) the process to which they are related has been suspended. So it would make no sense to send a Unix signal to the current process. External interrupt handling depends on the type of an interrupt:
* `I/O` interrupts;
* Timer interrupts;
* Interprocessor interrupts.
I will try to describe all types of interrupts in this book.
Generally, a handler of an `I/O` interrupt must be flexible enough to service several devices at the same time. For example in the [PCI](https://en.wikipedia.org/wiki/Conventional_PCI) bus architecture several devices may share the same `IRQ` line. In the simplest way the Linux kernel must do following thing when an `I/O` interrupt occurred:
* Save the value of an `IRQ` and the register's contents on the kernel stack;
* Send an acknowledgment to the hardware controller which is servicing the `IRQ` line;
* Execute the interrupt service routine (next we will call it `ISR`) which is associated with the device;
* Restore registers and return from an interrupt;
Ok, we know a little theory and now let's start with the `early_irq_init` function. The implementation of the `early_irq_init` function is in the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c). This function make early initialization of the `irq_desc` structure. The `irq_desc` structure is the foundation of interrupt management code in the Linux kernel. An array of this structure, which has the same name - `irq_desc`, keeps track of every interrupt request source in the Linux kernel. This structure defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/master/include/linux/irqdesc.h) and as you can note it depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. This kernel configuration option enables support for sparse irqs. The `irq_desc` structure contains many different files:
* `irq_common_data` - per irq and chip data passed down to chip functions;
* `status_use_accessors` - contains status of the interrupt source which is combination of the values from the `enum` from the [include/linux/irq.h](https://github.com/torvalds/linux/blob/master/include/linux/irq.h) and different macros which are defined in the same source code file;
* `kstat_irqs` - irq stats per-cpu;
* `handle_irq` - highlevel irq-events handler;
* `action` - identifies the interrupt service routines to be invoked when the [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) occurs;
* `irq_count` - counter of interrupt occurrences on the IRQ line;
* `depth` - `0` if the IRQ line is enabled and a positive value if it has been disabled at least once;
* `last_unhandled` - aging timer for unhandled count;
* `irqs_unhandled` - count of the unhandled interrupts;
* `lock` - a spin lock used to serialize the accesses to the `IRQ` descriptor;
* `pending_mask` - pending rebalanced interrupts;
* `owner` - an owner of interrupt descriptor. Interrupt descriptors can be allocated from modules. This field is need to proved refcount on the module which provides the interrupts;
* and etc.
Of course it is not all fields of the `irq_desc` structure, because it is too long to describe each field of this structure, but we will see it all soon. Now let's start to dive into the implementation of the `early_irq_init` function.
Early external interrupts initialization
--------------------------------------------------------------------------------
Now, let's look on the implementation of the `early_irq_init` function. Note that implementation of the `early_irq_init` function depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. Now we consider implementation of the `early_irq_init` function when the `CONFIG_SPARSE_IRQ` kernel configuration option is not set. This function starts from the declaration of the following variables: `irq` descriptors counter, loop counter, memory node and the `irq_desc` descriptor:
```C
int __init early_irq_init(void)
{
int count, i, node = first_online_node;
struct irq_desc *desc;
...
...
...
}
```
The `node` is an online [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) node which depends on the `MAX_NUMNODES` value which depends on the `CONFIG_NODES_SHIFT` kernel configuration parameter:
```C
#define MAX_NUMNODES (1 << NODES_SHIFT)
...
...
...
#ifdef CONFIG_NODES_SHIFT
#define NODES_SHIFT CONFIG_NODES_SHIFT
#else
#define NODES_SHIFT 0
#endif
```
As I already wrote, implementation of the `first_online_node` macro depends on the `MAX_NUMNODES` value:
```C
#if MAX_NUMNODES > 1
#define first_online_node first_node(node_states[N_ONLINE])
#else
#define first_online_node 0
```
The `node_states` is the [enum](https://en.wikipedia.org/wiki/Enumerated_type) which defined in the [include/linux/nodemask.h](https://github.com/torvalds/linux/blob/master/include/linux/nodemask.h) and represent the set of the states of a node. In our case we are searching an online node and it will be `0` if `MAX_NUMNODES` is one or zero. If the `MAX_NUMNODES` is greater than one, the `node_states[N_ONLINE]` will return `1` and the `first_node` macro will be expands to the call of the `__first_node` function which will return `minimal` or the first online node:
```C
#define first_node(src) __first_node(&(src))
static inline int __first_node(const nodemask_t *srcp)
{
return min_t(int, MAX_NUMNODES, find_first_bit(srcp->bits, MAX_NUMNODES));
}
```
More about this will be in the another chapter about the `NUMA`. The next step after the declaration of these local variables is the call of the:
```C
init_irq_default_affinity();
```
function. The `init_irq_default_affinity` function defined in the same source code file and depends on the `CONFIG_SMP` kernel configuration option allocates a given [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) structure (in our case it is the `irq_default_affinity`):
```C
#if defined(CONFIG_SMP)
cpumask_var_t irq_default_affinity;
static void __init init_irq_default_affinity(void)
{
alloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT);
cpumask_setall(irq_default_affinity);
}
#else
static void __init init_irq_default_affinity(void)
{
}
#endif
```
We know that when a hardware, such as disk controller or keyboard, needs attention from the processor, it throws an interrupt. The interrupt tells to the processor that something has happened and that the processor should interrupt current process and handle an incoming event. In order to prevent multiple devices from sending the same interrupts, the [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) system was established where each device in a computer system is assigned its own special IRQ so that its interrupts are unique. Linux kernel can assign certain `IRQs` to specific processors. This is known as `SMP IRQ affinity`, and it allows you control how your system will respond to various hardware events (that's why it has certain implementation only if the `CONFIG_SMP` kernel configuration option is set). After we allocated `irq_default_affinity` cpumask, we can see `printk` output:
```C
printk(KERN_INFO "NR_IRQS:%d\n", NR_IRQS);
```
which prints `NR_IRQS`:
```C
~$ dmesg | grep NR_IRQS
[ 0.000000] NR_IRQS:4352
```
The `NR_IRQS` is the maximum number of the `irq` descriptors or in another words maximum number of interrupts. Its value depends on the state of the `CONFIG_X86_IO_APIC` kernel configuration option. If the `CONFIG_X86_IO_APIC` is not set and the Linux kernel uses an old [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) chip, the `NR_IRQS` is:
```C
#define NR_IRQS_LEGACY 16
#ifdef CONFIG_X86_IO_APIC
...
...
...
#else
# define NR_IRQS NR_IRQS_LEGACY
#endif
```
In other way, when the `CONFIG_X86_IO_APIC` kernel configuration option is set, the `NR_IRQS` depends on the amount of the processors and amount of the interrupt vectors:
```C
#define CPU_VECTOR_LIMIT (64 * NR_CPUS)
#define NR_VECTORS 256
#define IO_APIC_VECTOR_LIMIT ( 32 * MAX_IO_APICS )
#define MAX_IO_APICS 128
# define NR_IRQS \
(CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT ? \
(NR_VECTORS + CPU_VECTOR_LIMIT) : \
(NR_VECTORS + IO_APIC_VECTOR_LIMIT))
...
...
...
```
We remember from the previous parts, that the amount of processors we can set during Linux kernel configuration process with the `CONFIG_NR_CPUS` configuration option:
![kernel](http://oi60.tinypic.com/1zdm1dt.jpg)
In the first case (`CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT`), the `NR_IRQS` will be `4352`, in the second case (`CPU_VECTOR_LIMIT < IO_APIC_VECTOR_LIMIT`), the `NR_IRQS` will be `768`. In my case the `NR_CPUS` is `8` as you can see in the my configuration, the `CPU_VECTOR_LIMIT` is `512` and the `IO_APIC_VECTOR_LIMIT` is `4096`. So `NR_IRQS` for my configuration is `4352`:
```
~$ dmesg | grep NR_IRQS
[ 0.000000] NR_IRQS:4352
```
In the next step we assign array of the IRQ descriptors to the `irq_desc` variable which we defined in the start of the `early_irq_init` function and calculate count of the `irq_desc` array with the `ARRAY_SIZE` macro:
```C
desc = irq_desc;
count = ARRAY_SIZE(irq_desc);
```
The `irq_desc` array defined in the same source code file and looks like:
```C
struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
[0 ... NR_IRQS-1] = {
.handle_irq = handle_bad_irq,
.depth = 1,
.lock = __RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock),
}
};
```
The `irq_desc` is array of the `irq` descriptors. It has three already initialized fields:
* `handle_irq` - as I already wrote above, this field is the highlevel irq-event handler. In our case it initialized with the `handle_bad_irq` function that defined in the [kernel/irq/handle.c](https://github.com/torvalds/linux/blob/master/kernel/irq/handle.c) source code file and handles spurious and unhandled irqs;
* `depth` - `0` if the IRQ line is enabled and a positive value if it has been disabled at least once;
* `lock` - A spin lock used to serialize the accesses to the `IRQ` descriptor.
As we calculated count of the interrupts and initialized our `irq_desc` array, we start to fill descriptors in the loop:
```C
for (i = 0; i < count; i++) {
desc[i].kstat_irqs = alloc_percpu(unsigned int);
alloc_masks(&desc[i], GFP_KERNEL, node);
raw_spin_lock_init(&desc[i].lock);
lockdep_set_class(&desc[i].lock, &irq_desc_lock_class);
desc_set_defaults(i, &desc[i], node, NULL);
}
```
We are going through the all interrupt descriptors and do the following things:
First of all we allocate [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable for the `irq` kernel statistic with the `alloc_percpu` macro. This macro allocates one instance of an object of the given type for every processor on the system. You can access kernel statistic from the userspace via `/proc/stat`:
```
~$ cat /proc/stat
cpu 207907 68 53904 5427850 14394 0 394 0 0 0
cpu0 25881 11 6684 679131 1351 0 18 0 0 0
cpu1 24791 16 5894 679994 2285 0 24 0 0 0
cpu2 26321 4 7154 678924 664 0 71 0 0 0
cpu3 26648 8 6931 678891 414 0 244 0 0 0
...
...
...
```
Where the sixth column is the servicing interrupts. After this we allocate [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) for the given irq descriptor affinity and initialize the [spinlock](https://en.wikipedia.org/wiki/Spinlock) for the given interrupt descriptor. After this before the [critical section](https://en.wikipedia.org/wiki/Critical_section), the lock will be acquired with a call of the `raw_spin_lock` and unlocked with the call of the `raw_spin_unlock`. In the next step we call the `lockdep_set_class` macro which set the [Lock validator](https://lwn.net/Articles/185666/) `irq_desc_lock_class` class for the lock of the given interrupt descriptor. More about `lockdep`, `spinlock` and other synchronization primitives will be described in the separate chapter.
In the end of the loop we call the `desc_set_defaults` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c). This function takes four parameters:
* number of a irq;
* interrupt descriptor;
* online `NUMA` node;
* owner of interrupt descriptor. Interrupt descriptors can be allocated from modules. This field is need to proved refcount on the module which provides the interrupts;
and fills the rest of the `irq_desc` fields. The `desc_set_defaults` function fills interrupt number, `irq` chip, platform-specific per-chip private data for the chip methods, per-IRQ data for the `irq_chip` methods and [MSI](https://en.wikipedia.org/wiki/Message_Signaled_Interrupts) descriptor for the per `irq` and `irq` chip data:
```C
desc->irq_data.irq = irq;
desc->irq_data.chip = &no_irq_chip;
desc->irq_data.chip_data = NULL;
desc->irq_data.handler_data = NULL;
desc->irq_data.msi_desc = NULL;
...
...
...
```
The `irq_data.chip` structure provides general `API` like the `irq_set_chip`, `irq_set_irq_type` and etc, for the irq controller [drivers](https://github.com/torvalds/linux/tree/master/drivers/irqchip). You can find it in the [kernel/irq/chip.c](https://github.com/torvalds/linux/blob/master/kernel/irq/chip.c) source code file.
After this we set the status of the accessor for the given descriptor and set disabled state of the interrupts:
```C
...
...
...
irq_settings_clr_and_set(desc, ~0, _IRQ_DEFAULT_INIT_FLAGS);
irqd_set(&desc->irq_data, IRQD_IRQ_DISABLED);
...
...
...
```
In the next step we set the high level interrupt handlers to the `handle_bad_irq` which handles spurious and unhandled irqs (as the hardware stuff is not initialized yet, we set this handler), set `irq_desc.desc` to `1` which means that an `IRQ` is disabled, reset count of the unhandled interrupts and interrupts in general:
```C
...
...
...
desc->handle_irq = handle_bad_irq;
desc->depth = 1;
desc->irq_count = 0;
desc->irqs_unhandled = 0;
desc->name = NULL;
desc->owner = owner;
...
...
...
```
After this we go through the all [possible](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) processor with the [for_each_possible_cpu](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h#L714) helper and set the `kstat_irqs` to zero for the given interrupt descriptor:
```C
for_each_possible_cpu(cpu)
*per_cpu_ptr(desc->kstat_irqs, cpu) = 0;
```
and call the `desc_smp_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c) that initializes `NUMA` node of the given interrupt descriptor, sets default `SMP` affinity and clears the `pending_mask` of the given interrupt descriptor depends on the value of the `CONFIG_GENERIC_PENDING_IRQ` kernel configuration option:
```C
static void desc_smp_init(struct irq_desc *desc, int node)
{
desc->irq_data.node = node;
cpumask_copy(desc->irq_data.affinity, irq_default_affinity);
#ifdef CONFIG_GENERIC_PENDING_IRQ
cpumask_clear(desc->pending_mask);
#endif
}
```
In the end of the `early_irq_init` function we return the return value of the `arch_early_irq_init` function:
```C
return arch_early_irq_init();
```
This function defined in the [kernel/apic/vector.c](https://github.com/torvalds/linux/blob/master/kernel/apic/vector.c) and contains only one call of the `arch_early_ioapic_init` function from the [kernel/apic/io_apic.c](https://github.com/torvalds/linux/blob/master/kernel/apic/io_apic.c). As we can understand from the `arch_early_ioapic_init` function's name, this function makes early initialization of the [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). First of all it make a check of the number of the legacy interrupts wit the call of the `nr_legacy_irqs` function. If we have no legacy interrupts with the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) programmable interrupt controller we set `io_apic_irqs` to the `0xffffffffffffffff`:
```C
if (!nr_legacy_irqs())
io_apic_irqs = ~0UL;
```
After this we are going through the all `I/O APICs` and allocate space for the registers with the call of the `alloc_ioapic_saved_registers`:
```C
for_each_ioapic(i)
alloc_ioapic_saved_registers(i);
```
And in the end of the `arch_early_ioapic_init` function we are going through the all legacy irqs (from `IRQ0` to `IRQ15`) in the loop and allocate space for the `irq_cfg` which represents configuration of an irq on the given `NUMA` node:
```C
for (i = 0; i < nr_legacy_irqs(); i++) {
cfg = alloc_irq_and_cfg_at(i, node);
cfg->vector = IRQ0_VECTOR + i;
cpumask_setall(cfg->domain);
}
```
That's all.
Sparse IRQs
--------------------------------------------------------------------------------
We already saw in the beginning of this part that implementation of the `early_irq_init` function depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. Previously we saw implementation of the `early_irq_init` function when the `CONFIG_SPARSE_IRQ` configuration option is not set, now let's look on the its implementation when this option is set. Implementation of this function very similar, but little differ. We can see the same definition of variables and call of the `init_irq_default_affinity` in the beginning of the `early_irq_init` function:
```C
#ifdef CONFIG_SPARSE_IRQ
int __init early_irq_init(void)
{
int i, initcnt, node = first_online_node;
struct irq_desc *desc;
init_irq_default_affinity();
...
...
...
}
#else
...
...
...
```
But after this we can see the following call:
```C
initcnt = arch_probe_nr_irqs();
```
The `arch_probe_nr_irqs` function defined in the [arch/x86/kernel/apic/vector.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/apic/vector.c) and calculates count of the pre-allocated irqs and update `nr_irqs` with its number. But stop. Why there are pre-allocated irqs? There is alternative form of interrupts called - [Message Signaled Interrupts](https://en.wikipedia.org/wiki/Message_Signaled_Interrupts) available in the [PCI](https://en.wikipedia.org/wiki/Conventional_PCI). Instead of assigning a fixed number of the interrupt request, the device is allowed to record a message at a particular address of RAM, in fact, the display on the [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs). `MSI` permits a device to allocate `1`, `2`, `4`, `8`, `16` or `32` interrupts and `MSI-X` permits a device to allocate up to `2048` interrupts. Now we know that irqs can be pre-allocated. More about `MSI` will be in a next part, but now let's look on the `arch_probe_nr_irqs` function. We can see the check which assign amount of the interrupt vectors for the each processor in the system to the `nr_irqs` if it is greater and calculate the `nr` which represents number of `MSI` interrupts:
```C
int nr_irqs = NR_IRQS;
if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
nr_irqs = NR_VECTORS * nr_cpu_ids;
nr = (gsi_top + nr_legacy_irqs()) + 8 * nr_cpu_ids;
```
Take a look on the `gsi_top` variable. Each `APIC` is identified with its own `ID` and with the offset where its `IRQ` starts. It is called `GSI` base or `Global System Interrupt` base. So the `gsi_top` represents it. We get the `Global System Interrupt` base from the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table (you can remember that we have parsed this table in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux Kernel initialization process chapter).
After this we update the `nr` depends on the value of the `gsi_top`:
```C
#if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ)
if (gsi_top <= NR_IRQS_LEGACY)
nr += 8 * nr_cpu_ids;
else
nr += gsi_top * 16;
#endif
```
Update the `nr_irqs` if it less than `nr` and return the number of the legacy irqs:
```C
if (nr < nr_irqs)
nr_irqs = nr;
return nr_legacy_irqs();
}
```
The next after the `arch_probe_nr_irqs` is printing information about number of `IRQs`:
```C
printk(KERN_INFO "NR_IRQS:%d nr_irqs:%d %d\n", NR_IRQS, nr_irqs, initcnt);
```
We can find it in the [dmesg](https://en.wikipedia.org/wiki/Dmesg) output:
```
$ dmesg | grep NR_IRQS
[ 0.000000] NR_IRQS:4352 nr_irqs:488 16
```
After this we do some checks that `nr_irqs` and `initcnt` values is not greater than maximum allowable number of `irqs`:
```C
if (WARN_ON(nr_irqs > IRQ_BITMAP_BITS))
nr_irqs = IRQ_BITMAP_BITS;
if (WARN_ON(initcnt > IRQ_BITMAP_BITS))
initcnt = IRQ_BITMAP_BITS;
```
where `IRQ_BITMAP_BITS` is equal to the `NR_IRQS` if the `CONFIG_SPARSE_IRQ` is not set and `NR_IRQS + 8196` in other way. In the next step we are going over all interrupt descriptors which need to be allocated in the loop and allocate space for the descriptor and insert to the `irq_desc_tree` [radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html):
```C
for (i = 0; i < initcnt; i++) {
desc = alloc_desc(i, node, NULL);
set_bit(i, allocated_irqs);
irq_insert_desc(i, desc);
}
```
In the end of the `early_irq_init` function we return the value of the call of the `arch_early_irq_init` function as we did it already in the previous variant when the `CONFIG_SPARSE_IRQ` option was not set:
```C
return arch_early_irq_init();
```
That's all.
Conclusion
--------------------------------------------------------------------------------
It is the end of the seventh part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we started to dive into external hardware interrupts in this part. We saw early initialization of the `irq_desc` structure which represents description of an external interrupt and contains information about it like list of irq actions, information about interrupt handler, interrupt's owner, count of the unhandled interrupt and etc. In the next part we will continue to research external interrupts.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [numa](https://en.wikipedia.org/wiki/Non-uniform_memory_access)
* [Enum type](https://en.wikipedia.org/wiki/Enumerated_type)
* [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
* [critical section](https://en.wikipedia.org/wiki/Critical_section)
* [Lock validator](https://lwn.net/Articles/185666/)
* [MSI](https://en.wikipedia.org/wiki/Message_Signaled_Interrupts)
* [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
* [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs)
* [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259)
* [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller)
* [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification)
* [radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html)
* [dmesg](https://en.wikipedia.org/wiki/Dmesg)

542
Interrupts/interrupts-8.md Normal file
View File

@@ -0,0 +1,542 @@
Interrupts and Interrupt Handling. Part 8.
================================================================================
Non-early initialization of the IRQs
--------------------------------------------------------------------------------
This is the eighth part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-7.html) we started to dive into the external hardware [interrupts](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). We looked on the implementation of the `early_irq_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c) source code file and saw the initialization of the `irq_desc` structure in this function. Remind that `irq_desc` structure (defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/master/include/linux/irqdesc.h#L46) is the foundation of interrupt management code in the Linux kernel and represents an interrupt descriptor. In this part we will continue to dive into the initialization stuff which is related to the external hardware interrupts.
Right after the call of the `early_irq_init` function in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) we can see the call of the `init_IRQ` function. This function is architecture-specific and defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c). The `init_IRQ` function makes initialization of the `vector_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable that defined in the same [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) source code file:
```C
...
DEFINE_PER_CPU(vector_irq_t, vector_irq) = {
[0 ... NR_VECTORS - 1] = -1,
};
...
```
and represents `percpu` array of the interrupt vector numbers. The `vector_irq_t` defined in the [arch/x86/include/asm/hw_irq.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/hw_irq.h) and expands to the:
```C
typedef int vector_irq_t[NR_VECTORS];
```
where `NR_VECTORS` is count of the vector number and as you can remember from the first [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-1.html) of this chapter it is `256` for the [x86_64](https://en.wikipedia.org/wiki/X86-64):
```C
#define NR_VECTORS 256
```
So, in the start of the `init_IRQ` function we fill the `vecto_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) array with the vector number of the `legacy` interrupts:
```C
void __init init_IRQ(void)
{
int i;
for (i = 0; i < nr_legacy_irqs(); i++)
per_cpu(vector_irq, 0)[IRQ0_VECTOR + i] = i;
...
...
...
}
```
This `vector_irq` will be used during the first steps of an external hardware interrupt handling in the `do_IRQ` function from the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c):
```C
__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
{
...
...
...
irq = __this_cpu_read(vector_irq[vector]);
if (!handle_irq(irq, regs)) {
...
...
...
}
exiting_irq();
...
...
return 1;
}
```
Why is `legacy` here? Actually all interrupts are handled by the modern [IO-APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs) controller. But these interrupts (from `0x30` to `0x3f`) by legacy interrupt-controllers like [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller). If these interrupts are handled by the `I/O APIC` then this vector space will be freed and re-used. Let's look on this code closer. First of all the `nr_legacy_irqs` defined in the [arch/x86/include/asm/i8259.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/i8259.h) and just returns the `nr_legacy_irqs` field from the `legacy_pic` structure:
```C
static inline int nr_legacy_irqs(void)
{
return legacy_pic->nr_legacy_irqs;
}
```
This structure defined in the same header file and represents non-modern programmable interrupts controller:
```C
struct legacy_pic {
int nr_legacy_irqs;
struct irq_chip *chip;
void (*mask)(unsigned int irq);
void (*unmask)(unsigned int irq);
void (*mask_all)(void);
void (*restore_mask)(void);
void (*init)(int auto_eoi);
int (*irq_pending)(unsigned int irq);
void (*make_irq)(unsigned int irq);
};
```
Actual default maximum number of the legacy interrupts represented by the `NR_IRQ_LEGACY` macro from the [arch/x86/include/asm/irq_vectors.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_vectors.h):
```C
#define NR_IRQS_LEGACY 16
```
In the loop we are accessing the `vecto_irq` per-cpu array with the `per_cpu` macro by the `IRQ0_VECTOR + i` index and write the legacy vector number there. The `IRQ0_VECTOR` macro defined in the [arch/x86/include/asm/irq_vectors.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_vectors.h) header file and expands to the `0x30`:
```C
#define FIRST_EXTERNAL_VECTOR 0x20
#define IRQ0_VECTOR ((FIRST_EXTERNAL_VECTOR + 16) & ~15)
```
Why is `0x30` here? You can remember from the first [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-1.html) of this chapter that first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. Vector numbers from `0x30` to `0x3f` are reserved for the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture). So, it means that we fill the `vector_irq` from the `IRQ0_VECTOR` which is equal to the `32` to the `IRQ0_VECTOR + 16` (before the `0x30`).
In the end of the `init_IRQ` function we can see the call of the following function:
```C
x86_init.irqs.intr_init();
```
from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c) source code file. If you have read [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you can remember the `x86_init` structure. This structure contains a couple of files which are points to the function related to the platform setup (`x86_64` in our case), for example `resources` - related with the memory resources, `mpparse` - related with the parsing of the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table and etc.). As we can see the `x86_init` also contains the `irqs` field which contains three following fields:
```C
struct x86_init_ops x86_init __initdata
{
...
...
...
.irqs = {
.pre_vector_init = init_ISA_irqs,
.intr_init = native_init_IRQ,
.trap_init = x86_init_noop,
},
...
...
...
}
```
Now, we are interesting in the `native_init_IRQ`. As we can note, the name of the `native_init_IRQ` function contains the `native_` prefix which means that this function is architecture-specific. It defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) and executes general initialization of the [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs) and initialization of the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture) irqs. Let's look on the implementation of the `native_init_IRQ` function and will try to understand what occurs there. The `native_init_IRQ` function starts from the execution of the following function:
```C
x86_init.irqs.pre_vector_init();
```
As we can see above, the `pre_vector_init` points to the `init_ISA_irqs` function that defined in the same [source code](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) file and as we can understand from the function's name, it makes initialization of the `ISA` related interrupts. The `init_ISA_irqs` function starts from the definition of the `chip` variable which has a `irq_chip` type:
```C
void __init init_ISA_irqs(void)
{
struct irq_chip *chip = legacy_pic->chip;
...
...
...
```
The `irq_chip` structure defined in the [include/linux/irq.h](https://github.com/torvalds/linux/blob/master/include/linux/irq.h) header file and represents hardware interrupt chip descriptor. It contains:
* `name` - name of a device. Used in the `/proc/interrupts`:
```C
$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 16 0 0 0 0 0 0 0 IO-APIC 2-edge timer
1: 2 0 0 0 0 0 0 0 IO-APIC 1-edge i8042
8: 1 0 0 0 0 0 0 0 IO-APIC 8-edge rtc0
```
look on the last column;
* `(*irq_mask)(struct irq_data *data)` - mask an interrupt source;
* `(*irq_ack)(struct irq_data *data)` - start of a new interrupt;
* `(*irq_startup)(struct irq_data *data)` - start up the interrupt;
* `(*irq_shutdown)(struct irq_data *data)` - shutdown the interrupt
* and etc.
fields. Note that the `irq_data` structure represents set of the per irq chip data passed down to chip functions. It contains `mask` - precomputed bitmask for accessing the chip registers, `irq` - interrupt number, `hwirq` - hardware interrupt number, local to the interrupt domain chip low level interrupt hardware access and etc.
After this depends on the `CONFIG_X86_64` and `CONFIG_X86_LOCAL_APIC` kernel configuration option call the `init_bsp_APIC` function from the [arch/x86/kernel/apic/apic.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/apic/apic.c):
```C
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_LOCAL_APIC)
init_bsp_APIC();
#endif
```
This function makes initialization of the [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) of `bootstrap processor` (or processor which starts first). It starts from the check that we found [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) config (read more about it in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux kernel initialization process chapter) and the processor has `APIC`:
```C
if (smp_found_config || !cpu_has_apic)
return;
```
In other way we return from this function. In the next step we call the `clear_local_APIC` function from the same source code file that shutdowns the local `APIC` (more about it will be in the chapter about the `Advanced Programmable Interrupt Controller`) and enable `APIC` of the first processor by the setting `unsigned int value` to the `APIC_SPIV_APIC_ENABLED`:
```C
value = apic_read(APIC_SPIV);
value &= ~APIC_VECTOR_MASK;
value |= APIC_SPIV_APIC_ENABLED;
```
and writing it with the help of the `apic_write` function:
```C
apic_write(APIC_SPIV, value);
```
After we have enabled `APIC` for the bootstrap processor, we return to the `init_ISA_irqs` function and in the next step we initialize legacy `Programmable Interrupt Controller` and set the legacy chip and handler for the each legacy irq:
```C
legacy_pic->init(0);
for (i = 0; i < nr_legacy_irqs(); i++)
irq_set_chip_and_handler(i, chip, handle_level_irq);
```
Where can we find `init` function? The `legacy_pic` defined in the [arch/x86/kernel/i8259.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/i8259.c) and it is:
```C
struct legacy_pic *legacy_pic = &default_legacy_pic;
```
Where the `default_legacy_pic` is:
```C
struct legacy_pic default_legacy_pic = {
...
...
...
.init = init_8259A,
...
...
...
}
```
The `init_8259A` function defined in the same source code file and executes initialization of the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) ``Programmable Interrupt Controller` (more about it will be in the separate chapter about `Programmable Interrupt Controllers` and `APIC`).
Now we can return to the `native_init_IRQ` function, after the `init_ISA_irqs` function finished its work. The next step is the call of the `apic_intr_init` function that allocates special interrupt gates which are used by the [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) architecture for the [Inter-processor interrupt](https://en.wikipedia.org/wiki/Inter-processor_interrupt). The `alloc_intr_gate` macro from the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) used for the interrupt descriptor allocation:
```C
#define alloc_intr_gate(n, addr) \
do { \
alloc_system_vector(n); \
set_intr_gate(n, addr); \
} while (0)
```
As we can see, first of all it expands to the call of the `alloc_system_vector` function that checks the given vector number in the `user_vectors` bitmap (read previous [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-7.html) about it) and if it is not set in the `user_vectors` bitmap we set it. After this we test that the `first_system_vector` is greater than given interrupt vector number and if it is greater we assign it:
```C
if (!test_bit(vector, used_vectors)) {
set_bit(vector, used_vectors);
if (first_system_vector > vector)
first_system_vector = vector;
} else {
BUG();
}
```
We already saw the `set_bit` macro, now let's look on the `test_bit` and the `first_system_vector`. The first `test_bit` macro defined in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/bitops.h) and looks like this:
```C
#define test_bit(nr, addr) \
(__builtin_constant_p((nr)) \
? constant_test_bit((nr), (addr)) \
: variable_test_bit((nr), (addr)))
```
We can see the [ternary operator](https://en.wikipedia.org/wiki/Ternary_operation) here make a test with the [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) built-in function `__builtin_constant_p` tests that given vector number (`nr`) is known at compile time. If you're feeling misunderstanding of the `__builtin_constant_p`, we can make simple test:
```C
#include <stdio.h>
#define PREDEFINED_VAL 1
int main() {
int i = 5;
printf("__builtin_constant_p(i) is %d\n", __builtin_constant_p(i));
printf("__builtin_constant_p(PREDEFINED_VAL) is %d\n", __builtin_constant_p(PREDEFINED_VAL));
printf("__builtin_constant_p(100) is %d\n", __builtin_constant_p(100));
return 0;
}
```
and look on the result:
```
$ gcc test.c -o test
$ ./test
__builtin_constant_p(i) is 0
__builtin_constant_p(PREDEFINED_VAL) is 1
__builtin_constant_p(100) is 1
```
Now I think it must be clear for you. Let's get back to the `test_bit` macro. If the `__builtin_constant_p` will return non-zero, we call `constant_test_bit` function:
```C
static inline int constant_test_bit(int nr, const void *addr)
{
const u32 *p = (const u32 *)addr;
return ((1UL << (nr & 31)) & (p[nr >> 5])) != 0;
}
```
and the `variable_test_bit` in other way:
```C
static inline int variable_test_bit(int nr, const void *addr)
{
u8 v;
const u32 *p = (const u32 *)addr;
asm("btl %2,%1; setc %0" : "=qm" (v) : "m" (*p), "Ir" (nr));
return v;
}
```
What's the difference between two these functions and why do we need in two different functions for the same purpose? As you already can guess main purpose is optimization. If we will write simple example with these functions:
```C
#define CONST 25
int main() {
int nr = 24;
variable_test_bit(nr, (int*)0x10000000);
constant_test_bit(CONST, (int*)0x10000000)
return 0;
}
```
and will look on the assembly output of our example we will see following assembly code:
```assembly
pushq %rbp
movq %rsp, %rbp
movl $268435456, %esi
movl $25, %edi
call constant_test_bit
```
for the `constant_test_bit`, and:
```assembly
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $24, -4(%rbp)
movl -4(%rbp), %eax
movl $268435456, %esi
movl %eax, %edi
call variable_test_bit
```
for the `variable_test_bit`. These two code listings starts with the same part, first of all we save base of the current stack frame in the `%rbp` register. But after this code for both examples is different. In the first example we put `$268435456` (here the `$268435456` is our second parameter - `0x10000000`) to the `esi` and `$25` (our first parameter) to the `edi` register and call `constant_test_bit`. We put function parameters to the `esi` and `edi` registers because as we are learning Linux kernel for the `x86_64` architecture we use `System V AMD64 ABI` [calling convention](https://en.wikipedia.org/wiki/X86_calling_conventions). All is pretty simple. When we are using predefined constant, the compiler can just substitute its value. Now let's look on the second part. As you can see here, the compiler can not substitute value from the `nr` variable. In this case compiler must calculate its offset on the program's [stack frame](https://en.wikipedia.org/wiki/Call_stack). We subtract `16` from the `rsp` register to allocate stack for the local variables data and put the `$24` (value of the `nr` variable) to the `rbp` with offset `-4`. Our stack frame will be like this:
```
<- stack grows
%[rbp]
|
+----------+ +---------+ +---------+ +--------+
| | | | | return | | |
| nr |-| |-| |-| argc |
| | | | | address | | |
+----------+ +---------+ +---------+ +--------+
|
%[rsp]
```
After this we put this value to the `eax`, so `eax` register now contains value of the `nr`. In the end we do the same that in the first example, we put the `$268435456` (the first parameter of the `variable_test_bit` function) and the value of the `eax` (value of `nr`) to the `edi` register (the second parameter of the `variable_test_bit function`).
The next step after the `apic_intr_init` function will finish its work is the setting interrupt gates from the `FIRST_EXTERNAL_VECTOR` or `0x20` to the `0x256`:
```C
i = FIRST_EXTERNAL_VECTOR;
#ifndef CONFIG_X86_LOCAL_APIC
#define first_system_vector NR_VECTORS
#endif
for_each_clear_bit_from(i, used_vectors, first_system_vector) {
set_intr_gate(i, irq_entries_start + 8 * (i - FIRST_EXTERNAL_VECTOR));
}
```
But as we are using the `for_each_clear_bit_from` helper, we set only non-initialized interrupt gates. After this we use the same `for_each_clear_bit_from` helper to fill the non-filled interrupt gates in the interrupt table with the `spurious_interrupt`:
```C
#ifdef CONFIG_X86_LOCAL_APIC
for_each_clear_bit_from(i, used_vectors, NR_VECTORS)
set_intr_gate(i, spurious_interrupt);
#endif
```
Where the `spurious_interrupt` function represent interrupt handler for the `spurious` interrupt. Here the `used_vectors` is the `unsigned long` that contains already initialized interrupt gates. We already filled first `32` interrupt vectors in the `trap_init` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file:
```C
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
set_bit(i, used_vectors);
```
You can remember how we did it in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-6.html) of this chapter.
In the end of the `native_init_IRQ` function we can see the following check:
```C
if (!acpi_ioapic && !of_ioapic && nr_legacy_irqs())
setup_irq(2, &irq2);
```
First of all let's deal with the condition. The `acpi_ioapic` variable represents existence of [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs). It defined in the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c). This variable set in the `acpi_set_irq_model_ioapic` function that called during the processing `Multiple APIC Description Table`. This occurs during initialization of the architecture-specific stuff in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) (more about it we will know in the other chapter about [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)). Note that the value of the `acpi_ioapic` variable depends on the `CONFIG_ACPI` and `CONFIG_X86_LOCAL_APIC` Linux kernel configuration options. If these options did not set, this variable will be just zero:
```C
#define acpi_ioapic 0
```
The second condition - `!of_ioapic && nr_legacy_irqs()` checks that we do not use [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware) `I/O APIC` and legacy interrupt controller. We already know about the `nr_legacy_irqs`. The second is `of_ioapic` variable defined in the [arch/x86/kernel/devicetree.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/devicetree.c) and initialized in the `dtb_ioapic_setup` function that build information about `APICs` in the [devicetree](https://en.wikipedia.org/wiki/Device_tree). Note that `of_ioapic` variable depends on the `CONFIG_OF` Linux kernel configuration option. If this option is not set, the value of the `of_ioapic` will be zero too:
```C
#ifdef CONFIG_OF
extern int of_ioapic;
...
...
...
#else
#define of_ioapic 0
...
...
...
#endif
```
If the condition will return non-zero value we call the:
```C
setup_irq(2, &irq2);
```
function. First of all about the `irq2`. The `irq2` is the `irqaction` structure that defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irqinit.c) source code file and represents `IRQ 2` line that is used to query devices connected cascade:
```C
static struct irqaction irq2 = {
.handler = no_action,
.name = "cascade",
.flags = IRQF_NO_THREAD,
};
```
Some time ago interrupt controller consisted of two chips and one was connected to second. The second chip that was connected to the first chip via this `IRQ 2` line. This chip serviced lines from `8` to `15` and after this lines of the first chip. So, for example [Intel 8259A](https://en.wikipedia.org/wiki/Intel_8259) has following lines:
* `IRQ 0` - system time;
* `IRQ 1` - keyboard;
* `IRQ 2` - used for devices which are cascade connected;
* `IRQ 8` - [RTC](https://en.wikipedia.org/wiki/Real-time_clock);
* `IRQ 9` - reserved;
* `IRQ 10` - reserved;
* `IRQ 11` - reserved;
* `IRQ 12` - `ps/2` mouse;
* `IRQ 13` - coprocessor;
* `IRQ 14` - hard drive controller;
* `IRQ 1` - reserved;
* `IRQ 3` - `COM2` and `COM4`;
* `IRQ 4` - `COM1` and `COM3`;
* `IRQ 5` - `LPT2`;
* `IRQ 6` - drive controller;
* `IRQ 7` - `LPT1`.
The `setup_irq` function defined in the [kernel/irq/manage.c](https://github.com/torvalds/linux/blob/master/kernel/irq/manage.c) and takes two parameters:
* vector number of an interrupt;
* `irqaction` structure related with an interrupt.
This function initializes interrupt descriptor from the given vector number at the beginning:
```C
struct irq_desc *desc = irq_to_desc(irq);
```
And call the `__setup_irq` function that setups given interrupt:
```C
chip_bus_lock(desc);
retval = __setup_irq(irq, desc, act);
chip_bus_sync_unlock(desc);
return retval;
```
Note that the interrupt descriptor is locked during `__setup_irq` function will work. The `__setup_irq` function makes many different things: It creates a handler thread when a thread function is supplied and the interrupt does not nest into another interrupt thread, sets the flags of the chip, fills the `irqaction` structure and many many more.
All of the above it creates `/prov/vector_number` directory and fills it, but if you are using modern computer all values will be zero there:
```
$ cat /proc/irq/2/node
0
$cat /proc/irq/2/affinity_hint
00
cat /proc/irq/2/spurious
count 0
unhandled 0
last_unhandled 0 ms
```
because probably `APIC` handles interrupts on the our machine.
That's all.
Conclusion
--------------------------------------------------------------------------------
It is the end of the eighth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we started to do it and saw early initialization of the `IRQs`. In this part we already saw non-early interrupts initialization in the `init_IRQ` function. We saw initialization of the `vector_irq` per-cpu array which is store vector numbers of the interrupts and will be used during interrupt handling and initialization of other stuff which is related to the external hardware interrupts.
In the next part we will continue to learn interrupts handling related stuff and will see initialization of the `softirqs`.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259)
* [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller)
* [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture)
* [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification)
* [Local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#Integrated_local_APICs)
* [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs)
* [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing)
* [Inter-processor interrupt](https://en.wikipedia.org/wiki/Inter-processor_interrupt)
* [ternary operator](https://en.wikipedia.org/wiki/Ternary_operation)
* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
* [calling convention](https://en.wikipedia.org/wiki/X86_calling_conventions)
* [PDF. System V Application Binary Interface AMD64](http://x86-64.org/documentation/abi.pdf)
* [Call stack](https://en.wikipedia.org/wiki/Call_stack)
* [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware)
* [devicetree](https://en.wikipedia.org/wiki/Device_tree)
* [RTC](https://en.wikipedia.org/wiki/Real-time_clock)
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-7.html)

526
Interrupts/interrupts-9.md Normal file
View File

@@ -0,0 +1,526 @@
中断和中断处理(九)
================================================================================
延后中断(软中断Tasklets 和工作队列)介绍
--------------------------------------------------------------------------------
这是 Linux 内核[中断和中断处理](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/index.html)的第九节,在[上一节](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/interrupts-8.html)我们分析了源文件 [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irqinit.c) 中的 `init_IRQ` 实现。接下来的这一节我们将继续深入学习外部硬件中断的初始化。
中断处理会有一些特点,其中最主要的两个是:
* 中断处理必须快速执行完毕
* 有时中断处理必须做很多冗长的事情
就像你所想到的,我们几乎不可能同时做到这两点,之前的中断被分为两部分:
* 前半部
* 后半部
`后半部` 曾经是 Linux 内核延后中断执行的一种方式,但现在的实际情况已经不是这样了。现在它已作为一个遗留称谓代表内核中所有延后中断的机制。如你所知,中断处理代码运行于中断处理上下文中,此时禁止响应后续的中断,所以要避免中断处理代码长时间执行。但有些中断却又需要执行很多工作,所以中断处理有时会被分为两部分。第一部分中,中断处理先只做尽量少的重要工作,接下来提交第二部分给内核调度,然后就结束运行。当系统比较空闲并且处理器上下文允许处理中断时,第二部分被延后的剩余任务就会开始执行。
当前实现延后中断的有如下三种途径:
* `软中断`
* `tasklets`
* `工作队列`
在这一小节我们将详细介绍这三种实现,现在是时间深入了解一下了。
软中断
----------------------------------------------------------------------------------
伴随着内核对并行处理的支持,出于性能考虑,所有新的下半部实现方案都基于被称之为 `ksoftirqd` (稍后将详细讨论)的内核线程。每个处理器都有自己的内核线程,名字叫做 `ksoftirqd/n`n是处理器的编号。我们可以通过系统命令 `systemd-cgls` 看到它们:
```
$ systemd-cgls -k | grep ksoft
├─ 3 [ksoftirqd/0]
├─ 13 [ksoftirqd/1]
├─ 18 [ksoftirqd/2]
├─ 23 [ksoftirqd/3]
├─ 28 [ksoftirqd/4]
├─ 33 [ksoftirqd/5]
├─ 38 [ksoftirqd/6]
├─ 43 [ksoftirqd/7]
```
`spawn_ksoftirqd` 函数启动这些线程。就像我们看到的,这个函数在早期的 [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/index.html) 被调用。
```C
early_initcall(spawn_ksoftirqd);
```
软中断在 Linux 内核编译时就静态地确定了。`open_softirq` 函数负责 `softirq` 初始化,它在 [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) 中定义:
```C
void open_softirq(int nr, void (*action)(struct softirq_action *))
{
softirq_vec[nr].action = action;
}
```
这个函数有两个参数:
* `softirq_vec` 数组的索引序号
* 一个指向软中断处理函数的指针
我们首先来看 `softirq_vec` 数组:
```C
static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
```
它在同一源文件中定义。`softirq_vec` 数组包含了 `NR_SOFTIRQS` (其值为10)个不同 `softirq` 类型的 `softirq_action`。当前版本的 Linux 内核定义了十种软中断向量。其中两个 tasklet 相关,两个网络相关,两个块处理相关,两个定时器相关,另外调度器和 RCU 也各占一个。所有这些都在一个枚举中定义:
```C
enum
{
HI_SOFTIRQ=0,
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
BLOCK_SOFTIRQ,
BLOCK_IOPOLL_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ,
RCU_SOFTIRQ,
NR_SOFTIRQS
};
```
以上软中断的名字在如下的数组中定义:
```C
const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
"TASKLET", "SCHED", "HRTIMER", "RCU"
};
```
我们也可以在 `/proc/softirqs` 的输出中看到他们:
```
~$ cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
HI: 5 0 0 0 0 0 0 0
TIMER: 332519 310498 289555 272913 282535 279467 282895 270979
NET_TX: 2320 0 0 2 1 1 0 0
NET_RX: 270221 225 338 281 311 262 430 265
BLOCK: 134282 32 40 10 12 7 8 8
BLOCK_IOPOLL: 0 0 0 0 0 0 0 0
TASKLET: 196835 2 3 0 0 0 0 0
SCHED: 161852 146745 129539 126064 127998 128014 120243 117391
HRTIMER: 0 0 0 0 0 0 0 0
RCU: 337707 289397 251874 239796 254377 254898 267497 256624
```
可以看到 `softirq_vec` 数组的类型为 `softirq_action`。这是软中断机制里一个重要的数据结构,它只有一个指向中断处理函数的成员:
```C
struct softirq_action
{
void (*action)(struct softirq_action *);
};
```
现在我们可以理解到 `open_softirq` 函数实际上用 `softirq_action` 参数填充了 `softirq_vec` 数组。由 `open_softirq` 注册的延后中断处理函数会由 `raise_softirq` 调用。这个函数只有一个参数 -- 软中断序号 `nr`。来看下它的实现:
```C
void raise_softirq(unsigned int nr)
{
unsigned long flags;
local_irq_save(flags);
raise_softirq_irqoff(nr);
local_irq_restore(flags);
}
```
可以看到在 `local_irq_save``local_irq_restore` 两个宏中间调用了 `raise_softirq_irqoff` 函数。`local_irq_save` 的定义位于 [include/linux/irqflags.h](https://github.com/torvalds/linux/blob/master/include/linux/irqflags.h) 头文件,它保存了 [eflags](https://en.wikipedia.org/wiki/FLAGS_register) 寄存器中的 [IF](https://en.wikipedia.org/wiki/Interrupt_flag) 标志位并且禁用了当前处理器的中断。`local_irq_restore` 宏定义于相同头文件中,它做了完全相反的事情:装回之前保存的中断标志位然后允许中断。这里之所以要禁用中断是因为将要运行的 `softirq` 中断处理运行于中断上下文中。
`raise_softirq_irqoff` 函数设置当前处理器上和nr参数对应的软中断标志位(`__softirq_pending`)。这是通过以下代码做到的:
```C
__raise_softirq_irqoff(nr);
```
然后,通过 `in_interrupt` 函数获得 `irq_count` 值。我们在这一章的第一[小节](https://www.gitbook.com/book/xinqiu/linux-insides-cn/content/Interrupts/interrupts-1.html)已经知道它是用来检测一个 cpu 是否处于中断环境。如果我们处于中断上下文中,我们就退出 `raise_softirq_irqoff` 函数,装回 `IF` 标志位并允许当前处理器的中断。如果不在中断上下文中,就会调用 `wakeup_softirqd` 函数:
```C
if (!in_interrupt())
wakeup_softirqd();
```
`wakeup_softirqd` 函数会激活当前处理器上的 `ksoftirqd` 内核线程:
```C
static void wakeup_softirqd(void)
{
struct task_struct *tsk = __this_cpu_read(ksoftirqd);
if (tsk && tsk->state != TASK_RUNNING)
wake_up_process(tsk);
}
```
每个 `ksoftirqd` 内核线程都运行 `run_ksoftirqd` 函数来检测是否有延后中断需要处理,如果有的话就会调用 `__do_softirq` 函数。`__do_softirq` 读取当前处理器对应的 `__softirq_pending` 软中断标记,并调用所有已被标记中断对应的处理函数。在执行一个延后函数的同时,可能会发生新的软中断。这会导致用户态代码由于 `__do_softirq` 要处理很多延后中断而很长时间不能返回。为了解决这个问题,系统限制了延后中断处理的最大耗时:
```C
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
...
...
...
restart:
while ((softirq_bit = ffs(pending))) {
...
h->action(h);
...
}
...
...
...
pending = local_softirq_pending();
if (pending) {
if (time_before(jiffies, end) && !need_resched() &&
--max_restart)
goto restart;
}
...
```
除周期性检测是否有延后中断需要执行之外,系统还会在一些关键时间点上检测。一个主要的检测时间点就是当定义在 [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c) 的 `do_IRQ` 函数被调用时,这是 Linux 内核中执行延后中断的主要时机。在这个函数将要完成中断处理时它会调用 [arch/x86/include/asm/apic.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/apic.h) 中定义的 `exiting_irq` 函数,`exiting_irq` 又调用了 `irq_exit``irq_exit` 函数会检测当前处理器上下文是否有延后中断,有的话就会调用 `invoke_softirq`
```C
if (!in_interrupt() && local_softirq_pending())
invoke_softirq();
```
这样就调用到了我们上面提到的 `__do_softirq`。每个 `softirq` 都有如下的阶段:通过 `open_softirq` 函数注册一个软中断,通过 `raise_softirq` 函数标记一个软中断来激活它,然后所有被标记的软中断将会在 Linux 内核下一次执行周期性软中断检测时得以调度,对应此类型软中断的处理函数也就得以执行。
从上述可看出,软中断是静态分配的,这对于后期加载的内核模块将是一个问题。基于软中断实现的 `tasklets` 解决了这个问题。
Tasklets
--------------------------------------------------------------------------------
如果你阅读 Linux 内核源码中软中断相关的代码,你会发现它很少会被用到。内核中实现延后中断的主要途径是 `tasklets`。正如上面说的,`tasklets` 构建于 `softirq` 中断之上,他是基于下面两个软中断实现的:
* `TASKLET_SOFTIRQ`;
* `HI_SOFTIRQ`.
简而言之,`tasklets` 是运行时分配和初始化的软中断。和软中断不同的是,同一类型的 `tasklets` 可以在同一时间运行于不同的处理器上。我们已经了解到一些关于软中断的知识,当然上面的文字并不能详细讲解所有的细节,但我们现在可以通过直接阅读代码一步步的更深入了解软中断。我们返回到开始部分讨论的 `softirq_init` 函数实现,这个函数在 [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) 中定义如下:
```C
void __init softirq_init(void)
{
int cpu;
for_each_possible_cpu(cpu) {
per_cpu(tasklet_vec, cpu).tail =
&per_cpu(tasklet_vec, cpu).head;
per_cpu(tasklet_hi_vec, cpu).tail =
&per_cpu(tasklet_hi_vec, cpu).head;
}
open_softirq(TASKLET_SOFTIRQ, tasklet_action);
open_softirq(HI_SOFTIRQ, tasklet_hi_action);
}
```
可以看到在函数开头定义了一个名为 cpu 的 integer 类型变量。接下来他会作为参数传递给宏 `for_each_possible_cpu` 来获得系统中所有的处理器。如果 `possible_cpu` 对你来说是一个新的术语,你可以阅读 [CPU masks](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/cpumask.html) 章节来了解更多知识。简单的说,`possible_cpu` 是系统运行期间插入的处理器集合。所有的 `possible processor` 存储在 `cpu_possible_bits` 位图中,你可以在 [kernel/cpu.c](https://github.com/torvalds/linux/blob/master/kernel/cpu.c) 中找到他的定义:
```C
static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly;
...
...
...
const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits);
```
好了,我们定义了 integer 类型变量 `cpu` 并且通过 `for_each_possible_cpu` 宏遍历了所有处理器,初始化了两个 `per-cpu` 变量:
* `tasklet_vec`;
* `tasklet_hi_vec`;
这两个 `per-cpu` 变量和 `softirq_init` 函数都定义在相同[代码](https://github.com/torvalds/linux/blob/master/kernel/softirq.c)中,他们被定义为 `tasklet_head` 类型:
```C
static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec);
static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec);
```
`tasklet_head` 结构代表一组 `Tasklets`它包含两个成员head 和 tail
```C
struct tasklet_head {
struct tasklet_struct *head;
struct tasklet_struct **tail;
};
```
`tasklet_struct` 数据类型在 [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h) 中定义,它代表一个 `Tasklet`。这本书之前部分我们没有见过这个单词,那我们先试着理解一下 `Tasklet` 究竟为何物。实际上,`Tasklet` 是处理延后中断的一种机制,来看一下 `tasklet_struct` 的具体定义:
```C
struct tasklet_struct
{
struct tasklet_struct *next;
unsigned long state;
atomic_t count;
void (*func)(unsigned long);
unsigned long data;
};
```
这个数据结构包含有下面5个成员
* 调度队列中的下一个 `Tasklet`
* 当前这个 `Tasklet` 的状态
* 这个 `Tasklet` 是否处于活动状态
* `Tasklet` 的回调函数
* 回调函数的参数
上面代码中,在 `softirq_init` 函数中初始化了两个 tasklets 数组:`tasklet_vec``tasklet_hi_vec`。Tasklets 和高优先级 Tasklets 分别存储于这两个数组中。初始化完成后我们看到代码 [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) 在 `softirq_init` 函数的最后又两次调用了 `open_softirq`
```C
open_softirq(TASKLET_SOFTIRQ, tasklet_action);
open_softirq(HI_SOFTIRQ, tasklet_hi_action);
```
`open_softirq` 函数的主要作用是初始化软中断,接下来让我们看看它是怎么做的。和 Tasklets 相关的软中断处理函数有两个,分别是 `tasklet_action``tasklet_hi_action`。其中 `tasklet_hi_action``HI_SOFTIRQ` 关联在一起,`tasklet_action``TASKLET_SOFTIRQ` 关联在一起。
Linux 内核提供一些 API 供操作 Tasklets 之用。首先是 `tasklet_init` 函数,它接受一个 `task_struct` 数据结构,一个处理函数,和另外一个参数,并利用这些参数来初始化所给的 `task_struct` 结构:
```C
void tasklet_init(struct tasklet_struct *t,
void (*func)(unsigned long), unsigned long data)
{
t->next = NULL;
t->state = 0;
atomic_set(&t->count, 0);
t->func = func;
t->data = data;
}
```
另外还有如下两个宏可以静态地初始化一个 tasklet
```C
DECLARE_TASKLET(name, func, data);
DECLARE_TASKLET_DISABLED(name, func, data);
```
Linux 内核提供三个函数标记一个 tasklet 已经准备就绪:
```C
void tasklet_schedule(struct tasklet_struct *t);
void tasklet_hi_schedule(struct tasklet_struct *t);
void tasklet_hi_schedule_first(struct tasklet_struct *t);
```
第一个函数使用普通优先级调度一个 tasklet第二个使用高优先级第三个则用更高优先级。所有这三个函数的实现都很类似所以我们只看一下第一个 `tasklet_schedule` 的实现:
```C
static inline void tasklet_schedule(struct tasklet_struct *t)
{
if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
__tasklet_schedule(t);
}
void __tasklet_schedule(struct tasklet_struct *t)
{
unsigned long flags;
local_irq_save(flags);
t->next = NULL;
*__this_cpu_read(tasklet_vec.tail) = t;
__this_cpu_write(tasklet_vec.tail, &(t->next));
raise_softirq_irqoff(TASKLET_SOFTIRQ);
local_irq_restore(flags);
}
```
我们看到它检测并设置所给的 tasklet 为 `TASKLET_STATE_SCHED` 状态,然后以所给 tasklet 为参数执行了 `__tasklet_schedule` 函数。`__tasklet_schedule` 看起来和前面见到的 `raise_softirq` 很像。一开始它保存中断标志并禁用中断,继而将新的 tasklet 添加到 `tasklet_vec`,然后调用了我们前面见过的 `raise_softirq_irqoff` 函数。当 Linux 内核调度器决定去运行一个延后函数,`tasklet_action` 函数会被作为和 `TASKLET_SOFTIRQ` 相关联的延后函数调用。同样的,`tasklet_hi_action` 会被作为和 `HI_SOFTIRQ` 相关联的延后函数调用。这些函数之所以如此相似是因为他们之间只有一个地方不同 --- `tasklet_action` 使用 `tasklet_vec``tasklet_hi_action` 使用 `tasklet_hi_vec`
让我们看下 `tasklet_action` 函数的实现:
```C
static void tasklet_action(struct softirq_action *a)
{
local_irq_disable();
list = __this_cpu_read(tasklet_vec.head);
__this_cpu_write(tasklet_vec.head, NULL);
__this_cpu_write(tasklet_vec.tail, this_cpu_ptr(&tasklet_vec.head));
local_irq_enable();
while (list) {
if (tasklet_trylock(t)) {
t->func(t->data);
tasklet_unlock(t);
}
...
...
...
}
}
```
`tasklet_action` 开始时利用 `local_irq_disable` 宏禁用了当前处理器的中断(你可以阅读本书[第二部分](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/interrupts-2.html)了解更多关于此宏的信息)。接下来获取到当前处理器对应的普通优先级 tasklet 列表并把它设置为 `NULL` ,这是因为所有的 tasklet 都将被执行。然后使能当前处理器的中断,循环遍历 tasklet 列表,每一次遍历都会对当前 tasklet 调用 `tasklet_trylock` 函数来更新它的状态为 `TASKLET_STATE_RUN`
```C
static inline int tasklet_trylock(struct tasklet_struct *t)
{
return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state);
}
```
如果这个操作成功了就会执行此 tasklet 的处理函数(我们在 `tasklet_init` 中所设置的),然后调用 `tasklet_unlock` 函数清除他的 `TASKLET_STATE_RUN` 状态。
通常情况下,这就是 `tasklet` 的所有概念。当然这些还不足以覆盖所有的 `tasklets`,但是我想大家可以以此为切入点继续学习下去。
`tasklets` 在 Linux 内核中是一个[广泛](http://lxr.free-electrons.com/ident?i=tasklet_init)使用的概念,但就像我在本章开头所写的,还有第三个延后中断机制 -- 工作队列。接下来我们将会看看它又是怎样一种机制。
工作队列
--------------------------------------------------------------------------------
`工作队列`是另外一个处理延后函数的概念,它大体上和 `tasklets` 类似。工作队列运行于内核进程上下文,而 `tasklets` 运行于软中断上下文。这意味着`工作队列`函数不必像 `tasklets` 一样必须是原子性的。Tasklets 总是运行于它提交自的那个处理器,工作队列在默认情况下也是这样。`工作队列`在 Linux 内核代码 [kernel/workqueue.c](https://github.com/torvalds/linux/blob/master/kernel/workqueue.c) 中由如下的数据结构表示:
```C
struct worker_pool {
spinlock_t lock;
int cpu;
int node;
int id;
unsigned int flags;
struct list_head worklist;
int nr_workers;
...
...
...
```
因为这个结构有非常多的成员,这里就不把它们全部罗列出来,下面只讨论上面列出的这几个。
工作队列最基础的用法,是作为创建内核线程的接口来处理提交到队列里的工作任务。所有这些内核线程称之为 `worker thread`。工作队列内的任务是由代码 [include/linux/workqueue.h](https://github.com/torvalds/linux/blob/master/include/linux/workqueue.h) 中定义的 `work_struct` 表示的,起定义如下:
```C
struct work_struct {
atomic_long_t data;
struct list_head entry;
work_func_t func;
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
};
```
这里有两个字段比较有意思:`func` --将被`工作队列`调度执行的函数,`data` --这个函数的参数。Linux 内核提供了称之为 `kworker` 的特定于每个 cpu 的内核线程:
```
systemd-cgls -k | grep kworker
├─ 5 [kworker/0:0H]
├─ 15 [kworker/1:0H]
├─ 20 [kworker/2:0H]
├─ 25 [kworker/3:0H]
├─ 30 [kworker/4:0H]
...
...
...
```
这些线程会被用来调度执行工作队列的延后函数(就像 `ksoftirqd` 之于`软中断`)。除此之外我们还可以为一个`工作队列`创建一个新的工作线程。Linux 内核提供了如下宏静态创建一个队列任务:
```C
#define DECLARE_WORK(n, f) \
struct work_struct n = __WORK_INITIALIZER(n, f)
```
它需要两个参数:工作队列的名字和工作队列的函数。我们还可以在运行时动态创建:
```C
#define INIT_WORK(_work, _func) \
__INIT_WORK((_work), (_func), 0)
#define __INIT_WORK(_work, _func, _onstack) \
do { \
__init_work((_work), _onstack); \
(_work)->data = (atomic_long_t) WORK_DATA_INIT(); \
INIT_LIST_HEAD(&(_work)->entry); \
(_work)->func = (_func); \
} while (0)
```
这个宏需要一个 `work_struct` 数据结构作为将要创建的队列任务,和一个将在这个任务里调度运行的函数。通过这两个宏的其中一个创建一个 `work` 后,我们需要把它放到`工作队列`中去。可以通过 `queue_work` 或者 `queue_delayed_work` 来做到这一点:
```C
static inline bool queue_work(struct workqueue_struct *wq,
struct work_struct *work)
{
return queue_work_on(WORK_CPU_UNBOUND, wq, work);
}
```
`queue_work` 只是调用了 `queue_work_on` 函数指定相应的处理器。注意这里给 `queue_work_on` 函数传递了 `WORK_CPU_UNBOUND` 参数,它作为代表队列任务要绑定到哪一个处理器的枚举一员,定义于 [include/linux/workqueue.h](https://github.com/torvalds/linux/blob/master/include/linux/workqueue.h)。`queue_work_on` 函数测试并设置所给`任务``WORK_STRUCT_PENDING_BIT` 标志位,然后以所给的工作队列和队列任务为参数执行 `__queue_work` 函数:
```C
bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
bool ret = false;
...
if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
__queue_work(cpu, wq, work);
ret = true;
}
...
return ret;
}
```
`__queue_work` 函数得到参数 `work poll`。是的,是 `work poll` 而不是 `workqueue`。实际上,所有的 `works` 都没有放在 `workqueue` 中,而是放在 Linux 内核中由 `worker_pool` 数据结构所定义的 `work poll`。如上所述,`workqueue_struct` 数据结构的 `pwqs` 成员是一个 `worker_pool` 列表。当我们创建一个 `workqueue`,他针对每一个处理器都创建了 `worker_pool`。每一个和 `worker_pool` 相关联的 `pool_workqueue` 都分配在相同的处理器上对应的优先级队列,`workqueue` 通过他们和 `worker_pool` 交互。在 `__queue_work` 函数里使用 `raw_smp_processor_id` 设置 cpu 为当前处理器在[第四章](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-4.html)你可以找到更多相关信息),得到与所给 `work_struct` 对应的 `pool_workqueue` 并将 `work` 插入到 `workqueue`
```C
static void __queue_work(int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
...
...
...
if (req_cpu == WORK_CPU_UNBOUND)
cpu = raw_smp_processor_id();
if (!(wq->flags & WQ_UNBOUND))
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
else
pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
...
...
...
insert_work(pwq, work, worklist, work_flags);
```
现在我们可以创建 `works``workqueue`,我们需要知道他们究竟会在何时被执行。就像前面提到的,所有的 `works` 都会在内核线程中执行。当内核线程得到调度,它开始执行 `workqueue` 中的 `works`。每一个工作队列内核线程都会在 `worker_thread` 函数里执行一个循环。这些内核线程会做很多不同的事情,其中一些和本章前面提到的很类似。当开始执行时,所有的 `work_struct``works` 都会从他的 `workqueue` 移除。
总结
--------------------------------------------------------------------------------
现在结束了[中断和中断处理](https://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/index.html)的第九节。这一节中我们继续讨论了外部硬件中断。在之前部分我们看到了 `IRQs` 的初始化和 `irq_desc` 数据结构,在这一节我们看到了用于延后函数的三个概念:`软中断``tasklet``工作队列`
下一节将是 `中断和中断处理` 的最后一节。我们将会了解真正的硬件驱动,并试着学习它是怎样和中断子系统一起工作的。
如果你有任何问题或建议,请给我发评论或者给我发 [Twitter](https://twitter.com/0xAX)。
**请注意英语并不是我的母语,我为任何表达不清楚的地方感到抱歉。如果你发现任何错误请发 PR 到 [linux-insides](https://github.com/MintCN/linux-insides-zh)。(译者注:翻译问题请发 PR 到 [linux-insides-cn](https://www.gitbook.com/book/xinqiu/linux-insides-cn))**
链接
--------------------------------------------------------------------------------
* [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/index.html)
* [IF](https://en.wikipedia.org/wiki/Interrupt_flag)
* [eflags](https://en.wikipedia.org/wiki/FLAGS_register)
* [CPU masks](http://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/cpumask.html)
* [per-cpu](http://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/per-cpu.html)
* [Workqueue](https://github.com/torvalds/linux/blob/master/Documentation/workqueue.txt)
* [Previous part](http://xinqiu.gitbooks.io/linux-insides-cn/content/Interrupts/interrupts-8.html)