From 0e9f666bcf680eca01fef5abb781532baf25f13b Mon Sep 17 00:00:00 2001 From: keltoy <315090132@qq.com> Date: Tue, 13 Dec 2016 15:10:59 +0800 Subject: [PATCH 01/21] commit 6.2 --- .DS_Store | Bin 0 -> 8196 bytes README.md | 2 +- SyncPrim/sync-2.md | 176 ++++++++++++++++++++++----------------------- 3 files changed, 88 insertions(+), 90 deletions(-) create mode 100644 .DS_Store diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..25adce00be7e6fb5da9b443f6781a9afc5ff3e28 GIT binary patch literal 8196 zcmeHM-A>d{5S{}obTPtSVBoShCSDN3U*v+AEDOYJ5FspRj2dWnH&9brvhA|Mx@K>D z2lWkn6raEc@d5OkX?N3hi{2Td=a6$|+RoQG^PTDGP9Y*yX*ZUL=84EaXIZ+9A*bGy|Fe&A{Kl0N&Z$j2qtj(N#Ta1~dcz zB?G)anCL9avFt}uwhjzZ0zejUS{iiJ0fI4OS&n5tk^;p&qv}D$P!*RLDuUy>$>xyd zSoR|s!AV7MQn6$qMHPUD`H-fHLQ+ZQ20ayO4R_kqIfp6lQtE zcLwT!UgTmBVG3gab0xXDj}>(2JUuI3{a@SLo#4WDC0oAQ7p@b8i;LgX*m&XA#H3Ly z8pS)t>)M&$sfAY94k}jbr5rhprXA~tzV8lo%YNoGJIAHjP2USG$7{9{=C7VRUc>KH z{I=H!6rs6jx%|yq;@iVj$VlHbf&-J}i2BJ?2aRDA(PGmPt8*d5fZDPSE@JOE_3Hiai z%GmP4XKk?%lDL_zgA+VuJ1d*U1~!WtQguT%4&F+R(f4q)jq`b_!)cK8ISgGZngPwg zfPwKyndka{`{wWe19L4k1Db(Px# literal 0 HcmV?d00001 diff --git a/README.md b/README.md index fbc44c3..c5b94f2 100644 --- a/README.md +++ b/README.md @@ -62,7 +62,7 @@ Linux 内核揭密 | 6. [Synchronization primitives](https://github.com/MintCN/linux-insides-zh/tree/master/SyncPrim)||正在进行| |├ [6.0](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/README.md)|[@mudongliang](https://github.com/mudongliang)|更新至[6f85b63e](https://github.com/0xAX/linux-insides/commit/6f85b63e347b636e08e965e9dc22c177e972afe2)| |├ [6.1](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-1.md)|[@keltoy](https://github.com/keltoy)|已完成| -|├ [6.2](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-2.md)|[@keltoy](https://github.com/keltoy)|正在进行| +|├ [6.2](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-2.md)|[@keltoy](https://github.com/keltoy)|已完成| |├ [6.3](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-3.md)|[@huxq](https://github.com/huxq)|已完成| |├ [6.4](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-4.md)|[@huxq](https://github.com/huxq)|正在进行| |├ [6.5](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-5.md)||未开始| diff --git a/SyncPrim/sync-2.md b/SyncPrim/sync-2.md index 5bf78a5..5dc040b 100644 --- a/SyncPrim/sync-2.md +++ b/SyncPrim/sync-2.md @@ -1,25 +1,25 @@ -Synchronization primitives in the Linux kernel. Part 2. +Linux 内核的同步原语. 第二部分. ================================================================================ -Queued Spinlocks +队列自旋锁 -------------------------------------------------------------------------------- -This is the second part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the first [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter we met the first - [spinlock](https://en.wikipedia.org/wiki/Spinlock). We will continue to learn this synchronization primitive in this part. If you have read the previous part, you may remember that besides normal spinlocks, the Linux kernel provides special type of `spinlocks` - `queued spinlocks`. In this part we will try to understand what does this concept represent. +这是本[章节](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html)的第二部分,这部分描述 Linux 内核的和我们在本章的第一[部分](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)所见到的--[自旋锁](https://en.wikipedia.org/wiki/Spinlock)的同步原语。在这个部分我们将继续学习自旋锁的同步原语。 如果阅读了上一部分的相关内容,你可能记得除了正常自旋锁,Linux 内核还提供`自旋锁`的一种特殊类型 - `队列自旋锁`。 在这个部分我们将尝试理解此概念锁代表的含义。 -We saw [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `spinlock` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html): +我们在上一[部分](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)已知`自旋锁`的 [API](https://en.wikipedia.org/wiki/Application_programming_interface): -* `spin_lock_init` - produces initialization of the given `spinlock`; -* `spin_lock` - acquires given `spinlock`; -* `spin_lock_bh` - disables software [interrupts](https://en.wikipedia.org/wiki/Interrupt) and acquire given `spinlock`. -* `spin_lock_irqsave` and `spin_lock_irq` - disable interrupts on local processor and preserve/not preserve previous interrupt state in the `flags`; -* `spin_unlock` - releases given `spinlock`; -* `spin_unlock_bh` - releases given `spinlock` and enables software interrupts; -* `spin_is_locked` - returns the state of the given `spinlock`; -* and etc. +* `spin_lock_init` - 为给定`自旋锁`进行初始化; +* `spin_lock` - 获取给定`自旋锁`; +* `spin_lock_bh` - 禁止软件[中断](https://en.wikipedia.org/wiki/Interrupt)并且获取给定`自旋锁`; +* `spin_lock_irqsave` 和 `spin_lock_irq` - 禁止本地处理器中断并且保存/不保存之前`标识位`的中断状态; +* `spin_unlock` - 释放给定的`自旋锁`; +* `spin_unlock_bh` - 释放给定的`自旋锁`并且启用软件中断; +* `spin_is_locked` - 返回给定`自旋锁`的状态; +* 等等。 -And we know that all of these macro which are defined in the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) header file will be expanded to the call of the functions with `arch_spin_.*` prefix from the [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/spinlock.h) for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. If we will look at this header fill with attention, we will that these functions (`arch_spin_is_locked`, `arch_spin_lock`, `arch_spin_unlock` and etc) defined only if the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option is disabled: +而且我们知道所有这些宏都在 [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) 头文件中所定义,都被扩展成针对 [x86_64](https://en.wikipedia.org/wiki/X86-64) 架构,来自于 [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/spinlock.h) 文件的 `arch_spin_.*` 前缀的函数调用。如果我们关注这个头文件,我们会发现这些函数(`arch_spin_is_locked`, `arch_spin_lock`, `arch_spin_unlock` 等等)只在 `CONFIG_QUEUED_SPINLOCKS` 内核配置选项禁用的时才定义: -```C +```c #ifdef CONFIG_QUEUED_SPINLOCKS #include #else @@ -34,10 +34,9 @@ static __always_inline void arch_spin_lock(arch_spinlock_t *lock) ... #endif ``` +这意味着 [arch/x86/include/asm/qspinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/qspinlock.h) 这个头文件提供提供这些函数自己的实现。实际上这些函数是宏定义并且在分布在其他头文件中。这个头文件是 [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126)。如果我们查看这个头文件,我们会发现这些宏的定义: -This means that the [arch/x86/include/asm/qspinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/qspinlock.h) header file provides own implementation of these functions. Actually they are macros and they are located in other header file. This header file is - [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126). If we will look into this header file, we will find definition of these macros: - -```C +```c #define arch_spin_is_locked(l) queued_spin_is_locked(l) #define arch_spin_is_contended(l) queued_spin_is_contended(l) #define arch_spin_value_unlocked(l) queued_spin_value_unlocked(l) @@ -48,12 +47,12 @@ This means that the [arch/x86/include/asm/qspinlock.h](https://github.com/torval #define arch_spin_unlock_wait(l) queued_spin_unlock_wait(l) ``` -Before we will consider how queued spinlocks and their [API](https://en.wikipedia.org/wiki/Application_programming_interface) are implemented, we take a look on theoretical part at first. +在我们考虑怎么排列自旋锁和实现他们的 [API](https://en.wikipedia.org/wiki/Application_programming_interface),我们首先看看理论部分。 -Introduction to queued spinlocks +介绍队列自旋锁 ------------------------------------------------------------------------------- -Queued spinlocks is a [locking mechanism](https://en.wikipedia.org/wiki/Lock_%28computer_science%29) in the Linux kernel which is replacement for the standard `spinlocks`. At least this is true for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. If we will look at the following kernel configuration file - [kernel/Kconfig.locks](https://github.com/torvalds/linux/blob/master/kernel/Kconfig.locks), we will see following configuration entries: +队列自旋锁是 Linux 内核的[锁机制](https://en.wikipedia.org/wiki/Lock_%28computer_science%29),是标准`自旋锁`的代替物。至少对 [x86_64](https://en.wikipedia.org/wiki/X86-64) 架构是真的。如果我们查看了以下内核配置文件 - [kernel/Kconfig.locks](https://github.com/torvalds/linux/blob/master/kernel/Kconfig.locks),我们将会发现以下配置入口: ``` config ARCH_USE_QUEUED_SPINLOCKS @@ -63,8 +62,7 @@ config QUEUED_SPINLOCKS def_bool y if ARCH_USE_QUEUED_SPINLOCKS depends on SMP ``` - -This means that the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option will be enabled by default if the `ARCH_USE_QUEUED_SPINLOCKS` is enabled. We may see that the `ARCH_USE_QUEUED_SPINLOCKS` is enabled by default in the `x86_64` specific kernel configuration file - [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig): +这意味着如果 `ARCH_USE_QUEUED_SPINLOCKS` 启用,那么 `CONFIG_QUEUED_SPINLOCKS` 内核配置选项将默认启用。 我们能够看到 `ARCH_USE_QUEUED_SPINLOCKS` 在 `x86_64` 特定内核配置文件 - [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig) 默认开启: ``` config X86 @@ -77,7 +75,7 @@ config X86 ... ``` -Before we will start to consider what is it queued spinlock concept, let's look on other types of `spinlocks`. For the start let's consider how `normal` spinlocks is implemented. Usually, implementation of `normal` spinlock is based on the [test and set](https://en.wikipedia.org/wiki/Test-and-set) instruction. Principle of work of this instruction is pretty simple. This instruction writes a value to the memory location and returns old value from this memory location. Both of these operations are in atomic context i.e. this instruction is non-interruptible. So if the first thread started to execute this instruction, second thread will wait until the first processor will not finish. Basic lock can be built on top of this mechanism. Schematically it may look like this: +在开始考虑什么是队列自旋锁概念之前,让我们看看其他`自旋锁`的类型。一开始我们考虑`正常`自旋锁是如何实现的。通常,`正常`自旋锁的实现是基于 [test and set](https://en.wikipedia.org/wiki/Test-and-set) 指令。这个指令的工作原则真的很简单。该指令写入一个值到内存地址然后返回该地址原来的旧值。这些操作都是在院子的上下文中完成的。也就是说,这个指令是不可中断的。因此如果第一个线程开始执行这个指令,第二个线程将会等待,直到第一个线程完成。基本锁可以在这个机制之上建立。可能看起来如下所示: ```C int lock(lock) @@ -95,19 +93,20 @@ int unlock(lock) } ``` -The first thread will execute the `test_and_set` which will set the `lock` to `1`. When the second thread will call the `lock` function, it will spin in the `while` loop, until the first thread will not call the `unlock` function and the `lock` will be equal to `0`. This implementation is not very good for performance, because it has at least two problems. The first problem is that this implementation may be unfair and the thread from one processor may have long waiting time, even if it called the `lock` before other threads which are waiting for free lock too. The second problem is that all threads which want to acquire a lock, must to execute many `atomic` operations like `test_and_set` on a variable which is in shared memory. This leads to the cache invalidation as the cache of the processor will store `lock=1`, but the value of the `lock` in memory may be `1` after a thread will release this lock. +第一个线程将执行 `test_and_set` 指令设置 `lock` 为 `1`。当第二个线程调用 `lock` 函数,它将在 `while` 循环中自旋,直到第一个线程调用 `unlock` 函数而且 `lock` 等于 `0`。这个实现对于执行不是很好,因为该实现至少有两个问题。第一个问题是该实现可能是非公平的而且一个处理器的线程可能有很长的等待时间,即使有其他线程也在等待释放锁,它还是调用了 `lock`。第二个问题是所有想要获取锁的线程,必须在共享内存的变量上执行很多类似`test_and_set` 这样的`原子`操作。这导致缓存失效,因为处理器缓存会存储 `lock=1`,但是在线程释放锁之后,内存中 `lock`可能只是`1`。 -In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) we saw the second type of spinlock implementation - `ticket spinlock`. This approach solves the first problem and may guarantee order of threads which want to acquire a lock, but still has a second problem. +在上一[部分](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) 我们了解了自旋锁的第二种实现 - +`排队自旋锁(ticket spinlock)`。这一方法解决了第一个问题而且能够保证想要获取锁的线程的顺序,但是仍然存在第二个问题。 -The topic of this part is `queued spinlocks`. This approach may help to solve both of these problems. The `queued spinlocks` allows to each processor to use its own memory location to spin. The basic principle of a queue-based spinlock can best be understood by studying a classic queue-based spinlock implementation called the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) lock. Before we will look at implementation of the `queued spinlocks` in the Linux kernel, we will try to understand what is it `MCS` lock. +这一部分的主旨是 `队列自旋锁`。这个方法能够帮助解决上述的两个问题。`队列自旋锁`允许每个处理器对自旋过程使用他自己的内存地址。通过学习名为 [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) 锁的这种基于队列自旋锁的实现,能够最好理解基于队列自旋锁的基本原则。在了解`队列自旋锁`的实现之前,我们先尝试理解什么是 `MCS` 锁。 -The basic idea of the `MCS` lock is in that as I already wrote in the previous paragraph, a thread spins on a local variable and each processor in the system has its own copy of these variable. In other words this concept is built on top of the [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variables concept in the Linux kernel. +`MCS`锁的基本理念就在上一段已经写到了,一个线程在本地变量上自旋然后每个系统的处理器自己拥有这些变量的拷贝。换句话说这个概念建立在 Linux 内核中的 [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) 变量概念之上。 -When the first thread wants to acquire a lock, it registers itself in the `queue` or in other words it will be added to the special `queue` and will acquire lock, because it is free for now. When the second thread will want to acquire the same lock before the first thread will release it, this thread adds its own copy of the lock variable into this `queue`. In this case the first thread will contain a `next` field which will point to the second thread. From this moment, the second thread will wait until the first thread will release its lock and notify `next` thread about this event. The first thread will be deleted from the `queue` and the second thread will be owner of a lock. +当第一个线程想要获取锁,线程在`队列`中注册了自身,或者换句话说,因为线程现在是闲置的,线程要加入特殊`队列`并且获取锁。当第二个线程想要在第一个线程释放锁之前获取相同锁,这个线程就会把他自身的所变量的拷贝加入到这个特殊`队列`中。这个例子中第一个线程会包含一个 `next` 字段指向第二个线程。从这一时刻,第二个线程会等待直到第一个线程释放它的锁并且关于这个事件通知给 `next` 线程。第一个线程从`队列`中删除而第二个线程持有该锁。 -Schematically we can represent it like: +我们可以这样代表示意一下: -Empty queue: +空队列: ``` +---------+ @@ -117,7 +116,7 @@ Empty queue: +---------+ ``` -First thread tries to acquire a lock: +第一个线程尝试获取锁: ``` +---------+ +----------------------------+ @@ -127,7 +126,7 @@ First thread tries to acquire a lock: +---------+ +----------------------------+ ``` -Second thread tries to acquire a lock: +第二个队列尝试获取锁 ``` +---------+ +----------------------------------------+ +-------------------------+ @@ -137,7 +136,7 @@ Second thread tries to acquire a lock: +---------+ +----------------------------------------+ +-------------------------+ ``` -Or the pseudocode: +为代码描述为: ```C void lock(...) @@ -180,14 +179,15 @@ void unlock(...) } ``` -The idea is simple, but the implementation of the `queued spinlocks` is must complex than this pseudocode. As I already wrote above, the `queued spinlock` mechanism is planned to be replacement for `ticket spinlocks` in the Linux kernel. But as you may remember, the usual `spinlock` fit into `32-bit` [word](https://en.wikipedia.org/wiki/Word_%28computer_architecture%29). But the `MCS` based lock does not fit to this size. As you may know `spinlock_t` type is [widely](http://lxr.free-electrons.com/ident?i=spinlock_t) used in the Linux kernel. In this case would have to rewrite a significant part of the Linux kernel, but this is unacceptable. Beside this, some kernel structures which contains a spinlock for protection can't grow. But anyway, implementation of the `queued spinlocks` in the Linux kernel based on this concept with some modifications which allows to fit it into `32` bits. +想法很简单,但是`队列自旋锁`的实现一定是比为代码复杂。就如同我上面写到的,`队列自旋锁`机制计划在 Linux 内核中成为`排队自旋锁`的替代品。但你们可能还记得,常用`自旋锁`适用于`32位(32-bit)`的 [字(word)](https://en.wikipedia.org/wiki/Word_%28computer_architecture%29)。而基于`MCS`的锁不能使用这个大小,你们卡能知道 `spinlock_t` 类型在 Linux 内核中的使用是[宽字符(widely)](http://lxr.free-electrons.com/ident?i=spinlock_t)的。这种情况下可能不得不重写 Linux 内核中重要的组成部分,但这是不可接受的。除了这一点,一些包含自旋锁用于保护的内核结构不能适配(can't grow)。但无论怎样,基于这一概念的 Linux 内核中的`队列自旋锁`实现有一些修改,可以适应`32`位的字。 -That's all about theory of the `queued spinlocks`, now let's consider how this mechanism is implemented in the Linux kernel. Implementation of the `queued spinlocks` looks more complex and tangled than implementation of `ticket spinlocks`, but the study with attention will lead to success. +这就是所有有关`队列自旋锁`的理论,现在让我们考虑以下在 Linux 内核中这个机制是如何实现的。`队列自旋锁`的实现看起来比`排队自旋锁`的实现更加复杂和混乱,但是关注研究它将会有收获 +(原句:but the study with attention will lead to success.)。 -API of queued spinlocks +队列自旋锁的API ------------------------------------------------------------------------------- -Now we know a little about `queued spinlocks` from the theoretical side, time to see the implementation of this mechanism in the Linux kernel. As we saw above, the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126) header files provides a set of macro which are represent API for a spinlock acquiring, releasing and etc: +现在我们从原理角度了解了一些`队列自旋锁`,是时候了解 Linux 内核中这一机制的实现了。就想我们之前了解的那样 [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126) 头文件提供一套宏,代表API中的自旋锁的获取、释放等等。 ```C #define arch_spin_is_locked(l) queued_spin_is_locked(l) @@ -200,7 +200,7 @@ Now we know a little about `queued spinlocks` from the theoretical side, time to #define arch_spin_unlock_wait(l) queued_spin_unlock_wait(l) ``` -All of these macros expand to the call of functions from the same header file. Additionally, we saw the `qspinlock` structure from the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file which represents a queued spinlock in the Linux kernel: +这些所有的宏扩展了同一头文件下的函数的调用。此外,我们发现 [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) 头文件的 `qspinlock` 结构代表了 Linux 内核队列自旋锁。 ```C typedef struct qspinlock { @@ -208,16 +208,16 @@ typedef struct qspinlock { } arch_spinlock_t; ``` -As we may see, the `qspinlock` structure contains only one field - `val`. This field represents the state of a given `spinlock`. This `4` bytes field consists from following four parts: +如我们所了解的,`qspinlock` 结构只包含了一个字段 - `val`。这个字段代表给定`自旋锁`的状态。`4` 个字节字段包括如下 4 个部分: -* `0-7` - locked byte; -* `8` - pending bit; -* `16-17` - two bit index which represents entry of the `per-cpu` array of the `MCS` lock (will see it soon); -* `18-31` - contains number of processor which indicates tail of the queue. +* `0-7` - 上锁字节(locked byte); +* `8` - 未决位(pending bit); +* `16-17` - 这两位代表了 `MCS` 锁的 `per_cpu` 数组(马上就会了解); +* `18-31` - 包括表明队列尾部的处理器数。 -and the `9-15` bytes are not used. +`9-15` 字节没有被使用。 -As we already know, each processor in the system has own copy of the lock. The lock is represented by the following structure: +就像我们已经知道的,系统中每个处理器有自己的锁拷贝。这个锁由以下结构所表示: ```C struct mcs_spinlock { @@ -227,20 +227,20 @@ struct mcs_spinlock { }; ``` -from the [kernel/locking/mcs_spinlock.h](https://github.com/torvalds/linux/blob/master/kernel/locking/mcs_spinlock.h) header file. The first field represents a pointer to the next thread in the `queue`. The second field represents the state of the current thread in the `queue`, where `1` is `lock` already acquired and `0` in other way. And the last field of the `mcs_spinlock` structure represents nested locks. To understand what is it nested lock, imagine situation when a thread acquired lock, but was interrupted by the hardware [interrupt](https://en.wikipedia.org/wiki/Interrupt) and an [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) tries to take a lock too. For this case, each processor has not just copy of the `mcs_spinlock` structure but array of these structures: +来自 [kernel/locking/mcs_spinlock.h](https://github.com/torvalds/linux/blob/master/kernel/locking/mcs_spinlock.h) 头文件。第一个字段代表了指向`队列`中下一个线程的指针。第二个字段代表了`队列`中当前线程的状态,其中 `1` 是 `锁`已经获取而 `0` 相反。然后最后一个 `mcs_spinlock` 字段 结构代表嵌套锁 (nested locks),了解什么是嵌套锁,就像想象一下当线程已经获取锁的情况,而被硬件[中断](https://en.wikipedia.org/wiki/Interrupt) 所中断,然后[中断处理程序](https://en.wikipedia.org/wiki/Interrupt_handler)又尝试获取锁。这个例子里,每个处理器不只是 `mcs_spinlock` 结构的拷贝,也是这些结构的数组: ```C static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]); ``` -This array allows to make four attempts of a lock acquisition for the four events in following contexts: +此数组允许以下情况的四个事件的锁获取的四个尝试(原文:This array allows to make four attempts of a lock acquisition for the four events in following contexts: +): +* 普通任务上下文; +* 硬件中断上下文; +* 软件中断上下文; +* 屏蔽中断上下文。 -* normal task context; -* hardware interrupt context; -* software interrupt context; -* non-maskable interrupt context. - -Now let's return to the `qspinlock` structure and the `API` of the `queued spinlocks`. Before we will move to consider `API` of `queued spinlocks`, notice the `val` field of the `qspinlock` structure has type - `atomic_t` which represents atomic variable or one operation at a time variable. So, all operations with this field will be [atomic](https://en.wikipedia.org/wiki/Linearizability). For example let's look at the reading value of the `val` API: +现在让我们返回 `qspinlock` 结构和`队列自旋锁`的 `API` 中来。在我们考虑`队列自旋锁`的 `API` 之前,请注意 `qspinlock` 结构的 `val` 字段有类型 - `atomic_t`,此类型代表原子变量或者变量的一次操作(原文:one operation at a time variable)。一次,所有这个字段的操作都是[原子的](https://en.wikipedia.org/wiki/Linearizability)。比如说让我们看看 `val` API 的值: ```C static __always_inline int queued_spin_is_locked(struct qspinlock *lock) @@ -249,13 +249,13 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock) } ``` -Ok, now we know data structures which represents queued spinlock in the Linux kernel and now time is to look at the implementation of the `main` function from the `queued spinlocks` [API](https://en.wikipedia.org/wiki/Application_programming_interface). +Ok,现在我们知道 Linux 内核的代表队列自旋锁数据结构,那么是时候看看`队列自旋锁`[API](https://en.wikipedia.org/wiki/Application_programming_interface)中`主要(main)`函数的实现。 ```C #define arch_spin_lock(l) queued_spin_lock(l) ``` -Yes, this function is - `queued_spin_lock`. As we may understand from the function's name, it allows to acquire lock by the thread. This function is defined in the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file and its implementation looks: +没错,这个函数是 - `queued_spin_lock`。正如我们可能从函数名中所了解的一样,函数允许通过线程获取锁。这个函数在 [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) 头文件中定义,它的实现看起来是这样: ```C static __always_inline void queued_spin_lock(struct qspinlock *lock) @@ -269,15 +269,14 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock) } ``` -Looks pretty easy, except the `queued_spin_lock_slowpath` function. We may see that it takes only one parameter. In our case this parameter will represent `queued spinlock` which will be locked. Let's consider the situation that `queue` with locks is empty for now and the first thread wanted to acquire lock. As we may see the `queued_spin_lock` function starts from the call of the `atomic_cmpxchg_acquire` macro. As you may guess from the name of this macro, it executes atomic [CMPXCHG](http://x86.renejeschke.de/html/file_module_x86_id_41.html) instruction which compares value of the second parameter (zero in our case) with the value of the first parameter (current state of the given spinlock) and if they are identical, it stores value of the `_Q_LOCKED_VAL` in the memory location which is pointed by the `&lock->val` and return the initial value from this memory location. +看起来很简单,除了 `queued_spin_lock_slowpath` 函数。 我们可能发现它只有一个参数。在我们的例子中这个参数代表 `队列自旋锁` 被上锁。让我们考虑`队列`锁为空,现在第一个线程想要获取锁的情况。正如我们可能了解的 `queued_spin_lock` 函数从调用 `atomic_cmpxchg_acquire` 宏开始。就像你们可能从宏的名字猜到的那样,它执行原子的 [CMPXCHG](http://x86.renejeschke.de/html/file_module_x86_id_41.html) 指令,使用第一个参数(当前给定自旋锁的状态)比较第二个参数(在我们的例子为零)的值,如果他们相等,那么第二个参数在存储位置保存 `_Q_LOCKED_VAL` 的值,该存储位置通过 `&lock->val` 指向并且返回这个存储位置的初始值。 -The `atomic_cmpxchg_acquire` macro is defined in the [include/linux/atomic.h](https://github.com/torvalds/linux/blob/master/include/linux/atomic.h) header file and expands to the call of the `atomic_cmpxchg` function: +`atomic_cmpxchg_acquire` 宏定义在 [include/linux/atomic.h](https://github.com/torvalds/linux/blob/master/include/linux/atomic.h) 头文件中并且扩展了 `atomic_cmpxchg` 函数的调用: ```C #define atomic_cmpxchg_acquire atomic_cmpxchg ``` - -which is architecture specific. We consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so in our case this header file will be [arch/x86/include/asm/atomic.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/atomic.h) and the implementation of the `atomic_cmpxchg` function is just returns the result of the `cmpxchg` macro: +这实现是架构所指定的。我们考虑 [x86_64](https://en.wikipedia.org/wiki/X86-64) 架构,因此在我们的例子中这个头文件在 [arch/x86/include/asm/atomic.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/atomic.h) 并且`atomic_cmpxchg` 函数的实现只是返回 `cmpxchg` 宏的结果: ```C static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new) @@ -286,7 +285,7 @@ static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new) } ``` -This macro is defined in the [arch/x86/include/asm/cmpxchg.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/cmpxchg.h) header file and looks: +这个宏在[arch/x86/include/asm/cmpxchg.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/cmpxchg.h)头文件中定义,看上去是这样: ```C #define cmpxchg(ptr, old, new) \ @@ -296,7 +295,7 @@ This macro is defined in the [arch/x86/include/asm/cmpxchg.h](https://github.com __raw_cmpxchg((ptr), (old), (new), (size), LOCK_PREFIX) ``` -As we may see, the `cmpxchg` macro expands to the `__cpmxchg` macro with the almost the same set of parameters. New additional parameter is the size of the atomic value. The `__cmpxchg` macro adds `LOCK_PREFIX` and expands to the `__raw_cmpxchg` macro where `LOCK_PREFIX` just [LOCK](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction. After all, the `__raw_cmpxchg` does all job for us: +就像我们可能了解的那样,`cmpxchg` 宏使用几乎相同的参数集合扩展了 `__cpmxchg` 宏。新添加的参数是原子值的大小。`__cpmxchg` 宏添加了 `LOCK_PREFIX`,还扩展了 `__raw_cmpxchg` 宏中 `LOCK_PREFIX`的 [LOCK](http://x86.renejeschke.de/html/file_module_x86_id_159.html)指令。毕竟 `__raw_cmpxchg` 对我们来说做了所有的的工作: ```C #define __raw_cmpxchg(ptr, old, new, size, lock) \ @@ -315,7 +314,7 @@ As we may see, the `cmpxchg` macro expands to the `__cpmxchg` macro with the alm }) ``` -After the `atomic_cmpxchg_acquire` macro will be executed, it returns the previous value of the memory location. Now only one thread tried to acquire a lock, so the `val` will be zero and we will return from the `queued_spin_lock` function: +在 `atomic_cmpxchg_acquire` 宏被执行后,该宏返回内存地址之前的值。现在只有一个线程尝试获取锁,因此 `val` 将会置为零然后我们从 `queued_spin_lock` 函数返回: ```C val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL); @@ -323,9 +322,10 @@ if (likely(val == 0)) return; ``` -From this moment, our first thread will hold a lock. Notice that this behavior differs from the behavior which was described in the `MCS` algorithm. The thread acquired lock, but we didn't add it to the `queue`. As I already wrote the implementation of `queued spinlocks` concept is based on the `MCS` algorithm in the Linux kernel, but in the same time it has some difference like this for optimization purpose. +此时此刻,我们的第一个线程持有锁。注意这个行为与在 `MCS` 算法的描述有所区别。线程获取锁,但是我们不添加此线程入`队列`。就像我之前已经写到的,`队列自旋锁` 概念的实现在 Linux 内核中基于 `MCS` 算法,但是于此同时它对优化目的有一些差异。 -So the first thread have acquired lock and now let's consider that the second thread tried to acquire the same lock. The second thread will start from the same `queued_spin_lock` function, but the `lock->val` will contain `1` or `_Q_LOCKED_VAL`, because first thread already holds lock. So, in this case the `queued_spin_lock_slowpath` function will be called. The `queued_spin_lock_slowpath` function is defined in the [kernel/locking/qspinlock.c](https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.c) source code file and starts from the following checks: +所以第一个线程已经获取了锁然后现在让我们考虑第二个线程尝试获取相同的锁的情况。第二个线程将从同样的 `queued_spin_lock` 函数开始,但是 `lock->val` 会包含 +`1` 或者 `_Q_LOCKED_VAL`,因为第一个线程已经持有了锁。因此,在本例中 `queued_spin_lock_slowpath` 函数将会被调用。`queued_spin_lock_slowpath`函数定义在 [kernel/locking/qspinlock.c](https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.c) 源码文件中并且从以下的检查开始: ```C void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) @@ -342,7 +342,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) } ``` -which check the state of the `pvqspinlock`. The `pvqspinlock` is `queued spinlock` in [paravirtualized](https://en.wikipedia.org/wiki/Paravirtualization) environment. As this chapter is related only to synchronization primitives in the Linux kernel, we skip these and other parts which are not directly related to the topic of this chapter. After these checks we compare our value which represents lock with the value of the `_Q_PENDING_VAL` macro and do nothing while this is true: +这些检查操作检查了 `pvqspinlock` 的状态。`pvqspinlock` 是在[准虚拟化(paravirtualized)](https://en.wikipedia.org/wiki/Paravirtualization)环境中的`队列自旋锁`。就像这一章节只相关 Linux 内核同步原语一样,我们跳过这些和其他不直接相关本章节主题的部分。这些检查之后我们比较使用 `_Q_PENDING_VAL` 宏的值所代表的锁,然后什么都不做直到该比较为真(原文:After these checks we compare our value which represents lock with the value of the `_Q_PENDING_VAL` macro and do nothing while this is true): ```C if (val == _Q_PENDING_VAL) { @@ -350,10 +350,9 @@ if (val == _Q_PENDING_VAL) { cpu_relax(); } ``` +这里 `cpu_relax` 只是 [NOP] 指令。综上,我们了解了锁饱含着 - `pending` 位。这个位代表了想要获取锁的线程,但是这个锁已经被其他线程获取了,并且与此同时`队列`为空。在本例中,`pending` 位将被设置并且`队列`不会被创建(touched)。这是优化所完成的,因为不需要考虑在引发缓存无效的自身 `mcs_spinlock` 数组的创建产生的非必需隐患(原文:This is done for optimization, because there are no need in unnecessary latency which will be caused by the cache invalidation in a touching of own `mcs_spinlock` array.)。 -where `cpu_relax` is just [NOP](https://en.wikipedia.org/wiki/NOP) instruction. Above, we saw that the lock contains - `pending` bit. This bit represents thread which wanted to acquire lock, but it is already acquired by the other thread and in the same time `queue` is empty. In this case, the `pending` bit will be set and the `queue` will not be touched. This is done for optimization, because there are no need in unnecessary latency which will be caused by the cache invalidation in a touching of own `mcs_spinlock` array. - -At the next step we enter into the following loop: +下一步我们进入下面的循环: ```C for (;;) { @@ -372,7 +371,7 @@ for (;;) { } ``` -The first `if` clause here checks that state of the lock (`val`) is in locked or pending state. This means that first thread already acquired lock, second thread tried to acquire lock too, but now it is in pending state. In this case we need to start to build queue. We will consider this situation little later. In our case we are first thread holds lock and the second thread tries to do it too. After this check we create new lock in a locked state and compare it with the state of the previous lock. As you remember, the `val` contains state of the `&lock->val` which after the second thread will call the `atomic_cmpxchg_acquire` macro will be equal to `1`. Both `new` and `val` values are equal so we set pending bit in the lock of the second thread. After this we need to check value of the `&lock->val` again, because the first thread may release lock before this moment. If the first thread did not released lock yet, the value of the `old` will be equal to the value of the `val` (because `atomic_cmpxchg_acquire` will return the value from the memory location which is pointed by the `lock->val` and now it is `1`) and we will exit from the loop. As we exited from this loop, we are waiting for the first thread until it will release lock, clear pending bit, acquire lock and return: +这里第一个 `if` 子句检查锁 (`val`) 的状态是上锁还是待定的(pending)。这意味着第一个线程已经获取了锁,第二个线程也试图获取锁,但现在第二个线程是待定状态。本例中我们需要开始建立队列。我们将稍后考虑这个情况。在我们的例子中,第一个线程持有锁而第二个线程也尝试获取锁。这个检查之后我们在上锁状态并且使用之前锁状态比较后创建新锁。就像你记得的那样,`val` 包含了 `&lock->val` 状态,在第二个线程调用 `atomic_cmpxchg_acquire` 宏后状态将会等于 `1`。由于 `new` 和 `val` 的值相等,所以我们在第二个线程的锁上设置待定位。在此之后,我们需要再次检查 `&lock->val` 的值,因为第一个线程可能在这个时候释放锁。如果第一个线程还又没释放锁,`旧`的值将等于 `val` (因为 `atomic_cmpxchg_acquire` 将会返回存储地址指向 `lock->val` 的值并且当前为 `1`)然后我们将退出循环。因为我们退出了循环,我们会等待第一个线程直到它释放锁,清除待定位,获取锁并且返回: ```C smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK)); @@ -380,7 +379,7 @@ clear_pending_set_locked(lock); return; ``` -Notice that we did not touch `queue` yet. We no need in it, because for two threads it just leads to unnecessary latency for memory access. In other case, the first thread may release it lock before this moment. In this case the `lock->val` will contain `_Q_LOCKED_VAL | _Q_PENDING_VAL` and we will start to build `queue`. We start to build `queue` by the getting the local copy of the `mcs_nodes` array of the processor which executes thread: +注意我们还没创建`队列`。这里我们不需要,因为对于两个线程来说,队列只是导致对内存访问的非必需潜在因素。在其他的例子中,第一个线程可能在这个时候释放其锁。在本例中 `lock->val` 将包含 `_Q_LOCKED_VAL | _Q_PENDING_VAL` 并且我们会开始建立`队列`。通过获得处理器执行线程的本地 `mcs_nodes` 数组的拷贝我们开始建立`队列`: ```C node = this_cpu_ptr(&mcs_nodes[0]); @@ -388,7 +387,7 @@ idx = node->count++; tail = encode_tail(smp_processor_id(), idx); ``` -Additionally we calculate `tail` which will indicate the tail of the `queue` and `index` which represents an entry of the `mcs_nodes` array. After this we set the `node` to point to the correct of the `mcs_nodes` array, set `locked` to zero because this thread didn't acquire lock yet and `next` to `NULL` because we don't know anything about other `queue` entries: +除此以外我们计算 表示`队列`尾部和代表 `mcs_nodes` 数组实体的`索引`的`tail` 。在此之后我们设置 `node` 指出 `mcs_nodes` 数组的正确,设置 `locked` 为零应为这个线程还没有获取锁,还有 `next` 为 `NULL` 因为我们不知道任何有关其他`队列`实体的信息: ```C node += idx; @@ -396,14 +395,14 @@ node->locked = 0; node->next = NULL; ``` -We already touch `per-cpu` copy of the queue for the processor which executes current thread which wants to acquire lock, this means that owner of the lock may released it before this moment. So we may try to acquire lock again by the call of the `queued_spin_trylock` function. +我们已经创建了对于执行当前线程想获取锁的处理器的队列的`每个 cpu(per-cpu)` 的拷贝,这意味着锁的拥有者可能在这个时刻释放了锁。因此我们可能通过 `queued_spin_trylock` 函数的调用尝试去再次获取锁。 ```C if (queued_spin_trylock(lock)) goto release; ``` -The `queued_spin_trylock` function is defined in the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h) header file and just does the same `queued_spin_lock` function that does: +`queued_spin_trylock` 函数在 [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h) 头文件中被定义而且就像 `queued_spin_lock` 函数一样: ```C static __always_inline int queued_spin_trylock(struct qspinlock *lock) @@ -414,21 +413,20 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock) return 0; } ``` - -If the lock was successfully acquired we jump to the `release` label to release a node of the `queue`: +如果锁成功被获取那么我们跳过`释放`标签而释放`队列`中的一个节点: ```C release: this_cpu_dec(mcs_nodes[0].count); ``` -because we no need in it anymore as lock is acquired. If the `queued_spin_trylock` was unsuccessful, we update tail of the queue: +因为我们不再需要它了,因为锁已经获得了。如果 `queued_spin_trylock` 不成功,我们更新队列的尾部: ```C old = xchg_tail(lock, tail); ``` -and retrieve previous tail. The next step is to check that `queue` is not empty. In this case we need to link previous entry with the new: +然后检索原先的尾部。下一步是检查`队列`是否为空。这个例子中我们需要用新的实体链接之前的实体: ```C if (old & _Q_TAIL_MASK) { @@ -439,7 +437,7 @@ if (old & _Q_TAIL_MASK) { } ``` -After queue entries linked, we start to wait until reaching the head of queue. As we As we reached this, we need to do a check for new node which might be added during this wait: +队列实体链接之后,我们开始等待直到队列的头部到来。由于我们等待头部,我们需要对可能在这个等待实践加入的新的节点做一些检查: ```C next = READ_ONCE(node->next); @@ -447,39 +445,39 @@ if (next) prefetchw(next); ``` -If the new node was added, we prefetch cache line from memory pointed by the next queue entry with the [PREFETCHW](http://www.felixcloutier.com/x86/PREFETCHW.html) instruction. We preload this pointer now for optimization purpose. We just became a head of queue and this means that there is upcoming `MCS` unlock operation and the next entry will be touched. +如果新节点被添加,我们从通过使用 [PREFETCHW](http://www.felixcloutier.com/x86/PREFETCHW.html) 指令指出下一个队列实体的内存中预先去除缓存线(cache line)。以优化为目的我们现在预先载入这个指针。我们只是改变了队列的头而这意味着有将要到来的 `MCS` 进行解锁操作并且下一个实体会被创建。 -Yes, from this moment we are in the head of the `queue`. But before we are able to acquire a lock, we need to wait at least two events: current owner of a lock will release it and the second thread with `pending` bit will acquire a lock too: +是的,从这个时刻我们在`队列`的头部。但是在我们有能力获取锁之前,我们需要至少等待两个事件:当前锁的拥有者释放锁和第二个线程处于`待定`位也获取锁: ```C smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK)); ``` -After both threads will release a lock, the head of the `queue` will hold a lock. In the end we just need to update the tail of the `queue` and remove current head from it. +两个线程都释放锁后,`队列`的头部会持有锁。最后我们只是需要更新`队列`尾部然后移除从队列中移除头部。 -That's all. +以上。 -Conclusion +总结 -------------------------------------------------------------------------------- -This is the end of the second part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock`. In this part we saw another implementation of the `spinlock` mechanism - `queued spinlock`. In the next part we will continue to dive into synchronization primitives in the Linux kernel. +这是 Linux 内核[同步原语](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)章节第二部分的结尾。在上一个[部分](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)我们已经见到了第一个同步原语`自旋锁`通过 Linux 内核 实现的`排队自旋锁(ticket spinlock)`。在这个部分我们了解了另一个`自旋锁`机制的实现 - `队列自旋锁`。下一个部分我们继续深入 Linux 内核同步原语。 -If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). +如果您有疑问或者建议,请在twitter [0xAX](https://twitter.com/0xAX) 上联系我,通过 [email](anotherworldofworld@gmail.com) 联系我,或者创建一个 [issue](https://github.com/0xAX/linux-insides/issues/new). -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** +**友情提示:英语不是我的母语,对于译文给您带来了的不便我感到非常抱歉。如果您发现任何错误请给我发送PR到 [linux-insides](https://github.com/0xAX/linux-insides)。** -Links +链接 -------------------------------------------------------------------------------- * [spinlock](https://en.wikipedia.org/wiki/Spinlock) * [interrupt](https://en.wikipedia.org/wiki/Interrupt) -* [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) +* [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) * [API](https://en.wikipedia.org/wiki/Application_programming_interface) * [Test and Set](https://en.wikipedia.org/wiki/Test-and-set) * [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) * [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) * [atomic instruction](https://en.wikipedia.org/wiki/Linearizability) -* [CMPXCHG instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html) +* [CMPXCHG instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html) * [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html) * [NOP instruction](https://en.wikipedia.org/wiki/NOP) * [PREFETCHW instruction](http://www.felixcloutier.com/x86/PREFETCHW.html) From 9c6b2e08024ccbd35d4c3cff142e6bba12a470f2 Mon Sep 17 00:00:00 2001 From: keltoy <315090132@qq.com> Date: Tue, 13 Dec 2016 15:11:43 +0800 Subject: [PATCH 02/21] commit 6.2.2 --- .DS_Store | Bin 8196 -> 0 bytes 1 file changed, 0 insertions(+), 0 deletions(-) delete mode 100644 .DS_Store diff --git a/.DS_Store b/.DS_Store deleted file mode 100644 index 25adce00be7e6fb5da9b443f6781a9afc5ff3e28..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 8196 zcmeHM-A>d{5S{}obTPtSVBoShCSDN3U*v+AEDOYJ5FspRj2dWnH&9brvhA|Mx@K>D z2lWkn6raEc@d5OkX?N3hi{2Td=a6$|+RoQG^PTDGP9Y*yX*ZUL=84EaXIZ+9A*bGy|Fe&A{Kl0N&Z$j2qtj(N#Ta1~dcz zB?G)anCL9avFt}uwhjzZ0zejUS{iiJ0fI4OS&n5tk^;p&qv}D$P!*RLDuUy>$>xyd zSoR|s!AV7MQn6$qMHPUD`H-fHLQ+ZQ20ayO4R_kqIfp6lQtE zcLwT!UgTmBVG3gab0xXDj}>(2JUuI3{a@SLo#4WDC0oAQ7p@b8i;LgX*m&XA#H3Ly z8pS)t>)M&$sfAY94k}jbr5rhprXA~tzV8lo%YNoGJIAHjP2USG$7{9{=C7VRUc>KH z{I=H!6rs6jx%|yq;@iVj$VlHbf&-J}i2BJ?2aRDA(PGmPt8*d5fZDPSE@JOE_3Hiai z%GmP4XKk?%lDL_zgA+VuJ1d*U1~!WtQguT%4&F+R(f4q)jq`b_!)cK8ISgGZngPwg zfPwKyndka{`{wWe19L4k1Db(Px# From 968744a0a4b6857002fcd5cd8dd48a5bbf0f3972 Mon Sep 17 00:00:00 2001 From: lifangwang Date: Fri, 7 Apr 2017 15:13:36 +0800 Subject: [PATCH 03/21] =?UTF-8?q?=E8=AE=A4=E9=A2=86=E7=BF=BB=E8=AF=917.3?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f7c950c..9669fd8 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,7 @@ Linux 内核揭密 |├ [7.0](https://github.com/MintCN/linux-insides-zh/blob/master/MM/README.md)|[@mudongliang](https://github.com/mudongliang)|更新至[f83c8ee2](https://github.com/0xAX/linux-insides/commit/f83c8ee29e2051a8f4c08d6a0fa8247d934e14d9)| |├ [7.1](https://github.com/MintCN/linux-insides-zh/blob/master/MM/linux-mm-1.md)|[@choleraehyq](https://github.com/choleraehyq)|已完成| |├ [7.2](https://github.com/MintCN/linux-insides-zh/blob/master/MM/linux-mm-2.md)|[@choleraehyq](https://github.com/choleraehyq)|已完成| -|└ [7.3](https://github.com/MintCN/linux-insides-zh/blob/master/MM/linux-mm-3.md)||未开始| +|└ [7.3](https://github.com/MintCN/linux-insides-zh/blob/master/MM/linux-mm-3.md)|[@lifangwang](https://github.com/lifangwang)|正在进行| | 8. SMP||上游未开始| | 9. [Concepts](https://github.com/MintCN/linux-insides-zh/tree/master/Concepts)||正在进行| |├ [9.0](https://github.com/MintCN/linux-insides-zh/blob/master/Concepts/README.md)|[@mudongliang](https://github.com/mudongliang)|更新至[44017507](https://github.com/0xAX/linux-insides/commit/4401750766f7150dcd16f579026f5554541a6ab9)| From eab314c9bec6503ca881e6395422c81a2c6e7a9a Mon Sep 17 00:00:00 2001 From: Dongliang Mu Date: Fri, 7 Apr 2017 16:12:25 -0400 Subject: [PATCH 04/21] fix one markdown semantics problem --- README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 9669fd8..551f61f 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,4 @@ -Linux 内核揭密 -============== +# Linux 内核揭密 [![Join the chat at https://gitter.im/MintCN/linux-insides-zh](https://badges.gitter.im/MintCN/linux-insides-zh.svg)](https://gitter.im/MintCN/linux-insides-zh?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) @@ -9,7 +8,7 @@ Linux 内核揭密 **问题/建议**: 通过在 twitter 上 [@0xAX](https://twitter.com/0xAX) ,直接添加 [issue](https://github.com/0xAX/linux-insides/issues/new) 或者直接给我发[邮件](mailto:anotherworldofworld@gmail.com),请自由地向我提出任何问题或者建议。 -##翻译进度 +## 翻译进度 | 章节|译者|翻译进度| | ------------- |:-------------:| -----:| @@ -98,7 +97,8 @@ Linux 内核揭密 | 14. [KernelStructures](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures)||正在进行| |├ [14.0](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures/README.md)|[@mudongliang](https://github.com/mudongliang)|更新至[3cb550c0](https://github.com/0xAX/linux-insides/commit/3cb550c089c8fc609f667290434e9e98e93fa279)| |└ [14.1](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures/idt.md)||未开始| -##翻译认领规则 + +## 翻译认领规则 为了避免多个译者同时翻译相同章节的情况出现,请按照以下规则认领自己要翻译的章节: @@ -111,21 +111,21 @@ Linux 内核揭密 翻译前建议看 [TRANSLATION_NOTES.md](https://github.com/MintCN/linux-insides-zh/blob/master/TRANSLATION_NOTES.md) 。关于翻译约定,大家有任何问题或建议也请开 issue 讨论。 -##作者 +## 作者 [@0xAX](https://twitter.com/0xAX) -##中文维护者 +## 中文维护者 [@xinqiu](https://github.com/xinqiu) [@mudongliang](https://github.com/mudongliang) -##中文贡献者 +## 中文贡献者 详见 [contributors.md](https://github.com/MintCN/linux-insides-zh/blob/master/contributors.md) -##LICENSE +## LICENSE Licensed [BY-NC-SA Creative Commons](http://creativecommons.org/licenses/by-nc-sa/4.0/). From a25fcd3c4ebdc49d265ba3c5773020f7c6a9e197 Mon Sep 17 00:00:00 2001 From: Dongliang Mu Date: Fri, 7 Apr 2017 16:19:43 -0400 Subject: [PATCH 05/21] assign 6.5-6.6 to @e1001925 and @hdl --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 551f61f..b6663d9 100644 --- a/README.md +++ b/README.md @@ -64,8 +64,8 @@ |├ [6.2](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-2.md)|[@keltoy](https://github.com/keltoy)|正在进行| |├ [6.3](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-3.md)|[@huxq](https://github.com/huxq)|已完成| |├ [6.4](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-4.md)|[@huxq](https://github.com/huxq)|正在进行| -|├ [6.5](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-5.md)||未开始| -|└ [6.6](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-6.md)||未开始| +|├ [6.5](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-5.md)|[@e1001925](https://github.com/e1001925)|正在进行| +|└ [6.6](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-6.md)|[@hdl](https://github.com/hdl)|正在进行| | 7. [Memory management](https://github.com/MintCN/linux-insides-zh/tree/master/MM)||正在进行| |├ [7.0](https://github.com/MintCN/linux-insides-zh/blob/master/MM/README.md)|[@mudongliang](https://github.com/mudongliang)|更新至[f83c8ee2](https://github.com/0xAX/linux-insides/commit/f83c8ee29e2051a8f4c08d6a0fa8247d934e14d9)| |├ [7.1](https://github.com/MintCN/linux-insides-zh/blob/master/MM/linux-mm-1.md)|[@choleraehyq](https://github.com/choleraehyq)|已完成| From 4a92a32d571bc0cf7d1f0762d5dca00ec488316b Mon Sep 17 00:00:00 2001 From: Dongliang Mu Date: Fri, 7 Apr 2017 16:37:38 -0400 Subject: [PATCH 06/21] modify sync-2 by review advice --- SyncPrim/sync-2.md | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/SyncPrim/sync-2.md b/SyncPrim/sync-2.md index 5dc040b..25a8ce1 100644 --- a/SyncPrim/sync-2.md +++ b/SyncPrim/sync-2.md @@ -126,7 +126,7 @@ int unlock(lock) +---------+ +----------------------------+ ``` -第二个队列尝试获取锁 +第二个队列尝试获取锁: ``` +---------+ +----------------------------------------+ +-------------------------+ @@ -136,7 +136,7 @@ int unlock(lock) +---------+ +----------------------------------------+ +-------------------------+ ``` -为代码描述为: +或者伪代码描述为: ```C void lock(...) @@ -179,15 +179,14 @@ void unlock(...) } ``` -想法很简单,但是`队列自旋锁`的实现一定是比为代码复杂。就如同我上面写到的,`队列自旋锁`机制计划在 Linux 内核中成为`排队自旋锁`的替代品。但你们可能还记得,常用`自旋锁`适用于`32位(32-bit)`的 [字(word)](https://en.wikipedia.org/wiki/Word_%28computer_architecture%29)。而基于`MCS`的锁不能使用这个大小,你们卡能知道 `spinlock_t` 类型在 Linux 内核中的使用是[宽字符(widely)](http://lxr.free-electrons.com/ident?i=spinlock_t)的。这种情况下可能不得不重写 Linux 内核中重要的组成部分,但这是不可接受的。除了这一点,一些包含自旋锁用于保护的内核结构不能适配(can't grow)。但无论怎样,基于这一概念的 Linux 内核中的`队列自旋锁`实现有一些修改,可以适应`32`位的字。 +想法很简单,但是`队列自旋锁`的实现一定是比伪代码复杂。就如同我上面写到的,`队列自旋锁`机制计划在 Linux 内核中成为`排队自旋锁`的替代品。但你们可能还记得,常用`自旋锁`适用于`32位(32-bit)`的 [字(word)](https://en.wikipedia.org/wiki/Word_%28computer_architecture%29)。而基于`MCS`的锁不能使用这个大小,你们可能知道 `spinlock_t` 类型在 Linux 内核中的使用是[宽字符(widely)](http://lxr.free-electrons.com/ident?i=spinlock_t)的。这种情况下可能不得不重写 Linux 内核中重要的组成部分,但这是不可接受的。除了这一点,一些包含自旋锁用于保护的内核结构不能增长大小。但无论怎样,基于这一概念的 Linux 内核中的`队列自旋锁`实现有一些修改,可以适应`32`位的字。 -这就是所有有关`队列自旋锁`的理论,现在让我们考虑以下在 Linux 内核中这个机制是如何实现的。`队列自旋锁`的实现看起来比`排队自旋锁`的实现更加复杂和混乱,但是关注研究它将会有收获 -(原句:but the study with attention will lead to success.)。 +这就是所有有关`队列自旋锁`的理论,现在让我们考虑以下在 Linux 内核中这个机制是如何实现的。`队列自旋锁`的实现看起来比`排队自旋锁`的实现更加复杂和混乱,但是细致的研究会引导成功。 队列自旋锁的API ------------------------------------------------------------------------------- -现在我们从原理角度了解了一些`队列自旋锁`,是时候了解 Linux 内核中这一机制的实现了。就想我们之前了解的那样 [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126) 头文件提供一套宏,代表API中的自旋锁的获取、释放等等。 +现在我们从原理角度了解了一些`队列自旋锁`,是时候了解 Linux 内核中这一机制的实现了。就像我们之前了解的那样 [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126) 头文件提供一套宏,代表 API 中的自旋锁的获取、释放等等。 ```C #define arch_spin_is_locked(l) queued_spin_is_locked(l) @@ -200,7 +199,7 @@ void unlock(...) #define arch_spin_unlock_wait(l) queued_spin_unlock_wait(l) ``` -这些所有的宏扩展了同一头文件下的函数的调用。此外,我们发现 [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) 头文件的 `qspinlock` 结构代表了 Linux 内核队列自旋锁。 +所有这些宏扩展了同一头文件下的函数的调用。此外,我们发现 [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) 头文件的 `qspinlock` 结构代表了 Linux 内核队列自旋锁。 ```C typedef struct qspinlock { @@ -269,7 +268,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock) } ``` -看起来很简单,除了 `queued_spin_lock_slowpath` 函数。 我们可能发现它只有一个参数。在我们的例子中这个参数代表 `队列自旋锁` 被上锁。让我们考虑`队列`锁为空,现在第一个线程想要获取锁的情况。正如我们可能了解的 `queued_spin_lock` 函数从调用 `atomic_cmpxchg_acquire` 宏开始。就像你们可能从宏的名字猜到的那样,它执行原子的 [CMPXCHG](http://x86.renejeschke.de/html/file_module_x86_id_41.html) 指令,使用第一个参数(当前给定自旋锁的状态)比较第二个参数(在我们的例子为零)的值,如果他们相等,那么第二个参数在存储位置保存 `_Q_LOCKED_VAL` 的值,该存储位置通过 `&lock->val` 指向并且返回这个存储位置的初始值。 +看起来很简单,除了 `queued_spin_lock_slowpath` 函数,我们可能发现它只有一个参数。在我们的例子中这个参数代表 `队列自旋锁` 被上锁。让我们考虑`队列`锁为空,现在第一个线程想要获取锁的情况。正如我们可能了解的 `queued_spin_lock` 函数从调用 `atomic_cmpxchg_acquire` 宏开始。就像你们可能从宏的名字猜到的那样,它执行原子的 [CMPXCHG](http://x86.renejeschke.de/html/file_module_x86_id_41.html) 指令,使用第一个参数(当前给定自旋锁的状态)比较第二个参数(在我们的例子为零)的值,如果他们相等,那么第二个参数在存储位置保存 `_Q_LOCKED_VAL` 的值,该存储位置通过 `&lock->val` 指向并且返回这个存储位置的初始值。 `atomic_cmpxchg_acquire` 宏定义在 [include/linux/atomic.h](https://github.com/torvalds/linux/blob/master/include/linux/atomic.h) 头文件中并且扩展了 `atomic_cmpxchg` 函数的调用: @@ -324,8 +323,7 @@ if (likely(val == 0)) 此时此刻,我们的第一个线程持有锁。注意这个行为与在 `MCS` 算法的描述有所区别。线程获取锁,但是我们不添加此线程入`队列`。就像我之前已经写到的,`队列自旋锁` 概念的实现在 Linux 内核中基于 `MCS` 算法,但是于此同时它对优化目的有一些差异。 -所以第一个线程已经获取了锁然后现在让我们考虑第二个线程尝试获取相同的锁的情况。第二个线程将从同样的 `queued_spin_lock` 函数开始,但是 `lock->val` 会包含 -`1` 或者 `_Q_LOCKED_VAL`,因为第一个线程已经持有了锁。因此,在本例中 `queued_spin_lock_slowpath` 函数将会被调用。`queued_spin_lock_slowpath`函数定义在 [kernel/locking/qspinlock.c](https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.c) 源码文件中并且从以下的检查开始: +所以第一个线程已经获取了锁然后现在让我们考虑第二个线程尝试获取相同的锁的情况。第二个线程将从同样的 `queued_spin_lock` 函数开始,但是 `lock->val` 会包含 `1` 或者 `_Q_LOCKED_VAL`,因为第一个线程已经持有了锁。因此,在本例中 `queued_spin_lock_slowpath` 函数将会被调用。`queued_spin_lock_slowpath`函数定义在 [kernel/locking/qspinlock.c](https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.c) 源码文件中并且从以下的检查开始: ```C void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) @@ -350,7 +348,7 @@ if (val == _Q_PENDING_VAL) { cpu_relax(); } ``` -这里 `cpu_relax` 只是 [NOP] 指令。综上,我们了解了锁饱含着 - `pending` 位。这个位代表了想要获取锁的线程,但是这个锁已经被其他线程获取了,并且与此同时`队列`为空。在本例中,`pending` 位将被设置并且`队列`不会被创建(touched)。这是优化所完成的,因为不需要考虑在引发缓存无效的自身 `mcs_spinlock` 数组的创建产生的非必需隐患(原文:This is done for optimization, because there are no need in unnecessary latency which will be caused by the cache invalidation in a touching of own `mcs_spinlock` array.)。 +这里 `cpu_relax` 只是 [NOP](https://en.wikipedia.org/wiki/NOP) 指令。综上,我们了解了锁饱含着 - `pending` 位。这个位代表了想要获取锁的线程,但是这个锁已经被其他线程获取了,并且与此同时`队列`为空。在本例中,`pending` 位将被设置并且`队列`不会被创建(touched)。这是优化所完成的,因为不需要考虑在引发缓存无效的自身 `mcs_spinlock` 数组的创建产生的非必需隐患(原文:This is done for optimization, because there are no need in unnecessary latency which will be caused by the cache invalidation in a touching of own `mcs_spinlock` array.)。 下一步我们进入下面的循环: @@ -387,7 +385,7 @@ idx = node->count++; tail = encode_tail(smp_processor_id(), idx); ``` -除此以外我们计算 表示`队列`尾部和代表 `mcs_nodes` 数组实体的`索引`的`tail` 。在此之后我们设置 `node` 指出 `mcs_nodes` 数组的正确,设置 `locked` 为零应为这个线程还没有获取锁,还有 `next` 为 `NULL` 因为我们不知道任何有关其他`队列`实体的信息: +除此之外我们计算 表示`队列`尾部和代表 `mcs_nodes` 数组实体的`索引`的`tail` 。在此之后我们设置 `node` 指向正确的 `mcs_nodes` 数组,设置 `locked` 为零应为这个线程还没有获取锁,还有 `next` 为 `NULL` 因为我们不知道任何有关其他`队列`实体的信息: ```C node += idx; @@ -420,7 +418,7 @@ release: this_cpu_dec(mcs_nodes[0].count); ``` -因为我们不再需要它了,因为锁已经获得了。如果 `queued_spin_trylock` 不成功,我们更新队列的尾部: +现在我们不再需要它了,因为锁已经获得了。如果 `queued_spin_trylock` 不成功,我们更新队列的尾部: ```C old = xchg_tail(lock, tail); From e0ca6e25d16bbc8e284a58cd700a6ed63f1eab6a Mon Sep 17 00:00:00 2001 From: Dongliang Mu Date: Fri, 7 Apr 2017 22:12:03 -0400 Subject: [PATCH 07/21] keep asm.md sync with upstream commit id - a9e59b54f004b97a153cfe11db3ee913ddcb565c --- Theory/asm.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Theory/asm.md b/Theory/asm.md index d0c3220..8a691db 100644 --- a/Theory/asm.md +++ b/Theory/asm.md @@ -309,7 +309,7 @@ a = 100 Or for example `I` which represents an immediate 32-bit integer. The difference between `i` and `I` is that `i` is general, whereas `I` is strictly specified to 32-bit integer data. For example if you try to compile the following ```C -int test_asm(int nr) +unsigned long test_asm(int nr) { unsigned long a = 0; @@ -332,7 +332,7 @@ test.c:7:9: error: impossible constraint in ‘asm’ when at the same time ```C -int test_asm(int nr) +unsigned long test_asm(int nr) { unsigned long a = 0; @@ -360,7 +360,7 @@ int main(void) static unsigned long element; __asm__ volatile("movq 16+%1, %0" : "=r"(element) : "o"(arr)); - printf("%d\n", element); + printf("%lu\n", element); return 0; } ``` From 1547a1d591ff6b804cd67e1d20ecbb7b555f2246 Mon Sep 17 00:00:00 2001 From: Dongliang Mu Date: Sun, 9 Apr 2017 23:34:40 -0400 Subject: [PATCH 08/21] Modify the sequence of translators --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 6fd41b7..7072814 100644 --- a/README.md +++ b/README.md @@ -64,8 +64,8 @@ |├ [6.2](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-2.md)|[@keltoy](https://github.com/keltoy)|已完成| |├ [6.3](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-3.md)|[@huxq](https://github.com/huxq)|已完成| |├ [6.4](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-4.md)|[@huxq](https://github.com/huxq)|正在进行| -|├ [6.5](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-5.md)|[@e1001925](https://github.com/e1001925)|正在进行| -|└ [6.6](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-6.md)|[@hdl](https://github.com/hdl)|正在进行| +|├ [6.5](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-5.md)|[@hdl](https://github.com/hdl)|正在进行| +|└ [6.6](https://github.com/MintCN/linux-insides-zh/blob/master/SyncPrim/sync-6.md)|[@e1001925](https://github.com/e1001925)|正在进行| | 7. [Memory management](https://github.com/MintCN/linux-insides-zh/tree/master/MM)||正在进行| |├ [7.0](https://github.com/MintCN/linux-insides-zh/blob/master/MM/README.md)|[@mudongliang](https://github.com/mudongliang)|更新至[f83c8ee2](https://github.com/0xAX/linux-insides/commit/f83c8ee29e2051a8f4c08d6a0fa8247d934e14d9)| |├ [7.1](https://github.com/MintCN/linux-insides-zh/blob/master/MM/linux-mm-1.md)|[@choleraehyq](https://github.com/choleraehyq)|已完成| From a5df33a0b662c0f4d80018cb875df6aff30494fb Mon Sep 17 00:00:00 2001 From: lifangwang Date: Fri, 14 Apr 2017 23:23:15 +0800 Subject: [PATCH 09/21] first translation --- MM/linux-mm-3.md | 598 ++++++++++++++++++++++++----------------------- 1 file changed, 308 insertions(+), 290 deletions(-) diff --git a/MM/linux-mm-3.md b/MM/linux-mm-3.md index 6cf9648..91bd122 100644 --- a/MM/linux-mm-3.md +++ b/MM/linux-mm-3.md @@ -1,418 +1,436 @@ -Linux kernel memory management Part 3. +Linux内核内存管理 第三节 ================================================================================ -Introduction to the kmemcheck in the Linux kernel +内核中 kmemcheck 介绍 -------------------------------------------------------------------------------- -This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/mm/) which describes [memory management](https://en.wikipedia.org/wiki/Memory_management) in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) of this chapter we met two memory management related concepts: +Linux内存管理 [章节](https://0xax.gitbooks.io/linux-insides/content/mm/) 描述了Linux内核中 [内存管理](https://en.wikipedia.org/wiki/Memory_management);本小节是第三部分。 在本章[第二节](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)中我们遇到了两个与内存管理相关的概念: -* `Fix-Mapped Addresses`; -* `ioremap`. +* `固定映射地址`; +* `输入输出重映射`. -The first concept represents special area in [virtual memory](https://en.wikipedia.org/wiki/Virtual_memory), whose corresponding physical mapping is calculated in [compile-time](https://en.wikipedia.org/wiki/Compile_time). The second concept provides ability to map input/output related memory to virtual memory. +固定映射地址代表[虚拟内存](https://en.wikipedia.org/wiki/Virtual_memory)中的一类特殊区域, 这类地址的物理映射地址是在[编译](https://en.wikipedia.org/wiki/Compile_time)期间计算出来的。输入输出重映射表示把输入/输出相关的内存映射到虚拟内存。 -For example if you will look at the output of the `/proc/iomem`: +例如,查看`/proc/iomem`命令的输出: ``` -$ sudo cat /proc/iomem - -00000000-00000fff : reserved -00001000-0009d7ff : System RAM -0009d800-0009ffff : reserved -000a0000-000bffff : PCI Bus 0000:00 -000c0000-000cffff : Video ROM -000d0000-000d3fff : PCI Bus 0000:00 -000d4000-000d7fff : PCI Bus 0000:00 -000d8000-000dbfff : PCI Bus 0000:00 -000dc000-000dffff : PCI Bus 0000:00 -000e0000-000fffff : reserved -... -... -... + $ sudo cat /proc/iomem + + 00000000-00000fff : reserved + 00001000-0009d7ff : System RAM + 0009d800-0009ffff : reserved + 000a0000-000bffff : PCI Bus 0000:00 + 000c0000-000cffff : Video ROM + 000d0000-000d3fff : PCI Bus 0000:00 + 000d4000-000d7fff : PCI Bus 0000:00 + 000d8000-000dbfff : PCI Bus 0000:00 + 000dc000-000dffff : PCI Bus 0000:00 + 000e0000-000fffff : reserved + ... + ... + ... ``` -you will see map of the system's memory for each physical device. Here the first column displays the memory registers used by each of the different types of memory. The second column lists the kind of memory located within those registers. Or for example: +可以看到系统中每个物理设备对应的内存映射区域。上述输出信息第一列表示各类型内存使用的内存寄存器。第二列展示了内存寄存器所包含的各种类型的内存。再例如: ``` -$ sudo cat /proc/ioports -0000-0cf7 : PCI Bus 0000:00 - 0000-001f : dma1 - 0020-0021 : pic1 - 0040-0043 : timer0 - 0050-0053 : timer1 - 0060-0060 : keyboard - 0064-0064 : keyboard - 0070-0077 : rtc0 - 0080-008f : dma page reg - 00a0-00a1 : pic2 - 00c0-00df : dma2 - 00f0-00ff : fpu - 00f0-00f0 : PNP0C04:00 - 03c0-03df : vga+ - 03f8-03ff : serial - 04d0-04d1 : pnp 00:06 - 0800-087f : pnp 00:01 - 0a00-0a0f : pnp 00:04 - 0a20-0a2f : pnp 00:04 - 0a30-0a3f : pnp 00:04 -... -... -... + $ sudo cat /proc/ioports + 0000-0cf7 : PCI Bus 0000:00 + 0000-001f : dma1 + 0020-0021 : pic1 + 0040-0043 : timer0 + 0050-0053 : timer1 + 0060-0060 : keyboard + 0064-0064 : keyboard + 0070-0077 : rtc0 + 0080-008f : dma page reg + 00a0-00a1 : pic2 + 00c0-00df : dma2 + 00f0-00ff : fpu + 00f0-00f0 : PNP0C04:00 + 03c0-03df : vga+ + 03f8-03ff : serial + 04d0-04d1 : pnp 00:06 + 0800-087f : pnp 00:01 + 0a00-0a0f : pnp 00:04 + 0a20-0a2f : pnp 00:04 + 0a30-0a3f : pnp 00:04 + ... + ... + ... ``` -can show us lists of currently registered port regions used for input or output communication with a device. All memory-mapped I/O addresses are not used by the kernel directly. So, before the Linux kernel can use such memory, it must to map it to the virtual memory space which is the main purpose of the `ioremap` mechanism. Note that we saw only early `ioremap` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). Soon we will look at the implementation of the non-early `ioremap` function. But before this we must learn other things, like a different types of memory allocators and etc., because in other way it will be very difficult to understand it. +该命令列出了系统中所有设备注册的输入输出端口。内核不能直接访问设备的输入/输出地址。所以在内核能够使用这些内存之前,内核必须将这些地址映射到虚拟地址空间,这就是输入输出内存映射机制的主要目的。在前面[第二节](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)中只介绍了早期的输入输出重映射。很快我们就要来看一看非早期输入输出重映射的实现机制。但在此之前,我们需要学习一些其他的知识,例如不同类型的内存分配器等,不然的话我们很难理解该机制。 -So, before we will move on to the non-early [memory management](https://en.wikipedia.org/wiki/Memory_management) of the Linux kernel, we will see some mechanisms which provide special abilities for [debugging](https://en.wikipedia.org/wiki/Debugging), check of [memory leaks](https://en.wikipedia.org/wiki/Memory_leak), memory control and etc. It will be easier to understand how memory management arranged in the Linux kernel after learning of all of these things. +所以,在进入Linux内核非早期的[内存管理](https://en.wikipedia.org/wiki/Memory_management)之前,我们要看一些提供特殊功能的机制,例如[调试](https://en.wikipedia.org/wiki/Debugging),检查[内存泄漏](https://en.wikipedia.org/wiki/Memory_leak),内存控制等等。学习这些内容有助于我们理解Linux内核中内存管理机制。 -As you already may guess from the title of this part, we will start to consider memory mechanisms from the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt). As we always did in other [chapters](https://0xax.gitbooks.io/linux-insides/content/), we will start to consider from theoretical side and will learn what is `kmemcheck` mechanism in general and only after this, we will see how it is implemented in the Linux kernel. +从本节的标题中,你可能已经看出来,我们会从 [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt)开始了解内存机制。和前面的[章节](https://0xax.gitbooks.io/linux-insides/content/)一样,我们首先从理论上学习什么是`kmemcheck`,然后再来看Linux内核中是怎么实现这一机制的。 -So let's start. What is it `kmemcheck` in the Linux kernel? As you may gues from the name of this mechanism, the `kmemcheck` checks memory. That's true. Main point of the `kmemcheck` mechanism is to check that some kernel code accesses `uninitialized memory`. Let's take following simple [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program: +让我们开始吧。Linux内核中的`kmemcheck`到底是什么呢?从该机制的名称上你可能已经猜到, `kmemcheck` 是检查内存的。你猜的很对。`kmemcheck`的主要目的就是用来检查是否有内核代码访问 `未初始化的内存`。让我们看一个简单的[C](https://en.wikipedia.org/wiki/C_%28programming_language%29)程序: -```C -#include -#include - -struct A { - int a; -}; - -int main(int argc, char **argv) { - struct A *a = malloc(sizeof(struct A)); - printf("a->a = %d\n", a->a); - return 0; -} ``` -Here we allocate memory for the `A` structure and tries to print value of the `a` field. If we will compile this program without additional options: + #include + #include + + struct A { + int a; + }; + + int main(int argc, char **argv) { + struct A *a = malloc(sizeof(struct A)); + printf("a->a = %d\n", a->a); + return 0; + } +``` + + +在上面的程序中我们给结构体`A`分配了内存,然后我们尝试打印该结构体的成员`a`。如果我们不使用其他选项来编译该程序: ``` gcc test.c -o test ``` -The [compiler](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) will not show us warning that `a` filed is not unitialized. But if we will run this program with [valgrind](https://en.wikipedia.org/wiki/Valgrind) tool, we will see the following output: +[编译器](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)不会显示成员 `a`未初始化的提示信息。但是如果使用工具[valgrind](https://en.wikipedia.org/wiki/Valgrind)来运行该程序,我们会看到如下输出: ``` -~$ valgrind --leak-check=yes ./test -==28469== Memcheck, a memory error detector -==28469== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. -==28469== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info -==28469== Command: ./test -==28469== -==28469== Conditional jump or move depends on uninitialised value(s) -==28469== at 0x4E820EA: vfprintf (in /usr/lib64/libc-2.22.so) -==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so) -==28469== by 0x4005B9: main (in /home/alex/test) -==28469== -==28469== Use of uninitialised value of size 8 -==28469== at 0x4E7E0BB: _itoa_word (in /usr/lib64/libc-2.22.so) -==28469== by 0x4E8262F: vfprintf (in /usr/lib64/libc-2.22.so) -==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so) -==28469== by 0x4005B9: main (in /home/alex/test) -... -... -... + + ~$ valgrind --leak-check=yes ./test + ==28469== Memcheck, a memory error detector + ==28469== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. + ==28469== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info + ==28469== Command: ./test + ==28469== + ==28469== Conditional jump or move depends on uninitialised value(s) + ==28469== at 0x4E820EA: vfprintf (in /usr/lib64/libc-2.22.so) + ==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so) + ==28469== by 0x4005B9: main (in /home/alex/test) + ==28469== + ==28469== Use of uninitialised value of size 8 + ==28469== at 0x4E7E0BB: _itoa_word (in /usr/lib64/libc-2.22.so) + ==28469== by 0x4E8262F: vfprintf (in /usr/lib64/libc-2.22.so) + ==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so) + ==28469== by 0x4005B9: main (in /home/alex/test) + ... + ... + ... ``` -Actually the `kmemcheck` mechanism does the same for the kernel, what the `valgrind` does for userspace programs. It check unitilized memory. +实际上`kmemcheck`在内核空间做的事情,和`valgrind`在用户空间做的事情是一样的,都是用来检测未初始化的内存。 -To enable this mechanism in the Linux kernel, you need to enable the `CONFIG_KMEMCHECK` kernel configuration option in the: +要想在内核启用该机制,配置内核时在内核选项菜单要使能`CONFIG_KMEMCHECK`选项: ``` Kernel hacking -> Memory Debugging ``` -menu of the Linux kernel configuration: + ![kernel configuration menu](http://oi63.tinypic.com/2pzbog7.jpg) -We may not only enable support of the `kmemcheck` mechanism in the Linux kernel, but it also provides some configuration options for us. We will see all of these options in the next paragraph of this part. Last note before we will consider how does the `kmemcheck` check memory. Now this mechanism is implemented only for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. You can be sure if you will look in the [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig) `x86` related kernel configuration file, you will see following lines: +我们不仅可以在内核中使能`kmemcheck`机制,它还提供了一些配置选项。我们可以在本小节的下一个段落中看到所有的选项。最后一个需要注意的是,`kmemcheck` 仅在 [x86_64](https://en.wikipedia.org/wiki/X86-64) 体系中实现了。为了确信这一点,我们可以查看`x86`的内核配置文件 [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig): ``` -config X86 - ... - ... - ... - select HAVE_ARCH_KMEMCHECK - ... - ... - ... + + config X86 + ... + ... + ... + select HAVE_ARCH_KMEMCHECK + ... + ... + ... ``` -So, there is no anything which is specific for other architectures. +因此,对于其他的体系结构来说,`kmemcheck` 功能是不存在的。 -Ok, so we know that `kmemcheck` provides mechanism to check usage of `uninitialized memory` in the Linux kernel and how to enable it. How it does these checks? When the Linux kernel tries to allocate some memory i.e. something is called like this: +现在我们知道了`kmemcheck`可以检测内核中`未初始化内存`的使用情况,也知道了如何开启这个功能。那么`kmemcheck`是怎么做检测的呢?当内核尝试分配内存时,例如如下一段代码: -```C +``` struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL); ``` -or in other words somebody wants to access a [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29), a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception is generated. This is achieved by the fact that the `kmemcheck` marks memory pages as `non-present` (more about this you can read in the special part which is devoted to [paging](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)). If a `page fault` exception is occured, the exception handler knows about it and in a case when the `kmemcheck` is enabled it transfers control to it. After the `kmemcheck` will finish its checks, the page will be marked as `present` and the interrupted code will be able to continue execution. There is little subtlety in this chain. When the first instruction of interrupted code will be executed, the `kmemcheck` will mark the page as `non-present` again. In this way next access to memory will be catched again. +或者换句话说,在进程访问[page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29)时发生了[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。`kmemcheck`将内存页标记为`不存在`(关于Linux内存分页的相关信息,你可以参考[分页](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html))。如果一个 `缺页中断`异常发生了,异常处理程序会来处理这个异常,如果异常处理程序检测到内核使能了 `kmemcheck`,那么就会将控制权提交给 `kmemcheck`来处理;`kmemcheck`检查完之后,该内存页会被标记为`存在`,然后异常处理程序得到控制权继续执行下去。 这里的处理方式比较巧妙。异常处理程序第一条指令执行时,`kmemcheck`会标记内存页为`不存在`,按照这种方式,下一个对内存页的访问也会被捕获。 -We just considered the `kmemcheck` mechanism from theoretical side. Now let's consider how it is implemented in the Linux kernel. +目前我们只是从理论层面考察了 `kmemcheck`,接下来我们看一下Linux内核是怎么来实现该机制的。 -Implementation of the `kmemcheck` mechanism in the Linux kernel +`kmemcheck`机制在Linux内核中的实现方式 -------------------------------------------------------------------------------- -So, now we know what is it `kmemcheck` and what it does in the Linux kernel. Time to see at its implementation in the Linux kernel. Implementation of the `kmemcheck` is splitted in two parts. The first is generic part is located in the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file and the second [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture-specific part is located in the [arch/x86/mm/kmemcheck](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck) directory. +我们应该已经了解`kmemcheck`是做什么的以及它在Linux内核中的功能,现在是时候看一下它在Linux内核中的实现。 `kmemcheck`在内核的实现分为两部分。第一部分是架构无关的部分,位于源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c);第二部分 [x86_64](https://en.wikipedia.org/wiki/X86-64)架构相关的部分位于目录[arch/x86/mm/kmemcheck](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck)中。 -Let's start from the initialization of this mechanism. We already know that to enable the `kmemcheck` mechanism in the Linux kernel, we must enable the `CONFIG_KMEMCHECK` kernel configuration option. But besides this, we need to pass one of following parameters: +我们先分析该机制的初始化过程。我们已经知道要在内核中使能`kmemcheck`机制,需要开启内核的`CONFIG_KMEMCHECK`配置项。除了这个选项,我们还需要给内核command line传递一个kmemcheck参数: * kmemcheck=0 (disabled) * kmemcheck=1 (enabled) * kmemcheck=2 (one-shot mode) -to the Linux kernel command line. The first two are clear, but the last needs a little explanation. This option switches the `kmemcheck` in a special mode when it will be turned off after detecting the first use of uninitialized memory. Actually this mode is enabled by default in the Linux kernel: +前面两个值得含义很明确,但是最后一个需要一点解释。这个选项会使`kmemcheck`进入一种特殊的模式:在第一次检测到未初始化内存的使用之后,就会关闭`kmemcheck`。实际上该模式是内核的默认选项: ![kernel configuration menu](http://oi66.tinypic.com/y2eeh.jpg) -We know from the seventh [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the chapter which describes initialization of the Linux kernel that the kernel command line is parsed during initialization of the Linux kernel in `do_initcall_level`, `do_early_param` functions. Actually the `kmemcheck` subsystem consists from two stages. The first stage is early. If we will look at the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file, we will see the `param_kmemcheck` function which is will be called during early command line parsing: +从Linux初始化过程章节的第七节[part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html)中,我们知道在内核初始化过程中,会在`do_initcall_level`, `do_early_param`等函数中解析内核command line。前面也提到过 `kmemcheck`子系统由两部分组成,第一部分启动比较早。在源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c)中有一个函数 `param_kmemcheck`,该函数在command line解析时就会用到: -```C -static int __init param_kmemcheck(char *str) -{ - int val; - int ret; - if (!str) +``` + + static int __init param_kmemcheck(char *str) + { + int val; + int ret; + + if (!str) + return -EINVAL; + + ret = kstrtoint(str, 0, &val); + if (ret) + return ret; + kmemcheck_enabled = val; + return 0; + } + + early_param("kmemcheck", param_kmemcheck); +``` + +从前面的介绍我们知道`param_kmemcheck`可能存在三种情况:`0` (使能), `1` (禁止) or `2` (一次性)。`param_kmemcheck`的实现很简单:将command line传递的`kmemcheck`参数的值由字符串转换为整数,然后赋值给变量`kmemcheck_enabled`。 + +第二阶段在内核初始化阶段执行,但不是在早期初始化过程 [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)。第二阶断的过程体现 `kmemcheck_init`: `kmemcheck_init`: + +``` + + int __init kmemcheck_init(void) + { + ... + ... + ... + } + + early_initcall(kmemcheck_init); + ``` + +`kmemcheck_init`的主要目的就是调用 `kmemcheck_selftest` 函数,并检查它的返回值: + +``` + + if (!kmemcheck_selftest()) { + printk(KERN_INFO "kmemcheck: self-tests failed; disabling\n"); + kmemcheck_enabled = 0; return -EINVAL; + } + + printk(KERN_INFO "kmemcheck: Initialized\n"); + ``` - ret = kstrtoint(str, 0, &val); - if (ret) - return ret; - kmemcheck_enabled = val; - return 0; -} +如果`kmemcheck_init`检测失败,就返回`EINVAL` 。 `kmemcheck_selftest`函数会检测内存访问相关的[操作码](https://en.wikipedia.org/wiki/Opcode)(例如 `rep movsb`, `movzwq`)的大小。如果检测到的大小的实际大小是一致的,`kmemcheck_selftest`返回 `true`,否则返回 `false`。 + +如果如下代码被调用: -early_param("kmemcheck", param_kmemcheck); ``` - -As we already saw, the `param_kmemcheck` may have one of the following values: `0` (enabled), `1` (disabled) or `2` (one-shot). The implementation of the `param_kmemcheck` is pretty simple. We just convert string value of the `kmemcheck` command line option to integer representation and set it to the `kmemcheck_enabled` variable. - -The second stage will be executed during initialization of the Linux kernel, rather during intialization of early [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html). The second stage is represented by the `kmemcheck_init`: - -```C -int __init kmemcheck_init(void) -{ - ... - ... - ... -} - -early_initcall(kmemcheck_init); -``` - -Main goal of the `kmemcheck_init` function is to call the `kmemcheck_selftest` function and check its result: - -```C -if (!kmemcheck_selftest()) { - printk(KERN_INFO "kmemcheck: self-tests failed; disabling\n"); - kmemcheck_enabled = 0; - return -EINVAL; -} - -printk(KERN_INFO "kmemcheck: Initialized\n"); -``` - -and return with the `EINVAL` if this check is failed. The `kmemcheck_selftest` function checks sizes of different memory access related [opcodes](https://en.wikipedia.org/wiki/Opcode) like `rep movsb`, `movzwq` and etc. If sizes of opcodes are equal to expected sizes, the `kmemcheck_selftest` will return `true` and `false` in other way. - -So when the somebody will call: - -```C struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL); ``` -through a series of different function calls the `kmem_getpages` function will be called. This function is defined in the [mm/slab.c](https://github.com/torvalds/linux/blob/master/mm/slab.c) source code file and main goal of this function tries to allocate [pages](https://en.wikipedia.org/wiki/Paging) with the given flags. In the end of this function we can see following code: +经过一系列的函数调用,`kmem_getpages`函数会被调用到,该函数的定义在源码 [mm/slab.c](https://github.com/torvalds/linux/blob/master/mm/slab.c)中,该函数的主要功能就是尝试按照指定的参数需求分配[内存页](https://en.wikipedia.org/wiki/Paging)。在该函数的结尾处有如下代码: -```C -if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) { - kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid); + + + if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) { + kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid); if (cachep->ctor) kmemcheck_mark_uninitialized_pages(page, nr_pages); else kmemcheck_mark_unallocated_pages(page, nr_pages); -} -``` - -So, here we check that the if `kmemcheck` is enabled and the `SLAB_NOTRACK` bit is not set in flags we set `non-present` bit for the just allocated page. The `SLAB_NOTRACK` bit tell us to not track uninitialized memory. Additionally we check if a cache object has constructor (details will be considered in next parts) we mark allocated page as uninitilized or unallocated in other way. The `kmemcheck_alloc_shadow` function is defined in the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file and does following things: - -```C -void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node) -{ - struct page *shadow; - - shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order); - - for(i = 0; i < pages; ++i) - page[i].shadow = page_address(&shadow[i]); - - kmemcheck_hide_pages(page, pages); -} -``` - -First of all it allocates memory space for the shadow bits. If this bit is set in a page, this means that this page is tracked by the `kmemcheck`. After we allocated space for the shadow bit, we fill all allocated pages with this bit. In the end we just call the `kmemcheck_hide_pages` function with the pointer to the allocated page and number of these pages. The `kmemcheck_hide_pages` is architecture-specific function, so its implementation is located in the [arch/x86/mm/kmemcheck/kmemcheck.c](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck/kmemcheck.c) source code file. The main goal of this function is to set `non-present` bit in given pages. Let's look at the implementation of this function: - -```C -void kmemcheck_hide_pages(struct page *p, unsigned int n) -{ - unsigned int i; - - for (i = 0; i < n; ++i) { - unsigned long address; - pte_t *pte; - unsigned int level; - - address = (unsigned long) page_address(&p[i]); - pte = lookup_address(address, &level); - BUG_ON(!pte); - BUG_ON(level != PG_LEVEL_4K); - - set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); - set_pte(pte, __pte(pte_val(*pte) | _PAGE_HIDDEN)); - __flush_tlb_one(address); } -} -``` -Here we go through all pages and and tries to get `page table entry` for each page. If this operation was successful, we unset present bit and set hidden bit in each page. In the end we flush [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer), because some pages was changed. From this point allocated pages are tracked by the `kmemcheck`. Now, as `present` bit is unset, the [page fault](https://en.wikipedia.org/wiki/Page_fault) execution will be occured right after the `kmalloc` will return pointer to allocated space and a code will try to access this memory. -As you may remember from the [second part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) of the Linux kernel initialization chapter, the `page fault` handler is located in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) source code file and represented by the `do_page_fault` function. We can see following check from the beginning of the `do_page_fault` function: - -```C -static noinline void -__do_page_fault(struct pt_regs *regs, unsigned long error_code, - unsigned long address) -{ - ... - ... - ... - if (kmemcheck_active(regs)) - kmemcheck_hide(regs); - ... - ... - ... -} -``` - -The `kmemcheck_active` gets `kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) structure and return the result of comparision of the `balance` field of this structure with zero: +这段代码判断如果`kmemcheck`使能,并且参数中未设置`SLAB_NOTRACK`,那么就给分配的内存页设置 `non-present`标记。`SLAB_NOTRACK`标记的含义是不跟踪未初始化的内存。另外,如果缓存对象有构造函数(缓存细节在下面描述),所分配的内存页标记为未初始化,否则标记为未分配。`kmemcheck_alloc_shadow`函数在源码[mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c)中,其基本内容如下: ``` -bool kmemcheck_active(struct pt_regs *regs) -{ - struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context); - return data->balance > 0; -} + void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node) + { + struct page *shadow; + + shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order); + + for(i = 0; i < pages; ++i) + page[i].shadow = page_address(&shadow[i]); + + kmemcheck_hide_pages(page, pages); + } + ``` -The `kmemcheck_context` is structure which describes current state of the `kmemcheck` mechanism. It stored unitialized addresses, number of such addresses and etc. The `balance` field of this structure represents current state of the `kmemcheck` or in other words it can tell us did `kmemcheck` already hid pages or not yet. If the `data->balance` is greater than zero, the `kmemcheck_hide` function will be called. This means than `kmemecheck` already set `present` bit for given pages and now we need to hide pages again to to cause nest step page fault. This function will hide addresses of pages again by unsetting of `present` bit. This means that one session of `kmemcheck` already finished and new page fault occured. At the first step the `kmemcheck_active` will return false as the `data->balance` is zero for the start and the `kmemcheck_hide` will not be called. Next, we may see following line of code in the `do_page_fault`: +首先为shadow bits分配内存,并为内存页设置shadow位。如果内存页设置了该标记,就意味着`kmemcheck`会跟踪这个内存页。最后调用`kmemcheck_hide_pages`函数。`kmemcheck_hide_pages`是体系结构相关的函数,其代码在 [arch/x86/mm/kmemcheck/kmemcheck.c](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck/kmemcheck.c)源码中。该函数的功能是为指定的内存页设置`non-present`标记。该函数实现如下: -```C -if (kmemcheck_fault(regs, address, error_code)) +``` + + void kmemcheck_hide_pages(struct page *p, unsigned int n) + { + unsigned int i; + + for (i = 0; i < n; ++i) { + unsigned long address; + pte_t *pte; + unsigned int level; + + address = (unsigned long) page_address(&p[i]); + pte = lookup_address(address, &level); + BUG_ON(!pte); + BUG_ON(level != PG_LEVEL_4K); + + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); + set_pte(pte, __pte(pte_val(*pte) | _PAGE_HIDDEN)); + __flush_tlb_one(address); + } + } +``` + +该函数遍历所有的内存页,并尝试获取每个内存页的`页表项`。如果获取成功,清理页表项的`present`标记,设置页表项的hidden标记。在最后刷新[translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer),因为有一些内存页已经发生了改变。从这个地方开始,内存页就进入 `kmemcheck`的跟踪系统。因为内存页的`present`标记被清除了,一旦 `kmalloc`返回了内存地址,并且有代码访问这个地址,就会触发[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。 + +在Linux内核初始化这章的[第二节](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)介绍过,`缺页中断`处理程序位于[arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c)的 `do_page_fault`函数中。该函数开始部分如下: + +``` + + static noinline void + __do_page_fault(struct pt_regs *regs, unsigned long error_code, + unsigned long address) + { + ... + ... + ... + if (kmemcheck_active(regs)) + kmemcheck_hide(regs); + ... + ... + ... + } +``` + +`kmemcheck_active`函数获取`kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)结构体,并返回该结构体成员`balance`和0的比较结果: + +``` + + bool kmemcheck_active(struct pt_regs *regs) + { + struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context); + + return data->balance > 0; + } +``` + +`kmemcheck_context`结构体代表 `kmemcheck`机制的当前状态。其内部保存了未初始化的地址,地址的数量等信息。其成员 `balance`代表了 `kmemcheck`的当前状态,换句话说,`balance`表示 `kmemcheck`是否已经隐藏了内存页。如果`data->balance`大于0, `kmemcheck_hide` 函数会被调用。这意味着 `kmemecheck`已经设置了内存页的`present`标记,但是我们需要再次隐藏内存页以便触发下一次的缺页中断。 `kmemcheck_hide`函数会清理内存页的 `present`标记,这表示一次`kmemcheck`会话已经完成,新的缺页中断会再次被触发。在第一步,由于`data->balance` 值为0,所以`kmemcheck_active`会返回false,所以 `kmemcheck_hide`也不会被调用。接下来,我们看`do_page_fault`的下一行代码: + +``` + if (kmemcheck_fault(regs, address, error_code)) return; ``` -First of all the `kmemcheck_fault` function checks that the fault was occured by the correct reason. At first we check the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) and check that we are in normal kernel mode: +首先 `kmemcheck_fault` 函数检查引起错误的真实原因。第一步先检查[标记寄存器](https://en.wikipedia.org/wiki/FLAGS_register)以确认进程是否处于正常的内核态: -```C -if (regs->flags & X86_VM_MASK) - return false; -if (regs->cs != __KERNEL_CS) +``` + if (regs->flags & X86_VM_MASK) + return false; + if (regs->cs != __KERNEL_CS) + return false; +``` + +如果检测失败,表明这不是`kmemcheck`相关的缺页中断,`kmemcheck_fault`会返回。如果检测成功,接下来查找发生异常的地址的`页表项`,如果找不到页表项,函数返回false: + +``` + pte = kmemcheck_pte_lookup(address); + if (!pte) return false; ``` -If these checks wasn't successful we return from the `kmemcheck_fault` function as it was not `kmemcheck` related page fault. After this we try to lookup a `page table entry` related to the faulted address and if we can't find it we return: +`kmemcheck_fault`最后一步是调用`kmemcheck_access` 函数,该函数检查对指定内存页的访问,并设置该内存页的present标记。 `kmemcheck_access`函数做了大部分工作,它检查引起缺页异常的当前指令,如果检查到了错误,那么会把该错误的上下文保存到循环队列中: -```C -pte = kmemcheck_pte_lookup(address); -if (!pte) - return false; ``` - -Last two steps of the `kmemcheck_fault` function is to call the `kmemcheck_access` function which check access to the given page and show addresses again by setting present bit in the given page. The `kmemcheck_access` function does all main job. It check current instruction which caused a page fault. If it will find an error, the context of this error will be saved by `kmemcheck` to the ring queue: - -```C static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE]; ``` -The `kmemcheck` mechanism declares special [tasklet](https://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-9.html): +`kmemcheck`声明了一个特殊的 [tasklet](https://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-9.html): -```C +``` static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0); ``` -which runs the `do_wakeup` function from the [arch/x86/mm/kmemcheck/error.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/kmemcheck/error.c) source code file when it will be scheduled to run. +该tasklet被调度执行时,会调用`do_wakeup`函数,该函数位于[arch/x86/mm/kmemcheck/error.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/kmemcheck/error.c)文件中。 -The `do_wakeup` function will call the `kmemcheck_error_recall` function which will print errors collected by `kmemcheck`. As we already saw the: +`do_wakeup`函数调用`kmemcheck_error_recall`函数以便将`kmemcheck`检测到的错误信息输出。 -```C +``` kmemcheck_show(regs); ``` -function will be called in the end of the `kmemcheck_fault` function. This function will set present bit for the given pages again: +`kmemcheck_fault`函数结束时会调用`kmemcheck_show`函数,该函数会再次设置内存页的present标记。 -```C -if (unlikely(data->balance != 0)) { - kmemcheck_show_all(); - kmemcheck_error_save_bug(regs); - data->balance = 0; - return; +``` + + if (unlikely(data->balance != 0)) { + kmemcheck_show_all(); + kmemcheck_error_save_bug(regs); + data->balance = 0; + return; } ``` -Where the `kmemcheck_show_all` function calls the `kmemcheck_show_addr` for each address: +`kmemcheck_show_all`函数会针对每个地址调用`kmemcheck_show_addr`: -```C -static unsigned int kmemcheck_show_all(void) -{ - struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context); - unsigned int i; - unsigned int n; - - n = 0; - for (i = 0; i < data->n_addrs; ++i) - n += kmemcheck_show_addr(data->addr[i]); - - return n; -} ``` -by the call of the `kmemcheck_show_addr`: - -```C -int kmemcheck_show_addr(unsigned long address) -{ - pte_t *pte; - - pte = kmemcheck_pte_lookup(address); - if (!pte) - return 0; - - set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); - __flush_tlb_one(address); - return 1; -} + static unsigned int kmemcheck_show_all(void) + { + struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context); + unsigned int i; + unsigned int n; + + n = 0; + for (i = 0; i < data->n_addrs; ++i) + n += kmemcheck_show_addr(data->addr[i]); + + return n; + } ``` -In the end of the `kmemcheck_show` function we set the [TF](https://en.wikipedia.org/wiki/Trap_flag) flag if it wasn't set: +`kmemcheck_show_addr`函数内容如下: -```C -if (!(regs->flags & X86_EFLAGS_TF)) - data->flags = regs->flags; ``` -We need to do it because we need to hide pages again after first executed instruction after a page fault will be handled. In a case when the `TF` flag, so the processor will switch into single-step mode after the first instruction will be executed. In this case `debug` exception will occured. From this moment pages will be hidden again and execution will be continued. As pages hidden from this moment, page fault exception will occur again and `kmemcheck` continue to check/collect errors again and print them from time to time. + int kmemcheck_show_addr(unsigned long address) + { + pte_t *pte; + + pte = kmemcheck_pte_lookup(address); + if (!pte) + return 0; + + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); + __flush_tlb_one(address); + return 1; + } +``` -That's all. +在函数 `kmemcheck_show`的结尾处会设置[TF](https://en.wikipedia.org/wiki/Trap_flag) 标记: -Conclusion +``` + + if (!(regs->flags & X86_EFLAGS_TF)) + data->flags = regs->flags; +``` + +我们之所以这么处理,是因为我们在内存页的缺页中断处理完后需要再次隐藏内存页。当 `TF`标记被设置后,处理器在访问指令异常后会进入单步模式,这会触发`debug` 异常。从这个地方开始,内存页会被隐藏起来,执行流程继续。由于内存页不可见,那么访问内存页的时候又会触发缺页中断,然后`kmemcheck`就有机会继续检测/手机内存错误信息并显示这些错误信息。 + +到这里`kmemcheck`的工作机制就介绍完毕了。 + +总结 -------------------------------------------------------------------------------- -This is the end of the third part about linux kernel [memory management](https://en.wikipedia.org/wiki/Memory_management). If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). In the next part we will see yet another memory debugging related tool - `kmemleak`. +Linux内核[内存管理](https://en.wikipedia.org/wiki/Memory_management)第三节介绍到此为止。如果你有任何疑问或者建议,你可以直接发消息给我[0xAX](https://twitter.com/0xAX), 给我发[邮件](anotherworldofworld@gmail.com),或者创建一个[issue](https://github.com/0xAX/linux-insides/issues/new). 在接下来的小节中,我们来看一下另一个内存调试工具 - `kmemleak`。 -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).** +**英文不是我的母语。如果你发现我的英文描述有任何问题,请提交一个PR到 [linux-insides](https://github.com/0xAX/linux-insides).** Links -------------------------------------------------------------------------------- From b1f7565fa5518f6e6454c2d6d832481917a05b6a Mon Sep 17 00:00:00 2001 From: lifangwang Date: Sat, 15 Apr 2017 20:40:53 +0800 Subject: [PATCH 10/21] fix typos --- MM/linux-mm-3.md | 480 +++++++++++++++++++++++------------------------ 1 file changed, 231 insertions(+), 249 deletions(-) diff --git a/MM/linux-mm-3.md b/MM/linux-mm-3.md index 91bd122..28b9872 100644 --- a/MM/linux-mm-3.md +++ b/MM/linux-mm-3.md @@ -11,82 +11,82 @@ Linux内存管理 [章节](https://0xax.gitbooks.io/linux-insides/content/mm/) 固定映射地址代表[虚拟内存](https://en.wikipedia.org/wiki/Virtual_memory)中的一类特殊区域, 这类地址的物理映射地址是在[编译](https://en.wikipedia.org/wiki/Compile_time)期间计算出来的。输入输出重映射表示把输入/输出相关的内存映射到虚拟内存。 -例如,查看`/proc/iomem`命令的输出: +例如,查看`/proc/iomem`命令: ``` - $ sudo cat /proc/iomem - - 00000000-00000fff : reserved - 00001000-0009d7ff : System RAM - 0009d800-0009ffff : reserved - 000a0000-000bffff : PCI Bus 0000:00 - 000c0000-000cffff : Video ROM - 000d0000-000d3fff : PCI Bus 0000:00 - 000d4000-000d7fff : PCI Bus 0000:00 - 000d8000-000dbfff : PCI Bus 0000:00 - 000dc000-000dffff : PCI Bus 0000:00 - 000e0000-000fffff : reserved - ... - ... - ... +$ sudo cat /proc/iomem + +00000000-00000fff : reserved +00001000-0009d7ff : System RAM +0009d800-0009ffff : reserved +000a0000-000bffff : PCI Bus 0000:00 +000c0000-000cffff : Video ROM +000d0000-000d3fff : PCI Bus 0000:00 +000d4000-000d7fff : PCI Bus 0000:00 +000d8000-000dbfff : PCI Bus 0000:00 +000dc000-000dffff : PCI Bus 0000:00 +000e0000-000fffff : reserved +... +... +... ``` -可以看到系统中每个物理设备对应的内存映射区域。上述输出信息第一列表示各类型内存使用的内存寄存器。第二列展示了内存寄存器所包含的各种类型的内存。再例如: +`iomem`命令的输出显示了系统中每个物理设备所映射的内存区域。第一列为物理设备分配的内存区域,第二列为对应的各种不同类型的物理设备。再例如: + ``` +$ sudo cat /proc/ioports - $ sudo cat /proc/ioports - 0000-0cf7 : PCI Bus 0000:00 - 0000-001f : dma1 - 0020-0021 : pic1 - 0040-0043 : timer0 - 0050-0053 : timer1 - 0060-0060 : keyboard - 0064-0064 : keyboard - 0070-0077 : rtc0 - 0080-008f : dma page reg - 00a0-00a1 : pic2 - 00c0-00df : dma2 - 00f0-00ff : fpu - 00f0-00f0 : PNP0C04:00 - 03c0-03df : vga+ - 03f8-03ff : serial - 04d0-04d1 : pnp 00:06 - 0800-087f : pnp 00:01 - 0a00-0a0f : pnp 00:04 - 0a20-0a2f : pnp 00:04 - 0a30-0a3f : pnp 00:04 - ... - ... - ... +0000-0cf7 : PCI Bus 0000:00 + 0000-001f : dma1 + 0020-0021 : pic1 + 0040-0043 : timer0 + 0050-0053 : timer1 + 0060-0060 : keyboard + 0064-0064 : keyboard + 0070-0077 : rtc0 + 0080-008f : dma page reg + 00a0-00a1 : pic2 + 00c0-00df : dma2 + 00f0-00ff : fpu + 00f0-00f0 : PNP0C04:00 + 03c0-03df : vga+ + 03f8-03ff : serial + 04d0-04d1 : pnp 00:06 + 0800-087f : pnp 00:01 + 0a00-0a0f : pnp 00:04 + 0a20-0a2f : pnp 00:04 + 0a30-0a3f : pnp 00:04 +... +... +... ``` -该命令列出了系统中所有设备注册的输入输出端口。内核不能直接访问设备的输入/输出地址。所以在内核能够使用这些内存之前,内核必须将这些地址映射到虚拟地址空间,这就是输入输出内存映射机制的主要目的。在前面[第二节](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)中只介绍了早期的输入输出重映射。很快我们就要来看一看非早期输入输出重映射的实现机制。但在此之前,我们需要学习一些其他的知识,例如不同类型的内存分配器等,不然的话我们很难理解该机制。 +`ioports`的输出列出了系统中物理设备所注册的各种类型的I/O端口。内核不能直接访问设备的输入/输出地址。在内核能够使用这些内存之前,必须将这些地址映射到虚拟地址空间,这就是`io remap`机制的主要目的。在前面[第二节](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)中只介绍了早期的`io remap`。很快我们就要来看一看常规的`io remap`实现机制。但在此之前,我们需要学习一些其他的知识,例如不同类型的内存分配器等,不然的话我们很难理解该机制。 -所以,在进入Linux内核非早期的[内存管理](https://en.wikipedia.org/wiki/Memory_management)之前,我们要看一些提供特殊功能的机制,例如[调试](https://en.wikipedia.org/wiki/Debugging),检查[内存泄漏](https://en.wikipedia.org/wiki/Memory_leak),内存控制等等。学习这些内容有助于我们理解Linux内核中内存管理机制。 +在进入Linux内核常规期的[内存管理](https://en.wikipedia.org/wiki/Memory_management)之前,我们要看一些特殊的内存机制,例如[调试](https://en.wikipedia.org/wiki/Debugging),检查[内存泄漏](https://en.wikipedia.org/wiki/Memory_leak),内存控制等等。学习这些内容有助于我们理解Linux内核的内存管理。 从本节的标题中,你可能已经看出来,我们会从 [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt)开始了解内存机制。和前面的[章节](https://0xax.gitbooks.io/linux-insides/content/)一样,我们首先从理论上学习什么是`kmemcheck`,然后再来看Linux内核中是怎么实现这一机制的。 让我们开始吧。Linux内核中的`kmemcheck`到底是什么呢?从该机制的名称上你可能已经猜到, `kmemcheck` 是检查内存的。你猜的很对。`kmemcheck`的主要目的就是用来检查是否有内核代码访问 `未初始化的内存`。让我们看一个简单的[C](https://en.wikipedia.org/wiki/C_%28programming_language%29)程序: -``` +```C +#include +#include - #include - #include - - struct A { - int a; - }; - - int main(int argc, char **argv) { - struct A *a = malloc(sizeof(struct A)); - printf("a->a = %d\n", a->a); - return 0; - } +struct A { + int a; +}; + +int main(int argc, char **argv) { + struct A *a = malloc(sizeof(struct A)); + printf("a->a = %d\n", a->a); + return 0; +} ``` -在上面的程序中我们给结构体`A`分配了内存,然后我们尝试打印该结构体的成员`a`。如果我们不使用其他选项来编译该程序: +在上面的程序中我们给结构体`A`分配了内存,然后我们尝试打印它的成员`a`。如果我们不使用其他选项来编译该程序: ``` gcc test.c -o test @@ -95,56 +95,52 @@ gcc test.c -o test [编译器](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)不会显示成员 `a`未初始化的提示信息。但是如果使用工具[valgrind](https://en.wikipedia.org/wiki/Valgrind)来运行该程序,我们会看到如下输出: ``` - - ~$ valgrind --leak-check=yes ./test - ==28469== Memcheck, a memory error detector - ==28469== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. - ==28469== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info - ==28469== Command: ./test - ==28469== - ==28469== Conditional jump or move depends on uninitialised value(s) - ==28469== at 0x4E820EA: vfprintf (in /usr/lib64/libc-2.22.so) - ==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so) - ==28469== by 0x4005B9: main (in /home/alex/test) - ==28469== - ==28469== Use of uninitialised value of size 8 - ==28469== at 0x4E7E0BB: _itoa_word (in /usr/lib64/libc-2.22.so) - ==28469== by 0x4E8262F: vfprintf (in /usr/lib64/libc-2.22.so) - ==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so) - ==28469== by 0x4005B9: main (in /home/alex/test) - ... - ... - ... +~$ valgrind --leak-check=yes ./test +==28469== Memcheck, a memory error detector +==28469== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. +==28469== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info +==28469== Command: ./test +==28469== +==28469== Conditional jump or move depends on uninitialised value(s) +==28469== at 0x4E820EA: vfprintf (in /usr/lib64/libc-2.22.so) +==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so) +==28469== by 0x4005B9: main (in /home/alex/test) +==28469== +==28469== Use of uninitialised value of size 8 +==28469== at 0x4E7E0BB: _itoa_word (in /usr/lib64/libc-2.22.so) +==28469== by 0x4E8262F: vfprintf (in /usr/lib64/libc-2.22.so) +==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so) +==28469== by 0x4005B9: main (in /home/alex/test) +... +... +... ``` 实际上`kmemcheck`在内核空间做的事情,和`valgrind`在用户空间做的事情是一样的,都是用来检测未初始化的内存。 -要想在内核启用该机制,配置内核时在内核选项菜单要使能`CONFIG_KMEMCHECK`选项: +要想在内核中启用该机制,需要在配置内核时使能`CONFIG_KMEMCHECK`选项: ``` Kernel hacking -> Memory Debugging ``` - - ![kernel configuration menu](http://oi63.tinypic.com/2pzbog7.jpg) -我们不仅可以在内核中使能`kmemcheck`机制,它还提供了一些配置选项。我们可以在本小节的下一个段落中看到所有的选项。最后一个需要注意的是,`kmemcheck` 仅在 [x86_64](https://en.wikipedia.org/wiki/X86-64) 体系中实现了。为了确信这一点,我们可以查看`x86`的内核配置文件 [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig): +`kmemcheck`机制还提供了一些内核配置参数,我们可以在下一个段落中看到所有的可选参数。最后一个需要注意的是,`kmemcheck` 仅在 [x86_64](https://en.wikipedia.org/wiki/X86-64) 体系中实现了。为了确信这一点,我们可以查看`x86`的内核配置文件 [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig): ``` - - config X86 - ... - ... - ... - select HAVE_ARCH_KMEMCHECK - ... - ... - ... +config X86 + ... + ... + ... + select HAVE_ARCH_KMEMCHECK + ... + ... + ... ``` -因此,对于其他的体系结构来说,`kmemcheck` 功能是不存在的。 +因此,对于其他的体系结构来说是没有`kmemcheck` 功能的。 现在我们知道了`kmemcheck`可以检测内核中`未初始化内存`的使用情况,也知道了如何开启这个功能。那么`kmemcheck`是怎么做检测的呢?当内核尝试分配内存时,例如如下一段代码: @@ -152,208 +148,198 @@ Kernel hacking struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL); ``` -或者换句话说,在进程访问[page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29)时发生了[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。`kmemcheck`将内存页标记为`不存在`(关于Linux内存分页的相关信息,你可以参考[分页](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html))。如果一个 `缺页中断`异常发生了,异常处理程序会来处理这个异常,如果异常处理程序检测到内核使能了 `kmemcheck`,那么就会将控制权提交给 `kmemcheck`来处理;`kmemcheck`检查完之后,该内存页会被标记为`存在`,然后异常处理程序得到控制权继续执行下去。 这里的处理方式比较巧妙。异常处理程序第一条指令执行时,`kmemcheck`会标记内存页为`不存在`,按照这种方式,下一个对内存页的访问也会被捕获。 +或者换句话说,在内核访问[page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29)时会发生[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。这是由于`kmemcheck`将内存页标记为`不存在`(关于Linux内存分页的相关信息,你可以参考[分页](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html))。如果一个 `缺页中断`异常发生了,异常处理程序会来处理这个异常,如果异常处理程序检测到内核使能了 `kmemcheck`,那么就会将控制权提交给 `kmemcheck`来处理;`kmemcheck`检查完之后,该内存页会被标记为`present`,然后被中断的程序得以继续执行下去。 这里的处理方式比较巧妙,被中断程序的第一条指令执行时,`kmemcheck`又会标记内存页为`not present`,按照这种方式,下一个对内存页的访问也会被捕获。 目前我们只是从理论层面考察了 `kmemcheck`,接下来我们看一下Linux内核是怎么来实现该机制的。 -`kmemcheck`机制在Linux内核中的实现方式 +`kmemcheck`机制在Linux内核中的实现 -------------------------------------------------------------------------------- 我们应该已经了解`kmemcheck`是做什么的以及它在Linux内核中的功能,现在是时候看一下它在Linux内核中的实现。 `kmemcheck`在内核的实现分为两部分。第一部分是架构无关的部分,位于源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c);第二部分 [x86_64](https://en.wikipedia.org/wiki/X86-64)架构相关的部分位于目录[arch/x86/mm/kmemcheck](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck)中。 -我们先分析该机制的初始化过程。我们已经知道要在内核中使能`kmemcheck`机制,需要开启内核的`CONFIG_KMEMCHECK`配置项。除了这个选项,我们还需要给内核command line传递一个kmemcheck参数: +我们先分析该机制的初始化过程。我们已经知道要在内核中使能`kmemcheck`机制,需要开启内核的`CONFIG_KMEMCHECK`配置项。除了这个选项,我们还需要给内核command line传递一个`kmemcheck`参数: * kmemcheck=0 (disabled) * kmemcheck=1 (enabled) * kmemcheck=2 (one-shot mode) -前面两个值得含义很明确,但是最后一个需要一点解释。这个选项会使`kmemcheck`进入一种特殊的模式:在第一次检测到未初始化内存的使用之后,就会关闭`kmemcheck`。实际上该模式是内核的默认选项: +前面两个值得含义很明确,但是最后一个需要解释。这个选项会使`kmemcheck`进入一种特殊的模式:在第一次检测到未初始化内存的使用之后,就会关闭`kmemcheck`。实际上该模式是内核的默认选项: ![kernel configuration menu](http://oi66.tinypic.com/y2eeh.jpg) 从Linux初始化过程章节的第七节[part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html)中,我们知道在内核初始化过程中,会在`do_initcall_level`, `do_early_param`等函数中解析内核command line。前面也提到过 `kmemcheck`子系统由两部分组成,第一部分启动比较早。在源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c)中有一个函数 `param_kmemcheck`,该函数在command line解析时就会用到: +```C +static int __init param_kmemcheck(char *str) +{ + int val; + int ret; -``` + if (!str) + return -EINVAL; - static int __init param_kmemcheck(char *str) - { - int val; - int ret; - - if (!str) - return -EINVAL; - - ret = kstrtoint(str, 0, &val); - if (ret) - return ret; - kmemcheck_enabled = val; - return 0; - } - - early_param("kmemcheck", param_kmemcheck); + ret = kstrtoint(str, 0, &val); + if (ret) + return ret; + kmemcheck_enabled = val; + return 0; +} + +early_param("kmemcheck", param_kmemcheck); ``` 从前面的介绍我们知道`param_kmemcheck`可能存在三种情况:`0` (使能), `1` (禁止) or `2` (一次性)。`param_kmemcheck`的实现很简单:将command line传递的`kmemcheck`参数的值由字符串转换为整数,然后赋值给变量`kmemcheck_enabled`。 -第二阶段在内核初始化阶段执行,但不是在早期初始化过程 [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)。第二阶断的过程体现 `kmemcheck_init`: `kmemcheck_init`: +第二阶段在内核初始化阶段执行,而不是在早期初始化过程 [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)。第二阶断的过程体现在 `kmemcheck_init`: +```C +int __init kmemcheck_init(void) +{ + ... + ... + ... +} + +early_initcall(kmemcheck_init); ``` - int __init kmemcheck_init(void) - { - ... - ... - ... - } - - early_initcall(kmemcheck_init); - ``` - `kmemcheck_init`的主要目的就是调用 `kmemcheck_selftest` 函数,并检查它的返回值: -``` +```C +if (!kmemcheck_selftest()) { + printk(KERN_INFO "kmemcheck: self-tests failed; disabling\n"); + kmemcheck_enabled = 0; + return -EINVAL; +} - if (!kmemcheck_selftest()) { - printk(KERN_INFO "kmemcheck: self-tests failed; disabling\n"); - kmemcheck_enabled = 0; - return -EINVAL; - } - - printk(KERN_INFO "kmemcheck: Initialized\n"); - ``` +printk(KERN_INFO "kmemcheck: Initialized\n"); +``` 如果`kmemcheck_init`检测失败,就返回`EINVAL` 。 `kmemcheck_selftest`函数会检测内存访问相关的[操作码](https://en.wikipedia.org/wiki/Opcode)(例如 `rep movsb`, `movzwq`)的大小。如果检测到的大小的实际大小是一致的,`kmemcheck_selftest`返回 `true`,否则返回 `false`。 如果如下代码被调用: -``` +```C struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL); ``` 经过一系列的函数调用,`kmem_getpages`函数会被调用到,该函数的定义在源码 [mm/slab.c](https://github.com/torvalds/linux/blob/master/mm/slab.c)中,该函数的主要功能就是尝试按照指定的参数需求分配[内存页](https://en.wikipedia.org/wiki/Paging)。在该函数的结尾处有如下代码: - - - if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) { - kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid); +```C +if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) { + kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid); if (cachep->ctor) kmemcheck_mark_uninitialized_pages(page, nr_pages); else kmemcheck_mark_unallocated_pages(page, nr_pages); - } - +} +``` 这段代码判断如果`kmemcheck`使能,并且参数中未设置`SLAB_NOTRACK`,那么就给分配的内存页设置 `non-present`标记。`SLAB_NOTRACK`标记的含义是不跟踪未初始化的内存。另外,如果缓存对象有构造函数(缓存细节在下面描述),所分配的内存页标记为未初始化,否则标记为未分配。`kmemcheck_alloc_shadow`函数在源码[mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c)中,其基本内容如下: -``` +```C +void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node) +{ + struct page *shadow; - void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node) - { - struct page *shadow; - - shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order); - - for(i = 0; i < pages; ++i) - page[i].shadow = page_address(&shadow[i]); - - kmemcheck_hide_pages(page, pages); - } + shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order); + for(i = 0; i < pages; ++i) + page[i].shadow = page_address(&shadow[i]); + + kmemcheck_hide_pages(page, pages); +} ``` 首先为shadow bits分配内存,并为内存页设置shadow位。如果内存页设置了该标记,就意味着`kmemcheck`会跟踪这个内存页。最后调用`kmemcheck_hide_pages`函数。`kmemcheck_hide_pages`是体系结构相关的函数,其代码在 [arch/x86/mm/kmemcheck/kmemcheck.c](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck/kmemcheck.c)源码中。该函数的功能是为指定的内存页设置`non-present`标记。该函数实现如下: -``` +```C +void kmemcheck_hide_pages(struct page *p, unsigned int n) +{ + unsigned int i; - void kmemcheck_hide_pages(struct page *p, unsigned int n) - { - unsigned int i; + for (i = 0; i < n; ++i) { + unsigned long address; + pte_t *pte; + unsigned int level; - for (i = 0; i < n; ++i) { - unsigned long address; - pte_t *pte; - unsigned int level; - - address = (unsigned long) page_address(&p[i]); - pte = lookup_address(address, &level); - BUG_ON(!pte); - BUG_ON(level != PG_LEVEL_4K); - - set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); - set_pte(pte, __pte(pte_val(*pte) | _PAGE_HIDDEN)); - __flush_tlb_one(address); - } + address = (unsigned long) page_address(&p[i]); + pte = lookup_address(address, &level); + BUG_ON(!pte); + BUG_ON(level != PG_LEVEL_4K); + + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); + set_pte(pte, __pte(pte_val(*pte) | _PAGE_HIDDEN)); + __flush_tlb_one(address); } +} ``` -该函数遍历所有的内存页,并尝试获取每个内存页的`页表项`。如果获取成功,清理页表项的`present`标记,设置页表项的hidden标记。在最后刷新[translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer),因为有一些内存页已经发生了改变。从这个地方开始,内存页就进入 `kmemcheck`的跟踪系统。因为内存页的`present`标记被清除了,一旦 `kmalloc`返回了内存地址,并且有代码访问这个地址,就会触发[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。 +该函数遍历参数代表的所有内存页,并尝试获取每个内存页的`页表项`。如果获取成功,清理页表项的present标记,设置页表项的hidden标记。在最后还需要刷新[TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer),因为有一些内存页已经发生了改变。从这个地方开始,内存页就进入 `kmemcheck`的跟踪系统。由于内存页的`present`标记被清除了,一旦 `kmalloc`返回了内存地址,并且有代码访问这个地址,就会触发[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。 -在Linux内核初始化这章的[第二节](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)介绍过,`缺页中断`处理程序位于[arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c)的 `do_page_fault`函数中。该函数开始部分如下: +在Linux内核初始化的[第二节](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)介绍过,`缺页中断`处理程序是[arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c)的 `do_page_fault`函数。该函数开始部分如下: -``` - - static noinline void - __do_page_fault(struct pt_regs *regs, unsigned long error_code, - unsigned long address) - { - ... - ... - ... - if (kmemcheck_active(regs)) - kmemcheck_hide(regs); - ... - ... - ... - } +```C +static noinline void +__do_page_fault(struct pt_regs *regs, unsigned long error_code, + unsigned long address) +{ + ... + ... + ... + if (kmemcheck_active(regs)) + kmemcheck_hide(regs); + ... + ... + ... +} ``` `kmemcheck_active`函数获取`kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)结构体,并返回该结构体成员`balance`和0的比较结果: ``` +bool kmemcheck_active(struct pt_regs *regs) +{ + struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context); - bool kmemcheck_active(struct pt_regs *regs) - { - struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context); - - return data->balance > 0; - } + return data->balance > 0; +} ``` `kmemcheck_context`结构体代表 `kmemcheck`机制的当前状态。其内部保存了未初始化的地址,地址的数量等信息。其成员 `balance`代表了 `kmemcheck`的当前状态,换句话说,`balance`表示 `kmemcheck`是否已经隐藏了内存页。如果`data->balance`大于0, `kmemcheck_hide` 函数会被调用。这意味着 `kmemecheck`已经设置了内存页的`present`标记,但是我们需要再次隐藏内存页以便触发下一次的缺页中断。 `kmemcheck_hide`函数会清理内存页的 `present`标记,这表示一次`kmemcheck`会话已经完成,新的缺页中断会再次被触发。在第一步,由于`data->balance` 值为0,所以`kmemcheck_active`会返回false,所以 `kmemcheck_hide`也不会被调用。接下来,我们看`do_page_fault`的下一行代码: -``` - if (kmemcheck_fault(regs, address, error_code)) +```C +if (kmemcheck_fault(regs, address, error_code)) return; ``` 首先 `kmemcheck_fault` 函数检查引起错误的真实原因。第一步先检查[标记寄存器](https://en.wikipedia.org/wiki/FLAGS_register)以确认进程是否处于正常的内核态: -``` - if (regs->flags & X86_VM_MASK) - return false; - if (regs->cs != __KERNEL_CS) - return false; -``` - -如果检测失败,表明这不是`kmemcheck`相关的缺页中断,`kmemcheck_fault`会返回。如果检测成功,接下来查找发生异常的地址的`页表项`,如果找不到页表项,函数返回false: - -``` - pte = kmemcheck_pte_lookup(address); - if (!pte) +```C +if (regs->flags & X86_VM_MASK) + return false; +if (regs->cs != __KERNEL_CS) return false; ``` -`kmemcheck_fault`最后一步是调用`kmemcheck_access` 函数,该函数检查对指定内存页的访问,并设置该内存页的present标记。 `kmemcheck_access`函数做了大部分工作,它检查引起缺页异常的当前指令,如果检查到了错误,那么会把该错误的上下文保存到循环队列中: +如果检测失败,表明这不是`kmemcheck`相关的缺页中断,`kmemcheck_fault`会返回false。如果检测成功,接下来查找发生异常的地址的`页表项`,如果找不到页表项,函数返回false: +```C +pte = kmemcheck_pte_lookup(address); +if (!pte) + return false; ``` + +`kmemcheck_fault`最后一步是调用`kmemcheck_access` 函数,该函数检查对指定内存页的访问,并设置该内存页的present标记。 `kmemcheck_access`函数做了大部分工作,它检查引起缺页异常的当前指令,如果检查到了错误,那么会把该错误的上下文保存到环形队列中: + +```C static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE]; ``` `kmemcheck`声明了一个特殊的 [tasklet](https://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-9.html): -``` +```C static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0); ``` @@ -361,74 +347,70 @@ static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0); `do_wakeup`函数调用`kmemcheck_error_recall`函数以便将`kmemcheck`检测到的错误信息输出。 -``` +```C kmemcheck_show(regs); ``` `kmemcheck_fault`函数结束时会调用`kmemcheck_show`函数,该函数会再次设置内存页的present标记。 -``` - - if (unlikely(data->balance != 0)) { - kmemcheck_show_all(); - kmemcheck_error_save_bug(regs); - data->balance = 0; - return; +```C +if (unlikely(data->balance != 0)) { + kmemcheck_show_all(); + kmemcheck_error_save_bug(regs); + data->balance = 0; + return; } ``` `kmemcheck_show_all`函数会针对每个地址调用`kmemcheck_show_addr`: -``` +```C +static unsigned int kmemcheck_show_all(void) +{ + struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context); + unsigned int i; + unsigned int n; - static unsigned int kmemcheck_show_all(void) - { - struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context); - unsigned int i; - unsigned int n; - - n = 0; - for (i = 0; i < data->n_addrs; ++i) - n += kmemcheck_show_addr(data->addr[i]); - - return n; - } + n = 0; + for (i = 0; i < data->n_addrs; ++i) + n += kmemcheck_show_addr(data->addr[i]); + + return n; +} ``` `kmemcheck_show_addr`函数内容如下: -``` +```C +int kmemcheck_show_addr(unsigned long address) +{ + pte_t *pte; - int kmemcheck_show_addr(unsigned long address) - { - pte_t *pte; - - pte = kmemcheck_pte_lookup(address); - if (!pte) - return 0; - - set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); - __flush_tlb_one(address); - return 1; - } + pte = kmemcheck_pte_lookup(address); + if (!pte) + return 0; + + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); + __flush_tlb_one(address); + return 1; +} ``` 在函数 `kmemcheck_show`的结尾处会设置[TF](https://en.wikipedia.org/wiki/Trap_flag) 标记: +```C +if (!(regs->flags & X86_EFLAGS_TF)) + data->flags = regs->flags; ``` - if (!(regs->flags & X86_EFLAGS_TF)) - data->flags = regs->flags; -``` - -我们之所以这么处理,是因为我们在内存页的缺页中断处理完后需要再次隐藏内存页。当 `TF`标记被设置后,处理器在访问指令异常后会进入单步模式,这会触发`debug` 异常。从这个地方开始,内存页会被隐藏起来,执行流程继续。由于内存页不可见,那么访问内存页的时候又会触发缺页中断,然后`kmemcheck`就有机会继续检测/手机内存错误信息并显示这些错误信息。 +我们之所以这么处理,是因为我们在内存页的缺页中断处理完后需要再次隐藏内存页。当 `TF`标记被设置后,处理器在执行被中断程序的第一条指令时会进入单步模式,这会触发`debug` 异常。从这个地方开始,内存页会被隐藏起来,执行流程继续。由于内存页不可见,那么访问内存页的时候又会触发缺页中断,然后`kmemcheck`就有机会继续检测/收集并显示内存错误信息。 到这里`kmemcheck`的工作机制就介绍完毕了。 -总结 +结束语 -------------------------------------------------------------------------------- -Linux内核[内存管理](https://en.wikipedia.org/wiki/Memory_management)第三节介绍到此为止。如果你有任何疑问或者建议,你可以直接发消息给我[0xAX](https://twitter.com/0xAX), 给我发[邮件](anotherworldofworld@gmail.com),或者创建一个[issue](https://github.com/0xAX/linux-insides/issues/new). 在接下来的小节中,我们来看一下另一个内存调试工具 - `kmemleak`。 +Linux内核[内存管理](https://en.wikipedia.org/wiki/Memory_management)第三节介绍到此为止。如果你有任何疑问或者建议,你可以直接给我[0xAX](https://twitter.com/0xAX)发消息, 发[邮件](anotherworldofworld@gmail.com),或者创建一个[issue](https://github.com/0xAX/linux-insides/issues/new)。 在接下来的小节中,我们来看一下另一个内存调试工具 - `kmemleak`。 **英文不是我的母语。如果你发现我的英文描述有任何问题,请提交一个PR到 [linux-insides](https://github.com/0xAX/linux-insides).** From ed574bd8d4dd11246dc8afe6d19b72457416928f Mon Sep 17 00:00:00 2001 From: xinqiu Date: Sat, 6 May 2017 09:44:05 +0800 Subject: [PATCH 11/21] =?UTF-8?q?=E4=BB=96=20->=20=E5=AE=83?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- Booting/linux-bootstrap-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Booting/linux-bootstrap-1.md b/Booting/linux-bootstrap-1.md index d23daee..4bea266 100644 --- a/Booting/linux-bootstrap-1.md +++ b/Booting/linux-bootstrap-1.md @@ -4,7 +4,7 @@ 从引导加载程序内核 -------------------------------------------------------------------------------- -如果你已经看过我之前的[文章](http://0xax.blogspot.com/search/label/asm),就知道之前我开始和底层编程打交道。我写了一些关于 Linux x86_64 汇编的文章。同时,我开始深入研究 Linux 源代码。底层是如果工作的,程序是如何在电脑上运行的,他们是如何在内存中定位的,内核是如何管理进程和内存,网络堆栈是如何在底层工作的等等,这些我都非常感兴趣。因此,我决定去写另外的一系列文章关于 **x86_64** 框架的 Linux 内核。 +如果你已经看过我之前的[文章](http://0xax.blogspot.com/search/label/asm),就知道之前我开始和底层编程打交道。我写了一些关于 Linux x86_64 汇编的文章。同时,我开始深入研究 Linux 源代码。底层是如果工作的,程序是如何在电脑上运行的,它们是如何在内存中定位的,内核是如何管理进程和内存,网络堆栈是如何在底层工作的等等,这些我都非常感兴趣。因此,我决定去写另外的一系列文章关于 **x86_64** 框架的 Linux 内核。 *注意这不是官方文档,只是学习和分享知识* From 188dd2273ffbf8db563ad5277467d1fd853a5053 Mon Sep 17 00:00:00 2001 From: xinqiu Date: Sat, 6 May 2017 09:56:09 +0800 Subject: [PATCH 12/21] =?UTF-8?q?=E4=B8=80=E4=BA=9B=E6=B6=A6=E8=89=B2?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- Booting/linux-bootstrap-1.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Booting/linux-bootstrap-1.md b/Booting/linux-bootstrap-1.md index 4bea266..b20039d 100644 --- a/Booting/linux-bootstrap-1.md +++ b/Booting/linux-bootstrap-1.md @@ -4,7 +4,7 @@ 从引导加载程序内核 -------------------------------------------------------------------------------- -如果你已经看过我之前的[文章](http://0xax.blogspot.com/search/label/asm),就知道之前我开始和底层编程打交道。我写了一些关于 Linux x86_64 汇编的文章。同时,我开始深入研究 Linux 源代码。底层是如果工作的,程序是如何在电脑上运行的,它们是如何在内存中定位的,内核是如何管理进程和内存,网络堆栈是如何在底层工作的等等,这些我都非常感兴趣。因此,我决定去写另外的一系列文章关于 **x86_64** 框架的 Linux 内核。 +如果看过我在这之前的[文章](http://0xax.blogspot.com/search/label/asm),你就会知道我已经开始涉足底层的代码编写。我写了一些关于 Linux x86_64 汇编的文章。同时,我开始深入研究 Linux 源代码。底层是如果工作的,程序是如何在电脑上运行的,它们是如何在内存中定位的,内核是如何管理进程和内存,网络堆栈是如何在底层工作的等等,这些我都非常感兴趣。因此,我决定去写另外的一系列文章关于 **x86_64** 框架的 Linux 内核。 *注意这不是官方文档,只是学习和分享知识* @@ -20,7 +20,7 @@ 神奇的电源按钮,接下来会发生什么? -------------------------------------------------------------------------------- -尽管这一系列文章关于 Linux 内核,我们还没有从内核代码(至少在这一章)开始。好了,当你按下你笔记本或台式机的神奇电源按钮,它开始工作。在主板发送一个信号给[电源](https://en.wikipedia.org/wiki/Power_supply),电源提供电脑适当量的电力。一旦主板收到了[电源备妥信号](https://en.wikipedia.org/wiki/Power_good_signal),它会尝试启动 CPU 。CPU 复位寄存器里的所有剩余数据,设置预定义的值给每个寄存器。 +尽管这一系列文章关于 Linux 内核,我们在第一章并不会从内核代码开始。电脑在你按下电源开关的时候,就开始工作。主板发送信号给[电源](https://en.wikipedia.org/wiki/Power_supply),而电源收到信号后会给电脑供应合适的电量。一旦主板收到了[电源备妥信号](https://en.wikipedia.org/wiki/Power_good_signal),它会尝试启动 CPU 。CPU 则复位寄存器的所有数据,并设置每个寄存器的预定值。 [80386](https://en.wikipedia.org/wiki/Intel_80386) From d166d39e2b46940bc82c081fe2dae5208449cfc1 Mon Sep 17 00:00:00 2001 From: Shengqiu Li Date: Wed, 10 May 2017 22:32:34 +0800 Subject: [PATCH 13/21] =?UTF-8?q?=E6=B7=BB=E5=8A=A0=E7=AC=AC=E4=BA=8C?= =?UTF-8?q?=E7=AB=A0=E7=AC=AC1=E3=80=812=E3=80=813=E8=8A=82=E7=BF=BB?= =?UTF-8?q?=E8=AF=91?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- Initialization/linux-initialization-1.md | 221 ++++++++++++----------- Initialization/linux-initialization-2.md | 175 +++++++++--------- Initialization/linux-initialization-3.md | 149 ++++++++------- 3 files changed, 278 insertions(+), 267 deletions(-) diff --git a/Initialization/linux-initialization-1.md b/Initialization/linux-initialization-1.md index d619262..372b23d 100644 --- a/Initialization/linux-initialization-1.md +++ b/Initialization/linux-initialization-1.md @@ -1,23 +1,26 @@ -Kernel initialization. Part 1. +内核初始化 第一部分 ================================================================================ -First steps in the kernel code +踏入内核代码的第一步(TODO: Need proofreading) -------------------------------------------------------------------------------- -The previous [post](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-5.html) was a last part of the Linux kernel [booting process](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/index.html) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489) will be called. +[上一章](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-5.html)是[引导过程](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/index.html)的最后一部分。从现在开始,我们将深入探究 Linux 内核的初始化过程。在解压缩完 Linux 内核镜像、并把它妥善地放入内存后,内核就开始工作了。我们在第一章中介绍了 Linux 内核引导程序,它的任务就是为执行内核代码做准备。而在本章中,我们将探究内核代码,看一看内核的初始化过程——即在启动 [PID](https://en.wikipedia.org/wiki/Process_identifier) 为 `1` 的 `init` 进程前,内核所做的大量工作。 -In the last [part](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-5.html) of the previous [chapter](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/index.html) we stopped at the [jmp](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file: +本章的内容很多,介绍了在内核启动前的所有准备工作。[arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) 文件中定义了内核入口点,我们会从这里开始,逐步地深入下去。在 `start_kernel` 函数(定义在 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489)) 执行之前,我们会看到很多的初期的初始化过程,例如初期页表初始化、切换到一个新的内核空间描述符等等。 + +在[上一章](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/index.html)的[最后一节](https://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-5.html)中,我们跟踪到了 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) 文件中的 [jmp](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) 指令: ```assembly jmp *%rax ``` -At this moment the `rax` register contains address of the Linux kernel entry point which that was obtained as a result of the call of the `decompress_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file. So, our last instruction in the kernel setup code is a jump on the kernel entry point. We already know where is defined the entry point of the linux kernel, so we are able to start to learn what does the Linux kernel does after the start. +此时 `rax` 寄存器中保存的就是 Linux 内核入口点,通过调用 `decompress_kernel` ([arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c)) 函数后获得。由此可见,内核引导程序的最后一行代码是一句指向内核入口点的跳转指令。既然已经知道了内核入口点定义在哪,我们就可以继续探究 Linux 内核在引导结束后做了些什么。 -First steps in the kernel + +内核执行的第一步 -------------------------------------------------------------------------------- -Okay, we got the address of the decompressed kernel image from the `decompress_kernel` function into `rax` register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly source code file and at the beginning of it, we can see following definitions: +OK,在调用了 `decompress_kernel` 函数后,`rax` 寄存器中保存了解压缩后的内核镜像的地址,并且跳转了过去。解压缩后的内核镜像的入口点定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S),这个文件的开头几行如下: ```assembly __HEAD @@ -29,13 +32,13 @@ startup_64: ... ``` -We can see definition of the `startup_64` routine that is defined in the `__HEAD` section, which is just a macro which expands to the definition of executable `.head.text` section: +我们可以看到 `startup_64` 过程定义在了 `__HEAD` 区段下。 `__HEAD` 只是一个宏,它将展开为可执行的 `.head.text` 区段: ```C #define __HEAD .section ".head.text","ax" ``` -We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S#L93) linker script: +我们可以在 [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S#L93) 链接器脚本文件中看到这个区段的定义: ``` .text : AT(ADDR(.text) - LOAD_OFFSET) { @@ -46,48 +49,48 @@ We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](htt } :text = 0x9090 ``` -Besides the definition of the `.text` section, we can understand default virtual and physical addresses from the linker script. Note that address of the `_text` is location counter which is defined as: +除了对 `.text` 区段的定义,我们还能从这个脚本文件中得知内核的默认物理地址与虚拟地址。`_text` 是一个地址计数器,对于 [x86_64](https://en.wikipedia.org/wiki/X86-64) 来说,它定义为: ``` . = __START_KERNEL; ``` -for the [x86_64](https://en.wikipedia.org/wiki/X86-64). The definition of the `__START_KERNEL` macro is located in the [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h) header file and represented by the sum of the base virtual address of the kernel mapping and physical start: +`__START_KERNEL` 宏的定义在 [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h) 头文件中,它由内核映射的虚拟基址与基物理起始点相加得到: ```C -#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) +#define _START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) #define __PHYSICAL_START ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN) ``` -Or in other words: +换句话说: -* Base physical address of the Linux kernel - `0x1000000`; -* Base virtual address of the Linux kernel - `0xffffffff81000000`. +* Linux 内核的物理基址 - `0x1000000`; +* Linux 内核的虚拟基址 - `0xffffffff81000000`. -Now we know default physical and virtual addresses of the `startup_64` routine, but to know actual addresses we must to calculate it with the following code: +现在我们知道了 `startup_64` 过程的默认物理地址与虚拟地址,但是真正的地址必须要通过下面的代码计算得到: ```assembly leaq _text(%rip), %rbp subq $_text - __START_KERNEL_map, %rbp ``` -Yes, it defined as `0x1000000`, but it may be different, for example if [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) is enabled. So our current goal is to calculate delta between `0x1000000` and where we actually loaded. Here we just put the `rip-relative` address to the `rbp` register and then subtract `$_text - __START_KERNEL_map` from it. We know that compiled virtual address of the `_text` is `0xffffffff81000000` and the physical address of it is `0x1000000`. The `__START_KERNEL_map` macro expands to the `0xffffffff80000000` address, so at the second line of the assembly code, we will get following expression: +没错,虽然定义为 `0x1000000`,但是仍然有可能变化,例如启用 [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) 的时候。所以我们当前的目标是计算 `0x1000000` 与实际加载地址的差。这里我们首先将RIP相对地址(`rip-relative`)放入 `rbp` 寄存器,并且从中减去 `$_text - __START_KERNEL_map` 。我们已经知道, `_text` 在编译后的默认虚拟地址为 `0xffffffff81000000`, 物理地址为 `0x1000000`。`__START_KERNEL_map` 宏将展开为 `0xffffffff80000000`,因此对于对于第二行汇编代码,我们将得到如下的表达式: ``` rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000) ``` -So, after the calculation, the `rbp` will contain `0` which represents difference between addresses where we actually loaded and where the code was compiled. In our case `zero` means that the Linux kernel was loaded by default address and the [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) was disabled. +在计算过后,`rbp` 的值将为 `0`,代表了实际加载地址与编译后的默认地址之间的差值。在我们这个例子中,`0` 代表了 Linux 内核被加载到了默认地址,并且没有启用 [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) 。 -After we got the address of the `startup_64`, we need to do a check that this address is correctly aligned. We will do it with the following code: +在得到了 `startup_64` 的地址后,我们需要检查这个地址是否已经正确对齐。下面的代码将进行这项工作: ```assembly testl $~PMD_PAGE_MASK, %ebp jnz bad_address ``` -Here we just compare low part of the `rbp` register with the complemented value of the `PMD_PAGE_MASK`. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [paging](http://xinqiu.gitbooks.io/linux-insides-cn/content/Theory/Paging.html) about it) and defined as: +在这里我们将 `rbp` 寄存器的低32位与 `PMD_PAGE_MASK` 进行比较。`PMD_PAGE_MASK` 代表中层页目录(`Page middle directory`)屏蔽位(相关信息请阅读 [paging](http://xinqiu.gitbooks.io/linux-insides-cn/content/Theory/Paging.html) 一节),它的定义如下: ```C #define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1)) @@ -96,9 +99,9 @@ Here we just compare low part of the `rbp` register with the complemented value #define PMD_SHIFT 21 ``` -As we can easily calculate, `PMD_PAGE_SIZE` is `2` megabytes. Here we use standard formula for checking alignment and if `text` address is not aligned for `2` megabytes, we jump to `bad_address` label. +可以很容易得出 `PMD_PAGE_SIZE` 为 `2MB` 。在这里我们使用标准公式来检查对齐问题,如果 `text` 的地址没有对齐到 `2MB`,则跳转到 `bad_address`。 -After this we check address that it is not too large by the checking of highest `18` bits: +在此之后,我们通过检查高 `18` 位来防止这个地址过大: ```assembly leaq _text(%rip), %rax @@ -106,18 +109,19 @@ After this we check address that it is not too large by the checking of highest jnz bad_address ``` -The address must not be greater than `46`-bits: +这个地址必须不超过 `46` 个比特,即小于2的46次方: ```C #define MAX_PHYSMEM_BITS 46 ``` -Okay, we did some early checks and now we can move on. +OK,至此我们完成了一些初步的检查,可以继续进行后续的工作了。 -Fix base addresses of page tables + +修正页表基地址 -------------------------------------------------------------------------------- -The first step before we start to setup identity paging is to fixup following addresses: +在开始设置 Identity 分页之前,我们需要首先修正下面的地址: ```assembly addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip) @@ -126,7 +130,7 @@ The first step before we start to setup identity paging is to fixup following ad addq %rbp, level2_fixmap_pgt + (506*8)(%rip) ``` -All of `early_level4_pgt`, `level3_kernel_pgt` and other address may be wrong if the `startup_64` is not equal to default `0x1000000` address. The `rbp` register contains the delta address so we add to the certain entries of the `early_level4_pgt`, the `level3_kernel_pgt` and the `level2_fixmap_pgt`. Let's try to understand what these labels mean. First of all let's look at their definition: +如果 `startup_64` 的值不为默认的 `0x1000000` 的话, 则包括 `early_level4_pgt`、`level3_kernel_pgt` 在内的很多地址都会不正确。`rbp`寄存器中包含的是相对地址,因此我们把它与 `early_level4_pgt`、`level3_kernel_pgt` 以及 `level2_fixmap_pgt` 中特定的项相加。首先我们来看一下它们的定义: ```assembly NEXT_PAGE(early_level4_pgt) @@ -151,25 +155,25 @@ NEXT_PAGE(level1_fixmap_pgt) .fill 512,8,0 ``` -Looks hard, but it isn't. First of all let's look at the `early_level4_pgt`. It starts with the (4096 - 8) bytes of zeros, it means that we don't use the first `511` entries. And after this we can see one `level3_kernel_pgt` entry. Note that we subtract `__START_KERNEL_map + _PAGE_TABLE` from it. As we know `__START_KERNEL_map` is a base virtual address of the kernel text, so if we subtract `__START_KERNEL_map`, we will get physical address of the `level3_kernel_pgt`. Now let's look at `_PAGE_TABLE`, it is just page entry access rights: +看起来很难理解,实则不然。首先我们来看一下 `early_level4_pgt`。它的前 (4096 - 8) 个字节全为 `0`,即它的前 `511` 个项均不使用,之后的一项是 `level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE`。我们知道 `__START_KERNEL_map` 是内核的虚拟基地址,因此减去 `__START_KERNEL_map` 后就得到了 `level3_kernel_pgt` 的物理地址。现在我们来看一下 `_PAGE_TABLE`,它是页表项的访问权限: ```C #define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_DIRTY) ``` -You can read more about it in the [paging](http://xinqiu.gitbooks.io/linux-insides-cn/content/Theory/Paging.html) part. +更多信息请阅读 [分页](http://xinqiu.gitbooks.io/linux-insides-cn/content/Theory/Paging.html) 部分. -The `level3_kernel_pgt` - stores two entries which map kernel space. At the start of it's definition, we can see that it is filled with zeros `L3_START_KERNEL` or `510` times. Here the `L3_START_KERNEL` is the index in the page upper directory which contains `__START_KERNEL_map` address and it equals `510`. After this, we can see the definition of the two `level3_kernel_pgt` entries: `level2_kernel_pgt` and `level2_fixmap_pgt`. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has: +`level3_kernel_pgt` 中保存的两项用来映射内核空间,在它的前 `510`(即 `L3_START_KERNEL`)项均为 `0`。这里的 `L3_START_KERNEL` 保存的是在上层页目录(Page Upper Directory)中包含`__START_KERNEL_map` 地址的那一条索引,它等于 `510`。后面一项 `level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE` 中的 `level2_kernel_pgt` 比较容易理解,它是一条页表项,包含了指向中层页目录的指针,它用来映射内核空间,并且具有如下的访问权限: ```C #define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \ _PAGE_DIRTY) ``` -access rights. The second - `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. They represented by the one `level2_fixmap_pgt` entry and `10` megabytes hole for the [vsyscalls](https://lwn.net/Articles/446528/) mapping. The next `level2_kernel_pgt` calls the `PDMS` macro which creates `512` megabytes from the `__START_KERNEL_map` for kernel `.text` (after these `512` megabytes will be modules memory space). +`level2_fixmap_pgt` 是一系列虚拟地址,它们可以在内核空间中指向任意的物理地址。它们由`level2_fixmap_pgt`作为入口点、`10`MB 大小的空间用来为 [vsyscalls](https://lwn.net/Articles/446528/) 做映射。`level2_kernel_pgt` 则调用了`PDMS` 宏,在 `__START_KERNEL_map` 地址处为内核的 `.text` 创建了 `512`MB 大小的空间(这 `512` MB空间的后面是模块内存空间)。 -Now, after we saw definitions of these symbols, let's get back to the code which is described at the beginning of the section. Remember that the `rbp` register contains delta between the address of the `startup_64` symbol which was got during kernel [linking](https://en.wikipedia.org/wiki/Linker_%28computing%29) and the actual address. So, for this moment, we just need to add add this delta to the base address of some page table entries, that they'll have correct addresses. In our case these entries are: +现在,在看过了这些符号的定义之后,让我们回到本节开始时介绍的那几行代码。`rbp` 寄存器包含了实际地址与 `startup_64` 地址之差,其中 `startup_64` 的地址是在内核[链接](https://en.wikipedia.org/wiki/Linker_%28computing%29)时获得的。因此我们只需要把它与各个页表项的基地址相加,就能够得到正确的地址了。在这里这些操作如下: ```assembly addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip) @@ -178,9 +182,9 @@ Now, after we saw definitions of these symbols, let's get back to the code which addq %rbp, level2_fixmap_pgt + (506*8)(%rip) ``` -or the last entry of the `early_level4_pgt` which is the `level3_kernel_pgt`, last two entries of the `level3_kernel_pgt` which are the `level2_kernel_pgt` and the `level2_fixmap_pgt` and five hundreds seventh entry of the `level2_fixmap_pgt` which is `level1_fixmap_pgt` page directory. +换句话说,`early_level4_pgt` 的最后一项就是 `level3_kernel_pgt`,`level3_kernel_pgt` 的最后两项分别是 `level2_kernel_pgt` 和 `level2_fixmap_pgt`, `level2_fixmap_pgt` 的第507项就是 `level1_fixmap_pgt` 页目录。 -After all of this we will have: +在这之后我们就得到了: ``` early_level4_pgt[511] -> level3_kernel_pgt[0] @@ -190,19 +194,19 @@ level2_kernel_pgt[0] -> 512 MB kernel mapping level2_fixmap_pgt[507] -> level1_fixmap_pgt ``` -Note that we didn't fixup base address of the `early_level4_pgt` and some of other page table directories, because we will see this during of building/filling of structures for these page tables. As we corrected base addresses of the page tables, we can start to build it. +需要注意的是,我们并不修正 `early_level4_pgt` 以及其他页目录的基地址,我们会在构造、填充这些页目录结构的时候修正。我们修正了页表基地址后,就可以开始构造这些页目录了。 -Identity mapping setup +Identity Map Paging -------------------------------------------------------------------------------- -Now we can see the set up of identity mapping of early page tables. In Identity Mapped Paging, virtual addresses are mapped to physical addresses that have the same value, `1 : 1`. Let's look at it in detail. First of all we get the `rip-relative` address of the `_text` and `_early_level4_pgt` and put they into `rdi` and `rbx` registers: +现在我们可以进入到对初期页表进行 Identity 映射的初始化过程了。在 Identity 映射分页中,虚拟地址会被映射到地址相同的物理地址上,即 `1 : 1`。下面我们来看一下细节。首先我们找到 `_text` 与 `_early_level4_pgt` 的 RIP 相对地址,并把他们放入 `rdi` 与 `rbx` 寄存器中。 ```assembly leaq _text(%rip), %rdi leaq early_level4_pgt(%rip), %rbx ``` -After this we store address of the `_text` in the `rax` and get the index of the page global directory entry which stores `_text` address, by shifting `_text` address on the `PGDIR_SHIFT`: +在此之后我们使用 `rax` 保存 `_text` 的地址。同时,在全局页目录表中有一条记录中存放的是 `_text` 的地址。为了得到这条索引,我们把 `_text` 的地址右移 `PGDIR_SHIFT` 位。 ```assembly movq %rdi, %rax @@ -213,7 +217,8 @@ After this we store address of the `_text` in the `rax` and get the index of the movq %rdx, 8(%rbx,%rax,8) ``` -where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories: +其中 `PGDIR_SHIFT` 为 `39`。`PGDIR_SHIFT`表示的是在虚拟地址下的全局页目录位的屏蔽值(mask)。下面的宏定义了所有类型的页目录的屏蔽值: + ```C #define PGDIR_SHIFT 39 @@ -221,9 +226,9 @@ where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global dir #define PMD_SHIFT 21 ``` -After this we put the address of the first `level3_kernel_pgt` in the `rdx` with the `_KERNPG_TABLE` access rights (see above) and fill the `early_level4_pgt` with the 2 `level3_kernel_pgt` entries. +此后我们就将 `level3_kernel_pgt` 的地址放进 `rdx` 中,并将它的访问权限设置为 `_KERNPG_TABLE`(见上),然后将 `level3_kernel_pgt` 填入 `early_level4_pgt` 的两项中。 -After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now contains the address of the first entry of the `level3_kernel_pgt`) and put `rdi` (it now contains physical address of the `_text`) to the `rax`. And after this we write addresses of the two page upper directory entries to the `level3_kernel_pgt`: +然后我们给 `rdx` 寄存器加上 `4096`(即 `early_level4_pgt` 的大小),并把 `rdi` 寄存器的值(即 `_text` 的物理地址)赋值给 `rax` 寄存器。之后我们把上层页目录中的两个项写入 `level3_kernel_pgt`: ```assembly addq $4096, %rdx @@ -236,7 +241,7 @@ After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now c movq %rdx, 4096(%rbx,%rax,8) ``` -In the next step we write addresses of the page middle directory entries to the `level2_kernel_pgt` and the last step is correcting of the kernel text+data virtual addresses: +下一步我们把中层页目录表项的地址写入 `level2_kernel_pgt`,然后修正内核的 text 和 data 的虚拟地址: ```assembly leaq level2_kernel_pgt(%rip), %rdi @@ -249,9 +254,9 @@ In the next step we write addresses of the page middle directory entries to the jne 1b ``` -Here we put the address of the `level2_kernel_pgt` to the `rdi` and address of the page table entry to the `r8` register. Next we check the present bit in the `level2_kernel_pgt` and if it is zero we're moving to the next page by adding 8 bytes to `rdi` which contains address of the `level2_kernel_pgt`. After this we compare it with `r8` (contains address of the page table entry) and go back to label `1` or move forward. +这里首先把 `level2_kernel_pgt` 的地址赋值给 `rdi`,并把页表项的地址赋值给 `r8` 寄存器。下一步我们来检查 `level2_kernel_pgt` 中的存在位,如果其为0,就把 `rdi` 加上8以便指向下一个页。然后我们将其与 `r8`(即页表项的地址)作比较,不相等的话就跳转回前面的标签 `1` ,反之则继续运行。 -In the next step we correct `phys_base` physical address with `rbp` (contains physical address of the `_text`), put physical address of the `early_level4_pgt` and jump to label `1`: +接下来我们使用 `rbp` (即 `_text` 的物理地址)来修正 `phys_base` 物理地址。将 `early_level4_pgt` 的物理地址与 `rbp` 相加,然后跳转至标签 `1`: ```assembly addq %rbp, phys_base(%rip) @@ -259,12 +264,12 @@ In the next step we correct `phys_base` physical address with `rbp` (contains ph jmp 1f ``` -where `phys_base` matches the first entry of the `level2_kernel_pgt` which is `512` MB kernel mapping. +其中 `phys_base` 与 `level2_kernel_pgt` 第一项相同,为 `512` MB的内核映射。 -Last preparation before jump at the kernel entry point +跳转至内核入口点之前的最后准备 -------------------------------------------------------------------------------- -After that we jump to the label `1` we enable `PAE`, `PGE` (Paging Global Extension) and put the physical address of the `phys_base` (see above) to the `rax` register and fill `cr3` register with it: +此后我们就跳转至标签`1`来开启 `PAE` 和 `PGE` (Paging Global Extension),并且将`phys_base`的物理地址(见上)放入 `rax` 就寄存器,同时将其放入 `cr3` 寄存器: ```assembly 1: @@ -275,7 +280,8 @@ After that we jump to the label `1` we enable `PAE`, `PGE` (Paging Global Extens movq %rax, %cr3 ``` -In the next step we check that CPU supports [NX](http://en.wikipedia.org/wiki/NX_bit) bit with: +接下来我们检查CPU是否支持 [NX](http://en.wikipedia.org/wiki/NX_bit) 位: + ```assembly movl $0x80000001, %eax @@ -283,16 +289,18 @@ In the next step we check that CPU supports [NX](http://en.wikipedia.org/wiki/NX movl %edx,%edi ``` -We put `0x80000001` value to the `eax` and execute `cpuid` instruction for getting the extended processor info and feature bits. The result will be in the `edx` register which we put to the `edi`. +首先将 `0x80000001` 放入 `eax` 中,然后执行 `cpuid` 指令来得到处理器信息。这条指令的结果会存放在 `edx` 中,我们把他再放到 `edi` 里。 + +现在我们把 `MSR_EFER` (即 `0xc0000080`)放入 `ecx`,然后执行 `rdmsr` 指令来读取CPU中的Model Specific Register (MSR)。 -Now we put `0xc0000080` or `MSR_EFER` to the `ecx` and call `rdmsr` instruction for the reading model specific register. ```assembly movl $MSR_EFER, %ecx rdmsr ``` -The result will be in the `edx:eax`. General view of the `EFER` is following: +返回结果将存放于 `edx:eax` 。下面展示了 `EFER` 各个位的含义: + ``` 63 32 @@ -309,7 +317,7 @@ The result will be in the `edx:eax`. General view of the `EFER` is following: -------------------------------------------------------------------------------- ``` -We will not see all fields in details here, but we will learn about this and other `MSRs` in a special part about it. As we read `EFER` to the `edx:eax`, we check `_EFER_SCE` or zero bit which is `System Call Extensions` with `btsl` instruction and set it to one. By the setting `SCE` bit we enable `SYSCALL` and `SYSRET` instructions. In the next step we check 20th bit in the `edi`, remember that this register stores result of the `cpuid` (see above). If `20` bit is set (`NX` bit) we just write `EFER_SCE` to the model specific register. +在这里我们不会介绍每一个位的含义,没有涉及到的位和其他的 MSR 将会在专门的部分介绍。在我们将 `EFER` 读入 `edx:eax` 之后,通过 `btsl` 来将 `_EFER_SCE` (即第0位)置1,设置 `SCE` 位将会启用 `SYSCALL` 以及 `SYSRET` 指令。下一步我们检查 `edi`(即 `cpuid` 的结果(见上)) 中的第20位。如果第 `20` 位(即 `NX` 位)置位,我们就只把 `EFER_SCE`写入MSR。 ```assembly btsl $_EFER_SCE, %eax @@ -320,17 +328,16 @@ We will not see all fields in details here, but we will learn about this and oth 1: wrmsr ``` -If the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is supported we enable `_EFER_NX` and write it too, with the `wrmsr` instruction. After the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is set, we set some bits in the `cr0` [control register](https://en.wikipedia.org/wiki/Control_register), namely: +如果支持 [NX](https://en.wikipedia.org/wiki/NX_bit) 那么我们就把 `_EFER_NX` 也写入MSR。在设置了 [NX](https://en.wikipedia.org/wiki/NX_bit) 后,还要对 `cr0` ([control register](https://en.wikipedia.org/wiki/Control_register)) 中的一些位进行设置: -* `X86_CR0_PE` - system is in protected mode; -* `X86_CR0_MP` - controls interaction of WAIT/FWAIT instructions with TS flag in CR0; -* `X86_CR0_ET` - on the 386, it allowed to specify whether the external math coprocessor was an 80287 or 80387; -* `X86_CR0_NE` - enable internal x87 floating point error reporting when set, else enables PC style x87 error detection; -* `X86_CR0_WP` - when set, the CPU can't write to read-only pages when privilege level is 0; -* `X86_CR0_AM` - alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3; -* `X86_CR0_PG` - enable paging. -by the execution following assembly code: +* `X86_CR0_PE` - 系统处于保护模式; +* `X86_CR0_MP` - 与CR0的TS标志位一同控制 WAIT/FWAIT 指令的功能; +* `X86_CR0_ET` - 386允许指定外部数学协处理器为80287或80387; +* `X86_CR0_NE` - 如果置位,则启用内置的x87浮点错误报告,否则启用PC风格的x87错误检测; +* `X86_CR0_WP` - 如果置位,则CPU在特权等级为0时无法写入只读内存页; +* `X86_CR0_AM` - 当AM位置位、EFLGS中的AC位置位、特权等级为3时,进行对齐检查; +* `X86_CR0_PG` - 启用分页. ```assembly #define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \ @@ -340,7 +347,7 @@ movl $CR0_STATE, %eax movq %rax, %cr0 ``` -We already know that to run any code, and even more [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code from assembly, we need to setup a stack. As always, we are doing it by the setting of [stack pointer](https://en.wikipedia.org/wiki/Stack_register) to a correct place in memory and resetting [flags](https://en.wikipedia.org/wiki/FLAGS_register) register after this: +为了从汇编执行[C语言](https://en.wikipedia.org/wiki/C_%28programming_language%29)代码,我们需要建立一个栈。首先将[栈指针](https://en.wikipedia.org/wiki/Stack_register) 指向一个内存中合适的区域,然后重置[FLAGS寄存器](https://en.wikipedia.org/wiki/FLAGS_register) ```assembly movq stack_start(%rip), %rsp @@ -348,14 +355,14 @@ pushq $0 popfq ``` -The most interesting thing here is the `stack_start`. It defined in the same [source](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) code file and looks like: +在这里最有意思的地方在于 `stack_start`。它也定义在[当前的源文件](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S)中: ```assembly GLOBAL(stack_start) .quad init_thread_union+THREAD_SIZE-8 ``` -The `GLOBAL` is already familiar to us from. It defined in the [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) header file expands to the `global` symbol definition: +对于 `GLOABL` 我们应该很熟悉了。它在 [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) 头文件中定义如下: ```C #define GLOBAL(name) \ @@ -363,16 +370,15 @@ The `GLOBAL` is already familiar to us from. It defined in the [arch/x86/include name: ``` -The `THREAD_SIZE` macro is defined in the [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h) header file and depends on value of the `KASAN_STACK_ORDER` macro: +`THREAD_SIZE` 定义在 [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h),它依赖于 `KASAN_STACK_ORDER` 的值: ```C #define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER) #define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER) ``` -We consider when the [kasan](http://lxr.free-electrons.com/source/Documentation/kasan.txt) is disabled and the `PAGE_SIZE` is `4096` bytes. So the `THREAD_SIZE` will expands to `16` kilobytes and represents size of the stack of a thread. Why is `thread`? You may already know that each [process](https://en.wikipedia.org/wiki/Process_%28computing%29) may have parent [processes](https://en.wikipedia.org/wiki/Parent_process) and [child](https://en.wikipedia.org/wiki/Child_process) processes. Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with the `thread_info` structure. +首先来考虑当禁用了 [kasan](http://lxr.free-electrons.com/source/Documentation/kasan.txt) 并且 `PAGE_SIZE` 大小为4096时的情况。此时 `THREAD_SIZE` 将为 `16` KB,代表了一个线程的栈的大小。为什么是`线程`?我们知道每一个[进程](https://en.wikipedia.org/wiki/Process_%28computing%29)可能会有[父进程](https://en.wikipedia.org/wiki/Parent_process)和[子进程](https://en.wikipedia.org/wiki/Child_process)。事实上,父进程和子进程使用不同的栈空间,每一个新进程都会拥有一个新的内核栈。在Linux内核中,这个栈由 `thread_info` 结构中的一个[union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B)表示: -And as we can see the `init_thread_union` is represented by the `thread_union`, which defined as: ```C union thread_union { @@ -381,14 +387,14 @@ union thread_union { }; ``` -and `init_thread_union` looks like: +例如,`init_thread_union`定义如下: ```C union thread_union init_thread_union __init_task_data = { INIT_THREAD_INFO(init_task) }; ``` -Where the `INIT_THREAD_INFO` macro takes `task_struct` structure which represents process descriptor in the Linux kernel and does some basic initialization of the given `task_struct` structure: +其中 `INIT_THREAD_INFO` 接受 `task_struct` 结构类型的参数,并进行一些初始化操作: ```C #define INIT_THREAD_INFO(tsk) \ @@ -400,7 +406,7 @@ Where the `INIT_THREAD_INFO` macro takes `task_struct` structure which represent } ``` -So, the `thread_union` contains low-level information about a process and process's stack and placed in the bottom of stack: +`task_struct` 结构在内核中代表了对进程的描述。因此,`thread_union` 包含了关于一个进程的低级信息,并且其位于进程栈底: ``` +-----------------------+ @@ -418,15 +424,15 @@ So, the `thread_union` contains low-level information about a process and proces +-----------------------+ ``` -Note that we reserve `8` bytes at the to of stack. This is necessary to guarantee illegal access of the next page memory. +需要注意的是我们在栈顶保留了 `8` 个字节的空间,用来保护对下一个内存页的非法访问。 -After the early boot stack is set, to update the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) with `lgdt` instruction: +在初期启动栈设置好之后,使用 `lgdt` 指令来更新[全局描述符表](https://en.wikipedia.org/wiki/Global_Descriptor_Table): ```assembly lgdt early_gdt_descr(%rip) ``` -where the `early_gdt_descr` is defined as: +其中 `early_gdt_descr` 定义如下: ```assembly early_gdt_descr: @@ -435,13 +441,13 @@ early_gdt_descr_base: .quad INIT_PER_CPU_VAR(gdt_page) ``` -We need to reload `Global Descriptor Table` because now kernel works in the low userspace addresses, but soon kernel will work in it's own space. Now let's look at the definition of `early_gdt_descr`. Global Descriptor Table contains `32` entries: +需要重新加载 `全局描述附表` 的原因是,虽然目前内核工作在用户空间的低地址中,但很快内核将会在它自己的内存地址空间中运行。下面让我们来看一下 `early_gdt_descr` 的定义。全局描述符表包含了32项,用于内核代码、数据、线程局部存储段等: ```C #define GDT_ENTRIES 32 ``` -for kernel code, data, thread local storage segments and etc... it's simple. Now let's look at the `early_gdt_descr_base`. First of `gdt_page` defined as: +现在来看一下 `early_gdt_descr_base`. 首先,`gdt_page` 的定义在[arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h)中: ```C struct gdt_page { @@ -449,7 +455,7 @@ struct gdt_page { } __attribute__((aligned(PAGE_SIZE))); ``` -in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h). It contains one field `gdt` which is array of the `desc_struct` structure which is defined as: +它只包含了一项 `desc_struct` 的数组`gdt`。`desc_struct`定义如下: ```C struct desc_struct { @@ -468,24 +474,26 @@ struct desc_struct { } __attribute__((packed)); ``` -and presents familiar to us `GDT` descriptor. Also we can note that `gdt_page` structure aligned to `PAGE_SIZE` which is `4096` bytes. It means that `gdt` will occupy one page. Now let's try to understand what is `INIT_PER_CPU_VAR`. `INIT_PER_CPU_VAR` is a macro which defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h) and just concats `init_per_cpu__` with the given parameter: +它跟 `GDT` 描述符的定义很像。同时需要注意的是,`gdt_page`结构是 `PAGE_SIZE`(` 4096`) 对齐的,即 `gdt` 将会占用一页内存。 + +下面我们来看一下 `INIT_PER_CPU_VAR`,它定义在 [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h),只是将给定的参数与 `init_per_cpu__`连接起来: ```C #define INIT_PER_CPU_VAR(var) init_per_cpu__##var ``` -After the `INIT_PER_CPU_VAR` macro will be expanded, we will have `init_per_cpu__gdt_page`. We can see in the [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S): +所以在宏展开之后,我们会得到 `init_per_cpu__gdt_page`。而在 [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) 中可以发现: ``` #define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load INIT_PER_CPU(gdt_page); ``` -As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT. +`INIT_PER_CPU` 扩展后也将得到 `init_per_cpu__gdt_page` 并将它的值设置为相对于 `__per_cpu_load` 的偏移量。这样,我们就得到了新GDT的正确的基地址。 -Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have will have its own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/per-cpu.html) post. +per-CPU变量是2.6内核中的特性。顾名思义,当我们创建一个 `per-CPU` 变量时,每个CPU都会拥有一份它自己的拷贝,在这里我们创建的是 `gdt_page` per-CPU变量。这种类型的变量有很多有点,比如由于每个CPU都只访问自己的变量而不需要锁等。因此在多处理器的情况下,每一个处理器核心都将拥有一份自己的 `GDT` 表,其中的每一项都代表了一块内存,这块内存可以由在这个核心上运行的线程访问。这里 [Theory/per-cpu](http://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/per-cpu.html) 有关于 `per-CPU` 变量的更详细的介绍。 -As we loaded new Global Descriptor Table, we reload segments as we did it every time: +在加载好了新的全局描述附表之后,跟之前一样我们重新加载一下各个段: ```assembly xorl %eax,%eax @@ -496,7 +504,7 @@ As we loaded new Global Descriptor Table, we reload segments as we did it every movl %eax,%gs ``` -After all of these steps we set up `gs` register that it post to the `irqstack` which represents special stack where [interrupts](https://en.wikipedia.org/wiki/Interrupt) will be handled on: +在所有这些步骤都结束后,我们需要设置一下 `gs` 寄存器,令它指向一个特殊的栈 `irqstack`,用于处理[中断]https://en.wikipedia.org/wiki/Interrupt): ```assembly movl $MSR_GS_BASE,%ecx @@ -505,13 +513,15 @@ After all of these steps we set up `gs` register that it post to the `irqstack` wrmsr ``` -where `MSR_GS_BASE` is: +其中, `MSR_GS_BASE` 为: ```C #define MSR_GS_BASE 0xc0000101 ``` -We need to put `MSR_GS_BASE` to the `ecx` register and load data from the `eax` and `edx` (which are point to the `initial_gs`) with `wrmsr` instruction. We don't use `cs`, `fs`, `ds` and `ss` segment registers for addressing in the 64-bit mode, but `fs` and `gs` registers can be used. `fs` and `gs` have a hidden part (as we saw it in the real mode for `cs`) and this part contains descriptor which mapped to [Model Specific Registers](https://en.wikipedia.org/wiki/Model-specific_register). So we can see above `0xc0000101` is a `gs.base` MSR address. When a [system call](https://en.wikipedia.org/wiki/System_call) or [interrupt](https://en.wikipedia.org/wiki/Interrupt) occurred, there is no kernel stack at the entry point, so the value of the `MSR_GS_BASE` will store address of the interrupt stack. +我们需要把 `MSR_GS_BASE` 放入 `ecx` 寄存器,同时利用 `wrmsr` 指令向 `eax` 和 `edx` 处的地址加载数据(即指向 `initial_gs`)。`cs`, `fs`, `ds` 和 `ss` 段寄存器在64位模式下不用来寻址,但 `fs` 和 `gs` 可以使用。 `fs` 和 `gs` 有一个隐含的部分(与实模式下的 `cs` 段寄存器类似),这个隐含部分存储了一个描述符,其指向 [Model Specific Registers](https://en.wikipedia.org/wiki/Model-specific_register)。因此上面的 `0xc0000101` 是一个 `gs.base` MSR 地址。当发生[系统调用](https://en.wikipedia.org/wiki/System_call) 或者 [中断](https://en.wikipedia.org/wiki/Interrupt)时,入口点处并没有内核栈,因此 `MSR_GS_BASE` 将会用来存放中断栈。 + +接下来我们把实模式中的 bootparam 结构的地址放入 `rdi` (),然后跳转到C语言代码: In the next step we put the address of the real mode bootparam structure to the `rdi` (remember `rsi` holds pointer to this structure from the start) and jump to the C code with: @@ -523,7 +533,7 @@ In the next step we put the address of the real mode bootparam structure to the lretq ``` -Here we put the address of the `initial_code` to the `rax` and push fake address, `__KERNEL_CS` and the address of the `initial_code` to the stack. After this we can see `lretq` instruction which means that after it return address will be extracted from stack (now there is address of the `initial_code`) and jump there. `initial_code` is defined in the same source code file and looks: +这里我们把 `initial_code` 放入 `rax` 中,并且向栈里分别压入一个无用的地址、`__KERNEL_CS` 和 `initial_code` 的地址。随后的 `lreq` 指令表示从栈上弹出返回地址并跳转。`initial_code` 同样定义在这个文件里: ```assembly .balign 8 @@ -534,7 +544,7 @@ Here we put the address of the `initial_code` to the `rax` and push fake address ... ``` -As we can see `initial_code` contains address of the `x86_64_start_kernel`, which is defined in the [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and looks like this: +可以看到 `initial_code` 包含了 `x86_64_start_kernel` 的地址,其定义在 [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c): ```C asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) { @@ -544,16 +554,16 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) { } ``` -It has one argument is a `real_mode_data` (remember that we passed address of the real mode data to the `rdi` register previously). +这个函数接受一个参数 `real_mode_data`(刚才我们传入了一个实模式下的数据的地址)。 -This is first C code in the kernel! +这个函数是内核中第一个执行的C语言代码! -Next to start_kernel +走进 start_kernel -------------------------------------------------------------------------------- -We need to see last preparations before we can see "kernel entry point" - start_kernel function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489). +在我们真正到达“内核入口点”之前,还需要一些最后的准备工作:[init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489)中的start_kernel函数。 -First of all we can see some checks in the `x86_64_start_kernel` function: +首先在 `x86_64_start_kernel` 函数中可以看到一些检查: ```C BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map); @@ -566,20 +576,24 @@ BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK) BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END); ``` +这些检查包括:模块的虚拟地址不能低于内核text段基地址 `__START_KERNEL_map`, + +`BUILD_BUG_ON` 宏定义如下: + There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - `__STAT_KERNEL_map`, that kernel text with modules is not less than image of the kernel and etc... `BUILD_BUG_ON` is a macro which looks as: ```C #define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) ``` -Let's try to understand how this trick works. Let's take for example first condition: `MODULES_VADDR < __START_KERNEL_map`. `!!conditions` is the same that `condition != 0`. So it means if `MODULES_VADDR < __START_KERNEL_map` is true, we will get `1` in the `!!(condition)` or zero if not. After `2*!!(condition)` we will get or `2` or `0`. In the end of calculations we can get two different behaviors: +我们来考虑一下这个trick是怎么工作的。首先以第一个条件 `MODULES_VADDR < __START_KERNEL_map` 为例:`!!conditions` 等价于 `condition != 0`,这代表如果 `MODULES_VADDR < __START_KERNEL_map` 为真,则 `!!(condition)` 为1,否则为0。随后 `2*!!(condition)` 将为 `2` 或 `0`。因此,这个宏将可能产生两种不同的行为: -* We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because `MODULES_VADDR` can't be less than `__START_KERNEL_map` will be in our case); -* No compilation errors. +* 编译错误。因为我们尝试取获取一个字符数组负索引处变量的大小。 +* 没有编译错误。 -That's all. So interesting C trick for getting compile error which depends on some constants. +这种C语言的trick利用常量达到了编译错误的目的。 -In the next step we can see call of the `cr4_init_shadow` function which stores shadow copy of the `cr4` per cpu. Context switches can change bits in the `cr4` so we need to store `cr4` for each CPU. And after this we can see call of the `reset_early_page_tables` function where we resets all page global directory entries and write new pointer to the PGT in `cr3`: +接下来 start_kernel 调用了 `cr4_init_shadow` 函数,其中存储了每个CPU中 `cr4` 的Shadow Copy。上下文切换可能会修改 `cr4` 中的位,因此需要位每个CPU保存一份 `cr4` 的内容。在这之后将会调用 `reset_early_page_tables` 函数,它充值了所有的全局页目录项,同时向 `cr3` 中重新写入了的全局页目录表的地址: ```C for (i = 0; i < PTRS_PER_PGD-1; i++) @@ -590,26 +604,25 @@ next_early_pgt = 0; write_cr3(__pa_nodebug(early_level4_pgt)); ``` -Soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (`PTRS_PER_PGD` is `512`) in the loop and make it zero. After this we set `next_early_pgt` to zero (we will see details about it in the next post) and write physical address of the `early_level4_pgt` to the `cr3`. `__pa_nodebug` is a macro which will be expanded to: +很快我们就会设置新的页表。在这里我们遍历了所有的全局页目录项(其中 `PTRS_PER_PGD` 为 `512`),将其设置为0。之后将 `next_early_pgt` 设置为0(会在下一篇文章中介绍细节),同时把 `early_level4_pgt` 的物理地址写入 `cr3`。`__pa_nodebug` 是一个宏,将被扩展为: ```C ((unsigned long)(x) - __START_KERNEL_map + phys_base) ``` -After this we clear `_bss` from the `__bss_stop` to `__bss_start` and the next step will be setup of the early `IDT` handlers, but it's big concept so we will see it in the next part. +此后我们清空了从 `__bss_stop` 到 `__bss_start` 的 `_bss` 段,下一步将是建立初期 `IDT(中断描述符表)` 的处理代码,内容很多,我们将会留到下一个部分再来探究。 -Conclusion +总结 -------------------------------------------------------------------------------- -This is the end of the first part about linux kernel initialization. +第一部分关于Linux内核的初始化过程到这里就结束了。 -If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/MintCN/linux-insides-zh/issues/new). +如果你有任何问题或建议,请在twitter上联系我 [0xAX](https://twitter.com/0xAX),或者通过[邮件](anotherworldofworld@gmail.com)与我沟通,还可以新开[issue](https://github.com/MintCN/linux-insides-zh/issues/new)。 -In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and a lot more. +下一部分我们会看到初期中断处理程序的初始化过程、内核空间的内存映射等。 -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** -Links +相关链接 -------------------------------------------------------------------------------- * [Model Specific Register](http://en.wikipedia.org/wiki/Model-specific_register) diff --git a/Initialization/linux-initialization-2.md b/Initialization/linux-initialization-2.md index 3a307b1..41b738c 100644 --- a/Initialization/linux-initialization-2.md +++ b/Initialization/linux-initialization-2.md @@ -1,38 +1,38 @@ -Kernel initialization. Part 2. +内核初始化 第二部分 ================================================================================ -Early interrupt and exception handling +初期中断和异常处理 -------------------------------------------------------------------------------- -In the previous [part](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) we stopped before setting of early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have basic [paging](https://en.wikipedia.org/wiki/Page_table) structure for early boot and our current goal is to finish early preparation before the main kernel code will start to work. +在上一个 [部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) 我们谈到了初期中断初始化。目前我们已经处于解压缩后的Linux内核中了,还有了用于初期启动的基本的[分页](https://en.wikipedia.org/wiki/Page_table)机制。我们的目标是在内核的主体代码执行前做好准备工作。 -We already started to do this preparation in the previous [first](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) part of this [chapter](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/index.html). We continue in this part and will know more about interrupt and exception handling. +我们已经在[本章](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/index.html)的[第一部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html)做了一些工作,在这一部分中我们会继续分析关于中断和异常处理部分的代码。 -Remember that we stopped before following loop: +我们在上一部分谈到了下面这个循环: ```C for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) set_intr_gate(i, early_idt_handler_array[i]); ``` -from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) source code file. But before we started to sort out this code, we need to know about interrupts and handlers. +这段代码位于 [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c)。在分析这段代码之前,我们先来了解一些关于中断和中断处理程序的知识。 -Some theory +理论 -------------------------------------------------------------------------------- -An interrupt is an event caused by software or hardware to the CPU. For example a user have pressed a key on keyboard. On interrupt, CPU stops the current task and transfer control to the special routine which is called - [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler). An interrupt handler handles and interrupt and transfer control back to the previously stopped task. We can split interrupts on three types: +中断是一种由软件或硬件产生的、向CPU发出的事件。例如,如果用户按下了键盘上的一个按键时,就会产生中断。此时CPU将会暂停当前的任务,并且将控制流转到特殊的程序中——[中断处理程序(Interrupt Handler)](https://en.wikipedia.org/wiki/Interrupt_handler)。一个中断处理程序会对中断进行处理,然后将控制权交还给之前暂停的任务中。中断分为三类: -* Software interrupts - when a software signals CPU that it needs kernel attention. These interrupts are generally used for system calls; -* Hardware interrupts - when a hardware event happens, for example button is pressed on a keyboard; -* Exceptions - interrupts generated by CPU, when the CPU detects error, for example division by zero or accessing a memory page which is not in RAM. +* 软件中断 - 当一个软件可以向CPU发出信号,表明它需要系统内核的相关功能时产生。这些中断通常用于系统调用; +* 硬件中断 - 当一个硬件有任何事件发生时产生,例如键盘的按键被按下; +* 异常 - 当CPU检测到错误时产生,例如发生了除零错误或者访问了一个不存在的内存页。 -Every interrupt and exception is assigned a unique number which called - `vector number`. `Vector number` can be any number from `0` to `255`. There is common practice to use first `32` vector numbers for exceptions, and vector numbers from `32` to `255` are used for user-defined interrupts. We can see it in the code above - `NUM_EXCEPTION_VECTORS`, which defined as: +每一个中断和异常都可以由一个数来表示,这个数叫做`向量号`,它可以取从 `0` 到 `255` 中的任何一个数。通常在实践中前 `32` 个向量号用来表示异常,`32` 到 `255` 用来表示用户定义的中断。可以看到在上面的代码中,`NUM_EXCEPTION_VECTORS` 就定义为: ```C #define NUM_EXCEPTION_VECTORS 32 ``` -CPU uses vector number as an index in the `Interrupt Descriptor Table` (we will see description of it soon). CPU catch interrupts from the [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) or through it's pins. Following table shows `0-31` exceptions: +CPU会从[APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)或者CPU引脚接收中断,并使用中断向量号作为 `中断描述符表` 的索引。下面的表中列出了 `0-31` 号异常: ``` ---------------------------------------------------------------------------------------------- @@ -84,9 +84,9 @@ CPU uses vector number as an index in the `Interrupt Descriptor Table` (we will ---------------------------------------------------------------------------------------------- ``` -To react on interrupt CPU uses special structure - Interrupt Descriptor Table or IDT. IDT is an array of 8-byte descriptors like Global Descriptor Table, but IDT entries are called `gates`. CPU multiplies vector number on 8 to find index of the IDT entry. But in 64-bit mode IDT is an array of 16-byte descriptors and CPU multiplies vector number on 16 to find index of the entry in the IDT. We remember from the previous part that CPU uses special `GDTR` register to locate Global Descriptor Table, so CPU uses special register `IDTR` for Interrupt Descriptor Table and `lidt` instruction for loading base address of the table into this register. +为了能够对中断进行处理,CPU使用了一种特殊的结构 - 中断描述符表(IDT)。IDT是一个由描述符组成的数组,其中每个描述符都为8个字节,与全局描述附表一致;不过不同的是,我们把IDT中的每一项叫做`门(gate)`。为了获得某一项描述符的起始地址,CPU会把向量号乘以8,在64位模式中则会乘以16。在前面我们已经见过,CPU使用一个特殊的 `GDTR` 寄存器来存放全局描述符表的地址,中断描述符表也有一个类似的寄存器 `IDTR`,同时还有用于将基地址加载入这个寄存器的指令 `lidt`。 -64-bit mode IDT entry has following structure: +64位模式下IDT的每一项的结构如下: ``` 127 96 @@ -115,46 +115,46 @@ To react on interrupt CPU uses special structure - Interrupt Descriptor Table or -------------------------------------------------------------------------------- ``` -Where: +其中: -* `Offset` - is offset to entry point of an interrupt handler; -* `DPL` - Descriptor Privilege Level; -* `P` - Segment Present flag; -* `Segment selector` - a code segment selector in GDT or LDT -* `IST` - provides ability to switch to a new stack for interrupts handling. +* `Offset` - 代表了到中断处理程序入口点的偏移; +* `DPL` - 描述符特权级别; +* `P` - Segment Present 标志; +* `Segment selector` - 在GDT或LDT中的代码段选择子; +* `IST` - 用来为中断处理提供一个新的栈。 -And the last `Type` field describes type of the `IDT` entry. There are three different kinds of handlers for interrupts: +最后的 `Type` 域描述了这一项的类型,中断处理程序共分为三种: -* Task descriptor -* Interrupt descriptor -* Trap descriptor +* 任务描述符 +* 中断描述符 +* 陷阱描述符 -Interrupt and trap descriptors contain a far pointer to the entry point of the interrupt handler. Only one difference between these types is how CPU handles `IF` flag. If interrupt handler was accessed through interrupt gate, CPU clear the `IF` flag to prevent other interrupts while current interrupt handler executes. After that current interrupt handler executes, CPU sets the `IF` flag again with `iret` instruction. +中断和陷阱描述符包含了一个指向中断处理程序的远(far)指针,二者唯一的不同在于CPU处理 `IF` 标志的方式。如果是由中断门进入中断处理程序的,CPU会清除 `IF` 标志位,这样当当前中断处理程序执行时,CPU不会对其他的中断进行处理;只有当当前的中断处理程序返回时,CPU 才在 `iret` 指令执行时重新设置 `IF` 标志位。 -Other bits in the interrupt gate reserved and must be 0. Now let's look how CPU handles interrupts: +中断门的其他位为保留位,必须为0。下面我们来看一下CPU是如何处理中断的: -* CPU save flags register, `CS`, and instruction pointer on the stack. -* If interrupt causes an error code (like `#PF` for example), CPU saves an error on the stack after instruction pointer; -* After interrupt handler executed, `iret` instruction used to return from it. +* CPU 会在栈上保存标志寄存器、`cs`段寄存器和程序计数器IP; +* 如果中断是由错误码引起的(比如 `#PF`), CPU会在栈上保存错误码; +* 在中断处理程序执行完毕后,由`iret`指令返回。 -Now let's back to code. +OK,接下来我们继续分析代码。 -Fill and load IDT +设置并加载 IDT -------------------------------------------------------------------------------- -We stopped at the following point: +我们分析到了如下代码: ```C for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) set_intr_gate(i, early_idt_handler_array[i]); ``` -Here we call `set_intr_gate` in the loop, which takes two parameters: +这里循环内部调用了 `set_intr_gate`,它接受两个参数: -* Number of an interrupt or `vector number`; -* Address of the idt handler. +* 中断号,即 `向量号`; +* 中断处理程序的地址。 -and inserts an interrupt gate to the `IDT` table which is represented by the `&idt_descr` array. First of all let's look on the `early_idt_handler_array` array. It is an array which is defined in the [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) header file contains addresses of the first `32` exception handlers: +同时,这个函数还会将中断门插入至 `IDT` 表中,代码中的 `&idt_descr` 数组即为 `IDT`。 首先让我们来看一下 `early_idt_handler_array` 数组,它定义在 [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) 头文件中,包含了前32个异常处理程序的地址: ```C #define EARLY_IDT_HANDLER_SIZE 9 @@ -163,11 +163,11 @@ and inserts an interrupt gate to the `IDT` table which is represented by the `&i extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE]; ``` -The `early_idt_handler_array` is `288` bytes array which contains address of exception entry points every nine bytes. Every nine bytes of this array consist of two bytes optional instruction for pushing dummy error code if an exception does not provide it, two bytes instruction for pushing vector number to the stack and five bytes of `jump` to the common exception handler code. +`early_idt_handler_array` 是一个大小为 `288` 字节的数组,每一项为 `9` 个字节,其中2个字节的备用指令用于向栈中压入默认错误码(如果异常本身没有提供错误码的话),2个字节的指令用于向栈中压入向量号,剩余5个字节用于跳转到异常处理程序。 -As we can see, We're filling only first 32 `IDT` entries in the loop, because all of the early setup runs with interrupts disabled, so there is no need to set up interrupt handlers for vectors greater than `32`. The `early_idt_handler_array` array contains generic idt handlers and we can find its definition in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file. For now we will skip it, but will look it soon. Before this we will look on the implementation of the `set_intr_gate` macro. +在上面的代码中,我们只通过一个循环向 `IDT` 中填入了前32项内容,这是因为在整个初期设置阶段,中断是禁用的。`early_idt_handler_array` 数组中的每一项指向的都是同一个通用中断处理程序,定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S)。我们先暂时跳过这个数组的内容,看一下 `set_intr_gate` 的定义。 -The `set_intr_gate` macro is defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) header file and looks: +`set_intr_gate` 宏定义在 [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h): ```C #define set_intr_gate(n, addr) \ @@ -180,7 +180,7 @@ The `set_intr_gate` macro is defined in the [arch/x86/include/asm/desc.h](https: } while (0) ``` -First of all it checks with that passed interrupt number is not greater than `255` with `BUG_ON` macro. We need to do this check because we can have only `256` interrupts. After this, it make a call of the `_set_gate` function which writes address of an interrupt gate to the `IDT`: +首先 `BUG_ON` 宏确保了传入的中断向量号不会大于255,因为我们最多只有 `256` 个中断。然后它调用了 `_set_gate` 函数,它会将中断门写入 `IDT`: ```C static inline void _set_gate(int gate, unsigned type, void *addr, @@ -193,7 +193,7 @@ static inline void _set_gate(int gate, unsigned type, void *addr, } ``` -At the start of `_set_gate` function we can see call of the `pack_gate` function which fills `gate_desc` structure with the given values: +在 `_set_gate` 函数的开始,它调用了 `pack_gate` 函数。这个函数会使用给定的参数填充 `gate_desc` 结构: ```C static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func, @@ -211,8 +211,7 @@ static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func, gate->offset_high = PTR_HIGH(func); } ``` - -As I mentioned above, we fill gate descriptor in this function. We fill three parts of the address of the interrupt handler with the address which we got in the main loop (address of the interrupt handler entry point). We are using three following macros to split address on three parts: +在这个函数里,我们把从主循环中得到的中断处理程序入口点地址拆成三个部分,填入门描述符中。下面的三个宏就用来做这个拆分工作: ```C #define PTR_LOW(x) ((unsigned long long)(x) & 0xFFFF) @@ -220,9 +219,9 @@ As I mentioned above, we fill gate descriptor in this function. We fill three pa #define PTR_HIGH(x) ((unsigned long long)(x) >> 32) ``` -With the first `PTR_LOW` macro we get the first `2` bytes of the address, with the second `PTR_MIDDLE` we get the second `2` bytes of the address and with the third `PTR_HIGH` macro we get the last `4` bytes of the address. Next we setup the segment selector for interrupt handler, it will be our kernel code segment - `__KERNEL_CS`. In the next step we fill `Interrupt Stack Table` and `Descriptor Privilege Level` (highest privilege level) with zeros. And we set `GAT_INTERRUPT` type in the end. +调用 `PTR_LOW` 可以得到x的低 `2` 个字节,调用 `PTR_MIDDLE` 可以得到x的中间 `2` 个字节,调用 `PTR_HIGH` 则能够得到x的高 `4` 个字节。接下来我们来位中断处理程序设置段选择子,即内核代码段 `__KERNEL_CS`。然后将 `Interrupt Stack Table` 和 `描述符特权等级` (最高特权等级)设置为0,以及在最后设置 `GAT_INTERRUPT` 类型。 -Now we have filled IDT entry and we can call `native_write_idt_entry` function which just copies filled `IDT` entry to the `IDT`: +现在我们已经设置好了IDT中的一项,那么通过调用 `native_write_idt_entry` 函数来把复制到 `IDT`: ```C static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate) @@ -231,32 +230,32 @@ static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_ } ``` -After that main loop will finished, we will have filled `idt_table` array of `gate_desc` structures and we can load `Interrupt Descriptor table` with the call of the: +主循环结束后,`idt_table` 就已经设置完毕了,其为一个 `gate_desc` 数组。然后我们就可以通过下面的代码加载 `中断描述符表`: ```C load_idt((const struct desc_ptr *)&idt_descr); ``` -Where `idt_descr` is: +其中,`idt_descr` 为: ```C struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table }; ``` -and `load_idt` just executes `lidt` instruction: +`load_idt` 函数只是执行了一下 `lidt` 指令: ```C asm volatile("lidt %0"::"m" (*dtr)); ``` -You can note that there are calls of the `_trace_*` functions in the `_set_gate` and other functions. These functions fills `IDT` gates in the same manner that `_set_gate` but with one difference. These functions use `trace_idt_table` the `Interrupt Descriptor Table` instead of `idt_table` for tracepoints (we will cover this theme in the another part). +你可能已经注意到了,在代码中还有对 `_trace_*` 函数的调用。这些函数会用跟 `_set_gate` 同样的方法对 `IDT` 门进行设置,但仅有一处不同:这些函数并不设置 `idt_table`,而是 `trace_idt_table`,用于设置追踪点(tracepoint,我们将会在其他章节介绍这一部分)。 -Okay, now we have filled and loaded `Interrupt Descriptor Table`, we know how the CPU acts during an interrupt. So now time to deal with interrupts handlers. +好了,至此我们已经了解到,通过设置并加载 `中断描述符表`,能够让CPU在发生中断时做出相应的动作。下面让我们来看一下如何编写中断处理程序。 -Early interrupts handlers +初期中断处理程序 -------------------------------------------------------------------------------- -As you can read above, we filled `IDT` with the address of the `early_idt_handler_array`. We can find it in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file: +在上面的代码中,我们用 `early_idt_handler_array` 的地址来填充了 `IDT`,这个 `early_idt_handler_array` 定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): ```assembly .globl early_idt_handler_array @@ -273,7 +272,7 @@ early_idt_handlers: .endr ``` -We can see here, interrupt handlers generation for the first `32` exceptions. We check here, if exception has an error code then we do nothing, if exception does not return error code, we push zero to the stack. We do it for that would stack was uniform. After that we push exception number on the stack and jump on the `early_idt_handler_array` which is generic interrupt handler for now. As we may see above, every nine bytes of the `early_idt_handler_array` array consists from optional push of an error code, push of `vector number` and jump instruction. We can see it in the output of the `objdump` util: +这段代码自动生成为前 `32` 个异常生成了中断处理程序。首先,为了统一栈的布局,如果一个异常没有返回错误码,那么我们就手动在栈中压入一个 `0`。然后再在栈中压入中断向量号,最后跳转至通用的中断处理程序 `early_idt_handler_common`。我们可以通过 `objdump` 命令的输出一探究竟: ``` $ objdump -D vmlinux @@ -294,7 +293,7 @@ ffffffff81fe5014: 6a 02 pushq $0x2 ... ``` -As i wrote above, CPU pushes flag register, `CS` and `RIP` on the stack. So before `early_idt_handler` will be executed, stack will contain following data: +由于在中断发生时,CPU会在栈上压入标志寄存器、`CS` 段寄存器和 `RIP` 寄存器的内容。因此在 `early_idt_handler` 执行前,栈的布局如下: ``` |--------------------| @@ -305,14 +304,14 @@ As i wrote above, CPU pushes flag register, `CS` and `RIP` on the stack. So befo |--------------------| ``` -Now let's look on the `early_idt_handler_common` implementation. It locates in the same [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343) assembly file and first of all we can see check for [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt). We don't need to handle it, so just ignore it in the `early_idt_handler_common`: +下面我们来看一下 `early_idt_handler_common` 的实现。它也定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343) 文件中。首先它会检查当前中断是否为 [不可屏蔽中断(NMI)](http://en.wikipedia.org/wiki/Non-maskable_interrupt),如果是则简单地忽略它们: ```assembly cmpl $2,(%rsp) je .Lis_nmi ``` -where `is_nmi`: +其中 `is_nmi` 为: ```assembly is_nmi: @@ -320,7 +319,9 @@ is_nmi: INTERRUPT_RETURN ``` -drops an error code and vector number from the stack and call `INTERRUPT_RETURN` which is just expands to the `iretq` instruction. As we checked the vector number and it is not `NMI`, we check `early_recursion_flag` to prevent recursion in the `early_idt_handler_common` and if it's correct we save general registers on the stack: +这段程序首先从栈顶弹出错误码和中断向量号,然后通过调用 `INTERRUPT_RETURN`,即 `iretq` 指令直接返回。 + +如果当前中断不是 `NMI`,则首先检查 `early_recursion_flag` 以避免在 `early_idt_handler_common` 程序中递归地产生中断。如果一切都没问题,就先在栈上保存通用寄存器,为了防止中断返回时寄存器的内容错乱: ```assembly pushq %rax @@ -334,16 +335,14 @@ drops an error code and vector number from the stack and call `INTERRUPT_RETURN` pushq %r11 ``` -We need to do it to prevent wrong values of registers when we return from the interrupt handler. After this we check segment selector in the stack: +然后我们检查栈上的段选择子: ```assembly cmpl $__KERNEL_CS,96(%rsp) jne 11f ``` -which must be equal to the kernel code segment and if it is not we jump on label `11` which prints `PANIC` message and makes stack dump. - -After the code segment was checked, we check the vector number, and if it is `#PF` or [Page Fault](https://en.wikipedia.org/wiki/Page_fault), we put value from the `cr2` to the `rdi` register and call `early_make_pgtable` (well see it soon): +段选择子必须为内核代码段,如果不是则跳转到标签 `11`,输出 `PANIC` 信息并打印栈的内容。然后我们来检查向量号,如果是 `#PF` 即 [缺页中断(Page Fault)](https://en.wikipedia.org/wiki/Page_fault),那么就把 `cr2` 寄存器中的值赋值给 `rdi`,然后调用 `early_make_pgtable` (详见后文): ```assembly cmpl $14,72(%rsp) @@ -354,8 +353,7 @@ After the code segment was checked, we check the vector number, and if it is `#P jz 20f ``` -If vector number is not `#PF`, we restore general purpose registers from the stack: - +如果向量号不是 `#PF`,那么就恢复通用寄存器: ```assembly popq %r11 popq %r10 @@ -368,16 +366,16 @@ If vector number is not `#PF`, we restore general purpose registers from the sta popq %rax ``` -and exit from the handler with `iret`. +并调用 `iret` 从中断处理程序返回。 -It is the end of the first interrupt handler. Note that it is very early interrupt handler, so it handles only Page Fault now. We will see handlers for the other interrupts, but now let's look on the page fault handler. +第一个中断处理程序到这里就结束了。由于它只是一个初期中段处理程序,因此只处理缺页中断。下面让我们首先来看一下缺页中断处理程序,其他中断的处理程序我们之后再进行分析。 -Page fault handling +缺页中断处理程序 -------------------------------------------------------------------------------- -In the previous paragraph we saw first early interrupt handler which checks interrupt number for page fault and calls `early_make_pgtable` for building new page tables if it is. We need to have `#PF` handler in this step because there are plans to add ability to load kernel above `4G` and make access to `boot_params` structure above the 4G. +在上一节中我们第一次见到了初期中断处理程序,它检查了缺页中断的中断号,并调用了 `early_make_pgtable`来建立新的页表。在这里我们需要提供 `#PF` 中断处理程序,以便为之后将内核加载至 `4G` 地址以上,并且能访问位于4G以上的 `boot_params` 结构体。 -You can find implementation of the `early_make_pgtable` in the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and takes one parameter - address from the `cr2` register, which caused Page Fault. Let's look on it: +`early_make_pgtable` 的实现在 [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c),它接受一个参数:从 `cr2` 寄存器得到的地址,这个地址引发了内存中断。下面让我们来看一下: ```C int __init early_make_pgtable(unsigned long address) @@ -393,60 +391,61 @@ int __init early_make_pgtable(unsigned long address) } ``` -It starts from the definition of some variables which have `*val_t` types. All of these types are just: +首先它定义了一些 `*val_t` 类型的变量。这些类型均为: ```C typedef unsigned long pgdval_t; ``` -Also we will operate with the `*_t` (not val) types, for example `pgd_t` and etc... All of these types defined in the [arch/x86/include/asm/pgtable_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_types.h) and represent structures like this: +此外,我们还会遇见 `*_t` (不带val)的类型,比如 `pgd_t`……这些类型都定义在 [arch/x86/include/asm/pgtable_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_types.h),形式如下: ```C typedef struct { pgdval_t pgd; } pgd_t; ``` -For example, +例如, ```C extern pgd_t early_level4_pgt[PTRS_PER_PGD]; ``` -Here `early_level4_pgt` presents early top-level page table directory which consists of an array of `pgd_t` types and `pgd` points to low-level page entries. +在这里 `early_level4_pgt` 代表了初期顶层页表目录,它是一个 `pdg_t` 类型的数组,其中的 `pgd` 指向了下一级页表。 -After we made the check that we have no invalid address, we're getting the address of the Page Global Directory entry which contains `#PF` address and put it's value to the `pgd` variable: +在确认不是非法地址后,我们取得页表中包含引起 `#PF` 中断的地址的那一项,将其赋值给 `pgd` 变量: ```C pgd_p = &early_level4_pgt[pgd_index(address)].pgd; pgd = *pgd_p; ``` -In the next step we check `pgd`, if it contains correct page global directory entry we put physical address of the page global directory entry and put it to the `pud_p` with: +接下来我们检查一下 `pgd`,如果它包含了正确的全局页表项的话,我们就把这一项的物理地址处理后赋值给 `pud_p`: + ```C pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base); ``` -where `PTE_PFN_MASK` is a macro: +其中 `PTE_PFN_MASK` 是一个宏: ```C #define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK) ``` -which expands to: +展开后将为: ```C (~(PAGE_SIZE-1)) & ((1 << 46) - 1) ``` -or +或者写为: ``` 0b1111111111111111111111111111111111111111111111 ``` -which is 46 bits to mask page frame. +它是一个46bit大小的页帧屏蔽值。 -If `pgd` does not contain correct address we check that `next_early_pgt` is not greater than `EARLY_DYNAMIC_PAGE_TABLES` which is `64` and present a fixed number of buffers to set up new page tables on demand. If `next_early_pgt` is greater than `EARLY_DYNAMIC_PAGE_TABLES` we reset page tables and start again. If `next_early_pgt` is less than `EARLY_DYNAMIC_PAGE_TABLES`, we create new page upper directory pointer which points to the current dynamic page table and writes it's physical address with the `_KERPG_TABLE` access rights to the page global directory: +如果 `pgd` 没有包含有效的地址,我们就检查 `next_early_pgt` 与 `EARLY_DYNAMIC_PAGE_TABLES`(即 `64`)的大小。`EARLY_DYNAMIC_PAGE_TABLES` 它是一个固定大小的缓冲区,用来在需要的时候建立新的页表。如果 `next_early_pgt` 比 `EARLY_DYNAMIC_PAGE_TABLES` 大,我们就用一个上层页目录指针指向当前的动态页表,并将它的物理地址与 `_KERPG_TABLE` 访问权限一起写入全局页目录表: ```C if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) { @@ -460,30 +459,32 @@ for (i = 0; i < PTRS_PER_PUD; i++) *pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE; ``` -After this we fix up address of the page upper directory with: +然后我们来修正上层页目录的地址: ```C pud_p += pud_index(address); pud = *pud_p; ``` -In the next step we do the same actions as we did before, but with the page middle directory. In the end we fix address of the page middle directory which contains maps kernel text+data virtual addresses: +下面我们对中层页目录重复上面同样的操作。最后我们利用 In the end we fix address of the page middle directory which contains maps kernel text+data virtual addresses: ```C pmd = (physaddr & PMD_MASK) + early_pmd_flags; pmd_p[pmd_index(address)] = pmd; ``` -After page fault handler finished it's work and as result our `early_level4_pgt` contains entries which point to the valid addresses. +到此缺页中断处理程序就完成了它所有的工作,此时 `early_level4_pgt` 就包含了指向合法地址的项。 -Conclusion +小结 -------------------------------------------------------------------------------- -This is the end of the second part about linux kernel insides. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/MintCN/linux-insides-zh/issues/new). In the next part we will see all steps before kernel entry point - `start_kernel` function. +本书的第二部分到此结束了。 -**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** +如果你有任何问题或建议,请在twitter上联系我 [0xAX](https://twitter.com/0xAX),或者通过[邮件](anotherworldofworld@gmail.com)与我沟通,还可以新开[issue](https://github.com/MintCN/linux-insides-zh/issues/new)。 -Links +接下来我们将会看到进入内核入口点 `start_kernel` 函数之前剩下所有的准备工作。 + +相关链接 -------------------------------------------------------------------------------- * [GNU assembly .rept](https://sourceware.org/binutils/docs-2.23/as/Rept.html) diff --git a/Initialization/linux-initialization-3.md b/Initialization/linux-initialization-3.md index 1c216f8..03c6208 100644 --- a/Initialization/linux-initialization-3.md +++ b/Initialization/linux-initialization-3.md @@ -1,21 +1,22 @@ -Kernel initialization. Part 3. +内核初始化 第三部分 ================================================================================ -Last preparations before the kernel entry point +进入内核入口点之前最后的准备工作 -------------------------------------------------------------------------------- -This is the third part of the Linux kernel initialization process series. In the previous [part](https://github.com/MintCN/linux-insides-zh/blob/master/Initialization/linux-initialization-2.md) we saw early interrupt and exception handling and will continue to dive into the linux kernel initialization process in the current part. Our next point is 'kernel entry point' - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. Yes, technically it is not kernel's entry point but the start of the generic kernel code which does not depend on certain architecture. But before we call the `start_kernel` function, we must do some preparations. So let's continue. + +这是 Linux 内核初始化过程的第三部分。在[上一个部分](https://github.com/MintCN/linux-insides-zh/blob/master/Initialization/linux-initialization-2.md) 中我们接触到了初期中断和异常处理,而在这个部分中我们要继续看一看 Linux 内核的初始化过程。在之后的章节我们将会关注“内核入口点”—— [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 文件中的`start_kernel` 函数。没错,从技术上说这并不是内核的入口点,只是不依赖于特定架构的通用内核代码的开始。不过,在我们调用 `start_kernel` 之前,有些准备必须要做。下面我们就来看一看。 boot_params again -------------------------------------------------------------------------------- -In the previous part we stopped at setting Interrupt Descriptor Table and loading it in the `IDTR` register. At the next step after this we can see a call of the `copy_bootdata` function: +在上一个部分中我们讲到了设置中断描述符表,并将其加载进 `IDTR` 寄存器。下一步是调用 `copy_bootdata` 函数: ```C copy_bootdata(__va(real_mode_data)); ``` -This function takes one argument - virtual address of the `real_mode_data`. Remember that we passed the address of the `boot_params` structure from [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L114) to the `x86_64_start_kernel` function as first argument in [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): +这个函数接受一个参数—— `read_mode_data` 的虚拟地址。`boot_params` 结构体是在 [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L114) 作为第一个参数传递到 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) 中的 `x86_64_start_kernel` 函数的: ``` /* rsi is pointer to real mode structure with interesting info. @@ -23,19 +24,19 @@ This function takes one argument - virtual address of the `real_mode_data`. Reme movq %rsi, %rdi ``` -Now let's look at `__va` macro. This macro defined in [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c): +下面我们来看一看 `__va` 宏。 这个宏定义在 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c): ```C #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) ``` -where `PAGE_OFFSET` is `__PAGE_OFFSET` which is `0xffff880000000000` and the base virtual address of the direct mapping of all physical memory. So we're getting virtual address of the `boot_params` structure and pass it to the `copy_bootdata` function, where we copy `real_mod_data` to the `boot_params` which is declared in the [arch/x86/kernel/setup.h](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.h) +其中 `PAGE_OFFSET` 就是 `__PAGE_OFFSET`(即 `0xffff880000000000`),也是所有对物理地址进行直接映射后的虚拟基地址。因此我们就得到了 `boot_params` 结构体的虚拟地址,并把他传入 `copy_bootdata` 函数中。在这个函数里我们把 `real_mod_data` (定义在 [arch/x86/kernel/setup.h](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.h)) 拷贝进 `boot_params`: ```C extern struct boot_params boot_params; ``` -Let's look at the `copy_boot_data` implementation: +`copy_boot_data` 的实现如下: ```C static void __init copy_bootdata(char *real_mode_data) @@ -53,9 +54,9 @@ static void __init copy_bootdata(char *real_mode_data) } ``` -First of all, note that this function is declared with `__init` prefix. It means that this function will be used only during the initialization and used memory will be freed. +首先,这个函数的声明中有一个 `__init` 前缀,这表示这个函数只在初始化阶段使用,并且它所使用的内存将会被释放。 -We can see declaration of two variables for the kernel command line and copying `real_mode_data` to the `boot_params` with the `memcpy` function. The next call of the `sanitize_boot_params` function which fills some fields of the `boot_params` structure like `ext_ramdisk_image` and etc... if bootloaders which fail to initialize unknown fields in `boot_params` to zero. After this we're getting address of the command line with the call of the `get_cmd_line_ptr` function: +在这个函数中首先声明了两个用于解析内核命令行的变量,然后使用`memcpy` 函数将 `real_mode_data` 拷贝进 `boot_params`。如果系统引导工具(bootloader)没能正确初始化 `boot_params` 中的某些成员的话,那么在接下来调用的 `sanitize_boot_params` 函数中将会对这些成员进行清零,比如 `ext_ramdisk_image` 等。此后我们通过调用 `get_cmd_line_ptr` 函数来得到命令行的地址: ```C unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr; @@ -63,26 +64,26 @@ cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32; return cmd_line_ptr; ``` -which gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check `cmd_line_ptr`, getting its virtual address and copy it to the `boot_command_line` which is just an array of bytes: +`get_cmd_line_ptr` 函数将会从 `boot_params` 中获得命令行的64位地址并返回。最后,我们检查一下是否正确获得了 `cmd_line_ptr`,并把它的虚拟地址拷贝到一个字节数组 `boot_command_line` 中: ```C extern char __initdata boot_command_line[]; ``` -After this we will have copied kernel command line and `boot_params` structure. In the next step we can see call of the `load_ucode_bsp` function which loads processor microcode, but we will not see it here. +这一步完成之后,我们就得到了内核命令行和 `boot_params` 结构体。之后,内核通过调用 `load_ucode_bsp` 函数来加载处理器微代码(microcode),不过我们目前先暂时忽略这一步。 -After microcode was loaded we can see the check of the `console_loglevel` and the `early_printk` function which prints `Kernel Alive` string. But you'll never see this output because `early_printk` is not initialized yet. It is a minor bug in the kernel and i sent the patch - [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2) and you will see it in the mainline soon. So you can skip this code. +微代码加载之后,内核会对 `console_loglevel` 进行检查,同时通过 `early_printk` 函数来打印出字符串 `Kernel Alive`。不过这个输出不会真的被显示出来,因为这个时候 `early_printk` 还没有被初始化。这是目前内核中的一个小bug,作者已经提交了补丁 [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2),补丁很快就能应用在主分支中了。所以你可以先跳过这段代码。 -Move on init pages +初始化内存页 -------------------------------------------------------------------------------- -In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call: +至此,我们已经拷贝了 `boot_params` 结构体,接下来将对初期页表进行一些设置以便在初始化内核的过程中使用。我们之前已经对初始化了初期页表,以便支持换页,这在之前的[部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html)中已经讨论过。现在则通过调用 `reset_early_page_tables` 函数将初期页表中大部分项清零(在之前的部分也有介绍),只保留内核高地址的映射。然后我们调用: ```C clear_page(init_level4_pgt); ``` -function and pass `init_level4_pgt` which also defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and looks: +`init_level4_pgt` 同样定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): ```assembly NEXT_PAGE(init_level4_pgt) @@ -93,7 +94,7 @@ NEXT_PAGE(init_level4_pgt) .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE ``` -which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. `clear_page` function defined in the [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/clear_page_64.S) let's look on this function: +这段代码为内核的代码段、数据段和 bss 段映射了前 2.5G 个字节。`clear_page` 函数定义在 [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/clear_page_64.S): ```assembly ENTRY(clear_page) @@ -121,30 +122,30 @@ ENTRY(clear_page) ENDPROC(clear_page) ``` -As you can understand from the function name it clears or fills with zeros page tables. First of all note that this function starts with the `CFI_STARTPROC` and `CFI_ENDPROC` which are expands to GNU assembly directives: +顾名思义,这个函数会将页表清零。这个函数的开始和结束部分有两个宏 `CFI_STARTPROC` 和 `CFI_ENDPROC`,他们会展开成 GNU 汇编指令,用于调试: ```C #define CFI_STARTPROC .cfi_startproc #define CFI_ENDPROC .cfi_endproc ``` -and used for debugging. After `CFI_STARTPROC` macro we zero out `eax` register and put 64 to the `ecx` (it will be a counter). Next we can see loop which starts with the `.Lloop` label and it starts from the `ecx` decrement. After it we put zero from the `rax` register to the `rdi` which contains the base address of the `init_level4_pgt` now and do the same procedure seven times but every time move `rdi` offset on 8. After this we will have first 64 bytes of the `init_level4_pgt` filled with zeros. In the next step we put the address of the `init_level4_pgt` with 64-bytes offset to the `rdi` again and repeat all operations until `ecx` reaches zero. In the end we will have `init_level4_pgt` filled with zeros. +在 `CFI_STARTPROC` 之后我们将 `eax` 寄存器清零,并将 `ecx` 赋值为 64(用作计数器)。接下来从 `.Lloop` 标签开始循环,首先就是将 `ecx` 减一。然后将 `rax` 中的值(目前为0)写入 `rdi` 指向的地址,`rdi` 中保存的是 `init_level4_pgt` 的基地址。接下来重复7次这个步骤,但是每次都相对 `rdi` 多偏移8个字节。之后 `init_level4_pgt` 的前64个字节就都被填充为0了。接下来我们将 `rdi` 中的值加上64,重复这个步骤,直到 `ecx` 减至0。最后就完成了将 `init_level4_pgt` 填零。 -As we have `init_level4_pgt` filled with zeros, we set the last `init_level4_pgt` entry to kernel high mapping with the: +在将 `init_level4_pgt` 填0之后,再把它的最后一项设置为内核高地址的映射: ```C init_level4_pgt[511] = early_level4_pgt[511]; ``` -Remember that we dropped all `early_level4_pgt` entries in the `reset_early_page_table` function and kept only kernel high mapping there. +在前面我们已经使用 `reset_early_page_table` 函数清除 `early_level4_pgt` 中的大部分项,而只保留内核高地址的映射。 -The last step in the `x86_64_start_kernel` function is the call of the: +`x86_64_start_kernel` 函数的最后一步是调用: ```C x86_64_start_reservations(real_mode_data); ``` -function with the `real_mode_data` as argument. The `x86_64_start_reservations` function defined in the same source code file as the `x86_64_start_kernel` function and looks: +并传入 `real_mode_data` 参数。 `x86_64_start_reservations` 函数与 `x86_64_start_kernel` 函数定义在同一个文件中: ```C void __init x86_64_start_reservations(char *real_mode_data) @@ -158,43 +159,43 @@ void __init x86_64_start_reservations(char *real_mode_data) } ``` -You can see that it is the last function before we are in the kernel entry point - `start_kernel` function. Let's look what it does and how it works. +这就是进入内核入口点之前的最后一个函数了。下面我们就来介绍一下这个函数。 -Last step before kernel entry point +内核入口点前的最后一步 -------------------------------------------------------------------------------- -First of all we can see in the `x86_64_start_reservations` function the check for `boot_params.hdr.version`: +在 `x86_64_start_reservations` 函数中首先检查了 `boot_params.hdr.version`: ```C if (!boot_params.hdr.version) copy_bootdata(__va(real_mode_data)); ``` -and if it is zero we call `copy_bootdata` function again with the virtual address of the `real_mode_data` (read about about it's implementation). +如果它为0,则再次调用 `copy_bootdata`,并传入 `real_mode_data` 的虚拟地址。 -In the next step we can see the call of the `reserve_ebda_region` function which defined in the [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head.c). This function reserves memory block for the `EBDA` or Extended BIOS Data Area. The Extended BIOS Data Area located in the top of conventional memory and contains data about ports, disk parameters and etc... +接下来则调用了 `reserve_ebda_region` 函数,它定义在 [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head.c)。这个函数为 `EBDA`(即Extended BIOS Data Area,扩展BIOS数据区域)预留空间。扩展BIOS预留区域位于常规内存顶部(译注:常规内存(Conventiional Memory)是指前640K字节内存),包含了端口、磁盘参数等数据。 -Let's look on the `reserve_ebda_region` function. It starts from the checking is paravirtualization enabled or not: +接下来我们来看一下 `reserve_ebda_region` 函数。它首先会检查是否启用了半虚拟化: ```C if (paravirt_enabled()) return; ``` -we exit from the `reserve_ebda_region` function if paravirtualization is enabled because if it enabled the extended bios data area is absent. In the next step we need to get the end of the low memory: +如果开启了半虚拟化,那么就退出 `reserve_ebda_region` 函数,因为此时没有扩展BIOS数据区域。下面我们首先得到低地址内存的末尾地址: ```C lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES); lowmem <<= 10; ``` -We're getting the virtual address of the BIOS low memory in kilobytes and convert it to bytes with shifting it on 10 (multiply on 1024 in other words). After this we need to get the address of the extended BIOS data are with the: +首先我们得到了BIOS地地址内存的虚拟地址,以KB为单位,然后将其左移10位(即乘以1024)转换为以字节为单位。然后我们需要获得扩展BIOS数据区域的地址: ```C ebda_addr = get_bios_ebda(); ``` -where `get_bios_ebda` function defined in the [arch/x86/include/asm/bios_ebda.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bios_ebda.h) and looks like: +其中, `get_bios_ebda` 函数定义在 [arch/x86/include/asm/bios_ebda.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bios_ebda.h): ```C static inline unsigned int get_bios_ebda(void) @@ -205,7 +206,7 @@ static inline unsigned int get_bios_ebda(void) } ``` -Let's try to understand how it works. Here we can see that we converting physical address `0x40E` to the virtual, where `0x0040:0x000e` is the segment which contains base address of the extended BIOS data area. Don't worry that we are using `phys_to_virt` function for converting a physical address to virtual address. You can note that previously we have used `__va` macro for the same point, but `phys_to_virt` is the same: +下面我们来尝试理解一下这段代码。这段代码中,首先我们将物理地址 `0x40E` 转换为虚拟地址,`0x0040:0x000e` 就是包含有扩展BIOS数据区域基地址的代码段。这里我们使用了 `phys_to_virt` 函数进行地址转换,而不是之前使用的 `__va` 宏。不过,事实上他们两个基本上是一样的: ```C static inline void *phys_to_virt(phys_addr_t address) @@ -214,7 +215,7 @@ static inline void *phys_to_virt(phys_addr_t address) } ``` -only with one difference: we pass argument with the `phys_addr_t` which depends on `CONFIG_PHYS_ADDR_T_64BIT`: +而不同之处在于,`phys_to_virt` 函数的参数类型 `phys_addr_t` 的定义依赖于 `CONFIG_PHYS_ADDR_T_64BIT`: ```C #ifdef CONFIG_PHYS_ADDR_T_64BIT @@ -224,9 +225,9 @@ only with one difference: we pass argument with the `phys_addr_t` which depends #endif ``` -This configuration option is enabled by `CONFIG_PHYS_ADDR_T_64BIT`. After that we got virtual address of the segment which stores the base address of the extended BIOS data area, we shift it on 4 and return. After this `ebda_addr` variables contains the base address of the extended BIOS data area. +具体的类型是由 `CONFIG_PHYS_ADDR_T_64BIT` 设置选项控制的。此后我们得到了包含扩展BIOS数据区域虚拟基地址的段,把它左移4位后返回。这样,`ebda_addr` 变量就包含了扩展BIOS数据区域的基地址。 -In the next step we check that address of the extended BIOS data area and low memory is not less than `INSANE_CUTOFF` macro +下一步我们来检查扩展BIOS数据区域与低地址内存的地址,看一看它们是否小于 `INSANE_CUTOFF` 宏: ```C if (ebda_addr < INSANE_CUTOFF) @@ -236,13 +237,13 @@ if (lowmem < INSANE_CUTOFF) lowmem = LOWMEM_CAP; ``` -which is: +`INSANE_CUTOFF` 为: ```C #define INSANE_CUTOFF 0x20000U ``` -or 128 kilobytes. In the last step we get lower part in the low memory and extended bios data area and call `memblock_reserve` function which will reserve memory region for extended bios data between low memory and one megabyte mark: +即 128 KB. 上一步我们得到了低地址内存中的低地址部分以及扩展BIOS数据区域,然后调用 `memblock_reserve` 函数来在低内存地址与1MB之间为扩展BIOS数据预留内存区域。 ```C lowmem = min(lowmem, ebda_addr); @@ -250,36 +251,36 @@ lowmem = min(lowmem, LOWMEM_CAP); memblock_reserve(lowmem, 0x100000 - lowmem); ``` -`memblock_reserve` function is defined at [mm/block.c](https://github.com/torvalds/linux/blob/master/mm/block.c) and takes two parameters: +`memblock_reserve` 函数定义在 [mm/block.c](https://github.com/torvalds/linux/blob/master/mm/block.c),它接受两个参数: -* base physical address; -* region size. +* 基物理地址 +* 区域大小 -and reserves memory region for the given base address and size. `memblock_reserve` is the first function in this book from linux kernel memory manager framework. We will take a closer look on memory manager soon, but now let's look at its implementation. +然后在给定的基地址处预留指定大小的内存。`memblock_reserve` 是在这本书中我们接触到的第一个Linux内核内存管理框架中的函数。我们很快会详细地介绍内存管理,不过现在还是先来看一看这个函数的实现。 -First touch of the linux kernel memory manager framework +Linux内核管理框架初探 -------------------------------------------------------------------------------- -In the previous paragraph we stopped at the call of the `memblock_reserve` function and as i sad before it is the first function from the memory manager framework. Let's try to understand how it works. `memblock_reserve` function just calls: +在上一段中我们遇到了对 `memblock_reserve` 函数的调用。现在我们来尝试理解一下这个函数是如何工作的。 `memblock_reserve` 函数只是调用了: ```C memblock_reserve_region(base, size, MAX_NUMNODES, 0); ``` -function and passes 4 parameters there: +`memblock_reserve_region` 接受四个参数: -* physical base address of the memory region; -* size of the memory region; -* maximum number of numa nodes; -* flags. +* 内存区域的物理基地址 +* 内存区域的大小 +* 最大 NUMA 节点数 +* 标志参数 flags -At the start of the `memblock_reserve_region` body we can see definition of the `memblock_type` structure: +在 `memblock_reserve_region` 函数一开始,就是一个 `memblock_type` 结构体类型的变量: ```C struct memblock_type *_rgn = &memblock.reserved; ``` -which presents the type of the memory block and looks: +`memblock_type` 类型代表了一块内存,定义如下: ```C struct memblock_type { @@ -290,7 +291,7 @@ struct memblock_type { }; ``` -As we need to reserve memory block for extended bios data area, the type of the current memory region is reserved where `memblock` structure is: +因为我们要为扩展BIOS数据区域预留内存块,所以当前内存区域的类型就是预留。`memblock` 结构体的定义为: ```C struct memblock { @@ -304,7 +305,7 @@ struct memblock { }; ``` -and describes generic memory block. You can see that we initialize `_rgn` by assigning it to the address of the `memblock.reserved`. `memblock` is the global variable which looks: +它描述了一块通用的数据块。我们用 `memblock.reserved` 的值来初始化 `_rgn`。`memblock` 全局变量定义如下: ```C struct memblock memblock __initdata_memblock = { @@ -324,27 +325,27 @@ struct memblock memblock __initdata_memblock = { }; ``` -We will not dive into detail of this variable, but we will see all details about it in the parts about memory manager. Just note that `memblock` variable defined with the `__initdata_memblock` which is: +我们现在不会继续深究这个变量,但在内存管理部分的中我们会详细地对它进行介绍。需要注意的是,这个变量的声明中使用了 `__initdata_memblock`: ```C #define __initdata_memblock __meminitdata ``` -and `__meminit_data` is: +而 `__meminit_data` 为: ```C #define __meminitdata __section(.meminit.data) ``` -From this we can conclude that all memory blocks will be in the `.meminit.data` section. After we defined `_rgn` we print information about it with `memblock_dbg` macros. You can enable it by passing `memblock=debug` to the kernel command line. +自此我们得出这样的结论:所有的内存块都将定义在 `.meminit.data` 区段中。在我们定义了 `_rgn` 之后,使用了 `memblock_dbg` 宏来输出相关的信息。你可以在从内核命令行传入参数 `memblock=debug` 来开启这些输出。 -After debugging lines were printed next is the call of the following function: +在输出了这些调试信息后,是对下面这个函数的调用: ```C memblock_add_range(_rgn, base, size, nid, flags); ``` -which adds new memory block region into the `.meminit.data` section. As we do not initialize `_rgn` but it just contains `&memblock.reserved`, we just fill passed `_rgn` with the base address of the extended BIOS data area region, size of this region and flags: +它向 `.meminit.data` 区段添加了一个新的内存块区域。由于 `_rgn` 的值是 `&memblock.reserved`,下面的代码就直接将扩展BIOS数据区域的基地址、大小和标志填入 `_rgn` 中: ```C if (type->regions[0].size == 0) { @@ -358,12 +359,12 @@ if (type->regions[0].size == 0) { } ``` -After we filled our region we can see the call of the `memblock_set_region_node` function with two parameters: +在填充好了区域后,接着是对 `memblock_set_region_node` 函数的调用。它接受两个参数: -* address of the filled memory region; -* NUMA node id. +* 填充好的内存区域的地址 +* NUMA节点ID -where our regions represented by the `memblock_region` structure: +其中我们的区域由 `memblock_region` 结构体来表示: ```C struct memblock_region { @@ -376,13 +377,13 @@ struct memblock_region { }; ``` -NUMA node id depends on `MAX_NUMNODES` macro which is defined in the [include/linux/numa.h](https://github.com/torvalds/linux/blob/master/include/linux/numa.h): +NUMA节点ID依赖于 `MAX_NUMNODES` 宏,定义在 [include/linux/numa.h](https://github.com/torvalds/linux/blob/master/include/linux/numa.h) ```C #define MAX_NUMNODES (1 << NODES_SHIFT) ``` -where `NODES_SHIFT` depends on `CONFIG_NODES_SHIFT` configuration parameter and defined as: +其中 `NODES_SHIFT` 依赖于 `CONFIG_NODES_SHIFT` 配置参数,定义如下: ```C #ifdef CONFIG_NODES_SHIFT @@ -392,7 +393,7 @@ where `NODES_SHIFT` depends on `CONFIG_NODES_SHIFT` configuration parameter and #endif ``` -`memblick_set_region_node` function just fills `nid` field from `memblock_region` with the given value: +`memblick_set_region_node` 函数只是填充了 `memblock_region` 中的 `nid` 成员: ```C static inline void memblock_set_region_node(struct memblock_region *r, int nid) @@ -401,28 +402,24 @@ static inline void memblock_set_region_node(struct memblock_region *r, int nid) } ``` -After this we will have first reserved `memblock` for the extended bios data area in the `.meminit.data` section. `reserve_ebda_region` function finished its work on this step and we can go back to the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). +在这之后我们就在 `.meminit.data` 区段拥有了为扩展BIOS数据区域预留的第一个 `memblock`。`reserve_ebda_region` 已经完成了它该做的任务,我们回到 [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) 继续。 -We finished all preparations before the kernel entry point! The last step in the `x86_64_start_reservations` function is the call of the: +至此我们已经结束了进入内核之前所有的准备工作。`x86_64_start_reservations` 的最后一步是调用 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 中的: ```C start_kernel() ``` -function from [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) file. +这一部分到此结束。 -That's all for this part. - -Conclusion +小结 -------------------------------------------------------------------------------- -It is the end of the third part about linux kernel insides. In next part we will see the first initialization steps in the kernel entry point - `start_kernel` function. It will be the first step before we will see launch of the first `init` process. +本书的第三部分到这里就结束了。在下一部分中,我们将会见到内核入口点处的初始化工作 —— 位于 `start_kernel` 函数中。这些工作是在启动第一个进程 `init` 之前首先要完成的工作。 -If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). +如果你有任何问题或建议,请在twitter上联系我 [0xAX](https://twitter.com/0xAX),或者通过[邮件](anotherworldofworld@gmail.com)与我沟通,还可以新开[issue](https://github.com/MintCN/linux-insides-zh/issues/new)。 -**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/MintCN/linux-insides-zh).** - -Links +相关链接 -------------------------------------------------------------------------------- * [BIOS data area](http://stanislavs.org/helppc/bios_data_area.html) From 879548bbfb04df68592ed63688d89dc48608901e Mon Sep 17 00:00:00 2001 From: xinqiu Date: Thu, 11 May 2017 09:32:00 +0800 Subject: [PATCH 14/21] =?UTF-8?q?=E4=BF=AE=E5=A4=8D=E4=BA=86=E4=B8=80?= =?UTF-8?q?=E4=BA=9B=E6=A0=BC=E5=BC=8F=E4=B8=8A=E7=9A=84=E9=94=99=E8=AF=AF?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- Initialization/linux-initialization-2.md | 52 ++++++++++++------------ 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/Initialization/linux-initialization-2.md b/Initialization/linux-initialization-2.md index 41b738c..0bc6002 100644 --- a/Initialization/linux-initialization-2.md +++ b/Initialization/linux-initialization-2.md @@ -4,9 +4,9 @@ 初期中断和异常处理 -------------------------------------------------------------------------------- -在上一个 [部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) 我们谈到了初期中断初始化。目前我们已经处于解压缩后的Linux内核中了,还有了用于初期启动的基本的[分页](https://en.wikipedia.org/wiki/Page_table)机制。我们的目标是在内核的主体代码执行前做好准备工作。 +在上一个 [部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) 我们谈到了初期中断初始化。目前我们已经处于解压缩后的Linux内核中了,还有了用于初期启动的基本的 [分页](https://en.wikipedia.org/wiki/Page_table) 机制。我们的目标是在内核的主体代码执行前做好准备工作。 -我们已经在[本章](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/index.html)的[第一部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html)做了一些工作,在这一部分中我们会继续分析关于中断和异常处理部分的代码。 +我们已经在 [本章](https://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/index.html) 的 [第一部分](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-1.html) 做了一些工作,在这一部分中我们会继续分析关于中断和异常处理部分的代码。 我们在上一部分谈到了下面这个循环: @@ -20,19 +20,19 @@ for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) 理论 -------------------------------------------------------------------------------- -中断是一种由软件或硬件产生的、向CPU发出的事件。例如,如果用户按下了键盘上的一个按键时,就会产生中断。此时CPU将会暂停当前的任务,并且将控制流转到特殊的程序中——[中断处理程序(Interrupt Handler)](https://en.wikipedia.org/wiki/Interrupt_handler)。一个中断处理程序会对中断进行处理,然后将控制权交还给之前暂停的任务中。中断分为三类: +中断是一种由软件或硬件产生的、向CPU发出的事件。例如,如果用户按下了键盘上的一个按键时,就会产生中断。此时CPU将会暂停当前的任务,并且将控制流转到特殊的程序中—— [中断处理程序(Interrupt Handler)](https://en.wikipedia.org/wiki/Interrupt_handler)。一个中断处理程序会对中断进行处理,然后将控制权交还给之前暂停的任务中。中断分为三类: * 软件中断 - 当一个软件可以向CPU发出信号,表明它需要系统内核的相关功能时产生。这些中断通常用于系统调用; * 硬件中断 - 当一个硬件有任何事件发生时产生,例如键盘的按键被按下; * 异常 - 当CPU检测到错误时产生,例如发生了除零错误或者访问了一个不存在的内存页。 -每一个中断和异常都可以由一个数来表示,这个数叫做`向量号`,它可以取从 `0` 到 `255` 中的任何一个数。通常在实践中前 `32` 个向量号用来表示异常,`32` 到 `255` 用来表示用户定义的中断。可以看到在上面的代码中,`NUM_EXCEPTION_VECTORS` 就定义为: +每一个中断和异常都可以由一个数来表示,这个数叫做 `向量号` ,它可以取从 `0` 到 `255` 中的任何一个数。通常在实践中前 `32` 个向量号用来表示异常,`32` 到 `255` 用来表示用户定义的中断。可以看到在上面的代码中,`NUM_EXCEPTION_VECTORS` 就定义为: ```C #define NUM_EXCEPTION_VECTORS 32 ``` -CPU会从[APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)或者CPU引脚接收中断,并使用中断向量号作为 `中断描述符表` 的索引。下面的表中列出了 `0-31` 号异常: +CPU会从 [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) 或者 CPU 引脚接收中断,并使用中断向量号作为 `中断描述符表` 的索引。下面的表中列出了 `0-31` 号异常: ``` ---------------------------------------------------------------------------------------------- @@ -84,9 +84,9 @@ CPU会从[APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Con ---------------------------------------------------------------------------------------------- ``` -为了能够对中断进行处理,CPU使用了一种特殊的结构 - 中断描述符表(IDT)。IDT是一个由描述符组成的数组,其中每个描述符都为8个字节,与全局描述附表一致;不过不同的是,我们把IDT中的每一项叫做`门(gate)`。为了获得某一项描述符的起始地址,CPU会把向量号乘以8,在64位模式中则会乘以16。在前面我们已经见过,CPU使用一个特殊的 `GDTR` 寄存器来存放全局描述符表的地址,中断描述符表也有一个类似的寄存器 `IDTR`,同时还有用于将基地址加载入这个寄存器的指令 `lidt`。 +为了能够对中断进行处理,CPU使用了一种特殊的结构 - 中断描述符表(IDT)。IDT 是一个由描述符组成的数组,其中每个描述符都为8个字节,与全局描述附表一致;不过不同的是,我们把IDT中的每一项叫做 `门(gate)` 。为了获得某一项描述符的起始地址,CPU 会把向量号乘以8,在64位模式中则会乘以16。在前面我们已经见过,CPU使用一个特殊的 `GDTR` 寄存器来存放全局描述符表的地址,中断描述符表也有一个类似的寄存器 `IDTR` ,同时还有用于将基地址加载入这个寄存器的指令 `lidt` 。 -64位模式下IDT的每一项的结构如下: +64位模式下 IDT 的每一项的结构如下: ``` 127 96 @@ -129,9 +129,9 @@ CPU会从[APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Con * 中断描述符 * 陷阱描述符 -中断和陷阱描述符包含了一个指向中断处理程序的远(far)指针,二者唯一的不同在于CPU处理 `IF` 标志的方式。如果是由中断门进入中断处理程序的,CPU会清除 `IF` 标志位,这样当当前中断处理程序执行时,CPU不会对其他的中断进行处理;只有当当前的中断处理程序返回时,CPU 才在 `iret` 指令执行时重新设置 `IF` 标志位。 +中断和陷阱描述符包含了一个指向中断处理程序的远 (far) 指针,二者唯一的不同在于CPU处理 `IF` 标志的方式。如果是由中断门进入中断处理程序的,CPU 会清除 `IF` 标志位,这样当当前中断处理程序执行时,CPU 不会对其他的中断进行处理;只有当当前的中断处理程序返回时,CPU 才在 `iret` 指令执行时重新设置 `IF` 标志位。 -中断门的其他位为保留位,必须为0。下面我们来看一下CPU是如何处理中断的: +中断门的其他位为保留位,必须为0。下面我们来看一下 CPU 是如何处理中断的: * CPU 会在栈上保存标志寄存器、`cs`段寄存器和程序计数器IP; * 如果中断是由错误码引起的(比如 `#PF`), CPU会在栈上保存错误码; @@ -149,7 +149,7 @@ for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) set_intr_gate(i, early_idt_handler_array[i]); ``` -这里循环内部调用了 `set_intr_gate`,它接受两个参数: +这里循环内部调用了 `set_intr_gate` ,它接受两个参数: * 中断号,即 `向量号`; * 中断处理程序的地址。 @@ -165,7 +165,7 @@ extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDL `early_idt_handler_array` 是一个大小为 `288` 字节的数组,每一项为 `9` 个字节,其中2个字节的备用指令用于向栈中压入默认错误码(如果异常本身没有提供错误码的话),2个字节的指令用于向栈中压入向量号,剩余5个字节用于跳转到异常处理程序。 -在上面的代码中,我们只通过一个循环向 `IDT` 中填入了前32项内容,这是因为在整个初期设置阶段,中断是禁用的。`early_idt_handler_array` 数组中的每一项指向的都是同一个通用中断处理程序,定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S)。我们先暂时跳过这个数组的内容,看一下 `set_intr_gate` 的定义。 +在上面的代码中,我们只通过一个循环向 `IDT` 中填入了前32项内容,这是因为在整个初期设置阶段,中断是禁用的。`early_idt_handler_array` 数组中的每一项指向的都是同一个通用中断处理程序,定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) 。我们先暂时跳过这个数组的内容,看一下 `set_intr_gate` 的定义。 `set_intr_gate` 宏定义在 [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h): @@ -219,7 +219,7 @@ static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func, #define PTR_HIGH(x) ((unsigned long long)(x) >> 32) ``` -调用 `PTR_LOW` 可以得到x的低 `2` 个字节,调用 `PTR_MIDDLE` 可以得到x的中间 `2` 个字节,调用 `PTR_HIGH` 则能够得到x的高 `4` 个字节。接下来我们来位中断处理程序设置段选择子,即内核代码段 `__KERNEL_CS`。然后将 `Interrupt Stack Table` 和 `描述符特权等级` (最高特权等级)设置为0,以及在最后设置 `GAT_INTERRUPT` 类型。 +调用 `PTR_LOW` 可以得到 x 的低 `2` 个字节,调用 `PTR_MIDDLE` 可以得到 x 的中间 `2` 个字节,调用 `PTR_HIGH` 则能够得到 x 的高 `4` 个字节。接下来我们来位中断处理程序设置段选择子,即内核代码段 `__KERNEL_CS`。然后将 `Interrupt Stack Table` 和 `描述符特权等级` (最高特权等级)设置为0,以及在最后设置 `GAT_INTERRUPT` 类型。 现在我们已经设置好了IDT中的一项,那么通过调用 `native_write_idt_entry` 函数来把复制到 `IDT`: @@ -248,14 +248,14 @@ struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table }; asm volatile("lidt %0"::"m" (*dtr)); ``` -你可能已经注意到了,在代码中还有对 `_trace_*` 函数的调用。这些函数会用跟 `_set_gate` 同样的方法对 `IDT` 门进行设置,但仅有一处不同:这些函数并不设置 `idt_table`,而是 `trace_idt_table`,用于设置追踪点(tracepoint,我们将会在其他章节介绍这一部分)。 +你可能已经注意到了,在代码中还有对 `_trace_*` 函数的调用。这些函数会用跟 `_set_gate` 同样的方法对 `IDT` 门进行设置,但仅有一处不同:这些函数并不设置 `idt_table` ,而是 `trace_idt_table` ,用于设置追踪点(tracepoint,我们将会在其他章节介绍这一部分)。 -好了,至此我们已经了解到,通过设置并加载 `中断描述符表`,能够让CPU在发生中断时做出相应的动作。下面让我们来看一下如何编写中断处理程序。 +好了,至此我们已经了解到,通过设置并加载 `中断描述符表` ,能够让CPU在发生中断时做出相应的动作。下面让我们来看一下如何编写中断处理程序。 初期中断处理程序 -------------------------------------------------------------------------------- -在上面的代码中,我们用 `early_idt_handler_array` 的地址来填充了 `IDT`,这个 `early_idt_handler_array` 定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): +在上面的代码中,我们用 `early_idt_handler_array` 的地址来填充了 `IDT` ,这个 `early_idt_handler_array` 定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): ```assembly .globl early_idt_handler_array @@ -272,7 +272,7 @@ early_idt_handlers: .endr ``` -这段代码自动生成为前 `32` 个异常生成了中断处理程序。首先,为了统一栈的布局,如果一个异常没有返回错误码,那么我们就手动在栈中压入一个 `0`。然后再在栈中压入中断向量号,最后跳转至通用的中断处理程序 `early_idt_handler_common`。我们可以通过 `objdump` 命令的输出一探究竟: +这段代码自动生成为前 `32` 个异常生成了中断处理程序。首先,为了统一栈的布局,如果一个异常没有返回错误码,那么我们就手动在栈中压入一个 `0`。然后再在栈中压入中断向量号,最后跳转至通用的中断处理程序 `early_idt_handler_common` 。我们可以通过 `objdump` 命令的输出一探究竟: ``` $ objdump -D vmlinux @@ -293,7 +293,7 @@ ffffffff81fe5014: 6a 02 pushq $0x2 ... ``` -由于在中断发生时,CPU会在栈上压入标志寄存器、`CS` 段寄存器和 `RIP` 寄存器的内容。因此在 `early_idt_handler` 执行前,栈的布局如下: +由于在中断发生时,CPU 会在栈上压入标志寄存器、`CS` 段寄存器和 `RIP` 寄存器的内容。因此在 `early_idt_handler` 执行前,栈的布局如下: ``` |--------------------| @@ -304,7 +304,7 @@ ffffffff81fe5014: 6a 02 pushq $0x2 |--------------------| ``` -下面我们来看一下 `early_idt_handler_common` 的实现。它也定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343) 文件中。首先它会检查当前中断是否为 [不可屏蔽中断(NMI)](http://en.wikipedia.org/wiki/Non-maskable_interrupt),如果是则简单地忽略它们: +下面我们来看一下 `early_idt_handler_common` 的实现。它也定义在 [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343) 文件中。首先它会检查当前中断是否为 [不可屏蔽中断(NMI)](http://en.wikipedia.org/wiki/Non-maskable_interrupt),如果是则简单地忽略它们: ```assembly cmpl $2,(%rsp) @@ -319,9 +319,9 @@ is_nmi: INTERRUPT_RETURN ``` -这段程序首先从栈顶弹出错误码和中断向量号,然后通过调用 `INTERRUPT_RETURN`,即 `iretq` 指令直接返回。 +这段程序首先从栈顶弹出错误码和中断向量号,然后通过调用 `INTERRUPT_RETURN` ,即 `iretq` 指令直接返回。 -如果当前中断不是 `NMI`,则首先检查 `early_recursion_flag` 以避免在 `early_idt_handler_common` 程序中递归地产生中断。如果一切都没问题,就先在栈上保存通用寄存器,为了防止中断返回时寄存器的内容错乱: +如果当前中断不是 `NMI` ,则首先检查 `early_recursion_flag` 以避免在 `early_idt_handler_common` 程序中递归地产生中断。如果一切都没问题,就先在栈上保存通用寄存器,为了防止中断返回时寄存器的内容错乱: ```assembly pushq %rax @@ -342,7 +342,7 @@ is_nmi: jne 11f ``` -段选择子必须为内核代码段,如果不是则跳转到标签 `11`,输出 `PANIC` 信息并打印栈的内容。然后我们来检查向量号,如果是 `#PF` 即 [缺页中断(Page Fault)](https://en.wikipedia.org/wiki/Page_fault),那么就把 `cr2` 寄存器中的值赋值给 `rdi`,然后调用 `early_make_pgtable` (详见后文): +段选择子必须为内核代码段,如果不是则跳转到标签 `11` ,输出 `PANIC` 信息并打印栈的内容。然后我们来检查向量号,如果是 `#PF` 即 [缺页中断(Page Fault)](https://en.wikipedia.org/wiki/Page_fault),那么就把 `cr2` 寄存器中的值赋值给 `rdi` ,然后调用 `early_make_pgtable` (详见后文): ```assembly cmpl $14,72(%rsp) @@ -353,7 +353,7 @@ is_nmi: jz 20f ``` -如果向量号不是 `#PF`,那么就恢复通用寄存器: +如果向量号不是 `#PF` ,那么就恢复通用寄存器: ```assembly popq %r11 popq %r10 @@ -373,7 +373,7 @@ is_nmi: 缺页中断处理程序 -------------------------------------------------------------------------------- -在上一节中我们第一次见到了初期中断处理程序,它检查了缺页中断的中断号,并调用了 `early_make_pgtable`来建立新的页表。在这里我们需要提供 `#PF` 中断处理程序,以便为之后将内核加载至 `4G` 地址以上,并且能访问位于4G以上的 `boot_params` 结构体。 +在上一节中我们第一次见到了初期中断处理程序,它检查了缺页中断的中断号,并调用了 `early_make_pgtable` 来建立新的页表。在这里我们需要提供 `#PF` 中断处理程序,以便为之后将内核加载至 `4G` 地址以上,并且能访问位于4G以上的 `boot_params` 结构体。 `early_make_pgtable` 的实现在 [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c),它接受一个参数:从 `cr2` 寄存器得到的地址,这个地址引发了内存中断。下面让我们来看一下: @@ -397,7 +397,7 @@ int __init early_make_pgtable(unsigned long address) typedef unsigned long pgdval_t; ``` -此外,我们还会遇见 `*_t` (不带val)的类型,比如 `pgd_t`……这些类型都定义在 [arch/x86/include/asm/pgtable_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_types.h),形式如下: +此外,我们还会遇见 `*_t` (不带val)的类型,比如 `pgd_t` ……这些类型都定义在 [arch/x86/include/asm/pgtable_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_types.h),形式如下: ```C typedef struct { pgdval_t pgd; } pgd_t; @@ -418,7 +418,7 @@ pgd_p = &early_level4_pgt[pgd_index(address)].pgd; pgd = *pgd_p; ``` -接下来我们检查一下 `pgd`,如果它包含了正确的全局页表项的话,我们就把这一项的物理地址处理后赋值给 `pud_p`: +接下来我们检查一下 `pgd` ,如果它包含了正确的全局页表项的话,我们就把这一项的物理地址处理后赋值给 `pud_p` : ```C @@ -445,7 +445,7 @@ pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base); 它是一个46bit大小的页帧屏蔽值。 -如果 `pgd` 没有包含有效的地址,我们就检查 `next_early_pgt` 与 `EARLY_DYNAMIC_PAGE_TABLES`(即 `64`)的大小。`EARLY_DYNAMIC_PAGE_TABLES` 它是一个固定大小的缓冲区,用来在需要的时候建立新的页表。如果 `next_early_pgt` 比 `EARLY_DYNAMIC_PAGE_TABLES` 大,我们就用一个上层页目录指针指向当前的动态页表,并将它的物理地址与 `_KERPG_TABLE` 访问权限一起写入全局页目录表: +如果 `pgd` 没有包含有效的地址,我们就检查 `next_early_pgt` 与 `EARLY_DYNAMIC_PAGE_TABLES`(即 `64` )的大小。`EARLY_DYNAMIC_PAGE_TABLES` 它是一个固定大小的缓冲区,用来在需要的时候建立新的页表。如果 `next_early_pgt` 比 `EARLY_DYNAMIC_PAGE_TABLES` 大,我们就用一个上层页目录指针指向当前的动态页表,并将它的物理地址与 `_KERPG_TABLE` 访问权限一起写入全局页目录表: ```C if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) { From 2fdaaf4d99058447382047f3757b34f3b40620aa Mon Sep 17 00:00:00 2001 From: ye11ow Date: Sun, 28 May 2017 19:34:03 +0800 Subject: [PATCH 15/21] Refined chapter 1.1 before Bootloader --- Booting/linux-bootstrap-1.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Booting/linux-bootstrap-1.md b/Booting/linux-bootstrap-1.md index b20039d..4a36351 100644 --- a/Booting/linux-bootstrap-1.md +++ b/Booting/linux-bootstrap-1.md @@ -20,7 +20,7 @@ 神奇的电源按钮,接下来会发生什么? -------------------------------------------------------------------------------- -尽管这一系列文章关于 Linux 内核,我们在第一章并不会从内核代码开始。电脑在你按下电源开关的时候,就开始工作。主板发送信号给[电源](https://en.wikipedia.org/wiki/Power_supply),而电源收到信号后会给电脑供应合适的电量。一旦主板收到了[电源备妥信号](https://en.wikipedia.org/wiki/Power_good_signal),它会尝试启动 CPU 。CPU 则复位寄存器的所有数据,并设置每个寄存器的预定值。 +尽管这是一系列关于 Linux 内核的文章,我们在第一章并不会从内核代码开始。电脑在你按下电源开关的时候,就开始工作。主板发送信号给[电源](https://en.wikipedia.org/wiki/Power_supply),而电源收到信号后会给电脑供应合适的电量。一旦主板收到了[电源备妥信号](https://en.wikipedia.org/wiki/Power_good_signal),它会尝试启动 CPU 。CPU 则复位寄存器的所有数据,并设置每个寄存器的预定值。 [80386](https://en.wikipedia.org/wiki/Intel_80386) @@ -32,13 +32,13 @@ CS selector 0xf000 CS base 0xffff0000 ``` -处理器开始在[实模式](https://en.wikipedia.org/wiki/Real_mode)工作,我们需要退回一点去理解在这种模式下的内存分割。所有 x86兼容处理器都支持实模式,从 [8086](https://en.wikipedia.org/wiki/Intel_8086)到现在的 Intel 64 位 CPU。8086 处理器有一个20位寻址总线,这意味着它可以对0到 2^20 位地址空间进行操作( 1Mb ).不过它只有16位的寄存器,通过这个16位寄存器最大寻址是 2^16 即 0xffff (64 Kb)。实模式使用[段式内存管理](http://en.wikipedia.org/wiki/Memory_segmentation) 来管理整个内存空间。所有内存被分成固定的 64KB 大小的小块。由于我们不能用16位寄存器寻址大于 64KB 的内存,一种替代的方法被设计出来了。一个地址包括两个部分:数据段起始地址和从该数据段起的偏移量。为了得到内存中的物理地址,我们要让数据段乘16并加上偏移量: +处理器开始在[实模式](https://en.wikipedia.org/wiki/Real_mode)工作。我们需要退回一点去理解在这种模式下的内存分段机制。从 [8086](https://en.wikipedia.org/wiki/Intel_8086)到现在的 Intel 64 位 CPU,所有 x86兼容处理器都支持实模式。8086 处理器有一个20位寻址总线,这意味着它可以对0到 2^20 位地址空间( 1MB )进行操作。不过它只有16位的寄存器,所以最大寻址空间是 2^16 即 0xffff (64 KB)。实模式使用[段式内存管理](http://en.wikipedia.org/wiki/Memory_segmentation) 来管理整个内存空间。所有内存被分成固定的65536字节(64 KB) 大小的小块。由于我们不能用16位寄存器寻址大于 64KB 的内存,一种替代的方法被设计出来了。一个地址包括两个部分:数据段起始地址和从该数据段起的偏移量。为了得到内存中的物理地址,我们要让数据段乘16并加上偏移量: ``` PhysicalAddress = Segment * 16 + Offset ``` -举个例子,如果 `CS:IP` 是 `0x2000:0x0010`, 相关的物理地址将会是: +举个例子,如果 `CS:IP` 是 `0x2000:0x0010`, 则对应的物理地址将会是: ```python >>> hex((0x2000 << 4) + 0x0010) @@ -96,7 +96,7 @@ SECTIONS { } ``` -现在BIOS已经开始工作了。在初始化和检查硬件之后,需要寻找到一个可引导设备。可引导设备列表存储在在 BIOS 配置中, BIOS 将根据其中配置的顺序,尝试从不同的设备上寻找引导程序。对于硬盘,BIOS 将尝试寻找引导扇区。如果在硬盘上存在一个MBR分区,那么引导扇区储存在第一个扇区(512字节)的头446字节,引导扇区的最后必须是 `0x55` 和 `0xaa` ,这2个字节称为魔术字节,如果 BIOS 看到这2个字节,就知道这个设备是一个可引导设备。举个例子: +现在BIOS已经开始工作了。在初始化和检查硬件之后,需要寻找到一个可引导设备。可引导设备列表存储在在 BIOS 配置中, BIOS 将根据其中配置的顺序,尝试从不同的设备上寻找引导程序。对于硬盘,BIOS 将尝试寻找引导扇区。如果在硬盘上存在一个MBR分区,那么引导扇区储存在第一个扇区(512字节)的头446字节,引导扇区的最后必须是 `0x55` 和 `0xaa` ,这2个字节称为魔术字节(Magic Bytes),如果 BIOS 看到这2个字节,就知道这个设备是一个可引导设备。举个例子: ```assembly ; From d4a78a1f5567580b91b034afa5815de33d2f7221 Mon Sep 17 00:00:00 2001 From: woodpenker Date: Mon, 12 Jun 2017 19:34:36 +0800 Subject: [PATCH 16/21] update:Translate Chapter14.1 --- KernelStructures/idt.md | 152 ++++++++++++++++++++-------------------- 1 file changed, 76 insertions(+), 76 deletions(-) diff --git a/KernelStructures/idt.md b/KernelStructures/idt.md index 5374350..7e0683e 100644 --- a/KernelStructures/idt.md +++ b/KernelStructures/idt.md @@ -1,61 +1,61 @@ -interrupt-descriptor table (IDT) + 中断描述符 (IDT) ================================================================================ -Three general interrupt & exceptions sources: +三个常见的中断和异常来源: -* Exceptions - sync; -* Software interrupts - sync; -* External interrupts - async. +* 异常 - sync; +* 软中断 - sync; +* 外部中断 - async。 -Types of Exceptions: +异常的类型: -* Faults - are precise exceptions reported on the boundary `before` the instruction causing the exception. The saved `%rip` points to the faulting instruction; -* Traps - are precise exceptions reported on the boundary `following` the instruction causing the exception. The same with `%rip`; -* Aborts - are imprecise exceptions. Because they are imprecise, aborts typically do not allow reliable program restart. +* 故障 - 在指令导致异常`之前`会被准确地报告。`%rip`保存的指针指向故障的指令; +* 陷阱 - 在指令导致异常`之后`会被准确地报告。`%rip`保存的指针同样指向故障的指令; +* 终止 - 是不明确的异常。 因为它们不能被明确,中止通常不允许程序可靠地再次启动。 -`Maskable` interrupts trigger the interrupt-handling mechanism only when RFLAGS.IF=1. Otherwise they are held pending for as long as the RFLAGS.IF bit is cleared to 0. +只有当RFLAGS.IF = 1时,`可屏蔽`中断触发才中断处理程序。 除非RFLAGS.IF位清零,否则它们将持续处于等待处理状态。 -`Nonmaskable` interrupts (NMI) are unaffected by the value of the rFLAGS.IF bit. However, the occurrence of an NMI masks further NMIs until an IRET instruction is executed. +`不可屏蔽`中断(NMI)不受rFLAGS.IF位的影响。 无论怎样一个NMI的发生都会进一步屏蔽之后的其他NMI,直到执行IRET(中断返回)指令。 + +具体的异常和中断来源被分配了固定的向量标识号(也称“中断向量”或简称“向量”)。中断处理程序使用中断向量来定位异常或中断,从而分配相应的系统软件服务处理程序。有至多256个特殊的中断向量可用。前32个是保留的,用于预定义的异常和中断条件。请参考[arch / x86 / include / asm / traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121)头文件中对他们的定义: -Specific exception and interrupt sources are assigned a fixed vector-identification number (also called an “interrupt vector” or simply “vector”). The interrupt vector is used by the interrupt-handling mechanism to locate the system-software service routine assigned to the exception or interrupt. Up to -256 unique interrupt vectors are available. The first 32 vectors are reserved for predefined exception and interrupt conditions. They are defined in the [arch/x86/include/asm/traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121) header file: ``` -/* Interrupts/Exceptions */ +/* 中断/异常 */ enum { - X86_TRAP_DE = 0, /* 0, Divide-by-zero */ - X86_TRAP_DB, /* 1, Debug */ - X86_TRAP_NMI, /* 2, Non-maskable Interrupt */ - X86_TRAP_BP, /* 3, Breakpoint */ - X86_TRAP_OF, /* 4, Overflow */ - X86_TRAP_BR, /* 5, Bound Range Exceeded */ - X86_TRAP_UD, /* 6, Invalid Opcode */ - X86_TRAP_NM, /* 7, Device Not Available */ - X86_TRAP_DF, /* 8, Double Fault */ - X86_TRAP_OLD_MF, /* 9, Coprocessor Segment Overrun */ - X86_TRAP_TS, /* 10, Invalid TSS */ - X86_TRAP_NP, /* 11, Segment Not Present */ - X86_TRAP_SS, /* 12, Stack Segment Fault */ - X86_TRAP_GP, /* 13, General Protection Fault */ - X86_TRAP_PF, /* 14, Page Fault */ - X86_TRAP_SPURIOUS, /* 15, Spurious Interrupt */ - X86_TRAP_MF, /* 16, x87 Floating-Point Exception */ - X86_TRAP_AC, /* 17, Alignment Check */ - X86_TRAP_MC, /* 18, Machine Check */ - X86_TRAP_XF, /* 19, SIMD Floating-Point Exception */ - X86_TRAP_IRET = 32, /* 32, IRET Exception */ + X86_TRAP_DE = 0, /* 0, 除零错误 */ + X86_TRAP_DB, /* 1, 调试 */ + X86_TRAP_NMI, /* 2, 不可屏蔽中断 */ + X86_TRAP_BP, /* 3, 断点 */ + X86_TRAP_OF, /* 4, 溢出 */ + X86_TRAP_BR, /* 5, 超出范围 */ + X86_TRAP_UD, /* 6, 操作码无效 */ + X86_TRAP_NM, /* 7, 设备不可用 */ + X86_TRAP_DF, /* 8, 双精度浮点错误 */ + X86_TRAP_OLD_MF, /* 9, 协处理器段溢出 */ + X86_TRAP_TS, /* 10, 无效的 TSS */ + X86_TRAP_NP, /* 11, 段不存在 */ + X86_TRAP_SS, /* 12, 堆栈段故障 */ + X86_TRAP_GP, /* 13, 一般保护故障 */ + X86_TRAP_PF, /* 14, 页错误 */ + X86_TRAP_SPURIOUS, /* 15, 伪中断 */ + X86_TRAP_MF, /* 16, x87 浮点异常 */ + X86_TRAP_AC, /* 17, 对齐检查 */ + X86_TRAP_MC, /* 18, 机器检测 */ + X86_TRAP_XF, /* 19, SIMD (单指令多数据结构浮点)异常 */ + X86_TRAP_IRET = 32, /* 32, IRET (中断返回)异常 */ }; ``` -Error Codes +错误代码(Error code) -------------------------------------------------------------------------------- -The processor exception-handling mechanism reports error and status information for some exceptions using an error code. The error code is pushed onto the stack by the exception-mechanism during the control transfer into the exception handler. The error code has two formats: +处理器异常处理程序使用错误代码报告某些异常的错误和状态信息。在控制权交给异常处理程序期间,异常处理装置将错误代码推送到堆栈中。错误代码有两种格式: -* most error-reporting exceptions format; -* page fault format. +* 多数异常错误报告格式; +* 页错误格式。 -Here is format of selector error code: +选择器错误代码的格式如下: ``` 31 16 15 3 2 1 0 @@ -66,14 +66,14 @@ Here is format of selector error code: +-------------------------------------------------------------------------------+ ``` -Where: +说明如下: -* `EXT` - If this bit is set to 1, the exception source is external to the processor. If cleared to 0, the exception source is internal to the processor; -* `IDT` - If this bit is set to 1, the error-code selector-index field references a gate descriptor located in the `interrupt-descriptor table`. If cleared to 0, the selector-index field references a descriptor in either the `global-descriptor table` or local-descriptor table `LDT`, as indicated by the `TI` bit; -* `TI` - If this bit is set to 1, the error-code selector-index field references a descriptor in the `LDT`. If cleared to 0, the selector-index field references a descriptor in the `GDT`. -* `Selector Index` - The selector-index field specifies the index into either the `GDT`, `LDT`, or `IDT`, as specified by the `IDT` and `TI` bits. +* `EXT` - 如果该位设置为1,则异常源在处理器外部。 如果设置为0,则异常源位于处理器的内部; +* `IDT` - 如果该位设置为1,则错误代码选择器索引字段引用位于“中断描述符表”中的门描述符。 如果设置为0,则选择器索引字段引用“全局描述符表”或本地描述符表“LDT”中的描述符,由“TI”位所指示; +* `TI` - 如果该位设置为1,则错误代码选择器索引字段引用“LDT”中的描述符。 如果清除为0,则选择器索引字段引用“GDT”中的描述符; +* `Selector Index` - 选择器索引字段指定索引为“GDT‘,“LDT”或“IDT”,它是由“IDT”和“TI”位指定的。 -Page-Fault Error Code format is: +页错误代码格式如下: ``` 31 4 3 2 1 0 @@ -84,24 +84,24 @@ Page-Fault Error Code format is: +-------------------------------------------------------------------------------+ ``` -Where: +说明如下: -* `I/D` - If this bit is set to 1, it indicates that the access that caused the page fault was an instruction fetch; -* `RSV` - If this bit is set to 1, the page fault is a result of the processor reading a 1 from a reserved field within a page-translation-table entry; -* `U/S` - If this bit is cleared to 0, an access in supervisor mode (`CPL=0, 1, or 2`) caused the page fault. If this bit is set to 1, an access in user mode (CPL=3) caused the page fault; -* `R/W` - If this bit is cleared to 0, the access that caused the page fault is a memory read. If this bit is set to 1, the memory access that caused the page fault was a write; -* `P` - If this bit is cleared to 0, the page fault was caused by a not-present page. If this bit is set to 1, the page fault was caused by a page-protection violation. +* `I/D` - 如果该位设置为1,表示造成页错误的访问是取指; +* `RSV` - 如果该位设置为1,则页错误是处理器从保留给分页表的区域中读取1的结果; +* `U/S` - 如果该位被设置为0,则是管理员模式(`CPL = 0,1或2`)进行访问导致了页错误。 如果该位设置为1,则是用户模式(CPL = 3)进行访问导致了页错误; +* `R/W` - 如果该位被设置为0,导致页错误的是内存读取。 如果该位设置为1,则导致页错误的是内存写入; +* `P` - 如果该位被设置为0,则页错误是由不存在的页面引起的。 如果该位设置为1,页错误是由于违反页保护引起的。 -Interrupt Control Transfers +中断控制传输(Interrupt Control Transfers) -------------------------------------------------------------------------------- -The IDT may contain any of three kinds of gate descriptors: +IDT可以包含三种门描述符中的任何一种: -* `Task Gate` - contains the segment selector for a TSS for an exception and/or interrupt handler task; -* `Interrupt Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an interrupt handler code segment; -* `Trap Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an exception handler code segment. +* `Task Gate(任务门)` - 包含用于异常与或中断处理程序任务的TSS的段选择器; +* `Interrupt Gate(中断门)` - 包含处理器用于将程序从执行转移到中断处理程序的段选择器和偏移量; +* `Trap Gate(陷阱门)` - 包含处理器用于将程序从执行转移到异常处理程序的段选择器和偏移量。 -General format of gates is: +门的一般格式是: ``` 127 96 @@ -130,16 +130,16 @@ General format of gates is: +-------------------------------------------------------------------------------+ ``` -Where +说明如下: -* `Selector` - Segment Selector for destination code segment; -* `Offset` - Offset to handler procedure entry point; -* `DPL` - Descriptor Privilege Level; -* `P` - Segment Present flag; -* `IST` - Interrupt Stack Table; -* `TYPE` - one of: Local descriptor-table (LDT) segment descriptor, Task-state segment (TSS) descriptor, Call-gate descriptor, Interrupt-gate descriptor, Trap-gate descriptor or Task-gate descriptor. +* `Selector` - 目标代码段的段选择器; +* `Offset` - 处理程序入口点的偏移量; +* `DPL` - 描述符权限级别; +* `P` - 当前段标志; +* `IST` - 中断堆栈表; +* `TYPE` - 本地描述符表(LDT)段描述符,任务状态段(TSS)描述符,调用门描述符,中断门描述符,陷阱门描述符或任务门描述符之一。 -An `IDT` descriptor is represented by the following structure in the Linux kernel (only for `x86_64`): +`IDT` 描述符在Linux内核中由以下结构表示(仅适用于`x86_64`): ```C struct gate_struct64 { @@ -152,9 +152,9 @@ struct gate_struct64 { } __attribute__((packed)); ``` -which is defined in the [arch/x86/include/asm/desc_defs.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/desc_defs.h#L51) header file. +它定义在 [arch/x86/include/asm/desc_defs.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/desc_defs.h#L51) 头文件中。 -A task gate descriptor does not contain `IST` field and its format differs from interrupt/trap gates: +任务门描述符不包含`IST`字段,并且其格式与中断/陷阱门不同: ```C struct ldttss_desc64 { @@ -167,24 +167,24 @@ struct ldttss_desc64 { } __attribute__((packed)); ``` -Exceptions During a Task Switch +任务切换期间的异常(Exceptions During a Task Switch) -------------------------------------------------------------------------------- -An exception can occur during a task switch while loading a segment selector. Page faults can also occur when accessing a TSS. In these cases, the hardware task-switch mechanism completes loading the new task state from the TSS, and then triggers the appropriate exception mechanism. +任务切换在加载段选择器期间可能会发生异常。页错误也可能会在访问TSS时出现。在这些情况下,由硬件任务切换机构完成从TSS加载新的任务状态,然后触发适当的异常处理。 -**In long mode, an exception cannot occur during a task switch, because the hardware task-switch mechanism is disabled.** +**在长模式下,由于硬件任务切换机构被禁用,因而在任务切换期间不会发生异常。** -Nonmaskable interrupt +不可屏蔽中断(Nonmaskable interrupt) -------------------------------------------------------------------------------- -**TODO** +**未完待续** API -------------------------------------------------------------------------------- -**TODO** +**未完待续** -Interrupt Stack Table +中断堆栈表(Interrupt Stack Table) -------------------------------------------------------------------------------- -**TODO** +**未完待续** From 28aa7957ba0c8310148d37f371f3e2ddb3ccaf47 Mon Sep 17 00:00:00 2001 From: woodpenker Date: Mon, 12 Jun 2017 19:47:16 +0800 Subject: [PATCH 17/21] update:update README & contributors --- README.md | 2 +- contributors.md | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 7072814..8d2a1de 100644 --- a/README.md +++ b/README.md @@ -96,7 +96,7 @@ |└ [13.4](https://github.com/MintCN/linux-insides-zh/blob/master/Misc/program_startup.md)|[@mudongliang](https://github.com/mudongliang)|已完成| | 14. [KernelStructures](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures)||正在进行| |├ [14.0](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures/README.md)|[@mudongliang](https://github.com/mudongliang)|更新至[3cb550c0](https://github.com/0xAX/linux-insides/commit/3cb550c089c8fc609f667290434e9e98e93fa279)| -|└ [14.1](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures/idt.md)||未开始| +|└ [14.1](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures/idt.md)||[@woodpenker](https://github.com/woodpenker)|更新至[4521637d](https://github.com/0xAX/linux-insides/commit/4521637d9cb76e5d4e4dc951758b264a68504927)| ## 翻译认领规则 diff --git a/contributors.md b/contributors.md index cd05b29..41c04de 100644 --- a/contributors.md +++ b/contributors.md @@ -32,4 +32,6 @@ [@a1ickgu0](https://github.com/a1ickgu0) -[@hao-lee](https://github.com/hao-lee) \ No newline at end of file +[@hao-lee](https://github.com/hao-lee) + +[@woodpenker](http://github.com/woodpenker) \ No newline at end of file From 37f72f15a87de5db149d56b445ff51cc64c2ea6f Mon Sep 17 00:00:00 2001 From: wlf Date: Mon, 19 Jun 2017 13:29:53 +0800 Subject: [PATCH 18/21] fix typos --- MM/linux-mm-3.md | 80 ++++++++++++++++++++++++------------------------ 1 file changed, 40 insertions(+), 40 deletions(-) diff --git a/MM/linux-mm-3.md b/MM/linux-mm-3.md index 28b9872..d8e369e 100644 --- a/MM/linux-mm-3.md +++ b/MM/linux-mm-3.md @@ -4,7 +4,7 @@ Linux内核内存管理 第三节 内核中 kmemcheck 介绍 -------------------------------------------------------------------------------- -Linux内存管理 [章节](https://0xax.gitbooks.io/linux-insides/content/mm/) 描述了Linux内核中 [内存管理](https://en.wikipedia.org/wiki/Memory_management);本小节是第三部分。 在本章[第二节](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)中我们遇到了两个与内存管理相关的概念: +Linux内存管理[章节](https://0xax.gitbooks.io/linux-insides/content/mm/)描述了Linux内核中[内存管理](https://en.wikipedia.org/wiki/Memory_management);本小节是第三部分。 在本章[第二节](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)中我们遇到了两个与内存管理相关的概念: * `固定映射地址`; * `输入输出重映射`. @@ -31,7 +31,7 @@ $ sudo cat /proc/iomem ... ``` -`iomem`命令的输出显示了系统中每个物理设备所映射的内存区域。第一列为物理设备分配的内存区域,第二列为对应的各种不同类型的物理设备。再例如: +`iomem` 命令的输出显示了系统中每个物理设备所映射的内存区域。第一列为物理设备分配的内存区域,第二列为对应的各种不同类型的物理设备。再例如: ``` @@ -62,13 +62,13 @@ $ sudo cat /proc/ioports ... ``` -`ioports`的输出列出了系统中物理设备所注册的各种类型的I/O端口。内核不能直接访问设备的输入/输出地址。在内核能够使用这些内存之前,必须将这些地址映射到虚拟地址空间,这就是`io remap`机制的主要目的。在前面[第二节](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)中只介绍了早期的`io remap`。很快我们就要来看一看常规的`io remap`实现机制。但在此之前,我们需要学习一些其他的知识,例如不同类型的内存分配器等,不然的话我们很难理解该机制。 +`ioports` 的输出列出了系统中物理设备所注册的各种类型的I/O端口。内核不能直接访问设备的输入/输出地址。在内核能够使用这些内存之前,必须将这些地址映射到虚拟地址空间,这就是`io remap`机制的主要目的。在前面[第二节](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)中只介绍了早期的 `io remap` 。很快我们就要来看一看常规的 `io remap` 实现机制。但在此之前,我们需要学习一些其他的知识,例如不同类型的内存分配器等,不然的话我们很难理解该机制。 在进入Linux内核常规期的[内存管理](https://en.wikipedia.org/wiki/Memory_management)之前,我们要看一些特殊的内存机制,例如[调试](https://en.wikipedia.org/wiki/Debugging),检查[内存泄漏](https://en.wikipedia.org/wiki/Memory_leak),内存控制等等。学习这些内容有助于我们理解Linux内核的内存管理。 -从本节的标题中,你可能已经看出来,我们会从 [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt)开始了解内存机制。和前面的[章节](https://0xax.gitbooks.io/linux-insides/content/)一样,我们首先从理论上学习什么是`kmemcheck`,然后再来看Linux内核中是怎么实现这一机制的。 +从本节的标题中,你可能已经看出来,我们会从[kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt)开始了解内存机制。和前面的[章节](https://0xax.gitbooks.io/linux-insides/content/)一样,我们首先从理论上学习什么是 `kmemcheck` ,然后再来看Linux内核中是怎么实现这一机制的。 -让我们开始吧。Linux内核中的`kmemcheck`到底是什么呢?从该机制的名称上你可能已经猜到, `kmemcheck` 是检查内存的。你猜的很对。`kmemcheck`的主要目的就是用来检查是否有内核代码访问 `未初始化的内存`。让我们看一个简单的[C](https://en.wikipedia.org/wiki/C_%28programming_language%29)程序: +让我们开始吧。Linux内核中的 `kmemcheck` 到底是什么呢?从该机制的名称上你可能已经猜到, `kmemcheck` 是检查内存的。你猜的很对。`kmemcheck` 的主要目的就是用来检查是否有内核代码访问 `未初始化的内存` 。让我们看一个简单的 [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) 程序: ```C #include @@ -92,7 +92,7 @@ int main(int argc, char **argv) { gcc test.c -o test ``` -[编译器](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)不会显示成员 `a`未初始化的提示信息。但是如果使用工具[valgrind](https://en.wikipedia.org/wiki/Valgrind)来运行该程序,我们会看到如下输出: +[编译器](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)不会显示成员 `a` 未初始化的提示信息。但是如果使用工具[valgrind](https://en.wikipedia.org/wiki/Valgrind)来运行该程序,我们会看到如下输出: ``` ~$ valgrind --leak-check=yes ./test @@ -116,9 +116,9 @@ gcc test.c -o test ... ``` -实际上`kmemcheck`在内核空间做的事情,和`valgrind`在用户空间做的事情是一样的,都是用来检测未初始化的内存。 +实际上 `kmemcheck` 在内核空间做的事情,和 `valgrind` 在用户空间做的事情是一样的,都是用来检测未初始化的内存。 -要想在内核中启用该机制,需要在配置内核时使能`CONFIG_KMEMCHECK`选项: +要想在内核中启用该机制,需要在配置内核时开启 `CONFIG_KMEMCHECK` 选项: ``` Kernel hacking @@ -127,7 +127,7 @@ Kernel hacking ![kernel configuration menu](http://oi63.tinypic.com/2pzbog7.jpg) -`kmemcheck`机制还提供了一些内核配置参数,我们可以在下一个段落中看到所有的可选参数。最后一个需要注意的是,`kmemcheck` 仅在 [x86_64](https://en.wikipedia.org/wiki/X86-64) 体系中实现了。为了确信这一点,我们可以查看`x86`的内核配置文件 [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig): +`kmemcheck` 机制还提供了一些内核配置参数,我们可以在下一个段落中看到所有的可选参数。最后一个需要注意的是,`kmemcheck` 仅在 [x86_64](https://en.wikipedia.org/wiki/X86-64) 体系中实现了。为了确信这一点,我们可以查看 `x86` 的内核配置文件 [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig): ``` config X86 @@ -140,34 +140,34 @@ config X86 ... ``` -因此,对于其他的体系结构来说是没有`kmemcheck` 功能的。 +因此,对于其他的体系结构来说是没有 `kmemcheck` 功能的。 -现在我们知道了`kmemcheck`可以检测内核中`未初始化内存`的使用情况,也知道了如何开启这个功能。那么`kmemcheck`是怎么做检测的呢?当内核尝试分配内存时,例如如下一段代码: +现在我们知道了 `kmemcheck` 可以检测内核中`未初始化内存`的使用情况,也知道了如何开启这个功能。那么 `kmemcheck` 是怎么做检测的呢?当内核尝试分配内存时,例如如下一段代码: ``` struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL); ``` -或者换句话说,在内核访问[page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29)时会发生[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。这是由于`kmemcheck`将内存页标记为`不存在`(关于Linux内存分页的相关信息,你可以参考[分页](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html))。如果一个 `缺页中断`异常发生了,异常处理程序会来处理这个异常,如果异常处理程序检测到内核使能了 `kmemcheck`,那么就会将控制权提交给 `kmemcheck`来处理;`kmemcheck`检查完之后,该内存页会被标记为`present`,然后被中断的程序得以继续执行下去。 这里的处理方式比较巧妙,被中断程序的第一条指令执行时,`kmemcheck`又会标记内存页为`not present`,按照这种方式,下一个对内存页的访问也会被捕获。 +或者换句话说,在内核访问 [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29) 时会发生[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。这是由于 `kmemcheck` 将内存页标记为`不存在`(关于Linux内存分页的相关信息,你可以参考[分页](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html))。如果一个`缺页中断`异常发生了,异常处理程序会来处理这个异常,如果异常处理程序检测到内核使能了 `kmemcheck`,那么就会将控制权提交给 `kmemcheck` 来处理;`kmemcheck` 检查完之后,该内存页会被标记为 `present`,然后被中断的程序得以继续执行下去。 这里的处理方式比较巧妙,被中断程序的第一条指令执行时,`kmemcheck` 又会标记内存页为 `not present`,按照这种方式,下一个对内存页的访问也会被捕获。 目前我们只是从理论层面考察了 `kmemcheck`,接下来我们看一下Linux内核是怎么来实现该机制的。 -`kmemcheck`机制在Linux内核中的实现 +`kmemcheck` 机制在Linux内核中的实现 -------------------------------------------------------------------------------- -我们应该已经了解`kmemcheck`是做什么的以及它在Linux内核中的功能,现在是时候看一下它在Linux内核中的实现。 `kmemcheck`在内核的实现分为两部分。第一部分是架构无关的部分,位于源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c);第二部分 [x86_64](https://en.wikipedia.org/wiki/X86-64)架构相关的部分位于目录[arch/x86/mm/kmemcheck](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck)中。 +我们应该已经了解 `kmemcheck` 是做什么的以及它在Linux内核中的功能,现在是时候看一下它在Linux内核中的实现。 `kmemcheck` 在内核的实现分为两部分。第一部分是架构无关的部分,位于源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c);第二部分 [x86_64](https://en.wikipedia.org/wiki/X86-64)架构相关的部分位于目录[arch/x86/mm/kmemcheck](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck)中。 -我们先分析该机制的初始化过程。我们已经知道要在内核中使能`kmemcheck`机制,需要开启内核的`CONFIG_KMEMCHECK`配置项。除了这个选项,我们还需要给内核command line传递一个`kmemcheck`参数: +我们先分析该机制的初始化过程。我们已经知道要在内核中使能 `kmemcheck` 机制,需要开启内核的`CONFIG_KMEMCHECK` 配置项。除了这个选项,我们还需要给内核command line传递一个 `kmemcheck` 参数: * kmemcheck=0 (disabled) * kmemcheck=1 (enabled) * kmemcheck=2 (one-shot mode) -前面两个值得含义很明确,但是最后一个需要解释。这个选项会使`kmemcheck`进入一种特殊的模式:在第一次检测到未初始化内存的使用之后,就会关闭`kmemcheck`。实际上该模式是内核的默认选项: +前面两个值得含义很明确,但是最后一个需要解释。这个选项会使 `kmemcheck` 进入一种特殊的模式:在第一次检测到未初始化内存的使用之后,就会关闭 `kmemcheck` 。实际上该模式是内核的默认选项: ![kernel configuration menu](http://oi66.tinypic.com/y2eeh.jpg) -从Linux初始化过程章节的第七节[part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html)中,我们知道在内核初始化过程中,会在`do_initcall_level`, `do_early_param`等函数中解析内核command line。前面也提到过 `kmemcheck`子系统由两部分组成,第一部分启动比较早。在源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c)中有一个函数 `param_kmemcheck`,该函数在command line解析时就会用到: +从Linux初始化过程章节的第七节 [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) 中,我们知道在内核初始化过程中,会在 `do_initcall_level` , `do_early_param` 等函数中解析内核 command line。前面也提到过 `kmemcheck` 子系统由两部分组成,第一部分启动比较早。在源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) 中有一个函数 `param_kmemcheck` ,该函数在command line解析时就会用到: ```C static int __init param_kmemcheck(char *str) @@ -188,9 +188,9 @@ static int __init param_kmemcheck(char *str) early_param("kmemcheck", param_kmemcheck); ``` -从前面的介绍我们知道`param_kmemcheck`可能存在三种情况:`0` (使能), `1` (禁止) or `2` (一次性)。`param_kmemcheck`的实现很简单:将command line传递的`kmemcheck`参数的值由字符串转换为整数,然后赋值给变量`kmemcheck_enabled`。 +从前面的介绍我们知道 `param_kmemcheck` 可能存在三种情况:`0` (使能), `1` (禁止) or `2` (一次性)。 `param_kmemcheck` 的实现很简单:将command line传递的 `kmemcheck` 参数的值由字符串转换为整数,然后赋值给变量 `kmemcheck_enabled` 。 -第二阶段在内核初始化阶段执行,而不是在早期初始化过程 [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)。第二阶断的过程体现在 `kmemcheck_init`: +第二阶段在内核初始化阶段执行,而不是在早期初始化过程 [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html) 。第二阶断的过程体现在 `kmemcheck_init` : ```C int __init kmemcheck_init(void) @@ -203,7 +203,7 @@ int __init kmemcheck_init(void) early_initcall(kmemcheck_init); ``` -`kmemcheck_init`的主要目的就是调用 `kmemcheck_selftest` 函数,并检查它的返回值: +`kmemcheck_init` 的主要目的就是调用 `kmemcheck_selftest` 函数,并检查它的返回值: ```C if (!kmemcheck_selftest()) { @@ -215,7 +215,7 @@ if (!kmemcheck_selftest()) { printk(KERN_INFO "kmemcheck: Initialized\n"); ``` -如果`kmemcheck_init`检测失败,就返回`EINVAL` 。 `kmemcheck_selftest`函数会检测内存访问相关的[操作码](https://en.wikipedia.org/wiki/Opcode)(例如 `rep movsb`, `movzwq`)的大小。如果检测到的大小的实际大小是一致的,`kmemcheck_selftest`返回 `true`,否则返回 `false`。 +如果 `kmemcheck_init` 检测失败,就返回 `EINVAL` 。 `kmemcheck_selftest` 函数会检测内存访问相关的[操作码](https://en.wikipedia.org/wiki/Opcode)(例如 `rep movsb`, `movzwq`)的大小。如果检测到的大小的实际大小是一致的,`kmemcheck_selftest` 返回 `true`,否则返回 `false`。 如果如下代码被调用: @@ -223,7 +223,7 @@ printk(KERN_INFO "kmemcheck: Initialized\n"); struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL); ``` -经过一系列的函数调用,`kmem_getpages`函数会被调用到,该函数的定义在源码 [mm/slab.c](https://github.com/torvalds/linux/blob/master/mm/slab.c)中,该函数的主要功能就是尝试按照指定的参数需求分配[内存页](https://en.wikipedia.org/wiki/Paging)。在该函数的结尾处有如下代码: +经过一系列的函数调用,`kmem_getpages` 函数会被调用到,该函数的定义在源码 [mm/slab.c](https://github.com/torvalds/linux/blob/master/mm/slab.c) 中,该函数的主要功能就是尝试按照指定的参数需求分配[内存页](https://en.wikipedia.org/wiki/Paging)。在该函数的结尾处有如下代码: ```C if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) { @@ -236,7 +236,7 @@ if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) { } ``` -这段代码判断如果`kmemcheck`使能,并且参数中未设置`SLAB_NOTRACK`,那么就给分配的内存页设置 `non-present`标记。`SLAB_NOTRACK`标记的含义是不跟踪未初始化的内存。另外,如果缓存对象有构造函数(缓存细节在下面描述),所分配的内存页标记为未初始化,否则标记为未分配。`kmemcheck_alloc_shadow`函数在源码[mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c)中,其基本内容如下: +这段代码判断如果 `kmemcheck` 使能,并且参数中未设置 `SLAB_NOTRACK` ,那么就给分配的内存页设置 `non-present` 标记。`SLAB_NOTRACK` 标记的含义是不跟踪未初始化的内存。另外,如果缓存对象有构造函数(细节在下面描述),所分配的内存页标记为未初始化,否则标记为未分配。`kmemcheck_alloc_shadow` 函数在源码 [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) 中,其基本内容如下: ```C void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node) @@ -252,7 +252,7 @@ void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node) } ``` -首先为shadow bits分配内存,并为内存页设置shadow位。如果内存页设置了该标记,就意味着`kmemcheck`会跟踪这个内存页。最后调用`kmemcheck_hide_pages`函数。`kmemcheck_hide_pages`是体系结构相关的函数,其代码在 [arch/x86/mm/kmemcheck/kmemcheck.c](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck/kmemcheck.c)源码中。该函数的功能是为指定的内存页设置`non-present`标记。该函数实现如下: +首先为 shadow bits 分配内存,并为内存页设置 shadow 位。如果内存页设置了该标记,就意味着 `kmemcheck` 会跟踪这个内存页。最后调用 `kmemcheck_hide_pages` 函数。 `kmemcheck_hide_pages` 是体系结构相关的函数,其代码在 [arch/x86/mm/kmemcheck/kmemcheck.c](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck/kmemcheck.c) 源码中。该函数的功能是为指定的内存页设置 `non-present` 标记。该函数实现如下: ```C void kmemcheck_hide_pages(struct page *p, unsigned int n) @@ -276,9 +276,9 @@ void kmemcheck_hide_pages(struct page *p, unsigned int n) } ``` -该函数遍历参数代表的所有内存页,并尝试获取每个内存页的`页表项`。如果获取成功,清理页表项的present标记,设置页表项的hidden标记。在最后还需要刷新[TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer),因为有一些内存页已经发生了改变。从这个地方开始,内存页就进入 `kmemcheck`的跟踪系统。由于内存页的`present`标记被清除了,一旦 `kmalloc`返回了内存地址,并且有代码访问这个地址,就会触发[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。 +该函数遍历参数代表的所有内存页,并尝试获取每个内存页的 `页表项` 。如果获取成功,清理页表项的present 标记,设置页表项的 hidden 标记。在最后还需要刷新 [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) ,因为有一些内存页已经发生了改变。从这个地方开始,内存页就进入 `kmemcheck` 的跟踪系统。由于内存页的 `present` 标记被清除了,一旦 `kmalloc` 返回了内存地址,并且有代码访问这个地址,就会触发[缺页中断](https://en.wikipedia.org/wiki/Page_fault)。 -在Linux内核初始化的[第二节](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)介绍过,`缺页中断`处理程序是[arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c)的 `do_page_fault`函数。该函数开始部分如下: +在Linux内核初始化的[第二节](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)介绍过,`缺页中断`处理程序是 [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) 的 `do_page_fault` 函数。该函数开始部分如下: ```C static noinline void @@ -296,7 +296,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, } ``` -`kmemcheck_active`函数获取`kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)结构体,并返回该结构体成员`balance`和0的比较结果: +`kmemcheck_active` 函数获取 `kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) 结构体,并返回该结构体成员 `balance` 和0的比较结果: ``` bool kmemcheck_active(struct pt_regs *regs) @@ -307,7 +307,7 @@ bool kmemcheck_active(struct pt_regs *regs) } ``` -`kmemcheck_context`结构体代表 `kmemcheck`机制的当前状态。其内部保存了未初始化的地址,地址的数量等信息。其成员 `balance`代表了 `kmemcheck`的当前状态,换句话说,`balance`表示 `kmemcheck`是否已经隐藏了内存页。如果`data->balance`大于0, `kmemcheck_hide` 函数会被调用。这意味着 `kmemecheck`已经设置了内存页的`present`标记,但是我们需要再次隐藏内存页以便触发下一次的缺页中断。 `kmemcheck_hide`函数会清理内存页的 `present`标记,这表示一次`kmemcheck`会话已经完成,新的缺页中断会再次被触发。在第一步,由于`data->balance` 值为0,所以`kmemcheck_active`会返回false,所以 `kmemcheck_hide`也不会被调用。接下来,我们看`do_page_fault`的下一行代码: +`kmemcheck_context` 结构体代表 `kmemcheck` 机制的当前状态。其内部保存了未初始化的地址,地址的数量等信息。其成员 `balance` 代表了 `kmemcheck` 的当前状态,换句话说,`balance` 表示 `kmemcheck` 是否已经隐藏了内存页。如果 `data->balance` 大于0, `kmemcheck_hide` 函数会被调用。这意味着 `kmemecheck` 已经设置了内存页的 `present` 标记,但是我们需要再次隐藏内存页以便触发下一次的缺页中断。 `kmemcheck_hide` 函数会清理内存页的 `present` 标记,这表示一次 `kmemcheck` 会话已经完成,新的缺页中断会再次被触发。在第一步,由于 `data->balance` 值为0,所以 `kmemcheck_active` 会返回false,所以 `kmemcheck_hide` 也不会被调用。接下来,我们看 `do_page_fault` 的下一行代码: ```C if (kmemcheck_fault(regs, address, error_code)) @@ -323,7 +323,7 @@ if (regs->cs != __KERNEL_CS) return false; ``` -如果检测失败,表明这不是`kmemcheck`相关的缺页中断,`kmemcheck_fault`会返回false。如果检测成功,接下来查找发生异常的地址的`页表项`,如果找不到页表项,函数返回false: +如果检测失败,表明这不是 `kmemcheck` 相关的缺页中断,`kmemcheck_fault` 会返回false。如果检测成功,接下来查找发生异常的地址的`页表项`,如果找不到页表项,函数返回false: ```C pte = kmemcheck_pte_lookup(address); @@ -331,27 +331,27 @@ if (!pte) return false; ``` -`kmemcheck_fault`最后一步是调用`kmemcheck_access` 函数,该函数检查对指定内存页的访问,并设置该内存页的present标记。 `kmemcheck_access`函数做了大部分工作,它检查引起缺页异常的当前指令,如果检查到了错误,那么会把该错误的上下文保存到环形队列中: +`kmemcheck_fault` 最后一步是调用 `kmemcheck_access` 函数,该函数检查对指定内存页的访问,并设置该内存页的present标记。 `kmemcheck_access` 函数做了大部分工作,它检查引起缺页异常的当前指令,如果检查到了错误,那么会把该错误的上下文保存到环形队列中: ```C static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE]; ``` -`kmemcheck`声明了一个特殊的 [tasklet](https://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-9.html): +`kmemcheck` 声明了一个特殊的 [tasklet](https://0xax.gitbooks.io/linux-insides/content/Interrupts/interrupts-9.html) : ```C static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0); ``` -该tasklet被调度执行时,会调用`do_wakeup`函数,该函数位于[arch/x86/mm/kmemcheck/error.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/kmemcheck/error.c)文件中。 +该tasklet被调度执行时,会调用 `do_wakeup` 函数,该函数位于 [arch/x86/mm/kmemcheck/error.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/kmemcheck/error.c) 文件中。 -`do_wakeup`函数调用`kmemcheck_error_recall`函数以便将`kmemcheck`检测到的错误信息输出。 +`do_wakeup` 函数调用 `kmemcheck_error_recall` 函数以便将 `kmemcheck` 检测到的错误信息输出。 ```C kmemcheck_show(regs); ``` -`kmemcheck_fault`函数结束时会调用`kmemcheck_show`函数,该函数会再次设置内存页的present标记。 +`kmemcheck_fault` 函数结束时会调用 `kmemcheck_show` 函数,该函数会再次设置内存页的present标记。 ```C if (unlikely(data->balance != 0)) { @@ -362,7 +362,7 @@ if (unlikely(data->balance != 0)) { } ``` -`kmemcheck_show_all`函数会针对每个地址调用`kmemcheck_show_addr`: +`kmemcheck_show_all` 函数会针对每个地址调用 `kmemcheck_show_addr` : ```C static unsigned int kmemcheck_show_all(void) @@ -379,7 +379,7 @@ static unsigned int kmemcheck_show_all(void) } ``` -`kmemcheck_show_addr`函数内容如下: +`kmemcheck_show_addr` 函数内容如下: ```C int kmemcheck_show_addr(unsigned long address) @@ -396,21 +396,21 @@ int kmemcheck_show_addr(unsigned long address) } ``` -在函数 `kmemcheck_show`的结尾处会设置[TF](https://en.wikipedia.org/wiki/Trap_flag) 标记: +在函数 `kmemcheck_show` 的结尾处会设置 [TF](https://en.wikipedia.org/wiki/Trap_flag) 标记: ```C if (!(regs->flags & X86_EFLAGS_TF)) data->flags = regs->flags; ``` -我们之所以这么处理,是因为我们在内存页的缺页中断处理完后需要再次隐藏内存页。当 `TF`标记被设置后,处理器在执行被中断程序的第一条指令时会进入单步模式,这会触发`debug` 异常。从这个地方开始,内存页会被隐藏起来,执行流程继续。由于内存页不可见,那么访问内存页的时候又会触发缺页中断,然后`kmemcheck`就有机会继续检测/收集并显示内存错误信息。 +我们之所以这么处理,是因为我们在内存页的缺页中断处理完后需要再次隐藏内存页。当 `TF` 标记被设置后,处理器在执行被中断程序的第一条指令时会进入单步模式,这会触发 `debug` 异常。从这个地方开始,内存页会被隐藏起来,执行流程继续。由于内存页不可见,那么访问内存页的时候又会触发缺页中断,然后`kmemcheck` 就有机会继续检测/收集并显示内存错误信息。 -到这里`kmemcheck`的工作机制就介绍完毕了。 +到这里 `kmemcheck` 的工作机制就介绍完毕了。 结束语 -------------------------------------------------------------------------------- -Linux内核[内存管理](https://en.wikipedia.org/wiki/Memory_management)第三节介绍到此为止。如果你有任何疑问或者建议,你可以直接给我[0xAX](https://twitter.com/0xAX)发消息, 发[邮件](anotherworldofworld@gmail.com),或者创建一个[issue](https://github.com/0xAX/linux-insides/issues/new)。 在接下来的小节中,我们来看一下另一个内存调试工具 - `kmemleak`。 +Linux内核[内存管理](https://en.wikipedia.org/wiki/Memory_management)第三节介绍到此为止。如果你有任何疑问或者建议,你可以直接给我[0xAX](https://twitter.com/0xAX)发消息, 发[邮件](anotherworldofworld@gmail.com),或者创建一个 [issue](https://github.com/0xAX/linux-insides/issues/new) 。 在接下来的小节中,我们来看一下另一个内存调试工具 - `kmemleak` 。 **英文不是我的母语。如果你发现我的英文描述有任何问题,请提交一个PR到 [linux-insides](https://github.com/0xAX/linux-insides).** From 8eb7a441eabee0a524e9adc639a0dba6ba9b6f70 Mon Sep 17 00:00:00 2001 From: woodpenker Date: Tue, 18 Jul 2017 21:46:35 +0800 Subject: [PATCH 19/21] fix:according to the Review --- KernelStructures/idt.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/KernelStructures/idt.md b/KernelStructures/idt.md index 7e0683e..1f6edf7 100644 --- a/KernelStructures/idt.md +++ b/KernelStructures/idt.md @@ -1,4 +1,4 @@ - 中断描述符 (IDT) + 中断描述符表 (IDT) ================================================================================ 三个常见的中断和异常来源: @@ -13,11 +13,11 @@ * 陷阱 - 在指令导致异常`之后`会被准确地报告。`%rip`保存的指针同样指向故障的指令; * 终止 - 是不明确的异常。 因为它们不能被明确,中止通常不允许程序可靠地再次启动。 -只有当RFLAGS.IF = 1时,`可屏蔽`中断触发才中断处理程序。 除非RFLAGS.IF位清零,否则它们将持续处于等待处理状态。 +只有当RFLAGS.IF = 1时,`可屏蔽`中断才触发中断处理程序。 除非RFLAGS.IF位清零,否则它们将持续处于等待处理状态。 -`不可屏蔽`中断(NMI)不受rFLAGS.IF位的影响。 无论怎样一个NMI的发生都会进一步屏蔽之后的其他NMI,直到执行IRET(中断返回)指令。 +`不可屏蔽`中断(NMI)不受RFLAGS.IF位的影响。 无论怎样一个NMI的发生都会进一步屏蔽之后的其他NMI,直到执行IRET(中断返回)指令。 -具体的异常和中断来源被分配了固定的向量标识号(也称“中断向量”或简称“向量”)。中断处理程序使用中断向量来定位异常或中断,从而分配相应的系统软件服务处理程序。有至多256个特殊的中断向量可用。前32个是保留的,用于预定义的异常和中断条件。请参考[arch / x86 / include / asm / traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121)头文件中对他们的定义: +具体的异常和中断来源被分配了固定的向量标识号(也称“中断向量”或简称“向量”)。中断处理程序使用中断向量来定位异常或中断,从而分配相应的系统软件服务处理程序。有至多256个特殊的中断向量可用。前32个是保留的,用于预定义的异常和中断条件。请参考[arch/x86/include/asm/traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121)头文件中对他们的定义: ``` @@ -139,7 +139,7 @@ IDT可以包含三种门描述符中的任何一种: * `IST` - 中断堆栈表; * `TYPE` - 本地描述符表(LDT)段描述符,任务状态段(TSS)描述符,调用门描述符,中断门描述符,陷阱门描述符或任务门描述符之一。 -`IDT` 描述符在Linux内核中由以下结构表示(仅适用于`x86_64`): +`IDT` 描述符在 Linux 内核中由以下结构表示(仅适用于`x86_64`): ```C struct gate_struct64 { @@ -170,9 +170,9 @@ struct ldttss_desc64 { 任务切换期间的异常(Exceptions During a Task Switch) -------------------------------------------------------------------------------- -任务切换在加载段选择器期间可能会发生异常。页错误也可能会在访问TSS时出现。在这些情况下,由硬件任务切换机构完成从TSS加载新的任务状态,然后触发适当的异常处理。 +任务切换在加载段选择子期间可能会发生异常。页错误也可能会在访问TSS时出现。在这些情况下,由硬件任务切换机制完成从TSS加载新的任务状态,然后触发适当的异常处理机制。 -**在长模式下,由于硬件任务切换机构被禁用,因而在任务切换期间不会发生异常。** +**在长模式下,由于硬件任务切换机制被禁用,因而在任务切换期间不会发生异常。** 不可屏蔽中断(Nonmaskable interrupt) -------------------------------------------------------------------------------- From bd84fd9a8d2812d7ad504c3f73064438835162cf Mon Sep 17 00:00:00 2001 From: woodpenker Date: Tue, 18 Jul 2017 21:54:14 +0800 Subject: [PATCH 20/21] =?UTF-8?q?fix:change=20selector=20to=20=E5=AD=90?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- KernelStructures/idt.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/KernelStructures/idt.md b/KernelStructures/idt.md index 1f6edf7..f39de0d 100644 --- a/KernelStructures/idt.md +++ b/KernelStructures/idt.md @@ -1,4 +1,4 @@ - 中断描述符表 (IDT) + 中断描述符 (IDT) ================================================================================ 三个常见的中断和异常来源: @@ -13,11 +13,11 @@ * 陷阱 - 在指令导致异常`之后`会被准确地报告。`%rip`保存的指针同样指向故障的指令; * 终止 - 是不明确的异常。 因为它们不能被明确,中止通常不允许程序可靠地再次启动。 -只有当RFLAGS.IF = 1时,`可屏蔽`中断才触发中断处理程序。 除非RFLAGS.IF位清零,否则它们将持续处于等待处理状态。 +只有当RFLAGS.IF = 1时,`可屏蔽`中断触发才中断处理程序。 除非RFLAGS.IF位清零,否则它们将持续处于等待处理状态。 -`不可屏蔽`中断(NMI)不受RFLAGS.IF位的影响。 无论怎样一个NMI的发生都会进一步屏蔽之后的其他NMI,直到执行IRET(中断返回)指令。 +`不可屏蔽`中断(NMI)不受rFLAGS.IF位的影响。 无论怎样一个NMI的发生都会进一步屏蔽之后的其他NMI,直到执行IRET(中断返回)指令。 -具体的异常和中断来源被分配了固定的向量标识号(也称“中断向量”或简称“向量”)。中断处理程序使用中断向量来定位异常或中断,从而分配相应的系统软件服务处理程序。有至多256个特殊的中断向量可用。前32个是保留的,用于预定义的异常和中断条件。请参考[arch/x86/include/asm/traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121)头文件中对他们的定义: +具体的异常和中断来源被分配了固定的向量标识号(也称“中断向量”或简称“向量”)。中断处理程序使用中断向量来定位异常或中断,从而分配相应的系统软件服务处理程序。有至多256个特殊的中断向量可用。前32个是保留的,用于预定义的异常和中断条件。请参考[arch / x86 / include / asm / traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121)头文件中对他们的定义: ``` @@ -55,7 +55,7 @@ enum { * 多数异常错误报告格式; * 页错误格式。 -选择器错误代码的格式如下: +选择子错误代码的格式如下: ``` 31 16 15 3 2 1 0 @@ -69,9 +69,9 @@ enum { 说明如下: * `EXT` - 如果该位设置为1,则异常源在处理器外部。 如果设置为0,则异常源位于处理器的内部; -* `IDT` - 如果该位设置为1,则错误代码选择器索引字段引用位于“中断描述符表”中的门描述符。 如果设置为0,则选择器索引字段引用“全局描述符表”或本地描述符表“LDT”中的描述符,由“TI”位所指示; -* `TI` - 如果该位设置为1,则错误代码选择器索引字段引用“LDT”中的描述符。 如果清除为0,则选择器索引字段引用“GDT”中的描述符; -* `Selector Index` - 选择器索引字段指定索引为“GDT‘,“LDT”或“IDT”,它是由“IDT”和“TI”位指定的。 +* `IDT` - 如果该位设置为1,则错误代码选择子索引字段引用位于“中断描述符表”中的门描述符。 如果设置为0,则选择子索引字段引用“全局描述符表”或本地描述符表“LDT”中的描述符,由“TI”位所指示; +* `TI` - 如果该位设置为1,则错误代码选择子索引字段引用“LDT”中的描述符。 如果清除为0,则选择子索引字段引用“GDT”中的描述符; +* `Selector Index` - 选择子索引字段指定索引为“GDT‘,“LDT”或“IDT”,它是由“IDT”和“TI”位指定的。 页错误代码格式如下: @@ -97,9 +97,9 @@ enum { IDT可以包含三种门描述符中的任何一种: -* `Task Gate(任务门)` - 包含用于异常与或中断处理程序任务的TSS的段选择器; -* `Interrupt Gate(中断门)` - 包含处理器用于将程序从执行转移到中断处理程序的段选择器和偏移量; -* `Trap Gate(陷阱门)` - 包含处理器用于将程序从执行转移到异常处理程序的段选择器和偏移量。 +* `Task Gate(任务门)` - 包含用于异常与或中断处理程序任务的TSS的段选择子; +* `Interrupt Gate(中断门)` - 包含处理器用于将程序从执行转移到中断处理程序的段选择子和偏移量; +* `Trap Gate(陷阱门)` - 包含处理器用于将程序从执行转移到异常处理程序的段选择子和偏移量。 门的一般格式是: @@ -132,14 +132,14 @@ IDT可以包含三种门描述符中的任何一种: 说明如下: -* `Selector` - 目标代码段的段选择器; +* `Selector` - 目标代码段的段选择子; * `Offset` - 处理程序入口点的偏移量; * `DPL` - 描述符权限级别; * `P` - 当前段标志; * `IST` - 中断堆栈表; * `TYPE` - 本地描述符表(LDT)段描述符,任务状态段(TSS)描述符,调用门描述符,中断门描述符,陷阱门描述符或任务门描述符之一。 -`IDT` 描述符在 Linux 内核中由以下结构表示(仅适用于`x86_64`): +`IDT` 描述符在Linux内核中由以下结构表示(仅适用于`x86_64`): ```C struct gate_struct64 { @@ -170,9 +170,9 @@ struct ldttss_desc64 { 任务切换期间的异常(Exceptions During a Task Switch) -------------------------------------------------------------------------------- -任务切换在加载段选择子期间可能会发生异常。页错误也可能会在访问TSS时出现。在这些情况下,由硬件任务切换机制完成从TSS加载新的任务状态,然后触发适当的异常处理机制。 +任务切换在加载段选择子期间可能会发生异常。页错误也可能会在访问TSS时出现。在这些情况下,由硬件任务切换机构完成从TSS加载新的任务状态,然后触发适当的异常处理。 -**在长模式下,由于硬件任务切换机制被禁用,因而在任务切换期间不会发生异常。** +**在长模式下,由于硬件任务切换机构被禁用,因而在任务切换期间不会发生异常。** 不可屏蔽中断(Nonmaskable interrupt) -------------------------------------------------------------------------------- From 49eed1847b2ee0f2eb85c12cb75001fe76556fad Mon Sep 17 00:00:00 2001 From: Dongliang Mu Date: Thu, 27 Jul 2017 09:36:47 -0400 Subject: [PATCH 21/21] fix one markdown semantic error --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 8d2a1de..06dc863 100644 --- a/README.md +++ b/README.md @@ -96,7 +96,7 @@ |└ [13.4](https://github.com/MintCN/linux-insides-zh/blob/master/Misc/program_startup.md)|[@mudongliang](https://github.com/mudongliang)|已完成| | 14. [KernelStructures](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures)||正在进行| |├ [14.0](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures/README.md)|[@mudongliang](https://github.com/mudongliang)|更新至[3cb550c0](https://github.com/0xAX/linux-insides/commit/3cb550c089c8fc609f667290434e9e98e93fa279)| -|└ [14.1](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures/idt.md)||[@woodpenker](https://github.com/woodpenker)|更新至[4521637d](https://github.com/0xAX/linux-insides/commit/4521637d9cb76e5d4e4dc951758b264a68504927)| +|└ [14.1](https://github.com/MintCN/linux-insides-zh/tree/master/KernelStructures/idt.md)|[@woodpenker](https://github.com/woodpenker)|更新至[4521637d](https://github.com/0xAX/linux-insides/commit/4521637d9cb76e5d4e4dc951758b264a68504927)| ## 翻译认领规则