mirror of
https://github.com/MintCN/linux-insides-zh.git
synced 2026-04-24 18:50:42 +08:00
148
MM/linux-mm-1.md
148
MM/linux-mm-1.md
@@ -1,20 +1,19 @@
|
||||
Linux kernel memory management Part 1.
|
||||
内核内存管理. 第一部分.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
简介
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Memory management is one of the most complex (and I think that it is the most complex) part of the operating system kernel. In the [last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part we stopped right before call of the `start_kernel` function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first `init` process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No complicated memory management is working yet. When the `start_kernel` function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have a clear understanding of these techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the `memblock`.
|
||||
内存管理是操作系统内核中最复杂的部分之一(我认为没有之一)。在[讲解内核进入点之前的准备工作](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)时,我们在调用 `start_kernel` 函数前停止了讲解。`start_kernel` 函数在内核启动第一个 `init` 进程前初始化了所有的内核特性(包括那些依赖于架构的特性)。你也许还记得在引导时建立了初期页表、识别页表和固定映射页表,但是复杂的内存管理部分还没有开始工作。当 `start_kernel` 函数被调用时,我们会看到从初期内存管理到更复杂的内存管理数据结构和技术的转变。为了更好地理解内核的初始化过程,我们需要对这些技术有更清晰的理解。本章节是内存管理框架和 API 的不同部分的概述,从 `memblock` 开始。
|
||||
|
||||
Memblock
|
||||
内存块
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Memblock is one of the methods of managing memory regions during the early bootstrap period while the usual kernel memory allocators are not up and
|
||||
running yet. Previously it was called `Logical Memory Block`, but with the [patch](https://lkml.org/lkml/2010/7/13/68) by Yinghai Lu, it was renamed to the `memblock`. As Linux kernel for `x86_64` architecture uses this method. We already met `memblock` in the [Last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part. And now it's time to get acquainted with it closer. We will see how it is implemented.
|
||||
内存块是在引导初期,泛用内核内存分配器还没有开始工作时对内存区域进行管理的方法之一。以前它被称为 `逻辑内存块`,但是内核接纳了 [Yinghai Lu 提供的补丁](https://lkml.org/lkml/2010/7/13/68)后改名为 `memblock` 。`x86_64` 架构上的内核会使用这个方法。我们已经在[讲解内核进入点之前的准备工作](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)时遇到过了它。现在是时候对它更加熟悉了。我们会看到它是被怎样实现的。
|
||||
|
||||
We will start to learn `memblock` from the data structures. Definitions of the all data structures can be found in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file.
|
||||
我们首先会学习 `memblock` 的数据结构。以下所有的数据结构都在 [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) 头文件中定义。
|
||||
|
||||
The first structure has the same name as this part and it is:
|
||||
第一个结构体的名字就叫做 `memblock`。它的定义如下:
|
||||
|
||||
```C
|
||||
struct memblock {
|
||||
@@ -28,7 +27,7 @@ struct memblock {
|
||||
};
|
||||
```
|
||||
|
||||
This structure contains five fields. First is `bottom_up` which allows allocating memory in bottom-up mode when it is `true`. Next field is `current_limit`. This field describes the limit size of the memory block. The next three fields describe the type of the memory block. It can be: reserved, memory and physical memory if the `CONFIG_HAVE_MEMBLOCK_PHYS_MAP` configuration option is enabled. Now we see yet another data structure - `memblock_type`. Let's look at its definition:
|
||||
这个结构体包含五个域。第一个 `bottom_up` 域置为 `true` 时允许内存以自底向上模式进行分配。下一个域是 `current_limit`。 这个域描述了内存块的尺寸限制。接下来的三个域描述了内存块的类型。内存块的类型可以是:被保留,内存和物理内存(如果 `CONFIG_HAVE_MEMBLOCK_PHYS_MAP` 编译配置选项被开启)。接下来我们来看看下一个数据结构- `memblock_type` 。让我们来看看它的定义:
|
||||
|
||||
```C
|
||||
struct memblock_type {
|
||||
@@ -39,7 +38,7 @@ struct memblock_type {
|
||||
};
|
||||
```
|
||||
|
||||
This structure provides information about memory type. It contains fields which describe the number of memory regions which are inside the current memory block, the size of all memory regions, the size of the allocated array of the memory regions and pointer to the array of the `memblock_region` structures. `memblock_region` is a structure which describes a memory region. Its definition is:
|
||||
这个结构体提供了关于内存类型的信息。它包含了描述当前内存块中内存区域的数量、所有内存区域的大小、内存区域的已分配数组的尺寸和指向 `memblock_region` 结构体数据的指针的域。`memblock_region` 结构体描述了一个内存区域,定义如下:
|
||||
|
||||
```C
|
||||
struct memblock_region {
|
||||
@@ -52,7 +51,7 @@ struct memblock_region {
|
||||
};
|
||||
```
|
||||
|
||||
`memblock_region` provides base address and size of the memory region, flags which can be:
|
||||
`memblock_region` 提供了内存区域的基址和大小,`flags` 域可以是:
|
||||
|
||||
```C
|
||||
#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)
|
||||
@@ -60,9 +59,9 @@ struct memblock_region {
|
||||
#define MEMBLOCK_HOTPLUG 0x1
|
||||
```
|
||||
|
||||
Also `memblock_region` provides integer field - [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access) node selector, if the `CONFIG_HAVE_MEMBLOCK_NODE_MAP` configuration option is enabled.
|
||||
同时,如果 `CONFIG_HAVE_MEMBLOCK_NODE_MAP` 编译配置选项被开启, `memblock_region` 结构体也提供了整数域 - [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access) 节点选择器。
|
||||
|
||||
Schematically we can imagine it as:
|
||||
我们将以上部分想象为如下示意图:
|
||||
|
||||
```
|
||||
+---------------------------+ +---------------------------+
|
||||
@@ -80,12 +79,12 @@ Schematically we can imagine it as:
|
||||
+---------------------------+ +---------------------------+
|
||||
```
|
||||
|
||||
These three structures: `memblock`, `memblock_type` and `memblock_region` are main in the `Memblock`. Now we know about it and can look at Memblock initialization process.
|
||||
这三个结构体: `memblock`, `memblock_type` 和 `memblock_region` 是 `Memblock` 的主要组成部分。现在我们可以进一步了解 `Memblock` 和 它的初始化过程了。
|
||||
|
||||
Memblock initialization
|
||||
内存块初始化
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As all API of the `memblock` are described in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file, all implementation of these function is in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c) source code file. Let's look at the top of the source code file and we will see the initialization of the `memblock` structure:
|
||||
所有 `memblock` 的 API 都在 [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) 头文件中描述, 所有函数的实现都在 [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c) 源码中。首先我们来看一下源码的开头部分和 `memblock` 结构体的初始化吧。
|
||||
|
||||
```C
|
||||
struct memblock memblock __initdata_memblock = {
|
||||
@@ -107,7 +106,7 @@ struct memblock memblock __initdata_memblock = {
|
||||
};
|
||||
```
|
||||
|
||||
Here we can see initialization of the `memblock` structure which has the same name as structure - `memblock`. First of all note the `__initdata_memblock`. Definition of this macro looks like:
|
||||
在这里我们可以看到 `memblock` 结构体的同名变量的初始化。首先请注意 `__initdata_memblock` 。这个宏的定义就像这样:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
|
||||
@@ -119,9 +118,9 @@ Here we can see initialization of the `memblock` structure which has the same na
|
||||
#endif
|
||||
```
|
||||
|
||||
You can note that it depends on `CONFIG_ARCH_DISCARD_MEMBLOCK`. If this configuration option is enabled, memblock code will be put to the `.init` section and it will be released after the kernel is booted up.
|
||||
你会发现这个宏依赖于 `CONFIG_ARCH_DISCARD_MEMBLOCK` 。如果这个编译配置选项开启,内存块的代码会被放置在 `.init` 段,这样它就会在内核引导完毕后被释放掉。
|
||||
|
||||
Next we can see initialization of the `memblock_type memory`, `memblock_type reserved` and `memblock_type physmem` fields of the `memblock` structure. Here we are interested only in the `memblock_type.regions` initialization process. Note that every `memblock_type` field initialized by the arrays of the `memblock_region`:
|
||||
接下来我们可以看看 `memblock_type memory` , `memblock_type reserved` 和 `memblock_type physmem` 域的初始化。在这里我们只对 `memblock_type.regions` 的初始化过程感兴趣,请注意每一个 `memblock_type` 域都是 `memblock_region` 的数组初始化的:
|
||||
|
||||
```C
|
||||
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
|
||||
@@ -131,44 +130,45 @@ static struct memblock_region memblock_physmem_init_regions[INIT_PHYSMEM_REGIONS
|
||||
#endif
|
||||
```
|
||||
|
||||
Every array contains 128 memory regions. We can see it in the `INIT_MEMBLOCK_REGIONS` macro definition:
|
||||
每个数组包含了 128 个内存区域。我们可以在 `INIT_MEMBLOCK_REGIONS` 宏定义中看到它:
|
||||
|
||||
```C
|
||||
#define INIT_MEMBLOCK_REGIONS 128
|
||||
```
|
||||
|
||||
Note that all arrays are also defined with the `__initdata_memblock` macro which we already saw in the `memblock` structure initialization (read above if you've forgotten).
|
||||
请注意所有的数组定义中也用到了在 `memblock` 中使用过的 `__initdata_memblock` 宏(如果忘掉了就翻到上面重温一下)。
|
||||
|
||||
The last two fields describe that `bottom_up` allocation is disabled and the limit of the current Memblock is:
|
||||
最后两个域描述了 `bottom_up` 分配是否被开启以及当前内存块的限制:
|
||||
|
||||
```C
|
||||
#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)
|
||||
```
|
||||
|
||||
which is `0xffffffffffffffff`.
|
||||
这个限制是 `0xffffffffffffffff`.
|
||||
|
||||
On this step the initialization of the `memblock` structure has been finished and we can look on the Memblock API.
|
||||
到此为止 `memblock` 结构体的初始化就结束了,我们可以开始看内存块相关 API 了。
|
||||
|
||||
Memblock API
|
||||
内存块应用程序接口
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Ok we have finished with initialization of the `memblock` structure and now we can look on the Memblock API and its implementation. As I said above, all implementation of the `memblock` is presented in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c). To understand how `memblock` works and how it is implemented, let's look at its usage first. There are a couple of [places](http://lxr.free-electrons.com/ident?i=memblock) in the linux kernel where memblock is used. For example let's take `memblock_x86_fill` function from the [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c#L1061). This function goes through the memory map provided by the [e820](http://en.wikipedia.org/wiki/E820) and adds memory regions reserved by the kernel to the `memblock` with the `memblock_add` function. As we met `memblock_add` function first, let's start from it.
|
||||
我们已经结束了 `memblock` 结构体的初始化讲解,现在我们要开始看内存块 API 和它的实现了。就像我上面说过的,所有 `memblock` 的实现都在 [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c) 中。为了理解 `memblock` 是怎样被实现和工作的,让我们先看看它的用法。内核中有[很多地方](http://lxr.free-electrons.com/ident?i=memblock)用到了内存块。举个例子,我们来看看 [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c#L1061) 中的 `memblock_x86_fill` 函数。这个函数使用了 [e820](http://en.wikipedia.org/wiki/E820) 提供的内存映射并使用 `memblock_add` 函数在 `memblock` 中添加了内核保留的内存区域。既然我们首先遇到了 `memblock_add` 函数,让我们从它开始讲解吧。
|
||||
|
||||
This function takes physical base address and size of the memory region and adds it to the `memblock`. `memblock_add` function does not do anything special in its body, but just calls:
|
||||
这个函数获取了物理基址和内存区域的大小并把它们加到了 `memblock` 中。`memblock_add` 函数本身没有做任何特殊的事情,它只是调用了
|
||||
|
||||
```C
|
||||
memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
|
||||
```
|
||||
|
||||
function. We pass memory block type - `memory`, physical base address and size of the memory region, maximum number of nodes which is 1 if `CONFIG_NODES_SHIFT` is not set in the configuration file or `1 << CONFIG_NODES_SHIFT` if it is set, and flags. The `memblock_add_range` function adds new memory region to the memory block. It starts by checking the size of the given region and if it is zero it just returns. After this, `memblock_add_range` checks for existence of the memory regions in the `memblock` structure with the given `memblock_type`. If there are no memory regions, we just fill new `memory_region` with the given values and return (we already saw the implementation of this in the [First touch of the linux kernel memory manager framework](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)). If `memblock_type` is not empty, we start to add new memory region to the `memblock` with the given `memblock_type`.
|
||||
函数。我们将内存块类型 - `memory`,内存基址和内存区域大小,节点的最大数目和标志传进去。如果 `CONFIG_NODES_SHIFT` 没有被设置,最大节点数目就是 1,否则是 `1 << CONFIG_NODES_SHIFT`。`memblock_add_range` 函数将新的内存区域加到了内存块中,它首先检查传入内存区域的大小,如果是 0 就直接返回。然后,这个函数会用 `memblock_type` 来检查 `memblock` 中的内存区域是否存在。如果不存在,我们就简单地用给定的值填充一个新的 `memory_region` 然后返回(我们已经在[对内核内存管理框架的初览](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)中看到了它的实现)。如果 `memblock_type` 不为空,我们就会使用提供的 `memblock_type` 将新的内存区域加到 `memblock` 中。
|
||||
|
||||
First of all we get the end of the memory region with the:
|
||||
首先,我们获取了内存区域的结束点:
|
||||
|
||||
```C
|
||||
phys_addr_t end = base + memblock_cap_size(base, &size);
|
||||
```
|
||||
|
||||
`memblock_cap_size` adjusts `size` that `base + size` will not overflow. Its implementation is pretty easy:
|
||||
`memblock_cap_size` 调整了 `size` 使 `base + size` 不会溢出。它的实现非常简单:
|
||||
|
||||
```C
|
||||
static inline phys_addr_t memblock_cap_size(phys_addr_t base, phys_addr_t *size)
|
||||
@@ -177,14 +177,14 @@ static inline phys_addr_t memblock_cap_size(phys_addr_t base, phys_addr_t *size)
|
||||
}
|
||||
```
|
||||
|
||||
`memblock_cap_size` returns new size which is the smallest value between the given size and `ULLONG_MAX - base`.
|
||||
`memblock_cap_size` 返回了提供的值与 `ULLONG_MAX - base` 中的较小值作为新的尺寸。
|
||||
|
||||
After that we have the end address of the new memory region, `memblock_add_range` checks overlap and merge conditions with already added memory regions. Insertion of the new memory region to the `memblock` consists of two steps:
|
||||
之后,我们获得了新的内存区域的结束地址,`memblock_add_range` 会检查与已加入内存区域是否重叠以及能否合并。将新的内存区域插入 `memblock` 包含两步:
|
||||
|
||||
* Adding of non-overlapping parts of the new memory area as separate regions;
|
||||
* Merging of all neighboring regions.
|
||||
* 将新内存区域的不重叠部分作为单独的区域加入;
|
||||
* 合并所有相接的区域。
|
||||
|
||||
We are going through all the already stored memory regions and checking for overlap with the new region:
|
||||
我们会迭代所有的已存储内存区域来检查是否与新区域重叠:
|
||||
|
||||
```C
|
||||
for (i = 0; i < type->cnt; i++) {
|
||||
@@ -202,7 +202,7 @@ We are going through all the already stored memory regions and checking for over
|
||||
}
|
||||
```
|
||||
|
||||
If the new memory region does not overlap regions which are already stored in the `memblock`, insert this region into the memblock with and this is first step, we check that new region can fit into the memory block and call `memblock_double_array` in other way:
|
||||
如果新的内存区域不与已有区域重叠,直接插入。否则我们会检查这个新内存区域是否合适并调用 `memblock_double_array` 函数:
|
||||
|
||||
```C
|
||||
while (type->cnt + nr_new > type->max)
|
||||
@@ -212,7 +212,7 @@ while (type->cnt + nr_new > type->max)
|
||||
goto repeat;
|
||||
```
|
||||
|
||||
`memblock_double_array` doubles the size of the given regions array. Then we set `insert` to `true` and go to the `repeat` label. In the second step, starting from the `repeat` label we go through the same loop and insert the current memory region into the memory block with the `memblock_insert_region` function:
|
||||
`memblock_double_array` 会将提供的区域数组长度加倍。然后我们会将 `insert` 置为 `true`,接着跳转到 `repeat` 标签。第二步,我们会从 `repeat` 标签开始,迭代同样的循环然后使用 `memblock_insert_region` 函数将当前内存区域插入内存块:
|
||||
|
||||
```C
|
||||
if (base < end) {
|
||||
@@ -223,21 +223,21 @@ while (type->cnt + nr_new > type->max)
|
||||
}
|
||||
```
|
||||
|
||||
As we set `insert` to `true` in the first step, now `memblock_insert_region` will be called. `memblock_insert_region` has almost the same implementation that we saw when we insert new region to the empty `memblock_type` (see above). This function gets the last memory region:
|
||||
我们在第一步将 `insert` 置为 `true`,现在 `memblock_insert_region` 会检查这个标志。`memblock_insert_region` 的实现与我们将新区域插入空 `memblock_type` 的实现(看上面)几乎相同。这个函数会获取最后一个内存区域:
|
||||
|
||||
```C
|
||||
struct memblock_region *rgn = &type->regions[idx];
|
||||
```
|
||||
|
||||
and copies memory area with `memmove`:
|
||||
然后用 `memmove` 拷贝这部分内存:
|
||||
|
||||
```C
|
||||
memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
|
||||
```
|
||||
|
||||
After this fills `memblock_region` fields of the new memory region base, size, etc. and increases size of the `memblock_type`. In the end of the execution, `memblock_add_range` calls `memblock_merge_regions` which merges neighboring compatible regions in the second step.
|
||||
之后我们会填充 `memblock_region` 域,然后增长 `memblock_type` 的尺寸。在函数执行的结束,`memblock_add_range` 会调用 `memblock_merge_regions` 来在第二步合并相邻可合并的内存区域。
|
||||
|
||||
In the second case the new memory region can overlap already stored regions. For example we already have `region1` in the `memblock`:
|
||||
还有第二种情况,新的内存区域与已储存区域完全重叠。比如 `memblock` 中已经有了 `region1` :
|
||||
|
||||
```
|
||||
0 0x1000
|
||||
@@ -250,7 +250,7 @@ In the second case the new memory region can overlap already stored regions. For
|
||||
+-----------------------+
|
||||
```
|
||||
|
||||
And now we want to add `region2` to the `memblock` with the following base address and size:
|
||||
现在我们想在 `memblock` 中添加 `region2` ,它的基址和尺寸如下:
|
||||
|
||||
```
|
||||
0x100 0x2000
|
||||
@@ -263,13 +263,13 @@ And now we want to add `region2` to the `memblock` with the following base addre
|
||||
+-----------------------+
|
||||
```
|
||||
|
||||
In this case set the base address of the new memory region as the end address of the overlapped region with:
|
||||
在这种情况下,新内存区域的基址会被像下面这样设置:
|
||||
|
||||
```C
|
||||
base = min(rend, end);
|
||||
```
|
||||
|
||||
So it will be `0x1000` in our case. And insert it as we did it already in the second step with:
|
||||
所以在我们设置的这种场景中,它会被设置为 `0x1000` 。然后我们会在第二步中将这个区域插入:
|
||||
|
||||
```
|
||||
if (base < end) {
|
||||
@@ -279,7 +279,7 @@ if (base < end) {
|
||||
}
|
||||
```
|
||||
|
||||
In this case we insert `overlapping portion` (we insert only the higher portion, because the lower portion is already in the overlapped memory region), then the remaining portion and merge these portions with `memblock_merge_regions`. As I said above `memblock_merge_regions` function merges neighboring compatible regions. It goes through the all memory regions from the given `memblock_type`, takes two neighboring memory regions - `type->regions[i]` and `type->regions[i + 1]` and checks that these regions have the same flags, belong to the same node and that end address of the first regions is not equal to the base address of the second region:
|
||||
在这种情况下我们会插入 `overlapping portion` (我们之插入地址高的部分,因为低地址部分已经被包含在重叠区域里了),然后会使用 `memblock_merge_regions` 合并剩余部分区域。就像我上文中所说的那样,这个函数会合并相邻的可合并区域。它会从给定的 `memblock_type` 遍历所有的内存区域,取出两个相邻区域 - `type->regions[i]` 和 `type->regions[i + 1]`,并检查他们是否拥有同样的标志,是否属于同一个节点,第一个区域的末尾地址是否与第二个区域的基地址相同。
|
||||
|
||||
```C
|
||||
while (i < type->cnt - 1) {
|
||||
@@ -295,25 +295,25 @@ while (i < type->cnt - 1) {
|
||||
}
|
||||
```
|
||||
|
||||
If none of these conditions are not true, we update the size of the first region with the size of the next region:
|
||||
如果上面所说的这些条件全部符合,我们就会更新第一个区域的长度,将第二个区域的长度加上去。
|
||||
|
||||
```C
|
||||
this->size += next->size;
|
||||
```
|
||||
|
||||
As we update the size of the first memory region with the size of the next memory region, we move all memory regions which are after the (`next`) memory region one index backward with the `memmove` function:
|
||||
我们在更新第一个区域的长度同时,会使用 `memmove` 将后面的所有区域向前移动一个下标。
|
||||
|
||||
```C
|
||||
memmove(next, next + 1, (type->cnt - (i + 2)) * sizeof(*next));
|
||||
```
|
||||
|
||||
And decrease the count of the memory regions which are belongs to the `memblock_type`:
|
||||
然后将 `memblock_type` 中内存区域的数量减一:
|
||||
|
||||
```C
|
||||
type->cnt--;
|
||||
```
|
||||
|
||||
After this we will get two memory regions merged into one:
|
||||
经过这些操作后我们就成功地将两个内存区域合并了:
|
||||
|
||||
```
|
||||
0 0x2000
|
||||
@@ -326,28 +326,28 @@ After this we will get two memory regions merged into one:
|
||||
+------------------------------------------------+
|
||||
```
|
||||
|
||||
That's all. This is the whole principle of the work of the `memblock_add_range` function.
|
||||
这就是 `memblock_add_range` 函数的工作原理和执行过程。
|
||||
|
||||
There is also `memblock_reserve` function which does the same as `memblock_add`, but only with one difference. It stores `memblock_type.reserved` in the memblock instead of `memblock_type.memory`.
|
||||
同样还有一个 `memblock_reserve` 函数与 `memblock_add` 几乎完成同样的工作,只有一点不同: `memblock_reserve` 将 `memblock_type.reserved` 而不是 `memblock_type.memory` 储存到内存块中。
|
||||
|
||||
Of course this is not the full API. Memblock provides APIs for not only adding `memory` and `reserved` memory regions, but also:
|
||||
当然这不是全部的 API。内存块不仅提供了添加 `memory` 和 `reserved` 内存区域,还提供了:
|
||||
|
||||
* memblock_remove - removes memory region from memblock;
|
||||
* memblock_find_in_range - finds free area in given range;
|
||||
* memblock_free - releases memory region in memblock;
|
||||
* for_each_mem_range - iterates through memblock areas.
|
||||
* memblock_remove - 从内存块中移除内存区域;
|
||||
* memblock_find_in_range - 寻找给定范围内的未使用区域;
|
||||
* memblock_free - 释放内存块中的内存区域;
|
||||
* for_each_mem_range - 迭代遍历内存块区域。
|
||||
|
||||
and many more....
|
||||
等等......
|
||||
|
||||
Getting info about memory regions
|
||||
获取内存区域的相关信息
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Memblock also provides an API for getting information about allocated memory regions in the `memblock`. It is split in two parts:
|
||||
内存块还提供了获取 `memblock` 中已分配内存区域信息的 API。包括两部分:
|
||||
|
||||
* get_allocated_memblock_memory_regions_info - getting info about memory regions;
|
||||
* get_allocated_memblock_reserved_regions_info - getting info about reserved regions.
|
||||
* get_allocated_memblock_memory_regions_info - 获取有关内存区域的信息;
|
||||
* get_allocated_memblock_reserved_regions_info - 获取有关保留区域的信息。
|
||||
|
||||
Implementation of these functions is easy. Let's look at `get_allocated_memblock_reserved_regions_info` for example:
|
||||
这些函数的实现都很简单。以 `get_allocated_memblock_reserved_regions_info` 为例:
|
||||
|
||||
```C
|
||||
phys_addr_t __init_memblock get_allocated_memblock_reserved_regions_info(
|
||||
@@ -363,25 +363,25 @@ phys_addr_t __init_memblock get_allocated_memblock_reserved_regions_info(
|
||||
}
|
||||
```
|
||||
|
||||
First of all this function checks that `memblock` contains reserved memory regions. If `memblock` does not contain reserved memory regions we just return zero. Otherwise we write the physical address of the reserved memory regions array to the given address and return aligned size of the allocated array. Note that there is `PAGE_ALIGN` macro used for align. Actually it depends on size of page:
|
||||
这个函数首先会检查 `memblock` 是否包含保留内存区域。如果否,就直接返回 0 。否则函数将保留内存区域的物理地址写到传入的数组中,然后返回已分配数组的对齐后尺寸。注意函数使用 `PAGE_ALIGN` 这个宏实现对齐。实际上这个宏依赖于页的尺寸:
|
||||
|
||||
```C
|
||||
#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)
|
||||
```
|
||||
|
||||
Implementation of the `get_allocated_memblock_memory_regions_info` function is the same. It has only one difference, `memblock_type.memory` used instead of `memblock_type.reserved`.
|
||||
`get_allocated_memblock_memory_regions_info` 函数的实现是基本一样的。只有一处不同,`get_allocated_memblock_memory_regions_info` 使用 `memblock_type.memory` 而不是 `memblock_type.reserved` 。
|
||||
|
||||
Memblock debugging
|
||||
内存块的相关除错技术
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
There are many calls to `memblock_dbg` in the memblock implementation. If you pass the `memblock=debug` option to the kernel command line, this function will be called. Actually `memblock_dbg` is just a macro which expands to `printk`:
|
||||
在内存块的实现中有许多对 `memblock_dbg` 的调用。如果在内核命令行中传入 `memblock=debug` 选项,这个函数就会被调用。实际上 `memblock_dbg` 是 `printk` 的一个拓展宏:
|
||||
|
||||
```C
|
||||
#define memblock_dbg(fmt, ...) \
|
||||
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
|
||||
```
|
||||
|
||||
For example you can see a call of this macro in the `memblock_reserve` function:
|
||||
比如你可以在 `memblock_reserve` 函数中看到对这个宏的调用:
|
||||
|
||||
```C
|
||||
memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
|
||||
@@ -390,29 +390,29 @@ memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
|
||||
flags, (void *)_RET_IP_);
|
||||
```
|
||||
|
||||
And you will see something like this:
|
||||
然后你将看到类似下图的画面:
|
||||
|
||||

|
||||
|
||||
Memblock has also support in [debugfs](http://en.wikipedia.org/wiki/Debugfs). If you run kernel not in `X86` architecture you can access:
|
||||
内存块技术也支持 [debugfs](http://en.wikipedia.org/wiki/Debugfs) 。如果你不是在 `X86` 架构下运行内核,你可以访问:
|
||||
|
||||
* /sys/kernel/debug/memblock/memory
|
||||
* /sys/kernel/debug/memblock/reserved
|
||||
* /sys/kernel/debug/memblock/physmem
|
||||
|
||||
for getting dump of the `memblock` contents.
|
||||
来获取 `memblock` 内容的核心转储信息。
|
||||
|
||||
Conclusion
|
||||
结束语
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the first part about linux kernel memory management. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
讲解内核内存管理的第一部分到此结束,如果你有任何的问题或者建议,你可以直接发消息给我[twitter](https://twitter.com/0xAX),也可以给我发[邮件](anotherworldofworld@gmail.com)或是直接创建一个 [issue](https://github.com/0xAX/linux-insides/issues/new)。
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
**英文不是我的母语。如果你发现我的英文描述有任何问题,请提交一个PR到[linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
相关连接:
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [e820](http://en.wikipedia.org/wiki/E820)
|
||||
* [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access)
|
||||
* [debugfs](http://en.wikipedia.org/wiki/Debugfs)
|
||||
* [First touch of the linux kernel memory manager framework](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)
|
||||
* [对内核内存管理框架的初览](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)
|
||||
164
MM/linux-mm-2.md
164
MM/linux-mm-2.md
@@ -1,10 +1,10 @@
|
||||
Linux kernel memory management Part 2.
|
||||
内核内存管理. 第二部分.
|
||||
================================================================================
|
||||
|
||||
Fix-Mapped Addresses and ioremap
|
||||
固定映射地址和输入输出重映射
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
`Fix-Mapped` addresses are a set of special compile-time addresses whose corresponding physical address do not have to be a linear address minus `__START_KERNEL_map`. Each fix-mapped address maps one page frame and the kernel uses them as pointers that never change their address. That is the main point of these addresses. As the comment says: `to have a constant address at compile time, but to set the physical address only in the boot process`. You can remember that in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html), we already set the `level2_fixmap_pgt`:
|
||||
固定映射地址是一组特殊的编译时确定的地址,它们与物理地址不一定具有减 `__START_KERNEL_map` 的线性映射关系。每一个固定映射的地址都会映射到一个内存页,内核会像指针一样使用它们,但是绝不会修改它们的地址。这是这种地址的主要特点。就像注释所说的那样,“在编译期就获得一个常量地址,只有在引导阶段才会被设定上物理地址。”你在本书的[前面部分](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)可以看到,我们已经设定了 `level2_fixmap_pgt` :
|
||||
|
||||
```assembly
|
||||
NEXT_PAGE(level2_fixmap_pgt)
|
||||
@@ -16,7 +16,7 @@ NEXT_PAGE(level1_fixmap_pgt)
|
||||
.fill 512,8,0
|
||||
```
|
||||
|
||||
As you can see `level2_fixmap_pgt` is right after the `level2_kernel_pgt` which is kernel code+data+bss. Every fix-mapped address is represented by an integer index which is defined in the `fixed_addresses` enum from the [arch/x86/include/asm/fixmap.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/fixmap.h). For example it contains entries for `VSYSCALL_PAGE` - if emulation of legacy vsyscall page is enabled, `FIX_APIC_BASE` for local [apic](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller), etc. In virtual memory fix-mapped area is placed in the modules area:
|
||||
就像我们看到的, `level2_fixmap_pgt` 紧挨着 `level2_kernel_pgt` 保存了内核的 code+data+bss 段。每一个固定映射的地址都由一个整数下标表示,这些整数下标在 [arch/x86/include/asm/fixmap.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/fixmap.h) 的 `fixed_addresses` 枚举类型中定义。比如,它包含了`VSYSCALL_PAGE` 的入口 - 如果合法的 vsyscall 页模拟机制被开启,或是启用了本地 [apic](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) 的 `FIX_APIC_BASE` 选项等等。在虚拟内存中,固定映射区域被放置在模块区域中:
|
||||
|
||||
```
|
||||
+-----------+-----------------+---------------+------------------+
|
||||
@@ -29,23 +29,24 @@ As you can see `level2_fixmap_pgt` is right after the `level2_kernel_pgt` which
|
||||
__START_KERNEL_map __START_KERNEL MODULES_VADDR 0xffffffffffffffff
|
||||
```
|
||||
|
||||
Base virtual address and size of the `fix-mapped` area are presented by the two following macro:
|
||||
基虚拟地址和固定映射区域的尺寸使用以下两个宏表示:
|
||||
|
||||
```C
|
||||
#define FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT)
|
||||
#define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
|
||||
```
|
||||
|
||||
Here `__end_of_permanent_fixed_addresses` is an element of the `fixed_addresses` enum and as I wrote above: Every fix-mapped address is represented by an integer index which is defined in the `fixed_addresses`. `PAGE_SHIFT` determines size of a page. For example size of the one page we can get with the `1 << PAGE_SHIFT`. In our case we need to get the size of the fix-mapped area, but not only of one page, that's why we are using `__end_of_permanent_fixed_addresses` for getting the size of the fix-mapped area. In my case it's a little more than `536` kilobytes. In your case it might be a different number, because the size depends on amount of the fix-mapped addresses which are depends on your kernel's configuration.
|
||||
在这里 `__end_of_permanent_fixed_addresses` 是 `fixed_addresses` 枚举中的一个元素,如我上文所说:每一个固定映射地址都由一个定义在 `fixed_addresses` 中的整数下标表示。`PAGE_SHIFT` 决定了页的大小。比如,我们可以使用 `1 << PAGE_SHIFT` 来获取一页的大小。在我们的场景下需要获取固定映射区域的尺寸,而不仅仅是一页的大小,这就是我们使用 `__end_of_permanent_fixed_addresses` 来获取固定映射区域尺寸的原因。在我的系统中这个值可能略大于 `536` KB。在你的系统上这个值可能会不同,因为这个值取决于固定映射地址的数目,而这个数目又取决于内核的配置。
|
||||
|
||||
The second `FIXADDR_START` macro just subtracts fix-mapped area size from the last address of the fix-mapped area to get its base virtual address. `FIXADDR_TOP` is a rounded up address from the base address of the [vsyscall](https://lwn.net/Articles/446528/) space:
|
||||
The second `FIXADDR_START` macro just substracts fix-mapped area size from the last address of the fix-mapped area to get its base virtual address. `FIXADDR_TOP` is a rounded up address from the base address of the [vsyscall](https://lwn.net/Articles/446528/) space:
|
||||
第二个 `FIXADDR_START` 宏只是从固定映射区域的末地址减去了固定映射区域的尺寸,这样就可以获得它的基虚拟地址。 `FIXADDR_TOP` 是一个从 [vsyscall](https://lwn.net/Articles/446528/) 空间的基址取整产生的地址:
|
||||
|
||||
```C
|
||||
#define FIXADDR_TOP (round_up(VSYSCALL_ADDR + PAGE_SIZE, 1<<PMD_SHIFT) - PAGE_SIZE)
|
||||
```
|
||||
|
||||
The `fixed_addresses` enums are used as an index to get the virtual address by the `fix_to_virt` function. Implementation of this function is easy:
|
||||
|
||||
`fixed_addresses` 枚举量被 `fix_to_virt` 函数用做下标用于获取虚拟地址。这个函数的实现很简单:
|
||||
|
||||
```C
|
||||
static __always_inline unsigned long fix_to_virt(const unsigned int idx)
|
||||
{
|
||||
@@ -54,13 +55,13 @@ static __always_inline unsigned long fix_to_virt(const unsigned int idx)
|
||||
}
|
||||
```
|
||||
|
||||
first of all it checks that the index given for the `fixed_addresses` enum is not greater or equal than `__end_of_fixed_addresses` with the `BUILD_BUG_ON` macro and then returns the result of the `__fix_to_virt` macro:
|
||||
首先它调用 `BUILD_BUG_ON` 宏检查了给定的 `fixed_addresses` 枚举量不大于等于 `__end_of_fixed_addresses`,然后返回了 `__fix_to_virt` 宏的运算结果:
|
||||
|
||||
```C
|
||||
#define __fix_to_virt(x) (FIXADDR_TOP - ((x) << PAGE_SHIFT))
|
||||
```
|
||||
|
||||
Here we shift left the given `fix-mapped` address index on the `PAGE_SHIFT` which determines size of a page as I wrote above and subtract it from the `FIXADDR_TOP` which is the highest address of the `fix-mapped` area. There is an inverse function for getting `fix-mapped` address from a virtual address:
|
||||
在这里我们用 `PAGE_SHIFT` 左移了给定的固定映射地址下标,就像我上文所述它决定了页的地址,然后将 `FIXADDR_TOP` 减去这个值,`FIXADDR_TOP` 是固定映射区域的最高地址。以下是从虚拟地址获取对应固定映射地址的转换函数:
|
||||
|
||||
```C
|
||||
static inline unsigned long virt_to_fix(const unsigned long vaddr)
|
||||
@@ -70,25 +71,25 @@ static inline unsigned long virt_to_fix(const unsigned long vaddr)
|
||||
}
|
||||
```
|
||||
|
||||
`virt_to_fix` takes virtual address, checks that this address is between `FIXADDR_START` and `FIXADDR_TOP` and calls `__virt_to_fix` macro which implemented as:
|
||||
`virt_to_fix` 以虚拟地址为参数,检查了这个地址是否位于 `FIXADDR_START` 和 `FIXADDR_TOP` 之间,然后调用 `__virt_to_fix` ,这个宏实现如下:
|
||||
|
||||
```C
|
||||
#define __virt_to_fix(x) ((FIXADDR_TOP - ((x)&PAGE_MASK)) >> PAGE_SHIFT)
|
||||
```
|
||||
|
||||
A PFN is simply an index within physical memory that is counted in page-sized units. PFN for a physical address could be trivially defined as (page_phys_addr >> PAGE_SHIFT);
|
||||
一个 PFN 是一块页大小物理内存的下标。一个物理地址的 PFN 可以简单地定义为 (page_phys_addr >> PAGE_SHIFT);
|
||||
|
||||
`__virt_to_fix` clears the first 12 bits in the given address, subtracts it from the last address the of `fix-mapped` area (`FIXADDR_TOP`) and shifts the result right on `PAGE_SHIFT` which is `12`. Let me explain how it works. As I already wrote we will clear the first 12 bits in the given address with `x & PAGE_MASK`. As we subtract this from the `FIXADDR_TOP`, we will get the last 12 bits of the `FIXADDR_TOP` which are present. We know that the first 12 bits of the virtual address represent the offset in the page frame. With the shifting it on `PAGE_SHIFT` we will get `Page frame number` which is just all bits in a virtual address besides the first 12 offset bits. `Fix-mapped` addresses are used in different [places](http://lxr.free-electrons.com/ident?i=fix_to_virt) in the linux kernel. `IDT` descriptor stored there, [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID stored in the `fix-mapped` area started from `FIX_TBOOT_BASE` index, [Xen](http://en.wikipedia.org/wiki/Xen) bootmap and many more... We already saw a little about `fix-mapped` addresses in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. We use `fix-mapped` area in the early `ioremap` initialization. Let's look on it and try to understand what is `ioremap`, how it is implemented in the kernel and how it is related to the `fix-mapped` addresses.
|
||||
`__virt_to_fix` 会清空给定地址的前 12 位,然后用固定映射区域的末地址(`FIXADDR_TOP`)减去它并右移 `PAGE_SHIFT` 即 12 位。让我们来解释它的工作原理。就像我已经写的那样,这个宏会使用 `x & PAGE_MASK` 来清空前 12 位。然后我们用 `FIXADDR_TOP` 减去它,就会得到 `FIXADDR_TOP` 的后 12 位。我们知道虚拟地址的前 12 位代表这个页的偏移量,当我们右移 `PAGE_SHIFT` 后就会得到 `Page frame number` ,即虚拟地址的所有位,包括最开始的 12 个偏移位。固定映射地址在[内核中多处使用](http://lxr.free-electrons.com/ident?i=fix_to_virt)。 `IDT` 描述符保存在这里,[英特尔可信赖执行技术](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID 储存在固定映射区域,以 `FIX_TBOOT_BASE` 下标开始。另外, [Xen](http://en.wikipedia.org/wiki/Xen) 引导映射等也储存在这个区域。我们已经在[内核初始化的第五部分](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html)看到了一部分关于固定映射地址的知识。接下来让我们看看什么是 `ioremap`,看看它是怎样实现的,与固定映射地址又有什么关系呢?
|
||||
|
||||
ioremap
|
||||
输入输出重映射
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Linux kernel provides many different primitives to manage memory. For this moment we will touch `I/O memory`. Every device is controlled by reading/writing from/to its registers. For example a driver can turn off/on a device by writing to its registers or get the state of a device by reading from its registers. Besides registers, many devices have buffers where a driver can write something or read from there. As we know for this moment there are two ways to access device's registers and data buffers:
|
||||
内核提供了许多不同的内存管理原语。现在我们将要接触 `I/O 内存`。每一个设备都通过读写它的寄存器来控制。比如,驱动可以通过向它的寄存器中写来打开或关闭设备,也可以通过读它的寄存器来获取设备状态。除了寄存器之外,许多设备都拥有一块可供驱动读写的缓冲区。如我们所知,现在有两种方法来访问设备的寄存器和数据缓冲区:
|
||||
|
||||
* through the I/O ports;
|
||||
* mapping of the all registers to the memory address space;
|
||||
* 通过 I/O 端口;
|
||||
* 将所有寄存器映射到内存地址空间;
|
||||
|
||||
In the first case every control register of a device has a number of input and output port. And driver of a device can read from a port and write to it with two `in` and `out` instructions which we already saw. If you want to know about currently registered port regions, you can know they by accessing of `/proc/ioports`:
|
||||
第一种情况,设备的所有控制寄存器都具有一个输入输出端口号。该设备的驱动可以用 `in` 和 `out` 指令来从端口中读写。你可以通过访问 `/proc/ioports` 来获取设备当前的 I/O 端口号。
|
||||
|
||||
```
|
||||
$ cat /proc/ioports
|
||||
@@ -119,19 +120,20 @@ $ cat /proc/ioports
|
||||
...
|
||||
```
|
||||
|
||||
`/proc/ioporst` provides information about what driver used address of a `I/O` ports region. All of these memory regions, for example `0000-0cf7`, were claimed with the `request_region` function from the [include/linux/ioport.h](https://github.com/torvalds/linux/blob/master/include/linux/ioport.h). Actually `request_region` is a macro which defied as:
|
||||
|
||||
`/proc/ioports` 提供了驱动使用 I/O 端口的内存区域地址。所有的这些内存区域,比如 `0000-0cf7` ,都是使用 [include/linux/ioport.h](https://github.com/torvalds/linux/blob/master/include/linux/ioport.h) 头文件中的 `request_region` 来声明的。实际上 `request_region` 是一个宏,它的定义如下:
|
||||
|
||||
```C
|
||||
#define request_region(start,n,name) __request_region(&ioport_resource, (start), (n), (name), 0)
|
||||
```
|
||||
|
||||
As we can see it takes three parameters:
|
||||
正如我们所看见的,它有三个参数:
|
||||
|
||||
* `start` - begin of region;
|
||||
* `n` - length of region;
|
||||
* `name` - name of requester.
|
||||
* `start` - 区域的起点;
|
||||
* `n` - 区域的长度;
|
||||
* `name` - 区域需求者的名字。
|
||||
|
||||
`request_region` allocates `I/O` port region. Very often `check_region` function is called before the `request_region` to check that the given address range is available and `release_region` to release memory region. `request_region` returns pointer to the `resource` structure. `resource` structure presents abstraction for a tree-like subset of system resources. We already saw `resource` structure in the firth part about kernel [initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) process and it looks as:
|
||||
`request_region` 分配 I/O 端口区域。通常在 `request_region` 之前会调用 `check_region` 来检查传入的地址区间是否可用,然后 `release_region` 会释放这个内存区域。`request_region` 返回指向 `resource` 结构体的指针。 `resource` 结构体是对系统资源的树状子集的抽象。我们已经在[内核初始化的第五部分](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html)见到过它了,它的定义是这样的:
|
||||
|
||||
```C
|
||||
struct resource {
|
||||
@@ -143,7 +145,7 @@ struct resource {
|
||||
};
|
||||
```
|
||||
|
||||
and contains start and end addresses of the resource, name, etc. Every `resource` structure contains pointers to the `parent`, `sibling` and `child` resources. As it has parent and childs, it means that every subset of resources has root `resource` structure. For example, for `I/O` ports it is `ioport_resource` structure:
|
||||
它包含起止地址、名字等等。每一个 `resource` 结构体包含一个指向 `parent`、`slibling` 和 `child` 资源的指针。它有父节点和子节点,这就意味着每一个资源的子集都有一个根节点。比如,对 I/O 端口来说有一个 `ioport_resource` 结构体:
|
||||
|
||||
```C
|
||||
struct resource ioport_resource = {
|
||||
@@ -155,7 +157,7 @@ struct resource ioport_resource = {
|
||||
EXPORT_SYMBOL(ioport_resource);
|
||||
```
|
||||
|
||||
Or for `iomem`, it is `iomem_resource` structure:
|
||||
或者对 `iomem` 来说,有一个 `iomem_resource` 结构体:
|
||||
|
||||
```C
|
||||
struct resource iomem_resource = {
|
||||
@@ -166,43 +168,43 @@ struct resource iomem_resource = {
|
||||
};
|
||||
```
|
||||
|
||||
As I wrote about `request_regions` is used for registering of I/O port region and this macro is used in many [places](http://lxr.free-electrons.com/ident?i=request_region) in the kernel. For example let's look at [drivers/char/rtc.c](https://github.com/torvalds/linux/blob/master/char/rtc.c). This source code file provides [Real Time Clock](http://en.wikipedia.org/wiki/Real-time_clock) interface in the linux kernel. As every kernel module, `rtc` module contains `module_init` definition:
|
||||
就像我所写的,`request_region` 用于注册 I/O 端口区域,这个宏用于[内核中的许多地方](http://lxr.free-electrons.com/ident?i=request_region)。比如让我们来看看 [drivers/char/rtc.c](https://github.com/torvalds/linux/blob/master/char/rtc.c)。这个源文件提供了内核中的[实时时钟](http://en.wikipedia.org/wiki/Real-time_clock)接口。与其他内核模块一样, `rtc` 模块包含一个 `module_init` 定义:
|
||||
|
||||
```C
|
||||
module_init(rtc_init);
|
||||
```
|
||||
|
||||
where `rtc_init` is `rtc` initialization function. This function is defined in the same `rtc.c` source code file. In the `rtc_init` function we can see a couple calls of the `rtc_request_region` functions, which wrap `request_region` for example:
|
||||
在这里 `rtc_init` 是 `rtc` 模块的初始化函数。这个函数也定义在 `rtc.c` 文件中。在 `rtc_init` 函数中我们可以看到许多对 `rtc_request_region` 函数的调用,实际上这是 `request_region` 的包装:
|
||||
|
||||
```C
|
||||
r = rtc_request_region(RTC_IO_EXTENT);
|
||||
```
|
||||
|
||||
where `rtc_request_region` calls:
|
||||
`rtc_request_region` 中调用了:
|
||||
|
||||
```C
|
||||
r = request_region(RTC_PORT(0), size, "rtc");
|
||||
```
|
||||
|
||||
Here `RTC_IO_EXTENT` is a size of memory region and it is `0x8`, `"rtc"` is a name of region and `RTC_PORT` is:
|
||||
在这里 `RTC_TO_EXTENT` 是一个内存区域的尺寸,在这里是 `0x8`, `"rtc"` 是区域的名字,`RTC_PORT` 是:
|
||||
|
||||
```C
|
||||
#define RTC_PORT(x) (0x70 + (x))
|
||||
```
|
||||
|
||||
So with the `request_region(RTC_PORT(0), size, "rtc")` we register memory region, started at `0x70` and with size `0x8`. Let's look on the `/proc/ioports`:
|
||||
所以使用 `request_region(RTC_PORT(0), size, "rtc")` 我们注册了一个内存区域, 以 `0x70` 开始,大小为 `0x8`。 让我们看看 `/proc/ioports`:
|
||||
|
||||
```
|
||||
~$ sudo cat /proc/ioports | grep rtc
|
||||
0070-0077 : rtc0
|
||||
```
|
||||
|
||||
So, we got it! Ok, it was ports. The second way is use of `I/O` memory. As I wrote above this way is mapping of control registers and memory of a device to the memory address space. `I/O` memory is a set of contiguous addresses which are provided by a device to CPU through a bus. All memory-mapped I/O addresses are not used by the kernel directly. There is a special `ioremap` function which allows us to covert the physical address on a bus to the kernel virtual address or in another words `ioremap` maps I/O physical memory region to access it from the kernel. The `ioremap` function takes two parameters:
|
||||
看,我们可以获取了它的信息。这就是端口。第二种途径是使用 I/O 内存。就像我上面写的,这是将设备的控制寄存器和内存映射到内存地址空间中。I/O 内存是一组由设备通过总线提供给 CPU 的相邻的地址。所有的 I/O 映射地址都不能由内核直接访问。有一个 `ioremap` 函数用来将总线上的物理地址转化为内核的虚拟地址,或者说,`ioremap` 映射了 I/O 物理地址来让他们能够在内核中使用。这个函数有两个参数:
|
||||
|
||||
* start of the memory region;
|
||||
* size of the memory region;
|
||||
* 内存区域的开始;
|
||||
* 内存区域的结束;
|
||||
|
||||
I/O memory mapping API provides functions for checking, requesting and release of a memory region as I/O ports API. There are three functions for it:
|
||||
I/O 内存映射 API 提供了用来检查、请求与释放内存区域的函数,就像 I/O 端口 API 一样。这里有三个函数:
|
||||
|
||||
* `request_mem_region`
|
||||
* `release_mem_region`
|
||||
@@ -238,7 +240,7 @@ e0000000-feafffff : PCI Bus 0000:00
|
||||
...
|
||||
```
|
||||
|
||||
Part of these addresses is from the call of the `e820_reserve_resources` function. We can find call of this function in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and the function itself is defined in the [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c). `e820_reserve_resources` goes through the [e820](http://en.wikipedia.org/wiki/E820) map and inserts memory regions to the root `iomem` resource structure. All `e820` memory regions which will be inserted to the `iomem` resource have following types:
|
||||
这些地址中的一部分源于对 `e820_reserve_resources` 函数的调用。我们可以在 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) 中找到对这个函数的调用,这个函数本身定义在 [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c) 中。这个函数遍历了 [e820](http://en.wikipedia.org/wiki/E820) 的映射然后将内存区域插入了根 `iomen` 结构体中。所有具有以下类型的 `e820` 内存区域都会被插入到 `iomem` 结构体中:
|
||||
|
||||
```C
|
||||
static inline const char *e820_type_to_string(int e820_type)
|
||||
@@ -254,15 +256,15 @@ static inline const char *e820_type_to_string(int e820_type)
|
||||
}
|
||||
```
|
||||
|
||||
and we can see them in the `/proc/iomem` (read above).
|
||||
我们可以在 `/proc/iomem` 中看到它们。
|
||||
|
||||
Now let's try to understand how `ioremap` works. We already know a little about `ioremap`, we saw it in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. If you have read this part, you can remember the call of the `early_ioremap_init` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c). Initialization of the `ioremap` is split inn two parts: there is the early part which we can use before the normal `ioremap` is available and the normal `ioremap` which is available after `vmalloc` initialization and call of the `paging_init`. We do not know anything about `vmalloc` for now, so let's consider early initialization of the `ioremap`. First of all `early_ioremap_init` checks that `fixmap` is aligned on page middle directory boundary:
|
||||
现在让我们尝试着理解 `ioremap` 是如何工作的。我们已经了解了一部分 `ioremap` 的知识,我们在[内核初始化的第五部分](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html)见过它。如果你读了那个章节,你就会记得 [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c) 文件中对 `early_ioremap_init` 函数的调用。对 `ioremap` 的初始化分为两个部分:有一部分在我们正常使用 `ioremap` 之前,但是要首先进行 `vmalloc` 的初始化并调用 `paging_init` 才能进行正常的 `ioremap` 调用。我们现在还不了解 `vmalloc` 的知识,先看看第一部分的初始化。首先 `early_ioremap_init` 会检查固定映射是否与页中部目录对齐:
|
||||
|
||||
```C
|
||||
BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
|
||||
```
|
||||
|
||||
more about `BUILD_BUG_ON` you can read in the first part about [Linux Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html). So `BUILD_BUG_ON` macro raises compilation error if the given expression is true. In the next step after this check, we can see call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c). This function presents generic initialization of the `ioremap`. `early_ioremap_setup` function fills the `slot_virt` array with the virtual addresses of the early fixmaps. All early fixmaps are after `__end_of_permanent_fixed_addresses` in memory. They are stats from the `FIX_BITMAP_BEGIN` (top) and ends with `FIX_BITMAP_END` (down). Actually there are `512` temporary boot-time mappings, used by early `ioremap`:
|
||||
更多关于 `BUILD_BUG_ON` 的内容你可以在[内核初始化的第一部分](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)看到。如果给定的表达式为真,`BUILD_BUG_ON` 宏就会抛出一个编译时错误。在检查后的下一步,我们可以看到对 `early_ioremap_setup` 函数的调用,这个函数定义在 [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c) 文件中。这个函数代表了对 `ioremap` 的大体初始化。`early_ioremap_setup` 函数用初期固定映射的地址填充了 `slot_virt` 数组。所有初期固定映射地址在内存中都在 `__end_of_permanent_fixed_addresses` 后面,它们从 `FIX_BITMAP_BEGIN` 开始,到 `FIX_BITMAP_END` 结束。实际上初期 `ioremap` 会使用 `512` 个临时引导时映射:
|
||||
|
||||
```
|
||||
#define NR_FIX_BTMAPS 64
|
||||
@@ -270,7 +272,7 @@ more about `BUILD_BUG_ON` you can read in the first part about [Linux Kernel ini
|
||||
#define TOTAL_FIX_BTMAPS (NR_FIX_BTMAPS * FIX_BTMAPS_SLOTS)
|
||||
```
|
||||
|
||||
and `early_ioremap_setup`:
|
||||
`early_ioremap_setup` 如下:
|
||||
|
||||
```C
|
||||
void __init early_ioremap_setup(void)
|
||||
@@ -286,7 +288,7 @@ void __init early_ioremap_setup(void)
|
||||
}
|
||||
```
|
||||
|
||||
the `slot_virt` and other arrays are defined in the same source code file:
|
||||
`slot_virt` 和其他数组定义在同一个源文件中:
|
||||
|
||||
```C
|
||||
static void __iomem *prev_map[FIX_BTMAPS_SLOTS] __initdata;
|
||||
@@ -294,7 +296,7 @@ static unsigned long prev_size[FIX_BTMAPS_SLOTS] __initdata;
|
||||
static unsigned long slot_virt[FIX_BTMAPS_SLOTS] __initdata;
|
||||
```
|
||||
|
||||
`slot_virt` contains virtual addresses of the `fix-mapped` areas, `prev_map` array contains addresses of the early ioremap areas. Note that I wrote above: `Actually there are 512 temporary boot-time mappings, used by early ioremap` and you can see that all arrays defined with the `__initdata` attribute which means that this memory will be released after kernel initialization process. After `early_ioremap_setup` finished its work, we're getting page middle directory where early ioremap begins with the `early_ioremap_pmd` function which just gets the base address of the page global directory and calculates the page middle directory for the given address:
|
||||
`slot_virt` 包含了固定映射区域的虚拟地址,`prev_map` 数组包含了初期 `ioremap` 区域的地址。注意我在上文中提到的:`实际上初期 ioremap 会使用 512 个临时引导时映射`,同时你可以看到所有的数组都使用 `__initdata` 定义,这意味着这些内存都会在内核初始化结束后释放掉。在 `early_ioremap_setup` 结束后,我们获得了页中部目录,以 `early_ioremap_pmd` 函数开始的早期 `ioremap`,`early_ioremap_pmd` 函数只能获得内存全局目录以及为给定地址计算页中部目录:
|
||||
|
||||
```C
|
||||
static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
|
||||
@@ -307,7 +309,7 @@ static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
|
||||
}
|
||||
```
|
||||
|
||||
After this we fills `bm_pte` (early ioremap page table entries) with zeros and call the `pmd_populate_kernel` function:
|
||||
之后我们用 0 填充 `bm_pte` (早期 `ioremap` 页表入口),然后调用 `pmd_populate_kernel` 函数:
|
||||
|
||||
```C
|
||||
pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
|
||||
@@ -315,17 +317,17 @@ memset(bm_pte, 0, sizeof(bm_pte));
|
||||
pmd_populate_kernel(&init_mm, pmd, bm_pte);
|
||||
```
|
||||
|
||||
`pmd_populate_kernel` takes three parameters:
|
||||
`pmd_populate_kernel` 函数有三个参数:
|
||||
|
||||
* `init_mm` - memory descriptor of the `init` process (you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html));
|
||||
* `pmd` - page middle directory of the beginning of the `ioremap` fixmaps;
|
||||
* `bm_pte` - early `ioremap` page table entries array which defined as:
|
||||
* `init_mm` - `init` 进程的内存描述符 (你可以在[前文](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html)中看到);
|
||||
* `pmd` - `ioremap` 固定映射开始处的页中部目录;
|
||||
* `bm_pte` - 初期 `ioremap` 页表入口数组定义为:
|
||||
|
||||
```C
|
||||
static pte_t bm_pte[PAGE_SIZE/sizeof(pte_t)] __page_aligned_bss;
|
||||
```
|
||||
|
||||
The `pmd_popularte_kernel` function defined in the [arch/x86/include/asm/pgalloc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgalloc.) and populates given page middle directory (`pmd`) with the given page table entries (`bm_pte`):
|
||||
`pmd_popularte_kernel` 函数定义在 [arch/x86/include/asm/pgalloc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgalloc.h) 中。它会用给定的页表入口(`bm_pte`)生成给定页中部目录(`pmd`):
|
||||
|
||||
```C
|
||||
static inline void pmd_populate_kernel(struct mm_struct *mm,
|
||||
@@ -336,13 +338,13 @@ static inline void pmd_populate_kernel(struct mm_struct *mm,
|
||||
}
|
||||
```
|
||||
|
||||
where `set_pmd` is:
|
||||
`set_pmd` 声明如下:
|
||||
|
||||
```C
|
||||
#define set_pmd(pmdp, pmd) native_set_pmd(pmdp, pmd)
|
||||
```
|
||||
|
||||
and `native_set_pmd` is:
|
||||
`native_set_pmd` 声明如下:
|
||||
|
||||
```C
|
||||
static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
|
||||
@@ -351,23 +353,23 @@ static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
|
||||
}
|
||||
```
|
||||
|
||||
That's all. Early `ioremap` is ready to use. There are a couple of checks in the `early_ioremap_init` function, but they are not so important, anyway initialization of the `ioremap` is finished.
|
||||
到这里 初期 `ioremap` 就可以使用了。在 `early_ioremap_init` 函数中有许多检查,但是都不重要,总之 `ioremap` 的初始化结束了。
|
||||
|
||||
Use of early ioremap
|
||||
初期输入输出重映射的使用
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As early `ioremap` is setup, we can use it. It provides two functions:
|
||||
初期 `ioremap` 初始化完成后,我们就能使用它了。它提供了两个函数:
|
||||
|
||||
* early_ioremap
|
||||
* early_iounmap
|
||||
|
||||
for mapping/unmapping of IO physical address to virtual address. Both functions depends on `CONFIG_MMU` configuration option. [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit) is a special block of memory management. Main purpose of this block is translation physical addresses to virtual addresses. Technically memory management unit knows about high-level page table address (`pgd`) from the `cr3` control register. If `CONFIG_MMU` options is set to `n`, `early_ioremap` just returns the given physical address and `early_iounmap` does not nothing. In other way, if `CONFIG_MMU` option is set to `y`, `early_ioremap` calls `__early_ioremap` which takes three parameters:
|
||||
用于从 IO 物理地址 映射/解除映射 到虚拟地址。这俩函数都依赖于 `CONFIG_MMU` 编译配置选项。[内存管理单元](http://en.wikipedia.org/wiki/Memory_management_unit)是内存管理的一种特殊块。这种块的主要用途是将物理地址转换为虚拟地址。技术上看内存管理单元可以从 `cr3` 控制寄存器中获取高等级页表地址(`pgd`)。如果 `CONFIG_MMU` 选项被设为 `n`,`early_ioremap` 就会直接返回物理地址,而 `early_iounmap` 就会什么都不做。另一方面,如果设为 `y` ,`early_ioremap` 就会调用 `__early_ioremap`,它有三个参数:
|
||||
|
||||
* `phys_addr` - base physical address of the `I/O` memory region to map on virtual addresses;
|
||||
* `size` - size of the `I/O` memory region;
|
||||
* `prot` - page table entry bits.
|
||||
* `phys_addr` - 要映射到虚拟地址上的 I/O 内存区域的基物理地址;
|
||||
* `size` - I/O 内存区域的尺寸;
|
||||
* `prot` - 页表入口位。
|
||||
|
||||
First of all in the `__early_ioremap`, we goes through the all early ioremap fixmap slots and check first free are in the `prev_map` array and remember it's number in the `slot` variable and set up size as we found it:
|
||||
在 `__early_ioremap` 中我们首先遍历了所有初期 `ioremap` 固定映射槽并检查 `prev_map` 数组中第一个空闲元素,然后将这个值存在了 `slot` 变量中,另外设置了尺寸:
|
||||
|
||||
```C
|
||||
slot = -1;
|
||||
@@ -384,8 +386,7 @@ prev_size[slot] = size;
|
||||
last_addr = phys_addr + size - 1;
|
||||
```
|
||||
|
||||
|
||||
In the next spte we can see the following code:
|
||||
在下一步中我们会看到以下代码:
|
||||
|
||||
```C
|
||||
offset = phys_addr & ~PAGE_MASK;
|
||||
@@ -393,20 +394,20 @@ phys_addr &= PAGE_MASK;
|
||||
size = PAGE_ALIGN(last_addr + 1) - phys_addr;
|
||||
```
|
||||
|
||||
Here we are using `PAGE_MASK` for clearing all bits in the `phys_addr` except the first 12 bits. `PAGE_MASK` macro is defined as:
|
||||
在这里我们使用了 `PAGE_MASK` 用于清空除前 12 位之外的整个 `phys_addr`。`PAGE_MASK` 宏定义如下:
|
||||
|
||||
```C
|
||||
#define PAGE_MASK (~(PAGE_SIZE-1))
|
||||
```
|
||||
|
||||
We know that size of a page is 4096 bytes or `1000000000000` in binary. `PAGE_SIZE - 1` will be `111111111111`, but with `~`, we will get `000000000000`, but as we use `~PAGE_MASK` we will get `111111111111` again. On the second line we do the same but clear the first 12 bits and getting page-aligned size of the area on the third line. We getting aligned area and now we need to get the number of pages which are occupied by the new `ioremap` area and calculate the fix-mapped index from `fixed_addresses` in the next steps:
|
||||
我们知道页的尺寸是 4096 个字节或用二进制表示为 `1000000000000` 。`PAGE_SIZE - 1` 就会是 `111111111111` ,但是使用 `~` 运算后我们就会得到 `000000000000` ,然后使用 `~PAGE_MASK` 又会返回 `111111111111` 。在第二行我们做了同样的事情但是只是清空了前 12 个位,然后在第三行获取了这个区域的页对齐尺寸。我们获得了对齐区域,接下来就需要获取新的 `ioremap` 区域所占用的页的数量然后计算固定映射下标:
|
||||
|
||||
```C
|
||||
nrpages = size >> PAGE_SHIFT;
|
||||
idx = FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*slot;
|
||||
```
|
||||
|
||||
Now we can fill `fix-mapped` area with the given physical addresses. Every iteration in the loop, we call `__early_set_fixmap` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c), increase given physical address on page size which is `4096` bytes and update `addresses` index and number of pages:
|
||||
现在我们用给定的物理地址填充了固定映射区域。循环中的每一次迭代,我们都调用一次 [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c) 中的 `__early_set_fixmap` 函数,为给定的物理地址加上页的大小 `4096`,然后更新下标和页的数量:
|
||||
|
||||
```C
|
||||
while (nrpages > 0) {
|
||||
@@ -417,13 +418,13 @@ while (nrpages > 0) {
|
||||
}
|
||||
```
|
||||
|
||||
The `__early_set_fixmap` function gets the page table entry (stored in the `bm_pte`, see above) for the given physical address with:
|
||||
`__early_set_fixmap` 函数为给定的物理地址获取了页表入口(保存在 `bm_pte` 中,见上文):
|
||||
|
||||
```C
|
||||
pte = early_ioremap_pte(addr);
|
||||
```
|
||||
|
||||
In the next step of the `early_ioremap_pte` we check the given page flags with the `pgprot_val` macro and calls `set_pte` or `pte_clear` depends on it:
|
||||
在 `early_ioremap_pte` 的下一步中我们用 `pgprot_val` 宏检查了给定的页标志,依赖这个标志选择调用 `set_pte` 还是 `pte_clear` :
|
||||
|
||||
```C
|
||||
if (pgprot_val(flags))
|
||||
@@ -433,18 +434,19 @@ if (pgprot_val(flags))
|
||||
```
|
||||
|
||||
As you can see above, we passed `FIXMAP_PAGE_IO` as flags to the `__early_ioremap`. `FIXMPA_PAGE_IO` expands to the:
|
||||
就像你看到的,我们将 `FIXMAP_PAGE_IO` 作为标志传入了 `__early_ioremap`。`FIXMPA_PAGE_IO` 从以下
|
||||
|
||||
```C
|
||||
(__PAGE_KERNEL_EXEC | _PAGE_NX)
|
||||
```
|
||||
|
||||
flags, so we call `set_pte` function for setting page table entry which works in the same manner as `set_pmd` but for PTEs (read above about it). As we set all `PTEs` in the loop, we can see the call of the `__flush_tlb_one` function:
|
||||
标志拓展而来, 所以我们调用 `set_pte` 来设置页表入口,就像 `set_pmd` 一样,只不过用于 `PTE`(见上文)。我们在循环中设定了所有 `PTE`,我们可以看到 `__flush_tlb_one` 的函数调用:
|
||||
|
||||
```C
|
||||
__flush_tlb_one(addr);
|
||||
```
|
||||
|
||||
This function is defined in the [arch/x86/include/asm/tlbflush.h](https://github.com/torvalds/linux/blob/master) and calls `__flush_tlb_single` or `__flush_tlb` depends on value of the `cpu_has_invlpg`:
|
||||
这个函数定义在 [arch/x86/include/asm/tlbflush.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/tlbflush.h)中,并通过判断 `cpu_has_invlpg` 的值来决定调用 `__flush_tlb_single` 还是 `__flush_tlb` :
|
||||
|
||||
```C
|
||||
static inline void __flush_tlb_one(unsigned long addr)
|
||||
@@ -456,13 +458,13 @@ static inline void __flush_tlb_one(unsigned long addr)
|
||||
}
|
||||
```
|
||||
|
||||
`__flush_tlb_one` function invalidates given address in the [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer). As you just saw we updated paging structure, but `TLB` is not informed of the changes, that's why we need to do it manually. There are two ways to do it. First is update `cr3` control register and `__flush_tlb` function does this:
|
||||
`__flush_tlb_one` 函数使 [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer) 中的给定地址失效。就像你看到的我们更新了页结构,但是 `TLB` 还没有改变,这就是我们需要手动做这件事情的原因。有两种方法做这件事。第一种是更新 `cr3` 寄存器, `__flush_tlb` 函数就是这么做的:
|
||||
|
||||
```C
|
||||
native_write_cr3(native_read_cr3());
|
||||
```
|
||||
|
||||
The second method is to use `invlpg` instruction to invalidates `TLB` entry. Let's look on `__flush_tlb_one` implementation. As you can see first of all it checks `cpu_has_invlpg` which defined as:
|
||||
第二种方法是使用 `invlpg` 命令来使 `TLB` 入口失效。让我们看看 `__flush_tlb_one` 的实现。就像我们所看到的,它首先检查了 `cpu_has_invlpg` ,定义如下:
|
||||
|
||||
```C
|
||||
#if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)
|
||||
@@ -472,7 +474,7 @@ The second method is to use `invlpg` instruction to invalidates `TLB` entry. Let
|
||||
#endif
|
||||
```
|
||||
|
||||
If a CPU support `invlpg` instruction, we call the `__flush_tlb_single` macro which expands to the call of the `__native_flush_tlb_single`:
|
||||
如果 CPU 支持 `invlpg` 指令,我们就调用 `__flush_tlb_single` 宏,它拓展自 `__native_flush_tlb_single`:
|
||||
|
||||
```C
|
||||
static inline void __native_flush_tlb_single(unsigned long addr)
|
||||
@@ -481,32 +483,32 @@ static inline void __native_flush_tlb_single(unsigned long addr)
|
||||
}
|
||||
```
|
||||
|
||||
or call `__flush_tlb` which just updates `cr3` register as we saw it above. After this step execution of the `__early_set_fixmap` function is finished and we can back to the `__early_ioremap` implementation. As we have set fixmap area for the given address, we need to save the base virtual address of the I/O Re-mapped area in the `prev_map` with the `slot` index:
|
||||
`__flush_tlb` 的调用知识更新了 `cr3` 寄存器。在这步结束之后 `__early_set_fixmap` 函数就执行完了,我们又可以回到 `__early_ioremap` 的实现了。因为我们为给定的地址设定了固定映射区域,我们需要将 I/O 重映射的区域的基虚拟地址用 `slot` 下标保存在 `prev_map` 数组中。
|
||||
|
||||
```C
|
||||
prev_map[slot] = (void __iomem *)(offset + slot_virt[slot]);
|
||||
```
|
||||
|
||||
and return it.
|
||||
然后返回它。
|
||||
|
||||
The second function is - `early_iounmap` - unmaps an `I/O` memory region. This function takes two parameters: base address and size of a `I/O` region and generally looks very similar on `early_ioremap`. It also goes through fixmap slots and looks for slot with the given address. After this it gets the index of the fixmap slot and calls `__late_clear_fixmap` or `__early_set_fixmap` depends on `after_paging_init` value. It calls `__early_set_fixmap` with on difference then it does `early_ioremap`: it passes `zero` as physical address. And in the end it sets address of the I/O memory region to `NULL`:
|
||||
第二个函数是 `early_iounmap` ,它会解除对一个 I/O 内存区域的映射。这个函数有两个参数:基地址和 I/O 区域的大小,这看起来与 `early_ioremap` 很像。它同样遍历了固定映射槽并寻找给定地址的槽。这样它就获得了这个固定映射槽的下标,然后通过判断 `after_paging_init` 的值决定是调用 `__late_clear_fixmap` 还是 `__early_set_fixmap` 。当这个值是 0 时会调用 `__early_set_fixmap`。最终它会将 I/O 内存区域设为 `NULL`:
|
||||
|
||||
```C
|
||||
prev_map[slot] = NULL;
|
||||
```
|
||||
|
||||
That's all about `fixmaps` and `ioremap`. Of course this part does not cover full features of the `ioremap`, it was only early ioremap, but there is also normal ioremap. But we need to know more things before it.
|
||||
这就是关于 `fixmap` 和 `ioremap` 的全部内容。当然这部分不可能包含所有 `ioremap` 的特性,仅仅是讲解了初期 `ioremap`,常规的 `ioremap` 没有讲。这主要是因为在讲解它之前需要了解更多内容才行。
|
||||
|
||||
So, this is the end!
|
||||
就是这样!
|
||||
|
||||
Conclusion
|
||||
结束语
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part about linux kernel memory management. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
讲解内核内存管理的第一部分到此结束,如果你有任何的问题或者建议,你可以直接发消息给我[twitter](https://twitter.com/0xAX),也可以给我发[邮件](anotherworldofworld@gmail.com)或是直接创建一个 [issue](https://github.com/0xAX/linux-insides/issues/new)。
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
**英文不是我的母语。如果你发现我的英文描述有任何问题,请提交一个PR到[linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
相关连接:
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [apic](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
@@ -518,4 +520,4 @@ Links
|
||||
* [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit)
|
||||
* [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer)
|
||||
* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [Linux kernel memory management Part 1.](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html)
|
||||
* [内核内存管理第一部分](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html)
|
||||
@@ -68,8 +68,8 @@ Linux Insides
|
||||
|└ 6.5||未开始|
|
||||
| 7. Memory management||正在进行|
|
||||
|├ 7.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|├ 7.1||未开始|
|
||||
|└ 7.2||未开始|
|
||||
|├ 7.1|[@choleraehyq](https://github.com/choleraehyq)|已完成|
|
||||
|└ 7.2||[@choleraehyq](https://github.com/choleraehyq)|已完成|
|
||||
| 8. SMP||未开始|
|
||||
| 9. Concepts||正在进行|
|
||||
|├ 9.0|[@mudongliang](https://github.com/mudongliang)|已完成|
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
|
||||
[@icecoobe](https://github.com/icecoobe)
|
||||
|
||||
[@choleraehyq](http://cholerae.com)
|
||||
[@choleraehyq](http://github.com/choleraehyq)
|
||||
|
||||
[@mudongliang](https://github.com/mudongliang)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user