complete2.0

This commit is contained in:
kele1997
2018-03-30 22:45:32 +08:00
parent cf9aecc3b5
commit 8e75874841
2 changed files with 91 additions and 638 deletions

View File

@@ -1,28 +1,29 @@
Kernel initialization. Part 6.
================================================================================
内核初始化 第六部分
===========================================================
Architecture-specific initialization, again...
================================================================================
仍旧是与系统架构有关的初始化
===========================================================
In the previous [part](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html) we saw architecture-specific (`x86_64` in our case) initialization stuff from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and finished on `x86_configure_nx` function which sets the `_PAGE_NX` flag depends on support of [NX bit](http://en.wikipedia.org/wiki/NX_bit). As I wrote before `setup_arch` function and `start_kernel` are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after `x86_configure_nx` is `parse_early_param`. This function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) and as you can understand from its name, this function parses kernel command line and setups different services depends on the given parameters (all kernel command line parameters you can find are in the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt)). You may remember how we setup `earlyprintk` in the earliest [part](http://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-2.html). On the early stage we looked for kernel parameters and their value with the `cmdline_find_option` function and `__cmdline_find_option`, `__cmdline_find_option_bool` helpers from the [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cmdline.c). There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already note calls like this:
在之前的[章节](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html)我们从 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c)了解了特定于系统架构的初始化事务(在我们的例子中是 `x86_64` 架构),并且通过 `x86_configure_nx` 函数根据对[NX bit](http://en.wikipedia.org/wiki/NX_bit)的支持配置了 `_PAGE_NX` 标志位。正如我之前写的, `setup_arch` 函数和 `start_kernel` 都非常复杂,所以在这个和下个章节我们将继续学习关于系统架构初始化进程的内容。`x86_configure_nx` 函数的下面是 `parse_early_param` 函数。这个函数定义在 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 中并且你可以从它的名字中了解到,这个函数解析内核命令行并且基于给定的参数创建不同的服务 (所有的内核命令行参数你都可以在 [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt) 找到)。 你可能记得在最前面的 [章节](http://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-2.html) 我们是怎样创建 `earlyprintk`地。在前面我们用 [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cmdline.c) 里面的 `cmdline_find_option``__cmdline_find_option`, `__cmdline_find_option_bool` 函数的帮助下寻找内核参数及其值。我们在通用内核部分不依赖于特定的系统架构,在这里我们使用另一种方法。 如果你正在阅读linux内核源代码你可能注意到这样的调用
```C
early_param("gbpages", parse_direct_gbpages_on);
```
`early_param` macro takes two parameters:
`early_param` 宏需要两个参数:
* command line parameter name;
* function which will be called if given parameter is passed.
* 命令行参数的名称
* 如果给定的参数通过,函数将被调用
and defined as:
函数定义如下:
```C
#define early_param(str, fn) \
__setup_param(str, fn, fn, 1)
```
in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h). As you can see `early_param` macro just makes call of the `__setup_param` macro:
这个定义可以在 [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) 中可以找到.
正如你所看到的, `early_param` 宏只是调用了 `__setup_param` 宏:
```C
#define __setup_param(str, unique_id, fn, early) \
@@ -34,7 +35,9 @@ in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/incl
= { __setup_str_##unique_id, fn, early }
```
This macro defines `__setup_str_*_id` variable (where `*` depends on given function name) and assigns it to the given command line parameter name. In the next line we can see definition of the `__setup_*` variable which type is `obs_kernel_param` and its initialization. `obs_kernel_param` structure defined as:
这个宏内部定义了 `__setup_str_*_id` 变量 (这里的 `*` 取决于给定的函数名称),然后把给定的命令行参数赋值给这个变量。在下一行中,我们可以看到定义了一个`obs_kernel_param` 类型的变量 `__setup_ *` 并对其进行初始化。
`obs_kernel_param` 结构体定义如下:
```C
struct obs_kernel_param {
@@ -44,13 +47,13 @@ struct obs_kernel_param {
};
```
and contains three fields:
这个结构体包含三个字段:
* name of the kernel parameter;
* function which setups something depend on parameter;
* field determines is parameter early (1) or not (0).
* 内核参数的名称
* 根据不同的参数,选取对应的处理函数
* 决定参数是否为 early 的标记位
Note that `__set_param` macro defines with `__section(.init.setup)` attribute. It means that all `__setup_str_*` will be placed in the `.init.setup` section, moreover, as we can see in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h), they will be placed between `__setup_start` and `__setup_end`:
注意 `__set_param` 宏定义有 `__section(.init.setup)` 属性。这意味着所有 `__setup_str_ *` 都将被放置在 `.init.setup` 区段中,此外正如我们在 [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h) 中看到的,`.init.setup` 区段被放置在 `__setup_start` `__setup_end` 之间:
```
#define INIT_SETUP(initsetup_align) \
@@ -60,7 +63,7 @@ Note that `__set_param` macro defines with `__section(.init.setup)` attribute. I
VMLINUX_SYMBOL(__setup_end) = .;
```
Now we know how parameters are defined, let's back to the `parse_early_param` implementation:
现在我们知道了参数是怎样定义的,让我们一起回到 `parse_early_param` 的实现上来:
```C
void __init parse_early_param(void)
@@ -77,29 +80,31 @@ void __init parse_early_param(void)
done = 1;
}
```
The `parse_early_param` function defines two static variables. First `done` check that `parse_early_param` already called and the second is temporary storage for kernel command line. After this we copy `boot_command_line` to the temporary command line which we just defined and call the `parse_early_options` function from the same source code `main.c` file. `parse_early_options` calls the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/master/) where `parse_args` parses given command line and calls `do_early_param` function. This [function](https://github.com/torvalds/linux/blob/master/init/main.c#L413) goes from the ` __setup_start` to `__setup_end`, and calls the function from the `obs_kernel_param` if a parameter is early. After this all services which are depend on early command line parameters were setup and the next call after the `parse_early_param` is `x86_report_nx`. As I wrote in the beginning of this part, we already set `NX-bit` with the `x86_configure_nx`. The next `x86_report_nx` function from the [arch/x86/mm/setup_nx.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/setup_nx.c) just prints information about the `NX`. Note that we call `x86_report_nx` not right after the `x86_configure_nx`, but after the call of the `parse_early_param`. The answer is simple: we call it after the `parse_early_param` because the kernel support `noexec` parameter:
`parse_early_param` 函数内部定义了两个静态变量。首先第一个变量 `done` 用来检查 `parse_early_param` 函数是否已经被调用过,第二个变量是用来临时存储内核命令行的。然后我们把 `boot_command_line` 的值赋值给刚刚定义的临时命令行变量中( `tmp_cmdline` ) 并且从相同的源代码文件 `main.c` 中调用 `parse_early_options` 函数。 `parse_early_options`函数从 [kernel/params.c](https://github.com/torvalds/linux/blob/master/) 中调用 `parse_args` 函数, `parse_args` 解析传入的命令行然后调用 `do_early_param` 函数。 [do_early_param](https://github.com/torvalds/linux/blob/master/init/main.c#L413) 函数 从 ` __setup_start` 循环到 `__setup_end` ,如果循环中 `obs_kernel_param` 实例中的 `early` 字段值为1 ,就调用 `obs_kernel_param` 中的第二个函数 `setup_func`。在这之后所有基于早期命令行参数的服务都已经被创建,在 `parse_early_param` 之后的下一个函数调用是 `x86_report_nx` 。 正如我在这章开头所写的,我们已经用 `x86_configure_nx` 函数配置了 `NX-bit` 位。接下来我们使用 [arch/x86/mm/setup_nx.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/setup_nx.c) 中的 `x86_report_nx`函数打印出关于 `NX` 的信息。注意`x86_report_nx` 函数不一定在 `x86_configure_nx` 函数之后调用,但是一定在 `parse_early_param` 之后调用。答案很简单: 因为内核支持 `noexec` 参数,所以我们一定在 `parse_early_param` 调用并且解析 `noexec` 参数之后才能调用 `x86_report_nx` :
```
noexec [X86]
On X86-32 available only on PAE configured kernels.
//在X86-32架构上仅在配置PAE的内核上可用。
noexec=on: enable non-executable mappings (default)
//noexec=on:开启非可执行文件的映射(默认)
noexec=off: disable non-executable mappings
//noexec=off: 禁用非可执行文件的映射
```
We can see it in the booting time:
我们可以在启动的时候看到:
![NX](http://oi62.tinypic.com/swwxhy.jpg)
After this we can see call of the:
之后我们可以看到下面函数的调用:
```C
memblock_x86_reserve_range_setup_data();
```
function. This function is defined in the same [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and remaps memory for the `setup_data` and reserved memory block for the `setup_data` (more about `setup_data` you can read in the previous [part](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html) and about `ioremap` and `memblock` you can read in the [Linux kernel memory management](http://xinqiu.gitbooks.io/linux-insides-cn/content/MM/index.html)).
这个函数的定义也在 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) 中,然后这个函数为 `setup_data` 重新映射内存并保留内存块(你可以阅读之前的 [章节](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html) 了解关于 `setup_data` 的更多内容,你也可以在 [Linux kernel memory management](http://xinqiu.gitbooks.io/linux-insides-cn/content/MM/index.html) 中阅读到关于 `ioremap` and `memblock` 的更多内容)。
In the next step we can see following conditional statement:
接下来我们来看看下面的条件语句:
```C
if (acpi_mps_check()) {
@@ -110,7 +115,7 @@ In the next step we can see following conditional statement:
}
```
The first `acpi_mps_check` function from the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) depends on `CONFIG_X86_LOCAL_APIC` and `CONFIG_x86_MPPARSE` configuration options:
`acpi_mps_check` 函数来自于 [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) ,它的结果取决于 `CONFIG_X86_LOCAL_APIC` `CONFIG_x86_MPPARSE` 配置选项:
```C
int __init acpi_mps_check(void)
@@ -128,12 +133,13 @@ int __init acpi_mps_check(void)
}
```
It checks the built-in `MPS` or [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) table. If `CONFIG_X86_LOCAL_APIC` is set and `CONFIG_x86_MPPAARSE` is not set, `acpi_mps_check` prints warning message if the one of the command line options: `acpi=off`, `acpi=noirq` or `pci=noacpi` passed to the kernel. If `acpi_mps_check` returns `1` it means that we disable local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and clear `X86_FEATURE_APIC` bit in the of the current CPU with the `setup_clear_cpu_cap` macro. (more about CPU mask you can read in the [CPU masks](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-2.html)).
`acpi_mps_check` 函数检查内置的 `MPS` 又称 [多重处理器规范]((http://en.wikipedia.org/wiki/MultiProcessor_Specification)) 表。如果设置了 ` CONFIG_X86_LOCAL_APIC` 但未设置 `CONFIG_x86_MPPAARSE` ,而且传递给内核的命令行选项中有 `acpi=off``acpi=noirq` 或者 `pci=noacpi` 参数,那么`acpi_mps_check` 函数就会输出警告信息。如果 `acpi_mps_check` 返回了1这就表示我们禁用了本地 [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
,而且 `setup_clear_cpu_cap` 宏清除了当前CPU中的 `X86_FEATURE_APIC` 位。(你可以阅读 [CPU masks](https://xinqiu.gitbooks.io/linux-insides-cn/content/Concepts/linux-cpu-2.html) 了解关于CPU mask的更多内容)。
Early PCI dump
--------------------------------------------------------------------------------
早期的PCI转储
--------------------------------------------------------------------------------
In the next step we make a dump of the [PCI](http://en.wikipedia.org/wiki/Conventional_PCI) devices with the following code:
接下来我们通过下面的代码来转储 [PCI](http://en.wikipedia.org/wiki/Conventional_PCI) 设备:
```C
#ifdef CONFIG_PCI
@@ -142,13 +148,13 @@ In the next step we make a dump of the [PCI](http://en.wikipedia.org/wiki/Conven
#endif
```
`pci_early_dump_regs` variable defined in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c) and its value depends on the kernel command line parameter: `pci=earlydump`. We can find definition of this parameter in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch):
变量 `pci_early_dump_regs` 定义在 [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c) 中,他的值取决于内核命令行参数:`pci=earlydump` 。我们可以在[drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch) 中看到这个参数的定义:
```C
early_param("pci", pci_setup);
```
`pci_setup` function gets the string after the `pci=` and analyzes it. This function calls `pcibios_setup` which defined as `__weak` in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch) and every architecture defines the same function which overrides `__weak` analog. For example `x86_64` architecture-dependent version is in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c):
`pci_setup` 函数取出 `pci=` 之后的字符串,然后进行解析。这个函数调用 [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch) 中用 `_weak` 修饰符定义的 `pcibios_setup` 函数,并且每种架构都重写了 `_weak` 修饰过的函数。 例如, `x86_64` 架构上的该函数版本在 [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c) 中:
```C
char *__init pcibios_setup(char *str) {
@@ -165,14 +171,14 @@ char *__init pcibios_setup(char *str) {
}
```
So, if `CONFIG_PCI` option is set and we passed `pci=earlydump` option to the kernel command line, next function which will be called - `early_dump_pci_devices` from the [arch/x86/pci/early.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/early.c). This function checks `noearly` pci parameter with:
如果我们设置了 `CONFIG_PCI` 选项,而且向内核命令行传递了 `pci=earlydump` 选项,那么 [arch/x86/pci/early.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/early.c) 中的 `early_dump_pci_devices` 函数将会被调用。这个函数像下面这样来检查pci参数 `noearly` :
```C
if (!early_pci_allowed())
return;
```
and returns if it was passed. Each PCI domain can host up to `256` buses and each bus hosts up to 32 devices. So, we goes in a loop:
如果条件不成立则返回。每个PCI域可以承载多达 `256` 条总线并且每条总线可以承载多达32个设备。那么接下来我们进入下面的循环:
```C
for (bus = 0; bus < 256; bus++) {
@@ -186,15 +192,15 @@ for (bus = 0; bus < 256; bus++) {
}
```
and read the `pci` config with the `read_pci_config` function.
然后我们通过 `read_pci_config` 函数来读取 `pci` 配置。
That's all. We will not go deep in the `pci` details, but will see more details in the special `Drivers/PCI` part.
这就是 pci 加载的全部过程了。我们在这里不会深入研究 `pci` 的细节,不过我们会在 `Drivers/PCI` 章节看到更多的细节。
Finish with memory parsing
--------------------------------------------------------------------------------
内存解析的完成
--------------------------------------------------------------------------------
After the `early_dump_pci_devices`, there are a couple of function related with available memory and [e820](http://en.wikipedia.org/wiki/E820) which we collected in the [First steps in the kernel setup](http://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-2.html) part:
`early_dump_pci_devices` 函数后面,有一些与可用内存和[e820](http://en.wikipedia.org/wiki/E820)相关的函数,其中 [e820](http://en.wikipedia.org/wiki/E820) 的相关信息我们在 [内核安装的第一步](http://xinqiu.gitbooks.io/linux-insides-cn/content/Booting/linux-bootstrap-2.html) 章节中整理过。
```C
/* update the e820_saved too */
e820_reserve_setup_data();
@@ -208,14 +214,14 @@ After the `early_dump_pci_devices`, there are a couple of function related with
early_reserve_e820_mpc_new();
```
Let's look on it. As you can see the first function is `e820_reserve_setup_data`. This function does almost the same as `memblock_x86_reserve_range_setup_data` which we saw above, but it also calls `e820_update_range` which adds new regions to the `e820map` with the given type which is `E820_RESERVED_KERN` in our case. The next function is `finish_e820_parsing` which sanitizes `e820map` with the `sanitize_e820_map` function. Besides this two functions we can see a couple of functions related to the [e820](http://en.wikipedia.org/wiki/E820). You can see it in the listing above. `e820_add_kernel_range` function takes the physical address of the kernel start and end:
让我们来一起看看上面的代码。正如你所看到的,第一个函数是 `e820_reserve_setup_data` 。这个函数和我们前面看到的 `memblock_x86_reserve_range_setup_data` 函数做的事情几乎是相同的,但是这个函数同时还会调用 `e820_update_range` 函数,向 `e820map` 中用给定的类型添加新的区域,在我们的例子中,使用的是 `E820_RESERVED_KERN` 类型。接下来的函数是 `finish_e820_parsing`,这个函数使用 `sanitize_e820_map` 函数对 `e820map` 进行清理。除了这两个函数之外,我们还可以看到一些与 [e820](http://en.wikipedia.org/wiki/E820) 有关的函数。你可以在上面的列表中看到这些函数。`e820_add_kernel_range` 函数需要内核开始和结束的物理地址:
```C
u64 start = __pa_symbol(_text);
u64 size = __pa_symbol(_end) - start;
```
```
checks that `.text` `.data` and `.bss` marked as `E820RAM` in the `e820map` and prints the warning message if not. The next function `trm_bios_range` update first 4096 bytes in `e820Map` as `E820_RESERVED` and sanitizes it again with the call of the `sanitize_e820_map`. After this we get the last page frame number with the call of the `e820_end_of_ram_pfn` function. Every memory page has an unique number - `Page frame number` and `e820_end_of_ram_pfn` function returns the maximum with the call of the `e820_end_pfn`:
`e820map` `E820RAM` `.text` `.data` `.bss` `trm_bios_range` `e820Map` 4096 `E820_RESERVED` `sanitize_e820_map` `e820map`使 `e820_end_of_ram_pfn` - `` `e820_end_of_ram_pfn` `e820_end_pfn` :
```C
unsigned long __init e820_end_of_ram_pfn(void)
@@ -224,7 +230,7 @@ unsigned long __init e820_end_of_ram_pfn(void)
}
```
where `e820_end_pfn` takes maximum page frame number on the certain architecture (`MAX_ARCH_PFN` is `0x400000000` for `x86_64`). In the `e820_end_pfn` we go through the all `e820` slots and check that `e820` entry has `E820_RAM` or `E820_PRAM` type because we calculate page frame numbers only for these types, gets the base address and end address of the page frame number for the current `e820` entry and makes some checks for these addresses:
`e820_end_pfn` 函数读取特定于系统架构的最大页帧号(对于 `x86_64` 架构来说 `MAX_ARCH_PFN` `0x400000000` )。在 `e820_end_pfn` 函数中我们遍历整个 `e820` 槽,并且检查 `e820` 中是否有 `E820_RAM` 或者 `E820_PRAM` 类型条目,因为我们只能对这些类型计算页面帧号,然后我们得到当前 `e820` 页面帧的基地址和结束地址,同时对这些地址进行检查:
```C
for (i = 0; i < e820.nr_map; i++) {
@@ -258,7 +264,7 @@ for (i = 0; i < e820.nr_map; i++) {
return last_pfn;
```
After this we check that `last_pfn` which we got in the loop is not greater that maximum page frame number for the certain architecture (`x86_64` in our case), print information about last page frame number and return it. We can see the `last_pfn` in the `dmesg` output:
接下来我们检查在循环中得到的 `last_pfn``last_pfn` 不得大于特定于系统架构的最大页帧号(在我们的例子中是 `x86_64` 系统架构),然后输出关于最大页帧号的信息,并且返回 `last_pfn`。我们可以在 `dmesg` 的输出中看到 `last_pfn` :
```
...
@@ -266,7 +272,7 @@ After this we check that `last_pfn` which we got in the loop is not greater that
...
```
After this, as we have calculated the biggest page frame number, we calculate `max_low_pfn` which is the biggest page frame number in the `low memory` or bellow first `4` gigabytes. If installed more than 4 gigabytes of RAM, `max_low_pfn` will be result of the `e820_end_of_low_ram_pfn` function which does the same `e820_end_of_ram_pfn` but with 4 gigabytes limit, in other way `max_low_pfn` will be the same as `max_pfn`:
在这之后,我们计算出了最大的页帧号,我们要计算 `max_low_pfn` ,这是 `低端内存` 或者低于第一个4GB中的最大页面帧。如果系统安装了超过4GB的内存RAM`max_low_pfn` 将会是`e820_end_of_low_ram_pfn` 函数的结果,这个函数和 `e820_end_of_ram_pfn` 相似但是有4GB限制换句话说 `max_low_pfn` `max_pfn` 的值是一样的:
```C
if (max_pfn > (1UL<<(32 - PAGE_SHIFT)))
@@ -277,19 +283,20 @@ else
high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
```
Next we calculate `high_memory` (defines the upper bound on direct map memory) with `__va` macro which returns a virtual address by the given physical memory.
接下来我们通过 `__va` 宏计算 `高端内存` (有更高的内存直接映射上界)中的最大页帧号,并且这个宏会根据给定的物理内存返回一个虚拟地址。
DMI scanning
桌面管理接口
-------------------------------------------------------------------------------
The next step after manipulations with different memory regions and `e820` slots is collecting information about computer. We will get all information with the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface) and following functions:
在处理完不同内存区域和 `e820` 槽之后,接下来就该收集计算机的相关信息了。我们将用下面的函数收集与 [桌面管理接口](http://en.wikipedia.org/wiki/Desktop_Management_Interface) 有关的所有信息:
```C
dmi_scan_machine();
dmi_memdev_walk();
```
First is `dmi_scan_machine` defined in the [drivers/firmware/dmi_scan.c](https://github.com/torvalds/linux/blob/master/drivers/firmware/dmi_scan.c). This function goes through the [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) structures and extracts information. There are two ways specified to gain access to the `SMBIOS` table: get the pointer to the `SMBIOS` table from the [EFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)'s configuration table and scanning the physical memory between `0xF0000` and `0x10000` addresses. Let's look on the second approach. `dmi_scan_machine` function remaps memory between `0xf0000` and `0x10000` with the `dmi_early_remap` which just expands to the `early_ioremap`:
首先是定义在 [drivers/firmware/dmi_scan.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/drivers/firmware/dmi_scan.c) 中的 `dmi_scan_machine` 函数。这个函数遍历 [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) 结构,并从中提取信息。这里有两种方法来访问 `SMBIOS` 表: 第一种是从 [EFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) 的配置表获得指向 `SMBIOS` 表的指针;第二种是扫描 `0xF0000` `0x10000` 地址之间的物理内存。让我们一起看看第二种方法。`dmi_scan_machine` 函数通过 `dmi_early_remap` 函数将 `0xf0000` `0x10000` 之间的内存重新映射并追加到 `early_ioremap`:
```C
void __init dmi_scan_machine(void)
@@ -303,8 +310,7 @@ void __init dmi_scan_machine(void)
if (p == NULL)
goto error;
```
and iterates over all `DMI` header address and find search `_SM_` string:
然后迭代所有的 `DMI` 头部地址,并且查找 `_SM_` 字符串:
```C
memset(buf, 0, 16);
@@ -317,22 +323,21 @@ for (q = p; q < p + 0x10000; q += 16) {
}
memcpy(buf, buf + 16, 16);
}
```
`_SM_` string must be between `000F0000h` and `0x000FFFFF`. Here we copy 16 bytes to the `buf` with `memcpy_fromio` which is the same `memcpy` and execute `dmi_smbios3_present` and `dmi_present` on the buffer. These functions check that first 4 bytes is `_SM_` string, get `SMBIOS` version and gets `_DMI_` attributes as `DMI` structure table length, table address and etc... After one of these functions finish, you will see the result of it in the `dmesg` output:
```
`_SM_` `000F0000h` `0x000FFFFF` `memcpy_fromio` `buf` 16 `memcpy` ( `buf` ) `dmi_smbios3_present` `dmi_present` `buf` 4 `__SM__` `SMBIOS` `_DMI_` `_DMI_` ... `dmesg` :
```
[ 0.000000] SMBIOS 2.7 present.
[ 0.000000] DMI: Gigabyte Technology Co., Ltd. Z97X-UD5H-BK/Z97X-UD5H-BK, BIOS F6 06/17/2014
```
In the end of the `dmi_scan_machine`, we unmap the previously remapped memory:
在 `dmi_scan_machine` 函数的最后,我们取消之前映射的内存:
```C
dmi_early_unmap(p, 0x10000);
```
The second function is - `dmi_memdev_walk`. As you can understand it goes over memory devices. Let's look on it:
第二个函数是 - `dmi_memdev_walk`。和你想的一样,这个函数遍历整个内存设备。让我们一起看看这个函数:
```C
void __init dmi_memdev_walk(void)
@@ -348,7 +353,7 @@ void __init dmi_memdev_walk(void)
}
```
It checks that `DMI` available (we got it in the previous function - `dmi_scan_machine`) and collects information about memory devices with `dmi_walk_early` and `dmi_alloc` which defined as:
这个函数检查 `DMI` 是否可用(我们之前在 `dmi_scan_machine` 函数中得到了这个结果,并且保存在 `dmi_available` 变量中),然后使用 `dmi_walk_early` `dmi_alloc` 函数收集内存设备的有关信息,其中 `dmi_alloc` 的定义如下:
```
#ifdef CONFIG_DMI
@@ -356,7 +361,7 @@ RESERVE_BRK(dmi_alloc, 65536);
#endif
```
`RESERVE_BRK` defined in the [arch/x86/include/asm/setup.h](http://en.wikipedia.org/wiki/Desktop_Management_Interface) and reserves space with given size in the `brk` section.
定义在 [arch/x86/include/asm/setup.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/setup.h) 中的 `RESERVE_BRK` 函数会在 `brk` 段中预留给定大小的空间:
-------------------------
init_hypervisor_platform();
@@ -364,13 +369,13 @@ RESERVE_BRK(dmi_alloc, 65536);
insert_resource(&iomem_resource, &code_resource);
insert_resource(&iomem_resource, &data_resource);
insert_resource(&iomem_resource, &bss_resource);
early_gart_iommu_check();
early_gart_iommu_check();
SMP config
--------------------------------------------------------------------------------
均衡多处理(SMP)的配置
--------------------------------------------------------------------------------
The next step is parsing of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration. We do it with the call of the `find_smp_config` function which just calls function:
接下来的一步是解析 [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) 的配置信息。我们调用 `find_smp_config` 函数来完成这个任务,这个函数内部调用另一个函数:
```C
static inline void find_smp_config(void)
@@ -379,7 +384,7 @@ static inline void find_smp_config(void)
}
```
inside. `x86_init.mpparse.find_smp_config` is the `default_find_smp_config` function from the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). In the `default_find_smp_config` function we are scanning a couple of memory regions for `SMP` config and return if they are found:
在函数的内部,`x86_init.mpparse.find_smp_config` 函数就是 [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c) 中的 `default_find_smp_config` 函数。我们调用 `default_find_smp_config` 函数扫描内存中的一些区域来寻找 `SMP` 的配置信息,并在找到它们的时候返回:
```C
if (smp_scan_config(0x0, 0x400) ||
@@ -388,14 +393,14 @@ if (smp_scan_config(0x0, 0x400) ||
return;
```
First of all `smp_scan_config` function defines a couple of variables:
首先 `smp_scan_config` 函数内部定义了一些变量:
```C
unsigned int *bp = phys_to_virt(base);
struct mpf_intel *mpf;
```
```
First is virtual address of the memory region where we will scan `SMP` config, second is the pointer to the `mpf_intel` structure. Let's try to understand what is it `mpf_intel`. All information stores in the multiprocessor configuration data structure. `mpf_intel` presents this structure and looks:
`SMP` `mpf_intel` `mpf_intel` `mpf_intel` :
```C
struct mpf_intel {
@@ -412,9 +417,9 @@ struct mpf_intel {
};
```
As we can read in the documentation - one of the main functions of the system BIOS is to construct the MP floating pointer structure and the MP configuration table. And operating system must have access to this information about the multiprocessor configuration and `mpf_intel` stores the physical address (look at second parameter) of the multiprocessor configuration table. So, `smp_scan_config` going in a loop through the given memory range and tries to find `MP floating pointer structure` there. It checks that current byte points to the `SMP` signature, checks checksum, checks if `mpf->specification` is 1 or 4(it must be `1` or `4` by specification) in the loop:
正如我们在文档中看到的那样 - 系统 BIOS的主要功能之一就是创建MP浮点型指针结构和MP配置表。而且操作系统必须可以访问关于多处理器配置的有关信息 `mpf_intel` 中存储了多处理器配置表的物理地址(看结构体的第二个变量),然后,`smp_scan_config` 函数在指定的内存区域中循环查找 `MP floating pointer structure` 。这个函数还会检查当前字节是否指向 `SMP` 签名,然后检查签名的校验和,并且检查循环中的 `mpf->specification` 的值是1还是4(这个值只能是1或者是4):
```C
```C7
while (length > 0) {
if ((*bp == SMP_MAGIC_IDENT) &&
(mpf->length == 1) &&
@@ -430,12 +435,13 @@ if ((*bp == SMP_MAGIC_IDENT) &&
}
```
reserves given memory block if search is successful with `memblock_reserve` and reserves physical address of the multiprocessor configuration table. You can find documentation about this in the - [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf). You can read More details in the special part about `SMP`.
如果搜索成功,就调用 `memblock_reserve` 函数保留一定的内存块,并且为多处理器配置表保留物理地址。你可以在 [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf) 中找到相关的文档。你也可以在 `SMP` 的特定章节阅读更多细节。
Additional early memory initialization routines
其他的早期内存初始化程序
--------------------------------------------------------------------------------
In the next step of the `setup_arch` we can see the call of the `early_alloc_pgt_buf` function which allocates the page table buffer for early stage. The page table buffer will be placed in the `brk` area. Let's look on its implementation:
`setup_arch` 的下一步,我们可以看到 `early_alloc_pgt_buf` 函数的调用,这个函数在早期阶段分配页表缓冲区。页表缓冲区将被放置在 `brk` 区段中。让我们一起看看这个功能的实现:
```C
void __init early_alloc_pgt_buf(void)
@@ -451,7 +457,7 @@ void __init early_alloc_pgt_buf(void)
}
```
First of all it get the size of the page table buffer, it will be `INIT_PGT_BUF_SIZE` which is `(6 * PAGE_SIZE)` in the current linux kernel 4.0. As we got the size of the page table buffer, we call `extend_brk` function with two parameters: size and align. As you can understand from its name, this function extends the `brk` area. As we can see in the linux kernel linker script `brk` is in memory right after the [BSS](http://en.wikipedia.org/wiki/.bss):
首先这个函数获得页表缓冲区的大小,它的值是 `INIT_PGT_BUF_SIZE` 这个值在目前的linux 4.0 内核中是 `(6 * PAGE_SIZE)`。因为我们已经得到了页表缓冲区的大小,现在我们调用 `extend_brk` 函数并且传入两个参数: size和align。你可以从他们的名称中猜到,这个函数扩展 `brk` 区段。正如我们在linux内核链接脚本中看到的`brk` 区段在内存中的位置恰好就在 [BSS](http://en.wikipedia.org/wiki/.bss) 区段后面:
```C
. = ALIGN(PAGE_SIZE);
@@ -463,11 +469,11 @@ First of all it get the size of the page table buffer, it will be `INIT_PGT_BUF_
}
```
Or we can find it with `readelf` util:
我们也可以使用 `readelf` 工具来找到它:
![brk area](http://oi61.tinypic.com/71lkeu.jpg)
![brk area](http://oi61.tinypic.com/71lkeu.jpg)
After that we got physical address of the new `brk` with the `__pa` macro, we calculate the base address and the end of the page table buffer. In the next step as we got page table buffer, we reserve memory block for the brk area with the `reserve_brk` function:
之后我们用 `_pa` 宏得到了新的 `brk` 区段的物理地址,我们计算页表缓冲区的基地址和结束地址。因为我们之前已经创建好了页面缓冲区,所以现在我们使用 `reserve_brk` 函数为 `brk` 区段保留内存块:
```C
static void __init reserve_brk(void)
@@ -480,7 +486,7 @@ static void __init reserve_brk(void)
}
```
Note that in the end of the `reserve_brk`, we set `brk_start` to zero, because after this we will not allocate it anymore. The next step after reserving memory block for the `brk`, we need to unmap out-of-range memory areas in the kernel mapping with the `cleanup_highmap` function. Remember that kernel mapping is `__START_KERNEL_map` and `_end - _text` or `level2_kernel_pgt` maps the kernel `_text`, `data` and `bss`. In the start of the `clean_high_map` we define these parameters:
注意在 `reserve_brk` 的最后,我们把 `_brk_start` 赋值为0,因为在这之后我们不会再为 `brk` 分配内存了,我们需要使用 `cleanup_highmap` 函数来释放内核映射中越界的内存区域。请记住内核映射是 `__START_KERNEL_map` `_end - _text` 或者 `level2_kernel_pgt` 对内核 `_text``data` `bss` 区段的映射。在 `clean_high_map` 的开始部分我们定义下面这些参数:
```C
unsigned long vaddr = __START_KERNEL_map;
@@ -489,7 +495,7 @@ pmd_t *pmd = level2_kernel_pgt;
pmd_t *last_pmd = pmd + PTRS_PER_PMD;
```
Now, as we defined start and end of the kernel mapping, we go in the loop through the all kernel page middle directory entries and clean entries which are not between `_text` and `end`:
现在,因为我们已经定义了内核映射的开始和结束位置,所以我们在循环中遍历所有内核页中间目录条目, 并且清除不在 `_text` `end` 区段中的条目:
```C
for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
@@ -500,7 +506,7 @@ for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
}
```
After this we set the limit for the `memblock` allocation with the `memblock_set_current_limit` function (read more about `memblock` you can in the [Linux kernel memory management Part 2](https://github.com/MintCN/linux-insides-zh/blob/master/MM/linux-mm-2.md)), it will be `ISA_END_ADDRESS` or `0x100000` and fill the `memblock` information according to `e820` with the call of the `memblock_x86_fill` function. You can see the result of this function in the kernel initialization time:
在这之后,我们使用 `memblock_set_current_limit` (你可以在[linux 内存管理 第二章节](https://github.com/MintCN/linux-insides-zh/blob/master/MM/linux-mm-2.md) 阅读关于 `memblock` 的更多内容) 函数来为 `memblock` 分配内存设置一个界限,这个界限可以是 `ISA_END_ADDRESS` 或者 `0x100000` ,然后调用 `memblock_x86_fill` 函数根据 `e820` 来填充 `memblock` 相关信息。你可以在内核初始化的时候看到这个函数运行的结果:
```
MEMBLOCK configuration:
@@ -515,20 +521,20 @@ MEMBLOCK configuration:
reserved[0x2] [0x0000007ec89000-0x0000007fffffff], 0x1377000 bytes flags: 0x0
```
The rest functions after the `memblock_x86_fill` are: `early_reserve_e820_mpc_new` allocates additional slots in the `e820map` for MultiProcessor Specification table, `reserve_real_mode` - reserves low memory from `0x0` to 1 megabyte for the trampoline to the real mode (for rebooting, etc.), `trim_platform_memory_ranges` - trims certain memory regions started from `0x20050000`, `0x20110000`, etc. these regions must be excluded because [Sandy Bridge](http://en.wikipedia.org/wiki/Sandy_Bridge) has problems with these regions, `trim_low_memory_range` reserves the first 4 kilobyte page in `memblock`, `init_mem_mapping` function reconstructs direct memory mapping and setups the direct mapping of the physical memory at `PAGE_OFFSET`, `early_trap_pf_init` setups `#PF` handler (we will look on it in the chapter about interrupts) and `setup_real_mode` function setups trampoline to the [real mode](http://en.wikipedia.org/wiki/Real_mode) code.
除了 `memblock_x86_fill` 之外的其他函数还有: `early_reserve_e820_mpc_new` 函数在 `e820map` 中为多处理器规格表分配额外的槽, `reserve_real_mode` - 用于保留从 `0x0` 到1M的低端内存用作到实模式的跳板(用于重启等...)`trim_platform_memory_ranges` 函数用于清除掉以 `0x20050000`, `0x20110000` 等地址开头的内存空间。这些内存区域必须被排除在外,因为 [Sandy Bridge](http://en.wikipedia.org/wiki/Sandy_Bridge) 会在这些内存区域出现一些问题, `trim_low_memory_range` 函数用于保留 `memblock` 中的前4KB页面`init_mem_mapping` 函数用于在 `PAGE_OFFSET` 处重建物理内存的直接映射, `early_trap_pf_init` 函数用于建立 `#PF` 处理函数(我们将会在有关中断的章节看到它), `setup_real_mode` 函数用于建立到 [实模式](http://en.wikipedia.org/wiki/Real_mode) 代码的跳板。
That's all. You can note that this part will not cover all functions which are in the `setup_arch` (like `early_gart_iommu_check`, [mtrr](http://en.wikipedia.org/wiki/Memory_type_range_register) initialization, etc.). As I already wrote many times, `setup_arch` is big, and linux kernel is big. That's why I can't cover every line in the linux kernel. I don't think that we missed something important, but you can say something like: each line of code is important. Yes, it's true, but I missed them anyway, because I think that it is not realistic to cover full linux kernel. Anyway we will often return to the idea that we have already seen, and if something is unfamiliar, we will cover this theme.
这就是本章的全部内容了。您可能注意到这部分并没有包括 `setup_arch` 中的所有函数 (如 `early_gart_iommu_check`[mtrr](http://en.wikipedia.org/wiki/Memory_type_range_register) 的初始化函数等...)。正如我已经说了很多次的, `setup_arch` 函数很复杂linux内核也很复杂。这就是为什么我不能包括linux内核中的每一行代码。我认为我们并没有错过重要的东西, 但是你可能会说: 每行代码都很重要。是的, 这没错, 但不管怎样我略过了他们, 因为我认为对于整个linux内核面面俱到是不现实的。无论如何, 我们会经常复习所学的内容, 如果有什么不熟悉的内容, 我们将会深入研究这些内容。
Conclusion
--------------------------------------------------------------------------------
结束语
--------------------------------------------------------------------------------
It is the end of the sixth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function again and it was long part, but we are not finished with it. Yes, `setup_arch` is big, hope that next part will be the last part about this function.
这里是linux 内核初始化进程第六章节的结尾。在这一章节中,我们再次深入研究了 `setup_arch` 函数,然而这是个很长的部分,我们目前还没有学习完。的确, `setup_arch`很复杂,希望下个章节将会是这个函数的最后一个部分。。
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
如果你有任何的疑问或者建议,你可以留言,也可以直接发消息给我[twitter](https://twitter.com/0xAX)
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
**很抱歉,英语并不是我的母语,非常抱歉给您阅读带来不便,如果你发现文中描述有任何问题,请提交一个 PR [linux-insides](https://github.com/MintCN/linux-insides-zh).**
Links
链接
--------------------------------------------------------------------------------
* [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification)
@@ -546,4 +552,4 @@ Links
* [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf)
* [BSS](http://en.wikipedia.org/wiki/.bss)
* [SMBIOS specification](http://www.dmtf.org/sites/default/files/standards/documents/DSP0134v2.5Final.pdf)
* [Previous part](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html)
* [前一个章节](http://xinqiu.gitbooks.io/linux-insides-cn/content/Initialization/linux-initialization-5.html)