10 KiB
每CPU变量
每CPU变量是一项内核特性。从它的名字你就可以理解这项特性的意义了。我们可以创建一个变量,然后每个CPU上都会有一个此变量的拷贝。本节我们来看下这个特性,并试着去理解它是如何实现以及工作的。
内核提供了一个创建每CPU变量的API - DEFINE_PER_CPU 宏:
#define DEFINE_PER_CPU(type, name) \
DEFINE_PER_CPU_SECTION(type, name, "")
像其它许多每CPU变量一样,这个宏定义在 include/linux/percpu-defs.h 中。现在我们来看下这个特性是如何实现的。
看下 DECLARE_PER_CPU 的定义,可以看到它使用了 2 个参数:type 和 name,因此我们可以这样创建每CPU变量:
DEFINE_PER_CPU(int, per_cpu_n)
我们传入要创建变量的类型和名字,DEFINE_PER_CPU 调用 DEFINE_PER_CPU_SECTION,将两个参数和空字符串传递给后者。让我们来看下 DEFINE_PER_CPU_SECTION 的定义:
#define DEFINE_PER_CPU_SECTION(type, name, sec) \
__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES \
__typeof__(type) name
#define __PCPU_ATTRS(sec) \
__percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \
PER_CPU_ATTRIBUTES
其中 section 是:
#define PER_CPU_BASE_SECTION ".data..percpu"
展开所有的宏,我们得到一个全局的每CPU变量:
__attribute__((section(".data..percpu"))) int per_cpu_n
这意味着我们在 .data..percpu 段有了一个 per_cpu_n 变量,可以在 vmlinux 中找到它:
.data..percpu 00013a58 0000000000000000 0000000001a5c000 00e00000 2**12
CONTENTS, ALLOC, LOAD, DATA
好,现在我们知道了,当我们使用 DEFINE_PER_CPU 宏时,一个在 .data..percpu 段中的每CPU变量就被创建了。当内核初始化时,调用 setup_per_cpu_areas 函数加载几次 .data..percpu 段,每个CPU上对每个段都加载一次。
让我们来看下每CPU区域初始化流程。它从 init/main.c 中调用 setup_per_cpu_areas 函数开始,这个函数定义在 arch/x86/kernel/setup_percpu.c 中。
pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);
setup_per_cpu_areas 以输出CPUs集合的最大个数开始,在内核配置中以 CONFIG_NR_CPUS 配置项设置,实际的CPU个数,nr_cpumask_bits 对于新的 cpumask 操作来说和 NR_CPUS 是一样的,最后是 NUMA 节点个数。
The setup_per_cpu_areas starts from the output information about the maximum number of CPUs set during kernel configuration with the CONFIG_NR_CPUS configuration option, actual number of CPUs, nr_cpumask_bits is the same that NR_CPUS bit for the new cpumask operators and number of NUMA nodes.
We can see this output in the dmesg:
$ dmesg | grep percpu
[ 0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
In the next step we check the percpu first chunk allocator. All percpu areas are allocated in chunks. The first chunk is used for the static percpu variables. The Linux kernel has percpu_alloc command line parameters which provides the type of the first chunk allocator. We can read about it in the kernel documentation:
percpu_alloc= Select which percpu first chunk allocator to use.
Currently supported values are "embed" and "page".
Archs may support subset or none of the selections.
See comments in mm/percpu.c for details on each
allocator. This parameter is primarily for debugging
and performance comparison.
The mm/percpu.c contains the handler of this command line option:
early_param("percpu_alloc", percpu_alloc_setup);
Where the percpu_alloc_setup function sets the pcpu_chosen_fc variable depends on the percpu_alloc parameter value. By default the first chunk allocator is auto:
enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
If the percpu_alloc parameter is not given to the kernel command line, the embed allocator will be used which embeds the first percpu chunk into bootmem with the memblock. The last allocator is the first chunk page allocator which maps the first chunk with PAGE_SIZE pages.
As I wrote above, first of all we make a check of the first chunk allocator type in the setup_per_cpu_areas. We check that first chunk allocator is not page:
if (pcpu_chosen_fc != PCPU_FC_PAGE) {
...
...
...
}
If it is not PCPU_FC_PAGE, we will use the embed allocator and allocate space for the first chunk with the pcpu_embed_first_chunk function:
rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
dyn_size, atom_size,
pcpu_cpu_distance,
pcpu_fc_alloc, pcpu_fc_free);
As shown above, the pcpu_embed_first_chunk function embeds the first percpu chunk into bootmem then we pass a couple of parameters to the pcup_embed_first_chunk. They are as follows:
PERCPU_FIRST_CHUNK_RESERVE- the size of the reserved space for the staticpercpuvariables;dyn_size- minimum free size for dynamic allocation in bytes;atom_size- all allocations are whole multiples of this and aligned to this parameter;pcpu_cpu_distance- callback to determine distance between cpus;pcpu_fc_alloc- function to allocatepercpupage;pcpu_fc_free- function to releasepercpupage.
We calculate all of these parameters before the call of the pcpu_embed_first_chunk:
const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
size_t atom_size;
#ifdef CONFIG_X86_64
atom_size = PMD_SIZE;
#else
atom_size = PAGE_SIZE;
#endif
If the first chunk allocator is PCPU_FC_PAGE, we will use the pcpu_page_first_chunk instead of the pcpu_embed_first_chunk. After that percpu areas up, we setup percpu offset and its segment for every CPU with the setup_percpu_segment function (only for x86 systems) and move some early data from the arrays to the percpu variables (x86_cpu_to_apicid, irq_stack_ptr and etc...). After the kernel finishes the initialization process, we will have loaded N .data..percpu sections, where N is the number of CPUs, and the section used by the bootstrap processor will contain an uninitialized variable created with the DEFINE_PER_CPU macro.
The kernel provides an API for per-cpu variables manipulating:
- get_cpu_var(var)
- put_cpu_var(var)
Let's look at the get_cpu_var implementation:
#define get_cpu_var(var) \
(*({ \
preempt_disable(); \
this_cpu_ptr(&var); \
}))
The Linux kernel is preemptible and accessing a per-cpu variable requires us to know which processor the kernel is running on. So, current code must not be preempted and moved to the another CPU while accessing a per-cpu variable. That's why, first of all we can see a call of the preempt_disable function then a call of the this_cpu_ptr macro, which looks like:
#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
and
#define raw_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
where per_cpu_ptr returns a pointer to the per-cpu variable for the given cpu (second parameter). After we've created a per-cpu variable and made modifications to it, we must call the put_cpu_var macro which enables preemption with a call of preempt_enable function. So the typical usage of a per-cpu variable is as follows:
get_cpu_var(var);
...
//Do something with the 'var'
...
put_cpu_var(var);
Let's look at the per_cpu_ptr macro:
#define per_cpu_ptr(ptr, cpu) \
({ \
__verify_pcpu_ptr(ptr); \
SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); \
})
As I wrote above, this macro returns a per-cpu variable for the given cpu. First of all it calls __verify_pcpu_ptr:
#define __verify_pcpu_ptr(ptr)
do {
const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL;
(void)__vpp_verify;
} while (0)
which makes the given ptr type of const void __percpu *,
After this we can see the call of the SHIFT_PERCPU_PTR macro with two parameters. As first parameter we pass our ptr and for second parameter we pass the cpu number to the per_cpu_offset macro:
#define per_cpu_offset(x) (__per_cpu_offset[x])
which expands to getting the x element from the __per_cpu_offset array:
extern unsigned long __per_cpu_offset[NR_CPUS];
where NR_CPUS is the number of CPUs. The __per_cpu_offset array is filled with the distances between cpu-variable copies. For example all per-cpu data is X bytes in size, so if we access __per_cpu_offset[Y], X*Y will be accessed. Let's look at the SHIFT_PERCPU_PTR implementation:
#define SHIFT_PERCPU_PTR(__p, __offset) \
RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset))
RELOC_HIDE just returns offset (typeof(ptr)) (__ptr + (off)) and it will return a pointer to the variable.
That's all! Of course it is not the full API, but a general overview. It can be hard to start with, but to understand per-cpu variables you mainly need to understand the include/linux/percpu-defs.h magic.
Let's again look at the algorithm of getting a pointer to a per-cpu variable:
- The kernel creates multiple
.data..percpusections (one per-cpu) during initialization process; - All variables created with the
DEFINE_PER_CPUmacro will be relocated to the first section or for CPU0; __per_cpu_offsetarray filled with the distance (BOOT_PERCPU_OFFSET) between.data..percpusections;- When the
per_cpu_ptris called, for example for getting a pointer on a certain per-cpu variable for the third CPU, the__per_cpu_offsetarray will be accessed, where every index points to the required CPU.
That's all.