From 5e8ad6f2219951d2a531e8a58b17ee9cd97a0486 Mon Sep 17 00:00:00 2001 From: Dongliang Mu Date: Sat, 11 Jun 2016 22:01:23 -0400 Subject: [PATCH] Change the translation status of Chapter 5 and 6 --- README.md | 13 +- Timers/timers-1.md | 436 +++++++++++++++++++++++++++++++++++++++++++ Timers/timers-2.md | 451 +++++++++++++++++++++++++++++++++++++++++++++ Timers/timers-3.md | 444 ++++++++++++++++++++++++++++++++++++++++++++ Timers/timers-4.md | 427 ++++++++++++++++++++++++++++++++++++++++++ Timers/timers-5.md | 415 +++++++++++++++++++++++++++++++++++++++++ Timers/timers-6.md | 413 +++++++++++++++++++++++++++++++++++++++++ Timers/timers-7.md | 421 ++++++++++++++++++++++++++++++++++++++++++ 8 files changed, 3018 insertions(+), 2 deletions(-) create mode 100644 Timers/timers-1.md create mode 100644 Timers/timers-2.md create mode 100644 Timers/timers-3.md create mode 100644 Timers/timers-4.md create mode 100644 Timers/timers-5.md create mode 100644 Timers/timers-6.md create mode 100644 Timers/timers-7.md diff --git a/README.md b/README.md index e8c5995..559dedc 100644 --- a/README.md +++ b/README.md @@ -50,8 +50,17 @@ Linux Insides |├ 4.2|[@qianmoke](https://github.com/qianmoke)|已完成| |├ 4.3||未开始| |└ 4.4||未开始| -| 5. Timers and time management|[@icecoobe](https://github.com/icecoobe)|正在进行| -| 6. Synchronization primitives|[@huxq](https://github.com/huxq)|正在进行| +| 5. Timers and time management||正在进行| +|├ 5.0|[@mudongliang](https://github.com/mudongliang)|已完成| +|├ 5.1||未开始| +|├ 5.2||未开始| +|├ 5.3||未开始| +|├ 5.4||未开始| +|├ 5.5||未开始| +|├ 5.6||未开始| +|└ 5.7||未开始| +| 6. Synchronization primitives||正在进行| +|├ 6.0|[@mudongliang](https://github.com/mudongliang)|已完成| |├ 6.1||未开始| |├ 6.2||未开始| |├ 6.3|[@huxq](https://github.com/huxq)|已完成| diff --git a/Timers/timers-1.md b/Timers/timers-1.md new file mode 100644 index 0000000..2e5994e --- /dev/null +++ b/Timers/timers-1.md @@ -0,0 +1,436 @@ +Timers and time management in the Linux kernel. Part 1. +================================================================================ + +Introduction +-------------------------------------------------------------------------------- + +This is yet another post that opens new chapter in the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) was a list part of the chapter that describes [system call](https://en.wikipedia.org/wiki/System_call) concept and now time is to start new chapter. As you can understand from the post's title, this chapter will be devoted to the `timers` and `time management` in the Linux kernel. The choice of topic for the current chapter is not accidental. Timers and generally time management are very important and widely used in the Linux kernel. The Linux kernel uses timers for various tasks, different timeouts for example in [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) implementation, the kernel must know current time, scheduling asynchronous functions, next event interrupt scheduling and many many more. + +So, we will start to learn implementation of the different time management related stuff in this part. We will see different types of timers and how do different Linux kernel subsystems use them. As always we will start from the earliest part of the Linux kernel and will go through initialization process of the Linux kernel. We already did it in the special [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) which describes initialization process of the Linux kernel, but as you may remember we missed some things there. And one of them is the initialization of timers. + +Let's start. + +Initialization of non-standard PC hardware clock +-------------------------------------------------------------------------------- + +After the Linux kernel was decompressed (more about this you can read in the [Kernel decompression](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) part) the architecture non-specific code starts to work in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. After initialization of the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), initialization of [cgroups](https://en.wikipedia.org/wiki/Cgroups) and setting [canary](https://en.wikipedia.org/wiki/Buffer_overflow_protection) value we can see the call of the `setup_arch` function. + +As you may remember this function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file and prepares/initializes architecture-specific stuff (for example it reserves place for [bss](https://en.wikipedia.org/wiki/.bss) section, reserves place for [initrd](https://en.wikipedia.org/wiki/Initrd), parses kernel command line and many many other things). Besides this, we can find some time management related functions there. + +The first is: + +```C +x86_init.timers.wallclock_init(); +``` + +We already saw `x86_init` structure in the chapter that describes initialization of the Linux kernel. This structure contains pointers to the default setup functions for the different platforms like [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms), [Intel CE4100](http://www.wpgholdings.com/epaper/US/newsRelease_20091215/255874.pdf) and etc. The `x86_init` structure defined in the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c#L36) and as you can see it determines standard PC hardware by default. + +As we can see, the `x86_init` structure has `x86_init_ops` type that provides a set of functions for platform specific setup like reserving standard resources, platform specific memory setup, initialization of interrupt handlers and etc. This structure looks like: + +```C +struct x86_init_ops { + struct x86_init_resources resources; + struct x86_init_mpparse mpparse; + struct x86_init_irqs irqs; + struct x86_init_oem oem; + struct x86_init_paging paging; + struct x86_init_timers timers; + struct x86_init_iommu iommu; + struct x86_init_pci pci; +}; +``` + +We can note `timers` field that has `x86_init_timers` type and as we can understand by its name - this field is related to time management and timers. The `x86_init_timers` contains four fields which are all functions that returns pointer on [void](https://en.wikipedia.org/wiki/Void_type): + +* `setup_percpu_clockev` - set up the per cpu clock event device for the boot cpu; +* `tsc_pre_init` - platform function called before [TSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) init; +* `timer_init` - initialize the platform timer; +* `wallclock_init` - initialize the wallclock device. + +So, as we already know, in our case the `wallclock_init` executes initialization of the wallclock device. If we will look on the `x86_init` structure, we will see that `wallclock_init` points to the `x86_init_noop`: + +```C +struct x86_init_ops x86_init __initdata = { + ... + ... + ... + .timers = { + .wallclock_init = x86_init_noop, + }, + ... + ... + ... +} +``` + +Where the `x86_init_noop` is just a function that does nothing: + +```C +void __cpuinit x86_init_noop(void) { } +``` + +for the standard PC hardware. Actually, the `wallclock_init` function is used in the [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms) platform. Initialization of the `x86_init.timers.wallclock_init` located in the [arch/x86/platform/intel-mid/intel-mid.c](https://github.com/torvalds/linux/blob/master/arch/x86/platform/intel-mid/intel-mid.c) source code file in the `x86_intel_mid_early_setup` function: + +```C +void __init x86_intel_mid_early_setup(void) +{ + ... + ... + ... + x86_init.timers.wallclock_init = intel_mid_rtc_init; + ... + ... + ... +} +``` + +Implementation of the `intel_mid_rtc_init` function is in the [arch/x86/platform/intel-mid/intel_mid_vrtc.c](https://github.com/torvalds/linux/blob/master/arch/x86/platform/intel-mid/intel_mid_vrtc.c) source code file and looks pretty easy. First of all, this function parses [Simple Firmware Interface](https://en.wikipedia.org/wiki/Simple_Firmware_Interface) M-Real-Time-Clock table for the getting such devices to the `sfi_mrtc_array` array and initialization of the `set_time` and `get_time` functions: + +```C +void __init intel_mid_rtc_init(void) +{ + unsigned long vrtc_paddr; + + sfi_table_parse(SFI_SIG_MRTC, NULL, NULL, sfi_parse_mrtc); + + vrtc_paddr = sfi_mrtc_array[0].phys_addr; + if (!sfi_mrtc_num || !vrtc_paddr) + return; + + vrtc_virt_base = (void __iomem *)set_fixmap_offset_nocache(FIX_LNW_VRTC, + vrtc_paddr); + + x86_platform.get_wallclock = vrtc_get_time; + x86_platform.set_wallclock = vrtc_set_mmss; +} +``` + +That's all, after this a device based on `Intel MID` will be able to get time from hardware clock. As I already wrote, the standard PC [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture does not support `x86_init_noop` and just do nothing during call of this function. We just saw initialization of the [real time clock](https://en.wikipedia.org/wiki/Real-time_clock) for the [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms) architecture and now times to return to the general `x86_64` architecture and will look on the time management related stuff there. + +Acquainted with jiffies +-------------------------------------------------------------------------------- + +If we will return to the `setup_arch` function which is located as you remember in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file, we will see the next call of the time management related function: + +```C +register_refined_jiffies(CLOCK_TICK_RATE); +``` + +Before we will look on the implementation of this function, we must know about [jiffy](https://en.wikipedia.org/wiki/Jiffy_%28time%29). As we can read on wikipedia: + +``` +Jiffy is an informal term for any unspecified short period of time +``` + +This definition is very similar to the `jiffy` in the Linux kernel. There is global variable with the `jiffies` which holds the number of ticks that have occurred since the system booted. The Linux kernel sets this variable to zero: + +```C +extern unsigned long volatile __jiffy_data jiffies; +``` + +during initialization process. This global variable will be increased each time during timer interrupt. Besides this, near the `jiffies` variable we can see definition of the similar variable + +```C +extern u64 jiffies_64; +``` + +Actually only one of these variables is in use in the Linux kernel. And it depends on the processor type. For the [x86_64](https://en.wikipedia.org/wiki/X86-64) it will be `u64` use and for the [x86](https://en.wikipedia.org/wiki/X86) is `unsigned long`. We will see this if we will look on the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) linker script: + +``` +#ifdef CONFIG_X86_32 +... +jiffies = jiffies_64; +... +#else +... +jiffies_64 = jiffies; +... +#endif +``` + +In the case of `x86_32` the `jiffies` will be lower `32` bits of the `jiffies_64` variable. Schematically, we can imagine it as follows + +``` + jiffies_64 ++-----------------------------------------------------+ +| | | +| | | +| | jiffies on `x86_32` | +| | | +| | | ++-----------------------------------------------------+ +63 31 0 +``` + +Now we know a little theory about `jiffies` and we can return to the our function. There is no architecture-specific implementation for our function - the `register_refined_jiffies`. This function located in the generic kernel code - [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file. Main point of the `register_refined_jiffies` is registration of the jiffy `clocksource`. Before we will look on the implementation of the `register_refined_jiffies` function, we must know what is it `clocksource`. As we can read in the comments: + +``` +The `clocksource` is hardware abstraction for a free-running counter. +``` + +I'm not sure about you, but that description didn't give a good understanding about the `clocksource` concept. Let's try to understand what is it, but we will not go deeper because this topic will be described in a separate part in much more detail. The main point of the `clocksource` is timekeeping abstraction or in very simple words - it provides a time value to the kernel. We already know about `jiffies` interface that represents number of ticks that have occurred since the system booted. It represented by the global variable in the Linux kernel and increased each timer interrupt. The Linux kernel can use `jiffies` for time measurement. So why do we need in separate context like the `clocksource`? Actually different hardware devices provide different clock sources that are widely in their capabilities. The availability of more precise techniques for time intervals measurement is hardware-dependent. + +For example `x86` has on-chip a 64-bit counter that is called [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) and its frequency can be equal to processor frequency. Or for example [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) that consists of a `64-bit` counter of at least `10 MHz` frequency. Two different timers and they are both for `x86`. If we will add timers from other architectures, this only makes this problem more complex. The Linux kernel provides `clocksource` concept to solve the problem. + +The clocksource concept represented by the `clocksource` structure in the Linux kernel. This structure defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file and contains a couple of fields that describe a time counter. For example it contains - `name` field which is the name of a counter, `flags` field that describes different properties of a counter, pointers to the `suspend` and `resume` functions, and many more. + +Let's look on the `clocksource` structure for jiffies that defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file: + +```C +static struct clocksource clocksource_jiffies = { + .name = "jiffies", + .rating = 1, + .read = jiffies_read, + .mask = 0xffffffff, + .mult = NSEC_PER_JIFFY << JIFFIES_SHIFT, + .shift = JIFFIES_SHIFT, + .max_cycles = 10, +}; +``` + +We can see definition of the default name here - `jiffies`, the next is `rating` field allows the best registered clock source to be chosen by the clock source management code available for the specified hardware. The `rating` may have following value: + +* `1-99` - Only available for bootup and testing purposes; +* `100-199` - Functional for real use, but not desired. +* `200-299` - A correct and usable clocksource. +* `300-399` - A reasonably fast and accurate clocksource. +* `400-499` - The ideal clocksource. A must-use where available; + +For example rating of the [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) is `300`, but rating of the [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is `250`. The next field is `read` - is pointer to the function that allows to read clocksource's cycle value or in other words it just returns `jiffies` variable with `cycle_t` type: + +```C +static cycle_t jiffies_read(struct clocksource *cs) +{ + return (cycle_t) jiffies; +} +``` + +that is just 64-bit unsigned type: + +```C +typedef u64 cycle_t; +``` + +The next field is the `mask` value ensures that subtraction between counters values from non `64 bit` counters do not need special overflow logic. In our case the mask is `0xffffffff` and it is `32` bits. This means that `jiffy` wraps around to zero after `42` seconds: + +```python +>>> 0xffffffff +4294967295 +# 42 nanoseconds +>>> 42 * pow(10, -9) +4.2000000000000006e-08 +# 43 nanoseconds +>>> 43 * pow(10, -9) +4.3e-08 +``` + +The next two fields `mult` and `shift` are used to convert the clocksource's period to nanoseconds per cycle. When the kernel calls the `clocksource.read` function, this function returns value in `machine` time units represented with `cycle_t` data type that we saw just now. To convert this return value to the [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) we need in these two fields: `mult` and `shift`. The `clocksource` provides `clocksource_cyc2ns` function that will do it for us with the following expression: + +```C +((u64) cycles * mult) >> shift; +``` + +As we can see the `mult` field is equal: + +```C +NSEC_PER_JIFFY << JIFFIES_SHIFT + +#define NSEC_PER_JIFFY ((NSEC_PER_SEC+HZ/2)/HZ) +#define NSEC_PER_SEC 1000000000L +``` + +by default, and the `shift` is + +```C +#if HZ < 34 + #define JIFFIES_SHIFT 6 +#elif HZ < 67 + #define JIFFIES_SHIFT 7 +#else + #define JIFFIES_SHIFT 8 +#endif +``` + +The `jiffies` clock source uses the `NSEC_PER_JIFFY` multiplier conversion to specify the nanosecond over cycle ratio. Note that values of the `JIFFIES_SHIFT` and `NSEC_PER_JIFFY` depend on `HZ` value. The `HZ` represents the frequency of the system timer. This macro defined in the [include/asm-generic/param.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/param.h) and depends on the `CONFIG_HZ` kernel configuration option. The value of `HZ` differs for each supported architecture, but for `x86` it's defined like: + +```C +#define HZ CONFIG_HZ +``` + +Where `CONFIG_HZ` can be one of the following values: + +![HZ](http://s9.postimg.org/xy85r3jrj/image.png) + +This means that in our case the timer interrupt frequency is `250 HZ` or occurs `250` times per second or one timer interrupt each `4ms`. + +The last field that we can see in the definition of the `clocksource_jiffies` structure is the - `max_cycles` that holds the maximum cycle value that can safely be multiplied without potentially causing an overflow. + + Ok, we just saw definition of the `clocksource_jiffies` structure, also we know a little about `jiffies` and `clocksource`, now is time to get back to the implementation of the our function. In the beginning of this part we have stopped on the call of the: + +```C +register_refined_jiffies(CLOCK_TICK_RATE); +``` + +function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file. + +As I already wrote, the main purpose of the `register_refined_jiffies` function is to register `refined_jiffies` clocksource. We already saw the `clocksource_jiffies` structure represents standard `jiffies` clock source. Now, if you look in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file, you will find yet another clock source definition: + +```C +struct clocksource refined_jiffies; +``` + +There is one different between `refined_jiffies` and `clocksource_jiffies`: The standard `jiffies` based clock source is the lowest common denominator clock source which should function on all systems. As we already know, the `jiffies` global variable will be increased during each timer interrupt. This means that standard `jiffies` based clock source has the same resolution as the timer interrupt frequency. From this we can understand that standard `jiffies` based clock source may suffer from inaccuracies. The `refined_jiffies` uses `CLOCK_TICK_RATE` as the base of `jiffies` shift. + +Let's look on the implementation of this function. First of all we can see that the `refined_jiffies` clock source based on the `clocksource_jiffies` structure: + +```C +int register_refined_jiffies(long cycles_per_second) +{ + u64 nsec_per_tick, shift_hz; + long cycles_per_tick; + + refined_jiffies = clocksource_jiffies; + refined_jiffies.name = "refined-jiffies"; + refined_jiffies.rating++; + ... + ... + ... +``` + +Here we can see that we update the name of the `refined_jiffies` to `refined-jiffies` and increase the rating of this structure. As you remember, the `clocksource_jiffies` has rating - `1`, so our `refined_jiffies` clocksource will have rating - `2`. This means that the `refined_jiffies` will be best selection for clock source management code. + +In the next step we need to calculate number of cycles per one tick: + +```C +cycles_per_tick = (cycles_per_second + HZ/2)/HZ; +``` + +Note that we have used `NSEC_PER_SEC` macro as the base of the standard `jiffies` multiplier. Here we are using the `cycles_per_second` which is the first parameter of the `register_refined_jiffies` function. We've passed the `CLOCK_TICK_RATE` macro to the `register_refined_jiffies` function. This macro definied in the [arch/x86/include/asm/timex.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/timex.h) header file and expands to the: + +```C +#define CLOCK_TICK_RATE PIT_TICK_RATE +``` + +where the `PIT_TICK_RATE` macro expands to the frequency of the [Intel 8253](Programmable interval timer): + +```C +#define PIT_TICK_RATE 1193182ul +``` + +After this we calculate `shift_hz` for the `register_refined_jiffies` that will store `hz << 8` or in other words frequency of the system timer. We shift left the `cycles_per_second` or frequency of the programmable interval timer on `8` in order to get extra accuracy: + +```C +shift_hz = (u64)cycles_per_second << 8; +shift_hz += cycles_per_tick/2; +do_div(shift_hz, cycles_per_tick); +``` + +In the next step we calculate the number of seconds per one tick by shifting left the `NSEC_PER_SEC` on `8` too as we did it with the `shift_hz` and do the same calculation as before: + +```C +nsec_per_tick = (u64)NSEC_PER_SEC << 8; +nsec_per_tick += (u32)shift_hz/2; +do_div(nsec_per_tick, (u32)shift_hz); +``` + +```C +refined_jiffies.mult = ((u32)nsec_per_tick) << JIFFIES_SHIFT; +``` + +In the end of the `register_refined_jiffies` function we register new clock source with the `__clocksource_register` function that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file and return: + +```C +__clocksource_register(&refined_jiffies); +return 0; +``` + +The clock source management code provides the API for clock source registration and selection. As we can see, clock sources are registered by calling the `__clocksource_register` function during kernel initialization or from a kernel module. During registration, the clock source management code will choose the best clock source available in the system using the `clocksource.rating` field which we already saw when we initialized `clocksource` structure for `jiffies`. + +Using the jiffies +-------------------------------------------------------------------------------- + +We just saw initialization of two `jiffies` based clock sources in the previous paragraph: + +* standard `jiffies` based clock source; +* refined `jiffies` based clock source; + +Don't worry if you don't understand the calculations here. They look frightening at first. Soon, step by step we will learn these things. So, we just saw initialization of `jffies` based clock sources and also we know that the Linux kernel has the global variable `jiffies` that holds the number of ticks that have occurred since the kernel started to work. Now, let's look how to use it. To use `jiffies` we just can use `jiffies` global variable by its name or with the call of the `get_jiffies_64` function. This function defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file and just returns full `64-bit` value of the `jiffies`: + +```C +u64 get_jiffies_64(void) +{ + unsigned long seq; + u64 ret; + + do { + seq = read_seqbegin(&jiffies_lock); + ret = jiffies_64; + } while (read_seqretry(&jiffies_lock, seq)); + return ret; +} +EXPORT_SYMBOL(get_jiffies_64); +``` + +Note that the `get_jiffies_64` function does not implemented as `jiffies_read` for example: + +```C +static cycle_t jiffies_read(struct clocksource *cs) +{ + return (cycle_t) jiffies; +} +``` + +We can see that implementation of the `get_jiffies_64` is more complex. The reading of the `jiffies_64` variable is implemented using [seqlocks](https://en.wikipedia.org/wiki/Seqlock). Actually this is done for machines that cannot atomically read the full 64-bit values. + +If we can access the `jiffies` or the `jiffies_64` variable we can convert it to `human` time units. To get one second we can use following expression: + +```C +jiffies / HZ +``` + +So, if we know this, we can get any time units. For example: + +```C +/* Thirty seconds from now */ +jiffies + 30*HZ + +/* Two minutes from now */ +jiffies + 120*HZ + +/* One millisecond from now */ +jiffies + HZ / 1000 +``` + +That's all. + +Conclusion +-------------------------------------------------------------------------------- + +This concludes the first part covering time and time management related concepts in the Linux kernel. We met first two concepts and its initialization in this part: `jiffies` and `clocksource`. In the next part we will continue to dive into this interesting theme and as I already wrote in this part we will acquainted and try to understand insides of these and other time management concepts in the Linux kernel. + +If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). + +**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [system call](https://en.wikipedia.org/wiki/System_call) +* [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) +* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) +* [cgroups](https://en.wikipedia.org/wiki/Cgroups) +* [bss](https://en.wikipedia.org/wiki/.bss) +* [initrd](https://en.wikipedia.org/wiki/Initrd) +* [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms) +* [TSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) +* [void](https://en.wikipedia.org/wiki/Void_type) +* [Simple Firmware Interface](https://en.wikipedia.org/wiki/Simple_Firmware_Interface) +* [x86_64](https://en.wikipedia.org/wiki/X86-64) +* [real time clock](https://en.wikipedia.org/wiki/Real-time_clock) +* [Jiffy](https://en.wikipedia.org/wiki/Jiffy_%28time%29) +* [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) +* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) +* [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253) +* [seqlocks](https://en.wikipedia.org/wiki/Seqlock) +* [cloksource documentation](https://www.kernel.org/doc/Documentation/timers/timekeeping.txt) +* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) diff --git a/Timers/timers-2.md b/Timers/timers-2.md new file mode 100644 index 0000000..d803494 --- /dev/null +++ b/Timers/timers-2.md @@ -0,0 +1,451 @@ +Timers and time management in the Linux kernel. Part 2. +================================================================================ + +Introduction to the `clocksource` framework +-------------------------------------------------------------------------------- + +The previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) was the first part in the current [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and time management related stuff in the Linux kernel. We got acquainted with two concepts in the previous part: + + * `jiffies` + * `clocksource` + +The first is the global variable that is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/master/include/linux/jiffies.h) header file and represents the counter that is increased during each timer interrupt. So if we can access this global variable and we know the timer interrupt rate we can convert `jiffies` to the human time units. As we already know the timer interrupt rate represented by the compile-time constant that is called `HZ` in the Linux kernel. The value of `HZ` is equal to the value of the `CONFIG_HZ` kernel configuration option and if we will look into the [arch/x86/configs/x86_64_defconfig](https://github.com/torvalds/linux/blob/master/arch/x86/configs/x86_64_defconfig) kernel configuration file, we will see that: + +``` +CONFIG_HZ_1000=y +``` + +kernel configuration option is set. This means that value of `CONFIG_HZ` will be `1000` by default for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. So, if we divide the value of `jiffies` by the value of `HZ`: + +``` +jiffies / HZ +``` + +we will get the amount of seconds that elapsed since the beginning of the moment the Linux kernel started to work or in other words we will get the system [uptime](https://en.wikipedia.org/wiki/Uptime). Since `HZ` represents the amount of timer interrupts in a second, we can set a value for some time in the future. For example: + +```C +/* one minute from now */ +unsigned long later = jiffies + 60*HZ; + +/* five minutes from now */ +unsigned long later = jiffies + 5*60*HZ; +``` + +This is a very common practice in the Linux kernel. For example, if you will look into the [arch/x86/kernel/smpboot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/smpboot.c) source code file, you will find the `do_boot_cpu` function. This function boots all processors besides bootstrap processor. You can find a snippet that waits ten seconds for a response from the application processor: + +```C +if (!boot_error) { + timeout = jiffies + 10*HZ; + while (time_before(jiffies, timeout)) { + ... + ... + ... + udelay(100); + } + ... + ... + ... +} +``` + +We assign `jiffies + 10*HZ` value to the `timeout` variable here. As I think you already understood, this means a ten seconds timeout. After this we are entering a loop where we use the `time_before` macro to compare the current `jiffies` value and our timeout. + +Or for example if we look into the [sound/isa/sscape.c](https://github.com/torvalds/linux/blob/master/sound/isa/sscape) source code file which represents the driver for the [Ensoniq Soundscape Elite](https://en.wikipedia.org/wiki/Ensoniq_Soundscape_Elite) sound card, we will see the `obp_startup_ack` function that waits upto a given timeout for the On-Board Processor to return its start-up acknowledgement sequence: + +```C +static int obp_startup_ack(struct soundscape *s, unsigned timeout) +{ + unsigned long end_time = jiffies + msecs_to_jiffies(timeout); + + do { + ... + ... + ... + x = host_read_unsafe(s->io_base); + ... + ... + ... + if (x == 0xfe || x == 0xff) + return 1; + msleep(10); + } while (time_before(jiffies, end_time)); + + return 0; +} +``` + +As you can see, the `jiffies` variable is very widely used in the Linux kernel [code](http://lxr.free-electrons.com/ident?i=jiffies). As I already wrote, we met yet another new time management related concept in the previous part - `clocksource`. We have only seen a short description of this concept and the API for a clock source registration. Let's take a closer look in this part. + +Introduction to `clocksource` +-------------------------------------------------------------------------------- + +The `clocksource` concept represents the generic API for clock sources management in the Linux kernel. Why do we need a separate framework for this? Let's go back to the beginning. The `time` concept is the fundamental concept in the Linux kernel and other operating system kernels. And the timekeeping is one of the necessities to use this concept. For example Linux kernel must know and update the time elapsed since system startup, it must determine how long the current process has been running for every processor and many many more. Where the Linux kernel can get information about time? First of all it is Real Time Clock or [RTC](https://en.wikipedia.org/wiki/Real-time_clock) that represents by the a nonvolatile device. You can find a set of architecture-independent real time clock drivers in the Linux kernel in the [drivers/rtc](https://github.com/torvalds/linux/tree/master/drivers/rtc) directory. Besides this, each architecture can provide a driver for the architecture-dependent real time clock, for example - `CMOS/RTC` - [arch/x86/kernel/rtc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/rtc.c) for the [x86](https://en.wikipedia.org/wiki/X86) architecture. The second is system timer - timer that excites [interrupts](https://en.wikipedia.org/wiki/Interrupt) with a periodic rate. For example, for [IBM PC](https://en.wikipedia.org/wiki/IBM_Personal_Computer) compatibles it was - [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer). + +We already know that for timekeeping purposes we can use `jiffies` in the Linux kernel. The `jiffies` can be considered as read only global variable which is updated with `HZ` frequency. We know that the `HZ` is a compile-time kernel parameter whose reasonable range is from `100` to `1000` [Hz](https://en.wikipedia.org/wiki/Hertz). So, it is guaranteed to have an interface for time measurement with `1` - `10` milliseconds resolution. Besides standard `jiffies`, we saw the `refined_jiffies` clock source in the previous part that is based on the `i8253/i8254` [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) tick rate which is almost `1193182` hertz. So we can get something about `1` microsecond resolution with the `refined_jiffies`. In this time, [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) are the favorite choice for the time value units of the given clock source. + +The availability of more precise techniques for time intervals measurement is hardware-dependent. We just knew a little about `x86` dependent timers hardware. But each architecture provides own timers hardware. Earlier each architecture had own implementation for this purpose. Solution of this problem is an abstraction layer and associated API in a common code framework for managing various clock sources and independent of the timer interrupt. This common code framework became - `clocksource` framework. + +Generic timeofday and clock source management framework moved a lot of timekeeping code into the architecture independent portion of the code, with the architecture-dependent portion reduced to defining and managing low-level hardware pieces of clocksources. It takes a large amount of funds to measure the time interval on different architectures with different hardware, and it is very complex. Implementation of the each clock related service is strongly associated with an individual hardware device and as you can understand, it results in similar implementations for different architectures. + +Within this framework, each clock source is required to maintain a representation of time as a monotonically increasing value. As we can see in the Linux kernel code, nanoseconds are the favorite choice for the time value units of a clock source in this time. One of the main point of the clock source framework is to allow an user to select clock source among a range of available hardware devices supporting clock functions when configuring the system and selecting, accessing and scaling different clock sources. + +The clocksource structure +-------------------------------------------------------------------------------- + +The fundamental of the `clocksource` framework is the `clocksource` structure that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file. We already saw some fields that are provided by the `clocksource` structure in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html). Let's look on the full definition of this structure and try to describe all of its fields: + +```C +struct clocksource { + cycle_t (*read)(struct clocksource *cs); + cycle_t mask; + u32 mult; + u32 shift; + u64 max_idle_ns; + u32 maxadj; +#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA + struct arch_clocksource_data archdata; +#endif + u64 max_cycles; + const char *name; + struct list_head list; + int rating; + int (*enable)(struct clocksource *cs); + void (*disable)(struct clocksource *cs); + unsigned long flags; + void (*suspend)(struct clocksource *cs); + void (*resume)(struct clocksource *cs); +#ifdef CONFIG_CLOCKSOURCE_WATCHDOG + struct list_head wd_list; + cycle_t cs_last; + cycle_t wd_last; +#endif + struct module *owner; +} ____cacheline_aligned; +``` + +We already saw the first field of the `clocksource` structure in the previous part - it is pointer to the `read` function that returns best counter selected by the clocksource framework. For example we use `jiffies_read` function to read `jiffies` value: + +```C +static struct clocksource clocksource_jiffies = { + ... + .read = jiffies_read, + ... +} +``` + +where `jiffies_read` just returns: + +```C +static cycle_t jiffies_read(struct clocksource *cs) +{ + return (cycle_t) jiffies; +} +``` + +Or the `read_tsc` function: + +```C +static struct clocksource clocksource_tsc = { + ... + .read = read_tsc, + ... +}; +``` + +for the [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) reading. + +The next field is `mask` that allows to ensure that subtraction between counters values from non `64 bit` counters do not need special overflow logic. After the `mask` field, we can see two fields: `mult` and `shift`. These are the fields that are base of mathematical functions that are provide ability to convert time values specific to each clock source. In other words these two fields help us to convert an abstract machine time units of a counter to nanoseconds. + +After these two fields we can see the `64` bits `max_idle_ns` field represents max idle time permitted by the clocksource in nanoseconds. We need in this field for the Linux kernel with enabled `CONFIG_NO_HZ` kernel configuration option. This kernel configuration option enables the Linux kernel to run without a regular timer tick (we will see full explanation of this in other part). The problem that dynamic tick allows the kernel to sleep for periods longer than a single tick, moreover sleep time could be unlimited. The `max_idle_ns` field represents this sleeping limit. + +The next field after the `max_idle_ns` is the `maxadj` field which is the maximum adjustment value to `mult`. The main formula by which we convert cycles to the nanoseconds: + +```C +((u64) cycles * mult) >> shift; +``` + +is not `100%` accurate. Instead the number is taken as close as possible to a nanosecond and `maxadj` helps to correct this and allows clocksource API to avoid `mult` values that might overflow when adjusted. The next four fields are pointers to the function: + +* `enable` - optional function to enable clocksource; +* `disable` - optional function to disable clocksource; +* `suspend` - suspend function for the clocksource; +* `resume` - resume function for the clocksource; + +The next field is the `max_cycles` and as we can understand from its name, this field represents maximum cycle value before potential overflow. And the last field is `owner` represents reference to a kernel [module](https://en.wikipedia.org/wiki/Loadable_kernel_module) that is owner of a clocksource. This is all. We just went through all the standard fields of the `clocksource` structure. But you can noted that we missed some fields of the `clocksource` structure. We can divide all of missed field on two types: Fields of the first type are already known for us. For example, they are `name` field that represents name of a `clocksource`, the `rating` field that helps to the Linux kernel to select the best clocksource and etc. The second type, fields which are dependent from the different Linux kernel configuration options. Let's look on these fields. + +The first field is the `archdata`. This field has `arch_clocksource_data` type and depends on the `CONFIG_ARCH_CLOCKSOURCE_DATA` kernel configuration option. This field is actual only for the [x86](https://en.wikipedia.org/wiki/X86) and [IA64](https://en.wikipedia.org/wiki/IA-64) architectures for this moment. And again, as we can understand from the field's name, it represents architecture-specific data for a clock source. For example, it represents `vDSO` clock mode: + +```C +struct arch_clocksource_data { + int vclock_mode; +}; +``` + +for the `x86` architectures. Where the `vDSO` clock mode can be one of the: + +```C +#define VCLOCK_NONE 0 +#define VCLOCK_TSC 1 +#define VCLOCK_HPET 2 +#define VCLOCK_PVCLOCK 3 +``` + +The last three fields are `wd_list`, `cs_last` and the `wd_last` depends on the `CONFIG_CLOCKSOURCE_WATCHDOG` kernel configuration option. First of all let's try to understand what is it `watchdog`. In a simple words, watchdog is a timer that is used for detection of the computer malfunctions and recovering from it. All of these three fields contain watchdog related data that is used by the `clocksource` framework. If we will grep the Linux kernel source code, we will see that only [arch/x86/KConfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig#L54) kernel configuration file contains the `CONFIG_CLOCKSOURCE_WATCHDOG` kernel configuration option. So, why do `x86` and `x86_64` need in [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer)? You already may know that all `x86` processors has special 64-bit register - [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). This register contains number of [cycles](https://en.wikipedia.org/wiki/Clock_rate) since the reset. Sometimes the time stamp counter needs to be verified against another clock source. We will not see initialization of the `watchdog` timer in this part, before this we must learn more about timers. + +That's all. From this moment we know all fields of the `clocksource` structure. This knowledge will help us to learn insides of the `clocksource` framework. + +New clock source registration +-------------------------------------------------------------------------------- + +We saw only one function from the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html). This function was - `__clocksource_register`. This function defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/tree/master/include/linux/clocksource.h) header file and as we can understand from the function's name, main point of this function is to register new clocksource. If we will look on the implementation of the `__clocksource_register` function, we will see that it just makes call of the `__clocksource_register_scale` function and returns its result: + +```C +static inline int __clocksource_register(struct clocksource *cs) +{ + return __clocksource_register_scale(cs, 1, 0); +} +``` + +Before we will see implementation of the `__clocksource_register_scale` function, we can see that `clocksource` provides additional API for a new clock source registration: + +```C +static inline int clocksource_register_hz(struct clocksource *cs, u32 hz) +{ + return __clocksource_register_scale(cs, 1, hz); +} + +static inline int clocksource_register_khz(struct clocksource *cs, u32 khz) +{ + return __clocksource_register_scale(cs, 1000, khz); +} +``` + +And all of these functions do the same. They return value of the `__clocksource_register_scale` function but with different set of parameters. The `__clocksource_register_scale` function defined in the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c) source code file. To understand difference between these functions, let's look on the parameters of the `clocksource_register_khz` function. As we can see, this function takes three parameters: + +* `cs` - clocksource to be installed; +* `scale` - scale factor of a clock source. In other words, if we will multiply value of this parameter on frequency, we will get `hz` of a clocksource; +* `freq` - clock source frequency divided by scale. + +Now let's look on the implementation of the `__clocksource_register_scale` function: + +```C +int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq) +{ + __clocksource_update_freq_scale(cs, scale, freq); + mutex_lock(&clocksource_mutex); + clocksource_enqueue(cs); + clocksource_enqueue_watchdog(cs); + clocksource_select(); + mutex_unlock(&clocksource_mutex); + return 0; +} +``` + +First of all we can see that the `__clocksource_register_scale` function starts from the call of the `__clocksource_update_freq_scale` function that defined in the same source code file and updates given clock source with the new frequency. Let's look on the implementation of this function. In the first step we need to check given frequency and if it was not passed as `zero`, we need to calculate `mult` and `shift` parameters for the given clock source. Why do we need to check value of the `frequency`? Actually it can be zero. if you attentively looked on the implementation of the `__clocksource_register` function, you may have noticed that we passed `frequency` as `0`. We will do it only for some clock sources that have self defined `mult` and `shift` parameters. Look in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) and you will see that we saw calculation of the `mult` and `shift` for `jiffies`. The `__clocksource_update_freq_scale` function will do it for us for other clock sources. + +So in the start of the `__clocksource_update_freq_scale` function we check the value of the `frequency` parameter and if is not zero we need to calculate `mult` and `shift` for the given clock source. Let's look on the `mult` and `shift` calculation: + +```C +void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq) +{ + u64 sec; + + if (freq) { + sec = cs->mask; + do_div(sec, freq); + do_div(sec, scale); + + if (!sec) + sec = 1; + else if (sec > 600 && cs->mask > UINT_MAX) + sec = 600; + + clocks_calc_mult_shift(&cs->mult, &cs->shift, freq, + NSEC_PER_SEC / scale, sec * scale); + } + ... + ... + ... +} +``` + +Here we can see calculation of the maximum number of seconds which we can run before a clock source counter will overflow. First of all we fill the `sec` variable with the value of a clock source mask. Remember that a clock source's mask represents maximum amount of bits that are valid for the given clock source. After this, we can see two division operations. At first we divide our `sec` variable on a clock source frequency and then on scale factor. The `freq` parameter shows us how many timer interrupts will be occurred in one second. So, we divide `mask` value that represents maximum number of a counter (for example `jiffy`) on the frequency of a timer and will get the maximum number of seconds for the certain clock source. The second division operation will give us maximum number of seconds for the certain clock source depends on its scale factor which can be `1` hertz or `1` kilohertz (10^ Hz). + +After we have got maximum number of seconds, we check this value and set it to `1` or `600` depends on the result at the next step. These values is maximum sleeping time for a clocksource in seconds. In the next step we can see call of the `clocks_calc_mult_shift`. Main point of this function is calculation of the `mult` and `shift` values for a given clock source. In the end of the `__clocksource_update_freq_scale` function we check that just calculated `mult` value of a given clock source will not cause overflow after adjustment, update the `max_idle_ns` and `max_cycles` values of a given clock source with the maximum nanoseconds that can be converted to a clock source counter and print result to the kernel buffer: + +```C +pr_info("%s: mask: 0x%llx max_cycles: 0x%llx, max_idle_ns: %lld ns\n", + cs->name, cs->mask, cs->max_cycles, cs->max_idle_ns); +``` + +that we can see in the [dmesg](https://en.wikipedia.org/wiki/Dmesg) output: + +``` +$ dmesg | grep "clocksource:" +[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns +[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns +[ 0.094084] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns +[ 0.205302] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns +[ 1.452979] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x7350b459580, max_idle_ns: 881591204237 ns +``` + +After the `__clocksource_update_freq_scale` function will finish its work, we can return back to the `__clocksource_register_scale` function that will register new clock source. We can see the call of the following three functions: + +```C +mutex_lock(&clocksource_mutex); +clocksource_enqueue(cs); +clocksource_enqueue_watchdog(cs); +clocksource_select(); +mutex_unlock(&clocksource_mutex); +``` + +Note that before the first will be called, we lock the `clocksource_mutex` [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion). The point of the `clocksource_mutex` mutex is to protect `curr_clocksource` variable which represents currently selected `clocksource` and `clocksource_list` variable which represents list that contains registered `clocksources`. Now, let's look on these three functions. + +The first `clocksource_enqueue` function and other two defined in the same source code [file](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c). We go through all already registered `clocksources` or in other words we go through all elements of the `clocksource_list` and tries to find best place for a given `clocksource`: + +```C +static void clocksource_enqueue(struct clocksource *cs) +{ + struct list_head *entry = &clocksource_list; + struct clocksource *tmp; + + list_for_each_entry(tmp, &clocksource_list, list) + if (tmp->rating >= cs->rating) + entry = &tmp->list; + list_add(&cs->list, entry); +} +``` + +In the end we just insert new clocksource to the `clocksource_list`. The second function - `clocksource_enqueue_watchdog` does almost the same that previous function, but it inserts new clock source to the `wd_list` depends on flags of a clock source and starts new [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer) timer. As I already wrote, we will not consider `watchdog` related stuff in this part but will do it in next parts. + +The last function is the `clocksource_select`. As we can understand from the function's name, main point of this function - select the best `clocksource` from registered clocksources. This function consists only from the call of the function helper: + +```C +static void clocksource_select(void) +{ + return __clocksource_select(false); +} +``` + +Note that the `__clocksource_select` function takes one parameter (`false` in our case). This [bool](https://en.wikipedia.org/wiki/Boolean_data_type) parameter shows how to traverse the `clocksource_list`. In our case we pass `false` that is meant that we will go through all entries of the `clocksource_list`. We already know that `clocksource` with the best rating will the first in the `clocksource_list` after the call of the `clocksource_enqueue` function, so we can easily get it from this list. After we found a clock source with the best rating, we switch to it: + +```C +if (curr_clocksource != best && !timekeeping_notify(best)) { + pr_info("Switched to clocksource %s\n", best->name); + curr_clocksource = best; +} +``` + +The result of this operation we can see in the `dmesg` output: + +``` +$ dmesg | grep Switched +[ 0.199688] clocksource: Switched to clocksource hpet +[ 2.452966] clocksource: Switched to clocksource tsc +``` + +Note that we can see two clock sources in the `dmesg` output (`hpet` and `tsc` in our case). Yes, actually there can be many different clock sources on a particular hardware. So the Linux kernel knows about all registered clock sources and switches to a clock source with a better rating each time after registration of a new clock source. + +If we will look on the bottom of the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c) source code file, we will see that it has [sysfs](https://en.wikipedia.org/wiki/Sysfs) interface. Main initialization occurs in the `init_clocksource_sysfs` function which will be called during device `initcalls`. Let's look on the implementation of the `init_clocksource_sysfs` function: + +```C +static struct bus_type clocksource_subsys = { + .name = "clocksource", + .dev_name = "clocksource", +}; + +static int __init init_clocksource_sysfs(void) +{ + int error = subsys_system_register(&clocksource_subsys, NULL); + + if (!error) + error = device_register(&device_clocksource); + if (!error) + error = device_create_file( + &device_clocksource, + &dev_attr_current_clocksource); + if (!error) + error = device_create_file(&device_clocksource, + &dev_attr_unbind_clocksource); + if (!error) + error = device_create_file( + &device_clocksource, + &dev_attr_available_clocksource); + return error; +} +device_initcall(init_clocksource_sysfs); +``` + +First of all we can see that it registers a `clocksource` subsystem with the call of the `subsys_system_register` function. In other words, after the call of this function, we will have following directory: + +``` +$ pwd +/sys/devices/system/clocksource +``` + +After this step, we can see registration of the `device_clocksource` device which is represented by the following structure: + +```C +static struct device device_clocksource = { + .id = 0, + .bus = &clocksource_subsys, +}; +``` + +and creation of three files: + +* `dev_attr_current_clocksource`; +* `dev_attr_unbind_clocksource`; +* `dev_attr_available_clocksource`. + +These files will provide information about current clock source in the system, available clock sources in the system and interface which allows to unbind the clock source. + +After the `init_clocksource_sysfs` function will be executed, we will be able find some information about available clock sources in the: + +``` +$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource +tsc hpet acpi_pm +``` + +Or for example information about current clock source in the system: + +``` +$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource +tsc +``` + +In the previous part, we saw API for the registration of the `jiffies` clock source, but didn't dive into details about the `clocksource` framework. In this part we did it and saw implementation of the new clock source registration and selection of a clock source with the best rating value in the system. Of course, this is not all API that `clocksource` framework provides. There a couple additional functions like `clocksource_unregister` for removing given clock source from the `clocksource_list` and etc. But I will not describe this functions in this part, because they are not important for us right now. Anyway if you are interesting in it, you can find it in the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c). + +That's all. + +Conclusion +-------------------------------------------------------------------------------- + +This is the end of the second part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the following two concepts: `jiffies` and `clocksource`. In this part we saw some examples of the `jiffies` usage and knew more details about the `clocksource` concept. + +If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). + +**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +------------------------------------------------------------------------------- + +* [x86](https://en.wikipedia.org/wiki/X86) +* [x86_64](https://en.wikipedia.org/wiki/X86-64) +* [uptime](https://en.wikipedia.org/wiki/Uptime) +* [Ensoniq Soundscape Elite](https://en.wikipedia.org/wiki/Ensoniq_Soundscape_Elite) +* [RTC](https://en.wikipedia.org/wiki/Real-time_clock) +* [interrupts](https://en.wikipedia.org/wiki/Interrupt) +* [IBM PC](https://en.wikipedia.org/wiki/IBM_Personal_Computer) +* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) +* [Hz](https://en.wikipedia.org/wiki/Hertz) +* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) +* [dmesg](https://en.wikipedia.org/wiki/Dmesg) +* [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) +* [loadable kernel module](https://en.wikipedia.org/wiki/Loadable_kernel_module) +* [IA64](https://en.wikipedia.org/wiki/IA-64) +* [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer) +* [clock rate](https://en.wikipedia.org/wiki/Clock_rate) +* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) +* [sysfs](https://en.wikipedia.org/wiki/Sysfs) +* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) diff --git a/Timers/timers-3.md b/Timers/timers-3.md new file mode 100644 index 0000000..6f78619 --- /dev/null +++ b/Timers/timers-3.md @@ -0,0 +1,444 @@ +Timers and time management in the Linux kernel. Part 3. +================================================================================ + +The tick broadcast framework and dyntick +-------------------------------------------------------------------------------- + +This is third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and we stopped on the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html). We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of this chapter is - `jiffies`. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the: + +```C +register_refined_jiffies(CLOCK_TICK_RATE); +``` + +function which defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file and executes initialization of the `refined_jiffies` clock source for us. Recall that this function is called from the `setup_arch` function that defined in the [https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c](arch/x86/kernel/setup.c) source code and executes architecture-specific ([x86_64](https://en.wikipedia.org/wiki/X86-64) in our case) initialization. Look on the implementation of the `setup_arch` and you will note that the call of the `register_refined_jiffies` is the last step before the `setup_arch` function will finish its work. + +There are many different `x86_64` specific things already configured after the end of the `setup_arch` execution. For example some early [interrupt](https://en.wikipedia.org/wiki/Interrupt) handlers already able to handle interrupts, memory space reserved for the [initrd](https://en.wikipedia.org/wiki/Initrd), [DMI](https://en.wikipedia.org/wiki/Desktop_Management_Interface) scanned, the Linux kernel log buffer is already set and this means that the [printk](https://en.wikipedia.org/wiki/Printk) function is able to work, [e820](https://en.wikipedia.org/wiki/E820) parsed and the Linux kernel already knows about available memory and and many many other architecture specific things (if you are interesting, you can read more about the `setup_arch` function and Linux kernel initialization process in the second [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of this book). + +Now, the `setup_arch` finished its work and we can back to the generic Linux kernel code. Recall that the `setup_arch` function was called from the `start_kernel` function which is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. So, we shall return to this function. You can see that there are many different function are called right after `setup_arch` function inside of the `start_kernel` function, but since our chapter is devoted to timers and time management related stuff, we will skip all code which is not related to this topic. The first function which is related to the time management in the Linux kernel is: + +```C +tick_init(); +``` + +in the `start_kernel`. The `tick_init` function defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and does two things: + +* Initialization of `tick broadcast` framework related data structures; +* Initialization of `full` tickless mode related data structures. + +We didn't see anything related to the `tick broadcast` framework in this book and didn't know anything about tickless mode in the Linux kernel. So, the main point of this part is to look on these concepts and to know what are they. + +The idle process +-------------------------------------------------------------------------------- + +First of all, let's look on the implementation of the `tick_init` function. As I already wrote, this function defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and consists from the two calls of following functions: + +```C +void __init tick_init(void) +{ + tick_broadcast_init(); + tick_nohz_init(); +} +``` + +As you can understand from the paragraph's title, we are interesting only in the `tick_broadcast_init` function for now. This function defined in the [kernel/time/tick-broadcast.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-broadcast.c) source code file and executes initialization of the `tick broadcast` framework related data structures. Before we will look on the implementation of the `tick_broadcast_init` function and will try to understand what does this function do, we need to know about `tick broadcast` framework. + +Main point of a central processor is to execute programs. But sometimes a processor may be in a special state when it is not being used by any program. This special state is called - [idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29). When the processor has no anything to execute, the Linux kernel launches `idle` task. We already saw a little about this in the last part of the [Linux kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-10.html). When the Linux kernel will finish all initialization processes in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file, it will call the `rest_init` function from the same source code file. Main point of this function is to launch kernel `init` thread and the `kthreadd` thread, to call the `schedule` function to start task scheduling and to go to sleep by calling the `cpu_idle_loop` function that defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/kernel/sched/idle.c) source code file. + +The `cpu_idle_loop` function represents infinite loop which checks the need for rescheduling on each iteration. After the scheduler finds something to execute, the `idle` process will finish its work and the control will be moved to a new runnable task with the call of the `schedule_preempt_disabled` function: + +```C +static void cpu_idle_loop(void) +{ + while (1) { + while (!need_resched()) { + ... + ... + ... + /* the main idle function */ + cpuidle_idle_call(); + } + ... + ... + ... + schedule_preempt_disabled(); +} +``` + +Of course, we will not consider full implementation of the `cpu_idle_loop` function and details of the `idle` state in this part, because it is not related to our topic. But there is one interesting moment for us. We know that the processor can execute only one task in one time. How does the Linux kernel decide to reschedule and stop `idle` process if the processor executes infinite loop in the `cpu_idle_loop`? The answer is system timer interrupts. When an interrupt occurs, the processor stops the `idle` thread and transfers control to an interrupt handler. After the system timer interrupt handler will be handled, the `need_resched` will return true and the Linux kernel will stop `idle` process and will transfer control to the current runnable task. But handling of the system timer interrupts is not effective for [power management](https://en.wikipedia.org/wiki/Power_management), because if a processor is in `idle` state, there is little point in sending it a system timer interrupt. + +By default, there is the `CONFIG_HZ_PERIODIC` kernel configuration option which is enabled in the Linux kernel and tells to handle each interrupt of the system timer. To solve this problem, the Linux kernel provides two additional ways of managing scheduling-clock interrupts: + +The first is to omit scheduling-clock ticks on idle processors. To enable this behaviour in the Linux kernel, we need to enable the `CONFIG_NO_HZ_IDLE` kernel configuration option. This option allows Linux kernel to avoid sending timer interrupts to idle processors. In this case periodic timer interrupts will be replaced with on-demand interrupts. This mode is called - `dyntick-idle` mode. But if the kernel does not handle interrupts of a system timer, how can the kernel decide if the system has nothing to do? + +Whenever the idle task is selected to run, the periodic tick is disabled with the call of the `tick_nohz_idle_enter` function that defined in the [kernel/time/tick-sched.c](https://github.com/torvalds/linux/blob/master/kernel/time/tich-sched.c) source code file and enabled with the call of the `tick_nohz_idle_exit` function. There is special concept in the Linux kernel which is called - `clock event devices` that are used to schedule the next interrupt. This concept provides API for devices which can deliver interrupts at a specific time in the future and represented by the `clock_event_device` structure in the Linux kernel. We will not dive into implementation of the `clock_event_device` structure now. We will see it in the next prat of this chapter. But there is one interesting moment for us right now. + +The second way is to omit scheduling-clock ticks on processors that are either in `idle` state or that have only one runnable task or in other words busy processor. We can enable this feature with the `CONFIG_NO_HZ_FULL` kernel configuration option and it allows to reduce the number of timer interrupts significantly. + +Besides the `cpu_idle_loop`, idle processor can be in a sleeping state. The Linux kernel provides special `cpuidle` framework. Main point of this framework is to put an idle processor to sleeping states. The name of the set of these states is - `C-states`. But how does a processor will be woken if local timer is disabled? The linux kernel provides `tick broadcast` framework for this. The main point of this framework is assign a timer which is not affected by the `C-states`. This timer will wake a sleeping processor. + +Now, after some theory we can return to the implementation of our function. Let's recall that the `tick_init` function just calls two following functions: + +```C +void __init tick_init(void) +{ + tick_broadcast_init(); + tick_nohz_init(); +} +``` + +Let's consider the first function. The first `tick_broadcast_init` function defined in the [kernel/time/tick-broadcast.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-broadcast.c) source code file and executes initialization of the `tick broadcast` framework related data structures. Let's look on the implementation of the `tick_broadcast_init` function: + +```C +void __init tick_broadcast_init(void) +{ + zalloc_cpumask_var(&tick_broadcast_mask, GFP_NOWAIT); + zalloc_cpumask_var(&tick_broadcast_on, GFP_NOWAIT); + zalloc_cpumask_var(&tmpmask, GFP_NOWAIT); +#ifdef CONFIG_TICK_ONESHOT + zalloc_cpumask_var(&tick_broadcast_oneshot_mask, GFP_NOWAIT); + zalloc_cpumask_var(&tick_broadcast_pending_mask, GFP_NOWAIT); + zalloc_cpumask_var(&tick_broadcast_force_mask, GFP_NOWAIT); +#endif +} +``` + +As we can see, the `tick_broadcast_init` function allocates different [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) with the help of the `zalloc_cpumask_var` function. The `zalloc_cpumask_var` function defined in the [lib/cpumask.c](https://github.com/torvalds/linux/blob/master/lib/cpumask.c) source code file and expands to the call of the following function: + +```C +bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags) +{ + return alloc_cpumask_var(mask, flags | __GFP_ZERO); +} +``` + +Ultimately, the memory space will be allocated for the given `cpumask` with the certain flags with the help of the `kmalloc_node` function: + +```C +*mask = kmalloc_node(cpumask_size(), flags, node); +``` + +Now let's look on the `cpumasks` that will be initialized in the `tick_broadcast_init` function. As we can see, the `tick_broadcast_init` function will initialize six `cpumasks`, and moreover, initialization of the last three `cpumasks` will be depended on the `CONFIG_TICK_ONESHOT` kernel configuration option. + +The first three `cpumasks` are: + +* `tick_broadcast_mask` - the bitmap which represents list of processors that are in a sleeping mode; +* `tick_broadcast_on` - the bitmap that stores numbers of processors which are in a periodic broadcast state; +* `tmpmask` - this bitmap for temporary usage. + +As we already know, the next three `cpumasks` depends on the `CONFIG_TICK_ONESHOT` kernel configuration option. Actually each clock event devices can be in one of two modes: + +* `periodic` - clock events devices that support periodic events; +* `oneshot` - clock events devices that capable of issuing events that happen only once. + +The linux kernel defines two mask for such clock events devices in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file: + +```C +#define CLOCK_EVT_FEAT_PERIODIC 0x000001 +#define CLOCK_EVT_FEAT_ONESHOT 0x000002 +``` + +So, the last three `cpumasks` are: + +* `tick_broadcast_oneshot_mask` - stores numbers of processors that must be notified; +* `tick_broadcast_pending_mask` - stores numbers of processors that pending broadcast; +* `tick_broadcast_force_mask` - stores numbers of processors with enforced broadcast. + +We have initialized six `cpumasks` in the `tick broadcast` framework, and now we can proceed to implementation of this framework. + +The `tick broadcast` framework +-------------------------------------------------------------------------------- + +Hardware may provide some clock source devices. When a processor sleeps and its local timer stopped, there must be additional clock source device that will handle awakening of a processor. The Linux kernel uses these `special` clock source devices which can raise an interrupt at a specified time. We already know that such timers called `clock events` devices in the Linux kernel. Besides `clock events` devices. Actually, each processor in the system has its own local timer which is programmed to issue interrupt at the time of the next deferred task. Also these timers can be programmed to do a periodical job, like updating `jiffies` and etc. These timers represented by the `tick_device` structure in the Linux kernel. This structure defined in the [kernel/time/tick-sched.h](https://github.com/torvalds/linux/blob/master/kernel/time/tick-sched.h) header file and looks: + +```C +struct tick_device { + struct clock_event_device *evtdev; + enum tick_device_mode mode; +}; +``` + +Note, that the `tick_device` structure contains two fields. The first field - `evtdev` represents pointer to the `clock_event_device` structure that defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and represents descriptor of a clock event device. A `clock event` device allows to register an event that will happen in the future. As I already wrote, we will not consider `clock_event_device` structure and related API in this part, but will see it in the next part. + +The second field of the `tick_device` structure represents mode of the `tick_device`. As we already know, the mode can be one of the: + +```C +num tick_device_mode { + TICKDEV_MODE_PERIODIC, + TICKDEV_MODE_ONESHOT, +}; +``` + +Each `clock events` device in the system registers itself by the call of the `clockevents_register_device` function or `clockevents_config_and_register` function during initialization process of the Linux kernel. During the registration of a new `clock events` device, the Linux kernel calls the `tick_check_new_device` function that defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/tick-common.c) source code file and checks the given `clock events` device should be used by the Linux kernel. After all checks, the `tick_check_new_device` function executes a call of the: + +```C +tick_install_broadcast_device(newdev); +``` + +function that checks that the given `clock event` device can be broadcast device and install it, if the given device can be broadcast device. Let's look on the implementation of the `tick_install_broadcast_device` function: + +```C +void tick_install_broadcast_device(struct clock_event_device *dev) +{ + struct clock_event_device *cur = tick_broadcast_device.evtdev; + + if (!tick_check_broadcast_device(cur, dev)) + return; + + if (!try_module_get(dev->owner)) + return; + + clockevents_exchange_device(cur, dev); + + if (cur) + cur->event_handler = clockevents_handle_noop; + + tick_broadcast_device.evtdev = dev; + + if (!cpumask_empty(tick_broadcast_mask)) + tick_broadcast_start_periodic(dev); + + if (dev->features & CLOCK_EVT_FEAT_ONESHOT) + tick_clock_notify(); +} +``` + +First of all we get the current `clock event` device from the `tick_broadcast_device`. The `tick_broadcast_device` defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/tick-common.c) source code file: + +```C +static struct tick_device tick_broadcast_device; +``` + +and represents external clock device that keeps track of events for a processor. The first step after we got the current clock device is the call of the `tick_check_broadcast_device` function which checks that a given clock events device can be utilized as broadcast device. The main point of the `tick_check_broadcast_device` function is to check value of the `features` field of the given `clock events` device. As we can understand from the name of this field, the `features` field contains a clock event device features. Available values defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and can be one of the `CLOCK_EVT_FEAT_PERIODIC` - which represents a clock events device which supports periodic events and etc. So, the `tick_check_broadcast_device` function check `features` flags for `CLOCK_EVT_FEAT_ONESHOT`, `CLOCK_EVT_FEAT_DUMMY` and other flags and returns `false` if the given clock events device has one of these features. In other way the `tick_check_broadcast_device` function compares `ratings` of the given clock event device and current clock event device and returns the best. + +After the `tick_check_broadcast_device` function, we can see the call of the `try_module_get` function that checks module owner of the clock events. We need to do it to be sure that the given `clock events` device was correctly initialized. The next step is the call of the `clockevents_exchange_device` function that defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and will release old clock events device and replace the previous functional handler with a dummy handler. + +In the last step of the `tick_install_broadcast_device` function we check that the `tick_broadcast_mask` is not empty and start the given `clock events` device in periodic mode with the call of the `tick_broadcast_start_periodic` function: + +```C +if (!cpumask_empty(tick_broadcast_mask)) + tick_broadcast_start_periodic(dev); + +if (dev->features & CLOCK_EVT_FEAT_ONESHOT) + tick_clock_notify(); +``` + +The `tick_broadcast_mask` filled in the `tick_device_uses_broadcast` function that checks a `clock events` device during registration of this `clock events` device: + +```C +int cpu = smp_processor_id(); + +int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu) +{ + ... + ... + ... + if (!tick_device_is_functional(dev)) { + ... + cpumask_set_cpu(cpu, tick_broadcast_mask); + ... + } + ... + ... + ... +} +``` + +More about the `smp_processor_id` macro you can read in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter. + +The `tick_broadcast_start_periodic` function check the given `clock event` device and call the `tick_setup_periodic` function: + +``` +static void tick_broadcast_start_periodic(struct clock_event_device *bc) +{ + if (bc) + tick_setup_periodic(bc, 1); +} +``` + +that defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and sets broadcast handler for the given `clock event` device by the call of the following function: + +```C +tick_set_periodic_handler(dev, broadcast); +``` + +This function checks the second parameter which represents broadcast state (`on` or `off`) and sets the broadcast handler depends on its value: + +```C +void tick_set_periodic_handler(struct clock_event_device *dev, int broadcast) +{ + if (!broadcast) + dev->event_handler = tick_handle_periodic; + else + dev->event_handler = tick_handle_periodic_broadcast; +} +``` + +When an `clock event` device will issue an interrupt, the `dev->event_handler` will be called. For example, let's look on the interrupt handler of the [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) which is located in the [arch/x86/kernel/hpet.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/hpet.c) source code file: + +```C +static irqreturn_t hpet_interrupt_handler(int irq, void *data) +{ + struct hpet_dev *dev = (struct hpet_dev *)data; + struct clock_event_device *hevt = &dev->evt; + + if (!hevt->event_handler) { + printk(KERN_INFO "Spurious HPET timer interrupt on HPET timer %d\n", + dev->num); + return IRQ_HANDLED; + } + + hevt->event_handler(hevt); + return IRQ_HANDLED; +} +``` + +The `hpet_interrupt_handler` gets the [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) specific data and check the event handler of the `clock event` device. Recall that we just set in the `tick_set_periodic_handler` function. So the `tick_handler_periodic_broadcast` function will be called in the end of the high precision event timer interrupt handler. + +The `tick_handler_periodic_broadcast` function calls the + +```C +bc_local = tick_do_periodic_broadcast(); +``` + +function which stores numbers of processors which have asked to be woken up in the temporary `cpumask` and call the `tick_do_broadcast` function: + +``` +cpumask_and(tmpmask, cpu_online_mask, tick_broadcast_mask); +return tick_do_broadcast(tmpmask); +``` + +The `tick_do_broadcast` calls the `broadcast` function of the given clock events which sends [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt) interrupt to the set of the processors. In the end we can call the event handler of the given `tick_device`: + +```C +if (bc_local) + td->evtdev->event_handler(td->evtdev); +``` + +which actually represents interrupt handler of the local timer of a processor. After this a processor will wake up. That is all about `tick broadcast` framework in the Linux kernel. We have missed some aspects of this framework, for example reprogramming of a `clock event` device and broadcast with the oneshot timer and etc. But the Linux kernel is very big, it is not real to cover all aspects of it. I think it will be interesting to dive into with yourself. + +If you remember, we have started this part with the call of the `tick_init` function. We just consider the `tick_broadcast_init` function and releated theory, but the `tick_init` function contains another call of a function and this function is - `tick_nohz_init`. Let's look on the implementation of this function. + +Initialization of dyntick related data structures +-------------------------------------------------------------------------------- + +We already saw some information about `dyntick` concept in this part and we know that this concept allows kernel to disable system timer interrupts in the `idle` state. The `tick_nohz_init` function makes initialization of the different data structures which are related to this concept. This function defined in the [kernel/time/tick-sched.c](https://github.com/torvalds/linux/blob/master/kernel/time/tich-sched.c) source code file and starts from the check of the value of the `tick_nohz_full_running` variable which represents state of the tick-less mode for the `idle` state and the state when system timer interrups are disabled during a processor has only one runnable task: + +```C +if (!tick_nohz_full_running) { + if (tick_nohz_init_all() < 0) + return; +} +``` + +If this mode is not running we call the `tick_nohz_init_all` function that defined in the same source code file and check its result. The `tick_nohz_init_all` function tries to allocate the `tick_nohz_full_mask` with the call of the `alloc_cpumask_var` that will allocate space for a `tick_nohz_full_mask`. The `tck_nohz_full_mask` will store numbers of processors that have enabled full `NO_HZ`. After successful allocation of the `tick_nohz_full_mask` we set all bits in the `tick_nogz_full_mask`, set the `tick_nohz_full_running` and return result to the `tick_nohz_init` function: + +```C +static int tick_nohz_init_all(void) +{ + int err = -1; +#ifdef CONFIG_NO_HZ_FULL_ALL + if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) { + WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n"); + return err; + } + err = 0; + cpumask_setall(tick_nohz_full_mask); + tick_nohz_full_running = true; +#endif + return err; +} +``` + +In the next step we try to allocate a memory space for the `housekeeping_mask`: + +```C +if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) { + WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n"); + cpumask_clear(tick_nohz_full_mask); + tick_nohz_full_running = false; + return; +} +``` + +This `cpumask` will store number of processor for `housekeeping` or in other words we need at least in one processor that will not be in `NO_HZ` mode, because it will do timekeeping and etc. After this we check the result of the architecture-specific `arch_irq_work_has_interrupt` function. This function checks ability to send inter-processor interrupt for the certain architecture. We need to check this, because system timer of a processor will be disabled during `NO_HZ` mode, so there must be at least one online processor which can send inter-processor interrupt to awake offline processor. This function defined in the [arch/x86/include/asm/irq_work.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_work.h) header file for the [x86_64](https://en.wikipedia.org/wiki/X86-64) and just checks that a processor has [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) from the [CPUID](https://en.wikipedia.org/wiki/CPUID): + +```C +static inline bool arch_irq_work_has_interrupt(void) +{ + return cpu_has_apic; +} +``` + +If a processor has not `APIC`, the Linux kernel prints warning message, clears the `tick_nohz_full_mask` cpumask, copies numbers of all possible processors in the system to the `housekeeping_mask` and resets the value of the `tick_nohz_full_running` variable: + +```C +if (!arch_irq_work_has_interrupt()) { + pr_warning("NO_HZ: Can't run full dynticks because arch doesn't " + "support irq work self-IPIs\n"); + cpumask_clear(tick_nohz_full_mask); + cpumask_copy(housekeeping_mask, cpu_possible_mask); + tick_nohz_full_running = false; + return; +} +``` + +After this step, we get the number of the current processor by the call of the `smp_processor_id` and check this processor in the `tick_nohz_full_mask`. If the `tick_nohz_full_mask` contains a given processor we clear appropriate bit in the `tick_nohz_full_mask`: + +```C +cpu = smp_processor_id(); + +if (cpumask_test_cpu(cpu, tick_nohz_full_mask)) { + pr_warning("NO_HZ: Clearing %d from nohz_full range for timekeeping\n", cpu); + cpumask_clear_cpu(cpu, tick_nohz_full_mask); +} +``` + +Because this processor will be used for timekeeping. After this step we put all numbers of processors that are in the `cpu_possible_mask` and not in the `tick_nohz_full_mask`: + +```C +cpumask_andnot(housekeeping_mask, + cpu_possible_mask, tick_nohz_full_mask); +``` + +After this operation, the `housekeeping_mask` will contain all processors of the system except a processor for timekeeping. In the last step of the `tick_nohz_init_all` function, we are going through all processors that are defined in the `tick_nohz_full_mask` and call the following function for an each processor: + +```C +for_each_cpu(cpu, tick_nohz_full_mask) + context_tracking_cpu_set(cpu); +``` + +The `context_tracking_cpu_set` function defined in the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/master/kernel/context_tracking.c) source code file and main point of this function is to set the `context_tracking.active` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable to `true`. When the `active` field will be set to `true` for the certain processor, all [context switches](https://en.wikipedia.org/wiki/Context_switch) will be ignored by the Linux kernel context tracking subsystem for this processor. + +That's all. This is the end of the `tick_nohz_init` function. After this `NO_HZ` related data structures will be initialzed. We didn't see API of the `NO_HZ` mode, but will see it soon. + +Conclusion +-------------------------------------------------------------------------------- + +This is the end of the third part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `clocksource` concept in the Linux kernel which represents framework for managing different clock source in a interrupt and hardware characteristics independent way. We continued to look on the Linux kernel initialization process in a time management context in this part and got acquainted with two new concepts for us: the `tick broadcast` framework and `tick-less` mode. The first concept helps the Linux kernel to deal with processors which are in deep sleep and the second concept represents the mode in which kernel may work to improve power management of `idle` processors. + +In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - `timers`. + +If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). + +**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +------------------------------------------------------------------------------- + +* [x86_64](https://en.wikipedia.org/wiki/X86-64) +* [initrd](https://en.wikipedia.org/wiki/Initrd) +* [interrupt](https://en.wikipedia.org/wiki/Interrupt) +* [DMI](https://en.wikipedia.org/wiki/Desktop_Management_Interface) +* [printk](https://en.wikipedia.org/wiki/Printk) +* [CPU idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29) +* [power management](https://en.wikipedia.org/wiki/Power_management) +* [NO_HZ documentation](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt) +* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) +* [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) +* [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) +* [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt) +* [CPUID](https://en.wikipedia.org/wiki/CPUID) +* [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) +* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) +* [context switches](https://en.wikipedia.org/wiki/Context_switch) +* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) diff --git a/Timers/timers-4.md b/Timers/timers-4.md new file mode 100644 index 0000000..e0b593e --- /dev/null +++ b/Timers/timers-4.md @@ -0,0 +1,427 @@ +Timers and time management in the Linux kernel. Part 4. +================================================================================ + +Timers +-------------------------------------------------------------------------------- + +This is fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html) we knew about the `tick broadcast` framework and `NO_HZ` mode in the Linux kernel. We will continue to dive into the time management related stuff in the Linux kernel in this part and will be acquainted with yet another concept in the Linux kernel - `timers`. Before we will look at timers in the Linux kernel, we have to learn some theory about this concept. Note that we will consider software timers in this part. + +The Linux kernel provides a `software timer` concept to allow to kernel functions could be invoked at future moment. Timers are widely used in the Linux kernel. For example, look in the [net/netfilter/ipset/ip_set_list_set.c](https://github.com/torvalds/linux/blob/master/net/netfilter/ipset/ip_set_list_set.c) source code file. This source code file provides implementation of the framework for the managing of groups of [IP](https://en.wikipedia.org/wiki/Internet_Protocol) addresses. + +We can find the `list_set` structure that contains `gc` filed in this source code file: + +```C +struct list_set { + ... + struct timer_list gc; + ... +}; +``` + +Not that the `gc` filed has `timer_list` type. This structure defined in the [include/linux/timer.h](https://github.com/torvalds/linux/blob/master/include/linux/timer.h) header file and main point of this structure is to store `dynamic` timers in the Linux kernel. Actually, the Linux kernel provides two types of timers called dynamic timers and interval timers. First type of timers is used by the kernel, and the second can be used by user mode. The `timer_list` structure contains actual `dynamic` timers. The `list_set` contains `gc` timer in our example represents timer for garbage collection. This timer will be initialized in the `list_set_gc_init` function: + +```C +static void +list_set_gc_init(struct ip_set *set, void (*gc)(unsigned long ul_set)) +{ + struct list_set *map = set->data; + ... + ... + ... + map->gc.function = gc; + map->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ; + ... + ... + ... +} +``` + +A function that is pointed by the `gc` pointer, will be called after timeout which is equal to the `map->gc.expires`. + +Ok, we will not dive into this example with the [netfilter](https://en.wikipedia.org/wiki/Netfilter), because this chapter is not about [network](https://en.wikipedia.org/wiki/Computer_network) related stuff. But we saw that timers are widely used in the Linux kernel and learned that they represent concept which allows to functions to be called in future. + +Now let's continue to research source code of Linux kernel which is related to the timers and time management stuff as we did it in all previous chapters. + +Introduction to dynamic timers in the Linux kernel +-------------------------------------------------------------------------------- + +As I already wrote, we knew about the `tick broadcast` framework and `NO_HZ` mode in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html). They will be initialized in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file by the call of the `tick_init` function. If we will look at this source code file, we will see that the next time management related function is: + +```C +init_timers(); +``` + +This function defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and contains calls of four functions: + +```C +void __init init_timers(void) +{ + init_timer_cpus(); + init_timer_stats(); + timer_register_cpu_notifier(); + open_softirq(TIMER_SOFTIRQ, run_timer_softirq); +} +``` + +Let's look on implementation of each function. The first function is `init_timer_cpus` defined in the [same](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and just calls the `init_timer_cpu` function for each possible processor in the system: + +```C +static void __init init_timer_cpus(void) +{ + int cpu; + + for_each_possible_cpu(cpu) + init_timer_cpu(cpu); +} +``` + +If you do not know or do not remember what is it a `possible` cpu, you can read the special [part](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) of this book which describes `cpumask` concept in the Linux kernel. In short words, a `possible` processor is a processor which can be plugged in anytime during the life of the system. + +The `init_timer_cpu` function does main work for us, namely it executes initialization of the `tvec_base` structure for each processor. This structure defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and stores data related to a `dynamic` timer for a certain processor. Let's look on the definition of this structure: + +```C +struct tvec_base { + spinlock_t lock; + struct timer_list *running_timer; + unsigned long timer_jiffies; + unsigned long next_timer; + unsigned long active_timers; + unsigned long all_timers; + int cpu; + bool migration_enabled; + bool nohz_active; + struct tvec_root tv1; + struct tvec tv2; + struct tvec tv3; + struct tvec tv4; + struct tvec tv5; +} ____cacheline_aligned; +``` + +The `thec_base` structure contains following fields: The `lock` for `tvec_base` protection, the next `running_timer` field points to the currently running timer for the certain processor, the `timer_jiffies` fields represents the earliest expiration time (it will be used by the Linux kernel to find already expired timers). The next field - `next_timer` contains the next pending timer for a next timer [interrupt](https://en.wikipedia.org/wiki/Interrupt) in a case when a processor goes to sleep and the `NO_HZ` mode is enabled in the Linux kernel. The `active_timers` field provides accounting of non-deferrable timers or in other words all timers that will not be stopped during a processor will go to sleep. The `all_timers` field tracks total number of timers or `active_timers` + deferrable timers. The `cpu` field represents number of a processor which owns timers. The `migration_enabled` and `nohz_active` fields are represent opportunity of timers migration to another processor and status of the `NO_HZ` mode respectively. + +The last five fields of the `tvec_base` structure represent lists of dynamic timers. The first `tv1` field has: + +```C +#define TVR_SIZE (1 << TVR_BITS) +#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8) + +... +... +... + +struct tvec_root { + struct hlist_head vec[TVR_SIZE]; +}; +``` + +type. Note that the value of the `TVR_SIZE` depends on the `CONFIG_BASE_SMALL` kernel configuration option: + +![base small](http://s17.postimg.org/db3towlu7/base_small.png) + +that reduces size of the kernel data structures if disabled. The `v1` is array that may contain `64` or `256` elements where an each element represents a dynamic timer that will decay within the next `255` system timer interrupts. Next three fields: `tv2`, `tv3` and `tv4` are lists with dynamic timers too, but they store dynamic timers which will decay the next `2^14 - 1`, `2^20 - 1` and `2^26` respectively. The last `tv5` field represents list which stores dynamic timers with a large expiring period. + +So, now we saw the `tvec_base` structure and description of its fields and we can look on the implementation of the `init_timer_cpu` function. As I already wrote, this function defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and executes initialization of the `tvec_bases`: + +```C +static void __init init_timer_cpu(int cpu) +{ + struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu); + + base->cpu = cpu; + spin_lock_init(&base->lock); + + base->timer_jiffies = jiffies; + base->next_timer = base->timer_jiffies; +} +``` + +The `tvec_bases` represents [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable which represents main data structure for a dynamic timer for a given processor. This `per-cpu` variable defined in the same source code file: + +```C +static DEFINE_PER_CPU(struct tvec_base, tvec_bases); +``` + +First of all we're getting the address of the `tvec_bases` for the given processor to `base` variable and as we got it, we are starting to initialize some of the `tvec_base` fields in the `init_timer_cpu` function. After initialization of the `per-cpu` dynamic timers with the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) and the number of a possible processor, we need to initialize a `tstats_lookup_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) in the `init_timer_stats` function: + +```C +void __init init_timer_stats(void) +{ + int cpu; + + for_each_possible_cpu(cpu) + raw_spin_lock_init(&per_cpu(tstats_lookup_lock, cpu)); +} +``` + +The `tstats_lookcup_lock` variable represents `per-cpu` raw spinlock: + +```C +static DEFINE_PER_CPU(raw_spinlock_t, tstats_lookup_lock); +``` + +which will be used for protection of operation with statistics of timers that can be accessed through the [procfs](https://en.wikipedia.org/wiki/Procfs): + +```C +static int __init init_tstats_procfs(void) +{ + struct proc_dir_entry *pe; + + pe = proc_create("timer_stats", 0644, NULL, &tstats_fops); + if (!pe) + return -ENOMEM; + return 0; +} +``` + +For example: + +``` +$ cat /proc/timer_stats +Timerstats sample period: 3.888770 s + 12, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick) + 15, 1 swapper hcd_submit_urb (rh_timer_func) + 4, 959 kedac schedule_timeout (process_timeout) + 1, 0 swapper page_writeback_init (wb_timer_fn) + 28, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick) + 22, 2948 IRQ 4 tty_flip_buffer_push (delayed_work_timer_fn) + ... + ... + ... +``` + +The next step after initialization of the `tstats_lookup_lock` spinlock is the call of the `timer_register_cpu_notifier` function. This function depends on the `CONFIG_HOTPLUG_CPU` kernel configuration option which enables support for [hotplug](https://en.wikipedia.org/wiki/Hot_swapping) processors in the Linux kernel. + +When a processor will be logically offlined, a notification will be sent to the Linux kernel with the `CPU_DEAD` or the `CPU_DEAD_FROZEN` event by the call of the `cpu_notifier` macro: + +```C +#ifdef CONFIG_HOTPLUG_CPU +... +... +static inline void timer_register_cpu_notifier(void) +{ + cpu_notifier(timer_cpu_notify, 0); +} +... +... +#else +... +... +static inline void timer_register_cpu_notifier(void) { } +... +... +#endif /* CONFIG_HOTPLUG_CPU */ +``` + +In this case the `timer_cpu_notify` will be called which checks an event type and will call the `migrate_timers` function: + +```C +static int timer_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + switch (action) { + case CPU_DEAD: + case CPU_DEAD_FROZEN: + migrate_timers((long)hcpu); + break; + default: + break; + } + + return NOTIFY_OK; +} +``` + +This chapter will not describe `hotplug` related events in the Linux kernel source code, but if you are interesting in such things, you can find implementation of the `migrate_timers` function in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file. + +The last step in the `init_timers` function is the call of the: + +```C +open_softirq(TIMER_SOFTIRQ, run_timer_softirq); +``` + +function. The `open_softirq` function may be already familar to you if you have read the ninth [part](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html) about the interrupts and interrupt handling in the Linux kernel. In short words, the `open_softirq` function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) source code file and executes initialization of the deferred interrupt handler. + +In our case the deferred function is the `run_timer_softirq` function that is will be called after a hardware interrupt in the `do_IRQ` function which defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c) source code file. The main point of this function is to handle a software dynamic timer. The Linux kernel does not do this thing during the hardware timer interrupt handling because this is time consuming operation. + +Let's look on the implementation of the `run_timer_softirq` function: + +```C +static void run_timer_softirq(struct softirq_action *h) +{ + struct tvec_base *base = this_cpu_ptr(&tvec_bases); + + if (time_after_eq(jiffies, base->timer_jiffies)) + __run_timers(base); +} +``` + +At the beginning of the `run_timer_softirq` function we get a `dynamic` timer for a current processor and compares the current value of the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) with the value of the `timer_jiffies` for the current structure by the call of the `time_after_eq` macro which is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/master/include/linux/jiffies.h) header file: + +```C +#define time_after_eq(a,b) \ + (typecheck(unsigned long, a) && \ + typecheck(unsigned long, b) && \ + ((long)((a) - (b)) >= 0)) +``` + +Reclaim that the `timer_jiffies` field of the `tvec_base` structure represents the relative time when functions delayed by the given timer will be executed. So we compare these two values and if the current time represented by the `jiffies` is greater than `base->timer_jiffies`, we call the `__run_timers` function that defined in the same source code file. Let's look on the implementation of this function. + +As I just wrote, the `__run_timers` function runs all expired timers for a given processor. This function starts from the acquiring of the `tvec_base's` lock to protect the `tvec_base` structure + +```C +static inline void __run_timers(struct tvec_base *base) +{ + struct timer_list *timer; + + spin_lock_irq(&base->lock); + ... + ... + ... + spin_unlock_irq(&base->lock); +} +``` + +After this it starts the loop while the `timer_jiffies` will not be greater than the `jiffies`: + +```C +while (time_after_eq(jiffies, base->timer_jiffies)) { + ... + ... + ... +} +``` + +We can find many different manipulations in the our loop, but the main point is to find expired timers and call delayed functions. First of all we need to calculate the `index` of the `base->tv1` list that stores the next timer to be handled with the following expression: + +```C +index = base->timer_jiffies & TVR_MASK; +``` + +where the `TVR_MASK` is a mask for the getting of the `tvec_root->vec` elements. As we got the index with the next timer which must be handled we check its value. If the index is zero, we go through all lists in our cascade table `tv2`, `tv3` and etc., and rehashing it with the call of the `cascade` function: + +```C +if (!index && + (!cascade(base, &base->tv2, INDEX(0))) && + (!cascade(base, &base->tv3, INDEX(1))) && + !cascade(base, &base->tv4, INDEX(2))) + cascade(base, &base->tv5, INDEX(3)); +``` + +After this we increase the value of the `base->timer_jiffies`: + +```C +++base->timer_jiffies; +``` + +In the last step we are executing a corresponding function for each timer from the list in a following loop: + +```C +hlist_move_list(base->tv1.vec + index, head); + +while (!hlist_empty(head)) { + ... + ... + ... + timer = hlist_entry(head->first, struct timer_list, entry); + fn = timer->function; + data = timer->data; + + spin_unlock(&base->lock); + call_timer_fn(timer, fn, data); + spin_lock(&base->lock); + + ... + ... + ... +} +``` + +where the `call_timer_fn` just call the given function: + +```C +static void call_timer_fn(struct timer_list *timer, void (*fn)(unsigned long), + unsigned long data) +{ + ... + ... + ... + fn(data); + ... + ... + ... +} +``` + +That's all. The Linux kernel has infrastructure for `dynamic timers` from this moment. We will not dive into this interesting theme. As I already wrote the `timers` is a [widely](http://lxr.free-electrons.com/ident?i=timer_list) used concept in the Linux kernel and nor one part, nor two parts will not cover understanding of such things how it implemented and how it works. But now we know about this concept, why does the Linux kernel needs in it and some data structures around it. + +Now let's look usage of `dynamic timers` in the Linux kernel. + +Usage of dynamic timers +-------------------------------------------------------------------------------- + +As you already can noted, if the Linux kernel provides a concept, it also provides API for managing of this concept and the `dynamic timers` concept is not exception here. To use a timer in the Linux kernel code, we must define a variable with a `timer_list` type. We can initialize our `timer_list` structure in two ways. The first is to use the `init_timer` macro that defined in the [include/linux/timer.h](https://github.com/torvalds/linux/blob/master/include/linux/timer.h) header file: + +```C +#define init_timer(timer) \ + __init_timer((timer), 0) + +#define __init_timer(_timer, _flags) \ + init_timer_key((_timer), (_flags), NULL, NULL) +``` + +where the `init_timer_key` function just calls the: + +```C +do_init_timer(timer, flags, name, key); +``` + +function which fields the given `timer` with default values. The second way is to use the: + +```C +#define TIMER_INITIALIZER(_function, _expires, _data) \ + __TIMER_INITIALIZER((_function), (_expires), (_data), 0) +``` + +macro which will initilize the given `timer_list` structure too. + +After a `dynamic timer` is initialzed we can start this `timer` with the call of the: + +```C +void add_timer(struct timer_list * timer); +``` + +function and stop it with the: + +```C +int del_timer(struct timer_list * timer); +``` + +function. + +That's all. + +Conclusion +-------------------------------------------------------------------------------- + +This is the end of the fourth part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part we got acquainted with the two new concepts: the `tick broadcast` framework and the `NO_HZ` mode. In this part we continued to dive into time managemented related stuff and got acquainted with the new concept - `dynamic timer` or software timer. We didn't saw implementation of a `dynamic timers` management code in details in this part but saw data structures and API around this concept. + +In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - `timers`. + +If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). + +**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +------------------------------------------------------------------------------- + +* [IP](https://en.wikipedia.org/wiki/Internet_Protocol) +* [netfilter](https://en.wikipedia.org/wiki/Netfilter) +* [network](https://en.wikipedia.org/wiki/Computer_network) +* [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) +* [interrupt](https://en.wikipedia.org/wiki/Interrupt) +* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) +* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) +* [spinlock](https://en.wikipedia.org/wiki/Spinlock) +* [procfs](https://en.wikipedia.org/wiki/Procfs) +* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html) diff --git a/Timers/timers-5.md b/Timers/timers-5.md new file mode 100644 index 0000000..dcad4fe --- /dev/null +++ b/Timers/timers-5.md @@ -0,0 +1,415 @@ +Timers and time management in the Linux kernel. Part 5. +================================================================================ + +Introduction to the `clockevents` framework +-------------------------------------------------------------------------------- + +This is fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. As you might noted from the title of this part, the `clockevents` framework will be discussed. We already saw one framework in the [second](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) part of this chapter. It was `clocksource` framework. Both of these frameworks represent timekeeping abstractions in the Linux kernel. + +At first let's refresh your memory and try to remember what is it `clocksource` framework and and what its purpose. The main goal of the `clocksource` framework is to provide `timeline`. As described in the [documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt): + +> For example issuing the command 'date' on a Linux system will eventually read the clock source to determine exactly what time it is. + +The Linux kernel supports many different clock sources. You can find some of them in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource). For example old good [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253) - [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) with `1193182` Hz frequency, yet another one - [ACPI PM](http://uefi.org/sites/default/files/resources/ACPI_5.pdf) timer with `3579545` Hz frequency. Besides the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource) directory, each architecture may provide own architecture-specific clock sources. For example [x86](https://en.wikipedia.org/wiki/X86) architecture provides [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer), or for example [powerpc](https://en.wikipedia.org/wiki/PowerPC) provides access to the processor timer through `timebase` register. + +Each clock source provides monotonic atomic counter. As I already wrote, the Linux kernel supports a huge set of different clock source and each clock source has own parameters like [frequency](https://en.wikipedia.org/wiki/Frequency). The main goal of the `clocksource` framework is to provide [API](https://en.wikipedia.org/wiki/Application_programming_interface) to select best available clock source in the system i.e. a clock source with the highest frequency. Additional goal of the `clocksource` framework is to represent an atomic counter provided by a clock source in human units. In this time, nanoseconds are the favorite choice for the time value units of the given clock source in the Linux kernel. + +The `clocksource` framework represented by the `clocksource` structure which is defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header code file which contains `name` of a clock source, rating of certain clock source in the system (a clock source with the higher frequency has the biggest rating in the system), `list` of all registered clock source in the system, `enable` and `disable` fields to enable and disable a clock source, pointer to the `read` function which must return an atomic counter of a clock source and etc. + +Additionally the `clocksource` structure provides two fields: `mult` and `shift` which are needed for translation of an atomic counter which is provided by a certain clock source to the human units, i.e. [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond). Translation occurs via following formula: + +``` +ns ~= (clocksource * mult) >> shift +``` + +As we already know, besides the `clocksource` structure, the `clocksource` framework provides an API for registration of clock source with different frequency scale factor: + +```C +static inline int clocksource_register_hz(struct clocksource *cs, u32 hz) +static inline int clocksource_register_khz(struct clocksource *cs, u32 khz) +``` + +A clock source unregistration: + +```C +int clocksource_unregister(struct clocksource *cs) +``` + +and etc. + +Additionally to the `clocksource` framework, the Linux kernel provides `clockevents` framework. As described in the [documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt): + +> Clock events are the conceptual reverse of clock sources + +Main goal of the is to manage clock event devices or in other words - to manage devices that allow to register an event or in other words [interrupt](https://en.wikipedia.org/wiki/Interrupt) that is going to happen at a defined point of time in the future. + +Now we know a little about the `clockevents` framework in the Linux kernel, and now time is to see on it [API](https://en.wikipedia.org/wiki/Application_programming_interface). + +API of `clockevents` framework +------------------------------------------------------------------------------- + +The main structure which described a clock event device is `clock_event_device` structure. This structure is defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and contains a huge set of fields. as well as the `clocksource` structure it has `name` fields which contains human readable name of a clock event device, for example [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) timer: + +```C +static struct clock_event_device lapic_clockevent = { + .name = "lapic", + ... + ... + ... +} +``` + +Addresses of the `event_handler`, `set_next_event`, `next_event` functions for a certain clock event device which are an [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler), setter of next event and local storage for next event respectively. Yet another field of the `clock_event_device` structure is - `features` field. Its value maybe on of the following generic features: + +```C +#define CLOCK_EVT_FEAT_PERIODIC 0x000001 +#define CLOCK_EVT_FEAT_ONESHOT 0x000002 +``` + +Where the `CLOCK_EVT_FEAT_PERIODIC` represents device which may be programmed to generate events periodically. The `CLOCK_EVT_FEAT_ONESHOT` represents device which may generate an event only once. Besides these two features, there are also architecture-specific features. For example [x86_64](https://en.wikipedia.org/wiki/X86-64) supports two additional features: + +```C +#define CLOCK_EVT_FEAT_C3STOP 0x000008 +``` + +The first `CLOCK_EVT_FEAT_C3STOP` means that a clock event device will be stopped in the [C3](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states) state. Additionally the `clock_event_device` structure has `mult` and `shift` fields as well as `clocksource` structure. The `clocksource` structure also contains other fields, but we will consider it later. + +After we considered part of the `clock_event_device` structure, time is to look at the `API` of the `clockevents` framework. To work with a clock event device, first of all we need to initialize `clock_event_device` structure and register a clock events device. The `clockevents` framework provides following `API` for registration of clock event devices: + +```C +void clockevents_register_device(struct clock_event_device *dev) +{ + ... + ... + ... +} +``` + +This function defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and as we may see, the `clockevents_register_device` function takes only one parameter: + +* address of a `clock_event_device` structure which represents a clock event device. + +So, to register a clock event device, at first we need to initialize `clock_event_device` structure with parameters of a certain clock event device. Let's take a look at one random clock event device in the Linux kernel source code. We can find one in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource) directory or try to take a look at an architecture-specific clock event device. Let's take for example - [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf). You can find its implementation in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource/timer-atmel-pit.c). + +First of all let's look at initialization of the `clock_event_device` structure. This occurs in the `at91sam926x_pit_common_init` function: + +```C +struct pit_data { + ... + ... + struct clock_event_device clkevt; + ... + ... +}; + +static void __init at91sam926x_pit_common_init(struct pit_data *data) +{ + ... + ... + ... + data->clkevt.name = "pit"; + data->clkevt.features = CLOCK_EVT_FEAT_PERIODIC; + data->clkevt.shift = 32; + data->clkevt.mult = div_sc(pit_rate, NSEC_PER_SEC, data->clkevt.shift); + data->clkevt.rating = 100; + data->clkevt.cpumask = cpumask_of(0); + + data->clkevt.set_state_shutdown = pit_clkevt_shutdown; + data->clkevt.set_state_periodic = pit_clkevt_set_periodic; + data->clkevt.resume = at91sam926x_pit_resume; + data->clkevt.suspend = at91sam926x_pit_suspend; + ... +} +``` + +Here we can see that `at91sam926x_pit_common_init` takes one parameter - pointer to the `pit_data` structure which contains `clock_event_device` structure which will contain clock event related information of the `at91sam926x` [periodic Interval Timer](https://en.wikipedia.org/wiki/Programmable_interval_timer). At the start we fill `name` of the timer device and its `features`. In our case we deal with periodic timer which as we already know may be programmed to generate events periodically. + +The next two fields `shift` and `mult` are familiar to us. They will be used to translate counter of our timer to nanoseconds. After this we set rating of the timer to `100`. This means if there will not be timers with higher rating in the system, this timer will be used for timekeeping. The next field - `cpumask` indicates for which processors in the system the device will work. In our case, the device will work for the first processor. The `cpumask_of` macro defined in the [include/linux/cpumask.h](https://github.com/torvalds/linux/tree/master/include/linux/cpumask.h) header file and just expands to the call of the: + +```C +#define cpumask_of(cpu) (get_cpu_mask(cpu)) +``` + +Where the `get_cpu_mask` returns the cpumask containing just a given `cpu` number. More about `cpumasks` concept you may read in the [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part. In the last four lines of code we set callbacks for the clock event device suspend/resume, device shutdown and update of the clock event device state. + +After we finished with the initialization of the `at91sam926x` periodic timer, we can register it by the call of the following functions: + +```C +clockevents_register_device(&data->clkevt); +``` + +Now we can consider implementation of the `clockevent_register_device` function. As I already wrote above, this function is defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and starts from the initialization of the initial event device state: + +```C +clockevent_set_state(dev, CLOCK_EVT_STATE_DETACHED); +``` + +Actually, an event device may be in one of this states: + +```C +enum clock_event_state { + CLOCK_EVT_STATE_DETACHED, + CLOCK_EVT_STATE_SHUTDOWN, + CLOCK_EVT_STATE_PERIODIC, + CLOCK_EVT_STATE_ONESHOT, + CLOCK_EVT_STATE_ONESHOT_STOPPED, +}; +``` + +Where: + +* `CLOCK_EVT_STATE_DETACHED` - a clock event device is not not used by `clockevents` framework. Actually it is initial state of all clock event devices; +* `CLOCK_EVT_STATE_SHUTDOWN` - a clock event device is powered-off; +* `CLOCK_EVT_STATE_PERIODIC` - a clock event device may be programmed to generate event periodically; +* `CLOCK_EVT_STATE_ONESHOT` - a clock event device may be programmed to generate event only once; +* `CLOCK_EVT_STATE_ONESHOT_STOPPED` - a clock event device was programmed to generate event only once and now it is temporary stopped. + +The implementation of the `clock_event_set_state` function is pretty easy: + +```C +static inline void clockevent_set_state(struct clock_event_device *dev, + enum clock_event_state state) +{ + dev->state_use_accessors = state; +} +``` + +As we can see, it just fills the `state_use_accessors` field of the given `clock_event_device` structure with the given value which is in our case is `CLOCK_EVT_STATE_DETACHED`. Actually all clock event devices has this initial state during registration. The `state_use_accessors` field of the `clock_event_device` structure provides `current` state of the clock event device. + +After we have set initial state of the given `clock_event_device` structure we check that the `cpumask` of the given clock event device is not zero: + +```C +if (!dev->cpumask) { + WARN_ON(num_possible_cpus() > 1); + dev->cpumask = cpumask_of(smp_processor_id()); +} +``` + +Remember that we have set the `cpumask` of the `at91sam926x` periodic timer to first processor. If the `cpumask` field is zero, we check the number of possible processors in the system and print warning message if it is less than on. Additionally we set the `cpumask` of the given clock event device to the current processor. If you are interested in how the `smp_processor_id` macro is implemented, you can read more about it in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter. + +After this check we lock the actual code of the clock event device registration by the call following macros: + +```C +raw_spin_lock_irqsave(&clockevents_lock, flags); +... +... +... +raw_spin_unlock_irqrestore(&clockevents_lock, flags); +``` + +Additionally the `raw_spin_lock_irqsave` and the `raw_spin_unlock_irqrestore` macros disable local interrupts, however interrupts on other processors still may occur. We need to do it to prevent potential [deadlock](https://en.wikipedia.org/wiki/Deadlock) if we adding new clock event device to the list of clock event devices and an interrupt occurs from other clock event device. + +We can see following code of clock event device registration between the `raw_spin_lock_irqsave` and `raw_spin_unlock_irqrestore` macros: + +```C +list_add(&dev->list, &clockevent_devices); +tick_check_new_device(dev); +clockevents_notify_released(); +``` + +First of all we add the given clock event device to the list of clock event devices which is represented by the `clockevent_devices`: + +```C +static LIST_HEAD(clockevent_devices); +``` + +At the next step we call the `tick_check_new_device` function which is defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and checks do the new registered clock event device should be used or not. The `tick_check_new_device` function checks the given `clock_event_device` gets the current registered tick device which is represented by the `tick_device` structure and compares their ratings and features. Actually `CLOCK_EVT_STATE_ONESHOT` is preferred: + +```C +static bool tick_check_preferred(struct clock_event_device *curdev, + struct clock_event_device *newdev) +{ + if (!(newdev->features & CLOCK_EVT_FEAT_ONESHOT)) { + if (curdev && (curdev->features & CLOCK_EVT_FEAT_ONESHOT)) + return false; + if (tick_oneshot_mode_active()) + return false; + } + + return !curdev || + newdev->rating > curdev->rating || + !cpumask_equal(curdev->cpumask, newdev->cpumask); +} +``` + +If the new registered clock event device is more preferred than old tick device, we exchange old and new registered devices and install new device: + +```C +clockevents_exchange_device(curdev, newdev); +tick_setup_device(td, newdev, cpu, cpumask_of(cpu)); +``` + +The `clockevents_exchange_device` function releases or in other words deleted the old clock event device from the `clockevent_devices` list. The next function - `tick_setup_device` as we may understand from its name, setups new tick device. This function check the mode of the new registered clock event device and call the `tick_setup_periodic` function or the `tick_setup_oneshot` depends on the tick device mode: + +```C +if (td->mode == TICKDEV_MODE_PERIODIC) + tick_setup_periodic(newdev, 0); +else + tick_setup_oneshot(newdev, handler, next_event); +``` + +Both of this functions calls the `clockevents_switch_state` to change state of the clock event device and the `clockevents_program_event` function to set next event of clock event device based on delta between the maximum and minimum difference current time and time for the next event. The `tick_setup_periodic`: + +```C +clockevents_switch_state(dev, CLOCK_EVT_STATE_PERIODIC); +clockevents_program_event(dev, next, false)) +``` + +and the `tick_setup_oneshot_periodic`: + +```C +clockevents_switch_state(newdev, CLOCK_EVT_STATE_ONESHOT); +clockevents_program_event(newdev, next_event, true); +``` + +The `clockevents_switch_state` function checks that the clock event device is not in the given state and calls the `__clockevents_switch_state` function from the same source code file: + +```C +if (clockevent_get_state(dev) != state) { + if (__clockevents_switch_state(dev, state)) + return; +``` + +The `__clockevents_switch_state` function just makes a call of the certain callback depends on the given state: + +```C +static int __clockevents_switch_state(struct clock_event_device *dev, + enum clock_event_state state) +{ + if (dev->features & CLOCK_EVT_FEAT_DUMMY) + return 0; + + switch (state) { + case CLOCK_EVT_STATE_DETACHED: + case CLOCK_EVT_STATE_SHUTDOWN: + if (dev->set_state_shutdown) + return dev->set_state_shutdown(dev); + return 0; + + case CLOCK_EVT_STATE_PERIODIC: + if (!(dev->features & CLOCK_EVT_FEAT_PERIODIC)) + return -ENOSYS; + if (dev->set_state_periodic) + return dev->set_state_periodic(dev); + return 0; + ... + ... + ... +``` + +In our case for `at91sam926x` periodic timer, the state is the `CLOCK_EVT_FEAT_PERIODIC`: + +```C +data->clkevt.features = CLOCK_EVT_FEAT_PERIODIC; +data->clkevt.set_state_periodic = pit_clkevt_set_periodic; +``` + +So, for the `pit_clkevt_set_periodic` callback will be called. If we will read the documentation of the [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf), we will see that there is `Periodic Interval Timer Mode Register` which allows us to control of periodic interval timer. + +It looks like: + +``` +31 25 24 ++---------------------------------------------------------------+ +| | PITIEN | PITEN | ++---------------------------------------------------------------+ +23 19 16 ++---------------------------------------------------------------+ +| | PIV | ++---------------------------------------------------------------+ +15 8 ++---------------------------------------------------------------+ +| PIV | ++---------------------------------------------------------------+ +7 0 ++---------------------------------------------------------------+ +| PIV | ++---------------------------------------------------------------+ +``` + +Where `PIV` or `Periodic Interval Value` - defines the value compared with the primary `20-bit` counter of the Periodic Interval Timer. The `PITEN` or `Period Interval Timer Enabled` if the bit is `1` and the `PITIEN` or `Periodic Interval Timer Interrupt Enable` if the bit is `1`. So, to set periodic mode, we need to set `24`, `25` bits in the `Periodic Interval Timer Mode Register`. And we are doing it in the `pit_clkevt_set_periodic` function: + +```C +static int pit_clkevt_set_periodic(struct clock_event_device *dev) +{ + struct pit_data *data = clkevt_to_pit_data(dev); + ... + ... + ... + pit_write(data->base, AT91_PIT_MR, + (data->cycle - 1) | AT91_PIT_PITEN | AT91_PIT_PITIEN); + + return 0; +} +``` + +Where the `AT91_PT_MR`, `AT91_PT_PITEN` and the `AT91_PIT_PITIEN` are declared as: + +```C +#define AT91_PIT_MR 0x00 +#define AT91_PIT_PITIEN BIT(25) +#define AT91_PIT_PITEN BIT(24) +``` + +After the setup of the new clock event device is finished, we can return to the `clockevents_register_device` function. The last function in the `clockevents_register_device` function is: + +```C +clockevents_notify_released(); +``` + +This function checks the `clockevents_released` list which contains released clock event devices (remember that they may occur after the call of the ` clockevents_exchange_device` function). If this list is not empty, we go through clock event devices from the `clock_events_released` list and delete it from the `clockevent_devices`: + +```C +static void clockevents_notify_released(void) +{ + struct clock_event_device *dev; + + while (!list_empty(&clockevents_released)) { + dev = list_entry(clockevents_released.next, + struct clock_event_device, list); + list_del(&dev->list); + list_add(&dev->list, &clockevent_devices); + tick_check_new_device(dev); + } +} +``` + +That's all. From this moment we have registered new clock event device. So the usage of the `clockevents` framework is simple and clear. Architectures registered their clock event devices, in the clock events core. Users of the clockevents core can get clock event devices for their use. The `clockevents` framework provides notification mechanisms for various clock related management events like a clock event device registered or unregistered, a processor is offlined in system which supports [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) and etc. + +We saw implementation only of the `clockevents_register_device` function. But generally, the clock event layer [API](https://en.wikipedia.org/wiki/Application_programming_interface) is small. Besides the `API` for clock event device registration, the `clockevents` framework provides functions to schedule the next event interrupt, clock event device notification service and support for suspend and resume for clock event devices. + +If you want to know more about `clockevents` API you can start to research following source code and header files: [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c), [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) and [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h). + +That's all. + +Conclusion +------------------------------------------------------------------------------- + +This is the end of the fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `timers` concept. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about yet another framework - `clockevents`. + +If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). + +**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +------------------------------------------------------------------------------- + +* [timekeeping documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt) +* [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253) +* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) +* [ACPI pdf](http://uefi.org/sites/default/files/resources/ACPI_5.pdf) +* [x86](https://en.wikipedia.org/wiki/X86) +* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) +* [powerpc](https://en.wikipedia.org/wiki/PowerPC) +* [frequency](https://en.wikipedia.org/wiki/Frequency) +* [API](https://en.wikipedia.org/wiki/Application_programming_interface) +* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) +* [interrupt](https://en.wikipedia.org/wiki/Interrupt) +* [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) +* [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) +* [C3 state](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states) +* [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf) +* [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) +* [deadlock](https://en.wikipedia.org/wiki/Deadlock) +* [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) +* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html) diff --git a/Timers/timers-6.md b/Timers/timers-6.md new file mode 100644 index 0000000..0ca4fa2 --- /dev/null +++ b/Timers/timers-6.md @@ -0,0 +1,413 @@ +Timers and time management in the Linux kernel. Part 6. +================================================================================ + +x86_64 related clock sources +-------------------------------------------------------------------------------- + +This is sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-5.html) we saw `clockevents` framework and now we will continue to dive into time management related stuff in the Linux kernel. This part will describe implementation of [x86](https://en.wikipedia.org/wiki/X86) architecture related clock sources (more about `clocksource` concept you can read in the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) of this chapter). + +First of all we must know what clock sources may be used at `x86` architecture. It is easy to know from the [sysfs](https://en.wikipedia.org/wiki/Sysfs) or from content of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. The `/sys/devices/system/clocksource/clocksourceN` provides two special files to achieve this: + +* `available_clocksource` - provides information about available clock sources in the system; +* `current_clocksource` - provides information about currently used clock source in the system. + +So, let's look: + +``` +$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource +tsc hpet acpi_pm +``` + +We can see that there are three registered clock sources in my system: + +* `tsc` - [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter); +* `hpet` - [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer); +* `acpi_pm` - [ACPI Power Management Timer](http://uefi.org/sites/default/files/resources/ACPI_5.pdf). + +Now let's look at the second file which provides best clock source (a clock source which has the best rating in the system): + +``` +$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource +tsc +``` + +For me it is [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). As we may know from the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) of this chapter, which describes internals of the `clocksource` framework in the Linux kernel, the best clock source in a system is a clock source with the best (highest) rating or in other words with the highest [frequency](https://en.wikipedia.org/wiki/Frequency). + +Frequency of the [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) power management timer is `3.579545 MHz`. Frequency of the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is at least `10 MHz`. And the frequency of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) depends on processor. For example On older processors, the `Time Stamp Counter` was counting internal processor clock cycles. This means its frequency changed when the processor's frequency scaling changed. The situation has changed for newer processors. Newer processors have an `invariant Time Stamp counter` that increments at a constant rate in all operational states of processor. Actually we can get its frequency in the output of the `/proc/cpuinfo`. For example for the first processor in the system: + +``` +$ cat /proc/cpuinfo +... +model name : Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz +... +``` + +And although Intel manual says that the frequency of the `Time Stamp Counter`, while constant, is not necessarily the maximum qualified frequency of the processor, or the frequency given in the brand string, anyway we may see that it will be much more than frequency of the `ACPI PM` timer or `High Precision Event Timer`. And we can see that the clock source with the best rating or highest frequency is current in the system. + +You can note that besides these three clock source, we don't see yet another two familiar us clock sources in the output of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. These clock sources are `jiffy` and `refined_jiffies`. We don't see them because this filed maps only high resolution clock sources or in other words clock sources with the [CLOCK_SOURCE_VALID_FOR_HRES](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h#L113) flag. + +As I already wrote above, we will consider all of these three clock sources in this part. We will consider it in order of their initialization or: + +* `hpet`; +* `acpi_pm`; +* `tsc`. + +We can make sure that the order is exactly like this in the output of the [dmesg](https://en.wikipedia.org/wiki/Dmesg) util: + +``` +$ dmesg | grep clocksource +[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns +[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns +[ 0.094369] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns +[ 0.186498] clocksource: Switched to clocksource hpet +[ 0.196827] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns +[ 1.413685] tsc: Refined TSC clocksource calibration: 3999.981 MHz +[ 1.413688] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x73509721780, max_idle_ns: 881591102108 ns +[ 2.413748] clocksource: Switched to clocksource tsc +``` + +The first clock source is the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer), so let's start from it. + +High Precision Event Timer +-------------------------------------------------------------------------------- + +The implementation of the `High Precision Event Timer` for the [x86](https://en.wikipedia.org/wiki/X86) architecture is located in the [arch/x86/kernel/hpet.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/hpet.c) source code file. Its initialization starts from the call of the `hpet_enable` function. This function is called during Linux kernel initialization. If we will look into `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file, we will see that after the all architecture-specific stuff initialized, early console is disabled and time management subsystem already ready, call of the following function: + +```C +if (late_time_init) + late_time_init(); +``` + +which does initialization of the late architecture specific timers after early jiffy counter already initialized. The definition of the `late_time_init` function for the `x86` architecture is located in the [arch/x86/kernel/time.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/time.c) source code file. It looks pretty easy: + +```C +static __init void x86_late_time_init(void) +{ + x86_init.timers.timer_init(); + tsc_init(); +} +``` + +As we may see, it does initialization of the `x86` related timer and initialization of the `Time Stamp Counter`. The seconds we will see in the next paragraph, but now let's consider the call of the `x86_init.timers.timer_init` function. The `timer_init` points to the `hpet_time_init` function from the same source code file. We can verify this by looking on the definition of the `x86_init` structure from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c): + +```C +struct x86_init_ops x86_init __initdata = { + ... + ... + ... + .timers = { + .setup_percpu_clockev = setup_boot_APIC_clock, + .timer_init = hpet_time_init, + .wallclock_init = x86_init_noop, + }, + ... + ... + ... +``` + +The `hpet_time_init` function does setup of the [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) if we can not enable `High Precision Event Timer` and setups default timer [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) for the enabled timer: + +```C +void __init hpet_time_init(void) +{ + if (!hpet_enable()) + setup_pit_timer(); + setup_default_timer_irq(); +} +``` + +First of all the `hpet_enable` function check we can enable `High Precision Event Timer` in the system by the call of the `is_hpet_capable` function and if we can, we map a virtual address space for it: + +```C +int __init hpet_enable(void) +{ + if (!is_hpet_capable()) + return 0; + + hpet_set_mapping(); +} +``` + +The `is_hpet_capable` function checks that we didn't pass `hpet=disable` to the kernel command line and the `hpet_address` is received from the [ACPI HPET](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table. The `hpet_set_mapping` function just maps the virtual address spaces for the timer registers: + +```C +hpet_virt_address = ioremap_nocache(hpet_address, HPET_MMAP_SIZE); +``` + +As we can read in the [IA-PC HPET (High Precision Event Timers) Specification](http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf): + +> The timer register space is 1024 bytes + +So, the `HPET_MMAP_SIZE` is `1024` bytes too: + +```C +#define HPET_MMAP_SIZE 1024 +``` + +After we mapped virtual space for the `High Precision Event Timer`, we read `HPET_ID` register to get number of the timers: + +```C +id = hpet_readl(HPET_ID); + +last = (id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT; +``` + +We need to get this number to allocate correct amount of space for the `General Configuration Register` of the `High Precision Event Timer`: + +```C +cfg = hpet_readl(HPET_CFG); + +hpet_boot_cfg = kmalloc((last + 2) * sizeof(*hpet_boot_cfg), GFP_KERNEL); +``` + +After the space is allocated for the configuration register of the `High Precision Event Timer`, we allow to main counter to run, and allow timer interrupts if they are enabled by the setting of `HPET_CFG_ENABLE` bit in the configuration register for all timers. In the end we just register new clock source by the call of the `hpet_clocksource_register` function: + +```C +if (hpet_clocksource_register()) + goto out_nohpet; +``` + +which just calls already familiar + +```C +clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq); +``` + +function. Where the `clocksource_hpet` is the `clocksource` structure with the rating `250` (remember rating of the previous `refined_jiffies` clock source was `2`), name - `hpet` and `read_hpet` callback for the reading of atomic counter provided by the `High Precision Event Timer`: + +```C +static struct clocksource clocksource_hpet = { + .name = "hpet", + .rating = 250, + .read = read_hpet, + .mask = HPET_MASK, + .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .resume = hpet_resume_counter, + .archdata = { .vclock_mode = VCLOCK_HPET }, +}; +``` + +After the `clocksource_hpet` is registered, we can return to the `hpet_time_init()` function from the [arch/x86/kernel/time.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/time.c) source code file. We can remember that the last step is the call of the: + +```C +setup_default_timer_irq(); +``` + +function in the `hpet_time_init()`. The `setup_default_timer_irq` function checks existence of `legacy` IRQs or in other words support for the [i8259](https://en.wikipedia.org/wiki/Intel_8259) and setups [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC) depends on this. + +That's all. From this moment the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) clock source registered in the Linux kernel `clock source` framework and may be used from generic kernel code via the `read_hpet`: +```C +static cycle_t read_hpet(struct clocksource *cs) +{ + return (cycle_t)hpet_readl(HPET_COUNTER); +} +``` + +function which just reads and returns atomic counter from the `Main Counter Register`. + +ACPI PM timer +-------------------------------------------------------------------------------- + +The seconds clock source is [ACPI Power Management Timer](http://uefi.org/sites/default/files/resources/ACPI_5.pdf). Implementation of this clock source is located in the [drivers/clocksource/acpi_pm.c](https://github.com/torvalds/linux/blob/master/drivers/clocksource_acpi_pm.c) source code file and starts from the call of the `init_acpi_pm_clocksource` function during `fs` [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html). + +If we will look at implementation of the `init_acpi_pm_clocksource` function, we will see that it starts from the check of the value of `pmtmr_ioport` variable: + +```C +static int __init init_acpi_pm_clocksource(void) +{ + ... + ... + ... + if (!pmtmr_ioport) + return -ENODEV; + ... + ... + ... +``` + +This `pmtmr_ioport` variable contains extended address of the `Power Management Timer Control Register Block`. It gets its value in the `acpi_parse_fadt` function which is defined in the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) source code file. This function parses `FADT` or `Fixed ACPI Description Table` [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table and tries to get the values of the `X_PM_TMR_BLK` field which contains extended address of the `Power Management Timer Control Register Block`, represented in `Generic Address Structure` format: + +```C +static int __init acpi_parse_fadt(struct acpi_table_header *table) +{ +#ifdef CONFIG_X86_PM_TIMER + ... + ... + ... + pmtmr_ioport = acpi_gbl_FADT.xpm_timer_block.address; + ... + ... + ... +#endif + return 0; +} +``` + +So, if the `CONFIG_X86_PM_TIMER` Linux kernel configuration option is disabled or something going wrong in the `acpi_parse_fadt` function, we can't access the `Power Management Timer` register and return from the `init_acpi_pm_clocksource`. In other way, if the value of the `pmtmr_ioport` variable is not zero, we check rate of this timer and register this clock source by the call of the: + +```C +clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC); +``` + +function. After the call of the `clocksource_register_hs`, the `acpi_pm` clock source will be registered in the `clocksource` framework of the Linux kernel: + +```C +static struct clocksource clocksource_acpi_pm = { + .name = "acpi_pm", + .rating = 200, + .read = acpi_pm_read, + .mask = (cycle_t)ACPI_PM_MASK, + .flags = CLOCK_SOURCE_IS_CONTINUOUS, +}; +``` + +with the rating - `200` and the `acpi_pm_read` callback to read atomic counter provided by the `acpi_pm` clock source. The `acpi_pm_read` function just executes `read_pmtmr` function: + +```C +static cycle_t acpi_pm_read(struct clocksource *cs) +{ + return (cycle_t)read_pmtmr(); +} +``` + +which reads value of the `Power Management Timer` register. This register has following structure: + +``` ++-------------------------------+----------------------------------+ +| | | +| upper eight bits of a | running count of the | +| 32-bit power management timer | power management timer | +| | | ++-------------------------------+----------------------------------+ +31 E_TMR_VAL 24 TMR_VAL 0 +``` + +Address of this register is stored in the `Fixed ACPI Description Table` [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table and we already have it in the `pmtmr_ioport`. So, the implementation of the `read_pmtmr` function is pretty easy: + +```C +static inline u32 read_pmtmr(void) +{ + return inl(pmtmr_ioport) & ACPI_PM_MASK; +} +``` + +We just read the value of the `Power Management Timer` register and mask its `24` bits. + +That's all. Now we move to the last clock source in this part - `Time Stamp Counter`. + +Time Stamp Counter +-------------------------------------------------------------------------------- + +The third and last clock source in this part is - [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) clock source and its implementation is located in the [arch/x86/kernel/tsc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c) source code file. We already saw the `x86_late_time_init` function in this part and initialization of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) starts from this place. This function calls the `tsc_init()` function from the [arch/x86/kernel/tsc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c) source code file. + +At the beginning of the `tsc_init` function we can see check, which checks that a processor has support of the `Time Stamp Counter`: + +```C +void __init tsc_init(void) +{ + u64 lpj; + int cpu; + + if (!cpu_has_tsc) { + setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); + return; + } + ... + ... + ... +``` + +The `cpu_has_tsc` macro expands to the call of the `cpu_has` macro: + +```C +#define cpu_has_tsc boot_cpu_has(X86_FEATURE_TSC) + +#define boot_cpu_has(bit) cpu_has(&boot_cpu_data, bit) + +#define cpu_has(c, bit) \ + (__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 : \ + test_cpu_cap(c, bit)) +``` + +which check the given bit (the `X86_FEATURE_TSC_DEADLINE_TIMER` in our case) in the `boot_cpu_data` array which is filled during early Linux kernel initialization. If the processor has support of the `Time Stamp Counter`, we get the frequency of the `Time Stamp Counter` by the call of the `calibrate_tsc` function from the same source code file which tries to get frequency from the different source like [Model Specific Register](https://en.wikipedia.org/wiki/Model-specific_register), calibrate over [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) and etc, after this we initialize frequency and scale factor for the all processors in the system: + +```C +tsc_khz = x86_platform.calibrate_tsc(); +cpu_khz = tsc_khz; + +for_each_possible_cpu(cpu) { + cyc2ns_init(cpu); + set_cyc2ns_scale(cpu_khz, cpu); +} +``` + +because only first bootstrap processor will call the `tsc_init`. After this we check hat `Time Stamp Counter` is not disabled: + +``` +if (tsc_disabled > 0) + return; +... +... +... +check_system_tsc_reliable(); +``` + +and call the `check_system_tsc_reliable` function which sets the `tsc_clocksource_reliable` if bootstrap processor has the `X86_FEATURE_TSC_RELIABLE` feature. Note that we went through the `tsc_init` function, but did not register our clock source. Actual registration of the `Time Stamp Counter` clock source occurs in the: + +```C +static int __init init_tsc_clocksource(void) +{ + if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz) + return 0; + ... + ... + ... + if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) { + clocksource_register_khz(&clocksource_tsc, tsc_khz); + return 0; + } +``` + +function. This function called during the `device` [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html). We do it to be sure that the `Time Stamp Counter` clock source will be registered after the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) clock source. + +After these all three clock sources will be registered in the `clocksource` framework and the `Time Stamp Counter` clock source will be selected as active, because it has the highest rating among other clock sources: + +```C +static struct clocksource clocksource_tsc = { + .name = "tsc", + .rating = 300, + .read = read_tsc, + .mask = CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_MUST_VERIFY, + .archdata = { .vclock_mode = VCLOCK_TSC }, +}; +``` + +That's all. + +Conclusion +-------------------------------------------------------------------------------- + +This is the end of the sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `clockevents` framework. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about three different clock sources which are used in the [x86](https://en.wikipedia.org/wiki/X86) architecture. The next part will be last part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) and we will see some user space related stuff, i.e. how some time related [system calls](https://en.wikipedia.org/wiki/System_call) implemented in the Linux kernel. + +If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). + +**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [x86](https://en.wikipedia.org/wiki/X86) +* [sysfs](https://en.wikipedia.org/wiki/Sysfs) +* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) +* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) +* [ACPI Power Management Timer (PDF)](http://uefi.org/sites/default/files/resources/ACPI_5.pdf) +* [frequency](https://en.wikipedia.org/wiki/Frequency). +* [dmesg](https://en.wikipedia.org/wiki/Dmesg) +* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) +* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) +* [IA-PC HPET (High Precision Event Timers) Specification](http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf) +* [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC) +* [i8259](https://en.wikipedia.org/wiki/Intel_8259) +* [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html) +* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-5.html) diff --git a/Timers/timers-7.md b/Timers/timers-7.md new file mode 100644 index 0000000..08ca77d --- /dev/null +++ b/Timers/timers-7.md @@ -0,0 +1,421 @@ +Timers and time management in the Linux kernel. Part 7. +================================================================================ + +Time related system calls in the Linux kernel +-------------------------------------------------------------------------------- + +This is the seventh and last part [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-6.html) we saw some [x86_64](https://en.wikipedia.org/wiki/X86-64) like [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) and [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). Internal time management is interesting part of the Linux kernel, but of course not only the kernel needs in the `time` concept. Our programs need to know time too. In this part, we will consider implementation of some time management related [system calls](https://en.wikipedia.org/wiki/System_call). These system calls are: + +* `clock_gettime`; +* `gettimeofday`; +* `nanosleep`. + +We will start from simple userspace [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program and see all way from the call of the [standard library](https://en.wikipedia.org/wiki/Standard_library) function to the implementation of certain system call. As each [architecture](https://github.com/torvalds/linux/tree/master/arch) provides its own implementation of certain system call, we will consider only [x86_64](https://en.wikipedia.org/wiki/X86-64) specific implementations of system calls, as this book is related to this architecture. + +Additionally we will not consider concept of system calls in this part, but only implementations of these three system calls in the Linux kernel. If you are interested in what is it a `system call`, there is special [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) about this. + +So, let's from the `gettimeofday` system call. + +Implementation of the `gettimeofday` system call +-------------------------------------------------------------------------------- + +As we can understand from the name of the `gettimeofday`, this function returns current time. First of all, let's look on the following simple example: + +```C +#include +#include +#include + +int main(int argc, char **argv) +{ + char buffer[40]; + struct timeval time; + + gettimeofday(&time, NULL); + + strftime(buffer, 40, "Current date/time: %m-%d-%Y/%T", localtime(&time.tv_sec)); + printf("%s\n",buffer); + + return 0; +} +``` + +As you can see, here we call the `gettimeofday` function which takes two parameters: pointer to the `timeval` structure which represents an elapsed tim: + +```C +struct timeval { + time_t tv_sec; /* seconds */ + suseconds_t tv_usec; /* microseconds */ +}; +``` + +The second parameter of the `gettimeofday` function is pointer to the `timezone` structure which represents a timezone. In our example, we pass address of the `timeval time` to the `gettimeofday` function, the Linux kernel fills the given `timeval` structure and returns it back to us. Additionally, we format the time with the `strftime` function to get something more human readable than elapsed microseconds. Let's see on result: + +```C +~$ gcc date.c -o date +~$ ./date +Current date/time: 03-26-2016/16:42:02 +``` + +As you already may know, an userspace application does not call a system call directly from the kernel space. Before the actual system call entry will be called, we call a function from the standard library. In my case it is [glibc](https://en.wikipedia.org/wiki/GNU_C_Library), so I will consider this case. The implementation of the `gettimeofday` function is located in the [sysdeps/unix/sysv/linux/x86/gettimeofday.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86/gettimeofday.c;h=36f7c26ffb0e818709d032c605fec8c4bd22a14e;hb=HEAD) source code file. As you already may know, the `gettimeofday` is not usual system call. It is located in the special area which is called `vDSO` (you can read more about it in the [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) which describes this concept). + +The `glibc` implementation of the `gettimeofday` tries to resolve the given symbol, in our case this symbol is `__vdso_gettimeofday` by the call of the `_dl_vdso_vsym` internal function. If the symbol will not be resolved, it returns `NULL` and we fallback to the call of the usual system call: + +```C +return (_dl_vdso_vsym ("__vdso_gettimeofday", &linux26) + ?: (void*) (&__gettimeofday_syscall)); +``` + +The `gettimeofday` entry is located in the [arch/x86/entry/vdso/vclock_gettime.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vclock_gettime.c) source code file. As we can see the `gettimeofday` is weak alias of the `__vdso_gettimeofday`: + +```C +int gettimeofday(struct timeval *, struct timezone *) + __attribute__((weak, alias("__vdso_gettimeofday"))); +``` + +The `__vdso_gettimeofday` is defined in the same source code file and calls the `do_realtime` function if the given `timeval` is not null: + +```C +notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) +{ + if (likely(tv != NULL)) { + if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE)) + return vdso_fallback_gtod(tv, tz); + tv->tv_usec /= 1000; + } + if (unlikely(tz != NULL)) { + tz->tz_minuteswest = gtod->tz_minuteswest; + tz->tz_dsttime = gtod->tz_dsttime; + } + + return 0; +} +``` + +If the `do_realtime` will fail, we fallback to the real system call via call the `syscall` instruction and passing the `__NR_gettimeofday` system call number and the given `timeval` and `timezone`: + +```C +notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) +{ + long ret; + + asm("syscall" : "=a" (ret) : + "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory"); + return ret; +} +``` + +The `do_realtime` function gets the time data from the `vsyscall_gtod_data` structure which is defined in the [arch/x86/include/asm/vgtod.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/vgtod.h#L16) header file and contains mapping of the `timespec` structure and a couple of fields which are related to the current clock source in the system. This function fills the given `timeval` structure with values from the `vsyscall_gtod_data` which contains a time related data which is updated via timer interrupt. + +First of all we try to access the `gtod` or `global time of day` the `vsyscall_gtod_data` structure via the call of the `gtod_read_begin` and will continue to do it until it will be successful: + +```C +do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->wall_time_sec; + ns = gtod->wall_time_snsec; + ns += vgetsns(&mode); + ns >>= gtod->shift; +} while (unlikely(gtod_read_retry(gtod, seq))); + +ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); +ts->tv_nsec = ns; +``` + +As we got access to the `gtod`, we fill the `ts->tv_sec` with the `gtod->wall_time_sec` which stores current time in seconds gotten from the [real time clock](https://en.wikipedia.org/wiki/Real-time_clock) during initialization of the timekeeping subsystem in the Linux kernel and the same value but in nanoseconds. In the end of this code we just fill the given `timespec` structure with the resulted values. + +That's all about the `gettimeofday` system call. The next system call in our list is the `clock_gettime`. + +Implementation of the clock_gettime system call +-------------------------------------------------------------------------------- + +The `clock_gettime` function gets the time which is specified by the second parameter. Generally the `clock_gettime` function takes two parameters: + +* `clk_id` - clock identifier; +* `timespec` - address of the `timespec` structure which represent elapsed time. + +Let's look on the following simple example: + +```C +#include +#include +#include + +int main(int argc, char **argv) +{ + struct timespec elapsed_from_boot; + + clock_gettime(CLOCK_BOOTTIME, &elapsed_from_boot); + + printf("%d - seconds elapsed from boot\n", elapsed_from_boot.tv_sec); + + return 0; +} +``` + +which prints `uptime` information: + +```C +~$ gcc uptime.c -o uptime +~$ ./uptime +14180 - seconds elapsed from boot +``` + +We can easily check the result with the help of the [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime) util: + +``` +~$ uptime +up 3:56 +``` + +The `elapsed_from_boot.tv_sec` represents elapsed time in seconds, so: + +```python +>>> 14180 / 60 +236 +>>> 14180 / 60 / 60 +3 +>>> 14180 / 60 % 60 +56 +``` + +The `clock_id` maybe one of the following: + +* `CLOCK_REALTIME` - system wide clock which measures real or wall-clock time; +* `CLOCK_REALTIME_COARSE` - faster version of the `CLOCK_REALTIME`; +* `CLOCK_MONOTONIC` - represents monotonic time since some unspecified starting point; +* `CLOCK_MONOTONIC_COARSE` - faster version of the `CLOCK_MONOTONIC`; +* `CLOCK_MONOTONIC_RAW` - the same as the `CLOCK_MONOTONIC` but provides non [NTP](https://en.wikipedia.org/wiki/Network_Time_Protocol) adjusted time. +* `CLOCK_BOOTTIME` - the same as the `CLOCK_MONOTONIC` but plus time that the system was suspended; +* `CLOCK_PROCESS_CPUTIME_ID` - per-process time consumed by all threads in the process; +* `CLOCK_THREAD_CPUTIME_ID` - thread-specific clock. + +The `clock_gettime` is not usual syscall too, but as the `gettimeofday`, this system call is placed in the `vDSO` area. Entry of this system call is located in the same source code file - [arch/x86/entry/vdso/vclock_gettime.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vclock_gettime.c)) as for `gettimeofday`. + +The Implementation of the `clock_gettime` depends on the clock id. If we have passed the `CLOCK_REALTIME` clock id, the `do_realtime` function will be called: + +```C +notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) +{ + switch (clock) { + case CLOCK_REALTIME: + if (do_realtime(ts) == VCLOCK_NONE) + goto fallback; + break; + ... + ... + ... +fallback: + return vdso_fallback_gettime(clock, ts); +} +``` + +In other cases, the `do_{name_of_clock_id}` function is called. Implementations of some of them is similar. For example if we will pass the `CLOCK_MONOTONIC` clock id: + +```C +... +... +... +case CLOCK_MONOTONIC: + if (do_monotonic(ts) == VCLOCK_NONE) + goto fallback; + break; +... +... +... +``` + +the `do_monotonic` function will be called which is very similar on the implementation of the `do_realtime`: + +```C +notrace static int __always_inline do_monotonic(struct timespec *ts) +{ + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->monotonic_time_sec; + ns = gtod->monotonic_time_snsec; + ns += vgetsns(&mode); + ns >>= gtod->shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} +``` + +We already saw a little about the implementation of this function in the previous paragraph about the `gettimeofday`. There is only one difference here, that the `sec` and `nsec` of our `timespec` value will be based on the `gtod->monotonic_time_sec` instead of `gtod->wall_time_sec` which maps the value of the `tk->tkr_mono.xtime_nsec` or number of [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) elapsed. + +That's all. + +Implementation of the `nanosleep` system call +-------------------------------------------------------------------------------- + +The last system call in our list is the `nanosleep`. As you can understand from its name, this function provides `sleeping` ability. Let's look on the following simple example: + +```C +#include +#include +#include + +int main (void) +{ + struct timespec ts = {5,0}; + + printf("sleep five seconds\n"); + nanosleep(&ts, NULL); + printf("end of sleep\n"); + + return 0; +} +``` + +If we will compile and run it, we will see the first line + +``` +~$ gcc sleep_test.c -o sleep +~$ ./sleep +sleep five seconds +end of sleep +``` + +and the second line after five seconds. + +The `nanosleep` is not located in the `vDSO` area like the `gettimeofday` and the `clock_gettime` functions. So, let's look how the `real` system call which is located in the kernel space will be called by the standard library. The implementation of the `nanosleep` system call will be called with the help of the [syscall](http://www.felixcloutier.com/x86/SYSCALL.html) instruction. Before the execution of the `syscall` instruction, parameters of the system call must be put in processor [registers](https://en.wikipedia.org/wiki/Processor_register) according to order which is described in the [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf) or in other words: + +* `rdi` - first parameter; +* `rsi` - second parameter; +* `rdx` - third parameter; +* `r10` - fourth parameter; +* `r8` - fifth parameter; +* `r9` - sixth parameter. + +The `nanosleep` system call has two parameters - two pointers to the `timespec` structures. The system call suspends the calling thread until the given timeout has elapsed. Additionally it will finish if a signal interrupts its execution. It takes two parameters, the first is `timespec` which represents timeout for the sleep. The second parameter is the pointer to the `timespec` structure too and it contains remainder of time if the call of the `nanosleep` was interrupted. + +As `nanosleep` has two parameters: + +```C +int nanosleep(const struct timespec *req, struct timespec *rem); +``` + +To call system call, we need put the `req` to the `rdi` register, and the `rem` parameter to the `rsi` register. The [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) does these job in the `INTERNAL_SYSCALL` macro which is located in the [sysdeps/unix/sysv/linux/x86_64/sysdep.h](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h;h=d023d68174d3dfb4e698160b31ae31ad291802e1;hb=HEAD) header file. + +```C +# define INTERNAL_SYSCALL(name, err, nr, args...) \ + INTERNAL_SYSCALL_NCS (__NR_##name, err, nr, ##args) +``` + +which takes the name of the system call, storage for possible error during execution of system call, number of the system call (all `x86_64` system calls you can find in the [system calls table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)) and arguments of certain system call. The `INTERNAL_SYSCALL` macro just expands to the call of the `INTERNAL_SYSCALL_NCS` macro, which prepares arguments of system call (puts them into the processor registers in correct order), executes `syscall` instruction and returns the result: + +```C +# define INTERNAL_SYSCALL_NCS(name, err, nr, args...) \ + ({ \ + unsigned long int resultvar; \ + LOAD_ARGS_##nr (args) \ + LOAD_REGS_##nr \ + asm volatile ( \ + "syscall\n\t" \ + : "=a" (resultvar) \ + : "0" (name) ASM_ARGS_##nr : "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \ + (long int) resultvar; }) +``` + +The `LOAD_ARGS_##nr` macro calls the `LOAD_ARGS_N` macro where the `N` is number of arguments of the system call. In our case, it will be the `LOAD_ARGS_2` macro. Ultimately all of these macros will be expanded to the following: + +```C +# define LOAD_REGS_TYPES_1(t1, a1) \ + register t1 _a1 asm ("rdi") = __arg1; \ + LOAD_REGS_0 + +# define LOAD_REGS_TYPES_2(t1, a1, t2, a2) \ + register t2 _a2 asm ("rsi") = __arg2; \ + LOAD_REGS_TYPES_1(t1, a1) +... +... +... +``` + +After the `syscall` instruction will be executed, the [context switch](https://en.wikipedia.org/wiki/Context_switch) will occur and the kernel will transfer execution to the system call handler. The system call handler for the `nanosleep` system call is located in the [kernel/time/hrtimer.c](https://github.com/torvalds/linux/blob/master/kernel/time/hrtimer.c) source code file and defined with the `SYSCALL_DEFINE2` macro helper: + +```C +SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp, + struct timespec __user *, rmtp) +{ + struct timespec tu; + + if (copy_from_user(&tu, rqtp, sizeof(tu))) + return -EFAULT; + + if (!timespec_valid(&tu)) + return -EINVAL; + + return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, CLOCK_MONOTONIC); +} +``` + +More about the `SYSCALL_DEFINE2` macro you may read in the [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) about system calls. If we look at the implementation of the `nanosleep` system call, first of all we will see that it starts from the call of the `copy_from_user` function. This function copies the given data from the userspace to kernelspace. In our case we copy timeout value to sleep to the kernelspace `timespec` structure and check that the given `timespec` is valid by the call of the `timesc_valid` function: + +```C +static inline bool timespec_valid(const struct timespec *ts) +{ + if (ts->tv_sec < 0) + return false; + if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC) + return false; + return true; +} +``` + +which just checks that the given `timespec` does not represent date before `1970` and nanoseconds does not overflow `1` second. The `nanosleep` function ends with the call of the `hrtimer_nanosleep` function from the same source code file. The `hrtimer_nanosleep` function creates a [timer](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-4.html) and calls the `do_nanosleep` function. The `do_nanosleep` does main job for us. This function provides loop: + +```C +do { + set_current_state(TASK_INTERRUPTIBLE); + hrtimer_start_expires(&t->timer, mode); + + if (likely(t->task)) + freezable_schedule(); + +} while (t->task && !signal_pending(current)); + +__set_current_state(TASK_RUNNING); +return t->task == NULL; +``` + +Which freezes current task during sleep. After we set `TASK_INTERRUPTIBLE` flag for the current task, the `hrtimer_start_expires` function starts the give high-resolution timer on the current processor. As the given high resolution timer will expire, the task will be again running. + +That's all. + +Conclusion +-------------------------------------------------------------------------------- + +This is the end of the seventh part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part we saw [x86_64](https://en.wikipedia.org/wiki/X86-64) specific clock sources. As I wrote in the beginning, this part is the last part of this chapter. We saw important time management related concepts like `clocksource` and `clockevents` frameworks, `jiffies` counter and etc., in this chpater. Of course this does not cover all of the time management in the Linux kernel. Many parts of this mostly related to the scheduling which we will see in other chapter. + +If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). + +**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + + +Links +-------------------------------------------------------------------------------- + +* [system call](https://en.wikipedia.org/wiki/System_call) +* [C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29) +* [standard library](https://en.wikipedia.org/wiki/Standard_library) +* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) +* [real time clock](https://en.wikipedia.org/wiki/Real-time_clock) +* [NTP](https://en.wikipedia.org/wiki/Network_Time_Protocol) +* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) +* [register](https://en.wikipedia.org/wiki/Processor_register) +* [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf) +* [context switch](https://en.wikipedia.org/wiki/Context_switch) +* [Introduction to timers in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-4.html) +* [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime) +* [system calls table for x86_64](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl) +* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) +* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) +* [x86_64](https://en.wikipedia.org/wiki/X86-64) +* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-6.html)