linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] tick_sched_timer() is not properly aligned, fixed by chance
@ 2023-04-06  9:57 Sebastian Andrzej Siewior
  2023-04-06 10:55 ` Peter Zijlstra
  2023-04-18 13:14 ` [tip: timers/core] tick/common: Align tick period with the HZ tick tip-bot2 for Sebastian Andrzej Siewior
  0 siblings, 2 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-04-06  9:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Pavel Tatashin, Peter Zijlstra, Thomas Gleixner

With HIGHRES enabled tick_sched_timer() is programmed every jiffy to
expire the timer_list timers. This timer is programmed accurate in
respect to CLOCK_MONOTONIC so that 0 is the first tick and the next one
is 1000/CONFIG_HZ ms later. For HZ=250 every 4 ms and so the us/ns part
of the timer is always 0. This can be observed by enabling hrtimer_start
events and looking for the tick_sched_timer timer:

| <idle>-0 [009] 137.085041: hrtimer_start: hrtimer=000000002df4f2bc function=tick_sched_timer expires=162784000000 softexpires=162784000000 mode=0xa
| <idle>-0 [026] 137.085300: hrtimer_start: hrtimer=00000000b0911b5d function=tick_sched_timer expires=316384000000 softexpires=316384000000 mode=0xa

The lower part (us, ns) is zero. This is important for certain work
loads where it is needed to either align with the timer or hide after
it.

With commit
	857baa87b6422 ("sched/clock: Enable sched clock early")

merged into v4.19-rc1 this 0 offset is gone. The problematic part is the
last hunk of the patch:

|@@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
|                return sched_clock() + __sched_clock_offset;
|
|        if (unlikely(!sched_clock_running))
|-               return 0ull;
|+               return sched_clock();
|
|        preempt_disable_notrace();
|        scd = cpu_sdc(cpu);

which returns the sched_clock() without any offset. The gain of this is
the high resolution output of the bootlog very early, so we see something
like this:
|[    0.000000] tsc: Detected 1995.083 MHz processor
|[    0.000893] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
|[    0.008179] e820: remove [mem 0x000a0000-0x000fffff] usable
|[    0.014395] last_pfn = 0xa40000 max_arch_pfn = 0x400000000
|[    0.020667] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
|[    0.029822] last_pfn = 0xbe000 max_arch_pfn = 0x400000000
|[    0.050770] found SMP MP-table at [mem 0x000fd2a0-0x000fd2af]
|[    0.057015] Kernel/User page tables isolation: disabled on command line.
|[    0.064452] Using GB pages for direct mapping
|[    0.070337] RAMDISK: [mem 0x32d23000-0x35688fff]
|[    0.075298] ACPI: Early table checksum verification disabled
…
|[    1.662577] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
|[    1.687317] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398412f243e, max_idle_ns: 881590620736 ns
|[    1.699071] Calibrating delay loop (skipped), value calculated using timer frequency.. 3990.16 BogoMIPS (lpj=7980332)
|[    1.703070] pid_max: default: 32768 minimum: 301
|[    1.707238] LSM: Security Framework initializing

but the tick_sched_timer() is no longer properly aligned:
| <idle>-0 [002] 131.286274: hrtimer_start: hrtimer=000000000b8c5637 function=tick_sched_timer expires=130680348871 softexpires=130680348871 mode=0xa
| <idle>-0 [006] 131.286275: hrtimer_start: hrtimer=00000000e08369e4 function=tick_sched_timer expires=317880348871 softexpires=317880348871 mode=0xa

instead of 0 we have "348871" here. I consider this as a bug given that
this important.

Then PeterZ had other problems and committed
	776f22913b8e5 ("sched/clock: Make local_clock() noinstr")

which is part of v6.3-rc1. The important part here is (again) the last
hunk, the local_clock() implementation:

|noinstr u64 local_clock(void)
|{
|       u64 clock;
|
|       if (static_branch_likely(&__sched_clock_stable))
|               return sched_clock() + __sched_clock_offset;
|
|       preempt_disable_notrace();
|       clock = sched_clock_local(this_scd());
|       preempt_enable_notrace();
|
|       return clock;
|}

With the static branch disabled that early in the boot we fallback to
the sched_clock_local() which has the tick-based resolution. That means
the output is back Tick based resolution for most part of the boot
process:

|[    0.000000] tsc: Detected 1995.095 MHz processor
|[    0.000661] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
|[    0.000665] e820: remove [mem 0x000a0000-0x000fffff] usable
|[    0.000677] last_pfn = 0xa40000 max_arch_pfn = 0x400000000
|[    0.000684] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
|[    0.001043] last_pfn = 0xbe000 max_arch_pfn = 0x400000000
|[    0.004000] found SMP MP-table at [mem 0x000fd2a0-0x000fd2af]
|[    0.004000] Using GB pages for direct mapping
…
|[    0.004000] Switched APIC routing to cluster x2apic.
|[    0.004000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
|[    0.004000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x3984298f70d, max_idle_ns: 881590726857 ns
|[    2.083356] Calibrating delay loop (skipped), value calculated using timer frequency.. 3990.19 BogoMIPS (lpj=7980380)
|[    2.096542] pid_max: default: 32768 minimum: 301
|[    2.113889] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes, vmalloc hugepage)

and the tick_sched_timer timer is programmed as expected.

The tick_sched_timer story is that it uses ktime_get() to set the
initial expire time and then always incremented. If the returned value
is jiffy-based then it will work as expected. If it has higher precision
then the time has to be adjusted since it is not expected.

So, what do we do?
- We would need to provide stable patches for v4.19…v6.1 to address the
  timer issue.

- The high precision timestamps during boot are gone. Is it important to
  bring that back? Or could that new local_clock() implementation be
  backported?

- An alternative is to ensure that the time returned by ktime_get() in
  tick_setup_device() is aligned with zeros.

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] tick_sched_timer() is not properly aligned, fixed by chance
  2023-04-06  9:57 [RFC] tick_sched_timer() is not properly aligned, fixed by chance Sebastian Andrzej Siewior
@ 2023-04-06 10:55 ` Peter Zijlstra
  2023-04-06 11:08   ` Thomas Gleixner
  2023-04-18 13:14 ` [tip: timers/core] tick/common: Align tick period with the HZ tick tip-bot2 for Sebastian Andrzej Siewior
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2023-04-06 10:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Ingo Molnar, Pavel Tatashin, Thomas Gleixner

On Thu, Apr 06, 2023 at 11:57:35AM +0200, Sebastian Andrzej Siewior wrote:
> With HIGHRES enabled tick_sched_timer() is programmed every jiffy to
> expire the timer_list timers. This timer is programmed accurate in
> respect to CLOCK_MONOTONIC so that 0 is the first tick and the next one
> is 1000/CONFIG_HZ ms later. For HZ=250 every 4 ms and so the us/ns part
> of the timer is always 0. This can be observed by enabling hrtimer_start
> events and looking for the tick_sched_timer timer:
> 
> | <idle>-0 [009] 137.085041: hrtimer_start: hrtimer=000000002df4f2bc function=tick_sched_timer expires=162784000000 softexpires=162784000000 mode=0xa
> | <idle>-0 [026] 137.085300: hrtimer_start: hrtimer=00000000b0911b5d function=tick_sched_timer expires=316384000000 softexpires=316384000000 mode=0xa
> 
> The lower part (us, ns) is zero. This is important for certain work
> loads where it is needed to either align with the timer or hide after
> it.
> 
> With commit
> 	857baa87b6422 ("sched/clock: Enable sched clock early")
> 
> merged into v4.19-rc1 this 0 offset is gone. The problematic part is the
> last hunk of the patch:
> 
> |@@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
> |                return sched_clock() + __sched_clock_offset;
> |
> |        if (unlikely(!sched_clock_running))
> |-               return 0ull;
> |+               return sched_clock();
> |
> |        preempt_disable_notrace();
> |        scd = cpu_sdc(cpu);
> 
> which returns the sched_clock() without any offset. The gain of this is
> the high resolution output of the bootlog very early, so we see something
> like this:
> |[    0.000000] tsc: Detected 1995.083 MHz processor
> |[    0.000893] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> |[    0.008179] e820: remove [mem 0x000a0000-0x000fffff] usable
> |[    0.014395] last_pfn = 0xa40000 max_arch_pfn = 0x400000000
> |[    0.020667] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> |[    0.029822] last_pfn = 0xbe000 max_arch_pfn = 0x400000000
> |[    0.050770] found SMP MP-table at [mem 0x000fd2a0-0x000fd2af]
> |[    0.057015] Kernel/User page tables isolation: disabled on command line.
> |[    0.064452] Using GB pages for direct mapping
> |[    0.070337] RAMDISK: [mem 0x32d23000-0x35688fff]
> |[    0.075298] ACPI: Early table checksum verification disabled
> …
> |[    1.662577] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> |[    1.687317] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398412f243e, max_idle_ns: 881590620736 ns
> |[    1.699071] Calibrating delay loop (skipped), value calculated using timer frequency.. 3990.16 BogoMIPS (lpj=7980332)
> |[    1.703070] pid_max: default: 32768 minimum: 301
> |[    1.707238] LSM: Security Framework initializing
> 
> but the tick_sched_timer() is no longer properly aligned:
> | <idle>-0 [002] 131.286274: hrtimer_start: hrtimer=000000000b8c5637 function=tick_sched_timer expires=130680348871 softexpires=130680348871 mode=0xa
> | <idle>-0 [006] 131.286275: hrtimer_start: hrtimer=00000000e08369e4 function=tick_sched_timer expires=317880348871 softexpires=317880348871 mode=0xa

But how ?!? Why does sched_clock() affect the tick_sched_timer() offset?

/me reads on

> instead of 0 we have "348871" here. I consider this as a bug given that
> this important.
> 
> Then PeterZ had other problems and committed
> 	776f22913b8e5 ("sched/clock: Make local_clock() noinstr")
> 
> which is part of v6.3-rc1. The important part here is (again) the last
> hunk, the local_clock() implementation:
> 
> |noinstr u64 local_clock(void)
> |{
> |       u64 clock;
> |
> |       if (static_branch_likely(&__sched_clock_stable))
> |               return sched_clock() + __sched_clock_offset;
> |
> |       preempt_disable_notrace();
> |       clock = sched_clock_local(this_scd());
> |       preempt_enable_notrace();
> |
> |       return clock;
> |}

And this morning I found:

https://lore.kernel.org/all/20230406040847.569970-1-dev@aaront.org/T/#u

that 'fixes' that :-)

> The tick_sched_timer story is that it uses ktime_get() to set the

But but but, ktime_get() does not use sched_clock(), it has it's own TSC
reader.

> initial expire time and then always incremented. If the returned value
> is jiffy-based then it will work as expected. If it has higher precision
> then the time has to be adjusted since it is not expected.
> 
> So, what do we do?
> - We would need to provide stable patches for v4.19…v6.1 to address the
>   timer issue.
> 
> - The high precision timestamps during boot are gone. Is it important to
>   bring that back? Or could that new local_clock() implementation be
>   backported?
> 
> - An alternative is to ensure that the time returned by ktime_get() in
>   tick_setup_device() is aligned with zeros.

The last one I think, let me try and figure out how sched_clock()
behaviour makes a different at all -- this is not making sense.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] tick_sched_timer() is not properly aligned, fixed by chance
  2023-04-06 10:55 ` Peter Zijlstra
@ 2023-04-06 11:08   ` Thomas Gleixner
  2023-04-06 15:02     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2023-04-06 11:08 UTC (permalink / raw)
  To: Peter Zijlstra, Sebastian Andrzej Siewior
  Cc: linux-kernel, Ingo Molnar, Pavel Tatashin

On Thu, Apr 06 2023 at 12:55, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 11:57:35AM +0200, Sebastian Andrzej Siewior wrote:
>
>> The tick_sched_timer story is that it uses ktime_get() to set the
>
> But but but, ktime_get() does not use sched_clock(), it has it's own TSC
> reader.

Correct and at that point during early boot the clocksource which feeds
ktime_get() should be jiffies and not some other random clocksource.

Sebastian, can you please evaluate which clocksource is used for
ktime_get() at the point where the first clock event device is set up?

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] tick_sched_timer() is not properly aligned, fixed by chance
  2023-04-06 11:08   ` Thomas Gleixner
@ 2023-04-06 15:02     ` Sebastian Andrzej Siewior
  2023-04-06 17:13       ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-04-06 15:02 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Pavel Tatashin

On 2023-04-06 13:08:29 [+0200], Thomas Gleixner wrote:
> On Thu, Apr 06 2023 at 12:55, Peter Zijlstra wrote:
> > On Thu, Apr 06, 2023 at 11:57:35AM +0200, Sebastian Andrzej Siewior wrote:
> >
> >> The tick_sched_timer story is that it uses ktime_get() to set the
> >
> > But but but, ktime_get() does not use sched_clock(), it has it's own TSC
> > reader.
> 
> Correct and at that point during early boot the clocksource which feeds
> ktime_get() should be jiffies and not some other random clocksource.
> 
> Sebastian, can you please evaluate which clocksource is used for
> ktime_get() at the point where the first clock event device is set up?

The first one added is hpet. It points to "jiffies" as name and
jiffies_read() as ->read().

Before the change/ with PeterZ change I get from ktime_get() for
base/nsec something like:
|[    0.004000] tick_next_period Name: jiffies jiffies_read+0x0/0x10
|[    0.004000] tick_next_period base/nsec: -401771248 405771248
|[    0.004000] tick_next_period hpet 4000000

base/nsec is different on each boot but it always ends up with 4000000.
With the optimisation, ktime_get() is:
| [    1.179079] tick_next_period base/nsec: 647439581 518613145
| [    1.179646] tick_next_period hpet 1166052726

so something is using it during init. And this is
	read_persistent_wall_and_boot_offset()

> Thanks,
> 
>         tglx
> 
Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] tick_sched_timer() is not properly aligned, fixed by chance
  2023-04-06 15:02     ` Sebastian Andrzej Siewior
@ 2023-04-06 17:13       ` Thomas Gleixner
  2023-04-18 12:26         ` [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick Sebastian Andrzej Siewior
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2023-04-06 17:13 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Pavel Tatashin

On Thu, Apr 06 2023 at 17:02, Sebastian Andrzej Siewior wrote:
> On 2023-04-06 13:08:29 [+0200], Thomas Gleixner wrote:
> The first one added is hpet. It points to "jiffies" as name and
> jiffies_read() as ->read().
>
> Before the change/ with PeterZ change I get from ktime_get() for
> base/nsec something like:
> |[    0.004000] tick_next_period Name: jiffies jiffies_read+0x0/0x10
> |[    0.004000] tick_next_period base/nsec: -401771248 405771248
> |[    0.004000] tick_next_period hpet 4000000
>
> base/nsec is different on each boot but it always ends up with 4000000.
> With the optimisation, ktime_get() is:
> | [    1.179079] tick_next_period base/nsec: 647439581 518613145
> | [    1.179646] tick_next_period hpet 1166052726
>
> so something is using it during init. And this is
> 	read_persistent_wall_and_boot_offset()

Duh, yes. I remember now. That was done to take the TSC "uptime" into
account, which is sensible.

So making tick_next_period aligned to TICK_NSEC in tick_setup_device()
is the right thing to do in general then.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick.
  2023-04-06 17:13       ` Thomas Gleixner
@ 2023-04-18 12:26         ` Sebastian Andrzej Siewior
  2023-06-13 22:59           ` [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. -- regression report Mathias Krause
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-04-18 12:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Gusenleitner Klaus,
	Frederic Weisbecker

With HIGHRES enabled tick_sched_timer() is programmed every jiffy to
expire the timer_list timers. This timer is programmed accurate in
respect to CLOCK_MONOTONIC so that 0 seconds and nanoseconds is the
first tick and the next one is 1000/CONFIG_HZ ms later. For HZ=250 it is
every 4 ms and so based on the current time the next tick can be
computed.

This accuracy broke since the commit mentioned below because the jiffy
based clocksource is initialized with higher accuracy in
read_persistent_wall_and_boot_offset(). This higher accuracy is
inherited during the setup in tick_setup_device(). The timer still fires
every 4ms with HZ=250 but timer is no longer aligned with
CLOCK_MONOTONIC with 0 as it origin but has an offset in the us/ns part
of the timestamp. The offset differs with every boot and makes it
impossible for user land to align with the tick.

Align the tick timer with CLOCK_MONOTONIC ensuring that it is always a
multiple of 1000/CONFIG_HZ ms.

Reported-by: Gusenleitner Klaus <gus@keba.com>
Link: https://lore.kernel.org/20230406095735.0_14edn3@linutronix.de
Fixes: 857baa87b6422 ("sched/clock: Enable sched clock early")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/time/tick-common.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 46789356f856e..65b8658da829e 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -218,9 +218,19 @@ static void tick_setup_device(struct tick_device *td,
 		 * this cpu:
 		 */
 		if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
+			ktime_t next_p;
+			u32 rem;
+
 			tick_do_timer_cpu = cpu;
 
-			tick_next_period = ktime_get();
+			next_p = ktime_get();
+			div_u64_rem(next_p, TICK_NSEC, &rem);
+			if (rem) {
+				next_p -= rem;
+				next_p += TICK_NSEC;
+			}
+
+			tick_next_period = next_p;
 #ifdef CONFIG_NO_HZ_FULL
 			/*
 			 * The boot CPU may be nohz_full, in which case set
-- 
2.40.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip: timers/core] tick/common: Align tick period with the HZ tick.
  2023-04-06  9:57 [RFC] tick_sched_timer() is not properly aligned, fixed by chance Sebastian Andrzej Siewior
  2023-04-06 10:55 ` Peter Zijlstra
@ 2023-04-18 13:14 ` tip-bot2 for Sebastian Andrzej Siewior
  1 sibling, 0 replies; 12+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2023-04-18 13:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Gusenleitner Klaus, Sebastian Andrzej Siewior, Thomas Gleixner,
	x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     e9523a0d81899361214d118ad60ef76f0e92f71d
Gitweb:        https://git.kernel.org/tip/e9523a0d81899361214d118ad60ef76f0e92f71d
Author:        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate:    Tue, 18 Apr 2023 14:26:39 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 18 Apr 2023 15:06:50 +02:00

tick/common: Align tick period with the HZ tick.

With HIGHRES enabled tick_sched_timer() is programmed every jiffy to
expire the timer_list timers. This timer is programmed accurate in
respect to CLOCK_MONOTONIC so that 0 seconds and nanoseconds is the
first tick and the next one is 1000/CONFIG_HZ ms later. For HZ=250 it is
every 4 ms and so based on the current time the next tick can be
computed.

This accuracy broke since the commit mentioned below because the jiffy
based clocksource is initialized with higher accuracy in
read_persistent_wall_and_boot_offset(). This higher accuracy is
inherited during the setup in tick_setup_device(). The timer still fires
every 4ms with HZ=250 but timer is no longer aligned with
CLOCK_MONOTONIC with 0 as it origin but has an offset in the us/ns part
of the timestamp. The offset differs with every boot and makes it
impossible for user land to align with the tick.

Align the tick period with CLOCK_MONOTONIC ensuring that it is always a
multiple of 1000/CONFIG_HZ ms.

Fixes: 857baa87b6422 ("sched/clock: Enable sched clock early")
Reported-by: Gusenleitner Klaus <gus@keba.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20230406095735.0_14edn3@linutronix.de
Link: https://lore.kernel.org/r/20230418122639.ikgfvu3f@linutronix.de
---
 kernel/time/tick-common.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 4678935..65b8658 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -218,9 +218,19 @@ static void tick_setup_device(struct tick_device *td,
 		 * this cpu:
 		 */
 		if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
+			ktime_t next_p;
+			u32 rem;
+
 			tick_do_timer_cpu = cpu;
 
-			tick_next_period = ktime_get();
+			next_p = ktime_get();
+			div_u64_rem(next_p, TICK_NSEC, &rem);
+			if (rem) {
+				next_p -= rem;
+				next_p += TICK_NSEC;
+			}
+
+			tick_next_period = next_p;
 #ifdef CONFIG_NO_HZ_FULL
 			/*
 			 * The boot CPU may be nohz_full, in which case set

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. -- regression report
  2023-04-18 12:26         ` [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick Sebastian Andrzej Siewior
@ 2023-06-13 22:59           ` Mathias Krause
  2023-06-15  9:03             ` Richard W.M. Jones
                               ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Mathias Krause @ 2023-06-13 22:59 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Thomas Gleixner
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Gusenleitner Klaus,
	Frederic Weisbecker, stable

Hi Sebastian,

On 18.04.23 14:26, Sebastian Andrzej Siewior wrote:
> [...]. The timer still fires
> every 4ms with HZ=250 but timer is no longer aligned with
> CLOCK_MONOTONIC with 0 as it origin but has an offset in the us/ns part
> of the timestamp. The offset differs with every boot and makes it
> impossible for user land to align with the tick.

I can observe these per-boot offsets too, but...

> Align the tick timer with CLOCK_MONOTONIC ensuring that it is always a
> multiple of 1000/CONFIG_HZ ms.

this change doesn't seem to achieve that goal, unfortunately. Quite the 
opposite. It makes the (boot) clock run faster and, because of the per-
boot different offset, differently fast for each boot. Up to the point 
where it's running too fast to make any progress at all.

This patch causes VM boot hangs for us. It took a while to identify as 
the boot hangs were only ~1 out of 30 but it's clearly it. Reverting 
the commit got me 100 boots in a row without any issue.

Instrumenting the kernel a little gave me a clue what the bug is. When 
switching from the boot timer tick device (which is 'hpet' in my setup) 
to 'lapic-deadline', the mode of the timer isn't changed and kept at 
TICKDEV_MODE_PERIODIC. As that device doesn't support this mode, 
tick_setup_periodic() will switch over to CLOCK_EVT_STATE_ONESHOT mode 
and program the next expire event based on tick_next_period.

clockevents_program_event() will calculate the delta of that timestamp 
and ktime_get() and pass that value on to dev->set_next_event() (which 
is lapic_next_deadline()) which will write it to the IA32_TSC_DEADLINE 
MSR.

That delta value -- which is still the per-boot different offset to 
ktime_get() your patch introduces -- now gets stuck and is taken as the 
new *jiffies tick time*. That's because tick_handle_periodic() -> 
tick_periodic() will advance tick_next_period by TICK_NSEC, make 
do_timer() increment jiffies_64 by one and then program the next event 
to be in TICK_NSEC ns based on the device's old expiry time, i.e. keep 
the offset intact. This is followed by re-arming the event by a call to 
clockevents_program_event() which does the already-know delta 
calculation and writes, again, the too little value to 
IA32_TSC_DEADLINE.

This effectively makes the jiffies based clock go too fast as the timer 
IRQ comes too early (less than TICK_NSEC ns). Sometimes it's barely 
noticeable, but sometimes it's so fast that the kernel is overloaded 
with only handling the local timer IRQ without making any further 
progress, especially in (nested) VM setups.

Without commit e9523a0d8189 ("tick/common: Align tick period with the 
HZ tick."), which was backported to many stable and LTS kernels (v6.3.2 
(571c3b46c9b3), v6.2.15 (f0cb827199ec), v6.1.28 (290e26ec0d01), 
v5.15.111 (a55050c7989c), v5.10.180 (c4013689269d) and v5.4.243 
(a3e7a3d472c2)) this clock drift is gone and my VMs boot again.

Before that commit, the delta between tick_next_period and ktime_get() 
was initially zero, so tick_handle_period() had to loop, as 
clockevents_program_event() will return with -ETIME. The next attempt 
would be done with a delta of TICK_NSEC which will make 
clockevents_program_event() succeed and ensure that future events don't 
need the additional loop iteration, as the delta got stuck at TICK_NSEC 
-- exactly where it should be.

We observed the bug first on the v6.3, v6.1 and v5.15 stable branch 
updates from May 11th and then, a week later, on v5.4 too. All first 
occurrences were coinciding with the bad commit going into the 
corresponding stable and LTS kernel releases.

The issue manifests itself as a fast running clock only during boot, 
when the clock source is still jiffies based. That'll eventually lead 
to a boot hang as the timer IRQs are firing too fast.

To reproduce this you can either boot loop a VM and try to get "lucky" 
to hit a big enough 'rem' value or just apply this little diff instead:

---8<---
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 65b8658da829..b01cf18a5d42 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -225,6 +225,7 @@ static void tick_setup_device(struct tick_device *td,
 
 			next_p = ktime_get();
 			div_u64_rem(next_p, TICK_NSEC, &rem);
+			rem = TICK_NSEC - 123;
 			if (rem) {
 				next_p -= rem;
 				next_p += TICK_NSEC;
--->8---

This should make the kernel get stuck with only handling timer ticks 
but not making any further progress.

Change the subtrahend to 1234 to get a system that boots but has an 
unrealistically fast clock during kernel initialization.

As reverting that commit fixes the issue for us but it seemingly fixes 
another bug for Klaus (or at least attempted to), the now uncovered bug 
should be fixed instead.

The fundamental issue is that the jiffies based clock source cannot be 
trusted and shouldn't be used to calculate offsets to timestamps in the 
future when tick_next_period mod ktime_get() != 0.

Can we defer the offset adjustment of tick_next_period to a later point 
in time when a stable clock source gets used, like 'tsc'?

Thanks,
Mathias

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. -- regression report
  2023-06-13 22:59           ` [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. -- regression report Mathias Krause
@ 2023-06-15  9:03             ` Richard W.M. Jones
  2023-06-15 11:40               ` Mathias Krause
  2023-06-15  9:22             ` Sebastian Andrzej Siewior
  2023-06-16 18:53             ` [tip: timers/urgent] tick/common: Align tick period during sched_timer setup tip-bot2 for Thomas Gleixner
  2 siblings, 1 reply; 12+ messages in thread
From: Richard W.M. Jones @ 2023-06-15  9:03 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner, Peter Zijlstra,
	linux-kernel, Ingo Molnar, Gusenleitner Klaus,
	Frederic Weisbecker, stable

[-- Attachment #1: Type: text/plain, Size: 690 bytes --]

On Wed, Jun 14, 2023 at 12:59:46AM +0200, Mathias Krause wrote:
> This patch causes VM boot hangs for us. It took a while to identify as 
> the boot hangs were only ~1 out of 30 but it's clearly it. Reverting 
> the commit got me 100 boots in a row without any issue.

FWIW I have quite a nice test program for catching these sorts of boot
hangs, see attached.  You need to change the VMLINUX define to point
to your vmlinux or vmlinuz file.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

[-- Attachment #2: bootbootboot.c --]
[-- Type: text/plain, Size: 3144 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <error.h>
#include <unistd.h>
#include <sys/wait.h>
#include <pthread.h>

#define NR_ITERATIONS 10000
#define MAX_THREADS 8
#define MAX_TIME 60 /* max time to wait for qemu to complete */

//#define VMLINUX "/home/rjones/d/linux/vmlinux"
#define VMLINUX "vmlinux"
//#define QEMU "/home/rjones/d/qemu/build/qemu-system-x86_64"
#define QEMU "qemu-system-x86_64"
#define QEMU_COMMAND \
  QEMU " -no-user-config -nodefaults -display none " \
  "-machine accel=kvm:tcg,graphics=off -cpu max,la57=off -m 1280 " \
  "-no-reboot " \
  "-rtc driftfix=slew -no-hpet -global kvm-pit.lost_tick_policy=discard " \
  "-kernel " VMLINUX " " \
  "-serial stdio " \
  "-append \"panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 selinux=0 TERM=xterm-256color\""

static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static unsigned iterations;

static void *start_thread (void *);

int
main (int argc, char *argv[])
{
	long n, i;
	int r;
	pthread_t thread[MAX_THREADS];

	n = sysconf (_SC_NPROCESSORS_ONLN);
	if (n == -1) error (EXIT_FAILURE, errno, "sysconf");

	if (n > MAX_THREADS)
		n = MAX_THREADS;

	for (i = 0; i < n; ++i) {
		r = pthread_create (&thread[i], NULL, start_thread, NULL);
		if (r != 0) error (EXIT_FAILURE, r, "pthread_create");
	}
	for (i = 0; i < n; ++i) {
		r = pthread_join (thread[i], NULL);
		if (r != 0) error (EXIT_FAILURE, r, "pthread_join");
	}
	printf ("\n");
	printf ("\n");
	printf ("test ok\n");
	exit (EXIT_SUCCESS);
}

static void *
start_thread (void *vp)
{
	pid_t pid;
	char tmp[] = "/tmp/kernel.out.XXXXXX";
	char cmd[1024];
	int i, r, status;

	if (mkstemp (tmp) == -1)
		error (EXIT_FAILURE, errno, "mkstemp: %s", tmp);

	snprintf (cmd, sizeof cmd, QEMU_COMMAND " >& %s", tmp);

	/* This basically runs a loop starting qemu. */
	for (;;) {
		pthread_mutex_lock (&lock);
		if (iterations >= NR_ITERATIONS) {
			pthread_mutex_unlock (&lock);
			return NULL;
		}
		if (iterations <= MAX_THREADS) { // stagger the start times
			pthread_mutex_unlock (&lock);
			usleep (rand () % 3000000);
			pthread_mutex_lock (&lock);
		}
		iterations++;
		printf ("%d... ", iterations); fflush (stdout);
		pthread_mutex_unlock (&lock);

		pid = fork ();
		if (pid == -1) error (EXIT_FAILURE, errno, "fork");
		if (pid == 0) {
			/* Child process, run qemu and wait. */
			if (system (cmd) != 0)
				_exit (EXIT_FAILURE);
			else
				_exit (EXIT_SUCCESS);
		}

		/* In the parent wait up to MAX_TIME seconds. */
		for (i = 0; i < MAX_TIME; ++i) {
			r = waitpid (pid, &status, WNOHANG);
			if (r == -1) error (EXIT_FAILURE, errno, "waitpid");
			if (r > 0) break;
			sleep (1);
		}

		if (i == MAX_TIME || status != 0) {
			/* Something failed in qemu (or it didn't
			 * exit), dump the whole log and exit with
			 * error.
			 */
			printf ("\n");
			printf ("\n");
			snprintf (cmd, sizeof cmd, "tail -20 %s", tmp);
			system (cmd);
			fprintf (stderr, "*** ERROR OR HANG ***\n");
			exit (EXIT_FAILURE);
		}
	}

	unlink (tmp);
}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. -- regression report
  2023-06-13 22:59           ` [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. -- regression report Mathias Krause
  2023-06-15  9:03             ` Richard W.M. Jones
@ 2023-06-15  9:22             ` Sebastian Andrzej Siewior
  2023-06-16 18:53             ` [tip: timers/urgent] tick/common: Align tick period during sched_timer setup tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-15  9:22 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Gusenleitner Klaus, Frederic Weisbecker, stable

On 2023-06-14 00:59:46 [+0200], Mathias Krause wrote:
> Hi Sebastian,
Hi Mathias,

> this change doesn't seem to achieve that goal, unfortunately. Quite the 
> opposite. It makes the (boot) clock run faster and, because of the per-
> boot different offset, differently fast for each boot. Up to the point 
> where it's running too fast to make any progress at all.

I meant to reply to this thread but took the other by accident. Sorry.
This should be addressed by
	https://lore.kernel.org/20230615091830.RxMV2xf_@linutronix.de

Thank you for the report.

> Thanks,
> Mathias

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. -- regression report
  2023-06-15  9:03             ` Richard W.M. Jones
@ 2023-06-15 11:40               ` Mathias Krause
  0 siblings, 0 replies; 12+ messages in thread
From: Mathias Krause @ 2023-06-15 11:40 UTC (permalink / raw)
  To: Richard W.M. Jones
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner, Peter Zijlstra,
	linux-kernel, Ingo Molnar, Gusenleitner Klaus,
	Frederic Weisbecker, stable

On 15.06.23 11:03, Richard W.M. Jones wrote:
> On Wed, Jun 14, 2023 at 12:59:46AM +0200, Mathias Krause wrote:
>> This patch causes VM boot hangs for us. It took a while to identify as 
>> the boot hangs were only ~1 out of 30 but it's clearly it. Reverting 
>> the commit got me 100 boots in a row without any issue.
> 
> FWIW I have quite a nice test program for catching these sorts of boot
> hangs, see attached.  You need to change the VMLINUX define to point
> to your vmlinux or vmlinuz file.

Heh, nice. For this specific case, though, a simple boot loop in a
single VM was sufficient:

  $ kvm -cpu host -smp 8 -m 8G -display none -serial stdio -s \
        -kernel path/to/bzImage -append 'console=ttyS0 panic=1'

This will reboot the VM in an endless loop until it hangs which is
visually noticeable as there's no more output generated to the terminal.
At that time one can attach to it via gdb and investigate further:

  $ gdb -ex 'target remote :1234' path/to/vmlinux

Yet another advantage is that one can filter the output, e.g. via:

  $ kvm ... | grep clock

Thanks,
Mathias

> 
> Rich.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip: timers/urgent] tick/common: Align tick period during sched_timer setup
  2023-06-13 22:59           ` [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. -- regression report Mathias Krause
  2023-06-15  9:03             ` Richard W.M. Jones
  2023-06-15  9:22             ` Sebastian Andrzej Siewior
@ 2023-06-16 18:53             ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 12+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-06-16 18:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mathias Krause, Bhatnagar, Rishabh, Thomas Gleixner,
	Sebastian Andrzej Siewior, Richard W.M. Jones, SeongJae Park,
	stable, x86, linux-kernel

The following commit has been merged into the timers/urgent branch of tip:

Commit-ID:     13bb06f8dd42071cb9a49f6e21099eea05d4b856
Gitweb:        https://git.kernel.org/tip/13bb06f8dd42071cb9a49f6e21099eea05d4b856
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 15 Jun 2023 11:18:30 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 16 Jun 2023 20:45:28 +02:00

tick/common: Align tick period during sched_timer setup

The tick period is aligned very early while the first clock_event_device is
registered. At that point the system runs in periodic mode and switches
later to one-shot mode if possible.

The next wake-up event is programmed based on the aligned value
(tick_next_period) but the delta value, that is used to program the
clock_event_device, is computed based on ktime_get().

With the subtracted offset, the device fires earlier than the exact time
frame. With a large enough offset the system programs the timer for the
next wake-up and the remaining time left is too small to make any boot
progress. The system hangs.

Move the alignment later to the setup of tick_sched timer. At this point
the system switches to oneshot mode and a high resolution clocksource is
available. At this point it is safe to align tick_next_period because
ktime_get() will now return accurate (not jiffies based) time.

[bigeasy: Patch description + testing].

Fixes: e9523a0d81899 ("tick/common: Align tick period with the HZ tick.")
Reported-by: Mathias Krause <minipli@grsecurity.net>
Reported-by: "Bhatnagar, Rishabh" <risbhat@amazon.com>
Suggested-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Acked-by: SeongJae Park <sj@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/5a56290d-806e-b9a5-f37c-f21958b5a8c0@grsecurity.net
Link: https://lore.kernel.org/12c6f9a3-d087-b824-0d05-0d18c9bc1bf3@amazon.com
Link: https://lore.kernel.org/r/20230615091830.RxMV2xf_@linutronix.de
---
 kernel/time/tick-common.c | 13 +------------
 kernel/time/tick-sched.c  | 13 ++++++++++++-
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 65b8658..e9138cd 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -218,19 +218,8 @@ static void tick_setup_device(struct tick_device *td,
 		 * this cpu:
 		 */
 		if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
-			ktime_t next_p;
-			u32 rem;
-
 			tick_do_timer_cpu = cpu;
-
-			next_p = ktime_get();
-			div_u64_rem(next_p, TICK_NSEC, &rem);
-			if (rem) {
-				next_p -= rem;
-				next_p += TICK_NSEC;
-			}
-
-			tick_next_period = next_p;
+			tick_next_period = ktime_get();
 #ifdef CONFIG_NO_HZ_FULL
 			/*
 			 * The boot CPU may be nohz_full, in which case set
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5225467..42c0be3 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -161,8 +161,19 @@ static ktime_t tick_init_jiffy_update(void)
 	raw_spin_lock(&jiffies_lock);
 	write_seqcount_begin(&jiffies_seq);
 	/* Did we start the jiffies update yet ? */
-	if (last_jiffies_update == 0)
+	if (last_jiffies_update == 0) {
+		u32 rem;
+
+		/*
+		 * Ensure that the tick is aligned to a multiple of
+		 * TICK_NSEC.
+		 */
+		div_u64_rem(tick_next_period, TICK_NSEC, &rem);
+		if (rem)
+			tick_next_period += TICK_NSEC - rem;
+
 		last_jiffies_update = tick_next_period;
+	}
 	period = last_jiffies_update;
 	write_seqcount_end(&jiffies_seq);
 	raw_spin_unlock(&jiffies_lock);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-06-16 18:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-06  9:57 [RFC] tick_sched_timer() is not properly aligned, fixed by chance Sebastian Andrzej Siewior
2023-04-06 10:55 ` Peter Zijlstra
2023-04-06 11:08   ` Thomas Gleixner
2023-04-06 15:02     ` Sebastian Andrzej Siewior
2023-04-06 17:13       ` Thomas Gleixner
2023-04-18 12:26         ` [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick Sebastian Andrzej Siewior
2023-06-13 22:59           ` [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. -- regression report Mathias Krause
2023-06-15  9:03             ` Richard W.M. Jones
2023-06-15 11:40               ` Mathias Krause
2023-06-15  9:22             ` Sebastian Andrzej Siewior
2023-06-16 18:53             ` [tip: timers/urgent] tick/common: Align tick period during sched_timer setup tip-bot2 for Thomas Gleixner
2023-04-18 13:14 ` [tip: timers/core] tick/common: Align tick period with the HZ tick tip-bot2 for Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).