linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* infinite loop in read_hpet from ktime_get_boot_fast_ns
@ 2019-06-07 14:14 Jason A. Donenfeld
  2019-06-11 21:09 ` Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-07 14:14 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, clemens, Sultan Alsawaf

Hey Thomas,

After some discussions here prior about the different clocks
available, WireGuard uses ktime_get_boot_fast_ns() pretty extensively.
The requirement is for a quasi-accurate monotonic counter that takes
into account sleep time, and this seems to fit the bill pretty well.
Sultan (CC'd) reported to me a non-reproducible bug he encountered in
4.19.47 (arch's linux-lts package), where the CPU was hung in
read_hpet.

CPU: 1 PID: 7927 Comm: kworker/1:3 Tainted: G           OE     4.19.47-1-lts #1
Hardware name: Dell Inc. XPS 15 9570/02MJVY, BIOS 1.10.1 04/26/2019
Workqueue: wg-crypt-interface wg_packet_tx_worker [wireguard]
RIP: 0010:read_hpet+0x67/0xc0
Code: c0 75 11 ba 01 00 00 00 f0 0f b1 15 a3 3d 1a 01 85 c0 74 37 48
89 cf 57 9d 0f 1f 44 00 00 48 c1 ee 20 eb 04 85 c9 74 12 f3 90 <49> 8b
08 48 89 ca 48 c1 ea 20 89 d0 39 f2 74 ea c3 48 8b 05 89 56
RSP: 0018:ffffb8d382533e18 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
RAX: 0000000018a4c89e RBX: 0000000000000000 RCX: 18a4c89e00000001
RDX: 0000000018a4c89e RSI: 0000000018a4c89e RDI: ffffffffb8227980
RBP: 000006c1c3f602a2 R08: ffffffffb8205040 R09: 0000000000000000
R10: 000001d58fd28efc R11: 0000000000000000 R12: ffffffffb8259a80
R13: 00000000ffffffff R14: 0000000518a0d8c4 R15: 000000000010fa5a
FS:  0000000000000000(0000) GS:ffff9b90ac240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00003663b14d9ce8 CR3: 000000030f20a006 CR4: 00000000003606e0
Call Trace:
 ktime_get_mono_fast_ns+0x53/0xa0
 ktime_get_boot_fast_ns+0x5/0x10
 wg_packet_tx_worker+0x183/0x220 [wireguard]
 process_one_work+0x1f4/0x3e0
 worker_thread+0x2d/0x3e0
 ? process_one_work+0x3e0/0x3e0
 kthread+0x112/0x130
 ? kthread_park+0x80/0x80
 ret_from_fork+0x35/0x40
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:3:7927]

It looks like RIP is spinning in this loop in read_hpet:

do {
    cpu_relax();
    new.lockval = READ_ONCE(hpet.lockval);
} while ((new.value == old.value) && arch_spin_is_locked(&new.lock));

I imagine this could be a bug in the hpet code, or a failure of the
hpet hardware. But I thought it'd be most prudent to check, first,
whether there are actually very particular conditions on when and
where ktime_get_boot_fast_ns and friends can be called. In other
words, maybe the bug is actually in my code. I was under the
impression that invoking it from anywhere was fine, given the
documentation says "NMI safe", but maybe there are still some
requirements I should keep in mind?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-07 14:14 infinite loop in read_hpet from ktime_get_boot_fast_ns Jason A. Donenfeld
@ 2019-06-11 21:09 ` Thomas Gleixner
  2019-06-11 21:40   ` Waiman Long
                     ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-06-11 21:09 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: LKML, clemens, Sultan Alsawaf, Waiman Long, x86, Peter Zijlstra

Jason,

On Fri, 7 Jun 2019, Jason A. Donenfeld wrote:

Adding a few more people on cc and keeping full context.

> Hey Thomas,
> 
> After some discussions here prior about the different clocks
> available, WireGuard uses ktime_get_boot_fast_ns() pretty extensively.
> The requirement is for a quasi-accurate monotonic counter that takes
> into account sleep time, and this seems to fit the bill pretty well.
> Sultan (CC'd) reported to me a non-reproducible bug he encountered in
> 4.19.47 (arch's linux-lts package), where the CPU was hung in
> read_hpet.
> 
> CPU: 1 PID: 7927 Comm: kworker/1:3 Tainted: G           OE     4.19.47-1-lts #1
> Hardware name: Dell Inc. XPS 15 9570/02MJVY, BIOS 1.10.1 04/26/2019
> Workqueue: wg-crypt-interface wg_packet_tx_worker [wireguard]
> RIP: 0010:read_hpet+0x67/0xc0
> Code: c0 75 11 ba 01 00 00 00 f0 0f b1 15 a3 3d 1a 01 85 c0 74 37 48
> 89 cf 57 9d 0f 1f 44 00 00 48 c1 ee 20 eb 04 85 c9 74 12 f3 90 <49> 8b
> 08 48 89 ca 48 c1 ea 20 89 d0 39 f2 74 ea c3 48 8b 05 89 56
> RSP: 0018:ffffb8d382533e18 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
> RAX: 0000000018a4c89e RBX: 0000000000000000 RCX: 18a4c89e00000001
> RDX: 0000000018a4c89e RSI: 0000000018a4c89e RDI: ffffffffb8227980
> RBP: 000006c1c3f602a2 R08: ffffffffb8205040 R09: 0000000000000000
> R10: 000001d58fd28efc R11: 0000000000000000 R12: ffffffffb8259a80
> R13: 00000000ffffffff R14: 0000000518a0d8c4 R15: 000000000010fa5a
> FS:  0000000000000000(0000) GS:ffff9b90ac240000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00003663b14d9ce8 CR3: 000000030f20a006 CR4: 00000000003606e0
> Call Trace:
>  ktime_get_mono_fast_ns+0x53/0xa0
>  ktime_get_boot_fast_ns+0x5/0x10
>  wg_packet_tx_worker+0x183/0x220 [wireguard]
>  process_one_work+0x1f4/0x3e0
>  worker_thread+0x2d/0x3e0
>  ? process_one_work+0x3e0/0x3e0
>  kthread+0x112/0x130
>  ? kthread_park+0x80/0x80
>  ret_from_fork+0x35/0x40
> watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:3:7927]
> 
> It looks like RIP is spinning in this loop in read_hpet:
> 
> do {
>     cpu_relax();
>     new.lockval = READ_ONCE(hpet.lockval);
> } while ((new.value == old.value) && arch_spin_is_locked(&new.lock));
> 
> I imagine this could be a bug in the hpet code, or a failure of the
> hpet hardware. But I thought it'd be most prudent to check, first,
> whether there are actually very particular conditions on when and
> where ktime_get_boot_fast_ns and friends can be called. In other
> words, maybe the bug is actually in my code. I was under the
> impression that invoking it from anywhere was fine, given the
> documentation says "NMI safe", but maybe there are still some
> requirements I should keep in mind?

I think your code is fine. Just 'fast' is relative with the HPET selected
as clocksource (it's actually aweful slow).

It probably livelocks in the HPET optimization Waiman did for large
machines. I'm having a dejavu with that spinlock livelock we debugged last
year. Peter?

Can you please ask the reporter to try the hack below?

Thanks,

	tglx

8<---------------
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index a0573f2e7763..0c9044698489 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -795,8 +795,7 @@ static u64 read_hpet(struct clocksource *cs)
 	/*
 	 * Read HPET directly if in NMI.
 	 */
-	if (in_nmi())
-		return (u64)hpet_readl(HPET_COUNTER);
+	return (u64)hpet_readl(HPET_COUNTER);
 
 	/*
 	 * Read the current state of the lock and HPET value atomically.

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-11 21:09 ` Thomas Gleixner
@ 2019-06-11 21:40   ` Waiman Long
  2019-06-12  9:02   ` Peter Zijlstra
  2019-06-12  9:29   ` infinite loop in read_hpet from ktime_get_boot_fast_ns Peter Zijlstra
  2 siblings, 0 replies; 26+ messages in thread
From: Waiman Long @ 2019-06-11 21:40 UTC (permalink / raw)
  To: Thomas Gleixner, Jason A. Donenfeld
  Cc: LKML, clemens, Sultan Alsawaf, x86, Peter Zijlstra

On 6/11/19 5:09 PM, Thomas Gleixner wrote:
> Jason,
>
> On Fri, 7 Jun 2019, Jason A. Donenfeld wrote:
>
> Adding a few more people on cc and keeping full context.
>
>> Hey Thomas,
>>
>> After some discussions here prior about the different clocks
>> available, WireGuard uses ktime_get_boot_fast_ns() pretty extensively.
>> The requirement is for a quasi-accurate monotonic counter that takes
>> into account sleep time, and this seems to fit the bill pretty well.
>> Sultan (CC'd) reported to me a non-reproducible bug he encountered in
>> 4.19.47 (arch's linux-lts package), where the CPU was hung in
>> read_hpet.
>>
>> CPU: 1 PID: 7927 Comm: kworker/1:3 Tainted: G           OE     4.19.47-1-lts #1
>> Hardware name: Dell Inc. XPS 15 9570/02MJVY, BIOS 1.10.1 04/26/2019
>> Workqueue: wg-crypt-interface wg_packet_tx_worker [wireguard]
>> RIP: 0010:read_hpet+0x67/0xc0
>> Code: c0 75 11 ba 01 00 00 00 f0 0f b1 15 a3 3d 1a 01 85 c0 74 37 48
>> 89 cf 57 9d 0f 1f 44 00 00 48 c1 ee 20 eb 04 85 c9 74 12 f3 90 <49> 8b
>> 08 48 89 ca 48 c1 ea 20 89 d0 39 f2 74 ea c3 48 8b 05 89 56
>> RSP: 0018:ffffb8d382533e18 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
>> RAX: 0000000018a4c89e RBX: 0000000000000000 RCX: 18a4c89e00000001
>> RDX: 0000000018a4c89e RSI: 0000000018a4c89e RDI: ffffffffb8227980
>> RBP: 000006c1c3f602a2 R08: ffffffffb8205040 R09: 0000000000000000
>> R10: 000001d58fd28efc R11: 0000000000000000 R12: ffffffffb8259a80
>> R13: 00000000ffffffff R14: 0000000518a0d8c4 R15: 000000000010fa5a
>> FS:  0000000000000000(0000) GS:ffff9b90ac240000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00003663b14d9ce8 CR3: 000000030f20a006 CR4: 00000000003606e0
>> Call Trace:
>>  ktime_get_mono_fast_ns+0x53/0xa0
>>  ktime_get_boot_fast_ns+0x5/0x10
>>  wg_packet_tx_worker+0x183/0x220 [wireguard]
>>  process_one_work+0x1f4/0x3e0
>>  worker_thread+0x2d/0x3e0
>>  ? process_one_work+0x3e0/0x3e0
>>  kthread+0x112/0x130
>>  ? kthread_park+0x80/0x80
>>  ret_from_fork+0x35/0x40
>> watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:3:7927]
>>
>> It looks like RIP is spinning in this loop in read_hpet:
>>
>> do {
>>     cpu_relax();
>>     new.lockval = READ_ONCE(hpet.lockval);
>> } while ((new.value == old.value) && arch_spin_is_locked(&new.lock));

The hang shouldn't happen unless the hpet_lock structure is somehow
corrupted by another unrelated task. If someone inadvertently change the
content of the lock word, live lock will happen.

>> I imagine this could be a bug in the hpet code, or a failure of the
>> hpet hardware. But I thought it'd be most prudent to check, first,
>> whether there are actually very particular conditions on when and
>> where ktime_get_boot_fast_ns and friends can be called. In other
>> words, maybe the bug is actually in my code. I was under the
>> impression that invoking it from anywhere was fine, given the
>> documentation says "NMI safe", but maybe there are still some
>> requirements I should keep in mind?
> I think your code is fine. Just 'fast' is relative with the HPET selected
> as clocksource (it's actually aweful slow).
>
> It probably livelocks in the HPET optimization Waiman did for large
> machines. I'm having a dejavu with that spinlock livelock we debugged last
> year. Peter?
>
> Can you please ask the reporter to try the hack below?
>
> Thanks,
>
> 	tglx
>
> 8<---------------
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index a0573f2e7763..0c9044698489 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -795,8 +795,7 @@ static u64 read_hpet(struct clocksource *cs)
>  	/*
>  	 * Read HPET directly if in NMI.
>  	 */
> -	if (in_nmi())
> -		return (u64)hpet_readl(HPET_COUNTER);

While at it, add:

+    WARN_ON(arch_spin_is_locked(&hpet.lock));

> +	return (u64)hpet_readl(HPET_COUNTER);
>  
>  	/*
>  	 * Read the current state of the lock and HPET value atomically.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-11 21:09 ` Thomas Gleixner
  2019-06-11 21:40   ` Waiman Long
@ 2019-06-12  9:02   ` Peter Zijlstra
  2019-06-12  9:44     ` Jason A. Donenfeld
  2019-06-12  9:29   ` infinite loop in read_hpet from ktime_get_boot_fast_ns Peter Zijlstra
  2 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2019-06-12  9:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jason A. Donenfeld, LKML, clemens, Sultan Alsawaf, Waiman Long, x86

On Tue, Jun 11, 2019 at 11:09:20PM +0200, Thomas Gleixner wrote:
> Jason,
> 
> On Fri, 7 Jun 2019, Jason A. Donenfeld wrote:
> 
> Adding a few more people on cc and keeping full context.
> 
> > Hey Thomas,
> > 
> > After some discussions here prior about the different clocks
> > available, WireGuard uses ktime_get_boot_fast_ns() pretty extensively.
> > The requirement is for a quasi-accurate monotonic counter that takes
> > into account sleep time, and this seems to fit the bill pretty well.

How quasi? Do the comments in kernel/sched/clock.c look like something
you could use?

As already mentioned in the other tasks, anything ktime will be
horrifically crap when it ends up using the HPET, the code in
kernel/sched/clock.c is a best effort to keep using TSC even when it is
deemed unusable for timekeeping.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-11 21:09 ` Thomas Gleixner
  2019-06-11 21:40   ` Waiman Long
  2019-06-12  9:02   ` Peter Zijlstra
@ 2019-06-12  9:29   ` Peter Zijlstra
  2 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-06-12  9:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jason A. Donenfeld, LKML, clemens, Sultan Alsawaf, Waiman Long, x86

On Tue, Jun 11, 2019 at 11:09:20PM +0200, Thomas Gleixner wrote:

> > CPU: 1 PID: 7927 Comm: kworker/1:3 Tainted: G           OE     4.19.47-1-lts #1
> > Hardware name: Dell Inc. XPS 15 9570/02MJVY, BIOS 1.10.1 04/26/2019

That's a laptop, limited number of CPUs in those.

> > Workqueue: wg-crypt-interface wg_packet_tx_worker [wireguard]
> > RIP: 0010:read_hpet+0x67/0xc0
> > Code: c0 75 11 ba 01 00 00 00 f0 0f b1 15 a3 3d 1a 01 85 c0 74 37 48
> > 89 cf 57 9d 0f 1f 44 00 00 48 c1 ee 20 eb 04 85 c9 74 12 f3 90 <49> 8b
> > 08 48 89 ca 48 c1 ea 20 89 d0 39 f2 74 ea c3 48 8b 05 89 56
> > RSP: 0018:ffffb8d382533e18 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
> > RAX: 0000000018a4c89e RBX: 0000000000000000 RCX: 18a4c89e00000001
> > RDX: 0000000018a4c89e RSI: 0000000018a4c89e RDI: ffffffffb8227980
> > RBP: 000006c1c3f602a2 R08: ffffffffb8205040 R09: 0000000000000000
> > R10: 000001d58fd28efc R11: 0000000000000000 R12: ffffffffb8259a80
> > R13: 00000000ffffffff R14: 0000000518a0d8c4 R15: 000000000010fa5a
> > FS:  0000000000000000(0000) GS:ffff9b90ac240000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00003663b14d9ce8 CR3: 000000030f20a006 CR4: 00000000003606e0
> > Call Trace:
> >  ktime_get_mono_fast_ns+0x53/0xa0
> >  ktime_get_boot_fast_ns+0x5/0x10
> >  wg_packet_tx_worker+0x183/0x220 [wireguard]
> >  process_one_work+0x1f4/0x3e0
> >  worker_thread+0x2d/0x3e0

This is process context, no IRQs or NMIs around.

> >  ? process_one_work+0x3e0/0x3e0
> >  kthread+0x112/0x130
> >  ? kthread_park+0x80/0x80
> >  ret_from_fork+0x35/0x40
> > watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:3:7927]
> > 
> > It looks like RIP is spinning in this loop in read_hpet:
> > 
> > do {
> >     cpu_relax();
> >     new.lockval = READ_ONCE(hpet.lockval);
> > } while ((new.value == old.value) && arch_spin_is_locked(&new.lock));

> I think your code is fine. Just 'fast' is relative with the HPET selected
> as clocksource (it's actually aweful slow).

Yeah, the _fast suffix is mostly to denote the NMI safeness of the
accessor, and the corners it cuts to be so, which violate strict
monotonicity.

If you land on the HPET, there is nothing fast about it, HPET sucks,
always has, always will.

(on a personal note, I'd return any hardware that ends up on HPET as
broken to the vendor)

> It probably livelocks in the HPET optimization Waiman did for large
> machines. I'm having a dejavu with that spinlock livelock we debugged last
> year. Peter?

Sorry, no memories left of that :/

  CPU0				CPU1

				if (trylock(lock))
					new = hpet_read();
					WRITE_ONCE(hpet, new);
  old = READ_ONCE(hpet)
  r = READ_ONCE(lock)
  if (r)
    goto contended
					unlock(lock);
					return new;
				}

  do {
	  new = READ_ONCE(hpet);
  } while (new == old && READ_ONCE(lock));

I don't see how we can stay stuck in that loop, either @hpet changing or
@lock being unlocked should terminate it.

Either (or both) must happen given the code.

> Can you please ask the reporter to try the hack below?
> 
> Thanks,
> 
> 	tglx
> 
> 8<---------------
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index a0573f2e7763..0c9044698489 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -795,8 +795,7 @@ static u64 read_hpet(struct clocksource *cs)
>  	/*
>  	 * Read HPET directly if in NMI.
>  	 */
> -	if (in_nmi())
> -		return (u64)hpet_readl(HPET_COUNTER);
> +	return (u64)hpet_readl(HPET_COUNTER);

That basically kills the entire optimization. At which point we'll get
even slower. That said, the laptop doesn't have enough CPUs in to
credibly life-lock on this.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-12  9:02   ` Peter Zijlstra
@ 2019-06-12  9:44     ` Jason A. Donenfeld
  2019-06-12 12:28       ` Peter Zijlstra
  2019-06-12 14:01       ` Arnd Bergmann
  0 siblings, 2 replies; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-12  9:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, clemens, Sultan Alsawaf, Waiman Long, X86 ML

Hey Peter,

On Wed, Jun 12, 2019 at 11:03 AM Peter Zijlstra <peterz@infradead.org> wrote:
> How quasi? Do the comments in kernel/sched/clock.c look like something
> you could use?
>
> As already mentioned in the other tasks, anything ktime will be
> horrifically crap when it ends up using the HPET, the code in
> kernel/sched/clock.c is a best effort to keep using TSC even when it is
> deemed unusable for timekeeping.

Thanks for pointing that out. Indeed the HPET path is a bummer and I'd
like to just escape using ktime all together.

In fact, my accuracy requirements are very lax. I could probably even
deal with an inaccuracy as huge as ~200 milliseconds. But what I do
need is 64-bit, so that it doesn't wrap, allowing me to compare two
stamps taken a long time apart, and for it to take into account sleep
time, like CLOCK_BOOTTIME does, which means get_jiffies_64() doesn't
fit the bill. I was under the impression that I could only get this
with ktime_get_boot & co, because those add the sleep offset.

It looks like, though, kernel/sched/clock.c keeps track of some
offsets too -- __sched_clock_offset and __gtod_offset, and the comment
at the top mentions explicit sleep hooks. I wasn't sure which function
to use from here, though. sched_clock() seems based on jiffies, which
has the 32-bit wraparound issue, and the base implementation doesn't
seem to take into account sleeptime. The x86 implementation seems use
rdtsc and then adds cyc2ns_offset which looks to be based on
cyc2ns_suspend, which I assume is what I want. But there's still the
issue of the 32-bit wraparound on the base implementation.

I guess you know this code better than my quick perusal. Is there some
clock in here that doesn't have a wrap around issue and takes into
account sleeptime, without being super slow like ktime/hpet?

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-12  9:44     ` Jason A. Donenfeld
@ 2019-06-12 12:28       ` Peter Zijlstra
  2019-06-12 12:31         ` Peter Zijlstra
                           ` (3 more replies)
  2019-06-12 14:01       ` Arnd Bergmann
  1 sibling, 4 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-06-12 12:28 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Thomas Gleixner, LKML, clemens, Sultan Alsawaf, Waiman Long, X86 ML

On Wed, Jun 12, 2019 at 11:44:35AM +0200, Jason A. Donenfeld wrote:
> Hey Peter,
> 
> On Wed, Jun 12, 2019 at 11:03 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > How quasi? Do the comments in kernel/sched/clock.c look like something
> > you could use?
> >
> > As already mentioned in the other tasks, anything ktime will be
> > horrifically crap when it ends up using the HPET, the code in
> > kernel/sched/clock.c is a best effort to keep using TSC even when it is
> > deemed unusable for timekeeping.
> 
> Thanks for pointing that out. Indeed the HPET path is a bummer and I'd
> like to just escape using ktime all together.
> 
> In fact, my accuracy requirements are very lax. I could probably even
> deal with an inaccuracy as huge as ~200 milliseconds. But what I do
> need is 64-bit, so that it doesn't wrap, allowing me to compare two
> stamps taken a long time apart, and for it to take into account sleep
> time, like CLOCK_BOOTTIME does, which means get_jiffies_64() doesn't
> fit the bill. I was under the impression that I could only get this
> with ktime_get_boot & co, because those add the sleep offset.
> 
> It looks like, though, kernel/sched/clock.c keeps track of some
> offsets too -- __sched_clock_offset and __gtod_offset,

Right, those are used to keep the clock values coherent (as best as
possible) when we switch modes.

When the TSC is stable sched_clock_cpu() is mapped directly to
sched_clock() for performance reasons. The moment the TSC is detected to
be unsuitable, we switch to the unstable mode, where we take a GTOD
timestamp every tick and add resolution with the CPU local TSC (plus
filters etc..).

To make this mode-switch as smooth as possible, we track those offsets.

> and the comment at the top mentions explicit sleep hooks. I wasn't
> sure which function to use from here, though. 

Either local_clock() or cpu_clock(cpu). The sleep hooks are not
something the consumer has to worry about.

> sched_clock() seems based on jiffies, which
> has the 32-bit wraparound issue, and the base implementation doesn't
> seem to take into account sleeptime. The x86 implementation seems use
> rdtsc and then adds cyc2ns_offset which looks to be based on
> cyc2ns_suspend, which I assume is what I want. 

Yes.

> But there's still the
> issue of the 32-bit wraparound on the base implementation.

If an architecture doesn't provide a sched_clock(), you're on a
seriously handicapped arch. It wraps in ~500 days, and aside from
changing jiffies_lock to a latch, I don't think we can do much about it.

(the scheduler too expects sched_clock() to not wrap short of the u64
and so having those machines online for 500 days will get you 'funny'
results)

AFAICT only: alpha, h8300, hexagon, m68knommu, nds32, nios2, openrisc
are lacking any form of sched_clock(), the rest has it either natively
or through sched_clock_register().

> I guess you know this code better than my quick perusal. Is there some
> clock in here that doesn't have a wrap around issue and takes into
> account sleeptime, without being super slow like ktime/hpet?

You probably want to use local_clock().


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-12 12:28       ` Peter Zijlstra
@ 2019-06-12 12:31         ` Peter Zijlstra
  2019-06-12 12:58         ` Jason A. Donenfeld
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-06-12 12:31 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Thomas Gleixner, LKML, clemens, Sultan Alsawaf, Waiman Long,
	X86 ML, arnd

On Wed, Jun 12, 2019 at 02:28:43PM +0200, Peter Zijlstra wrote:

> AFAICT only: alpha, h8300, hexagon, m68knommu, nds32, nios2, openrisc
> are lacking any form of sched_clock(), the rest has it either natively
> or through sched_clock_register().

Scratch nds32, they use drivers/clocksource/timer-atcpit100.c

Thanks Arnd!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-12 12:28       ` Peter Zijlstra
  2019-06-12 12:31         ` Peter Zijlstra
@ 2019-06-12 12:58         ` Jason A. Donenfeld
  2019-06-12 15:27           ` Peter Zijlstra
  2019-06-12 19:46         ` Arnd Bergmann
  2019-06-18 17:34         ` Jason A. Donenfeld
  3 siblings, 1 reply; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-12 12:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, clemens, Sultan Alsawaf, Waiman Long, X86 ML

Hi Peter,

Thanks for the explanation.

On Wed, Jun 12, 2019 at 2:29 PM Peter Zijlstra <peterz@infradead.org> wrote:
> Either local_clock() or cpu_clock(cpu). The sleep hooks are not
> something the consumer has to worry about.

Alright. Just so long as it *is* tracking sleep, then that's fine. If
it isn't some important aspects of the protocol will be violated.

> If an architecture doesn't provide a sched_clock(), you're on a
> seriously handicapped arch. It wraps in ~500 days, and aside from
> changing jiffies_lock to a latch, I don't think we can do much about it.

Are you sure? The base definition I'm looking at uses jiffies:

unsigned long long __weak sched_clock(void)
{
        return (unsigned long long)(jiffies - INITIAL_JIFFIES)
                                        * (NSEC_PER_SEC / HZ);
}

On a CONFIG_HZ_1000 machine, jiffies wraps in ~49.7 days:
>>> ((1<<32)-1)/1000/(60*60*24)
49.710269618055555

Why not just use get_jiffies_64()? The lock is too costly on 32bit?

> (the scheduler too expects sched_clock() to not wrap short of the u64
> and so having those machines online for 500 days will get you 'funny'
> results)

Ahh. So if, on the other hand, the whole machine explodes at the wrap
mark, I guess my silly protocol is the least of concerns, and so this
shouldn't matter?

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-12  9:44     ` Jason A. Donenfeld
  2019-06-12 12:28       ` Peter Zijlstra
@ 2019-06-12 14:01       ` Arnd Bergmann
  2019-06-13 15:18         ` Jason A. Donenfeld
  1 sibling, 1 reply; 26+ messages in thread
From: Arnd Bergmann @ 2019-06-12 14:01 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

On Wed, Jun 12, 2019 at 11:46 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> On Wed, Jun 12, 2019 at 11:03 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > How quasi? Do the comments in kernel/sched/clock.c look like something
> > you could use?
> >
> > As already mentioned in the other tasks, anything ktime will be
> > horrifically crap when it ends up using the HPET, the code in
> > kernel/sched/clock.c is a best effort to keep using TSC even when it is
> > deemed unusable for timekeeping.
>
> Thanks for pointing that out. Indeed the HPET path is a bummer and I'd
> like to just escape using ktime all together.
>
> In fact, my accuracy requirements are very lax. I could probably even
> deal with an inaccuracy as huge as ~200 milliseconds. But what I do
> need is 64-bit, so that it doesn't wrap, allowing me to compare two
> stamps taken a long time apart, and for it to take into account sleep
> time, like CLOCK_BOOTTIME does, which means get_jiffies_64() doesn't
> fit the bill. I was under the impression that I could only get this
> with ktime_get_boot & co, because those add the sleep offset.

Documentation/core-api/timekeeping.rst describes the timekeeping
interfaces. I think what you want here is ktime_get_coarse_boottime().

Note that "coarse" means "don't access the hardware clocksource"
here, which is faster than "fast", but less accurate.

This is updated as often as "jiffies_64", but is in nanosecond resolution
and takes suspended time into account.

      Arnd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-12 12:58         ` Jason A. Donenfeld
@ 2019-06-12 15:27           ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-06-12 15:27 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Thomas Gleixner, LKML, clemens, Sultan Alsawaf, Waiman Long, X86 ML

On Wed, Jun 12, 2019 at 02:58:21PM +0200, Jason A. Donenfeld wrote:
> Hi Peter,
> 
> Thanks for the explanation.
> 
> On Wed, Jun 12, 2019 at 2:29 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > Either local_clock() or cpu_clock(cpu). The sleep hooks are not
> > something the consumer has to worry about.
> 
> Alright. Just so long as it *is* tracking sleep, then that's fine. If
> it isn't some important aspects of the protocol will be violated.

The scheduler also cares about how long a task has been sleeping, so
yes, that's automagic.

> > If an architecture doesn't provide a sched_clock(), you're on a
> > seriously handicapped arch. It wraps in ~500 days, and aside from
> > changing jiffies_lock to a latch, I don't think we can do much about it.
> 
> Are you sure? The base definition I'm looking at uses jiffies:
> 
> unsigned long long __weak sched_clock(void)
> {
>         return (unsigned long long)(jiffies - INITIAL_JIFFIES)
>                                         * (NSEC_PER_SEC / HZ);
> }
> 
> On a CONFIG_HZ_1000 machine, jiffies wraps in ~49.7 days:
> >>> ((1<<32)-1)/1000/(60*60*24)
> 49.710269618055555

Bah, I must've done the math wrong (or assumed HZ=100).

> Why not just use get_jiffies_64()? The lock is too costly on 32bit?

Deadlocks when you do get_jiffies_64() from within an update. What would
be an easier update is forcing everyone to use the GENERIC_SCHED_CLOCK
fallback or something like that.

OTOH, changing jiffies_lock to a latch shouldn't be rocket science
either.

> > (the scheduler too expects sched_clock() to not wrap short of the u64
> > and so having those machines online for 500 days will get you 'funny'
> > results)
> 
> Ahh. So if, on the other hand, the whole machine explodes at the wrap
> mark, I guess my silly protocol is the least of concerns, and so this
> shouldn't matter?

That was my thinking...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-12 12:28       ` Peter Zijlstra
  2019-06-12 12:31         ` Peter Zijlstra
  2019-06-12 12:58         ` Jason A. Donenfeld
@ 2019-06-12 19:46         ` Arnd Bergmann
  2019-06-18 17:34         ` Jason A. Donenfeld
  3 siblings, 0 replies; 26+ messages in thread
From: Arnd Bergmann @ 2019-06-12 19:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason A. Donenfeld, Thomas Gleixner, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

On Wed, Jun 12, 2019 at 7:55 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 12, 2019 at 11:44:35AM +0200, Jason A. Donenfeld wrote:

> > But there's still the
> > issue of the 32-bit wraparound on the base implementation.
>
> If an architecture doesn't provide a sched_clock(), you're on a
> seriously handicapped arch. It wraps in ~500 days, and aside from
> changing jiffies_lock to a latch, I don't think we can do much about it.
>
> (the scheduler too expects sched_clock() to not wrap short of the u64
> and so having those machines online for 500 days will get you 'funny'
> results)
>
> AFAICT only: alpha, h8300, hexagon, m68knommu, nds32, nios2, openrisc
> are lacking any form of sched_clock(), the rest has it either natively
> or through sched_clock_register().

For completeness (as we already discussed on IRC), on many architectures
this would depend on the clocksource driver: many (older) arm, mips, sh
or m68k implementations don't have sched_clock(), as this depends on
the clocksource driver. All the modern ones tend to have one, but older
ones may only support an interval timer tick that cannot be read.

        Arnd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-12 14:01       ` Arnd Bergmann
@ 2019-06-13 15:18         ` Jason A. Donenfeld
  2019-06-13 15:40           ` Arnd Bergmann
  0 siblings, 1 reply; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-13 15:18 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

Hey Arnd, Peter,

On Wed, Jun 12, 2019 at 4:01 PM Arnd Bergmann <arnd@arndb.de> wrote:
> Documentation/core-api/timekeeping.rst describes the timekeeping
> interfaces. I think what you want here is ktime_get_coarse_boottime().
>
> Note that "coarse" means "don't access the hardware clocksource"
> here, which is faster than "fast", but less accurate.
>
> This is updated as often as "jiffies_64", but is in nanosecond resolution
> and takes suspended time into account.

Oh, thanks. Indeed ktime_get_coarse_boottime seems even better. It's
perhaps a bit slower, in that it has that seqlock, but that might give
better synchronization between CPUs as well.

Peter - any immediate downside you can think of compared to local_clock()?

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-13 15:18         ` Jason A. Donenfeld
@ 2019-06-13 15:40           ` Arnd Bergmann
  2019-06-13 16:17             ` Jason A. Donenfeld
  0 siblings, 1 reply; 26+ messages in thread
From: Arnd Bergmann @ 2019-06-13 15:40 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

On Thu, Jun 13, 2019 at 5:19 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hey Arnd, Peter,
>
> On Wed, Jun 12, 2019 at 4:01 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > Documentation/core-api/timekeeping.rst describes the timekeeping
> > interfaces. I think what you want here is ktime_get_coarse_boottime().
> >
> > Note that "coarse" means "don't access the hardware clocksource"
> > here, which is faster than "fast", but less accurate.
> >
> > This is updated as often as "jiffies_64", but is in nanosecond resolution
> > and takes suspended time into account.
>
> Oh, thanks. Indeed ktime_get_coarse_boottime seems even better. It's
> perhaps a bit slower, in that it has that seqlock, but that might give
> better synchronization between CPUs as well.

A seqlock is a very cheap synchronization primitive, I would actually
guess that this is faster than most implementations of sched_clock()
that access a hardware register for reading the time.

       Arnd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-13 15:40           ` Arnd Bergmann
@ 2019-06-13 16:17             ` Jason A. Donenfeld
  2019-06-13 16:26               ` Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-13 16:17 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

Hey Arnd,

On Thu, Jun 13, 2019 at 5:40 PM Arnd Bergmann <arnd@arndb.de> wrote:
> A seqlock is a very cheap synchronization primitive, I would actually
> guess that this is faster than most implementations of sched_clock()
> that access a hardware register for reading the time.

It appears to me that ktime_get_coarse_boottime() has a granularity of
a whole second, which is a lot worse than jiffies. Looking at the
source, you assign base but don't then add ns like the other
functions. At first I thought this was an intentional quirk to avoid
hitting the slow hardware paths. But noticing this poor granularity
now and observing that there's actually a blank line (\n\n) where the
nanosecond addition normally would be, I wonder if something was lost
in cut-and-paste?

I'm still poking around trying to see what's up. As a quick test,
running this on every packet during a high speed test shows the left
incrementing many times per second, whereas the right increments once
per second:

static int x = 0;
if (!(x++ % 30000))
     pr_err("%llu %llu\n", local_clock(), ktime_get_coarse_boottime());

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-13 16:17             ` Jason A. Donenfeld
@ 2019-06-13 16:26               ` Thomas Gleixner
  2019-06-13 16:34                 ` Jason A. Donenfeld
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2019-06-13 16:26 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Arnd Bergmann, Peter Zijlstra, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

On Thu, 13 Jun 2019, Jason A. Donenfeld wrote:
> Hey Arnd,
> 
> On Thu, Jun 13, 2019 at 5:40 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > A seqlock is a very cheap synchronization primitive, I would actually
> > guess that this is faster than most implementations of sched_clock()
> > that access a hardware register for reading the time.
> 
> It appears to me that ktime_get_coarse_boottime() has a granularity of
> a whole second, which is a lot worse than jiffies. Looking at the
> source, you assign base but don't then add ns like the other
> functions. At first I thought this was an intentional quirk to avoid
> hitting the slow hardware paths. But noticing this poor granularity
> now and observing that there's actually a blank line (\n\n) where the
> nanosecond addition normally would be, I wonder if something was lost
> in cut-and-paste?
> 
> I'm still poking around trying to see what's up. As a quick test,
> running this on every packet during a high speed test shows the left
> incrementing many times per second, whereas the right increments once
> per second:
> 
> static int x = 0;
> if (!(x++ % 30000))
>      pr_err("%llu %llu\n", local_clock(), ktime_get_coarse_boottime());

That does not make sense. The coarse time getters use
tk->tkr_mono.base. base is updated every tick (or if the machine is
completely idle right when the first CPU wakes up again).

timekeeping_get_ns() which is added in ktime_get_boottime() is the time in
ns elapsed since base was updated last, which is less than a tick.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-13 16:26               ` Thomas Gleixner
@ 2019-06-13 16:34                 ` Jason A. Donenfeld
  2019-06-13 16:41                   ` Jason A. Donenfeld
  2019-06-13 19:53                   ` Thomas Gleixner
  0 siblings, 2 replies; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-13 16:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Arnd Bergmann, Peter Zijlstra, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

On Thu, Jun 13, 2019 at 6:26 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> That does not make sense. The coarse time getters use
> tk->tkr_mono.base. base is updated every tick (or if the machine is
> completely idle right when the first CPU wakes up again).

Sense or not, it seems to be happening, at least on 5.2-rc4:

int __init mod_init(void)
{
      for (;;) {
              pr_err("%llu %llu %llu\n", get_jiffies_64(),
ktime_get_coarse_boottime(), local_clock());
              msleep(200);
      }
      return 0;
}

[    0.308166] wireguard: 4294892358 270211253 308165785
[    0.512931] wireguard: 4294892410 270211253 512926052
[    0.720930] wireguard: 4294892462 270211253 720925707
[    0.928995] wireguard: 4294892514 270211253 928990019
[    1.137011] wireguard: 4294892566 270211253 1137005585
[    1.344948] wireguard: 4294892618 1270211253 1344943311
[    1.552942] wireguard: 4294892670 1270211253 1552937363
[    1.760963] wireguard: 4294892722 1270211253 1760957241
[    1.969092] wireguard: 4294892774 1270211253 1969087174
[    2.176951] wireguard: 4294892826 1270211253 2176946130
[    2.385073] wireguard: 4294892878 2270211253 2385068075
[    2.593033] wireguard: 4294892930 2270211253 2593025847
[    2.801071] wireguard: 4294892982 2270211253 2801066369
[    3.008895] wireguard: 4294893034 2270211253 3008892169
[    3.216956] wireguard: 4294893086 2270211253 3216951160
[    3.424965] wireguard: 4294893138 3270211253 3424960158
[    3.632942] wireguard: 4294893190 3270211253 3632937133
[    3.840947] wireguard: 4294893242 3270211253 3840942470
[    4.048950] wireguard: 4294893294 3270211253 4048945319
[    4.257078] wireguard: 4294893346 3270211253 4257072850
[    4.464943] wireguard: 4294893398 4270211253 4464938724
[    4.672934] wireguard: 4294893450 4270211253 4672929572
[    4.880939] wireguard: 4294893502 4270211253 4880933901
[    5.089015] wireguard: 4294893554 4270211253 5089010190
[    5.297072] wireguard: 4294893606 4270211253 5297067556
[    5.504935] wireguard: 4294893658 5270211253 5504929920
[    5.712963] wireguard: 4294893710 5270211253 5712957452
[    5.920949] wireguard: 4294893762 5270211253 5920944292
[    6.128959] wireguard: 4294893814 5270211253 6128954684
[    6.336968] wireguard: 4294893866 6270211253 6336962979
[    6.544945] wireguard: 4294893918 6270211253 6544931514
[    6.752963] wireguard: 4294893970 6270211253 6752958001
[    6.961008] wireguard: 4294894022 6270211253 6961003309
[    7.169073] wireguard: 4294894074 6270211253 7169068202
[    7.376936] wireguard: 4294894126 7270211253 7376931429
[    7.585022] wireguard: 4294894178 7270211253 7585016996
[    7.793004] wireguard: 4294894230 7270211253 7792999324
[    8.000953] wireguard: 4294894282 7270211253 8000947668
[    8.208949] wireguard: 4294894334 7270211253 8208943630
[    8.417090] wireguard: 4294894386 8270211253 8417085802
[    8.625071] wireguard: 4294894438 8270211253 8625066012
[    8.833086] wireguard: 4294894490 8270211253 8833081441
[    9.041032] wireguard: 4294894542 8270211253 9041027362
[    9.248948] wireguard: 4294894594 8270211253 9248943016
[    9.456968] wireguard: 4294894646 9270211253 9456963019
[    9.664955] wireguard: 4294894698 9270211253 9664950178
[    9.872935] wireguard: 4294894750 9270211253 9872929704

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-13 16:34                 ` Jason A. Donenfeld
@ 2019-06-13 16:41                   ` Jason A. Donenfeld
  2019-06-13 19:53                   ` Thomas Gleixner
  1 sibling, 0 replies; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-13 16:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Arnd Bergmann, Peter Zijlstra, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

Or in case that's not clear enough:

int __init mod_init(void)
{
  u64 j1 = 0, j2, k1 = 0, k2, l1 = 0, l2;
  for (;;) {
    j2 = jiffies64_to_nsecs(get_jiffies_64());
    k2 = ktime_get_coarse_boottime();
    l2 = local_clock();
    pr_err("%llu %llu %llu\n", j2 - j1, k2 - k1, l2 - l1);
    j1 = j2;
    k1 = k2;
    l1 = l2;
    msleep(200);
  }
  return 0;
}

[    0.289469] wireguard: 17179569424000000 18446744073155127569 289468886
[    0.495002] wireguard: 208000000 1000000000 205528400
[    0.702857] wireguard: 208000000 0 207855129
[    0.910836] wireguard: 208000000 0 207979087
[    1.118871] wireguard: 208000000 0 208034819
[    1.326964] wireguard: 208000000 0 208093122
[    1.534941] wireguard: 208000000 1000000000 207976614
[    1.742868] wireguard: 208000000 0 207927167
[    1.950877] wireguard: 208000000 0 208009538
[    2.158865] wireguard: 208000000 0 207987922
[    2.367012] wireguard: 208000000 0 208146016
[    2.574931] wireguard: 208000000 1000000000 207918184
[    2.782871] wireguard: 208000000 0 207939645
[    2.991017] wireguard: 208000000 0 208148426
[    3.198943] wireguard: 208000000 0 207925281
[    3.406959] wireguard: 208000000 0 208014262
[    3.614892] wireguard: 208000000 1000000000 207934524
[    3.822869] wireguard: 208000000 0 207976970
[    4.030994] wireguard: 208000000 0 208125835
[    4.238792] wireguard: 208000000 0 207800186
[    4.446853] wireguard: 208000000 0 208056350
[    4.654978] wireguard: 208000000 1000000000 208126753
[    4.862801] wireguard: 208000000 0 207822955
[    5.070999] wireguard: 208000000 0 208198517
[    5.278818] wireguard: 208000000 0 207820902
[    5.486879] wireguard: 208000000 1000000000 208059285
[    5.694959] wireguard: 208000000 0 208066514
[    5.903011] wireguard: 208000000 0 208065258
[    6.111015] wireguard: 208000000 0 208004056
[    6.318856] wireguard: 208000000 0 207841271

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-13 16:34                 ` Jason A. Donenfeld
  2019-06-13 16:41                   ` Jason A. Donenfeld
@ 2019-06-13 19:53                   ` Thomas Gleixner
  2019-06-14  9:14                     ` Jason A. Donenfeld
  2019-06-14  9:55                     ` [tip:timers/urgent] timekeeping: Repair ktime_get_coarse*() granularity tip-bot for Thomas Gleixner
  1 sibling, 2 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-06-13 19:53 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Arnd Bergmann, Peter Zijlstra, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

On Thu, 13 Jun 2019, Jason A. Donenfeld wrote:

> On Thu, Jun 13, 2019 at 6:26 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > That does not make sense. The coarse time getters use
> > tk->tkr_mono.base. base is updated every tick (or if the machine is
> > completely idle right when the first CPU wakes up again).
> 
> Sense or not, it seems to be happening, at least on 5.2-rc4:

Bah. Seems I had paged out all the subtle parts of timekeeping and answered
from my blurred memory while traveling. Stared at it for a while and of
course base is only updated every second. The nsec part uses the
accumulated nsecs (< 1sec) plus the time delta read from the hardware. So
yes, the ktime_get_coarse* stuff has been broken from day one.

Fix below.

Thanks,

	tglx

8<------------------
Subject: timekeeping: Repair ktime_get_coarse*() granularity
From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu, 13 Jun 2019 21:40:45 +0200

Jason reported that the coarse ktime based time getters advance only once
per second and not once per tick as advertised.

The code reads only the monotonic base time, which advances once per
second. The nanoseconds are accumulated on every tick in xtime_nsec up to
a second and the regular time getters take this nanoseconds offset into
account, but the ktime_get_coarse*() implementation fails to do so.

Add the accumulated xtime_nsec value to the monotonic base time to get the
proper per tick advancing coarse tinme.

Fixes: b9ff604cff11 ("timekeeping: Add ktime_get_coarse_with_offset")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 kernel/time/timekeeping.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -810,17 +810,18 @@ ktime_t ktime_get_coarse_with_offset(enu
 	struct timekeeper *tk = &tk_core.timekeeper;
 	unsigned int seq;
 	ktime_t base, *offset = offsets[offs];
+	u64 nsecs;
 
 	WARN_ON(timekeeping_suspended);
 
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 		base = ktime_add(tk->tkr_mono.base, *offset);
+		nsecs = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
 
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
-	return base;
-
+	return base + nsecs;
 }
 EXPORT_SYMBOL_GPL(ktime_get_coarse_with_offset);
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-13 19:53                   ` Thomas Gleixner
@ 2019-06-14  9:14                     ` Jason A. Donenfeld
  2019-06-14  9:44                       ` Thomas Gleixner
  2019-06-14  9:48                       ` [PATCH] timekeeping: add get_jiffies_boot_64() for jiffies including sleep Jason A. Donenfeld
  2019-06-14  9:55                     ` [tip:timers/urgent] timekeeping: Repair ktime_get_coarse*() granularity tip-bot for Thomas Gleixner
  1 sibling, 2 replies; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-14  9:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Arnd Bergmann, Peter Zijlstra, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

Hey Thomas,

> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
>         } while (read_seqcount_retry(&tk_core.seq, seq));
>
> -       return base;
> -
> +       return base + nsecs;

The rest of the file seems to use `ktime_add_ns(base, nsecs)`. I
realize, of course, that these days that macro is the same thing as
what you wrote, though.

I can confirm that this fixes it (see below), so you can add my
Tested-by if you want or care.

One thing I'm curious about is the performance comparison with various
ways of using jiffies directly:

ktime_mono_to_any(ns_to_ktime(jiffies64_to_nsecs(get_jiffies_64())),
TK_OFFS_BOOT)

Or really giving up on the locking:

ktime_to_ns(tk_core.timekeeper.offs_boot) + jiffies64_to_nsecs(get_jiffies_64())

Or keeping things in units of jiffies, though that incurs a div_u64:

nsecs_to_jiffies64(ktime_to_ns(tk_core.timekeeper.offs_boot)) + get_jiffies_64()

But since offs_boot is updated somewhat rarely, that div_u64 could be
precomputed each time offs_boot is updated, allowing hypothetically:

tk_core.timekeeper.offs_boot_jiffies + get_jiffies_64()

Which then could be remade into a wrapper such as:

get_jiffies_boot_64()

The speed is indeed an important factor to me in accessing this time
value. Are any of these remotely interesting to you in that light?
Maybe I'll send a patch for the latter.

Jason

8<-----------------

int __init mod_init(void)
{
  u64 j1 = 0, j2, k1 = 0, k2, c1 = 0, c2, l1 = 0, l2;
  for (;;) {
    j2 = jiffies64_to_nsecs(get_jiffies_64());
    k2 = ktime_to_ns(ktime_mono_to_any(ns_to_ktime(jiffies64_to_nsecs(get_jiffies_64())),
TK_OFFS_BOOT));
    c2 = ktime_get_coarse_boottime();
    l2 = local_clock();
    pr_err("%llu %llu %llu %llu\n", j2 - j1, k2 - k1, c2 - c1, l2 - l1);
    j1 = j2;
    k1 = k2;
    c1 = c2;
    l1 = l2;
    msleep(200);
  }
  return 0;
}

[    0.420861] wireguard: 17179569472000000 17179569472000000
312696682 420860781
[    0.628656] wireguard: 208000000 208000000 208000000 207791083
[    0.836591] wireguard: 208000000 208000000 208000000 207934734
[    1.044728] wireguard: 208000000 208000000 208000000 208137167
[    1.252593] wireguard: 208000000 208000000 208000000 207862974
[    1.460815] wireguard: 208000000 208000000 208000000 208223514
[    1.668667] wireguard: 208000000 208000000 208000000 207852437
[    1.876438] wireguard: 208000000 208000000 208000000 207773658
[    2.084627] wireguard: 208000000 208000000 208000000 208185643
[    2.292690] wireguard: 208000000 208000000 208000000 208063377
[    2.500672] wireguard: 208000000 208000000 208000000 207982209
[    2.708658] wireguard: 208000000 208000000 208000000 207986527
[    2.916686] wireguard: 208000000 208000000 208000000 208026945
[    3.124732] wireguard: 208000000 208000000 208000000 208046153
[    3.332684] wireguard: 208000000 208000000 208000000 207952302
[    3.540668] wireguard: 208000000 208000000 208000000 207978195
[    3.748633] wireguard: 208000000 208000000 208000000 207970981
[    3.956686] wireguard: 208000000 208000000 208000000 208053094
[    4.164690] wireguard: 208000000 208000000 208000000 207995376
[    4.372660] wireguard: 208000000 208000000 208000000 207978324
[    4.580787] wireguard: 208000000 208000000 208000000 208126491
[    4.788716] wireguard: 208000000 208000000 208000000 207930106
[    4.996685] wireguard: 208000000 208000000 208000000 207968555
[    5.204673] wireguard: 208000000 208000000 208000000 207988295
[    5.412676] wireguard: 208000000 208000000 208000000 207991396
[    5.620648] wireguard: 208000000 208000000 208000000 207983671
[    5.828822] wireguard: 208000000 208000000 208000000 208174230
[    6.036596] wireguard: 208000000 208000000 208000000 207773401
[    6.244615] wireguard: 208000000 208000000 208000000 208017986
[    6.452625] wireguard: 208000000 208000000 208000000 208011215
[    6.660678] wireguard: 208000000 208000000 208000000 208053134
[    6.868609] wireguard: 208000000 208000000 208000000 207929536

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-14  9:14                     ` Jason A. Donenfeld
@ 2019-06-14  9:44                       ` Thomas Gleixner
  2019-06-14  9:56                         ` Jason A. Donenfeld
  2019-06-14  9:48                       ` [PATCH] timekeeping: add get_jiffies_boot_64() for jiffies including sleep Jason A. Donenfeld
  1 sibling, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2019-06-14  9:44 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Arnd Bergmann, Peter Zijlstra, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

Jason,

On Fri, 14 Jun 2019, Jason A. Donenfeld wrote:

> Hey Thomas,
> 
> > --- a/kernel/time/timekeeping.c
> > +++ b/kernel/time/timekeeping.c
> >         } while (read_seqcount_retry(&tk_core.seq, seq));
> >
> > -       return base;
> > -
> > +       return base + nsecs;
> 
> The rest of the file seems to use `ktime_add_ns(base, nsecs)`. I
> realize, of course, that these days that macro is the same thing as
> what you wrote, though.

Yeah, historical raisins when ktime_t was special on 32bit.

> One thing I'm curious about is the performance comparison with various
> ways of using jiffies directly:
> 
> ktime_mono_to_any(ns_to_ktime(jiffies64_to_nsecs(get_jiffies_64())),
> TK_OFFS_BOOT)
> 
> Or really giving up on the locking:
> 
> ktime_to_ns(tk_core.timekeeper.offs_boot) + jiffies64_to_nsecs(get_jiffies_64())
> 
> Or keeping things in units of jiffies, though that incurs a div_u64:
> 
> nsecs_to_jiffies64(ktime_to_ns(tk_core.timekeeper.offs_boot)) + get_jiffies_64()

jiffies64 uses a seqcount on 32bit as well.
 
> But since offs_boot is updated somewhat rarely, that div_u64 could be
> precomputed each time offs_boot is updated, allowing hypothetically:
> 
> tk_core.timekeeper.offs_boot_jiffies + get_jiffies_64()

Hrm, I'm not a great fan of these shortcuts which cut corners based on
'somewhat rarely, so it should not matter'. Should not matter always
strikes back at some point. :)

> Which then could be remade into a wrapper such as:
> 
> get_jiffies_boot_64()
> 
> The speed is indeed an important factor to me in accessing this time
> value. Are any of these remotely interesting to you in that light?
> Maybe I'll send a patch for the latter.

So what you are looking for is jiffies based on boot time?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] timekeeping: add get_jiffies_boot_64() for jiffies including sleep
  2019-06-14  9:14                     ` Jason A. Donenfeld
  2019-06-14  9:44                       ` Thomas Gleixner
@ 2019-06-14  9:48                       ` Jason A. Donenfeld
  1 sibling, 0 replies; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-14  9:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason A. Donenfeld, Thomas Gleixner, Peter Zijlstra, Arnd Bergmann

This enables using the usual get_jiffies_64() but taking into account
time spent sleeping, giving the high performance characteristics of
querying jiffies without the drawback.

We accomplish this by precomputing the boottime jiffies offset whenever
it is updated, rather than doing the expensive-ish div_u64 on each
query.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
---
 include/linux/timekeeper_internal.h |  2 ++
 include/linux/timekeeping.h         |  2 ++
 kernel/time/timekeeping.c           | 12 ++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 7acb953298a7..2e4c52fe0250 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -51,6 +51,7 @@ struct tk_read_base {
  * @wall_to_monotonic:	CLOCK_REALTIME to CLOCK_MONOTONIC offset
  * @offs_real:		Offset clock monotonic -> clock realtime
  * @offs_boot:		Offset clock monotonic -> clock boottime
+ * @offs_boot_jiffies64	Offset clock monotonic -> clock boottime in jiffies64
  * @offs_tai:		Offset clock monotonic -> clock tai
  * @tai_offset:		The current UTC to TAI offset in seconds
  * @clock_was_set_seq:	The sequence number of clock was set events
@@ -93,6 +94,7 @@ struct timekeeper {
 	struct timespec64	wall_to_monotonic;
 	ktime_t			offs_real;
 	ktime_t			offs_boot;
+	u64			offs_boot_jiffies64;
 	ktime_t			offs_tai;
 	s32			tai_offset;
 	unsigned int		clock_was_set_seq;
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index a8ab0f143ac4..511803c9cae2 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -151,6 +151,8 @@ extern u64 ktime_get_raw_fast_ns(void);
 extern u64 ktime_get_boot_fast_ns(void);
 extern u64 ktime_get_real_fast_ns(void);
 
+extern u64 get_jiffies_boot_64(void);
+
 /*
  * timespec64/time64_t interfaces utilizing the ktime based ones
  * for API completeness, these could be implemented more efficiently
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 85f5912d8f70..83c675bcd88a 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -146,6 +146,7 @@ static void tk_set_wall_to_mono(struct timekeeper *tk, struct timespec64 wtm)
 static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta)
 {
 	tk->offs_boot = ktime_add(tk->offs_boot, delta);
+	tk->offs_boot_jiffies64 = nsecs_to_jiffies64(ktime_to_ns(tk->offs_boot));
 }
 
 /*
@@ -539,6 +540,17 @@ u64 ktime_get_real_fast_ns(void)
 }
 EXPORT_SYMBOL_GPL(ktime_get_real_fast_ns);
 
+/**
+ * get_jiffies_boot_64 - The normal get_jiffies_64(), but taking into
+ * account the time spent sleeping. This does not do any sort of locking
+ * on the time spent sleeping.
+ */
+u64 get_jiffies_boot_64(void)
+{
+	return get_jiffies_64() + tk_core.timekeeper.offs_boot_jiffies64;
+}
+EXPORT_SYMBOL(get_jiffies_boot_64);
+
 /**
  * halt_fast_timekeeper - Prevent fast timekeeper from accessing clocksource.
  * @tk: Timekeeper to snapshot.
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip:timers/urgent] timekeeping: Repair ktime_get_coarse*() granularity
  2019-06-13 19:53                   ` Thomas Gleixner
  2019-06-14  9:14                     ` Jason A. Donenfeld
@ 2019-06-14  9:55                     ` tip-bot for Thomas Gleixner
  2019-06-14 11:18                       ` Arnd Bergmann
  1 sibling, 1 reply; 26+ messages in thread
From: tip-bot for Thomas Gleixner @ 2019-06-14  9:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: sultan, longman, linux-kernel, arnd, clemens, tglx, hpa, Jason,
	mingo, peterz

Commit-ID:  e3ff9c3678b4d80e22d2557b68726174578eaf52
Gitweb:     https://git.kernel.org/tip/e3ff9c3678b4d80e22d2557b68726174578eaf52
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 13 Jun 2019 21:40:45 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 14 Jun 2019 11:51:44 +0200

timekeeping: Repair ktime_get_coarse*() granularity

Jason reported that the coarse ktime based time getters advance only once
per second and not once per tick as advertised.

The code reads only the monotonic base time, which advances once per
second. The nanoseconds are accumulated on every tick in xtime_nsec up to
a second and the regular time getters take this nanoseconds offset into
account, but the ktime_get_coarse*() implementation fails to do so.

Add the accumulated xtime_nsec value to the monotonic base time to get the
proper per tick advancing coarse tinme.

Fixes: b9ff604cff11 ("timekeeping: Add ktime_get_coarse_with_offset")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Sultan Alsawaf <sultan@kerneltoast.com>
Cc: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906132136280.1791@nanos.tec.linutronix.de

---
 kernel/time/timekeeping.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 85f5912d8f70..44b726bab4bd 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -808,17 +808,18 @@ ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs)
 	struct timekeeper *tk = &tk_core.timekeeper;
 	unsigned int seq;
 	ktime_t base, *offset = offsets[offs];
+	u64 nsecs;
 
 	WARN_ON(timekeeping_suspended);
 
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 		base = ktime_add(tk->tkr_mono.base, *offset);
+		nsecs = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
 
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
-	return base;
-
+	return base + nsecs;
 }
 EXPORT_SYMBOL_GPL(ktime_get_coarse_with_offset);
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-14  9:44                       ` Thomas Gleixner
@ 2019-06-14  9:56                         ` Jason A. Donenfeld
  0 siblings, 0 replies; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-14  9:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Arnd Bergmann, Peter Zijlstra, LKML, Clemens Ladisch,
	Sultan Alsawaf, Waiman Long, X86 ML

Hi Thomas,

On Fri, Jun 14, 2019 at 11:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> jiffies64 uses a seqcount on 32bit as well.

Right, but not 64-bit, which is some sort of improvement I suppose.

> Hrm, I'm not a great fan of these shortcuts which cut corners based on
> 'somewhat rarely, so it should not matter'. Should not matter always
> strikes back at some point. :)

What's the actual rare event you have in mind? That on 32-bits, only
half of the value is updated while resuming from sleep? But isn't
there already a race in that case? Namely, either the boot offset is
never queried until the system has fully resumed (safe), or there's
actually some overlap, where the boot offset is queried on resumption
before it's updated. In the former case, a race isn't possible because
of other locking. In the latter case, there's already a race
contingent on the same instructions executing that's way worse. So I
wonder if this is in the category of "hasn't ever mattered" rather
than "should not matter".

> So what you are looking for is jiffies based on boot time?

Right. Just sent a patch.

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [tip:timers/urgent] timekeeping: Repair ktime_get_coarse*() granularity
  2019-06-14  9:55                     ` [tip:timers/urgent] timekeeping: Repair ktime_get_coarse*() granularity tip-bot for Thomas Gleixner
@ 2019-06-14 11:18                       ` Arnd Bergmann
  0 siblings, 0 replies; 26+ messages in thread
From: Arnd Bergmann @ 2019-06-14 11:18 UTC (permalink / raw)
  To: Jason A. Donenfeld, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Arnd Bergmann, Clemens Ladisch, Waiman Long,
	Sultan Alsawaf, Linux Kernel Mailing List
  Cc: linux-tip-commits

On Fri, Jun 14, 2019 at 11:55 AM tip-bot for Thomas Gleixner
<tipbot@zytor.com> wrote:
>
> Commit-ID:  e3ff9c3678b4d80e22d2557b68726174578eaf52
> Gitweb:     https://git.kernel.org/tip/e3ff9c3678b4d80e22d2557b68726174578eaf52
> Author:     Thomas Gleixner <tglx@linutronix.de>
> AuthorDate: Thu, 13 Jun 2019 21:40:45 +0200
> Committer:  Thomas Gleixner <tglx@linutronix.de>
> CommitDate: Fri, 14 Jun 2019 11:51:44 +0200
>
> timekeeping: Repair ktime_get_coarse*() granularity
>
> Jason reported that the coarse ktime based time getters advance only once
> per second and not once per tick as advertised.
>
> The code reads only the monotonic base time, which advances once per
> second. The nanoseconds are accumulated on every tick in xtime_nsec up to
> a second and the regular time getters take this nanoseconds offset into
> account, but the ktime_get_coarse*() implementation fails to do so.
>
> Add the accumulated xtime_nsec value to the monotonic base time to get the
> proper per tick advancing coarse tinme.

Acked-by: Arnd Bergmann <arnd@arndb.de> (too late, I know)

I checked what code is actually affected. Fortunately there are only
a very small number of callers that I could find:

arch/powerpc/xmon/xmon.c:        ktime_get_coarse_boottime_ts64(&uptime);
drivers/iio/industrialio-core.c:        return
ktime_to_ns(ktime_get_coarse_real());
drivers/power/supply/ab8500_fg.c:    time64_t now =
ktime_get_boottime_seconds();
drivers/net/wireless/intel/ipw2x00/ipw2100.c:    time64_t now =
ktime_get_boottime_seconds();

Only two of these even read the sub-second portion, and the
iio file only calls this function if CLOCK_REALTIME_COARSE
was passed from user space.

I don't think any harm was done so far by this bug, so it's not surprising
we never noticed.

         Arnd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: infinite loop in read_hpet from ktime_get_boot_fast_ns
  2019-06-12 12:28       ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  2019-06-12 19:46         ` Arnd Bergmann
@ 2019-06-18 17:34         ` Jason A. Donenfeld
  3 siblings, 0 replies; 26+ messages in thread
From: Jason A. Donenfeld @ 2019-06-18 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Clemens Ladisch, Sultan Alsawaf,
	Waiman Long, X86 ML

Hi Peter,

On Wed, Jun 12, 2019 at 2:29 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > and the comment at the top mentions explicit sleep hooks. I wasn't
> > sure which function to use from here, though.
>
> Either local_clock() or cpu_clock(cpu). The sleep hooks are not
> something the consumer has to worry about.

I'm not sure whether this is another bug or just a misunderstanding on
what we meant about sleep, but I observed the following behavior,
below, with local_clock(). Notably, local_clock() isn't advanced
during sleep in the way that the other mechanisms are.

Regards,
Jason

[50865.699714] wireguard: local_clock: 1, jiffies: 1,
jiffies_monotoany: 1, ktime_boot: 1, ktime_boot_coarse: 1
[50866.723923] wireguard: local_clock: 2, jiffies: 2,
jiffies_monotoany: 2, ktime_boot: 2, ktime_boot_coarse: 2
[50867.747896] wireguard: local_clock: 3, jiffies: 3,
jiffies_monotoany: 3, ktime_boot: 3, ktime_boot_coarse: 3
[50868.772297] wireguard: local_clock: 4, jiffies: 4,
jiffies_monotoany: 4, ktime_boot: 4, ktime_boot_coarse: 4
[50869.796419] wireguard: local_clock: 5, jiffies: 5,
jiffies_monotoany: 5, ktime_boot: 5, ktime_boot_coarse: 5
[50870.820719] wireguard: local_clock: 6, jiffies: 6,
jiffies_monotoany: 6, ktime_boot: 6, ktime_boot_coarse: 6
[50871.834768] wireguard: local_clock: 7, jiffies: 7,
jiffies_monotoany: 10, ktime_boot: 10, ktime_boot_coarse: 10
[50869.918760] PM: suspend entry (deep)
[50872.846134] PM: suspend exit
[50872.874955] wireguard: local_clock: 8, jiffies: 8,
jiffies_monotoany: 11, ktime_boot: 11, ktime_boot_coarse: 11
[50873.899142] wireguard: local_clock: 9, jiffies: 9,
jiffies_monotoany: 12, ktime_boot: 12, ktime_boot_coarse: 12
[50874.923368] wireguard: local_clock: 10, jiffies: 10,
jiffies_monotoany: 13, ktime_boot: 13, ktime_boot_coarse: 13
[50875.947641] wireguard: local_clock: 11, jiffies: 11,
jiffies_monotoany: 14, ktime_boot: 14, ktime_boot_coarse: 14
[50876.971833] wireguard: local_clock: 12, jiffies: 12,
jiffies_monotoany: 15, ktime_boot: 15, ktime_boot_coarse: 15
[50877.995969] wireguard: local_clock: 13, jiffies: 13,
jiffies_monotoany: 16, ktime_boot: 16, ktime_boot_coarse: 16
[50879.020220] wireguard: local_clock: 14, jiffies: 14,
jiffies_monotoany: 17, ktime_boot: 17, ktime_boot_coarse: 17
[50880.044395] wireguard: local_clock: 15, jiffies: 15,
jiffies_monotoany: 18, ktime_boot: 18, ktime_boot_coarse: 18

static void ugh(struct work_struct *w)
{
  int i;
  u64 start_localclock = local_clock(), end_localclock;
  u64 start_jiffies = jiffies64_to_nsecs(get_jiffies_64()), end_jiffies;
  u64 start_jiffiesmonotoany =
ktime_to_ns(ktime_mono_to_any(ns_to_ktime(jiffies64_to_nsecs(get_jiffies_64())),
TK_OFFS_BOOT)), end_jiffiesmonotoany;
  u64 start_ktimeboot = ktime_get_boot_ns(), end_ktimeboot;
  u64 start_ktimebootcoarse =
ktime_to_ns(ktime_get_coarse_boottime()), end_ktimebootcoarse;

  for (i = 0; i < 15; ++i) {
    msleep(1000);
    end_localclock = local_clock();
    end_jiffies = jiffies64_to_nsecs(get_jiffies_64());
    end_jiffiesmonotoany =
ktime_to_ns(ktime_mono_to_any(ns_to_ktime(jiffies64_to_nsecs(get_jiffies_64())),
TK_OFFS_BOOT));
    end_ktimeboot = ktime_get_boot_ns();
    end_ktimebootcoarse = ktime_to_ns(ktime_get_coarse_boottime());
    pr_err("local_clock: %llu, jiffies: %llu, jiffies_monotoany: %llu,
ktime_boot: %llu, ktime_boot_coarse: %llu\n",
           (end_localclock - start_localclock) / NSEC_PER_SEC,
           (end_jiffies - start_jiffies) / NSEC_PER_SEC,
           (end_jiffiesmonotoany - start_jiffiesmonotoany) / NSEC_PER_SEC,
           (end_ktimeboot - start_ktimeboot) / NSEC_PER_SEC,
           (end_ktimebootcoarse - start_ktimebootcoarse) / NSEC_PER_SEC);
  }
}

static DECLARE_WORK(blah, ugh);

static int __init mod_init(void)
{
  schedule_work(&blah);
  return 0;
}

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-06-18 17:34 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-07 14:14 infinite loop in read_hpet from ktime_get_boot_fast_ns Jason A. Donenfeld
2019-06-11 21:09 ` Thomas Gleixner
2019-06-11 21:40   ` Waiman Long
2019-06-12  9:02   ` Peter Zijlstra
2019-06-12  9:44     ` Jason A. Donenfeld
2019-06-12 12:28       ` Peter Zijlstra
2019-06-12 12:31         ` Peter Zijlstra
2019-06-12 12:58         ` Jason A. Donenfeld
2019-06-12 15:27           ` Peter Zijlstra
2019-06-12 19:46         ` Arnd Bergmann
2019-06-18 17:34         ` Jason A. Donenfeld
2019-06-12 14:01       ` Arnd Bergmann
2019-06-13 15:18         ` Jason A. Donenfeld
2019-06-13 15:40           ` Arnd Bergmann
2019-06-13 16:17             ` Jason A. Donenfeld
2019-06-13 16:26               ` Thomas Gleixner
2019-06-13 16:34                 ` Jason A. Donenfeld
2019-06-13 16:41                   ` Jason A. Donenfeld
2019-06-13 19:53                   ` Thomas Gleixner
2019-06-14  9:14                     ` Jason A. Donenfeld
2019-06-14  9:44                       ` Thomas Gleixner
2019-06-14  9:56                         ` Jason A. Donenfeld
2019-06-14  9:48                       ` [PATCH] timekeeping: add get_jiffies_boot_64() for jiffies including sleep Jason A. Donenfeld
2019-06-14  9:55                     ` [tip:timers/urgent] timekeeping: Repair ktime_get_coarse*() granularity tip-bot for Thomas Gleixner
2019-06-14 11:18                       ` Arnd Bergmann
2019-06-12  9:29   ` infinite loop in read_hpet from ktime_get_boot_fast_ns Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).