All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] psi: avoid divide-by-zero crash inside virtual machines
@ 2019-02-14 19:31 Johannes Weiner
  2019-02-14 19:58 ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Johannes Weiner @ 2019-02-14 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Łukasz Siudut, linux-kernel, kernel-team

We've been seeing hard-to-trigger psi crashes when running inside VM
instances:

[ 1255.840325] divide error: 0000 [#1] SMP PTI
[ 1255.840325] Modules linked in: [...]
[ 1255.840325] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
[ 1255.840325] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[ 1255.840325] Workqueue: events psi_clock
[ 1255.840325] RIP: 0010:psi_update_stats+0x270/0x490
[ 1255.840325] RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
[ 1255.840325] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
[ 1255.840325] RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
[ 1255.840325] RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
[ 1255.840325] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1255.840325] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
[ 1255.840325] FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[ 1255.840325] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1255.840325] CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
[ 1255.840325] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1255.840325] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1255.840325] Call Trace:
[ 1255.840325]  psi_clock+0x12/0x50
[ 1255.840325]  process_one_work+0x1e0/0x390
[ 1255.840325]  worker_thread+0x2b/0x3c0
[ 1255.840325]  ? rescuer_thread+0x330/0x330
[ 1255.840325]  kthread+0x113/0x130
[ 1255.840325]  ? kthread_create_worker_on_cpu+0x40/0x40
[ 1255.840325]  ? SyS_exit_group+0x10/0x10
[ 1255.840325]  ret_from_fork+0x35/0x40
[ 1255.840325] Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48

The Code-line points to `period` being 0 inside update_stats(), and we
divide by that when calculating that period's pressure percentage.

The elapsed period should never be 0. The reason this can happen is
due to an off-by-one in the idle time / missing period calculation
combined with a coarse sched_clock() in the virtual machine.

The target time for aggregation is advanced into the future on a fixed
grid to prevent clock drift. So when an aggregation runs after some
idle period, we can not just set it to "now + psi_period", but have to
calculate the downtime and advance the target time relative to itself.

However, if the aggregator was disabled exactly one psi_period (ns),
we drop one idle period in the calculation due to a > when we should
do >=. In that case, next_update will be advanced from 'now -
psi_period' to 'now' when it should be moved to 'now + psi_period'.
The run finishes with last_update == next_update == sched_clock().

With hardware clocks, this exact nanosecond match isn't likely in the
first place; but if it does happen, the clock will still have moved on
and the period non-zero by the time the worker runs. A pointlessly
short period, but besides the extra work, no harm no foul. However, a
slow sched_clock() like we have on VMs might not have advanced either
by the time the worker runs again. And when we calculate the elapsed
period, the result, our pressure divisor, will be 0. Ouch.

Fix this by correctly handling the situation when the elapsed time
between aggregation runs is precisely two periods, and advance the
expiration timestamp correctly to period into the future.

Reported-by: Łukasz Siudut <lsiudut@fb.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index c3484785b179..0e97ca9306ef 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -322,7 +322,7 @@ static bool update_stats(struct psi_group *group)
 	expires = group->next_update;
 	if (now < expires)
 		goto out;
-	if (now - expires > psi_period)
+	if (now - expires >= psi_period)
 		missed_periods = div_u64(now - expires, psi_period);
 
 	/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] psi: avoid divide-by-zero crash inside virtual machines
  2019-02-14 19:31 [PATCH] psi: avoid divide-by-zero crash inside virtual machines Johannes Weiner
@ 2019-02-14 19:58 ` Andrew Morton
  2019-02-14 20:53   ` Johannes Weiner
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2019-02-14 19:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Łukasz Siudut, linux-kernel, kernel-team

On Thu, 14 Feb 2019 14:31:57 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -322,7 +322,7 @@ static bool update_stats(struct psi_group *group)
>  	expires = group->next_update;
>  	if (now < expires)
>  		goto out;
> -	if (now - expires > psi_period)
> +	if (now - expires >= psi_period)
>  		missed_periods = div_u64(now - expires, psi_period);
>  
>  	/*

It seems appropriate to use time_after64() and friends in this code.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] psi: avoid divide-by-zero crash inside virtual machines
  2019-02-14 19:58 ` Andrew Morton
@ 2019-02-14 20:53   ` Johannes Weiner
  2019-02-14 21:22     ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Johannes Weiner @ 2019-02-14 20:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Łukasz Siudut, linux-kernel, kernel-team

On Thu, Feb 14, 2019 at 11:58:55AM -0800, Andrew Morton wrote:
> On Thu, 14 Feb 2019 14:31:57 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -322,7 +322,7 @@ static bool update_stats(struct psi_group *group)
> >  	expires = group->next_update;
> >  	if (now < expires)
> >  		goto out;
> > -	if (now - expires > psi_period)
> > +	if (now - expires >= psi_period)
> >  		missed_periods = div_u64(now - expires, psi_period);
> >  
> >  	/*
> 
> It seems appropriate to use time_after64() and friends in this code.

These timestamps are all sourced from sched_clock(), which is defined
to be monotonic and never wrap in practice. From the "sched_clock()"
section in Documentation/timers/timekeeping.txt:

	"This function shall return the number of nanoseconds since
	 the system was started."

	"The sched_clock() function may wrap only on unsigned long
	 long boundaries, i.e. after 64 bits. Since this is a
	 nanosecond value this will mean it wraps after circa 585
	 years. (For most practical systems this means "never".)"

As far as readability goes, I have to say I find the naked comparisons
a bit easier to understand (and I'm glad we can use those here since
the code is already complicated):

	if (now < expires)

vs.

	if (time_before64(now, expires))

These macros always have me double check the argument order.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] psi: avoid divide-by-zero crash inside virtual machines
  2019-02-14 20:53   ` Johannes Weiner
@ 2019-02-14 21:22     ` Andrew Morton
  0 siblings, 0 replies; 4+ messages in thread
From: Andrew Morton @ 2019-02-14 21:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Łukasz Siudut, linux-kernel, kernel-team

On Thu, 14 Feb 2019 15:53:43 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> 
> 	if (now < expires)
> 
> vs.
> 
> 	if (time_before64(now, expires))
> 
> These macros always have me double check the argument order.

Yeah, me too.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-02-14 21:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-14 19:31 [PATCH] psi: avoid divide-by-zero crash inside virtual machines Johannes Weiner
2019-02-14 19:58 ` Andrew Morton
2019-02-14 20:53   ` Johannes Weiner
2019-02-14 21:22     ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.