linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] psi:fix divide by zero in psi_update_stats
@ 2019-11-08  7:33 tim
  2019-11-08  9:31 ` Peter Zijlstra
  2019-11-12 15:41 ` Johannes Weiner
  0 siblings, 2 replies; 11+ messages in thread
From: tim @ 2019-11-08  7:33 UTC (permalink / raw)
  To: Johannes Weiner, Ingo Molnar, Peter Zijlstra; +Cc: linux-kernel

In psi_update_stats, it is possible that period has value like
0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
truncates u64 period to u32, results in zero divisor.
Use div64_u64() instead of div_u64()  if the divisor is u64 to avoid
truncation to 32-bit on 64-bit platforms.

Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com>
---
 kernel/sched/psi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 517e3719027e..399d6f106de5 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -291,7 +291,7 @@ static void calc_avgs(unsigned long avg[3], int missed_periods,
 	}
 
 	/* Sample the most recent active period */
-	pct = div_u64(time * 100, period);
+	pct = div64_u64(time * 100, period);
 	pct *= FIXED_1;
 	avg[0] = calc_load(avg[0], EXP_10s, pct);
 	avg[1] = calc_load(avg[1], EXP_60s, pct);
-- 
2.14.4.44.g2045bb




^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] psi:fix divide by zero in psi_update_stats
  2019-11-08  7:33 [PATCH] psi:fix divide by zero in psi_update_stats tim
@ 2019-11-08  9:31 ` Peter Zijlstra
  2019-11-08  9:49   ` Jingfeng Xie
  2019-11-12 15:41 ` Johannes Weiner
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2019-11-08  9:31 UTC (permalink / raw)
  To: tim; +Cc: Johannes Weiner, Ingo Molnar, linux-kernel

On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> In psi_update_stats, it is possible that period has value like
> 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which

How can this happen? Is that a valid case or should we be avoiding that?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] psi:fix divide by zero in psi_update_stats
  2019-11-08  9:31 ` Peter Zijlstra
@ 2019-11-08  9:49   ` Jingfeng Xie
  2019-11-08 10:05     ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Jingfeng Xie @ 2019-11-08  9:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, Ingo Molnar, linux-kernel, Joseph Qi, Xunlei Pang

It happens multiple times on our online machines, the crash call trace is like below:
[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330
[58914.286053] RSP

With full kdump vmcore analysis,  The R8 is period in psi_update_stats which results in the zero division error.


On 2019/11/8 PM 5:31,“Peter Zijlstra”<peterz@infradead.org> wrote:

    On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
    > In psi_update_stats, it is possible that period has value like
    > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
    
    How can this happen? Is that a valid case or should we be avoiding that?
    



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] psi:fix divide by zero in psi_update_stats
  2019-11-08  9:49   ` Jingfeng Xie
@ 2019-11-08 10:05     ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2019-11-08 10:05 UTC (permalink / raw)
  To: Jingfeng Xie
  Cc: Johannes Weiner, Ingo Molnar, linux-kernel, Joseph Qi, Xunlei Pang


https://people.kernel.org/tglx/notes-about-netiquette-qw89

On Fri, Nov 08, 2019 at 05:49:01PM +0800, Jingfeng Xie wrote:
> It happens multiple times on our online machines, the crash call trace is like below:
> [58914.066423] divide error: 0000 [#1] SMP
> [58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
> [58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
> [58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
> [58914.102728] Workqueue: events psi_update_work
> [58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
> [58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
> [58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
> [58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
> [58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
> [58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
> [58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
> [58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
> [58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
> [58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
> [58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [58914.200360] PKRU: 55555554
> [58914.203221] Stack:
> [58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
> [58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
> [58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [58914.228734] Call Trace:
> [58914.231337] [] psi_update_work+0x22/0x60
> [58914.237067] [] process_one_work+0x189/0x420
> [58914.243063] [] worker_thread+0x4e/0x4b0
> [58914.248701] [] ? process_one_work+0x420/0x420
> [58914.254869] [] kthread+0xe6/0x100
> [58914.259994] [] ? kthread_park+0x60/0x60
> [58914.265640] [] ret_from_fork+0x39/0x50
> [58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
> [58914.279691] RIP [] psi_update_stats+0x1c1/0x330
> [58914.286053] RSP
> 
> With full kdump vmcore analysis,  The R8 is period in psi_update_stats which results in the zero division error.

This does not answer either question I asked.

>     How can this happen? Is that a valid case or should we be avoiding that?

This does not explain how the period got so large, nor if that is a
valid/expected scenario.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] psi:fix divide by zero in psi_update_stats
  2019-11-08  7:33 [PATCH] psi:fix divide by zero in psi_update_stats tim
  2019-11-08  9:31 ` Peter Zijlstra
@ 2019-11-12 15:41 ` Johannes Weiner
  2019-11-12 15:48   ` Johannes Weiner
  2019-11-12 18:33   ` Suren Baghdasaryan
  1 sibling, 2 replies; 11+ messages in thread
From: Johannes Weiner @ 2019-11-12 15:41 UTC (permalink / raw)
  To: tim; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Suren Baghdasaryan

On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> In psi_update_stats, it is possible that period has value like
> 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> truncates u64 period to u32, results in zero divisor.
> Use div64_u64() instead of div_u64()  if the divisor is u64 to avoid
> truncation to 32-bit on 64-bit platforms.
> 
> Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com>

This is legit. When we stop the periodic averaging worker due to an
idle CPU, the period after restart can be much longer than the ~4 sec
in the lower 32 bits. See the missed_periods logic in update_averages.

What is surprising is that you can hit this repeatedly, as the odds
are 1 in 4,294,967,296. An extremely coarse clock source?

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

There are several more instances of div_u64 in psi.c. They all look
fine to me except for one in the psi poll() windowing code, where we
divide by the window size, which can be up to 10s. CCing Suren.

---
From 009cece5f37a38f4baeb1bebdcb432ac9ae66ef8 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 12 Nov 2019 10:35:26 -0500
Subject: [PATCH] psi: fix a division error in psi poll()

The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. But div_u64 is meant
only for 32-bit divisors. Use div64_u64 for the 64-bit divisor.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 517e3719027e..84af7aa158bf 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -481,7 +481,7 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
 		u32 remaining;
 
 		remaining = win->size - elapsed;
-		growth += div_u64(win->prev_growth * remaining, win->size);
+		growth += div64_u64(win->prev_growth * remaining, win->size);
 	}
 
 	return growth;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] psi:fix divide by zero in psi_update_stats
  2019-11-12 15:41 ` Johannes Weiner
@ 2019-11-12 15:48   ` Johannes Weiner
  2019-11-12 16:08     ` Johannes Weiner
  2019-11-12 18:33   ` Suren Baghdasaryan
  1 sibling, 1 reply; 11+ messages in thread
From: Johannes Weiner @ 2019-11-12 15:48 UTC (permalink / raw)
  To: tim; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Suren Baghdasaryan

On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
> On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > In psi_update_stats, it is possible that period has value like
> > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > truncates u64 period to u32, results in zero divisor.
> > Use div64_u64() instead of div_u64()  if the divisor is u64 to avoid
> > truncation to 32-bit on 64-bit platforms.
> > 
> > Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com>
> 
> This is legit. When we stop the periodic averaging worker due to an
> idle CPU, the period after restart can be much longer than the ~4 sec
> in the lower 32 bits. See the missed_periods logic in update_averages.

Argh, that's not right. Of course I notice right after hitting send.

missed_periods are subtracted out of the difference between now and
the last update, so period should be not much bigger than 2s.

Something else is going on here.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] psi:fix divide by zero in psi_update_stats
  2019-11-12 15:48   ` Johannes Weiner
@ 2019-11-12 16:08     ` Johannes Weiner
  2019-11-12 17:27       ` Suren Baghdasaryan
  2019-11-29  6:37       ` Jingfeng Xie
  0 siblings, 2 replies; 11+ messages in thread
From: Johannes Weiner @ 2019-11-12 16:08 UTC (permalink / raw)
  To: tim; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Suren Baghdasaryan

On Tue, Nov 12, 2019 at 10:48:46AM -0500, Johannes Weiner wrote:
> On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
> > On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > > In psi_update_stats, it is possible that period has value like
> > > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > > truncates u64 period to u32, results in zero divisor.
> > > Use div64_u64() instead of div_u64()  if the divisor is u64 to avoid
> > > truncation to 32-bit on 64-bit platforms.
> > > 
> > > Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com>
> > 
> > This is legit. When we stop the periodic averaging worker due to an
> > idle CPU, the period after restart can be much longer than the ~4 sec
> > in the lower 32 bits. See the missed_periods logic in update_averages.
> 
> Argh, that's not right. Of course I notice right after hitting send.
> 
> missed_periods are subtracted out of the difference between now and
> the last update, so period should be not much bigger than 2s.
> 
> Something else is going on here.

Tim, does this happen right after boot? I wonder if it's because we're
not initializing avg_last_update, and the initial delta between the
last update (0) and the first scheduled update (sched_clock() + 2s)
ends up bigger than 4 seconds somehow. Later on, the delta between the
last and the scheduled update should always be ~2s. But for that to
happen, it would require a pretty slow boot, or a sched_clock() that
does not start at 0.

Tim, if you have a coredump, can you extract the value of the other
variables printed in the following patch?

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84af7aa158bf..1b6836d23091 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -374,6 +374,10 @@ static u64 update_averages(struct psi_group *group, u64 now)
 	 */
 	avg_next_update = expires + ((1 + missed_periods) * psi_period);
 	period = now - (group->avg_last_update + (missed_periods * psi_period));
+
+	WARN(period >> 32, "period=%ld now=%ld expires=%ld last=%ld missed=%ld\n",
+	     period, now, expires, group->avg_last_update, missed_periods);
+
 	group->avg_last_update = now;
 
 	for (s = 0; s < NR_PSI_STATES - 1; s++) {

And we may need something like this to make the tick initialization
more robust regardless of the reported bug here:

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84af7aa158bf..ce8f6748678a 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -185,7 +185,8 @@ static void group_init(struct psi_group *group)
 
 	for_each_possible_cpu(cpu)
 		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
-	group->avg_next_update = sched_clock() + psi_period;
+	group->avg_last_update = sched_clock();
+	group->avg_next_update = group->avg_last_update + psi_period;
 	INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
 	mutex_init(&group->avgs_lock);
 	/* Init trigger-related members */

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] psi:fix divide by zero in psi_update_stats
  2019-11-12 16:08     ` Johannes Weiner
@ 2019-11-12 17:27       ` Suren Baghdasaryan
  2019-11-29  6:37       ` Jingfeng Xie
  1 sibling, 0 replies; 11+ messages in thread
From: Suren Baghdasaryan @ 2019-11-12 17:27 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: tim, Ingo Molnar, Peter Zijlstra, LKML

On Tue, Nov 12, 2019 at 8:08 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Nov 12, 2019 at 10:48:46AM -0500, Johannes Weiner wrote:
> > On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
> > > On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > > > In psi_update_stats, it is possible that period has value like
> > > > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > > > truncates u64 period to u32, results in zero divisor.
> > > > Use div64_u64() instead of div_u64()  if the divisor is u64 to avoid
> > > > truncation to 32-bit on 64-bit platforms.
> > > >
> > > > Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com>
> > >
> > > This is legit. When we stop the periodic averaging worker due to an
> > > idle CPU, the period after restart can be much longer than the ~4 sec
> > > in the lower 32 bits. See the missed_periods logic in update_averages.
> >
> > Argh, that's not right. Of course I notice right after hitting send.
> >
> > missed_periods are subtracted out of the difference between now and
> > the last update, so period should be not much bigger than 2s.
> >
> > Something else is going on here.
>
> Tim, does this happen right after boot? I wonder if it's because we're
> not initializing avg_last_update, and the initial delta between the
> last update (0) and the first scheduled update (sched_clock() + 2s)
> ends up bigger than 4 seconds somehow. Later on, the delta between the
> last and the scheduled update should always be ~2s. But for that to
> happen, it would require a pretty slow boot, or a sched_clock() that
> does not start at 0.
>
> Tim, if you have a coredump, can you extract the value of the other
> variables printed in the following patch?
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84af7aa158bf..1b6836d23091 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -374,6 +374,10 @@ static u64 update_averages(struct psi_group *group, u64 now)
>          */
>         avg_next_update = expires + ((1 + missed_periods) * psi_period);
>         period = now - (group->avg_last_update + (missed_periods * psi_period));
> +
> +       WARN(period >> 32, "period=%ld now=%ld expires=%ld last=%ld missed=%ld\n",
> +            period, now, expires, group->avg_last_update, missed_periods);
> +
>         group->avg_last_update = now;
>
>         for (s = 0; s < NR_PSI_STATES - 1; s++) {
>
> And we may need something like this to make the tick initialization
> more robust regardless of the reported bug here:
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84af7aa158bf..ce8f6748678a 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -185,7 +185,8 @@ static void group_init(struct psi_group *group)
>
>         for_each_possible_cpu(cpu)
>                 seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> -       group->avg_next_update = sched_clock() + psi_period;
> +       group->avg_last_update = sched_clock();
> +       group->avg_next_update = group->avg_last_update + psi_period;
>         INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
>         mutex_init(&group->avgs_lock);
>         /* Init trigger-related members */

Both fixes for group_init() and window_update() make sense to me.
window_update() division would be reproducible because win->size is
set during trigger setup and does not change afterwards. Since
userspace defines the window size in usecs this would require doing
some math and finding a value that yields zeros in 32 LSBs after
conversion into nsecs (see: t->win.size = window_us * NSEC_PER_USEC in
psi_trigger_create()). Haven't seen this issue because all my test
cases (1, 2, 10secs) had non-zero LSBs.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] psi:fix divide by zero in psi_update_stats
  2019-11-12 15:41 ` Johannes Weiner
  2019-11-12 15:48   ` Johannes Weiner
@ 2019-11-12 18:33   ` Suren Baghdasaryan
  1 sibling, 0 replies; 11+ messages in thread
From: Suren Baghdasaryan @ 2019-11-12 18:33 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: tim, Ingo Molnar, Peter Zijlstra, LKML

On Tue, Nov 12, 2019 at 7:41 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > In psi_update_stats, it is possible that period has value like
> > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > truncates u64 period to u32, results in zero divisor.
> > Use div64_u64() instead of div_u64()  if the divisor is u64 to avoid
> > truncation to 32-bit on 64-bit platforms.
> >
> > Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com>
>
> This is legit. When we stop the periodic averaging worker due to an
> idle CPU, the period after restart can be much longer than the ~4 sec
> in the lower 32 bits. See the missed_periods logic in update_averages.
>
> What is surprising is that you can hit this repeatedly, as the odds
> are 1 in 4,294,967,296. An extremely coarse clock source?
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> There are several more instances of div_u64 in psi.c. They all look
> fine to me except for one in the psi poll() windowing code, where we
> divide by the window size, which can be up to 10s. CCing Suren.
>
> ---
> From 009cece5f37a38f4baeb1bebdcb432ac9ae66ef8 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Tue, 12 Nov 2019 10:35:26 -0500
> Subject: [PATCH] psi: fix a division error in psi poll()
>
> The psi window size is a u64 an can be up to 10 seconds right now,
> which exceeds the lower 32 bits of the variable. But div_u64 is meant
> only for 32-bit divisors. Use div64_u64 for the 64-bit divisor.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  kernel/sched/psi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 517e3719027e..84af7aa158bf 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -481,7 +481,7 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
>                 u32 remaining;
>
>                 remaining = win->size - elapsed;
> -               growth += div_u64(win->prev_growth * remaining, win->size);
> +               growth += div64_u64(win->prev_growth * remaining, win->size);
>         }
>
>         return growth;
> --
> 2.24.0
>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] psi:fix divide by zero in psi_update_stats
  2019-11-12 16:08     ` Johannes Weiner
  2019-11-12 17:27       ` Suren Baghdasaryan
@ 2019-11-29  6:37       ` Jingfeng Xie
  2019-11-30  1:41         ` Suren Baghdasaryan
  1 sibling, 1 reply; 11+ messages in thread
From: Jingfeng Xie @ 2019-11-29  6:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Suren Baghdasaryan,
	Xunlei Pang, 齐江(窅默)

Weiner,
The crash does not happen right after boot, in my case,  it happens in 58914 ~ 815463 seconds range since boot 

With my coredump,some values are extracted as below:

period = 001df2dc00000000
now = 001df2dc00000000, same as period
expires = group->next_update = rdi = 00003594f700648e
group->avg_last_update  could not be known
missed_periods = 0

在 2019/11/13 上午12:08,“Johannes Weiner”<hannes@cmpxchg.org> 写入:

    On Tue, Nov 12, 2019 at 10:48:46AM -0500, Johannes Weiner wrote:
    > On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
    > > On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
    > > > In psi_update_stats, it is possible that period has value like
    > > > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
    > > > truncates u64 period to u32, results in zero divisor.
    > > > Use div64_u64() instead of div_u64()  if the divisor is u64 to avoid
    > > > truncation to 32-bit on 64-bit platforms.
    > > > 
    > > > Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com>
    > > 
    > > This is legit. When we stop the periodic averaging worker due to an
    > > idle CPU, the period after restart can be much longer than the ~4 sec
    > > in the lower 32 bits. See the missed_periods logic in update_averages.
    > 
    > Argh, that's not right. Of course I notice right after hitting send.
    > 
    > missed_periods are subtracted out of the difference between now and
    > the last update, so period should be not much bigger than 2s.
    > 
    > Something else is going on here.
    
    Tim, does this happen right after boot? I wonder if it's because we're
    not initializing avg_last_update, and the initial delta between the
    last update (0) and the first scheduled update (sched_clock() + 2s)
    ends up bigger than 4 seconds somehow. Later on, the delta between the
    last and the scheduled update should always be ~2s. But for that to
    happen, it would require a pretty slow boot, or a sched_clock() that
    does not start at 0.
    
    Tim, if you have a coredump, can you extract the value of the other
    variables printed in the following patch?
    
    diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
    index 84af7aa158bf..1b6836d23091 100644
    --- a/kernel/sched/psi.c
    +++ b/kernel/sched/psi.c
    @@ -374,6 +374,10 @@ static u64 update_averages(struct psi_group *group, u64 now)
     	 */
     	avg_next_update = expires + ((1 + missed_periods) * psi_period);
     	period = now - (group->avg_last_update + (missed_periods * psi_period));
    +
    +	WARN(period >> 32, "period=%ld now=%ld expires=%ld last=%ld missed=%ld\n",
    +	     period, now, expires, group->avg_last_update, missed_periods);
    +
     	group->avg_last_update = now;
     
     	for (s = 0; s < NR_PSI_STATES - 1; s++) {
    
    And we may need something like this to make the tick initialization
    more robust regardless of the reported bug here:
    
    diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
    index 84af7aa158bf..ce8f6748678a 100644
    --- a/kernel/sched/psi.c
    +++ b/kernel/sched/psi.c
    @@ -185,7 +185,8 @@ static void group_init(struct psi_group *group)
     
     	for_each_possible_cpu(cpu)
     		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
    -	group->avg_next_update = sched_clock() + psi_period;
    +	group->avg_last_update = sched_clock();
    +	group->avg_next_update = group->avg_last_update + psi_period;
     	INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
     	mutex_init(&group->avgs_lock);
     	/* Init trigger-related members */
    




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] psi:fix divide by zero in psi_update_stats
  2019-11-29  6:37       ` Jingfeng Xie
@ 2019-11-30  1:41         ` Suren Baghdasaryan
  0 siblings, 0 replies; 11+ messages in thread
From: Suren Baghdasaryan @ 2019-11-30  1:41 UTC (permalink / raw)
  To: Jingfeng Xie
  Cc: Johannes Weiner, Ingo Molnar, Peter Zijlstra, LKML, Xunlei Pang,
	齐江(窅默)

On Thu, Nov 28, 2019 at 10:37 PM Jingfeng Xie
<xiejingfeng@linux.alibaba.com> wrote:
>
> Weiner,
> The crash does not happen right after boot, in my case,  it happens in 58914 ~ 815463 seconds range since boot
>
> With my coredump,some values are extracted as below:
>
> period = 001df2dc00000000
> now = 001df2dc00000000, same as period
> expires = group->next_update = rdi = 00003594f700648e
> group->avg_last_update  could not be known
> missed_periods = 0
>
Considering that "period = now - (group->avg_last_update +
(missed_periods * psi_period))" and the above values (period==now and
missed_periods==0), group->avg_last_update must be 0 and that would
mean this is indeed the first update_averages() call.
I think this can happen if a cgroup is created long after the boot.
The following call chain would happen:
cgroup_create->psi_cgroup_alloc->group_init->INIT_DELAYED_WORK->psi_avgs_work->update_averages.
If this cgroup creation is timed so that psi_avgs_work is called when
sched_clock returns a value with LSBs of 0 then we get this problem.
The patch Johannes posted earlier which sets group->avg_last_update to
sched_clock in group_init should have fixed this problem. Tim, did you
capture this coredump after applying that patch? If not please try
applying it and see if it still happens.


> 在 2019/11/13 上午12:08,“Johannes Weiner”<hannes@cmpxchg.org> 写入:
>
>     On Tue, Nov 12, 2019 at 10:48:46AM -0500, Johannes Weiner wrote:
>     > On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
>     > > On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
>     > > > In psi_update_stats, it is possible that period has value like
>     > > > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
>     > > > truncates u64 period to u32, results in zero divisor.
>     > > > Use div64_u64() instead of div_u64()  if the divisor is u64 to avoid
>     > > > truncation to 32-bit on 64-bit platforms.
>     > > >
>     > > > Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com>
>     > >
>     > > This is legit. When we stop the periodic averaging worker due to an
>     > > idle CPU, the period after restart can be much longer than the ~4 sec
>     > > in the lower 32 bits. See the missed_periods logic in update_averages.
>     >
>     > Argh, that's not right. Of course I notice right after hitting send.
>     >
>     > missed_periods are subtracted out of the difference between now and
>     > the last update, so period should be not much bigger than 2s.
>     >
>     > Something else is going on here.
>
>     Tim, does this happen right after boot? I wonder if it's because we're
>     not initializing avg_last_update, and the initial delta between the
>     last update (0) and the first scheduled update (sched_clock() + 2s)
>     ends up bigger than 4 seconds somehow. Later on, the delta between the
>     last and the scheduled update should always be ~2s. But for that to
>     happen, it would require a pretty slow boot, or a sched_clock() that
>     does not start at 0.
>
>     Tim, if you have a coredump, can you extract the value of the other
>     variables printed in the following patch?
>
>     diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>     index 84af7aa158bf..1b6836d23091 100644
>     --- a/kernel/sched/psi.c
>     +++ b/kernel/sched/psi.c
>     @@ -374,6 +374,10 @@ static u64 update_averages(struct psi_group *group, u64 now)
>          */
>         avg_next_update = expires + ((1 + missed_periods) * psi_period);
>         period = now - (group->avg_last_update + (missed_periods * psi_period));
>     +
>     +   WARN(period >> 32, "period=%ld now=%ld expires=%ld last=%ld missed=%ld\n",
>     +        period, now, expires, group->avg_last_update, missed_periods);
>     +
>         group->avg_last_update = now;
>
>         for (s = 0; s < NR_PSI_STATES - 1; s++) {
>
>     And we may need something like this to make the tick initialization
>     more robust regardless of the reported bug here:
>
>     diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>     index 84af7aa158bf..ce8f6748678a 100644
>     --- a/kernel/sched/psi.c
>     +++ b/kernel/sched/psi.c
>     @@ -185,7 +185,8 @@ static void group_init(struct psi_group *group)
>
>         for_each_possible_cpu(cpu)
>                 seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
>     -   group->avg_next_update = sched_clock() + psi_period;
>     +   group->avg_last_update = sched_clock();
>     +   group->avg_next_update = group->avg_last_update + psi_period;
>         INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
>         mutex_init(&group->avgs_lock);
>         /* Init trigger-related members */
>
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-11-30  1:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-08  7:33 [PATCH] psi:fix divide by zero in psi_update_stats tim
2019-11-08  9:31 ` Peter Zijlstra
2019-11-08  9:49   ` Jingfeng Xie
2019-11-08 10:05     ` Peter Zijlstra
2019-11-12 15:41 ` Johannes Weiner
2019-11-12 15:48   ` Johannes Weiner
2019-11-12 16:08     ` Johannes Weiner
2019-11-12 17:27       ` Suren Baghdasaryan
2019-11-29  6:37       ` Jingfeng Xie
2019-11-30  1:41         ` Suren Baghdasaryan
2019-11-12 18:33   ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).