* [PATCH 0/2] psi: two division fixes @ 2019-12-03 18:35 Johannes Weiner 2019-12-03 18:35 ` [PATCH 1/2] psi: fix sampling error and rare div0 crashes with cgroups and high uptime Johannes Weiner ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Johannes Weiner @ 2019-12-03 18:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jingfeng Xie, Suren Baghdasaryan, linux-kernel Hi Peter, here is a patch that should fix the rare div0 crashes reported by Jingfeng, and a second patch for a div32-with-64-bit-divisor issue spotted during code review. Can you please take a look and route them through the scheduler tree? Thanks! kernel/sched/psi.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/2] psi: fix sampling error and rare div0 crashes with cgroups and high uptime 2019-12-03 18:35 [PATCH 0/2] psi: two division fixes Johannes Weiner @ 2019-12-03 18:35 ` Johannes Weiner 2019-12-17 12:39 ` [tip: sched/urgent] sched/psi: Fix " tip-bot2 for Johannes Weiner 2019-12-03 18:35 ` [PATCH 2/2] psi: fix a division error in psi poll() Johannes Weiner 2019-12-11 15:43 ` [PATCH 0/2] psi: two division fixes Peter Zijlstra 2 siblings, 1 reply; 6+ messages in thread From: Johannes Weiner @ 2019-12-03 18:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jingfeng Xie, Suren Baghdasaryan, linux-kernel Jingfeng reports rare div0 crashes in psi on systems with some uptime: [58914.066423] divide error: 0000 [#1] SMP [58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O) [58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1 [58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019 [58914.102728] Workqueue: events psi_update_work [58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000 [58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330 [58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246 [58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640 [58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e [58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000 [58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000 [58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78 [58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000 [58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0 [58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [58914.200360] PKRU: 55555554 [58914.203221] Stack: [58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8 [58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000 [58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [58914.228734] Call Trace: [58914.231337] [] psi_update_work+0x22/0x60 [58914.237067] [] process_one_work+0x189/0x420 [58914.243063] [] worker_thread+0x4e/0x4b0 [58914.248701] [] ? process_one_work+0x420/0x420 [58914.254869] [] kthread+0xe6/0x100 [58914.259994] [] ? kthread_park+0x60/0x60 [58914.265640] [] ret_from_fork+0x39/0x50 [58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1 [58914.279691] RIP [] psi_update_stats+0x1c1/0x330 The crashing instruction is trying to divide the observed stall time by the sampling period. The period, stored in R8, is not 0, but we are dividing by the lower 32 bits only, which are all 0 in this instance. We could switch to a 64-bit division, but the period shouldn't be that big in the first place. It's the time between the last update and the next scheduled one, and so should always be around 2s and comfortably fit into 32 bits. The bug is in the initialization of new cgroups: we schedule the first sampling event in a cgroup as an offset of sched_clock(), but fail to initialize the last_update timestamp, and it defaults to 0. That results in a bogusly large sampling period the first time we run the sampling code, and consequently we underreport pressure for the first 2s of a cgroup's life. But worse, if sched_clock() is sufficiently advanced on the system, and the user gets unlucky, the period's lower 32 bits can all be 0 and the sampling division will crash. Fix this by initializing the last update timestamp to the creation time of the cgroup, thus correctly marking the start of the first pressure sampling period in a new cgroup. Reported-by: Jingfeng Xie <xiejingfeng@linux.alibaba.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- kernel/sched/psi.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 517e3719027e..970db4686dd4 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -185,7 +185,8 @@ static void group_init(struct psi_group *group) for_each_possible_cpu(cpu) seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); - group->avg_next_update = sched_clock() + psi_period; + group->avg_last_update = sched_clock(); + group->avg_next_update = group->avg_last_update + psi_period; INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); mutex_init(&group->avgs_lock); /* Init trigger-related members */ -- 2.24.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [tip: sched/urgent] sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime 2019-12-03 18:35 ` [PATCH 1/2] psi: fix sampling error and rare div0 crashes with cgroups and high uptime Johannes Weiner @ 2019-12-17 12:39 ` tip-bot2 for Johannes Weiner 0 siblings, 0 replies; 6+ messages in thread From: tip-bot2 for Johannes Weiner @ 2019-12-17 12:39 UTC (permalink / raw) To: linux-tip-commits Cc: Jingfeng Xie, Johannes Weiner, Peter Zijlstra (Intel), Suren Baghdasaryan, x86, LKML The following commit has been merged into the sched/urgent branch of tip: Commit-ID: 3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985 Gitweb: https://git.kernel.org/tip/3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985 Author: Johannes Weiner <hannes@cmpxchg.org> AuthorDate: Tue, 03 Dec 2019 13:35:23 -05:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Tue, 17 Dec 2019 13:32:47 +01:00 sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime Jingfeng reports rare div0 crashes in psi on systems with some uptime: [58914.066423] divide error: 0000 [#1] SMP [58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O) [58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1 [58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019 [58914.102728] Workqueue: events psi_update_work [58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000 [58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330 [58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246 [58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640 [58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e [58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000 [58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000 [58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78 [58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000 [58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0 [58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [58914.200360] PKRU: 55555554 [58914.203221] Stack: [58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8 [58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000 [58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [58914.228734] Call Trace: [58914.231337] [] psi_update_work+0x22/0x60 [58914.237067] [] process_one_work+0x189/0x420 [58914.243063] [] worker_thread+0x4e/0x4b0 [58914.248701] [] ? process_one_work+0x420/0x420 [58914.254869] [] kthread+0xe6/0x100 [58914.259994] [] ? kthread_park+0x60/0x60 [58914.265640] [] ret_from_fork+0x39/0x50 [58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1 [58914.279691] RIP [] psi_update_stats+0x1c1/0x330 The crashing instruction is trying to divide the observed stall time by the sampling period. The period, stored in R8, is not 0, but we are dividing by the lower 32 bits only, which are all 0 in this instance. We could switch to a 64-bit division, but the period shouldn't be that big in the first place. It's the time between the last update and the next scheduled one, and so should always be around 2s and comfortably fit into 32 bits. The bug is in the initialization of new cgroups: we schedule the first sampling event in a cgroup as an offset of sched_clock(), but fail to initialize the last_update timestamp, and it defaults to 0. That results in a bogusly large sampling period the first time we run the sampling code, and consequently we underreport pressure for the first 2s of a cgroup's life. But worse, if sched_clock() is sufficiently advanced on the system, and the user gets unlucky, the period's lower 32 bits can all be 0 and the sampling division will crash. Fix this by initializing the last update timestamp to the creation time of the cgroup, thus correctly marking the start of the first pressure sampling period in a new cgroup. Reported-by: Jingfeng Xie <xiejingfeng@linux.alibaba.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org --- kernel/sched/psi.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 517e371..970db46 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -185,7 +185,8 @@ static void group_init(struct psi_group *group) for_each_possible_cpu(cpu) seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); - group->avg_next_update = sched_clock() + psi_period; + group->avg_last_update = sched_clock(); + group->avg_next_update = group->avg_last_update + psi_period; INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); mutex_init(&group->avgs_lock); /* Init trigger-related members */ ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2] psi: fix a division error in psi poll() 2019-12-03 18:35 [PATCH 0/2] psi: two division fixes Johannes Weiner 2019-12-03 18:35 ` [PATCH 1/2] psi: fix sampling error and rare div0 crashes with cgroups and high uptime Johannes Weiner @ 2019-12-03 18:35 ` Johannes Weiner 2019-12-17 12:39 ` [tip: sched/urgent] psi: Fix " tip-bot2 for Johannes Weiner 2019-12-11 15:43 ` [PATCH 0/2] psi: two division fixes Peter Zijlstra 2 siblings, 1 reply; 6+ messages in thread From: Johannes Weiner @ 2019-12-03 18:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jingfeng Xie, Suren Baghdasaryan, linux-kernel The psi window size is a u64 an can be up to 10 seconds right now, which exceeds the lower 32 bits of the variable. We currently use div_u64 for it, which is meant only for 32-bit divisors. The result is garbage pressure sampling values and even potential div0 crashes. Use div64_u64. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> --- kernel/sched/psi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 970db4686dd4..ce8f6748678a 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -482,7 +482,7 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value) u32 remaining; remaining = win->size - elapsed; - growth += div_u64(win->prev_growth * remaining, win->size); + growth += div64_u64(win->prev_growth * remaining, win->size); } return growth; -- 2.24.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [tip: sched/urgent] psi: Fix a division error in psi poll() 2019-12-03 18:35 ` [PATCH 2/2] psi: fix a division error in psi poll() Johannes Weiner @ 2019-12-17 12:39 ` tip-bot2 for Johannes Weiner 0 siblings, 0 replies; 6+ messages in thread From: tip-bot2 for Johannes Weiner @ 2019-12-17 12:39 UTC (permalink / raw) To: linux-tip-commits Cc: Johannes Weiner, Peter Zijlstra (Intel), Suren Baghdasaryan, Jingfeng Xie, x86, LKML The following commit has been merged into the sched/urgent branch of tip: Commit-ID: c3466952ca1514158d7c16c9cfc48c27d5c5dc0f Gitweb: https://git.kernel.org/tip/c3466952ca1514158d7c16c9cfc48c27d5c5dc0f Author: Johannes Weiner <hannes@cmpxchg.org> AuthorDate: Tue, 03 Dec 2019 13:35:24 -05:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Tue, 17 Dec 2019 13:32:48 +01:00 psi: Fix a division error in psi poll() The psi window size is a u64 an can be up to 10 seconds right now, which exceeds the lower 32 bits of the variable. We currently use div_u64 for it, which is meant only for 32-bit divisors. The result is garbage pressure sampling values and even potential div0 crashes. Use div64_u64. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Jingfeng Xie <xiejingfeng@linux.alibaba.com> Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org --- kernel/sched/psi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 970db46..ce8f674 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -482,7 +482,7 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value) u32 remaining; remaining = win->size - elapsed; - growth += div_u64(win->prev_growth * remaining, win->size); + growth += div64_u64(win->prev_growth * remaining, win->size); } return growth; ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 0/2] psi: two division fixes 2019-12-03 18:35 [PATCH 0/2] psi: two division fixes Johannes Weiner 2019-12-03 18:35 ` [PATCH 1/2] psi: fix sampling error and rare div0 crashes with cgroups and high uptime Johannes Weiner 2019-12-03 18:35 ` [PATCH 2/2] psi: fix a division error in psi poll() Johannes Weiner @ 2019-12-11 15:43 ` Peter Zijlstra 2 siblings, 0 replies; 6+ messages in thread From: Peter Zijlstra @ 2019-12-11 15:43 UTC (permalink / raw) To: Johannes Weiner; +Cc: Jingfeng Xie, Suren Baghdasaryan, linux-kernel On Tue, Dec 03, 2019 at 01:35:22PM -0500, Johannes Weiner wrote: > Hi Peter, > > here is a patch that should fix the rare div0 crashes reported by > Jingfeng, and a second patch for a div32-with-64-bit-divisor issue > spotted during code review. > > Can you please take a look and route them through the scheduler tree? Will do, Thanks! ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-12-17 12:40 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-03 18:35 [PATCH 0/2] psi: two division fixes Johannes Weiner 2019-12-03 18:35 ` [PATCH 1/2] psi: fix sampling error and rare div0 crashes with cgroups and high uptime Johannes Weiner 2019-12-17 12:39 ` [tip: sched/urgent] sched/psi: Fix " tip-bot2 for Johannes Weiner 2019-12-03 18:35 ` [PATCH 2/2] psi: fix a division error in psi poll() Johannes Weiner 2019-12-17 12:39 ` [tip: sched/urgent] psi: Fix " tip-bot2 for Johannes Weiner 2019-12-11 15:43 ` [PATCH 0/2] psi: two division fixes Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).