* Divide-by-zero in post_init_entity_util_avg @ 2016-06-09 9:01 Chris Wilson 2016-06-09 1:33 ` Yuyang Du 2016-06-09 10:29 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra 0 siblings, 2 replies; 14+ messages in thread From: Chris Wilson @ 2016-06-09 9:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrey Ryabinin, Yuyang Du, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel [15774.966082] divide error: 0000 [#1] SMP [15774.966137] Modules linked in: i915 intel_gtt [15774.966208] CPU: 1 PID: 15319 Comm: gemscript Not tainted 4.7.0-rc1+ #330 [15774.966252] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [15774.966317] task: ffff880276a55e80 ti: ffff880272c10000 task.ti: ffff880272c10000 [15774.966377] RIP: 0010:[<ffffffff810b12e3>] [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80 [15774.966463] RSP: 0018:ffff880272c13e30 EFLAGS: 00010057 [15774.966504] RAX: 0000000022f00000 RBX: ffff880276a51f80 RCX: 00000000000000e8 [15774.966550] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880276a52000 [15774.966593] RBP: ffff880272c13e30 R08: 0000000000000000 R09: 0000000000000008 [15774.966635] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880276a52544 [15774.966678] R13: ffff880276a52000 R14: ffff880276a521d8 R15: 0000000000003bda [15774.966721] FS: 00007fe4df95a740(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000 [15774.966781] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [15774.966822] CR2: 00007fe4de7f0378 CR3: 0000000275e15000 CR4: 00000000001006e0 [15774.966860] Stack: [15774.966894] ffff880272c13e68 ffffffff810a69c5 0000000000000206 0000000000003bd7 [15774.966987] 0000000000000000 0000000000000000 ffff880276a51f80 ffff880272c13ee8 [15774.967073] ffffffff8107a364 0000000000003bd7 ffffffffffffffff 0000000000000000 [15774.967162] Call Trace: [15774.967208] [<ffffffff810a69c5>] wake_up_new_task+0x95/0x130 [15774.967256] [<ffffffff8107a364>] _do_fork+0x184/0x400 [15774.967302] [<ffffffff8107a6b7>] SyS_clone+0x37/0x50 [15774.967353] [<ffffffff8100197d>] do_syscall_64+0x5d/0xe0 [15774.967402] [<ffffffff815719bc>] entry_SYSCALL64_slow_path+0x25/0x25 [15774.967444] Code: 89 d1 48 c1 e9 3f 48 01 d1 48 d1 f9 48 85 c9 7e 38 48 85 c0 74 35 48 0f af 07 31 d2 48 89 87 e0 00 00 00 48 8b 76 70 48 83 c6 01 <48> f7 f6 48 39 c8 77 18 48 89 c1 48 89 87 e0 00 00 00 69 c9 7e [15774.968086] RIP [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80 [15774.968142] RSP <ffff880272c13e30> I've presumed commit 2b8c41daba327 ("sched/fair: Initiate a new task's util avg to a bounded value") to be at fault, hence the CCs. Though it may just be a victim. gdb says 0x43/0x80 is 725 if (cfs_rq->avg.util_avg != 0) { 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight; -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1); 728 729 if (sa->util_avg > cap) 730 sa->util_avg = cap; 731 } else { I've run the same fork-heavy workload that seemed to hit the initial fault under kasan. kasan has not reported any errors, nor has the bug reoccurred after a day (earlier I had a couple of panics within a few hours). Is it possible for a race window where cfg_rq->avg.load_avg is indeed -1? Any evidence of other memcorruption in the above? -Chris -- Chris Wilson, Intel Open Source Technology Centre ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Divide-by-zero in post_init_entity_util_avg 2016-06-09 9:01 Divide-by-zero in post_init_entity_util_avg Chris Wilson @ 2016-06-09 1:33 ` Yuyang Du 2016-06-09 13:07 ` Peter Zijlstra 2016-06-09 10:29 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra 1 sibling, 1 reply; 14+ messages in thread From: Yuyang Du @ 2016-06-09 1:33 UTC (permalink / raw) To: Chris Wilson Cc: Peter Zijlstra, Andrey Ryabinin, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel On Thu, Jun 09, 2016 at 10:01:42AM +0100, Chris Wilson wrote: > I've presumed commit 2b8c41daba327 ("sched/fair: Initiate a new task's > util avg to a bounded value") to be at fault, hence the CCs. Though it > may just be a victim. > > gdb says 0x43/0x80 is > > 725 if (cfs_rq->avg.util_avg != 0) { > 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight; > -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1); > 728 > 729 if (sa->util_avg > cap) > 730 sa->util_avg = cap; > 731 } else { > > I've run the same fork-heavy workload that seemed to hit the initial > fault under kasan. kasan has not reported any errors, nor has the bug > reoccurred after a day (earlier I had a couple of panics within a few > hours). > > Is it possible for a race window where cfg_rq->avg.load_avg is indeed > -1? Any evidence of other memcorruption in the above? -1 should not be possible, sounds like a soft error. But, a race is anyway hazardous. Thanks a lot, Chris. -- Subject: [PATCH] sched/fair: Avoid hazardous reading cfs_rq->avg.load_avg without rq lock The commit 2b8c41daba327 ("sched/fair: Initiate a new task's util avg to a bounded value") references cfs_rq->avg.load_avg and then the value is used as a divisor (actually cfs_rq->avg.load_avg + 1). This race condition may cause a divide-by-zero exception. Fix it by moving it into rq locked section. Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Yuyang Du <yuyang.du@intel.com> --- kernel/sched/core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 385c947..b9f44df 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2535,10 +2535,9 @@ void wake_up_new_task(struct task_struct *p) */ set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif + rq = __task_rq_lock(p, &rf); /* Post initialize new task's util average when its cfs_rq is set */ post_init_entity_util_avg(&p->se); - - rq = __task_rq_lock(p, &rf); activate_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p); -- ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Divide-by-zero in post_init_entity_util_avg 2016-06-09 1:33 ` Yuyang Du @ 2016-06-09 13:07 ` Peter Zijlstra 2016-06-12 22:25 ` Yuyang Du ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Peter Zijlstra @ 2016-06-09 13:07 UTC (permalink / raw) To: Yuyang Du Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel Chris Wilson reported a divide by 0 at: post_init_entity_util_avg(): > 725 if (cfs_rq->avg.util_avg != 0) { > 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight; > -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1); > 728 > 729 if (sa->util_avg > cap) > 730 sa->util_avg = cap; > 731 } else { Which given the lack of serialization, and the code generated from update_cfs_rq_load_avg() is entirely possible. if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); sa->load_avg = max_t(long, sa->load_avg - r, 0); sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); removed_load = 1; } turns into: ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax ffffffff8108706b: 48 85 c0 test %rax,%rax ffffffff8108706e: 74 40 je ffffffff810870b0 <update_blocked_averages+0xc0> ffffffff81087070: 4c 89 f8 mov %r15,%rax ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13) ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13) ffffffff8108707e: 4c 89 f9 mov %r15,%rcx ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13) ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx Which you'll note ends up with sa->load_avg -= r in memory at ffffffff8108707a. Ludicrous code generation if you ask me; I'd have expected something like (note, r15 holds 0): mov %r15, %rax xchg %rax, cfs_rq->removed_load_avg mov sa->load_avg, %rcx sub %rax, %rcx cmovs %r15, %rcx mov %rcx, sa->load_avg Adding the serialization (to _both_ call sites) should fix this. Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value") Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/core.c | 3 +-- kernel/sched/fair.c | 8 +++++++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 385c947482e1..4aff10e3bd14 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2535,10 +2535,9 @@ void wake_up_new_task(struct task_struct *p) */ set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif - /* Post initialize new task's util average when its cfs_rq is set */ + rq = __task_rq_lock(p, &rf); post_init_entity_util_avg(&p->se); - rq = __task_rq_lock(p, &rf); activate_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c6dd8bab010c..f379da14c7dd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8468,8 +8468,9 @@ void free_fair_sched_group(struct task_group *tg) int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) { - struct cfs_rq *cfs_rq; struct sched_entity *se; + struct cfs_rq *cfs_rq; + struct rq *rq; int i; tg->cfs_rq = kzalloc(sizeof(cfs_rq) * nr_cpu_ids, GFP_KERNEL); @@ -8484,6 +8485,8 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) init_cfs_bandwidth(tg_cfs_bandwidth(tg)); for_each_possible_cpu(i) { + rq = cpu_rq(i); + cfs_rq = kzalloc_node(sizeof(struct cfs_rq), GFP_KERNEL, cpu_to_node(i)); if (!cfs_rq) @@ -8497,7 +8500,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) init_cfs_rq(cfs_rq); init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]); init_entity_runnable_average(se); + + raw_spin_lock_irq(&rq->lock); post_init_entity_util_avg(se); + raw_spin_unlock_irq(&rq->lock); } return 1; ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Divide-by-zero in post_init_entity_util_avg 2016-06-09 13:07 ` Peter Zijlstra @ 2016-06-12 22:25 ` Yuyang Du 2016-06-14 11:25 ` [tip:sched/core] sched/fair: Fix post_init_entity_util_avg() serialization tip-bot for Peter Zijlstra 2016-06-16 8:50 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra 2 siblings, 0 replies; 14+ messages in thread From: Yuyang Du @ 2016-06-12 22:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel On Thu, Jun 09, 2016 at 03:07:50PM +0200, Peter Zijlstra wrote: > Which given the lack of serialization, and the code generated from > update_cfs_rq_load_avg() is entirely possible. > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > sa->load_avg = max_t(long, sa->load_avg - r, 0); > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > removed_load = 1; > } > > turns into: > > ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax > ffffffff8108706b: 48 85 c0 test %rax,%rax > ffffffff8108706e: 74 40 je ffffffff810870b0 <update_blocked_averages+0xc0> > ffffffff81087070: 4c 89 f8 mov %r15,%rax > ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13) > ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13) > ffffffff8108707e: 4c 89 f9 mov %r15,%rcx > ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx > ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13) > ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx > > Which you'll note ends up with sa->load_avg -= r in memory at > ffffffff8108707a. Surprised. I actually tweaked a little bit on this, but haven't had desirable generated codes. Any compiler expert can shed some light on it? > Ludicrous code generation if you ask me; I'd have expected something > like (note, r15 holds 0): > > mov %r15, %rax > xchg %rax, cfs_rq->removed_load_avg > mov sa->load_avg, %rcx > sub %rax, %rcx > cmovs %r15, %rcx > mov %rcx, sa->load_avg > > Adding the serialization (to _both_ call sites) should fix this. Absolutely, both :) I am going to remove the group entity post util initialization soon in the flat hierarchical util implementation, it is not used anywhere. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [tip:sched/core] sched/fair: Fix post_init_entity_util_avg() serialization 2016-06-09 13:07 ` Peter Zijlstra 2016-06-12 22:25 ` Yuyang Du @ 2016-06-14 11:25 ` tip-bot for Peter Zijlstra 2016-06-16 8:50 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra 2 siblings, 0 replies; 14+ messages in thread From: tip-bot for Peter Zijlstra @ 2016-06-14 11:25 UTC (permalink / raw) To: linux-tip-commits Cc: aryabinin, tglx, mingo, efault, hpa, peterz, chris, torvalds, yuyang.du, linux-kernel Commit-ID: b7fa30c9cc48c4f55663420472505d3b4f6e1705 Gitweb: http://git.kernel.org/tip/b7fa30c9cc48c4f55663420472505d3b4f6e1705 Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Thu, 9 Jun 2016 15:07:50 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Tue, 14 Jun 2016 10:58:34 +0200 sched/fair: Fix post_init_entity_util_avg() serialization Chris Wilson reported a divide by 0 at: post_init_entity_util_avg(): > 725 if (cfs_rq->avg.util_avg != 0) { > 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight; > -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1); > 728 > 729 if (sa->util_avg > cap) > 730 sa->util_avg = cap; > 731 } else { Which given the lack of serialization, and the code generated from update_cfs_rq_load_avg() is entirely possible: if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); sa->load_avg = max_t(long, sa->load_avg - r, 0); sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); removed_load = 1; } turns into: ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax ffffffff8108706b: 48 85 c0 test %rax,%rax ffffffff8108706e: 74 40 je ffffffff810870b0 ffffffff81087070: 4c 89 f8 mov %r15,%rax ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13) ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13) ffffffff8108707e: 4c 89 f9 mov %r15,%rcx ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13) ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx Which you'll note ends up with 'sa->load_avg - r' in memory at ffffffff8108707a. By calling post_init_entity_util_avg() under rq->lock we're sure to be fully serialized against PELT updates and cannot observe intermediate state like this. Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Cc: bsegall@google.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: steve.muckle@linaro.org Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value") Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/core.c | 3 +-- kernel/sched/fair.c | 8 +++++++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 017d539..13d0896 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2535,10 +2535,9 @@ void wake_up_new_task(struct task_struct *p) */ set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif - /* Post initialize new task's util average when its cfs_rq is set */ + rq = __task_rq_lock(p, &rf); post_init_entity_util_avg(&p->se); - rq = __task_rq_lock(p, &rf); activate_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 218f8e8..4e33ad1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8496,8 +8496,9 @@ void free_fair_sched_group(struct task_group *tg) int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) { - struct cfs_rq *cfs_rq; struct sched_entity *se; + struct cfs_rq *cfs_rq; + struct rq *rq; int i; tg->cfs_rq = kzalloc(sizeof(cfs_rq) * nr_cpu_ids, GFP_KERNEL); @@ -8512,6 +8513,8 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) init_cfs_bandwidth(tg_cfs_bandwidth(tg)); for_each_possible_cpu(i) { + rq = cpu_rq(i); + cfs_rq = kzalloc_node(sizeof(struct cfs_rq), GFP_KERNEL, cpu_to_node(i)); if (!cfs_rq) @@ -8525,7 +8528,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) init_cfs_rq(cfs_rq); init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]); init_entity_runnable_average(se); + + raw_spin_lock_irq(&rq->lock); post_init_entity_util_avg(se); + raw_spin_unlock_irq(&rq->lock); } return 1; ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Divide-by-zero in post_init_entity_util_avg 2016-06-09 13:07 ` Peter Zijlstra 2016-06-12 22:25 ` Yuyang Du 2016-06-14 11:25 ` [tip:sched/core] sched/fair: Fix post_init_entity_util_avg() serialization tip-bot for Peter Zijlstra @ 2016-06-16 8:50 ` Peter Zijlstra 2016-06-16 12:25 ` Peter Zijlstra 2 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2016-06-16 8:50 UTC (permalink / raw) To: Yuyang Du Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel, kernel On Thu, Jun 09, 2016 at 03:07:50PM +0200, Peter Zijlstra wrote: > Which given the lack of serialization, and the code generated from > update_cfs_rq_load_avg() is entirely possible. > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > sa->load_avg = max_t(long, sa->load_avg - r, 0); > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > removed_load = 1; > } > > turns into: > > ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax > ffffffff8108706b: 48 85 c0 test %rax,%rax > ffffffff8108706e: 74 40 je ffffffff810870b0 <update_blocked_averages+0xc0> > ffffffff81087070: 4c 89 f8 mov %r15,%rax > ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13) > ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13) > ffffffff8108707e: 4c 89 f9 mov %r15,%rcx > ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx > ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13) > ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx > > Which you'll note ends up with sa->load_avg -= r in memory at > ffffffff8108707a. > > Ludicrous code generation if you ask me; I'd have expected something > like (note, r15 holds 0): > > mov %r15, %rax > xchg %rax, cfs_rq->removed_load_avg > mov sa->load_avg, %rcx > sub %rax, %rcx > cmovs %r15, %rcx > mov %rcx, sa->load_avg So I _should_ have looked at other unserialized users of ->load_avg, but alas. Luckily nikbor reported a similar /0 from task_h_load() which instantly triggered recollection of this here problem. Now, we really do not want to go grab rq->lock there, so I did the below. Which actually ends up generating the 'right' code as per the above. 3133: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13) 313a: 49 8b 4d 70 mov 0x70(%r13),%rcx 313e: bb 01 00 00 00 mov $0x1,%ebx 3143: 48 29 c1 sub %rax,%rcx 3146: 49 0f 48 cf cmovs %r15,%rcx 314a: 48 69 c0 82 45 ff ff imul $0xffffffffffff4582,%rax,%rax 3151: 49 89 4d 70 mov %rcx,0x70(%r13) This ensures the negative value never hits memory and allows the unserialized use. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/fair.c | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f75930bdd326..3fd3d903e6b6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2878,6 +2878,20 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) } } +/* + * Explicitly do a load-store to ensure the temporary value never hits memory. + * This allows lockless observations without ever seeing the negative values. + * + * Incidentally, this also generates much saner code for x86. + */ +#define sub_positive(type, ptr, val) do { \ + type tmp = READ_ONCE(*ptr); \ + tmp -= (val); \ + if (tmp < 0) \ + tmp = 0; \ + WRITE_ONCE(*ptr, tmp); \ +} while (0) + /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) @@ -2887,15 +2901,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); - sa->load_avg = max_t(long, sa->load_avg - r, 0); - sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); + sub_positive(long, &sa->load_avg, r); + sub_positive(s64, &sa->load_sum, r * LOAD_AVG_MAX); removed_load = 1; } if (atomic_long_read(&cfs_rq->removed_util_avg)) { long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0); - sa->util_avg = max_t(long, sa->util_avg - r, 0); - sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0); + sub_positive(long, &sa->util_avg, r); + sub_positive(s32, &sa->util_sum, r * LOAD_AVG_MAX); removed_util = 1; } @@ -2968,10 +2982,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s &se->avg, se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se, NULL); - cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0); - cfs_rq->avg.load_sum = max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0); - cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0); - cfs_rq->avg.util_sum = max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0); + sub_positive(long, &cfs_rq->avg.load_avg, se->avg.load_avg); + sub_positive(s64, &cfs_rq->avg.load_sum, se->avg.load_sum); + sub_positive(long, &cfs_rq->avg.util_avg, se->avg.util_avg); + sub_positive(s32, &cfs_rq->avg.util_sum, se->avg.util_sum); cfs_rq_util_change(cfs_rq); } ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Divide-by-zero in post_init_entity_util_avg 2016-06-16 8:50 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra @ 2016-06-16 12:25 ` Peter Zijlstra 2016-06-16 16:16 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Peter Zijlstra @ 2016-06-16 12:25 UTC (permalink / raw) To: Yuyang Du Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel, kernel On Thu, Jun 16, 2016 at 10:50:40AM +0200, Peter Zijlstra wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f75930bdd326..3fd3d903e6b6 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2878,6 +2878,20 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) > } > } > > +/* > + * Explicitly do a load-store to ensure the temporary value never hits memory. > + * This allows lockless observations without ever seeing the negative values. > + * > + * Incidentally, this also generates much saner code for x86. > + */ > +#define sub_positive(type, ptr, val) do { \ > + type tmp = READ_ONCE(*ptr); \ > + tmp -= (val); \ > + if (tmp < 0) \ > + tmp = 0; \ > + WRITE_ONCE(*ptr, tmp); \ > +} while (0) > + > /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */ > static inline int > update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) > @@ -2887,15 +2901,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > - sa->load_avg = max_t(long, sa->load_avg - r, 0); > - sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > + sub_positive(long, &sa->load_avg, r); > + sub_positive(s64, &sa->load_sum, r * LOAD_AVG_MAX); Hmm, so either we should change these variables to signed types as forced here, or this logic (along with the former) is plain wrong. As it stands any unsigned value with the MSB set will wipe the field after this subtraction. I suppose instead we'd want something like: tmp = READ_ONCE(*ptr); if (tmp > val) tmp -= val; else tmp = 0; WRITE_ONCE(*ptr, tmp); In order to generate: xchg %rax,0xa0(%r13) mov 0x78(%r13),%rcx sub %rax,%rcx cmovae %r15,%rcx mov %rcx,0x78(%r13) however, GCC isn't smart enough and generates: xchg %rax,0x98(%r13) mov 0x70(%r13),%rsi mov %rsi,%rcx sub %rax,%rcx cmp %rsi,%rax cmovae %r15,%rcx mov %rcx,0x70(%r13) Doing a CMP with the _same_ values it does the SUB with, resulting in exactly the same CC values. (this is with gcc-5.3, I'm still trying to build gcc-6.1 from the debian package which I suppose I should just give up and do a source build) Opinions? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Divide-by-zero in post_init_entity_util_avg 2016-06-16 12:25 ` Peter Zijlstra @ 2016-06-16 16:16 ` Peter Zijlstra 2016-06-17 8:16 ` Andrey Ryabinin 2016-06-17 9:19 ` [PATCH] sched/fair: Fix cfs_rq avg tracking underflow Peter Zijlstra 2 siblings, 0 replies; 14+ messages in thread From: Peter Zijlstra @ 2016-06-16 16:16 UTC (permalink / raw) To: Yuyang Du Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel, kernel On Thu, Jun 16, 2016 at 02:25:04PM +0200, Peter Zijlstra wrote: > On Thu, Jun 16, 2016 at 10:50:40AM +0200, Peter Zijlstra wrote: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index f75930bdd326..3fd3d903e6b6 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -2878,6 +2878,20 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) > > } > > } > > > > +/* > > + * Explicitly do a load-store to ensure the temporary value never hits memory. > > + * This allows lockless observations without ever seeing the negative values. > > + * > > + * Incidentally, this also generates much saner code for x86. > > + */ > > +#define sub_positive(type, ptr, val) do { \ > > + type tmp = READ_ONCE(*ptr); \ > > + tmp -= (val); \ > > + if (tmp < 0) \ > > + tmp = 0; \ > > + WRITE_ONCE(*ptr, tmp); \ > > +} while (0) > > + > > /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */ > > static inline int > > update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) > > @@ -2887,15 +2901,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) > > > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > > - sa->load_avg = max_t(long, sa->load_avg - r, 0); > > - sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > > + sub_positive(long, &sa->load_avg, r); > > + sub_positive(s64, &sa->load_sum, r * LOAD_AVG_MAX); > > Hmm, so either we should change these variables to signed types as > forced here, or this logic (along with the former) is plain wrong. > > As it stands any unsigned value with the MSB set will wipe the field > after this subtraction. > > I suppose instead we'd want something like: > > tmp = READ_ONCE(*ptr); > if (tmp > val) > tmp -= val; > else > tmp = 0; > WRITE_ONCE(*ptr, tmp); Stackoverflow suggested this pattern for (unsigned) underflow checking: r = a - b; if ((r = a - b) > a) underflow() should generate the right asm, but no, that doesn't work either. http://stackoverflow.com/questions/24958469/subtract-and-detect-underflow-most-efficient-way-x86-64-with-gcc ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Divide-by-zero in post_init_entity_util_avg 2016-06-16 12:25 ` Peter Zijlstra 2016-06-16 16:16 ` Peter Zijlstra @ 2016-06-17 8:16 ` Andrey Ryabinin 2016-06-17 8:23 ` Peter Zijlstra 2016-06-17 9:19 ` [PATCH] sched/fair: Fix cfs_rq avg tracking underflow Peter Zijlstra 2 siblings, 1 reply; 14+ messages in thread From: Andrey Ryabinin @ 2016-06-17 8:16 UTC (permalink / raw) To: Peter Zijlstra, Yuyang Du Cc: Chris Wilson, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel, kernel On 06/16/2016 03:25 PM, Peter Zijlstra wrote: > On Thu, Jun 16, 2016 at 10:50:40AM +0200, Peter Zijlstra wrote: >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index f75930bdd326..3fd3d903e6b6 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -2878,6 +2878,20 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) >> } >> } >> >> +/* >> + * Explicitly do a load-store to ensure the temporary value never hits memory. >> + * This allows lockless observations without ever seeing the negative values. >> + * >> + * Incidentally, this also generates much saner code for x86. >> + */ >> +#define sub_positive(type, ptr, val) do { \ >> + type tmp = READ_ONCE(*ptr); \ >> + tmp -= (val); \ >> + if (tmp < 0) \ >> + tmp = 0; \ >> + WRITE_ONCE(*ptr, tmp); \ >> +} while (0) >> + >> /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */ >> static inline int >> update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) >> @@ -2887,15 +2901,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) >> >> if (atomic_long_read(&cfs_rq->removed_load_avg)) { >> s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); >> - sa->load_avg = max_t(long, sa->load_avg - r, 0); >> - sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); >> + sub_positive(long, &sa->load_avg, r); >> + sub_positive(s64, &sa->load_sum, r * LOAD_AVG_MAX); > > Hmm, so either we should change these variables to signed types as > forced here, or this logic (along with the former) is plain wrong. > > As it stands any unsigned value with the MSB set will wipe the field > after this subtraction. > > I suppose instead we'd want something like: > > tmp = READ_ONCE(*ptr); > if (tmp > val) > tmp -= val; > else > tmp = 0; > WRITE_ONCE(*ptr, tmp); > > In order to generate: > > xchg %rax,0xa0(%r13) > mov 0x78(%r13),%rcx > sub %rax,%rcx > cmovae %r15,%rcx > mov %rcx,0x78(%r13) > > however, GCC isn't smart enough and generates: > > xchg %rax,0x98(%r13) > mov 0x70(%r13),%rsi > mov %rsi,%rcx > sub %rax,%rcx > cmp %rsi,%rax > cmovae %r15,%rcx > mov %rcx,0x70(%r13) > > Doing a CMP with the _same_ values it does the SUB with, resulting in > exactly the same CC values. > FYI - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3507 (Reported: 2001-07-01) > (this is with gcc-5.3, I'm still trying to build gcc-6.1 from the debian > package which I suppose I should just give up and do a source build) > > Opinions? > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Divide-by-zero in post_init_entity_util_avg 2016-06-17 8:16 ` Andrey Ryabinin @ 2016-06-17 8:23 ` Peter Zijlstra 0 siblings, 0 replies; 14+ messages in thread From: Peter Zijlstra @ 2016-06-17 8:23 UTC (permalink / raw) To: Andrey Ryabinin Cc: Yuyang Du, Chris Wilson, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel, kernel On Fri, Jun 17, 2016 at 11:16:24AM +0300, Andrey Ryabinin wrote: > > I suppose instead we'd want something like: > > > > tmp = READ_ONCE(*ptr); > > if (tmp > val) > > tmp -= val; > > else > > tmp = 0; > > WRITE_ONCE(*ptr, tmp); > > > > In order to generate: > > > > xchg %rax,0xa0(%r13) > > mov 0x78(%r13),%rcx > > sub %rax,%rcx > > cmovae %r15,%rcx > > mov %rcx,0x78(%r13) > > > > however, GCC isn't smart enough and generates: > > > > xchg %rax,0x98(%r13) > > mov 0x70(%r13),%rsi > > mov %rsi,%rcx > > sub %rax,%rcx > > cmp %rsi,%rax > > cmovae %r15,%rcx > > mov %rcx,0x70(%r13) > > > > Doing a CMP with the _same_ values it does the SUB with, resulting in > > exactly the same CC values. > > > > FYI - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3507 (Reported: 2001-07-01) > I found this one when I was googling yesterday: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30315 But yes, it seems this is a 'known' issue. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH] sched/fair: Fix cfs_rq avg tracking underflow 2016-06-16 12:25 ` Peter Zijlstra 2016-06-16 16:16 ` Peter Zijlstra 2016-06-17 8:16 ` Andrey Ryabinin @ 2016-06-17 9:19 ` Peter Zijlstra 2016-06-17 2:01 ` Yuyang Du 2016-06-20 13:24 ` [tip:sched/urgent] " tip-bot for Peter Zijlstra 2 siblings, 2 replies; 14+ messages in thread From: Peter Zijlstra @ 2016-06-17 9:19 UTC (permalink / raw) To: Yuyang Du Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel, kernel In any case, I've settled on the below. It generates crap code, but hopefully GCC can be fixed sometime this century. --- Subject: sched/fair: Fix cfs_rq avg tracking underflow From: Peter Zijlstra <peterz@infradead.org> Date: Thu, 16 Jun 2016 10:50:40 +0200 As per commit: b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization") > the code generated from update_cfs_rq_load_avg(): > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > sa->load_avg = max_t(long, sa->load_avg - r, 0); > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > removed_load = 1; > } > > turns into: > > ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax > ffffffff8108706b: 48 85 c0 test %rax,%rax > ffffffff8108706e: 74 40 je ffffffff810870b0 <update_blocked_averages+0xc0> > ffffffff81087070: 4c 89 f8 mov %r15,%rax > ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13) > ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13) > ffffffff8108707e: 4c 89 f9 mov %r15,%rcx > ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx > ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13) > ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx > > Which you'll note ends up with sa->load_avg -= r in memory at > ffffffff8108707a. So I _should_ have looked at other unserialized users of ->load_avg, but alas. Luckily nikbor reported a similar /0 from task_h_load() which instantly triggered recollection of this here problem. Aside from the intermediate value hitting memory and causing problems, there's another problem: the underflow detection relies on the signed bit. This reduces the effective width of the variables, iow. its effectively the same as having these variables be of signed type. This patches changes to a different means of unsigned underflow detection to not rely on the signed bit. This allows the variables to use the 'full' unsigned range. And it does so with explicit LOAD - STORE to ensure any intermediate value will never be visible in memory, allowing these unserialized loads. Note: GCC generates crap code for this Note2: I say 'full' above, if we end up at U*_MAX we'll still explode; maybe we should do clamping on add too. Cc: Yuyang Du <yuyang.du@intel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: pjt@google.com Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: steve.muckle@linaro.org Cc: kernel@kyup.com Cc: morten.rasmussen@arm.com Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/fair.c | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2878,6 +2878,23 @@ static inline void cfs_rq_util_change(st } } +/* + * Unsigned subtract and clamp on underflow. + * + * Explicitly do a load-store to ensure the intermediate value never hits + * memory. This allows lockless observations without ever seeing the negative + * values. + */ +#define sub_positive(_ptr, _val) do { \ + typeof(_ptr) ptr = (_ptr); \ + typeof(*ptr) val = (_val); \ + typeof(*ptr) res, var = READ_ONCE(*ptr); \ + res = var - val; \ + if (res > var) \ + res = 0; \ + WRITE_ONCE(*ptr, res); \ +} while (0) + /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) @@ -2887,15 +2904,15 @@ update_cfs_rq_load_avg(u64 now, struct c if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); - sa->load_avg = max_t(long, sa->load_avg - r, 0); - sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); + sub_positive(&sa->load_avg, r); + sub_positive(&sa->load_sum, r * LOAD_AVG_MAX); removed_load = 1; } if (atomic_long_read(&cfs_rq->removed_util_avg)) { long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0); - sa->util_avg = max_t(long, sa->util_avg - r, 0); - sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0); + sub_positive(&sa->util_avg, r); + sub_positive(&sa->util_sum, r * LOAD_AVG_MAX); removed_util = 1; } @@ -2968,10 +2985,10 @@ static void detach_entity_load_avg(struc &se->avg, se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se, NULL); - cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0); - cfs_rq->avg.load_sum = max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0); - cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0); - cfs_rq->avg.util_sum = max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0); + sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg); + sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum); + sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); cfs_rq_util_change(cfs_rq); } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] sched/fair: Fix cfs_rq avg tracking underflow 2016-06-17 9:19 ` [PATCH] sched/fair: Fix cfs_rq avg tracking underflow Peter Zijlstra @ 2016-06-17 2:01 ` Yuyang Du 2016-06-20 13:24 ` [tip:sched/urgent] " tip-bot for Peter Zijlstra 1 sibling, 0 replies; 14+ messages in thread From: Yuyang Du @ 2016-06-17 2:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel, kernel On Fri, Jun 17, 2016 at 11:19:48AM +0200, Peter Zijlstra wrote: > +/* > + * Unsigned subtract and clamp on underflow. > + * > + * Explicitly do a load-store to ensure the intermediate value never hits > + * memory. This allows lockless observations without ever seeing the negative > + * values. > + */ > +#define sub_positive(_ptr, _val) do { \ > + typeof(_ptr) ptr = (_ptr); \ > + typeof(*ptr) val = (_val); \ > + typeof(*ptr) res, var = READ_ONCE(*ptr); \ > + res = var - val; \ > + if (res > var) \ > + res = 0; \ > + WRITE_ONCE(*ptr, res); \ > +} while (0) > + maybe sub_nonnegative() or sub_til_zero() ... I wonder whether it is the if statement finally allows GCC to 'registerize' load_avg, and I almost tried it... ^ permalink raw reply [flat|nested] 14+ messages in thread
* [tip:sched/urgent] sched/fair: Fix cfs_rq avg tracking underflow 2016-06-17 9:19 ` [PATCH] sched/fair: Fix cfs_rq avg tracking underflow Peter Zijlstra 2016-06-17 2:01 ` Yuyang Du @ 2016-06-20 13:24 ` tip-bot for Peter Zijlstra 1 sibling, 0 replies; 14+ messages in thread From: tip-bot for Peter Zijlstra @ 2016-06-20 13:24 UTC (permalink / raw) To: linux-tip-commits Cc: peterz, linux-kernel, yuyang.du, aryabinin, hpa, chris, mingo, tglx, efault, torvalds Commit-ID: 8974189222159154c55f24ddad33e3613960521a Gitweb: http://git.kernel.org/tip/8974189222159154c55f24ddad33e3613960521a Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Thu, 16 Jun 2016 10:50:40 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Mon, 20 Jun 2016 11:29:09 +0200 sched/fair: Fix cfs_rq avg tracking underflow As per commit: b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization") > the code generated from update_cfs_rq_load_avg(): > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > sa->load_avg = max_t(long, sa->load_avg - r, 0); > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > removed_load = 1; > } > > turns into: > > ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax > ffffffff8108706b: 48 85 c0 test %rax,%rax > ffffffff8108706e: 74 40 je ffffffff810870b0 <update_blocked_averages+0xc0> > ffffffff81087070: 4c 89 f8 mov %r15,%rax > ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13) > ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13) > ffffffff8108707e: 4c 89 f9 mov %r15,%rcx > ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx > ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13) > ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx > > Which you'll note ends up with sa->load_avg -= r in memory at > ffffffff8108707a. So I _should_ have looked at other unserialized users of ->load_avg, but alas. Luckily nikbor reported a similar /0 from task_h_load() which instantly triggered recollection of this here problem. Aside from the intermediate value hitting memory and causing problems, there's another problem: the underflow detection relies on the signed bit. This reduces the effective width of the variables, IOW its effectively the same as having these variables be of signed type. This patch changes to a different means of unsigned underflow detection to not rely on the signed bit. This allows the variables to use the 'full' unsigned range. And it does so with explicit LOAD - STORE to ensure any intermediate value will never be visible in memory, allowing these unserialized loads. Note: GCC generates crap code for this, might warrant a look later. Note2: I say 'full' above, if we end up at U*_MAX we'll still explode; maybe we should do clamping on add too. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Cc: bsegall@google.com Cc: kernel@kyup.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: steve.muckle@linaro.org Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a2348de..2ae68f0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2904,6 +2904,23 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) } } +/* + * Unsigned subtract and clamp on underflow. + * + * Explicitly do a load-store to ensure the intermediate value never hits + * memory. This allows lockless observations without ever seeing the negative + * values. + */ +#define sub_positive(_ptr, _val) do { \ + typeof(_ptr) ptr = (_ptr); \ + typeof(*ptr) val = (_val); \ + typeof(*ptr) res, var = READ_ONCE(*ptr); \ + res = var - val; \ + if (res > var) \ + res = 0; \ + WRITE_ONCE(*ptr, res); \ +} while (0) + /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) @@ -2913,15 +2930,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); - sa->load_avg = max_t(long, sa->load_avg - r, 0); - sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); + sub_positive(&sa->load_avg, r); + sub_positive(&sa->load_sum, r * LOAD_AVG_MAX); removed_load = 1; } if (atomic_long_read(&cfs_rq->removed_util_avg)) { long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0); - sa->util_avg = max_t(long, sa->util_avg - r, 0); - sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0); + sub_positive(&sa->util_avg, r); + sub_positive(&sa->util_sum, r * LOAD_AVG_MAX); removed_util = 1; } @@ -2994,10 +3011,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s &se->avg, se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se, NULL); - cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0); - cfs_rq->avg.load_sum = max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0); - cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0); - cfs_rq->avg.util_sum = max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0); + sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg); + sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum); + sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); cfs_rq_util_change(cfs_rq); } ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Divide-by-zero in post_init_entity_util_avg 2016-06-09 9:01 Divide-by-zero in post_init_entity_util_avg Chris Wilson 2016-06-09 1:33 ` Yuyang Du @ 2016-06-09 10:29 ` Peter Zijlstra 1 sibling, 0 replies; 14+ messages in thread From: Peter Zijlstra @ 2016-06-09 10:29 UTC (permalink / raw) To: Chris Wilson Cc: Andrey Ryabinin, Yuyang Du, Linus Torvalds, Mike Galbraith, Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel On Thu, Jun 09, 2016 at 10:01:42AM +0100, Chris Wilson wrote: > > [15774.966082] divide error: 0000 [#1] SMP > [15774.966137] Modules linked in: i915 intel_gtt > [15774.966208] CPU: 1 PID: 15319 Comm: gemscript Not tainted 4.7.0-rc1+ #330 > [15774.966252] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 > [15774.966317] task: ffff880276a55e80 ti: ffff880272c10000 task.ti: ffff880272c10000 > [15774.966377] RIP: 0010:[<ffffffff810b12e3>] [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80 > [15774.966463] RSP: 0018:ffff880272c13e30 EFLAGS: 00010057 > [15774.966504] RAX: 0000000022f00000 RBX: ffff880276a51f80 RCX: 00000000000000e8 > [15774.966550] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880276a52000 > [15774.966593] RBP: ffff880272c13e30 R08: 0000000000000000 R09: 0000000000000008 > [15774.966635] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880276a52544 > [15774.966678] R13: ffff880276a52000 R14: ffff880276a521d8 R15: 0000000000003bda > [15774.966721] FS: 00007fe4df95a740(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000 > [15774.966781] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [15774.966822] CR2: 00007fe4de7f0378 CR3: 0000000275e15000 CR4: 00000000001006e0 > [15774.966860] Stack: > [15774.966894] ffff880272c13e68 ffffffff810a69c5 0000000000000206 0000000000003bd7 > [15774.966987] 0000000000000000 0000000000000000 ffff880276a51f80 ffff880272c13ee8 > [15774.967073] ffffffff8107a364 0000000000003bd7 ffffffffffffffff 0000000000000000 > [15774.967162] Call Trace: > [15774.967208] [<ffffffff810a69c5>] wake_up_new_task+0x95/0x130 > [15774.967256] [<ffffffff8107a364>] _do_fork+0x184/0x400 > [15774.967302] [<ffffffff8107a6b7>] SyS_clone+0x37/0x50 > [15774.967353] [<ffffffff8100197d>] do_syscall_64+0x5d/0xe0 > [15774.967402] [<ffffffff815719bc>] entry_SYSCALL64_slow_path+0x25/0x25 > [15774.967444] Code: 89 d1 48 c1 e9 3f 48 01 d1 48 d1 f9 48 85 c9 7e 38 48 85 c0 74 35 48 0f af 07 31 d2 48 89 87 e0 00 00 00 48 8b 76 70 48 83 c6 01 <48> f7 f6 48 39 c8 77 18 48 89 c1 48 89 87 e0 00 00 00 69 c9 7e > [15774.968086] RIP [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80 > [15774.968142] RSP <ffff880272c13e30> > > I've presumed commit 2b8c41daba327 ("sched/fair: Initiate a new task's > util avg to a bounded value") to be at fault, hence the CCs. Though it > may just be a victim. > > gdb says 0x43/0x80 is > > 725 if (cfs_rq->avg.util_avg != 0) { > 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight; > -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1); > 728 > 729 if (sa->util_avg > cap) > 730 sa->util_avg = cap; > 731 } else { > > I've run the same fork-heavy workload that seemed to hit the initial > fault under kasan. kasan has not reported any errors, nor has the bug > reoccurred after a day (earlier I had a couple of panics within a few > hours). > > Is it possible for a race window where cfg_rq->avg.load_avg is indeed > -1? Any evidence of other memcorruption in the above? > -Chris Maybe; I need to look harder and at the generated code; but at the very least serialization isn't right. We compute/update averages under rq->lock, but post_init_entity_avg() looks at these values without holding it. Something like the below should cure that, something similar should probably be done for the other callsite as well, I'll look harder after lunch. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 385c947482e1..289d99d91883 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2536,9 +2536,9 @@ void wake_up_new_task(struct task_struct *p) set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif /* Post initialize new task's util average when its cfs_rq is set */ - post_init_entity_util_avg(&p->se); rq = __task_rq_lock(p, &rf); + post_init_entity_util_avg(&p->se); activate_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p); ^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-06-20 13:26 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-06-09 9:01 Divide-by-zero in post_init_entity_util_avg Chris Wilson 2016-06-09 1:33 ` Yuyang Du 2016-06-09 13:07 ` Peter Zijlstra 2016-06-12 22:25 ` Yuyang Du 2016-06-14 11:25 ` [tip:sched/core] sched/fair: Fix post_init_entity_util_avg() serialization tip-bot for Peter Zijlstra 2016-06-16 8:50 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra 2016-06-16 12:25 ` Peter Zijlstra 2016-06-16 16:16 ` Peter Zijlstra 2016-06-17 8:16 ` Andrey Ryabinin 2016-06-17 8:23 ` Peter Zijlstra 2016-06-17 9:19 ` [PATCH] sched/fair: Fix cfs_rq avg tracking underflow Peter Zijlstra 2016-06-17 2:01 ` Yuyang Du 2016-06-20 13:24 ` [tip:sched/urgent] " tip-bot for Peter Zijlstra 2016-06-09 10:29 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.