linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Divide-by-zero in post_init_entity_util_avg
  2016-06-09  9:01 Divide-by-zero in post_init_entity_util_avg Chris Wilson
@ 2016-06-09  1:33 ` Yuyang Du
  2016-06-09 13:07   ` Peter Zijlstra
  2016-06-09 10:29 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra
  1 sibling, 1 reply; 14+ messages in thread
From: Yuyang Du @ 2016-06-09  1:33 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Peter Zijlstra, Andrey Ryabinin, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel

On Thu, Jun 09, 2016 at 10:01:42AM +0100, Chris Wilson wrote:
> I've presumed commit 2b8c41daba327 ("sched/fair: Initiate a new task's
> util avg to a bounded value") to be at fault, hence the CCs. Though it
> may just be a victim.
> 
> gdb says 0x43/0x80 is
> 
>    725			if (cfs_rq->avg.util_avg != 0) {
>    726				sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
> -> 727				sa->util_avg /= (cfs_rq->avg.load_avg + 1);
>    728	
>    729				if (sa->util_avg > cap)
>    730					sa->util_avg = cap;
>    731			} else {
> 
> I've run the same fork-heavy workload that seemed to hit the initial
> fault under kasan. kasan has not reported any errors, nor has the bug
> reoccurred after a day (earlier I had a couple of panics within a few
> hours). 
> 
> Is it possible for a race window where cfg_rq->avg.load_avg is indeed
> -1? Any evidence of other memcorruption in the above?

-1 should not be possible, sounds like a soft error.

But, a race is anyway hazardous. Thanks a lot, Chris.

--
Subject: [PATCH] sched/fair: Avoid hazardous reading cfs_rq->avg.load_avg
 without rq lock

The commit 2b8c41daba327 ("sched/fair: Initiate a new task's util avg
to a bounded value") references cfs_rq->avg.load_avg and then the value
is used as a divisor (actually cfs_rq->avg.load_avg + 1).

This race condition may cause a divide-by-zero exception. Fix it by
moving it into rq locked section.

Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 385c947..b9f44df 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2535,10 +2535,9 @@ void wake_up_new_task(struct task_struct *p)
 	 */
 	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
+	rq = __task_rq_lock(p, &rf);
 	/* Post initialize new task's util average when its cfs_rq is set */
 	post_init_entity_util_avg(&p->se);
-
-	rq = __task_rq_lock(p, &rf);
 	activate_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_QUEUED;
 	trace_sched_wakeup_new(p);
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Divide-by-zero in post_init_entity_util_avg
@ 2016-06-09  9:01 Chris Wilson
  2016-06-09  1:33 ` Yuyang Du
  2016-06-09 10:29 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra
  0 siblings, 2 replies; 14+ messages in thread
From: Chris Wilson @ 2016-06-09  9:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrey Ryabinin, Yuyang Du, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel


[15774.966082] divide error: 0000 [#1] SMP
[15774.966137] Modules linked in: i915 intel_gtt
[15774.966208] CPU: 1 PID: 15319 Comm: gemscript Not tainted 4.7.0-rc1+ #330
[15774.966252] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015
[15774.966317] task: ffff880276a55e80 ti: ffff880272c10000 task.ti: ffff880272c10000
[15774.966377] RIP: 0010:[<ffffffff810b12e3>]  [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80
[15774.966463] RSP: 0018:ffff880272c13e30  EFLAGS: 00010057
[15774.966504] RAX: 0000000022f00000 RBX: ffff880276a51f80 RCX: 00000000000000e8
[15774.966550] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880276a52000
[15774.966593] RBP: ffff880272c13e30 R08: 0000000000000000 R09: 0000000000000008
[15774.966635] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880276a52544
[15774.966678] R13: ffff880276a52000 R14: ffff880276a521d8 R15: 0000000000003bda
[15774.966721] FS:  00007fe4df95a740(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000
[15774.966781] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15774.966822] CR2: 00007fe4de7f0378 CR3: 0000000275e15000 CR4: 00000000001006e0
[15774.966860] Stack:
[15774.966894]  ffff880272c13e68 ffffffff810a69c5 0000000000000206 0000000000003bd7
[15774.966987]  0000000000000000 0000000000000000 ffff880276a51f80 ffff880272c13ee8
[15774.967073]  ffffffff8107a364 0000000000003bd7 ffffffffffffffff 0000000000000000
[15774.967162] Call Trace:
[15774.967208]  [<ffffffff810a69c5>] wake_up_new_task+0x95/0x130
[15774.967256]  [<ffffffff8107a364>] _do_fork+0x184/0x400
[15774.967302]  [<ffffffff8107a6b7>] SyS_clone+0x37/0x50
[15774.967353]  [<ffffffff8100197d>] do_syscall_64+0x5d/0xe0
[15774.967402]  [<ffffffff815719bc>] entry_SYSCALL64_slow_path+0x25/0x25
[15774.967444] Code: 89 d1 48 c1 e9 3f 48 01 d1 48 d1 f9 48 85 c9 7e 38 48 85 c0 74 35 48 0f af 07 31 d2 48 89 87 e0 00 00 00 48 8b 76 70 48 83 c6 01 <48> f7 f6 48 39 c8 77 18 48 89 c1 48 89 87 e0 00 00 00 69 c9 7e 
[15774.968086] RIP  [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80
[15774.968142]  RSP <ffff880272c13e30>

I've presumed commit 2b8c41daba327 ("sched/fair: Initiate a new task's
util avg to a bounded value") to be at fault, hence the CCs. Though it
may just be a victim.

gdb says 0x43/0x80 is

   725			if (cfs_rq->avg.util_avg != 0) {
   726				sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
-> 727				sa->util_avg /= (cfs_rq->avg.load_avg + 1);
   728	
   729				if (sa->util_avg > cap)
   730					sa->util_avg = cap;
   731			} else {

I've run the same fork-heavy workload that seemed to hit the initial
fault under kasan. kasan has not reported any errors, nor has the bug
reoccurred after a day (earlier I had a couple of panics within a few
hours). 

Is it possible for a race window where cfg_rq->avg.load_avg is indeed
-1? Any evidence of other memcorruption in the above?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Divide-by-zero in post_init_entity_util_avg
  2016-06-09  9:01 Divide-by-zero in post_init_entity_util_avg Chris Wilson
  2016-06-09  1:33 ` Yuyang Du
@ 2016-06-09 10:29 ` Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2016-06-09 10:29 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Andrey Ryabinin, Yuyang Du, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel

On Thu, Jun 09, 2016 at 10:01:42AM +0100, Chris Wilson wrote:
> 
> [15774.966082] divide error: 0000 [#1] SMP
> [15774.966137] Modules linked in: i915 intel_gtt
> [15774.966208] CPU: 1 PID: 15319 Comm: gemscript Not tainted 4.7.0-rc1+ #330
> [15774.966252] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015
> [15774.966317] task: ffff880276a55e80 ti: ffff880272c10000 task.ti: ffff880272c10000
> [15774.966377] RIP: 0010:[<ffffffff810b12e3>]  [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80
> [15774.966463] RSP: 0018:ffff880272c13e30  EFLAGS: 00010057
> [15774.966504] RAX: 0000000022f00000 RBX: ffff880276a51f80 RCX: 00000000000000e8
> [15774.966550] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880276a52000
> [15774.966593] RBP: ffff880272c13e30 R08: 0000000000000000 R09: 0000000000000008
> [15774.966635] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880276a52544
> [15774.966678] R13: ffff880276a52000 R14: ffff880276a521d8 R15: 0000000000003bda
> [15774.966721] FS:  00007fe4df95a740(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000
> [15774.966781] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [15774.966822] CR2: 00007fe4de7f0378 CR3: 0000000275e15000 CR4: 00000000001006e0
> [15774.966860] Stack:
> [15774.966894]  ffff880272c13e68 ffffffff810a69c5 0000000000000206 0000000000003bd7
> [15774.966987]  0000000000000000 0000000000000000 ffff880276a51f80 ffff880272c13ee8
> [15774.967073]  ffffffff8107a364 0000000000003bd7 ffffffffffffffff 0000000000000000
> [15774.967162] Call Trace:
> [15774.967208]  [<ffffffff810a69c5>] wake_up_new_task+0x95/0x130
> [15774.967256]  [<ffffffff8107a364>] _do_fork+0x184/0x400
> [15774.967302]  [<ffffffff8107a6b7>] SyS_clone+0x37/0x50
> [15774.967353]  [<ffffffff8100197d>] do_syscall_64+0x5d/0xe0
> [15774.967402]  [<ffffffff815719bc>] entry_SYSCALL64_slow_path+0x25/0x25
> [15774.967444] Code: 89 d1 48 c1 e9 3f 48 01 d1 48 d1 f9 48 85 c9 7e 38 48 85 c0 74 35 48 0f af 07 31 d2 48 89 87 e0 00 00 00 48 8b 76 70 48 83 c6 01 <48> f7 f6 48 39 c8 77 18 48 89 c1 48 89 87 e0 00 00 00 69 c9 7e 
> [15774.968086] RIP  [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80
> [15774.968142]  RSP <ffff880272c13e30>
> 
> I've presumed commit 2b8c41daba327 ("sched/fair: Initiate a new task's
> util avg to a bounded value") to be at fault, hence the CCs. Though it
> may just be a victim.
> 
> gdb says 0x43/0x80 is
> 
>    725			if (cfs_rq->avg.util_avg != 0) {
>    726				sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
> -> 727				sa->util_avg /= (cfs_rq->avg.load_avg + 1);
>    728	
>    729				if (sa->util_avg > cap)
>    730					sa->util_avg = cap;
>    731			} else {
> 
> I've run the same fork-heavy workload that seemed to hit the initial
> fault under kasan. kasan has not reported any errors, nor has the bug
> reoccurred after a day (earlier I had a couple of panics within a few
> hours). 
> 
> Is it possible for a race window where cfg_rq->avg.load_avg is indeed
> -1? Any evidence of other memcorruption in the above?
> -Chris

Maybe; I need to look harder and at the generated code; but at the very
least serialization isn't right.

We compute/update averages under rq->lock, but post_init_entity_avg()
looks at these values without holding it.

Something like the below should cure that, something similar should
probably be done for the other callsite as well, I'll look harder after
lunch.


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 385c947482e1..289d99d91883 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2536,9 +2536,9 @@ void wake_up_new_task(struct task_struct *p)
 	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	/* Post initialize new task's util average when its cfs_rq is set */
-	post_init_entity_util_avg(&p->se);
 
 	rq = __task_rq_lock(p, &rf);
+	post_init_entity_util_avg(&p->se);
 	activate_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_QUEUED;
 	trace_sched_wakeup_new(p);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Divide-by-zero in post_init_entity_util_avg
  2016-06-09  1:33 ` Yuyang Du
@ 2016-06-09 13:07   ` Peter Zijlstra
  2016-06-12 22:25     ` Yuyang Du
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Peter Zijlstra @ 2016-06-09 13:07 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel

Chris Wilson reported a divide by 0 at:

post_init_entity_util_avg():

>    725			if (cfs_rq->avg.util_avg != 0) {
>    726				sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
> -> 727				sa->util_avg /= (cfs_rq->avg.load_avg + 1);
>    728	
>    729				if (sa->util_avg > cap)
>    730					sa->util_avg = cap;
>    731			} else {

Which given the lack of serialization, and the code generated from
update_cfs_rq_load_avg() is entirely possible.

	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
		sa->load_avg = max_t(long, sa->load_avg - r, 0);
		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
		removed_load = 1;
	}

turns into:

ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
ffffffff8108706b:       48 85 c0                test   %rax,%rax
ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
ffffffff81087070:       4c 89 f8                mov    %r15,%rax
ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx

Which you'll note ends up with sa->load_avg -= r in memory at
ffffffff8108707a.

Ludicrous code generation if you ask me; I'd have expected something
like (note, r15 holds 0):

	mov	%r15, %rax
	xchg	%rax, cfs_rq->removed_load_avg
	mov	sa->load_avg, %rcx
	sub	%rax, %rcx
	cmovs	%r15, %rcx
	mov	%rcx, sa->load_avg

Adding the serialization (to _both_ call sites) should fix this.

Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value")
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 3 +--
 kernel/sched/fair.c | 8 +++++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 385c947482e1..4aff10e3bd14 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2535,10 +2535,9 @@ void wake_up_new_task(struct task_struct *p)
 	 */
 	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
-	/* Post initialize new task's util average when its cfs_rq is set */
+	rq = __task_rq_lock(p, &rf);
 	post_init_entity_util_avg(&p->se);
 
-	rq = __task_rq_lock(p, &rf);
 	activate_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_QUEUED;
 	trace_sched_wakeup_new(p);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6dd8bab010c..f379da14c7dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8468,8 +8468,9 @@ void free_fair_sched_group(struct task_group *tg)
 
 int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 {
-	struct cfs_rq *cfs_rq;
 	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+	struct rq *rq;
 	int i;
 
 	tg->cfs_rq = kzalloc(sizeof(cfs_rq) * nr_cpu_ids, GFP_KERNEL);
@@ -8484,6 +8485,8 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
 	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
 				      GFP_KERNEL, cpu_to_node(i));
 		if (!cfs_rq)
@@ -8497,7 +8500,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 		init_cfs_rq(cfs_rq);
 		init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
 		init_entity_runnable_average(se);
+
+		raw_spin_lock_irq(&rq->lock);
 		post_init_entity_util_avg(se);
+		raw_spin_unlock_irq(&rq->lock);
 	}
 
 	return 1;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Divide-by-zero in post_init_entity_util_avg
  2016-06-09 13:07   ` Peter Zijlstra
@ 2016-06-12 22:25     ` Yuyang Du
  2016-06-14 11:25     ` [tip:sched/core] sched/fair: Fix post_init_entity_util_avg() serialization tip-bot for Peter Zijlstra
  2016-06-16  8:50     ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra
  2 siblings, 0 replies; 14+ messages in thread
From: Yuyang Du @ 2016-06-12 22:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel

On Thu, Jun 09, 2016 at 03:07:50PM +0200, Peter Zijlstra wrote:
> Which given the lack of serialization, and the code generated from
> update_cfs_rq_load_avg() is entirely possible.
> 
> 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> 		removed_load = 1;
> 	}
> 
> turns into:
> 
> ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
> ffffffff8108706b:       48 85 c0                test   %rax,%rax
> ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
> ffffffff81087070:       4c 89 f8                mov    %r15,%rax
> ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
> ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
> ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
> ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
> ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
> ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
> 
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.
 
Surprised. I actually tweaked a little bit on this, but haven't had desirable
generated codes. Any compiler expert can shed some light on it?

> Ludicrous code generation if you ask me; I'd have expected something
> like (note, r15 holds 0):
> 
> 	mov	%r15, %rax
> 	xchg	%rax, cfs_rq->removed_load_avg
> 	mov	sa->load_avg, %rcx
> 	sub	%rax, %rcx
> 	cmovs	%r15, %rcx
> 	mov	%rcx, sa->load_avg
> 
> Adding the serialization (to _both_ call sites) should fix this.
 
Absolutely, both :)

I am going to remove the group entity post util initialization soon in the
flat hierarchical util implementation, it is not used anywhere.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched/fair: Fix post_init_entity_util_avg() serialization
  2016-06-09 13:07   ` Peter Zijlstra
  2016-06-12 22:25     ` Yuyang Du
@ 2016-06-14 11:25     ` tip-bot for Peter Zijlstra
  2016-06-16  8:50     ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra
  2 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Peter Zijlstra @ 2016-06-14 11:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: aryabinin, tglx, mingo, efault, hpa, peterz, chris, torvalds,
	yuyang.du, linux-kernel

Commit-ID:  b7fa30c9cc48c4f55663420472505d3b4f6e1705
Gitweb:     http://git.kernel.org/tip/b7fa30c9cc48c4f55663420472505d3b4f6e1705
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 9 Jun 2016 15:07:50 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 14 Jun 2016 10:58:34 +0200

sched/fair: Fix post_init_entity_util_avg() serialization

Chris Wilson reported a divide by 0 at:

 post_init_entity_util_avg():

 >    725	if (cfs_rq->avg.util_avg != 0) {
 >    726		sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
 > -> 727		sa->util_avg /= (cfs_rq->avg.load_avg + 1);
 >    728
 >    729		if (sa->util_avg > cap)
 >    730			sa->util_avg = cap;
 >    731	} else {

Which given the lack of serialization, and the code generated from
update_cfs_rq_load_avg() is entirely possible:

	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
		sa->load_avg = max_t(long, sa->load_avg - r, 0);
		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
		removed_load = 1;
	}

turns into:

  ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
  ffffffff8108706b:       48 85 c0                test   %rax,%rax
  ffffffff8108706e:       74 40                   je     ffffffff810870b0
  ffffffff81087070:       4c 89 f8                mov    %r15,%rax
  ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
  ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
  ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
  ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
  ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
  ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx

Which you'll note ends up with 'sa->load_avg - r' in memory at
ffffffff8108707a.

By calling post_init_entity_util_avg() under rq->lock we're sure to be
fully serialized against PELT updates and cannot observe intermediate
state like this.

Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value")
Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 3 +--
 kernel/sched/fair.c | 8 +++++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 017d539..13d0896 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2535,10 +2535,9 @@ void wake_up_new_task(struct task_struct *p)
 	 */
 	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
-	/* Post initialize new task's util average when its cfs_rq is set */
+	rq = __task_rq_lock(p, &rf);
 	post_init_entity_util_avg(&p->se);
 
-	rq = __task_rq_lock(p, &rf);
 	activate_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_QUEUED;
 	trace_sched_wakeup_new(p);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e8..4e33ad1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8496,8 +8496,9 @@ void free_fair_sched_group(struct task_group *tg)
 
 int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 {
-	struct cfs_rq *cfs_rq;
 	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+	struct rq *rq;
 	int i;
 
 	tg->cfs_rq = kzalloc(sizeof(cfs_rq) * nr_cpu_ids, GFP_KERNEL);
@@ -8512,6 +8513,8 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
 	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
 				      GFP_KERNEL, cpu_to_node(i));
 		if (!cfs_rq)
@@ -8525,7 +8528,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 		init_cfs_rq(cfs_rq);
 		init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
 		init_entity_runnable_average(se);
+
+		raw_spin_lock_irq(&rq->lock);
 		post_init_entity_util_avg(se);
+		raw_spin_unlock_irq(&rq->lock);
 	}
 
 	return 1;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Divide-by-zero in post_init_entity_util_avg
  2016-06-09 13:07   ` Peter Zijlstra
  2016-06-12 22:25     ` Yuyang Du
  2016-06-14 11:25     ` [tip:sched/core] sched/fair: Fix post_init_entity_util_avg() serialization tip-bot for Peter Zijlstra
@ 2016-06-16  8:50     ` Peter Zijlstra
  2016-06-16 12:25       ` Peter Zijlstra
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2016-06-16  8:50 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel, kernel

On Thu, Jun 09, 2016 at 03:07:50PM +0200, Peter Zijlstra wrote:
> Which given the lack of serialization, and the code generated from
> update_cfs_rq_load_avg() is entirely possible.
> 
> 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> 		removed_load = 1;
> 	}
> 
> turns into:
> 
> ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
> ffffffff8108706b:       48 85 c0                test   %rax,%rax
> ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
> ffffffff81087070:       4c 89 f8                mov    %r15,%rax
> ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
> ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
> ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
> ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
> ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
> ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
> 
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.
> 
> Ludicrous code generation if you ask me; I'd have expected something
> like (note, r15 holds 0):
> 
> 	mov	%r15, %rax
> 	xchg	%rax, cfs_rq->removed_load_avg
> 	mov	sa->load_avg, %rcx
> 	sub	%rax, %rcx
> 	cmovs	%r15, %rcx
> 	mov	%rcx, sa->load_avg

So I _should_ have looked at other unserialized users of ->load_avg, but
alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Now, we really do not want to go grab rq->lock there, so I did the
below. Which actually ends up generating the 'right' code as per the
above.

    3133:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
    313a:       49 8b 4d 70             mov    0x70(%r13),%rcx
    313e:       bb 01 00 00 00          mov    $0x1,%ebx
    3143:       48 29 c1                sub    %rax,%rcx
    3146:       49 0f 48 cf             cmovs  %r15,%rcx
    314a:       48 69 c0 82 45 ff ff    imul   $0xffffffffffff4582,%rax,%rax
    3151:       49 89 4d 70             mov    %rcx,0x70(%r13)

This ensures the negative value never hits memory and allows the
unserialized use.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f75930bdd326..3fd3d903e6b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2878,6 +2878,20 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 	}
 }
 
+/*
+ * Explicitly do a load-store to ensure the temporary value never hits memory.
+ * This allows lockless observations without ever seeing the negative values.
+ *
+ * Incidentally, this also generates much saner code for x86.
+ */
+#define sub_positive(type, ptr, val) do {			\
+	type tmp = READ_ONCE(*ptr);				\
+	tmp -= (val);						\
+	if (tmp < 0)						\
+		tmp = 0;					\
+	WRITE_ONCE(*ptr, tmp);					\
+} while (0)
+
 /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
@@ -2887,15 +2901,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
-		sa->load_avg = max_t(long, sa->load_avg - r, 0);
-		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(long, &sa->load_avg, r);
+		sub_positive(s64,  &sa->load_sum, r * LOAD_AVG_MAX);
 		removed_load = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
-		sa->util_avg = max_t(long, sa->util_avg - r, 0);
-		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(long, &sa->util_avg, r);
+		sub_positive(s32,  &sa->util_sum, r * LOAD_AVG_MAX);
 		removed_util = 1;
 	}
 
@@ -2968,10 +2982,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
 
-	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
-	cfs_rq->avg.load_sum = max_t(s64,  cfs_rq->avg.load_sum - se->avg.load_sum, 0);
-	cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
-	cfs_rq->avg.util_sum = max_t(s32,  cfs_rq->avg.util_sum - se->avg.util_sum, 0);
+	sub_positive(long, &cfs_rq->avg.load_avg, se->avg.load_avg);
+	sub_positive(s64,  &cfs_rq->avg.load_sum, se->avg.load_sum);
+	sub_positive(long, &cfs_rq->avg.util_avg, se->avg.util_avg);
+	sub_positive(s32,  &cfs_rq->avg.util_sum, se->avg.util_sum);
 
 	cfs_rq_util_change(cfs_rq);
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Divide-by-zero in post_init_entity_util_avg
  2016-06-16  8:50     ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra
@ 2016-06-16 12:25       ` Peter Zijlstra
  2016-06-16 16:16         ` Peter Zijlstra
                           ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Peter Zijlstra @ 2016-06-16 12:25 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel, kernel

On Thu, Jun 16, 2016 at 10:50:40AM +0200, Peter Zijlstra wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f75930bdd326..3fd3d903e6b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2878,6 +2878,20 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  	}
>  }
>  
> +/*
> + * Explicitly do a load-store to ensure the temporary value never hits memory.
> + * This allows lockless observations without ever seeing the negative values.
> + *
> + * Incidentally, this also generates much saner code for x86.
> + */
> +#define sub_positive(type, ptr, val) do {			\
> +	type tmp = READ_ONCE(*ptr);				\
> +	tmp -= (val);						\
> +	if (tmp < 0)						\
> +		tmp = 0;					\
> +	WRITE_ONCE(*ptr, tmp);					\
> +} while (0)
> +
>  /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
>  static inline int
>  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
> @@ -2887,15 +2901,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
>  
>  	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
>  		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> -		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> -		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> +		sub_positive(long, &sa->load_avg, r);
> +		sub_positive(s64,  &sa->load_sum, r * LOAD_AVG_MAX);

Hmm, so either we should change these variables to signed types as
forced here, or this logic (along with the former) is plain wrong.

As it stands any unsigned value with the MSB set will wipe the field
after this subtraction.

I suppose instead we'd want something like:

	tmp = READ_ONCE(*ptr);
	if (tmp > val)
	  tmp -= val;
	else
	  tmp = 0;
	WRITE_ONCE(*ptr, tmp);

In order to generate:

  xchg   %rax,0xa0(%r13)
  mov    0x78(%r13),%rcx
  sub    %rax,%rcx
  cmovae %r15,%rcx
  mov    %rcx,0x78(%r13)

however, GCC isn't smart enough and generates:

  xchg   %rax,0x98(%r13)
  mov    0x70(%r13),%rsi
  mov    %rsi,%rcx
  sub    %rax,%rcx
  cmp    %rsi,%rax
  cmovae %r15,%rcx
  mov    %rcx,0x70(%r13)

Doing a CMP with the _same_ values it does the SUB with, resulting in
exactly the same CC values.

(this is with gcc-5.3, I'm still trying to build gcc-6.1 from the debian
package which I suppose I should just give up and do a source build)

Opinions?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Divide-by-zero in post_init_entity_util_avg
  2016-06-16 12:25       ` Peter Zijlstra
@ 2016-06-16 16:16         ` Peter Zijlstra
  2016-06-17  8:16         ` Andrey Ryabinin
  2016-06-17  9:19         ` [PATCH] sched/fair: Fix cfs_rq avg tracking underflow Peter Zijlstra
  2 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2016-06-16 16:16 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel, kernel

On Thu, Jun 16, 2016 at 02:25:04PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 10:50:40AM +0200, Peter Zijlstra wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f75930bdd326..3fd3d903e6b6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2878,6 +2878,20 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
> >  	}
> >  }
> >  
> > +/*
> > + * Explicitly do a load-store to ensure the temporary value never hits memory.
> > + * This allows lockless observations without ever seeing the negative values.
> > + *
> > + * Incidentally, this also generates much saner code for x86.
> > + */
> > +#define sub_positive(type, ptr, val) do {			\
> > +	type tmp = READ_ONCE(*ptr);				\
> > +	tmp -= (val);						\
> > +	if (tmp < 0)						\
> > +		tmp = 0;					\
> > +	WRITE_ONCE(*ptr, tmp);					\
> > +} while (0)
> > +
> >  /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
> >  static inline int
> >  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
> > @@ -2887,15 +2901,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
> >  
> >  	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> >  		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> > -		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> > -		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> > +		sub_positive(long, &sa->load_avg, r);
> > +		sub_positive(s64,  &sa->load_sum, r * LOAD_AVG_MAX);
> 
> Hmm, so either we should change these variables to signed types as
> forced here, or this logic (along with the former) is plain wrong.
> 
> As it stands any unsigned value with the MSB set will wipe the field
> after this subtraction.
> 
> I suppose instead we'd want something like:
> 
> 	tmp = READ_ONCE(*ptr);
> 	if (tmp > val)
> 	  tmp -= val;
> 	else
> 	  tmp = 0;
> 	WRITE_ONCE(*ptr, tmp);

Stackoverflow suggested this pattern for (unsigned) underflow checking:

	r = a - b;
	if ((r = a - b) > a)
	  underflow()

should generate the right asm, but no, that doesn't work either.

http://stackoverflow.com/questions/24958469/subtract-and-detect-underflow-most-efficient-way-x86-64-with-gcc

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Fix cfs_rq avg tracking underflow
  2016-06-17  9:19         ` [PATCH] sched/fair: Fix cfs_rq avg tracking underflow Peter Zijlstra
@ 2016-06-17  2:01           ` Yuyang Du
  2016-06-20 13:24           ` [tip:sched/urgent] " tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: Yuyang Du @ 2016-06-17  2:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel, kernel

On Fri, Jun 17, 2016 at 11:19:48AM +0200, Peter Zijlstra wrote:
> +/*
> + * Unsigned subtract and clamp on underflow.
> + *
> + * Explicitly do a load-store to ensure the intermediate value never hits
> + * memory. This allows lockless observations without ever seeing the negative
> + * values.
> + */
> +#define sub_positive(_ptr, _val) do {				\
> +	typeof(_ptr) ptr = (_ptr);				\
> +	typeof(*ptr) val = (_val);				\
> +	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
> +	res = var - val;					\
> +	if (res > var)						\
> +		res = 0;					\
> +	WRITE_ONCE(*ptr, res);					\
> +} while (0)
> +

maybe sub_nonnegative() or sub_til_zero() ...

I wonder whether it is the if statement finally allows GCC to 'registerize'
load_avg, and I almost tried it...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Divide-by-zero in post_init_entity_util_avg
  2016-06-16 12:25       ` Peter Zijlstra
  2016-06-16 16:16         ` Peter Zijlstra
@ 2016-06-17  8:16         ` Andrey Ryabinin
  2016-06-17  8:23           ` Peter Zijlstra
  2016-06-17  9:19         ` [PATCH] sched/fair: Fix cfs_rq avg tracking underflow Peter Zijlstra
  2 siblings, 1 reply; 14+ messages in thread
From: Andrey Ryabinin @ 2016-06-17  8:16 UTC (permalink / raw)
  To: Peter Zijlstra, Yuyang Du
  Cc: Chris Wilson, Linus Torvalds, Mike Galbraith, Thomas Gleixner,
	bsegall, morten.rasmussen, pjt, steve.muckle, linux-kernel,
	kernel



On 06/16/2016 03:25 PM, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 10:50:40AM +0200, Peter Zijlstra wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f75930bdd326..3fd3d903e6b6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2878,6 +2878,20 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>>  	}
>>  }
>>  
>> +/*
>> + * Explicitly do a load-store to ensure the temporary value never hits memory.
>> + * This allows lockless observations without ever seeing the negative values.
>> + *
>> + * Incidentally, this also generates much saner code for x86.
>> + */
>> +#define sub_positive(type, ptr, val) do {			\
>> +	type tmp = READ_ONCE(*ptr);				\
>> +	tmp -= (val);						\
>> +	if (tmp < 0)						\
>> +		tmp = 0;					\
>> +	WRITE_ONCE(*ptr, tmp);					\
>> +} while (0)
>> +
>>  /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
>>  static inline int
>>  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
>> @@ -2887,15 +2901,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
>>  
>>  	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
>>  		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
>> -		sa->load_avg = max_t(long, sa->load_avg - r, 0);
>> -		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
>> +		sub_positive(long, &sa->load_avg, r);
>> +		sub_positive(s64,  &sa->load_sum, r * LOAD_AVG_MAX);
> 
> Hmm, so either we should change these variables to signed types as
> forced here, or this logic (along with the former) is plain wrong.
> 
> As it stands any unsigned value with the MSB set will wipe the field
> after this subtraction.
> 
> I suppose instead we'd want something like:
> 
> 	tmp = READ_ONCE(*ptr);
> 	if (tmp > val)
> 	  tmp -= val;
> 	else
> 	  tmp = 0;
> 	WRITE_ONCE(*ptr, tmp);
> 
> In order to generate:
> 
>   xchg   %rax,0xa0(%r13)
>   mov    0x78(%r13),%rcx
>   sub    %rax,%rcx
>   cmovae %r15,%rcx
>   mov    %rcx,0x78(%r13)
> 
> however, GCC isn't smart enough and generates:
> 
>   xchg   %rax,0x98(%r13)
>   mov    0x70(%r13),%rsi
>   mov    %rsi,%rcx
>   sub    %rax,%rcx
>   cmp    %rsi,%rax
>   cmovae %r15,%rcx
>   mov    %rcx,0x70(%r13)
> 
> Doing a CMP with the _same_ values it does the SUB with, resulting in
> exactly the same CC values.
> 

FYI - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3507  (Reported:	2001-07-01)


> (this is with gcc-5.3, I'm still trying to build gcc-6.1 from the debian
> package which I suppose I should just give up and do a source build)
> 
> Opinions?
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Divide-by-zero in post_init_entity_util_avg
  2016-06-17  8:16         ` Andrey Ryabinin
@ 2016-06-17  8:23           ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2016-06-17  8:23 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Yuyang Du, Chris Wilson, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel, kernel

On Fri, Jun 17, 2016 at 11:16:24AM +0300, Andrey Ryabinin wrote:
> > I suppose instead we'd want something like:
> > 
> > 	tmp = READ_ONCE(*ptr);
> > 	if (tmp > val)
> > 	  tmp -= val;
> > 	else
> > 	  tmp = 0;
> > 	WRITE_ONCE(*ptr, tmp);
> > 
> > In order to generate:
> > 
> >   xchg   %rax,0xa0(%r13)
> >   mov    0x78(%r13),%rcx
> >   sub    %rax,%rcx
> >   cmovae %r15,%rcx
> >   mov    %rcx,0x78(%r13)
> > 
> > however, GCC isn't smart enough and generates:
> > 
> >   xchg   %rax,0x98(%r13)
> >   mov    0x70(%r13),%rsi
> >   mov    %rsi,%rcx
> >   sub    %rax,%rcx
> >   cmp    %rsi,%rax
> >   cmovae %r15,%rcx
> >   mov    %rcx,0x70(%r13)
> > 
> > Doing a CMP with the _same_ values it does the SUB with, resulting in
> > exactly the same CC values.
> > 
> 
> FYI - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3507  (Reported:	2001-07-01)
> 

I found this one when I was googling yesterday:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30315

But yes, it seems this is a 'known' issue.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] sched/fair: Fix cfs_rq avg tracking underflow
  2016-06-16 12:25       ` Peter Zijlstra
  2016-06-16 16:16         ` Peter Zijlstra
  2016-06-17  8:16         ` Andrey Ryabinin
@ 2016-06-17  9:19         ` Peter Zijlstra
  2016-06-17  2:01           ` Yuyang Du
  2016-06-20 13:24           ` [tip:sched/urgent] " tip-bot for Peter Zijlstra
  2 siblings, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2016-06-17  9:19 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Chris Wilson, Andrey Ryabinin, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel, kernel


In any case, I've settled on the below. It generates crap code, but
hopefully GCC can be fixed sometime this century.

---
Subject: sched/fair: Fix cfs_rq avg tracking underflow
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu, 16 Jun 2016 10:50:40 +0200

As per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
> 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> 		removed_load = 1;
> 	}
>
> turns into:
>
> ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
> ffffffff8108706b:       48 85 c0                test   %rax,%rax
> ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
> ffffffff81087070:       4c 89 f8                mov    %r15,%rax
> ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
> ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
> ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
> ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
> ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
> ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.

So I _should_ have looked at other unserialized users of ->load_avg, but
alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, iow. its
effectively the same as having these variables be of signed type.

This patches changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
       maybe we should do clamping on add too.

Cc: Yuyang Du <yuyang.du@intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: pjt@google.com
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: steve.muckle@linaro.org
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2878,6 +2878,23 @@ static inline void cfs_rq_util_change(st
 	}
 }
 
+/*
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {				\
+	typeof(_ptr) ptr = (_ptr);				\
+	typeof(*ptr) val = (_val);				\
+	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
+	res = var - val;					\
+	if (res > var)						\
+		res = 0;					\
+	WRITE_ONCE(*ptr, res);					\
+} while (0)
+
 /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
@@ -2887,15 +2904,15 @@ update_cfs_rq_load_avg(u64 now, struct c
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
-		sa->load_avg = max_t(long, sa->load_avg - r, 0);
-		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->load_avg, r);
+		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
 		removed_load = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
-		sa->util_avg = max_t(long, sa->util_avg - r, 0);
-		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->util_avg, r);
+		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
 		removed_util = 1;
 	}
 
@@ -2968,10 +2985,10 @@ static void detach_entity_load_avg(struc
 			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
 
-	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
-	cfs_rq->avg.load_sum = max_t(s64,  cfs_rq->avg.load_sum - se->avg.load_sum, 0);
-	cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
-	cfs_rq->avg.util_sum = max_t(s32,  cfs_rq->avg.util_sum - se->avg.util_sum, 0);
+	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
+	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
+	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
+	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 
 	cfs_rq_util_change(cfs_rq);
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/urgent] sched/fair: Fix cfs_rq avg tracking underflow
  2016-06-17  9:19         ` [PATCH] sched/fair: Fix cfs_rq avg tracking underflow Peter Zijlstra
  2016-06-17  2:01           ` Yuyang Du
@ 2016-06-20 13:24           ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: tip-bot for Peter Zijlstra @ 2016-06-20 13:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, linux-kernel, yuyang.du, aryabinin, hpa, chris, mingo,
	tglx, efault, torvalds

Commit-ID:  8974189222159154c55f24ddad33e3613960521a
Gitweb:     http://git.kernel.org/tip/8974189222159154c55f24ddad33e3613960521a
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 16 Jun 2016 10:50:40 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 20 Jun 2016 11:29:09 +0200

sched/fair: Fix cfs_rq avg tracking underflow

As per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
> 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> 		removed_load = 1;
> 	}
>
> turns into:
>
> ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
> ffffffff8108706b:       48 85 c0                test   %rax,%rax
> ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
> ffffffff81087070:       4c 89 f8                mov    %r15,%rax
> ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
> ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
> ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
> ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
> ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
> ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.

So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
       maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2348de..2ae68f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2904,6 +2904,23 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 	}
 }
 
+/*
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {				\
+	typeof(_ptr) ptr = (_ptr);				\
+	typeof(*ptr) val = (_val);				\
+	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
+	res = var - val;					\
+	if (res > var)						\
+		res = 0;					\
+	WRITE_ONCE(*ptr, res);					\
+} while (0)
+
 /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
@@ -2913,15 +2930,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
-		sa->load_avg = max_t(long, sa->load_avg - r, 0);
-		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->load_avg, r);
+		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
 		removed_load = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
-		sa->util_avg = max_t(long, sa->util_avg - r, 0);
-		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->util_avg, r);
+		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
 		removed_util = 1;
 	}
 
@@ -2994,10 +3011,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
 
-	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
-	cfs_rq->avg.load_sum = max_t(s64,  cfs_rq->avg.load_sum - se->avg.load_sum, 0);
-	cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
-	cfs_rq->avg.util_sum = max_t(s32,  cfs_rq->avg.util_sum - se->avg.util_sum, 0);
+	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
+	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
+	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
+	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 
 	cfs_rq_util_change(cfs_rq);
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-06-20 13:26 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-09  9:01 Divide-by-zero in post_init_entity_util_avg Chris Wilson
2016-06-09  1:33 ` Yuyang Du
2016-06-09 13:07   ` Peter Zijlstra
2016-06-12 22:25     ` Yuyang Du
2016-06-14 11:25     ` [tip:sched/core] sched/fair: Fix post_init_entity_util_avg() serialization tip-bot for Peter Zijlstra
2016-06-16  8:50     ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra
2016-06-16 12:25       ` Peter Zijlstra
2016-06-16 16:16         ` Peter Zijlstra
2016-06-17  8:16         ` Andrey Ryabinin
2016-06-17  8:23           ` Peter Zijlstra
2016-06-17  9:19         ` [PATCH] sched/fair: Fix cfs_rq avg tracking underflow Peter Zijlstra
2016-06-17  2:01           ` Yuyang Du
2016-06-20 13:24           ` [tip:sched/urgent] " tip-bot for Peter Zijlstra
2016-06-09 10:29 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).