[PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
@ 2015-04-03 12:41 Konstantin Khlebnikov
  2015-04-03 12:51 ` Konstantin Khlebnikov
  2015-04-06 22:45 ` bsegall
  0 siblings, 2 replies; 10+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-03 12:41 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, linux-kernel; +Cc: Ben Segall, Roman Gushchin

Pick_next_task_fair() must be sure that here is at least one runnable
task before calling put_prev_task(), but put_prev_task() can expire
last remains of cfs quota and throttle all currently runnable tasks.
As a result pick_next_task_fair() cannot find next task and crashes.

This patch leaves 1 in ->runtime_remaining when current assignation
expires and tries to refill it right after that. In the worst case
task will be scheduled once and throttled at the end of slice.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 kernel/sched/fair.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..91785d077db4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
+	/* nothing to expire */
+	if (cfs_rq->runtime_remaining <= 0)
 		return;
 
-	if (cfs_rq->runtime_remaining < 0)
+	/* if the deadline is ahead of our clock, nothing to do */
+	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
 		return;
 
 	/*
@@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		/* extend local deadline, drift is bounded above by 2 ticks */
 		cfs_rq->runtime_expires += TICK_NSEC;
 	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
+		/*
+		 * Global deadline is ahead, expiration has passed.
+		 *
+		 * Do not expire runtime completely. Otherwise put_prev_task()
+		 * can throttle all tasks when we already checked nr_running or
+		 * put_prev_entity() can throttle already chosen next entity.
+		 */
+		cfs_rq->runtime_remaining = 1;
 	}
 }
 
@@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	cfs_rq->runtime_remaining -= delta_exec;
 	expire_cfs_rq_runtime(cfs_rq);
 
-	if (likely(cfs_rq->runtime_remaining > 0))
+	if (likely(cfs_rq->runtime_remaining > 1))
 		return;
 
 	/*


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-03 12:41 [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() Konstantin Khlebnikov
@ 2015-04-03 12:51 ` Konstantin Khlebnikov
  2015-04-07 12:52   ` Peter Zijlstra
  2015-04-06 22:45 ` bsegall
  1 sibling, 1 reply; 10+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-03 12:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, linux-kernel; +Cc: Ben Segall, Roman Gushchin

On 03.04.2015 15:41, Konstantin Khlebnikov wrote:
> Pick_next_task_fair() must be sure that here is at least one runnable
> task before calling put_prev_task(), but put_prev_task() can expire
> last remains of cfs quota and throttle all currently runnable tasks.
> As a result pick_next_task_fair() cannot find next task and crashes.

Kernel crash looks like this:

<1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference 
at 0000000000000038
<1>[50288.719538] IP: [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.719567] PGD 0
<4>[50288.719578] Oops: 0000 [#1] SMP
<4>[50288.719594] Modules linked in: vhost_net macvtap macvlan vhost 
8021q mrp garp ip6table_filter ip6_tables nf_conntrack_ipv4 
nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 
xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables 
bridge stp llc netconsole configfs x86_pkg_temp_thermal intel_powerclamp 
coretemp kvm_intel kvm mgag200 crc32_pclmul ghash_clmulni_intel 
aesni_intel ablk_helper cryptd lrw ttm gf128mul drm_kms_helper drm 
glue_helper aes_x86_64 i2c_algo_bit sysimgblt sysfillrect i2c_core 
sb_edac edac_core syscopyarea microcode ipmi_si ipmi_msghandler lpc_ich 
ioatdma dca mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp 
e1000e ptp pps_core ahci libahci raid10 raid456 async_pq async_xor xor 
async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 
multipath<4>[50288.719956]  linear
<4>[50288.719964] CPU: 27 PID: 11505 Comm: kvm Not tainted 3.18.10-7 #7
<4>[50288.719987] Hardware name:
<4>[50288.720015] task: ffff880036acbaa0 ti: ffff8808445f8000 task.ti: 
ffff8808445f8000
<4>[50288.720041] RIP: 0010:[<ffffffff81097b8c>]  [<ffffffff81097b8c>] 
set_next_entity+0x1c/0x80
<4>[50288.720072] RSP: 0018:ffff8808445fbbb8  EFLAGS: 00010086
<4>[50288.720091] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
000000000000bcb8
<4>[50288.720116] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff88107fd72af0
<4>[50288.720141] RBP: ffff8808445fbbd8 R08: 0000000000000000 R09: 
0000000000000001
<4>[50288.720165] R10: 0000000000000000 R11: 0000000000000001 R12: 
0000000000000000
<4>[50288.720190] R13: 0000000000000000 R14: ffff880b6f250030 R15: 
ffff88107fd72af0
<4>[50288.720214] FS:  00007f55467fc700(0000) GS:ffff88107fd60000(0000) 
knlGS:ffff8802175e0000
<4>[50288.720242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[50288.720262] CR2: 0000000000000038 CR3: 0000000324ede000 CR4: 
00000000000427e0
<4>[50288.720287] Stack:
<4>[50288.720296]  ffff88107fd72a80 ffff88107fd72a80 0000000000000000 
0000000000000000
<4>[50288.720327]  ffff8808445fbc68 ffffffff8109ead8 ffff880800000000 
ffffffffa1438990
<4>[50288.720357]  ffff880b6f250000 0000000000000000 0000000000012a80 
ffff880036acbaa0
<4>[50288.720388] Call Trace:
<4>[50288.720402]  [<ffffffff8109ead8>] pick_next_task_fair+0x88/0x5d0
<4>[50288.720429]  [<ffffffffa1438990>] ? 
vmx_fpu_activate.part.63+0x90/0xb0 [kvm_intel]
<4>[50288.720457]  [<ffffffff81096b95>] ? sched_clock_cpu+0x85/0xc0
<4>[50288.720479]  [<ffffffff816b5b99>] __schedule+0xf9/0x7d0
<4>[50288.720500]  [<ffffffff816bb210>] ? reboot_interrupt+0x80/0x80
<4>[50288.720522]  [<ffffffff816b630a>] _cond_resched+0x2a/0x40
<4>[50288.720549]  [<ffffffffa03dd8c5>] __vcpu_run+0xd35/0xf30 [kvm]
<4>[50288.720573]  [<ffffffff81075fc7>] ? __set_task_blocked+0x37/0x80
<4>[50288.720595]  [<ffffffff8109387e>] ? try_to_wake_up+0x21e/0x360
<4>[50288.720622]  [<ffffffffa03ddb65>] 
kvm_arch_vcpu_ioctl_run+0xa5/0x220 [kvm]
<4>[50288.720650]  [<ffffffffa03c48b2>] kvm_vcpu_ioctl+0x2c2/0x620 [kvm]
<4>[50288.720675]  [<ffffffff811c01c6>] do_vfs_ioctl+0x86/0x4f0
<4>[50288.720697]  [<ffffffff810d14a2>] ? SyS_futex+0x142/0x1a0
<4>[50288.720717]  [<ffffffff811c06c1>] SyS_ioctl+0x91/0xb0
<4>[50288.720737]  [<ffffffff816ba489>] system_call_fastpath+0x12/0x17
<4>[50288.720758] Code: c7 47 60 00 00 00 00 45 31 c0 e9 0c ff ff ff 66 
66 66 66 90 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89 f3 4c 
89 6d f8 <44> 8b 4e 38 49 89 fc 45 85 c9 74 17 4c 8d 6e 10 4c 39 6f 30 74
<1>[50288.722636] RIP  [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.723533]  RSP <ffff8808445fbbb8>
<4>[50288.724406] CR2: 0000000000000038

in pick_next_task_fair() cfs_rq->nr_running was non-zero but after
put_prev_task(rq, prev) kernel cannot find any tasks to schedule next.

It crashes from time to time on strange libvirt/kvm setup where
cfs_quota is set on two levels: at parent cgroup which contains kvm
and at per-vcpu child cgroup.

This patch isn't verified yet.
But I haven't found any other possible reasons for that crash.

>
> This patch leaves 1 in ->runtime_remaining when current assignation
> expires and tries to refill it right after that. In the worst case
> task will be scheduled once and throttled at the end of slice.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>   kernel/sched/fair.c |   19 +++++++++++++------
>   1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce18f3c097a..91785d077db4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>   {
>   	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>
> -	/* if the deadline is ahead of our clock, nothing to do */
> -	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> +	/* nothing to expire */
> +	if (cfs_rq->runtime_remaining <= 0)
>   		return;
>
> -	if (cfs_rq->runtime_remaining < 0)
> +	/* if the deadline is ahead of our clock, nothing to do */
> +	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
>   		return;
>
>   	/*
> @@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>   		/* extend local deadline, drift is bounded above by 2 ticks */
>   		cfs_rq->runtime_expires += TICK_NSEC;
>   	} else {
> -		/* global deadline is ahead, expiration has passed */
> -		cfs_rq->runtime_remaining = 0;
> +		/*
> +		 * Global deadline is ahead, expiration has passed.
> +		 *
> +		 * Do not expire runtime completely. Otherwise put_prev_task()
> +		 * can throttle all tasks when we already checked nr_running or
> +		 * put_prev_entity() can throttle already chosen next entity.
> +		 */
> +		cfs_rq->runtime_remaining = 1;
>   	}
>   }
>
> @@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>   	cfs_rq->runtime_remaining -= delta_exec;
>   	expire_cfs_rq_runtime(cfs_rq);
>
> -	if (likely(cfs_rq->runtime_remaining > 0))
> +	if (likely(cfs_rq->runtime_remaining > 1))
>   		return;
>
>   	/*
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-03 12:41 [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() Konstantin Khlebnikov
  2015-04-03 12:51 ` Konstantin Khlebnikov
@ 2015-04-06 22:45 ` bsegall
  2015-04-07 15:53   ` Konstantin Khlebnikov
  2015-06-07 17:47   ` [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair() tip-bot for Ben Segall
  1 sibling, 2 replies; 10+ messages in thread
From: bsegall @ 2015-04-06 22:45 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Roman Gushchin, pjt

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:

> Pick_next_task_fair() must be sure that here is at least one runnable
> task before calling put_prev_task(), but put_prev_task() can expire
> last remains of cfs quota and throttle all currently runnable tasks.
> As a result pick_next_task_fair() cannot find next task and crashes.
>
> This patch leaves 1 in ->runtime_remaining when current assignation
> expires and tries to refill it right after that. In the worst case
> task will be scheduled once and throttled at the end of slice.
>

I don't think expire_cfs_rq_runtime is the problem. What I believe
happens is this:

/prev/some_task is running, calls schedule() with nr_running == 2.
pick_next's first do/while loop does update_curr(/) and picks /next, and
the next iteration just sees check_cfs_rq_runtime(/next), and thus does
goto simple. However, there is now only /prev/some_task runnable, and it
hasn't checked the entire prev hierarchy for throttling, thus leading to
the crash.

This would require that check_cfs_rq_runtime(/next) return true despite
being on_rq though, which iirc is not supposed to happen (note that we
do not call update_curr(/next), and it would do nothing if we did,
because /next isn't part of the current thread's hierarchy). However,
this /can/ happen if runtime has just been (re)enabled on /next, because
tg_set_cfs_bandwidth sets runtime_remaining to 0, not 1.

The idea was that each rq would grab runtime when they were scheduled
(pick_next_task_fair didn't ever look at throttling info), so this was
fine with the old code, but is a problem now. I think it would be
sufficient to just initialize to 1 in tg_set_cfs_bandwidth. The arguably
more precise option would be to only check_cfs_rq_runtime if
cfs_rq->curr is set, but the code is slightly less pretty.

Could you check this patch to see if it works (or the trivial
tg_set_bandwidth runtime_remaining = 1 patch)?

---8<----->8---

>From f12fa8e981bf1d87cbbc30951bdf27e70c803e25 Mon Sep 17 00:00:00 2001
From: Ben Segall <bsegall@google.com>
Date: Mon, 6 Apr 2015 15:28:10 -0700
Subject: [PATCH] sched: prevent throttle in early pick_next_task_fair

The first call to check_cfs_rq_runtime in pick_next_task_fair is only supposed
to trigger when cfs_rq is still an ancestor of prev. However, it was able to
trigger on tgs that had just had bandwidth toggled, because tg_set_cfs_bandwidth
set runtime_remaining to 0, and check_cfs_rq_runtime doesn't check the global
pool.

Fix this by only calling check_cfs_rq_runtime if we are still in prev's
ancestry, as evidenced by cfs_rq->curr.

Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee595ef..5cb52e9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5038,18 +5038,21 @@ again:
 		 * entity, update_curr() will update its vruntime, otherwise
 		 * forget we've ever seen it.
 		 */
-		if (curr && curr->on_rq)
-			update_curr(cfs_rq);
-		else
-			curr = NULL;
+		if (curr) {
+			if (curr->on_rq)
+				update_curr(cfs_rq);
+			else 
+				curr = NULL;

-		/*
-		 * This call to check_cfs_rq_runtime() will do the throttle and
-		 * dequeue its entity in the parent(s). Therefore the 'simple'
-		 * nr_running test will indeed be correct.
-		 */
-		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
-			goto simple;
+			/*
+			 * This call to check_cfs_rq_runtime() will do the
+			 * throttle and dequeue its entity in the parent(s).
+			 * Therefore the 'simple' nr_running test will indeed
+			 * be correct.
+			 */
+			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+				goto simple;
+		}

 		se = pick_next_entity(cfs_rq, curr);
 		cfs_rq = group_cfs_rq(se);
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-03 12:51 ` Konstantin Khlebnikov
@ 2015-04-07 12:52   ` Peter Zijlstra
  2015-04-07 13:47     ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2015-04-07 12:52 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin

On Fri, Apr 03, 2015 at 03:51:17PM +0300, Konstantin Khlebnikov wrote:
> On 03.04.2015 15:41, Konstantin Khlebnikov wrote:
> >Pick_next_task_fair() must be sure that here is at least one runnable
> >task before calling put_prev_task(), but put_prev_task() can expire
> >last remains of cfs quota and throttle all currently runnable tasks.
> >As a result pick_next_task_fair() cannot find next task and crashes.
> 
> Kernel crash looks like this:
> 
> <1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> <1>[50288.719538] IP: [<ffffffff81097b8c>] set_next_entity+0x1c/0x80

> <4>[50288.720388] Call Trace:
> <4>[50288.720402]  [<ffffffff8109ead8>] pick_next_task_fair+0x88/0x5d0
> <4>[50288.720479]  [<ffffffff816b5b99>] __schedule+0xf9/0x7d0

Which set_next_entity() is that? There are 3 in pick_next_task_fair().

I have a vague suspicion its in the 'simple' code, please verify.

The thinking is that if it was the 'complex' pick_next_entity()
returning NULL we'd have exploded elsewhere, the cfs_rq iteration
would've wandered off into random memory and most likely exploded on
cfs_rq->curr.

Which too would suggest the check_cfs_rq_runtime() thing works just
fine, it send us to the simple code.

> >This patch leaves 1 in ->runtime_remaining when current assignation
> >expires and tries to refill it right after that. In the worst case
> >task will be scheduled once and throttled at the end of slice.

Which is a strange approach. If pick_next_task_fair() is borken, we
should fix that, no?

In any case, it appears to me that: 606dba2e2894 ("sched: Push
put_prev_task() into pick_next_task()") inverted the ->nr_running and
put_prev_task() statements.

If the above set_next_entity() is indeed the simple one, does the below
cure things?

---
 kernel/sched/fair.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26eb7218..df72d61138a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5176,12 +5176,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 simple:
 	cfs_rq = &rq->cfs;
 #endif
+	put_prev_task(rq, prev);
 
 	if (!cfs_rq->nr_running)
 		goto idle;
 
-	put_prev_task(rq, prev);
-
 	do {
 		se = pick_next_entity(cfs_rq, NULL);
 		set_next_entity(cfs_rq, se);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-07 12:52   ` Peter Zijlstra
@ 2015-04-07 13:47     ` Peter Zijlstra
  2015-04-07 15:12       ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2015-04-07 13:47 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin

On Tue, Apr 07, 2015 at 02:52:51PM +0200, Peter Zijlstra wrote:
> If the above set_next_entity() is indeed the simple one, does the below
> cure things?
> 
> ---
>  kernel/sched/fair.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdae26eb7218..df72d61138a8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5176,12 +5176,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
>  simple:
>  	cfs_rq = &rq->cfs;
>  #endif
> +	put_prev_task(rq, prev);
>  
>  	if (!cfs_rq->nr_running)
>  		goto idle;
>  
> -	put_prev_task(rq, prev);
> -
>  	do {
>  		se = pick_next_entity(cfs_rq, NULL);
>  		set_next_entity(cfs_rq, se);

Bah, that's broken because if we end up going idle pick_next_task_idle()
is going to do put_prev_task() again.

Lemme think a bit more on that.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-07 13:47     ` Peter Zijlstra
@ 2015-04-07 15:12       ` Peter Zijlstra
  2015-04-07 15:32         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2015-04-07 15:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin

On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote:
> Lemme think a bit more on that.

So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from
empty queue") something like this would be called for.

---
 kernel/sched/fair.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26eb7218..1e47f816c976 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 	return p;
 simple:
 	cfs_rq = &rq->cfs;
+
+	if (cfs_bandwidth_used()) {
+		/*
+		 * We may dequeue prev's cfs_rq in put_prev_task().
+		 * So, we update time before cfs_rq->nr_running check.
+		 */
+		if (prev->on_rq)
+			update_curr(cfs_rq);
+
+		se = &prev->se;
+		for_each_sched_entity(se) {
+			cfs_rq = cfs_rq_of(se);
+			check_cfs_rq_runtime(cfs_rq);
+		}
+	}
 #endif
 
 	if (!cfs_rq->nr_running)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-07 15:12       ` Peter Zijlstra
@ 2015-04-07 15:32         ` Konstantin Khlebnikov
  2015-04-07 15:43           ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-07 15:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin

On 07.04.2015 18:12, Peter Zijlstra wrote:
> On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote:
>> Lemme think a bit more on that.
>
> So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from
> empty queue") something like this would be called for.
>
> ---
>   kernel/sched/fair.c | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdae26eb7218..1e47f816c976 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
>   	return p;
>   simple:
>   	cfs_rq = &rq->cfs;

We don't need this if prev isn't from fair_sched_class:
first "goto simple" should go directly to "if (!cfs_rq->nr_running)".

> +
> +	if (cfs_bandwidth_used()) {
> +		/*
> +		 * We may dequeue prev's cfs_rq in put_prev_task().
> +		 * So, we update time before cfs_rq->nr_running check.
> +		 */
> +		if (prev->on_rq)
> +			update_curr(cfs_rq);
> +
> +		se = &prev->se;
> +		for_each_sched_entity(se) {

Probably update_curr() should be inside loop too?

> +			cfs_rq = cfs_rq_of(se);
> +			check_cfs_rq_runtime(cfs_rq);
> +		}
> +	}
>   #endif
>
>   	if (!cfs_rq->nr_running)
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-07 15:32         ` Konstantin Khlebnikov
@ 2015-04-07 15:43           ` Peter Zijlstra
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2015-04-07 15:43 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin

On Tue, Apr 07, 2015 at 06:32:47PM +0300, Konstantin Khlebnikov wrote:
> On 07.04.2015 18:12, Peter Zijlstra wrote:
> >On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote:
> >>Lemme think a bit more on that.
> >
> >So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from
> >empty queue") something like this would be called for.
> >
> >---
> >  kernel/sched/fair.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> >diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >index fdae26eb7218..1e47f816c976 100644
> >--- a/kernel/sched/fair.c
> >+++ b/kernel/sched/fair.c
> >@@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> >  	return p;
> >  simple:
> >  	cfs_rq = &rq->cfs;
> 
> We don't need this if prev isn't from fair_sched_class:
> first "goto simple" should go directly to "if (!cfs_rq->nr_running)".

Just add that to the test here:

> >+	if (cfs_bandwidth_used()) {
> >+		/*
> >+		 * We may dequeue prev's cfs_rq in put_prev_task().
> >+		 * So, we update time before cfs_rq->nr_running check.
> >+		 */
> >+		if (prev->on_rq)
> >+			update_curr(cfs_rq);
> >+
> >+		se = &prev->se;
> >+		for_each_sched_entity(se) {
> 
> Probably update_curr() should be inside loop too?

Ah, yes, you're right.

> >+			cfs_rq = cfs_rq_of(se);
> >+			check_cfs_rq_runtime(cfs_rq);
> >+		}
> >+	}
> >  #endif
> >
> >  	if (!cfs_rq->nr_running)
> >
> 
> 
> -- 
> Konstantin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-06 22:45 ` bsegall
@ 2015-04-07 15:53   ` Konstantin Khlebnikov
  2015-06-07 17:47   ` [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair() tip-bot for Ben Segall
  1 sibling, 0 replies; 10+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-07 15:53 UTC (permalink / raw)
  To: bsegall; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Roman Gushchin, pjt

On 07.04.2015 01:45, bsegall@google.com wrote:
> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>
>> Pick_next_task_fair() must be sure that here is at least one runnable
>> task before calling put_prev_task(), but put_prev_task() can expire
>> last remains of cfs quota and throttle all currently runnable tasks.
>> As a result pick_next_task_fair() cannot find next task and crashes.
>>
>> This patch leaves 1 in ->runtime_remaining when current assignation
>> expires and tries to refill it right after that. In the worst case
>> task will be scheduled once and throttled at the end of slice.
>>
>
> I don't think expire_cfs_rq_runtime is the problem. What I believe
> happens is this:
>
> /prev/some_task is running, calls schedule() with nr_running == 2.
> pick_next's first do/while loop does update_curr(/) and picks /next, and
> the next iteration just sees check_cfs_rq_runtime(/next), and thus does
> goto simple. However, there is now only /prev/some_task runnable, and it
> hasn't checked the entire prev hierarchy for throttling, thus leading to
> the crash.
>
> This would require that check_cfs_rq_runtime(/next) return true despite
> being on_rq though, which iirc is not supposed to happen (note that we
> do not call update_curr(/next), and it would do nothing if we did,
> because /next isn't part of the current thread's hierarchy). However,
> this /can/ happen if runtime has just been (re)enabled on /next, because
> tg_set_cfs_bandwidth sets runtime_remaining to 0, not 1.

Yeah, this seems possible too.

> The idea was that each rq would grab runtime when they were scheduled
> (pick_next_task_fair didn't ever look at throttling info), so this was
> fine with the old code, but is a problem now. I think it would be
> sufficient to just initialize to 1 in tg_set_cfs_bandwidth. The arguably
> more precise option would be to only check_cfs_rq_runtime if
> cfs_rq->curr is set, but the code is slightly less pretty.

Probably this code should call something like
account_cfs_rq_runtime(cfs_rq, 0);

>
> Could you check this patch to see if it works (or the trivial
> tg_set_bandwidth runtime_remaining = 1 patch)?

I'm not sure that I'll see this bug again: we're replacing this setup 
with something different.

>
> ---8<----->8---
>
>  From f12fa8e981bf1d87cbbc30951bdf27e70c803e25 Mon Sep 17 00:00:00 2001
> From: Ben Segall <bsegall@google.com>
> Date: Mon, 6 Apr 2015 15:28:10 -0700
> Subject: [PATCH] sched: prevent throttle in early pick_next_task_fair
>
> The first call to check_cfs_rq_runtime in pick_next_task_fair is only supposed
> to trigger when cfs_rq is still an ancestor of prev. However, it was able to
> trigger on tgs that had just had bandwidth toggled, because tg_set_cfs_bandwidth
> set runtime_remaining to 0, and check_cfs_rq_runtime doesn't check the global
> pool.
>
> Fix this by only calling check_cfs_rq_runtime if we are still in prev's
> ancestry, as evidenced by cfs_rq->curr.
>
> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Signed-off-by: Ben Segall <bsegall@google.com>
> ---
>   kernel/sched/fair.c | 25 ++++++++++++++-----------
>   1 file changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ee595ef..5cb52e9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5038,18 +5038,21 @@ again:
>   		 * entity, update_curr() will update its vruntime, otherwise
>   		 * forget we've ever seen it.
>   		 */
> -		if (curr && curr->on_rq)
> -			update_curr(cfs_rq);
> -		else
> -			curr = NULL;
> +		if (curr) {
> +			if (curr->on_rq)
> +				update_curr(cfs_rq);
> +			else
> +				curr = NULL;
>
> -		/*
> -		 * This call to check_cfs_rq_runtime() will do the throttle and
> -		 * dequeue its entity in the parent(s). Therefore the 'simple'
> -		 * nr_running test will indeed be correct.
> -		 */
> -		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> -			goto simple;
> +			/*
> +			 * This call to check_cfs_rq_runtime() will do the
> +			 * throttle and dequeue its entity in the parent(s).
> +			 * Therefore the 'simple' nr_running test will indeed
> +			 * be correct.
> +			 */
> +			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> +				goto simple;
> +		}
>
>   		se = pick_next_entity(cfs_rq, curr);
>   		cfs_rq = group_cfs_rq(se);
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair()
  2015-04-06 22:45 ` bsegall
  2015-04-07 15:53   ` Konstantin Khlebnikov
@ 2015-06-07 17:47   ` tip-bot for Ben Segall
  1 sibling, 0 replies; 10+ messages in thread
From: tip-bot for Ben Segall @ 2015-06-07 17:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, torvalds, akpm, bsegall, mnaser, khlebnikov, peterz, hpa,
	klamm, tglx, linux-kernel

Commit-ID:  54d27365cae88fbcc853b391dcd561e71acb81fa
Gitweb:     http://git.kernel.org/tip/54d27365cae88fbcc853b391dcd561e71acb81fa
Author:     Ben Segall <bsegall@google.com>
AuthorDate: Mon, 6 Apr 2015 15:28:10 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 7 Jun 2015 15:57:44 +0200

sched/fair: Prevent throttling in early pick_next_task_fair()

The optimized task selection logic optimistically selects a new task
to run without first doing a full put_prev_task(). This is so that we
can avoid a put/set on the common ancestors of the old and new task.

Similarly, we should only call check_cfs_rq_runtime() to throttle
eligible groups if they're part of the common ancestry, otherwise it
is possible to end up with no eligible task in the simple task
selection.

Imagine:
		/root
	/prev		/next
	/A		/B

If our optimistic selection ends up throttling /next, we goto simple
and our put_prev_task() ends up throttling /prev, after which we're
going to bug out in set_next_entity() because there aren't any tasks
left.

Avoid this scenario by only throttling common ancestors.

Reported-by: Mohammed Naser <mnaser@vexxhost.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Ben Segall <bsegall@google.com>
[ munged Changelog ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Fixes: 678d5718d8d0 ("sched/fair: Optimize cgroup pick_next_task_fair()")
Link: http://lkml.kernel.org/r/xm26wq1oswoq.fsf@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d4632f..84ada05 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5322,18 +5322,21 @@ again:
 		 * entity, update_curr() will update its vruntime, otherwise
 		 * forget we've ever seen it.
 		 */
-		if (curr && curr->on_rq)
-			update_curr(cfs_rq);
-		else
-			curr = NULL;
+		if (curr) {
+			if (curr->on_rq)
+				update_curr(cfs_rq);
+			else
+				curr = NULL;
 
-		/*
-		 * This call to check_cfs_rq_runtime() will do the throttle and
-		 * dequeue its entity in the parent(s). Therefore the 'simple'
-		 * nr_running test will indeed be correct.
-		 */
-		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
-			goto simple;
+			/*
+			 * This call to check_cfs_rq_runtime() will do the
+			 * throttle and dequeue its entity in the parent(s).
+			 * Therefore the 'simple' nr_running test will indeed
+			 * be correct.
+			 */
+			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+				goto simple;
+		}
 
 		se = pick_next_entity(cfs_rq, curr);
 		cfs_rq = group_cfs_rq(se);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-06-07 17:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-03 12:41 [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() Konstantin Khlebnikov
2015-04-03 12:51 ` Konstantin Khlebnikov
2015-04-07 12:52   ` Peter Zijlstra
2015-04-07 13:47     ` Peter Zijlstra
2015-04-07 15:12       ` Peter Zijlstra
2015-04-07 15:32         ` Konstantin Khlebnikov
2015-04-07 15:43           ` Peter Zijlstra
2015-04-06 22:45 ` bsegall
2015-04-07 15:53   ` Konstantin Khlebnikov
2015-06-07 17:47   ` [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair() tip-bot for Ben Segall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).