[PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
@ 2015-04-03 12:41 Konstantin Khlebnikov
  2015-04-03 12:51 ` Konstantin Khlebnikov
  2015-04-06 22:45 ` bsegall
  0 siblings, 2 replies; 10+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-03 12:41 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, linux-kernel; +Cc: Ben Segall, Roman Gushchin

Pick_next_task_fair() must be sure that here is at least one runnable
task before calling put_prev_task(), but put_prev_task() can expire
last remains of cfs quota and throttle all currently runnable tasks.
As a result pick_next_task_fair() cannot find next task and crashes.

This patch leaves 1 in ->runtime_remaining when current assignation
expires and tries to refill it right after that. In the worst case
task will be scheduled once and throttled at the end of slice.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 kernel/sched/fair.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..91785d077db4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
+	/* nothing to expire */
+	if (cfs_rq->runtime_remaining <= 0)
 		return;
 
-	if (cfs_rq->runtime_remaining < 0)
+	/* if the deadline is ahead of our clock, nothing to do */
+	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
 		return;
 
 	/*
@@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		/* extend local deadline, drift is bounded above by 2 ticks */
 		cfs_rq->runtime_expires += TICK_NSEC;
 	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
+		/*
+		 * Global deadline is ahead, expiration has passed.
+		 *
+		 * Do not expire runtime completely. Otherwise put_prev_task()
+		 * can throttle all tasks when we already checked nr_running or
+		 * put_prev_entity() can throttle already chosen next entity.
+		 */
+		cfs_rq->runtime_remaining = 1;
 	}
 }
 
@@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	cfs_rq->runtime_remaining -= delta_exec;
 	expire_cfs_rq_runtime(cfs_rq);
 
-	if (likely(cfs_rq->runtime_remaining > 0))
+	if (likely(cfs_rq->runtime_remaining > 1))
 		return;
 
 	/*


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-03 12:41 [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() Konstantin Khlebnikov
@ 2015-04-03 12:51 ` Konstantin Khlebnikov
  2015-04-07 12:52   ` Peter Zijlstra
  2015-04-06 22:45 ` bsegall
  1 sibling, 1 reply; 10+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-03 12:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, linux-kernel; +Cc: Ben Segall, Roman Gushchin

On 03.04.2015 15:41, Konstantin Khlebnikov wrote:
> Pick_next_task_fair() must be sure that here is at least one runnable
> task before calling put_prev_task(), but put_prev_task() can expire
> last remains of cfs quota and throttle all currently runnable tasks.
> As a result pick_next_task_fair() cannot find next task and crashes.

Kernel crash looks like this:

<1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference 
at 0000000000000038
<1>[50288.719538] IP: [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.719567] PGD 0
<4>[50288.719578] Oops: 0000 [#1] SMP
<4>[50288.719594] Modules linked in: vhost_net macvtap macvlan vhost 
8021q mrp garp ip6table_filter ip6_tables nf_conntrack_ipv4 
nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 
xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables 
bridge stp llc netconsole configfs x86_pkg_temp_thermal intel_powerclamp 
coretemp kvm_intel kvm mgag200 crc32_pclmul ghash_clmulni_intel 
aesni_intel ablk_helper cryptd lrw ttm gf128mul drm_kms_helper drm 
glue_helper aes_x86_64 i2c_algo_bit sysimgblt sysfillrect i2c_core 
sb_edac edac_core syscopyarea microcode ipmi_si ipmi_msghandler lpc_ich 
ioatdma dca mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp 
e1000e ptp pps_core ahci libahci raid10 raid456 async_pq async_xor xor 
async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 
multipath<4>[50288.719956]  linear
<4>[50288.719964] CPU: 27 PID: 11505 Comm: kvm Not tainted 3.18.10-7 #7
<4>[50288.719987] Hardware name:
<4>[50288.720015] task: ffff880036acbaa0 ti: ffff8808445f8000 task.ti: 
ffff8808445f8000
<4>[50288.720041] RIP: 0010:[<ffffffff81097b8c>]  [<ffffffff81097b8c>] 
set_next_entity+0x1c/0x80
<4>[50288.720072] RSP: 0018:ffff8808445fbbb8  EFLAGS: 00010086
<4>[50288.720091] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
000000000000bcb8
<4>[50288.720116] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff88107fd72af0
<4>[50288.720141] RBP: ffff8808445fbbd8 R08: 0000000000000000 R09: 
0000000000000001
<4>[50288.720165] R10: 0000000000000000 R11: 0000000000000001 R12: 
0000000000000000
<4>[50288.720190] R13: 0000000000000000 R14: ffff880b6f250030 R15: 
ffff88107fd72af0
<4>[50288.720214] FS:  00007f55467fc700(0000) GS:ffff88107fd60000(0000) 
knlGS:ffff8802175e0000
<4>[50288.720242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[50288.720262] CR2: 0000000000000038 CR3: 0000000324ede000 CR4: 
00000000000427e0
<4>[50288.720287] Stack:
<4>[50288.720296]  ffff88107fd72a80 ffff88107fd72a80 0000000000000000 
0000000000000000
<4>[50288.720327]  ffff8808445fbc68 ffffffff8109ead8 ffff880800000000 
ffffffffa1438990
<4>[50288.720357]  ffff880b6f250000 0000000000000000 0000000000012a80 
ffff880036acbaa0
<4>[50288.720388] Call Trace:
<4>[50288.720402]  [<ffffffff8109ead8>] pick_next_task_fair+0x88/0x5d0
<4>[50288.720429]  [<ffffffffa1438990>] ? 
vmx_fpu_activate.part.63+0x90/0xb0 [kvm_intel]
<4>[50288.720457]  [<ffffffff81096b95>] ? sched_clock_cpu+0x85/0xc0
<4>[50288.720479]  [<ffffffff816b5b99>] __schedule+0xf9/0x7d0
<4>[50288.720500]  [<ffffffff816bb210>] ? reboot_interrupt+0x80/0x80
<4>[50288.720522]  [<ffffffff816b630a>] _cond_resched+0x2a/0x40
<4>[50288.720549]  [<ffffffffa03dd8c5>] __vcpu_run+0xd35/0xf30 [kvm]
<4>[50288.720573]  [<ffffffff81075fc7>] ? __set_task_blocked+0x37/0x80
<4>[50288.720595]  [<ffffffff8109387e>] ? try_to_wake_up+0x21e/0x360
<4>[50288.720622]  [<ffffffffa03ddb65>] 
kvm_arch_vcpu_ioctl_run+0xa5/0x220 [kvm]
<4>[50288.720650]  [<ffffffffa03c48b2>] kvm_vcpu_ioctl+0x2c2/0x620 [kvm]
<4>[50288.720675]  [<ffffffff811c01c6>] do_vfs_ioctl+0x86/0x4f0
<4>[50288.720697]  [<ffffffff810d14a2>] ? SyS_futex+0x142/0x1a0
<4>[50288.720717]  [<ffffffff811c06c1>] SyS_ioctl+0x91/0xb0
<4>[50288.720737]  [<ffffffff816ba489>] system_call_fastpath+0x12/0x17
<4>[50288.720758] Code: c7 47 60 00 00 00 00 45 31 c0 e9 0c ff ff ff 66 
66 66 66 90 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89 f3 4c 
89 6d f8 <44> 8b 4e 38 49 89 fc 45 85 c9 74 17 4c 8d 6e 10 4c 39 6f 30 74
<1>[50288.722636] RIP  [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.723533]  RSP <ffff8808445fbbb8>
<4>[50288.724406] CR2: 0000000000000038

in pick_next_task_fair() cfs_rq->nr_running was non-zero but after
put_prev_task(rq, prev) kernel cannot find any tasks to schedule next.

It crashes from time to time on strange libvirt/kvm setup where
cfs_quota is set on two levels: at parent cgroup which contains kvm
and at per-vcpu child cgroup.

This patch isn't verified yet.
But I haven't found any other possible reasons for that crash.

>
> This patch leaves 1 in ->runtime_remaining when current assignation
> expires and tries to refill it right after that. In the worst case
> task will be scheduled once and throttled at the end of slice.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>   kernel/sched/fair.c |   19 +++++++++++++------
>   1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce18f3c097a..91785d077db4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>   {
>   	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>
> -	/* if the deadline is ahead of our clock, nothing to do */
> -	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> +	/* nothing to expire */
> +	if (cfs_rq->runtime_remaining <= 0)
>   		return;
>
> -	if (cfs_rq->runtime_remaining < 0)
> +	/* if the deadline is ahead of our clock, nothing to do */
> +	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
>   		return;
>
>   	/*
> @@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>   		/* extend local deadline, drift is bounded above by 2 ticks */
>   		cfs_rq->runtime_expires += TICK_NSEC;
>   	} else {
> -		/* global deadline is ahead, expiration has passed */
> -		cfs_rq->runtime_remaining = 0;
> +		/*
> +		 * Global deadline is ahead, expiration has passed.
> +		 *
> +		 * Do not expire runtime completely. Otherwise put_prev_task()
> +		 * can throttle all tasks when we already checked nr_running or
> +		 * put_prev_entity() can throttle already chosen next entity.
> +		 */
> +		cfs_rq->runtime_remaining = 1;
>   	}
>   }
>
> @@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>   	cfs_rq->runtime_remaining -= delta_exec;
>   	expire_cfs_rq_runtime(cfs_rq);
>
> -	if (likely(cfs_rq->runtime_remaining > 0))
> +	if (likely(cfs_rq->runtime_remaining > 1))
>   		return;
>
>   	/*
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-03 12:41 [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() Konstantin Khlebnikov
  2015-04-03 12:51 ` Konstantin Khlebnikov
@ 2015-04-06 22:45 ` bsegall
  2015-04-07 15:53   ` Konstantin Khlebnikov
  2015-06-07 17:47   ` [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair() tip-bot for Ben Segall
  1 sibling, 2 replies; 10+ messages in thread
From: bsegall @ 2015-04-06 22:45 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Roman Gushchin, pjt

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:

> Pick_next_task_fair() must be sure that here is at least one runnable
> task before calling put_prev_task(), but put_prev_task() can expire
> last remains of cfs quota and throttle all currently runnable tasks.
> As a result pick_next_task_fair() cannot find next task and crashes.
>
> This patch leaves 1 in ->runtime_remaining when current assignation
> expires and tries to refill it right after that. In the worst case
> task will be scheduled once and throttled at the end of slice.
>

I don't think expire_cfs_rq_runtime is the problem. What I believe
happens is this:

/prev/some_task is running, calls schedule() with nr_running == 2.
pick_next's first do/while loop does update_curr(/) and picks /next, and
the next iteration just sees check_cfs_rq_runtime(/next), and thus does
goto simple. However, there is now only /prev/some_task runnable, and it
hasn't checked the entire prev hierarchy for throttling, thus leading to
the crash.

This would require that check_cfs_rq_runtime(/next) return true despite
being on_rq though, which iirc is not supposed to happen (note that we
do not call update_curr(/next), and it would do nothing if we did,
because /next isn't part of the current thread's hierarchy). However,
this /can/ happen if runtime has just been (re)enabled on /next, because
tg_set_cfs_bandwidth sets runtime_remaining to 0, not 1.

The idea was that each rq would grab runtime when they were scheduled
(pick_next_task_fair didn't ever look at throttling info), so this was
fine with the old code, but is a problem now. I think it would be
sufficient to just initialize to 1 in tg_set_cfs_bandwidth. The arguably
more precise option would be to only check_cfs_rq_runtime if
cfs_rq->curr is set, but the code is slightly less pretty.

Could you check this patch to see if it works (or the trivial
tg_set_bandwidth runtime_remaining = 1 patch)?

---8<----->8---

>From f12fa8e981bf1d87cbbc30951bdf27e70c803e25 Mon Sep 17 00:00:00 2001
From: Ben Segall <bsegall@google.com>
Date: Mon, 6 Apr 2015 15:28:10 -0700
Subject: [PATCH] sched: prevent throttle in early pick_next_task_fair

The first call to check_cfs_rq_runtime in pick_next_task_fair is only supposed
to trigger when cfs_rq is still an ancestor of prev. However, it was able to
trigger on tgs that had just had bandwidth toggled, because tg_set_cfs_bandwidth
set runtime_remaining to 0, and check_cfs_rq_runtime doesn't check the global
pool.

Fix this by only calling check_cfs_rq_runtime if we are still in prev's
ancestry, as evidenced by cfs_rq->curr.

Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee595ef..5cb52e9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5038,18 +5038,21 @@ again:
 		 * entity, update_curr() will update its vruntime, otherwise
 		 * forget we've ever seen it.
 		 */
-		if (curr && curr->on_rq)
-			update_curr(cfs_rq);
-		else
-			curr = NULL;
+		if (curr) {
+			if (curr->on_rq)
+				update_curr(cfs_rq);
+			else 
+				curr = NULL;

-		/*
-		 * This call to check_cfs_rq_runtime() will do the throttle and
-		 * dequeue its entity in the parent(s). Therefore the 'simple'
-		 * nr_running test will indeed be correct.
-		 */
-		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
-			goto simple;
+			/*
+			 * This call to check_cfs_rq_runtime() will do the
+			 * throttle and dequeue its entity in the parent(s).
+			 * Therefore the 'simple' nr_running test will indeed
+			 * be correct.
+			 */
+			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+				goto simple;
+		}

 		se = pick_next_entity(cfs_rq, curr);
 		cfs_rq = group_cfs_rq(se);
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-03 12:51 ` Konstantin Khlebnikov
@ 2015-04-07 12:52   ` Peter Zijlstra
  2015-04-07 13:47     ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2015-04-07 12:52 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin

On Fri, Apr 03, 2015 at 03:51:17PM +0300, Konstantin Khlebnikov wrote:
> On 03.04.2015 15:41, Konstantin Khlebnikov wrote:
> >Pick_next_task_fair() must be sure that here is at least one runnable
> >task before calling put_prev_task(), but put_prev_task() can expire
> >last remains of cfs quota and throttle all currently runnable tasks.
> >As a result pick_next_task_fair() cannot find next task and crashes.
> 
> Kernel crash looks like this:
> 
> <1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> <1>[50288.719538] IP: [<ffffffff81097b8c>] set_next_entity+0x1c/0x80

> <4>[50288.720388] Call Trace:
> <4>[50288.720402]  [<ffffffff8109ead8>] pick_next_task_fair+0x88/0x5d0
> <4>[50288.720479]  [<ffffffff816b5b99>] __schedule+0xf9/0x7d0

Which set_next_entity() is that? There are 3 in pick_next_task_fair().

I have a vague suspicion its in the 'simple' code, please verify.

The thinking is that if it was the 'complex' pick_next_entity()
returning NULL we'd have exploded elsewhere, the cfs_rq iteration
would've wandered off into random memory and most likely exploded on
cfs_rq->curr.

Which too would suggest the check_cfs_rq_runtime() thing works just
fine, it send us to the simple code.

> >This patch leaves 1 in ->runtime_remaining when current assignation
> >expires and tries to refill it right after that. In the worst case
> >task will be scheduled once and throttled at the end of slice.

Which is a strange approach. If pick_next_task_fair() is borken, we
should fix that, no?

In any case, it appears to me that: 606dba2e2894 ("sched: Push
put_prev_task() into pick_next_task()") inverted the ->nr_running and
put_prev_task() statements.

If the above set_next_entity() is indeed the simple one, does the below
cure things?

---
 kernel/sched/fair.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26eb7218..df72d61138a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5176,12 +5176,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 simple:
 	cfs_rq = &rq->cfs;
 #endif
+	put_prev_task(rq, prev);
 
 	if (!cfs_rq->nr_running)
 		goto idle;
 
-	put_prev_task(rq, prev);
-
 	do {
 		se = pick_next_entity(cfs_rq, NULL);
 		set_next_entity(cfs_rq, se);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-07 12:52   ` Peter Zijlstra
@ 2015-04-07 13:47     ` Peter Zijlstra
  2015-04-07 15:12       ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2015-04-07 13:47 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin

On Tue, Apr 07, 2015 at 02:52:51PM +0200, Peter Zijlstra wrote:
> If the above set_next_entity() is indeed the simple one, does the below
> cure things?
> 
> ---
>  kernel/sched/fair.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdae26eb7218..df72d61138a8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5176,12 +5176,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
>  simple:
>  	cfs_rq = &rq->cfs;
>  #endif
> +	put_prev_task(rq, prev);
>  
>  	if (!cfs_rq->nr_running)
>  		goto idle;
>  
> -	put_prev_task(rq, prev);
> -
>  	do {
>  		se = pick_next_entity(cfs_rq, NULL);
>  		set_next_entity(cfs_rq, se);

Bah, that's broken because if we end up going idle pick_next_task_idle()
is going to do put_prev_task() again.

Lemme think a bit more on that.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-07 13:47     ` Peter Zijlstra
@ 2015-04-07 15:12       ` Peter Zijlstra
  2015-04-07 15:32         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2015-04-07 15:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin

On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote:
> Lemme think a bit more on that.

So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from
empty queue") something like this would be called for.

---
 kernel/sched/fair.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26eb7218..1e47f816c976 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 	return p;
 simple:
 	cfs_rq = &rq->cfs;
+
+	if (cfs_bandwidth_used()) {
+		/*
+		 * We may dequeue prev's cfs_rq in put_prev_task().
+		 * So, we update time before cfs_rq->nr_running check.
+		 */
+		if (prev->on_rq)
+			update_curr(cfs_rq);
+
+		se = &prev->se;
+		for_each_sched_entity(se) {
+			cfs_rq = cfs_rq_of(se);
+			check_cfs_rq_runtime(cfs_rq);
+		}
+	}
 #endif
 
 	if (!cfs_rq->nr_running)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-07 15:12       ` Peter Zijlstra
@ 2015-04-07 15:32         ` Konstantin Khlebnikov
  2015-04-07 15:43           ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-07 15:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin

On 07.04.2015 18:12, Peter Zijlstra wrote:
> On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote:
>> Lemme think a bit more on that.
>
> So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from
> empty queue") something like this would be called for.
>
> ---
>   kernel/sched/fair.c | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdae26eb7218..1e47f816c976 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
>   	return p;
>   simple:
>   	cfs_rq = &rq->cfs;

We don't need this if prev isn't from fair_sched_class:
first "goto simple" should go directly to "if (!cfs_rq->nr_running)".

> +
> +	if (cfs_bandwidth_used()) {
> +		/*
> +		 * We may dequeue prev's cfs_rq in put_prev_task().
> +		 * So, we update time before cfs_rq->nr_running check.
> +		 */
> +		if (prev->on_rq)
> +			update_curr(cfs_rq);
> +
> +		se = &prev->se;
> +		for_each_sched_entity(se) {

Probably update_curr() should be inside loop too?

> +			cfs_rq = cfs_rq_of(se);
> +			check_cfs_rq_runtime(cfs_rq);
> +		}
> +	}
>   #endif
>
>   	if (!cfs_rq->nr_running)
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-07 15:32         ` Konstantin Khlebnikov
@ 2015-04-07 15:43           ` Peter Zijlstra
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2015-04-07 15:43 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ingo Molnar, linux-kernel, Ben Segall, Roman Gushchin

On Tue, Apr 07, 2015 at 06:32:47PM +0300, Konstantin Khlebnikov wrote:
> On 07.04.2015 18:12, Peter Zijlstra wrote:
> >On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote:
> >>Lemme think a bit more on that.
> >
> >So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from
> >empty queue") something like this would be called for.
> >
> >---
> >  kernel/sched/fair.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> >diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >index fdae26eb7218..1e47f816c976 100644
> >--- a/kernel/sched/fair.c
> >+++ b/kernel/sched/fair.c
> >@@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> >  	return p;
> >  simple:
> >  	cfs_rq = &rq->cfs;
> 
> We don't need this if prev isn't from fair_sched_class:
> first "goto simple" should go directly to "if (!cfs_rq->nr_running)".

Just add that to the test here:

> >+	if (cfs_bandwidth_used()) {
> >+		/*
> >+		 * We may dequeue prev's cfs_rq in put_prev_task().
> >+		 * So, we update time before cfs_rq->nr_running check.
> >+		 */
> >+		if (prev->on_rq)
> >+			update_curr(cfs_rq);
> >+
> >+		se = &prev->se;
> >+		for_each_sched_entity(se) {
> 
> Probably update_curr() should be inside loop too?

Ah, yes, you're right.

> >+			cfs_rq = cfs_rq_of(se);
> >+			check_cfs_rq_runtime(cfs_rq);
> >+		}
> >+	}
> >  #endif
> >
> >  	if (!cfs_rq->nr_running)
> >
> 
> 
> -- 
> Konstantin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()
  2015-04-06 22:45 ` bsegall
@ 2015-04-07 15:53   ` Konstantin Khlebnikov
  2015-06-07 17:47   ` [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair() tip-bot for Ben Segall
  1 sibling, 0 replies; 10+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-07 15:53 UTC (permalink / raw)
  To: bsegall; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Roman Gushchin, pjt

On 07.04.2015 01:45, bsegall@google.com wrote:
> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>
>> Pick_next_task_fair() must be sure that here is at least one runnable
>> task before calling put_prev_task(), but put_prev_task() can expire
>> last remains of cfs quota and throttle all currently runnable tasks.
>> As a result pick_next_task_fair() cannot find next task and crashes.
>>
>> This patch leaves 1 in ->runtime_remaining when current assignation
>> expires and tries to refill it right after that. In the worst case
>> task will be scheduled once and throttled at the end of slice.
>>
>
> I don't think expire_cfs_rq_runtime is the problem. What I believe
> happens is this:
>
> /prev/some_task is running, calls schedule() with nr_running == 2.
> pick_next's first do/while loop does update_curr(/) and picks /next, and
> the next iteration just sees check_cfs_rq_runtime(/next), and thus does
> goto simple. However, there is now only /prev/some_task runnable, and it
> hasn't checked the entire prev hierarchy for throttling, thus leading to
> the crash.
>
> This would require that check_cfs_rq_runtime(/next) return true despite
> being on_rq though, which iirc is not supposed to happen (note that we
> do not call update_curr(/next), and it would do nothing if we did,
> because /next isn't part of the current thread's hierarchy). However,
> this /can/ happen if runtime has just been (re)enabled on /next, because
> tg_set_cfs_bandwidth sets runtime_remaining to 0, not 1.

Yeah, this seems possible too.

> The idea was that each rq would grab runtime when they were scheduled
> (pick_next_task_fair didn't ever look at throttling info), so this was
> fine with the old code, but is a problem now. I think it would be
> sufficient to just initialize to 1 in tg_set_cfs_bandwidth. The arguably
> more precise option would be to only check_cfs_rq_runtime if
> cfs_rq->curr is set, but the code is slightly less pretty.

Probably this code should call something like
account_cfs_rq_runtime(cfs_rq, 0);

>
> Could you check this patch to see if it works (or the trivial
> tg_set_bandwidth runtime_remaining = 1 patch)?

I'm not sure that I'll see this bug again: we're replacing this setup 
with something different.

>
> ---8<----->8---
>
>  From f12fa8e981bf1d87cbbc30951bdf27e70c803e25 Mon Sep 17 00:00:00 2001
> From: Ben Segall <bsegall@google.com>
> Date: Mon, 6 Apr 2015 15:28:10 -0700
> Subject: [PATCH] sched: prevent throttle in early pick_next_task_fair
>
> The first call to check_cfs_rq_runtime in pick_next_task_fair is only supposed
> to trigger when cfs_rq is still an ancestor of prev. However, it was able to
> trigger on tgs that had just had bandwidth toggled, because tg_set_cfs_bandwidth
> set runtime_remaining to 0, and check_cfs_rq_runtime doesn't check the global
> pool.
>
> Fix this by only calling check_cfs_rq_runtime if we are still in prev's
> ancestry, as evidenced by cfs_rq->curr.
>
> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Signed-off-by: Ben Segall <bsegall@google.com>
> ---
>   kernel/sched/fair.c | 25 ++++++++++++++-----------
>   1 file changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ee595ef..5cb52e9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5038,18 +5038,21 @@ again:
>   		 * entity, update_curr() will update its vruntime, otherwise
>   		 * forget we've ever seen it.
>   		 */
> -		if (curr && curr->on_rq)
> -			update_curr(cfs_rq);
> -		else
> -			curr = NULL;
> +		if (curr) {
> +			if (curr->on_rq)
> +				update_curr(cfs_rq);
> +			else
> +				curr = NULL;
>
> -		/*
> -		 * This call to check_cfs_rq_runtime() will do the throttle and
> -		 * dequeue its entity in the parent(s). Therefore the 'simple'
> -		 * nr_running test will indeed be correct.
> -		 */
> -		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> -			goto simple;
> +			/*
> +			 * This call to check_cfs_rq_runtime() will do the
> +			 * throttle and dequeue its entity in the parent(s).
> +			 * Therefore the 'simple' nr_running test will indeed
> +			 * be correct.
> +			 */
> +			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> +				goto simple;
> +		}
>
>   		se = pick_next_entity(cfs_rq, curr);
>   		cfs_rq = group_cfs_rq(se);
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair()
  2015-04-06 22:45 ` bsegall
  2015-04-07 15:53   ` Konstantin Khlebnikov
@ 2015-06-07 17:47   ` tip-bot for Ben Segall
  1 sibling, 0 replies; 10+ messages in thread
From: tip-bot for Ben Segall @ 2015-06-07 17:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, torvalds, akpm, bsegall, mnaser, khlebnikov, peterz, hpa,
	klamm, tglx, linux-kernel

Commit-ID:  54d27365cae88fbcc853b391dcd561e71acb81fa
Gitweb:     http://git.kernel.org/tip/54d27365cae88fbcc853b391dcd561e71acb81fa
Author:     Ben Segall <bsegall@google.com>
AuthorDate: Mon, 6 Apr 2015 15:28:10 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 7 Jun 2015 15:57:44 +0200

sched/fair: Prevent throttling in early pick_next_task_fair()

The optimized task selection logic optimistically selects a new task
to run without first doing a full put_prev_task(). This is so that we
can avoid a put/set on the common ancestors of the old and new task.

Similarly, we should only call check_cfs_rq_runtime() to throttle
eligible groups if they're part of the common ancestry, otherwise it
is possible to end up with no eligible task in the simple task
selection.

Imagine:
		/root
	/prev		/next
	/A		/B

If our optimistic selection ends up throttling /next, we goto simple
and our put_prev_task() ends up throttling /prev, after which we're
going to bug out in set_next_entity() because there aren't any tasks
left.

Avoid this scenario by only throttling common ancestors.

Reported-by: Mohammed Naser <mnaser@vexxhost.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Ben Segall <bsegall@google.com>
[ munged Changelog ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Fixes: 678d5718d8d0 ("sched/fair: Optimize cgroup pick_next_task_fair()")
Link: http://lkml.kernel.org/r/xm26wq1oswoq.fsf@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d4632f..84ada05 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5322,18 +5322,21 @@ again:
 		 * entity, update_curr() will update its vruntime, otherwise
 		 * forget we've ever seen it.
 		 */
-		if (curr && curr->on_rq)
-			update_curr(cfs_rq);
-		else
-			curr = NULL;
+		if (curr) {
+			if (curr->on_rq)
+				update_curr(cfs_rq);
+			else
+				curr = NULL;
 
-		/*
-		 * This call to check_cfs_rq_runtime() will do the throttle and
-		 * dequeue its entity in the parent(s). Therefore the 'simple'
-		 * nr_running test will indeed be correct.
-		 */
-		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
-			goto simple;
+			/*
+			 * This call to check_cfs_rq_runtime() will do the
+			 * throttle and dequeue its entity in the parent(s).
+			 * Therefore the 'simple' nr_running test will indeed
+			 * be correct.
+			 */
+			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+				goto simple;
+		}
 
 		se = pick_next_entity(cfs_rq, curr);
 		cfs_rq = group_cfs_rq(se);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-06-07 17:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-03 12:41 [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() Konstantin Khlebnikov
2015-04-03 12:51 ` Konstantin Khlebnikov
2015-04-07 12:52   ` Peter Zijlstra
2015-04-07 13:47     ` Peter Zijlstra
2015-04-07 15:12       ` Peter Zijlstra
2015-04-07 15:32         ` Konstantin Khlebnikov
2015-04-07 15:43           ` Peter Zijlstra
2015-04-06 22:45 ` bsegall
2015-04-07 15:53   ` Konstantin Khlebnikov
2015-06-07 17:47   ` [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair() tip-bot for Ben Segall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.