LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
@ 2019-08-14 18:00 Liangyan
  2019-08-15 16:36 ` Valentin Schneider
  2019-08-23 20:00 ` [PATCH] sched/fair: don't assign runtime for throttled cfs_rq bsegall
  0 siblings, 2 replies; 17+ messages in thread
From: Liangyan @ 2019-08-14 18:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel; +Cc: shanpeic, xlpang

do_sched_cfs_period_timer() will refill cfs_b runtime and call
distribute_cfs_runtime() to unthrottle cfs_rq, sometimes cfs_b->runtime
will allocate all quota to one cfs_rq incorrectly.
This will cause other cfs_rq can't get runtime and will be throttled.
We find that one throttled cfs_rq has non-negative
cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
in snippet: distribute_cfs_runtime() {
runtime = -cfs_rq->runtime_remaining + 1; }.
This cast will cause that runtime will be a large number and
cfs_b->runtime will be subtracted to be zero at last.

This commit prevents cfs_rq to be assgined new runtime if it has been
throttled to avoid the above incorrect type cast.

Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81fd8a2a605b..b14d67d28138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4068,6 +4068,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
 
+	if (cfs_rq->throttled)
+		return;
 	/*
 	 * if we're unable to extend our runtime we resched so that the active
 	 * hierarchy can be throttled
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
  2019-08-14 18:00 [PATCH] sched/fair: don't assign runtime for throttled cfs_rq Liangyan
@ 2019-08-15 16:36 ` Valentin Schneider
       [not found]   ` <7C1833A8-27A4-4755-9B1E-335C20207A66@linux.alibaba.com>
  2019-08-23 20:00 ` [PATCH] sched/fair: don't assign runtime for throttled cfs_rq bsegall
  1 sibling, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2019-08-15 16:36 UTC (permalink / raw)
  To: Liangyan, Ingo Molnar, Peter Zijlstra, linux-kernel; +Cc: shanpeic, xlpang

On 14/08/2019 19:00, Liangyan wrote:
> do_sched_cfs_period_timer() will refill cfs_b runtime and call
> distribute_cfs_runtime() to unthrottle cfs_rq, sometimes cfs_b->runtime
> will allocate all quota to one cfs_rq incorrectly.

> This will cause other cfs_rq can't get runtime and will be throttled.
> We find that one throttled cfs_rq has non-negative
> cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
> in snippet: distribute_cfs_runtime() {
> runtime = -cfs_rq->runtime_remaining + 1; }.

> This cast will cause that runtime will be a large number and
> cfs_b->runtime will be subtracted to be zero at last.
> 

I'm a complete CFS bandwidth noob but let me give this a try...


-Wconversion does pick this up (turning this thing on made me understand
why it's not on by default)

kernel/sched/fair.c: In function ‘distribute_cfs_runtime’:
kernel/sched/fair.c:4633:13: warning: conversion to ‘u64’ {aka ‘long long unsigned int’} from ‘s64’ {aka ‘long long int’} may change the sign of the result [-Wsign-conversion]
   runtime = -cfs_rq->runtime_remaining + 1;
             ^
kernel/sched/fair.c:4638:29: warning: conversion to ‘long long unsigned int’ from ‘s64’ {aka ‘long long int’} may change the sign of the result [-Wsign-conversion]
   cfs_rq->runtime_remaining += runtime;
                             ^~


The thing is we have a !cfs_rq_throttled() check just before the snippet
you're calling out, so AFAICT cfs_rq->runtime_remaining has to be <= 0
there (otherwise this cfs_rq wouldn't be throttled).

I doubt you can get this to fire, but just to be sure...
-----8<-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bbd90adabe2a..836948a3ae23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4630,6 +4630,8 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 		if (!cfs_rq_throttled(cfs_rq))
 			goto next;
 
+		WARN_ON(cfs_rq->runtime_remaining > 0);
+
 		runtime = -cfs_rq->runtime_remaining + 1;
 		if (runtime > remaining)
 			runtime = remaining;
----->8-----

Other than those signed/unsigned shenanigans, I only see one other scenario
that leads to a cfs_rq getting allocated all the remaining runtime: its
.runtime_remaining just has to be greater or equal (in absolute value)
than the remaining runtime.

If that's happening consistently, I suppose that could be due to long
delays between update_curr_fair() calls, but I can't think right why that
would happen.

> This commit prevents cfs_rq to be assgined new runtime if it has been
> throttled to avoid the above incorrect type cast.
> 
> Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com>
> ---
>  kernel/sched/fair.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 81fd8a2a605b..b14d67d28138 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4068,6 +4068,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>  	if (likely(cfs_rq->runtime_remaining > 0))
>  		return;
>  
> +	if (cfs_rq->throttled)
> +		return;
>  	/*
>  	 * if we're unable to extend our runtime we resched so that the active
>  	 * hierarchy can be throttled
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
       [not found]   ` <7C1833A8-27A4-4755-9B1E-335C20207A66@linux.alibaba.com>
@ 2019-08-16 14:02     ` Valentin Schneider
  2019-08-16 14:31       ` Valentin Schneider
  0 siblings, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2019-08-16 14:02 UTC (permalink / raw)
  To: Liangyan; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, shanpeic, xlpang, pjt

On 16/08/2019 08:08, Liangyan wrote:
> Please check below dmesg log with “WARN_ON(cfs_rq->runtime_remaining > 0)”. If apply my patch, the warning is gone.  Append the reproducing case in the end.
> 

[...]

Huh, thanks for the log & the reproducer. I'm still struggling to
understand how we could hit the condition you're adding, since
account_cfs_rq_runtime() shouldn't be called for throttled cfs_rqs (which
I guess is the bug). Also, if the cfs_rq is throttled, shouldn't we
prevent any further decrement of its ->runtime_remaining ?

I had a look at the callers of account_cfs_rq_runtime():

- update_curr(). Seems safe, but has a cfs_rq->curr check at the top. This
  won't catch throttled cfs_rq's because AFAICT their curr pointer isn't
  NULL'd on throttle.

- check_enqueue_throttle(). Already has a cfs_rq_throttled() check.

- set_next_task_fair(). Peter shuffled the whole set/put task thing
  recently but last I looked it seemed all sane.

I'll try to make sense of it, but have also Cc'd Paul since unlike me he
actually knows this stuff.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
  2019-08-16 14:02     ` Valentin Schneider
@ 2019-08-16 14:31       ` Valentin Schneider
       [not found]         ` <02BC41EE-6653-4473-91D4-CDEE53D8703D@linux.alibaba.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2019-08-16 14:31 UTC (permalink / raw)
  To: Liangyan; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, shanpeic, xlpang, pjt

On 16/08/2019 15:02, Valentin Schneider wrote:
> On 16/08/2019 08:08, Liangyan wrote:
>> Please check below dmesg log with “WARN_ON(cfs_rq->runtime_remaining > 0)”. If apply my patch, the warning is gone.  Append the reproducing case in the end.
>>
> 
> [...]
> 
> Huh, thanks for the log & the reproducer. I'm still struggling to
> understand how we could hit the condition you're adding, since
> account_cfs_rq_runtime() shouldn't be called for throttled cfs_rqs (which
> I guess is the bug). Also, if the cfs_rq is throttled, shouldn't we
> prevent any further decrement of its ->runtime_remaining ?
> 
> I had a look at the callers of account_cfs_rq_runtime():
> 
> - update_curr(). Seems safe, but has a cfs_rq->curr check at the top. This
>   won't catch throttled cfs_rq's because AFAICT their curr pointer isn't
>   NULL'd on throttle.
> 
> - check_enqueue_throttle(). Already has a cfs_rq_throttled() check.
> 
> - set_next_task_fair(). Peter shuffled the whole set/put task thing
>   recently but last I looked it seemed all sane.
> 
> I'll try to make sense of it, but have also Cc'd Paul since unlike me he
> actually knows this stuff.
> 

Hah, seems like we get update_curr() calls on throttled rqs via
put_prev_entity():

[  151.538560]  put_prev_entity+0x8d/0x100
[  151.538562]  put_prev_task_fair+0x22/0x40
[  151.538564]  pick_next_task_fair+0x140/0x390
[  151.538566]  __schedule+0x122/0x6c0
[  151.538568]  schedule+0x2d/0x90
[  151.538570]  exit_to_usermode_loop+0x61/0x100
[  151.538572]  prepare_exit_to_usermode+0x91/0xa0
[  151.538573]  retint_user+0x8/0x8

Debug warns:
-----8<-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1054d2cf6aaa..41e0e78de4fe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -828,6 +828,8 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 }
 #endif /* CONFIG_SMP */
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+
 /*
  * Update the current task's runtime statistics.
  */
@@ -840,6 +842,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	if (unlikely(!curr))
 		return;
 
+	WARN_ON(cfs_rq_throttled(cfs_rq));
+
 	delta_exec = now - curr->exec_start;
 	if (unlikely((s64)delta_exec <= 0))
 		return;
@@ -10169,6 +10173,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		set_next_entity(cfs_rq, se);
+		WARN_ON(cfs_rq_throttled(cfs_rq));
 		/* ensure bandwidth has been allocated on our new cfs_rq */
 		account_cfs_rq_runtime(cfs_rq, 0);
 	}
----->8-----

So I guess what we'd want there is something like
-----8<-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1054d2cf6aaa..b2c40f994aa9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -828,6 +828,8 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 }
 #endif /* CONFIG_SMP */
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+
 /*
  * Update the current task's runtime statistics.
  */
@@ -840,6 +842,9 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	if (unlikely(!curr))
 		return;
 
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
 	delta_exec = now - curr->exec_start;
 	if (unlikely((s64)delta_exec <= 0))
 		return;
----->8-----

but I still don't comprehend how we can get there in the first place.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
       [not found]         ` <02BC41EE-6653-4473-91D4-CDEE53D8703D@linux.alibaba.com>
@ 2019-08-16 17:19           ` Valentin Schneider
  2019-08-19 17:34             ` Valentin Schneider
  0 siblings, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2019-08-16 17:19 UTC (permalink / raw)
  To: Liangyan; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, shanpeic, xlpang, pjt

On 16/08/2019 16:39, Liangyan wrote:
> Thanks for the feedback.
> Add some debug prints and get below log. It seems that pick_next_task_fair throttle the cfs_rq first, then call put_prev_entity to assign runtime to this cfs_rq.
> 
[...]
> 
> Regarding the suggested change,  i¡¯m not  sure whether it is ok to skip the runtime account for curr task.
> 

Yeah it's probably pretty stupid. IIRC throttled cfs_rq means frozen
rq_clock, so any subsequent call to update_curr() on a throttled cfs_rq
should lead to an early bailout anyway due to delta_exec <= 0.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
  2019-08-16 17:19           ` Valentin Schneider
@ 2019-08-19 17:34             ` Valentin Schneider
  2019-08-20 10:54               ` [PATCH] sched/fair: Add missing unthrottle_cfs_rq() Valentin Schneider
  0 siblings, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2019-08-19 17:34 UTC (permalink / raw)
  To: Liangyan; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, shanpeic, xlpang, pjt

On 16/08/2019 18:19, Valentin Schneider wrote:
[...]
> Yeah it's probably pretty stupid. IIRC throttled cfs_rq means frozen
> rq_clock, so any subsequent call to update_curr() on a throttled cfs_rq
> should lead to an early bailout anyway due to delta_exec <= 0.
> 

Did some more tracing, seems like the issue is we can make
->runtime_remaining positive in assign_cfs_rq_runtime() but not mark the
cfs_rq as unthrottled.

So AFAICT we'd need something like this:

-----8<-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1054d2cf6aaa..ffbb4dfc4b81 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4385,6 +4385,11 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
 	return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_bandwidth_used() && cfs_rq->throttled;
+}
+
 /* returns 0 on failure to allocate runtime */
 static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
@@ -4411,6 +4416,9 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 	cfs_rq->runtime_remaining += amount;
 
+	if (cfs_rq->runtime_remaining > 0 && cfs_rq_throttled(cfs_rq))
+		unthrottle_cfs_rq(cfs_rq);
+
 	return cfs_rq->runtime_remaining > 0;
 }
 
@@ -4439,11 +4447,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	__account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
-static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
-{
-	return cfs_bandwidth_used() && cfs_rq->throttled;
-}
-
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
----->8-----

Does that make sense? If so we *may* want to add some ->runtime_remaining
wrappers (e.g. {add/remove}_runtime()) and have the check in there to
make sure it's not forgotten.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] sched/fair: Add missing unthrottle_cfs_rq()
  2019-08-19 17:34             ` Valentin Schneider
@ 2019-08-20 10:54               ` Valentin Schneider
  2019-08-22  9:21                 ` Peter Zijlstra
  2019-08-22 18:48                 ` bsegall
  0 siblings, 2 replies; 17+ messages in thread
From: Valentin Schneider @ 2019-08-20 10:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, liangyan.peng, shanpeic, xlpang, pjt, stable

Turns out a cfs_rq->runtime_remaining can become positive in
assign_cfs_rq_runtime(), but this codepath has no call to
unthrottle_cfs_rq().

This can leave us in a situation where we have a throttled cfs_rq with
positive ->runtime_remaining, which breaks the math in
distribute_cfs_runtime(): this function expects a negative value so that
it may safely negate it into a positive value.

Add the missing unthrottle_cfs_rq(). While at it, add a WARN_ON where
we expect negative values, and pull in a comment from the mailing list
that didn't make it in [1].

[1]: https://lkml.kernel.org/r/BANLkTi=NmCxKX6EbDQcJYDJ5kKyG2N1ssw@mail.gmail.com

Cc: <stable@vger.kernel.org>
Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth")
Reported-by: Liangyan <liangyan.peng@linux.alibaba.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1054d2cf6aaa..219ff3f328e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4385,6 +4385,11 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
 	return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_bandwidth_used() && cfs_rq->throttled;
+}
+
 /* returns 0 on failure to allocate runtime */
 static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
@@ -4411,6 +4416,9 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 	cfs_rq->runtime_remaining += amount;
 
+	if (cfs_rq->runtime_remaining > 0 && cfs_rq_throttled(cfs_rq))
+		unthrottle_cfs_rq(cfs_rq);
+
 	return cfs_rq->runtime_remaining > 0;
 }
 
@@ -4439,11 +4447,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	__account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
-static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
-{
-	return cfs_bandwidth_used() && cfs_rq->throttled;
-}
-
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
@@ -4628,6 +4631,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 		if (!cfs_rq_throttled(cfs_rq))
 			goto next;
 
+		/* By the above check, this should never be true */
+		WARN_ON(cfs_rq->runtime_remaining > 0);
+
+		/* Pick the minimum amount to return to a positive quota state */
 		runtime = -cfs_rq->runtime_remaining + 1;
 		if (runtime > remaining)
 			runtime = remaining;
-- 
2.22.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: Add missing unthrottle_cfs_rq()
  2019-08-20 10:54               ` [PATCH] sched/fair: Add missing unthrottle_cfs_rq() Valentin Schneider
@ 2019-08-22  9:21                 ` Peter Zijlstra
  2019-08-22 17:43                   ` bsegall
  2019-08-22 18:48                 ` bsegall
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2019-08-22  9:21 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, liangyan.peng, shanpeic, xlpang, pjt,
	stable, Ben Segall

On Tue, Aug 20, 2019 at 11:54:20AM +0100, Valentin Schneider wrote:
> Turns out a cfs_rq->runtime_remaining can become positive in
> assign_cfs_rq_runtime(), but this codepath has no call to
> unthrottle_cfs_rq().
> 
> This can leave us in a situation where we have a throttled cfs_rq with
> positive ->runtime_remaining, which breaks the math in
> distribute_cfs_runtime(): this function expects a negative value so that
> it may safely negate it into a positive value.
> 
> Add the missing unthrottle_cfs_rq(). While at it, add a WARN_ON where
> we expect negative values, and pull in a comment from the mailing list
> that didn't make it in [1].
> 
> [1]: https://lkml.kernel.org/r/BANLkTi=NmCxKX6EbDQcJYDJ5kKyG2N1ssw@mail.gmail.com
> 
> Cc: <stable@vger.kernel.org>
> Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth")
> Reported-by: Liangyan <liangyan.peng@linux.alibaba.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Thanks!

> ---
>  kernel/sched/fair.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1054d2cf6aaa..219ff3f328e5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4385,6 +4385,11 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
>  	return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
>  }
>  
> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> +{
> +	return cfs_bandwidth_used() && cfs_rq->throttled;
> +}
> +
>  /* returns 0 on failure to allocate runtime */
>  static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  {
> @@ -4411,6 +4416,9 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  
>  	cfs_rq->runtime_remaining += amount;
>  
> +	if (cfs_rq->runtime_remaining > 0 && cfs_rq_throttled(cfs_rq))
> +		unthrottle_cfs_rq(cfs_rq);
> +
>  	return cfs_rq->runtime_remaining > 0;
>  }
>  
> @@ -4439,11 +4447,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>  	__account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }
>  
> -static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> -{
> -	return cfs_bandwidth_used() && cfs_rq->throttled;
> -}
> -
>  /* check whether cfs_rq, or any parent, is throttled */
>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>  {
> @@ -4628,6 +4631,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  		if (!cfs_rq_throttled(cfs_rq))
>  			goto next;
>  
> +		/* By the above check, this should never be true */
> +		WARN_ON(cfs_rq->runtime_remaining > 0);
> +
> +		/* Pick the minimum amount to return to a positive quota state */
>  		runtime = -cfs_rq->runtime_remaining + 1;
>  		if (runtime > remaining)
>  			runtime = remaining;
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: Add missing unthrottle_cfs_rq()
  2019-08-22  9:21                 ` Peter Zijlstra
@ 2019-08-22 17:43                   ` bsegall
  0 siblings, 0 replies; 17+ messages in thread
From: bsegall @ 2019-08-22 17:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, linux-kernel, mingo, liangyan.peng, shanpeic,
	xlpang, pjt, stable

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Aug 20, 2019 at 11:54:20AM +0100, Valentin Schneider wrote:
>> Turns out a cfs_rq->runtime_remaining can become positive in
>> assign_cfs_rq_runtime(), but this codepath has no call to
>> unthrottle_cfs_rq().
>> 
>> This can leave us in a situation where we have a throttled cfs_rq with
>> positive ->runtime_remaining, which breaks the math in
>> distribute_cfs_runtime(): this function expects a negative value so that
>> it may safely negate it into a positive value.
>> 
>> Add the missing unthrottle_cfs_rq(). While at it, add a WARN_ON where
>> we expect negative values, and pull in a comment from the mailing list
>> that didn't make it in [1].

This didn't exist because it's not supposed to be possible to call
account_cfs_rq_runtime on a throttled cfs_rq at all, so that's the
invariant being violated. Do you know what the code path causing this
looks like?

This would allow both list del and add while distribute is doing a
foreach, but I think that the racing behavior would just be to restart
the distribute loop, which is fine.



>> 
>> [1]: https://lkml.kernel.org/r/BANLkTi=NmCxKX6EbDQcJYDJ5kKyG2N1ssw@mail.gmail.com
>> 
>> Cc: <stable@vger.kernel.org>
>> Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth")
>> Reported-by: Liangyan <liangyan.peng@linux.alibaba.com>
>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>
> Thanks!
>
>> ---
>>  kernel/sched/fair.c | 17 ++++++++++++-----
>>  1 file changed, 12 insertions(+), 5 deletions(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1054d2cf6aaa..219ff3f328e5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4385,6 +4385,11 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
>>  	return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
>>  }
>>  
>> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>> +{
>> +	return cfs_bandwidth_used() && cfs_rq->throttled;
>> +}
>> +
>>  /* returns 0 on failure to allocate runtime */
>>  static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  {
>> @@ -4411,6 +4416,9 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  
>>  	cfs_rq->runtime_remaining += amount;
>>  
>> +	if (cfs_rq->runtime_remaining > 0 && cfs_rq_throttled(cfs_rq))
>> +		unthrottle_cfs_rq(cfs_rq);
>> +
>>  	return cfs_rq->runtime_remaining > 0;
>>  }
>>  
>> @@ -4439,11 +4447,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>>  	__account_cfs_rq_runtime(cfs_rq, delta_exec);
>>  }
>>  
>> -static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>> -{
>> -	return cfs_bandwidth_used() && cfs_rq->throttled;
>> -}
>> -
>>  /* check whether cfs_rq, or any parent, is throttled */
>>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>>  {
>> @@ -4628,6 +4631,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>>  		if (!cfs_rq_throttled(cfs_rq))
>>  			goto next;
>>  
>> +		/* By the above check, this should never be true */
>> +		WARN_ON(cfs_rq->runtime_remaining > 0);
>> +
>> +		/* Pick the minimum amount to return to a positive quota state */
>>  		runtime = -cfs_rq->runtime_remaining + 1;
>>  		if (runtime > remaining)
>>  			runtime = remaining;
>> -- 
>> 2.22.0
>> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: Add missing unthrottle_cfs_rq()
  2019-08-20 10:54               ` [PATCH] sched/fair: Add missing unthrottle_cfs_rq() Valentin Schneider
  2019-08-22  9:21                 ` Peter Zijlstra
@ 2019-08-22 18:48                 ` bsegall
  2019-08-22 20:40                   ` Valentin Schneider
  2019-08-23  7:22                   ` Liangyan
  1 sibling, 2 replies; 17+ messages in thread
From: bsegall @ 2019-08-22 18:48 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, peterz, liangyan.peng, shanpeic, xlpang,
	pjt, stable

Valentin Schneider <valentin.schneider@arm.com> writes:

> Turns out a cfs_rq->runtime_remaining can become positive in
> assign_cfs_rq_runtime(), but this codepath has no call to
> unthrottle_cfs_rq().
>
> This can leave us in a situation where we have a throttled cfs_rq with
> positive ->runtime_remaining, which breaks the math in
> distribute_cfs_runtime(): this function expects a negative value so that
> it may safely negate it into a positive value.
>
> Add the missing unthrottle_cfs_rq(). While at it, add a WARN_ON where
> we expect negative values, and pull in a comment from the mailing list
> that didn't make it in [1].
>
> [1]: https://lkml.kernel.org/r/BANLkTi=NmCxKX6EbDQcJYDJ5kKyG2N1ssw@mail.gmail.com
>
> Cc: <stable@vger.kernel.org>
> Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth")
> Reported-by: Liangyan <liangyan.peng@linux.alibaba.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Having now seen the rest of the thread:

Could you send the repro, as it doesn't seem to have reached lkml, so
that I can confirm my guess as to what's going on?

It seems most likely we throttle during one of the remove-change-adds in
set_cpus_allowed and friends or during the put half of pick_next_task
followed by idle balance to drop the lock. Then distribute races with a
later assign_cfs_rq_runtime so that the account finds runtime in the
cfs_b.

Re clock_task, it's only frozen for the purposes of pelt, not delta_exec

The other possible way to fix this would be to skip assign if throttled,
since the only time it could succeed is if we're racing with a
distribute that will unthrottle use anyways.

The main advantage of that is the risk of screwy behavior due to unthrottling
in the middle of pick_next/put_prev. The disadvantage is that we already
have the lock, if it works we don't need an ipi to trigger a preempt,
etc. (But I think one of the issues is that we may trigger the preempt
on the previous task, not the next, and I'm not 100% sure that will
carry over correctly)



> ---
>  kernel/sched/fair.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1054d2cf6aaa..219ff3f328e5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4385,6 +4385,11 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
>  	return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
>  }
>  
> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> +{
> +	return cfs_bandwidth_used() && cfs_rq->throttled;
> +}
> +
>  /* returns 0 on failure to allocate runtime */
>  static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  {
> @@ -4411,6 +4416,9 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  
>  	cfs_rq->runtime_remaining += amount;
>  
> +	if (cfs_rq->runtime_remaining > 0 && cfs_rq_throttled(cfs_rq))
> +		unthrottle_cfs_rq(cfs_rq);
> +
>  	return cfs_rq->runtime_remaining > 0;
>  }
>  
> @@ -4439,11 +4447,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>  	__account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }
>  
> -static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> -{
> -	return cfs_bandwidth_used() && cfs_rq->throttled;
> -}
> -
>  /* check whether cfs_rq, or any parent, is throttled */
>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>  {
> @@ -4628,6 +4631,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  		if (!cfs_rq_throttled(cfs_rq))
>  			goto next;
>  
> +		/* By the above check, this should never be true */
> +		WARN_ON(cfs_rq->runtime_remaining > 0);
> +
> +		/* Pick the minimum amount to return to a positive quota state */
>  		runtime = -cfs_rq->runtime_remaining + 1;
>  		if (runtime > remaining)
>  			runtime = remaining;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: Add missing unthrottle_cfs_rq()
  2019-08-22 18:48                 ` bsegall
@ 2019-08-22 20:40                   ` Valentin Schneider
  2019-08-22 21:10                     ` Valentin Schneider
  2019-08-23  7:22                   ` Liangyan
  1 sibling, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2019-08-22 20:40 UTC (permalink / raw)
  To: bsegall
  Cc: linux-kernel, mingo, peterz, liangyan.peng, shanpeic, xlpang,
	pjt, stable

On 22/08/2019 19:48, bsegall@google.com wrote:> Having now seen the rest of the thread:
> 
> Could you send the repro, as it doesn't seem to have reached lkml, so
> that I can confirm my guess as to what's going on?
> 

Huh, odd. Here's the thing:

delay.c:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>

unsigned long NUM_LOOPS=2500000*250;

/* simple loop based delay: */
static void delay_loop(unsigned long loops)
{
	asm volatile(
		"	test %0,%0	\n"
		"	jz 3f		\n"
		"	jmp 1f		\n"

		".align 16		\n"
		"1:	jmp 2f		\n"

		".align 16		\n"
		"2:	dec %0		\n"
		"	jnz 2b		\n"
		"3:	dec %0		\n"

		: /* we don't need output */
		:"a" (loops)
	);
}

void __const_udelay(unsigned long xloops)
{
	int d0;

	xloops *= 4;
	asm("mull %%edx"
		:"=d" (xloops), "=&a" (d0)
		:"1" (xloops), "0"
		(NUM_LOOPS));

	delay_loop(++xloops);
}

void __udelay(unsigned long usecs)
{
	__const_udelay(usecs * 0x000010c7); /* 2**32 / 1000000 (rounded up) */
}

static void *thread_start(void *arg)
{
    while(1) {
	__udelay((unsigned long)arg);
	usleep(7000);
    }
}

int main(int argc, char*argv[])
{
	int i;
	int thread;
	unsigned long timeout;
	pthread_t new_th;

	if (argc != 3) {
		printf("./delay nr_thread work_loop\n");
		exit(-1);
	}

	thread = atoi(argv[1]);
	timeout = (unsigned long)atoi(argv[2]);

	for (i = 0; i < thread; i++) {
		pthread_create(&new_th, NULL, thread_start, (void *)timeout);
		usleep(100);
	}

	while(1) {
		sleep(10);
	}
}


do-delay.sh:
#!/bin/bash

mkdir /sys/fs/cgroup/cpu/test1
echo 100000 > /sys/fs/cgroup/cpu/cpu.cfs_period_us
echo 1600000 > /sys/fs/cgroup/cpu/test1/cpu.cfs_quota_us
./delay 500 1000 &
echo $! > /sys/fs/cgroup/cpu/test1/cgroup.procs
mkdir /sys/fs/cgroup/cpu/test2
echo 100000 > /sys/fs/cgroup/cpu/test2/cpu.cfs_period_us
echo 800000 > /sys/fs/cgroup/cpu/test2/cpu.cfs_quota_us
./delay 500 1000 &
echo $! > /sys/fs/cgroup/cpu/test2/cgroup.procs
prev=0;while true; do curr=`cat /sys/fs/cgroup/cpu/test1/cpuacct.usage` && echo $(($curr-$prev)) && prev=$curr && sleep 1; done


I ran the thing on a dual-socket x86 test box and could trigger the issue
quite rapidly (w/ the WARN_ON in distribute_cfs_runtime()).

> It seems most likely we throttle during one of the remove-change-adds in
> set_cpus_allowed and friends or during the put half of pick_next_task
> followed by idle balance to drop the lock. Then distribute races with a
> later assign_cfs_rq_runtime so that the account finds runtime in the
> cfs_b.
> 

I should still have a trace laying around, let me have a look.

> Re clock_task, it's only frozen for the purposes of pelt, not delta_exec
> 

Noted, thanks. But then we shouldn't expect throttled rq's to call into
update_curr(), right? Maybe just right after they've been throttled, but not
beyond that. Otherwise I fail to see how that would make sense.

> The other possible way to fix this would be to skip assign if throttled,
> since the only time it could succeed is if we're racing with a
> distribute that will unthrottle use anyways.
> 

So pretty much the change Liangyan originally proposed? (so much for trying
to help :p)

> The main advantage of that is the risk of screwy behavior due to unthrottling
> in the middle of pick_next/put_prev. The disadvantage is that we already
> have the lock, if it works we don't need an ipi to trigger a preempt,
> etc. (But I think one of the issues is that we may trigger the preempt
> on the previous task, not the next, and I'm not 100% sure that will
> carry over correctly)
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: Add missing unthrottle_cfs_rq()
  2019-08-22 20:40                   ` Valentin Schneider
@ 2019-08-22 21:10                     ` Valentin Schneider
  0 siblings, 0 replies; 17+ messages in thread
From: Valentin Schneider @ 2019-08-22 21:10 UTC (permalink / raw)
  To: bsegall
  Cc: linux-kernel, mingo, peterz, liangyan.peng, shanpeic, xlpang,
	pjt, stable

On 22/08/2019 21:40, Valentin Schneider wrote:
> On 22/08/2019 19:48, bsegall@google.com wrote:

Re we shouldn't get account_cfs_rq_runtime() called on throttled cfs_rq's,
with this:
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 171eef3f08f9..1acb88024cad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4385,6 +4385,11 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
 	return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_bandwidth_used() && cfs_rq->throttled;
+}
+
 /* returns 0 on failure to allocate runtime */
 static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
@@ -4411,6 +4416,8 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 	cfs_rq->runtime_remaining += amount;
 
+	WARN_ON(cfs_rq_throttled(cfs_rq) && cfs_rq->runtime_remaining > 0);
+
 	return cfs_rq->runtime_remaining > 0;
 }
 
@@ -4436,12 +4443,9 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
 		return;
 
-	__account_cfs_rq_runtime(cfs_rq, delta_exec);
-}
+	WARN_ON(cfs_rq_throttled(cfs_rq));
 
-static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
-{
-	return cfs_bandwidth_used() && cfs_rq->throttled;
+	__account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
 /* check whether cfs_rq, or any parent, is throttled */
---

I get this:

[  204.798643] Call Trace:
[  204.798645]  put_prev_entity+0x8d/0x100
[  204.798647]  put_prev_task_fair+0x22/0x40
[  204.798648]  pick_next_task_idle+0x36/0x50
[  204.798650]  __schedule+0x61d/0x6c0
[  204.798651]  schedule+0x2d/0x90
[  204.798653]  exit_to_usermode_loop+0x61/0x100
[  204.798654]  prepare_exit_to_usermode+0x91/0xa0
[  204.798656]  retint_user+0x8/0x8

(this is a hit on the account_cfs_rq_runtime() WARN_ON)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: Add missing unthrottle_cfs_rq()
  2019-08-22 18:48                 ` bsegall
  2019-08-22 20:40                   ` Valentin Schneider
@ 2019-08-23  7:22                   ` Liangyan
  1 sibling, 0 replies; 17+ messages in thread
From: Liangyan @ 2019-08-23  7:22 UTC (permalink / raw)
  To: bsegall, Valentin Schneider
  Cc: linux-kernel, mingo, peterz, shanpeic, xlpang, pjt, stable

Resend.

Sorry that my previous email has format issue.

On 19/8/23 上午2:48, bsegall@google.com wrote:
> Valentin Schneider <valentin.schneider@arm.com> writes:
> 
>> Turns out a cfs_rq->runtime_remaining can become positive in
>> assign_cfs_rq_runtime(), but this codepath has no call to
>> unthrottle_cfs_rq().
>>
>> This can leave us in a situation where we have a throttled cfs_rq with
>> positive ->runtime_remaining, which breaks the math in
>> distribute_cfs_runtime(): this function expects a negative value so that
>> it may safely negate it into a positive value.
>>
>> Add the missing unthrottle_cfs_rq(). While at it, add a WARN_ON where
>> we expect negative values, and pull in a comment from the mailing list
>> that didn't make it in [1].
>>
>> [1]: https://lkml.kernel.org/r/BANLkTi=NmCxKX6EbDQcJYDJ5kKyG2N1ssw@mail.gmail.com
>>
>> Cc: <stable@vger.kernel.org>
>> Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth")
>> Reported-by: Liangyan <liangyan.peng@linux.alibaba.com>
>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> 
> Having now seen the rest of the thread:
> 
> Could you send the repro, as it doesn't seem to have reached lkml, so
> that I can confirm my guess as to what's going on?
> 
> It seems most likely we throttle during one of the remove-change-adds in
> set_cpus_allowed and friends or during the put half of pick_next_task
> followed by idle balance to drop the lock. Then distribute races with a
> later assign_cfs_rq_runtime so that the account finds runtime in the
> cfs_b.
> 
pick_next_task_fair calls update_curr but get zero runtime because of 
cfs_b->runtime=0, then throttle current cfs_rq because of 
cfs_rq->runtime_remaining=0, then call put_prev_entity to 
__account_cfs_rq_runtime to assign new runtime since dequeue_entity on 
another cpu just call return_cfs_rq_runtime to return some runtime to 
cfs_b->runtime.


Please check below debug log to trace this logic.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..0da556c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4395,6 +4395,12 @@ static void __account_cfs_rq_runtime(struct 
cfs_rq *cfs_rq, u64 delta_exec)
         if (likely(cfs_rq->runtime_remaining > 0))
                 return;

+       if (cfs_rq->throttled && smp_processor_id()==20) {
+ 
pr_err("__account_cfs_rq_runtime:cfs_rq=%p,comm=%s,pid=%d,cpu=%d\n",cfs_rq,current->comm,current->pid,smp_processor_id());
+               dump_stack();
+       }
+       //if (cfs_rq->throttled)
+       //      return;
         /*
          * if we're unable to extend our runtime we resched so that the 
active
          * hierarchy can be throttled
@@ -4508,6 +4514,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
                 sub_nr_running(rq, task_delta);

         cfs_rq->throttled = 1;
+       {
+               if (cfs_rq->nr_running > 0 && smp_processor_id()==20) {
+ 
pr_err("throttle_cfs_rq:cfs_rq=%p,comm=%s,pid=%d,cpu=%d\n",cfs_rq,current->comm,current->pid,smp_processor_id());
+                       dump_stack();
+               }
+       }
         cfs_rq->throttled_clock = rq_clock(rq);
         raw_spin_lock(&cfs_b->lock);
         empty = list_empty(&cfs_b->throttled_cfs_rq);


[  257.406397] 
throttle_cfs_rq:cfs_rq=00000000b012d731,comm=delay,pid=4307,cpu=20
[  257.407251] CPU: 20 PID: 4307 Comm: delay Tainted: G        W   E 
5.2.0-rc3+ #9
[  257.408795] Call Trace:
[  257.409085]  dump_stack+0x5c/0x7b
[  257.409482]  throttle_cfs_rq+0x2c3/0x2d0
[  257.409940]  check_cfs_rq_runtime+0x2f/0x50
[  257.410430]  pick_next_task_fair+0xb1/0x740
[  257.410918]  __schedule+0x104/0x670
[  257.411333]  schedule+0x33/0x90
[  257.411706]  exit_to_usermode_loop+0x6a/0xf0
[  257.412201]  prepare_exit_to_usermode+0x80/0xc0
[  257.412734]  retint_user+0x8/0x8
[  257.413114] RIP: 0033:0x4006d0
[  257.413480] Code: 7d f8 48 8b 45 f8 48 85 c0 74 24 eb 0d 0f 1f 00 66 
2e 0f 1f 84 00 00 00 00 00 eb 0e 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 
00 <48> ff c8 75 fb 48 ff c8 90 5d c3 55 48 89 e5 48 83 ec 18 48 89 7d
[  257.415630] RSP: 002b:00007f9b74abbe90 EFLAGS: 00000206 ORIG_RAX: 
ffffffffffffff13
[  257.416508] RAX: 0000000000060f7b RBX: 0000000000000000 RCX: 
0000000000000000
[  257.417333] RDX: 00000000002625b3 RSI: 0000000000000000 RDI: 
00000000002625b4
[  257.418155] RBP: 00007f9b74abbe90 R08: 00007f9b74abc700 R09: 
00007f9b74abc700
[  257.418983] R10: 00007f9b74abc9d0 R11: 0000000000000000 R12: 
00007ffee72e1afe
[  257.419813] R13: 00007ffee72e1aff R14: 00007ffee72e1b00 R15: 
0000000000000000


[  257.420718] 
__account_cfs_rq_runtime:cfs_rq=00000000b012d731,comm=delay,pid=4307,cpu=20
[  257.421646] CPU: 20 PID: 4307 Comm: delay Tainted: G        W   E 
5.2.0-rc3+ #9
[  257.422538] Call Trace:
[  257.424712]  dump_stack+0x5c/0x7b
[  257.425101]  __account_cfs_rq_runtime+0x1d7/0x1e0
[  257.425656]  put_prev_entity+0x90/0x100
[  257.426102]  pick_next_task_fair+0x334/0x740
[  257.426605]  __schedule+0x104/0x670
[  257.427013]  schedule+0x33/0x90
[  257.427384]  exit_to_usermode_loop+0x6a/0xf0
[  257.427879]  prepare_exit_to_usermode+0x80/0xc0
[  257.428406]  retint_user+0x8/0x8
[  257.428783] RIP: 0033:0x4006d0


> Re clock_task, it's only frozen for the purposes of pelt, not delta_exec
> 
> The other possible way to fix this would be to skip assign if throttled,
> since the only time it could succeed is if we're racing with a
> distribute that will unthrottle use anyways.

I ever posted a similar patch here
https://lkml.org/lkml/2019/8/14/1176

> 
> The main advantage of that is the risk of screwy behavior due to unthrottling
> in the middle of pick_next/put_prev. The disadvantage is that we already
> have the lock, if it works we don't need an ipi to trigger a preempt,
> etc. (But I think one of the issues is that we may trigger the preempt
> on the previous task, not the next, and I'm not 100% sure that will
> carry over correctly)
> 
> 
> 
>> ---
>>   kernel/sched/fair.c | 17 ++++++++++++-----
>>   1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1054d2cf6aaa..219ff3f328e5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4385,6 +4385,11 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
>>   	return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
>>   }
>>   
>> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>> +{
>> +	return cfs_bandwidth_used() && cfs_rq->throttled;
>> +}
>> +
>>   /* returns 0 on failure to allocate runtime */
>>   static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>   {
>> @@ -4411,6 +4416,9 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>   
>>   	cfs_rq->runtime_remaining += amount;
>>   
>> +	if (cfs_rq->runtime_remaining > 0 && cfs_rq_throttled(cfs_rq))
>> +		unthrottle_cfs_rq(cfs_rq);
>> +
>>   	return cfs_rq->runtime_remaining > 0;
>>   }
>>   
>> @@ -4439,11 +4447,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>>   	__account_cfs_rq_runtime(cfs_rq, delta_exec);
>>   }
>>   
>> -static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>> -{
>> -	return cfs_bandwidth_used() && cfs_rq->throttled;
>> -}
>> -
>>   /* check whether cfs_rq, or any parent, is throttled */
>>   static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>>   {
>> @@ -4628,6 +4631,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>>   		if (!cfs_rq_throttled(cfs_rq))
>>   			goto next;
>>   
>> +		/* By the above check, this should never be true */
>> +		WARN_ON(cfs_rq->runtime_remaining > 0);
>> +
>> +		/* Pick the minimum amount to return to a positive quota state */
>>   		runtime = -cfs_rq->runtime_remaining + 1;
>>   		if (runtime > remaining)
>>   			runtime = remaining;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
  2019-08-14 18:00 [PATCH] sched/fair: don't assign runtime for throttled cfs_rq Liangyan
  2019-08-15 16:36 ` Valentin Schneider
@ 2019-08-23 20:00 ` bsegall
  2019-08-23 23:19   ` Valentin Schneider
  1 sibling, 1 reply; 17+ messages in thread
From: bsegall @ 2019-08-23 20:00 UTC (permalink / raw)
  To: Liangyan; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, shanpeic, xlpang

Liangyan <liangyan.peng@linux.alibaba.com> writes:

> do_sched_cfs_period_timer() will refill cfs_b runtime and call
> distribute_cfs_runtime() to unthrottle cfs_rq, sometimes cfs_b->runtime
> will allocate all quota to one cfs_rq incorrectly.
> This will cause other cfs_rq can't get runtime and will be throttled.
> We find that one throttled cfs_rq has non-negative
> cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
> in snippet: distribute_cfs_runtime() {
> runtime = -cfs_rq->runtime_remaining + 1; }.
> This cast will cause that runtime will be a large number and
> cfs_b->runtime will be subtracted to be zero at last.
>
> This commit prevents cfs_rq to be assgined new runtime if it has been
> throttled to avoid the above incorrect type cast.
>
> Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com>

Could you mention in the message that this a throttled cfs_rq can have
account_cfs_rq_runtime called on it because it is throttled before
idle_balance, and the idle_balance calls update_rq_clock to add time
that is accounted to the task.

I think this solution is less risky than unthrottling
in this area, so other than that:

Reviewed-by: Ben Segall <bsegall@google.com>


> ---
>  kernel/sched/fair.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 81fd8a2a605b..b14d67d28138 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4068,6 +4068,8 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>  	if (likely(cfs_rq->runtime_remaining > 0))
>  		return;
>  
> +	if (cfs_rq->throttled)
> +		return;
>  	/*
>  	 * if we're unable to extend our runtime we resched so that the active
>  	 * hierarchy can be throttled

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
  2019-08-23 20:00 ` [PATCH] sched/fair: don't assign runtime for throttled cfs_rq bsegall
@ 2019-08-23 23:19   ` Valentin Schneider
  2019-08-26 17:38     ` bsegall
  0 siblings, 1 reply; 17+ messages in thread
From: Valentin Schneider @ 2019-08-23 23:19 UTC (permalink / raw)
  To: bsegall, Liangyan
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, shanpeic, xlpang

On 23/08/2019 21:00, bsegall@google.com wrote:
[...]
> Could you mention in the message that this a throttled cfs_rq can have
> account_cfs_rq_runtime called on it because it is throttled before
> idle_balance, and the idle_balance calls update_rq_clock to add time
> that is accounted to the task.
> 

Mayhaps even a comment for the extra condition.

> I think this solution is less risky than unthrottling
> in this area, so other than that:
> 
> Reviewed-by: Ben Segall <bsegall@google.com>
> 

If you don't mind squashing this in:

-----8<-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1d9cec9b1ed..b47b0bcf56bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4630,6 +4630,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 		if (!cfs_rq_throttled(cfs_rq))
 			goto next;
 
+		/* By the above check, this should never be true */
+		WARN_ON(cfs_rq->runtime_remaining > 0);
+
+		/* Pick the minimum amount to return to a positive quota state */
 		runtime = -cfs_rq->runtime_remaining + 1;
 		if (runtime > remaining)
 			runtime = remaining;
----->8-----

I'm not adamant about the extra comment, but the WARN_ON would be nice IMO.


@Ben, do you reckon we want to strap

Cc: <stable@vger.kernel.org>
Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth")

to the thing? AFAICT the pick_next_task_fair() + idle_balance() dance you
described should still be possible on that commit.


Other than that,

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

[...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
  2019-08-23 23:19   ` Valentin Schneider
@ 2019-08-26 17:38     ` bsegall
  2019-08-27  2:45       ` Liangyan
  0 siblings, 1 reply; 17+ messages in thread
From: bsegall @ 2019-08-26 17:38 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Liangyan, Ingo Molnar, Peter Zijlstra, linux-kernel, shanpeic, xlpang

Valentin Schneider <valentin.schneider@arm.com> writes:

> On 23/08/2019 21:00, bsegall@google.com wrote:
> [...]
>> Could you mention in the message that this a throttled cfs_rq can have
>> account_cfs_rq_runtime called on it because it is throttled before
>> idle_balance, and the idle_balance calls update_rq_clock to add time
>> that is accounted to the task.
>> 
>
> Mayhaps even a comment for the extra condition.
>
>> I think this solution is less risky than unthrottling
>> in this area, so other than that:
>> 
>> Reviewed-by: Ben Segall <bsegall@google.com>
>> 
>
> If you don't mind squashing this in:
>
> -----8<-----
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b1d9cec9b1ed..b47b0bcf56bc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4630,6 +4630,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  		if (!cfs_rq_throttled(cfs_rq))
>  			goto next;
>  
> +		/* By the above check, this should never be true */
> +		WARN_ON(cfs_rq->runtime_remaining > 0);
> +
> +		/* Pick the minimum amount to return to a positive quota state */
>  		runtime = -cfs_rq->runtime_remaining + 1;
>  		if (runtime > remaining)
>  			runtime = remaining;
> ----->8-----
>
> I'm not adamant about the extra comment, but the WARN_ON would be nice IMO.
>
>
> @Ben, do you reckon we want to strap
>
> Cc: <stable@vger.kernel.org>
> Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth")
>
> to the thing? AFAICT the pick_next_task_fair() + idle_balance() dance you
> described should still be possible on that commit.

I'm not sure about stable policy in general, but it seems reasonable.
The WARN_ON might want to be WARN_ON_ONCE, and it seems fine to have it
or not.

>
>
> Other than that,
>
> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
>
> [...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/fair: don't assign runtime for throttled cfs_rq
  2019-08-26 17:38     ` bsegall
@ 2019-08-27  2:45       ` Liangyan
  0 siblings, 0 replies; 17+ messages in thread
From: Liangyan @ 2019-08-27  2:45 UTC (permalink / raw)
  To: bsegall, Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, shanpeic, xlpang



On 19/8/27 上午1:38, bsegall@google.com wrote:
> Valentin Schneider <valentin.schneider@arm.com> writes:
> 
>> On 23/08/2019 21:00, bsegall@google.com wrote:
>> [...]
>>> Could you mention in the message that this a throttled cfs_rq can have
>>> account_cfs_rq_runtime called on it because it is throttled before
>>> idle_balance, and the idle_balance calls update_rq_clock to add time
>>> that is accounted to the task.
>>>
>>
>> Mayhaps even a comment for the extra condition.
>>
>>> I think this solution is less risky than unthrottling
>>> in this area, so other than that:
>>>
>>> Reviewed-by: Ben Segall <bsegall@google.com>
>>>
>>
>> If you don't mind squashing this in:
>>
>> -----8<-----
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b1d9cec9b1ed..b47b0bcf56bc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4630,6 +4630,10 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>>   		if (!cfs_rq_throttled(cfs_rq))
>>   			goto next;
>>   
>> +		/* By the above check, this should never be true */
>> +		WARN_ON(cfs_rq->runtime_remaining > 0);
>> +
>> +		/* Pick the minimum amount to return to a positive quota state */
>>   		runtime = -cfs_rq->runtime_remaining + 1;
>>   		if (runtime > remaining)
>>   			runtime = remaining;
>> ----->8-----
>>
>> I'm not adamant about the extra comment, but the WARN_ON would be nice IMO.
>>
>>
>> @Ben, do you reckon we want to strap
>>
>> Cc: <stable@vger.kernel.org>
>> Fixes: ec12cb7f31e2 ("sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth")
>>
>> to the thing? AFAICT the pick_next_task_fair() + idle_balance() dance you
>> described should still be possible on that commit.
> 
> I'm not sure about stable policy in general, but it seems reasonable.
> The WARN_ON might want to be WARN_ON_ONCE, and it seems fine to have it
> or not.

Thanks Ben and Valentin for all of the comments. Per Xunlei's 
suggestion, I used SCHED_WARN_ON instead in v3. Regarding whether cc 
stable, I'm also not sure.

> 
>>
>>
>> Other than that,
>>
>> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
>>
>> [...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-14 18:00 [PATCH] sched/fair: don't assign runtime for throttled cfs_rq Liangyan
2019-08-15 16:36 ` Valentin Schneider
     [not found]   ` <7C1833A8-27A4-4755-9B1E-335C20207A66@linux.alibaba.com>
2019-08-16 14:02     ` Valentin Schneider
2019-08-16 14:31       ` Valentin Schneider
     [not found]         ` <02BC41EE-6653-4473-91D4-CDEE53D8703D@linux.alibaba.com>
2019-08-16 17:19           ` Valentin Schneider
2019-08-19 17:34             ` Valentin Schneider
2019-08-20 10:54               ` [PATCH] sched/fair: Add missing unthrottle_cfs_rq() Valentin Schneider
2019-08-22  9:21                 ` Peter Zijlstra
2019-08-22 17:43                   ` bsegall
2019-08-22 18:48                 ` bsegall
2019-08-22 20:40                   ` Valentin Schneider
2019-08-22 21:10                     ` Valentin Schneider
2019-08-23  7:22                   ` Liangyan
2019-08-23 20:00 ` [PATCH] sched/fair: don't assign runtime for throttled cfs_rq bsegall
2019-08-23 23:19   ` Valentin Schneider
2019-08-26 17:38     ` bsegall
2019-08-27  2:45       ` Liangyan

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox