linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: initialize runtime to non-zero on cfs bw set
@ 2013-02-08  7:10 Vladimir Davydov
  2013-02-08 14:46 ` Paul Turner
  2013-02-08 15:17 ` [tip:sched/urgent] sched: Initialize cfs_rq-> runtime_remaining " tip-bot for Vladimir Davydov
  0 siblings, 2 replies; 5+ messages in thread
From: Vladimir Davydov @ 2013-02-08  7:10 UTC (permalink / raw)
  To: Peter Zijlstra, Paul Turner; +Cc: Ingo Molnar, devel, linux-kernel

If cfs_rq->runtime_remaining is <= 0 then either
- cfs_rq is throttled and waiting for quota redistribution, or
- cfs_rq is currently executing and will be throttled on
  put_prev_entity, or
- cfs_rq is not throttled and has not executed since its quota was set
  (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).

It is obvious that the last case is rather an exception from the rule
"runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
soon as it finishes its execution". Moreover, it can lead to a task hang
as follows. If put_prev_task is called immediately after first
pick_next_task after quota was set, "immediately" meaning rq->clock in
both functions is the same, then the corresponding cfs_rq will be
throttled. Besides being unfair (the cfs_rq has not executed in fact),
the quota refilling timer can be idle at that time and it won't be
activated on put_prev_task because update_curr calls
account_cfs_rq_runtime, which activates the timer, only if delta_exec is
strictly positive. As a result we can get a task "running" inside a
throttled cfs_rq which will probably never be unthrottled.

To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
will be throttled only if it has executed for some positive number of
nanoseconds.
--
Several times we had our customers encountered such hangs inside a VM
(seems something is wrong or rather different in time accounting there).
Analyzing crash dumps revealed that hung tasks were running inside
cfs_rq's, which had the following setup

cfs_rq->throttled=1
cfs_rq->runtime_enabled=1
cfs_rq->runtime_remaining=0
cfs_rq->tg->cfs_bandwidth.idle=1
cfs_rq->tg->cfs_bandwidth.timer_active=0

which conforms pretty nice to the explanation given above.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 kernel/sched/core.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..c7a078f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		cfs_rq->runtime_remaining = 1;
 
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set
  2013-02-08  7:10 [PATCH] sched: initialize runtime to non-zero on cfs bw set Vladimir Davydov
@ 2013-02-08 14:46 ` Paul Turner
  2013-02-08 15:26   ` Vladimir Davydov
       [not found]   ` <BEF8F492-C44F-43F1-AB39-EA498A0063EA@parallels.com>
  2013-02-08 15:17 ` [tip:sched/urgent] sched: Initialize cfs_rq-> runtime_remaining " tip-bot for Vladimir Davydov
  1 sibling, 2 replies; 5+ messages in thread
From: Paul Turner @ 2013-02-08 14:46 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Peter Zijlstra, Ingo Molnar, devel, linux-kernel

On Fri, Feb 08, 2013 at 11:10:46AM +0400, Vladimir Davydov wrote:
> If cfs_rq->runtime_remaining is <= 0 then either
> - cfs_rq is throttled and waiting for quota redistribution, or
> - cfs_rq is currently executing and will be throttled on
>   put_prev_entity, or
> - cfs_rq is not throttled and has not executed since its quota was set
>   (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).
> 
> It is obvious that the last case is rather an exception from the rule
> "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
> soon as it finishes its execution". Moreover, it can lead to a task hang
> as follows. If put_prev_task is called immediately after first
> pick_next_task after quota was set, "immediately" meaning rq->clock in
> both functions is the same, then the corresponding cfs_rq will be
> throttled. Besides being unfair (the cfs_rq has not executed in fact),
> the quota refilling timer can be idle at that time and it won't be
> activated on put_prev_task because update_curr calls
> account_cfs_rq_runtime, which activates the timer, only if delta_exec is
> strictly positive. As a result we can get a task "running" inside a
> throttled cfs_rq which will probably never be unthrottled.
> 
> To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
> runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
> will be throttled only if it has executed for some positive number of
> nanoseconds.
> --
> Several times we had our customers encountered such hangs inside a VM
> (seems something is wrong or rather different in time accounting there).

Yeah, looks like!

It's not ultimately _super_ shocking; I can think of a few  places where such
gremlins could lurk if they caused enough problems for someone to really go
digging.

> Analyzing crash dumps revealed that hung tasks were running inside
> cfs_rq's, which had the following setup
> 
> cfs_rq->throttled=1
> cfs_rq->runtime_enabled=1
> cfs_rq->runtime_remaining=0
> cfs_rq->tg->cfs_bandwidth.idle=1
> cfs_rq->tg->cfs_bandwidth.timer_active=0
> 
> which conforms pretty nice to the explanation given above.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  kernel/sched/core.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26058d0..c7a078f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>  
>  		raw_spin_lock_irq(&rq->lock);
>  		cfs_rq->runtime_enabled = runtime_enabled;
> -		cfs_rq->runtime_remaining = 0;
> +		cfs_rq->runtime_remaining = 1;

So I agree this is reasonably correct and would fix the issue identified.
However, one concern is that it would potentially grant a tick of execution
time on all cfs_rqs which could result in large quota violations on a many core
machine; one trick then would be to give them "expired" quota; which would be
safe against put_prev_entity->check_cfs_runtime, e.g.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..4369231 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7687,7 +7687,17 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		/*
+		 * On re-definition of bandwidth values we allocate a trivial
+		 * amount of already expired quota.  This guarantees that
+		 * put_prev_entity() cannot lead to a throttle event before we
+		 * have seen a call to account_cfs_runtime(); while not being
+		 * usable by newly waking, or set_curr_task_fair-ing, cpus
+		 * since it would be immediately expired, requiring
+		 * reassignment.
+		 */
+		cfs_rq->runtime_remaining = 1;
+		cfs_rq->runtime_expires = rq_of(cfs_rq)->clock - 1;
 
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);

A perhaps more explicit approach that should be more consistent would be to
properly allocate bandwidth in the first place.  Something like (compile
tested):

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..9646c01 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7682,6 +7682,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {
+		bool exhausted = false;
 		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
 		struct rq *rq = cfs_rq->rq;
 
@@ -7689,9 +7690,27 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 0;
 
+		/*
+		 * We know there's bandwidth remaining (since this loop would
+		 * have otherwise terminated) we can unthrottle up-front.
+		 */
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);
+
+		if (cfs_rq->curr) {
+			/* cfs_rq is currently running, force an update */
+			account_cfs_rq_runtime(cfs_rq, 0);
+			/* If we were unable to allocate runtime then:
+			 * (a) We've sent a reschedule against cpu i
+			 * (b) There is no point in visiting further cpus as we
+			 *     have exhausted our new quota.
+			 */
+			if (!cfs_rq->runtime_remaining)
+				exhausted = true;
+		}
 		raw_spin_unlock_irq(&rq->lock);
+		if (exhausted)
+			break;
 	}
 out_unlock:
 	mutex_unlock(&cfs_constraints_mutex);


That said I actually thought of the first patch (e.g. explicitly using expired
quota) after I wrote the second.  It's perhaps more subtle; but not
unreasonable.  Any thoughts?

Thanks for the report,

- Paul
>  		if (cfs_rq->throttled)
>  			unthrottle_cfs_rq(cfs_rq);
> -- 
> 1.7.1
> 


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [tip:sched/urgent] sched: Initialize cfs_rq-> runtime_remaining to non-zero on cfs bw set
  2013-02-08  7:10 [PATCH] sched: initialize runtime to non-zero on cfs bw set Vladimir Davydov
  2013-02-08 14:46 ` Paul Turner
@ 2013-02-08 15:17 ` tip-bot for Vladimir Davydov
  1 sibling, 0 replies; 5+ messages in thread
From: tip-bot for Vladimir Davydov @ 2013-02-08 15:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, pjt, peterz, devel, tglx, vdavydov

Commit-ID:  0a702bb8af3c1b2dff355fb3c27e7f7d5285e30b
Gitweb:     http://git.kernel.org/tip/0a702bb8af3c1b2dff355fb3c27e7f7d5285e30b
Author:     Vladimir Davydov <vdavydov@parallels.com>
AuthorDate: Fri, 8 Feb 2013 11:10:46 +0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 8 Feb 2013 15:14:38 +0100

sched: Initialize cfs_rq->runtime_remaining to non-zero on cfs bw set

If cfs_rq->runtime_remaining is <= 0 then either

 - cfs_rq is throttled and waiting for quota redistribution, or
 - cfs_rq is currently executing and will be throttled on put_prev_entity, or
 - cfs_rq is not throttled and has not executed since its quota was set
   (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).

It is obvious that the last case is rather an exception from the
rule "runtime_remaining<=0 iff cfs_rq is throttled or will be
throttled as soon as it finishes its execution".

Moreover, it can lead to a task hang as follows. If
put_prev_task() is called immediately after first pick_next_task
after quota was set, "immediately" meaning rq->clock in both
functions is the same, then the corresponding cfs_rq will be
throttled.

Besides being unfair (the cfs_rq has not executed in fact), the
quota refilling timer can be idle at that time and it won't be
activated on put_prev_task because update_curr calls
account_cfs_rq_runtime, which activates the timer, only if
delta_exec is strictly positive. As a result we can get a task
"running" inside a throttled cfs_rq which will probably never be
unthrottled.

To avoid the problem, the patch makes tg_set_cfs_bandwidth
initialize runtime_remaining of each cfs_rq to 1 instead of 0 so
that the cfs_rq will be throttled only if it has executed for
some positive number of nanoseconds.

Several times we had our customers encountered such hangs inside
a VM (seems something is wrong or rather different in time
accounting there). Analyzing crash dumps revealed that hung
tasks were running inside cfs_rq's, which had the following
setup:

 cfs_rq->throttled=1
 cfs_rq->runtime_enabled=1
 cfs_rq->runtime_remaining=0
 cfs_rq->tg->cfs_bandwidth.idle=1
 cfs_rq->tg->cfs_bandwidth.timer_active=0

which conforms pretty nice to the explanation given above.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: <devel@openvz.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/1360307446-26978-1-git-send-email-vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..c7a078f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		cfs_rq->runtime_remaining = 1;
 
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set
  2013-02-08 14:46 ` Paul Turner
@ 2013-02-08 15:26   ` Vladimir Davydov
       [not found]   ` <BEF8F492-C44F-43F1-AB39-EA498A0063EA@parallels.com>
  1 sibling, 0 replies; 5+ messages in thread
From: Vladimir Davydov @ 2013-02-08 15:26 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Ingo Molnar, <devel@openvz.org>,
	<linux-kernel@vger.kernel.org>

On Feb 8, 2013, at 6:46 PM, Paul Turner <pjt@google.com> wrote:

> On Fri, Feb 08, 2013 at 11:10:46AM +0400, Vladimir Davydov wrote:
>> If cfs_rq->runtime_remaining is <= 0 then either
>> - cfs_rq is throttled and waiting for quota redistribution, or
>> - cfs_rq is currently executing and will be throttled on
>>  put_prev_entity, or
>> - cfs_rq is not throttled and has not executed since its quota was set
>>  (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).
>> 
>> It is obvious that the last case is rather an exception from the rule
>> "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
>> soon as it finishes its execution". Moreover, it can lead to a task hang
>> as follows. If put_prev_task is called immediately after first
>> pick_next_task after quota was set, "immediately" meaning rq->clock in
>> both functions is the same, then the corresponding cfs_rq will be
>> throttled. Besides being unfair (the cfs_rq has not executed in fact),
>> the quota refilling timer can be idle at that time and it won't be
>> activated on put_prev_task because update_curr calls
>> account_cfs_rq_runtime, which activates the timer, only if delta_exec is
>> strictly positive. As a result we can get a task "running" inside a
>> throttled cfs_rq which will probably never be unthrottled.
>> 
>> To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
>> runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
>> will be throttled only if it has executed for some positive number of
>> nanoseconds.
>> --
>> Several times we had our customers encountered such hangs inside a VM
>> (seems something is wrong or rather different in time accounting there).
> 
> Yeah, looks like!
> 
> It's not ultimately _super_ shocking; I can think of a few  places where such
> gremlins could lurk if they caused enough problems for someone to really go
> digging.
> 
>> Analyzing crash dumps revealed that hung tasks were running inside
>> cfs_rq's, which had the following setup
>> 
>> cfs_rq->throttled=1
>> cfs_rq->runtime_enabled=1
>> cfs_rq->runtime_remaining=0
>> cfs_rq->tg->cfs_bandwidth.idle=1
>> cfs_rq->tg->cfs_bandwidth.timer_active=0
>> 
>> which conforms pretty nice to the explanation given above.
>> 
>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>> ---
>> kernel/sched/core.c |    2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 26058d0..c7a078f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> 
>> 		raw_spin_lock_irq(&rq->lock);
>> 		cfs_rq->runtime_enabled = runtime_enabled;
>> -		cfs_rq->runtime_remaining = 0;
>> +		cfs_rq->runtime_remaining = 1;
> 
> So I agree this is reasonably correct and would fix the issue identified.
> However, one concern is that it would potentially grant a tick of execution
> time on all cfs_rqs which could result in large quota violations on a many core
> machine; one trick then would be to give them "expired" quota; which would be
> safe against put_prev_entity->check_cfs_runtime, e.g.

Yeah, I missed that. Thank you for the update.

> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1dff78a..4369231 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7687,7 +7687,17 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> 
> 		raw_spin_lock_irq(&rq->lock);
> 		cfs_rq->runtime_enabled = runtime_enabled;
> -		cfs_rq->runtime_remaining = 0;
> +		/*
> +		 * On re-definition of bandwidth values we allocate a trivial
> +		 * amount of already expired quota.  This guarantees that
> +		 * put_prev_entity() cannot lead to a throttle event before we
> +		 * have seen a call to account_cfs_runtime(); while not being
> +		 * usable by newly waking, or set_curr_task_fair-ing, cpus
> +		 * since it would be immediately expired, requiring
> +		 * reassignment.
> +		 */
> +		cfs_rq->runtime_remaining = 1;
> +		cfs_rq->runtime_expires = rq_of(cfs_rq)->clock - 1;
> 
> 		if (cfs_rq->throttled)
> 			unthrottle_cfs_rq(cfs_rq);
> 
> A perhaps more explicit approach that should be more consistent would be to
> properly allocate bandwidth in the first place.  Something like (compile
> tested):

I may be mistaken, but I think it isn't quite right. The point is that the task
group can be idle when its quota is reconfigured, i.e. no cfs_rq is throttled,
no cfs_rq is being executed. Then the hunk added by your second patch is skipped,
and if a task is enqueued onto this group's cfs_rq and yields before any clock
update, we will face exactly same situation: running task in a throttled group.

Anyway, I vote for your first patch. IMO, it should work fine, and it definitely
looks much clearer.

Would you mind if I added you to the 'signed-off-by' field and resent the patch?

Thank you again for the comment.

> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1dff78a..9646c01 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7682,6 +7682,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> 	raw_spin_unlock_irq(&cfs_b->lock);
> 
> 	for_each_possible_cpu(i) {
> +		bool exhausted = false;
> 		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
> 		struct rq *rq = cfs_rq->rq;
> 
> @@ -7689,9 +7690,27 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> 		cfs_rq->runtime_enabled = runtime_enabled;
> 		cfs_rq->runtime_remaining = 0;
> 
> +		/*
> +		 * We know there's bandwidth remaining (since this loop would
> +		 * have otherwise terminated) we can unthrottle up-front.
> +		 */
> 		if (cfs_rq->throttled)
> 			unthrottle_cfs_rq(cfs_rq);
> +
> +		if (cfs_rq->curr) {
> +			/* cfs_rq is currently running, force an update */
> +			account_cfs_rq_runtime(cfs_rq, 0);
> +			/* If we were unable to allocate runtime then:
> +			 * (a) We've sent a reschedule against cpu i
> +			 * (b) There is no point in visiting further cpus as we
> +			 *     have exhausted our new quota.
> +			 */
> +			if (!cfs_rq->runtime_remaining)
> +				exhausted = true;
> +		}
> 		raw_spin_unlock_irq(&rq->lock);
> +		if (exhausted)
> +			break;
> 	}
> out_unlock:
> 	mutex_unlock(&cfs_constraints_mutex);
> 
> 
> That said I actually thought of the first patch (e.g. explicitly using expired
> quota) after I wrote the second.  It's perhaps more subtle; but not
> unreasonable.  Any thoughts?
> 
> Thanks for the report,
> 
> - Paul
>> 		if (cfs_rq->throttled)
>> 			unthrottle_cfs_rq(cfs_rq);
>> -- 
>> 1.7.1
>> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set
       [not found]   ` <BEF8F492-C44F-43F1-AB39-EA498A0063EA@parallels.com>
@ 2013-02-08 16:32     ` Vladimir Davydov
  0 siblings, 0 replies; 5+ messages in thread
From: Vladimir Davydov @ 2013-02-08 16:32 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Ingo Molnar, <devel@openvz.org>,
	<linux-kernel@vger.kernel.org>

On Feb 8, 2013, at 7:26 PM, Vladimir Davydov <VDavydov@parallels.com> wrote:

> On Feb 8, 2013, at 6:46 PM, Paul Turner <pjt@google.com> wrote:
> 
>> On Fri, Feb 08, 2013 at 11:10:46AM +0400, Vladimir Davydov wrote:
>>> If cfs_rq->runtime_remaining is <= 0 then either
>>> - cfs_rq is throttled and waiting for quota redistribution, or
>>> - cfs_rq is currently executing and will be throttled on
>>> put_prev_entity, or
>>> - cfs_rq is not throttled and has not executed since its quota was set
>>> (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).
>>> 
>>> It is obvious that the last case is rather an exception from the rule
>>> "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
>>> soon as it finishes its execution". Moreover, it can lead to a task hang
>>> as follows. If put_prev_task is called immediately after first
>>> pick_next_task after quota was set, "immediately" meaning rq->clock in
>>> both functions is the same, then the corresponding cfs_rq will be
>>> throttled. Besides being unfair (the cfs_rq has not executed in fact),
>>> the quota refilling timer can be idle at that time and it won't be
>>> activated on put_prev_task because update_curr calls
>>> account_cfs_rq_runtime, which activates the timer, only if delta_exec is
>>> strictly positive. As a result we can get a task "running" inside a
>>> throttled cfs_rq which will probably never be unthrottled.
>>> 
>>> To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
>>> runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
>>> will be throttled only if it has executed for some positive number of
>>> nanoseconds.
>>> --
>>> Several times we had our customers encountered such hangs inside a VM
>>> (seems something is wrong or rather different in time accounting there).
>> 
>> Yeah, looks like!
>> 
>> It's not ultimately _super_ shocking; I can think of a few  places where such
>> gremlins could lurk if they caused enough problems for someone to really go
>> digging.
>> 
>>> Analyzing crash dumps revealed that hung tasks were running inside
>>> cfs_rq's, which had the following setup
>>> 
>>> cfs_rq->throttled=1
>>> cfs_rq->runtime_enabled=1
>>> cfs_rq->runtime_remaining=0
>>> cfs_rq->tg->cfs_bandwidth.idle=1
>>> cfs_rq->tg->cfs_bandwidth.timer_active=0
>>> 
>>> which conforms pretty nice to the explanation given above.
>>> 
>>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>>> ---
>>> kernel/sched/core.c |    2 +-
>>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>> 
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 26058d0..c7a078f 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>>> 
>>> 		raw_spin_lock_irq(&rq->lock);
>>> 		cfs_rq->runtime_enabled = runtime_enabled;
>>> -		cfs_rq->runtime_remaining = 0;
>>> +		cfs_rq->runtime_remaining = 1;
>> 
>> So I agree this is reasonably correct and would fix the issue identified.
>> However, one concern is that it would potentially grant a tick of execution
>> time on all cfs_rqs which could result in large quota violations on a many core
>> machine; one trick then would be to give them "expired" quota; which would be
>> safe against put_prev_entity->check_cfs_runtime, e.g.
> 

But wait, what "large quota violations" are you talking about? First, the granted
quota is rather ephemeral because it will be neglected during the next cfs bw period.
Second, we will actually grant NR_CPUS nanoseconds, is is only 1 ms for a 1000 cpus
host, because the task group will sleep in the throttled state for each nanosecond
consumed over the granted quota, i.e. it may be a spike in the load which the task
group will eventually pay for. Besides, I guess quota reconfiguration is rather a
rare event, so it should not be a concern.

Thanks


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-02-08 16:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-08  7:10 [PATCH] sched: initialize runtime to non-zero on cfs bw set Vladimir Davydov
2013-02-08 14:46 ` Paul Turner
2013-02-08 15:26   ` Vladimir Davydov
     [not found]   ` <BEF8F492-C44F-43F1-AB39-EA498A0063EA@parallels.com>
2013-02-08 16:32     ` Vladimir Davydov
2013-02-08 15:17 ` [tip:sched/urgent] sched: Initialize cfs_rq-> runtime_remaining " tip-bot for Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).