linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vladimir Davydov <VDavydov@parallels.com>
To: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	"<devel@openvz.org>" <devel@openvz.org>,
	"<linux-kernel@vger.kernel.org>" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set
Date: Fri, 8 Feb 2013 16:32:43 +0000	[thread overview]
Message-ID: <049D7A4CCB8B694CACDEEB03BE35F306112361EE@MSK-EXCH.sw.swsoft.com> (raw)
In-Reply-To: <BEF8F492-C44F-43F1-AB39-EA498A0063EA@parallels.com>

On Feb 8, 2013, at 7:26 PM, Vladimir Davydov <VDavydov@parallels.com> wrote:

> On Feb 8, 2013, at 6:46 PM, Paul Turner <pjt@google.com> wrote:
> 
>> On Fri, Feb 08, 2013 at 11:10:46AM +0400, Vladimir Davydov wrote:
>>> If cfs_rq->runtime_remaining is <= 0 then either
>>> - cfs_rq is throttled and waiting for quota redistribution, or
>>> - cfs_rq is currently executing and will be throttled on
>>> put_prev_entity, or
>>> - cfs_rq is not throttled and has not executed since its quota was set
>>> (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).
>>> 
>>> It is obvious that the last case is rather an exception from the rule
>>> "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
>>> soon as it finishes its execution". Moreover, it can lead to a task hang
>>> as follows. If put_prev_task is called immediately after first
>>> pick_next_task after quota was set, "immediately" meaning rq->clock in
>>> both functions is the same, then the corresponding cfs_rq will be
>>> throttled. Besides being unfair (the cfs_rq has not executed in fact),
>>> the quota refilling timer can be idle at that time and it won't be
>>> activated on put_prev_task because update_curr calls
>>> account_cfs_rq_runtime, which activates the timer, only if delta_exec is
>>> strictly positive. As a result we can get a task "running" inside a
>>> throttled cfs_rq which will probably never be unthrottled.
>>> 
>>> To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
>>> runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
>>> will be throttled only if it has executed for some positive number of
>>> nanoseconds.
>>> --
>>> Several times we had our customers encountered such hangs inside a VM
>>> (seems something is wrong or rather different in time accounting there).
>> 
>> Yeah, looks like!
>> 
>> It's not ultimately _super_ shocking; I can think of a few  places where such
>> gremlins could lurk if they caused enough problems for someone to really go
>> digging.
>> 
>>> Analyzing crash dumps revealed that hung tasks were running inside
>>> cfs_rq's, which had the following setup
>>> 
>>> cfs_rq->throttled=1
>>> cfs_rq->runtime_enabled=1
>>> cfs_rq->runtime_remaining=0
>>> cfs_rq->tg->cfs_bandwidth.idle=1
>>> cfs_rq->tg->cfs_bandwidth.timer_active=0
>>> 
>>> which conforms pretty nice to the explanation given above.
>>> 
>>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>>> ---
>>> kernel/sched/core.c |    2 +-
>>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>> 
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 26058d0..c7a078f 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>>> 
>>> 		raw_spin_lock_irq(&rq->lock);
>>> 		cfs_rq->runtime_enabled = runtime_enabled;
>>> -		cfs_rq->runtime_remaining = 0;
>>> +		cfs_rq->runtime_remaining = 1;
>> 
>> So I agree this is reasonably correct and would fix the issue identified.
>> However, one concern is that it would potentially grant a tick of execution
>> time on all cfs_rqs which could result in large quota violations on a many core
>> machine; one trick then would be to give them "expired" quota; which would be
>> safe against put_prev_entity->check_cfs_runtime, e.g.
> 

But wait, what "large quota violations" are you talking about? First, the granted
quota is rather ephemeral because it will be neglected during the next cfs bw period.
Second, we will actually grant NR_CPUS nanoseconds, is is only 1 ms for a 1000 cpus
host, because the task group will sleep in the throttled state for each nanosecond
consumed over the granted quota, i.e. it may be a spike in the load which the task
group will eventually pay for. Besides, I guess quota reconfiguration is rather a
rare event, so it should not be a concern.

Thanks


  parent reply	other threads:[~2013-02-08 16:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-08  7:10 [PATCH] sched: initialize runtime to non-zero on cfs bw set Vladimir Davydov
2013-02-08 14:46 ` Paul Turner
2013-02-08 15:26   ` Vladimir Davydov
     [not found]   ` <BEF8F492-C44F-43F1-AB39-EA498A0063EA@parallels.com>
2013-02-08 16:32     ` Vladimir Davydov [this message]
2013-02-08 15:17 ` [tip:sched/urgent] sched: Initialize cfs_rq-> runtime_remaining " tip-bot for Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=049D7A4CCB8B694CACDEEB03BE35F306112361EE@MSK-EXCH.sw.swsoft.com \
    --to=vdavydov@parallels.com \
    --cc=devel@openvz.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).