All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted
@ 2018-06-18  9:16 Xunlei Pang
  2018-06-18 18:58 ` bsegall
  2018-06-19  4:36 ` Cong Wang
  0 siblings, 2 replies; 8+ messages in thread
From: Xunlei Pang @ 2018-06-18  9:16 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Ben Segall; +Cc: linux-kernel

I noticed the group frequently got throttled even it consumed
low cpu usage, this caused some jitters on the response time
to some of our business containers enabling cpu quota.

It's very easy to reproduce:
mkdir /sys/fs/cgroup/cpu/test
cd /sys/fs/cgroup/cpu/test
echo 100000 > cpu.cfs_quota_us
echo $$ > tasks
then repeat:
cat cpu.stat |grep nr_throttled  // nr_throttled will increase

After some analysis, we found that cfs_rq::runtime_remaining will
be cleared by expire_cfs_rq_runtime() due to two equal but stale
"cfs_{b|q}->runtime_expires" after period timer is re-armed.

The global expiration should be advanced accordingly when the
bandwidth period timer is restarted.

Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 kernel/sched/fair.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f384264e832..bb006e671609 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
+	u64 overrun;
+
 	lockdep_assert_held(&cfs_b->lock);
 
-	if (!cfs_b->period_active) {
-		cfs_b->period_active = 1;
-		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
-	}
+	if (cfs_b->period_active)
+		return;
+
+	cfs_b->period_active = 1;
+	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
+	cfs_b->expires_seq++;
+	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-- 
2.14.1.40.g8e62ba1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted
  2018-06-18  9:16 [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted Xunlei Pang
@ 2018-06-18 18:58 ` bsegall
  2018-06-19  3:14   ` Xunlei Pang
  2018-06-19  4:36 ` Cong Wang
  1 sibling, 1 reply; 8+ messages in thread
From: bsegall @ 2018-06-18 18:58 UTC (permalink / raw)
  To: Xunlei Pang; +Cc: Peter Zijlstra, Ingo Molnar, Ben Segall, linux-kernel

Xunlei Pang <xlpang@linux.alibaba.com> writes:

> I noticed the group frequently got throttled even it consumed
> low cpu usage, this caused some jitters on the response time
> to some of our business containers enabling cpu quota.
>
> It's very easy to reproduce:
> mkdir /sys/fs/cgroup/cpu/test
> cd /sys/fs/cgroup/cpu/test
> echo 100000 > cpu.cfs_quota_us
> echo $$ > tasks
> then repeat:
> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>
> After some analysis, we found that cfs_rq::runtime_remaining will
> be cleared by expire_cfs_rq_runtime() due to two equal but stale
> "cfs_{b|q}->runtime_expires" after period timer is re-armed.

If this is after the first patch, then this is no longer what should
happen, and instead it would incorrectly /keep/ old local cfs_rq
runtime, and not __refill global runtime until the period.


>
> The global expiration should be advanced accordingly when the
> bandwidth period timer is restarted.
>
> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
> ---
>  kernel/sched/fair.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9f384264e832..bb006e671609 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  
>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  {
> +	u64 overrun;
> +
>  	lockdep_assert_held(&cfs_b->lock);
>  
> -	if (!cfs_b->period_active) {
> -		cfs_b->period_active = 1;
> -		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> -		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> -	}
> +	if (cfs_b->period_active)
> +		return;
> +
> +	cfs_b->period_active = 1;
> +	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> +	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);

I think we actually want if (overrun) __refill_cfs_bandwidth_runtime(),
much like tg_set_cfs_bandwidth

> +	cfs_b->expires_seq++;
> +	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>  }
>  
>  static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted
  2018-06-18 18:58 ` bsegall
@ 2018-06-19  3:14   ` Xunlei Pang
  2018-06-19 17:49     ` bsegall
  0 siblings, 1 reply; 8+ messages in thread
From: Xunlei Pang @ 2018-06-19  3:14 UTC (permalink / raw)
  To: bsegall; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel

On 6/19/18 2:58 AM, bsegall@google.com wrote:
> Xunlei Pang <xlpang@linux.alibaba.com> writes:
> 
>> I noticed the group frequently got throttled even it consumed
>> low cpu usage, this caused some jitters on the response time
>> to some of our business containers enabling cpu quota.
>>
>> It's very easy to reproduce:
>> mkdir /sys/fs/cgroup/cpu/test
>> cd /sys/fs/cgroup/cpu/test
>> echo 100000 > cpu.cfs_quota_us
>> echo $$ > tasks
>> then repeat:
>> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>
>> After some analysis, we found that cfs_rq::runtime_remaining will
>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
> 
> If this is after the first patch, then this is no longer what should
> happen, and instead it would incorrectly /keep/ old local cfs_rq
> runtime, and not __refill global runtime until the period.

Exactly, I can improve the changelog.

> 
> 
>>
>> The global expiration should be advanced accordingly when the
>> bandwidth period timer is restarted.
>>
>> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
>> ---
>>  kernel/sched/fair.c | 15 ++++++++++-----
>>  1 file changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9f384264e832..bb006e671609 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  
>>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>  {
>> +	u64 overrun;
>> +
>>  	lockdep_assert_held(&cfs_b->lock);
>>  
>> -	if (!cfs_b->period_active) {
>> -		cfs_b->period_active = 1;
>> -		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>> -		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>> -	}
>> +	if (cfs_b->period_active)
>> +		return;
>> +
>> +	cfs_b->period_active = 1;
>> +	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>> +	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> 
> I think we actually want if (overrun) __refill_cfs_bandwidth_runtime(),
> much like tg_set_cfs_bandwidth

It's a little different, e.g. periods like below:

           Pm
           |
           v
  |-----|-----|
Pn-1   Pn    Pn+1

Assuming we start_cfs_bandwidth() at some moment Pm between "Pn" and
"Pn+1", cfs_b->runtime_expires we want here should be "Pn+1", while
__refill_cfs_bandwidth_runtime() forces it to be "Pm+period"(hrtimer
expires at "Pn+1").

Regards,
Xunlei

> 
>> +	cfs_b->expires_seq++;
>> +	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>>  }
>>  
>>  static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted
  2018-06-18  9:16 [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted Xunlei Pang
  2018-06-18 18:58 ` bsegall
@ 2018-06-19  4:36 ` Cong Wang
  2018-06-19  5:46   ` Xunlei Pang
  1 sibling, 1 reply; 8+ messages in thread
From: Cong Wang @ 2018-06-19  4:36 UTC (permalink / raw)
  To: Xunlei Pang; +Cc: Peter Zijlstra, Ingo Molnar, Ben Segall, LKML

On Mon, Jun 18, 2018 at 2:16 AM, Xunlei Pang <xlpang@linux.alibaba.com> wrote:
> I noticed the group frequently got throttled even it consumed
> low cpu usage, this caused some jitters on the response time
> to some of our business containers enabling cpu quota.
>
> It's very easy to reproduce:
> mkdir /sys/fs/cgroup/cpu/test
> cd /sys/fs/cgroup/cpu/test
> echo 100000 > cpu.cfs_quota_us
> echo $$ > tasks
> then repeat:
> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>
> After some analysis, we found that cfs_rq::runtime_remaining will
> be cleared by expire_cfs_rq_runtime() due to two equal but stale
> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>
> The global expiration should be advanced accordingly when the
> bandwidth period timer is restarted.
>

I observed the same problem and already sent some patches:

https://lkml.org/lkml/2018/5/22/37
https://lkml.org/lkml/2018/5/22/38
https://lkml.org/lkml/2018/5/22/35

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted
  2018-06-19  4:36 ` Cong Wang
@ 2018-06-19  5:46   ` Xunlei Pang
  0 siblings, 0 replies; 8+ messages in thread
From: Xunlei Pang @ 2018-06-19  5:46 UTC (permalink / raw)
  To: Cong Wang; +Cc: Peter Zijlstra, Ingo Molnar, Ben Segall, LKML

On 6/19/18 12:36 PM, Cong Wang wrote:
> On Mon, Jun 18, 2018 at 2:16 AM, Xunlei Pang <xlpang@linux.alibaba.com> wrote:
>> I noticed the group frequently got throttled even it consumed
>> low cpu usage, this caused some jitters on the response time
>> to some of our business containers enabling cpu quota.
>>
>> It's very easy to reproduce:
>> mkdir /sys/fs/cgroup/cpu/test
>> cd /sys/fs/cgroup/cpu/test
>> echo 100000 > cpu.cfs_quota_us
>> echo $$ > tasks
>> then repeat:
>> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>
>> After some analysis, we found that cfs_rq::runtime_remaining will
>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>>
>> The global expiration should be advanced accordingly when the
>> bandwidth period timer is restarted.
>>
> 
> I observed the same problem and already sent some patches:
> 
> https://lkml.org/lkml/2018/5/22/37
> https://lkml.org/lkml/2018/5/22/38
> https://lkml.org/lkml/2018/5/22/35
> 

Looks they are related to large slice setting and unused slack
under large number of cpus, the issue I described is a little
different.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted
  2018-06-19  3:14   ` Xunlei Pang
@ 2018-06-19 17:49     ` bsegall
  2018-06-20  6:06       ` Xunlei Pang
  0 siblings, 1 reply; 8+ messages in thread
From: bsegall @ 2018-06-19 17:49 UTC (permalink / raw)
  To: Xunlei Pang; +Cc: bsegall, Peter Zijlstra, Ingo Molnar, linux-kernel

Xunlei Pang <xlpang@linux.alibaba.com> writes:

> On 6/19/18 2:58 AM, bsegall@google.com wrote:
>> Xunlei Pang <xlpang@linux.alibaba.com> writes:
>> 
>>> I noticed the group frequently got throttled even it consumed
>>> low cpu usage, this caused some jitters on the response time
>>> to some of our business containers enabling cpu quota.
>>>
>>> It's very easy to reproduce:
>>> mkdir /sys/fs/cgroup/cpu/test
>>> cd /sys/fs/cgroup/cpu/test
>>> echo 100000 > cpu.cfs_quota_us
>>> echo $$ > tasks
>>> then repeat:
>>> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>>
>>> After some analysis, we found that cfs_rq::runtime_remaining will
>>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>> 
>> If this is after the first patch, then this is no longer what should
>> happen, and instead it would incorrectly /keep/ old local cfs_rq
>> runtime, and not __refill global runtime until the period.
>
> Exactly, I can improve the changelog.
>
>> 
>> 
>>>
>>> The global expiration should be advanced accordingly when the
>>> bandwidth period timer is restarted.
>>>
>>> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
>>> ---
>>>  kernel/sched/fair.c | 15 ++++++++++-----
>>>  1 file changed, 10 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 9f384264e832..bb006e671609 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>>  
>>>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>>  {
>>> +	u64 overrun;
>>> +
>>>  	lockdep_assert_held(&cfs_b->lock);
>>>  
>>> -	if (!cfs_b->period_active) {
>>> -		cfs_b->period_active = 1;
>>> -		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>> -		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>>> -	}
>>> +	if (cfs_b->period_active)

>>> +		return;
>>> +
>>> +	cfs_b->period_active = 1;
>>> +	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>> +	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
>> 
>> I think we actually want if (overrun) __refill_cfs_bandwidth_runtime(),
>> much like tg_set_cfs_bandwidth
>
> It's a little different, e.g. periods like below:
>
>            Pm
>            |
>            v
>   |-----|-----|
> Pn-1   Pn    Pn+1
>
> Assuming we start_cfs_bandwidth() at some moment Pm between "Pn" and
> "Pn+1", cfs_b->runtime_expires we want here should be "Pn+1", while
> __refill_cfs_bandwidth_runtime() forces it to be "Pm+period"(hrtimer
> expires at "Pn+1").
>
> Regards,
> Xunlei
>

Hmm, yeah, it would wind up out of sync with the hrtimer unless we
restarted the period boundaries.

Now that I look at it even more closely, I don't see how you're getting
problems like that, besides wasting tiny amounts of cpu time in expire:
we only disable the period timer when there has been a full period with
no activity, so we should have a full cfs_b->runtime, and a
runtime_expires from the period that was spent entirely idle (so it
should always be in the past). But with a working extend-local-deadline
check like after patch 1 it shouldn't /matter/ what the actual value of
runtime_expires is, we'll just waste tiny amounts of cpu time
incrementing runtime_expires every account_cfs_rq_runtime call rather
than aborting early.

Actually, I think arguably the real issue may be in patch 1:

>> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  	 * spread between our sched_clock and the one on which runtime was
>>  	 * issued.
>>  	 */
>> -	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
>> +	if ((s64)(expires - cfs_rq->runtime_expires) > 0) {
>>  		cfs_rq->runtime_expires = expires;
>> +		cfs_rq->expires_seq = expires_seq;
>> +	}
>>  
>>  	return cfs_rq->runtime_remaining > 0;
>>  }


We don't want to clobber a tick-extended runtime_expires with the
original one, but we /do/ want update if expires_seq changes, no matter
what (rather than skipping updating both if we've tick-extended past the
/next/ period's runtime_expires).

I think this should actually be

if (cfs_rq->expires_seq != expires_seq) {
	cfs_rq->runtime_expires = expires;
	cfs_rq->expires_seq = expires_seq;
}

Can you easily check and see if this fixes the issue without this second
patch? That would clear up how the actual problem is happening, and I
think it's a change we'd want anyways.

That said, we should also just update runtime_expires like this patch
does regardless, and as mentioned we should have a full cfs_b->runtime
so we don't need a __refill like I was originally thinking.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted
  2018-06-19 17:49     ` bsegall
@ 2018-06-20  6:06       ` Xunlei Pang
  2018-06-20 10:26         ` Xunlei Pang
  0 siblings, 1 reply; 8+ messages in thread
From: Xunlei Pang @ 2018-06-20  6:06 UTC (permalink / raw)
  To: bsegall; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel

On 6/20/18 1:49 AM, bsegall@google.com wrote:
> Xunlei Pang <xlpang@linux.alibaba.com> writes:
> 
>> On 6/19/18 2:58 AM, bsegall@google.com wrote:
>>> Xunlei Pang <xlpang@linux.alibaba.com> writes:
>>>
>>>> I noticed the group frequently got throttled even it consumed
>>>> low cpu usage, this caused some jitters on the response time
>>>> to some of our business containers enabling cpu quota.
>>>>
>>>> It's very easy to reproduce:
>>>> mkdir /sys/fs/cgroup/cpu/test
>>>> cd /sys/fs/cgroup/cpu/test
>>>> echo 100000 > cpu.cfs_quota_us
>>>> echo $$ > tasks
>>>> then repeat:
>>>> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>>>
>>>> After some analysis, we found that cfs_rq::runtime_remaining will
>>>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>>>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>>>
>>> If this is after the first patch, then this is no longer what should
>>> happen, and instead it would incorrectly /keep/ old local cfs_rq
>>> runtime, and not __refill global runtime until the period.
>>
>> Exactly, I can improve the changelog.
>>
>>>
>>>
>>>>
>>>> The global expiration should be advanced accordingly when the
>>>> bandwidth period timer is restarted.
>>>>
>>>> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
>>>> ---
>>>>  kernel/sched/fair.c | 15 ++++++++++-----
>>>>  1 file changed, 10 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 9f384264e832..bb006e671609 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>>>  
>>>>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>>>  {
>>>> +	u64 overrun;
>>>> +
>>>>  	lockdep_assert_held(&cfs_b->lock);
>>>>  
>>>> -	if (!cfs_b->period_active) {
>>>> -		cfs_b->period_active = 1;
>>>> -		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>>> -		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>>>> -	}
>>>> +	if (cfs_b->period_active)
> 
>>>> +		return;
>>>> +
>>>> +	cfs_b->period_active = 1;
>>>> +	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>>> +	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
>>>
>>> I think we actually want if (overrun) __refill_cfs_bandwidth_runtime(),
>>> much like tg_set_cfs_bandwidth
>>
>> It's a little different, e.g. periods like below:
>>
>>            Pm
>>            |
>>            v
>>   |-----|-----|
>> Pn-1   Pn    Pn+1
>>
>> Assuming we start_cfs_bandwidth() at some moment Pm between "Pn" and
>> "Pn+1", cfs_b->runtime_expires we want here should be "Pn+1", while
>> __refill_cfs_bandwidth_runtime() forces it to be "Pm+period"(hrtimer
>> expires at "Pn+1").
>>
>> Regards,
>> Xunlei
>>
> 
> Hmm, yeah, it would wind up out of sync with the hrtimer unless we
> restarted the period boundaries.
> 
> Now that I look at it even more closely, I don't see how you're getting
> problems like that, besides wasting tiny amounts of cpu time in expire:
> we only disable the period timer when there has been a full period with
> no activity, so we should have a full cfs_b->runtime, and a
> runtime_expires from the period that was spent entirely idle (so it
> should always be in the past). But with a working extend-local-deadline
> check like after patch 1 it shouldn't /matter/ what the actual value of
> runtime_expires is, we'll just waste tiny amounts of cpu time
> incrementing runtime_expires every account_cfs_rq_runtime call rather
> than aborting early.
> 
> Actually, I think arguably the real issue may be in patch 1:

Ben,

Thanks for the comments.

Yes, the problematic throttle issue is solved by patch 1, and it's a
different issue in patch 2. It wants to address another stale runtime
issue(I will update the changelog in v2), specifically:
An ever-running cfs_rq tends to retain some runtime(min_cfs_rq_runtime),
it could be even more(sysctl_sched_cfs_bandwidth_slice) in the worst
case, these runtime should be discarded once the period is advanced.

Without this patch expire_cfs_rq_runtime() will see the two equal
cfs_rq{b}->expires_seq after period gets restarted after some idle,
so it will fall into the "clock drift" branch and go on consuming
the old runtime if any.

With the expiration update done in patch 2, we can ensure the stale
cfs_rq->runtime_remaining gets cleared at expire_cfs_rq_runtime()
once a new period begins.

> 
>>> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>>  	 * spread between our sched_clock and the one on which runtime was
>>>  	 * issued.
>>>  	 */
>>> -	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
>>> +	if ((s64)(expires - cfs_rq->runtime_expires) > 0) {
>>>  		cfs_rq->runtime_expires = expires;
>>> +		cfs_rq->expires_seq = expires_seq;
>>> +	}
>>>  
>>>  	return cfs_rq->runtime_remaining > 0;
>>>  }
> 
> 
> We don't want to clobber a tick-extended runtime_expires with the
> original one, but we /do/ want update if expires_seq changes, no matter
> what (rather than skipping updating both if we've tick-extended past the
> /next/ period's runtime_expires).
> 
> I think this should actually be
> 
> if (cfs_rq->expires_seq != expires_seq) {
> 	cfs_rq->runtime_expires = expires;
> 	cfs_rq->expires_seq = expires_seq;
> }
> 
> Can you easily check and see if this fixes the issue without this second
> patch? That would clear up how the actual problem is happening, and I
> think it's a change we'd want anyways.

Agree, I will have it in v2.

> 
> That said, we should also just update runtime_expires like this patch
> does regardless, and as mentioned we should have a full cfs_b->runtime
> so we don't need a __refill like I was originally thinking.
>
Regards,
Xunlei

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted
  2018-06-20  6:06       ` Xunlei Pang
@ 2018-06-20 10:26         ` Xunlei Pang
  0 siblings, 0 replies; 8+ messages in thread
From: Xunlei Pang @ 2018-06-20 10:26 UTC (permalink / raw)
  To: bsegall; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel

On 6/20/18 2:06 PM, Xunlei Pang wrote:
> On 6/20/18 1:49 AM, bsegall@google.com wrote:
>> Xunlei Pang <xlpang@linux.alibaba.com> writes:
>>
>>> On 6/19/18 2:58 AM, bsegall@google.com wrote:
>>>> Xunlei Pang <xlpang@linux.alibaba.com> writes:
>>>>
>>>>> I noticed the group frequently got throttled even it consumed
>>>>> low cpu usage, this caused some jitters on the response time
>>>>> to some of our business containers enabling cpu quota.
>>>>>
>>>>> It's very easy to reproduce:
>>>>> mkdir /sys/fs/cgroup/cpu/test
>>>>> cd /sys/fs/cgroup/cpu/test
>>>>> echo 100000 > cpu.cfs_quota_us
>>>>> echo $$ > tasks
>>>>> then repeat:
>>>>> cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>>>>
>>>>> After some analysis, we found that cfs_rq::runtime_remaining will
>>>>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>>>>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>>>>
>>>> If this is after the first patch, then this is no longer what should
>>>> happen, and instead it would incorrectly /keep/ old local cfs_rq
>>>> runtime, and not __refill global runtime until the period.
>>>
>>> Exactly, I can improve the changelog.
>>>
>>>>
>>>>
>>>>>
>>>>> The global expiration should be advanced accordingly when the
>>>>> bandwidth period timer is restarted.
>>>>>
>>>>> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
>>>>> ---
>>>>>  kernel/sched/fair.c | 15 ++++++++++-----
>>>>>  1 file changed, 10 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 9f384264e832..bb006e671609 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>>>>  
>>>>>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>>>>  {
>>>>> +	u64 overrun;
>>>>> +
>>>>>  	lockdep_assert_held(&cfs_b->lock);
>>>>>  
>>>>> -	if (!cfs_b->period_active) {
>>>>> -		cfs_b->period_active = 1;
>>>>> -		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>>>> -		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>>>>> -	}
>>>>> +	if (cfs_b->period_active)
>>
>>>>> +		return;
>>>>> +
>>>>> +	cfs_b->period_active = 1;
>>>>> +	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>>>> +	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
>>>>
>>>> I think we actually want if (overrun) __refill_cfs_bandwidth_runtime(),
>>>> much like tg_set_cfs_bandwidth
>>>
>>> It's a little different, e.g. periods like below:
>>>
>>>            Pm
>>>            |
>>>            v
>>>   |-----|-----|
>>> Pn-1   Pn    Pn+1
>>>
>>> Assuming we start_cfs_bandwidth() at some moment Pm between "Pn" and
>>> "Pn+1", cfs_b->runtime_expires we want here should be "Pn+1", while
>>> __refill_cfs_bandwidth_runtime() forces it to be "Pm+period"(hrtimer
>>> expires at "Pn+1").
>>>
>>> Regards,
>>> Xunlei
>>>
>>
>> Hmm, yeah, it would wind up out of sync with the hrtimer unless we
>> restarted the period boundaries.
>>
>> Now that I look at it even more closely, I don't see how you're getting
>> problems like that, besides wasting tiny amounts of cpu time in expire:
>> we only disable the period timer when there has been a full period with
>> no activity, so we should have a full cfs_b->runtime, and a
>> runtime_expires from the period that was spent entirely idle (so it
>> should always be in the past). But with a working extend-local-deadline
>> check like after patch 1 it shouldn't /matter/ what the actual value of
>> runtime_expires is, we'll just waste tiny amounts of cpu time
>> incrementing runtime_expires every account_cfs_rq_runtime call rather
>> than aborting early.
>>
>> Actually, I think arguably the real issue may be in patch 1:
> 
> Ben,
> 
> Thanks for the comments.
> 
> Yes, the problematic throttle issue is solved by patch 1, and it's a
> different issue in patch 2. It wants to address another stale runtime
> issue(I will update the changelog in v2), specifically:
> An ever-running cfs_rq tends to retain some runtime(min_cfs_rq_runtime),
> it could be even more(sysctl_sched_cfs_bandwidth_slice) in the worst
> case, these runtime should be discarded once the period is advanced.
> 
> Without this patch expire_cfs_rq_runtime() will see the two equal
> cfs_rq{b}->expires_seq after period gets restarted after some idle,
> so it will fall into the "clock drift" branch and go on consuming
> the old runtime if any.
> 
> With the expiration update done in patch 2, we can ensure the stale
> cfs_rq->runtime_remaining gets cleared at expire_cfs_rq_runtime()
> once a new period begins.

I did some tests, as you said the throttle problem gets resovled by
patch 1. And stale cfs_rq->runtime_remaining can also be cleared
properly due to different expires_seq, now patch 2 is mainly to avoid
needless expire_cfs_rq_runtime() execution due to old expiration time.

I've sent v2, please have a look, thanks!

-Xunlei

> 
>>
>>>> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>>>  	 * spread between our sched_clock and the one on which runtime was
>>>>  	 * issued.
>>>>  	 */
>>>> -	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
>>>> +	if ((s64)(expires - cfs_rq->runtime_expires) > 0) {
>>>>  		cfs_rq->runtime_expires = expires;
>>>> +		cfs_rq->expires_seq = expires_seq;
>>>> +	}
>>>>  
>>>>  	return cfs_rq->runtime_remaining > 0;
>>>>  }
>>
>>
>> We don't want to clobber a tick-extended runtime_expires with the
>> original one, but we /do/ want update if expires_seq changes, no matter
>> what (rather than skipping updating both if we've tick-extended past the
>> /next/ period's runtime_expires).
>>
>> I think this should actually be
>>
>> if (cfs_rq->expires_seq != expires_seq) {
>> 	cfs_rq->runtime_expires = expires;
>> 	cfs_rq->expires_seq = expires_seq;
>> }
>>
>> Can you easily check and see if this fixes the issue without this second
>> patch? That would clear up how the actual problem is happening, and I
>> think it's a change we'd want anyways.
> 
> Agree, I will have it in v2.
> 
>>
>> That said, we should also just update runtime_expires like this patch
>> does regardless, and as mentioned we should have a full cfs_b->runtime
>> so we don't need a __refill like I was originally thinking.
>>
> Regards,
> Xunlei
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-06-20 10:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-18  9:16 [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted Xunlei Pang
2018-06-18 18:58 ` bsegall
2018-06-19  3:14   ` Xunlei Pang
2018-06-19 17:49     ` bsegall
2018-06-20  6:06       ` Xunlei Pang
2018-06-20 10:26         ` Xunlei Pang
2018-06-19  4:36 ` Cong Wang
2018-06-19  5:46   ` Xunlei Pang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.